[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-users] utf8 and string-ref performance
From: |
Peter Bex |
Subject: |
Re: [Chicken-users] utf8 and string-ref performance |
Date: |
Wed, 24 Nov 2010 17:05:18 +0100 |
User-agent: |
Mutt/1.4.2.3i |
On Wed, Nov 24, 2010 at 08:37:37AM -0700, Alan Post wrote:
> gentufa'i works by storing the entire input port in a string, and
> ceating position objects to refer to the "rest of the string" as I
> parse.
>
> This means I need to perform the following:
>
> 1) reference a character by index
> 2) compare a character, string, or regular expression starting
> at an index.
Are you sure you need this? If I understood the sentence above the
list correctly, it might be enough to use string->list and then work
with a list of characters. This can be done pretty fast, and you
can store pointers into arbitrary places of the input simply by storing
the relevant cons cell.
> Without utf8, step 1 is O(1). With utf8 enabled, that step becomes
> O(n). step 2 is also more expensive with utf8, as I have to pay that
> same O(n) to get to the correct index, and I suspect I have some
> O(n*1) operations that become O(n*m), namely character class
> comparisons.
utf8 uses the fantastic "iset" egg for dealing with character sets.
membership tests are as fast as O(1) for short ranges, and become
O(log(n)) for longer ones, where n is the number of ranges it needs to
store. My knowledge of complexity theory is a little rusty so I'm
probably describing this in a very roundabout fashion, but the point is
that you shouldn't worry about charset performance. It'll be slower
than built-in srfi-14 (which is insanely fast - because it's a damn hack :P)
but not by too much.
> Essentially, I take characters off the front of the string as I
> parse, with the caveat that PEG parsers support full backtracking,
> so I sometimes retrieve previous position objects and work from
> there--I can't just throw away the prefix of the string I've
> matched.
With cons cells you should be able to implement this efficiently enough.
We don't have anything like string-pointers which can store arbitrary
indices in a string AFAIK. That would be useful to have, I guess.
> Also, I'm not 100% sure what the utf8 gets me, compared to treating
> the string like binary data. I suspect it would work if I had a
> utf8 input file, a utf8 string to match, but that I compare them as
> binary data. I couldn't compare a utf8 *character*, like
> #\¿, but I think I could compare "¿". (Those are both inverted
> question marks, if you don't have utf8 e-mail support) Am I wrong
> about that?
No, that's correct. It's only when you start operating on the character
level (or when splitting strings for example) that utf8 makes a difference.
HTH,
Peter
--
http://sjamaan.ath.cx
--
"The process of preparing programs for a digital computer
is especially attractive, not only because it can be economically
and scientifically rewarding, but also because it can be an aesthetic
experience much like composing poetry or music."
-- Donald Knuth
- [Chicken-users] utf8 and string-ref performance, Alan Post, 2010/11/24
- Re: [Chicken-users] utf8 and string-ref performance, F. Wittenberger, 2010/11/24
- Re: [Chicken-users] utf8 and string-ref performance,
Peter Bex <=
- Re: [Chicken-users] utf8 and string-ref performance, Alan Post, 2010/11/24
- Re: [Chicken-users] utf8 and string-ref performance, Felix, 2010/11/24
- Re: [Chicken-users] utf8 and string-ref performance, Peter Bex, 2010/11/24
- Re: [Chicken-users] utf8 and string-ref performance, Alan Post, 2010/11/24
- Re: [Chicken-users] utf8 and string-ref performance, Felix, 2010/11/24
- Re: [Chicken-users] utf8 and string-ref performance, Peter Bex, 2010/11/25
- Re: [Chicken-users] utf8 and string-ref performance, Felix, 2010/11/25
- Re: [Chicken-users] utf8 and string-ref performance, Peter Bex, 2010/11/25
- Re: [Chicken-users] utf8 and string-ref performance, Felix, 2010/11/25