guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: add regexp-split: a summary and new proposal


From: Eli Barzilay
Subject: Re: add regexp-split: a summary and new proposal
Date: Sat, 31 Dec 2011 16:13:41 -0500

11 hours ago, Daniel Hartwig wrote:
> On 31 December 2011 15:30, Eli Barzilay <address@hidden> wrote:
> > But there's one more point that bugs me about the python thing: the
> > resulting list has both the matches and the non-matching gaps, and
> > knowing which is which is tricky.  For example, if you do this (I'll
> > use our syntax here, so note the minor differences):
> >
> >  (define (foo rx)
> >    (regexp-split rx "some string"))
> >
> > then you can't tell which is which in its output without knowing how
> > many grouping parens are in the input regexp.  It therefore makes
> > sense to me to have this instead:
> >
> >  > (regexp-explode #rx"([^0-9])" "123+456*/")
> >  '("123" ("+") "456" ("*") "" ("/") "")
> >
> > and now it's easy to know which is which.  This is of course a simple
> > example with a single group so it doesn't look like much help, but
> > when with more than one group things can get confusing otherwise: for
> > example, in python you can get `None's in the result:
> >
> >  >>> re.split('([^0-9](4)?)', '123+456*/')
> >  ['123', '+4', '4', '56', '*', None, '', '/', None, '']
> >
> > but with the above, this becomes:
> >
> >  > (regexp-explode #rx"([^0-9](4)?)" "123+456*/")
> >  '("123" ("+4" "4") "456" ("*" #f) "" ("/" #f) "")
> >
> > so you can rely on the odd-numbered elements to be strings.  This is
> > probably going to be different for you, since you allow string
> > predicates instead of regexps.
> >
> > Finally, the Racket implementation will probably be a little different
> > still -- our `regexp-match' returns a list with the matched substring
> > first, and then the matches for the capturing groups.  Following this,
> 
> The format is the same in Guile, substring followed by capturing
> groups:
> 
> scheme@(guile-user)> (string-match "([^0-9])" "123+456*/")
> $7 = #("123+456*/" (3 . 4) (3 . 4))
> 
> Though that is more of an analogue to `regexp-match-positions'.

(I guess, if I understand the output to have yet another first
value with is the string that the positions apply to.  We'd get only
the two pairs.)


> > a more uniform behavior for a `regexp-explode' would be to return
> > these lists, so we'd actually get:
> >
> >  > (regexp-explode #rx"[^0-9]" "123+456*/")
> >  '("123" ("+") "456" ("*") "" ("/") "")
> >  > (regexp-explode #rx"([^0-9])" "123+456*/")
> >  '("123" ("+" "+") "456" ("*" "*") "" ("/" "/") "")
> 
> This is a very interesting way to return the results.
> 
> Now that the `explode' has been separated from `split' I am actually
> quite partial to always including the matched substring in the result.
> This makes even more sense considering the output would be the same
> using a char-predicate or regexp with no capturing groups:
> 
> scheme@(guile-user)> (string-explode "123+456*/" (negate char-numeric?))
> $8 = ("123" "+" "456" "*" "" "/" "")
> scheme@(guile-user)> (string-explode "123+456*/" (make-regexp "[^0-9]"))
> $9 = ("123" "+" "456" "*" "" "/" "")
> 
> And the result is compatible with using `string-concatenate' as an
> inverse operation:
> 
> scheme@(guile-user)> (string-concatenate $9)
> $10 = "123+456*/"
> 
> Bonus!

You mean keep the python thing, or have only the full matches rather
than the groups?

(If you keep the groups, then you get that bonus only when there are
no groups, of course, otherwise you get a semi-random character
salad.)


> WRT to all the capturing groups as a list:
> 
>  + as you mention earlier the user can be somewhat ignorant of the
>    number of capturing groups (why not just use `split'?);

(Because of the usual reasons...  It's hiding as some random utility
that takes in a string from an api-level function, and now it needs to
parse it if you need to know the number of groups.)


>  + easier to handle collectively;
> 
>  - result is no longer a flat list (I *do* like sexps, really);

Well, given a `flatten' function it's trivial to get the flat form
back...

>  - moving away from *all* existing implementations;
> 
>  * trivial to transform between styles assuming one knows how many
>    capturing groups;

...but the flattened form loses information, which means that getting
from it to the nested one is impossible without information about the
(number of groups in the) regexp.


> So now I am thinking about both `string-explode' (flat output) and
> `regexp-explode' with the nested output.

(I'm not familiar enough with your conventional differences between
`string-x' and `regexp-x', but that seems potentially confusing...)


> > And again, this looks silly in this simple example, but would be
> > more useful in more complex ones.  We would also have a similar
> > `regexp-explode-positions' function that returns position pairs
> > for cases where you don't want to allocate all substrings.
> 
> ... or need to know the positioning information.

Obviously.


> [BTW, substrings in Guile share copy-on-write memory with their super
> so I don't see string allocation as an issue on the Guile front.  Not
> sure about substrings in Racket.]

We have the ability to share substrings, but I don't think that we're
using it for these things.  It seems dangerous to me -- what if I do
something like:

  (define x (substring (make-string 1000000000 #\space) 0 1))

?  With a naive implementation you'd get the whole gb in memory just
for that tiny string...

In any case, we also allow regexp operations on ports, and in that
case allocation is an issue no matter what you do.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!



reply via email to

[Prev in Thread] Current Thread [Next in Thread]