help-octave
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

regexp: how to split a cellstr array into substring arrays, each matchin


From: Philip Nienhuis
Subject: regexp: how to split a cellstr array into substring arrays, each matching regular expressions
Date: Sun, 20 May 2012 22:00:53 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 SeaMonkey/2.0.6

Having a cellstr array like this:

octave:178> ar = {'abcdefguvwxAny' ; 'acegxyzTrailing'; 'vxzJunk'}
ar =
{
  [1,1] = abcdefguvwxAny
  [2,1] = acegxyzTrailing
  [3,1] = vxzJunk
}

how can I efficiently split it into two columns using regular expressions like
'[abcdefg]'  and  '[uvwxyz]'

to obtain

{ 'abcdefg', 'uvwxAny'; 'acegTrailing', 'xyz'; '', 'vxzJunk'}  ?

IOW, I'd like to split the cellstr array at the location where '[uvwxyz]' matches (even if not present, see far below).


The closest I get is:

## Invert pattern and use 'split' keyword
octave:179> ss = regexp (ar, '[^abcdefg]', 'split')
ss =
{
  [1,1] =
  {
    [1,1] = abcdefg
    [1,2] =
    [1,3] =
    [1,4] =
    [1,5] =
    [1,6] =
    [1,7] =
    [1,8] =
  }
  [2,1] =
  {
    [1,1] = aceg
    [1,2] =
    [1,3] =
    [1,4] =
    [1,5] =
    [1,6] = a
    [1,7] =
    [1,8] =
    [1,9] =
    [1,10] = g
  }
  [3,1] =
  {
    [1,1] =
    [1,2] =
    [1,3] =
    [1,4] =
    [1,5] =
    [1,6] =
    [1,7] =
    [1,8] =
  }
}
octave:180> col1 = cellfun (@(x) x{1}, {ss{:}}, 'uni', false)
col1 =
{
  [1,1] = abcdefg
  [1,2] = aceg
  [1,3] =
}
octave:181> col2 = regexp (ar, '[uvwxyz].*', 'match', 'once')
tt =
{
  [1,1] = uvwxAny
  [2,1] = xyzTrailing
  [3,1] = vxzJunk
}

## ...or the latter statement, perhaps more robust, as:
octave:182> tt = regexp (ar, '[uvwxyz].*', 'match')
tt =
{
  [1,1] =
  {
    [1,1] = uvwxAny
  }
  [2,1] =
  {
    [1,1] = xyzTrailing
  }
  [3,1] =
  {
    [1,1] = vxzJunk
  }
}
octave:183> col2 = cellfun (@(x) [x{:}], {tt{:}}, 'Uni', false)
tt =
{
  [1,1] = uvwxAny
  [1,2] = xyzTrailing
  [1,3] = vxzJunk
}
octave:184>


( cellfun() was invoked to be able to use repeated indexing; I couldn't find another way to extract the first/last entries of ss and tt. )
I think my method isn't very robust.
So I hope there's a less convoluted and more reliable way.


BTW,
octave:184> ar = {'abcdefguvwxAny' ; 'acegxyzTrailing'; 'aJunk'}
ar =
{
  [1,1] = abcdefguvwxAny
  [2,1] = acegxyzTrailing
  [3,1] = Junk
}
octave:186> tt = regexp (ar, '[uvwxyz].*', 'match', 'once')
tt =
{
  [1,1] = uvwxAny
  [2,1] = xyzTrailing
  [3,1] = unk
}

=> is this a bug? (swallowing the "J" from the last entry)

Thanks,

Philip



reply via email to

[Prev in Thread] Current Thread [Next in Thread]