bug-parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Parallel Bug Reports Unexpected behavior when handling binary da


From: Andreas Bernauer
Subject: Re: GNU Parallel Bug Reports Unexpected behavior when handling binary data with --regexp and --recstart
Date: Thu, 18 Jun 2015 09:21:21 +0200
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.7.0

On 18/06/15 1:12, Tim Mattison wrote:
> I have some data that I want to process with GNU Parallel that is pure
> binary data.  I saw an example somewhere that showed how I could use
> --regexp and --recstart to specify a binary pattern and it seemed like
> it worked at first but after running it for a while I noticed that it
> appeared to be missing the binary pattern sometimes.  I wrote a script
> that reproduces this issue and wanted to see if someone could explain if
> this is expected or not.

While I can reproduce your 'bug', your script does not do what you think
it does. :-)

(dd does not output to stdout, MacOS' grep does not understand the -P
(--perl-regexp) option, and the script should be run with bash (for
echo's '-n').)

Anyhow, I attached an updated script test2.sh, which still shows the
'bug'. parallel seems to split only the last record.

I put 'bug' in quote, as it does not seem to have to do with binary
data. The 'bug' appears with regular (printable) data, too, see attached
test3.sh script. I suppose the record splitting feature is tested, so we
probably do not use it properly?

-Andreas
~~~~~~~~~~~~~~~~
$ ls
test.sh*  test2.sh*
$ ./test2.sh
Creating test file... done.
Instances of pattern found with grep:        3
Output files from GNU Parallel:        2
test.raw:
0000000 01 67 aa 00 00 00 00 00 00 00 00 00 00 00 00 00
0000010 01 67 bb 00 00 00 00 00 00 00 00 00 00 00 00 00
0000020 01 67 cc 00 00 00 00 00 00 00 00 00 00 00 00 00
0000030
parallel's results:
1.test-result
0000000 01 67 aa 00 00 00 00 00 00 00 00 00 00 00 00 00
0000010 01 67 bb 00 00 00 00 00 00 00 00 00 00 00 00 00
0000020
2.test-result
0000000 01 67 cc 00 00 00 00 00 00 00 00 00 00 00 00 00
0000010
$ parallel --version
GNU parallel 20150522
Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015 Ole Tange
and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --bibtex'.
~~~~~~~~~~~~~~~~

With ASCII data in test.raw:
~~~~~~~~~~~~~~~
$ ./test3.sh
Creating test file... done.
Instances of pattern found with grep:        3
Output files from GNU Parallel:        2
test.raw:
0000000 41 42 43 78 61 61 31 32 33 34 35 36 37 38 39 30
0000010 41 42 43 78 62 62 31 32 33 34 35 36 37 38 39 30
0000020 41 42 43 78 63 63 31 32 33 34 35 36 37 38 39 30
0000030
parallel's results:
1.test-result
0000000 41 42 43 78 61 61 31 32 33 34 35 36 37 38 39 30
0000010 41 42 43 78 62 62 31 32 33 34 35 36 37 38 39 30
0000020
2.test-result
0000000 41 42 43 78 63 63 31 32 33 34 35 36 37 38 39 30
0000010
$ cat test.raw
ABCxaa1234567890ABCxbb1234567890ABCxcc1234567890
~~~~~~~~~~~~~~~~


> 
> The script creates a file that has the binary pattern 0000000167 in it
> three times.  Each instance of the pattern is followed immediately by
> AA, BB, or CC, and the 1000 bytes of zeroes.
> 
> GNU grep reports that it sees this binary pattern three times.  GNU
> Parallel splits this up into two files though and the first file has the
> AA and BB instances of the pattern in it.
> 
> Do I need to do something else to make sure this pattern is checked for
> in a different way?
> 
> This script is written to run on Mac OS.  If you are running on Linux
> you'll need to change "ggrep" to "grep".
> 
> Thanks,
> Tim
> 
> -- CUT HERE --
> # Remove any old test results
> rm -f *.test-result
> 
> # Create the test file
> echo -n "Creating test file... "
> echo -ne '\x00\x00\x00\x01\x67\xaa' > test.raw
> dd if=/dev/zero bs=1000 count=1 >> test.raw &> /dev/null
> echo -ne '\x00\x00\x00\x01\x67\xbb' >> test.raw
> dd if=/dev/zero bs=1000 count=1 >> test.raw &> /dev/null
> echo -ne '\x00\x00\x00\x01\x67\xcc' >> test.raw
> dd if=/dev/zero bs=1000 count=1 >> test.raw &> /dev/null
> echo "done."
> 
> # Count the number of times grep finds this pattern (using ggrep since
> we're on Mac OS)
> echo -n "Instances of pattern found with grep: "
> ggrep -obUaP '\x00\x00\x00\x01\x67' test.raw | wc -l
> 
> # Have GNU Parallel split up the file based on the given pattern as a regexp
> cat test.raw | parallel -k --pipe --regexp --recstart
> '\x00\x00\x00\x01\x67' --recend '' cat\>{#}.test-result &> /dev/null
> 
> # Count the number of output files GNU Parallel created
> echo -n "Output files from GNU Parallel: "
> ls -la *.test-result | wc -l
> 
> # Remove the test results.  Comment this out if you want to examine them
> after the fact.
> rm *.test-result
> 
>
> Sent from Mailbox <https://www.dropbox.com/mailbox>

Attachment: test2.sh
Description: Bourne shell script

Attachment: test3.sh
Description: Bourne shell script


reply via email to

[Prev in Thread] Current Thread [Next in Thread]