Re: GNU Parallel Bug Reports imperfect parallelization

On Fri, Jun 26, 2015 at 3:22 AM, FrithMartin <address@hidden> wrote:

Hi again,

I would like to use GNU parallel to analyze genome sequences, so that I can analyze the chromosomes in parallel.
Let's make a fake genome, with 12 equal-sized chromosomes, in FASTA format:

seq 12000000 | awk 'NR % 1000000 == 1 {print ">"} {print "aaaaaaaaa"}' > fake-genome.fasta

If I have >=12 CPUs, I should be able to get a 12-fold speedup, by analyzing all the chromosomes in parallel.
Let's try:

parallel --pipe --recstart '>' -k wc < fake-genome.fasta

parallel: Warning: A record was longer than 1048576. Increasing to --blocksize 1363150.
parallel: Warning: A record was longer than 1363150. Increasing to --blocksize 1772096.
parallel: Warning: A record was longer than 1772096. Increasing to --blocksize 2303726.
parallel: Warning: A record was longer than 2303726. Increasing to --blocksize 2994845.
parallel: Warning: A record was longer than 2994845. Increasing to --blocksize 3893300.
parallel: Warning: A record was longer than 3893300. Increasing to --blocksize 5061291.
parallel: Warning: A record was longer than 5061291. Increasing to --blocksize 6579680.
parallel: Warning: A record was longer than 6579680. Increasing to --blocksize 8553585.
parallel: Warning: A record was longer than 8553585. Increasing to --blocksize 11119662.
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
2000002 2000002 20000004
1000001 1000001 10000002
1000001 1000001 10000002

It did not separate the 9th and 10th chromosomes, so I only get a 6-fold speedup.

The root cause is that it *adds* blocksize bytes to the partial record already in memory. This means that the chunk size increases even when the blocksize does not increase. To fix this, instead of reading blocksize bytes, read (blocksize minus partial-record-size) bytes. I attach a patch that fixes this.

My "parallel --version" is:
GNU parallel 20150622

Have a nice day,
Martin Frith

P.S. I also request to remove the increasing blocksize warnings, if the user did not specify a blocksize, because they are harmless and just cause needless concern.<read-patch.txt>
<read-patch.txt>

From:	Tim Mattison
Subject:	Re: GNU Parallel Bug Reports imperfect parallelization
Date:	Fri, 26 Jun 2015 02:45:31 -0700 (PDT)