bug-parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU Parallel Bug Reports imperfect parallelization


From: FrithMartin
Subject: GNU Parallel Bug Reports imperfect parallelization
Date: Fri, 26 Jun 2015 06:50:08 +0000

Hi again,

I would like to use GNU parallel to analyze genome sequences, so that I can 
analyze the chromosomes in parallel.
Let's make a fake genome, with 12 equal-sized chromosomes, in FASTA format:

seq 12000000 | awk 'NR % 1000000 == 1 {print ">"} {print "aaaaaaaaa"}' > 
fake-genome.fasta

If I have >=12 CPUs, I should be able to get a 12-fold speedup, by analyzing 
all the chromosomes in parallel.
Let's try:

parallel --pipe --recstart '>' -k wc < fake-genome.fasta

parallel: Warning: A record was longer than 1048576. Increasing to --blocksize 
1363150.
parallel: Warning: A record was longer than 1363150. Increasing to --blocksize 
1772096.
parallel: Warning: A record was longer than 1772096. Increasing to --blocksize 
2303726.
parallel: Warning: A record was longer than 2303726. Increasing to --blocksize 
2994845.
parallel: Warning: A record was longer than 2994845. Increasing to --blocksize 
3893300.
parallel: Warning: A record was longer than 3893300. Increasing to --blocksize 
5061291.
parallel: Warning: A record was longer than 5061291. Increasing to --blocksize 
6579680.
parallel: Warning: A record was longer than 6579680. Increasing to --blocksize 
8553585.
parallel: Warning: A record was longer than 8553585. Increasing to --blocksize 
11119662.
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
2000002 2000002 20000004
1000001 1000001 10000002
1000001 1000001 10000002

It did not separate the 9th and 10th chromosomes, so I only get a 6-fold 
speedup.

The root cause is that it *adds* blocksize bytes to the partial record already 
in memory. This means that the chunk size increases even when the blocksize 
does not increase. To fix this, instead of reading blocksize bytes, read 
(blocksize minus partial-record-size) bytes. I attach a patch that fixes this.

My "parallel --version" is:
GNU parallel 20150622

Have a nice day,
Martin Frith

P.S. I also request to remove the increasing blocksize warnings, if the user 
did not specify a blocksize, because they are harmless and just cause needless 
concern.

Attachment: read-patch.txt
Description: read-patch.txt


reply via email to

[Prev in Thread] Current Thread [Next in Thread]