[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
GNU Parallel Bug Reports imperfect parallelization
From: |
FrithMartin |
Subject: |
GNU Parallel Bug Reports imperfect parallelization |
Date: |
Fri, 26 Jun 2015 06:50:08 +0000 |
Hi again,
I would like to use GNU parallel to analyze genome sequences, so that I can
analyze the chromosomes in parallel.
Let's make a fake genome, with 12 equal-sized chromosomes, in FASTA format:
seq 12000000 | awk 'NR % 1000000 == 1 {print ">"} {print "aaaaaaaaa"}' >
fake-genome.fasta
If I have >=12 CPUs, I should be able to get a 12-fold speedup, by analyzing
all the chromosomes in parallel.
Let's try:
parallel --pipe --recstart '>' -k wc < fake-genome.fasta
parallel: Warning: A record was longer than 1048576. Increasing to --blocksize
1363150.
parallel: Warning: A record was longer than 1363150. Increasing to --blocksize
1772096.
parallel: Warning: A record was longer than 1772096. Increasing to --blocksize
2303726.
parallel: Warning: A record was longer than 2303726. Increasing to --blocksize
2994845.
parallel: Warning: A record was longer than 2994845. Increasing to --blocksize
3893300.
parallel: Warning: A record was longer than 3893300. Increasing to --blocksize
5061291.
parallel: Warning: A record was longer than 5061291. Increasing to --blocksize
6579680.
parallel: Warning: A record was longer than 6579680. Increasing to --blocksize
8553585.
parallel: Warning: A record was longer than 8553585. Increasing to --blocksize
11119662.
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
2000002 2000002 20000004
1000001 1000001 10000002
1000001 1000001 10000002
It did not separate the 9th and 10th chromosomes, so I only get a 6-fold
speedup.
The root cause is that it *adds* blocksize bytes to the partial record already
in memory. This means that the chunk size increases even when the blocksize
does not increase. To fix this, instead of reading blocksize bytes, read
(blocksize minus partial-record-size) bytes. I attach a patch that fixes this.
My "parallel --version" is:
GNU parallel 20150622
Have a nice day,
Martin Frith
P.S. I also request to remove the increasing blocksize warnings, if the user
did not specify a blocksize, because they are harmless and just cause needless
concern.
read-patch.txt
Description: read-patch.txt
- GNU Parallel Bug Reports imperfect parallelization,
FrithMartin <=