bug-parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU Parallel Bug Reports Truncated large records


From: Johannes Dröge
Subject: GNU Parallel Bug Reports Truncated large records
Date: Mon, 23 Feb 2015 14:28:07 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0

Hi Ole and GNU parallel devs,

I'm processing large files (~50 GiB) with variable record sizes and have the 
following issues:

1) The processing run-time of individual blocks is more than linear with the 
input size. Therefore, it would be best if GNU parallel would allow pass single 
records or a fixed number of records for each job, or at least would not 
automatically increase the block size. Instead, the block size auto-detection 
increases the block size on large individual blocks until only very few 
processes are being run in parallel which then dominate the overall run-time. 
This behavior strongly impacts the granularity of the parallel execution.

2) I'm seeing that large records (>2 GiB) are being truncated at 2 GiB and thus 
passed incompletely via stdin. You find my compressed input under

https://elefant.bifo.helmholtz-hzi.de/public.php?service=files&t=48fb2c2e7ba7ace340acf37ffe9803f3
 (~1.2 GiB, valid until March 2015)

and I'm processing the data as follows:

zcat debug.maf.gz | parallel --halt-on-error --no-notice --gnu --pipe 
--recstart '# batch ' --recend '\n\n' 'cat > "$PARALLEL_SEQ".maf'

You will see that only one job and output file is created because the first 
record is the largest one. Then, the output is truncated after exactly 2 GiB. I 
think this is a serious issue as this is a silent data corruption and will 
affect the analysis if, for instance biological sequence data is shortened 
before analysis.

Info: I'm using the latest version of GNU parallel (20150122) on 64 bit Linux, 
Debian 7.

Thanks for your help.

Gruß Johannes

-- 
Johannes Dröge, M.Sc.
Algorithmic Bioinformatics, Heinrich Heine University Düsseldorf
25.12.01.50, Universitätsstraße 1, 40225 Düsseldorf, Germany
PGP: http://keys.fungs.de/6ea5e4.asc (55F2720303A7F236A94666F20E2360727A6EA5E4)
Web: algbio.cs.uni-duesseldorf.de | Tel/Fax: +49 211 81-12644/13464

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]