[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UNIX join command bug
From: |
Guillaume Smits |
Subject: |
UNIX join command bug |
Date: |
Thu, 21 Aug 2008 16:45:12 +0100 |
Dear GNU,
I have two files exactly identical composed of:
6 Fields, tab separated, with a /n at the end of the line, sorted
numerically on the key identifier (field #2).
Here is the head of the files:
File1
CHR SNP A1 A2 MAF NCHROBS
13 rs4 G A 0.0648148 216
7 rs8 T C 0.166667 216
7 rs16 T C 0.475962 208
...
File2
CHR SNP A1 A2 MAF NCHROBS
7 rs8 A G 0.215674 9876
7 rs16 G A 0.477102 9870
7 rs19 G A 0.385628 9880
...
The first file is ~ 1,400,000 lines long
The second file is ~ 330,000 lines long
There should be ~322,000 lines in common (i.e., with the same SNP
identifier - field #2).
When I perform a very simple join command as follows:
Join -1 2 -2 2 file1.txt file2.txt > joinedfile.txt
I obtain a joinedfile of ~213.000 lines in place of the expected
~322.000 lines (65% of the lines).
The lines missing are scattered everywhere in the original files (at the
beginning, middle or end). There is also no logic to find while
considering the SNP identifier of the missing lines.
For example a line which is missing is the following one:
File 1
11 rs1535 G A 0.348624 218
File 2
11 rs1535 G A 0.440218 9886
As one can see, the key field identifier is identical (rs1535) hence
this line should be printed in the output.
I can't find any difference between the files (e.g., no hidden
characters) or the key identifiers. The files are sorted in the same
way, tabulated in the same way,...
The only difference is the number of lines (1.4 million in file 1; 300
thousands in file 2). While big, these line numbers should not be a
limiting factor to the join command... (and why would be the missing
line scattered all along the files?)
Using a Perl script to print lines having the same field 2 identifier, I
obtain the ~322,000 lines expected proving that it is nearly surely a
join command bug.
Question: Is there any trivial (or less trivial) explanation to this
join command bug?
Thanks for your help,
G
Guillaume Smits
Team 108 - Human Genetics Program
The Sanger Institute
Office: N3-33, Morgan Building
Tel: + 44 (0)1223 834244 (ext 8643)
Email: address@hidden
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
- UNIX join command bug,
Guillaume Smits <=