[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Sorting csv with embedded commas problem
From: |
Bernhard Voelker |
Subject: |
Re: Sorting csv with embedded commas problem |
Date: |
Fri, 2 Jul 2021 00:05:46 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 |
On 6/30/21 9:52 PM, Boyd Kelly wrote:
> Hello,
>
> Not sure if I am doing this correctly, but I seem to have a problem sorting
> when the first column of my data contains a uuid often with embedded
> commas. Does sort deal with this properly?
>
> My file contains this: (where the first and last fields may contain
> embedded commas.
>
> "}~,)4",nɛgɛso,yq,,vélo,,,,,0630
> "wjv(,4q_S}",badenmuso,cj,,soeur,,,,,"0630, fam, gens"
> ARm}q0bjb#,a be baara kɛ.,bk,,il travaille.,,,,,"0630, phrase"
>
> The following is not correctly sorted by the second field
> $ sort -t "," -k 2 test2.csv
> "}~,)4",nɛgɛso,yq,,vélo,,,,,0630
> "wjv(,4q_S}",badenmuso,cj,,soeur,,,,,"0630, fam, gens"
> ARm}q0bjb#,a be baara kɛ.,bk,,il travaille.,,,,,"0630, phrase"
When using the comma ',' character as separator, then the data in the
literally goes until the first comma:
"}~
"wjv(
ARm}q0bjb#
The --debug option shows that as well:
$ sort --debug -t "," -k 2 test2.csv
sort: text ordering performed using simple byte comparison
"}~,)4",nɛgɛso,yq,,vélo,,,,,0630
_______________________________
___________________________________
"wjv(,4q_S}",badenmuso,cj,,soeur,,,,,"0630, fam, gens"
________________________________________________
______________________________________________________
ARm}q0bjb#,a be baara kɛ.,bk,,il travaille.,,,,,"0630, phrase"
____________________________________________________
_______________________________________________________________
In that sense, it is not foreseen that the delimiter character is part of
the fields. Obviously, the data comes from a tool which tries to work around
this by enclosing such data with a " as string delimiter, but that is not what
sort(1) is expecting. And to my knowledge, there's no way to tell sort that
it should treat the delimiter character somewhere in there special.
Instead, you should use a delimiter character which does or can not occur in
the original data; e.g. a tab '\t', a semicolon ';', a hash '#', a '@' or
maybe even a non-printable character.
Alternatively, you could also enforce that all column values get quoted by
"....":
"}~,)4","nɛgɛso","yq","","vél","","","","","0630"
"wjv(,4q_S}","badenmuso","cj","","soeur","","","","","0630, fam, gens"
"ARm}q0bjb#","a be baara kɛ.","bk","","il travaille.","","","","","0630,
phrase"
and then use the decorate-sort-undecorate (DSU) approach by changing ","
(quote-comma-quote)
to a single and safe delimiter, sort by that delimiter, and then change the
delimiter back
to ",".
$ sed 's/","/\t/g' < test2.csv-all-columns-quoted \
| sort -t $'\t' -k2,2 \
| sed 's/\t/","/g'
"ARm}q0bjb#","a be baara kɛ.","bk","","il travaille.","","","","","0630,
phrase"
"wjv(,4q_S}","badenmuso","cj","","soeur","","","","","0630, fam, gens"
"}~,)4","nɛgɛso","yq","","vél","","","","","0630"
Still, this is only working under certain assumptions. E.g. how does it look
when the
original data has a "," inside? Or what if a field contains a newline
character?
An CSV export is not really optimal for this.
If you have the data e.g. in LibreOffice Calc, the best bet is maybe to use the
CSV export
with "fixed column width" (which is not really comma-separated anymore), and
then sort by
specific offset-numbers
}~,)4 nɛgɛso yq vélo
630
wjv(,4q_S}badenmuso cj soeur
0630, fam, gen
ARm}q0bjb#a be baara kɛbk il travai
0630, phrase
Then running sort with the DSU idiom:
$ tr '\n' '\0' < test-with-fixed-column-width.csv \
| sort -z -k 1.11,1.23 \
| tr '\0' '\n'
Of course, this still only works when none of the fields contain a newline
(which is a ambiguity
in the input, not in sort).
Have a nice day,
Berny
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: Sorting csv with embedded commas problem,
Bernhard Voelker <=