bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Quotes being stripped by "--csv"


From: Ed Morton
Subject: Quotes being stripped by "--csv"
Date: Fri, 17 Nov 2023 06:09:46 -0600
User-agent: Mozilla Thunderbird

Someone posted a question on stackoverflow about how to print just the first 2 fields from a CSV so given this input:

   "foo,""bar""",2,3
   1,"foo,bar",3
   1,"foo,
   bar",3

the expected output would be:

   "foo,""bar""",2
   1,"foo,bar"
   1,"foo,
   bar"

I thought I'd answer with "--csv" but when I tried it I got this output:

   $ awk --csv -v OFS=',' '{print $1, $2}' file.csv
   foo,"bar",2
   1,foo,bar
   1,foo,
   bar

The quotes around the fields that need to be quoted (and were quoted in the input) are missing and the escaped double quotes (`""`) around the first `bar` have become individual (`"`) so the output is no longer valid CSV.

I could get it back to valid CSV and produce the expected output by writing this or similar:

   $ awk --csv -v OFS=',' '{for (i=1; i<=NF; i++) {
   gsub(/"/,"\"\"",$i); if ($i ~ /[,\n"]/) { $i="\"" $i "\""} }; print
   $1, $2}' file.csv
   "foo,""bar""",2
   1,"foo,bar"
   1,"foo,
   bar"

but that's counter-intuitive and frustrating to have to write and I think many users wouldn't know how to, or understand why they need to, write that code to get valid CSV output.

I understand there is a benefit to stripping double quotes for working on field contents and I appreciate that you need to make this work with existing functionality (`OFS` values, etc.) so I understand why `--csv` can't simply always output valid CSV and I also understand the "don't provide constructs to do things that are easy to do with existing constructs" awk mantra to avoid code bloat, but there has to be a way to make it easier for people to just print a couple of fields from valid CSV input and have the output still be valid CSV.

If there was a way to have `--csv` optionally NOT strip double quotes when reading the fields then that'd solve the problem, e.g. `--csv=q` or `--csvq` or similar to indicate quotes in and around fields should be retained. If we had that then I could write something like:

    awk --csv=q -v OFS=',' '{print $1, $2}' file.csv

or, less desirably as it's longer and can't be set on the command line but would be better than nothing:

    awk --csv -v OFS=',' 'BEGIN{PROCINFO["CSV"]="q"} {print $1, $2}' file.csv

to get the desired output above and there are almost certainly other use cases for people wanting to retain the quotes and there is no simple alternative today (not using --csv but instead setting FPAT and counting double quotes to know if a newline is inside or outside of a field, and adding lines to $0 until you have a complete record).

I don't think that would be hard for users to understand or result in language bloat or introduce any additional complexity working with existing constructs - you simply wouldn't strip quotes when reading the input and so they'd still be there when producing output.

Regards,

    Ed.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]