bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quotes being stripped by "--csv"


From: Ed Morton
Subject: Re: Quotes being stripped by "--csv"
Date: Sun, 19 Nov 2023 07:04:30 -0600
User-agent: Mozilla Thunderbird

Done, thanks:

https://github.com/onetrueawk/awk/issues/215

On 11/19/2023 2:37 AM, arnold@skeeve.com wrote:
Hi.

I understand what you're saying. I don't have an answer at this point.
I think it would be helpful for you to open an issue on the Github repo
for Brian Kernighan's awk, as CSV handling was his idea. Maybe he can
come up with something.

In any case, opening an issue there will allow for wider discussion amongst
AWK implementors.

Thanks,

Arnold

Ed Morton<mortoneccc@comcast.net>  wrote:

Someone posted a question on stackoverflow about how to print just the
first 2 fields from a CSV so given this input:

     "foo,""bar""",2,3
     1,"foo,bar",3
     1,"foo,
     bar",3

the expected output would be:

     "foo,""bar""",2
     1,"foo,bar"
     1,"foo,
     bar"

I thought I'd answer with "--csv" but when I tried it I got this output:

     $ awk --csv -v OFS=',' '{print $1, $2}' file.csv
     foo,"bar",2
     1,foo,bar
     1,foo,
     bar

The quotes around the fields that need to be quoted (and were quoted in
the input) are missing and the escaped double quotes (`""`) around the
first `bar` have become individual (`"`) so the output is no longer
valid CSV.

I could get it back to valid CSV and produce the expected output by
writing this or similar:

     $ awk --csv -v OFS=',' '{for (i=1; i<=NF; i++) {
     gsub(/"/,"\"\"",$i); if ($i ~ /[,\n"]/) { $i="\"" $i "\""} }; print
     $1, $2}' file.csv
     "foo,""bar""",2
     1,"foo,bar"
     1,"foo,
     bar"

but that's counter-intuitive and frustrating to have to write and I
think many users wouldn't know how to, or understand why they need to,
write that code to get valid CSV output.

I understand there is a benefit to stripping double quotes for working
on field contents and I appreciate that you need to make this work with
existing functionality (`OFS` values, etc.) so I understand why `--csv`
can't simply always output valid CSV and I also understand the "don't
provide constructs to do things that are easy to do with existing
constructs" awk mantra to avoid code bloat, but there has to be a way to
make it easier for people to just print a couple of fields from valid
CSV input and have the output still be valid CSV.

If there was a way to have `--csv` optionally NOT strip double quotes
when reading the fields then that'd solve the problem, e.g. `--csv=q` or
`--csvq` or similar to indicate quotes in and around fields should be
retained. If we had that then I could write something like:

      awk --csv=q -v OFS=',' '{print $1, $2}' file.csv

or, less desirably as it's longer and can't be set on the command line
but would be better than nothing:

      awk --csv -v OFS=',' 'BEGIN{PROCINFO["CSV"]="q"} {print $1, $2}'
file.csv

to get the desired output above and there are almost certainly other use
cases for people wanting to retain the quotes and there is no simple
alternative today (not using --csv but instead setting FPAT and counting
double quotes to know if a newline is inside or outside of a field, and
adding lines to $0 until you have a complete record).

I don't think that would be hard for users to understand or result in
language bloat or introduce any additional complexity working with
existing constructs - you simply wouldn't strip quotes when reading the
input and so they'd still be there when producing output.

Regards,

      Ed.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]