bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quotes being stripped by "--csv"


From: Ben Hoyt
Subject: Re: Quotes being stripped by "--csv"
Date: Mon, 20 Nov 2023 10:32:21 +1300

Hi Ed and Arnold,

Yes, this shows the difference between CSV for input mode and for output
mode. The --csv option is only about "CSV input mode", and doesn't affect
output, as you observe.

I think --csv unescaping doubled quotes and stripping the start and end
quotes (as it does now) is the correct behaviour, because those extra
quotes aren't part of the actual field value, so I think the behaviour you
propose with --csv=q would be an unhelpful patch over the problem -- you
can't actually use the value in the field without the quotes
escaped/removed.

GoAWK and Frawk both had more extensive CSV support before awk and Gawk. (I
based GoAWK's approach on Frawk's.) We have separate CSV controls for input
mode and output mode. So you can say "goawk -i csv" for CSV input mode
("goawk --csv" is equivalent to that) but you can also say "goawk -o csv"
for CSV output mode, and of course "goawk -i csv -o csv" for CSV input mode
and output mode. Full GoAWK CSV docs here:
https://github.com/benhoyt/goawk/blob/master/docs/csv.md

Original awk and Gawk don't have the concept of CSV output mode. I suspect
they probably won't add it in a hurry. In the new second edition of "The
AWK Programming Language" by Kernighan et al, chapter 3 page 39 basically
says you should do the output part manually. To quote:

-----
By the way, generating CSV is straightforward. Here’s a function to_csv
that converts a
string to a properly quoted string by doubling each quote and surrounding
the result with
quotes. It’s an example of a function that could go into a personal library.

# to_csv - convert s to proper "..."
function to_csv(s) {
  gsub(/"/, "\"\"", s)
  return "\"" s "\""
}

(Note how quotes are quoted with backslashes.)

We can use this function within a loop to insert commas between elements of
an array to
create a properly formatted CSV record for an associative array, or for an
indexed array like
the fields of a line, as illustrated in the functions rec_to_csv and
arr_to_csv:

# rec_to_csv - convert a record to csv
function rec_to_csv(    s, i) {
  for (i = 1; i < NF; i++)
    s = s to_csv($i) ","
  s = s to_csv($NF)
  return s
}

# arr_to_csv - convert an indexed array to csv
function arr_to_csv(arr,    s, i, n) {
  n = length(arr)
  for (i = 1; i <= n; i++)
    s = s to_csv(arr[i]) ","
  return substr(s, 1, length(s)-1) # remove trailing comma
}
-----

I don't love the lack of a built-in way to do this, hence the support for
"CSV output mode" in GoAWK. But it is what it is for now. I'd definitely be
interested to hear what Kernighan has to say.

Cheers,
Ben.

On Sun, 19 Nov 2023 at 21:37, <arnold@skeeve.com> wrote:

> Hi.
>
> I understand what you're saying. I don't have an answer at this point.
> I think it would be helpful for you to open an issue on the Github repo
> for Brian Kernighan's awk, as CSV handling was his idea. Maybe he can
> come up with something.
>
> In any case, opening an issue there will allow for wider discussion amongst
> AWK implementors.
>
> Thanks,
>
> Arnold
>
> Ed Morton <mortoneccc@comcast.net> wrote:
>
> > Someone posted a question on stackoverflow about how to print just the
> > first 2 fields from a CSV so given this input:
> >
> >     "foo,""bar""",2,3
> >     1,"foo,bar",3
> >     1,"foo,
> >     bar",3
> >
> > the expected output would be:
> >
> >     "foo,""bar""",2
> >     1,"foo,bar"
> >     1,"foo,
> >     bar"
> >
> > I thought I'd answer with "--csv" but when I tried it I got this output:
> >
> >     $ awk --csv -v OFS=',' '{print $1, $2}' file.csv
> >     foo,"bar",2
> >     1,foo,bar
> >     1,foo,
> >     bar
> >
> > The quotes around the fields that need to be quoted (and were quoted in
> > the input) are missing and the escaped double quotes (`""`) around the
> > first `bar` have become individual (`"`) so the output is no longer
> > valid CSV.
> >
> > I could get it back to valid CSV and produce the expected output by
> > writing this or similar:
> >
> >     $ awk --csv -v OFS=',' '{for (i=1; i<=NF; i++) {
> >     gsub(/"/,"\"\"",$i); if ($i ~ /[,\n"]/) { $i="\"" $i "\""} }; print
> >     $1, $2}' file.csv
> >     "foo,""bar""",2
> >     1,"foo,bar"
> >     1,"foo,
> >     bar"
> >
> > but that's counter-intuitive and frustrating to have to write and I
> > think many users wouldn't know how to, or understand why they need to,
> > write that code to get valid CSV output.
> >
> > I understand there is a benefit to stripping double quotes for working
> > on field contents and I appreciate that you need to make this work with
> > existing functionality (`OFS` values, etc.) so I understand why `--csv`
> > can't simply always output valid CSV and I also understand the "don't
> > provide constructs to do things that are easy to do with existing
> > constructs" awk mantra to avoid code bloat, but there has to be a way to
> > make it easier for people to just print a couple of fields from valid
> > CSV input and have the output still be valid CSV.
> >
> > If there was a way to have `--csv` optionally NOT strip double quotes
> > when reading the fields then that'd solve the problem, e.g. `--csv=q` or
> > `--csvq` or similar to indicate quotes in and around fields should be
> > retained. If we had that then I could write something like:
> >
> >      awk --csv=q -v OFS=',' '{print $1, $2}' file.csv
> >
> > or, less desirably as it's longer and can't be set on the command line
> > but would be better than nothing:
> >
> >      awk --csv -v OFS=',' 'BEGIN{PROCINFO["CSV"]="q"} {print $1, $2}'
> > file.csv
> >
> > to get the desired output above and there are almost certainly other use
> > cases for people wanting to retain the quotes and there is no simple
> > alternative today (not using --csv but instead setting FPAT and counting
> > double quotes to know if a newline is inside or outside of a field, and
> > adding lines to $0 until you have a complete record).
> >
> > I don't think that would be hard for users to understand or result in
> > language bloat or introduce any additional complexity working with
> > existing constructs - you simply wouldn't strip quotes when reading the
> > input and so they'd still be there when producing output.
> >
> > Regards,
> >
> >      Ed.
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]