[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: CSV extension status
From: |
Ed Morton |
Subject: |
Re: CSV extension status |
Date: |
Tue, 25 May 2021 12:53:35 -0500 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.2 |
Manual - thanks for replying, my responses are inline below
On 5/25/2021 3:09 AM, Manuel Collado wrote:
El 25/05/2021 a las 4:14, Ed Morton escribió:
I see the conversation has continued at bug-gawk and Arnold had
suggested spinning it off into an email chain which, if it happened, I'm
not on. I see a lot of complexity being discussed in the thread that
just doesn't seem to be necessary. Is there any reason why the simple
"buildRec()" function I posted at
https://stackoverflow.com/a/45420607/1745001 (and which could be written
more concisely if I used gawk extensions) isn't all we'd need to parse
CSVs? No modes, no extra/ambiguous terminology - just reading a CSV into
fields by calling 1 function each time a record is read.
The goal is to allow beginner awk users to process CSV data as if they
were regular awk records. No need to tamper with predefined variables
like FS, OFS, NR, FPAT etc. Just put -i csvmode in the command line or
add @include "csvmode" to the script.
Setting the field separators using FS and OFS is IMHO desired behavior,
not tampering, as that'd be the intuitive way to specify which character
separates the fields of the CSV. There's no need to do anything with
FPAT, and I agree it'd be good to have NR work as it does today (my
buildRec() function would undesirably have record numbers get out of
sync with `NR` for CSV fields that contain newlines). I'm not advocating
for using my script as the solution to CSV parsing, by the way, just
using it as an example of something simple that gets the main job of
separating a CSV into fields done without any need for configuration
variables and explanation.
By using the CSVMODE library your example becomes:
$ cat decsv2.awk
{
printf "Record %d:\n", NR
for (i=1;i<=NF;i++) {
# To replace newlines with blanks add gsub(/\n/," ",$i) here
printf " $%d=<%s>\n", i, $i
}
print "----"
}
$ gawk -icsvmode-1 -f decsv2 file.csv
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Please note that the modified decsv2.awk script is not CSV specific.
It can be used unmodified to process regular awk records.
I don't think there's a use-case for a single script that has to process
CSVs and non-CSVs but if there were it's easily done by just separating
the input/output field identification from the control logic so it's not
useful to have a CSV library for awk that does this. To me a script that
can handle both CSV and non-CSV data is in the same ballpark as a script
that can handle both FS-separated and FPAT-matched data - it's almost
never going to be wanted but if it is there's simple ways to write it
using existing constructs.
Of course, different users have different needs and taste. This is why
the library in question attempts to satisfy as much users as possible,
by offering a rich set of configuration options.
IMHO a rich set of configuration options isn't necessary and just makes
the usage much more complicated than it should be. All you need is:
FS = a character (usually , or tab or ;)
OFS = a character (usually same as FS)
If you WANTED to allow other quotes than " and methods of escaping them
other than as "" then you could also have:
CSVQUOTE = a character (usually ", rarely')
CSVESCAPE = a character that appears before a CSVQUOTE within a field to
escape it (usually ", rarely \).
but that's it. Anything else the user needed to do (strip quotes from
input fields, add quotes to output fields, replace newlines in fields
with spaces, etc., etc.) is all easy for them to do in their code.
Even more. The library allows to modify fields and records the usual
way. For instance, to add a new field "val" at position "pos":
if (pos>NF) {$pos = val} else {$pos = val OFS $pos}; $0 = $0;
Good, that's as it should be.
And this code works transparently for both CSV data and regular text
data.
Again, I just don't see why that's useful and if it adds even the
tiniest bit of complexity, or additional configuration, or performance
overhead then IMHO it shouldn't be done.
Ed.
Regards.
- Re: CSV extension status, (continued)
- Re: CSV extension status, arnold, 2021/05/20
- Re: CSV extension status, Neil R. Ormos, 2021/05/18
- Re: CSV extension status, Manuel Collado, 2021/05/18
- Re: CSV extension status, Neil R. Ormos, 2021/05/19
- Re: CSV extension status, Manuel Collado, 2021/05/19
Re: CSV extension status, Ed Morton, 2021/05/17
- Re: CSV extension status, Manuel Collado, 2021/05/17
- Re: CSV extension status, Ed Morton, 2021/05/17
- Re: CSV extension status, Ed Morton, 2021/05/24
- Re: CSV extension status, Manuel Collado, 2021/05/25
- Re: CSV extension status,
Ed Morton <=
- Re: CSV extension status, arnold, 2021/05/26
- Re: CSV extension status, Ed Morton, 2021/05/28