Re: CSV extension status

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CSV extension status

From:	Ed Morton
Subject:	Re: CSV extension status
Date:	Tue, 25 May 2021 12:53:35 -0500
User-agent:	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.2

Manual - thanks for replying, my responses are inline below

On 5/25/2021 3:09 AM, Manuel Collado wrote:

El 25/05/2021 a las 4:14, Ed Morton escribió:

I see the conversation has continued at bug-gawk and Arnold had
suggested spinning it off into an email chain which, if it happened, I'm
not on. I see a lot of complexity being discussed in the thread that
just doesn't seem to be necessary. Is there any reason why the simple
"buildRec()" function I posted at
https://stackoverflow.com/a/45420607/1745001 (and which could be written
more concisely if I used gawk extensions) isn't all we'd need to parse
CSVs? No modes, no extra/ambiguous terminology - just reading a CSV into
fields by calling 1 function each time a record is read.

The goal is to allow beginner awk users to process CSV data as if theywere regular awk records. No need to tamper with predefined variableslike FS, OFS, NR, FPAT etc. Just put -i csvmode in the command line oradd @include "csvmode" to the script.

Setting the field separators using FS and OFS is IMHO desired behavior,not tampering, as that'd be the intuitive way to specify which characterseparates the fields of the CSV. There's no need to do anything withFPAT, and I agree it'd be good to have NR work as it does today (mybuildRec() function would undesirably have record numbers get out ofsync with `NR` for CSV fields that contain newlines). I'm not advocatingfor using my script as the solution to CSV parsing, by the way, justusing it as an example of something simple that gets the main job ofseparating a CSV into fields done without any need for configurationvariables and explanation.


By using the CSVMODE library your example becomes:

$ cat decsv2.awk
{
    printf "Record %d:\n", NR
    for (i=1;i<=NF;i++) {
        # To replace newlines with blanks add gsub(/\n/," ",$i) here
        printf "    $%d=<%s>\n", i, $i
    }
    print "----"
}

$ gawk -icsvmode-1 -f decsv2 file.csv
Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----

Please note that the modified decsv2.awk script is not CSV specific.It can be used unmodified to process regular awk records.

I don't think there's a use-case for a single script that has to processCSVs and non-CSVs but if there were it's easily done by just separatingthe input/output field identification from the control logic so it's notuseful to have a CSV library for awk that does this. To me a script thatcan handle both CSV and non-CSV data is in the same ballpark as a scriptthat can handle both FS-separated and FPAT-matched data - it's almostnever going to be wanted but if it is there's simple ways to write itusing existing constructs.

Of course, different users have different needs and taste. This is whythe library in question attempts to satisfy as much users as possible,by offering a rich set of configuration options.

IMHO a rich set of configuration options isn't necessary and just makesthe usage much more complicated than it should be. All you need is:


FS = a character (usually , or tab or ;)
OFS = a character (usually same as FS)

If you WANTED to allow other quotes than " and methods of escaping themother than as "" then you could also have:


CSVQUOTE = a character (usually ", rarely')

CSVESCAPE = a character that appears before a CSVQUOTE within a field toescape it (usually ", rarely \).

but that's it. Anything else the user needed to do (strip quotes frominput fields, add quotes to output fields, replace newlines in fieldswith spaces, etc., etc.) is all easy for them to do in their code.

Even more. The library allows to modify fields and records the usualway. For instance, to add a new field "val" at position "pos":
  if (pos>NF) {$pos = val} else {$pos = val OFS $pos}; $0 = $0;

Good, that's as it should be.

And this code works transparently for both CSV data and regular textdata.

Again, I just don't see why that's useful and if it adds even thetiniest bit of complexity, or additional configuration, or performanceoverhead then IMHO it shouldn't be done.

Ed.


Regards.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: CSV extension status, (continued)

Prev by Date: Re: CSV extension status
Next by Date: Re: CSV extension status
Previous by thread: Re: CSV extension status
Next by thread: Re: CSV extension status
Index(es):
- Date
- Thread