bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: manual section 4.7.1


From: Ed Morton
Subject: Re: manual section 4.7.1
Date: Tue, 4 Apr 2023 10:32:49 -0500
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1

Are you sure in the FPAT output you're not just seeing the expected effects of there being a CR in your data? The `--csv` output is the one that looks wrong to me if you have `CR`s at the end of each line, unless `--csv` is documented to strip `CR`s from the output.

Please provide the input file you used as it's hard to tell what's going on from just the output. Also pipe the output to `cat -v` or `od -c` or similar so we can see where the CRs are in the output but my best guess right now is that `FPAT` is retaining the CRs as expected while `--csv` is stripping them (which may or may not be expected - I'm not familiar with that option).

    Ed.

On 4/4/2023 5:12 AM, cph1968@proton.me wrote:
the regex fp[2] in section 4.7.1 (below) don't quite cut it if the CSV file 
records end in both CR and NL [0H0D 0H0A]. I believe this is a common feature 
of Windows files.
A simple fix is however to use the gawk --csv option.

❯ head -n 2 TSCAINV_022023.csv| gawk -f print-fields.awk
ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
F = 1 <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
F = 1 <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
note here that the last '>' is first character on the next line.

output using the --csv option:
❯ head -n 2 TSCAINV_022023.csv| gawk --csv -f print-fields.awk
<ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY>
NF = 10 <ID><CASRN><casregno><UID><EXP><ChemName><DEF><UVCB><FLAG><ACTIVITY>
<1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE>
NF = 10 <1><50-00-0><50000><><><Formaldehyde><><><><ACTIVE>

much better :-)

❯ cat print-fields.awk
{
     print "<" $0 ">"
     printf("NF = %s ", NF)
     for (i = 1; i <= NF; i++) {
         printf("<%s>", $i)
     }
     print ""
}


from section 4.7.1:
BEGIN {
      fp[0] = "([^,]+)|(\"[^\"]+\")"
      fp[1] = "([^,]*)|(\"[^\"]+\")"
      fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
      FPAT = fp[fpat+0]
}



kind regards,

cph1968

Sent with Proton Mail secure email.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]