[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Insertion of extra OFS character into output string
From: |
H |
Subject: |
Re: Insertion of extra OFS character into output string |
Date: |
Tue, 14 Mar 2023 15:09:56 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 |
On 03/14/2023 02:41 AM, david kerns wrote:
>
>
> On Mon, Mar 13, 2023 at 5:59 PM H <agents@meddatainc.com
> <mailto:agents@meddatainc.com>> wrote:
>
> On March 14, 2023 12:41:16 AM GMT+01:00, "Neil R. Ormos"
> <ormos-gnulists17@ormos.org <mailto:ormos-gnulists17@ormos.org>> wrote:
> >H wrote:
> >
> >> I am a newcomer to awk and have run into an
> >> issue I have not figured out yet... My platform
> >> is CentOS 7 running awk 4.0.2, the default
> >> version.
> >
> >> The following awk statement generates an extra
> >> tab character between fields 1 and 2, regardless
> >> of the data in the file:
> >
> >> awk 'BEGIN{FS=","; FPAT="([^,]*)|(\"[^\"]+\")"; OFS="\t"} {$1=$1;
> >gsub(/"/, ""); print}' somefile.csv
> >
> >> If i change the statement to:
> >
> >> awk 'BEGIN{FS=","; FPAT="([^,]*)|(\"[^\"]+\")"; OFS="\t"} {$2=$2;
> >gsub(/"/, ""); print}' somefile.csv
> >
> >> an extra OFS character is inserted between
> >> fields two and three. I can add that removing
> >> the gsub() in either of the two examples does
> >> not affect the results.
> >
> >> Might this be a bug in 4.0.2 or a feature I have
> >> not yet understood?
> >
> >I don't have 4.0.2 available to test, but I tested with older and newer
> >versions.
> >
> >When I test, I get the result I think I expect from the code you
> >posted.
> >
> >Also, setting FPAT overrides the effect of having earlier set FS. (I
> >believe that the most-recently set one among FS, FPAT, and FIELDWIDTHS
> >controls the field splitting operation.)
> >
> >echo "1,2" | awk 'BEGIN{FS=","; FPAT="([^,]*)|(\"[^\"]+\")"; OFS="\t"}
> >{$1=$1; print}' | hexdump -c
> >0000000 1 \t 2 \n
> >0000004
> >
> >It would be easier to help if you would please provide:
> >
> > the simplest input line that reproduces the problem;
> >
> > the output you expect; and
> >
> > the output you are getting.
>
> I am not on my computer but typing this on my phone. With that caveat, a
> /minimal/ example would be:
> echo "Alpha,Beta,Charlie,Delta" | awk 'BEGIN{FS=",";
> FPAT="([^,]*)|(\"[^\"]+\")"; OFS="\t"} {$1=$1; gsub(/"/, ""); print}'
>
> I would expect to see:
> Alpha<TAB>Beta<TAB>Charlie<TAB>Delta
> but instead see
> Alpha<TAB><TAB>Beta<TAB>Charlie<TAB>Delta
>
> If you change $1=$1 to $2=$2 you will find that the extra tab character
> then moves to the next field.
>
> I believe I had also tried without the definition of FS with the same
> result.
>
> Finally, note that the FPAT expression comes from the awk documentation
> and is thus expected to work.
>
> Can anyone try this with the most recent version of awk?
>
>
> I think there is a bug here: (I fixed your FPAT, but that issue is unrelated
> to what you're reporting)
> $ cat somefile.csv
> 1,"this field, has a comma",3,4
> $ cat p11
> gawk 'BEGIN {
> FPAT="[^,]*|[\"][^\"]+[\"]"
> OFS="\t"
> }
> {
> for (i = 1; i <= NF; i++) x=$i # if you comment this line out, you'll
> get the extra tab on output
> $1=$1;
> gsub(/"/, "");
> print
> }' somefile.csv
> $ ./bash pp11 | xxd
> 0000000: 3109 7468 6973 2066 6965 6c64 2c20 6861 1.this field, ha
> 0000010: 7320 6120 636f 6d6d 6109 3309 340a s a comma.3.4.
>
> however, it does seemed to be fixed in 5.2.1
>
>
Why the need to "fix" my FPAT? As I stated earlier, the FPAT I used is from the
awk documentation.
Also, it is better to keep this discussion on the mailing list where it
belongs, no need to pollute my personal email.