bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quotes being stripped by "--csv"


From: Neil R. Ormos
Subject: Re: Quotes being stripped by "--csv"
Date: Sun, 26 Nov 2023 12:58:40 -0600 (CST)
User-agent: Alpine 2.20 (DEB 67 2015-01-07)

Ed Morton wrote:

> [...] but not all CSV-processing applications require
> modifying fields and not all applications that do modify
> fields are allowed to produce output with different quotes
> than the input had even if they have to strip those quotes
> temporarily while modifying the fields.

> I get CSVs from multiple sources and need to
> compare/manipulate them and return them to those sources
> or send to other destinations that would otherwise receive
> the original exported CSV. Some of those CSVs are exported
> from Excel or other Windows tools, some are exported from
> various applications that run on various web sites, some
> are created by various Unix tools that have evolved over
> the years. I see various quoting styles/rules applied
> across those CSVs - quote only when needed, quote all
> fields, quote all strings but do not quote numbers, quote
> only specific columns, quote the data rows but not the
> header row, etc., etc. [...]

> [...] but people have been writing tools to parse various
> subsets of CSVs with various subsets of allowed/required
> quoting for 50+ years and CSVs are used in many varied
> applications with no 1 common standard they all follow,
> despite the existence of RFC4180, so I expect I'm not
> alone in having a need for CSV parsing that simply doesn't
> strip quotes.

I've had many use cases that are in a category similar to what Ed describes.  
The producer or the ultimate consumer of the CSV file exhibits idiosyncratic 
CSV-handling behavior, that behavior cannot be changed, the full extent of the 
idiosyncrasy is unknown or tedious to duplicate, and the practical requirement 
is that the output from awk shall be identical to the input except for specific 
intended changes.

The optional behavior Ed requested, where the fields of CSV input records are 
separated but otherwise unmolested, would simplify handling of these use cases, 
e.g., in one-liners and similar small-scale scripts.  Because I have been 
processing CSV files long before the --csv option was added, I already have 
ways of dealing with these situations.  But each user newly confronted with 
these use cases would have to analyze the problem and craft a new solution.  
The optional non-stripping --csv behavior would avoid that duplicative effort 
for many potential users, while advantageously providing a standard, 
easy-to-use facility that exhibits performance and behavior consistent with 
gawk's conventional field and record processing.

I do not seek to relitigate Arnold's decision; he has to weigh and reconcile 
many competing considerations, of which utility is but one.  This post is 
offered simply to balance the record in view of doubts expressed earlier as to 
whether the requested optional --csv behavior would be "useful".  Although Ed 
took the initiative to make and advocate for the request, he is not the only 
one confronted with this category of CSV-handling problem, and the optional 
behavior Ed requested would indeed be useful to others.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]