bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#26574: v4.4: POSIX violation with respect to output of a trailing ne


From: Michael Klement
Subject: bug#26574: v4.4: POSIX violation with respect to output of a trailing newline, even with --posix
Date: Thu, 20 Apr 2017 12:36:03 -0400

Thanks for the detailed feedback, Eric.

The POSIX spec. is, unfortunately, vague on this topic:

The definition of a line (which you quote) is complemented with the definition 
of an incomplete line 
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_195>:

> A sequence of one or more non- <newline> characters at the end of the file.


So while the standard is aware of this possibility and gives it a name that 
suggests it is a kind of line, but something's missing, there is precious 
little behavior prescribed with respect to such incomplete lines.

So we have:

sed's "input files shall be text files."
a text file contains "characters organized into zero or more lines"

Beyond the "zero or more lines", the only restrictions placed on what 
constitutes a text file 
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403>
 are:
" The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes 
in length, including the <newline> character. "

If you interpret the word "lines" in the phrase "zero or more lines" to mean 
complete lines only (which is reasonable), then indeed any file that ends in an 
incomplete line is not a text file.

I really wish the spec. were more explicit about incomplete lines.

>   If anything, the only
> change I would make is have 'sed --posix' error out on non-text input,
> to call attention to the user's attempt to feed non-posix-compliant data
> to sed.


That is definitely an option, but perhaps intuitive understanding and 
historical practice / other implementations could be considered instead:

Intuitively, a file containing text with an incomplete line is obviously still 
a text file - just one that has no trailing \n, so treating incomplete lines 
(mostly) like lines makes sense.
In practice, most utilities still read the incomplete line - the shell's read 
builtin being a notable exception.
wc is an interesting case, which doesn't count an incomplete line as a line 
(the spec <http://pubs.opengroup.org/onlinepubs/9699919799/utilities/wc.html>. 
is actually unambiguous there and mandates counting the newlines), yet still 
counts its words and characters/bytes.
BSD/macOS Sed is a mostly POSIX-features-only implementation, and it always 
appends a trailing \n, even when encountering an incomplete line. (On the flip 
side, that makes it fundamentally unsuited to operating on binary files - 
unlike GNU Sed).
I'm not sure about other implementations (or even if there are any that still 
matter today).

So, as a compromise, GNU sed --posix could treat files with an incomplete line 
as text files, as long as the incomplete line contains no NULs and contains at 
most getconf LINE_MAX - 1 characters.

Maybe the issue at hand is rarely of concern in the real world, but I've 
stumbled over it on several occasions when writing portable Sed commands (at 
least portable between Linux and macOS).
This issue and the infamous -i option incompatibility (which probably will 
never go away) are what get in the way of writing such commands.

Thanks,

Michael






> On Apr 20, 2017, at 6:42 AM, Eric Blake <address@hidden> wrote:
> 
> tag 26574 notabug
> thanks
> 
> On 04/19/2017 08:43 PM, Michael Klement wrote:
>> $ sed --version
>> sed (GNU sed) 4.4
>> 
>> The POSIX spec. 
>> <http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html> states:
>> "Whenever the pattern space is written to standard output or a named file, 
>> sed shall immediately follow it with a <newline>."
>> 
>> While GNU Sed's default behavior of preserving the trailing-newline status 
>> of the input's last line is defensible and can be helpful,
>> it should exhibit POSIX-compliant behavior when invoked with --posix.
> 
> POSIX also requires that input given to sed be text files:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html
> "The input files shall be text files."
> 
> And per the definition of text file, ALL input lines must have a
> trailing newline in the first place:
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html
> "3.403 Text File
> A file that contains characters organized into zero or more lines. The
> lines do not contain NUL characters and none can exceed {LINE_MAX} bytes
> in length, including the <newline> character. Although POSIX.1-2008 does
> not distinguish between text files and binary files (see the ISO C
> standard), many utilities only produce predictable or meaningful output
> when operating on text files. The standard utilities that have such
> restrictions always specify "text files" in their STDIN or INPUT FILES
> sections."
> 
> "3.206 Line
> A sequence of zero or more non- <newline> characters plus a terminating
> <newline> character."
> 
> Input that does NOT end in a trailing newline is NOT a text file, and
> therefore is NOT a POSIX-compliant use of sed, and therefore, sed
> --posix need not do anything different with it because you are already
> outside the bounds of what POSIX requires.
> 
> Therefore, I don't think you have a case for changing any behavior, at
> least not on the grounds of appealing to POSIX, so I'm marking this as
> not a bug, but feel free to continue discussion.  If anything, the only
> change I would make is have 'sed --posix' error out on non-text input,
> to call attention to the user's attempt to feed non-posix-compliant data
> to sed.
> 
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]