bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4


From: Dick Dunbar
Subject: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 12:30:29 -0700

Hi Assaf and Eric,
Thanks for your remarks.  Very thoughtful and helpful.

1. I hadn't realized sed had a -z option.  Here's how I used it:
   find -print0 | sed -ze "s/^/'/" -e "s/\$/'\n/"

2. Rather than fighting with sed behaviour, it's just easier to use Eric's
suggestion
    to strip the \r in a separate stage.  But this doesn't do that.  It
replaces \r
    with a null character followed by \n

    $ cat t.out | tr -d '\r' | od -xc
     0000000    3a43    535c    6163    5c6e    2069    322e    000a
                   C   :   \   S   c   a   n   \   i       .   2  \n

    And running a stream through d2u will cause the entire pipe to
    stall until eof on the stream.

3. Eric: the discussion of binary file open confused me.
    Does sed default to binary open?  How would you suggest I
    fix this in user-land?

I don't really understand what  'info sed' is saying because sed
can operate on a stream -or- a file.  It's not just mixing Win programs
and cygwin programs that causes problems.  It is very common to
get files from multiple platforms.

Editing a 'sh' script with notepad will definitely ruin your day:

#!/bin/bash \r\n

vim identifies an edited file as "dos" if it encounters one.

'-b'
'--binary'
     This option is available on every platform, but is only effective
     where the operating system makes a distinction between text files
     and binary files.  When such a distinction is made--as is the case
     for MS-DOS, Windows, Cygwin--text files are composed of lines
     separated by a carriage return _and_ a line feed character, and
     'sed' does not see the ending CR. When this option is specified,
     'sed' will open input files in binary mode, thus not requesting
     this special processing and considering lines to end at a line feed.

Throughout the sed documentation, '\n' is called "new line".
In the --binary description it is correctly called "line feed".  ( CR+LF)

https://en.wikipedia.org/wiki/Newline

Several years ago, I switched from the O'Reilly sed/awk book
I like this one:  http://www.thegeekstuff.com/sed-awk-101-hacks-ebook

I'm not sure I would have ever picked up on this cygwin change
by reading release notes or info sed.

It doesn't hurt until it bites you.

-Cheers guys;  thanks for being friendly



On Thu, May 11, 2017 at 7:05 PM, Assaf Gordon <address@hidden> wrote:

> Hello,
>
> > On May 11, 2017, at 18:39, Dick Dunbar <address@hidden> wrote:
> >
> > To round out this discussion:
> > I wanted a simple filter to ensure filename paths didn't contain spaces.
>
> There's a nuance here to verify:
> Did you want a filter to ensure non of your files have spaces (e.g. detect
> if some haves do have spaces and then fail),
> or
> Did you want a robust way to use the 'mv' command (as below), even in
> the case of files with spaces ?
>
> If you just wanted to detect files with spaces,
> something like this would work:
>     find -type f -print0 | grep -qz '[[:space:]]' && echo
> have-files-with-spaces
>
> If you wanted to print files that have spaces, something like this would
> work:
>     find -type f -print0 | grep -z '[[:space:]]' | tr '\0' '\n'
>
> > For example:
> >   find /foo -maxdepth 1 -atime +366 -print0 |
> >      xargs -0 sh -c 'mv "$@" /archive' move
>
> I'm not sure what the purpose of 'move' in the above command.
> But if you wanted to move all the found files to the directory /archive,
> even if they have spaces in them, a more efficient way would be:
>
>     find /foo -maxdepth 1 -atime +366 -print0 | \
>        xargs -0 mv -no-run-if-empty -t /archive
>
> This GNU extension (-t DEST) works great with find/xargs,
> as xargs by default adds the parameters at the end of the command line,
> and "-t" means the destination directory is specified first.
>
> > So why are there different flags to indicate null-terminated lines?
> >   find -print0
> >   xargs -0
> >   sed  -z
> >
> > Seems silly.  To make a non-breaking-code-change,
> > why not add "-z" to the find and xargs command so they are compatible?
>
> Putting aside the naming conversion for a moment (Remember that each
> program is developed
> by different people) - I'll focus on find/xargs - which are part of the
> same
> package (findutils) and developed by the same people.
>
> These two are designed to work closely together - that's why they have
> "-print0" and "-0".
>
> The whole point of the following construct:
>
>    find [criteria] -print0 | xargs -0 ANY-PROGRAM
>
> Is that 'ANY-PROGRAM' doesn't need to understand NUL-line-endings at all.
> The main reason find and xargs need the NULLs is to ensure
> file names are not broken by whitespace or even newlines. But once xargs
> reads
> the entire filename, it passes each filename as a single parameter to
> ANY-PROGRAM,
> and so there's no need to worry any more about filenames with whitespaces.
>
> This useful constructs breaks down if ANY-PROGRAM is 'sh' which the might
> do further parameter splitting based on whitespace.
>
> > And ... because we're dealing with the same issue of executables
> > creating stream data, why doesn't sed/awk/grep have an option
> > to deal with null delimited lines such that "$" would find them.
>
> I'm not sure I understand: sed and grep have "-z" exactly for this purpose.
> (also: sort -z , perl -0).
> gawk has a slight different syntax, where you simply set the RS (input
> record separator) to NULL:
>
>    find -type f -print0 | gawk -vRS="\0" -vORS="\n" '{ print "file = " $0
> }'
>
> But remember that when you use 'sed -z', the output also uses NULs as
> line-terminators,
> so it won't look good on the terminal or in a file.
>
> > Having sed recognize \r, \n, \0 as end of line might cause some
> > breakage if you have to deal with data that has embedded nulls.
>
> Instead of thinking in general "data that has embedded nulls",
> it'll be easier to consider concrete cases.
> Text files do not have embedded nuls (by definition, otherwise they are not
> text files). So standard text programs (sed/grep/awk) do not need to deal
> with NULs
> as line separators.
>
> The main use case of having NUL as line separator is precisely with "find
> -print0".
> In this case, either use "xargs -0" and then the actual program doesn't
> need
> to worry about NULs at all, or use the gnu extensions (e.g. 'sed -z' or
> 'grep -z').
>
> > Had to check:
> > find . -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/"
> >
> > Doesn't work.  One very long string of null terminated filenames is
> returned.
>
> It works perfectly:
> 1. sed without -z treats newlines (\n) as line terminators.
> 2. 'find -print0' did not generate '\n' character at all.
> 3. 'sed' read the entire input (i.e. all files separated by NULs),
>    treated it as one line, and added quotes at the beginning and the end
>    of the entire buffer.
> 4. NULs were kept as-is, and are printed on your terminal.
>
> Example:
>     $ touch a b 'c d'
>     $ find -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
>       27  2e  2f  61  00  2e  2f  62  00  2e  2f  63  20  64  00  27
>        '   .   /   a  \0   .   /   b  \0   .   /   c       d  \0   '
>
> > So we now know that sed does not check for \0 as a line terminator.
> > And the sed -z flag produces the same long string.
> >
> > find . -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/"
>
> It also produces the correct output:
> This time, because of the '-z', sed indeed reads each filename until the
> NUL,
> and adds quotes around each file.
> But it also uses NULs as line terminators on the OUTPUT,
> so newline characters are not used at all.
> Notice that each file is surrounded by quotes, exactly as you've asked:
>
>   $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
>     27  2e  2f  61  27  00  27  2e  2f  62  27  00  27  2e  2f  63
>      '   .   /   a   '  \0   '   .   /   b   '  \0   '   .   /   c
>     20  64  27  00
>          d   '  \0
>
> The missing piece is that after you've processed each file using 'sed -z',
> if you want to print them to the terminal, you still need to convert NULs
> to newlines:
>
>   $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | tr '\0' '\n'
>   './a'
>   './b'
>   './c d'
>
> Or, if you wanted to user sed/grep as an intermediate filter between
> 'find' and 'xargs',
> then something like this:
>
>   find [criteria] -print0 | grep -z [REGEX] | xargs -0 ANYPROGRAM
>   find [criteria] -print0 | sed -z [REGEX] | xargs -0 ANYPROGRAM
>
>
> In most of my examples above, whitespace don't actual cause problems -
> because sed/grep will not be confused by whitespace and won't break the
> line
> (it is mostly shell argument parsing that will get terribly confused by
> whitespace,
> and also "xargs" with certain parameters).
>
> They real 'kick' is that using NULs allows handling files that have
> embedded newlines.
>
> Consider the following:
>
>   $ touch a b 'c d' "$(printf 'e\nf')"
>   $ ls -log
>   total 0
>   -rw-r--r-- 1 0 May 12 01:43 a
>   -rw-r--r-- 1 0 May 12 01:43 b
>   -rw-r--r-- 1 0 May 12 01:43 c d
>   -rw-r--r-- 1 0 May 12 01:43 e?f
>
> The last file has an embedded newline, which will mess-up 'find':
>
>   ## incorrect output: the 'e\nf' file is broken, 'echo' is executed
>   ## wrong number of times with non-existing file names:
>   $ find -type f | xargs -I% echo ==%==
>   ==./e==
>   ==f==
>   ==./a==
>   ==./b==
>   ==./c d==
>
> Using 'xargs -0' will solve it. This output is correct, but perhaps
> confusing
> when displayed on the terminal:
>
>   $ find -type f -print0 | xargs -0 -I% echo ==%==
>   ==./e
>   f==
>   ==./a==
>   ==./b==
>   ==./c d==
>
> And similarly with 'sed -z':
>
>   $ find -type f -print0 | sed -z -e 's/^/<<</' -e 's/$/>>>/' | tr '\0'
> '\n'
>   <<<./e
>   f>>>
>   <<<./a>>>
>   <<<./b>>>
>   <<<./c d>>>
>
>
>
> Once last tip:
> Sometimes you want to find and operate on files based on the their content
> instead
> of attributes (e.g. 'grep').
>
> Here too, a file with spaces or newlines will cause troubles:
>
>   $ echo yes  > "$(printf 'hello\nworld')"
>   $ ls -log
>   total 4
>   -rw-r--r-- 1 4 May 12 01:57 hello?world
>
> If you wanted to find all files containing 'yes',
> grep alone would print a confusing output:
>
>   $ grep -l yes *
>   hello
>   world
>
> And using it with "xargs" will fail:
>
>   $ grep -l yes * | xargs -I% echo 'handling file ===%==='
>   handling file ===hello===
>   handling file ===world===
>
> Grep has a separate option (upper case -Z) to print the matched filenames
> with a NUL instead of a newline. This enables correct handling:
>
>   $ grep -lZ yes * | xargs -0 -I% echo 'handling file ===%s==='
>   handling file ===hello
>   worlds===
>
> And later:
>
>   $ grep -lZ yes * | xargs -0 mv -t /destination
>
>
>
> Hope this helps,
> regards,
>  - assaf
>
>
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]