bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4


From: Assaf Gordon
Subject: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Thu, 11 May 2017 22:05:33 -0400

Hello,

> On May 11, 2017, at 18:39, Dick Dunbar <address@hidden> wrote:
> 
> To round out this discussion:
> I wanted a simple filter to ensure filename paths didn't contain spaces.

There's a nuance here to verify:
Did you want a filter to ensure non of your files have spaces (e.g. detect
if some haves do have spaces and then fail),
or
Did you want a robust way to use the 'mv' command (as below), even in
the case of files with spaces ?

If you just wanted to detect files with spaces,
something like this would work:
    find -type f -print0 | grep -qz '[[:space:]]' && echo have-files-with-spaces

If you wanted to print files that have spaces, something like this would
work:
    find -type f -print0 | grep -z '[[:space:]]' | tr '\0' '\n'

> For example:
>   find /foo -maxdepth 1 -atime +366 -print0 |
>      xargs -0 sh -c 'mv "$@" /archive' move

I'm not sure what the purpose of 'move' in the above command.
But if you wanted to move all the found files to the directory /archive,
even if they have spaces in them, a more efficient way would be:

    find /foo -maxdepth 1 -atime +366 -print0 | \
       xargs -0 mv -no-run-if-empty -t /archive

This GNU extension (-t DEST) works great with find/xargs,
as xargs by default adds the parameters at the end of the command line,
and "-t" means the destination directory is specified first.

> So why are there different flags to indicate null-terminated lines?
>   find -print0
>   xargs -0
>   sed  -z
> 
> Seems silly.  To make a non-breaking-code-change,
> why not add "-z" to the find and xargs command so they are compatible?

Putting aside the naming conversion for a moment (Remember that each program is 
developed
by different people) - I'll focus on find/xargs - which are part of the same
package (findutils) and developed by the same people.

These two are designed to work closely together - that's why they have
"-print0" and "-0".

The whole point of the following construct:

   find [criteria] -print0 | xargs -0 ANY-PROGRAM

Is that 'ANY-PROGRAM' doesn't need to understand NUL-line-endings at all.
The main reason find and xargs need the NULLs is to ensure
file names are not broken by whitespace or even newlines. But once xargs reads
the entire filename, it passes each filename as a single parameter to 
ANY-PROGRAM,
and so there's no need to worry any more about filenames with whitespaces.

This useful constructs breaks down if ANY-PROGRAM is 'sh' which the might
do further parameter splitting based on whitespace.

> And ... because we're dealing with the same issue of executables
> creating stream data, why doesn't sed/awk/grep have an option
> to deal with null delimited lines such that "$" would find them.

I'm not sure I understand: sed and grep have "-z" exactly for this purpose.
(also: sort -z , perl -0).
gawk has a slight different syntax, where you simply set the RS (input
record separator) to NULL:

   find -type f -print0 | gawk -vRS="\0" -vORS="\n" '{ print "file = " $0 }'

But remember that when you use 'sed -z', the output also uses NULs as 
line-terminators,
so it won't look good on the terminal or in a file.

> Having sed recognize \r, \n, \0 as end of line might cause some 
> breakage if you have to deal with data that has embedded nulls.

Instead of thinking in general "data that has embedded nulls",
it'll be easier to consider concrete cases.
Text files do not have embedded nuls (by definition, otherwise they are not
text files). So standard text programs (sed/grep/awk) do not need to deal with 
NULs
as line separators.

The main use case of having NUL as line separator is precisely with "find 
-print0".
In this case, either use "xargs -0" and then the actual program doesn't need
to worry about NULs at all, or use the gnu extensions (e.g. 'sed -z' or 'grep 
-z').

> Had to check:
> find . -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/"
> 
> Doesn't work.  One very long string of null terminated filenames is returned.

It works perfectly:
1. sed without -z treats newlines (\n) as line terminators.
2. 'find -print0' did not generate '\n' character at all.
3. 'sed' read the entire input (i.e. all files separated by NULs),
   treated it as one line, and added quotes at the beginning and the end
   of the entire buffer.
4. NULs were kept as-is, and are printed on your terminal.

Example:
    $ touch a b 'c d'
    $ find -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
      27  2e  2f  61  00  2e  2f  62  00  2e  2f  63  20  64  00  27
       '   .   /   a  \0   .   /   b  \0   .   /   c       d  \0   '

> So we now know that sed does not check for \0 as a line terminator.
> And the sed -z flag produces the same long string.
> 
> find . -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/"

It also produces the correct output:
This time, because of the '-z', sed indeed reads each filename until the NUL,
and adds quotes around each file.
But it also uses NULs as line terminators on the OUTPUT,
so newline characters are not used at all.
Notice that each file is surrounded by quotes, exactly as you've asked:

  $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
    27  2e  2f  61  27  00  27  2e  2f  62  27  00  27  2e  2f  63
     '   .   /   a   '  \0   '   .   /   b   '  \0   '   .   /   c
    20  64  27  00
         d   '  \0

The missing piece is that after you've processed each file using 'sed -z',
if you want to print them to the terminal, you still need to convert NULs to 
newlines:

  $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | tr '\0' '\n'
  './a'
  './b'
  './c d'

Or, if you wanted to user sed/grep as an intermediate filter between 'find' and 
'xargs',
then something like this:

  find [criteria] -print0 | grep -z [REGEX] | xargs -0 ANYPROGRAM
  find [criteria] -print0 | sed -z [REGEX] | xargs -0 ANYPROGRAM


In most of my examples above, whitespace don't actual cause problems -
because sed/grep will not be confused by whitespace and won't break the line
(it is mostly shell argument parsing that will get terribly confused by 
whitespace,
and also "xargs" with certain parameters).

They real 'kick' is that using NULs allows handling files that have embedded 
newlines.

Consider the following:

  $ touch a b 'c d' "$(printf 'e\nf')"
  $ ls -log
  total 0
  -rw-r--r-- 1 0 May 12 01:43 a
  -rw-r--r-- 1 0 May 12 01:43 b
  -rw-r--r-- 1 0 May 12 01:43 c d
  -rw-r--r-- 1 0 May 12 01:43 e?f

The last file has an embedded newline, which will mess-up 'find':

  ## incorrect output: the 'e\nf' file is broken, 'echo' is executed
  ## wrong number of times with non-existing file names:
  $ find -type f | xargs -I% echo ==%==
  ==./e==
  ==f==
  ==./a==
  ==./b==
  ==./c d==

Using 'xargs -0' will solve it. This output is correct, but perhaps confusing
when displayed on the terminal:

  $ find -type f -print0 | xargs -0 -I% echo ==%==
  ==./e
  f==
  ==./a==
  ==./b==
  ==./c d==

And similarly with 'sed -z':

  $ find -type f -print0 | sed -z -e 's/^/<<</' -e 's/$/>>>/' | tr '\0' '\n'
  <<<./e
  f>>>
  <<<./a>>>
  <<<./b>>>
  <<<./c d>>>



Once last tip:
Sometimes you want to find and operate on files based on the their content 
instead
of attributes (e.g. 'grep').

Here too, a file with spaces or newlines will cause troubles:

  $ echo yes  > "$(printf 'hello\nworld')"
  $ ls -log
  total 4
  -rw-r--r-- 1 4 May 12 01:57 hello?world

If you wanted to find all files containing 'yes',
grep alone would print a confusing output:

  $ grep -l yes *
  hello
  world

And using it with "xargs" will fail:

  $ grep -l yes * | xargs -I% echo 'handling file ===%==='
  handling file ===hello===
  handling file ===world===

Grep has a separate option (upper case -Z) to print the matched filenames
with a NUL instead of a newline. This enables correct handling:

  $ grep -lZ yes * | xargs -0 -I% echo 'handling file ===%s==='
  handling file ===hello
  worlds===

And later:

  $ grep -lZ yes * | xargs -0 mv -t /destination



Hope this helps,
regards,
 - assaf








reply via email to

[Prev in Thread] Current Thread [Next in Thread]