bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4


From: Dick Dunbar
Subject: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 02:17:39 -0700

There is nothing tricky about this sed filter.
I want to render filenames emitted by a program ( not find) in single
quotes so that no special characters are interpreted by the shell:
  ( space, $, etc )

The "mv" example was just another type of filter used by different
(known)  cygwin/unix programs.  The current problem remains with sed.
I remain mystified why the semantics of "$" ( end of line ) was changed,
and still cannot imagine any program that would benefit from such a change.

Yes I understood Eric's explanation that it was to make Linux users
more comfortable.  What was wrong with the previous sed implementation.

Linux users have to deal with Windows files containing crlf line-end
all the time.  What, exactly, is the problem you were trying to solve?
The cygwin definition should work fine on Posix systems.
In all my cross-platform experience, I never had to question sed's
definition of "$".   It just worked.

Sorry for expanding the conversation to flags in other pgms;  that's a
separate discussion.


On Thu, May 11, 2017 at 7:05 PM, Assaf Gordon <address@hidden> wrote:

> Hello,
>
> > On May 11, 2017, at 18:39, Dick Dunbar <address@hidden> wrote:
> >
> > To round out this discussion:
> > I wanted a simple filter to ensure filename paths didn't contain spaces.
>
> There's a nuance here to verify:
> Did you want a filter to ensure non of your files have spaces (e.g. detect
> if some haves do have spaces and then fail),
> or
> Did you want a robust way to use the 'mv' command (as below), even in
> the case of files with spaces ?
>
> If you just wanted to detect files with spaces,
> something like this would work:
>     find -type f -print0 | grep -qz '[[:space:]]' && echo
> have-files-with-spaces
>
> If you wanted to print files that have spaces, something like this would
> work:
>     find -type f -print0 | grep -z '[[:space:]]' | tr '\0' '\n'
>
> > For example:
> >   find /foo -maxdepth 1 -atime +366 -print0 |
> >      xargs -0 sh -c 'mv "$@" /archive' move
>
> I'm not sure what the purpose of 'move' in the above command.
> But if you wanted to move all the found files to the directory /archive,
> even if they have spaces in them, a more efficient way would be:
>
>     find /foo -maxdepth 1 -atime +366 -print0 | \
>        xargs -0 mv -no-run-if-empty -t /archive
>
> This GNU extension (-t DEST) works great with find/xargs,
> as xargs by default adds the parameters at the end of the command line,
> and "-t" means the destination directory is specified first.
>
> > So why are there different flags to indicate null-terminated lines?
> >   find -print0
> >   xargs -0
> >   sed  -z
> >
> > Seems silly.  To make a non-breaking-code-change,
> > why not add "-z" to the find and xargs command so they are compatible?
>
> Putting aside the naming conversion for a moment (Remember that each
> program is developed
> by different people) - I'll focus on find/xargs - which are part of the
> same
> package (findutils) and developed by the same people.
>
> These two are designed to work closely together - that's why they have
> "-print0" and "-0".
>
> The whole point of the following construct:
>
>    find [criteria] -print0 | xargs -0 ANY-PROGRAM
>
> Is that 'ANY-PROGRAM' doesn't need to understand NUL-line-endings at all.
> The main reason find and xargs need the NULLs is to ensure
> file names are not broken by whitespace or even newlines. But once xargs
> reads
> the entire filename, it passes each filename as a single parameter to
> ANY-PROGRAM,
> and so there's no need to worry any more about filenames with whitespaces.
>
> This useful constructs breaks down if ANY-PROGRAM is 'sh' which the might
> do further parameter splitting based on whitespace.
>
> > And ... because we're dealing with the same issue of executables
> > creating stream data, why doesn't sed/awk/grep have an option
> > to deal with null delimited lines such that "$" would find them.
>
> I'm not sure I understand: sed and grep have "-z" exactly for this purpose.
> (also: sort -z , perl -0).
> gawk has a slight different syntax, where you simply set the RS (input
> record separator) to NULL:
>
>    find -type f -print0 | gawk -vRS="\0" -vORS="\n" '{ print "file = " $0
> }'
>
> But remember that when you use 'sed -z', the output also uses NULs as
> line-terminators,
> so it won't look good on the terminal or in a file.
>
> > Having sed recognize \r, \n, \0 as end of line might cause some
> > breakage if you have to deal with data that has embedded nulls.
>
> Instead of thinking in general "data that has embedded nulls",
> it'll be easier to consider concrete cases.
> Text files do not have embedded nuls (by definition, otherwise they are not
> text files). So standard text programs (sed/grep/awk) do not need to deal
> with NULs
> as line separators.
>
> The main use case of having NUL as line separator is precisely with "find
> -print0".
> In this case, either use "xargs -0" and then the actual program doesn't
> need
> to worry about NULs at all, or use the gnu extensions (e.g. 'sed -z' or
> 'grep -z').
>
> > Had to check:
> > find . -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/"
> >
> > Doesn't work.  One very long string of null terminated filenames is
> returned.
>
> It works perfectly:
> 1. sed without -z treats newlines (\n) as line terminators.
> 2. 'find -print0' did not generate '\n' character at all.
> 3. 'sed' read the entire input (i.e. all files separated by NULs),
>    treated it as one line, and added quotes at the beginning and the end
>    of the entire buffer.
> 4. NULs were kept as-is, and are printed on your terminal.
>
> Example:
>     $ touch a b 'c d'
>     $ find -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
>       27  2e  2f  61  00  2e  2f  62  00  2e  2f  63  20  64  00  27
>        '   .   /   a  \0   .   /   b  \0   .   /   c       d  \0   '
>
> > So we now know that sed does not check for \0 as a line terminator.
> > And the sed -z flag produces the same long string.
> >
> > find . -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/"
>
> It also produces the correct output:
> This time, because of the '-z', sed indeed reads each filename until the
> NUL,
> and adds quotes around each file.
> But it also uses NULs as line terminators on the OUTPUT,
> so newline characters are not used at all.
> Notice that each file is surrounded by quotes, exactly as you've asked:
>
>   $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
>     27  2e  2f  61  27  00  27  2e  2f  62  27  00  27  2e  2f  63
>      '   .   /   a   '  \0   '   .   /   b   '  \0   '   .   /   c
>     20  64  27  00
>          d   '  \0
>
> The missing piece is that after you've processed each file using 'sed -z',
> if you want to print them to the terminal, you still need to convert NULs
> to newlines:
>
>   $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | tr '\0' '\n'
>   './a'
>   './b'
>   './c d'
>
> Or, if you wanted to user sed/grep as an intermediate filter between
> 'find' and 'xargs',
> then something like this:
>
>   find [criteria] -print0 | grep -z [REGEX] | xargs -0 ANYPROGRAM
>   find [criteria] -print0 | sed -z [REGEX] | xargs -0 ANYPROGRAM
>
>
> In most of my examples above, whitespace don't actual cause problems -
> because sed/grep will not be confused by whitespace and won't break the
> line
> (it is mostly shell argument parsing that will get terribly confused by
> whitespace,
> and also "xargs" with certain parameters).
>
> They real 'kick' is that using NULs allows handling files that have
> embedded newlines.
>
> Consider the following:
>
>   $ touch a b 'c d' "$(printf 'e\nf')"
>   $ ls -log
>   total 0
>   -rw-r--r-- 1 0 May 12 01:43 a
>   -rw-r--r-- 1 0 May 12 01:43 b
>   -rw-r--r-- 1 0 May 12 01:43 c d
>   -rw-r--r-- 1 0 May 12 01:43 e?f
>
> The last file has an embedded newline, which will mess-up 'find':
>
>   ## incorrect output: the 'e\nf' file is broken, 'echo' is executed
>   ## wrong number of times with non-existing file names:
>   $ find -type f | xargs -I% echo ==%==
>   ==./e==
>   ==f==
>   ==./a==
>   ==./b==
>   ==./c d==
>
> Using 'xargs -0' will solve it. This output is correct, but perhaps
> confusing
> when displayed on the terminal:
>
>   $ find -type f -print0 | xargs -0 -I% echo ==%==
>   ==./e
>   f==
>   ==./a==
>   ==./b==
>   ==./c d==
>
> And similarly with 'sed -z':
>
>   $ find -type f -print0 | sed -z -e 's/^/<<</' -e 's/$/>>>/' | tr '\0'
> '\n'
>   <<<./e
>   f>>>
>   <<<./a>>>
>   <<<./b>>>
>   <<<./c d>>>
>
>
>
> Once last tip:
> Sometimes you want to find and operate on files based on the their content
> instead
> of attributes (e.g. 'grep').
>
> Here too, a file with spaces or newlines will cause troubles:
>
>   $ echo yes  > "$(printf 'hello\nworld')"
>   $ ls -log
>   total 4
>   -rw-r--r-- 1 4 May 12 01:57 hello?world
>
> If you wanted to find all files containing 'yes',
> grep alone would print a confusing output:
>
>   $ grep -l yes *
>   hello
>   world
>
> And using it with "xargs" will fail:
>
>   $ grep -l yes * | xargs -I% echo 'handling file ===%==='
>   handling file ===hello===
>   handling file ===world===
>
> Grep has a separate option (upper case -Z) to print the matched filenames
> with a NUL instead of a newline. This enables correct handling:
>
>   $ grep -lZ yes * | xargs -0 -I% echo 'handling file ===%s==='
>   handling file ===hello
>   worlds===
>
> And later:
>
>   $ grep -lZ yes * | xargs -0 mv -t /destination
>
>
>
> Hope this helps,
> regards,
>  - assaf
>
>
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]