bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4

bug-sed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4

From:	Dick Dunbar
Subject:	bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date:	Fri, 12 May 2017 02:26:18 -0700

It is still unexplained how sed correctly finds the end-of-line correctly
when there are no control characters at all.  ( \r, \n )

In the original sedtest.sh script I posted,
   fn

Running that script again ... when there are two additional blank
characters at the end of $fn, produces the desired result.
The single quote follows the "2", even tho there are 2 blanks at the end of
the string.

fn="C:\Scan\i .2  "

$ ./sedtest.sh
1. simple string works
 C:\Scan\i .2
'C:\Scan\i .2'


On Fri, May 12, 2017 at 2:17 AM, Dick Dunbar <address@hidden> wrote:

> There is nothing tricky about this sed filter.
> I want to render filenames emitted by a program ( not find) in single
> quotes so that no special characters are interpreted by the shell:
>   ( space, $, etc )
>
> The "mv" example was just another type of filter used by different
> (known)  cygwin/unix programs.  The current problem remains with sed.
> I remain mystified why the semantics of "$" ( end of line ) was changed,
> and still cannot imagine any program that would benefit from such a change.
>
> Yes I understood Eric's explanation that it was to make Linux users
> more comfortable.  What was wrong with the previous sed implementation.
>
> Linux users have to deal with Windows files containing crlf line-end
> all the time.  What, exactly, is the problem you were trying to solve?
> The cygwin definition should work fine on Posix systems.
> In all my cross-platform experience, I never had to question sed's
> definition of "$".   It just worked.
>
> Sorry for expanding the conversation to flags in other pgms;  that's a
> separate discussion.
>
>
> On Thu, May 11, 2017 at 7:05 PM, Assaf Gordon <address@hidden>
> wrote:
>
>> Hello,
>>
>> > On May 11, 2017, at 18:39, Dick Dunbar <address@hidden> wrote:
>> >
>> > To round out this discussion:
>> > I wanted a simple filter to ensure filename paths didn't contain spaces.
>>
>> There's a nuance here to verify:
>> Did you want a filter to ensure non of your files have spaces (e.g. detect
>> if some haves do have spaces and then fail),
>> or
>> Did you want a robust way to use the 'mv' command (as below), even in
>> the case of files with spaces ?
>>
>> If you just wanted to detect files with spaces,
>> something like this would work:
>>     find -type f -print0 | grep -qz '[[:space:]]' && echo
>> have-files-with-spaces
>>
>> If you wanted to print files that have spaces, something like this would
>> work:
>>     find -type f -print0 | grep -z '[[:space:]]' | tr '\0' '\n'
>>
>> > For example:
>> >   find /foo -maxdepth 1 -atime +366 -print0 |
>> >      xargs -0 sh -c 'mv "$@" /archive' move
>>
>> I'm not sure what the purpose of 'move' in the above command.
>> But if you wanted to move all the found files to the directory /archive,
>> even if they have spaces in them, a more efficient way would be:
>>
>>     find /foo -maxdepth 1 -atime +366 -print0 | \
>>        xargs -0 mv -no-run-if-empty -t /archive
>>
>> This GNU extension (-t DEST) works great with find/xargs,
>> as xargs by default adds the parameters at the end of the command line,
>> and "-t" means the destination directory is specified first.
>>
>> > So why are there different flags to indicate null-terminated lines?
>> >   find -print0
>> >   xargs -0
>> >   sed  -z
>> >
>> > Seems silly.  To make a non-breaking-code-change,
>> > why not add "-z" to the find and xargs command so they are compatible?
>>
>> Putting aside the naming conversion for a moment (Remember that each
>> program is developed
>> by different people) - I'll focus on find/xargs - which are part of the
>> same
>> package (findutils) and developed by the same people.
>>
>> These two are designed to work closely together - that's why they have
>> "-print0" and "-0".
>>
>> The whole point of the following construct:
>>
>>    find [criteria] -print0 | xargs -0 ANY-PROGRAM
>>
>> Is that 'ANY-PROGRAM' doesn't need to understand NUL-line-endings at all.
>> The main reason find and xargs need the NULLs is to ensure
>> file names are not broken by whitespace or even newlines. But once xargs
>> reads
>> the entire filename, it passes each filename as a single parameter to
>> ANY-PROGRAM,
>> and so there's no need to worry any more about filenames with whitespaces.
>>
>> This useful constructs breaks down if ANY-PROGRAM is 'sh' which the might
>> do further parameter splitting based on whitespace.
>>
>> > And ... because we're dealing with the same issue of executables
>> > creating stream data, why doesn't sed/awk/grep have an option
>> > to deal with null delimited lines such that "$" would find them.
>>
>> I'm not sure I understand: sed and grep have "-z" exactly for this
>> purpose.
>> (also: sort -z , perl -0).
>> gawk has a slight different syntax, where you simply set the RS (input
>> record separator) to NULL:
>>
>>    find -type f -print0 | gawk -vRS="\0" -vORS="\n" '{ print "file = " $0
>> }'
>>
>> But remember that when you use 'sed -z', the output also uses NULs as
>> line-terminators,
>> so it won't look good on the terminal or in a file.
>>
>> > Having sed recognize \r, \n, \0 as end of line might cause some
>> > breakage if you have to deal with data that has embedded nulls.
>>
>> Instead of thinking in general "data that has embedded nulls",
>> it'll be easier to consider concrete cases.
>> Text files do not have embedded nuls (by definition, otherwise they are
>> not
>> text files). So standard text programs (sed/grep/awk) do not need to deal
>> with NULs
>> as line separators.
>>
>> The main use case of having NUL as line separator is precisely with "find
>> -print0".
>> In this case, either use "xargs -0" and then the actual program doesn't
>> need
>> to worry about NULs at all, or use the gnu extensions (e.g. 'sed -z' or
>> 'grep -z').
>>
>> > Had to check:
>> > find . -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/"
>> >
>> > Doesn't work.  One very long string of null terminated filenames is
>> returned.
>>
>> It works perfectly:
>> 1. sed without -z treats newlines (\n) as line terminators.
>> 2. 'find -print0' did not generate '\n' character at all.
>> 3. 'sed' read the entire input (i.e. all files separated by NULs),
>>    treated it as one line, and added quotes at the beginning and the end
>>    of the entire buffer.
>> 4. NULs were kept as-is, and are printed on your terminal.
>>
>> Example:
>>     $ touch a b 'c d'
>>     $ find -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
>>       27  2e  2f  61  00  2e  2f  62  00  2e  2f  63  20  64  00  27
>>        '   .   /   a  \0   .   /   b  \0   .   /   c       d  \0   '
>>
>> > So we now know that sed does not check for \0 as a line terminator.
>> > And the sed -z flag produces the same long string.
>> >
>> > find . -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/"
>>
>> It also produces the correct output:
>> This time, because of the '-z', sed indeed reads each filename until the
>> NUL,
>> and adds quotes around each file.
>> But it also uses NULs as line terminators on the OUTPUT,
>> so newline characters are not used at all.
>> Notice that each file is surrounded by quotes, exactly as you've asked:
>>
>>   $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
>>     27  2e  2f  61  27  00  27  2e  2f  62  27  00  27  2e  2f  63
>>      '   .   /   a   '  \0   '   .   /   b   '  \0   '   .   /   c
>>     20  64  27  00
>>          d   '  \0
>>
>> The missing piece is that after you've processed each file using 'sed -z',
>> if you want to print them to the terminal, you still need to convert NULs
>> to newlines:
>>
>>   $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | tr '\0' '\n'
>>   './a'
>>   './b'
>>   './c d'
>>
>> Or, if you wanted to user sed/grep as an intermediate filter between
>> 'find' and 'xargs',
>> then something like this:
>>
>>   find [criteria] -print0 | grep -z [REGEX] | xargs -0 ANYPROGRAM
>>   find [criteria] -print0 | sed -z [REGEX] | xargs -0 ANYPROGRAM
>>
>>
>> In most of my examples above, whitespace don't actual cause problems -
>> because sed/grep will not be confused by whitespace and won't break the
>> line
>> (it is mostly shell argument parsing that will get terribly confused by
>> whitespace,
>> and also "xargs" with certain parameters).
>>
>> They real 'kick' is that using NULs allows handling files that have
>> embedded newlines.
>>
>> Consider the following:
>>
>>   $ touch a b 'c d' "$(printf 'e\nf')"
>>   $ ls -log
>>   total 0
>>   -rw-r--r-- 1 0 May 12 01:43 a
>>   -rw-r--r-- 1 0 May 12 01:43 b
>>   -rw-r--r-- 1 0 May 12 01:43 c d
>>   -rw-r--r-- 1 0 May 12 01:43 e?f
>>
>> The last file has an embedded newline, which will mess-up 'find':
>>
>>   ## incorrect output: the 'e\nf' file is broken, 'echo' is executed
>>   ## wrong number of times with non-existing file names:
>>   $ find -type f | xargs -I% echo ==%==
>>   ==./e==
>>   ==f==
>>   ==./a==
>>   ==./b==
>>   ==./c d==
>>
>> Using 'xargs -0' will solve it. This output is correct, but perhaps
>> confusing
>> when displayed on the terminal:
>>
>>   $ find -type f -print0 | xargs -0 -I% echo ==%==
>>   ==./e
>>   f==
>>   ==./a==
>>   ==./b==
>>   ==./c d==
>>
>> And similarly with 'sed -z':
>>
>>   $ find -type f -print0 | sed -z -e 's/^/<<</' -e 's/$/>>>/' | tr '\0'
>> '\n'
>>   <<<./e
>>   f>>>
>>   <<<./a>>>
>>   <<<./b>>>
>>   <<<./c d>>>
>>
>>
>>
>> Once last tip:
>> Sometimes you want to find and operate on files based on the their
>> content instead
>> of attributes (e.g. 'grep').
>>
>> Here too, a file with spaces or newlines will cause troubles:
>>
>>   $ echo yes  > "$(printf 'hello\nworld')"
>>   $ ls -log
>>   total 4
>>   -rw-r--r-- 1 4 May 12 01:57 hello?world
>>
>> If you wanted to find all files containing 'yes',
>> grep alone would print a confusing output:
>>
>>   $ grep -l yes *
>>   hello
>>   world
>>
>> And using it with "xargs" will fail:
>>
>>   $ grep -l yes * | xargs -I% echo 'handling file ===%==='
>>   handling file ===hello===
>>   handling file ===world===
>>
>> Grep has a separate option (upper case -Z) to print the matched filenames
>> with a NUL instead of a newline. This enables correct handling:
>>
>>   $ grep -lZ yes * | xargs -0 -I% echo 'handling file ===%s==='
>>   handling file ===hello
>>   worlds===
>>
>> And later:
>>
>>   $ grep -lZ yes * | xargs -0 mv -t /destination
>>
>>
>>
>> Hope this helps,
>> regards,
>>  - assaf
>>
>>
>>
>>
>

[Prev in Thread]

Current Thread

[Next in Thread]

bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, (continued)
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Eric Blake, 2017/05/11
  - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/11
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Eric Blake, 2017/05/11
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Assaf Gordon, 2017/05/11
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Eric Blake, 2017/05/11
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Assaf Gordon, 2017/05/11
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/11
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/11
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Assaf Gordon, 2017/05/11
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/12
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar <=
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Assaf Gordon, 2017/05/12
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Assaf Gordon, 2017/05/12
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/13
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/12
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Eric Blake, 2017/05/12
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/12
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/12
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Eric Blake, 2017/05/12
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/12
    - bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Eric Blake, 2017/05/15

Prev by Date: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Next by Date: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Previous by thread: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Next by thread: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Index(es):
- Date
- Thread