bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4


From: Eric Blake
Subject: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 16:12:38 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.0

On 05/12/2017 02:30 PM, Dick Dunbar wrote:
> Hi Assaf and Eric,
> Thanks for your remarks.  Very thoughtful and helpful.
> 
> 1. I hadn't realized sed had a -z option.  Here's how I used it:
>    find -print0 | sed -ze "s/^/'/" -e "s/\$/'\n/"
> 
> 2. Rather than fighting with sed behaviour, it's just easier to use Eric's
> suggestion
>     to strip the \r in a separate stage.  But this doesn't do that.  It
> replaces \r
>     with a null character followed by \n
> 
>     $ cat t.out | tr -d '\r' | od -xc

od -xc is not the nicest format; I prefer od -tx1z.  And for good reason:

>      0000000    3a43    535c    6163    5c6e    2069    322e    000a
>                    C   :   \   S   c   a   n   \   i       .   2  \n

Note that this includes things like '6163' corresponding to 'c' 'a',
which (if you think about it) looks backwards.  It really means that od
-x defaults to printing 2 bytes a type, and based on your machines
endianness, those bytes appear in swapped endian format compared to the
per-character display from -c.

You are misreading the output to assume that the final 000a means that
tr inserted a \0 in place of the \r.  What REALLY happened was that tr
DID delete the \r, but now od has an odd number of bytes, and has to PAD
the final -x output (to a 2-byte boundary) by adding \0 as the padding
for display purposes, and the padding happens to appear in the same
number where you used to see the \r character (but note that the \r\n
appeared as 0a0d, if you omit tr from the pipeline).  But if you use od
correctly, you will see that tr is indeed stripping \r after all.

> 
>     And running a stream through d2u will cause the entire pipe to
>     stall until eof on the stream.

That merely means that d2u is not the best filter. I didn't say it was
the only viable filter (and I'm sure that upstream d2u maintainers would
welcome a patch to make it not stall pipelines - but that's a topic for
that list rather than this one).

> 
> 3. Eric: the discussion of binary file open confused me.
>     Does sed default to binary open?  How would you suggest I
>     fix this in user-land?

sed, and most other cygwin programs designed for text processing,
default to using open("r") semantics (which is text-open if the file
lives on a text mount, and binary-open if the file lives on a binary
mount).  Since pipelines do not live in the file system, there is no
mount point to control whether you want the pipeline to be treated as
text or binary, so cygwin defaults pipelines to behave as if they were
in binary mode.  (At one point, the CYGWIN environment variable had an
option to choose whether all pipelines should be force-text or
force-binary, but I think it got ripped out years ago)

Prior to Feb 2017, SOME cygwin programs (sed included) had
downstream-only patches to FORCE open("rt") semantics on stdin,
including when stdin came from a pipeline.  But forcing this behavior,
while nice for text input coming from a non-cygwin program, was a
data-corruption agent for binary input coming from a cygwin program, and
could not be overridden.  So the decision was to drop the downstream
behavior. Now pipes are treated as binary mode, but YOU can override the
behavior by pre-filtering data before handing it through the pipe to sed.

> 
> I don't really understand what  'info sed' is saying because sed
> can operate on a stream -or- a file.  It's not just mixing Win programs
> and cygwin programs that causes problems.  It is very common to
> get files from multiple platforms.
> 
> Editing a 'sh' script with notepad will definitely ruin your day:
> 
> #!/bin/bash \r\n

Yes, that DOES ruin your day, if you try to execute that script on
Linux, it will fail. So the same script fails on Cygwin, unless you use
cygwin bash's downstream 'igncr' option to tell bash to ignore all \r.
Cygwin's approach is that you should opt in to ignoring \r (default
should be to behave like Linux, and only by doing something explicit can
you make life easier if you are going to be littering your data with \r
that should be ignored).

> 
> vim identifies an edited file as "dos" if it encounters one.

That's because vim ALWAYS opens files in binary mode (open("rb") rather
than open("r"), and then reproduces its OWN code to deal with line
endings).  Not every program wants to copy vim's bloat by dealing with
line endings themselves.

https://cygwin.com/cygwin-ug-net/using-textbinary.html is also a useful
resource to read.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]