[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Multibyte and unibyte file names
From: |
Eli Zaretskii |
Subject: |
Re: Multibyte and unibyte file names |
Date: |
Sun, 27 Jan 2013 09:03:08 +0200 |
> From: Stefan Monnier <address@hidden>
> Cc: address@hidden, address@hidden, address@hidden
> Date: Sat, 26 Jan 2013 17:11:25 -0500
>
> > OK, but as long as file-name primitives are required to support
> > unibyte strings, you cannot be sure these situations won't pop up in
> > the future.
>
> I don't see a need to disallow unibyte strings, but I don't see the need
> to be particularly careful about it either. Basically Elisp code which
> provides unibyte file names does it at its own risks.
What about C code that calls these primitives? Can we consider every
such instance a bug in the caller? If so, we could stop catering to
unibyte strings in these primitives, which will make at least some of
them a whole lot simpler.
> >> I think the right thing to do with unibyte file names is to treat them
> >> as a sequence of bytes, not a sequence of encoded chars. If the caller
> >> doesn't like it, then she should pass a decoded file name instead.
> > This effectively means we don't support them _as_file_names_.
> > Because, e.g., testing individual bytes for equality to something like
> > '\\' can trip on multibyte (DBCS) encodings if the trailing byte
> > happens to be '\\'. In general, it isn't "safe" to iterate over these
> > strings one byte at a time.
>
> But that's exactly the behavior stipulated by POSIX (tho for '/' rather
> than '\\'). I.e. if you use file names on a POSIX host with
> a coding-system that occasionally uses '/' within its multibyte
> sequences, you'll get those surprises regardless of Emacs. And for that
> reason, Emacs would be right to cut those file names in the middle of
> a multibyte sequence.
Then why did you regard this:
(let ((file-name-coding-system 'cp932))
(expand-file-name "表" "C:/"))
=> "c:/\225/"
as a bug? This is exactly what happens there: the string "表", when
encoded with cp932, has '\' as its last byte.
> IIUC that's what makes this a "w32-only problem", because the w32
> semantics for file names is based on characters, so a '\\' (or a '/')
> appearing with a multibyte sequence is not considered by the OS as
> a separator.
>
> And since Emacs is largely based on "POSIX semantics for the generic
> code, plus an emulation layer in w32.c", we have a problem of subtly
> incompatible semantics.
Maybe so, but it certainly isn't the only place in Emacs with subtly
incompatible semantics. And anyway, I don't see how this observation
helps to decide what, if anything, to do to fix this.
> >> > Are you saying that since this happens
> >> > infrequently, we could process such file names in a broken way,
> >> Right.
> > He, I don't think this will be well accepted.
>
> I haven't heard too many screams about this over the years.
I heard 2 this week, from 2 different users. Inability to reference
file names that are allowed by the underlying filesystem is a bad bug,
IMO.
> > And it does that because dostounix_filename needs optionally to
> > downcase the name (when w32-downcase-file-names is set).
>
> Hmm.. but downcasing is an operation on chars, not on bytes, so it
> should be applied to decoded names, right?
That's not how the code was written. w32.c functions get the strings
that are already encoded.
> > The way dostounix_filename downcases file names depends on the current
> > locale, so it must get encoded file names.
>
> Are you saying that the "downcase" function is not Emacs's own but is
> a function provided by the OS, so we need to encode the name to pass it
> to that function?
That's how the code works, yes.
> If so, we need to immediately decode the result.
We already do. Example:
else if (STRING_MULTIBYTE (filename))
{
tem_fn = ENCODE_FILE (make_specified_string (beg, -1, p - beg, 1));
dostounix_filename (SSDATA (tem_fn));
tem_fn = DECODE_FILE (tem_fn);
}
> (and of course this encode+downcase+decode is only done if
> w32-downcase-file-names is set).
Can't do that, because dostounix_filename also mirrors the backslashes
and downcases the drive letter -- independently of
w32-downcase-file-names. Since dostounix_filename currently operates
only on encoded file names, the above is always done for decoded file
names.
> Alternatively, we could use Emacs's own downcasing function, which does
> not depend on the locale and operates directly on decoded names.
That's what I intend to do, indeed, once the dust settles on this
discussion, and I understand the requirements.
Note that using Emacs's downcase is not a trivial change, because
(AFAIK) accessing the downcase_table can trigger GC. Also, downcasing
might change the byte count of a multibyte string (due to
unification), so we cannot pass a 'char *' to dostounix_filename. Not
rocket science, of course, but still...
Alternatively, we could downcase inline in the primitives themselves,
not inside dostounix_filename.
> But indeed for uses of IS_DIRECTORY_SEP in w32.c this is probably more
> serious since those functions emulate POSIX calls, so they always receive
> encoded file names.
I think I already fixed all of them.
> > UTF-8 precludes them. Thus my question whether we want to support
> > encoded file names in these primitives as first-class citizens.
>
> Could you specify a bit more precisely which primitives you have
> in mind?
Those in fileio.c and in dired.c. I could give an explicit list, if
you want.
- Re: Multibyte and unibyte file names, (continued)
- Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/24
- Re: Multibyte and unibyte file names, Stefan Monnier, 2013/01/24
- Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/25
- Re: Multibyte and unibyte file names, Stefan Monnier, 2013/01/25
- Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/25
- Re: Multibyte and unibyte file names, Stefan Monnier, 2013/01/25
- Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/26
- Re: Multibyte and unibyte file names, Stefan Monnier, 2013/01/26
- Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/26
- Re: Multibyte and unibyte file names, Stefan Monnier, 2013/01/26
- Re: Multibyte and unibyte file names,
Eli Zaretskii <=
- Re: Multibyte and unibyte file names, Andreas Schwab, 2013/01/27
- Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/27
- Re: Multibyte and unibyte file names, Stefan Monnier, 2013/01/27
- Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/28
- Re: Multibyte and unibyte file names, Stefan Monnier, 2013/01/28
- Re: Multibyte and unibyte file names, Stephen J. Turnbull, 2013/01/26
- Re: Multibyte and unibyte file names, Stephen J. Turnbull, 2013/01/25
- Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/26
- Re: Multibyte and unibyte file names, Stephen J. Turnbull, 2013/01/26
- Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/26