Re: Multibyte and unibyte file names

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte and unibyte file names

From:	Eli Zaretskii
Subject:	Re: Multibyte and unibyte file names
Date:	Sun, 27 Jan 2013 09:03:08 +0200

> From: Stefan Monnier <address@hidden>
> Cc: address@hidden,  address@hidden,  address@hidden
> Date: Sat, 26 Jan 2013 17:11:25 -0500
> 
> > OK, but as long as file-name primitives are required to support
> > unibyte strings, you cannot be sure these situations won't pop up in
> > the future.
> 
> I don't see a need to disallow unibyte strings, but I don't see the need
> to be particularly careful about it either.  Basically Elisp code which
> provides unibyte file names does it at its own risks.

What about C code that calls these primitives?  Can we consider every
such instance a bug in the caller?  If so, we could stop catering to
unibyte strings in these primitives, which will make at least some of
them a whole lot simpler.

> >> I think the right thing to do with unibyte file names is to treat them
> >> as a sequence of bytes, not a sequence of encoded chars.  If the caller
> >> doesn't like it, then she should pass a decoded file name instead.
> > This effectively means we don't support them _as_file_names_.
> > Because, e.g., testing individual bytes for equality to something like
> > '\\' can trip on multibyte (DBCS) encodings if the trailing byte
> > happens to be '\\'.  In general, it isn't "safe" to iterate over these
> > strings one byte at a time.
> 
> But that's exactly the behavior stipulated by POSIX (tho for '/' rather
> than '\\').  I.e. if you use file names on a POSIX host with
> a coding-system that occasionally uses '/' within its multibyte
> sequences, you'll get those surprises regardless of Emacs.  And for that
> reason, Emacs would be right to cut those file names in the middle of
> a multibyte sequence.

Then why did you regard this:

 (let ((file-name-coding-system 'cp932))
   (expand-file-name "表" "C:/"))

  => "c:/\225/"

as a bug?  This is exactly what happens there: the string "表", when
encoded with cp932, has '\' as its last byte.

> IIUC that's what makes this a "w32-only problem", because the w32
> semantics for file names is based on characters, so a '\\' (or a '/')
> appearing with a multibyte sequence is not considered by the OS as
> a separator.
> 
> And since Emacs is largely based on "POSIX semantics for the generic
> code, plus an emulation layer in w32.c", we have a problem of subtly
> incompatible semantics.

Maybe so, but it certainly isn't the only place in Emacs with subtly
incompatible semantics.  And anyway, I don't see how this observation
helps to decide what, if anything, to do to fix this.

> >> > Are you saying that since this happens
> >> > infrequently, we could process such file names in a broken way,
> >> Right.
> > He, I don't think this will be well accepted.
> 
> I haven't heard too many screams about this over the years.

I heard 2 this week, from 2 different users.  Inability to reference
file names that are allowed by the underlying filesystem is a bad bug,
IMO.

> > And it does that because dostounix_filename needs optionally to
> > downcase the name (when w32-downcase-file-names is set).
> 
> Hmm.. but downcasing is an operation on chars, not on bytes, so it
> should be applied to decoded names, right?

That's not how the code was written.  w32.c functions get the strings
that are already encoded.

> > The way dostounix_filename downcases file names depends on the current
> > locale, so it must get encoded file names.
> 
> Are you saying that the "downcase" function is not Emacs's own but is
> a function provided by the OS, so we need to encode the name to pass it
> to that function?

That's how the code works, yes.

> If so, we need to immediately decode the result.

We already do.  Example:

  else if (STRING_MULTIBYTE (filename))
    {
      tem_fn = ENCODE_FILE (make_specified_string (beg, -1, p - beg, 1));
      dostounix_filename (SSDATA (tem_fn));
      tem_fn = DECODE_FILE (tem_fn);
    }

> (and of course this encode+downcase+decode is only done if
> w32-downcase-file-names is set).

Can't do that, because dostounix_filename also mirrors the backslashes
and downcases the drive letter -- independently of
w32-downcase-file-names.  Since dostounix_filename currently operates
only on encoded file names, the above is always done for decoded file
names.

> Alternatively, we could use Emacs's own downcasing function, which does
> not depend on the locale and operates directly on decoded names.

That's what I intend to do, indeed, once the dust settles on this
discussion, and I understand the requirements.

Note that using Emacs's downcase is not a trivial change, because
(AFAIK) accessing the downcase_table can trigger GC.  Also, downcasing
might change the byte count of a multibyte string (due to
unification), so we cannot pass a 'char *' to dostounix_filename.  Not
rocket science, of course, but still...

Alternatively, we could downcase inline in the primitives themselves,
not inside dostounix_filename.

> But indeed for uses of IS_DIRECTORY_SEP in w32.c this is probably more
> serious since those functions emulate POSIX calls, so they always receive
> encoded file names.

I think I already fixed all of them.

> > UTF-8 precludes them.  Thus my question whether we want to support
> > encoded file names in these primitives as first-class citizens.
> 
> Could you specify a bit more precisely which primitives you have
> in mind?

Those in fileio.c and in dired.c.  I could give an explicit list, if
you want.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Multibyte and unibyte file names, (continued)

Prev by Date: Re: Issues for 24.3
Next by Date: Re: Multibyte and unibyte file names
Previous by thread: Re: Multibyte and unibyte file names
Next by thread: Re: Multibyte and unibyte file names
Index(es):
- Date
- Thread