guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.


From: Mark H Weaver
Subject: Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.
Date: Tue, 17 May 2011 16:03:18 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)

Noah Lavine <address@hidden> writes:
> Mark is right that paths are basically just strings, even though
> occasionally they're not. I sort of like the idea of the PEP-383
> encoding (making paths strings that can potentially contain unused
> codepoints, which represent non-character bytes), but would that make
> path strings break under some Guile string operations?

Yes, this is indeed a problem.  Instead of using isolated surrogate code
points as recommended by PEP-383, I think we should instead use one of
the alternative mappings proposed in section 3.7.4 of Unicode Technical
Report #36 <http://www.unicode.org/reports/tr36/>:

1. Use 256 private-use code points, somewhere in the ranges F0000..FFFFD
   or 100000..10FFFD. This would probably cause the fewest security and
   interoperability problems. There is, however, some possibility of
   collision with other uses of private-use characters.

2. Use pairs of noncharacter code points in the range FDD0..FDEF. These
   are "super" private-use characters, and are discouraged for general
   interchange. The transformation would take each nibble of a byte Y,
   and add to FDD0 and FDE0, respectively. However, noncharacter code
   points may be replaced by U+FFFD ( � ) REPLACEMENT CHARACTER by some
   implementations, especially when they use them internally. (Again,
   incoming characters must never be deleted, because that can cause
   security problems.)

> Also, when we convert strings to paths, we need to know what encoding
> the local filesystem uses. That will usually be UTF-8, but potentially
> might not be, correct?

Yes, that is correct.  I haven't looked deeply into this, but clearly a
lot of software uses the current locale encoding to interpret these
POSIX byte strings, and I suspect at least some software uses UTF-8 to
interpret filenames.  Fortunately, most popular modern distributions of
GNU are now using UTF-8 locales by default, which basically makes the
problem disappear.

Regardless, this method of mapping ill-formed byte sequences to
private-use code points can used with _any_ encoding, not just UTF-8.

    Best,
     Mark



reply via email to

[Prev in Thread] Current Thread [Next in Thread]