[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] Patch Logs vs. character sets
From: |
Tom Lord |
Subject: |
Re: [Gnu-arch-users] Patch Logs vs. character sets |
Date: |
Fri, 4 Jun 2004 12:12:31 -0700 (PDT) |
Stephen:
Tom> Too bad. Welcome to string processing in the 21st century.
Tom> Get used to it.
You mean "last quarter of the 20th century". In _this_ century,
sane people will use Unicode nearly exclusively for the internals
of new I18N software, and mostly for I18N external use, too.
I pretty much agree with that except that I have a broader perspective
(i think) on what it means.
People are used to the various unicode encoding forms, UTF-8, UTF-16,
UTF-32, and their endian variants.
My view is that there are some additional encoding forms which are
important to support in some cases. Most of the additional encoding
forms are "degenerate" in the sense that they might not be able to
represent arbitrary Unicode strigns or they might contain codepoints
which are not actually Unicode characters.
Three of the degenerate encoding forms for Unicode that I think about
and support in at least some of my code are:
iso-8859-1
Only represents a subset of Unicode, but stores that subset
as bytes, each containing a Unicode codepoint.
ascii+non-specific
An encoding in which programs can not obtain an answer to
the question "What is the Nth codepoint" if the Nth byte
of a sequence of characters is an integer in the closed range
128..255. In other words, in this "encoding form", you know
that bytes 0..127 represent actual Unicode codepoints,
but all that you know about the other byte values is that
they each somehow represent a single codepoint.
bogus-32
An encoding that could be described as "unicode+non-specific"
-- it can handle any 32 bit character set of which Unicode is
a subset
I sometimes think about whether to add the other iso-8859-* varients
to the list but haven't reached any conclusion (other than having not
bothered to do it).
Some rules of thumb:
1) Unless there is a really good reason, don't require anything more than
ascii+non-specific. Arch (both tla and the (conceptual)
specification) are (deliberately) still in this category.
2) Applications inevitably will have to deal with mixed encoding
forms.
There is a myth going around that, in the future, everything
is UTF-8. That would be nice, if it were true, but it is
surely not. Some applications (e.g., for high-performance
text processing) will most assuredly _not_ want to use UTF-8
internally. For example, such applications may want to trade
space for speed by using only fixed-width encodings. On
the other hand, many _interfaces_ (e.g., to network protocols)
are very likely to insist on UTF-8 (for example, due to its compact
nature for many kinds of data).
A simple-minded solution would just impose a thin layer over
interfaces where that layer converts from one encoding form to
another. That would be inefficient and in some cases (e.g.,
interfaces to side-effecting functions) inadequate.
Instead, the (relatively small) set of core string-manipulation
primitives should have interfaces which are encoding-system
agnostic. For example, one should be able to (properly)
concatenate a UTF-8 and UTF-16 string.
That's a pain in the neck to do --- you need a good library of
encoding-agnostic string primitives to make it practical.
But on the other hand, in the end, all of your higher level
is encoding-form agnostic -- expressed just in terms of
abstract string primitives.
Applying those rules of thumb to arch, present and future, the log
file question in particular:
Log files have two kinds of data: (1) data that is supposed to be
parsable by programs, such as arch itself; (2) data that is supposed
to be human readable but nothing more.
An easy way to solve (1) without having to think too hard about (2) is
to say that, at least at some layers of arch, all log files are
"ascii+non-specific" text.
What about (2)? Arch itself doesn't care about how that's encoded
other than that it has to be compatible with ascii+non-specific.
Other programs, such as an archive browser, need to actually care
about data of type (2).
There is a standard solution in this kind of situation: from the
ascii+non-specific point of view there should be some data that says
how the "non-specific" data is encoded. In this case, that means
adding an encoding header to log messages, picking a namespace for
encodings, making iso-8859-1 the retroactive default, and that's that.
The funny thing about that solution is that it means arch will work
perfectly well not only for Unicode and subsets of Unicode, but for
_any_ character set that happens to conform to "ascii+non-specific".
I would have to go out of my way to forbid people to use such
character sets for log messages.
-t
- Re: [Gnu-arch-users] Patch Logs vs. character sets,
Tom Lord <=