bug-teseq
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-teseq] codesets are not shown


From: Micah Cowan
Subject: Re: [Bug-teseq] codesets are not shown
Date: Sat, 23 Aug 2008 11:40:03 -0700
User-agent: Thunderbird 2.0.0.16 (X11/20080724)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bruno Haible wrote:
> Hello Micah,
> 
> You wrote on 2008-08-06:
>> It was a conscious decision not to include descriptions for each
>> registered character set defined in the IR registry. Mainly motivated by
>> laziness ;)
>> ...
>> I'll be ecstatic to accept any
>> patches anyone might offer to provide these details.
> 
> Find attached a patch that implements these descriptions.

Excellent, thanks!

>> The single-byte encodings have some coverage, but also not complete, and
>> also not particularly distinguished. For example, the final bytes 5/10
>> and 6/8 are both reported as designating "Spanish" charsets, without any
>> distinguishing information between them.
> 
> The modified code makes this distinction:
> 
> $ printf '\x1b\x29\x5A' | teseq 
> : Esc ) Z
> & G1D4: G1-DESIGNATE 94-SET
> " Designate 94-character set Z (ISO646-ES) to G1.
> $ printf '\x1b\x29\x68' | teseq 
> : Esc ) h
> & G1D4: G1-DESIGNATE 94-SET
> " Designate 94-character set h (ISO646-ES2) to G1.
> 
> The names ISO646-ES, ISO646-ES2 are taken from glibc in this case.

Which in turn come from the IANA registry.

I was hoping, though, to leave intact the simpler names ("Spanish", etc)
to make it crystal clear to whoever reads it, what language the
character set had been intended for (where this is known). But I can add
these back in later.

>> Also in my mind, was that in typical multibyte charset usage, the user
>> will know whether the multibyte charset being designated is Chinese,
>> Japanese, Korean or whatever (though, of course, they won't know which
>> plane will be invoked, unless they knew the final bytes anyway).
> 
> I don't agree. The education level of CJK people on these topics is not
> high: I sometimes get libiconv support emails from people in Japan
> who don't know that ISO-2022-JP is based on JISX0208; they ask me to change
> the converter so that it supports characters in CP932.

What I meant was that a user who puts Japanese text through Teseq will
have very good reason to expect that the multibyte charset being
switched to is most likely Japanese. :)

>> What'd be potentially even cooler would be if Teseq remembered the
>> currently-invoked charsets, and used them in rendering the content of
>> the text lines (that's discussed in the Future Enhancements section of
>> the manual). Then it's obvious what characters are being printed to Ecma
>> 48 / ISO 2022-capable terminals.
> 
> Well, I think that would be an entirely new program. teseq as-is is a 
> debugging
> tool. If it was to convert text lines, it would be a converter program (and
> would have to include a lot more tables for ISO IR registered character sets
> than GNU libiconv or GNU libc contain).

That's one way to look at it. Another point of view is that, by
replacing all functional Ecma 35 escapes with "explanations", leaving
only garbled ASCII left (which will be quite unreadable, since they
aren't meant to represent themselves), Teseq produces output which can
be _less_ easy-to-read, for debugging purposes or other. That is, the
original was readable in Japanese (or what have you); the "annotated"
version ought to be, as well.

Of course, doing this conversion directly in Teseq, replacing the
literal characters of a text line with the viewable characters, which
however are no longer convertible back to the original identities, means
information loss. Since, under normal circumstances, Teseq produces
output with no information loss, a "viewer" program that groks Teseq
output could perform appropriate conversions for just the text lines,
using state gleaned from escape and control-character lines. Still,
since Teseq _already_ has to recognize escape sequences, there's a good
case for putting state-specific code there.

As to finding the code to perform the conversions, the program called
"luit" (which is apparently now a built-in part of xterm?) performs
ISO-2022 to UTF-8 conversions, and could probably form a solid
foundation for such a mode. But I never said it would have to have full
support at any rate: it might well be that it would just handle whatever
iconv can.

Aside from character conversions, though, Teseq would benefit from state
memory for other reasons as well: for instance, it currently makes
potentially faulty assumptions about the settings of various modes, in
its hard-coded descriptions of various control functions which depend on
those modes. It would be beneficial for Teseq to render more accurate
settings in the event that it has already seen relevant mode-setting
sequences.

As cool as such a feature might be, though, it's likely I'd never get
around to it: Teseq already does what I wish it to, with a handful of
minor exceptions, so my motivation to add major enhancements is fairly
low. Wget and Screen certainly take up enough of my time, without having
to share it with Teseq. :)

> About the appended patch:
> 
> - It is correct as far as I can tell.
> 
> - It also fixes a bug regarding the printed name of 94-character code sets
>   with intermediate byte 0x21. Example:
> 
>   $ printf '\x1b\x29\x21\x41\n' | teseq 
>   : Esc ) ! A
>   & G1D4: G1-DESIGNATE 94-SET
>   " Designate 94-character set A (ISO646, British) to G1.
>   . LF/^J
> 
>   Wrong! Fixed to say this:
> 
>   $ printf '\x1b\x29\x21\x41\n' | teseq 
>   : Esc ) ! A
>   & G1D4: G1-DESIGNATE 94-SET
>   " Designate 94-character set !A (ISO646-CU) to G1.
>   . LF/^J

Hm. I took the approach for Teseq of always writing the naïve,
common-case solution first, and then fleshing out the rest. I had
intended to do this, but apparently missed it before the release. Thanks!

> - In the iso_ir_names array I tried to put the most well-known name of
>   each encoding. This may be the one used by glibc, or from other known
>   sources. In cases where I could not even make up a reasonable name by
>   myself - I marked these as "not an official name" - I used "ISO-IR-nn"
>   with a number.
> 
> - If I were you, I would move out the variables and functions
>   iso_ir_names ... iso_ir_c1_name to a new file, say "iso-ir.h", and include
>   that in teseq.c. I did not do it in this patch because this is not a 
> decision
>   in my domain.

That seems reasonable to me. I'll probably do this, if I get around to it.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIsFmD7M8hyUobTrERAnRAAJ0UkK+lNy9o4blritE0QzURGVtepgCfZxSj
O6DyFeiLlKIZjmiwnm2PNnc=
=FvUD
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]