[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-teseq] codesets are not shown
From: |
Bruno Haible |
Subject: |
Re: [Bug-teseq] codesets are not shown |
Date: |
Sat, 23 Aug 2008 16:15:27 +0200 |
User-agent: |
KMail/1.5.4 |
Hello Micah,
You wrote on 2008-08-06:
> It was a conscious decision not to include descriptions for each
> registered character set defined in the IR registry. Mainly motivated by
> laziness ;)
> ...
> I'll be ecstatic to accept any
> patches anyone might offer to provide these details.
Find attached a patch that implements these descriptions.
> The single-byte encodings have some coverage, but also not complete, and
> also not particularly distinguished. For example, the final bytes 5/10
> and 6/8 are both reported as designating "Spanish" charsets, without any
> distinguishing information between them.
The modified code makes this distinction:
$ printf '\x1b\x29\x5A' | teseq
: Esc ) Z
& G1D4: G1-DESIGNATE 94-SET
" Designate 94-character set Z (ISO646-ES) to G1.
$ printf '\x1b\x29\x68' | teseq
: Esc ) h
& G1D4: G1-DESIGNATE 94-SET
" Designate 94-character set h (ISO646-ES2) to G1.
The names ISO646-ES, ISO646-ES2 are taken from glibc in this case.
> Also in my mind, was that in typical multibyte charset usage, the user
> will know whether the multibyte charset being designated is Chinese,
> Japanese, Korean or whatever (though, of course, they won't know which
> plane will be invoked, unless they knew the final bytes anyway).
I don't agree. The education level of CJK people on these topics is not
high: I sometimes get libiconv support emails from people in Japan
who don't know that ISO-2022-JP is based on JISX0208; they ask me to change
the converter so that it supports characters in CP932.
> What'd be potentially even cooler would be if Teseq remembered the
> currently-invoked charsets, and used them in rendering the content of
> the text lines (that's discussed in the Future Enhancements section of
> the manual). Then it's obvious what characters are being printed to Ecma
> 48 / ISO 2022-capable terminals.
Well, I think that would be an entirely new program. teseq as-is is a debugging
tool. If it was to convert text lines, it would be a converter program (and
would have to include a lot more tables for ISO IR registered character sets
than GNU libiconv or GNU libc contain).
About the appended patch:
- It is correct as far as I can tell.
- It also fixes a bug regarding the printed name of 94-character code sets
with intermediate byte 0x21. Example:
$ printf '\x1b\x29\x21\x41\n' | teseq
: Esc ) ! A
& G1D4: G1-DESIGNATE 94-SET
" Designate 94-character set A (ISO646, British) to G1.
. LF/^J
Wrong! Fixed to say this:
$ printf '\x1b\x29\x21\x41\n' | teseq
: Esc ) ! A
& G1D4: G1-DESIGNATE 94-SET
" Designate 94-character set !A (ISO646-CU) to G1.
. LF/^J
- In the iso_ir_names array I tried to put the most well-known name of
each encoding. This may be the one used by glibc, or from other known
sources. In cases where I could not even make up a reasonable name by
myself - I marked these as "not an official name" - I used "ISO-IR-nn"
with a number.
- If I were you, I would move out the variables and functions
iso_ir_names ... iso_ir_c1_name to a new file, say "iso-ir.h", and include
that in teseq.c. I did not do it in this patch because this is not a decision
in my domain.
Bruno
teseq-patch2
Description: Text document