bug-teseq
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-teseq] codesets are not shown


From: Bruno Haible
Subject: Re: [Bug-teseq] codesets are not shown
Date: Sat, 23 Aug 2008 16:15:27 +0200
User-agent: KMail/1.5.4

Hello Micah,

You wrote on 2008-08-06:
> It was a conscious decision not to include descriptions for each
> registered character set defined in the IR registry. Mainly motivated by
> laziness ;)
> ...
> I'll be ecstatic to accept any
> patches anyone might offer to provide these details.

Find attached a patch that implements these descriptions.

> The single-byte encodings have some coverage, but also not complete, and
> also not particularly distinguished. For example, the final bytes 5/10
> and 6/8 are both reported as designating "Spanish" charsets, without any
> distinguishing information between them.

The modified code makes this distinction:

$ printf '\x1b\x29\x5A' | teseq 
: Esc ) Z
& G1D4: G1-DESIGNATE 94-SET
" Designate 94-character set Z (ISO646-ES) to G1.
$ printf '\x1b\x29\x68' | teseq 
: Esc ) h
& G1D4: G1-DESIGNATE 94-SET
" Designate 94-character set h (ISO646-ES2) to G1.

The names ISO646-ES, ISO646-ES2 are taken from glibc in this case.

> Also in my mind, was that in typical multibyte charset usage, the user
> will know whether the multibyte charset being designated is Chinese,
> Japanese, Korean or whatever (though, of course, they won't know which
> plane will be invoked, unless they knew the final bytes anyway).

I don't agree. The education level of CJK people on these topics is not
high: I sometimes get libiconv support emails from people in Japan
who don't know that ISO-2022-JP is based on JISX0208; they ask me to change
the converter so that it supports characters in CP932.

> What'd be potentially even cooler would be if Teseq remembered the
> currently-invoked charsets, and used them in rendering the content of
> the text lines (that's discussed in the Future Enhancements section of
> the manual). Then it's obvious what characters are being printed to Ecma
> 48 / ISO 2022-capable terminals.

Well, I think that would be an entirely new program. teseq as-is is a debugging
tool. If it was to convert text lines, it would be a converter program (and
would have to include a lot more tables for ISO IR registered character sets
than GNU libiconv or GNU libc contain).


About the appended patch:

- It is correct as far as I can tell.

- It also fixes a bug regarding the printed name of 94-character code sets
  with intermediate byte 0x21. Example:

  $ printf '\x1b\x29\x21\x41\n' | teseq 
  : Esc ) ! A
  & G1D4: G1-DESIGNATE 94-SET
  " Designate 94-character set A (ISO646, British) to G1.
  . LF/^J

  Wrong! Fixed to say this:

  $ printf '\x1b\x29\x21\x41\n' | teseq 
  : Esc ) ! A
  & G1D4: G1-DESIGNATE 94-SET
  " Designate 94-character set !A (ISO646-CU) to G1.
  . LF/^J

- In the iso_ir_names array I tried to put the most well-known name of
  each encoding. This may be the one used by glibc, or from other known
  sources. In cases where I could not even make up a reasonable name by
  myself - I marked these as "not an official name" - I used "ISO-IR-nn"
  with a number.

- If I were you, I would move out the variables and functions
  iso_ir_names ... iso_ir_c1_name to a new file, say "iso-ir.h", and include
  that in teseq.c. I did not do it in this patch because this is not a decision
  in my domain.

Bruno

Attachment: teseq-patch2
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]