lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev lynx and other character sets


From: Klaus Weide
Subject: Re: lynx-dev lynx and other character sets
Date: Wed, 30 Jun 1999 08:53:38 -0500 (CDT)

On Wed, 30 Jun 1999, Leonid Pauzner wrote:

> 26-Jun-99 20:20 Klaus Weide wrote:
> >        ----
> 
> > When display character set is NOT 'UNICODE (UTF-8)' (and not CJK or
> > transparent either), I notice something strange for all the scripts
> > Lynx doesn't understand (Armenian, Devanagari, Bengali, ...):
> > Those characters are not shown in any way, there is no indication
> > that something was missing.   Some earlier version would show
> > something like
> 
> >       Armenian
> >              U531 U532 U533 U534 U535 U536 U537 U538 U539 ...
> 
> > instead.  Leonid, was this a concious decision?  Seems like a bug
> > to me.
> 
> I thought such indication was too technical for average lynx user
> and not very useful in fact (say, I run into japanese text
> with any european display charset).  Instead, this can be indicated
> from Info Page: [7bit chars only] / [7bit approximation was used]
> / [few not recognized characters filtered out]  or so.

But completely dropping the characters without any indication is
definitely not the right thing, in my opinion.  Whole sections of
a document meay go missing.  And sometimes omission of single important
characters may be just as bad.  The user has no idea that he should
go to the Info Page in order to find out that he missed something.

The following is from the HTML 4.0 spec:

5.4 Undisplayable characters

   A user agent may not be able to render all characters in a document
   meaningfully, for instance, because the user agent lacks a suitable
   font, a character has a value that may not be expressed in the user
   agent's internal character encoding, etc.

   Because there are many different things that may be done in such
   cases, this document does not prescribe any specific behavior.
   Depending on the implementation, undisplayable characters may also be
   handled by the underlying display system and not the application
   itself. In the absence of more sophisticated behavior, for example
   tailored to the needs of a particular script or language, we recommend
   the following behavior for user agents:
    1. Adopt a clearly visible, but unobtrusive mechanism to alert the
               ^^^^^^^^^^^^^^^
       user of missing resources.
    2. If missing characters are presented using their numeric
       representation, use the hexadecimal (not decimal) form since this
       is the form used in character set standards.

> >From the other hand, this hide a bug:
> when we switch "\" for source mode we got a different output
> for few notrecognized 8-bit characters when we uncomment the code
> you are asking for (have not remember details now).

Hiding a bug is not a good enough reason.  (Even if I may be responsible
for it...)

I think the current behavior, dropping those characters completely
without notice, is the worst of possible choices.  For example, if
you think 'Uxxx' is too bad (I don't, but I can understand disagreement),
showing a '?' for each missing character would be better than nothing.
(A Warning message, probably combined with that, would be better.
But of course it should not appear for each character, maybe only once
per loaded document.)

> >       ----
> 
> > Another observation: in the situation of the provious section,
> > force Raw Mode on.  This has to be done from the 'O'ptions screen,
> > since '@' is now disable for explicit charset.  The missing characters
> > (or some of them) are now shown in some kind of 'raw' way.  This is
> > also the case in an earlier lynx version I keep around for reference
> > ("2.7.1ac-0.91"), but in a different way.  I think I found this
> > somewhat useful a long time ago for certain kinds of broken "utf-8"
> > documents, that's why it was there, and apparently it has survived.
> 
> I have a little experience with "utf-8" pages
> but seems documents in normal "8bit charsets" feel good without this mode.

Expect that "utf-8" will become more "normal".

I don't insist that the "some kind of 'raw' way" is a good thing or
should be kept.  I was just telling you it's still there. :)

Actually, since 'Uxxx' was dropped, the number of character to which
it is applied seems to have increased... On the other hand, if we had an
agreed way for handling "missing" characters there wouldn't be a place
for this kind of thing anyway.

> > If you want to pursue this further, I can try to dig up the page(s)
> > where I found this useful.
> 
> Please give examples.

Still looking for them...

> You mean to overload "Raw Mode" key for `visualizing' (few)
> unrecoverable characters while the usual meaning of that key is another...

The '@' key to which it corresponds (and the -raw flag) has always been
overloaded.  One could say it isn't needed any more if 'assumed charset'
and 'document character set' can be selected separately.  Especially
if it doesn't do some other things too...

It is still a sometimes convenient way for toggling between two states
though, and should be kept for (more-or-less) backward compatibility.


    Klaus


reply via email to

[Prev in Thread] Current Thread [Next in Thread]