[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Freecats-Dev] HTML support & conversion filters (cont.)
From: |
Henri Chorand |
Subject: |
[Freecats-Dev] HTML support & conversion filters (cont.) |
Date: |
Sun, 09 Feb 2003 14:42:45 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 |
Thierry Sourbier wrote:
- another closely related issue is that JCAT only understands
well-formed HTML and XML. Due to this, it won't be able to work
>> on at least 80% of existing HTML files. This is why we prefer a
>> "dumb" approach.
1. It is easier to have a "dumb" parser read well formed HTML than a
> "smart" parser able to read "dumb" HTML.
Well, for me, in fact the question was, how is it easier to read
malformed HTML?
;-)
Indeed for malformed HTML it is not only of tags being misplaced or
> missing, but also knowing what is a tag and what is not e.g. :
> "<b> This character "<" can mess up everything </b>").
I think we can:
- start from a comprehensive list of HTML tags as defined in:
http://www.w3.org/TR/html401/, version from 24 December 1999
- possibly add a number of widely used custom tags, like the "KADOV"
stuff in RoboHelp's files, and known extensions of MS IE and Netscape
- enable the user to add a number of "custom" tags (<something>)
When converting a supposedly HTML file:
- for each of the above tags, make it follow our processing rule
"recognize as tag" depending on its category:
internal (fomatting), ex. <B>
external (structure), ex. <P>
- consider anything else as text, replacing any "<" and ">" found in a
source file by the corresponding HTML sequence, < and >
- convert it as Unicode according to the character encoding specified or
defaulting to Western
That way, we should be able to process most, if not all, files.
Remember that we don't want to alter the file's structure & text
contents. We only want to enable the translator to edit text contents.
For me, that way, the translator will still be able to manually process
any non-recognized weird "tag", by leaving it untouched whenever it
meets them.
So, with a "garbage" file, we only risk keeping "weird" "tags" in the
segments, we won't try to understand it. My localization experience
indicates that such custom "tags" is more often found in the document's
structure (ex. "kadov" tag ih headers) than really mixed with text.
Internal tags are to be kept within TUs' source & target segments. The
translator will freely decide whether to keep them (possibly moving
them), to delete them or to add new ones (only if considered as internal
ones) in the target segment.
External tags are to be kept outside TUs, then restored as they were.
That way, even for "very bad" formating, we don't risk producing a worst
output.
2. Most malformed HTML files can be made compliant to the standard by
running them through Tidy. See http://tidy.sourceforge.net/. In a web
> l10n product I worked on before Tidy was part of the workflow.
Well, why not?
Anyway, I suggest we first begin writing a simple pair of conversion
filters for ANSI text-only files (Notepad text files), in order to be
able to test the server.
The next useful "exercice" could be to develop a number of still very
simple conversion filters for some common resource files (including
common online help formats), such as:
.CNT, .HPJ, .HHK, .HHC, .HHP, .RC
Note that nothing exists for most of these in proprietary CAT tools.
Thierry, what would you think about establishing a contact for us
>> with Yves?
I've already pointed him out to the project home page.
Thanks for this!
Regards,
Henri