[Freecats-Dev] HTML support & conversion filters (cont.)

freecats-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] HTML support & conversion filters (cont.)

From:	Henri Chorand
Subject:	[Freecats-Dev] HTML support & conversion filters (cont.)
Date:	Sun, 09 Feb 2003 14:42:45 +0100
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Thierry Sourbier wrote:

- another closely related issue is that JCAT only understands
well-formed HTML and XML. Due to this, it won't be able to work

>> on at least 80% of existing HTML files. This is why we prefer a
>> "dumb" approach.


1. It is easier to have a "dumb" parser read well formed HTML than a

> "smart" parser able to read "dumb" HTML.

Well, for me, in fact the question was, how is it easier to readmalformed HTML?

;-)

Indeed for malformed HTML it is not only of tags being misplaced or

> missing, but also knowing what is a tag and what is not e.g. :
> "<b> This character "<" can mess up everything </b>").

I think we can:
- start from a comprehensive list of HTML tags as defined in:
http://www.w3.org/TR/html401/, version from 24 December 1999

- possibly add a number of widely used custom tags, like the "KADOV"stuff in RoboHelp's files, and known extensions of MS IE and Netscape

- enable the user to add a number of "custom" tags (<something>)

When converting a supposedly HTML file:

- for each of the above tags, make it follow our processing rule"recognize as tag" depending on its category:

    internal (fomatting), ex. <B>
    external (structure), ex. <P>

- consider anything else as text, replacing any "<" and ">" found in asource file by the corresponding HTML sequence, < and >- convert it as Unicode according to the character encoding specified ordefaulting to Western


That way, we should be able to process most, if not all, files.

Remember that we don't want to alter the file's structure & textcontents. We only want to enable the translator to edit text contents.For me, that way, the translator will still be able to manually processany non-recognized weird "tag", by leaving it untouched whenever itmeets them.

So, with a "garbage" file, we only risk keeping "weird" "tags" in thesegments, we won't try to understand it. My localization experienceindicates that such custom "tags" is more often found in the document'sstructure (ex. "kadov" tag ih headers) than really mixed with text.

Internal tags are to be kept within TUs' source & target segments. Thetranslator will freely decide whether to keep them (possibly movingthem), to delete them or to add new ones (only if considered as internalones) in the target segment.External tags are to be kept outside TUs, then restored as they were.That way, even for "very bad" formating, we don't risk producing a worstoutput.

2. Most malformed HTML files can be made compliant to the standard by
running them through Tidy. See http://tidy.sourceforge.net/. In a web

> l10n product I worked on before Tidy was part of the workflow.

Well, why not?

Anyway, I suggest we first begin writing a simple pair of conversionfilters for ANSI text-only files (Notepad text files), in order to beable to test the server.

The next useful "exercice" could be to develop a number of still verysimple conversion filters for some common resource files (includingcommon online help formats), such as:

.CNT, .HPJ, .HHK, .HHC, .HHP, .RC
Note that nothing exists for most of these in proprietary CAT tools.

Thierry, what would you think about establishing a contact for us

>> with Yves?


I've already pointed him out to the project home page.


Thanks for this!


Regards,

Henri

[Prev in Thread]

Current Thread

[Next in Thread]

[Freecats-Dev] A look at JCAT?, Simos Xenitellis, 2003/02/02
- Re: [Freecats-Dev] A look at JCAT?, Henri Chorand, 2003/02/02
  - [Freecats-Dev] HTML support (was: A look at JCAT?), Thierry Sourbier, 2003/02/07
    - [Freecats-Dev] HTML support & conversion filters (cont.), Henri Chorand <=

Prev by Date: [Freecats-Dev] HTML support (was: A look at JCAT?)
Next by Date: [Freecats-Dev] About Free CATS meeting Friday 7, Feb. 2003
Previous by thread: [Freecats-Dev] HTML support (was: A look at JCAT?)
Next by thread: [Freecats-Dev] The other side: "Commercial" translation memory tools
Index(es):
- Date
- Thread