Re: [Freecats-Dev] Source segments indexing method

freecats-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freecats-Dev] Source segments indexing method

From:	Henri Chorand
Subject:	Re: [Freecats-Dev] Source segments indexing method
Date:	Sun, 26 Jan 2003 17:08:51 +0100
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Charles Stewart wrote:

A general question: how ambitious is the CATS going to be?

As we don't have many developers yet, I beliewe we can agree on firstbuilding a working prototype with a bare bone server and translationclient, just to demonstrate we can make it and to attract more people.

Even if tomorrow, IBM or else knocks at the door to offer "unlimited"resources, I still think we must take one of our project's features(starting from user requirements) as an asset, so as to carefullyprovide specifications centered around CAT so as to address translators'needs.

I sometimes tend to consider natural language processing as a potentialnest of vaporware. I mean, I started Free CATS because I felt it wasurgent from stopping Mikro$oft-owned Trados to rule CAT world;otherwise, I would never have dared starting this project and might bedoing less ambitious things to help free software - like helping totranslate interesting software to help it spread more, if only moreend-user tools were already available we would not be trying to make one ;-)

Are we going to model the hierarchical structure of language,

> or do we think this task is too hard?
> If we don't, how are we going to spot common idioms such

as "neither... nor..."?

This is what I call "semantic level processing" in my previous post. Ireally have nothing against it, but from my little working experience inthis field, I know it's even more ambitious than CAT.

When I worked as documentation manager at the SPIRIT software's editor,we used to call such words (articles, adverbs, auxiliary verbs, etc.) as"tool words" ("mots outils") and they were not indexed.

This distinction is quite less relevant for CAT, as the translator usesCAT to consistently translate similar sentences. Imagine several similarshort sentences where only a key term differs (for instance "user code","client code", "supplier code" etc.):

The user code field is now highlighted.
The client code field is now highlighted.
(...)

If we compare these two sentences, 6 out of 7 words are identical andthe sequence of identical words is the same between them, so the fuzzyrate is going to be quite high (6/7*100, around 86% to simplify).

If you remove N-Gram index entries which correspond to these tool words,the translator can't expect to retrieve fuzzies as well:


The user code field is now highlighted.
The client code field is now highlighted.

If we erase the tool words, the two sentences become:

user code field highlighted.
client code field highlighted.
Fuzzy rate is now around (3/4*100, 75% to simplify).

Sorry for being long, I wanted to be clear for non-coders.

I'll just restrict my comment to the Source segments indexing (SSI)
method at the moment.  I've attached some excerpts from earlier emails
to the end of this message:

        #1. What componenets of FreeCATS make use of the SSI method?


In my draft indexing specs document, only the server.

        Is it only to be used in building the corpus for the
        Translation Memory server, or do we use it as a preprocessor
        in translating text?


In a classic CAT tool, we don't have such preprocessors.

        #2. I agree with David Welton about starting with European for
        now, but I think we should make an effort to attract someone
        who knows asian character sets.  I don't think we should figure
        this stuff out for ourselves, if none of us speaks an asian
        language.  We shouldn't wait too long: if we work only with
        Indo-European languages, we might have some nasty surprises
        when we find that Korean, say, violates some assumptions we
        thought applied to all language texts;

True. See the draft document (when available), I think it's OK at leastfor Chinese, but I'm dubious with Thai and all languages which don'tclearly separate words within a sentence. Still, maybe Thai computerusers (also translators) may have started another way of dealing withthis problem (I mean, how could one even implement a spell check in Thai?).

I can try to contact two French localization companies specialized inEastern languages and ask for help, but ideally, we would get moreassistance from language scientists. There is a newsgroup in French,fr.sci.linguistique, on which we could post for help, and we might alsoask where to post in other languages (I can't find any similar one inEnglish on Usenet but maybe I don't know where to search).

        #3. Unicode character properties: clearly it is the right thing
        to use these;


Fine, this is one of our non-ambiguous areas.

        #4. I think it is better to work directly from the source text:

We thought about it at our last Breton meeting in Quimper, and I can'tpretend our proposal is a final one.


We rejected it for a variety of reasons:

- As we want the server to only manage a TM (performance), we preferconverting a source file into our bilingual working format (still to bedefined in detail, will be Unicode-based and probably look a lot likeHTML strings embedded in custom tags with a few extra info)- We want the translator to retain control of the project files locallyand avoid any taylorist web-service approach where the translatorcannot, for instance, manually process the files one way or another.- As we want to work with a variety of formats, the only reasonablething to do is to convert any of these (back and forth) into a(bilingual) working format, otherwise even the server has to learn eachof these formats, and we need/want a lot of them: text-only, flatresource file (Windows .RC / .H, WinHelp & HTML Help source files, HTML(even if badly formatted), XML (idem), RTF, Open Office, Latex...without speaking about MS Office or proprietary DTP files.

(...)
with case-folded texts, we work, eg. with case-insensitive matches.


Yes, we'll probably un-capitalize letters, but we have to keep accents.

     - We will lose potentially valuable information if we
     do things like throw away email headers;

Any info embedded within a tag will be kept in the bilingual workingformat (source & target segments) but could be dropped in the TM if wechoose to use simplified internal tags.

(...)
        #5. N-grams: easy to do this if we represent the lexicon using
        a state-transition diagram or even a recursive descent parser
        (the best are almost as fast as lexing regexps).


I need to be better acquainted with these terms

        #6. I'm against using fuzzy matching: if we build up a big
        corpus in a language, then we will have almost all actually
        occurring misspellings in that language.  Exact matching is
        much faster than fuzzy matching, and easier to design around.


Well, CAT also is about fuzzy matching :-)

More up to the point, building up a large TM from legacy materialsalways brings up questions about their quality. If, tomorrow, I'mcontacted by the project team of a major free software project who isinterested about Free CATS (and I hope it will happen), I'll advise themto fully review their legacy translations, especially now that they havea better tool at hand.

In my work at Kemper DOC, major localization agencies often propose usto work on projects in which the legacy MT is half-filled with rubbish,or simply inconsistent with the project terminology, so we try to do abetter work - more like handicraft than with a fully automated tool, butin the end, the work gets done, and we could not do it as easily withoutCAT and fuzzy matching. It's a genuine Real Life Situation :-(

In a nutshell, if the user is not a qualified translator, to parodyMurphy's Laws, even without garbage in (read: legacy), if a donkey takescontrol over the handles, garbage out is not far away.

But of course, fuzzy matching ONLY comes into play if we have no exactmatch to retrieve in the TM - or when we do a context search.

This is also an instance where we can first provide a crude function, tobe refined later on.


For instance:
The user code field is now highlighted.

Suppose I want to check how "user code" is translated in the TM (it'snot included in the terminology database used, or this project does notuse one), I'll highlight this string frol within the translation clientand press a function key. The translation client will display, within aseparate window, all sentences which contain this string, so that I canprovide a consistent translation, even if there is no perfect match, ifthere is no fuzzy match (or none below my current threshold anyway), orif the only fuzzy matches do match some other part of my current segmentthan the "user code" bit in it.



Pardon me for being long and forgive my rants!

Henri

[Prev in Thread]

Current Thread

[Next in Thread]

[Freecats-Dev] Source segments indexing method, Charles Stewart, 2003/01/26
- Re: [Freecats-Dev] Source segments indexing method, Henri Chorand <=

Prev by Date: [Freecats-Dev] Re: Advogato, FreeCATS DB
Next by Date: [Freecats-Dev] TM data structures and Indexing (first attempt)
Previous by thread: [Freecats-Dev] Source segments indexing method
Next by thread: [Freecats-Dev] Re: Advogato, FreeCATS DB
Index(es):
- Date
- Thread