[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Freecats-Dev] Source segments indexing method
From: |
Henri Chorand |
Subject: |
Re: [Freecats-Dev] Source segments indexing method |
Date: |
Sun, 26 Jan 2003 17:08:51 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 |
Charles Stewart wrote:
A general question: how ambitious is the CATS going to be?
As we don't have many developers yet, I beliewe we can agree on first
building a working prototype with a bare bone server and translation
client, just to demonstrate we can make it and to attract more people.
Even if tomorrow, IBM or else knocks at the door to offer "unlimited"
resources, I still think we must take one of our project's features
(starting from user requirements) as an asset, so as to carefully
provide specifications centered around CAT so as to address translators'
needs.
I sometimes tend to consider natural language processing as a potential
nest of vaporware. I mean, I started Free CATS because I felt it was
urgent from stopping Mikro$oft-owned Trados to rule CAT world;
otherwise, I would never have dared starting this project and might be
doing less ambitious things to help free software - like helping to
translate interesting software to help it spread more, if only more
end-user tools were already available we would not be trying to make one ;-)
Are we going to model the hierarchical structure of language,
> or do we think this task is too hard?
> If we don't, how are we going to spot common idioms such
as "neither... nor..."?
This is what I call "semantic level processing" in my previous post. I
really have nothing against it, but from my little working experience in
this field, I know it's even more ambitious than CAT.
When I worked as documentation manager at the SPIRIT software's editor,
we used to call such words (articles, adverbs, auxiliary verbs, etc.) as
"tool words" ("mots outils") and they were not indexed.
This distinction is quite less relevant for CAT, as the translator uses
CAT to consistently translate similar sentences. Imagine several similar
short sentences where only a key term differs (for instance "user code",
"client code", "supplier code" etc.):
The user code field is now highlighted.
The client code field is now highlighted.
(...)
If we compare these two sentences, 6 out of 7 words are identical and
the sequence of identical words is the same between them, so the fuzzy
rate is going to be quite high (6/7*100, around 86% to simplify).
If you remove N-Gram index entries which correspond to these tool words,
the translator can't expect to retrieve fuzzies as well:
The user code field is now highlighted.
The client code field is now highlighted.
If we erase the tool words, the two sentences become:
user code field highlighted.
client code field highlighted.
Fuzzy rate is now around (3/4*100, 75% to simplify).
Sorry for being long, I wanted to be clear for non-coders.
I'll just restrict my comment to the Source segments indexing (SSI)
method at the moment. I've attached some excerpts from earlier emails
to the end of this message:
#1. What componenets of FreeCATS make use of the SSI method?
In my draft indexing specs document, only the server.
Is it only to be used in building the corpus for the
Translation Memory server, or do we use it as a preprocessor
in translating text?
In a classic CAT tool, we don't have such preprocessors.
#2. I agree with David Welton about starting with European for
now, but I think we should make an effort to attract someone
who knows asian character sets. I don't think we should figure
this stuff out for ourselves, if none of us speaks an asian
language. We shouldn't wait too long: if we work only with
Indo-European languages, we might have some nasty surprises
when we find that Korean, say, violates some assumptions we
thought applied to all language texts;
True. See the draft document (when available), I think it's OK at least
for Chinese, but I'm dubious with Thai and all languages which don't
clearly separate words within a sentence. Still, maybe Thai computer
users (also translators) may have started another way of dealing with
this problem (I mean, how could one even implement a spell check in Thai?).
I can try to contact two French localization companies specialized in
Eastern languages and ask for help, but ideally, we would get more
assistance from language scientists. There is a newsgroup in French,
fr.sci.linguistique, on which we could post for help, and we might also
ask where to post in other languages (I can't find any similar one in
English on Usenet but maybe I don't know where to search).
#3. Unicode character properties: clearly it is the right thing
to use these;
Fine, this is one of our non-ambiguous areas.
#4. I think it is better to work directly from the source text:
We thought about it at our last Breton meeting in Quimper, and I can't
pretend our proposal is a final one.
We rejected it for a variety of reasons:
- As we want the server to only manage a TM (performance), we prefer
converting a source file into our bilingual working format (still to be
defined in detail, will be Unicode-based and probably look a lot like
HTML strings embedded in custom tags with a few extra info)
- We want the translator to retain control of the project files locally
and avoid any taylorist web-service approach where the translator
cannot, for instance, manually process the files one way or another.
- As we want to work with a variety of formats, the only reasonable
thing to do is to convert any of these (back and forth) into a
(bilingual) working format, otherwise even the server has to learn each
of these formats, and we need/want a lot of them: text-only, flat
resource file (Windows .RC / .H, WinHelp & HTML Help source files, HTML
(even if badly formatted), XML (idem), RTF, Open Office, Latex...
without speaking about MS Office or proprietary DTP files.
(...)
with case-folded texts, we work, eg. with case-insensitive matches.
Yes, we'll probably un-capitalize letters, but we have to keep accents.
- We will lose potentially valuable information if we
do things like throw away email headers;
Any info embedded within a tag will be kept in the bilingual working
format (source & target segments) but could be dropped in the TM if we
choose to use simplified internal tags.
(...)
#5. N-grams: easy to do this if we represent the lexicon using
a state-transition diagram or even a recursive descent parser
(the best are almost as fast as lexing regexps).
I need to be better acquainted with these terms
#6. I'm against using fuzzy matching: if we build up a big
corpus in a language, then we will have almost all actually
occurring misspellings in that language. Exact matching is
much faster than fuzzy matching, and easier to design around.
Well, CAT also is about fuzzy matching :-)
More up to the point, building up a large TM from legacy materials
always brings up questions about their quality. If, tomorrow, I'm
contacted by the project team of a major free software project who is
interested about Free CATS (and I hope it will happen), I'll advise them
to fully review their legacy translations, especially now that they have
a better tool at hand.
In my work at Kemper DOC, major localization agencies often propose us
to work on projects in which the legacy MT is half-filled with rubbish,
or simply inconsistent with the project terminology, so we try to do a
better work - more like handicraft than with a fully automated tool, but
in the end, the work gets done, and we could not do it as easily without
CAT and fuzzy matching. It's a genuine Real Life Situation :-(
In a nutshell, if the user is not a qualified translator, to parody
Murphy's Laws, even without garbage in (read: legacy), if a donkey takes
control over the handles, garbage out is not far away.
But of course, fuzzy matching ONLY comes into play if we have no exact
match to retrieve in the TM - or when we do a context search.
This is also an instance where we can first provide a crude function, to
be refined later on.
For instance:
The user code field is now highlighted.
Suppose I want to check how "user code" is translated in the TM (it's
not included in the terminology database used, or this project does not
use one), I'll highlight this string frol within the translation client
and press a function key. The translation client will display, within a
separate window, all sentences which contain this string, so that I can
provide a consistent translation, even if there is no perfect match, if
there is no fuzzy match (or none below my current threshold anyway), or
if the only fuzzy matches do match some other part of my current segment
than the "user code" bit in it.
Pardon me for being long and forgive my rants!
Henri