[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Freecats-Dev] Translation Units indexing - a first draft
From: |
Henri Chorand |
Subject: |
[Freecats-Dev] Translation Units indexing - a first draft |
Date: |
Thu, 23 Jan 2003 00:55:45 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 |
Hi all,
I posted my previous message to answer a very nice feedback from Charles
Stewart, who posted an answer to David's message on Advogato.
Charles is a very good example of the kind of senior developers we are
presently trying to get interested in our project in order to see it
really lift off from the ground. In fact, as a scientist, he is involved
in various computing-related research work, as you can see at:
http://www.linearity.org/cas/
So, I hope he can (and several other) join us and provide most valuable
contributions. Anyway, it's another encouraging event, and it shows that
Free CATS project's aim is relevant, to begin with.
Today, I had a couple of phone talks with Julien Poireau, and (among
other things) we tried to see how we could index source segments in
translation units. This helped me to produce a first draft (see below).
While being still a long way from a detailed algorithm, I hope my
suggestion can at least help stimulate our brains.
David Welton recently sent us a link to Zebra, a text database server
released under GPL license which seems to incorporate many of the
features we're looking for (if you look at its documentation pages).
As for any existing, big piece of software, which we look at so as to
determine whether we could adapt and use, it takes a lot of time simply
to read its documentation in order to understand how it works and how we
could adapt it, and this does not prevent us from thinking on our own
how to build a database server from scratch.
If we are to adapt such a software, we still need to be able to map our
concepts with the existing product's ones, and to express very carefully
and with some details what still needs to be done on it in order to
obtain what we want - and this, whether or not we obtain help from this
project's team.
----------------------------------------------------------------
Source segments indexing method by a Translation Memory server
----------------------------------------------------------------
(Sorry if you meet improper English terms, this is a DRAFT)
1) Parsing
We need to parse the source segment and to split it into a sequence (a
sorted list) of items (words, separators and tags)
Definition:
Word sequence of contiguous alphabetic and/or numeric characters
Separator sequence of contiguous non-alphabetic and non-numeric characters
space
tab
punctuation marks . , ; : ! ? inverse? inverse!
' " < > + - * / = _ ( ) [ ] { } hyphens & similar
various symbols (etc.)
non-breakable space
Tag tag belonging to our list of internal tags (let's consider our file
is already converted)
For each item:
If it's a word, we extract a list of all sub-words (sub-strings) which
length is >= (language-specific minimum)
(for example: 3 pour french, english)
(we consider each word and each of its sub-words to be alphabetic
strings)
If a tag, we convert it into one of the following items:
standalone
beginning-type tag
end-type tag
2) Indexing
For each word and sub-word, we create an index entry pointing towards
the TU's ID (basic step in order to be able to retrieve the TU during
queries).
For each TU, we also create the following index entries:
comprehensive list of all values indexed for this TU (will make queries
faster).
When creating a TU, the server automatically assigns it an ID. So, we do
NOT care about the sequence of items in the source segment.
This may seem weird, but in fact, the more two source segments share the
same words (contents), the more probably they are to share the same
contents, or a similar contents.
3) Looking for matches (fuzzies)
For ANY query (whether looking for a TU or in a context search):
Let's call a starting segment the one for which we are looking for matches.
We build a comprehensive list of all words and sub-words in the starting
segment.
We look for all the TU that have
the highest number of matches at the string level
(in which an index exists for the starting string considered)
AND that have
the smallest number of non-matches
(index entries of other TUs not found in the list of starting segment's
index entries)
We then determine penalties for:
any variation in the respective SEQUENCES of words and subwords within
TUs
any variation in separators and tags
of course, the score of a sub-word is lower than the one of a full word
(by half, for instance)
And we can then sort the full contents of the TM by decreasing relevance
order against our starting segment.
Apart from that, we need to build indexes in a way which allows large
increases in a TM's number of TUs. Think about it: Trados often forces
translators to reorganize its TM indexes...
I hope I'm clear enough.
Regards,
Henri Chorand
- [Freecats-Dev] Translation Units indexing - a first draft,
Henri Chorand <=