[Freecats-Dev] Translation Units indexing

freecats-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Translation Units indexing - a first draft

From:	Henri Chorand
Subject:	[Freecats-Dev] Translation Units indexing - a first draft
Date:	Thu, 23 Jan 2003 00:55:45 +0100
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Hi all,

I posted my previous message to answer a very nice feedback from CharlesStewart, who posted an answer to David's message on Advogato.

Charles is a very good example of the kind of senior developers we arepresently trying to get interested in our project in order to see itreally lift off from the ground. In fact, as a scientist, he is involvedin various computing-related research work, as you can see at:

http://www.linearity.org/cas/

So, I hope he can (and several other) join us and provide most valuablecontributions. Anyway, it's another encouraging event, and it shows thatFree CATS project's aim is relevant, to begin with.

Today, I had a couple of phone talks with Julien Poireau, and (amongother things) we tried to see how we could index source segments intranslation units. This helped me to produce a first draft (see below).While being still a long way from a detailed algorithm, I hope mysuggestion can at least help stimulate our brains.

David Welton recently sent us a link to Zebra, a text database serverreleased under GPL license which seems to incorporate many of thefeatures we're looking for (if you look at its documentation pages).

As for any existing, big piece of software, which we look at so as todetermine whether we could adapt and use, it takes a lot of time simplyto read its documentation in order to understand how it works and how wecould adapt it, and this does not prevent us from thinking on our ownhow to build a database server from scratch.

If we are to adapt such a software, we still need to be able to map ourconcepts with the existing product's ones, and to express very carefullyand with some details what still needs to be done on it in order toobtain what we want - and this, whether or not we obtain help from thisproject's team.



----------------------------------------------------------------
 Source segments indexing method by a Translation Memory server
----------------------------------------------------------------

(Sorry if you meet improper English terms, this is a DRAFT)

1) Parsing

We need to parse the source segment and to split it into a sequence (asorted list) of items (words, separators and tags)

Definition:
Word            sequence of contiguous alphabetic and/or numeric characters
Separator       sequence of contiguous non-alphabetic and non-numeric characters
                        space
                        tab
                        punctuation marks . , ; : ! ? inverse? inverse!
                        ' " < > + - * / = _ ( ) [ ] { } hyphens & similar 
various symbols (etc.)
                        non-breakable space

Tag tag belonging to our list of internal tags (let's consider our fileis already converted)


For each item:

If it's a word, we extract a list of all sub-words (sub-strings) whichlength is >= (language-specific minimum)

        (for example: 3 pour french, english)
        (we consider each word and each of its sub-words to be alphabetic 
strings)
        If a tag, we convert it into one of the following items:
                standalone
                beginning-type tag
                end-type tag


2) Indexing

For each word and sub-word, we create an index entry pointing towardsthe TU's ID (basic step in order to be able to retrieve the TU duringqueries).

For each TU, we also create the following index entries:

comprehensive list of all values indexed for this TU (will make queriesfaster).

When creating a TU, the server automatically assigns it an ID. So, we doNOT care about the sequence of items in the source segment.This may seem weird, but in fact, the more two source segments share thesame words (contents), the more probably they are to share the samecontents, or a similar contents.



3) Looking for matches (fuzzies)

For ANY query (whether looking for a TU or in a context search):
Let's call a starting segment the one for which we are looking for matches.

We build a comprehensive list of all words and sub-words in the startingsegment.


We look for all the TU that have
        the highest number of matches at the string level
        (in which an index exists for the starting string considered)
AND that have
        the smallest number of non-matches

(index entries of other TUs not found in the list of starting segment'sindex entries)


We then determine penalties for:
        any variation in the respective SEQUENCES of words and subwords within 
TUs
        any variation in separators and tags

of course, the score of a sub-word is lower than the one of a full word(by half, for instance)

And we can then sort the full contents of the TM by decreasing relevanceorder against our starting segment.

Apart from that, we need to build indexes in a way which allows largeincreases in a TM's number of TUs. Think about it: Trados often forcestranslators to reorganize its TM indexes...



I hope I'm clear enough.


Regards,

Henri Chorand

[Prev in Thread]

Current Thread

[Next in Thread]

[Freecats-Dev] Translation Units indexing - a first draft, Henri Chorand <=
- RE: [Freecats-Dev] Translation Units indexing - a first draft, Thierry Sourbier, 2003/01/22
  - Re: [Freecats-Dev] Translation Units indexing - a first draft, David N. Welton, 2003/01/22
  - [Freecats-Dev] Translation Units indexing - a first draft (cont.), Henri Chorand, 2003/01/23

Prev by Date: [Freecats-Dev] Re: OSCATS project page on advogato
Next by Date: RE: [Freecats-Dev] Translation Units indexing - a first draft
Previous by thread: [Freecats-Dev] Re: OSCATS project page on advogato
Next by thread: RE: [Freecats-Dev] Translation Units indexing - a first draft
Index(es):
- Date
- Thread