freecats-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] TM data structures and Indexing (first attempt)

From:	Henri Chorand
Subject:	[Freecats-Dev] TM data structures and Indexing (first attempt)
Date:	Mon, 27 Jan 2003 00:06:47 +0100
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Hi all,

I've been thinking about how to perform TU indexing so as to allow quickand accurate matching. While I've not solved things, I still considerit's a small step forward.


Please read it (non-finished parts are clearly indicated).

If you know the answers... let me know ! ! !


Henri

Free CATS

Author	Henri Chorand
Date	January 25, 2002
Version	0.1 Draft
Object	Free CATS Server – Data structures & Indexing

Overview

This document suggests a TU indexing algorithm and a possible data & index storage implementation built on top of a basic database (e.g. Berkeley DB) or even directly over the filesystem. It should help to determine a fuzzy match algorithm (to be refined in this document when we see things more clearly).

Please consider this data structure as incomplete and partially inadequate. The most serious part of this document is N-Gram management which I consider well advanced. So, take it as what it is, a first attempt.

Note: Berkeley DB only allows a program to store records made up of a key and value pair, without any restriction about each item's type or length. This would correspond to flat file name & contents.

For more information, see Berkeley DB Reference Guide.

Note: I don't detail here the TM creation/import/export/deletion features here, as they are rather mainstream. Basically:

All these features can only be performed by the server administrator, who must obtain exclusive access to the TM (except for TM creation where it's obviously always the case).
Creating a TM is achieved by adding a few server-level info and creating a set of dedicated tables.
Import/export functions (to be worked on later) may use buffer files to store import/export data on the server, and allow file transfer via FTP (some secure implementation preferred, e.g. PureFTPd or Secure FTP).
Deleting a TM is achieved by removing a few server-level info and deleting a set of dedicated tables (contents & structure).

The only four basic operations we need performed by our DBMS are:

Creating a TU
Updating a TU
Deleting a TU
Fetching a list of TUs from a context string (fuzzy matching).

This first version of the document suggests how to perform three of these four operations (see below). The fourth one is the most difficult one, but it has been taken into account as best as possible – I will try to provide some guidelines.

Note that for each TU, we only need to index its source segment and its N-Grams that it contains. Therefore, when a TU's target segment is updated by a user, there is no need to modify its indexing data, which is going to require quite less CPU resources. We can also consider deletion of individual TUs within a TM as quite rare.

Note: The proposed data structure comes from a feature implemented in a (then highly advanced) full text indexing system, which was once partly explained to me by the R&D manager of the software's editor. So it does not come out of my mind, I only was clever enough to remember it since when I heard it nearly ten years ago.

Note: Sorry, the HTML format of this document prevents me to properly indent pseudocode (see below), at least using Open Office.

Data storage

Each TU and its N-Grams will be handled as follows:

Source and target segments will be stored separately, in two dedicated tables (SRCSEG_tmid and TRGSEG_tmid) in which the key is an integer (TU ID) fed from the TM's TU counter and the value is their respective Unicode string value.
Internal tags are stored with a simplified notation (real tags are only stored in documents) which could be based on system delimiters (to be refined), such as: Begin Tag / End Tag / Standalone Tag (we may use carefully chosen unused/reserved characters in Unicode).
All N-Grams will be stored in two TM-level tables (NGRSTR_tmid and INGSTR_tmid), with their string value and integer (N-Gram ID, similarly fed from the TM's N-Gram counter) being respectively key and value (in the first table) or value and key (in the second one).
Two index tables (SEGIDX_tmid and NGRIDX_tmid) will provide a mapping between TU IDs and N-Gram IDs, respectively taken as key/value and value/key pairs (both ways).

TU creation

Concurrent access is dealt with later on in this document, so as to keep things simple (at least at first). Anyway, it's quite rare in statistical terms, and it may be handled in an appropriate way later on, during proofreading (where all target segments are reviewed and sometimes edited). We must remember that in real life, the proofreader(s) will be able to decide which translation to keep, so it's not a real problem from the user's point of view, even though it would be with various other kinds of databases used in a different domain... But, as you will see here, there might be a handy way to handle it, to provide faster response times and to spare CPU time.

Each time an user requests a TU creation in a given TM, the TM server will perform the following steps:

If this TU already exists in TM, it will be updated instead (see below at TU update).
This is determined by the fuzzy search immediately performed (in order to return the target segment if it exists), after which the relevant TU data is also immediately created in the TM without waiting for the new or updated target segment to be returned (see below at Concurrent access).
It obtains the TU's next available ID from the TM's TU counter by adding 1 to its corresponding value (with exclusive access).
It creates the records corresponding to TU's source and target (empty) segments in SRCSEG and TRGSEG tables (key is TU ID for both).
It builds up the full list of N-Grams found in this segment (discussed below), then eliminates any duplicates.
For each (now unique) N-Gram:
If this N-Gram still does not exist in the TM (as determined from NGRSTI table contents),
it obtains the N-Gram's ID from next available ID as indicated in the TM's N-Gram counter (similarly to TU management)
it creates its corresponding records in NGRIST and NGRSTI tables (keys are respectively N-Gram's ID and string)
it creates its corresponding record in NGRIDX (key is N-Gram's ID) with value ""
end if
It updates NGRIDX table (key is N-Gram ID) by setting to "1" the bit corresponding to current TU ID (inserts "0" bits for filling)
end for.
It creates the TU's record in SEGIDX table, in which it sets to "1" all bits corresponding to current TU's N-Grams (key is TU ID).

TU update

Each time an user requests a TU update in a given TM, the TM server will perform the following steps:

If this TU (as determined by its ID) does not already exist in TM
return an error message and quits (the master process may decide to create the TU)
end if.
It updates the target segment in TRGSEG table.

TU deletion

Each time an user requests a TU deletion in a given TM, the TM server will perform the following steps:

If this TU (as determined by its ID) does not already exist in TM
return an error message and quit
end if.
The TM server will perform the following updates:
Delete TU records in SRCSEG, TRGSEG tables (could be emptied with Value = "")
In SEGIDX table's TU record, look for all N-Grams (bits "1") (position of each bit "1" corresponds to an N-Gram's key)
For all such bits,
Determine its position (N-Gram ID)
In NGRIDX table's corresponding record, set to "0" the bit which position = TU's ID
(deletion of TU entry in N-Gram's index)
Delete all trailing "0" so as to optimize storage space
If NGRIDX record's value = "" (or only contains "0" values if no trimming) (no TU uses this N-Gram any more), then
(if we are to delete any records corresponding to now unused N-Grams, but it should not be necessary)
delete NGRSTI record with key = N-Gram's string (value of NGRIST record)
delete NGRIST record with key = N-Gram's ID (key of NGRSTI record)
delete the NGRIDX record (could be emptied with Value = "")
End if
End for
Delete TU record in SEGIDX table (could be emptied with Value = "")

So, we'll end up with an unused TU number in the database. It might be possible to recycle it but it's really not worth it. Reorganizing a TM (full reindexing, if ever we implement it) will automatically redo all TU numbering from scratch, and therefore eliminate any such "holes".

Concurrent access

(Short reminder for non-coders) A concurrent access problem is a situation where several users attempt to simultaneously update the same data, resulting in inconsistencies. Here, unless we apply my advance indexing suggestion, it would happen when a first user opens a TU which has not been created on the TM server, and another user also opens the same TU (which happens to exist in another file of the project) begins to also create the same one before the TU, giving birth to one of the two following situations:

The first user saves its target segment sooner than the second one (the most frequent case). Therefore, the second user is not aware of the first user's translation and may translate it differently. The first user will not be aware either that the data actually stored come from the second user.
The second user saves its target segment sooner than the first one – possible if the first one, for instance, enters a recreative break or frantically browses dictionaries. A similar situation will happen, and the data will come from the first user (in each case, the last one decides on the target segment's contents).

The real problem is, if we did nothing about it, in this (rare) case, we would end up with two TUs (each with its own ID) containing the same source segment, but without any good reason for this. I then had an idea – to perform indexing in advance (as soon as the TU is opened, but before its target segment is known). It's possible (all required data is available to the server) because TU indexing is only based on the source segment's contents. Here is probably our solution.

Each time a user opens a TU, he automatically requests the TM server to immediately perform a fuzzy match and return the results, for good reasons:

During translation, the user must know if the source segment's translation is already stored in the TM (whether because it was translated during a previous project, or because this sames source segment comes several times in the source materials) – this is one of the main reasons why people use a CAT software in the first place.
Similarly, during proofreading, the CAT software will show the reviewer if, for whatever reason, the target contents stored in the document differs from the TM contents – and the reviewer needs to know if this is the case. Indeed, it the target segment deserves a new contents, this reviewed contents must be applied to all the occurrences of the source segment in the source materials.

The only situation where this advance indexing must be cancelled is when the user interactively modifies segmentation, by joining two contiguous segments together (always allowed within the same paragraph, as determined by the translation client) or by splitting a segment in two (most probably because it was first created by two joined segments). As a rule, it's not going to happen often. In this case, the TM server will un-index the TU, thus generating a hole (unused number) in the sequence of TU IDs.

So, I suggest the server should index a new TU as soon as a translation client asks if there is any, when the user opens the TU. Then, any other user requesting the creation of the same TU will receive a warning message and won't be able to save its translation in the server's TM – he will have to wait until the update is performed (a security timeout will help ensure the TU does not remain open forever on the first user's translation client – see below). Any other user will then automaticall close and skip the TU – the translation client could also keep a "non-synchronized" flag within the TU.

In this case, the TM server must also index each of this source segment's N-Grams. The great advantage is quite better response times, as when the user has finished entering a new translation or validating an existing translation, all index data will have been already updated in advance in the TM tables. Also, in the case of a roll-back, I don't believe it's useful to also delete N-Gram entries in tables (we should only update the applicable records in NGRIDX_tmid table, see below). Should some N-Grams be unique to the old TU's source segment, they will come back into play as soon as the missing bit (if put into another segment) is also translated. So, temporarily keeping records corresponding to unused N-Grams is NOT going to be a problem, as long as the corresponding indexes are properly updated.

Of course, we must then handle the situation where the translation client never sends the target segment to the server (if the user stops working in the middle of a TU). In this case, the server might:

Keep a list of pre-indexed, but non translated TUs (this cannot happen more than once per user).
Define a timeout (e.g. 30 minutes) after which it sends a "close this TU" message to the translation client, stores some short message in the target segment and requests the translation client to do the same. We would therefore always create records corresponding to the project's TU, as each of these is going to be translated at some stage during the project.

We will "only" need to manage connections well enough to allow the user to reconnect to the server after a force disconnect (network down for too long, client application crash). In this case, I suggest that when a client attempts to reconnect, the server should send him a message detailing which TU needs to be re-opened (if it was left in "open" state when something went wrong).

All this makes me think our TM server would use a single process (a loop) to process all TM update requests each in turn (in synchronous mode if I got it well), whereas the simple context searches can be processed in asynchronous mode (by another process), so all TM update requests would be queued. Here, too, the queue's maximum length would be equal to the number of users, as a translation client simply can't go any further unless it has finished performing its current task. If experienced database coders believe I'm wrong here, please provide a better alternative ASAP.

Building up a list of N-Grams from a segment

After thinking about it over and over again, I have come up with a highly customizable solution.

We'll begin by defining a set of rules for each kind of language. The two ones I'll describe here should suit needs for Western languages and ideogram-based languages (like Chinese). Each time we want to add support for a new language, we'll do as follows:

Add this language in the table of supported languages.
Decide whether one of the existing algorithms is suitable or if we need to create a new, custom one (and then, of course, do create it). In this case, we'll be free to define the parameters we need.
If we use one of the existing algorithms, decide whether its parameter values are suitable of if we need to assign a new set of values.

To build a list of N-Grams, the trickiest part is to cut the segment into words. We can follow a simple algorithm, such as:

1. Parsing the source segment and extracting, one by one, all full words found according to the set of delimiters to be defined here:

Space character
All optional segment delimiters (Tabulation, ":", hard-coded line return), in case they are not activated in document segmentation rules on the translation client (when building up a bilingual working file from a source file)
Punctuation marks . , ; : ! ? ¿ ¡
(In fact, include the dot "." only if followed by a space or at segment's end, include the comma "," only if followed by a space)

A simple way to do this is to transform any delimiter into space character, trim spaces (clear all spaces at segment's beginning and end and remove all duplicate spaces) then extract words from segments with any adequate substring extract function (to be seen in Tcl later on).

2. Removing word duplicates for optimization.

3. For each word, determining its list of N-Grams according to the language parameter value lngrset (rule set). I suggest calling the same algorithm with a specific set of parameters fitting the relevant language. These parameters, defined in [RuleSet] section (see below) of TM server's configuration parameters (sample values in dark red provide an example), are as follows:

Western language default (proposal)

ngminl = 3 minimum N-Gram length
ngbeg = True if we index all N-Grams starting from word's beginning (0 = false, 1 = true)
ngend = True if we index all N-Grams ending at word's end (0 = false, 1 = true)
ngmid = False if we index all N-Grams in the "middle" (not starting from word's beginning and not ending at word's end (0 = false, 1 = true)
lowcas = True if we transform all capital letters in N-Grams as lower case letters (0 = false (keep capital letters), 1 = true)

Algorithm

// Initialization

word = "amnesia" // just an example here, length 7

If lowcas then word = LowerCase(word) // to be done in fact at segment level (faster)

ngram_list = (empty)

word_length = StringLength(word)

// I consider the rank of first character in a string equals 1 here.

Append word to ngram_list // a full word is the biggest of its N-Grams

For i_length = ngminl to (word_length – 1) step +1

// loop for i_length = 3 to 6 with increment step = +1

// N-Gram is length of N-Gram extracted, starting at minimum length specified, ending at (word length – 1)

// Beginning of Word

if ngbeg = True then

Append SubString(1, i_length) to ngram_list

// Where 1 is start position and i_length is substring length

end if

// Middle of Word (turned off in my proposal, ngmid = False so nothing is done here)

if ngmid = True then

For j_posinword = 2 [first character in word] to (word_length – i_length) step +1

// loop within word where we begin on position 2 and end at (word length – 1)

Append SubString(j_posinword, i_length) to ngram_list

Next j_posinword

end if

if ngend = True then

Append SubString((word_length – i_length + 1), i_length) to ngram_list

// Where word_length – i_length + 1 is start position and i_length is substring length

// So we extract the i_length LAST characters here

end if

Next i_length

There, we only have to create our N-Gram entries in DB tables / filesystem / whatever.

Ideogram language default (proposal)

We have NOTHING to do with N-Grams, we only need to consider the individual "words", which each are made up of a single Unicode character. A list of words (duplicates removed) makes up the full list of N-Grams (all are whole words) to be indexed.

Fuzzy search

This issue is precisely the one for which I would be happy to receive advice from algorithm experts.

Ideally, we should be able to define a math function who would return V, a uniquely derived value (quite possibly a multi-dimensional value – a vector) from a segment's ordered list of words (seen as N-Gram IDs), punctuation marks and simplified tags, so as to search for the closest values of V already stored in the TM. I'm not strong enough in maths in order to suggest anything, but I hope this document will be enough for some reader to answer some "I know how to do this".

We need to find such a function in order to make Free CATS server "work", because we we start using half-baked, dirty tricks so as to come up with "something" that will not be close enough to what we're trying to do, but I want to believe [(c) X-Files] that a mathematically elegant solution exists and that somebody will tell me about it, and even more a miracle, will explain it to me in terms I can grasp.

So, I have not yet devised a true algorithm. Here are a few hints from which I suggest to build it.

We start from a query string which may (TU storage in TM), or may not (fuzzy searching) be a full source segment. It's a sequence of Items (words, punctuation marks and/or internal (simplified) tags). In order to keep things simple, I suggest to work on words, then to refine our near-final results with the other types of items.

We need to find a set made up of up to MaxTUNumber (client-defined threshold) TUs in TM, each of which has a matching rate no lower than MinFuzzyRate (client-defined threshold) as compared to our query string.

We have already found (thanks to Thierry) that dealing with N-Grams instead of words works.

Therefore, we are looking for TUs which are closest to our query string, so they:

Have the highest possible number of query string N-Grams in their contents (maximum similarity score)
Have the lowest possible number of non-query string N-Grams in their contents (minimum discrepancy score)
Have the most similar possible sequence of N-Grams between them.

We might try to work on the last criterion after the two first ones – after we have build a subset on which to work.

The main advantage of series of binary flags like those found in SEGIDX (and NGRIDX) tables is that they allow to very quickly perform binary AND and OR operations on their contents.

In order to use a unique algorithm with a query string, whether it's a source segment TU or a context search, in the latter case, once we have found this famous V function, we'll simply determine V(query string).

To fulfill the two first criteria, and as we can't browse the entire contents of SEGIDX table in order to find the best match, we must first find a quick way to find it – this may indicate that my draft data structure needs some improvement.

If the source segments of two TUs are equal, they must have the same list of N-Grams, so the corresponding records in SEGIDX are equal. True, if two SEGIDX records are equal, there might be some discrepancies between the segments, like if:

Punctuation differs
Internal tags (number and/or position) differ
One or more words are repeated.

A way to do this could also be to generate, from a source segment, an ordered list of its words (seen as N-Gram IDs).

We must try to determine, for a given source segment, the TUs which source segment contains all our N-Grams. To do this:

******** à reprendre ou à flinguer ********

So, we may start from SEGIDX and OR operators between them as probably the best way to select "close enough" segments, but this will generate a quite large number of fuzzy matches. We would refine our selection with this ordered list.

If the result of an AND operation is null, we have to look

In this ordered list (Yet Another Index Table), we would still drop punctuation marks and internal tags (simplified), as their discrepancies must be considered as "minor" as compared to words being identical. It's only after fetching the "best" segments (from word contents AND sequence) that we would slightly lower fuzzy rates based on punctuation marks and internal tags, and therefore build the "definitive" list of best matches.

Table structure

It seems that Berkeley DB does not have a database (logical set of tables) concept, so I guess we must append manually the TM ID (32 bits integer, further detailed below, indicated by _tmid here) to the table name. If we work directly on top of a filesystem, we could use directories for this and therefore simplify the whole thing.

For each TM, we need to use the following tables:

SRCSEG_tmid

Key

Value

Source segments

Segment ID (32 bits sequence number, >= 0)

Source segment's Unicode string (variable length string)

(internal tags are also included in a simplified way – with placeables)

TRGSEG_tmid

Key

Value

Target segments

Segment ID (32 bits sequence number, >= 0)

Target segment's Unicode string (variable length string)

(internal tags are also included in a simplified way – with placeables)

NGRSTR_tmid

Key

Value

N-Gram strings

N-Gram ID (32 bits sequence number, >= 0)

N-Gram's Unicode string (variable length string)

INGSTR_tmid

Key_tmid

Value

N-Gram IDs (Inverse N-Gram strings)

N-Gram's Unicode string (variable length string)

N-Gram ID (32 bits sequence number, >= 0)

SEGIDX_tmid

Key

Value

Index of a segment's N-Grams

Segment ID (32 bits sequence number, >= 0)

Series of flags for N-Grams (0 = False, 1 = True)

(a variable length binary object in which a bit's rank equals the corresponding N-Gram's key value)

NGRIDX_tmid

Key

Value

Index of a N-Gram's segments

N-Gram ID (32 bits sequence number, >= 0)

Series of flags for segments (0 = False, 1 = True)

(a variable length binary object in which a bit's rank equals the corresponding segment's key value)

Note: This first approach does not enable to fully represent all data found in a source segment, as we can't guess, from index tables, the number of occurrences of N-Gram within the source segment, nor its location(s) within the source segment. It was designed with the aim to quickly determine whether or not a given N-Gram is found in a source segment.

If need be, we might also use another table to store a source segment's representation as a sequence of N-Grams (full words) and internal (simplified) tags, separated via a delimiter.

Other TU data

Let's not forget to store the following information for each TU:

Creation author (user code)
Creation timestamp (server's local date & time, GMT)
Last update author (user code)
Last update timestamp (server's local date & time, GMT)
Category (string value used to qualify TUs depending on project, optional)
Sub-project (string value used to qualify TUs depending on project, optional)
Status (integer value used to qualify TUs depending on status), managed by translation client & server, for instance: 0 = legacy, 1 = untranslated, 2 = translated, 3 = reviewed (validated)

For simplicity's sake, I have not included them in any table, but (in order to limit the overall number of tables used by the server) they could fit in the TRGSEG table, the beginning of the value contents could be for instance as follows (where / is a delimiter):
creation author / creation timestamp / last update author / last update timestamp / (actual target segment here)

Apart from these, the client could use the following general-purpose tables (flat .CFG-type files would also do):

LNGxxx

Key

Value

Subset of ISO 639-2 language codes which are handled by Free CATS for source & target languages

ISO 639-2 language code

Language name, as expressed in language xxx (ISO 639-2 code) – "eng" (English) and "fra" (French) available

CTYxxx

Key

Value

Full list of ISO 1366 3-letter language codes used by Free CATS for source & target countries

ISO 1366 3-letter country code

Country name, as expressed in language xxx (ISO 639-2 code) – "eng" (English) and "fra" (French) available

EMSxxx

Key

Value

Error messages (associated to any error code, either emitted by the server or a client)

Error code

Error message text, as expressed in language xxx (ISO 639-2 code) – "eng" (English) & "fra" (French) to be done first

About disk space issues

Laws of statistics are going to help us in that, as a rule, (correct me if I'm wrong), the most frequently used N-Grams (in a given source language) will be, most of the time, met first, therefore stored first (the average number of uses of a given N-Gram should be higher if it's ID is lower than ID average). Similarly, while we can expect to have a huge number of N-Grams, it's not going to grow exponentially, as it will stabilize once all of a project's source terms have been encountered.
The binary objects used for indexing can safely be shortened by truncating all trailing zeroes in them. For comparison's sake, trailing zeroes will be temporarily added when needed.
If we choose a filesystem to store TM data, we'll need to adapt the table structure in order to avoid lots of (N-Gram-related) individual index records in NGRSTR and INGSTR tables, which would each eat up a block on a hard disk drive, and possibly replace each of them with a single flat text file.

TM-level parameters

I suggest to implement miscellaneous TM-level parameters either as a table or as a separate .CFG type file, as follows (at least as an example):

[Admin] ; server administrator data

admid = (string value) ; Administrator's ID

admpwd = (string value) ; Administrator's password

admname = (string value) ; Administrator's full name

admmail = (string value) ; Administrator's email address

[Languages] ; natural languages reference data

lnglist = (list of delimiter-separated string values) ; list of ISO 639-2 codes of languages currently supported by Free CATS server

lngrset = (list of delimiter-separated string values) ; list of rule set codes applicable to language with corresponding position in lnglist

lnflag = (list of delimiter-separated string values) ; list of file URIs corresponding to country flags (only needed by translation clients)

Note: A rule set code is a parameter which determines the algorithm chosen to build the list of N-Grams from a word in a given language.

Note: The language names table is only stored on the clients, and the server can specify which languages are currently implemented on it – it may vary between servers.

[RuleSet] ; N-Gram rules set (to be followed for a set of languages)

ngminl = (integer value > 0) ; minimum N-Gram length

ngbeg = (boolean value) ; do we index all N-Grams starting from word's beginning (0 = false, 1 = true)

ngend = (boolean value) ; do we index all N-Grams ending at word's end (0 = false, 1 = true)

ngmid = (boolean value) ; do we index all N-Grams in the "middle" (not starting from word's beginning and not ending at word's end (0 = false, 1 = true)

lowcas = (boolean value) ; do we transform all capital letters in N-Grams as lower case letters (0 = false (keep capital letters), 1 = true)

For instance, I would suggest, for Western languages: 3 1 0 1 or possibly 3 1 1 1. As I see it, this still simple parameterization should still allow great flexibility, probably more than we'll ever need.

[General] ; server data

lasttm = (32-bit integer) ; ID of last TM created on the Free CATS server (0 at initial setup)

tmlist = (list of delimiter-separated integers) ; list of IDs of TMs stored on Free CATS server (starting at 1)

svsegd = (list of delimiter-separated string values) ; list of optional segment delimiters allowed (unique strings: Tab, ":")

[User] ; user data

lastusr = (32-bit integer) ; ID of last user created on the Free CATS server (0 at initial setup)

usrlist = (list of delimiter-separated integers) ; list of IDs of users created on the Free CATS server (starting at 1)

usrnam = (list of delimiter-separated strings) ; list of user names on Free CATS server

usrpwd = (list of delimiter-separated strings) ; list of user passwords on Free CATS server

usreml = (list of delimiter-separated strings) ; list of user email addresses on Free CATS server (may be used some day)

usrtm = (list of delimiter-separated integers) ; list of IDs of TMs to which user ID usr has been granted access on Free CATS server (usr being a unique user ID)

[TMnn] (nn being a unique TM ID) ; TM data

tmdesc = (string value) ; TM description

srclng = (string value) ; source language (3 letter - ISO 639-2 code)

tgtlng = (string value) ; target language (3 letter - ISO 639-2 code)

srccty = (string value) ; source country (3-letter ISO 3166 code)

tgtcty = (string value) ; target country (3-letter ISO 3166 code)

crtmstp = (date & time) ; TM creation timestamp

lutmstp = (date & time) ; TM last update timestamp

lasttu = (32-bit integer) ; ID of last TU created on Free CATS server (up to more than 4,294,000,000)

lastng = (32-bit integer) ; ID of last N-Gram created on Free CATS server (idem)

tucnt = (32-bit integer) ; TU counter on Free CATS server (useful to determine the number of "holes" in numbering)

lastng = (32-bit integer) ; ID of last N-Gram created on Free CATS server (idem)

subprj = (list of delimiter-separated string values) ; list of sub-project values (unique strings)

categ = (list of delimiter-separated string values) ; list of category values (unique strings)

tmsegd = (list of delimiter-separated string values) ; list of optional segment delimiters activated (unique strings, any of: Tab, ":" - must be defined at TM level for consistency)

(other TM parameters ...) **** vérifier le cahier des charges

[Prev in Thread]

Current Thread

[Next in Thread]

[Freecats-Dev] TM data structures and Indexing (first attempt), Henri Chorand <=

Prev by Date: Re: [Freecats-Dev] Source segments indexing method
Next by Date: [Freecats-Dev] Prototype of message for Translators' Discussion Lists
Previous by thread: [Freecats-Dev] Re: Advogato, FreeCATS DB
Next by thread: [Freecats-Dev] Prototype of message for Translators' Discussion Lists
Index(es):
- Date
- Thread