|
From: | Henri Chorand |
Subject: | [Freecats-Dev] TM data structures and Indexing (first attempt) |
Date: | Mon, 27 Jan 2003 00:06:47 +0100 |
User-agent: | Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 |
Hi all,I've been thinking about how to perform TU indexing so as to allow quick and accurate matching. While I've not solved things, I still consider it's a small step forward.
Please read it (non-finished parts are clearly indicated). If you know the answers... let me know ! ! ! Henri
Free CATS
This document suggests a TU indexing algorithm and a possible data & index storage implementation built on top of a basic database (e.g. Berkeley DB) or even directly over the filesystem. It should help to determine a fuzzy match algorithm (to be refined in this document when we see things more clearly). Please consider this data structure as incomplete and partially inadequate. The most serious part of this document is N-Gram management which I consider well advanced. So, take it as what it is, a first attempt. Note: Berkeley DB only allows a program to store records made up of a key and value pair, without any restriction about each item's type or length. This would correspond to flat file name & contents. For more information, see Berkeley DB Reference Guide. Note: I don't detail here the TM creation/import/export/deletion features here, as they are rather mainstream. Basically:
The only four basic operations we need performed by our DBMS are:
This first version of the document suggests how to perform three of these four operations (see below). The fourth one is the most difficult one, but it has been taken into account as best as possible – I will try to provide some guidelines. Note that for each TU, we only need to index its source segment and its N-Grams that it contains. Therefore, when a TU's target segment is updated by a user, there is no need to modify its indexing data, which is going to require quite less CPU resources. We can also consider deletion of individual TUs within a TM as quite rare. Note: The proposed data structure comes from a feature implemented in a (then highly advanced) full text indexing system, which was once partly explained to me by the R&D manager of the software's editor. So it does not come out of my mind, I only was clever enough to remember it since when I heard it nearly ten years ago. Note: Sorry, the HTML format of this document prevents me to properly indent pseudocode (see below), at least using Open Office.
Each TU and its N-Grams will be handled as follows:
Concurrent access is dealt with later on in this document, so as to keep things simple (at least at first). Anyway, it's quite rare in statistical terms, and it may be handled in an appropriate way later on, during proofreading (where all target segments are reviewed and sometimes edited). We must remember that in real life, the proofreader(s) will be able to decide which translation to keep, so it's not a real problem from the user's point of view, even though it would be with various other kinds of databases used in a different domain... But, as you will see here, there might be a handy way to handle it, to provide faster response times and to spare CPU time. Each time an user requests a TU creation in a given TM, the TM server will perform the following steps:
Each time an user requests a TU update in a given TM, the TM server will perform the following steps:
Each time an user requests a TU deletion in a given TM, the TM server will perform the following steps:
So, we'll end up with an unused TU number in the database. It might be possible to recycle it but it's really not worth it. Reorganizing a TM (full reindexing, if ever we implement it) will automatically redo all TU numbering from scratch, and therefore eliminate any such "holes".
(Short reminder for non-coders) A concurrent access problem is a situation where several users attempt to simultaneously update the same data, resulting in inconsistencies. Here, unless we apply my advance indexing suggestion, it would happen when a first user opens a TU which has not been created on the TM server, and another user also opens the same TU (which happens to exist in another file of the project) begins to also create the same one before the TU, giving birth to one of the two following situations:
The real problem is, if we did nothing about it, in this (rare) case, we would end up with two TUs (each with its own ID) containing the same source segment, but without any good reason for this. I then had an idea – to perform indexing in advance (as soon as the TU is opened, but before its target segment is known). It's possible (all required data is available to the server) because TU indexing is only based on the source segment's contents. Here is probably our solution. Each time a user opens a TU, he automatically requests the TM server to immediately perform a fuzzy match and return the results, for good reasons:
The only situation where this advance indexing must be cancelled is when the user interactively modifies segmentation, by joining two contiguous segments together (always allowed within the same paragraph, as determined by the translation client) or by splitting a segment in two (most probably because it was first created by two joined segments). As a rule, it's not going to happen often. In this case, the TM server will un-index the TU, thus generating a hole (unused number) in the sequence of TU IDs. So, I suggest the server should index a new TU as soon as a translation client asks if there is any, when the user opens the TU. Then, any other user requesting the creation of the same TU will receive a warning message and won't be able to save its translation in the server's TM – he will have to wait until the update is performed (a security timeout will help ensure the TU does not remain open forever on the first user's translation client – see below). Any other user will then automaticall close and skip the TU – the translation client could also keep a "non-synchronized" flag within the TU. In this case, the TM server must also index each of this source segment's N-Grams. The great advantage is quite better response times, as when the user has finished entering a new translation or validating an existing translation, all index data will have been already updated in advance in the TM tables. Also, in the case of a roll-back, I don't believe it's useful to also delete N-Gram entries in tables (we should only update the applicable records in NGRIDX_tmid table, see below). Should some N-Grams be unique to the old TU's source segment, they will come back into play as soon as the missing bit (if put into another segment) is also translated. So, temporarily keeping records corresponding to unused N-Grams is NOT going to be a problem, as long as the corresponding indexes are properly updated. Of course, we must then handle the situation where the translation client never sends the target segment to the server (if the user stops working in the middle of a TU). In this case, the server might:
We will "only" need to manage connections well enough to allow the user to reconnect to the server after a force disconnect (network down for too long, client application crash). In this case, I suggest that when a client attempts to reconnect, the server should send him a message detailing which TU needs to be re-opened (if it was left in "open" state when something went wrong). All this makes me think our TM server would use a single process (a loop) to process all TM update requests each in turn (in synchronous mode if I got it well), whereas the simple context searches can be processed in asynchronous mode (by another process), so all TM update requests would be queued. Here, too, the queue's maximum length would be equal to the number of users, as a translation client simply can't go any further unless it has finished performing its current task. If experienced database coders believe I'm wrong here, please provide a better alternative ASAP.
After thinking about it over and over again, I have come up with a highly customizable solution. We'll begin by defining a set of rules for each kind of language. The two ones I'll describe here should suit needs for Western languages and ideogram-based languages (like Chinese). Each time we want to add support for a new language, we'll do as follows:
1. Parsing the source segment and extracting, one by one, all full words found according to the set of delimiters to be defined here:
A simple way to do this is to transform any delimiter into space character, trim spaces (clear all spaces at segment's beginning and end and remove all duplicate spaces) then extract words from segments with any adequate substring extract function (to be seen in Tcl later on). 2. Removing word duplicates for optimization. 3. For each word, determining its list of N-Grams according to the language parameter value lngrset (rule set). I suggest calling the same algorithm with a specific set of parameters fitting the relevant language. These parameters, defined in [RuleSet] section (see below) of TM server's configuration parameters (sample values in dark red provide an example), are as follows: Western language default (proposal)
Algorithm // Initialization word = "amnesia" // just an example here, length 7 If lowcas then word = LowerCase(word) // to be done in fact at segment level (faster) ngram_list = (empty) word_length = StringLength(word) // I consider the rank of first character in a string equals 1 here. Append word to ngram_list // a full word is the biggest of its N-Grams For i_length = ngminl to (word_length – 1) step +1 // loop for i_length = 3 to 6 with increment step = +1 // N-Gram is length of N-Gram extracted, starting at minimum length specified, ending at (word length – 1) // Beginning of Word if ngbeg = True then Append SubString(1, i_length) to ngram_list // Where 1 is start position and i_length is substring length end if // Middle of Word (turned off in my proposal, ngmid = False so nothing is done here) if ngmid = True then For j_posinword = 2 [first character in word] to (word_length – i_length) step +1 // loop within word where we begin on position 2 and end at (word length – 1) Append SubString(j_posinword, i_length) to ngram_list Next j_posinword end if if ngend = True then Append SubString((word_length – i_length + 1), i_length) to ngram_list // Where word_length – i_length + 1 is start position and i_length is substring length // So we extract the i_length LAST characters here end if Next i_length There, we only have to create our N-Gram entries in DB tables / filesystem / whatever. Ideogram language default (proposal) We have NOTHING to do with N-Grams, we only need to consider the individual "words", which each are made up of a single Unicode character. A list of words (duplicates removed) makes up the full list of N-Grams (all are whole words) to be indexed.
This issue is precisely the one for which I would be happy to receive advice from algorithm experts. Ideally, we should be able to define a math function who would return V, a uniquely derived value (quite possibly a multi-dimensional value – a vector) from a segment's ordered list of words (seen as N-Gram IDs), punctuation marks and simplified tags, so as to search for the closest values of V already stored in the TM. I'm not strong enough in maths in order to suggest anything, but I hope this document will be enough for some reader to answer some "I know how to do this". We need to find such a function in order to make Free CATS server "work", because we we start using half-baked, dirty tricks so as to come up with "something" that will not be close enough to what we're trying to do, but I want to believe [(c) X-Files] that a mathematically elegant solution exists and that somebody will tell me about it, and even more a miracle, will explain it to me in terms I can grasp. So, I have not yet devised a true algorithm. Here are a few hints from which I suggest to build it. We start from a query string which may (TU storage in TM), or may not (fuzzy searching) be a full source segment. It's a sequence of Items (words, punctuation marks and/or internal (simplified) tags). In order to keep things simple, I suggest to work on words, then to refine our near-final results with the other types of items. We need to find a set made up of up to MaxTUNumber (client-defined threshold) TUs in TM, each of which has a matching rate no lower than MinFuzzyRate (client-defined threshold) as compared to our query string. We have already found (thanks to Thierry) that dealing with N-Grams instead of words works. Therefore, we are looking for TUs which are closest to our query string, so they:
We might try to work on the last criterion after the two first ones – after we have build a subset on which to work. The main advantage of series of binary flags like those found in SEGIDX (and NGRIDX) tables is that they allow to very quickly perform binary AND and OR operations on their contents. In order to use a unique algorithm with a query string, whether it's a source segment TU or a context search, in the latter case, once we have found this famous V function, we'll simply determine V(query string). To fulfill the two first criteria, and as we can't browse the entire contents of SEGIDX table in order to find the best match, we must first find a quick way to find it – this may indicate that my draft data structure needs some improvement.
If the source segments of two TUs are equal, they must have the same list of N-Grams, so the corresponding records in SEGIDX are equal. True, if two SEGIDX records are equal, there might be some discrepancies between the segments, like if:
A way to do this could also be to generate, from a source segment, an ordered list of its words (seen as N-Gram IDs). We must try to determine, for a given source segment, the TUs which source segment contains all our N-Grams. To do this:
******** à reprendre ou à flinguer ******** So, we may start from SEGIDX and OR operators between them as probably the best way to select "close enough" segments, but this will generate a quite large number of fuzzy matches. We would refine our selection with this ordered list. If the result of an AND operation is null, we have to look In this ordered list (Yet Another Index Table), we would still drop punctuation marks and internal tags (simplified), as their discrepancies must be considered as "minor" as compared to words being identical. It's only after fetching the "best" segments (from word contents AND sequence) that we would slightly lower fuzzy rates based on punctuation marks and internal tags, and therefore build the "definitive" list of best matches.
It seems that Berkeley DB does not have a database (logical set of tables) concept, so I guess we must append manually the TM ID (32 bits integer, further detailed below, indicated by _tmid here) to the table name. If we work directly on top of a filesystem, we could use directories for this and therefore simplify the whole thing. For each TM, we need to use the following tables:
Note: This first approach does not enable to fully represent all data found in a source segment, as we can't guess, from index tables, the number of occurrences of N-Gram within the source segment, nor its location(s) within the source segment. It was designed with the aim to quickly determine whether or not a given N-Gram is found in a source segment. If need be, we might also use another table to store a source segment's representation as a sequence of N-Grams (full words) and internal (simplified) tags, separated via a delimiter.
Let's not forget to store the following information for each TU:
For simplicity's sake, I have not included them in any table, but
(in order to limit the overall number of tables used by the server)
they could fit in the TRGSEG table, the beginning of the value
contents could be for instance as follows (where / is a
delimiter): Apart from these, the client could use the following general-purpose tables (flat .CFG-type files would also do):
I suggest to implement miscellaneous TM-level parameters either as a table or as a separate .CFG type file, as follows (at least as an example): [Admin] ; server administrator data admid = (string value) ; Administrator's ID admpwd = (string value) ; Administrator's password admname = (string value) ; Administrator's full name admmail = (string value) ; Administrator's email address
[Languages] ; natural languages reference data lnglist = (list of delimiter-separated string values) ; list of ISO 639-2 codes of languages currently supported by Free CATS server lngrset = (list of delimiter-separated string values) ; list of rule set codes applicable to language with corresponding position in lnglist lnflag = (list of delimiter-separated string values) ; list of file URIs corresponding to country flags (only needed by translation clients) Note: A rule set code is a parameter which determines the algorithm chosen to build the list of N-Grams from a word in a given language. Note: The language names table is only stored on the clients, and the server can specify which languages are currently implemented on it – it may vary between servers.
[RuleSet] ; N-Gram rules set (to be followed for a set of languages) ngminl = (integer value > 0) ; minimum N-Gram length ngbeg = (boolean value) ; do we index all N-Grams starting from word's beginning (0 = false, 1 = true) ngend = (boolean value) ; do we index all N-Grams ending at word's end (0 = false, 1 = true) ngmid = (boolean value) ; do we index all N-Grams in the "middle" (not starting from word's beginning and not ending at word's end (0 = false, 1 = true) lowcas = (boolean value) ; do we transform all capital letters in N-Grams as lower case letters (0 = false (keep capital letters), 1 = true) For instance, I would suggest, for Western languages: 3 1 0 1 or possibly 3 1 1 1. As I see it, this still simple parameterization should still allow great flexibility, probably more than we'll ever need.
[General] ; server data lasttm = (32-bit integer) ; ID of last TM created on the Free CATS server (0 at initial setup) tmlist = (list of delimiter-separated integers) ; list of IDs of TMs stored on Free CATS server (starting at 1) svsegd = (list of delimiter-separated string values) ; list of optional segment delimiters allowed (unique strings: Tab, ":")
[User] ; user data lastusr = (32-bit integer) ; ID of last user created on the Free CATS server (0 at initial setup) usrlist = (list of delimiter-separated integers) ; list of IDs of users created on the Free CATS server (starting at 1) usrnam = (list of delimiter-separated strings) ; list of user names on Free CATS server usrpwd = (list of delimiter-separated strings) ; list of user passwords on Free CATS server usreml = (list of delimiter-separated strings) ; list of user email addresses on Free CATS server (may be used some day) usrtm = (list of delimiter-separated integers) ; list of IDs of TMs to which user ID usr has been granted access on Free CATS server (usr being a unique user ID)
[TMnn] (nn being a unique TM ID) ; TM data tmdesc = (string value) ; TM description srclng = (string value) ; source language (3 letter - ISO 639-2 code) tgtlng = (string value) ; target language (3 letter - ISO 639-2 code) srccty = (string value) ; source country (3-letter ISO 3166 code) tgtcty = (string value) ; target country (3-letter ISO 3166 code) crtmstp = (date & time) ; TM creation timestamp lutmstp = (date & time) ; TM last update timestamp lasttu = (32-bit integer) ; ID of last TU created on Free CATS server (up to more than 4,294,000,000) lastng = (32-bit integer) ; ID of last N-Gram created on Free CATS server (idem) tucnt = (32-bit integer) ; TU counter on Free CATS server (useful to determine the number of "holes" in numbering) lastng = (32-bit integer) ; ID of last N-Gram created on Free CATS server (idem) subprj = (list of delimiter-separated string values) ; list of sub-project values (unique strings) categ = (list of delimiter-separated string values) ; list of category values (unique strings) tmsegd = (list of delimiter-separated string values) ; list of optional segment delimiters activated (unique strings, any of: Tab, ":" - must be defined at TM level for consistency) (other TM parameters ...) **** vérifier le cahier des charges
|
[Prev in Thread] | Current Thread | [Next in Thread] |