[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Freecats-Dev] Source segments indexing method
From: |
Charles Stewart |
Subject: |
[Freecats-Dev] Source segments indexing method |
Date: |
Sun, 26 Jan 2003 07:32:01 -0500 (EST) |
A general question: how ambitious is the CATS going to be? Are we going
to model the hierarchical structure of language, or do we think this
task is too hard? If we don't, how are we going to spot common idioms such
as "neither... nor..."?
I'll just restrict my comment to the Source segments indexing (SSI)
method at the moment. I've attached some excerpts from earlier emails
to the end of this message:
#1. What componenets of FreeCATS make use of the SSI method?
Is it only to be used in building the corpus for the
Translation Memory server, or do we use it as a preprocessor in
translating text?
#2. I agree with David Welton about starting with European for
now, but I think we should make an effort to attract someone
who knows asian character sets. I don't think we should figure
this stuff out for ourselves, if none of us speaks an asian
language. We shouldn't wait too long: if we work only with
Indo-European languages, we might have some nasty surprises
when we find that Korean, say, violates some assumptions we
thought applied to all language texts;
#3. Unicode character properties: clearly it is the right thing
to use these;
#4. I think it is better to work directly from the source text:
it might sound like a harder problem to work with raw source
files without any preprocessing, but:
- It isn't as hard as it sounds. Rather than work, eg.
with case-folded texts, we work, eg. with case-insensitive
matches. When case is useful, it is there to use (eg.
in german all nouns are captalised, and can be the best
token to distinguish otherwise ambiguous words);
- We will lose potentially valuable information if we
do things like throw away email headers;
- We can be smarter without it. Eg. if we translate an
apparently english-language email into french, the
preprocessor is unlikely to be smart enough to spot
a C program fragment hidden in the body, but the translation
software can be. Let's call this the "envelope problem":
figuring out all the ways in which to-be translated text
might be interwoven with to-be-passed-on verbatim text.
#5. N-grams: easy to do this if we represent the lexicon using
a state-transition diagram or even a recursive descent parser
(the best are almost as fast as lexing regexps).
#6. I'm against using fuzzy matching: if we build up a big
corpus in a language, then we will have almost all actually
occurring misspellings in that language. Exact matching is
much faster than fuzzy matching, and easier to design around.
Henri Chorand <address@hidden>:
1) Parsing
We need to parse the source segment and to split it into a sequence (a
sorted list) of items (words, separators and tags)
Definition:
Word sequence of contiguous alphabetic and/or numeric
characters
Separator sequence of contiguous non-alphabetic and non-numeric
characters
space
tab
punctuation marks . , ; : ! ? inverse? inverse!
' " < > + - * / = _ ( ) [ ] { } hyphens & similar
various symbols (etc.)
non-breakable space
Tag tag belonging to our list of internal tags (let's
consider
our file is already converted)
"Thierry Sourbier" <address@hidden>:
1) Unicode character properties should be used to determine if a
character
is a letter/digit/punctuation mark.
2) Word breaking is a challenging problem (think Japanese, Thai...).
3) You'll need to apply some kind of normalization such as case folding
otherwise: "text" and "TEXT" will never fuzzy match. What about accents?
4) If you want the TM to scale only store in the index the smallest
"sub-words" (N-Gram in the litterature)
address@hidden (David N. Welton):
> 1) Unicode character properties should be used to determine if a
> character is a letter/digit/punctuation mark.
Tcl deals with this out of the box, in theory:
string is punct $foo
punct Any Unicode punctuation character.
> 2) Word breaking is a challenging problem (think Japanese, Thai...).
It also has a 'string wordend'.
Don't know if these do what they should for Asian character
sets, although my inclination on any open source project is to
get it running first. Maybe that means getting the project
started with European languages, and then mixing in others.
This has the disadvantage that you might have to rework things
later, but at least you get something people can use and
then they get interested in your project...
> 4) If you want the TM to scale only store in the index the smallest
> "sub-words" (N-Gram in the litterature)
This is not really my department:-)
Henri Chorand again:
> 2) Word breaking is a challenging problem (think Japanese, Thai...).
As I see it, it's not too much a problem with Chinese or
Japanese: kana (hiragana & katakana) use alphabets, so it
should be dealt with like for other alphabet-based languages.
See at:
http://kanji.free.fr/tabl_kana.php3?type=3Dhira
http://kanji.free.fr/tabl_kana.php3?type=3Dkata
Well, I know it's in French, but who cares at this stage :-)
kanji use (possibly adapted) (differently articulated) Chinese
characters= =2E Mainland China uses simplified Chinese
characters, Taiwan uses traditiona= l Chinese ones. Chinese
characters (ideograms) are in fact one-"letter" words, so each
ideogram will be indexed without even needing to extract
n-grams from it.
Thai is much more a problem in that, at least traditionally,
all words (which are mono-syllablic) are glued together to form
a sentence.
> 3) You'll need to apply some kind of normalization such as case
folding
> otherwise: "text" and "TEXT" will never fuzzy match. What about
accents=
?
I would say no for accents (let's consider them as they are in
any charac= ter code set, different from the corresponding
non-accented letters). And what about accented capital letters
(can anybody confirm they are specific characters in Unicode,
like in ANSI?) (see below about case management)
Also remember that in French, for instance, accented and
non-accented letters are the ONLY difference between different
words (with different meanings), like in:
"hue" ("gee up", to a horse) and "hu=E9" ("jeered")
"rue" (street) and "ru=E9" (kicked)
(sorry if these are the only examples my horse and I can think of).
Case might be different. We should interpret a case difference
as a "less close" match, but we might also:
- ignore them during all indexing steps (index entries
normalized into al= l lower-case strings)
- re-inject this difference at the final stage (so as to lower
match values).
- [Freecats-Dev] Source segments indexing method,
Charles Stewart <=