help-gift
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [help-GIFT] Adding text features to Viper/GIFT


From: Wolfgang Müller
Subject: Re: [help-GIFT] Adding text features to Viper/GIFT
Date: Tue, 15 May 2001 08:36:06 +0200

Le Mardi 15 Mai 2001 03:14, David Squire a écrit :
> Hi all,
>
> I am just about to spend a few hours integrating my text indexing code with
> the feature extraction code for Viper/GIFT. One of the fundamental issues
> here (as has been discussed earlier) is that the number and nature of the
> features (word stems) which will be encountered in indexing a collection is
> not known in advance.
>
> The currently suggested solution is to maintain a file with each collection
> which maps words to feature IDs - feature IDs would not orrespond directly
> between collections (whereas they do now).

Yes, I think there is no real choice. Otherwise you would have to have unique 
hash values for each word which depend only on the word... => very long 
feature ID values.

>
> My current (quick and dirty) text indexing software accepts *all* the .txt
> files to index as command line arguments. Statistics are then gathered for
> term frequencies in the documents (in fact they are presently treated on a
> paragraph by paragraph basis) and the entire collection as a whole. The
> advantage of this is that a single hash mapping terms to their IDs and
> collection frequencies can be maintained throughout the entire process.
>

I think indexing in one run is the way to go. The current solution is a hack 
(which works well :-) ).


> If this were to be changed to work on a file by file basis, as the image
> indexing currently works, then a file storing this hash would have to be
> loaded, updated and then saved each time features were extracted for a
> given .txt file.
>
> I am planning a work-around where an initial text indexing phase will index
> all .txt files in a collection, and write a summary file containing term ID
> and term document frequency information for each .txt file. These can then
> be read when the individual images are indexed. I think that this will work
> quite well, but I think that we should think about how this should be
> handled in the gift-add-collection.pl, gift-extract-features,
> gift-generate-inverted-file, framework.
>

I think the feature extraction framework needed should be quite close to what 
the GIFT itself is doing:

Have some kernel and some plugins.

The kernel should slurp in a list of URLs expressed in XML (to incorporate 
filenames with spaces and other ugly things), and this list of URLs should 
then be processed by the appropriate plugins. The kernel would do the 
dispatching.

As a result we would have only one invocation which gives us lots of speed 
advantages. 

I think presently nobody is working on this. The one who volonteers on this 
part should be ready to take pains and be creative to make a framework that 
enables more than she/he is thinking of presently.

For the time being, I guess the best you can do is:

1) slurp in a file containing a list of all your images. Maybe get inspired 
by url2fts.xml
2) make your feature extractor parse gift-config.mrml 
3) add references to the files you generated into gift-config.mrml.

Like this, it should be possible for people to add a text index in the same 
easy fashion as they can add a "normal" gift index.

Wolfgang



reply via email to

[Prev in Thread] Current Thread [Next in Thread]