[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [help-GIFT] Searching for similarities

From: Wolfgang Mueller
Subject: Re: [help-GIFT] Searching for similarities
Date: Fri, 19 Oct 2001 10:49:28 +0200

On Friday 19 October 2001 04:43, David Squire wrote:
> Carsten Pfeiffer wrote:
> > On Freitag, 19. Oktober 2001 02:28 David Squire wrote:
> > > I have that in the pipeline. At present it handles pdf, ps, doc, txt
> > > and html. I have also written a tool to heuristically correct text that
> > > is mangled by ps2txt or pdf2txt (i.e. hyphens, missing ligatures,
> > > etc.).
> I should have mentioned that it makes use of ispell to do this, and the
> correction is *slow*.

I guess it's in Perl, right?

> > > It's the last week of semester here, so there is a real chance that
> > > I'll be able to integrate this stuff into GIFT soon.

> > Wow, excellent! Looking forward to trying that out. How'd you query for
> > such things? I don't remember all of MRML out of my head -- was there
> > some elements for querying for textual meta-data?
> MRML does not explictly support keyword-based query at present, but I think
> that we will have to consider such an extension (after all, it's been a
> pretty standard example of how to extend MRML in talks and papers). I consider the problem solved :-)

OK. we had this already in the list. The thing to do is to write a 
"string"->"feature-id" function, that's the missing bit in viper.

The other thing to do for integrating text into the gift is speaking of a 
feature extraction framework. I am still looking for which tool to use as a 
base, but I would like to have some web crawler basis that constructs a list 
of feature extractors before crawling, then crawls (and feeds each of the 
feature extractors at each step), and then destroys the feature extractors 
after crawling.

The feature extractors should be shared libs, similar to GIFT plugins.

Currently I am looking into htdig and wget as basis. Htdig is attractive for 
its HTML/pdf/ps parsers. OK, David I just learned that you have that, too. 
Now the question is, if you'd like to team up with the ht://dig people to 
enhance their tool. They also seem to have something similar to the plugin 
mechanism I proposed above, however, I haven't yet found it in the code.

I see some real advantages in teaming up with either wget or ht://dig, 
because they solve quite a lot of infrastructural problems, and they solve 
them in a truly impressive manner.


Dr. Wolfgang Müller, assistant == teaching assistant
Personal page: 
Maintainer, GNU Image Finding Tool (

reply via email to

[Prev in Thread] Current Thread [Next in Thread]