Re: [help-GIFT] Searching for similarities

From: David Squire
Subject: Re: [help-GIFT] Searching for similarities
Date: Fri, 19 Oct 2001 12:43:26 +1000

Carsten Pfeiffer wrote:

> On Freitag, 19. Oktober 2001 02:28 David Squire wrote:
> > I have that in the pipeline. At present it handles pdf, ps, doc, txt and
> > html. I have also written a tool to heuristically correct text that is
> > mangled by ps2txt or pdf2txt (i.e. hyphens, missing ligatures, etc.).

I should have mentioned that it makes use of ispell to do this, and the 
correction is *slow*.

> > It's the last week of semester here, so there is a real chance that I'll be
> > able to integrate this stuff into GIFT soon.
> Wow, excellent! Looking forward to trying that out. How'd you query for such
> things? I don't remember all of MRML out of my head -- was there some
> elements for querying for textual meta-data?

MRML does not explictly support keyword-based query at present, but I think 
that we will have to consider such an extension (after all, it's been a pretty 
standard example of how to extend MRML in talks and papers).

On the other hand, the query by example stuff still works. For a text document 
the text is not meta-data, it *is* the data. Doing classic "bag of words" text 
retrieval works fine with the existing GIFT.

> Is there [going to be] some way to not only perform queries, but also access
> all the data, so that one could try to visualize it in some tree or graph
> structure?

I'm not sure that I understand you here. If you are talking about meta-data, 
then it is not something I have considered at all (since I have not considered 
the text to be meta-data). If you're talking about the documents themselves - 
feel free to write such a thing :)



