[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [libextractor] extractor metadata and XML/RDF
From: |
Christian Grothoff |
Subject: |
Re: [libextractor] extractor metadata and XML/RDF |
Date: |
Mon, 9 Jul 2007 23:43:16 -0600 |
User-agent: |
KMail/1.9.5 |
On Monday 09 July 2007 10:07, Andreas Harth wrote:
> Hello,
>
> I'm working on SWSE [1], a Semantic Web Search Engine. The aim
> is to collect arbitrary content from the Web and make the metadata
> available for search and query.
>
> Extractor looks like exactly the right tool for extracting metadata
> from legacy formats. However, the resulting metadata are name-value
> pairs, which makes post-processing difficult.
I don't see how it makes post-processing difficult. It is pretty much the
simplest format possible. Now, certainly having data in highly standardized
format (such as dates, numbers, etc.) would help certain forms of
post-processing. However, given that some of the file-formats are a bit
vague in how they encode the data in the first place, I don't see how it
would be possible to always achieve this.
> Do you have (or are there efforts in that direction) a more formal
> way of returning metadata? I can see XML or better RDF fitting there.
> I'd like to add some terms from standard ontologies (such as Dublin
> Core and Friend of a Friend) to the output, probably using sed
> scripts in the beginning if there is currently nothing else available.
The metadata types used by LE were motivated by Dublin Core. Additional terms
are added as needed by particular formats. Improvements in the set of
available metadata types are welcome but should be driven by adding or
modifying existing plugins to produce better terms, not by just adding terms
that will never be extracted. I am not aware of any effort to add support
for RDF or XML.
Best regards,
Christian