[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [libextractor] return of getKeywords()
From: |
Christian Grothoff |
Subject: |
Re: [libextractor] return of getKeywords() |
Date: |
Fri, 30 Mar 2007 15:27:49 -0600 |
User-agent: |
KMail/1.9.5 |
On Friday 30 March 2007 14:56, Ryan Underwood wrote:
> > As you can see, the plugin first checks if this is a JPEG, and then if it
> > is, instantly adds the MIME type. So even if the JPEG does not contains
> > any other metadata, you'll always get at least the MIME type.
>
> I was referring to the "return prev" and friends after preliminary
> sanity checks. This is still I/O, but I see the point; the file is
> already opened, so most of the damage is done by that point, especially
> on a network filesystem which caches the whole file on open.
I am not sure that this is actually true (NSF/network file systems caching the
entire file on mmap/open) in general. Maybe you want to profile this
(generate a huge, 2 GB file, mmap, see what happens).
> > Actually, you can already do this with the existing API. All you need is
> > (manually) construct a mapping of mime-types/file-extensions to LE plugin
> > names (based on your assumptions of what mime-types/extensions could
> > possibly be handled by a particular plugin) and then just
> > use "EXTRACTOR_addLibrary(NULL, "pluginname")" to load just the right
> > plugin for each extension (also avoids the cost of loading useless
> > plugins!). Keep the resulting ExtractorList's in memory (and re-use for
> > all files of that type/extension). So this can easily be done without
> > changing the API at all.
>
> This sounds good; are the LE plugin names static enough to rely upon in
> compiled code?
Yes. We do not change those around -- after all, users can use them in
configuration files (see for example the GNUnet FS EXTRACTOR configuration
option where users specify which plugins they want to load). Naturally,
there maybe new plugins from time to time, so ideally you may want to put
this information not into the binary but into a configuration file (mapping
extensions/mime-types to LE plugins). Ship with a reasonable default, and
most users will never have to worry about it.
> > Well, the above optimization allows you to avoid calling plugins that you
> > do not like to call. But again, note that no IO is done if you use
> > getKeywords2, and even with getKeywords, IO is only done once
> > per "getKeywords" call, never once per plugin.
>
> Yeah, as I said above, noticing that mmap is used instead of reading
> into buffer, alleviates another concern.
>
> My main concern is still avoiding that initial file open, because it is
> quite expensive here. In order to do that in my application, I have to
> be able to tell if libextractor handled the file or ignored it.
I think it is better to configure you application to specifically use or avoid
LE for particular extensions/mime-types (with a configuration file) instead
of changing LE to provide heuristic information. Also, this way you will be
guaranteed deterministic behavior from your application: for extensions where
LE is enabled, it will *always* be run, because it *always* makes sense. For
those, where it does not make sense, LE will *never* be run, instead of once
for a (random) first file.
> Based on your previous comments noting that the plugins set the mime
> type, it seems like an EXTRACTOR_MIMETYPE keyword type will only ever be
> set if any plugin claims the file. So could I then assume that if no
> keywords of type EXTRACTOR_MIMETYPE exist, then the file was effectively
> ignored by all plugins that were loaded?
Not exactly, since some plugins do not set MIMETYPE because they cannot be
sure (HTML is one example). Also, some plugins are not mime-specific
(filename, printable, hash). These plugins may or may not add metadata, but
will never set a mimetype.
Christian