libextractor
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [libextractor] Hachoir project and some comments about libextractor


From: Christian Grothoff
Subject: Re: [libextractor] Hachoir project and some comments about libextractor
Date: Sat, 2 Sep 2006 14:09:13 -0700
User-agent: KMail/1.9.4

Dear Victor,

First of all, sorry for the very late reply, but I've been incredibly busy.  
For LE, we would certainly want to extract more metadata if possible, so 
having another project that shows systematically how to get to it is great -- 
I'll look forward to studying your parsers to improve ours.  

As for performance, I used to study performance by extracting metadata from 
dozens to thousands of files in the same process.  Usually, in order to avoid 
measuring disk-IO performance, I use the same file and run the extraction 
process many times.  Just like python, LE has some startup overhead since it 
needs to load the plugins.  So if you want to compare performance, one good 
thing to do is take various files (different formats and sizes) and 
repeatedly extract the metadata in-process (without printing). 

Another good benchmark (for libextractor) that includes IO is to use doodle 
(http://gnunet.org/doodle/) which will run  the extractor on all files on the 
local harddisk.   A  doodle-style test is useful if you want to measure IO 
(and not just CPU).  After all, extractors can be much faster if they can 
avoid reading the entire file to memory.

I also have one question for you: why did you start Hachoir?  I mean, which 
goals of your project could not have been done within the context of GNU 
libextractor?


Best regards,

Christian


On Wednesday 02 August 2006 08:16, Victor STINNER wrote:
> Hi,
>
> I'm one of the authors of Hachoir project:
>   http://hachoir.python-hosting.com/
>
> This project is a generic binary (and only binary) file parser. It's in
> development since 10 months, but it's already interesting to test it.
>
> I'm writting to you because I wrote a small tool based on Hachoir:
> hachoir-metadata which extract many informations from known files. "known"
> means that it needs a Hachoir parser and a metadata extractor. List of
> supported files is here:
> http://hachoir.python-hosting.com/wiki/Metadata
>
> It's hard to say if it's fast or not since I don't have good test, but on
> supported files it gives more informations than extract. I don't know if
> your goal is to extract the more informations as possible or just to
> extract informations useful to search a specific file.
>
> We worked on optimisation last weeks. Best result was with svn version 479
>
> : on one file, Hachoir was just 4 times slower than extract. Test is "time
>
> extract file.png" and "time hachoir --metadata file.png". But this test is
> stupid because Python take some millisecond to load (whereas extract is
> pure C code).
>
> --
>
> I think that you use Hachoir source code to improve your parsers. Example:
> PNG parser is poor. It doesn't extract create date not comments. You can
> look at "parser/image/png.py" and "metadata/image.py".
>
> To download Hachoir:
>   svn co https://svn.hachoir.python-hosting.com/hachoir/trunk hachoir
>
> To test Hachoir:
>   cd <hachoir directory>
>   export PYTHONPATH=$(cd src; pwd)
>   script/hachoir-metadata file
>   script/hachoir-metadata file1 file2 ...
> Options:
>   script/hachoir-metadata --level LEVEL file, filter informations
>   script/hachoir-metadata --mime LEVEL file, just display MIME type
>
> You can also test file explorer (need python "urwid" module):
>   script/hachoir-urwid file
>
> Or you can install it using "./setup.py install" ;-) (but now it's broken,
> I will fix it next hours)
>
> Haypo
>
>
> _______________________________________________
> libextractor mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/libextractor




reply via email to

[Prev in Thread] Current Thread [Next in Thread]