bug-recutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-recutils] Indexing plan


From: Jose E. Marchesi
Subject: Re: [bug-recutils] Indexing plan
Date: Wed, 09 May 2012 19:53:11 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.92 (gnu/linux)

Hi.

Some notes on your plan (which I generally like).
    
    The design and implementation of indexing support could be split
    into the following parts (each of them would also include tests and
    documentation if it's user-visible):
    
    - design index file structure
    
    A single file would be used, with record descriptors in binary form
    and several indices of various types.

This must probably be the first step.  Given the structure and format of
recfiles, it is important to determine _what_ can be indexed.  The data
structure to use for each index is an implementation detail at this
point.
    
    - find the index file, using recfile metadata to check if it's up to
      date

A combination of the last modified time and the size of the file will
probably be enough.  But it must be a portable solution.
    
    - delay reading the recfile
    
    If indices are found, do it on first operation requiring whole rsets
    or not being doable with the indices found.

Sorry, I don't understand.  Do what?
    
    - API for building indices and a recfix command to use it
    - API for parsing indices
    
    Maybe support a textual serialization of an index file for debugging
    and tests.

Those APIs can go into librec, maybe under the rec_idx_* namespace.
    
    The above parts could be done without indexing of records.  There will
    be other uses for indexing rset descriptors and maybe their offsets in
    the file, so this can be used to test the above parts before choosing
    record index data structures.

I like the approach of first supporting the indexing of rsets, and then
to figure out what to do for indexing records.
    
    - design and implement tree-based index of value of a specific field
      -> record mapping
    
    Could use a memory-mapped btree with offline build and search
    operations.
    
    A new record descriptor field would be added to specify which fields
    (or tuples of fields) to index using a specific index type.

I would go simple and use %key: for that purpose.  The support for
multi-field keys is another sub-project that will probably be done
before 2.0.
    
    - find what index to use for a query
    
    An easy solution would be to find a comparison operation on a field
    that has an index.  I think much more could be done, but this would be
    enough to work for simple queries and to test other parts.
    
    E.g. for selection expression '(field1 = X) && (field2 = Y) && ...'
    an index could be used to find all records with a field1 equal to X
    and evaluate the expression only on them.

That sounds like a good approach.
    
    - more performance tests
    
    Should help make indexing improve specific operations.  Won't find
    which operations to improve, unlike real applications of big recfiles.
    
    - other index structures if they would be useful

Please keep in mind that we want to use as much as stuff from gnulib as
possible.  This applies to that kind of data structures as well.  Since
you will probably find problems/improvements in the modules providing
data structures, it would be a good idea if you fill a copyright
assignment for gnulib right now.  Would that be ok for you?  As soon as
you confirm I will start the process with the gnulib maintainers.

-- 
Jose E. Marchesi         http://www.jemarch.net
GNU Project              http://www.gnu.org



reply via email to

[Prev in Thread] Current Thread [Next in Thread]