bug-recutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-recutils] Index file structure


From: Jose E. Marchesi
Subject: Re: [bug-recutils] Index file structure
Date: Tue, 15 May 2012 21:56:57 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.92 (gnu/linux)

Hi.
    
    I was thinking what the binary index file structure would need to
    specify without having indices of records.  It could be used for lazy
    loading of rsets.  I believe implementing it won't be useful now, before
    supporting seekable parsing (probably of mmapped files) and other
    changes needed to lazily load rsets.

It is good to start defining the index file format.

Note that the lazy load of rsets the implementation could do the
following in order to use an optional index file:

1. If the index file exist, then read it and use it to locate the record
   set.

2. Otherwise search for ^%rec: TYPENAME without parsing the whole file.

    In the following description, all numbers are little endian, probably
    64bit, and aligned to their size so they can be quickly accessed in
    memory mapped files.

Regarding the endianess, I would rather use the host endianess (either
little or big endian) and mark which endianess is being used in a field
in the file header.

An index file is something that can be regenerated at will, and thus I
don't think that binary portability to other systems is very important.

I would also encode a version of the index file in the header, which can
be maintained by librec when generating the index.

In this way, the recutils can detect whether an index file with the
wrong endianess and/or the wrong version is opened, and regenerate it.

    All offsets in the recfile would point to the empty line before the
    start of a record (unless it's the start of the file), to be more sure
    that the index is up to date.

    The file would start with these fields:
    
    - a magic number

This is probably the funnier part of this task!  Can you think on any
funny hexspeak magic number applicable here?
    
    - recfile modification time in seconds since the start of 1970
    - recfile size
    - recfile name length
    - number of rsets
    - number of indices

I would insert some padding here for future use.
    
    Then for each rset:
    
    - rset type name length
    - offset in the recfile to the "\n\n%rec" starting it
    
    The file name and rset names would follow.

    Then (maybe after padding) for each index its binary descriptor would be
    included, it would start with its type and length so descriptors of
    unknown types could be skipped.

I am not sure it is a good idea to store the name of the indexed file
into the index file.  The application will open foo.rix, if it exists,
before opening foo.rec.  If later on the user wants to rename foo.rec
and foo.rix to bar.rec bar.rix then she must be able to do that without
regenerating the index file.
    
    The index file would be ignored if the specified recfile modification
    time, size and name don't match the ones of opened recfile, or when a
    record offset used doesn't point to an empty line.

So the reason why you want to point to the line before the start of the
record is because it is faster to check for ^\n than for a field name.
    
    I think these issues need discussing (and probably aren't the only
    ones):
    
    - should more precise timestamps be used?  Python uses only whole
      seconds and doesn't check file size, I had no problems with
      reliability of this check.

Note that it is a common activity to use recutils in scripts.  To use
whole seconds in the index file could be problematic with code like the
following:

VALUE=`recsel -e "foo = 10" -P bar foo.rec`
recins -f new -v $VALUE foo.rec

if the recsel invocation is fast enough.
    
    - should we use 64 bit or 32 bit offsets in the file?  I think most
      advantages of recutils apply only to files that are small enough to be
      edited in a text editor and index preparation would be too slow for
      larger files, so SQL databases or other solutions would be more
      practical for larger files.

Your assumption is reasonable but, what would be the advantage of using
32 bit offsets, apart from the size of the index file?
    
-- 
Jose E. Marchesi         http://www.jemarch.net
GNU Project              http://www.gnu.org



reply via email to

[Prev in Thread] Current Thread [Next in Thread]