bug-recutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-recutils] Index file structure


From: Michał Masłowski
Subject: Re: [bug-recutils] Index file structure
Date: Wed, 16 May 2012 22:33:26 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.50 (gnu/linux)

>     In the following description, all numbers are little endian, probably
>     64bit, and aligned to their size so they can be quickly accessed in
>     memory mapped files.
>
> Regarding the endianess, I would rather use the host endianess (either
> little or big endian) and mark which endianess is being used in a field
> in the file header.
>
> An index file is something that can be regenerated at will, and thus I
> don't think that binary portability to other systems is very important.

Ok, it's probably not a common issue (I don't have any big endian
machines).

> I would also encode a version of the index file in the header, which can
> be maintained by librec when generating the index.

Adding new fields after the number of indices would need it.  Adding new
index types won't need a version and I haven't planned other file format
changes after implementing this.

> In this way, the recutils can detect whether an index file with the
> wrong endianess and/or the wrong version is opened, and regenerate it.

Regenerate or warn and not use it?  I don't expect it to happen multiple
times when doing multiple queries in a script, so it might be effective
enough.

>     - a magic number
>
> This is probably the funnier part of this task!  Can you think on any
> funny hexspeak magic number applicable here?

I don't have any specific ideas, it could be written using the host
endian and be different for each incompatible format version.

>     The file name and rset names would follow.
[...]
> I am not sure it is a good idea to store the name of the indexed file
> into the index file.  The application will open foo.rix, if it exists,
> before opening foo.rec.  If later on the user wants to rename foo.rec
> and foo.rix to bar.rec bar.rix then she must be able to do that without
> regenerating the index file.

Ok, not sure why I considered the name check to be useful.

>     The index file would be ignored if the specified recfile modification
>     time, size and name don't match the ones of opened recfile, or when a
>     record offset used doesn't point to an empty line.
>
> So the reason why you want to point to the line before the start of the
> record is because it is faster to check for ^\n than for a field name.

The check would make sure that a syntactically valid record starts at
that offset, checking for a field name would be done by the parser.  A
problem with this method is that a long comment preceding a record would
need to be parsed, this could be avoided by removing this check and
storing offsets of the first field in a record.

>     - should more precise timestamps be used?  Python uses only whole
>       seconds and doesn't check file size, I had no problems with
>       reliability of this check.
>
> Note that it is a common activity to use recutils in scripts.  To use
> whole seconds in the index file could be problematic with code like the
> following:
>
> VALUE=`recsel -e "foo = 10" -P bar foo.rec`
> recins -f new -v $VALUE foo.rec
>
> if the recsel invocation is fast enough.

So it should use e.g. a nanosecond precision timestamp on systems
supporting it.  This could require a regeneration of the index when it's
moved to a file system of different timestamp precision.

>     - should we use 64 bit or 32 bit offsets in the file?  I think most
>       advantages of recutils apply only to files that are small enough to be
>       edited in a text editor and index preparation would be too slow for
>       larger files, so SQL databases or other solutions would be more
>       practical for larger files.
>
> Your assumption is reasonable but, what would be the advantage of using
> 32 bit offsets, apart from the size of the index file?

There would be no other advantage.  It just looked like a common file
format design pattern, maybe since most file format specifications known
to me use 32 bit sizes or unlimited decimal.

Attachment: pgpgEa9b3iPZ_.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]