[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-recutils] Index file structure
Re: [bug-recutils] Index file structure
Wed, 16 May 2012 22:33:26 +0200
Gnus/5.13 (Gnus v5.13) Emacs/24.0.50 (gnu/linux)
> In the following description, all numbers are little endian, probably
> 64bit, and aligned to their size so they can be quickly accessed in
> memory mapped files.
> Regarding the endianess, I would rather use the host endianess (either
> little or big endian) and mark which endianess is being used in a field
> in the file header.
> An index file is something that can be regenerated at will, and thus I
> don't think that binary portability to other systems is very important.
Ok, it's probably not a common issue (I don't have any big endian
> I would also encode a version of the index file in the header, which can
> be maintained by librec when generating the index.
Adding new fields after the number of indices would need it. Adding new
index types won't need a version and I haven't planned other file format
changes after implementing this.
> In this way, the recutils can detect whether an index file with the
> wrong endianess and/or the wrong version is opened, and regenerate it.
Regenerate or warn and not use it? I don't expect it to happen multiple
times when doing multiple queries in a script, so it might be effective
> - a magic number
> This is probably the funnier part of this task! Can you think on any
> funny hexspeak magic number applicable here?
I don't have any specific ideas, it could be written using the host
endian and be different for each incompatible format version.
> The file name and rset names would follow.
> I am not sure it is a good idea to store the name of the indexed file
> into the index file. The application will open foo.rix, if it exists,
> before opening foo.rec. If later on the user wants to rename foo.rec
> and foo.rix to bar.rec bar.rix then she must be able to do that without
> regenerating the index file.
Ok, not sure why I considered the name check to be useful.
> The index file would be ignored if the specified recfile modification
> time, size and name don't match the ones of opened recfile, or when a
> record offset used doesn't point to an empty line.
> So the reason why you want to point to the line before the start of the
> record is because it is faster to check for ^\n than for a field name.
The check would make sure that a syntactically valid record starts at
that offset, checking for a field name would be done by the parser. A
problem with this method is that a long comment preceding a record would
need to be parsed, this could be avoided by removing this check and
storing offsets of the first field in a record.
> - should more precise timestamps be used? Python uses only whole
> seconds and doesn't check file size, I had no problems with
> reliability of this check.
> Note that it is a common activity to use recutils in scripts. To use
> whole seconds in the index file could be problematic with code like the
> VALUE=`recsel -e "foo = 10" -P bar foo.rec`
> recins -f new -v $VALUE foo.rec
> if the recsel invocation is fast enough.
So it should use e.g. a nanosecond precision timestamp on systems
supporting it. This could require a regeneration of the index when it's
moved to a file system of different timestamp precision.
> - should we use 64 bit or 32 bit offsets in the file? I think most
> advantages of recutils apply only to files that are small enough to be
> edited in a text editor and index preparation would be too slow for
> larger files, so SQL databases or other solutions would be more
> practical for larger files.
> Your assumption is reasonable but, what would be the advantage of using
> 32 bit offsets, apart from the size of the index file?
There would be no other advantage. It just looked like a common file
format design pattern, maybe since most file format specifications known
to me use 32 bit sizes or unlimited decimal.
Description: PGP signature