Re: [bug-recutils] Index file structure

bug-recutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-recutils] Index file structure

From:	Michał Masłowski
Subject:	Re: [bug-recutils] Index file structure
Date:	Wed, 16 May 2012 22:33:26 +0200
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/24.0.50 (gnu/linux)

>     In the following description, all numbers are little endian, probably
>     64bit, and aligned to their size so they can be quickly accessed in
>     memory mapped files.
>
> Regarding the endianess, I would rather use the host endianess (either
> little or big endian) and mark which endianess is being used in a field
> in the file header.
>
> An index file is something that can be regenerated at will, and thus I
> don't think that binary portability to other systems is very important.

Ok, it's probably not a common issue (I don't have any big endian
machines).

> I would also encode a version of the index file in the header, which can
> be maintained by librec when generating the index.

Adding new fields after the number of indices would need it.  Adding new
index types won't need a version and I haven't planned other file format
changes after implementing this.

> In this way, the recutils can detect whether an index file with the
> wrong endianess and/or the wrong version is opened, and regenerate it.

Regenerate or warn and not use it?  I don't expect it to happen multiple
times when doing multiple queries in a script, so it might be effective
enough.

>     - a magic number
>
> This is probably the funnier part of this task!  Can you think on any
> funny hexspeak magic number applicable here?

I don't have any specific ideas, it could be written using the host
endian and be different for each incompatible format version.

>     The file name and rset names would follow.
[...]
> I am not sure it is a good idea to store the name of the indexed file
> into the index file.  The application will open foo.rix, if it exists,
> before opening foo.rec.  If later on the user wants to rename foo.rec
> and foo.rix to bar.rec bar.rix then she must be able to do that without
> regenerating the index file.

Ok, not sure why I considered the name check to be useful.

>     The index file would be ignored if the specified recfile modification
>     time, size and name don't match the ones of opened recfile, or when a
>     record offset used doesn't point to an empty line.
>
> So the reason why you want to point to the line before the start of the
> record is because it is faster to check for ^\n than for a field name.

The check would make sure that a syntactically valid record starts at
that offset, checking for a field name would be done by the parser.  A
problem with this method is that a long comment preceding a record would
need to be parsed, this could be avoided by removing this check and
storing offsets of the first field in a record.

>     - should more precise timestamps be used?  Python uses only whole
>       seconds and doesn't check file size, I had no problems with
>       reliability of this check.
>
> Note that it is a common activity to use recutils in scripts.  To use
> whole seconds in the index file could be problematic with code like the
> following:
>
> VALUE=`recsel -e "foo = 10" -P bar foo.rec`
> recins -f new -v $VALUE foo.rec
>
> if the recsel invocation is fast enough.

So it should use e.g. a nanosecond precision timestamp on systems
supporting it.  This could require a regeneration of the index when it's
moved to a file system of different timestamp precision.

>     - should we use 64 bit or 32 bit offsets in the file?  I think most
>       advantages of recutils apply only to files that are small enough to be
>       edited in a text editor and index preparation would be too slow for
>       larger files, so SQL databases or other solutions would be more
>       practical for larger files.
>
> Your assumption is reasonable but, what would be the advantage of using
> 32 bit offsets, apart from the size of the index file?

There would be no other advantage.  It just looked like a common file
format design pattern, maybe since most file format specifications known
to me use 32 bit sizes or unlimited decimal.

pgpgEa9b3iPZ_.pgp
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

[bug-recutils] Index file structure, Michał Masłowski, 2012/05/15
- Re: [bug-recutils] Index file structure, Jose E. Marchesi, 2012/05/15
  - Re: [bug-recutils] Index file structure, Michał Masłowski <=

Prev by Date: [bug-recutils] Additional issues to think about for the indexes support
Next by Date: Re: [bug-recutils] Additional issues to think about for the indexes support
Previous by thread: Re: [bug-recutils] Index file structure
Next by thread: [bug-recutils] Additional issues to think about for the indexes support
Index(es):
- Date
- Thread