bug-recutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-recutils] Additional issues to think about for the indexes supp


From: Michał Masłowski
Subject: Re: [bug-recutils] Additional issues to think about for the indexes support
Date: Thu, 17 May 2012 12:37:10 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.50 (gnu/linux)

> - The index file structure must allow to have indexes also for the
>   "default" (or unnamed) record set, if any.

It could use an invalid name for it, e.g. one with a negative length (or
zero length, although it would be different for rec_db_query).

> - Does it make sense to support secondary indexes, additionally to
>   primary (clustered) indexes?

In my understanding of these terms, the indices I proposed would be
non-clustered.  Queries which could use secondary indices are common for
relational databases, might be useful also for recutils.

> - The indexes must be dense, because nothing guarantees the contents in
>   the recfile to be ordered in any way.  Sparse indexes could be used
>   along with %sort:... but it would be too dangerous I think.

Haven't thought of sparse indices due to this.  If there are
files that would benefit from it, it could be implemented later,
although I don't expect this to be needed.

> - Since most (all?) of the keys will be strings, it is good to consider
>   to apply some prefix compression to the indexes algorithms to make
>   them more efficient.

Like a trie?

> - 1.6 (which will be out in a couple of week) will be shipping support
>   for composite primary keys, i.e. keys featuring more than one field,
>   such as:
>
>      %rec: Transaction
>      %key: Provider,Consumer
>
>   It follows that the indexes must support composite keys.  This is
>   usually achieved by using a lexicographic order, such as:
>
>    (a1, a2, ...) < (b1, b2, ...) IFF a1 < b2 OR (a1 = b1 AND a2 < b2) OR ...

Originally wanted to describe the index as mapping tuples of field
values to record offsets, it doesn't seem to introduce additional
difficulties.

> - We must determine which algorithm would be better for our ordered
>   indexes: b-tree or b+-tree (the second is probably a better
>   candidate).

The difference known to me is that b+-trees store values (record offsets
in our case) only in leaves, while duplicating some keys (tuples of
field values).  I'm not sure if it would have other advantages than
smaller size in cases opposite to our (unless the keys are smaller than
8 byte offsets).

> - I would not implement hash indexes, because they can't be used with
>   ranges, only with equality.  Ordered indexes are much better for this
>   application IMO.  Also, b-tree indexes can be used with prefixes of
>   the key, so the index can be used in much more selection expressions.

Selection expressions don't support lexicographical comparison of
strings, so hash indices wouldn't have this problem (although b-trees
would be needed for numbers or dates).  Prefix search could be useful.

> - The indexes can also be used in sorting operations, where the sorting
>   criteria involves an index.

Sorting would be avoided when the index finds records in the correct
order, in some cases a different index should be used if many records
are expected to be found.

Attachment: pgpJkafdKlMRh.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]