[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-recutils] Index binary file format and interface

From: Michał Masłowski
Subject: [bug-recutils] Index binary file format and interface
Date: Tue, 31 Jul 2012 21:35:07 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.1 (gnu/linux)

Hello and sorry for the delay.

Previously I have written that several similar index formats could be
useful, ones mapping tuples of field values to records and ones mapping
indices (like for recsel -n) to records (this would be useful for the
default sorting).  The field values would be represented as strings,
case-insensitive strings, integers, reals or dates, all of these would
have different optimal binary formats and sort orders.

Noticing that the record positions depend on sorting (which can be
chosen for a recsel invocation) I think a separate index format for
indexed queries or default sorting won't be needed and this would be
more useful in the typical format (for comparison queries, sorting and
indexed queries).

This could be the index byte array format:

- 1 byte rset number (assuming there won't be more than 256 rsets in a
  file, unsure if there would be any benefit in having more)

- 1 byte depth; 0 is used if the root is leaf

- key type:

  - 1 byte number of fields in the type

  - 1 byte field value type: string, case-insensitive string, integer,
    real or date (initially only string)

  - null-terminated field name

  ... value types and names of the other fields

- padding to eight bytes

- nodes, starting from root

Non-leaf node format:

- number of keys

- node_0 key_0 num_records_0

- ...

- node_n

where node_i is an offset in this byte array to the target node.
Assume there is -infinity before node_0 and +infinity after node_n,
then key=x node=n key=y num_records=m means that all keys in the range
[x, y) are in the subtree of root n and that subtree would have m
records for interval search (records with multiple fields of the same
name would be duplicated in the tree, but sorting would use only the
first occurence of the field).

Leaf node format:

- line_0 character_0 key_0

- ...

- line_n character_n key_n next_node

where next_node is 0 in the last leaf and line_i has its highest bit
set to 1 if this node element shouldn't be used for sorting (i.e. if a
non-first field is used for comparison).

Public interfaces (with types and arguments to be defined):

- rec_idx_tree_new(rset, byte_array, size)

- rec_idx_tree_get_rset_number(idx_tree)

- rec_idx_tree_get_num_fields(idx_tree)

- rec_idx_tree_get_field_name(idx_tree, n) for n-th field in the key

- rec_idx_tree_get_field_type(idx_tree, n) for n-th field in the key
  type, will have an enum of supported values

- rec_idx_tree_scan(idx_tree, left, right) returns an abstract
  iterator of records in key range [left, right] (left and right will
  be probably pointers to arrays of pointers to values of appropriate

- rec_idx_tree_scan_interval(idx_tree, left, right) for index or
  random queries (recsel's -n and -m)

- rec_idx_tree_destroy(idx_tree)

- rec_idx_tree_build(rset, &byte_array, &index) builds an index for an
  rset, it's stored in memory to be written to an index file

After implementing and testing this I will implement recfix integration
and a very simple query planner for rec_db_query.

Indexes with partially unsupported key types would not be used, I think
there is no need to handle them.

Any comments?

Attachment: pgpE7SsrSv7ZM.pgp
Description: PGP signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]