Another alternative string representation proposal 1.3

guile-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Another alternative string representation proposal 1.3

From:	Dirk Herrmann
Subject:	Another alternative string representation proposal 1.3
Date:	Fri, 6 Oct 2000 11:18:28 +0200 (MEST)
Hello everybody.

This third version of the string representation proposal incorporates some
of the suggested improvements, but also some other things:

 * I generalized the idea of a char-field to a memory-field.  The idea is,
   that in principle its not only strings and symbols that may share
   memory regions, but one could also think of sharable uniform arrays
   etc.  Even if there is currently little use for it, it's something that
   does not seem too far out of scope.

 * The concept of owned and not-owned memory-fields is still in there.  I
   still think that a mutable but not-owned memory-field is a useful
   concept.  Applications of such a thing are for example memory regions
   on the stack or static memory regions like C arrays.  The possibility
   to capture contiguous memory regions from other languages (like for
   example Fortran) without necessarily having to put them under gc
   control seems also to be a good justification.


String representation proposal 1.3:
===================================

Memory Field Type
=================

A number of container types store their elements in a contiguous memory
region, for example strings, symbols, vectors, uniform vectors etc.  For some
of these types it is desirable to be able to share memory with other languages
(like strings shared between C code and guile, or arrays shared between
Fortran and guile).  For some of these types it is desirable to be able to
share parts of the memory region between different objects (like strings and
substrings, strings and symbols or n-dimensional arrays and n-1-dimensional
arrays).

As a low-level representation of contiguous memory regions that may be shared
among guile objects and between guile and other languages, guile provides a
memory-field type.  Objects of type memory-field are only used to implement
other data types and will never be seen on the scheme level.  Types that use
memory-fields for their low-level representation will be called client-types.
Since a memory-field is an ordinary scheme object, it will automatically be
garbage collected if it is not used any more.

A memory-field has the following attributes:
 * base is a void* pointing to the base address of the memory-field.
 * length is a scm_sizet denoting the size of the memory-field.
 * owner_p is a boolean value that indicates whether the memory-field is
   actually the owner of the memory-region, i. e. whether the memory region
   should be freed if the memory-field is garbage collected.

Possible double cell layout:
   <memory-field type tag/owner_p, length, base, <type-dependent-data>>

The type-dependent-data word allows different uses, depending on the client
types.  If the type-dependent-data word or the contents of the memory region
contain scheme objects, it is the responsibility of the client types to
protect those objects from garbage protection.  For the special case of
shared memory-fields with a copy-on-write policy, there is additional
support provided, as explained in the following section.


Shared copy-on-write memory-fields
----------------------------------

For client types that allow sharing of memory-fields, but demand a
copy-on-write strategy for memory-fields that are actually shared, the
following support is provided:

 * The type-dependent-data holds an integer value that denotes the number of
   client objects that share the memory-field object.  A special case for the
   client count is a negative value, which indicates that the memory region
   associated with the memory-field object is read-only.  For all other cases,
   i. e. for non-negative client counts, the number has to be increased when a
   new client object is created, and decreased if a client object is garbage
   collected.

 * For each set of types that may share the same memory-field objects, a
   mutex has to be provided which must be locked for the modification of
   the memory-fields.  This includes modifications of the number of client
   objects.  For example, the types `string' and `symbol' would use a
   common mutex.

 * Memory-fields that have more than one client may not be modified.
   Instead, a copy of the memory field has to be made to which the
   modififications are then applied.

This, for example, allows for a thread-safe implementation of copy-on-write
strings, substrings and symbols (see below).


String Type
===========

Strings in guile are represented by shared copy-on-write memory-fields.
Thus, a string object in guile is defined by the following attributes:
 * a memory-field object
 * an unsigned integer denoting the offset from the memory-field's base
   address where the characters of the string start
 * an unsigned integer denoting the string's length.

Possible double cell layout:
   <string type tag/length, memory-field, base, <unused>>
Alternatively, if longer strings shall be used:
   <string type tag, length, memory-field, base>

Creating a substring of a string can then be performed by creating a new
string object, which points to the same memory-field object, but has an
adjusted length and offset.  Thus, strings and substrings share the same
layout and are handled transparently for the user.  There are, however,
some problems in the context of implicitly shared substrings:

 * Since a string object may actually be just a substring, it is not
   guaranteed that the last character of a string is a \0 character.  In
   fact, this should be no problem anyway in scheme, as \0 characters may
   be perfectly valid elements of a string.  However, some parts of guile
   and maybe also user code expects strings to be stored in C style.

 * Small substrings might be generated from large, temporary strings and
   that the small substrings are kept in a data structure such that the
   heap will still have to host the large original strings.  A
   single-character substring may thus keep memory-fields of arbitrary
   size from getting collected.

 * A performance problem can occur in a similar constellation:  If a small
   substring is generated from a larger string, and the larger string is
   to be modified afterwards, a copy of the memory region belonging to the
   large string has to be made.  In such cases, it would be better to copy
   the small substring instead.

Whether or not such situations occur is application dependent.  Thus, the
string implementation provides two functions to generate substrings:
 * scm_shared_substring, shared-substring
 * scm_copied_substring, copied-substring
The scheme function `substring' can be bound to either one, which allows
the user to customize the behavior of `substring' for an application or
even parts of an application.  The functions can also be called explicitly
from user code.  Moreover, a compiler can choose between the two variants,
according to which of them seems more appropriate due to static analysis.


Symbol Type
===========

The attributes for symbols are the same as for strings, except that with
every symbol a hash value has to be stored.  On the other hand, symbols
can be expected to be of limited length, which makes the following layout
feasible:

Possible double cell layout:
   <string type tag/length, memory-field, base, hash-value>




Garbage collection issues:
==========================

If there are no further users of a memory-field, the memory-field will
simply be garbage collected.  Depending on whether the memory-field is the
owner of the corresponding memory regeion, this region will be freed or
not.  This distinction allows to create memory-fields from memory regions
that lie on the stack or in static memory (for example global C arrays).
The notion of read-only memory-fields also allows to use memory regions
from read-only memory, for example constant strings in the C code.

Some care has to be taken for the following constellation:  All users of
some memory-field as well as the memory-field itself are to be collected.  
In this special case, the client count of the memory-field should
potentially not be decremented, since the _ordering_ of the sweeps is
unclear:  The memory-field may be swept _before_ some of its
clients.  Thus, before during gc the client count is changed, it has to be
checked whether the memory-field itself is already swept.  [Does this
cause problems with a generational gc, or can a similar mechanism be used
there?]


Performance:
============

With the given proposal, every access to the character field of a string
or a symbol requires an additional memory access due to the added level of
indirection.  Further, the number of scheme objects increases since a lot
of strings then consist of two objects:  the string and its memory-field.
Also, modifying operations on memory-fields require the use of a mutex, as
well as adding or removing clients of a memory-field.

On the positive side, the proposed solution allows a large number of
typical operations to be sped up enormously:  substring, string-copy,
string->symbol, symbol->string etc. should be much faster.  Only little
care about garbage collection issues has to be taken, and due to the
possibility to create memory-fields from constant C strings, some initial
string-copying operations can be avoided.


Best regards
Dirk Herrmann
[Prev in Thread]
Current Thread
[Next in Thread]
Another alternative string representation proposal 1.3, Dirk Herrmann <=
- Re: Another alternative string representation proposal 1.3, Keisuke Nishida, 2000/10/06
  - Re: Another alternative string representation proposal 1.3, Dirk Herrmann, 2000/10/09
Prev by Date: Re: CVS Guile and configure, Makefile.in, and other auxiliary files...
Next by Date: Re: Another alternative string representation proposal 1.3
Previous by thread: Re: CVS Guile and configure, Makefile.in, and other auxiliary files...
Next by thread: Re: Another alternative string representation proposal 1.3
Index(es):
- Date
- Thread