guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Another alternative string representation proposal


From: Dirk Herrmann
Subject: Another alternative string representation proposal
Date: Wed, 20 Sep 2000 19:17:21 +0200 (MEST)

Mikael Djurfeldt wrote:

> The following proposal has the probllem that it isn't thread safe.
> Can anyone come up with an alternative which is thread-safe?

I have a modified version of my proposal, which should also be thread safe.

The differences to the previous proposal are:
* I renamed string-rep to char-field (seems more general).
* I added a guarded `users' counter for the char-field, which allows to
  determine the number of scheme objects that share a char-field.  [Note:
  This requires to take some care during garbage collection.]

The previous proposal made initial string modifications expensive due to
the need for an initial copying, independent of whether a string was
actually shared or not.  Also, once a string was modified, further sharing
of such a string became impossible and instead copying became necessary.

In contrast, this proposal avoids these kinds of copying, but instead
requires to use a mutex for every string modification.  Obtaining a mutex
can be expected (hopedfully) to be much cheaper than copying a string, but
in cases where code modifies the same string very often the accumulated
mutex overhead might outweigh the cost of the copying.


String representation proposal 1.2:
-----------------------------------

Character Field Type
--------------------------

The guile implementations of strings, symbols and possibly other data
types that need a field of C characters as its low level representation
use a common type `char-field'.  Objects of type char-field are only used
to implement other data types and will never be seen on the scheme level.
Character fields may be shared between different scheme objects.  For
example, a symbol SYM and the string returned by (symbol->string SYM) may
share the same char-field, thus avoiding to allocate a new memory region
and to copy the contents of the original character field there.  Since a
char-field is an ordinary scheme object, it will automatically be garbage
collected if it is not used any more.

A char-field has the following attributes:
 * chars is a char* pointing to the first character of the character
   field.
 * length is an unsigned int denoting the number of characters in the
   character field.
 * owner_p is a boolean value that indicates whether the char-field is
   actually the owner of the character field, i. e. whether the character
   field should be freed if the char-field is garbage collected.
 * mutability is a value from '(mutable, immutable, read-only) that
   determines whether modifications to the characters in the character
   field are allowed.
 * users is an unsigned integer value that denotes the number of scheme
   objects that share the char-field.  This attribute is only used if
   mutability is 'immutable.
 * mutex is a mutex object that has to be locked for any write access to
   the char-field, and when changing the mutability from `mutable' to
   `immutable'.  This attribute is only used if mutability is 'mutable.
   [Alternatively, one common mutex could be used, but this would mean
   to serialize string modifications across all threads.]

Possible double cell layout:
   <typ/length, chars, flags = (owner_p << 2) + mutability, users|mutex>
A good encoding for mutability seems to be:
   mutable   = #b01,
   immutable = #b10,
   read-only = #b00,
=> char-field-mutable = flags & 1, mutability-changeable = flags & 3 

The following assertions hold:

 * If mutability is 'mutable, then there is at most one scheme object
   that uses the char-field.  In other words, a mutable char-field is
   never shared, the number of users is either 0 or 1, and thus does not
   have to be stored.  It is legal to change the status of a char-field
   from mutable to immutable, but this requires to obtain the mutex.  The
   mutex has also to be obtained for any changes to the char-field.
   Thus, the mutex guarantees that the status of a char-field does not
   change from mutable to immutable during a modification.  After a
   char-field has been changed from mutable to immutable, the mutex is
   not needed any longer.  Instead, the `users' attribute has to be
   initialized appropriately.

 * If mutability is 'immutable, and the number of users is larger than 1,
   then any operation that intends to modify the character field has to
   create a new mutable char-field object with its own copy of the
   character field.  If the number of users is 0 or 1, it is legal to
   change the status of the char-field from immutable to mutable, but
   this change has to be guarded by a mutex scm_mutex_char_field_users.
   After a char-field has been changed from immutable to mutable, the
   users attribute is not needed any more, but instead a mutex has to be
   created for the mutable char-field.  Further, if mutability is
   'immutable, then any scheme object, that wants to share the char-field
   has to increment the number of users, which is guarded by the mutex
   scm_mutex_char_field_users.  Analogously, if a scheme object does not
   any longer share the char-field, the number of users has to be
   decremented.

 * If mutability is 'read-only, then any operation that intends to modify
   the character field has to create a new mutable char-field object,
   with its own copy of the character field.  It is not allowed to change
   the mutability from 'read-only to anything else.  Read-only should be
   used to indicate that the character field lies in a memory region that
   can not be modified, like a ROM area.

A new char-field can be created as 'mutable, 'immutable, or 'read-only.
If the char-field lies in a memory region that can not be modified, then
the char-field has to be created as 'read-only.  Otherwise, except it is
known in advance that no modification will be performed, a char-field
will typically be created as mutable.  When used for a symbol, for
example, an initial creation as immutable may make sense.  (Note: Whether
a char-field is 'mutable, 'immutable or 'read-only does not have any
implications on the mutability of the client that uses the char-field.
The fact that a char-field is 'immutable or 'read-only simply indicates
that a client will have to do a copy on write.)

Changing the status from 'mutable to 'immutable may make sense in a
response to a (for example) call to string->symbol, which otherwise would
require to copy the string.  Changing the status from 'immutable to
'mutable makes sense if a modification is to be performed on a string
that once was shared but is not shared any more.  Again, this allows to
avoid copying the string.


String Type
-----------

Strings in guile come in several flavours.  They may be mutable or not,
and they may use a char-field as a whole or only a part of the underlying
character field.  The data type representation is as follows:

* mutable/immutable strings:  These need to hold only a reference to the
  char-field.  However, although the length of such a string is the same 
  as the length of the underlying char-field, it makes sense to also
  duplicate the length parameter from the char-field, in order to save
  some references and to allow unified handling of all string and symbol
  types.
  Cell layout: <typ/length, char-field>
* mutable/immutable substrings:  These hold a reference to the char-field,
  a relative position or a pointer into the chacter field, and a length.
  Double cell layout: <typ/length, char-field, pos-or-ptr, <unused>>

The mutability of a string is orthogonal to the mutability of the
underlying char-field:

* If a string is immutable, this is visible on the scheme level.  I. e. an
  attempt to modify such a string will cause an error to be signalled.  An
  immutable string uses an immutable or read-only char-field as its
  representation.

  [This does not have to be the case:  If a string is immutable, that
  means that it can not modified from the scheme level.  There may,
  however, be situations where such a string is to be modified from the C
  level, which means that a mutable char-field would be used.  I can't
  currently think of a situation where this would make sense.  We could
  therefore also just say that an immutable string always uses an
  immutable or read-only char-field.]

* If a string is mutable, it's representation may be any kind of
  char-field.  If a mutable string is represented by an immutable or
  read-only char-field and has to be copied to be modified, this is done
  transparently for the user of the mutable string.  Note: Except for
  cases where it is known that the string object is only accessed from a
  single thread, changing the char-field of a string object requires the
  use of the char-field's mutex.


Symbol Type
-----------

Since the representation of symbols requires the use of double cells
anyway, a symbol is for simplicity always represented similar to an
immutable substring, with the hash value stored in the fourth cell
element.
Double cell layout:  <typ/length, char-field, pos-or-ptr, raw-hash>


Thread safety:
--------------

Since no two distinct scheme objects share a mutable char-field, accesses
to mutable char-fields are thread-safe.  A status change from a mutable
to an immutable char-field can also easily be performed.  The only
threading problem that remains with respect to mutable char-fields is the
case of two different threads that access the same char-field via the same
scheme object.  This, however, is a user-level problem and its solution
does not lie in the scope of this proposal.

An immutable char-field can be shared by several scheme objects.  This is
thread safe since the immutable char-field is not modified.  The only
exception is when the status of the char-field is to be changed from
immutable to mutable.  This, however, can only happen while the immutable
char-field is not shared.

The only threading issue with strings is when an immutable char-field of a
mutable string is replaced with a mutable char-field, in order to allow
for modifications of the string.  This replacement of an immutable with a
mutable char-field is an operation that potentially has to be guarded by
the use of a mutex.


Garbage collection issues:
--------------------------

If there are no further users of a char-field, the char-field will simply
be garbage collected.  Depending on whether the char-field is the owner of
its character field, this field will be freed or not.  This distinction
allows to create char-fields from character fields that lie on the stack
or in read-only memory regions, for example constant strings in the C
code.

Some care has to be taken for the following constellation:  All users of
some char-field as well as the char-field itself are to be collected.  In
this special case, the users count of the char-field should not be
decremented, since the _ordering_ of the sweeps is unclear:  The
char-field may be swept _before_ some of its users.  To avoid changing
the contents of an already swept cell, the users attribute of a
char-field should only be decremented if the mark-bit of the char-field
is set during sweep, i. e. when the char-field is known to survive the
collection.  [Does this cause problems with a generational gc, or can a
similar mechanism be used there?]

There is still the problem, that small substrings might be generated from
large, temporary strings and that the small substrings are kept in a data
structure such that the heap will still have to host the large original
strings.  This problem can only be addressed by extending the proposal by
some means to protocol the used regions of large strings.  Since there is
no list of users of a char-field, this would have to be done during
garbage collection.

Performance:
------------

With the given proposal, every access to the character field of a string
or a symbol requires an additional memory access due to the added level of
indirection.  Further, the number of scheme objects increases since a lot
of strings then consist of two objects:  the string and its char-field.
Also, modifying operations on char-fields require the use of a mutex, as
well as every sharing of a string or changing the mutability.

On the positive side, the proposed solution allows a large number of
typical operations to be sped up enormously:  substring, string-copy,
string->symbol, symbol->string etc. should be much faster.  Only little
care about garbage collection issues has to be taken, and due to the
possibility to create char-fields from constant C strings, some initial
string-copying operations can be avoided.


Best regards
Dirk Herrmann




reply via email to

[Prev in Thread] Current Thread [Next in Thread]