guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Alternative string representation proposal


From: Dirk Herrmann
Subject: Alternative string representation proposal
Date: Fri, 8 Sep 2000 18:29:58 +0200 (MEST)

Mikael Djurfeldt wrote:

> The following proposal has the probllem that it isn't thread safe.
> Can anyone come up with an alternative which is thread-safe?

I have a proposal for a different solution, which should be thread safe.

The underlying assumption is, that strings are typically rarely modified.  
With respect to symbols, this assumption is obviously true.  Further it is
assumed that a string that already got modified is likely to be modified
again.


Proposal:
---------

String Representation Type
--------------------------

The guile implementations of strings, symbols and possibly other data
types that need a field of C characters as its low level representation
use a common type `string-rep'.  Objects of type string-rep are only used
to implement other data types and will never be seen on the scheme level.
String reps may be shared between different scheme objects.  For example,
a symbol SYM and the string returned by (symbol->string SYM) may share the
same string rep, thus avoiding to copy the character field.  Since a
string rep is an ordinary scheme object, it will automatically be
garbage collected if it is not used any more.

A string-rep has the following attributes:
 * chars is a char* pointing to the first character of the character
   field.
 * length is an unsigned int denoting the number of characters in the
   character field.
 * owner_p is a boolean value that indicates whether the string-rep is
   actually the owner of the character field, i. e. whether the character
   field should be freed if the string-rep is garbage collected.
 * mutable_p is a boolean value that determines whether modifications to
   the characters in the character field are allowed.

The following assertions hold:
 * If mutable_p is true, then there is at most one scheme object that
   uses the string-rep.  In other words, a mutable string-rep is never
   shared.
 * If mutable_p is false, then any operation that intends to modify the
   character field has to create a new mutable string-rep object, with its
   own copy of the character field.

A new string-rep can be created as mutable or immutable.  Except it is
known in advance that a modification will be performed, a string-rep will
typically be created as immutable.  A compiler might, for example,
determine that a modification will occur shortly after the creation and
thus create the string-rep as mutable.  (Note:  The mutability of a
string-rep does not have any implications on the mutability of the client 
that uses the string-rep.  The fact that a string-rep is immutable simply 
indicates that a client will have to do a copy on write.)

On the C level, the status of a string-rep may be changed from mutable to
immutable and vice versa.  The thread safety of such an operation is not
guaranteed by the string-rep implementation.  Instead, the code that
performs such a change has to take the necessary measures to guarantee the
thread safety.  An example for the use of a status change is the following
situation:  A function creates a new string, performs some initial
modifications on that string and passes it as a return value.  Here, it
makes sense to create the corresponding string-rep as mutable, and, after
all initial modifications have been performed and before the string is
returned, change the status of the string-rep object to immutable.  Again,
this may also be done by a compiler when compiling scheme code.


String Type
-----------

Strings in guile come in several flavours.  They may be mutable or not,
and they may use a string-rep as a whole or only a part of the underlying
character field.  The data type representation is as follows:

* mutable/immutable strings:  These need to hold only a reference to the
  string-rep.  However, although the length of such a string is the same 
  as the length of the underlying string-rep, it makes sense to also
  duplicate the length parameter from the string-rep, in order so save
  some references and to allow unified handling of all string and symbol
  types.
  Cell layout: <typ/length, string-rep>
* mutable/immutable substrings:  These hold a reference to the string rep,
  a relative position or a pointer into the chacter field, and a length.
  Double cell layout: <typ/length, string-rep, pos-or-ptr <unused>>

The mutability of a string is orthogonal to the mutability of the
underlying string-rep:

* If a string is immutable, this is visible on the scheme level.  I. e. an
  attempt to modify such a string will cause an error to be signalled.  An
  immutable string uses an immutable string-rep as its representation.

  [This does not have to be the case:  If a string is immutable, that
  means that it can not modified from the scheme level.  There may,
  however, be situations where such a string is to be modified from the C
  level, which means that a mutable string-rep would be used.  I can't
  currently think of a situation where this would make sense.  We could
  therefore also just say that an immutable string always uses an
  immutable string-rep.]

* If a string is mutable, it's representation may be a mutable string-rep,
  or an immutable string-rep.  This depends on whether the string has ever
  been modified.  If a mutable string that uses an immutable string-rep is
  modified, a mutable copy of the original string-rep (or rather the
  required parts of it) is created, and the modifications are applied 
  to this string.  This is done transparently for the user of the mutable
  string.  Except for cases where it is known that the string object is
  only accessed from a single thread, changing the string-rep of a string
  object requires the use of a mutex.


Symbol Type
-----------

Since the representation of symbols requires the use of double cells
anyway, a symbol is for simplicity always represented similar to an
immutable substring, with the hash value stored in the fourth cell
element.
Double cell layout:  <typ/length, string-rep, pos-or-ptr, raw-hash>


Thread safety:
--------------

Since no two distinct scheme objects share a mutable string-rep, accesses
to mutable string-reps are thread-safe.  A status change from a mutable
to an immutable string rep can also easily be performed.  The only
threading problem that remains with respect to mutable string reps is the
case of two different threads that acess the same string-rep via the same
scheme object.  This, however, is a user-level problem and its solution
does not lie in the scope of this proposal.

An immutable string rep can be shared by several scheme objects.  This is
thread safe since the immutable string-rep is not modified.  The only
exception is when the status of the string-rep is to be changed from
immutable to mutable.  This, however, is only provided for optimization
purposes and the threading problems that such a change may cause have to
be solved in the corresponding user code.

The only threading issue with strings is when an immutable string-rep of a
mutable string is replaced with a mutable string-rep, in order to allow
for modifications of the string.  This replacement of an immutable with a
mutable string-rep is an operation that potentially has to be guarded by
the use of a mutex.


Garbage collection issues:
--------------------------

If there are no further users of a string-rep, the string-rep will simply
be garbage collected.  Depending on whether the string rep is the owner of
its character field, this field will be freed or not.  This distinction
allows to create string reps from character fields that lie in read-only
memory regions, for example constant strings in the C code.

There is still the problem, that small substrings might be generated from
large, temporary strings and that the small substrings are kept in a data
structure such that the heap will still have to host the large original
strings.  This problem can be addressed by an extension to the above
proposal:

An immutable string-rep gets an additional field `usage'.  Every client of
the string-rep adds on construction its own usage of the string, i. e. the
number of characters that are shared to the `usage' field.  This typically
requires to use a mutex.  Similarly, if a client changes to a different
string-rep or gets garbage collected, the corresponding `usage' fields are
updated.  During garbage collection, it is checked whether the usage of a
string-rep falls below a certain threshold.  In that case, after gc the
corresponding clients of the string-rep can be given their own private
string-rep.  Further alternatives may be to protocol the bounds of the
used parts of a string-rep during gc.  In any case, it is questionable
whether any of this is worth the effort.


Performance:
------------

With the given proposal, every access to the character field of a string
or a symbol requires an additional memory access due to the added level of
indirection.  Further, the number of scheme objects increases since a lot
of strings then consist of two objects:  the string and its string-rep.
Also, an initial modifying operation on an object that uses an immutable
string-rep is time consuming, since a duplicate of the string-rep has to
be created.  All further modifications, however, are quite fast.

On the positive side, the proposed solution allows a large number of
typical operations to be sped up enormously:  substring, string-copy,
string->symbol, symbol->string etc. should be much faster.  No special
care about garbage collection issues have to be taken, and due to the
possibility to create string-reps from constant C strings some initial
string-copying operations can be avoided.  Further, mutecis are only
rarely needed, namely when a string is modified for the first time.
Certain optimizations can be used in the C code to reduce the number of
initial copy-on-write operations when a string is known to be modified,
namely to use a mutable string-rep from the beginning.


Alternative ideas:
------------------

Mutable string-reps are never shared.  Thus, in this case the indirection
string->string-rep->chars could be avoided, by storing the pointer to the
character field directly in the string type.  Then, there would be string
types that point to a string-rep, and when the string is about to modify
something, the string type would change to a string type that directly
points to its private character field.  I don't know whether this is
advantageous, since it complicates all read-operations, since they have to
distinguish between the alternative representations.



Best regards
Dirk Herrmann




reply via email to

[Prev in Thread] Current Thread [Next in Thread]