guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Another alternative string representation proposal


From: Dirk Herrmann
Subject: Re: Another alternative string representation proposal
Date: Sat, 23 Sep 2000 06:39:35 +0200 (MEST)

On 22 Sep 2000, Keisuke Nishida wrote:

> Dirk Herrmann <address@hidden> writes:
> 
> > A char-field has the following attributes:
> >  * chars is a char* pointing to the first character of the character
> >    field.
> >  * length is an unsigned int denoting the number of characters in the
> >    character field.
> >  * owner_p is a boolean value that indicates whether the char-field is
> >    actually the owner of the character field, i. e. whether the character
> >    field should be freed if the char-field is garbage collected.
> 
> Is owner_p really necessary?  If a char-field is read-only, it is not
> the owner of the character sequence.  If it is mutable or immutable,
> we can force the char-field to own the sequence.

Yes, I agree that this is simpler.  There are only few situations where
anything else would make sense, but these are probably not worth the
effort.  (Think of, for example, a fixed size C array on the stack that
is used to initialize a string:  This would be writable, but not
owned.  The benefit would be that an initial string-copying could be
avoided.  In C loops, where some temporary scheme strings are required,
for example, because some function expects a scheme string as a parameter,
this could make sense, but, again, its not sure whether it is worth the
effort.)

> >  * mutability is a value from '(mutable, immutable, read-only) that
> >    determines whether modifications to the characters in the character
> >    field are allowed.
> >  * users is an unsigned integer value that denotes the number of scheme
> >    objects that share the char-field.  This attribute is only used if
> >    mutability is 'immutable.
> 
> Why do you need the mutability field when you have the users count?
> It seems to me that the following definition is sufficient:
> 
>   users = -1   read-only
>   users = 0    no reference (can be GCed)

users = 0 does not really imply that the char-field can be GCed:  It only
states that there is currently no scheme object using this
char-field.  There may, however, be references on the stack.  But,
otherwise you are right, this would be a valid encoding of the attributes
if read-only and owned get merged.

>   users = 1    mutable
>   users > 1    immutable

This encoding is fine, if there is either just one mutex for all
modifications to all char-fields, or if there is one mutex for each
char-field, independent of whether mutable or not.  For the mixed
situation that I described, there are (slight) disadvantages due to the
fact that a char-field with exactly one user is always mutable:  For the
first sharing you would need to lock two instead of just one mutex.  Well,
that may be negligible :-)

The single mutex solution, although serializing all modifications to all
char-fields, is probably most preferrable.  Otherwise, it is easy to
create deadlock situations.  Well, you can avoid these by defining an
order in which mutexes have to be locked, but it is still a source for
errors.

> >  * If mutability is 'immutable, and the number of users is larger than 1,
> >    then any operation that intends to modify the character field has to
> >    create a new mutable char-field object with its own copy of the
> >    character field.  [...]
> 
> If the immutable char-field is very large, copying it costs much more
> than copying its substring, while modifying the larger char-field more
> likely happens.  Probably creating a small substring from a large string
> should be treated specially...

Yes, but that is a decision which has to be taken by the 'substring'
function.  It would also be possible to add a function to the API that
explicitly states to make a 'fresh' substring.  A 'trick solution is, to
make a substring and immediately afterwards 'write' to the substring
without modifying it:
  (let* ((sub (substring s from to)))
    (string-set! sub 0 (string-ref sub 0)))
But, this is a hack.  However, I doubt that there is a good heuristic for
when a substring should really be copied, and when not.  This is
application dependen, because in application where the 'worst-case'
scenario that you have sketched does not occur, it would be a pity to
waste the performance potential by copying the substrings.

All in all, I think your proposed simplifications make sense.  

There is even another simplification that I think would be helpful:  We
don't distinguish between substrings and strings, but store all strings as
char-field + offset + length, i. e. in a double cell.  This simplifies the
code and should not have a big negative performance impact, maybe even on
the contrary:  For every access to the chars, the offset would have to be
considered, but not having to distinguish between two string types does
eliminate some branches in the code.

And, another thing:  The concept of read-only _strings_ (not
char-fields) can probably also be dropped.  I realized that guile's
current support for read-only strings is just because of the lack of a
proper copy-on-write string implementation, and regarding recent changes
to the CVS, support for read-only strings will therefore be dropped.

Many thanks for reading the proposal, and your very good suggestions for
improvement.

Best regards
Dirk



reply via email to

[Prev in Thread] Current Thread [Next in Thread]