Re: utf8 and emacs text/string multibyte representation

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf8 and emacs text/string multibyte representation

From:	Eli Zaretskii
Subject:	Re: utf8 and emacs text/string multibyte representation
Date:	Thu, 30 Oct 2014 18:06:41 +0200

> From: Camm Maguire <address@hidden>
> Cc: address@hidden,  address@hidden
> Date: Thu, 30 Oct 2014 10:13:20 -0400
> 
> >> Does every string access in emacs proceed through the utf8 decoder?
> >
> > If you need to look at the character, yes.  E.g., if you need some
> > property of the character, you need to index the appropriate table by
> > that character's codepoint.  But in most operations that is not
> > needed.  You just need to recognize several specific characters, like
> > the null character, the slash, etc., most of which are ASCII.
> >
> 
> Do you allocate a fresh boxed character on each aref, or output an
> integer referring to a fixed ~2^22 sized table?

I'm not sure what you mean by a "boxed character".  A character in
Emacs is just an int.

> Do you maintain such a table in core?

We have a lot of tables indexed by characters.  Their implementation
is memory efficient: it can store identical values for a range of
characters, and also store the default value with minimal overhead.

> >> > We indeed maintain a cache for byte-to-character and character-to-byte
> >> > conversions.
> >> 
> >> How big is this cache?
> >
> > Its size is dynamic, and depends on how frequently the conversion is
> > needed in places that are far away.  The cache stores byte-to-char
> > correspondence in places that are far away, and Emacs uses binary
> > search in between them.
> >
> 
> How far is 'far away'?

The current heuristic value is 5000 characters.

> If you had this to do all over again, would you still opt for the
> multibyte? 

Yes, I think so.  I know nobody ever suggested to switch.

> While you have buffers to consider too, which probably relate to
> strings, it seems to me that the dominant costs are always memory
> allocation/gc related, making the memory footprint important but not at
> the expense of allocating characters, and that the most frequent
> operations are removals/pattern substitutions, which can proceed
> bytewise with the same gc overhead.

We don't allocate characters, they are just integers.

As for strings, Emacs allocates small strings specially, to minimize
overhead.  And of course, there's GC that takes care of freeing
memory.

> GCL also supports regular expressions -- how is this modified for utf-8?

We use GNU regexp, slightly modified for Emacs.  I suggest to take a
look at the source.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Referring to revisions in the git future., (continued)

Prev by Date: Re: "enum class" supports for cc-mode
Next by Date: Re: Referring to revisions in the git future.
Previous by thread: Re: utf8 and emacs text/string multibyte representation
Next by thread: Re: utf8 and emacs text/string multibyte representation
Index(es):
- Date
- Thread