guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reducing iconv-induced memory usage


From: Mark H Weaver
Subject: Re: Reducing iconv-induced memory usage
Date: Tue, 26 Apr 2011 23:47:41 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)

Hi Ludovic!

address@hidden (Ludovic Courtès) writes:
> So, here’s the patch.
>
> It also makes UTF-8 input ~30% faster according to ports.bm (which
> doesn’t benchmark output):

Thanks for working on this.  I haven't yet had time to fully review this
patch, but here I will document the problems I see so far.

First of all, while looking at this patch, I've discovered another
problem in ports.c: scm_char_ready_p does not consider the possibility
of multibyte characters, and returns #t whenever there is at least one
byte ready.

> -/* Read a codepoint from PORT and return it in *CODEPOINT.  Fill BUF
> -   with the byte representation of the codepoint in PORT's encoding, and
> -   set *LEN to the length in bytes of that representation.  Return 0 on
> -   success and an errno value on error.  */
> +/* Read a UTF-8 sequence from PORT.  On success, return 0 and set
> +   *CODEPOINT to the codepoint that was read, fill BUF with its UTF-8
> +   representation, and set *LEN to the length in bytes.  Return
> +   `EILSEQ' on error.  */
>  static int
> -get_codepoint (SCM port, scm_t_wchar *codepoint,
> -            char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +get_utf8_codepoint (SCM port, scm_t_wchar *codepoint,
> +                 scm_t_uint8 buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +{
> +  int byte;
> +
> +  *len = 0;
> +
> +  byte = scm_get_byte_or_eof (port);
> +  if (byte == EOF)
> +    {
> +      *codepoint = EOF;
> +      return 0;
> +    }
> +
> +  buf[0] = (scm_t_uint8) byte;
> +  *len = 1;
> +
> +  if (buf[0] <= 0x7f)
> +    *codepoint = buf[0];
> +  else if ((buf[0] & 0xe0) == 0xc0)
> +    {
> +      byte = scm_get_byte_or_eof (port);
> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
> +     goto invalid_seq;
> +
> +      buf[1] = (scm_t_uint8) byte;
> +      *len = 2;
> +
> +      *codepoint = ((scm_t_wchar) buf[0] & 0x1f) << 6UL
> +     | (buf[1] & 0x3f);
> +    }

The code here would be sufficient for UTF-8 that is known valid, but
when reading from a port we must check for ill-formed UTF-8.

Unicode requires that we reject as ill-formed any UTF-8 byte sequence in
non-shortest form.  For example, we must reject the byte sequence
0xC1 0x80 which a permissive reader would read as 0x40, since obviously
that code point can be encoded as a single byte in UTF-8.

We must also reject any UTF-8 byte sequence that corresponds to a
surrogate code point (U+D800..U+DFFF), or to a code point greater than
U+10FFFF.

Table 3.7 of the Unicode 6.0.0 standard, reproduced below, concisely
shows all well-formed UTF-8 byte sequences.  The asterisks highlight
continuation bytes that are constrained to a smaller range than the
usual 80..BF.

   code points       byte[0]  byte[1]   byte[2]  byte[3]
---------------------------------------------------------
U+000000..U+00007F | 00..7F |         |        |        |
U+000080..U+0007FF | C2..DF | 80..BF  |        |        |
U+000800..U+000FFF |   E0   | A0..BF* | 80..BF |        |
U+001000..U+00CFFF | E1..EC | 80..BF  | 80..BF |        |
U+00D000..U+00D7FF |   ED   | 80..9F* | 80..BF |        |
U+00E000..U+00FFFF | EE..EF | 80..BF  | 80..BF |        |
U+010000..U+03FFFF |   F0   | 90..BF* | 80..BF | 80..BF |
U+040000..U+0FFFFF | F1..F3 | 80..BF  | 80..BF | 80..BF |
U+100000..U+10FFFF |   F4   | 80..8F* | 80..BF | 80..BF |
---------------------------------------------------------

So, for the code above corresponding to 2-byte sequences, it would
suffice to verify that buf[0] >= 0xC2.  The 3- and 4-byte cases are
somewhat more constrained.

> +  else if ((buf[0] & 0xf0) == 0xe0)
> +    {
> +      byte = scm_get_byte_or_eof (port);
> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
> +     goto invalid_seq;
> +
> +      buf[1] = (scm_t_uint8) byte;
> +      *len = 2;
> +
> +      byte = scm_get_byte_or_eof (port);
> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
> +     goto invalid_seq;
> +
> +      buf[2] = (scm_t_uint8) byte;
> +      *len = 3;
> +
> +      *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL
> +     | ((scm_t_wchar) buf[1] & 0x3f) << 6UL
> +     | (buf[2] & 0x3f);
> +    }
> +  else
> +    {

That ^^^ should not simply be an "else".  It must check that the first
byte is valid.

> +      byte = scm_get_byte_or_eof (port);
> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
> +     goto invalid_seq;
> +
> +      buf[1] = (scm_t_uint8) byte;
> +      *len = 2;
> +
> +      byte = scm_get_byte_or_eof (port);
> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
> +     goto invalid_seq;
> +
> +      buf[2] = (scm_t_uint8) byte;
> +      *len = 3;
> +
> +      byte = scm_get_byte_or_eof (port);
> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
> +     goto invalid_seq;
> +
> +      buf[3] = (scm_t_uint8) byte;
> +      *len = 4;
> +
> +      *codepoint = ((scm_t_wchar) buf[0] & 0x07) << 18UL
> +     | ((scm_t_wchar) buf[1] & 0x3f) << 12UL
> +     | ((scm_t_wchar) buf[2] & 0x3f) << 6UL
> +     | (buf[3] & 0x3f);
> +    }
> +
> +  return 0;
> +
> + invalid_seq:
> +  /* Return the faulty byte.  */
> +  scm_unget_byte (byte, port);

This ungets only the last byte, but there may be up to 4 bytes to unget.

> +
> +  return EILSEQ;
> +}
> +
> +/* Likewise, read a byte sequence from PORT, passing it through its
> +   input conversion descriptor.  */
> +static int
> +get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
> +                  char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
>  {
> +  scm_t_port *pt;
>    int err, byte_read;
>    size_t bytes_consumed, output_size;
>    char *output;
>    scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
> -  scm_t_port *pt = SCM_PTAB_ENTRY (port);
>  
> -  if (SCM_UNLIKELY (pt->input_cd == (iconv_t) -1))
> -    /* Initialize the conversion descriptors.  */
> -    scm_i_set_port_encoding_x (port, pt->encoding);
> +  pt = SCM_PTAB_ENTRY (port);
>  
>    for (output_size = 0, output = (char *) utf8_buf,
>        bytes_consumed = 0, err = 0;
> @@ -1174,10 +1265,44 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
>       output_size = sizeof (utf8_buf) - output_left;
>      }
>  
> -  if (SCM_UNLIKELY (err != 0))
> +
> +  if (SCM_LIKELY (err == 0))
> +    {
> +      /* Convert the UTF8_BUF sequence to a Unicode code point.  */
> +      *codepoint = utf8_to_codepoint (utf8_buf, output_size);
> +      *len = bytes_consumed;
> +    }
> +
> +  return err;
> +}
> +
> +/* Read a codepoint from PORT and return it in *CODEPOINT.  Fill BUF
> +   with the byte representation of the codepoint in PORT's encoding, and
> +   set *LEN to the length in bytes of that representation.  Return 0 on
> +   success and an errno value on error.  */
> +static int
> +get_codepoint (SCM port, scm_t_wchar *codepoint,
> +            char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +{
> +  int err;
> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +
> +  if (pt->input_cd == (iconv_t) -1)
> +    /* Initialize the conversion descriptors, if needed.  */
> +    scm_i_set_port_encoding_x (port, pt->encoding);
> +
> +  if (pt->input_cd == (iconv_t) -1)
> +    err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len);
> +  else
> +    err = get_iconv_codepoint (port, codepoint, buf, len);

>From the code above, it appears that for UTF-8 ports,
scm_i_set_port_encoding_x will necessarily be called once per character
read.  This seems rather inefficient.  Also, if we wish to support
Latin-1 without iconv as well, the simple method above will not work.

I would recommend adding an enum field to the port which for now only
has two encoding schemes: ICONV or UTF8.  Later, we could add LATIN1 and
maybe ASCII as well.  Given that this check must be done once per
character, it seems better to do a switch on an enum than to strcmp with
pt->encoding (as is done in scm_i_set_port_encoding_x).

Thanks again for working on this :)

     Mark



reply via email to

[Prev in Thread] Current Thread [Next in Thread]