Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

guile-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

From:	Mark H Weaver
Subject:	Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
Date:	Wed, 03 Apr 2013 15:28:27 -0400
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)

Hi Ludovic,

Thanks for the quick review!  An improved patch is attached below.

address@hidden (Ludovic Courtès) writes:
> Woow, well thought out.  The semantics seem good.  (It’s interesting to
> see how BOMs complicate things, but that’s life, I guess.)
>
> The patch looks good to me.  The test suite is nice.  It doesn’t seem to
> cover all the corner cases listed above, but that can be added later on
> perhaps?

Yes, the tests are still a work-in-progess, but I've added quite a few
more since you last looked.

> Perhaps the text above could be added to the manual,

In the attached patch, I've added a new node to the "Input and Output"
section.

> Mark H Weaver <address@hidden> skribis:
>> diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
>> index 73a788f..cd1746b 100644
>> --- a/libguile/ports-internal.h
>> +++ b/libguile/ports-internal.h
>> @@ -48,14 +48,19 @@ struct scm_port_internal
>>  {
>>    scm_t_port_encoding_mode encoding_mode;
>>    scm_t_iconv_descriptors *iconv_descriptors;
>> +  int at_stream_start_for_bom_read;
>> +  int at_stream_start_for_bom_write;
>
> Add “:1”?

Good idea.

[...more good suggestions that I've incorporated in the new patch...]

>> +{
>> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
>> +  int result;
>> +  int i = 0;
>> +
>> +  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
>> +    {
>> +      pt->read_pos++;
>> +      i++;
>> +    }
>> +
>> +  result = (i == len);
>> +
>> +  while (i > 0)
>> +    scm_unget_byte (bytes[--i], port);
>> +
>> +  return result;
>> +}
>
> Should it be scm_get_byte_or_eof given that scm_unget_byte is used later?

Yes.  Bytes are only consumed if are equal to bytes[i], so an EOF will
never be consumed or passed to scm_unget_byte.

> What if pt->read_buf_size == 1?  What if there’s data in saved_read_buf?

All of those details are handled by 'scm_peek_byte_or_eof', which is
guaranteed to leave 'pt->read_pos' pointing at the byte that's returned
(if not EOF).  Therefore, it's always safe to increment that pointer by
one (but no more than one) after calling 'scm_peek_byte_or_eof' if it
returned non-EOF.

Look at the code for 'scm_peek_byte_or_eof' and this will be clear.
Also note that you did the same thing in 'scm_utf8_codepoint' :)

[...more good suggestions, incorporated...]

>> +      /* If the specified encoding is UTF-16 or UTF-32, then make
>> +         that more precise by deciding what endianness to use.  */
>> +      if (strcmp (pt->encoding, "UTF-16") == 0)
>> +        precise_encoding = decide_utf16_encoding (port, mode);
>> +      else if (strcmp (pt->encoding, "UTF-32") == 0)
>> +        precise_encoding = decide_utf32_encoding (port, mode);
>
> Shouldn’t it be strcasecmp?  (Actually there are other uses of strcmp
> already, but I think it’s a mistake.)

Ouch, good catch!  Indeed, we already had some bugs because of this.  I
pushed a fix for the existing bugs to stable-2.0, and updated this patch
accordingly.

Here's the new patch.  Any more suggestions?

    Thanks!
      Mark

>From c0d7228824dcaf7edcbc2de2cdef5c091ef2fc2f Mon Sep 17 00:00:00 2001
From: Mark H Weaver <address@hidden>
Date: Wed, 3 Apr 2013 04:22:04 -0400
Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs).

* libguile/ports-internal.h (struct scm_port_internal): Add new members
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (SCM_UNICODE_BOM): New macro.
  (scm_i_port_iconv_descriptors): Add 'mode' parameter to prototype.

* libguile/ports.c (scm_new_port_table_entry): Initialize
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (get_iconv_codepoint): Pass new 'mode' parameter to
  'scm_i_port_iconv_descriptors'.
  (get_codepoint): After reading a codepoint at stream start, record
  that we're no longer at stream start, and consume a BOM where
  appropriate.
  (scm_seek): Set the stream start flags according to the new position.
  (looking_at_bytes): New static function.
  (scm_utf8_bom, scm_utf16be_bom, scm_utf16le_bom, scm_utf32be_bom,
  scm_utf32le_bom): New static const arrays.
  (decide_utf16_encoding, decide_utf32_encoding): New static functions.
  (scm_i_port_iconv_descriptors): Add new 'mode' parameter.  If the
  specified encoding is UTF-16 or UTF-32, make that precise by deciding
  what endianness to use, and construct iconv descriptors based on the
  precise encoding.
  (scm_i_set_port_encoding_x): Record that we are now at stream start.
  Do not open the new iconv descriptors immediately; let them be
  initialized lazily.

* libguile/print.c (display_string_using_iconv): Record that we're no
  longer at stream start.  Write a BOM if appropriate.

* doc/ref/api-io.texi (BOM Handling): New node.

* test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"):
  Adapt test to cope with the fact that 'set-port-encoding!' does not
  immediately open the iconv descriptors.
  (bv-read-test): New procedure.
  ("unicode byte-order marks (BOMs)"): New test prefix.
---
 doc/ref/api-io.texi         |   67 ++++++++++
 libguile/ports-internal.h   |    7 +-
 libguile/ports.c            |  136 +++++++++++++++++---
 libguile/print.c            |   18 ++-
 test-suite/tests/ports.test |  291 ++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 500 insertions(+), 19 deletions(-)

diff --git a/doc/ref/api-io.texi b/doc/ref/api-io.texi
index 8c974be..b5c78d0 100644
--- a/doc/ref/api-io.texi
+++ b/doc/ref/api-io.texi
@@ -19,6 +19,7 @@
 * Port Types::                  Types of port and how to make them.
 * R6RS I/O Ports::              The R6RS port API.
 * I/O Extensions::              Using and extending ports in C.
+* BOM Handling::                Handling of Unicode byte order marks.
 @end menu
 
 
@@ -2373,6 +2374,72 @@ Set using
 
 @end table
 
address@hidden BOM Handling
address@hidden Handling of Unicode byte order marks.
address@hidden BOM
address@hidden byte order mark
+
+This section documents the finer points of Guile's handling of Unicode
+byte order marks (BOMs).  A byte order mark (U+FEFF) is typically found
+at the start of a UTF-16 or UTF-32 stream, so that the reader can
+reliably determine the byte order.  Occasionally, a BOM is found at the
+start of a UTF-8 stream, but this is much less common and not generally
+recommended.
+
+Guile attempts to handle BOMs automatically, and in accordance with the
+recommendations of the Unicode Standard, when the port encoding is set
+to @code{UTF-8}, @code{UTF-16}, or @code{UTF-32}.  In brief, Guile
+automatically writes a BOM at the start of a UTF-16 and UTF-32 stream,
+and automatically consumes one from the start of a UTF-8, UTF-16, or
+UTF-32 stream.
+
+As specified in the Unicode Standard, BOMs are only handled specially at
+the start of a stream, and only if the port encoding is set to
address@hidden or @code{UTF-32}.  If the port encoding is set to
address@hidden, @code{UTF-16LE}, @code{UTF-16BE}, or @code{UTF-16LE},
+then BOMs are @emph{not} handled specially, and none of the special
+handling described in this section applies.
+
address@hidden @bullet
address@hidden
+Guile looks for a BOM only if the user performs a textual read before
+writing or seeking.  If the user writes to a port before reading,
+big-endian is assumed.
+
address@hidden
+If @code{set-port-encoding!} is called in the middle of a stream, Guile
+treats this as a new logical ``start of stream'' for purposes of BOM
+handling.  This is intended to multiple logical text streams embedded
+within a larger binary stream.
+
address@hidden
+Binary I/O operations are not guaranteed to update Guile's notion of
+whether the port is at the ``start of the stream'', nor are they
+guaranteed to produce or consume BOMs.  More generally, the handling of
+BOMs is unspecified if binary I/O is performed before textual I/O.
+
address@hidden
+For ports that support seeking (e.g. normal files), the input and output
+streams are considered linked: if the user reads first, then a BOM will
+be consumed (if appropriate), but later writes will @emph{not} produce a
+BOM.  Similarly, if the user writes first, then later reads will
address@hidden consume a BOM.
+
address@hidden
+For ports that do not support seeking (e.g. pipes, sockets, and
+terminals), the input and output streams are considered
address@hidden for purposes of BOM handling: the first read will
+consume a BOM (if appropriate), @emph{and} the first write will produce
+a BOM (if appropriate).
+
address@hidden
+Seeks to the beginning of the file set the ``start of stream'' flags.
+Seeks anywhere else clear the ``start of stream'' flags.  Note that
+seeking before reading or writing, when the encoding is set to
address@hidden or @code{UTF-32}, will generally cause big-endian to be
+used, unless your first read is at a BOM.
address@hidden itemize
+
 @c Local Variables:
 @c TeX-master: "guile.texi"
 @c End:
diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
index 73a788f..70e8c45 100644
--- a/libguile/ports-internal.h
+++ b/libguile/ports-internal.h
@@ -46,6 +46,8 @@ typedef struct scm_iconv_descriptors scm_t_iconv_descriptors;
 
 struct scm_port_internal
 {
+  unsigned at_stream_start_for_bom_read  : 1;
+  unsigned at_stream_start_for_bom_write : 1;
   scm_t_port_encoding_mode encoding_mode;
   scm_t_iconv_descriptors *iconv_descriptors;
   SCM alist;
@@ -53,9 +55,12 @@ struct scm_port_internal
 
 typedef struct scm_port_internal scm_t_port_internal;
 
+#define SCM_UNICODE_BOM  0xFEFFUL  /* Unicode byte-order mark */
+
 #define SCM_PORT_GET_INTERNAL(x)                                \
   ((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd))
 
-SCM_INTERNAL scm_t_iconv_descriptors *scm_i_port_iconv_descriptors (SCM port);
+SCM_INTERNAL scm_t_iconv_descriptors *
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode);
 
 #endif
diff --git a/libguile/ports.c b/libguile/ports.c
index 61f8006..a19e869 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -639,6 +639,9 @@ scm_new_port_table_entry (scm_t_bits tag)
     pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
   pti->iconv_descriptors = NULL;
 
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   /* XXX These fields are not what they seem.  They have been
      repurposed, but cannot safely be renamed in 2.0 without breaking
      ABI compatibility.  This will be cleaned up in 2.2.  */
@@ -1306,10 +1309,12 @@ static int
 get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
                     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
 {
-  scm_t_iconv_descriptors *id = scm_i_port_iconv_descriptors (port);
+  scm_t_iconv_descriptors *id;
   scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
   size_t input_size = 0;
 
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_READ);
+
   for (;;)
     {
       int byte_read;
@@ -1393,7 +1398,24 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
     err = get_iconv_codepoint (port, codepoint, buf, len);
 
   if (SCM_LIKELY (err == 0))
-    update_port_lf (*codepoint, port);
+    {
+      if (SCM_UNLIKELY (pti->at_stream_start_for_bom_read))
+        {
+          /* Record that we're no longer at stream start. */
+          pti->at_stream_start_for_bom_read = 0;
+          if (pt->rw_random)
+            pti->at_stream_start_for_bom_write = 0;
+
+          /* If we just read a BOM in an encoding that recognizes them,
+             then silently consume it and read another code point. */
+          if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
+                            && (strcasecmp (pt->encoding, "UTF-8") == 0
+                                || strcasecmp (pt->encoding, "UTF-16") == 0
+                                || strcasecmp (pt->encoding, "UTF-32") == 0)))
+            return get_codepoint (port, codepoint, buf, len);
+        }
+      update_port_lf (*codepoint, port);
+    }
   else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
     {
       *codepoint = '?';
@@ -2006,6 +2028,7 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
 
   if (SCM_OPPORTP (fd_port))
     {
+      scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (fd_port);
       scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port);
       off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset);
       off_t_or_off64_t rv;
@@ -2015,6 +2038,11 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
                         scm_cons (fd_port, SCM_EOL));
       else
        rv = ptob->seek (fd_port, off, how);
+
+      /* Set stream-start flags according to new position. */
+      pti->at_stream_start_for_bom_read  = (rv == 0);
+      pti->at_stream_start_for_bom_write = (rv == 0);
+
       return scm_from_off_t_or_off64_t (rv);
     }
   else /* file descriptor?.  */
@@ -2265,6 +2293,68 @@ scm_i_default_port_encoding (void)
     }
 }
 
+/* If the next LEN bytes from PORT are equal to those in BYTES, then
+   return 1, else return 0.  Leave the port position unchanged.  */
+static int
+looking_at_bytes (SCM port, const unsigned char *bytes, int len)
+{
+  scm_t_port *pt = SCM_PTAB_ENTRY (port);
+  int result;
+  int i = 0;
+
+  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
+    {
+      pt->read_pos++;
+      i++;
+    }
+
+  result = (i == len);
+
+  while (i > 0)
+    scm_unget_byte (bytes[--i], port);
+
+  return result;
+}
+
+static const unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
+static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
+static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
+static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
+static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
+
+/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
+   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
+   and specifies which operation is about to be done.  The MODE
+   determines how we will decide the endianness.  We deliberately avoid
+   reading from the port unless the user is about to do so.  If the user
+   is about to read, then we look for a BOM, and if present, we use it
+   to determine the endianness.  Otherwise we choose big-endian, as
+   recommended by the Unicode Consortium.  */
+static const char *
+decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
+      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
+    return "UTF-16LE";
+  else
+    return "UTF-16BE";
+}
+
+/* Decide what endianness to use for a UTF-32 port.  Return "UTF-32BE"
+   or "UTF-32LE".  See the comment above 'decide_utf16_encoding' for
+   details.  */
+static const char *
+decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
+      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
+    return "UTF-32LE";
+  else
+    return "UTF-32BE";
+}
+
 static void
 finalize_iconv_descriptors (void *ptr, void *data)
 {
@@ -2341,23 +2431,36 @@ close_iconv_descriptors (scm_t_iconv_descriptors *id)
   id->output_cd = (void *) -1;
 }
 
+/* Return the iconv_descriptors, initializing them if necessary.  MODE
+   must be either SCM_PORT_READ or SCM_PORT_WRITE, and specifies which
+   operation is about to be done.  We deliberately avoid reading from
+   the port unless the user was about to do so.  */
 scm_t_iconv_descriptors *
-scm_i_port_iconv_descriptors (SCM port)
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode)
 {
-  scm_t_port *pt;
-  scm_t_port_internal *pti;
-
-  pt = SCM_PTAB_ENTRY (port);
-  pti = SCM_PORT_GET_INTERNAL (port);
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
   assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV);
 
   if (!pti->iconv_descriptors)
     {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+      const char *precise_encoding;
+
       if (!pt->encoding)
         pt->encoding = "ISO-8859-1";
+
+      /* If the specified encoding is UTF-16 or UTF-32, then make
+         that more precise by deciding what endianness to use.  */
+      if (strcasecmp (pt->encoding, "UTF-16") == 0)
+        precise_encoding = decide_utf16_encoding (port, mode);
+      else if (strcasecmp (pt->encoding, "UTF-32") == 0)
+        precise_encoding = decide_utf32_encoding (port, mode);
+      else
+        precise_encoding = pt->encoding;
+
       pti->iconv_descriptors =
-        open_iconv_descriptors (pt->encoding,
+        open_iconv_descriptors (precise_encoding,
                                 SCM_INPUT_PORT_P (port),
                                 SCM_OUTPUT_PORT_P (port));
     }
@@ -2377,6 +2480,14 @@ scm_i_set_port_encoding_x (SCM port, const char 
*encoding)
   pti = SCM_PORT_GET_INTERNAL (port);
   prev = pti->iconv_descriptors;
 
+  /* In order to handle cases where the encoding changes mid-stream
+     (e.g. within an HTTP stream, or within a file that is composed of
+     segments with different encodings), we consider this to be "stream
+     start" for purposes of BOM handling, regardless of our actual file
+     position. */
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   if (encoding == NULL)
     encoding = "ISO-8859-1";
 
@@ -2387,19 +2498,14 @@ scm_i_set_port_encoding_x (SCM port, const char 
*encoding)
     {
       pt->encoding = "UTF-8";
       pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
-      pti->iconv_descriptors = NULL;
     }
   else
     {
-      /* Open descriptors before mutating the port. */
-      pti->iconv_descriptors =
-        open_iconv_descriptors (encoding,
-                                SCM_INPUT_PORT_P (port),
-                                SCM_OUTPUT_PORT_P (port));
       pt->encoding = scm_gc_strdup (encoding, "port");
       pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
     }
 
+  pti->iconv_descriptors = NULL;
   if (prev)
     close_iconv_descriptors (prev);
 }
diff --git a/libguile/print.c b/libguile/print.c
index 1572690..3f72810 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -881,8 +881,24 @@ display_string_using_iconv (const void *str, int narrow_p, 
size_t len,
 {
   size_t printed;
   scm_t_iconv_descriptors *id;
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
-  id = scm_i_port_iconv_descriptors (port);
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_WRITE);
+
+  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
+    {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+      /* Record that we're no longer at stream start.  */
+      pti->at_stream_start_for_bom_write = 0;
+      if (pt->rw_random)
+        pti->at_stream_start_for_bom_read = 0;
+
+      /* Write a BOM if appropriate.  */
+      if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0
+                        || strcasecmp(pt->encoding, "UTF-32") == 0))
+        display_character (SCM_UNICODE_BOM, port, iconveh_error);
+    }
 
   printed = 0;
 
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 886ab24..d81e864 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -24,7 +24,8 @@
   #:use-module (ice-9 popen)
   #:use-module (ice-9 rdelim)
   #:use-module (rnrs bytevectors)
-  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port)))
+  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port
+                                          open-bytevector-output-port)))
 
 (define (display-line . args)
   (for-each display args)
@@ -918,7 +919,9 @@
 
   (pass-if-exception "set-port-encoding!, wrong encoding"
     exception:miscellaneous-error
-    (set-port-encoding! (open-input-string "") "does-not-exist"))
+    (let ((p (open-input-string "")))
+      (set-port-encoding! p "does-not-exist")
+      (read p)))
 
   (pass-if-exception "%default-port-encoding, wrong encoding"
     exception:miscellaneous-error
@@ -1149,6 +1152,290 @@
 
 
 
+(with-test-prefix "unicode byte-order marks (BOMs)"
+
+  (define (bv-read-test* encoding bv proc)
+    (let ((port (open-bytevector-input-port bv)))
+      (set-port-encoding! port encoding)
+      (proc port)))
+
+  (define (bv-read-test encoding bv)
+    (bv-read-test* encoding bv read-string))
+
+  (define (bv-write-test* encoding proc)
+    (call-with-values
+        (lambda () (open-bytevector-output-port))
+      (lambda (port get-bytevector)
+        (set-port-encoding! port encoding)
+        (proc port)
+        (get-bytevector))))
+
+  (define (bv-write-test encoding str)
+    (bv-write-test* encoding
+                    (lambda (p)
+                      (display str p))))
+
+  (pass-if-equal "BOM not discarded from Latin-1 stream"
+      "\xEF\xBB\xBF\x61"
+    (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from Latin-2 stream"
+      "\u010F\u0165\u017C\x61"
+    (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-32BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF
+                                       #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
+                                       #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM not written to UTF-8 stream"
+      #vu8(#x61)
+    (bv-write-test "UTF-8" "a"))
+
+  (pass-if-equal "BOM not written to UTF-16BE stream"
+      #vu8(#x00 #x61)
+    (bv-write-test "UTF-16BE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-16LE stream"
+      #vu8(#x61 #x00)
+    (bv-write-test "UTF-16LE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-32BE stream"
+      #vu8(#x00 #x00 #x00 #x61)
+    (bv-write-test "UTF-32BE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-32LE stream"
+      #vu8(#x61 #x00 #x00 #x00)
+    (bv-write-test "UTF-32LE" "a"))
+
+  (pass-if "Don't read from the port unless user asks to"
+    (let* ((p (make-soft-port
+               (vector
+                (lambda (c) #f)           ; write char
+                (lambda (s) #f)           ; write string
+                (lambda () #f)            ; flush
+                (lambda () (throw 'fail)) ; read char
+                (lambda () #f))
+               "rw")))
+      (set-port-encoding! p "UTF-16")
+      (display "abc" p)
+      (set-port-encoding! p "UTF-32")
+      (display "def" p)
+      #t))
+
+  ;; TODO: test that input and output streams are independent when
+  ;; appropriate, and linked when appropriate.
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream"
+      "a"
+    (bv-read-test "Utf-8" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "uTf-8" #vu8(#xEF #xBB #xBF #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-8 stream"
+      "\uFEFFa"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-8 stream after seek to > 0"
+      "\uFEFFb"
+    (bv-read-test* "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)
+                   (lambda (p)
+                     (seek p 1 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-8 stream"
+      "a\uFEFFb"
+    (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)))
+
+  (pass-if-equal "BOM (BE) written to start of UTF-16 stream"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62)
+    (bv-write-test "UTF-16" "ab"))
+
+  (pass-if-equal "BOM (BE) written to UTF-16 stream after set-port-encoding!"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62 #xFE #xFF #x00 #x63 #x00 #x64)
+    (bv-write-test* "UTF-16"
+                    (lambda (p)
+                      (display "ab" p)
+                      (set-port-encoding! p "UTF-16")
+                      (display "cd" p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE) after seek to 
0"
+      '(#\a "a")
+    (bv-read-test* "utf-16" #vu8(#xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "Utf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "uTf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-16 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "utf-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)))
+          (le (bv-read-test "utf-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE) after seek to 
0"
+      '(#\a "a")
+    (bv-read-test* "Utf-16" #vu8(#xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTf-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (LE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "utF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE) after seek to 
0"
+      '(#\a "a")
+    (bv-read-test* "utF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UtF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (seek p 4 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (BE) after 
set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (LE,BE) after 
set-port-encoding!"
+      "ab"
+    (bv-read-test* "utf-16" #vu8(#x00 #x61 #xFF #xFE #x62 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (BE) after 
set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (LE,BE) after 
set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #xFF #xFE #x00 #x00
+                                      #x62 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-32 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                               #x00 #x00 #xFE #xFF
+                                               #x00 #x00 #x00 #x62)))
+          (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00
+                                               #xFF #xFE #x00 #x00
+                                               #x62 #x00 #x00 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE) after seek to 
0"
+      '(#\a "a")
+    (bv-read-test* "UTf-32" #vu8(#xFF #xFE #x00 #x00
+                                      #x61 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00))))
+
+
+
 (define-syntax-rule (with-load-path path body ...)
   (let ((new path)
         (old %load-path))
-- 
1.7.10.4

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/03
- Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/03
- Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Ludovic Courtès, 2013/04/03
  - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver <=
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Ludovic Courtès, 2013/04/03
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/03
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mike Gran, 2013/04/03
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/03
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/04
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Andy Wingo, 2013/04/04
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/05
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mike Gran, 2013/04/05
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Ludovic Courtès, 2013/04/05
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/05

Prev by Date: Re: array-copy! is slow & array-map.c
Next by Date: Re: array-copy! is slow & array-map.c
Previous by thread: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
Next by thread: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
Index(es):
- Date
- Thread