[PATCH] Improve handling of Unicode byte-order marks (BOMs)

guile-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH] Improve handling of Unicode byte-order marks (BOMs)

From:	Mark H Weaver
Subject:	[PATCH] Improve handling of Unicode byte-order marks (BOMs)
Date:	Wed, 03 Apr 2013 06:44:19 -0400

Hello all,

I've attached a proposed patch to improve our handling of BOMs.
Here are a few notable aspects:

* All kinds of streams are supported in a uniform way: files, pipes,
  sockets, terminals, etc.

* As specified in Unicode 6.2, BOMs are only handled specially at the
  start of a stream, and only if the encoding is set to "UTF-16" or
  "UTF-32".  BOMs are *not* handled specially if the encoding is set to
  "UTF-16LE", etc.

* This code never tries to read a BOM until the user has asked to read.
  If the user writes before reading, it chooses big-endian and writes a
  BOM if appropriate (if the encoding is set to "UTF-16" or "UTF-32").

* The encodings "UTF-16" and "UTF-32" are *never* passed to iconv,
  because BOM handling varies between iconv implementations.  Creation
  of the iconv descriptors is always postponed until the first read or
  write, at which point a decision is made about the endianness, and
  then "UTF-16BE", "UTF-16LE", "UTF-32BE", or "UTF-32LE" is passed to
  iconv.

* If 'rw_random' is zero, then the input and output streams are
  considered independent: the first read will consume a BOM if
  appropriate, *and* the first write will produce a BOM if appropriate.

* If 'rw_random' is non-zero, then the input and output streams are
  considered linked: if the user reads first, then a BOM will be
  consumed if appropriate, but later writes will *not* produce a BOM.
  Similarly, if the user writes first, then later reads will *not*
  consume a BOM.

* If 'set-port-encoding!' is called in the middle of a stream, it treats
  it as a new logical "start of stream", i.e. if the encoding is set to
  "UTF-16" or "UTF-32" then a BOM will be consumed the next time you
  read and/or produced the next time you write.

* Seeks to the beginning of the file set the "start of stream" flags.
  Seeks anywhere else clear the "start of stream" flags.

Okay, here's the patch.  Comments and suggestions solicited.

     Mark

>From 008b89c7ba4637e2d6323f02b6b8b6284a533857 Mon Sep 17 00:00:00 2001
From: Mark H Weaver <address@hidden>
Date: Wed, 3 Apr 2013 04:22:04 -0400
Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs).

* libguile/ports-internal.h (struct scm_port_internal): Add new members
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (SCM_UNICODE_BOM): New macro.
  (scm_i_port_iconv_descriptors): Add 'mode' parameter to prototype.

* libguile/ports.c (scm_new_port_table_entry): Initialize
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (get_iconv_codepoint): Pass new 'mode' parameter to
  'scm_i_port_iconv_descriptors'.
  (get_codepoint): After reading a codepoint at stream start, record
  that we're no longer at stream start, and consume a BOM where
  appropriate.
  (scm_seek): Set the stream start flags according to the new position.
  (looking_at_bytes): New static function.
  (scm_utf8_bom, scm_utf16be_bom, scm_utf16le_bom, scm_utf32be_bom,
  scm_utf32le_bom): New static const arrays.
  (decide_utf16_encoding, decide_utf32_encoding): New static functions.
  (scm_i_port_iconv_descriptors): Add new 'mode' parameter.  If the
  specified encoding is UTF-16 or UTF-32, make that precise by deciding
  what endianness to use, and construct iconv descriptors based on the
  precise encoding.
  (scm_i_set_port_encoding_x): Record that we are now at stream start.
  Do not open the new iconv descriptors immediately; let them be
  initialized lazily.

* libguile/print.c (display_string_using_iconv): Record that we're no
  longer at stream start.  Write a BOM if appropriate.

* test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"):
  Adapt test to cope with the fact that 'set-port-encoding!' does not
  immediately open the iconv descriptors.
  (bv-read-test): New procedure.
  ("unicode byte-order marks (BOMs)"): New test prefix.
---
 libguile/ports-internal.h   |    7 +-
 libguile/ports.c            |  134 +++++++++++++++++++---
 libguile/print.c            |   18 ++-
 test-suite/tests/ports.test |  259 ++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 399 insertions(+), 19 deletions(-)

diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
index 73a788f..cd1746b 100644
--- a/libguile/ports-internal.h
+++ b/libguile/ports-internal.h
@@ -48,14 +48,19 @@ struct scm_port_internal
 {
   scm_t_port_encoding_mode encoding_mode;
   scm_t_iconv_descriptors *iconv_descriptors;
+  int at_stream_start_for_bom_read;
+  int at_stream_start_for_bom_write;
   SCM alist;
 };
 
 typedef struct scm_port_internal scm_t_port_internal;
 
+#define SCM_UNICODE_BOM  0xFEFF  /* Unicode byte-order mark */
+
 #define SCM_PORT_GET_INTERNAL(x)                                \
   ((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd))
 
-SCM_INTERNAL scm_t_iconv_descriptors *scm_i_port_iconv_descriptors (SCM port);
+SCM_INTERNAL scm_t_iconv_descriptors *
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode);
 
 #endif
diff --git a/libguile/ports.c b/libguile/ports.c
index 51145e6..382867a 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -639,6 +639,9 @@ scm_new_port_table_entry (scm_t_bits tag)
     pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
   pti->iconv_descriptors = NULL;
 
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   /* XXX These fields are not what they seem.  They have been
      repurposed, but cannot safely be renamed in 2.0 without breaking
      ABI compatibility.  This will be cleaned up in 2.2.  */
@@ -1306,10 +1309,12 @@ static int
 get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
                     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
 {
-  scm_t_iconv_descriptors *id = scm_i_port_iconv_descriptors (port);
+  scm_t_iconv_descriptors *id;
   scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
   size_t input_size = 0;
 
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_READ);
+
   for (;;)
     {
       int byte_read;
@@ -1393,7 +1398,24 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
     err = get_iconv_codepoint (port, codepoint, buf, len);
 
   if (SCM_LIKELY (err == 0))
-    update_port_lf (*codepoint, port);
+    {
+      if (SCM_UNLIKELY (pti->at_stream_start_for_bom_read))
+        {
+          /* Record that we're no longer at stream start. */
+          pti->at_stream_start_for_bom_read = 0;
+          if (pt->rw_random)
+            pti->at_stream_start_for_bom_write = 0;
+
+          /* If we just read a BOM in an encoding that recognizes them,
+             then silently consume it and read another code point. */
+          if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
+                            && (strcmp(pt->encoding, "UTF-8") == 0
+                                || strcmp(pt->encoding, "UTF-16") == 0
+                                || strcmp(pt->encoding, "UTF-32") == 0)))
+            return get_codepoint (port, codepoint, buf, len);
+        }
+      update_port_lf (*codepoint, port);
+    }
   else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
     {
       *codepoint = '?';
@@ -2006,6 +2028,7 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
 
   if (SCM_OPPORTP (fd_port))
     {
+      scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (fd_port);
       scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port);
       off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset);
       off_t_or_off64_t rv;
@@ -2015,6 +2038,11 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
                         scm_cons (fd_port, SCM_EOL));
       else
        rv = ptob->seek (fd_port, off, how);
+
+      /* Set stream-start flags according to new position. */
+      pti->at_stream_start_for_bom_read  = (rv == 0);
+      pti->at_stream_start_for_bom_write = (rv == 0);
+
       return scm_from_off_t_or_off64_t (rv);
     }
   else /* file descriptor?.  */
@@ -2265,6 +2293,66 @@ scm_i_default_port_encoding (void)
     }
 }
 
+/* If the next LEN bytes from port are equal to those in BYTES, then
+   return 1, else return 0.  Leave the port position unchanged.  */
+static int
+looking_at_bytes (SCM port, unsigned char *bytes, int len)
+{
+  scm_t_port *pt = SCM_PTAB_ENTRY (port);
+  int result;
+  int i = 0;
+
+  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
+    {
+      pt->read_pos++;
+      i++;
+    }
+
+  result = (i == len);
+
+  while (i > 0)
+    scm_unget_byte (bytes[--i], port);
+
+  return result;
+}
+
+static unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
+static unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
+static unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
+static unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
+static unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
+
+/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
+   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
+   and specifies which operation is about to be done.  The MODE
+   determines how we will decide the endianness.  We deliberately avoid
+   reading from the port unless the user is about to do so.  If the user
+   is about to read, then we look for a BOM, and if present, we use it
+   to determine the endianness.  Otherwise we choose big-endian, as
+   recommended by the Unicode Consortium.  */
+static char *
+decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
+    return "UTF-16LE";
+  else
+    return "UTF-16BE";
+}
+
+/* Decide what endianness to use for a UTF-32 port.  Return "UTF-16BE"
+   or "UTF-16LE".  See the comment above 'decide_utf16_encoding' for
+   details.  */
+static char *
+decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
+    return "UTF-32LE";
+  else
+    return "UTF-32BE";
+}
+
 static void
 finalize_iconv_descriptors (void *ptr, void *data)
 {
@@ -2341,23 +2429,36 @@ close_iconv_descriptors (scm_t_iconv_descriptors *id)
   id->output_cd = (void *) -1;
 }
 
+/* Return the iconv_descriptors, initializing them if necessary.  MODE
+   must be either SCM_PORT_READ or SCM_PORT_WRITE, and specifies which
+   operation is about to be done.  We deliberately avoid reading from
+   the port unless the user was about to do so.  */
 scm_t_iconv_descriptors *
-scm_i_port_iconv_descriptors (SCM port)
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode)
 {
-  scm_t_port *pt;
-  scm_t_port_internal *pti;
-
-  pt = SCM_PTAB_ENTRY (port);
-  pti = SCM_PORT_GET_INTERNAL (port);
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
   assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV);
 
   if (!pti->iconv_descriptors)
     {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+      char *precise_encoding;
+
       if (!pt->encoding)
         pt->encoding = "ISO-8859-1";
+
+      /* If the specified encoding is UTF-16 or UTF-32, then make
+         that more precise by deciding what endianness to use.  */
+      if (strcmp (pt->encoding, "UTF-16") == 0)
+        precise_encoding = decide_utf16_encoding (port, mode);
+      else if (strcmp (pt->encoding, "UTF-32") == 0)
+        precise_encoding = decide_utf32_encoding (port, mode);
+      else
+        precise_encoding = pt->encoding;
+
       pti->iconv_descriptors =
-        open_iconv_descriptors (pt->encoding,
+        open_iconv_descriptors (precise_encoding,
                                 SCM_INPUT_PORT_P (port),
                                 SCM_OUTPUT_PORT_P (port));
     }
@@ -2377,6 +2478,14 @@ scm_i_set_port_encoding_x (SCM port, const char 
*encoding)
   pti = SCM_PORT_GET_INTERNAL (port);
   prev = pti->iconv_descriptors;
 
+  /* In order to handle cases where the encoding changes mid-stream
+     (e.g. within an HTTP stream, or within a file that is composed of
+     segments with different encodings), we consider this to be "stream
+     start" for purposes of BOM handling, regardless of our actual file
+     position. */
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   if (encoding == NULL)
     encoding = "ISO-8859-1";
 
@@ -2387,19 +2496,14 @@ scm_i_set_port_encoding_x (SCM port, const char 
*encoding)
     {
       pt->encoding = "UTF-8";
       pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
-      pti->iconv_descriptors = NULL;
     }
   else
     {
-      /* Open descriptors before mutating the port. */
-      pti->iconv_descriptors =
-        open_iconv_descriptors (encoding,
-                                SCM_INPUT_PORT_P (port),
-                                SCM_OUTPUT_PORT_P (port));
       pt->encoding = scm_gc_strdup (encoding, "port");
       pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
     }
 
+  pti->iconv_descriptors = NULL;
   if (prev)
     close_iconv_descriptors (prev);
 }
diff --git a/libguile/print.c b/libguile/print.c
index 1572690..5795c8e 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -881,8 +881,24 @@ display_string_using_iconv (const void *str, int narrow_p, 
size_t len,
 {
   size_t printed;
   scm_t_iconv_descriptors *id;
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
-  id = scm_i_port_iconv_descriptors (port);
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_WRITE);
+
+  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
+    {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+      /* Record that we're no longer at stream start.  */
+      pti->at_stream_start_for_bom_write = 0;
+      if (pt->rw_random)
+        pti->at_stream_start_for_bom_read = 0;
+
+      /* Write a BOM if appropriate.  */
+      if (SCM_UNLIKELY (strcmp(pt->encoding, "UTF-16") == 0
+                        || strcmp(pt->encoding, "UTF-32") == 0))
+        display_character (SCM_UNICODE_BOM, port, iconveh_question_mark);
+    }
 
   printed = 0;
 
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 886ab24..69a4ea7 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -24,7 +24,8 @@
   #:use-module (ice-9 popen)
   #:use-module (ice-9 rdelim)
   #:use-module (rnrs bytevectors)
-  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port)))
+  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port
+                                          open-bytevector-output-port)))
 
 (define (display-line . args)
   (for-each display args)
@@ -918,7 +919,9 @@
 
   (pass-if-exception "set-port-encoding!, wrong encoding"
     exception:miscellaneous-error
-    (set-port-encoding! (open-input-string "") "does-not-exist"))
+    (let ((p (open-input-string "")))
+      (set-port-encoding! p "does-not-exist")
+      (read p)))
 
   (pass-if-exception "%default-port-encoding, wrong encoding"
     exception:miscellaneous-error
@@ -1149,6 +1152,258 @@
 
 
 
+(with-test-prefix "unicode byte-order marks (BOMs)"
+
+  (define (bv-read-test* encoding bv proc)
+    (let ((port (open-bytevector-input-port bv)))
+      (set-port-encoding! port encoding)
+      (proc port)))
+
+  (define (bv-read-test encoding bv)
+    (bv-read-test* encoding bv read-string))
+
+  (define (bv-write-test* encoding proc)
+    (call-with-values
+        (lambda () (open-bytevector-output-port))
+      (lambda (port get-bytevector)
+        (set-port-encoding! port encoding)
+        (proc port)
+        (get-bytevector))))
+
+  (define (bv-write-test encoding str)
+    (bv-write-test* encoding
+                    (lambda (p)
+                      (display str p))))
+
+  (pass-if-equal "BOM not discarded from Latin-1 stream"
+      "\xEF\xBB\xBF\x61"
+    (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from Latin-2 stream"
+      "\u010F\u0165\u017C\x61"
+    (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-32BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF
+                                       #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
+                                       #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM not written to UTF-8 stream"
+      #vu8(#x61)
+    (bv-write-test "UTF-8" "a"))
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream"
+      "a"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTF-8" #vu8(#xEF #xBB #xBF #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-8 stream"
+      "\uFEFFa"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-8 stream after seek to > 0"
+      "\uFEFFb"
+    (bv-read-test* "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)
+                   (lambda (p)
+                     (seek p 1 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-8 stream"
+      "a\uFEFFb"
+    (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)))
+
+  (pass-if-equal "BOM (BE) written to start of UTF-16 stream"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62)
+    (bv-write-test "UTF-16" "ab"))
+
+  (pass-if-equal "BOM (BE) written to UTF-16 stream after set-port-encoding!"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62 #xFE #xFF #x00 #x63 #x00 #x64)
+    (bv-write-test* "UTF-16"
+                    (lambda (p)
+                      (display "ab" p)
+                      (set-port-encoding! p "UTF-16")
+                      (display "cd" p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE) after seek to 
0"
+      '(#\a "a")
+    (bv-read-test* "UTF-16" #vu8(#xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UTF-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-16 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)))
+          (le (bv-read-test "UTF-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE) after seek to 
0"
+      '(#\a "a")
+    (bv-read-test* "UTF-16" #vu8(#xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (LE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UTF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE) after seek to 
0"
+      '(#\a "a")
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (seek p 4 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (BE) after 
set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (LE,BE) after 
set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFF #xFE #x62 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (BE) after 
set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (LE,BE) after 
set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #xFF #xFE #x00 #x00
+                                      #x62 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-32 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                               #x00 #x00 #xFE #xFF
+                                               #x00 #x00 #x00 #x62)))
+          (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00
+                                               #xFF #xFE #x00 #x00
+                                               #x62 #x00 #x00 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE) after seek to 
0"
+      '(#\a "a")
+    (bv-read-test* "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                      #x61 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00)))
+
+  )
+
+
+
 (define-syntax-rule (with-load-path path body ...)
   (let ((new path)
         (old %load-path))
-- 
1.7.10.4

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver <=
- Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/03
- Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Ludovic Courtès, 2013/04/03
  - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/03
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Ludovic Courtès, 2013/04/03
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/03
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mike Gran, 2013/04/03
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/03
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/04
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Andy Wingo, 2013/04/04
    - Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs), Mark H Weaver, 2013/04/05

Prev by Date: Re: [PATCH] Move slow path out of 'scm_get_byte_or_eof' et al
Next by Date: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
Previous by thread: [PATCH] Move slow path out of 'scm_get_byte_or_eof' et al
Next by thread: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
Index(es):
- Date
- Thread