guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode I/O


From: Ludovic Courtès
Subject: Re: Unicode I/O
Date: Sun, 23 Jan 2011 00:42:23 +0100
User-agent: Gnus/5.110011 (No Gnus v0.11) Emacs/23.2 (gnu/linux)

Hello!

address@hidden (Ludovic Courtès) writes:

> I’ve just pushed a ‘wip-iconv’ branch, which currently changes ports to
> use ‘iconv’ for input.  Remaining tasks include doing it for output, and
> finding a solution for ‘scm_{to,from}_stringn’ so that it behaves in the
> same way wrt. to escapes and error handling.

I just merged ‘wip-iconv’ into ‘master’.  It uses ‘iconv’ for
display/write and peek-char/read-char, but not yet for
‘scm_{to,from}_string’ and ‘read-line’.  Caveat: only tested on
GNU/Linux.

Also, we should take advantage of this to improve error reporting, e.g.,
to include the location of a conversion failure.

Overall, it improves performance, except on Latin-1 ports since I chose
not to special-case them (i.e., I/O on Latin-1 ports goes through
iconv.)  The trick is that iconv conversion descriptors are opened once
for all, and no heap allocation happens (‘u32_conv_from_encoding’ and
friends typically malloc.)

Benchmark results:

--8<---------------cut here---------------start------------->8---
;; with iconv:

("ports.bm: peek-char: latin-1 port" 700000 total 0.38)
("ports.bm: peek-char: utf-8 port, ascii character" 700000 total 0.38)
("ports.bm: peek-char: utf-8 port, Korean character" 700000 total 0.68)
("ports.bm: read-char: latin-1 port" 10000000 total 3.34)
("ports.bm: read-char: utf-8 port, ascii character" 10000000 total 3.33)
("ports.bm: read-char: utf-8 port, Korean character" 10000000 total 3.31)
("ports.bm: char-ready?: latin-1 port" 10000000 total 3.02 user 3.01)
("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 total 3.0)
("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 total 3.01)

;; with libunistring:

("ports.bm: peek-char: latin-1 port" 700000 total 0.25)
("ports.bm: peek-char: utf-8 port, ascii character" 700000 total 2.65)
("ports.bm: peek-char: utf-8 port, Korean character" 700000 total 7.58)
("ports.bm: read-char: latin-1 port" 10000000 total 3.38)
("ports.bm: read-char: utf-8 port, ascii character" 10000000 total 3.31)
("ports.bm: read-char: utf-8 port, Korean character" 10000000 total 3.29)
("ports.bm: char-ready?: latin-1 port" 10000000 total 3.08 user 3.08)
("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 total 3.08)
("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 total 3.05)
--8<---------------cut here---------------end--------------->8---

So ‘peek-char’ is faster, whereas ‘read-char’ gives the same results (to
my surprise, I must say.)

The ‘peek-char’ improvement is beneficial to SSAX.  When loading a 4 MiB
XML file in UTF-8, it’s ~4 times faster than the old method:

--8<---------------cut here---------------start------------->8---
$ time guile -c '(use-modules (sxml simple)) (setlocale LC_ALL "") (xml->sxml 
(open-input-file "chbouib.xml"))'

real    0m20.509s
user    0m20.437s
sys     0m0.064s

$ time ./meta/guile -c '(use-modules (sxml simple)) (setlocale LC_ALL "") 
(xml->sxml (open-input-file "chbouib.xml"))'

real    0m5.676s
user    0m5.599s
sys     0m0.076s
--8<---------------cut here---------------end--------------->8---

For ‘write.bm’:

--8<---------------cut here---------------start------------->8---
;; with iconv:

("write.bm: write: string with escapes" 50 total 0.71)
("write.bm: write: string without escapes" 50 total 0.65)
("write.bm: display: string with escapes" 1000 total 3.39)
("write.bm: display: string without escapes" 1000 total 0.97)

;; with libunistring:

("write.bm: write: string with escapes" 50 total 7.06)
("write.bm: write: string without escapes" 50 total 7.51)
("write.bm: display: string with escapes" 1000 total 1.96)
("write.bm: display: string without escapes" 1000 total 1.46)
--8<---------------cut here---------------end--------------->8---

In the nominal case, ‘display’ is ~30% faster here, and ‘sxml->xml’ is
60% faster on this 4 MiB XML file:

--8<---------------cut here---------------start------------->8---
$ ./meta/guile -c '(use-modules (sxml simple) (ice-9 time)) (setlocale LC_ALL 
"") (define s (xml->sxml (open-input-file "chbouib.xml"))) (time 
(with-output-to-file "/tmp/foo.xml" (lambda () (sxml->xml s))))'
clock utime stime cutime cstime gctime
 2.48  2.44  0.02   0.00   0.00   0.00

$ guile -c '(use-modules (sxml simple) (ice-9 time)) (setlocale LC_ALL "") 
(define s (xml->sxml (open-input-file "chbouib.xml"))) (time 
(with-output-to-file "/tmp/foo.xml" (lambda () (sxml->xml s))))'
clock utime stime cutime cstime gctime
 6.43  6.39  0.04   0.00   0.00   0.00
--8<---------------cut here---------------end--------------->8---

Thanks,
Ludo’.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]