[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
unicode string functions
From: |
Bruno Haible |
Subject: |
unicode string functions |
Date: |
Tue, 2 Jan 2007 22:39:18 +0100 |
User-agent: |
KMail/1.9.1 |
Hi,
Since 2000 the need for elementary functions on Unicode strings has been
apparent and increasing:
- some utility functions exist in GNOME's glib,
- clisp, gettext, emacs, python, ... need a programmatic access to the
character name tables,
- gettext's linebreak module relies on several utility functions for
Unicode strings,
- any program printing line + column numbers of characters in a file
needs to consider the width of the character, e.g. libiconv does this,
- clisp would like to have Unicode regular expressions that work even
when the locale is in ISO-8859-1 encoding,
...
Since 2001 I've been working on a library covering such topics. But two issues
kept me from releasing this library:
- It should be more lightweight than IBM's ICU library. It should contain
many functions, and support all 3 kinds of in-memory representation (UTF-8,
UTF-16 and UTF-32), but without installing a multi-megabyte library.
Someone wanting 2 or 3 Unicode string functions does not want to link
with a megabyte big library.
- The basic character type, ucs4_t, is an alias of uint32_t. But
one could not assume <stdint.h>.
Gnulib solves both issues 1. by providing an infrastructure for a source-code
library, 2. by providing a package independent <stdint.h>.
These data types are actually suitable for gnulib, since they are basic
and project independent.
I'll therefore add a set of modules for Unicode text handling.
The choice of the in-memory representation (UTF-8, UTF-16 or UTF-32) is up to
the application; libunistring supports all three equally.
The modules are organized in the following directories:
unistr elementary string functions
uniconv conversion from/to legacy encodings
unistdio formatted output to strings
uniname character names
uniwidth string width when using nonproportional fonts
unilbrk line breaking algorithm
unictype character classification and properties
--
unicase case folding
unicomp composition and decomposition
uniregex regular expressions
unibidi bidirectional reordering (use FriBidi in the meantime)
The last four are planned, not yet implemented.
Copyright is FSF and LGPL, as usual.
Bruno
- unicode string functions,
Bruno Haible <=