libcdio-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Libcdio-devel] Iconv usage and string handling


From: Burkhard Plaum
Subject: Re: [Libcdio-devel] Iconv usage and string handling
Date: Tue, 25 Apr 2006 19:21:51 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.7.12-1.3.1

Hi,

Peter Creath wrote:
Well, that's just the point.  The <string.h> functions aren't
unicode-aware -- do you know what their behavior will be for non-ASCII
UTF-8?  If not, you shouldn't use them for UTF-8 strings.

If a string can have arbitrary charsets, I agree that <string.h> functions
will fail. But if you force everything to UTF-8, most of them
can be used (probably one of the reasons for the popularity of UTF-8).

The only possible pitfall with UTF-8 is, that there is no 1:1 mapping from
bytes to printed characters. strlen() will return the number of *bytes* not
*characters*. If you say, that this is nonstandard behaviour, I agree.
But most programs assume anyway, that strlen() returns the *bytes*
(e.g. for later malloc).

I can't think of many situations anywhere near the libcdio usage,
where the number of characters actually matters, except textrendering
in GUI toolkits and the stdio implementation in glibc.

Another issue is alphabetic sorting, but
AFAIK this is strongly locale dependent anyway.

[...]
And the compiler barks at you if you "accidentally" pass a non-ASCII
string to a routine that expects an ASCII string.

Actually, it doesn't:

#include <string.h>

typedef char utf8;

int foo()
  {
  utf8 * str = "Hello World";
  return strlen(str);
  }

Compiles without barking using "gcc -c -pedantic -Wall"

You can explicitly override it, but it doesn't fail silently.  Silent
failure causes many many bugs.

Agree :)

The way libXml2 is doing it is actually the right way.  If it turns
out that strcpy breaks on non-ASCII data, your wrapper can do the
right thing.

They document, that xmlChar must be UTF-8 and then have xmlStrcat(),
xmlStrcup() etc. IMO completely unnecessary.

Is UTF-8 guaranteed not to have any internal null bytes?

Yes. It's a quite simple encoding: http://en.wikipedia.org/wiki/UTF8

I won't fight to the death over this issue, but I would like to see a
case, where an own utf8 datatype is absolutely neccessary. Otherwise,
I prefer to keep things simple.

Cheers

Burkhard




reply via email to

[Prev in Thread] Current Thread [Next in Thread]