[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Help-source-highlight] Unicode files ?
From: |
Dario Teixeira |
Subject: |
Re: [Help-source-highlight] Unicode files ? |
Date: |
Tue, 30 Mar 2010 06:26:03 -0700 (PDT) |
Hi,
> Thanks for you answers. Again, as I discovered Source-highlight very
> recently, I don't know if Unicode is an important feature for you or
> not... I read sometimes source code from Japanese or Chinese
> developers, and am French myself, so that's not unusual to store code
> or text files in Unicode (I mostly work with Visual Studio).
I would say that Unicode is an essential feature. In fact, I thought
Source-highlight was already Unicode-compliant, since this is 2010 and
is hard to imagine an application that isn't.
> Unicode files (UTF-8 for example, which is widely used on the Internet)
> can store characters on 1 to 6 bytes. So of course it's very difficult
> to use (length() and so are difficult)
I think you are exaggerating the difficulty of dealing with variable-length
encodings such as UTF-8. In fact, almost every library I know that deals
with Unicode does so using the UTF-8 encoding. Sure, finding the Nth
element of a string is a O(n) operation instead of O(1), but many other
common operations such as strcpy() and strcat() are done the same way as
with a fixed-length encoding.
> 1) First you have to know if the file is Unicode or not. They
> should have a header, described here:
Some of us use Source-highlight as a library, and therefore that
determination should be made by the main application. I suggest
that the core functions of Source-highlight be parameterised over
the encoding used. Almost everyone uses either single-byte
(non-Unicode, thus) or Unicode in the form of UTF-8. There's also
some UTF-16 out there and even UTF-32 (aka UCS-4), but these are
less common. In fact, if Source-highlight were to support only
single-byte encodings and UTF-8, I would deem it Unicode-compliant.
> 2) The second thing is to convert the whole file to a "fixed bytes
> per character" format, so you can work with it. A wide char format
> (16 bits wchar) is a good choice most of the time.
Actually, 16-bits wchar is a terrible choice, since Unicode code-points
require 32-bits. Also, you don't necessarily need to convert the whole
file to a fixed-length encoding. Why not simply work natively in UTF-8?
It's really not as difficult as you make it to be...
> Don't know too much on the Linux side, but it's simply a matter of
> wcslen, wcscpy, wcscat instead of length(), strcpy(), strcat() with
> Visual Studio.
Using UTF-8, you will need a special length() function, but you can
use the regular strcpy() and strcat(). I don't use C++, but a quick
google search tells me there are libraries out there that provide
UTF-8 support for C++.
Cheers,
Dario Teixeira
Re: [Help-source-highlight] Unicode files ?, Dario Teixeira, 2010/03/30