|
From: | Lionel Fumery |
Subject: | Re: [Help-source-highlight] Unicode files ? |
Date: | Tue, 30 Mar 2010 16:09:10 +0200 |
User-agent: | Thunderbird 2.0.0.24 (Windows/20100228) |
Hi, I'm not neither an unicode expert, so it's great if this feature is easier than what I thought... About working all the time in utf-8, do you mean (for example) converting utf-16 or anything else to utf-8 then working in utf-8 ? Or only supporting utf-8 files? Thanks for your help, Lionel Dario Teixeira wrote: Hi,Thanks for you answers. Again, as I discovered Source-highlight very recently, I don't know if Unicode is an important feature for you or not... I read sometimes source code from Japanese or Chinese developers, and am French myself, so that's not unusual to store code or text files in Unicode (I mostly work with Visual Studio).I would say that Unicode is an essential feature. In fact, I thought Source-highlight was already Unicode-compliant, since this is 2010 and is hard to imagine an application that isn't.Unicode files (UTF-8 for example, which is widely used on the Internet) can store characters on 1 to 6 bytes. So of course it's very difficult to use (length() and so are difficult)I think you are exaggerating the difficulty of dealing with variable-length encodings such as UTF-8. In fact, almost every library I know that deals with Unicode does so using the UTF-8 encoding. Sure, finding the Nth element of a string is a O(n) operation instead of O(1), but many other common operations such as strcpy() and strcat() are done the same way as with a fixed-length encoding.1) First you have to know if the file is Unicode or not. They should have a header, described here:Some of us use Source-highlight as a library, and therefore that determination should be made by the main application. I suggest that the core functions of Source-highlight be parameterised over the encoding used. Almost everyone uses either single-byte (non-Unicode, thus) or Unicode in the form of UTF-8. There's also some UTF-16 out there and even UTF-32 (aka UCS-4), but these are less common. In fact, if Source-highlight were to support only single-byte encodings and UTF-8, I would deem it Unicode-compliant.2) The second thing is to convert the whole file to a "fixed bytes per character" format, so you can work with it. A wide char format (16 bits wchar) is a good choice most of the time.Actually, 16-bits wchar is a terrible choice, since Unicode code-points require 32-bits. Also, you don't necessarily need to convert the whole file to a fixed-length encoding. Why not simply work natively in UTF-8? It's really not as difficult as you make it to be...Don't know too much on the Linux side, but it's simply a matter of wcslen, wcscpy, wcscat instead of length(), strcpy(), strcat() with Visual Studio.Using UTF-8, you will need a special length() function, but you can use the regular strcpy() and strcat(). I don't use C++, but a quick google search tells me there are libraries out there that provide UTF-8 support for C++. Cheers, Dario Teixeira _______________________________________________ Help-source-highlight mailing list address@hidden http://lists.gnu.org/mailman/listinfo/help-source-highlight |
[Prev in Thread] | Current Thread | [Next in Thread] |