[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[GNUnet-developers] libextractor and UTF-8
From: |
Milan |
Subject: |
[GNUnet-developers] libextractor and UTF-8 |
Date: |
Mon, 20 Dec 2004 18:19:02 +0100 |
User-agent: |
Mozilla Thunderbird 0.9 (X11/20041124) |
Hi !
So there are 3 little patches to convert id3v2 tags' strings to UTF-8.
I've tested them a little, but it must be done in more different
conditions. I'm unable to make tags from versions 2.4 and 2.2. The
problem is to find a program which allows to choose precisely the
subversion, or to make the tag by hand. UCS-2 encodings and UTF-16BE are
difficult to find in any taggers because their use is not necessary.
For the version 2.0 and 2.3, I assume that encoding byte is always
present in text frames (which we're looking for). I think that's chat
the text says, but it wasn't what libextractor did before. If encoding
byte was 0x01, i.e. encoding is UCS-2 (very rare !), this byte would
have been included into the string with the preceding code.
I'm going on holidays tomorrow (Yeah I'm a student, so I can take a
breath ! ;-)). I will give you results of my tests when I'll come back.
Before it if somebody wants to try with some files, you're welcome !
--
Milan
:-:-:-:
Clé GnuPG : 0xB4A12547
:-:-:-:
"Libérez-vous des systèmes propriétaires... Passez à GNU/Linux"
{ www.gnu.org/home.fr.html --- www.lea-linux.org/intro/ }
170c170,198
< if (data[pos+10] == '\0') {
---
> if (data[pos+10] <= 0x01) {
> /* this byte describes the encoding
> try to convert strings to UTF-8
> if it fails, then forget it */
> switch (data[pos+10]) {
> case 0x00 :
> word = g_convert(&data[pos+11],
> csize,
> "ISO-8859-1",
> "UTF-8",
> NULL, NULL, NULL);
> break;
> case 0x01 :
> word = g_convert(&data[pos+11],
> csize,
> "UCS-2",
> "UTF-8",
> NULL, NULL, NULL);
> break;
> }
> } else {
> /* bad encoding byte,
> try to convert from iso-8859-1 */
> word = g_convert(&data[pos+11],
> csize,
> "ISO-8859-1",
> "UTF-8",
> NULL, NULL, NULL);
> }
173,180c201,205
< }
< word = malloc(csize+1);
<
< memcpy(word,
< &data[pos+10],
< csize);
< word[csize] = '\0';
< if (strlen(word) > 0) {
---
> if ((word != NULL) &&
> (g_utf8_strlen(word, -1) > 0) &&
> g_utf8_validate (word,
> -1,
> NULL)) {
168,171c170,195
< /* FIXME: this bit describes the encoding! */
< pos++;
< csize--;
< }
---
> /* this byte describes the encoding
> try to convert strings to UTF-8
> if it fails, then forget it */
> switch (data[pos+10]) {
> case 0x00 :
> word = g_convert(&data[pos+11],
> csize,
> "ISO-8859-1",
> "UTF-8",
> NULL, NULL, NULL);
> break;
> case 0x01 :
> word = g_convert(&data[pos+11],
> csize,
> "UTF-16",
> "UTF-8",
> NULL, NULL, NULL);
> break;
> case 0x02 :
> word = g_convert(&data[pos+11],
> csize,
> "UTF-16BE",
> "UTF-8",
> NULL, NULL, NULL);
> break;
> case 0x03 :
175c198
< &data[pos+10],
---
> &data[pos+11],
178c201,218
< if (strlen(word) > 0) {
---
> break;
> }
> } else {
> /* bad encoding byte,
> try to convert from iso-8859-1 */
> word = g_convert(&data[pos+11],
> csize,
> "ISO-8859-1",
> "UTF-8",
> NULL, NULL, NULL);
> }
> pos++;
> csize--;
> if ((word != NULL) &&
> (g_utf8_strlen(word, -1) > 0) &&
> g_utf8_validate (word,
> -1,
> NULL)) {
133,134c134,162
< word = malloc(csize+1);
< if (data[pos+6] == '\0') {
---
> if (data[pos+6] <= 0x01) {
> /* this byte describes the encoding
> try to convert strings to UTF-8
> if it fails, then forget it */
> switch (data[pos+6]) {
> case 0x00 :
> word = g_convert(&data[pos+7],
> csize,
> "ISO-8859-1",
> "UTF-8",
> NULL, NULL, NULL);
> break;
> case 0x01 :
> word = g_convert(&data[pos+7],
> csize,
> "UCS-2",
> "UTF-8",
> NULL, NULL, NULL);
> break;
> }
> } else {
> /* bad encoding byte,
> try to convert from iso-8859-1 */
> word = g_convert(&data[pos+7],
> csize,
> "ISO-8859-1",
> "UTF-8",
> NULL, NULL, NULL);
> }
137,142c165,169
< }
< memcpy(word,
< &data[pos+6],
< csize);
< word[csize] = '\0';
< if (strlen(word) > 0) {
---
> if ((word != NULL) &&
> (g_utf8_strlen(word, -1) > 0) &&
> g_utf8_validate (word,
> -1,
> NULL)) {
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [GNUnet-developers] libextractor and UTF-8,
Milan <=