[GNUnet-developers] libextractor and UTF-8

gnunet-developers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[GNUnet-developers] libextractor and UTF-8

From:	Milan
Subject:	[GNUnet-developers] libextractor and UTF-8
Date:	Mon, 20 Dec 2004 18:19:02 +0100
User-agent:	Mozilla Thunderbird 0.9 (X11/20041124)

Hi !

So there are 3 little patches to convert id3v2 tags' strings to UTF-8.I've tested them a little, but it must be done in more differentconditions. I'm unable to make tags from versions 2.4 and 2.2. Theproblem is to find a program which allows to choose precisely thesubversion, or to make the tag by hand. UCS-2 encodings and UTF-16BE aredifficult to find in any taggers because their use is not necessary.

For the version 2.0 and 2.3, I assume that encoding byte is alwayspresent in text frames (which we're looking for). I think that's chatthe text says, but it wasn't what libextractor did before. If encodingbyte was 0x01, i.e. encoding is UCS-2 (very rare !), this byte wouldhave been included into the string with the preceding code.

I'm going on holidays tomorrow (Yeah I'm a student, so I can take abreath ! ;-)). I will give you results of my tests when I'll come back.Before it if somebody wants to try with some files, you're welcome !


--
Milan
:-:-:-:
Clé GnuPG : 0xB4A12547
:-:-:-:

"Libérez-vous des systèmes propriétaires... Passez à GNU/Linux"
{ www.gnu.org/home.fr.html --- www.lea-linux.org/intro/ }

170c170,198
<       if (data[pos+10] == '\0') {
---
>       if (data[pos+10] <= 0x01) { 
>         /* this byte describes the encoding
>         try to convert strings to UTF-8
>         if it fails, then forget it */
>         switch (data[pos+10]) {
>            case 0x00 :
>               word = g_convert(&data[pos+11],
>                                csize,
>                                "ISO-8859-1",
>                                "UTF-8",
>                                NULL, NULL, NULL);
>               break;
>            case 0x01 :
>               word = g_convert(&data[pos+11],
>                                csize,
>                                "UCS-2",
>                                "UTF-8",
>                                NULL, NULL, NULL);
>               break;
>          }
>       } else {
>         /* bad encoding byte,
>            try to convert from iso-8859-1 */
>           word =  g_convert(&data[pos+11],
>                             csize,
>                             "ISO-8859-1",
>                             "UTF-8",
>                             NULL, NULL, NULL);
>       }
173,180c201,205
<       }
<       word = malloc(csize+1);
< 
<       memcpy(word,
<              &data[pos+10],
<              csize);
<       word[csize] = '\0';
<       if (strlen(word) > 0) {
---
>       if ((word != NULL) &&
>               (g_utf8_strlen(word, -1) > 0) &&
>               g_utf8_validate (word,
>                               -1,
>                               NULL)) {

168,171c170,195
<         /* FIXME: this bit describes the encoding! */
<         pos++;
<         csize--;
<       }
---
>         /* this byte describes the encoding
>         try to convert strings to UTF-8
>         if it fails, then forget it */
>         switch (data[pos+10]) {
>            case 0x00 :
>               word = g_convert(&data[pos+11],
>                                csize,
>                                "ISO-8859-1",
>                                "UTF-8",
>                                NULL, NULL, NULL);
>               break;
>            case 0x01 :
>               word = g_convert(&data[pos+11],
>                                csize,
>                                "UTF-16",
>                                "UTF-8",
>                                NULL, NULL, NULL);
>               break;
>            case 0x02 :
>               word = g_convert(&data[pos+11],
>                                csize,
>                                "UTF-16BE",
>                                "UTF-8",
>                                NULL, NULL, NULL);
>               break;
>           case 0x03 :
175c198
<              &data[pos+10],
---
>                     &data[pos+11],
178c201,218
<       if (strlen(word) > 0) {
---
>               break;
>          }
>       } else {
>         /* bad encoding byte,
>            try to convert from iso-8859-1 */
>           word =  g_convert(&data[pos+11],
>                             csize,
>                             "ISO-8859-1",
>                             "UTF-8",
>                             NULL, NULL, NULL);
>       }
>       pos++;
>       csize--;
>       if ((word != NULL) &&
>               (g_utf8_strlen(word, -1) > 0) &&
>               g_utf8_validate (word,
>                               -1,
>                               NULL)) {

133,134c134,162
<       word = malloc(csize+1);
<       if (data[pos+6] == '\0') {
---
>       if (data[pos+6] <= 0x01) { 
>         /* this byte describes the encoding
>         try to convert strings to UTF-8
>         if it fails, then forget it */
>         switch (data[pos+6]) {
>            case 0x00 :
>               word = g_convert(&data[pos+7],
>                                csize,
>                                "ISO-8859-1",
>                                "UTF-8",
>                                NULL, NULL, NULL);
>               break;
>            case 0x01 :
>               word = g_convert(&data[pos+7],
>                                csize,
>                                "UCS-2",
>                                "UTF-8",
>                                NULL, NULL, NULL);
>               break;
>          }
>       } else {
>         /* bad encoding byte,
>            try to convert from iso-8859-1 */
>           word =  g_convert(&data[pos+7],
>                             csize,
>                             "ISO-8859-1",
>                             "UTF-8",
>                             NULL, NULL, NULL);
>       }
137,142c165,169
<       }
<       memcpy(word,
<              &data[pos+6],
<              csize);
<       word[csize] = '\0';
<       if (strlen(word) > 0) {
---
>       if ((word != NULL) &&
>               (g_utf8_strlen(word, -1) > 0) &&
>               g_utf8_validate (word,
>                               -1,
>                               NULL)) {

[Prev in Thread]

Current Thread

[Next in Thread]

[GNUnet-developers] libextractor and UTF-8, Milan <=

Prev by Date: Re: [GNUnet-developers] MacOS X & other stuff
Next by Date: [GNUnet-developers] big endian & CRC problems
Previous by thread: [GNUnet-developers] MacOS X & other stuff
Next by thread: [GNUnet-developers] big endian & CRC problems
Index(es):
- Date
- Thread