grub-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] * grub-core/fs/udf.c: Add support for UUID


From: Pali Rohár
Subject: Re: [PATCH] * grub-core/fs/udf.c: Add support for UUID
Date: Fri, 12 May 2017 16:39:14 +0200
User-agent: KMail/1.13.7 (Linux/3.13.0-117-generic; KDE/4.14.2; x86_64; ; )

On Monday 08 May 2017 15:13:28 Vladimir 'phcoder' Serbinenko wrote:
> On Mon, Apr 10, 2017, 23:17 Pali Rohár <address@hidden> wrote:
> > -read_string (const grub_uint8_t *raw, grub_size_t sz, char
> > *outbuf) +read_string (const grub_uint8_t *raw, grub_size_t sz,
> > char *outbuf, int normalize_utf8)
> 
> Normalize isn't the right word. And it's not utf-8 but latin1 (called
> compressed utf-16 by udf docs).
> Are you sure you handle utf-16 case correctly? What is the expected
> behavior in those cases? Ideally you may want to just parse raw
> string in caller

Hi! Now I looked at OSTA UDF spec again and found reason for my 
disinformation... libblkid has wrongly implemented 8bit OSTA compressed 
unicode and I just tried to mimic libblkid in grub...

libblkid handles 16bit OSTA compressed unicode as UTF-16BE and 8bit OSTA 
compressed unicode as UTF-8.

In UDF 2.01 specification is written:
====
For a CompressionID of 8 or 16, the value of the CompressionID shall 
specify the number of BitsPerCharacter for the d-characters defined in 
the CharacterBitStream field. Each sequence of CompressionID bits in the 
CharacterBitStream field shall represent an OSTA Compressed Unicode d-
character. The bits of the character being encoded shall be added to the 
CharacterBitStream from most- to least-significant-bit. The bits shall 
be added to the CharacterBitStream starting from the most significant 
bit of the current byte being encoded into. The value of the OSTA 
Compressed Unicode d-character interpreted as a Uint16 defines the value 
of the corresponding d-character in the Unicode 2.0 standard.
====

So it means that 8bit OSTA compressed unicode buffer contains sequence 
of Unicode codepoints, one per 8 bits. What effectively means 
equivalence with Latin1 (ISO-8859-1) encoding.

And 16bit OSTA compressed unicode means sequence of Unicode codepoints, 
one per 16 bits in big endian. What is probably only UCS-2 and not full 
UTF-16.

So problem is with 8bit OSTA compressed unicode if contains bytes which 
are not UTF-8 invariants (ASCII). As those those are decoded differently 
with Latin1 and UTF-8.

(Please correct me if I'm wrong here)

For now rather scratch/suspend this my patch until we decide what to do 
with it due to different/wrong implementation of reading strings in 
libblkid from util-linux.

-- 
Pali Rohár
address@hidden

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]