info-mtools
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [mtools] Short filenames, codepages and possible mtools/kernel bug


From: Jaime
Subject: Re: [mtools] Short filenames, codepages and possible mtools/kernel bug
Date: Wed, 31 May 2006 01:38:07 +0100

On Mon, 2006-05-29 at 12:57 +0200, Alain Knaff wrote:
> David C Niemi wrote:
> > 
> > This probably goes back to a shortcut I took in 1994 to get VFAT working 
> > on Mtools.  As you may know VFAT uses 16-bit Unicode, and I assumed that 
> > the high bits would always be zero.  So there's no support for special 
> > code pages unless Alain has since added it.
> > 
> > However, mounting the floppy with the kernel MSDOS file system support 
> > is a totally separate implementation, by different people.
> > 
> > DCN
> 
> Nope, I didn't add any full Unicode support since then...
> 
> However, the problem still is weird, because for Ç you don't need 
> unicode. ISO-8859-1, which is supported, should be enough.
> 
> VFAT uses a constant 2 byte format for its unicode (UCS-2?), and in this 
> representation, all ISO-8859-1 characters (which include Ç) have their 
> high byte equal to zero.
> 
> The same is not true with variable-length unicode encoding (UTF-8), 
> which add an escape byte to all characters from 0x80 to 0xff.
> 
> 
> I tried reproducing the problem here, but I do get a Ç as I should.
> 
> [...]
> >> Then I swap over to Linux, and run "mdir a:". What I now see is:
> >>
> >> AB�DE    TXT         0 2006-05-28  16:00  AB�DE.TXT
> >>        1 file                    0 bytes
> >>                          1 457 664 bytes free
> 
> It's not necessarily an mtools problem, it could also be a terminal 
> (konsole, gterm, ...) issue.
> 
> Try doing mdir a: | hexdump -C
> 

mdir a: | hexdump -C
00000000  20 56 6f 6c 75 6d 65 20  69 6e 20 64 72 69 76 65  | Volume in drive|
00000010  20 41 20 68 61 73 20 6e  6f 20 6c 61 62 65 6c 0a  | A has no label.|
00000020  20 56 6f 6c 75 6d 65 20  53 65 72 69 61 6c 20 4e  | Volume Serial N|
00000030  75 6d 62 65 72 20 69 73  20 32 43 32 46 2d 35 45  |umber is 2C2F-5E|
00000040  44 42 0a 44 69 72 65 63  74 6f 72 79 20 66 6f 72  |DB.Directory for|
00000050  20 41 3a 2f 0a 0a 41 42  c7 44 45 20 20 20 20 54  | A:/..AB.DE    T|
00000060  58 54 20 20 20 20 20 20  20 20 20 30 20 32 30 30  |XT         0 200|
00000070  36 2d 30 35 2d 32 38 20  20 31 36 3a 30 30 20 20  |6-05-28  16:00  |
00000080  41 42 c7 44 45 2e 54 58  54 0a 20 20 20 20 20 20  |AB.DE.TXT.      |
00000090  20 20 31 20 66 69 6c 65  20 20 20 20 20 20 20 20  |  1 file        |
000000a0  20 20 20 20 20 20 20 20  20 20 20 20 30 20 62 79  |            0 by|
000000b0  74 65 73 0a 20 20 20 20  20 20 20 20 20 20 20 20  |tes.            |
000000c0  20 20 20 20 20 20 20 20  20 20 20 20 20 20 31 20  |              1 |
000000d0  34 35 37 20 36 36 34 20  62 79 74 65 73 20 66 72  |457 664 bytes fr|
000000e0  65 65 0a 0a                                       |ee..|
000000e4


> If you see C7 for the Ç, it is ok (and the mess up only happened on 
> display), if something else, then it is indeed an mtools bug.

Yup, the C7s are there in all the right places.

> 
> >>
> >> (the capital C cedilla has been replaced by a tiny white question mark
> >> inside a black diamond/lozenge). Just to check, I mount the filesystem
> >> using the following command:
> >>
> >> mount -t msdos -o codepage=850 /dev/fd0 temp
> 
> Try mount -t vfat instead to get long names and extended characters)

Er, I don't think I want long names. But please bear with me here - I'm
a complete "character encoding" newbie, and I'm trying to learn how it
works.

> >>
> >> Then, ls shows me a question mark where the capital C cedilla should be.
> 
> That's an ls issue (not an msdos/vfat filesystem issue). Ls replaces, 
> _on_display_ , those characters that it thinks are unprintable with 
> question marks. Depending on your settings (LANG, LC_CTYPE and LC_ALL 
> environment variables), ls may think that the Ç is an unprintable 
> character, and replace it by a question mark. This even happens on 
> native Linux filesystems (reiserfs, etc...). Try it by creating a file 
> with a Ç in it, and then doing ls.
> 
> I've found that with LC_ALL=en_US , the Ç is displayed correctly.
> 
> If that doesn't help, try ls -b instead. Ls -b substitutes "unprintable" 
> characters with their octal code (Should be \307 in case of Ç).
> 

"ls -b" returns "ab\200de.txt" so it's using octal 200 rather than octal
307, but I assume that's because the 80s (hex) on the disk are 200s in
octal (my "mount -t msdos" means that it's the short filenames which are
used, rather than the long filenames, so I get the 8.3 "codepaged"
version, rather than the long filename in unicode).

> [...]
> >> 25F0  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
> >>  ................
> >> 2600  E5 41 00 42  00 C7 00 44   00 45 00 0F  00 19 2E 00
>                           ^^
> >>  .A.B...D.E......
> 
> The C7 is the proper (unicode, iso-8859-1) representation for Ç, so 
> everything should be ok there...
> 
> 
> >> 2610  54 00 58 00  54 00 00 00   FF FF 00 00  FF FF FF FF
> >>  T.X.T...........
> >> 2620  E5 42 80 44  45 20 20 20   54 58 54 20  00 30 03 80
> >>  .B.DE   TXT .0..
> >> 2630  BC 34 BC 34  00 00 04 80   BC 34 00 00  00 00 00 00
> >>  .4.4.....4......
> >> 2640  41 41 00 42  00 C7 00 44   00 45 00 0F  00 19 2E 00
> >>  AA.B...D.E......
> >> 2650  54 00 58 00  54 00 00 00   FF FF 00 00  FF FF FF FF
> >>  T.X.T...........
> >> 2660  41 42 80 44  45 20 20 20   54 58 54 20  00 30 03 80
> >>  AB.DE   TXT .0..
> >> 2670  BC 34 BC 34  00 00 04 80   BC 34 00 00  00 00 00 00
> >>  .4.4.....4......
> >> 2680  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
> >>  ................
> >>
> >> I assume that the "80"s in between the "42"s and the "44"s are my
> >> missing capital C cedillas (both codepages 437 and 850 list the capital
> >> C cedilla as occupying point 80 hex).
> 
> The 80 is indeed very confusing. At first I was confused by this too, as 
> I assumed this to be an "unknown character" placeholder.
> 
> However, after further analysis, I noticed that 0x80 is indeed the 
> correct legacy MS-DOS code for Ç, as surprising as it sounds. (MS-Dos 
> didn't use standard ISO-8859-1, but its own proprietary encoding, as 
> specified in the codepage...)
> 
> If you use a different example than Ç (such as for example é), you see a 
> different code there.
> 
> 
> >> In case it helps, I've left a truncated binary disk image of the
> >> diskette here:
> >> http://www.carbon.eclipse.co.uk/msdosfs.diskImage
> 
> Just tried to do an mdir on it... and indeed, it showed me Ç:
> 
>  > mdir -i msdosfs.diskImage ::
>   Volume in drive : has no label
>   Volume Serial Number is 2C2F-5EDB
> Directory for ::/
> 
> ABÇDE    TXT         0 2006-05-28  16:00  ABÇDE.TXT
>          1 file                    0 bytes
>                            1 457 664 bytes free
> 
> 
> >> Could anyone please tell me whether this is my error, or is it a bug
> >> (possibly in mtools, possibly in the kernel)?
> 
> It suspect the error might be in the terminal program that you are using 
> (which might be set to display UTF-8. Try changing that to ISO-8859-1 
> a.k.a Iso-Latin-1)

Apart from not knowing (yet) how to do this, wouldn't this affect the
output for other mounted filesystems that _do_ use utf-8?

I read somewhere that short filenames on fat filesystems (on windows
systems) are encoded using the local codepage (which, when I created the
file under windows, was 850). This at least agrees with the hexdump of
my raw diskette (seeing the 80s make sense to me). What I'm really after
is the ability to view the short filenames on the diskette as they were
typed in under Windows.

After doing so more investigation, I found the following statement here:
http://svn.haxx.se/dev/archive-2005-05/0406.shtml

"The POSIX way of making filename encoding locale-dependent is
fundamentally broken IMO. But I don't think each tool can solve a system
problem. On POSIX systems, I think the best solution is to rely on the
locale like we currently do. People should set up their locale correctly
and ensure that filenames are in the encoding of the locale."

I don't fully understand this, but I think it means that POSIX (and
therefore Unix/Linux?) assumes that filenames are stored in a
character-encoding which is "represented" by the user's locale. But then
I have a problem: I only have one locale (at a time) but I have several
mounted filesystems - the majority have unicodish (<<new word?)
filenames (as they're ext3) but I want a filesystem with codepage 850
filenames mounted at the same time (my dos diskette). And there's my
problem. Many different simultaneous filename encoding mechanisms, but
only one locale.

I'm now beginning to think that mtools really isn't at fault here (and
that it's more a Posix/Linux limitation). Many thanks, and apologies for
the noise.

Jaime



_______________________________________________
mtools mailing list
address@hidden
http://www.tux.org/mailman/listinfo/mtools


reply via email to

[Prev in Thread] Current Thread [Next in Thread]