[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] bad filename
From: |
Bykov Aleksey |
Subject: |
Re: [Bug-wget] bad filename |
Date: |
Fri, 25 Apr 2014 21:39:55 +0300 |
Greetings, Andries E. Brouwer
>>- the patch is inside #ifdef WINDOWS ... #endif while the problem
>> occurs on all systems, also on Unix.
Yes, it is.
>> - Presently, 0-31 and 127-159 are considerd "control".
Sorry, i preffer converting. At least for uppercase/lowercase conversion (with
towlower() and towupper()). Sometimes it useful - when one site, mirrored with
Wget, moved between case-sensitive and case-unsensitive filesystems (ext3 and
NTFS).
Remastered patch, so it has some chance to work in non-windows system. Tested
with cyrillic names in FAT32 and NTFS win32 system. mswindow.diff - only
windows related stuff.
Best regards, Bykov Aleksey
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
From: address@hidden
To: address@hidden
Date: 17:28:10, 04.23.2014
Subject: Re: [Bug-wget] bad filename
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
>>On Wed, Apr 23, 2014 at 04:57:11PM +0300, Bykov Aleksey wrote:
>> > Greetings, Darshit Shah
>> > This was disscussed some (or long) time ago.
>> > Possible logic:
>> > If locale isn't UTF-8 then process as before else
>> > 1. Convert string to WideCharString with mbstowcs().
>> > 2. For Each WideChar check it size with wctomb(). If size is 1 then
>> > compare it with mask. If char restricted, then "quoted++;"
>> > 3. If need, convert to lower/upper case with towlower()/towupper()
>> > 4. Recreate string char by char with wctomb: Convert char to temporary
>> > buffer. If filechar size is 1 compare with mask and repalce. Else
>> > "memcpy(q, char_buffer, char_size); q+=char_size;"
>> > In windows i can't check it ( mbstowcs didn't work with UTF-8, so must be
>> > used MultiByteToWideChar()...)
>> > Patch for windows (unstructured, unclear, unfinished, but worked) is
>> > attached.
>> > Best Regards, Bykov Aleksey.
>>
>> Good!
>>
>> However:
>> - the patch is inside #ifdef WINDOWS ... #endif while the problem
>> occurs on all systems, also on Unix.
>> - I think all of this is needlessly complicated. Repeatedly
>> converting filenames is not a good plan if the goal is to
>> keep them unchanged.
>> - UTF-8 has the nice property that the only 7-bit bytes that occur
>> inside a character code are those in the ASCII set. So, no
>> conversion is needed to test the length: every byte in 0-127
>> always represents a full character.
>> - Presently, 0-31 and 127-159 are considerd "control". That is
>> wrong on UTF-8 systems, where 128-159 are part of a multibyte character.
>> If one wants to preserve the filename mangling in the 0-31,127 range,
>> but wants to do the mangling to 128-159 only when some option asks
>> for it, then 0-31,127 and 128-159 should have different flags in
>> url.c:static const unsigned char urlchr_table[256], e.g.
>> ..
>> #define D filechr_highcontrol
>> ..
>> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, /* 128-143 */
>> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, /* 144-159 */
>> ..
>> #undef D
>>
>> Andries
>>
mswindows.diff
Description: Binary data
url.diff
Description: Binary data
- Re: [Bug-wget] bad filename, (continued)
- Re: [Bug-wget] bad filename, Tim Ruehsen, 2014/04/23
- Re: [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/23
- Re: [Bug-wget] bad filename, Tim Ruehsen, 2014/04/24
- Re: [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/25
- Re: [Bug-wget] bad filename, Tim Ruehsen, 2014/04/24
- Re: [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/24
- Re: [Bug-wget] bad filename, Tim Rühsen, 2014/04/24
Re: [Bug-wget] bad filename, Bykov Aleksey, 2014/04/23