bug-global
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Binary recognition is to narrow [new suggestion]


From: Shigio YAMAGUCHI
Subject: Re: Binary recognition is to narrow [new suggestion]
Date: Sat, 21 Nov 2009 15:42:11 +0900

> Instead of counting characters over 127 the only test is that the first
> 511 bytes don't contain any of the controll characters 0-8, 14-31. No
> normal textfile would contain these.
> 
> Assuming that binary data is random the probability of a incorrectly
> tagged binary would be
> 
> ((256-8-18)/256)^511=.00000000000000000000000170726
> 
> just testing 127 bits would be a bit to little
> 
> ((256-8-18)/256)^127=.00000123868

This is a very interesting idea.

> One of the benefits is that this will correctly tag files in uni-code as
> text as well. Since those control characters never appears in uni-code
> either.

This is a big merit.
Most other multi-byte character set are sure to be designed like that,

I would like to make the 512 a customizable variable too.

$ gtags                         ... use conventional test

[File gtags.conf]
+----------------------------
|...
|       :binarytest_size=512:...  ----------------------------------+
|                                                                   |
                                                                    v
$ gtags                         ... use new test using the first n=512 bytes

After testing for a while, we can decide what we should do.
Thank you for your profitable consideration.
--
Shigio YAMAGUCHI <address@hidden>
PGP fingerprint: D1CB 0B89 B346 4AB6 5663  C4B6 3CA5 BBB3 57BE DDA3




reply via email to

[Prev in Thread] Current Thread [Next in Thread]