bug-global
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Binary recognition is to narrow [new suggestion]


From: Hideki IWAMOTO
Subject: Re: Binary recognition is to narrow [new suggestion]
Date: Sat, 21 Nov 2009 19:21:25 +0900

Hi.

>                 if (c <= 8)
>                         return 1;
>                 if (c >= 14 && c < 32)
>                         return 1;

You had better use table look-up like attached patch.

On Fri, 20 Nov 2009 17:33:01 +0100 (CET), Erik Jonsson wrote...
> Hi again,
> 
> I have done some more testing and calculations now. The probability that a
> binary file will pass as a text-file is quite high if one only tests the
> first 32 bytes. I have therefore tested the performance if one where to
> use the first 512 bytes. What I found was that the performance hit was
> minimal however the benefits are several.
> 
> Instead of counting characters over 127 the only test is that the first
> 511 bytes don't contain any of the controll characters 0-8, 14-31. No
> normal textfile would contain these.
> 
> Assuming that binary data is random the probability of a incorrectly
> tagged binary would be
> 
> ((256-8-18)/256)^511=.00000000000000000000000170726
> 
> just testing 127 bits would be a bit to little
> 
> ((256-8-18)/256)^127=.00000123868
> 
> One of the benefits is that this will correctly tag files in uni-code as
> text as well. Since those control characters never appears in uni-code
> either.
> 
> The performance hit seems minimal on my computer.
> 
> 511 byte version
> address@hidden:~/source/dps/src$ time ~/install/global-5.7.6/gtags/gtags
> real    0m34.425s
> user    0m8.337s
> sys     0m3.080s
> 
> 32 byte version
> address@hidden:~/source/dps/src$ time gtags
> real    0m32.120s
> user    0m8.361s
> sys     0m2.820s
> 
> 
> I have tried to clear the cache as good as possible between the runs.
> 
> Here is the 511 byte is_binary that I'm using.
> 
> static int
> is_binary(const char *path)
> {
>         int ip;
>         char buf[512];
>         char *cp;
>         int i, c, size;
> 
>         ip = open(path, O_RDONLY);
>         if (ip < 0)
>                 die("cannot open file '%s' in read mode.", path);
>         size = read(ip, buf, sizeof(buf)-1);
>         close(ip);
> 
>         buf[size] = 0; //Terminate the data
> 
>         if (size <= 0)
>                 return 1;
>         if (size >= 7 && locatestring(buf, "!<arch>", MATCH_AT_FIRST))
>                 return 1;
>         cp = buf;
>         while ((c = (unsigned char) *cp)) {
>                 if (c <= 8)
>                         return 1;
>                 if (c >= 14 && c < 32)
>                         return 1;
>                 cp++;
>         }
> 
>         return cp != buf+size;
> }
> 
> feel free to use the code as you like.
> 
> /Erik J.
> 
> 
> 
> 
> 
> _______________________________________________
> Bug-global mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/bug-global

----
Hideki IWAMOTO  address@hidden

Attachment: 20091121-binarychar.patch
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]