bug-global
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Binary recognition is to narrow [new suggestion]


From: Erik Jonsson
Subject: Binary recognition is to narrow [new suggestion]
Date: Fri, 20 Nov 2009 17:33:01 +0100 (CET)
User-agent: SquirrelMail/1.4.9a

Hi again,

I have done some more testing and calculations now. The probability that a
binary file will pass as a text-file is quite high if one only tests the
first 32 bytes. I have therefore tested the performance if one where to
use the first 512 bytes. What I found was that the performance hit was
minimal however the benefits are several.

Instead of counting characters over 127 the only test is that the first
511 bytes don't contain any of the controll characters 0-8, 14-31. No
normal textfile would contain these.

Assuming that binary data is random the probability of a incorrectly
tagged binary would be

((256-8-18)/256)^511=.00000000000000000000000170726

just testing 127 bits would be a bit to little

((256-8-18)/256)^127=.00000123868

One of the benefits is that this will correctly tag files in uni-code as
text as well. Since those control characters never appears in uni-code
either.

The performance hit seems minimal on my computer.

511 byte version
address@hidden:~/source/dps/src$ time ~/install/global-5.7.6/gtags/gtags
real    0m34.425s
user    0m8.337s
sys     0m3.080s

32 byte version
address@hidden:~/source/dps/src$ time gtags
real    0m32.120s
user    0m8.361s
sys     0m2.820s


I have tried to clear the cache as good as possible between the runs.

Here is the 511 byte is_binary that I'm using.

static int
is_binary(const char *path)
{
        int ip;
        char buf[512];
        char *cp;
        int i, c, size;

        ip = open(path, O_RDONLY);
        if (ip < 0)
                die("cannot open file '%s' in read mode.", path);
        size = read(ip, buf, sizeof(buf)-1);
        close(ip);

        buf[size] = 0; //Terminate the data

        if (size <= 0)
                return 1;
        if (size >= 7 && locatestring(buf, "!<arch>", MATCH_AT_FIRST))
                return 1;
        cp = buf;
        while ((c = (unsigned char) *cp)) {
                if (c <= 8)
                        return 1;
                if (c >= 14 && c < 32)
                        return 1;
                cp++;
        }

        return cp != buf+size;
}

feel free to use the code as you like.

/Erik J.







reply via email to

[Prev in Thread] Current Thread [Next in Thread]