lzip-bug
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lzip-bug] Selection of CRC32 Polynomial for lzip


From: Antonio Diaz Diaz
Subject: Re: [Lzip-bug] Selection of CRC32 Polynomial for lzip
Date: Thu, 18 May 2017 01:45:25 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.9.1.19) Gecko/20110420 SeaMonkey/2.0.14

Hello Damir,

Damir wrote:
Have you considered choosing a different polynomial for Crc32 calculation
in lzip file format?

Yes, but I found no compelling reason to change.


Some recent CPUs (x86_64 SSE4.2, PowerPC ISA 2.07, ARM v8.1) offer
hardware accelerated calculation of CRC32 with a different polynomial
(crc32c) than used in lzip (ethernet crc32).

Maybe hardware accelerated calculation of ethernet CRC32 also exists. After all it is the same polynomial used by gzip and zlib.


So, picking crc32c poly instead has two benefits:
1) hardware accelerated integrity checking

Hardware acceleration of CRC calculation makes sense for storage devices because the data is just moved; there is no time spent in processing it. Calculating the CRC is the only calculation involved.

But calculating the CRC is just a small part of the total decompression time. So, even if you accelerate it, the total speed gain is small. (Probably smaller than 5%). For compression the speed gain is even smaller.


2) better protection against undetected errors

You will need to prove this one.

CRC32C has a slightly larger Hamming distance than ethernet CRC32 for "small" packet sizes (see pags 3,4 of [1]). But beyond some size perhaps not much larger than 128 KiB, both have the same HD of 2. For files larger than that (uncompressed) size, there is little diference between both CRCs.

[1] http://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koopman.pdf

Even more important, we are talking about the interaction between compression and integrity checking. The difference between a Hamming distance of 2 or 3 is probably immaterial here. Maybe you would like to read section 2.10 of [2]. I quote:

"Verification of data integrity in compressed files is different from other cases (like Ethernet packets) because the data that can become corrupted are the compressed data, but the data that are verified (the dataword) are the decompressed data. Decompression can cause error multiplication; even a single-bit error in the compressed data may produce any random number of errors in the decompressed data, or even modify the size of the decompressed data."

[2] http://www.nongnu.org/lzip/xz_inadequate.html


The downside is the compatibility problem, but changing version byte in
file header can help with that.

This is a very large downside, most probably to gain almost nothing. IMO, one of the big problems of today's software development is that too many people are willing to complicate the code without the slightest proof that the proposed change is indeed an improvement.


Best regards,
Antonio.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]