lzip-bug
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lzip-bug] Selection of CRC32 Polynomial for lzip


From: Antonio Diaz Diaz
Subject: Re: [Lzip-bug] Selection of CRC32 Polynomial for lzip
Date: Fri, 19 May 2017 19:53:22 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.9.1.19) Gecko/20110420 SeaMonkey/2.0.14

Damir wrote:
Basically, for x86, hardware version is ~10 times faster than Sarwate and 3
times faster than Slice-By-8.

Thanks for the link. This gives a total speed gain smaller than 5% compared to slice-by-8. And it only works on some CPUs.


Also, at a size of 4294967263 bits (537 MB), ethernet CRC32 has been
confirmed to feature the longest dataword at HD 3 for all possible CRC32
polynomials. Twice that of CRC32C.

That's a choice between better protection of smaller files (90Kbit to
2Gbit) and better protection of larger files (2Gbit to 4 Gbit).
If someone cares about safety, they can limit member size to 2Gbit,
and get much better protection with CRC32c at a price of slightly
less compression rate, which is not possible with Ethernet poly
(member size of 92kbit is nothing).

"much better protection"? 'HD = 3' does not mean that all 3 bit errors go undetected. Just that a small fraction of them aren't detected.

CRC32C is just marginally better than ethernet CRC32 in the range from 92 kbit to 2 Gbit in the sense that it catches all the errors consisting of exactly 3 bits flipped *in the decompressed data* spread more than 32 bits apart, instead of catching almost all such errors. This is pretty immaterial because such errors are very rare in the first place.

You also forget that CRC is just one of the several safety factors provided by the lzip format. And that a larger file increases the probability of corruption.

Koopman himself is careful in his affirmations about what CRC is better. He states at 'Best CRC Polynomials'[1]:

"IMPORTANT NOTE: These are "BEST" polynomials under an assumption of a low, constant random independent BERsuch as you'd find in communication networks. If you have a BER that is higher than, say, 1 bit in 100,000, or is non-constant, or is non-random/non-independent, then you need to understand more before using these polynomials."

[1] http://users.ece.cmu.edu/~koopman/crc/index.html


In my experience with LZMA compression, there are two kinds of errors
not caught by the decoder:
1) Without error multiplication. The error affects a single byte and is
guaranteed to be detected by both CRCs (ethernet and CRC32C).
2) With error multiplication. The total number of bits flipped is larger
than 4, which both CRCs can detect about equally well.

That sounds reasonable and calls for CRC64 (which is enabled by default in
xz, but is done terribly wrong). Also, CRC64 is much less supported in
hardware than CRC32c.

CRC64 is much less studied than CRC32, and causes twice as many false positives as CRC32.


I have performed many millions of trial decompressions and never found
any indication that the current CRC is not perfectly adequate.

It would be interesting to see undetected error rate for such trials, and
compare it with crc32c.

You can find a preliminary result in section 2.10.4 of [2]. I have made more tests since then, but I have never found a single error undetected by lzip's CRC32. As of now the undetected error rate remains at 0.

"More precisely, 36.4 million trial decompressions on files ranging from 1 kB to 217 MB of compressed size resulted in 27 errors undetected by the decoder (all of them detected by the CRC)."

[2] http://www.nongnu.org/lzip/xz_inadequate.html


$ lzip -t v2.lz
    v2.lz: Version 2 member format not supported.

Ah, that's a shame and calls for .lz2 file extension. :)

How is a foo_v1 tool supposed to know how to decode a foo_v2 file? Maybe v2 uses a different algorithm, not only a different CRC.


Maybe you're right, and it's better for safety-related reasons to just use
a homegrown container format with safety-oriented CRC instead of using
existing container format like LZIP, since it really has a burden
of forward compatibility.

But it is not yet proven that CRC32C is safer than ethernet CRC32 in a compression context. Because of the error multiplication caused by decompression, the error model seen by the CRC is more one of unconstrained random data corruption than one of low, constant random independent bit error rate.


Best regards,
Antonio.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]