lzip-bug
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Re: LZSS is not LZ77


From: wrotycz
Subject: Re: Re: LZSS is not LZ77
Date: Wed, 11 Oct 2023 15:02:41 +0200
User-agent: GWP-Draft

> I have never seen neither the papers from Lempel and Ziv nor the paper from Storer and Szymanski,


How can you even say it and then quote wikipedia, and know it is true?

> LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream.

It is very general idea, that is in general true, but in a very next sentence there is untrue:

> A match is encoded by a pair of numbers called a length-distance pair

Half true, but, well, half truth is not whole truth. It is not true at all. To dig out the essence one have to analyze every single piece of it now.

LZ77 scheme is not based on 'length, distance pair', as it can only encode reference to the past, and the past is encoded how? According to this 'definition' it is not.
One cannot use wikipedia as a source of information. Any information, in fact. There is one thing they do well, though - they give references to their claims. And there it original Lempel and Ziv's paper[1]. It is only 7 pages long and give no concept of a literal, and the only encoding is possible as triplets - `distance, length, substring'.
And the pseudocode, you quoted, explains that in line 14:
```
output (d, l, c)
```

So let me give an example of coding with LZ77:
Given a string below that is 31*8 = 248 bits long. Let us encode it with LZ77 scheme according to the Lempels-Ziv's paper and wikipedia itself: (distance, length , character) == (d,l,c):

Str:  "Alice has cat and cat has Alice"
LZ77: (0,0,A)(0,0,l)(0,0,i)(0,0,c)(0,0,e)(0,0, )(0,0,h)(0,0,a)(0,0,s)(0,0, )(0,0,c)(0,0,a)(0,0,t)(0,0, )(0,0,a)(0,0,n)(0,0,d)(8,5,c)(16,4,h)(26,5,A)

if distance is 5 bits, length 3 bits, and character (a substring) is 8 bits, then 20*(8+8) = 320 bits. Not much improvement from 284 bits, is to?

Now let me give you an example of coding using Storer-Szymanski's method described in their paper [2]. It is much longer but I will save you a time and take an example from page 3 (page 930 of the journal) with a string I used before:

Str:   "Alice has cat and cat has Alice"
LZSS: Alice has cat and(8,5)(16,4)(26,5)

Again, 31*8 = 248 bits. If distance is 5 bits, length 3 bits, and literal/match flag is 1 bit, then 17*9+(1+8)*3 = 180 bits.

Subtle difference.

As you can see, wikipedia's statement
> The main difference between LZ77 and LZSS is that in LZ77 the dictionary reference could actually be longer than the string it was replacing.
although true, does not show the difference between the methods, but result of a difference. An effect, not a cause. The cause is explained in the next sentence/s
> In LZSS, such references are omitted if the length is less than the "break even" point. Furthermore, LZSS uses one-bit flags to indicate whether the next chunk of data is a literal (byte) or a reference to an offset/length pair.

Which explains real difference between the methods.
LZ77 demonstrates the general idea, that backreferencing strings can be a method of compression but it is practically inapplicable. Not until LZSS makes the scheme a practical, useful tool.
If I proof that electrically heated wire can be source of light am I suddenly inventor of a bulb? Far from it. Until Storer and Szymanski solver the problem of inefficient encoding LZ77 was practically impractical. In fact, their invention of literal, and encoding it and match with single bit was a launch pad for the method.

LZMA is LZSS + Markov chains + arithmetic coding. In comparison Deflate is LZSS + prefix coding (Huffman derived but rarely Huffman).
As, hopefully, demonstrated, if not by me than maybe a prof. Bird, whom i quoted before, would be an authority here [3][4], LZMA, just like "PKZip, ARJ, RAR, ZOO, LHarc use LZSS rather than LZ77 as the primary compression algorithm; the encoding of literal characters and of length-distance pairs varies".


PS.
One cannot treat wikipedia as a source of any kind. There are good articles among them but majority of them is written by laymen and amateurs. Specialists do not have time to correct amateur, to not say amateurish articles in popular media. Therefore referring to references is a must.

----

1. A Universal Algorithm for Sequential Data Compression. Ziv, Lempel
https://scholar.archive.org/work/gfdvh3vf5rb55a62mgsrvory2i

2. Data compression via textual substitution; Storer, Szymanski; Journal of the ACM, Volume 29, Issue 4pp 928–951; 1 Oct 1982
https://doi.org/10.1145/322344.322346
https://dl.acm.org/doi/10.1145/322344.322346
https://dl.acm.org/doi/pdf/10.1145/322344.322346

3. LZ schemes
https://www.youtube.com/watch?v=VDXBnmr8AY0&t=16m16s

4. LZ schemes (slide)
https://encode.su/attachment.php?attachmentid=10843&d=1697027101


reply via email to

[Prev in Thread] Current Thread [Next in Thread]