[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Uncompressed mbox format storage
From: |
wrotycz |
Subject: |
Uncompressed mbox format storage |
Date: |
Wed, 11 Dec 2024 21:54:44 +0100 |
User-agent: |
GWP-Draft |
Just wanted to point out that for search option it does not matter whether these files are or are not compressed.
Here is a test on rather old hardware and on HDD where search greps query 'matrix' and the corresponding times of execution.
Yes, hardware is old but it is deliberate to show difference, or lack of it, for the given task.
Also `grep -i' was used as that is what search does - it clearly ignores the letters case.
Tools used are grep, zgrep = zutils/zgrep, gzgrep = gzip/zgrep.
~~~
## unbuffered $ grep -i
$ sync; echo 3 > /proc/sys/vm/drop_caches
$ time grep -i matrix bug-gnuzilla/* > /dev/null
real 0m8.178s
user 0m6.477s
sys 0m0.250s
$ sync; echo 3 > /proc/sys/vm/drop_caches
$ time ./zgrep -i matrix bug-gnuzilla-gz/*.gz > /dev/null
real 0m8.882s
user 0m7.722s
sys 0m1.140s
$ sync; echo 3 > /proc/sys/vm/drop_caches
$ time ./gzgrep -i matrix bug-gnuzilla-gz/*.gz > /dev/null
real 0m9.196s
user 0m8.021s
sys 0m1.372s
## buffered $ grep -i
$ time grep -i matrix bug-gnuzilla/* > /dev/null
real 0m8.077s
user 0m6.480s
sys 0m0.056s
$ time ./zgrep -i matrix bug-gnuzilla-gz/*.gz > /dev/null
real 0m8.151s
user 0m7.692s
sys 0m0.883s
$ time ./gzgrep -i matrix bug-gnuzilla-gz/*.gz > /dev/null
real 0m8.477s
user 0m7.964s
sys 0m1.083s
~~~
As can be seen z/grep -i does not make a difference between uncompressed and gz-compressed searches.
It is because decompression is faster than this search and compressed data wait in pipe anyway.
In case of first time use (unbuffered) using compressed files can even beneficial as there is less to (slow) read and decompression is many times faster anyway, so the bottleneck is in read that can be alleviated by less (compressed) data to read.
In case of SSD drive the difference goes towards buffered case, but overall difference is marginal.
For reference 'same' buffered grep in tmpfs:
~~~
$ time grep -i matrix bug-gnuzilla/* > /dev/null
real 0m6.607s
user 0m6.565s
sys 0m0.044s
$ time ./zgrep matrix bug-gnuzilla-gz/*.gz > /dev/null
real 0m1.977s
user 0m1.304s
sys 0m0.809s
$ time ./gzgrep matrix bug-gnuzilla-gz/*.gz > /dev/null
real 0m2.389s
user 0m1.533s
sys 0m1.064s
~~~
This shows that using compressed files can be faster, even much faster than uncompressed.
--
Last comparison of zgrep between compressors.
~~~
$ time ./zgrep -i matrix bug-gnuzilla/* > /dev/null
66 794 6253
real 0m8.017s
user 0m7.046s
sys 0m0.907s
$ time ./zgrep -i matrix bug-gnuzilla-gz/*.gz > /dev/null
# gzip VmPeak: 2516 kB
real 0m8.115s
user 0m7.709s
sys 0m0.853s
$ time ./zgrep -i matrix bug-gnuzilla-bz2/*.bz2 > /dev/null
# bzip2 VmPeak: 5596 kB
real 0m9.654s
user 0m10.654s
sys 0m1.009s
$ time ./zgrep -i matrix bug-gnuzilla-lz/*.lz > /dev/null
# lzip VmPeak: 4224 kB - 7204 kB
real 0m9.355s
user 0m9.215s
sys 0m1.005s
~~~
The reasons I uses gzip as show case are these: it is decent compressor, good enough for the job, uses little memory, the least of all three, and is fastest of all those.