-ck hacking: lrzip-0.612

Sunday, 18 March 2012

lrzip-0.612

Just for good measure, here's yet another lrzip release.

lrzip 0.612 on freecode

This time the main update is a new zpaq library back end instead of using the ageing zpipe code. There are a number of advantages to using the libzpaq library over the old zpipe code.

First, the old code required a FILE type stream as it was written with STDIO in mind, so it was the only compression back end that required the use of some lesser known but handy, yet (virtually) linux only memory features like fmemopen, open_memstream and friends. These were not portable for osx and others so they were emulated on those platforms through the incredibly clunky use of temporary files on disk. Using the new library has killed off the need for these features making the code more portable.

Second, the code is significantly faster since it is the latest full c++ version of the zpaq code. Unfortunately it also means it takes a LOT longer to compile this part of lrzip now, but that's not a big deal since you usually only compile it once ;)

Third, it supports 3 different compression levels, one of which is higher than the previously supported one in lrzip. As lrzip uses 9 levels of compression, I've mapped the 3 levels to -L 1-3, 4-7 and 8-9 since -L 7 is the default and that provides the "mid level" compression from zpaq.

Finally, the beauty of the zpaq compression algorithm is the reference decoder can decompress any zpaq compressed data of any profile. This means you are able to use the latest version of lrzip with compression -L 9 (max profile), yet it is still backwardly compatible with older 0.6x versions of lrzip, not requiring an updated minor version and file format. The release archive I provide of lrzip-0.612.tar.lrz is self compressed with the new max profile. Even though there is significantly more code than ever in the lrzip release tarball, it has actually shrunk for the first time in a while.

So all that talk is boring and all, so let's throw around some benchmark results which are much more fun.

From the original readme benchmarks, I had compressed the linux 2.6.37 tarball, so I used that again for comparison. Tests were performed on an Intel quad core 3Ghz core 2 duo.

Compression Size  Percentage Compress Decompress
None     430612480 100
7z        63636839 14.8  2m28s  0m6.6s
xz        63291156 14.7  4m02s  0m8.7
lrzip     64561485 14.9  1m12s  0m4.3s
lrzip -z  51588423 12.0  2m02s  2m08s
lrzip -l 137515997 31.9  0m14s  0m2.7s
lrzip -g  86142459 20.0  0m17s  0m3.0s
lrzip -b  72103197 16.7  0m21s  0m6.5s
bzip2     74060625 17.2  0m48s  0m12.8s
gzip      94512561 21.9  0m17s  0m4.0s

As you can see, the improvements in speed of the rzip stage have made all the compression back ends pretty snappy, and most fun of all is that lrzip -z on this workload is even faster on compression than the multithreaded 7z and is significantly smaller. Alas the major disadvantage of zpaq remains that it takes about as long to decompress as it takes to compress. However, with the trend towards more CPU cores as time goes on, one could argue that zpaq compression, as used within lrzip, is getting to a speed where it can be in regular use instead of just research/experimental use, especially when files are small like the lrzip distributed tarball I provide.

I also repeated my old classic 10GB virtual image benchmarks

Compression Size  Percentage Compress Time Decompress Time
None     10737418240 100.0
gzip      2772899756  25.8  05m47s  2m46s
bzip2     2704781700  25.2  16m15s  6m19s
xz        2272322208  21.2  50m58s  3m52s
7z        2242897134  20.9  26m36s  5m41s
lrzip     1372218189  12.8  10m23s  2m53s
lrzip -U  1095735108  10.2  08m44s  2m45s
lrzip -l  1831894161  17.1  04m53s  2m37s
lrzip -lU 1414959433  13.2  04m48s  2m38s
lrzip -zU 1067169419   9.9  39m32s  39m46s

Using "U"nlimited "z"paq options, it is actually faster than xz now. Note that about 30% of this image is blank space but that's a not-uncommon type of virtual image. If it were full of data, the difference would be less. Anyway I think it's fair to say that it's worth watching zpaq in the future. Edit: I've sent Matt Mahoney (zpaq author) the latest benchmarks for lrzip and how it performs on the large file benchmark and he's updated his site: http://mattmahoney.net/dc/text.html I think it's performing pretty well for a general compression utility.

12 comments:

graysky18 March 2012 at 02:15
Nice, CK. For those who are unaware, can you comment on the status and implications of incorporating lrzip into libarchive? I believe you are/were working with Michael Blumenkrantz (author of liblrzip) to do this.

Thanks again!
ReplyDelete
Replies
ck18 March 2012 at 02:56
The GPL license in lrzip will prevent the full library support being merged into libarchive, only the simple compress/uncompress features. Michael's forwarded a patch to me for separate-binary support of lrzip in libarchive to do this. I have not yet submitted it on his behalf.
ReplyDelete
Replies
graysky18 March 2012 at 05:18
Hopefully, your efforts will result in lrzip support into pacman/makepkg. I ran a more simplistic version of your benchmark compressing the kernel source. Results are like yours but I included some graphs of key meteric.

http://repo-ck.com/bench/lrzip.pdf
ReplyDelete
Replies
ck18 March 2012 at 11:58
Thanks for those. Be aware that -L 7 is the same as no options.
ReplyDelete
Replies
ck18 March 2012 at 20:11
I posted a first draft patch to libarchive for basic lrzip inclusion and the news is even better - they can actually accept libraries that are GPL licensed which means that we can work on full lrzip support with libarchive.

See:
https://github.com/libarchive/libarchive/pull/7#issuecomment-4559190
ReplyDelete
Replies
Zed20 March 2012 at 05:58
are you tried srep and exdupe?
ReplyDelete
Replies
ck20 March 2012 at 07:37
srep was based on lrzip ideas...
ReplyDelete
Replies
Anonymous19 May 2012 at 02:02
maybe I'm missing something: I tried stdin/stdout compression and decompression and I found something I can't explain.

$echo -n '1234567890123456789012345678901' |lrzip -q |lrzip -d -q |cat
1234567890123456789012345678901 # ok

$echo -n '12345678901234567890123456789012' |lrzip -q |lrzip -d -q |cat
Failed to decompress buffer - lzmaerr=1
Failed to decompress buffer - lzmaerr=1
Failed to decompress in ucompthread
Fatal error - exiting

$echo -n '000000000000000000000000000000000' | lrzip -q | lrzip -q -d |cat
000000000000000000000000000000000 #ok

$echo -n '0000000000000000000000000000000000' | lrzip -q | lrzip -q -d |cat
Failed to decompress buffer - lzmaerr=6
Failed to decompress buffer - lzmaerr=6
Failed to decompress in ucompthread
Fatal error - exiting

$echo -n 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzab' | lrzip -q | lrzip -q -d |cat
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzab # ok

$echo -n 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc' | lrzip -q | lrzip -q -d |cat
Failed to decompress buffer - lzmaerr=1
Failed to decompress buffer - lzmaerr=1
Failed to decompress in ucompthread
Fatal error - exiting

31 is ok for numbers 0-9
33 is ok with zeros
54 is ok with letters

thanks.
LS
ReplyDelete
Replies

Add comment