Monday, 8 November 2010

lrzip-0.520 and massive kernel compression experiment.

So it's another day and it's time for a new version. Actually version 0.520 is the same as version 0.5.2, it's just that I had it pointed out to me that distros (and github and most everything else) don't really like 3 point numbering schemes. So I have to go back to 2 point version numbering and this is otherwise the same code. I'm not planning on any more changes in the immediate future so hopefully this will calm down for a while and give people a change to actually use it.

Anyway I had a bit of fun with testing out lrzip at the extremes of what it does well and ended up posting my results to the linux kernel mailing list since it involved linux tarballs. Here is a copy of the email as I sent it.

---

Let's do this backwards. Data first.


tarball of every 3 point linux kernel from 2.6.0 to 2.6.36

9748285440 linux-2.6.0-2.6.36.tar


compressed file

136516013 linux-2.6.0-2.6.36.tar.lrz


Compressed size: 1.4%


Compression time:

00:19:22.086


Decompression time:

00:03:02.564


lrzip.kolivas.org


Now for the introduction

Lrzip is a compression application I've been working on that is based on the
original excellent application rzip by Andrew Tridgell. Rzip worked by having
a preprocessor stage with a massive compression window up to 900MB for long
distance redundancy compaction and then compressing the data with bz2. Lrzip
was a project I began based on rzip which tried to extend the initial
compression window beyond 900MB and to use lzma as the back end for the 2nd
stage compression. The idea was that as file sizes got bigger, and machines
had more ram, it would keep getting more and more useful.

After many iterations on lrzip, I've been able to significantly expand on this
idea and make it address 64 bit sized files, ram sizes, and bigger windows.
Previously the limit was based on available ram and how much of the file being
compressed could be mmapped. The current version (0.5.2) is able to use
unlimited sized compression windows for the preprocessor stage using two
sliding mmap windows I invented to match the hash search algorithm of rzip
which can go beyond the size of the ram available. Basically the larger the
file, and the more redundant information in the file, the more the compression
is you can achieve.

Anyway the relevance of this to kernel folk is that the linux kernel is a
perfect test case for long distance redundancy when multiple kernel trees are
compressed.

So here's the lowdown. All of these results were conducted on a 3GHz quad core
64 bit machine with 8GB ram on a 160GB SSD so hard drive speed is a
(relatively) insignificant part of the times.


Starting with linux kernel 2.6.36

413511680 linux-2.6.36.tar
70277083 linux-2.6.36.tar.bz2
59905365 linux-2.6.36.tar.lrz

Compression:
Compression Ratio: 6.903. Average Compression Speed: 1.582MB/s.
Total time: 00:04:08.896

Decompression:
Average DeCompression Speed: 17.130MB/s
Total time: 00:00:22.557


So trying my best to show off what lrzip is good at, I grabbed every 3 point
2.6 linux kernel release. That's every release from 2.6.0 to 2.6.36 as a
tarball, not as patches. It came to a 9.1GB file. Now each kernel tree is
going to be repeating an -awful- lot of data, but given that they have gotten
progressively larger, and span gigabytes, normal compression doesn't really
offer much. This is where the rzip compaction stage over the full size of the
file comes in handy. The more redundant information there is over large
distances, the better. [insert joke about the linux kernel and redundant
information here]. (-MU means Maximum Unlimited; maximum ram and unlimited
window size).


9748285440 linux-2.6.0-2.6.36.tar

lrzip -MU linux-2.6.0-2.6.36.tar

136516013 linux-2.6.0-2.6.36.tar.lrz

Compression:
Compression Ratio: 71.408. Average Compression Speed: 8.000MB/s.
Total time: 00:19:22.086

That's 9.1GB to 130MB (1.4%) in 20 minutes.

Decompression:
Average DeCompression Speed: 51.359MB/s
Total time: 00:03:02.564


At 130MB, it's small enough for me to even offer the entire 3 point stable
release for download from my piddly little server. So here's the file, if
anyone wanted to confirm its validity, or just to download them all :)

linux-2.6.0-2.6.36.tar.lrz


lrzip also has an lzo and zpaq backend for those who want to extract every
last drop of compression out of it at the cost of a bit more time. zpaq is
EIGHT TIMES slower during the backend compression phase compared to lzma
whereas lzo is almost instant after the rzip phase. Note that this file loses
most of its size during the preprocessor stage so it doesn't end up taking 8x
longer to compress with zpaq:


lrzip -MUz linux-2.6.0-2.6.36.tar

121021214 linux-2.6.0-2.6.36.tar.lrz

Compression:
Compression Ratio: 80.550. Average Compression Speed: 3.041MB/s.
Total time: 00:50:56.537

That's 9.1GB to 115MB (1.2%) in 51 mins.

Decompression time is about 52 minutes (yes it's longer than compression!)


Now, I am NOT proposing lrzip be used as a drop in replacement for the
existing compression formats in use. It does work on stdin/stdout but
inefficiently except on compression from stdin. It does not have any archive
facilities beyond compressing the one file, so it comes with an lrztar and
lrzuntar wrapper to do basic directory archiving. It uses a LOT of ram, and
although it has the ability to use unlimited sized compression windows
unconstrained by ram, the bigger the discrepancy between the file size and the
ram, the slower it will get. (Decompression typically uses 10x less ram than
compression).

Nevertheless, I home some people with lots of similar files lying around, like
kernel hackers, will hopefully find it a useful tool, as will people with
database dumps and the like. I also wonder if features of this compression
approach might be helpful for other projects that deal with huge amounts of
data and have large memory requirements. When I mentioned the sliding mmap
feature online, git using lots of memory during its compression phase was
mentioned, so perhaps there are other uses for the sliding mmap idea as a way
of reducing memory usage. There's some documentation of it in the code in
rzip.c and I wrote briefly about it in my blog: http://ck-hack.blogspot.com

Saturday, 6 November 2010

lrzip-0.5.2 with unlimited size compression windows unconstrained by ram.

After my last blog, all I needed to do was sleep on the "sliding mmap" implementation I had so far to think about how to improve it to make it work in some reasonable time frame instead of 100 to 1000 times slower. Yes, I know lrzip isn't as exciting as BFS to everyone else, but I'm having fun hacking on it.

The hash search in the rzip phase of lrzip moves progressively along the file offset, and the match lengths are up to 65536 bytes long. So I changed my 2 sliding buffers to have the one large one move along with the progressive file offset, and the smaller one to be 65536 bytes long. Well the results were dramatic. Instead of being 100 times slower, my first attempt showed it to be 3 times slower. So I cleaned it up and started benchmarking. Not only was it faster than the earlier implementation, surprisingly, sometimes it's faster to compress with it enabled than not! So it was clear I was onto a winner. I tried a few different memory configuration machines and different file sizes, and I can see that as the amount of redundancy goes up in the original file, and as the file deviates further from the ram size, it slows down more and more, but it's still a usable speed.

Using the classic 10GB virtual image file on an 8GB machine example, here's how it fared with the various options:
Options Size       Compress Decompress
-l      1793312108  5m13s    3m12s
-lM     1413268368  4m18s    2m54s
-lU     1413268368  6m05s    2m54s

Not bad at all really, considering it manages to mmap about 5.5GB and then the other 4.5GB depends on the sliding feature to work.

So I fixed a swag of other things, updated the docs and made all new benchmarks, and decided I couldn't sit on it any longer. At some stage you have to release, so version 0.5.1 is out, with the goodness of unlimited size compression windows, unconstrained by ram.

EDIT: Updated link to 0.5.2
lrzip-0.5.2

Here's some more fun benchmarks:

These are benchmarks performed on a 2.53Ghz dual core Intel Core2 with 4GB ram using lrzip v0.5.1. Note that it was running with a 32 bit userspace so only 2GB addressing was possible. However the benchmark was run with the -U option allowing the whole file to be treated as one large compression window.

Tarball of 6 consecutive kernel trees, linux-2.6.31 to linux-2.6.36

Compression Size        Percentage Compress Decompress
None        2373713920   100
7z          344088002    14.5       17m26s   1m22s
lrzip -l    223130711    9.4        05m21s   1m01s
lrzip -U    73356070     3.1        08m53s   43s
lrzip -Ul   158851141    6.7        04m31s   35s
lrzip -Uz   62614573     2.6        24m42s   25m30s

Kick arse.

Update:Again I managed to break the Darwin build in 0.5.1. This time it's a simple missing ; at the end of line 536 in main.c . I'll confirm that it works otherwise on Darwin and do a quick bump to a new version. (edit:done)

EDIT: Just for grins, I tried compressing a 50GB freenet datastore on the 8GB ram machine with just the one window as a test. It succeeded with flying colours in about 2 hours. Given the nature of the data that is of very low compressibility due to all being encrypted, and the window being more than 6* the size of the ram.

Compression Ratio: 1.022. Average Compression Speed: 7.266MB/s.
Total time: 01:59:3563.609

It also decompressed fine:
Total time: 01:02:3566.894

Note that it was on an external USB hard drive which isn't the fastest thing to read/write from and reading and writing 50GB to the same drive probably would take an hour itself, so that's why it takes so long. The compression time frame looks good in light of that :)

Friday, 5 November 2010

Lrzip and sliding mmap

After my last blog I completed work on releasing a new version of lrzip. The most user visible changes would be the new file format which is slightly more compact and a little faster on encoding. Behind the scenes, though, quite a lot of work went into fixing random minor issues, improving the memory allocation code and making the maximum compression windows larger with less chance of failure by falling back as much as possible.

One other major change was to try and implement stdin/stdout properly on lrzip instead of faking it by using temporary files as it did on the last version. As you may, or more likely, may not be aware, the original rzip which lrzip was based on did not work with stdin/stdout. It was one of the most commonly requested features of lrzip, and I implemented it by simply creating temporary files on readin, and then working off that, and a temporary file on compression and then dumping it to stdout once complete. I've been thinking for a long time about just how I could do it without the temporary files as there were certain limitations. As far as stdin goes, oOn compression, the file it's compressing needs to be mmappable. On decompression, the compressed file needs to be seekable. Neither of those can be done on stdin. As far as stdout goes, the output file needs to be seekable/readable while it's being written, and the header needs to be written at the start of the file on compression as well. Neither of those can be done on stdout.

After much fluffing around, the best I could do was to implement stdin without a temporary file. As lrzip works off an mmapped file, I implemented it with an anonymous mapping that reads as much as it can from stdin before working with it. The problem with this is we can't know how big the file will be as it's being read, so the anonymous mapping is made as big as is possible. There is some slight compression lost by doing this as one cannot map any more than how much real ram can be allocated, and the block headers need to be generous in predicting how long the next data block will be when they're written, but these are not major problems.

After releasing version 0.50, it didn't take long before someone spotted a bug to do with building it on OS X. It turns out that OS X doesn't have mremap implemented. So with the help of a few people online who had OS X, we got to fixing the build problem and I also noticed that due to a wrongly made workaround for OS X not having memopen, even the linux builds would be getting the fake memopen so I fixed that and committed it to the git repo. Not enough for a new version yet.

Anyway all of that got me thinking about how to manage the whole compression window failing issue since I started implementing things to workaround failing to allocate enough ram. Up to version 0.5, a compression window is "chosen" by lrzip based on a 2/3 of ram rule, or if set by the user. This was rather prone to failure for numerous obvious reasons, and the fact that further ram needed to be allocated as the backend compression phase would kick in.

When I first started playing with lrzip as an entity, one thing struck me. The major advantage to using the rzip first stage is proportional to the size of the compression window, and that window is limited by how much of a file you can mmap. Now with linux using virtual addressing and being an aggressive overcommitter, it turns out that amount of ram can be quite a lot, but still limited by how much real ram you have. Also if you dedicate all your ram to the first stage, you're left with not much ram for the backend compressor, and lzma quite benefits from a lot of ram too. So a few years back I half-heartedly mentioned the idea of a "sliding mmap" which could look like you have the whole file mmapped, yet did not need to dedicate your entire ram to it, as the "window" of which part of the buffer you were currently accessing would simply move over that part of the file, and you'd just have a virtual offset into one massive buffer.

So I set out to build my sliding mmap. After screwing around for many hours just to sort out pointers and variable types and all sorts of other nonsense that I'm not much good with, I finally had something that started taking shape. The first thing I experimented with was one large buffer that would move as per my original concept above. Well this would be fine if you were mostly reading linearly from a file, but the hash search of a huge compression window does this NOT. It reads from one end to the other in 2 sort of streams at a time, so there were literally billions on of mmap/munmaps occurring on a large file (since that's what I was designing it for) and almost no progress was being made. So then I experimented with splitting the buffer into two half sized ones and moving just the top one. Not really any better. Then I created a mini algorithm for sliding both of those two buffers up and down. Still no better. I tried tossing the mmap out completely and using single reads from disk per byte either via pread or fread. That was even worse. I tried mallocing just one page and moving that and it was marginally better. I kept hoping I'd make progress, as I'm nothing if not persistent :P Finally I found what worked best was to still have as much ram mapped as possible since this is fast, and use a one page sized buffer (4096 bytes) move up and down once we read beyond the base mmapped buffer.

How well does it perform? Well somewhere between 100 and 1000 times slower than a real mmap once reading beyond the initial buffer, but just as fast during that initial buffer. So if you can manage to fit most of a file into ram, this will get you the rest of the way. It takes away one of the major advantages of the rzip preprocessor stage, being that it's very fast, but then again lrzip does also do the ultra slow thing when in zpaq mode. I have ideas for how I might be able to tweak it further, so we'll see :) It was fun to implement, and it's nice to know that I can claim that lrzip has "unlimited sized compression windows" now. I've since committed that to the git repo.

Probably more useful than that feature is that I put a lot more effort into maximising the size of windows by throwing away the "guessing" how big a window it can use. Now it just allocates as much as it can through testing, and then does the same on the backend compression phase so it wont run out of ram later on. Simply put, the larger the window for the preprocessor stage, the better, and if it's as big as the file, even better, and this does its best to make that as big as possible (with -M).

So here's some numbers, choosing what lrzip is best at; a 10GB virtual image (so lots of redundant data over a big file) being compressed on a machine with 8GB ram on an SSD drive (this is an excerpt from the benchmarks on my site):

Compression Size      Compress   Decompress 
None      10737418240 
gzip      2772899756  05m47.35s  2m46.77s
bzip2     2704781700  20m34.26s  7m51.36s
xz        2272322208  58m26.82s  4m46.15s
7z        2242897134  29m28.15s  6m35.95s
lrzip     1354237684  29m13.40s  6m55.44s
lrzip -M  1079528708  23m44.22s  4m05.46s
lrzip -l  1793312108  05m13.24s  3m12.88s
lrzip -lM 1413268368  04m18.33s  2m54.65s
lrzip -z  1299844906  04h32m14s  04h33m
lrzip -zM 1066902006  04h07m14s  04h08m

-l mode is using lzo as the backend
-M mode uses the maximum possible ram. In this case, lrzip succeeded in fitting the entire 10GB file into one compression window.
I'd love to tell you how well zpaq fared, but it hasn't finished yet since it's so slow :P It used to take around 4 hours to do this benchmark. I'll update this post tomorrow with them. (edit:done).

Hopefully I can finish cleaning a few more things up and release version 0.5.1 soon with these changes, but I need to go to bed. They're all committed into the git repo already.

lrzip.kolivas.org

UPDATE: I've played with the sliding mmap implementation some more, in line with how lrzip works, and found that if I make the smaller buffer 64k it matches the size of the search function, and if I slide the larger buffer along with the main hash search it makes for a MASSIVE speed improvement over the earlier implementation. It is in fact usable speed now, as I just tested a 5GB file on a laptop that could normally only use a 2GB window, but used unlimited mode having an effective 5GB window, and it completed in 15 minutes, which is very reasonable! It decompressed in under 4 minutes. The changes have been committed to the git repo.

Sunday, 31 October 2010

On lrzip and packing bytes

As some of you may be aware I maintain a compression application called lrzip which is an evolution of Tridge's rzip which does a range coding pre-processing pass and then compresses the data with bzip2. Lrzip does the compression with a selection of backend compressors, using lzma by default but offering extremely fast with lzo or extremely slow with zpaq. The other thing lrzip does differently is I updated all the code to work properly on 64 bit machines with super large sizes. The idea behind lrzip is that as machines get more ram, and file sizes get bigger, the compression it will be capable of achieving will keep getting better, as it uses compression windows limited only in size by available ram. It is not a complex mature application with lots of features like 7z, but simply offers solid compression at good speeds.

See more about it with benchmarks and other advertising blurbs here:
lrzip.kolivas.org

Anyway when I updated it to 64 bit, my conversion of the code was a little inefficient, simply changing all the offsets in the pre-processor stage from 16 and 32 bit variables to 64 bit ones. This lost some of the efficiency of the compression because suddenly 2 and 4 byte entries were now 8 bytes long. However, since the 8 bytes were mostly zeroes, they would compress quite well since the backend compression pass was done next. Still, it was clear this was inefficient. So I did some experimenting.

The first thing I did was to base the byte width on the file size of the archive, thus allowing very small archives to have even less than 4 byte wide entries, and most archives would likely be 4-5 bytes wide. 5 bytes gives us 40 bits or files up to 1,099,511,627,776 in size. The compression gains were small but very real, and given that with zpaq compression lrzip's overall compression is rather extreme, any gains are worthwhile chasing.

The next thing I tried was packing bits. As most offsets were likely to be a lot less than the full file size, then it is possible to just make each entry only as many bytes wide as the data itself. To do this, I encoded how many bytes wide the data was within the data itself, by adding 3 bits to it. Having only 61 bits for the data itself of a 64 bit variable is not a problem since a 61 bit sized file is up to 2,305,843,009,213,693,952 in size.

Here's a rundown of how the code looked for encode/decode:
static inline void put_vbytes(void *ss, int stream, i64 s)
{
 uchar encoded_bytes;
 int bytes, bits = 0;

 while (s > (1LL << bits))
  bits++;
 /* Add one bit for bit 0, and 3 bits for the encoded number of bytes */
 bits += 4;

 bytes = bits / 8;
 /* Round up to byte sized */
 if (bits % 8)
  bytes++;

 /* Always assume there will be one byte, so we only need 3 bits to count
  * the remaining 7 possible. Left shift s by 3 bits and then encode the
  * 3 bits of remaining bytes into the right-most bits of s that will be
  * written thus making them first */
 encoded_bytes = bytes - 1;
 s <<= 3;
 s |= (i64)encoded_bytes;
 put_vchars(ss, stream, s, bytes);
}


static inline i64 read_vbytes(void *ss, int stream)
{
 uchar encoded_bytes, sb = read_u8(ss, stream);
 i64 s;
 int bytes;

 /* Grab the right most 3 bits. It has encoded how many remaining
  * bytes this entry is */
 encoded_bytes = sb & 7;
 s = sb;
 for (bytes = 1; bytes <= encoded_bytes; bytes++) {
  int bits = bytes * 8;

  sb = read_u8(ss, stream);
  s |= (i64)sb << bits;
 }
 s >>= 3;
 return s;
}

Anyway after a lot of fluffing around to actually make this code work since I had numerous mistakes while getting this far, I finally got around to testing it. The results of the pre-processing stage were very promising with the data getting a lot smaller. Then after the backend compression stage kicked in, the final compressed file size was almost the same as it was with just the 8 byte wide entries already in use in the current lrzip version 0.47 release. I sat there looking at it thinking "well that was a complete waste of time". So after thinking about it a little more, I realised that this bit-packing I was doing is basically just another form of compression, so it was making the compression stage less efficient. Incidentally, some file formats and compression formats I believe use a similar form of bit packing.

So I'll be throwing away the bit-packing idea and going back to the fixed width idea since the final compression size will be smaller that way. However I wanted to put to do something useful with the code since I made it and figured I should put it on this blog since I'm about to delete it from the actual code :) So back to the fixed byte width coding, hopefully with a new version of lrzip not too far away.