Friday, 5 November 2010

Lrzip and sliding mmap

After my last blog I completed work on releasing a new version of lrzip. The most user visible changes would be the new file format which is slightly more compact and a little faster on encoding. Behind the scenes, though, quite a lot of work went into fixing random minor issues, improving the memory allocation code and making the maximum compression windows larger with less chance of failure by falling back as much as possible.

One other major change was to try and implement stdin/stdout properly on lrzip instead of faking it by using temporary files as it did on the last version. As you may, or more likely, may not be aware, the original rzip which lrzip was based on did not work with stdin/stdout. It was one of the most commonly requested features of lrzip, and I implemented it by simply creating temporary files on readin, and then working off that, and a temporary file on compression and then dumping it to stdout once complete. I've been thinking for a long time about just how I could do it without the temporary files as there were certain limitations. As far as stdin goes, oOn compression, the file it's compressing needs to be mmappable. On decompression, the compressed file needs to be seekable. Neither of those can be done on stdin. As far as stdout goes, the output file needs to be seekable/readable while it's being written, and the header needs to be written at the start of the file on compression as well. Neither of those can be done on stdout.

After much fluffing around, the best I could do was to implement stdin without a temporary file. As lrzip works off an mmapped file, I implemented it with an anonymous mapping that reads as much as it can from stdin before working with it. The problem with this is we can't know how big the file will be as it's being read, so the anonymous mapping is made as big as is possible. There is some slight compression lost by doing this as one cannot map any more than how much real ram can be allocated, and the block headers need to be generous in predicting how long the next data block will be when they're written, but these are not major problems.

After releasing version 0.50, it didn't take long before someone spotted a bug to do with building it on OS X. It turns out that OS X doesn't have mremap implemented. So with the help of a few people online who had OS X, we got to fixing the build problem and I also noticed that due to a wrongly made workaround for OS X not having memopen, even the linux builds would be getting the fake memopen so I fixed that and committed it to the git repo. Not enough for a new version yet.

Anyway all of that got me thinking about how to manage the whole compression window failing issue since I started implementing things to workaround failing to allocate enough ram. Up to version 0.5, a compression window is "chosen" by lrzip based on a 2/3 of ram rule, or if set by the user. This was rather prone to failure for numerous obvious reasons, and the fact that further ram needed to be allocated as the backend compression phase would kick in.

When I first started playing with lrzip as an entity, one thing struck me. The major advantage to using the rzip first stage is proportional to the size of the compression window, and that window is limited by how much of a file you can mmap. Now with linux using virtual addressing and being an aggressive overcommitter, it turns out that amount of ram can be quite a lot, but still limited by how much real ram you have. Also if you dedicate all your ram to the first stage, you're left with not much ram for the backend compressor, and lzma quite benefits from a lot of ram too. So a few years back I half-heartedly mentioned the idea of a "sliding mmap" which could look like you have the whole file mmapped, yet did not need to dedicate your entire ram to it, as the "window" of which part of the buffer you were currently accessing would simply move over that part of the file, and you'd just have a virtual offset into one massive buffer.

So I set out to build my sliding mmap. After screwing around for many hours just to sort out pointers and variable types and all sorts of other nonsense that I'm not much good with, I finally had something that started taking shape. The first thing I experimented with was one large buffer that would move as per my original concept above. Well this would be fine if you were mostly reading linearly from a file, but the hash search of a huge compression window does this NOT. It reads from one end to the other in 2 sort of streams at a time, so there were literally billions on of mmap/munmaps occurring on a large file (since that's what I was designing it for) and almost no progress was being made. So then I experimented with splitting the buffer into two half sized ones and moving just the top one. Not really any better. Then I created a mini algorithm for sliding both of those two buffers up and down. Still no better. I tried tossing the mmap out completely and using single reads from disk per byte either via pread or fread. That was even worse. I tried mallocing just one page and moving that and it was marginally better. I kept hoping I'd make progress, as I'm nothing if not persistent :P Finally I found what worked best was to still have as much ram mapped as possible since this is fast, and use a one page sized buffer (4096 bytes) move up and down once we read beyond the base mmapped buffer.

How well does it perform? Well somewhere between 100 and 1000 times slower than a real mmap once reading beyond the initial buffer, but just as fast during that initial buffer. So if you can manage to fit most of a file into ram, this will get you the rest of the way. It takes away one of the major advantages of the rzip preprocessor stage, being that it's very fast, but then again lrzip does also do the ultra slow thing when in zpaq mode. I have ideas for how I might be able to tweak it further, so we'll see :) It was fun to implement, and it's nice to know that I can claim that lrzip has "unlimited sized compression windows" now. I've since committed that to the git repo.

Probably more useful than that feature is that I put a lot more effort into maximising the size of windows by throwing away the "guessing" how big a window it can use. Now it just allocates as much as it can through testing, and then does the same on the backend compression phase so it wont run out of ram later on. Simply put, the larger the window for the preprocessor stage, the better, and if it's as big as the file, even better, and this does its best to make that as big as possible (with -M).

So here's some numbers, choosing what lrzip is best at; a 10GB virtual image (so lots of redundant data over a big file) being compressed on a machine with 8GB ram on an SSD drive (this is an excerpt from the benchmarks on my site):

Compression Size      Compress   Decompress 
None      10737418240 
gzip      2772899756  05m47.35s  2m46.77s
bzip2     2704781700  20m34.26s  7m51.36s
xz        2272322208  58m26.82s  4m46.15s
7z        2242897134  29m28.15s  6m35.95s
lrzip     1354237684  29m13.40s  6m55.44s
lrzip -M  1079528708  23m44.22s  4m05.46s
lrzip -l  1793312108  05m13.24s  3m12.88s
lrzip -lM 1413268368  04m18.33s  2m54.65s
lrzip -z  1299844906  04h32m14s  04h33m
lrzip -zM 1066902006  04h07m14s  04h08m

-l mode is using lzo as the backend
-M mode uses the maximum possible ram. In this case, lrzip succeeded in fitting the entire 10GB file into one compression window.
I'd love to tell you how well zpaq fared, but it hasn't finished yet since it's so slow :P It used to take around 4 hours to do this benchmark. I'll update this post tomorrow with them. (edit:done).

Hopefully I can finish cleaning a few more things up and release version 0.5.1 soon with these changes, but I need to go to bed. They're all committed into the git repo already.

lrzip.kolivas.org

UPDATE: I've played with the sliding mmap implementation some more, in line with how lrzip works, and found that if I make the smaller buffer 64k it matches the size of the search function, and if I slide the larger buffer along with the main hash search it makes for a MASSIVE speed improvement over the earlier implementation. It is in fact usable speed now, as I just tested a 5GB file on a laptop that could normally only use a 2GB window, but used unlimited mode having an effective 5GB window, and it completed in 15 minutes, which is very reasonable! It decompressed in under 4 minutes. The changes have been committed to the git repo.

No comments:

Post a Comment