Showing posts with label mmap. Show all posts
Showing posts with label mmap. Show all posts

Monday, 22 November 2010

lrzip-0.542, mmap windows and overcommit

I started out this blog entry with a lot more about the tty patch thingy and then decided against it. Suffice to say I don't like heuristics being special coded into the scheduler.

So back to lrzip progress. I found a lovely little bug in it which would make the sliding mmap buffer not slide from the 2nd compression window onwards. It was a lovely bug to find because after fixing it, a very large file took 1/4 the time it was taking to compress down. Overall sliding mmap has gotten a lot faster and useful.

I did a bit of further investigating to see if it was fast enough to enable unlimited compression windows by default. It hasn't quite gotten that much faster, but it's very usable. I tried it on that 9.1GB all-kernel archive on a 32 bit laptop and the default windows take 20 minutes to compress, while unlimited sized windows with sliding mmap took 80 minutes. Not bad given how much more compression it ends up giving.

Now because I run my main desktop with swap disabled (swap is a steaming pile of dodo's doodoo) I wanted to check what happened with memory allocation on linux and just how much the Virtual Memory subsystem will happily give you since lrzip tests before it allocates a window. With a file backed mmap it turns out you can allocate whatever you bloody well like. It happily let me mmap 50GB on a machine with 4GB ram. Now this is a disaster because imagine reading that file linearly into ram as lrzip is working on it. As soon as it stops being able to fit any more into real ram (which happens at about 2/3 of the ram used), it has all this stuff in ram which is now a complete waste and starts faking caching the rest. It never really has any useful amount in ram at any one time since it ends up dropping everything behind. Then if you read up to the end of the file in short bursts, and back again (as lrzip does), it gets worse and worse. Unfortunately, the -M mode in lrzip just worked that way by using whatever mmap would give you.

So I spent days trying various combinations of sizes of window that would work out optimal (with and without swap) and kept coming up with the same answer: that having a main buffer of ~2/3 of the ram gives the best speed performance. It left me without an answer for what to do with -M mode. After enough playing around I decided to make the sliding mmap mode kick in whenever the compression window is larger than 2/3 ram as it speeds up the performance noticeably compared to just mapping in the whole file and never really caching anything. But I had to make up some upper limit to how big -M would work with and 1.5 times the ram size worked out to be a reasonable compromise in how much it would slow down, and how much it would hit swap.

Lrzip can hit swap real hard was the conclusion by the way... and it's not like it speeds anything up or allows any more ram to be used compared to just disabling it. The only
time swap made a difference to the testing was how much anonymous ram could be mmapped (as in just free ram versus caching a file that's on disk). The linux vm allows you to pretty much allocate everything you have as real ram and swap. You should see what the desktop was like after it finished compressing a huge file - clicking on anything took ages since pretty much every useful application was sitting on swap instead of in ram. I had to test it with a real HD since my main desktop now uses an SSD to see how it would affect regular users. By the way, if you can afford an SSD for your desktop, get one ASAP. It's the greatest desktop speedup I've seen in years. (Sure be vigilant with backups and so on cause people don't trust them blah blah).

Despite choosing 2/3 as the amount of ram to allocate, lrzip still actually tests to ensure it won't fail and then decreases the amount till it succeeds. It's rather tricky making sure there will be enough ram for all the compression components because it needs enough ram to allocate to caching the file on disk, enough for 2 compression streams possibly as large as that first one, and then enough ram to dedicate to the back end compression phase. All in all it's a real ram hog. If you compress a big file with lrzip, expect that most things in ram will be trashed (all for a good cause of course). Making sure it doesn't fail on 32 bits is also rather annoying, but I now use the sliding mmap for anything bigger than 700MB there and that's made a big difference to how effectively a 32 bit machine can compress large files too.

Lrzip has been building successfully on and off on mac osx for a while now. Every 2nd or 3rd release I break it since I can't test it myself. It turns out that I recently broke it again. When I added the massive multithreading capabilities to the compression side, I used a fairly standard posix tool, the unnamed semaphore. It turns out that mac osx just doesn't support them. It builds fine, but then when you run it, it says function not implemented. That's pretty sad... named semaphores are much clunkier and according to the manpage "POSIX named semaphores have kernel persistence: if not removed by sem_unlink(3), a semaphore will exist until the system is shut down." I'm not sure if that's adhered to, but that's rather awkward if you don't clean up after yourself very well when you abort your program. At some stage if I can find the enthusiasm, I might try and get the multithreaded lrzip working on osx by converting the unnamed semaphores to named ones, or use them selectively on osx.

All in all I'm very pleased with how lrzip has shaped up. I went back and tested the huge kernel tarball with zpaq multithreaded compression and managed to get 9.1GB down to 116MB in just 40 minutes with zpaq on an external USB drive. Gotta be happy with that :) It's a shame people don't find this remotely as interesting as anything I have to say on the linux kernel.

Lrzip project page on freshmeat

EDIT: I keep getting people ask me what the big deal is with 32 bits and lrzip, since 32 bits with PAE should be able to address up to 64GB ram. That may well be the case, but the gnu libraries on the 32 bit userspace take a size_t on malloc and mmap, and they are the size of a signed long, which is 4 bytes on 32 bits, and 8 bytes on 64 bits. So the most you can address in one malloc with 32 bit userspace is up to 31 bits, or a bit over 2GB.

Friday, 5 November 2010

Lrzip and sliding mmap

After my last blog I completed work on releasing a new version of lrzip. The most user visible changes would be the new file format which is slightly more compact and a little faster on encoding. Behind the scenes, though, quite a lot of work went into fixing random minor issues, improving the memory allocation code and making the maximum compression windows larger with less chance of failure by falling back as much as possible.

One other major change was to try and implement stdin/stdout properly on lrzip instead of faking it by using temporary files as it did on the last version. As you may, or more likely, may not be aware, the original rzip which lrzip was based on did not work with stdin/stdout. It was one of the most commonly requested features of lrzip, and I implemented it by simply creating temporary files on readin, and then working off that, and a temporary file on compression and then dumping it to stdout once complete. I've been thinking for a long time about just how I could do it without the temporary files as there were certain limitations. As far as stdin goes, oOn compression, the file it's compressing needs to be mmappable. On decompression, the compressed file needs to be seekable. Neither of those can be done on stdin. As far as stdout goes, the output file needs to be seekable/readable while it's being written, and the header needs to be written at the start of the file on compression as well. Neither of those can be done on stdout.

After much fluffing around, the best I could do was to implement stdin without a temporary file. As lrzip works off an mmapped file, I implemented it with an anonymous mapping that reads as much as it can from stdin before working with it. The problem with this is we can't know how big the file will be as it's being read, so the anonymous mapping is made as big as is possible. There is some slight compression lost by doing this as one cannot map any more than how much real ram can be allocated, and the block headers need to be generous in predicting how long the next data block will be when they're written, but these are not major problems.

After releasing version 0.50, it didn't take long before someone spotted a bug to do with building it on OS X. It turns out that OS X doesn't have mremap implemented. So with the help of a few people online who had OS X, we got to fixing the build problem and I also noticed that due to a wrongly made workaround for OS X not having memopen, even the linux builds would be getting the fake memopen so I fixed that and committed it to the git repo. Not enough for a new version yet.

Anyway all of that got me thinking about how to manage the whole compression window failing issue since I started implementing things to workaround failing to allocate enough ram. Up to version 0.5, a compression window is "chosen" by lrzip based on a 2/3 of ram rule, or if set by the user. This was rather prone to failure for numerous obvious reasons, and the fact that further ram needed to be allocated as the backend compression phase would kick in.

When I first started playing with lrzip as an entity, one thing struck me. The major advantage to using the rzip first stage is proportional to the size of the compression window, and that window is limited by how much of a file you can mmap. Now with linux using virtual addressing and being an aggressive overcommitter, it turns out that amount of ram can be quite a lot, but still limited by how much real ram you have. Also if you dedicate all your ram to the first stage, you're left with not much ram for the backend compressor, and lzma quite benefits from a lot of ram too. So a few years back I half-heartedly mentioned the idea of a "sliding mmap" which could look like you have the whole file mmapped, yet did not need to dedicate your entire ram to it, as the "window" of which part of the buffer you were currently accessing would simply move over that part of the file, and you'd just have a virtual offset into one massive buffer.

So I set out to build my sliding mmap. After screwing around for many hours just to sort out pointers and variable types and all sorts of other nonsense that I'm not much good with, I finally had something that started taking shape. The first thing I experimented with was one large buffer that would move as per my original concept above. Well this would be fine if you were mostly reading linearly from a file, but the hash search of a huge compression window does this NOT. It reads from one end to the other in 2 sort of streams at a time, so there were literally billions on of mmap/munmaps occurring on a large file (since that's what I was designing it for) and almost no progress was being made. So then I experimented with splitting the buffer into two half sized ones and moving just the top one. Not really any better. Then I created a mini algorithm for sliding both of those two buffers up and down. Still no better. I tried tossing the mmap out completely and using single reads from disk per byte either via pread or fread. That was even worse. I tried mallocing just one page and moving that and it was marginally better. I kept hoping I'd make progress, as I'm nothing if not persistent :P Finally I found what worked best was to still have as much ram mapped as possible since this is fast, and use a one page sized buffer (4096 bytes) move up and down once we read beyond the base mmapped buffer.

How well does it perform? Well somewhere between 100 and 1000 times slower than a real mmap once reading beyond the initial buffer, but just as fast during that initial buffer. So if you can manage to fit most of a file into ram, this will get you the rest of the way. It takes away one of the major advantages of the rzip preprocessor stage, being that it's very fast, but then again lrzip does also do the ultra slow thing when in zpaq mode. I have ideas for how I might be able to tweak it further, so we'll see :) It was fun to implement, and it's nice to know that I can claim that lrzip has "unlimited sized compression windows" now. I've since committed that to the git repo.

Probably more useful than that feature is that I put a lot more effort into maximising the size of windows by throwing away the "guessing" how big a window it can use. Now it just allocates as much as it can through testing, and then does the same on the backend compression phase so it wont run out of ram later on. Simply put, the larger the window for the preprocessor stage, the better, and if it's as big as the file, even better, and this does its best to make that as big as possible (with -M).

So here's some numbers, choosing what lrzip is best at; a 10GB virtual image (so lots of redundant data over a big file) being compressed on a machine with 8GB ram on an SSD drive (this is an excerpt from the benchmarks on my site):

Compression Size      Compress   Decompress 
None      10737418240 
gzip      2772899756  05m47.35s  2m46.77s
bzip2     2704781700  20m34.26s  7m51.36s
xz        2272322208  58m26.82s  4m46.15s
7z        2242897134  29m28.15s  6m35.95s
lrzip     1354237684  29m13.40s  6m55.44s
lrzip -M  1079528708  23m44.22s  4m05.46s
lrzip -l  1793312108  05m13.24s  3m12.88s
lrzip -lM 1413268368  04m18.33s  2m54.65s
lrzip -z  1299844906  04h32m14s  04h33m
lrzip -zM 1066902006  04h07m14s  04h08m

-l mode is using lzo as the backend
-M mode uses the maximum possible ram. In this case, lrzip succeeded in fitting the entire 10GB file into one compression window.
I'd love to tell you how well zpaq fared, but it hasn't finished yet since it's so slow :P It used to take around 4 hours to do this benchmark. I'll update this post tomorrow with them. (edit:done).

Hopefully I can finish cleaning a few more things up and release version 0.5.1 soon with these changes, but I need to go to bed. They're all committed into the git repo already.

lrzip.kolivas.org

UPDATE: I've played with the sliding mmap implementation some more, in line with how lrzip works, and found that if I make the smaller buffer 64k it matches the size of the search function, and if I slide the larger buffer along with the main hash search it makes for a MASSIVE speed improvement over the earlier implementation. It is in fact usable speed now, as I just tested a 5GB file on a laptop that could normally only use a 2GB window, but used unlimited mode having an effective 5GB window, and it completed in 15 minutes, which is very reasonable! It decompressed in under 4 minutes. The changes have been committed to the git repo.