Tuesday 7 December 2010

lrzip-0.544 and fudging with memory allocation on 32 bits

It never ceases to amaze me that no matter how much I try and keep the memory allocation paths in check on lrzip, userspace will continue to fail me somehow. I had reports of memory allocation failures on 32 bit machines running lrzip with big files and started investigating again. Even if you have 64GB of ram running on a 32 bit kernel with PAE enabled, you can only allocate up to 2GB maximum with malloc and friends, due to ssize_t being a 32 bit signed long, giving you only 31 bits of positive to work with (which works out to 2GB). I had taken that into account, but it seems that 32 bit userspace really doesn't even like allocating more than 2GB in total for the one application.

Since lrzip can use much smaller allocation now with the use of sliding mmap, I set out to just use even smaller base mmap on 32 bits rather than allocating up to 2GB, and relying on sliding mmap for the rest. This fixed the memory allocation errors on 32 bits, but left me with the lzma back end failing for some users out there, reporting "try decreasing window size". Even though there is plenty of ram available now, it occasionally returns with a memory error. Since I haven't actually written any the lzma library myself, I really didn't know where to begin looking. I tried decreasing the window size passed to lzma but it was failing sometimes even on windows of only 60MB despite gigabytes of ram spare. So after much battling with it I decided to just let it fail, and then fall back to bzip2 compression if lzma failed to compress that block. This means that lrzip will generate a mixed compression file which is perfectly fine, but at least it will not leave blocks uncompressed. None of this is a problem on 64 bits mind you, but there are plenty of people running 32 bits still out there and will be for some time to come.

The other issue that arose was that uclibc doesn't appear to support the sysconf() function fully, not returning a valid value for ram (it returns a negative value) so I had to add a fallback to reading /proc for that.

The multithreaded compression (till now) worked off dividing up the size of the file by the number of threads, and then when a chunk of that size was reached, it would pass it onto a thread and then continue. The problem with this approach is the file is rzip preprocessed before passing it on to a back end compression thread, so it's usually smaller than this. I modified it to put data to threads at regular intervals during the rzip hash search instead, thus spawning threads at even timeframes to hopefully use more CPU time on SMP machines. What's interesting about this, though, is that the chunks compress markedly differently because they're different parts of the rzip processed data and can be tiny or massive after that processing. It made for only very modest improvements in compression times. It has an advantage on decompression, though, because each chunk after decompression is the same size, and it can take quite a while to write say a 2.5GB chunk at a time, thus allowing the next thread to keep working on the chunk after that while it's busy writing the first chunk.

When I last posted, I was unhappy about the multithreaded decompression component so I rewote part of that as well. Now, instead of guessing what stream is likely to be read next, it simply spawns two groups of threads, one group for each stream. The vast majority of the time, archives end up with just one block of stream 0 and lots of blocks of stream 1, so it won't make a big difference, but hopefully it's more robust than it was before. It does mean, however, that you can potentially spawn twice as many threads as you have CPUs which could be a problem with memory usage.

So what's next for lrzip? Well I keep trying to let it stabilise instead of doing more work on it, but there are more people using it now which means I get more bug reports. That's both good and bad ;) The things I see that still need addressing are more ability to cope with failures to do with memory allocation. Unlike other compression applications which use fixed upper sized compression windows, the main challenge with lrzip is to try and use as much memory as possible without running out of memory. This means that the main challenges aren't to do with the algorithms themselves, but memory usage. Multithreading has just made that even more complex since each thread needs its own memory to work with. I need to put more work into coping with trying to allocate too much ram and then cut back on thread usage, but backtracking like that is a little tricky with the code the way it is at the moment. If you do get memory problems, try setting -p values to less than your CPUs manually.

Also, it still will not work on macOSX, with the last good version for that being 0.520. I need a little rest before tackling that :s

Anyway, the new version is out there, and to indicate that it's pretty much just a bug fix and that everyone should upgrade, I've only pushed the minor subversion up to 0.544.

http://freshmeat.net/projects/lrzip

4 comments:

  1. 7zip needs really a lot of memory to compress a chunk. As a baseline, you can consider it needs roughly x12 memory, which means for a 60MB chunk that it will need about 700MB.
    This is really a lot for a 32bit system, especially if you want to scatter several threads, then memory usage adds up.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. As a side comment, regarding your point on memory usage with lot of threads,
    it reminds me of an old compression project which was based on efficient memory usage with large dictionary sizes.

    It's called Etincelle, with a working win32 demo available here :
    http://phantasie.tonempire.net/pc-compression-f2/etincelle-fast-efficient-compression-x280-82mb-s-t102.htm#160

    The nice thing about Etincelle is that it uses a 2MB table, whatever the dictionary size (which could be the size of chunk it is fed with). Therefore, its memory footprint is pretty much under control, even with many threads : it will simply need about 2MB per thread, (plus obviously input & output buffers).

    Etincelle still needs some code clean up to reach open source status; but obviously, if you think it could be an interesting back end test for your project, it's reason enough to speed-up its release.

    ReplyDelete
  4. Thanks for your comments. Knowing it needs 12x the ram is VERY helpful. The rzip first stage already does use very little ram to make the most of big windows as you described with your own app which sounds very interesting. Compressing with bzip2 as a backend on lrzip is pretty efficient with ram usage but I'm trying for bigger and better :P I'll certainly take this extra 12x info on board.

    ReplyDelete