It never ceases to amaze me that no matter how much I try and keep the memory allocation paths in check on lrzip, userspace will continue to fail me somehow. I had reports of memory allocation failures on 32 bit machines running lrzip with big files and started investigating again. Even if you have 64GB of ram running on a 32 bit kernel with PAE enabled, you can only allocate up to 2GB maximum with malloc and friends, due to ssize_t being a 32 bit signed long, giving you only 31 bits of positive to work with (which works out to 2GB). I had taken that into account, but it seems that 32 bit userspace really doesn't even like allocating more than 2GB in total for the one application.
Since lrzip can use much smaller allocation now with the use of sliding mmap, I set out to just use even smaller base mmap on 32 bits rather than allocating up to 2GB, and relying on sliding mmap for the rest. This fixed the memory allocation errors on 32 bits, but left me with the lzma back end failing for some users out there, reporting "try decreasing window size". Even though there is plenty of ram available now, it occasionally returns with a memory error. Since I haven't actually written any the lzma library myself, I really didn't know where to begin looking. I tried decreasing the window size passed to lzma but it was failing sometimes even on windows of only 60MB despite gigabytes of ram spare. So after much battling with it I decided to just let it fail, and then fall back to bzip2 compression if lzma failed to compress that block. This means that lrzip will generate a mixed compression file which is perfectly fine, but at least it will not leave blocks uncompressed. None of this is a problem on 64 bits mind you, but there are plenty of people running 32 bits still out there and will be for some time to come.
The other issue that arose was that uclibc doesn't appear to support the sysconf() function fully, not returning a valid value for ram (it returns a negative value) so I had to add a fallback to reading /proc for that.
The multithreaded compression (till now) worked off dividing up the size of the file by the number of threads, and then when a chunk of that size was reached, it would pass it onto a thread and then continue. The problem with this approach is the file is rzip preprocessed before passing it on to a back end compression thread, so it's usually smaller than this. I modified it to put data to threads at regular intervals during the rzip hash search instead, thus spawning threads at even timeframes to hopefully use more CPU time on SMP machines. What's interesting about this, though, is that the chunks compress markedly differently because they're different parts of the rzip processed data and can be tiny or massive after that processing. It made for only very modest improvements in compression times. It has an advantage on decompression, though, because each chunk after decompression is the same size, and it can take quite a while to write say a 2.5GB chunk at a time, thus allowing the next thread to keep working on the chunk after that while it's busy writing the first chunk.
When I last posted, I was unhappy about the multithreaded decompression component so I rewote part of that as well. Now, instead of guessing what stream is likely to be read next, it simply spawns two groups of threads, one group for each stream. The vast majority of the time, archives end up with just one block of stream 0 and lots of blocks of stream 1, so it won't make a big difference, but hopefully it's more robust than it was before. It does mean, however, that you can potentially spawn twice as many threads as you have CPUs which could be a problem with memory usage.
So what's next for lrzip? Well I keep trying to let it stabilise instead of doing more work on it, but there are more people using it now which means I get more bug reports. That's both good and bad ;) The things I see that still need addressing are more ability to cope with failures to do with memory allocation. Unlike other compression applications which use fixed upper sized compression windows, the main challenge with lrzip is to try and use as much memory as possible without running out of memory. This means that the main challenges aren't to do with the algorithms themselves, but memory usage. Multithreading has just made that even more complex since each thread needs its own memory to work with. I need to put more work into coping with trying to allocate too much ram and then cut back on thread usage, but backtracking like that is a little tricky with the code the way it is at the moment. If you do get memory problems, try setting -p values to less than your CPUs manually.
Also, it still will not work on macOSX, with the last good version for that being 0.520. I need a little rest before tackling that :s
Anyway, the new version is out there, and to indicate that it's pretty much just a bug fix and that everyone should upgrade, I've only pushed the minor subversion up to 0.544.
http://freshmeat.net/projects/lrzip
A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.
Tuesday, 7 December 2010
Sunday, 5 December 2010
Automated per session task groups comment.
Against my better judgement I sent an email to lkml. Here's the transcript since it summarises my position.
---
Greets.
I applaud your efforts to continue addressing interactivity and responsiveness but, I know I'm going to regret this, I feel strongly enough to speak up about this change.
On Sun, 5 Dec 2010 10:43:44 Colin Walters wrote:
> On Sat, Dec 4, 2010 at 5:39 PM, Linus Torvalds
> wrote:
> > What's your point again? It's a heuristic.
>
> So if it's a heuristic the OS can get wrong,
This is precisely what I see as the flaw in this approach. The whole reason you have CFS now is that we had a scheduler which was pretty good for all the other things in the O(1) scheduler, but needed heuristics to get interactivity right. I put them there. Then I spent the next few years trying to find a way to get rid of them. The reason is precisely what Colin says above. Heuristics get it wrong sometimes. So no matter how smart you think your heuristics are, it is impossible to get it right 100% of the time. If the heuristics make it better 99% of the time, and introduce disastrous corner cases, regressions and exploits 1% of the time, that's unforgivable. That's precisely what we had with the old O(1) scheduler and that's what you got rid of when you put CFS into mainline. The whole reason CFS was better was it was mostly fair and concentrated on ensuring decent latency rather than trying to guess what would be right, so it was predictable and reliable.
So if you introduce heuristics once again into the scheduler to try and improve the desktop by unfairly distributing CPU, you will go back to where you once were. Mostly better but sometimes really badly wrong. No matter how smart you think you can be with heuristics they cannot be right all the time. And there are regressions with these tty followed by per session group patches. Search forums where desktop users go and you'll see that people are afraid to speak up on lkml but some users are having mplayer and amarok skipping under light load when trying them. You want to program more intelligence in to work around these regressions, you'll just get yourself deeper and deeper into the same quagmire. The 'quick fix' you seek now is not something you should be defending so vehemently. The "I have a solution now" just doesn't make sense in this light. I for one do not welcome our new heuristic overlords.
If you're serious about really improving the desktop from within the kernel, as you seem to be with this latest change, then make a change that's predictable and gets it right ALL the time and is robust for the future. Stop working within all the old fashioned concepts and allow userspace to tell the kernel what it wants, and give the user the power to choose. If you think this is too hard and not doable, or that the user is too uninformed or want to modify things themselves, then allow me to propose a relatively simple change that can expedite this.
There are two aspects to getting good desktop behaviour, enough CPU and low latency. 'nice' by your own admission is too crude and doesn't really describe how either of these should really be modified. Furthermore there are 40 levels of it and only about 4 or 5 are ever used. We also know that users don't even bother using it.
What I propose is a new syscall latnice for "latency nice". It only need have 4 levels, 1 for default, 0 for latency insensitive, 2 for relatively latency sensitive gui apps, and 3 for exquisitely latency sensitive uses such as audio. These should not require extra privileges to use and thus should also not be usable for "exploiting" extra CPU by default. It's simply a matter of working with lower latencies yet shorter quota (or timeslices) which would mean throughput on these apps is sacrificed due to cache trashing but then that's not what latency sensitive applications need. These can then be encouraged to be included within the applications themselves, making this a more long term change. 'Firefox' could set itself 2, 'Amarok' and 'mplayer' 3, and 'make' - bless its soul - 0, and so on. Keeping the range simple and defined will make it easy for userspace developers to cope with, and users to fiddle with.
But that would only be the first step. The second step is to take the plunge and accept that we DO want selective unfairness on the desktop, but where WE want it, not where the kernel thinks we might want it. It's not an exploit if my full screen HD video continues to consume 80% of the CPU while make is running - on a desktop. Take a leaf out of other desktop OSs and allow the user to choose say levels 0, 1, or 2 for desktop interactivity with a simple /proc/sys/kernel/interactive tunable, a bit like the "optimise for foreground applications" seen elsewhere. This could then be used to decide whether to use the scheduling hints from latnice to either just ensure low latency but keep the same CPU usage - 0, or actually give progressively more CPU for latniced tasks as the interactive tunable is increased. Then distros can set this on installation and make it part of the many funky GUIs to choose between the different levels. This then takes the user out of the picture almost entirely, yet gives them the power to change it if they so desire.
The actual scheduler changes required to implement this are absurdly simple and doable now, and will not cost in overhead the way cgroups do. It also should cause no regressions when interactive mode is disabled and would have no effect till changes are made elsewhere, or the users use the latnice utility.
Move away from the fragile heuristic tweaks and find a longer term robust solution.
Regards,
Con
--
-ck
P.S. I'm very happy for someone else to do it. Alternatively you could include BFS and I'd code it up for that in my spare time.
---
EDIT:
And just for the sake of it I hacked up what a latnice patch would look like. Of course being unsupported by userspace means there's no point me supporting and promoting this change even on BFS.
http://ck.kolivas.org/patches/bfs/latnice/
---
Greets.
I applaud your efforts to continue addressing interactivity and responsiveness but, I know I'm going to regret this, I feel strongly enough to speak up about this change.
On Sun, 5 Dec 2010 10:43:44 Colin Walters wrote:
> On Sat, Dec 4, 2010 at 5:39 PM, Linus Torvalds
>
> > What's your point again? It's a heuristic.
>
> So if it's a heuristic the OS can get wrong,
This is precisely what I see as the flaw in this approach. The whole reason you have CFS now is that we had a scheduler which was pretty good for all the other things in the O(1) scheduler, but needed heuristics to get interactivity right. I put them there. Then I spent the next few years trying to find a way to get rid of them. The reason is precisely what Colin says above. Heuristics get it wrong sometimes. So no matter how smart you think your heuristics are, it is impossible to get it right 100% of the time. If the heuristics make it better 99% of the time, and introduce disastrous corner cases, regressions and exploits 1% of the time, that's unforgivable. That's precisely what we had with the old O(1) scheduler and that's what you got rid of when you put CFS into mainline. The whole reason CFS was better was it was mostly fair and concentrated on ensuring decent latency rather than trying to guess what would be right, so it was predictable and reliable.
So if you introduce heuristics once again into the scheduler to try and improve the desktop by unfairly distributing CPU, you will go back to where you once were. Mostly better but sometimes really badly wrong. No matter how smart you think you can be with heuristics they cannot be right all the time. And there are regressions with these tty followed by per session group patches. Search forums where desktop users go and you'll see that people are afraid to speak up on lkml but some users are having mplayer and amarok skipping under light load when trying them. You want to program more intelligence in to work around these regressions, you'll just get yourself deeper and deeper into the same quagmire. The 'quick fix' you seek now is not something you should be defending so vehemently. The "I have a solution now" just doesn't make sense in this light. I for one do not welcome our new heuristic overlords.
If you're serious about really improving the desktop from within the kernel, as you seem to be with this latest change, then make a change that's predictable and gets it right ALL the time and is robust for the future. Stop working within all the old fashioned concepts and allow userspace to tell the kernel what it wants, and give the user the power to choose. If you think this is too hard and not doable, or that the user is too uninformed or want to modify things themselves, then allow me to propose a relatively simple change that can expedite this.
There are two aspects to getting good desktop behaviour, enough CPU and low latency. 'nice' by your own admission is too crude and doesn't really describe how either of these should really be modified. Furthermore there are 40 levels of it and only about 4 or 5 are ever used. We also know that users don't even bother using it.
What I propose is a new syscall latnice for "latency nice". It only need have 4 levels, 1 for default, 0 for latency insensitive, 2 for relatively latency sensitive gui apps, and 3 for exquisitely latency sensitive uses such as audio. These should not require extra privileges to use and thus should also not be usable for "exploiting" extra CPU by default. It's simply a matter of working with lower latencies yet shorter quota (or timeslices) which would mean throughput on these apps is sacrificed due to cache trashing but then that's not what latency sensitive applications need. These can then be encouraged to be included within the applications themselves, making this a more long term change. 'Firefox' could set itself 2, 'Amarok' and 'mplayer' 3, and 'make' - bless its soul - 0, and so on. Keeping the range simple and defined will make it easy for userspace developers to cope with, and users to fiddle with.
But that would only be the first step. The second step is to take the plunge and accept that we DO want selective unfairness on the desktop, but where WE want it, not where the kernel thinks we might want it. It's not an exploit if my full screen HD video continues to consume 80% of the CPU while make is running - on a desktop. Take a leaf out of other desktop OSs and allow the user to choose say levels 0, 1, or 2 for desktop interactivity with a simple /proc/sys/kernel/interactive tunable, a bit like the "optimise for foreground applications" seen elsewhere. This could then be used to decide whether to use the scheduling hints from latnice to either just ensure low latency but keep the same CPU usage - 0, or actually give progressively more CPU for latniced tasks as the interactive tunable is increased. Then distros can set this on installation and make it part of the many funky GUIs to choose between the different levels. This then takes the user out of the picture almost entirely, yet gives them the power to change it if they so desire.
The actual scheduler changes required to implement this are absurdly simple and doable now, and will not cost in overhead the way cgroups do. It also should cause no regressions when interactive mode is disabled and would have no effect till changes are made elsewhere, or the users use the latnice utility.
Move away from the fragile heuristic tweaks and find a longer term robust solution.
Regards,
Con
--
-ck
P.S. I'm very happy for someone else to do it. Alternatively you could include BFS and I'd code it up for that in my spare time.
---
EDIT:
And just for the sake of it I hacked up what a latnice patch would look like. Of course being unsupported by userspace means there's no point me supporting and promoting this change even on BFS.
http://ck.kolivas.org/patches/bfs/latnice/
Monday, 29 November 2010
lrzip-0.543, random fixes
Yep, there are random fixes in this one.
Oh wait, you probably want more detail than that.
Lrzip splits the data into two streams during its rzip stage, one with the "dictionary" parts it ends up finding multiple copies of, and the rest. It turns out that during the multithreaded decompression modification I made the naive assumption that there would be only one chunk of stream 0 data per compressed window. Unfortunately I can't tell what the data is until I've already read it, and throwing lots of chunks into threads to decompress means I have a whole lot of data potentially in a different order to how the rzip re-expansion stage will expect it. I was loathe to modifying the lrzip archive format yet again so soon after changing it, so I used the help of the only person I could find reproducing a decompression problem to try and find how to predict what stream the data came from and then modified the decompression stage accordingly. I'm still not happy with it because it still feels fragile, but I need more time to investigate to ensure I'm doing the right thing. This is unlikely to hit anyone trying to decompress anything less than many many gigabytes in size (the archive is still fine, and the previous non-multithreaded one should decompress it ok).
Much more importantly for the common case usage, I limited the lrzip compression window on 0.542 to only 300MB because it seemed to break on larger windows on 32 bit. Unfortunately, I did it -before- chopping it up into smaller chunks for multithreading, and for all architectures, not just 32 bit. So I fixed that which means the default compression window should be significantly larger on 0.543.
Finally it was clear that the rate limiting step on multithreaded workloads was the rzip stage on extra large files, and since the compression/decompression can now begin as a separate thread to the rzip component, I made the rzip process run at a lower 'nice' level than the back end threads, which afforded a small speed up too.
I still have more planned for lrzip but was hoping to take a small break for a while to do other things now that it's more or less stable. In the meantime, apparently when it's run on a system with uclibc instead of glibc, sysconf() fails to report ram properly which needs fixing. That should be easy enough to workaround, but is a little disappointing from uclibc. The other planned changes involve committing a more robust fix for the multiple stream 0 archives, and more multithreading improvements. Currently one chunk at a time is either compressed or decompressed at the same time as the rzip preprocessing/expansion stage and then all the threads need to complete before moving onto the next chunk. The next logical change is to move the threading even higher up the chain to be able to process multiple chunks concurrently, keeping all CPUs busy at all times. This last change is unlikely to make a huge difference on default settings, but should speed up zpaq based de/compression even more when a file is large enough (bigger than available ram).
Oh and working on OSX still needs fixing (see previous post about named semaphores).
Oh wait, you probably want more detail than that.
Lrzip splits the data into two streams during its rzip stage, one with the "dictionary" parts it ends up finding multiple copies of, and the rest. It turns out that during the multithreaded decompression modification I made the naive assumption that there would be only one chunk of stream 0 data per compressed window. Unfortunately I can't tell what the data is until I've already read it, and throwing lots of chunks into threads to decompress means I have a whole lot of data potentially in a different order to how the rzip re-expansion stage will expect it. I was loathe to modifying the lrzip archive format yet again so soon after changing it, so I used the help of the only person I could find reproducing a decompression problem to try and find how to predict what stream the data came from and then modified the decompression stage accordingly. I'm still not happy with it because it still feels fragile, but I need more time to investigate to ensure I'm doing the right thing. This is unlikely to hit anyone trying to decompress anything less than many many gigabytes in size (the archive is still fine, and the previous non-multithreaded one should decompress it ok).
Much more importantly for the common case usage, I limited the lrzip compression window on 0.542 to only 300MB because it seemed to break on larger windows on 32 bit. Unfortunately, I did it -before- chopping it up into smaller chunks for multithreading, and for all architectures, not just 32 bit. So I fixed that which means the default compression window should be significantly larger on 0.543.
Finally it was clear that the rate limiting step on multithreaded workloads was the rzip stage on extra large files, and since the compression/decompression can now begin as a separate thread to the rzip component, I made the rzip process run at a lower 'nice' level than the back end threads, which afforded a small speed up too.
I still have more planned for lrzip but was hoping to take a small break for a while to do other things now that it's more or less stable. In the meantime, apparently when it's run on a system with uclibc instead of glibc, sysconf() fails to report ram properly which needs fixing. That should be easy enough to workaround, but is a little disappointing from uclibc. The other planned changes involve committing a more robust fix for the multiple stream 0 archives, and more multithreading improvements. Currently one chunk at a time is either compressed or decompressed at the same time as the rzip preprocessing/expansion stage and then all the threads need to complete before moving onto the next chunk. The next logical change is to move the threading even higher up the chain to be able to process multiple chunks concurrently, keeping all CPUs busy at all times. This last change is unlikely to make a huge difference on default settings, but should speed up zpaq based de/compression even more when a file is large enough (bigger than available ram).
Oh and working on OSX still needs fixing (see previous post about named semaphores).
Monday, 22 November 2010
lrzip-0.542, mmap windows and overcommit
I started out this blog entry with a lot more about the tty patch thingy and then decided against it. Suffice to say I don't like heuristics being special coded into the scheduler.
So back to lrzip progress. I found a lovely little bug in it which would make the sliding mmap buffer not slide from the 2nd compression window onwards. It was a lovely bug to find because after fixing it, a very large file took 1/4 the time it was taking to compress down. Overall sliding mmap has gotten a lot faster and useful.
I did a bit of further investigating to see if it was fast enough to enable unlimited compression windows by default. It hasn't quite gotten that much faster, but it's very usable. I tried it on that 9.1GB all-kernel archive on a 32 bit laptop and the default windows take 20 minutes to compress, while unlimited sized windows with sliding mmap took 80 minutes. Not bad given how much more compression it ends up giving.
Now because I run my main desktop with swap disabled (swap is a steaming pile of dodo's doodoo) I wanted to check what happened with memory allocation on linux and just how much the Virtual Memory subsystem will happily give you since lrzip tests before it allocates a window. With a file backed mmap it turns out you can allocate whatever you bloody well like. It happily let me mmap 50GB on a machine with 4GB ram. Now this is a disaster because imagine reading that file linearly into ram as lrzip is working on it. As soon as it stops being able to fit any more into real ram (which happens at about 2/3 of the ram used), it has all this stuff in ram which is now a complete waste and starts faking caching the rest. It never really has any useful amount in ram at any one time since it ends up dropping everything behind. Then if you read up to the end of the file in short bursts, and back again (as lrzip does), it gets worse and worse. Unfortunately, the -M mode in lrzip just worked that way by using whatever mmap would give you.
So I spent days trying various combinations of sizes of window that would work out optimal (with and without swap) and kept coming up with the same answer: that having a main buffer of ~2/3 of the ram gives the best speed performance. It left me without an answer for what to do with -M mode. After enough playing around I decided to make the sliding mmap mode kick in whenever the compression window is larger than 2/3 ram as it speeds up the performance noticeably compared to just mapping in the whole file and never really caching anything. But I had to make up some upper limit to how big -M would work with and 1.5 times the ram size worked out to be a reasonable compromise in how much it would slow down, and how much it would hit swap.
Lrzip can hit swap real hard was the conclusion by the way... and it's not like it speeds anything up or allows any more ram to be used compared to just disabling it. The only
time swap made a difference to the testing was how much anonymous ram could be mmapped (as in just free ram versus caching a file that's on disk). The linux vm allows you to pretty much allocate everything you have as real ram and swap. You should see what the desktop was like after it finished compressing a huge file - clicking on anything took ages since pretty much every useful application was sitting on swap instead of in ram. I had to test it with a real HD since my main desktop now uses an SSD to see how it would affect regular users. By the way, if you can afford an SSD for your desktop, get one ASAP. It's the greatest desktop speedup I've seen in years. (Sure be vigilant with backups and so on cause people don't trust them blah blah).
Despite choosing 2/3 as the amount of ram to allocate, lrzip still actually tests to ensure it won't fail and then decreases the amount till it succeeds. It's rather tricky making sure there will be enough ram for all the compression components because it needs enough ram to allocate to caching the file on disk, enough for 2 compression streams possibly as large as that first one, and then enough ram to dedicate to the back end compression phase. All in all it's a real ram hog. If you compress a big file with lrzip, expect that most things in ram will be trashed (all for a good cause of course). Making sure it doesn't fail on 32 bits is also rather annoying, but I now use the sliding mmap for anything bigger than 700MB there and that's made a big difference to how effectively a 32 bit machine can compress large files too.
Lrzip has been building successfully on and off on mac osx for a while now. Every 2nd or 3rd release I break it since I can't test it myself. It turns out that I recently broke it again. When I added the massive multithreading capabilities to the compression side, I used a fairly standard posix tool, the unnamed semaphore. It turns out that mac osx just doesn't support them. It builds fine, but then when you run it, it says function not implemented. That's pretty sad... named semaphores are much clunkier and according to the manpage "POSIX named semaphores have kernel persistence: if not removed by sem_unlink(3), a semaphore will exist until the system is shut down." I'm not sure if that's adhered to, but that's rather awkward if you don't clean up after yourself very well when you abort your program. At some stage if I can find the enthusiasm, I might try and get the multithreaded lrzip working on osx by converting the unnamed semaphores to named ones, or use them selectively on osx.
All in all I'm very pleased with how lrzip has shaped up. I went back and tested the huge kernel tarball with zpaq multithreaded compression and managed to get 9.1GB down to 116MB in just 40 minutes with zpaq on an external USB drive. Gotta be happy with that :) It's a shame people don't find this remotely as interesting as anything I have to say on the linux kernel.
Lrzip project page on freshmeat
EDIT: I keep getting people ask me what the big deal is with 32 bits and lrzip, since 32 bits with PAE should be able to address up to 64GB ram. That may well be the case, but the gnu libraries on the 32 bit userspace take a size_t on malloc and mmap, and they are the size of a signed long, which is 4 bytes on 32 bits, and 8 bytes on 64 bits. So the most you can address in one malloc with 32 bit userspace is up to 31 bits, or a bit over 2GB.
So back to lrzip progress. I found a lovely little bug in it which would make the sliding mmap buffer not slide from the 2nd compression window onwards. It was a lovely bug to find because after fixing it, a very large file took 1/4 the time it was taking to compress down. Overall sliding mmap has gotten a lot faster and useful.
I did a bit of further investigating to see if it was fast enough to enable unlimited compression windows by default. It hasn't quite gotten that much faster, but it's very usable. I tried it on that 9.1GB all-kernel archive on a 32 bit laptop and the default windows take 20 minutes to compress, while unlimited sized windows with sliding mmap took 80 minutes. Not bad given how much more compression it ends up giving.
Now because I run my main desktop with swap disabled (swap is a steaming pile of dodo's doodoo) I wanted to check what happened with memory allocation on linux and just how much the Virtual Memory subsystem will happily give you since lrzip tests before it allocates a window. With a file backed mmap it turns out you can allocate whatever you bloody well like. It happily let me mmap 50GB on a machine with 4GB ram. Now this is a disaster because imagine reading that file linearly into ram as lrzip is working on it. As soon as it stops being able to fit any more into real ram (which happens at about 2/3 of the ram used), it has all this stuff in ram which is now a complete waste and starts faking caching the rest. It never really has any useful amount in ram at any one time since it ends up dropping everything behind. Then if you read up to the end of the file in short bursts, and back again (as lrzip does), it gets worse and worse. Unfortunately, the -M mode in lrzip just worked that way by using whatever mmap would give you.
So I spent days trying various combinations of sizes of window that would work out optimal (with and without swap) and kept coming up with the same answer: that having a main buffer of ~2/3 of the ram gives the best speed performance. It left me without an answer for what to do with -M mode. After enough playing around I decided to make the sliding mmap mode kick in whenever the compression window is larger than 2/3 ram as it speeds up the performance noticeably compared to just mapping in the whole file and never really caching anything. But I had to make up some upper limit to how big -M would work with and 1.5 times the ram size worked out to be a reasonable compromise in how much it would slow down, and how much it would hit swap.
Lrzip can hit swap real hard was the conclusion by the way... and it's not like it speeds anything up or allows any more ram to be used compared to just disabling it. The only
time swap made a difference to the testing was how much anonymous ram could be mmapped (as in just free ram versus caching a file that's on disk). The linux vm allows you to pretty much allocate everything you have as real ram and swap. You should see what the desktop was like after it finished compressing a huge file - clicking on anything took ages since pretty much every useful application was sitting on swap instead of in ram. I had to test it with a real HD since my main desktop now uses an SSD to see how it would affect regular users. By the way, if you can afford an SSD for your desktop, get one ASAP. It's the greatest desktop speedup I've seen in years. (Sure be vigilant with backups and so on cause people don't trust them blah blah).
Despite choosing 2/3 as the amount of ram to allocate, lrzip still actually tests to ensure it won't fail and then decreases the amount till it succeeds. It's rather tricky making sure there will be enough ram for all the compression components because it needs enough ram to allocate to caching the file on disk, enough for 2 compression streams possibly as large as that first one, and then enough ram to dedicate to the back end compression phase. All in all it's a real ram hog. If you compress a big file with lrzip, expect that most things in ram will be trashed (all for a good cause of course). Making sure it doesn't fail on 32 bits is also rather annoying, but I now use the sliding mmap for anything bigger than 700MB there and that's made a big difference to how effectively a 32 bit machine can compress large files too.
Lrzip has been building successfully on and off on mac osx for a while now. Every 2nd or 3rd release I break it since I can't test it myself. It turns out that I recently broke it again. When I added the massive multithreading capabilities to the compression side, I used a fairly standard posix tool, the unnamed semaphore. It turns out that mac osx just doesn't support them. It builds fine, but then when you run it, it says function not implemented. That's pretty sad... named semaphores are much clunkier and according to the manpage "POSIX named semaphores have kernel persistence: if not removed by sem_unlink(3), a semaphore will exist until the system is shut down." I'm not sure if that's adhered to, but that's rather awkward if you don't clean up after yourself very well when you abort your program. At some stage if I can find the enthusiasm, I might try and get the multithreaded lrzip working on osx by converting the unnamed semaphores to named ones, or use them selectively on osx.
All in all I'm very pleased with how lrzip has shaped up. I went back and tested the huge kernel tarball with zpaq multithreaded compression and managed to get 9.1GB down to 116MB in just 40 minutes with zpaq on an external USB drive. Gotta be happy with that :) It's a shame people don't find this remotely as interesting as anything I have to say on the linux kernel.
Lrzip project page on freshmeat
EDIT: I keep getting people ask me what the big deal is with 32 bits and lrzip, since 32 bits with PAE should be able to address up to 64GB ram. That may well be the case, but the gnu libraries on the 32 bit userspace take a size_t on malloc and mmap, and they are the size of a signed long, which is 4 bytes on 32 bits, and 8 bytes on 64 bits. So the most you can address in one malloc with 32 bit userspace is up to 31 bits, or a bit over 2GB.
Subscribe to:
Posts (Atom)