Friday 11 February 2011

lrzip-0.552 Random fixes.

It had been a while since lrzip had received attention, and a number of things came up that made me give it some attention. First, it wouldn't compile on FreeBSD because it doesn't implement mremap. So I wrapped that in a fake mremap the way I managed the OSX one. Then OSX would compile the existing code fine but would fail to actually work because it would say "unimplemented" when you tried to run it due to OSX not having unnamed semaphores implemented. I've mentioned this before, but named semaphores are a real pain to work with and unnamed ones are very convenient. So I wrapped all the threading calls for OSX to not actually thread and ignore all the calls to semaphore functions. It means OSX doesn't benefit from the extra threading added after version 0.530 but at least it actually works now.

Finally, there was an alarming report I received about file corruption on decompression from a user who had a 20GB log that he was compressing. The generated archive would decompress fine with version 0.530 but it randomly would silently corrupt somewhere in the decompression with any version after that and it was only noticeable after comparing to the original file. Now, try as hard as I could at home, I couldn't reproduce this bug. So with the help of this user doing possibly hundreds of tests for me and with debugging code I found a trail of possible causes. Disabling the decompression threading code surprisingly didn't fix it, confirming the bug was elsewhere. After an extensive search I found there were some mmap calls in lrzip that weren't being careful about being aligned to page size, hence writing to the buffer generated would have random results. It's surprising that no more corruption was seen but presumably the particular buffer in question never had that much data so it would fit in whatever amount happened to be allocated. Presumably the larger file is what made it easier to trigger. That would certainly have explained random position failures, but it didn't really explain why it only started happening after 0.530.

Anyway after converting the mmap calls to ordinary malloc calls, decompression would now actually fail with a crc error. Lrzip does a crc check on the data generated, and compares it to the crc stored in the archive. However if a string of zeroes is generated, and then the crc is also read as zero, it can look ok. I'm assuming this is why it was silently corrupting it. The crc errors at least gave me a trail of possible sources for the error. After much scratching of heads, heavy drinking and sleepless nights, I found that in the original rzip code that lrzip came from was a function that was designed to read a certain amount of data, then return how much it had returned. The thing is, the function that called it (unzip_literal) didn't check the return value and just assumed it always read all the data asked of it. Hitting this codepath is obviously extremely rare but there it was. Fixing that resolved the instant crc error and made it mostly reliable.

Now I'd love to say it's 100% reliable, but after running the decompression hundreds of times just to be certain, it failed on one occasion. So there's still potential for failure, possibly somewhere else, in the code on decompressing extremely large archives of 20GB or more. Given the number of fixes I put into lrzip, and that there were obvious bugs in the previous version, I've released the cumulative fixes as version 0.552. I've also put a warning in there that all very large decompressed files should be checked with md5sum or equivalent to ensure they are not corrupted.

A handful of other minor fixes went in too, so this is basically a bugfix release only. So if you're using lrzip, I highly recommend upgrading to version 0.552.

Freshmeat link (sometimes takes a while for new release to appear):

Now for more debugging.

No comments:

Post a Comment