Monday 26 September 2011

lrzip-0.608

Right on the heels of the lrzip 0.607 bugfix release is yet another release, this time with a performance boost!

http://lrzip.kolivas.org

TL;DR: Faster compression on lrzip

When I converted lrzip to work with stdin/stdout and with infinitely sized mmap windows, I needed to do an awful lot of layering of the access to the data internally to decide whether to access the data directly in the original mmapped region of ram, whether to mmap a new region of ram, or to use the moving smaller upper mmapped ram on just the portion being accessed. While this made it possible to implement features that were originally considered virtually impossible, it unfortunately also slowed it down substantially during the rzip dictionary search stage.

For example, a simple memcpy had to turn into this convoluted function:


        for (i = 0; i < n; i++) {
            memcpy(sinfo->s[stream].buf + sinfo->s[stream].buflen + i,
                   get_sb(control, p + i), 1);
        }

As you can see, it had to check it byte by byte in the get_sb function to decide where the memory was. I had tried my best to make the get_sb() function as small as possible since I knew this was a hot path, and it had ended up looking like this:

    i64 low_end = sb.offset_low + sb.size_low;

    if (unlikely(sb.offset_search > low_end))
        remap_low_sb(control);
    if (p >= sb.offset_low && p < low_end)
        return (sb.buf_low + p - sb.offset_low);
    if (p >= sb.offset_high && p < (sb.offset_high + sb.size_high))
        return (sb.buf_high + (p - sb.offset_high));
    /* p is not within the low or high buffer range */
    remap_high_sb(control, p);
    return (sb.buf_high + (p - sb.offset_high));

Not too bad, but clearly this is going to cost a heck of a lot more than just doing a simple memcpy!

Anyway it had concerned me that the more "features" I added to lrzip, the slower it got. Of course this allows one to have unlimited compression windows making for unparalleled compression ratios of massive files with long range redundancy, but most of the time people won't be using this feature and that code is just going to cost them adversely :\

A little bit of playing with profiling recently showed me that the situation was even worse than I had imagined. During the rzip phase, get_sb accounted for 37% of the consumed CPU cycles! Anyway, a simple set of proxy functions seemed the simplest answer, where it would use different versions of memcpy and get_sb depending on whether the file fits in ram or not.

static uchar *sliding_get_sb(rzip_control *control, i64 p)
{
    if (p >= sb.offset_low && p < sb.offset_low + sb.size_low)
        return (sb.buf_low + p - sb.offset_low);
    if (p >= sb.offset_high && p < (sb.offset_high + sb.size_high))
        return (sb.buf_high + (p - sb.offset_high));
    /* p is not within the low or high buffer range */
    remap_high_sb(control, p);
    return (sb.buf_high + (p - sb.offset_high));
}

static uchar *single_get_sb(rzip_control *control, i64 p)
{
    return (sb.buf_low + p);
}

/* We use a pointer to the function we actually want to use and only enable
 * the sliding mmap version if we need sliding mmap functionality as this is
 * a hot function during the rzip phase */
static uchar *(*get_sb)(rzip_control *control, i64 p);

static void sliding_mcpy(rzip_control *control, unsigned char *buf, i64 offset, i64 len)
{
    i64 i;

    for (i = 0; i < len; i++)
        memcpy(buf + i, sliding_get_sb(control, offset + i), 1);
}

static void single_mcpy(rzip_control *control, unsigned char *buf, i64 offset, i64 len)
{
    memcpy(buf, sb.buf_low + offset, len);
}

/* Since the sliding get_sb only allows us to access one byte at a time, we
 * do the same as we did with get_sb with the memcpy since one memcpy is much
 * faster than numerous memcpys 1 byte at a time */
static void (*do_mcpy)(rzip_control *control, unsigned char *buf, i64 offset, i64 len);


Voila, 30% speedup for the rzip stage on small files. This does not bring us back to the previous performance, but close to it. Unfortunately the extra level of indirection actually slowed it down even more on larger files. Then I discovered I could move one compare out of the get_sb function and recover that last bit of speed. So overall, large file rzip stage is actually slightly faster too now.

Finally I decided to not install the bash completion script by default since some distributions include a bash completion package and they conflict on install. So instead I've kept it in the tarball, but don't install it by default. A couple of other trivial changes, and I figured I should just release a stable faster release, so here's lrzip 0.608 (when freshmeat syncs up)

http://freshmeat.net/projects/long-range-zip

2 comments: