Monday, December 23, 2013

Zhuff v0.9, or a first FSE application

 As a quick follow up to last week release of FSE, I wanted to produce a quick demo to show the kind of gain to expect with this new entropy algorithm.

And a simple example shall be Zhuff.

Zhuff, is basically, LZ4 and Huff0 stuffed together. It's very fast in spite of this 2-pass process. Its major limitation so far is a 64KB window size, as an inheritance from LZ4, but that's something we'll talk again in a later post.

For the time being, my only intention is to swap huff0 with FSE, and look at the results.

And the results are positive.
Here are some benchmark numbers, on a Core i5-3340M @2.7GHz, Windows  Seven 64 bits, using single-thread, fast algorithm

Filename Compressor   Ratio Decoding speed
silesia  zhuff v0.8   2.783    275 MB/s
silesia  zhuff v0.9   2.805    311 MB/s

enwik8   zhuff v0.8   2.440    200 MB/s
enwik8   zhuff v0.9   2.460    230 MB/s

calgary  zhuff v0.8   2.672    220 MB/s
calgary  zhuff v0.9   2.693    250 MB/s

(Note : I've not mentioned improvements in compression speed, they are measurable but unrelated to FSE, so out of this report)

So, in a nutshell, we have a moderate (>10%) increase of decoding speed, and a small improvement to compression ratio. Nothing earth-shattering, but still worthy to grab.
The situation would have been even better for FSE if higher probabilities had to be compressed, but Zhuff is a simple fast LZ algorithm, so it doesn't exactly qualify as generating "high probabilities".

The purpose of this demo was just to show that FSE can easily replace and improve results compared to Huffman. As a side-effect, it makes Zhuff an interesting playground to test future FSE improvements.

With this part now settled, the next articles shall describe FSE principles, especially how and why it works.

Monday, December 16, 2013

Finite State Entropy - A new breed of entropy coder

 In compression theory, the entropy encoding stage is typically the last stage of a compression algorithm, the one where the gains from the model are realized.

The purpose of the entropy stage is to reduce a set of flags/symbol to their optimal space given their probability. As a simple example, if a flag has 50% to be set, you want to encode it using 1 bit. If a symbol has 25% probability to have value X, you want to encode it using 2 bits, and so on.
The optimal size to encode a probability is proven, and known as the Shannon limit. You can't beat that limit, you can only get close to it.

A solution to this problem has been worked for decades, starting with Claude Shannon own work, which were efficient but not optimal. The optimal solution was ultimately found by one of Shannon's own pupils, David A. Huffman, almost by chance. His version became immensely popular, not least because he could prove, a few years later, that his construction method was optimal : there was no way to build a better distribution.

Or so it was thought.
There was still a problem with Huffman encoding (and all previous ones) : an hidden assumption is that a symbol must be encoded using an integer number of bits. To say it simply, you can't go lower than 1 bit.
It seems reasonable, but that's not even close to Shannon's limit. An event which has 90% probability to happen for example should be encoded using 0.15 bits. You can't do that using Huffman trees.

A solution to this problem was found almost 30 years later, by Jorma Rissanen, under the name of Arithmetic coder. Explaining how it works is outside of the scope of this blog post, since it's complex and would require a few chapters; I invite you to read the Wikipedia page if you want to learn more about it. For the purpose of this presentation, it's enough to say that Arithmetic encoding, and its little brother Range encoding,  solved the fractional bit issue of Huffman, and with only some minimal losses to complain about due to rounding, get closer to Shannon limit. So close in fact that entropy encoding is, since then, considered a "solved problem".

Which is terrible because it gives the feeling that nothing better can be invented.

Well, there is more to this story. Of course, there is still a little problem with arithmetic encoders : they require arithmetic operations, such as multiplications, and divisions, and strictly defined rounding errors.
This is serious requirement for CPU, especially in the 80's. Moreover, some lawyer-happy companies such as IBM grabbed this opportunity to flood the field with multiple dubious patents on minor implementation details, making clear that anyone trying to use the method would face expensive litigation. Considering this environment, the method was barely used for the next few decades, Huffman remaining the algorithm of choice for the entropy stage.

Even today, with most of the patent issues cleared, modern CPU will still take a hit due to the divisions. Optimized versions can sometimes get away with the division during the encoding stage, but not the decoding stage (with the exception of the Binary arithmetic coding, which is however limited to 0/1 symbols).
As a consequence, arithmetic encoders are quite slower than Huffman ones. For low-end or even "retro" CPU, it's simply out of range.

It's been a long time objective of mine to bring arithmetic-level compression performance to vintage (or retro) CPU. Consider it a challenge. I've tested several variants, for example a mix of Huffman and Binary Arithmetic, which was free of divisions, but alas still needed multiplications, and required more registers to operate, which was overkill for weak CPU.

So I've been reading with a keen eye the ANS theory, from Jarek Duda, which I felt was heading into the same direction. If you are able to fully understand his paper, you are better than me, because quite frankly, most of the wording used in his document is way out of my reach. (Note : Jarek pointed to an update version of his paper, which should be easier to understand). Fortunately, it nonetheless resonated, because I was working on something very similar, and Jarek's document provided the last elements required to make it work.

And here is the result today, the Finite State Entropy coder, which is proposed in a BSD-license package at Github.

In a nutshell, this coder provides the same level of performance as Arithmetic coder, but only requires additions, masks, and shifts.
The speed of this implementation is fairly good, and even on modern high-end CPU, it can prove a valuable replacement to standard Huffman implementations.
Compared to zlib's Huffman entropy coder, it manages to outperform its compression ratio while besting it on speed, especially decoding speed.

Benchmark platform : Core i5-3340M (2.7GHz), Window Seven 64-bits
Benchmarked file : win98-lz4-run
Algorithm Ratio Compression Decompression
FSE       2.688   290 MS/s     415 MS/s
zlib      2.660   200 MS/s     220 MS/s

Benchmarked file : proba70.bin
Algorithm Ratio Compression Decompression
FSE       6.316   300 MS/s     420 MS/s
zlib      5.575   250 MS/s     270 MS/s

Benchmarked file : proba90.bin
Algorithm Ratio Compression Decompression
FSE       15.21   300 MS/s     420 MS/s
zlib      7.175   250 MS/s     285 MS/s

As could be guessed, the higher the compression ratio, the more efficient FSE becomes compared to Huffman, since Huffman can't break the "1 bit per symbol" limit.
FSE speed is also very stable, under all probabilities.

I'm quite please with the result, especially considering that, since the invention of arithmetic coding in the 70's, little new has been brought to this field.


A little bit of History :

Jarek Duda's ANS theory was first published in 2007, and the paper received many revisions since then. Back in 2007, only Matt Mahoney had enough skill and willpower to sift through the complex theory, and morph it into a working code. However, Matt concentrated on the only use case of interest to him, the Binary version, called ABS, limited to 0/1 alphabet. This decision put his implementation in direct competition with the Binary Arithmetic Coder, which is very fast, efficient, and flexible. Basically, a losing ground for ANS. As a consequence, ANS theory looked uncompetitive, and slumbered during the next few years.

FSE work re-instates ANS as a competitive algorithm for multi-symbol alphabet (>2), concentrating its perspective as a viable alternative to block-based Huffman.

Thanks to promising early results from FSE, Jarek concentrated back its attention on multi-symbol alphabet. As we were chatting about perspectives and limitations of ANS, I underlined the requirement of a decoding table as a memory cost, and suggested a solution in the making to limit that issue (which ultimately failed). But Jarek took the challenge, and successfully created a new variant. He then published an updated version of his paper. The new method would be called rANS. He would later invent the terms tANS and rANS to distinguish the different methods.

rANS was later adapted by Fabian Giesen and Charles Bloom, producing some very promising implementations, notably vector-oriented code by Fabian.

But as said, this is not the direction selected for FSE, created before Jarek's paper revision. FSE is a finite state machine, created precisely to avoid any kind of multiplication, with an eye on low-power CPU requirements. It's interesting to note such radically different implementations can emerge from a common starting theory.

For the time being, FSE is still considered beta stuff, so please consider this release for testing purposes or private development environments.

Explaining how and why it works is pretty complex, and will require a few more posts, but bear with me, they will come in this blog.

Hopefully, with Jarek's document and these implementations now published, it will be harder this time for big corporations to confiscate an innovation from the public domain.


List of Blog posts explaining FSE algorithm :

[Edit] : replaced huff0 by zlib on the comparison table
[Edit] : added entry on rANS variants by Fabian & Charles
[Edit] : added list of blog entries

Saturday, October 19, 2013

Compression stream (part 3)

 After taking care of the decompression stream, it's now time to have a look at the compression side.

As stated earlier, a compression stream delivering independent blocks is fairly easy to setup. So the objective of today is to define a proper behavior using chained block methodology, which uses data from previous blocks to better compress the next one. By extension, this strategy also works with static dictionaries.

An ideal compression stream should be able to :
- accept input from any source and any size
- deliver compressed output into destination buffer of any size.
- use as little working memory as possible

The last point is the most interesting one.
A (relatively) easy way to create a streaming interface is to use intermediate buffers, for both input and output. These buffers would easily absorb side effects, such as : providing a ridiculously small output buffer, channeling more input than output, etc.
Obviously, this methodology has downsides too, such as performance slowdown, due to the need to copy data from and into intermediate buffers. But more importantly, it increases memory footprint.

Let's focus on the output buffer for now.

It's relatively easy to skip the intermediate output buffer in the simple case when the remaining space into the destination buffer is "large enough". Then, it's possible to compress directly into the destination buffer, because the compressed block can fit in. It saves a useless memory copy operation, and potentially a memory allocation, if the intermediate output buffer is not yet allocated.

"Large enough" condition is met when one of these conditions is reached :
- Remaining space is larger than a full block size.
- Remaining space is larger than input.

However, if none of these conditions is fulfilled, it becomes harder to decide.

If remaining space into output buffer is smaller than input size (or block size), we could eventually try compression, just in case it would work. After all, if compression ratio is, for example 2:1, an output buffer with barely more than half the size of input could be enough.
The problem is, if it's not enough, then the compression operation will fail, and compression tables would become unusable. This is a major issue. In contrast with decoding, we can't "try and retry" compression, due to the modifications of compression tables.

A potential idea to solve this issue could be to duplicate compression tables before attempting the compression. However, let's remind that the objective is to perform better than "compressing into an intermediate buffer", to avoid some memory copy and allocation. Duplicating the compression tables would create the very same issues, it's doubtful if this technique achieve any saving.

Another idea is to dynamically adapt the amount of input to compress. Basically, since the worst case is that input may be not compressible, compress as much input data as there is remaining space into output buffer. This is possible, because blocks don't need to be full, they just have a "maximum size" (currently, selectable from 64KB to 4MB).

OK, this method works, up to a point.

As stated previously, LZ4 is currently able to produce compressed blocks, which need to be "properly terminated". Small blocks compress worse than larger ones. Fortunately, chained block is an excellent methodology to mitigate this effect.

However, when block sizes start to be become very small (for example, < 100 bytes), some inefficiencies become sensible. For example, the last 5 bytes of any block must be uncompressed. A match cannot start at a distance < 12 bytes from the end of the block. Each block starts with a 4 bytes header, and potentially ends with a 4 bytes checksum. These costs are small, but repeated small blocks will create them more often.

Consequently, there is a "trade-off" to find.

From a compression ratio perspective, an intermediate buffer is more optimal. For example, if the user application provides a destination buffer of 150 bytes, it's likely preferable to compress first into a (larger) temporary buffer, and then deliver the result 150 bytes at a time.

Now, it's possible to imagine some embedded systems with very limited available memory, where allocating another temporary buffer is not possible. In this case, these systems might prefer the inefficiency of the "small blocks" method, rather than no compression at all.

So, maybe, the better way is to provide both methods, letting the programmer select which one works best for his use case.

In the end, trying to manage the complexity of all border cases might be a bit too large for a first delivery. After all, the first delivery is not the last, it's still possible to provide improvements in future releases.

So, a probably faster way to deliver a working streaming interface is to work with intermediate buffers methodology by default, avoiding them in the most obvious use cases. These use cases, by the way, should be the most common ones. By properly documenting them, it increases the odd that the Streaming interface is used in a more optimal way.

Friday, September 27, 2013

Stream Decoding (part 2)

 I've been considering a complementary idea since writing the first part of LZ4 streaming interface.

A major point which bothered me was the need to shuffle data around, and the impossibility to decode data directly at final destination. Let's go into detail.

Currently, the LZ4 decoding functions either accept of refuse references outside of current block. Refusing is the default behavior, compatible with independent blocks.

For chained block, however, there is a major restriction. The decoding function will expect to find data from previous block *just before* current block. The good news is that it makes the decoding function fairly simple. The downside is that it pushes some requirements into user territory.

Addressing this requirement is now a major concern. It can be accomplished by decoding into an intermediate buffer, where the last decoded 64KB can be saved, and decompression operation can be performed in a contiguous space after it.

Decoding into an intermediate buffer makes the work done, but also results in the following issues :
1) It's necessary to allocate more memory for stream-decoding. Memory-constrained systems may not afford it. Worse, this cost is multiplied by the number of streams decoded in parallel.
2) It's necessary to copy decoded data, from the intermediate buffer to final destination. This is obviously bad for performance.
3) Keeping the last decoded 64KB just before the new block to decode require either to repeatedly copy 64KB at this place, which can become a serious waste for small blocks (think about small network packets), or require an even bigger intermediate buffer to limit the number of times data is copied.

Trying to fix these issues, I came up with a different design, with much more desirable properties.

The main idea is to work with a rotating memory buffer. It reduces the memory requirement for stream-decoding to just and only 64KB.

Even more importantly, it authorizes to decode directly at final destination, with no requirement regarding size or continuity. This is way more flexible for a user program.

To reach these properties, it's necessary to create a new family of decoding functions, able to use an external dictionary. The main idea is : when a reference is generated "outside of current block", look into another memory space, provided as an input argument.

As a positive side effect, this capability saves a lot of memory for server systems, handling many clients' connections. When using full adaptive streams, the memory requirement is only 64KB per decoding stream.
But even better, when using a static dictionary (which means, each block is compressed with the same prefix data), memory requirement per stream is reduced to zero. All decoding streams will directly read into the same memory block. For systems handling huge number of streams in parallel, it is a major memory allocation saver.

With the main properties of the decoding stream settled, now is the time to talk about the compressing stream.

(To be continued...)

Wednesday, September 11, 2013

Towards a Streaming Interface for LZ4

 After settling a Streaming Format for LZ4, it seems about time to define a proper API for it.

LZ4 is quite stable in "block mode" scenarios. They are quite simple. "Take a memory zone here, and compress it there"; with obviously, its reverse operation : "take a compressed memory zone there, and decompress it here". These scenarios and close variants seem today properly supported.

The streaming API objective is to extend usages beyond block mode.
The API shall serve a fairly typical communication scenario between distant machines, in which data may not be easily prepared into a single source buffer (in case of multiple sources), and must be sent at arbitrary (application-defined) moments, with the goal to reduce latency.
It's not necessarily the only scenario the API will serve, but "live communication" is interesting enough; it requires a strong set of properties, difficult to fulfill, and is therefore a likely superset of many other, simpler, use cases.

A streaming interface requires to define a direction. For each direction, there is a data producer, acting as a compressing pipe, and a data consumer, requiring a decompressing pipe. (A full bi-directional communication, therefore, requires 2 directions.)
As a consequence, two roles, and two interfaces, have to be defined.

From a definition perspective, starting from the end can be more instructive, in order to properly see and take into consideration desirable properties. So let's start by the decoding side.

Streaming API for decompression

The LZ4 streaming format describes a stream as a series of blocks.
The fundamental assumption is that each block is complete, to be decoded.
This can be easily checked, thanks to the block prefix, which tells its size. When enough data is provided, the compressed block can be considered "complete", and decompression can be performed.

The main advantage of this method is that it can use existing field-tested decoding functions. However, it also introduces a few limitations.

  • Unfinished input block :
    If the full compressed block is available directly from source buffer, there is no problem.
    But if not, there is a need to copy the piece of block into a temporary buffer, in order to complete it later on.

    This situation could for example happen in a packet scenario, where each packet might be small (for example about 1500 bytes over Ethernet), and several ones are needed to represent even a small compressed block.

    Nonetheless, just the possibility that it "may" happen translates into a requirement to allocate some memory for a temporary input buffer, just in case. Another more complex solution would be to allocate it "just in time", which is, the first time an unfinished block is received.

    Fortunately, the streaming format defines an upper bound to this temporary buffer size, since each block has a maximum size. Therefore, this memory requirement is predictable.
    A way to limit the amount of temporary memory is to use small maximum block sizes. The decoder could then simply refuse to decode streams with a maximum block size above a selected limit.

    Currently, the LZ4 streaming format defines a few maximum block sizes values, from 64KB to 4MB. There may be a need to define more sizes (specifically at the lower end), and, maybe, let users define their own sizes. In this case, the format specification will have to evolve to integrate this requirement.

  • Size of destination buffer :
    The only thing that the decoder knows when it starts decoding a block is its maximum decoded size.
    Should the destination buffer be not large enough to contain the largest possible block size, there is a risk that the decoding process will fail.

    There are several ways to handle this situation. The easiest one is to impose a restriction on authorized destination buffer size. Simply put, if the destination buffer is not large enough, the decoding process will not even start.

    Such a restriction certainly makes the streaming interface easier to code, but can be considered too limiting. For example, maybe the user process wants to decode data directly at its final destination. In this case, the destination buffer is just big enough to handle the exact amount of data expected, not a single byte more.

    To face such a circumstance, a temporary output buffer would be allocated. Whenever a decoded block is too large to fit into the destination buffer, the block will be decoded into this temporary output buffer instead; then, the relevant piece of block will be copied into the destination buffer.
    Such "memory copy" operation seems obviously sub-optimal, but it does the job done.

    This mechanism still lets opened the choice for the priority buffer.
    If the destination buffer is not large enough to contain the largest possible block size,
    should the block be decoded first into the temporary output buffer, and then the relevant piece get copied into the destination buffer,
    should it be decoded first into the destination buffer, just in case it would be large enough, and then backup to the temporary output buffer when it does not work ?

    The second choice basically trusts more the user programmer, at the cost of heavier performance penalty if the bet wasn't correct (2 decoding function calls instead of one). On the other hand, if the bet was correct, it saves one memory copy operation.
    The first choice is more middle-ground, costing one memory copy operation and a single decoding function call in all circumstances.

    A complex proposal could be to "heuristically guess" the best choice. For example, the algorithm would start with choice 2 (trust the programmer, and decode directly into destination buffer), and then revert to choice 1 after a few fails.
    Another possibility is to have 2 separate functions, so that the programmer can directly select what he wants.

  • Chained blocks :
    In order to improve compression ratio, sometimes dramatically for small packets, the LZ4 streaming format is able to encode new blocks using the previously decoded 64KB from previous blocks.
    This mechanism produces terrific compression improvements, especially for packets which have a lot of headers, fields and/or identifiers in common.

    The logical consequence is that the previous 64KB must be known by the decoding process. Since there is no guarantee that prior decoded data still sits at its decoded position (and it is most probably not), it seems there is no other way than to save these 64KB into a temporary buffer.

    A fairly logical choice would be to put this 64KB into the "temporary output buffer" mentioned in previous paragraph, thus merging both requirements. It inflates the size of this temporary output buffer by 64KB.

    Another consequence is that choice 2 (of previous paragraph) does no longer make sense, therefore only choice 1 seems relevant. Choice 2 can still make sense for in-place decompression of independent blocks. However, for a streaming scenario, chained block is really the advised setup, since it greatly improves compression ratio.

    If there is no other possibility than to first decode into a temporary output buffer, then it seems to make sense to provide, as a result, a pointer and a length into this buffer, rather than copying the result into another destination buffer. It will be up to the user process to decide what to do with this data.

  • Too much input for your output
    This situation can happen when input contains several full size blocks.
    Since decoding is going to be done in a temporary output buffer, which size is controlled by the decoding stream, it is guaranteed to decode at least one full block.
    But beyond that point, there is no such guarantee.

    As a consequence, with not enough output buffer left to decompress remaining input data, the streaming interface must deliver what it can, and then indicate that the job is not completed.
    One relatively simple way to achieve this is to output a Boolean "more_to_come" signal. When it is set, the user process is informed that more data needs to be decoded, and should therefore call again the decoding function, after disposing of current output (since the next output is likely to overwrite it).

So here we have a reasonably complex set of requirements for the decoding process. And it already has a few perspectives for improvements.

For example, most of requirements regarding temporary buffers come from the fact that decoding function must handle complete blocks.
Should a function able to decode partial blocks exist, it would eliminate the need for a temporary input buffer.
The size of the temporary output buffer, which currently must be 64KB + MaxBlockSize, could receive an upper limit of 64KB + 64KB = 128KB (This obviously would only matter for large values of MaxBlockSize).

Going one step further, a decoding function able to handle data copy from "out of buffer" positions would reduce memory needs to just 64KB, on top of offering some perspectives for in-place decompression. By avoiding the requirement to decompress first into a temporary output buffer, and then copy the relevant result at its final destination, it could improve performance.

On the other hand, a new decoding function able to decode "partial blocks" would require some more complex logic, tests, branching, and state information. As a consequence, all this complexity costs a fair share of performance, losing some speed in the process. It's unclear if the final result would be faster than using intermediate buffer. But at least, it would shave off a few buffers.

There are apparently several optimization steps which could be attempted beyond the first delivery. The main idea is that, such future improvements will, ideally, have no impact on the API itself.

(To be continued...)

Thursday, August 15, 2013

Inter-block compression

 One of the new capabilities defined into the LZ4 Streaming format is inter-block compression. While the concept is simple to formulate (the next block shall be compressed using information from previous block(s)), its execution has been non existent so far.

Up to now, the LZ4 compression utility is doing something relatively simple : cut the data to be processed into independent blocks, compress them one by one (or decode them one by one), assemble the result into destination file or stream.
It is simple and it works.

The problem : the smaller the block size is, the worse the compression ratio.
It can be clearly noticed on the following table :

File : Silesia.tar / Independant blocks
Block Size     LZ4     LZ4 HC
4 MB          47.97%   36.82%   
1 MB          48.03%   36.99%
256 KB        48.27%   37.68%
64 KB          (*)     40.41%

(*) Note : the 64KB version of LZ4 is using a dedicated algorithm which improves compression ratio, and is therefore not comparable with previous block sizes. Nonetheless, for the record, the result is 48.34%.

As can be seen, the situation gets worse and worse with each block size reduction. That's because the first 64KB of each block starts with an empty dictionary, worsening compression capabilities. LZ4 HC is, by the way, much more affected than the Fast LZ4, since it achieves better benefit from earlier data (and therefore loses the most) thanks to its better parsing methodology.
One can also note that, starting with 64KB block size, the decrease in compression ratio is very sensible. It can only get worse from this point towards smaller sizes.

One work around solution is to simply not use small block. It works fairly well : lz4c uses the maximum block size (4MB) for file compression as default.
But there are circumstances when this choice is not possible.

For example, let's suppose your application is sending packets in real time to a server. Because the application is supposed to react in real-time with low-latency, you can't wait to amass 4 MB of data before compressing it and sending it to the server. Data must be sent at application-defined moment, no later.

Such a packet of data is typically small, at most a few KB.
This is very small : if compressing just the packet, the compression is probably going to be poor.

Even more importantly, data packets tend to have a lot of common elements between them, such as headers, and identifiers. But since the repetition can only be noticed across multiple packets, it is lost if each packet is processed independently.

So the idea is to compress the next packet using information from previous packets.

It may seem simple, but this directly contradicts some current assumption in the LZ4 algorithm, which states that each compression function starts from a clean state.

Well, no longer. The soon-to-be release r102 of LZ4 features new compression functions to continue the compression process from where it was stopped previously.
To realize this operation, the compression context is now an explicit argument of the new functions. The context must be created manually, using a dedicated function, and then provided at each round to the new compression functions.
(Note : The usual compression functions are still there, and work the same, so there is no modification for existing LZ4 users.)

As a consequence, these new functions should improve compression ratio. And that's what can be measured :

File : Silesia.tar / Inter-block compression
Block Size     LZ4     LZ4 HC  (HC Independent blocks)
4 MB          47.95%   36.76%      (36.82%) 
1 MB          47.95%   36.76%      (36.99%)
256 KB        47.96%   36.77%      (37.68%)
64 KB         47.97%   36.78%      (40.41%)

There are a few learnings from this experiment :

1) Compression ratio do improve a little bit. This was expected : with previous data available, the initial 64KB of data of each block can be fully compressed at its potential.
One can notice that, once again, LZ4 HC benefit more from it. That's still for the same reason : better parsing, making better usage of available dictionary.

2) Compression ratio do not degrade with smaller block sizes.
Or, to be fair, it does worsen, but by a tiny amount compared to previous situation.

This property is key : it opens the perspective for very small block sizes (such as network packets, of a few KB), while keeping a good compression ratio, close to optimal conditions.

This was the target property of this exercise. It represents a significant step towards an LZ4 Streaming API.

Friday, August 9, 2013

Part 3 : Inlining big functions

 After the success of converting LZ4_decompress() to inline function, I was curious to repeat the performance on LZ4_compress().

The attempt is more complex, because LZ4_compress() is a much larger function, typically unfit for inlining. Also, the different variants use different types (table of U16, U32, or const BYTE*, depending on circumstances).

The result is somewhat mixed, although I do now have a working version using inline instead of macro. Here are a few learnings from this experiment :

1) No allocation within inline functions

It's of course possible to allocate a few variables, but it's not the point to allocate big tables on stack inside inline functions. This is because inline functions can be inlined several times, bloating the memory requirement on stack.

It's an issue for LZ4, since its hash table can be allocated on heap, or on stack, depending on "#define HEAPMODE" trigger. This choice will have to be done outside of the main inlined function.

The good news is that if you allocate on stack and then call the inline function, the inline function will keep the properties of accessing stack variables & tables (most importantly speed) even though the code tells it's working on pointers.

2) Inlining big function is generally not a win

Inlining big function is not "natural" for the compiler.
Inline functions were initially created as a safer replacement to macro, for small snippets of code. The main idea is that calling a function requires some push/pop/jmp operations, costing both CPU cycles and stack memory, which could be avoided. In general, function calling is fast enough, and does not require such optimization. But for small code sections inside hot loops, which are going to be called millions of times, this can make a sensible performance difference.

The capability to "remove branches" inside such code thanks to optimizations is merely a consequence, not the main initial objective.
As a consequence, the language does not even take it into consideration, and only consider the "function call delay" as the main variable to be optimized. Hence, when the function is too large, the usual choice for the compiler is to not inline it.

Therefore, to inline big functions, you have to "force" it (using compiler-specific extensions).

The situation differs sharply, depending if we are compiling for 32-bits or 64-bits systems.
- 64-bits programs tend to inline fairly well big functions. A small boost to performance can even be expected, apparently the compiler is able to find even more optimizations.
- 32-bits programs, in contrast, don't like such (forced) inlining. Here also the situation differs depending on the compiler, with Visual making a much better job than GCC.

I tentatively attribute such difference to the number of available registers. In 32-bits (x86) mode, registers are scarce (6, which are supposed to be general nowadays, but were previously specific). In 64-bits mode, they are plentiful (16). It means this is much easier for the compiler to combine all these codes together, with enough space to keep important variables into registers and avoid memory accesses.

If the hypothesis is correct, this means this issue is more related to number of available registers and compiler cleverness, than it is to 32 vs 64 bits. This can be important for ARM systems, which tend to be 32-bits one, but get 16 general-purpose registers at their disposal (like x64). They should suffer less from inlining issues.

Coming back to LZ4, sometimes, the loss of performance on 32-bits x86 is so important that it's better to not inline the function, even if it results in more branching. It's the current choice for LZ4_compressHC().

3) Type selection is okay

I've been hiding the different types inside another inline function, which selects the right pointer type and the appropriate operation depending on a parameter. The compiler makes a fine job at eliminating useless branches, keeping only the path with correct type to be inlined.

4) Clear selection signal is mandatory

In order for the compiler to only keep the right code path, eliminating all branches, it is essential it correctly understand which branch is the right one.
There must be no room for confusion.

As a simple example, suppose we've got this code :

inline int functionA(int parameter)
{ (... bunch of code with one branching depending on parameter...) }

int functionB()
int parameter=1;
return functionA(parameter);

You'll be surprised to notice that the branching is not necessarily removed, even though the parameter has a very clear value.

This seems like a limitation of the compiler, which doesn't want to "bet" on the value of parameter. Therefore, prefer something along this line :

void functionB()
return functionA(1);

This one will ensure that only the path corresponding to parameter==1 will be kept, eliminating branching.

5) Predictable branching don't cost that much

Modern CPU are fairly complex beasts, and one thing they do fairly well is to correctly guess what will be the result of a branching, anticipating the next operations by filling the pipeline.
This is even more true if the result is always the same.
So, even if a branching is not eliminated, but lead to always the same path, the CPU will quickly understand it, and next time will speculative start the right branch even before the result of the test.
As a consequence, the cost of branching is minimal (only the test operation itself and a jmp), with no pipeline stall.
This means it's not always necessary to worry about every branching. Only those with (almost) random pattern will have a big hit to performance. Predictable ones are fairly fast.

So, the result :

The current developer version is a mixed bag. It results in a small performance boost in 64-bits mode (as compared to macro), and a small performance loss in 32-bits modes.

using a Corei5-3340M, on LZ4_compress() (fast variant) :
64-bits GCC : +2% (Ubuntu)
32-bits Visual : -1% (Win7)
32-bits GCC : -5% (Ubuntu)

On the other hand, it eliminates the need for external files (lz4_encoder.h & lz4hc_encoder.h), making the source code both easier to read and to integrate into user projects. And this was the main objective all along.

Monday, June 10, 2013

Fighting Code Bloat - Part 2 - Inline functions

 A while ago, I've written a blog post on a method to decrease source code size, improving maintainability, when the source contains multiple functions which are almost identical, except for a few differences in conditions or types.

The method uses a separate *.h file, which is included as many times as necessary, with a few #define to trigger the relevant sections of the code.

Although it works, I couldn't help but get the feeling that it looks almost like a hack. Also, as a drawback, the method slightly increases source complexity, by increasing the total number of files to be included into an external project (granted, this number remains fairly small for LZ4, but still, it's a step in the wrong direction).

An insightful comment from Bryce Schober kindly reminded that another method was rightly accessible, using inline functions. Yes, inline functions would remove the issue of separated #include files, but it comes with its own set of limitations; namely, inline is merely an "hint" to the compiler, not a guaranteed that the function will effectively get inlined, and it doesn't solve the issue of manipulating different types within the function.

However, I wanted to give this method a try, in case it would result in a cleaner solution. Inline functions are regularly advised as a better alternative to macro (whenever applicable), since inline functions are still compiled, with all the benefits of strong typing and semantic check. It greatly improves debugging and code maintainability.

For this attempt, I selected the set of decompression functions, which has two big advantages :
1) The function is small. As an heuristic, the smallest a function is, the more probable it will get inlined.
2) There are no 'type' manipulations between the different version, only different sets of tests.

The key aspect for this trick to work is to expect the compiler's optimizer to do its job properly. Namely, whenever a test is guaranteed to be 'true' or 'false', we want the associated branch to be eliminated and the relevant piece of code to be instantiated in place. That's why it's key for this function to be inlined : without it, the compiler can't remove the branches, resulting in sensible performance penalty.

Ultimately, the experiment proved successful, and is the basis of LZ4 newest version, r97.

A few key things that were learned in the process :

- Make sure that branches to be removed receive a very clear '1 or 0' signal.
For example, if one branch depends on (variable > 0), don't write it this way. You should know if variable will fulfill the condition, and add another input, such as testtrigger, which will receive the value 1 or 0. Now, the test becomes if (testtrigger) : this will ensure the compiler will understand the test's result, and therefore only instantiate the correct branch.

- Make sure the function will really get inlined. In my tests, GCC happily inlined the generic function, but Visual Studio would not. Hopefully, this can be solve by using __forceinline keyword (forceinline is not part of C99 specification, it is therefore compiler-specific. It's a portability issue).

- Impact on performance is not zero. Here, it is a mixed bag. Inlined functions perform slightly worse than macros for Visual, while performing slightly better for GCC. It's difficult to explain why. The resulting assembler code is fairly close, but definitely not identical. The very small assembler differences could be enough to explain the performance delta. We are hopefully talking about small deltas, in the range of 2%.

- Code readability is definitely improved, as compared to a separate file with macros. This is a big boost to code maintenance. As an added bonus, it even proved easier to find some more optimization.

As a result, this version features a small boost to decoding speed, as measured below :
(fullbench, compiled with GCC v4.6.1 64-bits, running on Core i5 560M with Linux Ubuntu 11.10)
Summary (MB/s): r96 r97

LZ4_decompress_fast 1412 1460
LZ4_decompress_fast_withPrefix64k 1408 1457
LZ4_decompress_safe 1369 1391
LZ4_decompress_safe_withPrefix64k 1404 1416
LZ4_decompress_safe_partial 1327 1387

All in all, this is a step in the right direction.
Can the experiment be extended to the compression function itself ? Well, that's going to be a later story. It certainly looks much more complex.

Friday, April 26, 2013

Fighting code bloat (C template-style)

 A little detail has always annoyed me within the LZ4 source code : the presence of 2 compression functions, and 2 decompression functions.

In both cases, the 2 variants are very close to each other. But the differences are large enough to justify 2 separate functions (such as different underlying types). While it's a minor annoyance, the situation could not last.

Creating the second function was relatively easy : just copy/paste the first function, and modify whatever is necessary. Problems start with code maintenance. Whenever modifying or correcting one function, it's necessary to not forget to also modify the second one, whenever it's applicable. Doing this multiple times, it's likely that a few minor changes get their way in, especially when the code is large (which, thankfully, it is not for LZ4 yet).
But that's only the beginning of the problems.
What if other variants are needed ?
Then, it will be necessary to create a new function almost similar to previous ones, or multiply the current number of variants by 2 if it is an additional on/off parameter. What if I need another on/off parameter ? That's 8 similar functions. Obviously, this doesn't scale.

C++ template
Dealing with a similar issue in C++ has its solution : it's called Template. Getting a list of parameters with predictable values in front of the function template will instruct the compiler to create as many versions of the function as required. It's a very handy way to write the function once, and have it instantiated many times with combination of modifications.

Alas, C coders are not blessed with such a comprehensive tool. C99, which is the latest language update to care about, doesn't define them. (The more recent C11 is still too young for widespread deployment, and anyway, even the new Type-generic defined in C11 is a long shot from Template).

Googling around shows this is a recurrent issue, with already multiple attempts at mimicking template behavior using C macros. With current limitations, it is the sole strategy to adopt. Most of these attempts are fairly complex, which doesn't help for debug, and come with various limitations, depending on attempted objectives (mostly focused on parameter types).

LZ4 is going to require new functions very soon, so the problem of duplicated lines of code is going to be adamant. It becomes urgent to solve it.

The case for inline functions
One potential way to do it is to use inline functions. Such functions will be instantiated in-place, where they are called. In many ways, inline functions behave the same as macros with parameters, but with the big advantage of being typed and compiled, resulting in much cleaner code to debug. By defining some parameters which mere objective is to enable or disable some parts of the code (typically some checks), the compiler will automatically create the right optimal code, removing useless branches. This is a good start.

However, inline functions also come with limitations, especially in C.
First, inline is merely a compilation hint. The compiler is free to decide if the function will be instantiated or referenced, the programmer has no direct control on this decision. A basic rule of thumb is : if the function is very small, it will most probably be instantiated in-place, while if it is large, it most likely won't. However, there is a large gap between these two extreme situations, where it's more difficult to guarantee anything.

This limitation push towards small inline functions, which by the way is the intention of the standard. So, instead of creating a very large all-encompassing function, it could be a good idea to cut it into smaller pieces. Alas, there is another problem : in contrast with macros, inline function do not modify their parameters (like normal functions do). But most of the time, the snippets of code have the responsibility to modify several variables, a single result is simply not enough. Since it's not possible to pass variables by references (this is C++ only), to reach this goal, it's possible to pass parameters as pointers to variables. It works (there is such an example into lz4hc). It just makes it even harder for the compiler to understand the intend and optimize such a mess, and therefore less likely to create the best fastest code.

The last limitation is even less forgiving : what can be done when the difference between two versions concerns underlying types ?
A good example is the HTYPE of LZ4, which can be either an U16 (for the 64K version), a BYTE* pointer (for 32-bits), or an U32 (for 64-bits). The code is almost the same, just this particular type changes depending on the version. How to do that with an inline function ? The simple answer is : you can't.

LZ4 solution
So here is my attempt. In order to reduce the number of functions written into LZ4 source code, preparing the creation of newer ones sharing almost the same code, each function is written into a dedicated file, included several times into the main .c source file, preceded by a set of #define which trigger specific behaviors.

Hence were created lz4_encoder.h and lz4_decoder.h.
Both files are unusual : they are neither real "headers", nor compilable piece of code. They are meant to be included into another source file (*.c), preceded by a list of #define, some of them being mandatory (such as FUNCTION_NAME), and others being optional (such as LIMITED_OUTPUT).
Each time the *.h file is included, it creates a new function.
It becomes possible to write the function once, and create multiple variations of it, greatly simplifying code maintenance.

The situation is not all rosy. First, the main function becomes more complex to write, since it encompasses all possible variations. It's sometimes necessary to refactor the code just to make it more readable, since a collection of :
can quickly impact readability, resulting in a negative impact on code maintenance.

Another drawback is the necessity for any user program to import several files (lz4.c + lz4_encoder.h +  lz4_decoder.h), while a single lz4.c was enough up to now.
This increased file management complexity was the main reason I've avoided this strategy up to now. But, with the streaming interface in the near future, it was a necessity to ensure that new functions can easily be created while keeping the source code size under control.

In the end, the code refactoring effort also created some immediate win. Performance of several functions improved, notably for the fast variant of the decompression function, the compression of small packets, and the High Compression variant. This is a consequence of "sanitizing" the code, removing useless tests from variants which don't need them, and finding minor differences between 2 versions, keeping only the better one.

Next goal in list is inter-dependent block compression and decompression, an essential piece of the puzzle towards Streaming Interface.

Tuesday, April 9, 2013

LZ4 Frame format : Final specifications

The LZ4 Framing Format specification has progressed quite a bit since last post, taking into consideration most issues raised by commenters. It has now reached version 1.5 (see edit below), which looks stable enough.

As a consequence, save any last-minute important item raised by contributors, the currently published specification will be used in upcoming LZ4 releases.

[Edit] : and last-minute change there is. Following a suggestion by Takayuki Matsuoka, the header checksum is now slightly different, in an effort to become more friendly with read-only media, hopefully improving clarity in the process. Specification version is now raised to v1.3.

[Edit 2] : A first version of LZ4c, implementing the above specification, is available at Google Code.

[Edit 3] : Following recommendations from Mark Adler, version v1.4 re-introduce frame content checksum. It's not correct to assume that block checksum makes frame content checksum redundant : block checksum only validates that each block has no error, while frame content checksum verify that all blocks are present and in correct order. Finally, frame content checksum also validates the encoding & decoding stages.
v1.4 also introduces the definition of "skippable frames", which can be used to encapsulate any kind of user-defined data into a flow of appended LZ4 frames.

[Edit 4] : Changed naming convention in v1.4.1 to "frame".

[Edit 5] : Removed Dict_ID from specification

Thursday, March 21, 2013

A Streaming format for LZ4

It is a long time since I'm willing to produce a suitable streaming format for LZ4. As stated earlier, the format defined into lz4demo was not meant to become widespread, and is too limited. It's also completely unsuitable for network-oriented protocols.

As a consequence, you'll find in the following link a nearly-final specification of the LZ4 streaming format, in OpenDocument format.

It's only "nearly" final, since there are still a few questions left, summarized at the end of each chapter.

However, the main properties seem settled. The format accomodates for a large selection of buffer sizes, authorizes sequential and independant blocks, embed a few checksum options, accept arbitrary flushes at any point, and even define provisions for preset dictionaries mode.

At this stage, the very last questions seem ready to be settled in the next few weeks. Inputs, comments are welcomed.

[Edit] progresses :
Settled thanks to your comments (here and direct email):

  • Endian convention : more votes for Little Endian.
  • Stream Size : 8 bytes seems okay
  • Stream checksum : removed (v0.7), block-level checksum seems enough
  • High compression flag : removed (v0.8), seems not useful enough
  • Block Maximum size : reduced table (v0.9), from 1KB to 4MB
  • Block size : simplified compressed/uncompressed flags (v1.0)

[Edit] answering spec-related questions directly into the post

Jim> suggestion is to allow different checksums, with a 16-bit word identifying which hash

Actually, it was my initial intention.
But i eventually backed off. Why ?

One of LZ4 strong points is its simple specification, which makes it possible for any programmer to produce an LZ4-compatible algorithm of its own, in any language. To reach this goal, complexity is always fought and reduced to a minimum.

If multiple hash algorithms are possible, then the decoder will have to support them all to be compatible with the specification. It's a significant burden, which will deter or slow down people willing to produce, test and maintain their own decoding routine.

Obviously, the streaming format proposed here is not supposed to be "the most appropriate for any usage". I intend it to become "mainstream", cross-platform, and to replace the current "static format" of "lz4demo". But there will always be some specific circumstances in which another format will do a better job.

Matt> deflate format supports preset dictionaries, but nobody uses them. I would drop it.

Actually, I had a few requests for such a feature. The idea is that pre-set dictionaries can make a great difference when it comes to sending a lot of small independant blocks.

Matt> Do you need checksums at all?

I think so. The format is for both short-lived transmission data, and for storage one. Checksum is almost mandatory for the second use-case. Anyway, in the proposal, Checksum is still an "option", so it can be disabled by the sender if it seems better for its own use case.

Matt> Do you need both block and stream checksums? Probably not.
Mark> Stream Checksum: I don't see the point if data block checksum gives the appropriate protection 

That's the good question. It's kind of 50/50 when it comes to evaluating comments.
The simplicity argument looks good though.
[Edit 2] Stream checksum is removed (v0.7+), keeping only block-level checksum

Mark> Do you really think your block maximum size values make sense (...) All in all, I tend to think it is a too wide choice anyway

Good point. In v0.9, the choice of values for block maximum size has been reduced.

Matt> Do you need variable sized fields just to save a few bytes in the header? Probably not. 
Mark> I would make that "compressed flag" explicit, and thus keep only one "data size" field disregarding if the data was compressed or not. (...) . I'm not even sure you would need two possible sizes just to save one byte per block. 

Good points. Apparently, this part looks too complex, and would deserve some re-thinking.
[Edit 2] : the specification regarding block size is simplified in v1.0.

Adrien> Would it be better to let users select their own dictionary identifiers, rather than requiring Dict-ID to be the result of passing the dictionnary through xxHash-32 ?

Good point . This behavior mimics the spec of RFC1950 (zlib). The only advantage i can think of is that it allows the decoder (and encoder) to check if the dictionary is the right one. I'm unsure if this is really useful though....

Takayuki> LZ4 stream may be (pre) loaded on Ready Only Memory. In this case, temporal masking 0 for BC.Checkbits is difficult.

Correct. The current proposal is to load the header into RAM in order to perform the masking and checksum there. Is that an issue ?
Another proposal could be to not check the checkbits for ROM pre-loaded streams, since potential transmission error is nullified for such scenario.