Friday, April 26, 2013

Fighting code bloat (C template-style)

 A little detail has always annoyed me within the LZ4 source code : the presence of 2 compression functions, and 2 decompression functions.

In both cases, the 2 variants are very close to each other. But the differences are large enough to justify 2 separate functions (such as different underlying types). While it's a minor annoyance, the situation could not last.

Creating the second function was relatively easy : just copy/paste the first function, and modify whatever is necessary. Problems start with code maintenance. Whenever modifying or correcting one function, it's necessary to not forget to also modify the second one, whenever it's applicable. Doing this multiple times, it's likely that a few minor changes get their way in, especially when the code is large (which, thankfully, it is not for LZ4 yet).
But that's only the beginning of the problems.
What if other variants are needed ?
Then, it will be necessary to create a new function almost similar to previous ones, or multiply the current number of variants by 2 if it is an additional on/off parameter. What if I need another on/off parameter ? That's 8 similar functions. Obviously, this doesn't scale.

C++ template
Dealing with a similar issue in C++ has its solution : it's called Template. Getting a list of parameters with predictable values in front of the function template will instruct the compiler to create as many versions of the function as required. It's a very handy way to write the function once, and have it instantiated many times with combination of modifications.

Alas, C coders are not blessed with such a comprehensive tool. C99, which is the latest language update to care about, doesn't define them. (The more recent C11 is still too young for widespread deployment, and anyway, even the new Type-generic defined in C11 is a long shot from Template).

Googling around shows this is a recurrent issue, with already multiple attempts at mimicking template behavior using C macros. With current limitations, it is the sole strategy to adopt. Most of these attempts are fairly complex, which doesn't help for debug, and come with various limitations, depending on attempted objectives (mostly focused on parameter types).

LZ4 is going to require new functions very soon, so the problem of duplicated lines of code is going to be adamant. It becomes urgent to solve it.

The case for inline functions
One potential way to do it is to use inline functions. Such functions will be instantiated in-place, where they are called. In many ways, inline functions behave the same as macros with parameters, but with the big advantage of being typed and compiled, resulting in much cleaner code to debug. By defining some parameters which mere objective is to enable or disable some parts of the code (typically some checks), the compiler will automatically create the right optimal code, removing useless branches. This is a good start.

However, inline functions also come with limitations, especially in C.
First, inline is merely a compilation hint. The compiler is free to decide if the function will be instantiated or referenced, the programmer has no direct control on this decision. A basic rule of thumb is : if the function is very small, it will most probably be instantiated in-place, while if it is large, it most likely won't. However, there is a large gap between these two extreme situations, where it's more difficult to guarantee anything.

This limitation push towards small inline functions, which by the way is the intention of the standard. So, instead of creating a very large all-encompassing function, it could be a good idea to cut it into smaller pieces. Alas, there is another problem : in contrast with macros, inline function do not modify their parameters (like normal functions do). But most of the time, the snippets of code have the responsibility to modify several variables, a single result is simply not enough. Since it's not possible to pass variables by references (this is C++ only), to reach this goal, it's possible to pass parameters as pointers to variables. It works (there is such an example into lz4hc). It just makes it even harder for the compiler to understand the intend and optimize such a mess, and therefore less likely to create the best fastest code.

The last limitation is even less forgiving : what can be done when the difference between two versions concerns underlying types ?
A good example is the HTYPE of LZ4, which can be either an U16 (for the 64K version), a BYTE* pointer (for 32-bits), or an U32 (for 64-bits). The code is almost the same, just this particular type changes depending on the version. How to do that with an inline function ? The simple answer is : you can't.

LZ4 solution
So here is my attempt. In order to reduce the number of functions written into LZ4 source code, preparing the creation of newer ones sharing almost the same code, each function is written into a dedicated file, included several times into the main .c source file, preceded by a set of #define which trigger specific behaviors.

Hence were created lz4_encoder.h and lz4_decoder.h.
Both files are unusual : they are neither real "headers", nor compilable piece of code. They are meant to be included into another source file (*.c), preceded by a list of #define, some of them being mandatory (such as FUNCTION_NAME), and others being optional (such as LIMITED_OUTPUT).
Each time the *.h file is included, it creates a new function.
It becomes possible to write the function once, and create multiple variations of it, greatly simplifying code maintenance.

The situation is not all rosy. First, the main function becomes more complex to write, since it encompasses all possible variations. It's sometimes necessary to refactor the code just to make it more readable, since a collection of :
can quickly impact readability, resulting in a negative impact on code maintenance.

Another drawback is the necessity for any user program to import several files (lz4.c + lz4_encoder.h +  lz4_decoder.h), while a single lz4.c was enough up to now.
This increased file management complexity was the main reason I've avoided this strategy up to now. But, with the streaming interface in the near future, it was a necessity to ensure that new functions can easily be created while keeping the source code size under control.

In the end, the code refactoring effort also created some immediate win. Performance of several functions improved, notably for the fast variant of the decompression function, the compression of small packets, and the High Compression variant. This is a consequence of "sanitizing" the code, removing useless tests from variants which don't need them, and finding minor differences between 2 versions, keeping only the better one.

Next goal in list is inter-dependent block compression and decompression, an essential piece of the puzzle towards Streaming Interface.


  1. Why not simply define more inline functions? I've had great success using inline functions to implement conditional behaviors in different contexts, and literal arguments to inline functions have given me the exact equivalent to the original. I suppose that some of my success may have been due to only needing to support gcc, so the need to use the always_inline attribute didn't bother me.

    1. Actually, that's a very good comment. I initially planned to have a few paragraphs on a comparison with inline functions, but finally went for the fastest (and laziest) alternative.
      I believe inline functions have their advantages, but can't solve all situations. When I get time, I'll update the blog post with a more detailed explanation.

    2. The blog entry has been updated with a paragraph on inline functions.

    3. Well, finally, I decided to use the "inline function" methodology for the set of decompression functions. It proved successful. See

  2. It's pretty interesting that a lot of compression code tends towards the same solutions. I remember seeing the lzop code for the first time and seeing lots of small files #including one (relatively speaking) monster file.

  3. I find the C code difficult to read. It is a little better now that you have consolidated the versions, but it still requires expanding many things in one's head as we read the code.

    There is nothing wrong with this, given that performance is the goal. But I would have liked a pseudo-code version kept up to date somewhere, that is more readable. I have looked at other language implementations; most copy the C-code version, except the Go implementation.

    1. Adrien's Java also has its own implementation.

      For the decoder, your best guide is the format specification, which can be "mapped" fairly easily into any language. (See for example the 8088 assembler version of Jim).

      For the encoder, it's a bit more complex...