Friday, February 5, 2016

Compressing small data

 Data compression is primarily seen as a file compression algorithm. After all, the main objective is to save storage space, is it ?
With this background in mind, it's also logical to focus on bigger files. Good compression achieved on a single large archive is worth the savings for countless smaller ones.

However, this is no longer where the bulk of compression happen. Today, compression is everywhere, embedded within systems, achieving its space and transmission savings without user intervention, nor awareness. The key to these invisible gains is to remain below the end-user perception threshold. To achieve this objective, it's not possible to wait for some large amount of data to process. Instead, data is processed in small amounts.

This would be all good and well if it wasn't for a simple observation : the smaller the amount to compress, the worse the compression ratio.
The reason is pretty simple : data compression works by finding redundancy within the processed source. When a new source starts, there is not yet any redundancy to build upon. And it takes time for any algorithm to achieve meaningful outcome.

Therefore, as the issue comes from starting from a blank history, what about starting from an already populated history ?

Streaming to the rescue

A first solution is streaming : data is cut into smaller blocks, but each block can make reference to previously sent ones. And it works quite well. In spite of some minor losses at block borders, most of the compression opportunities of a single large data source are preserved, but now with the advantage to process, send, and receive tiny blocks on the fly, making the experience smooth.

However, this scenario only works with serial data, a communication channel for example, where order is known and preserved.

For a large category of applications, such as database and storage, this cannot work : data must remain accessible in a random fashion, no known "a priori" order. Reaching a specific block sector should not require to decode all preceding ones just to rebuild the dynamic context.

For such use case, a common work-around is to create some "not too small blocks". Say there are many records of a few hundred bytes each. Group them in packs of at least 16 KB. Now this achieves some nice middle-ground between not-to-poor compression ratio and good enough random access capability.
This is still not ideal though, since it's required to decompress a full block just to get a single random record out of it. Therefore, each application will settle for its own middle ground, using block sizes of 4 KB, 16 KB or even 128 KB, depending on usage pattern.

Dictionary compression

Preserving random access at record level and good compression ratio, is hard. But it's achievable too, using a dictionary. To summarize, it's a kind of common prefix, shared by all compressed objects. It makes every compression and decompression operation start from the same populated history.

Dictionary compression has the great property to be compatible with random access. Even for communication scenarios, it can prove easier to manage at scale than "per-connection streaming", since instead of storing one different context per connection, there is always the same context to start from when compressing or decompressing any new data block.

A good dictionary can compress small records into tiny compressed blobs. Sometimes, the current record can be found "as is" entirely within the dictionary, reducing it to a single reference. More likely, some critical redundant elements will be detected (header, footer, keywords) leaving only variable ones to be described (ID fields, date, etc.).

For this situation to work properly, the dictionary needs to be tuned for the underlying structure of objects to compress. There is no such thing as a "universal dictionary". One must be created and used for a target data type.

Fortunately, this condition can be met quite often.
Just created some new protocol for a transaction engine or an online game ? It's likely based on a few common important messages and keywords (even binary ones). Have some event or log records ? There is likely a grammar for them (json, xml maybe). The same can be said of digital resources, be it html files, css stylesheets, javascript programs, etc.
If you know what you are going to compress, you can create a dictionary for it.

The key is, since it's not possible to create a meaningful "universal dictionary", one must create one dictionary per resource type.

Example of a structured JSON message

How to create a dictionary from a training set ? Well, even though one could be tempted to manually create one, by compacting all keywords and repeatable sequences into a file, this can be a tedious task. Moreover, there is always a chance that the dictionary will have to be updated regularly due to moving conditions.
This is why, starting from v0.5, zstd offers a dictionary builder capability.

Using the builder, it's possible to quickly create a dictionary from a list of samples. The process is relatively fast (a matter of seconds), which makes it possible to generate and update multiple dictionaries for multiple targets.

But what good can achieve dictionary compression ?
To answer this question, a few tests were run on some typical samples. A flow of JSON records from a probe, some Mercurial log events, and a collection of large JSON documents, provided by @KryzFr.

Collection Namedirect
Small JSON recordsx1.331 - x1.366x5.860 - x6.830~ x4.7300200 - 400
Mercurial eventsx2.322 - x2.538x3.377 - x4.462~ x1.51.5 KB20 - 200 KB
Large JSON docsx3.813 - x4.043x8.935 - x13.366~ x2.86 KB800 - 20 KB

These compression gains are achieved without any speed loss, and even feature faster decompression processing. As one can see, it's no "small improvement". This method can achieve transformative gains, especially for very small records.

Large documents will benefit proportionally less, since dictionary gains are mostly effective in the first few KB. Then there is enough history to build upon, and the compression algorithm can rely on it to compress the rest of the file.

Dictionary compression will work if there is some correlation in a family of small data (common keywords and structure). Hence, deploying one dictionary per type of data will provide the greater benefits.

Anyway, if you are in a situation where compressing small data can be useful for your use case (databases and contextless communication scenarios come to mind, but there are likely other ones), you are welcomed to have a look at this new open source tool and compression methodology and report your experience or feature requests.

Zstd is now getting closer to v1.0 release, it's a good time to provide feedback and integrate them into final specification.


  1. Very cool! Will there be an API for the dictionary builder or will it only be a command-line utility?

    1. There is an API already availableefor the dictionary builder : look into dictBuilder.h.
      It's still preliminary though, hence perfectible, but one has to start somewhere...

  2. Hi Yann.

    Small files reinforced with dictionaries, deliver better ratio but are kinda bloaty. My point here is to add one more nasty viewpoint for your either small or huge files approaching - the encoding scheme.
    Needless to say, the big bunch of unicode alone variants bloats the processing a lot, below one UNICODE, the worst of all, example is given: the famous and revered "Большая Советская Энциклопедия" - 30 volumes of Wikipedia-like content - semi-dictionary, semi-encyclopedia.

    XZ uses level 9, LZMA2:26, 64MB dictionary, Zstd is v0.4.7:


    396,120,046 GreatSovietEncyclopaediaRuRu.dsl
    67,790,530 GreatSovietEncyclopaediaRuRu.dsl.Nakamichi
    44,591,268 GreatSovietEncyclopaediaRuRu.dsl.xz

    D:\_Deathship_textual_corpus\Goldenboy_Turbobench>turbobench_singlefile.bat GreatSovietEncyclopaediaRuRu.dsl
    Decompressing 23 times, for more precise overall...

    D:\_Deathship_textual_corpus\Goldenboy_Turbobench>turbobenchs.exe GreatSovietEncyclopaediaRuRu.dsl -ezlib,9/fastlz,2/chameleon,2/snappy_c/zstd,1,20/lz4,9,16/lz5,15/brieflz/brotli,11/crush,2/lzma,9/zpaq,2/lzf/shrinker/yappy/trl
    e/memcpy/naka/lzturbo,19,29,39 -k0 -J23
    83424253 21.1 3.16 DDDDDDDDDDDDDD 281.94 zlib 9 GreatSovietEncyclopaediaRuRu.dsl
    149995831 37.9 196.67 DDDDDDDDDDDDDD 394.19 fastlz 2 GreatSovietEncyclopaediaRuRu.dsl
    210522291 53.1 950.24 DDDDDDDDDDDDDD 1670.38 chameleon 2 GreatSovietEncyclopaediaRuRu.dsl
    158204122 39.9 283.65 DDDDDDDDDDDDDD 640.74 snappy_c GreatSovietEncyclopaediaRuRu.dsl
    95033271 24.0 137.71 DDDDDDDDDDDDDD 419.31 zstd 1 GreatSovietEncyclopaediaRuRu.dsl
    52389046 13.2 1.39 DDDDDDDDDDDDDD 273.64 zstd 20 GreatSovietEncyclopaediaRuRu.dsl
    94374416 23.8 7.61 DDDDDDDDDDDDDD 1098.03 lz4 9 GreatSovietEncyclopaediaRuRu.dsl
    94200951 23.8 6.87 DDDDDDDDDDDDDD 1099.60 lz4 16 GreatSovietEncyclopaediaRuRu.dsl
    83027430 21.0 1.25 DDDDDDDDDDDDDD 599.31 lz5 15 GreatSovietEncyclopaediaRuRu.dsl
    125356638 31.6 85.31 DDDDDDDDDDDDDD 173.12 brieflz GreatSovietEncyclopaediaRuRu.dsl
    60458801 15.3 0.26 DDDDDDDDDDDDDD 321.02 brotli 11 GreatSovietEncyclopaediaRuRu.dsl
    69685413 17.6 0.12 DDDDDDDDDDDDDD 311.04 crush 2 GreatSovietEncyclopaediaRuRu.dsl
    42372850 10.7 0.38 DDDDDDDDDDDDDD 113.60 lzma 9 GreatSovietEncyclopaediaRuRu.dsl
    65972471 16.7 3.17 DDDDDDDDDDDDD 70.41 zpaq 2 GreatSovietEncyclopaediaRuRu.dsl
    146599615 37.0 253.62 DDDDDDDDDDDDDD 538.42 lzf GreatSovietEncyclopaediaRuRu.dsl
    172927017 43.7 208.34 DDDDDDDDDDDDDD 447.19 shrinker GreatSovietEncyclopaediaRuRu.dsl
    121893595 30.8 50.52 DDDDDDDDDDDDDD 843.75 yappy GreatSovietEncyclopaediaRuRu.dsl
    396120050 100.0 103.35 DDDDDDDDDDDDDD 1682.47 trle GreatSovietEncyclopaediaRuRu.dsl
    396120050 100.0 2400.28 DDDDDDDDDDDDDD 1680.56 memcpy GreatSovietEncyclopaediaRuRu.dsl
    ^CTerminate batch job (Y/N)? ye encoded; Done 3%; Compression Ratio: 5.23:1


    1. Hi Sanmayce,

      Thanks for the benchmarks, they are indeed very interesting, as it's not frequent to have Cyrillic Unicode content for benchmark.
      It seems, btw, that `zstd -20` scores pretty well in this test, being only second in compression ratio to `lzma -9` (which is slower).

      As a minor comment, I don't think it is related to this post's topic, dictionary compression.

      Dictionary compression is meant to improve compression of small blocks / packets. Here, the sample is very large (> 100 MB) so I guess it's a case of large file compression. It's a valid use case, just a different one.

    2. >Here, the sample is very large (> 100 MB) so I guess it's a case of large file compression. It's a valid use case, just a different one.

      Even more than valid - a serious one it is.

      >As a minor comment, I don't think it is related to this post's topic, dictionary compression.

      Three things.

      First, wanted to illustrate how your Zstd delivers awesome results overshadowing Brotli's tricks (Brotli v16-01) - 52,389,046 vs 60,458,801.
      Second, excuse me, I tend to go off-topic too often, but when I foresee [yet] not arrived difficulties it's natural to me to ambush them.
      Different character encodings go hand-in-hand with dictionaries, yes?
      Simply, I wanted to make you aware of some useful datasets encoded in some non-UTF-8, non-ANSI tables, one of those is a Japanese dictionary:

      15,023,476 edict_(JIS_X_0208_coding_in_EUC-JP_encapsulation)
      60,681,877 JMdict_(Unicode-ISO-10646_coding_using_UTF-8_encapsulation)

      Donloadable at:

      Very important (being Japanese-English) bilingual datasets they are, to tweak a compressor as well.
      What about Chinese, it would be nice some Chinaman to throw some kilobytes of hieroglyphs at Zstd, no?

      Back to the GSE, one non-UNICODE variant of "Большая Советская Энциклопедия" is the two times smaller "GSE_Russian_(Windows_Cyrillic_encoding).tar" dataset:

      150,634,496 GSE_Russian_(Windows_Cyrillic_encoding).tar
      40,279,156 GSE_Russian_(Windows_Cyrillic_encoding).tar.xz

      D:\_Deathship_textual_corpus\Goldenboy_Turbobench>turbobenchs.exe "GSE_Russian_(Windows_Cyrillic_encoding).tar" -ezlib,9/fastlz,2/chameleon,2/snappy_c/zstd,1,20/lz4,9,16/lz5,15/brieflz/brotli,11/lzma,9 -k0 -J23
      63854446 42.4 11.25 DDDDDDDDDDDDDD 169.89 zlib 9 GSE_Russian_(Windows_Cyrillic_encoding).tar
      96798385 64.3 115.99 DDDDDDDDDDDDDD 248.71 fastlz 2 GSE_Russian_(Windows_Cyrillic_encoding).tar
      95493422 63.4 790.74 DDDDDDDDDDDDDD 829.60 chameleon 2 GSE_Russian_(Windows_Cyrillic_encoding).tar
      101553672 67.4 174.06 DDDDDDDDDDDDDD 471.32 snappy_c GSE_Russian_(Windows_Cyrillic_encoding).tar
      71735736 47.6 94.11 DDDDDDDDDDDDDD 402.46 zstd 1 GSE_Russian_(Windows_Cyrillic_encoding).tar
      43545917 28.9 1.16 DDDDDDDDDDDDDD 150.88 zstd 20 GSE_Russian_(Windows_Cyrillic_encoding).tar
      72474962 48.1 20.29 DDDDDDDDDDDDDD 862.45 lz4 9 GSE_Russian_(Windows_Cyrillic_encoding).tar
      72474661 48.1 20.27 DDDDDDDDDDDDDD 861.96 lz4 16 GSE_Russian_(Windows_Cyrillic_encoding).tar
      59387025 39.4 0.38 DDDDDDDDDDDDDD 336.68 lz5 15 GSE_Russian_(Windows_Cyrillic_encoding).tar
      76483483 50.8 51.41 DDDDDDDDDDDDDD 113.83 brieflz GSE_Russian_(Windows_Cyrillic_encoding).tar
      45510754 30.2 0.16 DDDDDDDDDDDDDD 165.58 brotli 11 GSE_Russian_(Windows_Cyrillic_encoding).tar
      39793313 26.4 0.67 DDDDDDDDDDDDDD 52.10 lzma 9 GSE_Russian_(Windows_Cyrillic_encoding).tar


      Again, superb tightness for -20, overpowering brotli 11 - 43,545,917 vs 45,510,754.

      Third, the strong ratio yielded by -20 suggests the need for arrival of more aggressive modes, say 30, why not? Compression rates at 1KB/s scare me not, at all.

      Coolness rises as Zstd enters higher modes. Icy stuff.

    3. Got it.

      So you wanted to underline that file containing non-english text content would not benefit from a compressor using english-oriented dictionary.

      Yes, you are right, this is a good illustration of : "there is no universal dictionary".

      > the strong ratio yielded by -20 suggests the need for arrival of more aggressive modes, say 30, why not?

      Indeed, why not ... stay tuned ...

    4. >So you wanted to underline that file containing non-english text content would not benefit from a compressor using english-oriented dictionary.

      Sure, but not that, I meant that character encodings along with dedicated dictionaries not only for natural languages but for descriptive/unnatural ones form a duo that bloats the implementation, the other thing was that ANS perhaps need NOT additional support for not kilobytish data, in my view ANS is planet-breaking, not merely ground-breaking. It alone can do amazing [de]compression with finess, LzTurbo -39 and Zstd proved already that while and not being burdened by patches/dictionaries/tricks and other bloated overhead.
      For small files and having many dictionaries (with adequate encodings) the benefit will be there but then the simplicity goes out of the window.

      >Indeed, why not ... stay tuned ...

      Thanks, I will, sharing your experience/views freely rocks.
      I see you target some small content in range 4-32KB, but in my world the atomic size is the size of a generic novel, or ~500KB. My secondary target is Wikipedia 50,000,000KB. My tertiary target is n-grams corpora 500,000,000KB. Long ago I ran out of drive space, those ~500GB n-grams along with my 2+TB of x-grams cry for a f-f-fast decompressor.

      Hate to be calptrappy, so 2 favorite novels of mine and 3 big corpora are benchmarked below:

      To me, Zstd -20 smokes Brotli 11, both in novels and corpora datasets. For the definitive English philosophical/religious corpus 'Sacred-Texts' mode 20 bends a knee before Brotli's 11, maybe 22 will do.

      Performance for ~800 popular novels tarred is very indicative of the real power of one [de]compressor.
      By the way, the context of the two Russian corpora is military/action/detective/criminal/antiterrorism. SPETSNAZ stands for 'special forces' or 'special designation' - the elite of army/navy.

      Many good movies were filmed based on those books, so the texts are fully ... operational not some nonsensical dump.

      I am short of computational power, otherwise I would have juxtaposed 100+ datasets using 3-4 generations CPU-RAM subsystems already. Grmbl.

  3. Kinda felt that, regardless of off-topicness (the reason - the key word 'dictionary' triggered in me the actual meaning not the compression term), some heavy DSLs have to be tested with level 21, of course along with other performers, LZSSE2 is in the mix too:



    Superfast is Zstd, the tests show that 21 brings more, achieving such superb results WITH SO LITTLE RAM FOOTPRINT is extra-superb, something, Yann, you should emphasize on paper in future, I mean as XZ reports in verbose mode (needed RAM for compression), my 2 cents.

    I am preparing one insane 97,504,190,079 bytes long corpus, called OOW, stands for Ocean-Of-Words, consisted of 12 corpora, which will throw a beam of light on the textual [de]compression topic...

  4. And my last comment inhere, finally on-topic:

    A quick juxtaposition between Brotli/Zstd/LZ5/Goldenboy, latest (bro_Feb-10-2016, 0.5.1, 1.4) revisions.

    I see no such big discrepancy, taking into account Zstd's lightweightness, after the 2MB mark it vanishes altogether:

    2,120,123 Thomas_Mann_-_Der_Zauberberg_(German).txt
    832,507 Thomas_Mann_-_Der_Zauberberg_(German).txt.15_256MB.lz5
    640,268 Thomas_Mann_-_Der_Zauberberg_(German).txt.L11_W24.brotli
    644,930 Thomas_Mann_-_Der_Zauberberg_(German).txt.L21.zst
    855,771 Thomas_Mann_-_Der_Zauberberg_(German).txt.Nakamichi
    2,227,200 Thomas_Mann_-_La_Montagne_magique_(French).tar
    797,378 Thomas_Mann_-_La_Montagne_magique_(French).tar.15_256MB.lz5
    600,425 Thomas_Mann_-_La_Montagne_magique_(French).tar.L11_W24.brotli
    603,811 Thomas_Mann_-_La_Montagne_magique_(French).tar.L21.zst
    787,830 Thomas_Mann_-_La_Montagne_magique_(French).tar.Nakamichi
    2,043,974 Thomas_Mann_-_The_Magic_Mountain_(English).txt
    785,959 Thomas_Mann_-_The_Magic_Mountain_(English).txt.15_256MB.lz5
    595,306 Thomas_Mann_-_The_Magic_Mountain_(English).txt.L11_W24.brotli
    605,680 Thomas_Mann_-_The_Magic_Mountain_(English).txt.L21.zst
    803,412 Thomas_Mann_-_The_Magic_Mountain_(English).txt.Nakamichi

    Being a Nobel Prize laureate means ... Thomas Mann should be included in every textual benchmark.

    When I finish the ~400 files torture then more vivid picture will appear...

    1. Thanks for these very interesting detailed results Sanmayce.

      We expect next version (0.6.0) to produce noticeably better compression ratio on small files.
      Hopefully, it will show on your compression corpus.

  5. My yesterday comment was treated as spam, maybe due to the 2 URLs, anyway, just for sake of showing the improved strength of Zstd here it is again:

    Hi again, just tested your stronger v0.6.0, for the first time I see Zstd outperforming LzTurbo v1.3, the file is semi-small, namely, 'the_meaning_of_zen.pdf' 54,272 bytes in size, not some mumbo-jumbo but an excellent book excerpt by MATSUMOTO Shirõ.

    In my quick (only 29 testfiles) decompression showdown your latest Zstd 0.6.0 is on par (sizewise) with with LzTurbo 1.3, however falls behind in decompression rate department, the full console log (Laptop i5-2430M @3GHz, DDR3 666 MHz):
    The full bench (DECOMPRESS_all.bat) is here:

    Maybe, on newer CPUs with more caches those ~200MB/s will jump considerably.

    On a try-and-fail note, hee-hee, I tried only YMM registers variant of Nakamichi reading/writing within 1MB window, it disappoints, 2x/3x slower than Conor's LZSSE2 level 17.
    Yet, I read, at Agner Fog's blog:

    "Store forwarding is one clock cycle faster on Skylake than on previous processors. Store forwarding is the time it takes to read from a memory address immediately after writing to the same address."

    This very line prompted me to write this YMM variant, maybe Skylake will execute Tsubame faster than the ... mobile ... Sandy-Bridge, ugh.

    Some months separate me from having a much richer picture of side-by-side list, in my view heavy comparison is needed in order to clarify the strengths and weaknesses of so many good performers, in my view LZSSE2 (despite being weaker than LzTurbo 19 in non-textual data) changed the WORLDSCAPE, it has to be tested on 5th/6th gen CPUs...

    1. Thanks for your interesting tests.
      You were right on your analysis, your previous comment has been automatically categorized as spam, probably due to presence of links.