RealTime Data Compression: Compressing small data

Friday, February 5, 2016

Compressing small data

Data compression is primarily seen as a file compression algorithm. After all, the main objective is to save storage space, is it ?
With this background in mind, it's also logical to focus on bigger files. Good compression achieved on a single large archive is worth the savings for countless smaller ones.

However, this is no longer where the bulk of compression happen. Today, compression is everywhere, embedded within systems, achieving its space and transmission savings without user intervention, nor awareness. The key to these invisible gains is to remain below the end-user perception threshold. To achieve this objective, it's not possible to wait for some large amount of data to process. Instead, data is processed in small amounts.

This would be all good and well if it wasn't for a simple observation : the smaller the amount to compress, the worse the compression ratio.
The reason is pretty simple : data compression works by finding redundancy within the processed source. When a new source starts, there is not yet any redundancy to build upon. And it takes time for any algorithm to achieve meaningful outcome.

Therefore, as the issue comes from starting from a blank history, what about starting from an already populated history ?

Streaming to the rescue

A first solution is streaming : data is cut into smaller blocks, but each block can make reference to previously sent ones. And it works quite well. In spite of some minor losses at block borders, most of the compression opportunities of a single large data source are preserved, but now with the advantage to process, send, and receive tiny blocks on the fly, making the experience smooth.

However, this scenario only works with serial data, a communication channel for example, where order is known and preserved.

For a large category of applications, such as database and storage, this cannot work : data must remain accessible in a random fashion, no known "a priori" order. Reaching a specific block sector should not require to decode all preceding ones just to rebuild the dynamic context.

For such use case, a common work-around is to create some "not too small blocks". Say there are many records of a few hundred bytes each. Group them in packs of at least 16 KB. Now this achieves some nice middle-ground between not-to-poor compression ratio and good enough random access capability.
This is still not ideal though, since it's required to decompress a full block just to get a single random record out of it. Therefore, each application will settle for its own middle ground, using block sizes of 4 KB, 16 KB or even 128 KB, depending on usage pattern.

Dictionary compression

Preserving random access at record level and good compression ratio, is hard. But it's achievable too, using a dictionary. To summarize, it's a kind of common prefix, shared by all compressed objects. It makes every compression and decompression operation start from the same populated history.

Dictionary compression has the great property to be compatible with random access. Even for communication scenarios, it can prove easier to manage at scale than "per-connection streaming", since instead of storing one different context per connection, there is always the same context to start from when compressing or decompressing any new data block.

A good dictionary can compress small records into tiny compressed blobs. Sometimes, the current record can be found "as is" entirely within the dictionary, reducing it to a single reference. More likely, some critical redundant elements will be detected (header, footer, keywords) leaving only variable ones to be described (ID fields, date, etc.).

For this situation to work properly, the dictionary needs to be tuned for the underlying structure of objects to compress. There is no such thing as a "universal dictionary". One must be created and used for a target data type.

Fortunately, this condition can be met quite often.
Just created some new protocol for a transaction engine or an online game ? It's likely based on a few common important messages and keywords (even binary ones). Have some event or log records ? There is likely a grammar for them (json, xml maybe). The same can be said of digital resources, be it html files, css stylesheets, javascript programs, etc.
If you know what you are going to compress, you can create a dictionary for it.

The key is, since it's not possible to create a meaningful "universal dictionary", one must create one dictionary per resource type.

Example of a structured JSON message

How to create a dictionary from a training set ? Well, even though one could be tempted to manually create one, by compacting all keywords and repeatable sequences into a file, this can be a tedious task. Moreover, there is always a chance that the dictionary will have to be updated regularly due to moving conditions.
This is why, starting from v0.5, zstd offers a dictionary builder capability.

Using the builder, it's possible to quickly create a dictionary from a list of samples. The process is relatively fast (a matter of seconds), which makes it possible to generate and update multiple dictionaries for multiple targets.

But what good can achieve dictionary compression ?
To answer this question, a few tests were run on some typical samples. A flow of JSON records from a probe, some Mercurial log events, and a collection of large JSON documents, provided by @KryzFr.

Collection Name	direct compression	Dictionary compression	Gains	Average unit	Range
Small JSON records	x1.331 - x1.366	x5.860 - x6.830	~ x4.7	300	200 - 400
Mercurial events	x2.322 - x2.538	x3.377 - x4.462	~ x1.5	1.5 KB	20 - 200 KB
Large JSON docs	x3.813 - x4.043	x8.935 - x13.366	~ x2.8	6 KB	800 - 20 KB

These compression gains are achieved without any speed loss, and even feature faster decompression processing. As one can see, it's no "small improvement". This method can achieve transformative gains, especially for very small records.

Large documents will benefit proportionally less, since dictionary gains are mostly effective in the first few KB. Then there is enough history to build upon, and the compression algorithm can rely on it to compress the rest of the file.

Dictionary compression will work if there is some correlation in a family of small data (common keywords and structure). Hence, deploying one dictionary per type of data will provide the greater benefits.

Anyway, if you are in a situation where compressing small data can be useful for your use case (databases and contextless communication scenarios come to mind, but there are likely other ones), you are welcomed to have a look at this new open source tool and compression methodology and report your experience or feature requests.

Zstd is now getting closer to v1.0 release, it's a good time to provide feedback and integrate them into final specification.

12 comments:

UnknownFebruary 5, 2016 at 10:52 PM
Very cool! Will there be an API for the dictionary builder or will it only be a command-line utility?
ReplyDelete
Replies
SanmayceFebruary 8, 2016 at 10:53 AM
Hi Yann.

Small files reinforced with dictionaries, deliver better ratio but are kinda bloaty. My point here is to add one more nasty viewpoint for your either small or huge files approaching - the encoding scheme.
Needless to say, the big bunch of unicode alone variants bloats the processing a lot, below one UNICODE, the worst of all, example is given: the famous and revered "Большая Советская Энциклопедия" - 30 volumes of Wikipedia-like content - semi-dictionary, semi-encyclopedia.

XZ uses level 9, LZMA2:26, 64MB dictionary, Zstd is v0.4.7:

D:\_Deathship_textual_corpus\Goldenboy_Turbobench>dir

396,120,046 GreatSovietEncyclopaediaRuRu.dsl
67,790,530 GreatSovietEncyclopaediaRuRu.dsl.Nakamichi
44,591,268 GreatSovietEncyclopaediaRuRu.dsl.xz

D:\_Deathship_textual_corpus\Goldenboy_Turbobench>turbobench_singlefile.bat GreatSovietEncyclopaediaRuRu.dsl
Decompressing 23 times, for more precise overall...

D:\_Deathship_textual_corpus\Goldenboy_Turbobench>turbobenchs.exe GreatSovietEncyclopaediaRuRu.dsl -ezlib,9/fastlz,2/chameleon,2/snappy_c/zstd,1,20/lz4,9,16/lz5,15/brieflz/brotli,11/crush,2/lzma,9/zpaq,2/lzf/shrinker/yappy/trl
e/memcpy/naka/lzturbo,19,29,39 -k0 -J23
83424253 21.1 3.16 DDDDDDDDDDDDDD 281.94 zlib 9 GreatSovietEncyclopaediaRuRu.dsl
149995831 37.9 196.67 DDDDDDDDDDDDDD 394.19 fastlz 2 GreatSovietEncyclopaediaRuRu.dsl
210522291 53.1 950.24 DDDDDDDDDDDDDD 1670.38 chameleon 2 GreatSovietEncyclopaediaRuRu.dsl
158204122 39.9 283.65 DDDDDDDDDDDDDD 640.74 snappy_c GreatSovietEncyclopaediaRuRu.dsl
95033271 24.0 137.71 DDDDDDDDDDDDDD 419.31 zstd 1 GreatSovietEncyclopaediaRuRu.dsl
52389046 13.2 1.39 DDDDDDDDDDDDDD 273.64 zstd 20 GreatSovietEncyclopaediaRuRu.dsl
94374416 23.8 7.61 DDDDDDDDDDDDDD 1098.03 lz4 9 GreatSovietEncyclopaediaRuRu.dsl
94200951 23.8 6.87 DDDDDDDDDDDDDD 1099.60 lz4 16 GreatSovietEncyclopaediaRuRu.dsl
83027430 21.0 1.25 DDDDDDDDDDDDDD 599.31 lz5 15 GreatSovietEncyclopaediaRuRu.dsl
125356638 31.6 85.31 DDDDDDDDDDDDDD 173.12 brieflz GreatSovietEncyclopaediaRuRu.dsl
60458801 15.3 0.26 DDDDDDDDDDDDDD 321.02 brotli 11 GreatSovietEncyclopaediaRuRu.dsl
69685413 17.6 0.12 DDDDDDDDDDDDDD 311.04 crush 2 GreatSovietEncyclopaediaRuRu.dsl
42372850 10.7 0.38 DDDDDDDDDDDDDD 113.60 lzma 9 GreatSovietEncyclopaediaRuRu.dsl
65972471 16.7 3.17 DDDDDDDDDDDDD 70.41 zpaq 2 GreatSovietEncyclopaediaRuRu.dsl
146599615 37.0 253.62 DDDDDDDDDDDDDD 538.42 lzf GreatSovietEncyclopaediaRuRu.dsl
172927017 43.7 208.34 DDDDDDDDDDDDDD 447.19 shrinker GreatSovietEncyclopaediaRuRu.dsl
121893595 30.8 50.52 DDDDDDDDDDDDDD 843.75 yappy GreatSovietEncyclopaediaRuRu.dsl
396120050 100.0 103.35 DDDDDDDDDDDDDD 1682.47 trle GreatSovietEncyclopaediaRuRu.dsl
396120050 100.0 2400.28 DDDDDDDDDDDDDD 1680.56 memcpy GreatSovietEncyclopaediaRuRu.dsl
^CTerminate batch job (Y/N)? ye encoded; Done 3%; Compression Ratio: 5.23:1

http://www.sanmayce.com/Downloads/BRE_turbobench.png

Regards
ReplyDelete
Replies
SanmayceMarch 6, 2016 at 10:33 PM
Kinda felt that, regardless of off-topicness (the reason - the key word 'dictionary' triggered in me the actual meaning not the compression term), some heavy DSLs have to be tested with level 21, of course along with other performers, LZSSE2 is in the mix too:

Zstd21.png:
https://onedrive.live.com/redir?resid=8439CC8C71159665!127&authkey=!AMjXR_pCyKvT_DQ&v=3&ithint=photo%2cpng

log-Laptop-Toshiba_DSLs.txt:
https://onedrive.live.com/redir?resid=8439CC8C71159665!128&authkey=!AH_sf4Ss_6YtFFU&ithint=file%2ctxt

Superfast is Zstd, the tests show that 21 brings more, achieving such superb results WITH SO LITTLE RAM FOOTPRINT is extra-superb, something, Yann, you should emphasize on paper in future, I mean as XZ reports in verbose mode (needed RAM for compression), my 2 cents.

I am preparing one insane 97,504,190,079 bytes long corpus, called OOW, stands for Ocean-Of-Words, consisted of 12 corpora, which will throw a beam of light on the textual [de]compression topic...
ReplyDelete
Replies
SanmayceMarch 7, 2016 at 2:00 AM
And my last comment inhere, finally on-topic:
www.sanmayce.com/Downloads/13_filelets.png

A quick juxtaposition between Brotli/Zstd/LZ5/Goldenboy, latest (bro_Feb-10-2016, 0.5.1, 1.4) revisions.

I see no such big discrepancy, taking into account Zstd's lightweightness, after the 2MB mark it vanishes altogether:

2,120,123 Thomas_Mann_-_Der_Zauberberg_(German).txt
832,507 Thomas_Mann_-_Der_Zauberberg_(German).txt.15_256MB.lz5
640,268 Thomas_Mann_-_Der_Zauberberg_(German).txt.L11_W24.brotli
644,930 Thomas_Mann_-_Der_Zauberberg_(German).txt.L21.zst
855,771 Thomas_Mann_-_Der_Zauberberg_(German).txt.Nakamichi
2,227,200 Thomas_Mann_-_La_Montagne_magique_(French).tar
797,378 Thomas_Mann_-_La_Montagne_magique_(French).tar.15_256MB.lz5
600,425 Thomas_Mann_-_La_Montagne_magique_(French).tar.L11_W24.brotli
603,811 Thomas_Mann_-_La_Montagne_magique_(French).tar.L21.zst
787,830 Thomas_Mann_-_La_Montagne_magique_(French).tar.Nakamichi
2,043,974 Thomas_Mann_-_The_Magic_Mountain_(English).txt
785,959 Thomas_Mann_-_The_Magic_Mountain_(English).txt.15_256MB.lz5
595,306 Thomas_Mann_-_The_Magic_Mountain_(English).txt.L11_W24.brotli
605,680 Thomas_Mann_-_The_Magic_Mountain_(English).txt.L21.zst
803,412 Thomas_Mann_-_The_Magic_Mountain_(English).txt.Nakamichi

Being a Nobel Prize laureate means ... Thomas Mann should be included in every textual benchmark.

When I finish the ~400 files torture then more vivid picture will appear...
ReplyDelete
Replies
SanmayceApril 16, 2016 at 10:53 AM
My yesterday comment was treated as spam, maybe due to the 2 URLs, anyway, just for sake of showing the improved strength of Zstd here it is again:

Hi again, just tested your stronger v0.6.0, for the first time I see Zstd outperforming LzTurbo v1.3, the file is semi-small, namely, 'the_meaning_of_zen.pdf' 54,272 bytes in size, not some mumbo-jumbo but an excellent book excerpt by MATSUMOTO Shirõ.

In my quick (only 29 testfiles) decompression showdown your latest Zstd 0.6.0 is on par (sizewise) with with LzTurbo 1.3, however falls behind in decompression rate department, the full console log (Laptop i5-2430M @3GHz, DDR3 666 MHz):
URL1
The full bench (DECOMPRESS_all.bat) is here:
URL2

Maybe, on newer CPUs with more caches those ~200MB/s will jump considerably.

On a try-and-fail note, hee-hee, I tried only YMM registers variant of Nakamichi reading/writing within 1MB window, it disappoints, 2x/3x slower than Conor's LZSSE2 level 17.
Yet, I read, at Agner Fog's blog:

"Store forwarding is one clock cycle faster on Skylake than on previous processors. Store forwarding is the time it takes to read from a memory address immediately after writing to the same address."

This very line prompted me to write this YMM variant, maybe Skylake will execute Tsubame faster than the ... mobile ... Sandy-Bridge, ugh.

Some months separate me from having a much richer picture of side-by-side list, in my view heavy comparison is needed in order to clarify the strengths and weaknesses of so many good performers, in my view LZSSE2 (despite being weaker than LzTurbo 19 in non-textual data) changed the WORLDSCAPE, it has to be tested on 5th/6th gen CPUs...
ReplyDelete
Replies

Add comment