proxy70

	[HN Gopher] Time-Series Compression Algorithms ___________________________________________________________________ Time-Series Compression Algorithms Author : todsacerdoti Score : 114 points Date : 2022-05-14 16:08 UTC (6 hours ago)
	web link (www.timescale.com)
	w3m dump (www.timescale.com)
	\| [deleted] \| infogulch wrote: \| I found TerminuDB's white paper on graph compression interesting \| and I think could be useful to time series compression: \| \| Succinct Data Structures and Delta Encoding for Modern Databases \| - https://assets.terminusdb.com/research/succinct-data-structu... \| sujayakar wrote: \| Navarro's book Compact Data Structures is an excellent survey \| of this literature! \| https://www.cambridge.org/core/books/compact-data-structures... \| \| I love the perspective of computing the information theoretic \| lower bounds for a data structure and then trying to achieve \| that with as little overhead as possible while still supporting \| efficient operations on the structure. \| naturalplasmoid wrote: \| One notable omission from this piece is a technique to compress \| integer time series with both positive and _negative_ values. \| \| If you naively apply bit-packing using the Simple8b algorithm, \| you'll find that negative integers are not compressed. This is \| due to how signed integers are represented in modern computers: \| negative integers will have their most significant bit set [1]. \| \| Zigzag encoding is a neat transform that circumvents this issue. \| It works by mapping signed integers to unsigned integers so that \| numbers with a small absolute value can be encoded using a small \| number of bits. Put another way, it encodes negative numbers \| using the least significant bit for sign. [2] \| \| If you're looking for a quick way to experiment with various time \| series compression algorithm I highly recommend Daniel Lemire's \| FastPFor repository [3] (as linked in the article). I've used the \| Python bindings [4] to quickly evaluate various compression \| algorithms with great success. \| \| Finally I'd like to humbly mention my own tiny contribution [5], \| an adaptation of Lemire's C++ Simple8b implementation (including \| basic methods for delta & zigzag encoding/decoding). \| \| I used C++ templates to make the encoding and decoding routines \| generic over integer bit-width, which expands support up to 64 \| bit integers, and offers efficient usage with smaller integers \| (eg 16 bit). I made a couple other minor tweaks including support \| for arrays up to 2^64 in length, and tweaking the API/method \| signatures so they can be used in a more functional style. This \| implementation is slightly simpler to invoke via FFI, and I \| intend to add examples showing how to compile for usage via JS \| (WebAssembly), Python, and C#. I threw my code up quickly in \| order to share with you all, hopefully someone finds it useful. I \| intend to expand on usage examples/test cases/etc, and am looking \| forward to any comments or contributions. \| \| [1] https://en.wikipedia.org/wiki/Signed_number_representation \| \| [2] https://en.wikipedia.org/wiki/Variable- \| length_quantity#Zigza... \| \| [3] https://github.com/lemire/FastPFor \| \| [4] https://github.com/searchivarius/PyFastPFor \| \| [5] https://github.com/naturalplasmoid/simple8b-timeseries- \| compr... \| jart wrote: \| I love zigzag encoding. The way it uses shift and xor is \| beautiful. It gives it a performance edge of 3 cycles on my cpu \| versus signed-leb encoding. One thing I did once, to encode \| UNICODE lookup tables in a smaller form (to reduce Python \| binary size) is I found out it can be very profitable to (1) \| delta encode; then (2) zig zag encode it; and finally (3) apply \| deflate. For example: \| _PyUnicode_PhrasebookOffset2: size is 178,176 \| bytes _PyUnicode_PhrasebookOffset2: packed size is \| 100,224 bytes _PyUnicode_PhrasebookOffset2: rle size \| is 282,216 bytes _PyUnicode_PhrasebookOffset2: \| deflate size is 52,200 bytes \| _PyUnicode_PhrasebookOffset2: bz2 size is 76,876 \| bytes _PyUnicode_PhrasebookOffset2: dleb size is \| 47,198 bytes _PyUnicode_PhrasebookOffset2: dzd size \| is 12,748 bytes \| \| It's so cool. \| spullara wrote: \| For Wavefront I got this down to about 1.5 bytes per metric \| reported (on average across customers) where each metric was a \| tuple of: \| \| name, source, timestamp (s), value (int) and a set of name/value \| pairs \| \| Key to optimizing it was having a database that supported \| different compression strategies based on the data. \| londons_explore wrote: \| That actually sounds pretty bad to me, when many metrics will \| be constant or very rarely changing, or perfectly correlated \| with another metric. \| \| Were you arithmetic coding stuff, or still doing bit-twiddling \| tricks? \| spullara wrote: \| If you think that sounds bad, I would love to see what you \| think is good and show your work then they can just add it to \| one of the options. It does basically all the tricks in the \| article and then some. This is a mean, not a median. The \| median is likely much lower but no one cares about the median \| when calculating how much space you need. \| \| RE: Correlated with another metric. That would be extremely \| expensive one to calculate. You need an algorithm that can \| operate when ingesting millions of points per second 24/7. \| awild wrote: \| Question for people who've implemented these kind of compression \| schemes: some of these mechanisms require local context (rle and \| s8b) how do you handle random access in these cases? Is the \| metadata embedded as a sort of key frame to which we have to \| advance before decoding the value? Or as a separate block in the \| headers? \| \| RLE especially sounds like it could degenerate into a binary \| search per lookup, maybe aided by caching the bounds of the last \| run and assuming locality and linear-ish usage? \| \| This might sound obvious, but wasn't mentioned in the article, \| you can also apply integer based compression schemes on \| dictionary encoded data. \| \| And floating point values don't always need those 64bits when the \| use case and input data don't require that level of precision. \| nu11ptr wrote: \| I was thinking the same thing as I was reading this. I doubt \| you could retain 100% random access. Instead, I think you would \| generally create "blocks" that are compressed in a time range \| that is compatible with your application (ie. 1 day blocks or 1 \| week blocks, etc.) and then when picking date/times you take X \| blocks and then further refine the query after compression \| (example: load May 5th block --> May 5th 8:10AM - 11:13AM). At \| least, that is what I have done in the past in my own home \| grown apps. Essentially each block then starts and terminates \| compression - # of blocks is a trade off in compression \| efficiency vs. granularity. \| arinlen wrote: \| > (...) how do you handle random access in these cases? \| \| Given we're discussing time series, random access is something \| that's not required at all. There is no use case where you're \| looking for a single, specific data point of a time series. \| Time series are used to store and retrieve batches of data \| points in order to process and/or analize/visualize in batches. \| Time series queries consists mainly of "get me all the data \| points of a time series reported between timestamps A and B". \| [deleted] \| londons_explore wrote: \| Most applications don't require true random access. \| \| Instead you're typically reading a range of data, and then you \| can decompress just the blocks required for the data you want \| to see. \| \| Caching of partial queries can also help substantially. For \| example, if many queries involve querying the max() of some \| per-second data grouped by minute, it is well worth caching \| that rather than reading the source data every time to \| calculate the max(). \| \| Typically the query engine can keep counts of every subquery \| and how frequently it's used and how many data points it \| involves to decide how long to cache it for. As far as I'm \| aware no opensource tsdb does this, despite it being a massive \| simple win, especially for alerting systems and dashboards that \| run very similar queries frequently. \| anonymoushn wrote: \| Are there similar algorithms useful for compressing stuff like "a \| list of the top 100 scores in a game?" We already store these as \| diffs of the list, but we could probably do better than zstd-ing \| the list-diffs. \| pizza wrote: \| You could look into multiset sequence compression methods, \| where the multiset elements are i sequences max_score(i, t) for \| at time t for each rank i https://arxiv.org/abs/1401.6410 \| tyingq wrote: \| This discussion about a Barclay's Bank json list of 74,000 \| numbers might have some ideas: \| https://news.ycombinator.com/item?id=28326036 \| \| I was able to get it from a 1.3mb json file (uncompressed) to a \| 151k uncompressed but 4k compressed file using mostly deltas. \| https://news.ycombinator.com/item?id=28348965 \| anonymoushn wrote: \| This is great! I'll have to think about how to combine \| "storing deltas within the list" and "storing deltas of the \| list," but there should be some significant gains available ___________________________________________________________________ (page generated 2022-05-14 23:00 UTC)