[HN Gopher] Time-Series Compression Algorithms
___________________________________________________________________
 
Time-Series Compression Algorithms
 
Author : todsacerdoti
Score  : 114 points
Date   : 2022-05-14 16:08 UTC (6 hours ago)
 
web link (www.timescale.com)
w3m dump (www.timescale.com)
 
| [deleted]
 
| infogulch wrote:
| I found TerminuDB's white paper on graph compression interesting
| and I think could be useful to time series compression:
| 
| Succinct Data Structures and Delta Encoding for Modern Databases
| - https://assets.terminusdb.com/research/succinct-data-structu...
 
  | sujayakar wrote:
  | Navarro's book Compact Data Structures is an excellent survey
  | of this literature!
  | https://www.cambridge.org/core/books/compact-data-structures...
  | 
  | I love the perspective of computing the information theoretic
  | lower bounds for a data structure and then trying to achieve
  | that with as little overhead as possible while still supporting
  | efficient operations on the structure.
 
| naturalplasmoid wrote:
| One notable omission from this piece is a technique to compress
| integer time series with both positive and _negative_ values.
| 
| If you naively apply bit-packing using the Simple8b algorithm,
| you'll find that negative integers are not compressed. This is
| due to how signed integers are represented in modern computers:
| negative integers will have their most significant bit set [1].
| 
| Zigzag encoding is a neat transform that circumvents this issue.
| It works by mapping signed integers to unsigned integers so that
| numbers with a small absolute value can be encoded using a small
| number of bits. Put another way, it encodes negative numbers
| using the least significant bit for sign. [2]
| 
| If you're looking for a quick way to experiment with various time
| series compression algorithm I highly recommend Daniel Lemire's
| FastPFor repository [3] (as linked in the article). I've used the
| Python bindings [4] to quickly evaluate various compression
| algorithms with great success.
| 
| Finally I'd like to humbly mention my own tiny contribution [5],
| an adaptation of Lemire's C++ Simple8b implementation (including
| basic methods for delta & zigzag encoding/decoding).
| 
| I used C++ templates to make the encoding and decoding routines
| generic over integer bit-width, which expands support up to 64
| bit integers, and offers efficient usage with smaller integers
| (eg 16 bit). I made a couple other minor tweaks including support
| for arrays up to 2^64 in length, and tweaking the API/method
| signatures so they can be used in a more functional style. This
| implementation is slightly simpler to invoke via FFI, and I
| intend to add examples showing how to compile for usage via JS
| (WebAssembly), Python, and C#. I threw my code up quickly in
| order to share with you all, hopefully someone finds it useful. I
| intend to expand on usage examples/test cases/etc, and am looking
| forward to any comments or contributions.
| 
| [1] https://en.wikipedia.org/wiki/Signed_number_representation
| 
| [2] https://en.wikipedia.org/wiki/Variable-
| length_quantity#Zigza...
| 
| [3] https://github.com/lemire/FastPFor
| 
| [4] https://github.com/searchivarius/PyFastPFor
| 
| [5] https://github.com/naturalplasmoid/simple8b-timeseries-
| compr...
 
  | jart wrote:
  | I love zigzag encoding. The way it uses shift and xor is
  | beautiful. It gives it a performance edge of 3 cycles on my cpu
  | versus signed-leb encoding. One thing I did once, to encode
  | UNICODE lookup tables in a smaller form (to reduce Python
  | binary size) is I found out it can be very profitable to (1)
  | delta encode; then (2) zig zag encode it; and finally (3) apply
  | deflate. For example:
  | _PyUnicode_PhrasebookOffset2: size         is      178,176
  | bytes         _PyUnicode_PhrasebookOffset2: packed size  is
  | 100,224 bytes         _PyUnicode_PhrasebookOffset2: rle size
  | is      282,216 bytes         _PyUnicode_PhrasebookOffset2:
  | deflate size is       52,200 bytes
  | _PyUnicode_PhrasebookOffset2: bz2 size     is       76,876
  | bytes         _PyUnicode_PhrasebookOffset2: dleb size    is
  | 47,198 bytes         _PyUnicode_PhrasebookOffset2: dzd size
  | is       12,748 bytes
  | 
  | It's so cool.
 
| spullara wrote:
| For Wavefront I got this down to about 1.5 bytes per metric
| reported (on average across customers) where each metric was a
| tuple of:
| 
| name, source, timestamp (s), value (int) and a set of name/value
| pairs
| 
| Key to optimizing it was having a database that supported
| different compression strategies based on the data.
 
  | londons_explore wrote:
  | That actually sounds pretty bad to me, when many metrics will
  | be constant or very rarely changing, or perfectly correlated
  | with another metric.
  | 
  | Were you arithmetic coding stuff, or still doing bit-twiddling
  | tricks?
 
    | spullara wrote:
    | If you think that sounds bad, I would love to see what you
    | think is good and show your work then they can just add it to
    | one of the options. It does basically all the tricks in the
    | article and then some. This is a mean, not a median. The
    | median is likely much lower but no one cares about the median
    | when calculating how much space you need.
    | 
    | RE: Correlated with another metric. That would be extremely
    | expensive one to calculate. You need an algorithm that can
    | operate when ingesting millions of points per second 24/7.
 
| awild wrote:
| Question for people who've implemented these kind of compression
| schemes: some of these mechanisms require local context (rle and
| s8b) how do you handle random access in these cases? Is the
| metadata embedded as a sort of key frame to which we have to
| advance before decoding the value? Or as a separate block in the
| headers?
| 
| RLE especially sounds like it could degenerate into a binary
| search per lookup, maybe aided by caching the bounds of the last
| run and assuming locality and linear-ish usage?
| 
| This might sound obvious, but wasn't mentioned in the article,
| you can also apply integer based compression schemes on
| dictionary encoded data.
| 
| And floating point values don't always need those 64bits when the
| use case and input data don't require that level of precision.
 
  | nu11ptr wrote:
  | I was thinking the same thing as I was reading this. I doubt
  | you could retain 100% random access. Instead, I think you would
  | generally create "blocks" that are compressed in a time range
  | that is compatible with your application (ie. 1 day blocks or 1
  | week blocks, etc.) and then when picking date/times you take X
  | blocks and then further refine the query after compression
  | (example: load May 5th block --> May 5th 8:10AM - 11:13AM). At
  | least, that is what I have done in the past in my own home
  | grown apps. Essentially each block then starts and terminates
  | compression - # of blocks is a trade off in compression
  | efficiency vs. granularity.
 
  | arinlen wrote:
  | > (...) how do you handle random access in these cases?
  | 
  | Given we're discussing time series, random access is something
  | that's not required at all. There is no use case where you're
  | looking for a single, specific data point of a time series.
  | Time series are used to store and retrieve batches of data
  | points in order to process and/or analize/visualize in batches.
  | Time series queries consists mainly of "get me all the data
  | points of a time series reported between timestamps A and B".
 
    | [deleted]
 
  | londons_explore wrote:
  | Most applications don't require true random access.
  | 
  | Instead you're typically reading a range of data, and then you
  | can decompress just the blocks required for the data you want
  | to see.
  | 
  | Caching of partial queries can also help substantially. For
  | example, if many queries involve querying the max() of some
  | per-second data grouped by minute, it is well worth caching
  | that rather than reading the source data every time to
  | calculate the max().
  | 
  | Typically the query engine can keep counts of every subquery
  | and how frequently it's used and how many data points it
  | involves to decide how long to cache it for. As far as I'm
  | aware no opensource tsdb does this, despite it being a massive
  | simple win, especially for alerting systems and dashboards that
  | run very similar queries frequently.
 
| anonymoushn wrote:
| Are there similar algorithms useful for compressing stuff like "a
| list of the top 100 scores in a game?" We already store these as
| diffs of the list, but we could probably do better than zstd-ing
| the list-diffs.
 
  | pizza wrote:
  | You could look into multiset sequence compression methods,
  | where the multiset elements are i sequences max_score(i, t) for
  | at time t for each rank i https://arxiv.org/abs/1401.6410
 
  | tyingq wrote:
  | This discussion about a Barclay's Bank json list of 74,000
  | numbers might have some ideas:
  | https://news.ycombinator.com/item?id=28326036
  | 
  | I was able to get it from a 1.3mb json file (uncompressed) to a
  | 151k uncompressed but 4k compressed file using mostly deltas.
  | https://news.ycombinator.com/item?id=28348965
 
    | anonymoushn wrote:
    | This is great! I'll have to think about how to combine
    | "storing deltas within the list" and "storing deltas of the
    | list," but there should be some significant gains available
 
___________________________________________________________________
(page generated 2022-05-14 23:00 UTC)