|
| [deleted]
| infogulch wrote:
| I found TerminuDB's white paper on graph compression interesting
| and I think could be useful to time series compression:
|
| Succinct Data Structures and Delta Encoding for Modern Databases
| - https://assets.terminusdb.com/research/succinct-data-structu...
| sujayakar wrote:
| Navarro's book Compact Data Structures is an excellent survey
| of this literature!
| https://www.cambridge.org/core/books/compact-data-structures...
|
| I love the perspective of computing the information theoretic
| lower bounds for a data structure and then trying to achieve
| that with as little overhead as possible while still supporting
| efficient operations on the structure.
| naturalplasmoid wrote:
| One notable omission from this piece is a technique to compress
| integer time series with both positive and _negative_ values.
|
| If you naively apply bit-packing using the Simple8b algorithm,
| you'll find that negative integers are not compressed. This is
| due to how signed integers are represented in modern computers:
| negative integers will have their most significant bit set [1].
|
| Zigzag encoding is a neat transform that circumvents this issue.
| It works by mapping signed integers to unsigned integers so that
| numbers with a small absolute value can be encoded using a small
| number of bits. Put another way, it encodes negative numbers
| using the least significant bit for sign. [2]
|
| If you're looking for a quick way to experiment with various time
| series compression algorithm I highly recommend Daniel Lemire's
| FastPFor repository [3] (as linked in the article). I've used the
| Python bindings [4] to quickly evaluate various compression
| algorithms with great success.
|
| Finally I'd like to humbly mention my own tiny contribution [5],
| an adaptation of Lemire's C++ Simple8b implementation (including
| basic methods for delta & zigzag encoding/decoding).
|
| I used C++ templates to make the encoding and decoding routines
| generic over integer bit-width, which expands support up to 64
| bit integers, and offers efficient usage with smaller integers
| (eg 16 bit). I made a couple other minor tweaks including support
| for arrays up to 2^64 in length, and tweaking the API/method
| signatures so they can be used in a more functional style. This
| implementation is slightly simpler to invoke via FFI, and I
| intend to add examples showing how to compile for usage via JS
| (WebAssembly), Python, and C#. I threw my code up quickly in
| order to share with you all, hopefully someone finds it useful. I
| intend to expand on usage examples/test cases/etc, and am looking
| forward to any comments or contributions.
|
| [1] https://en.wikipedia.org/wiki/Signed_number_representation
|
| [2] https://en.wikipedia.org/wiki/Variable-
| length_quantity#Zigza...
|
| [3] https://github.com/lemire/FastPFor
|
| [4] https://github.com/searchivarius/PyFastPFor
|
| [5] https://github.com/naturalplasmoid/simple8b-timeseries-
| compr...
| jart wrote:
| I love zigzag encoding. The way it uses shift and xor is
| beautiful. It gives it a performance edge of 3 cycles on my cpu
| versus signed-leb encoding. One thing I did once, to encode
| UNICODE lookup tables in a smaller form (to reduce Python
| binary size) is I found out it can be very profitable to (1)
| delta encode; then (2) zig zag encode it; and finally (3) apply
| deflate. For example:
| _PyUnicode_PhrasebookOffset2: size is 178,176
| bytes _PyUnicode_PhrasebookOffset2: packed size is
| 100,224 bytes _PyUnicode_PhrasebookOffset2: rle size
| is 282,216 bytes _PyUnicode_PhrasebookOffset2:
| deflate size is 52,200 bytes
| _PyUnicode_PhrasebookOffset2: bz2 size is 76,876
| bytes _PyUnicode_PhrasebookOffset2: dleb size is
| 47,198 bytes _PyUnicode_PhrasebookOffset2: dzd size
| is 12,748 bytes
|
| It's so cool.
| spullara wrote:
| For Wavefront I got this down to about 1.5 bytes per metric
| reported (on average across customers) where each metric was a
| tuple of:
|
| name, source, timestamp (s), value (int) and a set of name/value
| pairs
|
| Key to optimizing it was having a database that supported
| different compression strategies based on the data.
| londons_explore wrote:
| That actually sounds pretty bad to me, when many metrics will
| be constant or very rarely changing, or perfectly correlated
| with another metric.
|
| Were you arithmetic coding stuff, or still doing bit-twiddling
| tricks?
| spullara wrote:
| If you think that sounds bad, I would love to see what you
| think is good and show your work then they can just add it to
| one of the options. It does basically all the tricks in the
| article and then some. This is a mean, not a median. The
| median is likely much lower but no one cares about the median
| when calculating how much space you need.
|
| RE: Correlated with another metric. That would be extremely
| expensive one to calculate. You need an algorithm that can
| operate when ingesting millions of points per second 24/7.
| awild wrote:
| Question for people who've implemented these kind of compression
| schemes: some of these mechanisms require local context (rle and
| s8b) how do you handle random access in these cases? Is the
| metadata embedded as a sort of key frame to which we have to
| advance before decoding the value? Or as a separate block in the
| headers?
|
| RLE especially sounds like it could degenerate into a binary
| search per lookup, maybe aided by caching the bounds of the last
| run and assuming locality and linear-ish usage?
|
| This might sound obvious, but wasn't mentioned in the article,
| you can also apply integer based compression schemes on
| dictionary encoded data.
|
| And floating point values don't always need those 64bits when the
| use case and input data don't require that level of precision.
| nu11ptr wrote:
| I was thinking the same thing as I was reading this. I doubt
| you could retain 100% random access. Instead, I think you would
| generally create "blocks" that are compressed in a time range
| that is compatible with your application (ie. 1 day blocks or 1
| week blocks, etc.) and then when picking date/times you take X
| blocks and then further refine the query after compression
| (example: load May 5th block --> May 5th 8:10AM - 11:13AM). At
| least, that is what I have done in the past in my own home
| grown apps. Essentially each block then starts and terminates
| compression - # of blocks is a trade off in compression
| efficiency vs. granularity.
| arinlen wrote:
| > (...) how do you handle random access in these cases?
|
| Given we're discussing time series, random access is something
| that's not required at all. There is no use case where you're
| looking for a single, specific data point of a time series.
| Time series are used to store and retrieve batches of data
| points in order to process and/or analize/visualize in batches.
| Time series queries consists mainly of "get me all the data
| points of a time series reported between timestamps A and B".
| [deleted]
| londons_explore wrote:
| Most applications don't require true random access.
|
| Instead you're typically reading a range of data, and then you
| can decompress just the blocks required for the data you want
| to see.
|
| Caching of partial queries can also help substantially. For
| example, if many queries involve querying the max() of some
| per-second data grouped by minute, it is well worth caching
| that rather than reading the source data every time to
| calculate the max().
|
| Typically the query engine can keep counts of every subquery
| and how frequently it's used and how many data points it
| involves to decide how long to cache it for. As far as I'm
| aware no opensource tsdb does this, despite it being a massive
| simple win, especially for alerting systems and dashboards that
| run very similar queries frequently.
| anonymoushn wrote:
| Are there similar algorithms useful for compressing stuff like "a
| list of the top 100 scores in a game?" We already store these as
| diffs of the list, but we could probably do better than zstd-ing
| the list-diffs.
| pizza wrote:
| You could look into multiset sequence compression methods,
| where the multiset elements are i sequences max_score(i, t) for
| at time t for each rank i https://arxiv.org/abs/1401.6410
| tyingq wrote:
| This discussion about a Barclay's Bank json list of 74,000
| numbers might have some ideas:
| https://news.ycombinator.com/item?id=28326036
|
| I was able to get it from a 1.3mb json file (uncompressed) to a
| 151k uncompressed but 4k compressed file using mostly deltas.
| https://news.ycombinator.com/item?id=28348965
| anonymoushn wrote:
| This is great! I'll have to think about how to combine
| "storing deltas within the list" and "storing deltas of the
| list," but there should be some significant gains available
___________________________________________________________________
(page generated 2022-05-14 23:00 UTC) |