[HN Gopher] Show HN: Data Diff - compare tables of any size acro...
___________________________________________________________________
 
Show HN: Data Diff - compare tables of any size across databases
 
Gleb, Alex, Erez and Simon here - we are building an open-source
tool for comparing data within and across databases at any scale.
The repo is at https://github.com/datafold/data-diff, and our home
page is https://datafold.com/.  As a company, Datafold builds tools
for data engineers to automate the most tedious and error-prone
tasks falling through the cracks of the modern data stack, such as
data testing and lineage. We launched two years ago with a tool for
regression-testing changes to ETL code
https://news.ycombinator.com/item?id=24071955. It compares the
produced data before and after the code change and shows the impact
on values, aggregate metrics, and downstream data applications.
While working with many customers on improving their data
engineering experience, we kept hearing that they needed to diff
their data across databases to validate data replication between
systems.  There were 3 main use cases for such replication:  (1) To
perform analytics on transactional data in an OLAP engine (e.g.
PostgreSQL > Snowflake) (2) To migrate between transactional stores
(e.g. MySQL > PostgreSQL) (3) To leverage data in a specialized
engine (e.g. PostgreSQL > ElasticSearch).  Despite multiple vendors
(e.g., Fivetran, Stitch) and open-source products (Airbyte,
Debezium) solving data replication, there was no tooling for
validating the correctness of such replication. When we researched
how teams were going about this, we found that most have been
either:  Running manual checks: e.g., starting with COUNT(*) and
then digging into the discrepancies, which often took hours to
pinpoint the inconsistencies. Using distributed MPP engines such as
Spark or Trino to download the complete datasets from both
databases and then comparing them in memory - an expensive process
requiring complex infrastructure.  Our users wanted a tool that
could:  (1) Compare datasets quickly (seconds/minutes) at a large
(millions/billions of rows) scale across different databases (2)
Have minimal network IO and database workload overhead. (3) Provide
straightforward output: basic stats and what rows are different.
(4) Be embedded into a data orchestrator such as Airflow to run
right after the replication process.  So we built Data Diff as an
open-source package available through pip. Data Diff can be run in
a CLI or wrapped into any data orchestrator such as Airflow,
Dagster, etc.  To solve for speed at scale with minimal overhead,
Data Diff relies on checksumming the data in both databases and
uses binary search to identify diverging records. That way, it can
compare arbitrarily large datasets in logarithmic time and IO -
only transferring a tiny fraction of the data over the network. For
example, it can diff tables with 25M rows in ~10s and 1B+ rows in
~5m across two physically separate PostgreSQL databases while
running on a typical laptop.  We've launched this tool under the
MIT license so that any developer can use it, and to encourage
contributions of other database connectors. We didn't want to
charge engineers for such a fundamental use case. We make money by
charging a license fee for advanced solutions such as column-level
data lineage, CI workflow automation, and ML-powered alerts.
 
Author : hichkaker
Score  : 67 points
Date   : 2022-06-22 15:28 UTC (7 hours ago)
 
| michalg wrote:
| Can you diff a table with a view? Or only tables are supported?
 
  | Sirupsen wrote:
  | If they are materialized into a table, they will work! Diffing
  | queries is not supported yet.
  | 
  | data-diff only relies on a `sum(md5(concat(col1, col2)))`
  | aggregation as well as `min(id), max(id)` to get the id
  | bookends.
 
| higeorge13 wrote:
| Awesome tool, we will definitely give it a try! 2 questions:
| 
| - how do you handle the data replication lag in the comparison?
| 
| - i assume that this works in identical tables between 2
| databases, right? Any support for "similar" tables based on a
| column set? Imagine that we have a use case where we have a table
| X in one db, and another table Y in another db, with some columns
| from X and enhanced attributes.
 
  | Sirupsen wrote:
  | (1) You can specify `--min-age` to exclude records in the
  | replication lag window
  | 
  | (2) We can do that, as long as they have the same name and
  | compatible-ish types. There isn't support for differently named
  | columns yet (but it's on the list)
  | 
  | Cheers. Open an issue if you run into trouble!
 
| cryptonector wrote:
| Does FDW let you do performant `FULL OUTER JOIN`s and/or `NATURAL
| FULL OUTER JOIN`s? If so then I would think that would be a
| decent place to start for remote DB diffs for PG. If might not be
| enough, of course, if the tables are huge, in which case taking a
| page from rsync and using some sort of per-row checksum as TFA
| does is clearly a good idea.
 
  | Sirupsen wrote:
  | I'm not completely sure I understand your comment, so pardon me
  | if I misunderstand. I don't think a foreign data wrapper would
  | fundamentally to be more efficient with whatever table is
  | ~foreign~, especially for an OUTER JOIN? Unless you're
  | basically implementing something similar to data-diff with an
  | OUTER JOIN with FDW, which seems possible
  | 
  | If you're doing in-database diffs, however, a join-based
  | approach will likely outperform data-diff though.
  | 
  | Ideally databases would have support a standard MERKLE TREE
  | INDEX so we could get extremely fast comparisons.
 
___________________________________________________________________
(page generated 2022-06-22 23:00 UTC)