[HN Gopher] Optimizers: The Low-Key MVP
___________________________________________________________________
 
Optimizers: The Low-Key MVP
 
Author : tosh
Score  : 116 points
Date   : 2024-11-16 13:53 UTC (1 days ago)
 
web link (duckdb.org)
w3m dump (duckdb.org)
 
| adrian17 wrote:
| > The SQL above results in a plan similar to the DuckDB optimized
| plan, but it is wordier and more error-prone to write, which can
| potentially lead to bugs.
| 
| FWIW, aside from manual filter pushdown, I consider the JOIN
| variant the canonical / "default" way to merge multiple tables;
| it keeps all the join-related logic in one place, while mixing
| both joining conditions and filtering conditions in WHERE always
| felt more error-prone to me.
 
  | Sesse__ wrote:
  | It is also the only way to represent join conditions for outer
  | joins.
 
  | camgunz wrote:
  | Same here; I've always intuited that this would limit the
  | generated tuples. I'm too lazy to do it now, but I wonder if
  | other DB engines also perform this optimization that
  | effectively makes filtering in JOIN conditions equivalent to
  | filtering in WHERE clauses. I'd also be interested in some
  | example queries that were hand-optimized to the point of
  | obvious obscurity--my guess is it's harder to do this in SQL
  | than in something like C.
 
    | Sesse__ wrote:
    | Any non-toy SQL engine is going to do condition pullup and
    | pushdown. (Well, I know at least one that only does explicit
    | pullup, but does pushdown more implicitly because it doesn't
    | use a tree. The effect is largely the same, modulo some
    | shenanigans around outer joins.)
    | 
    | Pushdown is actually a more subtle problem than people give
    | it credit for; it inhabits this weird space where things are
    | too trivial to be covered in papers but is too hard to cover
    | properly in textbooks (e.g., no coverage of multiple
    | equalities or outer joins).
 
| ikesau wrote:
| > This means your optimizations need to be applied by hand, which
| is sustainable if your data starts changing.
| 
| Seems like a missing "un" here
| 
| Compelling article! I've already found DuckDB to be the most
| ergonomic tool for quick and dirty wrangling, it's good to know
| it can handle massive jobs too.
 
  | Nihilartikel wrote:
  | I regularly use duckdb on datasets of 1B+ rows, with nasty
  | strong columns that may be over 10MB per value in the outliers.
  | Mostly it just works, and fast too! When it doesn't, I'll
  | usually just dump to parquet and hit it with sparksql, but that
  | is the exception rather than the rule.
 
| unwind wrote:
| Meta: I think the title would be better here if it came out and
| said _query_ optimizers.
| 
| That gives a less subtle clue that it's about databases than
| looking at the domain.
 
| lmeyerov wrote:
| Always a fan of query plan articles!
| 
| Note: the dig at dataframe libs is worth some care in case you
| think that means duckdb can optimize and they cannot
| 
| Dask, Polars, and others pick a lazy default in order to make
| distribution and other optimizations easier. When staying in
| their pure fragments ('vectorized'), the same scheduler rewriting
| opportunity is here.
| 
| This is a subtle but important distinction when looking at these
| frameworks. We are making our new graph query language 'gfql' to
| be dataframe-native so it can run naturally & natively as a step
| of pipelines people are already doing, but also to ensure we
| automatically run as optimized CPU/GPU columnar opts. At the same
| time, because of the intent to allow room for query plan
| optimization, we are staying declarative / lazy, even if the
| generated & interpreted code uses an eager DF runtime . I'm
| optimistic about output target lazy DF systems doing query
| planner work for us long-term here, but for the eager framework
| targets, the query planning has to be on our side.
 
| kwillets wrote:
| One possible hand optimization is to push the aggregation below
| the joins, which makes the latter a few hundred rows.
 
___________________________________________________________________
(page generated 2024-11-17 16:01 UTC)