|
| adrian17 wrote:
| > The SQL above results in a plan similar to the DuckDB optimized
| plan, but it is wordier and more error-prone to write, which can
| potentially lead to bugs.
|
| FWIW, aside from manual filter pushdown, I consider the JOIN
| variant the canonical / "default" way to merge multiple tables;
| it keeps all the join-related logic in one place, while mixing
| both joining conditions and filtering conditions in WHERE always
| felt more error-prone to me.
| Sesse__ wrote:
| It is also the only way to represent join conditions for outer
| joins.
| camgunz wrote:
| Same here; I've always intuited that this would limit the
| generated tuples. I'm too lazy to do it now, but I wonder if
| other DB engines also perform this optimization that
| effectively makes filtering in JOIN conditions equivalent to
| filtering in WHERE clauses. I'd also be interested in some
| example queries that were hand-optimized to the point of
| obvious obscurity--my guess is it's harder to do this in SQL
| than in something like C.
| Sesse__ wrote:
| Any non-toy SQL engine is going to do condition pullup and
| pushdown. (Well, I know at least one that only does explicit
| pullup, but does pushdown more implicitly because it doesn't
| use a tree. The effect is largely the same, modulo some
| shenanigans around outer joins.)
|
| Pushdown is actually a more subtle problem than people give
| it credit for; it inhabits this weird space where things are
| too trivial to be covered in papers but is too hard to cover
| properly in textbooks (e.g., no coverage of multiple
| equalities or outer joins).
| ikesau wrote:
| > This means your optimizations need to be applied by hand, which
| is sustainable if your data starts changing.
|
| Seems like a missing "un" here
|
| Compelling article! I've already found DuckDB to be the most
| ergonomic tool for quick and dirty wrangling, it's good to know
| it can handle massive jobs too.
| Nihilartikel wrote:
| I regularly use duckdb on datasets of 1B+ rows, with nasty
| strong columns that may be over 10MB per value in the outliers.
| Mostly it just works, and fast too! When it doesn't, I'll
| usually just dump to parquet and hit it with sparksql, but that
| is the exception rather than the rule.
| unwind wrote:
| Meta: I think the title would be better here if it came out and
| said _query_ optimizers.
|
| That gives a less subtle clue that it's about databases than
| looking at the domain.
| lmeyerov wrote:
| Always a fan of query plan articles!
|
| Note: the dig at dataframe libs is worth some care in case you
| think that means duckdb can optimize and they cannot
|
| Dask, Polars, and others pick a lazy default in order to make
| distribution and other optimizations easier. When staying in
| their pure fragments ('vectorized'), the same scheduler rewriting
| opportunity is here.
|
| This is a subtle but important distinction when looking at these
| frameworks. We are making our new graph query language 'gfql' to
| be dataframe-native so it can run naturally & natively as a step
| of pipelines people are already doing, but also to ensure we
| automatically run as optimized CPU/GPU columnar opts. At the same
| time, because of the intent to allow room for query plan
| optimization, we are staying declarative / lazy, even if the
| generated & interpreted code uses an eager DF runtime . I'm
| optimistic about output target lazy DF systems doing query
| planner work for us long-term here, but for the eager framework
| targets, the query planning has to be on our side.
| kwillets wrote:
| One possible hand optimization is to push the aggregation below
| the joins, which makes the latter a few hundred rows.
___________________________________________________________________
(page generated 2024-11-17 16:01 UTC) |