proxy70

	[HN Gopher] Optimizers: The Low-Key MVP ___________________________________________________________________ Optimizers: The Low-Key MVP Author : tosh Score : 116 points Date : 2024-11-16 13:53 UTC (1 days ago)
	web link (duckdb.org)
	w3m dump (duckdb.org)
	\| adrian17 wrote: \| > The SQL above results in a plan similar to the DuckDB optimized \| plan, but it is wordier and more error-prone to write, which can \| potentially lead to bugs. \| \| FWIW, aside from manual filter pushdown, I consider the JOIN \| variant the canonical / "default" way to merge multiple tables; \| it keeps all the join-related logic in one place, while mixing \| both joining conditions and filtering conditions in WHERE always \| felt more error-prone to me. \| Sesse__ wrote: \| It is also the only way to represent join conditions for outer \| joins. \| camgunz wrote: \| Same here; I've always intuited that this would limit the \| generated tuples. I'm too lazy to do it now, but I wonder if \| other DB engines also perform this optimization that \| effectively makes filtering in JOIN conditions equivalent to \| filtering in WHERE clauses. I'd also be interested in some \| example queries that were hand-optimized to the point of \| obvious obscurity--my guess is it's harder to do this in SQL \| than in something like C. \| Sesse__ wrote: \| Any non-toy SQL engine is going to do condition pullup and \| pushdown. (Well, I know at least one that only does explicit \| pullup, but does pushdown more implicitly because it doesn't \| use a tree. The effect is largely the same, modulo some \| shenanigans around outer joins.) \| \| Pushdown is actually a more subtle problem than people give \| it credit for; it inhabits this weird space where things are \| too trivial to be covered in papers but is too hard to cover \| properly in textbooks (e.g., no coverage of multiple \| equalities or outer joins). \| ikesau wrote: \| > This means your optimizations need to be applied by hand, which \| is sustainable if your data starts changing. \| \| Seems like a missing "un" here \| \| Compelling article! I've already found DuckDB to be the most \| ergonomic tool for quick and dirty wrangling, it's good to know \| it can handle massive jobs too. \| Nihilartikel wrote: \| I regularly use duckdb on datasets of 1B+ rows, with nasty \| strong columns that may be over 10MB per value in the outliers. \| Mostly it just works, and fast too! When it doesn't, I'll \| usually just dump to parquet and hit it with sparksql, but that \| is the exception rather than the rule. \| unwind wrote: \| Meta: I think the title would be better here if it came out and \| said _query_ optimizers. \| \| That gives a less subtle clue that it's about databases than \| looking at the domain. \| lmeyerov wrote: \| Always a fan of query plan articles! \| \| Note: the dig at dataframe libs is worth some care in case you \| think that means duckdb can optimize and they cannot \| \| Dask, Polars, and others pick a lazy default in order to make \| distribution and other optimizations easier. When staying in \| their pure fragments ('vectorized'), the same scheduler rewriting \| opportunity is here. \| \| This is a subtle but important distinction when looking at these \| frameworks. We are making our new graph query language 'gfql' to \| be dataframe-native so it can run naturally & natively as a step \| of pipelines people are already doing, but also to ensure we \| automatically run as optimized CPU/GPU columnar opts. At the same \| time, because of the intent to allow room for query plan \| optimization, we are staying declarative / lazy, even if the \| generated & interpreted code uses an eager DF runtime . I'm \| optimistic about output target lazy DF systems doing query \| planner work for us long-term here, but for the eager framework \| targets, the query planning has to be on our side. \| kwillets wrote: \| One possible hand optimization is to push the aggregation below \| the joins, which makes the latter a few hundred rows. ___________________________________________________________________ (page generated 2024-11-17 16:01 UTC)