|
| paulddraper wrote:
| Excellent article.
|
| I did not realize that PostgreSQL now supported MERGE
| compumike wrote:
| Is anyone aware of performance benchmarks comparing Postgres's
| new MERGE to the old INSERT ... ON CONFLICT?
|
| The "WHEN NOT MATCHED THEN DO NOTHING" is very appealing, and we
| currently implement this with a transaction and a DELETE after
| the upsert because we want to do updates only, not
| unintentionally insert new rows.
| hn_throwaway_99 wrote:
| > we currently implement this with a transaction and a DELETE
| after the upsert because we want to do updates only, not
| unintentionally insert new rows.
|
| Huh??? Why not just do an UPDATE then? After all, you seem to
| be saying "we want to do an upsert but without the insert
| part".
|
| Edit: I'm also trying to understand the value of "WHEN NOT
| MATCHED THEN DO NOTHING". How is that any different that just
| leaving off the "WHEN NOT MATCHED" clause altogether?
| compumike wrote:
| Batch update many many rows at once with entirely different
| values.
|
| (Agree that the "WHEN NOT MATCHED DO NOTHING" is effectively
| the default behavior.)
|
| Use case is the Outbound Prober processes for
| https://heiioncall.com/ which can do thousands of HTTP
| requests per second, then aggregate all those row updates
| into a single query (or single transaction, which in practice
| is a single query).
|
| We can do these batched updates very, very fast using INSERT
| ... ON CONFLICT. No need to re-parse a differently shaped
| query, only a single index rebuild. Smooth :)
| hn_throwaway_99 wrote:
| Still not really understanding - you can do all of that
| with plain UPDATEs (using join syntax if you want to "batch
| update many many rows at once with entirely different
| values.") I would be hella confused as a developer if I saw
| an INSERT ... ON CONFLICT statement where you specifically
| wanted to prevent any inserts.
| gomezjdaniel wrote:
| Thanks for the article.
|
| You have ton of interesting posts in your personal website!
| gregw2 wrote:
| FYI, article doesn't say it but Redshift nowdays supports MERGE:
| https://docs.aws.amazon.com/redshift/latest/dg/r_MERGE.html
| yellow_lead wrote:
| Its annoying that a standard upset cannot be part of SQL.
| hn_throwaway_99 wrote:
| But MERGE is now the upsert that is part of the SQL standard.
| nicoburns wrote:
| Unfortunately MERGE doesn't seem to have the transactional
| semantics that people want from an UPSERT feature (I believe
| in any implementation, but perhaps Oracle supports properly
| transactional MERGE?)
| MarkusWinand wrote:
| I recently covered MERGE on modern-sql.com: https://modern-
| sql.com/caniuse/merge
|
| There I also look at limitations of some implementations and
| problems such as not reporting ambiguous column names -- just
| guessing what you mean ;)
| bafe wrote:
| Great work as always! I still remember looking at your site
| when I was working on an ETL system for sensor data and
| realising in frustration how MariaDB was missing all the nice
| modern SQL functions like merge.
| sebazzz wrote:
| I've noticed that MERGE in MSSQL can have a performance penalty
| as opposed to have separate insert and update statements.
| swasheck wrote:
| mssql merge has a lot of problems. it's good for the most basic
| and straightforward of use cases, but it struggles with a
| variety of features:
|
| https://michaeljswart.com/2021/08/what-to-avoid-if-you-want-...
| mwigdahl wrote:
| We stripped MERGE out of all of our production code. Not
| worth the many corner cases and footguns that remain even
| after over 15 years of being in the product.
| NicoJuicy wrote:
| In ms sql you can do an update query, check the rowcount and if
| it's 0, then do an insert.
|
| That's how I got an improved query performance for syncing
| data.
| hot_gril wrote:
| I think you need a unique index if you're doing something
| like this in MS-SQL, Postgres, or MySQL. Otherwise, the
| default isolation level won't prevent this data race:
| writer A: begin xact, update, see 0 rows updated writer
| B: begin xact, insert same thing, end xact writer A:
| insert, end xact
|
| One reason I use ON CONFLICT in Postgres is for explicitness.
| You're specifically telling the DBMS to upsert, and it'll
| throw an error if you don't have a unique index to support
| this.
| hot_gril wrote:
| Also mentioned above, ON CONFLICT DO UPDATE would be able
| to update the concurrent writer's row instead of throwing a
| uniqueness violation.
| polygotdomain wrote:
| The challenge with UPSERTs is ambiguity, and the variety of
| syntax options you see on the page indicates different attempts
| to try to address that ambiguity while still keeping a simple
| syntax.
|
| Personally, I work with MSSQL which has skipped adoption of
| UPSERT syntax and supported MERGE statements for probably 15
| years now. While there's a certain degree of complexity in a
| MERGE statement, there's also a strong level of explicitness that
| is helpful in making sure certain conditions don't fall through
| the cracks or don't perform an operation that isn't expected.
|
| The ability to handle different conditions in match statements is
| incredibly powerful and allows potentially complex ETL statements
| to be handled within a single merge statement. I'm not saying
| that's always the best practice, but if things are structured in
| the right way, it may be.
|
| I believe the transaction isolation of a MERGE statement (in
| MSSQL at least) means that the lock on the rows is not released
| between the read and the write, which means you'll have fewer
| issues with reading data, attempting an update, then having it
| fail bc another process made a change in between the two calls.
|
| I regularly use output tables to debug the results of a MERGE
| statement, and in this case the $action keyword is incredibly
| helpful for knowing what was done.
| TheCycoONE wrote:
| It's been many years since I worked at a MSSQL shop but at
| least at the time (2008R2) there were issues with MERGE upsert
| statements deadlocking with themselves when run concurrently.
|
| Based on https://dba.stackexchange.com/questions/187776/merge-
| deadloc... it seems that's still an issue as of 5 years ago,
| and not the recommended way of doing an upsert on MSSQL.
| sixbrx wrote:
| Good article, but I wouldn't characterize MERGE as strictly
| better than the older "ON CONFLICT ..." for Postgres, because
| MERGE doesn't support any sort of RETURNING clause.
| majkinetor wrote:
| Yeah, we were happy to see MERGE in latest postgres but
| abandoned it after we found there is no RETURNING...
| Nezteb wrote:
| I found a post on the mailing list about this:
| https://www.postgresql.org/message-id/CANbhV-EXmw9i4YhJTgsrL...
|
| I'm not sure how to go about making an official feature/change
| request though.
| keosak wrote:
| INSERT ... ON CONFLICT has a problem with CHECK constraints.
| The constraint must be satisfied in the INSERT tuple even if
| the row already exists and UPDATE will be executed. I recently
| dealt with this at work and the resulting CTE query was very
| similar to MERGE, except MERGE has better syntax.
| bakugo wrote:
| Something I find somewhat annoying about MySQL's INSERT...ON
| DUPLICATE KEY UPDATE is that it increments auto-increment columns
| even if no new row is inserted. Does anyone know if Postgres' ON
| CONFLICT and MERGE do the same?
| sbuttgereit wrote:
| I expect this behavior to be pretty common. In PostgreSQL, auto
| incrementing keys like this are backed by sequences.
|
| Sequences, at least in the database products I'm familiar with,
| do not exhibit transactional behavior; I'm not sure if this is
| actually part of the standard, but if not, I at least expect
| this to be pretty common. Part of the reason for this is to
| avoid introducing a point of serialization in concurrent
| transaction processing. If we were to do what you're asking
| for, you can get quite a performance hit. Consider a long
| running transaction involving an insert into a table using an
| auto-incrementing key. Any other transaction that wanted to get
| an ID would have to wait for that long running transaction to
| finish to get a number from the sequence... the waiting
| transaction wouldn't know if the long running transaction was
| going to commit or rollback and so the sequence number would be
| undefined until the waiting transaction could get its own lock.
|
| If we set forth the expectation that a sequence will only
| guarantee you that you'll get a unique number from within the
| range, but that any other desire like gapless numbering is out
| of scope for the sequence you can avoid that whole waiting
| game. Sure, you might have gaps as a result; if the long
| running transaction finally rolls back, you'd get a gap for the
| number it consumed.... but no one waited to find out if that
| was the case and all the records got unique IDs.
|
| If you need to ascribe meaning beyond merely "unique identity"
| to such a record ID, it might be that an auto-sequencing is
| simply the wrong tool. I think the compromise and expectation
| that sequences, and extended functionality depending on
| sequence mechanics, are not transactional is a good one given
| the performance costs of ensuring gapless numbering whether you
| really need it or not.
| tucnak wrote:
| Yes, unfortunately it does and I don't think you can work
| around this...
| andorov wrote:
| worth re-upping -
|
| postgres requests an id from the sequence for each row of the
| incoming data ahead of time since it doesn't know which rows
| are updates and which are inserts (although presumably this
| could be changed?). the sequence doesn't reset down for the
| unused so this can eat through it unexpectedly quickly if you
| have a table with a relatively large volume of updates.
|
| also as a tip if you hit the max integer for the sequence and
| need space to implement a fundamental fix you can quickly
| change the sequence to start at -1 and go down. there's no
| issue with negative ids since they're also integers.
| anarazel wrote:
| > postgres requests an id from the sequence for each row of
| the incoming data ahead of time since it doesn't know which
| rows are updates and which are inserts (although presumably
| this could be changed?).
|
| Not in any sort of general way - the sequence can be part of
| the unique key that you're conflicting on.
| dewey wrote:
| PG upsert also increments the counter.
|
| I somehow feel it's wasteful and feel a bit conflicted. I also
| realize that while it feels natural to somehow get a feeling
| for the row numbers and it can help with debugging because you
| somewhat know some IDs / ranges after a while it's probably
| more something I just got used to and maybe shouldn't rely on.
| gregw2 wrote:
| The problem with plain/primitive upset/merge operations is they
| don't always support handling of what should happen when a row is
| deleted from the source. People new to data engineering don't
| always recognize this.
|
| Im those cases, one needs "delete-then-upsert" or "soft-delete-
| then-upsert" in your upset/merge implementation. I was a tad
| disappointed to see this only peripherally alluded to.
|
| With some flavor of merge implementations you can do this with a
| pair of merge calls but in my opinion, ideally it should be one
| merge operation and db vendors should wise up to this.
|
| Per https://modern-sql.com/caniuse/merge only SQL Server and
| BigQuery currently support "when not matched by source". But
| Snowflake lets you have multiple "when not matched" clauses with
| "and" expressions so perhaps you can do it on that. Redshift docs
| show examples of using two merge statements.
| hn_throwaway_99 wrote:
| > The problem with plain/primitive upset/merge operations is
| they don't always support handling of what should happen when a
| row is deleted from the source.
|
| I didn't understand what you meant by this - you seem to be
| referencing a larger workflow (e.g. keeping 2 tables in sync),
| and I'm not understanding how is relevant to the "upsert"
| primitive. Can you give an example?
| gregw2 wrote:
| You are bringing data from a transactional (OLTP) system into
| a reporting/analytics (OLAP, lakehouse, data warehouse, data
| hub) system.
|
| The analytic database is expected to maintain a list of all
| customers and their transactions going back years. The
| transactional system stores a more limited history for
| operational performance reasons.
|
| You daily pull the list of customers from source
| transactional system and upsert/merge it into your analytic
| database. Over time the source transaction team removes old
| unused or mistakenly created customers.
|
| If you use a pure upsert without awareness/handling of
| source-side deletes, your analytic list of customers grows
| without any way to indicate to reporting users which
| customers are still in your transaction system (or when they
| got deleted from the transactional system.) But various
| analytics and/or APIs from your analytic system need to
| exclude such no-longer-present customers...
|
| "Why am I seeing these customers in analytics that aren't in
| the source system?" "Can I just get the current customer list
| from the analytic system?" "Can your analytics component X
| quit transmitting to me these old/bogus customers I don't see
| in the transactional system?" people will ask you...
| nicoburns wrote:
| UPSERT support is really for OLTP databases with frequent
| transactions that may conflict. If you're doing periodic
| batch jobs then you can just use plain insert/update/delete
| (and I believe a common pattern for syncing to an analytics
| database would be to first dump all the data from the
| transactional system into a staging table, from which you
| can run easily comparative queries (e.g. to check for
| records no longer present in the transactional database)
| before updating the main tables.
| LorenPechtel wrote:
| That will save me a few headaches--I wasn't aware of the MYSQL
| ability to reference the values in the update part.
| petergeoghegan wrote:
| > insert...on conflict is a fine way to do upserts on older
| PostgreSQL versions. For modern ones (15+) there is a better
| alternative [SQL Standard: MERGE]
|
| This is incorrect. To quote the Postgres MERGE docs:
|
| "When MERGE is run concurrently with other commands that modify
| the target table, the usual transaction isolation rules apply;
| see Section 13.2 for an explanation on the behavior at each
| isolation level. You may also wish to consider using INSERT ...
| ON CONFLICT as an alternative statement which offers the ability
| to run an UPDATE if a concurrent INSERT occurs. There are a
| variety of differences and restrictions between the two statement
| types and they are not interchangeable."
|
| MERGE can give duplicate violation errors when concurrent inserts
| take place (at the default READ COMMITTED isolation level). MERGE
| is no more capable of avoiding these sorts of race conditions
| than totally naively application code. In short, upserting is
| precisely what you're not supposed to use MERGE for -- ON
| CONFLICT is expressly designed for that use case.
| kardianos wrote:
| I agree. In SQL Server, merge comes with a number of quirks
| that cause me more pain when I try to use it. The ONLY reason I
| use MERGE today, isn't to merge, but to include non-inserted
| columns in the OUTPUT clause; only the merge statement can do
| that. This can be useful when you want to output a lookup from
| a source to a newly inserted identity (auto-increment) column.
|
| Generally, stay away from MERGE. At least in SQL Server.
| singingfish wrote:
| I've done a lot of work on MERGE queries recently. For postgres
| I suspect that self updating CTEs that report back what
| actually happened are maybe a better idea.
| chasil wrote:
| MERGE is standardized in ANSI SQL 2003, and updated in 2008:
|
| https://en.wikipedia.org/wiki/Merge_(SQL)
|
| No question, everyone should use standardized, cross-platform
| syntax if at all possible.
|
| Maybe someday ANSI will standardize some kind of upsert syntax;
| until that time use upsert only if you are explicitly tying
| your SQL to your particular database.
| roenxi wrote:
| > No question, everyone should use standardized, cross-
| platform syntax if at all possible.
|
| There are big questionmarks around this statement. SQL isn't
| an interchange format. Standard _table design_ goes a long
| way because data has lots of different consumers and needs
| frequently to be moved around between systems, standard SQL
| is nothing useful to aim for. Everyone already uses DB-
| specific drivers and syntax because it is better than the
| alternative.
|
| - The standard is not generally available. Most of us will
| never learn what is in it.
|
| - SQL standardises terrible syntax. The SQL standards
| committee has a unique view on programming where
| "function_name(a, b, c);" is some sort of weird alternative
| syntax. They haven't gotten as far as working out variables
| or function composition either [0].
|
| - Tying SQL to your specific database is the best option for
| performance. Writing database-independent SQL is somethign of
| a fools errand, it is either trivial performance-insensitive
| enough that you should be using a code generator or complex
| enough to deserve DB-specific turning.
|
| [0] I assume. Not going to pay for a copy of the standard to
| check.
| _a_a_a_ wrote:
| Before you recommend anything, read this about MSSQL's
| dangerously borked MERGE
| https://www.mssqltips.com/sqlservertip/3074/use-caution-
| with...
|
| Anyone using MSSQL should read it.
| victor106 wrote:
| Great explanation,
|
| I highly recommend the author's book
|
| https://antonz.org/sql-window-functions-book/
| mikece wrote:
| I don't know how recently the "UPSERT" nomenclature came into use
| but I laughed the first time I heard the term used several years
| ago, thinking it was a verbal contraction for a concept that,
| while logical, didn't actually exist in database engines (my
| ignorance was quickly remedied after my expression of mirth). It
| certainly wasn't part of the SQL courses back in 1998.
| ZiggerZZ wrote:
| An old Python package with the implementation of MERGE in
| BigQuery - https://github.com/ZiggerZZ/pygbq
___________________________________________________________________
(page generated 2023-09-25 23:02 UTC) |