proxy70

	[HN Gopher] Upsert in SQL ___________________________________________________________________ Upsert in SQL Author : ingve Score : 101 points Date : 2023-09-25 10:29 UTC (12 hours ago)
	web link (antonz.org)
	w3m dump (antonz.org)
	\| paulddraper wrote: \| Excellent article. \| \| I did not realize that PostgreSQL now supported MERGE \| compumike wrote: \| Is anyone aware of performance benchmarks comparing Postgres's \| new MERGE to the old INSERT ... ON CONFLICT? \| \| The "WHEN NOT MATCHED THEN DO NOTHING" is very appealing, and we \| currently implement this with a transaction and a DELETE after \| the upsert because we want to do updates only, not \| unintentionally insert new rows. \| hn_throwaway_99 wrote: \| > we currently implement this with a transaction and a DELETE \| after the upsert because we want to do updates only, not \| unintentionally insert new rows. \| \| Huh??? Why not just do an UPDATE then? After all, you seem to \| be saying "we want to do an upsert but without the insert \| part". \| \| Edit: I'm also trying to understand the value of "WHEN NOT \| MATCHED THEN DO NOTHING". How is that any different that just \| leaving off the "WHEN NOT MATCHED" clause altogether? \| compumike wrote: \| Batch update many many rows at once with entirely different \| values. \| \| (Agree that the "WHEN NOT MATCHED DO NOTHING" is effectively \| the default behavior.) \| \| Use case is the Outbound Prober processes for \| https://heiioncall.com/ which can do thousands of HTTP \| requests per second, then aggregate all those row updates \| into a single query (or single transaction, which in practice \| is a single query). \| \| We can do these batched updates very, very fast using INSERT \| ... ON CONFLICT. No need to re-parse a differently shaped \| query, only a single index rebuild. Smooth :) \| hn_throwaway_99 wrote: \| Still not really understanding - you can do all of that \| with plain UPDATEs (using join syntax if you want to "batch \| update many many rows at once with entirely different \| values.") I would be hella confused as a developer if I saw \| an INSERT ... ON CONFLICT statement where you specifically \| wanted to prevent any inserts. \| gomezjdaniel wrote: \| Thanks for the article. \| \| You have ton of interesting posts in your personal website! \| gregw2 wrote: \| FYI, article doesn't say it but Redshift nowdays supports MERGE: \| https://docs.aws.amazon.com/redshift/latest/dg/r_MERGE.html \| yellow_lead wrote: \| Its annoying that a standard upset cannot be part of SQL. \| hn_throwaway_99 wrote: \| But MERGE is now the upsert that is part of the SQL standard. \| nicoburns wrote: \| Unfortunately MERGE doesn't seem to have the transactional \| semantics that people want from an UPSERT feature (I believe \| in any implementation, but perhaps Oracle supports properly \| transactional MERGE?) \| MarkusWinand wrote: \| I recently covered MERGE on modern-sql.com: https://modern- \| sql.com/caniuse/merge \| \| There I also look at limitations of some implementations and \| problems such as not reporting ambiguous column names -- just \| guessing what you mean ;) \| bafe wrote: \| Great work as always! I still remember looking at your site \| when I was working on an ETL system for sensor data and \| realising in frustration how MariaDB was missing all the nice \| modern SQL functions like merge. \| sebazzz wrote: \| I've noticed that MERGE in MSSQL can have a performance penalty \| as opposed to have separate insert and update statements. \| swasheck wrote: \| mssql merge has a lot of problems. it's good for the most basic \| and straightforward of use cases, but it struggles with a \| variety of features: \| \| https://michaeljswart.com/2021/08/what-to-avoid-if-you-want-... \| mwigdahl wrote: \| We stripped MERGE out of all of our production code. Not \| worth the many corner cases and footguns that remain even \| after over 15 years of being in the product. \| NicoJuicy wrote: \| In ms sql you can do an update query, check the rowcount and if \| it's 0, then do an insert. \| \| That's how I got an improved query performance for syncing \| data. \| hot_gril wrote: \| I think you need a unique index if you're doing something \| like this in MS-SQL, Postgres, or MySQL. Otherwise, the \| default isolation level won't prevent this data race: \| writer A: begin xact, update, see 0 rows updated writer \| B: begin xact, insert same thing, end xact writer A: \| insert, end xact \| \| One reason I use ON CONFLICT in Postgres is for explicitness. \| You're specifically telling the DBMS to upsert, and it'll \| throw an error if you don't have a unique index to support \| this. \| hot_gril wrote: \| Also mentioned above, ON CONFLICT DO UPDATE would be able \| to update the concurrent writer's row instead of throwing a \| uniqueness violation. \| polygotdomain wrote: \| The challenge with UPSERTs is ambiguity, and the variety of \| syntax options you see on the page indicates different attempts \| to try to address that ambiguity while still keeping a simple \| syntax. \| \| Personally, I work with MSSQL which has skipped adoption of \| UPSERT syntax and supported MERGE statements for probably 15 \| years now. While there's a certain degree of complexity in a \| MERGE statement, there's also a strong level of explicitness that \| is helpful in making sure certain conditions don't fall through \| the cracks or don't perform an operation that isn't expected. \| \| The ability to handle different conditions in match statements is \| incredibly powerful and allows potentially complex ETL statements \| to be handled within a single merge statement. I'm not saying \| that's always the best practice, but if things are structured in \| the right way, it may be. \| \| I believe the transaction isolation of a MERGE statement (in \| MSSQL at least) means that the lock on the rows is not released \| between the read and the write, which means you'll have fewer \| issues with reading data, attempting an update, then having it \| fail bc another process made a change in between the two calls. \| \| I regularly use output tables to debug the results of a MERGE \| statement, and in this case the $action keyword is incredibly \| helpful for knowing what was done. \| TheCycoONE wrote: \| It's been many years since I worked at a MSSQL shop but at \| least at the time (2008R2) there were issues with MERGE upsert \| statements deadlocking with themselves when run concurrently. \| \| Based on https://dba.stackexchange.com/questions/187776/merge- \| deadloc... it seems that's still an issue as of 5 years ago, \| and not the recommended way of doing an upsert on MSSQL. \| sixbrx wrote: \| Good article, but I wouldn't characterize MERGE as strictly \| better than the older "ON CONFLICT ..." for Postgres, because \| MERGE doesn't support any sort of RETURNING clause. \| majkinetor wrote: \| Yeah, we were happy to see MERGE in latest postgres but \| abandoned it after we found there is no RETURNING... \| Nezteb wrote: \| I found a post on the mailing list about this: \| https://www.postgresql.org/message-id/CANbhV-EXmw9i4YhJTgsrL... \| \| I'm not sure how to go about making an official feature/change \| request though. \| keosak wrote: \| INSERT ... ON CONFLICT has a problem with CHECK constraints. \| The constraint must be satisfied in the INSERT tuple even if \| the row already exists and UPDATE will be executed. I recently \| dealt with this at work and the resulting CTE query was very \| similar to MERGE, except MERGE has better syntax. \| bakugo wrote: \| Something I find somewhat annoying about MySQL's INSERT...ON \| DUPLICATE KEY UPDATE is that it increments auto-increment columns \| even if no new row is inserted. Does anyone know if Postgres' ON \| CONFLICT and MERGE do the same? \| sbuttgereit wrote: \| I expect this behavior to be pretty common. In PostgreSQL, auto \| incrementing keys like this are backed by sequences. \| \| Sequences, at least in the database products I'm familiar with, \| do not exhibit transactional behavior; I'm not sure if this is \| actually part of the standard, but if not, I at least expect \| this to be pretty common. Part of the reason for this is to \| avoid introducing a point of serialization in concurrent \| transaction processing. If we were to do what you're asking \| for, you can get quite a performance hit. Consider a long \| running transaction involving an insert into a table using an \| auto-incrementing key. Any other transaction that wanted to get \| an ID would have to wait for that long running transaction to \| finish to get a number from the sequence... the waiting \| transaction wouldn't know if the long running transaction was \| going to commit or rollback and so the sequence number would be \| undefined until the waiting transaction could get its own lock. \| \| If we set forth the expectation that a sequence will only \| guarantee you that you'll get a unique number from within the \| range, but that any other desire like gapless numbering is out \| of scope for the sequence you can avoid that whole waiting \| game. Sure, you might have gaps as a result; if the long \| running transaction finally rolls back, you'd get a gap for the \| number it consumed.... but no one waited to find out if that \| was the case and all the records got unique IDs. \| \| If you need to ascribe meaning beyond merely "unique identity" \| to such a record ID, it might be that an auto-sequencing is \| simply the wrong tool. I think the compromise and expectation \| that sequences, and extended functionality depending on \| sequence mechanics, are not transactional is a good one given \| the performance costs of ensuring gapless numbering whether you \| really need it or not. \| tucnak wrote: \| Yes, unfortunately it does and I don't think you can work \| around this... \| andorov wrote: \| worth re-upping - \| \| postgres requests an id from the sequence for each row of the \| incoming data ahead of time since it doesn't know which rows \| are updates and which are inserts (although presumably this \| could be changed?). the sequence doesn't reset down for the \| unused so this can eat through it unexpectedly quickly if you \| have a table with a relatively large volume of updates. \| \| also as a tip if you hit the max integer for the sequence and \| need space to implement a fundamental fix you can quickly \| change the sequence to start at -1 and go down. there's no \| issue with negative ids since they're also integers. \| anarazel wrote: \| > postgres requests an id from the sequence for each row of \| the incoming data ahead of time since it doesn't know which \| rows are updates and which are inserts (although presumably \| this could be changed?). \| \| Not in any sort of general way - the sequence can be part of \| the unique key that you're conflicting on. \| dewey wrote: \| PG upsert also increments the counter. \| \| I somehow feel it's wasteful and feel a bit conflicted. I also \| realize that while it feels natural to somehow get a feeling \| for the row numbers and it can help with debugging because you \| somewhat know some IDs / ranges after a while it's probably \| more something I just got used to and maybe shouldn't rely on. \| gregw2 wrote: \| The problem with plain/primitive upset/merge operations is they \| don't always support handling of what should happen when a row is \| deleted from the source. People new to data engineering don't \| always recognize this. \| \| Im those cases, one needs "delete-then-upsert" or "soft-delete- \| then-upsert" in your upset/merge implementation. I was a tad \| disappointed to see this only peripherally alluded to. \| \| With some flavor of merge implementations you can do this with a \| pair of merge calls but in my opinion, ideally it should be one \| merge operation and db vendors should wise up to this. \| \| Per https://modern-sql.com/caniuse/merge only SQL Server and \| BigQuery currently support "when not matched by source". But \| Snowflake lets you have multiple "when not matched" clauses with \| "and" expressions so perhaps you can do it on that. Redshift docs \| show examples of using two merge statements. \| hn_throwaway_99 wrote: \| > The problem with plain/primitive upset/merge operations is \| they don't always support handling of what should happen when a \| row is deleted from the source. \| \| I didn't understand what you meant by this - you seem to be \| referencing a larger workflow (e.g. keeping 2 tables in sync), \| and I'm not understanding how is relevant to the "upsert" \| primitive. Can you give an example? \| gregw2 wrote: \| You are bringing data from a transactional (OLTP) system into \| a reporting/analytics (OLAP, lakehouse, data warehouse, data \| hub) system. \| \| The analytic database is expected to maintain a list of all \| customers and their transactions going back years. The \| transactional system stores a more limited history for \| operational performance reasons. \| \| You daily pull the list of customers from source \| transactional system and upsert/merge it into your analytic \| database. Over time the source transaction team removes old \| unused or mistakenly created customers. \| \| If you use a pure upsert without awareness/handling of \| source-side deletes, your analytic list of customers grows \| without any way to indicate to reporting users which \| customers are still in your transaction system (or when they \| got deleted from the transactional system.) But various \| analytics and/or APIs from your analytic system need to \| exclude such no-longer-present customers... \| \| "Why am I seeing these customers in analytics that aren't in \| the source system?" "Can I just get the current customer list \| from the analytic system?" "Can your analytics component X \| quit transmitting to me these old/bogus customers I don't see \| in the transactional system?" people will ask you... \| nicoburns wrote: \| UPSERT support is really for OLTP databases with frequent \| transactions that may conflict. If you're doing periodic \| batch jobs then you can just use plain insert/update/delete \| (and I believe a common pattern for syncing to an analytics \| database would be to first dump all the data from the \| transactional system into a staging table, from which you \| can run easily comparative queries (e.g. to check for \| records no longer present in the transactional database) \| before updating the main tables. \| LorenPechtel wrote: \| That will save me a few headaches--I wasn't aware of the MYSQL \| ability to reference the values in the update part. \| petergeoghegan wrote: \| > insert...on conflict is a fine way to do upserts on older \| PostgreSQL versions. For modern ones (15+) there is a better \| alternative [SQL Standard: MERGE] \| \| This is incorrect. To quote the Postgres MERGE docs: \| \| "When MERGE is run concurrently with other commands that modify \| the target table, the usual transaction isolation rules apply; \| see Section 13.2 for an explanation on the behavior at each \| isolation level. You may also wish to consider using INSERT ... \| ON CONFLICT as an alternative statement which offers the ability \| to run an UPDATE if a concurrent INSERT occurs. There are a \| variety of differences and restrictions between the two statement \| types and they are not interchangeable." \| \| MERGE can give duplicate violation errors when concurrent inserts \| take place (at the default READ COMMITTED isolation level). MERGE \| is no more capable of avoiding these sorts of race conditions \| than totally naively application code. In short, upserting is \| precisely what you're not supposed to use MERGE for -- ON \| CONFLICT is expressly designed for that use case. \| kardianos wrote: \| I agree. In SQL Server, merge comes with a number of quirks \| that cause me more pain when I try to use it. The ONLY reason I \| use MERGE today, isn't to merge, but to include non-inserted \| columns in the OUTPUT clause; only the merge statement can do \| that. This can be useful when you want to output a lookup from \| a source to a newly inserted identity (auto-increment) column. \| \| Generally, stay away from MERGE. At least in SQL Server. \| singingfish wrote: \| I've done a lot of work on MERGE queries recently. For postgres \| I suspect that self updating CTEs that report back what \| actually happened are maybe a better idea. \| chasil wrote: \| MERGE is standardized in ANSI SQL 2003, and updated in 2008: \| \| https://en.wikipedia.org/wiki/Merge_(SQL) \| \| No question, everyone should use standardized, cross-platform \| syntax if at all possible. \| \| Maybe someday ANSI will standardize some kind of upsert syntax; \| until that time use upsert only if you are explicitly tying \| your SQL to your particular database. \| roenxi wrote: \| > No question, everyone should use standardized, cross- \| platform syntax if at all possible. \| \| There are big questionmarks around this statement. SQL isn't \| an interchange format. Standard _table design_ goes a long \| way because data has lots of different consumers and needs \| frequently to be moved around between systems, standard SQL \| is nothing useful to aim for. Everyone already uses DB- \| specific drivers and syntax because it is better than the \| alternative. \| \| - The standard is not generally available. Most of us will \| never learn what is in it. \| \| - SQL standardises terrible syntax. The SQL standards \| committee has a unique view on programming where \| "function_name(a, b, c);" is some sort of weird alternative \| syntax. They haven't gotten as far as working out variables \| or function composition either [0]. \| \| - Tying SQL to your specific database is the best option for \| performance. Writing database-independent SQL is somethign of \| a fools errand, it is either trivial performance-insensitive \| enough that you should be using a code generator or complex \| enough to deserve DB-specific turning. \| \| [0] I assume. Not going to pay for a copy of the standard to \| check. \| _a_a_a_ wrote: \| Before you recommend anything, read this about MSSQL's \| dangerously borked MERGE \| https://www.mssqltips.com/sqlservertip/3074/use-caution- \| with... \| \| Anyone using MSSQL should read it. \| victor106 wrote: \| Great explanation, \| \| I highly recommend the author's book \| \| https://antonz.org/sql-window-functions-book/ \| mikece wrote: \| I don't know how recently the "UPSERT" nomenclature came into use \| but I laughed the first time I heard the term used several years \| ago, thinking it was a verbal contraction for a concept that, \| while logical, didn't actually exist in database engines (my \| ignorance was quickly remedied after my expression of mirth). It \| certainly wasn't part of the SQL courses back in 1998. \| ZiggerZZ wrote: \| An old Python package with the implementation of MERGE in \| BigQuery - https://github.com/ZiggerZZ/pygbq ___________________________________________________________________ (page generated 2023-09-25 23:02 UTC)