[HN Gopher] Upsert in SQL
___________________________________________________________________
 
Upsert in SQL
 
Author : ingve
Score  : 101 points
Date   : 2023-09-25 10:29 UTC (12 hours ago)
 
web link (antonz.org)
w3m dump (antonz.org)
 
| paulddraper wrote:
| Excellent article.
| 
| I did not realize that PostgreSQL now supported MERGE
 
| compumike wrote:
| Is anyone aware of performance benchmarks comparing Postgres's
| new MERGE to the old INSERT ... ON CONFLICT?
| 
| The "WHEN NOT MATCHED THEN DO NOTHING" is very appealing, and we
| currently implement this with a transaction and a DELETE after
| the upsert because we want to do updates only, not
| unintentionally insert new rows.
 
  | hn_throwaway_99 wrote:
  | > we currently implement this with a transaction and a DELETE
  | after the upsert because we want to do updates only, not
  | unintentionally insert new rows.
  | 
  | Huh??? Why not just do an UPDATE then? After all, you seem to
  | be saying "we want to do an upsert but without the insert
  | part".
  | 
  | Edit: I'm also trying to understand the value of "WHEN NOT
  | MATCHED THEN DO NOTHING". How is that any different that just
  | leaving off the "WHEN NOT MATCHED" clause altogether?
 
    | compumike wrote:
    | Batch update many many rows at once with entirely different
    | values.
    | 
    | (Agree that the "WHEN NOT MATCHED DO NOTHING" is effectively
    | the default behavior.)
    | 
    | Use case is the Outbound Prober processes for
    | https://heiioncall.com/ which can do thousands of HTTP
    | requests per second, then aggregate all those row updates
    | into a single query (or single transaction, which in practice
    | is a single query).
    | 
    | We can do these batched updates very, very fast using INSERT
    | ... ON CONFLICT. No need to re-parse a differently shaped
    | query, only a single index rebuild. Smooth :)
 
      | hn_throwaway_99 wrote:
      | Still not really understanding - you can do all of that
      | with plain UPDATEs (using join syntax if you want to "batch
      | update many many rows at once with entirely different
      | values.") I would be hella confused as a developer if I saw
      | an INSERT ... ON CONFLICT statement where you specifically
      | wanted to prevent any inserts.
 
| gomezjdaniel wrote:
| Thanks for the article.
| 
| You have ton of interesting posts in your personal website!
 
| gregw2 wrote:
| FYI, article doesn't say it but Redshift nowdays supports MERGE:
| https://docs.aws.amazon.com/redshift/latest/dg/r_MERGE.html
 
| yellow_lead wrote:
| Its annoying that a standard upset cannot be part of SQL.
 
  | hn_throwaway_99 wrote:
  | But MERGE is now the upsert that is part of the SQL standard.
 
    | nicoburns wrote:
    | Unfortunately MERGE doesn't seem to have the transactional
    | semantics that people want from an UPSERT feature (I believe
    | in any implementation, but perhaps Oracle supports properly
    | transactional MERGE?)
 
| MarkusWinand wrote:
| I recently covered MERGE on modern-sql.com: https://modern-
| sql.com/caniuse/merge
| 
| There I also look at limitations of some implementations and
| problems such as not reporting ambiguous column names -- just
| guessing what you mean ;)
 
  | bafe wrote:
  | Great work as always! I still remember looking at your site
  | when I was working on an ETL system for sensor data and
  | realising in frustration how MariaDB was missing all the nice
  | modern SQL functions like merge.
 
| sebazzz wrote:
| I've noticed that MERGE in MSSQL can have a performance penalty
| as opposed to have separate insert and update statements.
 
  | swasheck wrote:
  | mssql merge has a lot of problems. it's good for the most basic
  | and straightforward of use cases, but it struggles with a
  | variety of features:
  | 
  | https://michaeljswart.com/2021/08/what-to-avoid-if-you-want-...
 
    | mwigdahl wrote:
    | We stripped MERGE out of all of our production code. Not
    | worth the many corner cases and footguns that remain even
    | after over 15 years of being in the product.
 
  | NicoJuicy wrote:
  | In ms sql you can do an update query, check the rowcount and if
  | it's 0, then do an insert.
  | 
  | That's how I got an improved query performance for syncing
  | data.
 
    | hot_gril wrote:
    | I think you need a unique index if you're doing something
    | like this in MS-SQL, Postgres, or MySQL. Otherwise, the
    | default isolation level won't prevent this data race:
    | writer A: begin xact, update, see 0 rows updated       writer
    | B: begin xact, insert same thing, end xact       writer A:
    | insert, end xact
    | 
    | One reason I use ON CONFLICT in Postgres is for explicitness.
    | You're specifically telling the DBMS to upsert, and it'll
    | throw an error if you don't have a unique index to support
    | this.
 
      | hot_gril wrote:
      | Also mentioned above, ON CONFLICT DO UPDATE would be able
      | to update the concurrent writer's row instead of throwing a
      | uniqueness violation.
 
| polygotdomain wrote:
| The challenge with UPSERTs is ambiguity, and the variety of
| syntax options you see on the page indicates different attempts
| to try to address that ambiguity while still keeping a simple
| syntax.
| 
| Personally, I work with MSSQL which has skipped adoption of
| UPSERT syntax and supported MERGE statements for probably 15
| years now. While there's a certain degree of complexity in a
| MERGE statement, there's also a strong level of explicitness that
| is helpful in making sure certain conditions don't fall through
| the cracks or don't perform an operation that isn't expected.
| 
| The ability to handle different conditions in match statements is
| incredibly powerful and allows potentially complex ETL statements
| to be handled within a single merge statement. I'm not saying
| that's always the best practice, but if things are structured in
| the right way, it may be.
| 
| I believe the transaction isolation of a MERGE statement (in
| MSSQL at least) means that the lock on the rows is not released
| between the read and the write, which means you'll have fewer
| issues with reading data, attempting an update, then having it
| fail bc another process made a change in between the two calls.
| 
| I regularly use output tables to debug the results of a MERGE
| statement, and in this case the $action keyword is incredibly
| helpful for knowing what was done.
 
  | TheCycoONE wrote:
  | It's been many years since I worked at a MSSQL shop but at
  | least at the time (2008R2) there were issues with MERGE upsert
  | statements deadlocking with themselves when run concurrently.
  | 
  | Based on https://dba.stackexchange.com/questions/187776/merge-
  | deadloc... it seems that's still an issue as of 5 years ago,
  | and not the recommended way of doing an upsert on MSSQL.
 
| sixbrx wrote:
| Good article, but I wouldn't characterize MERGE as strictly
| better than the older "ON CONFLICT ..." for Postgres, because
| MERGE doesn't support any sort of RETURNING clause.
 
  | majkinetor wrote:
  | Yeah, we were happy to see MERGE in latest postgres but
  | abandoned it after we found there is no RETURNING...
 
  | Nezteb wrote:
  | I found a post on the mailing list about this:
  | https://www.postgresql.org/message-id/CANbhV-EXmw9i4YhJTgsrL...
  | 
  | I'm not sure how to go about making an official feature/change
  | request though.
 
  | keosak wrote:
  | INSERT ... ON CONFLICT has a problem with CHECK constraints.
  | The constraint must be satisfied in the INSERT tuple even if
  | the row already exists and UPDATE will be executed. I recently
  | dealt with this at work and the resulting CTE query was very
  | similar to MERGE, except MERGE has better syntax.
 
| bakugo wrote:
| Something I find somewhat annoying about MySQL's INSERT...ON
| DUPLICATE KEY UPDATE is that it increments auto-increment columns
| even if no new row is inserted. Does anyone know if Postgres' ON
| CONFLICT and MERGE do the same?
 
  | sbuttgereit wrote:
  | I expect this behavior to be pretty common. In PostgreSQL, auto
  | incrementing keys like this are backed by sequences.
  | 
  | Sequences, at least in the database products I'm familiar with,
  | do not exhibit transactional behavior; I'm not sure if this is
  | actually part of the standard, but if not, I at least expect
  | this to be pretty common. Part of the reason for this is to
  | avoid introducing a point of serialization in concurrent
  | transaction processing. If we were to do what you're asking
  | for, you can get quite a performance hit. Consider a long
  | running transaction involving an insert into a table using an
  | auto-incrementing key. Any other transaction that wanted to get
  | an ID would have to wait for that long running transaction to
  | finish to get a number from the sequence... the waiting
  | transaction wouldn't know if the long running transaction was
  | going to commit or rollback and so the sequence number would be
  | undefined until the waiting transaction could get its own lock.
  | 
  | If we set forth the expectation that a sequence will only
  | guarantee you that you'll get a unique number from within the
  | range, but that any other desire like gapless numbering is out
  | of scope for the sequence you can avoid that whole waiting
  | game. Sure, you might have gaps as a result; if the long
  | running transaction finally rolls back, you'd get a gap for the
  | number it consumed.... but no one waited to find out if that
  | was the case and all the records got unique IDs.
  | 
  | If you need to ascribe meaning beyond merely "unique identity"
  | to such a record ID, it might be that an auto-sequencing is
  | simply the wrong tool. I think the compromise and expectation
  | that sequences, and extended functionality depending on
  | sequence mechanics, are not transactional is a good one given
  | the performance costs of ensuring gapless numbering whether you
  | really need it or not.
 
  | tucnak wrote:
  | Yes, unfortunately it does and I don't think you can work
  | around this...
 
  | andorov wrote:
  | worth re-upping -
  | 
  | postgres requests an id from the sequence for each row of the
  | incoming data ahead of time since it doesn't know which rows
  | are updates and which are inserts (although presumably this
  | could be changed?). the sequence doesn't reset down for the
  | unused so this can eat through it unexpectedly quickly if you
  | have a table with a relatively large volume of updates.
  | 
  | also as a tip if you hit the max integer for the sequence and
  | need space to implement a fundamental fix you can quickly
  | change the sequence to start at -1 and go down. there's no
  | issue with negative ids since they're also integers.
 
    | anarazel wrote:
    | > postgres requests an id from the sequence for each row of
    | the incoming data ahead of time since it doesn't know which
    | rows are updates and which are inserts (although presumably
    | this could be changed?).
    | 
    | Not in any sort of general way - the sequence can be part of
    | the unique key that you're conflicting on.
 
  | dewey wrote:
  | PG upsert also increments the counter.
  | 
  | I somehow feel it's wasteful and feel a bit conflicted. I also
  | realize that while it feels natural to somehow get a feeling
  | for the row numbers and it can help with debugging because you
  | somewhat know some IDs / ranges after a while it's probably
  | more something I just got used to and maybe shouldn't rely on.
 
| gregw2 wrote:
| The problem with plain/primitive upset/merge operations is they
| don't always support handling of what should happen when a row is
| deleted from the source. People new to data engineering don't
| always recognize this.
| 
| Im those cases, one needs "delete-then-upsert" or "soft-delete-
| then-upsert" in your upset/merge implementation. I was a tad
| disappointed to see this only peripherally alluded to.
| 
| With some flavor of merge implementations you can do this with a
| pair of merge calls but in my opinion, ideally it should be one
| merge operation and db vendors should wise up to this.
| 
| Per https://modern-sql.com/caniuse/merge only SQL Server and
| BigQuery currently support "when not matched by source". But
| Snowflake lets you have multiple "when not matched" clauses with
| "and" expressions so perhaps you can do it on that. Redshift docs
| show examples of using two merge statements.
 
  | hn_throwaway_99 wrote:
  | > The problem with plain/primitive upset/merge operations is
  | they don't always support handling of what should happen when a
  | row is deleted from the source.
  | 
  | I didn't understand what you meant by this - you seem to be
  | referencing a larger workflow (e.g. keeping 2 tables in sync),
  | and I'm not understanding how is relevant to the "upsert"
  | primitive. Can you give an example?
 
    | gregw2 wrote:
    | You are bringing data from a transactional (OLTP) system into
    | a reporting/analytics (OLAP, lakehouse, data warehouse, data
    | hub) system.
    | 
    | The analytic database is expected to maintain a list of all
    | customers and their transactions going back years. The
    | transactional system stores a more limited history for
    | operational performance reasons.
    | 
    | You daily pull the list of customers from source
    | transactional system and upsert/merge it into your analytic
    | database. Over time the source transaction team removes old
    | unused or mistakenly created customers.
    | 
    | If you use a pure upsert without awareness/handling of
    | source-side deletes, your analytic list of customers grows
    | without any way to indicate to reporting users which
    | customers are still in your transaction system (or when they
    | got deleted from the transactional system.) But various
    | analytics and/or APIs from your analytic system need to
    | exclude such no-longer-present customers...
    | 
    | "Why am I seeing these customers in analytics that aren't in
    | the source system?" "Can I just get the current customer list
    | from the analytic system?" "Can your analytics component X
    | quit transmitting to me these old/bogus customers I don't see
    | in the transactional system?" people will ask you...
 
      | nicoburns wrote:
      | UPSERT support is really for OLTP databases with frequent
      | transactions that may conflict. If you're doing periodic
      | batch jobs then you can just use plain insert/update/delete
      | (and I believe a common pattern for syncing to an analytics
      | database would be to first dump all the data from the
      | transactional system into a staging table, from which you
      | can run easily comparative queries (e.g. to check for
      | records no longer present in the transactional database)
      | before updating the main tables.
 
| LorenPechtel wrote:
| That will save me a few headaches--I wasn't aware of the MYSQL
| ability to reference the values in the update part.
 
| petergeoghegan wrote:
| > insert...on conflict is a fine way to do upserts on older
| PostgreSQL versions. For modern ones (15+) there is a better
| alternative [SQL Standard: MERGE]
| 
| This is incorrect. To quote the Postgres MERGE docs:
| 
| "When MERGE is run concurrently with other commands that modify
| the target table, the usual transaction isolation rules apply;
| see Section 13.2 for an explanation on the behavior at each
| isolation level. You may also wish to consider using INSERT ...
| ON CONFLICT as an alternative statement which offers the ability
| to run an UPDATE if a concurrent INSERT occurs. There are a
| variety of differences and restrictions between the two statement
| types and they are not interchangeable."
| 
| MERGE can give duplicate violation errors when concurrent inserts
| take place (at the default READ COMMITTED isolation level). MERGE
| is no more capable of avoiding these sorts of race conditions
| than totally naively application code. In short, upserting is
| precisely what you're not supposed to use MERGE for -- ON
| CONFLICT is expressly designed for that use case.
 
  | kardianos wrote:
  | I agree. In SQL Server, merge comes with a number of quirks
  | that cause me more pain when I try to use it. The ONLY reason I
  | use MERGE today, isn't to merge, but to include non-inserted
  | columns in the OUTPUT clause; only the merge statement can do
  | that. This can be useful when you want to output a lookup from
  | a source to a newly inserted identity (auto-increment) column.
  | 
  | Generally, stay away from MERGE. At least in SQL Server.
 
  | singingfish wrote:
  | I've done a lot of work on MERGE queries recently. For postgres
  | I suspect that self updating CTEs that report back what
  | actually happened are maybe a better idea.
 
  | chasil wrote:
  | MERGE is standardized in ANSI SQL 2003, and updated in 2008:
  | 
  | https://en.wikipedia.org/wiki/Merge_(SQL)
  | 
  | No question, everyone should use standardized, cross-platform
  | syntax if at all possible.
  | 
  | Maybe someday ANSI will standardize some kind of upsert syntax;
  | until that time use upsert only if you are explicitly tying
  | your SQL to your particular database.
 
    | roenxi wrote:
    | > No question, everyone should use standardized, cross-
    | platform syntax if at all possible.
    | 
    | There are big questionmarks around this statement. SQL isn't
    | an interchange format. Standard _table design_ goes a long
    | way because data has lots of different consumers and needs
    | frequently to be moved around between systems, standard SQL
    | is nothing useful to aim for. Everyone already uses DB-
    | specific drivers and syntax because it is better than the
    | alternative.
    | 
    | - The standard is not generally available. Most of us will
    | never learn what is in it.
    | 
    | - SQL standardises terrible syntax. The SQL standards
    | committee has a unique view on programming where
    | "function_name(a, b, c);" is some sort of weird alternative
    | syntax. They haven't gotten as far as working out variables
    | or function composition either [0].
    | 
    | - Tying SQL to your specific database is the best option for
    | performance. Writing database-independent SQL is somethign of
    | a fools errand, it is either trivial performance-insensitive
    | enough that you should be using a code generator or complex
    | enough to deserve DB-specific turning.
    | 
    | [0] I assume. Not going to pay for a copy of the standard to
    | check.
 
    | _a_a_a_ wrote:
    | Before you recommend anything, read this about MSSQL's
    | dangerously borked MERGE
    | https://www.mssqltips.com/sqlservertip/3074/use-caution-
    | with...
    | 
    | Anyone using MSSQL should read it.
 
| victor106 wrote:
| Great explanation,
| 
| I highly recommend the author's book
| 
| https://antonz.org/sql-window-functions-book/
 
| mikece wrote:
| I don't know how recently the "UPSERT" nomenclature came into use
| but I laughed the first time I heard the term used several years
| ago, thinking it was a verbal contraction for a concept that,
| while logical, didn't actually exist in database engines (my
| ignorance was quickly remedied after my expression of mirth). It
| certainly wasn't part of the SQL courses back in 1998.
 
| ZiggerZZ wrote:
| An old Python package with the implementation of MERGE in
| BigQuery - https://github.com/ZiggerZZ/pygbq
 
___________________________________________________________________
(page generated 2023-09-25 23:02 UTC)