[HN Gopher] TIL-Python has a built-in persistent key-value store...
___________________________________________________________________
 
TIL-Python has a built-in persistent key-value store (2018)
 
Author : zora_goron
Score  : 227 points
Date   : 2022-09-15 10:10 UTC (12 hours ago)
 
web link (remusao.github.io)
w3m dump (remusao.github.io)
 
| nonethewiser wrote:
| Huh. I am working on a little utility to deduplicate csvs using
| sqlite. Seems like this would likely be better.
| 
| I used sqlite in the projects name. This makes me think of an
| article I read recently that suggested not using descriptive
| names for projects. For reasons such as this.
 
  | ALittleLight wrote:
  | You may want to look at pandas too (unless the issue is your
  | CSVs are too large to fit in memory).
 
  | ivvve wrote:
  | This is probably dumb intervention but forgive me because it
  | made me giddy to find: the pandas drop_duplicates() method is
  | pretty great, if you're in a pinch. Again -- forgive me if you
  | know about it!
 
| amelius wrote:
| This is going to fall flat when you try to use multithreading.
| The GIL will rear its head.
 
| giraffe_lady wrote:
| This is a wrapper around dbm, the unix library? Though having a
| portable interface to that tool is nothing small, and most
| languages don't have one in their standard library. Though I
| think all languages with a stdlib as big as python's probably do.
| It's not inherently a python thing though, unless I'm badly
| misunderstanding.
 
  | collyw wrote:
  | I look at it the other way, this is the least effort disk
  | persistence that I have seen in python. A fair amount of dev
  | stuff is reading from disk and putting it into some kind of
  | object / dictionary format. I am not really bothered about what
  | it uses under the hood (ok it is interesting enough that I have
  | looked up DMB). To me the he main appeal of this that you can
  | use what appears to be a normal python dictionary and have it
  | save to disk with minimal effort. This is a lot less friction
  | that using a database.
 
    | giraffe_lady wrote:
    | Oh yeah for sure. I don't mean to convey any disdain for dbm
    | itself or for having it rolled into the standard library.
    | Reading and writing custom files for "convenience
    | persistence" or configuration is one of those really common
    | things that you don't usually need to actually do if you know
    | about tools like this.
    | 
    | I was just pointing out that the title and even the article
    | seem to associate this system with python per se, rather than
    | understanding it as python's interface to a common and pre-
    | existing system.
 
| Someone wrote:
| FTA: Why is sqlite3 so slow?
| 
| My money is that's mostly because it actually stores the data. I
| don't think _dbm_ guarantees any of the data written makes it to
| disk until you _sync_ or _close_ the database.
| 
| This writes, guesstimating, on the order of 100 kilobytes, so
| chances are the _dbm_ data never hits the disk.
 
| akubera wrote:
| There's also the shelve[0] module which allows storing any
| pickleable object in a persistent key-value store, not just
| string/bytes. I've found it's very handy for caching while
| developing scripts which query remote resources, and not have to
| worry about serialization.
| 
| [0] https://docs.python.org/3.10/library/shelve.html
| 
| Obligatory pickle note: one should be aware of pickle security
| implications and should not open a "Shelf" provided by untrusted
| sources, or rather should treat opening a shelf (or any pickle
| deserialization operation for that matter) as running an
| arbitrary Python script (which cannot be read).
 
  | cristoperb wrote:
  | I once wrote a locking wrapper around the shelve module so I
  | could use it as a thread- and multiprocess-safe key-value cache
  | (including a wrapper around the requests modules `get()` to
  | transparently cache/validate http resources according to
  | headers):
  | 
  | https://github.com/cristoper/shelfcache
  | 
  | It works despite some cross-platform issues (flock and macos's
  | version of gdbm interacting to create a deadlock), but if I
  | were to do again I would just use sqlite (which Python's
  | standard library has an interface for).
 
    | TillE wrote:
    | > but if I were to do again I would just use sqlite
    | 
    | Yeah, I tried to use shelve for some very simple stuff
    | because it seemed like a great fit, but ultimately found that
    | I had a much better time with tortoise-orm on top of sqlite.
    | 
    | If you need _any_ kind of real feature, just use sqlite.
 
  | PartiallyTyped wrote:
  | I like to use `joblib.Memory` for my caching as it catches
  | changes in functions which prompt rerunning over loading from
  | the cache and then overwrite the old result.
 
  | mrweasel wrote:
  | Can you safely "pickle" Python objects across different
  | architectures and Python version (assuming we forget about
  | Python 2)?
 
    | NegativeLatency wrote:
    | There's a version on the pickle format, so you might be able
    | to do it across versions but I suspect the version has
    | changed for security reasons over time?
 
    | Spivak wrote:
    | Pickle could _in theory_ could be architecture dependent
    | since __getstate__, and __setstate__ are user provided
    | options. But you would have to try to do that on purpose.
    | 
    | And you don't even have to forget about Python 2! If you use
    | format version 2 you can pickle objects from every version
    | from Python 2.3+ and all pickle format are promised to be
    | backwards compatible. If you only care about Python 3 then
    | you can use version 3 and it will work for all Python 3.0+.
    | 
    | https://docs.python.org/3/library/pickle.html#data-stream-
    | fo...
    | 
    | The reason against using pickle hasn't changed though, if you
    | wouldn't exec() it, don't unpickle it. If you're going to
    | send it over the network use MAC use MAC use MAC. Seriously,
    | it's built in -- the hmac module.
 
      | mmcgaha wrote:
      | We had a program that was sending a pickled session state
      | as a cookie. We solved that by packing the cookie as a
      | random string, a timestamp, the object, and a MAC. We
      | validated the MAC, then checked the timestamp, and finally
      | unpickle the object. It still bothers me that we are
      | unpickling data passed by the client but I ran arguments
      | against doing it.
 
  | clysm wrote:
  | Another problem with pickles with any sort of living code base
  | is when one makes modification to the type itself - renaming,
  | refactoring, whatever. Picking objects (and nested objects)
  | that aren't explicitly meant for data
  | storage/retention/transmissions leads to headaches.
  | 
  | It's best to use another dedicated type or library specifically
  | for this task.
 
  | formerly_proven wrote:
  | There's even an object database built around pickle, ZODB. It
  | even has a networked implementation with replication /
  | failover. Used to be part of Zope, originally written in the
  | 1998 time frame or so - GvR actually committed a bunch of stuff
  | to it.
 
    | nas wrote:
    | ZODB is awesome and overlooked, IMHO. I'm biased I guess
    | because I was involved in making Durus which is inspired by
    | ZODB. The ZODB model is not appropriate for all applications
    | (optimistic concurrency control) but for the applications it
    | works for, it's great. Very easy to develop with (no
    | relational-to-OO mismatch) and performance can be great if
    | you design your model carefully. The client caching of data
    | is great. It is a bit like memcache but the data is already
    | there as objects in RAM. The database server will invalidate
    | the cache for you, no manual invalidation needed.
 
      | mwcampbell wrote:
      | I developed a web application in 2004 using Quixote and
      | Durus. I wonder how many developers outside of MEMS
      | Exchange ever used both of those packages. Somehow I had
      | not yet encountered a proper ORM (I didn't discover
      | SQLObject until later, and Django wasn't out yet), so I
      | liked Durus at the time. That web application is still
      | running, and in the years since then, I've had to handle
      | several escalated customer support cases, and I often
      | wished I had chosen a relational database so I could easily
      | do ad-hoc queries. So Durus probably wasn't the best choice
      | for my application, but that's not your fault. And one
      | thing I liked about Durus was that the implementation was
      | simple enough that I felt I could really understand how it
      | worked. So thanks for that.
 
| nijave wrote:
| Wonder how this compares to using the filesystem and keys where
| keys are either filenames or a file path (and the value is stored
| in the file)?
 
| j0hnyl wrote:
| I wonder if it's still faster than sqlite in 2022.
 
  | cbdumas wrote:
  | The script linked at the bottom of this page uses sqlite
  | without an index on the table. In my testing if you add an
  | index sqlite quickly outperforms dbm in terms of read
  | performance
 
  | calpaterson wrote:
  | According to his benchmark run today on Python 3.9 on linux,
  | dbm is still considerably faster for random access.
  | 
  | That said, what database dbm is using varies a lot by platform.
  | I think on linux it's usually BerkleyDB.
  | 
  | I use dbm a lot to cache pickles of object that were slow to
  | create. eg pandas dataframes
 
    | curiousgal wrote:
    | Sorry if it's a dumb question but sinc dbm writes stuff to
    | disk, how is that different than loading the pickles?
 
      | dfox wrote:
      | You get one file for the whole store, not one per key.
      | Also, it tends to better handle having lots of keys
      | significantly better than filesystems handle lots of files
      | in one directory.
 
        | curiousgal wrote:
        | Thank you!
 
    | qalmakka wrote:
    | I think on Linux is often GNU DBM, Berkeley DB is a
    | different, better implementation by Oracle that requires a
    | third party module.
 
| gerty wrote:
| It's not built-in, of course, but if you can live with the quirky
| API, LMDB as an embedded key-value store will be hard to beat in
| terms of speed. It can also do "one-writer, many-readers" and
| transactions.
 
| cbdumas wrote:
| So I read the script used to compare 'dbm' and 'sqlite3', and in
| sqlite it creates a table with no index. Hard to take that
| comparison seriously. I wrote a little benchmark script of my own
| just now and sqlite3 beats DBM handily if you add an index on the
| key.
 
  | simonw wrote:
  | Yeah I tried that myself - I took that benchmark script and
  | changed the CREATE TABLE lines to look like this (adding the
  | primary key):                   CREATE TABLE store(key TEXT
  | PRIMARY KEY, value TEXT)
  | 
  | Here are the results before I made that change:
  | sqlite         Took 0.032 seconds, 3.19541 microseconds /
  | record         dict_open         Took 0.002 seconds, 0.20261
  | microseconds / record         dbm_open         Took 0.043
  | seconds, 4.26550 microseconds / record         sqlite3_mem_open
  | Took 2.240 seconds, 224.02620 microseconds / record
  | sqlite3_file_open         Took 7.119 seconds, 711.87410
  | microseconds / record
  | 
  | And here's what I got after adding the primary keys:
  | sqlite         Took 0.040 seconds, 3.97618 microseconds /
  | record         dict_open         Took 0.002 seconds, 0.19641
  | microseconds / record         dbm_open         Took 0.042
  | seconds, 4.18961 microseconds / record         sqlite3_mem_open
  | Took 0.116 seconds, 11.58359 microseconds / record
  | sqlite3_file_open         Took 5.571 seconds, 557.13968
  | microseconds / record
  | 
  | My code is here:
  | https://gist.github.com/simonw/019ddf08150178d49f4967cc38356...
 
    | cbdumas wrote:
    | Yeah now that I dig in a little further it looks like it's
    | not as clear cut as I thought. DBM read performance is better
    | for me with larger test cases than I was initially using,
    | though I am getting some weird performance hiccups when I use
    | very large test cases. It looks like I'm too late to edit my
    | top level response to reflect that.
 
    | divbzero wrote:
    | Just to spell out the results: Adding the primary key appears
    | to improve SQLite performance but still falls short of DBM.
 
  | banana_giraffe wrote:
  | The index is the least of the issue with the SQLite
  | implementation. It's calling one INSERT per each record in that
  | version, so the benchmark is spending something like 99.8% of
  | its time opening and closing transactions as it sets up the
  | database.
  | 
  | Fixing that on my machine took the sqlite3_file_open benchmark
  | from 16.910 seconds to 1.033 seconds. Adding the index brought
  | it down to 0.040 seconds.
  | 
  | Also, I've never really dug into what's going on, but the dbm
  | implementation is pretty slow on Windows, at least when I've
  | tried to use it.
 
    | avinassh wrote:
    | I was working on a project to insert a billion rows in SQLite
    | under a minute, batching the inserts made it crazy fast
    | compared to individual transactions.
    | 
    | link: https://avi.im/blag/2021/fast-sqlite-inserts/
 
    | Steltek wrote:
    | For use cases where you want a very simple key-value store,
    | working with single records is probably a good test?
 
      | banana_giraffe wrote:
      | Maybe?
      | 
      | Sure, mutating data sets might be a useful use case. But,
      | inserting thousands of items at once one at a time in a
      | tight loop, then asking for all of them is testing an
      | unusual use case in my opinion.
      | 
      | My point was that we're comparing apples and oranges. By
      | default, I think, Python's dbm implementation doesn't do
      | any sort of transaction or even sync after every insert,
      | where as SQLite does have a pretty hefty atomic guarantee
      | after each INSERT, so they're quite different actions.
 
    | [deleted]
 
    | OGWhales wrote:
    | > Also, I've never really dug into what's going on, but the
    | dbm implementation is pretty slow on Windows, at least when
    | I've tried to use it.
    | 
    | Seems like this would be why:
    | https://news.ycombinator.com/item?id=32852333
 
      | banana_giraffe wrote:
      | "the slow-but-simple implementation in module dbm.dumb will
      | be used"
      | 
      | No kidding.
      | 
      | https://github.com/python/cpython/blob/main/Lib/dbm/dumb.py
      | 
      | It uses a text file to store keys and pointers into a
      | binary file for values. It works .. most of the time .. but
      | yeah, that's not going to win any speed awards.
 
    | jacobr1 wrote:
    | Also you can use WAL
 
| MrBuddyCasino wrote:
| PHP also has support several Berkeley-style databases:
| https://www.php.net/manual/en/dba.requirements.php
 
| pmarreck wrote:
| For the record, so does Erlang/Elixir, right in their stdlib
| (OTP), as either ETS, DETS, or Mnesia
| https://en.wikipedia.org/wiki/Mnesia
| https://www.erlang.org/doc/man/mnesia.html
 
| Snawoot wrote:
| There is no point to use sqlite3 in default journaling mode. I
| bet results may be even better than dbm if you use
| PRAGMA journal_mode=WAL;
 
  | bob1029 wrote:
  | > There is no point to use sqlite3 in default journaling mode.
  | 
  | There are situations where you wouldn't want to, but they are
  | probably very uncommon outside of the embedded computing world.
  | Copying 3 files vs 1 is not a gigantic deal. Most of the time
  | you aren't even moving a SQLite database around.
 
    | quietbritishjim wrote:
    | There's never a need to copy 3 files anyway. If the database
    | is closed cleanly then the WAL file and lock file are
    | deleted. If not (either it's still open or not closed
    | cleanly) then I think any half-finished transactions will be
    | discarded if you copy all files to a new location. Certainly
    | safest not to copy the 2 extra files in any case.
 
    | WorldMaker wrote:
    | You can still copy _just_ the primary file and not the WAL or
    | any other files if you want to live dangerously and possibly
    | lose transactions if there are write operations in parallel.
    | 
    | Of course, you'd be better off with WAL-based replication
    | tech like litestream instead of plain file copies if you are
    | truly worried about parallel operations during your file
    | copies.
 
      | epilys wrote:
      | You can even distribute individual WAL frames and arrange
      | distributed writes with raft consensus and a time lease. I
      | never formally modelled this but it seemed to work
      | perfectly: the approach only lacked checkpointing the Wal
      | file and synchronising the database file across raft nodes.
 
  | nonethewiser wrote:
  | more info https://sqlite.org/wal.html
 
  | kuschku wrote:
  | That only works if all processes accessing it share a memory
  | namespace (aka, are not in containers/jails/VMs and on the same
  | physical machine).
 
    | singron wrote:
    | You are probably thinking of issues with networked
    | filesystems like nfs, 9p, or vboxsf where you can't mmap a
    | file and actually share memory. Basically any other real
    | filesystem that 2 processes actually open the same file on
    | will allow shared memory.
 
    | joppy wrote:
    | WAL works for different processes accessing the same SQLite
    | database. It's meant to improve concurrency, in fact.
 
      | kuschku wrote:
      | Yes, but the processes still need to share a memory
      | namespace. In WAL mode, SQLite creates a shared memory file
      | which gets mapped into any process trying to access it
      | (that's why WAL mode creates additional .shm files).
      | 
      | But that only works if all given processes run on the same
      | physical machine, and not in containers/jails/VM.
 
        | formerly_proven wrote:
        | I dunno what SQLite is doing specifically but you can
        | certainly mmap a file across containers and it behaves
        | like you'd think. SELinux and other hardening options
        | might interfere though, because shared mmaps should imho
        | be viewed as a very likely vector for cross-process
        | exploitation.
 
    | fuckstick wrote:
    | There isn't really a thing called a memory namespace, in
    | Linux or the BSDs at least. In Linux there are cgroups that
    | deal with accounting and you can share VM space among cloned
    | processes (which we then typically refer to as threads).
    | Neither of these affect the ability to share memory among
    | processes.
    | 
    | The only namespace that matters here is the filesystem/mount
    | namespace.
    | 
    | There is no reason you can't access the same shared SQLite
    | database on a common volume between containers on Linux for
    | instance.
 
| wodenokoto wrote:
| According to documentation it is not so much a db included as an
| interface to one:
| 
| > dbm is a generic interface to variants of the DBM database --
| dbm.gnu or dbm.ndbm. If none of these modules is installed, the
| slow-but-simple implementation in module dbm.dumb will be used.
| There is a third party interface to the Oracle Berkeley DB
| 
| https://docs.python.org/3/library/dbm.html
 
  | tootie wrote:
  | Yeah, dbm is a posix database utility that's been around
  | forever. There's a few implementations available and there are
  | bindings for most languages.
  | 
  | https://en.wikipedia.org/wiki/DBM_(computing)
 
  | chrismorgan wrote:
  | When it includes a fallback implementation so that you can rely
  | on it working, I think it's fair to call it "included".
 
    | cozzyd wrote:
    | Right but the on-disk formats between versions won't be
    | compatible...
 
    | qalmakka wrote:
    | The main issue here is that, given it relies on whatever
    | implementation of DBM you have installed, its performance
    | will vary considerably from OS to OS.
    | 
    | I don't know how fast or slow the "dumb" implementation is,
    | but I can bet it is way slower than gdbm. I can see someone
    | using this module, considering it "fast enough" on GNU/Linux
    | and then finding out that it's painfully slow on Windows and
    | shipping binaries of another implementation is a massive
    | PITA.
 
  | tyingq wrote:
  | Each of which has it's own set of limitations and watch-outs. I
  | would not use the python dbm interface for anything serious.
  | Ndbm, for example, has a per item value limit that varies
  | between ~1-4k.
 
| isoprophlex wrote:
| There exists the 'diskcache' library too, which gives you a
| persistent dictionary that stores any kind of pickle-able python
| object. There's also a nice persistent memoization decorator,
| with optional TTL... I've always found 'diskcache' to be very
| ergonomic.
 
  | BiteCode_dev wrote:
  | If you can pip install something, I would also recommend
  | diskcache:
  | 
  | - it backed up by sqlite, so the data is much more secure
  | 
  | - you can access it from several processes
  | 
  | - it's more portable
 
| syncsynchalt wrote:
| This is libdbm, a relatively ancient db library that you could
| find on most unixes.
| 
| In the 90s you might target dbm for your portable Unix
| application because you could reasonably expect it to be
| implemented on your platform. It lost a lot of share to gdbm and
| sleepycat's BerkeleyDB, both of which I consider its successor.
| 
| Of course all of this is now sqlite3.
 
| [deleted]
 
| blibble wrote:
| the gnu version of dbm has some nice surprises
| 
| if you go and look at the binary database you'll see random other
| bits of memory in there, because it doesn't initialise most of
| its buffers
 
| ridiculous_fish wrote:
| In AppleScript you can mark a variable as a property, and then
| its value is automatically remembered across runs. For example a
| script can easily remember how many times it has run:
| property runCount : 0         set runCount to runCount + 1
| 
| So natural to use!
| 
| https://developer.apple.com/library/archive/documentation/Ap...
 
  | nathancahill wrote:
  | I feel like AppleScript could have really shown if things had
  | played out differently in Mac OS automation. I learned to code
  | with AppleScript on OS 9 and it really got me in to
  | programming. MacScripter.net was my old hangout. Good times.
 
| im3w1l wrote:
| As far as I can tell this doesn't support transactions, which
| imho is table stakes for a persistence solution in 2022.
 
  | eesmith wrote:
  | FWIW, it was added in 1992.
  | https://github.com/python/cpython/commit/dd9ed839d6958f73704...
  | with gdbm added in 1994.
  | https://github.com/python/cpython/commit/4b4c664d2e93279c8d7...
  | .
 
    | im3w1l wrote:
    | The article makes it sound like it's something to consider
    | today (well 2018), that's why I think it's fair to see how
    | well it satisfies modern requirements.
 
      | ok123456 wrote:
      | If you just need a simple key value store that stays on
      | disk, it's fine.
 
      | jonnycomputer wrote:
      | requirements _for what_ is the question.  "modern
      | requirements" doens't mean that much.
 
  | pjc50 wrote:
  | DBM is the persistence solution of 1979.
  | https://en.wikipedia.org/wiki/DBM_(computing)
 
| Zababa wrote:
| > Why is sqlite3 so slow? Well, the benchmark is probably not
| representative of the typical workload for sqlite (lots of
| individual insertions and selections). If we perform the same
| operations using executemany and one select of all the keys at
| once, we get:
| 
| I think this is because each time you call execute (which is
| probably sqlite3_exec under the hood), your statement is prepared
| again, and then deleted, while with executemany it's prepared
| once and then used with the data. According to the SQLite3
| documentation of sqlite3_exec:
| 
| > The sqlite3_exec() interface is a convenience wrapper around
| sqlite3_prepare_v2(), sqlite3_step(), and sqlite3_finalize(),
| that allows an application to run multiple statements of SQL
| without having to use a lot of C code.
| 
| I knew that when you execute a request a lot, prepared statements
| are faster, but it seems that it's not exactly the case and that
| all statements are prepared, the performance improvements come
| from preparing (and deleting) only once. The documentation page
| about the Prepared Statement Object has a good explanation of the
| lifecycle of a prepared statement, which also seems to be the
| lifecycle of all statements.
| 
| [1]: https://www.sqlite.org/c3ref/exec.html
| 
| [2]: https://www.sqlite.org/c3ref/stmt.html
 
  | quietbritishjim wrote:
  | The Python sqlite3 middle caches prepared statements, I think
  | it's the last 100 by default but can be overridden. It does
  | require you to use proper parameters rather rather than string
  | interpolation though for the cache to be effective (I didn't
  | check their benchmark code to see if that's the case).
  | 
  | As others have pointed out, it's the lack of a suitable index
  | that's the real problem here.
 
| jonnycomputer wrote:
| I did not know this either. Wow.
 
| collyw wrote:
| I thought Perl's Tie::DB_file module was unique for that sort of
| thing. That's probably quite useful.
 
___________________________________________________________________
(page generated 2022-09-15 23:00 UTC)