proxy70

	[HN Gopher] TIL-Python has a built-in persistent key-value store... ___________________________________________________________________ TIL-Python has a built-in persistent key-value store (2018) Author : zora_goron Score : 227 points Date : 2022-09-15 10:10 UTC (12 hours ago)
	web link (remusao.github.io)
	w3m dump (remusao.github.io)
	\| nonethewiser wrote: \| Huh. I am working on a little utility to deduplicate csvs using \| sqlite. Seems like this would likely be better. \| \| I used sqlite in the projects name. This makes me think of an \| article I read recently that suggested not using descriptive \| names for projects. For reasons such as this. \| ALittleLight wrote: \| You may want to look at pandas too (unless the issue is your \| CSVs are too large to fit in memory). \| ivvve wrote: \| This is probably dumb intervention but forgive me because it \| made me giddy to find: the pandas drop_duplicates() method is \| pretty great, if you're in a pinch. Again -- forgive me if you \| know about it! \| amelius wrote: \| This is going to fall flat when you try to use multithreading. \| The GIL will rear its head. \| giraffe_lady wrote: \| This is a wrapper around dbm, the unix library? Though having a \| portable interface to that tool is nothing small, and most \| languages don't have one in their standard library. Though I \| think all languages with a stdlib as big as python's probably do. \| It's not inherently a python thing though, unless I'm badly \| misunderstanding. \| collyw wrote: \| I look at it the other way, this is the least effort disk \| persistence that I have seen in python. A fair amount of dev \| stuff is reading from disk and putting it into some kind of \| object / dictionary format. I am not really bothered about what \| it uses under the hood (ok it is interesting enough that I have \| looked up DMB). To me the he main appeal of this that you can \| use what appears to be a normal python dictionary and have it \| save to disk with minimal effort. This is a lot less friction \| that using a database. \| giraffe_lady wrote: \| Oh yeah for sure. I don't mean to convey any disdain for dbm \| itself or for having it rolled into the standard library. \| Reading and writing custom files for "convenience \| persistence" or configuration is one of those really common \| things that you don't usually need to actually do if you know \| about tools like this. \| \| I was just pointing out that the title and even the article \| seem to associate this system with python per se, rather than \| understanding it as python's interface to a common and pre- \| existing system. \| Someone wrote: \| FTA: Why is sqlite3 so slow? \| \| My money is that's mostly because it actually stores the data. I \| don't think _dbm_ guarantees any of the data written makes it to \| disk until you _sync_ or _close_ the database. \| \| This writes, guesstimating, on the order of 100 kilobytes, so \| chances are the _dbm_ data never hits the disk. \| akubera wrote: \| There's also the shelve[0] module which allows storing any \| pickleable object in a persistent key-value store, not just \| string/bytes. I've found it's very handy for caching while \| developing scripts which query remote resources, and not have to \| worry about serialization. \| \| [0] https://docs.python.org/3.10/library/shelve.html \| \| Obligatory pickle note: one should be aware of pickle security \| implications and should not open a "Shelf" provided by untrusted \| sources, or rather should treat opening a shelf (or any pickle \| deserialization operation for that matter) as running an \| arbitrary Python script (which cannot be read). \| cristoperb wrote: \| I once wrote a locking wrapper around the shelve module so I \| could use it as a thread- and multiprocess-safe key-value cache \| (including a wrapper around the requests modules `get()` to \| transparently cache/validate http resources according to \| headers): \| \| https://github.com/cristoper/shelfcache \| \| It works despite some cross-platform issues (flock and macos's \| version of gdbm interacting to create a deadlock), but if I \| were to do again I would just use sqlite (which Python's \| standard library has an interface for). \| TillE wrote: \| > but if I were to do again I would just use sqlite \| \| Yeah, I tried to use shelve for some very simple stuff \| because it seemed like a great fit, but ultimately found that \| I had a much better time with tortoise-orm on top of sqlite. \| \| If you need _any_ kind of real feature, just use sqlite. \| PartiallyTyped wrote: \| I like to use `joblib.Memory` for my caching as it catches \| changes in functions which prompt rerunning over loading from \| the cache and then overwrite the old result. \| mrweasel wrote: \| Can you safely "pickle" Python objects across different \| architectures and Python version (assuming we forget about \| Python 2)? \| NegativeLatency wrote: \| There's a version on the pickle format, so you might be able \| to do it across versions but I suspect the version has \| changed for security reasons over time? \| Spivak wrote: \| Pickle could _in theory_ could be architecture dependent \| since __getstate__, and __setstate__ are user provided \| options. But you would have to try to do that on purpose. \| \| And you don't even have to forget about Python 2! If you use \| format version 2 you can pickle objects from every version \| from Python 2.3+ and all pickle format are promised to be \| backwards compatible. If you only care about Python 3 then \| you can use version 3 and it will work for all Python 3.0+. \| \| https://docs.python.org/3/library/pickle.html#data-stream- \| fo... \| \| The reason against using pickle hasn't changed though, if you \| wouldn't exec() it, don't unpickle it. If you're going to \| send it over the network use MAC use MAC use MAC. Seriously, \| it's built in -- the hmac module. \| mmcgaha wrote: \| We had a program that was sending a pickled session state \| as a cookie. We solved that by packing the cookie as a \| random string, a timestamp, the object, and a MAC. We \| validated the MAC, then checked the timestamp, and finally \| unpickle the object. It still bothers me that we are \| unpickling data passed by the client but I ran arguments \| against doing it. \| clysm wrote: \| Another problem with pickles with any sort of living code base \| is when one makes modification to the type itself - renaming, \| refactoring, whatever. Picking objects (and nested objects) \| that aren't explicitly meant for data \| storage/retention/transmissions leads to headaches. \| \| It's best to use another dedicated type or library specifically \| for this task. \| formerly_proven wrote: \| There's even an object database built around pickle, ZODB. It \| even has a networked implementation with replication / \| failover. Used to be part of Zope, originally written in the \| 1998 time frame or so - GvR actually committed a bunch of stuff \| to it. \| nas wrote: \| ZODB is awesome and overlooked, IMHO. I'm biased I guess \| because I was involved in making Durus which is inspired by \| ZODB. The ZODB model is not appropriate for all applications \| (optimistic concurrency control) but for the applications it \| works for, it's great. Very easy to develop with (no \| relational-to-OO mismatch) and performance can be great if \| you design your model carefully. The client caching of data \| is great. It is a bit like memcache but the data is already \| there as objects in RAM. The database server will invalidate \| the cache for you, no manual invalidation needed. \| mwcampbell wrote: \| I developed a web application in 2004 using Quixote and \| Durus. I wonder how many developers outside of MEMS \| Exchange ever used both of those packages. Somehow I had \| not yet encountered a proper ORM (I didn't discover \| SQLObject until later, and Django wasn't out yet), so I \| liked Durus at the time. That web application is still \| running, and in the years since then, I've had to handle \| several escalated customer support cases, and I often \| wished I had chosen a relational database so I could easily \| do ad-hoc queries. So Durus probably wasn't the best choice \| for my application, but that's not your fault. And one \| thing I liked about Durus was that the implementation was \| simple enough that I felt I could really understand how it \| worked. So thanks for that. \| nijave wrote: \| Wonder how this compares to using the filesystem and keys where \| keys are either filenames or a file path (and the value is stored \| in the file)? \| j0hnyl wrote: \| I wonder if it's still faster than sqlite in 2022. \| cbdumas wrote: \| The script linked at the bottom of this page uses sqlite \| without an index on the table. In my testing if you add an \| index sqlite quickly outperforms dbm in terms of read \| performance \| calpaterson wrote: \| According to his benchmark run today on Python 3.9 on linux, \| dbm is still considerably faster for random access. \| \| That said, what database dbm is using varies a lot by platform. \| I think on linux it's usually BerkleyDB. \| \| I use dbm a lot to cache pickles of object that were slow to \| create. eg pandas dataframes \| curiousgal wrote: \| Sorry if it's a dumb question but sinc dbm writes stuff to \| disk, how is that different than loading the pickles? \| dfox wrote: \| You get one file for the whole store, not one per key. \| Also, it tends to better handle having lots of keys \| significantly better than filesystems handle lots of files \| in one directory. \| curiousgal wrote: \| Thank you! \| qalmakka wrote: \| I think on Linux is often GNU DBM, Berkeley DB is a \| different, better implementation by Oracle that requires a \| third party module. \| gerty wrote: \| It's not built-in, of course, but if you can live with the quirky \| API, LMDB as an embedded key-value store will be hard to beat in \| terms of speed. It can also do "one-writer, many-readers" and \| transactions. \| cbdumas wrote: \| So I read the script used to compare 'dbm' and 'sqlite3', and in \| sqlite it creates a table with no index. Hard to take that \| comparison seriously. I wrote a little benchmark script of my own \| just now and sqlite3 beats DBM handily if you add an index on the \| key. \| simonw wrote: \| Yeah I tried that myself - I took that benchmark script and \| changed the CREATE TABLE lines to look like this (adding the \| primary key): CREATE TABLE store(key TEXT \| PRIMARY KEY, value TEXT) \| \| Here are the results before I made that change: \| sqlite Took 0.032 seconds, 3.19541 microseconds / \| record dict_open Took 0.002 seconds, 0.20261 \| microseconds / record dbm_open Took 0.043 \| seconds, 4.26550 microseconds / record sqlite3_mem_open \| Took 2.240 seconds, 224.02620 microseconds / record \| sqlite3_file_open Took 7.119 seconds, 711.87410 \| microseconds / record \| \| And here's what I got after adding the primary keys: \| sqlite Took 0.040 seconds, 3.97618 microseconds / \| record dict_open Took 0.002 seconds, 0.19641 \| microseconds / record dbm_open Took 0.042 \| seconds, 4.18961 microseconds / record sqlite3_mem_open \| Took 0.116 seconds, 11.58359 microseconds / record \| sqlite3_file_open Took 5.571 seconds, 557.13968 \| microseconds / record \| \| My code is here: \| https://gist.github.com/simonw/019ddf08150178d49f4967cc38356... \| cbdumas wrote: \| Yeah now that I dig in a little further it looks like it's \| not as clear cut as I thought. DBM read performance is better \| for me with larger test cases than I was initially using, \| though I am getting some weird performance hiccups when I use \| very large test cases. It looks like I'm too late to edit my \| top level response to reflect that. \| divbzero wrote: \| Just to spell out the results: Adding the primary key appears \| to improve SQLite performance but still falls short of DBM. \| banana_giraffe wrote: \| The index is the least of the issue with the SQLite \| implementation. It's calling one INSERT per each record in that \| version, so the benchmark is spending something like 99.8% of \| its time opening and closing transactions as it sets up the \| database. \| \| Fixing that on my machine took the sqlite3_file_open benchmark \| from 16.910 seconds to 1.033 seconds. Adding the index brought \| it down to 0.040 seconds. \| \| Also, I've never really dug into what's going on, but the dbm \| implementation is pretty slow on Windows, at least when I've \| tried to use it. \| avinassh wrote: \| I was working on a project to insert a billion rows in SQLite \| under a minute, batching the inserts made it crazy fast \| compared to individual transactions. \| \| link: https://avi.im/blag/2021/fast-sqlite-inserts/ \| Steltek wrote: \| For use cases where you want a very simple key-value store, \| working with single records is probably a good test? \| banana_giraffe wrote: \| Maybe? \| \| Sure, mutating data sets might be a useful use case. But, \| inserting thousands of items at once one at a time in a \| tight loop, then asking for all of them is testing an \| unusual use case in my opinion. \| \| My point was that we're comparing apples and oranges. By \| default, I think, Python's dbm implementation doesn't do \| any sort of transaction or even sync after every insert, \| where as SQLite does have a pretty hefty atomic guarantee \| after each INSERT, so they're quite different actions. \| [deleted] \| OGWhales wrote: \| > Also, I've never really dug into what's going on, but the \| dbm implementation is pretty slow on Windows, at least when \| I've tried to use it. \| \| Seems like this would be why: \| https://news.ycombinator.com/item?id=32852333 \| banana_giraffe wrote: \| "the slow-but-simple implementation in module dbm.dumb will \| be used" \| \| No kidding. \| \| https://github.com/python/cpython/blob/main/Lib/dbm/dumb.py \| \| It uses a text file to store keys and pointers into a \| binary file for values. It works .. most of the time .. but \| yeah, that's not going to win any speed awards. \| jacobr1 wrote: \| Also you can use WAL \| MrBuddyCasino wrote: \| PHP also has support several Berkeley-style databases: \| https://www.php.net/manual/en/dba.requirements.php \| pmarreck wrote: \| For the record, so does Erlang/Elixir, right in their stdlib \| (OTP), as either ETS, DETS, or Mnesia \| https://en.wikipedia.org/wiki/Mnesia \| https://www.erlang.org/doc/man/mnesia.html \| Snawoot wrote: \| There is no point to use sqlite3 in default journaling mode. I \| bet results may be even better than dbm if you use \| PRAGMA journal_mode=WAL; \| bob1029 wrote: \| > There is no point to use sqlite3 in default journaling mode. \| \| There are situations where you wouldn't want to, but they are \| probably very uncommon outside of the embedded computing world. \| Copying 3 files vs 1 is not a gigantic deal. Most of the time \| you aren't even moving a SQLite database around. \| quietbritishjim wrote: \| There's never a need to copy 3 files anyway. If the database \| is closed cleanly then the WAL file and lock file are \| deleted. If not (either it's still open or not closed \| cleanly) then I think any half-finished transactions will be \| discarded if you copy all files to a new location. Certainly \| safest not to copy the 2 extra files in any case. \| WorldMaker wrote: \| You can still copy _just_ the primary file and not the WAL or \| any other files if you want to live dangerously and possibly \| lose transactions if there are write operations in parallel. \| \| Of course, you'd be better off with WAL-based replication \| tech like litestream instead of plain file copies if you are \| truly worried about parallel operations during your file \| copies. \| epilys wrote: \| You can even distribute individual WAL frames and arrange \| distributed writes with raft consensus and a time lease. I \| never formally modelled this but it seemed to work \| perfectly: the approach only lacked checkpointing the Wal \| file and synchronising the database file across raft nodes. \| nonethewiser wrote: \| more info https://sqlite.org/wal.html \| kuschku wrote: \| That only works if all processes accessing it share a memory \| namespace (aka, are not in containers/jails/VMs and on the same \| physical machine). \| singron wrote: \| You are probably thinking of issues with networked \| filesystems like nfs, 9p, or vboxsf where you can't mmap a \| file and actually share memory. Basically any other real \| filesystem that 2 processes actually open the same file on \| will allow shared memory. \| joppy wrote: \| WAL works for different processes accessing the same SQLite \| database. It's meant to improve concurrency, in fact. \| kuschku wrote: \| Yes, but the processes still need to share a memory \| namespace. In WAL mode, SQLite creates a shared memory file \| which gets mapped into any process trying to access it \| (that's why WAL mode creates additional .shm files). \| \| But that only works if all given processes run on the same \| physical machine, and not in containers/jails/VM. \| formerly_proven wrote: \| I dunno what SQLite is doing specifically but you can \| certainly mmap a file across containers and it behaves \| like you'd think. SELinux and other hardening options \| might interfere though, because shared mmaps should imho \| be viewed as a very likely vector for cross-process \| exploitation. \| fuckstick wrote: \| There isn't really a thing called a memory namespace, in \| Linux or the BSDs at least. In Linux there are cgroups that \| deal with accounting and you can share VM space among cloned \| processes (which we then typically refer to as threads). \| Neither of these affect the ability to share memory among \| processes. \| \| The only namespace that matters here is the filesystem/mount \| namespace. \| \| There is no reason you can't access the same shared SQLite \| database on a common volume between containers on Linux for \| instance. \| wodenokoto wrote: \| According to documentation it is not so much a db included as an \| interface to one: \| \| > dbm is a generic interface to variants of the DBM database -- \| dbm.gnu or dbm.ndbm. If none of these modules is installed, the \| slow-but-simple implementation in module dbm.dumb will be used. \| There is a third party interface to the Oracle Berkeley DB \| \| https://docs.python.org/3/library/dbm.html \| tootie wrote: \| Yeah, dbm is a posix database utility that's been around \| forever. There's a few implementations available and there are \| bindings for most languages. \| \| https://en.wikipedia.org/wiki/DBM_(computing) \| chrismorgan wrote: \| When it includes a fallback implementation so that you can rely \| on it working, I think it's fair to call it "included". \| cozzyd wrote: \| Right but the on-disk formats between versions won't be \| compatible... \| qalmakka wrote: \| The main issue here is that, given it relies on whatever \| implementation of DBM you have installed, its performance \| will vary considerably from OS to OS. \| \| I don't know how fast or slow the "dumb" implementation is, \| but I can bet it is way slower than gdbm. I can see someone \| using this module, considering it "fast enough" on GNU/Linux \| and then finding out that it's painfully slow on Windows and \| shipping binaries of another implementation is a massive \| PITA. \| tyingq wrote: \| Each of which has it's own set of limitations and watch-outs. I \| would not use the python dbm interface for anything serious. \| Ndbm, for example, has a per item value limit that varies \| between ~1-4k. \| isoprophlex wrote: \| There exists the 'diskcache' library too, which gives you a \| persistent dictionary that stores any kind of pickle-able python \| object. There's also a nice persistent memoization decorator, \| with optional TTL... I've always found 'diskcache' to be very \| ergonomic. \| BiteCode_dev wrote: \| If you can pip install something, I would also recommend \| diskcache: \| \| - it backed up by sqlite, so the data is much more secure \| \| - you can access it from several processes \| \| - it's more portable \| syncsynchalt wrote: \| This is libdbm, a relatively ancient db library that you could \| find on most unixes. \| \| In the 90s you might target dbm for your portable Unix \| application because you could reasonably expect it to be \| implemented on your platform. It lost a lot of share to gdbm and \| sleepycat's BerkeleyDB, both of which I consider its successor. \| \| Of course all of this is now sqlite3. \| [deleted] \| blibble wrote: \| the gnu version of dbm has some nice surprises \| \| if you go and look at the binary database you'll see random other \| bits of memory in there, because it doesn't initialise most of \| its buffers \| ridiculous_fish wrote: \| In AppleScript you can mark a variable as a property, and then \| its value is automatically remembered across runs. For example a \| script can easily remember how many times it has run: \| property runCount : 0 set runCount to runCount + 1 \| \| So natural to use! \| \| https://developer.apple.com/library/archive/documentation/Ap... \| nathancahill wrote: \| I feel like AppleScript could have really shown if things had \| played out differently in Mac OS automation. I learned to code \| with AppleScript on OS 9 and it really got me in to \| programming. MacScripter.net was my old hangout. Good times. \| im3w1l wrote: \| As far as I can tell this doesn't support transactions, which \| imho is table stakes for a persistence solution in 2022. \| eesmith wrote: \| FWIW, it was added in 1992. \| https://github.com/python/cpython/commit/dd9ed839d6958f73704... \| with gdbm added in 1994. \| https://github.com/python/cpython/commit/4b4c664d2e93279c8d7... \| . \| im3w1l wrote: \| The article makes it sound like it's something to consider \| today (well 2018), that's why I think it's fair to see how \| well it satisfies modern requirements. \| ok123456 wrote: \| If you just need a simple key value store that stays on \| disk, it's fine. \| jonnycomputer wrote: \| requirements _for what_ is the question. "modern \| requirements" doens't mean that much. \| pjc50 wrote: \| DBM is the persistence solution of 1979. \| https://en.wikipedia.org/wiki/DBM_(computing) \| Zababa wrote: \| > Why is sqlite3 so slow? Well, the benchmark is probably not \| representative of the typical workload for sqlite (lots of \| individual insertions and selections). If we perform the same \| operations using executemany and one select of all the keys at \| once, we get: \| \| I think this is because each time you call execute (which is \| probably sqlite3_exec under the hood), your statement is prepared \| again, and then deleted, while with executemany it's prepared \| once and then used with the data. According to the SQLite3 \| documentation of sqlite3_exec: \| \| > The sqlite3_exec() interface is a convenience wrapper around \| sqlite3_prepare_v2(), sqlite3_step(), and sqlite3_finalize(), \| that allows an application to run multiple statements of SQL \| without having to use a lot of C code. \| \| I knew that when you execute a request a lot, prepared statements \| are faster, but it seems that it's not exactly the case and that \| all statements are prepared, the performance improvements come \| from preparing (and deleting) only once. The documentation page \| about the Prepared Statement Object has a good explanation of the \| lifecycle of a prepared statement, which also seems to be the \| lifecycle of all statements. \| \| [1]: https://www.sqlite.org/c3ref/exec.html \| \| [2]: https://www.sqlite.org/c3ref/stmt.html \| quietbritishjim wrote: \| The Python sqlite3 middle caches prepared statements, I think \| it's the last 100 by default but can be overridden. It does \| require you to use proper parameters rather rather than string \| interpolation though for the cache to be effective (I didn't \| check their benchmark code to see if that's the case). \| \| As others have pointed out, it's the lack of a suitable index \| that's the real problem here. \| jonnycomputer wrote: \| I did not know this either. Wow. \| collyw wrote: \| I thought Perl's Tie::DB_file module was unique for that sort of \| thing. That's probably quite useful. ___________________________________________________________________ (page generated 2022-09-15 23:00 UTC)