|
| nonethewiser wrote:
| Huh. I am working on a little utility to deduplicate csvs using
| sqlite. Seems like this would likely be better.
|
| I used sqlite in the projects name. This makes me think of an
| article I read recently that suggested not using descriptive
| names for projects. For reasons such as this.
| ALittleLight wrote:
| You may want to look at pandas too (unless the issue is your
| CSVs are too large to fit in memory).
| ivvve wrote:
| This is probably dumb intervention but forgive me because it
| made me giddy to find: the pandas drop_duplicates() method is
| pretty great, if you're in a pinch. Again -- forgive me if you
| know about it!
| amelius wrote:
| This is going to fall flat when you try to use multithreading.
| The GIL will rear its head.
| giraffe_lady wrote:
| This is a wrapper around dbm, the unix library? Though having a
| portable interface to that tool is nothing small, and most
| languages don't have one in their standard library. Though I
| think all languages with a stdlib as big as python's probably do.
| It's not inherently a python thing though, unless I'm badly
| misunderstanding.
| collyw wrote:
| I look at it the other way, this is the least effort disk
| persistence that I have seen in python. A fair amount of dev
| stuff is reading from disk and putting it into some kind of
| object / dictionary format. I am not really bothered about what
| it uses under the hood (ok it is interesting enough that I have
| looked up DMB). To me the he main appeal of this that you can
| use what appears to be a normal python dictionary and have it
| save to disk with minimal effort. This is a lot less friction
| that using a database.
| giraffe_lady wrote:
| Oh yeah for sure. I don't mean to convey any disdain for dbm
| itself or for having it rolled into the standard library.
| Reading and writing custom files for "convenience
| persistence" or configuration is one of those really common
| things that you don't usually need to actually do if you know
| about tools like this.
|
| I was just pointing out that the title and even the article
| seem to associate this system with python per se, rather than
| understanding it as python's interface to a common and pre-
| existing system.
| Someone wrote:
| FTA: Why is sqlite3 so slow?
|
| My money is that's mostly because it actually stores the data. I
| don't think _dbm_ guarantees any of the data written makes it to
| disk until you _sync_ or _close_ the database.
|
| This writes, guesstimating, on the order of 100 kilobytes, so
| chances are the _dbm_ data never hits the disk.
| akubera wrote:
| There's also the shelve[0] module which allows storing any
| pickleable object in a persistent key-value store, not just
| string/bytes. I've found it's very handy for caching while
| developing scripts which query remote resources, and not have to
| worry about serialization.
|
| [0] https://docs.python.org/3.10/library/shelve.html
|
| Obligatory pickle note: one should be aware of pickle security
| implications and should not open a "Shelf" provided by untrusted
| sources, or rather should treat opening a shelf (or any pickle
| deserialization operation for that matter) as running an
| arbitrary Python script (which cannot be read).
| cristoperb wrote:
| I once wrote a locking wrapper around the shelve module so I
| could use it as a thread- and multiprocess-safe key-value cache
| (including a wrapper around the requests modules `get()` to
| transparently cache/validate http resources according to
| headers):
|
| https://github.com/cristoper/shelfcache
|
| It works despite some cross-platform issues (flock and macos's
| version of gdbm interacting to create a deadlock), but if I
| were to do again I would just use sqlite (which Python's
| standard library has an interface for).
| TillE wrote:
| > but if I were to do again I would just use sqlite
|
| Yeah, I tried to use shelve for some very simple stuff
| because it seemed like a great fit, but ultimately found that
| I had a much better time with tortoise-orm on top of sqlite.
|
| If you need _any_ kind of real feature, just use sqlite.
| PartiallyTyped wrote:
| I like to use `joblib.Memory` for my caching as it catches
| changes in functions which prompt rerunning over loading from
| the cache and then overwrite the old result.
| mrweasel wrote:
| Can you safely "pickle" Python objects across different
| architectures and Python version (assuming we forget about
| Python 2)?
| NegativeLatency wrote:
| There's a version on the pickle format, so you might be able
| to do it across versions but I suspect the version has
| changed for security reasons over time?
| Spivak wrote:
| Pickle could _in theory_ could be architecture dependent
| since __getstate__, and __setstate__ are user provided
| options. But you would have to try to do that on purpose.
|
| And you don't even have to forget about Python 2! If you use
| format version 2 you can pickle objects from every version
| from Python 2.3+ and all pickle format are promised to be
| backwards compatible. If you only care about Python 3 then
| you can use version 3 and it will work for all Python 3.0+.
|
| https://docs.python.org/3/library/pickle.html#data-stream-
| fo...
|
| The reason against using pickle hasn't changed though, if you
| wouldn't exec() it, don't unpickle it. If you're going to
| send it over the network use MAC use MAC use MAC. Seriously,
| it's built in -- the hmac module.
| mmcgaha wrote:
| We had a program that was sending a pickled session state
| as a cookie. We solved that by packing the cookie as a
| random string, a timestamp, the object, and a MAC. We
| validated the MAC, then checked the timestamp, and finally
| unpickle the object. It still bothers me that we are
| unpickling data passed by the client but I ran arguments
| against doing it.
| clysm wrote:
| Another problem with pickles with any sort of living code base
| is when one makes modification to the type itself - renaming,
| refactoring, whatever. Picking objects (and nested objects)
| that aren't explicitly meant for data
| storage/retention/transmissions leads to headaches.
|
| It's best to use another dedicated type or library specifically
| for this task.
| formerly_proven wrote:
| There's even an object database built around pickle, ZODB. It
| even has a networked implementation with replication /
| failover. Used to be part of Zope, originally written in the
| 1998 time frame or so - GvR actually committed a bunch of stuff
| to it.
| nas wrote:
| ZODB is awesome and overlooked, IMHO. I'm biased I guess
| because I was involved in making Durus which is inspired by
| ZODB. The ZODB model is not appropriate for all applications
| (optimistic concurrency control) but for the applications it
| works for, it's great. Very easy to develop with (no
| relational-to-OO mismatch) and performance can be great if
| you design your model carefully. The client caching of data
| is great. It is a bit like memcache but the data is already
| there as objects in RAM. The database server will invalidate
| the cache for you, no manual invalidation needed.
| mwcampbell wrote:
| I developed a web application in 2004 using Quixote and
| Durus. I wonder how many developers outside of MEMS
| Exchange ever used both of those packages. Somehow I had
| not yet encountered a proper ORM (I didn't discover
| SQLObject until later, and Django wasn't out yet), so I
| liked Durus at the time. That web application is still
| running, and in the years since then, I've had to handle
| several escalated customer support cases, and I often
| wished I had chosen a relational database so I could easily
| do ad-hoc queries. So Durus probably wasn't the best choice
| for my application, but that's not your fault. And one
| thing I liked about Durus was that the implementation was
| simple enough that I felt I could really understand how it
| worked. So thanks for that.
| nijave wrote:
| Wonder how this compares to using the filesystem and keys where
| keys are either filenames or a file path (and the value is stored
| in the file)?
| j0hnyl wrote:
| I wonder if it's still faster than sqlite in 2022.
| cbdumas wrote:
| The script linked at the bottom of this page uses sqlite
| without an index on the table. In my testing if you add an
| index sqlite quickly outperforms dbm in terms of read
| performance
| calpaterson wrote:
| According to his benchmark run today on Python 3.9 on linux,
| dbm is still considerably faster for random access.
|
| That said, what database dbm is using varies a lot by platform.
| I think on linux it's usually BerkleyDB.
|
| I use dbm a lot to cache pickles of object that were slow to
| create. eg pandas dataframes
| curiousgal wrote:
| Sorry if it's a dumb question but sinc dbm writes stuff to
| disk, how is that different than loading the pickles?
| dfox wrote:
| You get one file for the whole store, not one per key.
| Also, it tends to better handle having lots of keys
| significantly better than filesystems handle lots of files
| in one directory.
| curiousgal wrote:
| Thank you!
| qalmakka wrote:
| I think on Linux is often GNU DBM, Berkeley DB is a
| different, better implementation by Oracle that requires a
| third party module.
| gerty wrote:
| It's not built-in, of course, but if you can live with the quirky
| API, LMDB as an embedded key-value store will be hard to beat in
| terms of speed. It can also do "one-writer, many-readers" and
| transactions.
| cbdumas wrote:
| So I read the script used to compare 'dbm' and 'sqlite3', and in
| sqlite it creates a table with no index. Hard to take that
| comparison seriously. I wrote a little benchmark script of my own
| just now and sqlite3 beats DBM handily if you add an index on the
| key.
| simonw wrote:
| Yeah I tried that myself - I took that benchmark script and
| changed the CREATE TABLE lines to look like this (adding the
| primary key): CREATE TABLE store(key TEXT
| PRIMARY KEY, value TEXT)
|
| Here are the results before I made that change:
| sqlite Took 0.032 seconds, 3.19541 microseconds /
| record dict_open Took 0.002 seconds, 0.20261
| microseconds / record dbm_open Took 0.043
| seconds, 4.26550 microseconds / record sqlite3_mem_open
| Took 2.240 seconds, 224.02620 microseconds / record
| sqlite3_file_open Took 7.119 seconds, 711.87410
| microseconds / record
|
| And here's what I got after adding the primary keys:
| sqlite Took 0.040 seconds, 3.97618 microseconds /
| record dict_open Took 0.002 seconds, 0.19641
| microseconds / record dbm_open Took 0.042
| seconds, 4.18961 microseconds / record sqlite3_mem_open
| Took 0.116 seconds, 11.58359 microseconds / record
| sqlite3_file_open Took 5.571 seconds, 557.13968
| microseconds / record
|
| My code is here:
| https://gist.github.com/simonw/019ddf08150178d49f4967cc38356...
| cbdumas wrote:
| Yeah now that I dig in a little further it looks like it's
| not as clear cut as I thought. DBM read performance is better
| for me with larger test cases than I was initially using,
| though I am getting some weird performance hiccups when I use
| very large test cases. It looks like I'm too late to edit my
| top level response to reflect that.
| divbzero wrote:
| Just to spell out the results: Adding the primary key appears
| to improve SQLite performance but still falls short of DBM.
| banana_giraffe wrote:
| The index is the least of the issue with the SQLite
| implementation. It's calling one INSERT per each record in that
| version, so the benchmark is spending something like 99.8% of
| its time opening and closing transactions as it sets up the
| database.
|
| Fixing that on my machine took the sqlite3_file_open benchmark
| from 16.910 seconds to 1.033 seconds. Adding the index brought
| it down to 0.040 seconds.
|
| Also, I've never really dug into what's going on, but the dbm
| implementation is pretty slow on Windows, at least when I've
| tried to use it.
| avinassh wrote:
| I was working on a project to insert a billion rows in SQLite
| under a minute, batching the inserts made it crazy fast
| compared to individual transactions.
|
| link: https://avi.im/blag/2021/fast-sqlite-inserts/
| Steltek wrote:
| For use cases where you want a very simple key-value store,
| working with single records is probably a good test?
| banana_giraffe wrote:
| Maybe?
|
| Sure, mutating data sets might be a useful use case. But,
| inserting thousands of items at once one at a time in a
| tight loop, then asking for all of them is testing an
| unusual use case in my opinion.
|
| My point was that we're comparing apples and oranges. By
| default, I think, Python's dbm implementation doesn't do
| any sort of transaction or even sync after every insert,
| where as SQLite does have a pretty hefty atomic guarantee
| after each INSERT, so they're quite different actions.
| [deleted]
| OGWhales wrote:
| > Also, I've never really dug into what's going on, but the
| dbm implementation is pretty slow on Windows, at least when
| I've tried to use it.
|
| Seems like this would be why:
| https://news.ycombinator.com/item?id=32852333
| banana_giraffe wrote:
| "the slow-but-simple implementation in module dbm.dumb will
| be used"
|
| No kidding.
|
| https://github.com/python/cpython/blob/main/Lib/dbm/dumb.py
|
| It uses a text file to store keys and pointers into a
| binary file for values. It works .. most of the time .. but
| yeah, that's not going to win any speed awards.
| jacobr1 wrote:
| Also you can use WAL
| MrBuddyCasino wrote:
| PHP also has support several Berkeley-style databases:
| https://www.php.net/manual/en/dba.requirements.php
| pmarreck wrote:
| For the record, so does Erlang/Elixir, right in their stdlib
| (OTP), as either ETS, DETS, or Mnesia
| https://en.wikipedia.org/wiki/Mnesia
| https://www.erlang.org/doc/man/mnesia.html
| Snawoot wrote:
| There is no point to use sqlite3 in default journaling mode. I
| bet results may be even better than dbm if you use
| PRAGMA journal_mode=WAL;
| bob1029 wrote:
| > There is no point to use sqlite3 in default journaling mode.
|
| There are situations where you wouldn't want to, but they are
| probably very uncommon outside of the embedded computing world.
| Copying 3 files vs 1 is not a gigantic deal. Most of the time
| you aren't even moving a SQLite database around.
| quietbritishjim wrote:
| There's never a need to copy 3 files anyway. If the database
| is closed cleanly then the WAL file and lock file are
| deleted. If not (either it's still open or not closed
| cleanly) then I think any half-finished transactions will be
| discarded if you copy all files to a new location. Certainly
| safest not to copy the 2 extra files in any case.
| WorldMaker wrote:
| You can still copy _just_ the primary file and not the WAL or
| any other files if you want to live dangerously and possibly
| lose transactions if there are write operations in parallel.
|
| Of course, you'd be better off with WAL-based replication
| tech like litestream instead of plain file copies if you are
| truly worried about parallel operations during your file
| copies.
| epilys wrote:
| You can even distribute individual WAL frames and arrange
| distributed writes with raft consensus and a time lease. I
| never formally modelled this but it seemed to work
| perfectly: the approach only lacked checkpointing the Wal
| file and synchronising the database file across raft nodes.
| nonethewiser wrote:
| more info https://sqlite.org/wal.html
| kuschku wrote:
| That only works if all processes accessing it share a memory
| namespace (aka, are not in containers/jails/VMs and on the same
| physical machine).
| singron wrote:
| You are probably thinking of issues with networked
| filesystems like nfs, 9p, or vboxsf where you can't mmap a
| file and actually share memory. Basically any other real
| filesystem that 2 processes actually open the same file on
| will allow shared memory.
| joppy wrote:
| WAL works for different processes accessing the same SQLite
| database. It's meant to improve concurrency, in fact.
| kuschku wrote:
| Yes, but the processes still need to share a memory
| namespace. In WAL mode, SQLite creates a shared memory file
| which gets mapped into any process trying to access it
| (that's why WAL mode creates additional .shm files).
|
| But that only works if all given processes run on the same
| physical machine, and not in containers/jails/VM.
| formerly_proven wrote:
| I dunno what SQLite is doing specifically but you can
| certainly mmap a file across containers and it behaves
| like you'd think. SELinux and other hardening options
| might interfere though, because shared mmaps should imho
| be viewed as a very likely vector for cross-process
| exploitation.
| fuckstick wrote:
| There isn't really a thing called a memory namespace, in
| Linux or the BSDs at least. In Linux there are cgroups that
| deal with accounting and you can share VM space among cloned
| processes (which we then typically refer to as threads).
| Neither of these affect the ability to share memory among
| processes.
|
| The only namespace that matters here is the filesystem/mount
| namespace.
|
| There is no reason you can't access the same shared SQLite
| database on a common volume between containers on Linux for
| instance.
| wodenokoto wrote:
| According to documentation it is not so much a db included as an
| interface to one:
|
| > dbm is a generic interface to variants of the DBM database --
| dbm.gnu or dbm.ndbm. If none of these modules is installed, the
| slow-but-simple implementation in module dbm.dumb will be used.
| There is a third party interface to the Oracle Berkeley DB
|
| https://docs.python.org/3/library/dbm.html
| tootie wrote:
| Yeah, dbm is a posix database utility that's been around
| forever. There's a few implementations available and there are
| bindings for most languages.
|
| https://en.wikipedia.org/wiki/DBM_(computing)
| chrismorgan wrote:
| When it includes a fallback implementation so that you can rely
| on it working, I think it's fair to call it "included".
| cozzyd wrote:
| Right but the on-disk formats between versions won't be
| compatible...
| qalmakka wrote:
| The main issue here is that, given it relies on whatever
| implementation of DBM you have installed, its performance
| will vary considerably from OS to OS.
|
| I don't know how fast or slow the "dumb" implementation is,
| but I can bet it is way slower than gdbm. I can see someone
| using this module, considering it "fast enough" on GNU/Linux
| and then finding out that it's painfully slow on Windows and
| shipping binaries of another implementation is a massive
| PITA.
| tyingq wrote:
| Each of which has it's own set of limitations and watch-outs. I
| would not use the python dbm interface for anything serious.
| Ndbm, for example, has a per item value limit that varies
| between ~1-4k.
| isoprophlex wrote:
| There exists the 'diskcache' library too, which gives you a
| persistent dictionary that stores any kind of pickle-able python
| object. There's also a nice persistent memoization decorator,
| with optional TTL... I've always found 'diskcache' to be very
| ergonomic.
| BiteCode_dev wrote:
| If you can pip install something, I would also recommend
| diskcache:
|
| - it backed up by sqlite, so the data is much more secure
|
| - you can access it from several processes
|
| - it's more portable
| syncsynchalt wrote:
| This is libdbm, a relatively ancient db library that you could
| find on most unixes.
|
| In the 90s you might target dbm for your portable Unix
| application because you could reasonably expect it to be
| implemented on your platform. It lost a lot of share to gdbm and
| sleepycat's BerkeleyDB, both of which I consider its successor.
|
| Of course all of this is now sqlite3.
| [deleted]
| blibble wrote:
| the gnu version of dbm has some nice surprises
|
| if you go and look at the binary database you'll see random other
| bits of memory in there, because it doesn't initialise most of
| its buffers
| ridiculous_fish wrote:
| In AppleScript you can mark a variable as a property, and then
| its value is automatically remembered across runs. For example a
| script can easily remember how many times it has run:
| property runCount : 0 set runCount to runCount + 1
|
| So natural to use!
|
| https://developer.apple.com/library/archive/documentation/Ap...
| nathancahill wrote:
| I feel like AppleScript could have really shown if things had
| played out differently in Mac OS automation. I learned to code
| with AppleScript on OS 9 and it really got me in to
| programming. MacScripter.net was my old hangout. Good times.
| im3w1l wrote:
| As far as I can tell this doesn't support transactions, which
| imho is table stakes for a persistence solution in 2022.
| eesmith wrote:
| FWIW, it was added in 1992.
| https://github.com/python/cpython/commit/dd9ed839d6958f73704...
| with gdbm added in 1994.
| https://github.com/python/cpython/commit/4b4c664d2e93279c8d7...
| .
| im3w1l wrote:
| The article makes it sound like it's something to consider
| today (well 2018), that's why I think it's fair to see how
| well it satisfies modern requirements.
| ok123456 wrote:
| If you just need a simple key value store that stays on
| disk, it's fine.
| jonnycomputer wrote:
| requirements _for what_ is the question. "modern
| requirements" doens't mean that much.
| pjc50 wrote:
| DBM is the persistence solution of 1979.
| https://en.wikipedia.org/wiki/DBM_(computing)
| Zababa wrote:
| > Why is sqlite3 so slow? Well, the benchmark is probably not
| representative of the typical workload for sqlite (lots of
| individual insertions and selections). If we perform the same
| operations using executemany and one select of all the keys at
| once, we get:
|
| I think this is because each time you call execute (which is
| probably sqlite3_exec under the hood), your statement is prepared
| again, and then deleted, while with executemany it's prepared
| once and then used with the data. According to the SQLite3
| documentation of sqlite3_exec:
|
| > The sqlite3_exec() interface is a convenience wrapper around
| sqlite3_prepare_v2(), sqlite3_step(), and sqlite3_finalize(),
| that allows an application to run multiple statements of SQL
| without having to use a lot of C code.
|
| I knew that when you execute a request a lot, prepared statements
| are faster, but it seems that it's not exactly the case and that
| all statements are prepared, the performance improvements come
| from preparing (and deleting) only once. The documentation page
| about the Prepared Statement Object has a good explanation of the
| lifecycle of a prepared statement, which also seems to be the
| lifecycle of all statements.
|
| [1]: https://www.sqlite.org/c3ref/exec.html
|
| [2]: https://www.sqlite.org/c3ref/stmt.html
| quietbritishjim wrote:
| The Python sqlite3 middle caches prepared statements, I think
| it's the last 100 by default but can be overridden. It does
| require you to use proper parameters rather rather than string
| interpolation though for the cache to be effective (I didn't
| check their benchmark code to see if that's the case).
|
| As others have pointed out, it's the lack of a suitable index
| that's the real problem here.
| jonnycomputer wrote:
| I did not know this either. Wow.
| collyw wrote:
| I thought Perl's Tie::DB_file module was unique for that sort of
| thing. That's probably quite useful.
___________________________________________________________________
(page generated 2022-09-15 23:00 UTC) |