proxy70

	[HN Gopher] Rqlite: The lightweight, distributed relational data... ___________________________________________________________________ Rqlite: The lightweight, distributed relational database built on SQLite Author : pavanyara Score : 175 points Date : 2021-01-22 13:48 UTC (9 hours ago)
	web link (github.com)
	w3m dump (github.com)
	\| mech422 wrote: \| So it looks like you can now distribute SQLite at the: \| \| Stmt level: https://github.com/rqlite/rqlite \| \| VFS Level: https://github.com/canonical/dqlite \| \| Block Level: https://github.com/benbjohnson/litestream \| \| Really cool enhancements to an awesome project! \| f430 wrote: \| Could you use this to build a decentralized p2p app? If so, what \| gotchas and limitations are there? \| otoolep wrote: \| No, rqlite is not suitable for that kind of application. All \| writes must go through the leader. \| hermitcrab wrote: \| So if: \| \| Alice, Bob and Charlie have a synced copy of the same database \| \| Charlie goes on a plane and adds a loads of records without a \| connection to the other databases \| \| Alice and Bob make no changes \| \| Charlie comes home and syncs \| \| Will Charlie lose all his changes, as his database is different \| to Alice and Bob's? \| \| What happens if Alice, Bob and Charlie all makes changes offline \| then resync? \| mrkurt wrote: \| It doesn't work that way. rqlite is effectively a \| leader/follower model that uses raft for leader consensus. \| Writes can only happen online, and only to the leader. \| hermitcrab wrote: \| Ok, thanks! \| teraflop wrote: \| As per the description, all updates must be replicated to a \| quorum of instances. If Charlie is on a plane without the \| ability to contact a quorum, he can't add records in the first \| place. \| \| This is a fundamentally necessary tradeoff to provide strong \| consistency, as described by the CAP theorem. \| unnouinceput wrote: \| In which case Charlie will have an additional local DB to \| record those records and when he gets back will use another \| protocol/method/system/whatever to add those new records? How \| about if everybody goes and adds records on same table? \| \| Here is a real life scenario that I had to deal with in the \| past. Technician (carpenters) goes to clients home to repair \| furniture in the middle of nowhere, so no internet. Adds the \| necessary paperwork which is pictures, declarations, contract \| (signed and scanned) to the Clients table. This company was \| employing hundreds of such technicians throughout the many \| counties of Germany, each with a laptop on them running this \| app which was the backbone for getting paid/do the work. And \| was not uncommon to have more than one carpenter go to client \| home and do the repairs. Since each carpenter was paid \| according to its own work, each of them would create entries \| in their local Clients table and when getting back to HQ \| their work was manually uploaded to central DB, and only \| after that they got paid. I automated that (that was the job, \| to eliminate the thousands of hours that carpenters were \| wasting manually). \| \| So given the above scenario, how is this system going to \| achieve that? Same table, and same client details even in \| table Clients, just different rows for different carpenters \| (foreign key to table Carpenters). \| wtallis wrote: \| > So given the above scenario, how is this system going to \| achieve that? \| \| I don't think it is. You're describing a use case that is \| distributed but explicitly does not want to enforce \| consistency--you want offline workers to all be able to \| keep working, and you're enforcing consistency after the \| fact and outside of the database itself. \| renewiltord wrote: \| This tool does not handle that problem. It is not meant to. \| It's for simultaneously available replicas. And this is the \| rare moment where thinking about replication vs \| synchronization as different is worthwhile. \| \| You usually replicate for failure tolerance and performance \| (this project only aims for the former). \| vorpalhex wrote: \| As other commentors have mentioned, this tool is not \| intended for that kind of case. You want a tool like \| PouchDB which handles this kind of setup, but have a \| different set of tradeoffs (they're eventually consistent, \| not strongly consistent). \| adsharma wrote: \| Charlie doesn't have to lose the data he saved on the plane. \| Don't know what the rqlite implementation does. \| \| In the second case, Alice-Bob consensus overrides Charlie \| [deleted] \| fipar wrote: \| I have not read the full description of this project yet, but \| it does mention the use or raft for consensus, so in your \| example, I would expect Charlie to not be able to add any \| records while being disconnected, because, if my understanding \| is correct: - Charlie would either be the leader, but then \| without getting confirmation of writes from enough followers, \| he would not be able to do any writes himself, or - Charlie \| would be a follower, and while disconnected would obviously get \| no writes from the leader. \| NDizzle wrote: \| What's messed up is that I was doing this kind of thing with \| Lotus Domino in the late 90s. I'm sure others were doing it \| before me, too. \| \| Sometimes you had conflicts that needed resolution, but those \| weren't that frequent for our use case. \| mshenfield wrote: \| Going from a local db to one over a network has at least one \| risk. The SQLite docs gives developers the okay to write "n+1" \| style queries (https://www.sqlite.org/np1queryprob.html). When \| the db is on the same file system as the application this pattern \| is fine. But as soon a you add a network call it becomes a \| potential bottleneck. \| alberth wrote: \| @otoolep \| \| SQLite has a great post on "When to Use" (and not use) SQLite. \| \| Would be great if you included these same use cases in the ReamMe \| docs and make it clear if Rqlite can address them. \| \| https://www.sqlite.org/whentouse.html \| scottlamb wrote: \| It looks like this layers Raft on top of SQLite. I don't like \| when systems replicate high-level changes like "update users set \| salary = salary + 1000 where ...;" Instead, I prefer they \| replicate low-level changes like "replace key/block X, which \| should have contents C_x, with C_x'". \| \| Why? Imagine you're doing a rolling update. Some of your replica \| are running the newer version of SQLite and some are running the \| older version. They may not execute the high-level query in \| exactly the same way. For example, in the absence of an "order \| by" clause, select results' order is unstated. So imagine someone \| makes a mutation that depends on this: "insert ... select ... \| limit". (Maybe a dumb example but it can happen anyway.) Now the \| databases start to diverge, not only in underlying bytes and \| implementation-defined ordering but in actual row data. \| \| I worked on a major distributed system that originally replicated \| high-level changes and switched to replicating low-level changes \| for this reason. We had a system for detecting when replicas \| didn't match, and replication of high-level changes was the \| biggest reason for diffs. (Hardware error was the second biggest \| reason; we added a lot of checksumming because of that.) \| lrossi wrote: \| If you replicate low level changes, you might not be able to do \| a live upgrade/downgrade if the version change affects the on \| disk format. \| \| Another downside is that you might also propagate data \| corruption in case of bugs in the DB software (e.g. memory \| corruption) or hardware defects. \| scottlamb wrote: \| > If you replicate low level changes, you might not be able \| to do a live upgrade/downgrade if the version change affects \| the on disk format. \| \| It certainly requires care to ensure all the replicas have \| software capable of understanding the new format before it's \| actually written, but it can be done. Likewise after writing \| the new format, you want to have a roll-back plan. \| \| In SQLite's case, https://sqlite.org/formatchng.html says: \| "Since 2004, there have been enhancements to SQLite such that \| newer database files are unreadable by older versions of the \| SQLite library. But the most recent versions of the SQLite \| library should be able to read and write any older SQLite \| database file without any problems." I don't believe \| upgrading SQLite automatically starts using any of those \| enhancements; you'd have to do a schema change like "PRAGMA \| journal_mode=WAL;" first. \| \| > Another downside is that you might also propagate data \| corruption in case of bugs in the DB software (e.g. memory \| corruption) or hardware defects. \| \| This happens regardless. \| otoolep wrote: \| rqlite creator here. \| \| I understand what you're saying, but I don't think it's a \| compelling objection. Obviously, differences between versions \| -- even patched versions -- can results in subtle, unintended, \| differences in how the code works for a given program. But \| there is no reason to think a system that operates at a lower \| level ("replace key/block X, which should have contents C_x, \| with C_x'") is less vulnerable to this kind of issue, compared \| to one that operates at a higher level i.e. statement-based \| replication, which rqlite uses. In fact I would argue that the \| system that operates on higher-level of abstraction is _less_ \| vulnerable i.e. to care about the subtle changes. \| [deleted] \| xd wrote: \| In MySQL/MariaDB this is what's known as non-deterministic \| behaviour so row or mixed replication is used to mitigate. \| \| Statement based (high level) replication is very useful for \| i.e. "insert into tbl0 select col0 from tbl1 order by col1" as \| you would only need to send the query not the individual row \| data. \| tyingq wrote: \| Dqlite replicates at the VFS layer of sqlite, which sounds like \| what you're looking for. https://github.com/canonical/dqlite \| hinkley wrote: \| I haven't gotten a straight answer out of the k3s people \| about why they dumped dqlite, just some comment about bugs. \| \| I could see myself using dqlite in the future so I'd like \| some more user reports from the trenches. Anyone shed some \| light on this? \| tyingq wrote: \| The initial issue seems to be saying that it's because they \| need to have etcd anyway, so consolidating on that removes \| a dependency. Which fits their simplicity goal. Though the \| issue appears to have been created by a user, not a \| maintainer. \| \| _" Since the HA direction needs etcd anyway.. I'm \| proposing dropping support for sqlite as the default \| embedded non-HA option and switch to embedded etcd as the \| default. This will reduce overall effort of maintainability \| of two entirely different datastores."_ \| \| https://github.com/k3s-io/k3s/issues/845 \| hinkley wrote: \| I accepted that reason for them, but as I don't benefit \| directly from switching to etcd, I'd rather know about \| what started the conversation than how it was concluded. \| merb wrote: \| dqlite support was flaky and did go trough their \| translation layer which probably added complexity. \| ttul wrote: \| Sounds like you need to submit a patch! \| scottlamb wrote: \| lol, this is an unhelpful reflex answer to any criticism of \| open source software. What I'm describing is a redesign. \| There's no point in submitting a patch for that. It'd be much \| more effort than a few lines of code, and it wouldn't be \| accepted. Open source means that anyone can fork. It doesn't \| mean that maintainers will automatically merge patches \| replacing their software with completely different software. \| The only way that will happen is if the maintainers decide \| for themselves it needs to be redesigned, and that starts \| with discussion rather than a patch. It's also a long shot \| compared to just finding different software that already has \| the design I prefer. \| \| If I want replicated SQLite, I'll look at dqlite or \| litestream instead, which sound more compatible with my \| design sensibilities. (Thanks, tyingq and benbjohnson!) \| monadic3 wrote: \| Frankly your bad-faith commentary isn't helping the \| conversation either. I sincerely appreciate your cleaning \| up the tone at the end. \| benbjohnson wrote: \| I just open-sourced a streaming replication tool for SQLite \| called Litestream that does physical replication (raw pages) \| instead of logical replication (SQL commands). Each approach \| has its pros and cons. Physical replication logs tend to be \| larger than logical logs but I agree that you avoid a lot of \| issues if you do physical replication. \| \| https://github.com/benbjohnson/litestream \| hinkley wrote: \| Do you use compression? And if so, how that affects the \| relative amount of network traffic vs logical. \| benbjohnson wrote: \| Yes, Litestream uses LZ4 compression. I originally used \| gzip but the compression speed was pretty slow. B-tree \| pages tend compress well because they tend to be 50-75% \| full because they need space to insert new records and \| because pages split when they get full. \| \| I'm seeing files shrink down to 14% of their size (1.7MB \| WAL compressed to 264KB). However, your exact compression \| will vary depending on your data. \| hinkley wrote: \| Ah, that makes sense. Most inserts don't split pages, so \| are around n worst case pages, but once in a while you \| get 2n updates where most of them are half full, and so \| compress better. \| \| So how does that compare to logical replication? (Also I \| imagine packet size plays a role, since you have to flush \| the stream quite frequently, right? 1000 bytes isn't much \| more expensive than 431) \| benbjohnson wrote: \| Litestream defaults to flushing out to S3 every 10 \| seconds but that's mainly because of PUT costs. Each \| request costs $0.00005 so it costs about $1.30 per month. \| If you flushed every second then it'd cost you $13/month. \| \| Logical replication would have significantly smaller \| sizes although the size cost isn't a huge deal on S3. \| Data transfer in to S3 is free and so are DELETE \| requests. The data only stays on S3 for as long as your \| Litestream retention specifies. So if you're retaining \| for a day then you're just keeping one day's worth of WAL \| changes on the S3 at any given time. \| mrkurt wrote: \| I've been following this, and am anxious for the direct-to- \| sqlite replication. \| \| One of rqlite's big limitations is that it resyncs the entire \| DB at startup time. Being able to start with a "snapshot" and \| then incrementally replicate changes would be a big help. \| CuriouslyC wrote: \| This is tangential, but depending on your sql needs, \| CouchDB's replication story is amazing, and you can \| replicate to the browser using PouchDB. There is an \| optional SQL layer, but obviously the good replication \| story comes with some trade-offs. \| otoolep wrote: \| rqlite creator here. \| \| I'm not sure I follow why it's a "big limitation"? Is it \| causing you long start-up times? I'm definitely interested \| in improving this, if it's an issue. What are you actually \| seeing? \| \| Also, rqlite does do log truncation (as per Raft spec), so \| after a certain amount of log entries (8192 by default) \| node restarts work _exactly_ like you suggested. The SQLite \| database is restored from a snapshot, and any remaining \| Raft Log entries are applied to the database. \| mrkurt wrote: \| Ah, ok that's some nuance I didn't know about! \| \| We're storing a few GB of data in the sqlite DB. \| Rebuilding those when rqlite restarts is slow and \| intensive process compared to just using the file on disk \| over again. \| \| Our particular use case means we'll end up restarting \| 100+ replica nodes all at once, so the way we're doing \| things makes it more painful than necessary. \| otoolep wrote: \| But how do you know it's intensive? Are you watching disk \| IO? Is there a noticeable delay when the node starts \| before it's ready to receive requests? \| \| Try setting "-raft-snap" to a lower number, maybe 1024, \| and see if it helps. You'll have much fewer log entries \| to apply on startup. However the node will perform a \| snapshot more often, and writes are blocked during the \| snapshotting. It's a trade-off. \| \| It might be possible to always restart using some sort of \| snapshot, independent of Raft, but that would add \| significant complexity to rqlite. The fact the SQLite \| database is built from scratch on startup, from the data \| in Raft log, means rqlite is much more robust. \| mrkurt wrote: \| Oh, we're reading the sqlite files directly. rqlite is \| really just a mechanism for us to propagate read only \| data to a bunch of clients. \| \| We need that sqlite file to never go away. Even a few \| seconds is bad. And since our replicas are spread all \| over the world, it's not feasible to move 1GB+ data from \| the "servers" fast enough. \| \| Is there a way for us to use that sqlite file without it \| ever going away? We've thought about hardlinking it \| elsewhere and replacing the hardlink when rqlite is up, \| but haven't built any tooling to do that. \| otoolep wrote: \| Hmmmm, that's a different issue. :-) \| \| Today the rqlite code deletes the SQLite database (if \| present) and then rebuilds it from the Raft log. It makes \| things so simple, and ensures the node can always \| recover, regardless of the prior state of the SQLite \| database -- basically the Raft log is the only thing that \| matters and that is _guaranteed_ to be the same under \| each node. \| \| The fundamental issue here is that Raft can only \| guarantee that the Raft log is in consensus, so rqlite \| can rely on that. It's always possible the one of the \| copies of SQLite under a single node gets a different \| state that all other nodes. This is because the change to \| the Raft log, and corresponding change to SQLite, are not \| atomic. Blowing away the SQLite database means a restart \| would fix this. \| \| If this is important -- and what you ask sounds \| reasonable for the read-only case that rqlite can support \| -- I guess the code could rebuild the SQLite database in \| a temporary place, wait until that's done, and then \| quickly swap any existing SQLite file with the rebuilt \| copy. That would minimize the time the file is not \| present. But the file has to go away at some point. \| \| Alternatively rqlite could open any existing SQLite file \| and DROP all data first. At least that way the _file_ \| wouldn 't disappear, but the data in the database would \| wink out of existence and then come back. WDYT? \| mrkurt wrote: \| Rebuilding and then moving it in place sounds pretty nice \| to me. \| jlongster wrote: \| I built an app (https://actualbudget.com/) that uses a \| local sqlite db and syncs changes, and that's exactly how \| it works. It takes quite a different approach though, using \| CRDTs to represent changes and those are synced around. \| When a fresh client comes into play, it downloads the \| latest sqlite snapshot from a server and then syncs up. \| benbjohnson wrote: \| That can be painful for sure. Litestream will do a snapshot \| on startup if it detects that it can't pick up from where \| it left off in the WAL. That can happen if Litestream is \| shut down and another process perform a checkpoint. But \| generally a restart will just use the existing snapshot & \| continue with the WAL replication. \| jgraettinger1 wrote: \| Here's another approach to the problem [0]: \| \| This package is part of Gazette [1], and uses a gazette \| journal (known as a "recovery log") to power raw bytestream \| replication & persistence. \| \| On top of journals, there's a recovery log "hinting" \| mechanism [2] that is aware of file layouts on disk, and \| keeps metadata around the portions of the journal which must \| be read to recover a particular on-disk state (e.x. what are \| the current live files, and which segments of the log hold \| them?). You can read and even live-tail a recovery log to \| "play back" / maintain the on-disk file state of a database \| that's processing somewhere else. \| \| Then, there's a package providing RocksDB with an Rocks \| environment that's configured to transparently replicate all \| database file writes into a recovery log [3]. Because RocksDB \| is a a continuously compacted LSM-tree and we're tracking \| live files, it's regularly deleting files which allow for \| "dropping" chunks of the recovery log journal which must be \| read or stored in order to recover the full database. \| \| For the SQLite implementation, SQLite journals and WAL's are \| well-suited to recovery logs & their live file tracking, \| because they're short-lived ephemeral files. The SQLite page \| DB is another matter, however, because it's a super-long \| lived and randomly written file. Naively tracking the page DB \| means you must re-play the _entire history_ of page mutations \| which have occurred. \| \| This implementation solves this by using a SQLite VFS which \| actually uses RocksDB under the hood for the SQLite page DB, \| and regular files (recorded to the same recovery log) for \| SQLite journals / WALs. In effect, we're leveraging RocksDB's \| regular compaction mechanisms to remove old versions of \| SQLite pages which must be tracked / read & replayed. \| \| [0] https://godoc.org/go.gazette.dev/core/consumer/store- \| sqlite \| \| [1] https://gazette.readthedocs.io/en/latest/ \| \| [2] https://gazette.readthedocs.io/en/latest/consumers- \| concepts.... \| \| [3] https://godoc.org/go.gazette.dev/core/consumer/store- \| rocksdb \| webmaven wrote: \| _> I just open-sourced a streaming replication tool for \| SQLite called Litestream that does physical replication (raw \| pages) instead of logical replication (SQL commands). Each \| approach has its pros and cons. Physical replication logs \| tend to be larger than logical logs but I agree that you \| avoid a lot of issues if you do physical replication._ \| \| Hmm. Not having dug into your solution much, is it safe to \| say that the physical replication logs have something like \| logical checkpoints? If so, would it make sense to only keep \| physical logs on a relatively short rolling window, and \| logical logs (ie. only the interleaved logical checkpoints) \| longer? \| benbjohnson wrote: \| I suppose you could save both the physical and logical logs \| if you really needed log term retention. SQLite databases \| (and b-trees in general) tend to compress well so the \| physical logging isn't as bad as it sounds. You could also \| store a binary diff of the physical page which would shrink \| it even smaller. \| \| One benefit to using physical logs is that you end up with \| a byte-for-byte copy of the original data so it makes it \| easy to validate that your recovery is correct. You'd need \| to iterate all the records in your database to validate a \| logical log. \| \| However, all that being said, Litestream runs as a separate \| daemon process so it actually doesn't have access to the \| SQL commands from the application. \| szszrk wrote: \| rqlite is mentioned here quite often, multiple times last year. I \| don't think this entry brings anything new. \| foolinaround wrote: \| We currently use browsers on several devices (both laptops and \| android) and rely on google sync currently. Maybe this could be \| used to sync bookmarks, history etc across my devices but still \| keep my data local to me? \| JoachimSchipper wrote: \| This uses Raft, so a quorum of devices would need to be online \| at the same time. That's not what you want for browser sync. \| blackbear_ wrote: \| I know nothing of consensus algorithms and distributed systems so \| bear with me please. \| \| > rqlite uses Raft to achieve consensus across all the instances \| of the SQLite databases, ensuring that every change made to the \| system is made to a quorum of SQLite databases, or none at all. \| \| What I understood from this sentence is that, if we have three \| instances, rqlite will make sure that every change is written to \| at least two. But what if two changes are written to two \| different pairs of instances? Then the three instances will have \| three different versions of the data. For example, change X is \| written to instances A and B, and change Y is written to B and C. \| Now A has X, B has X and Y, and C has Y only. How do you decide \| who is right? \| edoceo wrote: \| Raft consensus: https://raft.github.io/ \| \| Surprisingly easy to understand, and a cool viaual. \| teraflop wrote: \| In brief: at any point in time, one of the replicas is a \| "leader" which controls the order in which operations are \| committed. The changes occur in a defined sequence, and other \| replicas may lag behind the leader, but cannot be inconsistent \| with it. \| \| Your example can't happen, because if (for instance) A is the \| leader, then C will not apply change Y without contacting the \| leader, which will tell it to apply X first. \| \| If you want more details about how this handles all the edge \| cases -- for instance, what happens if the leader crashes -- \| the Raft paper is quite accessible: \| https://raft.github.io/raft.pdf \| hinkley wrote: \| TL;DR: Raft updates are serialized (as in sequential). \| whizzter wrote: \| The semantics of Raft has a "simple" (compared to the harder to \| understand Paxos) forward motion of events that is supposed to \| guarantee that you won't get into weird states regardless of if \| any particular node(s) goes down (I think it can surive (N/2)-1 \| dead machines in a cluster of N). \| \| Raft is based on having a leader decide what the next COMMIT is \| going to be, so B could never have X and Y at the same time \| (they could both be queued but other mechanisms could reject \| them). \| \| Also data is not considered committed until more than half the \| cluster has acknowledged it (at which point the leader will \| know it and handle going forward), leader election also works \| in a similar way iirc. \| \| As others mentioned, the visualization on \| https://raft.github.io/ is really good (You can affect it to \| create commits and control downtime of machines) \| hinkley wrote: \| It's 1/2 + 1 isn't it? So if the leader goes down at the \| exact moment of quorum, you you can still get quorum again. \| \| That would mean in 3 servers you need 2.5 aka 3 machines to \| commit a change. Then 4/5, 5/7, 6/9, 7/11. And I think it's a \| wash anyway, because as the servers go up the fraction you \| need for quorum goes down, but the odds of falling behind or \| failing outright go up too. Not to mention the time during \| which 1/n machines are down due to an upgrade gets longer and \| longer the more machines you have, increasing the chances of \| double fault. \| simtel20 wrote: \| > It's 1/2 + 1 isn't it? \| \| The parent post is talking about the number that can go \| down while maintaining quorum, and you're talking about the \| number that need to remain up to maintain quorum. So you're \| both correct. \| \| However: \| \| > That would mean in 3 servers you need 2.5 aka 3 machines \| to commit a change. \| \| That seems wrong. You need N//2 +1 where "//" is floor \| division, so in a 3 node cluster, you need 3//2 +1, or 1+1 \| or 2 nodes to commit a change. \| hinkley wrote: \| I think I see the problem. \| \| 'Simple majority' is based on the number of the machines \| that the leader knows about. You can only change the \| membership by issuing a write. Write quorum and \| leadership quorum are two different things, and if I've \| got it right, they can diverge after a partition. \| \| I'm also thinking of double faults, because the point of \| Raft is to get past single fault tolerance. \| \| [edit: shortened] \| \| After a permanent fault (broken hardware) in a cluster of \| 5, the replacement quorum member can't vote for writes \| until it has caught up. It can vote for leaders, but it \| can't nominate itself. Catching up leaves a window for \| additional faults. \| \| It's always 3/5 for writes and elections, the difference \| is that the _ratio_ of original machines that have to \| confirm a write can go to 100% of survivors, instead of \| the 3 /4 of reachable machines. Meaning network jitter \| and packet loss, slows down writes until it recovers, and \| an additional partition can block writes altogether, even \| with 3/5 surviving the partition. \| teraflop wrote: \| > It's 1/2 + 1 isn't it? \| \| > That would mean in 3 servers you need 2.5 aka 3 machines \| to commit a change. Then 4/5, 5/7, 6/9, 7/11. \| \| No, the requirement isn't 1/2 + 1. Any _strict_ majority of \| the cluster is enough to elect a leader. So you need 2 /3, \| or 3/4, or 3/5, and so on. \| \| > Not to mention the time during which 1/n machines are \| down due to an upgrade gets longer and longer the more \| machines you have, increasing the chances of double fault. \| \| Generally, this is not the case. If individual machine \| failures are random and equally probable, and if each \| machine is down on average less than 50% of the time, then \| adding more machines makes things better, not worse. (This \| is a basic property of the binomial distribution.) \| \| Of course, if you have a single point of failure somewhere \| -- e.g. a network switch -- this assumption can be \| violated, but that's true regardless of how many machines \| you have. \| hinkley wrote: \| If the leader is down (the scenario you clipped out in \| your response) you need a strict majority with an even \| number of machines. \| hinkley wrote: \| This is right for the wrong reason. See follow-up down- \| thread. \| jasonjayr wrote: \| IIRC, a monatomic counter is involved. The odd one out will \| realize it's behind the highest sequence number and discard \| it's updates to resync with the majority consensus. \| \| Edit: http://thesecretlivesofdata.com/raft/ if you have some \| time seems to be a good step by step explanation on how it \| works in detail. \| adsharma wrote: \| The visual is indeed cool. I also thought it'd be nice to use \| a chat like interface to learn raft. \| \| Alice: /set a 1 Alice: /set b 2 Bob: /status \| \| Etc \| \| https://github.com/adsharma/zre_raft \| https://github.com/adsharma/raft \| \| Bug reports welcome \| ericlewis wrote: \| Expensify had a version of something like this back in like \| 2013/14 I think. \| moderation wrote: \| The project is BedrockDB [0] and has been previously discussed \| [1]. \| \| 0. https://bedrockdb.com/ \| \| 1. https://news.ycombinator.com/item?id=12739771 \| ericlewis wrote: \| Nice! Though the blockchain part is new to me. Interesting \| they kept growing this. \| Conlectus wrote: \| One thing that jumps out at me after reading a lot of Jepsen \| analyses - does Rqlite assume that partitions form equality \| relations? That is, that all nodes belong to one and only \| partition group? This is not always the case in practice. \| yjftsjthsd-h wrote: \| So the case of A can talk to B, B can talk to C, but A can't \| talk to C? (Making sure that I understand how you can be in \| multiple partitions) \| fnord123 wrote: \| FoundationDB and Comdb2 also use sqlite as a storage engine. \| Curious that they decided to implement yet another one. \| \| https://www.foundationdb.org/ \| \| http://comdb2.org/ \| tyingq wrote: \| Rqlite appears to predate comdb2. \| rapsey wrote: \| Literally the first sentence. \| \| > Comdb2 is a relational database built in-house at Bloomberg \| L.P. over the last 14 years or so. \| \| rqlite is not 14 years old. \| tyingq wrote: \| I was looking at the github repo history. Was is publicly \| visible sooner than that would imply? \| tyingq wrote: \| Answering my own question, Comdb2 was made available to \| the public on 1 January 2016, well after rqlite launched. \| peter_d_sherman wrote: \| First of all, great idea, and a brilliant and highly laudable \| effort! \| \| Favorited! \| \| One minor caveat ("Here be Dragons") I have (with respect to my \| own future adoption/production use), however: \| \| https://github.com/rqlite/rqlite/blob/master/DOC/FAQ.md \| \| > _" Does rqlite support transactions? \| \| It supports a form of transactions. You can wrap a bulk update in \| a transaction such that all the statements in the bulk request \| will succeed, or none of them will. However the behaviour or \| rqlite is undefined if you send explicit BEGIN, COMMIT, or \| ROLLBACK statements. This is not because they won't work -- they \| will -- but if your node (or cluster) fails while a transaction \| is in progress, the system may be left in a hard-to-use state. So \| until rqlite can offer strict guarantees about its behaviour if \| it fails during a transaction, using BEGIN, COMMIT, and ROLLBACK \| is officially unsupported. Unfortunately this does mean that \| rqlite may not be suitable for some applications."_ \| \| PDS: Distributed transactions are extremely difficult to get \| exactly right -- so I'm not trying to criticize all of the hard \| work and effort that everyone has put into this (again, it's a \| great idea, and I think it has a terrific future). \| \| But Distributed Transactions -- are what differentiate something \| like rsqlite from say, something like CockroachDB (https://www.co \| ckroachlabs.com/docs/stable/architecture/life-...). \| \| Of course, CockroachDB is a pay-for product with an actual \| company with many years of experience backing it, whereas rqlite, \| as far as I can intuit, at this point in time (someone correct me \| if I am wrong), appears to be a volunteer effort... \| \| Still, I think that rqlite despite this -- has a glorious and \| wonderful future! \| \| Again, a brilliant and laudable effort, suitable for many use \| cases presently, and I can't wait to see what the future holds \| for this Open Source project! \| \| Maybe in the future some code-ninja will step up to the plate and \| add fully guaranteed, safe, distributed transactions! \| \| Until then, it looks like a great idea coupled with a great \| software engineering effort! \| \| As I said, "Favorited!". \| jchrisa wrote: \| I'm curious how this relates to the Calvin protocol as \| implemented by FaunaDB. They both use Raft, but FaunaDB and \| Calvin have additional details about how transactions are retried \| and aborted. https://fauna.com/blog/consistency-without-clocks- \| faunadb-tr... \| ClumsyPilot wrote: \| I think microk8s uses this to form a cluster, and k3s used to use \| it but moved back to etc. \| \| Would be good to hear from someone who used it what are the pros \| and cons of such a setup \| fasteo wrote: \| AFAIK, microk8s uses a similar - but not this - form of \| distributed sqlite. Specifically, it uses dqlite[1] "a C \| library that implements an embeddable and replicated SQL \| database engine with high-availability and automatic failover." \| \| [1] https://github.com/canonical/dqlite \| tyingq wrote: \| Probably worth mentioning that Canonical initially made \| dqlite to be the backing store for LXD. It uses the sqlite \| VFS as the client entry point, so it's a very easy transition \| for an existing sqlite app, just recompile with the new \| header. ___________________________________________________________________ (page generated 2021-01-22 23:00 UTC)