|
| mech422 wrote:
| So it looks like you can now distribute SQLite at the:
|
| Stmt level: https://github.com/rqlite/rqlite
|
| VFS Level: https://github.com/canonical/dqlite
|
| Block Level: https://github.com/benbjohnson/litestream
|
| Really cool enhancements to an awesome project!
| f430 wrote:
| Could you use this to build a decentralized p2p app? If so, what
| gotchas and limitations are there?
| otoolep wrote:
| No, rqlite is not suitable for that kind of application. All
| writes must go through the leader.
| hermitcrab wrote:
| So if:
|
| Alice, Bob and Charlie have a synced copy of the same database
|
| Charlie goes on a plane and adds a loads of records without a
| connection to the other databases
|
| Alice and Bob make no changes
|
| Charlie comes home and syncs
|
| Will Charlie lose all his changes, as his database is different
| to Alice and Bob's?
|
| What happens if Alice, Bob and Charlie all makes changes offline
| then resync?
| mrkurt wrote:
| It doesn't work that way. rqlite is effectively a
| leader/follower model that uses raft for leader consensus.
| Writes can only happen online, and only to the leader.
| hermitcrab wrote:
| Ok, thanks!
| teraflop wrote:
| As per the description, all updates must be replicated to a
| quorum of instances. If Charlie is on a plane without the
| ability to contact a quorum, he can't add records in the first
| place.
|
| This is a fundamentally necessary tradeoff to provide strong
| consistency, as described by the CAP theorem.
| unnouinceput wrote:
| In which case Charlie will have an additional local DB to
| record those records and when he gets back will use another
| protocol/method/system/whatever to add those new records? How
| about if everybody goes and adds records on same table?
|
| Here is a real life scenario that I had to deal with in the
| past. Technician (carpenters) goes to clients home to repair
| furniture in the middle of nowhere, so no internet. Adds the
| necessary paperwork which is pictures, declarations, contract
| (signed and scanned) to the Clients table. This company was
| employing hundreds of such technicians throughout the many
| counties of Germany, each with a laptop on them running this
| app which was the backbone for getting paid/do the work. And
| was not uncommon to have more than one carpenter go to client
| home and do the repairs. Since each carpenter was paid
| according to its own work, each of them would create entries
| in their local Clients table and when getting back to HQ
| their work was manually uploaded to central DB, and only
| after that they got paid. I automated that (that was the job,
| to eliminate the thousands of hours that carpenters were
| wasting manually).
|
| So given the above scenario, how is this system going to
| achieve that? Same table, and same client details even in
| table Clients, just different rows for different carpenters
| (foreign key to table Carpenters).
| wtallis wrote:
| > So given the above scenario, how is this system going to
| achieve that?
|
| I don't think it is. You're describing a use case that is
| distributed but explicitly does not want to enforce
| consistency--you want offline workers to all be able to
| keep working, and you're enforcing consistency after the
| fact and outside of the database itself.
| renewiltord wrote:
| This tool does not handle that problem. It is not meant to.
| It's for simultaneously available replicas. And this is the
| rare moment where thinking about replication vs
| synchronization as different is worthwhile.
|
| You usually replicate for failure tolerance and performance
| (this project only aims for the former).
| vorpalhex wrote:
| As other commentors have mentioned, this tool is not
| intended for that kind of case. You want a tool like
| PouchDB which handles this kind of setup, but have a
| different set of tradeoffs (they're eventually consistent,
| not strongly consistent).
| adsharma wrote:
| Charlie doesn't have to lose the data he saved on the plane.
| Don't know what the rqlite implementation does.
|
| In the second case, Alice-Bob consensus overrides Charlie
| [deleted]
| fipar wrote:
| I have not read the full description of this project yet, but
| it does mention the use or raft for consensus, so in your
| example, I would expect Charlie to not be able to add any
| records while being disconnected, because, if my understanding
| is correct: - Charlie would either be the leader, but then
| without getting confirmation of writes from enough followers,
| he would not be able to do any writes himself, or - Charlie
| would be a follower, and while disconnected would obviously get
| no writes from the leader.
| NDizzle wrote:
| What's messed up is that I was doing this kind of thing with
| Lotus Domino in the late 90s. I'm sure others were doing it
| before me, too.
|
| Sometimes you had conflicts that needed resolution, but those
| weren't that frequent for our use case.
| mshenfield wrote:
| Going from a local db to one over a network has at least one
| risk. The SQLite docs gives developers the okay to write "n+1"
| style queries (https://www.sqlite.org/np1queryprob.html). When
| the db is on the same file system as the application this pattern
| is fine. But as soon a you add a network call it becomes a
| potential bottleneck.
| alberth wrote:
| @otoolep
|
| SQLite has a great post on "When to Use" (and not use) SQLite.
|
| Would be great if you included these same use cases in the ReamMe
| docs and make it clear if Rqlite can address them.
|
| https://www.sqlite.org/whentouse.html
| scottlamb wrote:
| It looks like this layers Raft on top of SQLite. I don't like
| when systems replicate high-level changes like "update users set
| salary = salary + 1000 where ...;" Instead, I prefer they
| replicate low-level changes like "replace key/block X, which
| should have contents C_x, with C_x'".
|
| Why? Imagine you're doing a rolling update. Some of your replica
| are running the newer version of SQLite and some are running the
| older version. They may not execute the high-level query in
| exactly the same way. For example, in the absence of an "order
| by" clause, select results' order is unstated. So imagine someone
| makes a mutation that depends on this: "insert ... select ...
| limit". (Maybe a dumb example but it can happen anyway.) Now the
| databases start to diverge, not only in underlying bytes and
| implementation-defined ordering but in actual row data.
|
| I worked on a major distributed system that originally replicated
| high-level changes and switched to replicating low-level changes
| for this reason. We had a system for detecting when replicas
| didn't match, and replication of high-level changes was the
| biggest reason for diffs. (Hardware error was the second biggest
| reason; we added a lot of checksumming because of that.)
| lrossi wrote:
| If you replicate low level changes, you might not be able to do
| a live upgrade/downgrade if the version change affects the on
| disk format.
|
| Another downside is that you might also propagate data
| corruption in case of bugs in the DB software (e.g. memory
| corruption) or hardware defects.
| scottlamb wrote:
| > If you replicate low level changes, you might not be able
| to do a live upgrade/downgrade if the version change affects
| the on disk format.
|
| It certainly requires care to ensure all the replicas have
| software capable of understanding the new format before it's
| actually written, but it can be done. Likewise after writing
| the new format, you want to have a roll-back plan.
|
| In SQLite's case, https://sqlite.org/formatchng.html says:
| "Since 2004, there have been enhancements to SQLite such that
| newer database files are unreadable by older versions of the
| SQLite library. But the most recent versions of the SQLite
| library should be able to read and write any older SQLite
| database file without any problems." I don't believe
| upgrading SQLite automatically starts using any of those
| enhancements; you'd have to do a schema change like "PRAGMA
| journal_mode=WAL;" first.
|
| > Another downside is that you might also propagate data
| corruption in case of bugs in the DB software (e.g. memory
| corruption) or hardware defects.
|
| This happens regardless.
| otoolep wrote:
| rqlite creator here.
|
| I understand what you're saying, but I don't think it's a
| compelling objection. Obviously, differences between versions
| -- even patched versions -- can results in subtle, unintended,
| differences in how the code works for a given program. But
| there is no reason to think a system that operates at a lower
| level ("replace key/block X, which should have contents C_x,
| with C_x'") is less vulnerable to this kind of issue, compared
| to one that operates at a higher level i.e. statement-based
| replication, which rqlite uses. In fact I would argue that the
| system that operates on higher-level of abstraction is _less_
| vulnerable i.e. to care about the subtle changes.
| [deleted]
| xd wrote:
| In MySQL/MariaDB this is what's known as non-deterministic
| behaviour so row or mixed replication is used to mitigate.
|
| Statement based (high level) replication is very useful for
| i.e. "insert into tbl0 select col0 from tbl1 order by col1" as
| you would only need to send the query not the individual row
| data.
| tyingq wrote:
| Dqlite replicates at the VFS layer of sqlite, which sounds like
| what you're looking for. https://github.com/canonical/dqlite
| hinkley wrote:
| I haven't gotten a straight answer out of the k3s people
| about why they dumped dqlite, just some comment about bugs.
|
| I could see myself using dqlite in the future so I'd like
| some more user reports from the trenches. Anyone shed some
| light on this?
| tyingq wrote:
| The initial issue seems to be saying that it's because they
| need to have etcd anyway, so consolidating on that removes
| a dependency. Which fits their simplicity goal. Though the
| issue appears to have been created by a user, not a
| maintainer.
|
| _" Since the HA direction needs etcd anyway.. I'm
| proposing dropping support for sqlite as the default
| embedded non-HA option and switch to embedded etcd as the
| default. This will reduce overall effort of maintainability
| of two entirely different datastores."_
|
| https://github.com/k3s-io/k3s/issues/845
| hinkley wrote:
| I accepted that reason for them, but as I don't benefit
| directly from switching to etcd, I'd rather know about
| what started the conversation than how it was concluded.
| merb wrote:
| dqlite support was flaky and did go trough their
| translation layer which probably added complexity.
| ttul wrote:
| Sounds like you need to submit a patch!
| scottlamb wrote:
| lol, this is an unhelpful reflex answer to any criticism of
| open source software. What I'm describing is a redesign.
| There's no point in submitting a patch for that. It'd be much
| more effort than a few lines of code, and it wouldn't be
| accepted. Open source means that anyone can fork. It doesn't
| mean that maintainers will automatically merge patches
| replacing their software with completely different software.
| The only way that will happen is if the maintainers decide
| for themselves it needs to be redesigned, and that starts
| with discussion rather than a patch. It's also a long shot
| compared to just finding different software that already has
| the design I prefer.
|
| If I want replicated SQLite, I'll look at dqlite or
| litestream instead, which sound more compatible with my
| design sensibilities. (Thanks, tyingq and benbjohnson!)
| monadic3 wrote:
| Frankly your bad-faith commentary isn't helping the
| conversation either. I sincerely appreciate your cleaning
| up the tone at the end.
| benbjohnson wrote:
| I just open-sourced a streaming replication tool for SQLite
| called Litestream that does physical replication (raw pages)
| instead of logical replication (SQL commands). Each approach
| has its pros and cons. Physical replication logs tend to be
| larger than logical logs but I agree that you avoid a lot of
| issues if you do physical replication.
|
| https://github.com/benbjohnson/litestream
| hinkley wrote:
| Do you use compression? And if so, how that affects the
| relative amount of network traffic vs logical.
| benbjohnson wrote:
| Yes, Litestream uses LZ4 compression. I originally used
| gzip but the compression speed was pretty slow. B-tree
| pages tend compress well because they tend to be 50-75%
| full because they need space to insert new records and
| because pages split when they get full.
|
| I'm seeing files shrink down to 14% of their size (1.7MB
| WAL compressed to 264KB). However, your exact compression
| will vary depending on your data.
| hinkley wrote:
| Ah, that makes sense. Most inserts don't split pages, so
| are around n worst case pages, but once in a while you
| get 2n updates where most of them are half full, and so
| compress better.
|
| So how does that compare to logical replication? (Also I
| imagine packet size plays a role, since you have to flush
| the stream quite frequently, right? 1000 bytes isn't much
| more expensive than 431)
| benbjohnson wrote:
| Litestream defaults to flushing out to S3 every 10
| seconds but that's mainly because of PUT costs. Each
| request costs $0.00005 so it costs about $1.30 per month.
| If you flushed every second then it'd cost you $13/month.
|
| Logical replication would have significantly smaller
| sizes although the size cost isn't a huge deal on S3.
| Data transfer in to S3 is free and so are DELETE
| requests. The data only stays on S3 for as long as your
| Litestream retention specifies. So if you're retaining
| for a day then you're just keeping one day's worth of WAL
| changes on the S3 at any given time.
| mrkurt wrote:
| I've been following this, and am anxious for the direct-to-
| sqlite replication.
|
| One of rqlite's big limitations is that it resyncs the entire
| DB at startup time. Being able to start with a "snapshot" and
| then incrementally replicate changes would be a big help.
| CuriouslyC wrote:
| This is tangential, but depending on your sql needs,
| CouchDB's replication story is amazing, and you can
| replicate to the browser using PouchDB. There is an
| optional SQL layer, but obviously the good replication
| story comes with some trade-offs.
| otoolep wrote:
| rqlite creator here.
|
| I'm not sure I follow why it's a "big limitation"? Is it
| causing you long start-up times? I'm definitely interested
| in improving this, if it's an issue. What are you actually
| seeing?
|
| Also, rqlite does do log truncation (as per Raft spec), so
| after a certain amount of log entries (8192 by default)
| node restarts work _exactly_ like you suggested. The SQLite
| database is restored from a snapshot, and any remaining
| Raft Log entries are applied to the database.
| mrkurt wrote:
| Ah, ok that's some nuance I didn't know about!
|
| We're storing a few GB of data in the sqlite DB.
| Rebuilding those when rqlite restarts is slow and
| intensive process compared to just using the file on disk
| over again.
|
| Our particular use case means we'll end up restarting
| 100+ replica nodes all at once, so the way we're doing
| things makes it more painful than necessary.
| otoolep wrote:
| But how do you know it's intensive? Are you watching disk
| IO? Is there a noticeable delay when the node starts
| before it's ready to receive requests?
|
| Try setting "-raft-snap" to a lower number, maybe 1024,
| and see if it helps. You'll have much fewer log entries
| to apply on startup. However the node will perform a
| snapshot more often, and writes are blocked during the
| snapshotting. It's a trade-off.
|
| It might be possible to always restart using some sort of
| snapshot, independent of Raft, but that would add
| significant complexity to rqlite. The fact the SQLite
| database is built from scratch on startup, from the data
| in Raft log, means rqlite is much more robust.
| mrkurt wrote:
| Oh, we're reading the sqlite files directly. rqlite is
| really just a mechanism for us to propagate read only
| data to a bunch of clients.
|
| We need that sqlite file to never go away. Even a few
| seconds is bad. And since our replicas are spread all
| over the world, it's not feasible to move 1GB+ data from
| the "servers" fast enough.
|
| Is there a way for us to use that sqlite file without it
| ever going away? We've thought about hardlinking it
| elsewhere and replacing the hardlink when rqlite is up,
| but haven't built any tooling to do that.
| otoolep wrote:
| Hmmmm, that's a different issue. :-)
|
| Today the rqlite code deletes the SQLite database (if
| present) and then rebuilds it from the Raft log. It makes
| things so simple, and ensures the node can always
| recover, regardless of the prior state of the SQLite
| database -- basically the Raft log is the only thing that
| matters and that is _guaranteed_ to be the same under
| each node.
|
| The fundamental issue here is that Raft can only
| guarantee that the Raft log is in consensus, so rqlite
| can rely on that. It's always possible the one of the
| copies of SQLite under a single node gets a different
| state that all other nodes. This is because the change to
| the Raft log, and corresponding change to SQLite, are not
| atomic. Blowing away the SQLite database means a restart
| would fix this.
|
| If this is important -- and what you ask sounds
| reasonable for the read-only case that rqlite can support
| -- I guess the code could rebuild the SQLite database in
| a temporary place, wait until that's done, and then
| quickly swap any existing SQLite file with the rebuilt
| copy. That would minimize the time the file is not
| present. But the file has to go away at some point.
|
| Alternatively rqlite could open any existing SQLite file
| and DROP all data first. At least that way the _file_
| wouldn 't disappear, but the data in the database would
| wink out of existence and then come back. WDYT?
| mrkurt wrote:
| Rebuilding and then moving it in place sounds pretty nice
| to me.
| jlongster wrote:
| I built an app (https://actualbudget.com/) that uses a
| local sqlite db and syncs changes, and that's exactly how
| it works. It takes quite a different approach though, using
| CRDTs to represent changes and those are synced around.
| When a fresh client comes into play, it downloads the
| latest sqlite snapshot from a server and then syncs up.
| benbjohnson wrote:
| That can be painful for sure. Litestream will do a snapshot
| on startup if it detects that it can't pick up from where
| it left off in the WAL. That can happen if Litestream is
| shut down and another process perform a checkpoint. But
| generally a restart will just use the existing snapshot &
| continue with the WAL replication.
| jgraettinger1 wrote:
| Here's another approach to the problem [0]:
|
| This package is part of Gazette [1], and uses a gazette
| journal (known as a "recovery log") to power raw bytestream
| replication & persistence.
|
| On top of journals, there's a recovery log "hinting"
| mechanism [2] that is aware of file layouts on disk, and
| keeps metadata around the portions of the journal which must
| be read to recover a particular on-disk state (e.x. what are
| the current live files, and which segments of the log hold
| them?). You can read and even live-tail a recovery log to
| "play back" / maintain the on-disk file state of a database
| that's processing somewhere else.
|
| Then, there's a package providing RocksDB with an Rocks
| environment that's configured to transparently replicate all
| database file writes into a recovery log [3]. Because RocksDB
| is a a continuously compacted LSM-tree and we're tracking
| live files, it's regularly deleting files which allow for
| "dropping" chunks of the recovery log journal which must be
| read or stored in order to recover the full database.
|
| For the SQLite implementation, SQLite journals and WAL's are
| well-suited to recovery logs & their live file tracking,
| because they're short-lived ephemeral files. The SQLite page
| DB is another matter, however, because it's a super-long
| lived and randomly written file. Naively tracking the page DB
| means you must re-play the _entire history_ of page mutations
| which have occurred.
|
| This implementation solves this by using a SQLite VFS which
| actually uses RocksDB under the hood for the SQLite page DB,
| and regular files (recorded to the same recovery log) for
| SQLite journals / WALs. In effect, we're leveraging RocksDB's
| regular compaction mechanisms to remove old versions of
| SQLite pages which must be tracked / read & replayed.
|
| [0] https://godoc.org/go.gazette.dev/core/consumer/store-
| sqlite
|
| [1] https://gazette.readthedocs.io/en/latest/
|
| [2] https://gazette.readthedocs.io/en/latest/consumers-
| concepts....
|
| [3] https://godoc.org/go.gazette.dev/core/consumer/store-
| rocksdb
| webmaven wrote:
| _> I just open-sourced a streaming replication tool for
| SQLite called Litestream that does physical replication (raw
| pages) instead of logical replication (SQL commands). Each
| approach has its pros and cons. Physical replication logs
| tend to be larger than logical logs but I agree that you
| avoid a lot of issues if you do physical replication._
|
| Hmm. Not having dug into your solution much, is it safe to
| say that the physical replication logs have something like
| logical checkpoints? If so, would it make sense to only keep
| physical logs on a relatively short rolling window, and
| logical logs (ie. only the interleaved logical checkpoints)
| longer?
| benbjohnson wrote:
| I suppose you could save both the physical and logical logs
| if you really needed log term retention. SQLite databases
| (and b-trees in general) tend to compress well so the
| physical logging isn't as bad as it sounds. You could also
| store a binary diff of the physical page which would shrink
| it even smaller.
|
| One benefit to using physical logs is that you end up with
| a byte-for-byte copy of the original data so it makes it
| easy to validate that your recovery is correct. You'd need
| to iterate all the records in your database to validate a
| logical log.
|
| However, all that being said, Litestream runs as a separate
| daemon process so it actually doesn't have access to the
| SQL commands from the application.
| szszrk wrote:
| rqlite is mentioned here quite often, multiple times last year. I
| don't think this entry brings anything new.
| foolinaround wrote:
| We currently use browsers on several devices (both laptops and
| android) and rely on google sync currently. Maybe this could be
| used to sync bookmarks, history etc across my devices but still
| keep my data local to me?
| JoachimSchipper wrote:
| This uses Raft, so a quorum of devices would need to be online
| at the same time. That's not what you want for browser sync.
| blackbear_ wrote:
| I know nothing of consensus algorithms and distributed systems so
| bear with me please.
|
| > rqlite uses Raft to achieve consensus across all the instances
| of the SQLite databases, ensuring that every change made to the
| system is made to a quorum of SQLite databases, or none at all.
|
| What I understood from this sentence is that, if we have three
| instances, rqlite will make sure that every change is written to
| at least two. But what if two changes are written to two
| different pairs of instances? Then the three instances will have
| three different versions of the data. For example, change X is
| written to instances A and B, and change Y is written to B and C.
| Now A has X, B has X and Y, and C has Y only. How do you decide
| who is right?
| edoceo wrote:
| Raft consensus: https://raft.github.io/
|
| Surprisingly easy to understand, and a cool viaual.
| teraflop wrote:
| In brief: at any point in time, one of the replicas is a
| "leader" which controls the order in which operations are
| committed. The changes occur in a defined sequence, and other
| replicas may lag behind the leader, but cannot be inconsistent
| with it.
|
| Your example can't happen, because if (for instance) A is the
| leader, then C will not apply change Y without contacting the
| leader, which will tell it to apply X first.
|
| If you want more details about how this handles all the edge
| cases -- for instance, what happens if the leader crashes --
| the Raft paper is quite accessible:
| https://raft.github.io/raft.pdf
| hinkley wrote:
| TL;DR: Raft updates are serialized (as in sequential).
| whizzter wrote:
| The semantics of Raft has a "simple" (compared to the harder to
| understand Paxos) forward motion of events that is supposed to
| guarantee that you won't get into weird states regardless of if
| any particular node(s) goes down (I think it can surive (N/2)-1
| dead machines in a cluster of N).
|
| Raft is based on having a leader decide what the next COMMIT is
| going to be, so B could never have X and Y at the same time
| (they could both be queued but other mechanisms could reject
| them).
|
| Also data is not considered committed until more than half the
| cluster has acknowledged it (at which point the leader will
| know it and handle going forward), leader election also works
| in a similar way iirc.
|
| As others mentioned, the visualization on
| https://raft.github.io/ is really good (You can affect it to
| create commits and control downtime of machines)
| hinkley wrote:
| It's 1/2 + 1 isn't it? So if the leader goes down at the
| exact moment of quorum, you you can still get quorum again.
|
| That would mean in 3 servers you need 2.5 aka 3 machines to
| commit a change. Then 4/5, 5/7, 6/9, 7/11. And I think it's a
| wash anyway, because as the servers go up the fraction you
| need for quorum goes down, but the odds of falling behind or
| failing outright go up too. Not to mention the time during
| which 1/n machines are down due to an upgrade gets longer and
| longer the more machines you have, increasing the chances of
| double fault.
| simtel20 wrote:
| > It's 1/2 + 1 isn't it?
|
| The parent post is talking about the number that can go
| down while maintaining quorum, and you're talking about the
| number that need to remain up to maintain quorum. So you're
| both correct.
|
| However:
|
| > That would mean in 3 servers you need 2.5 aka 3 machines
| to commit a change.
|
| That seems wrong. You need N//2 +1 where "//" is floor
| division, so in a 3 node cluster, you need 3//2 +1, or 1+1
| or 2 nodes to commit a change.
| hinkley wrote:
| I think I see the problem.
|
| 'Simple majority' is based on the number of the machines
| that the leader knows about. You can only change the
| membership by issuing a write. Write quorum and
| leadership quorum are two different things, and if I've
| got it right, they can diverge after a partition.
|
| I'm also thinking of double faults, because the point of
| Raft is to get past single fault tolerance.
|
| [edit: shortened]
|
| After a permanent fault (broken hardware) in a cluster of
| 5, the replacement quorum member can't vote for writes
| until it has caught up. It can vote for leaders, but it
| can't nominate itself. Catching up leaves a window for
| additional faults.
|
| It's always 3/5 for writes and elections, the difference
| is that the _ratio_ of original machines that have to
| confirm a write can go to 100% of survivors, instead of
| the 3 /4 of reachable machines. Meaning network jitter
| and packet loss, slows down writes until it recovers, and
| an additional partition can block writes altogether, even
| with 3/5 surviving the partition.
| teraflop wrote:
| > It's 1/2 + 1 isn't it?
|
| > That would mean in 3 servers you need 2.5 aka 3 machines
| to commit a change. Then 4/5, 5/7, 6/9, 7/11.
|
| No, the requirement isn't 1/2 + 1. Any _strict_ majority of
| the cluster is enough to elect a leader. So you need 2 /3,
| or 3/4, or 3/5, and so on.
|
| > Not to mention the time during which 1/n machines are
| down due to an upgrade gets longer and longer the more
| machines you have, increasing the chances of double fault.
|
| Generally, this is not the case. If individual machine
| failures are random and equally probable, and if each
| machine is down on average less than 50% of the time, then
| adding more machines makes things better, not worse. (This
| is a basic property of the binomial distribution.)
|
| Of course, if you have a single point of failure somewhere
| -- e.g. a network switch -- this assumption can be
| violated, but that's true regardless of how many machines
| you have.
| hinkley wrote:
| If the leader is down (the scenario you clipped out in
| your response) you need a strict majority with an even
| number of machines.
| hinkley wrote:
| This is right for the wrong reason. See follow-up down-
| thread.
| jasonjayr wrote:
| IIRC, a monatomic counter is involved. The odd one out will
| realize it's behind the highest sequence number and discard
| it's updates to resync with the majority consensus.
|
| Edit: http://thesecretlivesofdata.com/raft/ if you have some
| time seems to be a good step by step explanation on how it
| works in detail.
| adsharma wrote:
| The visual is indeed cool. I also thought it'd be nice to use
| a chat like interface to learn raft.
|
| Alice: /set a 1 Alice: /set b 2 Bob: /status
|
| Etc
|
| https://github.com/adsharma/zre_raft
| https://github.com/adsharma/raft
|
| Bug reports welcome
| ericlewis wrote:
| Expensify had a version of something like this back in like
| 2013/14 I think.
| moderation wrote:
| The project is BedrockDB [0] and has been previously discussed
| [1].
|
| 0. https://bedrockdb.com/
|
| 1. https://news.ycombinator.com/item?id=12739771
| ericlewis wrote:
| Nice! Though the blockchain part is new to me. Interesting
| they kept growing this.
| Conlectus wrote:
| One thing that jumps out at me after reading a lot of Jepsen
| analyses - does Rqlite assume that partitions form equality
| relations? That is, that all nodes belong to one and only
| partition group? This is not always the case in practice.
| yjftsjthsd-h wrote:
| So the case of A can talk to B, B can talk to C, but A can't
| talk to C? (Making sure that I understand how you can be in
| multiple partitions)
| fnord123 wrote:
| FoundationDB and Comdb2 also use sqlite as a storage engine.
| Curious that they decided to implement yet another one.
|
| https://www.foundationdb.org/
|
| http://comdb2.org/
| tyingq wrote:
| Rqlite appears to predate comdb2.
| rapsey wrote:
| Literally the first sentence.
|
| > Comdb2 is a relational database built in-house at Bloomberg
| L.P. over the last 14 years or so.
|
| rqlite is not 14 years old.
| tyingq wrote:
| I was looking at the github repo history. Was is publicly
| visible sooner than that would imply?
| tyingq wrote:
| Answering my own question, Comdb2 was made available to
| the public on 1 January 2016, well after rqlite launched.
| peter_d_sherman wrote:
| First of all, great idea, and a brilliant and highly laudable
| effort!
|
| Favorited!
|
| One minor caveat ("Here be Dragons") I have (with respect to my
| own future adoption/production use), however:
|
| https://github.com/rqlite/rqlite/blob/master/DOC/FAQ.md
|
| > _" Does rqlite support transactions?
|
| It supports a form of transactions. You can wrap a bulk update in
| a transaction such that all the statements in the bulk request
| will succeed, or none of them will. However the behaviour or
| rqlite is undefined if you send explicit BEGIN, COMMIT, or
| ROLLBACK statements. This is not because they won't work -- they
| will -- but if your node (or cluster) fails while a transaction
| is in progress, the system may be left in a hard-to-use state. So
| until rqlite can offer strict guarantees about its behaviour if
| it fails during a transaction, using BEGIN, COMMIT, and ROLLBACK
| is officially unsupported. Unfortunately this does mean that
| rqlite may not be suitable for some applications."_
|
| PDS: Distributed transactions are extremely difficult to get
| exactly right -- so I'm not trying to criticize all of the hard
| work and effort that everyone has put into this (again, it's a
| great idea, and I think it has a terrific future).
|
| But Distributed Transactions -- are what differentiate something
| like rsqlite from say, something like CockroachDB (https://www.co
| ckroachlabs.com/docs/stable/architecture/life-...).
|
| Of course, CockroachDB is a pay-for product with an actual
| company with many years of experience backing it, whereas rqlite,
| as far as I can intuit, at this point in time (someone correct me
| if I am wrong), appears to be a volunteer effort...
|
| Still, I think that rqlite despite this -- has a glorious and
| wonderful future!
|
| Again, a brilliant and laudable effort, suitable for many use
| cases presently, and I can't wait to see what the future holds
| for this Open Source project!
|
| Maybe in the future some code-ninja will step up to the plate and
| add fully guaranteed, safe, distributed transactions!
|
| Until then, it looks like a great idea coupled with a great
| software engineering effort!
|
| As I said, "Favorited!".
| jchrisa wrote:
| I'm curious how this relates to the Calvin protocol as
| implemented by FaunaDB. They both use Raft, but FaunaDB and
| Calvin have additional details about how transactions are retried
| and aborted. https://fauna.com/blog/consistency-without-clocks-
| faunadb-tr...
| ClumsyPilot wrote:
| I think microk8s uses this to form a cluster, and k3s used to use
| it but moved back to etc.
|
| Would be good to hear from someone who used it what are the pros
| and cons of such a setup
| fasteo wrote:
| AFAIK, microk8s uses a similar - but not this - form of
| distributed sqlite. Specifically, it uses dqlite[1] "a C
| library that implements an embeddable and replicated SQL
| database engine with high-availability and automatic failover."
|
| [1] https://github.com/canonical/dqlite
| tyingq wrote:
| Probably worth mentioning that Canonical initially made
| dqlite to be the backing store for LXD. It uses the sqlite
| VFS as the client entry point, so it's a very easy transition
| for an existing sqlite app, just recompile with the new
| header.
___________________________________________________________________
(page generated 2021-01-22 23:00 UTC) |