[HN Gopher] Rqlite: The lightweight, distributed relational data...
___________________________________________________________________
 
Rqlite: The lightweight, distributed relational database built on
SQLite
 
Author : pavanyara
Score  : 175 points
Date   : 2021-01-22 13:48 UTC (9 hours ago)
 
web link (github.com)
w3m dump (github.com)
 
| mech422 wrote:
| So it looks like you can now distribute SQLite at the:
| 
| Stmt level: https://github.com/rqlite/rqlite
| 
| VFS Level: https://github.com/canonical/dqlite
| 
| Block Level: https://github.com/benbjohnson/litestream
| 
| Really cool enhancements to an awesome project!
 
| f430 wrote:
| Could you use this to build a decentralized p2p app? If so, what
| gotchas and limitations are there?
 
  | otoolep wrote:
  | No, rqlite is not suitable for that kind of application. All
  | writes must go through the leader.
 
| hermitcrab wrote:
| So if:
| 
| Alice, Bob and Charlie have a synced copy of the same database
| 
| Charlie goes on a plane and adds a loads of records without a
| connection to the other databases
| 
| Alice and Bob make no changes
| 
| Charlie comes home and syncs
| 
| Will Charlie lose all his changes, as his database is different
| to Alice and Bob's?
| 
| What happens if Alice, Bob and Charlie all makes changes offline
| then resync?
 
  | mrkurt wrote:
  | It doesn't work that way. rqlite is effectively a
  | leader/follower model that uses raft for leader consensus.
  | Writes can only happen online, and only to the leader.
 
    | hermitcrab wrote:
    | Ok, thanks!
 
  | teraflop wrote:
  | As per the description, all updates must be replicated to a
  | quorum of instances. If Charlie is on a plane without the
  | ability to contact a quorum, he can't add records in the first
  | place.
  | 
  | This is a fundamentally necessary tradeoff to provide strong
  | consistency, as described by the CAP theorem.
 
    | unnouinceput wrote:
    | In which case Charlie will have an additional local DB to
    | record those records and when he gets back will use another
    | protocol/method/system/whatever to add those new records? How
    | about if everybody goes and adds records on same table?
    | 
    | Here is a real life scenario that I had to deal with in the
    | past. Technician (carpenters) goes to clients home to repair
    | furniture in the middle of nowhere, so no internet. Adds the
    | necessary paperwork which is pictures, declarations, contract
    | (signed and scanned) to the Clients table. This company was
    | employing hundreds of such technicians throughout the many
    | counties of Germany, each with a laptop on them running this
    | app which was the backbone for getting paid/do the work. And
    | was not uncommon to have more than one carpenter go to client
    | home and do the repairs. Since each carpenter was paid
    | according to its own work, each of them would create entries
    | in their local Clients table and when getting back to HQ
    | their work was manually uploaded to central DB, and only
    | after that they got paid. I automated that (that was the job,
    | to eliminate the thousands of hours that carpenters were
    | wasting manually).
    | 
    | So given the above scenario, how is this system going to
    | achieve that? Same table, and same client details even in
    | table Clients, just different rows for different carpenters
    | (foreign key to table Carpenters).
 
      | wtallis wrote:
      | > So given the above scenario, how is this system going to
      | achieve that?
      | 
      | I don't think it is. You're describing a use case that is
      | distributed but explicitly does not want to enforce
      | consistency--you want offline workers to all be able to
      | keep working, and you're enforcing consistency after the
      | fact and outside of the database itself.
 
      | renewiltord wrote:
      | This tool does not handle that problem. It is not meant to.
      | It's for simultaneously available replicas. And this is the
      | rare moment where thinking about replication vs
      | synchronization as different is worthwhile.
      | 
      | You usually replicate for failure tolerance and performance
      | (this project only aims for the former).
 
      | vorpalhex wrote:
      | As other commentors have mentioned, this tool is not
      | intended for that kind of case. You want a tool like
      | PouchDB which handles this kind of setup, but have a
      | different set of tradeoffs (they're eventually consistent,
      | not strongly consistent).
 
  | adsharma wrote:
  | Charlie doesn't have to lose the data he saved on the plane.
  | Don't know what the rqlite implementation does.
  | 
  | In the second case, Alice-Bob consensus overrides Charlie
 
  | [deleted]
 
  | fipar wrote:
  | I have not read the full description of this project yet, but
  | it does mention the use or raft for consensus, so in your
  | example, I would expect Charlie to not be able to add any
  | records while being disconnected, because, if my understanding
  | is correct: - Charlie would either be the leader, but then
  | without getting confirmation of writes from enough followers,
  | he would not be able to do any writes himself, or - Charlie
  | would be a follower, and while disconnected would obviously get
  | no writes from the leader.
 
  | NDizzle wrote:
  | What's messed up is that I was doing this kind of thing with
  | Lotus Domino in the late 90s. I'm sure others were doing it
  | before me, too.
  | 
  | Sometimes you had conflicts that needed resolution, but those
  | weren't that frequent for our use case.
 
| mshenfield wrote:
| Going from a local db to one over a network has at least one
| risk. The SQLite docs gives developers the okay to write "n+1"
| style queries (https://www.sqlite.org/np1queryprob.html). When
| the db is on the same file system as the application this pattern
| is fine. But as soon a you add a network call it becomes a
| potential bottleneck.
 
| alberth wrote:
| @otoolep
| 
| SQLite has a great post on "When to Use" (and not use) SQLite.
| 
| Would be great if you included these same use cases in the ReamMe
| docs and make it clear if Rqlite can address them.
| 
| https://www.sqlite.org/whentouse.html
 
| scottlamb wrote:
| It looks like this layers Raft on top of SQLite. I don't like
| when systems replicate high-level changes like "update users set
| salary = salary + 1000 where ...;" Instead, I prefer they
| replicate low-level changes like "replace key/block X, which
| should have contents C_x, with C_x'".
| 
| Why? Imagine you're doing a rolling update. Some of your replica
| are running the newer version of SQLite and some are running the
| older version. They may not execute the high-level query in
| exactly the same way. For example, in the absence of an "order
| by" clause, select results' order is unstated. So imagine someone
| makes a mutation that depends on this: "insert ... select ...
| limit". (Maybe a dumb example but it can happen anyway.) Now the
| databases start to diverge, not only in underlying bytes and
| implementation-defined ordering but in actual row data.
| 
| I worked on a major distributed system that originally replicated
| high-level changes and switched to replicating low-level changes
| for this reason. We had a system for detecting when replicas
| didn't match, and replication of high-level changes was the
| biggest reason for diffs. (Hardware error was the second biggest
| reason; we added a lot of checksumming because of that.)
 
  | lrossi wrote:
  | If you replicate low level changes, you might not be able to do
  | a live upgrade/downgrade if the version change affects the on
  | disk format.
  | 
  | Another downside is that you might also propagate data
  | corruption in case of bugs in the DB software (e.g. memory
  | corruption) or hardware defects.
 
    | scottlamb wrote:
    | > If you replicate low level changes, you might not be able
    | to do a live upgrade/downgrade if the version change affects
    | the on disk format.
    | 
    | It certainly requires care to ensure all the replicas have
    | software capable of understanding the new format before it's
    | actually written, but it can be done. Likewise after writing
    | the new format, you want to have a roll-back plan.
    | 
    | In SQLite's case, https://sqlite.org/formatchng.html says:
    | "Since 2004, there have been enhancements to SQLite such that
    | newer database files are unreadable by older versions of the
    | SQLite library. But the most recent versions of the SQLite
    | library should be able to read and write any older SQLite
    | database file without any problems." I don't believe
    | upgrading SQLite automatically starts using any of those
    | enhancements; you'd have to do a schema change like "PRAGMA
    | journal_mode=WAL;" first.
    | 
    | > Another downside is that you might also propagate data
    | corruption in case of bugs in the DB software (e.g. memory
    | corruption) or hardware defects.
    | 
    | This happens regardless.
 
  | otoolep wrote:
  | rqlite creator here.
  | 
  | I understand what you're saying, but I don't think it's a
  | compelling objection. Obviously, differences between versions
  | -- even patched versions -- can results in subtle, unintended,
  | differences in how the code works for a given program. But
  | there is no reason to think a system that operates at a lower
  | level ("replace key/block X, which should have contents C_x,
  | with C_x'") is less vulnerable to this kind of issue, compared
  | to one that operates at a higher level i.e. statement-based
  | replication, which rqlite uses. In fact I would argue that the
  | system that operates on higher-level of abstraction is _less_
  | vulnerable i.e. to care about the subtle changes.
 
  | [deleted]
 
  | xd wrote:
  | In MySQL/MariaDB this is what's known as non-deterministic
  | behaviour so row or mixed replication is used to mitigate.
  | 
  | Statement based (high level) replication is very useful for
  | i.e. "insert into tbl0 select col0 from tbl1 order by col1" as
  | you would only need to send the query not the individual row
  | data.
 
  | tyingq wrote:
  | Dqlite replicates at the VFS layer of sqlite, which sounds like
  | what you're looking for. https://github.com/canonical/dqlite
 
    | hinkley wrote:
    | I haven't gotten a straight answer out of the k3s people
    | about why they dumped dqlite, just some comment about bugs.
    | 
    | I could see myself using dqlite in the future so I'd like
    | some more user reports from the trenches. Anyone shed some
    | light on this?
 
      | tyingq wrote:
      | The initial issue seems to be saying that it's because they
      | need to have etcd anyway, so consolidating on that removes
      | a dependency. Which fits their simplicity goal. Though the
      | issue appears to have been created by a user, not a
      | maintainer.
      | 
      |  _" Since the HA direction needs etcd anyway.. I'm
      | proposing dropping support for sqlite as the default
      | embedded non-HA option and switch to embedded etcd as the
      | default. This will reduce overall effort of maintainability
      | of two entirely different datastores."_
      | 
      | https://github.com/k3s-io/k3s/issues/845
 
        | hinkley wrote:
        | I accepted that reason for them, but as I don't benefit
        | directly from switching to etcd, I'd rather know about
        | what started the conversation than how it was concluded.
 
        | merb wrote:
        | dqlite support was flaky and did go trough their
        | translation layer which probably added complexity.
 
  | ttul wrote:
  | Sounds like you need to submit a patch!
 
    | scottlamb wrote:
    | lol, this is an unhelpful reflex answer to any criticism of
    | open source software. What I'm describing is a redesign.
    | There's no point in submitting a patch for that. It'd be much
    | more effort than a few lines of code, and it wouldn't be
    | accepted. Open source means that anyone can fork. It doesn't
    | mean that maintainers will automatically merge patches
    | replacing their software with completely different software.
    | The only way that will happen is if the maintainers decide
    | for themselves it needs to be redesigned, and that starts
    | with discussion rather than a patch. It's also a long shot
    | compared to just finding different software that already has
    | the design I prefer.
    | 
    | If I want replicated SQLite, I'll look at dqlite or
    | litestream instead, which sound more compatible with my
    | design sensibilities. (Thanks, tyingq and benbjohnson!)
 
      | monadic3 wrote:
      | Frankly your bad-faith commentary isn't helping the
      | conversation either. I sincerely appreciate your cleaning
      | up the tone at the end.
 
  | benbjohnson wrote:
  | I just open-sourced a streaming replication tool for SQLite
  | called Litestream that does physical replication (raw pages)
  | instead of logical replication (SQL commands). Each approach
  | has its pros and cons. Physical replication logs tend to be
  | larger than logical logs but I agree that you avoid a lot of
  | issues if you do physical replication.
  | 
  | https://github.com/benbjohnson/litestream
 
    | hinkley wrote:
    | Do you use compression? And if so, how that affects the
    | relative amount of network traffic vs logical.
 
      | benbjohnson wrote:
      | Yes, Litestream uses LZ4 compression. I originally used
      | gzip but the compression speed was pretty slow. B-tree
      | pages tend compress well because they tend to be 50-75%
      | full because they need space to insert new records and
      | because pages split when they get full.
      | 
      | I'm seeing files shrink down to 14% of their size (1.7MB
      | WAL compressed to 264KB). However, your exact compression
      | will vary depending on your data.
 
        | hinkley wrote:
        | Ah, that makes sense. Most inserts don't split pages, so
        | are around n worst case pages, but once in a while you
        | get 2n updates where most of them are half full, and so
        | compress better.
        | 
        | So how does that compare to logical replication? (Also I
        | imagine packet size plays a role, since you have to flush
        | the stream quite frequently, right? 1000 bytes isn't much
        | more expensive than 431)
 
        | benbjohnson wrote:
        | Litestream defaults to flushing out to S3 every 10
        | seconds but that's mainly because of PUT costs. Each
        | request costs $0.00005 so it costs about $1.30 per month.
        | If you flushed every second then it'd cost you $13/month.
        | 
        | Logical replication would have significantly smaller
        | sizes although the size cost isn't a huge deal on S3.
        | Data transfer in to S3 is free and so are DELETE
        | requests. The data only stays on S3 for as long as your
        | Litestream retention specifies. So if you're retaining
        | for a day then you're just keeping one day's worth of WAL
        | changes on the S3 at any given time.
 
    | mrkurt wrote:
    | I've been following this, and am anxious for the direct-to-
    | sqlite replication.
    | 
    | One of rqlite's big limitations is that it resyncs the entire
    | DB at startup time. Being able to start with a "snapshot" and
    | then incrementally replicate changes would be a big help.
 
      | CuriouslyC wrote:
      | This is tangential, but depending on your sql needs,
      | CouchDB's replication story is amazing, and you can
      | replicate to the browser using PouchDB. There is an
      | optional SQL layer, but obviously the good replication
      | story comes with some trade-offs.
 
      | otoolep wrote:
      | rqlite creator here.
      | 
      | I'm not sure I follow why it's a "big limitation"? Is it
      | causing you long start-up times? I'm definitely interested
      | in improving this, if it's an issue. What are you actually
      | seeing?
      | 
      | Also, rqlite does do log truncation (as per Raft spec), so
      | after a certain amount of log entries (8192 by default)
      | node restarts work _exactly_ like you suggested. The SQLite
      | database is restored from a snapshot, and any remaining
      | Raft Log entries are applied to the database.
 
        | mrkurt wrote:
        | Ah, ok that's some nuance I didn't know about!
        | 
        | We're storing a few GB of data in the sqlite DB.
        | Rebuilding those when rqlite restarts is slow and
        | intensive process compared to just using the file on disk
        | over again.
        | 
        | Our particular use case means we'll end up restarting
        | 100+ replica nodes all at once, so the way we're doing
        | things makes it more painful than necessary.
 
        | otoolep wrote:
        | But how do you know it's intensive? Are you watching disk
        | IO? Is there a noticeable delay when the node starts
        | before it's ready to receive requests?
        | 
        | Try setting "-raft-snap" to a lower number, maybe 1024,
        | and see if it helps. You'll have much fewer log entries
        | to apply on startup. However the node will perform a
        | snapshot more often, and writes are blocked during the
        | snapshotting. It's a trade-off.
        | 
        | It might be possible to always restart using some sort of
        | snapshot, independent of Raft, but that would add
        | significant complexity to rqlite. The fact the SQLite
        | database is built from scratch on startup, from the data
        | in Raft log, means rqlite is much more robust.
 
        | mrkurt wrote:
        | Oh, we're reading the sqlite files directly. rqlite is
        | really just a mechanism for us to propagate read only
        | data to a bunch of clients.
        | 
        | We need that sqlite file to never go away. Even a few
        | seconds is bad. And since our replicas are spread all
        | over the world, it's not feasible to move 1GB+ data from
        | the "servers" fast enough.
        | 
        | Is there a way for us to use that sqlite file without it
        | ever going away? We've thought about hardlinking it
        | elsewhere and replacing the hardlink when rqlite is up,
        | but haven't built any tooling to do that.
 
        | otoolep wrote:
        | Hmmmm, that's a different issue. :-)
        | 
        | Today the rqlite code deletes the SQLite database (if
        | present) and then rebuilds it from the Raft log. It makes
        | things so simple, and ensures the node can always
        | recover, regardless of the prior state of the SQLite
        | database -- basically the Raft log is the only thing that
        | matters and that is _guaranteed_ to be the same under
        | each node.
        | 
        | The fundamental issue here is that Raft can only
        | guarantee that the Raft log is in consensus, so rqlite
        | can rely on that. It's always possible the one of the
        | copies of SQLite under a single node gets a different
        | state that all other nodes. This is because the change to
        | the Raft log, and corresponding change to SQLite, are not
        | atomic. Blowing away the SQLite database means a restart
        | would fix this.
        | 
        | If this is important -- and what you ask sounds
        | reasonable for the read-only case that rqlite can support
        | -- I guess the code could rebuild the SQLite database in
        | a temporary place, wait until that's done, and then
        | quickly swap any existing SQLite file with the rebuilt
        | copy. That would minimize the time the file is not
        | present. But the file has to go away at some point.
        | 
        | Alternatively rqlite could open any existing SQLite file
        | and DROP all data first. At least that way the _file_
        | wouldn 't disappear, but the data in the database would
        | wink out of existence and then come back. WDYT?
 
        | mrkurt wrote:
        | Rebuilding and then moving it in place sounds pretty nice
        | to me.
 
      | jlongster wrote:
      | I built an app (https://actualbudget.com/) that uses a
      | local sqlite db and syncs changes, and that's exactly how
      | it works. It takes quite a different approach though, using
      | CRDTs to represent changes and those are synced around.
      | When a fresh client comes into play, it downloads the
      | latest sqlite snapshot from a server and then syncs up.
 
      | benbjohnson wrote:
      | That can be painful for sure. Litestream will do a snapshot
      | on startup if it detects that it can't pick up from where
      | it left off in the WAL. That can happen if Litestream is
      | shut down and another process perform a checkpoint. But
      | generally a restart will just use the existing snapshot &
      | continue with the WAL replication.
 
    | jgraettinger1 wrote:
    | Here's another approach to the problem [0]:
    | 
    | This package is part of Gazette [1], and uses a gazette
    | journal (known as a "recovery log") to power raw bytestream
    | replication & persistence.
    | 
    | On top of journals, there's a recovery log "hinting"
    | mechanism [2] that is aware of file layouts on disk, and
    | keeps metadata around the portions of the journal which must
    | be read to recover a particular on-disk state (e.x. what are
    | the current live files, and which segments of the log hold
    | them?). You can read and even live-tail a recovery log to
    | "play back" / maintain the on-disk file state of a database
    | that's processing somewhere else.
    | 
    | Then, there's a package providing RocksDB with an Rocks
    | environment that's configured to transparently replicate all
    | database file writes into a recovery log [3]. Because RocksDB
    | is a a continuously compacted LSM-tree and we're tracking
    | live files, it's regularly deleting files which allow for
    | "dropping" chunks of the recovery log journal which must be
    | read or stored in order to recover the full database.
    | 
    | For the SQLite implementation, SQLite journals and WAL's are
    | well-suited to recovery logs & their live file tracking,
    | because they're short-lived ephemeral files. The SQLite page
    | DB is another matter, however, because it's a super-long
    | lived and randomly written file. Naively tracking the page DB
    | means you must re-play the _entire history_ of page mutations
    | which have occurred.
    | 
    | This implementation solves this by using a SQLite VFS which
    | actually uses RocksDB under the hood for the SQLite page DB,
    | and regular files (recorded to the same recovery log) for
    | SQLite journals / WALs. In effect, we're leveraging RocksDB's
    | regular compaction mechanisms to remove old versions of
    | SQLite pages which must be tracked / read & replayed.
    | 
    | [0] https://godoc.org/go.gazette.dev/core/consumer/store-
    | sqlite
    | 
    | [1] https://gazette.readthedocs.io/en/latest/
    | 
    | [2] https://gazette.readthedocs.io/en/latest/consumers-
    | concepts....
    | 
    | [3] https://godoc.org/go.gazette.dev/core/consumer/store-
    | rocksdb
 
    | webmaven wrote:
    | _> I just open-sourced a streaming replication tool for
    | SQLite called Litestream that does physical replication (raw
    | pages) instead of logical replication (SQL commands). Each
    | approach has its pros and cons. Physical replication logs
    | tend to be larger than logical logs but I agree that you
    | avoid a lot of issues if you do physical replication._
    | 
    | Hmm. Not having dug into your solution much, is it safe to
    | say that the physical replication logs have something like
    | logical checkpoints? If so, would it make sense to only keep
    | physical logs on a relatively short rolling window, and
    | logical logs (ie. only the interleaved logical checkpoints)
    | longer?
 
      | benbjohnson wrote:
      | I suppose you could save both the physical and logical logs
      | if you really needed log term retention. SQLite databases
      | (and b-trees in general) tend to compress well so the
      | physical logging isn't as bad as it sounds. You could also
      | store a binary diff of the physical page which would shrink
      | it even smaller.
      | 
      | One benefit to using physical logs is that you end up with
      | a byte-for-byte copy of the original data so it makes it
      | easy to validate that your recovery is correct. You'd need
      | to iterate all the records in your database to validate a
      | logical log.
      | 
      | However, all that being said, Litestream runs as a separate
      | daemon process so it actually doesn't have access to the
      | SQL commands from the application.
 
| szszrk wrote:
| rqlite is mentioned here quite often, multiple times last year. I
| don't think this entry brings anything new.
 
| foolinaround wrote:
| We currently use browsers on several devices (both laptops and
| android) and rely on google sync currently. Maybe this could be
| used to sync bookmarks, history etc across my devices but still
| keep my data local to me?
 
  | JoachimSchipper wrote:
  | This uses Raft, so a quorum of devices would need to be online
  | at the same time. That's not what you want for browser sync.
 
| blackbear_ wrote:
| I know nothing of consensus algorithms and distributed systems so
| bear with me please.
| 
| > rqlite uses Raft to achieve consensus across all the instances
| of the SQLite databases, ensuring that every change made to the
| system is made to a quorum of SQLite databases, or none at all.
| 
| What I understood from this sentence is that, if we have three
| instances, rqlite will make sure that every change is written to
| at least two. But what if two changes are written to two
| different pairs of instances? Then the three instances will have
| three different versions of the data. For example, change X is
| written to instances A and B, and change Y is written to B and C.
| Now A has X, B has X and Y, and C has Y only. How do you decide
| who is right?
 
  | edoceo wrote:
  | Raft consensus: https://raft.github.io/
  | 
  | Surprisingly easy to understand, and a cool viaual.
 
  | teraflop wrote:
  | In brief: at any point in time, one of the replicas is a
  | "leader" which controls the order in which operations are
  | committed. The changes occur in a defined sequence, and other
  | replicas may lag behind the leader, but cannot be inconsistent
  | with it.
  | 
  | Your example can't happen, because if (for instance) A is the
  | leader, then C will not apply change Y without contacting the
  | leader, which will tell it to apply X first.
  | 
  | If you want more details about how this handles all the edge
  | cases -- for instance, what happens if the leader crashes --
  | the Raft paper is quite accessible:
  | https://raft.github.io/raft.pdf
 
    | hinkley wrote:
    | TL;DR: Raft updates are serialized (as in sequential).
 
  | whizzter wrote:
  | The semantics of Raft has a "simple" (compared to the harder to
  | understand Paxos) forward motion of events that is supposed to
  | guarantee that you won't get into weird states regardless of if
  | any particular node(s) goes down (I think it can surive (N/2)-1
  | dead machines in a cluster of N).
  | 
  | Raft is based on having a leader decide what the next COMMIT is
  | going to be, so B could never have X and Y at the same time
  | (they could both be queued but other mechanisms could reject
  | them).
  | 
  | Also data is not considered committed until more than half the
  | cluster has acknowledged it (at which point the leader will
  | know it and handle going forward), leader election also works
  | in a similar way iirc.
  | 
  | As others mentioned, the visualization on
  | https://raft.github.io/ is really good (You can affect it to
  | create commits and control downtime of machines)
 
    | hinkley wrote:
    | It's 1/2 + 1 isn't it? So if the leader goes down at the
    | exact moment of quorum, you you can still get quorum again.
    | 
    | That would mean in 3 servers you need 2.5 aka 3 machines to
    | commit a change. Then 4/5, 5/7, 6/9, 7/11. And I think it's a
    | wash anyway, because as the servers go up the fraction you
    | need for quorum goes down, but the odds of falling behind or
    | failing outright go up too. Not to mention the time during
    | which 1/n machines are down due to an upgrade gets longer and
    | longer the more machines you have, increasing the chances of
    | double fault.
 
      | simtel20 wrote:
      | > It's 1/2 + 1 isn't it?
      | 
      | The parent post is talking about the number that can go
      | down while maintaining quorum, and you're talking about the
      | number that need to remain up to maintain quorum. So you're
      | both correct.
      | 
      | However:
      | 
      | > That would mean in 3 servers you need 2.5 aka 3 machines
      | to commit a change.
      | 
      | That seems wrong. You need N//2 +1 where "//" is floor
      | division, so in a 3 node cluster, you need 3//2 +1, or 1+1
      | or 2 nodes to commit a change.
 
        | hinkley wrote:
        | I think I see the problem.
        | 
        | 'Simple majority' is based on the number of the machines
        | that the leader knows about. You can only change the
        | membership by issuing a write. Write quorum and
        | leadership quorum are two different things, and if I've
        | got it right, they can diverge after a partition.
        | 
        | I'm also thinking of double faults, because the point of
        | Raft is to get past single fault tolerance.
        | 
        | [edit: shortened]
        | 
        | After a permanent fault (broken hardware) in a cluster of
        | 5, the replacement quorum member can't vote for writes
        | until it has caught up. It can vote for leaders, but it
        | can't nominate itself. Catching up leaves a window for
        | additional faults.
        | 
        | It's always 3/5 for writes and elections, the difference
        | is that the _ratio_ of original machines that have to
        | confirm a write can go to 100% of survivors, instead of
        | the 3 /4 of reachable machines. Meaning network jitter
        | and packet loss, slows down writes until it recovers, and
        | an additional partition can block writes altogether, even
        | with 3/5 surviving the partition.
 
      | teraflop wrote:
      | > It's 1/2 + 1 isn't it?
      | 
      | > That would mean in 3 servers you need 2.5 aka 3 machines
      | to commit a change. Then 4/5, 5/7, 6/9, 7/11.
      | 
      | No, the requirement isn't 1/2 + 1. Any _strict_ majority of
      | the cluster is enough to elect a leader. So you need 2 /3,
      | or 3/4, or 3/5, and so on.
      | 
      | > Not to mention the time during which 1/n machines are
      | down due to an upgrade gets longer and longer the more
      | machines you have, increasing the chances of double fault.
      | 
      | Generally, this is not the case. If individual machine
      | failures are random and equally probable, and if each
      | machine is down on average less than 50% of the time, then
      | adding more machines makes things better, not worse. (This
      | is a basic property of the binomial distribution.)
      | 
      | Of course, if you have a single point of failure somewhere
      | -- e.g. a network switch -- this assumption can be
      | violated, but that's true regardless of how many machines
      | you have.
 
        | hinkley wrote:
        | If the leader is down (the scenario you clipped out in
        | your response) you need a strict majority with an even
        | number of machines.
 
        | hinkley wrote:
        | This is right for the wrong reason. See follow-up down-
        | thread.
 
  | jasonjayr wrote:
  | IIRC, a monatomic counter is involved. The odd one out will
  | realize it's behind the highest sequence number and discard
  | it's updates to resync with the majority consensus.
  | 
  | Edit: http://thesecretlivesofdata.com/raft/ if you have some
  | time seems to be a good step by step explanation on how it
  | works in detail.
 
    | adsharma wrote:
    | The visual is indeed cool. I also thought it'd be nice to use
    | a chat like interface to learn raft.
    | 
    | Alice: /set a 1 Alice: /set b 2 Bob: /status
    | 
    | Etc
    | 
    | https://github.com/adsharma/zre_raft
    | https://github.com/adsharma/raft
    | 
    | Bug reports welcome
 
| ericlewis wrote:
| Expensify had a version of something like this back in like
| 2013/14 I think.
 
  | moderation wrote:
  | The project is BedrockDB [0] and has been previously discussed
  | [1].
  | 
  | 0. https://bedrockdb.com/
  | 
  | 1. https://news.ycombinator.com/item?id=12739771
 
    | ericlewis wrote:
    | Nice! Though the blockchain part is new to me. Interesting
    | they kept growing this.
 
| Conlectus wrote:
| One thing that jumps out at me after reading a lot of Jepsen
| analyses - does Rqlite assume that partitions form equality
| relations? That is, that all nodes belong to one and only
| partition group? This is not always the case in practice.
 
  | yjftsjthsd-h wrote:
  | So the case of A can talk to B, B can talk to C, but A can't
  | talk to C? (Making sure that I understand how you can be in
  | multiple partitions)
 
| fnord123 wrote:
| FoundationDB and Comdb2 also use sqlite as a storage engine.
| Curious that they decided to implement yet another one.
| 
| https://www.foundationdb.org/
| 
| http://comdb2.org/
 
  | tyingq wrote:
  | Rqlite appears to predate comdb2.
 
    | rapsey wrote:
    | Literally the first sentence.
    | 
    | > Comdb2 is a relational database built in-house at Bloomberg
    | L.P. over the last 14 years or so.
    | 
    | rqlite is not 14 years old.
 
      | tyingq wrote:
      | I was looking at the github repo history. Was is publicly
      | visible sooner than that would imply?
 
        | tyingq wrote:
        | Answering my own question, Comdb2 was made available to
        | the public on 1 January 2016, well after rqlite launched.
 
| peter_d_sherman wrote:
| First of all, great idea, and a brilliant and highly laudable
| effort!
| 
| Favorited!
| 
| One minor caveat ("Here be Dragons") I have (with respect to my
| own future adoption/production use), however:
| 
| https://github.com/rqlite/rqlite/blob/master/DOC/FAQ.md
| 
| > _" Does rqlite support transactions?
| 
| It supports a form of transactions. You can wrap a bulk update in
| a transaction such that all the statements in the bulk request
| will succeed, or none of them will. However the behaviour or
| rqlite is undefined if you send explicit BEGIN, COMMIT, or
| ROLLBACK statements. This is not because they won't work -- they
| will -- but if your node (or cluster) fails while a transaction
| is in progress, the system may be left in a hard-to-use state. So
| until rqlite can offer strict guarantees about its behaviour if
| it fails during a transaction, using BEGIN, COMMIT, and ROLLBACK
| is officially unsupported. Unfortunately this does mean that
| rqlite may not be suitable for some applications."_
| 
| PDS: Distributed transactions are extremely difficult to get
| exactly right -- so I'm not trying to criticize all of the hard
| work and effort that everyone has put into this (again, it's a
| great idea, and I think it has a terrific future).
| 
| But Distributed Transactions -- are what differentiate something
| like rsqlite from say, something like CockroachDB (https://www.co
| ckroachlabs.com/docs/stable/architecture/life-...).
| 
| Of course, CockroachDB is a pay-for product with an actual
| company with many years of experience backing it, whereas rqlite,
| as far as I can intuit, at this point in time (someone correct me
| if I am wrong), appears to be a volunteer effort...
| 
| Still, I think that rqlite despite this -- has a glorious and
| wonderful future!
| 
| Again, a brilliant and laudable effort, suitable for many use
| cases presently, and I can't wait to see what the future holds
| for this Open Source project!
| 
| Maybe in the future some code-ninja will step up to the plate and
| add fully guaranteed, safe, distributed transactions!
| 
| Until then, it looks like a great idea coupled with a great
| software engineering effort!
| 
| As I said, "Favorited!".
 
| jchrisa wrote:
| I'm curious how this relates to the Calvin protocol as
| implemented by FaunaDB. They both use Raft, but FaunaDB and
| Calvin have additional details about how transactions are retried
| and aborted. https://fauna.com/blog/consistency-without-clocks-
| faunadb-tr...
 
| ClumsyPilot wrote:
| I think microk8s uses this to form a cluster, and k3s used to use
| it but moved back to etc.
| 
| Would be good to hear from someone who used it what are the pros
| and cons of such a setup
 
  | fasteo wrote:
  | AFAIK, microk8s uses a similar - but not this - form of
  | distributed sqlite. Specifically, it uses dqlite[1] "a C
  | library that implements an embeddable and replicated SQL
  | database engine with high-availability and automatic failover."
  | 
  | [1] https://github.com/canonical/dqlite
 
    | tyingq wrote:
    | Probably worth mentioning that Canonical initially made
    | dqlite to be the backing store for LXD. It uses the sqlite
    | VFS as the client entry point, so it's a very easy transition
    | for an existing sqlite app, just recompile with the new
    | header.
 
___________________________________________________________________
(page generated 2021-01-22 23:00 UTC)