https://sqlsync.dev/posts/stop-building-databases/ back Stop building databases Carl Sverre November 2023 Outline * The humble cache * Going faster with indexes * Glass half-full mutations * Recursive Cache invalidation * Are we building databases? * Let's talk! * Prior art There comes a time in every frontend engineer's life where we realize we need to cache data from an API. It might start off benign - storing a previous page of data for that instant back button experience, implementing a bit of undo logic, or merging some state from different API requests. But we all know what ends up happening. More feature requests show up, and soon we're busy implementing data caches, manual indexes, optimistic mutations, and recursive cache invalidation. These features bear a remarkable resemblance to the inner workings of databases. Indeed, in any frontend application of sufficient complexity, engineers will necessarily end up building so many data management features that they are essentially creating a domain specific database. This added complexity is duplicated in each project we work on, and takes away from spending time on delighting users and solving business problems. So today, I'd like you to join me as we take a look at common application data patterns, and how they relate to database features. Afterwards, we will take a look at an alternative solution to these patterns - a frontend optimized database stack which allows us to focus on the application rather than micromanaging data. Welcome to the world of accidental database programming. The humble cache Our journey starts in the most humble of ways. Send an API request to the server and store it in a local variable. We might want to do this for any number of reasons, but one good example is working around declarative re-renders. Many modern web applications use declarative frameworks like React, which internally will re-render the tree many times over the course of a user's interaction with a page. We wouldn't want to issue an API request for each render - so we throw the data into a variable. Here is an example using React Hooks: const Newsfeed = () => { const [{ loading, entries, error }, setEntries] = useState({ loading: true, entries: [], error: null, }); useEffect(() => { NewsfeedAPI.getAll() .then((entries) => setEntries({entries })) .catch((error) => setEntries({ error })); }, []); // render loading, entries, and error states return <div>...</div>; }; In this example^1 we can see a basic API request cache stored using React state hooks. Once the API request completes, we will cache the result (or an error) until the component is removed and re-added to the tree. Soon we are tempted to add more features. A common next step is to "lift the cache" into a higher layer, or out of the UI tree entirely. One example of this is using Redux, a popular state management library for React. At its core, Redux allows developers to consolidate state and coordinate atomic changes to that state over time. However, over time, Redux has evolved into an ecosystem of tools and patterns which can be used to manage API data caching. The goal of using Redux (or similar) in this way is to centralize cache logic, coordinate refresh, and most importantly share cache results between components. As our humble caching layer grows in complexity it starts to take on a new identity: A centralized storage system that coordinates with the rendering engine and user actions to efficiently wrangle data. We might say it's starting to look a bit like a database... But that's crazy. Let's talk about indexes instead. Going faster with indexes When I started building web apps almost two decades ago, I was immediately drawn to the elegance of properly structured data. By organizing data in a certain way, the application could do less work and deliver a much better user experience. At the time, my interactions with data in the backend were abstracted through databases - databases which allowed me to think about indexes as magical fairy dust sprinkled on queries to make them go fast. It wasn't until I joined SingleStore (formerly MemSQL) that I discovered how similar frontend data optimization was to the internals of database storage.^2 One optimization we can leverage in the frontend is to store data cached from the server in an object keyed by ID. This often arises due to the structure of the underlying API. For example, in an application using a REST API, we often read some data in batch and then enrich specific objects as needed. This requires constantly merging API results into the cache which is made easier if objects are stored by ID. Let's look at an example of this kind of cache: const CACHE = { "f3ac87": { id: "f3ac87", author: "@carlsverre", title: "stop building databases", description: "There comes a time in every software engineers life...", createdAt: new Date(), }, /* "a281f0": { ... }, "2f9f6c": { ... }, ... */ }; // retrieval by ID console.log(CACHE["f3ac87"]); // update by ID CACHE["f3ac87"].title = "everyone is a database programmer"; In the example above, we are indexing entries by ID in a simple JavaScript object. Due to this "data shape," we have optimized for creating, reading, updating, and deleting entries by ID. However, any operation that needs to look at multiple entries, such as a filter, would require checking every entry. Let's improve this data structure to help us quickly look up entries by date: const truncateTime = (date) => { const clone = new Date(date); clone.setHours(0, 0, 0, 0); return clone; }; const ENTRIES_BY_DATE = Object.groupBy( Object.values(CACHE), (e) => truncateTime(e.createdAt), ); // filter by date == today console.log(ENTRIES_BY_DATE[truncateTime(new Date())]); In this second example, we create a basic index tracking the year/ month/day portion of each entry's createdAt field.^3 Now our frontend has gained the ability to quickly look up entries published on a given date - at the expense of having to maintain consistency between the two data structures: CACHE and ENTRIES_BY_DATE. Consider what happens as we start managing many of these indexes - each requiring custom logic to build, update, and query. Verifying correctness quickly becomes a burden on testing and code review. Forgetting to properly delete or update an entry in one index can lead to difficult bugs. Soon, we are spending more time building infrastructure to manage this complexity than building new application features. It's only a matter of time before we start to realize that we are, in fact, building a database. Caching and indexing data is not where our journey ends! Without fail, someone will realize that optimistically mutating these data structures could lead to a much snappier user experience. Glass half-full mutations What are optimistic mutations anyway? Basically, the idea is to simulate the effect of a particular operation locally - before the server has had a chance to respond. By doing this, we can build user interfaces that seem to respond instantly, "eliminating" network latency. Of course, this comes at a cost - in the event that the server decides to do something other than anticipated (or there is some kind of error), the UI may need to "rollback" this change and prompt the user to fix issues. Assuming that the UI is good at predicting the server outcome (i.e. errors can be handled on the client side and logic is kept tightly in sync), optimistic mutations can be a powerful tool in a frontend engineer's toolbox. It is also a great way to further practice our database engineering skills! Let's break down how optimistic mutations work, and think about the new challenges they bring to the table: A diagram illustrating the phases of an optimistic mutation's execution, which is described in detail below. In the diagram above, we can trace the execution path of an optimistic update: 1. The UI emits a write operation. In this case, the UI is trying to update some piece of data. 2. This update is optimistically applied to the local cache - making the assumption that the server will end up agreeing with this decision. The UI can immediately re-render with the new state. 3. Asynchronously, the update operation is sent to the server. 4. Finally, the server responds with the result of the update which is merged into the local cache, overwriting the earlier optimistic update. This allows the UI to re-render (if needed), leaving the system in a (hopefully) consistent state. Ensuring that this process maintains consistency with the server comes with a number of challenges. First, we need to duplicate logic between the client and server, so we can optimistically predict the result. Next, each in-flight mutation must be tracked in order to handle asynchronous errors or server disagreements. Finally, in order to provide the best user experience, the optimistic portion of the cache may even need to be made durable to reconcile changes across application restart. Solving each of these challenges incurs a huge tax on both developer time and correctness verification. Once again, we find ourselves micromanaging data rather than delighting customers with new, differentiating features. But these issues pale in comparison to what I like to call: recursive cache invalidation. Recursive Cache invalidation In any reasonably data intensive application, pieces of data will often show up in multiple places in the cache: const CACHE = { projects: { 1: { id: 1, name: "build a spaceship", progress: 0.5, numTasks: 10 }, 2: { id: 2, name: "to the moon", progress: 0.0, numTasks: 10 }, // ... }, tasks: { 1: { id: 1, project: 1, name: "finish hull design", status: "pending" }, 9: { id: 9, project: 1, name: "order bolts", status: "pending" }, // ... }, users: { 1: { id: 1, name: "Carl", assignedTasks: [1] }, 2: { id: 2, name: "ChatGPT", assignedTasks: [9] }, }, }; In the example above, we are caching projects, tasks, and users working on an important space mission.^4 Let's figure out what we need to do after we complete a task in order to keep the cache consistent with the server: A sequence diagram illustrating recursive invalidation as outlined below. Let's walk through the steps: 1. We inform the server that the task is complete. 2. Refresh the project as its progress has now changed. 3. 60% complete! Now let's check for new task assignments. 4. Darn, more work to do. Let's see what the new assignment is. In this example, the UI has to do multiple round trips after an update to correctly invalidate every part of the cache that was potentially changed. While it's certainly possible to build more complex APIs that reduce this into one trip,^5 ultimately the result is coupling API or client logic to the underlying data model. Let's consider two reasons why this might be bad: First, this requires that the UI knows what part of the cache is relevant to each mutation. This can become very brittle at scale, as data relationships and aggregations may influence many parts of the local cache. Also, as the engineering team grows, these issues may even cross team boundaries, which can start to feel similar to mutable global variables in large software projects. Second, when coupled with optimistic mutations, reproducing server logic on the client side in order to predict server changes becomes much more difficult. For example, in this case, we may want to optimistically remove task 1 from user 1 and increase the progress value by calculating a new ratio (we can use the total number of tasks). As we teach our applications how to predict these nested changes locally, our clients naturally duplicate more and more of the backend stack. Are we building databases? In any frontend application of sufficient complexity, engineers will necessarily end up building so many data management features that they are essentially creating a domain specific database. This added complexity is duplicated in each project we work on and takes away from spending time on delighting users and solving business problems. My goal here was to shine a light on these patterns we have grown accustomed to and ask: where is the frontend optimized database stack we deserve? I'm tired of waiting - so I'm tackling this problem head-on. I call it SQLSync. SQLSync is a frontend optimized database stack built on top of SQLite^6 with a synchronization engine powered by ideas from Git and distributed systems. The stack is designed to integrate seamlessly with the most popular frontend frameworks like React, Vue, and Next.js. SQLSync's goal is to handle the hardest data management problems, allowing developers to focus on what makes their application unique. As an early example of what SQLSync can do, check out this Todo app. The entire data layer is implemented in 60 lines of Rust and a couple of SQL queries scattered across components. In exchange, SQLSync provides a durable cache, the full power of SQLite (indexes, constraints, triggers, query optimization), optimistic mutations, smart cache invalidation, and reactive queries. Let's briefly take a deeper look at how SQLSync approaches each of the data management problems we've discussed today. SQLSync stores data locally in one or more SQLite databases. At face value, this gives us a durable cache that looks very similar to the databases we use in the backend. But it's a durable cache with superpowers! First off, indexes can be created easily, and are automatically kept in sync with the data. Just like on the backend, these indexes can be automatically used by the database to accelerate queries. But it's not just indexes! SQLite also gives us the full power of SQL which can easily express complex queries over our data, as well as triggers, foreign keys, constraints, and extensions such as full-text search. On top of all that, SQLSync provides optimistic mutations out of the box. It does this by handling mutations in a reducer, similar to how Redux works. The reducer can be written in any language that can be compiled to WebAssembly. Using this reducer, SQLSync can execute mutations optimistically on the client, and then run them in a globally consistent order on the server. Finally, the client performs an operation similar to a Git Rebase to synchronize the client with the server.^7 One advantage of SQLSync's architecture is that it eliminates the need for recursive cache invalidation. By writing all the data mutation logic within a reducer that can be easily shared on both the client and the server, all changes made during a mutation are automatically visible. Even better, due to synchronization working like "Git Rebase", this approach allows the server to make different changes than what happened on the client - with the guarantee that the client will always reach the same consistent outcome. This is a powerful capability that eliminates the need for developers to spend mental bandwidth on data micromanagement. Let's talk! I'm building SQLSync because I want to make client-side applications easier to build without us having to reinvent the wheel each time. If you share my vision of the future, consider starring SQLSync on Github, joining our discord, or emailing me. I'd love to hear what your perfect client-side database looks like and what you think about SQLSync's approach to the problem. Prior art The topic of using a database (or database methods) in the frontend stack is a popular topic these days. I'd like to highlight some specific articles that I found insightful and relevant. Riffle In Riffle's essay "Building data-centric apps with a reactive relational database", the authors explore the idea of storing all application state - including UI state - in a single reactive database. Some key ideas that inspired this blog post as well as the development of SQLSync include: * Reactive queries provide a clean mental model and align well with declarative systems like React * Solving technical challenges in client-side application development using ideas originating in the database community * The benefits of modeling state using a relational data model and real indexes Instant.db Stepan, the author of Instant.db, wrote two fantastic blog posts on databases in the browser. * Database in the Browser, a Spec * A Graph-Based Firebase I highly recommend reading both posts, as they describe many of the same issues I mention in this post, with more of a focus on the relationship between the frontend and backend stack. Ultimately Stepan clearly describes the motivation which led to the creation of Instant.db, a graph-based successor to Firebase. CR-SQLite Matt Wonlaw's CR-SQLite is an extension to SQLite which uses conflict free replicated data types (CRDTs) and causally ordered event logs to support merging data consistently. This allows peer-to-peer applications to store and collaborate on data in SQLite without requiring a central coordinator. This project is also a fantastic example of running SQLite in the browser. In addition, Matt is exploring related ideas like incremental computation and making SQL easier to work with via typed-sql. Overall, Matt is an inspiring builder in the space, and I recommend taking a look at what he is working on. Footnotes 1. This example is simplified for clarity. Instead, use battle tested API libraries instead or keep reading to learn more about frontend-friendly databases! - 2. Note: indexes in databases are a lot more complex than simply storing data differently. They involve statistics collection, data versioning, transaction control, locks, and more to ensure correctness and interop with query optimization. So yeah - magical fairy dust deg` +[?]. - 3. We could use many data structures to do this. This approach is pretty easy to do in JS, but not very flexible. Querying for a range of dates would require multiple accesses, for example. A potentially better data structure would be an array of entries sorted by date, combined with more complex query and update logic. - 4. Some people build spaceships. Some people build web apps. Some people pretend to be building spaceships while building web apps. - 5. GraphQL is one approach to this problem, although it's not perfect either - 6. SQLite may not only be deployed more than every other database combined, but it may be in the top five most deployed software modules of any description. - 7. To learn more about how SQLSync works, check out my talk at WasmCon 2023 -