[HN Gopher] Show HN: Cozo - new Graph DB with Datalog, embedded ...
___________________________________________________________________
 
Show HN: Cozo - new Graph DB with Datalog, embedded like SQLite
 
Hi HN, I have been making this Cozo database since half a year ago,
and now it is ready for public release.  My initial motivation is
that I want a graph database. Lightweight and easy to use, like
SQLite. Powerful and performant, like Postgres. I found none of the
existing solutions good enough.  Deciding to roll my own, I need to
choose a query language. I am familiar with Cypher but consider it
not much of an improvement over CTE in SQL (Cypher is sometimes
notationally more convenient, but not more expressive). I like
Gremlin but would prefer something more declarative.
Experimentations with Datomic and its clones convinced me that
Datalog is the way to go.  Then I need a data model. I find the
property graph model (Neo4j, etc.) over-constraining, and the
triple store model (Datomic, etc.) suffering from inherent
performance problems. They also lack the most important property of
the relational model: being an algebra. Non-algebraic models are
not very composable: you may store data as property graphs or
triples, but when you do a query, you always get back relations. So
I decided to have relational algebra as the data model.  The end
result, I now present to you. Let me know what you think, good or
bad, and I'll do my best to address them. This is the first time
that I use Rust in a significant project, and I love the
experience!
 
Author : zh217
Score  : 278 points
Date   : 2022-11-08 12:25 UTC (10 hours ago)
 
web link (github.com)
w3m dump (github.com)
 
| canadiantim wrote:
| You mention how Cypher is not much of an improvement over CTE in
| SQL, I was wondering if you could expand on this point a bit if
| possible?
| 
| Some part of me is considering using Apache AGE graph extension
| for postgres, but another part wonders whether it's worth it
| considering CTE's can do a lot very similarly.
| 
| I'll definitely be following the progress for Cozo though, sounds
| great on the face of it. Definitely will have to consider
| potentially using Cozo as well. I wonder if it could make sense
| to use Postgres and Cozo together?
 
  | zh217 wrote:
  | Yes of course.
  | 
  | Perhaps I should start by clarifying that I am talking about
  | the number of queries the Cypher language can express, without
  | any vendor-specific extensions, since my consideration was
  | whether to use it as the query language for my own database.
  | And Cypher is of course much more convenient to _type_ than SQL
  | for expressing graph traversals - it was built for that.
  | 
  | With that understanding, any cypher pattern can be translated
  | into a series of joins and projections in SQL, and any
  | recursive query in cypher can be translated into a recursive
  | CTE. Theoretically, SQL with recursive CTE is not Turing
  | complete (unless you also add in window functions in recursive
  | CTE, which I don't think any of the Cypher databases currently
  | provide), whereas Datalog with function symbol is. Practically,
  | you can easily write a shortest path query in pure Datalog
  | without recourse to built-in algorithms (an example is shown in
  | README), and at least in Cozo it executes essentially as a
  | variant of Dijkstra's algorithm. I'm not sure I can do that in
  | Cypher. I don't think it is doable.
 
    | samuell wrote:
    | Does Cypher even support nested and/or recursive queries? I
    | remember asking the Neo4j guys at a meetup about that many
    | years ago, and they didn't even seem to understand the
    | question. Might have changed since then of course.
    | 
    | Otherwise the thing I have noticed with the datalog (as well
    | as prolog) syntax, is you are able to build a vocabulary of
    | re-usable queries, in a much more usable was than any of the
    | solutions I've seen in SQL, or other similar languages.
    | 
    | It thus allows you to raise your level of abstraction, by
    | layer by layer define your definitions (or "classes" if you
    | will) with well crafted queries, that can be used for further
    | refined classifying queries.
 
      | zh217 wrote:
      | Re Datalog syntax: yes, the "composability" is the main
      | reason that I decided to adopt it as the query language.
      | This is also the reason why we made storing query results
      | back into the database very easy (no pre-declaration of
      | "tables" necessary) so that intermediate results can be
      | materialized in the database at will and be used by
      | multiple subsequent queries.
 
        | samuell wrote:
        | Indeed, composability is the spot-on keyword here.
 
| [deleted]
 
| samuell wrote:
| How I have waited for this: A simple, accessible library for
| graph-like data with datalog (also in a statically compiled
| language, yay). Have even pondered using SWI-prolog for this kind
| of stuff, but it seems so much nicer to be able to use it
| embedded in more "normal" types of languages.
| 
| Looking forward to play with this!
| 
| The main thing I will be wondering now is how it will scale to
| really large datasets. Any input on that?
 
  | samuell wrote:
  | For folks looking for documentation or getting started-
  | examples, see:
  | 
  | - The tutorial: https://nbviewer.org/github/cozodb/cozo-
  | docs/blob/main/tutor...
  | 
  | - The language documentation:
  | https://cozodb.github.io/current/manual/
  | 
  | - The pycozo library README for some examples on how to run
  | this from inline python:
  | https://github.com/cozodb/pycozo#readme
 
  | zh217 wrote:
  | Thanks for your interest in this!
  | 
  | It currently uses RocksDB as the storage engine. If your server
  | has enough resources, I believe it can store TBs of data with
  | no problem.
  | 
  | Running queries on datasets this big is a complicated story.
  | Point lookups should be nearly instant, whereas running
  | complicated graph algorithms on the whole dataset is
  | (currently) out of the question, since all the rows a query
  | touches must reside in memory. Also, the algorithmic complexity
  | of some of the graph algorithms is too high for big data and
  | there's nothing we can do about it. We aim to provide a smooth
  | way for big data to be distilled layer by layer, but we are not
  | there yet.
 
    | samuell wrote:
    | Many thanks for the detailed answer!
 
| mark_l_watson wrote:
| Thank you, this looks very useful. I will try the Python embedded
| mode when I have time.
| 
| I especially like the Datalog query examples in your READ project
| file. I usually use RDF/RDFS and the SPARQL query language, with
| must less use of property graphs using Neo4J. I expect an easy
| ramp up learning your library.
| 
| BTW, I read the discussion of your use of the AGPL license. For
| what it is worth, that license is fine with me. I usually release
| my open source projects using Apache 2, but when required
| libraries use GPL or AGPL, I simply use those licenses.
 
| dmitriid wrote:
| I nitpick for the README: consider converting examples from
| images to code blocks (you can even directly copy-paste them into
| the code blocks and they should retain their formatting)
| 
| Otherwise: yes, please. I love the idea.
 
| mola wrote:
| Graph query over relational data, brilliant. I need this
| yesterday.
 
| OtomotO wrote:
| Awesome work, congrats.
| 
| For someone who never did anything datalog I didn't see an
| example in the repo and the docs (docs.rs) could need some more
| content.
| 
| I hope to see a 1.0 at some point and performance that can
| compete with SQLite.
| 
| Would love to have an alternative, especially as I have a few pet
| projects that have graph data (well, in the end the whole
| universe can be modelled as a graph ;))
 
  | zh217 wrote:
  | I'm very happy that you like it!
  | 
  | The "teasers" section in the repo README contains a few
  | examples. Or you could look at the tutorial
  | (https://nbviewer.org/github/cozodb/cozo-
  | docs/blob/main/tutor...), which contains all sorts of examples.
  | 
  | The Rust documentation on docs.rs could certainly be improved,
  | will do that later!
 
    | OtomotO wrote:
    | Ah, yes, mea culpa. Was browsing on the phone and did miss
    | that link indeed.
    | 
    | Is is also okay to store big data that would otherwise go
    | into another storage like e.g. blog-posts?
    | 
    | I mean the content could also be modeled as a leaf-node and
    | not be part of the db itself. (not sure if that would be
    | abusing the kv storage)
 
      | zh217 wrote:
      | In short: yes, but not right now. See this issue:
      | https://github.com/cozodb/cozo/issues/2. Also in this case
      | you are not really using it as an embedded database
      | anymore, which is our original motivation. We currently
      | also provide a "cozoserver", but it is pretty primitive at
      | the moment. "Big data" capabilities, when they arrive in
      | Cozo, will probably go into the server instead of the
      | embedded binaries.
 
        | OtomotO wrote:
        | Hm, why wouldn't that be embedded?
        | 
        | How do you define embedded?
        | 
        | One of my application is a simple "blog-like" webservice
        | where you can either use a SQLite db or Postgres.
        | 
        | Personally I often prefer SQLite because it doesn't need
        | a thousand configurations and I can just migrate all the
        | content with copying a file.
 
        | zh217 wrote:
        | My use of "embedded" means that the whole database runs
        | in the same process as your application. This is how
        | SQLite works. Your application doesn't "connect" to an
        | SQLite database in the usual sense. Your application
        | simply contains SQLite as part of itself. Contrast this
        | with Postgres, where you first need to start a Postgres
        | server and then have your application talk to it.
 
        | OtomotO wrote:
        | Exactly.
        | 
        | I was just curious because of your comment:
        | 
        | > Also in this case you are not really using it as an
        | embedded database anymore, which is our original
        | motivation
        | 
        | As by your (and mine) definition, I am indeed using it as
        | an embedded database. It's running inside the process and
        | storing (and persisting) blog-posts.
 
    | Serow225 wrote:
    | I'm excited to get some more Rust docs!
    | 
    | Even just a pointer to serde ::from_value(value).unwrap(),
    | and ::deserialize(value), would be
    | helpful to get people pointed in the right direction.
    | 
    | Looks like a super cool project, congrats!
 
| ithrow wrote:
| _you may store data as property graphs or triples, but when you
| do a query, you always get back relations_
| 
| Can you elaborate on this? in datomic you can get back
| hierarchical data
 
| ekidd wrote:
| This is a really impressive piece of work! Congratulations!
| 
| I note that it appears to be a library, but it's licensed under
| the Affero GPL. I believe this means that if I link your library
| into a program, and if I then allow users to interact with that
| combined program in any way over a network, then I have to make
| it possible for users to download the source code to my entire
| program. Is that your goal here? Were you thinking of some kind
| of commercial licensing model for people writing server-side apps
| that use your library?
| 
| (I'm curious because I've been deciding whether or not to roll my
| own toy Datalog for a permissively-licensed open source Rust
| project.)
 
  | zh217 wrote:
  | No, my understanding is that if you don't make any changes to
  | the Cozo code, you don't need to release anything to the
  | public. If you do, and you cannot release your non-Cozo code,
  | then you must dynamically link to the library (and release your
  | changes to the Cozo code). The Python, NodeJS and Java/Clojure
  | libraries all use dynamic linking.
  | 
  | There is no plan for any commercial license - this is a
  | personal project at the moment. My hope is for this project to
  | grow into a true FOSS database with wide contributions and no
  | company controlling it. If a community forms and after I
  | understand the consequences a little bit more, the license may
  | change if the community decides that it is better for the long-
  | term good of the project. For the moment though, it is staying
  | AGPL.
 
    | Cu3PO42 wrote:
    | Let me preface by saying that this seems like a great piece
    | of software and it is absolutely within your right to license
    | it as whatever you would like, no matter what any of the
    | commenters here think.
    | 
    | However, I don't believe your understanding of AGPL is
    | accurate.
    | 
    | > No, my understanding is that if you don't make any changes
    | to the Cozo code, you don't need to release anything to the
    | public. If you do, and you cannot release your non-Cozo code,
    | then you must dynamically link to the library (and release
    | your changes to the Cozo code). The Python, NodeJS and
    | Java/Clojure libraries all use dynamic linking.
    | 
    | This sounds like you're thinking of the LGPL, not AGPL.
    | Whereas LGPL is less strict than GPL because the exception
    | you describe above applies. AGPL on the other hand is more
    | strict. Essentially, if you use any AGPL code to provide a
    | service to users then you must also make the source code
    | available, even if the software itself is never delivered to
    | users.
    | 
    | The intention here is that you can't get around GPL by hiding
    | any use of the GPL code behind a server, so it makes perfect
    | sense to use it for a database. But I don't think it does
    | what you want.
    | 
    | Whichever way you decide to go, be it AGPL, LGPL or something
    | else, I encourage you to make a choice before accepting any
    | outside contributions. As soon as you have code from other
    | authors without a CLA you will need to obtain their
    | permission to change the license (with some exceptions).
    | 
    | (Disclaimer: I'm not a lawyer, just interested in licenses.)
 
      | zh217 wrote:
      | It seems that I really did misunderstand the differences.
      | It is now under LGPL. The repo still requires CLA for
      | contribution for the moment until I am really sure.
 
      | zh217 wrote:
      | Thank you for your perspective.
      | 
      | Maybe I was confused about the case of using an executable
      | vs linking against a library. Let me double-check with a
      | few friends who understand copyright laws better than me.
      | If everything checks out, the next release will be under
      | LGPL.
      | 
      | About CLA: at the previous suggestion of a friend, the repo
      | was locked with CLA requirement currently (even though
      | nobody outside contributed yet). This will be lifted once
      | the situation becomes clearer.
 
        | [deleted]
 
    | georgewfraser wrote:
    | Licensing under AGPL will make it hard for any startup to use
    | Cozo. Lawyers always ask about AGPL in venture financing
    | diligence and it is considered a red flag. You can argue that
    | they are wrong, the linking exception and so on, but you're
    | basically shouting into the wind.
 
    | ekidd wrote:
    | > If a community forms and after I understand the
    | consequences a little bit more, the license may change if the
    | community decides that it is better for the long-term good of
    | the project. For the moment though, it is staying AGPL.
    | 
    | Yes, I do want to be clear: I encourage you to use whatever
    | license you like. You wrote the code! I was just curious,
    | because it would also affect the license of any hypothetical
    | software I wrote that used the library.
    | 
    | Here's a _super oversimplified_ version of the main license
    | types (I am not a lawyer):
    | 
    | - Permissive: "Do whatever you want but don't sue me."
    | 
    | - LGPL: "If you give this library to other people, you must
    | 'share and share alike' the source and your changes to this
    | library."
    | 
    | - GPL: "If you use this code in your program, you must 'share
    | and share alike' your entire program, but only if you give
    | people copies of the program."
    | 
    | - AGPL: "If you use this code in your program, you must
    | 'share and share alike' your entire program with anyone who
    | can interact with it over a network."
    | 
    | The AGPL makes a ton of sense for an advanced database
    | _server,_ because otherwise AWS may make their own version
    | and run it on their servers as a paid service, without
    | contributing back.
    | 
    | But like I said, I'm simplifying way too much. Take a look at
    | the FSF's license descriptions and/or talk to a lawyer. This
    | shouldn't be stressful. Figure out what license supports the
    | kind of users and community you want, pick it, and don't look
    | back. :-)
    | 
    | (I may end up writing a super-simple non-persistent Datalog
    | at some point for an open source project. My needs are _much_
    | simpler than the things you support, anyways--I only ever
    | need to run one particular query.)
 
      | zh217 wrote:
      | I realized my mistake, as I said in the other comments. The
      | main repo is now under LGPL. I'll see what I'll do with the
      | bindings. Writing code is so much better than dealing with
      | licenses!
 
        | ekidd wrote:
        | Oh, cool!
        | 
        | And yeah, licenses can be challenging and frustrating,
        | especially the first time you release a major project.
        | 
        | I am really super excited by the idea of embedded Datalog
        | in Rust. I sometimes run into situations where I need
        | something that fits in that awkward gap between SQL and
        | Prolog. I want more expressiveness, better composability,
        | and better graph support than SQL. But I also want
        | finite-sized results that I can materialize in bounded
        | time.
        | 
        | There has been some very neat work with incrementally-
        | updated Datalog in the Rust community. For example, I
        | think Datafrog is really neat: https://github.com/frankmc
        | sherry/blog/blob/master/posts/2018... But it's great to
        | see more cool projects in this space, so thank you.
 
    | kylebarron wrote:
    | If I'm not mistaken that sounds more like LGPL than the AGPL?
 
      | zh217 wrote:
      | Maybe, and maybe I need to consult a lawyer someday to get
      | the facts straight. To tell you the truth my head hurts
      | when I attempt to understand what these licenses say.
      | Regardless, I intend this project to be true FOSS, the
      | "finer detail" of which FOSS license it uses may change.
 
        | mijoharas wrote:
        | My understanding is the same as kylebarron's[0] since you
        | lack linking protections (which you would get under
        | LGPL), so any work that includes cozo would be a "derived
        | work" under the (A)GPL. Interestingly there doesn't seem
        | to be an affero LGPL license[1], which could be what you
        | might want here.
        | 
        | Otherwise, simplest solution provided you want a copyleft
        | license would be to use the LGPL I think.
        | 
        | NOTE: not a lawyer.
        | 
        | [0] https://softwareengineering.stackexchange.com/questio
        | ns/1078...
        | 
        | [1] https://redmonk.com/dberkholz/2012/09/07/opening-the-
        | infrast... (old link, but I couldn't find anything since
        | then describing this kind of license?)
 
        | wizzwizz4 wrote:
        | We kinda do have it; it's just mostly useless, given the
        | linking clause. (Not entirely useless, though, as that
        | article sets out.)
        | 
        | GPL and AGPL have the same layout, so you can just take
        | the LGPL, and replace all references to 'GPL' and 'GNU
        | General Public License' with 'AGPL' and 'GNU Affero
        | General Public License'. Of course, you couldn't call
        | that license 'GNU ALGPL' or 'GNU LAGPL'; you'd have to
        | come up with your own name. (Disclaimer: I'm not a
        | lawyer, and I haven't checked this as thoroughly as I
        | would if I were going to use this for my own software.)
        | 
        | Maybe it's worth bothering Bradley M. Kuhn
        | (http://ebb.org/bkuhn/) again and seeing what the current
        | status of a Lesser AGPL is?
 
        | _frkl wrote:
        | That's a fair enough stance. I'd recommend not taking any
        | outside contributions until you are sure about the
        | license, since it'll make it much harder to change the
        | license if you do. Or maybe require all outside
        | contributions to be licensed very permissively, like
        | using the BSD license. Or you could use a CLA, but that's
        | not something I'd recommend. Either way, licensing is
        | hard :(. I can emphasise with the head hurting.... Oh,
        | also, check out https://tldrlegal.com/ .
 
        | kapilvt wrote:
        | its also odd then re the python bindings being MIT, as
        | the AGPL will convey throughout any aggregation or
        | library usage, as would GPL, the primary delta for GPL vs
        | AGPL is the intent on the later for network offered
        | services, which in the context of an embedded library/db
        | is odd. rightly or wrongly many orgs will refuse to allow
        | usage of gpl/agpl software due to the licensing concerns
        | around the effects of the rest of their ip. duckdb
        | (embedded analytics sql) uses mit, etc. so in terms of
        | creating a "true foss" project ie a community of users
        | and contributors, its definitely worth considering a
        | licensing change imho, but of course dealers choice.
 
        | zh217 wrote:
        | OP here. Nothing about the license is final yet since
        | there are no outside contributors. I just changed the
        | main repo to LGPL, not because what I believed in
        | changed, but because it seems that I really misunderstood
        | the licenses.
 
    | dangoor wrote:
    | I am not a lawyer, but I work in an open source programs
    | office and am currently working specifically on open source
    | license compliance.
    | 
    | Beyond what the sibling comments have said about LGPL
    | sounding more like what you're going for, I'll just note that
    | if you'd like broad adoption of this while still ensuring
    | that changes to your code remain open, you might also want to
    | consider the Mozilla Public License.
    | 
    | From what I understand of MPL and LGPL is that MPL is better
    | for instances where dynamic linking isn't possible. The MPL
    | basically says that any changes _to the files you created_
    | must be available under the MPL, preserving their public
    | availability.
    | 
    | That said, most organizations are fine with the LGPL, but it
    | just gets gnarly if there are instances where you really want
    | to statically link something but you still fully want to
    | support the original library's openness.
 
    | pie_flavor wrote:
    | AGPL is a variant of the GPL, not the LGPL. Meaning that
    | dynamic linking still constitutes (according to them) a
    | derivative work, meaning that even programs that dynamically
    | link against it must themselves be AGPL in their entirety.
    | Dynamic linking is also meaningfully complicated to do in
    | Rust, and this licensure of the crates.io crate will be a
    | footgun for anyone not using cargo-deny.
    | 
    | I think this is a very cool project, but its use of *GPL
    | essentially ensures I'm not going to use it for anything. If
    | you're planning on reducing it to LGPL, I'm not sure what the
    | GPL is getting you over going with the Rust standard license
    | set of MIT + Apache 2.0.
 
| jitl wrote:
| This is amazing!
| 
| Have you looked at differential-datalog? It's rust-based,
| maintained by VMWare, and has a very rich, well-typed Datalog
| language. differential-datalog is in-memory only right now, but
| could be ideal to integrate your graph as a datastore or disk
| spill cache.
| 
| https://github.com/vmware/differential-datalog
 
| abc3354 wrote:
| This look nice !
| 
| Datascript seems to be another Datalog engine (in memory only)
| 
| https://github.com/tonsky/datascript
 
  | fsiefken wrote:
  | there are a few more, including ones supporting on disk
  | databases
  | https://en.wikipedia.org/wiki/Datalog#Systems_implementing_D...
 
| billylindeman wrote:
| This is amazing. I can't wait to play with it
 
| typon wrote:
| I have been meaning to do this exact project for 5 years at
| least. Congrats on making it happen - looking forward to using it
 
| stevesimmons wrote:
| This does look very nice!
| 
| Especially (from my point of view) having the Python interface.
| 
| What's the max practical graph sizes you anticipate?
 
  | zh217 wrote:
  | For the moment: you can have as much data as you want on disk
  | as long as the RocksDB storage engine can handle it, which I
  | believe is quite large. For any single query though, you want
  | all the data you touch to fit in memory. The good news is that
  | Rust is very efficient in using memory. This will be improved
  | in future versions.
  | 
  | For the built-in graph algorithms, you are also limited by the
  | algorithmic complexity, which for some of them is quite high
  | (notably betweenness centrality). There is nothing the database
  | can help in this case, though we may add some approximate
  | algorithms with lower complexities later.
 
| pgt wrote:
| Good job! How to transact? The examples only show queries.
 
  | zh217 wrote:
  | Transactions are described in the manual: https://cozodb.github
  | .io/current/manual/stored.html#chaining....
  | 
  | Sorry about the docs being all over the place at the moment! My
  | only excuse is that Cozo is very young. The documentation (and
  | the implementation) still needs a lot of work!
 
| dwenzek wrote:
| Really nice!
| 
| I like the design choices of Datalog for the query language and
| Relations for the data model. This contrasts with the typical
| choices made for graph databases where the word graph seems to
| make _links_ a mandatory query and representation tool.
 
| philzook wrote:
| Very cool! I love the sqlite install everywhere model.
| 
| Could you compare use case with Souffle? https://souffle-
| lang.github.io/
| 
| I'd suggest putting the link to the docs more prominently on the
| github page
| 
| Is the "traditional" datalog `path(x,z) :- edge(x,y), path(y,z).`
| syntax not pleasant to the modern eye? I've grown to rather like
| it. Or is there something that syntax can't do?
| 
| I've been building a Datalog shim layer in python to bridge
| across a couple different datalog systems
| https://github.com/philzook58/snakelog (including a datalog built
| on top of the python sqlite bindings), so I should look into
| including yours
 
  | zh217 wrote:
  | I find nothing wrong with the classical syntax, but there is a
  | very practical, even stupid reason why the syntax is the way it
  | is now. As you can see from the tutorial
  | (https://nbviewer.org/github/cozodb/cozo-
  | docs/blob/main/tutor...), you can run Cozo in Jupyter notebooks
  | and mix it with Python code. This is the main way that I myself
  | interact with Cozo. Since I don't fancy writing an
  | unmaintainable mess of Jupyter frontend code that may become
  | obsolete in a few years, CozoScript had better look like python
  | enough so as not to completely baffle the Jupyter syntax
  | highlighter. That's why the syntax for comments is `#`, not
  | `//`. That's also why the syntax for stored relation is
  | `*stored`, not `&stored` or `%stored`.
  | 
  | This is a hack from the beginning, but over time I grew to like
  | the syntax quite a bit. And hopefully by being similar to
  | Python or JS superficially, fewer confusion results for new
  | users :)
 
    | philzook wrote:
    | Ah, that's very interesting. Thank you. `s.add(path(x,z) <=
    | edge(x,y) & path(y,z))` is what I chose as python syntax, but
    | it is clunkier.
 
    | samuell wrote:
    | Interesting! I'm thinking ... perhaps a small syntax
    | comparison for prolog/classical datalog vs cozo, would help
    | people used to the classical syntax quickly get started.
 
| packetlost wrote:
| This is very similar to the goals of a project I've been working
| on, though I've been focusing on the raw storage format
| (literally a drop-in replacement for RocksDB, so this could be
| interesting). I think datalog databases are _far_ underrated.
 
___________________________________________________________________
(page generated 2022-11-08 23:00 UTC)