[HN Gopher] Building a high performance JSON parser
___________________________________________________________________
 
Building a high performance JSON parser
 
Author : davecheney
Score  : 349 points
Date   : 2023-11-05 13:10 UTC (9 hours ago)
 
web link (dave.cheney.net)
w3m dump (dave.cheney.net)
 
| denysvitali wrote:
| Also interesting: https://youtu.be/a7VBbbcmxyQ
 
| eatonphil wrote:
| The walkthrough is very nice, how to do this if you're going to
| do it.
| 
| If you're going for pure performance in a production environment
| you might take a look at Daniel Lemire's work:
| https://github.com/simdjson/simdjson. Or the MinIO port of it to
| Go: https://github.com/minio/simdjson-go.
 
  | vjerancrnjak wrote:
  | If your JSON always looks the same you can also do better than
  | general JSON parsers.
 
    | lylejantzi3rd wrote:
    | Andreas Fredriksson demonstrates exactly that in this video:
    | https://vimeo.com/644068002
 
      | ykonstant wrote:
      | Very enjoyable video!
 
      | kaladin_1 wrote:
      | I really enjoyed this video even though he lost me with the
      | SIMD code.
 
    | diarrhea wrote:
    | I wonder: can fast, special-case JSON parsers be dynamically
    | autogenerated from JSON Schemas?
    | 
    | Perhaps some macro-ridden Rust monstrosity that spits out
    | specialised parsers at compile time, dynamically...
 
      | minhazm wrote:
      | For json schema specifically there are some tools like go-
      | jsonschema[1] but I've never used them personally. But you
      | can use something like ffjson[2] in go to generate a static
      | serialize/deserialize function based on a struct
      | definition.
      | 
      | [1] https://github.com/omissis/go-jsonschema [2]
      | https://github.com/pquerna/ffjson
 
        | atombender wrote:
        | Hey, go-jsonschema is my project. (Someone else just took
        | over maintaining it, though.) It still relies on the
        | standard Go parser; all it does it generate structs with
        | the right types and tags.
 
      | galangalalgol wrote:
      | Doesn't the serde crate's json support do precisely this?
      | It generates structs that have optional in all the right
      | places and with all the right types anyway. Seems like the
      | llvm optimiser can probably do something useful with that
      | even if the serde feature isn't using apriori knowledge out
      | of the schema.
 
      | dleeftink wrote:
      | Somewhat tangentially related, Fabian Iwand posted this
      | regex prefix tree visualiser/generator last week [0], which
      | may offer some inspiration for prototyping auto generated
      | schemas.
 
        | atombender wrote:
        | You forgot to include the link?
 
      | PartiallyTyped wrote:
      | Pydantic does that to some extend I think.
 
      | marginalia_nu wrote:
      | A fundamental problem with JSON parsing is that it has
      | variable length fields that don't encode their length, in a
      | streaming scenario you basically need to keep resizing your
      | buffer until the data fits. If the data is on disk and not
      | streaming you may get away with reading ahead to find the
      | end of the field first, but that's also not particularly
      | fast.
      | 
      | Schemas can't fix that.
 
      | mhh__ wrote:
      | It's relatively common in D application to use the compile
      | time capabilities to generator a parser at compile time
 
    | loeg wrote:
    | You might also move to something other than JSON if parsing
    | it is a significant part of your workload.
 
      | haswell wrote:
      | Most of the times I've had to deal with JSON performance
      | issues, it involved a 3rd party API and JSON was the only
      | option.
      | 
      | If you're building something net-new and know you'll have
      | these problems out the gate, something other than JSON
      | might be feasible, but the moment some other system not in
      | the closed loop needs to work with the data, you're back to
      | JSON and any associated perf issues.
 
  | fooster wrote:
  | Last time I compared the performance of various json parsers
  | the simd one turned out to be disappointingly slow.
 
  | Thaxll wrote:
  | The fastest json lib in Go is the one done by the company
  | behind Tiktok.
 
    | ken47 wrote:
    | Fastest at what?
 
      | cannonpalms wrote:
      | > For all sizes of json and all scenarios of usage, Sonic
      | performs best.
      | 
      | The repository has benchmarks
 
        | mananaysiempre wrote:
        | I'm not seeing simdjson in them though? I must be missing
        | something because the Go port of it is explicitly
        | mentioned in the motivation[1] (not the real thing,
        | though).
        | 
        | [1] https://github.com/bytedance/sonic/blob/main/docs/INT
        | RODUCTI...
 
    | rockinghigh wrote:
    | https://github.com/bytedance/sonic
 
    | pizzafeelsright wrote:
    | Excellent treat vector.
 
    | pizzafeelsright wrote:
    | Excellent treat vector.
 
  | lionkor wrote:
  | simdjson has not been the fastest for a long long time
 
    | jzwinck wrote:
    | What is faster? According to
    | https://github.com/kostya/benchmarks#json nothing is.
 
| rexfuzzle wrote:
| Great to see a shout out to Phil Pearl! Also worth looking at
| https://github.com/bytedance/sonic
 
| Galanwe wrote:
| "Json" and "Go" seem antithetical in the same sentence as "high
| performance" to me. As long as we are talking about _absolute
| performance_.
| 
| When talking about _relative performance_, as in _compared to
| what can be done with a somewhat similar stack_, I feel like it
| should deserve a mention of what the stack is, otherwise there is
| no reference point.
 
  | bsdnoob wrote:
  | Did you even open the article? Following is literally in first
  | paragraph
  | 
  | > This package offers the same high level json.Decoder API but
  | higher throughput and reduced allocations
 
    | jcelerier wrote:
    | How does that contradict what the parent poster says? I think
    | it's very weird to call something "high performance" when it
    | looks like it's maybe 15-20% of the performance of a simdjson
    | in c++. This is not "going from normal performance to high
    | performance", this going from "very subpar" to "subpar"
 
      | willsmith72 wrote:
      | Ok but how many teams are building web APIs in C++?
 
        | TheCleric wrote:
        | I worked with a guy who did this. It was fast, but boy
        | howdy was it not simple.
 
      | lolinder wrote:
      | Par is different for different stacks. It's reasonable for
      | someone to treat their standard library's JSON parser as
      | "par", given that that's the parser that most of their
      | peers will be using, even if there are faster options that
      | are commonly used in other stacks.
 
      | Thaxll wrote:
      | Because even in C++ people don't use json simd most project
      | use rapidjson which Go is on part with.
 
  | SmoothBrain12 wrote:
  | Exactly, it's not too hard to implement in C. The one I made
  | never copied data, instead saved the pointer/length to the
  | data. The user only had to Memory Map the file (or equivalent),
  | pass that data into the parse. Only memory allocation was for
  | the Jason nodes.
  | 
  | This way they only paid the parsing tax (decoding doubles,
  | etc..) if the user used that data.
  | 
  | You hit the nail on the head
 
    | compressedgas wrote:
    | Like how https://github.com/zserge/jsmn works. I thought it
    | would be neat to have such as parser for
    | https://github.com/vshymanskyy/muon
 
    | mr_mitm wrote:
    | Did you publish the code somewhere? I'd be interested in
    | reading it.
 
    | eska wrote:
    | I also really like this paradigm. It's just that in old
    | crusty null-terminated C style this is really awkward because
    | the input data must be copied or modified. But it's not an
    | issue when using slices (length and pointer). Unfortunately
    | most of the C standard library and many operating system APIs
    | expect that.
    | 
    | I've seen this referred to as a pull parser in a Rust
    | library? (https://github.com/raphlinus/pulldown-cmark)
 
    | Aurornis wrote:
    | The linked article is a GopherCon talk.
    | 
    | The first line of the article explains the context of the
    | talk:
    | 
    | > This talk is a case study of designing an efficient Go
    | package.
    | 
    | The target audience and context are clearly Go developers.
    | Some of these comments are focusing too much on the headline
    | without addressing the actual article.
 
    | cryo wrote:
    | I've made a JSON parser which works like this too. No dynamic
    | memory allocations, similar to the JSMN parser but stricter
    | to the specification.
    | 
    | Always nice to be in control over memory :)
    | 
    | https://sr.ht/~cryo/cj
 
    | hgs3 wrote:
    | Yup and if your implementation uses a hashmap for object key
    | -> value lookup, then I recommend allocating the hashmap
    | _after_ parsing the object not during to avoid continually
    | resizing the hashmap. You can implement this by using an
    | intrusive linked list to track your key /value JSON nodes
    | until the time comes to allocate the hashmap. Basically when
    | parsing an object 1. use a counter 'N' to track the number of
    | keys, 2. link the JSON nodes representing key/value pairs
    | into an intrusive linked list, 3. after parsing the object
    | use 'N' to allocate a perfectly sized hashmap in one go. You
    | can then iterate over the linked list of JSON key/value pair
    | nodes adding them to the hashmap. You can use this same trick
    | when parsing JSON arrays to avoid continually resizing a
    | backing array. Alternatively, never allocate a backing array
    | and instead use the linked list to implement an iterator.
 
    | duped wrote:
    | > The user only had to Memory Map the file (or equivalent)
    | 
    | Having done this myself, it's a massive cheat code because
    | your bottleneck is almost always i/o and memory mapped i/o is
    | orders of magnitude faster than sequential calls to read().
    | 
    | But that said it's not always appropriate. You can have
    | gigabytes of JSON to parse, and the JSON might be available
    | over the network, and your service might be running on a
    | small node with limited memory. Memory mapping here adds
    | quite a lot of latency and cost to the system. A very fast
    | streaming JSON decoder is the move here.
 
      | vlovich123 wrote:
      | > memory mapped i/o is orders of magnitude faster than
      | sequential calls to read()
      | 
      | That's not something I've generally seen. Any source for
      | this claim?
      | 
      | > You can have gigabytes of JSON to parse, and the JSON
      | might be available over the network, and your service might
      | be running on a small node with limited memory. Memory
      | mapping here adds quite a lot of latency and cost to the
      | system
      | 
      | Why does mmap add latency? I would think that mmap adds
      | more latency for small documents because the cost of doing
      | the mmap is high (cross CPU TLB shoot down to modify the
      | page table) and there's no chance to amortize. Relatedly,
      | there's minimal to no relation between SAX vs DOM style
      | parsing and mmap - you can use either with mmap. If you're
      | not aware, you do have some knobs with mmap to hint to the
      | OS how it's going to be used although it's very unwieldy to
      | configure it to work well.
 
        | duped wrote:
        | Experience? Last time I made that optimization it was
        | 100x faster, ballpark. I don't feel like benchmarking it
        | right now, try yourself.
        | 
        | The latency comes from the fact you need to have the
        | whole file. The use case I'm talking about is a JSON
        | document you need to pull off the network because it
        | doesn't exist on disk, might not fit there, and might not
        | fit in memory.
 
        | vlovich123 wrote:
        | > Experience? Last time I made that optimization it was
        | 100x faster, ballpark. I don't feel like benchmarking it
        | right now, try yourself.
        | 
        | I have. Many times. There's definitely not a 100x
        | difference given that normal file I/O can easily saturate
        | NVMe throughput. I'm sure it's possible to build a repro
        | showing a 100x difference, but you have to be doing
        | something intentionally to cause that (e.g. using a very
        | small read buffer so that you're doing enough syscalls
        | that it shows up in a profile).
        | 
        | > The latency comes from the fact you need to have the
        | whole file
        | 
        | That's a whole other matter. But again, if you're pulling
        | it off the network, you usually can't mmap it anyway
        | unless you're using a remote-mounted filesystem (which
        | will add more overhead than mmap vs buffered I/O).
 
        | duped wrote:
        | I think you misunderstood my point, which was to
        | highlight exactly when mmap won't work....
 
        | ben-schaaf wrote:
        | In my experience mmap is at best 50% faster compared to
        | good pread usage on Linux and MacOS.
 
  | Aurornis wrote:
  | > "Json" and "Go" seem antithetical in the same sentence as
  | "high performance" to me
  | 
  | Absolute highest performance is rarely the highest priority in
  | designing a system.
  | 
  | Of course we could design a hyper-optimized, application
  | specific payload format and code the deserializer in assembly
  | and the performance would be great, but it wouldn't be useful
  | outside of very specific circumstances.
  | 
  | In most real world projects, performance of Go and JSON is fine
  | and allow for rapid development, easy implementation, and
  | flexibility if anything changes.
  | 
  | I don't think it's valid to criticize someone for optimizing
  | within their use case.
  | 
  | > I feel like it should deserve a mention of what the stack is,
  | otherwise there is no reference point.
  | 
  | The article clearly mentions that this is a GopherCon talk in
  | the header. It was posted on Dave Cheney's website, a well-
  | known Go figure.
  | 
  | It's clearly in the context of Go web services, so I don't
  | understand your criticisms. The context is clear from the
  | article.
  | 
  | The _very first line_ of the article explains the context:
  | 
  | > This talk is a case study of designing an efficient Go
  | package.
 
  | riku_iki wrote:
  | > "Json" and "Go" seem antithetical in the same sentence as
  | "high performance" to me. As long as we are talking about
  | _absolute performance_.
  | 
  | Even just "Json" is problematic here as wire protocol for
  | absolute performance no matter what will be programming
  | language.
 
    | 3cats-in-a-coat wrote:
    | I think people say that as they give disproportional weight
    | to the fact it's text-based, while ignoring how astoundingly
    | simple and linear it is to write and read.
    | 
    | The only way to nudge the needle is to start exchanging
    | direct memory dumps, which is what ProtoBuff and the like do.
    | But this is clearly only for very specific use.
 
      | riku_iki wrote:
      | > while ignoring how astoundingly simple and linear it is
      | to write and read.
      | 
      | code maybe simple, but you have lots of performance
      | penalties: resolving field keys, you need to construct some
      | complicated data structures through memory allocations,
      | which is expensive.
      | 
      | > to start exchanging direct memory dumps, which is what
      | ProtoBuff and the like do
      | 
      | Protobuff actually is doing parsing, it is just binary
      | format. What you describing is more like Flatbuffer.
      | 
      | > But this is clearly only for very specific use.
      | 
      | yes, specific use is high performance computations )
 
  | crazygringo wrote:
  | JSON is often the format of an externally provided data source,
  | and you don't have a choice.
  | 
  | And whatever language you're writing in, you usually want to do
  | what you can to maximize performance. If your JSON input is 500
  | bytes it probably doesn't matter, but if you're intaking a 5 MB
  | JSON file then you can definitely be sure the performance does.
  | 
  | What more do you need to know about "the stack" in this case?
  | It's whenever you need to ingest large amounts of JSON in Go.
  | Not sure what could be clearer.
 
  | 3cats-in-a-coat wrote:
  | JSON is probably the fastest serialization format to produce
  | and parse, which is also safe for public use, compared to
  | binary formats which often have fragile, highly specific and
  | vulnerable encoding as they're directly plopped into memory and
  | used as-is (i.e. they're not parsed at all, it's just two
  | computers exchanging memory dumps).
  | 
  | Compare it with XML for example, which is a nightmare of
  | complexity if you actually follow the spec and not just make
  | something XML-like.
  | 
  | We have some formats which try to walk the boundary between
  | safe/universal and fast like ASN.1 but those are obscure at
  | best.
 
    | alpaca128 wrote:
    | I prefer msgpack if the data contains a lot of numeric
    | values. Representing numbers as strings like in JSON can blow
    | up the size and msgpack is usually just as simple to use.
 
  | tptacek wrote:
  | This is an article about optimizing a JSON parser in Go.
  | 
  | As always, try to remember that people usually aren't writing
  | (or posting their talks) specifically for an HN audience.
  | Cheney clearly has an audience of Go programmers; that's the
  | space he operates in. He's not going to title his post, which
  | he didn't make for HN, just to avoid a language war on the
  | threads here.
  | 
  | It's our responsibility to avoid the unproductive language war
  | threads, not this author's.
 
  | hoosieree wrote:
  | Even sticking within the confines of json there's low hanging
  | fruit, e.g. if your data is typically like:
  | [{"k1": true, "k2": [2,3,4]}, {"k1": false, "k2": []}, ...]
  | 
  | You can amortize the overhead of the keys by turning this from
  | an array of structs (AoS) into a struct of arrays (SoA):
  | {"k1": [true, false, ...], "k2": [[2,3,4], [], ...]}
  | 
  | Then you only have to read "k1" and "k2" once, instead of _once
  | per record_. Presumably there will be the odd record that
  | contains something like { "k3": 0} but you can use mini batches
  | of SoA and tune their size according to your desired
  | latency/throughput tradeoff.
  | 
  | Or if your data is 99.999% of the time just pairs of k1 and k2,
  | turn them into tuples:                   {"k1k2":
  | [true,[2,3,4],false,[], ...]}
  | 
  | And then 0.001% of the time you send a lone k3 message:
  | {"k3": 2}
  | 
  | Even if your endpoints can't change their schema, you can still
  | trade latency for throughput by doing the SoA conversion,
  | transmitting, then converting back to AoS at the receiver.
  | Maybe worthwhile if you have to forward the message many times
  | but only decode it once.
 
  | kunley wrote:
  | "Antiethical".
  | 
  | Is it true that mindless bashing Go became some kind of a new
  | religion in some circles?
 
| kevingadd wrote:
| I'm surprised there's no way to say 'I really mean it, inline
| this function' for the stuff that didn't inline because it was
| too big.
| 
| The baseline whitespace count/search operation seems like it
| would be MUCH faster if you vectorized it with SIMD, but I can
| understand that being out of scope for the author.
 
  | mgaunard wrote:
  | Of course you can force-inline.
 
    | cbarrick wrote:
    | Obviously you can manually inline functions. That's what
    | happened in the article.
    | 
    | The comment is about having a directive or annotation to make
    | the compiler inline the function for you, which Go does not
    | have. IMO, the pre-inline code was cleaner to me. It's a
    | shame that the compiler could not optimize it.
    | 
    | There was once a proposal for this, but it's really against
    | Go's design as a language.
    | 
    | https://github.com/golang/go/issues/21536
 
      | mgaunard wrote:
      | You can in any systems programming language.
      | 
      | Go is mostly a toy language for cloud people.
 
        | cbarrick wrote:
        | > toy language
        | 
        | You may be surprised to hear that Go is used in a ton of
        | large scale critical systems.
 
        | mgaunard wrote:
        | I don't consider cloud technology a critical system.
 
| peterohler wrote:
| You might want to take a look at https://github.com/ohler55/ojg.
| It takes a different approach with a single pass parser. There
| are some performance benchmarks included on the README.md landing
| page.
 
| arun-mani-j wrote:
| I remember reading a SO question which asks for a C library to
| parse JSON. A comment was like - C developers won't use a library
| for JSON, they will write one themselves.
| 
| I don't know how "true" that comment is but I thought I should
| try to write a parser myself to get a feel :D
| 
| So I wrote one, in Python - https://arunmani.in/articles/silly-
| json-parser/
| 
| It was a delightful experience though, writing and testing to
| break your own code with different variety of inputs. :)
 
  | xoac wrote:
  | Good for you but what does this have to do with the article?
 
  | janmo wrote:
  | I wrote a small JSON parser in C myself which I called jsoncut.
  | It just cuts out a certain part of a json file. I deal with
  | large JSON files, but want only to extract and parse certain
  | parts of it. All libraries I tried parse everything, use a lot
  | of RAM and are slow.
  | 
  | Link here, if interested to have a look:
  | https://github.com/rgex/jsoncut
 
    | vlovich123 wrote:
    | The words you're looking for are SAX-like JSON parser or
    | streaming json parser. I don't know if there's any command
    | line tools like the one you wrote that use it though to
    | provide a jq-like interface.
 
      | janmo wrote:
      | I tried JQ and other command line tools, all were extremely
      | slow and seemed to always parse the entire file.
      | 
      | My parser just reads the file byte by byte until it finds
      | the target, then outputs the content. When that's done it
      | stops reading the file, meaning that it can be extremely
      | fast when the targeted information is at the beginning of
      | the JSON file.
 
        | vlovich123 wrote:
        | You're still describing a SAX parser (i.e. streaming). jq
        | doesn't use a SAX parser because it's a multi-pass
        | document editor at its core, hence why I said "jq-like"
        | in terms of supporting a similar syntax for single-pass
        | queries. If you used RapidJSON's SAX parser in the body
        | of your custom code (returning false once you found what
        | you're looking for), I'm pretty sure it would
        | significantly outperform your custom hand-rolled code. Of
        | course, your custom code is very small with no external
        | dependencies and presumably fast enough, so tradeoffs.
 
  | masklinn wrote:
  | > I remember reading a SO question which asks for a C library
  | to parse JSON. A comment was like - C developers won't use a
  | library for JSON, they will write one themselves.
  | 
  | > I don't know how "true" that comment is
  | 
  | Either way it's a good way to get a pair of quadratic loops in
  | your program: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-
  | loading-times...
 
  | jjice wrote:
  | I guess there are only so many ways to write a JSON parser b
  | cause one I wrote on a train in Python looks very similar!
  | 
  | I thought it would be nice and simple but it really was still
  | simpler than I expected. It's a fantastic spec if you need to
  | throw one together yourself, without massive performance
  | considerations.
 
| visarga wrote:
| nowadays I am more interested in a "forgiving" JSON/YAML parser,
| that would recover from LLM errors, is there such a thing?
 
  | kevingadd wrote:
  | If the LLM did such a bad job that the syntax is wrong, do you
  | really trust the data inside?
  | 
  | Forgiving parsers/lexers are common in language compilers for
  | languages like rust or C# or typescript, you may want to
  | investigate typescript in particular since it's applicable to
  | JSON syntax. Maybe you could repurpose their parser.
 
  | RichieAHB wrote:
  | I feel like trying to infer valid JSON from invalid JSON is a
  | recipe for garbage. You'd probably be better off doing a second
  | pass with the "JSON" through the LLM but, as the sibling
  | commenter said, at this point even the good JSON may be garbage
  | ...
 
  | _dain_ wrote:
  | halloween was last week
 
  | explaininjs wrote:
  | Perhaps not quite what you're asking for, but along the same
  | lines there's this "Incomplete JSON" parser, which takes a
  | string of JSON as it's coming out of an LLM and parses it into
  | as much data as it can get. Useful for building streaming UI's,
  | for instance it is used on https://rexipie.com quite
  | extensively.
  | 
  | https://gist.github.com/JacksonKearl/6778c02bf85495d1e39291c...
  | 
  | Some example test cases:                   { input: '[{"a": 0,
  | "b":', output: [{ a: 0 }] },         { input: '[{"a": 0, "b":
  | 1', output: [{ a: 0, b: 1 }] },              { input: "[{},",
  | output: [{}] },         { input: "[{},1", output: [{}, 1] },
  | { input: '[{},"', output: [{}, ""] },         { input:
  | '[{},"abc', output: [{}, "abc"] },
  | 
  | Work could be done to optimize it, for instance add streaming
  | support. But the cycles consumed either way is minimal for LLM-
  | output-length=constrained JSON.
  | 
  | Fun fact: as best I can tell, GPT-4 is entirely unable to
  | synthesize code to accomplish this task. Perhaps that will
  | change as this implementation is made public, I do not know.
 
  | gurrasson wrote:
  | The jsonrepair tool https://github.com/josdejong/jsonrepair
  | might interest you. It's tailored to fix JSON strings.
  | 
  | I've been looking into something similar for handling partial
  | JSONs, where you only have the first n chars of a JSON. This is
  | common with LLM with streamed outputs aimed at reducing
  | latency. If one knows the JSON schema ahead, then one can start
  | processing these first fields before the remaining data has
  | fully loaded. If you have to wait for the whole thing to load
  | there is little point in streaming.
  | 
  | Was looking for a library that could do this parsing.
 
    | explaininjs wrote:
    | See my sibling comment :)
 
| mannyv wrote:
| These are always interesting to read because you get to see
| runtime quirks. I'm surprised there was so much function call
| overhead, for example. And it's interesting you can bypass range
| checkong.
| 
| The most important thing, though, is the process: measure then
| optimize.
 
| isuckatcoding wrote:
| This is fantastically useful.
| 
| Funny enough I stumbled upon your article just yesterday through
| google search.
 
| mgaunard wrote:
| "It's unrealistic to expect to have the entire input in memory"
| -- wrong for most applications
 
  | isuckatcoding wrote:
  | Yes but for applications where you need to do ETL style
  | transformations on large datasets, streaming is an immensely
  | useful strategy.
  | 
  | Sure you could argue go isn't the right tool for the job but I
  | don't see why it can't be done with the right optimizations
  | like this effort.
 
    | dh2022 wrote:
    | If performance is important why would you keep large datasets
    | in JSON format?
 
      | querulous wrote:
      | sometimes it's not your data
 
      | isuckatcoding wrote:
      | Usually because the downstream service or store needs it
 
      | Maxion wrote:
      | Because you work at or for some bureaucratic MegaCorp, that
      | does weird things with no real logic behind it other than
      | clueless Dilbert managers making decisions based on
      | LinkedIn blogs. Alternatively desperate IT consultants
      | trying to get something to work with too low of a budget
      | and/or no access to do things the right way.
      | 
      | Be glad you have JSON to parse, and not EDI, some custom
      | deliminated data format (with no or old documentation) - or
      | _shudders_ you work in the airline industry with SABRE.
 
  | capableweb wrote:
  | https://yourdatafitsinram.net/
 
  | mannyv wrote:
  | If you're building a library you either need to explicitly call
  | out your limits or do streaming.
  | 
  | I've pumped gigs of jaon data, so a streaming parser is
  | appreciated. Plus streaming shows the author is better at
  | engineering and is aware of the various use cases.
  | 
  | Memory is not cheap or free except in theory.
 
    | jjeaff wrote:
    | I guess it's all relative. Memory is significantly cheaper if
    | you get it anywhere but on loan from a cloud provider.
 
      | mannyv wrote:
      | RAM is always expensive no matter where you get it from.
      | 
      | Would you rather do two hours of work or force thousands of
      | people to buy more RAM because your library is a memory
      | hog?
      | 
      | And on embedded systems RAM is a premium. More RAM = most
      | cost.
 
  | e12e wrote:
  | If you can live with "fits on disk" mmap() is a viable option?
  | Unless you truly need streaming (early handling of early data,
  | like a stream of transactions/operations from a single JSON
  | file?)
 
    | mannyv wrote:
    | In general, JSON comes over the network, so MMAP won't really
    | work unless you save to a file. But then you'll run out of
    | disk space.
    | 
    | I mean, you have a 1k, 2k, 4k buffer. Why use more, because
    | it's too much work?
 
  | ahoka wrote:
  | Most applications read JSONs from networks, where you have a
  | stream. Buffering and fiddling with the whole request in memory
  | increases latency by a lot, even if your JSON is smallish.
 
    | mgaunard wrote:
    | On a carefully built WebSocket server you would ensure your
    | WebSocket messages all fit within a single MTU.
 
    | Rapzid wrote:
    | Most( _most_ ) JSON payloads are probably much smaller than
    | many buffer sizes so just end up all in memory anyway.
 
| jensneuse wrote:
| I've taken a very similar approach and built a GraphQL tokenizer
| and parser (amongst many other things) that's also zero memory
| allocations and quite fast. In case you'd like to check out the
| code: https://github.com/wundergraph/graphql-go-tools
 
  | markl42 wrote:
  | How big of an issue is this for GQL servers where all queries
  | are known ahead of time (allowlist) - i.e. you can
  | cache/memorize the ast parsing and this is only a perf issue
  | for a few minutes after the container starts up
  | 
  | Or does this bite us in other ways too?
 
    | jensneuse wrote:
    | I build GraphQL API gateways / Routers for 5+ years now. It
    | would be nice if trusted Documents or persisted operations
    | were the default, but the reality is that a lot of people
    | want to open up their GraphQL to the public. For that reason
    | we've build a fast parser, validator, normalizer and many
    | other things to support these use cases.
 
| nwpierce wrote:
| Writing a json parser is definitely an educational experience. I
| wrote one this summer for my own purposes that is decently fast:
| https://github.com/nwpierce/jsb
 
| lamontcg wrote:
| Wish I wasn't 4 or 5 uncompleted projects deep right now and had
| the time to rewrite a monkey parser using all these tricks.
 
| jchw wrote:
| Looks pretty good! Even though I've written far too many JSON
| parsers already in my career, it's really nice to have a
| reference for how to think about making a reasonable, fast JSON
| parser, going through each step individually.
| 
| That said, I will say one thing: you don't _really_ need to have
| an explicit tokenizer for JSON. You can get rid of the concept of
| tokens and integrate parsing and tokenization _entirely_. This is
| what I usually do since it makes everything simpler. This is a
| lot harder to do with something like the rest of ECMAscript since
| in something like ECMAscript you wind up needing look-ahead
| (sometimes arbitrarily large look-ahead... consider arrow
| functions: it 's mostly a subset of the grammar of a
| parenthesized expression. Comma is an operator, and for default
| values, equal is an operator. It isn't until the => does or does
| not appear that you know for sure!)
 
  | coldtea wrote:
  | What line of work are you in that you've "written far too many
  | JSON parsers already" in your career?!!!
 
    | craigching wrote:
    | Probably anywhere that requires parsing large JSON documents.
    | Off the shelf JSON parsers are notoriously slow on large JSON
    | documents.
 
      | ahoka wrote:
      | Not necessarily, for example Newtonsoft is fine with
      | multiple hundreds of megabyes if you use it correctly. But
      | of course depends on how large we are talking about.
 
    | jchw wrote:
    | Reasons differ. C++ is a really hard place to be. It's gotten
    | better, but if you can't tolerate exceptions, need code that
    | is as-obviously-memory-safe-as-possible, can parse
    | incrementally (think SAX style), off-the-shelf options like
    | jsoncpp may not fit the bill.
    | 
    | Handling large documents is indeed another big one. It _sort-
    | of_ fits in the same category as being able to parse
    | incrementally. That said, Go has a JSON scanner you can sort
    | of use for incremental parsing, but in practice I 've found
    | it to be a lot slower, so for large documents it's a problem.
    | 
    | I've done a couple in hobby projects too. One time I did a
    | partial one in Win32-style C89 because I wanted one that
    | didn't depend on libc.
 
    | lgas wrote:
    | Someone misunderstood the JSONParserFactory somewhere along
    | the line.
 
    | marcosdumay wrote:
    | I've seen "somebody doesn't agree with the standard and we
    | must support it" way too many times, and I've written JSON
    | parsers because of this. (And, of course, it's easy to get
    | some difference with the JSON standard.)
    | 
    | I've had problems with handling streams like the OP on
    | basically every programing language and data-encoding
    | language pair that I've tried. It looks like nobody ever
    | thinks about it (I do use chunking any time I can, but some
    | times you can't).
    | 
    | There are probably lots and lots of reasons to write your own
    | parser.
 
      | jbiggley wrote:
      | This reminds me of my favourite quote about standards.
      | 
      | >The wonderful thing about standards is that there are so
      | many of them to choose from.
      | 
      | And, keeping with the theme, this quote may be from Grace
      | Hopper, Andrew Tanenbaum, Patricia Seybold or Ken Olsen.
 
| evmar wrote:
| In n2[1] I needed a fast tokenizer and had the same "garbage
| factory" problem, which is basically that there's a set of
| constant tokens (like json.Delim in this post) and then strings
| which cause allocations.
| 
| I came up with what I think is a kind of neat solution, which is
| that the tokenizer is generic over some T and takes a function
| from byteslice to T and uses T in place of the strings. This way,
| when the caller has some more efficient representation available
| (like one that allocates less) it can provide one, but I can
| still unit test the tokenizer with the identity function for
| convenience.
| 
| In a sense this is like fusing the tokenizer with the parser at
| build time, but the generic allows layering the tokenizer such
| that it doesn't know about the parser's representation.
| 
| [1] https://github.com/evmar/n2
 
| suzzer99 wrote:
| Can someone explain to me why JSON can't have comments or
| trailing commas? I really hope the performance gains are worth
| it, because I've lost 100s of man-hours to those things, and had
| to resort to stuff like this in package.json:
| "IMPORTANT: do not run the scripts below this line, they are for
| CICD only": true,
 
  | coldtea wrote:
  | It can't have comments because it didn't originally had
  | comments, so now it's too late. And it originally didn't have
  | comments, because Douglas Cockford thought they could be abused
  | for parsing instructions.
  | 
  | As for not having trailing commas, it's probably a less
  | intentional bad design choice.
  | 
  | That said, if you want commas and comments, and control the
  | parsers that will be used for your JSON, then use JSONC (JSON
  | with comments). VSCode for example does that for its JSON
  | configuration.
 
    | explaininjs wrote:
    | JSONC also supports trailing commas. It is, in effect, "JSON
    | with no downsides".
    | 
    | TOML/Yaml always drive me batty with all their obscure
    | special syntax. Whereas it's almost impossible to look at a
    | formatted blob of JSON and not have a very solid
    | understanding of what it represents.
    | 
    | The one thing I _might_ add is multiline strings with ` 's,
    | but even that is probably more trouble than it's worth, as
    | you immediately start going down the path of "well let's also
    | have syntax to strip the indentation from those strings,
    | maybe we should add new syntax to support raw strings, ..."
 
    | tubthumper8 wrote:
    | Does JSONC have a specification or formal definition? People
    | have suggested[1] using JSON5[2] instead for that reason
    | 
    | [1] https://github.com/microsoft/vscode/issues/100688
    | 
    | [2] https://spec.json5.org/
 
      | mananaysiempre wrote:
      | Unfortunately, JSON5 says keys can be ES5
      | IdentifierName[1]s, which means you must carry around
      | Unicode tables. This makes it a non-option for small
      | devices, for example. (I mean, not really, you technically
      | _could_ fit the necessary data and code in low single-digit
      | kilobytes, but it feels stupid that you have to. Or you
      | could just not do that but then it's no longer JSON5 and
      | what was the point of having a spec again?)
      | 
      | [1] https://es5.github.io/x7.html#x7.6
 
    | mananaysiempre wrote:
    | Amusingly, it originally _did_ have comments. Removing
    | comments was the one change Crockford ever made to the
    | spec[1].
    | 
    | [1] https://web.archive.org/web/20150105080225/https://plus.g
    | oog... (thank you Internet Archive for making Google's social
    | network somewhat accessible and less than useless)
 
  | shepherdjerred wrote:
  | I don't know the historic reason why it wasn't included in the
  | original spec, but at this point it doesn't matter. JSON is
  | entrenched and not going to change.
  | 
  | If you want comments, you can always use jsonc.
 
  | semiquaver wrote:
  | It's not that it "can't", more like it "doesn't". Douglas
  | Crockford prioritized simplicity when specifying JSON. Its BNF
  | grammar famously fits on one side of a business card.
  | 
  | Other flavors of JSON that include support for comments and
  | trailing commas exist, but they are reasonably called by
  | different names. One of these is YAML (mostly a superset of
  | JSON). To some extent the difficulties with YAML (like unquoted
  | 'no' being a synonym for false) have vindicated Crockford's
  | priorities.
 
| forrestthewoods wrote:
| > Any (useful) JSON decoder code cannot go faster that this.
| 
| That line feels like a troll. Cunningham's Law in action.
| 
| You can definitely go faster than 2 Gb/sec. In a word, SIMD.
 
  | shoo wrote:
  | we could re-frame by distinguishing problem statements from
  | implementations
  | 
  | Problem A: read a stream of bytes, parse it as JSON
  | 
  | Problem B: read a stream of bytes, count how many bytes match a
  | JSON whitespace character
  | 
  | Problem B should require fewer resources* to solve than problem
  | A. So in that sense problem B is a relaxation of problem A, and
  | a highly efficient implementation of problem B should be able
  | to process bytes much more efficiently than an "optimal"
  | implementation of problem A.
  | 
  | So in this sense, we can probably all agree with the author
  | that counting whitespace bytes is an easier problem than the
  | full parsing problem.
  | 
  | We're agreed that the author's implementation (half a page of
  | go code that fits on a talk slide) to solve problem B isn't the
  | most efficient way to solve problem B.
  | 
  | I remember reading somewhere the advice that to set a really
  | solid target for benchmarking, you should avoid measuring the
  | performance of implementations and instead try to estimate a
  | theoretical upper bound on performance, based on say a
  | simplified model of how the hardware works and a simplification
  | of the problem -- that hopefully still captures the essence of
  | what the bottleneck is. Then you can compare any implementation
  | to that (unreachable) theoretical upper bound, to get more of
  | an idea of how much performance is still left on the table.
  | 
  | * for reasonably boring choices of target platform, e.g. amd64
  | + ram, not some hypothetical hardware platform with
  | surprisingly fast dedicated support for JSON parsing and bad
  | support for anything else.
 
| ncruces wrote:
| It's possible to improve over the standard library with better
| API design, but it's not really possible to do a fully streaming
| parser that doesn't half fill structures before finding an error
| and bailing out in the middle, which is another explicit design
| constraint for the standard library.
 
| hintymad wrote:
| How is this compared to Daniel Lemire's simdjson?
| https://github.com/simdjson/simdjson
 
| 1vuio0pswjnm7 wrote:
| "But there is a better trick that we can use that is more space
| efficient than this table, and is sometimes called a computed
| goto."
| 
| From 1989:
| 
| https://raw.githubusercontent.com/spitbol/x32/master/docs/sp...
| 
| "Indirection in the Goto field is a more powerful version of the
| computed Goto which appears in some languages. It allows a
| program to quickly perform a multi-way control branch based on an
| item of data."
 
| wood_spirit wrote:
| My own lessons from writing fast json parsers has a lot of
| language-type things but here are some generalisations:
| 
| Avoid heap allocations in tokenising. Have a tokeniser that is a
| function that returns a stack-allocated struct or an int64 token
| that is a packed field describing the start, length and type
| offsets etc of the token.
| 
| Avoid heap allocations in parsing: support a getString(key
| String) type interface for clients that what to chop up a buffer.
| 
| For deserialising to object where you know the fields at compile
| time, generally generate a switch case of key length before
| comparing string values.
| 
| My experience in data pipelines that process lots of json is that
| choice of json library can be a 3-10x performance difference and
| that all the main parsers want to allocate objects.
| 
| If the classes you are serialising or deserialising is known at
| compile time then Jackson Java does a good job but you can get a
| 2x boost with careful coding and profiling.
| 
| Whereas if you are paying aribrary json then all the mainstream
| parsers want to do lots of allocations that a more intrusive
| parser that you write yourself can avoid, and that you can make
| massive performance wins if you are processing thousands or
| millions of objects per second.
 
| wslh wrote:
| I remember this JSON benchmark page from RapidJSON [1].
| 
| [1] https://rapidjson.org/md_doc_performance.html
 
___________________________________________________________________
(page generated 2023-11-05 23:00 UTC)