proxy70

	[HN Gopher] Building a high performance JSON parser ___________________________________________________________________ Building a high performance JSON parser Author : davecheney Score : 349 points Date : 2023-11-05 13:10 UTC (9 hours ago)
	web link (dave.cheney.net)
	w3m dump (dave.cheney.net)
	\| denysvitali wrote: \| Also interesting: https://youtu.be/a7VBbbcmxyQ \| eatonphil wrote: \| The walkthrough is very nice, how to do this if you're going to \| do it. \| \| If you're going for pure performance in a production environment \| you might take a look at Daniel Lemire's work: \| https://github.com/simdjson/simdjson. Or the MinIO port of it to \| Go: https://github.com/minio/simdjson-go. \| vjerancrnjak wrote: \| If your JSON always looks the same you can also do better than \| general JSON parsers. \| lylejantzi3rd wrote: \| Andreas Fredriksson demonstrates exactly that in this video: \| https://vimeo.com/644068002 \| ykonstant wrote: \| Very enjoyable video! \| kaladin_1 wrote: \| I really enjoyed this video even though he lost me with the \| SIMD code. \| diarrhea wrote: \| I wonder: can fast, special-case JSON parsers be dynamically \| autogenerated from JSON Schemas? \| \| Perhaps some macro-ridden Rust monstrosity that spits out \| specialised parsers at compile time, dynamically... \| minhazm wrote: \| For json schema specifically there are some tools like go- \| jsonschema[1] but I've never used them personally. But you \| can use something like ffjson[2] in go to generate a static \| serialize/deserialize function based on a struct \| definition. \| \| [1] https://github.com/omissis/go-jsonschema [2] \| https://github.com/pquerna/ffjson \| atombender wrote: \| Hey, go-jsonschema is my project. (Someone else just took \| over maintaining it, though.) It still relies on the \| standard Go parser; all it does it generate structs with \| the right types and tags. \| galangalalgol wrote: \| Doesn't the serde crate's json support do precisely this? \| It generates structs that have optional in all the right \| places and with all the right types anyway. Seems like the \| llvm optimiser can probably do something useful with that \| even if the serde feature isn't using apriori knowledge out \| of the schema. \| dleeftink wrote: \| Somewhat tangentially related, Fabian Iwand posted this \| regex prefix tree visualiser/generator last week [0], which \| may offer some inspiration for prototyping auto generated \| schemas. \| atombender wrote: \| You forgot to include the link? \| PartiallyTyped wrote: \| Pydantic does that to some extend I think. \| marginalia_nu wrote: \| A fundamental problem with JSON parsing is that it has \| variable length fields that don't encode their length, in a \| streaming scenario you basically need to keep resizing your \| buffer until the data fits. If the data is on disk and not \| streaming you may get away with reading ahead to find the \| end of the field first, but that's also not particularly \| fast. \| \| Schemas can't fix that. \| mhh__ wrote: \| It's relatively common in D application to use the compile \| time capabilities to generator a parser at compile time \| loeg wrote: \| You might also move to something other than JSON if parsing \| it is a significant part of your workload. \| haswell wrote: \| Most of the times I've had to deal with JSON performance \| issues, it involved a 3rd party API and JSON was the only \| option. \| \| If you're building something net-new and know you'll have \| these problems out the gate, something other than JSON \| might be feasible, but the moment some other system not in \| the closed loop needs to work with the data, you're back to \| JSON and any associated perf issues. \| fooster wrote: \| Last time I compared the performance of various json parsers \| the simd one turned out to be disappointingly slow. \| Thaxll wrote: \| The fastest json lib in Go is the one done by the company \| behind Tiktok. \| ken47 wrote: \| Fastest at what? \| cannonpalms wrote: \| > For all sizes of json and all scenarios of usage, Sonic \| performs best. \| \| The repository has benchmarks \| mananaysiempre wrote: \| I'm not seeing simdjson in them though? I must be missing \| something because the Go port of it is explicitly \| mentioned in the motivation[1] (not the real thing, \| though). \| \| [1] https://github.com/bytedance/sonic/blob/main/docs/INT \| RODUCTI... \| rockinghigh wrote: \| https://github.com/bytedance/sonic \| pizzafeelsright wrote: \| Excellent treat vector. \| pizzafeelsright wrote: \| Excellent treat vector. \| lionkor wrote: \| simdjson has not been the fastest for a long long time \| jzwinck wrote: \| What is faster? According to \| https://github.com/kostya/benchmarks#json nothing is. \| rexfuzzle wrote: \| Great to see a shout out to Phil Pearl! Also worth looking at \| https://github.com/bytedance/sonic \| Galanwe wrote: \| "Json" and "Go" seem antithetical in the same sentence as "high \| performance" to me. As long as we are talking about _absolute \| performance_. \| \| When talking about _relative performance_, as in _compared to \| what can be done with a somewhat similar stack_, I feel like it \| should deserve a mention of what the stack is, otherwise there is \| no reference point. \| bsdnoob wrote: \| Did you even open the article? Following is literally in first \| paragraph \| \| > This package offers the same high level json.Decoder API but \| higher throughput and reduced allocations \| jcelerier wrote: \| How does that contradict what the parent poster says? I think \| it's very weird to call something "high performance" when it \| looks like it's maybe 15-20% of the performance of a simdjson \| in c++. This is not "going from normal performance to high \| performance", this going from "very subpar" to "subpar" \| willsmith72 wrote: \| Ok but how many teams are building web APIs in C++? \| TheCleric wrote: \| I worked with a guy who did this. It was fast, but boy \| howdy was it not simple. \| lolinder wrote: \| Par is different for different stacks. It's reasonable for \| someone to treat their standard library's JSON parser as \| "par", given that that's the parser that most of their \| peers will be using, even if there are faster options that \| are commonly used in other stacks. \| Thaxll wrote: \| Because even in C++ people don't use json simd most project \| use rapidjson which Go is on part with. \| SmoothBrain12 wrote: \| Exactly, it's not too hard to implement in C. The one I made \| never copied data, instead saved the pointer/length to the \| data. The user only had to Memory Map the file (or equivalent), \| pass that data into the parse. Only memory allocation was for \| the Jason nodes. \| \| This way they only paid the parsing tax (decoding doubles, \| etc..) if the user used that data. \| \| You hit the nail on the head \| compressedgas wrote: \| Like how https://github.com/zserge/jsmn works. I thought it \| would be neat to have such as parser for \| https://github.com/vshymanskyy/muon \| mr_mitm wrote: \| Did you publish the code somewhere? I'd be interested in \| reading it. \| eska wrote: \| I also really like this paradigm. It's just that in old \| crusty null-terminated C style this is really awkward because \| the input data must be copied or modified. But it's not an \| issue when using slices (length and pointer). Unfortunately \| most of the C standard library and many operating system APIs \| expect that. \| \| I've seen this referred to as a pull parser in a Rust \| library? (https://github.com/raphlinus/pulldown-cmark) \| Aurornis wrote: \| The linked article is a GopherCon talk. \| \| The first line of the article explains the context of the \| talk: \| \| > This talk is a case study of designing an efficient Go \| package. \| \| The target audience and context are clearly Go developers. \| Some of these comments are focusing too much on the headline \| without addressing the actual article. \| cryo wrote: \| I've made a JSON parser which works like this too. No dynamic \| memory allocations, similar to the JSMN parser but stricter \| to the specification. \| \| Always nice to be in control over memory :) \| \| https://sr.ht/~cryo/cj \| hgs3 wrote: \| Yup and if your implementation uses a hashmap for object key \| -> value lookup, then I recommend allocating the hashmap \| _after_ parsing the object not during to avoid continually \| resizing the hashmap. You can implement this by using an \| intrusive linked list to track your key /value JSON nodes \| until the time comes to allocate the hashmap. Basically when \| parsing an object 1. use a counter 'N' to track the number of \| keys, 2. link the JSON nodes representing key/value pairs \| into an intrusive linked list, 3. after parsing the object \| use 'N' to allocate a perfectly sized hashmap in one go. You \| can then iterate over the linked list of JSON key/value pair \| nodes adding them to the hashmap. You can use this same trick \| when parsing JSON arrays to avoid continually resizing a \| backing array. Alternatively, never allocate a backing array \| and instead use the linked list to implement an iterator. \| duped wrote: \| > The user only had to Memory Map the file (or equivalent) \| \| Having done this myself, it's a massive cheat code because \| your bottleneck is almost always i/o and memory mapped i/o is \| orders of magnitude faster than sequential calls to read(). \| \| But that said it's not always appropriate. You can have \| gigabytes of JSON to parse, and the JSON might be available \| over the network, and your service might be running on a \| small node with limited memory. Memory mapping here adds \| quite a lot of latency and cost to the system. A very fast \| streaming JSON decoder is the move here. \| vlovich123 wrote: \| > memory mapped i/o is orders of magnitude faster than \| sequential calls to read() \| \| That's not something I've generally seen. Any source for \| this claim? \| \| > You can have gigabytes of JSON to parse, and the JSON \| might be available over the network, and your service might \| be running on a small node with limited memory. Memory \| mapping here adds quite a lot of latency and cost to the \| system \| \| Why does mmap add latency? I would think that mmap adds \| more latency for small documents because the cost of doing \| the mmap is high (cross CPU TLB shoot down to modify the \| page table) and there's no chance to amortize. Relatedly, \| there's minimal to no relation between SAX vs DOM style \| parsing and mmap - you can use either with mmap. If you're \| not aware, you do have some knobs with mmap to hint to the \| OS how it's going to be used although it's very unwieldy to \| configure it to work well. \| duped wrote: \| Experience? Last time I made that optimization it was \| 100x faster, ballpark. I don't feel like benchmarking it \| right now, try yourself. \| \| The latency comes from the fact you need to have the \| whole file. The use case I'm talking about is a JSON \| document you need to pull off the network because it \| doesn't exist on disk, might not fit there, and might not \| fit in memory. \| vlovich123 wrote: \| > Experience? Last time I made that optimization it was \| 100x faster, ballpark. I don't feel like benchmarking it \| right now, try yourself. \| \| I have. Many times. There's definitely not a 100x \| difference given that normal file I/O can easily saturate \| NVMe throughput. I'm sure it's possible to build a repro \| showing a 100x difference, but you have to be doing \| something intentionally to cause that (e.g. using a very \| small read buffer so that you're doing enough syscalls \| that it shows up in a profile). \| \| > The latency comes from the fact you need to have the \| whole file \| \| That's a whole other matter. But again, if you're pulling \| it off the network, you usually can't mmap it anyway \| unless you're using a remote-mounted filesystem (which \| will add more overhead than mmap vs buffered I/O). \| duped wrote: \| I think you misunderstood my point, which was to \| highlight exactly when mmap won't work.... \| ben-schaaf wrote: \| In my experience mmap is at best 50% faster compared to \| good pread usage on Linux and MacOS. \| Aurornis wrote: \| > "Json" and "Go" seem antithetical in the same sentence as \| "high performance" to me \| \| Absolute highest performance is rarely the highest priority in \| designing a system. \| \| Of course we could design a hyper-optimized, application \| specific payload format and code the deserializer in assembly \| and the performance would be great, but it wouldn't be useful \| outside of very specific circumstances. \| \| In most real world projects, performance of Go and JSON is fine \| and allow for rapid development, easy implementation, and \| flexibility if anything changes. \| \| I don't think it's valid to criticize someone for optimizing \| within their use case. \| \| > I feel like it should deserve a mention of what the stack is, \| otherwise there is no reference point. \| \| The article clearly mentions that this is a GopherCon talk in \| the header. It was posted on Dave Cheney's website, a well- \| known Go figure. \| \| It's clearly in the context of Go web services, so I don't \| understand your criticisms. The context is clear from the \| article. \| \| The _very first line_ of the article explains the context: \| \| > This talk is a case study of designing an efficient Go \| package. \| riku_iki wrote: \| > "Json" and "Go" seem antithetical in the same sentence as \| "high performance" to me. As long as we are talking about \| _absolute performance_. \| \| Even just "Json" is problematic here as wire protocol for \| absolute performance no matter what will be programming \| language. \| 3cats-in-a-coat wrote: \| I think people say that as they give disproportional weight \| to the fact it's text-based, while ignoring how astoundingly \| simple and linear it is to write and read. \| \| The only way to nudge the needle is to start exchanging \| direct memory dumps, which is what ProtoBuff and the like do. \| But this is clearly only for very specific use. \| riku_iki wrote: \| > while ignoring how astoundingly simple and linear it is \| to write and read. \| \| code maybe simple, but you have lots of performance \| penalties: resolving field keys, you need to construct some \| complicated data structures through memory allocations, \| which is expensive. \| \| > to start exchanging direct memory dumps, which is what \| ProtoBuff and the like do \| \| Protobuff actually is doing parsing, it is just binary \| format. What you describing is more like Flatbuffer. \| \| > But this is clearly only for very specific use. \| \| yes, specific use is high performance computations ) \| crazygringo wrote: \| JSON is often the format of an externally provided data source, \| and you don't have a choice. \| \| And whatever language you're writing in, you usually want to do \| what you can to maximize performance. If your JSON input is 500 \| bytes it probably doesn't matter, but if you're intaking a 5 MB \| JSON file then you can definitely be sure the performance does. \| \| What more do you need to know about "the stack" in this case? \| It's whenever you need to ingest large amounts of JSON in Go. \| Not sure what could be clearer. \| 3cats-in-a-coat wrote: \| JSON is probably the fastest serialization format to produce \| and parse, which is also safe for public use, compared to \| binary formats which often have fragile, highly specific and \| vulnerable encoding as they're directly plopped into memory and \| used as-is (i.e. they're not parsed at all, it's just two \| computers exchanging memory dumps). \| \| Compare it with XML for example, which is a nightmare of \| complexity if you actually follow the spec and not just make \| something XML-like. \| \| We have some formats which try to walk the boundary between \| safe/universal and fast like ASN.1 but those are obscure at \| best. \| alpaca128 wrote: \| I prefer msgpack if the data contains a lot of numeric \| values. Representing numbers as strings like in JSON can blow \| up the size and msgpack is usually just as simple to use. \| tptacek wrote: \| This is an article about optimizing a JSON parser in Go. \| \| As always, try to remember that people usually aren't writing \| (or posting their talks) specifically for an HN audience. \| Cheney clearly has an audience of Go programmers; that's the \| space he operates in. He's not going to title his post, which \| he didn't make for HN, just to avoid a language war on the \| threads here. \| \| It's our responsibility to avoid the unproductive language war \| threads, not this author's. \| hoosieree wrote: \| Even sticking within the confines of json there's low hanging \| fruit, e.g. if your data is typically like: \| [{"k1": true, "k2": [2,3,4]}, {"k1": false, "k2": []}, ...] \| \| You can amortize the overhead of the keys by turning this from \| an array of structs (AoS) into a struct of arrays (SoA): \| {"k1": [true, false, ...], "k2": [[2,3,4], [], ...]} \| \| Then you only have to read "k1" and "k2" once, instead of _once \| per record_. Presumably there will be the odd record that \| contains something like { "k3": 0} but you can use mini batches \| of SoA and tune their size according to your desired \| latency/throughput tradeoff. \| \| Or if your data is 99.999% of the time just pairs of k1 and k2, \| turn them into tuples: {"k1k2": \| [true,[2,3,4],false,[], ...]} \| \| And then 0.001% of the time you send a lone k3 message: \| {"k3": 2} \| \| Even if your endpoints can't change their schema, you can still \| trade latency for throughput by doing the SoA conversion, \| transmitting, then converting back to AoS at the receiver. \| Maybe worthwhile if you have to forward the message many times \| but only decode it once. \| kunley wrote: \| "Antiethical". \| \| Is it true that mindless bashing Go became some kind of a new \| religion in some circles? \| kevingadd wrote: \| I'm surprised there's no way to say 'I really mean it, inline \| this function' for the stuff that didn't inline because it was \| too big. \| \| The baseline whitespace count/search operation seems like it \| would be MUCH faster if you vectorized it with SIMD, but I can \| understand that being out of scope for the author. \| mgaunard wrote: \| Of course you can force-inline. \| cbarrick wrote: \| Obviously you can manually inline functions. That's what \| happened in the article. \| \| The comment is about having a directive or annotation to make \| the compiler inline the function for you, which Go does not \| have. IMO, the pre-inline code was cleaner to me. It's a \| shame that the compiler could not optimize it. \| \| There was once a proposal for this, but it's really against \| Go's design as a language. \| \| https://github.com/golang/go/issues/21536 \| mgaunard wrote: \| You can in any systems programming language. \| \| Go is mostly a toy language for cloud people. \| cbarrick wrote: \| > toy language \| \| You may be surprised to hear that Go is used in a ton of \| large scale critical systems. \| mgaunard wrote: \| I don't consider cloud technology a critical system. \| peterohler wrote: \| You might want to take a look at https://github.com/ohler55/ojg. \| It takes a different approach with a single pass parser. There \| are some performance benchmarks included on the README.md landing \| page. \| arun-mani-j wrote: \| I remember reading a SO question which asks for a C library to \| parse JSON. A comment was like - C developers won't use a library \| for JSON, they will write one themselves. \| \| I don't know how "true" that comment is but I thought I should \| try to write a parser myself to get a feel :D \| \| So I wrote one, in Python - https://arunmani.in/articles/silly- \| json-parser/ \| \| It was a delightful experience though, writing and testing to \| break your own code with different variety of inputs. :) \| xoac wrote: \| Good for you but what does this have to do with the article? \| janmo wrote: \| I wrote a small JSON parser in C myself which I called jsoncut. \| It just cuts out a certain part of a json file. I deal with \| large JSON files, but want only to extract and parse certain \| parts of it. All libraries I tried parse everything, use a lot \| of RAM and are slow. \| \| Link here, if interested to have a look: \| https://github.com/rgex/jsoncut \| vlovich123 wrote: \| The words you're looking for are SAX-like JSON parser or \| streaming json parser. I don't know if there's any command \| line tools like the one you wrote that use it though to \| provide a jq-like interface. \| janmo wrote: \| I tried JQ and other command line tools, all were extremely \| slow and seemed to always parse the entire file. \| \| My parser just reads the file byte by byte until it finds \| the target, then outputs the content. When that's done it \| stops reading the file, meaning that it can be extremely \| fast when the targeted information is at the beginning of \| the JSON file. \| vlovich123 wrote: \| You're still describing a SAX parser (i.e. streaming). jq \| doesn't use a SAX parser because it's a multi-pass \| document editor at its core, hence why I said "jq-like" \| in terms of supporting a similar syntax for single-pass \| queries. If you used RapidJSON's SAX parser in the body \| of your custom code (returning false once you found what \| you're looking for), I'm pretty sure it would \| significantly outperform your custom hand-rolled code. Of \| course, your custom code is very small with no external \| dependencies and presumably fast enough, so tradeoffs. \| masklinn wrote: \| > I remember reading a SO question which asks for a C library \| to parse JSON. A comment was like - C developers won't use a \| library for JSON, they will write one themselves. \| \| > I don't know how "true" that comment is \| \| Either way it's a good way to get a pair of quadratic loops in \| your program: https://nee.lv/2021/02/28/How-I-cut-GTA-Online- \| loading-times... \| jjice wrote: \| I guess there are only so many ways to write a JSON parser b \| cause one I wrote on a train in Python looks very similar! \| \| I thought it would be nice and simple but it really was still \| simpler than I expected. It's a fantastic spec if you need to \| throw one together yourself, without massive performance \| considerations. \| visarga wrote: \| nowadays I am more interested in a "forgiving" JSON/YAML parser, \| that would recover from LLM errors, is there such a thing? \| kevingadd wrote: \| If the LLM did such a bad job that the syntax is wrong, do you \| really trust the data inside? \| \| Forgiving parsers/lexers are common in language compilers for \| languages like rust or C# or typescript, you may want to \| investigate typescript in particular since it's applicable to \| JSON syntax. Maybe you could repurpose their parser. \| RichieAHB wrote: \| I feel like trying to infer valid JSON from invalid JSON is a \| recipe for garbage. You'd probably be better off doing a second \| pass with the "JSON" through the LLM but, as the sibling \| commenter said, at this point even the good JSON may be garbage \| ... \| _dain_ wrote: \| halloween was last week \| explaininjs wrote: \| Perhaps not quite what you're asking for, but along the same \| lines there's this "Incomplete JSON" parser, which takes a \| string of JSON as it's coming out of an LLM and parses it into \| as much data as it can get. Useful for building streaming UI's, \| for instance it is used on https://rexipie.com quite \| extensively. \| \| https://gist.github.com/JacksonKearl/6778c02bf85495d1e39291c... \| \| Some example test cases: { input: '[{"a": 0, \| "b":', output: [{ a: 0 }] }, { input: '[{"a": 0, "b": \| 1', output: [{ a: 0, b: 1 }] }, { input: "[{},", \| output: [{}] }, { input: "[{},1", output: [{}, 1] }, \| { input: '[{},"', output: [{}, ""] }, { input: \| '[{},"abc', output: [{}, "abc"] }, \| \| Work could be done to optimize it, for instance add streaming \| support. But the cycles consumed either way is minimal for LLM- \| output-length=constrained JSON. \| \| Fun fact: as best I can tell, GPT-4 is entirely unable to \| synthesize code to accomplish this task. Perhaps that will \| change as this implementation is made public, I do not know. \| gurrasson wrote: \| The jsonrepair tool https://github.com/josdejong/jsonrepair \| might interest you. It's tailored to fix JSON strings. \| \| I've been looking into something similar for handling partial \| JSONs, where you only have the first n chars of a JSON. This is \| common with LLM with streamed outputs aimed at reducing \| latency. If one knows the JSON schema ahead, then one can start \| processing these first fields before the remaining data has \| fully loaded. If you have to wait for the whole thing to load \| there is little point in streaming. \| \| Was looking for a library that could do this parsing. \| explaininjs wrote: \| See my sibling comment :) \| mannyv wrote: \| These are always interesting to read because you get to see \| runtime quirks. I'm surprised there was so much function call \| overhead, for example. And it's interesting you can bypass range \| checkong. \| \| The most important thing, though, is the process: measure then \| optimize. \| isuckatcoding wrote: \| This is fantastically useful. \| \| Funny enough I stumbled upon your article just yesterday through \| google search. \| mgaunard wrote: \| "It's unrealistic to expect to have the entire input in memory" \| -- wrong for most applications \| isuckatcoding wrote: \| Yes but for applications where you need to do ETL style \| transformations on large datasets, streaming is an immensely \| useful strategy. \| \| Sure you could argue go isn't the right tool for the job but I \| don't see why it can't be done with the right optimizations \| like this effort. \| dh2022 wrote: \| If performance is important why would you keep large datasets \| in JSON format? \| querulous wrote: \| sometimes it's not your data \| isuckatcoding wrote: \| Usually because the downstream service or store needs it \| Maxion wrote: \| Because you work at or for some bureaucratic MegaCorp, that \| does weird things with no real logic behind it other than \| clueless Dilbert managers making decisions based on \| LinkedIn blogs. Alternatively desperate IT consultants \| trying to get something to work with too low of a budget \| and/or no access to do things the right way. \| \| Be glad you have JSON to parse, and not EDI, some custom \| deliminated data format (with no or old documentation) - or \| _shudders_ you work in the airline industry with SABRE. \| capableweb wrote: \| https://yourdatafitsinram.net/ \| mannyv wrote: \| If you're building a library you either need to explicitly call \| out your limits or do streaming. \| \| I've pumped gigs of jaon data, so a streaming parser is \| appreciated. Plus streaming shows the author is better at \| engineering and is aware of the various use cases. \| \| Memory is not cheap or free except in theory. \| jjeaff wrote: \| I guess it's all relative. Memory is significantly cheaper if \| you get it anywhere but on loan from a cloud provider. \| mannyv wrote: \| RAM is always expensive no matter where you get it from. \| \| Would you rather do two hours of work or force thousands of \| people to buy more RAM because your library is a memory \| hog? \| \| And on embedded systems RAM is a premium. More RAM = most \| cost. \| e12e wrote: \| If you can live with "fits on disk" mmap() is a viable option? \| Unless you truly need streaming (early handling of early data, \| like a stream of transactions/operations from a single JSON \| file?) \| mannyv wrote: \| In general, JSON comes over the network, so MMAP won't really \| work unless you save to a file. But then you'll run out of \| disk space. \| \| I mean, you have a 1k, 2k, 4k buffer. Why use more, because \| it's too much work? \| ahoka wrote: \| Most applications read JSONs from networks, where you have a \| stream. Buffering and fiddling with the whole request in memory \| increases latency by a lot, even if your JSON is smallish. \| mgaunard wrote: \| On a carefully built WebSocket server you would ensure your \| WebSocket messages all fit within a single MTU. \| Rapzid wrote: \| Most( _most_ ) JSON payloads are probably much smaller than \| many buffer sizes so just end up all in memory anyway. \| jensneuse wrote: \| I've taken a very similar approach and built a GraphQL tokenizer \| and parser (amongst many other things) that's also zero memory \| allocations and quite fast. In case you'd like to check out the \| code: https://github.com/wundergraph/graphql-go-tools \| markl42 wrote: \| How big of an issue is this for GQL servers where all queries \| are known ahead of time (allowlist) - i.e. you can \| cache/memorize the ast parsing and this is only a perf issue \| for a few minutes after the container starts up \| \| Or does this bite us in other ways too? \| jensneuse wrote: \| I build GraphQL API gateways / Routers for 5+ years now. It \| would be nice if trusted Documents or persisted operations \| were the default, but the reality is that a lot of people \| want to open up their GraphQL to the public. For that reason \| we've build a fast parser, validator, normalizer and many \| other things to support these use cases. \| nwpierce wrote: \| Writing a json parser is definitely an educational experience. I \| wrote one this summer for my own purposes that is decently fast: \| https://github.com/nwpierce/jsb \| lamontcg wrote: \| Wish I wasn't 4 or 5 uncompleted projects deep right now and had \| the time to rewrite a monkey parser using all these tricks. \| jchw wrote: \| Looks pretty good! Even though I've written far too many JSON \| parsers already in my career, it's really nice to have a \| reference for how to think about making a reasonable, fast JSON \| parser, going through each step individually. \| \| That said, I will say one thing: you don't _really_ need to have \| an explicit tokenizer for JSON. You can get rid of the concept of \| tokens and integrate parsing and tokenization _entirely_. This is \| what I usually do since it makes everything simpler. This is a \| lot harder to do with something like the rest of ECMAscript since \| in something like ECMAscript you wind up needing look-ahead \| (sometimes arbitrarily large look-ahead... consider arrow \| functions: it 's mostly a subset of the grammar of a \| parenthesized expression. Comma is an operator, and for default \| values, equal is an operator. It isn't until the => does or does \| not appear that you know for sure!) \| coldtea wrote: \| What line of work are you in that you've "written far too many \| JSON parsers already" in your career?!!! \| craigching wrote: \| Probably anywhere that requires parsing large JSON documents. \| Off the shelf JSON parsers are notoriously slow on large JSON \| documents. \| ahoka wrote: \| Not necessarily, for example Newtonsoft is fine with \| multiple hundreds of megabyes if you use it correctly. But \| of course depends on how large we are talking about. \| jchw wrote: \| Reasons differ. C++ is a really hard place to be. It's gotten \| better, but if you can't tolerate exceptions, need code that \| is as-obviously-memory-safe-as-possible, can parse \| incrementally (think SAX style), off-the-shelf options like \| jsoncpp may not fit the bill. \| \| Handling large documents is indeed another big one. It _sort- \| of_ fits in the same category as being able to parse \| incrementally. That said, Go has a JSON scanner you can sort \| of use for incremental parsing, but in practice I 've found \| it to be a lot slower, so for large documents it's a problem. \| \| I've done a couple in hobby projects too. One time I did a \| partial one in Win32-style C89 because I wanted one that \| didn't depend on libc. \| lgas wrote: \| Someone misunderstood the JSONParserFactory somewhere along \| the line. \| marcosdumay wrote: \| I've seen "somebody doesn't agree with the standard and we \| must support it" way too many times, and I've written JSON \| parsers because of this. (And, of course, it's easy to get \| some difference with the JSON standard.) \| \| I've had problems with handling streams like the OP on \| basically every programing language and data-encoding \| language pair that I've tried. It looks like nobody ever \| thinks about it (I do use chunking any time I can, but some \| times you can't). \| \| There are probably lots and lots of reasons to write your own \| parser. \| jbiggley wrote: \| This reminds me of my favourite quote about standards. \| \| >The wonderful thing about standards is that there are so \| many of them to choose from. \| \| And, keeping with the theme, this quote may be from Grace \| Hopper, Andrew Tanenbaum, Patricia Seybold or Ken Olsen. \| evmar wrote: \| In n2[1] I needed a fast tokenizer and had the same "garbage \| factory" problem, which is basically that there's a set of \| constant tokens (like json.Delim in this post) and then strings \| which cause allocations. \| \| I came up with what I think is a kind of neat solution, which is \| that the tokenizer is generic over some T and takes a function \| from byteslice to T and uses T in place of the strings. This way, \| when the caller has some more efficient representation available \| (like one that allocates less) it can provide one, but I can \| still unit test the tokenizer with the identity function for \| convenience. \| \| In a sense this is like fusing the tokenizer with the parser at \| build time, but the generic allows layering the tokenizer such \| that it doesn't know about the parser's representation. \| \| [1] https://github.com/evmar/n2 \| suzzer99 wrote: \| Can someone explain to me why JSON can't have comments or \| trailing commas? I really hope the performance gains are worth \| it, because I've lost 100s of man-hours to those things, and had \| to resort to stuff like this in package.json: \| "IMPORTANT: do not run the scripts below this line, they are for \| CICD only": true, \| coldtea wrote: \| It can't have comments because it didn't originally had \| comments, so now it's too late. And it originally didn't have \| comments, because Douglas Cockford thought they could be abused \| for parsing instructions. \| \| As for not having trailing commas, it's probably a less \| intentional bad design choice. \| \| That said, if you want commas and comments, and control the \| parsers that will be used for your JSON, then use JSONC (JSON \| with comments). VSCode for example does that for its JSON \| configuration. \| explaininjs wrote: \| JSONC also supports trailing commas. It is, in effect, "JSON \| with no downsides". \| \| TOML/Yaml always drive me batty with all their obscure \| special syntax. Whereas it's almost impossible to look at a \| formatted blob of JSON and not have a very solid \| understanding of what it represents. \| \| The one thing I _might_ add is multiline strings with ` 's, \| but even that is probably more trouble than it's worth, as \| you immediately start going down the path of "well let's also \| have syntax to strip the indentation from those strings, \| maybe we should add new syntax to support raw strings, ..." \| tubthumper8 wrote: \| Does JSONC have a specification or formal definition? People \| have suggested[1] using JSON5[2] instead for that reason \| \| [1] https://github.com/microsoft/vscode/issues/100688 \| \| [2] https://spec.json5.org/ \| mananaysiempre wrote: \| Unfortunately, JSON5 says keys can be ES5 \| IdentifierName[1]s, which means you must carry around \| Unicode tables. This makes it a non-option for small \| devices, for example. (I mean, not really, you technically \| _could_ fit the necessary data and code in low single-digit \| kilobytes, but it feels stupid that you have to. Or you \| could just not do that but then it's no longer JSON5 and \| what was the point of having a spec again?) \| \| [1] https://es5.github.io/x7.html#x7.6 \| mananaysiempre wrote: \| Amusingly, it originally _did_ have comments. Removing \| comments was the one change Crockford ever made to the \| spec[1]. \| \| [1] https://web.archive.org/web/20150105080225/https://plus.g \| oog... (thank you Internet Archive for making Google's social \| network somewhat accessible and less than useless) \| shepherdjerred wrote: \| I don't know the historic reason why it wasn't included in the \| original spec, but at this point it doesn't matter. JSON is \| entrenched and not going to change. \| \| If you want comments, you can always use jsonc. \| semiquaver wrote: \| It's not that it "can't", more like it "doesn't". Douglas \| Crockford prioritized simplicity when specifying JSON. Its BNF \| grammar famously fits on one side of a business card. \| \| Other flavors of JSON that include support for comments and \| trailing commas exist, but they are reasonably called by \| different names. One of these is YAML (mostly a superset of \| JSON). To some extent the difficulties with YAML (like unquoted \| 'no' being a synonym for false) have vindicated Crockford's \| priorities. \| forrestthewoods wrote: \| > Any (useful) JSON decoder code cannot go faster that this. \| \| That line feels like a troll. Cunningham's Law in action. \| \| You can definitely go faster than 2 Gb/sec. In a word, SIMD. \| shoo wrote: \| we could re-frame by distinguishing problem statements from \| implementations \| \| Problem A: read a stream of bytes, parse it as JSON \| \| Problem B: read a stream of bytes, count how many bytes match a \| JSON whitespace character \| \| Problem B should require fewer resources* to solve than problem \| A. So in that sense problem B is a relaxation of problem A, and \| a highly efficient implementation of problem B should be able \| to process bytes much more efficiently than an "optimal" \| implementation of problem A. \| \| So in this sense, we can probably all agree with the author \| that counting whitespace bytes is an easier problem than the \| full parsing problem. \| \| We're agreed that the author's implementation (half a page of \| go code that fits on a talk slide) to solve problem B isn't the \| most efficient way to solve problem B. \| \| I remember reading somewhere the advice that to set a really \| solid target for benchmarking, you should avoid measuring the \| performance of implementations and instead try to estimate a \| theoretical upper bound on performance, based on say a \| simplified model of how the hardware works and a simplification \| of the problem -- that hopefully still captures the essence of \| what the bottleneck is. Then you can compare any implementation \| to that (unreachable) theoretical upper bound, to get more of \| an idea of how much performance is still left on the table. \| \| * for reasonably boring choices of target platform, e.g. amd64 \| + ram, not some hypothetical hardware platform with \| surprisingly fast dedicated support for JSON parsing and bad \| support for anything else. \| ncruces wrote: \| It's possible to improve over the standard library with better \| API design, but it's not really possible to do a fully streaming \| parser that doesn't half fill structures before finding an error \| and bailing out in the middle, which is another explicit design \| constraint for the standard library. \| hintymad wrote: \| How is this compared to Daniel Lemire's simdjson? \| https://github.com/simdjson/simdjson \| 1vuio0pswjnm7 wrote: \| "But there is a better trick that we can use that is more space \| efficient than this table, and is sometimes called a computed \| goto." \| \| From 1989: \| \| https://raw.githubusercontent.com/spitbol/x32/master/docs/sp... \| \| "Indirection in the Goto field is a more powerful version of the \| computed Goto which appears in some languages. It allows a \| program to quickly perform a multi-way control branch based on an \| item of data." \| wood_spirit wrote: \| My own lessons from writing fast json parsers has a lot of \| language-type things but here are some generalisations: \| \| Avoid heap allocations in tokenising. Have a tokeniser that is a \| function that returns a stack-allocated struct or an int64 token \| that is a packed field describing the start, length and type \| offsets etc of the token. \| \| Avoid heap allocations in parsing: support a getString(key \| String) type interface for clients that what to chop up a buffer. \| \| For deserialising to object where you know the fields at compile \| time, generally generate a switch case of key length before \| comparing string values. \| \| My experience in data pipelines that process lots of json is that \| choice of json library can be a 3-10x performance difference and \| that all the main parsers want to allocate objects. \| \| If the classes you are serialising or deserialising is known at \| compile time then Jackson Java does a good job but you can get a \| 2x boost with careful coding and profiling. \| \| Whereas if you are paying aribrary json then all the mainstream \| parsers want to do lots of allocations that a more intrusive \| parser that you write yourself can avoid, and that you can make \| massive performance wins if you are processing thousands or \| millions of objects per second. \| wslh wrote: \| I remember this JSON benchmark page from RapidJSON [1]. \| \| [1] https://rapidjson.org/md_doc_performance.html ___________________________________________________________________ (page generated 2023-11-05 23:00 UTC)