|
| denysvitali wrote:
| Also interesting: https://youtu.be/a7VBbbcmxyQ
| eatonphil wrote:
| The walkthrough is very nice, how to do this if you're going to
| do it.
|
| If you're going for pure performance in a production environment
| you might take a look at Daniel Lemire's work:
| https://github.com/simdjson/simdjson. Or the MinIO port of it to
| Go: https://github.com/minio/simdjson-go.
| vjerancrnjak wrote:
| If your JSON always looks the same you can also do better than
| general JSON parsers.
| lylejantzi3rd wrote:
| Andreas Fredriksson demonstrates exactly that in this video:
| https://vimeo.com/644068002
| ykonstant wrote:
| Very enjoyable video!
| kaladin_1 wrote:
| I really enjoyed this video even though he lost me with the
| SIMD code.
| diarrhea wrote:
| I wonder: can fast, special-case JSON parsers be dynamically
| autogenerated from JSON Schemas?
|
| Perhaps some macro-ridden Rust monstrosity that spits out
| specialised parsers at compile time, dynamically...
| minhazm wrote:
| For json schema specifically there are some tools like go-
| jsonschema[1] but I've never used them personally. But you
| can use something like ffjson[2] in go to generate a static
| serialize/deserialize function based on a struct
| definition.
|
| [1] https://github.com/omissis/go-jsonschema [2]
| https://github.com/pquerna/ffjson
| atombender wrote:
| Hey, go-jsonschema is my project. (Someone else just took
| over maintaining it, though.) It still relies on the
| standard Go parser; all it does it generate structs with
| the right types and tags.
| galangalalgol wrote:
| Doesn't the serde crate's json support do precisely this?
| It generates structs that have optional in all the right
| places and with all the right types anyway. Seems like the
| llvm optimiser can probably do something useful with that
| even if the serde feature isn't using apriori knowledge out
| of the schema.
| dleeftink wrote:
| Somewhat tangentially related, Fabian Iwand posted this
| regex prefix tree visualiser/generator last week [0], which
| may offer some inspiration for prototyping auto generated
| schemas.
| atombender wrote:
| You forgot to include the link?
| PartiallyTyped wrote:
| Pydantic does that to some extend I think.
| marginalia_nu wrote:
| A fundamental problem with JSON parsing is that it has
| variable length fields that don't encode their length, in a
| streaming scenario you basically need to keep resizing your
| buffer until the data fits. If the data is on disk and not
| streaming you may get away with reading ahead to find the
| end of the field first, but that's also not particularly
| fast.
|
| Schemas can't fix that.
| mhh__ wrote:
| It's relatively common in D application to use the compile
| time capabilities to generator a parser at compile time
| loeg wrote:
| You might also move to something other than JSON if parsing
| it is a significant part of your workload.
| haswell wrote:
| Most of the times I've had to deal with JSON performance
| issues, it involved a 3rd party API and JSON was the only
| option.
|
| If you're building something net-new and know you'll have
| these problems out the gate, something other than JSON
| might be feasible, but the moment some other system not in
| the closed loop needs to work with the data, you're back to
| JSON and any associated perf issues.
| fooster wrote:
| Last time I compared the performance of various json parsers
| the simd one turned out to be disappointingly slow.
| Thaxll wrote:
| The fastest json lib in Go is the one done by the company
| behind Tiktok.
| ken47 wrote:
| Fastest at what?
| cannonpalms wrote:
| > For all sizes of json and all scenarios of usage, Sonic
| performs best.
|
| The repository has benchmarks
| mananaysiempre wrote:
| I'm not seeing simdjson in them though? I must be missing
| something because the Go port of it is explicitly
| mentioned in the motivation[1] (not the real thing,
| though).
|
| [1] https://github.com/bytedance/sonic/blob/main/docs/INT
| RODUCTI...
| rockinghigh wrote:
| https://github.com/bytedance/sonic
| pizzafeelsright wrote:
| Excellent treat vector.
| pizzafeelsright wrote:
| Excellent treat vector.
| lionkor wrote:
| simdjson has not been the fastest for a long long time
| jzwinck wrote:
| What is faster? According to
| https://github.com/kostya/benchmarks#json nothing is.
| rexfuzzle wrote:
| Great to see a shout out to Phil Pearl! Also worth looking at
| https://github.com/bytedance/sonic
| Galanwe wrote:
| "Json" and "Go" seem antithetical in the same sentence as "high
| performance" to me. As long as we are talking about _absolute
| performance_.
|
| When talking about _relative performance_, as in _compared to
| what can be done with a somewhat similar stack_, I feel like it
| should deserve a mention of what the stack is, otherwise there is
| no reference point.
| bsdnoob wrote:
| Did you even open the article? Following is literally in first
| paragraph
|
| > This package offers the same high level json.Decoder API but
| higher throughput and reduced allocations
| jcelerier wrote:
| How does that contradict what the parent poster says? I think
| it's very weird to call something "high performance" when it
| looks like it's maybe 15-20% of the performance of a simdjson
| in c++. This is not "going from normal performance to high
| performance", this going from "very subpar" to "subpar"
| willsmith72 wrote:
| Ok but how many teams are building web APIs in C++?
| TheCleric wrote:
| I worked with a guy who did this. It was fast, but boy
| howdy was it not simple.
| lolinder wrote:
| Par is different for different stacks. It's reasonable for
| someone to treat their standard library's JSON parser as
| "par", given that that's the parser that most of their
| peers will be using, even if there are faster options that
| are commonly used in other stacks.
| Thaxll wrote:
| Because even in C++ people don't use json simd most project
| use rapidjson which Go is on part with.
| SmoothBrain12 wrote:
| Exactly, it's not too hard to implement in C. The one I made
| never copied data, instead saved the pointer/length to the
| data. The user only had to Memory Map the file (or equivalent),
| pass that data into the parse. Only memory allocation was for
| the Jason nodes.
|
| This way they only paid the parsing tax (decoding doubles,
| etc..) if the user used that data.
|
| You hit the nail on the head
| compressedgas wrote:
| Like how https://github.com/zserge/jsmn works. I thought it
| would be neat to have such as parser for
| https://github.com/vshymanskyy/muon
| mr_mitm wrote:
| Did you publish the code somewhere? I'd be interested in
| reading it.
| eska wrote:
| I also really like this paradigm. It's just that in old
| crusty null-terminated C style this is really awkward because
| the input data must be copied or modified. But it's not an
| issue when using slices (length and pointer). Unfortunately
| most of the C standard library and many operating system APIs
| expect that.
|
| I've seen this referred to as a pull parser in a Rust
| library? (https://github.com/raphlinus/pulldown-cmark)
| Aurornis wrote:
| The linked article is a GopherCon talk.
|
| The first line of the article explains the context of the
| talk:
|
| > This talk is a case study of designing an efficient Go
| package.
|
| The target audience and context are clearly Go developers.
| Some of these comments are focusing too much on the headline
| without addressing the actual article.
| cryo wrote:
| I've made a JSON parser which works like this too. No dynamic
| memory allocations, similar to the JSMN parser but stricter
| to the specification.
|
| Always nice to be in control over memory :)
|
| https://sr.ht/~cryo/cj
| hgs3 wrote:
| Yup and if your implementation uses a hashmap for object key
| -> value lookup, then I recommend allocating the hashmap
| _after_ parsing the object not during to avoid continually
| resizing the hashmap. You can implement this by using an
| intrusive linked list to track your key /value JSON nodes
| until the time comes to allocate the hashmap. Basically when
| parsing an object 1. use a counter 'N' to track the number of
| keys, 2. link the JSON nodes representing key/value pairs
| into an intrusive linked list, 3. after parsing the object
| use 'N' to allocate a perfectly sized hashmap in one go. You
| can then iterate over the linked list of JSON key/value pair
| nodes adding them to the hashmap. You can use this same trick
| when parsing JSON arrays to avoid continually resizing a
| backing array. Alternatively, never allocate a backing array
| and instead use the linked list to implement an iterator.
| duped wrote:
| > The user only had to Memory Map the file (or equivalent)
|
| Having done this myself, it's a massive cheat code because
| your bottleneck is almost always i/o and memory mapped i/o is
| orders of magnitude faster than sequential calls to read().
|
| But that said it's not always appropriate. You can have
| gigabytes of JSON to parse, and the JSON might be available
| over the network, and your service might be running on a
| small node with limited memory. Memory mapping here adds
| quite a lot of latency and cost to the system. A very fast
| streaming JSON decoder is the move here.
| vlovich123 wrote:
| > memory mapped i/o is orders of magnitude faster than
| sequential calls to read()
|
| That's not something I've generally seen. Any source for
| this claim?
|
| > You can have gigabytes of JSON to parse, and the JSON
| might be available over the network, and your service might
| be running on a small node with limited memory. Memory
| mapping here adds quite a lot of latency and cost to the
| system
|
| Why does mmap add latency? I would think that mmap adds
| more latency for small documents because the cost of doing
| the mmap is high (cross CPU TLB shoot down to modify the
| page table) and there's no chance to amortize. Relatedly,
| there's minimal to no relation between SAX vs DOM style
| parsing and mmap - you can use either with mmap. If you're
| not aware, you do have some knobs with mmap to hint to the
| OS how it's going to be used although it's very unwieldy to
| configure it to work well.
| duped wrote:
| Experience? Last time I made that optimization it was
| 100x faster, ballpark. I don't feel like benchmarking it
| right now, try yourself.
|
| The latency comes from the fact you need to have the
| whole file. The use case I'm talking about is a JSON
| document you need to pull off the network because it
| doesn't exist on disk, might not fit there, and might not
| fit in memory.
| vlovich123 wrote:
| > Experience? Last time I made that optimization it was
| 100x faster, ballpark. I don't feel like benchmarking it
| right now, try yourself.
|
| I have. Many times. There's definitely not a 100x
| difference given that normal file I/O can easily saturate
| NVMe throughput. I'm sure it's possible to build a repro
| showing a 100x difference, but you have to be doing
| something intentionally to cause that (e.g. using a very
| small read buffer so that you're doing enough syscalls
| that it shows up in a profile).
|
| > The latency comes from the fact you need to have the
| whole file
|
| That's a whole other matter. But again, if you're pulling
| it off the network, you usually can't mmap it anyway
| unless you're using a remote-mounted filesystem (which
| will add more overhead than mmap vs buffered I/O).
| duped wrote:
| I think you misunderstood my point, which was to
| highlight exactly when mmap won't work....
| ben-schaaf wrote:
| In my experience mmap is at best 50% faster compared to
| good pread usage on Linux and MacOS.
| Aurornis wrote:
| > "Json" and "Go" seem antithetical in the same sentence as
| "high performance" to me
|
| Absolute highest performance is rarely the highest priority in
| designing a system.
|
| Of course we could design a hyper-optimized, application
| specific payload format and code the deserializer in assembly
| and the performance would be great, but it wouldn't be useful
| outside of very specific circumstances.
|
| In most real world projects, performance of Go and JSON is fine
| and allow for rapid development, easy implementation, and
| flexibility if anything changes.
|
| I don't think it's valid to criticize someone for optimizing
| within their use case.
|
| > I feel like it should deserve a mention of what the stack is,
| otherwise there is no reference point.
|
| The article clearly mentions that this is a GopherCon talk in
| the header. It was posted on Dave Cheney's website, a well-
| known Go figure.
|
| It's clearly in the context of Go web services, so I don't
| understand your criticisms. The context is clear from the
| article.
|
| The _very first line_ of the article explains the context:
|
| > This talk is a case study of designing an efficient Go
| package.
| riku_iki wrote:
| > "Json" and "Go" seem antithetical in the same sentence as
| "high performance" to me. As long as we are talking about
| _absolute performance_.
|
| Even just "Json" is problematic here as wire protocol for
| absolute performance no matter what will be programming
| language.
| 3cats-in-a-coat wrote:
| I think people say that as they give disproportional weight
| to the fact it's text-based, while ignoring how astoundingly
| simple and linear it is to write and read.
|
| The only way to nudge the needle is to start exchanging
| direct memory dumps, which is what ProtoBuff and the like do.
| But this is clearly only for very specific use.
| riku_iki wrote:
| > while ignoring how astoundingly simple and linear it is
| to write and read.
|
| code maybe simple, but you have lots of performance
| penalties: resolving field keys, you need to construct some
| complicated data structures through memory allocations,
| which is expensive.
|
| > to start exchanging direct memory dumps, which is what
| ProtoBuff and the like do
|
| Protobuff actually is doing parsing, it is just binary
| format. What you describing is more like Flatbuffer.
|
| > But this is clearly only for very specific use.
|
| yes, specific use is high performance computations )
| crazygringo wrote:
| JSON is often the format of an externally provided data source,
| and you don't have a choice.
|
| And whatever language you're writing in, you usually want to do
| what you can to maximize performance. If your JSON input is 500
| bytes it probably doesn't matter, but if you're intaking a 5 MB
| JSON file then you can definitely be sure the performance does.
|
| What more do you need to know about "the stack" in this case?
| It's whenever you need to ingest large amounts of JSON in Go.
| Not sure what could be clearer.
| 3cats-in-a-coat wrote:
| JSON is probably the fastest serialization format to produce
| and parse, which is also safe for public use, compared to
| binary formats which often have fragile, highly specific and
| vulnerable encoding as they're directly plopped into memory and
| used as-is (i.e. they're not parsed at all, it's just two
| computers exchanging memory dumps).
|
| Compare it with XML for example, which is a nightmare of
| complexity if you actually follow the spec and not just make
| something XML-like.
|
| We have some formats which try to walk the boundary between
| safe/universal and fast like ASN.1 but those are obscure at
| best.
| alpaca128 wrote:
| I prefer msgpack if the data contains a lot of numeric
| values. Representing numbers as strings like in JSON can blow
| up the size and msgpack is usually just as simple to use.
| tptacek wrote:
| This is an article about optimizing a JSON parser in Go.
|
| As always, try to remember that people usually aren't writing
| (or posting their talks) specifically for an HN audience.
| Cheney clearly has an audience of Go programmers; that's the
| space he operates in. He's not going to title his post, which
| he didn't make for HN, just to avoid a language war on the
| threads here.
|
| It's our responsibility to avoid the unproductive language war
| threads, not this author's.
| hoosieree wrote:
| Even sticking within the confines of json there's low hanging
| fruit, e.g. if your data is typically like:
| [{"k1": true, "k2": [2,3,4]}, {"k1": false, "k2": []}, ...]
|
| You can amortize the overhead of the keys by turning this from
| an array of structs (AoS) into a struct of arrays (SoA):
| {"k1": [true, false, ...], "k2": [[2,3,4], [], ...]}
|
| Then you only have to read "k1" and "k2" once, instead of _once
| per record_. Presumably there will be the odd record that
| contains something like { "k3": 0} but you can use mini batches
| of SoA and tune their size according to your desired
| latency/throughput tradeoff.
|
| Or if your data is 99.999% of the time just pairs of k1 and k2,
| turn them into tuples: {"k1k2":
| [true,[2,3,4],false,[], ...]}
|
| And then 0.001% of the time you send a lone k3 message:
| {"k3": 2}
|
| Even if your endpoints can't change their schema, you can still
| trade latency for throughput by doing the SoA conversion,
| transmitting, then converting back to AoS at the receiver.
| Maybe worthwhile if you have to forward the message many times
| but only decode it once.
| kunley wrote:
| "Antiethical".
|
| Is it true that mindless bashing Go became some kind of a new
| religion in some circles?
| kevingadd wrote:
| I'm surprised there's no way to say 'I really mean it, inline
| this function' for the stuff that didn't inline because it was
| too big.
|
| The baseline whitespace count/search operation seems like it
| would be MUCH faster if you vectorized it with SIMD, but I can
| understand that being out of scope for the author.
| mgaunard wrote:
| Of course you can force-inline.
| cbarrick wrote:
| Obviously you can manually inline functions. That's what
| happened in the article.
|
| The comment is about having a directive or annotation to make
| the compiler inline the function for you, which Go does not
| have. IMO, the pre-inline code was cleaner to me. It's a
| shame that the compiler could not optimize it.
|
| There was once a proposal for this, but it's really against
| Go's design as a language.
|
| https://github.com/golang/go/issues/21536
| mgaunard wrote:
| You can in any systems programming language.
|
| Go is mostly a toy language for cloud people.
| cbarrick wrote:
| > toy language
|
| You may be surprised to hear that Go is used in a ton of
| large scale critical systems.
| mgaunard wrote:
| I don't consider cloud technology a critical system.
| peterohler wrote:
| You might want to take a look at https://github.com/ohler55/ojg.
| It takes a different approach with a single pass parser. There
| are some performance benchmarks included on the README.md landing
| page.
| arun-mani-j wrote:
| I remember reading a SO question which asks for a C library to
| parse JSON. A comment was like - C developers won't use a library
| for JSON, they will write one themselves.
|
| I don't know how "true" that comment is but I thought I should
| try to write a parser myself to get a feel :D
|
| So I wrote one, in Python - https://arunmani.in/articles/silly-
| json-parser/
|
| It was a delightful experience though, writing and testing to
| break your own code with different variety of inputs. :)
| xoac wrote:
| Good for you but what does this have to do with the article?
| janmo wrote:
| I wrote a small JSON parser in C myself which I called jsoncut.
| It just cuts out a certain part of a json file. I deal with
| large JSON files, but want only to extract and parse certain
| parts of it. All libraries I tried parse everything, use a lot
| of RAM and are slow.
|
| Link here, if interested to have a look:
| https://github.com/rgex/jsoncut
| vlovich123 wrote:
| The words you're looking for are SAX-like JSON parser or
| streaming json parser. I don't know if there's any command
| line tools like the one you wrote that use it though to
| provide a jq-like interface.
| janmo wrote:
| I tried JQ and other command line tools, all were extremely
| slow and seemed to always parse the entire file.
|
| My parser just reads the file byte by byte until it finds
| the target, then outputs the content. When that's done it
| stops reading the file, meaning that it can be extremely
| fast when the targeted information is at the beginning of
| the JSON file.
| vlovich123 wrote:
| You're still describing a SAX parser (i.e. streaming). jq
| doesn't use a SAX parser because it's a multi-pass
| document editor at its core, hence why I said "jq-like"
| in terms of supporting a similar syntax for single-pass
| queries. If you used RapidJSON's SAX parser in the body
| of your custom code (returning false once you found what
| you're looking for), I'm pretty sure it would
| significantly outperform your custom hand-rolled code. Of
| course, your custom code is very small with no external
| dependencies and presumably fast enough, so tradeoffs.
| masklinn wrote:
| > I remember reading a SO question which asks for a C library
| to parse JSON. A comment was like - C developers won't use a
| library for JSON, they will write one themselves.
|
| > I don't know how "true" that comment is
|
| Either way it's a good way to get a pair of quadratic loops in
| your program: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-
| loading-times...
| jjice wrote:
| I guess there are only so many ways to write a JSON parser b
| cause one I wrote on a train in Python looks very similar!
|
| I thought it would be nice and simple but it really was still
| simpler than I expected. It's a fantastic spec if you need to
| throw one together yourself, without massive performance
| considerations.
| visarga wrote:
| nowadays I am more interested in a "forgiving" JSON/YAML parser,
| that would recover from LLM errors, is there such a thing?
| kevingadd wrote:
| If the LLM did such a bad job that the syntax is wrong, do you
| really trust the data inside?
|
| Forgiving parsers/lexers are common in language compilers for
| languages like rust or C# or typescript, you may want to
| investigate typescript in particular since it's applicable to
| JSON syntax. Maybe you could repurpose their parser.
| RichieAHB wrote:
| I feel like trying to infer valid JSON from invalid JSON is a
| recipe for garbage. You'd probably be better off doing a second
| pass with the "JSON" through the LLM but, as the sibling
| commenter said, at this point even the good JSON may be garbage
| ...
| _dain_ wrote:
| halloween was last week
| explaininjs wrote:
| Perhaps not quite what you're asking for, but along the same
| lines there's this "Incomplete JSON" parser, which takes a
| string of JSON as it's coming out of an LLM and parses it into
| as much data as it can get. Useful for building streaming UI's,
| for instance it is used on https://rexipie.com quite
| extensively.
|
| https://gist.github.com/JacksonKearl/6778c02bf85495d1e39291c...
|
| Some example test cases: { input: '[{"a": 0,
| "b":', output: [{ a: 0 }] }, { input: '[{"a": 0, "b":
| 1', output: [{ a: 0, b: 1 }] }, { input: "[{},",
| output: [{}] }, { input: "[{},1", output: [{}, 1] },
| { input: '[{},"', output: [{}, ""] }, { input:
| '[{},"abc', output: [{}, "abc"] },
|
| Work could be done to optimize it, for instance add streaming
| support. But the cycles consumed either way is minimal for LLM-
| output-length=constrained JSON.
|
| Fun fact: as best I can tell, GPT-4 is entirely unable to
| synthesize code to accomplish this task. Perhaps that will
| change as this implementation is made public, I do not know.
| gurrasson wrote:
| The jsonrepair tool https://github.com/josdejong/jsonrepair
| might interest you. It's tailored to fix JSON strings.
|
| I've been looking into something similar for handling partial
| JSONs, where you only have the first n chars of a JSON. This is
| common with LLM with streamed outputs aimed at reducing
| latency. If one knows the JSON schema ahead, then one can start
| processing these first fields before the remaining data has
| fully loaded. If you have to wait for the whole thing to load
| there is little point in streaming.
|
| Was looking for a library that could do this parsing.
| explaininjs wrote:
| See my sibling comment :)
| mannyv wrote:
| These are always interesting to read because you get to see
| runtime quirks. I'm surprised there was so much function call
| overhead, for example. And it's interesting you can bypass range
| checkong.
|
| The most important thing, though, is the process: measure then
| optimize.
| isuckatcoding wrote:
| This is fantastically useful.
|
| Funny enough I stumbled upon your article just yesterday through
| google search.
| mgaunard wrote:
| "It's unrealistic to expect to have the entire input in memory"
| -- wrong for most applications
| isuckatcoding wrote:
| Yes but for applications where you need to do ETL style
| transformations on large datasets, streaming is an immensely
| useful strategy.
|
| Sure you could argue go isn't the right tool for the job but I
| don't see why it can't be done with the right optimizations
| like this effort.
| dh2022 wrote:
| If performance is important why would you keep large datasets
| in JSON format?
| querulous wrote:
| sometimes it's not your data
| isuckatcoding wrote:
| Usually because the downstream service or store needs it
| Maxion wrote:
| Because you work at or for some bureaucratic MegaCorp, that
| does weird things with no real logic behind it other than
| clueless Dilbert managers making decisions based on
| LinkedIn blogs. Alternatively desperate IT consultants
| trying to get something to work with too low of a budget
| and/or no access to do things the right way.
|
| Be glad you have JSON to parse, and not EDI, some custom
| deliminated data format (with no or old documentation) - or
| _shudders_ you work in the airline industry with SABRE.
| capableweb wrote:
| https://yourdatafitsinram.net/
| mannyv wrote:
| If you're building a library you either need to explicitly call
| out your limits or do streaming.
|
| I've pumped gigs of jaon data, so a streaming parser is
| appreciated. Plus streaming shows the author is better at
| engineering and is aware of the various use cases.
|
| Memory is not cheap or free except in theory.
| jjeaff wrote:
| I guess it's all relative. Memory is significantly cheaper if
| you get it anywhere but on loan from a cloud provider.
| mannyv wrote:
| RAM is always expensive no matter where you get it from.
|
| Would you rather do two hours of work or force thousands of
| people to buy more RAM because your library is a memory
| hog?
|
| And on embedded systems RAM is a premium. More RAM = most
| cost.
| e12e wrote:
| If you can live with "fits on disk" mmap() is a viable option?
| Unless you truly need streaming (early handling of early data,
| like a stream of transactions/operations from a single JSON
| file?)
| mannyv wrote:
| In general, JSON comes over the network, so MMAP won't really
| work unless you save to a file. But then you'll run out of
| disk space.
|
| I mean, you have a 1k, 2k, 4k buffer. Why use more, because
| it's too much work?
| ahoka wrote:
| Most applications read JSONs from networks, where you have a
| stream. Buffering and fiddling with the whole request in memory
| increases latency by a lot, even if your JSON is smallish.
| mgaunard wrote:
| On a carefully built WebSocket server you would ensure your
| WebSocket messages all fit within a single MTU.
| Rapzid wrote:
| Most( _most_ ) JSON payloads are probably much smaller than
| many buffer sizes so just end up all in memory anyway.
| jensneuse wrote:
| I've taken a very similar approach and built a GraphQL tokenizer
| and parser (amongst many other things) that's also zero memory
| allocations and quite fast. In case you'd like to check out the
| code: https://github.com/wundergraph/graphql-go-tools
| markl42 wrote:
| How big of an issue is this for GQL servers where all queries
| are known ahead of time (allowlist) - i.e. you can
| cache/memorize the ast parsing and this is only a perf issue
| for a few minutes after the container starts up
|
| Or does this bite us in other ways too?
| jensneuse wrote:
| I build GraphQL API gateways / Routers for 5+ years now. It
| would be nice if trusted Documents or persisted operations
| were the default, but the reality is that a lot of people
| want to open up their GraphQL to the public. For that reason
| we've build a fast parser, validator, normalizer and many
| other things to support these use cases.
| nwpierce wrote:
| Writing a json parser is definitely an educational experience. I
| wrote one this summer for my own purposes that is decently fast:
| https://github.com/nwpierce/jsb
| lamontcg wrote:
| Wish I wasn't 4 or 5 uncompleted projects deep right now and had
| the time to rewrite a monkey parser using all these tricks.
| jchw wrote:
| Looks pretty good! Even though I've written far too many JSON
| parsers already in my career, it's really nice to have a
| reference for how to think about making a reasonable, fast JSON
| parser, going through each step individually.
|
| That said, I will say one thing: you don't _really_ need to have
| an explicit tokenizer for JSON. You can get rid of the concept of
| tokens and integrate parsing and tokenization _entirely_. This is
| what I usually do since it makes everything simpler. This is a
| lot harder to do with something like the rest of ECMAscript since
| in something like ECMAscript you wind up needing look-ahead
| (sometimes arbitrarily large look-ahead... consider arrow
| functions: it 's mostly a subset of the grammar of a
| parenthesized expression. Comma is an operator, and for default
| values, equal is an operator. It isn't until the => does or does
| not appear that you know for sure!)
| coldtea wrote:
| What line of work are you in that you've "written far too many
| JSON parsers already" in your career?!!!
| craigching wrote:
| Probably anywhere that requires parsing large JSON documents.
| Off the shelf JSON parsers are notoriously slow on large JSON
| documents.
| ahoka wrote:
| Not necessarily, for example Newtonsoft is fine with
| multiple hundreds of megabyes if you use it correctly. But
| of course depends on how large we are talking about.
| jchw wrote:
| Reasons differ. C++ is a really hard place to be. It's gotten
| better, but if you can't tolerate exceptions, need code that
| is as-obviously-memory-safe-as-possible, can parse
| incrementally (think SAX style), off-the-shelf options like
| jsoncpp may not fit the bill.
|
| Handling large documents is indeed another big one. It _sort-
| of_ fits in the same category as being able to parse
| incrementally. That said, Go has a JSON scanner you can sort
| of use for incremental parsing, but in practice I 've found
| it to be a lot slower, so for large documents it's a problem.
|
| I've done a couple in hobby projects too. One time I did a
| partial one in Win32-style C89 because I wanted one that
| didn't depend on libc.
| lgas wrote:
| Someone misunderstood the JSONParserFactory somewhere along
| the line.
| marcosdumay wrote:
| I've seen "somebody doesn't agree with the standard and we
| must support it" way too many times, and I've written JSON
| parsers because of this. (And, of course, it's easy to get
| some difference with the JSON standard.)
|
| I've had problems with handling streams like the OP on
| basically every programing language and data-encoding
| language pair that I've tried. It looks like nobody ever
| thinks about it (I do use chunking any time I can, but some
| times you can't).
|
| There are probably lots and lots of reasons to write your own
| parser.
| jbiggley wrote:
| This reminds me of my favourite quote about standards.
|
| >The wonderful thing about standards is that there are so
| many of them to choose from.
|
| And, keeping with the theme, this quote may be from Grace
| Hopper, Andrew Tanenbaum, Patricia Seybold or Ken Olsen.
| evmar wrote:
| In n2[1] I needed a fast tokenizer and had the same "garbage
| factory" problem, which is basically that there's a set of
| constant tokens (like json.Delim in this post) and then strings
| which cause allocations.
|
| I came up with what I think is a kind of neat solution, which is
| that the tokenizer is generic over some T and takes a function
| from byteslice to T and uses T in place of the strings. This way,
| when the caller has some more efficient representation available
| (like one that allocates less) it can provide one, but I can
| still unit test the tokenizer with the identity function for
| convenience.
|
| In a sense this is like fusing the tokenizer with the parser at
| build time, but the generic allows layering the tokenizer such
| that it doesn't know about the parser's representation.
|
| [1] https://github.com/evmar/n2
| suzzer99 wrote:
| Can someone explain to me why JSON can't have comments or
| trailing commas? I really hope the performance gains are worth
| it, because I've lost 100s of man-hours to those things, and had
| to resort to stuff like this in package.json:
| "IMPORTANT: do not run the scripts below this line, they are for
| CICD only": true,
| coldtea wrote:
| It can't have comments because it didn't originally had
| comments, so now it's too late. And it originally didn't have
| comments, because Douglas Cockford thought they could be abused
| for parsing instructions.
|
| As for not having trailing commas, it's probably a less
| intentional bad design choice.
|
| That said, if you want commas and comments, and control the
| parsers that will be used for your JSON, then use JSONC (JSON
| with comments). VSCode for example does that for its JSON
| configuration.
| explaininjs wrote:
| JSONC also supports trailing commas. It is, in effect, "JSON
| with no downsides".
|
| TOML/Yaml always drive me batty with all their obscure
| special syntax. Whereas it's almost impossible to look at a
| formatted blob of JSON and not have a very solid
| understanding of what it represents.
|
| The one thing I _might_ add is multiline strings with ` 's,
| but even that is probably more trouble than it's worth, as
| you immediately start going down the path of "well let's also
| have syntax to strip the indentation from those strings,
| maybe we should add new syntax to support raw strings, ..."
| tubthumper8 wrote:
| Does JSONC have a specification or formal definition? People
| have suggested[1] using JSON5[2] instead for that reason
|
| [1] https://github.com/microsoft/vscode/issues/100688
|
| [2] https://spec.json5.org/
| mananaysiempre wrote:
| Unfortunately, JSON5 says keys can be ES5
| IdentifierName[1]s, which means you must carry around
| Unicode tables. This makes it a non-option for small
| devices, for example. (I mean, not really, you technically
| _could_ fit the necessary data and code in low single-digit
| kilobytes, but it feels stupid that you have to. Or you
| could just not do that but then it's no longer JSON5 and
| what was the point of having a spec again?)
|
| [1] https://es5.github.io/x7.html#x7.6
| mananaysiempre wrote:
| Amusingly, it originally _did_ have comments. Removing
| comments was the one change Crockford ever made to the
| spec[1].
|
| [1] https://web.archive.org/web/20150105080225/https://plus.g
| oog... (thank you Internet Archive for making Google's social
| network somewhat accessible and less than useless)
| shepherdjerred wrote:
| I don't know the historic reason why it wasn't included in the
| original spec, but at this point it doesn't matter. JSON is
| entrenched and not going to change.
|
| If you want comments, you can always use jsonc.
| semiquaver wrote:
| It's not that it "can't", more like it "doesn't". Douglas
| Crockford prioritized simplicity when specifying JSON. Its BNF
| grammar famously fits on one side of a business card.
|
| Other flavors of JSON that include support for comments and
| trailing commas exist, but they are reasonably called by
| different names. One of these is YAML (mostly a superset of
| JSON). To some extent the difficulties with YAML (like unquoted
| 'no' being a synonym for false) have vindicated Crockford's
| priorities.
| forrestthewoods wrote:
| > Any (useful) JSON decoder code cannot go faster that this.
|
| That line feels like a troll. Cunningham's Law in action.
|
| You can definitely go faster than 2 Gb/sec. In a word, SIMD.
| shoo wrote:
| we could re-frame by distinguishing problem statements from
| implementations
|
| Problem A: read a stream of bytes, parse it as JSON
|
| Problem B: read a stream of bytes, count how many bytes match a
| JSON whitespace character
|
| Problem B should require fewer resources* to solve than problem
| A. So in that sense problem B is a relaxation of problem A, and
| a highly efficient implementation of problem B should be able
| to process bytes much more efficiently than an "optimal"
| implementation of problem A.
|
| So in this sense, we can probably all agree with the author
| that counting whitespace bytes is an easier problem than the
| full parsing problem.
|
| We're agreed that the author's implementation (half a page of
| go code that fits on a talk slide) to solve problem B isn't the
| most efficient way to solve problem B.
|
| I remember reading somewhere the advice that to set a really
| solid target for benchmarking, you should avoid measuring the
| performance of implementations and instead try to estimate a
| theoretical upper bound on performance, based on say a
| simplified model of how the hardware works and a simplification
| of the problem -- that hopefully still captures the essence of
| what the bottleneck is. Then you can compare any implementation
| to that (unreachable) theoretical upper bound, to get more of
| an idea of how much performance is still left on the table.
|
| * for reasonably boring choices of target platform, e.g. amd64
| + ram, not some hypothetical hardware platform with
| surprisingly fast dedicated support for JSON parsing and bad
| support for anything else.
| ncruces wrote:
| It's possible to improve over the standard library with better
| API design, but it's not really possible to do a fully streaming
| parser that doesn't half fill structures before finding an error
| and bailing out in the middle, which is another explicit design
| constraint for the standard library.
| hintymad wrote:
| How is this compared to Daniel Lemire's simdjson?
| https://github.com/simdjson/simdjson
| 1vuio0pswjnm7 wrote:
| "But there is a better trick that we can use that is more space
| efficient than this table, and is sometimes called a computed
| goto."
|
| From 1989:
|
| https://raw.githubusercontent.com/spitbol/x32/master/docs/sp...
|
| "Indirection in the Goto field is a more powerful version of the
| computed Goto which appears in some languages. It allows a
| program to quickly perform a multi-way control branch based on an
| item of data."
| wood_spirit wrote:
| My own lessons from writing fast json parsers has a lot of
| language-type things but here are some generalisations:
|
| Avoid heap allocations in tokenising. Have a tokeniser that is a
| function that returns a stack-allocated struct or an int64 token
| that is a packed field describing the start, length and type
| offsets etc of the token.
|
| Avoid heap allocations in parsing: support a getString(key
| String) type interface for clients that what to chop up a buffer.
|
| For deserialising to object where you know the fields at compile
| time, generally generate a switch case of key length before
| comparing string values.
|
| My experience in data pipelines that process lots of json is that
| choice of json library can be a 3-10x performance difference and
| that all the main parsers want to allocate objects.
|
| If the classes you are serialising or deserialising is known at
| compile time then Jackson Java does a good job but you can get a
| 2x boost with careful coding and profiling.
|
| Whereas if you are paying aribrary json then all the mainstream
| parsers want to do lots of allocations that a more intrusive
| parser that you write yourself can avoid, and that you can make
| massive performance wins if you are processing thousands or
| millions of objects per second.
| wslh wrote:
| I remember this JSON benchmark page from RapidJSON [1].
|
| [1] https://rapidjson.org/md_doc_performance.html
___________________________________________________________________
(page generated 2023-11-05 23:00 UTC) |