Hacker News

On-demand JSON: A better way to parse documents?

by warpechon 2/9/2024, 8:19:53 PM with 16 comments

by kristianpon 2/9/2024, 10:22:12 PM
So they're creating a DOM-like api in front of a sax style parser and getting faster results (barring FPGA and GPU research). It's released as part of SIMDJson.
I wonder if that kind of front end was done in the age of SAX parsers?
Such a well-written paper.
by eternityforeston 2/11/2024, 1:25:14 AM
Why not just use msgpack? The advantage of JSON is that support is already built in to everything and you don't have to think about it.
If you start having to actually make an effort to fuss with it, then why not consider other formats?
This does have nice backwards compatibility with existing JSON stuff though, and sticking to standards is cool. But msgpack is also pretty nice.
by wruzaon 2/11/2024, 11:47:03 AM
Alternatively, jsonl/ndjson. The largest parts of jsons are usually arrays, not dictionaries. So you can e.g.:
```
  {<a header about foos and bars>}
  {<foo 1>}
  ...
  {<foo N>}
  {<bar 1>}
  ...
  {<bar N>}
```
It is compatible with streaming, database json columns, code editors.
by xiphias2on 2/10/2024, 8:06:42 PM
I don't really understand what's new here compared to what SIMDJSON supported already.
Anyways, it's the best JSON parser I found (in any language), I implemented fastgron (https://github.com/adamritter/fastgron) on top of it because of the on demand library performance.
One problem with the library was that it needed extra padding at the end of the JSON, so it didn't support streaming / memory mapping.
by bawolffon 2/11/2024, 2:24:05 AM
Is this different from what everyone was doing with XML back in the day?
by pkulakon 2/11/2024, 5:55:45 AM
This is a real “why didn’t I think of that” moment for sure. So many systems I’ve written have profiled with most of the cpu and allocations in the JSON parser, when all it needs is a few fields. But rewriting it all in SAX is just not worth all the trouble.
by jensneuseon 2/11/2024, 9:22:34 AM
Sounds similar to a technique we're using to dynamically aggregate and transform JSON. We call this package "astjson" as we're doing operations like "walking" through the JSON or "merging" fields at the AST level. We wrote about the topic and how it helped us to improve the performance of our API gateway written in Go, which makes heavy use of JSON aggregations: https://wundergraph.com/blog/astjson_high_performance_json_t...
by hwestiiion 2/10/2024, 8:29:43 PM
On face it, this sounds kind of like the XML::Twig perl module.
by SushiHippieon 2/11/2024, 1:04:55 AM
Related submission from yesterday:
https://news.ycombinator.com/item?id=39319746 - JSON Parsing: Intel Sapphire Rapids versus AMD Zen 4 - 40 points and 10 comments
by pshirshovon 2/11/2024, 12:27:05 AM
I solved this problem with a custom indexed format: https://github.com/7mind/sick
by Waterluvianon 2/11/2024, 2:38:35 AM
> The JSON specification has six structural characters (‘[’, ‘{’, ‘]’, ‘}’, ‘:’, ‘,’) to delimit the location and structure of objects and arrays.
Wouldn’t a quote “ also be a structural character? It doesn’t actually represent data, it just delimits the beginning and end of a string.
I get why I’m probably wrong: a string isn’t a structure of chars because that’s not a type in json. The above six are the pieces of the two collections in JSON.
by jesprenjon 2/11/2024, 1:05:54 PM
Relevant: LEJP - libwebsockets json parser.
You specify what you're interested in and then the parser calls your callback whenever it reads the part of a large JSON stream that has your key.
https://libwebsockets.org/lws-api-doc-main/html/md_READMEs_R...
by skibzon 2/11/2024, 6:08:12 AM
Pretty cool!
This reminds me of oboe.js: https://github.com/jimhigson/oboe.js
by basil-rashon 2/10/2024, 11:07:12 PM
> The JSON syntax is nearly a strict subset of the popular programming language JavaScript.
What JSON isn’t valid JS?
by fanseepawntson 2/11/2024, 3:11:15 AM
[dead]
by fanseepawntson 2/10/2024, 11:52:32 PM
Sorry, I would never use this. Before I consume any json from any source or for any purpose I validate it. Lazy loading serves no purpose if you need validation.
Hint: you need validation.