Loading Pydantic models from JSON without running out of memory

  • Pydantic author here. We have plans for an improvement to pydantic where JSON is parsed iteratively, which will make way for reading a file as we parse it. Details in https://github.com/pydantic/pydantic/issues/10032.

    Our JSON parser, jiter (https://github.com/pydantic/jiter) already supports iterative parsing, so it's "just" a matter of solving the lifetimes in pydantic-core to validate as we parse.

    This should make pydantic around 3x faster at parsing JSON and significantly reduce the memory overhead.

  • Having only recently encountered this, does anyone have any insight as to why it takes 2GB to handle a 100MB file?

    This looks highly reminiscent (though not exactly the same, pedants) of why people used to get excited about using SAX instead of DOM for xml parsing.

  • My problem isn't running out of memory; it's loading in a complex model where the fields are BaseModels and unions of BaseModels multiple levels deep. It doesn't load it all the way and leaves some of the deeper parts as dictionaries. I need like almost a parser to search the space of different loads. Anyone have any ideas for software that does that?

  • Alternatively, if you had to go with json, you could consider using jsonl. I think I’d start by evaluating whether this is a good application for json. I tend to only want to use it for small files. Binary formats are usually much better in this scenario.

  • i gave up on python dataclasses & json. Using protobufs object within the application itself. I also have a "...Mixin" class for almost every wire model, with extra methods

    Automatic, statically typed deserialization is worth the trouble in my opinion

  • So are there downsides to just always setting slots=True on all of my python data types?

  • I'd like to see a comparison of ijson vs just `json.load(f)`. `ujson` would also be interesting to see.

  • Maybe using mmap would also save some memory, I'm not quite sure if this can be implemented in Python.

  • How does the speed of the dataclass version compare?

  • Or just dump pydantic and use msgspec instead: https://jcristharif.com/msgspec/