Show HN: OCR Benchmark Focusing on Automation

  • Great list! I’ll definitely run your benchmark against Doctly.ai (our PDF-to-Markdown service) specially as we publish our workflow service, to see how we stack up.

    One thing I’ve noticed in many benchmarks, though, is the potential for bias. I’m actually working on a post about this issue, so it’s top of mind for me. For example, in the omni benchmark, the ground truth expected a specific order for heading information—like logo, phone number, and customer details. While this data was all located near the top of the document, the exact ordering felt subjective. Should the model prioritize horizontal or vertical scanning? Since the ground truth was created by the company running the benchmark, their model naturally scored the highest for maintaining the same order as the ground-truth.

    However, this approach penalized other LLMs for not adhering to the "correct" order, even though the order itself was arguably arbitrary. This kind of bias can skew results and make it harder to evaluate models fairly. I’d love to see benchmarks that account for subjectivity or allow for multiple valid interpretations of document structure.

    Did you run into this when looking at the benchmarks?

    On a side note, Doctly.ai leverages multiple LLMs to evaluate documents, and runs a tournament with a judge for each page to get the best data (this is only on the Precision Ultra selection).

  • There was a discussion on this benchmark https://getomni.ai/ocr-benchmark couple of weeks ago here > https://news.ycombinator.com/item?id=43118514

  • Many of the benchmarks I have seen in this space suffer from the Texas Sharpshooter fallacy, where you shoot first and then paint a target around the hole.

    If you create a benchmark and your product outperforms everything else, it could mean many things. Overfitting being one of them.

  • Love to see another benchmark! We published the OmniAI OCR benchmark the other week. Thanks for adding us to the list.

    One question on the "Automation" score in the results, is this a function of extraction accuracy vs the accuracy of the LLM's "confidence score". I noticed the "accuracy" column was very tightly grouped (between 79 & 84%) but the automation score was way more variable.

    And side note: is there an open source Mistral benchmark for their latest OCR model? I know they claimed it was 95% accurate, but it looks that was based on an internal evaluation.

  • How do these compare to traditional commercial and open source OCR tools? What about things like the Apple Vision APIs?

  • In your pricing example I see $6.27 for a 10 page document. That is extremely expensive.