It's not bad! But it still hallucinates. Here's an example of an (admittedly difficult) image:
https://i.imgur.com/jcwW5AG.jpeg
For the blocks in the center, it outputs:
> Claude, duc de Saint-Simon, pair et chevalier des ordres, gouverneur de Blaye, Senlis, etc., né le 16 août 1607 , 3 mai 1693 ; ép. 1○, le 26 septembre 1644, Diane - Henriette de Budos de Portes, morte le 2 décembre 1670; 2○, le 17 octobre 1672, Charlotte de l'Aubespine, morte le 6 octobre 1725.
This is perfect! But then the next one:
> Louis, commandeur de Malte, Louis de Fay Laurent bre 1644, Diane - Henriette de Budos de Portes, de Cressonsac. du Chastelet, mortilhomme aux gardes, 2 juin 1679.
This is really bad because
1/ a portion of the text of the previous bloc is repeated
2/ a portion of the next bloc is imported here where it shouldn't be ("Cressonsac"), and of the right most bloc ("Chastelet")
3/ but worst of all, a whole word is invented, "mortilhomme" that appears nowhere in the original. (The word doesn't exist in French so in that case it would be easier to spot; but the risk is when words are invented, that do exist and "feel right" in the context.)
(Correct text for the second bloc should be:
> Louis, commandeur de Malte, capitaine aux gardes, 2 juin 1679.)
This is incredibly exciting. I've been pondering/experimenting on a hobby project that makes reading papers and textbooks easier and more effective. Unfortunately the OCR and figure extraction technology just wasn't there yet. This is a game changer.
Specifically, this allows you to associate figure references with the actual figure, which would allow me to build a UI that solves the annoying problem of looking for a referenced figure on another page, which breaks up the flow of reading.
It also allows a clean conversion to HTML, so you can add cool functionality like clicking on unfamiliar words for definitions, or inserting LLM generated checkpoint questions to verify understanding. I would like to see if I can automatically integrate Andy Matuschak's Orbit[0] SRS into any PDF.
Lots of potential here.
We ran some benchmarks comparing against Gemini Flash 2.0. You can find the full writeup here: https://reducto.ai/blog/lvm-ocr-accuracy-mistral-gemini
A high level summary is that while this is an impressive model, it underperforms even current SOTA VLMs on document parsing and has a tendency to hallucinate with OCR, table structure, and drop content.
I never thought I'd see the day where technology finally advanced far enough that we can edit a PDF.
We're approaching the point where OCR becomes "solved" — very exciting! Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.
However IMO, there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.
You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. But the future is on the horizon!
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)
Great progress, but unfortunately, for our use case (converting medical textbooks from PDF to MD), the results are not as good as those by MinerU/PDF-Extract-Kit [1].
Also the collab link in the article is broken, found a functional one [2] in the docs.
[1] https://github.com/opendatalab/MinerU [2] https://colab.research.google.com/github/mistralai/cookbook/...
Mistral OCR made multiple mistakes in extracting this [1] document. It is a two-page-long PDF in Arabic from the Saudi Central Bank. The following errors were observed:
- Referenced Vision 2030 as Vision 2.0. - Failed to extract the table; instead, it hallucinated and extracted the text in a different format. - Failed to extract the number and date of the circular.
I tested the same document with ChatGPT, Claude, Grok, and Gemini. Only Claude 3.7 extracted the complete document, while all others failed badly. You can read my analysis here [2].
1. https://rulebook.sama.gov.sa/sites/default/files/en_net_file... 2. https://shekhargulati.com/2025/03/05/claude-3-7-sonnet-is-go...
Dang. Super fast and significantly more accurate than google, Claude and others.
Pricing : $1/1000 pages, or per 2k pages if “batched”. I’m not sure what batching means in this case: multiple pdfs? Why not split them to halve the cost?
Anyway this looks great at pdf to markdown.
This is cool! With that said for anyone looking to use this in RAG, the downside to specialized models instead of general VLMs is you can't easily tune it to your use specific case. So for example, we use Gemini to add very specific alt text to images in the extracted Markdown. It's also 2 - 3X the cost of Gemini Flash - hopefully the increased performance is significant.
Regardless excited to see more and more competition in the space.
Wrote an article on it: https://www.sergey.fyi/articles/gemini-flash-2-tips
6 years ago I was working with a very large enterprise that was struggling to solve this problem, trying to scan millions of arbitrary forms and documents per month to clearly understand key points like account numbers, names and addresses, policy numbers, phone numbers, embedded images or scribbled notes, and also draw relationships between these values on a given form, or even across forms.
I wasn't there to solve that specific problem but it was connected to what we were doing so it was fascinating to hear that team talk through all the things they'd tried, from brute-force training on templates (didn't scale as they had too many kinds of forms) to every vendor solution under the sun (none worked quite as advertised on their data)..
I have to imagine this is a problem shared by so many companies.
Related, does anyone know of an app that can read gauges from an image and log the number to influx? I have a solar power meter in my crawlspace, it is inconvenient to go down there. I want to point an old phone at it and log it so I can check it easily. The gauge is digital and looks like this:
I noticed on the Arabic example they lost a space after the first letter on the third to last line, can any native speakers confirm? (I only know enough Arabic to ask dumb questions like this, curious to learn more.)
Edit: it looks like they also added a vowel mark not present in the input on the line immediately after.
Edit2: here's a picture of what I'm talking about, the before/after: https://ibb.co/v6xcPMHv
Nit: Please change the URL from
https://mistral.ai/fr/news/mistral-ocr
to
https://mistral.ai/news/mistral-ocr
The article is the same, but the site navigation is in English instead of French.
Unless it's a silent statement, of course. =)
I uploaded a picture of my Chinese mouthwash [0] and it made a ton of mistakes and hallucinated a lot. Very disappointing. For example it says the usage instructions is to use 80 ml each time, even though the actual usage instruction on the bottle says use 5-20 mL each time, three times a day, and gargle for 1 minute.
[0] https://i.imgur.com/JiX9joY.jpeg
[1] https://chat.mistral.ai/chat/8df2c9b9-ee72-414b-81c3-843ce74...
"World's best OCR model" - that is quite a statement. Are there any well-known benchmarks for OCR software?
I gave it a bunch of my wifes 18th century English scans to transcribe, mostly couldn't do it, and it's been doing this for 15 minutes now, not sure why but i find quite amusing: https://share.zight.com/L1u2jZYl
A similar but different product that was discussed on HN is OlmOCR from AI2, which is open source:
I would like to see how it performs with massively warped and skewed scanned text images, basically a scanned image where the text lines are wavy as opposed as straight horizontal, where the letters are elongated. One where the line widths are different depending on the position on the scanned image. I once had to deal with such a task that somebody gave me with OCR software, Acrobat, and other tools could not decode the mess so I had to recreate the 30 pages myself, manually. Not a fun thing to do but that is a real use case.
The hard ones are things like contracts, leases, and financial documents which 1) don’t have a common format 2) are filled with numbers proper nouns and addresses which it’s really important not to mess up 3) cannot be inferred from context.
Typical OCR pipeline would be to pass the doc through a character-level OCR system then correct errors with a statistical model like an LLM. An LLM can help correct “crodit card” to “credit card” but it cannot correct names or numbers. It’s really bad if it replaces a 7 with a 2.
Forgive my absolute ignorance, I should probably run this through a chat bot before posting ... So I'm updating my post with answers now!
Q: Do LLMs specialise in "document level" recognition based on headings, paragraphs, columns tables etc? Ie: ignore words and characters for now and attempt to recognise a known document format.
A: Not most LLMs, but those with multimodal / vision capability could (eg DeepSeek Vision. ChatGPT 4). There are specialized models for this work like Tesseract, LayoutLM.
Q: How did OCR work "back in the day" before we had these LLMs? Are any of these methods useful now?
A: They used pattern recognition and feature extraction, rules and templates. Newer ML based OCR used SVM to isolate individual characters and HMM to predict the next character or word. Today's multimodal models process images and words, can handle context better than the older methods, and can recognise whole words or phrases instead of having to read each character perfectly. This is why they can produce better results but with hallucinations.
Q: Can LLMs rate their own confidence in each section, maybe outputting text with annotations that say "only 10% certain of this word", and pass the surrounding block through more filters, different LLMs, different methods to try to improve that confidence?
A: Short answer, "no". But you can try to estimate with post processing.
Or am I super naive, and all of those methods are already used by the big commercial OCR services like Textract etc?
Intriguing announcement, however the examples on the mistral.ai page seem rather "easy".
What about rare glyphs in different languages using handwriting from previous centuries?
I've been dealing with OCR issues and evaluating different approaches for past 5+ years at a national library that I work at.
Usual consensus is that widely used open source Tesseract is subpar to commercial models.
That might be so without fine tuning. However one can perform supplemental training and build your own Tesseract models that can outperform the base ones.
Case study of Kant's letter's from 18th century:
About 6 months ago, I tested OpenAi approach to OCR to some old 18th century letters that needed digitizing.
The results were rather good (90+% accuracy) with the usual hallucination here and there.
What was funny that OpenAI was using base Tesseract to generate the segmenting and initial OCR.
The actual OCRed content before last inference step was rather horrid because the Tesseract model that OpenAi was using was not appropriate for the particular image.
When I took OpenAi off the first step and moved to my own Tesseract models, I gained significantly in "raw" OCR accuracy at character level.
Then I performed normal LLM inference at the last step.
What was a bit shocking: My actual gains for the task (humanly readable text for general use) were not particularly significant.
That is LLMs are fantastic at "untangling" complete mess of tokens into something humanly readable.
For example:
P!3goattie -> prerogative (that is given the surrounding text is similarly garbled)
The new Mistral OCR release looks impressive - 94.89% overall accuracy and significantly better multilingual support than competitors. As someone who's built document processing systems at scale, I'm curious about the real-world implications.
Has anyone tried this on specialized domains like medical or legal documents? The benchmarks are promising, but OCR has always faced challenges with domain-specific terminology and formatting.
Also interesting to see the pricing model ($1/1000 pages) in a landscape where many expected this functionality to eventually be bundled into base LLM offerings. This feels like a trend where previously encapsulated capabilities are being unbundled into specialized APIs with separate pricing.
I wonder if this is the beginning of the componentization of AI infrastructure - breaking monolithic models into specialized services that each do one thing extremely well.
I was just watching a science-related video containing math equations. I wondered how soon will I be able to ask the video player "What am I looking at here, describe the equations" and it will OCR the frames, analyze them and explain them to me.
It's only a matter of time before "browsing" means navigating HTTP sites via LLM prompts. although, I think it is critical that LLM input should NOT be restricted to verbal cues. Not everyone is an extrovert that longs to hear the sound of their own voices. A lot of human communication is non-verbal.
Once we get over the privacy implications (and I do believe this can only be done by worldwide legislative efforts), I can imagine looking at a "website" or video, and my expressions, mannerisms and gestures will be considered prompts.
At least that is what I imagine the tech would evolve into in 5+ years.
Perusing the web site, it's depressing how much behind Mistral is on basic "how can I make this a compelling hook for customers" for the page.
The notebook link? An ACL'd doc
The examples don't even include a small text-to-markdown sample.
The before/after slider is cute, but useless - SxS is a much better way to compare.
Trying it in "Le Chat" requires a login.
It's like an example of "how can we implement maximum loss across our entire funnel". (I have no doubt the underlying tech does well, but... damn, why do you make it so hard to actually see it, Mistral?)
If anybody tried it and has shareable examples - can you post a link? Also, anybody tried it with handwriting yet?
I'd mentioned this on HN last month, but I took a picture of a grocery list and then pasted it into ChatGPT to have it written out and it worked flawlessly...until I discovered that I'd messed up the picture when I took it at an angle and had accidentally cut off the first character or two of the bottom half of the list.
ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").
Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.
One of my hobby projects while in University was to do OCR on book scans. Doing character recognition was solved, but finding the relationship between characters was very difficult. I tried "primitive" neural nets, but edge cases would often break what I built. Super cool to me to see such an order of magnitude in improvement here.
Does it do hand written notes and annotations? What about meta information like highlighting? I am also curious if LLMs will get better because more access to information if it can be effectively extracted from PDFs.
I wonder how good it would be to convert sheet music to MusicXML. All the current tools more or less suck with this task, or maybe I’m just ignorant and don’t know what lego bricks to put together.
Is there a reliable handwriting OCR benchmark out there (updated, not a blog post)? Despite the gains claimed for printed text, I found (anecdotally) that trying to use Mistral OCR on my messy cursive handwriting to be much less accurate than GPT-4o, in the ballpark of 30% wrong vs closer to 5% wrong for GPT-4o.
Edit: answered in another post: https://huggingface.co/spaces/echo840/ocrbench-leaderboard
Dupe of an hour previous post https://news.ycombinator.com/item?id=43282489
High accuracy is the goal! But the multimodal approach introduces some complexities that can impact real-world performance. We break it down in our review: https://undatas.io/blog/posts/in-depth-review-of-mistral-ocr... As for use cases, it really depends on how well it handles edge cases…
We developers seem to really dislike PDFs, to a degree that we'll build LLMs and have them translate it into Markdown.
Jokes aside, PDFs really serve a good purpose, but getting data out of them is usually really hard. They should have something like an embedded Markdown version with a JSON structure describing the layout, so that machines can easily digest the data they contain.
Does this support Japanese? They list a table of language comparisons againat other approaches but I can't tell if it is exhaustive.
I'm hoping that something like this will be able to handle 3000-page Japanese car workshop manuals. Because traditional OCR really struggles with it. It has tables, graphics, text in graphics, the whole shebang.
Wow this basically "solves" DRM for books as well as opening up the door for digitizing old texts more accurately.
Someone working there has good taste to include a Nizar Qabbani poem.
Bit unrelated but is there anything that can help with really low resolution text? My neighbor got hit and run the other day for example, and I've been trying every tool I can to make out some of the letters/numbers on the plate
I ran Mistral AI OCR against JigsawStack OCR and beat their model in every category. Full breakdown here: https://jigsawstack.com/blog/mistral-ocr-vs-jigsawstack-vocr
Pretty cool, would love to use this with paperless, but I just can't bring myself to send a photo of all my documents to a third party, especially legal and sensitive documents, which is what I use Paperless for.
Because of that I'm stuck with crappy vision on Ollama (Thanks to AMDs crappy ROCm support for Vllm)
While it is nice to have more options, it still definitely isn't at a human level yet for hard to read text. Still haven't seen anything that can deal with something like this very well: https://i.imgur.com/n2sBFdJ.jpeg
If I remember right, Gemini actually was the closest as far as accuracy of the parts where it "behaved", but it'd start to go off the rails and reword things at the end of larger paragraphs. Maybe if the image was broken up into smaller chunks. In comparison, Mistral for the most part (besides on one particular line for some reason) sticks to the same number of words, but gets a lot wrong on the specifics.
Still terrible at handwriting.
I signed up for the API, cobbled together from their tutorial (https://docs.mistral.ai/capabilities/document/) -- why can't they give the full script instead of little bits?
Tried uploading a tiff, they rejected it. Tried upload JPG, they rejected it (even though they supposed support images?). Tried resaving as PDF. It took that, but the output was just bad. Then tried ChatGPT on the original .tiff (not using API), and it got it perfectly. Honestly I could barely make out the handwriting with my eyes but now that I see ChatGPT's version I think it's right.
It will be interesting to see how all the companies in the document processing space adapt as OCR becomes a commodity.
The best products will be defined by everything "non-AI", like UX, performance and reliability at scale, and human-in-the loop feedback for domain experts.
Nice demos but I wonder how well it does on longer files. I've been experimenting with passing some fairly neat PDFs to various LLMs for data extraction. They're created from Excel exports and some of the data is cut off or badly laid out, but it's all digitally extractable.
The challenge isn't so much the OCR part, but just the length. After one page the LLMs get "lazy" and just skip bits or stop entirely.
And page by page isn't trivial as header rows are repeated or missing etc.
So far my experience has definitely been that the last 2% of the content still takes the most time to accurately extract for large messy documents, and LLMs still don't seem to have a one-shot solve for that. Maybe this is it?
I had a need to scan serial numbers from Apple's product boxes out of pictures taken by a random person on their phone.
All OCR tools that I have tried have failed. Granted, I would get much better results if I used OpenCV to detect the label, rotate/correct it, normalize contrast, etc.
But... I have tried the then new vision model from OpenAI and it did the trick so well it's wasn't feasible to consider anything else at that point.
I have checked all S/N afterwards for being correct via third-party API - and all of theme were. Sure, sometimes I had to check versions with 0/o and i/l/1 substitutions but I believe these kind of mistakes are non-issues.
Congrats to the Mistral team for launching! A general-purpose OCR model is useful, of course. However, more purpose-built solutions are a must to convert business documents reliably. AI models pre-trained on specific document types perform better and are more accurate. Coming soon from the ABBYY team, we're shipping a new OCR API designed to be consistent, reliable, and hallucination-free. Check it out if you're looking for best-in-class DX: https://digital.abbyy.com/code-extract-automate-your-new-mus...
I tried with both PDFs and PNGs in Le Chat and the results were the worst I've ever seen when compared to any other model (Claude, ChatGPT, Gemini).
So bad that I think I need to enable the OCR function somehow, but couldn't find it.
> It takes images and PDFs as input
If you are working with PDF, I would suggest a hybrid process.
It is feasible to extract information with 100% accuracy from PDFs that were generated using the mappable acrofields approach. In many domains, you have a fixed set of forms you need to process and this can be leveraged to build a custom tool for extracting the data.
Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.
The moment you need to use this kind of technology you are in a completely different regime of what the business will (should) tolerate.
Co-founder of doctly.ai here (OCR tool)
I love mistral and what they do. I got really excited about this, but a little disappointed after my first few tests.
I tried a complex table that we use as a first test of any new model, and Mistral OCR decided the entire table should just be extracted as an 'image' and returned this markdown:
```  ```
I'll keep testing, but so far, very disappointing :(
This document I try is the entire reason we created Doctly to begin with. We needed an OCR tool for regulatory documents we use and nothing could really give us the right data.
Doctly uses a judge, OCRs a document against multiple LLMs and decides which one to pick. It will continue to run the page until the judge scores above a certain score.
I would have loved to add this into the judge list, but might have to skip it.
I wonder how it compares to USPS workers at deciphering illegible handwriting.
I feel this is created for RAG. I tried a document [0] that I tested with OCR; it got all the table values correctly, but the page's footer was missing.
Headers and footers are a real pain with RAG applications, as they are not required, and most OCR or PDF parsers will return them, and there is extract work to do to remove them.
[0] https://github.com/orasik/parsevision/blob/main/example/Mult...
> Mistral OCR has shown impressive performance, but OCR remains a challenging problem, especially with the risk of hallucinations and missing text in LLM-based approaches. For those interested in exploring its capabilities further, the official site provides more details: [Mistral OCR](https://www.mistralocr.org). It would be great to see more benchmarks comparing different OCR solutions in real-world scenarios.
I was curious about Mistral so I made a few visualizations.
A high level diagram w/ links to files: https://eraser.io/git-diagrammer?diagramId=uttKbhgCgmbmLp8OF...
Specific flow of an OCR request: https://eraser.io/git-diagrammer?diagramId=CX46d1Jy5Gsg3QDzP...
(Disclaimer - uses a tool I've been working on)
Curious that have people find more details regarding what is the architecture of this "mistral-ocr-latest". I have two question that
1. I was initially thinking this is VLM parsing model until I saw it can extract images. Then, I assume it is a pipeline of an image extraction and a VLM model while their result is combined to give the final result.
2. In this case, benchmark the pipeline result vs a end to end VLM such as gemini 2.0 flash might not be apple to apple comparison.
It outperforms the competition significantly AND can extract embedded images from the text. I really like LLMs for OCR more and more. Gemini was already pretty good at it
I think its interesting they left out Gemini 2.0 Pro in the benchmarks which I find to be markedly better than flash if you don't mind the spend.
This is $1 per 1000 pages. For comparison, Azure Document Intelligence is $1.5/1000 pages for general OCR and $30/1000 pages for “custom extraction”.
Given the fact that multi-modal LLMs are getting so good at OCR these days, is it a shame that we can't do local OCR with high accuracy in the near-term?
Looks good but in the first hover/slider demo one can see how it could lead to confusion when handling side by side content.
Table 1 is referred to in section `2 Architectural details` but before `2.1 Multimodal Decoder`. In the generated markdown though it is below the latter section, as if it was in/part of that section.
Of course I am nitpicking here but just the first thing I noticed.
I understand that is more juicy to get information from graphs, figures and so on, as every domain uses those, but i really hope to eventually see these models to be able to workout music notation, i have tried the best known apps and all of them fail to capture important details such as guitar performace symbols for bends or legato
Does it work for video subtitles? And in Chinese? I’m looking to transcribe subtitles of live music recordings from ANHOP and KHOP.
A great question for people wanting to use OCR in business is... Which digits in monetary amounts can you tolerate being incorrect?
> "Fastest in its category"
Not one mention of the company that they have partnered with and that is Cerebras AI and that is the reason they have fast inference [0]
Literally no-one here is talking about them and they are about to IPO.
Le chat doesn’t seem to know about this change despite the blog post stating it. Can anyone explain how to use it in Le Chat?
Is this model open source?
This might be a contrarian take: the improvement against gpt-4o and gemini-1.5 flash, both of which are general purpose multi-modal models, seem to be underwhelming.
I'm sensing another bitter lesson coming, where domain optimized AI will hold a short term advantage but will be outdated quickly as the frontier model advances.
I built a CLI script for feeding PDFs into this API - notes on that and my explorations of Mistral OCR here: https://simonwillison.net/2025/Mar/7/mistral-ocr/
Is this able to convert pdf flowcharts into yaml or json representations of them? I have been experimenting with Claude 3.5. It has been very good at readig / understanding/ converting into representations of flow charts.
So I am wondering if this is more capable. Will try definitely, but maybe someone can chime in.
I see a lot of comments on hallucination risk and the accumulation of non-traceable rotten data. If you are curious to try a better non-llm-based OCR, try LLMWhisperer.https://pg.llmwhisperer.unstract.com/
I feel like i can't create an agent with their OCR model yet ? Is it something planned or it's only API?
Just tested with a multilingual (bidi) English/Hebrew document.
The Hebrew output had no correspondence to the text whatsoever (in context, there was an English translation, and the Hebrew produced was a back-translation of that).
Their benchmark results are impressive, don't get me wrong. But I'm a little disappointed. I often read multilingual document scans in the humanities. Multilingual (and esp. bidi) OCR is challenging, and I'm always looking for a better solution for a side-project I'm working on (fixpdfs.com).
Also, I thought OCR implied that you could get bounding boxes for text (and reconstruct a text layer on a scan, for example). Am I wrong, or is this term just overloaded, now?
But what's the need exactly for OCR when you have multimodal LLMs that can read the same info and directly answer any questions about it ?
For a VLLM, my understanding is that OCR corresponds to a sub-field of questions, of the type 'read exactly what's written in this document'.
I'm surprised they didn't benchmark it against Pixtral.
They test it against a bunch of different Multimodal LLMs, so why not their own?
I don't really see the purpose of the OCR form factor, when you have multimodal LLMs. Unless it's significantly cheaper.
Could anyone suggest a tool which would take a bunch of PDFs (already OCR-d with Finereader), and replace the OCR overlay on all of them, maintaining the positions? I would like to have more accurate search over my document archive.
Curious to see how this performance against more real world usage of someone taking a photo of text (which the text then becomes slightly blurred) and performing OCR on it.
I can't exactly tell if the "Mistral 7B" image is an example of this exact scenario.
Is this free in LeChat? I uploaded a handwritten text and it stopped after the 4th word.
It'd be great if this could be tested against genealogical documents written in cursive like oh most of the documents on microfilm stored by the LDS on familysearch, or eastern european archival projects etc.
Tried with a few historical handwritten German documents, accuracy was abysmal.
Benchmarks look good. I tried this with a PDF that already has accurate PDF embedded just with new lines making pdftotext fail, and it was accurate for the text it found, but missed entire pages
Spent time working on OCR problem many years ago for a mobile app. We found at the time that the preprocessing was so critical to the outcome (quality of image, angle, colour/greyscale)
Is there an ocr with this kind of accuracy, but can run in a mobile device ? Looking for an ocr that can detect texts with high accuracy in realtime, so option of using cloud ocr is not viable.
I've found that the stunning OCR results so far were because the models were trained on the example file category. Is that the case here? Or can this recognize various documents?
So, the only thing that stopped AI from learning from all our science and taking over the world was the difficulty of converting PDFs of academic papers to more computer readable formats.
Not anymore.
They say: "releasing the API mistral-ocr-latest at 1000 pages / $"
I had to reread that a few times. I assume this means 1000pg/$1 but I'm still not sure about it.
I don't need AGI just give me superhuman OCR so we can turn all existing pdfs into text* and cheaply host it.
Feels like we are almost there.
What's the general time for something like this to hit openrouter? I really hate having accounts everywhere when I'm trying to test new things.
LLM based OCR is a disaster, great potential for hallucinations and no estimate of confidence. Results might seem promising but you’ll always be wondering.
I have an actually hard OCR exercise for an AI model: I take this image of Chinese text on one of the memorial stones on the Washington Monument https://www.nps.gov/articles/american-mission-ningpo-china-2... and ask the model to do OCR. Not a single model I've seen can OCR this correctly. Mistral is especially bad here: it gets stuck in an endless loop of nonsensical hallucinated text. Insofar as Mistral is design for "preserving historical and cultural heritage" it couldn't do that very well yet.
A good model can recognize that the text is written top to bottom and then right to left and perform OCR in that direction. Apple's Live Text can do that, though it makes plenty of mistakes otherwise. Mistral is far from that.
For general use this will be good.
But I bet that simple ML will lead to better OCRs when you are doing anything specialized, such as, medical documents, invoices etc.
This is $1 per 1000 pages.
For comparison, Azure Document Intelligence is $1.5/1000 pages for general OCR and $30/1000 pages for “custom extraction”.
This looks like a massive win if you were the NHS and had to scan and process old case notes.
Same is true if you were a solicitors/lawyers.
Has anyone tried it for handwriting?
So far Gemini is the only model I can get decent output from for a particular hard handwriting task
What's the simple explanation for why these VLM OCRs hallucinate but previous version of OCRs don't?
I'm using gemini to solve textual CAPTCHA with some good results (better than untrained OCR).
I will give this a shot
Is this burying the lede? OCR is a solved problem, but structuring document data from scans isn't.
How can I use these new OCR tools to make PDF files searchable by embedding the text layer?
It's not fair to call it a "Mistrial" just because it hallucinates a little bit.
How does one use it to identify bounding rectangles of images/diagrams in the PDF?
I'm happy to see this development after being underwhelmed with Chatgpt OCR!
Wonder how it does with table data in pdfs / page-long tabular data?
As far as open source OCRs go, Tesseract is still the best, right?
Its funny how Gemini consistently beats googles dedicated document API.
It's disappointing to see that the benchmark results are so opaque. I hope we see reproducible results soon, and hopefully from Mistral themselves.
1. We don't know what the evaluation setup is. It's very possible that the ranking would be different with a bit of prompt engineering.
2. We don't know how large each dataset is (or even how the metrics are calculated/aggregated). The metrics are all reported as XY.ZW%, but it's very possible that the .ZW% -- or even Y.ZW% -- is just noise.[1]
3. We don't know how the datasets were mined or filtered. Mistral could have (even accidentally!) filtered out particularly data points that their model struggled with. (E.g., imagine good-meaning engineer testing a document with Mistral OCR first, finding it doesn't work, and deducing that it's probably bad data and removing it.)
[1] https://medium.com/towards-data-science/digit-significance-i...
Ohhh. Gonna test it out with some 100+ year old scribbles :)
1. There’s no simple page / sandbox to upload images and try it. Fine, I’ll code it up.
2. “Explore the Mistral AI APIs” (https://docs.mistral.ai) links to all apis except OCR.
3. The docs on the api params refer to document chunking and image chunking but no details on how their chunking works?
So much unnecessary friction smh.
They really went for it with the hieroglyphs opening.
Are there any open source projects with the same goal?
as builders in this space, we decided to put it to the test on complex nested tables, pie charts, etc. to see if the same VLM hallucination issues persist, and to what degree. while results were promising, we found several critical failure nodes across two document domains.
check out our blog post here! https://www.runpulse.com/blog/beyond-the-hype-real-world-tes...
Document processing is where b2b SAAS is at.
Nextjs error is still uncauht correctly.
Alas, I can't run it locally. So it still doesn't solve the problem of OCR for my PDF archive containing my private data...
Oh - on premise solution - awesome!
Release the weights or buy an ad
Really cool, thanks Mistral!
What about tables in PDFs?
Saving you a click: no, it cannot be self hosted (unless you have a few million dollars laying around)
Congrats to Mistral for yet again releasing another closed source thing that costs more than running an open source equivalent:
Can someone give me a tl&dr on how to start using this? Is this available if one signs up for a regular Mistral account?
It's shocking how much our industry fails to see past its own nose.
Not a single example on that page is a Purchase Order, Invoice etc. Not a single example shown is relevant to industry at scale.
Such a shame that PDF doesn’t just, like, include the semantic structure of the document by default. It is brilliant that we standardized on an archival document format that doesn’t include direct access to the document text or structure as a core intrinsic default feature.
I say this with great anger as someone who works in accessibility and has had PDF as a thorn in my side for 30 years.
Making Transformers the same cost as CNN's (which are used in character-level ocr, as opposed to image-patch-level) is a good thing. The problem with CNN based character-level OCR is not the recognition models but the detection models. In a former life, I found a way to increase detection accuracy, and, therefore, overall OCR accuracy, and used that as an enhancement on top of Amazon and Google OCR. It worked really well. But the transformer approach is more powerful and if it can be done for $1 per 1000 pages, that is a game changer, IMO, at least of incumbents offering traditional character-level OCR.
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[flagged]
[flagged]
It's a weird timing because I just launched https://dochq.io - ai document extraction where you can define what you need to get out your documents in plain English, I legitimately thought that this was going to be such a niche product but hell, there has been a very rapid rise for AI-based OCR lately, an article/tweet even went viral 2 weeks ago I think? About using Gemini to do OCR, fun times.
I ran a partial benchmark against marker - https://github.com/VikParuchuri/marker .
Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.
You can see the samples here - https://huggingface.co/datasets/datalab-to/marker_comparison... .
The code for the benchmark is here - https://github.com/VikParuchuri/marker/tree/master/benchmark... . Will run a full benchmark soon.
Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.