Hacker News

Ask HN: Why are all OCR outputs so raw?

by james-revisoaion 11/15/2023, 11:04:41 AM with 6 comments

by vivegion 11/15/2023, 6:33:36 PM
Layout analysis is the key. Quite a bit of work has been going on recently in this area.
Some papers of relevance:
```
  - Xu Zhong, Jianbin Tang, Antonio Jimeno Yepes. "PubLayNet: largest dataset ever for document layout analysis," Aug 2019. Preprint: https://arxiv.org/abs/1908.07836 Code/Data: https://github.com/ibm-aur-nlp/PubLayNet

  - B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar and P. Staar, "DocLayNet: a large human-annotated dataset for document-layout analysis," 13 August 2022. [Online]. Available: https://developer.ibm.com/exchanges/data/all/doclaynet/.

  - S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie and R. Manmatha, "Docformer: End-to-end transformer for document understanding.," in The International Conference on Computer Vision (ICCV 2021), 2021.
```
The first one is for publications. From the abstract: "...the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated".
The second is for documents. It contains 80K manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement.
by sandreason 11/15/2023, 11:13:17 PM
This. I stumbled over the same problem and did not find the preprocessing too hard.
I achieved pretty good results with a few simple steps before using tesseract:
- Sauvola adaptive thresholding (today there are many better algorithms, but sauvola is still pretty good)
- Creating Histogram based Patches for analysing what parts are text and what parts are images (similar to JBIG2)
I even once found a paper using an algorithm for detecting text-line slopes on geographical maps that was simple, fast and pure genius for curved text lines and then implemented a pixel mapper to correct these curved text lines. Unfortunately the whole project got lost somewhere in the NAS. Maybe I still have it somewhere, but Java was not the best language to implement this :-)
However, I think that even if I found a simple solution for some of my use cases - the whole OCR topic is pretty hard to generalize. Algorithms that work for specific use cases in specific countries don't work for others. And it is lots of hard work to capture all the fonts, typography, edge cases and performance problems in one piece of software.
by solardevon 11/15/2023, 5:06:28 PM
From my (several years out of date) experience, commercial OCR software like ABBYY FineReader tends to be a lot better at dealing with layout than the FOSS stuff. They have a GUI layer that lets you draw areas to define columns, etc.
These days it looks like ABBYY has pivoted towards cloud services and SDKs though, with the standalone software (now called FineReader PDF) de-emphasized. I am not sure if the new versions and services still offer column separation.
by quickthrower2on 11/15/2023, 11:25:45 AM
Sounds like you have the seed for a startup!
by 8organicbitson 11/15/2023, 11:25:51 AM
I'm not sure why, but I'll agree with the frustration, better tooling is needed.
by billconanon 11/15/2023, 5:41:59 PM
maybe this is better? https://github.com/clovaai/donut
I'm not sure