Show HN: HTML visualization of a PDF file's internal structure

  • Many moons ago I was tasked with extracting data from a bunch of PDFs. I made a tool to visualise how characters were laid out on the page and bounding boxes of all the elements.

    The project was in the end a complete failure and several people were upset at me for not delivering what I was supposed to.

    In present day, with the capabilities that are now available with LLMs to extract data from PDFs I 100% would go the route of utilising AI to extract the data they wanted. Back then that did not yet exist.

  • That's pretty cool! I would have used it a lot at my previous job if it existed back then. In my ideal world it should work somewhat like https://lapo.it/asn1js/ -- you drop a file and it does all the stuff locally.

  • I've used the iText RUPS (free) for a while for debugging PDFs (as I have the "privilege" to work on code that extracts data from PDFs...). It looks like your introspection stuff might be a bit stronger, which would be great. I'll take it for a whirl.

  • I remember there was a similar project on github allows visualize any type of binary data by a given schema. There was an TCP/IP example IIRC.

  • Damn, this is also convenient for forensics and finding watermarks.

  • Looks nice.

    Would be better if all of the PDF's bytes where shown. Seems like `endobj` and `xref` are not shown.

  • This would be really nice as browser library. Could just dragn drop a file and see its insides. But impressive nonetheless.

  • Well done. This is a very useful security previewing tool. PDFs are a menace.

  • Is the UI tooling that does the visualization a library? I really like the UI format, would love to use this for breaking down and debugging video byte streams too.

    EDIT: Oh it's actually reasonably simple, great use of CSS! https://github.com/desgeeko/pdfsyntax/blob/main/docs/simple_...

  • On a similar note, why haven't PDF been replaced? There are XPS, DjVu and XHTML (EPUB) but they all seem to be targeting different usecase (a packaged HTML file).

    What I want is a simple document format that allows embedding other files and metadata without the Adobe's bloat. I should be able to hyperlink within pages, change font-size etc without text overflowing and being able to print in a consistent manner.

  • I’ve been shopping for something that does a per-byte description of the content of visual media formats (jpeg, png, avi, mp4, etc). Anyone know of one?

  • related: https://news.ycombinator.com/item?id=41377960

  • This is really cool! I've spent the last few years debugging lots of PDFs while working on DocSpring, so I'm always looking for new tools to make this easier. Thanks for working on pdfsyntax!

  • Kudos to making this self-hosted. So very much appreciated!

  • it does not have any dependency to a pdf parsing library, correct? That's a cool way to learn to file format and be able to work around weird pdf file. But what was the motivation to not use a library to do the pdf parsing work? is it the case that there is none available? Nice work!

  • Wow, I've been doing some PDF parsing at work and this is going to come in SO handy.

  • This looks amazingly useful!

    Thank You For Making And Sharing!

  • If you're interested in manipulating PDFs, I've found QPDF [0] to be a useful tool. Its "QDF mode" lays out the objects in a form where you can directly edit them, and it can automatically fix up the xref table afterwards. It can also convert to and from a JSON format that you can manipulate with your own scripts.

    [0] https://github.com/qpdf/qpdf, https://qpdf.readthedocs.io/en/stable/