One of the now-underdiscussed features of embeddings is that you can indeed use any existing statistical modeling techniques on them out of the box, and as a bonus avoid the common NLP preprocessing nuances and pitfalls (e.g. stemming) entirely.
This post is a good example on why going straight to LLM embeddings for NLP is a pragmatic first step, especially for long documents.
Hi! Author here, I wasn't expecting this to be at the top of HN, AMA
Interesting read with lots of good detail, thank you. A comment: if you are balancing the classes when you do one vs all binary training, and then use the max probability for inference, your probabilities might not be calibrated well, which could be a problem. Do you correct the probabilities before taking the argmax?
Back in 2006 there were multiple 1tb collections of textbooks as torrents. I imagine the size and number has only grown since then.
I have 20-40TB (pre-dedup) of PDFs - 8TB is a lot but not even close to the total number of PDFs available.
Interesting and fun article! I've been experimenting with various LLMs/GenAI solutions to extract tabular data from PDFs with underwhelming results. It seems like they are good at extracting strings of text and summarizing (e.g what was the total price? when was this printed?) but extracting reliably into a CSV has a decent margin of error.
Very cool! At Airtrain we’ve also found embeddings can be very valuable for building classification models. If you’re looking to play around with a large amount of text and embeddings we actually recently deduped and embedded all of fineweb-edu (also mentioned in the article) and put the resulting dataset on Hugging Face: https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fort...
This is a really cool idea, thanks for sharing. I don't have that much free time these days, but I was thinking of trying a similar-but-different project not too long ago.
I wanted to make a bit of an open source tool to pull down useful time series data for the social sciences (e.g. time series of social media comments about grocery prices). Seems like LLMs have unlocked all kinds of new research angles that people aren't using yet.
I may steal some of your good ideas if I ever get to work on that side project :)
Nice work! You've taken multiple approaches similar to what I sometimes do at the national library, I've used all kind of embeddings -> classifiers / LDA.
Curious on your prompt: https://github.com/snat-s/m/blob/main/classify_metadata/prom...
Wouldn't this be basically prompting to classify by the type of URL?
Classification is just a start. Wondering if it's worth doing something more -- like turning all of the text into Markdown or HTML? Would anyone find that interesting?
My first thought on seeing the PCA embeddings scatterplot was "I wonder what pdfs are at the centre of those two clusters?" The most typical pdfs on the internet.
I would have expected the finetuned model to perform much better. Would be curious to see the performance with other models
First you need a good PDF library :/
> How would you classify all the pdfs in the internet?
Definitely as 'hot dog' or 'not a hot dog'.
Whoever said there "internet," they fail to grasp how big the internet really is.
would be interesting to see if they tried LDA (latent direchelet allocation) topics
I would have expected categories for product brochures and product manuals.
Ive been playing with https://www.aryn.ai/ for Partitioning. Curious if anyone has tried these tools for better data extraction from PDFs. Any other suggestions?
(I'm a bit disappointed that most of the discussion is about estimating the size of PDFs on the internet, I'd love to hear more about different approaches to extracting better data from the PDFs.)
Typo: “meats the eye”
This seems like cool work but with a ton of "marketing hype speak" that immediately gets watered down by the first paragraph.
Ordering of statements.
1. (Title) Classifying all of the pdfs on the internet
2. (First Paragraph) Well not all, but all the PDFs in Common Crawl
3. (First Image) Well not all of them, but 500k of them.
I am not knocking the project, but while categorizing 500k PDFs is something we couldnt necessarily do well a few years ago, this is far from "The internet's PDFs".
Interesting read, I did not know about Common Crawl. I feel like RTBF is kind of a lost battle these days with more and more crawlers for AI and whatnot. Once on the internet there is no way back, for better or for worse. This tangent aside, 8TB is really not a lot of data, it's just 8 consumer-grade 1TB hard drives. I find it hard to believe this is "the largest corpus of PDFs online", maybe the largest public one. Not sure how representative it is of "the whole internet".
I don’t have 8TB laying around, but we can be a bit more clever.... In particular I cared about a specific column called url. I really care about the urls because they essentially tell us a lot more from a website than what meats the eye.
I'm I correct that it is only only using the URL of the PDF to do classification? Maybe still useful, but that's quite a different story than "classifying all the pdfs".
Did some similar work with similar visualizations ~2009, on ~5.7M research articles (PDFs, private corpus) from scientific publishers Elsevier, Springer:
Newton, G., A. Callahan & M. Dumontier. 2009. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library. Second Workshop on Very Large Digital Libraries at the European Conference on Digital Libraries (ECDL) 2009. https://lekythos.library.ucy.ac.cy/bitstream/handle/10797/14...
I am the first author.