Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book

  • As an experiment I searched Google for "harry potter and the sorcerer's stone text":

    - the first result is a pdf of the full book

    - the second result is a txt of the full book

    - the third result is a pdf of the complete harry potter collection

    - the fourth result is a txt of the full book (hosted on github funny enough)

    Further down there are similar copies from the internet archive and dozens of other sites. All in the first 2-3 pages.

    I get that copyright is a problem, but let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.

  • It's important to note the way it was measured:

    > the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time

    As I understand it, it means if you prompt it with some actual context from a specific subset that is 42% of the book, it completes it with 50 tokens from the book, 50% of the time.

    So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own. To allege a true copyright violation you'd still need to show that you can chain those together or use some other method to build actual substantial portions of the book. And if it only gets it right 50% of the time, that seems like it would be very hard to do with high fidelity.

    Having said all that, what is really interesting is how different the latest Llama 70b is from previous versions. It does suggest that Meta maybe got a bit desperate and started over-training on certain materials that greatly increased its direct recall behaviour.

  • Well, so can a nontrivial number of people. It's Harry Potter we're talking about - it's up there with The Bible in popularity ranking.

    I'm gonna bet that Llama 3.1 can recall a significant portion of Pride and Prejudice too.

    With examples of this magnitude, it's normal and entirely expected this can happen - as it does with people[0] - the only thing this is really telling us is that the model doesn't understand its position in the society well enough to know to shut up; that obliging the request is going to land it, or its owners, into trouble.

    In some way, it's actually perverted.

    EDIT: it's even worse than that. What the research seems to be measuring is that the models recognize sentence-sized pieces of the book as likely continuations of an earlier sentence-sized piece. Not whether it'll reproduce that text when used straightforwardly - just whether there's an indication it recognizes the token patterns as likely.

    By that standard, I bet there's over a billion people right now who could do that to 42% of first Harry Potter book. By that standard, I too memorized the Bible end-to-end, as had most people alive today, whether or not they're Christian; works this popular bleed through into common language usage patterns.

    --

    [0] - Even more so when you relax your criteria to accept occasional misspell or paraphrase - then each of us likely know someone who could piece together a chunk of HP book from memory.

  • I think it's important to recognize here that fanfiction.net has 850 thousand distinct pieces of Harry Potter fanction on it. Fifty thousand of which are more than 40k words in length. Many of which (no easy way to measure) directly reproducing parts of the original books.

    archiveofourown.org has 500 thousand, some, but probably not the majority, of that are duplicated from fanfiction.net. 37 thousand of these are over 40 thousand words.

    I.e. harry potter and its derivatives presumably appear a million times in the training set, and its hard to imagine a model that could discuss this cultural phenomena well without knowing quite a bit about the source material.

  • From a quick web search I can find that there are book review sites that allow users to enter and rate verbatim "quotes" from books. This one [1] contains ~2000 [2] portions of a sentence, a paragraph or several paragraphs of Harry Potter and the Sorcerer's Stone.

    Could it be plausible that an LLM had ingested parts of the book via scrapping web pages like this and not the full copyrighted book and get results similar to those of the linked study?

    [1] https://www.goodreads.com/work/quotes/4640799-harry-potter-a...

    [2] ~30 portions x 68 pages

  • On the other hand, it’s surprising that Llama memorized so much of Harry Potter and the Sorcerer's Stone.

    It's sold 120 million copies over 30 years. I've gotta think literally every passage is quoted online somewhere else a bunch of times. You could probably stitch together the full book quote-by-quote.

  • Quotation is fair use in all sensible copyright system. An LLM will mostly be able to quote anything, and should be. Quotation is not derived work. LLMs are not stealing copyrighted work. They just show that Harry Potter is in English and a mostly logical story. If someone is stabbed, they will die in most stories, that's not copyrightable. If you have an engine that knows everything, it will be able to quote everything.

  • I can recall about 12% of the first Harry Potter book so it's interesting to see Llama is only 4x smarter than me. I will catch up.

  • That's a clickbait title.

    What they are actually saying: Given one correct quoted sentence, the model has 42% chance of predicting the next sentence correctly.

    So, assuming you start with the first sentence and tell it to keep going, it has a 0.42^n odds of staying on track, where n is the n-th sentence.

    It seems to me, that if they didn't keep correcting it over and over again with real quotes, it wouldn't even get to the end of the first page without descending into wild fanfiction territory, with errors accumulating and growing as the length of the text progressed.

    EDIT: As the article states, for an entire 50 token excerpt to be correct the probability of each output has to be fairly high. So perhaps it would be more accurate to view it as 0.985^n where n is the n-th token. Still the same result long term. Unless every token is correct, it will stray further and further from the correct source.

  • I really wish we could get rid of copyright. It's going to hold us back long term.

  • Imagine the literary possibilities when it can write 100%! Rowling's original work was an amusing, if rather derivative children's book. But Llama's version of the Philosophers stone will be something else entirely. Just think of the rather heavy-handed Cerberus reference in the original work. Instead of a rote reference to Greek mythology used as a simple trope, it will be filled with a subtext that only an LLM can produce.

    Right now they're working on recreating the famous sequence with the troll in the dungeon. It might cost them another few billion in training, but the end results will speak for themselves.

  • If LLMs are good at summarizing/compressing, what does this say about the underlying text? Why are some passages more easily recalled? Sure, some sections have probably been quoted more times than others, so there's bias in training data, which might explain why the Llama 1 and 3.1 images have similar peaks. Would this happen to LLMs even with no training bias?

    Edit: seems the first part is about a memory about being bullied by Duddley. The second is where he's been elected to the quidditch team. Possibly they are just boring passages, compared to the surrounding ones. So probably just training bias.

  • I'm surprised no one in the comments has mentioned overfitting. Perhaps this is too obvious but I think of it as a very clear bug in a model if it asserts something to be true because it has heard it once. I realize that training a model is not easy, but this is something that should've been caught before it was released. Either QA is sleeping on the job or they have intentionally released a model with serious flaws in its design/training. I also understand the intense pressure to release early and often, but this type of thing isn't a warning.

  • Do LLMs have any perception that Harry Potter is fiction or is it possible that they will give some magical advice based on fiction works that they have been trained with?

    edit: never mind, I’ll just ask ChatGPT

  • It's not fair use just because you guys want it be fair use.

    While limited quoting can (and usually is) considered fair use, quoting significant portions of a book (much less 42% of it) has never been fair use, in the U.S., Europe, or any other nation.

    Yes, information wants to be free, yada yada. That means facts. Whether creative works are free is up to their creators.

  • Many people could also produce text snippets from memory. I dispute that reading a book is a copyright violation. Copying and distributing a book, yes, but just reading it - no.

    If the book was obtained legitimately, letting an LLM read it is not an issue.

  • Hmm, couldn't this be used as a benchmark for quantization algorithms?

  • Would it be possible that other people posted content of Harry Potter book online and the model developer scrape that information? Would the model developer be at fault in this scenario?

  • It’s well-known that John von Neumann had this ability too:

    Herman Goldstine wrote "One of his remarkable abilities was his power of absolute recall. As far as I could tell, von Neumann was able on once reading a book or article to quote it back verbatim; moreover, he could do it years later without hesitation. He could also translate it at no diminution in speed from its original language into English. On one occasion I tested his ability by asking him to tell me how A Tale of Two Cities started. Whereupon, without any pause, he immediately began to recite the first chapter and continued until asked to stop after about ten or fifteen minutes."

    Maybe it’s just an unavoidable side effect of extreme intelligence?

  • harry potter is likely to be excerpted a million times all over the web (in legitimate fair use context). wouldn't it make more sense to try out other titles that are still under copyright, appear in the research datasets, but have little mention across the web and other typical source corpii?

  • I wonder what percentage we could expect from a true general AI, 100% ?

    It would be nice to know that at least our literature might survive the technological singularity.

  • I mean it makes sense. Same thing as George RR Martin complaining that it can spit out chunks of his books (finish your books already!!)

    As I have pointed out many times before - for GRRM's books and for HP books, the Internet is FILLED to the brim with quotes from these books, there are uploads of the entire books, there are several (not just one) fan wikis for each of these fandoms. There is a lot of content in general on the Internet that quotes these books, they are pop culture sensations.

    So of course they're weighted heavily when training an LLM by just feeding it the Internet. If a model could ever recount it correctly 100% in the correct order, then that's overfitting. But otherwise it's just plain & simple high occurrence in training data.

  • LLMs are to a certain degree compressed databases of their training data. But 42% is a surprisingly large number.

  • It will generate a correct next token 42% of the time when prompted with a 50 token quote.

    Not 42% of the book.

    It's a pretty big distinction.

  • Meta Llama, Author of Harry Potter

  • My children can recall 100% of some of their favorite books as I'm sure some of us here can do the same. Some of us can recall 100% of a poem or a song lyrics.

  • Given the method and how the english language works, isn't that the expected outcome for any text that isnt highly technical?

    Guess the next word: Not all heros wear _____

  • what is that bar (= token span) on the right common to the first three models

  • [dead]

  • [flagged]

  • https://archive.is/OSQt6

    If you've seen as many magnet links as I have, with your subconscious similarly primed with the foreknowledge of Meta having used torrents to download/leech (and possibly upload/seed) the dataset(s) to train their LLMs, you might scroll down to see the first picture in this article from the source paper, and find uncanny the resemblance of the chart depicted to a common visual representation of torrent block download status.

    Can't unsee it. For comparison (note the circled part):

    https://superuser.com/questions/366212/what-do-all-these-dow...

    Previously, related:

    Extracting memorized pieces of books from open-weight language models - https://news.ycombinator.com/item?id=44108926 - May 2025

  • As I've said several times, the corpus is key: LLMs thus far "read" most anything, but should instead have well-curated corpora. "Garbage In, Garbage Out!(GIGO)" is the saying.

    While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere. Leave Harry Potter for a different "Harry Potter LLM".

    Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder.