Bytes

Copyright Concerns Loom Over ChatGPT’s Book Memorisations

GPT-4 has ‘memorised’ popular titles such as The Lord of the Rings trilogy, A Game of Thrones, Dune and many others

Published

May 19, 2023

Researchers at the University of California, Berkeley, have conducted an in-depth analysis of OpenAI’s ChatGPT and its underlying GPT-4 large language model and have found that these models have been trained using text sourced from copyrighted books.

In their research paper titled ‘Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4,’ academics Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman present their findings and analysis.

The researchers explained, “We find that OpenAI models have memorised a wide collection of copyrighted materials, and that the degree of memorisation is tied to the frequency with which passages of those books appear on the web.”

Published on GitHub, the team has publicly shared its code and data, along with a Google Docs file containing the list of identified books.

Sci-fi and Fantasy works dominate

The investigation showed that OpenAI‘s GPT-4 has memorised popular titles like the Harry Potter children’s books, Orwell’s Nineteen Eighty-Four, The Lord of the Rings trilogy, the Hunger Games books, Hitchhiker’s Guide to the Galaxy, Fahrenheit 451, A Game of Thrones, and Dune, among others.

Researchers from the university also note that sci-fi and fantasy books lead the list and further emphasised that memorising specific book titles has consequential effects. Notably, these models exhibit improved accuracy in generating predictions for queries like, “What year was this passage published?” after they’ve memorised the book.

David Bamman, one of the co-authors and an associate professor in the School of Information at UC Berkeley, took to Twitter to summarise the paper.

Takeaways: open models are good; popular texts are probably not good barometers of model performance; with the bias toward sci-fi/fantasy, we should be thinking about whose narrative experiences are encoded in these models, and how that influences other behaviors. 5/6
— David Bamman (@dbamman) May 2, 2023

The researchers clarify that they are not asserting there’s complete text from the mentioned books within ChatGPT or its underlying models, as large language models (LLMs) do not store text verbatim.

“The data behind ChatGPT and GPT-4 is fundamentally unknowable outside of OpenAI,” the authors explained in their paper. “At no point do we access, or attempt to access, the true training data behind these models, or any underlying components of the systems. Our work carries out probabilistic inference to measure the familiarity of these models with a set of books, but the question of whether they truly exist within the training data of these models is not answerable.”

The inevitability of lawsuits

Speaking with The Register, Legal expert Tyler Ochoa from Santa Clara University in California anticipates future lawsuits targeting major players in the field of large language models, such as OpenAI and Google.

Firstly, he raises the question of whether copying substantial amounts of text or images for training purposes qualifies as fair use, suggesting that it is likely to be considered fair use. Secondly, he addresses the issue of ‘memorisation’, where the generated output closely resembles the input, potentially leading to copyright infringement.

“Lawsuits against AI text-generating models are inevitable,” said Ochoa.

Lastly, he ponders whether the output of an AI text generator, if not a direct copy of an existing text, is protected by copyright. Ochoa adds that under US law, the answer is no – because copyright law requires human creativity before copyrighting AI works. However, some countries will disagree and will protect AI-generated works.

In this article:AI / ChatGPT / Featured / GPT-4 / OpenAI / Research

Written By Isa Muhammad

Isa Muhammad is a writer and video game journalist covering many aspects of entertainment media including the film industry. He's steadily writing his way to the sharp end of journalism and enjoys staying informed. If he's not reading, playing video games or catching up on his favourite TV series, then he's probably writing about them.