Researchers at the University of California, Berkeley, have conducted an in-depth analysis of OpenAI’s ChatGPT and its underlying GPT-4 large language model and have found that these models have been trained using text sourced from copyrighted books.
In their research paper titled ‘Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4,’ academics Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman present their findings and analysis.
The researchers explained, “We find that OpenAI models have memorised a wide collection of copyrighted materials, and that the degree of memorisation is tied to the frequency with which passages of those books appear on the web.”
Sci-fi and Fantasy works dominate
The investigation showed that OpenAI‘s GPT-4 has memorised popular titles like the Harry Potter children’s books, Orwell’s Nineteen Eighty-Four, The Lord of the Rings trilogy, the Hunger Games books, Hitchhiker’s Guide to the Galaxy, Fahrenheit 451, A Game of Thrones, and Dune, among others.
Researchers from the university also note that sci-fi and fantasy books lead the list and further emphasised that memorising specific book titles has consequential effects. Notably, these models exhibit improved accuracy in generating predictions for queries like, “What year was this passage published?” after they’ve memorised the book.
David Bamman, one of the co-authors and an associate professor in the School of Information at UC Berkeley, took to Twitter to summarise the paper.
The researchers clarify that they are not asserting there’s complete text from the mentioned books within ChatGPT or its underlying models, as large language models (LLMs) do not store text verbatim.
“The data behind ChatGPT and GPT-4 is fundamentally unknowable outside of OpenAI,” the authors explained in their paper. “At no point do we access, or attempt to access, the true training data behind these models, or any underlying components of the systems. Our work carries out probabilistic inference to measure the familiarity of these models with a set of books, but the question of whether they truly exist within the training data of these models is not answerable.”
The inevitability of lawsuits
Speaking with The Register, Legal expert Tyler Ochoa from Santa Clara University in California anticipates future lawsuits targeting major players in the field of large language models, such as OpenAI and Google.
Firstly, he raises the question of whether copying substantial amounts of text or images for training purposes qualifies as fair use, suggesting that it is likely to be considered fair use. Secondly, he addresses the issue of ‘memorisation’, where the generated output closely resembles the input, potentially leading to copyright infringement.
“Lawsuits against AI text-generating models are inevitable,” said Ochoa.
Lastly, he ponders whether the output of an AI text generator, if not a direct copy of an existing text, is protected by copyright. Ochoa adds that under US law, the answer is no – because copyright law requires human creativity before copyrighting AI works. However, some countries will disagree and will protect AI-generated works.