Get all your news in one place.
100’s of premium titles.
One app.
Start reading
The Guardian - UK
The Guardian - UK
Entertainment
Ella Creamer

‘Impossible’ to develop AI tools without using copyrighted materials, says OpenAI

OpenAI logo on phone screen.
OpenAI says it complies with the ‘requirements of all applicable laws, including copyright laws’. Photograph: salarko/Alamy

Artificial intelligence company OpenAI has said that it would be “impossible” to develop leading AI tools without using copyrighted materials.

“Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens,” the company said.

The claim was part of written evidence submitted to the House of Lords communications and digital select committee as part of its inquiry into large language models such as OpenAI’s ChatGPT, which generates human-like text responses to prompts.

In September, a group of authors including Jodi Picoult and George RR Martin filed a lawsuit against OpenAI alleging copyright infringement. The lawsuit cites various ChatGPT commands and outputs for each author, including one for Martin that alleges the model generated an “infringing” and “detailed” outline for a prequel to A Game of Thrones that used characters from Martin’s existing books.

“Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials,” said the company.

Stuart Russell, author of Human Compatible: Artificial Intelligence and the Problem of Control, said that whether or not OpenAI’s statement is true “seems irrelevant” to the question of whether they should pay copyright owners. “It would be impossible for me to make money by selling the crown jewels without stealing them first, but that doesn’t entitle me to steal them.”

In the written evidence, OpenAI insisted that it complies with the “requirements of all applicable laws, including copyright laws”.

In June 2022, the Intellectual Property Office proposed a copyright exemption that would allow developers of AI tools free use of copyrighted books and music for training AI models. However, by February 2023, the government changed course and dropped its plan in response to a backlash from the creative industries.

If OpenAI were to use public domain materials exclusively as training data, it “would not result in having the most relevant language models”, said Maria Liakata, a professor in natural language processing at Queen Mary, University of London. “Having a large language model trained on Shakespeare would not be particularly useful. But I think the way they’ve gone about it could have been different. I think they should have been more transparent about what kind of data they’re using.”

OpenAI has not revealed the sources of its training data. However, in a 2020 paper introducing ChatGPT-3, the company said that 15% of the GPT-3 training set came from two “internet-based books corpora”, one of which is called Books2 and is estimated to contain nearly 300,000 titles. A June lawsuit stated that the only websites to offer that much material are “shadow libraries” such as Library Genesis (LibGen) and Z-Library, through which books can be secured in bulk via torrent systems.

“Training AI models requires a vast amount of high quality data,” said Yulan He, a professor in the department of informatics at King’s College London. “Since the majority of the data comes from the internet, it is likely many copyrighted materials are included.

“But I think we should still make efforts to use public domain data for AI model training, and perhaps explore ways to generate synthetic data not containing copyrighted materials for training. If we do need to rely on copyrighted materials, then we should obtain licences and permissions in using such materials,” she added.

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.