1 . Vast numbers of copyrighted books appear to have been memorized by ChatGPT and its successor GPT-4, posing questions about the legality of how these large language models (LLMs) are created.
Both artificial intelligences were developed by private firm OpenAI and trained on huge amounts of data, but which texts make up this training data is unknown. To find out more, David Bamman at the University of California, Berkeley, and his colleagues looked at whether the AIs were able to fill in missing details from a selection of almost 600 fiction books, drawn from sources such as nominees (被提名者) for the Pulitzer prize, and The New York Times’s bestsellers lists over the same time period.
The team picked 100 passages from each book that contained a single, named character. The researchers then blanked out the name and asked the AI to fill it in. This task was designed to expose if the AIs could return the exact right answer. “It really requires knowledge of the underlying material in order to be able to get the name right,” says Bamman.
Both AIs completed the task with high accuracy — as much as 98 percent for passages from Lewis Carroll’s 1865 book Alice’s Adventures in Wonderland — which is out of copyright — and 76 percent for J.K. Rowling’s Harry Potter and the Philosopher’s Stone, which is not. The researchers say this suggests the AIs were trained on significant proportions of both books.
These AIs don’t produce an exact duplicate of a text in the same way as a photocopier, which is a clearer example of copyright infringement. “ChatGPT can recite parts of a book because it has seen it thousands of times,” says Andres Guadamuz at the University of Sussex, UK. “The model consists of statistical frequency of words. It’s not reproduction in the copyright sense.”
“The use of copyright works without permission in training data sets for large language or image models has already emerged as one of the most pressing legal challenges to this novel industry,” says Lilian Edwards at Newcastle University, UK.
Bamman says that, ultimately, the legal system in each country will have to determine whether LLMs are infringing (侵犯) copyrights. “I think that’s an open question that a lot of court cases are going to decide for us in the coming months,” he says.
Regulation is also likely to play a key role: the European Union’s Artificial Intelligence Act, which has been two years in the making, will include a requirement that companies making generative AI tools need to disclose any copyrighted material used to train their models. That was a late change, added to the draft law in April, according to Reuter.
1. Bamman and his colleagues designed the task to_________.A.compare the accuracy rate of ChatGPT and GPT-4 |
B.test the range of knowledge of ChatGPT and GPT-4 |
C.show how ChatGPT and GPT-4 memorize many books |
D.check what ChatGPT and GPT-4’s training data consist of |
A.AIs were trained more on copyrighted works than those out of copyright. |
B.Guadamuz thinks what AIs have done is a kind of copyright infringement. |
C.AI companies need to uncover copyrighted materials used as training data. |
D.The permission for the use of copyright works becomes a legal challenge. |
A.The training process of AIs. | B.The legal uncertainty of AIs. |
C.The future regulation of AIs. | D.The training materials of AIs. |
1. 介绍你心目中的榜样;
2. 说明原因。
注意:1.词数100左右;
2.标题已给出。
My role model
____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________