AI Language Models Like ChatGPT May Paraphrase Without Citing The Source

With ChatGPT a huge fad, students may be excited that their next paper will be easy, because if it is truly AI it will start from the unique seed you give it. Yet it does not work that way. Language models that generate text in response to user prompts use a lot of the same training materials, which means they can plagiarize content in multiple ways.

For an analysis, scholars focused on identifying three forms of plagiarism: verbatim, or directly copying and pasting content; paraphrase, or rewording and restructuring content without citing the original source; and idea, or using the main idea from a text without proper attribution. They constructed a pipeline for automated plagiarism detection and tested it against OpenAI’s GPT-2 because the language model’s training data is available online, allowing the researchers to compare generated texts to the 8 million documents used to pre-train GPT-2.

The scientists used 210,000 generated texts to test for plagiarism in pre-trained language models and fine-tuned language models, or models trained further to focus on specific topic areas. In this case, the team fine-tuned three language models to focus on scientific documents, scholarly articles related to COVID-19, and patent claims. They used an open-source search engine to retrieve the top 10 training documents most similar to each generated text and modified an existing text alignment algorithm to better detect instances of verbatim, paraphrase and idea plagiarism.

The team found that the language models committed all three types of plagiarism, and that the larger the dataset and parameters used to train the model, the more often plagiarism occurred. They also noted that fine-tuned language models reduced verbatim plagiarism but increased instances of paraphrase and idea plagiarism. In addition, they identified instances of the language model exposing individuals’ private information through all three forms of plagiarism.

Though the results of the study only apply to GPT-2, the automatic plagiarism detection process that the researchers established can be applied to newer language models like ChatGPT to determine if and how often these models plagiarize training content. Testing for plagiarism, however, depends on the developers making the training data publicly accessible, said the researchers.

“As a stochastic parrot, we taught language models to mimic human writings without teaching them how not to plagiarize properly,” said lead author Jooyoung Lee, doctoral student in the College of Information Sciences and Technology at Penn State. “Now, it’s time to teach them to write more properly, and we have a long way to go.”

John
I highly recommend getting the vaccine if you are over 50 years old. Before I got the vaccine, I'd come down with pneumonia every year; since then, nothing. There are currently four vaccines...

New Vaccine For 21 Strains Of Pneumococcal Disease · 1 day ago
John
Hank is just ashamed that he doesn't know science. At all. If he did, he'd know that the Chinese have been monitoring earthquakes since 132 CE, have been investigating flight since 1300...

How Trump Is Making Taiwan Safe(r) · 1 day ago
John H.
You are always attacking the left and Democrats so it is remarkable that you make that charge against me. You started a science forum and use it to push your political agenda. Remarkable hypocrisy...

How Trump Is Making Taiwan Safe(r) · 2 days ago
Hank Campbell
I can't think of any important science that China has done. With 5% of the world population, the US is 30% of the world's science, and only a small percentage of that was ever funded by...

How Trump Is Making Taiwan Safe(r) · 3 days ago
Jennifer
I remember dealing with a frustrating bureaucracy myself—trying to contest an unfair utility bill increase a few years back, only to find the system designed to deflect responsibility at every...

Dust Is Changing The Microbiome Of California Mice, Warn Academics | Science 2.0 · 3 days ago

Related articles

Comments

Know Science And Want To Write?

Donate or Buy SWAG