The internet brought a new threat to journalism. Online news sources, blogs, pulling information from larger sources or simply making it up. Fake news, clickbait, listicles, all took from real journalism. But an even greater threat was on the horizon: AI.
After a newspaper pours the money and labor, sometimes even blood and tears, into their work, an AI company might come along, gobble up swaths of data, give it out for free, or charge their own prices for accessing it. Worse, they’ll remove profit, attribution, and could even inject misinformation attributed to an institution with greater reputation.
That’s what The New York Times says happened to them. They claim that OpenAI, and Microsoft, who has backed the company’s profit-driving wing, has stolen their copyrighted data without permission or licensing. What’s perhaps worse, it allegedly can spit out large sections of full articles as well as falsehoods, so-called “hallucinations,” and attribute them to the NY Times.
Before, when OpenAI would admit where they procured their data, the NY Times was the top publisher of proprietary writings used to make ChatGPT. Millions of articles, dating back over a hundred years. Billions of words, sentences, data points, all taken for free.
OpenAI is valued at over $80 billion, with Microsoft investing $13 billion in the for-profit arm of the company, of which it owns 49%. The NY Times is demanding that all data taken from them without permission be removed from the models, effectively forcing OpenAI to re-create their ChatGPT models. They also want financial damages for the millions of articles taken without permission in the billions of dollars.
If they win, it could empower creators to fight back against the AI that steals their work, shaping the relationship between AI and humanity forever. Losing could mean trouble for creators everywhere.
In This Article:
Mountains of Data, with no Consent
OpenAI has licensed data with German media company Axel Springer, Politico, Business Insider, and AP News. They were in talks with The New York Times, but, according to Ars Technica reporting, talks weren’t going smoothly. OpenAI denies that, saying the lawsuit left them “surprised and disappointed,” pointing to “productive conversations” they’ve had with The New York Times.
OpenAI took millions of records from The NY Times, without permission. They made the NY Times the largest—unwilling—proprietary contributor to pre-3.5 ChatGPT. What amount of money should they owe NY Times? OpenAI is evaluated at $80 billion, in part, allegedly, because of that data they took without permission.
If an alleged bugler came into your house, stole all your possessions, sold it, and promised you a percentage of the proceeds, would you be happy? How pleasant could such talks be?
This marks the first time a major newspaper sued an AI company for alleged infringement for using their data for training. They claim OpenAI used “almost a century’s worth of copyrighted content” in order to “generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.” You can get Copilot, trained on the same data, to output chunks of NY Times articles, as you can see above.
The data, the reporting that NY Times works on, on top of the individual voices and style of their contributors, has seemingly been copied, repackaged, and sold, without consent.
“By providing Times content without The Times’s permission or authorization, Defendants’ tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue”
– From NY Times’ Lawsuit
Before 3.5, when OpenAI made their datasets public, it was clear that they included millions of records published by the NY Times. It was the third most referenced source, behind only Wikipedia (mostly creative commons), and U.S. Patents (fair use).
To look at these accusations another way, if you got a NY Times subscription, copied every article and re-published it, removing any way for the NY Times to make money from it in the process, and made money from it, you’d allegedly be in a similar position as OpenAI. However, in our hypothetical, the law is more clear: you’d be a thief.
The NY Times hopes OpenAI will be seen in the same light.
Fiction and non-fiction authors have also sued AI companies, including Sarah Silverman, George RR Martin, Julian Sanction, John Grisham, and more. All claim their works were taken without permission, their voices, knowledge, and hard work allegedly stolen and repackaged.
The Costs of Journalism Include More than Money
Journalism is more than simply putting words on a page. It’s a craft honed for a lifetime. A blend of the scientific method and art of writing. Each author has their own voice, their own style, and their own story to tell through their reporting. But how those stories get to them is an incredibly complex path. Fact-finding is witnessing. It’s going to the site of events to collect notes on what happened. That could be at the scene of a tragedy, or on location in a war zone. It can involve secret meetings with sources, encrypted communication, and searching for ways to corroborate or disprove a person’s story. It’s tough, often dangerous, and endless, tireless work. Being the first to break a story is vital, but ensuring it’s truthful is even more important.
2023 was one of the deadliest years for journalists in recent memory. Over the past few months, dozens of journalists have been killed covering the Israel-Hamas war. Over 400 journalists have been imprisoned for their job, and reports claim around 100 dead.
These people are risking their lives to bring the truth to light, to keep the world informed. Then AI companies allegedly scrape their information, their facts, their very voice, without permission, to repackage it and sell it. No risk, full reward. If a story’s written in blood and then copied by a soulless machine, a machine that may even add “hallucinations,” misinformation, to an article and attribute it to that publication or that author, then a serious violation of decency has been committed. To take the collected work of someone’s life and tarnish it like that is horrible. Thanks to AI, it’s even automated.
Hallucinations, Misinformation, and Quotes
While a large part of the lawsuit includes the collection of the NY Times’ data, without consent, it includes the issue of direct quotes and misinformation. Chat AI can include hallucinations, or false, made-up, output, to fill in the blanks or make a sentence or paragraph better fit a prompt or question. It’s glorified autocorrect at its core, and that means it doesn’t see the difference between a hallucination and a fact. Only we do because we can cross-reference it, we can learn and grow, applying critical thinking to the facts to understand something. Chat bots can’t do that. They are capable only of placing words in their most likely order.
If prompted, ChatGPT will apparently spit out paragraphs from articles that are behind paywalls. The Times claims that, in the case of reviews from Wirecutter, which they own, that means stripping content of the affiliate links the NY Times uses to make money from their free stories and reviews.
On top of that, it’ll allegedly sometimes add false information, while attributing it to the NY Times. If you claim one person told a lie, and publish that in writing, and your statement about what someone else said or did is a falsehood, that’s called libel. What happens when an AI does it?
“Publicly, Defendants insist that their conduct is protected as ‘fair use’ because their unlicensed use of copyrighted content to train GenAI models serves a new ‘transformative’ purpose. But there is nothing ‘transformative’ about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it.”
– From the NY Times’ lawsuit
Stolen Voice
Teaching is considered “fair use.” If you’re a teacher, and you project an article from the NY Times up on your screen, using it to teach kids, it’s fair use. That means you won’t get in trouble for it. However, that was never written to cover training AI. AI doesn’t learn. It does not use critical thinking to extrapolate any additional information based on limited data. It can only output patterns it has found in its data. Words already put together, sentences already created. Everything it is, is whatever has been poured into it.
The NY Times points out that OpenAI and Microsoft’s chat AI “generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.” That “expressive style” bit is important. It’s uniquely human, and not something AI can create anew. It’s the personal stories of people telling their own autobiographies, the atrocities they saw in war, or how a new tax could help education. Humanity itself, allegedly being repackaged and sold. Even if AI companies claim the data has been changed enough—which, if it’s outputting entire sections of articles, it’s not—what is it changing that original data with? Is it using the personal life of an author, another person’s voice, to twist the words of another? What copyrighted text is being used to obscure one source’s plagiarism? Can they prove their transformations didn’t come from someone else’s work, taken without permission?
The Death of an Industry that Can’t be Replaced
If no journalists were risking their lives and dying to cover the Israel-Hamas War, what do you think AI would write about? How would the machines cover the conflict? Perhaps they’d use information from Wikipedia? How would Wikipedia verify it? Twitter (or X) has seemingly been full of disinformation, spreading antisemitism and Islamophobia, which turns into violence. Perhaps that’s a source it would use. It could base its opinions on someone claiming “Hitler was right.” It could use image generation to replace pictures, showing you what violence might look like, halfway around the globe.
It could never give you the facts.
“If The Times and other news organizations cannot produce and protect their independent journalism, there will be a vacuum that no computer or artificial intelligence can fill. Less journalism will be produced, and the cost to society will be enormous.”
– From the NY Times lawsuit
Because AI is built on what others make. If journalism dies, AI loses everything it needs to make output about the news. Without the creativity and hard work of humanity, AI, and the companies profiting from their output, have nothing of value.
This lawsuit seeks to erase the instances of GPT and their data models that used NY Times material without permission. Those can be rebuilt they way they should have been built in the first place, from sources they have the rights to copy and with careful curation. It won’t be the death of ChatGPT, just new data powering it, without the articles NY Times claims OpenAI stole for their product. The Times wants billions of dollars in damages for the millions of records taken from them. If found guilty of theft, they want a permanent injunction to prevent OpenAI from stealing from them again.
Your Labor Should Not be Free for the Taking
If you want someone’s labor, you typically have to pay for it. Not doing so is theft. Labor has value, and people’s output has value. People cannot be forced to give up their labor. Not to a machine, not to a corporation, not to an AI. This lawsuit seeks to help create that precedent.
“Move fast and break things” is one of the most common mantras in tech. But in this case, the broken things are the livelihoods of the people who unwillingly were forced to contribute their work to the demise of their own industry. Allowing AI companies to use the output of someone’s labor to make that person obsolete is not a system that will ever benefit humanity. It’ll only benefit the greedy, taking the work of others for free, to make their own profits.
Maybe, as a human, not an AI or a corporation, I am biased. Maybe I do carry the bias of humanity, the desire to see myself and others have a chance to dream, to succeed, to learn, to grow, and space to love. You won’t get that kind of bias from a machine that steals my job. Or yours, for that matter. Because if your work can be taken without your permission and used to make you jobless, purposeless, perhaps even homeless, you don’t own your labor. And there’s another word for that.
Sources: (See? I list mine unlike OpenAI)
- Bobby Allyn, NPR
- AP News via The Guardian
- Josh Hendrickson, PCMag
- “List of journalists killed in the 2023 Israel-Hamas war,” Wikipedia
- Michel Martin, NPR
- Ryan McNeal, Android Authority
- Keith Romer, Erika Beras, Kenny Malone, Willa Rubin, Sam Yellowhorse Kesler, NPR
- Matt O’Brien and Frank Bajak, AP News
- Malak Saleh, Engadget
- Jonathan Stempel, Reuters
- John Timmer, ArsTechnica
- Kyle Wiggers, TechCrunch
- Maxwell Zeff, Gizmodo