Scraping, Suing and Settlements: The Copyright battle between AI and Media

Behind the widespread ecstasy and euphoria over AI’s near-magical capabilities lies an intense and unpalatable conflict between AI firms and media organisations over the unauthorized use of copyrighted content for AI training. Media companies and authors argue that Generative AI models such as those developed by OpenAI, Meta, and Perplexity AI have scraped their articles, books, and other intellectual property without consent or compensation. Their main concern is that AI firms profit from their journalism without proper compensation or recognition.

In December 2023, The New York Times became the first major US media company to sue OpenAI, alleging that millions of its articles were used to train AI models without permission. The complaint accused OpenAI of profiting off the NYT’s significant investment in journalism. Prior to the NYT lawsuit, actress Sarah Silverman, and authors like John Grisham, and Jonathan Franzen sued AI firms for unauthorised use of their works to train AI.

The latest wave of lawsuits against AI firms over unauthorized use of content includes:

ANI vs OpenAI: Indian news agency ANI has filed a lawsuit against OpenAI for using its news content without permission to train AI models. ANI says this use infringes on their copyright and diminishes the value of its reporting.

News Corp vs Perplexity AI: The Wall Street Journal and The New York Post are suing Perplexity AI. The lawsuit claims Perplexity uses their content and trademarks without authorization.

Canadian media outlets vs OpenAI: Five Canadian news organisations – including The Globe and Mail – have collectively sued OpenAI, accusing it of using their copyrighted news content to train AI models without consent. “Journalism is in the public interest. OpenAI using other companies’ journalism for their own commercial gain is not. It’s illegal,” the lawsuit said.

The unfair claim to fair use

AI firms argue that their content usage qualifies as fair use, claiming AI-generated outputs differ significantly from original works. However, media outlets like The New York Times counter that such use undermines the online journalism business model.

Experts are unanimous that AI companies cannot just lay claim to media content under the ‘fair use’ principle.

Elaine Fletcher, Business Manager - UK and Europe at Hoonartek, is of the opinion that using copyrighted content without permission is legally and ethically contentious, regardless of societal benefits.

She reckons that publishers have valid claims over their content, including the specific expression of publicly reported facts. “While AI companies may argue fair use, this defence is not guaranteed and often depends on the purpose and amount of material used. The tension arises between protecting intellectual property rights and promoting technological advancement.”

Concurring with Fletcher, Vishal Rupani, Co-founder, Sprect.com, asserts that publishers have a legitimate claim over their work, especially when it’s being used commercially by AI companies. He cites the case of Canada and Australia penalizing Google for its Google News service, demanding compensation for using news content without paying for it. This, he adds, reflects the growing realization that content creators should be rewarded for their work, even in a digital world.

Rupani points out that the debate over whether AI companies can use copyrighted content without permission, even for societal benefits, is a tricky one.

According to him, while AI models like ChatGPT, Gemini, and Claude can democratize knowledge and boost productivity, they still need to respect intellectual property rights. The crux of the matter, he says, lies in whether scraping content from creators, whether it’s an in-depth news report or a viral cat meme, is fair game.

Rupani likens AI royalties to the music industry’s straightforward pay-per-stream model. “With AI, it’s different. AI learns unpredictably from content, making it hard to gauge how much a news article influenced an AI’s ‘aha!’ moment. Transparency is the real issue – AI firms control the content vault, making value attribution tricky. For example, Google News showcases content from publishers like TOI and The Hindu, but their earnings are unclear. Until better tracking methods emerge, revenue-sharing may just be another AI experiment,” he explains.

Surbhi Allagh, Co-founder of Itch, points out that AI threatens publication integrity and revenue. “Let’s be honest: while AI companies claim to benefit society, they’re also significantly benefiting themselves. Publications may not have a copyright on facts, but the issue lies in their unique expression – how they craft words, phrases, and structure entire pieces. This creative aspect is undeniably protected by copyright.”

According to her, in cases like The New York Times, where content is paywalled, using it without permission feels like a clear breach. It’s not just about the facts, but the way those facts are packaged, and bypassing these protections undermines the publication’s efforts,” says Allagh.

Wanted: A middle ground

Perplexity AI defends its use of news content in the lawsuit filed by the Wall Street Journal and The New York Post, arguing that its actions fall within the limits of fair use. The company emphasizes that it offers transformative tools that enable users to access information in new ways.

“There are around three dozen lawsuits by media companies against generative AI tools. The common theme betrayed by those complaints collectively is that they wish this technology didn’t exist. They prefer to live in a world where publicly reported facts are owned by corporations, and no one can do anything with those publicly reported facts without paying a toll,” says Perplexity AI in a strongly worded article.

Generative AI tools are here to stay, but the ongoing tension between AI firms and media companies threatens the health of the ecosystem. Hence, AI companies and publishers need to find a middle ground to foster collaboration rather than confrontation, particularly through revenue-sharing models like the one Perplexity AI has proposed: in July this year, the AI start-up launched “a first-of-its-kind revenue-sharing program with leading publishers like TIME, Fortune, and Der Spiegel”.

Some recent instances of AI firms partnering with media companies include:

Time and OpenAI: Time entered into a multi-year agreement with OpenAI. This deal allows OpenAI to use Time’s 101-year archive to train its models and generate AI responses. In return, Time will have access to OpenAI’s tools for its own innovation efforts.

Condé Nast and OpenAI: Condé Nast partnered with OpenAI to allow the use of their content for AI products. This deal faced internal criticism, with some staff expressing concerns about aiding the development of tools that might undermine journalism. “No one wants to help train the tools spreading misinformation and degrading the skills many of us spent decades honing,” a writer said.

News Corp and OpenAI: News Corp struck a deal with OpenAI to share content from outlets such as The Wall Street Journal and New York Post. This arrangement aims to ensure proper compensation and attribution while enabling OpenAI to train its models.

Elaine Fletcher feels that a middle ground can be found through licensing agreements and revenue-sharing models. “AI companies can respect intellectual property rights while accessing valuable data by compensating publishers for using their content. Collaborative approaches like Perplexity’s proposal encourage partnership over litigation, benefiting both parties and promoting innovation.”

Vishal Rupani feels that collaboration through revenue-sharing models, like Perplexity’s proposal, is feasible but tricky – the biggest hurdle is transparency.

Rupani compares AI royalties to the music industry, where payments are clear-cut – pay per stream. “But with AI, it’s different. AI learns from content in unpredictable ways, making it hard to determine how much a news article influenced an AI’s ‘aha!’ moment. The real issue is transparency: AI companies control the content vault, and tracking value attribution is tricky. For instance, Google News features content from publishers like TOI and The Hindu, but their earnings remain unclear. Until we have better measurement methods, revenue-sharing may remain just another AI experiment,” he explains.

Perplexity has effectively leveraged the loophole that no publication technically owns rights to facts, opines Surbhi Allagh. However, she adds, publications can no longer afford to ignore the reality that AI is here to stay. According to her, the way forward is finding a balance that serves the interests of both sides.

“Licensing agreements and revenue-sharing models, such as ad revenue splits, could provide mutually beneficial solutions. That said, the sustainability of such models from the perspective of AI companies is worth questioning. Can these companies sustain their growth while compensating publishers fairly? The future likely lies in creative, adaptable frameworks that work for both parties,” says Allagh.

Regulators’ role?

This dilemma raises ethical, creative, and legal challenges, making a clear solution elusive. As lawsuits between publishers and AI companies intensify, should regulators step in to define fair use in AI training? How might these legal battles influence innovation, content creation, and digital publishing in the future?

Surbhi Allagh agrees that regulators are still struggling to set clear boundaries, making the landscape increasingly confusing.

“We’ve seen cases like the Monkey Selfie dispute, and now copyright claims seem to be getting more absurd. There’s a strong need to establish clear guardrails and continuously evolve them so that content creation can thrive without the constant fear of potential legal consequences. Without a framework, both AI companies and creators are left running around like headless chickens, making it difficult to innovate and grow,” says Allagh.

Regulatory bodies are struggling to navigate AI, a field that’s evolving faster than they can keep up with, says Vishal Rupani. “They might be experts in finance or healthcare, but AI is an entirely new frontier that requires a deeper understanding before effective regulations can be crafted.”

Currently, he adds, AI companies are like kids in a candy store, grabbing what they can, while publishers remain sidelined.

“Without fully understanding how AI learns and uses content, regulators risk making ill-fitting decisions. While these lawsuits may initially slow progress, they could ultimately lead to clearer rules. AI firms will need to refine their practices, and publishers will learn to protect their work better. This could pave the way for a fairer, more structured content-sharing system,” Rupani concludes.

Elaine Fletcher suggests that regulatory bodies should provide clear guidelines on how the AI training process should use copyrighted materials produced by government authorities, universities, research institutes and industry enterprises.

According to her, the AI governing authority should update fair use provisions to reflect modern technologies. “Without clarity, on-going lawsuits may hinder innovation and create legal hurdles. Clear regulations can balance the interests of all parties, fostering a healthy environment for innovation and digital publishing,” Fletcher concludes.

The conflict between AI firms and media firms underscores a critical need for balanced solutions that respect intellectual property while fostering innovation. Scraping, suing, and settlements are just the symptoms of an ecosystem in flux. Collaboration through licensing agreements and revenue-sharing models offers a promising path forward, but transparency in AI operations is essential for these models to succeed.