This paper is a game-changer in the debate over AI and copyright. It provides the first empirical proof that pirated content measurably improves AI models...
...which could have massive legal, financial, and policy implications. This undermines AI companies' fair use defense and strengthens copyright claims.
Question 1 of 2 for ChatGPT-4o: Please read the article "Cloze Encounters: The Impact of Pirated Data Access on LLM Performance" and tell me what it says in easy to understand language.
ChatGPT-4o’s answer:
Summary of "Cloze Encounters: The Impact of Pirated Data Access on LLM Performance"
What’s the Study About?
The paper examines how using pirated books in AI training affects the performance of Large Language Models (LLMs) like GPT, Claude, Gemini, and Llama. The researchers focus on the Books3 dataset, a collection of over 195,000 pirated books that AI developers may have used to train models.
They compare AI performance on books that are included in Books3 versus books that are not included, using a "name cloze" test—a method where an AI has to fill in missing words (usually character names) from books.
Key Findings:
AI Performs Better on Books in the Books3 Dataset
Models are significantly better at recalling names and details from books in Books3 than books not in the dataset.
GPT-4 and GPT-3.5 showed the biggest improvement (21-23% increase in accuracy).
Other models, like Llama-70B and Claude, showed smaller improvements (~7-9%).
The smallest Llama model (Llama-8B) showed no improvement, suggesting model size matters.
Less Popular Books Benefit the Most
AI models showed the biggest performance jump on less popular books that had fewer summaries or references available elsewhere.
For very popular books, which have many alternative sources (like Wikipedia summaries), the improvement was smaller or even negative.
Books3 Likely Played a Big Role in AI Training
Even though AI companies don’t disclose their training data, the study strongly suggests that models trained on Books3 gained a significant advantage.
The fact that performance dropped sharply for books published after 2020 (not in Books3) supports this conclusion.
Legal and Economic Implications
The study highlights a major copyright issue—LLMs benefit significantly from pirated books.
This could affect lawsuits against AI companies (e.g., by The New York Times, Authors Guild, etc.).
Publishers and authors could demand licensing fees for their content based on this performance advantage.
Why This Matters
For AI Companies: Shows the importance of data sources and potential risks of relying on pirated content.
For Authors & Publishers: Strengthens the case for compensation if their books improve AI performance.
For Copyright Law: Adds evidence that AI models may not be fair-use when trained on copyrighted material.
Question 2 of 2 for ChatGPT-4o: List the most surprising, controversial and valuable statements made in the paper and provide overall recommendations for publishers, AI makers, those entangled in litigation and regulators.
ChatGPT-4o’s answer:
Most Surprising, Controversial, and Valuable Statements in the Paper
Surprising Statements
AI Performance Boost from Pirated Data is Measurable and Significant
LLMs trained on Books3 showed up to a 23% increase in accuracy on certain tasks compared to those that lacked access.
This means pirated data directly enhances AI capabilities.
More Popular Books Benefit Less from Direct Access
The AI models gained the biggest advantage from obscure books rather than bestsellers.
This suggests that unique, less accessible content plays a more crucial role in AI training than commonly available texts.
AI Models Likely Trained on Books3 Despite Companies Not Disclosing It
The study found a sharp performance drop for books published after 2020 (which were not included in Books3).
This strongly implies Books3 (a pirated dataset) was indeed used for training, even though AI makers haven’t admitted it.
Controversial Statements
OpenAI’s GPT-4 and GPT-3.5 Benefited More from Books3 Than Other Models
The study suggests OpenAI’s models rely more heavily on Books3 than Claude, Gemini, or Llama.
If true, this could directly impact ongoing lawsuits where OpenAI claims it does not use pirated data.
Piracy is Functionally Beneficial for AI Model Development
The paper indirectly suggests that LLMs trained on pirated content outperform those that do not.
This raises serious legal and ethical concerns—if piracy helps AI perform better, does it incentivize illegal data usage?
AI Licensing Market is Underpriced Compared to Pirated Content’s Value
AI companies like Google and OpenAI are paying tens of millions for licensed data, but this study shows that pirated content provides huge performance gains.
This could increase demands for higher licensing fees from content owners.
Most Valuable Statements
AI Model Size Matters When It Comes to Reliance on Pirated Content
Smaller models (like Llama-8B) did not benefit from Books3, while larger models saw substantial improvements.
This suggests that scaling laws make copyrighted data even more valuable for training.
The Fair Use Defense May Be Weakened
AI companies often argue that no single data source significantly affects AI performance.
This paper proves otherwise—Books3 inclusion measurably improves model accuracy.
This could harm OpenAI, Google, and Meta’s fair use arguments in court.
Empirical Framework for Detecting Copyrighted Data in Training
The study introduces a "name cloze" method that can help detect whether AI models have been trained on specific copyrighted books.
This is a valuable tool for publishers, authors, and regulators to investigate copyright violations.
Recommendations for Key Stakeholders
📖 For Publishers & Authors
✅ Strengthen Licensing Negotiations
Use this study to demand higher licensing fees for AI training, particularly for less popular or niche content.
✅ Leverage Empirical Evidence in Lawsuits
The study provides hard data showing that AI models perform better with access to pirated books.
This undermines AI companies' fair use defense and strengthens copyright claims.
✅ Develop AI-Exclusive Content Deals
Follow the path of Reddit ($60M deal with Google) and Wiley ($23M AI content licensing deal).
Publishers should demand payment rather than allowing unlicensed scraping.
🤖 For AI Makers (OpenAI, Google, Meta, Anthropic)
✅ Improve Transparency on Training Data
This study suggests that AI companies are not fully disclosing their data sources.
To avoid lawsuits and regulatory scrutiny, AI companies should clearly disclose training datasets.
✅ Reevaluate Dependence on Pirated Data
The study suggests pirated content significantly boosts performance.
AI makers must develop strategies to acquire legal content while maintaining high performance.
✅ Strengthen Data Attribution & Licensing
Implement better tracking of data origins to prove compliance with copyright laws.
Explore direct licensing agreements with book publishers to reduce legal risks.
⚖️ For Those Entangled in Litigation (NY Times, Authors Guild, Getty Images, etc.)
✅ Use This Study to Undermine Fair Use Defenses
AI companies often argue that individual datasets don’t significantly impact performance.
This paper proves otherwise—Books3 dramatically improves model output.
✅ Push for Model Training Transparency in Court
Courts should demand disclosure of AI training datasets.
Use the name cloze method to investigate whether a model was trained on pirated content.
✅ Demand Compensation Based on Performance Gains
AI companies may owe higher damages if their use of copyrighted material directly led to better models.
🏛️ For Regulators & Policymakers
✅ Mandate Transparency in AI Training Data
Require AI developers to publish details of datasets used in training.
Establish audit mechanisms to check for copyright violations.
✅ Strengthen Copyright Protections for AI Training
Expand copyright laws to explicitly cover AI training data.
Require opt-in rather than opt-out models for content licensing.
✅ Create AI Data Licensing Standards
Develop fair pricing models for copyrighted content in AI training.
This prevents AI companies from relying on piracy as a cheaper alternative.
Final Thoughts
This paper is a game-changer in the debate over AI and copyright. It provides the first empirical proof that pirated content measurably improves AI models, which could have massive legal, financial, and policy implications.
Would you like me to draft a formal position paper based on these insights? 🚀