OECD Report: Some AI models reproduce entire paragraphs from copyrighted books or generate art that closely imitates famous artists’ styles.
The number of companies disclosing their data sources has dropped from 20% to just 7% in under a year, possibly due to fear of lawsuits.
Question 1 of 2 for ChatGPT-4o: Please read the report “Intellectual property issues in artificial intelligence trained on scraped data” and tell me what it says in easy to understand language. List all surprising, controversial and valuable statements made in the report.
ChatGPT-4o’s answer:
Summary of the Report: "Intellectual Property Issues in Artificial Intelligence Trained on Scraped Data" (OECD, 2025)
The report from the OECD explores how scraping data from the internet for training AI models creates significant intellectual property (IP) issues. It outlines the current legal landscape, highlights controversial and surprising challenges, and suggests possible policy solutions.
📌 Key Points in Simple Terms:
Data Scraping and AI:
AI systems, especially generative AI, need large datasets to learn. Developers often scrape data from websites, social media, and online databases without permission.
Scraping is automated data collection, often done without notifying or compensating the owners of that content.
IP Rights Impacted:
Copyright: Many scraped materials, like books or articles, are protected under copyright laws.
Database Rights: In regions like the EU, even collections of data can have legal protection.
Trademarks: Logos or brand names scraped from sites could cause confusion if used improperly.
Moral Rights: Artists may object if AI imitates their style without permission.
Publicity Rights: Using a person's voice or likeness without consent can violate their rights.
Global Legal Confusion:
Laws on scraping and IP vary greatly between countries. For example, the EU allows scraping for research unless the owner opts out, while Japan permits scraping for commercial use under certain conditions.
Lawsuits are rapidly increasing, especially in the U.S., where authors and artists are suing over the use of their work in AI training without permission.
Transparency Problems:
AI developers often hide where they get their training data from. A Stanford study found that only 7% of major AI developers disclose their data sources.
Some popular datasets, like “Books3,” contain large amounts of pirated content.
Big Tech's Role:
Companies like LinkedIn scrape data themselves while trying to block others from scraping their platforms.
Data aggregators, such as Common Crawl and LAION, collect scraped data and distribute it freely, often without checking if it’s legal to share.
🚨 Surprising or Controversial Statements:
Massive Use of Pirated Content: Popular AI training datasets are filled with copyrighted material, including pirated books and articles.
Dual Standards by Tech Giants: Platforms that scrape data for their own AI models aggressively prevent others from scraping their sites.
AI Outputs Mimic Copyrighted Works: Some AI models reproduce entire paragraphs from copyrighted books or generate art that closely imitates famous artists’ styles.
Falling Transparency: The number of companies disclosing their data sources has dropped from 20% to just 7% in under a year, possibly due to fear of lawsuits.
💡 Valuable Insights and Policy Suggestions:
Voluntary Code of Conduct: Create industry standards for scraping that include transparency rules, contract terms, and monitoring mechanisms.
Technical Tools for Rights Holders: Develop better digital tools for creators to control how their data is scraped or used.
Standard Contracts: Encourage contracts with clear terms for data scraping that protect creators and clarify legal responsibilities.
Raising Awareness: Educate creators and developers about their rights and obligations regarding data scraping and AI training.
🌍 International Responses:
EU: Requires general-purpose AI models to comply with copyright laws and disclose data sources.
Japan: Allows scraping but with limitations based on usage and contractual terms.
Canada: Promotes transparency through a voluntary code of conduct for generative AI.
U.S.: Courts are currently deciding on whether AI training using scraped content qualifies as fair use.
This report highlights the urgent need for clearer laws and agreements around AI training data. It stresses that without action, legal disputes will continue to rise, and creators will lose control over their work.
Question 2 of 2 for ChatGPT-4o: How can publishers and regulators use this information when wanting to protect Intellectual Property? And what are the lessons that AI makers can learn from this report?
ChatGPT-4o’s answer:
📚 How Publishers and Regulators Can Use This Report to Protect Intellectual Property (IP):
1. Advocate for Transparency and Data Provenance Standards:
Require Disclosure: Push for laws or industry standards that force AI developers to disclose their training datasets, similar to the EU AI Act’s rules.
Audit and Certify Datasets: Establish independent audit mechanisms to verify whether training data complies with copyright and licensing laws.
2. Promote Stronger Licensing Practices:
Standardized Licensing Terms: Develop industry-wide templates for licensing agreements that cover AI data use, resale rights, and compensation models.
Compulsory Licensing Models: Advocate for models where certain AI uses of content require automatic payments (similar to how music royalties work).
3. Develop and Support Technical Solutions:
Digital Rights Management (DRM) Tools: Create technologies that embed usage restrictions into digital content (e.g., watermarks, licensing trackers).
Web Crawling Safeguards: Encourage platforms to use stronger anti-scraping tools, such as bot detection and CAPTCHA protocols.
4. Influence Policy and Legal Reforms:
Support Legislative Updates: Push for laws that clarify how fair use, text and data mining (TDM), and database rights apply to AI training.
Jurisdictional Cooperation: Promote international agreements to address cross-border scraping and IP violations.
5. Educate Content Creators:
Provide Guidance: Help creators understand their rights, such as how to opt out of TDM under EU laws or use copyright registration systems effectively.
Collective Rights Management: Support the formation of organizations that can collectively bargain with AI developers on behalf of creators.
🤖 Lessons for AI Makers from the Report:
1. Transparency Builds Trust:
Clearly disclose what datasets you use to train your models. Lack of transparency has already damaged public trust and led to lawsuits.
2. Respect Copyright and Licensing Rights:
Secure licenses for content used in training and be proactive about compensating rights holders.
Use data from sources with clear permissions (e.g., Creative Commons, public domain works).
3. Address Style and Likeness Rights Proactively:
Obtain explicit consent when training models on styles or likenesses of living creators or public figures.
Develop opt-out mechanisms for artists and authors who do not wish their works to train AI models.
4. Prepare for Cross-Border Compliance:
Ensure compliance with international laws like the EU AI Act, which requires transparency even if your company is based outside the EU.
Be aware of differences between fair use (U.S.) and opt-out mechanisms (EU, Japan).
5. Use Technical Solutions to Manage Rights:
Integrate tools that track data provenance and limit the use of protected content.
Develop filters to exclude IP-protected works from training datasets.
6. Support Industry-Wide Solutions:
Participate in developing voluntary codes of conduct, as suggested in the report.
Collaborate with publishers to create licensing frameworks rather than waiting for lawsuits.
💡 Strategic Takeaways:
For publishers and regulators, this report is a roadmap for policy advocacy, technical protection measures, and stakeholder education. For AI developers, it highlights that compliance with IP laws is not only a legal necessity but also a path to long-term industry credibility.