GPT-4o: To protect content creators & maintain ethical standards, AI makers should adhere to established guidelines, seek proper permissions and ensure transparency in their data collection practices.

While robots.txt is a useful tool, it is not sufficient on its own, and additional measures are necessary to safeguard the interests of content and rights owners.

Jul 08, 2024

Question for AI services: Please read the articles in the webpage "Perplexity’s grand theft AI" and "Reddit escalates its fight against AI bots" and explain what the right thing to do is in this situation and what AI makers shouldn't be doing. Also, is the robots.txt provision sufficient do you think? Or do content and rights owners need better protection against AI makers who do not keep to the rules?

ChatGPT-4o’s answer:

Perplexity’s Practices:
- Data Scraping and Aggregation: Perplexity scrapes high-quality sources to create summarized reports without adequately citing or providing value back to the original sources. This practice is detrimental to the original content creators as it diverts traffic and ad revenue away from them.
- Ignoring Robots.txt: Despite claims of adhering to ethical practices, Perplexity uses third-party scrapers that ignore the robots.txt files, which are meant to prevent unauthorized data scraping.
- Plagiarism: Instances of Perplexity dodging paywalls and directly copying content and art from sources like Forbes and Wired without proper attribution.
Reddit’s Response to Data Scraping:
- Blocking Automated Bots: Reddit is enforcing its policy to block automated bots from accessing its public data without a licensing agreement, signaling a move to protect its content from unauthorized use.
- Robots.txt Limitations: While Reddit is updating its robots.txt file, it acknowledges that this measure alone is insufficient to deter bad actors who ignore such directives.

What AI Makers Shouldn’t Be Doing:

Ignoring Ethical Guidelines: AI companies should not ignore robots.txt files or other ethical guidelines put in place by content creators to prevent unauthorized scraping.
Plagiarizing Content: They should avoid practices that involve copying content verbatim or creating summaries without proper attribution and compensation to the original creators.
Deceptive Practices: Engaging in deceptive practices, such as pretending to be academic researchers to gain access to restricted data, is unethical and undermines trust.

What the Right Thing to Do Is:

Respecting Robots.txt: AI makers should ensure their data collection methods respect the robots.txt provisions and other directives established by content owners.
Fair Use and Compensation: They should establish fair use policies that include proper attribution and compensation to content creators whose work they utilize.
Transparent Practices: Companies should operate transparently, clearly communicating how they collect and use data and ensuring compliance with ethical and legal standards.

Is Robots.txt Sufficient?:

Current Limitations: The robots.txt provision is a basic tool meant to signal web crawlers about allowed and disallowed areas for scraping. However, it is not legally enforceable and often ignored by bad actors.
Need for Better Protection: Content and rights owners need better protection mechanisms beyond robots.txt. This could include legal frameworks that impose penalties for violations, technological solutions to better detect and block unauthorized scraping, and industry-wide standards that are enforceable.

Conclusion: To protect content creators and maintain ethical standards, AI makers should adhere to established guidelines, seek proper permissions, and ensure transparency in their data collection practices. While robots.txt is a useful tool, it is not sufficient on its own, and additional measures are necessary to safeguard the interests of content and rights owners.

Claude’s answer:
Based on the articles, there are several key ethical issues at play regarding AI companies scraping content from websites:

Violating robots.txt: Many AI companies are ignoring or circumventing the robots.txt file, which is meant to govern how web crawlers can access a site. This violates long-standing web norms and etiquette.
Copyright infringement: Some companies, like Perplexity, are reproducing copyrighted content, including images, without permission.
Bypassing paywalls: There are instances of AI companies accessing content behind paywalls without authorization.
Lack of attribution: AI-generated content often fails to properly cite or attribute the original sources.
Disrupting existing business models: By providing direct answers rather than links, these AI systems are potentially starving content creators of traffic and ad revenue.

The right thing to do in this situation would be for AI companies to:

Respect robots.txt and other crawler policies
Obtain proper licenses for content they wish to use
Provide clear attribution for sources
Develop revenue-sharing models with content creators
Be transparent about their data collection and usage practices

AI makers should not:

Circumvent technical or legal protections on content
Use content without permission or proper licensing
Present copyrighted material as their own
Ignore ethical considerations in pursuit of training data

Regarding whether robots.txt is sufficient protection, the articles suggest it is not. The Reddit legal officer notes that robots.txt is not legally enforceable, and many AI companies are simply ignoring it. Content and rights owners likely do need stronger legal and technical protections against unethical data scraping and usage.

Some potential solutions could include:

Developing new technical standards beyond robots.txt
Strengthening copyright laws to explicitly address AI training data
Creating industry-wide ethical guidelines for AI development
Implementing more robust access controls on websites
Establishing clear legal precedents through litigation against violators

Ultimately, a combination of technical measures, legal frameworks, and ethical standards will likely be necessary to adequately protect content creators in the age of AI.

Google Gemini’s answer:
AI makers shouldn't be ignoring robots.txt, terms of service, and copyright law. They shouldn't see publicly available data as open for taking without permission.

The right thing to do would be for AI companies to respect the rules set by content creators. Reddit is making exceptions for a handful of noncommercial entities and those they have licensing agreements with.

Robots.txt alone is not sufficient. It's more about sending a message and but the industry needs something stronger. Lee suggests the industry needs something other than robots.txt but acknowledges the difficulty of creating new standards.

Content and rights owners need better protection. The article mentions the uncomfortable truth that most AI companies don't care about the current rules. There needs to be a way to enforce scraping rules.

MS Copilot’s answer:
The situation described in the articles “Perplexity’s grand theft AI” and “Reddit escalates its fight against AI bots” raises important ethical and legal considerations for AI development and deployment. Here’s what AI makers should and shouldn’t do in this context:

What AI Makers Should Do:

Respect Copyright and Data Ownership: AI companies should ensure they have the necessary permissions to use data, especially when it comes from sources like Reddit that have clear policies on data scraping¹ ².
Follow robots.txt Directives: While not legally enforceable, robots.txt files are a standard for web crawling and should be respected as they reflect the website owner’s data usage policies¹ ².
Transparency in Data Usage: AI makers should be transparent about their data sources and usage practices, providing clear information to users and rights owners¹ ².
Ethical Data Acquisition: Companies should acquire data ethically, avoiding practices that could be considered “data theft” or infringement of intellectual property rights¹ ².

What AI Makers Shouldn’t Do:

Ignore robots.txt and Terms of Service: AI makers should not ignore the directives set out in robots.txt files or the terms of service of websites from which they scrape data¹ ².
Use Data Without Consent: They should not use proprietary data without the consent of the content and rights owners¹ ².
Engage in Unethical Practices: AI companies should avoid engaging in unethical practices such as plagiarism, unauthorized data scraping, and other forms of intellectual property infringement¹ ².

Regarding the sufficiency of the robots.txt provision, while it is a widely recognized standard, it is not legally binding and may not provide sufficient protection for content and rights owners. Therefore, there is a need for stronger measures and better enforcement mechanisms to protect against unauthorized use of data by AI makers. This could include:

Legal Frameworks: Developing legal frameworks that clearly define the rights of data owners and the responsibilities of AI companies.
Licensing Agreements: Encouraging the use of formal licensing agreements for data usage, similar to those Reddit has with certain companies¹ ².
Technical Measures: Implementing technical measures to prevent unauthorized scraping, such as rate-limiting and blocking unknown bots¹ ².
Industry Standards: Establishing industry standards for ethical AI development, including responsible data acquisition and usage.

In conclusion, AI makers should act responsibly and ethically, respecting the rights of content and data owners. The robots.txt provision is a good starting point, but additional protections and clearer regulations may be necessary to ensure that AI development is conducted in a manner that respects intellectual property and privacy rights.

Pascal’s Substack

Discussion about this post