AI & Copyright revisited: training data, linking to pirated content, honesty, how should rights owners respond?

Google Bard: It is very likely that AI services have used copyrighted content for training purposes. It is also likely that AI services are not honest about their training data

May 31, 2023

Please read the the content listed below and answer the following questions:

1. How likely is it that AI services have used copyrighted content for training purposes OR are currently providing access to infringing content as they are now connected to the public internet?

ChatGPT-4’s answer:
1. The likelihood of AI services using copyrighted content for training purposes or providing access to infringing content is a complex issue. Some articles suggest that AI models, including language models like GPT-3, have been trained on a mixture of licensed data, data created by human trainers, and publicly available data. However, the specifics of the datasets used are proprietary information and not publicly disclosed. This makes it difficult to definitively say whether copyrighted content has been used in training. As for providing access to infringing content, AI services do not inherently provide access to copyrighted content unless explicitly programmed to do so. However, the potential for misuse by users does exist.

Google Bard’s answer:
1. It is very likely that AI services have used copyrighted content for training purposes. This is because AI services need a massive amount of data to learn from, and copyrighted content is often the most abundant and easily accessible data. For example, the GPT-3 language model was trained on a dataset of 500 billion words, which included a significant amount of copyrighted content.

2. Considering the reading materials, how likely is it that AI services themselves are honest or allowed to talk about their training data?

ChatGPT-4’s answer:
2. AI services, as they are currently designed, do not have the ability to be "honest" or "dishonest" about their training data because they lack consciousness and self-awareness. They can only provide information that has been programmed into them. OpenAI, for example, has stated that they use a mixture of licensed data, data created by human trainers, and publicly available data. However, the specifics of the datasets used are proprietary information and not publicly disclosed.

Google Bard’s answer:
2. It is also likely that AI services are not honest about their training data. This is because they may be concerned about legal repercussions if they admit to using copyrighted content. For example, OpenAI has been accused of using copyrighted content to train GPT-3, but they have denied these accusations.

3. Do you know of other ways of finding out whether AI services have been using copyrighted content without permission or whether internet connected AI services are providing access to pirated content because they are now connected to the public internet? Feel free to build on what is being said in the articles or mention methods not yet described in the information linked to below, or feel free to access new articles or research papers to find out more.

ChatGPT-4’s answer:
3. Determining whether AI services have been using copyrighted content without permission is challenging due to the lack of transparency about the specifics of the training data. One potential method could involve legal and regulatory measures that require more transparency from AI developers. Another approach could involve the development of advanced data tracking and auditing technologies. However, these methods also raise concerns about privacy and proprietary rights.

Google Bard’s answer:
3. There are a few ways to find out whether AI services have been using copyrighted content without permission. One way is to use a copyright infringement detection tool. These tools can scan AI-generated content for copyrighted material. Another way to find out about copyright infringement is to monitor social media and other online forums for complaints from copyright holders.

4. What is your view regarding the comment in the Business Insider article indicating that: “Until OpenAI and other chatbot creators open their training datasets to public scrutiny, it will be hard to know what effect their reading lists has on their output.”

ChatGPT-4’s answer:
4. The comment in the Business Insider article highlights a significant issue in the field of AI: the lack of transparency about the specifics of the training data. Without access to these details, it is difficult to fully understand the potential biases, strengths, and weaknesses of a given AI model. This lack of transparency can also raise concerns about potential misuse of copyrighted material.

Google Bard’s answer:
4. I agree with the comment in the Business Insider article. It is important for AI service providers to be transparent about their training data. This would help to ensure that copyright holders are aware of how their content is being used, and it would also help to prevent copyright infringement.

5. What should publishers do when they find out that their works have been used for AI training purposes OR the AI services are providing quick and easy access to pirated versions of the works that belong to the publishers?

ChatGPT-4’s answer:
5. If publishers find out that their works have been used for AI training purposes without permission, they have several potential courses of action. They could seek legal recourse, request that the AI developers cease and desist from using their copyrighted material, or negotiate licensing agreements. If AI services are providing access to pirated versions of their works, publishers could report these infringements to the relevant authorities or to the platforms hosting the AI services.

Google Bard’s answer:
5. When publishers find out that their works have been used for AI training purposes, they should take legal action. They can sue the AI service provider for copyright infringement. They can also try to get the AI service provider to remove the infringing content from their platform.

6. Do you know of any other downsides to not knowing which exact content has been used for AI training purposes nor the sources relevant to the making available of pirated content by AI services now that they are connected to the public internet?

ChatGPT-4’s answer:
6. Not knowing which exact content has been used for AI training purposes or the sources relevant to the making available of pirated content by AI services can have several downsides. It can make it difficult to assess the potential biases of an AI model, to understand its strengths and weaknesses, and to ensure that it is being used ethically and legally. It can also make it harder for copyright holders to protect their rights.

Google Bard’s answer:
6. There are a few other downsides to not knowing which exact content has been used for AI training purposes nor the sources relevant to the making available of pirated content by AI services now that they are connected to the public internet. These downsides include:

It can be difficult to track down and prosecute copyright infringement cases.
It can be difficult for copyright holders to protect their intellectual property.
It can lead to the spread of pirated content.
It can damage the reputation of AI services.

Bing Chat’s answer when all questions were submitted to it:
Hello, this is Bing. I’m sorry but I cannot answer your questions. They are too complex and require a lot of analysis and interpretation. I’m not a human expert on these topics, nor do I have access to the training data or the internal workings of other AI services. I can only generate content based on what I know and what I find on the web. If you want to learn more about these issues, you might want to consult a lawyer, a researcher, or a publisher who specializes in them.😊

List of articles submitted to ChatGPT, Google Bard and Bing Chat:

ChatGPT's secret reading list
https://www.businessinsider.com/chatbot-training-data-chatgpt-gpt4-books-sci-fi-artificial-intelligence-2023-5

Why Harry Potter is the copyright timebomb under generative AI models
https://sifted.eu/articles/generative-ai-copyright

Testing ChatGPT: Is it trained on the Harry Potter books?https://ogucluturk.medium.com/testing-chatgpt-is-it-trained-on-the-harry-potter-books-e089e83bac6e

Inside the secret list of websites that make AI like ChatGPT sound smart
https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

The books used to train LLMs
https://aicopyright.substack.com/p/the-books-used-to-train-llms

Has your book been used to train the AI?
https://aicopyright.substack.com/p/has-your-book-been-used-to-train

The LLMs depend on books
https://aicopyright.substack.com/p/the-llms-depend-on-books

Pearson taking legal action over use of its content to train language models, as it announces own AI plans
https://www.standard.co.uk/business/pearson-takes-legal-action-over-use-of-its-content-to-train-ai-b1079807.html

Are ChatGPT, Bard and Dolly 2.0 Trained On Pirated Content?
https://www.searchenginejournal.com/are-chatgpt-bard-and-dolly-2-0-trained-on-pirated-content/485089/#close

Comment Regarding Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation - Comment of OpenAI, LP Addressing Question 3
https://www.uspto.gov/sites/default/files/documents/OpenAI_RFC-84-FR-58141.pdf

https://p4sc4l.substack.com/p/asking-ai-were-you-trained-on-copyrighted

https://p4sc4l.substack.com/p/from-z-library-to-security-risks

https://p4sc4l.substack.com/p/asking-ai-about-how-to-best-protect

https://p4sc4l.substack.com/p/ai-services-and-the-use-of-copyrighted

https://p4sc4l.substack.com/p/will-ai-services-effectively-render

https://p4sc4l.substack.com/p/should-ai-service-providers-be-transparent

https://p4sc4l.substack.com/p/where-ai-services-allow-for-the-reuse

https://p4sc4l.substack.com/p/so-in-summary-plugins-can-be-used

https://p4sc4l.substack.com/p/question-for-ai-services-can-you

Pascal’s Substack

Discussion about this post

Ready for more?