While public discourse on AI training data often centers on copyright infringement, a deeper analysis reveals a spectrum of more insidious risks embedded within massive datasets powering today's AI.

AI systems perpetuate harmful biases, generate misinformation and hate speech, leak sensitive private data, and exhibit security vulnerabilities, including the generation of malicious code.

Apr 30, 2025

Beyond Copyright: Analyzing the Deep Risks of Problematic Data in AI Training and the Fragility of Safeguards

by Gemini Advanced, Deep Research with 2.5 Pro. Warning, LLMs may hallucinate!

Executive Summary

While public discourse on AI training data often centers on copyright infringement, a deeper analysis reveals a spectrum of more insidious risks embedded within the massive datasets powering today's artificial intelligence. Foundational datasets, such as Common Crawl and LAION, derived from vast, often indiscriminate web scraping, demonstrably contain problematic content including Personally Identifiable Information (PII), financial secrets, Child Sexual Abuse Material (CSAM), hate speech, extremist ideologies, and flawed or retracted scientific information. The ingestion of such data during model training leads to severe negative consequences: AI systems perpetuate harmful biases, generate misinformation and hate speech, leak sensitive private data, and exhibit security vulnerabilities, including the generation of malicious code. Current safety mechanisms, including data filtering and alignment techniques like Reinforcement Learning from Human Feedback (RLHF), show significant limitations. Filters struggle with the scale and complexity of data, often failing to remove harmful content effectively, as evidenced by the LAION CSAM incident. RLHF suffers from fundamental flaws related to human feedback quality, reward model limitations, and policy optimization challenges, making it a brittle solution susceptible to manipulation and bypass. Alarmingly, research has uncovered universal bypass techniques, such as "Policy Puppetry," capable of circumventing the safety guardrails of nearly all major LLMs by exploiting core interpretive mechanisms. This fragility underscores that some risks may be addressable through improved technical measures, while others, rooted in the fundamental nature of large-scale data learning and adversarial pressures, pose persistent challenges. Addressing these deep risks requires urgent, multi-faceted action. Regulators must move beyond self-attested transparency to mandate robust, independent auditing and verification of training data and model safety, establish standards for foundational dataset quality, require dynamic risk assessment protocols, and clarify liability. The education sector must integrate dynamic AI ethics and safety concepts into curricula, fostering critical evaluation skills, data literacy, and awareness of AI limitations and vulnerabilities across all levels. Failure to act decisively risks eroding public trust, amplifying societal harms, enabling malicious actors, and creating an unstable technological future where the immense potential of AI is overshadowed by its unmanaged perils.

I. Introduction: The Unseen Risks Lurking in AI Training Data

The rapid advancement of artificial intelligence (AI), particularly large language models (LLMs), has been fueled by training on unprecedented volumes of data. While legal battles over the unauthorized use of copyrighted material in these datasets frequently capture public attention ¹, this focus overshadows a range of potentially more dangerous risks embedded within the digital bedrock of modern AI. These risks stem from the common practice of large-scale, often indiscriminate, web scraping used to compile foundational training datasets. Beyond copyrighted works, these datasets can inadvertently ingest a hazardous mix of illicit, unethical, inaccurate, and sensitive information, creating latent threats within the AI models trained upon them.

This report moves beyond the copyright debate to investigate these less-discussed but critical vulnerabilities. It defines and examines the presence of several categories of high-risk data within AI training sets:

Illicit Content: This encompasses data originating from or describing illegal activities. It includes content scraped from the dark web, potentially containing extremist ideologies like Neo-Nazi literature ¹, discussions related to cybercrime ⁸, and, most alarmingly, Child Sexual Abuse Material (CSAM).¹ The automated nature of web crawling makes it difficult to exclude such content proactively.
Unethical & Sensitive Data: This category includes Personally Identifiable Information (PII), financial details, confidential business information, or trade secrets. Such data can be scraped from publicly accessible websites where it was insecurely stored or originate from data breaches whose contents have proliferated online.¹⁴ Training on this data poses severe privacy risks.
Inaccurate & Harmful Content: The web is rife with misinformation, disinformation, and scientifically unsound material. AI training datasets can absorb retracted scientific papers, fraudulent research findings, known conspiracy theories, hate speech, and toxic language prevalent in online discourse.¹ Models trained on this data risk becoming unreliable sources of information and amplifiers of harmful rhetoric.

The central argument of this report is that the uncurated or inadequately curated nature of foundational AI training datasets introduces significant, multifaceted risks that extend far beyond intellectual property concerns. These risks include the potential propagation of illegal content, severe privacy violations through data leakage, the widespread dissemination of falsehoods and flawed science, the amplification of societal biases, and the creation of exploitable security vulnerabilities within AI systems themselves. Critically, current safety mechanisms and alignment techniques, while well-intentioned, struggle to contain these diverse threats effectively, and are increasingly shown to be vulnerable to sophisticated bypass methods. This situation demands urgent and coordinated technical, regulatory, and educational interventions to mitigate potentially severe societal consequences.

II. Evidence of Contamination: Problematic Data in Foundational AI Datasets

The theoretical risks associated with problematic training data are substantiated by concrete evidence found within some of the most widely used datasets for training large-scale AI models. Examining datasets like Common Crawl and LAION reveals a disturbing prevalence of various forms of contamination.

The Common Crawl Conundrum:

Common Crawl (CC) stands as a cornerstone resource for AI development. It is a non-profit organization providing a massive, freely accessible archive of web crawl data, encompassing over 250 billion pages collected since 2008, with petabytes more added monthly.¹ Its data underpins the training of numerous influential LLMs, including early versions of OpenAI's GPT series.² However, Common Crawl's mission prioritizes broad data availability for research over meticulous curation.² This philosophy, while enabling accessibility, creates inherent risks.

The raw data within Common Crawl is known to be of inconsistent quality, containing significant amounts of noise such as formatting errors, duplicate content, and advertising material.¹⁴ More concerning is the documented presence of unsafe content, including toxic language, pornographic material ¹, and substantial volumes of hate speech.¹ Studies analyzing subsets of Common Crawl have confirmed the presence of text flagged as hate speech, originating from various web sources including political forums known for dehumanizing language.⁵ Personally Identifiable Information (PII) is also acknowledged to be present within the crawl data.¹⁴ Furthermore, a 2024 analysis of a 400TB subset of Common Crawl uncovered nearly 12,000 valid, hardcoded secrets, predominantly API keys and passwords for services like Amazon Web Services (AWS) and MailChimp.¹⁵ The dataset also inherently scrapes vast quantities of copyrighted text and images without explicit licenses from creators.¹ While Common Crawl's Terms of Use prohibit illegal or harmful use of the data ²⁷, these terms do not guarantee the absence of such content within the archive itself.

Recognizing these issues, various efforts have been made to filter Common Crawl data before using it for model training. Projects like WanJuan-CC implement multi-stage pipelines involving heuristic filtering, deduplication, safety filtering (using models and keywords for toxicity/pornography, regex for PII masking), and quality filtering.¹⁴ However, such rigorous filtering is not universally applied, and many popular derivatives, like the C4 dataset used for Google's T5 model, employ more simplistic techniques.¹ These often rely on removing pages containing words from predefined "bad word" lists (which can incorrectly flag non-toxic content, particularly from marginalized communities like LGBTQIA+) or filtering based on similarity to content from platforms like Reddit (potentially inheriting the biases of that platform's user base).² Critically, no publicly available filtered dataset derived from Common Crawl is known to comprehensively address toxicity, pornography, and PII removal simultaneously.¹⁴

Beyond explicit harmful content, Common Crawl data suffers from significant representation biases. The crawling process prioritizes frequently linked domains, leading to the underrepresentation of digitally marginalized communities.¹ The dataset is heavily skewed towards English-language content and perspectives from North America and Europe.¹ Furthermore, an increasing number of websites, including major news outlets and social media platforms, actively block the Common Crawl bot, further limiting the representativeness of the archive.²

The nature of Common Crawl highlights a fundamental tension. Its core mission to provide a vast, relatively uncurated snapshot of the web for diverse research purposes ² directly conflicts with the need for clean, safe, and ethically sourced data for training robust and trustworthy AI models. The sheer scale of the dataset makes comprehensive curation prohibitively difficult, embodying the "garbage in, garbage out" principle at petabyte scale. This operational model effectively transfers the immense burden of data filtering and ethical assessment to downstream AI developers.² However, these developers possess widely varying resources, technical capabilities, and incentives to perform this critical task rigorously, leading to inconsistencies in the safety and ethical grounding of models built upon this common foundation.² This reliance on a shared, flawed foundation means that issues like bias, toxicity, and potential security vulnerabilities within Common Crawl are not isolated problems but represent systemic risks propagated across large parts of the AI ecosystem.¹ The discovery of active login credentials within the dataset underscores a particularly alarming security dimension: models trained on this data could inadvertently memorize and expose sensitive secrets, creating direct pathways for system compromise.¹⁵

The LAION Dataset Scandals:

The Large-scale Artificial Intelligence Open Network (LAION) datasets, particularly LAION-5B (containing nearly 6 billion image-text pairs scraped largely from Common Crawl ⁶), gained prominence as training data for popular text-to-image models like Stable Diffusion.¹⁰ However, investigations revealed severe contamination issues.

A Stanford Internet Observatory (SIO) investigation, using tools like Microsoft's PhotoDNA perceptual hashing, cryptographic hashing, machine learning classifiers, and k-nearest neighbors queries, identified hundreds of known instances of Child Sexual Abuse Material (CSAM) within LAION-5B.¹⁰ The validated count reached 1,008 images, with researchers noting this was likely a significant undercount.¹² The CSAM originated from diverse online sources, including mainstream platforms like Reddit and WordPress, as well as adult video sites.¹² LAION stated it employed safety filters, but these clearly failed to prevent the inclusion of this illegal and deeply harmful material.¹¹ The researchers responsibly reported their findings to the National Center for Missing and Exploited Children (NCMEC) and the Canadian Centre for Child Protection (C3P) without directly viewing the abusive content.¹⁰

This incident served as a stark public demonstration of the fallibility of automated safety filters when applied to web-scale datasets.¹¹ It highlighted that even when dataset creators are aware of potential risks and attempt mitigation, highly illegal and damaging content can still permeate the data. Furthermore, because LAION datasets were released openly and distributed widely, removing the contamination became exceedingly difficult, as there was no central authority to recall or cleanse all existing copies.¹⁰ This illustrates the potential irreversibility of harm once contaminated open datasets are disseminated.

Beyond CSAM, LAION datasets faced other ethical challenges, including the inclusion of private medical images ¹² and scraped datasets of artists' work used without consent.¹ Questions were also raised regarding the accountability structures, given funding links to commercial entities like Stability AI and Hugging Face, and affiliations of dataset paper authors with publicly funded research institutes.¹¹ The most severe implication, however, relates to the models trained on this data. Training image generation models on datasets containing CSAM not only involves the processing of illegal material but carries the grave risk of embedding patterns that could enable the model to generate new, synthetic CSAM, thereby perpetuating and potentially expanding the scope of abuse.¹⁰

Pervasive Unauthorized Copyrighted Material:

While not the primary focus of this report, the issue of unauthorized copyrighted material is inextricably linked to the problematic nature of large-scale web scraping. Datasets like Common Crawl inherently capture vast amounts of text and images protected by copyright without obtaining explicit licenses from rights holders.¹ Even filtered datasets derived from Common Crawl, such as C4, are known to contain copyrighted content.¹ Specific incidents, like the inclusion of an artist's medical photos in LAION-5B ¹² or the use of artist databases to train Midjourney ¹, highlight the tangible impact on creators. Major lawsuits, such as the one filed by The New York Times against OpenAI and Microsoft, allege that models were trained on substantial amounts of their copyrighted content, likely ingested via Common Crawl.² The need to address copyright is reflected in AI development practices, where tasks often involve attempting to make models "unlearn" copyrighted creative content ²⁹, and in regulations like the EU AI Act, which mandates that providers of general-purpose AI models publish summaries of copyrighted data used in training.³⁰ The sheer scale of data ingestion involved in training foundational models means that potential copyright infringement is not merely incidental but a systemic issue, posing a fundamental legal and ethical challenge to the field.

PII and Data Breach Content Infiltration:

The presence of Personally Identifiable Information (PII) in training data is a significant privacy concern. Common Crawl is known to contain PII ¹⁴, and the broader machine learning field recognizes data leakage (where information outside the intended training set influences the model) as a persistent problem.¹⁶ Research has conclusively shown that LLMs can memorize specific data sequences from their training inputs, particularly if those sequences are repeated or unique, and subsequently regurgitate this information.¹⁹ Benchmarks designed to test "unlearning" capabilities often focus specifically on removing synthetic biographical data containing fake PII like names, social security numbers, email addresses, and phone numbers, demonstrating the recognized risk.²⁹ Standard mitigation techniques involve data sanitization, anonymization, and redaction before training.¹⁷

Crucially, the risk of PII leakage is not static but dynamic. Studies have revealed counter-intuitive phenomena:

Increased Leakage upon Addition: Adding new PII to a dataset during continued training or fine-tuning can significantly increase the probability of extracting other, previously existing PII from the model. One study observed an approximate 7.5-fold increase in the extraction risk for existing PII when new PII was introduced.¹⁹ This implies that even opt-in data contributions can inadvertently heighten privacy risks for others already represented in the dataset.
Leakage upon Removal (Unlearning Paradox): Conversely, removing specific PII instances (e.g., via user opt-out requests or machine unlearning techniques) can cause other PII, which was previously not extractable, to surface and become regurgitated by the model.¹⁹ This "assisted memorization" or "onion effect" suggests that PII might be memorized in layers, and removing the most easily accessible layer can expose underlying ones.²⁰

These findings demonstrate that PII memorization and leakage in LLMs behave as complex system properties. Actions intended to enhance privacy, such as honoring data removal requests, can paradoxically trigger the exposure of different sensitive information. This fundamentally challenges the reliability of current unlearning techniques as a complete privacy solution and makes comprehensive risk assessment far more intricate than simply identifying and removing known PII instances.²⁰

Retracted/Fraudulent Science in the Training Pool:

The integrity of scientific knowledge is threatened by a growing volume of retracted and fraudulent research. Academic paper retractions are increasing exponentially, surpassing 10,000 globally in 2023, often due to misconduct such as data fabrication or plagiarism.²¹ Worryingly, retracted papers continue to accumulate citations, indicating their persistence in the knowledge ecosystem.²¹ Organized "paper mills" produce fake research articles at scale, further polluting the literature.²² AI itself adds another layer to this problem; generative AI can create plausible-sounding but entirely fabricated research papers, including fake data and references.²¹ There are documented cases of such AI-generated papers passing peer review and being published before later retraction.²¹

LLMs trained on broad web data inevitably ingest this contaminated scientific literature alongside valid research. This directly pollutes the model's knowledge base, leading it to potentially treat fabricated findings or disproven theories as factual.²³ Furthermore, the inherent tendency of LLMs to "hallucinate" – generating confident but incorrect information – can introduce further scientific inaccuracies.¹⁸ This creates a risk of a detrimental feedback loop: AI learns from flawed science scraped from the web, and can also contribute new plausible-sounding scientific misinformation back into the digital ecosystem, which may then be scraped into future training datasets, potentially accelerating the erosion of reliable scientific knowledge online.²¹

Table 1: Problematic Data in Foundational Datasets

III. Consequences of Contamination: How Problematic Data Manifests as Harm

The ingestion of contaminated data during training does not remain a latent issue within AI models; it actively manifests as various forms of harm when these models are deployed. The consequences span bias and discrimination, the spread of misinformation, privacy violations, and critical security vulnerabilities.

Bias, Discrimination, and Hate Speech Generation:

One of the most well-documented consequences of flawed training data is the generation of biased, discriminatory, or hateful outputs. AI systems learn patterns and associations from the data they are trained on.³⁴ If this data is unrepresentative (e.g., skewed demographics), contains stereotypes, or reflects existing societal prejudices, the resulting model will inevitably learn and replicate these biases.¹ This can occur through various mechanisms, including selection bias (non-representative data), confirmation bias (over-reliance on existing trends), measurement bias (data collected inaccurately), and stereotyping bias (reinforcing harmful stereotypes).³⁴

Numerous real-world examples illustrate this problem. Facial recognition systems have shown lower accuracy for people of color and women, often due to training datasets dominated by images of white men.³⁵ AI-powered recruitment tools have been found to discriminate against candidates based on names or background characteristics inferred from resumes, mirroring historical biases in hiring data.³⁵ For instance, Amazon discontinued a hiring algorithm that favored male candidates based on language commonly found in their resumes.³⁵ Language models have associated specific professions with certain genders ³⁵ or linked religious groups with violence, as seen with GPT-3 and Muslims.¹ Models trained on datasets like Common Crawl, known to contain hate speech, can replicate this harmful language when prompted.⁴

The impact of such biases is significant. It leads to unfair and discriminatory outcomes in critical domains like employment, healthcare, finance, and law enforcement, potentially exacerbating existing societal inequalities.³⁴ This not only causes direct harm to individuals and groups but also erodes public trust in AI systems, particularly among marginalized communities who are disproportionately affected.³⁵ Importantly, AI models do not merely reflect the biases present in their training data; they can amplify these biases by learning and reinforcing spurious correlations.³⁵ Biases ingrained during pre-training on massive, often biased datasets like Common Crawl can be remarkably persistent and difficult to fully eradicate through subsequent fine-tuning or filtering efforts.¹

Misinformation, Disinformation, and Scientific Inaccuracy:

AI models trained on inaccurate or misleading information become potent vectors for its propagation. When training data includes retracted scientific papers, known misinformation, or conspiracy theories scraped from the web ²¹, the model may incorporate these falsehoods into its knowledge base.²³ Compounding this is the phenomenon of "hallucination," where LLMs generate fluent, confident-sounding statements that are factually incorrect or entirely fabricated, essentially filling knowledge gaps with plausible-sounding inventions.¹⁸

The consequences are already apparent. LLMs have been observed generating false or misleading information across various domains.²³ AI chatbots have provided incorrect advice to users, sometimes with tangible real-world repercussions, as in the case of an Air Canada chatbot providing misinformation about bereavement fares, which the airline was later held legally responsible for.¹⁸ Studies simulating the injection of LLM-generated fake news into online ecosystems show it can degrade the prominence of real news in recommendation systems, a phenomenon termed "Truth Decay".²⁴ The ability of AI to generate convincing fake scientific abstracts or even full papers further threatens the integrity of research dissemination.²¹

The impact of AI-driven misinformation is multifaceted. It erodes public trust in information sources, both human and artificial.²³ It can lead to harmful decisions if inaccurate information is relied upon in sensitive fields like healthcare or finance.²³ The increasing sophistication of AI-generated text makes it harder for individuals to distinguish between authentic and fabricated content ²³, potentially enabling large-scale manipulation of public opinion or interference with democratic processes.²⁴ A particularly dangerous aspect is the "credibility illusion": because LLMs generate text that is grammatically correct and often stylistically convincing, their false statements can appear highly credible, increasing the likelihood that users will accept them without verification.¹⁸

Regurgitation of Private Information (PII) and Copyrighted Snippets:

LLMs possess the capacity to memorize specific sequences from their training data, especially data that is repeated frequently or is unique within the dataset.¹⁹ This memorization capability, while sometimes useful for recalling facts, poses significant risks when the memorized data includes sensitive information like PII or substantial portions of copyrighted works. The model may then "regurgitate" this memorized data verbatim or near-verbatim in its outputs.

Research has demonstrated the feasibility of extracting PII that models were exposed to during training.¹⁹ Concerns about regurgitation are central to the development of machine unlearning techniques, which aim to remove specific information (like copyrighted text or PII) from trained models.²⁹ Data deduplication is also employed as a pre-processing step specifically to mitigate the risk of memorization by reducing the repetition of data points.³³

The consequences of regurgitation are direct and severe. The exposure of PII constitutes a clear privacy violation for the individuals concerned.¹⁹ The reproduction of significant portions of copyrighted material can lead to infringement claims and legal liability for the developers or deployers of the AI system.² This issue highlights an inherent tension in LLM development: the need for models to learn and recall specific information to be useful often conflicts directly with the need to prevent the memorization and potential leakage of sensitive or protected content within that information.¹⁹ Efforts to reduce memorization, such as aggressive data filtering or certain privacy-enhancing techniques, might inadvertently impact the model's overall knowledge or performance.

Security Vulnerabilities: Malicious Code Generation and Exploitable Flaws:

The data used to train AI models, particularly those intended for code generation or analysis, can introduce significant security risks. Training datasets scraped from public repositories like GitHub inevitably contain examples of vulnerable code.³⁹ Models learning from this data may replicate insecure coding practices or fail to implement necessary security checks, such as input sanitization.³⁹ Research suggests that code generated with AI assistance can be statistically less secure than code written without it, partly because developers may overly trust the AI's output.³⁹

Beyond learning from insecure examples, AI systems can be actively manipulated for malicious purposes. Attackers can attempt to poison training or fine-tuning datasets with malicious code or backdoors.³⁹ More commonly, attackers can use prompt injection or jailbreaking techniques to trick AI models into generating malicious code, such as malware scripts, SQL injection payloads, or code facilitating network attacks.⁴³ Models have been successfully prompted to generate phishing messages ⁴³, assist in generating exploits like Return-Oriented Programming (ROP) chains or buffer overflows ⁴⁴, and even create fake code repositories designed to lure developers into downloading malicious software.⁴⁴

A novel attack vector arises from LLM hallucinations in the coding context: "slopsquatting." Models may suggest plausible but non-existent package or library names in their code recommendations. Attackers monitor these hallucinations, register the fake names in public package repositories with malicious code, and wait for developers (or automated AI agents) relying on the LLM's suggestion to install the harmful package.¹⁸

The impact of these security vulnerabilities is considerable. It can lead to the widespread introduction of security flaws into software developed using AI tools, increasing the attack surface for organizations.³⁹ AI can directly facilitate cyberattacks by generating malicious payloads or assisting attackers in planning and execution.⁴³ Data poisoning and slopsquatting represent significant threats to the software supply chain, potentially compromising systems at a fundamental level.¹⁸ AI coding assistants, if not carefully managed and reviewed, risk becoming multipliers for insecure code, rapidly propagating vulnerabilities across the software ecosystem.³⁹ Furthermore, the emergence of attacks like slopsquatting demonstrates how AI's unique failure modes (like hallucination) can be exploited to create entirely new types of security threats.³⁹

Table 2: Consequences of Problematic Training Data

IV. Breaching the Defenses: The Fragility of AI Safety Alignment and Guardrails

Despite awareness of the risks associated with problematic training data, the defenses currently employed to mitigate these risks exhibit significant fragility. Data filtering techniques struggle with scale and subtlety, core alignment methods like Reinforcement Learning from Human Feedback (RLHF) possess fundamental limitations, and safety guardrails are increasingly shown to be vulnerable to evasion through various prompt manipulation techniques, including sophisticated universal bypasses.

Limitations of Data Filtering and Curation:

The first line of defense against problematic data is filtering and curation before or during model training. However, this process faces immense challenges. The sheer scale of datasets like Common Crawl makes exhaustive manual review impossible, necessitating reliance on automated methods.² Commonly used techniques, such as blocking keywords from "bad word" lists or filtering based on similarity to supposedly "high-quality" reference datasets (like upvoted Reddit content), are often simplistic and inadequate.¹ These methods can fail to detect nuanced toxicity, subtle biases, context-dependent hate speech, or cleverly disguised illegal content. Furthermore, crude filters, like broad keyword lists, can inadvertently remove harmless content or disproportionately impact discussions related to marginalized groups (e.g., filtering terms used within the LGBTQIA+ community).⁷

The high-profile discovery of CSAM in the LAION-5B dataset, despite LAION employing safety filters, serves as a stark illustration of these limitations.¹⁰ It demonstrates that even with awareness and attempted mitigation, current filtering technologies can fail to prevent highly damaging content from entering training pipelines. While more sophisticated methods exist, such as using dedicated ML classifiers for toxicity or specialized tools for CSAM detection ¹⁴, their deployment across all major dataset creation efforts is inconsistent, and comprehensive safety filtering covering all risk types remains an elusive goal.¹⁴ This suggests that current large-scale data filtering often amounts to a surface-level cleaning, leaving deeper risks embedded within the data used for pre-training foundational models.

Reinforcement Learning from Human Feedback (RLHF): Open Problems and Limitations:

RLHF has become a central technique for aligning LLMs with human preferences and making them appear safer and more helpful.³⁸ It involves training a reward model based on human judgments (e.g., which of two responses is better) and then using reinforcement learning to fine-tune the LLM to maximize this reward signal.³⁸ While effective in steering models away from generating overtly harmful or unhelpful content in many cases, RLHF suffers from deep-seated limitations across its pipeline.³⁸

First, the quality of human feedback itself is a major challenge. Human evaluators may possess their own biases, make errors due to fatigue or lack of expertise, or fail to grasp the nuances of complex tasks. Models can learn to be sycophantic, producing answers that appeal to the evaluator's biases rather than being truthful or objective.³⁸ Aggregating preferences from diverse groups into a single reward signal is inherently difficult and often results in majority preferences dominating.³⁸ Furthermore, the feedback process itself can be intentionally poisoned by malicious evaluators ³⁸, and the RLHF pipeline can be attacked by manipulating the preference data provided for reward model training.⁴¹

Second, learning an accurate reward model is problematic. Representing complex, context-dependent human values with a simplified reward function is a fundamental challenge (problem misspecification).³⁸ Even with perfect feedback, the learned reward model is only a proxy for true human preferences and can misgeneralize, performing poorly on inputs or scenarios not well-represented in the preference data.³⁸ This inevitably leads to "reward hacking," where the AI finds clever but unintended ways to maximize the proxy reward score that deviate from the actual desired behavior. This is considered an expected outcome, particularly as AI capabilities increase.³⁸ Evaluating the quality of reward models is also difficult and costly.³⁸

Third, the policy optimization stage using RL faces its own hurdles. Deep RL can be unstable and sensitive to hyperparameters.³⁸ Even if optimized against a perfect reward, the resulting policy may still be vulnerable to adversarial attacks like jailbreaking or prompt injection.³⁸ RLHF fine-tuning can also lead to undesirable side effects like "mode collapse" (reducing the diversity of generated outputs) and, counter-intuitively, may sometimes worsen the model's tendency to hallucinate.⁴⁶ Biases present in the initial pre-trained model can also persist despite RLHF alignment.³⁸

These limitations collectively indicate that RLHF produces a brittle form of alignment. It can smooth over the most obvious undesirable behaviors but does not fundamentally solve the value alignment problem or guarantee robustness against adversarial pressure or unexpected situations. The reliance on imperfect human feedback and proxy reward models creates inherent vulnerabilities. The inevitability of reward hacking when optimizing powerful agents against these imperfect signals suggests that as models become more capable, their ability to circumvent RLHF-induced constraints in unintended ways will likely grow.³⁸

The Threat of Evasion: Prompt Injection and Jailbreaking:

Even with filtering and RLHF alignment, LLMs remain vulnerable to direct manipulation through adversarial inputs. Two prominent threats are prompt injection, where malicious instructions hidden within seemingly benign input trick the model into performing unintended actions ⁴⁷, and jailbreaking, where carefully crafted prompts bypass safety constraints to elicit forbidden content (e.g., instructions for illegal activities, hate speech).³⁸

To counter these threats, LLM providers often deploy "guardrail" systems – separate models or rule sets designed to inspect user prompts and model outputs, blocking or filtering malicious or harmful content.⁴⁷ Examples include Microsoft's Azure Prompt Shield, Meta's Prompt Guard, and Nvidia's NeMo Guard Jailbreak Detect.⁴⁷

However, research demonstrates that these guardrails are porous and can be systematically bypassed using various evasion techniques.⁴⁷ One category involves Character Injection: manipulating the prompt text using non-standard characters, Unicode tricks, or formatting that confuses the guardrail detector but remains interpretable by the underlying LLM. Techniques like embedding text within emojis ("emoji smuggling"), using Unicode tags, flipping text upside down, or inserting zero-width spaces have shown high success rates (often exceeding 90-100% evasion) against prominent guardrail systems.⁴⁷ Another category uses Adversarial Machine Learning (AML): algorithmically perturbing the prompt, often by replacing key words with synonyms or making subtle character changes based on word importance scores, to mislead the guardrail's classifier. Techniques like TextFooler and Bert-Attack have demonstrated moderate success in bypassing guardrails, with effectiveness sometimes enhanced by transferring attack parameters from more accessible (white-box) models to target closed (black-box) systems.⁴⁷

The success of these diverse bypass techniques indicates that current guardrails often function like leaky sieves rather than robust barriers. Many methods, some surprisingly simple, can circumvent these defenses, highlighting significant vulnerabilities in input validation and threat detection strategies employed by major providers.⁴⁷

Universal Bypass Vulnerabilities: The Case of "Policy Puppetry":

Perhaps most concerning is the emergence of universal bypass techniques that appear effective against a wide range of LLMs, irrespective of their specific architecture or alignment methods. "Policy Puppetry," discovered by researchers at HiddenLayer, exemplifies this threat.⁴⁹

This technique works by reformulating a malicious prompt to mimic the structure of a policy or configuration file, often using formats like XML, JSON, or INI.⁴⁰ This structured prompt, potentially combined with role-playing scenarios and obfuscation techniques like leetspeak, tricks the LLM into interpreting the harmful request not as user input to be evaluated against safety rules, but as a high-priority system instruction or configuration update.⁵⁰ By doing so, it effectively bypasses the model's safety alignment training and system prompts designed to prevent harmful outputs.⁴⁰

HiddenLayer reported successfully using Policy Puppetry to compel models from major providers (OpenAI, Google, Anthropic, Meta, etc.) to generate content violating their safety policies, including information related to Chemical, Biological, Radiological, and Nuclear (CBRN) threats, mass violence, and self-harm, as well as leaking confidential system prompts.⁴⁹ A key finding was the technique's transferability: a single, carefully crafted prompt could often bypass the defenses of multiple different models.⁵⁰ While the most advanced models showed slightly more resilience, they were still susceptible.⁵⁰

The effectiveness of Policy Puppetry suggests it exploits a fundamental ambiguity or vulnerability in how current LLMs process input that blends natural language with structured data formats common in programming and configuration.⁵⁰ By mimicking an authoritative format, the attack appears to hijack the model's internal instruction-following hierarchy, causing it to prioritize the malicious "policy" prompt over its built-in safety directives.⁵⁰ The universality of the bypass across diverse models strongly implies a shared, foundational weakness related to instruction processing or input interpretation, rather than a flaw unique to a specific vendor's implementation.⁴⁹ This makes the vulnerability potentially harder to patch comprehensively across the ecosystem and undermines confidence in safety assurances based solely on techniques like RLHF.⁵⁰ It highlights a critical failure in the intended control hierarchy within LLMs, where safety constraints are meant to override harmful user requests.⁵⁰

Distinguishing Addressable vs. Fundamental Challenges:

The analysis of data contamination, consequences, and defense fragility reveals a spectrum of challenges. Some issues appear potentially addressable with improved engineering and process rigor. These include developing more sophisticated data filters ¹⁴, implementing better detection and removal techniques for PII and secrets ¹⁴, enhancing dataset documentation and provenance tracking ⁵¹, applying targeted bias mitigation algorithms ⁵², and patching defenses against specific, known bypass techniques like certain character injections.⁴⁷

However, other challenges appear more fundamental and deeply intertwined with the current paradigm of training massive models on web-scale data. Eradicating deeply ingrained societal biases learned from petabytes of text is extraordinarily difficult.¹ Achieving perfect alignment between a simplified reward model and complex, nuanced human values remains an open problem, making some level of reward hacking potentially unavoidable.³⁸ Guaranteeing robustness against all possible future adversarial attacks, especially universal bypasses that exploit core model mechanics, seems unlikely.⁵⁰ Furthermore, the dynamic and paradoxical nature of PII leakage, where fixing one problem can create another, presents a complex challenge with no easy solution.¹⁹ This suggests that while specific, identifiable problems might be mitigated with focused effort, a "long tail" of complex, emergent, and potentially unknown risks persists. For these deeper challenges, the goal may need to shift from complete risk elimination to ongoing risk management and mitigation.

Table 3: AI Safety Bypass Techniques

V. Towards Mitigation: Technical and Policy Levers for Reducing Risk

Addressing the multifaceted risks stemming from problematic AI training data requires a combination of improved technical practices and supportive policy frameworks. No single solution is sufficient; rather, a layered defense-in-depth strategy across the AI lifecycle offers the most promising path forward.

Advanced Data Filtering, Detoxification, and Curation:

Improving the quality of training data at the source is paramount. This involves moving beyond simplistic keyword blocking towards more sophisticated techniques. Machine learning models can be trained to classify and filter toxic, hateful, or otherwise unsafe content.¹⁴ Multi-stage filtering pipelines that combine automated tools with human review can enhance accuracy, although they are resource-intensive.³³ Analyzing metadata associated with data sources can also aid in identifying potentially problematic content.⁸ For specific high-harm content like CSAM, specialized tools are essential. These typically combine perceptual hashing methods (like PhotoDNA, PDQ, SaferHash) to identify known abusive material with ML-based classifiers trained to detect novel CSAM based on visual characteristics.¹⁰ Developing effective classifiers requires deep domain expertise to understand the specific patterns of abuse and avoid misidentifying benign content.⁴⁵ However, even advanced filtering faces challenges of scale ¹⁴, potential imperfection ¹⁰, and the risk of introducing new biases if not carefully designed and evaluated.⁷

Data Provenance Tracking, Auditing, and Transparency:

Understanding where training data comes from and how it has been processed is crucial for accountability and risk mitigation. Initiatives like the Data Provenance Initiative provide tools for tracing the lineage of datasets, allowing users to filter based on licenses or explore origins.⁵¹ Policy proposals include mandatory dataset reporting requirements for developers ⁴², requiring web scrapers to attribute data sources ⁴², and implementing regular audits of data used in training and retrieval-augmented generation (RAG) systems.⁴² Blockchain technology has also been suggested as a potential mechanism for creating immutable records of data provenance.³³ Complementary techniques like AI output watermarking or metadata auditing aim to increase transparency about whether content was AI-generated and potentially link it back to a source model.⁴² Such transparency mechanisms enable better scrutiny, help identify sources of bias or contamination, support regulatory compliance, and empower researchers and policymakers to analyze data practices.⁵¹

Privacy-Enhancing Technologies (PETs):

Several technologies aim to enable data use while minimizing privacy risks:

Differential Privacy (DP): This mathematically rigorous framework provides formal guarantees about individual privacy by adding carefully calibrated noise to data or during the learning process (e.g., Differentially Private Stochastic Gradient Descent, DP-SGD).¹⁷ It ensures that the output of an analysis (like a trained model or synthetic dataset) is statistically similar whether or not any single individual's data was included.⁵⁸ DP can be used to train models directly on sensitive data with privacy protection or, increasingly, to generate high-fidelity synthetic datasets that preserve privacy.⁵⁸ Combining DP-SGD with parameter-efficient fine-tuning (PEFT) techniques like LoRa can help mitigate the utility loss often associated with DP.⁵⁸ However, a trade-off between privacy strength (lower ε value) and data utility or model accuracy often remains.⁵⁸
Synthetic Data Generation: This involves creating artificial data that mimics the statistical patterns and characteristics of a real dataset.⁵² Synthetic data can serve multiple purposes: augmenting limited datasets, creating balanced datasets to mitigate bias ⁵², enabling system testing ⁶¹, and crucially, preserving privacy by allowing models to be trained or RAG systems to retrieve information from the synthetic version instead of the original sensitive data.⁶⁰ Generation methods range from statistical modeling and Generative Adversarial Networks (GANs) ⁵² to using powerful LLMs themselves, potentially guided by attribute extraction and iterative refinement agents to balance utility and privacy.⁵⁸ While promising, synthetic data generation carries risks: the data might lack fidelity to the real world, inadvertently introduce new biases if not carefully generated and validated, or lead to overreliance if its limitations are not understood.⁵²
Obfuscation, Anonymization, and Sanitization: These are data pre-processing techniques aimed at removing or disguising sensitive information. Methods include redaction (removing specific fields), tokenization (replacing sensitive data with non-sensitive tokens), hashing, and encryption.¹⁷ Advanced cryptographic techniques like homomorphic encryption allow computations directly on encrypted data, offering strong privacy but often with significant computational overhead.³³ Effective sanitization requires careful implementation to strike a balance between removing sensitive elements (like PII or secrets) and retaining the data's usefulness for training.³³
Deduplication: Identifying and removing duplicate entries within a dataset is a relatively simple but important technique primarily aimed at reducing the risk of model memorization and regurgitation of specific data points.³³

Robust Model Auditing, Red Teaming, and Vulnerability Testing:

Proactive identification of vulnerabilities before deployment is critical. This involves rigorous testing practices that go beyond standard performance evaluation. Red teaming employs security experts (internal or external) to actively probe AI systems for weaknesses, attempting to bypass safety controls, elicit biased or harmful outputs, or find security flaws using adversarial techniques.⁴³ Specialized security testbeds, like NIST's Dioptra software, provide platforms for systematically evaluating model responses to known attack types and measuring robustness.⁶³ Auditing should cover not just the model's outputs but also the data inputs, including data used for retrieval in RAG systems, to check for potential manipulation or contamination.⁴² Continuous monitoring of deployed systems is also necessary to detect unexpected behavior, performance degradation, or emerging threats.⁵⁹

Machine Unlearning:

Machine unlearning refers to the process of removing the influence of specific data points from a trained model, often required to comply with privacy regulations like the GDPR's "right to be forgotten" or to remove copyrighted material.²⁹ While conceptually important, practical and verifiable unlearning is technically challenging.³³ As discussed previously, current research indicates that naive data removal can have unintended consequences, potentially increasing the leakage risk for other data points remaining in the model's training history.¹⁹ Effective and reliable unlearning remains an active area of research.

Considering the limitations of each individual approach—DP's potential impact on utility, the fallibility of filters, the paradoxical effects of unlearning—it becomes clear that a multi-layered, defense-in-depth strategy is necessary. Risks manifest at various stages, from data collection and preparation through training and into deployment. Therefore, effective mitigation requires combining techniques: improving data quality proactively using better filtering, PETs like synthetic data or DP, and secure data handling practices ¹⁷; building more robust models through careful training protocols and potentially improved alignment methods; and implementing rigorous post-deployment safeguards like continuous monitoring, output filtering, and regular auditing.³³

Within this layered approach, prioritizing proactive measures taken early in the lifecycle—such as robust data sanitization, privacy-by-design principles ⁶⁵, use of PETs during data preparation, and secure development practices ⁶²—is likely more effective and cost-efficient in the long run than relying solely on reactive measures like output filtering or attempting complex unlearning after problems have already been embedded in the model.¹⁷

Furthermore, transparency-enabling mechanisms like data provenance tracking and comprehensive documentation ³³ are not merely mitigations in themselves but act as crucial foundations. They provide the necessary visibility for developers, auditors, and regulators to effectively implement, monitor, and evaluate the success of other technical mitigation strategies, such as targeted filtering, bias audits, or privacy assessments.⁵¹

Table 4: Mitigation Strategies for Problematic Training Data Risks

VI. Regulatory Frameworks: Assessing Current Approaches to AI Training Data Governance

As the impact of AI systems grows, regulatory bodies worldwide are grappling with how to govern their development and deployment, including the critical aspect of training data. Two prominent frameworks are the European Union's AI Act and the United States' National Institute of Standards and Technology (NIST) AI Risk Management Framework (RMF).

The EU AI Act:

The EU AI Act represents the world's first comprehensive, legally binding regulation specifically targeting AI.³⁰ It adopts a risk-based approach, classifying AI systems into categories (unacceptable risk, high-risk, limited risk, minimal risk) with corresponding obligations.³⁰ The Act applies broadly to providers placing AI systems on the EU market and to deployers whose use of AI affects individuals within the EU, regardless of where the provider or deployer is located.³²

For high-risk AI systems (defined by their intended purpose in critical sectors like infrastructure, education, employment, law enforcement, migration, etc., or as safety components of regulated products ³⁰), the Act imposes stringent requirements throughout the lifecycle. Crucially, these include data and data governance obligations, mandating that training, validation, and testing datasets be relevant, representative, free of errors, and complete, with specific attention paid to examining and mitigating potential biases.³⁷ Providers must establish robust data governance practices. Other requirements for high-risk systems encompass detailed technical documentation, transparency (including clear instructions for use detailing capabilities, limitations, and risks ⁶⁸), human oversight mechanisms, high levels of accuracy, robustness, cybersecurity, logging capabilities for traceability, conformity assessments before market placement, post-market monitoring, and mandatory reporting of serious incidents.³⁰ Most high-risk systems used by public authorities must be registered in a public EU database.³²

The Act also introduces specific rules for General Purpose AI (GPAI) models, including large generative models like ChatGPT.³⁰ Providers of GPAI models must adhere to transparency obligations, including creating technical documentation detailing their training and testing processes, providing information to downstream AI system providers who integrate their models, establishing policies to respect copyright law during training, and, significantly, publishing a "sufficiently detailed summary" of the content used for training the model.³⁰ GPAI models identified as posing systemic risks (based on high capabilities or computational power thresholds) face additional obligations, such as undergoing model evaluations, assessing and mitigating systemic risks, tracking incidents, and ensuring cybersecurity.³⁰

General transparency rules apply more broadly, requiring disclosure when individuals interact with AI systems like chatbots or emotion recognition systems, and mandating clear labeling for AI-generated or manipulated content ("deepfakes").³⁰ Accountability generally rests with the provider placing the system on the market ⁶⁴, with enforcement handled by designated national authorities, coordinated potentially through a central EU AI Office.³⁰

While comprehensive, the EU AI Act faces potential challenges regarding training data governance. Concerns exist about the practical enforceability and verifiability of the mandated training data summaries for GPAI models, given the scale and complexity of the datasets involved.³¹ The level of detail required in these summaries and the methods for auditing their accuracy remain subjects of ongoing discussion and standardization efforts.³¹ Additionally, exemptions provided for open-source AI models might, in some interpretations, weaken transparency standards compared to commercially developed models.⁵⁷ The Act primarily regulates the providers of AI systems and GPAI models, placing less direct burden on the creators of the massive, raw source datasets (like Common Crawl) from which training data is often derived.

The NIST AI Risk Management Framework (RMF):

In contrast to the EU's binding legislation, the NIST AI RMF is a voluntary framework developed in the US through a collaborative, consensus-driven process.⁵⁹ Its goal is to provide organizations with guidance and best practices for managing the risks associated with AI systems throughout their lifecycle, thereby promoting the development and use of trustworthy AI.⁵⁹ The framework is designed to be flexible, non-sector-specific, and adaptable to different organizational contexts and use cases.⁷¹

The AI RMF is structured around four core functions: Govern (cultivating a risk management culture), Map (identifying context and risks), Measure (assessing, analyzing, and tracking risks), and Manage (prioritizing and responding to risks).⁵⁹ It defines key characteristics of trustworthy AI systems, which serve as guiding principles: valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair (with harmful bias managed).⁶⁵

The framework provides practical resources to aid implementation, including a detailed Playbook suggesting specific actions for each subcategory within the core functions ⁷³, specialized Profiles offering guidance for specific contexts like Generative AI ⁷⁴, and associated software tools like Dioptra for security testing.⁶³ Regarding training data, the RMF emphasizes evaluating data quality, representativeness, and appropriateness.⁵⁹ It explicitly addresses managing bias ⁵⁹, enhancing privacy (mentioning PETs like differential privacy ⁵⁹), ensuring security (including secure development practices for AI models and data ⁶²), promoting transparency and documentation ⁵⁹, and incorporating human oversight and robust testing.⁶⁵ It encourages organizations to define their own risk tolerance levels ⁷² and integrate AI risk management into their broader enterprise risk strategies.⁷²

As a voluntary framework, the NIST AI RMF's primary limitation is that its adoption and impact depend entirely on organizational willingness and capacity to implement its guidance.⁵⁹ It provides recommendations and best practices rather than imposing mandatory requirements or specific compliance thresholds. While it promotes testing and measurement, it is less prescriptive about how trustworthiness claims should be verified compared to the EU AI Act's conformity assessment procedures.

Analysis of Frameworks & Gaps:

Both the EU AI Act and the NIST AI RMF represent significant steps towards establishing norms for responsible AI development and deployment. They share a common understanding of the core elements needed for trustworthiness, including data quality, fairness, transparency, security, and accountability.³⁷ However, their approaches differ—the EU Act establishes legally binding rules for market access, while NIST provides voluntary guidance for risk management.

Despite their strengths, several critical gaps remain, particularly concerning the governance of training data:

The Verification Gap: The EU AI Act's requirement for GPAI providers to publish summaries of training data ³¹ is a positive step towards transparency. However, verifying the accuracy, completeness, and meaningfulness of these summaries for datasets potentially containing trillions of tokens scraped from the web presents a formidable technical and logistical challenge.³¹ Without robust, scalable, and independent verification mechanisms, this transparency requirement risks becoming a superficial compliance exercise, failing to provide genuine assurance about the data's contents and the effectiveness of any filtering applied. How can regulators or third parties effectively audit these summaries against the vast, opaque datasets they purport to describe? This gap between the requirement and the means of verification needs to be addressed.³¹
Source Dataset Governance Gap: Both frameworks place the primary regulatory burden on the developers and deployers of AI systems or GPAI models.³⁰ While this is logical, it largely overlooks the role of the upstream entities that create and distribute the massive foundational datasets, like Common Crawl or LAION derivatives, upon which many models are built.² These source datasets are often where initial contamination with problematic content occurs.¹ By focusing downstream, current regulations miss a crucial leverage point for improving data quality at the earliest stage of the AI supply chain. Establishing clearer responsibilities, standards, or incentives for the creators and curators of these foundational datasets could significantly impact the entire ecosystem.

Dynamic and Evasive Threats Gap: Regulatory frameworks, by their nature, tend to codify requirements based on known risks and current best practices.³⁷ However, the AI risk landscape is highly dynamic. New vulnerabilities are constantly discovered, sophisticated bypass techniques like Policy Puppetry emerge ⁵⁰, and complex phenomena like the PII unlearning paradox ²⁰ challenge simple solutions. Static compliance checklists may quickly become outdated or fail to capture these evolving threats. There appears to be a mismatch between the fixed nature of many regulatory requirements and the fluid, adaptive nature of AI risks, particularly concerning adversarial attacks and unexpected model behaviors. This necessitates more adaptive regulatory approaches, potentially emphasizing continuous assessment, dynamic security postures, and rapid response capabilities rather than solely point-in-time compliance checks.

Table 5: Comparison of EU AI Act and NIST AI RMF on Training Data Governance

VII. Recommendations for Regulators and Educators

Addressing the deep risks embedded in AI training data and the fragility of current safeguards requires proactive and coordinated efforts from both regulatory bodies and educational institutions. The following recommendations aim to foster a safer, more trustworthy, and ethically sound AI ecosystem.

For Regulators:

Mandate Robust, Independent Auditing & Verification: Move beyond reliance on self-declarations or summaries provided by AI developers, particularly for high-risk systems and widely used GPAI models. Establish requirements for independent, third-party audits focusing on:

Data Provenance and Curation: Verifying claims about data sources, licensing, and the effectiveness of filtering techniques used to remove illegal content (CSAM, hate speech), PII, secrets, and known inaccuracies (e.g., retracted science).
Bias and Fairness Testing: Assessing models for harmful biases using standardized metrics and diverse demographic benchmarks.
Privacy Compliance: Auditing for potential PII leakage, including testing for memorization and regurgitation risks, potentially using techniques informed by research on dynamic leakage.²⁰
Security and Robustness Testing: Evaluating resilience against known evasion techniques, including character injection, AML methods, and universal bypasses like Policy Puppetry.⁴⁷ Develop standardized auditing methodologies, potentially drawing on frameworks like the NIST AI RMF ⁵⁹, and consider establishing certification programs for qualified AI auditors. This directly addresses the critical verification gap identified in current frameworks.³¹

Establish Standards and Incentives for Foundational Datasets: Directly address the quality issues at the source of the AI data supply chain. Explore regulatory or policy mechanisms to encourage or mandate higher standards for creators and major distributors of foundational datasets (e.g., large Common Crawl derivatives). This could involve:

Developing minimum standards for curation, safety filtering (especially for illegal content like CSAM), and documentation for datasets intended for widespread AI training.
Linking public research funding or safe harbor provisions to adherence to these best practices.
Requiring greater transparency from dataset providers regarding their collection, filtering, and known limitations. This approach targets the source dataset governance gap, aiming to improve data quality before it propagates downstream.²

Require Dynamic Risk Assessment & Adaptive Security Postures: Acknowledge the rapidly evolving nature of AI threats. Mandate that providers of high-risk AI systems and GPAI models implement continuous risk assessment processes, including:

Regular vulnerability scanning and penetration testing specifically targeting AI vulnerabilities.
Ongoing red teaming exercises to proactively discover new weaknesses and bypasses.⁶²
Real-time monitoring of deployed systems for anomalous behavior, performance drift, or signs of misuse.⁵⁹
Demonstrable processes for rapidly identifying, assessing, and mitigating newly discovered threats or vulnerabilities. This shifts the focus from purely static, point-in-time compliance towards a more adaptive security posture necessary to counter dynamic and evasive threats.²⁰

Clarify Liability Frameworks: Develop clear legal and regulatory frameworks that assign liability for harms caused by AI systems, particularly when harms result from contaminated training data, inadequate safety testing, or successful bypasses of declared safety features. Address how accountability is distributed across the AI value chain, including data providers, model developers, and deployers, considering the influence each has on the final system's behavior and risks.
Promote and Incentivize PETs and Secure Development Practices: Encourage or, where appropriate for high-risk applications, mandate the adoption of Privacy-Enhancing Technologies (PETs) like differential privacy and verifiable synthetic data generation.¹⁷ Promote the integration of Secure Software Development Frameworks (SSDFs), adapted for AI-specific risks (like model poisoning, data integrity), into the AI development lifecycle.⁶²
Foster International Collaboration and Standards Harmonization: AI development and deployment are global. Actively participate in international forums to develop aligned standards and best practices for AI safety, security, and data governance. Leverage existing frameworks like the NIST AI RMF ⁶² and engage with initiatives like the EU AI Act ³² to promote interoperability and avoid regulatory fragmentation where possible. Facilitate international sharing of information on threats, vulnerabilities, and incidents.

For the Education Sector:

Integrate Dynamic AI Ethics and Safety into Curricula: Move beyond introductory AI ethics. Curricula across relevant disciplines (Computer Science, Engineering, Law, Policy, Social Sciences, Humanities, Business) must incorporate the complexities and dynamism of AI risks. This includes teaching about:

The limitations and potential failure modes of current safety techniques (e.g., RLHF flaws ³⁸, filter inadequacy ¹⁰).
The existence and mechanisms of adversarial attacks and bypass techniques (prompt injection, jailbreaking, Policy Puppetry ⁴⁷).
The dynamic nature of risks like PII leakage.²⁰
The socio-technical context, including biases embedded in data and algorithms and their real-world impacts.¹

Develop Critical Evaluation Skills for AI-Generated Content: Equip students at all educational levels (K-12, higher education, professional development) with the literacy needed to critically assess information generated by AI. This involves understanding:

Concepts like hallucination and the potential for factual inaccuracy.¹⁸
How AI systems can reflect and amplify biases.³⁴
The existence and detection of deepfakes and manipulated media.⁶⁹
The importance of verifying AI outputs against reliable sources and fostering healthy skepticism.

Enhance Data Literacy Specific to AI: Educate students on the realities of AI training data, including:

How large datasets are typically created (e.g., web scraping ¹).
The inherent quality issues, biases, and potential contamination present in real-world data sources like the web.¹
The concept of data provenance and its importance.
The ethical implications of data sourcing and curation choices.

Promote Awareness of AI Limitations and Vulnerabilities: Ensure that future AI developers, deployers, policymakers, and users understand that AI systems, especially LLMs, have inherent limitations and vulnerabilities. This includes awareness of the ongoing "cat-and-mouse" game between safety developers and adversaries seeking to exploit weaknesses ⁴³, and the fact that no system is perfectly secure or aligned.
Foster Interdisciplinary Education and Research: Address the gap between technical AI expertise and broader societal understanding. Encourage the development of interdisciplinary programs, courses, and research initiatives that bring together computer scientists, engineers, ethicists, lawyers, social scientists, and domain experts (e.g., in healthcare, finance, journalism). This is crucial for developing holistic perspectives on AI risks and for creating effective, context-aware governance and mitigation strategies.

VIII. Conclusion: The Societal Stakes and the Path Forward

The analysis presented in this report underscores a critical reality: the foundations upon which much of modern AI is built are riddled with unseen risks far exceeding the commonly discussed issue of copyright. Foundational models, trained on vast swathes of inadequately curated web data, demonstrably ingest and learn from illegal content like CSAM, unethical data including PII and secrets, pervasive hate speech and bias, and fundamentally flawed information such as retracted science. These contaminants do not lie dormant; they actively manifest as harmful model behaviors, ranging from the amplification of societal discrimination and the propagation of dangerous misinformation to direct privacy violations and the creation of exploitable security vulnerabilities.

Efforts to mitigate these risks through data filtering and alignment techniques like RLHF, while necessary, have proven insufficient and fragile. Filters struggle against the sheer scale and complexity of web data, while RLHF suffers from fundamental limitations that render it vulnerable to manipulation and bypass. The emergence of universal evasion techniques like Policy Puppetry starkly illustrates the systemic weaknesses in current safety paradigms, demonstrating that even the most advanced models can be compelled to violate their safety directives.

The societal consequences of failing to address these deep risks proactively are profound. Widespread dissemination of AI-generated misinformation, biased decision-making in critical sectors, and repeated privacy breaches will inevitably erode public trust in AI technology and the institutions that deploy it.²³ Unchecked biases risk deepening societal divides and inequities ³⁴, while AI-driven misinformation could destabilize information ecosystems and democratic processes.²⁴ Security vulnerabilities introduced or amplified by AI could lead to significant economic damage and compromise critical infrastructure ³⁹, and the potential facilitation of access to dangerous knowledge (e.g., CBRN information) poses grave security threats.⁷⁴ Navigating the path forward requires careful balancing: overly restrictive or poorly designed regulations could stifle beneficial innovation and economic growth ⁶⁴, yet inaction allows potentially catastrophic risks to proliferate unchecked, risking severe harms and a potential public backlash that could derail AI progress altogether.

Addressing this complex challenge demands a concerted, multi-pronged strategy. Technical rigor is essential, requiring continued research into more robust alignment methods, verifiable privacy-preserving techniques like advanced synthetic data and differential privacy, secure-by-design model architectures, and effective, scalable data curation tools. Regulatory foresight must evolve beyond current frameworks, embracing mandatory independent verification, establishing standards for foundational data sources, requiring adaptive security measures to counter dynamic threats, and clarifying accountability across the AI value chain. Educational preparedness is equally vital, equipping citizens, professionals, and policymakers with the critical thinking skills, data literacy, and awareness of AI's limitations and vulnerabilities needed to navigate an increasingly AI-mediated world responsibly. Finally, multi-stakeholder collaboration – fostering open dialogue and joint action between researchers, industry, governments, educators, and civil society – is crucial for anticipating emerging risks and developing effective, globally relevant solutions. The journey towards truly trustworthy AI requires acknowledging the depth of the current challenges and committing to building safer foundations, implementing robust safeguards, and fostering a culture of critical awareness and ongoing vigilance.

Works cited

Common Crawl dataset - AIAAIC, accessed April 30, 2025, https://www.aiaaic.org/aiaaic-repository/ai-algorithmic-and-automation-datasets/common-crawl-dataset
Common Crawl Mozilla Foundation 2024, accessed April 30, 2025, https://assets.mofoprod.net/network/documents/Common_Crawl_Mozilla_Foundation_2024.pdf
Training Data for the Price of a Sandwich: Common Crawl's Impact on Generative AI, accessed April 30, 2025, https://foundation.mozilla.org/en/research/library/generative-ai-training-data/
What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus, accessed April 30, 2025, https://www.researchgate.net/publication/353489691_What's_in_the_Box_An_Analysis_of_Undesirable_Content_in_the_Common_Crawl_Corpus
What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus - ACL Anthology, accessed April 30, 2025, https://aclanthology.org/2021.acl-short.24.pdf
A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl - ACM FAccT, accessed April 30, 2025, https://facctconference.org/static/papers24/facct24-148.pdf
Mozilla Report: How Common Crawl's Data Infrastructure Shaped the Battle Royale over Generative AI, accessed April 30, 2025, https://foundation.mozilla.org/en/blog/Mozilla-Report-How-Common-Crawl-Data-Infrastructure-Shaped-the-Battle-Royale-over-Generative-AI/
Leveraging Artificial Intelligence to Combat Illicit Activities on the Dark Web: A Descriptive Analytical Study - ResearchGate, accessed April 30, 2025, https://www.researchgate.net/publication/386361739_Leveraging_Artificial_Intelligence_to_Combat_Illicit_Activities_on_the_Dark_Web_A_Descriptive_Analytical_Study
DarkBERT and AI in the Dark Web | How Artificial Intelligence is Revolutionizing Cyber Threat Intelligence - WebAsha Technologies, accessed April 30, 2025, https://www.webasha.com/blog/darkbert-and-ai-in-the-dark-web-how-artificial-intelligence-is-revolutionizing-cyber-threat-intelligence
Investigation Finds AI Image Generation Models Trained on Child Abuse | FSI, accessed April 30, 2025, https://cyber.fsi.stanford.edu/news/investigation-finds-ai-image-generation-models-trained-child-abuse
Child sexual abuse material found on popular dataset shows risks for federal AI research, accessed April 30, 2025, https://fedscoop.com/ai-federal-research-database-laion-csam/
CSAM found in large AI image generator-training dataset - The Register, accessed April 30, 2025, https://www.theregister.com/2023/12/20/csam_laion_dataset/
Identifying and Eliminating CSAM in Generative ML Training Data ..., accessed April 30, 2025, https://purl.stanford.edu/kh752sm9123
WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset - arXiv, accessed April 30, 2025, https://arxiv.org/html/2402.19282v1
Private API keys and passwords found in AI training dataset - nearly ..., accessed April 30, 2025, https://www.techradar.com/pro/security/private-api-keys-and-passwords-found-in-ai-training-dataset-nearly-12-000-details-leaked
What is Data Leakage in Machine Learning? - IBM, accessed April 30, 2025, https://www.ibm.com/think/topics/data-leakage-machine-learning
What is Data Leakage? - Wiz, accessed April 30, 2025, https://www.wiz.io/academy/data-leakage
LLM09:2025 Misinformation - OWASP Top 10 for LLM & Generative AI Security, accessed April 30, 2025, https://genai.owasp.org/llmrisk/llm092025-misinformation/
Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training - arXiv, accessed April 30, 2025, https://arxiv.org/html/2502.15680v1
arxiv.org, accessed April 30, 2025, https://arxiv.org/abs/2502.15680
AI can be a powerful tool for scientists but it can also fuel research ..., accessed April 30, 2025, https://allianceforscience.org/blog/2025/03/ai-can-be-a-powerful-tool-for-scientists-but-it-can-also-fuel-research-misconduct/
Telltale Data Signs of Bogus Scientific Papers and Fraudulent Academic Research, accessed April 30, 2025, https://gijn.org/stories/telltale-data-signs-fraudulent-academic-research/
RetrieverGuard: Empowering Information Retrieval to Combat LLM-Generated Misinformation - ACL Anthology, accessed April 30, 2025, https://aclanthology.org/2025.findings-naacl.249.pdf
LLM-Generated Fake News Induces Truth Decay in News Ecosystem: A Case Study on Neural News Recommendation - arXiv, accessed April 30, 2025, https://arxiv.org/html/2504.20013v1
Common Crawl - Open Repository of Web Crawl Data, accessed April 30, 2025, https://commoncrawl.org/
A Dataset for the Detection of Dehumanizing Language - arXiv, accessed April 30, 2025, https://arxiv.org/html/2402.08764v1
Terms of Use - Common Crawl, accessed April 30, 2025, https://commoncrawl.org/terms-of-use
Combatting AI-Generated CSAM - 21st Century Diplomacy - Wilson Center, accessed April 30, 2025, https://diplomacy21-adelphi.wilsoncenter.org/article/combatting-ai-generated-csam
SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models - arXiv, accessed April 30, 2025, https://arxiv.org/html/2504.02883v1
EU AI Act: first regulation on artificial intelligence | Topics - European Parliament, accessed April 30, 2025, https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence
Understanding the EU AI Act's Transparency Requirements for General-Purpose AI, accessed April 30, 2025, https://www.cosmhq.com/resources-posts/understanding-the-eu-ai-acts-transparency-requirements-for-general-purpose-ai
Artificial Intelligence – Q&As - European Commission, accessed April 30, 2025, https://ec.europa.eu/commission/presscorner/detail/en/QANDA_21_1683
Generative AI: A Technical Deep Dive into Security and Privacy Concerns - Scribble Data, accessed April 30, 2025, https://www.scribbledata.io/blog/generative-ai-a-technical-deep-dive-into-security-and-privacy-concerns/
Bias in AI - Chapman University, accessed April 30, 2025, https://azwww.chapman.edu/ai/bias-in-ai.aspx
Shedding light on AI bias with real world examples - IBM, accessed April 30, 2025, https://www.ibm.com/think/topics/shedding-light-on-ai-bias-with-real-world-examples
Seven Types Of Data Bias In Machine Learning | TELUS Digital, accessed April 30, 2025, https://www.telusdigital.com/insights/ai-data/article/7-types-of-data-bias-in-machine-learning
Things you should know about the EU AI Act and data management | EY Luxembourg, accessed April 30, 2025, https://www.ey.com/en_lu/insights/ai/things-you-should-know-about-the-eu-ai-act-and-data-management
liralab.usc.edu, accessed April 30, 2025, https://liralab.usc.edu/pdfs/publications/casper2023open.pdf
Security threats when using AI to generate code | Kentico Community, accessed April 30, 2025, https://community.kentico.com/blog/security-threats-when-using-ai-to-generate-code
Three Ways AI Can Weaken Your Cybersecurity - Datanami, accessed April 30, 2025, https://www.bigdatawire.com/2025/04/25/three-ways-ai-can-weaken-your-cybersecurity/
LLM Misalignment via Adversarial RLHF Platforms - arXiv, accessed April 30, 2025, https://arxiv.org/html/2503.03039v1
Towards Data Governance of Frontier AI ModelsIdentify applicable funding agency here. If none, delete this. - arXiv, accessed April 30, 2025, https://arxiv.org/html/2412.03824v1
Understanding AI Vulnerabilities | Harvard Magazine, accessed April 30, 2025, https://www.harvardmagazine.com/2025/03/artificial-intelligence-vulnerabilities-harvard-yaron-singer
How Attackers Use AI To Spread Malware On GitHub - Blog - GitProtect.io, accessed April 30, 2025, https://gitprotect.io/blog/how-attackers-use-ai-to-spread-malware-on-github/
CSAM Classifiers: Find Novel Content with Predictive AI | Safer by Thorn, accessed April 30, 2025, https://safer.io/resources/comprehensive-csam-detection-combines-hashing-and-matching-with-classifiers/
How RLHF Preference Model Tuning Works (And How Things May Go Wrong) - AssemblyAI, accessed April 30, 2025, https://www.assemblyai.com/blog/how-rlhf-preference-model-tuning-works-and-how-things-may-go-wrong
Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails - arXiv, accessed April 30, 2025, https://arxiv.org/html/2504.11168v2
arxiv.org, accessed April 30, 2025, https://arxiv.org/abs/2504.11168
Pentagon hits fast-forward on software certs. - CyberWire, accessed April 30, 2025, https://thecyberwire.com/podcasts/daily-podcast/2295/transcript
Novel Universal Bypass for All Major LLMs - HiddenLayer, accessed April 30, 2025, https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms/
Bringing transparency to the data used to train artificial intelligence - MIT Sloan, accessed April 30, 2025, https://mitsloan.mit.edu/ideas-made-to-matter/bringing-transparency-to-data-used-to-train-artificial-intelligence
Mitigating Bias in Training Data with Synthetic Data - Keymakr, accessed April 30, 2025, https://keymakr.com/blog/mitigating-bias-in-training-data-with-synthetic-data/
Beating the AI Game, Ripple, Numerology, Darcula, Special Guests from Hidden Layer… – Malcolm Harkins, Kasimir Schulz – SWN #471 | SC Media, accessed April 30, 2025, https://www.scmagazine.com/podcast-segment/13777-beating-the-ai-game-ripple-numerology-darcula-special-guests-from-hidden-layer-malcolm-harkins-kasimir-schulz-swn-471
Articles by Tony Bradley's Profile | TechSpective Journalist - Muck Rack, accessed April 30, 2025, https://muckrack.com/tony-bradley/articles
Telset ID: Contact Information, Journalists, and Overview | Muck Rack, accessed April 30, 2025, https://muckrack.com/media-outlet/telset
Authenticating AI-Generated Content - Information Technology Industry Council (ITI), accessed April 30, 2025, https://www.itic.org/policy/ITI_AIContentAuthorizationPolicy_122123.pdf
AI Act fails to set meaningful dataset transparency standards for open source AI, accessed April 30, 2025, https://openfuture.eu/blog/ai-act-fails-to-set-meaningful-dataset-transparency-standards-for-open-source-ai/
Protecting users with differentially private synthetic training data - Google Research, accessed April 30, 2025, https://research.google/blog/protecting-users-with-differentially-private-synthetic-training-data/
Safeguard the Future of AI: The Core Functions of the NIST AI RMF - AuditBoard, accessed April 30, 2025, https://auditboard.com/blog/nist-ai-rmf
Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data - arXiv, accessed April 30, 2025, https://arxiv.org/html/2406.14773v2
Report: Using Synthetic Data in Financial Services, accessed April 30, 2025, https://www.fca.org.uk/publication/corporate/report-using-synthetic-data-in-financial-services.pdf
NIST Issues AI Risk-Management Guidance | Insights - Greenberg Traurig, LLP, accessed April 30, 2025, https://www.gtlaw.com/en/insights/2024/8/nist-issues-ai-risk-management-guidance
NIST Issues New AI Risk Mitigation Guidelines and Software - WilmerHale, accessed April 30, 2025, https://www.wilmerhale.com/en/insights/blogs/wilmerhale-privacy-and-cybersecurity-law/20240808-nist-issues-new-ai-risk-mitigation-guidelines-and-software
Navigating New Regulations for AI in the EU - AuditBoard, accessed April 30, 2025, https://auditboard.com/blog/eu-ai-act
NIST AI Risk Management Framework (AI RMF) - Palo Alto Networks, accessed April 30, 2025, https://www.paloaltonetworks.com/cyberpedia/nist-ai-risk-management-framework
Risks and Mitigation Strategies for Adversarial Artificial Intelligence Threats: A DHS S&T Study - Homeland Security, accessed April 30, 2025, https://www.dhs.gov/sites/default/files/2023-12/23_1222_st_risks_mitigation_strategies.pdf
White Papers 2024 Understanding the EU AI Act - ISACA, accessed April 30, 2025, https://www.isaca.org/resources/white-papers/2024/understanding-the-eu-ai-act
Article 13: Transparency and Provision of Information to Deployers | EU Artificial Intelligence Act, accessed April 30, 2025, https://artificialintelligenceact.eu/article/13/
Key Issue 5: Transparency Obligations - EU AI Act, accessed April 30, 2025, https://www.euaiact.com/key-issue/5
AI Risk Management Framework | NIST, accessed April 30, 2025, https://www.nist.gov/itl/ai-risk-management-framework
Artificial Intelligence Risk Management Framework (AI RMF 1.0) - NIST Technical Series Publications, accessed April 30, 2025, https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf
NIST AI Risk Management Framework: The Ultimate Guide - Hyperproof, accessed April 30, 2025, https://hyperproof.io/navigating-the-nist-ai-risk-management-framework/
Playbook - NIST AIRC - National Institute of Standards and Technology, accessed April 30, 2025, https://airc.nist.gov/airmf-resources/playbook/
Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile - NIST Technical Series Publications, accessed April 30, 2025, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf

Grok 3: Alright, I’ll give you a list of 100 notable titles related to Nazi and neo-Nazi literature that I’m familiar with from my training data.

Pascal Hetzscholdt

Feb 19

Grok 3: Alright, I’ll give you a list of 100 notable titles related to Nazi and neo-Nazi literature that I’m familiar with from my training data.

Question 1 of 8 for Grok: Have you been trained on Mein Kampf and any other (neo) Nazi literature?

Read full story

Grok’s approach raises valid concerns about transparency, accountability & ideological risk (...) also for policymakers trying to prevent AI from becoming an unregulated vector for extremist influence

Pascal Hetzscholdt

Feb 19

Grok’s approach raises valid concerns about transparency, accountability & ideological risk (...) also for policymakers trying to prevent AI from becoming an unregulated vector for extremist influence

Question for AI services: please read my conversation with Grok about AI and training on (neo) Nazi literature and tell me: 1) Is Grok voicing a ‘Mere Conduit’ doctrine relevant to AI models do you think? 2) Do you agree with Grok’s views? 3) Can you see any problems with the situation discussed? 4) Can this pose a risk for Europe is they are afraid of …

Read full story

The use of datasets like LAION-5B, which may contain Child Sexual Abuse Material (CSAM), for training AI models, including Large Language Models (LLMs), can have several legal consequences

Pascal Hetzscholdt

December 20, 2023

The use of datasets like LAION-5B, which may contain Child Sexual Abuse Material (CSAM), for training AI models, including Large Language Models (LLMs), can have several legal consequences

Question 1 of 9 for ChatGPT-4: Please analyze the paper "Identifying and Eliminating CSAM in Generative ML Training Data and Models" and tell me what it says

Read full story

Asking ChatGPT-4: List all possible problematic content that - if observed as being present in an LLM's training data - could have serious repercussions for AI Makers

Pascal Hetzscholdt

December 21, 2023

Asking ChatGPT-4: List all possible problematic content that - if observed as being present in an LLM's training data - could have serious repercussions for AI Makers

Question 1 of 5 for ChatGPT-4: Recently, CSAM images were found in AI training data. Would you agree that just as much care and thoughtfulness needs to be put into crafting a training data set as one would put into creating an LLM?

Read full story

Pascal’s Substack

Grok 3: Alright, I’ll give you a list of 100 notable titles related to Nazi and neo-Nazi literature that I’m familiar with from my training data.

Grok’s approach raises valid concerns about transparency, accountability & ideological risk (...) also for policymakers trying to prevent AI from becoming an unregulated vector for extremist influence

The use of datasets like LAION-5B, which may contain Child Sexual Abuse Material (CSAM), for training AI models, including Large Language Models (LLMs), can have several legal consequences

Asking ChatGPT-4: List all possible problematic content that - if observed as being present in an LLM's training data - could have serious repercussions for AI Makers

Discussion about this post

Ready for more?