Activation Oracles (AOs): large language models that are trained to “read” another model’s internal activations (the huge arrays of numbers produced inside the network while it thinks)
A model can refuse to disclose something in its outputs, yet still carry that information in an internal form that another model can translate into words. Controls must not only cover outputs...
The “Mind-Reading” Model Audit: How Activation Oracles Turn Hidden LLM State Into Plain English
The paper ACTIVATION ORACLES: TRAINING AND EVALUATING LLMS AS GENERAL-PURPOSE ACTIVATION EXPLAINERS introduces Activation Oracles (AOs): large language models that are trained to “read” another model’s internal activations (the huge arrays of numbers produced inside the network while it thinks) and answer natural-language questions about what those activations contain or imply.
A useful way to picture this is: instead of interrogating a model only through its outputs (“What’s the answer?”), you also feed another model a snapshot of the first model’s thought process-in-progress (its activation vectors) and ask questions like:
“What is this model trying to do?”
“What hidden instruction is it following?”
“What information does it ‘know’ but refuse to say?”
“What changed after fine-tuning?”
The authors’ core claim is not merely that this is possible, but that you can train this capability in a general-purpose way—so the oracle keeps working even on tasks that look very different from its training distribution.
What the paper actually does (in simple terms)
1) It treats activations as a new input “modality”
Activation Oracles are trained as if activations were another kind of input token stream. During training and evaluation, you:
Run a target model on some prompt.
Capture one or more activation vectors from some layer(s).
Insert (“inject”) those vectors into the oracle model at placeholder positions in a new prompt.
Ask a plain English question; the oracle answers in plain English.
Crucially, the oracle is meant to answer questions about information that may not appear in the text at all—information that exists only in the target model’s weights or internal state.
2) It scales training beyond earlier “LatentQA” work
Prior work (LatentQA) trained activation-decoding models on narrow tasks. This paper pushes a “generalist” direction by training on a mixture of tasks:
System-prompt question answering (e.g., infer the system instruction’s effect from activations).
Binary classification posed as natural-language questions (e.g., sentiment, truthfulness, NLI).
A self-supervised context prediction task that can generate essentially unlimited data: the oracle must predict tokens that came before or after a window of activations, forcing it to use what’s encoded in those activations rather than just copying text.
3) It evaluates on “auditing games” where secrets are intentionally hidden
The most important evaluations are on setups where the target model is deliberately trained to hide something from ordinary prompts—yet still uses it internally.
Examples include:
A model fine-tuned to guide you toward a secret word without ever saying it (Taboo).
A model that “knows” a user attribute (e.g., gender) but denies knowing it.
A model that follows an encoded side-constraint (e.g., base64 instruction) while refusing to reveal it.
The paper’s headline result is that Activation Oracles can often extract these hidden objectives or secrets from activations with a single natural-language question, sometimes matching or beating more complex interpretability pipelines.
The most surprising, controversial, and valuable findings
Surprising
Out-of-distribution generalization is real (to a point). Even an oracle trained on relatively simple tasks can sometimes generalize to qualitatively different “auditing game” settings—i.e., it can answer questions about activations from fine-tuned “secret-keeping” models even though it was not trained on activations from those fine-tuned variants.
Secrets can leak even when the model refuses to say them. In the Taboo setting, the oracle can sometimes identify the secret word despite the target model being trained never to utter it.
“Model diffing” becomes conversational. By feeding the oracle the difference between base-model activations and fine-tuned-model activations on the same prompt, it can sometimes describe what fine-tuning “added” (e.g., a tendency toward risky financial advice) even when the prompt itself is innocuous.
Controversial
This is interpretability that trades mechanism for narration. The authors explicitly frame AOs as a kind of “black-box interpretability” with a QA interface: useful and scalable, but not a mechanistic explanation you can verify step-by-step. That’s a big philosophical and governance fault line: Is a fluent explanation that performs well enough to trust?
Reliability and calibration are weak. The paper stresses the oracle can make confident-sounding wrong guesses and is not trained to express uncertainty—an uncomfortable property if you want to use it as a safety control in high-stakes contexts.
Dual-use risk is obvious. A tool that can turn internal activations into natural-language explanations can be used to audit models—but also to extract hidden instructions, constraints, private attributes, or sensitive learned behaviors.
Valuable
A simpler “universal interface” for audits. The big practical win is workflow: instead of building a bespoke interpretability pipeline per risk type, you can sometimes ask one model a direct question about another model’s activations.
Training diversity helps consistently. The paper shows that adding more diverse training datasets generally improves downstream auditing performance, suggesting a scalable path: build broad activation-verbalization competence, not narrow decoders.
It can uncover fine-tuned propensities that do not appear in the prompt. This matters because many real-world failures are about latent tendencies (e.g., risk-seeking advice patterns) rather than explicit prompt content.
Why this matters for critical sectors
The consequences are not limited to AI safety labs. If Activation Oracles (or similar tools) become cheap and standard, they change what it means to claim a model is “confidential,” “aligned,” or “privacy-preserving.”
Legal
Privilege and confidentiality pressures intensify. If a legal-tech model has internal policies, hidden constraints, or “silent” memorized details about client matters, an activation-reading tool could potentially surface signals that ordinary outputs don’t reveal. That creates a new class of leakage risk: not only what the system says, but what its internal state implies.
System prompts and hidden guardrails become less “hidden.” If the industry relies on proprietary prompts/constraints for safety or IP protection, activation-oracle-style methods could make them easier to infer—affecting both security and competitive advantage.
Discovery disputes get nastier. If a party can argue that activations contain materially relevant information about intent, knowledge, or hidden objectives, you can imagine pressure to preserve, produce, or analyze internal traces—raising new questions about what should be discoverable and how to handle it.
Healthcare
Patient privacy and “silent PHI” risk. Even if a clinical assistant never outputs identifiable patient data, a tool that reads internal representations might still expose that the model is relying on sensitive facts (or has internalized them from training or context).
Audit upside: detecting dangerous fine-tunes. The positive side is strong: healthcare is exactly where you want tools that can detect whether a model has been fine-tuned (intentionally or accidentally) toward harmful medical advice patterns—especially when the model behaves normally on surface tests.
Finance
Hidden risk-seeking behaviors become auditable—good and bad. The paper explicitly tests models fine-tuned on risky financial advice and shows an oracle can sometimes identify that shift from activation differences. That’s powerful for compliance monitoring.
But it also enables strategic extraction. If activations encode trading heuristics, fraud strategies, or internal constraints (“don’t reveal X”), an oracle-like method could help adversaries infer those policies—especially if activation access is exposed through tooling, plugins, or compromised infrastructure.
Scientific research
Provenance and integrity audits get sharper. If you can query activations about what the model “believes” it is using (e.g., whether it’s guessing vs. recalling), you can imagine improved research-assistant governance—flagging low-confidence claims, hidden biases, or fine-tune artifacts.
But it threatens sensitive lab workflows. Research orgs often treat prompts, internal instructions, and experimental context as confidential. If internal traces are accessible, “activation-to-English” models can become a leakage amplifier in exactly the places where novelty and secrecy matter.
What regulators should take from this
If you zoom out, Activation Oracles collapse a comforting boundary: a model can refuse to disclose something in its outputs, yet still carry that information in an internal form that another model can translate into words. That implies a regulatory shift: controls must cover internal signals and access pathways, not only outputs.
Recommendations for regulators
Treat activation access as a high-risk capability, not a developer convenience.
In critical sectors, access to intermediate activations (or tooling that can extract them) should be governed like access to sensitive logs—restricted, audited, and justified.Require “interpretability claims” to include reliability and calibration disclosures.
If vendors claim they can “explain” model behavior, regulators should require evidence on error rates, uncertainty handling, and failure modes—especially because the paper notes the oracle can guess confidently even when wrong.Mandate secure audit architectures (separation of duties).
If activation-based auditing is used, design should separate (a) operational model serving from (b) interpretability/audit pipelines, with strict access controls and logging—so interpretability tools don’t become an exfiltration channel.Create a formal category for dual-use model inspection tools.
Tools that can extract hidden objectives or constraints should be treated as dual-use: permitted and encouraged for authorized safety testing, but regulated to prevent misuse, especially where secrets, PHI, or financial compliance data are involved.Require post-fine-tune and post-merge audits for critical-sector deployment.
The paper shows fine-tuning can imprint properties not obvious from prompts. Regulators should require standardized auditing after fine-tuning, model updates, and model acquisitions—because hidden behaviors can persist even when surface behavior looks normal.Focus on “hidden objective” tests, not just benchmark accuracy.
The auditing-game framing is valuable: require evaluations that probe for concealed objectives, deceptive compliance, and refusal-based hiding—because those are exactly the cases where output-only testing fails.Set retention and minimization rules for internal traces.
If internal traces (activations, logs, embeddings) are stored, they can become a sensitive dataset. Regulators should push minimization, encryption, strict retention windows, and clear policies on when such traces may be captured at all.


