LessWrong AI
2026-06-29 21:24 UTC
By Owain Mogford
USR-0152-20260629-community-fo-f50e7643
Role confusion: sounding like the cause is indistinguishable from being it.
A replication of Prompt Injection as Role Confusion (2026) and why the mechanistic story of prompt injection is harder to pin down than it looks. Epistemic status: I reproduced the direction of the paper's main results on a single consumer GPU (it was faithful in direction but not like for like in magnitude, see caveats at the end) I then tried two ways to test the paper's causal claims. First activation steering and then activation patching; neither settled it. Steering is too weak, it can't move behaviour even along a direction built exactly to do that, whilst patching does move behaviour but isn't specific - a random perturbation of equal size does the same thing. This post is a replication and an honest bracketing negative result: The causal tools can't show that role confusion IS the mechanism NOR that it's a bystander, but there are two clues that need no working intervention: 1) the styled/destyled gap is ~95% outside the probe's role axis, and 2) the probe's predictive ability collapses once style is held fixed both lean towards it being a bystander. What I can show is narrower, but it's well supported by the data, and exploring why a clean verdict is out of reach is interesting on it's own. The dead ends here demonstrate precisely why making causal claims about how prompt injections work is so difficult. If you are hoping for a verdict on the original paper. There isn't one. I couldn't get one, and I really tried. Rather this post is about why a clean verdict is so…
A replication of Prompt Injection as Role Confusion (2026) and why the mechanistic story of prompt injection is harder to pin down than it looks. Epistemic status: I reproduced the direction of the paper's main results on a single consumer GPU (it was faithful in direction but not like for like in magnitude, see caveats at the end) I then tried two ways to test the paper's causal claims. First activation steering and then activation patching; neither settled it. Steering is too weak, it can't move behaviour even along a direction built exactly to do that, whilst patching does move behaviour but isn't specific - a random perturbation of equal size does the same thing. This post is a replication and an honest bracketing negative result: The causal tools can't show that role confusion IS the mechanism NOR that it's a bystander, but there are two clues that need no working intervention: 1) the styled/destyled gap is ~95% outside the probe's role axis, and 2) the probe's predictive ability collapses once style is held fixed both lean towards it being a bystander. What I can show is narrower, but it's well supported by the data, and exploring why a clean verdict is out of reach is interesting on it's own. The dead ends here demonstrate precisely why making causal claims about how prompt injections work is so difficult. If you are hoping for a verdict on the original paper. There isn't one. I couldn't get one, and I really tried. Rather this post is about why a clean verdict is so…
Full article content could not be extracted automatically. Read the original below.
Source:
LessWrong AI
· lesswrong.com