TL;DR It would make sense to briefly skim through our previous post that introduces our experiments on refusal in LLMs . There we explain how it started, here we’ll tell how it’s going. The primary goal of this text is to try and structure the list of whack-a-mole research questions. The secondary goal is to get some outside perspective, so if you run a similar research or have seen a similar research, please lend us a hand. Feel free to jump straight to the section that looks most appealing. We recommend skimming through “The Main Question” as this section provides a broader perspective. Then we listed all other questions that arose during research. You’ll find them under headers “Another Question: …” and “Wording Also Matters”. The first one discusses how refusal is represented in different layers and what it might mean. The second one is dedicated to two parts of refusal – its wording and actual detection of a potentially harmful request. “The Main Question” is split into two parts: in “Our suggestion” we outline our main hypothesis and proofs we found during our experiments; in “An Alternative Suggestion” we highlight the opposing point of view and proofs behind it. The Main Question (MQ) We experiment on open-weight small (~9B) instruct models trying to understand what exactly happens when they refuse to provide an answer given different contexts. One of the core observations is, refusal looks different for different categories of potential harm (for example, a request…

Full article content could not be extracted automatically. Read the original below.