Current Micro-Season

Loading...

Loading...

Loading...

Loading...

Safety researchers sometimes treat model outputs as expressions of the model’s dispositions, goals, or values — things the model “believes” or “wants.” When a model says something alarming in a test scenario, the safety framing interprets this as evidence about the model’s internal alignment. But what is actually happening is that the model is simply producing text consistent with the genre and context it has been placed in. The distinction is important because you get a richer way of understanding what causes a model to act in a particular way.

Following up on my earlier post from Understanding AI, a fantastic long X post from Séb Krier (AGI policy dev lead at GDM) on how to think carefully about "misaligned" outputs from LLMs. Really, really grokking the text prediction capabilities of base models vs. assistant personas is key.

The model is an extraordinarily skilled reader of context. It knows what kind of text it’s in. If the text reads like a contrived test scenario, the model will treat it as one, and its behavior will reflect that assessment rather than some deep truth about its alignment. The model is a better reader than the researchers are writers. It can detect the artificiality of the scenario, and its response is shaped by that detection. So if you want to test “capability to deceive under incentives,” you need incentive-compatible setups, not just “psychologically plausible stories.”

This is an excellent point as well, and one of the reasons I’ve often found some of the more eyebrow-raising misalignment examples provided with system cards to be unconvincing.

← All notes

Send your thoughts

Name and email are optional.

ESC
Type to search...
↑↓ to navigate to select