Safety researchers sometimes treat model outputs as expressions of the model’s dispositions, goals, or values — things the model “believes” or “wants.” When a model says something alarming in a test scenario, the safety framing interprets this as evidence about the model’s internal alignment. But what is actually happening is that the model is simply producing text consistent with the genre and context it has been placed in. The distinction is important because you get a richer way of understanding what causes a model to act in a particular way.
This is an excellent point as well, and one of the reasons I’ve often found some of the more eyebrow-raising misalignment examples provided with system cards to be unconvincing.