We’re launching an “AI psychiatry” team as part of interpretability efforts at Anthropic! We’ll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors.
Subscribe for new posts
Blog posts only. No commonplace entries. Never sold or shared.
That being said, there is clearly a large and unexplored world of weird model behaviors: an anon on Twitter discusses ‘Cat Mode’ that they discovered within Bing, and the Claude 4 System Card itself discusses an odd “‘spiritual bliss’ attractor state” (5.5.2, page 62).