Anthropic is launching an AI Psychiatry team
Models contain unknown multitudes
Trey Causey
We’re launching an “AI psychiatry” team as part of interpretability efforts at Anthropic! We’ll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors.
That being said, there is clearly a large and unexplored world of weird model behaviors: an anon on Twitter discusses ‘Cat Mode’ that they discovered within Bing, and the Claude 4 System Card itself discusses an odd “‘spiritual bliss’ attractor state” (5.5.2, page 62).