{"version":1,"type":"rich","provider_name":"Libsyn","provider_url":"https:\/\/www.libsyn.com","height":90,"width":600,"title":"42 - Owain Evans on LLM Psychology","description":"Earlier this year, the paper &quot;Emergent Misalignment&quot; made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models. Patreon: https:\/\/www.patreon.com\/axrpodcast Ko-fi: https:\/\/ko-fi.com\/axrpodcast Transcript:  https:\/\/axrp.net\/episode\/2025\/06\/06\/episode-42-owain-evans-llm-psychology.html &amp;nbsp; Topics we discuss, and timestamps: 0:00:37 Why introspection? 0:06:24 Experiments in &quot;Looking Inward&quot; 0:15:11 Why fine-tune for introspection? 0:22:32 Does &quot;Looking Inward&quot; test introspection, or something else? 0:34:14 Interpreting the results of &quot;Looking Inward&quot; 0:44:56 Limitations to introspection? 0:49:54 &quot;Tell me about yourself&quot;, and its relation to other papers 1:05:45 Backdoor results 1:12:01 Emergent Misalignment 1:22:13 Why so hammy, and so infrequently evil? 1:36:31 Why emergent misalignment? 1:46:45 Emergent misalignment and other types of misalignment 1:53:57 Is emergent misalignment good news? 2:00:01 Follow-up work to &quot;Emergent Misalignment&quot; 2:03:10 Reception of &quot;Emergent Misalignment&quot; vs other papers 2:07:43 Evil numbers 2:12:20 Following Owain's research &amp;nbsp; Links for Owain: Truthful AI: https:\/\/www.truthfulai.org Owain's website: https:\/\/owainevans.github.io\/ Owain's twitter\/X account: https:\/\/twitter.com\/OwainEvans_UK &amp;nbsp; Research we discuss: Looking Inward: Language Models Can Learn About Themselves by Introspection: https:\/\/arxiv.org\/abs\/2410.13787 Tell me about yourself: LLMs are aware of their learned behaviors: https:\/\/arxiv.org\/abs\/2501.11120 Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https:\/\/arxiv.org\/abs\/2406.14546 Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https:\/\/arxiv.org\/abs\/2502.17424 X\/Twitter thread of GPT-4.1 emergent misalignment results: https:\/\/x.com\/OwainEvans_UK\/status\/1912701650051190852 Taken out of context: On measuring situational awareness in LLMs: https:\/\/arxiv.org\/abs\/2309.00667 &amp;nbsp; Episode art by Hamish Doodles: hamishdoodles.com ","author_name":"AXRP - the AI X-risk Research Podcast","author_url":"https:\/\/axrp.net","html":"<iframe title=\"Libsyn Player\" style=\"border: none\" src=\"\/\/html5-player.libsyn.com\/embed\/episode\/id\/36897095\/height\/90\/theme\/custom\/thumbnail\/yes\/direction\/forward\/render-playlist\/no\/custom-color\/88AA3C\/\" height=\"90\" width=\"600\" scrolling=\"no\"  allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen><\/iframe>","thumbnail_url":"https:\/\/assets.libsyn.com\/secure\/content\/189315250"}