<?xml version="1.0" encoding="utf-8"?>
<oembed>
  <version>1</version>
  <type>rich</type>
  <provider_name>Libsyn</provider_name>
  <provider_url>https://www.libsyn.com</provider_url>
  <height>90</height>
  <width>600</width>
  <title>42 - Owain Evans on LLM Psychology</title>
  <description>Earlier this year, the paper &amp;quot;Emergent Misalignment&amp;quot; made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript:  https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html &amp;amp;nbsp; Topics we discuss, and timestamps: 0:00:37 Why introspection? 0:06:24 Experiments in &amp;quot;Looking Inward&amp;quot; 0:15:11 Why fine-tune for introspection? 0:22:32 Does &amp;quot;Looking Inward&amp;quot; test introspection, or something else? 0:34:14 Interpreting the results of &amp;quot;Looking Inward&amp;quot; 0:44:56 Limitations to introspection? 0:49:54 &amp;quot;Tell me about yourself&amp;quot;, and its relation to other papers 1:05:45 Backdoor results 1:12:01 Emergent Misalignment 1:22:13 Why so hammy, and so infrequently evil? 1:36:31 Why emergent misalignment? 1:46:45 Emergent misalignment and other types of misalignment 1:53:57 Is emergent misalignment good news? 2:00:01 Follow-up work to &amp;quot;Emergent Misalignment&amp;quot; 2:03:10 Reception of &amp;quot;Emergent Misalignment&amp;quot; vs other papers 2:07:43 Evil numbers 2:12:20 Following Owain's research &amp;amp;nbsp; Links for Owain: Truthful AI: https://www.truthfulai.org Owain's website: https://owainevans.github.io/ Owain's twitter/X account: https://twitter.com/OwainEvans_UK &amp;amp;nbsp; Research we discuss: Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787 Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120 Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546 Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424 X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852 Taken out of context: On measuring situational awareness in LLMs: https://arxiv.org/abs/2309.00667 &amp;amp;nbsp; Episode art by Hamish Doodles: hamishdoodles.com </description>
  <author_name>AXRP - the AI X-risk Research Podcast</author_name>
  <author_url>https://axrp.net</author_url>
  <html>&lt;iframe title="Libsyn Player" style="border: none" src="//html5-player.libsyn.com/embed/episode/id/36897095/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/88AA3C/" height="90" width="600" scrolling="no"  allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen&gt;&lt;/iframe&gt;</html>
  <thumbnail_url>https://assets.libsyn.com/secure/content/189315250</thumbnail_url>
</oembed>
