<?xml version="1.0" encoding="utf-8"?>
<oembed>
  <version>1</version>
  <type>rich</type>
  <provider_name>Libsyn</provider_name>
  <provider_url>https://www.libsyn.com</provider_url>
  <height>90</height>
  <width>600</width>
  <title>34 - AI Evaluations with Beth Barnes</title>
  <description>How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript:  https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html &amp;amp;nbsp; Topics we discuss, and timestamps: 0:00:37 - What is METR? 0:02:44 - What is an &amp;quot;eval&amp;quot;? 0:14:42 - How good are evals? 0:37:25 - Are models showing their full capabilities? 0:53:25 - Evaluating alignment 1:01:38 - Existential safety methodology 1:12:13 - Threat models and capability buffers 1:38:25 - METR's policy work 1:48:19 - METR's relationships with labs 2:04:12 - Related research 2:10:02 - Roles at METR, and following METR's work &amp;amp;nbsp; Links for METR: METR: https://metr.org METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/ METR - Hiring: https://metr.org/hiring Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/ &amp;amp;nbsp; Other links: Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/ Password-locked models: a stress case for capabilities evaluation:  https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566 Untrusted smart models and trusted dumb models:  https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models AI companies aren't really using external evaluators:  https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators Nobody Knows How to Safety-Test AI (Time):  https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/ ChatGPT can talk, but OpenAI employees sure can’t:  https://www.vox.com/future-perfect/2024/5/17/24158478/openai-departures-sam-altman-employees-chatgpt-release Leaked OpenAI documents reveal aggressive tactics toward former employees:  https://www.vox.com/future-perfect/351132/openai-vested-equity-nda-sam-altman-documents-employees Beth on her non-disparagement agreement with OpenAI:  https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai?commentId=MrJF3tWiKYMtJepgX Sam Altman's statement on OpenAI equity: https://x.com/sama/status/1791936857594581428 &amp;amp;nbsp; Episode art by Hamish Doodles:&amp;amp;nbsp;hamishdoodles.com </description>
  <author_name>AXRP - the AI X-risk Research Podcast</author_name>
  <author_url>https://axrp.net</author_url>
  <html>&lt;iframe title="Libsyn Player" style="border: none" src="//html5-player.libsyn.com/embed/episode/id/32317457/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/88AA3C/" height="90" width="600" scrolling="no"  allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen&gt;&lt;/iframe&gt;</html>
  <thumbnail_url>https://assets.libsyn.com/secure/content/175223227</thumbnail_url>
</oembed>
