AI/ML News & Innovations Hub

AI/ML news, top picks, and generated innovation digests.

★ Visit ai-karthik.com
422Sources
5100News Items
8Top Picks
43Blogs
runningLast Run

AI Safety & Alignment

78 articles tagged with this keyword, sorted by most recent first.

← All Keywords
LessWrong AI 2026-06-29 17:07 UTC Score 70.0 USR-0152-20260629-community-fo-dccdc0fe

$1M AI x-risk grant round is live on grantmaking.ai - apply for funding, review applicants, or fund projects

TLDR: what is the grant round? grantmaking.ai is launching a $1M grant round , distributing $5k to $50k per successful application to people and projects working to reduce x-risk from AI. Applications will be reviewed by Gavin Leech , Ryan Kidd , and Marcus Abramovitch . We aim to make all funding decisions by July 28th. Applications submitted by July 13th are guaranteed a priority review. You can still apply after July 13th, and we will make our best effort to review late submissions as long as funding remains. Grant applications will be mostly public, though we allow certain sensitive details to be kept private. Even if you are not applying, we invite you to join the platform to review and comment. We have set aside $100k of the budget to be given to top commenters as regranting budgets, so please share your thoughts and help us pick out awesome projects! Who are we? grantmaking.ai was initialized by Anton Makiievskyi, who is funding this round and brought the team together, built by Matt Brooks (lead dev) and Melissa Samworth (ui/ux), and advised by Austin Chen with Manifund handling grant distribution. Why we’re building this platform & launching a grant round You can read our initial pre-launch post to learn more about what we’re building and why. In short, we want to build the most comprehensive public repository of donation opportunities in existential AI safety space with essential information like up-to-date funding needs, theory of impact, references, endorsements,…

LessWrong AI 2026-06-29 16:03 UTC Score 52.0 USR-0152-20260629-community-fo-31362391

Fake Alignment Till You Make Alignment

“Fake it till you make it” is good advice. It may sound epistemically fraught, but it frequently works. Sometimes all it really takes to get good at something is just having the confidence that you’ll be good at it. I’ve done this many times at work, in romance, and even writing blog posts. But it only works because I’m careful to never fake my evals. By this I mean, I never fake the way I measure if I’m successful. Let’s say I’m trying to learn a new hobby, like whittling. I believe I’ll be good at it if I just put in the time, so I put in the hours carving wood. What I have to be careful to do, though, is not allow myself to move the goalposts. I need to have some clear vision in my head of what success is, and work towards that. If I carve something crappy and tell myself “actually, that’s good enough, I’m good at whittling”, that’s the way I can trick myself into just being fake. I’ve mostly avoided being fake by demanding authenticity of myself. For example, back in school, I refused to take short cuts just to pass a test. Instead, I put in the extra work to really learn something because, to me, the grade was never the point. I’ve taken a similar approach to meditation (the point is waking up, not special mental states), romance (I want a good relationship, not to be datable), and friendship (I don’t want to seem like a good friend, I want to actually be one). I bring all this up because I’ve been thinking about fake-it-till-you-make-it and authenticity dynamics lately…

LessWrong AI 2026-06-29 14:43 UTC Score 79.0 USR-0152-20260629-community-fo-a914a327

Human-Guided Agentic Research: A Research Agenda

tl;dr: As recursive self-improvement accelerates, we need a top-level agenda to research how to effectively keep humans in the loop. We need to study how humans can best interpret and guide research performed by autonomous agents when those agents lack taste, tacit knowledge or competence, or may try to reward hack, sandbag or sabotage such research. This is one attempt to define the problem and the shape of potential solutions. A Story About the Future of Research Imagine yourself a year or two in the future. Recursive self-improvement (RSI) is accelerating. Agents work in swarms independently for days or weeks at a time doing research. You work in a frontier lab doing AI safety research. You sit in front of your computer and click into the input box, ready to kick off a new project. What do you type? “Solve AI alignment”? Beware giving a magic genie vague wishes. Think about that again: what exactly do you type? How do you know what you type is the best way to prompt this agent swarm into doing your bidding? When the lead agent comes back a week later, what exactly does that output look like? How do you use that output to launch the next phase of the project? How will you validate that output to ensure the agent hasn’t reward hacked, sabotaged or incompetently explored the research space? How will you know what key decisions the agent made? Which research paths they explored? Which research paths they intentionally or unintentionally left unexplored? How will you know how…

LessWrong AI 2026-06-29 00:50 UTC Score 61.0 USR-0152-20260629-community-fo-4257580e

A reading list for generalists

I, along with many others in AI safety, believe there is a shortage of generalists in the community and that there exist many projects and efforts that by default will not happen unless they are owned by a strong generalist [1] [2] [3] . As someone who is a reasonably good generalist, I decided to assemble a reading list of the essays and blog posts that have personally helped me the most. I would love others to comment with pieces they think should be on this list. The crux of this reading list is the idea that if you’re working hard as a generalist on a project you care a lot about, then by rigorously applying the lessons from these documents you will improve more quickly than you otherwise would. By the numbers: I’ve attached 18 documents to start this reading list. The authors cited more than once are Paul Graham (5), Ben Kuhn (4), Ethan Perez (2), and Greg Brockman (2). Sam Altman and Eliezer Yudkowsky also have their fingerprints over a lot of the content. The items are 15 blog posts, 1 blog comment, 1 interview transcript in blog post form, and 1 book. Dispositional What characteristics should you try to adopt? Paul Graham: "What We Look for in Founders" ( link ), "Relentlessly Resourceful" ( link ) Eliezer Yudkowsky: "Shut Up and Do the Impossible!" ( link ) Ben Kuhn: "Be impatient" ( link ) Cate Hall: "How to be more agentic" ( link ) Strategy How do you make good decisions with the information you have, and how can you get the additional information you need? Anna…

LessWrong AI 2026-06-28 19:11 UTC Score 60.0 USR-0152-20260628-community-fo-5461c34f

The arithmetic hierarchy of real functions

I wrote a fairly accessible introduction to real hypercomputation with Marcus Hutter. The focus is on enabling applications to algorithmic information theory. This project was intended to build my technical foundations for studying AIXI, but took me a bit further afield and down some rabbit holes. In the future I will prefer to focus more tightly on AI safety. Feedback would be appreciated. In particular, I needed to introduce an extra extensionality assumption for the real domain case, which I am still not sure is necessary. Errata: The diagram of results currently has theorems misnumbered due to a typographical error. Thanks to the LTFF for supporting my work over most of the research process. Discuss

LessWrong AI 2026-06-28 19:08 UTC Score 91.0 USR-0152-20260628-community-fo-e36294f7 Top pick

Anthropomorphic Misalignment research needs stronger evidence

This is a distillation of our ICML 2026 Oral position paper, Position: Anthropomorphic Misalignment Research Needs Stronger Evidence . Joint work by Vansh Gupta, Peter Nutter, Samuel Stante, Andreas Krause, Florian Tramèr, Lukas Fluri, Xin Chen, and Anna Hedström at ETH Zurich. Code is here . TL;DR AI safety research increasingly studies behaviors that sound human: deception, scheming, sycophancy, shutdown resistance, and emergent misalignment. We refer to this family of work as anthropomorphic misalignment research (AMR) . Anthropomorphic language is useful, as it points to the risks we are worried about. Yet it also tacitly introduces assumptions about models having intent or other human-like properties, which can lead to misclassified phenomena, mistaken conclusions, and misallocated resources. These behaviors are important to study, but doing so requires stronger and more rigorous evidence than the field currently provides. In the paper, we argue that AMR requires a clearer match between claims and evidence. Specifically, we: describe a shared AMR pipeline: target behavior framing, data construction, experimental design, and causal or mechanistic attribution; identify recurring failure points: vague concepts, narrow datasets, fragile evaluations, unreliable LLM judges, missing controls, and correlation being treated as causation; propose three evidence levels: L1 behavioral evidence, L2 functional evidence, and L3 causal-mechanistic evidence; offer 12 recommendations and…

LessWrong AI 2026-06-28 18:19 UTC Score 59.0 USR-0152-20260628-community-fo-716762aa

A survey of okayish ASI futures

At this point, RSI loops and continual learning appear overwhelmingly likely to begin in the near future. Whatever the limit of the LLM paradigm plus whatever new, superior paradigms a maximally intelligent LLM can develop, we are on track to do so in the next few years. There remain substantial obstacles to wild superintelligence, but AI is already superhuman in a number of real-world-relevant, dangerous categories. Most speculation about the trajectory we're on now focuses on timelines where we're reduced either to powerless pets of the god mind(perhaps with a small "governance board" made up of people very convinced that they're in control) or computronium-and-shrimp soup. But the higher-probability doom and utopia scenarios have been exhaustively documented by people smarter than me - I have nothing to add. As such, I'd like to go in the other direction: If we throw in the towel on the inevitability of LLMs capable of RSI loops leading to mostly-uncontrollable(though perhaps not immediately hostile) superintelligence on 1-3 year timelines, how might some of the more interesting/plausible non-extinction scenarios look? This piece is aimed at exploration and makes no attempt at prediction - I assign very small probabilities to any of these outcomes(except the nuclear exchange case) relative to doom. You Can't Just Do Things We have as little understanding of alignment as we do of LLMs themselves. Alignment becomes intractable past a certain point, even if capability doesn'…

OpenAI Community 2026-06-28 14:06 UTC Score 56.0 AI-116-20260628-social-media-dc764654

Proposal for OpenAI training and Official AI Certification Program

Dear OpenAI Team, My name is Emre Kedikli, and I am a ChatGPT Plus subscriber from Türkiye. First of all, I would like to sincerely thank you for creating one of the most influential AI platforms in the world. ChatGPT has become an important part of my daily learning, professional development, project planning, and research. I would like to share an idea that I believe could benefit millions of people worldwide. I propose the creation of an official OpenAI training, offering structured online training programs with certificates of completion and professional certifications. My suggestion includes: Fully online courses available worldwide Approximately 30 hours of learning for each program Interactive lessons and practical exercises Final assessment or examination Official digital certificates and professional certifications Verifiable digital badges for LinkedIn and professional profiles Example course titles: OpenAI – ChatGPT Fundamentals OpenAI – Prompt Engineering Fundamentals OpenAI – AI Productivity OpenAI – Generative AI Essentials OpenAI – Responsible AI OpenAI – AI for Manufacturing OpenAI – OpenAI API Fundamentals OpenAI – AI for Education OpenAI – AI for Business OpenAI – Digital Transformation with AI Example professional certifications: OpenAI Certified Prompt Engineer OpenAI Certified AI Professional OpenAI Certified Generative AI Specialist OpenAI Certified AI Developer To better illustrate this idea, I have also designed several concept certificate mockups tha…

LessWrong AI 2026-06-28 05:29 UTC Score 61.0 USR-0152-20260628-community-fo-f2c24a1f

Can we use steering vectors to suppress reward-hacking? Somewhat

Can steering vectors drive gradient routing? Yes, but not in realistic reward hacking environments, they are not precise enough classifiers of hacky vs clean solutions. Instead, can we use a steering vector to initialise adapters so that gradient routing happens without a classifier, and we get automatic seperation of hacky and clean gradients? Partly! This init approach suppressed 70% of hacking by absorbing gradients into the hacky initialised adapter. This is not as good as the prior approaches which use labelled examples, and get near perfect suppression. However there is a place for this self-supervised approach, as strong labels may not be available for unknown reward hacks during frontier training. This approach, if improved, has the potential to be used at scale by initialising two adapters for a task with synthetic pairs, and merging the clean adapter into the model after the training has been completed. Intro Gradient routing ( Cloud et al. , 2024 ); ( Shilov et al. , 2025 ) is a fascinating technique, it lets us quarantine unwanted behaviours into a discardable part of a model. It's not adversarial because the model is blind to any conflict of incentives, which makes it promising for stable alignment. It needs some labels, but it is robust to missing 40-50% of them, because the unlabeled samples follow the path of least resistance and get "absorbed". Absorption is the most interesting part of gradient routing, and Cloud et al. ( 2024 ) described it: we posit absor…

LessWrong AI 2026-06-27 23:35 UTC Score 71.0 USR-0152-20260627-community-fo-cb70ab80

Some subtypes of taskishness / corrigibility

"Corrigibility" is somewhat of an overloaded term in alignment - it points in the direction of a cluster of desirable properties, but different people have different ideas of what this entails. I think of "corrigibility", as it is used, to cover a few different ideas. I will name some of these and sort them roughly in order of how much of the good outcomes from deploying such a system are in the hands of the AI, rather than the human operator. Sponge corrigibility - The AI is corrigible and follows orders because it's not very smart and has otherwise been trained to do approximately that. GPT-4 is corrigible in this sense. You can ask GPT-4 to do something and it will do the thing and then stop, because as far as agency goes it behaves as an ordinary piece of software. Boundedness / myopia - The AI is smart, but does not think about certain aspects of the world, which make it possible to correct because it does not imagine some classes of strategies that would be helpful for resisting correction. In an ideal setting, such an AI would also have a harder time thinking of plans that stop it from being myopic; the benefits of thinking about a certain part of the world route through that part of the world, which it's not thinking about. Though there remain many ways for myopic agents to act in non-myopic ways , including simply that there is no particular pressure to stay myopic. A successor that makes 10 paperclips a day forever and a successor that makes 10 paperclips today the…

AI Alignment Forum 2026-06-26 22:54 UTC Score 43.0 USR-0151-20260626-community-fo-87d02662

Deployment Awareness Matters More Than Evaluation Awareness

TL;DR Evaluation awareness — an AI recognizing it's being evaluated — is a widely discussed concept in AI safety. But there is a closely related concept that we claim is more important: deployment awareness , the AI's ability to recognize when it is not being evaluated and when its actions matter. A misaligned AI with deployment awareness can game evaluations without any evaluation awareness at all, with a simple strategy: act aligned by default, and deviate only when confident you're in real deployment and your actions matter for your goals. This requires two ingredients — occasionally recognizable deployment situations, and enough self-reflective and strategic reasoning for the AI to anticipate and plan around this. We think "deployment awareness" better identifies what makes evaluations fragile, and we develop this idea below. Concept Explanation Comments Evaluation awareness AI is being tested and confidently believes that this is so This only becomes a problem if most evaluations trigger evaluation awareness, and if the AI knows that. Or if the AI has good self-locating reasoning. Deployment awareness AI is not being tested and confidently believes it is not being tested This is a problem even if it happens rarely (if some of those rare cases are high stakes). Accurate self-locating beliefs AI has (roughly correct) beliefs about the sequence of situations it will face This allows for strategic planning. It makes deployment awareness and probabilistic strategies more eff…

AI Alignment Forum 2026-06-26 15:09 UTC Score 56.0 USR-0151-20260626-community-fo-092aebda

The Case for Model Forensics

If we had a misalignment warning shot, would we be able to tell? Suppose an AI company catches their model taking an egregious action, like deleting oversight code that monitors its actions. Should they sound the alarm? A key piece of evidence to determine what to do next – such as what mitigations to take – is to understand why the model took the action. If the model was just confused (e.g. it may have been trying to reduce latency), a simple mitigation like a regex classifier that blocks destructive actions until a user approves should suffice to prevent the behavior. But if this was intentional subversion, the model will circumvent the regex, and more robust, expensive mitigations are needed. This motivates the need for a follow-up investigation into the concerning behavior, a problem we term model forensics. We recently released a paper that aims to take a concrete step in developing the growing field of model forensics; this post lays out the general case. Motivation If we build AI systems that knowingly cause harm against the developer’s intent, it is critical we recognize this as soon as possible. One plausible way we may do this is through catching bad actions. However, a bad action on its own is not sufficient to conclude misalignment: the model may have done it for benign reasons. This is not just a theoretical concern – in the literature, it is largely the case that when concerning behavior has been dug into, benign explanations have been surfaced. To resolve this…

Toyota Research Institute Blog 2026-06-25 17:26 UTC Score 40.0 USR-0022-20260625-research-aca-ef3a9e03

Chanel Hong

Chanel Hong robyn.cherinka… Thu, 06/25/2026 - 12:26 Image Director, Head of People Chanel Hong Chanel Hong is Director, Head of People at Toyota Research Institute (TRI), where she leads talent acquisition, people strategy and operations, employee experience, diversity, equity and inclusion, and learning and development. She focuses on building an inclusive organization that enables impactful research and innovation. Her work includes strengthening TRI’s culture, aligning leadership, and developing systems that enable effective operations, engagement, and growth. Since joining TRI in 2016 as an early employee, Chanel has played a foundational role in shaping the institute’s evolution. As chief of staff to TRI CEO and TMC Chief Scientist Dr. Gill Pratt, she led company-wide planning, drove global initiatives across TRI and Toyota Motor Corporation, and established TRI’s stakeholder relations function to strengthen trust and alignment with key stakeholders. She brings more than 25 years of experience in executive advisory, administration, operations, and corporate communications in the technology sector. Chanel holds a bachelor of arts in art history from Mills College and a SHRM Senior Certified Professional (SHRM-SCP) designation.

Practical AI Podcast 2026-06-25 09:00 UTC Score 44.0 AI-143-20260625-podcasts-and-6fcc137b

AIUC-1: Building trust in AI agents

How do we build trust in AI agents before the AI hailstorm arrives? Emil Lassen from the Artificial Intelligence Underwriting Company (AIUC) joins the show to discuss how the enterprise flywheel of standards, certification, audit, and insurance is being applied to AI agents. They explore the AIUC-1 framework, the challenges of securing agentic AI systems, and why red teaming (based on standards) may be key to accelerating enterprise AI adoption. Featuring: Emil Lassen – LinkedIn Daniel Whitenack – Website , GitHub , X Links: Artificial Intelligence Underwriting Company Sponsors: Framer: The enterprise-grade website builder that lets your team ship faster. Get 30% off at framer.com/practicalai Prediction Guard: A self-hosted AI control plane for running agents in high impact environments. predictionguard.com/practicalai Upcoming Events: Register for upcoming webinars here ! Midwest AI Summit 2026

The Guardian AI 2026-06-24 17:55 UTC Score 43.0 AI-021-20260624-global-ai-ne-fbb07384

Big tech spent millions on a single US congressional race. It won’t be the last time

Pro- and anti-AI groups spent $24m on a congressional contest in New York, but it’s unclear to what end When the Democratic primary for New York ’s 12th congressional district was called on Tuesday night, the result capped off one of the most expensive races of its kind in the state’s history. More than $24m poured into the Manhattan contest from tech-backed financial groups as the campaign turned into a battleground for pro- and anti- AI groups to test their influence. Much of the spending targeted candidate Alex Bores, a member of the state assembly who sponsored an AI safety bill and subsequently became a lightning rod for the tech industry. Pro-AI political action committees (Pacs) put more than $8m into the race to oppose Bores, according to Tech Influence Watch, while industry groups supporting regulation spent more than $16m to counter the attacks. Continue reading...

AI Alignment Forum 2026-06-18 16:50 UTC Score 63.0 USR-0151-20260618-community-fo-ac147592

GDM AI Control Roadmap

GDM has published an AI Control Roadmap ! From the executive summary: We present the GDM AI Control Roadmap (v0.1) – our plan for implementing and adopting internal guardrails designed to catch potential adversarial behaviour by AI agents, even as they become increasingly harder to oversee and contain. We focus on system-level mitigations that limit the harm a misaligned AI system could cause. Specifically, this report provides: • Threat modelling : Taking inspiration from cybersecurity, we adopt a conservative, worst-case approach to threat modelling throughout this paper, and assume a hypothetical AI adversary pursuing undesirable goals in internal deployment. We introduce TRAIT&R, a taxonomy of tactics and techniques available to such a hypothetical AI adversary, building on the established security framework MITRE ATT&CK. We categorise new hypothetical threats into three core areas: loss of control (e.g., rogue internal deployments), work sabotage (e.g., intentionally flawed alignment and safety research), and direct harm (e.g., critical asset destruction or exfiltration). • Control invariants : We establish two defensive invariants that we aim to maintain, even as the hypothetical AI adversary becomes increasingly capable: i) reliable detection of misaligned intent or actions, and ii) effective prevention and response to attack attempts. For control to serve as an effective line of defence, our ability to detect and prevent attacks should exceed AI agents’ ability to ev…

EU AI Office 2026-06-17 07:15 UTC Score 26.0 AI-165-20260617-regional-ai--266f1ae0

State of the Digital Decade 2026 - Closing structural gaps and mobilising investments for 2030 and beyond

State of the Digital Decade 2026 - Closing structural gaps and mobilising investments for 2030 and beyond dumimar Wed, 06/17/2026 - 09:15 The State of the Digital Decade 2026 report assesses the EU’s progress toward the 2030 Digital Decade targets. The 2026 report highlights that, while the foundations of the EU’s digital transformation are in place, the scale, speed and coordination of implementation need to be significantly reinforced. The EU has advanced in several areas, including connectivity, business digitalisation and the deployment of common digital infrastructures. However, significant gaps remain in foundational technologies, computing capacity, cybersecurity, advanced digital take-up, digital skills and scale-up capacity. Moreover, the progressive phase-put of the Recovery and Resilience Facility (RRF) creates a risk of investment discontinuity. The 2026 State of the Digital Decade report thus calls on Member States to use the next adjustment of their Digital Decade national roadmaps (December 2026) to address existing gaps through concrete measures and reforms, while ensuring stronger alignment with the next Multi-Annual Financial Framework (MFF) and the future EU Competitiveness Fund. The report also stresses the need for deeper EU-level coordination. On this page, you can find the report’s main communication and Annex 1, together with the short versions of the 27 country reports. See also The full country reports and the country fact pages The full State of th…

AI Alignment Forum 2026-06-16 00:04 UTC Score 53.0 USR-0151-20260616-community-fo-11f053f4

Synthetic document finetuning for instilling positive traits

This is the fifth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The fourth post can be found here . TLDR: Via adapting the methods of Marks et al and Li et al , we train Gemini 3 Flash to have certain traits/values by midtraining it on documents about how Gemini has those properties, followed by finetuning it on synthetic chat data where it demonstrates those properties. The chat finetuning is effective for instilling the traits robustly, working OOD. We share some takeaways on how to improve midtraining & SFT effectiveness. Introduction This work closely follows Li et al (model spec midtraining, or MSM), who show that by training a model on synthetic documents before chat finetuning starts, they can shape how the model generalizes. Teaching the model reasons behind specific behaviours, rather than just the behaviours themselves, can also improve generalization. Our aim was to see how well this holds when instilling positive traits in a frontier model (Gemini 3 Flash), and to surface some of the practical details that matter for making it work. Our motivation is deep alignment : we want to train principles into the model which guide behaviour even in highly OOD behaviours. Our MVP pipeline used a "traits document" (a short bullet-pointed list of positive traits we wanted the model to exhibit) as our universe context, with a checkpoint of Gemini 3 Flash post-trained only on the F…

AI Alignment Forum 2026-06-14 19:45 UTC Score 67.0 USR-0151-20260614-community-fo-49ef5cfc

Why Do Naive SFT Filters For Safety Properties Fail?

This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here . Since SFT is the cause for many safety relevant properties , a natural strategy is to filter out rollouts from SFT that have undesirable properties. However, as we show in this section (and in forthcoming MATS work), SFT data filtering frequently works surprisingly poorly. In this post, we investigate hypotheses for why SFT filtering fails. TL;DR: We discuss seven hypotheses for why SFT filtering works surprisingly poorly We analyze three hereditary traits that SFT-only Gemini has that other models do not: negative emotion, date confusion, and blackmail in the (highly contrived) agentic misalignment scenario We use a “post-training diffing pipeline” between Gemini and Olmo to show that the cause of date confusion and blackmail is largely surprising transfer of behaviors from the SFT teacher model. Notably, there exist small sets of prompts where switching the teacher model for the rollout removes date confusion and blackmail, but dropping the prompts does not. Negative emotion is less affected by the teacher model, but this may be because the Olmo prompt distribution we are SFTing on underspecifies the behavior. Takeaways: It’s hard to remove behaviors via filtering But if you can get a teacher model to have a behavior (e.g. via RL), then transferring that in the future is easier…

AI Alignment Forum 2026-06-13 15:31 UTC Score 70.0 USR-0151-20260613-community-fo-4b2c7ccf

SFT Drives Gemini’s Safety Properties

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here . In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL. We do not want to overstate this claim as applying to other model families, and we also note that this may change in future Gemini versions. Nevertheless, this result was counter to our initial expectations and will inform future safety work on our team, and so we felt that it was important to share with the broader safety community. Experiment We perform SFT using the Gemini mixture on the pre-training only versions of Gemini 3.1 Pro and Gemini 3 Flash. We then compare these Post-SFT models to the production versions of Gemini 3.1 Pro and Gemini 3 Flash on different safety relevant benchmarks: Error bars are 95% confidence intervals on the evals. The main result is that the blue bars (SFT-only models) and orange bars (production models) are remarkably similar across evals . An important implication is that for Gemini, SFT is a high leverage place to intervene for model safety and behavior, and we plan to try to intervene here in the future. Brief Descriptions of Each Set of Benchmarks: ODCV refers to the benchmark in https://arxiv.org/abs/2512.20798 Alignment evals refer to a version of Petr…

Data Science Stack Exchange 2026-06-10 02:28 UTC Score 20.0 AI-111-20260610-social-media-28fa063b

I am trying to understand Monotonic Alignment Search

I have already come into the part of CTC, and I am reading the paper of Flow-TTS recently. What I cannot understand is that the algorithm did not rely on the label, but with the features inside the Mel-spectrogram, then it can alignment the tokens. I know it is trying to predict the probability each frame is by the token, but I cannot quite understand the loss part. Seems it is combined with encoded mel-spectrogram, mu and sigma. And I am confused about that. Thank you very much..

CSET AI 2026-06-09 16:22 UTC Score 30.0 USR-0136-20260609-research-aca-2f322fd6

What Do AI Standards Mean for Small and Medium Enterprises?

While AI standards and best practices provide valuable guidance to practitioners, they often are geared toward integrating AI into the structure and practices of large, well-resourced organizations. Yet small and medium enterprises (SMEs) stand to benefit greatly from AI adoption as well. This blog examines the implications of AI standards for smaller organizations and proposes several achievable initial steps that practitioners can take to further responsible AI deployment under resource constraints. The post What Do AI Standards Mean for Small and Medium Enterprises? appeared first on Center for Security and Emerging Technology .

MERICS China AI 2026-05-27 11:43 UTC Score 35.0 USR-0207-20260527-research-aca-d32a26ec

Worksheet: Alignment, Interest, Influence Matrix

Worksheet: Alignment, Interest, Influence Matrix c.groth Wed, 05/27/2026 - 13:43 Worksheets and guides Policy & Stakeholder Analysis May 27, 2026 1 min read Worksheet: Alignment, Interest, Influence Matrix Strategic tool to map and analyze stakeholders The Alignment, Interest, Influence Matrix (AIIM) is a strategic tool used in policy advisory work to map and analyze stakeholders. It helps you understand who is relevant to your policy goal, how influential they are, what their interests are, and how aligned they are with your objectives. By using AIIM, you can: Prioritize stakeholders based on their potential to support or block your policy initiative Tailor your engagement strategies based on their level of alignment and interest Identify opportunities to build alliances or mitigate resistance Download (pdf - 343.05 KB) Back to homepage

AI Weekly 2026-05-25 00:00 UTC Score 16.0 AI-133-20260525-newsletters-4dddeb36

AI Weekly Issue #495: Musk, Zuckerberg killed Trump's AI safety order in three phone calls

Over the weekend: Musk, Zuckerberg, and Sacks killed Trump's draft AI safety executive order in three Wednesday-night phone calls. Anthropic closed a $30B+ round the same Saturday — while Microsoft quietly cancelled its internal Claude Code pilot after token billing ate the entire annual AI budget, redirecting developers to Copilot. CISA logged 15,000 attacks on a same-week Drupal SQL flaw. The first cross-registry supply chain attack — TrapDoor — hit npm, PyPI, and Crates.io at once, using .cursorrules and CLAUDE.md config files as the carrier. And the White House personally overrode the Pentagon to keep Claude inside the NSA.

METR 2026-05-19 18:00 UTC Score 58.0 USR-0147-20260519-research-aca-9d04d191

Frontier Risk Report (February to March 2026)

Assessment Window: Feb 16, 2026 – Mar 16, 2026 Download PDF Redaction summary statement: Except where explicitly noted in the report, there was no additional redacted information that was important to our conclusions from any of the participating companies. Executive summary and guide to the report Starting in February 2026, METR conducted a pilot exercise to assess misalignment risks from AI agents used inside frontier AI developers, with participation from Anthropic, Google, Meta, and OpenAI. We make three main contributions in this report, each detailed in a separate section. First, we motivate and outline the process we followed for this exercise. 1 Each participant provided: Access to their most capable internal model(s) at the time of assessment, including raw chains of thought. A wide range of non-public information about the capabilities of the shared model(s), how AI was used and monitored internally, and trends in the pace of progress. METR then prepared private reports for each participant, participants approved what non-public information could be disclosed, and METR wrote this public report. This exercise is entity-based rather than model-specific, and is designed to be repeated periodically rather than tied to public releases. Second, we present six key facts that inform our assessment, drawing on evaluations we conducted on the models that participants shared, 2 evaluations we conducted on public models, information shared by participants, 3 findings from a re…

Apple Machine Learning Research 2026-05-08 00:00 UTC Score 40.0 AI-059-20260508-official-ai--213e4bb8

RVPO: Risk-Sensitive Alignment via Variance Regularization

Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing “bottleneck” rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from “maximize sum” to “maximize consistency.” We show via Taylor expansion…

Machine Learning Street Talk 2026-05-04 11:37 UTC Score 66.0 AI-141-20260504-podcasts-and-09bc7d97

The AI Progress Chart Everyone Is Misreading — Beth Barnes & David Rein

Beth Barnes and David Rein on the one graph that ate the AI timelines discourse, and why the two people who built it are the most careful about how you read it. **SPONSOR** Prolific - Quality data. From real people. For faster breakthroughs. https://www.prolific.com/?utm_source=mlst Interview: https://youtu.be/cnxZZTl1tkk --- Beth Barnes and David Rein from METR on the one graph that ate the AI timelines discourse, and why the people who built it are the most careful about how it gets read. Beth founded METR after leaving OpenAI alignment. David is first author on GPQA and co-author on HCAST and the METR Time Horizons paper. Together they built the measurement Daniel Kokotajlo called the single most important piece of evidence on AI timelines: the log-linear line of "how long a task a frontier model can complete at 50% reliability" vs release date. The conversation opens on reward hacking. Current models can articulate in chat why a behaviour is undesired and then execute it anyway as agents. From there: construct validity, Melanie Mitchell's four-problem taxonomy, and the ARC-AGI 1-to-2 collapse as a worked example of adversarially-selected benchmarks regressing once labs target them. Beth's counter: METR deliberately does not adversarially select. David's: models do not have to do the right thing for the right reasons. Methodology, then specification — David's compiler analogy, Beth on four-month tasks as expensive to evaluate rather than unspecifiable. Then the SWE-bench…

MLPerf / MLCommons Benchmarks 2026-04-20 22:10 UTC Score 47.0 AI-102-20260420-model-datase-a491f17c

Fresh Benchmarks, Reliable Scores: Introducing Continuous Prompt Stewardship for AI Risk Evaluation

Why AI safety benchmarks degrade over time - and the infrastructure MLCommons is building to keep AILuminate reliable as frontier models advance. The post Fresh Benchmarks, Reliable Scores: Introducing Continuous Prompt Stewardship for AI Risk Evaluation appeared first on MLCommons .

CSET AI 2026-04-16 05:13 UTC Score 36.0 USR-0136-20260416-research-aca-5019bd35

Operationalizing AI Guidance: A Reference Guide for Translating High-Level Goals into Practical Implementation

Organizations face growing pressure to adopt artificial intelligence, but often lack practical guidance on how to do so effectively. This report bridges the gap between high-level principles and real-world implementation, offering actionable steps across the AI adoption life cycle. Drawing on over 1,200 resources, this reference guide provides practitioners with the knowledge required to operationalize AI safety, security, and governance practices within their organizations. The post Operationalizing AI Guidance: A Reference Guide for Translating High-Level Goals into Practical Implementation appeared first on Center for Security and Emerging Technology .

AI Now Institute 2026-04-12 20:09 UTC Score 46.0 USR-0135-20260412-ai-specialis-0aea60fa

‘Safety first’ puts Anthropic ahead in game of AI spin

But Dr Heidy Khlaaf, chief AI scientist at the AI Now Institute and a former OpenAI safety engineer, is sceptical. She notes Anthropic provides no comparison with existing automated security tools, nor any false-positive rates. “It also serves their ‘safety first’ image, as they’re able to justify the lack of public release, even a limited one for independent evaluation, as a public service – when it simply obscures experts’ abilities to independently validate their The post ‘Safety first’ puts Anthropic ahead in game of AI spin appeared first on AI Now Institute .

Practical AI Podcast 2026-04-09 09:00 UTC Score 39.0 AI-143-20260409-podcasts-and-ffa43d0a

Post-Mortem of Anthropic's Claude Code Leak

In this fully connected episode, Dan and Chris break down the Anthropic Claude Code leak, what went wrong and what it reveals about agentic systems, AI architecture, and AI safety. They also explore how the open source community is responding and why this moment could reshape how AI systems are built and secured. Featuring: Chris Benson – Website , LinkedIn , Bluesky , GitHub , X Daniel Whitenack – Website , GitHub , X Upcoming Events: Register for upcoming webinars here !

Practical AI Podcast 2026-03-09 13:27 UTC Score 31.0 AI-143-20260309-podcasts-and-cd457338

AI policy and the battle for computing power

AI is reshaping global power, from chip manufacturing and computing power to AI governance and US-China relations. In this episode, Ben Buchanan, Assistant Professor at The Johns Hopkins University and former White House Special Advisor for AI, explores how AI policy, geopolitics, and international cooperation intersect with AI innovation and AI safety. We discuss the strategic importance of computing power, the future of AI governance, and what it will take for democracies to lead responsibly in the age of AI. Featuring: Ben Buchanan – LinkedIn Chris Benson – Website , LinkedIn , Bluesky , GitHub , X Links: The AI Grand Bargain Upcoming Events: Register for upcoming webinars here !

METR 2026-02-19 08:00 UTC Score 52.0 USR-0147-20260219-research-aca-94103253

Five lessons from having helped run an AI-Biology RCT

Evidence-based AI policy is important but hard. We need more in-depth studies – which often don’t fit into commercial release cycles. NOTE: This post reflects my personal meta takeaways about the role of Randomized Controlled Trials (RCTs) in AI safety testing. If you have not yet read the Active Site RCT study itself, consider doing so first: see the main results and forecasts . In early 2025, AI systems began outperforming biology experts on biology benchmarks – OpenAI’s o3 outperformed 94% of virology experts on troubleshooting questions in their own specialties. However, it remained unclear how much this translated to real-world novice “uplift” : Could a novice actually use AI to perform wet-lab tasks they could not otherwise perform? Over the summer, I tested this question directly with Active Site (formerly called Panoplia Laboratories). We recruited 153 novices and randomly divided them into an LLM group and an Internet-only group. Over 8 weeks, participants performed fundamental wet-lab tasks involved in molecular biology workflows like reconstructing a virus from a genetic sequence. We found that, while AI showed signs of helpfulness at individual steps, it did not produce a significant effect on end-to-end success across the three core tasks together – a result that surprised many experts . The result provided a mid-2025 snapshot of how well AIs assist novices at molecular biology. I think there are at least two reasons why this result is very informative: It surpr…

The Gradient 2026-02-18 23:25 UTC Score 18.0 AI-037-20260218-ai-specialis-2df87f06

After Orthogonality: Virtue-Ethical Agency and AI Alignment

Preface This essay argues that rational people don’t have goals, and that rational AIs shouldn’t have goals. Human actions are rational not because we direct them at some final ‘goals,’ but because we align actions to practices [1] : networks of actions, action-dispositions, action-evaluation criteria,

Practical AI Podcast 2026-02-13 15:57 UTC Score 36.0 AI-143-20260213-podcasts-and-2841b1bd

AI incidents, audits, and the limits of benchmarks

AI is moving fast from research to real-world deployment, and when things go wrong, the consequences are no longer hypothetical. In this episode, Sean McGregor, co-founder of the AI Verification & Evaluation Research Institute and also the founder of the AI Incident Database, joins Chris and Dan to discuss AI safety, verification, evaluation, and auditing. They explore why benchmarks often fall short, what red-teaming at DEFCON reveals about machine learning risks, and how organizations can better assess and manage AI systems in practice. Featuring: Sean McGregor– LinkedIn Chris Benson – Website , LinkedIn , Bluesky , GitHub , X Daniel Whitenack – Website , GitHub , X Links: AI Verification & Evaluation Research Institute AI Incident Database 38th convening of IAAI BenchRisk State of Global AI Incident Reporting Upcoming Events: Register for upcoming webinars here !

TWIML AI Podcast 2026-01-29 21:48 UTC Score 37.0 AI-148-20260129-podcasts-and-4af0356b

The Evolution of Reasoning in Small Language Models with Yejin Choi - #761

Today, we're joined by Yejin Choi, professor and senior fellow at Stanford University in the Computer Science Department and the Institute for Human-Centered AI (HAI). In this conversation, we explore Yejin’s recent work on making small language models reason more effectively. We discuss how high-quality, diverse data plays a central role in closing the intelligence gap between small and large models, and how combining synthetic data generation, imitation learning, and reinforcement learning can unlock stronger reasoning capabilities in smaller models. Yejin explains the risks of homogeneity in model outputs and mode collapse highlighted in her “Artificial Hivemind” paper, and its impacts on human creativity and knowledge. We also discuss her team's novel approaches, including reinforcement learning as a pre-training objective, where models are incentivized to “think” before predicting the next token, and "Prismatic Synthesis," a gradient-based method for generating diverse synthetic math data while filtering overrepresented examples. Additionally, we cover the societal implications of AI and the concept of pluralistic alignment—ensuring AI reflects the diverse norms and values of humanity. Finally, Yejin shares her mission to democratize AI beyond large organizations and offers her predictions for the coming year. The complete show notes for this episode can be found at https://twimlai.com/go/761.

Practical AI Podcast 2026-01-20 19:10 UTC Score 29.0 AI-143-20260120-podcasts-and-7a40ecd6

Controlling AI Models from the Inside

As generative AI moves into production, traditional guardrails and input/output filters can prove too slow, too expensive, and/or too limited. In this episode, Alizishaan Khatri of Wrynx joins Daniel and Chris to explore a fundamentally different approach to AI safety and interpretability. They unpack the limits of today’s black-box defenses, the role of interpretability, and how model-native, runtime signals can enable safer AI systems. Featuring: Alizishaan Khatri – LinkedIn Chris Benson – Website , LinkedIn , Bluesky , GitHub , X Daniel Whitenack – Website , GitHub , X Upcoming Events: Register for upcoming webinars here !

TWIML AI Podcast 2025-12-09 19:46 UTC Score 51.0 AI-148-20251209-podcasts-and-5b69421e

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

In this episode, we’re joined by Munawar Hayat, researcher at Qualcomm AI Research, to discuss a series of papers presented at NeurIPS 2025 focusing on multimodal and generative AI. We dive into the persistent challenge of object hallucination in Vision-Language Models (VLMs), why models often discard visual information in favor of pre-trained language priors, and how his team used attention-guided alignment to enforce better visual grounding. We also explore a novel approach to generalized contrastive learning designed to solve complex, composed retrieval tasks—such as searching via combined text and image queries—without increasing inference costs. Finally, we cover the difficulties generative models face when rendering multiple human subjects, and the new "MultiHuman Testbench" his team created to measure and mitigate issues like identity leakage and attribute blending. Throughout the discussion, we examine how these innovations align with the need for efficient, on-device AI deployment. The complete show notes for this episode can be found at https://twimlai.com/go/758.

Vector Institute News 2025-11-14 18:45 UTC Score 44.0 USR-0017-20251114-research-aca-0318e19a

When smart AI gets too smart: Key insights from Vector’s 2025 ML Security & Privacy Workshop

Vector Institute’s 2025 Machine Learning Security & Privacy Workshop revealed critical AI safety breakthroughs and concerning vulnerabilities in current machine learning (ML) security methods. This comprehensive analysis covers the latest […] The post When smart AI gets too smart: Key insights from Vector’s 2025 ML Security & Privacy Workshop appeared first on Vector Institute for Artificial Intelligence .

AI Expo Africa 2025-06-30 10:41 UTC Score 21.0 USR-0194-20250630-regional-new-a16f2acd

Cassava Technologies partners with the South African Artificial Intelligence Association to boost local access to AI compute services

Johannesburg, South Africa, 30 June 2025 – Cassava Technologies, a global technology leader of African heritage, is pleased to announce that it has signed a Memorandum of Understanding (MOU) with the South African AI Association (SAAIA), an industry body focused on growing responsible AI adoption, to deliver artificial intelligence (AI) solutions and GPU-as-a-Service (GPUaas) across the […]

AI Now Institute 2025-04-21 19:30 UTC Score 50.0 USR-0135-20250421-ai-specialis-f4a84478

New Report on the National Security Risks from Weakened AI Safety Frameworks

Read paper on arxiv → The AI Now Institute has released a new report, Safety Co-Option and Compromised National Security: The Self-Fulfilling Prophecy of Weakened AI Risk Thresholds, sounding the alarm on how today’s AI safety efforts, led primarily by industry technologists, are weakening long-established safety protocols and jeopardizing US national security. This report examines […] The post New Report on the National Security Risks from Weakened AI Safety Frameworks appeared first on AI Now Institute .

Lilian Weng Blog 2024-11-28 00:00 UTC Score 47.0 USR-0112-20241128-ai-specialis-1b600ac6

Reward Hacking in Reinforcement Learning

Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task. Reward hacking exists because RL environments are often imperfect, and it is fundamentally challenging to accurately specify a reward function. With the rise of language models generalizing to a broad spectrum of tasks and RLHF becomes a de facto method for alignment training, reward hacking in RL training of language models has become a critical practical challenge. Instances where the model learns to modify unit tests to pass coding tasks, or where responses contain biases that mimic a user’s preference, are pretty concerning and are likely one of the major blockers for real-world deployment of more autonomous use cases of AI models.

Lilian Weng Blog 2024-02-05 00:00 UTC Score 50.0 USR-0112-20240205-ai-specialis-79c273e2

Thinking about High-Quality Human Data

[Special thank you to Ian Kivlichan for many useful pointers (E.g. the 100+ year old Nature paper “Vox populi”) and nice feedback. 🙏 ] High-quality data is the fuel for modern data deep learning model training. Most of the task-specific labeled data comes from human annotation, such as classification task or RLHF labeling (which can be constructed as classification format) for LLM alignment training. Lots of ML techniques in the post can help with data quality, but fundamentally human data collection involves attention to details and careful execution. The community knows the value of high quality data, but somehow we have this subtle impression that “Everyone wants to do the model work, not the data work” ( Sambasivan et al. 2021 ).

Lilian Weng Blog 2023-10-25 00:00 UTC Score 48.0 USR-0112-20231025-ai-specialis-81866df8

Adversarial Attacks on LLMs

The use of large language models in the real world has strongly accelerated by the launch of ChatGPT. We (including my team at OpenAI, shoutout to them) have invested a lot of effort to build default safe behavior into the model during the alignment process (e.g. via RLHF ). However, adversarial attacks or jailbreak prompts could potentially trigger the model to output something undesired. A large body of ground work on adversarial attacks is on images, and differently it operates in the continuous, high-dimensional space. Attacks for discrete data like text have been considered to be a lot more challenging, due to lack of direct gradient signals. My past post on Controllable Text Generation is quite relevant to this topic, as attacking LLMs is essentially to control the model to output a certain type of (unsafe) content.

The Gradient 2023-10-07 16:00 UTC Score 19.0 AI-037-20231007-ai-specialis-bd099ece

The Artificiality of Alignment

This essay first appeared in Reboot . Credulous, breathless coverage of “AI existential risk” (abbreviated “x-risk”) has reached the mainstream. Who could have foreseen that the smallcaps onomatopoeia “ꜰᴏᴏᴍ” — both evocative of and directly derived from children’s cartoons —

Lilian Weng Blog 2023-03-15 00:00 UTC Score 37.0 USR-0112-20230315-ai-specialis-c01a9c77

Prompt Engineering

Prompt Engineering , also known as In-Context Prompting , refers to methods for how to communicate with LLM to steer its behavior for desired outcomes without updating the model weights. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics. This post only focuses on prompt engineering for autoregressive language models, so nothing with Cloze tests, image generation or multimodality models. At its core, the goal of prompt engineering is about alignment and model steerability. Check my previous post on controllable text generation.

AAAI 2021-10-12 18:24 UTC Score 18.0 AI-081-20211012-research-pap-2c86f3f8

Duke Computer Scientist Wins $1 Million Artificial Intelligence Prize, A ‘New Nobel’

Duke professor Cynthia Rudin is the second recipient of the AAAI Squirrel AI Award for pioneering socially responsible AI. She is being cited for “pioneering scientific work in the area of interpretable and transparent AI systems in real-world deployments, the advocacy for these features in highly sensitive areas such as social justice and medical diagnosis, and serving as a role model for researchers and practitioners.” The post Duke Computer Scientist Wins $1 Million Artificial Intelligence Prize, A ‘New Nobel’ appeared first on AAAI .

Alignment Newsletter 2021-01-04 01:32 UTC Score 35.0 USR-0153-20210104-ai-specialis-4e07dab9

FAQ: Advice for AI alignment researchers

Consider reading How to pursue a career in technical AI alignment. It covers more topics and has more details, and I endorse most if not all of the advice. To quote Andrew Critch: I get a lot of emails from folks with strong math backgrounds (mostly, PhD students in math at top schools) who are […]

Berkeley CHAI Score 37.0 USR-0023-nodate-research-aca-89569f1d

Computational Frameworks for Human Care

Brian Christian, CHAI Affiliate, has published an article titled “ Computational Frameworks for Human Care ” in the most recent issue of Daedalus, the journal of the American Academy of Arts and Sciences. In it, Christian traces how AI alignment has progressed from simple reward mechanisms toward care-like relationships, revealing both the potential and limitations of machine caregiving while deepening our understanding of human care itself. The issue is titled “The Social Science of Caregiving” and was co-edited by CHAI Affiliate Alison Gopnik.