Putting up Bumpers

This is a linkpost for Putting up Bumpers on the Anthropic Alignment Blog.

April 23, 2025

tl;dr: Even if we can't solve alignment, we can solve the problem of catching and fixing misalignment.

If a child is bowling for the first time, and they just aim at the pins and throw, they’re almost certain to miss. Their ball will fall into one of the gutters. But if there were beginners’ bumpers in place blocking much of the length of those gutters, their throw would be almost certain to hit at least a few pins. This essay describes an alignment strategy for early AGI systems I call ‘putting up bumpers’, in which we treat it as a top priority to implement and test safeguards that allow us to course-correct if we turn out to have built or deployed a misaligned model, in the same way that bowling bumpers allow a poorly aimed ball to reach its target.

To do this, we'd aim to build up many largely-independent lines of defense that allow us to catch and respond to early signs of misalignment. This would involve significantly improving and scaling up our use of tools like mechanistic interpretability audits, behavioral red-teaming, and early post-deployment monitoring. We believe that, even without further breakthroughs, this work we can almost entirely mitigate the risk that we unwittingly put misaligned circa-human-expert-level agents in a position where they can cause severe harm.

On this view, if we’re dealing with an early AGI system that has human-expert-level capabilities in at least some key domains, our approach to alignment might look something like this:

Pretraining: We start with a pretrained base model.
Finetuning: We attempt to fine-tune that model into a helpful, harmless, and honest agent using some combination of human preference data, Constitutional AI-style model-generated preference data, outcome rewards, and present-day forms of scalable oversight.
Audits as our Primary Bumpers: At several points during this process, we perform alignment audits on the system under construction, using many methods in parallel. We’d draw on mechanistic interpretability, prompting outside the human/assistant frame, etc.
Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.
Bouncing Off: We rewind our finetuning process as far as is needed to make another attempt at aligning the model, taking advantage of what we’ve learned in the previous step.
Repeat: We repeat this process until we’ve developed a complete system that appears, as best we can tell, to be well aligned. Or, we are repeatedly failing in consistent ways, change plans and try to articulate as best we can why alignment doesn’t seem tractable.
Post-Deployment Monitoring as Secondary Bumpers: We then begin rolling out the system in earnest, both for internal R&D and external use, but keep substantial monitoring measures in place that provide additional opportunities to catch misalignment. If we observe substantial warning signs for misalignment, we return to step 4.

The hypothesis behind the Bumpers approach is that, on short timelines (i) this process is fairly likely to converge, after a reasonably small number of iterations, at an end state in which we are no longer observing warning signs for misalignment and (ii) if it converges, the resulting model is very likely to be aligned in the relevant sense.

We are not certain of this hypothesis, but we think it’s important to consider. It has already motivated major parts of Anthropic’s safety research portfolio and we’re hiring across several teams—two of them new—to build out work along these lines.

For the rest of the essay, continue to: Putting up Bumpers on the Anthropic Alignment Blog.