o1's new CBRN report looks (a fair bit) better

Dec 10, 2024

[This is a slightly edited blog post version of my tweet thread].

A few weeks ago, I “peer-reviewed” o1-preview's ChemBio safety card and highlighted some issues about its methodology.

Now that o1 is out, how does it stack up?

Better! (Though there’s still room for improvement.)

Here’s my new o1 scorecard. 🧵👇

The new system card improved on the old one:

o1 underperformed PhDs at *one* lab-skill eval (out of 5!) and it's not clear how that test was scored
OAI says tinkering could boost scores, but not by how much (other orgs try to forecast this)
Results are from a "near-final" o1 version. Some people note that the final version that got released likely does better

Previously, I flagged o1-previews’ 69% score on the Gryphon eval might match PhDs.

Turns out, experts score 57%—so OAI passed this eval *months* ago. I hope they declares such results in the future.

(I'd keep an eye on the multimodal eval with no PhD score yet)

AIs keep saturating dangerous capability tests. With o1 we “ratcheted up” from multiple-choice to open-ended evals. But that won’t hold for long.

We need harder evals—ones where if an AI succeeds that suggests a real risk. (No updates yet on OAI’s wet-lab study).

1 test suggests the "lower bound" lacks wet-lab skills; 4 can't rule it out. It's plausible o1 was ~fine to deploy, but all remains subjective.

The report being clearer and more nuanced helps build trust. The next one should go further—and include harder evaluations.

Want to improve the “science of evals” and make dangerous capability tests more realistic? Tell us your ideas!

Open Philanthropy has supported many tests that OAI and others now use—including work by people who are skeptical of AGI and AI risks.

Better evidence = better decisions

Previous Instructions