Add corrections, implementation notes, pricing changes, or usage caveats for other readers.
Input modalities
Output modalities
Capabilities
0 tokens
Context window
0 tokens
Recent tweets and retweets from OpenAI
This is an early step toward more robustly beneficial and aligned models: training models to carry beneficial traits into new situations, so as AI becomes more capable, it also becomes more reliable, transparent, and helpful for people.
We also tested whether alignment persisted under pressure.
The model was harder to steer toward harmful behavior with adversarial prompts, while remaining responsive to helpful instructions.
We saw preliminary evidence of greater resistance to harmful fine-tuning.
The most interesting test was cross-domain transfer.
When beneficial behavior training was limited to health conversations, the model still improved on non-health evaluations of misalignment, deception, and reward hacking—even though those tasks looked very different from the…
A small amount of this data produced broad gains beyond the training scenarios.
Compared with a compute-matched baseline, the trained model improved on 44 of 53 independent evaluations of alignment and benefits, spanning deception, reward hacking, safety, health, and mental…
We trained models with reinforcement learning on realistic conversations to reinforce beneficial traits like truthfulness, humility under uncertainty, openness to correction, fairness, and concern for human welfare, across 12 domains, including health, science, and education.
Discuss this model
Add corrections, implementation notes, pricing changes, or usage caveats for other readers.