GLM-5.2 is not only stronger on benchmarks, but also much better in real app development scenarios — iOS, Android, WeChat Mini Programs, and more.
Behind this jump is a full loop from environment construction, evaluation, data optimization, reward design, to training.
Real…
Long-horizon is more than a concept. It should live in real-world scenarios, empowering AI builders to solve the problems that matter.
And more scenarios are on the way.
GLM-5.2 delivers a substantial leap in app development capabilities, which also represent demanding long-horizon tasks.
Results:
- GLM-5.1: 21/70
- GLM-5.2: 48/70
- Claude Fable 5: 56/70
That's more than a twofold improvement from GLM-5.1 to GLM-5.2.
These come from an…
Announcing AA-Briefcase, the benchmark for the next era of agentic knowledge work
AA-Briefcase is our new benchmark for testing models on long-horizon knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week projects, each with…
Discuss this model
Add corrections, implementation notes, pricing changes, or usage caveats for other readers.