We wrapped a live session on M3 yesterday with the @togethercompute team & our researchers @zpysky1125 and @HaohaiSun
A few highlights 🧵
1. MSA (MiniMax Sparse Attention) is the star ⭐️. Unlike CSA/HCA, which compress the KV cache, MSA keeps the real, uncompressed KV and…
Discuss this model
Add corrections, implementation notes, pricing changes, or usage caveats for other readers.