A Chinese language model with just three billion parameters sometimes matches models a hundred times larger on math and coding tasks. The researchers behind it have developed a hypothesis about how AI capabilities are structured.

Weibo's parent company Sina has released a small language model that competes with today's top models on hard math and coding tasks. According to a technical report, VibeThinker-3B performs on par with DeepSeek V3.2 and Kimi K2.5 on competitive benchmarks like AIME26. Both of those models have 200 to 333 times more parameters.

Sina positions the model as an experiment in figuring out how much compute a model actually needs to compete at the top. Its predecessor, VibeThinker-1.5B, launched in November 2025. The new version pushes further, asking whether a small model can hit genuine top-tier performance, not just be "good for its size."

Six bar charts compare VibeThinker-3B against Qwen3.6 Plus, Gemini 3 Pro, GLM-5, Kimi K2.5, and Claude Opus 4.5 across the benchmarks AIME'25, AIME'26, LiveCodeBench v6, IMO-AnswerBench, HMMT'25, and IFBench. Hatched bar extensions show gains from CLR test-time scaling.
Across six math and coding benchmarks, the 3B model (orange) falls within the performance range of five current top models including Gemini 3 Pro, GLM-5, and Claude Opus 4.5. | Image: Sina Weibo

Logic scales down, factual knowledge doesn't

The results tell two different stories. On structured tasks with clearly verifiable solutions, like math olympiads or programming challenges, VibeThinker-3B matches models like GLM-5 or Gemini 3 Pro. On LiveCodeBench, it beats every other model under 20 billion parameters.

Factual knowledge is a different story. On the knowledge-heavy GPQA-Diamond benchmark, the model falls well behind its much larger competitors.

Horizontal bar chart showing IMO-AnswerBench scores for open-source reasoning models with their parameter counts. VibeThinker-3B scores 76.4 with three billion parameters and 80.6 with CLR, putting it in the range of DeepSeek V3.2 (671B), GLM-5 (744B), and Kimi K2.5 (1T).
VibeThinker-3B nearly matches DeepSeek V3.2, GLM-5, and Kimi K2.5 on IMO-AnswerBench despite being hundreds of times smaller. | Image: Sina Weib

To rule out data contamination, the team had the model compete in LeetCode contests held between late April and late May 2026, after training wrapped up. VibeThinker-3B solved 123 out of 128 problems on the first try. That puts it ahead of GPT-5.2, Qwen3-Max, Kimi K2.5, and Claude Opus 4.6. It trails only GPT-5.3-Codex, Gemini 3.1 Pro, and Gemini 3 Flash, but not by much.

Post-training does the heavy lifting

VibeThinker-3B builds on Alibaba's Qwen2.5-Coder-3B. Sina's contribution is the post-training, everything that happens after generic pre-training on large data sets. According to the report, that's what brings a 3B model close to the top performers.

Post-training happens in stages. First, the model learns a broad range of tasks through supervised fine-tuning, covering math, coding, and general dialogue. Then it gets tailored for hard, multi-step reasoning problems.

Reinforcement learning follows, applied sequentially for math, programming, and STEM. Self-distillation then consolidates the skills from each phase into a single model. A final step makes sure the model better follows instructions.

Flowchart of the VibeThinker-3B training pipeline, from the base model through two-stage supervised fine-tuning and multi-stage reasoning RL for math, code, and STEM to a final instruct RL phase, with offline self-distillation as a feedback loop.
It's the post-training that enables a performance leap. Two-stage supervised fine-tuning, multi-stage reasoning RL for math, code, and STEM, plus a final instruction phase for prompt adherence. | Image: Sina Weibo

During fine-tuning, the team deliberately builds up a wide variety of solution paths. Reinforcement learning then strengthens the ones that work. The argument is that performance comes from training methods, data quality, and reliable validation signals rather than from more parameters.

What this means for how AI capabilities work

Based on these results, the authors propose what they call the "Parametric Compression-Coverage Hypothesis." Different AI capabilities have different structures and need different numbers of parameters.

Logical reasoning, like solving a math problem step by step, relies on a few recurring patterns. Searching, checking conditions, correcting errors, combining intermediate results. That kind of skill can be packed into a compact core. World knowledge works differently. Answering open-ended questions across many topics requires broad coverage, meaning lots of parameters storing lots of facts.

This reframes what small models are for, the researchers say. They're not just cheap, lightweight versions built for budget inference but an independent research path running parallel to traditional scaling logic. Where tasks are verifiable and have clear solution structures, parameter count isn't the bottleneck anymore.

VibeThinker-3B is openly available on Hugging Face and GitHub.

Small models catching up to much larger systems on narrow tasks is becoming a pattern. In April, Alibaba's Qwen3.6-27B outperformed its predecessor, which was 15 times larger, across all coding benchmarks. Falcon H1R 7B from Abu Dhabi hit the performance level of models two to seven times its size, according to its makers. Earlier studies on logical gaps in small models suggested they generally hit a wall on multi-step reasoning. The VibeThinker results on verifiable tasks challenge exactly that assumption.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Subscribe now