GLM-5.2-NVFP4 is now ready to serve in vLLM. NVIDIA just dropped the official NVFP4 checkpoint of Z.ai's GLM-5.2, the 744B-parameter MoE model built for long-horizon coding and agentic tasks, and it's already deployable with a single vllm serve command. The headline promise: smaller memory footprint than FP8, same accuracy.
The model underneath
GLM-5.2 is an open-weights model from Z.ai (formerly Zhipu AI), tuned heavily for software engineering, multi-step reasoning, and tool-augmented agent work. It builds on the Mixture-of-Experts (MoE) foundation introduced with GLM-5 and GLM-5.1, extending the context window to a usable 1 million tokens while preserving strong coding performance.
It uses a MoE design with approximately 753B total parameters and roughly 40B active per token. That last number is what actually matters for compute cost: only 40B parameters fire per forward pass, not 753B. The headline change over GLM-5 / 5.1 is that Multi-Token Prediction (MTP) is extended from 3 to 5 draft tokens, lifting end-to-end throughput on reasoning, coding, and agentic workloads. MTP is a speculative decoding technique where the model predicts multiple future tokens in parallel, then verifies them, effectively getting more output per GPU cycle.
GLM-5.2 also introduces IndexShare, which reuses the same attention indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9x at a 1M context length. This is what makes a 1M-token context window practical rather than just a marketing number.
What NVFP4 actually is
NVFP4 is not your typical INT4 quantization. NVIDIA Blackwell's NVFP4 is a 4-bit floating point format designed to improve model accuracy at ultra-low precision using a two-level scaling strategy. It reduces quantization error by using a smaller block size of 16 values, compared to its predecessor MXFP4 which used 32, allowing for more localized adaptation to the data's dynamic range.