Yeah. That’s where things usually go sideways.
glm 5 is here, and GLM-5 is very clearly aiming at that long-horizon, agent-style grind. Not just “autocomplete, but nicer.”
People are already buzzing about it in dev circles. I’m going to keep this grounded in stuff we can actually point to and run.
Primary source: the official GLM-5 repo README on GitHub
https://github.com/zai-org/GLM-5
What actually shipped with GLM-5
If you need the crisp, snippet-friendly version, it’s basically this:
- Scale jump. GLM-5 goes from 355B parameters in GLM-4.5 to 744B parameters.
- More pretraining data. tokens move from 23T up to 28.5T tokens.
- Attention efficiency. GLM-5 integrates DeepSeek Sparse Attention to cut deployment cost while keeping long-context capacity.
- Long-horizon focus: it’s described as “purpose-built” for complex systems engineering and long-horizon agentic tasks.
- Benchmark headline: on Vending Bench 2, GLM-5 is reported as #1 among open-source models, finishing a 1-year simulated vending-machine business with a $4,432 final balance. Their write-up says this approaches Claude Opus 4.5.
All of that is straight from the README:
https://github.com/zai-org/GLM-5
So yeah, glm 5 is here mostly translates to “agentic engineering is now a first-class target.” Not just “bigger number, bigger ego.”
GLM-5 scale, the “744B” bit, and why anyone should care
That wording, “744B parameters”, is pulling a lot of weight.
In practice, an “active” count usually hints at sparse activation, often Mixture-of-Experts-ish behavior. Meaning you don’t necessarily pay the full 744B compute cost per token at inference time, because only part of the network fires each step. The README snippet we have doesn’t go deep on architecture, so I’ll keep it conservative and plain:
It’s huge on paper, and it’s built so inference can be more practical than a dense 744B model.
Still… if you’re thinking self-hosting, don’t kid yourself. This is multi-GPU territory. You’ll be living in the world of tensor parallelism, memory utilization, and quantized or FP8 checkpoints when available.
glm 5 is here for long-horizon agents, not just chatty Q&A
Z.ai frames GLM-5 as a move “from vibe coding to agentic engineering.” And honestly? If you’ve watched real systems fail, it lands.
Because the hard parts aren’t clever one-liners. It’s stuff like:
- Breaking tasks down without producing an infinite TODO list nobody finishes
- Tracking state across time. Files changed, tests run, artifacts produced
- Tool use doesn’t melt down. Shell, git, linters, package managers
- Recovering when things fail, since they will fail, and usually at 2 a.m.
GLM-5 is positioned around loop: build over time, don’t just answer prompts.
If I were writing the post with visuals, I’d drop in something like this:
- Diagram idea. “Agent loop for GLM-5: plan → tool call → observe → update state → iterate”
Alt text: GLM-5 long-horizon agent loop showing planning, tool calls, observation, memory/state update, and iteration for systems engineering tasks.
Benchmarks: Vending Bench 2, CC-Bench-V2, and how much you should believe
Benchmarks are useful. They’re also slippery. Both things can be true.
From the repo README:
Vending Bench 2. GLM-5 ranks #1 among open-source models and ends with $4,432 after running a simulated vending machine business for one year. This is described as measuring “long-term operational capability.”
Source: https://github.com/zai-org/GLM-5CC-Bench-V2. GLM-5 “significantly outperforms GLM-4.7” across “frontend, backend, and long-horizon tasks,” and they say it narrows the gap to Claude Opus 4.5.
Source: https://github.com/zai-org/GLM-5
My take: Vending Bench 2 is interesting because it forces persistence, bookkeeping, and planning under constraints. That looks a lot more like real automation work than another short-form math sprint. But I’d still read it as directional, not gospel.
And no, glm 5 is here doesn’t magically make every other model irrelevant. It’s more like a signpost: open models are getting better at the stuff people actually try to automate.
Running GLM-5 locally with vLLM
The repo says vLLM, SGLang, and xLLM support local deployment, and it includes example commands. I’ll stick with the vLLM route here because it’s a common way to stand up an OpenAI-style endpoint.
1) Install vLLM (their example uses nightly)
From the GLM-5 README:
pip install -U vllm --pre --index-url https.//pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.gitSource: https://github.com/zai-org/GLM-5
2) Serve GLM-5 (example from the repo)
vllm serve zai-org/GLM-5-FP8 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.85 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-5-fp8Source: https://github.com/zai-org/GLM-5
3) Call it like OpenAI Chat Completions
vLLM’s OpenAI-compatible server docs:
https://docs.vllm.ai/en/latest/serving/openai_compatible_server/
Example Python (adapted from vLLM docs):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
resp = client.chat.completions.create(
model="glm-5-fp8",
messages=[
{"role": "user", "content": "Write a Makefile target to run pytest with coverage."}
],
)
print(resp.choices[0].message.content)If you already have tooling pointed at OpenAI-compatible endpoints (Open WebUI, custom clients, internal gateways), this tends to be the least-painful route.
Attention and RL infra: DSA and “slime,” per the README
Two bits from the README are worth flagging, because they’re about the boring stuff that decides whether you can deploy any of this.
- DeepSeek Sparse Attention (DSA): GLM-5 integrates DSA to “largely” reduce deployment cost while preserving long-context capacity.
- slime (async RL infra): they describe “slime” as a novel asynchronous RL infrastructure to improve RL training throughput and efficiency.
Source: https://github.com/zai-org/GLM-5
I like seeing those called out. Attention cost can be the tax that wrecks long-context deployments. RL pipelines can be the tax wrecks iteration speed. Nothing glamorous about either one, but both matter.
Best practices for using GLM-5 in coding and agent workflows without turning your repo into soup
If you’re going to treat GLM-5 like an “agent brain,” a few habits keep things from getting… weird.
- Make it show receipts. Commands run, files touched, tests executed
- Keep a task journal. A simple
STATUS.mdthe agent updates each iteration works surprisingly well - Pin tool outputs. Save lint, test, build logs so it references real errors, not vibes
- Put guardrails on it. Restrict shell commands and file access because even good models do dumb stuff when rushed
- Evaluate on your own repo. If GLM-5 shines on “long-horizon,” test it with multi-step PRs, not one-off helper functions
And if you’re skeptical, good. Shipping agents into production should feel a little uncomfortable. Just a little.
Where glm 5 is here fits in the bigger picture
Community reaction is all over the place. Some people are hyped that non‑US models are closing the gap. Others are blunt that GLM-5 “is not there yet” versus the very top closed models. You can see that vibe in the Reddit thread:
https://www.reddit.com/r/singularity/comments/1r22g1l/glm5_is_here/
I’m somewhere in the middle.
GLM-5 looks serious on paper with the scale, data, and long-horizon benchmark claims. The deployment story looks practical too, with vLLM/SGLang paths, parsers, FP8. But whether it’s “frontier-level agent in my environment” is still something you only learn by running it and watching what breaks.
That’s engineering.
Conclusion
glm 5 is here, and the interesting part isn’t the parameter count. It’s the very explicit push toward long-horizon agentic tasks and systems engineering, with concrete claims like 744B/40B active, 28.5T tokens, DSA, plus the Vending Bench 2 result ($4,432, #1 open-source per their README).
If you try GLM-5, I’d genuinely love to hear what breaks first: tool calling, planning, or the classic dependency hell.
And if you’re thinking about deploying models as internal services, you might also like.
- Internal. https.//www.basantasapkota026.com.np/2026/01/ai-model-hype-are-new-versions-really.html
- Internal: https://www.basantasapkota026.com.np/2025/11/11-powerful-apis-for-your-next-project.html
Drop a comment with your hardware setup and whether you used vLLM or SGLang. Those details matter way more than leaderboard screenshots.