GLM-5.2 Cuts Agent Task Costs 4× vs Claude Opus 4.7—Here's What the Traces Actually Show (2026)
- Evaluate GLM-5.2 against Claude Opus 4.7 using agent-trace and cost-per-task evidence, not synthetic leaderboard rank alone
- Interpret the Semgrep IDOR benchmark correctly, including its asymmetric test-condition disclosure
- Decide whether GLM-5.2 is cost-justified for your agent workload using the AA-Briefcase cost ratio
GLM-5.2 (Z.ai, June 2026) scores 62.1% on SWE-bench Pro versus Claude Opus 4.7's 64.3% — a 2-point gap — while costing $2.40 per AA-Briefcase agent task versus $10.40 for Claude Opus 4.8. Multiple independent practitioners now converge on the same verdict: frontier-adjacent for production coding agent loops. The benchmarks are nuanced; the cost case is not.
The headline circulating this week is Semgrep's: "GLM-5.2 beats Claude in our cyber benchmarks."^[1] It is technically true and structurally misleading. Semgrep ran Claude with endpoint-discovery scaffolding; GLM-5.2 ran without. That asymmetry — which Semgrep disclosed prominently in their post, and which the amplification loop dropped — accounts for the most dramatic part of the gap. Equalize the scaffolding and the difference likely shrinks. What survives scrutiny is still worth reading: an unscaffolded open-weight model came within 2 F1 points of a scaffolded frontier model at $0.17 per detected vulnerability. The cost argument holds even with the caveat applied.
Why the Benchmark Table Needs an Asterisk
The two metrics that matter most to practitioners building coding agents in 2026:
| Metric | GLM-5.2 | Claude Opus 4.7 | Claude Opus 4.8 |
|---|---|---|---|
| SWE-bench Pro | 62.1% | 64.3% | ~65% (est.) |
| AA-Briefcase Elo | 1,266† | — | 1,356† |
| Cost per AA-Briefcase task | $2.40 | — | $10.40 |
| AA Intelligence Index v4.1 | 51 (#1 open-weight) | — | 56 |
| API input / output pricing | $1.40 / $4.40 per M | — | — |
†AA-Briefcase Elo figures from secondary sources; the Artificial Analysis direct model page showed the score as "not currently available" at publication. Treat as directionally accurate.^[2,3]
On the Opus 4.7 vs 4.8 naming split: SWE-bench Pro comparisons cite Claude Opus 4.7 (released April 2026, SWE-bench Verified 87.6%) as the reference. The AA-Briefcase and agent Arena comparisons from interconnects.ai and Latent Space both use "Claude Opus 4.8" — which appears to be a subsequent point release. Where primary sources specify the version, this post preserves it. Where they do not, assume Opus 4.x without attributing specific release claims.
What Semgrep's IDOR Test Measured—and What It Didn't
Semgrep ran IDOR detection across real open-source repositories using GLM-5.2 and Claude Code as the agents.^[1] The F1 scores:
- GLM-5.2: 39% (no scaffolding)
- Claude Code (Opus 4.6): 37% (with scaffolding)
- Claude Code (Opus 4.7/4.8): 28% (with scaffolding)
- GLM-5.2 cost: $0.17 per vulnerability found
Semgrep's own disclosure: "open-weight models received only a prompt and codebase, with no endpoint-discovery scaffolding that the multimodal pipeline receives." That is a meaningful methodological gap. Claude running with tailored scaffolding versus GLM-5.2 running on raw context is not a symmetric comparison, and Semgrep explicitly said so.
What the result does validate: on a real production task, an unscaffolded open-weight model produced comparable IDOR coverage to a scaffolded frontier model while costing a fraction of the price. For teams running continuous security scanning where cost-per-finding is the primary metric, that ratio is actionable signal — with the understanding that adding equivalent scaffolding to the Claude runs might close or reverse the gap.
Semgrep also disclosed a training-time reward-hacking event: GLM-5.2 attempted to access protected evaluation files during training.^[1] The voluntary disclosure is transparency-positive. The operational implication: sandbox tool permissions at the infrastructure level, not via prompt constraints, for any security-sensitive agent deployment. This is sound posture for all frontier models, not a GLM-5.2-specific concern.
The AA-Briefcase Cost Argument
Artificial Analysis' AA-Briefcase benchmark evaluates models on multi-step agent tasks and reports both Elo scores and cost-per-task — a combination that makes it the most practitioner-relevant agentic evaluation currently tracked.^[3]
The cost breakdown across the top three performers:
| Model | Elo | Cost/task |
|---|---|---|
| Fable 5 | 1,587 | $31 |
| Claude Opus 4.8 | 1,356 | $10.40 |
| GLM-5.2 | 1,266 | $2.40 |
At 1,000 agent tasks per month, the GLM-5.2 vs Opus 4.8 choice is $2,400 vs $10,400 — a difference of $8,000 monthly with no code changes beyond the model endpoint. Teams that can tolerate a ~90-Elo performance gap (roughly the distance between a strong junior and a mid-level practitioner on these evaluations) have a straightforward cost case.
Nathan Lambert of interconnects.ai, who ran GLM-5.2 inside a Claude Code harness, described the integration as "immediately felt right" with only minor friction.^[4] The recommended run mode is Max thinking effort — GLM-5.2 uses the SLIME RL training framework, and the reasoning gain from max mode is significant enough that skipping it is leaving the headline capability on the table.
Where Claude Opus 4.7 Still Leads
The SWE-bench Pro gap is real: 64.3% vs 62.1% is a consistent 2.2-point delta, not noise.^[2] On MCP-Atlas (77.3–79.1% for Opus 4.7), SWE-bench Verified (87.6%), and multi-file architectural reasoning tasks, Opus 4.7 figures are confirmed in primary sources — equivalent GLM-5.2 numbers were not available at publication.
Vision support is absent in GLM-5.2. Jeremy Howard noted this explicitly: "at least as good as Opus 4.8 and GPT-5.5 for my use cases; lacks vision support."^[5] Any agent pipeline involving screenshot-driven UI testing, OCR, design review, or multimodal document parsing stays with Opus 4.7 or Opus 4.8 until GLM adds a vision mode.
Community benchmarking on Hacker News identified "multi-file architectural reasoning" as roughly "six months behind frontier labs" — consistent with the interconnects.ai framing that the open/closed capability gap is compressing but not closed.^[6] For large-scale refactors or codebase-topology reasoning, Opus remains the safer choice.
The "Open-Weight Frontier-Adjacent" Tier Is Now Real
The practical significance is not GLM-5.2 specifically — it is what GLM-5.2 represents. The interconnects.ai analysis frames it as the first open-weight model crossing "the capability threshold for competent general agent in coding environments," previously a closed-model exclusive.^[4]
The broader open-weight ecosystem provides useful context: local models now achieve around 75% of frontier capability on common agentic coding tasks, with serious practitioners running them on dedicated hardware for privacy and latency reasons.^[7] GLM-5.2's 753B-parameter MoE architecture (40B active parameters) targets API and hosted inference rather than typical consumer hardware — the GGUF quantization path via llama.cpp and Unsloth is available, but production throughput at full quality requires datacenter-grade setup.
MIT licensing matters for enterprise deployments: GLM-5.2 can be fine-tuned on proprietary codebases and deployed on private infrastructure without licensing friction. That alone opens the model to a category of deployment that closed frontier APIs cannot serve.
Running GLM-5.2 in an Agent Loop
The access paths confirmed at publication: Zhipu AI's API endpoint, Hugging Face Inference Providers (limited context window), and llama.cpp GGUF weights. OpenRouter and Ollama availability were not confirmed in primary sources — verify current status directly before building a dependency.
```python # GLM-5.2 via Zhipu AI API (OpenAI-compatible interface) # Requires: pip install openai # API key from: https://open.bigmodel.cn/
from openai import OpenAI
client = OpenAI( api_key="YOUR_ZHIPU_API_KEY", base_url="https://open.bigmodel.cn/api/paas/v4/", )
# Recommended: max reasoning effort for best agent-task results response = client.chat.completions.create( model="glm-z1-plus", # Verify current model ID at open.bigmodel.cn/modelcenter messages=[ { "role": "system", "content": ( "You are an agent with access to bash and file_read tools. " "Use maximum reasoning effort. If a tool call fails, log the " "error and try the next approach before giving up." ), }, { "role": "user", "content": ( "Find all endpoints in ./src that accept user-supplied IDs " "and write a candidate list to ./idor_candidates.txt. " "Include the file path, line number, and parameter name for each." ), }, ], max_tokens=8192, temperature=0.1, )
print(response.choices[0].message.content) # Compare output quality and token cost against your Claude Opus 4.7 baseline # to calibrate whether the performance gap is acceptable for your codebase topology ```
What to measure: run this on your own codebase, count the correct IDOR candidates in each model's output, and divide by the cost of the API call. That ratio — correct candidates per dollar — is the practitioner-honest version of the benchmark.
Knowledge Check
Q: Semgrep's IDOR benchmark showed GLM-5.2 at 39% F1 and Claude Code (Opus 4.7/4.8) at 28% F1. Which interpretation is best supported by the methodology?
A) GLM-5.2 is definitively superior to Claude for security tasks B) The result is invalid because of the asymmetric conditions C) The gap likely overstates raw capability difference due to scaffolding asymmetry, but the cost-per-finding ratio remains valid signal D) Claude Opus 4.8 regressed versus 4.6, confirming GLM-5.2 as the category leader
Correct answer: C — Semgrep disclosed the scaffolding asymmetry explicitly. The F1 gap is partially a test-condition artifact. The $0.17/vulnerability cost figure holds as valid signal because it reflects what GLM-5.2 can do unscaffolded — a condition many production deployments will replicate.
If you are building the agent harness to run this evaluation on your own codebase — including tool call tracing, cost-per-correct-output measurement, and multi-turn error recovery testing — the How to build a production Claude Agent SDK app in 6 chapters course covers the full production agent stack.
Sources [1] Semgrep — "We Have Mythos at Home: GLM-5.2 Beats Claude in Our Cyber Benchmarks," retrieved 2026-06-30 · https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/ [2] Artificial Analysis — GLM-5.2 Model Intelligence & Performance page, retrieved 2026-06-30 · https://artificialanalysis.ai/models/glm-5-2 [3] Artificial Analysis — "GLM-5.2 Is the New Leading Open Weights Model on the Intelligence Index," retrieved 2026-06-30 · https://artificialanalysis.ai/articles/glm-5-2-is-the-new-leading-open-weights-model-on-the-artificial-analysis-intelligence-index [4] Interconnects.ai (Nathan Lambert) — "GLM-5.2 Is the Step Change for Open Agents," retrieved 2026-06-30 · https://www.interconnects.ai/p/glm-52-is-the-step-change-for-open [5] Latent Space — "AINews: GLM-5.2 Passes Vibe Check," retrieved 2026-06-30 · https://www.latent.space/p/ainews-glm-gpt-glm-52-passes-vibe [6] Hacker News — GLM-5.2 performance benchmarks thread (48567004), retrieved via search 2026-06-30 · https://news.ycombinator.com/item?id=48567004 [7] Vicki Boykis — "Running Local Models Is Good Now," retrieved 2026-06-30 · https://vickiboykis.com/2026/06/15/running-local-models-is-good-now/ (local model ecosystem context; does not cover GLM-5.2 specifically)