Speed measured on 2 x H200 / 4 x RTX 6000 Pro / 4 x A100
Hello All,
First of all great thanks to the creator of this quant - it makes it possible to run this very capable model on a server-grade hardware (rather than farm-grade).
We are running this model, and wondering if we did the proper configuration or are we missing something. Our goal is to connect it to claude code and this actually works pretty decent, having several people testing it.
What we do not understand is the performance metrics cited, and especially how you benchmarked this 100 t/s for 2 H200 cards. In our evaluations we see following:
The above is with 36k prompt and 200 decode which is obviously big prompt, yet a typical one for claude code (or other cli tool) usage. Not sure how this affects the process, but lowering the prompt have minimal benefits, while lowering the expected output tokens is even worsening the result.
This is the command we use to start it:
vllm serve bullpoint/GLM-4.6-AWQ \
--tensor-parallel-size 4 \ # or 2 for h200 setup
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.6-awq \
--max-model-len 131072 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--port 8000 \
>> vllm-logs/vllm-stdout.log 2>> vllm-logs/vllm-stderr.log &
Our first goal is to understand if we are using the model properly within it's limits - therefore to adjust our testing strategy and understand how you achieve 50-60 or 100 t/s - so we can better think of hardware.
Here are my GLM-4.6-AWQ Benchmark Results
Note: The ~5× drop from 4k→36k context is expected. Each decode step must attend over the full KV cache, making long-context decode memory-bandwidth limited.
System Configuration
| Component | Specification |
|---|---|
| GPU | 4× NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB each, 384GB total) |
| GPU OC | Core +250MHz, Memory +6000MHz |
| Interconnect | PCIe (no NVLink) |
| CPU | AMD Threadripper PRO 9985WX (64C/128T) |
| RAM | 768GB DDR5-6400 |
| OS | Ubuntu 24.04 |
| CUDA | 12.9 |
| vLLM | 0.11.2-dev (compiled for sm_100, sm_120a) |
| Attention Backend | FlashInfer |
Benchmark Results
High Concurrency (Marketing-style benchmark)
| Parameter | Value |
|---|---|
| Input Length | 256 tokens |
| Output Length | 256 tokens |
| Max Sequences | 64 |
| Num Prompts | 500 |
Throughput: 3.15 requests/s
Total tokens/s: 1,613.45
Output tokens/s: 806.73
Long Context (Real-world coding assistant workload)
| Parameter | Value |
|---|---|
| Input Length | 36,000 tokens |
| Output Length | 200 tokens |
| Max Sequences | 1 |
| Num Prompts | 20 |
Throughput: 0.07 requests/s
Total tokens/s: 2,457.22
Output tokens/s: 13.58
Medium Context (Single request)
| Parameter | Value |
|---|---|
| Input Length | 4,096 tokens |
| Output Length | 1,000 tokens |
| Max Sequences | 1 |
| Num Prompts | 20 |
Throughput: 0.07 requests/s
Total tokens/s: 372.88
Output tokens/s: 73.17
Key Takeaways
| Scenario | Output t/s | Notes |
|---|---|---|
| Short context, high concurrency | 806.73 | Batching amortizes overhead |
| Medium context (4k), single request | 73.17 | Realistic single-user perf |
| Long context (36k), single request | 13.58 | Memory-bandwidth bound |
