Speed measured on 2 x H200 / 4 x RTX 6000 Pro / 4 x A100

#3
by HristoTodorov - opened

Hello All,

First of all great thanks to the creator of this quant - it makes it possible to run this very capable model on a server-grade hardware (rather than farm-grade).

We are running this model, and wondering if we did the proper configuration or are we missing something. Our goal is to connect it to claude code and this actually works pretty decent, having several people testing it.

What we do not understand is the performance metrics cited, and especially how you benchmarked this 100 t/s for 2 H200 cards. In our evaluations we see following:

Screenshot 2025-11-27 at 19.12.38

The above is with 36k prompt and 200 decode which is obviously big prompt, yet a typical one for claude code (or other cli tool) usage. Not sure how this affects the process, but lowering the prompt have minimal benefits, while lowering the expected output tokens is even worsening the result.

This is the command we use to start it:

vllm serve bullpoint/GLM-4.6-AWQ \
  --tensor-parallel-size 4 \ # or 2 for h200 setup
  --tool-call-parser glm45 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.6-awq \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --port 8000 \
>> vllm-logs/vllm-stdout.log 2>> vllm-logs/vllm-stderr.log &

Our first goal is to understand if we are using the model properly within it's limits - therefore to adjust our testing strategy and understand how you achieve 50-60 or 100 t/s - so we can better think of hardware.

Here are my GLM-4.6-AWQ Benchmark Results

Note: The ~5× drop from 4k→36k context is expected. Each decode step must attend over the full KV cache, making long-context decode memory-bandwidth limited.

System Configuration

Component Specification
GPU 4× NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB each, 384GB total)
GPU OC Core +250MHz, Memory +6000MHz
Interconnect PCIe (no NVLink)
CPU AMD Threadripper PRO 9985WX (64C/128T)
RAM 768GB DDR5-6400
OS Ubuntu 24.04
CUDA 12.9
vLLM 0.11.2-dev (compiled for sm_100, sm_120a)
Attention Backend FlashInfer

Benchmark Results

High Concurrency (Marketing-style benchmark)

Parameter Value
Input Length 256 tokens
Output Length 256 tokens
Max Sequences 64
Num Prompts 500
Throughput: 3.15 requests/s
Total tokens/s: 1,613.45
Output tokens/s: 806.73

Long Context (Real-world coding assistant workload)

Parameter Value
Input Length 36,000 tokens
Output Length 200 tokens
Max Sequences 1
Num Prompts 20
Throughput: 0.07 requests/s
Total tokens/s: 2,457.22
Output tokens/s: 13.58

Medium Context (Single request)

Parameter Value
Input Length 4,096 tokens
Output Length 1,000 tokens
Max Sequences 1
Num Prompts 20
Throughput: 0.07 requests/s
Total tokens/s: 372.88
Output tokens/s: 73.17

Key Takeaways

Scenario Output t/s Notes
Short context, high concurrency 806.73 Batching amortizes overhead
Medium context (4k), single request 73.17 Realistic single-user perf
Long context (36k), single request 13.58 Memory-bandwidth bound
bullpoint changed discussion status to closed

Sign up or log in to comment