Speed measured on 2 x H200 / 4 x RTX 6000 Pro / 4 x A100

by HristoTodorov - opened 14 days ago

14 days ago

Hello All,

First of all great thanks to the creator of this quant - it makes it possible to run this very capable model on a server-grade hardware (rather than farm-grade).

We are running this model, and wondering if we did the proper configuration or are we missing something. Our goal is to connect it to claude code and this actually works pretty decent, having several people testing it.

What we do not understand is the performance metrics cited, and especially how you benchmarked this 100 t/s for 2 H200 cards. In our evaluations we see following:

The above is with 36k prompt and 200 decode which is obviously big prompt, yet a typical one for claude code (or other cli tool) usage. Not sure how this affects the process, but lowering the prompt have minimal benefits, while lowering the expected output tokens is even worsening the result.

This is the command we use to start it:

vllm serve bullpoint/GLM-4.6-AWQ \
  --tensor-parallel-size 4 \ # or 2 for h200 setup
  --tool-call-parser glm45 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.6-awq \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --port 8000 \
>> vllm-logs/vllm-stdout.log 2>> vllm-logs/vllm-stderr.log &

Our first goal is to understand if we are using the model properly within it's limits - therefore to adjust our testing strategy and understand how you achieve 50-60 or 100 t/s - so we can better think of hardware.

bullpoint

Owner 13 days ago

Here are my GLM-4.6-AWQ Benchmark Results

Note: The ~5× drop from 4k→36k context is expected. Each decode step must attend over the full KV cache, making long-context decode memory-bandwidth limited.

System Configuration

Component	Specification
GPU	4× NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB each, 384GB total)
GPU OC	Core +250MHz, Memory +6000MHz
Interconnect	PCIe (no NVLink)
CPU	AMD Threadripper PRO 9985WX (64C/128T)
RAM	768GB DDR5-6400
OS	Ubuntu 24.04
CUDA	12.9
vLLM	0.11.2-dev (compiled for sm_100, sm_120a)
Attention Backend	FlashInfer

Benchmark Results

High Concurrency (Marketing-style benchmark)

Parameter	Value
Input Length	256 tokens
Output Length	256 tokens
Max Sequences	64
Num Prompts	500

Throughput: 3.15 requests/s
Total tokens/s: 1,613.45
Output tokens/s: 806.73

Long Context (Real-world coding assistant workload)

Parameter	Value
Input Length	36,000 tokens
Output Length	200 tokens
Max Sequences	1
Num Prompts	20

Throughput: 0.07 requests/s
Total tokens/s: 2,457.22
Output tokens/s: 13.58

Medium Context (Single request)

Parameter	Value
Input Length	4,096 tokens
Output Length	1,000 tokens
Max Sequences	1
Num Prompts	20

Throughput: 0.07 requests/s
Total tokens/s: 372.88
Output tokens/s: 73.17

Key Takeaways

Scenario	Output t/s	Notes
Short context, high concurrency	806.73	Batching amortizes overhead
Medium context (4k), single request	73.17	Realistic single-user perf
Long context (36k), single request	13.58	Memory-bandwidth bound

bullpoint changed discussion status to closed 13 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment