Can't inference the model

by MattisR - opened Sep 25

Sep 25

Hello, when I start vLLM using the model it run in void

(APIServer pid=17046) INFO: 23.90.234.2:50378 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=17046) INFO 09-25 09:26:39 [chat_utils.py:538] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=17046) WARNING 09-25 09:26:39 [chat_utils.py:422] 'add_generation_prompt' is not supported for mistral tokenizer, so it will be ignored.
(APIServer pid=17046) WARNING 09-25 09:26:39 [chat_utils.py:427] 'continue_final_message' is not supported for mistral tokenizer, so it will be ignored.
(APIServer pid=17046) INFO 09-25 09:26:54 [loggers.py:123] Engine 000: Avg prompt throughput: 302.1 tokens/s, Avg generation throughput: 3.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.2%, Prefix cache hit rate: 0.0%

# No more throughput but an infinite running...
(APIServer pid=17046) INFO 09-25 09:27:04 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.5%, Prefix cache hit rate: 0.0%
(APIServer pid=17046) INFO 09-25 09:27:14 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.9%, Prefix cache hit rate: 0.0%
(APIServer pid=17046) INFO 09-25 09:27:24 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.2%, Prefix cache hit rate: 0.0%
(APIServer pid=17046) INFO 09-25 09:27:34 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.6%, Prefix cache hit rate: 0.0%
(APIServer pid=17046) INFO 09-25 09:27:44 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.9%, Prefix cache hit rate: 0.0%
...

when I set a max_tokens in my request I have an empty content.

(APIServer pid=17046) ERROR 09-25 09:35:47 [serving_chat.py:251] ValueError: Invalid assistant message: role='assistant' content='' tool_calls=None prefix=False

vllm=0.10.2
tokenizers==0.22.1
transformers==4.56.2
mistral_common==1.8.5

alexmarques

Red Hat AI org Sep 25

Can you please share commands for reproduction?

MattisR

Sep 25

•

edited Sep 25

python -m vllm.entrypoints.openai.api_server \
    --model RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic \
    --tokenizer_mode mistral \
    --config_format mistral \
    --load_format mistral \
    --host 0.0.0.0 \
    --port 8000

on an NVIDIA L40S using the two given python script from the README.MD

alexmarques

Red Hat AI org Sep 25

Hi @MattisR . Thanks for clarifying.

It seems that there is a small issue in the stable release of vLLM. You can try installing from source for the time being. Here are the instructions:
https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#build-wheel-from-source. You can follow "Set up using Python-only build (without compilation)".

The fix should be available in the next release.

MattisR changed discussion status to closed Sep 29

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment