Can't inference the model
#3
by
MattisR
- opened
Hello, when I start vLLM using the model it run in void
(APIServer pid=17046) INFO: 23.90.234.2:50378 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=17046) INFO 09-25 09:26:39 [chat_utils.py:538] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=17046) WARNING 09-25 09:26:39 [chat_utils.py:422] 'add_generation_prompt' is not supported for mistral tokenizer, so it will be ignored.
(APIServer pid=17046) WARNING 09-25 09:26:39 [chat_utils.py:427] 'continue_final_message' is not supported for mistral tokenizer, so it will be ignored.
(APIServer pid=17046) INFO 09-25 09:26:54 [loggers.py:123] Engine 000: Avg prompt throughput: 302.1 tokens/s, Avg generation throughput: 3.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.2%, Prefix cache hit rate: 0.0%
# No more throughput but an infinite running...
(APIServer pid=17046) INFO 09-25 09:27:04 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.5%, Prefix cache hit rate: 0.0%
(APIServer pid=17046) INFO 09-25 09:27:14 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.9%, Prefix cache hit rate: 0.0%
(APIServer pid=17046) INFO 09-25 09:27:24 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.2%, Prefix cache hit rate: 0.0%
(APIServer pid=17046) INFO 09-25 09:27:34 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.6%, Prefix cache hit rate: 0.0%
(APIServer pid=17046) INFO 09-25 09:27:44 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.9%, Prefix cache hit rate: 0.0%
...
when I set a max_tokens in my request I have an empty content.
(APIServer pid=17046) ERROR 09-25 09:35:47 [serving_chat.py:251] ValueError: Invalid assistant message: role='assistant' content='' tool_calls=None prefix=False
vllm=0.10.2
tokenizers==0.22.1
transformers==4.56.2
mistral_common==1.8.5
Can you please share commands for reproduction?
python -m vllm.entrypoints.openai.api_server \
--model RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--host 0.0.0.0 \
--port 8000
on an NVIDIA L40S using the two given python script from the README.MD
Hi @MattisR . Thanks for clarifying.
It seems that there is a small issue in the stable release of vLLM. You can try installing from source for the time being. Here are the instructions:
https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#build-wheel-from-source. You can follow "Set up using Python-only build (without compilation)".
The fix should be available in the next release.
MattisR
changed discussion status to
closed