This is a follow-up to my earlier post on getting vLLM + Qwen2.5-14B working with tool calling on RTX 4000 Ada. That setup ran vLLM v0.9.2. Upgrading to v0.17.0 broke things in ten distinct ways. This post records each one in the order it was encountered, with the fix.
Hardware and target
GPU: NVIDIA RTX 4000 SFF Ada Generation — 20GB VRAM, sm_89 architecture
OS: Ubuntu 24.04
Upgrade path: vllm/vllm-openai:v0.9.2 → vllm/vllm-openai:v0.17.0
Models explored: Qwen3.5-9B, Qwen3-8B-AWQ, Qwen2.5-14B-Instruct-AWQ
Final model: Qwen/Qwen2.5-14B-Instruct-AWQ
Goal: Tool calling for agentic coding (qwen-code / OpenCode)
Issue 1: --model flag is deprecated
Symptom: Container starts with a deprecation warning then exits or ignores the model arg.
What changed: In v0.17.0 the CLI restructured. vllm serve is now a proper subcommand
and the model name is a positional argument, not a flag.
Old:
command: >
--model Qwen/Qwen2.5-14B-Instruct-AWQ
--quantization awq_marlin
New:
entrypoint: ["vllm", "serve", "Qwen/Qwen3-8B-AWQ"]
command: >
--quantization awq_marlin
--gpu-memory-utilization 0.95
Note: this also leads directly to Issue 3 — keep reading.
Issue 2: CUDA runtime not found
Symptom:
RuntimeError: Failed to infer device type
The container starts but vLLM can’t see the GPU at all.
Root cause: The runtime: nvidia key in docker-compose conflicts with the deploy: block
when both are present in some Compose versions. The result is that neither works — the container
gets no GPU access.
Fix: Remove runtime: nvidia entirely and rely on the deploy.resources.reservations
block. Add NVIDIA_VISIBLE_DEVICES=all explicitly as an environment variable to ensure the
NVIDIA Container Toolkit exposes the GPU inside the container.
services:
lmcache-vllm:
image: vllm/vllm-openai:v0.17.0
# Do NOT add: runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HF_TOKEN=${HF_TOKEN}
Issue 3: Docker entrypoint conflict
Symptom: The command fails with an unrecognized argument error, or vllm is called twice in
the resulting process string.
Root cause: The vllm/vllm-openai image already sets vllm as its ENTRYPOINT. If your
docker-compose command: starts with serve, Docker passes it as an argument to vllm — which
works. But if you also prefix with vllm in the command, you get vllm vllm serve ....
The clean approach that handles the model-as-positional-arg change from Issue 1 is to override the entrypoint fully:
entrypoint: ["vllm", "serve", "Qwen/Qwen3-8B-AWQ"]
command: >
--gpu-memory-utilization 0.95
--max-model-len 23232
--quantization awq_marlin
--enable-auto-tool-choice
--tool-call-parser hermes
This gives you a clean process line: vllm serve Qwen/Qwen3-8B-AWQ --gpu-memory-utilization ...
Issue 4: --enable-auto-tool-choice requires --tool-call-parser
Symptom: vLLM exits at startup:
ValueError: --enable-auto-tool-choice requires --tool-call-parser to be set
Fix: Add --tool-call-parser. In v0.17.0, the available parsers are:
deepseek_v3, deepseek_v31, deepseek_v32, ernie45, functiongemma, gigachat3, glm45, glm47,
granite, granite-20b-fc, hermes, hunyuan_a13b, internlm, jamba, kimi_k2, llama3_json,
llama4_json, llama4_pythonic, longcat, minimax, minimax_m2, mistral, olmo3, openai,
phi4_mini_json, pythonic, qwen3_coder, qwen3_xml, seed_oss, step3, step3p5, xlam
For Qwen3 family: --tool-call-parser hermes is the stable choice. qwen3_xml and
qwen3_coder are also available if your model uses those output formats.
Issue 5: --enable-prompt-tokens-details does not exist
Symptom: vLLM exits at startup:
error: unrecognized arguments: --enable-prompt-tokens-details
Fix: Remove the flag. It does not exist in v0.17.0. If you need prompt token details in responses, check the current vLLM docs for the equivalent (or confirm it is now default behavior).
Issue 6: Transformers too old for Qwen3.5
Symptom: vLLM starts loading the model then exits:
KeyError: 'qwen3_5'
or a similar error about an unknown model architecture.
Root cause: vllm/vllm-openai:v0.17.0 ships a pinned version of transformers that
predates Qwen3.5 support. Qwen/Qwen3.5-9B is not recognized.
Options:
- Add a startup pip upgrade:
pip install --upgrade transformersin an entrypoint wrapper. Fragile — future vLLM releases may break with unpinned transformers. - Use a model the bundled transformers already knows.
Qwen/Qwen3-8B-AWQworks without any transformers upgrade.
Option 2 is preferable for a stable production setup.
Issue 7: --quantization awq_marlin fails on non-quantized model
Symptom:
ValueError: The model Qwen/Qwen3.5-9B is not quantized. awq_marlin requires a pre-quantized model.
Root cause: awq_marlin is a quantization backend, not a runtime quantization method.
It requires the model weights to already be AWQ-quantized (the weights have AWQ metadata embedded).
Qwen/Qwen3.5-9B is a full-precision model — no quantization metadata.
Fix: Either remove --quantization awq_marlin (and accept full-precision inference), or use
a pre-quantized model. For a 20GB GPU, full precision 9B is also going to OOM — see Issue 8.
Issue 8: CUDA OOM with full-precision 9B model
Symptom:
torch.cuda.OutOfMemoryError: CUDA out of memory
Root cause: Qwen3.5-9B in bfloat16 is approximately 18GB of weights alone. With KV cache
overhead and the vLLM runtime, it does not fit in 20GB VRAM.
Fix: Use a 4-bit quantized model. The AWQ-quantized variants of 8B/9B class models fit comfortably with room for KV cache.
Issue 9: AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4 is for NPU, not NVIDIA
Symptom: Model loads but produces garbage output, or vLLM raises a backend compatibility error.
Root cause: AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4 is compiled and optimized for
Axera NPU hardware. The weight format and quantization scheme are not compatible with
vLLM’s GPTQ/Marlin kernels for NVIDIA GPUs. The Hugging Face model card mentions NVIDIA support
only in a general sense — the actual weights are NPU-targeted.
Fix: Use officially published Qwen AWQ models from the Qwen org on Hugging Face.
Issue 10: Model selection for 20GB VRAM
With full-precision models OOM-ing and NPU-targeted quants ruled out, the realistic options for RTX 4000 Ada with v0.17.0 are:
| Model | VRAM estimate | Notes |
|---|---|---|
Qwen/Qwen3-8B-AWQ | ~8GB weights + KV cache | Official, well-tested with vLLM |
Qwen/Qwen3-14B-AWQ | ~10GB weights + reduced KV cache | Fits, but reduce max-model-len |
cyankiwi/Qwen3.5-9B-AWQ-4bit | ~6GB weights + KV cache | Community quant, newest model |
Qwen/Qwen3-8B-AWQ is the most stable choice for v0.17.0 — official org, known to work with
the bundled transformers version, and awq_marlin activates cleanly on sm_89.
Final working docker-compose
services:
lmcache-vllm:
image: vllm/vllm-openai:v0.17.0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
- HF_TOKEN=${HF_TOKEN}
- NVIDIA_VISIBLE_DEVICES=all
ipc: host
entrypoint: ["vllm", "serve", "Qwen/Qwen2.5-14B-Instruct-AWQ"]
command: >
--gpu-memory-utilization 0.9
--max-model-len 23232
--quantization awq_marlin
--enable-auto-tool-choice
--tool-call-parser hermes
Bonus: qwen-code in Docker
If you want to run the qwen-code CLI without installing Node locally:
docker run -it --rm node:22-slim bash -c \
"npm install -g @qwen-code/qwen-code@latest && qwen-code"
Point it at http://host-ip:8000 and it picks up the local vLLM endpoint.
Summary of changes from v0.9.2 to v0.17.0
| Area | v0.9.2 | v0.17.0 |
|---|---|---|
| Model argument | --model <name> | Positional arg: vllm serve <name> |
| GPU in compose | runtime: nvidia or deploy | deploy only + NVIDIA_VISIBLE_DEVICES=all |
| Tool call parser | Optional | Required when --enable-auto-tool-choice is set |
--enable-prompt-tokens-details | Supported | Removed |
| Qwen3.5 support | N/A | Requires transformers upgrade or use Qwen3 instead |
The upgrade has genuine improvements — the expanded tool-call parser list and cleaner CLI structure are worth it. The migration is straightforward once you know which flags changed.
Bonus experiment: Gemma 4 27B on the same setup
After getting Qwen3-14B-AWQ stable in production, I tested Google’s Gemma 4 27B on the same RTX 4000 Ada (20GB VRAM) to see how it compared for the artoo use case.
Does it fit?
Gemma 4 27B in full precision is around 54GB — well beyond 20GB VRAM. The only way to run it
is with a 4-bit quantized variant. Community AWQ builds exist on Hugging Face, and they bring
the weight footprint down to approximately 14–15GB, which fits on 20GB with a reduced
--max-model-len.
entrypoint: ["vllm", "serve", "bartowski/google_gemma-4-27b-it-GGUF"]
command: >
--gpu-memory-utilization 0.92
--max-model-len 16384
--quantization awq_marlin
--enable-auto-tool-choice
--tool-call-parser functiongemma
Note --tool-call-parser functiongemma — Gemma 4’s tool call format is different from the
Hermes/Qwen family. Using hermes here produces malformed JSON tool calls.
Context window comparison
| Model | Max context (full) | Practical on 20GB |
|---|---|---|
| Qwen3-14B-AWQ | 40K tokens | ~32K with KV cache headroom |
| Gemma 4 27B (AWQ 4-bit) | 128K tokens | ~16K before OOM |
This is the core tradeoff. Gemma 4 advertises 128K context but the KV cache is enormous at that window size. On 20GB you have to cut it back so far that the theoretical advantage disappears. Qwen3-14B-AWQ at 32K is more usable in practice.
Speed
Qwen3-14B-AWQ generates at roughly 40–50 tok/s on RTX 4000 Ada. Gemma 4 27B AWQ generates at roughly 18–22 tok/s on the same hardware.
For artoo’s interactive chat use case — where users expect a response in under 3 seconds — the Gemma throughput is noticeably sluggish on longer prompts.
Tool calling quality
Gemma 4 27B produces correct tool call JSON for simple single-tool requests. Where it struggled was with the artoo pipeline’s pre-processed data: when the context includes several thousand tokens of timereg and git log data, Gemma occasionally invoked tools redundantly or ignored pre-fetched data and asked for it again via tool calls. Qwen3-14B-AWQ handled the enriched context reliably without re-requesting data the Go layer had already appended.
Why we stayed with Qwen3-14B-AWQ
Three reasons:
Context headroom. 32K practical context vs ~16K for Gemma 4 27B on this VRAM budget. Artoo regularly sends 10–15K token prompts after enrichment. Gemma hits the limit on complex queries.
Tool-calling discipline. Qwen3 with
hermesparser does not double-invoke tools on pre-fetched data. This matters because artoo’s design deliberately avoids redundant LLM round-trips (see architecture doc) — a model that re-fetches what Go already provided breaks that contract.Throughput. 40–50 tok/s vs 18–22 tok/s is the difference between a snappy response and one that feels slow. On a single-GPU setup with no batching, the smaller model wins.
Gemma 4 27B would be the better choice on a 48GB GPU (H100 80GB or two A100s), where the full context window is accessible and throughput is no longer the constraint. On the RTX 4000 Ada 20GB budget, Qwen3-14B-AWQ is the pragmatic pick.