Upgrading vLLM to v0.17.0 with Qwen on RTX 4000 Ada: Every Breaking Change You Will Hit

This is a follow-up to my earlier post on getting vLLM + Qwen2.5-14B working with tool calling on RTX 4000 Ada. That setup ran vLLM v0.9.2. Upgrading to v0.17.0 broke things in ten distinct ways. This post records each one in the order it was encountered, with the fix.

Hardware and target

GPU: NVIDIA RTX 4000 SFF Ada Generation — 20GB VRAM, sm_89 architecture OS: Ubuntu 24.04 Upgrade path: vllm/vllm-openai:v0.9.2 → vllm/vllm-openai:v0.17.0 Models explored: Qwen3.5-9B, Qwen3-8B-AWQ, Qwen2.5-14B-Instruct-AWQ Final model: Qwen/Qwen2.5-14B-Instruct-AWQ Goal: Tool calling for agentic coding (qwen-code / OpenCode)

Issue 1: `--model` flag is deprecated

Symptom: Container starts with a deprecation warning then exits or ignores the model arg.

What changed: In v0.17.0 the CLI restructured. vllm serve is now a proper subcommand and the model name is a positional argument, not a flag.

Old:

command: >
  --model Qwen/Qwen2.5-14B-Instruct-AWQ
  --quantization awq_marlin

New:

entrypoint: ["vllm", "serve", "Qwen/Qwen3-8B-AWQ"]
command: >
  --quantization awq_marlin
  --gpu-memory-utilization 0.95

Note: this also leads directly to Issue 3 — keep reading.

Issue 2: CUDA runtime not found

Symptom:

RuntimeError: Failed to infer device type

The container starts but vLLM can’t see the GPU at all.

Root cause: The runtime: nvidia key in docker-compose conflicts with the deploy: block when both are present in some Compose versions. The result is that neither works — the container gets no GPU access.

Fix: Remove runtime: nvidia entirely and rely on the deploy.resources.reservations block. Add NVIDIA_VISIBLE_DEVICES=all explicitly as an environment variable to ensure the NVIDIA Container Toolkit exposes the GPU inside the container.

services:
  lmcache-vllm:
    image: vllm/vllm-openai:v0.17.0
    # Do NOT add: runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN}

Issue 3: Docker entrypoint conflict

Symptom: The command fails with an unrecognized argument error, or vllm is called twice in the resulting process string.

Root cause: The vllm/vllm-openai image already sets vllm as its ENTRYPOINT. If your docker-compose command: starts with serve, Docker passes it as an argument to vllm — which works. But if you also prefix with vllm in the command, you get vllm vllm serve ....

The clean approach that handles the model-as-positional-arg change from Issue 1 is to override the entrypoint fully:

entrypoint: ["vllm", "serve", "Qwen/Qwen3-8B-AWQ"]
command: >
  --gpu-memory-utilization 0.95
  --max-model-len 23232
  --quantization awq_marlin
  --enable-auto-tool-choice
  --tool-call-parser hermes

This gives you a clean process line: vllm serve Qwen/Qwen3-8B-AWQ --gpu-memory-utilization ...

Issue 4: `--enable-auto-tool-choice` requires `--tool-call-parser`

Symptom: vLLM exits at startup:

ValueError: --enable-auto-tool-choice requires --tool-call-parser to be set

Fix: Add --tool-call-parser. In v0.17.0, the available parsers are:

deepseek_v3, deepseek_v31, deepseek_v32, ernie45, functiongemma, gigachat3, glm45, glm47,
granite, granite-20b-fc, hermes, hunyuan_a13b, internlm, jamba, kimi_k2, llama3_json,
llama4_json, llama4_pythonic, longcat, minimax, minimax_m2, mistral, olmo3, openai,
phi4_mini_json, pythonic, qwen3_coder, qwen3_xml, seed_oss, step3, step3p5, xlam

For Qwen3 family: --tool-call-parser hermes is the stable choice. qwen3_xml and qwen3_coder are also available if your model uses those output formats.

Issue 5: `--enable-prompt-tokens-details` does not exist

Symptom: vLLM exits at startup:

error: unrecognized arguments: --enable-prompt-tokens-details

Fix: Remove the flag. It does not exist in v0.17.0. If you need prompt token details in responses, check the current vLLM docs for the equivalent (or confirm it is now default behavior).

Issue 6: Transformers too old for Qwen3.5

Symptom: vLLM starts loading the model then exits:

KeyError: 'qwen3_5'

or a similar error about an unknown model architecture.

Root cause: vllm/vllm-openai:v0.17.0 ships a pinned version of transformers that predates Qwen3.5 support. Qwen/Qwen3.5-9B is not recognized.

Options:

Add a startup pip upgrade: pip install --upgrade transformers in an entrypoint wrapper. Fragile — future vLLM releases may break with unpinned transformers.
Use a model the bundled transformers already knows. Qwen/Qwen3-8B-AWQ works without any transformers upgrade.

Option 2 is preferable for a stable production setup.

Issue 7: `--quantization awq_marlin` fails on non-quantized model

Symptom:

ValueError: The model Qwen/Qwen3.5-9B is not quantized. awq_marlin requires a pre-quantized model.

Root cause: awq_marlin is a quantization backend, not a runtime quantization method. It requires the model weights to already be AWQ-quantized (the weights have AWQ metadata embedded). Qwen/Qwen3.5-9B is a full-precision model — no quantization metadata.

Fix: Either remove --quantization awq_marlin (and accept full-precision inference), or use a pre-quantized model. For a 20GB GPU, full precision 9B is also going to OOM — see Issue 8.

Issue 8: CUDA OOM with full-precision 9B model

Symptom:

torch.cuda.OutOfMemoryError: CUDA out of memory

Root cause: Qwen3.5-9B in bfloat16 is approximately 18GB of weights alone. With KV cache overhead and the vLLM runtime, it does not fit in 20GB VRAM.

Fix: Use a 4-bit quantized model. The AWQ-quantized variants of 8B/9B class models fit comfortably with room for KV cache.

Issue 9: AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4 is for NPU, not NVIDIA

Symptom: Model loads but produces garbage output, or vLLM raises a backend compatibility error.

Root cause: AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4 is compiled and optimized for Axera NPU hardware. The weight format and quantization scheme are not compatible with vLLM’s GPTQ/Marlin kernels for NVIDIA GPUs. The Hugging Face model card mentions NVIDIA support only in a general sense — the actual weights are NPU-targeted.

Fix: Use officially published Qwen AWQ models from the Qwen org on Hugging Face.

Issue 10: Model selection for 20GB VRAM

With full-precision models OOM-ing and NPU-targeted quants ruled out, the realistic options for RTX 4000 Ada with v0.17.0 are:

Model	VRAM estimate	Notes
`Qwen/Qwen3-8B-AWQ`	~8GB weights + KV cache	Official, well-tested with vLLM
`Qwen/Qwen3-14B-AWQ`	~10GB weights + reduced KV cache	Fits, but reduce max-model-len
`cyankiwi/Qwen3.5-9B-AWQ-4bit`	~6GB weights + KV cache	Community quant, newest model

Qwen/Qwen3-8B-AWQ is the most stable choice for v0.17.0 — official org, known to work with the bundled transformers version, and awq_marlin activates cleanly on sm_89.

Final working docker-compose

services:
  lmcache-vllm:
    image: vllm/vllm-openai:v0.17.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - NVIDIA_VISIBLE_DEVICES=all
    ipc: host
    entrypoint: ["vllm", "serve", "Qwen/Qwen2.5-14B-Instruct-AWQ"]
    command: >
      --gpu-memory-utilization 0.9
      --max-model-len 23232
      --quantization awq_marlin
      --enable-auto-tool-choice
      --tool-call-parser hermes

Bonus: qwen-code in Docker

If you want to run the qwen-code CLI without installing Node locally:

docker run -it --rm node:22-slim bash -c \
  "npm install -g @qwen-code/qwen-code@latest && qwen-code"

Point it at http://host-ip:8000 and it picks up the local vLLM endpoint.

Summary of changes from v0.9.2 to v0.17.0

Area	v0.9.2	v0.17.0
Model argument	`--model <name>`	Positional arg: `vllm serve <name>`
GPU in compose	`runtime: nvidia` or `deploy`	`deploy` only + `NVIDIA_VISIBLE_DEVICES=all`
Tool call parser	Optional	Required when `--enable-auto-tool-choice` is set
`--enable-prompt-tokens-details`	Supported	Removed
Qwen3.5 support	N/A	Requires transformers upgrade or use Qwen3 instead

The upgrade has genuine improvements — the expanded tool-call parser list and cleaner CLI structure are worth it. The migration is straightforward once you know which flags changed.

Bonus experiment: Gemma 4 27B on the same setup

After getting Qwen3-14B-AWQ stable in production, I tested Google’s Gemma 4 27B on the same RTX 4000 Ada (20GB VRAM) to see how it compared for the artoo use case.

Does it fit?

Gemma 4 27B in full precision is around 54GB — well beyond 20GB VRAM. The only way to run it is with a 4-bit quantized variant. Community AWQ builds exist on Hugging Face, and they bring the weight footprint down to approximately 14–15GB, which fits on 20GB with a reduced --max-model-len.

entrypoint: ["vllm", "serve", "bartowski/google_gemma-4-27b-it-GGUF"]
command: >
  --gpu-memory-utilization 0.92
  --max-model-len 16384
  --quantization awq_marlin
  --enable-auto-tool-choice
  --tool-call-parser functiongemma

Note --tool-call-parser functiongemma — Gemma 4’s tool call format is different from the Hermes/Qwen family. Using hermes here produces malformed JSON tool calls.

Context window comparison

Model	Max context (full)	Practical on 20GB
Qwen3-14B-AWQ	40K tokens	~32K with KV cache headroom
Gemma 4 27B (AWQ 4-bit)	128K tokens	~16K before OOM

This is the core tradeoff. Gemma 4 advertises 128K context but the KV cache is enormous at that window size. On 20GB you have to cut it back so far that the theoretical advantage disappears. Qwen3-14B-AWQ at 32K is more usable in practice.

Speed

Qwen3-14B-AWQ generates at roughly 40–50 tok/s on RTX 4000 Ada. Gemma 4 27B AWQ generates at roughly 18–22 tok/s on the same hardware.

For artoo’s interactive chat use case — where users expect a response in under 3 seconds — the Gemma throughput is noticeably sluggish on longer prompts.

Tool calling quality

Gemma 4 27B produces correct tool call JSON for simple single-tool requests. Where it struggled was with the artoo pipeline’s pre-processed data: when the context includes several thousand tokens of timereg and git log data, Gemma occasionally invoked tools redundantly or ignored pre-fetched data and asked for it again via tool calls. Qwen3-14B-AWQ handled the enriched context reliably without re-requesting data the Go layer had already appended.

Why we stayed with Qwen3-14B-AWQ

Three reasons:

Context headroom. 32K practical context vs ~16K for Gemma 4 27B on this VRAM budget. Artoo regularly sends 10–15K token prompts after enrichment. Gemma hits the limit on complex queries.
Tool-calling discipline. Qwen3 with hermes parser does not double-invoke tools on pre-fetched data. This matters because artoo’s design deliberately avoids redundant LLM round-trips (see architecture doc) — a model that re-fetches what Go already provided breaks that contract.
Throughput. 40–50 tok/s vs 18–22 tok/s is the difference between a snappy response and one that feels slow. On a single-GPU setup with no batching, the smaller model wins.

Gemma 4 27B would be the better choice on a 48GB GPU (H100 80GB or two A100s), where the full context window is accessible and throughput is no longer the constraint. On the RTX 4000 Ada 20GB budget, Qwen3-14B-AWQ is the pragmatic pick.

Hardware and target#

Issue 1: --model flag is deprecated#

Issue 2: CUDA runtime not found#

Issue 3: Docker entrypoint conflict#

Issue 4: --enable-auto-tool-choice requires --tool-call-parser#

Issue 5: --enable-prompt-tokens-details does not exist#

Issue 6: Transformers too old for Qwen3.5#

Issue 7: --quantization awq_marlin fails on non-quantized model#

Issue 8: CUDA OOM with full-precision 9B model#

Issue 9: AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4 is for NPU, not NVIDIA#

Issue 10: Model selection for 20GB VRAM#

Final working docker-compose#

Bonus: qwen-code in Docker#

Summary of changes from v0.9.2 to v0.17.0#

Bonus experiment: Gemma 4 27B on the same setup#

Does it fit?#

Context window comparison#

Speed#

Tool calling quality#

Why we stayed with Qwen3-14B-AWQ#

Hardware and target

Issue 1: `--model` flag is deprecated

Issue 2: CUDA runtime not found

Issue 3: Docker entrypoint conflict

Issue 4: `--enable-auto-tool-choice` requires `--tool-call-parser`

Issue 5: `--enable-prompt-tokens-details` does not exist

Issue 6: Transformers too old for Qwen3.5

Issue 7: `--quantization awq_marlin` fails on non-quantized model

Issue 8: CUDA OOM with full-precision 9B model

Issue 9: AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4 is for NPU, not NVIDIA

Issue 10: Model selection for 20GB VRAM

Final working docker-compose

Bonus: qwen-code in Docker

Summary of changes from v0.9.2 to v0.17.0

Bonus experiment: Gemma 4 27B on the same setup

Does it fit?

Context window comparison

Speed

Tool calling quality

Why we stayed with Qwen3-14B-AWQ