Upgrading vLLM to v0.17.0 with Qwen on RTX 4000 Ada: Every Breaking Change You Will Hit

A hands-on record of every error encountered upgrading from vLLM v0.9.2 to v0.17.0 with Qwen models on an RTX 4000 Ada (20GB VRAM): deprecated CLI flags, CUDA runtime changes, entrypoint conflicts, tool-call parser requirements, unsupported transformers, and how to pick the right quantized model for 20GB VRAM.

March 10, 2026 · 8 min · Ashish Jaiswal

vLLM + Qwen2.5-14B on Hetzner RTX 4000 Ada: Making Tool Calling Work

A complete journey of getting vLLM + Qwen2.5-14B-AWQ running on an RTX 4000 Ada with working tool calls for OpenCode: CUDA driver setup, throughput debugging, AWQ Marlin kernels for sm_89, writing a custom Qwen tool parser plugin, and debugging model refusals caused by a reasoning:true misconfiguration.

February 27, 2026 · 10 min · Ashish Jaiswal