Context: A Lot Changed Since Our Last Post

A few weeks ago I wrote about how we set up a local LLM at Obmondo — the hardware decision (Hetzner GEX44 with an RTX 4000 SFF Ada, 20 GB VRAM), the serving software (vLLM), and the model we landed on (Qwen2.5-Coder-14B-Instruct). That post ended with “I’ll follow up once we have actual numbers.”

Between writing that and getting the server provisioned, two things happened that made me revisit the model choice:

  1. DeepCoder-14B-Preview dropped from Agentica Project and Together AI — a 14B reasoning model that reportedly matches OpenAI’s o1 on coding benchmarks.
  2. Qwen3-Coder arrived from Alibaba — a whole new generation of Qwen coding models, including a 30B MoE variant that might actually fit on our hardware.

If you’re running (or planning to run) a local coding assistant, these aren’t footnotes. They change what’s possible at the price point most teams can actually afford.


The Contenders

Before going into the comparison, here’s a quick profile of each model:

Qwen2.5-Coder-14B-Instruct (Baseline)

What we originally committed to. From Alibaba’s Qwen team, 14B dense model fine-tuned on a broad code corpus — Python, Ruby, Go, shell, and more. The -Instruct variant is optimised for conversation: Q&A, explanation, generation from a description. Fast, well-supported, active community builds on HuggingFace. No reasoning overhead.

DeepCoder-14B-Preview

Released by Agentica Project in collaboration with Together AI. This is not a general coding model fine-tuned on code — it’s a reasoning model built on top of DeepSeek-R1-Distilled-Qwen-14B, further trained with reinforcement learning on 24,000 coding problems. The RL training used a modified version of the verl distributed RL framework, improving training efficiency by 2×. Everything — model weights, training data, code, training logs — is fully open source.

The key word is reasoning. Before generating an answer, DeepCoder produces a chain-of-thought trace: it works through the problem step by step internally before committing to output. More on why that matters for latency.

Qwen3-Coder-30B-A3B-Instruct

The current top of Alibaba’s coding model lineup for local deployment. The “30B-A3B” naming tells you what you need to know architecturally: 30 billion total parameters, ~3.3 billion active per token. This is a Mixture-of-Experts (MoE) model. The router selects a sparse subset of experts for each token, so inference cost scales with active parameters, not total ones.

There’s also Qwen3-Coder-Next, based on an 80B-A3B-Base architecture. Impressive benchmarks, but the total parameter count means it won’t fit on a single 20 GB GPU even heavily quantized.

For the hardware we’re running, Qwen3-Coder-30B-A3B is the right comparison point.


Benchmark Numbers

The usual caveats apply: benchmarks are constructed environments, not real codebases. HumanEval tests short, self-contained functions. SWE-Bench Verified tests the harder problem of navigating a real repo and making a targeted fix. LiveCodeBench uses fresh competitive programming problems to reduce contamination risk.

ModelHumanEvalLiveCodeBenchSWE-Bench VerifiedActive Params
Qwen2.5-Coder-14B-Instruct~72%~44%14B (dense)
DeepCoder-14B-Preview60.6%14B (dense)
Qwen3-Coder-30B-A3B-Instruct69.6%~3.3B
OpenAI o1~59%
OpenAI o3-mini (Low)~60%

A few things worth unpacking here:

DeepCoder on LiveCodeBench (60.6%) is remarkable for a 14B model. It matches o3-mini at Low compute and beats o1. The model achieves this because LiveCodeBench problems are exactly the kind of multi-step algorithmic reasoning where chain-of-thought RL training shines.

Qwen2.5-Coder-14B on LiveCodeBench (~44%) looks weak by comparison, but LiveCodeBench skews toward competitive programming. For “write me a function to parse this JSON” or “what’s wrong with this Ruby method”, the gap narrows considerably.

Qwen3-Coder-30B-A3B on SWE-Bench Verified (69.6%) is the most practically relevant number for real software development. SWE-Bench Verified requires the model to read a GitHub issue, navigate an existing repo, and produce a patch that makes failing tests pass. That’s a much closer proxy to what developers actually need.


The Hardware Reality: What Fits on 20 GB VRAM

Our server is a Hetzner GEX44 — RTX 4000 SFF Ada, 20 GB GDDR6 ECC VRAM. Here’s what each model needs:

ModelFormatVRAM RequiredFits?
Qwen2.5-Coder-14B Q4_K_MGGUF / AWQ~9–10 GBYes, comfortably
DeepCoder-14B-Preview Q4_K_MGGUF / AWQ~9–10 GBYes, comfortably
Qwen3-Coder-30B-A3B Q4_K_MGGUF~16 GBYes, ~4 GB headroom
Qwen3-Coder-30B-A3B Q5_K_MGGUF~20 GBTight but possible
Qwen3-Coder-Next (80B-A3B) Q4GGUF40 GB+No

The MoE architecture of Qwen3-Coder-30B-A3B is what makes it fit. Even though the model has 30 billion total weights, the quantized GGUF file is sized based on total parameters — so Q4_K_M comes in around 16 GB. The 3.3B active parameter figure matters for throughput and latency, not memory footprint.

Practical implication: On our specific hardware, the realistic choices are:

  • Qwen2.5-Coder-14B at Q4 (current — safe headroom for concurrent requests)
  • DeepCoder-14B at Q4 (same VRAM footprint as above, drop-in swap)
  • Qwen3-Coder-30B-A3B at Q4_K_M (fits with ~4 GB for KV cache)

The Latency Trade-off: Reasoning Has a Cost

This is the thing the benchmark tables don’t tell you.

DeepCoder is a reasoning model. When it receives a prompt, it doesn’t go straight to the answer — it works through the problem internally, producing a <think>...</think> trace before the actual response. For hard algorithmic problems (the ones that put it ahead of o1 on LiveCodeBench), this is where the quality comes from. For those problems, the model might think for 1,000–2,000 tokens before writing a single line of code.

In practice, reported latency for DeepCoder on complex tasks is high — in some documented cases, total generation time (thinking + output) reaches several minutes per request.

For our use case at Obmondo — a team of ~10 developers asking coding questions throughout the day — that latency profile is a problem:

  • Quick Q&A (“How do I check if a key exists in a Ruby hash?”) — reasoning overhead is pure cost, no benefit. You want an answer in 3 seconds, not 45.
  • Complex debugging (“This Kubernetes probe keeps failing, here’s the full config and the controller logs”) — reasoning might help, but the developer is still waiting.
  • Alert summarization (“Summarize this Prometheus alert for the on-call person”) — needs to be fast, by definition.

Qwen2.5-Coder and Qwen3-Coder are instruct models. They answer directly. On an RTX 4000 Ada, Qwen2.5-Coder-14B generates 20–40 tokens/second at Q4 — a 200-token response in 5–10 seconds. Qwen3-Coder-30B-A3B at Q4 will be slower due to more total weights to load during generation, but still substantially faster than a reasoning model.


When DeepCoder Makes Sense

Despite the latency issue, DeepCoder-14B is genuinely impressive for specific workloads. It’s worth running if:

  • Your tasks are hard and bounded — competitive programming-style problems, complex algorithm implementation, mathematical code. These are exactly the workloads where it outperforms o1.
  • Latency is acceptable — batch processing, async pipelines, or situations where the developer submits a problem and comes back to it rather than watching a spinner.
  • You want a reasoning trace — the <think> output isn’t just overhead, it’s an explanation of approach. Useful for learning and code review.
  • You’re running it alongside a faster model — nothing stops you from routing simple questions to Qwen2.5-Coder (fast) and complex ones to DeepCoder (thorough). With vLLM’s multi-model support or a lightweight routing layer, this is a legitimate setup.

The open-source commitment from Agentica is also notable. Training code, data, logs, all public. For a team that wants to fine-tune or experiment, that’s valuable.


When Qwen3-Coder-30B-A3B Makes Sense

Qwen3-Coder-30B-A3B is the most interesting option for teams who want to upgrade their local setup without changing hardware.

The SWE-Bench Verified score (69.6%) is the right benchmark to look at for agentic coding — not just answering questions, but making changes to real repos. If you’re building tooling on top of the LLM (auto-PR generation, automated refactoring, feature scaffolding from a description), Qwen3-Coder-30B-A3B is in a different league from Qwen2.5-Coder-14B.

The MoE architecture also means inference cost per token is lower than a dense 14B model at high batch sizes, because the compute-heavy portion only involves ~3.3B active parameters. At low concurrency (1-2 users), the difference is less visible. At higher concurrency, MoE models can show better throughput than their total parameter count suggests.

The catch for our specific hardware: Q4_K_M gives you ~4 GB of headroom on a 20 GB card. The KV cache for a 2,000-token context window at 4 bits across 32 layers eats into that quickly. You can tune --gpu-memory-utilization and --max-model-len in vLLM to manage this, but it requires more careful configuration than Qwen2.5-Coder-14B, which sits comfortably with plenty of overhead.


What We’re Actually Doing

After working through this, our plan:

Primary model stays Qwen2.5-Coder-14B-Instruct. The reasons we chose it originally still hold. It’s fast, it fits, it handles our stack (Python, Ruby, Go, shell) well, and for interactive coding assistance — which is 80% of our use — being responsive matters more than being maximally capable.

We’re testing DeepCoder-14B-Preview for one specific pipeline. The alert summarization use case I described in the last post benefits from deliberate reasoning: “look at this set of alerting signals, what is the likely root cause?” For a model running asynchronously in a webhook pipeline (not blocking a developer waiting at a terminal), the latency is acceptable and the quality improvement for complex alert clusters is worth investigating.

Qwen3-Coder-30B-A3B is on the roadmap. If we end up using the LLM for more agentic tasks — automated PR drafting, scaffolding features from a description — we’ll revisit. The SWE-Bench score suggests it would be meaningfully better for those tasks. The hardware fits. The community builds (Unsloth GGUF, AWQ) exist and are tested. The main reason not to switch now is that our current use cases don’t need it yet.


Practical Decision Framework

If you’re choosing a local coding model for a small team today:

Quick interactive Q&A for 5–15 developers: Qwen2.5-Coder-14B at Q4 or Q8 depending on your VRAM. It’s fast, reliable, and already has battle-tested tooling around it.

Complex, bounded tasks where latency is acceptable: DeepCoder-14B-Preview. If you’re building a pipeline rather than an interactive tool, the reasoning quality at this parameter count is remarkable.

Agentic coding — making repo changes, reviewing PRs, scaffolding features: Qwen3-Coder-30B-A3B if your hardware fits. You won’t get close to this capability at the same VRAM budget with a dense model.

Limited to 10–12 GB VRAM: Qwen2.5-Coder-7B or DeepCoder-14B at aggressive quantization (Q3). Qwen3-Coder-30B-A3B won’t fit.


A Note on “Qwen 3.5”

A few people have asked about Qwen 3.5. At time of writing, the Qwen team has a Qwen3.5 repository on GitHub and materials suggest it’s in development, but no Coder-specific variant has been released in the form of deployable weights. Qwen3-Coder is the current coding-focused generation. I’ll update this post or write a follow-up when Qwen3.5-Coder lands — if the jump from 2.5 to 3 is any signal, it’ll be worth paying attention to.


Running a similar setup or evaluating different models for your team? I’d like to compare notes — find me on LinkedIn or GitHub. If something here is out of date (this space moves fast), open a correction.