Using Qwen3-Coder with opencode and llama.cpph1

In this post, we’ll explore how to use Qwen3-Coder with opencode and llama.cpp for local development.

Introduction to Qwen3-Coderh2

Qwen3-Coder is a powerful code generation model developed by Alibaba. It’s designed specifically for coding tasks and can help developers write better code faster.

For the best experience with this model, check out the official Unsloth repository on Hugging Face which provides optimized GGUF files for llama.cpp.

You can also find more information about Qwen3-Coder and its capabilities on the Unsloth Qwen3-Coder page.

Setting up opencodeh2

First, you’ll need to install opencode:

npm install -g opencode-ai

Once installed, you can use it to interact with Qwen models directly from your terminal.

Running Qwen3-Coder with Dockerh2

You can run Qwen3-Coder using Docker for an easier setup experience:

docker run \
  --rm \
  --shm-size 16g \
  --device=nvidia.com/gpu=all \
  -v ~/models:/root/.cache/llama.cpp \
  -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  --hf-repo unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
  --hf-file Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf \
  --temp 0.7 \
  --min-p 0.0 \
  --top-p 0.80 \
  --top-k 20 \
  --repeat-penalty 1.05 \
  --jinja \
  --host 0.0.0.0 \
  --port 8080 \
  -c 32684 \
  -ngl 99

This command starts a Docker container that runs the llama.cpp server with the Qwen3-Coder model. It maps your local models directory to the container, sets up GPU access for faster inference, and exposes port 8080 for API access.

I’m using the Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf model because it fits comfortably within the 24GB VRAM of my NVIDIA RTX 4090. This particular model is both very fast and highly accurate for code generation tasks.

Configuring opencode for Local Inferenceh2

To configure opencode to work with your local llama.cpp server, add this to your opencode configuration:

{
  "provider": {
    "llamacpp": {
      "npm": "@ai-sdk/openai-compatible",
      "models": {
        "unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf": {
          "name": "Qwen3-Coder (unsloth-Q4)"
        }
      },
      "options": {
        "baseURL": "http://100.112.153.1:8080/v1",
        "apiKey": "sk-local"
      }
    }
  }
}

This configuration makes opencode communicate with your local llama.cpp server using the OpenAI API interface, allowing you to leverage the full power of Qwen3-Coder locally.

Combining opencode and llama.cpph2

You can leverage both tools for different use cases:

Use opencode for quick interactions and integrations
Use llama.cpp for local inference when privacy or offline access is required

This combination gives you the flexibility to work in different environments while still having access to powerful code generation capabilities.

Conclusionh2

By combining Qwen3-Coder with opencode and llama.cpp, developers can have a robust toolchain for code generation that works both online and offline. This approach provides flexibility and control over their development workflow.

PS: The reason why you get cut-off tool messages and why it looks kinda wonky (check cover image) is that qwen3-coder comes with a new tool format. Llama.cpp still doesn’t have a qwen3-coder parser for it. Well - it still works great.

PPS: Article was generated with help from opencode + Qwen3-Coder + Llama.cpp :)