Running OpenClaw with Ollama: Local Models Guide

I’ve been running OpenClaw with API providers for a while now, and the bills add up. Even cheap models like GLM-5 and MiniMax M2.5 still cost something, and every message goes through someone else’s server. So I set up Ollama on the same box and pointed OpenClaw at it. Zero cost per token, total privacy, and the latency is actually decent if you pick the right model for your hardware.

This guide covers how to get OpenClaw talking to Ollama, which models work well for different hardware tiers, and what to do when your machine can’t handle the bigger ones. Short answer on that last part: Nanbeige4.1-3B is shockingly good for a 3B model.

What You'll Need

OpenClaw installed and running (setup guide here)
A machine with at least 8GB RAM (16GB+ recommended for good models)
Ollama installed (one command on Linux/macOS)
No API keys required — everything runs locally

Why Run Local Models with OpenClaw

Three reasons to bother with this instead of just using an API:

$0 per token — Ollama is free. No API bills, no rate limits, no usage caps
Privacy — conversations never leave your machine. No third-party logging, no data retention policies to read
No internet dependency — works on an airgapped network, during outages, on a plane

The tradeoff is real: local models are slower and less capable than Claude Opus or GPT-5. But for everyday OpenClaw stuff like answering messages and running scheduled jobs, a decent local model handles it fine.

Hardware Requirements

This is the part most guides get wrong. They tell you the model size and forget about the actual experience. Here’s what I’ve found running different tiers.

VRAM Is What Matters

Models run in GPU VRAM when available. If the model doesn’t fit in VRAM, it spills into system RAM and gets much slower. CPU-only inference works but expect 5-10x slower responses.

Hardware Tier	VRAM / RAM	Best Model Size	Response Speed
Low-end (no GPU, 8GB RAM)	8GB system	1B-3B models	Slow (5-15 tok/s)
Mid-range (no GPU, 16-32GB RAM)	16-32GB system	7B-8B models	Moderate (10-25 tok/s)
GPU entry (RTX 3060 12GB / M1 16GB)	12-16GB	7B-14B models	Good (30-60 tok/s)
GPU mid (RTX 4070 Ti 16GB / M2 Pro 32GB)	16-32GB	14B-32B models	Good (40-80 tok/s)
GPU high (RTX 4090 24GB / M4 Max 128GB)	24-128GB	32B-70B models	Fast (50-100+ tok/s)

Apple Silicon Note

M1/M2/M3/M4 Macs share unified memory between CPU and GPU, so Ollama can use all system RAM as VRAM. A Mac Mini M4 with 32GB RAM can comfortably run 14B-32B models. That’s why Mac Minis are popular for OpenClaw setups.

Recommended Builds

Hetzner CAX31 or similar

8 vCPU (ARM), 32GB RAM, no GPU
Runs 7B-8B models at usable speeds
Best model: qwen3:8b or gpt-oss:20b
Cost: ~€15/mo on Hetzner

Good enough for a personal assistant handling messages and simple tasks. Don’t expect fast responses on complex coding problems.

Mac Mini M4 with 32GB RAM

Unified memory acts as VRAM
Runs 14B-32B models comfortably
Best model: qwen2.5-coder:32b or qwen3:32b
Cost: one-time ~$800

The sweet spot for most people. 32B models handle coding, writing, and tool calls well. This is what I’d buy if starting fresh.

RTX 4090 24GB or dual GPU setup

24GB VRAM for a single 70B quantized model
Best model: llama3.3:70b or deepseek-r1:70b
Cost: one-time ~$2000+ for the GPU

The 70B models come close to API-quality responses. Worth it if you’re replacing a $100+/mo API habit.

Any machine with 8GB RAM

CPU-only inference, slow but works
Best model: nanbeige4.1-3b (see section below)
Fallback: qwen3:1.7b or llama3.2:3b

Surprisingly usable for simple conversations and basic tasks. Not great for coding.

Installing Ollama

One command:

curl -fsSL https://ollama.ai/install.sh | sh

On macOS, download from ollama.ai or use Homebrew:

brew install ollama

Check it’s running:

ollama --version
ollama list

If ollama list works, you’re set.

Picking the Right Model

Not every model works well with OpenClaw. You need tool calling support (so the agent can use its tools) and ideally good instruction following. Here are the models I’ve tested, ranked by hardware tier.

Best Models by Tier

Model	Size	VRAM Needed	Tool Calling	Best For
`qwen2.5-coder:32b`	32B	~20GB	Yes	Coding, complex tasks
`qwen3:32b`	32B	~20GB	Yes	General + reasoning
`gpt-oss:20b`	20B	~14GB	Yes	General, tool use
`qwen3:8b`	8B	~6GB	Yes	General purpose
`llama3.3:70b`	70B	~42GB	Yes	Best local quality
`qwen2.5-coder:14b`	14B	~10GB	Yes	Coding on mid-range
`deepseek-r1:14b`	14B	~10GB	Yes	Reasoning tasks
`mistral-small3.2:24b`	24B	~16GB	Yes	Vision + tools
`qwen3:4b`	4B	~3GB	Yes	Budget general
`nanbeige4.1-3b`*	3B	~2.5GB	Yes	Budget, see below

* Community upload on Ollama — see Nanbeige section below.

My Picks

32GB Mac Mini or 16GB+ GPU: qwen2.5-coder:32b as primary, qwen3:8b as fallback
16GB RAM, no GPU: qwen3:8b as primary, qwen3:4b as fallback
8GB RAM: nanbeige4.1-3b or qwen3:4b

Pull your chosen model:

ollama pull qwen2.5-coder:32b
# or for lower-end hardware:
ollama pull qwen3:8b
# or for minimal hardware:
ollama pull tomng/nanbeige4.1

Nanbeige4.1-3B: The Budget Surprise

This model caught me off guard. Nanbeige4.1-3B is a 3B parameter model from Nanbeige Lab (a team at Kanzhun/BOSS Zhipin) that punches way above its weight class. Look at these numbers:

Benchmark	Nanbeige4.1-3B	Qwen3-4B	Qwen3-8B	Qwen3-32B
LiveCodeBench-V6	76.9	57.4	49.4	55.7
AIME 2026 I (math)	87.4	81.5	70.4	75.8
GPQA (science)	83.8	65.8	62.0	68.4
Arena-Hard-v2 (alignment)	73.2	34.9	26.3	56.0
BFCL-V4 (tool use)	56.5	44.9	42.2	47.9

A 3B model beating Qwen3-32B on coding benchmarks. It scores 56.5 on BFCL-V4 (tool use), which matters because OpenClaw relies on tool calling. It can also handle over 500 rounds of tool invocations for deep-search tasks. I don’t know of another sub-4B model that can do that.

When to Use It

Your machine has 8GB RAM and no GPU
You want a backup model that uses minimal resources
You’re running on a Raspberry Pi 5 or similar ARM board
You need an emergency fallback when your main model is too slow

When NOT to Use It

You have hardware for bigger models — 8B+ will still give better results on complex tasks
You need long creative writing or nuanced conversation
You’re doing heavy multi-file code refactoring

Installing Nanbeige4.1-3B

The model is available through community uploads on Ollama. The most popular version with tool support:

ollama pull tomng/nanbeige4.1

You can also import the GGUF from HuggingFace if you want a specific quantization:

# Download the Q4_K_M quantization (smallest useful size, ~2.3GB)
# From: huggingface.co/Edge-Quant/Nanbeige4.1-3B-Q4_K_M-GGUF

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Nanbeige4.1-3B-Q4_K_M.gguf
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.0
TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range .Messages }}
{{- if eq .Role "user" }}
{{ .Content }}
{{- else if eq .Role "assistant" }}
{{ .Content }}
{{- end }}
{{- end }}"""
EOF

ollama create nanbeige4.1-3b -f Modelfile

Configuring OpenClaw for Ollama

Step 1: Enable Ollama

Set the environment variable that tells OpenClaw to look for Ollama:

export OLLAMA_API_KEY="ollama-local"

Or add it permanently to your OpenClaw environment:

# Add to ~/.openclaw/.env
echo 'OLLAMA_API_KEY=ollama-local' >> ~/.openclaw/.env

The value doesn’t matter (Ollama doesn’t check it), but OpenClaw needs it set to enable the Ollama provider.

Step 2: Verify Discovery

OpenClaw auto-discovers tool-capable Ollama models. Check what it found:

openclaw models list --local

You should see your pulled models listed. If a model doesn’t appear, it might not report tool support. You can still use it by adding explicit config (see Step 3).

Step 3: Set Your Model

For a straightforward setup with one model:

openclaw models set ollama/qwen2.5-coder:32b

Or edit your config file (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen2.5-coder:32b"
      }
    }
  }
}

Step 4: Set Up Fallbacks

Fallbacks matter more with local models because a single model might be too slow or run out of context. Configure a chain:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen2.5-coder:32b",
        "fallbacks": [
          "ollama/qwen3:8b",
          "ollama/tomng/nanbeige4.1"
        ]
      }
    }
  }
}

If the 32B model chokes on a long context, OpenClaw falls through to the 8B, then to Nanbeige.

Full Config Example (Mid-Range Hardware)

Here’s a complete config for a 32GB Mac Mini:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen2.5-coder:32b",
        "fallbacks": ["ollama/qwen3:8b"]
      },
      "imageModel": {
        "primary": "ollama/mistral-small3.2:24b"
      }
    }
  }
}

Full Config Example (Low-End Hardware)

For an 8GB RAM machine or Raspberry Pi:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/tomng/nanbeige4.1",
        "fallbacks": ["ollama/qwen3:1.7b"]
      }
    }
  }
}

Explicit Provider Config (Remote Ollama)

If Ollama runs on a different machine (say a GPU server on your LAN), you need explicit config instead of auto-discovery:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://192.168.1.50:11434",
        "apiKey": "ollama-local",
        "api": "ollama",
        "models": [
          {
            "id": "qwen2.5-coder:32b",
            "name": "Qwen 2.5 Coder 32B",
            "reasoning": false,
            "input": ["text"],
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen2.5-coder:32b"
      }
    }
  }
}

Security Reminder

If you’re exposing Ollama over the network, make sure it’s on a trusted network (Tailscale, LAN behind firewall). Ollama has no built-in authentication. See the OpenClaw security guide for hardening your setup.

Hybrid Setup: Local + API Fallback

The setup I actually recommend is a hybrid. Local model as primary, cheap API as fallback for when the local model struggles. You get privacy and zero cost for 90% of messages, and API quality is there when you need it.

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen2.5-coder:32b",
        "fallbacks": [
          "ollama/qwen3:8b",
          "zai/glm-5"
        ]
      }
    }
  }
}

With this config, OpenClaw tries the local 32B first. If context is too long or the model errors, it falls to the local 8B. If that also fails, it hits the GLM-5 API (which is cheap). You can set up GLM-5 or MiniMax M2.5 as your API fallback following the best open source models guide.

Switching Models On the Fly

You don’t have to restart OpenClaw to change models. Use the /model command in chat:

/model                        # Show available models
/model list                   # Full list with providers
/model ollama/qwen3:8b        # Switch to a different model
/model status                 # Check current model + auth

Good for testing: pull a new model with ollama pull, and it shows up in /model list automatically (with auto-discovery enabled).

Performance Tuning

Context Window

Ollama reports the context window from the model metadata. You can override it in explicit config:

{
  "models": {
    "providers": {
      "ollama": {
        "models": [{
          "id": "qwen2.5-coder:32b",
          "contextWindow": 65536,
          "maxTokens": 16384
        }]
      }
    }
  }
}

Larger context windows use more VRAM. If you’re hitting memory limits, reduce contextWindow.

Running Multiple Models

Ollama can keep multiple models loaded if you have the VRAM. Set OLLAMA_NUM_PARALLEL to control concurrency:

OLLAMA_NUM_PARALLEL=2 ollama serve

With enough VRAM, you can run a coding model and a general model simultaneously, switching between them with /model in OpenClaw.

GPU Layers

If you have a GPU but not enough VRAM for the full model, Ollama automatically offloads some layers to CPU. You can control this:

OLLAMA_GPU_LAYERS=35 ollama serve

More layers on GPU = faster but more VRAM. Experiment to find your sweet spot.

Troubleshooting

Common Problems

Model doesn’t show up in openclaw models list

OpenClaw auto-discovery only shows models with tool support. Either pull a tool-capable model or define it explicitly in models.providers.ollama. Check with:

ollama list
curl http://localhost:11434/api/tags

Responses are extremely slow

Your model is probably running on CPU. Check if the model fits in your available VRAM/RAM. Solutions:

Switch to a smaller model (qwen3:8b instead of qwen3:32b)
On Mac: close other apps to free unified memory
On Linux with GPU: check nvidia-smi for VRAM usage

“Connection refused” errors

Ollama isn’t running. Start it:

ollama serve

Or on macOS, open the Ollama app. Check the API is accessible:

curl http://localhost:11434/api/tags

Tool calls failing or being ignored

The model might not support tool calling well. Switch to a model known to handle tools: qwen2.5-coder, qwen3, gpt-oss, or llama3.3. Avoid older models like llama2 or codellama for OpenClaw.

Out of memory crashes

The model is too large. Either:

Pull a smaller quantization: ollama pull qwen3:8b-q4_0
Switch to a smaller model
Add more RAM/swap (not ideal but works)

Nanbeige4.1-3B not found

It’s a community upload, not in the official library. Pull with the namespace:

ollama pull tomng/nanbeige4.1

Security Considerations

Running models locally is inherently more private than API calls, but don’t skip the basics:

Ollama has no auth: anyone on your network who can reach port 11434 can use your models. Bind to localhost or use a firewall.
Model downloads: Ollama pulls models from ollama.com. If you’re security-conscious, verify model checksums or use airgapped installs.
OpenClaw security: local models don’t change the OpenClaw threat model. Still follow the security hardening guide for gateway binding, sandbox mode, and skill vetting.

OpenClaw Setup Guide — full installation on Hetzner VPS or Mac Mini
Best Open Source Models for OpenClaw — API-based models (GLM-5, MiniMax M2.5) for when local isn’t enough
DuckDuckGo Search for OpenClaw — free web search that pairs well with local models
OpenClaw Security Guide — hardening your instance, especially relevant when running on a LAN
Best OpenClaw Dashboards — dashboards that show model usage and cost (useful for tracking $0 local usage vs API fallback spend)
OpenClaw Alternatives — other platforms with their own local model support

Frequently Asked Questions

Can I run OpenClaw entirely offline with Ollama?

Yes, once the models are pulled and OpenClaw is installed. No internet required for inference. You won’t have web search or channel integrations (Telegram, WhatsApp), but CLI and local sessions work fine.

Which model gives the best quality per dollar?

There are no dollars involved — Ollama is free. The question is quality per hardware. On a 32GB Mac Mini, qwen2.5-coder:32b gives the best results I’ve seen. On 8GB RAM, nanbeige4.1-3b is the best tradeoff between speed and capability.

How does Nanbeige4.1-3B compare to API models like GLM-5?

GLM-5 is still better for complex tasks, especially multi-step coding. Nanbeige4.1-3B is competitive on benchmarks, but real-world OpenClaw usage means long contexts and multi-turn conversations where bigger models have an edge. Use Nanbeige for quick tasks and messages, API models for the heavy stuff.

Can I use Ollama and API providers at the same time?

Yes, and I recommend it. Set an Ollama model as primary and a cheap API model as fallback. OpenClaw handles the switching automatically when a model fails or can’t handle the context.

Does tool calling work with all Ollama models?

No. OpenClaw’s auto-discovery only shows models that report tool support. Older models and some smaller ones don’t support it. The models listed in this guide all support tool calling. You can check with ollama show <model> and look for tools in the capabilities.

What about running Ollama on a Raspberry Pi?

A Raspberry Pi 5 with 8GB RAM can run 1B-3B models. Nanbeige4.1-3B or qwen3:1.7b are your best options. Responses will be slow (2-5 tokens/second) but functional for simple tasks. Don’t expect it to handle coding questions.

Is vLLM better than Ollama for OpenClaw?

vLLM gives better throughput on multi-GPU setups and production workloads. Ollama is simpler to set up and better for single-user scenarios. For a personal OpenClaw instance, Ollama is the right choice. If you’re running multiple agents or need concurrent inference, look at vLLM.

How much RAM do I actually need for X model?

Rough formula: model parameters × 0.6GB for Q4 quantization. So a 7B model needs ~4.2GB, a 14B needs ~8.4GB, a 32B needs ~19.2GB. Add 2-4GB headroom for the system and Ollama overhead. These are minimums — more RAM means larger context windows.

Local models with Ollama won’t replace Claude Opus for complex work. But they cover 80-90% of what I use OpenClaw for: answering messages, running cron jobs, lookups, automations. The hybrid setup (local primary + API fallback) means privacy and zero cost most of the time, with API quality when you actually need it.

Start with whatever model fits your hardware. If you’ve got a Mac Mini, go straight to qwen2.5-coder:32b. If you’re on a budget VPS with 8GB RAM, pull Nanbeige4.1-3B and be surprised.

For the rest of the OpenClaw stack: setup guide, API model recommendations, free web search, security hardening, dashboards, and alternative platforms.

Running OpenClaw with Ollama: Local Models Guide

Table of Contents

What You'll Need

Why Run Local Models with OpenClaw

Hardware Requirements

VRAM Is What Matters

Apple Silicon Note

Recommended Builds

Installing Ollama

Picking the Right Model

Best Models by Tier

My Picks

Nanbeige4.1-3B: The Budget Surprise

When to Use It

When NOT to Use It

Installing Nanbeige4.1-3B

Configuring OpenClaw for Ollama

Step 1: Enable Ollama

Step 2: Verify Discovery

Step 3: Set Your Model

Step 4: Set Up Fallbacks

Full Config Example (Mid-Range Hardware)

Full Config Example (Low-End Hardware)

Explicit Provider Config (Remote Ollama)

Security Reminder

Hybrid Setup: Local + API Fallback

Switching Models On the Fly

Performance Tuning

Context Window

Running Multiple Models

GPU Layers

Troubleshooting

Security Considerations

Fish Shell Syntax Highlighting Guide

Fish Shell Themes - Best Prompts (Tide, Starship, Pure)

Fish Shell vs Bash vs Zsh - Complete Comparison 2026

Table of Contents

What You'll Need

Why Run Local Models with OpenClaw

Hardware Requirements

VRAM Is What Matters

Apple Silicon Note

Recommended Builds

Installing Ollama

Picking the Right Model

Best Models by Tier

My Picks

Nanbeige4.1-3B: The Budget Surprise

When to Use It

When NOT to Use It

Installing Nanbeige4.1-3B

Configuring OpenClaw for Ollama

Step 1: Enable Ollama

Step 2: Verify Discovery

Step 3: Set Your Model

Step 4: Set Up Fallbacks

Full Config Example (Mid-Range Hardware)

Full Config Example (Low-End Hardware)

Explicit Provider Config (Remote Ollama)

Security Reminder

Hybrid Setup: Local + API Fallback

Switching Models On the Fly

Performance Tuning

Context Window

Running Multiple Models

GPU Layers

Troubleshooting

Security Considerations

Related Guides

Related Posts

NanoBot Setup Guide: MiniMax M2.5, GLM-5, and Brave Search on Your VPS

Dockge Install - Docker Compose Manager for Self-Hosting

VibeProxy: Use Your Claude, Codex & Gemini Subscriptions with Any AI Coding Tool