Run Your Own AI: Ollama + Open WebUI for Private Local LLMs

Running AI models locally used to require a PhD in machine learning. Then Ollama came along and made it as simple as ollama run llama3.1. Add Open WebUI into the mix, and you’ve got yourself a fully private ChatGPT alternative running on your own hardware.

No API keys. No rate limits. No data leaving your network. Just your own AI assistant that you control completely.

Why Run Local LLMs?

Before we dive into the setup, let’s talk about why you’d want to run LLMs locally instead of using ChatGPT or Claude:

Privacy: Your data never leaves your network. Perfect for sensitive documents, code, or conversations.
No API Costs: Run unlimited queries without watching your credit card balance.
Offline Access: Works without internet—useful for air-gapped networks or remote locations.
Customization: Fine-tune models, adjust parameters, and create custom assistants.
Learning: Understand how AI actually works by getting hands-on.

The tradeoff? You need hardware. But we’ll cover that too.

What You’ll Need

Hardware Requirements

The beauty of Ollama is its flexibility—it can run on everything from a Raspberry Pi to a multi-GPU workstation. Here’s what you need to know:

Minimum (CPU-only):

8GB RAM (16GB recommended)
10GB+ free disk space
Works but slow—5-10x slower than GPU

Recommended (GPU-accelerated):

NVIDIA GPU with 8GB+ VRAM (CUDA 12+)
OR Apple Silicon Mac (Metal acceleration built-in)
16GB+ system RAM
NVMe SSD for model storage

VRAM Guide by Model Size:

Model	VRAM Needed	Good For
Llama 3.1 8B	~5-8 GB	General chat, simple tasks
Mistral 7B	~4-7 GB	Fast, efficient responses
Llama 3.1 13B	~10-12 GB	Better reasoning
Mixtral 8x7B	~12-16 GB	Complex tasks, MoE efficiency
Llama 3.3 70B	~43 GB	Near-GPT-4 quality

Pro tip: Start small with 7-8B models. They run great on consumer GPUs like an RTX 3060 or even a GTX 1660.

Installing Ollama

Ollama’s installation is refreshingly simple:

Linux / macOS / WSL2

curl -fsSL https://ollama.ai/install.sh | sh

That’s it. The script detects your OS, installs Ollama, and sets up the service.

Verify Installation

ollama --version
# ollama version is 0.5.x or newer

Pull Your First Model

# Download Llama 3.1 8B (about 4.9GB)
ollama pull llama3.1

# Start chatting
ollama run llama3.1

You’re now running a capable AI model locally. Try asking it something:

>>> Write a haiku about homelab servers
Blinking lights in dark,
Whisper fans cool silicon,
My data stays home.

Type /bye to exit the chat.

Adding Open WebUI: Your ChatGPT Interface

Command-line chat is fine, but Open WebUI gives you the familiar ChatGPT experience—conversation history, model switching, document uploads, and more.

The All-in-One Docker Method (Recommended)

This single command gives you both Ollama and Open WebUI:

docker run -d -p 3000:8080 \
  --gpus=all \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:ollama

For CPU-only systems:

docker run -d -p 3000:8080 \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:ollama

If You Already Have Ollama Installed

Just add the UI:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

First-Time Setup

Open http://localhost:3000 in your browser
Create an admin account (first user becomes admin)
Click the model selector (top-left) to pull new models
Start chatting!

Model Selection: What to Run

Ollama’s model library is huge. Here are my recommendations for different use cases:

For General Chat

Llama 3.1 8B is the sweet spot—fast enough for quick responses, capable enough for most tasks.

ollama pull llama3.1

For Coding

DeepSeek-Coder-V2 or Qwen 2.5 Coder 32B excel at code generation and debugging. DeepSeek V3 in particular rivals GPT-4o on coding benchmarks.

ollama pull deepseek-coder-v2
ollama pull qwen2.5-coder:32b

For Fast Responses

Mistral 7B is incredibly snappy. Great for quick questions and brainstorming.

ollama pull mistral

For Maximum Quality

Llama 3.3 70B approaches GPT-4 territory, but needs ~43GB VRAM (dual RTX 3090s or an A6000).

ollama pull llama3.3:70b

For Massive Context

Llama 4 Scout offers a staggering 10 million token context window—perfect for analyzing entire codebases or long documents.

ollama pull llama4-scout

RAG: Chat With Your Documents

One of Open WebUI’s killer features is Retrieval Augmented Generation (RAG). Upload documents and ask questions about them.

Quick Document Chat

Click the paperclip icon in chat
Upload any PDF, text, or markdown file
Ask questions about the content

The AI will cite specific parts of your document in responses.

Knowledge Collections

For permanent document libraries:

Go to Workspace → Knowledge
Create a new collection (e.g., “Homelab Docs”)
Upload related documents
Link the collection to specific models
Toggle it on when chatting

RAG Configuration Tips

In Admin Panel → Settings → Documents:

Embedding Model: nomic-embed-text (default) works great
Chunk Size: 1500 tokens (increase for longer documents)
Top K: 4-6 results per query (higher = more context, slower)

Performance Optimization

GPU Acceleration

Verify GPU detection:

# Should show your GPU
nvidia-smi

# Check Ollama is using it
ollama run llama3.1
>>> /show info

Keep Models in Memory

By default, Ollama unloads models after 5 minutes of inactivity. Speed up repeated queries:

# Keep models loaded for 1 hour
export OLLAMA_KEEP_ALIVE=1h

Context Window Sizing

Larger context = more VRAM. Ollama auto-adjusts:

Under 24GB VRAM: 4K context (usually fine)
24-48GB VRAM: 32K context
48GB+ VRAM: 256K context (for Llama 4 Scout)

Force a specific context:

ollama run llama3.1 --context 8192

Quantization Explained

Ollama uses 4-bit quantization (Q4_K_M) by default. This reduces model size by ~70% with minimal quality loss:

Quantization	Size	Quality	Speed
Q4_K_M	~4.9 GB (8B model)	Good	Fastest
Q5_K_M	~5.5 GB	Better	Fast
Q8_0	~8 GB	Best	Slower
FP16	~16 GB	Full	Slowest

For most homelab use cases, Q4_K_M is the sweet spot.

Security Considerations

Network Security

Never expose port 11434 directly to the internet. Ollama has no built-in authentication.

If you need remote access:

# Option 1: SSH tunnel
ssh -L 11434:localhost:11434 user@server

# Option 2: Reverse proxy with auth (nginx example)
# Add basic auth in front of /api/*

Production Deployment

For production use:

Run behind a reverse proxy (Traefik, Nginx, Caddy)
Add authentication (OAuth, basic auth)
Use HTTPS/TLS certificates
Consider container isolation (Docker networks)
Keep Ollama and Open WebUI updated

Homelab Integration Examples

API Integration

Ollama exposes a REST API on port 11434:

# Simple completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is self-hosting awesome?",
  "stream": false
}'

# Chat endpoint
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "user", "content": "Explain Docker in one paragraph"}
  ]
}'

Python Example

import requests

response = requests.post('http://localhost:11434/api/chat', json={
    "model": "llama3.1",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": False
})

print(response.json()['message']['content'])

Use Cases

Automation scripts: Generate dynamic content, summarize logs
Home Assistant integration: Voice assistant with local processing
Code review: Analyze pull requests without sending code to the cloud
Document indexing: RAG over your homelab documentation

Troubleshooting

Model Not Loading

# Check logs
journalctl -u ollama -f

# Verify VRAM
nvidia-smi

# Try smaller model first
ollama pull llama3.1:8b

Slow Responses

Verify GPU is being used (/show info in chat)
Check VRAM usage during inference
Try a smaller quantization
Ensure model fits entirely in VRAM (avoid CPU offload)

Docker GPU Issues

# Test NVIDIA container toolkit
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

If that fails, install the NVIDIA Container Toolkit:

# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Commands Cheat Sheet

# List installed models
ollama list

# Pull a new model
ollama pull llama3.1

# Run interactively
ollama run llama3.1

# Run with options
ollama run llama3.1 --context 8192

# Show model info
ollama show llama3.1

# Delete a model
ollama rm mistral

# Push custom model (if you create one)
ollama push mymodel

# Serve API on custom port
OLLAMA_HOST=0.0.0.0:8080 ollama serve

What’s Next?

Once you’ve got Ollama and Open WebUI running:

Experiment with models - Try different sizes and see what works for your hardware
Set up RAG - Upload your homelab documentation and chat with it
Create custom models - Use Modelfiles to create specialized assistants
Integrate with your stack - Use the API for automation
Share with family - Open WebUI supports multiple users

The self-hosted AI revolution is here, and it runs on your hardware. No API keys required.

Have questions or tips about running local LLMs? Drop them in the comments below!