Run Your Own AI: Ollama + Open WebUI for Private Local LLMs
Set up a ChatGPT alternative on your homelab with Ollama and Open WebUI. Complete guide covering installation, GPU optimization, model selection, and RAG for document chat.
Table of Contents
- Why Run Local LLMs?
- What You’ll Need
- Hardware Requirements
- Installing Ollama
- Linux / macOS / WSL2
- Verify Installation
- Pull Your First Model
- Adding Open WebUI: Your ChatGPT Interface
- The All-in-One Docker Method (Recommended)
- If You Already Have Ollama Installed
- First-Time Setup
- Model Selection: What to Run
- For General Chat
- For Coding
- For Fast Responses
- For Maximum Quality
- For Massive Context
- RAG: Chat With Your Documents
- Quick Document Chat
- Knowledge Collections
- RAG Configuration Tips
- Performance Optimization
- GPU Acceleration
- Keep Models in Memory
- Context Window Sizing
- Quantization Explained
- Security Considerations
- Network Security
- Production Deployment
- Homelab Integration Examples
- API Integration
- Python Example
- Use Cases
- Troubleshooting
- Model Not Loading
- Slow Responses
- Docker GPU Issues
- Commands Cheat Sheet
- What’s Next?
Running AI models locally used to require a PhD in machine learning. Then Ollama came along and made it as simple as ollama run llama3.1. Add Open WebUI into the mix, and you’ve got yourself a fully private ChatGPT alternative running on your own hardware.
No API keys. No rate limits. No data leaving your network. Just your own AI assistant that you control completely.
Why Run Local LLMs?
Before we dive into the setup, let’s talk about why you’d want to run LLMs locally instead of using ChatGPT or Claude:
- Privacy: Your data never leaves your network. Perfect for sensitive documents, code, or conversations.
- No API Costs: Run unlimited queries without watching your credit card balance.
- Offline Access: Works without internet—useful for air-gapped networks or remote locations.
- Customization: Fine-tune models, adjust parameters, and create custom assistants.
- Learning: Understand how AI actually works by getting hands-on.
The tradeoff? You need hardware. But we’ll cover that too.
What You’ll Need
Hardware Requirements
The beauty of Ollama is its flexibility—it can run on everything from a Raspberry Pi to a multi-GPU workstation. Here’s what you need to know:
Minimum (CPU-only):
- 8GB RAM (16GB recommended)
- 10GB+ free disk space
- Works but slow—5-10x slower than GPU
Recommended (GPU-accelerated):
- NVIDIA GPU with 8GB+ VRAM (CUDA 12+)
- OR Apple Silicon Mac (Metal acceleration built-in)
- 16GB+ system RAM
- NVMe SSD for model storage
VRAM Guide by Model Size:
| Model | VRAM Needed | Good For |
|---|---|---|
| Llama 3.1 8B | ~5-8 GB | General chat, simple tasks |
| Mistral 7B | ~4-7 GB | Fast, efficient responses |
| Llama 3.1 13B | ~10-12 GB | Better reasoning |
| Mixtral 8x7B | ~12-16 GB | Complex tasks, MoE efficiency |
| Llama 3.3 70B | ~43 GB | Near-GPT-4 quality |
Pro tip: Start small with 7-8B models. They run great on consumer GPUs like an RTX 3060 or even a GTX 1660.
Installing Ollama
Ollama’s installation is refreshingly simple:
Linux / macOS / WSL2
curl -fsSL https://ollama.ai/install.sh | sh
That’s it. The script detects your OS, installs Ollama, and sets up the service.
Verify Installation
ollama --version
# ollama version is 0.5.x or newer
Pull Your First Model
# Download Llama 3.1 8B (about 4.9GB)
ollama pull llama3.1
# Start chatting
ollama run llama3.1
You’re now running a capable AI model locally. Try asking it something:
>>> Write a haiku about homelab servers
Blinking lights in dark,
Whisper fans cool silicon,
My data stays home.
Type /bye to exit the chat.
Adding Open WebUI: Your ChatGPT Interface
Command-line chat is fine, but Open WebUI gives you the familiar ChatGPT experience—conversation history, model switching, document uploads, and more.
The All-in-One Docker Method (Recommended)
This single command gives you both Ollama and Open WebUI:
docker run -d -p 3000:8080 \
--gpus=all \
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:ollama
For CPU-only systems:
docker run -d -p 3000:8080 \
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:ollama
If You Already Have Ollama Installed
Just add the UI:
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
First-Time Setup
- Open
http://localhost:3000in your browser - Create an admin account (first user becomes admin)
- Click the model selector (top-left) to pull new models
- Start chatting!
Model Selection: What to Run
Ollama’s model library is huge. Here are my recommendations for different use cases:
For General Chat
Llama 3.1 8B is the sweet spot—fast enough for quick responses, capable enough for most tasks.
ollama pull llama3.1
For Coding
DeepSeek-Coder-V2 or Qwen 2.5 Coder 32B excel at code generation and debugging. DeepSeek V3 in particular rivals GPT-4o on coding benchmarks.
ollama pull deepseek-coder-v2
ollama pull qwen2.5-coder:32b
For Fast Responses
Mistral 7B is incredibly snappy. Great for quick questions and brainstorming.
ollama pull mistral
For Maximum Quality
Llama 3.3 70B approaches GPT-4 territory, but needs ~43GB VRAM (dual RTX 3090s or an A6000).
ollama pull llama3.3:70b
For Massive Context
Llama 4 Scout offers a staggering 10 million token context window—perfect for analyzing entire codebases or long documents.
ollama pull llama4-scout
RAG: Chat With Your Documents
One of Open WebUI’s killer features is Retrieval Augmented Generation (RAG). Upload documents and ask questions about them.
Quick Document Chat
- Click the paperclip icon in chat
- Upload any PDF, text, or markdown file
- Ask questions about the content
The AI will cite specific parts of your document in responses.
Knowledge Collections
For permanent document libraries:
- Go to Workspace → Knowledge
- Create a new collection (e.g., “Homelab Docs”)
- Upload related documents
- Link the collection to specific models
- Toggle it on when chatting
RAG Configuration Tips
In Admin Panel → Settings → Documents:
- Embedding Model:
nomic-embed-text(default) works great - Chunk Size: 1500 tokens (increase for longer documents)
- Top K: 4-6 results per query (higher = more context, slower)
Performance Optimization
GPU Acceleration
Verify GPU detection:
# Should show your GPU
nvidia-smi
# Check Ollama is using it
ollama run llama3.1
>>> /show info
Keep Models in Memory
By default, Ollama unloads models after 5 minutes of inactivity. Speed up repeated queries:
# Keep models loaded for 1 hour
export OLLAMA_KEEP_ALIVE=1h
Context Window Sizing
Larger context = more VRAM. Ollama auto-adjusts:
- Under 24GB VRAM: 4K context (usually fine)
- 24-48GB VRAM: 32K context
- 48GB+ VRAM: 256K context (for Llama 4 Scout)
Force a specific context:
ollama run llama3.1 --context 8192
Quantization Explained
Ollama uses 4-bit quantization (Q4_K_M) by default. This reduces model size by ~70% with minimal quality loss:
| Quantization | Size | Quality | Speed |
|---|---|---|---|
| Q4_K_M | ~4.9 GB (8B model) | Good | Fastest |
| Q5_K_M | ~5.5 GB | Better | Fast |
| Q8_0 | ~8 GB | Best | Slower |
| FP16 | ~16 GB | Full | Slowest |
For most homelab use cases, Q4_K_M is the sweet spot.
Security Considerations
Network Security
Never expose port 11434 directly to the internet. Ollama has no built-in authentication.
If you need remote access:
# Option 1: SSH tunnel
ssh -L 11434:localhost:11434 user@server
# Option 2: Reverse proxy with auth (nginx example)
# Add basic auth in front of /api/*
Production Deployment
For production use:
- Run behind a reverse proxy (Traefik, Nginx, Caddy)
- Add authentication (OAuth, basic auth)
- Use HTTPS/TLS certificates
- Consider container isolation (Docker networks)
- Keep Ollama and Open WebUI updated
Homelab Integration Examples
API Integration
Ollama exposes a REST API on port 11434:
# Simple completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Why is self-hosting awesome?",
"stream": false
}'
# Chat endpoint
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [
{"role": "user", "content": "Explain Docker in one paragraph"}
]
}'
Python Example
import requests
response = requests.post('http://localhost:11434/api/chat', json={
"model": "llama3.1",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": False
})
print(response.json()['message']['content'])
Use Cases
- Automation scripts: Generate dynamic content, summarize logs
- Home Assistant integration: Voice assistant with local processing
- Code review: Analyze pull requests without sending code to the cloud
- Document indexing: RAG over your homelab documentation
Troubleshooting
Model Not Loading
# Check logs
journalctl -u ollama -f
# Verify VRAM
nvidia-smi
# Try smaller model first
ollama pull llama3.1:8b
Slow Responses
- Verify GPU is being used (
/show infoin chat) - Check VRAM usage during inference
- Try a smaller quantization
- Ensure model fits entirely in VRAM (avoid CPU offload)
Docker GPU Issues
# Test NVIDIA container toolkit
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
If that fails, install the NVIDIA Container Toolkit:
# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Commands Cheat Sheet
# List installed models
ollama list
# Pull a new model
ollama pull llama3.1
# Run interactively
ollama run llama3.1
# Run with options
ollama run llama3.1 --context 8192
# Show model info
ollama show llama3.1
# Delete a model
ollama rm mistral
# Push custom model (if you create one)
ollama push mymodel
# Serve API on custom port
OLLAMA_HOST=0.0.0.0:8080 ollama serve
What’s Next?
Once you’ve got Ollama and Open WebUI running:
- Experiment with models - Try different sizes and see what works for your hardware
- Set up RAG - Upload your homelab documentation and chat with it
- Create custom models - Use Modelfiles to create specialized assistants
- Integrate with your stack - Use the API for automation
- Share with family - Open WebUI supports multiple users
The self-hosted AI revolution is here, and it runs on your hardware. No API keys required.
Have questions or tips about running local LLMs? Drop them in the comments below!

Comments
Powered by GitHub Discussions