Run Your Own AI: Ollama + Open WebUI for Private Local LLMs

Set up a ChatGPT alternative on your homelab with Ollama and Open WebUI. Complete guide covering installation, GPU optimization, model selection, and RAG for document chat.

• 7 min read
ollamaopen-webuillmself-hostedaihomelabdocker
Run Your Own AI: Ollama + Open WebUI for Private Local LLMs

Running AI models locally used to require a PhD in machine learning. Then Ollama came along and made it as simple as ollama run llama3.1. Add Open WebUI into the mix, and you’ve got yourself a fully private ChatGPT alternative running on your own hardware.

No API keys. No rate limits. No data leaving your network. Just your own AI assistant that you control completely.

Why Run Local LLMs?

Before we dive into the setup, let’s talk about why you’d want to run LLMs locally instead of using ChatGPT or Claude:

  • Privacy: Your data never leaves your network. Perfect for sensitive documents, code, or conversations.
  • No API Costs: Run unlimited queries without watching your credit card balance.
  • Offline Access: Works without internet—useful for air-gapped networks or remote locations.
  • Customization: Fine-tune models, adjust parameters, and create custom assistants.
  • Learning: Understand how AI actually works by getting hands-on.

The tradeoff? You need hardware. But we’ll cover that too.

What You’ll Need

Hardware Requirements

The beauty of Ollama is its flexibility—it can run on everything from a Raspberry Pi to a multi-GPU workstation. Here’s what you need to know:

Minimum (CPU-only):

  • 8GB RAM (16GB recommended)
  • 10GB+ free disk space
  • Works but slow—5-10x slower than GPU

Recommended (GPU-accelerated):

  • NVIDIA GPU with 8GB+ VRAM (CUDA 12+)
  • OR Apple Silicon Mac (Metal acceleration built-in)
  • 16GB+ system RAM
  • NVMe SSD for model storage

VRAM Guide by Model Size:

ModelVRAM NeededGood For
Llama 3.1 8B~5-8 GBGeneral chat, simple tasks
Mistral 7B~4-7 GBFast, efficient responses
Llama 3.1 13B~10-12 GBBetter reasoning
Mixtral 8x7B~12-16 GBComplex tasks, MoE efficiency
Llama 3.3 70B~43 GBNear-GPT-4 quality

Pro tip: Start small with 7-8B models. They run great on consumer GPUs like an RTX 3060 or even a GTX 1660.

Installing Ollama

Ollama’s installation is refreshingly simple:

Linux / macOS / WSL2

curl -fsSL https://ollama.ai/install.sh | sh

That’s it. The script detects your OS, installs Ollama, and sets up the service.

Verify Installation

ollama --version
# ollama version is 0.5.x or newer

Pull Your First Model

# Download Llama 3.1 8B (about 4.9GB)
ollama pull llama3.1

# Start chatting
ollama run llama3.1

You’re now running a capable AI model locally. Try asking it something:

>>> Write a haiku about homelab servers
Blinking lights in dark,
Whisper fans cool silicon,
My data stays home.

Type /bye to exit the chat.

Adding Open WebUI: Your ChatGPT Interface

Command-line chat is fine, but Open WebUI gives you the familiar ChatGPT experience—conversation history, model switching, document uploads, and more.

This single command gives you both Ollama and Open WebUI:

docker run -d -p 3000:8080 \
  --gpus=all \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:ollama

For CPU-only systems:

docker run -d -p 3000:8080 \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:ollama

If You Already Have Ollama Installed

Just add the UI:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

First-Time Setup

  1. Open http://localhost:3000 in your browser
  2. Create an admin account (first user becomes admin)
  3. Click the model selector (top-left) to pull new models
  4. Start chatting!

Model Selection: What to Run

Ollama’s model library is huge. Here are my recommendations for different use cases:

For General Chat

Llama 3.1 8B is the sweet spot—fast enough for quick responses, capable enough for most tasks.

ollama pull llama3.1

For Coding

DeepSeek-Coder-V2 or Qwen 2.5 Coder 32B excel at code generation and debugging. DeepSeek V3 in particular rivals GPT-4o on coding benchmarks.

ollama pull deepseek-coder-v2
ollama pull qwen2.5-coder:32b

For Fast Responses

Mistral 7B is incredibly snappy. Great for quick questions and brainstorming.

ollama pull mistral

For Maximum Quality

Llama 3.3 70B approaches GPT-4 territory, but needs ~43GB VRAM (dual RTX 3090s or an A6000).

ollama pull llama3.3:70b

For Massive Context

Llama 4 Scout offers a staggering 10 million token context window—perfect for analyzing entire codebases or long documents.

ollama pull llama4-scout

RAG: Chat With Your Documents

One of Open WebUI’s killer features is Retrieval Augmented Generation (RAG). Upload documents and ask questions about them.

Quick Document Chat

  1. Click the paperclip icon in chat
  2. Upload any PDF, text, or markdown file
  3. Ask questions about the content

The AI will cite specific parts of your document in responses.

Knowledge Collections

For permanent document libraries:

  1. Go to Workspace → Knowledge
  2. Create a new collection (e.g., “Homelab Docs”)
  3. Upload related documents
  4. Link the collection to specific models
  5. Toggle it on when chatting

RAG Configuration Tips

In Admin Panel → Settings → Documents:

  • Embedding Model: nomic-embed-text (default) works great
  • Chunk Size: 1500 tokens (increase for longer documents)
  • Top K: 4-6 results per query (higher = more context, slower)

Performance Optimization

GPU Acceleration

Verify GPU detection:

# Should show your GPU
nvidia-smi

# Check Ollama is using it
ollama run llama3.1
>>> /show info

Keep Models in Memory

By default, Ollama unloads models after 5 minutes of inactivity. Speed up repeated queries:

# Keep models loaded for 1 hour
export OLLAMA_KEEP_ALIVE=1h

Context Window Sizing

Larger context = more VRAM. Ollama auto-adjusts:

  • Under 24GB VRAM: 4K context (usually fine)
  • 24-48GB VRAM: 32K context
  • 48GB+ VRAM: 256K context (for Llama 4 Scout)

Force a specific context:

ollama run llama3.1 --context 8192

Quantization Explained

Ollama uses 4-bit quantization (Q4_K_M) by default. This reduces model size by ~70% with minimal quality loss:

QuantizationSizeQualitySpeed
Q4_K_M~4.9 GB (8B model)GoodFastest
Q5_K_M~5.5 GBBetterFast
Q8_0~8 GBBestSlower
FP16~16 GBFullSlowest

For most homelab use cases, Q4_K_M is the sweet spot.

Security Considerations

Network Security

Never expose port 11434 directly to the internet. Ollama has no built-in authentication.

If you need remote access:

# Option 1: SSH tunnel
ssh -L 11434:localhost:11434 user@server

# Option 2: Reverse proxy with auth (nginx example)
# Add basic auth in front of /api/*

Production Deployment

For production use:

  1. Run behind a reverse proxy (Traefik, Nginx, Caddy)
  2. Add authentication (OAuth, basic auth)
  3. Use HTTPS/TLS certificates
  4. Consider container isolation (Docker networks)
  5. Keep Ollama and Open WebUI updated

Homelab Integration Examples

API Integration

Ollama exposes a REST API on port 11434:

# Simple completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is self-hosting awesome?",
  "stream": false
}'

# Chat endpoint
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "user", "content": "Explain Docker in one paragraph"}
  ]
}'

Python Example

import requests

response = requests.post('http://localhost:11434/api/chat', json={
    "model": "llama3.1",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": False
})

print(response.json()['message']['content'])

Use Cases

  • Automation scripts: Generate dynamic content, summarize logs
  • Home Assistant integration: Voice assistant with local processing
  • Code review: Analyze pull requests without sending code to the cloud
  • Document indexing: RAG over your homelab documentation

Troubleshooting

Model Not Loading

# Check logs
journalctl -u ollama -f

# Verify VRAM
nvidia-smi

# Try smaller model first
ollama pull llama3.1:8b

Slow Responses

  • Verify GPU is being used (/show info in chat)
  • Check VRAM usage during inference
  • Try a smaller quantization
  • Ensure model fits entirely in VRAM (avoid CPU offload)

Docker GPU Issues

# Test NVIDIA container toolkit
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

If that fails, install the NVIDIA Container Toolkit:

# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Commands Cheat Sheet

# List installed models
ollama list

# Pull a new model
ollama pull llama3.1

# Run interactively
ollama run llama3.1

# Run with options
ollama run llama3.1 --context 8192

# Show model info
ollama show llama3.1

# Delete a model
ollama rm mistral

# Push custom model (if you create one)
ollama push mymodel

# Serve API on custom port
OLLAMA_HOST=0.0.0.0:8080 ollama serve

What’s Next?

Once you’ve got Ollama and Open WebUI running:

  1. Experiment with models - Try different sizes and see what works for your hardware
  2. Set up RAG - Upload your homelab documentation and chat with it
  3. Create custom models - Use Modelfiles to create specialized assistants
  4. Integrate with your stack - Use the API for automation
  5. Share with family - Open WebUI supports multiple users

The self-hosted AI revolution is here, and it runs on your hardware. No API keys required.


Have questions or tips about running local LLMs? Drop them in the comments below!

Anthony Lattanzio

Anthony Lattanzio

Tech Enthusiast & Builder

I'm a tech enthusiast who loves building things with hardware and software. By night, I run a homelab that's grown way beyond what any reasonable person needs. Check out about me for more.

Comments

Powered by GitHub Discussions