Build Your Own Monitoring Stack with Grafana and Prometheus
Set up a production-ready monitoring solution with Grafana dashboards and Prometheus metrics collection. Perfect for homelabs, self-hosted setups, and learning cloud-native observability.
Table of Contents
- Why Prometheus and Grafana?
- Architecture at a Glance
- Quick Start: Docker Compose Setup
- The docker-compose.yml
- The prometheus.yml Configuration
- Deploy and Verify
- Understanding Your Metrics: PromQL Basics
- Counter vs Gauge
- The Queries You’ll Actually Use
- Dashboard Design: USE and RED Methods
- USE Method (For Infrastructure)
- RED Method (For Services)
- Building Your First Dashboard
- Alerting: Catch Problems Before They Become Outages
- Alerting Rules
- Alertmanager Configuration
- Beyond the Basics: Common Exporters
- cAdvisor for Container Metrics
- Blackbox Exporter for Service Health
- Production Considerations
- Retention
- Long-term Storage
- High Availability
- Security
- Homelab: The Sweet Spot
- One Compose to Rule Them
- Next Steps
So you’re running services at home—maybe a Proxmox cluster, some Docker containers, a NAS, perhaps Home Assistant or Pi-hole—and you want to know what’s happening under the hood. You’ve seen those beautiful Grafana dashboards on r/homelab and thought, “I want that.”
Good news: building a monitoring stack with Grafana and Prometheus is easier than it looks, and the skills transfer directly to production environments. Let’s build one together.
Why Prometheus and Grafana?
Prometheus is your data collector. It scrapes metrics from targets (servers, containers, applications), stores them in a time-series database, and evaluates alerting rules. Think of it as a relentless accountant that visits every service every 15 seconds asking, “How are you doing?”
Grafana is your visualization layer. It connects to Prometheus (and 30+ other data sources) and transforms raw metrics into dashboards that actually make sense. Charts, graphs, heatmaps, tables—whatever tells your story best.
Together, they form the monitoring backbone for countless organizations, from solo homelabbers to global enterprises. Here’s why:
| Prometheus Handles | Grafana Handles |
|---|---|
| Metric collection | Visualization |
| Time-series storage | Dashboard composition |
| Alerting rules | Multi-source correlation |
| Service discovery | Template variables |
The synergy is real: Prometheus natively integrates with Grafana. You write the same PromQL queries for both alerting rules and dashboard panels—learn once, use everywhere.
Architecture at a Glance
A typical setup looks like this:
[Targets: Servers, Containers, Apps]
|
v
[Prometheus] ──scrapes──> metrics
|
├──> stores time-series data
└──> evaluates alerting rules
|
[Grafana] ──queries──> Prometheus
|
[Alertmanager] ──sends──> Slack/Email/PagerDuty
Prometheus pulls metrics from configured targets (pull model), stores them locally, and fires alerts when conditions are met. Grafana queries Prometheus to render dashboards. Alertmanager handles the actual notifications.
Quick Start: Docker Compose Setup
Let’s get a working stack running. I’ll assume Docker and Docker Compose are installed.
The docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./provisioning:/etc/grafana/provisioning
ports:
- "3000:3000"
restart: unless-stopped
node_exporter:
image: prom/node-exporter:latest
container_name: node_exporter
command:
- '--path.rootfs=/host'
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- '/:/host:ro,rslave'
- '/proc:/host/proc:ro'
- '/sys:/host/sys:ro'
network_mode: host
pid: host
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
The prometheus.yml Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['prometheus:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
Security note: The default Grafana credentials are admin/changeme. Change them immediately after first login. For production, use environment variables or secrets management—never commit passwords to version control.
Deploy and Verify
# Create the files above, then:
docker compose up -d
# Check Prometheus is scraping:
curl http://localhost:9090/api/v1/targets
# Open Grafana:
open http://localhost:3000
Log in to Grafana, add Prometheus as a data source (http://prometheus:9090), and you’re ready to build dashboards.
Pre-built dashboards: Grafana maintains a library of community dashboards at grafana.com/grafana/dashboards. For Node Exporter, try importing dashboard ID 1860 (Node Exporter Full)—it gives you comprehensive host metrics instantly.
Understanding Your Metrics: PromQL Basics
Prometheus stores metrics as time series with labels:
node_memory_MemAvailable_bytes{instance="192.168.1.10:9100", job="node_exporter"} 8589934592 1709000000000
The metric name, labels, value, and timestamp. To make sense of it, you need PromQL.
Counter vs Gauge
Counters only increase (or reset). Use rate() to get per-second values:
# Requests per second over the last 5 minutes
rate(http_requests_total[5m])
# Total increase over last hour
increase(http_requests_total[1h])
Gauges can go up or down. Query them directly:
# Current available memory
node_memory_MemAvailable_bytes
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
The Queries You’ll Actually Use
CPU usage by mode:
rate(node_cpu_seconds_total{mode!="idle"}[5m]) by (mode, instance)
Memory utilization:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Disk usage:
((node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes) * 100
Network traffic:
# Received
rate(node_network_receive_bytes_total[5m])
# Transmitted
rate(node_network_transmit_bytes_total[5m])
p99 latency (for histogram metrics):
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
The five-minute window: Most queries use [5m] as the range. This balances responsiveness and noise reduction. For high-frequency metrics, you might use 1m or 30s. For slow-moving data like disk space, 5m is fine.
Dashboard Design: USE and RED Methods
Good dashboards tell a story. Two frameworks help structure them effectively:
USE Method (For Infrastructure)
For resources like CPU, memory, disks:
- Utilization: Percent time busy (CPU %)
- Saturation: Amount of work queued (load average, I/O wait)
- Errors: Error count per second
RED Method (For Services)
For applications, APIs, microservices:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Request latency distribution
Dashboard hierarchy: Start with a high-level overview dashboard (USE for infrastructure, RED for services). Then create drill-down dashboards that explore specific areas in detail. Link them together using Grafana’s dashboard links feature.
Building Your First Dashboard
- Create a dashboard → Add new panel
- Choose visualization → Time series for most metrics, Stat for single values
- Write your PromQL → Start simple, refine
- Add meaningful titles → “CPU Utilization” not “Panel 1”
- Set thresholds → Green (ok), Yellow (warning), Red (critical)
- Use variables →
$instance,$jobfor reusable dashboards
Example variables:
# Instance dropdown
name: instance
type: query
query: label_values(up, instance)
# Job dropdown
name: job
type: query
query: label_values(up, job)
Now your dashboard works for any instance/job—just change the dropdown.
Alerting: Catch Problems Before They Become Outages
Prometheus handles the rules. Alertmanager handles the notifications.
Alerting Rules
Create alerting_rules.yml and add it to your Prometheus config:
rule_files:
- '/etc/prometheus/alerting_rules.yml'
Basic rules for a homelab:
groups:
- name: node_alerts
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} has been down for more than 5 minutes."
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf \"%.2f\" }}%"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Root filesystem has less than 10% free space"
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.2f\" }}%"
The for: clause: This prevents flapping alerts. An instance must be down for 5 minutes before the alert fires. Adjust based on your tolerance for noise.
Alertmanager Configuration
Add Alertmanager to your compose file:
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports:
- "9093:9093"
restart: unless-stopped
alertmanager.yml for Slack:
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '{{ .Status | toTitle }}: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Grouping matters: Without grouping, you might receive 50 individual alerts for “disk space low” across your fleet. Grouping by alertname and instance collapses them into one notification with context.
Beyond the Basics: Common Exporters
The Node Exporter gives you host metrics. For everything else, there’s probably an exporter:
| Exporter | What It Monitors | Default Port |
|---|---|---|
| node_exporter | Host (CPU, memory, disk, network) | 9100 |
| cAdvisor | Docker/container metrics | 8080 |
| blackbox_exporter | HTTP/TCP/ICMP probes | 9115 |
| mysqld_exporter | MySQL databases | 9104 |
| postgres_exporter | PostgreSQL | 9187 |
| redis_exporter | Redis instances | 9121 |
cAdvisor for Container Metrics
Add this to your Docker Compose:
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8080:8080"
restart: unless-stopped
And to Prometheus:
scrape_configs:
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
Now you can monitor container CPU, memory, and network usage.
Blackbox Exporter for Service Health
Want to know if your websites are up? Blackbox exporter probes endpoints:
blackbox_exporter:
image: prom/blackbox-exporter:latest
container_name: blackbox_exporter
ports:
- "9115:9115"
volumes:
- ./blackbox.yml:/etc/blackbox_exporter/config.yml
restart: unless-stopped
blackbox.yml:
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: [200, 301, 302]
method: GET
preferred_ip_protocol: ip4
Prometheus scrape config:
scrape_configs:
- job_name: 'blackbox_http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://grafana.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox_exporter:9115
The relabeling magic: Blackbox uses a special pattern where Prometheus queries the exporter with the target as a parameter. The relabel configs rewire __address__ to make this work. It’s confusing at first but extremely powerful.
Production Considerations
Running in production? Think about these:
Retention
Prometheus stores data locally. Set reasonable retention:
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB
Estimate storage: ~1-2 bytes per sample (compressed). A million samples/second for 30 days needs roughly 30-60 GB.
Long-term Storage
For retention beyond 30 days, you need remote storage:
| System | Use Case |
|---|---|
| Thanos | Multi-cluster, virtually unlimited retention |
| VictoriaMetrics | Cost-efficient single-node alternative |
| Cortex/Mimir | Large-scale distributed setups |
High Availability
For critical environments:
- Run two Prometheus instances with identical configs
- Cluster Alertmanagers for deduplication
- Use a load balancer for Grafana
Alertmanager clustering:
alertmanager --cluster.listen-address="" \
--cluster.peer=alertmanager-1:9094 \
--cluster.peer=alertmanager-2:9094
Security
Basic practices:
- Firewall Prometheus — Don’t expose 9090 publicly
- Use a reverse proxy — Nginx/Traefik for Grafana with TLS
- Secure Grafana — Disable anonymous access, use strong passwords
- Don’t commit secrets — Use environment variables
Homelab: The Sweet Spot
For most homelabs, a single Prometheus instance with 30-day retention, Grafana, Node Exporter, cAdvisor, and Alertmanager covers 95% of use cases. Add exporters as needed for specific services (Pi-hole, Home Assistant, Proxmox).
Total resource footprint: ~500MB RAM for Prometheus, ~100MB for Grafana. Your old NUC or Raspberry Pi 4 can handle it easily.
One Compose to Rule Them
Here’s a complete homelab setup:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:latest
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
volumes:
- grafana_data:/var/lib/grafana
- ./provisioning:/etc/grafana/provisioning:ro
ports:
- "3000:3000"
restart: unless-stopped
node_exporter:
image: prom/node-exporter:latest
command:
- '--path.rootfs=/host'
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
volumes:
- '/:/host:ro,rslave'
- '/proc:/host/proc:ro'
- '/sys:/host/sys:ro'
network_mode: host
pid: host
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8080:8080"
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports:
- "9093:9093"
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Next Steps
You’ve got the foundation. Where to go from here:
- Import community dashboards — Start with dashboard ID 1860 (Node Exporter Full)
- Add custom alerts — Tailor to your environment
- Set up notifications — Slack, Discord, email, Pushover
- Monitor your applications — Most have Prometheus exporters or client libraries
- Experiment with PromQL — The query language is surprisingly powerful
The beauty of Prometheus + Grafana is that it scales from a single Raspberry Pi to global infrastructure. The skills you learn here transfer directly to production environments. Start small, iterate, and build the dashboards that tell your story.
Monitor the monitor: Prometheus exposes its own metrics. Create a dashboard tracking prometheus_tsdb_head_series (active series), prometheus_scrape_duration_seconds (scrape timing), and prometheus_rule_evaluation_duration_seconds (alert evaluation). You’ll spot cardinality explosions and performance issues early.

Comments
Powered by GitHub Discussions