Build Your Own Monitoring Stack with Grafana and Prometheus

So you’re running services at home—maybe a Proxmox cluster, some Docker containers, a NAS, perhaps Home Assistant or Pi-hole—and you want to know what’s happening under the hood. You’ve seen those beautiful Grafana dashboards on r/homelab and thought, “I want that.”

Good news: building a monitoring stack with Grafana and Prometheus is easier than it looks, and the skills transfer directly to production environments. Let’s build one together.

Why Prometheus and Grafana?

Prometheus is your data collector. It scrapes metrics from targets (servers, containers, applications), stores them in a time-series database, and evaluates alerting rules. Think of it as a relentless accountant that visits every service every 15 seconds asking, “How are you doing?”

Grafana is your visualization layer. It connects to Prometheus (and 30+ other data sources) and transforms raw metrics into dashboards that actually make sense. Charts, graphs, heatmaps, tables—whatever tells your story best.

Together, they form the monitoring backbone for countless organizations, from solo homelabbers to global enterprises. Here’s why:

Prometheus Handles	Grafana Handles
Metric collection	Visualization
Time-series storage	Dashboard composition
Alerting rules	Multi-source correlation
Service discovery	Template variables

Note

The synergy is real: Prometheus natively integrates with Grafana. You write the same PromQL queries for both alerting rules and dashboard panels—learn once, use everywhere.

Architecture at a Glance

A typical setup looks like this:

[Targets: Servers, Containers, Apps]
              |
              v
        [Prometheus] ──scrapes──> metrics
              |
              ├──> stores time-series data
              └──> evaluates alerting rules
                    |
              [Grafana] ──queries──> Prometheus
                    |
              [Alertmanager] ──sends──> Slack/Email/PagerDuty

Prometheus pulls metrics from configured targets (pull model), stores them locally, and fires alerts when conditions are met. Grafana queries Prometheus to render dashboards. Alertmanager handles the actual notifications.

Quick Start: Docker Compose Setup

Let’s get a working stack running. I’ll assume Docker and Docker Compose are installed.

The docker-compose.yml

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning
    ports:
      - "3000:3000"
    restart: unless-stopped

  node_exporter:
    image: prom/node-exporter:latest
    container_name: node_exporter
    command:
      - '--path.rootfs=/host'
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - '/:/host:ro,rslave'
      - '/proc:/host/proc:ro'
      - '/sys:/host/sys:ro'
    network_mode: host
    pid: host
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

The prometheus.yml Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['prometheus:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

Warning

Security note: The default Grafana credentials are admin/changeme. Change them immediately after first login. For production, use environment variables or secrets management—never commit passwords to version control.

Deploy and Verify

# Create the files above, then:
docker compose up -d

# Check Prometheus is scraping:
curl http://localhost:9090/api/v1/targets

# Open Grafana:
open http://localhost:3000

Pro Tip

Pre-built dashboards: Grafana maintains a library of community dashboards at grafana.com/grafana/dashboards. For Node Exporter, try importing dashboard ID 1860 (Node Exporter Full)—it gives you comprehensive host metrics instantly.

Understanding Your Metrics: PromQL Basics

Prometheus stores metrics as time series with labels:

node_memory_MemAvailable_bytes{instance="192.168.1.10:9100", job="node_exporter"} 8589934592 1709000000000

The metric name, labels, value, and timestamp. To make sense of it, you need PromQL.

Counter vs Gauge

Counters only increase (or reset). Use rate() to get per-second values:

# Requests per second over the last 5 minutes
rate(http_requests_total[5m])

# Total increase over last hour
increase(http_requests_total[1h])

Gauges can go up or down. Query them directly:

# Current available memory
node_memory_MemAvailable_bytes

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

The Queries You’ll Actually Use

CPU usage by mode:

rate(node_cpu_seconds_total{mode!="idle"}[5m]) by (mode, instance)

Memory utilization:

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Disk usage:

((node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes) * 100

Network traffic:

# Received
rate(node_network_receive_bytes_total[5m])

# Transmitted
rate(node_network_transmit_bytes_total[5m])

p99 latency (for histogram metrics):

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Note

The five-minute window: Most queries use [5m] as the range. This balances responsiveness and noise reduction. For high-frequency metrics, you might use 1m or 30s. For slow-moving data like disk space, 5m is fine.

Dashboard Design: USE and RED Methods

Good dashboards tell a story. Two frameworks help structure them effectively:

USE Method (For Infrastructure)

For resources like CPU, memory, disks:

Utilization: Percent time busy (CPU %)
Saturation: Amount of work queued (load average, I/O wait)
Errors: Error count per second

RED Method (For Services)

For applications, APIs, microservices:

Rate: Requests per second
Errors: Failed requests per second
Duration: Request latency distribution

Pro Tip

Dashboard hierarchy: Start with a high-level overview dashboard (USE for infrastructure, RED for services). Then create drill-down dashboards that explore specific areas in detail. Link them together using Grafana’s dashboard links feature.

Building Your First Dashboard

Create a dashboard → Add new panel
Choose visualization → Time series for most metrics, Stat for single values
Write your PromQL → Start simple, refine
Add meaningful titles → “CPU Utilization” not “Panel 1”
Set thresholds → Green (ok), Yellow (warning), Red (critical)
Use variables → $instance, $job for reusable dashboards

Example variables:

# Instance dropdown
name: instance
type: query
query: label_values(up, instance)

# Job dropdown
name: job
type: query
query: label_values(up, job)

Now your dashboard works for any instance/job—just change the dropdown.

Alerting: Catch Problems Before They Become Outages

Prometheus handles the rules. Alertmanager handles the notifications.

Alerting Rules

Create alerting_rules.yml and add it to your Prometheus config:

rule_files:
  - '/etc/prometheus/alerting_rules.yml'

Basic rules for a homelab:

groups:
  - name: node_alerts
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} has been down for more than 5 minutes."

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.2f\" }}%"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Root filesystem has less than 10% free space"

      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}%"

Note

The for: clause: This prevents flapping alerts. An instance must be down for 5 minutes before the alert fires. Adjust based on your tolerance for noise.

Alertmanager Configuration

Add Alertmanager to your compose file:

alertmanager:
  image: prom/alertmanager:latest
  container_name: alertmanager
  volumes:
    - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
  ports:
    - "9093:9093"
  restart: unless-stopped

alertmanager.yml for Slack:

global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        title: '{{ .Status | toTitle }}: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Pro Tip

Grouping matters: Without grouping, you might receive 50 individual alerts for “disk space low” across your fleet. Grouping by alertname and instance collapses them into one notification with context.

Beyond the Basics: Common Exporters

The Node Exporter gives you host metrics. For everything else, there’s probably an exporter:

Exporter	What It Monitors	Default Port
node_exporter	Host (CPU, memory, disk, network)	9100
cAdvisor	Docker/container metrics	8080
blackbox_exporter	HTTP/TCP/ICMP probes	9115
mysqld_exporter	MySQL databases	9104
postgres_exporter	PostgreSQL	9187
redis_exporter	Redis instances	9121

cAdvisor for Container Metrics

Add this to your Docker Compose:

cadvisor:
  image: gcr.io/cadvisor/cadvisor:latest
  container_name: cadvisor
  volumes:
    - /:/rootfs:ro
    - /var/run:/var/run:ro
    - /sys:/sys:ro
    - /var/lib/docker/:/var/lib/docker:ro
  ports:
    - "8080:8080"
  restart: unless-stopped

And to Prometheus:

scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

Now you can monitor container CPU, memory, and network usage.

Blackbox Exporter for Service Health

Want to know if your websites are up? Blackbox exporter probes endpoints:

blackbox_exporter:
  image: prom/blackbox-exporter:latest
  container_name: blackbox_exporter
  ports:
    - "9115:9115"
  volumes:
    - ./blackbox.yml:/etc/blackbox_exporter/config.yml
  restart: unless-stopped

blackbox.yml:

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      valid_status_codes: [200, 301, 302]
      method: GET
      preferred_ip_protocol: ip4

Prometheus scrape config:

scrape_configs:
  - job_name: 'blackbox_http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://example.com
        - https://grafana.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox_exporter:9115

Note

The relabeling magic: Blackbox uses a special pattern where Prometheus queries the exporter with the target as a parameter. The relabel configs rewire __address__ to make this work. It’s confusing at first but extremely powerful.

Production Considerations

Running in production? Think about these:

Retention

Prometheus stores data locally. Set reasonable retention:

--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB

Estimate storage: ~1-2 bytes per sample (compressed). A million samples/second for 30 days needs roughly 30-60 GB.

Long-term Storage

For retention beyond 30 days, you need remote storage:

System	Use Case
Thanos	Multi-cluster, virtually unlimited retention
VictoriaMetrics	Cost-efficient single-node alternative
Cortex/Mimir	Large-scale distributed setups

High Availability

For critical environments:

Run two Prometheus instances with identical configs
Cluster Alertmanagers for deduplication
Use a load balancer for Grafana

Alertmanager clustering:

alertmanager --cluster.listen-address="" \
  --cluster.peer=alertmanager-1:9094 \
  --cluster.peer=alertmanager-2:9094

Security

Basic practices:

Firewall Prometheus — Don’t expose 9090 publicly
Use a reverse proxy — Nginx/Traefik for Grafana with TLS
Secure Grafana — Disable anonymous access, use strong passwords
Don’t commit secrets — Use environment variables

Homelab: The Sweet Spot

For most homelabs, a single Prometheus instance with 30-day retention, Grafana, Node Exporter, cAdvisor, and Alertmanager covers 95% of use cases. Add exporters as needed for specific services (Pi-hole, Home Assistant, Proxmox).

Total resource footprint: ~500MB RAM for Prometheus, ~100MB for Grafana. Your old NUC or Raspberry Pi 4 can handle it easily.

One Compose to Rule Them

Here’s a complete homelab setup:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - grafana_data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning:ro
    ports:
      - "3000:3000"
    restart: unless-stopped

  node_exporter:
    image: prom/node-exporter:latest
    command:
      - '--path.rootfs=/host'
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
    volumes:
      - '/:/host:ro,rslave'
      - '/proc:/host/proc:ro'
      - '/sys:/host/sys:ro'
    network_mode: host
    pid: host
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8080:8080"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    ports:
      - "9093:9093"
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Next Steps

You’ve got the foundation. Where to go from here:

Import community dashboards — Start with dashboard ID 1860 (Node Exporter Full)
Add custom alerts — Tailor to your environment
Set up notifications — Slack, Discord, email, Pushover
Monitor your applications — Most have Prometheus exporters or client libraries
Experiment with PromQL — The query language is surprisingly powerful

The beauty of Prometheus + Grafana is that it scales from a single Raspberry Pi to global infrastructure. The skills you learn here transfer directly to production environments. Start small, iterate, and build the dashboards that tell your story.

Pro Tip

Monitor the monitor: Prometheus exposes its own metrics. Create a dashboard tracking prometheus_tsdb_head_series (active series), prometheus_scrape_duration_seconds (scrape timing), and prometheus_rule_evaluation_duration_seconds (alert evaluation). You’ll spot cardinality explosions and performance issues early.

Build Your Own Monitoring Stack with Grafana and Prometheus

Why Prometheus and Grafana?

Architecture at a Glance

Quick Start: Docker Compose Setup

The docker-compose.yml

The prometheus.yml Configuration

Deploy and Verify

Understanding Your Metrics: PromQL Basics

Counter vs Gauge

The Queries You’ll Actually Use

Dashboard Design: USE and RED Methods

USE Method (For Infrastructure)

RED Method (For Services)

Building Your First Dashboard

Alerting: Catch Problems Before They Become Outages

Alerting Rules

Alertmanager Configuration

Beyond the Basics: Common Exporters

cAdvisor for Container Metrics

Blackbox Exporter for Service Health

Production Considerations

Retention

Long-term Storage

High Availability

Security

Homelab: The Sweet Spot

One Compose to Rule Them

Next Steps

Anthony Lattanzio

Comments

Why Prometheus and Grafana?

Architecture at a Glance

Quick Start: Docker Compose Setup

The docker-compose.yml

The prometheus.yml Configuration

Deploy and Verify

Understanding Your Metrics: PromQL Basics

Counter vs Gauge

The Queries You’ll Actually Use

Dashboard Design: USE and RED Methods

USE Method (For Infrastructure)

RED Method (For Services)

Building Your First Dashboard

Alerting: Catch Problems Before They Become Outages

Alerting Rules

Alertmanager Configuration

Beyond the Basics: Common Exporters

cAdvisor for Container Metrics

Blackbox Exporter for Service Health

Production Considerations

Retention

Long-term Storage

High Availability

Security

Homelab: The Sweet Spot

One Compose to Rule Them

Next Steps

Get Early Access

Anthony Lattanzio

If you liked this, check out...

Building a Budget Intel N100 Homelab: The Ultimate 2024 Guide

Comments