Build Your Own Monitoring Stack with Grafana and Prometheus

Set up a production-ready monitoring solution with Grafana dashboards and Prometheus metrics collection. Perfect for homelabs, self-hosted setups, and learning cloud-native observability.

• 9 min read
self-hostedmonitoringgrafanaprometheushomelab
Build Your Own Monitoring Stack with Grafana and Prometheus

So you’re running services at home—maybe a Proxmox cluster, some Docker containers, a NAS, perhaps Home Assistant or Pi-hole—and you want to know what’s happening under the hood. You’ve seen those beautiful Grafana dashboards on r/homelab and thought, “I want that.”

Good news: building a monitoring stack with Grafana and Prometheus is easier than it looks, and the skills transfer directly to production environments. Let’s build one together.

Why Prometheus and Grafana?

Prometheus is your data collector. It scrapes metrics from targets (servers, containers, applications), stores them in a time-series database, and evaluates alerting rules. Think of it as a relentless accountant that visits every service every 15 seconds asking, “How are you doing?”

Grafana is your visualization layer. It connects to Prometheus (and 30+ other data sources) and transforms raw metrics into dashboards that actually make sense. Charts, graphs, heatmaps, tables—whatever tells your story best.

Together, they form the monitoring backbone for countless organizations, from solo homelabbers to global enterprises. Here’s why:

Prometheus HandlesGrafana Handles
Metric collectionVisualization
Time-series storageDashboard composition
Alerting rulesMulti-source correlation
Service discoveryTemplate variables
Note

The synergy is real: Prometheus natively integrates with Grafana. You write the same PromQL queries for both alerting rules and dashboard panels—learn once, use everywhere.

Architecture at a Glance

A typical setup looks like this:

[Targets: Servers, Containers, Apps]
              |
              v
        [Prometheus] ──scrapes──> metrics
              |
              ├──> stores time-series data
              └──> evaluates alerting rules
                    |
              [Grafana] ──queries──> Prometheus
                    |
              [Alertmanager] ──sends──> Slack/Email/PagerDuty

Prometheus pulls metrics from configured targets (pull model), stores them locally, and fires alerts when conditions are met. Grafana queries Prometheus to render dashboards. Alertmanager handles the actual notifications.

Quick Start: Docker Compose Setup

Let’s get a working stack running. I’ll assume Docker and Docker Compose are installed.

The docker-compose.yml

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning
    ports:
      - "3000:3000"
    restart: unless-stopped

  node_exporter:
    image: prom/node-exporter:latest
    container_name: node_exporter
    command:
      - '--path.rootfs=/host'
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - '/:/host:ro,rslave'
      - '/proc:/host/proc:ro'
      - '/sys:/host/sys:ro'
    network_mode: host
    pid: host
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

The prometheus.yml Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['prometheus:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']
Warning

Security note: The default Grafana credentials are admin/changeme. Change them immediately after first login. For production, use environment variables or secrets management—never commit passwords to version control.

Deploy and Verify

# Create the files above, then:
docker compose up -d

# Check Prometheus is scraping:
curl http://localhost:9090/api/v1/targets

# Open Grafana:
open http://localhost:3000

Log in to Grafana, add Prometheus as a data source (http://prometheus:9090), and you’re ready to build dashboards.

Pro Tip

Pre-built dashboards: Grafana maintains a library of community dashboards at grafana.com/grafana/dashboards. For Node Exporter, try importing dashboard ID 1860 (Node Exporter Full)—it gives you comprehensive host metrics instantly.

Understanding Your Metrics: PromQL Basics

Prometheus stores metrics as time series with labels:

node_memory_MemAvailable_bytes{instance="192.168.1.10:9100", job="node_exporter"} 8589934592 1709000000000

The metric name, labels, value, and timestamp. To make sense of it, you need PromQL.

Counter vs Gauge

Counters only increase (or reset). Use rate() to get per-second values:

# Requests per second over the last 5 minutes
rate(http_requests_total[5m])

# Total increase over last hour
increase(http_requests_total[1h])

Gauges can go up or down. Query them directly:

# Current available memory
node_memory_MemAvailable_bytes

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

The Queries You’ll Actually Use

CPU usage by mode:

rate(node_cpu_seconds_total{mode!="idle"}[5m]) by (mode, instance)

Memory utilization:

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Disk usage:

((node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes) * 100

Network traffic:

# Received
rate(node_network_receive_bytes_total[5m])

# Transmitted
rate(node_network_transmit_bytes_total[5m])

p99 latency (for histogram metrics):

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Note

The five-minute window: Most queries use [5m] as the range. This balances responsiveness and noise reduction. For high-frequency metrics, you might use 1m or 30s. For slow-moving data like disk space, 5m is fine.

Dashboard Design: USE and RED Methods

Good dashboards tell a story. Two frameworks help structure them effectively:

USE Method (For Infrastructure)

For resources like CPU, memory, disks:

  • Utilization: Percent time busy (CPU %)
  • Saturation: Amount of work queued (load average, I/O wait)
  • Errors: Error count per second

RED Method (For Services)

For applications, APIs, microservices:

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Request latency distribution
Pro Tip

Dashboard hierarchy: Start with a high-level overview dashboard (USE for infrastructure, RED for services). Then create drill-down dashboards that explore specific areas in detail. Link them together using Grafana’s dashboard links feature.

Building Your First Dashboard

  1. Create a dashboard → Add new panel
  2. Choose visualization → Time series for most metrics, Stat for single values
  3. Write your PromQL → Start simple, refine
  4. Add meaningful titles → “CPU Utilization” not “Panel 1”
  5. Set thresholds → Green (ok), Yellow (warning), Red (critical)
  6. Use variables$instance, $job for reusable dashboards

Example variables:

# Instance dropdown
name: instance
type: query
query: label_values(up, instance)

# Job dropdown
name: job
type: query
query: label_values(up, job)

Now your dashboard works for any instance/job—just change the dropdown.

Alerting: Catch Problems Before They Become Outages

Prometheus handles the rules. Alertmanager handles the notifications.

Alerting Rules

Create alerting_rules.yml and add it to your Prometheus config:

rule_files:
  - '/etc/prometheus/alerting_rules.yml'

Basic rules for a homelab:

groups:
  - name: node_alerts
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} has been down for more than 5 minutes."

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.2f\" }}%"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Root filesystem has less than 10% free space"

      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}%"
Note

The for: clause: This prevents flapping alerts. An instance must be down for 5 minutes before the alert fires. Adjust based on your tolerance for noise.

Alertmanager Configuration

Add Alertmanager to your compose file:

alertmanager:
  image: prom/alertmanager:latest
  container_name: alertmanager
  volumes:
    - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
  ports:
    - "9093:9093"
  restart: unless-stopped

alertmanager.yml for Slack:

global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        title: '{{ .Status | toTitle }}: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Pro Tip

Grouping matters: Without grouping, you might receive 50 individual alerts for “disk space low” across your fleet. Grouping by alertname and instance collapses them into one notification with context.

Beyond the Basics: Common Exporters

The Node Exporter gives you host metrics. For everything else, there’s probably an exporter:

ExporterWhat It MonitorsDefault Port
node_exporterHost (CPU, memory, disk, network)9100
cAdvisorDocker/container metrics8080
blackbox_exporterHTTP/TCP/ICMP probes9115
mysqld_exporterMySQL databases9104
postgres_exporterPostgreSQL9187
redis_exporterRedis instances9121

cAdvisor for Container Metrics

Add this to your Docker Compose:

cadvisor:
  image: gcr.io/cadvisor/cadvisor:latest
  container_name: cadvisor
  volumes:
    - /:/rootfs:ro
    - /var/run:/var/run:ro
    - /sys:/sys:ro
    - /var/lib/docker/:/var/lib/docker:ro
  ports:
    - "8080:8080"
  restart: unless-stopped

And to Prometheus:

scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

Now you can monitor container CPU, memory, and network usage.

Blackbox Exporter for Service Health

Want to know if your websites are up? Blackbox exporter probes endpoints:

blackbox_exporter:
  image: prom/blackbox-exporter:latest
  container_name: blackbox_exporter
  ports:
    - "9115:9115"
  volumes:
    - ./blackbox.yml:/etc/blackbox_exporter/config.yml
  restart: unless-stopped

blackbox.yml:

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      valid_status_codes: [200, 301, 302]
      method: GET
      preferred_ip_protocol: ip4

Prometheus scrape config:

scrape_configs:
  - job_name: 'blackbox_http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://example.com
        - https://grafana.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox_exporter:9115
Note

The relabeling magic: Blackbox uses a special pattern where Prometheus queries the exporter with the target as a parameter. The relabel configs rewire __address__ to make this work. It’s confusing at first but extremely powerful.

Production Considerations

Running in production? Think about these:

Retention

Prometheus stores data locally. Set reasonable retention:

--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB

Estimate storage: ~1-2 bytes per sample (compressed). A million samples/second for 30 days needs roughly 30-60 GB.

Long-term Storage

For retention beyond 30 days, you need remote storage:

SystemUse Case
ThanosMulti-cluster, virtually unlimited retention
VictoriaMetricsCost-efficient single-node alternative
Cortex/MimirLarge-scale distributed setups

High Availability

For critical environments:

  • Run two Prometheus instances with identical configs
  • Cluster Alertmanagers for deduplication
  • Use a load balancer for Grafana

Alertmanager clustering:

alertmanager --cluster.listen-address="" \
  --cluster.peer=alertmanager-1:9094 \
  --cluster.peer=alertmanager-2:9094

Security

Basic practices:

  1. Firewall Prometheus — Don’t expose 9090 publicly
  2. Use a reverse proxy — Nginx/Traefik for Grafana with TLS
  3. Secure Grafana — Disable anonymous access, use strong passwords
  4. Don’t commit secrets — Use environment variables

Homelab: The Sweet Spot

For most homelabs, a single Prometheus instance with 30-day retention, Grafana, Node Exporter, cAdvisor, and Alertmanager covers 95% of use cases. Add exporters as needed for specific services (Pi-hole, Home Assistant, Proxmox).

Total resource footprint: ~500MB RAM for Prometheus, ~100MB for Grafana. Your old NUC or Raspberry Pi 4 can handle it easily.

One Compose to Rule Them

Here’s a complete homelab setup:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - grafana_data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning:ro
    ports:
      - "3000:3000"
    restart: unless-stopped

  node_exporter:
    image: prom/node-exporter:latest
    command:
      - '--path.rootfs=/host'
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
    volumes:
      - '/:/host:ro,rslave'
      - '/proc:/host/proc:ro'
      - '/sys:/host/sys:ro'
    network_mode: host
    pid: host
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8080:8080"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    ports:
      - "9093:9093"
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Next Steps

You’ve got the foundation. Where to go from here:

  1. Import community dashboards — Start with dashboard ID 1860 (Node Exporter Full)
  2. Add custom alerts — Tailor to your environment
  3. Set up notifications — Slack, Discord, email, Pushover
  4. Monitor your applications — Most have Prometheus exporters or client libraries
  5. Experiment with PromQL — The query language is surprisingly powerful

The beauty of Prometheus + Grafana is that it scales from a single Raspberry Pi to global infrastructure. The skills you learn here transfer directly to production environments. Start small, iterate, and build the dashboards that tell your story.

Pro Tip

Monitor the monitor: Prometheus exposes its own metrics. Create a dashboard tracking prometheus_tsdb_head_series (active series), prometheus_scrape_duration_seconds (scrape timing), and prometheus_rule_evaluation_duration_seconds (alert evaluation). You’ll spot cardinality explosions and performance issues early.

Anthony Lattanzio

Anthony Lattanzio

Tech Enthusiast & Builder

I'm a tech enthusiast who loves building things with hardware and software. By night, I run a homelab that's grown way beyond what any reasonable person needs. Check out about me for more.

Comments

Powered by GitHub Discussions