Building a Home Lab Monitoring Stack with Prometheus and Grafana

Mahmoud Adel | Dec 10, 2024 min read

Every system administrator needs visibility into their infrastructure. In this guide, we’ll build a complete monitoring stack using Prometheus for metrics collection, Grafana for visualization, and Alertmanager for notifications—all containerized with Docker Compose.

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                    Monitoring Stack                      │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐  │
│  │   Grafana   │──│  Prometheus │──│  Alertmanager   │  │
│  │  :3000      │  │  :9090      │  │  :9093          │  │
│  └─────────────┘  └──────┬──────┘  └─────────────────┘  │
│                          │                               │
│  ┌───────────────────────┼───────────────────────────┐  │
│  │                 Exporters                          │  │
│  ├─────────────┬─────────────┬─────────────┬─────────┤  │
│  │ Node        │ cAdvisor    │ Blackbox    │ Custom  │  │
│  │ Exporter    │ (Docker)    │ Exporter    │ Apps    │  │
│  │ :9100       │ :8080       │ :9115       │         │  │
│  └─────────────┴─────────────┴─────────────┴─────────┘  │
└─────────────────────────────────────────────────────────┘

Project Structure

monitoring/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── alerts/
│       └── rules.yml
├── alertmanager/
│   └── alertmanager.yml
├── grafana/
│   └── provisioning/
│       ├── dashboards/
│       │   ├── dashboard.yml
│       │   └── node-exporter.json
│       └── datasources/
│           └── prometheus.yml
└── .env

Docker Compose Configuration

Create docker-compose.yml:

version: '3.8'

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data: {}
  grafana_data: {}
  alertmanager_data: {}

services:
  prometheus:
    image: prom/prometheus:v2.47.0
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alerts:/etc/prometheus/alerts:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    ports:
      - "9090:9090"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:10.1.0
    container_name: grafana
    restart: unless-stopped
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    environment:
      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=http://localhost:3000
      - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel
    ports:
      - "3000:3000"
    networks:
      - monitoring
    depends_on:
      - prometheus

  alertmanager:
    image: prom/alertmanager:v0.26.0
    container_name: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - "9093:9093"
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.6.1
    container_name: node-exporter
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.2
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /cgroup:/cgroup:ro
    ports:
      - "8080:8080"
    networks:
      - monitoring

  blackbox-exporter:
    image: prom/blackbox-exporter:v0.24.0
    container_name: blackbox-exporter
    restart: unless-stopped
    ports:
      - "9115:9115"
    networks:
      - monitoring

Prometheus Configuration

Create prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'home-lab'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - /etc/prometheus/alerts/*.yml

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter - System metrics
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'home-server'

  # cAdvisor - Docker container metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  # Blackbox Exporter - Endpoint monitoring
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://google.com
        - https://github.com
        - http://localhost:3000  # Grafana
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Alert Rules

Create prometheus/alerts/rules.yml:

groups:
  - name: system-alerts
    rules:
      # High CPU usage
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% on {{ $labels.instance }} for more than 5 minutes."

      # High memory usage
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is above 85% on {{ $labels.instance }}."

      # Low disk space
      - alert: LowDiskSpace
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space"
          description: "Disk space is below 15% on {{ $labels.instance }}."

      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.job }} has been down for more than 1 minute."

  - name: container-alerts
    rules:
      # Container high CPU
      - alert: ContainerHighCPU
        expr: (sum by (name) (rate(container_cpu_usage_seconds_total{name!=""}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container high CPU usage"
          description: "Container {{ $labels.name }} CPU usage is above 80%."

      # Container high memory
      - alert: ContainerHighMemory
        expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""}) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container high memory usage"
          description: "Container {{ $labels.name }} memory usage is above 80%."

      # Container restart
      - alert: ContainerRestarted
        expr: increase(container_last_seen{name!=""}[5m]) > 2
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Container restarted"
          description: "Container {{ $labels.name }} has restarted."

Alertmanager Configuration

Create alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'telegram'
  routes:
    - match:
        severity: critical
      receiver: 'telegram'
      continue: true
    - match:
        severity: warning
      receiver: 'telegram'

receivers:
  - name: 'telegram'
    telegram_configs:
      - bot_token: 'YOUR_BOT_TOKEN'
        chat_id: YOUR_CHAT_ID
        message: |
          🚨 *Alert: {{ .Status | toUpper }}*
          
          {{ range .Alerts }}
          *Alert:* {{ .Labels.alertname }}
          *Severity:* {{ .Labels.severity }}
          *Instance:* {{ .Labels.instance }}
          *Description:* {{ .Annotations.description }}
          {{ end }}          

  # Email alternative
  # - name: 'email'
  #   email_configs:
  #     - to: 'admin@example.com'
  #       from: 'alertmanager@example.com'
  #       smarthost: 'smtp.gmail.com:587'
  #       auth_username: 'your-email@gmail.com'
  #       auth_password: 'app-password'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Grafana Provisioning

Datasource

Create grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Dashboard Provisioning

Create grafana/provisioning/dashboards/dashboard.yml:

apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /etc/grafana/provisioning/dashboards

Deploy the Stack

Create .env file:

GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=your-secure-password

Start everything:

docker-compose up -d

# Check status
docker-compose ps

# View logs
docker-compose logs -f

Access the Services

ServiceURLCredentials
Grafanahttp://localhost:3000admin / (from .env)
Prometheushttp://localhost:9090None
Alertmanagerhttp://localhost:9093None
Node Exporterhttp://localhost:9100None

Useful PromQL Queries

CPU Usage

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory Usage

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Disk Usage

(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100

Network Traffic

# Inbound
rate(node_network_receive_bytes_total[5m]) * 8

# Outbound
rate(node_network_transmit_bytes_total[5m]) * 8

Container Memory

container_memory_usage_bytes{name!=""} / 1024 / 1024

Adding More Targets

To monitor additional hosts, install Node Exporter on each machine:

# On the target machine
docker run -d \
  --name node-exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  prom/node-exporter \
  --path.rootfs=/host

Then add to Prometheus:

- job_name: 'remote-nodes'
  static_configs:
    - targets: 
      - '192.168.1.100:9100'
      - '192.168.1.101:9100'

Conclusion

You now have a production-grade monitoring stack that provides:

  • Real-time metrics from all your systems
  • Beautiful dashboards in Grafana
  • Instant alerts via Telegram/Email
  • Container monitoring with cAdvisor
  • Endpoint health checks with Blackbox Exporter

This setup scales well for home labs and small infrastructures. For larger deployments, consider adding Thanos or Cortex for long-term storage and high availability.

Happy monitoring! 📊