Docs For AI
Monitoring

Prometheus & Grafana

Metrics collection with Prometheus and visualization with Grafana - PromQL, alerting, and dashboard design

Prometheus & Grafana

Prometheus is an open-source time-series database and monitoring system. Grafana is a visualization platform that connects to Prometheus (and other data sources) to build dashboards and alerts.

Architecture

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   App + /metrics  │  │ Node Exporter │  │ cAdvisor    │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │
       └────────────────┼────────────────┘
                        │ scrape
                 ┌──────▼──────┐
                 │  Prometheus  │ (TSDB + PromQL)
                 └──────┬──────┘
                        │ query
              ┌─────────┼─────────┐
              ▼                   ▼
       ┌────────────┐     ┌────────────┐
       │   Grafana   │     │Alertmanager│
       │ (Dashboards)│     │  (Alerts)  │
       └────────────┘     └────────────┘

Prometheus Configuration

scrape_config

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

rule_files:
  - "alerts/*.yml"

scrape_configs:
  # Self-monitoring
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Application metrics
  - job_name: "api-server"
    metrics_path: /metrics
    static_configs:
      - targets: ["api-server:3000"]
        labels:
          environment: "production"

  # Node exporter (system metrics)
  - job_name: "node-exporter"
    static_configs:
      - targets:
          - "node-01:9100"
          - "node-02:9100"
          - "node-03:9100"

  # Kubernetes service discovery
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Metric Types

TypeDescriptionExample
CounterMonotonically increasing valuehttp_requests_total, errors_total
GaugeValue that can go up or downtemperature, active_connections, memory_usage_bytes
HistogramSamples in configurable bucketshttp_request_duration_seconds
SummarySimilar to histogram with client-side quantilesrequest_duration_quantile

Instrumenting an Application (Node.js)

const client = require('prom-client');

// Default metrics (CPU, memory, event loop, GC)
client.collectDefaultMetrics({ prefix: 'app_' });

// Custom counter
const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
});

// Custom histogram
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

// Custom gauge
const activeConnections = new client.Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
});

// Middleware
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({ method: req.method, path: req.route?.path });
  activeConnections.inc();

  res.on('finish', () => {
    httpRequestsTotal.inc({ method: req.method, path: req.route?.path, status: res.statusCode });
    activeConnections.dec();
    end();
  });
  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

PromQL (Query Language)

Basic Queries

# Instant vector: current value
http_requests_total

# Range vector: values over time
http_requests_total[5m]

# Label filtering
http_requests_total{method="GET", status="200"}
http_requests_total{status=~"5.."}         # Regex match
http_requests_total{path!="/health"}       # Negative match

# Rate: per-second rate over 5 minutes
rate(http_requests_total[5m])

# Increase: total increase over 1 hour
increase(http_requests_total[1h])

Common Patterns

# Request rate by service
sum(rate(http_requests_total[5m])) by (service)

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# P50 / P90 / P99 latency
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100

# CPU usage percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Disk usage percentage
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"})
/ node_filesystem_size_bytes{mountpoint="/"} * 100

# Top 5 pods by memory
topk(5, container_memory_usage_bytes{namespace="production"})

Alerting Rules

# alerts/application.yml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected ({{ $value | humanizePercentage }})"
          description: "Error rate is above 5% for more than 5 minutes"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency is {{ $value }}s (threshold: 2s)"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"

  - name: infrastructure
    rules:
      - alert: HighMemoryUsage
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes > 0.90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage above 90% on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
          < 0.10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/xxx/xxx/xxx"

route:
  receiver: "default"
  group_by: ["alertname", "severity"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "pagerduty"
      continue: true
    - match:
        severity: critical
      receiver: "slack-critical"
    - match:
        severity: warning
      receiver: "slack-warning"

receivers:
  - name: "default"
    slack_configs:
      - channel: "#alerts"
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: "slack-critical"
    slack_configs:
      - channel: "#alerts-critical"
        title: '🚨 {{ .CommonAnnotations.summary }}'

  - name: "slack-warning"
    slack_configs:
      - channel: "#alerts-warning"
        title: '⚠️ {{ .CommonAnnotations.summary }}'

  - name: "pagerduty"
    pagerduty_configs:
      - routing_key: "YOUR_PD_ROUTING_KEY"
        severity: critical

inhibit_rules:
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["alertname", "instance"]

Grafana Dashboards

Data Source Configuration

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"

Key Dashboard Panels

PanelPromQLType
Request Ratesum(rate(http_requests_total[5m]))Graph
Error Rate (%)sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100Stat
P99 Latencyhistogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))Graph
CPU Usage100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100Gauge
Memory Usage(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100Gauge
Active Podscount(kube_pod_status_phase{phase="Running"})Stat

Docker Compose Setup

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts:/etc/prometheus/alerts
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'

  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus

  alertmanager:
    image: prom/alertmanager:v0.27.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml

  node-exporter:
    image: prom/node-exporter:v1.7.0
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'

volumes:
  prometheus-data:
  grafana-data:

Best Practices

Monitoring Guidelines

  1. USE Method: Monitor Utilization, Saturation, and Errors for every resource
  2. RED Method: Monitor Rate, Errors, and Duration for every service
  3. Golden Signals: Latency, Traffic, Errors, Saturation (Google SRE)
  4. Cardinality: Avoid high-cardinality labels (user IDs, request IDs) in metrics
  5. Retention: Balance storage cost with analysis needs (15-30 days typical)
  6. Alert fatigue: Only alert on actionable conditions; use severity levels
  7. Dashboards: Create separate dashboards for overview, per-service, and infrastructure
  8. Recording rules: Pre-compute expensive queries as recording rules for faster dashboards

On this page