Monitoring
Prometheus & Grafana
Metrics collection with Prometheus and visualization with Grafana - PromQL, alerting, and dashboard design
Prometheus & Grafana
Prometheus is an open-source time-series database and monitoring system. Grafana is a visualization platform that connects to Prometheus (and other data sources) to build dashboards and alerts.
Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ App + /metrics │ │ Node Exporter │ │ cAdvisor │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────┼────────────────┘
│ scrape
┌──────▼──────┐
│ Prometheus │ (TSDB + PromQL)
└──────┬──────┘
│ query
┌─────────┼─────────┐
▼ ▼
┌────────────┐ ┌────────────┐
│ Grafana │ │Alertmanager│
│ (Dashboards)│ │ (Alerts) │
└────────────┘ └────────────┘Prometheus Configuration
scrape_config
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
rule_files:
- "alerts/*.yml"
scrape_configs:
# Self-monitoring
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Application metrics
- job_name: "api-server"
metrics_path: /metrics
static_configs:
- targets: ["api-server:3000"]
labels:
environment: "production"
# Node exporter (system metrics)
- job_name: "node-exporter"
static_configs:
- targets:
- "node-01:9100"
- "node-02:9100"
- "node-03:9100"
# Kubernetes service discovery
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)Metric Types
| Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing value | http_requests_total, errors_total |
| Gauge | Value that can go up or down | temperature, active_connections, memory_usage_bytes |
| Histogram | Samples in configurable buckets | http_request_duration_seconds |
| Summary | Similar to histogram with client-side quantiles | request_duration_quantile |
Instrumenting an Application (Node.js)
const client = require('prom-client');
// Default metrics (CPU, memory, event loop, GC)
client.collectDefaultMetrics({ prefix: 'app_' });
// Custom counter
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'path', 'status'],
});
// Custom histogram
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
// Custom gauge
const activeConnections = new client.Gauge({
name: 'active_connections',
help: 'Number of active connections',
});
// Middleware
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer({ method: req.method, path: req.route?.path });
activeConnections.inc();
res.on('finish', () => {
httpRequestsTotal.inc({ method: req.method, path: req.route?.path, status: res.statusCode });
activeConnections.dec();
end();
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});PromQL (Query Language)
Basic Queries
# Instant vector: current value
http_requests_total
# Range vector: values over time
http_requests_total[5m]
# Label filtering
http_requests_total{method="GET", status="200"}
http_requests_total{status=~"5.."} # Regex match
http_requests_total{path!="/health"} # Negative match
# Rate: per-second rate over 5 minutes
rate(http_requests_total[5m])
# Increase: total increase over 1 hour
increase(http_requests_total[1h])Common Patterns
# Request rate by service
sum(rate(http_requests_total[5m])) by (service)
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# P50 / P90 / P99 latency
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
# CPU usage percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Disk usage percentage
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"})
/ node_filesystem_size_bytes{mountpoint="/"} * 100
# Top 5 pods by memory
topk(5, container_memory_usage_bytes{namespace="production"})Alerting Rules
# alerts/application.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected ({{ $value | humanizePercentage }})"
description: "Error rate is above 5% for more than 5 minutes"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
> 2
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency is {{ $value }}s (threshold: 2s)"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- name: infrastructure
rules:
- alert: HighMemoryUsage
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes > 0.90
for: 10m
labels:
severity: warning
annotations:
summary: "Memory usage above 90% on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
< 0.10
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/xxx/xxx/xxx"
route:
receiver: "default"
group_by: ["alertname", "severity"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "pagerduty"
continue: true
- match:
severity: critical
receiver: "slack-critical"
- match:
severity: warning
receiver: "slack-warning"
receivers:
- name: "default"
slack_configs:
- channel: "#alerts"
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: "slack-critical"
slack_configs:
- channel: "#alerts-critical"
title: '🚨 {{ .CommonAnnotations.summary }}'
- name: "slack-warning"
slack_configs:
- channel: "#alerts-warning"
title: '⚠️ {{ .CommonAnnotations.summary }}'
- name: "pagerduty"
pagerduty_configs:
- routing_key: "YOUR_PD_ROUTING_KEY"
severity: critical
inhibit_rules:
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "instance"]Grafana Dashboards
Data Source Configuration
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
timeInterval: "15s"Key Dashboard Panels
| Panel | PromQL | Type |
|---|---|---|
| Request Rate | sum(rate(http_requests_total[5m])) | Graph |
| Error Rate (%) | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 | Stat |
| P99 Latency | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) | Graph |
| CPU Usage | 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 | Gauge |
| Memory Usage | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 | Gauge |
| Active Pods | count(kube_pod_status_phase{phase="Running"}) | Stat |
Docker Compose Setup
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.51.0
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts:/etc/prometheus/alerts
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
grafana:
image: grafana/grafana:10.4.0
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
- grafana-data:/var/lib/grafana
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager:v0.27.0
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
node-exporter:
image: prom/node-exporter:v1.7.0
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
volumes:
prometheus-data:
grafana-data:Best Practices
Monitoring Guidelines
- USE Method: Monitor Utilization, Saturation, and Errors for every resource
- RED Method: Monitor Rate, Errors, and Duration for every service
- Golden Signals: Latency, Traffic, Errors, Saturation (Google SRE)
- Cardinality: Avoid high-cardinality labels (user IDs, request IDs) in metrics
- Retention: Balance storage cost with analysis needs (15-30 days typical)
- Alert fatigue: Only alert on actionable conditions; use severity levels
- Dashboards: Create separate dashboards for overview, per-service, and infrastructure
- Recording rules: Pre-compute expensive queries as recording rules for faster dashboards