Self-Hosted Performance Platform
Deploy a private performance monitoring and alerting platform with trend dashboards and threshold-based alerts
Self-Hosted Performance Platform
Build and deploy a private, internal performance monitoring platform. This guide covers the complete architecture — from browser-side data collection through storage, trend visualization (with time-range selection), and configurable threshold alerting.
Architecture Overview
┌──────────────────────────────────────────────────────────────────┐
│ Browser (Real Users) │
│ ┌──────────────┐ │
│ │ web-vitals │── sendBeacon ──┐ │
│ │ SDK snippet │ │ │
│ └──────────────┘ │ │
└──────────────────────────────────┼──────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────┐
│ Ingestion Layer │
│ ┌──────────────────────────────────────────┐ │
│ │ Collector API (Node.js / Go) │ │
│ │ - validate & enrich events │ │
│ │ - batch write to storage │ │
│ └──────────┬───────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ ClickHouse │ │ Redis │ │
│ │ (metrics store) │ │ (alert state) │ │
│ └──────────────────┘ └──────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Query API (Node.js / Go) │ │
│ │ - trend queries with time range │ │
│ │ - percentile aggregations │ │
│ │ - alert rule evaluation │ │
│ └──────────┬───────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Dashboard UI (Next.js) │ │
│ │ - trend charts with time picker │ │
│ │ - alert rule configuration │ │
│ │ - team notifications │ │
│ └──────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘Why self-host?
- Full data ownership — metrics never leave your network
- No per-seat or per-event pricing
- Custom dimensions (business context, experiments, feature flags)
- Integration with internal alerting systems (Slack, PagerDuty, WeCom, DingTalk)
Choosing an Architecture: ClickHouse vs ELK
This guide provides two complete architectures. Choose based on your team's situation.
| ClickHouse + Grafana | ELK Stack | |
|---|---|---|
| Best for | Pure performance analytics, high-volume metrics | Unified observability (logs + metrics + traces) |
| Aggregation speed | Extremely fast — columnar storage with native quantile() | Slower for percentile aggregations at scale |
| Storage efficiency | ~10:1 compression ratio (columnar) | ~3:1 (inverted index overhead) |
| Dashboard | Grafana (powerful, needs configuration) or custom UI | Kibana (rich out-of-box, drag-and-drop, zero code) |
| Alerting | Custom engine or Grafana Alerting | ElastAlert2 or Kibana Alerting (built-in) |
| Full-text search | Not supported | Native — search through error logs alongside metrics |
| APM integration | Metrics only (pair with Jaeger for traces) | Elastic APM covers browser, server, DB traces in one place |
| Setup complexity | Lower (fewer components) | Higher (more services, more RAM) |
| RAM requirement | ~2 GB minimum | ~4–8 GB minimum (ES is memory-hungry) |
| Team already has ELK? | Need to deploy separately | Reuse existing cluster — just add APM + index |
Recommendation
- Choose ClickHouse if your primary goal is fast metrics dashboards and you want minimal infrastructure
- Choose ELK if your team already runs ELK, or you want unified logs + metrics + traces + error tracking in one platform with Kibana's drag-and-drop dashboards
- Both architectures share the same Browser SDK (section 1 below)
Architecture A: ClickHouse Stack
1. Data Collection (Browser SDK)
A lightweight SDK that reports Core Web Vitals and custom metrics to your collector.
// sdk/perf-sdk.ts
import { onCLS, onINP, onLCP, onFCP, onTTFB, type Metric } from 'web-vitals';
interface PerfEvent {
metric: string;
value: number;
rating: string;
page: string;
device: string;
connection: string;
timestamp: number;
sessionId: string;
appVersion: string;
}
const SESSION_ID = crypto.randomUUID();
function getDevice(): string {
if (/Mobi|Android/i.test(navigator.userAgent)) return 'mobile';
if (/Tablet|iPad/i.test(navigator.userAgent)) return 'tablet';
return 'desktop';
}
function send(metric: Metric) {
const event: PerfEvent = {
metric: metric.name,
value: metric.value,
rating: metric.rating,
page: location.pathname,
device: getDevice(),
connection: (navigator as any).connection?.effectiveType ?? 'unknown',
timestamp: Date.now(),
sessionId: SESSION_ID,
appVersion: document.querySelector('meta[name="app-version"]')?.getAttribute('content') ?? 'unknown',
};
const blob = new Blob([JSON.stringify(event)], { type: 'application/json' });
navigator.sendBeacon('/api/perf/collect', blob);
}
export function initPerfSDK() {
onLCP(send);
onINP(send);
onCLS(send);
onFCP(send);
onTTFB(send);
}Usage in Next.js:
// app/layout.tsx
'use client';
import { useEffect } from 'react';
export function PerfProvider({ children }: { children: React.ReactNode }) {
useEffect(() => {
import('./perf-sdk').then(({ initPerfSDK }) => initPerfSDK());
}, []);
return <>{children}</>;
}2. Ingestion Service (Collector API)
Receives events, validates them, and batch-inserts into ClickHouse.
ClickHouse Table Schema
-- Create the metrics table
CREATE TABLE perf_metrics (
metric LowCardinality(String), -- LCP, INP, CLS, FCP, TTFB
value Float64,
rating LowCardinality(String), -- good, needs-improvement, poor
page String,
device LowCardinality(String), -- mobile, tablet, desktop
connection LowCardinality(String), -- 4g, 3g, 2g, slow-2g, unknown
app_version LowCardinality(String),
session_id String,
timestamp DateTime64(3),
-- Partition by day for efficient time-range queries
-- Order by metric + page for fast aggregation
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (metric, page, timestamp);
-- Materialized view: pre-aggregate hourly percentiles
CREATE MATERIALIZED VIEW perf_hourly_mv
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMMDD(hour)
ORDER BY (metric, page, device, hour)
AS
SELECT
metric,
page,
device,
toStartOfHour(timestamp) AS hour,
quantileState(0.50)(value) AS p50_state,
quantileState(0.75)(value) AS p75_state,
quantileState(0.95)(value) AS p95_state,
countState() AS count_state
FROM perf_metrics
GROUP BY metric, page, device, hour;Collector API (Node.js)
// server/collector.ts
import { createClient } from '@clickhouse/client';
import { Router, json } from 'express';
const clickhouse = createClient({
url: process.env.CLICKHOUSE_URL ?? 'http://localhost:8123',
database: 'perf',
});
const VALID_METRICS = new Set(['LCP', 'INP', 'CLS', 'FCP', 'TTFB']);
const BATCH_SIZE = 500;
const FLUSH_INTERVAL_MS = 5000;
let buffer: PerfEvent[] = [];
interface PerfEvent {
metric: string;
value: number;
rating: string;
page: string;
device: string;
connection: string;
timestamp: number;
sessionId: string;
appVersion: string;
}
async function flush() {
if (buffer.length === 0) return;
const batch = buffer.splice(0, buffer.length);
await clickhouse.insert({
table: 'perf_metrics',
values: batch.map(e => ({
metric: e.metric,
value: e.value,
rating: e.rating,
page: e.page,
device: e.device,
connection: e.connection,
app_version: e.appVersion,
session_id: e.sessionId,
timestamp: new Date(e.timestamp).toISOString(),
})),
format: 'JSONEachRow',
});
console.log(`Flushed ${batch.length} events to ClickHouse`);
}
// Flush periodically
setInterval(flush, FLUSH_INTERVAL_MS);
export const collectorRouter = Router();
collectorRouter.post('/collect', json(), (req, res) => {
const event: PerfEvent = req.body;
// Validate
if (!VALID_METRICS.has(event.metric)) {
return res.status(400).json({ error: 'Invalid metric' });
}
if (typeof event.value !== 'number' || event.value < 0) {
return res.status(400).json({ error: 'Invalid value' });
}
buffer.push(event);
// Flush if buffer is full
if (buffer.length >= BATCH_SIZE) {
flush().catch(console.error);
}
res.status(204).end();
});3. Query API (Trend + Alerting Data)
Provide endpoints for the dashboard to query trends and for the alerting engine to evaluate thresholds.
// server/query.ts
import { createClient } from '@clickhouse/client';
import { Router } from 'express';
const clickhouse = createClient({
url: process.env.CLICKHOUSE_URL ?? 'http://localhost:8123',
database: 'perf',
});
export const queryRouter = Router();
// ── Trend API ───────────────────────────────────────────────────
// GET /api/perf/trend?metric=LCP&page=/&start=2026-01-01&end=2026-02-01&granularity=hour
queryRouter.get('/trend', async (req, res) => {
const { metric, page, device, start, end, granularity = 'hour' } = req.query;
const granularityFn = granularity === 'day' ? 'toStartOfDay' :
granularity === 'hour' ? 'toStartOfHour' :
'toStartOfFifteenMinutes';
const conditions: string[] = ['1 = 1'];
const params: Record<string, string> = {};
if (metric) { conditions.push('metric = {metric:String}'); params.metric = metric as string; }
if (page) { conditions.push('page = {page:String}'); params.page = page as string; }
if (device) { conditions.push('device = {device:String}'); params.device = device as string; }
if (start) { conditions.push('timestamp >= {start:String}'); params.start = start as string; }
if (end) { conditions.push('timestamp <= {end:String}'); params.end = end as string; }
const query = `
SELECT
${granularityFn}(timestamp) AS time,
quantile(0.50)(value) AS p50,
quantile(0.75)(value) AS p75,
quantile(0.95)(value) AS p95,
count() AS samples,
countIf(rating = 'good') / count() * 100 AS good_pct,
countIf(rating = 'poor') / count() * 100 AS poor_pct
FROM perf_metrics
WHERE ${conditions.join(' AND ')}
GROUP BY time
ORDER BY time
`;
const result = await clickhouse.query({ query, query_params: params, format: 'JSONEachRow' });
const data = await result.json();
res.json(data);
});
// ── Overview API ────────────────────────────────────────────────
// GET /api/perf/overview?start=2026-01-01&end=2026-02-01
queryRouter.get('/overview', async (req, res) => {
const { start, end } = req.query;
const query = `
SELECT
metric,
page,
device,
quantile(0.50)(value) AS p50,
quantile(0.75)(value) AS p75,
quantile(0.95)(value) AS p95,
count() AS samples,
countIf(rating = 'good') / count() * 100 AS good_pct,
countIf(rating = 'poor') / count() * 100 AS poor_pct
FROM perf_metrics
WHERE timestamp BETWEEN {start:String} AND {end:String}
GROUP BY metric, page, device
ORDER BY metric, page, device
`;
const result = await clickhouse.query({
query,
query_params: { start: start as string, end: end as string },
format: 'JSONEachRow',
});
const data = await result.json();
res.json(data);
});
// ── Current Percentile (for alerting) ───────────────────────────
// GET /api/perf/current?metric=LCP&percentile=75&windowMinutes=60
queryRouter.get('/current', async (req, res) => {
const { metric, percentile = '75', windowMinutes = '60' } = req.query;
const query = `
SELECT
quantile({p:Float64} / 100)(value) AS value,
count() AS samples
FROM perf_metrics
WHERE metric = {metric:String}
AND timestamp >= now() - INTERVAL {window:UInt32} MINUTE
`;
const result = await clickhouse.query({
query,
query_params: {
metric: metric as string,
p: percentile as string,
window: windowMinutes as string,
},
format: 'JSONEachRow',
});
const [row] = await result.json<{ value: number; samples: number }>();
res.json(row);
});4. Alerting Engine
A scheduled evaluator that checks configured thresholds and sends notifications.
Alert Rule Configuration
// server/alert-rules.ts
export interface AlertRule {
id: string;
name: string;
metric: string; // LCP, INP, CLS, FCP, TTFB
percentile: number; // 50, 75, 95
threshold: number; // metric value threshold
windowMinutes: number; // time window to evaluate
cooldownMinutes: number; // minimum gap between repeated alerts
severity: 'warning' | 'critical';
channels: AlertChannel[]; // where to send notifications
enabled: boolean;
page?: string; // optional: scope to a specific page
}
export interface AlertChannel {
type: 'slack' | 'webhook' | 'email' | 'wecom' | 'dingtalk';
target: string; // webhook URL, email address, etc.
}
// Default rules — can be overridden via the dashboard UI
export const defaultRules: AlertRule[] = [
{
id: 'lcp-warning',
name: 'LCP p75 > 2.5s',
metric: 'LCP',
percentile: 75,
threshold: 2500,
windowMinutes: 60,
cooldownMinutes: 120,
severity: 'warning',
channels: [{ type: 'slack', target: process.env.SLACK_WEBHOOK_URL! }],
enabled: true,
},
{
id: 'lcp-critical',
name: 'LCP p75 > 4s',
metric: 'LCP',
percentile: 75,
threshold: 4000,
windowMinutes: 15,
cooldownMinutes: 30,
severity: 'critical',
channels: [{ type: 'slack', target: process.env.SLACK_WEBHOOK_URL! }],
enabled: true,
},
{
id: 'inp-warning',
name: 'INP p75 > 200ms',
metric: 'INP',
percentile: 75,
threshold: 200,
windowMinutes: 60,
cooldownMinutes: 120,
severity: 'warning',
channels: [{ type: 'slack', target: process.env.SLACK_WEBHOOK_URL! }],
enabled: true,
},
{
id: 'cls-warning',
name: 'CLS p75 > 0.1',
metric: 'CLS',
percentile: 75,
threshold: 0.1,
windowMinutes: 60,
cooldownMinutes: 120,
severity: 'warning',
channels: [{ type: 'slack', target: process.env.SLACK_WEBHOOK_URL! }],
enabled: true,
},
];Alert Evaluator
// server/alert-evaluator.ts
import Redis from 'ioredis';
import { AlertRule, AlertChannel } from './alert-rules';
const redis = new Redis(process.env.REDIS_URL ?? 'redis://localhost:6379');
interface MetricsClient {
getPercentile(metric: string, percentile: number, windowMinutes: number, page?: string): Promise<{ value: number; samples: number }>;
}
export class AlertEvaluator {
constructor(
private metricsClient: MetricsClient,
private rules: AlertRule[]
) {}
// Run on a schedule (e.g. every 1 minute via setInterval or cron)
async evaluate() {
for (const rule of this.rules) {
if (!rule.enabled) continue;
try {
const { value, samples } = await this.metricsClient.getPercentile(
rule.metric, rule.percentile, rule.windowMinutes, rule.page
);
// Need enough samples to be meaningful
if (samples < 30) continue;
if (value > rule.threshold) {
await this.fire(rule, value, samples);
} else {
// Clear alert state when recovered
await redis.del(`alert:fired:${rule.id}`);
}
} catch (err) {
console.error(`Alert evaluation failed for rule ${rule.id}:`, err);
}
}
}
private async fire(rule: AlertRule, value: number, samples: number) {
const cooldownKey = `alert:fired:${rule.id}`;
const lastFired = await redis.get(cooldownKey);
if (lastFired) return; // Still in cooldown
// Set cooldown
await redis.setex(cooldownKey, rule.cooldownMinutes * 60, Date.now().toString());
// Store alert history
await redis.lpush('alert:history', JSON.stringify({
ruleId: rule.id,
ruleName: rule.name,
metric: rule.metric,
value,
threshold: rule.threshold,
samples,
severity: rule.severity,
timestamp: new Date().toISOString(),
}));
await redis.ltrim('alert:history', 0, 999); // Keep last 1000
// Send notifications
for (const channel of rule.channels) {
await this.notify(channel, rule, value, samples);
}
}
private async notify(channel: AlertChannel, rule: AlertRule, value: number, samples: number) {
const formatValue = (metric: string, v: number) =>
metric === 'CLS' ? v.toFixed(3) : `${v.toFixed(0)} ms`;
const message = {
title: `[${rule.severity.toUpperCase()}] ${rule.name}`,
body: [
`**${rule.metric}** p${rule.percentile} = **${formatValue(rule.metric, value)}** (threshold: ${formatValue(rule.metric, rule.threshold)})`,
`Window: ${rule.windowMinutes} min | Samples: ${samples}`,
rule.page ? `Page: ${rule.page}` : 'All pages',
`Time: ${new Date().toISOString()}`,
].join('\n'),
};
switch (channel.type) {
case 'slack':
await fetch(channel.target, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
text: `${message.title}\n${message.body}`,
}),
});
break;
case 'wecom':
await fetch(channel.target, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
msgtype: 'markdown',
markdown: { content: `${message.title}\n${message.body}` },
}),
});
break;
case 'dingtalk':
await fetch(channel.target, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
msgtype: 'markdown',
markdown: { title: message.title, text: message.body },
}),
});
break;
case 'webhook':
await fetch(channel.target, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ ...message, rule, value, samples }),
});
break;
}
}
}Run the Evaluator
// server/index.ts
import { AlertEvaluator } from './alert-evaluator';
import { defaultRules } from './alert-rules';
const evaluator = new AlertEvaluator(metricsClient, defaultRules);
// Evaluate every 60 seconds
setInterval(() => {
evaluator.evaluate().catch(console.error);
}, 60_000);5. Dashboard UI
A Next.js dashboard with time-range trend charts and alert rule management.
Trend Chart Component
// components/TrendChart.tsx
'use client';
import { useState, useEffect } from 'react';
import {
LineChart, Line, XAxis, YAxis, Tooltip, CartesianGrid,
ResponsiveContainer, ReferenceLine,
} from 'recharts';
interface TrendPoint {
time: string;
p50: number;
p75: number;
p95: number;
good_pct: number;
}
interface TrendChartProps {
metric: string;
page?: string;
threshold?: number;
}
const TIME_RANGES = [
{ label: '1H', value: '1h', granularity: '15min' },
{ label: '6H', value: '6h', granularity: '15min' },
{ label: '24H', value: '24h', granularity: 'hour' },
{ label: '7D', value: '7d', granularity: 'hour' },
{ label: '30D', value: '30d', granularity: 'day' },
];
function getStartDate(range: string): string {
const now = Date.now();
const ms: Record<string, number> = {
'1h': 3600000,
'6h': 21600000,
'24h': 86400000,
'7d': 604800000,
'30d': 2592000000,
};
return new Date(now - (ms[range] ?? 86400000)).toISOString();
}
export function TrendChart({ metric, page, threshold }: TrendChartProps) {
const [range, setRange] = useState('24h');
const [data, setData] = useState<TrendPoint[]>([]);
const [loading, setLoading] = useState(true);
const granularity = TIME_RANGES.find(r => r.value === range)?.granularity ?? 'hour';
useEffect(() => {
setLoading(true);
const params = new URLSearchParams({
metric,
start: getStartDate(range),
end: new Date().toISOString(),
granularity,
});
if (page) params.set('page', page);
fetch(`/api/perf/trend?${params}`)
.then(r => r.json())
.then(setData)
.finally(() => setLoading(false));
}, [metric, page, range, granularity]);
const formatValue = (v: number) =>
metric === 'CLS' ? v.toFixed(3) : `${v.toFixed(0)} ms`;
return (
<div>
<div style={{ display: 'flex', justifyContent: 'space-between', marginBottom: 16 }}>
<h3>{metric} Trend</h3>
<div style={{ display: 'flex', gap: 4 }}>
{TIME_RANGES.map(r => (
<button
key={r.value}
onClick={() => setRange(r.value)}
style={{
padding: '4px 12px',
borderRadius: 4,
border: '1px solid #ddd',
background: range === r.value ? '#0070f3' : '#fff',
color: range === r.value ? '#fff' : '#333',
cursor: 'pointer',
}}
>
{r.label}
</button>
))}
</div>
</div>
{loading ? (
<div style={{ height: 300, display: 'flex', alignItems: 'center', justifyContent: 'center' }}>
Loading...
</div>
) : (
<ResponsiveContainer width="100%" height={300}>
<LineChart data={data}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis
dataKey="time"
tickFormatter={(t) => new Date(t).toLocaleTimeString([], { hour: '2-digit', minute: '2-digit' })}
/>
<YAxis tickFormatter={formatValue} />
<Tooltip
labelFormatter={(t) => new Date(t as string).toLocaleString()}
formatter={(v: number) => formatValue(v)}
/>
<Line type="monotone" dataKey="p50" stroke="#10b981" name="p50" dot={false} />
<Line type="monotone" dataKey="p75" stroke="#f59e0b" name="p75" strokeWidth={2} dot={false} />
<Line type="monotone" dataKey="p95" stroke="#ef4444" name="p95" dot={false} />
{threshold && (
<ReferenceLine y={threshold} stroke="#ef4444" strokeDasharray="5 5" label="Threshold" />
)}
</LineChart>
</ResponsiveContainer>
)}
</div>
);
}Alert Rule Management UI
// components/AlertRuleEditor.tsx
'use client';
import { useState } from 'react';
interface AlertRule {
id: string;
name: string;
metric: string;
percentile: number;
threshold: number;
windowMinutes: number;
cooldownMinutes: number;
severity: 'warning' | 'critical';
channels: { type: string; target: string }[];
enabled: boolean;
}
export function AlertRuleEditor({ rules: initial }: { rules: AlertRule[] }) {
const [rules, setRules] = useState(initial);
const [editing, setEditing] = useState<string | null>(null);
async function saveRule(rule: AlertRule) {
await fetch('/api/perf/alerts/rules', {
method: 'PUT',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(rule),
});
setEditing(null);
}
async function toggleRule(id: string, enabled: boolean) {
setRules(prev => prev.map(r => r.id === id ? { ...r, enabled } : r));
await fetch(`/api/perf/alerts/rules/${id}/toggle`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ enabled }),
});
}
const formatThreshold = (metric: string, value: number) =>
metric === 'CLS' ? value.toFixed(2) : `${value} ms`;
return (
<div>
<h3>Alert Rules</h3>
<table style={{ width: '100%', borderCollapse: 'collapse' }}>
<thead>
<tr>
<th>Status</th>
<th>Name</th>
<th>Metric</th>
<th>Percentile</th>
<th>Threshold</th>
<th>Window</th>
<th>Severity</th>
<th>Actions</th>
</tr>
</thead>
<tbody>
{rules.map(rule => (
<tr key={rule.id}>
<td>
<input
type="checkbox"
checked={rule.enabled}
onChange={(e) => toggleRule(rule.id, e.target.checked)}
/>
</td>
<td>{rule.name}</td>
<td>{rule.metric}</td>
<td>p{rule.percentile}</td>
<td>{formatThreshold(rule.metric, rule.threshold)}</td>
<td>{rule.windowMinutes} min</td>
<td>
<span style={{
padding: '2px 8px',
borderRadius: 4,
background: rule.severity === 'critical' ? '#fecaca' : '#fef3c7',
color: rule.severity === 'critical' ? '#dc2626' : '#d97706',
}}>
{rule.severity}
</span>
</td>
<td>
<button onClick={() => setEditing(rule.id)}>Edit</button>
</td>
</tr>
))}
</tbody>
</table>
</div>
);
}Dashboard Page
// app/dashboard/page.tsx
import { TrendChart } from '@/components/TrendChart';
import { AlertRuleEditor } from '@/components/AlertRuleEditor';
export default async function DashboardPage() {
const rules = await fetch(`${process.env.API_URL}/api/perf/alerts/rules`).then(r => r.json());
return (
<div style={{ maxWidth: 1200, margin: '0 auto', padding: 24 }}>
<h1>Performance Dashboard</h1>
<section>
<h2>Core Web Vitals Trends</h2>
<div style={{ display: 'grid', gap: 24 }}>
<TrendChart metric="LCP" threshold={2500} />
<TrendChart metric="INP" threshold={200} />
<TrendChart metric="CLS" threshold={0.1} />
</div>
</section>
<section style={{ marginTop: 48 }}>
<h2>Alerting</h2>
<AlertRuleEditor rules={rules} />
</section>
</div>
);
}6. Docker Compose Deployment
One-command deployment for the entire stack.
# docker-compose.yml
services:
clickhouse:
image: clickhouse/clickhouse-server:24
ports:
- "8123:8123" # HTTP interface
- "9000:9000" # Native interface
volumes:
- clickhouse-data:/var/lib/clickhouse
- ./init-db.sql:/docker-entrypoint-initdb.d/init.sql
environment:
CLICKHOUSE_DB: perf
CLICKHOUSE_USER: perf
CLICKHOUSE_PASSWORD: ${CLICKHOUSE_PASSWORD}
ulimits:
nofile:
soft: 262144
hard: 262144
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis-data:/data
command: redis-server --appendonly yes
collector:
build:
context: .
dockerfile: Dockerfile.server
ports:
- "3001:3001"
environment:
CLICKHOUSE_URL: http://clickhouse:8123
REDIS_URL: redis://redis:6379
PORT: 3001
depends_on:
- clickhouse
- redis
dashboard:
build:
context: .
dockerfile: Dockerfile.dashboard
ports:
- "3000:3000"
environment:
API_URL: http://collector:3001
NEXT_PUBLIC_API_URL: http://localhost:3001
depends_on:
- collector
alerter:
build:
context: .
dockerfile: Dockerfile.server
command: ["node", "dist/alert-runner.js"]
environment:
CLICKHOUSE_URL: http://clickhouse:8123
REDIS_URL: redis://redis:6379
SLACK_WEBHOOK_URL: ${SLACK_WEBHOOK_URL}
depends_on:
- clickhouse
- redis
volumes:
clickhouse-data:
redis-data:Start the Platform
# Create .env with your secrets
cat > .env << 'EOL'
CLICKHOUSE_PASSWORD=your-secure-password
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx/yyy/zzz
EOL
# Start all services
docker compose up -d
# Verify
docker compose ps
# clickhouse running 0.0.0.0:8123->8123/tcp
# redis running 0.0.0.0:6379->6379/tcp
# collector running 0.0.0.0:3001->3001/tcp
# dashboard running 0.0.0.0:3000->3000/tcp
# alerter runningNginx Reverse Proxy (Production)
# /etc/nginx/conf.d/perf-platform.conf
upstream dashboard {
server 127.0.0.1:3000;
}
upstream collector {
server 127.0.0.1:3001;
}
server {
listen 443 ssl;
server_name perf.internal.example.com;
ssl_certificate /etc/ssl/certs/perf.crt;
ssl_certificate_key /etc/ssl/private/perf.key;
# Dashboard
location / {
proxy_pass http://dashboard;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
# Collector API — accessed by browser SDK
location /api/perf/collect {
proxy_pass http://collector/collect;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# Rate limiting
limit_req zone=collector burst=100 nodelay;
}
# Query API — accessed by dashboard
location /api/perf/ {
proxy_pass http://collector/;
proxy_set_header Host $host;
# Internal only
allow 10.0.0.0/8;
allow 172.16.0.0/12;
deny all;
}
}
limit_req_zone $binary_remote_addr zone=collector:10m rate=50r/s;7. Data Retention and Maintenance
-- Auto-drop partitions older than 90 days
ALTER TABLE perf_metrics DROP PARTITION
WHERE toYYYYMMDD(timestamp) < toYYYYMMDD(now() - INTERVAL 90 DAY);
-- Schedule via cron or ClickHouse system scheduler
SYSTEM STOP MERGES perf_metrics;
OPTIMIZE TABLE perf_metrics FINAL;
SYSTEM START MERGES perf_metrics;# Add to crontab — run daily at 3 AM
echo '0 3 * * * clickhouse-client --query "ALTER TABLE perf.perf_metrics DROP PARTITION WHERE toYYYYMMDD(timestamp) < toYYYYMMDD(now() - INTERVAL 90 DAY)"' | crontab -Architecture B: ELK Stack
Use Elasticsearch for storage, Kibana for zero-code dashboards, and ElastAlert2 for threshold alerting. If your team already has an ELK cluster, you only need to add the APM Server and configure new indices.
ELK Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ Browser │
│ ┌───────────────────┐ │
│ │ Elastic APM RUM │── beacon ──┐ │
│ │ Agent (or custom │ │ │
│ │ web-vitals SDK) │ │ │
│ └───────────────────┘ │ │
└───────────────────────────────────┼─────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ ┌─────────────────┐ ┌──────────────────────────┐ │
│ │ APM Server │ │ Logstash (optional) │ │
│ │ (ingest RUM + │ │ (ingest custom events) │ │
│ │ server traces) │ │ │ │
│ └────────┬────────┘ └────────────┬─────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Elasticsearch │ │
│ │ - apm-* indices (RUM + APM data) │ │
│ │ - perf-metrics-* indices (custom events) │ │
│ │ - ILM for automatic retention │ │
│ └────────────────────┬────────────────────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ Kibana │ │ ElastAlert2 │ │
│ │ - Lens charts │ │ - threshold rules │ │
│ │ - time picker │ │ - Slack / WeCom / │ │
│ │ - APM UI │ │ DingTalk / email │ │
│ └─────────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────┘1. Elastic APM RUM Agent
The official Elastic RUM agent automatically captures page loads, Core Web Vitals, user interactions, and JS errors.
// lib/elastic-rum.ts
import { init as initApm } from '@elastic/apm-rum';
const apm = initApm({
serviceName: 'my-frontend',
serverUrl: process.env.NEXT_PUBLIC_APM_SERVER_URL!, // e.g. https://apm.internal.example.com
serviceVersion: process.env.NEXT_PUBLIC_APP_VERSION,
environment: process.env.NODE_ENV,
// Core Web Vitals are captured automatically
// Additional configuration:
distributedTracingOrigins: ['https://api.example.com'], // Correlate frontend → backend traces
transactionSampleRate: 1.0, // 100% of page loads (adjust for high-traffic sites)
});
export default apm;// app/layout.tsx
'use client';
import { useEffect } from 'react';
export function ElasticRUMProvider({ children }: { children: React.ReactNode }) {
useEffect(() => {
import('../lib/elastic-rum');
}, []);
return <>{children}</>;
}What the RUM agent captures automatically:
- Page load transactions — navigation timing, resource loading
- Core Web Vitals — LCP, CLS, INP, FCP, TTFB (as transaction marks)
- User interactions — click, route change (as transactions)
- JS errors — uncaught exceptions and promise rejections
- HTTP requests — XHR and Fetch with timing and correlation IDs
2. Alternative: Custom SDK → Logstash
If you prefer the lightweight web-vitals SDK from Architecture A, route events through Logstash into Elasticsearch.
// Use the same perf-sdk.ts from section 1, but point sendBeacon to Logstash HTTP input
navigator.sendBeacon('https://perf.internal.example.com/api/perf/collect', blob);# logstash/pipeline/perf.conf
input {
http {
port => 5044
codec => json
}
}
filter {
# Parse and enrich
date {
match => ["timestamp", "UNIX_MS"]
target => "@timestamp"
}
mutate {
rename => { "metric" => "perf.metric" }
rename => { "value" => "perf.value" }
rename => { "rating" => "perf.rating" }
rename => { "page" => "url.path" }
rename => { "device" => "user_agent.device" }
}
# Add good/poor thresholds for visualization
if [perf.metric] == "LCP" {
if [perf.value] <= 2500 { mutate { add_field => { "perf.bucket" => "good" } } }
else if [perf.value] <= 4000 { mutate { add_field => { "perf.bucket" => "needs-improvement" } } }
else { mutate { add_field => { "perf.bucket" => "poor" } } }
}
}
output {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "perf-metrics-%{+YYYY.MM.dd}"
user => "elastic"
password => "${ES_PASSWORD}"
}
}3. Elasticsearch Index Template
Define mappings and ILM for automatic retention.
// PUT _index_template/perf-metrics
{
"index_patterns": ["perf-metrics-*"],
"template": {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"index.lifecycle.name": "perf-metrics-ilm",
"index.lifecycle.rollover_alias": "perf-metrics"
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"perf.metric": { "type": "keyword" },
"perf.value": { "type": "float" },
"perf.rating": { "type": "keyword" },
"perf.bucket": { "type": "keyword" },
"url.path": { "type": "keyword" },
"user_agent.device": { "type": "keyword" },
"connection": { "type": "keyword" },
"app_version": { "type": "keyword" },
"session_id": { "type": "keyword" }
}
}
}
}// PUT _ilm/policy/perf-metrics-ilm
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "7d",
"max_primary_shard_size": "10gb"
}
}
},
"warm": {
"min_age": "30d",
"actions": {
"forcemerge": { "max_num_segments": 1 },
"shrink": { "number_of_shards": 1 }
}
},
"delete": {
"min_age": "90d",
"actions": { "delete": {} }
}
}
}
}4. Kibana Dashboards (Zero Code)
Kibana provides time-range trend charts, device breakdowns, and percentile visualizations without writing any code.
Creating a Web Vitals Dashboard in Kibana:
-
Go to Kibana → Analytics → Dashboard → Create
-
LCP Trend (Line chart)
- Visualization type: Lens → Line
- Index:
perf-metrics-* - X-axis:
@timestamp(date histogram, auto interval) - Y-axis: Percentile of perf.value → p50, p75, p95
- Filter:
perf.metric: LCP - Add Reference line at y=2500 (threshold)
-
Good / Needs Improvement / Poor (Donut chart)
- Visualization type: Lens → Donut
- Slice by:
perf.bucket - Filter:
perf.metric: LCP
-
Metrics by Page (Table)
- Visualization type: Lens → Table
- Rows:
url.path - Metrics: Percentile 75 of
perf.value, Count - Filter by metric using dashboard-level filter control
-
Device Breakdown (Bar chart)
- X-axis:
user_agent.device - Y-axis: Percentile 75 of
perf.value - Split series by:
perf.metric
- X-axis:
The Kibana time picker (top right) works automatically — users can switch between last 1 hour, 24 hours, 7 days, 30 days, or any custom range. All panels update together.
If using Elastic APM RUM Agent, Kibana's built-in APM UI (Kibana → Observability → APM) provides:
- Service overview with throughput and latency
- Transaction details with waterfall view
- Core Web Vitals dashboard (pre-built)
- Error tracking with stack traces
- Service map showing frontend → backend dependencies
5. ElastAlert2 for Threshold Alerting
ElastAlert2 is an open-source alerting framework that queries Elasticsearch on a schedule and fires alerts based on rules.
# elastalert/config.yaml
rules_folder: /opt/elastalert/rules
run_every:
minutes: 1
buffer_time:
minutes: 60
es_host: elasticsearch
es_port: 9200
es_username: elastic
es_password: ${ES_PASSWORD}
writeback_index: elastalert_status# elastalert/rules/lcp-warning.yaml
name: "LCP p75 exceeds 2.5s"
type: metric_aggregation
index: perf-metrics-*
# Query filter
filter:
- term:
perf.metric: LCP
metric_agg_key: perf.value
metric_agg_type: percentiles
metric_agg_percents: [75]
max_threshold: 2500
min_doc_count: 30
# Time window
timeframe:
minutes: 60
# Cooldown — don't re-alert for 2 hours
realert:
hours: 2
# Slack notification
alert:
- slack
slack_webhook_url: "${SLACK_WEBHOOK_URL}"
slack_channel_override: "#perf-alerts"
slack_username_override: "PerfBot"
slack_msg_color: "warning"
alert_subject: "LCP p75 exceeded 2.5s threshold"
alert_text: |
*LCP p75 exceeded threshold*
Current p75: {0[75.0]} ms
Threshold: 2500 ms
Window: last 60 minutes
Time: {match[@timestamp]}
alert_text_type: alert_text_only# elastalert/rules/lcp-critical.yaml
name: "LCP p75 exceeds 4s (critical)"
type: metric_aggregation
index: perf-metrics-*
filter:
- term:
perf.metric: LCP
metric_agg_key: perf.value
metric_agg_type: percentiles
metric_agg_percents: [75]
max_threshold: 4000
min_doc_count: 30
timeframe:
minutes: 15
realert:
minutes: 30
alert:
- slack
slack_webhook_url: "${SLACK_WEBHOOK_URL}"
slack_channel_override: "#perf-alerts"
slack_msg_color: "danger"
alert_subject: "CRITICAL: LCP p75 exceeded 4s"# elastalert/rules/inp-warning.yaml
name: "INP p75 exceeds 200ms"
type: metric_aggregation
index: perf-metrics-*
filter:
- term:
perf.metric: INP
metric_agg_key: perf.value
metric_agg_type: percentiles
metric_agg_percents: [75]
max_threshold: 200
min_doc_count: 30
timeframe:
minutes: 60
realert:
hours: 2
alert:
- slack
slack_webhook_url: "${SLACK_WEBHOOK_URL}"
slack_channel_override: "#perf-alerts"
slack_msg_color: "warning"
alert_subject: "INP p75 exceeded 200ms threshold"For WeCom / DingTalk, use the post alert type:
# elastalert/rules/lcp-wecom.yaml
name: "LCP alert (WeCom)"
# ... same metric_aggregation config as above ...
alert:
- post
http_post_url: "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY"
http_post_headers:
Content-Type: "application/json"
http_post_payload:
msgtype: "markdown"
markdown:
content: >
**[WARNING] LCP p75 exceeded threshold**
> Current: {0[75.0]} ms | Threshold: 2500 ms
> Window: 60 min | Time: {match[@timestamp]}6. ELK Docker Compose
# docker-compose-elk.yml
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.17.0
environment:
- discovery.type=single-node
- xpack.security.enabled=true
- ELASTIC_PASSWORD=${ES_PASSWORD}
- "ES_JAVA_OPTS=-Xms2g -Xmx2g"
ports:
- "9200:9200"
volumes:
- es-data:/usr/share/elasticsearch/data
ulimits:
memlock:
soft: -1
hard: -1
kibana:
image: docker.elastic.co/kibana/kibana:8.17.0
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
- ELASTICSEARCH_USERNAME=kibana_system
- ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD}
ports:
- "5601:5601"
depends_on:
- elasticsearch
apm-server:
image: docker.elastic.co/apm/apm-server:8.17.0
environment:
- output.elasticsearch.hosts=["http://elasticsearch:9200"]
- output.elasticsearch.username=elastic
- output.elasticsearch.password=${ES_PASSWORD}
- apm-server.rum.enabled=true
- apm-server.rum.allow_origins=["*"]
- apm-server.rum.allow_headers=["Content-Type"]
ports:
- "8200:8200"
depends_on:
- elasticsearch
# Optional: only needed if using custom SDK instead of Elastic APM RUM
logstash:
image: docker.elastic.co/logstash/logstash:8.17.0
volumes:
- ./logstash/pipeline:/usr/share/logstash/pipeline
environment:
- ES_PASSWORD=${ES_PASSWORD}
ports:
- "5044:5044"
depends_on:
- elasticsearch
elastalert:
image: jertel/elastalert2:latest
volumes:
- ./elastalert/config.yaml:/opt/elastalert/config.yaml
- ./elastalert/rules:/opt/elastalert/rules
environment:
- ES_PASSWORD=${ES_PASSWORD}
- SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
depends_on:
- elasticsearch
volumes:
es-data:# Start the ELK stack
cat > .env << 'EOL'
ES_PASSWORD=your-secure-password
KIBANA_PASSWORD=your-kibana-password
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx/yyy/zzz
EOL
docker compose -f docker-compose-elk.yml up -d
# Verify
# Elasticsearch: http://localhost:9200
# Kibana: http://localhost:5601
# APM Server: http://localhost:82007. ELK Data Retention
ILM (Index Lifecycle Management) handles retention automatically — configured in the index template above. No cron jobs needed.
Index Lifecycle:
hot (0–7 days) → rollover at 7d or 10 GB
warm (7–30 days) → force merge, shrink
delete (90 days) → auto-deleteTo adjust retention, update the ILM policy in Kibana: Kibana → Stack Management → Index Lifecycle Policies → perf-metrics-ilm
Best Practices
Self-Hosted Platform Guidelines
- Start simple — both ClickHouse and a single-node ELK can handle millions of events per day
- Batch writes — buffer events and flush periodically; never write one row at a time
- Pre-aggregate — ClickHouse materialized views or ES transforms speed up dashboard queries
- Set data retention — ClickHouse partition drops or ES ILM; 90 days is a good default
- Rate-limit the collector — protect against SDK bugs or traffic spikes flooding your storage
- Cooldown on alerts — avoid alert fatigue by enforcing minimum gaps between repeated notifications
- Require minimum samples — skip alerting when data volume is too low to be statistically meaningful
- Secure the platform — put the dashboard and query API behind your internal network or VPN
- Choose based on your team — reuse existing ELK if available; otherwise ClickHouse is leaner