Prometheus + Grafana is the open-source monitoring standard. Prometheus collects and queries time-series data; Grafana visualizes it on attractive dashboards. This guide walks through building a monitoring stack from zero and watching 20 servers on a single screen.
Architecture
Prometheus uses a pull model — every 15 seconds it scrapes each target service's /metrics endpoint and writes the response to its on-disk time-series DB. Small agents like Node Exporter expose CPU/RAM/disk metrics on each host. Grafana uses Prometheus as a data source and renders panels from it.
Installation with Docker Compose
# compose.yml
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
ports: ['9090:9090']
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
ports: ['3000:3000']
volumes:
- grafana-data:/var/lib/grafana
environment:
GF_SECURITY_ADMIN_PASSWORD: supersecret
GF_SERVER_ROOT_URL: https://monitor.example.com
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
ports: ['9100:9100']
restart: unless-stopped
volumes:
prometheus-data:
grafana-data:
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
- job_name: node
static_configs:
- targets:
- node-exporter:9100
- web-1.example.com:9100
- web-2.example.com:9100
- db-1.example.com:9100
- job_name: nginx
static_configs:
- targets: ['web-1.example.com:9113'] # nginx-prometheus-exporter
- job_name: postgres
static_configs:
- targets: ['db-1.example.com:9187'] # postgres_exporter
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- alerts.yml
Exposing Application Metrics
// Node.js + prom-client
const client = require('prom-client');
const register = client.register;
// Default Node.js metrics
client.collectDefaultMetrics({ register });
// Custom metrics
const httpDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});
app.use((req, res, next) => {
const end = httpDuration.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode });
});
next();
});
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.send(await register.metrics());
});
PromQL — Basic Queries
# CPU usage percentage (last 5 min)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# RAM usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk fullness
100 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes * 100)
# HTTP P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Error rate
sum(rate(http_request_duration_seconds_count{status=~"5.."}[5m]))
/ sum(rate(http_request_duration_seconds_count[5m]))
# Requests per second
sum(rate(http_request_duration_seconds_count[1m])) by (route)
Grafana Dashboards
Instead of hand-rolling a dashboard, import one that already exists: grafana.com/grafana/dashboards → Node Exporter Full (1860), NGINX (12708), PostgreSQL (9628). In Grafana, go to Dashboard → Import → paste the ID → pick a data source.
Alertmanager + Alert Rules
# alerts.yml
groups:
- name: host
rules:
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels: { severity: warning }
annotations:
summary: '{{ $labels.instance }} CPU > 85% (10m)'
- alert: DiskFull
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes) < 0.1
for: 5m
labels: { severity: critical }
annotations:
summary: '{{ $labels.instance }} disk under 10%'
- alert: HostDown
expr: up == 0
for: 2m
labels: { severity: critical }
annotations:
summary: '{{ $labels.instance }} not responding'
# alertmanager.yml
route:
receiver: slack
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match: { severity: critical }
receiver: pagerduty
receivers:
- name: slack
slack_configs:
- api_url: https://hooks.slack.com/services/XXX
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
- name: pagerduty
pagerduty_configs:
- service_key: YOUR_KEY
Retention and Storage
- Default retention is 15 days. Extend with
--storage.tsdb.retention.time=30d - For long-term storage use Thanos or VictoriaMetrics
- Disk estimate: 1-2 bytes × samples/second × seconds
- Example: 100 targets × 1000 metrics × 4 bytes × 86400 sec ≈ 33 GB/day
Security
- Prometheus and Grafana should not be public — put them behind a reverse proxy with basic auth
- Disable anonymous access in Grafana, use a strong admin password
- Behind Cloudflare Access or a VPN
- TLS is mandatory
Conclusion
Prometheus + Grafana installs in 30 minutes and scales indefinitely — the same stack runs on 1 server or 1,000. The exporter ecosystem is rich: MySQL, Redis, Nginx, MongoDB, RabbitMQ — practically every popular service has a ready-made exporter.
Reach out to KEYDAL for Prometheus, Grafana and Alertmanager setup plus custom dashboard design. Contact us