Prometheus + Grafana is the open-source monitoring standard. Prometheus collects and queries time-series data; Grafana visualizes it on attractive dashboards. This guide walks through building a monitoring stack from zero and watching 20 servers on a single screen.

Architecture

Prometheus uses a pull model — every 15 seconds it scrapes each target service's /metrics endpoint and writes the response to its on-disk time-series DB. Small agents like Node Exporter expose CPU/RAM/disk metrics on each host. Grafana uses Prometheus as a data source and renders panels from it.

Installation with Docker Compose

# compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports: ['9090:9090']
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports: ['3000:3000']
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      GF_SECURITY_ADMIN_PASSWORD: supersecret
      GF_SERVER_ROOT_URL: https://monitor.example.com
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    ports: ['9100:9100']
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

  - job_name: node
    static_configs:
      - targets:
          - node-exporter:9100
          - web-1.example.com:9100
          - web-2.example.com:9100
          - db-1.example.com:9100

  - job_name: nginx
    static_configs:
      - targets: ['web-1.example.com:9113']  # nginx-prometheus-exporter

  - job_name: postgres
    static_configs:
      - targets: ['db-1.example.com:9187']   # postgres_exporter

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - alerts.yml

Exposing Application Metrics

// Node.js + prom-client
const client = require('prom-client');
const register = client.register;

// Default Node.js metrics
client.collectDefaultMetrics({ register });

// Custom metrics
const httpDuration = new client.Histogram({
    name: 'http_request_duration_seconds',
    help: 'HTTP request duration',
    labelNames: ['method', 'route', 'status'],
    buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});

app.use((req, res, next) => {
    const end = httpDuration.startTimer();
    res.on('finish', () => {
        end({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode });
    });
    next();
});

app.get('/metrics', async (req, res) => {
    res.set('Content-Type', register.contentType);
    res.send(await register.metrics());
});

PromQL — Basic Queries

# CPU usage percentage (last 5 min)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# RAM usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk fullness
100 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes * 100)

# HTTP P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Error rate
sum(rate(http_request_duration_seconds_count{status=~"5.."}[5m]))
  / sum(rate(http_request_duration_seconds_count[5m]))

# Requests per second
sum(rate(http_request_duration_seconds_count[1m])) by (route)

Grafana Dashboards

Instead of hand-rolling a dashboard, import one that already exists: grafana.com/grafana/dashboards → Node Exporter Full (1860), NGINX (12708), PostgreSQL (9628). In Grafana, go to Dashboard → Import → paste the ID → pick a data source.

Alertmanager + Alert Rules

# alerts.yml
groups:
  - name: host
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: '{{ $labels.instance }} CPU > 85% (10m)'

      - alert: DiskFull
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes) < 0.1
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: '{{ $labels.instance }} disk under 10%'

      - alert: HostDown
        expr: up == 0
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: '{{ $labels.instance }} not responding'
# alertmanager.yml
route:
  receiver: slack
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match: { severity: critical }
      receiver: pagerduty

receivers:
  - name: slack
    slack_configs:
      - api_url: https://hooks.slack.com/services/XXX
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'

  - name: pagerduty
    pagerduty_configs:
      - service_key: YOUR_KEY

Retention and Storage

  • Default retention is 15 days. Extend with --storage.tsdb.retention.time=30d
  • For long-term storage use Thanos or VictoriaMetrics
  • Disk estimate: 1-2 bytes × samples/second × seconds
  • Example: 100 targets × 1000 metrics × 4 bytes × 86400 sec ≈ 33 GB/day

Security

  • Prometheus and Grafana should not be public — put them behind a reverse proxy with basic auth
  • Disable anonymous access in Grafana, use a strong admin password
  • Behind Cloudflare Access or a VPN
  • TLS is mandatory

Conclusion

Prometheus + Grafana installs in 30 minutes and scales indefinitely — the same stack runs on 1 server or 1,000. The exporter ecosystem is rich: MySQL, Redis, Nginx, MongoDB, RabbitMQ — practically every popular service has a ready-made exporter.

Monitoring stack setup

Reach out to KEYDAL for Prometheus, Grafana and Alertmanager setup plus custom dashboard design. Contact us

WhatsApp