Prometheus and Grafana Server Monitoring: Metric

Prometheus + Grafana is the open-source monitoring standard. Prometheus collects and queries time-series data; Grafana visualizes it on attractive dashboards. This guide walks through building a monitoring stack from zero and watching 20 servers on a single screen.

Architecture

Prometheus uses a pull model — every 15 seconds it scrapes each target service's /metrics endpoint and writes the response to its on-disk time-series DB. Small agents like Node Exporter expose CPU/RAM/disk metrics on each host. Grafana uses Prometheus as a data source and renders panels from it.

Installation with Docker Compose

# compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports: ['9090:9090']
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports: ['3000:3000']
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      GF_SECURITY_ADMIN_PASSWORD: supersecret
      GF_SERVER_ROOT_URL: https://monitor.example.com
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    ports: ['9100:9100']
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

  - job_name: node
    static_configs:
      - targets:
          - node-exporter:9100
          - web-1.example.com:9100
          - web-2.example.com:9100
          - db-1.example.com:9100

  - job_name: nginx
    static_configs:
      - targets: ['web-1.example.com:9113']  # nginx-prometheus-exporter

  - job_name: postgres
    static_configs:
      - targets: ['db-1.example.com:9187']   # postgres_exporter

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - alerts.yml

Exposing Application Metrics

// Node.js + prom-client
const client = require('prom-client');
const register = client.register;

// Default Node.js metrics
client.collectDefaultMetrics({ register });

// Custom metrics
const httpDuration = new client.Histogram({
    name: 'http_request_duration_seconds',
    help: 'HTTP request duration',
    labelNames: ['method', 'route', 'status'],
    buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});

app.use((req, res, next) => {
    const end = httpDuration.startTimer();
    res.on('finish', () => {
        end({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode });
    });
    next();
});

app.get('/metrics', async (req, res) => {
    res.set('Content-Type', register.contentType);
    res.send(await register.metrics());
});

PromQL — Basic Queries

# CPU usage percentage (last 5 min)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# RAM usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk fullness
100 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes * 100)

# HTTP P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Error rate
sum(rate(http_request_duration_seconds_count{status=~"5.."}[5m]))
  / sum(rate(http_request_duration_seconds_count[5m]))

# Requests per second
sum(rate(http_request_duration_seconds_count[1m])) by (route)

Grafana Dashboards

Instead of hand-rolling a dashboard, import one that already exists: grafana.com/grafana/dashboards → Node Exporter Full (1860), NGINX (12708), PostgreSQL (9628). In Grafana, go to Dashboard → Import → paste the ID → pick a data source.

Alertmanager + Alert Rules

# alerts.yml
groups:
  - name: host
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: '{{ $labels.instance }} CPU > 85% (10m)'

      - alert: DiskFull
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes) < 0.1
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: '{{ $labels.instance }} disk under 10%'

      - alert: HostDown
        expr: up == 0
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: '{{ $labels.instance }} not responding'

# alertmanager.yml
route:
  receiver: slack
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match: { severity: critical }
      receiver: pagerduty

receivers:
  - name: slack
    slack_configs:
      - api_url: https://hooks.slack.com/services/XXX
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'

  - name: pagerduty
    pagerduty_configs:
      - service_key: YOUR_KEY

Retention and Storage

Default retention is 15 days. Extend with --storage.tsdb.retention.time=30d
For long-term storage use Thanos or VictoriaMetrics
Disk estimate: 1-2 bytes × samples/second × seconds
Example: 100 targets × 1000 metrics × 4 bytes × 86400 sec ≈ 33 GB/day

Security

Prometheus and Grafana should not be public — put them behind a reverse proxy with basic auth
Disable anonymous access in Grafana, use a strong admin password
Behind Cloudflare Access or a VPN
TLS is mandatory

Modern Web Hosting and Server Infrastructure

A performant web hosting service rests on three infrastructure decisions: NVMe SSD disks (4-6× IOPS over SATA SSD), LiteSpeed Web Server or Nginx + LSCache (9× request capacity over Apache) and CloudLinux + Imunify360 isolation. The hosting provider's control panel (cPanel, Plesk, DirectAdmin), daily backup policy, data center location and support response time make a big difference too. Turkish locations give low latency to local visitors, while Hetzner Frankfurt or OVH Roubaix suit global traffic. As your site grows, transitioning from shared hosting to VPS to dedicated server scales CPU/RAM/disk to your needs.

Monitoring stack setup

Reach out to KEYDAL for Prometheus, Grafana and Alertmanager setup plus custom dashboard design. Contact us

Readers of this article also read these

hosting 8 min

Prometheus and Grafana Server Monitoring: Metric Collection Guide