The Proactive Monitoring Playbook

Reactive monitoring tells you something is broken after users complain. Proactive monitoring tells you something is about to break before anyone notices. Building the second kind requires intentional design.

There are two fundamentally different monitoring postures. Reactive monitoring — alerts that fire when something has already broken — is the starting point for most organizations. Proactive monitoring — instrumentation that surfaces degradation before it becomes an incident — is the goal, and it requires different tooling, different alert design, and a different operational culture.

The difference between them isn’t just technical. A reactive monitoring setup treats operations as an incident-response discipline: something breaks, someone fixes it. A proactive monitoring setup treats operations as a steady-state management discipline: identify trends, address capacity constraints, resolve growing problems before they become outages.

Here’s how to build the second kind.

The Metrics That Predict Failures

Most alert configurations focus on current state: disk is over 90%, CPU is over 85%, service is returning 5xx errors. These alerts fire when things are already bad. They don’t give you time to act before impact.

The metrics that predict failures are trend-based:

Disk usage rate of change, not just current usage. A disk at 75% capacity with 10GB/day growth rate will be full in 15 days. An alert at 90% current usage fires when you have days, not weeks. A trend-based alert fires when the growth rate projects to hit critical threshold in 21 days — which gives you time to investigate, expand capacity, or clean up unnecessary data without urgency.

Memory leak indicators. Steady upward growth in RSS (resident set size) of a process over time is the signature of a memory leak. Alert when a process’s memory usage has grown by more than N% over a rolling 24-hour window without a corresponding restart. This catches memory leaks in the early stage, before they cause OOM kills.

Request latency percentile drift. If P95 response time is trending upward over days — not spiking, but drifting — something is degrading. Could be database query performance as the table grows, could be cache hit rate declining, could be upstream service slowing down. Alert when the 7-day trailing P95 exceeds the 30-day baseline by more than 20%.

Error rate drift. Similarly, a slowly increasing error rate is a leading indicator of a growing problem. A sudden spike is reactive. A gradual increase from 0.1% to 0.5% error rate over a week is proactive signal.

Connection pool headroom. Database connection pools that are consistently near capacity — even if they haven’t hit the limit — indicate a configuration or scaling decision that needs attention before a connection pool exhaustion incident occurs.

Alert Design Principles

Alerts are only valuable if they’re actionable and trustworthy. The two failure modes:

Too many alerts — operators tune out. If alerts fire constantly for things that don’t require action, the alert that does require action gets buried. Alert fatigue is the root cause of many serious incidents.

Too few alerts — problems are discovered by users rather than by monitoring. This is the reactive posture.

The design principle for each alert: if this alert fires at 3 AM, is there an action I should take tonight? If yes, the alert belongs in the on-call rotation. If no, it belongs in a daily review digest or should be suppressed during off-hours.

This distinction forces clarity about alert severity. The P1/P2/P3 model:

P1 — requires immediate response, any time of day. Examples: service is down, data loss is occurring, security breach is in progress.

P2 — requires response within business hours or within 4 hours off-hours. Examples: sustained performance degradation, high error rate, capacity approaching critical threshold.

P3 — informational, addressed in next business day review. Examples: certificate expiring in 45 days, disk usage growing toward configured threshold, unusual traffic pattern.

An alert configured as P1 that should be P3 is an alert that will train operators to ignore P1 alerts. Be ruthless about severity classification.

The Alert Noise Audit

Most monitoring environments accumulate alert noise over time. An alert is added when a problem occurs; it’s never revisited when the problem no longer exists or when the threshold is wrong. The result is an alert configuration that’s a historical record of past incidents rather than a current operational tool.

Run a quarterly alert audit:

  1. For each alert: how many times did it fire in the past 90 days?
  2. How many of those firings resulted in a human taking an action?
  3. For the firings that resulted in action: was the action the right one, or was it dismissing the alert?

Alerts that fired frequently but resulted in dismissal rather than action are false positives. They need either a threshold adjustment or suppression. Alerts that fired infrequently but always resulted in action are high-quality alerts. Alerts that never fired either have thresholds set too high (they’ll fire too late) or represent conditions that haven’t occurred (which may be fine).

The target: every alert that pages on-call should result in an action more than 80% of the time. Below that ratio, the alert needs tuning.

Synthetic Monitoring: Testing From the Outside

Internal metrics tell you how the system looks from inside. Synthetic monitoring tells you how it looks from where your users are.

Synthetic monitoring sends real requests to your application from external locations on a schedule — every 1-5 minutes is typical — and alerts if the request fails, takes too long, or returns unexpected content. The specific check might be:

  • HTTP GET to the application homepage, expecting 200 status code
  • Login flow: POST credentials, verify redirect to authenticated page
  • API endpoint: GET /api/healthz, verify JSON response contains "status": "ok"

The value of synthetic monitoring: it catches failures that internal monitoring misses. A misconfigured firewall rule that blocks external traffic while the server itself is healthy. A CDN edge node returning stale content. A DNS record that’s expired. A TLS certificate that’s valid from inside the network but expired to external clients.

Pingdom, Checkly, DataDog Synthetics, and Uptime Robot all provide synthetic monitoring. For production workloads, monitoring from multiple geographic locations matters — a failure in one region that doesn’t affect others is a different incident than a global failure.

Log-Based Alerting for Application Events

Infrastructure metrics tell you about server health. Application logs tell you about application behavior. Log-based alerting bridges these.

Specific patterns worth alerting on:

  • Authentication failures spiking above baseline — could be a brute force attempt or a client configuration issue
  • Specific exception types appearing for the first time — a new code path failing in production
  • “Out of memory” or “connection refused” log lines — infrastructure problems visible in application logs
  • Database slow query logs appearing — performance degradation in the database layer

Log-based alerting requires structured logging (JSON output with consistent field names) and a log aggregation platform (Loki, Elasticsearch, CloudWatch Logs, Datadog Logs) that supports query-based alerts. Unstructured log lines with variable formats make reliable log alerting difficult.

The investment in structured logging pays dividends beyond alerting: faster debugging, better search, and the ability to quantify error rates and event frequencies over time.

Every alert should have a runbook — a documented response procedure that tells the on-call engineer what to check and what to do. An alert without a runbook forces the on-call engineer to diagnose from scratch at 3 AM, which takes longer and produces worse decisions than following a documented procedure.

The runbook for each alert should be short and specific:

  1. What does this alert mean?
  2. What’s the immediate check to confirm the problem (a specific command, a specific dashboard)?
  3. What are the likely causes? (Ordered by frequency)
  4. What’s the remediation for each likely cause?
  5. Who to escalate to if the runbook doesn’t resolve it?

Link the runbook from the alert itself — in PagerDuty, Opsgenie, or wherever alerts are routed, the alert body should include the runbook URL. An on-call engineer should never have to remember where the runbook lives.

Our managed hosting service includes proactive monitoring infrastructure as a core component, not an add-on. Related: the monitoring stack is closely connected to data engineering decisions — the metrics your monitoring produces are data, and the quality of that data determines the quality of your operational visibility.