SYSTEM
MONITORING

// Know your system. Watch everything.

YOU CAN'T MANAGE WHAT YOU DON'T MEASURE.

Whether it's a sluggish server, a memory leak, or a network bottleneck—monitoring gives you visibility. Before users complain, you should know there's a problem.

KNOW YOUR NORMAL.

Establish baselines. Track trends. Set alerts. The difference between a sysadmin who sleeps at night and one who doesn't is their monitoring stack.

OBSERVABILITY IS FREEDOM.

Logs, metrics, traces—understand all three. When something breaks, you'll know exactly where to look. No more guessing. No more panic.

START MONITORING →

// The Path to Observability

12 lessons. Complete visibility.

LESSON 01

Introduction to System Monitoring

Why monitoring matters and key concepts

Beginner
LESSON 02

top and htop - Real-Time Monitoring

Interactive process and resource viewers

Beginner
LESSON 03

Memory and CPU Deep Dive

Understanding RAM usage and CPU metrics

Intermediate
LESSON 04

Disk I/O and Storage Monitoring

Track disk usage, IOPS, and storage health

Intermediate
LESSON 05

Log Analysis and Management

journalctl, rsyslog, and log aggregation

Intermediate
LESSON 06

Introduction to Prometheus

Modern monitoring with time-series metrics

Intermediate
LESSON 07

Node Exporter and System Metrics

Export system metrics to Prometheus

Intermediate
LESSON 08

Grafana - Visualizing Metrics

Create dashboards and alerts

Intermediate
LESSON 09

Alerting with Prometheus AlertManager

Get notified when things go wrong

Advanced
LESSON 10

Building a Complete Monitoring Stack

Integrate Prometheus, Grafana, and exporters

Advanced
LESSON 11

Distributed Tracing

Track requests across microservices with Jaeger

Advanced
LESSON 12

Incident Response

On-call procedures, runbooks, and post-mortems

Advanced

// Why Monitoring Matters

Effective monitoring helps you:

The monitoring triangle:

What gets measured gets managed. What gets monitored gets fixed.

// Tools & References

📊 Prometheus

Metrics collection and alerting

prometheus.io

📈 Grafana

Visualization and dashboards

grafana.com

📝 ELK Stack

Log management

elastic.co

🖥️ htop

Interactive process viewer

htop.dev

📡 Netdata

Real-time monitoring

netdata.cloud

🔔 AlertManager

Alert handling

prometheus.io

// Introduction to System Monitoring

×

What is Monitoring?

Monitoring is collecting, analyzing, and acting on data from your systems. It has three pillars:

The Three Pillars of Observability

  • Metrics: Quantitative measurements over time
    Example: CPU usage 75%, Memory 4.2GB used, 1,234 requests/sec
  • Logs: Timestamped events describing what happened
    Example: "2024-01-15 10:23:45 ERROR Connection failed from 192.168.1.50"
  • Traces: Request paths through distributed systems
    Example: User request → API Gateway → Auth Service → Database → Response

Key Monitoring Metrics

  • CPU: Usage percentage, load average, context switches
  • Memory: Used, free, cached, swap usage
  • Disk: Space used, I/O wait, throughput
  • Network: Bandwidth, packets/sec, errors
  • Processes: Count, state distribution

Basic Monitoring Tools

$ uptime              # System load average
$ w                   # Who is logged in and what they're doing
">$free -h              # Memory usage in human-readable format
$ df -h               # Disk usage

Quiz

Which of these is an example of a metric?

What is the data type that stores metric values over time?

What model does Prometheus use to collect metrics?

What model uses clients sending metrics to a central server?

What triggers when a metric crosses a threshold?

Show Answers

Answers

  1. cpu usage 75%
  2. time series
  3. pull
  4. push
  5. alert

// top and htop - Real-Time Monitoring

×

Understanding top

The top command is the classic real-time process monitor:

$ top                    # Start top
$ top -u username        # Show only user's processes
$ top -p 1234          # Monitor specific PID
$ top -d 2             # 2 second delay

Top Output Explained

top - 14:30:25 up 45 days,  3:22,  2 users,  load average: 0.52, 0.58, 0.59
Tasks: 145 total,   1 running, 144 sleeping,   0 stopped,   0 zombie
Cpu(s):  5.2%us,  2.1%sy,  0.0%ni, 92.1%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:   8192404k total,  5678232k used,  2514172k free,   412344k buffers
Swap:  2097148k total,        0k used,  2097148k free,  3245676k cached

Understanding Load Average

Load average shows average number of processes in runnable state:

  • 0.52, 0.58, 0.59 = 1min, 5min, 15min averages
  • On 4-core system: 4.0 = fully utilized
  • Over 4.0: Processes waiting (performance issue)

htop - Better top

htop is a more user-friendly version with color and mouse support:

$ htop                   # Start htop
$ htop -u username       # Filter by user
$ htop -d 10            # 10% delay (slower refresh)

htop Shortcuts

  • F1: Help
  • Space: Highlight process
  • U: Show user processes
  • T: Tree view
  • P M T: Sort by CPU/Memory/Time
  • /: Search
  • k: Kill process

Quiz

What does load average measure?

What is an enhanced version of top?

What are the three time periods for load average?

What CPU stat shows time waiting for I/O?

What command shows system uptime and load?

Show Answers

Answers

  1. number of processes in runnable state
  2. htop
  3. 1, 5, 15
  4. wa
  5. uptime

// Memory and CPU Deep Dive

×

Understanding Linux Memory

Linux uses memory intelligently—it caches disk data for performance:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:          7.8Gi       3.2Gi       1.2Gi       123Mi       3.3Gi       4.3Gi
Swap:         2.0Gi          0B       2.0Gi

Understanding the Columns

  • total: Total RAM installed
  • used: Actually in use (not available)
  • free: Completely unused
  • buff/cache: Cached disk data (can be reclaimed)
  • available: Memory available for new processes

Memory Analysis

$ cat /proc/meminfo      # Detailed memory info
$ vmstat 1              # Virtual memory statistics
$ ps aux --sort=-%mem    # Top memory consumers
$ pmap -x 1234          # Memory map of process

Understanding CPU States

  • us: User space - programs you run
  • sy: System - kernel operations
  • ni: Nice - low-priority processes
  • id: Idle - doing nothing
  • wa: I/O Wait - waiting for disk
  • hi: Hardware interrupts
  • si: Software interrupts
  • st: Stolen (virtualization)

CPU Analysis

$ mpstat -P ALL 1       # Per-CPU stats
$ sar -u 1 5           # Historical CPU stats
$ ps -eo pcpu,pid,user,args | sort -k 1 -r | head -10  # Top CPU users

Quiz

In "free" output, which value shows memory available for new processes?

What memory is used by the kernel for caches?

What command shows virtual memory statistics?

What controls kernel tendency to swap?

Show Answers

Answers

  1. available
  2. buff/cache
  3. vmstat
  4. swappiness

// Disk I/O and Storage Monitoring

×

Understanding Disk I/O

Slow disk I/O can cripple your server. Monitor for:

  • I/O Wait (wa): CPU waiting for disk
  • IOPS: Operations per second
  • Throughput: MB/s transferred
  • Latency: Time per operation

Basic Disk Tools

$ df -h                   # Disk space usage
$ df -i                   # Inode usage
$ du -sh /var/log        # Directory size
$ du -ah --max-depth=1 | sort -hr | head -10  # Largest files/dirs

iostat - I/O Statistics

$ iostat -xz 1          # I/O stats per device
Device r/s w/s rkB/s wkB/s await svc_t %util
sda 2.50 0.50 20.00 4.00 10.50 5.50 15.00

iostat Columns

  • r/s, w/s: Read/write requests per second
  • rkB/s, wkB/s: Read/write KB per second
  • await: Average wait time (ms)
  • svc_t: Average service time
  • %util: Device utilization %

iotop - Per-Process I/O

$ sudo iotop                # Interactive I/O monitor
$ sudo iotop -o              # Only show active I/O
$ sudo iotop -b -n 3        # Batch mode, 3 iterations

Smart Monitoring

$ sudo smartctl -a /dev/sda   # SMART data
$ sudo smartctl -H /dev/sda   # Health check

Quiz

What does "await" represent in iostat output?

What command shows CPU and disk I/O stats?

What shows per-process I/O usage?

What shows disk space usage?

Show Answers

Answers

  1. average wait time for i/o requests
  2. iostat
  3. iotop
  4. df

// Log Analysis and Management

×

Linux Log Locations

  • /var/log/syslog: System messages (Debian/Ubuntu)
  • /var/log/messages: System messages (RHEL/CentOS)
  • /var/log/auth.log: Authentication logs
  • /var/log/kern.log: Kernel messages
  • /var/log/nginx/: Nginx access/error logs
  • /var/log/apache2/: Apache logs

Essential Log Commands

$ tail -f /var/log/syslog       # Follow log in real-time
$ tail -n 100 /var/log/syslog    # Last 100 lines
$ grep "error" /var/log/syslog     # Search for errors
$ less +F /var/log/syslog        # Read and follow

journalctl - Systemd Logs

$ journalctl                   # All logs
$ journalctl -u nginx.service    # Specific unit
$ journalctl -f                # Follow
$ journalctl --since "1 hour ago"  # Time range
$ journalctl -p err             # Error priority
$ journalctl -b                # Current boot

Logrotate - Log Management

$ cat /etc/logrotate.conf     # Main config
$ ls /etc/logrotate.d/        # Service-specific configs

Example /etc/logrotate.d/myapp:

/var/log/myapp/*.log {
    daily
    rotate 14
    compress
    delaycompress
    missingok
    notifempty
    create 0640 root adm
}

Centralized Logging

For multiple servers, consider:

  • ELK Stack: Elasticsearch, Logstash, Kibana
  • Graylog: Alternative to ELK
  • Loki: Prometheus-style logging

Quiz

Which command shows logs from a specific systemd service?

Where are system logs typically stored?

What shows the last lines of a log file?

What searches for patterns in text?

Show Answers

Answers

  1. journalctl -u
  2. /var/log
  3. tail
  4. grep

// Introduction to Prometheus

×

What is Prometheus?

Prometheus is an open-source monitoring system with:

  • Time-series database: Stores metrics with timestamps
  • Pull model: Scrapes metrics from targets
  • PromQL: Powerful query language
  • Alerting: Built-in alert manager

Installing Prometheus

$ wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
$ tar xzf prometheus-*.tar.gz
$ cd prometheus-2.45.0.linux-amd64

Prometheus Configuration

$ cat > prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
EOF

Running Prometheus

$ ./prometheus --config.file=prometheus.yml
# Access UI at http://localhost:9090

Key Metrics Types

  • Counter: Always increasing (requests_total, bytes_sent)
  • Gauge: Can go up and down (cpu_usage, memory_free)
  • Histogram: Bucketed observations (request_duration_seconds)
  • Summary: Similar to histogram (quantiles)

Basic PromQL Queries

# CPU usage of all nodes
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Requests per second
rate(http_requests_total[5m])

# Top 5 CPU consumers
topk(5, sum by (process) (rate(process_cpu_seconds_total[5m])))

Quiz

What is the default Prometheus scrape interval?

What is Prometheus's query language called?

What function calculates per-second rate?

What adds metadata to metrics?

Show Answers

Answers

  1. 15s
  2. promql
  3. rate
  4. label

// Node Exporter and System Metrics

×

What is Node Exporter?

Node Exporter exposes system metrics for Prometheus—CPU, memory, disk, network, and more.

Installing Node Exporter

$ wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
$ tar xzf node_exporter-*.tar.gz
$ cd node_exporter-1.6.1.linux-amd64
$ ./node_exporter
# Metrics available at http://localhost:9100/metrics

Key Node Exporter Metrics

# CPU
node_cpu_seconds_total{mode="idle|user|system|iowait|..."}

# Memory
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
node_memory_MemFree_bytes

# Disk
node_filesystem_avail_bytes
node_filesystem_size_bytes
node_disk_io_time_seconds_total

# Network
node_network_receive_bytes_total
node_network_transmit_bytes_total

Useful PromQL Queries

# CPU by mode
sum by (mode) (rate(node_cpu_seconds_total[5m])) * 100

# Memory in GB
node_memory_MemTotal_bytes / 1024^3
node_memory_MemAvailable_bytes / 1024^3

# Disk usage %
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100

# Network traffic (bytes/sec)
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

Run as Service

$ sudo useradd -rs /bin/false node_exporter
$ sudo cp node_exporter /usr/local/bin/
$ sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
$ sudo tee /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF
$ sudo systemctl daemon-reload
$ sudo systemctl enable node_exporter
$ sudo systemctl start node_exporter

Additional Exporters

  • cadvisor: Docker container metrics
  • nginx_exporter: Nginx stats
  • postgres_exporter: PostgreSQL metrics
  • blackbox_exporter: HTTP/TCP/ICMP checks

Quiz

What port does Node Exporter run on by default?

What exporter collects system metrics?

What monitors Docker containers?

What probes endpoints for HTTP/TCP/ICMP?

Show Answers

Answers

  1. 9100
  2. node exporter
  3. cadvisor
  4. blackbox

// Grafana - Visualizing Metrics

×

What is Grafana?

Grafana is the visualization layer—connect to Prometheus, InfluxDB, and dozens of other data sources to create beautiful dashboards.

Installing Grafana

$ sudo apt-get install -y apt-transport-https software-properties-common
$ wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
$ echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
$ sudo apt-get update
$ sudo apt-get install grafana
$ sudo systemctl enable grafana-server
$ sudo systemctl start grafana-server
# Access at http://localhost:3000 (admin/admin)

Adding Prometheus as Data Source

  1. Open Grafana (http://localhost:3000)
  2. Go to Configuration → Data Sources
  3. Click "Add data source"
  4. Select "Prometheus"
  5. Set URL to http://localhost:9090
  6. Click "Save & Test"

Creating a Dashboard

  1. Click "Dashboards" → "New Dashboard"
  2. Click "Add visualization"
  3. Select Prometheus as data source
  4. Enter a PromQL query
  5. Configure visualization (graph, stat, gauge)
  6. Add panel titles, labels, thresholds

Useful Query Examples

# CPU Usage (Graph)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage (Gauge)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Disk Space (Stat)
node_filesystem_size_bytes - node_filesystem_avail_bytes

# Network In/Out (Time series)
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])

Dashboard JSON

You can import pre-built dashboards:

  • Search grafana.com/dashboards for "Node Exporter"
  • Import ID 1860 for the popular Node Exporter dashboard
  • Import ID 9588 for Prometheus 2.0 stats

Quiz

What is the default Grafana admin password?

What port does Grafana run on by default?

What displays visualizations in Grafana?

What is an individual visualization called?

Show Answers

Answers

  1. admin
  2. 3000
  3. dashboard
  4. panel

// Alerting with Prometheus AlertManager

×

Why Alerting?

Monitoring without alerting is just looking at pretty graphs. You need to know when things break—before users complain.

AlertManager Architecture

  • Prometheus: Evaluates rules, sends alerts to AlertManager
  • AlertManager: Receives, deduplicates, groups, routes alerts
  • Receivers: Email, Slack, PagerDuty, webhooks

Alerting Rules

$ cat alerts.yml
groups:
- name: alerts
  rules:
  - alert: HighCPU
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is above 80% for more than 5 minutes"

  - alert: HighMemory
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"

  - alert: DiskSpaceLow
    expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"

AlertManager Configuration

$ cat alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'email'

receivers:
- name: 'email'
  email_configs:
  - to: 'admin@example.com'
    send_resolved: true

- name: 'slack'
  slack_configs:
  - channel: '#alerts'
    api_url: 'https://hooks.slack.com/services/XXX'
    send_resolved: true

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'instance']

Alert States

  • Pending: Rule triggered but not for "for" duration yet
  • Firing: Alert active and being sent
  • Resolved: Condition no longer true

Best Practices

  • Use "for" duration to avoid flapping alerts
  • Group related alerts to reduce noise
  • Have runbooks for each alert
  • Start with fewer alerts, add more as you learn

Quiz

What does the "for" clause in an alert rule do?

What handles alert routing and notifications?

What matches alerts to notification routes?

What temporarily mutes alerts?

Show Answers

Answers

  1. time condition must be true before alerting
  2. alertmanager
  3. label
  4. silence

// Building a Complete Monitoring Stack

×

Putting It All Together

A complete monitoring stack includes:

  • Collection: Prometheus + Exporters
  • Visualization: Grafana
  • Alerting: AlertManager
  • Log Aggregation: Loki or ELK

Docker Compose Setup

$ cat docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'

volumes:
  prometheus-data:
  grafana-data:

Production Considerations

  • High Availability: Run redundant Prometheus instances
  • Retention: Configure storage retention (default 15d)
  • Remote Write: Send data to remote storage (Thanos, Cortex)
  • Backup: Grafana dashboards, Prometheus rules
  • Authentication: Secure Grafana with auth proxy

Suggested Metrics to Monitor

# Infrastructure
- CPU usage > 80%
- Memory usage > 90%
- Disk usage > 85%
- Load average > CPU count
- Network errors

# Application
- Request latency (p50, p95, p99)
- Error rate
- Request throughput
- Queue depth

# Business
- Active users
- Transaction volume
- Revenue (if applicable)

Next Steps

  • Explore more exporters: Redis, MongoDB, Kafka
  • Learn PromQL: Recording rules, functions
  • Add tracing: Jaeger or Tempo
  • Implement SLOs: Error budgets, SLIs

Quiz

What Docker volumes are needed to expose system metrics to node-exporter?

What defines multi-container applications?

What pre-calculates complex queries?

What automatically finds targets to monitor?

Show Answers

Answers

  1. /proc and /sys
  2. docker-compose
  3. recording rules
  4. service discovery

// Distributed Tracing

×

What is Distributed Tracing?

Distributed tracing tracks requests as they flow through multiple services in a microservices architecture. While metrics tell you what is wrong and logs tell you why, traces tell you where the problem is.

Key Concepts

  • Trace: End-to-end request journey through all services
  • Span: A single operation within a trace (e.g., database query, API call)
  • Span Context: Unique identifiers passed between services (trace ID, span ID)
  • Baggage: Metadata carried across service boundaries

Installing Jaeger

$ docker run -d --name jaeger \\ -p 16686:16686 \\ -p 14268:14268 \\ -p 14250:14250 \\ jaegertracing/all-in-one:1.45
Tip: The all-in-one image is perfect for local development. For production, use separate components.

Instrumentation with OpenTelemetry

OpenTelemetry is the modern standard for instrumentation:

Python Example

$ pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-jaeger
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(
    JaegerExporter(agent_host_name="localhost", agent_port=6831)
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

# Create spans
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", "12345")
    span.set_attribute("customer.id", "cust_789")
    
    with tracer.start_as_current_span("validate_payment"):
        # Payment validation logic
        pass
    
    with tracer.start_as_current_span("update_inventory"):
        # Inventory update logic
        pass

Propagating Context Between Services

Trace context must be propagated via HTTP headers:

# Request headers include trace context traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 # trace_id: 0af7651916cd43dd8448eb211c80319c # span_id: b7ad6b7169203331 # sampled: 01 (yes)

Viewing Traces in Jaeger UI

  1. Access Jaeger at http://localhost:16686
  2. Select your service from the dropdown
  3. Search for traces by operation name or tags
  4. Analyze the trace waterfall view

Sample Rate Configuration

In production, sample only a percentage of requests:

# Always sample (development) sampler: type: const param: 1 # Probabilistic sampling (production - 1%) sampler: type: probabilistic param: 0.01 # Rate limiting (max 10 traces per second) sampler: type: ratelimiting param: 10

Quiz

What is a single operation within a trace called?

What is a popular open-source distributed tracing system?

What HTTP header carries trace context between services?

What is the modern standard for observability instrumentation?

Show All Answers

Answers

  1. span
  2. jaeger
  3. traceparent
  4. opentelemetry

// Incident Response

×

What is Incident Response?

Incident response is the organized approach to addressing and managing the aftermath of a security breach or system outage. The goal is to handle the situation in a way that limits damage and reduces recovery time and costs.

The Incident Response Lifecycle

  • Detection: Alert fires, monitoring detects anomaly, user reports issue
  • Triage: Assess severity, determine impact, assign owner
  • Response: Follow runbook, implement fix, communicate status
  • Resolution: Confirm fix works, close incident, notify stakeholders
  • Post-Mortem: Analyze root cause, identify improvements, document lessons

Severity Levels

SEV1 (Critical): Complete outage, data loss, security breach - Response: All hands on deck - SLA: 15 min response, 1 hour resolution target SEV2 (High): Major functionality degraded - Response: Primary on-call engaged - SLA: 30 min response, 4 hour resolution target SEV3 (Medium): Minor impact, workarounds available - Response: Next business day - SLA: 2 hour response, 24 hour resolution target SEV4 (Low): Cosmetic issues, feature requests - Response: Backlog for next sprint - SLA: Best effort

Runbooks - Your Playbook

Every alert should have a runbook. A good runbook includes:

  • Symptoms: What you'll see when this alert fires
  • Impact: Who/what is affected
  • Investigation steps: Commands to run, dashboards to check
  • Resolution steps: How to fix it
  • Escalation: When to call for help

Sample Runbook: High CPU Alert

ALERT: HighCPUUsage > 80% for 5 minutes 1. IDENTIFY: $ top -bn1 | head -20 $ ps aux --sort=-%cpu | head -10 2. ASSESS IMPACT: - Check if it's a known batch job - Check if user-facing services are degraded - Review error rates in Grafana 3. MITIGATE (if needed): $ sudo kill -9 # Last resort $ sudo renice +10 # Lower priority 4. SCALE (if applicable): $ kubectl scale deployment app --replicas=5 5. POST-INCIDENT: - Document what process caused it - Create ticket for permanent fix - Update alert threshold if needed

On-Call Best Practices

Tip: Rotate on-call regularly. No one should be on-call for more than a week at a time. Burnout is real.
  • Primary/Secondary: Always have a backup on-call person
  • Follow-the-sun: Distribute on-call across time zones
  • Compensation: Pay on-call hours or give time off
  • War rooms: Use Slack/Zoom for major incidents
  • Page sparingly: Only wake people for SEV1/SEV2

Alert Fatigue - The Silent Killer

Too many alerts leads to ignored alerts. Combat this with:

  • Actionable alerts only: If you can't act on it, don't alert
  • Grouping: One alert per incident, not per symptom
  • Smart thresholds: Avoid flapping alerts with proper "for" durations
  • Regular review: Monthly alert quality review
  • Auto-remediation: Auto-restart services, clear caches

Post-Mortem Template

INCIDENT POST-MORTEM: Database Outage - 2024-01-15 SUMMARY: - Duration: 45 minutes - Severity: SEV1 - Impact: 12,000 users unable to login - Root Cause: Connection pool exhaustion TIMELINE: 14:23 - Alert fired: DB connection pool 95% full 14:25 - On-call engineer acknowledged 14:28 - Identified runaway query from new deployment 14:30 - Killed long-running queries, restarted pool 14:35 - Service recovered, error rates normal ROOT CAUSE: New feature deployed without proper query optimization. Missing index on user_sessions.last_activity column. LESSONS LEARNED: 1. Need query review for all DB changes 2. Add connection pool alert at 80% 3. Deploy new features with feature flags ACTION ITEMS: - [ ] @dba Add index on user_sessions.last_activity (Due: 1/17) - [ ] @dev Add query performance tests to CI (Due: 1/20) - [ ] @sre Update runbook with query kill commands (Due: 1/16)
Tip: Post-mortems are blameless. Focus on systemic issues, not individual mistakes. The goal is learning, not punishment.

Incident Communication

Keep stakeholders informed:

  • Internal status page: Atlassian Statuspage, Cachet
  • Slack updates: #incidents channel with automated status
  • Customer comms: Clear, honest, no blame-shifting
  • All-clear: Confirm fix, monitor for 1 hour before closing

Quiz

What is a documented procedure for handling specific alerts?

What severity level indicates a complete outage?

What is the blameless analysis after an incident?

What occurs when too many alerts cause them to be ignored?

Show All Answers

Answers

  1. runbook
  2. sev1
  3. post-mortem
  4. alert fatigue