Home Guides Blog Tests Games Pathway Foss

SYSTEM
MONITORING

// Know your system. Watch everything.

YOU CAN'T MANAGE WHAT YOU DON'T MEASURE.

Whether it's a sluggish server, a memory leak, or a network bottleneck—monitoring gives you visibility. Before users complain, you should know there's a problem.

KNOW YOUR NORMAL.

Establish baselines. Track trends. Set alerts. The difference between a sysadmin who sleeps at night and one who doesn't is their monitoring stack.

OBSERVABILITY IS FREEDOM.

Logs, metrics, traces—understand all three. When something breaks, you'll know exactly where to look. No more guessing. No more panic.

START MONITORING →

// The Path to Observability

12 lessons. Complete visibility.

LESSON 01

Introduction to System Monitoring

Why monitoring matters and key concepts

Beginner

LESSON 02

top and htop - Real-Time Monitoring

Interactive process and resource viewers

Beginner

LESSON 03

Memory and CPU Deep Dive

Understanding RAM usage and CPU metrics

Intermediate

LESSON 04

Disk I/O and Storage Monitoring

Track disk usage, IOPS, and storage health

Intermediate

LESSON 05

Log Analysis and Management

journalctl, rsyslog, and log aggregation

Intermediate

LESSON 06

Introduction to Prometheus

Modern monitoring with time-series metrics

Intermediate

LESSON 07

Node Exporter and System Metrics

Export system metrics to Prometheus

Intermediate

LESSON 08

Grafana - Visualizing Metrics

Create dashboards and alerts

Intermediate

LESSON 09

Alerting with Prometheus AlertManager

Get notified when things go wrong

Advanced

LESSON 10

Building a Complete Monitoring Stack

Integrate Prometheus, Grafana, and exporters

Advanced

LESSON 11

Distributed Tracing

Track requests across microservices with Jaeger

Advanced

LESSON 12

Incident Response

On-call procedures, runbooks, and post-mortems

Advanced

// Why Monitoring Matters

Effective monitoring helps you:

Detect issues early: Before users notice
Troubleshoot faster: With historical data and baselines
Plan capacity: Know when to upgrade
Prove uptime: SLA compliance documentation
Understand behavior: How users actually use your systems

The monitoring triangle:

Metrics: Numerical measurements (CPU %, memory used, requests/sec)
Logs: Timestamped events with context
Traces: Request paths through distributed systems

What gets measured gets managed. What gets monitored gets fixed.

// Tools & References

📊 Prometheus

Metrics collection and alerting

prometheus.io

📈 Grafana

Visualization and dashboards

grafana.com

📝 ELK Stack

Log management

elastic.co

🖥️ htop

Interactive process viewer

htop.dev

📡 Netdata

Real-time monitoring

netdata.cloud

🔔 AlertManager

Alert handling

prometheus.io

// Introduction to System Monitoring

What is Monitoring?

Monitoring is collecting, analyzing, and acting on data from your systems. It has three pillars:

The Three Pillars of Observability

Metrics: Quantitative measurements over time
Example: CPU usage 75%, Memory 4.2GB used, 1,234 requests/sec
Logs: Timestamped events describing what happened
Example: "2024-01-15 10:23:45 ERROR Connection failed from 192.168.1.50"
Traces: Request paths through distributed systems
Example: User request → API Gateway → Auth Service → Database → Response

Key Monitoring Metrics

CPU: Usage percentage, load average, context switches
Memory: Used, free, cached, swap usage
Disk: Space used, I/O wait, throughput
Network: Bandwidth, packets/sec, errors
Processes: Count, state distribution

Basic Monitoring Tools

$ uptime              # System load average
$ w                   # Who is logged in and what they're doing
">$free -h              # Memory usage in human-readable format
$ df -h               # Disk usage

Quiz

Which of these is an example of a metric?

What is the data type that stores metric values over time?

What model does Prometheus use to collect metrics?

What model uses clients sending metrics to a central server?

What triggers when a metric crosses a threshold?

Show Answers

Answers

cpu usage 75%
time series
pull
push
alert

// top and htop - Real-Time Monitoring

Understanding top

The top command is the classic real-time process monitor:

$ top                    # Start top
$ top -u username        # Show only user's processes
$ top -p 1234          # Monitor specific PID
$ top -d 2             # 2 second delay

Top Output Explained

top - 14:30:25 up 45 days,  3:22,  2 users,  load average: 0.52, 0.58, 0.59
Tasks: 145 total,   1 running, 144 sleeping,   0 stopped,   0 zombie
Cpu(s):  5.2%us,  2.1%sy,  0.0%ni, 92.1%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:   8192404k total,  5678232k used,  2514172k free,   412344k buffers
Swap:  2097148k total,        0k used,  2097148k free,  3245676k cached

Understanding Load Average

Load average shows average number of processes in runnable state:

0.52, 0.58, 0.59 = 1min, 5min, 15min averages
On 4-core system: 4.0 = fully utilized
Over 4.0: Processes waiting (performance issue)

htop - Better top

htop is a more user-friendly version with color and mouse support:

$ htop                   # Start htop
$ htop -u username       # Filter by user
$ htop -d 10            # 10% delay (slower refresh)

htop Shortcuts

F1: Help
Space: Highlight process
U: Show user processes
T: Tree view
P M T: Sort by CPU/Memory/Time
/: Search
k: Kill process

Quiz

What does load average measure?

What is an enhanced version of top?

What are the three time periods for load average?

What CPU stat shows time waiting for I/O?

What command shows system uptime and load?

Show Answers

Answers

number of processes in runnable state
htop
1, 5, 15
wa
uptime

// Memory and CPU Deep Dive

Understanding Linux Memory

Linux uses memory intelligently—it caches disk data for performance:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:          7.8Gi       3.2Gi       1.2Gi       123Mi       3.3Gi       4.3Gi
Swap:         2.0Gi          0B       2.0Gi

Understanding the Columns

total: Total RAM installed
used: Actually in use (not available)
free: Completely unused
buff/cache: Cached disk data (can be reclaimed)
available: Memory available for new processes

Memory Analysis

$ cat /proc/meminfo      # Detailed memory info
$ vmstat 1              # Virtual memory statistics
$ ps aux --sort=-%mem    # Top memory consumers
$ pmap -x 1234          # Memory map of process

Understanding CPU States

us: User space - programs you run
sy: System - kernel operations
ni: Nice - low-priority processes
id: Idle - doing nothing
wa: I/O Wait - waiting for disk
hi: Hardware interrupts
si: Software interrupts
st: Stolen (virtualization)

CPU Analysis

$ mpstat -P ALL 1       # Per-CPU stats
$ sar -u 1 5           # Historical CPU stats
$ ps -eo pcpu,pid,user,args | sort -k 1 -r | head -10  # Top CPU users

Quiz

In "free" output, which value shows memory available for new processes?

What memory is used by the kernel for caches?

What command shows virtual memory statistics?

What controls kernel tendency to swap?

Show Answers

Answers

available
buff/cache
vmstat
swappiness

// Disk I/O and Storage Monitoring

Understanding Disk I/O

Slow disk I/O can cripple your server. Monitor for:

I/O Wait (wa): CPU waiting for disk
IOPS: Operations per second
Throughput: MB/s transferred
Latency: Time per operation

Basic Disk Tools

$ df -h                   # Disk space usage
$ df -i                   # Inode usage
$ du -sh /var/log        # Directory size
$ du -ah --max-depth=1 | sort -hr | head -10  # Largest files/dirs

iostat - I/O Statistics

$ iostat -xz 1          # I/O stats per device
Device r/s w/s rkB/s wkB/s await svc_t %util
sda 2.50 0.50 20.00 4.00 10.50 5.50 15.00

iostat Columns

r/s, w/s: Read/write requests per second
rkB/s, wkB/s: Read/write KB per second
await: Average wait time (ms)
svc_t: Average service time
%util: Device utilization %

iotop - Per-Process I/O

$ sudo iotop                # Interactive I/O monitor
$ sudo iotop -o              # Only show active I/O
$ sudo iotop -b -n 3        # Batch mode, 3 iterations

Smart Monitoring

$ sudo smartctl -a /dev/sda   # SMART data
$ sudo smartctl -H /dev/sda   # Health check

Quiz

What does "await" represent in iostat output?

What command shows CPU and disk I/O stats?

What shows per-process I/O usage?

What shows disk space usage?

Show Answers

Answers

average wait time for i/o requests
iostat
iotop
df

// Log Analysis and Management

Linux Log Locations

/var/log/syslog: System messages (Debian/Ubuntu)
/var/log/messages: System messages (RHEL/CentOS)
/var/log/auth.log: Authentication logs
/var/log/kern.log: Kernel messages
/var/log/nginx/: Nginx access/error logs
/var/log/apache2/: Apache logs

Essential Log Commands

$ tail -f /var/log/syslog       # Follow log in real-time
$ tail -n 100 /var/log/syslog    # Last 100 lines
$ grep "error" /var/log/syslog     # Search for errors
$ less +F /var/log/syslog        # Read and follow

journalctl - Systemd Logs

$ journalctl                   # All logs
$ journalctl -u nginx.service    # Specific unit
$ journalctl -f                # Follow
$ journalctl --since "1 hour ago"  # Time range
$ journalctl -p err             # Error priority
$ journalctl -b                # Current boot

Logrotate - Log Management

$ cat /etc/logrotate.conf     # Main config
$ ls /etc/logrotate.d/        # Service-specific configs

Example /etc/logrotate.d/myapp:

/var/log/myapp/*.log {
    daily
    rotate 14
    compress
    delaycompress
    missingok
    notifempty
    create 0640 root adm
}

Centralized Logging

For multiple servers, consider:

ELK Stack: Elasticsearch, Logstash, Kibana
Graylog: Alternative to ELK
Loki: Prometheus-style logging

Quiz

Which command shows logs from a specific systemd service?

Where are system logs typically stored?

What shows the last lines of a log file?

What searches for patterns in text?

Show Answers

Answers

journalctl -u
/var/log
tail
grep

// Introduction to Prometheus

What is Prometheus?

Prometheus is an open-source monitoring system with:

Time-series database: Stores metrics with timestamps
Pull model: Scrapes metrics from targets
PromQL: Powerful query language
Alerting: Built-in alert manager

Installing Prometheus

$ wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
$ tar xzf prometheus-*.tar.gz
$ cd prometheus-2.45.0.linux-amd64

Prometheus Configuration

$ cat > prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
EOF

Running Prometheus

$ ./prometheus --config.file=prometheus.yml
# Access UI at http://localhost:9090

Key Metrics Types

Counter: Always increasing (requests_total, bytes_sent)
Gauge: Can go up and down (cpu_usage, memory_free)
Histogram: Bucketed observations (request_duration_seconds)
Summary: Similar to histogram (quantiles)

Basic PromQL Queries

# CPU usage of all nodes
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Requests per second
rate(http_requests_total[5m])

# Top 5 CPU consumers
topk(5, sum by (process) (rate(process_cpu_seconds_total[5m])))

Quiz

What is the default Prometheus scrape interval?

What is Prometheus's query language called?

What function calculates per-second rate?

What adds metadata to metrics?

Show Answers

Answers

15s
promql
rate
label

// Node Exporter and System Metrics

What is Node Exporter?

Node Exporter exposes system metrics for Prometheus—CPU, memory, disk, network, and more.

Installing Node Exporter

$ wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
$ tar xzf node_exporter-*.tar.gz
$ cd node_exporter-1.6.1.linux-amd64
$ ./node_exporter
# Metrics available at http://localhost:9100/metrics

Key Node Exporter Metrics

# CPU
node_cpu_seconds_total{mode="idle|user|system|iowait|..."}

# Memory
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
node_memory_MemFree_bytes

# Disk
node_filesystem_avail_bytes
node_filesystem_size_bytes
node_disk_io_time_seconds_total

# Network
node_network_receive_bytes_total
node_network_transmit_bytes_total

Useful PromQL Queries

# CPU by mode
sum by (mode) (rate(node_cpu_seconds_total[5m])) * 100

# Memory in GB
node_memory_MemTotal_bytes / 1024^3
node_memory_MemAvailable_bytes / 1024^3

# Disk usage %
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100

# Network traffic (bytes/sec)
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

Run as Service

$ sudo useradd -rs /bin/false node_exporter
$ sudo cp node_exporter /usr/local/bin/
$ sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
$ sudo tee /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF
$ sudo systemctl daemon-reload
$ sudo systemctl enable node_exporter
$ sudo systemctl start node_exporter

Additional Exporters

cadvisor: Docker container metrics
nginx_exporter: Nginx stats
postgres_exporter: PostgreSQL metrics
blackbox_exporter: HTTP/TCP/ICMP checks

Quiz

What port does Node Exporter run on by default?

What exporter collects system metrics?

What monitors Docker containers?

What probes endpoints for HTTP/TCP/ICMP?

Show Answers

Answers

9100
node exporter
cadvisor
blackbox

// Grafana - Visualizing Metrics

What is Grafana?

Grafana is the visualization layer—connect to Prometheus, InfluxDB, and dozens of other data sources to create beautiful dashboards.

Installing Grafana

$ sudo apt-get install -y apt-transport-https software-properties-common
$ wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
$ echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
$ sudo apt-get update
$ sudo apt-get install grafana
$ sudo systemctl enable grafana-server
$ sudo systemctl start grafana-server
# Access at http://localhost:3000 (admin/admin)

Adding Prometheus as Data Source

Open Grafana (http://localhost:3000)
Go to Configuration → Data Sources
Click "Add data source"
Select "Prometheus"
Set URL to http://localhost:9090
Click "Save & Test"

Creating a Dashboard

Click "Dashboards" → "New Dashboard"
Click "Add visualization"
Select Prometheus as data source
Enter a PromQL query
Configure visualization (graph, stat, gauge)
Add panel titles, labels, thresholds

Useful Query Examples

# CPU Usage (Graph)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage (Gauge)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Disk Space (Stat)
node_filesystem_size_bytes - node_filesystem_avail_bytes

# Network In/Out (Time series)
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])

Dashboard JSON

You can import pre-built dashboards:

Search grafana.com/dashboards for "Node Exporter"
Import ID 1860 for the popular Node Exporter dashboard
Import ID 9588 for Prometheus 2.0 stats

Quiz

What is the default Grafana admin password?

What port does Grafana run on by default?

What displays visualizations in Grafana?

What is an individual visualization called?

Show Answers

Answers

admin
3000
dashboard
panel

// Alerting with Prometheus AlertManager

Why Alerting?

Monitoring without alerting is just looking at pretty graphs. You need to know when things break—before users complain.

AlertManager Architecture

Prometheus: Evaluates rules, sends alerts to AlertManager
AlertManager: Receives, deduplicates, groups, routes alerts
Receivers: Email, Slack, PagerDuty, webhooks

Alerting Rules

$ cat alerts.yml
groups:
- name: alerts
  rules:
  - alert: HighCPU
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is above 80% for more than 5 minutes"

  - alert: HighMemory
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"

  - alert: DiskSpaceLow
    expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"

AlertManager Configuration

$ cat alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'email'

receivers:
- name: 'email'
  email_configs:
  - to: 'admin@example.com'
    send_resolved: true

- name: 'slack'
  slack_configs:
  - channel: '#alerts'
    api_url: 'https://hooks.slack.com/services/XXX'
    send_resolved: true

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'instance']

Alert States

Pending: Rule triggered but not for "for" duration yet
Firing: Alert active and being sent
Resolved: Condition no longer true

Best Practices

Use "for" duration to avoid flapping alerts
Group related alerts to reduce noise
Have runbooks for each alert
Start with fewer alerts, add more as you learn

Quiz

What does the "for" clause in an alert rule do?

What handles alert routing and notifications?

What matches alerts to notification routes?

What temporarily mutes alerts?

Show Answers

Answers

time condition must be true before alerting
alertmanager
label
silence

// Building a Complete Monitoring Stack

Putting It All Together

A complete monitoring stack includes:

Collection: Prometheus + Exporters
Visualization: Grafana
Alerting: AlertManager
Log Aggregation: Loki or ELK

Docker Compose Setup

$ cat docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'

volumes:
  prometheus-data:
  grafana-data:

Production Considerations

High Availability: Run redundant Prometheus instances
Retention: Configure storage retention (default 15d)
Remote Write: Send data to remote storage (Thanos, Cortex)
Backup: Grafana dashboards, Prometheus rules
Authentication: Secure Grafana with auth proxy

Suggested Metrics to Monitor

# Infrastructure
- CPU usage > 80%
- Memory usage > 90%
- Disk usage > 85%
- Load average > CPU count
- Network errors

# Application
- Request latency (p50, p95, p99)
- Error rate
- Request throughput
- Queue depth

# Business
- Active users
- Transaction volume
- Revenue (if applicable)

Next Steps

Explore more exporters: Redis, MongoDB, Kafka
Learn PromQL: Recording rules, functions
Add tracing: Jaeger or Tempo
Implement SLOs: Error budgets, SLIs

Quiz

What Docker volumes are needed to expose system metrics to node-exporter?

What defines multi-container applications?

What pre-calculates complex queries?

What automatically finds targets to monitor?

Show Answers

Answers

/proc and /sys
docker-compose
recording rules
service discovery

// Distributed Tracing

What is Distributed Tracing?

Distributed tracing tracks requests as they flow through multiple services in a microservices architecture. While metrics tell you what is wrong and logs tell you why, traces tell you where the problem is.

Key Concepts

Trace: End-to-end request journey through all services
Span: A single operation within a trace (e.g., database query, API call)
Span Context: Unique identifiers passed between services (trace ID, span ID)
Baggage: Metadata carried across service boundaries

Installing Jaeger

$ docker run -d --name jaeger \\ -p 16686:16686 \\ -p 14268:14268 \\ -p 14250:14250 \\ jaegertracing/all-in-one:1.45

Tip: The all-in-one image is perfect for local development. For production, use separate components.

Instrumentation with OpenTelemetry

OpenTelemetry is the modern standard for instrumentation:

Python Example

$ pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-jaeger

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(
    JaegerExporter(agent_host_name="localhost", agent_port=6831)
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

# Create spans
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", "12345")
    span.set_attribute("customer.id", "cust_789")
    
    with tracer.start_as_current_span("validate_payment"):
        # Payment validation logic
        pass
    
    with tracer.start_as_current_span("update_inventory"):
        # Inventory update logic
        pass

Propagating Context Between Services

Trace context must be propagated via HTTP headers:

# Request headers include trace context traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 # trace_id: 0af7651916cd43dd8448eb211c80319c # span_id: b7ad6b7169203331 # sampled: 01 (yes)

Viewing Traces in Jaeger UI

Access Jaeger at http://localhost:16686
Select your service from the dropdown
Search for traces by operation name or tags
Analyze the trace waterfall view

Sample Rate Configuration

In production, sample only a percentage of requests:

# Always sample (development) sampler: type: const param: 1 # Probabilistic sampling (production - 1%) sampler: type: probabilistic param: 0.01 # Rate limiting (max 10 traces per second) sampler: type: ratelimiting param: 10

Quiz

What is a single operation within a trace called?

What is a popular open-source distributed tracing system?

What HTTP header carries trace context between services?

What is the modern standard for observability instrumentation?

Show All Answers

Answers

span
jaeger
traceparent
opentelemetry

// Incident Response

What is Incident Response?

Incident response is the organized approach to addressing and managing the aftermath of a security breach or system outage. The goal is to handle the situation in a way that limits damage and reduces recovery time and costs.

The Incident Response Lifecycle

Detection: Alert fires, monitoring detects anomaly, user reports issue
Triage: Assess severity, determine impact, assign owner
Response: Follow runbook, implement fix, communicate status
Resolution: Confirm fix works, close incident, notify stakeholders
Post-Mortem: Analyze root cause, identify improvements, document lessons

Severity Levels

SEV1 (Critical): Complete outage, data loss, security breach - Response: All hands on deck - SLA: 15 min response, 1 hour resolution target SEV2 (High): Major functionality degraded - Response: Primary on-call engaged - SLA: 30 min response, 4 hour resolution target SEV3 (Medium): Minor impact, workarounds available - Response: Next business day - SLA: 2 hour response, 24 hour resolution target SEV4 (Low): Cosmetic issues, feature requests - Response: Backlog for next sprint - SLA: Best effort

Runbooks - Your Playbook

Every alert should have a runbook. A good runbook includes:

Symptoms: What you'll see when this alert fires
Impact: Who/what is affected
Investigation steps: Commands to run, dashboards to check
Resolution steps: How to fix it
Escalation: When to call for help

Sample Runbook: High CPU Alert

ALERT: HighCPUUsage > 80% for 5 minutes 1. IDENTIFY: $ top -bn1 | head -20 $ ps aux --sort=-%cpu | head -10 2. ASSESS IMPACT: - Check if it's a known batch job - Check if user-facing services are degraded - Review error rates in Grafana 3. MITIGATE (if needed): $ sudo kill -9 # Last resort $ sudo renice +10 # Lower priority 4. SCALE (if applicable): $ kubectl scale deployment app --replicas=5 5. POST-INCIDENT: - Document what process caused it - Create ticket for permanent fix - Update alert threshold if needed

On-Call Best Practices

Tip: Rotate on-call regularly. No one should be on-call for more than a week at a time. Burnout is real.

Primary/Secondary: Always have a backup on-call person
Follow-the-sun: Distribute on-call across time zones
Compensation: Pay on-call hours or give time off
War rooms: Use Slack/Zoom for major incidents
Page sparingly: Only wake people for SEV1/SEV2

Alert Fatigue - The Silent Killer

Too many alerts leads to ignored alerts. Combat this with:

Actionable alerts only: If you can't act on it, don't alert
Grouping: One alert per incident, not per symptom
Smart thresholds: Avoid flapping alerts with proper "for" durations
Regular review: Monthly alert quality review
Auto-remediation: Auto-restart services, clear caches

Post-Mortem Template

INCIDENT POST-MORTEM: Database Outage - 2024-01-15 SUMMARY: - Duration: 45 minutes - Severity: SEV1 - Impact: 12,000 users unable to login - Root Cause: Connection pool exhaustion TIMELINE: 14:23 - Alert fired: DB connection pool 95% full 14:25 - On-call engineer acknowledged 14:28 - Identified runaway query from new deployment 14:30 - Killed long-running queries, restarted pool 14:35 - Service recovered, error rates normal ROOT CAUSE: New feature deployed without proper query optimization. Missing index on user_sessions.last_activity column. LESSONS LEARNED: 1. Need query review for all DB changes 2. Add connection pool alert at 80% 3. Deploy new features with feature flags ACTION ITEMS: - [ ] @dba Add index on user_sessions.last_activity (Due: 1/17) - [ ] @dev Add query performance tests to CI (Due: 1/20) - [ ] @sre Update runbook with query kill commands (Due: 1/16)

Tip: Post-mortems are blameless. Focus on systemic issues, not individual mistakes. The goal is learning, not punishment.

Incident Communication

Keep stakeholders informed:

Internal status page: Atlassian Statuspage, Cachet
Slack updates: #incidents channel with automated status
Customer comms: Clear, honest, no blame-shifting
All-clear: Confirm fix, monitor for 1 hour before closing

Quiz

What is a documented procedure for handling specific alerts?

What severity level indicates a complete outage?

What is the blameless analysis after an incident?

What occurs when too many alerts cause them to be ignored?

Show All Answers

Answers

runbook
sev1
post-mortem
alert fatigue

SYSTEMMONITORING

// The Path to Observability

Introduction to System Monitoring

top and htop - Real-Time Monitoring

Memory and CPU Deep Dive

Disk I/O and Storage Monitoring

Log Analysis and Management

Introduction to Prometheus

Node Exporter and System Metrics

Grafana - Visualizing Metrics

Alerting with Prometheus AlertManager

Building a Complete Monitoring Stack

Distributed Tracing

Incident Response

// Why Monitoring Matters

// Tools & References

📊 Prometheus

📈 Grafana

📝 ELK Stack

🖥️ htop

📡 Netdata

🔔 AlertManager

// Introduction to System Monitoring

What is Monitoring?

The Three Pillars of Observability

Key Monitoring Metrics

Basic Monitoring Tools

Quiz

Answers

// top and htop - Real-Time Monitoring

Understanding top

Top Output Explained

Understanding Load Average

htop - Better top

htop Shortcuts

Quiz

Answers

// Memory and CPU Deep Dive

Understanding Linux Memory

Understanding the Columns

Memory Analysis

Understanding CPU States

CPU Analysis

Quiz

Answers

// Disk I/O and Storage Monitoring

Understanding Disk I/O

Basic Disk Tools

iostat - I/O Statistics

iostat Columns

iotop - Per-Process I/O

Smart Monitoring

Quiz

Answers

// Log Analysis and Management

Linux Log Locations

Essential Log Commands

journalctl - Systemd Logs

Logrotate - Log Management

Centralized Logging

Quiz

Answers

// Introduction to Prometheus

What is Prometheus?

Installing Prometheus

Prometheus Configuration

Running Prometheus

Key Metrics Types

Basic PromQL Queries

Quiz

Answers

// Node Exporter and System Metrics

What is Node Exporter?

Installing Node Exporter

Key Node Exporter Metrics

Useful PromQL Queries

Run as Service

Additional Exporters

Quiz

Answers

SYSTEM
MONITORING