// Know your system. Watch everything.
YOU CAN'T MANAGE WHAT YOU DON'T MEASURE.
Whether it's a sluggish server, a memory leak, or a network bottleneck—monitoring gives you visibility. Before users complain, you should know there's a problem.
KNOW YOUR NORMAL.
Establish baselines. Track trends. Set alerts. The difference between a sysadmin who sleeps at night and one who doesn't is their monitoring stack.
OBSERVABILITY IS FREEDOM.
Logs, metrics, traces—understand all three. When something breaks, you'll know exactly where to look. No more guessing. No more panic.
12 lessons. Complete visibility.
Why monitoring matters and key concepts
BeginnerInteractive process and resource viewers
BeginnerUnderstanding RAM usage and CPU metrics
IntermediateTrack disk usage, IOPS, and storage health
Intermediatejournalctl, rsyslog, and log aggregation
IntermediateModern monitoring with time-series metrics
IntermediateExport system metrics to Prometheus
IntermediateCreate dashboards and alerts
IntermediateGet notified when things go wrong
AdvancedIntegrate Prometheus, Grafana, and exporters
AdvancedTrack requests across microservices with Jaeger
AdvancedOn-call procedures, runbooks, and post-mortems
AdvancedEffective monitoring helps you:
The monitoring triangle:
What gets measured gets managed. What gets monitored gets fixed.
Monitoring is collecting, analyzing, and acting on data from your systems. It has three pillars:
$ uptime # System load average $ w # Who is logged in and what they're doing ">$ free -h # Memory usage in human-readable format $ df -h # Disk usage
Which of these is an example of a metric?
What is the data type that stores metric values over time?
What model does Prometheus use to collect metrics?
What model uses clients sending metrics to a central server?
What triggers when a metric crosses a threshold?
The top command is the classic real-time process monitor:
$ top # Start top $ top -u username # Show only user's processes $ top -p 1234 # Monitor specific PID $ top -d 2 # 2 second delay
top - 14:30:25 up 45 days, 3:22, 2 users, load average: 0.52, 0.58, 0.59 Tasks: 145 total, 1 running, 144 sleeping, 0 stopped, 0 zombie Cpu(s): 5.2%us, 2.1%sy, 0.0%ni, 92.1%id, 0.3%wa, 0.0%hi, 0.3%si, 0.0%st Mem: 8192404k total, 5678232k used, 2514172k free, 412344k buffers Swap: 2097148k total, 0k used, 2097148k free, 3245676k cached
Load average shows average number of processes in runnable state:
htop is a more user-friendly version with color and mouse support:
$ htop # Start htop $ htop -u username # Filter by user $ htop -d 10 # 10% delay (slower refresh)
What does load average measure?
What is an enhanced version of top?
What are the three time periods for load average?
What CPU stat shows time waiting for I/O?
What command shows system uptime and load?
Linux uses memory intelligently—it caches disk data for performance:
$ free -h total used free shared buff/cache available Mem: 7.8Gi 3.2Gi 1.2Gi 123Mi 3.3Gi 4.3Gi Swap: 2.0Gi 0B 2.0Gi
$ cat /proc/meminfo # Detailed memory info $ vmstat 1 # Virtual memory statistics $ ps aux --sort=-%mem # Top memory consumers $ pmap -x 1234 # Memory map of process
$ mpstat -P ALL 1 # Per-CPU stats $ sar -u 1 5 # Historical CPU stats $ ps -eo pcpu,pid,user,args | sort -k 1 -r | head -10 # Top CPU users
In "free" output, which value shows memory available for new processes?
What memory is used by the kernel for caches?
What command shows virtual memory statistics?
What controls kernel tendency to swap?
Slow disk I/O can cripple your server. Monitor for:
$ df -h # Disk space usage $ df -i # Inode usage $ du -sh /var/log # Directory size $ du -ah --max-depth=1 | sort -hr | head -10 # Largest files/dirs
$ iostat -xz 1 # I/O stats per device Device r/s w/s rkB/s wkB/s await svc_t %util sda 2.50 0.50 20.00 4.00 10.50 5.50 15.00
$ sudo iotop # Interactive I/O monitor $ sudo iotop -o # Only show active I/O $ sudo iotop -b -n 3 # Batch mode, 3 iterations
$ sudo smartctl -a /dev/sda # SMART data $ sudo smartctl -H /dev/sda # Health check
What does "await" represent in iostat output?
What command shows CPU and disk I/O stats?
What shows per-process I/O usage?
What shows disk space usage?
$ tail -f /var/log/syslog # Follow log in real-time $ tail -n 100 /var/log/syslog # Last 100 lines $ grep "error" /var/log/syslog # Search for errors $ less +F /var/log/syslog # Read and follow
$ journalctl # All logs $ journalctl -u nginx.service # Specific unit $ journalctl -f # Follow $ journalctl --since "1 hour ago" # Time range $ journalctl -p err # Error priority $ journalctl -b # Current boot
$ cat /etc/logrotate.conf # Main config $ ls /etc/logrotate.d/ # Service-specific configs
Example /etc/logrotate.d/myapp:
/var/log/myapp/*.log {
daily
rotate 14
compress
delaycompress
missingok
notifempty
create 0640 root adm
}
For multiple servers, consider:
Which command shows logs from a specific systemd service?
Where are system logs typically stored?
What shows the last lines of a log file?
What searches for patterns in text?
Prometheus is an open-source monitoring system with:
$ wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz $ tar xzf prometheus-*.tar.gz $ cd prometheus-2.45.0.linux-amd64
$ cat > prometheus.yml << 'EOF' global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node' static_configs: - targets: ['localhost:9100'] EOF
$ ./prometheus --config.file=prometheus.yml # Access UI at http://localhost:9090
# CPU usage of all nodes
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Requests per second
rate(http_requests_total[5m])
# Top 5 CPU consumers
topk(5, sum by (process) (rate(process_cpu_seconds_total[5m])))
What is the default Prometheus scrape interval?
What is Prometheus's query language called?
What function calculates per-second rate?
What adds metadata to metrics?
Node Exporter exposes system metrics for Prometheus—CPU, memory, disk, network, and more.
$ wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz $ tar xzf node_exporter-*.tar.gz $ cd node_exporter-1.6.1.linux-amd64 $ ./node_exporter # Metrics available at http://localhost:9100/metrics
# CPU
node_cpu_seconds_total{mode="idle|user|system|iowait|..."}
# Memory
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
node_memory_MemFree_bytes
# Disk
node_filesystem_avail_bytes
node_filesystem_size_bytes
node_disk_io_time_seconds_total
# Network
node_network_receive_bytes_total
node_network_transmit_bytes_total
# CPU by mode
sum by (mode) (rate(node_cpu_seconds_total[5m])) * 100
# Memory in GB
node_memory_MemTotal_bytes / 1024^3
node_memory_MemAvailable_bytes / 1024^3
# Disk usage %
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
# Network traffic (bytes/sec)
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
$ sudo useradd -rs /bin/false node_exporter $ sudo cp node_exporter /usr/local/bin/ $ sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter $ sudo tee /etc/systemd/system/node_exporter.service << 'EOF' [Unit] Description=Node Exporter After=network.target [Service] User=node_exporter ExecStart=/usr/local/bin/node_exporter Restart=on-failure [Install] WantedBy=multi-user.target EOF $ sudo systemctl daemon-reload $ sudo systemctl enable node_exporter $ sudo systemctl start node_exporter
What port does Node Exporter run on by default?
What exporter collects system metrics?
What monitors Docker containers?
What probes endpoints for HTTP/TCP/ICMP?
Grafana is the visualization layer—connect to Prometheus, InfluxDB, and dozens of other data sources to create beautiful dashboards.
$ sudo apt-get install -y apt-transport-https software-properties-common $ wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add - $ echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list $ sudo apt-get update $ sudo apt-get install grafana $ sudo systemctl enable grafana-server $ sudo systemctl start grafana-server # Access at http://localhost:3000 (admin/admin)
# CPU Usage (Graph)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory Usage (Gauge)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Disk Space (Stat)
node_filesystem_size_bytes - node_filesystem_avail_bytes
# Network In/Out (Time series)
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])
You can import pre-built dashboards:
What is the default Grafana admin password?
What port does Grafana run on by default?
What displays visualizations in Grafana?
What is an individual visualization called?
Monitoring without alerting is just looking at pretty graphs. You need to know when things break—before users complain.
$ cat alerts.yml groups: - name: alerts rules: - alert: HighCPU expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is above 80% for more than 5 minutes" - alert: HighMemory expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90 for: 5m labels: severity: critical annotations: summary: "High memory usage on {{ $labels.instance }}" - alert: DiskSpaceLow expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 85 for: 5m labels: severity: warning annotations: summary: "Disk space low on {{ $labels.instance }}"
$ cat alertmanager.yml global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'email' receivers: - name: 'email' email_configs: - to: 'admin@example.com' send_resolved: true - name: 'slack' slack_configs: - channel: '#alerts' api_url: 'https://hooks.slack.com/services/XXX' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']
What does the "for" clause in an alert rule do?
What handles alert routing and notifications?
What matches alerts to notification routes?
What temporarily mutes alerts?
A complete monitoring stack includes:
$ cat docker-compose.yml version: '3' services: prometheus: image: prom/prometheus:latest volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus ports: - "9090:9090" command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' node-exporter: image: prom/node-exporter:latest ports: - "9100:9100" volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--path.rootfs=/rootfs' grafana: image: grafana/grafana:latest ports: - "3000:3000" volumes: - grafana-data:/var/lib/grafana environment: - GF_SECURITY_ADMIN_PASSWORD=admin alertmanager: image: prom/alertmanager:latest ports: - "9093:9093" volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml command: - '--config.file=/etc/alertmanager/alertmanager.yml' volumes: prometheus-data: grafana-data:
# Infrastructure
- CPU usage > 80%
- Memory usage > 90%
- Disk usage > 85%
- Load average > CPU count
- Network errors
# Application
- Request latency (p50, p95, p99)
- Error rate
- Request throughput
- Queue depth
# Business
- Active users
- Transaction volume
- Revenue (if applicable)
What Docker volumes are needed to expose system metrics to node-exporter?
What defines multi-container applications?
What pre-calculates complex queries?
What automatically finds targets to monitor?
Distributed tracing tracks requests as they flow through multiple services in a microservices architecture. While metrics tell you what is wrong and logs tell you why, traces tell you where the problem is.
OpenTelemetry is the modern standard for instrumentation:
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(
JaegerExporter(agent_host_name="localhost", agent_port=6831)
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# Create spans
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", "12345")
span.set_attribute("customer.id", "cust_789")
with tracer.start_as_current_span("validate_payment"):
# Payment validation logic
pass
with tracer.start_as_current_span("update_inventory"):
# Inventory update logic
pass
Trace context must be propagated via HTTP headers:
In production, sample only a percentage of requests:
What is a single operation within a trace called?
What is a popular open-source distributed tracing system?
What HTTP header carries trace context between services?
What is the modern standard for observability instrumentation?
Incident response is the organized approach to addressing and managing the aftermath of a security breach or system outage. The goal is to handle the situation in a way that limits damage and reduces recovery time and costs.
Every alert should have a runbook. A good runbook includes:
Too many alerts leads to ignored alerts. Combat this with:
Keep stakeholders informed:
What is a documented procedure for handling specific alerts?
What severity level indicates a complete outage?
What is the blameless analysis after an incident?
What occurs when too many alerts cause them to be ignored?