Monitoring

Your server is lying to you. It says everything is fine when it's actually on fire. Why? Because you're not watching. You're flying blind. That's not a strategy - that's hope.

Hope is not a strategy. If you're not monitoring your systems, you're not running a service. You're running a guessing game.

The Bare Minimum

Before you do anything else, set up basic system monitoring. Today. Right now:

# CPU, memory, disk - the holy trinity
top
free -h
df -h

# What's running
ps aux | head -20

# Network connections
ss -tunap

# Disk I/O
iostat -x 1 5

These commands are your eyes. Run them regularly. Know what's normal so you recognize the abnormal.

Enter Prometheus

Prometheus is the monitoring system for people who actually give a damn. It scrapes metrics, stores them, and lets you query them. No middleware. No database required. Just pull-based monitoring that works.

# Install Prometheus
wget https://prometheus.io/download/prometheus-2.45.0.linux-amd64.tar.gz
tar xzf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64

# Run it
./prometheus --config.file=prometheus.yml

Prometheus is open source. It's not owned by a corporation trying to lock you in. It just works.

Grafana: Make It Pretty

Raw numbers don't tell stories. Grafana does. Connect it to Prometheus and build dashboards that show you exactly what's happening.

# Install Grafana
wget https://dl.grafana.com/oss/release/grafana-10.0.0.linux-amd64.tar.gz
tar -zxvf grafana-10.0.0.linux-amd64.tar.gz
cd grafana-10.0.0

# Start it
./bin/grafana-server

Point your browser to http://localhost:3000. Default login is admin/admin. Change that immediately.

The Metrics That Matter

Don't monitor everything. Monitor what kills you:

CPU usage - If it hits 100% for too long, something is wrong
Memory - OOM killer is not your friend
Disk - Full disks break everything
Network - Saturation means dropped packets
Latency - Slow responses drive users away

Alerting: Wake Up

Monitoring without alerting is just expensive logging. Configure alerts that actually matter:

groups:
- name: alerts
  rules:
  - alert: HighCPU
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU on {{ $labels.instance }}"

Only alert on things that need action. Alert fatigue is real. When everything is an alert, nothing is.

The Point

You built the system. You deployed it. Now watch it. The moment you stop monitoring is the moment something breaks and you don't find out until a user complains.

Use open tools. Prometheus, Grafana, node_exporter - all free, all open source. Build your monitoring stack on software that respects your freedom.

Learn More

Monitoring Guide - Interactive lessons
Prometheus Blog - Metrics collection
Grafana Blog - Visualize data
Logging Blog - Log management