Your server is lying to you. It says everything is fine when it's actually on fire. Why? Because you're not watching. You're flying blind. That's not a strategy - that's hope.
The Bare Minimum
Before you do anything else, set up basic system monitoring. Today. Right now:
# CPU, memory, disk - the holy trinity
top
free -h
df -h
# What's running
ps aux | head -20
# Network connections
ss -tunap
# Disk I/O
iostat -x 1 5
These commands are your eyes. Run them regularly. Know what's normal so you recognize the abnormal.
Enter Prometheus
Prometheus is the monitoring system for people who actually give a damn. It scrapes metrics, stores them, and lets you query them. No middleware. No database required. Just pull-based monitoring that works.
# Install Prometheus
wget https://prometheus.io/download/prometheus-2.45.0.linux-amd64.tar.gz
tar xzf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64
# Run it
./prometheus --config.file=prometheus.yml
Prometheus is open source. It's not owned by a corporation trying to lock you in. It just works.
Grafana: Make It Pretty
Raw numbers don't tell stories. Grafana does. Connect it to Prometheus and build dashboards that show you exactly what's happening.
# Install Grafana
wget https://dl.grafana.com/oss/release/grafana-10.0.0.linux-amd64.tar.gz
tar -zxvf grafana-10.0.0.linux-amd64.tar.gz
cd grafana-10.0.0
# Start it
./bin/grafana-server
Point your browser to http://localhost:3000. Default login is admin/admin. Change that immediately.
The Metrics That Matter
Don't monitor everything. Monitor what kills you:
- CPU usage - If it hits 100% for too long, something is wrong
- Memory - OOM killer is not your friend
- Disk - Full disks break everything
- Network - Saturation means dropped packets
- Latency - Slow responses drive users away
Alerting: Wake Up
Monitoring without alerting is just expensive logging. Configure alerts that actually matter:
groups:
- name: alerts
rules:
- alert: HighCPU
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
Only alert on things that need action. Alert fatigue is real. When everything is an alert, nothing is.
The Point
You built the system. You deployed it. Now watch it. The moment you stop monitoring is the moment something breaks and you don't find out until a user complains.
Use open tools. Prometheus, Grafana, node_exporter - all free, all open source. Build your monitoring stack on software that respects your freedom.
Because the best time to set up monitoring was before something broke. The second best time is now.