Files
nuc/docs/monitoring.md
Alejandro Gutiérrez 8b503a549c Add operational documentation
CloudBeaver database manager guide, Ecija intranet deployment,
Gitea-Coolify auto-deploy and integration docs, monitoring setup
with presentation, remote access guide, security architecture,
and Turbostarter deployment procedure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 15:17:18 +01:00

157 lines
6.2 KiB
Markdown

# NUC Monitoring & Recovery Setup
**Date:** 2026-02-02 22:20
**Context:** Complete monitoring stack deployment with auto-recovery and remote access
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ MONITORING STACK │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ scrape ┌─────────────┐ │
│ │ OpenWrt │◄─────────────│ Prometheus │ │
│ │ :9100 │ │ :9091 │ │
│ └─────────────┘ └──────┬──────┘ │
│ │ │
│ ┌─────────────┐ scrape │ ┌─────────────┐ │
│ │ NUC Node │◄───────────────────┤ │ Alertmanager│ │
│ │ Exporter │ │ │ :9093 │ │
│ │ :9100 │ ┌─────┴────┐ └──────┬──────┘ │
│ └─────────────┘ │ Grafana │ │ │
│ │ :3333 │ ▼ │
│ └──────────┘ ┌───────────┐ │
│ │ntfy Bridge│ │
│ │ :9095 │ │
│ └─────┬─────┘ │
│ │ │
│ ▼ │
│ ntfy.sh/nuc-watchdog │
└─────────────────────────────────────────────────────────────────────┘
```
## Access URLs
| Service | Local URL | Remote URL | Credentials |
|---------|-----------|------------|-------------|
| **Grafana** | http://192.168.1.3:3333 | https://alezmad-nuc.tail58f5ad.ts.net | admin / nucmonitoring |
| **Prometheus** | http://192.168.1.3:9091 | - | - |
| **Alertmanager** | http://192.168.1.3:9093 | - | - |
| **OpenWrt Metrics** | http://192.168.1.1:9100 | - | - |
## Prometheus Targets
| Job | Target | Scrape Interval |
|-----|--------|-----------------|
| `prometheus` | localhost:9090 | 15s |
| `nuc-node` | 192.168.1.3:9100 | 15s |
| `openwrt` | 192.168.1.1:9100 | 30s |
## Alert Rules
### NUC Alerts (`/opt/monitoring/alert_rules.yml`)
| Alert | Condition | Severity |
|-------|-----------|----------|
| NUCDown | `up{job="nuc-node"} == 0` for 1m | critical |
| HighCPULoad | CPU > 80% for 5m | warning |
| HighMemoryUsage | Memory > 85% for 5m | warning |
| DiskSpaceLow | Disk > 85% for 5m | warning |
### OpenWrt Alerts
| Alert | Condition | Severity |
|-------|-----------|----------|
| OpenWrtDown | `up{job="openwrt"} == 0` for 1m | critical |
| OpenWrtHighLoad | Load > 2 for 5m | warning |
## Auto-Recovery Layers
| Layer | Component | Action | Trigger |
|-------|-----------|--------|---------|
| 1 | OpenWrt Monitor | WoL packet | HTTP+Ping fail (3x) |
| 2 | Hardware Watchdog | Auto-reboot | System freeze |
| 3 | Kernel Panic | Auto-reboot (10s) | Kernel panic |
| 4 | WoL | Wake from power off | Manual or script |
## Configuration Files
### NUC (`/opt/monitoring/`)
- `prometheus.yml` - Prometheus configuration
- `alert_rules.yml` - Alert rules
- `alertmanager.yml` - Alertmanager → ntfy bridge
- `alertmanager-ntfy-bridge.py` - Webhook translator
### OpenWrt (`/opt/`)
- `monitor-nuc.sh` - Health check daemon (0.5Hz)
### Systemd Services
- `prometheus-node-exporter.service` - NUC metrics
- `alertmanager-ntfy-bridge.service` - Alert translator
## Notifications
**ntfy Topic:** `nuc-watchdog`
Subscribe via:
- App: ntfy (iOS/Android)
- Web: https://ntfy.sh/nuc-watchdog
**Alert Sources:**
1. OpenWrt watchdog (direct to ntfy.sh)
2. Alertmanager → bridge → ntfy.sh
## Remote Access
### Tailscale (NUC)
- Funnel URL: https://alezmad-nuc.tail58f5ad.ts.net
- Exposes: Grafana (:3333)
### WireGuard (OpenWrt)
- Endpoint: 5.224.196.245:51820
- VPN Subnet: 10.10.10.0/24
- Config: `~/wireguard/home-vpn.conf`
## Grafana Setup
**Data Source:** Prometheus (http://prometheus:9090)
**Imported Dashboards:**
- Node Exporter Full (ID: 1860)
## Maintenance Commands
```bash
# Check Prometheus targets
curl -s http://192.168.1.3:9091/api/v1/targets | jq '.data.activeTargets[].health'
# Check active alerts
curl -s http://192.168.1.3:9093/api/v2/alerts | jq '.[].labels.alertname'
# Restart monitoring stack
ssh nuc "docker restart prometheus-r0wg4gwoow44kkkc8skc4kwg alertmanager-r0wg4gwoow44kkkc8skc4kwg grafana-r0wg4gwoow44kkkc8skc4kwg"
# Check OpenWrt monitor
ssh root@192.168.1.1 "ps | grep monitor"
# Test alert
curl -X POST http://192.168.1.3:9093/api/v2/alerts -H "Content-Type: application/json" \
-d '[{"labels":{"alertname":"TestAlert","severity":"warning"},"annotations":{"summary":"Test"}}]'
# Manual ntfy test
curl -d "Test message" https://ntfy.sh/nuc-watchdog
```
## Container UUIDs (Coolify)
| Service | UUID |
|---------|------|
| Monitoring Stack | r0wg4gwoow44kkkc8skc4kwg |
## Related
- OpenWrt NUC Monitor: `/opt/monitor-nuc.sh`
- Kernel panic config: `/etc/sysctl.d/99-auto-reboot.conf`
- WireGuard config: `~/wireguard/home-vpn.conf`