Add operational documentation
CloudBeaver database manager guide, Ecija intranet deployment, Gitea-Coolify auto-deploy and integration docs, monitoring setup with presentation, remote access guide, security architecture, and Turbostarter deployment procedure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
156
docs/monitoring.md
Normal file
156
docs/monitoring.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# NUC Monitoring & Recovery Setup
|
||||
|
||||
**Date:** 2026-02-02 22:20
|
||||
**Context:** Complete monitoring stack deployment with auto-recovery and remote access
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ MONITORING STACK │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────┐ scrape ┌─────────────┐ │
|
||||
│ │ OpenWrt │◄─────────────│ Prometheus │ │
|
||||
│ │ :9100 │ │ :9091 │ │
|
||||
│ └─────────────┘ └──────┬──────┘ │
|
||||
│ │ │
|
||||
│ ┌─────────────┐ scrape │ ┌─────────────┐ │
|
||||
│ │ NUC Node │◄───────────────────┤ │ Alertmanager│ │
|
||||
│ │ Exporter │ │ │ :9093 │ │
|
||||
│ │ :9100 │ ┌─────┴────┐ └──────┬──────┘ │
|
||||
│ └─────────────┘ │ Grafana │ │ │
|
||||
│ │ :3333 │ ▼ │
|
||||
│ └──────────┘ ┌───────────┐ │
|
||||
│ │ntfy Bridge│ │
|
||||
│ │ :9095 │ │
|
||||
│ └─────┬─────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ntfy.sh/nuc-watchdog │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Access URLs
|
||||
|
||||
| Service | Local URL | Remote URL | Credentials |
|
||||
|---------|-----------|------------|-------------|
|
||||
| **Grafana** | http://192.168.1.3:3333 | https://alezmad-nuc.tail58f5ad.ts.net | admin / nucmonitoring |
|
||||
| **Prometheus** | http://192.168.1.3:9091 | - | - |
|
||||
| **Alertmanager** | http://192.168.1.3:9093 | - | - |
|
||||
| **OpenWrt Metrics** | http://192.168.1.1:9100 | - | - |
|
||||
|
||||
## Prometheus Targets
|
||||
|
||||
| Job | Target | Scrape Interval |
|
||||
|-----|--------|-----------------|
|
||||
| `prometheus` | localhost:9090 | 15s |
|
||||
| `nuc-node` | 192.168.1.3:9100 | 15s |
|
||||
| `openwrt` | 192.168.1.1:9100 | 30s |
|
||||
|
||||
## Alert Rules
|
||||
|
||||
### NUC Alerts (`/opt/monitoring/alert_rules.yml`)
|
||||
|
||||
| Alert | Condition | Severity |
|
||||
|-------|-----------|----------|
|
||||
| NUCDown | `up{job="nuc-node"} == 0` for 1m | critical |
|
||||
| HighCPULoad | CPU > 80% for 5m | warning |
|
||||
| HighMemoryUsage | Memory > 85% for 5m | warning |
|
||||
| DiskSpaceLow | Disk > 85% for 5m | warning |
|
||||
|
||||
### OpenWrt Alerts
|
||||
|
||||
| Alert | Condition | Severity |
|
||||
|-------|-----------|----------|
|
||||
| OpenWrtDown | `up{job="openwrt"} == 0` for 1m | critical |
|
||||
| OpenWrtHighLoad | Load > 2 for 5m | warning |
|
||||
|
||||
## Auto-Recovery Layers
|
||||
|
||||
| Layer | Component | Action | Trigger |
|
||||
|-------|-----------|--------|---------|
|
||||
| 1 | OpenWrt Monitor | WoL packet | HTTP+Ping fail (3x) |
|
||||
| 2 | Hardware Watchdog | Auto-reboot | System freeze |
|
||||
| 3 | Kernel Panic | Auto-reboot (10s) | Kernel panic |
|
||||
| 4 | WoL | Wake from power off | Manual or script |
|
||||
|
||||
## Configuration Files
|
||||
|
||||
### NUC (`/opt/monitoring/`)
|
||||
- `prometheus.yml` - Prometheus configuration
|
||||
- `alert_rules.yml` - Alert rules
|
||||
- `alertmanager.yml` - Alertmanager → ntfy bridge
|
||||
- `alertmanager-ntfy-bridge.py` - Webhook translator
|
||||
|
||||
### OpenWrt (`/opt/`)
|
||||
- `monitor-nuc.sh` - Health check daemon (0.5Hz)
|
||||
|
||||
### Systemd Services
|
||||
- `prometheus-node-exporter.service` - NUC metrics
|
||||
- `alertmanager-ntfy-bridge.service` - Alert translator
|
||||
|
||||
## Notifications
|
||||
|
||||
**ntfy Topic:** `nuc-watchdog`
|
||||
|
||||
Subscribe via:
|
||||
- App: ntfy (iOS/Android)
|
||||
- Web: https://ntfy.sh/nuc-watchdog
|
||||
|
||||
**Alert Sources:**
|
||||
1. OpenWrt watchdog (direct to ntfy.sh)
|
||||
2. Alertmanager → bridge → ntfy.sh
|
||||
|
||||
## Remote Access
|
||||
|
||||
### Tailscale (NUC)
|
||||
- Funnel URL: https://alezmad-nuc.tail58f5ad.ts.net
|
||||
- Exposes: Grafana (:3333)
|
||||
|
||||
### WireGuard (OpenWrt)
|
||||
- Endpoint: 5.224.196.245:51820
|
||||
- VPN Subnet: 10.10.10.0/24
|
||||
- Config: `~/wireguard/home-vpn.conf`
|
||||
|
||||
## Grafana Setup
|
||||
|
||||
**Data Source:** Prometheus (http://prometheus:9090)
|
||||
|
||||
**Imported Dashboards:**
|
||||
- Node Exporter Full (ID: 1860)
|
||||
|
||||
## Maintenance Commands
|
||||
|
||||
```bash
|
||||
# Check Prometheus targets
|
||||
curl -s http://192.168.1.3:9091/api/v1/targets | jq '.data.activeTargets[].health'
|
||||
|
||||
# Check active alerts
|
||||
curl -s http://192.168.1.3:9093/api/v2/alerts | jq '.[].labels.alertname'
|
||||
|
||||
# Restart monitoring stack
|
||||
ssh nuc "docker restart prometheus-r0wg4gwoow44kkkc8skc4kwg alertmanager-r0wg4gwoow44kkkc8skc4kwg grafana-r0wg4gwoow44kkkc8skc4kwg"
|
||||
|
||||
# Check OpenWrt monitor
|
||||
ssh root@192.168.1.1 "ps | grep monitor"
|
||||
|
||||
# Test alert
|
||||
curl -X POST http://192.168.1.3:9093/api/v2/alerts -H "Content-Type: application/json" \
|
||||
-d '[{"labels":{"alertname":"TestAlert","severity":"warning"},"annotations":{"summary":"Test"}}]'
|
||||
|
||||
# Manual ntfy test
|
||||
curl -d "Test message" https://ntfy.sh/nuc-watchdog
|
||||
```
|
||||
|
||||
## Container UUIDs (Coolify)
|
||||
|
||||
| Service | UUID |
|
||||
|---------|------|
|
||||
| Monitoring Stack | r0wg4gwoow44kkkc8skc4kwg |
|
||||
|
||||
## Related
|
||||
|
||||
- OpenWrt NUC Monitor: `/opt/monitor-nuc.sh`
|
||||
- Kernel panic config: `/etc/sysctl.d/99-auto-reboot.conf`
|
||||
- WireGuard config: `~/wireguard/home-vpn.conf`
|
||||
Reference in New Issue
Block a user