At-a-glance cluster health dashboard and automated headless health-sweep alerts.
| Routes | /health |
| Feature id | health |
| Default access | public |
| Requires auth | No (set CHM_FEATURE_HEALTH_ACCESS=authenticated to gate) |
| System tables | system.metrics, system.asynchronous_metrics, system.replicas, system.merges, system.errors, system.disks, system.replication_queue, system.processes, system.query_log, system.parts |
| ClickHouse grants | SELECT on the system tables above |
What it does
The Health page aggregates checks across multiple system tables into a single status grid. Each check has a severity level (ok / warning / critical) and a human-readable summary. Operators use this page as a starting point for incident triage.
Checks cover:
- Replication lag — replicas falling behind, readonly tables
- Merge backlog — active merge count and slow merges
- Error rate — recent errors from
system.errors - Disk usage — free space across all disks
- Query load — running query count and memory pressure
- Part health — excessive part counts
In addition to the UI, chmonitor exposes a headless health-sweep endpoint (GET /api/cron/health-sweep) that runs the same checks and dispatches webhook alerts. This is designed to be called on a schedule (e.g., Cloudflare Cron every 5 minutes) without a browser.
Pages
| Page | Route | What it shows | System tables |
|---|---|---|---|
| Health | /health | Status grid with per-check severity and details | system.metrics, system.asynchronous_metrics, system.replicas, system.merges, system.errors, system.disks, system.replication_queue, system.processes, system.query_log, system.parts |
Permissions & access
Disable:
CHM_FEATURE_HEALTH_ENABLED=false
Require authentication:
CHM_FEATURE_HEALTH_ACCESS=authenticated
Config file:
[features.health]
enabled = true
access = "authenticated"
Configuration
Health-sweep cron alerting
The health-sweep endpoint runs checks over all configured hosts and sends a webhook notification when a check meets or exceeds the minimum severity.
| Variable | Default | Description |
|---|---|---|
CRON_SECRET | (unset = open) | Guards GET /api/cron/health-sweep. Pass as Authorization: Bearer <secret>. Setting this is strongly recommended — without it the endpoint is publicly accessible to anyone who can reach your deployment. |
HEALTH_ALERT_ENABLED | false | Set to true to enable webhook dispatch. |
HEALTH_ALERT_WEBHOOK_URL | (required if enabled) | Slack or Discord incoming webhook URL. Payload is a JSON object with a text field. |
HEALTH_ALERT_MIN_SEVERITY | warning | Minimum severity that triggers a notification. Values: warning or critical. |
Example for Cloudflare Workers (using wrangler secret put):
wrangler secret put CRON_SECRET
wrangler secret put HEALTH_ALERT_WEBHOOK_URL
Example environment block:
HEALTH_ALERT_ENABLED=true
HEALTH_ALERT_MIN_SEVERITY=warning
HEALTH_ALERT_WEBHOOK_URL=https://hooks.slack.com/services/...
CRON_SECRET=<random-secret>
To call the endpoint manually:
curl -H "Authorization: Bearer $CRON_SECRET" \
https://your-chmonitor.example.com/api/cron/health-sweep
The endpoint returns a JSON array of check results. It always returns HTTP 200; alert dispatch happens server-side.
Scheduling (Cloudflare Cron)
In wrangler.toml:
[triggers]
crons = ["*/5 * * * *"]
The cron handler calls the health-sweep logic directly — no HTTP hop needed when running inside the same Worker.
Notes & limitations
system.error_logis checked separately by the Errors page (under Operations). The Health page usessystem.errors(in-memory error counts), which resets on server restart.- If a system table is missing (e.g., no
system.replicason a standalone node), that check is skipped with an “unavailable” state rather than a false positive. - The webhook payload is a plain Slack/Discord-compatible JSON object (
{"text": "..."}). Custom payload shapes are not supported in v1. - The health-sweep endpoint queries all configured
CLICKHOUSE_HOSTentries. High host counts increase sweep latency.