Health · chmonitor docs

At-a-glance cluster health dashboard and automated headless health-sweep alerts.


Routes	`/health`
Feature id	`health`
Default access	`public`
Requires auth	No (set `CHM_FEATURE_HEALTH_ACCESS=authenticated` to gate)
System tables	`system.metrics`, `system.asynchronous_metrics`, `system.replicas`, `system.merges`, `system.errors`, `system.disks`, `system.replication_queue`, `system.processes`, `system.query_log`, `system.parts`
ClickHouse grants	`SELECT` on the system tables above

What it does

The Health page aggregates checks across multiple system tables into a single status grid. Each check has a severity level (ok / warning / critical) and a human-readable summary. Operators use this page as a starting point for incident triage.

Checks cover:

Replication lag — replicas falling behind, readonly tables
Merge backlog — active merge count and slow merges
Error rate — recent errors from system.errors
Disk usage — free space across all disks
Query load — running query count and memory pressure
Part health — excessive part counts

In addition to the UI, chmonitor exposes a headless health-sweep endpoint (GET /api/cron/health-sweep) that runs the same checks and dispatches webhook alerts. This is designed to be called on a schedule (e.g., Cloudflare Cron every 5 minutes) without a browser.

Pages

Page	Route	What it shows	System tables
Health	`/health`	Status grid with per-check severity and details	`system.metrics`, `system.asynchronous_metrics`, `system.replicas`, `system.merges`, `system.errors`, `system.disks`, `system.replication_queue`, `system.processes`, `system.query_log`, `system.parts`

Permissions & access

Disable:

CHM_FEATURE_HEALTH_ENABLED=false

Require authentication:

CHM_FEATURE_HEALTH_ACCESS=authenticated

Config file:

[features.health]
enabled = true
access = "authenticated"

Configuration

Health-sweep cron alerting

The health-sweep endpoint runs checks over all configured hosts and sends a webhook notification when a check meets or exceeds the minimum severity.

Variable	Default	Description
`CRON_SECRET`	(unset = open)	Guards `GET /api/cron/health-sweep`. Pass as `Authorization: Bearer <secret>`. Setting this is strongly recommended — without it the endpoint is publicly accessible to anyone who can reach your deployment.
`HEALTH_ALERT_ENABLED`	`false`	Set to `true` to enable webhook dispatch.
`HEALTH_ALERT_WEBHOOK_URL`	(required if enabled)	Slack or Discord incoming webhook URL. Payload is a JSON object with a `text` field.
`HEALTH_ALERT_MIN_SEVERITY`	`warning`	Minimum severity that triggers a notification. Values: `warning` or `critical`.

Example for Cloudflare Workers (using wrangler secret put):

wrangler secret put CRON_SECRET
wrangler secret put HEALTH_ALERT_WEBHOOK_URL

Example environment block:

HEALTH_ALERT_ENABLED=true
HEALTH_ALERT_MIN_SEVERITY=warning
HEALTH_ALERT_WEBHOOK_URL=https://hooks.slack.com/services/...
CRON_SECRET=<random-secret>

To call the endpoint manually:

curl -H "Authorization: Bearer $CRON_SECRET" \
  https://your-chmonitor.example.com/api/cron/health-sweep

The endpoint returns a JSON array of check results. It always returns HTTP 200; alert dispatch happens server-side.

Scheduling (Cloudflare Cron)

In wrangler.toml:

[triggers]
crons = ["*/5 * * * *"]

The cron handler calls the health-sweep logic directly — no HTTP hop needed when running inside the same Worker.

Notes & limitations

system.error_log is checked separately by the Errors page (under Operations). The Health page uses system.errors (in-memory error counts), which resets on server restart.
If a system table is missing (e.g., no system.replicas on a standalone node), that check is skipped with an “unavailable” state rather than a false positive.
The webhook payload is a plain Slack/Discord-compatible JSON object ({"text": "..."}). Custom payload shapes are not supported in v1.
The health-sweep endpoint queries all configured CLICKHOUSE_HOST entries. High host counts increase sweep latency.