chmonitor

At-a-glance cluster health dashboard and automated headless health-sweep alerts.

Routes/health
Feature idhealth
Default accesspublic
Requires authNo (set CHM_FEATURE_HEALTH_ACCESS=authenticated to gate)
System tablessystem.metrics, system.asynchronous_metrics, system.replicas, system.merges, system.errors, system.disks, system.replication_queue, system.processes, system.query_log, system.parts
ClickHouse grantsSELECT on the system tables above

What it does

The Health page aggregates checks across multiple system tables into a single status grid. Each check has a severity level (ok / warning / critical) and a human-readable summary. Operators use this page as a starting point for incident triage.

Checks cover:

  • Replication lag — replicas falling behind, readonly tables
  • Merge backlog — active merge count and slow merges
  • Error rate — recent errors from system.errors
  • Disk usage — free space across all disks
  • Query load — running query count and memory pressure
  • Part health — excessive part counts

In addition to the UI, chmonitor exposes a headless health-sweep endpoint (GET /api/cron/health-sweep) that runs the same checks and dispatches webhook alerts. This is designed to be called on a schedule (e.g., Cloudflare Cron every 5 minutes) without a browser.

Pages

PageRouteWhat it showsSystem tables
Health/healthStatus grid with per-check severity and detailssystem.metrics, system.asynchronous_metrics, system.replicas, system.merges, system.errors, system.disks, system.replication_queue, system.processes, system.query_log, system.parts

Permissions & access

Disable:

CHM_FEATURE_HEALTH_ENABLED=false

Require authentication:

CHM_FEATURE_HEALTH_ACCESS=authenticated

Config file:

[features.health]
enabled = true
access = "authenticated"

Configuration

Health-sweep cron alerting

The health-sweep endpoint runs checks over all configured hosts and sends a webhook notification when a check meets or exceeds the minimum severity.

VariableDefaultDescription
CRON_SECRET(unset = open)Guards GET /api/cron/health-sweep. Pass as Authorization: Bearer <secret>. Setting this is strongly recommended — without it the endpoint is publicly accessible to anyone who can reach your deployment.
HEALTH_ALERT_ENABLEDfalseSet to true to enable webhook dispatch.
HEALTH_ALERT_WEBHOOK_URL(required if enabled)Slack or Discord incoming webhook URL. Payload is a JSON object with a text field.
HEALTH_ALERT_MIN_SEVERITYwarningMinimum severity that triggers a notification. Values: warning or critical.

Example for Cloudflare Workers (using wrangler secret put):

wrangler secret put CRON_SECRET
wrangler secret put HEALTH_ALERT_WEBHOOK_URL

Example environment block:

HEALTH_ALERT_ENABLED=true
HEALTH_ALERT_MIN_SEVERITY=warning
HEALTH_ALERT_WEBHOOK_URL=https://hooks.slack.com/services/...
CRON_SECRET=<random-secret>

To call the endpoint manually:

curl -H "Authorization: Bearer $CRON_SECRET" \
  https://your-chmonitor.example.com/api/cron/health-sweep

The endpoint returns a JSON array of check results. It always returns HTTP 200; alert dispatch happens server-side.

Scheduling (Cloudflare Cron)

In wrangler.toml:

[triggers]
crons = ["*/5 * * * *"]

The cron handler calls the health-sweep logic directly — no HTTP hop needed when running inside the same Worker.

Notes & limitations

  • system.error_log is checked separately by the Errors page (under Operations). The Health page uses system.errors (in-memory error counts), which resets on server restart.
  • If a system table is missing (e.g., no system.replicas on a standalone node), that check is skipped with an “unavailable” state rather than a false positive.
  • The webhook payload is a plain Slack/Discord-compatible JSON object ({"text": "..."}). Custom payload shapes are not supported in v1.
  • The health-sweep endpoint queries all configured CLICKHOUSE_HOST entries. High host counts increase sweep latency.