chmonitor — ClickHouse Monitoring Dashboard

Full reference for the agent's 26+ tools, 18 expert skills, plan-and-verify mode, and worked example questions.

The agent exposes a small set of powerful primitives (26 tools, plus 3 env-gated control tools). Anything without a dedicated tool is done by writing SQL with the query tool, guided by a skill recipe. You never call tools directly — the agent picks and chains them automatically.

Tools

Category	What it covers	Tools
Schema & exploration	Databases, tables, columns, ad-hoc SQL	`query`, `list_databases`, `list_tables`, `get_table_schema`, `explore_table_schema`
Query analysis	Running, slow, failed queries; normalized slow patterns; EXPLAIN; pre-flight cost estimate	`get_running_queries`, `get_slow_queries`, `list_slow_query_patterns`, `get_failed_queries`, `explain_query`, `estimate_query_cost`
Health, storage, replication, merges	Metrics, disks, parts, replication, merges	`get_metrics`, `get_disk_usage`, `get_table_parts`, `get_replication_status`, `get_merge_status`
Capacity planning	Disk-full forecast, TTL/retention advice (recommend-only)	`forecast_disk_capacity`, `suggest_ttl_adjustment`
Aggregation advisor	Design MV/projection DDL from frequent aggregation queries, with a size estimate (recommend-only)	`recommend_materialized_view`
Plan & verify	A visible, adaptive step-by-step plan	`update_plan`
Knowledge & interaction	Load expert guides; ask the user	`load_skill`, `ask_user`
Charts & visualization	Run SQL and return an interactive chart	`query_and_visualize`
Insights	Explain a statistical anomaly baseline and z-score	`explain_anomaly_score`
Advisor	Ranked skip-index/projection/partition/PREWHERE recommendations for a slow query (recommend-only)	`get_optimization_recommendations`
Dashboards	Suggest a dashboard layout from registry charts for a natural-language request (recommend-only)	`suggest_dashboard`
Control actions (env-gated)	Kill query/mutation, optimize table	`kill_query`, `kill_mutation`, `optimize_table`

Everything else (expensive-query rankings, query patterns, anomalies, table-design advice, settings, logs, replication queue, ZooKeeper, users…) is done with query plus the relevant skill recipe — so the agent keeps full reach with a much smaller, more reliable tool surface.

Control tools are off by default

Control tools (kill_query, kill_mutation, optimize_table) are disabled unless AGENT_ENABLE_CONTROL_TOOLS=true. The agent always confirms before using them.

Tool reference

Tool	Description	Notes
`query`	Execute a read-only SQL query on ClickHouse.	SELECT / WITH / DESCRIBE / EXPLAIN only. Write statements are rejected.
`list_databases`	List all databases with engine and comment metadata.
`list_tables`	List tables in a database with row counts and size.
`get_table_schema`	Get column definitions for a specific table.
`explore_table_schema`	Three-mode exploration: no args → databases; database only → tables; database + table → full schema with indexes and keys.
`get_running_queries`	Currently running queries ordered by elapsed time.	Reads `system.processes`.
`get_slow_queries`	Slowest completed queries from the query log.	Reads `system.query_log`.
`list_slow_query_patterns`	Normalized slow query patterns — `system.query_log` grouped by `normalized_query_hash`, with calls, total/avg/p50/p95/p99/max duration, CPU time, peak memory, read/write bytes, errors, and cache-hit ratio per pattern.	Read-only. Use for "which query shape is expensive overall" — unlike `get_slow_queries`, which ranks individual executions. First step of the `query-optimization` diagnose loop.
`get_failed_queries`	Recent failed queries with error details.	Reads `system.query_log`.
`explain_query`	EXPLAIN PLAN / PIPELINE / PLAN with indexes for a query.
`estimate_query_cost`	Pre-flight cost estimate (rows scanned, bytes read, peak memory, wall time, confidence) from EXPLAIN alone.	Read-only and recommend-only — runs EXPLAIN only, never executes the analyzed query.
`get_metrics`	Server health: version, uptime, active connections, memory.	Reads `system.metrics`.
`get_disk_usage`	Per-disk free and total space.	Reads `system.disks`.
`get_table_parts`	Part-level info for a table: rows, size, compression ratio.	Reads `system.parts`.
`forecast_disk_capacity`	Forecast when disks will run out of free space from recent write-growth trend, plus top contributing tables.	Reads `system.part_log` (NewPart events) + `system.disks`. Reports a clear message instead of a forecast when `part_log` isn't enabled.
`suggest_ttl_adjustment`	Recommend a TTL/retention change for a table to keep projected disk utilization ≤80%, never below a stated retention floor.	Returns a suggested `ALTER TABLE ... MODIFY TTL ...` string + risk note — recommend-only, never executed. Reports a clear message instead of a suggestion when `part_log` isn't enabled.
`recommend_materialized_view`	Mine frequent GROUP BY/aggregate query shapes and design a Summing/AggregatingMergeTree MV or projection to pre-aggregate them.	Returns DDL text + size estimate + impact + risk (added write-path/storage cost) — recommend-only, never executed. Reads `system.query_log`, `system.parts`, `system.tables`.
`get_replication_status`	Per-table replication delay, queue size, and replica counts.	Reads `system.replicas`.
`get_merge_status`	Currently running merges with progress and elapsed time.	Reads `system.merges`.
`update_plan`	Create or update a step-by-step workflow plan for the current investigation.	Rendered as a live checklist in the UI.
`load_skill`	Load an expert guide (skill) with SQL recipes for a specific domain.
`find_reference_query`	Search the dashboard's built-in library of 100+ vetted, version-aware monitoring queries and return the closest matches (name, description, SQL).	Read-only, deterministic keyword-overlap lookup over the built-in `QueryConfig` catalog — executes nothing. Meant to be used before hand-writing `system.*` SQL.
`ask_user`	Ask the user a question to gather information before proceeding.
`query_and_visualize`	Run a SQL query and return an interactive chart config.	Chart type is auto-detected from result columns.
`explain_anomaly_score`	Explain a per-host/per-metric statistical anomaly baseline (mean, stddev, median, MAD, sample count) and, given a value, its z-score and anomaly verdict.	Read-only. Reports "no baseline yet" during cold start (falls back to a static threshold).
`get_optimization_recommendations`	Analyze a slow query (by `queryId` from system.query_log, or raw `sql`) and return ranked skip-index, projection, partition-key, and PREWHERE recommendations with DDL/rewrite text, rationale, risk, effort, and an estimated granules/bytes saved.	Read-only and recommend-only — reads EXPLAIN + `system.tables`/`system.columns`/`system.data_skipping_indexes`/`system.parts`; never executes or applies any DDL or rewrite. Every impact figure is explicitly labeled an estimate.
`suggest_dashboard`	Map a natural-language request to a dashboard layout built only from charts in the chart registry, auto-placed on the plan-57 12-column grid.	Recommend-only — never persists anything. The chat UI shows an "Apply to dashboard" action that loads the layout into the dashboard builder's unsaved working grid; saving still requires the existing save action. Rejects any chart name not present in both the client and API chart registries.
`kill_query`	Kill a running query by `query_id`.	DESTRUCTIVE. Env-gated: requires `AGENT_ENABLE_CONTROL_TOOLS=true`. Agent always confirms first.
`kill_mutation`	Cancel a running mutation on a table.	DESTRUCTIVE. Env-gated: requires `AGENT_ENABLE_CONTROL_TOOLS=true`. Agent always confirms first.
`optimize_table`	Trigger an OPTIMIZE on a table to force merges.	DESTRUCTIVE. Env-gated: requires `AGENT_ENABLE_CONTROL_TOOLS=true`. Agent always confirms first.

Skills

Skills are expert guides — with copy-pasteable SQL recipes against system.* — that the agent loads on demand. Because the toolset is lean, skills are how the agent stays powerful. Eighteen skills are included:

Skill	Covers
`system-tables-reference`	Exact columns of key system tables; recipes; tools vs raw SQL
`data-analysis`	Aggregation & time-series recipes (largest scan, expensive queries, patterns, period comparison)
`anomaly-detection`	Recent-vs-baseline comparisons (error spikes, p95 regressions, part explosions)
`query-tuning-advisor`	Diagnose a slow query and propose concrete rewrites & better joins
`query-optimization`	PREWHERE, JOIN patterns, materialized views, EXPLAIN, indexes
`schema-design-advisor`	ORDER BY/partition keys, codecs, skip indexes, column type right-sizing
`storage-optimization`	Compression codecs, TTL, tiered storage, part management
`version-upgrade-advisor`	Whether/how to upgrade ClickHouse and what is gained
`hardware-tuning`	Size settings to the box's cores/RAM/disk
`concept-explainer`	Teach core ClickHouse concepts
`replication-guide`	ReplicatedMergeTree, failover, lag diagnosis, Keeper
`cluster-operations`	Distributed tables, resharding, node management, topology
`migration-patterns`	ALTER patterns, zero-downtime schema changes
`security-hardening`	RBAC, row policies, quotas, audit logging
`clickhouse-best-practices`	Schema design, query tuning, operational guidelines
`troubleshooting`	OOM, slow merges, stuck mutations, error-code diagnosis
`incident-response`	Structured triage recipes (disk full, errors, replication lag, health sweep)
`plan-and-verify`	Decompose with `update_plan` and verify each result before concluding

List available skills with GET /api/v1/agent/skills.

Plan and verify

For multi-step tasks the agent authors a live checklist with update_plan, keeps exactly one step in progress, and adapts the plan as results come in. Crucially, it verifies each result before stating it — re-querying or cross-checking a second system table for a finding, running explain_query on both versions before claiming a rewrite is faster, and separating what is verified from what is a hypothesis. The plan-and-verify and incident-response skills encode the recipes. Simple one-step questions skip the plan entirely.

Example questions

Ask in plain English:

"Which queries are running right now and how long have they been executing?" — lists query id, user, elapsed time, and memory, sorted by duration.
"What were the 10 slowest queries in the last 24 hours?" — fetches from the query log and offers to EXPLAIN any of them.
"What's the largest data scan ever performed on this cluster?" — loads data-analysis and runs the system.query_log recipe.
"Anything abnormal in the last hour versus baseline?" — loads anomaly-detection and compares recent activity to the preceding window.
"Show me query volume per hour over the last day as a chart." — runs the aggregation and renders a line chart inline.
"Suggest a better ORDER BY and which columns should be LowCardinality for analytics.events." — loads schema-design-advisor, inspects the schema and parts, and recommends changes.

Capabilities

Tools

Tool reference

Skills

Plan and verify

Example questions

AI Agent

Configuration

Conversation history

On this page

Capabilities

Tools

Tool reference

Skills

Plan and verify

Example questions

Related

AI Agent

Configuration

Conversation history

On this page