chmonitor
AI Agent

Capabilities

Full reference for the agent's 26+ tools, 18 expert skills, plan-and-verify mode, and worked example questions.

The agent exposes a small set of powerful primitives (26 tools, plus 3 env-gated control tools). Anything without a dedicated tool is done by writing SQL with the query tool, guided by a skill recipe. You never call tools directly — the agent picks and chains them automatically.

Tools

CategoryWhat it coversTools
Schema & explorationDatabases, tables, columns, ad-hoc SQLquery, list_databases, list_tables, get_table_schema, explore_table_schema
Query analysisRunning, slow, failed queries; normalized slow patterns; EXPLAIN; pre-flight cost estimateget_running_queries, get_slow_queries, list_slow_query_patterns, get_failed_queries, explain_query, estimate_query_cost
Health, storage, replication, mergesMetrics, disks, parts, replication, mergesget_metrics, get_disk_usage, get_table_parts, get_replication_status, get_merge_status
Capacity planningDisk-full forecast, TTL/retention advice (recommend-only)forecast_disk_capacity, suggest_ttl_adjustment
Aggregation advisorDesign MV/projection DDL from frequent aggregation queries, with a size estimate (recommend-only)recommend_materialized_view
Plan & verifyA visible, adaptive step-by-step planupdate_plan
Knowledge & interactionLoad expert guides; ask the userload_skill, ask_user
Charts & visualizationRun SQL and return an interactive chartquery_and_visualize
InsightsExplain a statistical anomaly baseline and z-scoreexplain_anomaly_score
AdvisorRanked skip-index/projection/partition/PREWHERE recommendations for a slow query (recommend-only)get_optimization_recommendations
DashboardsSuggest a dashboard layout from registry charts for a natural-language request (recommend-only)suggest_dashboard
Control actions (env-gated)Kill query/mutation, optimize tablekill_query, kill_mutation, optimize_table

Everything else (expensive-query rankings, query patterns, anomalies, table-design advice, settings, logs, replication queue, ZooKeeper, users…) is done with query plus the relevant skill recipe — so the agent keeps full reach with a much smaller, more reliable tool surface.

Control tools are off by default

Control tools (kill_query, kill_mutation, optimize_table) are disabled unless AGENT_ENABLE_CONTROL_TOOLS=true. The agent always confirms before using them.

Tool reference

ToolDescriptionNotes
queryExecute a read-only SQL query on ClickHouse.SELECT / WITH / DESCRIBE / EXPLAIN only. Write statements are rejected.
list_databasesList all databases with engine and comment metadata.
list_tablesList tables in a database with row counts and size.
get_table_schemaGet column definitions for a specific table.
explore_table_schemaThree-mode exploration: no args → databases; database only → tables; database + table → full schema with indexes and keys.
get_running_queriesCurrently running queries ordered by elapsed time.Reads system.processes.
get_slow_queriesSlowest completed queries from the query log.Reads system.query_log.
list_slow_query_patternsNormalized slow query patterns — system.query_log grouped by normalized_query_hash, with calls, total/avg/p50/p95/p99/max duration, CPU time, peak memory, read/write bytes, errors, and cache-hit ratio per pattern.Read-only. Use for "which query shape is expensive overall" — unlike get_slow_queries, which ranks individual executions. First step of the query-optimization diagnose loop.
get_failed_queriesRecent failed queries with error details.Reads system.query_log.
explain_queryEXPLAIN PLAN / PIPELINE / PLAN with indexes for a query.
estimate_query_costPre-flight cost estimate (rows scanned, bytes read, peak memory, wall time, confidence) from EXPLAIN alone.Read-only and recommend-only — runs EXPLAIN only, never executes the analyzed query.
get_metricsServer health: version, uptime, active connections, memory.Reads system.metrics.
get_disk_usagePer-disk free and total space.Reads system.disks.
get_table_partsPart-level info for a table: rows, size, compression ratio.Reads system.parts.
forecast_disk_capacityForecast when disks will run out of free space from recent write-growth trend, plus top contributing tables.Reads system.part_log (NewPart events) + system.disks. Reports a clear message instead of a forecast when part_log isn't enabled.
suggest_ttl_adjustmentRecommend a TTL/retention change for a table to keep projected disk utilization ≤80%, never below a stated retention floor.Returns a suggested ALTER TABLE ... MODIFY TTL ... string + risk note — recommend-only, never executed. Reports a clear message instead of a suggestion when part_log isn't enabled.
recommend_materialized_viewMine frequent GROUP BY/aggregate query shapes and design a Summing/AggregatingMergeTree MV or projection to pre-aggregate them.Returns DDL text + size estimate + impact + risk (added write-path/storage cost) — recommend-only, never executed. Reads system.query_log, system.parts, system.tables.
get_replication_statusPer-table replication delay, queue size, and replica counts.Reads system.replicas.
get_merge_statusCurrently running merges with progress and elapsed time.Reads system.merges.
update_planCreate or update a step-by-step workflow plan for the current investigation.Rendered as a live checklist in the UI.
load_skillLoad an expert guide (skill) with SQL recipes for a specific domain.
find_reference_querySearch the dashboard's built-in library of 100+ vetted, version-aware monitoring queries and return the closest matches (name, description, SQL).Read-only, deterministic keyword-overlap lookup over the built-in QueryConfig catalog — executes nothing. Meant to be used before hand-writing system.* SQL.
ask_userAsk the user a question to gather information before proceeding.
query_and_visualizeRun a SQL query and return an interactive chart config.Chart type is auto-detected from result columns.
explain_anomaly_scoreExplain a per-host/per-metric statistical anomaly baseline (mean, stddev, median, MAD, sample count) and, given a value, its z-score and anomaly verdict.Read-only. Reports "no baseline yet" during cold start (falls back to a static threshold).
get_optimization_recommendationsAnalyze a slow query (by queryId from system.query_log, or raw sql) and return ranked skip-index, projection, partition-key, and PREWHERE recommendations with DDL/rewrite text, rationale, risk, effort, and an estimated granules/bytes saved.Read-only and recommend-only — reads EXPLAIN + system.tables/system.columns/system.data_skipping_indexes/system.parts; never executes or applies any DDL or rewrite. Every impact figure is explicitly labeled an estimate.
suggest_dashboardMap a natural-language request to a dashboard layout built only from charts in the chart registry, auto-placed on the plan-57 12-column grid.Recommend-only — never persists anything. The chat UI shows an "Apply to dashboard" action that loads the layout into the dashboard builder's unsaved working grid; saving still requires the existing save action. Rejects any chart name not present in both the client and API chart registries.
kill_queryKill a running query by query_id.DESTRUCTIVE. Env-gated: requires AGENT_ENABLE_CONTROL_TOOLS=true. Agent always confirms first.
kill_mutationCancel a running mutation on a table.DESTRUCTIVE. Env-gated: requires AGENT_ENABLE_CONTROL_TOOLS=true. Agent always confirms first.
optimize_tableTrigger an OPTIMIZE on a table to force merges.DESTRUCTIVE. Env-gated: requires AGENT_ENABLE_CONTROL_TOOLS=true. Agent always confirms first.

Skills

Skills are expert guides — with copy-pasteable SQL recipes against system.* — that the agent loads on demand. Because the toolset is lean, skills are how the agent stays powerful. Eighteen skills are included:

SkillCovers
system-tables-referenceExact columns of key system tables; recipes; tools vs raw SQL
data-analysisAggregation & time-series recipes (largest scan, expensive queries, patterns, period comparison)
anomaly-detectionRecent-vs-baseline comparisons (error spikes, p95 regressions, part explosions)
query-tuning-advisorDiagnose a slow query and propose concrete rewrites & better joins
query-optimizationPREWHERE, JOIN patterns, materialized views, EXPLAIN, indexes
schema-design-advisorORDER BY/partition keys, codecs, skip indexes, column type right-sizing
storage-optimizationCompression codecs, TTL, tiered storage, part management
version-upgrade-advisorWhether/how to upgrade ClickHouse and what is gained
hardware-tuningSize settings to the box's cores/RAM/disk
concept-explainerTeach core ClickHouse concepts
replication-guideReplicatedMergeTree, failover, lag diagnosis, Keeper
cluster-operationsDistributed tables, resharding, node management, topology
migration-patternsALTER patterns, zero-downtime schema changes
security-hardeningRBAC, row policies, quotas, audit logging
clickhouse-best-practicesSchema design, query tuning, operational guidelines
troubleshootingOOM, slow merges, stuck mutations, error-code diagnosis
incident-responseStructured triage recipes (disk full, errors, replication lag, health sweep)
plan-and-verifyDecompose with update_plan and verify each result before concluding

List available skills with GET /api/v1/agent/skills.

Plan and verify

For multi-step tasks the agent authors a live checklist with update_plan, keeps exactly one step in progress, and adapts the plan as results come in. Crucially, it verifies each result before stating it — re-querying or cross-checking a second system table for a finding, running explain_query on both versions before claiming a rewrite is faster, and separating what is verified from what is a hypothesis. The plan-and-verify and incident-response skills encode the recipes. Simple one-step questions skip the plan entirely.

Example questions

Ask in plain English:

  • "Which queries are running right now and how long have they been executing?" — lists query id, user, elapsed time, and memory, sorted by duration.
  • "What were the 10 slowest queries in the last 24 hours?" — fetches from the query log and offers to EXPLAIN any of them.
  • "What's the largest data scan ever performed on this cluster?" — loads data-analysis and runs the system.query_log recipe.
  • "Anything abnormal in the last hour versus baseline?" — loads anomaly-detection and compares recent activity to the preceding window.
  • "Show me query volume per hour over the last day as a chart." — runs the aggregation and renders a line chart inline.
  • "Suggest a better ORDER BY and which columns should be LowCardinality for analytics.events." — loads schema-design-advisor, inspects the schema and parts, and recommends changes.

On this page