Agent capabilities · chmonitor docs

The agent has ~44 tools organized into categories. You never call tools directly — the agent picks and chains them automatically. This page summarizes what is available and what you can ask.

Tool categories

Category	What it covers	Example tools
Schema & exploration	Databases, tables, columns, data source discovery	`list_tables`, `get_table_schema`, `discover_data_sources`
Query analysis	Running, slow, failed, expensive queries; EXPLAIN; optimization	`get_running_queries`, `get_slow_queries`, `explain_query`, `analyze_query_optimization`
System health	Server metrics, CPU/memory/disk, errors, crash log, anomalies	`get_metrics`, `get_system_resources`, `detect_anomalies`, `generate_health_report`
Findings persistence	Record and retrieve structured monitoring findings	`record_finding`, `list_recent_findings`
Storage, merges & mutations	Part sizes, active merges, stuck mutations, merge throughput	`get_table_parts`, `get_merge_status`, `get_mutations`
Replication & cluster	Lag, queue, ZooKeeper/Keeper, topology, DDL queue	`get_replication_status`, `get_clusters`, `get_zookeeper_info`
Security & audit	Sessions, login attempts, users and roles	`get_active_sessions`, `get_login_attempts`, `get_users_and_roles`
Schema migration	ALTER impact assessment, column usage, table design	`analyze_schema_change`, `get_column_usage`, `recommend_table_design`
Comparison & forecasting	Cross-host, time-period, capacity forecast	`compare_hosts`, `compare_time_periods`, `forecast_capacity`
Charts & visualization	Run SQL and return an interactive chart	`query_and_visualize`
Settings & logs	Changed server settings, text log, stack traces	`get_settings`, `get_text_log`, `get_stack_traces`
Control actions	Kill query/mutation, optimize table (disabled by default)	`kill_query`, `kill_mutation`, `optimize_table`
Interaction & knowledge	Skills, workflows, context snapshot, clarifying questions	`load_skill`, `start_workflow`, `get_context`

Control tools (kill_query, kill_mutation, optimize_table) are disabled unless AGENT_ENABLE_CONTROL_TOOLS=true. The agent always confirms before using them.

Skills

Skills are bundled expert guides the agent loads on demand when a question needs domain depth. Nine skills are included:

Skill	Covers
`clickhouse-best-practices`	Schema design, query tuning, operational guidelines
`query-optimization`	PREWHERE, JOIN patterns, materialized views, EXPLAIN, indexes
`system-tables-reference`	Exact columns of key system tables; when to use tools vs raw SQL
`troubleshooting`	OOM, slow merges, stuck mutations, error-code diagnosis
`replication-guide`	ReplicatedMergeTree, failover, lag diagnosis, Keeper
`cluster-operations`	Distributed tables, resharding, node management, topology
`storage-optimization`	Compression codecs, TTL, tiered storage, part management
`migration-patterns`	ALTER patterns, zero-downtime schema changes
`security-hardening`	RBAC, row policies, quotas, audit logging

List available skills: GET /api/v1/agent/skills.

Dynamic workflow templates

For multi-step tasks the agent picks a template, instantiates it into a live checklist, and adapts it as results come in.

Template	Purpose
`incident-investigation`	Triage a reported problem to a root cause
`health-check`	Full cluster health sweep
`query-optimization`	Analyze and speed up a slow query
`capacity-planning`	Forecast storage and resource needs
`replication-triage`	Diagnose replication lag or failover
`migration-safety`	Assess a schema change before applying it

Simple one-step questions skip the planning harness entirely.

Findings persistence

The agent can persist structured findings to an app-owned monitoring_findings table (MergeTree, 30-day TTL). Writes are best-effort and silently no-op on read-only clusters.

Retrieve findings via API: GET /api/v1/findings.

Example questions

Ask in plain English:

“Which queries are running right now and how long have they been executing?” — lists query id, user, elapsed time, and memory, sorted by duration.
“What were the 10 slowest queries in the last 24 hours?” — fetches from query log and offers to EXPLAIN any of them.
“Why is the cluster slow right now?” — loads the troubleshooting skill, scans for anomalies, correlates merges and errors into a root-cause summary.
“Show me query volume per hour over the last day as a chart.” — runs the aggregation and renders a line chart inline.
“Is it safe to drop the event_payload column from analytics.events?” — checks column usage and classifies the ALTER risk before recommending.