chmonitor

The agent has ~44 tools organized into categories. You never call tools directly — the agent picks and chains them automatically. This page summarizes what is available and what you can ask.

Tool categories

CategoryWhat it coversExample tools
Schema & explorationDatabases, tables, columns, data source discoverylist_tables, get_table_schema, discover_data_sources
Query analysisRunning, slow, failed, expensive queries; EXPLAIN; optimizationget_running_queries, get_slow_queries, explain_query, analyze_query_optimization
System healthServer metrics, CPU/memory/disk, errors, crash log, anomaliesget_metrics, get_system_resources, detect_anomalies, generate_health_report
Findings persistenceRecord and retrieve structured monitoring findingsrecord_finding, list_recent_findings
Storage, merges & mutationsPart sizes, active merges, stuck mutations, merge throughputget_table_parts, get_merge_status, get_mutations
Replication & clusterLag, queue, ZooKeeper/Keeper, topology, DDL queueget_replication_status, get_clusters, get_zookeeper_info
Security & auditSessions, login attempts, users and rolesget_active_sessions, get_login_attempts, get_users_and_roles
Schema migrationALTER impact assessment, column usage, table designanalyze_schema_change, get_column_usage, recommend_table_design
Comparison & forecastingCross-host, time-period, capacity forecastcompare_hosts, compare_time_periods, forecast_capacity
Charts & visualizationRun SQL and return an interactive chartquery_and_visualize
Settings & logsChanged server settings, text log, stack tracesget_settings, get_text_log, get_stack_traces
Control actionsKill query/mutation, optimize table (disabled by default)kill_query, kill_mutation, optimize_table
Interaction & knowledgeSkills, workflows, context snapshot, clarifying questionsload_skill, start_workflow, get_context

Control tools (kill_query, kill_mutation, optimize_table) are disabled unless AGENT_ENABLE_CONTROL_TOOLS=true. The agent always confirms before using them.

Skills

Skills are bundled expert guides the agent loads on demand when a question needs domain depth. Nine skills are included:

SkillCovers
clickhouse-best-practicesSchema design, query tuning, operational guidelines
query-optimizationPREWHERE, JOIN patterns, materialized views, EXPLAIN, indexes
system-tables-referenceExact columns of key system tables; when to use tools vs raw SQL
troubleshootingOOM, slow merges, stuck mutations, error-code diagnosis
replication-guideReplicatedMergeTree, failover, lag diagnosis, Keeper
cluster-operationsDistributed tables, resharding, node management, topology
storage-optimizationCompression codecs, TTL, tiered storage, part management
migration-patternsALTER patterns, zero-downtime schema changes
security-hardeningRBAC, row policies, quotas, audit logging

List available skills: GET /api/v1/agent/skills.

Dynamic workflow templates

For multi-step tasks the agent picks a template, instantiates it into a live checklist, and adapts it as results come in.

TemplatePurpose
incident-investigationTriage a reported problem to a root cause
health-checkFull cluster health sweep
query-optimizationAnalyze and speed up a slow query
capacity-planningForecast storage and resource needs
replication-triageDiagnose replication lag or failover
migration-safetyAssess a schema change before applying it

Simple one-step questions skip the planning harness entirely.

Findings persistence

The agent can persist structured findings to an app-owned monitoring_findings table (MergeTree, 30-day TTL). Writes are best-effort and silently no-op on read-only clusters.

Retrieve findings via API: GET /api/v1/findings.

Example questions

Ask in plain English:

  • “Which queries are running right now and how long have they been executing?” — lists query id, user, elapsed time, and memory, sorted by duration.

  • “What were the 10 slowest queries in the last 24 hours?” — fetches from query log and offers to EXPLAIN any of them.

  • “Why is the cluster slow right now?” — loads the troubleshooting skill, scans for anomalies, correlates merges and errors into a root-cause summary.

  • “Show me query volume per hour over the last day as a chart.” — runs the aggregation and renders a line chart inline.

  • “Is it safe to drop the event_payload column from analytics.events?” — checks column usage and classifies the ALTER risk before recommending.