The agent has ~44 tools organized into categories. You never call tools directly — the agent picks and chains them automatically. This page summarizes what is available and what you can ask.
Tool categories
| Category | What it covers | Example tools |
|---|---|---|
| Schema & exploration | Databases, tables, columns, data source discovery | list_tables, get_table_schema, discover_data_sources |
| Query analysis | Running, slow, failed, expensive queries; EXPLAIN; optimization | get_running_queries, get_slow_queries, explain_query, analyze_query_optimization |
| System health | Server metrics, CPU/memory/disk, errors, crash log, anomalies | get_metrics, get_system_resources, detect_anomalies, generate_health_report |
| Findings persistence | Record and retrieve structured monitoring findings | record_finding, list_recent_findings |
| Storage, merges & mutations | Part sizes, active merges, stuck mutations, merge throughput | get_table_parts, get_merge_status, get_mutations |
| Replication & cluster | Lag, queue, ZooKeeper/Keeper, topology, DDL queue | get_replication_status, get_clusters, get_zookeeper_info |
| Security & audit | Sessions, login attempts, users and roles | get_active_sessions, get_login_attempts, get_users_and_roles |
| Schema migration | ALTER impact assessment, column usage, table design | analyze_schema_change, get_column_usage, recommend_table_design |
| Comparison & forecasting | Cross-host, time-period, capacity forecast | compare_hosts, compare_time_periods, forecast_capacity |
| Charts & visualization | Run SQL and return an interactive chart | query_and_visualize |
| Settings & logs | Changed server settings, text log, stack traces | get_settings, get_text_log, get_stack_traces |
| Control actions | Kill query/mutation, optimize table (disabled by default) | kill_query, kill_mutation, optimize_table |
| Interaction & knowledge | Skills, workflows, context snapshot, clarifying questions | load_skill, start_workflow, get_context |
Control tools (kill_query, kill_mutation, optimize_table) are disabled unless AGENT_ENABLE_CONTROL_TOOLS=true. The agent always confirms before using them.
Skills
Skills are bundled expert guides the agent loads on demand when a question needs domain depth. Nine skills are included:
| Skill | Covers |
|---|---|
clickhouse-best-practices | Schema design, query tuning, operational guidelines |
query-optimization | PREWHERE, JOIN patterns, materialized views, EXPLAIN, indexes |
system-tables-reference | Exact columns of key system tables; when to use tools vs raw SQL |
troubleshooting | OOM, slow merges, stuck mutations, error-code diagnosis |
replication-guide | ReplicatedMergeTree, failover, lag diagnosis, Keeper |
cluster-operations | Distributed tables, resharding, node management, topology |
storage-optimization | Compression codecs, TTL, tiered storage, part management |
migration-patterns | ALTER patterns, zero-downtime schema changes |
security-hardening | RBAC, row policies, quotas, audit logging |
List available skills: GET /api/v1/agent/skills.
Dynamic workflow templates
For multi-step tasks the agent picks a template, instantiates it into a live checklist, and adapts it as results come in.
| Template | Purpose |
|---|---|
incident-investigation | Triage a reported problem to a root cause |
health-check | Full cluster health sweep |
query-optimization | Analyze and speed up a slow query |
capacity-planning | Forecast storage and resource needs |
replication-triage | Diagnose replication lag or failover |
migration-safety | Assess a schema change before applying it |
Simple one-step questions skip the planning harness entirely.
Findings persistence
The agent can persist structured findings to an app-owned monitoring_findings table (MergeTree, 30-day TTL). Writes are best-effort and silently no-op on read-only clusters.
Retrieve findings via API: GET /api/v1/findings.
Example questions
Ask in plain English:
-
“Which queries are running right now and how long have they been executing?” — lists query id, user, elapsed time, and memory, sorted by duration.
-
“What were the 10 slowest queries in the last 24 hours?” — fetches from query log and offers to EXPLAIN any of them.
-
“Why is the cluster slow right now?” — loads the
troubleshootingskill, scans for anomalies, correlates merges and errors into a root-cause summary. -
“Show me query volume per hour over the last day as a chart.” — runs the aggregation and renders a line chart inline.
-
“Is it safe to drop the
event_payloadcolumn fromanalytics.events?” — checks column usage and classifies the ALTER risk before recommending.