A Cross-Engine Database Performance Investigation Workflow Using MCP


The most expensive part of a database performance incident is usually not the fix. It is the time spent figuring out what is wrong.

An engineer who knows SQL Server might spend ten minutes identifying a blocking chain that is serialising a critical workload. The same engineer, dropped onto a PostgreSQL incident, might spend an hour ruling out the wrong causes before finding the equivalent issue in pg_stat_activity.

The engine changes. The problem pattern does not. And that gap — between knowing the pattern and knowing the engine — is where investigation time goes.

A cross-engine MCP workflow closes that gap.

The core investigation model

Regardless of the database engine, performance incidents break down into the same sequence of questions:

1. Signal — What changed, and when did it change?

Performance rarely degrades silently from the start. Something happened: a deployment, a data volume threshold crossed, a new query hitting production for the first time, a backup job running at the wrong hour. The first diagnostic move is always to establish a timeline. When did response times change? What was different at that moment compared to an hour earlier?

2. Load — Which queries or workloads are dominant right now?

Once you have a timeline, you need to understand what the database is actually doing. Which queries are running? Which are consuming the most CPU, I/O, or elapsed time? Which are waiting? A database that appears busy may be burning resources on one or two queries — or it may be a genuine volume problem across hundreds. You cannot know until you look at the workload distribution.

3. Constraint — What resource or structural limit is the bottleneck?

The dominant workload is putting pressure on something. CPU, memory, disk I/O, network, lock contention, or internal serialisation points. Each constraint has a different remediation path. A CPU-bound query is usually a plan or index problem. A lock-bound query is a concurrency or transaction design problem. An I/O-bound system may be an index, a missing buffer pool, or a storage throughput ceiling. Identifying the constraint correctly determines what you do next.

4. Action — What is the smallest change that reduces user-visible latency?

Not every database problem needs a large fix. The goal during an active incident is to find the smallest change that improves the situation for users. That might be killing a runaway query. Adding a missing index. Adjusting a query hint. Increasing a timeout. The MCP workflow surfaces enough context to make that call quickly, rather than triggering larger work during an outage.

What this looks like in practice

Here is the same incident pattern played out across two different engines, using the same investigation model.

Slow reporting queries on SQL Server

Signal: a monitoring alert fires at 14:23. Latency on the reporting endpoint has risen from 200ms to 4.5 seconds. No deployment happened. The application team reports it started degrading around 14:15.

Load: active session inspection shows a batch reporting query running for 18 minutes. It is the only query with significant elapsed time. It accounts for 91% of current I/O wait.

Constraint: the execution plan shows a full clustered index scan on a 200 million row table. The query predicate filters on report_date, but no index exists on that column. Previous executions this morning used a different plan — the plan cache was invalidated at 14:12 after a statistics update.

Action: create a nonclustered index on report_date INCLUDE (account_id, amount, status). The query picks up the new index within seconds. Latency returns to baseline.

Lock contention on PostgreSQL

Signal: application error rate rises at 09:47. Errors are timeouts on write operations to the orders table. Started coinciding with a batch data import that kicks off at 09:45 each morning.

Load: pg_stat_activity shows 34 sessions blocked, all waiting on lock. One session has been running for 6 minutes — a bulk insert from the import process — and holds a lock on the table.

Constraint: the import process uses COPY ... WITH (LOCK TABLE) behaviour that serialises all writes during import. Every application write is queuing behind it.

Action: reschedule the import to off-peak hours, or restructure it to use row-level inserts with transaction batching that does not hold a table-level lock. Immediate remediation: terminate the blocking import session to restore application writes, then fix the import process.

Same four steps. Different engine syntax, different system views, same investigation logic.

Where the MCP server fits

Manually querying sys.dm_exec_requests, pg_stat_activity, information_schema.processlist, or v$session for every incident is tedious and error-prone. Engineers need to know the right views, the right join conditions, and how to interpret the output — across every engine they cover.

A database MCP server handles that layer. It exposes diagnostic tools — active sessions, top queries by resource, wait statistics, execution plans, blocking chains — through a consistent interface. The AI assistant calls those tools, interprets the output, and surfaces findings in plain language.

The engineer doesn’t need to know which DMV to query on SQL Server versus which pg_stat_* view to use on PostgreSQL. They ask “what is causing the slowdown?” and the MCP workflow returns the answer with evidence.

Engine-specific considerations

The four-step model applies everywhere, but each engine has areas worth knowing:

SQL Server and Azure SQL Database: wait statistics via sys.dm_os_wait_stats and sys.dm_exec_requests are the fastest path to understanding constraint type. Blocking chains are common and clearly visible. Plan stability (parameter sniffing, plan cache invalidation) is a frequent source of sudden regressions.

PostgreSQL: pg_stat_activity and pg_stat_statements cover most investigation needs. Autovacuum interference is a common PostgreSQL-specific issue — tables with high update/delete rates accumulate dead tuples that bloat tables and slow queries over time. Lock waits in PostgreSQL are explicit and visible in pg_locks.

MySQL and MariaDB: the Performance Schema (performance_schema.events_statements_summary_by_digest) is the primary tool for query-level analysis. InnoDB-specific waits (row locks, deadlocks) are visible in information_schema.INNODB_TRX and INNODB_LOCK_WAITS. Slow query log analysis is often a starting point.

Oracle: v$sql and v$session provide active query and session state. v$session_wait shows current wait events. Active Session History (v$active_session_history) gives historical wait data without requiring the Diagnostics Pack — it is available as a standard view. Statspack provides workload summary data without the AWR licence requirement.

Building a cross-engine investigation practice

The practical outcome of running the same four-step model consistently is that every incident produces the same kind of artefact: a signal, a workload finding, a constraint identification, and an action. Post-incident reviews become more useful. Runbooks become engine-agnostic. Junior engineers improve faster because the model is teachable.

Teams that formalise this approach — either through MCP tooling or through structured runbooks — typically see faster mean-time-to-resolution and fewer escalations to senior specialists.

Explore by engine

For a live walkthrough of the MCP investigation workflow, request a miniDBA demo.