IT Incident Report Template

INC-2026-0042: Database connection pool exhaustion


Severity
P1 — Critical
Status
Investigating
Reported by
TLThomas Laurent
Assigned to
SVSophie Vermeersch

📋 Incident description

At 08:42 on 15 January 2026, the payment gateway began returning 503 errors to approximately 12% of checkout requests. Monitoring alerts triggered within 3 minutes. The root cause was traced to connection pool exhaustion on the primary PostgreSQL cluster, caused by a long-running analytics query that acquired 47 connections without releasing them.

⏱️ Timeline

TimeEventOwner
08:42503 errors spike on payment gatewayMonitoring
08:45PagerDuty alert acknowledged by S. VermeerschSV
09:03Analytics query identified and terminatedSV
09:11Connection pool recovered, error rate back to baselineMonitoring

🔬 Root cause analysis

The analytics team deployed a new quarterly revenue report at 08:30. The query joined five tables without appropriate indexing on the transaction_items table, resulting in a full table scan across 14.2 million rows. Each scan held a connection for the duration of execution (~4 minutes), and the scheduler launched 12 parallel instances. With the default pool limit set to 50, the 47 analytics connections left only 3 available for production traffic.

✅ Resolution

Immediate fix (09:03)

Terminated the long-running analytics query using pg_terminate_backend(). Connection pool recovered within 8 minutes. All checkout errors resolved by 09:11.

Permanent fix (deployed 16 January)

Added a composite index on transaction_items (order_id, created_at). Query execution time dropped from ~4 minutes to 1.3 seconds. Configured PgBouncer to enforce a per-user connection limit of 10 for the analytics role, preventing any single workload from exhausting the pool.

🛡️ Preventive actions

ActionOwnerDueStatus
Add per-role connection limits in PgBouncerSV17/01/2026Done
Require EXPLAIN ANALYZE review for queries touching >1M rowsTL31/01/2026In progress
Add connection pool utilisation alert at 80% thresholdSV24/01/2026Done

💡 Lessons learnt

  • What worked: PagerDuty alert triggered within 3 minutes; on-call engineer identified the root cause in 18 minutes.
  • What didn't: No connection pool monitoring existed — the team relied on downstream error rates to detect the problem.
  • What we'd change: Analytics workloads should run on a read replica with isolated connection pools, not on the
Content continues in Elium...

Standardise how your IT team documents service disruptions. This template captures everything from initial detection to root cause analysis and preventive actions — so the next time the same issue hits, the fix is already documented and searchable. Stop losing institutional knowledge to email threads and tribal memory.

Try now in Elium

What is an IT incident report?

An IT incident report is a structured document that captures the full lifecycle of an unplanned service disruption — from initial detection through root cause analysis to final resolution. It serves as both an operational record and a knowledge asset, giving IT teams a single, referenceable account of what happened, why it happened, and what was done to fix it.

Within ITIL-aligned service management, incident reports bridge reactive firefighting and proactive improvement. When teams document incidents consistently, they build a searchable body of evidence that reveals patterns — recurring failures, vulnerable systems, gaps in monitoring. Without this, the same problems resurface because institutional memory lives only in the heads of whoever was on call.

Who should use this template?

This IT incident report template is for teams responsible for IT service continuity:

  • IT Service Desk Managers — standardise how your team documents incidents so nothing falls through the cracks between shifts
  • CIOs and IT Directors — gain visibility into incident trends, MTTR, and recurring failure points
  • Knowledge Managers — ensure post-incident knowledge is captured in a structured, searchable format rather than buried in email threads
  • DevOps and SRE teams — feed incident data into blameless post-mortems and improvement cycles

What’s included in this template?

The template has two parts: structured metadata fields and narrative sections.

Metadata fields capture the essentials:

  • Incident ID and severity level (P1–P4)
  • Affected system or service
  • Timestamps: detection, acknowledgement, resolution
  • Reporter and resolver
  • Status (open, investigating, resolved, closed)

Narrative sections tell the full story:

  • Incident description — what happened, from the user’s perspective
  • Timeline — chronological record of events and actions
  • Root cause analysis — the underlying reason, not just symptoms
  • Resolution — steps taken to restore service
  • Preventive actions — changes to prevent recurrence
  • Lessons learnt — what worked, what didn’t, what to change

How to create and customise this template in Elium

  1. Open the Template Builder — Go to your profile menu and select the Template Builder tab, or click “+ Create” and choose “Create a new template”.
  2. Set the scope — Choose an icon, enable the template, and decide whether it applies platform-wide or to specific spaces (e.g. your IT Support space only).
  3. Add structured fields — Click “Field” to add the metadata your team needs: text fields for Incident ID, a tag field for severity level, date fields for detection and resolution timestamps, and user fields for reporter and resolver. Mark critical fields as mandatory.
  4. Build the body structure — Use the “+” button to add content blocks for each narrative section: incident description, timeline, root cause analysis, resolution steps, and preventive actions.
  5. Preview and save — Review the template layout, then save. Team members can now select it when creating new articles, and you can apply it to existing content in bulk.

How AI helps you create and use this template

Capture faster. Elium’s AI assistant drafts incident reports from raw inputs — ticket notes, monitoring alerts, or war room chat logs. Instead of starting from a blank page after a stressful resolution, your team gets a structured first draft to review and refine.

Retrieve smarter. When a new incident occurs, ask Elium’s AI: “What was the root cause of the last database connection pool issue?” or “Show me all P1 incidents affecting the payment gateway this year.” Specific, sourced answers — not generic guesses.

Why teams use Elium for incident management

The real value of incident reports appears over time. A single report is useful; hundreds of consistently structured reports become a strategic asset. Elium makes this practical by combining structured fields with AI-powered search, so teams surface patterns that would otherwise stay hidden.

VINCI Energies — 97,000 employees across 61 countries — faced exactly this challenge. IT support procedures were scattered across Word documents, SharePoint sites, and email threads. By centralising knowledge in Elium, they made the right information findable at the right time, cutting resolution delays and eliminating duplicate effort across support tiers.

For incident management, this means faster resolution of recurring issues (the fix is already documented), better onboarding for new staff (they learn from real incidents), and clearer reporting (structured metadata enables trend analysis without manual spreadsheets).

Frequently asked questions

An IT incident report is a formal record of an unplanned service disruption — documenting what happened, the root cause, and how it was resolved. Without consistent documentation, organisations lose the lessons from each incident. Teams diagnose the same failures repeatedly, resolution times stay high, and critical knowledge leaves when staff move on.
A complete incident report includes structured metadata (incident ID, severity, affected system, timestamps, resolver) and narrative sections covering the incident description, chronological timeline, root cause analysis, resolution steps, and preventive actions. The best reports also capture lessons learnt — what went well, what didn’t, and what the team would change next time.
Consistent incident reporting reduces mean time to resolution by making previous fixes searchable and reusable. It reveals patterns across incidents — recurring failures, vulnerable systems, and gaps in monitoring — that would otherwise stay hidden. Over time, your incident knowledge base becomes a training resource for new staff and a data source for management reporting on service reliability trends.
Start with the facts: when the incident was detected, what was affected, and how severe the impact was. Document the timeline of actions taken, then identify the root cause — not just the symptoms. Finish with specific preventive actions and assign owners. Keep the language factual; the goal is to improve systems, not find fault.
An incident is a single unplanned disruption — something that needs restoring now. A problem is the underlying cause behind one or more incidents. Incident management restores service quickly; problem management investigates why incidents happen and eliminates root causes. An incident report documents the event; a problem record tracks the deeper investigation.

Related reading: Read more on our blog