At 08:42 on 15 January 2026, the payment gateway began returning 503 errors to approximately 12% of checkout requests. Monitoring alerts triggered within 3 minutes. The root cause was traced to connection pool exhaustion on the primary PostgreSQL cluster, caused by a long-running analytics query that acquired 47 connections without releasing them.
| Time | Event | Owner |
|---|---|---|
| 08:42 | 503 errors spike on payment gateway | Monitoring |
| 08:45 | PagerDuty alert acknowledged by S. Vermeersch | SV |
| 09:03 | Analytics query identified and terminated | SV |
| 09:11 | Connection pool recovered, error rate back to baseline | Monitoring |
The analytics team deployed a new quarterly revenue report at 08:30. The query joined five tables without appropriate indexing on the transaction_items table, resulting in a full table scan across 14.2 million rows. Each scan held a connection for the duration of execution (~4 minutes), and the scheduler launched 12 parallel instances. With the default pool limit set to 50, the 47 analytics connections left only 3 available for production traffic.
Immediate fix (09:03)
Terminated the long-running analytics query using pg_terminate_backend(). Connection pool recovered within 8 minutes. All checkout errors resolved by 09:11.
Permanent fix (deployed 16 January)
Added a composite index on transaction_items (order_id, created_at). Query execution time dropped from ~4 minutes to 1.3 seconds. Configured PgBouncer to enforce a per-user connection limit of 10 for the analytics role, preventing any single workload from exhausting the pool.
| Action | Owner | Due | Status |
|---|---|---|---|
| Add per-role connection limits in PgBouncer | SV | 17/01/2026 | Done |
| Require EXPLAIN ANALYZE review for queries touching >1M rows | TL | 31/01/2026 | In progress |
| Add connection pool utilisation alert at 80% threshold | SV | 24/01/2026 | Done |
Standardise how your IT team documents service disruptions. This template captures everything from initial detection to root cause analysis and preventive actions — so the next time the same issue hits, the fix is already documented and searchable. Stop losing institutional knowledge to email threads and tribal memory.
Try now in EliumAn IT incident report is a structured document that captures the full lifecycle of an unplanned service disruption — from initial detection through root cause analysis to final resolution. It serves as both an operational record and a knowledge asset, giving IT teams a single, referenceable account of what happened, why it happened, and what was done to fix it.
Within ITIL-aligned service management, incident reports bridge reactive firefighting and proactive improvement. When teams document incidents consistently, they build a searchable body of evidence that reveals patterns — recurring failures, vulnerable systems, gaps in monitoring. Without this, the same problems resurface because institutional memory lives only in the heads of whoever was on call.
This IT incident report template is for teams responsible for IT service continuity:
The template has two parts: structured metadata fields and narrative sections.
Metadata fields capture the essentials:
Narrative sections tell the full story:
Capture faster. Elium’s AI assistant drafts incident reports from raw inputs — ticket notes, monitoring alerts, or war room chat logs. Instead of starting from a blank page after a stressful resolution, your team gets a structured first draft to review and refine.
Retrieve smarter. When a new incident occurs, ask Elium’s AI: “What was the root cause of the last database connection pool issue?” or “Show me all P1 incidents affecting the payment gateway this year.” Specific, sourced answers — not generic guesses.
The real value of incident reports appears over time. A single report is useful; hundreds of consistently structured reports become a strategic asset. Elium makes this practical by combining structured fields with AI-powered search, so teams surface patterns that would otherwise stay hidden.
VINCI Energies — 97,000 employees across 61 countries — faced exactly this challenge. IT support procedures were scattered across Word documents, SharePoint sites, and email threads. By centralising knowledge in Elium, they made the right information findable at the right time, cutting resolution delays and eliminating duplicate effort across support tiers.
For incident management, this means faster resolution of recurring issues (the fix is already documented), better onboarding for new staff (they learn from real incidents), and clearer reporting (structured metadata enables trend analysis without manual spreadsheets).
Related reading: Read more on our blog
We use cookies to enhance your browsing experience, serve personalized content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies. You can manage your preferences or learn more in our Privacy Policy. Learn more in our Privacy Policy.
We use different types of cookies to optimize your experience on our website. You can choose which categories you want to allow.
These cookies are essential for the website to function properly. They enable basic functionality such as page navigation and access to secure areas. The website cannot function properly without these cookies.
These cookies help us understand how visitors interact with our website by collecting and reporting information anonymously. This helps us improve our website's performance.
These cookies are used to track visitors across websites. They are used to display ads that may be relevant and engaging for individual users.
These cookies enable enhanced functionality and personalization, such as videos and live chats. They may be set by us or by third-party providers.