Skip to content

Last Updated: 2026-05-03 Owner: Ops-Dev Summary: Structured template for documenting root cause, impact, and follow-up actions after a QuantMatrix incident.

QuantMatrix Post-Incident Review Template

This template is for reviewing incidents after they have been stabilized.

Its purpose is to: - document what actually happened - separate facts from assumptions - understand impact clearly - identify root causes and contributing factors - assign corrective actions - improve the system, runbooks, alerts, and operator workflows

This review should be completed for: - all Severity 1 incidents - all Severity 2 incidents - any incident that exposed a serious weakness in safety, execution, reconciliation, or analytics

1. Incident Metadata

  • Incident Title:
  • Incident ID:
  • Severity: 1 / 2 / 3
  • Date:
  • Start Time:
  • End Time:
  • Duration:
  • Environment: Local / Paper / Live
  • Trading Mode:
  • Review Owner:
  • Operator Involved:
  • Engineer Involved:
  • Strategy Owner Involved:

2. Executive Summary

Short Summary

  • What happened:
  • Why it mattered:
  • How it was stabilized:

Final Outcome

  • Resolved: Yes / No / Partially
  • Trading halted: Yes / No
  • Positions liquidated: Yes / No
  • Follow-up required: Yes / No

3. Incident Timeline

Record events in order.

For each event capture: - Timestamp - Event - Source - Action taken - Outcome

Example rows: - 09:32:14 - Duplicate order suspected - Operator UI - Halted new entries - Confirmed broker had two live orders - 09:34:01 - Broker positions checked - Broker API - Reconciliation started - Exposure confirmed

4. Impact Assessment

Trading Impact

  • Real capital affected: Yes / No
  • Symbols affected:
  • Orders affected:
  • Positions affected:
  • Realized P&L impact:
  • Unrealized P&L impact:
  • Missed opportunities:

Operational Impact

  • Session interrupted: Yes / No
  • System halted: Yes / No
  • Manual intervention required: Yes / No
  • Operator confidence impacted: Yes / No

Customer / Stakeholder Impact

  • Who was impacted:
  • What they observed:
  • Whether external communication was needed:

5. Detection

How Was The Incident First Detected?

  • UI alert
  • Broker alert
  • Operator observation
  • Log review
  • Reconciliation mismatch
  • Analytics anomaly
  • Other:

Detection Quality

  • Was the detection fast enough?
  • Was the alert clear enough?
  • Could the issue have been detected earlier?

6. What Was Known vs Unknown

Facts Confirmed During Incident

-

Unknowns During Incident

-

Wrong Assumptions Made

-

7. Root Cause Analysis

Immediate Cause

  • What directly triggered the incident?

Contributing Factors

  • What else made the incident possible or worse?

Deeper System Cause

  • What underlying design, implementation, process, or operational weakness allowed it?

Root Cause Category

Mark all that apply: - [ ] Broker/API behavior - [ ] Order management logic - [ ] Risk control gap - [ ] Reconciliation gap - [ ] Market data quality - [ ] Strategy logic - [ ] UI/operator confusion - [ ] Missing alert - [ ] Missing test coverage - [ ] Deployment/configuration issue - [ ] Manual process weakness

8. Response Review

What Actions Were Taken?

-

What Worked Well?

-

What Did Not Work Well?

-

Was The Severity Assigned Correctly?

  • Yes / No
  • If not, what should it have been?

Was The Halt / Liquidation Decision Appropriate?

  • Yes / No / Not applicable
  • Notes:

9. Broker and State Verification Review

  • Was broker state checked quickly enough?
  • Were open orders verified?
  • Were open positions verified?
  • Was broker state treated as authoritative when appropriate?
  • Did reconciliation succeed?
  • Were there any remaining state mismatches after stabilization?

10. Risk and Safety Review

  • Did risk controls behave correctly?
  • Did any risk control fail or get bypassed?
  • Was emergency halt available?
  • Did operator know what to do?
  • Was there any near-miss that could have become worse?

11. Analytics and Logging Review

  • Were logs sufficient to reconstruct the incident?
  • Were analytics records preserved correctly?
  • Were order, execution, and trade identifiers available?
  • Did any telemetry go missing?
  • Were recommendations or analytics outputs affected?

12. What Should Change

Code / System Changes

  • [ ]
  • [ ]
  • [ ]

Risk Rule Changes

  • [ ]
  • [ ]
  • [ ]

Monitoring / Alerting Changes

  • [ ]
  • [ ]
  • [ ]

Runbook / Process Changes

  • [ ]
  • [ ]
  • [ ]

UI / Operator Experience Changes

  • [ ]
  • [ ]
  • [ ]

Test Coverage Changes

  • [ ]
  • [ ]
  • [ ]

13. Corrective Actions

For each action item capture: - Action - Owner - Priority - Due Date - Status

14. Prevention Plan

What Will Prevent Recurrence?

  • automated test additions:
  • alert additions:
  • guardrail additions:
  • UI improvements:
  • runbook updates:
  • deployment checks:

Residual Risk

  • What risk remains even after fixes?

15. Lessons Learned

Technical Lessons

-

Operational Lessons

-

Strategy / Trading Lessons

-

16. Final Review Decision

  • Incident fully understood: Yes / No / Partially
  • Corrective actions sufficient: Yes / No
  • Safe to resume normal operations: Yes / No / With restrictions
  • Extra monitoring required next session: Yes / No
  • Related releases should be paused: Yes / No

17. Signoff

  • Review Owner:
  • Operator:
  • Engineer:
  • Strategy Owner:
  • Completion Date:

18. Suggested Attachments

Attach when available: - screenshots - broker order export - position snapshot - relevant logs - reconciliation output - incident chat transcript - analytics output related to affected trades