Last Updated: 2026-05-03 Owner: Ops-Dev Summary: Structured template for documenting root cause, impact, and follow-up actions after a QuantMatrix incident.
QuantMatrix Post-Incident Review Template¶
This template is for reviewing incidents after they have been stabilized.
Its purpose is to: - document what actually happened - separate facts from assumptions - understand impact clearly - identify root causes and contributing factors - assign corrective actions - improve the system, runbooks, alerts, and operator workflows
This review should be completed for: - all Severity 1 incidents - all Severity 2 incidents - any incident that exposed a serious weakness in safety, execution, reconciliation, or analytics
1. Incident Metadata¶
- Incident Title:
- Incident ID:
- Severity: 1 / 2 / 3
- Date:
- Start Time:
- End Time:
- Duration:
- Environment: Local / Paper / Live
- Trading Mode:
- Review Owner:
- Operator Involved:
- Engineer Involved:
- Strategy Owner Involved:
2. Executive Summary¶
Short Summary¶
- What happened:
- Why it mattered:
- How it was stabilized:
Final Outcome¶
- Resolved: Yes / No / Partially
- Trading halted: Yes / No
- Positions liquidated: Yes / No
- Follow-up required: Yes / No
3. Incident Timeline¶
Record events in order.
For each event capture: - Timestamp - Event - Source - Action taken - Outcome
Example rows: - 09:32:14 - Duplicate order suspected - Operator UI - Halted new entries - Confirmed broker had two live orders - 09:34:01 - Broker positions checked - Broker API - Reconciliation started - Exposure confirmed
4. Impact Assessment¶
Trading Impact¶
- Real capital affected: Yes / No
- Symbols affected:
- Orders affected:
- Positions affected:
- Realized P&L impact:
- Unrealized P&L impact:
- Missed opportunities:
Operational Impact¶
- Session interrupted: Yes / No
- System halted: Yes / No
- Manual intervention required: Yes / No
- Operator confidence impacted: Yes / No
Customer / Stakeholder Impact¶
- Who was impacted:
- What they observed:
- Whether external communication was needed:
5. Detection¶
How Was The Incident First Detected?¶
- UI alert
- Broker alert
- Operator observation
- Log review
- Reconciliation mismatch
- Analytics anomaly
- Other:
Detection Quality¶
- Was the detection fast enough?
- Was the alert clear enough?
- Could the issue have been detected earlier?
6. What Was Known vs Unknown¶
Facts Confirmed During Incident¶
-¶
Unknowns During Incident¶
-¶
Wrong Assumptions Made¶
-¶
7. Root Cause Analysis¶
Immediate Cause¶
- What directly triggered the incident?
Contributing Factors¶
- What else made the incident possible or worse?
Deeper System Cause¶
- What underlying design, implementation, process, or operational weakness allowed it?
Root Cause Category¶
Mark all that apply: - [ ] Broker/API behavior - [ ] Order management logic - [ ] Risk control gap - [ ] Reconciliation gap - [ ] Market data quality - [ ] Strategy logic - [ ] UI/operator confusion - [ ] Missing alert - [ ] Missing test coverage - [ ] Deployment/configuration issue - [ ] Manual process weakness
8. Response Review¶
What Actions Were Taken?¶
-¶
What Worked Well?¶
-¶
What Did Not Work Well?¶
-¶
Was The Severity Assigned Correctly?¶
- Yes / No
- If not, what should it have been?
Was The Halt / Liquidation Decision Appropriate?¶
- Yes / No / Not applicable
- Notes:
9. Broker and State Verification Review¶
- Was broker state checked quickly enough?
- Were open orders verified?
- Were open positions verified?
- Was broker state treated as authoritative when appropriate?
- Did reconciliation succeed?
- Were there any remaining state mismatches after stabilization?
10. Risk and Safety Review¶
- Did risk controls behave correctly?
- Did any risk control fail or get bypassed?
- Was emergency halt available?
- Did operator know what to do?
- Was there any near-miss that could have become worse?
11. Analytics and Logging Review¶
- Were logs sufficient to reconstruct the incident?
- Were analytics records preserved correctly?
- Were order, execution, and trade identifiers available?
- Did any telemetry go missing?
- Were recommendations or analytics outputs affected?
12. What Should Change¶
Code / System Changes¶
- [ ]
- [ ]
- [ ]
Risk Rule Changes¶
- [ ]
- [ ]
- [ ]
Monitoring / Alerting Changes¶
- [ ]
- [ ]
- [ ]
Runbook / Process Changes¶
- [ ]
- [ ]
- [ ]
UI / Operator Experience Changes¶
- [ ]
- [ ]
- [ ]
Test Coverage Changes¶
- [ ]
- [ ]
- [ ]
13. Corrective Actions¶
For each action item capture: - Action - Owner - Priority - Due Date - Status
14. Prevention Plan¶
What Will Prevent Recurrence?¶
- automated test additions:
- alert additions:
- guardrail additions:
- UI improvements:
- runbook updates:
- deployment checks:
Residual Risk¶
- What risk remains even after fixes?
15. Lessons Learned¶
Technical Lessons¶
-¶
Operational Lessons¶
-¶
Strategy / Trading Lessons¶
-¶
16. Final Review Decision¶
- Incident fully understood: Yes / No / Partially
- Corrective actions sufficient: Yes / No
- Safe to resume normal operations: Yes / No / With restrictions
- Extra monitoring required next session: Yes / No
- Related releases should be paused: Yes / No
17. Signoff¶
- Review Owner:
- Operator:
- Engineer:
- Strategy Owner:
- Completion Date:
18. Suggested Attachments¶
Attach when available: - screenshots - broker order export - position snapshot - relevant logs - reconciliation output - incident chat transcript - analytics output related to affected trades