Last Updated: 2026-05-03 Owner: Ops-Dev Summary: Structured template for documenting root cause, impact, and follow-up actions after a QuantMatrix incident.

QuantMatrix Post-Incident Review Template¶

This template is for reviewing incidents after they have been stabilized.

Its purpose is to: - document what actually happened - separate facts from assumptions - understand impact clearly - identify root causes and contributing factors - assign corrective actions - improve the system, runbooks, alerts, and operator workflows

This review should be completed for: - all Severity 1 incidents - all Severity 2 incidents - any incident that exposed a serious weakness in safety, execution, reconciliation, or analytics

1. Incident Metadata¶

Incident Title:
Incident ID:
Severity: 1 / 2 / 3
Date:
Start Time:
End Time:
Duration:
Environment: Local / Paper / Live
Trading Mode:
Review Owner:
Operator Involved:
Engineer Involved:
Strategy Owner Involved:

2. Executive Summary¶

Short Summary¶

What happened:
Why it mattered:
How it was stabilized:

Final Outcome¶

Resolved: Yes / No / Partially
Trading halted: Yes / No
Positions liquidated: Yes / No
Follow-up required: Yes / No

3. Incident Timeline¶

Record events in order.

For each event capture: - Timestamp - Event - Source - Action taken - Outcome

Example rows: - 09:32:14 - Duplicate order suspected - Operator UI - Halted new entries - Confirmed broker had two live orders - 09:34:01 - Broker positions checked - Broker API - Reconciliation started - Exposure confirmed

4. Impact Assessment¶

Trading Impact¶

Real capital affected: Yes / No
Symbols affected:
Orders affected:
Positions affected:
Realized P&L impact:
Unrealized P&L impact:
Missed opportunities:

Operational Impact¶

Session interrupted: Yes / No
System halted: Yes / No
Manual intervention required: Yes / No
Operator confidence impacted: Yes / No

Customer / Stakeholder Impact¶

Who was impacted:
What they observed:
Whether external communication was needed:

5. Detection¶

How Was The Incident First Detected?¶

UI alert
Broker alert
Operator observation
Log review
Reconciliation mismatch
Analytics anomaly
Other:

Detection Quality¶

Was the detection fast enough?
Was the alert clear enough?
Could the issue have been detected earlier?

6. What Was Known vs Unknown¶

Facts Confirmed During Incident¶

-¶

Unknowns During Incident¶

-¶

Wrong Assumptions Made¶

-¶

7. Root Cause Analysis¶

Immediate Cause¶

What directly triggered the incident?

Contributing Factors¶

What else made the incident possible or worse?

Deeper System Cause¶

What underlying design, implementation, process, or operational weakness allowed it?

Root Cause Category¶

Mark all that apply: - [ ] Broker/API behavior - [ ] Order management logic - [ ] Risk control gap - [ ] Reconciliation gap - [ ] Market data quality - [ ] Strategy logic - [ ] UI/operator confusion - [ ] Missing alert - [ ] Missing test coverage - [ ] Deployment/configuration issue - [ ] Manual process weakness

8. Response Review¶

What Actions Were Taken?¶

-¶

What Worked Well?¶

-¶

What Did Not Work Well?¶

-¶

Was The Severity Assigned Correctly?¶

Yes / No
If not, what should it have been?

Was The Halt / Liquidation Decision Appropriate?¶

Yes / No / Not applicable
Notes:

9. Broker and State Verification Review¶

Was broker state checked quickly enough?
Were open orders verified?
Were open positions verified?
Was broker state treated as authoritative when appropriate?
Did reconciliation succeed?
Were there any remaining state mismatches after stabilization?

10. Risk and Safety Review¶

Did risk controls behave correctly?
Did any risk control fail or get bypassed?
Was emergency halt available?
Did operator know what to do?
Was there any near-miss that could have become worse?

11. Analytics and Logging Review¶

Were logs sufficient to reconstruct the incident?
Were analytics records preserved correctly?
Were order, execution, and trade identifiers available?
Did any telemetry go missing?
Were recommendations or analytics outputs affected?

12. What Should Change¶

Code / System Changes¶

[ ]
[ ]
[ ]

Risk Rule Changes¶

[ ]
[ ]
[ ]

Monitoring / Alerting Changes¶

[ ]
[ ]
[ ]

Runbook / Process Changes¶

[ ]
[ ]
[ ]

UI / Operator Experience Changes¶

[ ]
[ ]
[ ]

Test Coverage Changes¶

[ ]
[ ]
[ ]

13. Corrective Actions¶

For each action item capture: - Action - Owner - Priority - Due Date - Status

14. Prevention Plan¶

What Will Prevent Recurrence?¶

automated test additions:
alert additions:
guardrail additions:
UI improvements:
runbook updates:
deployment checks:

Residual Risk¶

What risk remains even after fixes?

15. Lessons Learned¶

Technical Lessons¶

-¶

Operational Lessons¶

-¶

Strategy / Trading Lessons¶

-¶

16. Final Review Decision¶

Incident fully understood: Yes / No / Partially
Corrective actions sufficient: Yes / No
Safe to resume normal operations: Yes / No / With restrictions
Extra monitoring required next session: Yes / No
Related releases should be paused: Yes / No

17. Signoff¶

Review Owner:
Operator:
Engineer:
Strategy Owner:
Completion Date:

18. Suggested Attachments¶

Attach when available: - screenshots - broker order export - position snapshot - relevant logs - reconciliation output - incident chat transcript - analytics output related to affected trades