Last Updated: 2026-05-03 Owner: Ops-Dev Summary: Response guide for diagnosing, containing, and recovering from operational incidents in QuantMatrix.
QuantMatrix Incident Response Playbook¶
This playbook is for handling operational incidents in QuantMatrix with clarity and consistency.
Its purpose is to: - reduce panic during live issues - make response actions repeatable - protect capital first - protect system integrity second - preserve evidence for investigation and improvement
This document should be used with the Operations Runbook and Live Trading Go-Live Checklist.
1. Incident Response Principles¶
When an incident happens: 1. protect live capital 2. prevent the situation from worsening 3. establish the true broker state 4. reconcile internal state 5. preserve logs and evidence 6. restore service only when it is safe 7. document lessons learned
Core rules: - when internal state and broker state disagree, treat broker-confirmed live exposure as authoritative - when execution behavior is ambiguous, stop automated trading - do not restart blindly during active position uncertainty - do not optimize for uptime over safety
2. Severity Levels¶
Severity 1: Critical¶
Use when there is real or potential immediate capital risk.
Examples: - duplicate live orders - unexpected live open position - live position mismatch - execution broker outage while positions are open - emergency halt fails - risk manager unavailable during live trading
Expected action: - halt automated trading immediately - consider emergency liquidate and halt - escalate to operator and engineer immediately
Severity 2: High¶
Use when trading safety is degraded but immediate capital loss is not yet confirmed.
Examples: - repeated rejected orders - market data outage during entry flow - reconciliation mismatch without open exposure - stuck strategy controlling active workflow - analytics pipeline failure during live session
Expected action: - pause new entries - investigate quickly - resume only after validation
Severity 3: Medium¶
Use when the system is impaired but safe fallback exists.
Examples: - delayed UI updates - scanner timing drift - stale dashboard metrics - non-critical background job failure
Expected action: - keep under observation - fix before next session if not session-critical
3. Incident Roles¶
Incident Operator¶
- takes immediate protective actions
- monitors positions, orders, risk, and health
- triggers halt or liquidation if needed
Incident Engineer¶
- diagnoses the system issue
- validates state from broker and persistence layers
- guides recovery and reconciliation
Strategy Owner¶
- advises on strategy-specific risk
- approves any strategy suspension or rollback
For early live operation, one person may hold more than one role, but responsibilities should still be followed.
4. Universal Incident Response Workflow¶
For every incident:
- Identify the incident
- Assign severity
- Confirm whether live capital is exposed
- Stop unsafe automated behavior
- Check broker account, orders, and positions
- Preserve logs and timestamps
- Reconcile system state
- Decide:
- continue with restricted mode
- halt trading
- liquidate and halt
- Record the incident
- Perform post-incident review
5. First 60 Seconds Checklist¶
When a serious incident is detected: - [ ] Confirm whether the system is in live, paper, or demo mode - [ ] Check whether any real positions are open - [ ] Check whether any live orders are outstanding - [ ] Stop new entries if execution behavior is uncertain - [ ] Open broker account and order view if available - [ ] Note the exact time incident started - [ ] Preserve screenshots or exported state if useful
If there is any uncertainty about real exposure, treat it as Severity 1 until proven otherwise.
6. Broker State Verification Procedure¶
This is the most important verification path in live incidents.
Always confirm: - account buying power - open positions - open orders - most recent fills - rejected orders
Verification sequence: 1. Pull account summary from execution broker 2. Pull open positions from execution broker 3. Pull open orders from execution broker 4. Pull recent executions from execution broker 5. Compare against internal Orders, Positions, and Account views
If broker state cannot be confirmed, do not resume automated trading.
7. Scenario Playbooks¶
A. Duplicate Order Suspected¶
Indicators: - two similar orders for same symbol and side - unexpected doubled position - OMS state inconsistent with expected one-buy rule
Immediate actions: 1. Stop new entries 2. Check broker order list 3. Confirm whether duplicate is real or only a UI/state artifact 4. If real and exposure increased unexpectedly, consider emergency liquidation policy 5. Freeze affected strategy
Recovery steps: - identify whether idempotency failed at OMS, broker retry, or recovery flow - reconcile internal order records with broker orders - mark incident with order ids involved
Resume criteria: - duplicate cause understood - no ambiguous open exposure remains - idempotency path verified
B. Execution Broker Disconnect While Positions Are Open¶
Indicators: - execution broker status failed - inability to place, cancel, or fetch orders - positions remain open
Immediate actions: 1. Stop all new entries 2. Preserve current internal position view 3. Attempt broker reconnect 4. Use external broker interface if available to verify positions manually 5. Prepare manual contingency if disconnect persists
Recovery steps: - restore broker access - pull fresh positions and orders - reconcile with internal state - verify no duplicate recovery orders will be sent
Resume criteria: - broker stable - positions confirmed - reconciliation complete
C. Market Data Disconnect During Active Trading¶
Indicators: - market data stream stops - stale prices on UI - scanner stops updating
Immediate actions: 1. Stop new entries 2. Decide whether exits may continue safely based on broker data availability and policy 3. Monitor open positions carefully
Recovery steps: - restore data feed - verify timestamps are current - verify scanner timing resumes correctly
Resume criteria: - live data confirmed fresh - no stale decision logic remains active
D. Risk Manager Unavailable¶
Indicators: - risk service unhealthy - risk checks not responding - orders may proceed without validation
Immediate actions: 1. Halt all new automated entries immediately 2. Do not allow restart until risk path is restored 3. Review whether any orders slipped through during degradation
Recovery steps: - restart or repair risk service - replay recent order intents if needed for audit only - confirm enforcement restored
Resume criteria: - risk enforcement tested successfully after recovery
E. Position Mismatch¶
Indicators: - UI position quantity differs from broker - expected flat state but broker shows open shares - average entry price differs materially
Immediate actions: 1. Stop new entries 2. Treat broker position as authoritative for exposure 3. Decide whether manual liquidation is required
Recovery steps: - run reconciliation - repair internal positions state - inspect related orders and executions
Resume criteria: - position state aligned across broker and system
F. Partial Fill Not Reflected Correctly¶
Indicators: - broker shows partial quantity but UI shows full intended quantity - exit logic sizing appears wrong
Immediate actions: 1. Freeze affected strategy 2. Confirm actual filled quantity 3. Prevent incorrect exit sizing
Recovery steps: - patch position state to actual quantity - verify subsequent exit logic references actual filled size
Resume criteria: - position math correct - strategy safe to continue or closed manually
G. Unexpected Live Order Rejections¶
Indicators: - repeated broker rejects - no fills despite valid signals
Immediate actions: 1. Check rejection reason pattern 2. Pause affected strategy or all new entries if systemic 3. Verify account permissions, buying power, order format, and session state
Recovery steps: - fix root cause - test with safe validation path
Resume criteria: - rejection path explained - no blind resubmission loop remains
H. Emergency Halt Triggered¶
Indicators: - operator clicked Emergency Liquidate & Halt - risk manager forced system halt
Immediate actions: 1. Confirm halt state in UI 2. Confirm new entries are blocked 3. Confirm cancels and liquidation behavior 4. Track all affected orders and positions
Recovery steps: - verify positions are flat or intentionally managed - verify system remains halted until manual release - perform full reconciliation
Resume criteria: - halt reason understood - post-halt reconciliation complete - explicit operator approval documented
8. Evidence Preservation¶
For any Severity 1 or 2 incident, preserve: - timestamp of first detection - environment and trading mode - affected symbols - affected order ids - affected strategy ids - screenshots of UI if useful - broker responses if available - relevant logs - reconciliation outputs
Do not overwrite or silently discard state that may explain the incident.
9. Communication Template¶
When notifying teammates, include: - incident severity - trading mode: live, paper, local - current exposure: positions and symbols - current system state: halted, paused, degraded - immediate action already taken - what is still unknown - next update time
Example: - Severity: 1 - Mode: Live - Exposure: Open long position in AAPL, 500 shares - State: New entries halted, broker reconnect in progress - Action taken: Strategy paused, broker state verification started - Unknown: Whether recent cancel request reached broker - Next update: 5 minutes
10. Recovery Decision Matrix¶
After stabilizing, choose one:
Continue in Restricted Mode¶
Use when: - root cause is contained - no ambiguous exposure remains - only non-critical components are degraded
Restrictions may include: - no new entries - manual exits only - one strategy disabled
Halt Trading for Session¶
Use when: - system trust is reduced - issue is not fully understood - trading edge is impaired - operator confidence is low
Liquidate and Halt¶
Use when: - current exposure is unsafe - state is materially inconsistent - execution reliability is compromised
11. Post-Incident Review¶
Every Severity 1 and Severity 2 incident should produce a review.
Review questions: - what happened - when did it begin - what was the true impact - what exposure existed - what was the immediate cause - what deeper system weakness allowed it - what action worked - what action was missing - what should change in code, config, alerts, or runbook
Required outputs: - incident summary - timeline - impacted modules - remediation items - owner and due date for each remediation
12. Incident Prevention Feedback Loop¶
Every serious incident should feed back into: - implementation spec updates - module checklist updates - operations runbook updates - test coverage additions - analytics and alerting improvements
If an incident happened once and is plausible again, it should usually produce: - a new automated test - a new alert or guardrail - a clearer operator instruction
13. Quick Reference Cards¶
If Execution Is Unclear¶
- stop new entries
- verify broker positions and orders
- reconcile before resuming
If State Is Inconsistent¶
- trust broker for real exposure
- freeze affected strategy
- repair internal state carefully
If Risk Is Down¶
- halt automated entries immediately
- do not resume until enforcement is confirmed
If You Are Unsure¶
- choose safety over continuity
- halt first, diagnose second
14. Recommended Companion Documents¶
After this playbook, the next useful documents are: - Production Readiness Review Template - Release Change Checklist - Daily Trade Review Template - Post-Incident Review Template