Skip to content

Last Updated: 2026-05-03 Owner: Ops-Dev Summary: Response guide for diagnosing, containing, and recovering from operational incidents in QuantMatrix.

QuantMatrix Incident Response Playbook

This playbook is for handling operational incidents in QuantMatrix with clarity and consistency.

Its purpose is to: - reduce panic during live issues - make response actions repeatable - protect capital first - protect system integrity second - preserve evidence for investigation and improvement

This document should be used with the Operations Runbook and Live Trading Go-Live Checklist.

1. Incident Response Principles

When an incident happens: 1. protect live capital 2. prevent the situation from worsening 3. establish the true broker state 4. reconcile internal state 5. preserve logs and evidence 6. restore service only when it is safe 7. document lessons learned

Core rules: - when internal state and broker state disagree, treat broker-confirmed live exposure as authoritative - when execution behavior is ambiguous, stop automated trading - do not restart blindly during active position uncertainty - do not optimize for uptime over safety

2. Severity Levels

Severity 1: Critical

Use when there is real or potential immediate capital risk.

Examples: - duplicate live orders - unexpected live open position - live position mismatch - execution broker outage while positions are open - emergency halt fails - risk manager unavailable during live trading

Expected action: - halt automated trading immediately - consider emergency liquidate and halt - escalate to operator and engineer immediately

Severity 2: High

Use when trading safety is degraded but immediate capital loss is not yet confirmed.

Examples: - repeated rejected orders - market data outage during entry flow - reconciliation mismatch without open exposure - stuck strategy controlling active workflow - analytics pipeline failure during live session

Expected action: - pause new entries - investigate quickly - resume only after validation

Severity 3: Medium

Use when the system is impaired but safe fallback exists.

Examples: - delayed UI updates - scanner timing drift - stale dashboard metrics - non-critical background job failure

Expected action: - keep under observation - fix before next session if not session-critical

3. Incident Roles

Incident Operator

  • takes immediate protective actions
  • monitors positions, orders, risk, and health
  • triggers halt or liquidation if needed

Incident Engineer

  • diagnoses the system issue
  • validates state from broker and persistence layers
  • guides recovery and reconciliation

Strategy Owner

  • advises on strategy-specific risk
  • approves any strategy suspension or rollback

For early live operation, one person may hold more than one role, but responsibilities should still be followed.

4. Universal Incident Response Workflow

For every incident:

  1. Identify the incident
  2. Assign severity
  3. Confirm whether live capital is exposed
  4. Stop unsafe automated behavior
  5. Check broker account, orders, and positions
  6. Preserve logs and timestamps
  7. Reconcile system state
  8. Decide:
  9. continue with restricted mode
  10. halt trading
  11. liquidate and halt
  12. Record the incident
  13. Perform post-incident review

5. First 60 Seconds Checklist

When a serious incident is detected: - [ ] Confirm whether the system is in live, paper, or demo mode - [ ] Check whether any real positions are open - [ ] Check whether any live orders are outstanding - [ ] Stop new entries if execution behavior is uncertain - [ ] Open broker account and order view if available - [ ] Note the exact time incident started - [ ] Preserve screenshots or exported state if useful

If there is any uncertainty about real exposure, treat it as Severity 1 until proven otherwise.

6. Broker State Verification Procedure

This is the most important verification path in live incidents.

Always confirm: - account buying power - open positions - open orders - most recent fills - rejected orders

Verification sequence: 1. Pull account summary from execution broker 2. Pull open positions from execution broker 3. Pull open orders from execution broker 4. Pull recent executions from execution broker 5. Compare against internal Orders, Positions, and Account views

If broker state cannot be confirmed, do not resume automated trading.

7. Scenario Playbooks

A. Duplicate Order Suspected

Indicators: - two similar orders for same symbol and side - unexpected doubled position - OMS state inconsistent with expected one-buy rule

Immediate actions: 1. Stop new entries 2. Check broker order list 3. Confirm whether duplicate is real or only a UI/state artifact 4. If real and exposure increased unexpectedly, consider emergency liquidation policy 5. Freeze affected strategy

Recovery steps: - identify whether idempotency failed at OMS, broker retry, or recovery flow - reconcile internal order records with broker orders - mark incident with order ids involved

Resume criteria: - duplicate cause understood - no ambiguous open exposure remains - idempotency path verified

B. Execution Broker Disconnect While Positions Are Open

Indicators: - execution broker status failed - inability to place, cancel, or fetch orders - positions remain open

Immediate actions: 1. Stop all new entries 2. Preserve current internal position view 3. Attempt broker reconnect 4. Use external broker interface if available to verify positions manually 5. Prepare manual contingency if disconnect persists

Recovery steps: - restore broker access - pull fresh positions and orders - reconcile with internal state - verify no duplicate recovery orders will be sent

Resume criteria: - broker stable - positions confirmed - reconciliation complete

C. Market Data Disconnect During Active Trading

Indicators: - market data stream stops - stale prices on UI - scanner stops updating

Immediate actions: 1. Stop new entries 2. Decide whether exits may continue safely based on broker data availability and policy 3. Monitor open positions carefully

Recovery steps: - restore data feed - verify timestamps are current - verify scanner timing resumes correctly

Resume criteria: - live data confirmed fresh - no stale decision logic remains active

D. Risk Manager Unavailable

Indicators: - risk service unhealthy - risk checks not responding - orders may proceed without validation

Immediate actions: 1. Halt all new automated entries immediately 2. Do not allow restart until risk path is restored 3. Review whether any orders slipped through during degradation

Recovery steps: - restart or repair risk service - replay recent order intents if needed for audit only - confirm enforcement restored

Resume criteria: - risk enforcement tested successfully after recovery

E. Position Mismatch

Indicators: - UI position quantity differs from broker - expected flat state but broker shows open shares - average entry price differs materially

Immediate actions: 1. Stop new entries 2. Treat broker position as authoritative for exposure 3. Decide whether manual liquidation is required

Recovery steps: - run reconciliation - repair internal positions state - inspect related orders and executions

Resume criteria: - position state aligned across broker and system

F. Partial Fill Not Reflected Correctly

Indicators: - broker shows partial quantity but UI shows full intended quantity - exit logic sizing appears wrong

Immediate actions: 1. Freeze affected strategy 2. Confirm actual filled quantity 3. Prevent incorrect exit sizing

Recovery steps: - patch position state to actual quantity - verify subsequent exit logic references actual filled size

Resume criteria: - position math correct - strategy safe to continue or closed manually

G. Unexpected Live Order Rejections

Indicators: - repeated broker rejects - no fills despite valid signals

Immediate actions: 1. Check rejection reason pattern 2. Pause affected strategy or all new entries if systemic 3. Verify account permissions, buying power, order format, and session state

Recovery steps: - fix root cause - test with safe validation path

Resume criteria: - rejection path explained - no blind resubmission loop remains

H. Emergency Halt Triggered

Indicators: - operator clicked Emergency Liquidate & Halt - risk manager forced system halt

Immediate actions: 1. Confirm halt state in UI 2. Confirm new entries are blocked 3. Confirm cancels and liquidation behavior 4. Track all affected orders and positions

Recovery steps: - verify positions are flat or intentionally managed - verify system remains halted until manual release - perform full reconciliation

Resume criteria: - halt reason understood - post-halt reconciliation complete - explicit operator approval documented

8. Evidence Preservation

For any Severity 1 or 2 incident, preserve: - timestamp of first detection - environment and trading mode - affected symbols - affected order ids - affected strategy ids - screenshots of UI if useful - broker responses if available - relevant logs - reconciliation outputs

Do not overwrite or silently discard state that may explain the incident.

9. Communication Template

When notifying teammates, include: - incident severity - trading mode: live, paper, local - current exposure: positions and symbols - current system state: halted, paused, degraded - immediate action already taken - what is still unknown - next update time

Example: - Severity: 1 - Mode: Live - Exposure: Open long position in AAPL, 500 shares - State: New entries halted, broker reconnect in progress - Action taken: Strategy paused, broker state verification started - Unknown: Whether recent cancel request reached broker - Next update: 5 minutes

10. Recovery Decision Matrix

After stabilizing, choose one:

Continue in Restricted Mode

Use when: - root cause is contained - no ambiguous exposure remains - only non-critical components are degraded

Restrictions may include: - no new entries - manual exits only - one strategy disabled

Halt Trading for Session

Use when: - system trust is reduced - issue is not fully understood - trading edge is impaired - operator confidence is low

Liquidate and Halt

Use when: - current exposure is unsafe - state is materially inconsistent - execution reliability is compromised

11. Post-Incident Review

Every Severity 1 and Severity 2 incident should produce a review.

Review questions: - what happened - when did it begin - what was the true impact - what exposure existed - what was the immediate cause - what deeper system weakness allowed it - what action worked - what action was missing - what should change in code, config, alerts, or runbook

Required outputs: - incident summary - timeline - impacted modules - remediation items - owner and due date for each remediation

12. Incident Prevention Feedback Loop

Every serious incident should feed back into: - implementation spec updates - module checklist updates - operations runbook updates - test coverage additions - analytics and alerting improvements

If an incident happened once and is plausible again, it should usually produce: - a new automated test - a new alert or guardrail - a clearer operator instruction

13. Quick Reference Cards

If Execution Is Unclear

  • stop new entries
  • verify broker positions and orders
  • reconcile before resuming

If State Is Inconsistent

  • trust broker for real exposure
  • freeze affected strategy
  • repair internal state carefully

If Risk Is Down

  • halt automated entries immediately
  • do not resume until enforcement is confirmed

If You Are Unsure

  • choose safety over continuity
  • halt first, diagnose second

After this playbook, the next useful documents are: - Production Readiness Review Template - Release Change Checklist - Daily Trade Review Template - Post-Incident Review Template