Last Updated: 2026-05-03 Owner: Ops-Dev Summary: Day-to-day operating guide for running QuantMatrix safely across startup, monitoring, incidents, and shutdown.
QuantMatrix Operations Runbook¶
This runbook is the operating manual for QuantMatrix.
Its purpose is to help an operator or engineer: - start the platform safely - verify that it is healthy before trading begins - monitor it during the session - respond to incidents calmly and consistently - shut it down safely - reconcile state after issues
This is intentionally practical. It should be used alongside the architecture, implementation, analytics, and checklist documents.
1. Runbook Scope¶
This runbook covers: - local development operation - paper trading operation - live trading operation - startup and shutdown procedures - health checks - broker connectivity issues - risk violations - order and position reconciliation - emergency halt and liquidation - post-session review
This runbook does not replace: - broker compliance obligations - credential management policy - deployment infrastructure guides
2. Environments¶
QuantMatrix should operate in clearly separated environments:
Local Development¶
- Purpose: UI development, backend integration, dry-run workflows
- Data sources: demo feed or sandbox data
- Execution: dry-run only
- Risk: no real capital
Paper Trading¶
- Purpose: production-like testing using broker paper accounts
- Data sources: real or paper-compatible market data
- Execution: broker paper environment only
- Risk: no real capital, but operational behavior must match live
Live Trading¶
- Purpose: real trading with real capital
- Data sources: approved live market data provider
- Execution: live broker account
- Risk: real financial and operational exposure
Backtesting¶
- Purpose: historical strategy validation
- Data sources: historical data
- Execution: simulated only
- Risk: no live broker interaction should be possible
3. Roles and Responsibilities¶
Operator¶
- starts and monitors the system
- verifies pre-market readiness
- watches health, orders, positions, and risk state
- triggers emergency halt if required
Engineer¶
- investigates incidents
- fixes system defects
- performs reconciliation and recovery support
- maintains logs, alerts, and runbook accuracy
Strategy Owner¶
- owns strategy configuration
- reviews analytics and trade outcomes
- approves strategy parameter changes
4. Normal Operating Lifecycle¶
Every trading day should follow this sequence:
- Pre-start checks
- System startup
- Health verification
- Pre-market validation
- Market-session monitoring
- Incident handling if needed
- End-of-session shutdown
- Post-session reconciliation
- Trade review and analytics review
5. Pre-Start Checklist¶
Before starting QuantMatrix:
Environment Check¶
- [ ] Confirm correct environment: local, paper, live, or backtest
- [ ] Confirm correct broker credentials loaded
- [ ] Confirm correct market data provider configured
- [ ] Confirm correct execution broker configured
- [ ] Confirm trading mode displayed correctly in UI
Safety Check¶
- [ ] Confirm live trading is intentionally enabled, not accidental
- [ ] Confirm max daily loss value is correct
- [ ] Confirm max position size is correct
- [ ] Confirm max trades per day is correct
- [ ] Confirm blocked symbols list is loaded
System Check¶
- [ ] Redis reachable
- [ ] PostgreSQL reachable
- [ ] API service starts cleanly
- [ ] Background jobs enabled as expected
- [ ] No stuck processes from prior session
Data Check¶
- [ ] Market data provider latency acceptable
- [ ] Clock synchronization acceptable
- [ ] Momentum Radar timing configured to minute boundary
- [ ] Historical snapshot retention rules active
6. Startup Procedure¶
Start the platform in this order:
- Configuration and secrets loader
- PostgreSQL connection and migrations
- Redis connection
- Core API service
- Market data ingestion service
- Momentum Radar service
- Opportunity Scanner
- Strategy Allocator
- Risk Manager
- Order Execution Service
- UI
Startup Verification¶
After startup, confirm: - [ ] API health endpoint reports healthy - [ ] UI loads correctly - [ ] Account summary loads - [ ] System Health panel shows all required services - [ ] Data broker is connected - [ ] Execution broker is connected - [ ] Risk manager is active - [ ] Scanner is active - [ ] No unexpected open orders loaded from previous session
If any critical dependency fails, do not proceed to active trading.
7. Health Verification¶
The following components must expose health: - API/backend gateway - Redis - PostgreSQL - Market data provider connection - Execution broker connection - Opportunity Scanner - Strategy Allocator - Risk Manager - Order Execution Service - Trade analytics pipeline
Health Status Expectations¶
Each component should show: - status: healthy / degraded / failed - latency - last successful activity timestamp - error detail if degraded or failed
Critical Health Conditions¶
Do not allow trading if: - execution broker is disconnected - risk manager is unavailable - account summary cannot be loaded - position reconciliation fails - clock synchronization is materially wrong
8. Pre-Market Validation¶
Before market open: - [ ] Confirm account buying power is correct - [ ] Confirm equity and cash are correct - [ ] Confirm no unexpected positions are open - [ ] Confirm no unexpected orders are resting - [ ] Confirm blocked list is correct - [ ] Confirm scanner settings are correct - [ ] Confirm strategy parameters are correct - [ ] Confirm alerts are enabled - [ ] Confirm emergency halt control is available
If the system supports scheduled scanning before market open, confirm it is using the intended time window.
9. In-Session Monitoring¶
During market hours, monitor:
Command Center¶
- account summary cards
- system health widget
- Momentum Radar
- Active Strategy Watchlist
- Active Positions
Orders¶
- new orders
- rejected orders
- partial fills
- cancels
- unexpected duplicates
Risk¶
- daily loss progression
- trade count
- position sizing compliance
- blocked or halted state
Analytics Signals¶
- unusual slippage
- repeated failed entries
- systematic late exits
- unusual concentration in one symbol or strategy
10. Routine Operator Actions¶
If a symbol should be excluded¶
- use the Block action
- choose session-only or permanent block
- verify it no longer enters the scanner/watchlist flow
If a strategy is stuck¶
- inspect polling health
- verify whether position exists
- use Restart only if there is no unsafe state transition
- if position exists, check whether strategy ownership is preserved before restarting logic
If a watchlist candidate is no longer wanted¶
- stop the strategy or close the candidate
- verify lifecycle timestamp updates correctly
11. Risk Events¶
Max Daily Loss Reached¶
Expected behavior: - trading is halted - new entries are blocked - risk violation is logged - UI clearly shows halted state
Operator actions: 1. Confirm halt occurred 2. Confirm no new buy orders are being accepted 3. Decide whether open positions should continue under exit logic or be manually liquidated 4. Record incident for review
Max Position Size Violation¶
Expected behavior: - order blocked before broker submission - violation reason persisted
Operator actions: 1. Confirm no oversized order was submitted 2. Review sizing logic 3. Verify strategy config
Max Trades Per Day Reached¶
Expected behavior: - additional entries blocked - positions already open continue under exit logic
Operator actions: 1. Confirm trade cap triggered as expected 2. Confirm no new entries are being created
12. Broker Connectivity Incidents¶
Market Data Broker Disconnect¶
Expected behavior: - reconnect attempts begin automatically - status becomes degraded - scanner and radar behavior degrade safely
Operator actions: 1. Confirm disconnect in health panel 2. Confirm reconnect attempts are happening 3. Pause new automated entries if data quality is uncertain 4. If disconnect persists, halt trading
Execution Broker Disconnect¶
Expected behavior: - no new orders are submitted - status becomes critical - risk and UI reflect execution outage
Operator actions: 1. Immediately stop trusting automated entry flow 2. Verify current positions from broker directly if possible 3. Halt new strategy entries 4. Reconcile all open orders and positions once connection returns
Broker API Errors or Rate Limits¶
Expected behavior: - retries with backoff - idempotent order handling - alert on repeated failure
Operator actions: 1. Check whether retries are succeeding 2. Confirm duplicate orders are not created 3. Reduce load or halt trading if instability continues
13. Order Incidents¶
Order Rejected¶
Operator actions: 1. Check rejection reason 2. Determine whether rejection is due to risk, broker rules, invalid size, or market state 3. Confirm strategy does not loop and resubmit blindly 4. Log for post-session analysis
Partial Fill¶
Operator actions: 1. Verify OMS state reflects partial quantity 2. Verify position reflects actual filled shares 3. Confirm exit logic uses actual position size, not intended size
Duplicate Order Suspicion¶
Operator actions: 1. Check idempotency key 2. Check broker order history 3. Compare internal order state with broker state 4. Halt affected strategy if state is ambiguous
14. Position Reconciliation¶
Position reconciliation should happen: - on startup - after broker reconnect - after order incident - at end of session
Reconciliation Procedure¶
- Pull current positions from execution broker
- Compare against internal positions table/state
- Compare expected quantity, average price, and side
- Compare resting orders
- Repair discrepancies using reconciliation workflow
- Record reconciliation event
If Mismatch Exists¶
- do not assume internal state is correct
- treat broker-confirmed state as authoritative for current live exposure
- repair internal state carefully and log the adjustment
15. Emergency Liquidate and Halt¶
Use this when: - system state is inconsistent - execution behavior is unsafe - market data is unreliable - repeated broker failures are occurring - risk controls appear compromised
Expected System Behavior¶
- cancel open entry orders if possible
- submit liquidate-all for open positions
- halt further automated trading
- persist emergency event
- show halted state in UI
Operator Procedure¶
- Trigger Emergency Liquidate & Halt
- Confirm command accepted
- Monitor broker for cancels and liquidations
- Confirm positions go to zero
- Confirm system remains halted
- Do not resume trading until reconciliation is complete
16. Shutdown Procedure¶
At end of session: 1. Stop new candidate intake 2. Allow open workflows to settle or manually close according to policy 3. Confirm order state stable 4. Confirm positions state stable 5. Flush final analytics events 6. Persist session summaries 7. Stop background services in safe order
Recommended stop order: 1. Scanner 2. Strategy Allocator 3. Entry logic 4. Exit logic after position resolution 5. Market data ingestion 6. Order execution service 7. API/UI if needed
17. Post-Session Reconciliation¶
At end of session confirm: - [ ] orders match broker - [ ] executions match broker - [ ] positions are flat if expected - [ ] realized P&L matches broker statement or account summary - [ ] blocked list updates persisted as expected - [ ] analytics pipeline closed trades correctly - [ ] recommendation generation completed if scheduled
18. Post-Session Review¶
Review at minimum: - winners and losers - rejected orders - partial fills - slippage outliers - strategies with poor behavior - risk blocks - symbols frequently stopped out - time-of-day weakness patterns
Store: - incident notes - operator notes - manual overrides - candidate improvements for next session
19. Recovery After Crash or Restart¶
If the platform crashes or is restarted during a session:
- Bring core services back carefully
- Reconnect to Redis and PostgreSQL
- Pull broker account, orders, and positions
- Rebuild live state from broker plus persistent records
- Replay Redis events or durable events if supported
- Mark recovered session clearly in logs
- Do not resume automated entries until reconciliation passes
Recovery Acceptance Criteria¶
- account summary correct
- open positions correct
- open orders correct
- watchlist state repaired or safely cleared
- risk state restored
- analytics event continuity preserved
20. Alerts and Escalation¶
Alerts should exist for: - execution broker disconnect - market data broker disconnect - failed order placement - repeated order rejection - max daily loss breach - emergency halt triggered - reconciliation mismatch - analytics pipeline failure
Escalation Priorities¶
Critical¶
- live position mismatch
- execution broker outage with open positions
- emergency halt failure
- duplicate order with real exposure
High¶
- repeated rejected orders
- data outage during active entry logic
- missing risk enforcement
Medium¶
- analytics lag
- delayed dashboard updates
- scanner timing drift
21. Logging Requirements¶
At minimum log: - startup and shutdown events - configuration mode - broker connection changes - scanner runs - strategy assignments - signals - risk decisions - order commands - broker acknowledgments - fills - reconciliation events - emergency actions - analytics completion events
Logs should be structured and timestamped.
22. Runbook Maintenance¶
This runbook must be updated when: - a new broker is added - a new risk rule is added - recovery flow changes - order lifecycle changes - analytics workflow changes - UI emergency controls change
Review cadence: - after every major release - after every serious incident - before enabling live trading changes
23. Quick Reference Checklists¶
Safe To Start Trading¶
- [ ] all health checks green
- [ ] account summary verified
- [ ] no unexpected positions or orders
- [ ] risk manager active
- [ ] execution broker connected
- [ ] data broker connected
- [ ] emergency halt available
Must Halt Trading Immediately¶
- [ ] execution state is ambiguous
- [ ] risk manager unavailable
- [ ] duplicate live order suspected
- [ ] data quality severely degraded
- [ ] broker positions do not match internal positions
Safe To Resume After Incident¶
- [ ] incident cause understood
- [ ] broker state reconciled
- [ ] internal state repaired
- [ ] risk controls verified
- [ ] operator decision recorded
24. Recommended Next Companion Documents¶
After this runbook, the next useful supporting documents are: - Incident Response Playbook - Production Readiness Review - Live Trading Go-Live Checklist - Trade Review Template