Last Updated: 2026-05-03 Owner: Ops-Dev Summary: Day-to-day operating guide for running QuantMatrix safely across startup, monitoring, incidents, and shutdown.

QuantMatrix Operations Runbook¶

This runbook is the operating manual for QuantMatrix.

Its purpose is to help an operator or engineer: - start the platform safely - verify that it is healthy before trading begins - monitor it during the session - respond to incidents calmly and consistently - shut it down safely - reconcile state after issues

This is intentionally practical. It should be used alongside the architecture, implementation, analytics, and checklist documents.

1. Runbook Scope¶

This runbook covers: - local development operation - paper trading operation - live trading operation - startup and shutdown procedures - health checks - broker connectivity issues - risk violations - order and position reconciliation - emergency halt and liquidation - post-session review

This runbook does not replace: - broker compliance obligations - credential management policy - deployment infrastructure guides

2. Environments¶

QuantMatrix should operate in clearly separated environments:

Local Development¶

Purpose: UI development, backend integration, dry-run workflows
Data sources: demo feed or sandbox data
Execution: dry-run only
Risk: no real capital

Paper Trading¶

Purpose: production-like testing using broker paper accounts
Data sources: real or paper-compatible market data
Execution: broker paper environment only
Risk: no real capital, but operational behavior must match live

Live Trading¶

Purpose: real trading with real capital
Data sources: approved live market data provider
Execution: live broker account
Risk: real financial and operational exposure

Backtesting¶

Purpose: historical strategy validation
Data sources: historical data
Execution: simulated only
Risk: no live broker interaction should be possible

3. Roles and Responsibilities¶

Operator¶

starts and monitors the system
verifies pre-market readiness
watches health, orders, positions, and risk state
triggers emergency halt if required

Engineer¶

investigates incidents
fixes system defects
performs reconciliation and recovery support
maintains logs, alerts, and runbook accuracy

Strategy Owner¶

owns strategy configuration
reviews analytics and trade outcomes
approves strategy parameter changes

4. Normal Operating Lifecycle¶

Every trading day should follow this sequence:

Pre-start checks
System startup
Health verification
Pre-market validation
Market-session monitoring
Incident handling if needed
End-of-session shutdown
Post-session reconciliation
Trade review and analytics review

5. Pre-Start Checklist¶

Before starting QuantMatrix:

Environment Check¶

[ ] Confirm correct environment: local, paper, live, or backtest
[ ] Confirm correct broker credentials loaded
[ ] Confirm correct market data provider configured
[ ] Confirm correct execution broker configured
[ ] Confirm trading mode displayed correctly in UI

Safety Check¶

[ ] Confirm live trading is intentionally enabled, not accidental
[ ] Confirm max daily loss value is correct
[ ] Confirm max position size is correct
[ ] Confirm max trades per day is correct
[ ] Confirm blocked symbols list is loaded

System Check¶

[ ] Redis reachable
[ ] PostgreSQL reachable
[ ] API service starts cleanly
[ ] Background jobs enabled as expected
[ ] No stuck processes from prior session

Data Check¶

[ ] Market data provider latency acceptable
[ ] Clock synchronization acceptable
[ ] Momentum Radar timing configured to minute boundary
[ ] Historical snapshot retention rules active

6. Startup Procedure¶

Start the platform in this order:

Configuration and secrets loader
PostgreSQL connection and migrations
Redis connection
Core API service
Market data ingestion service
Momentum Radar service
Opportunity Scanner
Strategy Allocator
Risk Manager
Order Execution Service
UI

Startup Verification¶

After startup, confirm: - [ ] API health endpoint reports healthy - [ ] UI loads correctly - [ ] Account summary loads - [ ] System Health panel shows all required services - [ ] Data broker is connected - [ ] Execution broker is connected - [ ] Risk manager is active - [ ] Scanner is active - [ ] No unexpected open orders loaded from previous session

If any critical dependency fails, do not proceed to active trading.

7. Health Verification¶

The following components must expose health: - API/backend gateway - Redis - PostgreSQL - Market data provider connection - Execution broker connection - Opportunity Scanner - Strategy Allocator - Risk Manager - Order Execution Service - Trade analytics pipeline

Health Status Expectations¶

Each component should show: - status: healthy / degraded / failed - latency - last successful activity timestamp - error detail if degraded or failed

Critical Health Conditions¶

Do not allow trading if: - execution broker is disconnected - risk manager is unavailable - account summary cannot be loaded - position reconciliation fails - clock synchronization is materially wrong

8. Pre-Market Validation¶

Before market open: - [ ] Confirm account buying power is correct - [ ] Confirm equity and cash are correct - [ ] Confirm no unexpected positions are open - [ ] Confirm no unexpected orders are resting - [ ] Confirm blocked list is correct - [ ] Confirm scanner settings are correct - [ ] Confirm strategy parameters are correct - [ ] Confirm alerts are enabled - [ ] Confirm emergency halt control is available

If the system supports scheduled scanning before market open, confirm it is using the intended time window.

9. In-Session Monitoring¶

During market hours, monitor:

Command Center¶

account summary cards
system health widget
Momentum Radar
Active Strategy Watchlist
Active Positions

Orders¶

new orders
rejected orders
partial fills
cancels
unexpected duplicates

Risk¶

daily loss progression
trade count
position sizing compliance
blocked or halted state

Analytics Signals¶

unusual slippage
repeated failed entries
systematic late exits
unusual concentration in one symbol or strategy

10. Routine Operator Actions¶

If a symbol should be excluded¶

use the Block action
choose session-only or permanent block
verify it no longer enters the scanner/watchlist flow

If a strategy is stuck¶

inspect polling health
verify whether position exists
use Restart only if there is no unsafe state transition
if position exists, check whether strategy ownership is preserved before restarting logic

If a watchlist candidate is no longer wanted¶

stop the strategy or close the candidate
verify lifecycle timestamp updates correctly

11. Risk Events¶

Max Daily Loss Reached¶

Expected behavior: - trading is halted - new entries are blocked - risk violation is logged - UI clearly shows halted state

Operator actions: 1. Confirm halt occurred 2. Confirm no new buy orders are being accepted 3. Decide whether open positions should continue under exit logic or be manually liquidated 4. Record incident for review

Max Position Size Violation¶

Expected behavior: - order blocked before broker submission - violation reason persisted

Operator actions: 1. Confirm no oversized order was submitted 2. Review sizing logic 3. Verify strategy config

Max Trades Per Day Reached¶

Expected behavior: - additional entries blocked - positions already open continue under exit logic

Operator actions: 1. Confirm trade cap triggered as expected 2. Confirm no new entries are being created

12. Broker Connectivity Incidents¶

Market Data Broker Disconnect¶

Expected behavior: - reconnect attempts begin automatically - status becomes degraded - scanner and radar behavior degrade safely

Operator actions: 1. Confirm disconnect in health panel 2. Confirm reconnect attempts are happening 3. Pause new automated entries if data quality is uncertain 4. If disconnect persists, halt trading

Execution Broker Disconnect¶

Expected behavior: - no new orders are submitted - status becomes critical - risk and UI reflect execution outage

Operator actions: 1. Immediately stop trusting automated entry flow 2. Verify current positions from broker directly if possible 3. Halt new strategy entries 4. Reconcile all open orders and positions once connection returns

Broker API Errors or Rate Limits¶

Expected behavior: - retries with backoff - idempotent order handling - alert on repeated failure

Operator actions: 1. Check whether retries are succeeding 2. Confirm duplicate orders are not created 3. Reduce load or halt trading if instability continues

13. Order Incidents¶

Order Rejected¶

Operator actions: 1. Check rejection reason 2. Determine whether rejection is due to risk, broker rules, invalid size, or market state 3. Confirm strategy does not loop and resubmit blindly 4. Log for post-session analysis

Partial Fill¶

Operator actions: 1. Verify OMS state reflects partial quantity 2. Verify position reflects actual filled shares 3. Confirm exit logic uses actual position size, not intended size

Duplicate Order Suspicion¶

Operator actions: 1. Check idempotency key 2. Check broker order history 3. Compare internal order state with broker state 4. Halt affected strategy if state is ambiguous

14. Position Reconciliation¶

Position reconciliation should happen: - on startup - after broker reconnect - after order incident - at end of session

Reconciliation Procedure¶

Pull current positions from execution broker
Compare against internal positions table/state
Compare expected quantity, average price, and side
Compare resting orders
Repair discrepancies using reconciliation workflow
Record reconciliation event

If Mismatch Exists¶

do not assume internal state is correct
treat broker-confirmed state as authoritative for current live exposure
repair internal state carefully and log the adjustment

15. Emergency Liquidate and Halt¶

Use this when: - system state is inconsistent - execution behavior is unsafe - market data is unreliable - repeated broker failures are occurring - risk controls appear compromised

Expected System Behavior¶

cancel open entry orders if possible
submit liquidate-all for open positions
halt further automated trading
persist emergency event
show halted state in UI

Operator Procedure¶

Trigger Emergency Liquidate & Halt
Confirm command accepted
Monitor broker for cancels and liquidations
Confirm positions go to zero
Confirm system remains halted
Do not resume trading until reconciliation is complete

16. Shutdown Procedure¶

At end of session: 1. Stop new candidate intake 2. Allow open workflows to settle or manually close according to policy 3. Confirm order state stable 4. Confirm positions state stable 5. Flush final analytics events 6. Persist session summaries 7. Stop background services in safe order

Recommended stop order: 1. Scanner 2. Strategy Allocator 3. Entry logic 4. Exit logic after position resolution 5. Market data ingestion 6. Order execution service 7. API/UI if needed

17. Post-Session Reconciliation¶

At end of session confirm: - [ ] orders match broker - [ ] executions match broker - [ ] positions are flat if expected - [ ] realized P&L matches broker statement or account summary - [ ] blocked list updates persisted as expected - [ ] analytics pipeline closed trades correctly - [ ] recommendation generation completed if scheduled

18. Post-Session Review¶

Review at minimum: - winners and losers - rejected orders - partial fills - slippage outliers - strategies with poor behavior - risk blocks - symbols frequently stopped out - time-of-day weakness patterns

Store: - incident notes - operator notes - manual overrides - candidate improvements for next session

19. Recovery After Crash or Restart¶

If the platform crashes or is restarted during a session:

Bring core services back carefully
Reconnect to Redis and PostgreSQL
Pull broker account, orders, and positions
Rebuild live state from broker plus persistent records
Replay Redis events or durable events if supported
Mark recovered session clearly in logs
Do not resume automated entries until reconciliation passes

Recovery Acceptance Criteria¶

account summary correct
open positions correct
open orders correct
watchlist state repaired or safely cleared
risk state restored
analytics event continuity preserved

20. Alerts and Escalation¶

Alerts should exist for: - execution broker disconnect - market data broker disconnect - failed order placement - repeated order rejection - max daily loss breach - emergency halt triggered - reconciliation mismatch - analytics pipeline failure

Escalation Priorities¶

Critical¶

live position mismatch
execution broker outage with open positions
emergency halt failure
duplicate order with real exposure

High¶

repeated rejected orders
data outage during active entry logic
missing risk enforcement

Medium¶

analytics lag
delayed dashboard updates
scanner timing drift

21. Logging Requirements¶

At minimum log: - startup and shutdown events - configuration mode - broker connection changes - scanner runs - strategy assignments - signals - risk decisions - order commands - broker acknowledgments - fills - reconciliation events - emergency actions - analytics completion events

Logs should be structured and timestamped.

22. Runbook Maintenance¶

This runbook must be updated when: - a new broker is added - a new risk rule is added - recovery flow changes - order lifecycle changes - analytics workflow changes - UI emergency controls change

Review cadence: - after every major release - after every serious incident - before enabling live trading changes

23. Quick Reference Checklists¶

Safe To Start Trading¶

[ ] all health checks green
[ ] account summary verified
[ ] no unexpected positions or orders
[ ] risk manager active
[ ] execution broker connected
[ ] data broker connected
[ ] emergency halt available

Must Halt Trading Immediately¶

[ ] execution state is ambiguous
[ ] risk manager unavailable
[ ] duplicate live order suspected
[ ] data quality severely degraded
[ ] broker positions do not match internal positions

Safe To Resume After Incident¶

[ ] incident cause understood
[ ] broker state reconciled
[ ] internal state repaired
[ ] risk controls verified
[ ] operator decision recorded

24. Recommended Next Companion Documents¶

After this runbook, the next useful supporting documents are: - Incident Response Playbook - Production Readiness Review - Live Trading Go-Live Checklist - Trade Review Template