Skip to content

Last Updated: 2026-05-03 Owner: Ops-Dev Summary: Day-to-day operating guide for running QuantMatrix safely across startup, monitoring, incidents, and shutdown.

QuantMatrix Operations Runbook

This runbook is the operating manual for QuantMatrix.

Its purpose is to help an operator or engineer: - start the platform safely - verify that it is healthy before trading begins - monitor it during the session - respond to incidents calmly and consistently - shut it down safely - reconcile state after issues

This is intentionally practical. It should be used alongside the architecture, implementation, analytics, and checklist documents.

1. Runbook Scope

This runbook covers: - local development operation - paper trading operation - live trading operation - startup and shutdown procedures - health checks - broker connectivity issues - risk violations - order and position reconciliation - emergency halt and liquidation - post-session review

This runbook does not replace: - broker compliance obligations - credential management policy - deployment infrastructure guides

2. Environments

QuantMatrix should operate in clearly separated environments:

Local Development

  • Purpose: UI development, backend integration, dry-run workflows
  • Data sources: demo feed or sandbox data
  • Execution: dry-run only
  • Risk: no real capital

Paper Trading

  • Purpose: production-like testing using broker paper accounts
  • Data sources: real or paper-compatible market data
  • Execution: broker paper environment only
  • Risk: no real capital, but operational behavior must match live

Live Trading

  • Purpose: real trading with real capital
  • Data sources: approved live market data provider
  • Execution: live broker account
  • Risk: real financial and operational exposure

Backtesting

  • Purpose: historical strategy validation
  • Data sources: historical data
  • Execution: simulated only
  • Risk: no live broker interaction should be possible

3. Roles and Responsibilities

Operator

  • starts and monitors the system
  • verifies pre-market readiness
  • watches health, orders, positions, and risk state
  • triggers emergency halt if required

Engineer

  • investigates incidents
  • fixes system defects
  • performs reconciliation and recovery support
  • maintains logs, alerts, and runbook accuracy

Strategy Owner

  • owns strategy configuration
  • reviews analytics and trade outcomes
  • approves strategy parameter changes

4. Normal Operating Lifecycle

Every trading day should follow this sequence:

  1. Pre-start checks
  2. System startup
  3. Health verification
  4. Pre-market validation
  5. Market-session monitoring
  6. Incident handling if needed
  7. End-of-session shutdown
  8. Post-session reconciliation
  9. Trade review and analytics review

5. Pre-Start Checklist

Before starting QuantMatrix:

Environment Check

  • [ ] Confirm correct environment: local, paper, live, or backtest
  • [ ] Confirm correct broker credentials loaded
  • [ ] Confirm correct market data provider configured
  • [ ] Confirm correct execution broker configured
  • [ ] Confirm trading mode displayed correctly in UI

Safety Check

  • [ ] Confirm live trading is intentionally enabled, not accidental
  • [ ] Confirm max daily loss value is correct
  • [ ] Confirm max position size is correct
  • [ ] Confirm max trades per day is correct
  • [ ] Confirm blocked symbols list is loaded

System Check

  • [ ] Redis reachable
  • [ ] PostgreSQL reachable
  • [ ] API service starts cleanly
  • [ ] Background jobs enabled as expected
  • [ ] No stuck processes from prior session

Data Check

  • [ ] Market data provider latency acceptable
  • [ ] Clock synchronization acceptable
  • [ ] Momentum Radar timing configured to minute boundary
  • [ ] Historical snapshot retention rules active

6. Startup Procedure

Start the platform in this order:

  1. Configuration and secrets loader
  2. PostgreSQL connection and migrations
  3. Redis connection
  4. Core API service
  5. Market data ingestion service
  6. Momentum Radar service
  7. Opportunity Scanner
  8. Strategy Allocator
  9. Risk Manager
  10. Order Execution Service
  11. UI

Startup Verification

After startup, confirm: - [ ] API health endpoint reports healthy - [ ] UI loads correctly - [ ] Account summary loads - [ ] System Health panel shows all required services - [ ] Data broker is connected - [ ] Execution broker is connected - [ ] Risk manager is active - [ ] Scanner is active - [ ] No unexpected open orders loaded from previous session

If any critical dependency fails, do not proceed to active trading.

7. Health Verification

The following components must expose health: - API/backend gateway - Redis - PostgreSQL - Market data provider connection - Execution broker connection - Opportunity Scanner - Strategy Allocator - Risk Manager - Order Execution Service - Trade analytics pipeline

Health Status Expectations

Each component should show: - status: healthy / degraded / failed - latency - last successful activity timestamp - error detail if degraded or failed

Critical Health Conditions

Do not allow trading if: - execution broker is disconnected - risk manager is unavailable - account summary cannot be loaded - position reconciliation fails - clock synchronization is materially wrong

8. Pre-Market Validation

Before market open: - [ ] Confirm account buying power is correct - [ ] Confirm equity and cash are correct - [ ] Confirm no unexpected positions are open - [ ] Confirm no unexpected orders are resting - [ ] Confirm blocked list is correct - [ ] Confirm scanner settings are correct - [ ] Confirm strategy parameters are correct - [ ] Confirm alerts are enabled - [ ] Confirm emergency halt control is available

If the system supports scheduled scanning before market open, confirm it is using the intended time window.

9. In-Session Monitoring

During market hours, monitor:

Command Center

  • account summary cards
  • system health widget
  • Momentum Radar
  • Active Strategy Watchlist
  • Active Positions

Orders

  • new orders
  • rejected orders
  • partial fills
  • cancels
  • unexpected duplicates

Risk

  • daily loss progression
  • trade count
  • position sizing compliance
  • blocked or halted state

Analytics Signals

  • unusual slippage
  • repeated failed entries
  • systematic late exits
  • unusual concentration in one symbol or strategy

10. Routine Operator Actions

If a symbol should be excluded

  • use the Block action
  • choose session-only or permanent block
  • verify it no longer enters the scanner/watchlist flow

If a strategy is stuck

  • inspect polling health
  • verify whether position exists
  • use Restart only if there is no unsafe state transition
  • if position exists, check whether strategy ownership is preserved before restarting logic

If a watchlist candidate is no longer wanted

  • stop the strategy or close the candidate
  • verify lifecycle timestamp updates correctly

11. Risk Events

Max Daily Loss Reached

Expected behavior: - trading is halted - new entries are blocked - risk violation is logged - UI clearly shows halted state

Operator actions: 1. Confirm halt occurred 2. Confirm no new buy orders are being accepted 3. Decide whether open positions should continue under exit logic or be manually liquidated 4. Record incident for review

Max Position Size Violation

Expected behavior: - order blocked before broker submission - violation reason persisted

Operator actions: 1. Confirm no oversized order was submitted 2. Review sizing logic 3. Verify strategy config

Max Trades Per Day Reached

Expected behavior: - additional entries blocked - positions already open continue under exit logic

Operator actions: 1. Confirm trade cap triggered as expected 2. Confirm no new entries are being created

12. Broker Connectivity Incidents

Market Data Broker Disconnect

Expected behavior: - reconnect attempts begin automatically - status becomes degraded - scanner and radar behavior degrade safely

Operator actions: 1. Confirm disconnect in health panel 2. Confirm reconnect attempts are happening 3. Pause new automated entries if data quality is uncertain 4. If disconnect persists, halt trading

Execution Broker Disconnect

Expected behavior: - no new orders are submitted - status becomes critical - risk and UI reflect execution outage

Operator actions: 1. Immediately stop trusting automated entry flow 2. Verify current positions from broker directly if possible 3. Halt new strategy entries 4. Reconcile all open orders and positions once connection returns

Broker API Errors or Rate Limits

Expected behavior: - retries with backoff - idempotent order handling - alert on repeated failure

Operator actions: 1. Check whether retries are succeeding 2. Confirm duplicate orders are not created 3. Reduce load or halt trading if instability continues

13. Order Incidents

Order Rejected

Operator actions: 1. Check rejection reason 2. Determine whether rejection is due to risk, broker rules, invalid size, or market state 3. Confirm strategy does not loop and resubmit blindly 4. Log for post-session analysis

Partial Fill

Operator actions: 1. Verify OMS state reflects partial quantity 2. Verify position reflects actual filled shares 3. Confirm exit logic uses actual position size, not intended size

Duplicate Order Suspicion

Operator actions: 1. Check idempotency key 2. Check broker order history 3. Compare internal order state with broker state 4. Halt affected strategy if state is ambiguous

14. Position Reconciliation

Position reconciliation should happen: - on startup - after broker reconnect - after order incident - at end of session

Reconciliation Procedure

  1. Pull current positions from execution broker
  2. Compare against internal positions table/state
  3. Compare expected quantity, average price, and side
  4. Compare resting orders
  5. Repair discrepancies using reconciliation workflow
  6. Record reconciliation event

If Mismatch Exists

  • do not assume internal state is correct
  • treat broker-confirmed state as authoritative for current live exposure
  • repair internal state carefully and log the adjustment

15. Emergency Liquidate and Halt

Use this when: - system state is inconsistent - execution behavior is unsafe - market data is unreliable - repeated broker failures are occurring - risk controls appear compromised

Expected System Behavior

  • cancel open entry orders if possible
  • submit liquidate-all for open positions
  • halt further automated trading
  • persist emergency event
  • show halted state in UI

Operator Procedure

  1. Trigger Emergency Liquidate & Halt
  2. Confirm command accepted
  3. Monitor broker for cancels and liquidations
  4. Confirm positions go to zero
  5. Confirm system remains halted
  6. Do not resume trading until reconciliation is complete

16. Shutdown Procedure

At end of session: 1. Stop new candidate intake 2. Allow open workflows to settle or manually close according to policy 3. Confirm order state stable 4. Confirm positions state stable 5. Flush final analytics events 6. Persist session summaries 7. Stop background services in safe order

Recommended stop order: 1. Scanner 2. Strategy Allocator 3. Entry logic 4. Exit logic after position resolution 5. Market data ingestion 6. Order execution service 7. API/UI if needed

17. Post-Session Reconciliation

At end of session confirm: - [ ] orders match broker - [ ] executions match broker - [ ] positions are flat if expected - [ ] realized P&L matches broker statement or account summary - [ ] blocked list updates persisted as expected - [ ] analytics pipeline closed trades correctly - [ ] recommendation generation completed if scheduled

18. Post-Session Review

Review at minimum: - winners and losers - rejected orders - partial fills - slippage outliers - strategies with poor behavior - risk blocks - symbols frequently stopped out - time-of-day weakness patterns

Store: - incident notes - operator notes - manual overrides - candidate improvements for next session

19. Recovery After Crash or Restart

If the platform crashes or is restarted during a session:

  1. Bring core services back carefully
  2. Reconnect to Redis and PostgreSQL
  3. Pull broker account, orders, and positions
  4. Rebuild live state from broker plus persistent records
  5. Replay Redis events or durable events if supported
  6. Mark recovered session clearly in logs
  7. Do not resume automated entries until reconciliation passes

Recovery Acceptance Criteria

  • account summary correct
  • open positions correct
  • open orders correct
  • watchlist state repaired or safely cleared
  • risk state restored
  • analytics event continuity preserved

20. Alerts and Escalation

Alerts should exist for: - execution broker disconnect - market data broker disconnect - failed order placement - repeated order rejection - max daily loss breach - emergency halt triggered - reconciliation mismatch - analytics pipeline failure

Escalation Priorities

Critical

  • live position mismatch
  • execution broker outage with open positions
  • emergency halt failure
  • duplicate order with real exposure

High

  • repeated rejected orders
  • data outage during active entry logic
  • missing risk enforcement

Medium

  • analytics lag
  • delayed dashboard updates
  • scanner timing drift

21. Logging Requirements

At minimum log: - startup and shutdown events - configuration mode - broker connection changes - scanner runs - strategy assignments - signals - risk decisions - order commands - broker acknowledgments - fills - reconciliation events - emergency actions - analytics completion events

Logs should be structured and timestamped.

22. Runbook Maintenance

This runbook must be updated when: - a new broker is added - a new risk rule is added - recovery flow changes - order lifecycle changes - analytics workflow changes - UI emergency controls change

Review cadence: - after every major release - after every serious incident - before enabling live trading changes

23. Quick Reference Checklists

Safe To Start Trading

  • [ ] all health checks green
  • [ ] account summary verified
  • [ ] no unexpected positions or orders
  • [ ] risk manager active
  • [ ] execution broker connected
  • [ ] data broker connected
  • [ ] emergency halt available

Must Halt Trading Immediately

  • [ ] execution state is ambiguous
  • [ ] risk manager unavailable
  • [ ] duplicate live order suspected
  • [ ] data quality severely degraded
  • [ ] broker positions do not match internal positions

Safe To Resume After Incident

  • [ ] incident cause understood
  • [ ] broker state reconciled
  • [ ] internal state repaired
  • [ ] risk controls verified
  • [ ] operator decision recorded

After this runbook, the next useful supporting documents are: - Incident Response Playbook - Production Readiness Review - Live Trading Go-Live Checklist - Trade Review Template