Explainable Dispatch Optimization Engine Blueprint
A dispatch optimizer has a strange trust boundary. The solver can search a plan space faster than a human dispatcher, but it cannot become the operational record by itself. The record has to know who accepted work, what was frozen, which technician was rejected, and whether a replay says the new plan beat the old habit.
That boundary shaped this architecture more than the choice of OR-Tools. The system is one Scala 3.6.4 Akka HTTP application with PostgreSQL 16, Redis 7, and an OR-Tools 9.11 solver adapter. It is not a mesh of tiny services. The core dispatch loop stays in one process so tenant scope, plan locks, explanation storage, replay, approvals, and UI routes share the same operational contract.
System topology
The HTTP layer composes authentication, imports, plan solving, operational data, replay, health, metrics, and server-rendered UI routes. The state layer uses relational tenant tables for operators, technicians, skills, jobs, plans, assignments, overrides, approvals, replay runs, outbox events, and objective settings. JSONB snapshots hold the data that must remain reviewable after a plan changes.
The core dispatch loop
A solve starts by loading tenant input into a constraint model. The builder rejects missing capability, frozen assignment conflicts, capacity overflow when overtime is not allowed, and work outside the planning window. Rejections are not discarded. They become explanation material for dispatchers and supervisors.
OR-Tools sits behind SolverAdapter. The adapter records whether OR-Tools ran, whether fallback happened, which status returned, and how long the solve took. If the native runtime fails or a timeout lands, deterministic fallback can still produce a partial plan with explicit unscheduled reasons.
Data model
The schema is tenant-scoped by default. That matters because dispatch data is not only operational data. It is permissioned evidence: which dispatcher accepted a plan, which supervisor approved overtime, which auditor can inspect the change history, and which replay run compared the same input snapshot against the baseline.
Infrastructure Decisions
- Scala 3 with Akka HTTP, not a heavier web framework. The API surface is small, the domain model benefits from typed JVM code, and route composition stays close to the dispatch core.
- PostgreSQL with JSONB snapshots, not document-only storage. Tenant ownership and foreign keys belong in relations, while plan inputs, solver traces, replay metrics, and approval payloads need immutable snapshots.
- Redis-first planning locks with database fallback, not Redis as a hard runtime dependency. Planning windows need cross-process protection, but local standalone mode should still work if Redis is absent.
- OR-Tools behind
SolverAdapter, not direct solver calls inside domain services. The adapter keeps deterministic fallback, timeout metadata, solver traces, and testable replay behavior visible. - Feature-flagged Notification Hub and Workflow Engine, not mandatory integration services. Dispatch planning should run even when outbound notifications or approval workflow calls are disabled.
- Replay against a greedy baseline, not trust by inspection. A dispatcher can compare SLA hit rate, travel, overtime, churn, unscheduled jobs, and solve time before trusting a new plan.
Operating constraints
The configured solver boundary accepts up to 500 jobs, with a normal timeout of 90 seconds and an emergency replan timeout of 30 seconds. Replay windows are capped at 31 days. Import validation caps CSV input at a configured source-file limit of 50,000 rows and 5,242,880 bytes. Those numbers are capacity boundaries and benchmark targets, not claims about live customer volume.
Database access is kept small as well: Hikari is capped at 10 connections with 1 idle connection. Redis lock TTLs are clamped between 60 and 86,400 seconds. The outbox publisher pulls pending rows in pages of 100. The point is not maximum throughput. The point is bounded behavior that can be inspected when a dispatch board is under pressure.
Observability
The application exposes health, database readiness, application readiness, and metrics surfaces. Metrics text includes solver duration and status, unscheduled jobs, SLA-risk jobs, outbox publish duration, workflow approval status, and replay duration. Logs include tenant, request, plan, and replay identifiers so a dispatcher's complaint can be followed from HTTP request to solver trace.
Readiness does not pretend optional integrations are required. If the Notification Hub or Workflow Engine is disabled, readiness reports that state instead of making the dispatch core fail. Solver trace metadata lives with the plan so the system can explain not just the assignment, but the path that produced it.
Why the boundary matters
The most dangerous version of this system would be a beautiful optimization demo that silently moves accepted work. This architecture treats the solver as one bounded component inside a tenant-scoped operating record. It can recommend, score, reject, and replay. It cannot erase the fact that a human already accepted the board.