Architectural Brief: Returns & Claims Orchestration Engine
Tracking is not resolution. A parcel status can say returned or exception, but an operations team still needs an owner, a deadline, evidence, customer action, carrier export, and audit history.
This system turns delivery exceptions into claims cases without making the Delivery Gateway mandatory. It can ingest Redis tracking:events when that ecosystem edge is enabled, but the manual claim path stays complete when every connector is off.
System Topology
Infrastructure Decisions
- Compute: Spring Boot 3.4 on Java 21. Chose it over a lighter FastAPI or Node service because the product is a transactional operations console: validation, schedulers, database migrations, tenant security, metrics, and typed configuration are first-class concerns in the codebase.
- Data Layer: PostgreSQL 16 with Flyway migrations. Chose it over an ORM-only schema because tenant-scoped uniqueness, foreign keys, idempotency constraints, JSONB snapshots, and indexes carry business correctness here.
V10__create_phase3_evidence_resolution_exports.sqlandV11__create_phase4_integration_tables.sqlare not incidental schema files. They define the operating contract. - Event Coordination: Redis 7 Streams with consumer group
returns-claims-engine. Chose it over Kafka because the upstream Delivery Gateway already emitstracking:eventsthrough Redis, and this service needs a local, low-friction consumer rather than another broker. - Presentation: Thymeleaf with HTMX and Tailwind. Chose it over a separate React client because this is an internal operations console: queue filters, detail pages, forms, and export actions. The server owns the workflow state, so server-rendered screens reduce client-side sync problems.
- Storage: Local evidence and export storage with a resolver seam. Chose it over direct S3 in the MVP because the app must run locally and standalone. The abstraction preserves a future object-store boundary without making local operation depend on cloud credentials.
- Integrations: Delivery Gateway, Notification Hub, Workflow Engine, and Dispute Workbench are feature-flagged. Chose optional edges over hard service dependencies because manual shipment and claim workflows must keep working when the portfolio graph is not running.
Constraints That Shaped the Design
- Input: A claim can start from manual entry or from a Delivery Gateway Redis event. The ingestion path handles both because
ShipmentEventIngestionService.upsertFromEventwrites the shipment, event, audit record, and auto-case in one transaction. - Output: The system produces owned cases, evidence checklists, resolution actions, carrier CSV or JSON packets, notification outbox rows, workflow triggers, and dispute exceptions.
- Scale Handled: The acceptance suite proves 50 tracking events per second locally, a 10,000-case queue under 1 second, and a claim detail page under 500 ms excluding downloads. Those numbers come from PRD §10b and dedicated Gradle tests.
- Idempotency: Shipment events use an advisory transaction lock plus a
(tenant_id, dedup_key)uniqueness constraint. Duplicate Redis messages can be read twice and still create one event row and one claim case. - Backpressure: The Redis consumer reads 50 events per poll. It caches tenant authorization per batch to avoid 50 repeated database lookups for one tenant, then acknowledges Redis only after the database transaction returns.
- Failure Boundaries: Notification and workflow dispatchers use outbox tables. If a downstream service fails, the claim case remains durable and retries are scheduled from the database.
- Deployment Boundary:
docker-compose.prod.ymlships app, PostgreSQL, and Redis, but no Traefik host label exists. The registry marks the project as shipped but not deployed, so there is no live URL in public content.
Decision Log
| Decision | Alternative Rejected | Why |
|---|---|---|
| Acknowledge Redis messages after database commit | Ack inside the transaction loop | Validation found that a mixed good and bad batch could ack a message while rolling back its database rows. Deferring ack keeps Redis recovery honest. |
| Batch-local tenant authorization cache | Query tenant and integration settings for every event | The 50 events/sec proof failed under full regression when the same tenant was looked up 50 times. A per-poll cache keeps the tenant gate and removes repeated hot-path reads. |
| Root-claim materialized detail query | Controller calling five services for one detail page | The original page could function while violating the latency contract. The root CTE joins evidence, actions, exports, audit, and shipment events through one tenant-scoped read model. |
| Carrier policy registry | Carrier rules in controllers or templates | DHL, DPD, GLS, and manual carrier behavior differs by evidence, deadlines, exports, and actions. Keeping policy behind CarrierClaimPolicyRegistry stops carrier rules leaking into the UI. |
| Export packets instead of direct carrier submission | Carrier portal API submission in the MVP | Reliable carrier sandbox credentials are not present. CSV and JSON packets give operators carrier-ready output without pretending the system can file into carrier portals. |
| Feature-flagged ecosystem connectors | Required live Notification Hub, Workflow Engine, and Dispute Workbench services | Integration-disabled mode is a success criterion. The engine has to run as a standalone claims console first. |
Scaling Limits
The first pressure point is not the case model. It is the batch worker shape. DeliveryEventConsumerJob reads 50 messages per poll and runs a single transaction guarded by a database lock. That is right for the local 50 events/sec target and keeps failure semantics simple. At a much larger carrier-event volume, I would split by tenant or shard the consumer group so one tenant's bad event cannot hold a whole mixed batch.
Claim detail history is the second limit. The current read model caps first-paint collections at 75 rows and returns total counts for disclosure. That keeps the operator page fast without hiding history, but a claim with years of evidence would need a paginated timeline endpoint rather than forcing every detail visit to carry archival data.
The business contract does not change at higher scale. Events still become owned cases. Carrier policy still decides evidence and exports. Optional services still sit behind outbox and trigger tables. The scaling work would change worker partitioning and history pagination, not the claims ownership model.