Architectural Brief: Returns & Claims Orchestration Engine

Tracking is not resolution. A parcel status can say returned or exception, but an operations team still needs an owner, a deadline, evidence, customer action, carrier export, and audit history.

This system turns delivery exceptions into claims cases without making the Delivery Gateway mandatory. It can ingest Redis tracking:events when that ecosystem edge is enabled, but the manual claim path stays complete when every connector is off.

System Topology

Architecture diagramScroll on small screens

Infrastructure Decisions

Compute: Spring Boot 3.4 on Java 21. Chose it over a lighter FastAPI or Node service because the product is a transactional operations console: validation, schedulers, database migrations, tenant security, metrics, and typed configuration are first-class concerns in the codebase.
Data Layer: PostgreSQL 16 with Flyway migrations. Chose it over an ORM-only schema because tenant-scoped uniqueness, foreign keys, idempotency constraints, JSONB snapshots, and indexes carry business correctness here. V10__create_phase3_evidence_resolution_exports.sql and V11__create_phase4_integration_tables.sql are not incidental schema files. They define the operating contract.
Event Coordination: Redis 7 Streams with consumer group returns-claims-engine. Chose it over Kafka because the upstream Delivery Gateway already emits tracking:events through Redis, and this service needs a local, low-friction consumer rather than another broker.
Presentation: Thymeleaf with HTMX and Tailwind. Chose it over a separate React client because this is an internal operations console: queue filters, detail pages, forms, and export actions. The server owns the workflow state, so server-rendered screens reduce client-side sync problems.
Storage: Local evidence and export storage with a resolver seam. Chose it over direct S3 in the MVP because the app must run locally and standalone. The abstraction preserves a future object-store boundary without making local operation depend on cloud credentials.
Integrations: Delivery Gateway, Notification Hub, Workflow Engine, and Dispute Workbench are feature-flagged. Chose optional edges over hard service dependencies because manual shipment and claim workflows must keep working when the portfolio graph is not running.

Constraints That Shaped the Design

Input: A claim can start from manual entry or from a Delivery Gateway Redis event. The ingestion path handles both because ShipmentEventIngestionService.upsertFromEvent writes the shipment, event, audit record, and auto-case in one transaction.
Output: The system produces owned cases, evidence checklists, resolution actions, carrier CSV or JSON packets, notification outbox rows, workflow triggers, and dispute exceptions.
Scale Handled: The acceptance suite proves 50 tracking events per second locally, a 10,000-case queue under 1 second, and a claim detail page under 500 ms excluding downloads. Those numbers come from PRD §10b and dedicated Gradle tests.
Idempotency: Shipment events use an advisory transaction lock plus a (tenant_id, dedup_key) uniqueness constraint. Duplicate Redis messages can be read twice and still create one event row and one claim case.
Backpressure: The Redis consumer reads 50 events per poll. It caches tenant authorization per batch to avoid 50 repeated database lookups for one tenant, then acknowledges Redis only after the database transaction returns.
Failure Boundaries: Notification and workflow dispatchers use outbox tables. If a downstream service fails, the claim case remains durable and retries are scheduled from the database.
Deployment Boundary: docker-compose.prod.yml ships app, PostgreSQL, and Redis, but no Traefik host label exists. The registry marks the project as shipped but not deployed, so there is no live URL in public content.

Decision Log

Decision	Alternative Rejected	Why
Acknowledge Redis messages after database commit	Ack inside the transaction loop	Validation found that a mixed good and bad batch could ack a message while rolling back its database rows. Deferring ack keeps Redis recovery honest.
Batch-local tenant authorization cache	Query tenant and integration settings for every event	The 50 events/sec proof failed under full regression when the same tenant was looked up 50 times. A per-poll cache keeps the tenant gate and removes repeated hot-path reads.
Root-claim materialized detail query	Controller calling five services for one detail page	The original page could function while violating the latency contract. The root CTE joins evidence, actions, exports, audit, and shipment events through one tenant-scoped read model.
Carrier policy registry	Carrier rules in controllers or templates	DHL, DPD, GLS, and manual carrier behavior differs by evidence, deadlines, exports, and actions. Keeping policy behind `CarrierClaimPolicyRegistry` stops carrier rules leaking into the UI.
Export packets instead of direct carrier submission	Carrier portal API submission in the MVP	Reliable carrier sandbox credentials are not present. CSV and JSON packets give operators carrier-ready output without pretending the system can file into carrier portals.
Feature-flagged ecosystem connectors	Required live Notification Hub, Workflow Engine, and Dispute Workbench services	Integration-disabled mode is a success criterion. The engine has to run as a standalone claims console first.

Scaling Limits

The first pressure point is not the case model. It is the batch worker shape. DeliveryEventConsumerJob reads 50 messages per poll and runs a single transaction guarded by a database lock. That is right for the local 50 events/sec target and keeps failure semantics simple. At a much larger carrier-event volume, I would split by tenant or shard the consumer group so one tenant's bad event cannot hold a whole mixed batch.

Claim detail history is the second limit. The current read model caps first-paint collections at 75 rows and returns total counts for disclosure. That keeps the operator page fast without hiding history, but a claim with years of evidence would need a paginated timeline endpoint rather than forcing every detail visit to carry archival data.

The business contract does not change at higher scale. Events still become owned cases. Carrier policy still decides evidence and exports. Optional services still sit behind outbox and trigger tables. The scaling work would change worker partitioning and history pagination, not the claims ownership model.

Returns & Claims Orchestration Engine