Architectural Brief: Delivery Tracking Gateway
Three carrier portals become a correctness problem once clients need live shipment state instead of occasional lookups. The gateway normalizes DHL, DPD, and GLS into one event contract, stores every status change before broadcasting it, and lets WebSocket clients subscribe without knowing which carrier produced the update.
The binding constraint is carrier disorder. Duplicate scans, old scans, 404 tracking responses, HTML error pages, and offsetless timestamps all arrive at the adapter boundary. The architecture treats those as normal inputs, not exceptions to the model.
System Topology
Infrastructure Decisions
- Compute: Node.js 22 with Fastify. Chose this over a heavier backend framework because the surface is a small API, a WebSocket route, and background loops. The production image runs compiled JavaScript with
tsc-aliasinstead of a TypeScript runtime loader because containers should start with plain Node. - Data Layer: PostgreSQL 16 with Drizzle ORM. Chose this over a document store because shipments, tracking events, and carrier configs have stable relationships.
tracking_events.shipment_iduses restrictive delete behavior so a shipment removal cannot erase the event history. - Event Transport: Redis Streams. Chose this over direct in-process WebSocket sends because fanout needs a consumer group, pending-message recovery, and explicit acknowledgement. The stream is delivery infrastructure; PostgreSQL remains the durable record.
- Carrier Integration: Separate DHL, DPD, and GLS adapters. Chose this over one shared parser because the carriers differ in authentication, URL shape, response shape, error behavior, and timestamp rules.
- Authentication: Static API keys. Chose this over user accounts and JWT because the system is a backend gateway for trusted client systems, not a user-facing account platform.
- Deployment: Docker Compose with PostgreSQL, Redis, and the app behind Traefik labels. Chose this over managed cloud primitives because the project target is a portable portfolio backend that can run locally or on a small VPS.
Constraints That Shaped the Design
- Input: Clients register shipments through the shipment registration endpoint; the polling engine then fetches updates from DHL, DPD, and GLS on carrier-specific intervals.
- Output: REST reads return current shipment state and full timelines; WebSocket clients receive stream payloads for subscribed tracking numbers.
- Scale Handled: The PRD target is 100 active shipments, 3 carriers, 500 events per minute, and WebSocket delivery under 200ms after a status change enters the event pipeline.
- Polling Boundary:
POLL_BATCH_SIZEdefaults to 10. The engine batches active shipments per carrier and uses exponential backoff from each carrier config. - Connection Boundary:
WS_MAX_CONNECTIONSdefaults to 1000. The route rejects new handshakes when the active connection set reaches that limit. - Retention Boundary: Delivered or returned shipments archive after 30 days by setting
deleted_at. The Redis stream trimmer keepstracking:eventsat 10,000 entries. - Deployment Boundary: The registry marks the project as shipped but not deployed.
docker-compose.prod.ymlstill contains a Traefik label fortracking.kingsleyonoh.com, but the build journal says live deployment was removed from scope.
Operational Contracts
The event processor owns the strongest invariant in the system. It checks the incoming dedup_key, inserts the tracking event, updates the shipment projection only when the carrier timestamp is newer, and only then publishes to Redis. If the database insert fails, Redis never receives the event. If Redis publish fails after the insert, the event remains in PostgreSQL and can be replayed later.
The WebSocket broadcaster has a different contract. It reads from tracking:events as consumer group ws-broadcaster, parses a single JSON payload field, fans out to subscribed connection ids, and acknowledges the stream entry. Malformed entries are logged and acknowledged because retrying invalid JSON does not repair it. Stuck pending entries are recovered with XAUTOCLAIM after 60 seconds.
The adapters enforce the boundary between carrier-specific noise and core logic. DHL needs an API key header and returns a normalized unknown event on 404. DPD can return HTML on failure, so the adapter detects text/html before JSON parsing. GLS sends local CET date and time fields, so the adapter parses them into UTC before the event reaches the processor.
Decision Log
| Decision | Alternative Rejected | Why |
|---|---|---|
| Event log before shipment projection | Directly update shipments.current_status |
Duplicate and out-of-order carrier updates need a preserved timeline, not only the latest state. |
Deterministic dedup_key |
Random event ids only | Replayed carrier data must collapse into the same event instead of creating duplicate rows. |
| Redis Streams consumer group | In-process publish to sockets | The WebSocket side needs acknowledgement, stuck-message reclaiming, and cross-process fanout later. |
| Separate carrier adapters | One generic carrier client | DHL, DPD, and GLS differ enough that shared parsing would hide carrier-specific failure rules. |
| Text status fields plus TypeScript unions | PostgreSQL enums | The project rules prefer text columns unless a migration explicitly needs DB enums, and carrier mappings can evolve in code. |
| Soft-delete shipments | Delete shipment rows | Polling must stop, but event history must remain readable and foreign keys must not cascade away audit data. |
::timestamp cursor casts |
Passing JavaScript Date directly to Drizzle filters |
A documented gotcha showed timestamp pagination repeating rows without explicit PostgreSQL timestamp literals. |
| Runtime migration runner | Shipping Drizzle Kit in production | A small compiled migration runner keeps the runtime image slimmer while still applying SQL before server start. |
Scaling Limits
The current architecture is right for a small gateway: one app process, one PostgreSQL database, one Redis stream, and independent carrier polling loops. The first pressure point is carrier throughput. The code polls each shipment in a batch sequentially inside pollBatch, so a large shipment set or slow carrier would stretch the cycle before database or Redis becomes the limit.
The second pressure point is WebSocket locality. The connection registry is in memory, while subscriptions live in Redis. That is fine for one app process. A multi-instance deployment would need process-aware routing or a socket gateway layer so a broadcaster on one container does not try to send to connections held by another.
The third pressure point is operational metrics. The stats endpoint reports carrier error rates as null because poll failure counters are not persisted yet. For production operations, that limitation should become a real table or time-series metric before anyone depends on per-carrier reliability dashboards.