Why I Moved Redis Acknowledgement Outside the Database Transaction

One good parcel event, one bad parcel event, one batch. That was enough to lose the good one.

The consumer read both from Redis Streams. The first event was valid, so the app wrote a shipment event and a claim case, then acknowledged the Redis message. The second event had no tenant id, threw an exception, and rolled back the PostgreSQL transaction. Redis had already been told the first message was done.

The database said nothing happened. Redis said the message was gone.

That is the kind of bug that looks like an operations mystery later. A carrier says it sent the event. The stream no longer shows it as pending. The claim queue has no case. Everyone starts looking at logs, but the data has already contradicted itself.

The part I got wrong

I treated Redis acknowledgement as part of processing. That was too early.

The first version of DeliveryEventConsumerJob.pollOnce() processed each stream record inside a TransactionTemplate. It wrote through ShipmentEventIngestionService.upsertFromEvent(...), then acknowledged the message in the same loop. It felt clean because both actions sat inside the same method and both happened after the event handler returned.

But Redis is not inside the PostgreSQL transaction. XACK does not care whether the database later commits. Once Redis removes the message from the pending list, the recovery path changes. If the transaction rolls back after that ack, PostgreSQL loses the row and Redis loses the retry handle.

That is not a duplicate problem. It is a vanished work problem.

The codebase already had idempotency in the right place for duplicates. ShipmentEventIngestionService builds a tenant-scoped dedup key from the event source and takes a PostgreSQL advisory transaction lock before inserting into shipment_events. The table also has a (tenant_id, dedup_key) uniqueness constraint. A replay can create at most one shipment event row and one claim case.

I was wrong to optimize for avoiding replay. Replay was safe. Premature ack was not.

The shape of the fix

The fix was small, but the boundary it moved mattered.

public int pollOnce() {
    if (!Boolean.TRUE.equals(properties.enabled())) {
        return 0;
    }
    List<String> processedMessageIds = transactionTemplate.execute(status -> {
        if (!lockManager.tryAcquireTransactionLock(LOCK_NAME)) {
            return List.of();
        }
        streamClient.ensureConsumerGroup();
        String consumer = consumerName();
        List<DeliveryGatewayTrackingEvent> events = streamClient.readPendingEvents(
                consumer, BATCH_SIZE, Duration.ZERO);
        if (events.isEmpty()) {
            events = streamClient.readNewEvents(consumer, BATCH_SIZE, Duration.ofMillis(250));
        }
        List<String> messageIds = new ArrayList<>();
        Map<UUID, TenantContext> tenantsById = new HashMap<>();
        for (DeliveryGatewayTrackingEvent event : events) {
            process(event, tenantsById);
            messageIds.add(event.messageId());
        }
        return messageIds;
    });
    if (processedMessageIds == null || processedMessageIds.isEmpty()) {
        return 0;
    }
    streamClient.acknowledgeAll(processedMessageIds);
    return processedMessageIds.size();
}

The transaction now returns the message ids only after the database work succeeds. Only then does the job call acknowledgeAll.

That creates two possible failure states, and both are tolerable.

If the database fails, no ack happens. Redis still has the pending messages. The next poll retries them.

If the database commits and the ack fails, Redis may replay messages whose database rows already exist. That is fine. The dedup key, advisory lock, and unique index collapse the replay back to the existing shipment event. The system may do extra work, but it will not create a second claim.

This is the tradeoff I should have chosen from the start: duplicate work over lost work.

Why the batch cache came later

The ack fix uncovered a different problem. The 50 events/sec acceptance test passed in isolation, then failed during full regression. The batch took 1,805 ms against a 1,000 ms target.

At first glance, that looked like Redis overhead. It wasn't. The consumer was reading 50 messages and doing one batched ack, but process(...) still resolved the same tenant and integration setting once per event. One tenant, 50 events, 50 database lookups before the hot path even reached shipment ingestion.

The fix was not a global cache. A global tenant cache would create stale security behavior and make feature flags harder to trust. The right cache lived inside one poll batch:

Map<UUID, TenantContext> tenantsById = new HashMap<>();
for (DeliveryGatewayTrackingEvent event : events) {
    process(event, tenantsById);
    messageIds.add(event.messageId());
}

process(...) now uses computeIfAbsent to authorize each tenant once per poll. Every poll still checks whether the tenant has delivery-gateway enabled. Nothing survives across polls. The latency win comes from removing repeated reads, not weakening the gate.

What surprised me was how easy this bug was to miss. The system could pass duplicate-message tests, tenant-scope tests, and manual exception flow tests while still failing a real throughput target. It needed the acceptance test with 50 actual Redis messages and PostgreSQL writes to reveal the hidden cost.

The database boundary still does the hard work

The consumer job is only the outer shell. The correctness boundary lives in ShipmentEventIngestionService.upsertFromEvent(...).

That service takes the event, normalizes carrier, tracking number, status, timestamps, snapshots, and metadata. It builds an idempotency key under the tenant id. It locks that key with pg_advisory_xact_lock. It upserts the shipment, inserts the event with on conflict do nothing, updates the visible shipment status only when the event timestamp is not older, writes audit records, then asks ClaimAutoCaseService whether the status deserves a case.

Only three statuses create default cases: failed_attempt, returned, and exception. A delivered scan does not open a claim. An unknown scan does not open a claim. The tracking timeline records context, but operations work begins only when the event status represents an operational exception.

That separation matters because not every carrier signal should become work. A tracking system records facts. A claims system creates ownership.

The result

The regression test that mattered is blunt: publish a valid event and then an invalid event in the same Redis batch. Run pollOnce(). Assert that no shipment event and no claim case exist, and that both Redis messages remain pending.

That test now passes.

The throughput proof also passes: 50 delivered events process in under 1 second locally, and because they are delivered events, they create zero claim cases. The duplicate stream test publishes two messages with the same source event and gets one shipment event, one claim case, and zero pending Redis messages after successful processing.

The lesson here is not "ack after commit" as a slogan. The real rule is narrower: if one system owns retry visibility and another system owns business durability, the retry signal must not be cleared until the business fact is committed.

The part I got wrong

The shape of the fix

Why the batch cache came later

The database boundary still does the hard work

The result

Put this system in context.

Contents