The Compliance Problem That Disappears When You Stop Updating Rows

Can the system prove what happened to discrepancy #4721 between Tuesday and Thursday?

A mutable record can show the current answer. It cannot prove nobody edited the answer between Thursday and today. The difference comes down to provability. The mutable system can produce a timeline, but it can't prove the timeline wasn't edited between Thursday and today. The immutable system can, because the records physically cannot change once written.

I built the Financial Compliance Ledger around this principle: the audit trail is not a feature attached to the data. The audit trail IS the data.

The Provenance Problem

Most compliance systems are CRUD applications with audit logging bolted on. The discrepancy record lives in a discrepancies table. When someone changes the status from open to acknowledged, the system updates the row and maybe writes a log entry somewhere else.

This design has a structural flaw. The log and the data it's supposed to protect occupy the same trust level. Both are regular database rows. Both can be updated. A DBA with production access, a bug in the application layer, or a migration script that touches the wrong table can alter the audit trail without leaving a trace. The audit log is a second-class citizen shadowing the real data.

The compliance question is specific: show me every state change, in order, with who did it and why, and prove it wasn't altered. You can't answer that question with updated_at and updated_by columns. Those columns are regular writable fields. They can be overwritten like anything else.

The fix requires a different data model, not better logging.

What Immutability Actually Requires

Financial audit trails have three hard constraints that CRUD systems can't satisfy structurally.

First, every action must produce a permanent record. Not a log entry that's written if the application remembers to call the logger. A database row that's created in the same transaction as the state change. If the state change commits, the record commits. If either fails, both roll back. There's no window where the projection says "acknowledged" but the event doesn't exist.

Second, ordering must be deterministic. Timestamps aren't enough. Two escalation rules evaluating against the same discrepancy in the same millisecond need an unambiguous order. I used PostgreSQL's BIGSERIAL type for the sequence_num column on ledger_events. It's monotonically increasing and gap-free within a transaction. No two events share a sequence number.

Third, the immutability can't be enforced only at the application layer. Application-level rules get bypassed by migration scripts, admin queries, and bugs. The Go codebase never issues an UPDATE or DELETE against the ledger_events table. There's no UpdateEvent method. There's no DeleteEvent method. They were never written.

The protection is structural: you can't accidentally call a function that doesn't exist.

The Dual-Table Architecture

The design splits the data into two tables with different mutation rules.

discrepancies is the mutable projection. It stores the current state: status, severity, amounts, timestamps. This table gets updated on every workflow action. It exists for queryability: filtering by status, sorting by creation date, paginating with cursor-based navigation. The indexes live here because this is the table that handles read-heavy operations.

ledger_events is the immutable source of truth. Every state change, every note, every escalation, every resolution becomes a row with a unique sequence_num, the actor who performed it, the actor_type (system, user, or escalation engine), and a JSONB payload carrying the details.

The discrepancy row is a convenience. If you deleted every row in the discrepancies table, you could reconstruct the complete state of every discrepancy from ledger_events alone. The events are the data. The projection is a cache.

The state machine enforces which transitions are legal. Here's the actual code from domain/discrepancy.go:

var validTransitions = map[string]map[string]bool{
    StatusOpen: {
        StatusAcknowledged: true,
        StatusAutoClosed:   true,
    },
    StatusAcknowledged: {
        StatusInvestigating: true,
    },
    StatusInvestigating: {
        StatusResolved:  true,
        StatusEscalated: true,
    },
    StatusEscalated: {
        StatusResolved: true,
    },
    // resolved and auto_closed are terminal states. No transitions out.
}

func ValidTransition(from, to string) bool {
    allowed, ok := validTransitions[from]
    if !ok {
        return false
    }
    return allowed[to]
}

Six states. Two terminal (resolved, auto_closed). Every transition is explicit. There's no default case that silently allows unknown transitions. If a status pair isn't in the map, the transition is rejected.

Each workflow action (acknowledge, investigate, resolve, add note) runs inside a single database transaction: read the discrepancy with a row lock, validate the transition via ValidTransition, update the mutable projection, append the immutable event, commit. Steps 3 and 4 are atomic. If the event INSERT fails, the status update rolls back. If the status update fails, no event is created.

The Concurrency Bug That Proved the Design

The first version relied on timestamp ordering for events. It worked in tests because tests run sequentially. In production-like conditions with the escalation engine running, two rules evaluated concurrently against the same discrepancy.

Both checked the status: open. Both saw a valid transition to auto_closed. Both wrote events.

The result: two discrepancy.auto_closed events for the same discrepancy. The event history showed two closures. An auditor looking at that trail would see a discrepancy closed twice and have no way to determine which closure was authoritative.

Moving the transition check inside the database transaction with a row-level lock on the discrepancy fixed it. The first transaction acquires the lock, validates the transition (open to auto_closed: valid), updates the projection, appends the event, and commits. The second transaction waits for the lock, reads the updated status (already auto_closed), runs ValidTransition("auto_closed", "auto_closed"), gets false, and rolls back.

The sequence_num made the fix verifiable. After the change, I wrote a test that fires two concurrent goroutines at the same discrepancy and asserts that exactly one auto_closed event exists. The monotonic sequence guarantees a single ordering. No shared sequence numbers. No gaps.

Optimistic Locking as an Opt-In

The expected_sequence parameter on workflow actions is optional. When a client provides it, the system checks that the discrepancy's latest event sequence number matches what the client expects. If another action happened between the client's read and write, the numbers diverge, and the system returns 409 Conflict.

I made it optional deliberately. A simple integration that acknowledges discrepancies one at a time doesn't need optimistic locking. The ValidTransition check catches illegal transitions regardless. But a dashboard where multiple analysts work on the same discrepancy simultaneously needs the 409 to prevent one analyst's context from silently overwriting another's.

Making optimistic locking mandatory would have been "safer" in the abstract. It also would have forced every client to track sequence numbers, even for simple, low-contention workflows. The state machine already prevents truly invalid transitions: you can't resolve an open discrepancy, you can't acknowledge one that's already resolved. The optimistic lock adds a second layer for the subset of clients that operate under concurrent access.

I was wrong about where the concurrency pressure would come from. I expected the escalation engine to be the primary source of conflicts because it processes discrepancies in bulk. It turned out that concurrent human analysts were the more common case. The engine runs on a 15-minute cycle and processes each tenant's rules sequentially. Two people clicking "investigate" on the same discrepancy within seconds of each other is harder to serialize than a scheduled batch job.

What the Trail Actually Proves

The system uses 7 event types spanning the full lifecycle: discrepancy.received, discrepancy.acknowledged, discrepancy.investigation_started, discrepancy.note_added, discrepancy.escalated, discrepancy.resolved, and discrepancy.auto_closed. Each event carries the actor (who performed it), the actor_type (system, user, or escalation engine), and a JSONB payload with specifics like resolution type (match_found, false_positive, manual_adjustment, write_off) or escalation reason.

When someone asks "what happened to discrepancy #4721?", the answer is a query returning an ordered list of immutable events. The sequence_num provides ordering. The actor field provides attribution. The payload provides context. The entire result set is provably unmodified because the table's application layer has no modification functions.

The 346 tests include 36 dedicated state machine test cases covering every valid transition and every invalid attempt. The domain layer is the most heavily tested module because it encodes the compliance rules that everything else depends on.

If your compliance audit trail can be edited by the same system that produces it, it's a log, not a ledger. The distinction matters the first time someone asks for proof.