How a 58-Table Subscription Crashed the Entire Replication Cluster

Six worker slots looked like enough until initial table sync started competing with steady-state replication. Super ID had 2 tables. Floorplan had 4. The first three subscriptions consumed 3 of the 6 available slots on the Central DB and settled without incident.

Then I added the Rightmove scraper. 58 tables.

The subscription sat in "initializing" state for two hours. No errors in the application logs. No alerts. The pg_stat_subscription view showed a PID, which meant the worker was alive. But pg_subscription_rel told the real story: every table was stuck at i (initializing), none had transitioned to r (ready). The data wasn't flowing.

What "Initializing" Actually Means

PostgreSQL logical replication has two phases. Phase one is the initial table copy: the subscriber connects to the publisher and copies every row from every published table. Phase two is the streaming phase: ongoing WAL changes flow through the replication slot.

The initial copy is where the problem lived. For a 58-table subscription, PostgreSQL doesn't copy tables sequentially by default. It spawns temporary table sync workers, one per table, to parallelize the initial load. Each of those temporary workers needs a slot in max_worker_processes.

The Central DB was a Supabase Small instance with max_worker_processes = 6. Three slots were permanently occupied by the existing subscriptions (Super ID, Floorplan, Image Condition). One slot went to the new Rightmove subscription's apply worker. That left 2 slots for temporary table sync workers. Two workers trying to copy 58 tables. The queue backed up, PostgreSQL's internal scheduler couldn't allocate new workers, and the whole process stalled.

The Supabase logs confirmed it: out of background worker slots.

Why I Didn't Just Increase the Worker Count

The obvious fix was to bump max_worker_processes from 6 to 20. I did that eventually. But increasing the parameter requires a database restart, and restarting the Central DB would kill all three active subscriptions. Each one would need to reconnect, re-verify its replication slot, and potentially re-sync any changes that occurred during the downtime. For a production system that was already feeding data to downstream consumers, a restart during the initial copy of a 58-table subscription was risky.

The less obvious fix was to stop fighting for resources and eliminate the competition entirely.

The Traffic Control Protocol

I disabled all three existing subscriptions:

ALTER SUBSCRIPTION sub_super_id_service DISABLE;
ALTER SUBSCRIPTION sub_floorplan_service DISABLE;
ALTER SUBSCRIPTION sub_image_condition_service DISABLE;

Disabling a subscription doesn't drop it. It doesn't lose the replication slot on the publisher. It just tells the background worker to stop consuming WAL changes and release its worker slot. Three disabled subscriptions freed 3 slots instantly.

Now the Rightmove subscription had 5 of 6 worker slots available for its initial copy: 1 for the apply worker, 4 for temporary table sync workers. Still not enough for 58 parallel copies, but PostgreSQL queues the remaining tables and processes them as workers become available. The copy completed in about 40 minutes.

After the initial copy finished, I re-enabled the other three:

ALTER SUBSCRIPTION sub_super_id_service ENABLE;
ALTER SUBSCRIPTION sub_floorplan_service ENABLE;
ALTER SUBSCRIPTION sub_image_condition_service ENABLE;

Each subscription reconnected to its publisher, picked up from where it left off via the replication slot, and resumed streaming. No data loss. No re-sync needed. The replication slot on each publisher had been holding the WAL position during the downtime.

The Type Casting Problem Nobody Warned About

The worker slot issue was loud. It produced log entries and visible symptoms. The type casting problem was silent.

The unified_property_master materialized view joins data from all four services using super_id as the anchor. But the four services store super_id in different types. The Super ID service uses uuid. Rightmove uses uuid. Floorplan stores it as text. Image Condition stores it as varchar(1028).

The SQL for the materialized view has explicit casts:

LEFT JOIN public.overall_image_analysis i
    ON i.super_id::uuid = s.super_id

LEFT JOIN (
    SELECT DISTINCT ON (super_id)
        super_id,
        event_type as latest_status,
        created_at as last_updated
    FROM floorplan.floorplan_events
    ORDER BY super_id, created_at DESC
) f ON f.super_id::uuid = s.super_id;

Without the ::uuid casts, the LEFT JOINs silently produce zero matches. No error. No warning. The view creates successfully, every row has NULL for condition scores and floorplan status, and unless you know what the data should look like, it looks like the data just hasn't been processed yet. The NULL values are indistinguishable from legitimately missing data.

I caught this only because I ran a sanity check query comparing the row count in super_id.generated_super_ids against the row count in unified_property_master with non-null condition scores. The numbers didn't add up. Tracing it back to the join conditions took another hour.

The Schema Drift Time Bomb

Logical replication streams row-level changes. It does not replicate DDL. If a developer runs ALTER TABLE scrape_events ADD COLUMN retry_count integer on the Rightmove source database but forgets to add the same column to the Central DB, the subscription crashes immediately. Not on the next data change. Immediately. The replication protocol sees a column count mismatch and stops the worker.

The fix is a process, not a technical control. We documented a "Central DB First" rule: every schema change must be applied to the Central DB before the Source DB. The one exception is CREATE TABLE, which must happen on source first so the table can be added to the publication before the subscription can subscribe to it.

This rule is enforced by documentation and fear, not by automation. There is no CI check that validates schema consistency across 5 databases. If someone forgets, replication breaks silently until someone checks pg_stat_subscription and sees a NULL PID.

What I'd Do Differently

Two things. First, I would have started with max_worker_processes = 20 from day one instead of accepting the default and discovering the limit under pressure. The parameter costs nothing when workers aren't active. Setting it high preemptively eliminates the entire class of slot exhaustion problems.

Second, I would build a schema drift detector. A simple cron job that compares column definitions between each source publication and the Central DB subscription tables. If a mismatch appears, fire an alert before the next DDL change breaks replication. The tooling is straightforward: information_schema.columns on both sides, a diff, an email. The reason it doesn't exist yet is that the system was built in a week and the development team is small enough that verbal communication covers the gap. That won't scale.