Turning Stripe Payment Failures Into Recovered Revenue

The Situation

Subscription Lifecycle Engine is a payment recovery and subscription-state service for SaaS operators that need Stripe failures turned into dunning, retries, notifications, and churn visibility. It is needed when failed charges sit in dashboards until revenue review instead of triggering recovery while the customer can still be saved.

A customer's credit card declines on the third month of a subscription. Stripe records the failed charge, fires a webhook event, and moves on. Without automation, that failed payment sits in a dashboard nobody checks daily. Two more billing cycles fail. The customer, who would have updated their card details after a single notification, is gone. The SaaS operator discovers the gap three months later during a revenue review, staring at a churn number that did not need to be that high.

This is involuntary churn: subscribers who leave not because they wanted to, but because a payment failed and nobody followed up. For operators running on Stripe, the payment processing is handled. What is not handled is the lifecycle around it: tracking which subscriptions are in trial, which just went past due, how many retries have been attempted, whether the customer was notified, and what the revenue impact looks like across the entire subscriber base.

The Subscription Lifecycle Engine sits between Stripe and the rest of the business. It catches every payment event, runs it through an 8-state subscription state machine, triggers automated dunning with configurable retry escalation, computes daily revenue metrics, and pushes state changes to four downstream services. One event stream in. Orchestrated business logic out.

The Cost of Doing Nothing

For a SaaS operator with 500 active subscribers averaging $50 per month, a typical 5% card failure rate means 25 failed charges every billing cycle. Without automated retry and notification, industry benchmarks suggest roughly 40-60% of those customers churn involuntarily. At the midpoint, that is 12 lost subscribers per month, or $7,200 in annual revenue that was recoverable with a timely retry and a well-timed email.

The labor cost compounds on top. Manually monitoring the Stripe dashboard for failed payments, emailing affected customers, tracking retry status in a spreadsheet, and deciding when to give up on a delinquent subscription takes 15 to 30 minutes per case. At 25 cases per month, that is 6 to 12 hours of operations work that scales linearly with subscriber count. For a team without a dedicated billing operations person, this work either gets done inconsistently (some customers get an email, some don't) or it doesn't get done at all.

The invisible cost: no centralized view of MRR, churn rate, or dunning recovery across tenants. Revenue metrics live in Stripe's dashboard, siloed from the rest of the business intelligence stack.

What I Built

An event-driven service in Elixir that processes Stripe webhook events through a structured pipeline: receive, deduplicate, route, process, and emit. The engine is multi-tenant (API key authentication, tenant-scoped queries), and every ecosystem integration is feature-flagged so the service runs fully standalone.

The dunning engine is the revenue recovery layer. When a subscription enters past_due, the engine creates a retry sequence: attempt payment at day 1, day 3, day 5, and day 7. Each attempt checks Stripe first (did the customer already pay?), retries the charge, and escalates the notification channel if the payment fails again. Attempts 1 and 2 send email. Attempts 3 and 4 send Telegram. The final attempt sends both. If all four retries fail, the engine cancels the Stripe subscription immediately and marks the subscriber as churned.

The metrics layer computes daily MRR, churn rate, and average revenue per user across all tenants. These snapshots feed downstream dashboards and keep the business picture current without anyone logging into Stripe.

Building the state machine was the hardest part. Stripe's lifecycle model and the engine's internal model overlap but don't align perfectly. I had to design a system that accepted Stripe's data without blindly accepting every state change. The result was a transition validator that strips invalid status changes from the webhook payload while keeping everything else (period dates, plan changes, metadata) intact. Data completeness without state corruption.

System Flow

Architecture diagramScroll on small screens

Data Model

Architecture diagramScroll on small screens

Architecture Layers

Architecture diagramScroll on small screens

The Decision Log

Decision	Alternative Rejected	Why
Elixir on OTP over TypeScript/Fastify	TypeScript with BullMQ (used in other portfolio services)	Webhook processing is a concurrency problem. OTP supervision trees restart crashed event processors automatically. The BEAM scheduler handles 48 concurrent workers without thread pool configuration. A Node event loop crash is fatal. An OTP process crash is a restart.
PostgreSQL-backed Oban over Redis job queue	Exq, Sidekiq-style Redis queues	One database handles application state, job scheduling, cron triggers, and retry tracking. Removing Redis from the stack means one fewer service to monitor, back up, and keep alive on a single VPS.
Configurable dunning intervals over fixed retry schedule	Hardcoded 24/48/72-hour retries	Different businesses have different tolerance for payment follow-up. The retry intervals (default: 1, 3, 5, 7 days) and max attempts (default: 4) are configurable via environment variables. A B2C operator might retry aggressively at 12-hour intervals. An enterprise B2B operator might wait a week between attempts.
Feature-flagged ecosystem integrations over mandatory connections	Hardwired HTTP calls to Notification Hub, Workflow Engine, Recon Engine, Client Portal	The engine must run standalone on first deployment, before any ecosystem service is onboarded. Each integration defaults to disabled. When disabled, the outbound call is replaced with a log message. No HTTP timeout. No connection error. No dependency on external service availability for core functionality.
Fire-and-forget notification dispatch over synchronous delivery	Blocking notification calls with inline retry	A notification failure must never prevent a subscription state transition from completing. If the Notification Hub is unreachable, the state change persists to the database, the dunning sequence continues, and the notification is silently skipped. The Hub has its own delivery retry infrastructure.

Ecosystem Integration

Stripe events don't arrive directly. They flow through the Webhook Ingestion Engine I built for exactly this pattern: receive webhooks from external providers, verify signatures, persist the raw payload, and fan out to registered destinations. The Subscription Lifecycle Engine is a destination. It receives pre-validated events via authenticated HTTP POST and never needs to handle Stripe signature verification itself.

When the dunning engine detects a failed payment, it pushes escalating alerts through a notification hub that handles channel routing (email via Resend, Telegram via Bot API) and delivery tracking. Payment routing decisions pass through a workflow engine that executes DAG-based logic. Paid invoice batches sync to a reconciliation engine for settlement matching every six hours. Daily MRR and churn snapshots push to a client portal for business intelligence.

Full breakdowns of each connected system: the Webhook Ingestion Engine that delivers Stripe events, the Notification Hub that routes dunning alerts, the Workflow Automation Engine that handles payment routing DAGs, and the Transaction Reconciliation Engine that matches invoice settlements.

Results

Before the engine: failed payments require manual monitoring of Stripe's dashboard, ad-hoc customer outreach via email, no centralized retry tracking, and revenue metrics computed by exporting CSVs. Dunning is inconsistent or absent. Churn is discovered after the fact.

After: a 7-day automated recovery window with four escalating retries across two notification channels. Daily MRR, churn rate, and ARPU snapshots computed per tenant and pushed to downstream dashboards. 22 API endpoints covering subscriptions, customers, invoices, dunning attempts, plans, metrics, health, and tenant management. 655 passing tests covering the full lifecycle from trial creation through involuntary churn, including Stripe edge cases like paused subscriptions receiving past-due events.

The system handles the full subscription lifecycle (trial, active, past due, paused, unpaid, canceled) through 15 validated state transitions. Eight background jobs run on configurable cron schedules: hourly dunning escalation, daily metrics computation, daily metrics push, daily stale event cleanup, 6-hourly invoice reconciliation sync, and daily trial expiration checks.

At 10x subscriber volume, the first constraint is the PostgreSQL connection pool (currently 10 connections shared across 48 concurrent workers). Increasing the pool size and adding a read replica for metrics queries would handle the next order of magnitude. The core event processing pipeline, the state machine, and the dunning retry logic would remain unchanged.

Subscription Lifecycle Engine

Turning Stripe Payment Failures Into Recovered Revenue

The Situation

The Cost of Doing Nothing

What I Built

System Flow

Data Model

Architecture Layers

The Decision Log

Ecosystem Integration

Results

Put this system in context.

Contents