~/About~/Foundry~/Blueprint~/Journal~/Projects
Book a Call
Blueprint

Subscription Lifecycle Engine

·6 min read·Kingsley Onoh·View on GitHub

Architectural Brief: Subscription Lifecycle Engine

Stripe handles payments. It doesn't track where each subscriber sits in their lifecycle, chase failed charges with escalating urgency, compute monthly recurring revenue, or coordinate state changes across downstream services. The binding constraint for this system: every piece of subscription state originates from a Stripe webhook event, and losing or misprocessing a single event means silent data drift between Stripe and the business. Built on Elixir and OTP because the BEAM runtime treats concurrent event processing and fault recovery as language-level primitives, not libraries bolted on after the fact.

System Topology

Infrastructure Decisions

  • Language and Runtime: Elixir 1.17 on OTP 27. Chose over Node.js/TypeScript (used elsewhere in the portfolio) because OTP supervision trees restart crashed processes automatically, and the BEAM scheduler handles 48 concurrent background workers without thread pool tuning. A webhook processor that crashes mid-event gets restarted by its supervisor. In Node, that crash takes down the event loop.
  • Framework: Phoenix 1.8 in API-only mode. No LiveView, no assets pipeline, no channel layer. Phoenix provides the request pipeline, JSON encoding, and release tooling without the weight of a full-stack framework. Bandit replaces Cowboy as the HTTP server: pure Elixir, no NIF dependencies.
  • Database: PostgreSQL 16 with 8 application tables. Chose over SQLite because the service needs JSONB columns for Stripe event payloads, CHECK constraints for state machine enforcement, partial indexes for hot-path queries (unprocessed events, unsynced invoices), and compound unique constraints scoped per tenant. UUIDv4 primary keys across all tables.
  • Job Queue: Oban 2.18, backed by PostgreSQL. Chose over Redis-backed alternatives (Exq, Sidekiq-style queues) because Oban eliminates an entire infrastructure dependency. The job queue, cron scheduler, and retry mechanism all live in the same PostgreSQL instance as the application data. Five named queues with separate concurrency limits: webhooks (20), default (10), ecosystem (10), dunning (5), metrics (3).
  • HTTP Client: Req 0.5 for all outbound ecosystem calls. Chose over HTTPoison and Tesla because Req ships with connection pooling, retry, and JSON encoding as defaults. Hackney remains as a transitive dependency required by stripity_stripe for Stripe API communication.
  • Stripe Integration: stripity_stripe 3.2 wrapping the Stripe REST API. Covers invoice retry, subscription cancellation, and entity fetch. The alternative (raw HTTP calls via Req) would mean reimplementing Stripe's pagination, error normalization, and rate limit handling from scratch.

Constraints That Shaped the Design

  • Input: Stripe webhook events delivered by the portfolio's Webhook Ingestion Engine via authenticated POST to the webhook handler endpoint. The engine doesn't poll Stripe or initiate event discovery. Events arrive when Stripe decides to send them.
  • Output: Subscription state changes persisted to PostgreSQL. Revenue metrics (MRR, churn, ARPU) computed daily at 02:00 UTC. State change notifications dispatched to four ecosystem services, all feature-flagged.
  • Scale Handled: 48 concurrent Oban workers sharing a 10-connection PostgreSQL pool. The webhook queue alone processes 20 events concurrently. At sustained throughput beyond approximately 100 concurrent webhook events with simultaneous API reads, the connection pool becomes the bottleneck. Horizontal scaling requires enabling BEAM distribution (currently set to RELEASE_DISTRIBUTION=none) and adding Oban's SmartEngine plugin for multi-node job coordination.
  • Hard Constraints: Webhook rate limit at 500 events per minute per tenant, enforced via ETS-backed sliding window counters. Read endpoints capped at 100 per minute, writes at 20 per minute, registration at 5 per minute per IP. Stripe's own rate limit is respected with a Retry-After header handler and a 2-second backoff on timeout.
  • Idempotency: Every webhook event is stored with a composite idempotency key formatted as tenant_id:stripe_event_id. A database-level unique constraint prevents duplicate processing. A three-state check (new, processing, duplicate) handles the race condition where two Oban workers pick up the same event simultaneously.
  • Tenant Isolation: API key authentication via SHA-256 hash lookup with a 5-minute ETS cache. Every database query includes WHERE tenant_id = ^tenant.id. All Stripe identifier uniqueness constraints (subscription ID, invoice ID, customer ID) are compound with tenant_id.

Decision Log

Decision Alternative Rejected Why
ETS-backed sliding window rate limiter over Redis or database counters ExRated, Hammer, or plug-based token bucket with Redis ETS counters reset per window with zero network round-trips. Redis-backed limiters add a dependency the rest of the stack deliberately avoids. At 500 events per minute per tenant, the rate limiter sits on the hottest code path in the system. Microsecond reads from ETS vs. millisecond reads from Redis is a meaningful difference at that frequency.
Fire-and-forget notification dispatch (always returns :ok) Synchronous notification with inline retry A failed notification must never block a subscription state transition. If the Notification Hub is unreachable, the state change still completes and persists. Notification failures are logged but not propagated upstream. The Hub has its own retry infrastructure.
Status stripping on invalid state machine transitions Rejecting the entire webhook event Stripe bundles status changes with period date updates, metadata, and plan modifications in a single event payload. Rejecting the event on an invalid transition loses all non-status data. Stripping only the invalid status and processing the rest preserves data completeness while enforcing state machine invariants.
Behaviours + Application.get_env for dependency injection Compile-time module attributes or protocol dispatch Runtime module resolution lets test configuration swap real HTTP clients for Mox mocks without conditional compilation. Combined with feature flags, each ecosystem integration has four testable states: disabled+real, disabled+mock, enabled+real, enabled+mock.
Integer division for MRR normalization on yearly plans Decimal arithmetic with sub-cent precision A $299.99/year plan computes to 2,499 cents/month via div(29999, 12). Maximum precision loss is 11 cents per yearly subscription. Integer-only arithmetic keeps MRR calculations simple for a metrics dashboard. This is not a billing system.
90-day event payload retention with metadata preservation Indefinite JSONB storage or full row deletion Stripe webhook payloads are large JSONB blobs. After 90 days, the StaleCleanupJob replaces the payload with an empty map, but the event metadata (type, timestamps, idempotency key, processing status) persists indefinitely. Storage growth is bounded. The audit trail survives.
#elixir#phoenix#oban#postgresql#stripe#event-driven

The complete performance for Subscription Lifecycle Engine

Get Notified

New system breakdown? You'll know first.