When Your Webhook Handler Goes Down at 2 AM
The Situation
Every engineering team that integrates with Stripe, GitHub, Shopify, or any service that emits webhooks ends up building the same thing: a handler that receives the POST, validates the signature, and processes the payload. Each team builds it inside the consuming service. Each team writes its own retry logic. Each team discovers, eventually, that their handler was silently dropping events during the outage nobody noticed until a customer called.
For a mid-size SaaS running five to eight webhook integrations, this means five to eight separate receivers. Each one has slightly different error handling. None of them share an audit trail. And there is no way to answer the question that matters most during an incident: "Did we receive that payment confirmation from Stripe yesterday, or did it disappear?"
Operations gets no visibility. When a delivery fails at 2 AM, the failure is silent. The first signal is a customer calling about a missing order, a sync that didn't run, or a notification that never arrived.
The Cost of Doing Nothing
Each custom webhook receiver takes 30 to 50 developer hours to build properly: signature verification, retry logic, error logging, idempotency handling. Multiply that across five integrations and the initial build cost alone is 150 to 250 hours of engineering time.
Maintenance compounds it. Format changes from upstream providers break handlers silently. Failed deliveries require manual investigation. Missed events need reconciliation with the source. For a team paying European market rates for backend developers, the annual cost of maintaining these receivers and debugging their failures runs roughly €10,000 to €15,000 in developer time, before counting the revenue risk of a missed payment webhook or a dropped inventory update.
The failure mode is always the same: the handler goes down, events accumulate on the sending side, the sender retries a few times, then gives up. By the time anyone notices, the events are gone. No record of what was received. No way to replay it.
What I Built
A centralized webhook gateway that sits in front of all inbound webhooks. External services POST their events to a single endpoint. The engine validates signatures, persists every payload to PostgreSQL before processing, and fans out delivery to configured downstream destinations using a Redis-backed job queue with 10 concurrent workers.
The core design decision is persist-before-process. Every webhook payload hits the database before a single delivery job is enqueued. If Redis goes down, delivery slows, but no data is lost. The database is the source of truth. Delivery is a side effect.
Failed deliveries don't disappear. After five retry attempts with exponential backoff (1 second, 2 seconds, 4, 8, 16), the delivery enters a dead-letter queue. Operations can inspect what failed, see the HTTP status code and the destination's response body, and retry with a single API call. For bulk recovery after an outage, the replay engine re-processes any range of historical events through the full delivery pipeline in batches of 100.
The design is source-agnostic. The engine doesn't know what Stripe is. Signature algorithm, header name, and signing secret are stored per-source in the database. Adding Stripe, GitHub, or a custom internal service is one API call to create a source row. No code changes, no deployments, no new adapter classes.
The part that took the most iteration was delivery security. The engine forwards payloads to URLs that tenants register. A destination URL that passed validation at creation time can resolve to a private IP address weeks later if the tenant's DNS records change. The fix required two layers of SSRF protection: static URL validation at registration plus DNS resolution and IP range checking on every delivery.
System Flow
Data Model
Architecture Layers
The Decision Log
| Decision | Alternative Rejected | Why |
|---|---|---|
| Persist every webhook to PostgreSQL before enqueuing delivery | Fire-and-forget through Redis | Redis failure between enqueue and delivery means permanent data loss. The database write adds roughly 10ms but guarantees nothing is silently dropped. |
| Source-agnostic configuration (signature algo, header, secret per source row) | Provider-specific handler classes for Stripe, GitHub, Shopify | Adding a new provider should be a database row, not a code deployment. Provider-specific classes accumulate format-specific logic that breaks when upstream APIs change. |
| BullMQ over Kafka for the delivery queue | Apache Kafka, NATS | The throughput target is 500 events per minute. Kafka's cluster management and partition rebalancing add operational cost that only pays off above 10,000 events/min. |
| Per-destination retry configuration | Global retry settings across all destinations | A fast internal microservice needs 3-second timeouts with 2 retries. A slow third-party API needs 10-second timeouts with 5 retries. One global setting punishes one or the other. |
| AES-256-GCM encryption for signing secrets at rest | Plaintext storage in the database | A database breach shouldn't hand the attacker the ability to forge webhook signatures for every configured source. |
Redis noeviction policy with 256MB cap |
LRU eviction | Silently evicting queued delivery jobs defeats the at-least-once guarantee. Loud failure (429 backpressure) is the correct response to memory pressure. |
Ecosystem Integration
The webhook engine doesn't just deliver to arbitrary URLs. It feeds two other systems in the portfolio. Event notifications, from payment confirmations to status changes, flow through the delivery pipeline into a notification hub that handles channel routing (email, SMS, push) without the webhook engine knowing or caring which channel a user prefers. Workflow triggers flow into an automation engine that branches, transforms, and chains actions based on the event payload. The webhook engine's only job is reliable delivery. What happens downstream is the downstream system's responsibility.
Full breakdown of the notification architecture: www.kingsleyonoh.com/projects/event-driven-notification-hub
Full breakdown of the workflow engine: www.kingsleyonoh.com/projects/workflow-automation-engine
Results
Before the gateway: each new webhook integration cost 30 to 50 developer hours to build and carried its own retry logic, error handling, and silent failure modes. Five integrations meant five audit trails that didn't talk to each other and roughly €10,000 to €15,000 a year in maintenance.
After: a single ingestion endpoint that persists every payload, verifies signatures against per-source configuration, and delivers to multiple destinations with exponential backoff and dead-lettering. Every event is queryable and replayable for 90 days. The management API lets operations add sources, destinations, and inspect delivery history without touching code.
The system handles 500 events per minute sustained, with burst capacity to 1,000 per minute at the rate limit. What took 30 to 50 hours of custom handler code per integration is now a single POST to the management API. A second POST adds a delivery destination.
509 tests validate the full pipeline, including signature verification for Stripe, GitHub, and custom HMAC formats, against real PostgreSQL and real Redis. The test suite took longer to build than the delivery worker. That's usually a sign the delivery worker is doing the right amount of work.