Architectural Brief: Webhook Ingestion Engine

Every service that emits webhooks has its own signature format, its own retry expectations, and its own payload structure. The gateway that sits between them and your internal services can't know any of this in advance. It needs to accept anything, validate it against per-source configuration, persist it before processing, and deliver to multiple downstream endpoints with guaranteed at-least-once semantics. The binding constraint: adding a new webhook source must be a database insert, not a code change.

System Topology

Architecture diagramScroll on small screens

Infrastructure Decisions

Runtime: Node.js 22 LTS with TypeScript 5.x in strict mode. Chose over Go because the delivery pipeline is I/O-bound (HTTP calls, database writes, Redis operations), not CPU-bound. TypeScript's async/await model handles thousands of concurrent deliveries without the goroutine scheduling overhead that Go brings for zero benefit at this throughput target.
Framework: Fastify 5.x. Chose over Express because Fastify's plugin system enforces route encapsulation (each module registers in its own scope), and its content type parser buffers raw bytes for HMAC verification while still parsing JSON for handlers in a single pass. Express would have required separate body-parser configuration to preserve original bytes alongside parsed JSON, and its middleware model doesn't scope state to plugin boundaries.
Data Layer: PostgreSQL 16 with Drizzle ORM (0.45.x). Chose Drizzle over Prisma because Prisma requires a separate query engine binary and generates heavier migration scaffolding than this project needs. Drizzle compiles to SQL strings at build time with zero runtime overhead. Chose PostgreSQL over SQLite because the delivery pipeline writes from 10 concurrent worker threads, and SQLite's single-writer lock would serialize all delivery recording.
Job Queue: BullMQ 5.x on Redis 7 (256MB, noeviction policy). Chose over Kafka and NATS because the throughput target is 500 events/min. A dedicated message broker adds operational complexity (cluster management, topic partitioning, consumer group rebalancing) that only pays off above 10,000 events/min. Chose BullMQ over a custom retry loop because in-memory timers don't survive process restarts, and BullMQ gives delayed jobs for exponential backoff with built-in concurrency control.
Cache: Redis 7, noeviction policy, 256MB ceiling. Bounded memory with backpressure at the cap rather than silent eviction. Source configuration is cached with a 60-second TTL to avoid hitting PostgreSQL on every webhook received.
Hosting: Docker container on Hetzner VPS behind Traefik with auto-TLS via Let's Encrypt. Chose Hetzner over AWS because a single VPS at this scale costs roughly 80% less than equivalent ECS Fargate, and the operational model is one docker compose command instead of a CloudFormation stack.

Constraints That Shaped the Design

Input: HTTP POST from any external service. Payload format unknown until runtime. Signature algorithm (hmac-sha256, hmac-sha1, or none), header name, and signing secret are stored per-source in PostgreSQL. Supports Stripe's t=<unix>,v1=<sig> timestamp format and GitHub's sha256= prefix convention with no source-specific code.
Output: HTTP POST to one or more downstream destinations per source. Each destination row carries its own timeout_ms (default 10,000), max_retries (default 5), and backoff_base_ms (default 1,000). Response bodies are truncated to 4,096 bytes and stored with each delivery record for debugging.
Scale Handled: ~500 events/min sustained, 1,000/min burst at the ingestion rate limit. 10 concurrent delivery workers process the queue. At ~2,500 events/min the single-node worker pool would saturate. Scaling past that requires deploying additional worker instances reading from the same Redis queue, which BullMQ supports without code changes.
Hard Constraints: Backpressure activates at 10,000 queued jobs (waiting + active), rejecting ingestion with HTTP 429 to protect Redis memory. Webhook signatures older than 5 minutes are rejected to prevent replay attacks. Maximum payload is 1MB. JSON nesting depth is capped at 20 levels to prevent stack overflow from recursive parsing.
Data Retention: Events are archived after 90 days to a separate events_archive table with a leaner index set. The primary events table stays fast for active delivery and replay. Completed BullMQ jobs are pruned at 1,000 and failed jobs at 5,000 to bound Redis memory growth.

Decision Log

Decision	Alternative Rejected	Why
Persist-before-process (PostgreSQL INSERT before enqueuing BullMQ job)	Fire-and-forget (enqueue first, persist async)	If Redis fails between enqueue and delivery, the event is lost forever. With persist-first, Redis failure delays delivery but never loses data. The database is the source of truth. Ingestion still returns in under 50ms because the write is a single INSERT.
Source-agnostic DB configuration (algo, header, secret per row)	Provider-specific adapter classes (StripeAdapter, GitHubAdapter)	Adapters accumulate provider-specific branches that rot when APIs change formats. The DB-config approach means adding any new webhook source is an INSERT, not a pull request. The engine has zero knowledge of what "Stripe" or "GitHub" means.
Per-destination retry configuration (timeout, retries, backoff per destination row)	Global retry settings for all destinations	A fast internal microservice and a slow third-party API have different failure profiles. Global settings force a lowest-common-denominator that either retries too aggressively for fast targets or gives up too early on slow ones.
Two-layer SSRF protection (URL validation at creation + DNS resolution at delivery)	Creation-time URL validation only	DNS records change after destination creation. A domain resolving to a public IP today can point to 169.254.169.254 (cloud metadata) next week. Delivery-time DNS resolution checks all A/AAAA records against private IP ranges before every HTTP call.
AES-256-GCM encryption for signing secrets at rest	Plaintext storage in PostgreSQL	A database breach that leaks signing secrets allows forging webhook signatures for any source. GCM encryption with a separate key (from environment) means the secrets column is useless without the key. Random 12-byte IV per encryption ensures identical secrets produce different ciphertext.
BullMQ `removeOnComplete: 1000` / `removeOnFail: 5000`	Retain all job history in Redis	At 500 events/min with multiple destinations, completed jobs accumulate at over 1,000 per minute. Without pruning, Redis memory grows roughly 2GB per day. The 1,000/5,000 retention gives enough debugging history without unbounded growth.
Redis `noeviction` memory policy (256MB cap)	`allkeys-lru` eviction	LRU eviction would silently discard queued delivery jobs when Redis fills up. `noeviction` causes write failures, which trigger 429 responses to webhook senders. Loud failure is better than silent data loss for a system guaranteeing at-least-once delivery.

Webhook Ingestion Engine

Architectural Brief: Webhook Ingestion Engine

System Topology

Infrastructure Decisions

Constraints That Shaped the Design

Decision Log

Put this system in context.

Contents