Why I Designed the Whole System Around Kafka and Then Deployed Without It

Memory, not event volume, set the boundary. Redpanda wanted 256MB of RAM before the notification service had processed a single event, and the production VPS only had 1GB total.

I had designed the system from day one around Kafka event consumption. The PRD specified it. The consumer module was built and tested. KafkaJS handled message parsing, schema validation, and automatic reconnection. The architecture was clean: events arrive on a Kafka topic, the consumer matches rules, the pipeline processes notifications. Decoupled. Async. Textbook.

Then I tried to deploy it.

The Memory Problem

The VPS runs multiple services behind Traefik: PostgreSQL, the notification hub, and other portfolio projects sharing the same machine. Traefik takes memory. PostgreSQL takes memory. The notification hub's container was limited to 128MB in the production docker-compose file, which was comfortable for the Fastify process alone.

Redpanda, even in its minimal single-core configuration (--smp 1 --memory 256M --overprovisioned), needs 150-200MB resident just to maintain the broker state. That's not under load. That's idle.

The arithmetic didn't work. Adding Redpanda would push total memory past the VPS limit. I could have upgraded the VPS, but spending more on infrastructure to run a message broker that processes events arriving via HTTP felt like the wrong tradeoff. The events come from other services hitting POST /api/events. They're already HTTP. Publishing them to Kafka just to consume them from Kafka on the same machine adds latency, memory overhead, and a failure point, all for a round-trip that happens within a single process.

The Honest Question

Was Kafka the right tool for this deployment, or had I designed for an architecture I couldn't afford to run?

The Kafka consumer provides real value in two scenarios: when events arrive from external systems that produce to Kafka natively, and when the notification hub scales horizontally across multiple instances that need a shared event stream. Neither scenario existed. Events arrived via HTTP. The hub ran on one container. Kafka was solving a problem I didn't have yet.

But I didn't want to rip Kafka out entirely. The consumer was tested. The topic subscription pattern worked. If the deployment ever moved to a larger machine, or if an external system needed to produce events directly to a Kafka topic, the path was already built. Deleting the consumer code would mean rebuilding it later.

The Flag

The solution was a single environment variable: USE_KAFKA.

In src/config.ts, the flag is a string enum that transforms to a boolean:

USE_KAFKA: z.enum(['true', 'false']).default('true')
  .transform((val) => val === 'true'),

When USE_KAFKA is true, the server starts a KafkaJS consumer that subscribes to the configured topic pattern, validates incoming messages against a Zod schema, looks up the tenant, matches rules, and feeds events into the notification pipeline. This is the path the system was designed for.

When USE_KAFKA is false, the POST /api/events route handler processes events inline. Instead of publishing to Kafka and waiting for the consumer to pick them up, it calls matchRules() and processNotification() directly within the same HTTP request. The event goes in, the pipeline runs, and the response comes back with a count of rules matched.

The critical design choice: both paths converge on the same pipeline function. processNotification() in src/processor/pipeline.ts doesn't know or care whether the event arrived via Kafka or via direct HTTP. It receives a validated event, a matched rule, a resolved recipient, and a config object. It runs the same eight-step cascade either way: resolve delivery address, check opt-out, check deduplication, check quiet hours, check digest mode, render template, insert notification record, dispatch to channel.

Two Codepaths in One Route

The dual-mode logic lives in src/api/events.routes.ts. The route handler checks the flag:

if (useKafka && kafkaBrokers && kafkaTopics) {
  await publishEvent(kafkaBrokers, kafkaTopics, event_id, {
    tenant_id: tenantId, event_type, event_id, payload,
    timestamp: new Date().toISOString(),
  });
} else {
  const rules = await matchRules(db, tenantId, event_type);
  for (const rule of rules) {
    const recipient = resolveRecipient(
      rule.recipientType, rule.recipientValue, payload
    );
    if (recipient) {
      await processNotification(db, event, rule, recipient, config);
    }
  }
}

In Kafka mode, the route publishes and returns immediately. The consumer handles processing asynchronously. In direct mode, the route does everything synchronously. The HTTP response waits for the full pipeline, including template rendering and channel dispatch, to complete before returning.

This tradeoff matters. Direct mode is simpler to debug (the response tells you exactly how many rules fired and whether processing succeeded) but blocks the HTTP response until all notifications are dispatched. If Resend takes 500ms per email and an event triggers 5 rules, the caller waits 2.5 seconds. At higher volume, that latency would stack up.

For now, event volume is low enough that direct processing completes in milliseconds. The day it doesn't, the fix is one environment variable change and a Redpanda container added to the compose file.

What I Got Wrong

I designed the Kafka consumer first and the direct processing path second. If I were building this again, I'd invert that order.

The direct path is simpler, easier to test, and sufficient for the scale the system actually operates at. Building the Kafka consumer first meant I spent time on KafkaJS connection management, topic subscription patterns, schema validation for messages, and reconnection logic before the system had processed a single real notification. All of that code works and is tested (6 tests on the Kafka event schema alone), but none of it runs in production.

Building the simple path first would have been smarter: deploy the direct HTTP pipeline, prove the notification routing works, and then add the Kafka consumer when the system needs async processing. Instead, I built the complex path first and the simple path as a fallback.

The code quality didn't suffer. Both paths share the same pipeline, and the pipeline is where the real complexity lives (eight processing steps, five possible skip/hold states, two digest routing paths). But I spent roughly a day on Kafka infrastructure that currently sits behind a false flag.

What Survives the Flag

The pipeline itself is flag-agnostic. The eight-step cascade in processNotification() runs identically regardless of how the event arrived:

Resolve delivery address. If the recipient is a raw email (contains @) or phone number (starts with +), use it directly. Otherwise, look up the user's preferences in user_preferences.
Check opt-out. Users can opt out per channel, per event type, or from everything via optOut JSONB.
Check deduplication. Same event_id + recipient + channel within the configured window (60 minutes default) gets skipped.
Check quiet hours. If the user has quiet hours set and the current time falls within their window (timezone-aware), hold the notification or queue it for digest.
Check digest mode. If the user has opted into digest batching, queue the notification with a scheduledFor timestamp computed from their schedule preference.
Render the Handlebars template using the event payload as context.
Insert the notification record into PostgreSQL with status pending.
Dispatch to the appropriate channel and update status to sent or failed.

Each step can bail out with a specific skip reason: no_delivery_address, opt_out, deduplicated, held (quiet hours), or queued_digest. Every bail-out is logged and recorded in the notifications table. No event disappears silently.

Kafka is invisible to the pipeline. The Kafka consumer doesn't know about quiet hours or digest batching. The channel dispatcher doesn't know about deduplication. Each layer has one job and knows nothing about the layers around it. That separation is what makes the USE_KAFKA flag possible. The ingestion layer can be swapped without touching anything downstream.

The Principle

Design for the architecture you want. Deploy for the constraints you have.

The Kafka consumer is not dead code. It's a capability that's ready when the deployment supports it. The direct processing path is not a hack. It's the correct choice for a system running on 128MB. Both codepaths exist because the constraint was known before deployment, not discovered after a production incident.

The system runs on a 1GB VPS, serves real tenants, and processes events in single-digit milliseconds. Redpanda is one environment variable and one container away. Until the event volume justifies the memory cost, it stays off.