The Backbone Nobody Sees: A Self-Hosted Workflow Orchestrator

The Situation

Workflow Automation Engine is a self-hosted workflow service for engineering teams that need webhook, cron, API, delay, and notification chains to run as trackable multi-step processes instead of inline code.

A CTO, VP Engineering, or backend lead needs it when business workflows are spread across webhook handlers, scheduled jobs, and one-off scripts, and nobody can see which step failed, retried, or completed. A payment processor fires a webhook. The backend validates the charge, updates the order database, calls the shipping API, and sends a confirmation email. Four steps, five failure points, and typically zero visibility into which step failed or why.

Most teams build this logic inline. The webhook handler calls the database, then calls the shipping API, then calls the email service, all in a single request cycle. When the shipping API times out on step three, the payment is recorded but the customer never gets notified. The developer debugs by reading server logs, reconstructing the execution sequence manually, and hoping the timestamps line up.

The Cost of Doing Nothing

Integration plumbing consumes a disproportionate share of backend development time. For a team building service integrations, the business logic (validate payment, calculate shipping, format notification) is typically a fraction of the code. The rest is infrastructure: connection handling, retry with backoff, timeout management, error classification, state persistence, and logging.

A backend developer spending three of every ten working days on infrastructure code that could be standardized is capacity lost to repetition. At European developer rates, a three-person backend team burns roughly €55-75K/year maintaining integration plumbing that works the same way in every project. None of it is a competitive advantage. All of it is table stakes that every team rebuilds independently.

The second cost is invisible: when an integration fails silently, the first person to notice is usually the customer. Without execution tracking, the team discovers failures reactively, sometimes days after the data stopped flowing.

What I Built

A self-hosted workflow engine that executes multi-step integrations defined as directed acyclic graphs. Users define workflows via a REST API: each workflow is a set of steps with declared dependencies, trigger configuration, and per-step retry policies. The engine parses the DAG, validates it for cycles and depth violations, and executes steps in dependency order with full state persistence.

Five step types cover the majority of integration patterns. HTTP steps call external APIs with templated headers and bodies. Transform steps reshape data between steps using sandboxed expressions. Condition steps evaluate an expression and skip the non-taken branch at runtime. Delay steps pause execution for a configurable interval. Sub-workflow steps nest one workflow inside another, up to three levels deep.

Every state transition writes to PostgreSQL before the next step begins. If the worker process crashes between steps, every execution in flight is recoverable from the last persisted state. The first version crashed during a 12-step workflow with nested conditions, and rebuilding the execution state from scratch made the persist-before-execute design non-negotiable.

System Flow

Architecture diagramScroll on small screens

Data Model

Architecture diagramScroll on small screens

Architecture Layers

Architecture diagramScroll on small screens

The Decision Log

Decision	Alternative Rejected	Why
PostgreSQL with JSONB over MongoDB	MongoDB for flexible workflow schemas	Workflow definitions are flexible, but the relationships between tenants, workflows, and executions are rigid. PostgreSQL gives both: JSONB for step configs and trigger payloads, relational integrity for everything else.
arq (Redis) over Celery	Celery with RabbitMQ as message broker	Redis was already required for rate limiting and caching. arq uses the same instance as its broker. One fewer service to deploy, monitor, and maintain.
Row-Level Security over application filtering	WHERE tenant_id = X in every query	Application-layer filtering works until someone forgets one query. RLS enforces tenant isolation at the database level. A missing WHERE clause still returns zero rows for the wrong tenant.
Persist-before-execute over fire-and-forget	Queue-only state with eventual DB sync	Worker crashes lose everything in a fire-and-forget model. Persisting to PostgreSQL before each transition means every execution is recoverable from a known state.
HMAC-SHA256 webhook validation over static tokens	Bearer token per webhook endpoint	HMAC signs the payload body. A captured signature is only valid for that specific request. Static tokens work for any request once leaked.

Ecosystem Integration

Webhook payloads and scheduled events that arrive at this engine often need to trigger notifications downstream. The HTTP step type can call any API, which means it can POST events to the notification hub I built for exactly this pattern. The hub handles channel selection (email, Telegram, in-app, SMS), template rendering, and delivery tracking. The workflow engine fires the event and moves on. It doesn't know or care how the recipient gets notified.

Full breakdown of the notification architecture: www.kingsleyonoh.com/projects/event-driven-notification-hub

Results

Before this engine existed, each integration project built its own execution pipeline. Retry logic, state tracking, failure logging, tenant isolation: all reimplemented from scratch. A typical integration endpoint needed 300-500 lines of plumbing code around 50-100 lines of business logic.

With the engine, a new integration is a workflow definition: a JSON document describing steps, dependencies, and conditions. Five step types (HTTP, transform, condition, delay, sub-workflow) cover the patterns that appeared in every previous project. Webhook triggers respond in under 100ms (202 Accepted, execution runs asynchronously). HMAC validation blocks unsigned requests. Cron scheduling fires workflows on a 5-field expression with sub-minute accuracy.

The engine runs on a single Hetzner VPS behind Traefik, inside four Docker containers capped at 512MB memory each. 544 tests validate the execution model, including tenant isolation, DAG cycle detection, conditional branching, and sub-workflow depth limits. Execution logs retain for 30 days, webhook deliveries for 7.

The cache invalidation gap is the known debt. The in-memory LRU cache (256 entries, 300-second TTL) works for a single app instance. At two or more instances, one can update a workflow while the other serves stale definitions for up to five minutes. Redis pub/sub invalidation is the fix, and it's not built yet. The engine scales vertically today. Horizontal scaling requires solving that cache coherence problem first.

Workflow Automation Engine

The Backbone Nobody Sees: A Self-Hosted Workflow Orchestrator

The Situation

The Cost of Doing Nothing

What I Built

System Flow

Data Model

Architecture Layers

The Decision Log

Ecosystem Integration

Results

Put this system in context.

Contents