~/About~/Foundry~/Blueprint~/Journal~/Projects
Book a Call
Foundry

Workflow Automation Engine

·7 min read·Kingsley Onoh·View on GitHub

The Backbone Nobody Sees: A Self-Hosted Workflow Orchestrator

The Situation

Every backend project eventually arrives at the same requirement: take data from one system, process it, and push it to another. A payment processor fires a webhook. The backend validates the charge, updates the order database, calls the shipping API, and sends a confirmation email. Four steps, five failure points, and typically zero visibility into which step failed or why.

Most teams build this logic inline. The webhook handler calls the database, then calls the shipping API, then calls the email service, all in a single request cycle. When the shipping API times out on step three, the payment is recorded but the customer never gets notified. The developer debugs by reading server logs, reconstructing the execution sequence manually, and hoping the timestamps line up.

This pattern repeats across every project that integrates external services. Each one reinvents retry logic, error handling, state tracking, and audit logging from scratch.

The Cost of Doing Nothing

Integration plumbing consumes a disproportionate share of backend development time. For a team building service integrations, the business logic (validate payment, calculate shipping, format notification) is typically a fraction of the code. The rest is infrastructure: connection handling, retry with backoff, timeout management, error classification, state persistence, and logging.

A backend developer spending three of every ten working days on infrastructure code that could be standardized is capacity lost to repetition. At European developer rates, a three-person backend team burns roughly €55-75K/year maintaining integration plumbing that works the same way in every project. None of it is a competitive advantage. All of it is table stakes that every team rebuilds independently.

The second cost is invisible: when an integration fails silently, the first person to notice is usually the customer. Without execution tracking, the team discovers failures reactively, sometimes days after the data stopped flowing.

What I Built

A self-hosted workflow engine that executes multi-step integrations defined as directed acyclic graphs. Users define workflows via a REST API: each workflow is a set of steps with declared dependencies, trigger configuration, and per-step retry policies. The engine parses the DAG, validates it for cycles and depth violations, and executes steps in dependency order with full state persistence.

Five step types cover the majority of integration patterns. HTTP steps call external APIs with templated headers and bodies. Transform steps reshape data between steps using sandboxed expressions. Condition steps evaluate an expression and skip the non-taken branch at runtime. Delay steps pause execution for a configurable interval. Sub-workflow steps nest one workflow inside another, up to three levels deep.

Every state transition writes to PostgreSQL before the next step begins. If the worker process crashes between steps, every execution in flight is recoverable from the last persisted state. The first version crashed during a 12-step workflow with nested conditions, and rebuilding the execution state from scratch made the persist-before-execute design non-negotiable.

System Flow

Data Model

Architecture Layers

The Decision Log

Decision Alternative Rejected Why
PostgreSQL with JSONB over MongoDB MongoDB for flexible workflow schemas Workflow definitions are flexible, but the relationships between tenants, workflows, and executions are rigid. PostgreSQL gives both: JSONB for step configs and trigger payloads, relational integrity for everything else.
arq (Redis) over Celery Celery with RabbitMQ as message broker Redis was already required for rate limiting and caching. arq uses the same instance as its broker. One fewer service to deploy, monitor, and maintain.
Row-Level Security over application filtering WHERE tenant_id = X in every query Application-layer filtering works until someone forgets one query. RLS enforces tenant isolation at the database level. A missing WHERE clause still returns zero rows for the wrong tenant.
Persist-before-execute over fire-and-forget Queue-only state with eventual DB sync Worker crashes lose everything in a fire-and-forget model. Persisting to PostgreSQL before each transition means every execution is recoverable from a known state.
HMAC-SHA256 webhook validation over static tokens Bearer token per webhook endpoint HMAC signs the payload body. A captured signature is only valid for that specific request. Static tokens work for any request once leaked.

Ecosystem Integration

Webhook payloads and scheduled events that arrive at this engine often need to trigger notifications downstream. The HTTP step type can call any API, which means it can POST events to the notification hub I built for exactly this pattern. The hub handles channel selection (email, Telegram, in-app, SMS), template rendering, and delivery tracking. The workflow engine fires the event and moves on. It doesn't know or care how the recipient gets notified.

Full breakdown of the notification architecture: www.kingsleyonoh.com/projects/event-driven-notification-hub

Results

Before this engine existed, each integration project built its own execution pipeline. Retry logic, state tracking, failure logging, tenant isolation: all reimplemented from scratch. A typical integration endpoint needed 300-500 lines of plumbing code around 50-100 lines of business logic.

With the engine, a new integration is a workflow definition: a JSON document describing steps, dependencies, and conditions. Five step types (HTTP, transform, condition, delay, sub-workflow) cover the patterns that appeared in every previous project. Webhook triggers respond in under 100ms (202 Accepted, execution runs asynchronously). HMAC validation blocks unsigned requests. Cron scheduling fires workflows on a 5-field expression with sub-minute accuracy.

The engine runs on a single Hetzner VPS behind Traefik, inside four Docker containers capped at 512MB memory each. 544 tests validate the execution model, including tenant isolation, DAG cycle detection, conditional branching, and sub-workflow depth limits. Execution logs retain for 30 days, webhook deliveries for 7.

The cache invalidation gap is the known debt. The in-memory LRU cache (256 entries, 300-second TTL) works for a single app instance. At two or more instances, one can update a workflow while the other serves stale definitions for up to five minutes. Redis pub/sub invalidation is the fix, and it's not built yet. The engine scales vertically today. Horizontal scaling requires solving that cache coherence problem first.

#python#fastapi#postgresql#redis#dag-orchestration

The full system record for Workflow Automation Engine

Get Notified

New system breakdown? You'll know first.