Architectural Brief: Client Management Portal
Client work breaks when status, access, internal notes, client updates, documents, and follow-up all live in different tools. This architecture keeps those facts in one tenant-scoped API, then lets the web app, CLI, jobs, and integrations read from the same operating record.
System Topology
Infrastructure Decisions
- Compute: Fastify API monolith on Node 22. Chose one API process over split services because client status, portal access, tasks, updates, documents, and operator state all share tenant boundaries. Splitting those early would add distributed auth and data consistency work without a matching traffic need.
- Frontend boundary: Next.js 16 frontend deployed separately from the API. Chose this over keeping all screens inside Fastify templates because the portal and admin surface need richer client-side behavior, while the API still owns auth, persistence, and integration rules.
- Data Layer: PostgreSQL with Drizzle. Chose this over a document store because tenants, clients, projects, tasks, updates, report drafts, invoices, capacity entries, and audit events are related records with foreign keys and tenant indexes. The model benefits from constraints, not loose collections.
- Session and rate state: Redis. Chose this over stateless JWT-only auth and in-memory rate counters because admin sessions need revocation and the API may restart or run behind a proxy. Public intake and tenant registration rate limits also need shared counters.
- File storage: Supabase Storage for attachments, with metadata stored in PostgreSQL. Chose this over Postgres blobs or app-disk storage because uploaded client files should not bloat relational rows or depend on one container filesystem.
- Notifications: Feature-flagged Notification Hub events. Chose this over an in-service email sender because the portal should own client state, not delivery templates, retries, and channel routing. If the hub is off, the client workflow still runs.
- Document boundary: Klevar Docs webhook and cache. Chose this over rendering documents in the portal because invoices, signed PDFs, and document events belong to a document service. The portal caches status and links enough to show clients the right evidence.
Constraints That Shaped the Design
- Input: Client intake forms, admin or CLI mutations, portal task submissions, project metric pushes, Docs webhooks, and daily job schedules feed the system.
- Output: Tenant-scoped client records, project timelines, portal updates, task requests, report drafts, capacity summaries, queue items, notification events, and document or invoice status.
- Scale Handled: The PRD targets under 50 clients, 50 concurrent users, 20 peak requests per second, API p95 under 200ms, and server-rendered page loads under 500ms. The design is sized for an internal operating system, not a public CRM market.
- Connection Ceiling: The database client has a max pool of 10 connections. That is acceptable for the current traffic target. Higher concurrency would need query consolidation, worker separation, or pool changes before adding more API replicas.
- Upload Boundaries: Multipart uploads cap at 10 MB per file, 5 files per parent, and 20 fields. The API keeps access control close to the request and avoids turning uploads into unbounded memory pressure.
- Operational Jobs: Stale leads run daily at 03:00 UTC, stale attachments at 04:00 UTC, and recurring checkpoints at 04:15 UTC. These jobs run inside the API process today. A multi-replica deployment would need scheduler leadership or a separate worker.
- Auth Boundaries: The system has 3 access paths: admin session, high-entropy API key, and portal token. Each path resolves a tenant or client before business data is loaded.
- Proven Bottleneck: API key lookup moved from bcrypt comparison loops to SHA-256 indexed lookup for high-entropy keys. The documented path cut admin route latency from roughly 950ms to roughly 108ms while keeping passwords on bcrypt.
Operational Contracts
The API owns the write contract. Next.js renders the portal and admin experience, but state changes go through Fastify routes. The CLI uses the same API boundary instead of writing to the database. That matters because the CLI is the fastest operator surface, and it still has to obey tenant scope, notification policy, visibility rules, and audit state.
PostgreSQL is the source of truth. Redis stores operational state: sessions and rate limits. Supabase stores files. Notification Hub and Klevar Docs are soft dependencies that can be disabled without stopping core client delivery. The system keeps its first rule simple: never let a client-visible action bypass the same tenant and visibility checks used by the portal.
Decision Log
| Decision | Alternative Rejected | Why |
|---|---|---|
| Fastify plugin API | Split microservices per client domain | The repo has 25 route files and about 172 route registrations, but the domains share tenant, auth, and relational state. A monolith keeps those boundaries enforceable. |
| Drizzle schema as contract | ORM model hidden behind generated clients | The schema shows the tenant indexes, foreign keys, enums, and JSONB fields directly. That made audit and test failures easier to trace. |
| SHA-256 API key lookup | Bcrypt loop over every stored key | API keys are high entropy. Bcrypt protected passwords, but it slowed every API-key admin request. Indexed SHA-256 matches the threat model and removed a measured bottleneck. |
| CLI-first operator surface | Dashboard-first admin build | The operator needed fast commands for queue, stale work, handoffs, asks, reports, and capacity. API and CLI routes proved the workflow before more UI was added. |
| JSONB project milestones with task linkage | Separate milestone table from day one | Milestones need flexible setup and quick CLI edits. The cost is paid in normalization, backfill, linked task keys, and notification guards instead of another table. |
| Feature-flagged Notification Hub | Email delivery inside the portal | The portal emits events and keeps moving if delivery fails. Email templates, retries, and channels stay in the hub. |
| Supabase Storage signed links | Binary files in Postgres | Attachments need storage limits and access control, not large relational rows. Metadata stays queryable while blobs stay outside the database. |
| In-process cron jobs | Dedicated queue and worker service | Current daily cleanup and checkpoint work is small. The first scale break is scheduler duplication, not route throughput. |
Scaling Limit
The next architecture change is not another UI. It is worker separation. If the portal grows past internal operating scale, cron jobs should move out of the API process, metrics item writes should be batched, and high-volume notification work should move through a durable queue. The relational model can keep carrying client state, but the write paths around it would need stronger backpressure.