~/About~/Foundry~/Blueprint~/Journal~/Projects
Book a Call
Blueprint

Sensor Telemetry Engine

·5 min read·Kingsley Onoh·View on GitHub

Architectural Brief: Sensor Telemetry Engine

A solar farm with 200 panels generates a reading every second per panel per metric. Voltage, temperature, power. That's 600 messages per second from one site. Add five sites and you're at 3,000 messages per second before a single line of business logic runs. The constraint that shaped every decision in this system: sustain 5,000 readings per second through a single binary, with anomaly detection running inline, without dropping messages.

System Topology

Infrastructure Decisions

  • Language: Rust (stable, 2024 edition). Chose over Go and Python because at 5,000 messages/second, garbage collection pauses in Go and Python's GIL introduce latency spikes that compound with batch flush timing. Rust's zero-cost async via Tokio keeps the NATS consumer and HTTP server on the same runtime without contention. The binary runs in a 512MB container.

  • Message Broker: NATS 2.x. Chose over Kafka because Kafka's partition model and ZooKeeper dependency are overkill for a single-node deployment. NATS gives sub-millisecond pub/sub, automatic reconnection, and fits in a 256MB container. JetStream is available if persistence is needed later, but the current design inserts before acknowledging, so message loss means the batch failed, not that NATS dropped it.

  • Time-Series Database: TimescaleDB 2.x on PostgreSQL 16. Chose over InfluxDB because InfluxDB means a proprietary query language (Flux/InfluxQL) and losing JOIN capability for alert rule lookups. TimescaleDB provides continuous aggregates (automatic 1-minute and 1-hour rollups), retention policies, and standard SQL. The rest of the stack already speaks PostgreSQL.

  • HTTP Framework: Axum 0.8. Chose over Actix-web because Axum's extractor model makes authentication composable. The AuthenticatedTenant extractor resolves API keys on every protected handler without a global middleware layer. Tower's service model gives built-in CORS and body size limiting.

  • API Key Auth with DashMap + argon2. Chose over JWT because machine-to-machine IoT traffic doesn't need token refresh or expiry management. The TenantResolver caches resolved keys in a DashMap with 5-minute per-entry TTL. argon2 verification only runs on cache miss, so the hot path is an O(1) hash map lookup, not a password hash computation.

  • Batch INSERT with UNNEST arrays. Chose over per-row inserts because one round-trip to TimescaleDB per 100 readings instead of 100 round-trips. The BatchBuffer swaps the internal Vec before releasing the Mutex, so the INSERT runs without holding the lock.

Constraints That Shaped the Design

  • Input: JSON messages published to NATS subjects matching sensors.>. Each message carries a device UUID, metric name, value, unit, optional RFC 3339 timestamp, and a tenant API key. Server receive time fills in when the timestamp is missing.

  • Output: REST API with 18 endpoints serving device management, time-series queries with automatic resolution selection, alert lifecycle management, and Prometheus metrics. Alert events are published to NATS on alerts.{tenant_id}.{device_id}.{metric} and optionally forwarded to the Notification Hub and Workflow Engine.

  • Scale Handled: 5,000 readings/second sustained. At 50,000 readings/second, the single-writer batch buffer would need partitioning by tenant or device type, and the Mutex-based buffer would need a lock-free ring buffer. The 20-connection PostgreSQL pool would need bumping, and NATS would benefit from JetStream for backpressure.

  • Hard Constraints: No foreign key constraints on the readings hypertable. TimescaleDB performs better without FK enforcement at 5,000 inserts/second. Continuous aggregates refresh on a 1-minute and 1-hour schedule. Deviation detection requires at least 5 aggregate samples in the window before it fires, preventing false positives during cold start.

Decision Log

Decisions not covered above in Infrastructure Decisions:

Decision Alternative Rejected Why
No FK on hypertable Foreign keys on readings TimescaleDB hypertable INSERT throughput drops significantly with FK enforcement. Device existence is verified at ingestion time via auto-registration instead.
Two-phase tenant resolution Full table scan with argon2 First try prefix-based lookup (1 row), then fall back to scanning up to 50 legacy rows. Prevents unbounded argon2 iteration on every request.
Insert-before-acknowledge for NATS Acknowledge-then-insert If the batch INSERT fails after acknowledgment, the readings are lost. Inserting first means a failed batch is retried by NATS redelivery.
15-minute default cooldown window Per-reading alert suppression A sensor stuck at a high value would generate 120 alerts/minute without cooldown. Database-backed cooldown survives process restarts, unlike in-memory TTL caches.
Auto-resolution for time-range queries Client-selected resolution Raw readings for ranges under 1 hour, minute aggregates for 1-24 hours, hour aggregates beyond. Prevents clients from accidentally scanning months of raw data.
#rust#timescaledb#nats#axum#iot#anomaly-detection

The complete performance for Sensor Telemetry Engine

Get Notified

New system breakdown? You'll know first.