Chapters · #08

Enhancing Observability

Name: Chapter 8: Enhancing Observability - The Book - DEML Platform
Author: Joe Alongi

Reading Progress23%

Chapter 8: Enhancing Observability

As the operational complexity of my platform increases, the sheer volume of telemetry data generated by my services threatens to overwhelm traditional RESTful ingestion pipelines. If my primary Django web server is forced to synchronously block and wait for database writes every time a client logs an error or a healthcheck completes, the entire system will inevitably suffer from compounding latency and catastrophic cascading failures under load. To architect for true resilience and scale, I must decisively decouple telemetry ingestion from my critical transactional path. To achieve this event-driven architecture with Event Projections and production reliability, we added:

Transactional Outbox: Django endpoints write events to an OutboxEvent model inside Postgres transactions. A dedicated outbox_relay management command (run as cron or daemon) reliably publishes them to Redpanda.
Client events flow through a Firebase Cloud Functions gateway (ingestEvent https callable, with version and idempotency_key).
The Django telemetry_worker now performs idempotent projections (using stable keys + dedup tracking in Firestore) with support for a dead-letter queue (frontend-events-dlq). It builds materialized read models in Firestore (named deml DB).

The function attempts to publish to Redpanda (frontend-events topic) or falls back to Firestore. This provides at-least-once with deduplication semantics.

flowchart LR
    A[Angular Frontend] -->|Client Events| FCF[Firebase Cloud Functions<br/>ingestEvent]
    FCF -->|Produce or Fallback| C[(Redpanda + Firestore deml)]
    C -->|Consume + Enrich| D[Django Telemetry Worker]
    D -->|Write Materialized State| FS[(Firestore<br/>users/{uid}/data/stats)]

    subgraph Observability
    F[OTel Collector] -->|Traces| G[(ClickHouse)]
    end

Event Projections Pattern (with Reliability Enhancements):

Commands: Angular → Firebase Functions → Redpanda (or Firestore). Django side uses Outbox for atomic writes.
Projections: telemetry_worker (idempotent with DLQ) enriches and writes to Firestore deml (e.g., active endpoints from Postgres). Use outbox_relay for reliable publishing.
Queries: Angular subscribes directly to Firestore projections via onSnapshot.
Events are versioned; projections support replay and snapshots for recovery.

The data flow for client events begins at the perimeter via the Firebase Function (which handles auth context natively). For other telemetry and integrations, Django/Ninja endpoints still act as producers (via Outbox) to Redpanda. The function and worker (plus Outbox relay) enable the Event Projections loop, whose health is continuously verified by a synthetic probe in the telemetry worker and surfaced as the "Event Projections" component on the public platform-status page. A relay ensures no events are lost on restarts, and projections are idempotent.

Rather than interacting directly with PostgreSQL for every event, the system uses Redpanda + Firestore for high-throughput, non-blocking asynchronous execution. The backend (Django + Functions) acts as a lightweight proxy layer. It accepts the incoming payload, fires the event, and returns quickly.

# Example Django path (still used for certain telemetry and integrations).
# Primary client event path now routes through Firebase Cloud Functions (ingestEvent)
# which publishes "frontend-events" (or falls back to Firestore).

import json
from aiokafka import AIOKafkaProducer
from ninja import Router

router = Router()
producer = AIOKafkaProducer(bootstrap_servers="localhost:9092")

@router.post("/telemetry/endpoints")
async def post_telemetry(request, payload: dict):
    await producer.start()
    await producer.send("app-events", json.dumps(payload).encode("utf-8"))
    await producer.stop()
    return {"status": "accepted"}

Downstream, an isolated background worker actively subscribes to the app-events topic. This worker consumes the raw messages and utilizes the Polars library to batch-process and transform the data at lightning speed before ultimately persisting the aggregated metrics. Furthermore, to provide comprehensive, zero-compromise visibility into code-level failures, I integrate Sentry for full-stack error tracking, instantly capturing stack traces across both the TypeScript and Python environments. I augment this with Semgrep, enforcing continuous, automated vulnerability scanning within my CI/CD pipelines to ensure my ingestion code remains secure.

OpenTelemetry and ClickHouse Integration

While Redpanda expertly handles my custom application events, standardizing my broader distributed tracing and infrastructure metrics requires an industry-standard protocol. Therefore, I have deeply integrated OpenTelemetry (OTel) across my entire stack, working in tandem with ClickHouse as my primary analytical datastore.

My application services and underlying infrastructure natively emit OTLP telemetry via efficient gRPC and HTTP protocols. An independent OpenTelemetry Collector intercepts this traffic at the edge. The Collector meticulously processes, filters, and batches these high-volume traces before exporting them directly into ClickHouse. As a columnar database engineered specifically for Online Analytical Processing (OLAP) workloads, ClickHouse excels at rapid data aggregation and time-series queries. This strategic architectural decision allows me to scale my observability infrastructure infinitely, ensuring that complex, multi-service distributed traces can be queried in milliseconds, all without placing a single computational burden on my primary PostgreSQL transactional database.