Front Matter · #02

CONOPS

Name: Concept of Operations (CONOPS) - The Book - DEML Platform
Author: Joe Alongi

Reading Progress6%

Concept of Operations (CONOPS)

This section is the single operational narrative for the DEML platform: who uses it, how it runs in production, which technologies execute each responsibility, and what operators do when things degrade. It reflects the 2026 Event Projections architecture (Firebase command gateway, Redpanda broker, Django workers, Firestore read models, Google Cloud compute, GCP security controls). Detailed checklists live in Appendix C, Appendix D, and docs/conops.md.

1. Purpose & Scope

The DEML platform is a multi-tenant observability and machine-learning SaaS. Operators, security engineers, and integrators use it to ingest telemetry, publish status pages, forecast SLAs, evaluate threat anomalies, and share STIX 2.1 indicators. This CONOPS covers:

Normal steady-state operations across all production services
User-facing workflows (anonymous visitors, account owners, API integrators)
Internal data paths (commands, projections, queries, batch ML)
Deployment boundaries (Cloud Run, Firebase, GCP, Hugging Face)
Maintenance cadence, monitoring, and degraded-mode behavior

Out of scope: local developer onboarding (see Chapter 1 and Appendix E), and deep algorithmic derivations (see Whitepaper).

2. Mission & Operational Objectives

Objective	How the platform achieves it
Reliable telemetry ingestion	Non-blocking command path via `ingestEvent` → Redpanda; Django Transactional Outbox for API-origin events; idempotent `telemetry_worker` projections
Low-latency dashboards	Materialized read models in Firestore (`deml` DB); Angular `onSnapshot` on `users/{uid}/data/stats`
Account isolation	Postgres tenancy by `UserProfile.account_id`; Firestore rules scoped to `request.auth.uid`; symmetrical worker loops per account + `platform` sentinel
Predictive intelligence	Daily `ml_worker` retraining on anonymized aggregate data; per-account inference without cross-tenant raw leakage
Transparent public status	`platform-status` dogfoods the stack under real load; customer pages gated by `is_published` ABAC
Audit-ready security	Firebase Auth + MFA on writes; GCP KMS envelope encryption; immutable GCS audit logs; continuous Semgrep/Trivy/Renovate

3. System Overview

The platform separates commands (writes), projections (derived state), and queries (reads):

flowchart TB
    subgraph Surfaces
        M[Astro Marketing Site]
        A[Angular App deml.app]
        API[Integration API Keys]
    end

    subgraph Commands
        FCF[Firebase Cloud Functions ingestEvent]
        DJ[Django REST + OutboxEvent]
    end

    subgraph Bus
        RP[Redpanda frontend-events / DLQ]
        OR[outbox_relay 5s cadence]
    end

    subgraph Projections
        TW[telemetry_worker Polars + ORM]
        FS[(Firestore deml)]
    end

    subgraph Truth
        PG[(PostgreSQL)]
        CH[(ClickHouse OLAP)]
    end

    M -->|Auth handoff| A
    A -->|Callable + JWT REST| FCF
    A -->|REST| DJ
    API -->|Bearer API key| DJ
    FCF -->|Try publish| RP
    FCF -->|Fallback| FS
    DJ -->|Atomic write| PG
    OR -->|Publish| RP
    DJ -.->|Outbox rows| PG
    RP --> TW
    TW --> FS
    TW --> PG
    A -.->|onSnapshot| FS
    DJ -->|OTLP| CH

Authoritative stores: PostgreSQL holds transactional truth (users, status pages, incidents, API keys, outbox). Firestore holds projected real-time stats optimized for client subscriptions. ClickHouse holds OLAP traces and CES analytics. Redpanda is the durable command bus—not a system of record.

4. Operational Environment

Layer	Provider	Responsibility
Compute & data plane	Cloud Run	Django API, Angular SSR, Postgres, Redpanda, ClickHouse, Dragonfly, all background workers, scanner, OTEL collector, Tor proxy
Client command gateway	Firebase Cloud Functions	`ingestEvent` callable with native Auth context
Identity	Firebase Authentication	Email/OAuth/MFA; JWT verified by Django middleware
Real-time read models	Firestore (named DB `deml`)	Projected stats; security rules enforce per-user isolation
Marketing hosting	Firebase Hosting	Astro `marketing/dist` at `dataengineeringformachinelearning.com`
Cryptography & audit	Google Cloud (Terraform)	KMS envelope keys, immutable audit log bucket, service accounts
Secrets	Infisical (recommended)	Runtime secret injection; SOC 2 / CMMC alignment
Model artifacts	Hugging Face Hub	Namespaced `.pt` state dict uploads
Content	Sanity.io	Incident narratives decoupled from Django

Cross-site URL trio (env-driven everywhere): FRONTEND_URL (https://deml.app), BACKEND_URL (https://backend.deml.app), MARKETING_URL (https://dataengineeringformachinelearning.com).

5. Operational Modes

Mode	Description	Operator actions
Normal	All Cloud Run services healthy; Redpanda reachable from Functions; projections flowing to Firestore	Monitor CES gauges, Sentry, GCP metrics; check the "Event Projections" component on platform-status
Degraded — Redpanda unreachable from Functions	`ingestEvent` writes fallback rows to Firestore `events` collection; `telemetry_worker` still processes broker when internal path works	Confirm `REDPANDA_BROKERS` uses public endpoint for Functions or accept Firestore fallback; check `frontend-events-dlq` depth
Degraded — Worker stalled	Firestore projections stale; Postgres/outbox may accumulate	Restart `deml-telemetry-worker` and `deml-relay`; inspect DLQ topic; replay idempotent keys
Maintenance	Migrations, dependency upgrades, model retraining	Cloud Run rolling deploy on `main` merge; Firebase workflow deploys Functions/rules independently
Incident / public comms	Outage or degradation visible to users	Publish via Sanity; `platform-status` remains world-readable; unpublished customer pages stay private

6. User Roles & Operational Workflows

The platform uses a User + Sites model—one Firebase login, many StatusPage records, no org hierarchies (Chapter 28).

Actor	Primary workflows
Anonymous visitor	Browse published status pages and `platform-status`; `/explore` directory; no PII beyond CDN logs
Account owner (`Operator`)	Firebase login → Django profile provisioned; create status pages (MFA required); configure integrations; run Event Projections verification
Viewer	Read-only Settings and dashboards; API returns `403` on mutations
Security Admin	Platform bootstrap account; same write surface as Operator for owned resources
API integrator	`Authorization: Bearer <API_KEY>` on `/api/v1/ingest` and `/api/v1/predict`; scoped to `account_id`
Platform operator (you)	GCP dashboard, Firebase console, GCP KMS/logs, GitHub Actions, Infisical, internal vulnerability Kanban

Typical owner session: Marketing site → auth handoff → Angular dashboard → client events fire ingestEvent → stats appear via Firestore subscription → REST calls for configuration and ML endpoints.

Typical integration session: External pipeline POSTs batched telemetry to /api/v1/ingest → Django writes business state + OutboxEvent atomically → outbox_relay publishes → worker projects enriched aggregates.

7. Command, Control & Data Flows

Client command path (primary):

Angular calls Firebase callable ingestEvent with version: "1.0" and generated idempotency_key.
Function validates context.auth; partitions Kafka messages by uid.
On broker success: message lands on frontend-events; function returns accepted immediately.
On broker failure: fallback document written to Firestore events (clients cannot read this collection per rules).
telemetry_worker consumes, deduplicates via stable keys, enriches from Postgres, writes users/{uid}/data/stats.
Angular FirestoreService.getRealtimeStats() streams updates via onSnapshot.

Django command path (integrations & legacy):

Authenticated REST handler mutates Postgres inside a transaction.
OutboxEvent row inserted in the same transaction.
outbox_relay (every 5s) publishes to Redpanda; same worker pipeline applies.

Query path: Clients never poll Postgres for live stats; they subscribe to Firestore projections. Historical analytics and CES use ClickHouse via backend APIs.

8. Deployment Topology & Service Matrix

Production runs 14 Cloud Run services (see Chapter 22). Core operational paths:

Service	Operational role
`deml-frontend`	Angular app, widgets, public status UI
`deml-backend`	Django REST, auth middleware, billing, outbox writers
`deml-postgres`	System of record (supports Neon serverless PostgreSQL)
`deml-queue`	Redpanda (`deml-queue.internal:9092` for inter-service traffic)
`deml-telemetry-worker`	Projection engine + pingers + analytics rollups
`deml-relay`	Reliable outbox publisher
`deml-workers`	Consolidated ML training, threat intel, and cron task consumers
`deml-clickhouse`	OLAP analytics and historical telemetry storage
`deml-dragonfly`	Rate limiting and hot caches
`deml-scanner` + `deml-cpe-guesser`	Vulnerability ledger enrichment
`deml-tor-proxy`	OSINT dark-web routing

Firebase deploy path (separate from Cloud Run): .github/workflows/firebase-backend-deploy.yml ships Cloud Functions + Firestore rules; firebase-hosting-*.yml ships marketing. Never point Cloud Run services at Public broker URLs for internal traffic—use *.internal (Appendix C).

9. Security Operations

Perimeter: Firebase App Check + reCAPTCHA; TLS 1.3 everywhere; strict CSP on marketing (firebase.json).
Authentication: JWT verification in FirebaseAuthenticationMiddleware; MFA enforced on writes via amr claim.
Authorization: RBAC (Viewer / Operator / Security Admin) + ABAC (is_published, ownership, platform-status immutability).
Data protection: AES-256-GCM field encryption; DEK rotation every 30 days; GCP KMS envelope (Chapter 10).
Supply chain: Pre-commit + GitHub Actions (Semgrep, Trivy, Gitleaks, Renovate); internal Kanban for vulns (Chapter 21).
Compliance posture: Architected for SOC 2 Type II, CMMC 2.0 Level 2, NIST SP 800-171 Rev. 3 (Chapter 23).

10. Threat-Driven Design and Defendable Architecture Principles

[!IMPORTANT] Foundational Frameworks — Key References

DEML's security architecture is guided by two Lockheed Martin white papers from the Intelligence Driven Defense® program:

A Threat-Driven Approach to Cyber Security (Muckin & Fitch, 2019) — prioritizes adversary objectives over compliance-only checklists; introduces the IDDIL/ATC workflow, STRIDE-LM categorization, and the functional control hierarchy applied in this section.

Defendable Architectures (Fitch & Muckin, 2019) — defines build-time requirements for Visibility, Manageability, and Survivability that map to Event Projections telemetry, automated worker cadence, and Outbox/DLQ degraded modes.

Together, the pair links what adversaries are doing (threat analysis) with how systems must be engineered (defensible characteristics)—the right fit for a multi-tenant detection platform where ingest paths, model endpoints, and tenant boundaries are active attack surfaces. Full bibliographic citations: Appendix L.

Modern data and ML platforms are not passive repositories—they are detection and response surfaces. Adversaries target telemetry pipelines, model endpoints, and tenant boundaries because those paths carry high-value signals and privileged access. A compliance-first checklist or a vulnerability-first patch queue alone cannot keep pace with that reality. Lockheed Martin's A Threat-Driven Approach to Cyber Security (Muckin & Fitch, 2019) argues that defenders must prioritize threats over compliance artifacts or isolated CVEs: identify what adversaries are trying to achieve, then engineer controls that interrupt those objectives. The companion framework Defendable Architectures (Fitch & Muckin, 2019) translates that mindset into build-time requirements—systems must be explicitly designed for Visibility, Manageability, and Survivability so operators can execute Intelligence Driven Defense at scale. DEML adopts both frameworks as operational doctrine, not slide-deck vocabulary: every production path in this CONOPS is shaped to make adversary behavior observable, operator response fast, and degraded operation survivable.

Visibility

Visibility means the platform exposes enough trustworthy signal—across commands, projections, queries, and batch ML—to detect misuse, misconfiguration, and attack progression without guessing. DEML achieves this through layered telemetry rather than a single dashboard.

The Event Projections loop is the primary visibility spine: client commands (ingestEvent, Django Outbox → outbox_relay) land on Redpanda; telemetry_worker enriches from Postgres and materializes Firestore read models while emitting OpenTelemetry traces to ClickHouse. Operators do not infer pipeline health from user complaints—the Event Projections synthetic probe on platform-status continuously validates end-to-end flow. Network traffic enrichment (Chapter 20) adds ASN, GeoIP, UA parsing, and behavioral context at the edge. Threat feeds ingested hourly by security_worker (Chapter 13) fuse external IoCs with internal telemetry before the ThreatModel scores access risk. The CES dashboard (Chapter 25) distills Threat Level, SLA Level, and Stableness into a single operational gauge. Sentry, GCP Logging, and immutable GCS audit logs complete the picture for release regressions and compliance evidence. Visibility is incomplete if it is tenant-blind: symmetrical worker loops and strict account_id / Firestore rule scoping ensure every signal is attributable.

Manageability

Manageability means operators can change posture, deploy fixes, rotate secrets, and tune models without architectural surgery—controls are centralized, automated, and repeatable across tenants including Tenant0 dogfood.

Automation is the manageability engine. outbox_relay (5s cadence) and telemetry_worker run continuously; ml_worker and security_worker consume Kafka tasks on schedule—retraining SLA/threat models daily and refreshing AbuseIPDB / OTX feeds hourly (Chapter 24). Pre-commit hooks, Semgrep, Trivy, Renovate, and the internal vulnerability Kanban (Chapter 21) turn supply-chain findings into tracked remediation without manual triage drift. RBAC + ABAC (Chapter 28) and GCP KMS envelope rotation (Chapter 10) are managed through documented APIs and workers—not ad hoc SQL. CI/CD splits Cloud Run and Firebase deploy paths so Functions, rules, and backend services ship independently (§14). Integration health endpoints (/api/v1/integrations/{platform}) and the service matrix in §8 give operators a single map of what to restart, scale, or roll back. Manageability fails when tenants are exceptions; DEML's symmetrical pipelines guarantee that a control applied to one account applies to all.

Survivability

Survivability means the platform continues its mission under stress—broker outages, worker stalls, crypto failures, or active attack—without silent data loss or unbounded blast radius.

DEML engineers survivability into the command path itself. When Redpanda is unreachable from Firebase Functions, ingestEvent falls back to Firestore events while internal services continue consuming via the private broker (§5). The Transactional Outbox ensures API-origin events are never published without a durable Postgres record. telemetry_worker idempotency keys and the frontend-events-dlq topic prevent poison messages from stalling the entire projection fleet—operators replay with stable keys after fixing enrichment logic (§13). Multi-tenant isolation (Postgres account_id, Firestore security rules, Hugging Face namespaced model artifacts) contains compromise: one tenant's incident does not become another's data leak. Sanity-backed status communications (Chapter 14) survive primary backend outages. Daily ml_worker retraining loops keep threat and SLA models current even as attack patterns shift. Survivability is not "always up"; it is graceful degradation with recoverable state and explicit operator runbooks in docs/conops.md.

Virtuous Knowledge Cycle. Threat-driven design is not a one-time architecture review—it is a closed loop. Design phases prioritize adversary objectives and map them to Visibility / Manageability / Survivability controls. Build phases encode those controls in Event Projections, workers, encryption, and access matrices. Run phases generate telemetry, CES scores, DLQ depth, and threat-intel matches that validate—or falsify—design assumptions. Defend phases feed incident outcomes, new IoCs, and model false-positive rates back into the next design iteration. Each lap tightens detection fidelity, reduces operator toil, and hardens degraded-mode behavior. The platform dogfoods this cycle on Tenant0 (platform-status) before any control reaches customer tenants.

Applying the IDDIL/ATC Threat Analysis Methodology

Lockheed Martin's A Threat-Driven Approach to Cyber Security (Muckin & Fitch, 2019) provides a repeatable threat-analysis workflow that complements the Visibility / Manageability / Survivability principles above. The methodology splits work into two phases: IDDIL (discovery) and ATC (implementation). A mnemonic anchors the sequence: "There are no idle threats — they attack." Idle threats are not hypothetical backlog items—they are adversary objectives that will be exercised against your pipeline unless you discover them, prioritize them, and implement controls that interrupt them. For data-engineering and ML detection platforms, that means treating every ingest path, model endpoint, and tenant boundary as an active attack surface, not a future hardening ticket.

Use IDDIL/ATC whenever you onboard a new integration, stand up a customer detection pipeline, or reassess an existing worker after an incident. The steps below are written so a reader can run the same playbook on their own stack; each includes a DEML example (how Tenant0 dogfoods the step) and a pipeline-builder example (how a typical customer threat-models a detection workflow on top of the platform).

Discovery Phase (IDDIL)

Identify the Assets. Catalog business assets (data and functionality required for mission success) separately from security assets (what adversaries covet). Business assets for DEML include tenant-scoped telemetry, trained ThreatModel weights, and Firestore projection read models that power live dashboards. Security assets include integration API keys (encrypted at rest), Postgres OutboxEvent rows, and Hugging Face model artifacts namespaced by tenant hash. DEML example: During CONOPS reviews, operators maintain an asset register tied to §8—each Cloud Run service, broker topic, and Firestore collection is tagged with owner, retention, and classification. Pipeline-builder example: A customer ingesting batch features via /api/v1/ingest should list (1) their source datasets, (2) derived aggregates consumed by downstream ML jobs, and (3) attacker targets such as spoofed ingest payloads or exfiltration of enriched threat scores from /api/v1/predict.

Define the Attack Surface. Map every component that touches, transports, or exposes the assets identified above. Produce a data-flow diagram (DFD) or equivalent showing trust boundaries. DEML example: The CONOPS command path in §7 is the canonical attack-surface diagram—Angular → Firebase ingestEvent → Redpanda frontend-events → telemetry_worker → Firestore users/{uid}/data/stats, plus the parallel Django REST → Outbox → outbox_relay path for integrations. Trust boundaries sit at Firebase Auth, Postgres transaction commits, and Firestore security rules. Pipeline-builder example: Draw boundaries between the customer's ETL cluster, DEML's /api/v1/ingest endpoint, and their internal model-serving tier. Mark where credentials cross networks and where unauthenticated read paths exist.

Decompose the System. Break the attack surface into layers: protocols, APIs, libraries, workers, and security functions (inventory, collect, detect, protect, manage, respond). Note existing controls and their effectiveness ratings. DEML example: Decomposition follows the Event Projections stack—ingestEvent callable (collect), NetworkTelemetryMiddleware + edge enrichment (detect), AES-256-GCM + KMS envelope (protect), security_worker hourly IoC refresh (manage), and DLQ replay runbooks (respond). Each layer links to a chapter: enrichment in Chapter 20, intel fusion in Chapter 13. Pipeline-builder example: Decompose a Spark → DEML ingest job into (a) credential storage, (b) batch serialization format, (c) retry/idempotency behavior, and (d) the customer's own anomaly-scoring model—identifying which layer owns validation vs. detection.

Identify Attack Vectors. Document paths an adversary could traverse to reach target assets, including multiple techniques per pathway. Categorize threats using STRIDE-LM and incorporate current threat intelligence. DEML example: Enumerated vectors include JWT forgery against Django REST (Spoofing), cross-tenant IDOR via predictable IDs (Information Disclosure, mitigated by UUID PKs), broker poisoning on frontend-events (Tampering), model inversion against /api/v1/predict (Information Disclosure), and credential stuffing against Firebase Auth (Spoofing). Attack trees for the ingest path note that a compromised integration key allows arbitrary event injection until ABAC and rate limits (deml-dragonfly) throttle the source; a foothold in one tenant's projection worker must not become Lateral Movement into another tenant's Firestore read models. Pipeline-builder example: A customer's detection pipeline faces vectors such as training-data poisoning (Tampering), label-flip attacks on feedback loops (Tampering), and replay of captured ingest payloads (Repudiation)—each mapped to a specific hop in their DFD and tagged with a STRIDE-LM category.

STRIDE-LM Threat Categorization

Microsoft's original STRIDE model (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) remains one of the most practical ways to label threats during design reviews. Lockheed Martin's A Threat-Driven Approach to Cyber Security (Muckin & Fitch, 2019) extends STRIDE with Lateral Movement (LM)—the adversary technique of pivoting from an initial foothold to adjacent systems, accounts, or data domains. For multi-tenant event platforms, LM is not a footnote: a single compromised ingest key, worker credential, or mis-scoped projection path can turn a localized incident into cross-tenant data exposure unless containment is engineered at every trust boundary. STRIDE-LM gives operators and pipeline builders a shared vocabulary to classify vectors discovered in IDDIL, prioritize controls in ATC, and trace threat-intelligence matches (Chapter 13) back to concrete design decisions.

STRIDE-LM category	Definition	DEML controls & design decisions
S — Spoofing	Pretending to be a user, service, tenant, or event source.	Firebase Auth JWT verification in `FirebaseAuthenticationMiddleware`; WebAuthn hardware-key MFA on writes; Firebase App Check + reCAPTCHA Enterprise; integration API keys bound to tenant scope; `ingestEvent` idempotency keys reject duplicate command replay.
T — Tampering	Modifying data in transit, at rest, or in the event pipeline.	Transactional Outbox (`OutboxEvent` written atomically with domain state); `telemetry_worker` idempotency keys; AES-256-GCM field encryption with GCP KMS envelope rotation; versioned event schemas; `platform-status` immutability via ABAC.
R — Repudiation	Denying that an action occurred or obscuring attribution.	Immutable Google Cloud Logging SIEM trail; GCS audit log retention; Postgres `OutboxEvent` and `ThreatReport` records with tenant `account_id`; OpenTelemetry traces in ClickHouse correlating ingest → enrichment → projection hops.
I — Information Disclosure	Exposing data or metadata to unauthorized parties.	UUID primary keys (anti-IDOR); RBAC + ABAC (Chapter 28); Firestore security rules scoped to `users/{uid}`; Hugging Face model artifacts namespaced by hashed tenant slug; encrypted integration tokens at rest (Chapter 10).
D — Denial of Service	Degrading or blocking availability of services or projections.	Dragonfly sliding-window rate limits; `frontend-events-dlq` isolates poison messages from the projection fleet; distroless containers reduce exploit surface; Sanity CDN–backed status communications survive backend outages (Chapter 14); synthetic Event Projections probe alerts on pipeline stall.
E — Elevation of Privilege	Gaining capabilities beyond authorized role or tenant scope.	Three-tier RBAC (`Viewer` / `Operator` / `Security Admin`); ABAC ownership and `is_published` gates; unprivileged Cloud Run service accounts; Infisical runtime secret injection (no keys on disk); Semgrep/Trivy supply-chain gates in CI.
LM — Lateral Movement	Pivoting from one compromised asset to others within or across trust boundaries.	Primary containment layer: strict multi-tenant isolation—Postgres `account_id` on every transactional row, symmetrical worker loops that never hardcode Tenant0 exceptions, Firestore rule scoping per `uid`, private Redpanda networking between Cloud Run services, no cross-tenant foreign keys in worker payloads (Tenant0 UUID normalization replaces legacy `"platform"` literals). Compromise in one tenant's ingest path cannot traverse to another tenant's projections, model weights, or integration keys without a separate, auditable authorization failure.

High-throughput event platforms amplify both the value and the risk of security telemetry: every command, projection, and ML inference generates evidence adversaries want to steal or poison, and every worker hop is a potential pivot point. STRIDE-LM is especially useful here because it forces teams to ask two questions on every new feature: what category of harm does this enable? and where could an attacker move next if this control fails? Tagging Redpanda topics, worker credentials, and Firestore collections with STRIDE-LM labels during design reviews prevents "detect-only" blind spots—teams discover early when they have strong Spoofing and Tampering controls but weak Lateral Movement containment, which is the failure mode most dangerous in SaaS pipelines. For operators, the same taxonomy turns hourly IoC refreshes and CES Threat Level spikes into actionable triage: an OTX match on a scraping ASN maps cleanly to Denial of Service and Spoofing; a DLQ depth anomaly maps to Tampering or survivability debt; a cross-tenant access attempt in audit logs maps directly to Information Disclosure and Lateral Movement and triggers the highest-severity runbook.

List Threat Actors and Objectives. Name adversary classes, their motivation, skill, resources, and goals against your assets. Feed current intel (feeds, ISAC reports, internal incidents) into this step. DEML example: Actor classes include automated scrapers (availability abuse on public platform-status), credential-stuffing botnets (account takeover), insider operators with Operator RBAC (data exfiltration via export APIs), and APT-style actors targeting ML model weights on Hugging Face. Objectives are tied to kill-chain stages—reconnaissance on /api/v1/integrations/{platform} health endpoints, delivery via forged ingest events, action on objectives via cross-tenant projection reads. Pipeline-builder example: A fraud-analytics team lists actors (insider analysts, compromised service accounts, supply-chain partners with ingest access) and states objectives (skew detection thresholds, hide fraudulent transactions in feature noise).

Implementation Phase (ATC)

Analysis & Assessment. For each discovered vector, determine root cause, successful-compromise impact, and worst-case scenarios. Employ threat models, attack trees, or Cyber Kill Chain mapping as artifacts; revisit discovery assumptions when new intel arrives. DEML example: When DLQ depth spikes on frontend-events-dlq, analysts trace enrichment failures to malformed payloads, assess impact (stalled projections → stale CES gauges), and model worst case (silent loss of threat-intel correlation if worker OOM persists). The ThreatModel binary classifier is assessed against false-negative cost (malicious IP admitted) vs. false-positive cost (legitimate integration throttled). Pipeline-builder example: A customer assesses whether a poisoned ingest batch could shift their PyTorch MLP decision boundary enough to miss fraud clusters, and documents the blast radius if /api/v1/predict returns attacker-controlled scores to an automated blocklist.

Triage. Prioritize findings by business/mission impact and threat intelligence—not by CVE count alone. Impact outweighs raw probability at this stage; active intel feeds the probability variable later in risk management. Express results in both business and technical terms. DEML example: Triage ranks (1) cross-tenant data leakage via mis-scoped Firestore rules as catastrophic, (2) integration key compromise with ingest write access as high, (3) single-tenant DLQ replay backlog as medium operational debt. Semgrep and Trivy findings enter the internal vulnerability Kanban (Chapter 21) only after threat-context triage—not every CVE is an immediate patch. Pipeline-builder example: A pipeline owner triages training-data poisoning above TLS misconfiguration if their model directly gates financial holds; they document the business impact ("false approvals") alongside the technical fix ("schema validation + outlier quarantine before ingest").

Controls. Select, implement, and validate controls that remove, counter, or mitigate prioritized threats. Controls exhibit functions—inventory, collect, detect, protect, manage, respond—and must trace back to specific attack vectors, not generic compliance checklists. Measure effectiveness and identify coverage gaps. DEML example: Controls mapped to ingest injection include Firebase App Check + MFA on writes (protect), UUID PKs + ABAC (protect), transactional Outbox + idempotency keys (detect/manage), ThreatModel inference at the edge (detect/respond), and DLQ replay with stable keys (respond). CES (Chapter 25) scores how well these controls perform in production on Tenant0 before customer rollout. Pipeline-builder example: A customer implements schema contracts and row-level checksums on batches before POSTing to /api/v1/ingest, enables DEML rate limits, stores API keys in a vault with rotation, and adds a human review queue when ThreatModel scores exceed a tenant-defined threshold.

Platform Practice Mapping

The table below shows where DEML's current production practices align with IDDIL/ATC. Use it as a checklist when threat-modeling your own pipeline—the left column is the methodology step; the right column is where to look in this codebase or CONOPS.

IDDIL/ATC step	DEML practice (reference)
I — Identify assets	Tenant-scoped Postgres models, Firestore `users/{uid}/data/*`, encrypted integration tokens (Chapter 10), HF namespaced model artifacts
D — Define attack surface	CONOPS §7 command/query paths; `/api/v1/ingest`, `/api/v1/predict`, Firebase `ingestEvent`
D — Decompose system	Service matrix §8; Event Projections loop (Outbox → relay → worker → Firestore)
I — Identify attack vectors	STRIDE-LM categorization (§10); UUID PK anti-IDOR, broker/DLQ failure modes §13, network enrichment (Chapter 20)
L — List threat actors	`security_worker` IoC feeds (AbuseIPDB, OTX), HIBP/Tor OSINT (Chapter 13), behavioral biometrics
A — Analysis & assessment	`ThreatModel` PyTorch classifier, Cyber Kill Chain–aligned CES metrics (Chapter 25), synthetic Event Projections probe
T — Triage	Vulnerability Kanban (Chapter 21), impact-weighted incident response, DLQ depth alerting
C — Controls	RBAC/ABAC (Chapter 28), KMS rotation, App Check, rate limits, Outbox idempotency, Firestore rule scoping

Actionable workflow for pipeline builders. Run IDDIL before your first production ingest: (1) list assets and draw a DFD with trust boundaries, (2) decompose your ETL → DEML → model-serving stack, (3) enumerate vectors and actors against that diagram, (4) analyze impact and triage by business consequence, (5) implement controls that map to specific vectors—not a generic security bundle—and (6) loop back when security_worker intel, DLQ telemetry, or model drift falsifies your assumptions. Threat-driven design is continuous; the mnemonic exists because unaddressed threats do not remain idle—they become the next incident in your detection pipeline.

These principles are operational scaffolding, not abstract theory. Chapter 7 and Chapter 23 apply them to compute hardening and enterprise compliance evidence; STRIDE-LM provides the threat taxonomy; Chapter 13 details the threat-intelligence fusion pipeline; Chapter 20 covers edge enrichment; and Chapter 25 formalizes how countermeasure effectiveness is measured and displayed.

11. Observability & Health Monitoring

Signal	Source	Operator use
Real-time user stats	Firestore projections	"Event Projections" component on platform-status (automated synthetic probe)
CES dashboard	ClickHouse + backend aggregates	Threat / SLA / Stableness gauges (Chapter 25)
Traces	OpenTelemetry → Collector → ClickHouse	Latency regressions, worker stalls
Errors	Sentry (frontend + backend)	Release regressions
Synthetic uptime	`telemetry_worker` pingers (30s)	Status page accuracy
Infrastructure	GCP metrics, GCP Logging	Capacity, audit trail

12. Maintenance & Automation Cadence

All schedules are canonical in Appendix D. Summary:

Every 5s: outbox_relay publishes pending events.
Continuous: telemetry_worker, ml_worker Kafka consumers.
Hourly: Threat intel fetch (security_worker).
Daily: ML retraining, db_cleanup (30-day raw retention), Stripe sync_subscriptions, DEK rotation checks.
Weekly / Monthly / Quarterly: Renovate, Semgrep, deep audits via GitHub Actions.

13. Contingency & Degraded Operations

Failure	System behavior	Recovery
Redpanda unavailable (Functions)	Firestore fallback writes; worker may still consume via internal broker	Restore broker; drain DLQ; verify projections catch up idempotently
`outbox_relay` stopped	Events accumulate in Postgres outbox	Restart relay; backlog publishes in order
Firestore rules mis-deployed	Client reads/writes rejected	Re-run `firebase-backend-deploy.yml`
Worker OOM on Polars batch	Messages route to `frontend-events-dlq`	Fix payload/enrichment; replay with stable keys
Postgres outage	REST mutations fail; cached projections may stale	Cloud SQL restore from volume snapshot; run migrations
KMS unreachable	Cannot decrypt integration tokens	Restore GCP credentials; verify `telemetry-app-sa` IAM

14. CI/CD & Release Operations

Feature branch → pre-commit (Ruff, ESLint, Axe) → PR.
Merge to main → Cloud Build webhook builds affected services (watch paths per service).
Same merge → Firebase workflows deploy Functions/rules/hosting when paths match.
scripts/sync_content.py propagates BOOK/README to frontend and marketing assets.
purge-cloudflare-cache.yml invalidates CDN after deploy.

Semantic versioning and release notes: scripts/git_flow.py (Chapter 16).

15. Documentation Map

Document	Audience	Content
This CONOPS	Operators, architects	End-to-end operational narrative
`docs/conops.md`	On-call engineers	Checklists, modes, quick reference
WHITEPAPER.md §2	Executives, reviewers	Concise CONOPS + hypothesis
README.md	Integrators	API gateway, architecture diagram
Appendix C	DevOps	Per-service Cloud Run variables
Appendix D	SRE	Schedules and retention
AGENTS.md	AI agents / contributors	Coding principles aligned to CONOPS