CONOPS
Concept of Operations (CONOPS)
This section is the single operational narrative for the DEML platform: who uses it, how it runs in production, which technologies execute each responsibility, and what operators do when things degrade. It reflects the 2026 Event Projections architecture (Firebase command gateway, Redpanda broker, Django workers, Firestore read models, Google Cloud compute, GCP security controls). Detailed checklists live in Appendix C, Appendix D, and docs/conops.md.
1. Purpose & Scope
The DEML platform is a multi-tenant observability and machine-learning SaaS. Operators, security engineers, and integrators use it to ingest telemetry, publish status pages, forecast SLAs, evaluate threat anomalies, and share STIX 2.1 indicators. This CONOPS covers:
- Normal steady-state operations across all production services
- User-facing workflows (anonymous visitors, account owners, API integrators)
- Internal data paths (commands, projections, queries, batch ML)
- Deployment boundaries (Cloud Run, Firebase, GCP, Hugging Face)
- Maintenance cadence, monitoring, and degraded-mode behavior
Out of scope: local developer onboarding (see Chapter 1 and Appendix E), and deep algorithmic derivations (see Whitepaper).
2. Mission & Operational Objectives
| Objective | How the platform achieves it |
|---|---|
| Reliable telemetry ingestion | Non-blocking command path via ingestEvent → Redpanda; Django Transactional Outbox for API-origin events; idempotent telemetry_worker projections |
| Low-latency dashboards | Materialized read models in Firestore (deml DB); Angular onSnapshot on users/{uid}/data/stats |
| Account isolation | Postgres tenancy by UserProfile.account_id; Firestore rules scoped to request.auth.uid; symmetrical worker loops per account + platform sentinel |
| Predictive intelligence | Daily ml_worker retraining on anonymized aggregate data; per-account inference without cross-tenant raw leakage |
| Transparent public status | platform-status dogfoods the stack under real load; customer pages gated by is_published ABAC |
| Audit-ready security | Firebase Auth + MFA on writes; GCP KMS envelope encryption; immutable GCS audit logs; continuous Semgrep/Trivy/Renovate |
3. System Overview
The platform separates commands (writes), projections (derived state), and queries (reads):
flowchart TB
subgraph Surfaces
M[Astro Marketing Site]
A[Angular App deml.app]
API[Integration API Keys]
end
subgraph Commands
FCF[Firebase Cloud Functions ingestEvent]
DJ[Django REST + OutboxEvent]
end
subgraph Bus
RP[Redpanda frontend-events / DLQ]
OR[outbox_relay 5s cadence]
end
subgraph Projections
TW[telemetry_worker Polars + ORM]
FS[(Firestore deml)]
end
subgraph Truth
PG[(PostgreSQL)]
CH[(ClickHouse OLAP)]
end
M -->|Auth handoff| A
A -->|Callable + JWT REST| FCF
A -->|REST| DJ
API -->|Bearer API key| DJ
FCF -->|Try publish| RP
FCF -->|Fallback| FS
DJ -->|Atomic write| PG
OR -->|Publish| RP
DJ -.->|Outbox rows| PG
RP --> TW
TW --> FS
TW --> PG
A -.->|onSnapshot| FS
DJ -->|OTLP| CH
Authoritative stores: PostgreSQL holds transactional truth (users, status pages, incidents, API keys, outbox). Firestore holds projected real-time stats optimized for client subscriptions. ClickHouse holds OLAP traces and CES analytics. Redpanda is the durable command bus—not a system of record.
4. Operational Environment
| Layer | Provider | Responsibility |
|---|---|---|
| Compute & data plane | Cloud Run | Django API, Angular SSR, Postgres, Redpanda, ClickHouse, Dragonfly, all background workers, scanner, OTEL collector, Tor proxy |
| Client command gateway | Firebase Cloud Functions | ingestEvent callable with native Auth context |
| Identity | Firebase Authentication | Email/OAuth/MFA; JWT verified by Django middleware |
| Real-time read models | Firestore (named DB deml) |
Projected stats; security rules enforce per-user isolation |
| Marketing hosting | Firebase Hosting | Astro marketing/dist at dataengineeringformachinelearning.com |
| Cryptography & audit | Google Cloud (Terraform) | KMS envelope keys, immutable audit log bucket, service accounts |
| Secrets | Infisical (recommended) | Runtime secret injection; SOC 2 / CMMC alignment |
| Model artifacts | Hugging Face Hub | Namespaced .pt state dict uploads |
| Content | Sanity.io | Incident narratives decoupled from Django |
Cross-site URL trio (env-driven everywhere): FRONTEND_URL (https://deml.app), BACKEND_URL (https://backend.deml.app), MARKETING_URL (https://dataengineeringformachinelearning.com).
5. Operational Modes
| Mode | Description | Operator actions |
|---|---|---|
| Normal | All Cloud Run services healthy; Redpanda reachable from Functions; projections flowing to Firestore | Monitor CES gauges, Sentry, GCP metrics; check the "Event Projections" component on platform-status |
| Degraded — Redpanda unreachable from Functions | ingestEvent writes fallback rows to Firestore events collection; telemetry_worker still processes broker when internal path works |
Confirm REDPANDA_BROKERS uses public endpoint for Functions or accept Firestore fallback; check frontend-events-dlq depth |
| Degraded — Worker stalled | Firestore projections stale; Postgres/outbox may accumulate | Restart deml-telemetry-worker and deml-relay; inspect DLQ topic; replay idempotent keys |
| Maintenance | Migrations, dependency upgrades, model retraining | Cloud Run rolling deploy on main merge; Firebase workflow deploys Functions/rules independently |
| Incident / public comms | Outage or degradation visible to users | Publish via Sanity; platform-status remains world-readable; unpublished customer pages stay private |
6. User Roles & Operational Workflows
The platform uses a User + Sites model—one Firebase login, many StatusPage records, no org hierarchies (Chapter 28).
| Actor | Primary workflows |
|---|---|
| Anonymous visitor | Browse published status pages and platform-status; /explore directory; no PII beyond CDN logs |
Account owner (Operator) |
Firebase login → Django profile provisioned; create status pages (MFA required); configure integrations; run Event Projections verification |
| Viewer | Read-only Settings and dashboards; API returns 403 on mutations |
| Security Admin | Platform bootstrap account; same write surface as Operator for owned resources |
| API integrator | Authorization: Bearer <API_KEY> on /api/v1/ingest and /api/v1/predict; scoped to account_id |
| Platform operator (you) | GCP dashboard, Firebase console, GCP KMS/logs, GitHub Actions, Infisical, internal vulnerability Kanban |
Typical owner session: Marketing site → auth handoff → Angular dashboard → client events fire ingestEvent → stats appear via Firestore subscription → REST calls for configuration and ML endpoints.
Typical integration session: External pipeline POSTs batched telemetry to /api/v1/ingest → Django writes business state + OutboxEvent atomically → outbox_relay publishes → worker projects enriched aggregates.
7. Command, Control & Data Flows
Client command path (primary):
- Angular calls Firebase callable
ingestEventwithversion: "1.0"and generatedidempotency_key. - Function validates
context.auth; partitions Kafka messages byuid. - On broker success: message lands on
frontend-events; function returnsacceptedimmediately. - On broker failure: fallback document written to Firestore
events(clients cannot read this collection per rules). telemetry_workerconsumes, deduplicates via stable keys, enriches from Postgres, writesusers/{uid}/data/stats.- Angular
FirestoreService.getRealtimeStats()streams updates viaonSnapshot.
Django command path (integrations & legacy):
- Authenticated REST handler mutates Postgres inside a transaction.
OutboxEventrow inserted in the same transaction.outbox_relay(every 5s) publishes to Redpanda; same worker pipeline applies.
Query path: Clients never poll Postgres for live stats; they subscribe to Firestore projections. Historical analytics and CES use ClickHouse via backend APIs.
8. Deployment Topology & Service Matrix
Production runs 14 Cloud Run services (see Chapter 22). Core operational paths:
| Service | Operational role |
|---|---|
deml-frontend |
Angular app, widgets, public status UI |
deml-backend |
Django REST, auth middleware, billing, outbox writers |
deml-postgres |
System of record (supports Neon serverless PostgreSQL) |
deml-queue |
Redpanda (deml-queue.internal:9092 for inter-service traffic) |
deml-telemetry-worker |
Projection engine + pingers + analytics rollups |
deml-relay |
Reliable outbox publisher |
deml-workers |
Consolidated ML training, threat intel, and cron task consumers |
deml-clickhouse |
OLAP analytics and historical telemetry storage |
deml-dragonfly |
Rate limiting and hot caches |
deml-scanner + deml-cpe-guesser |
Vulnerability ledger enrichment |
deml-tor-proxy |
OSINT dark-web routing |
Firebase deploy path (separate from Cloud Run): .github/workflows/firebase-backend-deploy.yml ships Cloud Functions + Firestore rules; firebase-hosting-*.yml ships marketing. Never point Cloud Run services at Public broker URLs for internal traffic—use *.internal (Appendix C).
9. Security Operations
- Perimeter: Firebase App Check + reCAPTCHA; TLS 1.3 everywhere; strict CSP on marketing (
firebase.json). - Authentication: JWT verification in
FirebaseAuthenticationMiddleware; MFA enforced on writes viaamrclaim. - Authorization: RBAC (
Viewer/Operator/Security Admin) + ABAC (is_published, ownership,platform-statusimmutability). - Data protection: AES-256-GCM field encryption; DEK rotation every 30 days; GCP KMS envelope (Chapter 10).
- Supply chain: Pre-commit + GitHub Actions (Semgrep, Trivy, Gitleaks, Renovate); internal Kanban for vulns (Chapter 21).
- Compliance posture: Architected for SOC 2 Type II, CMMC 2.0 Level 2, NIST SP 800-171 Rev. 3 (Chapter 23).
10. Threat-Driven Design and Defendable Architecture Principles
[!IMPORTANT] Foundational Frameworks — Key References
DEML's security architecture is guided by two Lockheed Martin white papers from the Intelligence Driven Defense® program:
- A Threat-Driven Approach to Cyber Security (Muckin & Fitch, 2019) — prioritizes adversary objectives over compliance-only checklists; introduces the IDDIL/ATC workflow, STRIDE-LM categorization, and the functional control hierarchy applied in this section.
- Defendable Architectures (Fitch & Muckin, 2019) — defines build-time requirements for Visibility, Manageability, and Survivability that map to Event Projections telemetry, automated worker cadence, and Outbox/DLQ degraded modes.
Together, the pair links what adversaries are doing (threat analysis) with how systems must be engineered (defensible characteristics)—the right fit for a multi-tenant detection platform where ingest paths, model endpoints, and tenant boundaries are active attack surfaces. Full bibliographic citations: Appendix L.
Modern data and ML platforms are not passive repositories—they are detection and response surfaces. Adversaries target telemetry pipelines, model endpoints, and tenant boundaries because those paths carry high-value signals and privileged access. A compliance-first checklist or a vulnerability-first patch queue alone cannot keep pace with that reality. Lockheed Martin's A Threat-Driven Approach to Cyber Security (Muckin & Fitch, 2019) argues that defenders must prioritize threats over compliance artifacts or isolated CVEs: identify what adversaries are trying to achieve, then engineer controls that interrupt those objectives. The companion framework Defendable Architectures (Fitch & Muckin, 2019) translates that mindset into build-time requirements—systems must be explicitly designed for Visibility, Manageability, and Survivability so operators can execute Intelligence Driven Defense at scale. DEML adopts both frameworks as operational doctrine, not slide-deck vocabulary: every production path in this CONOPS is shaped to make adversary behavior observable, operator response fast, and degraded operation survivable.
Visibility
Visibility means the platform exposes enough trustworthy signal—across commands, projections, queries, and batch ML—to detect misuse, misconfiguration, and attack progression without guessing. DEML achieves this through layered telemetry rather than a single dashboard.
The Event Projections loop is the primary visibility spine: client commands (ingestEvent, Django Outbox → outbox_relay) land on Redpanda; telemetry_worker enriches from Postgres and materializes Firestore read models while emitting OpenTelemetry traces to ClickHouse. Operators do not infer pipeline health from user complaints—the Event Projections synthetic probe on platform-status continuously validates end-to-end flow. Network traffic enrichment (Chapter 20) adds ASN, GeoIP, UA parsing, and behavioral context at the edge. Threat feeds ingested hourly by security_worker (Chapter 13) fuse external IoCs with internal telemetry before the ThreatModel scores access risk. The CES dashboard (Chapter 25) distills Threat Level, SLA Level, and Stableness into a single operational gauge. Sentry, GCP Logging, and immutable GCS audit logs complete the picture for release regressions and compliance evidence. Visibility is incomplete if it is tenant-blind: symmetrical worker loops and strict account_id / Firestore rule scoping ensure every signal is attributable.
Manageability
Manageability means operators can change posture, deploy fixes, rotate secrets, and tune models without architectural surgery—controls are centralized, automated, and repeatable across tenants including Tenant0 dogfood.
Automation is the manageability engine. outbox_relay (5s cadence) and telemetry_worker run continuously; ml_worker and security_worker consume Kafka tasks on schedule—retraining SLA/threat models daily and refreshing AbuseIPDB / OTX feeds hourly (Chapter 24). Pre-commit hooks, Semgrep, Trivy, Renovate, and the internal vulnerability Kanban (Chapter 21) turn supply-chain findings into tracked remediation without manual triage drift. RBAC + ABAC (Chapter 28) and GCP KMS envelope rotation (Chapter 10) are managed through documented APIs and workers—not ad hoc SQL. CI/CD splits Cloud Run and Firebase deploy paths so Functions, rules, and backend services ship independently (§14). Integration health endpoints (/api/v1/integrations/{platform}) and the service matrix in §8 give operators a single map of what to restart, scale, or roll back. Manageability fails when tenants are exceptions; DEML's symmetrical pipelines guarantee that a control applied to one account applies to all.
Survivability
Survivability means the platform continues its mission under stress—broker outages, worker stalls, crypto failures, or active attack—without silent data loss or unbounded blast radius.
DEML engineers survivability into the command path itself. When Redpanda is unreachable from Firebase Functions, ingestEvent falls back to Firestore events while internal services continue consuming via the private broker (§5). The Transactional Outbox ensures API-origin events are never published without a durable Postgres record. telemetry_worker idempotency keys and the frontend-events-dlq topic prevent poison messages from stalling the entire projection fleet—operators replay with stable keys after fixing enrichment logic (§13). Multi-tenant isolation (Postgres account_id, Firestore security rules, Hugging Face namespaced model artifacts) contains compromise: one tenant's incident does not become another's data leak. Sanity-backed status communications (Chapter 14) survive primary backend outages. Daily ml_worker retraining loops keep threat and SLA models current even as attack patterns shift. Survivability is not "always up"; it is graceful degradation with recoverable state and explicit operator runbooks in docs/conops.md.
Virtuous Knowledge Cycle. Threat-driven design is not a one-time architecture review—it is a closed loop. Design phases prioritize adversary objectives and map them to Visibility / Manageability / Survivability controls. Build phases encode those controls in Event Projections, workers, encryption, and access matrices. Run phases generate telemetry, CES scores, DLQ depth, and threat-intel matches that validate—or falsify—design assumptions. Defend phases feed incident outcomes, new IoCs, and model false-positive rates back into the next design iteration. Each lap tightens detection fidelity, reduces operator toil, and hardens degraded-mode behavior. The platform dogfoods this cycle on Tenant0 (platform-status) before any control reaches customer tenants.
Applying the IDDIL/ATC Threat Analysis Methodology
Lockheed Martin's A Threat-Driven Approach to Cyber Security (Muckin & Fitch, 2019) provides a repeatable threat-analysis workflow that complements the Visibility / Manageability / Survivability principles above. The methodology splits work into two phases: IDDIL (discovery) and ATC (implementation). A mnemonic anchors the sequence: "There are no idle threats — they attack." Idle threats are not hypothetical backlog items—they are adversary objectives that will be exercised against your pipeline unless you discover them, prioritize them, and implement controls that interrupt them. For data-engineering and ML detection platforms, that means treating every ingest path, model endpoint, and tenant boundary as an active attack surface, not a future hardening ticket.
Use IDDIL/ATC whenever you onboard a new integration, stand up a customer detection pipeline, or reassess an existing worker after an incident. The steps below are written so a reader can run the same playbook on their own stack; each includes a DEML example (how Tenant0 dogfoods the step) and a pipeline-builder example (how a typical customer threat-models a detection workflow on top of the platform).
Discovery Phase (IDDIL)
Identify the Assets. Catalog business assets (data and functionality required for mission success) separately from security assets (what adversaries covet). Business assets for DEML include tenant-scoped telemetry, trained ThreatModel weights, and Firestore projection read models that power live dashboards. Security assets include integration API keys (encrypted at rest), Postgres OutboxEvent rows, and Hugging Face model artifacts namespaced by tenant hash. DEML example: During CONOPS reviews, operators maintain an asset register tied to §8—each Cloud Run service, broker topic, and Firestore collection is tagged with owner, retention, and classification. Pipeline-builder example: A customer ingesting batch features via /api/v1/ingest should list (1) their source datasets, (2) derived aggregates consumed by downstream ML jobs, and (3) attacker targets such as spoofed ingest payloads or exfiltration of enriched threat scores from /api/v1/predict.
Define the Attack Surface. Map every component that touches, transports, or exposes the assets identified above. Produce a data-flow diagram (DFD) or equivalent showing trust boundaries. DEML example: The CONOPS command path in §7 is the canonical attack-surface diagram—Angular → Firebase ingestEvent → Redpanda frontend-events → telemetry_worker → Firestore users/{uid}/data/stats, plus the parallel Django REST → Outbox → outbox_relay path for integrations. Trust boundaries sit at Firebase Auth, Postgres transaction commits, and Firestore security rules. Pipeline-builder example: Draw boundaries between the customer's ETL cluster, DEML's /api/v1/ingest endpoint, and their internal model-serving tier. Mark where credentials cross networks and where unauthenticated read paths exist.
Decompose the System. Break the attack surface into layers: protocols, APIs, libraries, workers, and security functions (inventory, collect, detect, protect, manage, respond). Note existing controls and their effectiveness ratings. DEML example: Decomposition follows the Event Projections stack—ingestEvent callable (collect), NetworkTelemetryMiddleware + edge enrichment (detect), AES-256-GCM + KMS envelope (protect), security_worker hourly IoC refresh (manage), and DLQ replay runbooks (respond). Each layer links to a chapter: enrichment in Chapter 20, intel fusion in Chapter 13. Pipeline-builder example: Decompose a Spark → DEML ingest job into (a) credential storage, (b) batch serialization format, (c) retry/idempotency behavior, and (d) the customer's own anomaly-scoring model—identifying which layer owns validation vs. detection.
Identify Attack Vectors. Document paths an adversary could traverse to reach target assets, including multiple techniques per pathway. Categorize threats using STRIDE-LM and incorporate current threat intelligence. DEML example: Enumerated vectors include JWT forgery against Django REST (Spoofing), cross-tenant IDOR via predictable IDs (Information Disclosure, mitigated by UUID PKs), broker poisoning on frontend-events (Tampering), model inversion against /api/v1/predict (Information Disclosure), and credential stuffing against Firebase Auth (Spoofing). Attack trees for the ingest path note that a compromised integration key allows arbitrary event injection until ABAC and rate limits (deml-dragonfly) throttle the source; a foothold in one tenant's projection worker must not become Lateral Movement into another tenant's Firestore read models. Pipeline-builder example: A customer's detection pipeline faces vectors such as training-data poisoning (Tampering), label-flip attacks on feedback loops (Tampering), and replay of captured ingest payloads (Repudiation)—each mapped to a specific hop in their DFD and tagged with a STRIDE-LM category.
STRIDE-LM Threat Categorization
Microsoft's original STRIDE model (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) remains one of the most practical ways to label threats during design reviews. Lockheed Martin's A Threat-Driven Approach to Cyber Security (Muckin & Fitch, 2019) extends STRIDE with Lateral Movement (LM)—the adversary technique of pivoting from an initial foothold to adjacent systems, accounts, or data domains. For multi-tenant event platforms, LM is not a footnote: a single compromised ingest key, worker credential, or mis-scoped projection path can turn a localized incident into cross-tenant data exposure unless containment is engineered at every trust boundary. STRIDE-LM gives operators and pipeline builders a shared vocabulary to classify vectors discovered in IDDIL, prioritize controls in ATC, and trace threat-intelligence matches (Chapter 13) back to concrete design decisions.
| STRIDE-LM category | Definition | DEML controls & design decisions |
|---|---|---|
| S — Spoofing | Pretending to be a user, service, tenant, or event source. | Firebase Auth JWT verification in FirebaseAuthenticationMiddleware; WebAuthn hardware-key MFA on writes; Firebase App Check + reCAPTCHA Enterprise; integration API keys bound to tenant scope; ingestEvent idempotency keys reject duplicate command replay. |
| T — Tampering | Modifying data in transit, at rest, or in the event pipeline. | Transactional Outbox (OutboxEvent written atomically with domain state); telemetry_worker idempotency keys; AES-256-GCM field encryption with GCP KMS envelope rotation; versioned event schemas; platform-status immutability via ABAC. |
| R — Repudiation | Denying that an action occurred or obscuring attribution. | Immutable Google Cloud Logging SIEM trail; GCS audit log retention; Postgres OutboxEvent and ThreatReport records with tenant account_id; OpenTelemetry traces in ClickHouse correlating ingest → enrichment → projection hops. |
| I — Information Disclosure | Exposing data or metadata to unauthorized parties. | UUID primary keys (anti-IDOR); RBAC + ABAC (Chapter 28); Firestore security rules scoped to users/{uid}; Hugging Face model artifacts namespaced by hashed tenant slug; encrypted integration tokens at rest (Chapter 10). |
| D — Denial of Service | Degrading or blocking availability of services or projections. | Dragonfly sliding-window rate limits; frontend-events-dlq isolates poison messages from the projection fleet; distroless containers reduce exploit surface; Sanity CDN–backed status communications survive backend outages (Chapter 14); synthetic Event Projections probe alerts on pipeline stall. |
| E — Elevation of Privilege | Gaining capabilities beyond authorized role or tenant scope. | Three-tier RBAC (Viewer / Operator / Security Admin); ABAC ownership and is_published gates; unprivileged Cloud Run service accounts; Infisical runtime secret injection (no keys on disk); Semgrep/Trivy supply-chain gates in CI. |
| LM — Lateral Movement | Pivoting from one compromised asset to others within or across trust boundaries. | Primary containment layer: strict multi-tenant isolation—Postgres account_id on every transactional row, symmetrical worker loops that never hardcode Tenant0 exceptions, Firestore rule scoping per uid, private Redpanda networking between Cloud Run services, no cross-tenant foreign keys in worker payloads (Tenant0 UUID normalization replaces legacy "platform" literals). Compromise in one tenant's ingest path cannot traverse to another tenant's projections, model weights, or integration keys without a separate, auditable authorization failure. |
High-throughput event platforms amplify both the value and the risk of security telemetry: every command, projection, and ML inference generates evidence adversaries want to steal or poison, and every worker hop is a potential pivot point. STRIDE-LM is especially useful here because it forces teams to ask two questions on every new feature: what category of harm does this enable? and where could an attacker move next if this control fails? Tagging Redpanda topics, worker credentials, and Firestore collections with STRIDE-LM labels during design reviews prevents "detect-only" blind spots—teams discover early when they have strong Spoofing and Tampering controls but weak Lateral Movement containment, which is the failure mode most dangerous in SaaS pipelines. For operators, the same taxonomy turns hourly IoC refreshes and CES Threat Level spikes into actionable triage: an OTX match on a scraping ASN maps cleanly to Denial of Service and Spoofing; a DLQ depth anomaly maps to Tampering or survivability debt; a cross-tenant access attempt in audit logs maps directly to Information Disclosure and Lateral Movement and triggers the highest-severity runbook.
List Threat Actors and Objectives. Name adversary classes, their motivation, skill, resources, and goals against your assets. Feed current intel (feeds, ISAC reports, internal incidents) into this step. DEML example: Actor classes include automated scrapers (availability abuse on public platform-status), credential-stuffing botnets (account takeover), insider operators with Operator RBAC (data exfiltration via export APIs), and APT-style actors targeting ML model weights on Hugging Face. Objectives are tied to kill-chain stages—reconnaissance on /api/v1/integrations/{platform} health endpoints, delivery via forged ingest events, action on objectives via cross-tenant projection reads. Pipeline-builder example: A fraud-analytics team lists actors (insider analysts, compromised service accounts, supply-chain partners with ingest access) and states objectives (skew detection thresholds, hide fraudulent transactions in feature noise).
Implementation Phase (ATC)
Analysis & Assessment. For each discovered vector, determine root cause, successful-compromise impact, and worst-case scenarios. Employ threat models, attack trees, or Cyber Kill Chain mapping as artifacts; revisit discovery assumptions when new intel arrives. DEML example: When DLQ depth spikes on frontend-events-dlq, analysts trace enrichment failures to malformed payloads, assess impact (stalled projections → stale CES gauges), and model worst case (silent loss of threat-intel correlation if worker OOM persists). The ThreatModel binary classifier is assessed against false-negative cost (malicious IP admitted) vs. false-positive cost (legitimate integration throttled). Pipeline-builder example: A customer assesses whether a poisoned ingest batch could shift their PyTorch MLP decision boundary enough to miss fraud clusters, and documents the blast radius if /api/v1/predict returns attacker-controlled scores to an automated blocklist.
Triage. Prioritize findings by business/mission impact and threat intelligence—not by CVE count alone. Impact outweighs raw probability at this stage; active intel feeds the probability variable later in risk management. Express results in both business and technical terms. DEML example: Triage ranks (1) cross-tenant data leakage via mis-scoped Firestore rules as catastrophic, (2) integration key compromise with ingest write access as high, (3) single-tenant DLQ replay backlog as medium operational debt. Semgrep and Trivy findings enter the internal vulnerability Kanban (Chapter 21) only after threat-context triage—not every CVE is an immediate patch. Pipeline-builder example: A pipeline owner triages training-data poisoning above TLS misconfiguration if their model directly gates financial holds; they document the business impact ("false approvals") alongside the technical fix ("schema validation + outlier quarantine before ingest").
Controls. Select, implement, and validate controls that remove, counter, or mitigate prioritized threats. Controls exhibit functions—inventory, collect, detect, protect, manage, respond—and must trace back to specific attack vectors, not generic compliance checklists. Measure effectiveness and identify coverage gaps. DEML example: Controls mapped to ingest injection include Firebase App Check + MFA on writes (protect), UUID PKs + ABAC (protect), transactional Outbox + idempotency keys (detect/manage), ThreatModel inference at the edge (detect/respond), and DLQ replay with stable keys (respond). CES (Chapter 25) scores how well these controls perform in production on Tenant0 before customer rollout. Pipeline-builder example: A customer implements schema contracts and row-level checksums on batches before POSTing to /api/v1/ingest, enables DEML rate limits, stores API keys in a vault with rotation, and adds a human review queue when ThreatModel scores exceed a tenant-defined threshold.
Platform Practice Mapping
The table below shows where DEML's current production practices align with IDDIL/ATC. Use it as a checklist when threat-modeling your own pipeline—the left column is the methodology step; the right column is where to look in this codebase or CONOPS.
| IDDIL/ATC step | DEML practice (reference) |
|---|---|
| I — Identify assets | Tenant-scoped Postgres models, Firestore users/{uid}/data/*, encrypted integration tokens (Chapter 10), HF namespaced model artifacts |
| D — Define attack surface | CONOPS §7 command/query paths; /api/v1/ingest, /api/v1/predict, Firebase ingestEvent |
| D — Decompose system | Service matrix §8; Event Projections loop (Outbox → relay → worker → Firestore) |
| I — Identify attack vectors | STRIDE-LM categorization (§10); UUID PK anti-IDOR, broker/DLQ failure modes §13, network enrichment (Chapter 20) |
| L — List threat actors | security_worker IoC feeds (AbuseIPDB, OTX), HIBP/Tor OSINT (Chapter 13), behavioral biometrics |
| A — Analysis & assessment | ThreatModel PyTorch classifier, Cyber Kill Chain–aligned CES metrics (Chapter 25), synthetic Event Projections probe |
| T — Triage | Vulnerability Kanban (Chapter 21), impact-weighted incident response, DLQ depth alerting |
| C — Controls | RBAC/ABAC (Chapter 28), KMS rotation, App Check, rate limits, Outbox idempotency, Firestore rule scoping |
Actionable workflow for pipeline builders. Run IDDIL before your first production ingest: (1) list assets and draw a DFD with trust boundaries, (2) decompose your ETL → DEML → model-serving stack, (3) enumerate vectors and actors against that diagram, (4) analyze impact and triage by business consequence, (5) implement controls that map to specific vectors—not a generic security bundle—and (6) loop back when security_worker intel, DLQ telemetry, or model drift falsifies your assumptions. Threat-driven design is continuous; the mnemonic exists because unaddressed threats do not remain idle—they become the next incident in your detection pipeline.
These principles are operational scaffolding, not abstract theory. Chapter 7 and Chapter 23 apply them to compute hardening and enterprise compliance evidence; STRIDE-LM provides the threat taxonomy; Chapter 13 details the threat-intelligence fusion pipeline; Chapter 20 covers edge enrichment; and Chapter 25 formalizes how countermeasure effectiveness is measured and displayed.
11. Observability & Health Monitoring
| Signal | Source | Operator use |
|---|---|---|
| Real-time user stats | Firestore projections | "Event Projections" component on platform-status (automated synthetic probe) |
| CES dashboard | ClickHouse + backend aggregates | Threat / SLA / Stableness gauges (Chapter 25) |
| Traces | OpenTelemetry → Collector → ClickHouse | Latency regressions, worker stalls |
| Errors | Sentry (frontend + backend) | Release regressions |
| Synthetic uptime | telemetry_worker pingers (30s) |
Status page accuracy |
| Infrastructure | GCP metrics, GCP Logging | Capacity, audit trail |
12. Maintenance & Automation Cadence
All schedules are canonical in Appendix D. Summary:
- Every 5s:
outbox_relaypublishes pending events. - Continuous:
telemetry_worker,ml_workerKafka consumers. - Hourly: Threat intel fetch (
security_worker). - Daily: ML retraining,
db_cleanup(30-day raw retention), Stripesync_subscriptions, DEK rotation checks. - Weekly / Monthly / Quarterly: Renovate, Semgrep, deep audits via GitHub Actions.
13. Contingency & Degraded Operations
| Failure | System behavior | Recovery |
|---|---|---|
| Redpanda unavailable (Functions) | Firestore fallback writes; worker may still consume via internal broker | Restore broker; drain DLQ; verify projections catch up idempotently |
outbox_relay stopped |
Events accumulate in Postgres outbox | Restart relay; backlog publishes in order |
| Firestore rules mis-deployed | Client reads/writes rejected | Re-run firebase-backend-deploy.yml |
| Worker OOM on Polars batch | Messages route to frontend-events-dlq |
Fix payload/enrichment; replay with stable keys |
| Postgres outage | REST mutations fail; cached projections may stale | Cloud SQL restore from volume snapshot; run migrations |
| KMS unreachable | Cannot decrypt integration tokens | Restore GCP credentials; verify telemetry-app-sa IAM |
14. CI/CD & Release Operations
- Feature branch → pre-commit (Ruff, ESLint, Axe) → PR.
- Merge to
main→ Cloud Build webhook builds affected services (watch paths per service). - Same merge → Firebase workflows deploy Functions/rules/hosting when paths match.
scripts/sync_content.pypropagates BOOK/README to frontend and marketing assets.purge-cloudflare-cache.ymlinvalidates CDN after deploy.
Semantic versioning and release notes: scripts/git_flow.py (Chapter 16).
15. Documentation Map
| Document | Audience | Content |
|---|---|---|
| This CONOPS | Operators, architects | End-to-end operational narrative |
docs/conops.md |
On-call engineers | Checklists, modes, quick reference |
| WHITEPAPER.md §2 | Executives, reviewers | Concise CONOPS + hypothesis |
| README.md | Integrators | API gateway, architecture diagram |
| Appendix C | DevOps | Per-service Cloud Run variables |
| Appendix D | SRE | Schedules and retention |
| AGENTS.md | AI agents / contributors | Coding principles aligned to CONOPS |