From 266ee30d4bad89bc52986bcab1703f503e7f8935 Mon Sep 17 00:00:00 2001 From: Allissa Auld <96758568+Allie-Cat-0@users.noreply.github.com> Date: Fri, 17 Apr 2026 20:31:37 +0200 Subject: [PATCH] docs: align roadmap with tender-fit requirements Plan aircraft-risk modelling, CCSDS RDM support, tender-grade replay validation, and ESA software assurance artefacts in the implementation and master plans. --- docs/Implementation_Plan.md | 303 + docs/Master_Plan.md | 15516 ++++++++++++++++++++++++++++++++++ 2 files changed, 15819 insertions(+) create mode 100644 docs/Implementation_Plan.md create mode 100644 docs/Master_Plan.md diff --git a/docs/Implementation_Plan.md b/docs/Implementation_Plan.md new file mode 100644 index 0000000..64eebe4 --- /dev/null +++ b/docs/Implementation_Plan.md @@ -0,0 +1,303 @@ +# SpaceCom Implementation Plan + +## Purpose + +This document turns the master plan into an implementation-oriented coding plan. It is focused on phases of engineering work and sprint-style delivery, not on full architectural detail. + +## Delivery Model + +The implementation plan assumes: + +- self-hosted GitLab as the delivery platform +- Docker Compose as the baseline deployment model +- contracts as the authoritative entitlement source +- Phase 0 blockers cleared before major coding commitments +- safety-critical, auth, and regulatory-sensitive changes always require human review + +## Phase 0: Pre-Build Decisions + +Objective: +- clear the decisions that would otherwise force rework + +Sprint 0 outcomes: + +- Space-Track AUP architecture decision recorded +- Cesium commercial licence executed +- GitLab CI/CD authority confirmed +- contract-driven entitlement model confirmed +- Redis trust split approved +- initial ADR set aligned with these decisions + +Exit criteria: + +- no unresolved blocker remains that would change ingest, licensing, or deployment direction + +## Phase 1: Foundation Sprints + +Objective: +- establish the platform baseline and developer workflow + +### Sprint 1.1 + +- backend project structure and FastAPI scaffolding +- frontend project structure and Next.js scaffolding +- Docker Compose topology +- TimescaleDB/PostGIS baseline schema +- GitLab CI skeleton +- docs tree, ADR format, `AGENTS.md`, changelog baseline + +### Sprint 1.2 + +- auth baseline: JWT, cookies, MFA foundation +- RBAC and role checks +- liveness and readiness endpoints +- structured logging +- security scan hooks and pre-commit +- first-login acceptance gates for ToS/AUP/Privacy Notice + +Exit criteria: + +- full stack starts cleanly +- CI runs on every merge request +- auth and role boundaries exist +- baseline docs and controls are in place + +## Phase 2: Data and Ingest Sprints + +Objective: +- make the platform ingest and manage real operational data safely + +### Sprint 2.1 + +- object catalog CRUD +- TLE ingest +- ingest worker wiring +- source validation and checksum checks +- basic data freshness indicators + +### Sprint 2.2 + +- DISCOS import +- space weather ingest +- retention policy scaffolding +- ingest metrics and failure alerts +- first degraded-mode backend signals + +Exit criteria: + +- core upstream data can be ingested, validated, stored, and monitored + +## Phase 3: Propagation and Physics Core + +Objective: +- deliver the first real analysis engine + +### Sprint 3.1 + +- frame/time utilities +- SGP4 propagation +- frame transformation validation +- initial CZML generation + +### Sprint 3.2 + +- numerical decay predictor +- atmospheric inputs +- upper/lower atmosphere model interfaces for terminal descent +- model versioning hooks +- prediction input validation + +Exit criteria: + +- the platform can produce technically valid propagated and decay outputs with traceable inputs + +## Phase 4: Simulation and Hazard Engine + +Objective: +- build the uncertainty and event-analysis layer + +### Sprint 4.1 + +- Monte Carlo job orchestration +- Celery chord flow +- worker resource controls +- result aggregation guards + +### Sprint 4.2 + +- breakup logic and sub/trans-sonic fragment model baseline +- fragment shape, tumble, and descent-assumption reference data +- corridor generation +- object/event linking +- first event orchestration path + +Exit criteria: + +- multi-run hazard outputs are generated safely, bounded, and auditable + +## Phase 5: Operational Frontend + +Objective: +- make the product usable for analysts and operators + +### Sprint 5.1 + +- globe and layer controls +- timeline strip +- object/event navigation +- degraded-state presentation + +### Sprint 5.2 + +- alert centre +- acknowledgement flow +- decision prompts +- keyboard-first interaction paths + +Exit criteria: + +- operators can see, interpret, and act on platform outputs in the UI + +## Phase 6: Event Workflow and Reporting + +Objective: +- support full event handling, not just display + +### Sprint 6.1 + +- event detail page +- event state transitions +- report job submission and tracking +- WebSocket event flow + +### Sprint 6.2 + +- report rendering pipeline +- export storage flow +- report audit trail +- offline/reconnect handling + +Exit criteria: + +- an end-to-end event can be analysed, reviewed, and reported through the platform + +## Phase 7: Reliability, Safety, and Compliance Hardening + +Objective: +- turn a capable system into an operationally defensible one + +### Sprint 7.1 + +- observability dashboards +- SLO instrumentation +- alerting rules +- incident-response runbook alignment + +### Sprint 7.2 + +- privacy workflows +- retention and pseudonymisation tasks +- safety-case artefacts linkage +- audit-log separation and integrity controls + +Exit criteria: + +- the system is observable, auditable, and aligned with documented operational obligations + +## Phase 8: Aviation Product Expansion + +Objective: +- strengthen the ANSP-facing differentiation + +### Sprint 8.1 + +- FIR intersection and aviation-context outputs +- aircraft exposure and vulnerability scoring +- uncertainty display refinements +- operator-facing airspace impact summaries +- conservative-baseline comparison outputs + +### Sprint 8.2 + +- shadow mode behaviour +- multi-ANSP coordination foundations +- NOTAM-adjacent workflow support +- live air-traffic density and route inputs for air-risk products + +Exit criteria: + +- the aviation-side product layer is credible for shadow deployment and stakeholder review + +## Phase 9: Space Operator Expansion + +Objective: +- extend the shared core into upstream customer workflows + +### Sprint 9.1 + +- owned-object data model and access rules +- API key lifecycle +- portal foundations + +### Sprint 9.2 + +- controlled re-entry planner +- CCSDS OEM/CDM/RDM export +- partner-facing integration workflows + +Exit criteria: + +- the shared physics core now supports both downstream aviation and upstream space-operator use + +## Phase 10: Deployment Readiness and Evidence + +Objective: +- prepare for shadow deployment and external scrutiny + +### Sprint 10.1 + +- validation and backcasting workflow +- named historical replay corpus and conservative-baseline comparisons +- tender-grade KPI and air-risk validation pack +- shadow reporting outputs +- deployment gating refinements + +### Sprint 10.2 + +- final compliance artefacts +- ESA software assurance artefacts: SRS, SDD, user manual, test report +- release checklist completion +- operational readiness review +- first shadow deployment package + +Exit criteria: + +- the platform is ready for controlled external deployment and evidence-based review + +## Cross-Cutting Workstreams + +The following run continuously across phases rather than in only one sprint: + +- security architecture and dependency hygiene +- documentation and ADR maintenance +- test coverage and validation artefacts +- safety-case traceability +- legal/commercial gating +- performance and capacity reviews +- procurement traceability and bid artefact maintenance + +## Recommended Execution Pattern + +For actual delivery, each sprint should end with: + +- a demoable increment +- updated docs +- updated ADRs where needed +- passing CI +- explicit review of safety/compliance impacts + +## Relationship to Other Docs + +- [Overview.md](/d:/Projects/SpaceCom/docs/Overview.md) gives the shortest summary +- [Roadmap.md](/d:/Projects/SpaceCom/docs/Roadmap.md) gives the big-picture staged view +- [MASTER_PLAN.md](/d:/Projects/SpaceCom/docs/MASTER_PLAN.md) remains the authoritative detailed plan diff --git a/docs/Master_Plan.md b/docs/Master_Plan.md new file mode 100644 index 0000000..84c3677 --- /dev/null +++ b/docs/Master_Plan.md @@ -0,0 +1,15516 @@ +# SpaceCom Master Development Plan + +## 1. Vision + +SpaceCom is a dual-domain re-entry debris hazard analysis platform that bridges the space and aviation domains. It is built by space engineers and operates as two interconnected products sharing a common physics core. + +**Space Domain (upstream):** A technical analysis platform for space operators, orbital analysts, and space agencies — providing decay prediction with full uncertainty quantification, conjunction screening, controlled re-entry corridor planning, and a programmatic API layer for integration with existing space operations systems. + +**Aviation Domain (downstream):** An operational decision support tool for ANSPs, airspace managers, and incident commanders — translating space domain predictions into actionable aviation safety outputs: hazard corridors, FIR intersection analysis, NOTAM drafting assistance, multi-ANSP coordination, and plain-language uncertainty communication. + +SpaceCom's strategic position is the interface layer between two domains that currently do not speak the same language. The aviation safety gap is the commercial differentiator and the most underserved operational need in the market. The space domain physics depth — numerical decay prediction, atmospheric density modelling, conjunction probability, and controlled re-entry planning — is the technical credibility that distinguishes SpaceCom from aviation software vendors with bolt-on orbital mechanics. + +**Positioning statement for procurement:** *"SpaceCom is the missing operational layer between space domain awareness and aviation domain action — built by space engineers, designed for the people who have to make decisions when something is coming down."* + +**AI-assisted development policy (F11):** SpaceCom uses AI coding assistants (currently Claude Code) in the development workflow. `AGENTS.md` at the repository root defines the boundaries and conventions for this use. Key constraints: +- AI assistants may generate, refactor, and review code, and draft documentation +- AI assistants **may not** make autonomous decisions about safety-critical algorithm changes, authentication logic, or regulatory compliance text — all such changes require human review and an approved PR with explicit reviewer sign-off +- AI-generated code is subject to identical review and testing standards as human-authored code — there is no reduced scrutiny for AI-generated contributions +- AI assistants **must not** be given production credentials, access to live Space-Track API keys, or personal data +- For ESA procurement purposes: all code in the repository, regardless of how it was authored, is the responsibility of the named human engineers. AI assistance is a development tool, not a co-author with liability + +This policy is stated explicitly because ESA and other public-sector procurement frameworks increasingly ask whether and how AI tools are used in safety-relevant software development. + +--- + +## 2. What We Keep from the Existing Codebase + +The prototype established several good foundational choices: + +- **Docker Compose orchestration** — frontend, backend, and database run as isolated containers with a single `docker compose up` +- **FastAPI backend** — lightweight, async-ready Python API server; already serves CZML orbital data +- **TimescaleDB + PostGIS** — time-series hypertables for orbit data and geographic types for footprints; the `orbits` hypertable and `reentry_predictions` polygon column are well-suited to the domain +- **CesiumJS globe** — proven 3D geospatial viewer with CZML support, already rendering orbital tracks with OSM tiles +- **CZML as the orbital data interchange format** — native to Cesium, supports time-dynamic position, styling, and labels +- **Schema tables: `objects`, `orbits`, `conjunctions`, `reentry_predictions`** — solid starting point for the data model (see §9 for required expansions) +- **Worker service slot** — the architecture already anticipates background data ingestion + +--- + +## 3. Architecture + +### 3.1 Layered Design + +``` +┌─────────────────────────────────────────────────────┐ +│ Frontend (Web) │ +│ Next.js + TypeScript + CesiumJS + Deck.gl │ +│ httpOnly cookies · CSP · security headers │ +├─────────────────────────────────────────────────────┤ +│ TLS Termination (Caddy/Nginx) │ +│ HTTPS + WSS only; HSTS preload │ +├─────────────────────────────────────────────────────┤ +│ API Gateway │ +│ FastAPI · RBAC middleware · rate limiting │ +│ JWT (RS256) · MFA enforcement · audit logging │ +├─────────────────────────────────────────────────────┤ +│ Core Services │ +│ Hazard Engine · Event Orchestrator · CZML Builder │ +│ Frame Transform Service · Space Weather Cache │ +│ HMAC integrity signing · Alert integrity guard │ +├─────────────────────────────────────────────────────┤ +│ Computational Workers (isolated network) │ +│ Celery tasks: propagation, decay, Monte Carlo │ +│ Per-job CPU time limits · resource caps │ +├─────────────────────────────────────────────────────┤ +│ Report Renderer (network-isolated container) │ +│ Playwright headless · no external network access │ +├─────────────────────────────────────────────────────┤ +│ Data Layer (backend_net only) │ +│ TimescaleDB+PostGIS · Redis (AUTH+TLS) │ +│ MinIO (private buckets · pre-signed URLs) │ +└─────────────────────────────────────────────────────┘ +``` + +### 3.2 Service Breakdown + +| Service | Runtime | Responsibility | Tier 2 Spec | Tier 3 Spec | +|---------|---------|----------------|-------------|-------------| +| `frontend` | Next.js on Node 22 / Nginx static | Globe UI, dashboards, event timeline, simulation controls | 2 vCPU / 4 GB | 2× (load balanced) | +| `backend` | FastAPI on Python 3.12 | REST + WebSocket API, authentication, RBAC, request validation, CZML generation, HMAC signing | 4 vCPU / 8 GB | 2× 4 vCPU / 8 GB (blue-green) | +| `worker-sim` | Python 3.12 + Celery `--queue=simulation --concurrency=16 --pool=prefork` | MC decay prediction (chord sub-tasks), breakup, conjunction, controlled re-entry. Isolated from frontend network. | 2× 16 vCPU / 32 GB | 4× 16 vCPU / 32 GB | +| `worker-ingest` | Python 3.12 + Celery `--queue=ingest --concurrency=2` | TLE polling, space weather, DISCOS, IERS EOP. Never competes with simulation queue. | 2 vCPU / 4 GB | 2× 2 vCPU / 4 GB (celery-redbeat HA) | +| `renderer` | Python 3.12 + Playwright | PDF report generation only. No external network access. Receives sanitised data from backend via internal API call only. | 4 vCPU / 8 GB | 2× 4 vCPU / 8 GB | +| `db` | TimescaleDB (PostgreSQL 17 + PostGIS) | Persistent storage. RLS policies enforced. Append-only triggers on audit tables. | 8 vCPU / 64 GB / 1 TB NVMe | Primary + standby: 8 vCPU / 128 GB each; Patroni failover | +| `redis` | Redis 7 | Broker + cache + celery-redbeat schedule. AUTH required. TLS in production. ACL users per service. | 2 vCPU / 8 GB | Redis Sentinel: 3× 2 vCPU / 8 GB | +| `minio` | MinIO (S3-compatible) | Object storage. All buckets private. Pre-signed URLs only. | 4 vCPU / 8 GB / 4 TB | Distributed: 4× 4 vCPU / 16 GB / 2 TB NVMe | +| `etcd` | etcd 3 | Patroni DCS (distributed configuration store) for DB leader election | — | 3× 1 vCPU / 2 GB | +| `pgbouncer` | PgBouncer 1.22 | Connection pooler between all application services and TimescaleDB. Transaction-mode pooling. Prevents connection count exceeding `max_connections` at Tier 3. Single failover target point for Patroni switchover. | 1 vCPU / 1 GB | 1 vCPU / 1 GB (updated by Patroni on failover) | +| `prometheus` | Prometheus 2.x | Metrics scraping from all services; recording rules; AlertManager rules | 2 vCPU / 4 GB | 2 vCPU / 8 GB | +| `grafana` | Grafana OSS | Four dashboards (§26.7); Loki + Tempo + Prometheus datasources | 1 vCPU / 2 GB | 1 vCPU / 2 GB | +| `loki` | Grafana Loki 2.9 | Log aggregation; queried by Grafana; Promtail ships container logs | 2 vCPU / 4 GB | 2 vCPU / 8 GB | +| `promtail` | Grafana Promtail 2.9 | Scrapes Docker json-file logs; labels by service; ships to Loki | 0.5 vCPU / 512 MB | 0.5 vCPU / 512 MB | +| `tempo` | Grafana Tempo | Distributed trace backend (Phase 2); OTLP ingest; queried by Grafana | — | 2 vCPU / 4 GB | + +#### Horizontal Scaling Trigger Thresholds (F9 — §58) + +Tier upgrades are not automatic — SpaceCom is VPS-based and requires deliberate provisioning. The following thresholds trigger a **scaling review meeting** (not an automated action). The responsible engineer creates a tracked issue within 5 business days. + +| Metric | Threshold | Sustained for | Tier transition indicated | +|--------|-----------|--------------|--------------------------| +| Backend CPU utilisation | > 70% | 30 min | Tier 1 → Tier 2 (add second backend instance) | +| `spacecom_ws_connected_clients` | > 400 sustained | 1 hour | Tier 1 → Tier 2 (WS ceiling at 500; add second backend) | +| Celery simulation queue depth | > 50 | 15 min (no active event) | Add simulation worker instance | +| MC p95 latency | > 180s (75% of 240s SLO) | 3 consecutive runs | Add simulation worker instance | +| DB CPU utilisation | > 60% | 1 hour | Tier 2 → Tier 3 (read replica + Patroni) | +| DB disk used | > 70% of provisioned | — | Expand disk before hitting 85% | +| Redis memory used | > 60% of `maxmemory` | — | Increase `maxmemory` or add Redis instance | + +Scaling decisions are recorded in `docs/runbooks/capacity-limits.md` with: metric value at decision time, decision made, provisioning timeline, and owner. This file is the authoritative capacity log for ESA and ANSP audits. + +#### Redis ACL Definition + +SpaceCom uses two Redis trust domains: +- `redis_app` for sessions, rate limits, WebSocket delivery state, commercial-enforcement deferrals, and other application state where stronger consistency and tighter access separation are required +- `redis_worker` for Celery broker/result traffic and ephemeral cache data, where limited inconsistency during failover is acceptable + +This split is deliberate. It prevents worker-side compromise from reaching session state and avoids applying the distributed-systems split-brain risk acceptance for ephemeral workloads to user-session or entitlement-adjacent state. + +Each Redis service gets its own ACL users with the minimum required key namespace: + +```conf +# redis_app/acl.conf - bind-mounted into the application Redis container +# Backend: application-state access only (session tokens, rate-limit counters, WebSocket tracking) +user spacecom_backend on >${REDIS_BACKEND_PASSWORD} ~* &* +@all + +# Disable unauthenticated default user +user default off +``` + +```conf +# redis_worker/acl.conf - bind-mounted into the worker Redis container +# Simulation worker: Celery broker/result namespaces only +user spacecom_worker on >${REDIS_WORKER_PASSWORD} ~celery* ~_kombu* ~unacked* &celery* +@all -@dangerous + +# Ingest worker: same scope as simulation worker +user spacecom_ingest on >${REDIS_INGEST_PASSWORD} ~celery* ~_kombu* ~unacked* &celery* +@all -@dangerous + +# Disable unauthenticated default user +user default off +``` + +Mount in `docker-compose.yml`: +```yaml +redis_app: + volumes: + - ./redis_app/acl.conf:/etc/redis/acl.conf:ro + command: redis-server --aclfile /etc/redis/acl.conf --tls-port 6379 ... + +redis_worker: + volumes: + - ./redis_worker/acl.conf:/etc/redis/acl.conf:ro + command: redis-server --aclfile /etc/redis/acl.conf --tls-port 6379 ... +``` + +Separate passwords (`REDIS_BACKEND_PASSWORD`, `REDIS_WORKER_PASSWORD`, `REDIS_INGEST_PASSWORD`) are defined in §30.3. Each rotates independently on the 90-day schedule. Redis Sentinel split-brain risk acceptance in §67 applies to `redis_worker` only; `redis_app` is treated as higher-integrity application state and is not covered by that acceptance. + +### 3.3 Docker Compose Services and Network Segmentation + +Services are assigned to isolated Docker networks. A compromised container on one network cannot directly reach services on another. + +```yaml +networks: + frontend_net: # frontend → backend only + backend_net: # backend → db, redis, minio, pgbouncer + worker_net: # worker → pgbouncer, redis, minio (no backend access; pgbouncer pools DB connections) + renderer_net: # backend → renderer only; renderer has no external egress + db_net: # db, pgbouncer — never exposed to frontend_net + +services: + frontend: networks: [frontend_net] + backend: networks: [frontend_net, backend_net, renderer_net] # +renderer_net: backend calls renderer API + worker-sim: networks: [worker_net] + worker-ingest: networks: [worker_net] + renderer: networks: [renderer_net] # backend-initiated calls only; no outbound to backend_net + db: networks: [backend_net, worker_net, db_net] + pgbouncer: networks: [backend_net, worker_net, db_net] # pooling for both backend AND workers + redis: networks: [backend_net, worker_net] + minio: networks: [backend_net, worker_net] +``` + +**Network topology rules:** +- Workers connect to DB via `pgbouncer:5432`, not `db:5432` directly — enforced by workers' `DATABASE_URL` env var pointing to PgBouncer. +- The backend is on `renderer_net` so it can call `renderer:8001`; the renderer cannot initiate connections to `backend_net`. +- `db_net` contains only TimescaleDB, PgBouncer, and etcd. No application service connects directly to this network except PgBouncer. + +**Container resource limits** — without explicit limits a runaway simulation worker OOM-kills the database (Linux OOM killer targets the largest RSS consumer): + +```yaml +services: + backend: + deploy: + resources: + limits: { cpus: '4.0', memory: 8G } + reservations: { memory: 512M } + + worker-sim: + deploy: + resources: + limits: { cpus: '16.0', memory: 32G } + reservations: { memory: 2G } + stop_grace_period: 300s # allows long MC jobs to finish before SIGKILL + command: > + celery -A app.worker worker + --queue=simulation + --concurrency=16 + --pool=prefork + --without-gossip + --without-mingle + --max-tasks-per-child=100 + pids_limit: 64 # prefork: 16 children + Beat + parent + overhead + + worker-ingest: + deploy: + resources: + limits: { cpus: '2.0', memory: 4G } + stop_grace_period: 60s + pids_limit: 16 + + renderer: + deploy: + resources: + limits: { cpus: '4.0', memory: 8G } + pids_limit: 100 # Chromium spawns ~5 processes per render × concurrent renders + tmpfs: + - /tmp/renders:size=512m,mode=1777 # PDF scratch; never written to persistent layer + environment: + RENDER_OUTPUT_DIR: /tmp/renders + + db: + deploy: + resources: + limits: { memory: 64G } # explicit cap; prevents OOM killer targeting db + + redis: + deploy: + resources: + limits: { cpus: '2.0', memory: 8G } + + minio: + deploy: + resources: + limits: { cpus: '4.0', memory: 8G } +``` + +Note: `deploy.resources` is honoured by `docker compose` (v2) without Swarm mode from Compose spec 3.x. Verify with `docker compose version` ≥ 2.0. + +All containers run as non-root users, with read-only root filesystems and dropped capabilities (see §7.10), except for the renderer container's documented `SYS_ADMIN` exception in §7.11. That exception is accepted only for the renderer, must never be copied to other services, and requires stricter network isolation and annual review. + +#### Host Bind Mounts + +All directories that operators need to access directly on the VPS — logs, generated exports, config, and backups — are bind-mounted from the host filesystem. This means no `docker compose exec` is required for routine operations: log tailing, reading generated files, editing config, or recovering a backup. + +```yaml +services: + backend: + volumes: + - ./logs/backend:/app/logs # structured JSON logs; tail directly on host + - ./exports:/app/exports # org export ZIPs, report PDFs + - ./config/backend.toml:/app/config/settings.toml:ro # edit on host; container reads + + worker-sim: + volumes: + - ./logs/worker-sim:/app/logs + - ./exports:/app/exports # shared export directory with backend + + worker-ingest: + volumes: + - ./logs/worker-ingest:/app/logs + + frontend: + volumes: + - ./logs/frontend:/app/logs + + db: + volumes: + - /data/postgres:/var/lib/postgresql/data # DB data on host disk; survives container recreation + - ./backups/db:/backups # pg_basebackup output directly accessible on host + + minio: + volumes: + - /data/minio:/data # object storage on host disk +``` + +**Host-side directory layout** (under `/opt/spacecom/`): +``` +/opt/spacecom/ + logs/ + backend/ ← tail -f logs/backend/app.log + worker-sim/ + worker-ingest/ + frontend/ + exports/ ← ls exports/ to see generated reports and org export ZIPs + config/ + backend.toml ← edit directly; restart backend container to apply + backups/ + db/ ← pg_basebackup archives; rsync to offsite from here +data/ + postgres/ ← TimescaleDB data files (outside /opt to avoid accidental compose down -v) + minio/ ← MinIO object data +``` + +**Key rules:** +- `/data/postgres` and `/data/minio` live **outside** the project directory so `docker compose down -v` cannot accidentally wipe them (Compose only removes named volumes, not bind-mounted host paths, but keeping them separate is an additional safeguard) +- Log directories are created by `make init-dirs` before first `docker compose up`; containers write to them as a non-root user (UID 1000); host operator reads as the same UID or via `sudo` +- Config files are mounted `:ro` (read-only) inside the container — a misconfigured backend cannot overwrite its own config +- `make logs SERVICE=backend` is a convenience alias for `tail -f /opt/spacecom/logs/backend/app.log` + +#### Port Exposure Map + +| Port | Service | Exposed to | Notes | +|------|---------|------------|-------| +| 80 | Caddy | Public internet | HTTP → HTTPS redirect only | +| 443 | Caddy | Public internet | TLS termination; proxies to backend/frontend | +| 8000 | Backend API | Internal (`frontend_net`) | Never directly internet-facing | +| 3000 | Frontend (Next.js) | Internal (`frontend_net`) | Caddy proxies; HMR port 3001 dev-only | +| 5432 | TimescaleDB | Internal (`db_net`) | **Never exposed to `frontend_net` or host** | +| 6379 | Redis | Internal (`backend_net`, `worker_net`) | AUTH required; no public exposure | +| 9000 | MinIO API | Internal (`backend_net`, `worker_net`) | Pre-signed URL access only from outside | +| 9001 | MinIO Console | Internal (`db_net`) | Never exposed publicly; admin use only | +| 5555 | Flower (Celery monitor) | Internal only | VPN/bastion access only in production | +| 2379/2380 | etcd (Patroni DCS) | Internal (`db_net`) | Never exposed outside db_net | + +**CI check:** `scripts/check_ports.py` — parses `docker-compose.yml` and all `docker-compose.*.yml` overrides; fails if any port from the "never-exposed" category appears in a `ports:` mapping. Runs in every CI pipeline. + +#### Infrastructure-Level Egress Filtering + +Docker's built-in `iptables` rules prevent *inter-network* lateral movement but **do not** restrict egress to the public internet from within a network. An egress filtering layer is mandatory at Tier 2 and Tier 3. + +**Allowed outbound destinations (whitelist):** + +| Service | Allowed destination | Protocol | Purpose | +|---------|---------------------|----------|---------| +| `ingest_worker` | `www.space-track.org` | HTTPS/443 | TLE / conjunction data | +| `ingest_worker` | `services.swpc.noaa.gov` | HTTPS/443 | Space weather | +| `ingest_worker` | `discosweb.esac.esa.int` | HTTPS/443 | DISCOS object catalogue | +| `ingest_worker` | `celestrak.org` | HTTPS/443 | TLE cross-validation | +| `ingest_worker` | `iers.org` | HTTPS/443 | EOP download | +| `backend` | SMTP relay (org-internal) | SMTP/587 | Alert email | +| All containers | Internal Docker networks | Any | Normal operation | +| All containers | **All other destinations** | **Any** | **BLOCKED** | + +**Implementation:** UFW or `nftables` rules on host (Tier 2); network policy + Calico/Cilium (Tier 3 Kubernetes migration); explicit allow-list in `docs/runbooks/egress-filtering.md`. Violations logged at WARN; repeated violations at CRITICAL. + +--- + +## 4. Coordinate Frames and Time Systems + +**This section is non-negotiable infrastructure.** Silent frame mismatches invalidate all downstream computation. All developers must understand and implement the conventions below before writing any propagation or display code. + +### 4.1 Reference Frame Pipeline + +``` +TLE input + │ + ▼ sgp4 library propagation +TEME (True Equator Mean Equinox) ← SGP4 native output; do NOT store as final product + │ + ▼ IAU 2006 precession-nutation (or Vallado TEME→J2000 simplification) +GCRF / J2000 (Geocentric Celestial Reference Frame) + │ │ + │ ▼ CZML INERTIAL frame ← CesiumJS expects GCRF/ICRF, not TEME + │ + ▼ IAU Earth Orientation Parameters (EOP): IERS Bulletin A/B +ITRF (International Terrestrial Reference Frame) ← Earth-fixed; use for database storage + │ + ▼ WGS84 geodetic transformation +Latitude / Longitude / Altitude ← For display, hazard zones, airspace intersections +``` + +**Implementation:** Use `astropy` (`astropy.coordinates`, `astropy.time`) for all frame conversions. It handles IERS EOP download and interpolation automatically. For performance-critical batch conversions, pre-load EOP tables and vectorise. + +### 4.2 CesiumJS Frame Convention + +- CZML `position` with `referenceFrame: "INERTIAL"` expects **ICRF/J2000 Cartesian** coordinates in **metres** +- SGP4 outputs are in **TEME** and must be rotated to J2000 before being written into CZML +- CZML `position` with `referenceFrame: "FIXED"` expects **ITRF Cartesian** in metres +- Never pipe raw TEME coordinates into CesiumJS + +### 4.3 Time System Conventions + +| System | Where Used | Notes | +|--------|-----------|-------| +| **UTC** | System-wide reference. All API timestamps, database timestamps, CZML epochs | Convert immediately at ingestion boundary | +| **UT1** | Earth rotation angle for ITRF↔GCRF conversion | UT1-UTC offset from IERS EOP | +| **TT (Terrestrial Time)** | `astropy` internal; precession-nutation models | ~69 s ahead of UTC | +| **TLE epoch** | Encoded in TLE line 1 as year + day-of-year fraction | Parse to UTC immediately | +| **GPS time** | May appear in precision ephemeris products | GPS = UTC + 18 s as of 2024 | + +**Rule:** Store all timestamps as `TIMESTAMPTZ` in UTC. Convert to local time only at presentation boundaries. + +### 4.4 Coordinate Reference System Contract (F1 — §62) + +The CRS used at every system boundary is documented in `docs/COORDINATE_SYSTEMS.md`. This is the authoritative single-page reference for any engineer writing frame conversion code. + +| Boundary | CRS | Format | Notes | +|----------|-----|--------|-------| +| SGP4 output | TEME (True Equator Mean Equinox) | Cartesian metres | Must not leave `physics/` without conversion | +| Physics → CZML builder | GCRF/J2000 | Cartesian metres | Explicit `teme_to_gcrf()` call | +| CZML `position` (INERTIAL) | GCRF/J2000 | Cartesian metres | `referenceFrame: "INERTIAL"` | +| CZML `position` (FIXED) | ITRF | Cartesian metres | `referenceFrame: "FIXED"` | +| Database storage (`orbits`) | GCRF/J2000 | Cartesian metres | Consistent with CZML inertial | +| Corridor polygon (DB) | WGS-84 geographic | `GEOGRAPHY(POLYGON)` SRID 4326 | Geodetic lat/lon from ITRF→WGS-84 | +| FIR boundary (DB) | WGS-84 geographic | `GEOMETRY(POLYGON, 4326)` | Planar approx. for regional FIRs | +| API response | WGS-84 geographic | GeoJSON (EPSG:4326) | Degrees; always lon,lat order (GeoJSON spec) | +| Globe display (CesiumJS) | ICRF (= GCRF for practical purposes) | Cartesian metres via CZML | CesiumJS handles geodetic display | +| Altitude display | WGS-84 ellipsoidal | km or ft (user preference) | See §4.4a for datum labelling | + +**Antimeridian and pole handling (F5 — §62):** + +- **Antimeridian:** Corridor polygons stored as `GEOGRAPHY` handle antimeridian crossing correctly — PostGIS GEOGRAPHY uses spherical arithmetic and does not wrap coordinates. CesiumJS CZML polygon positions must be expressed as a continuous polyline; for antimeridian-crossing corridors, the CZML serialiser must not clamp coordinates to ±180° — pass the raw ITRF→geodetic output. CesiumJS handles coordinate wrapping internally when `referenceFrame: "FIXED"` is used for corridor polygons. +- **Polar orbits:** For objects with inclination > 80°, the ground track corridor may approach or cross the poles. `ST_AsGeoJSON` on a GEOGRAPHY polygon that passes within ~1° of a pole can produce degenerate output (longitude undefined at the pole itself). Mitigation: before storing, check `ST_DWithin(corridor, ST_GeogFromText('SRID=4326;POINT(0 90)'), 111000)` (within 1° of north pole) or south pole equivalent — if true, log a `POLAR_CORRIDOR_WARNING` and clip the polygon to 89.5° max latitude. This is a rare case (ISS incl. 51.6°; most rocket bodies are below 75° incl.) but must not crash the pipeline. + +**`docs/COORDINATE_SYSTEMS.md`** is a Phase 1 deliverable. Tests in `tests/test_frame_utils.py` serve as executable verification of the contract. + +### 4.5 Implementation Checklist + +- [ ] `frame_utils.py`: `teme_to_gcrf()`, `gcrf_to_itrf()`, `itrf_to_geodetic()` +- [ ] Unit tests against Vallado 2013 reference cases +- [ ] EOP data auto-refresh: weekly Celery task pulling IERS Bulletin A; verify SHA-256 checksum of downloaded file before applying +- [ ] CZML builder uses `gcrf_to_czml_inertial()` — explicit function, never implicit conversion +- [ ] `docs/COORDINATE_SYSTEMS.md` committed with CRS boundary table + +--- + +## 5. User Personas + +All UX decisions are traceable to one of the four personas defined here. Navigation, default views, information hierarchy, and alert behaviour must serve user tasks — not the system's internal module structure. + +### Persona A — Operational Airspace Manager + +**Role:** ANSP or aviation authority staff. Responsible for airspace safety decisions in real-time or near-real-time. + +**Primary question:** "Is any airspace under my responsibility affected in the next 6–12 hours, and what do I need to do about it?" + +**Key needs:** Immediate situational awareness, clear go/no-go spatial display for their region, alert acknowledgement workflow, one-click advisory export, minimal cognitive load. + +**Tolerance for complexity:** Very low. + +--- + +### Persona B — Safety Analyst + +**Role:** Space agency, authority research arm, or consultancy. Conducts detailed re-entry risk assessments for regulatory submissions or post-event reports. + +**Primary question:** "What is the full uncertainty envelope, what assumptions drove the prediction, and how does this compare to previous similar events?" + +**Key needs:** Full simulation parameter access, run comparison, numerical uncertainty detail, full data provenance, configurable report generation, historical replay. + +**Tolerance for complexity:** High. + +--- + +### Persona C — Incident Commander + +**Role:** Senior official coordinating response during an active re-entry event. Uses the platform as a shared situational awareness tool in a briefing room. + +**Primary question:** "Where exactly is it coming down, when, and what is the worst-case affected area right now?" + +**Key needs:** Clean large-format display, auto-narrowing corridor updates, countdown timer, plain-language status summary, shareable live-view URL. + +**Tolerance for complexity:** Low. + +--- + +### Persona D — Systems Administrator / Data Manager + +**Role:** Technical operator managing system health, data ingest, model configuration, and user accounts. + +**Primary question:** "Is everything ingesting correctly, are data sources healthy, and are workers keeping up?" + +**Key needs:** System health dashboard, ingest job status, worker queue metrics, model version management, user and role management. + +**Tolerance for complexity:** High technical tolerance. + +--- + +### Persona E — Space Operator + +**Role:** Satellite or launch vehicle operator responsible for one or more objects in the SpaceCom catalog. May be a commercial operator, a national space agency operating assets, or a launch service provider managing spent upper stages. + +**Primary question:** "What is the current decay prediction for my objects, when do I need to act, and if I have manoeuvre capability, what deorbit window minimises ground risk?" + +**Key needs:** Object-scoped view showing only their registered objects; decay prediction with full Monte Carlo detail; controlled re-entry corridor planner (for objects with remaining propellant); conjunction alert for their own objects; API key management for programmatic integration with their own operations centre; exportable predictions for regulatory submission under national space law. + +**Tolerance for complexity:** High — these are trained orbital engineers, not ATC professionals. + +**Regulatory context:** Many space operators have legal obligations under national space law (e.g., Australia Space (Launches and Returns) Act 2018, FAA AST licensing) to demonstrate responsible end-of-life management. SpaceCom outputs serve as supporting evidence for those submissions. The platform must produce artefacts suitable for regulatory audit. + +--- + +### Persona F — Orbital Analyst + +**Role:** Technical analyst at a space agency, research institution, safety consultancy, or the SSA/STM office of a national authority. Conducts orbital analysis, validates predictions, and produces technical assessments — potentially across the full catalog, not just owned objects. + +**Primary question:** "What does the full orbital picture look like for this object class, how do SpaceCom predictions compare to other tools, and what are the statistical properties of the prediction ensemble?" + +**Key needs:** Full catalog read access; conjunction screening across arbitrary object pairs; simulation parameter tuning and comparison; bulk export (CSV, JSON, CCSDS formats); access to raw propagation outputs (state vectors, covariance matrices); historical validation runs; API access for batch processing. + +**Tolerance for complexity:** Very high — this persona builds the technical evidence base that other personas act on. + +--- + +## 6. UX Design Specification + +This section translates engineering capability into concrete interface designs. All designs are persona-linked and phase-scheduled. + +### 6.1 Information Architecture — Task-Based Navigation + +Navigation is organised around user tasks, not backend modules. Module names never appear in the UI. + +The platform has two navigation domains — **Aviation** (default for Persona A/B/C) and **Space** (for Persona E/F). Both are accessible from the top navigation. The root route (`/`) defaults to the domain matched to the user's role on login. + +**Aviation Domain Navigation:** +``` +/ → Operational Overview (Persona A, C primary) +/watch/{norad_id} → Object Watch Page (Persona A, B) +/events → Active Events + Timeline (Persona A, C) +/events/{id} → Event Detail (Persona A, B, C) +/airspace → Airspace Impact View (Persona A) +/analysis → Analyst Workspace (Persona B primary) +/catalog → Object Catalog (Persona B) +/reports → Report Management (Persona A, B) +/admin → System Administration (Persona D) +``` + +**Space Domain Navigation:** +``` +/space → Space Operator Overview (Persona E, F primary) +/space/objects → My Objects Dashboard (Persona E — owned objects only) +/space/objects/{norad_id} → Object Technical Detail (Persona E, F) +/space/reentry/plan → Controlled Re-entry Planner (Persona E) +/space/conjunction → Conjunction Screening (Persona F) +/space/analysis → Orbital Analyst Workspace (Persona F) +/space/export → Bulk Export (Persona F) +/space/api → API Keys + Documentation (Persona E, F) +``` + +The 3D globe is a shared component embedded within pages, not a standalone page. Different pages focus and configure the globe differently. + +--- + +### 6.2 Operational Overview Page (`/`) + +Landing page for Persona A and C. Loads immediately without configuration. + +**Layout:** + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ [● LIVE] SpaceCom [Space Weather: ELEVATED ▲] [Alerts: 2] │ +├──────────────────────────────┬──────────────────────────────────┤ +│ │ ACTIVE EVENTS │ +│ 3D GLOBE │ ● CZ-5B R/B 44878 │ +│ (active events + │ Window: 08h – 20h from now │ +│ affected FIRs only) │ Most likely ~14h from now │ +│ │ YMMM FIR — HIGH │ +│ │ [View] [Corridor] │ +│ │ ───────────────────────────── │ +│ │ ○ SL-16 R/B 28900 │ +│ │ Window: 54h – 90h from now │ +│ │ Most likely ~72h from now │ +│ │ Ocean — LOW │ +│ │ │ +│ │ 72-HOUR TIMELINE │ +│ │ [Gantt strip] │ +│ │ │ +│ │ SPACE WEATHER │ +│ │ Activity: ELEVATED │ +│ │ Extend window: add ≥2h buffer │ +├──────────────────────────────┴──────────────────────────────────┤ +│ [● Live] ──────────●────────────────────────────── +72h │ +└─────────────────────────────────────────────────────────────────┘ +``` + +**Globe default state:** Active decay objects and their corridors only. All other objects hidden. Affected FIR boundaries highlighted. No orbital tracks unless the user expands an event card. + +**Temporal uncertainty display — Persona A/C:** Event cards and the Operational Overview show window ranges in plain language (`Window: 08h – 20h from now / Most likely ~14h from now`), never `± N` notation. The `±` form implies symmetric uncertainty, which re-entry distributions are not. The Analyst Workspace (Persona B) additionally shows raw p05/p50/p95 UTC times. + +--- + +### 6.3 Time Navigation System + +Three modes — always visible, always unambiguous. Mixing modes without explicit user intent is prohibited. + +| Mode | Indicator | Description | +|------|-----------|-------------| +| **LIVE** | Green pulsing pill: `● LIVE` | Current real-world state. Globe and predictions update from live feeds. | +| **REPLAY** | Amber pill: `⏪ REPLAY 2024-01-14 03:22 UTC` | Replaying a historical event. All data fixed. No live updates. | +| **SIMULATION** | Purple pill: `⚗ SIMULATION — [object name]` | Custom scenario. Data is synthetic. Must never be confused with live. | + +The mode indicator is persistent in the top nav bar. Switching modes requires explicit action through a mode-switch dialogue — it cannot happen implicitly. + +**Mode-switch dialogue specification:** + +When the user initiates a mode switch (e.g., LIVE → SIMULATION), the following modal must appear. The dialogue must explicitly state the current mode, the target mode, and all operational consequences: + +``` +SWITCH TO SIMULATION MODE? +────────────────────────────────────────────────────────────── +You are currently viewing LIVE data. +Switching to SIMULATION will display synthetic scenario data. + + ⚠ Alerts and notifications are suppressed in SIMULATION. + ⚠ Simulation data must never be used for operational decisions. + ⚠ Other users will not see your simulation. + +[Cancel] [Switch to Simulation ▶] +────────────────────────────────────────────────────────────── +``` + +Rules: +- Cancel on left, destructive action on right (consistent with aviation HMI conventions) +- The dialogue must always show both the current mode and target mode — never just "are you sure?" +- Equivalent dialogues apply for all mode transitions (LIVE ↔ REPLAY, LIVE ↔ SIMULATION, etc.) + +**Simulation mode block during active alerts:** If the organisation has `disable_simulation_during_active_events` enabled (admin setting, default: off), the SIMULATION mode switch is blocked whenever there are unacknowledged CRITICAL or HIGH alerts. A modal replaces the switch dialogue: + +``` +CANNOT ENTER SIMULATION +────────────────────────────────────────────────────────────── +2 active CRITICAL alerts require acknowledgement. +Acknowledge all active alerts before running simulations. + +[View active alerts] [Cancel] +────────────────────────────────────────────────────────────── +``` + +Document `disable_simulation_during_active_events` prominently in the admin UI: *"Enable only if your organisation has a dedicated SpaceCom monitoring role separate from simulation users."* + +**Timeline control — two zoom levels:** + +- **Event scale (default):** 72 hours, 6-hour intervals. Re-entry windows shown as coloured bars. +- **Orbital scale:** 4-hour window, 15-minute intervals. For orbital passes and conjunction events. + +**LIVE mode scrub:** User can drag the playhead into the future to preview a predicted corridor. A "Return to Live" button appears whenever the playhead is not at current time. + +**Future-preview temporal wash:** When the timeline playhead is not at current time (user is previewing a future state), the entire right-panel event list and alert badges are overlaid with a temporal wash (semi-transparent grey overlay) and a persistent label: + +``` +┌──────────────────────────────────────────────────────────────┐ +│ ⏩ PREVIEWING +4h 00m — not current state [Return to Live] │ +└──────────────────────────────────────────────────────────────┘ +``` + +The wash and label prevent a controller from acting on predicted-future data as though it were current. The globe corridor may show the projected state; the event list must be visually distinct. Alert badges are greyed and annotated "(projected)" in preview mode. Alert sounds and notifications are suppressed while previewing. + +--- + +### 6.4 Uncertainty Visualisation — Three Phased Modes + +Three representations are planned across phases. All are user-selectable via the `UncertaintyModeSelector` once implemented. Each page context has a recommended default. + +**Mode selector** (appears in the layer controls panel whenever corridor data is loaded): +``` +Corridor Display +● Percentile Corridors ← Phase 1 +○ Probability Heatmap ← Phase 2 +○ Monte Carlo Particles ← Phase 3 +``` + +Modes B and C appear greyed in the selector until their phase ships. + +--- + +#### Mode A — Percentile Corridors (Phase 1, default for Persona A/C) + +**What it shows:** Three nested polygon swaths on the globe — 5th, 50th, and 95th percentile ground track corridors from Monte Carlo output. + +**Visual encoding:** +- 95th percentile: wide, 15% opacity amber fill, dashed border — hazard extent +- 50th percentile: medium, 35% opacity amber fill, solid border — nominal corridor +- 5th percentile: narrow, 60% opacity amber fill, bold border — high-probability core + +**Colour by risk level:** Ocean-only → blue family; partial land → amber; significant land → red-orange. + +**Over time:** As the re-entry window narrows, the outer swath contracts automatically in LIVE mode. The user watches the corridor "tighten" in real-time. + +--- + +#### Mode B — Probability Heatmap (Phase 2, default for Persona B) + +**What it shows:** Continuous colour-ramp Deck.gl heatmap. Each cell's colour encodes probability density of ground impact across the full Monte Carlo sample set. + +**Visual encoding:** Perceptually uniform, colour-blind-safe sequential palette (viridis or custom blue-white-orange). Scale normalised to the maximum probability cell; legend with percentile labels always shown. + +**Interaction:** Hover a cell → tooltip shows "~N% probability of impact within this 50×50 km cell." The heatmap is recomputed client-side if the user adjusts the re-entry window bounds via the timeline. + +--- + +#### Mode C — Monte Carlo Particle Visualisation (Phase 3, Persona B advanced / Persona C briefing) + +**What it shows:** 50–200 animated MC sample trajectory lines converging from re-entry interface altitude (~80 km) to impact. Particle colour encodes F10.7 assumption (cool = low solar activity = later re-entry, warm = high). Impact points persist as dots. + +**Interaction:** Play/pause animation; scrub to any point in the trajectory; click a particle to see its parameter set (F10.7, Ap, B*). + +**Performance:** Use CesiumJS `Primitive` API with per-instance colour attributes — not `Entity` API. Trajectory geometry pre-baked server-side and streamed as binary format (`/viz/mc-trajectories/{prediction_id}`). Never compute trajectories in the browser. + +**Not the default for Persona A** — the animation can be alarming without quantitative context. + +**Weighted opacity:** Particles render with opacity proportional to their sample weight, not uniform opacity. This visually down-weights outlier trajectories so that low-probability high-consequence paths do not visually dominate. + +**Mandatory first-use overlay:** When Mode C is first enabled (per user, tracked in user preferences), a one-time overlay appears before the animation starts: + +``` +MONTE CARLO PARTICLE VIEW +────────────────────────────────────────────────────────────── +Each animated line shows one possible re-entry scenario sampled +from the prediction distribution. Colour encodes the solar +activity assumption used for that sample. + +These are not equally likely outcomes — particle opacity +reflects sample weight. For operational planning, the +Percentile Corridors view (Mode A) gives a more reliable +summary. + +[Understood — show animation] +────────────────────────────────────────────────────────────── +``` + +The overlay is dismissed permanently per user on first acknowledgement and never shown again. It cannot be bypassed — the animation does not play until the user explicitly acknowledges. + +--- + +### 6.5 Globe Information Hierarchy and Layer Management + +**Default view state:** Active decay objects and their corridors, FIR boundaries for affected regions. "Show everything" is never the default. + +**Layer management panel:** + +``` +LAYERS +──────────────────────────────────────── +Objects + ☑ Active decay objects (TIP issued) + ☑ Decaying objects (perigee < 250 km) + ☐ All tracked payloads + ☐ Rocket bodies + ☐ Debris catalog + +Orbital Tracks + ☐ Ground tracks (selected object only) + ☐ All objects — [!] performance warning + +Predictions & Corridors + ☑ Re-entry corridors (active events) + ☐ Re-entry corridors (all predicted) + ☐ Fragment impact points + ☐ Conjunction geometry + +Airspace (Phase 2) + ☐ FIR / UIR boundaries + ☐ Controlled airspace + ☐ Affected sectors (hazard intersection) + +Reference + ☐ Population density grid + ☐ Critical infrastructure +──────────────────────────────────────── +Corridor Display: [Percentile ▾] +``` + +Layer state persists to `localStorage` per session. Shared URLs encode active layer state in query parameters. + +**Object clustering:** At zoom > 5,000 km, objects cluster. Badge shows count and highest urgency level. Clusters expand at < 2,000 km. + +**Altitude-aware clustering rule (F8 — §62):** Objects at different altitudes with the same ground-track sub-point are not co-located — they have different re-entry windows and different hazard profiles. Two objects that share a 2D screen position but differ by > 100 km in altitude must **not** be merged into a single cluster. Implementation rule: CesiumJS `EntityCluster` clustering is disabled for any object with `reentry_predictions` showing a window < 30 days (i.e., any decay-relevant object in the watch/alert state). Objects in the normal catalog (`window > 30 days`) may continue to use screen-space clustering. This prevents the pathological case where a TIP-active object at 200 km is merged into a cluster with a nominal object at 500 km that shares its ground track, making the TIP object invisible in the cluster badge. + +**Urgency / Priority Visual Encoding** (colour-blind-safe — shape distinguishes as well as colour): + +| State | Symbol | Colour | Meaning | +|-------|--------|--------|---------| +| TIP issued, window < 6h | ◆ filled diamond | Red `#D32F2F` | Imminent re-entry | +| TIP issued, window 6–24h | ◆ outlined diamond | Orange `#E65100` | Active threat | +| Predicted decay, window < 7d | ▲ triangle | Amber `#F9A825` | Elevated watch | +| Decaying, window > 7d | ● circle | Yellow-grey | Monitor | +| Conjunction Pc > 1:1000 | ✕ cross | Purple `#6A1B9A` | Conjunction risk | +| Normal tracked | · dot | Grey `#546E7A` | Catalog | + +Never use red/green as the sole distinguishing pair. + +--- + +### 6.6 Alert System UX + +**Alert taxonomy:** + +| Level | Trigger | Visual Treatment | Requires Acknowledgement? | +|-------|---------|-----------------|--------------------------| +| **CRITICAL** | TIP issued, window < 6h, hazard intersects active FIR | Full-width banner (red), audio tone (ops room mode) | Yes — named user; timestamp + note recorded | +| **HIGH** | Window < 24h, conjunction Pc > 1:1000 | Persistent badge (orange) | Yes — dismissal recorded | +| **MEDIUM** | New TIP issued (any), window < 7d, new CDM | Toast (amber), 8s auto-dismiss | No — logged | +| **LOW** | New TLE ingested, space weather index change | Notification centre only | No | + +**Alert fatigue mitigation:** +- Mute rules: per-user, per-session LOW suppression +- Geographic filtering: alerts scoped to user's configured FIR list +- Deduplication: window shrinks that don't cross a threshold do not re-trigger +- Rate limit: same trigger condition cannot produce more than 1 CRITICAL alert per object per 4-hour window without a manual operator reset +- Alert generation triggered only by backend logic on verified data — never by direct API call from a client + +**Ops room workload buffer (`OPS_ROOM_SUPPRESS_MINUTES`):** An optional per-organisation setting (default: 0 — disabled). When set to N > 0, CRITICAL alert full-screen banners are queued for up to N minutes before display. The top-nav badge increments immediately so peripheral attention is captured; only the full-screen interrupt is deferred. This matches FAA AC 25.1329 alert prioritisation philosophy: acknowledge at a glance, act when workload permits. Must be documented in the admin UI with a mandatory warning: *"Only enable if your operations room has a dedicated SpaceCom monitoring role. If a single controller manages all alerts, suppression introduces delay that may be safety-significant."* + +**Audio alert specification:** +- Trigger: CRITICAL alert only (no audio for HIGH or lower) +- Sound: two-tone ascending chime pattern (not a siren — ops rooms have sirens from other systems) +- Behaviour: plays once on alert display; does not loop; stops on alert acknowledgement (not just banner dismiss) +- Volume: configurable per-device (default 50% system volume); mutable by operator per-session +- Ops room mode: organisation-level setting that enables audio (default: off; requires explicit activation) + +**Alert storm detection:** If the system generates > 5 CRITICAL alerts within 1 hour across all objects, generate a meta-alert to Persona D. The meta-alert presents a disambiguation prompt rather than a bare count: + +``` +[META-ALERT — ALERT VOLUME ANOMALY] +────────────────────────────────────────────────────────────── +5 CRITICAL alerts generated within 1 hour. + +This may indicate: + (a) Multiple genuine re-entry events — verify via Space-Track + independently before taking operational action. + (b) System integrity issue — check ingest pipeline and data + source health for signs of false data injection. + +[Open /admin health dashboard →] [View all CRITICAL alerts →] +────────────────────────────────────────────────────────────── +``` + +**Acknowledgement workflow:** + +CRITICAL acknowledgement requires two steps to prevent accidental confirmation: + +**Step 1** — Alert banner with summary and Open Map link: +``` +[CRITICAL ALERT] +─────────────────────────────────────────────────────── +CZ-5B R/B (44878) — TIP Issued +Re-entry window: 2026-03-16 14:00 – 22:00 UTC (8h) +Affected FIRs: YMMM, YSSY +Risk level: HIGH | [Open map →] +[Review and Acknowledge →] +─────────────────────────────────────────────────────── +``` + +**Step 2** — Confirmation modal (appears on clicking "Review and Acknowledge"): +``` +ACKNOWLEDGE CRITICAL ALERT +─────────────────────────────────────────────────────── +CZ-5B R/B (44878) — Re-entry window 14:00–22:00 UTC 16 Mar + +Action taken (required — minimum 10 characters): +[_____________________________________________] + +[Cancel] [Confirm — J. Smith, 09:14 UTC] +─────────────────────────────────────────────────────── +``` + +The Confirm button is disabled until the `Action taken` field contains ≥ 10 characters. This prevents reflexive one-click acknowledgement during an incident and ensures a minimal action record is always created. + +Acknowledgements stored in `alert_events` (append-only). Records cannot be modified or deleted. + +--- + +### 6.7 Timeline / Gantt View + +Full timeline accessible from `/events` and as a compact strip on the Operational Overview. + +``` + NOW +6h +12h +24h +48h +72h +Object │ │ │ │ │ │ +────────────────────┼────────┼────────┼────────┼────────┼────────┼──── +CZ-5B R/B 44878 │ [■■■■■[══════ window ═══════]■■■] │ + YMMM FIR — HIGH │ │ │ │ │ │ +────────────────────┼────────┼────────┼────────┼────────┼────────┼──── +SL-16 R/B 28900 │ │ │ [■[══════════════════════════→ + NZZC FIR — MED │ │ │ │ │ │ +``` + +`■` = nominal re-entry point; `══` = uncertainty window; colour = risk level. + +Click event bar → Event Detail page; hover → tooltip with window bounds and affected FIRs. Zoom range: 6h to 7d. + +--- + +### 6.8 Event Detail Page (`/events/{id}`) + +``` +┌──────────────────────────────────────────────────────────────┐ +│ ← Events │ CZ-5B R/B NORAD 44878 │ [■ CRITICAL] │ +│ │ Re-entry window: 14:00–22:00 UTC 16 Mar 2026 │ +├──────────────────────────────┬───────────────────────────────┤ +│ │ OBJECT │ +│ 3D GLOBE │ Mass: 21,600 kg (● DISCOS) │ +│ (focused on corridor) │ B*: 0.000215 /ER │ +│ Mode: [Percentile ▾] │ Data confidence: ● DISCOS │ +│ [Layers] │ │ +│ │ PREDICTION │ +│ │ Model: cowell_nrlmsise00 v2 │ +│ │ F10.7 assumed: 148 sfu │ +│ │ MC samples: 500 │ +│ │ HMAC: ✓ verified │ +│ │ │ +│ │ WINDOW │ +│ │ 5th pct: 13:12 UTC │ +│ │ 50th pct: 17:43 UTC │ +│ │ 95th pct: 22:08 UTC │ +│ │ │ +│ │ TIP MESSAGES │ +│ │ MSG #3 — 09:00 UTC today │ +│ │ [All TIP history →] │ +├──────────────────────────────┴───────────────────────────────┤ +│ AFFECTED AIRSPACE (Phase 2) │ +│ YMMM FIR ████ HIGH entry 14:20–19:10 UTC │ +├──────────────────────────────────────────────────────────────┤ +│ [Run Simulation] [Generate Report] [Share Link] │ +└──────────────────────────────────────────────────────────────┘ +``` + +**HMAC verification status** is displayed prominently. If `✗ verification failed` appears, a banner reads: "This prediction record may have been tampered with. Do not use for operational decisions. Contact your system administrator." + +**Data confidence** annotates every physical property: `● DISCOS` (green), `● estimated` (amber), `● unknown` (grey). When source is `unknown` or `estimated`, a warning callout appears above the prediction panel. + +**Corridor Evolution widget (Phase 2):** A compact 2D strip on the Event Detail page showing how the p50 corridor footprint is evolving over time — three overlapping semi-transparent polygon outlines at T+0h, T+2h, T+4h from the current prediction. Updated automatically in LIVE mode. Gives Persona A Level 3 situation awareness (projection) at a glance without requiring simulation tools. Labelled: *"Corridor evolution — how prediction is narrowing"*. If the corridor is widening (unusual), an amber warning appears: *"Uncertainty is increasing — check space weather."* + +**Duty Manager View (Phase 2):** A `[Duty Manager View]` toggle button on the Event Detail header. When active, collapses all technical detail and presents a large-text, decluttered view containing only: + +``` +┌──────────────────────────────────────────────────────────────┐ +│ CZ-5B R/B NORAD 44878 [■ CRITICAL] │ +│ │ +│ RE-ENTRY WINDOW │ +│ Start: 14:00 UTC 16 Mar 2026 │ +│ End: 22:00 UTC 16 Mar 2026 │ +│ Most likely: 17:43 UTC │ +│ │ +│ AFFECTED FIRs │ +│ YMMM (Airservices Australia) — HIGH RISK │ +│ YSSY (Airservices Australia) — MEDIUM RISK │ +│ │ +│ [Draft NOTAM] [Log Action] [Share Link] │ +└──────────────────────────────────────────────────────────────┘ +``` + +Toggle back to full view via `[Technical Detail]`. State is not persisted between sessions — always starts in full view. + +**Response Options accordion (Phase 2):** An expandable panel at the bottom of the Event Detail page, visible to `operator` and above roles. Contextualised to the current risk level and FIR intersection. These are considerations only — all decisions rest with the ANSP: + +``` +RESPONSE OPTIONS [▼ expand] +────────────────────────────────────────────────────────────── +Based on current prediction (risk: HIGH, window: 8h): + +The following actions are for your consideration. +All operational decisions rest with the ANSP. + + ☐ Issue SIGMET or advisory to aircraft in YMMM FIR + ☐ Notify adjacent ANSPs (YMMM borders: WAAF, OPKR) + ☐ Draft NOTAM for authorised issuance [Open →] + ☐ Coordinate with FMP on traffic flow impact + ☐ Establish watching brief schedule (every 30 min) + +[Log coordination note] +────────────────────────────────────────────────────────────── +``` + +Checkbox states and coordination notes are appended to `alert_events` (append-only). The Response Options items are dynamically generated by the backend based on risk level and affected FIR count — not hardcoded in the frontend. + +--- + +### 6.9 Simulation Job Management UX + +Persistent collapsible bottom-drawer panel visible on any page. Jobs continue running when the user navigates away. + +``` +SIMULATION JOBS [▲ collapse] +──────────────────────────────────────────────────────────────── +● Running Decay prediction — 44878 312/500 ████░ 62% + F10.7: 148, Ap: 12, B*±10% ~45s rem + [Cancel] + +✓ Complete Decay prediction — 44878 High F10.7 scenario + Completed 09:02 UTC [View results] [Compare] + +✗ Failed Breakup simulation — 28900 + Error: DISCOS data missing [Retry] [Details] +──────────────────────────────────────────────────────────────── +``` + +**Simulation comparison:** Two completed runs for the same object can be overlaid on the globe with distinct colours and a split-panel parameter comparison. + +--- + +### 6.10 Space Weather Widget + +``` +SPACE WEATHER [09:14 UTC] +──────────────────────────────────────────────────────────── +Solar Activity ●●●○○ ELEVATED + F10.7 observed: 148 sfu (81d avg: 132) + +Geomagnetic ●●●●○ ACTIVE + Kp: 5.3 / Ap daily: 27 + +Re-entry Impact ▲ Active conditions — extend precaution window + Add ≥2h buffer beyond 95th percentile. + +Forecast (24h) Activity expected to decline — Kp 3–4 +──────────────────────────────────────────────────────────── +Source: NOAA SWPC Updated: 09:00 UTC [Full history →] +``` + +**Operational status summary** is generated by the backend based on F10.7 deviation from the 81-day average. The "Re-entry Impact" line delivers an operationally actionable statement — not a percentage — with a concrete recommended precaution buffer computed by the backend and delivered as a structured field: + +| Condition | Re-entry Impact line | Recommended buffer | +|-----------|----------------------|--------------------| +| F10.7 < 90 or Kp < 2 | Low activity — predictions at nominal accuracy | +0h | +| F10.7 90–140, Kp 2–4 | Moderate activity — standard uncertainty applies | +1h | +| F10.7 140–200, Kp 4–6 | Active conditions — extend precaution window. Add ≥2h buffer beyond 95th percentile. | +2h | +| F10.7 > 200 or Kp > 6 | High activity — predictions less reliable. Add ≥4h buffer beyond 95th percentile. | +4h | + +The buffer recommendation is surfaced on the Event Detail page as an explicit callout when conditions are Elevated or above: *"Space weather active: consider extending your airspace precaution window to [95th pct time + buffer]."* + +--- + +### 6.11 2D Plan View (Phase 2) + +Globe/map toggle (`[🌐 Globe] [🗺 Plan]`) synchronises selected object, active corridor, and time position. State is preserved on switch. + +**2D view features:** Mercator or azimuthal equidistant projection; ICAO chart symbology for airspace; ground-track corridor as horizontal projection only; altitude/time cross-section panel below showing corridor vertical extent at each FIR crossing. + +--- + +### 6.12 Reporting Workflow + +**Report configuration dialogue:** + +``` +NEW REPORT — CZ-5B R/B (44878) +────────────────────────────────────────────────────────────── +Simulation: [Run #3 — 09:14 UTC ▾] + +Report Type: + ○ Operational Briefing (1–2 pages, plain language) + ○ Technical Assessment (full uncertainty, model provenance) + ○ Regulatory Submission (formal format, appendices) + +Include Sections: + ☑ Object properties and data confidence + ☑ Re-entry window and uncertainty percentiles + ☑ Ground track corridor map + ☑ Affected airspace and FIR crossing times + ☑ Space weather conditions at prediction time + ☑ Model version and simulation parameters + ☐ Full MC sample distribution + ☐ TIP message history + +Prepared by: J. Smith Authority: CASA +────────────────────────────────────────────────────────────── +[Preview] [Generate PDF] [Cancel] +``` + +**Report identity:** Every report has a unique ID, the simulation ID it was derived from, a generation timestamp, and the analyst's name. Reports are stored in MinIO and listed in `/reports`. + +**Date format in all reports and exports (F7):** Slash-delimited dates (`03/04/2026`) are ambiguous between DD/MM and MM/DD and are banned from all SpaceCom outputs. All dates in PDF reports, CSV exports, and NOTAM drafts use **`DD MMM YYYY`** format (e.g. `04 MAR 2026`) — unambiguous across all locales and consistent with ICAO and aviation convention. All times alongside dates use `HH:MMZ` (e.g. `04 MAR 2026 14:00Z`). This applies to: PDF prediction reports, CSV bulk exports, NOTAM draft `(B)`/`(C)` fields (which use ICAO `YYMMDDHHMM` format internally but are displayed as `DD MMM YYYY HH:MMZ` in the preview). + +**Report rendering:** Server-side Playwright in the isolated `renderer` container. The map image is a headless Chromium screenshot of the globe at the relevant configuration. All user-supplied text is HTML-escaped before interpolation. The renderer has no external network access — it receives only sanitised, structured data from the backend API. + +--- + +### 6.13 NOTAM Drafting Workflow (Phase 2) + +SpaceCom cannot issue NOTAMs. Only designated NOTAM offices authorised by the relevant AIS authority can issue them. SpaceCom's role is to produce a draft in ICAO Annex 15 format ready for review and formal submission by an authorised originator. + +**Trigger:** From the Event Detail page, Persona A clicks `[Draft NOTAM]`. This is only available when a hazard corridor intersects one or more FIRs. + +**Draft NOTAM output (ICAO Annex 15 / OPADD format):** + +Field format follows ICAO Annex 15 Appendix 6 and EUROCONTROL OPADD. Timestamps use `YYMMDDHHmm` format (not ISO 8601 — ICAO Annex 15 §5.1.2). `(B)` = `p10 − 30 min`; `(C)` = `p90 + 30 min` (see mapping table below). + +``` +NOTAM DRAFT — FOR REVIEW AND AUTHORISED ISSUANCE ONLY +══════════════════════════════════════════════════════ +Generated by SpaceCom v2.1 | Prediction ID: pred-44878-20260316-003 +Data source: USSPACECOM TIP #3 + SpaceCom decay prediction +⚠ This is a DRAFT only. Must be reviewed and issued by authorised NOTAM office. + +Q) YMMM/QWELW/IV/NBO/AE/000/999/2200S13300E999 +A) YMMM +B) 2603161330 +C) 2603162230 +E) UNCONTROLLED SPACE OBJECT RE-ENTRY. OBJECT: CZ-5B ROCKET BODY + NORAD ID 44878. PREDICTED RE-ENTRY WINDOW 1400-2200 UTC 16 MAR + 2026. NOMINAL RE-ENTRY POINT APRX 22S 133E. 95TH PERCENTILE + CORRIDOR 18S 115E TO 28S 155E. DEBRIS SURVIVAL PSB. AIRSPACE + WITHIN CORRIDOR MAY BE AFFECTED ALL LEVELS DURING WINDOW. + REF SPACECOM PRED-44878-20260316-003. +F) SFC +G) UNL +``` + +**NOTAM field mapping (ICAO Annex 15 Appendix 6):** + +| NOTAM field | SpaceCom data source | Format rule | +|---|---|---| +| `(Q)` Q-line | FIR ICAO designator + NOTAM code `QWELW` (re-entry warning) | Generated from `airspace.icao_designator`; subject code `WE` (airspace warning), condition `LW` (laser/space) | +| `(A)` FIR | `airspace.icao_designator` for each intersecting FIR | One NOTAM per FIR; multi-FIR events generate multiple drafts | +| `(B)` Valid from | `prediction.p10_reentry_time − 30 minutes` | `YYMMDDHHmm` (UTC); example: `2603161330` | +| `(C)` Valid to | `prediction.p90_reentry_time + 30 minutes` | `YYMMDDHHmm` (UTC) | +| `(D)` Schedule | Omitted (continuous) | Do not include `(D)` field for continuous validity | +| `(E)` Description | Templated from sanitised object name, NORAD ID, p50 time, corridor bounds | `sanitise_icao()` applied; ICAO Doc 8400 abbreviations used (`PSB` not "possible", `APRX` not "approximately") | +| `(F)/(G)` Limits | `SFC` / `UNL` | Hardcoded for re-entry events; do not compute from corridor altitude | + +**`(B)`/`(C)` field: re-entry window to NOTAM validity — time-critical cancellation:** The `(C)` validity time does not mean the hazard persists until then — it is the worst-case boundary. When re-entry is confirmed, the NOTAM cancellation draft must be initiated immediately. The Event Detail page surfaces a prominent `[Draft NOTAM Cancellation — RE-ENTRY CONFIRMED]` button at the moment the event status changes to `confirmed`, with a UI note: "Cancellation draft should be submitted to the NOTAM office without delay." + +**Unit test:** Generate a draft for a prediction with `p10=2026-03-16T14:00Z`, `p90=2026-03-16T22:00Z`; assert `(B)` field is `2603161330` and `(C)` field is `2603162230`. Assert Q-line matches regex `\(Q\) [A-Z]{4}/QWELW/IV/NBO/AE/\d{3}/\d{3}/\d{4}[NS]\d{5}[EW]\d{3}`. + +**NOTAM cancellation draft:** When an event is closed (re-entry confirmed, object decayed), the Event Detail page offers `[Draft NOTAM Cancellation]` — generates a CANX NOTAM draft referencing the original. + +**Regulatory note displayed in the UI:** A persistent banner on the NOTAM draft page reads: *"This draft is generated for review purposes only. It must be reviewed for accuracy, formatted to local AIS standards, and issued by an authorised NOTAM originator. SpaceCom does not issue NOTAMs."* + +**NOTAM language and i18n exclusion (F6):** ICAO Annex 15 specifies that NOTAMs use ICAO standard phraseology in English (or the language of the state for domestic NOTAMs). NOTAM template strings are **never internationalised**: +- All NOTAM template strings are hardcoded ICAO English phraseology in `backend/app/modules/notam/templates.py` +- Each template string is annotated `# ICAO-FIXED: do not translate` +- The NOTAM draft is excluded from the `next-intl` message extraction tooling +- The NOTAM preview panel renders in a fixed-width monospace font to match traditional NOTAM format +- `lang="en"` attribute is set on the NOTAM text container regardless of the operator's UI locale + +The draft is stored in the `notam_drafts` table (see §9.2) for audit purposes. + +--- + +### 6.14 Shadow Mode (Phase 2) + +Shadow mode allows ANSPs to run SpaceCom in parallel with existing procedures during a trial period, without acting operationally on its outputs. This is the primary mechanism for building regulatory acceptance evidence. + +**Activation:** `admin` role only, per-organisation setting in `/admin`. + +**Visual treatment when shadow mode is active:** + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ ⚗ SHADOW MODE — Predictions are not for operational use │ +│ All outputs are recorded for validation. No alerts are │ +│ delivered externally. Contact your administrator to disable. │ +└─────────────────────────────────────────────────────────────────┘ +``` + +- A persistent amber banner spans the top of every page +- The mode indicator pill shows `⚗ SHADOW` in amber +- All alert levels are demoted to INFORMATIONAL — no banners, no audio tones, no email delivery +- Prediction records have `shadow_mode = TRUE` in the database (see §9) +- Shadow predictions are excluded from all operational views but accessible in `/analysis` + +**Validation reporting:** After each real re-entry event, Persona B can generate a Shadow Validation Report comparing SpaceCom shadow predictions against the actual observed re-entry time/location. These reports form the evidence base for regulatory adoption. + +**Shadow Mode Exit Criteria (regulatory hand-off specification — Finding 6):** + +Shadow mode is a formal regulatory activity, not a product trial. Exit to operational use requires: + +| Criterion | Requirement | +|---|---| +| Minimum shadow period | 90 days, or covering ≥ 3 re-entry events above the CRITICAL alert threshold, whichever is longer | +| Prediction accuracy | `corridor_contains_observed ≥ 90%` across shadow period events (from `prediction_outcomes`) | +| False positive rate | `fir_false_positive_rate ≤ 20%` — no more than 1 in 5 corridor-intersecting FIR alerts is a false alarm | +| False negative rate | `fir_false_negative = 0` during the shadow period — no re-entry event missed entirely | +| Exit document | `shadow-mode-exit-report-{org_id}-{date}.pdf` generated from `prediction_outcomes`; contains automated statistics + ANSP Safety Department sign-off field | +| Regulatory hand-off | Written confirmation from the ANSP's Accountable Manager or Head of ATM Safety that their internal Safety Case / Tool Acceptance process is complete | +| System state | `shadow_mode_cleared = TRUE` is set by SpaceCom `admin` only after receipt of the written ANSP confirmation | + +The exit report template lives at `docs/templates/shadow-mode-exit-report.md`. Persona B generates the statistics from the admin analysis panel; the ANSP prints, signs, and returns the PDF. No software system can substitute for the ANSP's internal Safety Department sign-off. + +**Commercial trial-to-operational conversion (Finding 5):** + +A successful shadow exit automatically generates a commercial offer. The admin panel transitions the organisation's `subscription_status` from `'shadow_trial'` to `'offered'` and Persona D receives a task notification. The offer package includes: +- Commercial offer document (generated from `docs/templates/commercial-offer-ansp.md`): tier, pricing, SLA schedule, DPA status +- MSA execution path: ANSPs that accept the offer sign the MSA; no separate negotiation required for the standard ANSP Operational tier +- Onboarding checklist: `docs/onboarding/ansp-onboarding-checklist.md` + +If an ANSP does not convert within 30 days of receiving the offer, `subscription_status` moves to `'offered_lapsed'` and Persona D is notified. The admin panel shows conversion pipeline status for all ANSP organisations. Maximum concurrent ANSP shadow deployments in Phase 2: **2** (resource constraint — each requires a dedicated SpaceCom integration lead for the 90-day shadow period). + +--- + +### 6.15 Space Operator Portal UX (Phase 2) + +The Space Operator Portal (`/space`) is the second front door. It serves Persona E and F with a technically dense interface — different visual language from the aviation-facing portal. + +**Space Operator Overview (`/space`):** + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ SpaceCom · Space Portal [API] [Export] [Persona E: ORBCO] │ +├─────────────────────┬───────────────────────────────────────────┤ +│ │ MY OBJECTS (3) │ +│ 3D GLOBE │ ┌────────────────────────────────────┐ │ +│ (owned objects │ │ CZ-5B R/B 44878 │ │ +│ only, with │ │ Perigee: 178 km ↓ Decaying fast │ │ +│ full orbital │ │ Re-entry: 16 Mar ± 8h │ │ +│ tracks and │ │ [Predict] [Plan deorbit] [Export] │ │ +│ decay vectors) │ ├────────────────────────────────────┤ │ +│ │ │ SL-16 R/B 28900 │ │ +│ │ │ Perigee: 312 km ~ Stable │ │ +│ │ │ [Predict] [Export] │ │ +│ │ └────────────────────────────────────┘ │ +│ │ CONJUNCTION ALERTS (MY OBJECTS) │ +│ │ No active conjunctions > Pc 1:10000 │ +├─────────────────────┴───────────────────────────────────────────┤ +│ API USAGE Requests today: 143 / 1000 [Manage keys →] │ +└─────────────────────────────────────────────────────────────────┘ +``` + +**Controlled Re-entry Planner (`/space/reentry/plan`):** + +Available for objects with remaining manoeuvre capability (flagged in `owned_objects.has_propulsion`). + +``` +CONTROLLED RE-ENTRY PLANNER — CZ-5B R/B (44878) +───────────────────────────────────────────────────────────────── +Delta-V budget: [▓▓▓░░░░░] 12.4 m/s remaining + +Target re-entry window: [2026-03-20 ▾] to [2026-03-22 ▾] +Avoid FIRs: [☑ YMMM] [☑ YSSY] [☑ Populated land] +Preferred landing: ● Ocean ○ Specific zone + +CANDIDATE WINDOWS +────────────────────────────────────────────────────────────────── + #1 2026-03-21 03:14 UTC ΔV: 8.2 m/s Risk: ● LOW + Landing: South Pacific FIR: NZZO (ocean) + [Select] [View corridor] + + #2 2026-03-21 09:47 UTC ΔV: 11.1 m/s Risk: ● LOW + Landing: Indian Ocean FIR: FJDG (ocean) + [Select] [View corridor] + + #3 2026-03-21 15:30 UTC ΔV: 9.8 m/s Risk: ▲ MEDIUM + Landing: 22S 133E FIR: YMMM (land) + [Select] [View corridor] +────────────────────────────────────────────────────────────────── +[Export manoeuvre plan (CCSDS)] [Generate operator report] +``` + +The planner outputs are suitable for submission to national space regulators as evidence of responsible end-of-life management under the ESA Zero Debris Charter and national space law requirements. + +**Zero Debris Charter compliance output format (Finding 2):** + +The planner produces a `controlled-reentry-compliance-report-{norad_id}-{date}.pdf` containing: +- Ranked deorbit window analysis (delta-V budget, window start/end, corridor risk score per window) +- FIR avoidance corridors for each candidate window +- Probability of casualty on the ground (Pc_ground) computed using NASA Debris Assessment Software methodology (1-in-10,000 IADC casualty threshold; documented in model card) +- Comparison table: each candidate window vs. the 1:10,000 Pc_ground threshold; compliant windows flagged green +- Zero Debris Charter alignment statement (auto-generated from object disposition) + +Machine-readable companion: `application/vnd.spacecom.reentry-compliance+json` — returned alongside the PDF download URL as `compliance_report_url` in the planning job result. Format documented in `docs/api-guide/compliance-export.md`. + +The Pc_ground calculation uses the fragment survivability model (§15.3 material class lookup) and the ESA DRAMA casualty area methodology. `objects.material_class IS NULL` → conservative all-survive assumption → higher Pc_ground — creates an incentive for operators to provide accurate physical data. + +ECCN classification review (already in §21 Phase 2 DoD) must resolve before this output is shared with non-US entities. + +--- + +### 6.16 Accessibility Requirements + +- **WCAG 2.1 Level AA compliance** — required for government and aviation authority procurement +- Colour-blind-safe palette throughout; urgency uses shape + colour, never colour alone +- High-contrast mode available in user settings (WCAG AAA scheme) +- Dark mode as a first-class theme (not an afterthought) +- All interactive elements keyboard-accessible; tab order logical +- Alerts announced via `aria-live="assertive"` (CRITICAL) and `aria-live="polite"` (MEDIUM/LOW) +- Globe canvas has `aria-label` describing current view context +- Minimum touch target size 44×44 px +- Tested at 1080p (ops room), 1440p (analyst workstation), 1024×768 (tablet minimum) +- Automated axe-core audit via `@axe-core/playwright` run on the 5 core pages on every PR; 0 critical, 0 serious violations required to merge; known acceptable third-party violations (e.g., CesiumJS canvas contrast) recorded in `tests/e2e/axe-exclusions.json` with a justification comment — not silently suppressed. Implementation: + ```typescript + // tests/e2e/accessibility.spec.ts + import AxeBuilder from '@axe-core/playwright'; + for (const [name, path] of [ + ['operational-overview', '/'], ['event-detail', '/events/seed-event'], + ['notam-draft', '/notam/draft/seed-draft'], ['space-portal', '/space/objects'], + ['settings', '/settings'], + ]) { + test(`${name} — WCAG 2.1 AA`, async ({ page }) => { + await page.goto(path); + const results = await new AxeBuilder({ page }) + .withTags(['wcag2a', 'wcag2aa']) + .exclude(loadAxeExclusions()) // loads axe-exclusions.json + .analyze(); + expect(results.violations).toEqual([]); + }); + } + ``` + +--- + +### 6.17 Multi-ANSP Coordination Panel (Phase 2) + +When an event's predicted corridor intersects FIRs belonging to more than one registered organisation, an additional panel appears on the Event Detail page. This panel provides shared situational awareness across ANSPs without replacing voice coordination. + +``` +MULTI-ANSP COORDINATION +────────────────────────────────────────────────────────────── +FIRs affected by this event: + YMMM Airservices Australia — ✓ Acknowledged 09:14 UTC J. Smith + NZZC Airways NZ — ○ Not yet acknowledged + +Last activity: + 09:22 UTC YMMM — "Watching brief established, coordinating with FMP" +────────────────────────────────────────────────────────────── +[Log coordination note] +``` + +Rules: +- Each ANSP sees the acknowledgement status and latest coordination note from all other ANSPs on the event; they do not see each other's internal alert state +- Coordination notes are free text, appended to `alert_events` (append-only, auditable), with organisation name, user name, and UTC timestamp +- The panel is read-only for organisations that have not yet acknowledged; they can acknowledge and then log notes +- Visibility is scoped: organisations only see the panel for events that intersect their registered FIRs — they do not see coordination panels for unrelated events from other orgs + +This does not replace voice or direct coordination — it creates a shared digital record that both ANSPs can reference. The panel carries a permanent banner: *"This coordination panel is for shared situational awareness only. It does not replace formal ATS coordination procedures or voice coordination."* + +**Authority and precedence (Finding 5):** The panel has no command authority. If two ANSPs log conflicting assessments, neither supersedes the other in SpaceCom — the system records both. The authoritative coordination outcome is always the result of direct ATS coordination outside the system. SpaceCom coordination notes are supporting evidence, not operational decisions. + +**WebSocket latency for coordination updates:** Coordination note updates must be visible to all parties within 2 seconds of posting (p99). This is specified as a performance SLA for the coordination panel WebSocket channel (distinct from the 5-second SLA for alert events). Latency > 2 seconds means an ANSP may have acted on a stale picture during a fast-moving event. + +**Data retention for coordination records (ICAO Annex 11 §2.26):** Coordination notes are safety records. Minimum retention: 5 years in append-only storage. The `coordination_notes` table (stored append-only in `alert_events.coordination_notes JSONB[]` or as a separate table) is included in the safety record retention category (§27.4) and excluded from standard data drop policies. + +--- + +### 6.18 First-Time User Onboarding State (Phase 1) + +When a new organisation has no configured FIRs and no active events, the globe is empty. An empty globe is indistinguishable from "the system isn't working" for first-time users. An onboarding state prevents this misinterpretation. + +**Trigger:** Organisation has `fir_list IS NULL OR fir_list = '{}'` at login. + +**Display:** Three setup cards replace the Active Events panel: + +``` +WELCOME TO SPACECOM +────────────────────────────────────────────────────────────── +To see relevant events and receive alerts, complete setup: + + 1. Configure your FIR watch list + Determines which re-entry events you see and which + alerts you receive. [Configure →] + + 2. Set alert delivery preferences + Email, WebSocket, or webhook for CRITICAL alerts. + [Configure →] + + 3. Optional: Enable Shadow Mode for a trial period + Run SpaceCom in parallel with existing procedures — + outputs are not for operational use until disabled. + [Configure →] + +────────────────────────────────────────────────────────────── +``` + +Cards disappear permanently once step 1 (FIR list) is complete. Steps 2 and 3 remain accessible from `/admin` at any time. The setup cards are not a modal — they appear inline and the user can still access all navigation. + +--- + +### 6.19 Degraded Mode UI Guidance (Phase 1) + +The `StalenessWarningBanner` (triggered by `/readyz` returning 207) must include an operational guidance line keyed to the specific type of data degradation, not just a generic "data may be stale" message. Persona A's question in degraded mode is not "is the data stale?" — it is "can I use this for an operational decision right now?" + +| Degradation type | Banner operational guidance | +|-----------------|----------------------------| +| Space weather data stale > 3h | *"Uncertainty estimates may be wider than shown. Treat all corridors as potentially broader than the 95th percentile boundary."* | +| TLE data stale > 24h | *"Object position data is more than 24 hours old. Do not use for precision airspace decisions without independent position verification."* | +| Active prediction older than 6h without refresh | *"This prediction reflects conditions from [timestamp]. A fresh prediction run is recommended before operational use. [Trigger refresh →]"* | +| IERS EOP data stale > 7 days | *"Coordinate frame transformations may have minor errors. Technical assessments only — do not use for precision airspace boundary work."* | + +Banner behaviour: +- The banner type is set by the backend via the `/readyz` response body (`degradation_type` enum) +- Each degradation type has its own banner message — not a generic "degraded" label +- The banner persists until the degradation is resolved; it cannot be dismissed by the user +- When multiple degradations are active, show the highest-impact degradation first, with a `(+N more)` expand link + +--- + +### 6.20 Secondary Display Mode (Phase 2) + +An ops room secondary monitor display mode — strips all navigation chrome and presents only the operational picture on a full-screen secondary display alongside existing ATC tools. + +**Activation:** `[Secondary Display]` link in the user menu, or URL parameter `?display=secondary`. Opens in a new window or full-screen. + +**Layout:** Full-screen globe on the left (~70% width), vertical event list on the right (~30% width). No top navigation, no admin links, no simulation controls. No sidebar panels. The LIVE/SHADOW/SIMULATION mode indicator remains visible (always). CRITICAL alert banners still appear. + +**Design principle:** This is a CSS-level change — hide navigation and chrome elements, maximise the operational data density. No new data is added; no existing data is removed. + +--- + +## 7. Security Architecture + +**This section is as non-negotiable as §4.** Security must be built in from Week 1, not audited at Phase 3. The primary security risk in an aviation safety system is not data exfiltration — it is data corruption that produces plausible but wrong outputs that are acted upon operationally. A false all-clear for a genuine re-entry threat is the highest-consequence attack against this system's mission. + +### 7.1 Threat Model (STRIDE) + +Key trust boundaries and their principal threats: + +| Boundary | Spoofing | Tampering | Repudiation | Info Disclosure | DoS | Elevation | +|----------|----------|-----------|-------------|-----------------|-----|-----------| +| Browser → API | JWT forgery | Request injection | Unlogged mutations | Token leak via XSS | Auth endpoint flood | RBAC bypass | +| API → DB | Credential leak | SQL injection | No audit trail | Column over-fetch | N+1 queries | RLS bypass | +| Ingest → External feeds | DNS/BGP hijack → wrong TLE | Man-in-middle alters F10.7 | — | Credential interception | Feed DoS | — | +| Celery worker → DB | Compromised worker | Corrupt sim output written to DB | Unlogged task | Param leak in logs | Runaway MC task | Worker → backend pivot | +| Playwright renderer → backend | — | User content → XSS → SSRF | — | Local file read | Hang/timeout | RCE via browser exploit | +| Redis | — | Cache poisoning | — | Token interception | Queue flood | — | + +Mitigations for each threat are specified in the sections below. + +--- + +### 7.2 Role-Based Access Control (RBAC) + +Four roles correspond to the four personas. Every API endpoint enforces the minimum required role via a FastAPI dependency. + +| Role | Assigned To | Permissions | +|------|------------|------------| +| `viewer` | Read-only external stakeholders | View objects, predictions, corridors; read-only globe (aviation domain) | +| `analyst` | Persona B | viewer + submit simulations, generate reports, access historical data, shadow validation reports | +| `operator` | Persona A, C | analyst + acknowledge alerts, issue advisories, draft NOTAMs, access operational tools | +| `org_admin` | Organisation administrator | operator + invite/remove users within their own org; assign roles up to `operator` within own org; view own org's audit log; manage own org's API keys; update own org's billing contact; cannot access other orgs' data; cannot assign `admin` or `org_admin` without system admin approval | +| `admin` | Persona D (system-wide) | Full access: user management across all orgs, ingest configuration, model version deployment, shadow mode toggle, subscription management | +| `space_operator` | Persona E | Object-scoped access (owned objects only via `owned_objects` table); decay predictions and controlled re-entry planning for own objects; conjunction alerts for own objects; API key management; CCSDS export; no access to other organisations' simulation data | +| `orbital_analyst` | Persona F | Full catalog read; conjunction screening across any object pair; simulation submission; bulk export (CSV, JSON, CCSDS); raw state vector and covariance access; API key management; no alert acknowledgement | + +**Object ownership scoping for `space_operator`:** The `owned_objects` table maps operators to their registered NORAD IDs. All queries from a `space_operator` user are automatically scoped to their owned object list — enforced by a PostgreSQL RLS policy on the `owned_objects` join, not only at the application layer: + +```sql +-- space_operator users see only their owned objects in catalog queries +CREATE POLICY objects_owner_scope ON objects + USING ( + current_setting('app.current_role') != 'space_operator' + OR id IN ( + SELECT object_id FROM owned_objects + WHERE organisation_id = current_setting('app.current_org_id')::INTEGER + ) + ); +``` + +**Multi-tenancy:** If multiple organisations use the system, every table that contains organisation-specific data (`simulations`, `reports`, `alert_events`, `hazard_zones`) must include an `organisation_id` column. PostgreSQL Row-Level Security (RLS) policies enforce the boundary at the database layer — not only at the application layer: + +```sql +ALTER TABLE simulations ENABLE ROW LEVEL SECURITY; +CREATE POLICY simulations_org_isolation ON simulations + USING (organisation_id = current_setting('app.current_org_id')::INTEGER); +``` + +The application sets `app.current_org_id` at the start of every database session from the authenticated user's JWT claims. + +**Comprehensive RLS policy coverage (F1):** The `simulations` example above is the template. Every table that carries `organisation_id` must have RLS enabled and an isolation policy applied. The full set: + +| Table | RLS policy | Notes | +|-------|-----------|-------| +| `simulations` | `organisation_id = current_org_id` | | +| `reentry_predictions` | `organisation_id = current_org_id` | shadow policy layered separately | +| `alert_events` | `organisation_id = current_org_id` | append-only; no UPDATE/DELETE anyway | +| `hazard_zones` | `organisation_id = current_org_id` | | +| `reports` | `organisation_id = current_org_id` | | +| `api_keys` | `organisation_id = current_org_id` | admins bypass to revoke any key | +| `usage_events` | `organisation_id = current_org_id` | billing metering records | +| `objects` | `organisation_id IS NULL OR organisation_id = current_org_id` | NULL = catalog-wide; org-specific = owned objects only | + +**RLS bypass for system-level tasks:** Celery workers and internal admin processes run under a dedicated database role (`spacecom_worker`) that bypasses RLS (`BYPASSRLS`). This role is never used by the API request path. Integration test (BLOCKING): establish two orgs with data; issue a query as Org A's session; assert zero Org B rows returned. This test runs in CI against a real database (not mocked). + +**Shadow mode segregation — database-layer enforcement (Finding 9):** + +Shadow predictions must be excluded from operational API responses at the RLS layer, not only via application `WHERE` clauses. A backend query bug or misconfigured join must not expose shadow records to `viewer`/`operator` sessions — that would be a regulatory incident. + +```sql +ALTER TABLE reentry_predictions ENABLE ROW LEVEL SECURITY; + +-- Non-admin sessions never see shadow records unless the session flag is set +CREATE POLICY shadow_segregation ON reentry_predictions + USING ( + shadow_mode = FALSE + OR current_setting('spacecom.include_shadow', TRUE) = 'true' + ); +``` + +The `spacecom.include_shadow` session variable is set to `'true'` only by the backend's shadow-admin code path, which requires `admin` role and explicit shadow-mode context. Regular backend sessions never set this variable. Integration test: query `reentry_predictions` as `viewer` role with no `WHERE shadow_mode` clause; verify zero shadow rows returned. + +**Four-eyes principle for admin role elevation (Finding 6):** + +A single compromised admin account must not be able to silently elevate a backdoor account. Elevation to `admin` requires a second admin to approve within 30 minutes. + +```sql +CREATE TABLE pending_role_changes ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + target_user_id INTEGER NOT NULL REFERENCES users(id), + requested_role TEXT NOT NULL, + requested_by INTEGER NOT NULL REFERENCES users(id), + approval_token_hash TEXT NOT NULL, -- SHA-256 of emailed token + expires_at TIMESTAMPTZ NOT NULL DEFAULT NOW() + INTERVAL '30 minutes', + approved_by INTEGER REFERENCES users(id), + approved_at TIMESTAMPTZ, + rejected_at TIMESTAMPTZ, + created_at TIMESTAMPTZ DEFAULT NOW() +); +``` + +Workflow: +1. `PATCH /admin/users/{id}/role` with `role=admin` creates a `pending_role_changes` row and triggers an email to all other active admins containing a single-use approval token +2. `POST /admin/role-changes/{change_id}/approve?token=` — any other admin can approve; completing the role change is atomic +3. Rows past `expires_at` are auto-rejected by a nightly job and logged as `ROLE_CHANGE_EXPIRED` +4. All outcomes (`ROLE_CHANGE_APPROVED`, `ROLE_CHANGE_REJECTED`, `ROLE_CHANGE_EXPIRED`) are logged to `security_logs` as HIGH severity +5. The requesting admin cannot approve their own pending change (enforced by `approved_by != requested_by` constraint) + +**RBAC enforcement pattern (FastAPI):** + +```python +def require_role(*roles: str): + def dependency(current_user: User = Depends(get_current_user)): + if current_user.role not in roles: + log_auth_failure(current_user, roles) + raise HTTPException(status_code=403, detail="Insufficient permissions") + return current_user + return dependency + +# Applied per router group — not per individual endpoint where it is easy to miss +router = APIRouter(dependencies=[Depends(require_role("operator", "admin"))]) +``` + +--- + +### 7.3 Authentication + +#### JWT Implementation + +- **Algorithm:** `RS256` (asymmetric). Never `HS256` with a shared secret. Never `none`. +- **Key storage:** RSA private signing key stored in Docker secrets / secrets manager (see §7.5). Never in an environment variable or `.env` file. +- **Token storage in browser:** `httpOnly`, `Secure`, `SameSite=Strict` cookies only. Never `localStorage` (vulnerable to XSS). Never query parameters (appear in server logs). +- **Access token lifetime:** 15 minutes. +- **Refresh token lifetime:** 24 hours for `operator`/`analyst`; 8 hours for `admin`. +- **Refresh token rotation with family reuse detection (Finding 5):** Invalidate the old token on every refresh. Tokens belong to a `family_id` (UUID assigned at first issuance). If a token from a superseded generation within a family is presented — i.e. it was already rotated and a newer token in the same family exists — the entire family is immediately revoked, logged as `REFRESH_TOKEN_REUSE` (HIGH severity), and an email alert is sent to the user ("Suspicious login detected — all sessions revoked"). This detects refresh token theft: the legitimate user retries after the attacker consumed the token first, causing the reuse to surface. The `refresh_tokens` table includes `family_id UUID NOT NULL` and `superseded_at TIMESTAMPTZ` (set when a new token replaces this one in rotation). +- **Refresh token storage:** `refresh_tokens` table in the database (see §9.2). This enables server-side revocation — Redis-only storage loses revocations on restart. + +#### Multi-Factor Authentication (MFA) + +TOTP-based MFA (RFC 6238) is required for all roles from Phase 1. Implementation: + +- On first login after account creation, user is presented with TOTP QR code (via `pyotp`) and required to verify before completing registration +- Recovery codes (8 × 10-character alphanumeric) generated at setup; stored as bcrypt hashes in `users.mfa_recovery_codes` +- MFA bypass via recovery code is logged as a security event (MEDIUM alert to admins) +- MFA is enforced at the JWT issuance step — tokens are not issued until MFA is verified +- Failed MFA attempts after 5 consecutive failures trigger a 30-minute account lockout and a MEDIUM alert + +#### SSO / Identity Provider Abstraction + +"Integrate with SkyNav SSO later" cannot remain a deferred decision. The auth layer must be designed as a pluggable provider from the start: + +```python +class AuthProvider(Protocol): + async def authenticate(self, credentials: Credentials) -> User: ... + async def issue_tokens(self, user: User) -> TokenPair: ... + async def revoke(self, refresh_token: str) -> None: ... + +class LocalJWTProvider(AuthProvider): ... # Phase 1: local JWT + TOTP +class OIDCProvider(AuthProvider): ... # Phase 3: OIDC/SAML SSO +``` + +All endpoint logic depends on `AuthProvider` — switching from local JWT to OIDC requires no endpoint changes. + +--- + +### 7.4 API Security + +#### Rate Limiting + +Implemented with `slowapi` (Redis token bucket). Limits are per-user for authenticated endpoints, per-IP for auth endpoints: + +| Endpoint | Limit | Window | +|----------|-------|--------| +| `POST /token` (login) | 10 per IP | 1 minute; exponential backoff after 5 failures | +| `POST /token/refresh` | 30 per user | 1 hour | +| `POST /decay/predict` | 10 per user | 1 hour | +| `POST /conjunctions/screen` | 5 per user | 1 hour | +| `POST /reports` | 20 per user | 1 day | +| `WS /ws/events` connection attempts | 10 per user | 1 minute | +| General authenticated read endpoints | 300 per user | 1 minute | +| General unauthenticated (if any) | 60 per IP | 1 minute | + +Rate limit headers returned on every response: `X-RateLimit-Limit`, `X-RateLimit-Remaining`, `X-RateLimit-Reset`. + +#### Simulation Parameter Validation + +All physical parameters must be validated against their physically meaningful ranges before a simulation job is accepted. Type validation alone is insufficient — NRLMSISE-00 will silently produce garbage for out-of-range inputs without raising an error: + +```python +class DecayPredictParams(BaseModel): + f107: float = Field(..., ge=65.0, le=300.0, + description="F10.7 solar flux (sfu). Physically valid: 65–300.") + ap: float = Field(..., ge=0.0, le=400.0, + description="Geomagnetic Ap index. Valid: 0–400.") + mc_samples: int = Field(..., ge=10, le=1000, + description="Monte Carlo sample count. Server cap: 1000 regardless of input.") + bstar_uncertainty_pct: float = Field(..., ge=0.0, le=50.0) + + @validator('mc_samples') + def cap_mc_samples(cls, v): + return min(v, 1000) # Server-side cap regardless of submitted value +``` + +#### Server-Side Request Forgery (SSRF) Mitigation + +The Ingest module fetches from five external sources. These URLs must be: +- **Hardcoded constants** in `ingest/sources.py` — never loaded from user input, API parameters, or database values +- **Fetched via an HTTP client configured with an allowlist** of expected IP ranges per source; connections to private IP ranges (`10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`, `169.254.0.0/16`, `::1`, `fc00::/7`) are blocked at the HTTP client layer + +```python +ALLOWED_HOSTS = { + "www.space-track.org": ["18.0.0.0/8"], # approximate; update with actual ranges + "celestrak.org": [...], + "swpc.noaa.gov": [...], + "discosweb.esoc.esa.int": [...], + "maia.usno.navy.mil": [...], +} +``` + +#### CZML and CZML Injection + +Object names and descriptions sourced from Space-Track are interpolated into CZML documents and ultimately rendered in CesiumJS. A malicious object name containing `` produces a sanitised NOTAM draft and does not raise (Week 17–18, with NOTAM drafting feature) +- [ ] Shadow mode RLS integration test: query `reentry_predictions` as `viewer` role with no WHERE clause; assert zero shadow rows returned +- [ ] Refresh token family reuse detection integration test: simulate attacker consuming a rotated token; assert entire family revoked + `REFRESH_TOKEN_REUSE` logged +- [ ] RLS policies reviewed and integration-tested for multi-tenancy boundary + +**Phase 3:** +- [ ] External penetration test by a qualified third party — scope must include: API auth bypass, privilege escalation, SSRF via ingest, XSS → Playwright escalation, WebSocket auth bypass, data integrity attacks on predictions, Redis/MinIO lateral movement +- [ ] All Critical and High penetration test findings remediated before production go-live +- [ ] SOC 2 Type I readiness review (if required by customer contracts) +- [ ] **Acceptance Test Procedure (ATP) defined and run (Finding 10):** `docs/bid/acceptance-test-procedure.md` exists with test script structured as: test ID, requirement reference, preconditions, steps, expected result, pass/fail criteria. ATP is runnable by a non-SpaceCom operator (evaluator) using documented environment setup. ATP covers: physics accuracy (§17 validation), NOTAM format (Q-line regex test), alert delivery latency (synthetic TIP → measure delivery time), HMAC integrity (tampered record → 503), multi-tenancy boundary (Org A cannot access Org B data). ATP seed data committed at `docs/bid/atp-seed-data/`. ATP successfully run by an independent evaluator on the staging environment before any institutional procurement submission. +- [ ] **Competitive differentiation review completed:** `docs/competitive-analysis.md` updated; any competitor capability that closed a differentiation gap has been assessed and a product response documented +- [ ] Security runbook: incident response procedure for each CRITICAL threat scenario + +--- + +### 7.16 Aviation Safety Integrity — Operational Scenarios + +**Scenario 1 — False all-clear attack:** + +An attacker who modifies `reentry_predictions` records to suppress a genuine hazard corridor could cause an airspace manager to conclude a FIR is safe when it is not. + +Mitigations layered in depth: +1. HMAC signing on every prediction record (§7.9) — modification is immediately detected +2. Immutability DB trigger (§7.9) — modifications fail at the database layer +3. TIP message cross-check: a prediction showing no hazard for an object with an active TIP message triggers a CRITICAL integrity alert regardless of the prediction's content +4. The UI displays HMAC status on every prediction — `✗ verification failed` is immediately visible to the operator + +**Scenario 2 — Alert storm attack:** + +An attacker flooding the alert system with false CRITICALs induces alert fatigue; operators disable alerts; a genuine event is missed. + +Mitigations: +1. Alert generation runs only from backend business logic on verified, HMAC-checked data — not from direct API calls +2. Rate limiting on CRITICAL alert generation per object per window (§6.6) +3. Alert storm detection: > 5 CRITICALs in 1 hour triggers a meta-alert to admins +4. Geographic filtering means alert volume per operator is naturally bounded to their region + +--- + +## 8. Functional Modules + +Each module is a Python package under `backend/modules/` with its own router, schemas, service layer, and (where applicable) Celery tasks. Modules communicate via internal function calls and the shared database — not HTTP between modules. + +### Phase 1 Modules + +| Module | Package | Purpose | +|--------|---------|---------| +| **Catalog** | `modules.catalog` | CRUD for space objects: NORAD ID, TLE sets, physical properties (from ESA DISCOS), B* drag term, radar cross-section. Source of truth for all tracked objects. | +| **Catalog Propagator** | `modules.propagator.catalog` | SGP4/SDP4 for general catalog tracking. Outputs GCRF state vectors and geodetic coordinates. Feeds the globe display. **Not used for decay prediction.** | +| **Decay Predictor** | `modules.propagator.decay` | Numerical integrator (RK7(8) adaptive step) with NRLMSISE-00 atmospheric density model, J2–J6 geopotential, and solar radiation pressure. Used for all re-entry window estimation. Monte Carlo uncertainty (vary F10.7 ±20%, Ap, B* ±10%). All outputs HMAC-signed on creation. Shadow mode flag propagated to all output records. | +| **Reentry** | `modules.reentry` | Phase 1 scope: re-entry window prediction (time ± uncertainty) and ground track corridor (percentile swaths). Phase 2 expands to full breakup/survivability. | +| **Space Weather** | `modules.spaceweather` | Ingests NOAA SWPC: F10.7, Ap/Kp, Dst, solar wind. Cross-validates against ESA Space Weather Service. Generates `operational_status` string. Drives Decay Predictor density models. | +| **Visualisation** | `modules.viz` | Generates CZML documents from ephemeris (J2000 Cartesian — explicit TEME→J2000 conversion), hazard zones, and debris corridors. Pre-bakes MC trajectory binary blobs for Mode C. All object name/description fields HTML-escaped before CZML output. | +| **Ingest** | `modules.ingest` | Background workers: Space-Track.org TLE polling, CelesTrak TLE polling, TIP message ingestion, ESA DISCOS physical property import, NOAA SWPC space weather polling, IERS EOP refresh. All external URLs are hardcoded constants; SSRF mitigation enforced at HTTP client layer. | +| **Public API** | `modules.api` | Versioned REST API (`/api/v1/`) as a first-class product for programmatic access by Persona E/F. Includes API key management (generation, rotation, revocation, usage tracking), CCSDS-format export endpoints, bulk ephemeris endpoints, and rate limiting per API key. API keys are separate credentials from the web session JWT and managed independently. | + +### Phase 2 Modules + +| Module | Package | Purpose | +|--------|---------|---------| +| **Atmospheric Breakup** | `modules.breakup` | ORSAT-like atmospheric re-entry breakup: aerothermal loading → structural failure → fragment generation → ballistic descent → ground impact with kinetic energy and casualty area. Produces fragment descriptors and uncertainty bounds for the sub-/trans-sonic descent layer. | +| **Conjunction** | `modules.conjunction` | All-vs-all conjunction screening: apogee/perigee filter → TCA refinement → collision probability (Alfano/Foster). Feeds `conjunctions` table. | +| **Upper Atmosphere** | `modules.weather.upper` | NRLMSISE-00 / JB2008 density model driven by space weather inputs. 80–600 km profiles for Decay Predictor and Atmospheric Breakup. | +| **Lower Atmosphere** | `modules.weather.lower` | GFS/ECMWF tropospheric wind and density profiles for 0–80 km terminal descent, including wind-sensitive dispersion inputs for fragment clouds after main breakup. | +| **Hazard** | `modules.hazard` | Fuses Decay Predictor + Atmospheric Breakup + atmosphere modules into hazard zones with uncertainty bounds. All output records HMAC-signed and immutable. Shadow mode flag preserved on all hazard zone records. | +| **Airspace** | `modules.airspace` | FIR/UIR boundaries, controlled airspace, routes. PostGIS hazard-airspace intersection. | +| **Air Risk** | `modules.air_risk` | Combines hazard outputs with air traffic density / ADS-B state, aircraft class assumptions, and vulnerability bands to generate time-sliced exposure scores and operator-facing air-risk products. Supports conservative-baseline comparison against blunt closure areas. | +| **On-Orbit Fragmentation** | `modules.fragmentation` | NASA Standard Breakup Model for on-orbit collision/explosion fragmentation. Separate from atmospheric breakup — different physics. | +| **Space Operator Portal** | `modules.space_portal` | The second front door. Owned object management (`owned_objects` table); object-scoped prediction views; CCSDS export; API key portal; controlled re-entry planner interface. Enforces `space_operator` RBAC object-ownership scoping. | +| **Controlled Re-entry Planner** | `modules.reentry.controlled` | For objects with remaining manoeuvre capability: given a delta-V budget and avoidance constraints (FIR exclusions, land avoidance, population density weighting), generates ranked candidate deorbit windows with corridor risk scores. Outputs suitable for national space law regulatory submissions and ESA Zero Debris Charter evidence. | +| **NOTAM Drafting** | `modules.notam` | Generates ICAO Annex 15 format NOTAM drafts from hazard corridor outputs. Produces cancellation drafts on event close. Stores all drafts in `notam_drafts` table. Displays mandatory regulatory disclaimer. Never submits NOTAMs — draft production only. | + +### Phase 3 Modules + +| Module | Package | Purpose | +|--------|---------|---------| +| **Reroute** | `modules.reroute` | Strategic pre-flight route intersection analysis only. Given a filed route, identifies which segments intersect the hazard corridor and outputs the geographic avoidance boundary. Does not generate specific alternate routes — avoidance boundary only, to keep SpaceCom in a purely informational role. | +| **Feedback** | `modules.feedback` | Prediction vs. observed outcome comparison. Atmospheric density scaling recalibration from historical re-entries. Maneuver detection (TLE-to-TLE ΔV estimation). Shadow validation reporting for ANSP regulatory adoption evidence. | +| **Alerts** | `modules.alerts` | WebSocket push + email notifications. Enforces alert rate limits and deduplication server-side. Stores all events in append-only `alert_events`. Shadow mode: all alerts suppressed to INFORMATIONAL; no external delivery. | +| **Launch Safety** | `modules.launch_safety` | Screen proposed launch trajectories against the live catalog for conjunction risk during ascent and parking orbit phases. Natural extension of the conjunction module. Serves launch operators as a third customer segment. | + +--- + +## 9. Data Model Evolution + +### 9.1 Retain and Expand from Existing Schema + +#### `objects` table + +```sql +ALTER TABLE objects ADD COLUMN IF NOT EXISTS + bstar DOUBLE PRECISION, -- SGP4 drag parameter (1/Earth-radii) + cd_a_over_m DOUBLE PRECISION, -- C_D * A / m (m²/kg); physical model + rcs_m2 DOUBLE PRECISION, -- Radar cross-section from Space-Track + rcs_size_class TEXT, -- SMALL | MEDIUM | LARGE + mass_kg DOUBLE PRECISION, + cross_section_m2 DOUBLE PRECISION, + material TEXT, + shape TEXT, + data_confidence TEXT DEFAULT 'unknown', -- 'discos' | 'estimated' | 'unknown' + object_type TEXT, -- PAYLOAD | ROCKET BODY | DEBRIS | UNKNOWN + launch_date DATE, + launch_site TEXT, + decay_date DATE, + organisation_id INTEGER REFERENCES organisations(id), -- multi-tenancy + -- Physics model parameters (Finding 3, 5, 7) + attitude_known BOOLEAN DEFAULT FALSE, -- FALSE = tumbling; affects A uncertainty sampling + material_class TEXT, -- 'aluminium'|'stainless_steel'|'titanium'|'carbon_composite'|'unknown' + cd_override DOUBLE PRECISION, -- operator-provided C_D override (space_operator only) + bstar_override DOUBLE PRECISION, -- operator-provided B* override (space_operator only) + cr_coefficient DOUBLE PRECISION DEFAULT 1.3 -- radiation pressure coefficient; 1.3 = standard non-cooperative +``` + +#### `orbits` table — full state vectors + +```sql +ALTER TABLE orbits ADD COLUMN IF NOT EXISTS + reference_frame TEXT DEFAULT 'GCRF', + pos_x_km DOUBLE PRECISION, + pos_y_km DOUBLE PRECISION, + pos_z_km DOUBLE PRECISION, + vel_x_kms DOUBLE PRECISION, + vel_y_kms DOUBLE PRECISION, + vel_z_kms DOUBLE PRECISION, + lat_deg DOUBLE PRECISION, + lon_deg DOUBLE PRECISION, + alt_km DOUBLE PRECISION, + speed_kms DOUBLE PRECISION, + -- RTN position covariance (upper triangle of 3×3) + cov_rr DOUBLE PRECISION, + cov_rt DOUBLE PRECISION, + cov_rn DOUBLE PRECISION, + cov_tt DOUBLE PRECISION, + cov_tn DOUBLE PRECISION, + cov_nn DOUBLE PRECISION, + propagator TEXT DEFAULT 'sgp4', + tle_epoch TIMESTAMPTZ +``` + +#### `conjunctions` table + +```sql +ALTER TABLE conjunctions ADD COLUMN IF NOT EXISTS + collision_probability DOUBLE PRECISION, + probability_method TEXT, + combined_radial_sigma_m DOUBLE PRECISION, + combined_transverse_sigma_m DOUBLE PRECISION, + combined_normal_sigma_m DOUBLE PRECISION +``` + +#### `reentry_predictions` table + +```sql +ALTER TABLE reentry_predictions ADD COLUMN IF NOT EXISTS + confidence_level DOUBLE PRECISION, + model_version TEXT, + propagator TEXT, + f107_assumed DOUBLE PRECISION, + ap_assumed DOUBLE PRECISION, + monte_carlo_n INTEGER, + ground_track_corridor GEOGRAPHY(POLYGON), -- GEOGRAPHY: global corridors may cross antimeridian + reentry_window_open TIMESTAMPTZ, + reentry_window_close TIMESTAMPTZ, + nominal_reentry_point GEOGRAPHY(POINT), -- GEOGRAPHY: global point + nominal_reentry_alt_km DOUBLE PRECISION DEFAULT 80.0, + p01_reentry_time TIMESTAMPTZ, -- 1st percentile — extreme early case; displayed as tail risk annotation (F10) + p05_reentry_time TIMESTAMPTZ, + p50_reentry_time TIMESTAMPTZ, + p95_reentry_time TIMESTAMPTZ, + p99_reentry_time TIMESTAMPTZ, -- 99th percentile — extreme late case; displayed as tail risk annotation (F10) + sigma_along_track_km DOUBLE PRECISION, + sigma_cross_track_km DOUBLE PRECISION, + organisation_id INTEGER REFERENCES organisations(id), + record_hmac TEXT NOT NULL, -- HMAC-SHA256 of canonical field set + integrity_failed BOOLEAN DEFAULT FALSE, + superseded_by INTEGER REFERENCES reentry_predictions(id) ON DELETE RESTRICT, -- write-once; RESTRICT prevents deleting a prediction that supersedes another (F10 — §67) + ood_flag BOOLEAN DEFAULT FALSE, -- TRUE if any input parameter falls outside the model's validated operating envelope + ood_reason TEXT, -- comma-separated list of which parameters triggered OOD (e.g. "high_am_ratio,low_data_confidence") + prediction_valid_until TIMESTAMPTZ, -- computed at creation: p50_reentry_time - 4h; UI warns if NOW() > this and prediction is not superseded + model_version TEXT NOT NULL, -- semantic version of decay predictor used; must match current deployed version or trigger re-run prompt + -- Multi-source conflict detection (Finding 10) + prediction_conflict BOOLEAN DEFAULT FALSE, -- TRUE if SpaceCom window does not overlap TIP or ESA window + conflict_sources TEXT[], -- e.g. ['space_track_tip', 'esa_esac'] + conflict_union_p10 TIMESTAMPTZ, -- union of all non-overlapping windows: earliest bound + conflict_union_p90 TIMESTAMPTZ -- union of all non-overlapping windows: latest bound +``` + +`superseded_by` is write-once after creation: it can be set once by an `analyst` or above, but never changed once set. A DB constraint enforces this (trigger that raises if `superseded_by` is being changed from a non-NULL value). The UI displays a `⚠ Superseded — see [newer run]` banner on any prediction where `superseded_by IS NOT NULL`. This preserves the immutability guarantee (old records are never deleted) while giving analysts a mechanism to communicate "this is not the current operational view." + +The same `superseded_by` pattern applies to the `simulations` table (self-referential FK). + +**Immutability trigger** (see §7.9) applied to this table in the initial migration. + +### 9.2 New Tables + +```sql +-- Organisations (for multi-tenancy) +CREATE TABLE organisations ( + id SERIAL PRIMARY KEY, + name TEXT NOT NULL UNIQUE, + created_at TIMESTAMPTZ DEFAULT NOW(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + -- Commercial tier (Finding 3, 5) + subscription_tier TEXT NOT NULL DEFAULT 'shadow_trial' + CHECK (subscription_tier IN ('shadow_trial','ansp_operational','space_operator','institutional','internal')), + subscription_status TEXT NOT NULL DEFAULT 'active' + CHECK (subscription_status IN ('active','offered','offered_lapsed','churned','suspended')), + subscription_started_at TIMESTAMPTZ, + subscription_expires_at TIMESTAMPTZ, + -- Shadow trial gate (F3 - §68): expiry normally auto-deactivates shadow mode, but enforcement is deferred while an active TIP / CRITICAL operational event exists + shadow_trial_expires_at TIMESTAMPTZ, -- NULL = no trial expiry (paid or internal); set on sandbox agreement signing + -- Resource quotas (F8 — §68): 0 = unlimited (paid tiers); >0 = monthly cap + monthly_mc_run_quota INTEGER NOT NULL DEFAULT 100 -- 100 for free/shadow_trial; 0 = unlimited for paid; deferred during active TIP/CRITICAL event + CHECK (monthly_mc_run_quota >= 0), + -- Feature flags (F11 — §68): Enterprise-only features gated here + feature_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE, -- Enterprise only + -- On-premise licence (F6 — §68) + licence_key TEXT, -- JWT signed by SpaceCom; checked at startup for on-premise deployments + licence_expires_at TIMESTAMPTZ, -- derived from licence_key; stored for query efficiency + -- Data residency (Finding 8) + hosting_jurisdiction TEXT NOT NULL DEFAULT 'eu' + CHECK (hosting_jurisdiction IN ('eu','uk','au','us','on_premise')), + data_residency_confirmed BOOLEAN DEFAULT FALSE -- DPA clause confirmed for this org +); + +-- Users +CREATE TABLE users ( + id SERIAL PRIMARY KEY, + organisation_id INTEGER REFERENCES organisations(id) NOT NULL, + email TEXT NOT NULL UNIQUE, + password_hash TEXT NOT NULL, -- bcrypt, cost factor >= 12 + role TEXT NOT NULL DEFAULT 'viewer' + CHECK (role IN ('viewer','analyst','operator','org_admin','admin','space_operator','orbital_analyst')), + mfa_secret TEXT, -- TOTP secret (encrypted at rest) + mfa_recovery_codes TEXT[], -- bcrypt hashes of recovery codes + mfa_enabled BOOLEAN DEFAULT FALSE, + failed_mfa_attempts INTEGER DEFAULT 0, + locked_until TIMESTAMPTZ, + created_at TIMESTAMPTZ DEFAULT NOW(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + last_login_at TIMESTAMPTZ, + tos_accepted_at TIMESTAMPTZ, -- NULL = ToS not yet accepted; access blocked until set + tos_version TEXT, -- semver of ToS accepted (e.g. "1.2.0") + tos_accepted_ip INET, -- IP address at time of acceptance (GDPR consent evidence) + data_source_acknowledgement BOOLEAN DEFAULT FALSE, -- must be TRUE before API key access + altitude_unit_preference TEXT NOT NULL DEFAULT 'ft' + CHECK (altitude_unit_preference IN ('m', 'ft', 'km')) + -- 'ft' default for ansp_operator; 'km' default for space_operator (set at account creation based on role) +); + +-- Refresh tokens (server-side revocation) +CREATE TABLE refresh_tokens ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + user_id INTEGER REFERENCES users(id) ON DELETE CASCADE, + token_hash TEXT NOT NULL UNIQUE, -- SHA-256 of the raw token + family_id UUID NOT NULL, -- All tokens from the same initial issuance share a family_id + issued_at TIMESTAMPTZ DEFAULT NOW(), + expires_at TIMESTAMPTZ NOT NULL, + revoked_at TIMESTAMPTZ, -- NULL = valid + superseded_at TIMESTAMPTZ, -- Set when this token is rotated out (newer token in family exists) + replaced_by UUID REFERENCES refresh_tokens(id), -- for rotation chain audit + source_ip INET, + user_agent TEXT +); +CREATE INDEX ON refresh_tokens (user_id, revoked_at); +CREATE INDEX ON refresh_tokens (family_id); -- for family revocation on reuse detection + +-- Security event log (append-only) +CREATE TABLE security_logs ( + id BIGSERIAL PRIMARY KEY, + logged_at TIMESTAMPTZ DEFAULT NOW(), + level TEXT NOT NULL, + event_type TEXT NOT NULL, + user_id INTEGER, + organisation_id INTEGER, + source_ip INET, + user_agent TEXT, + resource TEXT, + detail JSONB, + record_hash TEXT -- SHA-256(logged_at || event_type || detail) for tamper detection +); +CREATE TRIGGER security_logs_immutable + BEFORE UPDATE OR DELETE ON security_logs + FOR EACH ROW EXECUTE FUNCTION prevent_modification(); + +-- TLE history (hypertable) +-- No surrogate PK: TimescaleDB requires any UNIQUE/PK constraint to include the partition column. +-- Natural unique key is (object_id, ingested_at). Reference TLE records by this composite key. +CREATE TABLE tle_sets ( + object_id INTEGER REFERENCES objects(id), + epoch TIMESTAMPTZ NOT NULL, + line1 TEXT NOT NULL, + line2 TEXT NOT NULL, + source TEXT NOT NULL, + ingested_at TIMESTAMPTZ DEFAULT NOW(), + inclination_deg DOUBLE PRECISION, + raan_deg DOUBLE PRECISION, + eccentricity DOUBLE PRECISION, + arg_perigee_deg DOUBLE PRECISION, + mean_anomaly_deg DOUBLE PRECISION, + mean_motion_rev_per_day DOUBLE PRECISION, + bstar DOUBLE PRECISION, + apogee_km DOUBLE PRECISION, + perigee_km DOUBLE PRECISION, + cross_validated BOOLEAN DEFAULT FALSE, -- TRUE if confirmed by second source + cross_validation_delta_sma_km DOUBLE PRECISION, -- SMA difference between sources + UNIQUE (object_id, ingested_at) -- natural key; safe for TimescaleDB (includes partition col) +); +SELECT create_hypertable('tle_sets', 'ingested_at'); + +-- Space weather (hypertable) +CREATE TABLE space_weather ( + time TIMESTAMPTZ NOT NULL, + f107_obs DOUBLE PRECISION, -- observed F10.7 (current day) + f107_prior_day DOUBLE PRECISION, -- prior-day F10.7 (NRLMSISE-00 f107 input) + f107_81day_avg DOUBLE PRECISION, -- 81-day centred average (NRLMSISE-00 f107A input) + ap_daily INTEGER, -- daily Ap index (linear; NOT Kp) + ap_3h_history DOUBLE PRECISION[19], -- 3-hourly Ap values for prior 57h (NRLMSISE-00 full mode) + kp_3hourly DOUBLE PRECISION[], -- 3-hourly Kp (for storm detection; Kp > 5 triggers storm flag) + dst_index INTEGER, + uncertainty_multiplier DOUBLE PRECISION, + operational_status TEXT, + source TEXT DEFAULT 'noaa_swpc', + secondary_source TEXT, -- ESA SWS cross-validation value + cross_validation_delta_f107 DOUBLE PRECISION -- difference between sources +); +SELECT create_hypertable('space_weather', 'time'); + +-- TIP messages +CREATE TABLE tip_messages ( + id BIGSERIAL PRIMARY KEY, + object_id INTEGER REFERENCES objects(id), + norad_id INTEGER NOT NULL, + message_time TIMESTAMPTZ NOT NULL, + message_number INTEGER, + reentry_window_open TIMESTAMPTZ, + reentry_window_close TIMESTAMPTZ, + predicted_region TEXT, + source TEXT DEFAULT 'usspacecom', + raw_message TEXT +); + +-- Alert events (append-only) +CREATE TABLE alert_events ( + id BIGSERIAL PRIMARY KEY, + created_at TIMESTAMPTZ DEFAULT NOW(), + level TEXT NOT NULL + CHECK (level IN ('INFO','WARNING','CRITICAL')), + trigger_type TEXT NOT NULL, + object_id INTEGER REFERENCES objects(id), + organisation_id INTEGER REFERENCES organisations(id), + message TEXT NOT NULL, + acknowledged_at TIMESTAMPTZ, + acknowledged_by INTEGER REFERENCES users(id) ON DELETE SET NULL, -- SET NULL on GDPR erasure; log entry preserved + acknowledgement_note TEXT, + delivered_websocket BOOLEAN DEFAULT FALSE, + delivered_email BOOLEAN DEFAULT FALSE, + fir_intersection_km2 DOUBLE PRECISION, -- area of FIR polygon intersected by the triggering corridor (km²); NULL for non-spatial alerts + intersection_percentile TEXT + CHECK (intersection_percentile IN ('p50','p95')), -- which corridor percentile triggered the alert + prediction_id BIGINT REFERENCES reentry_predictions(id) ON DELETE RESTRICT, -- RESTRICT prevents cascade delete of legal-hold predictions (F10 — §67) + record_hmac TEXT NOT NULL DEFAULT '' -- HMAC-SHA256 of safety-critical fields; signed at insert; verified nightly (F9) +); +CREATE TRIGGER alert_events_immutable + BEFORE UPDATE OR DELETE ON alert_events + FOR EACH ROW EXECUTE FUNCTION prevent_modification(); + +-- Simulations +CREATE TABLE simulations ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + module TEXT NOT NULL, + object_id INTEGER REFERENCES objects(id), + organisation_id INTEGER REFERENCES organisations(id), + params_json JSONB NOT NULL, + started_at TIMESTAMPTZ DEFAULT NOW(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + completed_at TIMESTAMPTZ, + status TEXT NOT NULL DEFAULT 'pending' + CHECK (status IN ('pending','running','complete','failed','cancelled')), + result_uri TEXT, + model_version TEXT, + celery_task_id TEXT, + error_detail TEXT, + created_by INTEGER REFERENCES users(id) +); + +-- Reports +CREATE TABLE reports ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + simulation_id UUID REFERENCES simulations(id), + object_id INTEGER REFERENCES objects(id), + organisation_id INTEGER REFERENCES organisations(id), + report_type TEXT NOT NULL, + created_at TIMESTAMPTZ DEFAULT NOW(), + created_by INTEGER REFERENCES users(id), + storage_uri TEXT NOT NULL, + params_json JSONB, + report_number TEXT +); + +-- Prediction outcomes (algorithmic accountability — links predictions to observed re-entry events) +CREATE TABLE prediction_outcomes ( + id SERIAL PRIMARY KEY, + prediction_id BIGINT NOT NULL REFERENCES reentry_predictions(id) ON DELETE RESTRICT, -- RESTRICT prevents cascade delete of legal-hold predictions (F10 — §67) + norad_id INTEGER NOT NULL, + observed_reentry_time TIMESTAMPTZ, -- actual re-entry time from post-event analysis (The Aerospace Corporation, US18SCS, etc.) + observed_reentry_source TEXT, -- 'aerospace_corp' | 'us18scs' | 'esa_esoc' | 'manual' + p50_error_minutes DOUBLE PRECISION, -- predicted p50 minus observed (+ = predicted late, - = predicted early) + corridor_contains_observed BOOLEAN, -- TRUE if observed impact point fell within p95 corridor + fir_false_positive BOOLEAN, -- TRUE if a CRITICAL alert fired but no observable debris reached the affected FIR + fir_false_negative BOOLEAN, -- TRUE if observable debris reached a FIR but no CRITICAL alert was generated + ood_flag_at_prediction BOOLEAN, -- snapshot of ood_flag from the prediction record at prediction time + notes TEXT, + recorded_at TIMESTAMPTZ DEFAULT NOW(), + recorded_by INTEGER REFERENCES users(id) -- analyst who logged the outcome +); + +-- Hazard zones +CREATE TABLE hazard_zones ( + id BIGSERIAL PRIMARY KEY, + simulation_id UUID REFERENCES simulations(id), + organisation_id INTEGER REFERENCES organisations(id), + valid_from TIMESTAMPTZ NOT NULL, + valid_to TIMESTAMPTZ NOT NULL, + geometry GEOGRAPHY(POLYGON, 4326) NOT NULL, + altitude_min_km DOUBLE PRECISION, + altitude_max_km DOUBLE PRECISION, + risk_level TEXT, + confidence DOUBLE PRECISION, + sigma_along_track_km DOUBLE PRECISION, + sigma_cross_track_km DOUBLE PRECISION, + record_hmac TEXT NOT NULL +); +CREATE INDEX ON hazard_zones USING GIST (geometry); +CREATE INDEX ON hazard_zones (valid_from, valid_to); +CREATE TRIGGER hazard_zones_immutable + BEFORE UPDATE OR DELETE ON hazard_zones + FOR EACH ROW EXECUTE FUNCTION prevent_modification(); + +-- Airspace boundaries +CREATE TABLE airspace ( + id BIGSERIAL PRIMARY KEY, + designator TEXT NOT NULL, + name TEXT, + type TEXT NOT NULL, + geometry GEOMETRY(POLYGON, 4326) NOT NULL, -- GEOMETRY (not GEOGRAPHY): FIR boundaries never cross antimeridian; ~3× faster for ST_Intersects + lower_fl INTEGER, + upper_fl INTEGER, + icao_region TEXT +); +CREATE INDEX ON airspace USING GIST (geometry); + +-- Debris fragments +CREATE TABLE fragments ( + id BIGSERIAL PRIMARY KEY, + simulation_id UUID REFERENCES simulations(id), + mass_kg DOUBLE PRECISION, + characteristic_length_m DOUBLE PRECISION, + cross_section_m2 DOUBLE PRECISION, + material TEXT, + ballistic_coefficient_kgm2 DOUBLE PRECISION, + pre_entry_survived BOOLEAN, + impact_point GEOGRAPHY(POINT, 4326), + impact_velocity_kms DOUBLE PRECISION, + impact_angle_deg DOUBLE PRECISION, + kinetic_energy_j DOUBLE PRECISION, + casualty_area_m2 DOUBLE PRECISION, + dispersion_semi_major_km DOUBLE PRECISION, + dispersion_semi_minor_km DOUBLE PRECISION, + dispersion_orientation_deg DOUBLE PRECISION +); +CREATE INDEX ON fragments USING GIST (impact_point); + +-- Owned objects (space operator registration) +CREATE TABLE owned_objects ( + id SERIAL PRIMARY KEY, + organisation_id INTEGER REFERENCES organisations(id) NOT NULL, + object_id INTEGER REFERENCES objects(id) NOT NULL, + norad_id INTEGER NOT NULL, + registered_at TIMESTAMPTZ DEFAULT NOW(), + registration_reference TEXT, -- National space law registration number + has_propulsion BOOLEAN DEFAULT FALSE, -- Enables controlled re-entry planner + UNIQUE (organisation_id, object_id) +); +CREATE INDEX ON owned_objects (organisation_id); + +-- API keys (for Persona E/F programmatic access) +CREATE TABLE api_keys ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + organisation_id INTEGER REFERENCES organisations(id) NOT NULL, + user_id INTEGER REFERENCES users(id), -- NULL for org-level service account keys (F5) + is_service_account BOOLEAN NOT NULL DEFAULT FALSE, -- TRUE = org-level key, no human user + service_account_name TEXT, -- required when is_service_account = TRUE; e.g. "ANSP Integration Service" + key_hash TEXT NOT NULL UNIQUE, -- SHA-256 of raw key; raw key shown once at creation + name TEXT NOT NULL, -- Human label, e.g. "Ops Centre Integration" + role TEXT NOT NULL, -- space_operator | orbital_analyst + created_at TIMESTAMPTZ DEFAULT NOW(), + last_used_at TIMESTAMPTZ, + expires_at TIMESTAMPTZ, + revoked_at TIMESTAMPTZ, + revoked_by INTEGER REFERENCES users(id), -- org_admin or admin who revoked (F5) + requests_today INTEGER DEFAULT 0, + daily_limit INTEGER DEFAULT 1000, + -- API key scope and rate limit overrides (Finding 11) + allowed_endpoints TEXT[], -- NULL = all endpoints for role; e.g. ['GET /space/objects'] + rate_limit_override JSONB, -- e.g. {"decay_predict": {"limit": 5, "window": "1h"}} + CONSTRAINT service_account_name_required CHECK ( + (is_service_account = FALSE) OR (service_account_name IS NOT NULL) + ), + CONSTRAINT user_or_service CHECK ( + (user_id IS NOT NULL AND is_service_account = FALSE) + OR (user_id IS NULL AND is_service_account = TRUE) + ) +); +CREATE INDEX ON api_keys (organisation_id, revoked_at); +CREATE INDEX ON api_keys (organisation_id, is_service_account); -- org admin key listing + +-- Async job tracking — all Celery-backed POST endpoints return a job reference (Finding 3) +CREATE TABLE jobs ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + organisation_id INTEGER NOT NULL REFERENCES organisations(id), + user_id INTEGER NOT NULL REFERENCES users(id), + job_type TEXT NOT NULL + CHECK (job_type IN ('decay_predict','report','reentry_plan','propagate')), + status TEXT NOT NULL DEFAULT 'queued' + CHECK (status IN ('queued','running','complete','failed','cancelled')), + celery_task_id TEXT, -- Celery AsyncResult ID for internal tracking + params_hash TEXT, -- SHA-256 of input params; used for idempotency check + result_url TEXT, -- populated when status='complete'; e.g. '/decay/predictions/123' + error_code TEXT, -- populated when status='failed' + error_message TEXT, + estimated_duration_seconds INTEGER, -- populated at creation from historical p50 for job_type + created_at TIMESTAMPTZ DEFAULT NOW(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + started_at TIMESTAMPTZ, + completed_at TIMESTAMPTZ +); +CREATE INDEX ON jobs (organisation_id, status, created_at DESC); +CREATE INDEX ON jobs (celery_task_id); + +-- Idempotency key store — prevents duplicate mutations from network retries (Finding 5) +CREATE TABLE idempotency_keys ( + key TEXT NOT NULL, -- client-provided UUID + user_id INTEGER NOT NULL REFERENCES users(id), + endpoint TEXT NOT NULL, -- e.g. 'POST /decay/predict' + response_status INTEGER NOT NULL, + response_body JSONB NOT NULL, + created_at TIMESTAMPTZ DEFAULT NOW(), + expires_at TIMESTAMPTZ NOT NULL DEFAULT NOW() + INTERVAL '24 hours', + PRIMARY KEY (key, user_id, endpoint) +); +CREATE INDEX ON idempotency_keys (expires_at); -- for TTL cleanup job + +-- Usage metering (F3) — billable events; append-only +CREATE TABLE usage_events ( + id BIGSERIAL PRIMARY KEY, + organisation_id INTEGER NOT NULL REFERENCES organisations(id), + user_id INTEGER REFERENCES users(id), -- NULL for API key / system-triggered events + api_key_id UUID REFERENCES api_keys(id), -- set when triggered via API key + event_type TEXT NOT NULL + CHECK (event_type IN ( + 'decay_prediction_run', + 'conjunction_screen_run', + 'report_export', + 'api_request', + 'mc_quota_exhausted', -- quota hit; signals upsell opportunity + 'reentry_plan_run' + )), + quantity INTEGER NOT NULL DEFAULT 1, -- e.g. number of API requests batched + billing_period TEXT NOT NULL, -- 'YYYY-MM' — month this event counts toward + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + detail JSONB -- event-specific metadata (object_id, mc_n, etc.) +); +CREATE INDEX ON usage_events (organisation_id, billing_period, event_type); +CREATE INDEX ON usage_events (organisation_id, created_at DESC); +-- Append-only enforcement +CREATE TRIGGER usage_events_immutable + BEFORE UPDATE OR DELETE ON usage_events + FOR EACH ROW EXECUTE FUNCTION prevent_modification(); + +-- Billing contacts (F10) +CREATE TABLE billing_contacts ( + id SERIAL PRIMARY KEY, + organisation_id INTEGER NOT NULL REFERENCES organisations(id) UNIQUE, + billing_email TEXT NOT NULL, + billing_name TEXT NOT NULL, + billing_address TEXT, + vat_number TEXT, -- EU VAT registration; required for B2B invoicing + purchase_order_number TEXT, -- PO reference required by some ANSP procurement depts + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + updated_by INTEGER REFERENCES users(id) -- must be org_admin or admin +); + +-- Subscription periods (F10) — immutable record of what was billed when +CREATE TABLE subscription_periods ( + id SERIAL PRIMARY KEY, + organisation_id INTEGER NOT NULL REFERENCES organisations(id), + tier TEXT NOT NULL, + period_start TIMESTAMPTZ NOT NULL, + period_end TIMESTAMPTZ, -- NULL = current (open) period + monthly_fee_eur NUMERIC(10, 2), -- agreed contract price; NULL for internal/trial + currency TEXT NOT NULL DEFAULT 'EUR', + invoice_ref TEXT, -- external billing system invoice ID (e.g. Stripe invoice_id) + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() +); +CREATE INDEX ON subscription_periods (organisation_id, period_start DESC); + +-- NOTAM drafts (audit trail; never submitted by SpaceCom) +CREATE TABLE notam_drafts ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + prediction_id BIGINT REFERENCES reentry_predictions(id), + organisation_id INTEGER REFERENCES organisations(id), + created_at TIMESTAMPTZ DEFAULT NOW(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + created_by INTEGER REFERENCES users(id), + draft_type TEXT NOT NULL + CHECK (draft_type IN ('new','cancellation')), + fir_designators TEXT[] NOT NULL, + valid_from TIMESTAMPTZ, + valid_to TIMESTAMPTZ, + draft_text TEXT NOT NULL, -- Full ICAO-format draft text + reviewed_by INTEGER REFERENCES users(id) ON DELETE SET NULL, -- SET NULL on GDPR erasure; draft preserved + reviewed_at TIMESTAMPTZ, + review_note TEXT, + safety_record BOOLEAN DEFAULT TRUE, -- always retained; excluded from data drop policy + generated_during_degraded BOOLEAN DEFAULT FALSE -- TRUE if ingest was degraded at generation time + -- No issuance fields — SpaceCom never issues NOTAMs +); + +-- Degraded mode audit log (Finding 7 — operational ANSP disclosure requirement) +-- Records every transition into and out of degraded mode for incident investigation +CREATE TABLE degraded_mode_events ( + id BIGSERIAL PRIMARY KEY, + started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + ended_at TIMESTAMPTZ, -- NULL = currently degraded + affected_sources TEXT[] NOT NULL, -- e.g. ['space_track', 'noaa_swpc'] + severity TEXT NOT NULL + CHECK (severity IN ('WARNING','CRITICAL')), + trigger_reason TEXT NOT NULL, -- human-readable: 'Space-Track ingest gap > 4h' + resolved_by TEXT, -- 'auto-recovery' | user_id | 'manual' + safety_record BOOLEAN DEFAULT TRUE -- always retained under safety record policy +); +-- Append-only: no UPDATE or DELETE permitted +CREATE TRIGGER degraded_mode_events_immutable + BEFORE UPDATE OR DELETE ON degraded_mode_events + FOR EACH ROW EXECUTE FUNCTION prevent_modification(); + +-- Shadow validation records (compare shadow predictions to actual events) +CREATE TABLE shadow_validations ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + prediction_id BIGINT REFERENCES reentry_predictions(id), + organisation_id INTEGER REFERENCES organisations(id), + created_at TIMESTAMPTZ DEFAULT NOW(), + created_by INTEGER REFERENCES users(id), + actual_reentry_time TIMESTAMPTZ, + actual_reentry_location GEOGRAPHY(POINT, 4326), + actual_source TEXT, -- 'aerospace_corp_db' | 'tip_message' | 'manual' + p50_error_minutes DOUBLE PRECISION, -- actual - predicted p50 in minutes + in_p95_corridor BOOLEAN, -- did actual point fall within 95th pct corridor? + notes TEXT +); + +-- Legal opinions (jurisdiction-level gate for shadow mode and operational deployment) +CREATE TABLE legal_opinions ( + id SERIAL PRIMARY KEY, + jurisdiction TEXT NOT NULL UNIQUE, -- e.g. 'AU', 'EU', 'UK', 'US' + status TEXT NOT NULL DEFAULT 'pending' + CHECK (status IN ('pending','in_progress','complete','not_required')), + opinion_date DATE, + counsel_firm TEXT, + shadow_mode_cleared BOOLEAN DEFAULT FALSE, -- opinion confirms shadow deployment is permissible + operational_cleared BOOLEAN DEFAULT FALSE, -- opinion confirms operational deployment is permissible + liability_cap_agreed BOOLEAN DEFAULT FALSE, + notes TEXT, + document_minio_key TEXT, -- reference to stored opinion document in MinIO + created_at TIMESTAMPTZ DEFAULT NOW(), + updated_at TIMESTAMPTZ DEFAULT NOW() +); + +-- Shared immutability function (used by multiple triggers) +CREATE OR REPLACE FUNCTION prevent_modification() +RETURNS TRIGGER AS $$ +BEGIN + RAISE EXCEPTION 'Table % is append-only or immutable after creation', TG_TABLE_NAME; +END; +$$ LANGUAGE plpgsql; + +-- Shared updated_at function (used by mutable tables) +CREATE OR REPLACE FUNCTION set_updated_at() +RETURNS TRIGGER LANGUAGE plpgsql AS $$ +BEGIN + NEW.updated_at = NOW(); + RETURN NEW; +END; +$$; + +-- updated_at triggers for all mutable tables +CREATE TRIGGER organisations_updated_at + BEFORE UPDATE ON organisations FOR EACH ROW EXECUTE FUNCTION set_updated_at(); +CREATE TRIGGER users_updated_at + BEFORE UPDATE ON users FOR EACH ROW EXECUTE FUNCTION set_updated_at(); +CREATE TRIGGER simulations_updated_at + BEFORE UPDATE ON simulations FOR EACH ROW EXECUTE FUNCTION set_updated_at(); +CREATE TRIGGER jobs_updated_at + BEFORE UPDATE ON jobs FOR EACH ROW EXECUTE FUNCTION set_updated_at(); +CREATE TRIGGER notam_drafts_updated_at + BEFORE UPDATE ON notam_drafts FOR EACH ROW EXECUTE FUNCTION set_updated_at(); +``` + +**Shadow mode flag on predictions and hazard zones:** Add `shadow_mode BOOLEAN DEFAULT FALSE` to both `reentry_predictions` and `hazard_zones`. Shadow records are excluded from all operational API responses (`WHERE shadow_mode = FALSE` applied to all operational endpoints) but accessible via `/analysis` and the Feedback/shadow validation workflow. + +--- + +### 9.3 Index Strategy + +All indexes must be created `CONCURRENTLY` on live hypertables to avoid table locks (see §9.4). The following indexes are required beyond TimescaleDB's automatic chunk indexes: + +```sql +-- orbits hypertable: object + time range queries (CZML generation) +CREATE INDEX CONCURRENTLY IF NOT EXISTS orbits_object_epoch_idx + ON orbits (object_id, epoch DESC); + +-- reentry_predictions: latest prediction per object (Event Detail, operational overview) +CREATE INDEX CONCURRENTLY IF NOT EXISTS reentry_pred_object_created_idx + ON reentry_predictions (object_id, created_at DESC) + WHERE integrity_failed = FALSE AND shadow_mode = FALSE; + +-- alert_events: unacknowledged alerts per org (badge count — called on every page load) +-- Partial index on acknowledged_at IS NULL: only live unacked rows indexed; shrinks as alerts are acknowledged +CREATE INDEX CONCURRENTLY IF NOT EXISTS alert_events_unacked_idx + ON alert_events (organisation_id, level, created_at DESC) + WHERE acknowledged_at IS NULL; + +-- jobs: Celery worker polls for queued jobs; partial index keeps this tiny and fast +CREATE INDEX CONCURRENTLY IF NOT EXISTS jobs_queued_idx + ON jobs (organisation_id, created_at) + WHERE status = 'queued'; + +-- refresh_tokens: token validation only cares about live (non-revoked) tokens +CREATE INDEX CONCURRENTLY IF NOT EXISTS refresh_tokens_live_idx + ON refresh_tokens (token_hash) + WHERE revoked_at IS NULL; + +-- idempotency_keys: TTL cleanup job needs only expired rows +CREATE INDEX CONCURRENTLY IF NOT EXISTS idempotency_keys_expired_idx + ON idempotency_keys (expires_at) + WHERE expires_at IS NOT NULL; + +-- PostGIS spatial: all columns used in ST_Intersects / ST_Contains / ST_Distance +CREATE INDEX CONCURRENTLY IF NOT EXISTS reentry_pred_corridor_gist + ON reentry_predictions USING GIST (ground_track_corridor); +-- airspace.geometry GIST index already present (see §9.2) +CREATE INDEX CONCURRENTLY IF NOT EXISTS hazard_zones_polygon_gist + ON hazard_zones USING GIST (polygon); +CREATE INDEX CONCURRENTLY IF NOT EXISTS fragments_impact_gist + ON fragments USING GIST (impact_point); + +-- tle_sets hypertable: latest TLE per object (cross-validation, propagation) +CREATE INDEX CONCURRENTLY IF NOT EXISTS tle_sets_object_ingested_idx + ON tle_sets (object_id, ingested_at DESC); + +-- security_logs: recent events per user (audit queries) +CREATE INDEX CONCURRENTLY IF NOT EXISTS security_logs_user_time_idx + ON security_logs (user_id, created_at DESC); +``` + +**Spatial type convention:** +- `GEOGRAPHY` — used for global features that may cross the antimeridian (corridor polygons, nominal re-entry points, fragment impact points). Geodetic calculations; correct for global spans. +- `GEOMETRY(POLYGON, 4326)` — used for regional features always within ±180° longitude (FIR/UIR airspace boundaries). Planar approximation; ~3× faster for `ST_Intersects` than `GEOGRAPHY`; accurate enough for airspace boundary intersection within a single hemisphere. + +**SRID enforcement (F2 — §62):** Declaring the SRID in the column type (`GEOMETRY(POLYGON, 4326)`) prevents implicit SRID mismatch errors, but does not prevent application code from inserting a geometry constructed with SRID 0. Add explicit CHECK constraints on all spatial columns: + +```sql +-- Ensure corridor polygon SRID is correct +ALTER TABLE reentry_predictions + ADD CONSTRAINT chk_corridor_srid + CHECK (ST_SRID(ground_track_corridor::geometry) = 4326); + +ALTER TABLE hazard_zones + ADD CONSTRAINT chk_hazard_zone_srid + CHECK (ST_SRID(geometry) = 4326); + +ALTER TABLE airspace + ADD CONSTRAINT chk_airspace_srid + CHECK (ST_SRID(geometry) = 4326); +``` + +The CI migration gate (`alembic check`) will flag any migration that adds a spatial column without a matching SRID CHECK constraint. + +**ST_Buffer distance units (F9 — §62):** `ST_Buffer` on a `GEOMETRY(POLYGON, 4326)` column uses degree-units, not metres. At 60°N, 1° ≈ 55 km; at the equator, 1° ≈ 111 km — an uncertainty buffer expressed in degrees gives wildly different areas at different latitudes. Always buffer in a projected CRS, then transform back: + +```sql +-- CORRECT: buffer 50 km around corridor point at any latitude +SELECT ST_Transform( + ST_Buffer( + ST_Transform(ST_SetSRID(ST_MakePoint(lon, lat), 4326), 3857), -- project to Web Mercator (metres) + 50000 -- 50 km in metres + ), + 4326 -- back to WGS84 +) AS buffered_geom; + +-- WRONG: buffer in degrees — DO NOT USE +-- SELECT ST_Buffer(geom, 0.5) FROM ... ← 0.5° is ~55 km at 60°N, ~55 km at equator +``` + +For global spans where Mercator distortion is unacceptable, use `ST_Buffer` on a `GEOGRAPHY` column instead — it accepts metres natively: +```sql +SELECT ST_Buffer(corridor::geography, 50000) -- 50 km buffer, geodetically correct +FROM reentry_predictions WHERE ... +``` + +**FIR intersection query optimisation:** Apply a bounding-box pre-filter before the full polygon intersection test to eliminate most rows cheaply. `airspace.geometry` is `GEOMETRY` while `hazard_zones.geometry` and corridor parameters are `GEOGRAPHY` — **always cast GEOGRAPHY → GEOMETRY explicitly** before passing to `ST_Intersects` with an airspace column; PostgreSQL cannot use the GiST index and falls back to a seq scan if the types are mixed implicitly: + +```sql +-- Corridor (GEOGRAPHY) intersecting FIR boundaries (GEOMETRY): explicit cast required +SELECT a.designator, a.name +FROM airspace a +WHERE a.geometry && ST_Envelope($1::geography::geometry) -- fast bbox pre-filter (uses GIST) + AND ST_Intersects(a.geometry, $1::geography::geometry); -- exact test (GEOMETRY, not GEOGRAPHY) +-- $1 = corridor polygon passed as GEOGRAPHY from application layer +``` + +Add a CI linter rule (or custom ruff plugin) that rejects `ST_Intersects(airspace.geometry, )` unless `` is explicitly cast to `::geometry`. This prevents the mixed-type silent seq-scan regression from being introduced during maintenance. + +Cache the FIR intersection result per `prediction_id` in Redis (TTL: until the prediction is superseded) — the intersection for a given prediction never changes. + +--- + +### 9.4 TimescaleDB Configuration and Continuous Aggregates + +**Hypertable chunk intervals** — set explicitly at creation; default 7-day chunks are too large for the `orbits` CZML query pattern (most queries cover ≤ 72h): + +```sql +-- orbits: 1-day chunks (72h CZML window spans 3 chunks; good chunk exclusion) +SELECT create_hypertable('orbits', 'epoch', + chunk_time_interval => INTERVAL '1 day', + if_not_exists => TRUE); + +-- tle_sets: 1-month chunks (~1,800 rows/day at 600 objects × 3 TLE updates; queried by object_id not time range) +-- Small chunks (7 days) produce poor compression ratios (~12,600 rows/chunk); 1 month improves ratio ~4× +SELECT create_hypertable('tle_sets', 'ingested_at', + chunk_time_interval => INTERVAL '1 month', + if_not_exists => TRUE); + +-- space_weather: 30-day chunks (~3000 rows/month at 15-min cadence) +SELECT create_hypertable('space_weather', 'time', + chunk_time_interval => INTERVAL '30 days', + if_not_exists => TRUE); +``` + +**Continuous aggregates** — pre-compute recurring expensive queries instead of scanning raw hypertable rows on every request: + +```sql +-- 81-day rolling F10.7 average (queried on every Space Weather Widget render) +CREATE MATERIALIZED VIEW space_weather_daily + WITH (timescaledb.continuous) AS + SELECT time_bucket('1 day', time) AS day, + AVG(f107_obs) AS f107_daily_avg, + MAX(kp_3hourly[1]) AS kp_max_daily + FROM space_weather + GROUP BY day +WITH NO DATA; + +SELECT add_continuous_aggregate_policy('space_weather_daily', + start_offset => INTERVAL '90 days', + end_offset => INTERVAL '1 hour', + schedule_interval => INTERVAL '1 hour'); +``` + +Backend queries for the 81-day F10.7 average read from `space_weather_daily` (the continuous aggregate), not from the raw `space_weather` hypertable. + +**Compression policy intervals** — compression must not target recently-written chunks. TimescaleDB decompresses a chunk before any write to it; compressing hot chunks adds 50–200ms latency per write batch. Set `compress_after` well beyond the active write window: + +| Hypertable | Chunk interval | `compress_after` | Write cadence | Reasoning | +|---|---|---|---|---| +| `orbits` | 1 day | 7 days | 1 min (continuous) | Data is queryable but not written after ~24h; 7-day buffer prevents write-decompress thrash | +| `adsb_states` | 4 hours | 14 days | 60s (Celery Beat) | Rolling 24h retention; compress only after data is past retention interest | +| `space_weather` | 30 days | 60 days | 15 min | Very low write rate; compress after one full 30-day chunk is closed | +| `tle_sets` | 1 month | 2 months | Every 4h ingest | ~1,800 rows/day; 1-month chunks give good compression ratio; 2-month buffer ensures active month is never compressed | + +```sql +-- Apply compression policies (run after hypertable creation) +SELECT add_compression_policy('orbits', INTERVAL '7 days'); +SELECT add_compression_policy('adsb_states', INTERVAL '14 days'); +SELECT add_compression_policy('space_weather', INTERVAL '60 days'); +SELECT add_compression_policy('tle_sets', INTERVAL '2 months'); +``` + +**Autovacuum tuning** — append-only tables still accumulate dead tuples from aborted transactions and MVCC overhead. Default 20% threshold is too conservative for high-write safety tables: + +```sql +ALTER TABLE alert_events SET ( + autovacuum_vacuum_scale_factor = 0.01, -- vacuum at 1% dead tuples (default: 20%) + autovacuum_analyze_scale_factor = 0.005 +); +ALTER TABLE security_logs SET ( + autovacuum_vacuum_scale_factor = 0.01, + autovacuum_analyze_scale_factor = 0.005 +); +ALTER TABLE reentry_predictions SET ( + autovacuum_vacuum_cost_delay = 2, -- allow aggressive vacuum on query-critical table + autovacuum_analyze_scale_factor = 0.01 +); +``` + +PostgreSQL-level settings via `patroni.yml`: +```yaml +postgresql: + parameters: + idle_in_transaction_session_timeout: 30000 # 30s -- prevents analytics sessions blocking autovacuum + max_connections: 50 # pgBouncer handles client multiplexing; DB needs only 50 + log_min_duration_statement: 500 # F7 §58: log queries > 500ms; shipped to Loki via Promtail + shared_preload_libraries: timescaledb,pg_stat_statements # F7 §58: enable slow query tracking + pg_stat_statements.track: all # track all statements including nested + # Analyst role statement timeout (F11 §58): prevents runaway analytics queries starving ops connections + # Applied at role level, not globally, to avoid impacting operational paths +``` + +**Query plan governance (F7 — §58):** Slow queries (> 500ms) appear in PostgreSQL logs and are shipped to Loki. A weekly Grafana report queries `pg_stat_statements` via the `postgres-exporter` and surfaces the top-10 queries by `total_exec_time`. Any query appearing in the top-10 for two consecutive weeks requires a PR with an `EXPLAIN ANALYSE` output and either an index addition or a documented acceptance rationale. The `EXPLAIN ANALYSE` output is recorded in the migration file header comment for index additions. CI migration timeout (§9.4) applies: migrations running > 30s against the test dataset require review before merge. + +**Analyst role query timeout (F11 — §58):** Persona B/F analyst queries route to the read replica (§3.2) but must still be bounded to prevent a runaway query exhausting replica connections and triggering replication lag. Apply a `statement_timeout` at the database role level so it applies regardless of connection source: + +```sql +-- Applied once at schema setup; persists across reconnections +ALTER ROLE spacecom_analyst SET statement_timeout = '30s'; +ALTER ROLE spacecom_readonly SET statement_timeout = '30s'; + +-- Operational roles have no statement timeout — but idle-in-transaction timeout applies globally +-- (idle_in_transaction_session_timeout = 30s in patroni.yml) +``` + +The `spacecom_analyst` role is the PgBouncer user for the read replica pool. All analyst-originated queries automatically inherit the 30s limit. If a query exceeds 30s it receives `ERROR: canceling statement due to statement timeout`; the frontend displays a user-facing message: "This query exceeded the 30-second limit. Refine your filters or contact your administrator." Logged at WARNING to Loki. + +**PgBouncer transaction mode + asyncpg prepared statement cache** — asyncpg caches prepared statements per server-side connection. In PgBouncer transaction mode, the connection returned after each transaction may differ from the one the statement was prepared on, causing `ERROR: prepared statement "..." does not exist` under load. Disable the cache in the SQLAlchemy async engine config: + +```python +engine = create_async_engine( + DATABASE_URL, + connect_args={"prepared_statement_cache_size": 0}, +) +``` + +This is non-negotiable when using PgBouncer transaction mode. Do not revert this setting in the belief that it is a performance regression — it prevents a hard production failure mode. See ADR 0008. + +**Migration safety on live hypertables** (additions to the Alembic policy in §26.9): +- Always use `CREATE INDEX CONCURRENTLY` for new indexes — no table lock; safe during live ingest +- Never add a column with a non-null default to a populated hypertable in one migration: (1) add nullable, (2) backfill in batches, (3) add NOT NULL constraint separately +- Test every migration against production-sized data; record execution time in the migration file header comment +- Set a CI migration timeout: if a migration runs > 30s against the test dataset, it must be reviewed before merge + +--- + +## 10. Technology Stack + +| Layer | Technology | Rationale | +|-------|-----------|-----------| +| Frontend framework | **Next.js 15 + TypeScript** | Type safety, SSR for dashboards, static export option | +| 3D Globe | **CesiumJS** (retained) | Native CZML support; proven in prototype | +| 2D overlays | **Deck.gl** | WebGL heatmaps (Mode B), arc layers, hex grids | +| Server state | **TanStack Query** | Caching, background refetch, stale-while-revalidate. API responses never stored in Zustand. | +| UI state | **Zustand** | Pure UI state only: timeline mode, selected object, layer visibility, alert acknowledgements | +| URL state | **nuqs** | Shareable deep links; selected object/event/time reflected in URL | +| Backend framework | **FastAPI** (retained) | Async, OpenAPI auto-docs, Pydantic validation | +| Task queue | **Celery + Redis** | Battle-tested for scientific compute; Flower monitoring | +| Catalog propagation | **`sgp4`** | SGP4/SDP4; catalog tracking only, not decay prediction | +| Numerical integrator | **`scipy.integrate.DOP853`** or custom **RK7(8)** | Adaptive step-size for Cowell decay prediction | +| Atmospheric density | **`nrlmsise00`** Python wrapper | NRLMSISE-00; driven by F10.7 and Ap | +| Frame transformations | **`astropy`** | IAU 2006 precession/nutation, IERS EOP, TEME→GCRF→ITRF | +| Astrodynamics utilities | **`poliastro`** (optional) | Conjunction geometry helpers | +| Auth | **`python-jose`** (RS256 JWT) + **`pyotp`** (TOTP MFA) | Asymmetric JWT; TOTP RFC 6238 | +| Rate limiting | **`slowapi`** | Redis token bucket; per-user and per-IP limits | +| HTML sanitisation | **`bleach`** | User-supplied content before Playwright rendering | +| Password hashing | **`passlib[bcrypt]`** | bcrypt cost factor ≥ 12 | +| Database | **TimescaleDB + PostGIS** (retained) | Time-series + geospatial; RLS for multi-tenancy | +| Cache / broker | **Redis 7** | Broker + pub/sub: `maxmemory-policy noeviction` (Celery queues must never be evicted). Separate Redis DB index for application cache: `allkeys-lru`. AUTH + TLS in production. | +| Connection pooler | **PgBouncer 1.22** | Transaction-mode pooling between all app services and TimescaleDB. Prevents connection exhaustion at Tier 3; single failover target for Patroni switchover. `max_client_conn=200`, `default_pool_size=20`. Pool sizing derivation (F2 — §58): PostgreSQL `max_connections=50`; reserve 5 for superuser/admin; 45 available server connections. `default_pool_size=20` per pool (one pool per DB user); leaves headroom for Alembic migrations and ad-hoc DBA access. `max_client_conn=200` = (2 backend workers × 40 async connections) + (4 sim workers × 16 threads) + (2 ingest workers × 4) = 152 peak; 200 provides burst headroom. Validate with `SHOW pools;` in `psql -h pgbouncer` — `cl_waiting > 0` sustained means pool is undersized. | +| Object storage | **MinIO** | Private buckets; pre-signed URLs only | +| Containerisation | **Docker Compose** (retained); **Caddy** as TLS-terminating reverse proxy | Single-command dev; HTTPS auto-provisioning | +| Testing — backend | **pytest + hypothesis** | Property-based tests for numerical and security invariants | +| Testing — frontend | **Vitest + Playwright** | Unit tests + E2E including security header checks | +| SAST — Python | **Bandit** | Static analysis; CI blocks on High severity | +| SAST — TypeScript | **ESLint security plugin** | Static analysis; CI blocks on High severity | +| Container scanning | **Trivy** | CI blocks on Critical/High CVEs | +| DAST | **OWASP ZAP** | Phase 2 pipeline against staging | +| Dependency management | **pip-tools** + **npm ci** | Pinned hashes; `--require-hashes` | +| Report rendering | **Playwright headless** (isolated `renderer` container) | Server-side globe screenshot; no client-side canvas | +| Secrets management | **Docker secrets** (Phase 1 production) → **HashiCorp Vault** (Phase 3) | | +| Task scheduler HA | **`celery-redbeat`** | Redis-backed Beat scheduler; distributed locking; multiple instances safe | +| DB HA / failover | **Patroni** + **etcd** | Automatic TimescaleDB primary/standby failover; ≤ 30s RTO | +| Redis HA | **Redis Sentinel** (3 nodes) | Master failover ≤ 10s; transparent to application via `redis-py` Sentinel client | +| Monitoring | **Prometheus + Grafana** | Business-level metrics from Phase 1; four dashboards (§26.7); AlertManager with runbook links | +| Log aggregation | **Grafana Loki + Promtail** | Phase 2; Promtail scrapes Docker log files; Loki stores and queries; co-deployed with Grafana; no index servers required | +| Distributed tracing | **OpenTelemetry → Grafana Tempo** | Phase 2; FastAPI + SQLAlchemy + Celery auto-instrumented; OTLP exporter; trace_id = request_id for log correlation; ADR 0017 | +| Structured logging | **structlog** | JSON structured logs with required fields; sanitising processor strips secrets; `request_id` propagated through HTTP → Celery chain | +| On-call alerting | **PagerDuty or OpsGenie** | Routes Prometheus AlertManager alerts; L1/L2/L3 escalation tiers (§26.8) | +| CI/CD pipeline | **GitLab CI** | Native to the self-hosted GitLab monorepo; stage-based builds for Python/Node; protected environments and approval rules for deploys | +| Container registry | **GitLab Container Registry** | Co-located with source; `sha-` is the canonical immutable tag; `latest` tag is forbidden in production deployments; image vulnerability attestations via `cosign` | +| Pre-commit | **`pre-commit` framework** | Hooks: `detect-secrets`, `ruff` (lint + format), `mypy` (type gate), `hadolint` (Dockerfile), `prettier` (JS/HTML), `sqlfluff` (migrations); spec in `.pre-commit-config.yaml`; same hooks re-run in CI | +| Local task runner | **`make`** | Standard targets: `make dev` (full-stack hot-reload), `make test` (pytest + vitest), `make migrate` (alembic upgrade head), `make seed` (fixture load), `make lint` (all pre-commit hooks), `make clean` (prune volumes) | + +--- + +## 11. Data Source Inventory + +| Source | Data | Access | Priority | +|--------|------|--------|----------| +| **Space-Track.org** | TLE catalog, CDMs, object catalog, RCS data, TIP messages | REST API (account required); credentials in secrets manager | P1 | +| **CelesTrak** | TLE subsets (active sats, decaying objects) | Public REST API / CSV | P1 | +| **USSPACECOM TIP Messages** | Tracking and Impact Prediction for decaying objects | Via Space-Track.org | P1 | +| **NOAA SWPC** | F10.7, Ap/Kp, Dst, solar wind; 3-day forecasts | Public REST API and FTP | P1 | +| **ESA Space Weather Service** | F10.7, Kp cross-validation source | Public REST API | P1 | +| **ESA DISCOS** | Physical object properties: mass, dimensions, shape, materials | REST API (account required) | P1 | +| **IERS Bulletin A/B** | UT1-UTC offsets, polar motion | Public FTP (usno.navy.mil); SHA-256 verified on download | P1 | +| **GFS / ECMWF** | Tropospheric winds and density 0–80 km | NOMADS (NOAA) public FTP | P2 | +| **ILRS / CDDIS** | Laser ranging POD products for validation | Public FTP | P2 (validation) | + +| **FIR/UIR boundaries** | FIR and UIR boundary polygons for airspace intersection | EUROCONTROL AIRAC dataset (subscription) for ECAC states; FAA Digital-Terminal Procedures for US; OpenAIP as fallback for non-AIRAC regions. GeoJSON format loaded into `airspace` table. Updated every 28 days on AIRAC cycle. | P1 | + +**Deprecated reference:** "18th SDS" → use **Space-Track.org** consistently. + +**ESA DISCOS redistribution rights (Finding 9):** ESA DISCOS is subject to an ESAC user agreement. Data may not be redistributed or used in commercial products without explicit ESA permission. SpaceCom is a commercial platform. Required actions before Phase 2 shadow deployment: +- Obtain written clarification from ESA/ESAC on whether DISCOS-derived physical properties (mass, dimensions) may be: (a) used internally to drive SpaceCom's own predictions; (b) exposed in API responses to ANSP customers; (c) included in generated PDF reports +- If redistribution is not permitted, DISCOS data is used only as internal model input — API responses and reports show `source: estimated` rather than exposing raw DISCOS values; the `data_confidence` UI flag continues to show `● DISCOS` for internal tracking but is not labelled as DISCOS in customer-facing outputs +- Include the DISCOS redistribution clarification in the Phase 2 legal gate checklist alongside the Space-Track AUP opinion + +**Airspace data scope and SUA disclosure (Finding 4):** Phase 2 FIR/UIR scope covers ECAC states (EUROCONTROL AIRAC) and US FIRs (FAA). The following airspace types are explicitly **out of scope for Phase 2** and disclosed to users: +- Special Use Airspace (SUA): danger areas, restricted areas, prohibited areas (ICAO Annex 11) +- Terminal Manoeuvring Areas (TMAs) and Control Zones (CTRs) +- Oceanic FIRs (ICAO Annex 2 special procedures; OACCs handle coordination) + +A persistent disclosure note on the Airspace Impact Panel reads: *"SpaceCom FIR intersection analysis covers FIR/UIR boundaries only. It does not account for special use airspace, terminal areas, or oceanic procedures. Controllers must apply their local procedures for these airspace types."* Phase 3 consideration: SUA polygon overlay from national AIP sources. Document in `docs/adr/0014-airspace-scope.md`. + +All source URLs are hardcoded constants in `ingest/sources.py`. The outbound HTTP client blocks connections to private IP ranges. No source URL is configurable via API or database at runtime. + +**Space-Track AUP — conditional architecture (Finding 9):** The AUP clarification is a **Phase 1 architectural decision gate**, not a Phase 2 deliverable. The current design assumes shared ingest (a single SpaceCom Space-Track credential fetches TLEs for all organisations). If the AUP prohibits redistribution of derived predictions to customers who have not themselves agreed to the AUP, the ingest architecture must change: + +- **Path A — redistribution permitted:** Current shared-ingest design is valid. Each customer organisation's access is governed by SpaceCom's AUP click-wrap and the MSA. No architectural change. +- **Path B — redistribution not permitted:** Per-organisation Space-Track credentials required. Each ANSP/operator must hold their own Space-Track account. SpaceCom acts as a processing layer using each org's own credentials. Architecture change: `space_track_credentials` table (per-org, encrypted); per-org ingest worker configuration; significant additional complexity. + +The decision must be documented in `docs/adr/0016-space-track-aup-architecture.md` with the chosen path and evidence (written AUP clarification). This ADR is a prerequisite for Phase 1 ingest architecture finalisation — marked as a blocking decision in the Phase 1 DoD. + +**Space weather raw format specifications:** + +| Source | Endpoint constant | Format | Key fields consumed | +|--------|------------------|--------|-------------------| +| NOAA SWPC F10.7 | `NOAA_F107_URL = "https://services.swpc.noaa.gov/json/f107_cm_flux.json"` | JSON array | `time_tag`, `flux` (solar flux units) | +| NOAA SWPC Kp/Ap | `NOAA_KP_URL = "https://services.swpc.noaa.gov/json/planetary_k_index_1m.json"` | JSON array | `time_tag`, `kp_index`, `ap` | +| NOAA SWPC 3-day forecast | `NOAA_FORECAST_URL = "https://services.swpc.noaa.gov/products/3-day-geomag-forecast.json"` | JSON | `Kp` array | +| ESA SWS Kp | `ESA_SWS_KP_URL = "https://swe.ssa.esa.int/web/guest/current-space-weather-conditions"` | REST JSON | `kp_index` (cross-validation) | + +An integration test asserts that each response contains the expected top-level keys. If a key is absent, the test fails and the schema change is caught before it reaches production ingest. + +**TLE validation at ingestion gate:** Before any TLE record is written to the database, `ingest/cross_validator.py` must verify: +1. Both lines are exactly 69 characters (standard TLE format) +2. Modulo-10 checksum passes on line 1 and line 2 +3. Epoch field parses to a valid UTC datetime +4. `BSTAR` drag term is within physically plausible bounds (−0.5 to +0.5) + +Failed validation is logged to `security_logs` type `INGEST_VALIDATION_FAILURE` with the raw TLE and failure reason. The record is not written to the database. + +**TLE ingest idempotency — ON CONFLICT behavior:** The `tle_sets` table has `UNIQUE (object_id, ingested_at)`. If the ingest worker runs twice for the same object within the same second (e.g., orphan recovery task + normal schedule overlap, or a worker restart mid-task), the second insert must not raise an exception or silently discard the row without tracking. Required semantics: + +```python +# ingest/writer.py +async def write_tle_set(session: AsyncSession, tle: TLERecord) -> bool: + """Insert TLE record. Returns True if inserted, False if duplicate.""" + stmt = pg_insert(TLESet).values( + object_id=tle.object_id, + ingested_at=tle.ingested_at, + tle_line1=tle.line1, + tle_line2=tle.line2, + epoch=tle.epoch, + source=tle.source, + ).on_conflict_do_nothing( + index_elements=["object_id", "ingested_at"] + ).returning(TLESet.object_id) + + result = await session.execute(stmt) + inserted = result.rowcount > 0 + if not inserted: + spacecom_ingest_tle_conflict_total.inc() # metric; non-zero signals scheduling race + structlog.get_logger().debug("tle_insert_skipped_duplicate", + object_id=tle.object_id, ingested_at=tle.ingested_at) + return inserted +``` + +Prometheus counter `spacecom_ingest_tle_conflict_total` — a sustained non-zero rate warrants investigation of the Beat schedule overlap. A brief spike during worker restart is acceptable. + +**Ingest idempotency requirement for all periodic tasks (F8 — §67):** TLE ingest uses `ON CONFLICT DO NOTHING` (above). All other periodic ingest tasks must use equivalent upsert semantics to survive celery-redbeat double-fire on restart: + +```sql +-- Space weather ingest: upsert on (fetched_at) unique constraint +INSERT INTO space_weather (fetched_at, kp, f107, ...) +VALUES (:fetched_at, :kp, :f107, ...) +ON CONFLICT (fetched_at) DO NOTHING; + +-- DISCOS object metadata: upsert on (norad_id) — update if data changed +INSERT INTO objects (norad_id, name, launch_date, ...) +VALUES (:norad_id, :name, :launch_date, ...) +ON CONFLICT (norad_id) DO UPDATE SET + name = EXCLUDED.name, + launch_date = EXCLUDED.launch_date, + updated_at = NOW() +WHERE objects.updated_at < EXCLUDED.updated_at; -- only update if newer + +-- IERS EOP: upsert on (date) unique constraint +INSERT INTO iers_eop (date, ut1_utc, x_pole, y_pole, ...) +VALUES (:date, :ut1_utc, :x_pole, :y_pole, ...) +ON CONFLICT (date) DO NOTHING; +``` + +Add unique constraints if not present: `UNIQUE (fetched_at)` on `space_weather`; `UNIQUE (date)` on `iers_eop`. These prevent double-write corruption at the DB level regardless of application retry logic. + +**IERS EOP cold-start requirement:** On a fresh deployment with no cached EOP data, astropy's `IERS_Auto` falls back to the bundled IERS-B table (which lags the current date by weeks to months), silently degrading `UT1-UTC` precision from ~1 ms (IERS-A) to ~10–50 ms (IERS-B). For epochs beyond the IERS-B table end date, astropy raises `IERSRangeError`, crashing all frame transforms. + +The EOP ingest task must run as part of `make seed` before any propagation task starts: +```bash +# Makefile +seed: migrate + docker compose exec backend python -m ingest.eop --bootstrap # downloads + caches current IERS-A + docker compose exec backend python -m ingest.fir --bootstrap # loads FIR boundaries + docker compose exec backend python fixtures/dev_seed.sql +``` + +The EOP ingest task in Celery Beat is ordered before the TLE ingest task: EOP runs at 00:00 UTC, TLE ingest at 00:10 UTC (ensuring fresh EOP before the first propagation of the day). + +**IERS EOP verification — dual-mirror comparison:** The IERS does not publish SHA-256 hashes alongside its EOP files. Comparing hash-against-prior-download detects corruption but not substitution. The correct approach is downloading from both the USNO mirror and the Paris Observatory mirror and verifying agreement: + +```python +# ingest/eop.py +IERS_MIRRORS = [ + "https://maia.usno.navy.mil/ser7/finals2000A.all", + "https://hpiers.obspm.fr/iers/series/opa/eopc04", # IERS-C04 series +] + +async def fetch_and_verify_eop() -> bytes: + contents = [] + for url in IERS_MIRRORS: + resp = await http_client.get(url, timeout=30) + resp.raise_for_status() + contents.append(resp.content) + + # Verify UT1-UTC values agree within 0.1 ms across mirrors (format-normalised comparison) + if not _eop_values_agree(contents[0], contents[1], tolerance_ms=0.1): + structlog.get_logger().error("eop_mirror_disagreement") + spacecom_eop_mirror_agreement.set(0) + raise EOPVerificationError("IERS EOP mirrors disagree — rejecting both") + + spacecom_eop_mirror_agreement.set(1) + return contents[0] # USNO is primary; Paris Observatory is the verification witness +``` + +Prometheus gauge `spacecom_eop_mirror_agreement` (1 = mirrors agree, 0 = disagreement detected). Alert on `spacecom_eop_mirror_agreement == 0`. + +--- + +## 12. Backend Directory Structure + +``` +backend/ + app/ + main.py # FastAPI app factory, middleware, router mounting + config.py # Settings via pydantic-settings (env vars); no secrets in code + auth/ + provider.py # AuthProvider protocol + LocalJWTProvider implementation + jwt.py # RS256 token issue, verify, refresh; key loaded from secrets + mfa.py # TOTP (pyotp); recovery code generation and verification + deps.py # get_current_user, require_role() dependency factory + middleware.py # Auth middleware; rate limit enforcement + frame_utils.py # TEME→GCRF→ITRF→WGS84 + IERS EOP refresh + hash verification + time_utils.py # Time system conversions + integrity.py # HMAC sign/verify for predictions and hazard zones + logging_config.py # Sanitising log formatter; security event logger + modules/ + catalog/ + router.py # /api/v1/objects; requires viewer role minimum + schemas.py + service.py + models.py + propagator/ + catalog.py # SGP4 catalog propagation + decay.py # RK7(8) + NRLMSISE-00 + Monte Carlo; HMAC-signs output + tasks.py # Celery tasks with time_limit, soft_time_limit + router.py # /api/v1/propagate, /api/v1/decay; requires analyst role + reentry/ + router.py # /api/v1/reentry; requires viewer role + service.py + corridor.py # Percentile corridor polygon generation + spaceweather/ + router.py # /api/v1/spaceweather; requires viewer role + service.py # Cross-validates NOAA SWPC vs ESA SWS; generates status string + tasks.py # Celery Beat: NOAA SWPC polling every 3h + noaa_swpc.py # NOAA SWPC client; URL hardcoded constant + esa_sws.py # ESA SWS cross-validation client + viz/ + router.py # /api/v1/czml; requires viewer role + czml_builder.py # CZML output; all strings HTML-escaped; J2000 INERTIAL frame + mc_geometry.py # MC trajectory binary blob pre-baking + ingest/ + sources.py # Hardcoded external URLs and IP allowlists (SSRF mitigation) + tasks.py # Celery Beat-scheduled tasks + spacetrack.py # Space-Track client; credentials from secrets manager only + celestrak.py # CelesTrak client + discos.py # ESA DISCOS client + iers.py # IERS EOP fetcher + SHA-256 verification + cross_validator.py # TLE and space weather cross-source comparison + alerts/ + router.py # /api/v1/alerts; requires operator role for acknowledge + service.py # Alert trigger evaluation; rate limit enforcement; deduplication + notifier.py # WebSocket push + email; storm detection + integrity_guard.py # TIP vs prediction cross-check; HMAC failure escalation + reports/ + router.py # /api/v1/reports; requires analyst role + builder.py # Section assembly; all user fields sanitised via bleach + renderer_client.py # Internal HTTPS call to renderer service with sanitised payload + security/ + audit.py # Security event logger; writes to security_logs + sanitiser.py # Log formatter that strips credential patterns + breakup/ + atmospheric.py + on_orbit.py + tasks.py + router.py + conjunction/ + screener.py + probability.py + tasks.py + router.py + weather/ + upper.py + lower.py + hazard/ + router.py + fusion.py # HMAC-signs all hazard_zones output; propagates shadow_mode flag + tasks.py + airspace/ + router.py + loader.py + intersection.py + notam/ + router.py # /api/v1/notam; requires operator role + drafter.py # ICAO Annex 15 format generation + disclaimer.py # Mandatory regulatory disclaimer text + space_portal/ + router.py # /api/v1/space; space_operator and orbital_analyst roles + owned_objects.py # Owned object CRUD; RLS enforcement + controlled_reentry.py # Deorbit window optimisation + ccsds_export.py # CCSDS OEM/CDM format export + api_keys.py # API key lifecycle management + launch_safety/ # Phase 3 + screener.py + router.py + reroute/ # Phase 3; strategic pre-flight avoidance boundary only + feedback/ # Phase 3; includes shadow_validation.py + migrations/ # Alembic; includes immutability triggers in initial migration + tests/ + conftest.py # db_session fixture (SAVEPOINT/ROLLBACK); testcontainers setup for Celery tests + physics/ + test_frame_utils.py + test_propagator/ + test_decay/ + test_nrlmsise.py + test_hypothesis.py # Hypothesis property-based tests (§42.3) + test_mc_corridor.py # MC seeded RNG corridor validation (§42.4) + test_breakup/ + test_integrity.py # HMAC sign/verify; tamper detection + test_auth.py # JWT; MFA; rate limiting; RBAC enforcement + test_rbac.py # Every endpoint tested for correct role enforcement + test_websocket.py # WS sequence replay; token expiry warning; close codes 4001/4002 + test_ingest/ + test_contracts.py # Space-Track + NOAA key presence AND value-range assertions + test_spaceweather/ + test_jobs/ + test_celery_failure.py # Timeout → 'failed'; orphan recovery Beat task + smoke/ # Post-deploy; all idempotent; run in ≤ 2 min; require smoke_user seed + test_api_health.py # GET /readyz → 200/207; GET /healthz → 200 + test_auth_smoke.py # Login → JWT; refresh → new token + test_catalog_smoke.py # GET /catalog → 200; 'data' key present + test_ws_smoke.py # WS connect → heartbeat within 5s + test_db_smoke.py # SELECT 1 via backend health endpoint + quarantine/ # Flaky tests awaiting fix; excluded from blocking CI (see §33.10 policy) + requirements.in # pip-tools source + requirements.txt # pip-compile output with hashes + Dockerfile # FROM pinned digest; non-root user; read-only FS +``` + +### 12.1 Repository `docs/` Directory Structure + +All documentation files live under `docs/` in the monorepo root. Files referenced elsewhere in this plan must exist at these paths. + +``` +docs/ + README.md # Documentation index — what's here and where to look + MASTER_PLAN.md # This document + AGENTS.md # Guidance for AI coding agents working in this repo (see §33.9) + CHANGELOG.md # Keep a Changelog format; human-maintained; one entry per release + + adr/ # Architecture Decision Records (MADR format) + README.md # ADR index with status column + 0001-rs256-asymmetric-jwt.md + 0002-dual-frontend-architecture.md + 0003-monte-carlo-chord-pattern.md + 0004-geography-vs-geometry-spatial-types.md + 0005-lazy-raise-sqlalchemy.md + 0006-timescaledb-chunk-intervals.md + 0007-cesiumjs-commercial-licence.md + 0008-pgbouncer-transaction-mode.md + 0009-ccsds-oem-gcrf-reference-frame.md + 0010-alert-threshold-rationale.md + # ... continued; one ADR per consequential decision in §20 + + runbooks/ + README.md # Runbook index with owner and last-reviewed date + TEMPLATE.md # Standard runbook template (see §33.4) + db-failover.md + celery-recovery.md + hmac-failure.md + ingest-failure.md + gdpr-breach-notification.md + safety-occurrence-notification.md + secrets-rotation-jwt.md + secrets-rotation-spacetrack.md + secrets-rotation-hmac.md + blue-green-deploy.md + restore-from-backup.md + + model-card-decay-predictor.md # Living document; updated per model version (§32.1) + ood-bounds.md # OOD detection thresholds (§32.3) + recalibration-procedure.md # Recalibration governance (§32.4) + alert-threshold-history.md # Alert threshold change log (§24.8) + + query-baselines/ # EXPLAIN ANALYZE output; one file per critical query + czml_catalog_100obj.txt + fir_intersection_baseline.txt + # ... one file per query baseline recorded in Phase 1 + + validation/ # Validation procedure and reference data (§17) + README.md # How to run each validation suite + reference-data/ + vallado-sgp4-cases.json # Vallado (2013) SGP4 reference state vectors + iers-frame-test-cases.json # IERS precession-nutation reference cases + aerospace-corp-reentries.json # Historical re-entry outcomes for backcast validation + backcast-validation-v1.0.0.pdf # Phase 1 validation report (≥3 events) + backcast-validation-v2.0.0.pdf # Phase 2 validation report (≥10 events) + + api-guide/ # Persona E/F API developer documentation (§33.10) + README.md # API guide index + authentication.md + rate-limiting.md + webhooks.md + code-examples/ + python-quickstart.py + typescript-quickstart.ts + error-reference.md + + user-guides/ # Operational persona documentation (§33.7) + aviation-portal-guide.md # Persona A/B/C + space-portal-guide.md # Persona E/F + admin-guide.md # Persona D + + test-plan.md # Test suite index with scope and blocking classification (§33.11) + + public-reports/ # Quarterly transparency reports (§32.6) + # quarterly-accuracy-YYYY-QN.pdf + + legal/ # Legal opinion documents (MinIO primary; this dir for dev reference) + # legal-opinion-template.md +``` + +--- + +## 13. Frontend Directory Structure and Architecture + +``` +frontend/ + src/ + app/ + page.tsx # Operational Overview + watch/[norad_id]/page.tsx # Object Watch Page + events/ + page.tsx # Active Events + full Timeline/Gantt + [id]/page.tsx # Event Detail + airspace/page.tsx # Airspace Impact View + analysis/page.tsx # Analyst Workspace + catalog/page.tsx # Object Catalog + reports/ + page.tsx + [id]/page.tsx + admin/page.tsx # System Administration (admin role only) + space/ + page.tsx # Space Operator Overview + objects/ + page.tsx # My Objects Dashboard (space_operator: owned only) + [norad_id]/page.tsx # Object Technical Detail + reentry/ + plan/page.tsx # Controlled Re-entry Planner + conjunction/page.tsx # Conjunction Screening (orbital_analyst) + analysis/page.tsx # Orbital Analyst Workspace + export/page.tsx # Bulk Export + api/page.tsx # API Keys + Documentation + layout.tsx # Root layout: nav, ModeIndicator, AlertBadge, + # JobsPanel; applies security headers via middleware + + middleware.ts # Next.js middleware: enforce HTTPS, set CSP + # and security headers on every response, + # redirect unauthenticated users to /login + + components/ + globe/ + CesiumViewer.tsx + LayerPanel.tsx + ViewToggle.tsx + ClusterLayer.tsx + CorridorLayer.tsx + corridor/ + PercentileCorridors.tsx # Mode A + ProbabilityHeatmap.tsx # Mode B (Phase 2) + ParticleTrajectories.tsx # Mode C (Phase 3) + UncertaintyModeSelector.tsx + plan/ + PlanView.tsx # Phase 2 + AltitudeCrossSection.tsx # Phase 2 + timeline/ + TimelineStrip.tsx + TimelineGantt.tsx + TimelineControls.tsx + ModeIndicator.tsx + panels/ + ObjectInfoPanel.tsx + PredictionPanel.tsx # Includes HMAC status indicator + AirspaceImpactPanel.tsx # Phase 2 + ConjunctionPanel.tsx # Phase 2 + alerts/ + AlertBanner.tsx + AlertBadge.tsx + NotificationCentre.tsx + AcknowledgeDialog.tsx + jobs/ + JobsPanel.tsx + JobProgressBar.tsx + SimulationComparison.tsx + spaceweather/ + SpaceWeatherWidget.tsx + reports/ + ReportConfigDialog.tsx + ReportPreview.tsx + space/ + SpaceOverview.tsx + OwnedObjectCard.tsx + ControlledReentryPlanner.tsx + DeorbitWindowList.tsx + ApiKeyManager.tsx + CcsdsExportPanel.tsx + ShadowBanner.tsx # Amber banner displayed when shadow mode active + notam/ + NotamDraftViewer.tsx + NotamCancellationDialog.tsx + NotamRegulatoryDisclaimer.tsx + shadow/ + ShadowModeIndicator.tsx + ShadowValidationReport.tsx + dashboard/ + EventSummaryCard.tsx + SystemHealthCard.tsx + shared/ + DataConfidenceBadge.tsx + IntegrityStatusBadge.tsx # ✓ HMAC verified / ✗ HMAC failed + UncertaintyBound.tsx + CountdownTimer.tsx + + hooks/ + useObjects.ts + usePrediction.ts # Polls HMAC status; shows warning if failed + useEphemeris.ts + useSpaceWeather.ts + useAlerts.ts + useSimulation.ts + useCZML.ts + useWebSocket.ts # Cookie-based auth; per-user connection limit + + stores/ # Zustand — UI state only; no API responses + timelineStore.ts # Mode, playhead position, playback speed + selectionStore.ts # Selected object/event/zone IDs + layerStore.ts # Layer visibility, corridor display mode + jobsStore.ts # Active job IDs (content fetched via TanStack Query) + alertStore.ts # Unread count, mute rules + uiStore.ts # Panel state, theme (dark/light/high-contrast) + + lib/ + api.ts # Typed fetch wrapper; credentials: 'include' + # for httpOnly cookie auth; never reads tokens + czml.ts + ws.ts # wss:// enforced; cookie auth at upgrade + corridorGeometry.ts + mcBinaryDecoder.ts + reportUtils.ts + + types/ + objects.ts + predictions.ts # Includes hmac_status, integrity_failed fields + alerts.ts + spaceweather.ts + simulation.ts + czml.ts + + public/ + branding/ + middleware.ts # Root Next.js middleware for security headers + next.config.ts # Content-Security-Policy defined here for SSR + tsconfig.json + package.json + package-lock.json # Committed; npm ci used in Docker builds +``` + +### 13.0 Accessibility Standard Commitment + +**Minimum standard: WCAG 2.1 Level AA** (ISO/IEC 40500:2012), which is incorporated by reference into **EN 301 549 v3.2.1** — the mandatory accessibility standard for ICT procured by EU public sector bodies including ESA. Failure to meet EN 301 549 is a bid disqualifier for any EU public sector tender. + +All frontend work must meet these criteria before a PR is merged: +- WCAG 2.1 AA automated check passes (`axe-core` — see §42) +- Keyboard-only operation possible for all primary operator workflows +- Screen reader (NVDA + Firefox; VoiceOver + Safari) tested for primary workflow on each release +- Colour contrast ≥ 4.5:1 for all informational text; ≥ 3:1 for UI components and graphical elements +- No functionality conveyed by colour alone + +**Deliverable:** Accessibility Conformance Report (ACR / VPAT 2.4) produced before Phase 2 ESA bid submission. Maintained thereafter for each major release. + +**UTC-only rule for operational interface (F1):** ICAO Annex 2 and Annex 15 mandate UTC for all aeronautical operational communications. The following is a hard rule — no exceptions without explicit documentation and legal/safety sign-off: +- All times displayed in Persona A/C operational views (alert panels, event detail, NOTAM draft, shift handover) are **UTC only**, formatted as `HH:MMZ` or `DD MMM YYYY HH:MMZ` +- No timezone conversion widget or local-time toggle in the operational interface +- Local time display is permitted only in non-operational views (account settings, admin billing pages) and must be clearly labelled with the timezone name +- The `Z` suffix or `UTC` label is persistently visible — never hidden in a tooltip or hover state +- All API timestamps returned as ISO 8601 UTC (`2026-03-22T14:00:00Z`) — never local time strings + +--- + +### 13.1 State Management Separation + +**TanStack Query:** All API-derived data — object lists, predictions, ephemeris, space weather, alerts, simulation results. Handles caching, background refetch, and stale-while-revalidate. + +**Zustand:** Pure UI state with no server dependency — selected IDs, layer visibility, timeline mode and position, panel open/closed state, theme, alert mute rules. + +**URL state (nuqs):** Shareable, bookmarkable — selected NORAD ID, active event ID, time position in replay mode, active layer set. Browser back/forward works correctly. Requires `NuqsAdapter` wrapping the App Router root layout to hydrate correctly on SSR. + +**Never in state:** Raw API response bodies. No `useEffect` that writes API responses into Zustand. + +**Authentication in the client:** The `api.ts` fetch wrapper uses `credentials: 'include'` to send the `httpOnly` auth cookie automatically. The client never reads, stores, or handles the JWT token directly — it is invisible to JavaScript. CSRF is mitigated by `SameSite=Strict` on the cookie. + +**Next.js App Router component boundary (ADR 0018):** The project uses **App Router**. The globe and all operational views are client components; static pages (onboarding, settings, admin) are React Server Components where practical. + +| Route group | RSC/Client | Rationale | +|---|---|---| +| `app/(globe)/` — operational views | `"use client"` root layout | CesiumJS, WebSocket, Zustand hooks require browser APIs | +| `app/(static)/` — onboarding, settings | Server Components by default | No browser APIs needed; faster initial load | +| `app/(auth)/` — login, MFA | Server Components + Client islands | Form validation islands only | + +Rules enforced in `AGENTS.md`: +- Never add `"use client"` to a leaf component without a comment explaining which browser API requires it +- `app/(globe)/layout.tsx` is the single `"use client"` boundary for all operational views — child components inherit it without re-declaring +- `nuqs` requires `` at the root of `app/(globe)/layout.tsx` + +**TanStack Query key factory** (`src/lib/queryKeys.ts`) — stable hierarchical keys prevent cache invalidation bugs: + +```typescript +export const queryKeys = { + objects: { + all: () => ['objects'] as const, + list: (f: ObjectFilters) => ['objects', 'list', f] as const, + detail: (id: number) => ['objects', 'detail', id] as const, + tleHistory: (id: number) => ['objects', id, 'tle-history'] as const, + }, + predictions: { + byObject: (id: number) => ['predictions', id] as const, + }, + alerts: { + all: () => ['alerts'] as const, + unacked: (orgId: number) => ['alerts', 'unacked', orgId] as const, + }, + jobs: { + detail: (jobId: string) => ['jobs', jobId] as const, + }, +} as const; +// On WS alert.new: queryClient.invalidateQueries({ queryKey: queryKeys.alerts.all() }) +// On acknowledge mutation: optimistic setQueryData, then invalidate on settle +``` + +**React error boundary hierarchy** — a CesiumJS crash must never remove the alert panel from the DOM: + +```tsx +// app/(globe)/layout.tsx +}> + }> + {/* WebGL context loss isolated here */} + + + {/* Survives globe crash */} + + + + + +``` + +`GlobeUnavailable` displays: *"Globe unavailable — WebGL context lost. Re-entry event data below remains operational."* Alert and event panels remain visible and functional. Add `GlobeErrorBoundary` to `AGENTS.md` safety-critical component list. + +**Loading and empty state specification** — for safety-critical panels, loading and empty must be visually distinct from each other and from error: + +| State | Visual treatment | Required text | +|---|---|---| +| Loading | Skeleton matching panel layout | — | +| Empty | Explicit affirmative message | `AlertPanel`: "No unacknowledged alerts"; `EventList`: "No active re-entry events" | +| Error | Inline error with retry button | Never blank | + +Rule: safety-critical panels (`AlertPanel`, `EventList`, `PredictionPanel`) must **never render blank**. `DataConfidenceBadge` must always show a value — display `"Unknown"` explicitly, never render nothing. + +**WebSocket reconnection policy** (`src/lib/ws.ts`): + +```typescript +const RECONNECT = { + initialDelayMs: 1_000, + maxDelayMs: 30_000, + multiplier: 2, + jitter: 0.2, // ±20% — spreads reconnections after mass outage/deploy +}; +// TOKEN_EXPIRY_WARNING handler: trigger silent POST /auth/token/refresh; +// on success send AUTH_REFRESH; on failure show re-login modal (60s grace before disconnect) +// Reconnect sends ?since_seq= for missed event replay +``` + +**Operational mode guard** (`src/hooks/useModeGuard.ts`) — enforces LIVE/SIMULATION/REPLAY write restrictions: + +```typescript +export function useModeGuard(allowedModes: OperationalMode[]) { + const { mode } = useTimelineStore(); + return { isAllowed: allowedModes.includes(mode), currentMode: mode }; +} +// Usage: const { isAllowed } = useModeGuard(['LIVE']); +// All write-action components (acknowledge alert, submit NOTAM draft, trigger prediction) +// must call useModeGuard(['LIVE']) and disable + annotate button in other modes. +``` + +**Deck.gl + CesiumJS integration** — use `DeckLayer` from `@deck.gl/cesium` (rendered inside CesiumJS as a primitive; correct z-order and shared input handling). Never use a separate Deck.gl canvas: + +```typescript +import { DeckLayer } from '@deck.gl/cesium'; +import { HeatmapLayer } from '@deck.gl/aggregation-layers'; + +const deckLayer = new DeckLayer({ + layers: [new HeatmapLayer({ id: 'mc-heatmap', data: mcTrajectories, + getPosition: d => [d.lon, d.lat], getWeight: d => d.weight, + radiusPixels: 30, intensity: 1, threshold: 0.03 })], +}); +viewer.scene.primitives.add(deckLayer); +// Remove when switching away from Mode B: viewer.scene.primitives.remove(deckLayer) +``` + +**CesiumJS client-side memory constraints:** + +| Constraint | Value | Enforcement | +|---|---|---| +| Max CZML entity count in globe | 500 | Prune lowest-perigee objects beyond 500; `useCZML` monitors count | +| Orbit path duration | 72h forward / 24h back | Longer paths accumulate geometry | +| Heatmap cell resolution (Mode B) | 0.5° × 0.5° | Higher resolution requires more GPU memory | +| Stale entity pruning | Remove entities not updated in 48h | Prevents ghost entities in long sessions | +| Globe entity count Prometheus metric | `spacecom_globe_entity_count` (gauge) | WARNING alert at 450; prune trigger at 500 | + +**Bundle size budget and dynamic imports:** + +| Bundle | Strategy | Budget (gzipped) | +|---|---|---| +| Login / onboarding / settings | Static; no CesiumJS/Deck.gl | < 200 KB | +| Globe route initial load | CesiumJS lazy-loaded; spinner shown | < 500 KB before CesiumJS | +| Globe fully loaded | CesiumJS + Deck.gl + app | < 8 MB | + +```typescript +// src/components/globe/GlobeCanvas.tsx +import dynamic from 'next/dynamic'; +const CesiumViewer = dynamic( + () => import('./CesiumViewerInner'), + { ssr: false, loading: () => } +); +``` + +`bundlewatch` (or `@next/bundle-analyzer`) in CI; warning (non-blocking) if initial route bundle exceeds budget. Baseline stored in `.bundle-size-baseline`. + +--- + +### 13.2 Accessible Parallel Table View (F4) + +The CesiumJS WebGL globe is inherently inaccessible: no keyboard navigation, no screen reader support, no motor-impairment accommodation. All interactions available via the globe must also be available via a **parallel data table view**. + +**Component:** `src/components/globe/ObjectTableView.tsx` + +- Accessible via keyboard shortcut `Alt+T` from any operational view, and via a persistent visible "Table view" button in the globe toolbar +- Displays all objects currently rendered on the globe: NORAD ID, name, orbit type, conjunction status badge, predicted re-entry window, alert level +- Sortable by any column (`aria-sort` updated on header click/keypress); filterable by alert level +- Row selection focuses the object's Event Detail panel (same as map click) +- All alert acknowledgement actions reachable from the table view — no functionality requires the globe +- Implemented as `` with ``, ``, `
`, `` — no ARIA table role substitutes where native HTML suffices +- Pagination or virtual scroll for large object sets; `aria-rowcount` and `aria-rowindex` set correctly for virtualised rows + +The table view is the **primary interaction surface** for users who cannot use the map. It must be functionally complete, not a read-only summary. + +--- + +### 13.3 Keyboard Navigation Specification (F6) + +All primary operator workflows must be completable by keyboard alone. Required implementation: + +**Skip links** (rendered as the first focusable element in the page, visible on focus): +```html + + + +``` + +**Focus ring:** Minimum 3px solid outline, ≥ 3:1 contrast against adjacent colours (WCAG 2.4.11 Focus Appearance, AA). Never `outline: none` without a custom focus indicator. Defined in design tokens: `--focus-ring: 3px solid #4A9FFF`. + +**Tab order:** Follows DOM order (no `tabindex > 0`). Logical flow: nav → alert panel → map toolbar → main content. Modal dialogs trap focus within the dialog while open; focus returns to the trigger element on close. + +**Application keyboard shortcuts (all documented in UI via `?` help overlay):** + +| Shortcut | Action | +|----------|--------| +| `Alt+A` | Focus most-recent active CRITICAL alert | +| `Alt+T` | Toggle table / globe view | +| `Alt+H` | Open shift handover view | +| `Alt+N` | Open NOTAM draft for active event | +| `?` | Open keyboard shortcut reference overlay | +| `Escape` | Close modal / dismiss non-CRITICAL overlay | +| `Arrow keys` | Navigate within alert list, table rows, accordion items | + +All shortcuts declared via `aria-keyshortcuts` on their trigger elements. No shortcut conflicts with browser or screen reader reserved keys. + +--- + +### 13.4 Colour and Contrast Specification (F7) + +All colour pairs must meet WCAG 2.1 AA contrast requirements. Documented in `frontend/src/tokens/colours.ts` as design tokens; no hardcoded colour values in component files. + +**Operational severity palette (dark theme — `background: #1A1A2E`):** + +| Severity | Background | Text | Contrast ratio | Status | +|----------|-----------|------|---------------|--------| +| CRITICAL | `#7B4000` | `#FFFFFF` | 7.2:1 | ✓ AA | +| HIGH | `#7A3B00` | `#FFD580` | 5.1:1 | ✓ AA | +| MEDIUM | `#1A3A5C` | `#90CAF9` | 4.6:1 | ✓ AA | +| LOW | `#1E3A2F` | `#81C784` | 4.5:1 | ✓ AA (minimum) | +| Focus ring | `#1A1A2E` | `#4A9FFF` | 4.8:1 | ✓ AA | + +All pairs verified with the APCA algorithm for large display text (corridor labels on the globe). If a colour fails at the target background, the background is adjusted — the text colour is kept consistent for operator recognition. + +**Number formatting (F4):** Probability values, altitudes, and distances must be formatted correctly across locales: +- **Operational interface (Persona A/C):** Always use ICAO-standard decimal point (`.`) regardless of browser locale — deviating from locale convention is intentional and matches ICAO Doc 8400 standards; this is documented as an explicit design decision +- **Admin / reporting / Space Operator views:** Use `Intl.NumberFormat(locale)` for locale-aware formatting (comma decimal separator in DE/FR/ES locales) +- Helper: `formatOperationalNumber(n: number): string` — always `.` decimal, 3 significant figures for probabilities; `formatDisplayNumber(n: number, locale: string): string` — locale-aware +- Never use raw `Number.toString()` or `n.toFixed()` in JSX — both ignore locale + +**Non-colour severity indicators (F5):** Colour must never be the sole differentiator. Each severity level also carries: + +| Severity | Icon/shape | Text label | Border width | +|----------|-----------|-----------|-------------| +| CRITICAL | ⬟ (octagon) | "CRITICAL" always visible | 3px solid | +| HIGH | ▲ (triangle) | "HIGH" always visible | 2px solid | +| MEDIUM | ● (circle) | "MEDIUM" always visible | 1px solid | +| LOW | ○ (circle outline) | "LOW" always visible | 1px dashed | + +The 1 Hz CRITICAL colour cycle (§28.3 habituation countermeasure) must also include a redundant non-colour animation: 1 Hz border-width pulse (2px → 4px → 2px). Users with `prefers-reduced-motion: reduce` see a static thick border instead (see §28.3 reduced-motion rules). + +--- + +### 13.5 Internationalisation Architecture (F5, F8, F11) + +**Language scope — Phase 1:** English only. No other locale is served. This is not a gap — it is an explicit decision that allows Phase 1 to ship without a localisation workflow. The architecture is designed so that adding a new locale requires only adding a `messages/{locale}.json` file and testing; no component code changes. + +**String externalisation strategy:** +- Library: `next-intl` (native Next.js App Router support, RSC-compatible, type-safe message keys) +- Source of truth: `messages/en.json` — all user-facing strings, namespaced by feature area +- Message ID convention: `{feature}.{component}.{element}` e.g. `alerts.critical.title`, `handover.accept.button` +- No bare string literals in JSX (enforced by `eslint-plugin-i18n-json` or equivalent) +- **ICAO-fixed strings are excluded from i18n scope** and must never appear in `messages/en.json` — they are hardcoded constants. Examples: `NOTAM`, `UTC`, `SIGMET`, category codes (`NOTAM_ISSUED`), ICAO phraseology in NOTAM templates. These are annotated `// ICAO-FIXED: do not translate` in source + +``` +messages/ + en.json # Source of truth — Phase 1 complete + fr.json # Phase 2 scaffold (machine-translated placeholders; native-speaker review before deploy) + de.json # Phase 3 scaffold +``` + +**CSS logical properties (F8):** All new components use CSS logical properties instead of directional utilities, making RTL support a configuration change rather than a code rewrite: + +| Avoid | Use instead | +|-------|------------| +| `margin-left`, `ml-*` | `margin-inline-start`, `ms-*` | +| `margin-right`, `mr-*` | `margin-inline-end`, `me-*` | +| `padding-left`, `pl-*` | `padding-inline-start`, `ps-*` | +| `padding-right`, `pr-*` | `padding-inline-end`, `pe-*` | +| `left: 0` | `inset-inline-start: 0` | +| `text-align: left` | `text-align: start` | + +The `` element carries `dir="ltr"` (hardcoded for Phase 1). When a RTL locale is added, this becomes `dir={locale.dir}` — no component changes required. RTL testing with Arabic locale is a Phase 3 gate before any Middle East deployment. + +**Altitude and distance unit display (F9):** Aviation and space domain use different unit conventions. All altitudes and distances are stored and transmitted in **metres** (SI base unit) in the database and API. The display layer converts based on `users.altitude_unit_preference`: + +| Role default | Unit | Display example | +|---|---|---| +| `ansp_operator` | `ft` | `39,370 ft (FL394)` | +| `space_operator` | `km` | `12.0 km` | +| `analyst` | `km` | `12.0 km` | + +Rules: +- Unit label always shown alongside the value — no bare numbers +- `aria-label` provides full unit name: `aria-label="39,370 feet (Flight Level 394)"` +- User can override their default in account settings via `PATCH /api/v1/users/me` +- API always returns metres; unit conversion is client-side only +- FL (Flight Level) shown in parentheses for `ft` display when altitude > 0 ft MSL and context is airspace + +**Altitude datum labelling (F11 — §62):** The SGP4 propagator and NRLMSISE-00 output altitudes above the WGS-84 ellipsoid. Aviation altimetry uses altitude above Mean Sea Level (MSL). The geoid height (difference between ellipsoid and MSL) varies globally from approximately −106 m to +85 m (EGM2008). For operational altitudes (below ~25 km / 82,000 ft during re-entry terminal phase), this difference is significant. + +**Required labelling rule:** All altitude displays must specify the datum. The datum is a non-configurable system constant per altitude context: + +| Altitude context | Datum | Display example | Notes | +|-----------------|-------|-----------------|-------| +| Orbital altitude (> 80 km) | WGS-84 ellipsoid | `185 km (ellipsoidal)` | SGP4 output; geoid difference negligible at orbital altitudes | +| Re-entry corridor boundary | WGS-84 ellipsoid | `80 km (ellipsoidal)` | Model boundary altitude | +| Fragment impact altitude | WGS-84 ellipsoid | `0 km (ellipsoidal)` → display as ground level | Converted at display time | +| Airspace sector boundary (FL) | QNH barometric | `FL390` / `39,000 ft (QNH)` | Aviation standard; NOT ellipsoidal | +| Terrain clearance / NOTAM lower bound | MSL (approx. ellipsoidal for > 1,000 ft) | `5,000 ft MSL` | Use `MSL` label explicitly | + +**Implementation:** `formatAltitude(metres, context)` helper accepts a `context` parameter (`'orbital' | 'airspace' | 'notam'`) and appends the appropriate datum label. The datum label is rendered in a smaller secondary font weight alongside the altitude value — not in `aria-label` alone. + +**API response datum field:** The prediction API response must include `altitude_datum: "WGS84_ELLIPSOIDAL"` alongside any altitude value. Consumers must not assume a datum that is not stated. + +**Future locale addition checklist** (documented in `docs/ADDING_A_LOCALE.md`): +1. Add `messages/{locale}.json` translated by a native-speaker aviation professional +2. Verify all ICAO-fixed strings are excluded from translation +3. Set `dir` for the locale (ltr/rtl) +4. Run automated RTL layout tests if `dir=rtl` +5. Confirm operational time display still shows UTC (not locale timezone) +6. Legal review of any jurisdiction-specific compliance text + +--- + +### 13.6 Contribution Workflow (F3) + +`CONTRIBUTING.md` at the repository root is a required document. It defines how contributors (internal engineers, auditors, future ESA-directed reviewers) engage with the codebase. + +**Branch naming convention:** +| Branch type | Pattern | Example | +|---|---|---| +| Feature | `feature/{ticket-id}-short-description` | `feature/SC-142-decay-unit-pref` | +| Bug fix | `fix/{ticket-id}-short-description` | `fix/SC-200-hmac-null-check` | +| Chore / dependency | `chore/{description}` | `chore/bump-fastapi-0.115` | +| Release | `release/{semver}` | `release/1.2.0` | +| Hotfix | `hotfix/{semver}` | `hotfix/1.1.1` | + +No direct commits to `main`. All changes via pull request. `main` is branch-protected: 1 required approval, all status checks must pass, no force-push. + +**Commit message format:** [Conventional Commits](https://www.conventionalcommits.org/) — `type(scope): description`. Types: `feat`, `fix`, `chore`, `docs`, `refactor`, `test`, `ci`. Example: `feat(decay): add p01/p99 tail risk columns`. + +**PR template** (`.github/pull_request_template.md`): +```markdown +## Summary + + +## Linked ticket + + +## Checklist +- [ ] `make test` passes locally +- [ ] OpenAPI spec regenerated (`make generate-openapi`) if API changed +- [ ] CHANGELOG.md updated under `[Unreleased]` +- [ ] axe-core accessibility check passes if UI changed +- [ ] Contract test passes if API response shape changed +- [ ] ADR created if an architectural decision was made +``` + +**Review SLA:** Pull requests must receive a first review within **1 business day** of opening. Stale PRs (no activity > 3 business days) are labelled `stale` automatically. + +--- + +### 13.7 Architecture Decision Records (F4) + +ADRs (Nygard format) are the lightweight record for code-level and architectural decisions. They live in `docs/adr/` and are numbered sequentially. + +**When to write an ADR:** Any decision that is: +- Hard to reverse (e.g., choosing a library, a DB schema approach, an algorithm) +- Likely to confuse a future contributor who finds the code without context +- Required by a public-sector procurement framework (ESA specifically requests evidence of a structured decision process) +- Referenced in a specialist review appendix (§45–§54 all reference ADR numbers) + +**Format** (`docs/adr/NNNN-title.md`): +```markdown +# ADR NNNN: Title + +**Status:** Proposed | Accepted | Deprecated | Superseded by ADR MMMM +**Date:** YYYY-MM-DD + +## Context +What problem are we solving? What constraints apply? + +## Decision +What did we decide? + +## Consequences +What becomes easier? What becomes harder? What is now out of scope? +``` + +**Known ADRs referenced in this plan:** + +| ADR | Topic | +|-----|-------| +| 0001 | FastAPI over Django REST Framework | +| 0002 | TimescaleDB + PostGIS for orbital time-series | +| 0003 | CesiumJS + Deck.gl for 3D globe rendering | +| 0004 | next-intl for string externalisation | +| 0005 | Append-only alert_events with HMAC signing | +| 0016 | NRLMSISE-00 vs JB2008 atmospheric density model | + +All ADR numbers referenced in this document must have a corresponding `docs/adr/NNNN-*.md` file before Phase 2 ESA submission. New ADRs start at the next available number. + +--- + +### 13.8 Developer Environment Setup (F6) + +`docs/DEVELOPMENT.md` is a required onboarding document. A new engineer must be able to run a fully functional local environment within **30 minutes** of reading it. The document covers: + +1. **Prerequisites:** Python 3.11 (pinned in `.python-version`), Node.js 20 LTS, Docker Desktop, `make` +2. **Environment bootstrap:** + ```bash + cp .env.example .env # review and fill required values + make init-dirs # creates logs/, exports/, config/, backups/ on host + make dev-up # docker compose up -d postgres redis minio + make migrate # alembic upgrade head + make seed # load development fixture data (10 tracked objects, sample TIPs) + make dev # starts: uvicorn + Next.js dev server + Celery worker + ``` +3. **Running tests:** + ```bash + make test # full test suite (backend + frontend) + make test-backend # backend only (pytest) + make test-frontend # frontend only (jest + playwright) + make test-e2e # Playwright end-to-end (requires make dev running) + ``` +4. **Useful local URLs:** + - API: `http://localhost:8000` / Swagger UI: `http://localhost:8000/docs` + - Frontend: `http://localhost:3000` + - MinIO console: `http://localhost:9001` (credentials in `.env.example`) +5. **Common issues:** documented in a `## Troubleshooting` section covering: Docker port conflicts, TimescaleDB first-run migration failure, CesiumJS ion token missing. + +`.env.example` is committed and kept up-to-date with all required variables (no value — keys only). `.env` is in `.gitignore` and must never be committed. + +--- + +### 13.9 Docs-as-Code Pipeline (F10) + +All project documentation (this plan, runbooks, ADRs, OpenAPI spec, data provenance records) is version-controlled in the repository and validated by CI. + +**Documentation site:** MkDocs Material. Source in `docs/`. Published to GitHub Pages on merge to `main`. Configuration in `mkdocs.yml`. + +**CI documentation checks (run on every PR):** +- `mkdocs build --strict` — fails on broken links, missing pages, invalid nav +- `markdown-link-check docs/` — external link validation (warns, does not fail, to avoid flaky CI on transient outages) +- `openapi-diff` — spec drift check (see §14 F1) +- `vale --config=.vale.ini docs/` — prose style linter (SpaceCom style guide: no passive voice in runbooks, consistent terminology table for `re-entry` vs `reentry`) + +**ESA submission artefact:** The MkDocs build output (static HTML) is archived as a CI artefact on each release tag. This provides a reproducible, point-in-time documentation snapshot for the ESA bid submission. The submission artefact is `docs-site-{version}.zip` stored in the GitHub release assets. + +**Docs owner:** Each section of the documentation has an `owner:` frontmatter field. The owner is responsible for keeping the section current after their feature area changes. Missing or stale ownership is flagged by a quarterly `docs-review` GitHub issue auto-created by a cron workflow. + +--- + +## 14. API Design + +Base path: `/api/v1`. All endpoints require authentication (minimum `viewer` role) unless noted. Role requirements listed per group. + +### System (no auth required) +- `GET /health` — liveness probe; returns `200 {"status": "ok", "version": ""}` if the process is running. Used by Docker/Kubernetes liveness probe and load balancer health check. Does **not** check downstream dependencies — a healthy response means only that the API process is alive. +- `GET /readyz` — readiness probe; returns `200 {"status": "ready", "checks": {...}}` when all dependencies are reachable. Returns `503` if any required dependency is unhealthy. Checks performed: PostgreSQL (query `SELECT 1`), Redis (PING), Celery worker queue depth < 1000. Used by DR automation to confirm the new primary is accepting traffic before updating DNS (§26.3). Also included in OpenAPI spec under `tags: ["System"]`. + +```json +// GET /readyz — healthy response example +{ + "status": "ready", + "checks": { + "postgres": "ok", + "redis": "ok", + "celery_queue_depth": 42 + }, + "version": "1.2.3" +} +// GET /readyz — unhealthy response (503) +{ + "status": "not_ready", + "checks": { + "postgres": "ok", + "redis": "error: connection refused", + "celery_queue_depth": 42 + } +} +``` + +### Auth +- `POST /auth/token` — login; returns `httpOnly` cookie (access) + `httpOnly` cookie (refresh); rate-limited 10/min/IP +- `POST /auth/token/refresh` — rotate refresh token; rate-limited +- `POST /auth/mfa/verify` — complete MFA; issues full-access token +- `POST /auth/logout` — revoke refresh token; clear cookies + +### Catalog (`viewer` minimum) +- `GET /objects` — list/search (paginated; filter by type, perigee, decay status, data_confidence) +- `GET /objects/{norad_id}` — detail with TLE, physical properties, data confidence annotation +- `POST /objects` — manual entry (`operator` role) +- `GET /objects/{norad_id}/tle-history` — full TLE history including cross-validation status + +### Propagation (`analyst` role) +- `POST /propagate` — submit catalog propagation job +- `GET /propagate/{task_id}` — poll status +- `GET /objects/{norad_id}/ephemeris?start=&end=&step=` — time range and step validation (Finding 7): + + | Parameter | Constraint | Error code | + |---|---|---| + | `start` | ≥ TLE epoch − 7 days; ≤ now + 90 days | `EPHEMERIS_START_OUT_OF_RANGE` | + | `end` | `start < end ≤ start + 30 days` | `EPHEMERIS_END_OUT_OF_RANGE` | + | `step` | ≥ 10 seconds and ≤ 86,400 seconds | `EPHEMERIS_STEP_OUT_OF_RANGE` | + | Computed points | `(end − start) / step ≤ 100,000` | `EPHEMERIS_TOO_MANY_POINTS` | + +### Decay Prediction (`analyst` role) +- `POST /decay/predict` — submit decay job; returns `202 Accepted` (Finding 3). **MC concurrency gate:** per-organisation Redis semaphore limits to 1 concurrent MC run (Phase 1); 2 for `analyst`+ (Phase 2); `429 + Retry-After` on limit; `admin` bypasses. + + **Async job lifecycle (Finding 3):** + ``` + POST /decay/predict + Idempotency-Key: ← optional; prevents duplicate on retry + → 202 Accepted + { + "jobId": "uuid", + "status": "queued", + "statusUrl": "/jobs/uuid", + "estimatedDurationSeconds": 45 + } + + GET /jobs/{job_id} + → 200 OK + { + "jobId": "uuid", + "status": "running" | "complete" | "failed" | "cancelled", + "resultUrl": "/decay/predictions/12345", // present when complete + "error": null | {"code": "...", "message": "..."}, + "createdAt": "...", + "completedAt": "...", + "durationSeconds": 42 + } + ``` + WebSocket `PREDICTION_COMPLETE` / `PREDICTION_FAILED` events are the primary completion signal. `GET /jobs/{id}` is the polling fallback (recommended interval: 5 seconds; do not poll faster). All Celery-backed POST endpoints (`/reports`, `/space/reentry/plan`, `/propagate`) follow the same lifecycle pattern. + +- `GET /jobs/{job_id}` — poll job status (all job types); `404` if job does not belong to the requesting user's organisation +- `GET /decay/predictions?norad_id=&status=` — list (cursor-paginated) + +### Re-entry (`viewer` role) +- `GET /reentry/predictions` — list with HMAC status; filterable by FIR, time window, confidence, integrity_failed +- `GET /reentry/predictions/{id}` — full detail; HMAC verified before serving; `integrity_failed` records return 503 +- `GET /reentry/tip-messages?norad_id=` — TIP messages + +### Space Weather (`viewer` role) +- `GET /spaceweather/current` — F10.7, Kp, Ap, Dst + `operational_status` + `uncertainty_multiplier` + cross-validation delta +- `GET /spaceweather/history?start=&end=` — history +- `GET /spaceweather/forecast` — 3-day NOAA SWPC forecast + +### Conjunctions (`viewer` role) +- `GET /conjunctions` — active events filterable by Pc threshold +- `GET /conjunctions/{id}` — detail with covariance and probability +- `POST /conjunctions/screen` — submit screening (`analyst` role) + +### Visualisation (`viewer` role) +- `GET /czml/objects` — full CZML catalog (J2000 INERTIAL; all strings HTML-escaped); **max payload policy: 5 MB**. If estimated payload exceeds 5 MB, the endpoint returns `HTTP 413` with `{"error": "catalog_too_large", "use_delta": true}`. +- `GET /czml/objects?since=` — **delta CZML**: returns only objects whose position or metadata has changed since the given timestamp. Clients must use this after the initial full load. Response includes `X-CZML-Full-Required: true` header if the server cannot produce a valid delta (e.g. client timestamp > 30 minutes old) — client must re-fetch the full catalog. Delta responses are always ≤ 500 KB for the 100-object catalog. +- `GET /czml/hazard/{zone_id}` — HMAC verified before serving +- `GET /czml/event/{event_id}` — full event CZML +- `GET /viz/mc-trajectories/{prediction_id}` — binary MC blob for Mode C + +### Hazard (`viewer` role) +- `GET /hazard/zones` — active zones; HMAC status included in response +- `GET /hazard/zones/{id}` — detail; HMAC verified before serving; `integrity_failed` records return 503 + +### Alerts (`viewer` read; `operator` acknowledge) +- `GET /alerts` — alert history +- `POST /alerts/{id}/acknowledge` — records user ID + timestamp + note in `alert_events` +- `GET /alerts/unread-count` — unread critical/high count for badge + +### Reports (`analyst` role) +- `GET /reports` — list (organisation-scoped via RLS) +- `POST /reports` — initiate generation (async) +- `GET /reports/{id}` — metadata + pre-signed 15-minute download URL +- `GET /reports/{id}/preview` — HTML preview + +### Org Admin (`org_admin` role — scoped to own organisation) (F7, F9, F11) +- `GET /org/users` — list users in own org +- `POST /org/users/invite` — invite a new user (sends email; creates user with `viewer` role pending activation) +- `PATCH /org/users/{id}/role` — assign role up to `operator` within own org; cannot assign `org_admin` or `admin` +- `DELETE /org/users/{id}` — deactivate user (revokes sessions and API keys; triggers pseudonymisation for GDPR) +- `GET /org/api-keys` — list all API keys in own org (including service account keys) +- `DELETE /org/api-keys/{id}` — revoke any key in own org +- `GET /org/audit-log` — paginated org-scoped audit log from `security_logs` and `alert_events` filtered by `organisation_id`; supports `?from=&to=&event_type=&user_id=` (F9) +- `GET /org/usage` — usage summary for current and previous billing period (predictions run, quota hits, API calls); sourced from `usage_events` table +- `PATCH /org/billing` — update `billing_contacts` row (email, PO number, VAT number) +- `POST /org/export` — trigger asynchronous org data export (F11); returns job ID; export includes all predictions, alert events, handover logs, and NOTAM drafts for the org; delivered as signed ZIP within 3 business days; used for GDPR portability and offboarding + +### Admin (`admin` role only) +- `GET /admin/ingest-status` — last run time and status per source +- `GET /admin/worker-status` — Celery queue depth and health +- `GET /admin/security-events` — recent security_logs entries +- `POST /admin/users` — create user +- `PATCH /admin/users/{id}/role` — change role (logged as HIGH security event) +- `GET /admin/organisations` — list all organisations with tier, status, usage summary +- `POST /admin/organisations` — provision new organisation (onboarding gate — see §29.8) +- `PATCH /admin/organisations/{id}` — update tier, status, subscription dates + +### Space Portal (`space_operator` or `orbital_analyst` role) +- `GET /space/objects` — list owned objects (`space_operator`: scoped; `orbital_analyst`: full catalog) +- `GET /space/objects/{norad_id}` — full technical detail with state vectors, covariance, TLE history +- `GET /space/objects/{norad_id}/ephemeris` — raw GCRF state vectors; CCSDS OEM format available via `Accept: application/ccsds-oem` +- `POST /space/reentry/plan` — submit controlled re-entry planning job; requires `owned_objects.has_propulsion = TRUE` +- `GET /space/reentry/plan/{task_id}` — poll; returns ranked deorbit windows with risk scores and FIR avoidance status +- `POST /space/conjunction/screen` — submit screening (`orbital_analyst` only) +- `GET /space/export/bulk` — bulk ephemeris/prediction export (JSON, CSV, CCSDS) + +### NOTAM Drafting (`operator` role) +- `POST /notam/draft` — generate draft NOTAM from prediction ID; returns ICAO-format draft text + mandatory disclaimer +- `GET /notam/drafts` — list drafts for organisation +- `GET /notam/drafts/{id}` — draft detail +- `POST /notam/drafts/{id}/cancel-draft` — generate cancellation draft for a previous new-NOTAM draft + +### API Key Management (`space_operator` or `orbital_analyst`) +- `POST /api-keys` — create new API key; raw key returned once and never stored +- `GET /api-keys` — list active keys (hashed IDs only, never raw keys) +- `DELETE /api-keys/{id}` — revoke key immediately +- `GET /api-keys/usage` — per-key request counts and last-used timestamp + +### WebSocket (`viewer` minimum; cookie auth at upgrade) +- `WS /ws/events` — real-time stream; 5 concurrent connections per user enforced. **Per-instance subscriber ceiling: 500 connections.** New connections beyond this limit receive `HTTP 503` at the WebSocket upgrade. A `ws_connected_clients` Prometheus gauge tracks current count per backend instance; alert fires at 400 (WARNING) to trigger horizontal scaling before the ceiling is reached. At Tier 2 (2 backend instances), the effective ceiling is 1,000 simultaneous WebSocket clients — documented as a known capacity limit in `docs/runbooks/capacity-limits.md`. + +**WebSocket event payload schema:** + +All events share an envelope: +```json +{ + "type": "", + "seq": 1042, + "ts": "2026-03-17T14:23:01.123Z", + "data": { ... } +} +``` + +| `type` | Trigger | `data` fields | +|--------|---------|---------------| +| `alert.new` | New alert generated | `alert_id`, `level`, `norad_id`, `object_name`, `fir_ids[]` | +| `alert.acknowledged` | Alert acknowledged by any user in org | `alert_id`, `acknowledged_by`, `note_preview` | +| `alert.superseded` | Alert superseded by a new one | `old_alert_id`, `new_alert_id` | +| `prediction.updated` | New re-entry prediction for a tracked object | `prediction_id`, `norad_id`, `p50_utc`, `supersedes_id` | +| `ingest.status` | Ingest job completed or failed | `source`, `status` (`ok`/`failed`), `record_count`, `next_run_at` | +| `spaceweather.change` | Operational status band changes | `old_status`, `new_status`, `kp`, `f107` | +| `tip.new` | New TIP message ingested | `norad_id`, `object_name`, `tip_epoch`, `predicted_reentry_utc` | + +**Reconnection and missed-event recovery:** Each event carries a monotonically increasing `seq` number per organisation. On reconnect, the client sends `?since_seq=` in the WebSocket upgrade URL. The server replays up to 200 missed events from an in-memory ring buffer (last 5 minutes). If the client has been disconnected > 5 minutes, it receives a `{"type": "resync_required"}` event and must re-fetch state via REST. + +**Per-org sequence number implementation (F5 — §67):** The `seq` counter for each org must be assigned using a PostgreSQL `SEQUENCE` object, not `MAX(seq)+1` in a trigger. `MAX(seq)+1` under concurrent inserts for the same org produces duplicate sequence numbers: + +```sql +-- Migration: create one sequence per org on org creation +-- (or use a single global sequence with per-org prefix — simpler) +CREATE SEQUENCE IF NOT EXISTS alert_seq_global + START 1 INCREMENT 1 NO CYCLE; + +-- In the alert_events INSERT trigger or application code: +-- NEW.seq := nextval('alert_seq_global'); +-- This is globally unique and monotonically increasing; per-org ordering +-- is derived by filtering on org_id + ordering by seq. +``` + +**Preferred approach:** A single global `alert_seq_global` sequence assigned at INSERT time. Per-org ordering is maintained because `seq` is globally monotonic — any two events for the same org will have the correct relative ordering by `seq`. The WebSocket ring buffer lookup uses `WHERE org_id = $1 AND seq > $2 ORDER BY seq` which remains correct with a global sequence. + +**Do not use:** `DEFAULT nextval('some_seq')` on the column without org-scoped locking — concurrent inserts across orgs share the sequence fine; concurrent inserts for the same org also work correctly since sequences are lock-free and gap-tolerant. + +**Application-level receipt acknowledgement (F2 — §63):** `delivered_websocket = TRUE` in `alert_events` is set at send-time, not client-receipt time. For safety-critical `CRITICAL` and `HIGH` alerts, the client must send an explicit receipt acknowledgement within 10 seconds: + +```typescript +// Client → Server: after rendering a CRITICAL/HIGH alert.new event +{ "type": "alert.received", "alert_id": "", "seq": } +``` + +Server response: +```json +{ "type": "alert.receipt_confirmed", "alert_id": "", "seq": } +``` + +If no `alert.received` arrives within 10 seconds of delivery, the server marks `alert_events.ws_receipt_confirmed = FALSE` and triggers the email fallback for that alert (same logic as offline delivery). This distinguishes "sent to socket" from "rendered on screen." + +```sql +ALTER TABLE alert_events + ADD COLUMN ws_receipt_confirmed BOOLEAN, + ADD COLUMN ws_receipt_at TIMESTAMPTZ; +-- NULL = not yet sent; TRUE = client confirmed receipt; FALSE = sent but no receipt within 10s +``` + +**Fan-out architecture across multiple backend instances (F3 — §63):** With ≥2 backend instances (Tier 2), a WebSocket connection from org A may be on instance-1 while a new alert fires on instance-2. Without a cross-instance broadcast mechanism, org A's operator misses the alert. + +**Required: Redis Pub/Sub fan-out:** + +```python +# backend/app/alerts/fanout.py +import redis.asyncio as aioredis + +ALERT_CHANNEL_PREFIX = "spacecom:alert:" + +async def publish_alert(redis: aioredis.Redis, org_id: str, event: dict): + """Publish alert event to Redis channel; all backend instances receive and forward to connected clients.""" + channel = f"{ALERT_CHANNEL_PREFIX}{org_id}" + await redis.publish(channel, json.dumps(event)) + +async def subscribe_org_alerts(redis: aioredis.Redis, org_id: str): + """Each backend instance subscribes to its connected orgs' channels on startup.""" + pubsub = redis.pubsub() + await pubsub.subscribe(f"{ALERT_CHANNEL_PREFIX}{org_id}") + return pubsub +``` + +Each backend instance maintains a local registry of `{org_id: [websocket_connections]}`. On receiving a Redis Pub/Sub message, the instance forwards to all local connections for that org. This decouples alert generation (any instance) from delivery (per-instance local connections). + +**ADR:** `docs/adr/0020-websocket-fanout-redis-pubsub.md` — documents this pattern and the decision against sticky sessions (which would break blue-green deploys). + +**Dead-connection ANSP fallback notification (F6 — §63):** When the ping-pong mechanism detects a dead connection, the current behaviour is to close the socket. There is no notification to the ANSP that their live monitoring connection has silently dropped. + +**Required behaviour:** +1. On ping-pong timeout: close socket; record `ws_disconnected_at` in Redis session key for that connection +2. If no reconnect within `WS_DEAD_CONNECTION_GRACE_SECONDS` (default: 120s): send email to the org's ANSP contact (`organisations.primary_contact_email`) with subject: *"SpaceCom live connection dropped — please check your browser"* +3. If an active TIP event exists for the org's FIRs when the disconnection is detected: grace period is reduced to 30s and the email subject is: *"URGENT: SpaceCom connection dropped during active re-entry event"* +4. On reconnect (before grace period expires): cancel the pending fallback email + +```python +# backend/app/alerts/ws_health.py +WS_DEAD_CONNECTION_GRACE_SECONDS = 120 +WS_DEAD_CONNECTION_GRACE_ACTIVE_TIP = 30 + +async def on_connection_closed(org_id: str, user_id: str, redis: aioredis.Redis): + active_tip = await redis.get(f"spacecom:active_tip:{org_id}") + grace = WS_DEAD_CONNECTION_GRACE_ACTIVE_TIP if active_tip else WS_DEAD_CONNECTION_GRACE_SECONDS + # Schedule fallback notification via Celery + notify_ws_dead.apply_async( + args=[org_id, user_id], + countdown=grace, + task_id=f"ws-dead-{org_id}-{user_id}" # revocable if reconnect arrives + ) + +async def on_reconnect(org_id: str, user_id: str): + # Cancel pending dead-connection notification + celery_app.control.revoke(f"ws-dead-{org_id}-{user_id}") +``` + +**Per-org email alert rate limit (F7 — §65 FinOps):** + +Email alerts are triggered both by the alert delivery pipeline (when WebSocket delivery is unconfirmed) and by degraded-mode notifications. Without a rate limit, a flapping prediction window or ingest instability can generate hundreds of alert emails per hour to the same ANSP contact, exhausting the SMTP relay quota and creating alert fatigue. + +**Rate limit policy:** Maximum **50 alert emails per org per hour**. When the limit is reached, subsequent alerts within the window are queued and delivered as a **digest email** at the end of the hour. + +```python +# backend/app/alerts/email_delivery.py +EMAIL_RATE_LIMIT_PER_ORG_PER_HOUR = 50 + +async def send_alert_email(org_id: str, alert: dict, redis: aioredis.Redis): + """Send alert email subject to per-org rate limit; fall back to digest queue.""" + rate_key = f"spacecom:email_rate:{org_id}:{datetime.utcnow().strftime('%Y%m%d%H')}" + count = await redis.incr(rate_key) + if count == 1: + await redis.expire(rate_key, 3600) # expire at end of hour window + + if count <= EMAIL_RATE_LIMIT_PER_ORG_PER_HOUR: + # Send immediately + await _dispatch_email(org_id, alert) + else: + # Add to digest queue; Celery task drains it at hour boundary + digest_key = f"spacecom:email_digest:{org_id}:{datetime.utcnow().strftime('%Y%m%d%H')}" + await redis.rpush(digest_key, json.dumps(alert)) + await redis.expire(digest_key, 7200) # safety expire + +@shared_task +def send_hourly_digest_emails(): + """Drain digest queues and send consolidated digest emails. Runs at HH:59.""" + # Find all digest keys matching current hour; send one digest per org + ... +``` + +**Contract expiry alerts (F7 — §68):** + +Without proactive expiry alerts, contracts expire silently. Add a Celery Beat task (`tasks/commercial/contract_expiry_alerts.py`) that runs daily at 07:00 UTC and checks `contracts.valid_until`: + +```python +@shared_task +def check_contract_expiry(): + """Alert commercial team of contracts expiring within 90/30/7 days.""" + thresholds = [ + (90, "90-day renewal notice"), + (30, "30-day renewal notice — action required"), + (7, "URGENT: 7-day contract expiry warning"), + ] + for days, subject_prefix in thresholds: + target_date = date.today() + timedelta(days=days) + expiring = db.execute(text(""" + SELECT c.id, o.name, c.monthly_value_cents, c.currency, + c.valid_until, o.primary_contact_email + FROM contracts c + JOIN organisations o ON o.id = c.org_id + WHERE DATE(c.valid_until) = :target_date + AND c.contract_type NOT IN ('sandbox', 'internal') + AND c.auto_renew = FALSE + """), {"target_date": target_date}).fetchall() + for contract in expiring: + send_email( + to="commercial@spacecom.io", + subject=f"[SpaceCom] {subject_prefix}: {contract.name}", + body=f"Contract for {contract.name} expires on {contract.valid_until.date()}. " + f"Monthly value: {contract.monthly_value_cents/100:.2f} {contract.currency}." + ) +``` + +Add to celery-redbeat at `crontab(hour=7, minute=0)`. Also send a courtesy expiry notice to the org admin contact at the 30-day threshold so they can initiate their internal procurement process. + +**Celery schedule:** Add `send_hourly_digest_emails` to celery-redbeat at `crontab(minute=59)`. + +**Cost rationale:** SMTP relay services (SES, Mailgun) charge per email. At 50/hour cap and 10 orgs, maximum 500 emails/hour = 12,000/day. At $0.10/1,000 (SES) = $1.20/day ≈ **$37/month** at sustained maximum. Without rate limiting during a flapping event, a single incident could generate thousands of emails in minutes. + +**Per-client back-pressure and send queue circuit breaker (F7 — §63):** A slow client whose network buffers are full will cause `await websocket.send_json(event)` to block in the FastAPI handler. Without a per-client queue depth check, a single slow client can block the fan-out loop for all clients. + +```python +# backend/app/alerts/ws_manager.py +WS_SEND_QUEUE_MAX = 50 # events; beyond this, circuit-breaker triggers + +class ConnectionManager: + def __init__(self): + self._connections: dict[str, list[WebSocket]] = {} + self._send_queues: dict[WebSocket, asyncio.Queue] = {} + + async def broadcast_to_org(self, org_id: str, event: dict): + for ws in self._connections.get(org_id, []): + queue = self._send_queues[ws] + if queue.qsize() >= WS_SEND_QUEUE_MAX: + # Circuit breaker: drop this connection; client will reconnect and replay + spacecom_ws_send_queue_overflow_total.labels(org_id=org_id).inc() + await ws.close(code=4003, reason="Send queue overflow — reconnect to resume") + else: + await queue.put(event) + + async def _send_worker(self, ws: WebSocket): + """Dedicated coroutine per connection — decouples send from broadcast loop.""" + queue = self._send_queues[ws] + while True: + event = await queue.get() + try: + await ws.send_json(event) + except Exception: + break # connection closed; worker exits +``` + +Prometheus counter: `spacecom_ws_send_queue_overflow_total{org_id}` — any non-zero value warrants investigation. + +**Missed-alert display for offline clients (F8 — §63):** When a client reconnects after receiving `resync_required`, it calls the REST API to re-fetch current state. The notification centre must explicitly surface alerts that arrived during the offline period: + +`GET /api/v1/alerts?since=&include_offline=true` — returns all unacknowledged alerts since `last_seen_ts`, annotated with `"received_while_offline": true`. The notification centre renders these with a distinct visual treatment: amber border + *"Received while you were offline"* label. The client stores `last_seen_ts` in `localStorage` (updated on each WebSocket message); this survives page reload but not localStorage clear. + +**WebSocket connection metadata — per-org operational visibility (F10 — §63):** + +New Prometheus metrics: + +```python +ws_org_connected = Gauge( + 'spacecom_ws_org_connected', + 'Whether at least one WebSocket connection is active for this org', + ['org_id', 'org_name'] +) +ws_org_connections = Gauge( + 'spacecom_ws_org_connection_count', + 'Number of active WebSocket connections for this org', + ['org_id'] +) +``` + +Updated when connections open/close. Alert rule: + +```yaml +- alert: ANSPNoLiveConnectionDuringTIPEvent + expr: | + spacecom_active_tip_events > 0 + and on(org_id) spacecom_ws_org_connected == 0 + for: 5m + severity: warning + annotations: + summary: "ANSP {{ $labels.org_name }} has no live WebSocket connection during active TIP event" + runbook_url: "https://spacecom.internal/docs/runbooks/ansp-connection-lost.md" +``` + +On-call dashboard panel 9 (below the fold): *"ANSP Connection Status"* — table of org names, connection count, last-connected timestamp, TIP-event indicator. Rows with `connected = 0` and active TIP highlighted in amber. + +**Protocol version negotiation (Finding 8):** Client connects with `?protocol_version=1`. The server's first message is always: +```json +{"type": "CONNECTED", "protocolVersion": 1, "serverVersion": "2.1.3", "seq": 0} +``` +When a breaking event schema change ships, both versions are supported in parallel for 6 months. Clients on a deprecated version receive: +```json +{"type": "PROTOCOL_DEPRECATION_WARNING", "currentVersion": 1, "sunsetDate": "2026-12-01", + "migrationGuideUrl": "/docs/api-guide/websocket-protocol.md#v2-migration"} +``` +After sunset, old-version connections are closed with code `4002` ("Protocol version deprecated"). Protocol version history is maintained in `docs/api-guide/websocket-protocol.md`. + +**Token refresh during long-lived sessions (Finding 4):** Access tokens expire in 15 minutes. The server sends a `TOKEN_EXPIRY_WARNING` event 2 minutes before expiry: +```json +{"type": "TOKEN_EXPIRY_WARNING", "expiresInSeconds": 120, "seq": N} +``` +The client calls `POST /auth/token/refresh` (standard REST — does not interrupt the WebSocket), then sends on the existing connection: +```json +{"type": "AUTH_REFRESH", "token": ""} +``` +Server responds: `{"type": "AUTH_REFRESHED", "seq": N}`. If the client does not refresh before expiry, the server closes with code `4001` ("Token expired — reconnect with a new token"). Clients distinguish `4001` (auth expiry, refresh and reconnect) from `4002` (protocol deprecated, upgrade required) from network errors (reconnect with backoff). + +**Mode awareness:** In SIMULATION or REPLAY mode, the client's WebSocket connection remains open but `alert.new` and `tip.new` events are suppressed for the duration of the mode session. Simulation-generated events are delivered on a separate `WS /ws/simulation/{session_id}` channel. + +### Alert Webhooks (`admin` role — registration; delivery to registered HTTPS endpoints) + +For ANSPs with programmatic dispatch systems that cannot consume a browser WebSocket. + +- `POST /webhooks` — register a webhook endpoint; `{"url": "https://ansp.example.com/hook", "events": ["alert.new", "tip.new"], "secret": ""}` +- `GET /webhooks` — list registered webhooks for the organisation +- `DELETE /webhooks/{id}` — deregister +- `POST /webhooks/{id}/test` — send a synthetic `alert.new` event to verify delivery + +**Delivery semantics:** At-least-once. SpaceCom POSTs the event envelope to the registered URL. Signature: `X-SpaceCom-Signature: sha256=` header on every delivery. Retry policy: 3 retries with exponential backoff (1s, 5s, 30s). After 3 failures, the webhook is marked `degraded` and the org admin is notified by email. After 10 consecutive failures, the webhook is auto-disabled. + +`alert_webhooks` table: +```sql +CREATE TABLE alert_webhooks ( + id SERIAL PRIMARY KEY, + organisation_id INTEGER NOT NULL REFERENCES organisations(id), + url TEXT NOT NULL, + secret_hash TEXT NOT NULL, -- bcrypt hash of the shared secret; never stored in plaintext + event_types TEXT[] NOT NULL, + status TEXT NOT NULL DEFAULT 'active', -- active | degraded | disabled + failure_count INTEGER DEFAULT 0, + last_delivery_at TIMESTAMPTZ, + last_failure_at TIMESTAMPTZ, + created_at TIMESTAMPTZ DEFAULT NOW() +); +``` + +### Structured Event Export (`viewer` minimum) + +First step toward SWIM / machine-readable ANSP system integration (Phase 3 target). + +- `GET /events/{id}/export?format=geojson` — returns the event's re-entry corridor and impact zone as a GeoJSON `FeatureCollection` with ICAO FIR IDs and prediction metadata in `properties` +- `GET /events/{id}/export?format=czml` — CZML event package (same as `GET /czml/event/{event_id}`) +- `GET /events/{id}/export?format=ccsds-oem` — raw OEM for the object's trajectory at time of prediction + +The GeoJSON export is the preferred integration surface for ANSP systems that are not SWIM-capable. The `properties` object includes: `norad_id`, `object_name`, `p05_utc`, `p50_utc`, `p95_utc`, `affected_fir_ids[]`, `risk_level`, `prediction_id`, `prediction_hmac` (for downstream integrity verification), `generated_at`. + +### API Conventions (Finding 9) + +**Field naming:** All API request and response bodies use `camelCase`. Database column names and Python internal models use `snake_case`. The conversion is handled automatically by a shared base model: + +```python +from pydantic import BaseModel, ConfigDict +from pydantic.alias_generators import to_camel + +class APIModel(BaseModel): + """Base class for all API response/request models. Serialises to camelCase JSON.""" + model_config = ConfigDict( + alias_generator=to_camel, + populate_by_name=True, # allows snake_case in tests and internal code + ) + +class PredictionResponse(APIModel): + prediction_id: int # → "predictionId" in JSON + p50_reentry_time: datetime # → "p50ReentryTime" + ood_flag: bool # → "oodFlag" +``` + +All Pydantic response models inherit from `APIModel`. All request bodies also inherit from `APIModel` (with `populate_by_name=True`, clients may send either case). Document in `docs/api-guide/conventions.md`. + +### Error Response Schema (Finding 2) + +All error responses use the `SpaceComError` envelope — including FastAPI's default Pydantic validation errors (which are overridden): + +```python +class SpaceComError(BaseModel): + error: str # machine-readable code from the error registry + message: str # human-readable; safe to display in UI + detail: dict | None = None + requestId: str # from X-Request-ID header; enables log correlation + +@app.exception_handler(RequestValidationError) +async def validation_error_handler(request, exc): + return JSONResponse(status_code=422, content=SpaceComError( + error="VALIDATION_ERROR", + message="Request validation failed", + detail={"fields": exc.errors()}, + requestId=request.headers.get("X-Request-ID", ""), + ).model_dump(by_alias=True)) +``` + +**Canonical error code registry** — all codes, HTTP status, and recovery actions documented in `docs/api-guide/error-reference.md`. CI check: any `HTTPException` raised in application code must use a code from the registry. Sample entries: + +| Code | HTTP status | Meaning | Recovery | +|---|---|---|---| +| `VALIDATION_ERROR` | 422 | Request body or query param invalid | Fix the indicated fields | +| `INVALID_CURSOR` | 400 | Pagination cursor malformed or expired | Restart from page 1 | +| `RATE_LIMITED` | 429 | Rate limit exceeded | Wait `retryAfterSeconds` | +| `EPHEMERIS_TOO_MANY_POINTS` | 400 | Computed points exceed 100,000 | Reduce range or increase step | +| `IDEMPOTENCY_IN_PROGRESS` | 409 | Duplicate request still processing | Wait and retry `statusUrl` | +| `HMAC_VERIFICATION_FAILED` | 503 | Prediction integrity check failed | Contact administrator | +| `API_KEY_INVALID` | 401 | API key revoked, expired, or invalid | Re-issue key | +| `PREDICTION_CONFLICT` | 200 (not error) | Multi-source window disagreement | See `conflictSources` field | + +### Rate Limit Error Response (Finding 6) + +`429 Too Many Requests` responses include `Retry-After` (RFC 7231 §7.1.3) and a structured body: + +``` +HTTP/1.1 429 Too Many Requests +Retry-After: 47 +X-RateLimit-Limit: 10 +X-RateLimit-Remaining: 0 +X-RateLimit-Reset: 1742134847 + +{ + "error": "RATE_LIMITED", + "message": "Rate limit exceeded for POST /decay/predict: 10 requests per hour", + "retryAfterSeconds": 47, + "limit": 10, + "window": "1h", + "requestId": "..." +} +``` + +`retryAfterSeconds` = `X-RateLimit-Reset − now()`. Clients implementing backoff must honour `Retry-After` and must not retry before it elapses. + +### Idempotency Keys (Finding 5) + +Mutation endpoints that have real-world consequences support idempotency keys: + +``` +POST /decay/predict +Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000 +``` + +Server behaviour: +- **First receipt:** process normally; store `(key, user_id, endpoint, response_body)` in `idempotency_keys` table with 24-hour TTL +- **Duplicate within 24h:** return stored response with `HTTP 200` + header `Idempotency-Replay: true`; do not re-execute +- **Still processing:** return `409 Conflict` → `{"error": "IDEMPOTENCY_IN_PROGRESS", "statusUrl": "/jobs/uuid"}` +- **After 24h:** key expired; treat as new request + +Applies to: `POST /decay/predict`, `POST /reports`, `POST /notam/draft`, `POST /alerts/{id}/acknowledge`, `POST /admin/users`. Documented in `docs/api-guide/idempotency.md`. + +### API Key Authentication Model (Finding 11) + +API key requests use key-only auth — no JWT required: +``` +Authorization: Bearer apikey_ +``` + +The prefix `apikey_` distinguishes API keys from JWT Bearer tokens at the middleware layer. The raw key is hashed with SHA-256 before storage; the raw key is shown exactly once at creation. + +Rules: +- API key rate limits are **independent** from JWT session rate limits — separate Redis buckets per key +- Webhook deliveries are **not** counted against any rate limit bucket (server-initiated, not client-initiated) +- `allowed_endpoints` scope: `null` = all endpoints for the key's role; a non-null array restricts to listed paths. `403` returned for requests to unlisted endpoints with `{"error": "ENDPOINT_NOT_IN_KEY_SCOPE"}` +- Revoked/expired/invalid key: always `401` → `{"error": "API_KEY_INVALID", "message": "API key is revoked or expired"}` — indistinguishable from never-valid (prevents enumeration) + +Document in `docs/api-guide/api-keys.md`. + +### System Endpoints (Finding 10) + +`GET /readyz` is included in the OpenAPI spec as a documented endpoint (tagged `System`), so integrators and SWIM consumers can discover and monitor it: + +```python +@app.get( + "/readyz", + tags=["System"], + summary="Readiness and degraded-state check", + response_model=ReadinessResponse, + responses={ + 200: {"description": "System operational"}, + 207: {"description": "System degraded — one or more data sources stale"}, + 503: {"description": "System unavailable — database or Redis unreachable"}, + } +) +``` + +`GET /healthz` (liveness probe) remains undocumented in OpenAPI — infrastructure-only. `/readyz` is the recommended integration health check endpoint for ANSP monitoring systems and the Phase 3 SWIM integration. + +**Clock skew detection and server time endpoint (F6 — §67):** + +CZML `availability` timestamps and prediction windows are generated using server UTC. If the server clock drifts (NTP sync failure after container restart, hypervisor clock skew, or VM migration), CZML ground track windows will be offset from real time. A client whose clock differs from the server clock by > 5 seconds will show predictions in the wrong temporal position. + +**Infrastructure requirement:** All SpaceCom hosts must run `chronyd` or `systemd-timesyncd` with NTP synchronisation to a reliable source. Add to the deployment runbook (`docs/runbooks/host-setup.md`): +```bash +# Ubuntu/Debian +timedatectl set-ntp true +timedatectl status # confirm NTPSynchronized: yes +``` +Add Grafana alert: `node_timex_sync_status != 1` → WARNING: *"NTP sync lost on "*. + +**Client-side clock skew display:** Add `GET /api/v1/time` endpoint (unauthenticated, rate-limited to 1 req/s per IP): +```python +@router.get("/api/v1/time") +async def server_time(): + return {"utc": datetime.utcnow().isoformat() + "Z", "unix": time.time()} +``` +The frontend calls this on page load and computes `skew_seconds = server_unix - Date.now()/1000`. If `abs(skew_seconds) > 5`: display a persistent WARNING banner: *"Your browser clock differs from the server by {N}s — prediction windows may appear offset. Please synchronise your system clock."* + +### Pagination Standard + +All list endpoints use **cursor-based pagination** (not offset-based). Offset pagination degrades as `OFFSET N` forces the DB to scan and discard N rows; at 7-year retention depth this becomes a full table scan. + +**Canonical response envelope — applied to every list endpoint (Finding 1):** +```json +{ + "data": [...], + "pagination": { + "next_cursor": "eyJjcmVhdGVkX2F0IjoiMjAyNi0wMy0xNlQxNDozMDowMFoiLCJpZCI6NDQ4Nzh9", + "has_more": true, + "limit": 50, + "total_count": null + } +} +``` + +Rules: +- `data` (not `items`) is the canonical array key across all list endpoints +- `next_cursor` is `base64url(json({"created_at": "", "id": }))` — opaque to clients, decoded server-side +- `total_count` is always `null` — count queries on large tables force full scans; document this explicitly in `docs/api-guide/pagination.md` +- `limit` defaults to 50; maximum 200; specified per endpoint group in OpenAPI `description` +- Empty result: `{"data": [], "pagination": {"next_cursor": null, "has_more": false, "limit": 50, "total_count": null}}` — never `404` +- Invalid/expired cursor: `400 Bad Request` → `{"error": "INVALID_CURSOR", "message": "Cursor is malformed or refers to a deleted record", "request_id": "..."}` + +**Standard query parameters:** +- `limit` — page size (default: 50, maximum: 200) +- `cursor` — opaque cursor token from a previous response (absent = first page) + +Cursor decodes server-side to `WHERE (created_at, id) < (cursor_ts, cursor_id) ORDER BY created_at DESC, id DESC`. Tokens are valid for 24 hours. + +**Implementation:** +```python +class PaginatedResponse(BaseModel, Generic[T]): + data: list[T] + pagination: PaginationMeta + +class PaginationMeta(BaseModel): + next_cursor: str | None + has_more: bool + limit: int + total_count: None = None # always None; never compute count + +def paginate_query(q, cursor: str | None, limit: int) -> PaginatedResponse: + """Shared utility used by all list endpoints — enforces envelope consistency.""" + ... +``` + +**Enforcement:** An OpenAPI CI check confirms every endpoint tagged `list` has `limit` and `cursor` query parameters and returns the `PaginatedResponse` schema. Violations fail CI. + +**Affected endpoints** (all paginated): `/objects`, `/decay/predictions`, `/reentry/predictions`, `/alerts`, `/conjunctions`, `/reports`, `/notam/drafts`, `/space/objects`, `/api-keys/usage`, `/admin/security-events`. + +--- + +### API Latency Budget — CZML Catalog Endpoint + +The CZML catalog endpoint (`GET /czml/objects`) is the most latency-sensitive read path and the primary SLO driver (p95 < 2s). Latency budget allocation: + +| Component | Budget | Notes | +|---|---|---| +| DNS + TLS handshake (new connection) | 50 ms | Not applicable on keep-alive; amortised to ~0 for repeat requests | +| Caddy proxy overhead | 5 ms | Header processing only | +| FastAPI routing + middleware (auth, RBAC, rate limit) | 30 ms | Each middleware ~5–10 ms; keep middleware count ≤ 5 on this path | +| PgBouncer connection acquisition | 10 ms | Pool saturation adds latency; monitor `pgbouncer_pool_waiting` metric | +| DB query execution (PostGIS geometry) | 800 ms | Includes GiST index scan + geometry serialisation | +| CZML serialisation (Pydantic → JSON) | 200 ms | Validated by benchmark; exceeding this indicates schema complexity regression | +| HTTP response transmission (5 MB @ 1 Gbps internal) | 40 ms | Internal network; negligible | +| **Total budget (new connection)** | **~1,135 ms** | **~865 ms headroom to 2s p95 SLO** | + +Any new middleware added to the CZML endpoint path must be profiled and must not exceed its allocated budget. Exceeding the DB or serialisation budget requires a performance investigation before merge. + +--- + +### API Versioning Policy + +Base path: `/api/v1`. All versioned endpoints follow Semantic Versioning applied to the API contract: + +- **Non-breaking changes** (additive: new optional fields, new endpoints, new query params): deployed without version bump; announced in `CHANGELOG.md` +- **Breaking changes** (removed fields, changed types, changed auth requirements, removed endpoints): require a new major version (`/api/v2`); old version supported in parallel for a minimum of **6 months** before sunset +- **Deprecation signalling:** Deprecated endpoints return `Deprecation: true` and `Sunset: ` response headers (RFC 8594) +- **Version negotiation:** Clients may send `Accept: application/vnd.spacecom.v1+json` to pin to a specific version; default is always the latest stable version +- **Breaking change notice:** Minimum 3 months written notice (email to registered API key holders + `CHANGELOG.md` entry) before any breaking change is deployed + +**Changelog discipline (F5):** `CHANGELOG.md` follows the [Keep a Changelog](https://keepachangelog.com/) format with [Conventional Commits](https://www.conventionalcommits.org/) as the commit-level input. Every PR must add an entry under `[Unreleased]` if it has a user-visible effect. On release, `[Unreleased]` is renamed to `[{semver}] - {date}`. +```markdown +## [Unreleased] +### Added +- `p01_reentry_time` and `p99_reentry_time` fields on decay prediction response (SC-188) +### Changed +- `altitude_unit_preference` default for ANSP operators changed from `m` to `ft` (SC-201) +### Fixed +- HMAC integrity check now correctly handles NULL `action_taken` field (SC-195) +### Deprecated +- `GET /objects/{id}/trajectory` — use `GET /objects/{id}/ephemeris` (sunset 2027-06-01) +``` +- `make changelog-check` (CI step) fails if `[Unreleased]` section is empty and the diff contains non-chore/docs commits +- Release changelogs are the source for API key holder email notifications and GitHub release notes + +**OpenAPI spec as source of truth (F1):** FastAPI generates the OpenAPI 3.1 spec automatically from route decorators, Pydantic schemas, and docstrings. The spec is the authoritative contract — not a separately maintained document. CI enforces this: +- `GET /api/v1/openapi.json` is served by the running API; CI downloads it and diffs against the committed `openapi.yaml` +- Any uncommitted drift fails the build with `openapi-diff --fail-on-incompatible` +- The committed `openapi.yaml` is regenerated by running `make generate-openapi` (calls `python -m app.generate_spec`) — this is a required step in the PR checklist for any API change +- The spec is the input to all downstream tooling: Swagger UI (`/docs`), Redoc (`/redoc`), contract tests, and the client SDK generator + +**API date/time contract (F10):** All date/time fields in API responses must use **ISO 8601 with UTC offset** — never Unix timestamps, never local time strings: +- Format: `"2026-03-22T14:00:00Z"` (UTC, `Z` suffix) +- OpenAPI annotation: `format: date-time` on every `_at`-suffixed and `_time`-suffixed field +- Contract test (BLOCKING): every field matching `/_at$|_time$/` in every response schema asserts it matches `^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?Z$` +- Pydantic models use `datetime` with `model_config = {"json_encoders": {datetime: lambda v: v.isoformat().replace("+00:00", "Z")}}` + +**Frontend ↔ API contract testing (F4):** The TypeScript types used by the Next.js frontend must be validated against the OpenAPI spec on every CI run — preventing the common drift where the Pydantic response model changes but the frontend `interface` is not updated until a runtime error surfaces. + +Implementation: `openapi-typescript` generates TypeScript types from `openapi.yaml` into `frontend/src/types/api.generated.ts`. The frontend imports only from this generated file — no hand-written API response interfaces. A CI check (`make check-api-types`) regenerates the types and fails if the git diff is non-empty: + +```bash +# CI step: check-api-types +openapi-typescript openapi.yaml -o frontend/src/types/api.generated.ts +git diff --exit-code frontend/src/types/api.generated.ts \ + || (echo "API types out of sync — run: make generate-api-types" && exit 1) +``` + +This is a one-way contract: the spec is authoritative; the TypeScript types are derived. Any API change that affects the frontend must regenerate types before the PR can merge. This replaces the need for a separate consumer-driven contract test framework (Pact) at Phase 1 scale. + +**OpenAPI response examples (F7):** Every endpoint schema in the OpenAPI spec must include at least one `examples:` block demonstrating a realistic success response. This is enforced by a CI lint step (`spectral lint openapi.yaml --ruleset .spectral.yaml`) with a custom rule `require-response-example`. Missing examples fail the build. The examples serve three purposes: Swagger UI and Redoc interactive documentation, contract test fixture baseline, and ESA auditor review readability. + +```yaml +# Example: openapi.yaml fragment for GET /objects/{norad_id} +responses: + '200': + content: + application/json: + schema: + $ref: '#/components/schemas/ObjectDetail' + examples: + debris_object: + summary: Tracked debris fragment in decay + value: + norad_id: 48274 + name: "CZ-3B DEB" + object_type: "DEBRIS" + perigee_km: 187.4 + apogee_km: 312.1 + data_confidence: "nominal" + propagation_quality: "degraded" + propagation_warning: "tle_age_7_14_days" +``` + +**Client SDK strategy (F8):** Phase 1 — no dedicated SDK. ANSP integrators are provided: +1. The committed `openapi.yaml` for import into Postman, Insomnia, or any OpenAPI-compatible tooling +2. A `docs/integration/` directory with language-specific quickstart guides (Python, JavaScript/TypeScript) showing auth, object fetch, and WebSocket subscription patterns +3. Python integration examples using `httpx` (async) and `requests` (sync) — not a packaged SDK + +Phase 2 gate: if ≥ 2 ANSP customers request a typed client, generate one using `openapi-generator-cli` targeting Python and TypeScript. Generated clients are published under the `@spacecom/` npm scope and `spacecom-client` PyPI package. The generator configuration is committed to `tools/sdk-generator/` so regeneration is reproducible from the spec. + +--- + +## 15. Propagation Architecture — Technical Detail + +### 15.1 Catalog Propagator (SGP4) + +```python +from sgp4.api import Satrec, jday +from app.frame_utils import teme_to_gcrf, gcrf_to_itrf, itrf_to_geodetic + +def propagate_catalog(tle_line1: str, tle_line2: str, times_utc: list[datetime]) -> list[OrbitalState]: + sat = Satrec.twoline2rv(tle_line1, tle_line2) + results = [] + for t in times_utc: + jd, fr = jday(t.year, t.month, t.day, t.hour, t.minute, t.second + t.microsecond/1e6) + e, r_teme, v_teme = sat.sgp4(jd, fr) + if e != 0: + raise PropagationError(f"SGP4 error code {e}") + r_gcrf, v_gcrf = teme_to_gcrf(r_teme, v_teme, t) + lat, lon, alt = itrf_to_geodetic(gcrf_to_itrf(r_gcrf, t)) + results.append(OrbitalState( + time=t, reference_frame='GCRF', + pos_x_km=r_gcrf[0], pos_y_km=r_gcrf[1], pos_z_km=r_gcrf[2], + vel_x_kms=v_gcrf[0], vel_y_kms=v_gcrf[1], vel_z_kms=v_gcrf[2], + lat_deg=lat, lon_deg=lon, alt_km=alt, propagator='sgp4' + )) + return results +``` + +**Scope limitation:** SGP4 accurate to ~1 km for perigee > 300 km and epoch age < 7 days. Do not use for decay prediction. + +**SGP4 validity gates — enforced at query time (Finding 1):** + +| Condition | Action | UI signal | +|---|---|---| +| `tle_epoch_age ≤ 7 days` | Normal propagation | `propagation_quality: 'nominal'` | +| `7 days < tle_epoch_age ≤ 14 days` | Propagate with warning | `propagation_quality: 'degraded'`; amber `DataConfidenceBadge`; API includes `propagation_warning: 'tle_age_7_14_days'` | +| `tle_epoch_age > 14 days` | Return estimate with explicit caveat | `propagation_quality: 'unreliable'`; object position not rendered on globe without user acknowledgement; API returns `propagation_warning: 'tle_age_exceeds_14_days'` | +| `perigee_altitude < 200 km` | Do not use SGP4 | Route all propagation requests to the numerical decay predictor; SGP4 is invalid in this density regime | + +The epoch age check runs at the start of `propagate_catalog()`. The perigee altitude gate is enforced during TLE ingest — objects crossing below 200 km perigee are automatically flagged for decay prediction and removed from SGP4 catalog propagation tasks. + +**Sub-150 km propagation confidence guard (F2):** For the numerical decay predictor, objects with current perigee < 150 km are in a regime where atmospheric density model uncertainty dominates and SGP4/numerical model errors grow rapidly. Predictions in this regime are flagged: +```python +if perigee_km < 150: + prediction.propagation_confidence = 'LOW_CONFIDENCE_PROPAGATION' + prediction.propagation_confidence_reason = ( + f'Perigee {perigee_km:.0f} km below 150 km; ' + 'atmospheric density uncertainty dominant; re-entry imminent' + ) +``` +`LOW_CONFIDENCE_PROPAGATION` is surfaced in the UI as a red badge: "⚠ Re-entry imminent — prediction confidence low; consult Space-Track TIP directly." Unit test (BLOCKING): construct a TLE with perigee = 120 km; call the decay predictor; assert `propagation_confidence == 'LOW_CONFIDENCE_PROPAGATION'`. + +### 15.2 Decay Predictor (Numerical) + +**Physics:** J2–J6 geopotential, NRLMSISE-00 drag, solar radiation pressure (cannonball model), WGS84 oblate Earth. + +#### NRLMSISE-00 Input Vector (Finding 2) + +NRLMSISE-00 requires a fully specified input vector. Using a single F10.7 value for both the 81-day average and the prior-day slot, or using Kp instead of Ap, introduces systematic density errors that are worst during geomagnetic storms — exactly when prediction uncertainty matters most. + +```python +# Required NRLMSISE-00 inputs — both stored in space_weather table +nrlmsise_input = NRLMSISEInput( + f107A = f107_81day_avg, # 81-day centred average F10.7 (NOT current) + f107 = f107_prior_day, # prior-day F10.7 value (NOT current day) + ap = ap_daily, # daily Ap index (linear) — NOT Kp (logarithmic) + ap_a = ap_3h_history_57h, # 19-element array of 3-hourly Ap for prior 57h + # enables full NRLMSISE accuracy (flags.switches[9]=1) +) +``` + +The `space_weather` table already stores `f107_81day_avg` and `ap_daily`. Add `f107_prior_day DOUBLE PRECISION` and `ap_3h_history DOUBLE PRECISION[19]` columns (the 3-hourly Ap history array for the 57 hours preceding each observation). The ingest worker populates both from the NOAA SWPC Space Weather JSON endpoint. + +**Atmospheric density model selection rationale (F3):** NRLMSISE-00 is used for Phase 1. JB2008 (Bowman et al. 2008) is the current USSF operational standard and is demonstrably more accurate during high solar activity periods (F10.7 > 150) and geomagnetic storms (Kp > 5). NRLMSISE-00 is chosen for Phase 1 because: +- Python bindings are mature (`nrlmsise00` PyPI package); JB2008 has no equivalent mature Python binding +- For the typical F10.7 range (70–150 sfu) at solar minimum/moderate activity, the accuracy difference is < 10% +- Phase 2 milestone: evaluate JB2008 against NRLMSISE-00 on historical re-entry backcasts; if MAE improvement > 15%, migrate; decision documented in `docs/adr/0016-atmospheric-density-model.md` + +**NRLMSISE-00 input validity bounds (F3):** Inputs outside these ranges produce unphysical density estimates; the prediction is rejected rather than silently accepted: +```python +NRLMSISE_INPUT_BOUNDS = { + "f107": (65.0, 300.0), # physical solar flux range; < 65 indicates data gap + "f107A": (65.0, 300.0), + "ap": (0.0, 400.0), # Ap index physical range + "altitude_km": (85.0, 1000.0), # validated density range +} +``` +If any bound is violated, raise `AtmosphericModelInputError` with field and value — never silently clamp. + +**Altitude scope:** NRLMSISE-00 is used from 150 km to 800 km. Above 800 km, the model is applied but the prediction carries `ood_flag = TRUE` with `ood_reason = 'above_nrlmsise_validated_range_800km'` (Finding 11). + +**Geomagnetic storm sensitivity (Finding 11):** During the MC sampling, when the current 3-hour Kp index exceeds 5, sample F10.7 and Ap from storm-period values (current observed, not 81-day average). The prediction is annotated: +- `space_weather_warning: 'geomagnetic_storm'` field on the `reentry_predictions` record +- UI amber callout: "Active geomagnetic storm — thermospheric density is elevated; re-entry timing uncertainty is significantly increased" +- The storm flag persists for the lifetime of the prediction; it is not cleared when the storm ends (the prediction was made during disturbed conditions) + +#### Ballistic Coefficient Uncertainty Model (Finding 3) + +The ballistic coefficient `β = m / (C_D × A)` is the dominant uncertainty in drag-driven decay. Its three components are sampled independently in the Monte Carlo: + +| Parameter | Distribution | Rationale | +|---|---|---| +| `C_D` | `Uniform(2.0, 2.4)` | Standard assumption for non-cooperative objects in free molecular flow; no direct measurement available | +| `A` (stable attitude, `attitude_known = TRUE`) | `Normal(A_discos, 0.05 × A_discos)` | 5% shape uncertainty for known-attitude objects | +| `A` (tumbling, `attitude_known = FALSE`) | `Normal(A_discos_mean, 0.25 × A_discos_mean)` | 25% uncertainty; tumbling objects present a time-varying cross-section | +| `m` | `Normal(m_discos, 0.10 × m_discos)` | 10% mass uncertainty; DISCOS masses are not independently verified | + +OOD rules: +- `attitude_known = FALSE AND mass_kg IS NULL` → `ood_flag = TRUE`, `ood_reason = 'tumbling_no_mass'` — outside validated regime +- `cd_a_over_m IS NULL AND mass_kg IS NULL AND cross_section_m2 IS NULL` → `ood_flag = TRUE`, `ood_reason = 'no_physical_properties'` + +Objects with known physical properties can have operator-provided overrides stored in `objects.cd_override DOUBLE PRECISION` and `objects.bstar_override DOUBLE PRECISION`. When overrides are present, the MC samples around the override value rather than the DISCOS-derived value. + +#### Solar Radiation Pressure (Finding 7) + +SRP is included using the cannonball model: +``` +a_srp = −P_sr × C_r × (A/m) × r̂_sun +``` +where `P_sr = 4.56 × 10⁻⁶ N/m²` at 1 AU (scaled by `(1 AU / r_sun)²`), `C_r` is the radiation pressure coefficient stored in `objects.cr_coefficient DOUBLE PRECISION DEFAULT 1.3`. + +SRP is significant (> 5% of drag contribution) for objects with area-to-mass ratio > 0.01 m²/kg at altitudes > 500 km. OOD flag: `area_to_mass > 0.01 AND perigee > 500 km AND cr_coefficient IS NULL` → `ood_reason = 'srp_significant_cr_unknown'`. + +#### Integrator Configuration (Finding 9) + +```python +from scipy.integrate import solve_ivp + +integrator_config = dict( + method = "DOP853", # RK7(8) embedded pair — adaptive step + rtol = 1e-9, # relative tolerance (parts-per-billion) + atol = 1e-9, # absolute tolerance (km); ≈ 1 mm position error + max_step = 60.0, # seconds; constrained to capture density variation at perigee + t_span = (t0, t0 + 120 * 86400), # 120-day maximum integration window + events = [ + altitude_80km_event, # terminal: breakup trigger + altitude_200km_event, # non-terminal: log perigee passage + ], + dense_output = False, +) +``` + +Stopping criterion: integration terminates when `altitude ≤ 80 km` (breakup trigger fires) or when the 120-day span elapses without reaching 80 km (result: `propagation_timeout`; stored as `status = 'timeout'` in `simulations`). The 120-day cap is a safety stop — any object not re-entering within 120 days from a sub-450 km perigee TLE is anomalous and should be flagged for human review. + +The `max_step = 60s` constraint near perigee prevents the integrator from stepping over atmospheric density variations. For altitudes above 300 km, the max step is relaxed to 300s (5 min) via a step-size hook that checks current altitude. + +**TLE age uncertainty inflation (F7):** TLE age is a formal uncertainty source, not just a staleness indicator. For decaying objects, position uncertainty grows with TLE age due to unmodelled atmospheric drag variations. A linear inflation model is applied to the ballistic coefficient covariance before MC sampling: +```python +# Applied in decay_predictor.py before MC sampling +tle_age_days = (prediction_epoch - tle_epoch).total_seconds() / 86400 +if tle_age_days > 0 and perigee_km < 450: + uncertainty_multiplier = 1.0 + 0.15 * tle_age_days + sigma_cd *= uncertainty_multiplier + sigma_area *= uncertainty_multiplier +``` +The 0.15/day coefficient is derived from Vallado (2013) §9.6 propagation error growth for LEO objects in ballistic flight. `tle_age_at_prediction_time` and `uncertainty_multiplier` are stored in `simulations.params_json` and included in the prediction API response for provenance. + +**Monte Carlo convergence criterion (F4):** N = 500 for production is not arbitrary — it satisfies the following convergence criterion tested on the reference object (`mc-ensemble-params.json`): + +| N | p95 corridor area (km²) | Change from N/2 | +|---|---|---| +| 100 | baseline | — | +| 250 | — | ~12% | +| 500 | — | ~4% | +| 1000 | — | ~1.8% | +| 2000 | — | ~0.9% | + +Convergence criterion: corridor area change < 2% between doublings. N = 500 satisfies this for the reference object. N = 1000 is used for objects with `ood_flag = TRUE` or `space_weather_warning = 'geomagnetic_storm'` (higher uncertainty → higher N needed for stable tail estimates). Server cap remains 1000. + +**Monte Carlo:** +``` +N = 500 (standard); N = 1000 (OOD flag or storm warning); server cap 1000 +Per-sample variation: C_D ~ U(2.0, 2.4); A ~ N(A_discos, σ_A × uncertainty_multiplier); + m ~ N(m_discos, σ_m); F10.7 and Ap from storm-aware sampling +Output: p01/p05/p25/p50/p75/p95/p99 re-entry times; ground track corridor polygon; per-sample binary blob for Mode C +All output records HMAC-signed before database write +``` + +### 15.3 Atmospheric Breakup Model + +Simplified ORSAT approach: aerothermal heating → failure altitude → fragment generation → RK4 ballistic descent → impact (velocity, angle, KE, casualty area). Distinct from NASA SBM on-orbit fragmentation. + +**Breakup altitude trigger (Finding 5):** Structural breakup begins when the numerical integrator crosses `altitude = 78 km` (midpoint of the 75–80 km range supported by NASA Debris Assessment Software and ESA DRAMA for aluminium-structured objects; documented in model card under "Breakup Altitude Rationale"). + +**Fragment generation:** Below 78 km, the fragment cloud is generated using the NASA Standard Breakup Model (NASA-TM-2018-220054) parameter set for the object's mass class: +- Mass class A: < 100 kg +- Mass class B: 100–1000 kg +- Mass class C: > 1000 kg (rocket bodies, large platforms) + +**Survivability by material (Finding 5):** Fragment demise altitude is determined by material class using the ESA DRAMA demise altitude lookup: + +| `material_class` | Typical demise altitude | Notes | +|---|---|---| +| `aluminium` | 60–70 km | Most fragments demise; some survive | +| `stainless_steel` | 45–55 km | Higher survival probability | +| `titanium` | 40–50 km | High survival; used in tanks and fasteners | +| `carbon_composite` | 55–65 km | Largely demises but reinforced structures may survive | +| `unknown` | Conservative: 0 km (surface impact) | All fragments assumed to survive — drives `ood_flag = TRUE` | + +`material_class TEXT` added to `objects` table. When `material_class IS NULL`, the `ood_flag` is set and the conservative all-survive assumption is used. The NOTAM `(E)` field debris survival statement changes from a static disclaimer to a model-driven statement: `DEBRIS SURVIVAL PROBABLE` (when calculated survivability > 50%) or `DEBRIS SURVIVAL POSSIBLE` (10–50%) or `COMPLETE DEMISE EXPECTED` (< 10%). + +**Casualty area:** Computed from fragment mass and velocity using the ESA DRAMA methodology. Stored per-fragment in `fragment_impacts` table. The aggregate casualty area polygon drives the "ground risk" display in the Event Detail page (Phase 3 feature). + +**Survival probability output (F5):** The aggregate object-level survival probability is stored in `reentry_predictions`: +```sql +ALTER TABLE reentry_predictions + ADD COLUMN survival_probability DOUBLE PRECISION, -- fraction of object mass expected to survive to surface (0.0–1.0) + ADD COLUMN survival_model_version TEXT, -- e.g. 'phase1_analytical_v1', 'drama_3.2' + ADD COLUMN survival_model_note TEXT; -- human-readable caveat, e.g. 'Phase 1: simplified analytical; no fragmentation modelling' +``` +Phase 1 method: simplified analytical — ballistic coefficient of the intact object projected to surface; if `material_class = 'unknown'`, `survival_probability = 1.0` (conservative all-survive). Phase 2: integrate ESA DRAMA output files where available from the space operator's licence submission. The NOTAM `(E)` field statement is driven by `survival_probability` (already specified above). + +### 15.4 Corridor Generation Algorithm (Finding 4) + +The re-entry corridor polygon is generated by `reentry/corridor.py`. The algorithm must be specified explicitly — the choice between convex hull, alpha-shape, and ellipse fit produces materially different FIR intersection results. + +**Algorithm:** + +```python +def generate_corridor_polygon( + mc_trajectories: list[list[GroundPoint]], + percentile: float = 0.95, + alpha: float = 0.1, # degrees; ~11 km at equator + buffer_km: float = 50.0, # lateral dispersion buffer below 80 km + max_vertices: int = 1000, +) -> Polygon: + """ + Generate a re-entry hazard corridor polygon from Monte Carlo trajectories. + + Algorithm: + 1. For each MC trajectory, collect ground positions at 10-min intervals + from the 80 km altitude crossing to the final impact point. + 2. Retain the central `percentile` fraction of trajectories by re-entry time + (discard the earliest p_low and latest p_high tails). + 3. Compute the alpha-shape (concave hull) of the combined point set + using alpha = 0.1°. Alpha-shape is preferred over convex hull for + elongated re-entry corridors (convex hull overestimates width by 2–5x). + 4. Buffer the polygon by `buffer_km` to account for lateral fragment + dispersion below 80 km. + 5. Simplify to <= `max_vertices` vertices (Douglas-Peucker, tolerance 0.01°). + 6. Store the raw MC endpoint cloud as JSONB in `reentry_predictions.mc_endpoint_cloud` + for audit and Mode C replay. + + Returns: + Polygon in EPSG:4326 (WGS84), suitable for PostGIS GEOGRAPHY storage. + """ +``` + +The alpha-shape library (`alphashape`) is added to `requirements.in`. The 50 km buffer accounts for the fact that fragments detach from the main object trajectory below 80 km and disperse laterally. This value is documented in the model card with a reference to ESA DRAMA lateral dispersion statistics. + +**Adaptive ground-track sampling for CZML corridor fidelity (F4 — §62):** + +Step 1 of the corridor algorithm above samples at 10-minute intervals. For the high-deceleration terminal phase (below ~150 km), 10 minutes corresponds to hundreds of kilometres of ground track — the polygon will miss the actual terminal geometry. Adaptive sampling is required: + +```python +def adaptive_ground_points(trajectory: list[StateVector]) -> list[GroundPoint]: + """ + Return ground points at altitude-dependent intervals: + > 300 km: every 5 min (slow deceleration; sparse sampling adequate) + 150–300 km: every 2 min + 80–150 km: every 30 s (rapid deceleration; must resolve terminal corridor) + < 80 km: every 10 s (fragment phase; maximum spatial resolution) + """ + points = [] + for sv in trajectory: + alt_km = sv.altitude_km + step_s = 300 if alt_km > 300 else ( + 120 if alt_km > 150 else ( + 30 if alt_km > 80 else 10)) + # only emit a point if sufficient time has elapsed since the last point + if not points or (sv.t - points[-1].t) >= step_s: + points.append(to_ground_point(sv)) + return points +``` + +This is a breaking change to the corridor algorithm: the reference polygon in `docs/validation/reference-data/mc-corridor-reference.geojson` must be regenerated after this change is implemented. The ADR for this change must document the old vs. new polygon area difference for the reference object. + +**PostGIS vs CZML corridor consistency test (F6 — §62):** + +The PostGIS `ground_track_corridor` polygon (used for FIR intersection and alert generation) and the CZML polygon positions (displayed on the globe) are independently derived. A serialisation bug in the CZML builder could render the corridor in the wrong location while the database record remains correct — operators would see one corridor, alerts would be generated based on another. + +**Required integration test** in `tests/integration/test_corridor_consistency.py`: + +```python +@pytest.mark.safety_critical +def test_czml_corridor_matches_postgis_polygon(db_session): + """ + The bounding box of the CZML polygon positions must agree with the + PostGIS corridor polygon bounding box to within 10 km in each direction. + """ + prediction = db_session.query(ReentryPrediction).filter( + ReentryPrediction.ground_track_corridor.isnot(None) + ).first() + + # Generate CZML from the prediction + czml_doc = generate_czml_for_prediction(prediction) + czml_polygon = extract_polygon_positions(czml_doc) # list of (lat, lon) + + # Get PostGIS bounding box + postgis_bbox = db_session.execute( + text("SELECT ST_Envelope(ground_track_corridor::geometry) FROM reentry_predictions WHERE id = :id"), + {"id": prediction.id} + ).scalar() + postgis_coords = extract_bbox_corners(postgis_bbox) # (min_lat, max_lat, min_lon, max_lon) + + czml_bbox = bounding_box_of(czml_polygon) + assert abs(czml_bbox.min_lat - postgis_coords.min_lat) < 0.1 # ~10 km latitude tolerance + assert abs(czml_bbox.max_lat - postgis_coords.max_lat) < 0.1 + # Antimeridian-aware longitude comparison + assert lon_diff_deg(czml_bbox.min_lon, postgis_coords.min_lon) < 0.1 + assert lon_diff_deg(czml_bbox.max_lon, postgis_coords.max_lon) < 0.1 +``` + +This test is marked `safety_critical` because a discrepancy > 10 km between displayed and stored corridor is a direct contribution to HZ-004. + +**Unit test:** Generate a corridor from a known synthetic MC dataset (100 trajectories, straight ground track); verify the resulting polygon contains all input points; verify the polygon area is less than the convex hull area (confirming the alpha-shape is tighter); verify the polygon has ≤ 1000 vertices. + +**MC test data generation strategy (Finding 10):** Generating hundreds of MC trajectories at test time is slow and non-deterministic. Committing raw trajectory arrays is a large binary blob. Use seeded RNG: + +```python +# tests/physics/conftest.py +@pytest.fixture(scope="session") +def synthetic_mc_ensemble(): + """500 synthetic trajectories from seeded RNG — deterministic, no external downloads.""" + rng = np.random.default_rng(seed=42) # seed must never change without updating reference polygon + return generate_mc_ensemble( + rng, n=500, + object_params={ # Reference object: committed, never change without ADR + "mass_kg": 1000.0, "cd": 2.2, "area_m2": 1.0, "perigee_km": 185.0, + }, + ) +``` + +Commit to `docs/validation/reference-data/`: +- `mc-corridor-reference.geojson` — pre-computed corridor polygon (run `python tools/generate_mc_reference.py` once; review and commit) +- `mc-ensemble-params.json` — RNG seed, object parameters, generation timestamp + +Test asserts: (a) generated corridor polygon matches committed reference within 5% area difference; (b) corridor contains ≥ 95% of input trajectories. If the corridor algorithm changes, the reference polygon must be explicitly regenerated and the change reviewed — the seed itself never changes. + +### 15.5 Conjunction Probability (Pc) Computation Method (Finding 8) + +The Pc method is specified in `conjunction/pc_compute.py` and must be documented in the API response. + +**Phase 1–2 method: Alfano/Foster 2D Gaussian** + +```python +def compute_pc_alfano( + r1: np.ndarray, # primary position (km, GCRF) + v1: np.ndarray, # primary velocity (km/s) + cov1: np.ndarray, # 6×6 covariance (km², km²/s²) + r2: np.ndarray, # secondary position + v2: np.ndarray, + cov2: np.ndarray, + hbr: float, # combined hard-body radius (m) +) -> float: + """ + Compute probability of collision using Alfano (2005) 2D Gaussian method. + + Projects combined covariance onto the encounter plane, integrates the + bivariate normal distribution over the combined hard-body area. + Standard method in the space surveillance community. + + Reference: Alfano (2005), "A Numerical Implementation of Spherical Object + Collision Probability", Journal of the Astronautical Sciences. + """ +``` + +**API response field:** Every conjunction record includes `pc_method: "alfano_2d_gaussian"` so consumers can correctly interpret the result. + +**Covariance source:** TLE format carries no covariance. SpaceCom estimates covariance via TLE differencing (Vallado & Cefola method): multiple TLEs for the same object within a 24-hour window are used to estimate position uncertainty. This is documented in the API as `covariance_source: "tle_differencing"` and flagged as `covariance_quality: 'low'` when fewer than 3 TLEs are available within 24 hours. + +**`pc_discrepancy_flag` implementation:** The log-scale comparison is confirmed as: +```python +pc_discrepancy_flag = abs(math.log10(pc_spacecom) - math.log10(pc_spacetrack)) > 1.0 +``` +Not a linear comparison. A discrepancy is an order-of-magnitude difference in probability — this threshold is correct. + +**Validity domain (F1):** The Alfano 2D Gaussian method is valid under the following conditions. Outside these conditions, the Pc estimate is flagged with `pc_validity: 'degraded'` in the API response: +- Short-encounter assumption: valid when the encounter duration is short compared to the orbital period (satisfied for LEO conjunction geometries) +- Linear relative motion: degrades when `miss_distance_km < 0.1` (non-linear trajectory effects become significant); flag: `pc_validity_warning: 'sub_100m_close_approach'` +- Gaussian covariance: degrades when the position uncertainty ellipsoid aspect ratio (σ_max/σ_min) > 100; flag: `pc_validity_warning: 'highly_anisotropic_covariance'` +- Minimum Pc floor: values below 1×10⁻¹⁵ are reported as `< 1e-15` and not computed precisely (numerical precision limit) + +**Reference implementation test (F1):** `tests/physics/test_pc_compute.py` — BLOCKING: +```python +# Reference cases from Vallado & Alfano (2009), Table 1 +VALLADO_ALFANO_CASES = [ + # (miss_dist_m, sigma_r1_m, sigma_t1_m, sigma_n1_m, + # sigma_r2_m, sigma_t2_m, sigma_n2_m, hbr_m, expected_pc) + (100.0, 50.0, 200.0, 50.0, 50.0, 200.0, 50.0, 10.0, 3.45e-3), + (500.0, 100.0, 500.0, 100.0, 100.0, 500.0, 100.0, 5.0, 2.1e-5), +] + +@pytest.mark.parametrize("case", VALLADO_ALFANO_CASES) +def test_pc_against_vallado_alfano(case): + pc = compute_pc_alfano(*build_conjunction_geometry(case)) + assert abs(pc - case.expected_pc) / case.expected_pc < 0.05 # within 5% +``` + +**Phase 3 consideration:** Monte Carlo Pc for conjunctions where `pc_spacecom > 1e-3` (high-probability cases where the Gaussian assumption may break down due to non-linear trajectory evolution). Document in `docs/adr/0015-pc-computation-method.md`. + +### 15.6 Model Version Governance (F6) + +All components of the prediction pipeline are versioned together as a single `model_version` string using semantic versioning (`MAJOR.MINOR.PATCH`): + +| Change type | Version bump | Examples | +|-------------|-------------|---------| +| Pc methodology or propagator algorithm change | MAJOR | Switch from Alfano 2D to Monte Carlo Pc; replace DOP853 integrator | +| Atmospheric model or input processing change | MINOR | NRLMSISE-00 → JB2008; change TLE age inflation coefficient | +| Bug fix in existing model | PATCH | Fix F10.7 index lookup off-by-one; correct frame transformation | + +Rules: +- Old model versions are **never deleted** — tagged in git (`model/v1.2.3`) and retained in `backend/app/modules/physics/versions/` +- `reentry_predictions.model_version` is set at creation and immutable thereafter +- A model version bump requires: updated unit tests, updated `docs/validation/reference-data/`, entry in `CHANGELOG.md`, ADR if MAJOR + +**Reproducibility endpoint (F6):** +``` +POST /api/v1/decay/predict/reproduce +Body: { "prediction_id": "uuid" } +``` +Re-runs the prediction using the exact model version and parameters from `simulations.params_json` recorded at the time of the original prediction. Returns a new prediction record with `reproduced_from_prediction_id` set. This endpoint is used for regulatory audit ("what model produced this output?") and post-incident review. Available to `analyst` role and above. + +### 15.7 Prediction Input Validation (F9) + +A `validate_prediction_inputs()` function in `backend/app/modules/physics/validation.py` gates all decay prediction submissions. Inputs that fail validation are rejected with structured errors — never silently clamped to a valid range. + +```python +def validate_prediction_inputs(params: PredictionParams) -> list[ValidationError]: + errors = [] + tle_age_days = (utcnow() - params.tle_epoch).days + if tle_age_days > 30: + errors.append(ValidationError("INVALID_TLE_EPOCH", + f"TLE epoch is {tle_age_days} days old; maximum 30 days")) + if not (65.0 <= params.f107 <= 300.0): + errors.append(ValidationError("F107_OUT_OF_RANGE", + f"F10.7 = {params.f107}; valid range [65, 300]")) + if not (0.0 <= params.ap <= 400.0): + errors.append(ValidationError("AP_OUT_OF_RANGE", + f"Ap = {params.ap}; valid range [0, 400]")) + if params.perigee_km > 1200.0: + errors.append(ValidationError("PERIGEE_TOO_HIGH", + f"Perigee {params.perigee_km} km > 1200 km; not a re-entry candidate")) + if params.mass_kg is not None and params.mass_kg <= 0: + errors.append(ValidationError("INVALID_MASS", + f"Mass {params.mass_kg} kg must be > 0")) + return errors +``` + +If `errors` is non-empty, the endpoint returns `422 Unprocessable Entity` with the full error list. Unit tests (BLOCKING) cover each validation path including boundary values. + +### 15.8 Data Provenance Specification (F11) + +**Phase 1 model classification:** No trained ML model components. All prediction parameters are derived from: +- Physical constants (gravitational parameter, WGS84 Earth model) +- Published atmospheric model coefficients (NRLMSISE-00) +- Published orbital mechanics algorithms (SGP4, Alfano 2005 Pc) +- Empirical constants from peer-reviewed literature (NASA Standard Breakup Model, ESA DRAMA demise altitudes, Vallado ballistic coefficient uncertainty) + +This is documented explicitly in `docs/ml/data-provenance.md` as: *"SpaceCom Phase 1 uses no trained machine learning components. All model parameters are derived from physical constants and published peer-reviewed sources cited below."* + +**EU AI Act Art. 10 compliance (Phase 1):** Because Phase 1 has no training data, the data governance obligations of Art. 10 apply to input data rather than training data. Input data provenance is tracked in `simulations.params_json` (TLE source, space weather source, timestamp, version). + +**Future ML component protocol:** Any future learned component (e.g., drag coefficient ML model, debris type classifier) must be accompanied by: +- Training dataset: source, date range, preprocessing steps, known biases +- Validation split: method, size, metrics +- Performance on historical re-entry backcasts (§15.9 backcasting pipeline) +- Documented in `docs/ml/data-provenance.md` under the component name +- `docs/ml/model-card-{component}.md` following the Google Model Card format + +### 15.9 Backcasting Validation Pipeline (F8) + +When a re-entry is confirmed (object decays — `objects.status = 'decayed'`), the backcasting pipeline runs automatically: + +```python +# Triggered by Celery task on object status change to 'decayed' +@celery.task +def run_reentry_backcast(object_id: int, confirmed_reentry_time: datetime): + """Compare all predictions made in 72h before re-entry to actual outcome.""" + predictions = db.query(ReentryPrediction).filter( + ReentryPrediction.object_id == object_id, + ReentryPrediction.created_at >= confirmed_reentry_time - timedelta(hours=72), + ).all() + for pred in predictions: + error_hours = (pred.p50_reentry_time - confirmed_reentry_time).total_seconds() / 3600 + db.add(ReentryBackcast( + prediction_id=pred.id, + object_id=object_id, + confirmed_reentry_time=confirmed_reentry_time, + p50_error_hours=error_hours, + lead_time_hours=(confirmed_reentry_time - pred.created_at).total_seconds() / 3600, + model_version=pred.model_version, + )) +``` + +```sql +CREATE TABLE reentry_backcasts ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + prediction_id BIGINT NOT NULL REFERENCES reentry_predictions(id), + object_id INTEGER NOT NULL REFERENCES objects(id), + confirmed_reentry_time TIMESTAMPTZ NOT NULL, + p50_error_hours DOUBLE PRECISION NOT NULL, -- signed: positive = predicted late + lead_time_hours DOUBLE PRECISION NOT NULL, + model_version TEXT NOT NULL, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() +); +CREATE INDEX ON reentry_backcasts (model_version, created_at DESC); +``` + +**Drift detection:** Rolling 30-prediction MAE by model version, computed nightly. If MAE > 2× historical baseline for the current model version, raise `MEDIUM` alert to Persona D flagging for model review. Surfaced in the admin analytics panel as a "Model Performance" widget. + +--- + +## 16. Cross-Cutting Concerns + +### 16.1 Subscription Tiers and Feature Flags (F2, F6) + +SpaceCom gates commercial entitlements by `contracts`, which is the single authoritative commercial source of truth. `organisations.subscription_tier` is a presentation and segmentation shorthand only, and must never be used as the authority for feature access, quota limits, or shadow/production eligibility. Active contract state is materialised into derived organisation flags and quotas by a synchronisation job so runtime checks remain cheap and explicit. + +| Tier | Intended customer | MC concurrent runs | Decay predictions/month | Conjunction screening | API access | Multi-ANSP coordination | +|------|------------------|-------------------|------------------------|-----------------------|------------|------------------------| +| `shadow_trial` | Evaluators / test orgs | 1 | 20 | Read-only (catalog) | No | No | +| `ansp_operational` | ANSP Phase 1 | 1 | 200 | Yes (Phase 2) | Yes | Yes | +| `space_operator` | Space operator orgs | 2 | 500 | Own objects only | Yes | No | +| `institutional` | Space agencies, research | 4 | Unlimited | Yes | Yes | Yes | +| `internal` | SpaceCom internal | Unlimited | Unlimited | Yes | Yes | Yes | + +**Feature flag enforcement pattern:** +```python +def require_tier(*tiers: str): + def dependency(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)): + org = db.get(Organisation, current_user.organisation_id) + if org.subscription_tier not in tiers: + raise HTTPException(status_code=403, detail={ + "code": "TIER_INSUFFICIENT", + "current_tier": org.subscription_tier, + "required_tiers": list(tiers), + }) + return org + return dependency + +# Applied at router level alongside require_role: +router = APIRouter(dependencies=[ + Depends(require_role("analyst", "operator", "org_admin", "admin")), + Depends(require_tier("ansp_operational", "institutional", "internal")), +]) +``` + +**Quota enforcement pattern (MC concurrent runs):** +```python +TIER_MC_CONCURRENCY = { + "shadow_trial": 1, + "ansp_operational": 1, + "space_operator": 2, + "institutional": 4, + "internal": 999, +} + +def get_mc_concurrency_limit(org: Organisation) -> int: + return TIER_MC_CONCURRENCY.get(org.subscription_tier, 1) +``` + +**Quota exhaustion is a billable signal:** Every `429 TIER_QUOTA_EXCEEDED` response writes a `usage_events` row with `event_type = 'mc_quota_exhausted'` (see §9.2 usage_events table). This powers the org admin's usage dashboard and the upsell trigger in the admin panel. + +**Tier changes take effect immediately** — no session restart required. The `require_tier` dependency reads from the database on each request; there is no tier caching that could allow a downgraded tier to continue accessing premium features. + +### Uncertainty and Confidence + +Every prediction includes: +- `confidence_level` (0.0–1.0) — derived from MC spread +- `uncertainty_bounds` — explicit p05/p50/p95 times, corridor ellipse axes +- `model_version` — semantic version +- `monte_carlo_n` — ≥ 100 preliminary, ≥ 500 operational +- `f107_assumed`, `ap_assumed` — critical for reproducibility +- `record_hmac` — tamper-evident signature, verified before serving + +**TLE covariance:** TLE format contains no covariance. Use TLE differencing (multiple TLEs within 24h) or empirical Vallado & Cefola covariance. Document clearly in API responses. + +**Multi-source prediction conflict resolution (Finding 10):** + +Space-Track TIP messages and SpaceCom's internal decay predictor may produce non-overlapping re-entry windows for the same object simultaneously. ESA ESAC may publish a third window. The aviation regulatory principle of most-conservative applies — the hazard presented to ANSPs must encompass the full credible uncertainty range. + +Resolution rules (applied at the `reentry_predictions` layer): + +| Situation | Rule | +|---|---| +| SpaceCom p10–p90 and TIP window overlap | Display SpaceCom corridor as primary; TIP window shown as secondary reference band on Event Detail page | +| SpaceCom p10–p90 and TIP window do not overlap | Set `prediction_conflict = TRUE` on the prediction; HIGH severity data quality warning displayed; hazard corridor presented to ANSPs uses the **union** of SpaceCom p10–p90 and TIP window | +| ESA ESAC window available | Overlay as third reference band; include in `PREDICTION_CONFLICT` assessment if non-overlapping | +| All sources agree (all windows overlap) | No flag; SpaceCom corridor is primary | + +Schema addition to `reentry_predictions`: +```sql +ALTER TABLE reentry_predictions + ADD COLUMN prediction_conflict BOOLEAN DEFAULT FALSE, + ADD COLUMN conflict_sources TEXT[], -- e.g. ['spacecom', 'space_track_tip'] + ADD COLUMN conflict_union_p10 TIMESTAMPTZ, + ADD COLUMN conflict_union_p90 TIMESTAMPTZ; +``` + +The Event Detail page shows a `⚠ PREDICTION CONFLICT` banner (HIGH severity style) when `prediction_conflict = TRUE`, listing the conflicting sources and their windows. The hazard corridor polygon uses `conflict_union_p10`/`conflict_union_p90` when the flag is set. Document in `docs/model-card-decay-predictor.md` under "Conflict Resolution with Authoritative Sources." + +### Auditability +- Every simulation in `simulations` with full `params_json` and result URI +- Reports stored with `simulation_id` reference +- `alert_events` and `security_logs` are append-only with DB-level triggers +- All API mutations logged with user ID, timestamp, and payload hash +- TIP messages stored verbatim for audit + +### Error Handling +- Structured error responses: `{ "error": "code", "message": "...", "detail": {...} }` +- Celery failures captured in `simulations.status = 'failed'`; surfaced in jobs panel +- Frame transformation failures fail loudly — never silently continue with TEME +- HMAC failures return 503 and trigger CRITICAL security event — never silently serve a tampered record +- TanStack Query error states render inline messages with retry; not page-level errors + +### Performance Patterns + +**SQLAlchemy async — `lazy="raise"` on all relationships:** +Async SQLAlchemy prohibits lazy-loaded relationship access outside an async context. Setting `lazy="raise"` converts silent N+1 errors into loud `InvalidRequestError` at development time rather than silent blocking DB calls in production: +```python +class ReentryPrediction(Base): + object: Mapped["SpaceObject"] = relationship(lazy="raise") + tip_messages: Mapped[list["TipMessage"]] = relationship(lazy="raise") + # Forces all callers to use joinedload/selectinload explicitly +``` +Required eager-loading patterns for the three highest-traffic endpoints: +- Event Detail: `selectinload(ReentryPrediction.object)`, `selectinload(ReentryPrediction.tip_messages)` +- Active alerts: `selectinload(AlertEvent.prediction)` +- CZML catalog: raw SQL with a single `JOIN` rather than ORM (bulk fetch; ORM overhead unacceptable at 864k rows) + +**CZML caching — two-tier strategy:** +CZML data for the current 72h window changes only when a new TLE is ingested or a propagation job completes. Cache the full serialised CZML blob: +```python +CZML_CACHE_KEY = "cache:czml:catalog:{catalog_hash}:{window_start}:{window_end}" +# TTL: 15 minutes in LIVE mode (refreshed after new TLE ingest event) +# TTL: permanent in REPLAY mode (historical data never changes) +``` +Per-object CZML fragments cached separately under `cache:czml:obj:{norad_id}:{...}`. When a TLE is re-ingested for one object, invalidate only that object's fragment and recompute the full catalog CZML from the cached fragments. + +**CZML cache invalidation triggers (F5 — §58):** + +| Event | Invalidation scope | Mechanism | +|-------|--------------------|-----------| +| New TLE ingested for object X | `cache:czml:obj:{norad_id_x}:*` only | Ingest task calls `redis.delete(pattern)` after TLE commit | +| Propagation job completes for object X | `cache:czml:obj:{norad_id_x}:*` + full catalog key | Propagation Celery task issues invalidation on success | +| New prediction created for object X | `cache:czml:obj:{norad_id_x}:*` | Prediction task issues invalidation on completion | +| Manual cache flush (admin API) | `cache:czml:*` | `DELETE /api/v1/admin/cache/czml` — requires `admin` role | +| Cold start / DR failover | Warm-up Celery task | `warm_czml_cache` Beat task runs at startup (see below) | + +**Stale-while-revalidate strategy:** The CZML cache key includes a `stale_ok` variant. When the primary key is expired but the stale key (`cache:czml:catalog:stale:{hash}`) exists, serve the stale response immediately and enqueue a background recompute. Maximum stale age: 5 minutes. This prevents a cache stampede during TLE batch ingest (up to 600 simultaneous invalidations). + +**Cache warm-up on cold start (F5 — §58):** +```python +@app.task +def warm_czml_cache(): + """Run at container startup and after DR failover. Estimated: 30–60s for 600 objects.""" + objects = db.query(Object).filter(Object.active == True).all() + for obj in objects: + generate_czml_fragment.delay(obj.norad_id) + # Full catalog key assembled by CZML endpoint after all fragments present +``` +Cold-start warm-up time (600 objects, 16 simulation workers): estimated 30–60 seconds. Included in DR RTO calculation (§26.3) as "cache warm-up: ~1 min" line item. + +**Redis key namespaces and eviction policy:** + +| Namespace | Contents | Eviction policy | Notes | +|-----------|----------|-----------------|-------| +| `celery:*` | Celery broker queues | `noeviction` — must never be evicted | Use separate Redis instance or DB 0 with `noeviction` | +| `redbeat:*` | celery-redbeat schedules | `noeviction` | Loss causes silent scheduled job disappearance | +| `cache:*` | Application cache (CZML, space weather, HMAC results) | `allkeys-lru` | Cache misses acceptable; broker loss is not | +| `ws:session:*` | WebSocket session state | `volatile-lru` (with TTL set) | Expires on session end | + +Run Celery broker and application cache as separate Redis database indexes (`SELECT 0` vs `SELECT 1`) so eviction policies can differ. The Sentinel configuration monitors both. + +Cache TTLs: +- `cache:czml:catalog` → 15 minutes +- `cache:spaceweather:current` → 5 minutes +- `cache:prediction:{id}:fir_intersection` → until superseded (keyed to prediction ID) +- `cache:prediction:{id}:hmac_verified` → 60 minutes + +**Bulk export — Celery offload for Persona F:** +`GET /space/export/bulk` must not materialise the full result set in the backend container — for the full catalog this risks OOM. Implement as a Celery task that writes to MinIO and returns a pre-signed download URL, consistent with the existing report generation pattern: +```python +@app.post("/space/export/bulk") +async def trigger_bulk_export(params: BulkExportParams, ...): + task = generate_bulk_export.delay(params.dict(), user_id=current_user.id) + return {"task_id": task.id, "status": "queued"} + +@app.get("/space/export/bulk/{task_id}") +async def get_bulk_export(task_id: str, ...): + # Returns {"status": "complete", "download_url": presigned_url} when done +``` +If a streaming response is preferred over task-based, use SQLAlchemy `yield_per=1000` cursor streaming — never materialise the full result set. + +**Analytics query routing to read replica:** +Persona B and F analytics queries (simulation comparison, historical validation, bulk export) are I/O intensive and must not compete with operational read paths on the primary TimescaleDB instance during active TIP events. Route to the Patroni standby: +```python +def get_db(write: bool = False, analytics: bool = False) -> AsyncSession: + if write: + return AsyncSession(primary_engine) + if analytics: + return AsyncSession(replica_engine) # Patroni standby + return AsyncSession(primary_engine) # operational reads: primary (avoids replica lag) +``` +Monitor replication lag: if replica lag > 30s, log a warning and redirect analytics queries to primary. + +**Query plan baseline:** +Add to Phase 1 setup: run `EXPLAIN (ANALYZE, BUFFERS)` on the primary CZML query with 100 objects and record the output in `docs/query-baselines/`. Re-run at Phase 3 load test and compare — if planning time or execution time has increased > 2×, investigate index bloat or chunk count growth before the load test proceeds. + +--- + +## 17. Validation Strategy + +### 17.0 Test Standards and Strategy (F1–F3, F5, F7, F8, F10, F11) + +#### Test Taxonomy (F2) + +Three levels — every developer must know which level a new test belongs to before writing it: + +| Level | Definition | I/O boundary | Tool | Location | +|-------|-----------|-------------|------|----------| +| **Unit** | Single function or class; all dependencies mocked or stubbed | No I/O | pytest | `tests/unit/` | +| **Integration** | Multiple components; real PostgreSQL + Redis; no external network | Real DB, no internet | pytest + testcontainers | `tests/integration/` | +| **E2E** | Full stack including browser; Celery worker running; real DB | Full stack | Playwright | `e2e/` | + +Rules: +- Physics algorithm tests (SGP4, MC, Pc) are **unit** tests — pure functions, no DB +- HMAC signing, RLS isolation, and rate-limit tests are **integration** tests — require a real DB transaction +- Alert delivery, WebSocket flow, and NOTAM draft UI are **E2E** tests +- A test that mocks the database is a unit test regardless of what it is testing — name it accordingly + +#### Coverage Standard (F1) + +| Scope | Tool | Minimum threshold | CI gate | +|-------|------|------------------|---------| +| Backend line coverage | `pytest-cov` | 80% | Fail below threshold | +| Backend branch coverage | `pytest-cov --branch` | 70% | Fail below threshold | +| Frontend line coverage | Jest `--coverage` | 75% | Fail below threshold | +| Safety-critical paths | `pytest -m safety_critical` | 100% (all pass, none skipped) | Always blocking | + +```ini +# pyproject.toml +[tool.pytest.ini_options] +addopts = "--cov=app --cov-branch --cov-fail-under=80 --cov-report=term-missing" + +[tool.coverage.run] +omit = ["*/migrations/*", "*/tests/*", "*/__pycache__/*"] +``` + +Coverage is measured on the **integration test run** (not unit-only) so that database-layer code paths are included. Coverage reports are uploaded to CI artefacts on every run; a coverage trend chart is required in the Phase 2 ESA submission. + +#### Test Data Management (F3) + +**Fixtures, not factories for shared reference data:** Physics reference cases (TLE sets, re-entry events, conjunction scenarios) are committed JSON files in `docs/validation/reference-data/`. Tests load them as pytest fixtures — never fetch from the internet at test time. + +**Isolated fixtures for integration tests:** Each integration test that writes to the database runs inside a transaction that is rolled back at teardown. No shared mutable state between tests: +```python +@pytest.fixture +def db_session(engine): + with engine.connect() as conn: + with conn.begin() as txn: + yield conn + txn.rollback() # all writes from this test disappear +``` + +**Time-dependent tests:** Any test that checks TLE age, token expiry, or billing period uses `freezegun` to freeze time to a known epoch. Tests must never rely on `datetime.utcnow()` producing a particular value: +```python +from freezegun import freeze_time + +@freeze_time("2026-01-15T12:00:00Z") +def test_tle_age_degraded_warning(): + # TLE epoch is 2026-01-08 → age = 7 days → expects 'degraded' + ... +``` + +**Sensitive test data:** Real NORAD IDs, real Space-Track credentials, and real ANSP organisation names must never appear in committed test fixtures. Use fictional NORAD IDs (90001–90099 are reserved for test objects by convention) and generated organisation names (`test-org-{uuid4()[:8]}`). + +#### Safety-Critical Test Markers (F8) + +All tests that verify safety-critical behaviour carry `@pytest.mark.safety_critical`. These run on every commit (not just pre-merge) and must all pass before any deployment: + +```python +# conftest.py +import pytest + +def pytest_configure(config): + config.addinivalue_line( + "markers", "safety_critical: test verifies a safety-critical invariant; always runs; zero tolerance for failure or skip" + ) +``` + +```python +# Usage +@pytest.mark.safety_critical +def test_cross_tenant_isolation(): + ... + +@pytest.mark.safety_critical +def test_hmac_integrity_failure_quarantines_record(): + ... + +@pytest.mark.safety_critical +def test_sub_150km_low_confidence_flag(): + ... +``` + +The full list of `safety_critical`-marked tests is maintained in `docs/TEST_PLAN.md` (see F11). CI runs `pytest -m safety_critical` as a separate fast job (target: < 2 minutes) before the full suite. + +#### Physics Test Determinism (F10) + +Monte Carlo tests are non-deterministic by default. All MC-based tests seed the random number generator explicitly: + +```python +import numpy as np + +@pytest.fixture(autouse=True) +def seed_rng(): + """Seed numpy RNG for all physics tests. Produces identical output across runs.""" + np.random.seed(42) + yield + # no teardown needed — each test gets a fresh seed via autouse + +@pytest.mark.safety_critical +def test_mc_convergence_criterion(): + result = run_mc_decay(tle=TEST_TLE, n=500, seed=42) + assert result.corridor_area_change_pct < 2.0 +``` + +The seed value `42` is fixed in `tests/conftest.py` and must not be changed without updating the baseline expected values. A PR that changes the seed without updating expected values fails the review checklist. + +#### Mutation Testing (F5) + +`mutmut` is run weekly (not on every commit — too slow) against the `backend/app/modules/physics/` and `backend/app/modules/alerts/` directories. These are the highest-consequence paths. + +```bash +mutmut run --paths-to-mutate=backend/app/modules/physics/,backend/app/modules/alerts/ +mutmut results +``` + +**Threshold:** Mutation score ≥ 70% for physics and alerts modules. Results published to CI artefacts. A score drop of > 5 percentage points between weekly runs creates a `mutation-regression` GitHub issue automatically. + +#### Test Environment Parity (F7) + +The CI test environment must use identical Docker images to production. Enforced by: +- `docker-compose.ci.yml` extends `docker-compose.yml` — same image tags, no overrides to DB version or Redis version +- TimescaleDB version in CI is pinned to the same tag as production (`timescale/timescaledb-ha:pg16-latest` is not acceptable — must be `timescale/timescaledb-ha:pg16.3-ts2.14.2`) +- `make test` in CI fails if `TIMESCALEDB_VERSION` env var does not match the value in `docker-compose.yml` +- MinIO is used in CI, not mocked — `make test` brings up the full service stack including MinIO before running integration tests + +#### ESA Test Plan Document (F11) + +`docs/TEST_PLAN.md` is a required Phase 2 deliverable. Structure: + +```markdown +# SpaceCom Test Plan + +## 1. Test levels and tools +## 2. Coverage targets and current status +## 3. Safety-critical test traceability matrix + | Requirement | Test ID | Test name | Result | + |-------------|---------|-----------|--------| + | Sub-150km propagation guard | SC-TEST-001 | test_sub_150km_low_confidence_flag | PASS | + | Cross-tenant data isolation | SC-TEST-002 | test_cross_tenant_isolation | PASS | + ... +## 4. Known test limitations +## 5. Test environment specification +## 6. Performance test results (latest k6 run) +``` + +The traceability matrix links each safety-critical requirement (drawn from §15, §7.2, §26) to its `@pytest.mark.safety_critical` test. This is the primary evidence document for ESA software assurance review. + +--- + +**Important:** Comparing SGP4 against Space-Track TLEs is circular. All validation uses independent reference datasets. + +**Reference data location:** `docs/validation/reference-data/` — committed to the repository and loaded automatically by the test suite. No external downloads required at test time. + +**How to run all validation suites:** +```bash +make test # runs pytest including all validation suites +pytest tests/test_frame_utils.py -v # frame transforms only +pytest tests/test_decay/ -v # decay predictor + backcast comparison +pytest tests/test_propagator/ -v # SGP4 propagator +``` + +**How to add a new validation case:** Add the reference data to the appropriate JSON file in `docs/validation/reference-data/`, add a test case in the relevant test module, and document the source in the file's header comment. + +--- + +### 17.1 Frame Transformation Validation + +| Test | Reference | Pass criterion | Run command | +|------|-----------|---------------|-------------| +| TEME→GCRF transform | Vallado (2013), Table 3-5 | Position error < 1 m; velocity error < 0.001 m/s | `pytest tests/test_frame_utils.py::test_teme_gcrf_vallado` | +| GCRF→ITRF transform | Vallado (2013), Table 3-4 | Position error < 1 m | `pytest tests/test_frame_utils.py::test_gcrf_itrf_vallado` | +| ITRF→WGS84 geodetic | IAU SOFA test vectors | Lat/lon error < 1 μrad; altitude error < 1 mm | `pytest tests/test_frame_utils.py::test_itrf_geodetic` | +| Round-trip WGS84→ITRF→GCRF→ITRF→WGS84 | Self-consistency | Round-trip error < floating-point machine precision (~1e-12) | `pytest tests/test_frame_utils.py::test_roundtrip` | +| IERS EOP application | IERS Bulletin A reference values | UT1-UTC error < 1 μs; pole offset error < 0.1 mas | `pytest tests/test_frame_utils.py::test_iers_eop` | + +**Committed test vectors (Finding 6):** The following reference data files must be committed to the repository before any frame transformation or propagation code is merged. Tests are parameterised fixtures that load from these files; they fail (not skip) if a file is absent: + +| File | Content | Source | +|---|---|---| +| `docs/validation/reference-data/frame_transform_gcrf_to_itrf.json` | ≥ 3 cases from Vallado (2013) §3.7: input UTC epoch + GCRF position → expected ITRF position, accurate to < 1 m | Vallado (2013) *Fundamentals of Astrodynamics* Table 3-4 | +| `docs/validation/reference-data/sgp4_propagation_cases.json` | ISS (NORAD 25544) and one historical re-entry object: state vector at epoch and after 1h and 24h propagation | STK or GMAT reference propagation | +| `docs/validation/reference-data/iers_eop_case.json` | One epoch with published IERS Bulletin B UT1-UTC and polar motion values; expected GCRF→ITRF transform result | IERS Bulletin B (iers.org) | + +```python +# tests/physics/test_frame_transforms.py +import json, pytest +from pathlib import Path + +CASES_FILE = Path("docs/validation/reference-data/frame_transform_gcrf_to_itrf.json") + +def test_reference_data_exists(): + """Fail hard if committed test vectors are missing — do not skip.""" + assert CASES_FILE.exists(), f"Required reference data missing: {CASES_FILE}" + +@pytest.mark.parametrize("case", json.loads(CASES_FILE.read_text())) +def test_gcrf_to_itrf(case): + result = gcrf_to_itrf(case["gcrf_km"], parse_utc(case["epoch_utc"])) + assert np.linalg.norm(result - case["expected_itrf_km"]) < 0.001 # 1 m tolerance +``` + +Reference data file: `docs/validation/reference-data/vallado-sgp4-cases.json` and `docs/validation/reference-data/iers-frame-test-cases.json`. + +**Operational significance of failure:** A frame transform error propagates directly into corridor polygon coordinates. A 1 km error at re-entry altitude produces a ground-track offset of 5–15 km. A failing frame test is a blocking CI failure. + +--- + +### 17.2 SGP4 Propagator Validation + +| Test | Reference | Pass criterion | +|------|-----------|---------------| +| State vector at epoch | Vallado (2013) test set, 10 objects spanning LEO/MEO/GEO/HEO | Position error < 1 km at epoch; < 10 km after 7-day propagation | +| Epoch parsing | NORAD 2-line epoch format → UTC | Round-trip to 1 ms precision | +| TLE line 1/2 checksum | Modulo-10 algorithm | Pass/fail; corrupted checksum rejected before propagation | + +**Operational significance of failure:** SGP4 position error at epoch > 1 km produces a corridor centred in the wrong place. Blocking CI failure. + +--- + +### 17.3 Decay Predictor Validation + +| Test | Reference | Pass criterion | +|------|-----------|---------------| +| NRLMSISE-00 density output | Picone et al. (2002) Table 1 reference atmosphere | Density within 1% of reference at 5 altitude/solar activity combinations | +| Historical backcast: p50 error | The Aerospace Corporation observed re-entry database (≥3 events Phase 1; ≥10 events Phase 2) | Median p50 error < 4h for rocket bodies with known physical properties | +| Historical backcast: corridor containment | Same database | p95 corridor contains observed impact in ≥90% of validation events | +| Historical replay: airspace disruption | Long March 5B Spanish airspace closure reconstruction with replay inputs and operator review | Affected FIR/time-window outputs judged operationally plausible and traceable in replay report | +| Air-risk ranking consistency | Documented crossing-scenario corpus (≥10 unique spacecraft/aircraft crossing cases by Phase 2) | Highest-ranked exposure slices remain stable under seed and traffic-density perturbations or the differences are explained in the validation note | +| Conservative-baseline comparison | Same replay corpus vs. full-FIR or fixed-radius precautionary closure baseline | Refined outputs reduce affected area or duration in a majority of replay cases without undercutting the agreed p95 protective envelope | +| Cross-tool comparison | GMAT (NASA open source) — 3 defined test cases | Re-entry time agreement within 1h for objects with identical inputs | +| Monte Carlo statistical consistency | Self-consistency: 500-sample run vs. 1000-sample run on same inputs | p05/p50/p95 agree within 2% (reducing with more samples) | + +Reference data files: `docs/validation/reference-data/aerospace-corp-reentries.json` for decay-only validation and `docs/validation/reference-data/reentry-airspace/` for airspace-risk replay cases (Long March 5B, Columbia-derived cloud case, and documented crossing scenarios). GMAT comparison is a manual procedure documented in `docs/validation/README.md` (GMAT is not run in CI — too slow; comparison run once per major model version). + +**Operational significance of failure:** Decay predictor p50 error > 4h means corridors are offset in time; operators could see a hazard window that doesn't match the actual re-entry. Major model version gate. + +--- + +### 17.4 Breakup Model Validation + +| Test | Reference | Pass criterion | +|------|-----------|---------------| +| Fragment count distribution | ESA DRAMA published results for similar-mass objects | Fragment count within 30% of DRAMA reference for a 500 kg object at 70 km | +| Energy conservation at breakup altitude | Internal check | Total kinetic + potential energy conserved within 1% through fragmentation step | +| Casualty area geometry | Hand-calculated reference case | Casualty area polygon area within 10% of analytic calculation | + +**Operational significance of failure:** Breakup model failure does not block Phase 1. It is an advisory failure in Phase 2. Blocking before Phase 3 regulatory submission. + +--- + +### 17.5 Security Validation + +| Test | Reference | Pass criterion | Blocking? | +|------|-----------|---------------|-----------| +| RBAC enforcement | `test_rbac.py` — every endpoint, every role | 403 for insufficient role; 401 for unauthenticated; 0 mismatches | Yes | +| HMAC tamper detection | `test_integrity.py` — direct DB row modification | API returns 503 + CRITICAL `security_logs` entry | Yes | +| Rate limiting | `test_auth.py` — per-endpoint threshold | 429 after threshold; 200 after reset window | Yes | +| CSP headers | Playwright E2E | `Content-Security-Policy` header present on all pages | Yes | +| Container non-root | CI `docker inspect` check | No container running as root UID | Yes | +| Trivy CVE scan | Trivy against all built images | 0 Critical/High CVEs | Yes | + +--- + +### 17.6 Verification Independence (F6 — §61) + +EUROCAE ED-153 / DO-278A §6.4 requires that SAL-2 software components undergo independent verification — meaning the person who verifies (reviews/tests) a SAL-2 requirement, design, or code artefact must not be the same person who produced it. + +**Policy:** `docs/safety/VERIFICATION_INDEPENDENCE.md` + +**Scope:** All SAL-2 components identified in §24.13: +- `physics/` (decay prediction engine) +- `alerts/` (alert generation pipeline) +- HMAC integrity verification functions +- CZML corridor generation and frame transform + +**Implementation in GitHub:** + +```yaml +# .github/CODEOWNERS +# SAL-2 components require an independent reviewer (not the PR author) +/backend/app/physics/ @safety-reviewer +/backend/app/alerts/ @safety-reviewer +/backend/app/integrity/ @safety-reviewer +/backend/app/czml/ @safety-reviewer +``` + +The `@safety-reviewer` team must have ≥1 member who is not the PR author. GitHub branch protection for `main` must include: +- `require_code_owner_reviews: true` for the above paths +- `dismiss_stale_reviews: true` (new commits require re-review) +- SAL-2 PRs require ≥2 approvals (one of which must be from `@safety-reviewer`) + +**Verification traceability:** The PR review record (GitHub PR number + reviewer + approval timestamp) serves as evidence for verification independence in the safety case (§24.12 E1.1). This record is referenced in the MoC document (§24.14 MOC-002). + +**Who qualifies as an independent reviewer for SAL-2:** Any engineer who: +1. Did not write the code being reviewed +2. Has sufficient domain knowledge to evaluate correctness (orbital mechanics familiarity for `physics/`; alerting logic familiarity for `alerts/`) +3. Is designated in the `@safety-reviewer` GitHub team + +Before ANSP shadow activation, the safety case custodian confirms that all SAL-2 components committed in the release have a documented independent reviewer. + +--- + +## 18. Additional Physics Considerations + +| Topic | Why It Matters | Phase | +|-------|---------------|-------| +| **Solar radiation pressure (SRP)** | Dominates drag above ~800 km for high A/m objects | Phase 1 (decay predictor) | +| **J2–J6 geopotential** | J2 alone: ~7°/day RAAN error | Phase 1 (decay predictor) | +| **Attitude and tumbling** | Drag coefficient 2–3× different; capture via B* Monte Carlo | Phase 2 | +| **Lift during re-entry** | Non-spherical fragments: 10s km cross-track shift | Phase 2 (breakup) | +| **Maneuver detection** | Active satellites maneuver; TLE-to-TLE ΔV estimation | Phase 2 | +| **Ionospheric drag** | Captured via NRLMSISE-00 ion density profile | Phase 1 (via model) | +| **Re-entry heating uncertainty** | Emissivity/melt temperatures poorly known for debris | Phase 2 | + +--- + +## 19. Development Phases — Detailed + +### Phase 1: Analytical Prototype (Weeks 1–10) + +**Goal:** Real object tracking, decay prediction with uncertainty quantification, functional Persona A/B interface. Security infrastructure fully in place before any other feature ships. + +| Week | Backend Deliverable | Frontend Deliverable | Security / SRE Deliverable | +|------|--------------------|--------------------|--------------------------| +| 1-2 | FastAPI scaffolding, Alembic migrations, Docker Compose with Tier 2 service topology. `frame_utils.py`, `time_utils.py`. IERS EOP refresh + SHA-256 verify. Append-only DB triggers. HMAC signing infrastructure. Liveness + readiness probes on all services. `GET /healthz`, `GET /readyz` with DB + Redis checks. Dead letter queue for Celery. `task_acks_late`, `task_reject_on_worker_lost` configured. Celery queue routing (`ingest` vs `simulation`). `celery-redbeat` configured. **Legal/compliance**: `users` table `tos_accepted_at/tos_version/tos_accepted_ip/data_source_acknowledgement` fields. First-login ToS/AUP/Privacy Notice acceptance flow (blocks access until all accepted). SBOM generated via `syft`; CesiumJS commercial licence verified. Privacy Notice drafted and published. | Next.js scaffolding. Root layout: nav, ModeIndicator, AlertBadge, JobsPanel stub. Dark mode + high-contrast theme. CSP and security headers via Next.js middleware. ToS/AUP acceptance gate on first login (blocks dashboard until accepted). | **RBAC schema + `require_role()`. JWT RS256 + httpOnly cookies. MFA (TOTP). Redis AUTH + ACLs. MinIO private buckets. Docker network segmentation. Container hardening. `git-secrets`. Bandit + ESLint security in CI. Trivy. Dependency pinning. Dependabot. `security_logs` + sanitising formatter. Docker Compose `depends_on: condition: service_healthy` wired.** **Documentation**: `docs/` directory tree created; `AGENTS.md` committed; initial ADRs for JWT, dual frontend, Monte Carlo chord, frame library; `docs/runbooks/TEMPLATE.md` + index; `CHANGELOG.md` first entry; `docs/validation/reference-data/` with Vallado and IERS cases; `docs/alert-threshold-history.md` initial entry. **DevOps/Platform**: self-hosted GitLab CI pipeline (lint, test-backend, test-frontend, security-scan, build-and-push jobs); multi-stage Dockerfiles for all services; `.pre-commit-config.yaml` with all six hooks; `.env.example` committed with all variables documented; `Makefile` with `dev`, `test`, `migrate`, `seed`, `lint`, `clean` targets; Docker layer + pip + npm build cache configured; `sha-` image tagging in the GitLab container registry in place. Prometheus metrics: `spacecom_active_tip_events`, `spacecom_tle_age_hours`, `spacecom_hmac_verification_failures_total` instrumented. | +| 3–4 | Catalog module: object CRUD, TLE import. TLE cross-validation. ESA DISCOS import. Ingest Celery Beat (celery-redbeat). Hardcoded URLs, SSRF-mitigated HTTP client. WAL archiving configured. Daily backup Celery task. TimescaleDB compression policy on `orbits`. Retention policy scaffolded. | Object Catalog page. DataConfidenceBadge. Object Watch page stub. | Rate limiting (`slowapi`). Simulation parameter range validation. Prometheus: `spacecom_ingest_success_total`, `spacecom_ingest_failure_total` per source. AlertManager rule: consecutive ingest failures → warning. | +| 5–6 | Space Weather: NOAA SWPC + ESA SWS cross-validation. `operational_status` string. TIP message ingestion. Prometheus: `spacecom_prediction_age_seconds` per NORAD ID. Readiness probe: TLE staleness + space weather age checks. | SpaceWeatherWidget. Alert taxonomy: CRITICAL banner, NotificationCentre, AcknowledgeDialog. Degraded mode banner (reads `readyz` 207 response). | `alert_events` append-only verified. Alert rate-limit and deduplication. Alert storm detection. AlertManager rule: `spacecom_active_tip_events > 0 AND prediction_age > 3600` → critical. | +| 7–8 | Catalog Propagator (SGP4): TEME→GCRF, CZML (J2000). Ephemeris caching. Frame transform validation. All CZML strings HTML-escaped. MC chord architecture: `run_mc_decay_prediction` → `group(run_single_trajectory)` → `aggregate_mc_results`. Chord result backend (Redis) sized. | Globe: real object positions, LayerPanel, clustering, urgency symbols. TimelineStrip. Live mode scrub. | WebSocket auth: cookie-based; connection limit. WS ping/pong. Prometheus: `spacecom_simulation_duration_seconds` histogram. | +| 9–10 | Decay Predictor: RK7(8) + NRLMSISE-00 + Monte Carlo chord. HMAC-signed output. Immutability triggers. Corridor polygon generation. Re-entry API. Validate against ≥3 historical re-entries. Monthly restore test Celery task implemented. | Mode A (Percentile Corridors). Event Detail: PredictionPanel with p05/p50/p95, HMAC status badge. TimelineGantt. Operational Overview. UncertaintyModeSelector (B/C greyed). | HMAC tamper detection E2E test. All-clear TIP cross-check guard. First backup restore test executed and passing. `spacecom_simulation_duration_seconds` p95 verified < 240s on Tier 2 hardware. | + +### Phase 2: Operational Analysis (Weeks 11–22) + +| Week | Backend Deliverable | Frontend Deliverable | Security / Regulatory | +|------|--------------------|--------------------|----------------------| +| 11–12 | Atmospheric Breakup: aerothermal, fragments, ballistic descent, casualty area. | Fragment impact points on globe. Fragment detail panel. | OWASP ZAP DAST against staging. | +| 13–14 | Conjunction: all-vs-all screening, Alfano probability. | Conjunction events on globe. ConjunctionPanel. | STRIDE threat model reviewed for Phase 2 surface. | +| 15–16 | Upper/Lower Atmosphere. Hazard module: fused zones, HMAC-signed, immutable, `shadow_mode` flag. | Mode B (Probability Heatmap): Deck.gl. UncertaintyModeSelector unlocks Mode B. | RLS multi-tenancy integration tests. Shadow records excluded from operational API (integration test). | +| 17–18 | Airspace: FIR/UIR load, PostGIS intersection. Airspace impact table. NOTAM Drafting: ICAO format, `notam_drafts` table, mandatory disclaimer. Shadow mode admin toggle. | AirspaceImpactPanel. NOTAM draft flow: NotamDraftViewer, disclaimer banner, review/cancel. 2D Plan View. ViewToggle. `/airspace` page. ShadowBanner + ShadowModeIndicator. | Regulatory disclaimer verified present on all NOTAM drafts. axe-core accessibility audit. | +| 19–20 | Report builder: bleach sanitisation, Playwright renderer (isolated, no-network, timeouts, seccomp). MinIO storage. Shadow validation schema + `shadow_validations` table. | ReportConfigDialog, ReportPreview, `/reports` page. IntegrityStatusBadge. SimulationComparison. ShadowValidationReport scaffold. | Renderer: `network_mode: none` enforced; sanitisation tests passing; 30s timeout verified. | +| 21–22 | Space Operator Portal: `owned_objects`, controlled re-entry planner (deorbit window optimiser), CCSDS export, `api_keys` table + lifecycle. `modules.api` with per-key rate limiting. **Legal gate**: legal opinion commissioned and received for primary deployment jurisdiction; `legal_opinions` table populated; shadow mode admin toggle wired to `shadow_mode_cleared` flag. Space-Track AUP redistribution clarification obtained (written confirmation from 18th Space Control Squadron or counsel opinion on permissible use). ECCN classification review commissioned for Controlled Re-entry Planner. GDPR compliance review: data inventory completed, lawful bases documented, DPA template drafted, erasure procedure (`handle_erasure_request`) implemented. | `/space` portal: SpaceOverview, ControlledReentryPlanner, DeorbitWindowList, ApiKeyManager, CcsdsExportPanel. Shadow mode admin toggle displays legal clearance status. | Object ownership RLS policy tested: `space_operator` cannot access non-owned objects. API key rate limiting verified. API Terms accepted at key creation and recorded. Jurisdiction screening at registration (OFAC/EU/UK sanctions list check). | + +### Phase 3: Operational Deployment (Weeks 23–32) + +| Week | Backend Deliverable | Frontend Deliverable | Security / Regulatory / SRE | +|------|--------------------|--------------------|----------------------------| +| 23–24 | Alerts module: thresholds, email delivery, geographic filtering, `alert_events`. Shadow mode: alerts suppressed. ADS-B feed integration: **OpenSky Network REST API** (`https://opensky-network.org/api/states/all`); polled every 60s via Celery Beat; flight state vectors stored in `adsb_states` (non-hypertable; rolling 24h window); route intersection advisory module reads `adsb_states` to identify flights in re-entry corridors. Air Risk module initialisation: aircraft exposure scoring, time-slice aggregation, and vulnerability banding by aircraft class. **Tier 3 HA infrastructure**: TimescaleDB streaming replication + Patroni + etcd. Redis Sentinel (3 nodes). 4× simulation workers (64 total cores). Blue-green deployment pipeline wired. | Full alert lifecycle UI: geographic filtering, mute rules, acknowledgement audit. Route overlay on globe. AirRiskPanel by FIR/time slice. Route intersection advisory (avoidance boundary only). | **Legal/regulatory**: MSA template finalised by counsel; Regulatory Sandbox Agreement template finalised. First ANSP shadow deployment executed under signed Regulatory Sandbox Agreement and confirmed legal clearance. GDPR breach notification procedure tested (tabletop exercise). Professional indemnity, cyber liability, and product liability insurance confirmed in place. **SRE**: Patroni failover tested (primary killed; standby promotes; backend reconnects; verify zero lost predictions). Redis Sentinel failover tested. SLO baseline measurements taken on Tier 3 hardware. | +| 25–26 | Feedback: prediction vs. outcome. Density scaling recalibration. Maneuver detection. Shadow validation report generation. Historical replay corpus: Long March 5B, Columbia-derived cloud case, and documented crossing-scenario set. Conservative-baseline comparison reporting for airspace closures. Launch safety module. Deployment freeze gate (CI/CD: block deploy if CRITICAL/HIGH alert active). ANSP communication plan implemented (degradation push + email). Incident response runbooks written (DB failover, Celery recovery, HMAC failure, ingest failure). | Prediction accuracy dashboard. Historical comparison. ShadowValidationReport. Air-risk replay comparison views. `/space` Persona F workspace. Launch safety portal. | Vault / cloud secrets manager. Secrets rotation. Begin first ANSP shadow mode deployment. **SRE**: PagerDuty/OpsGenie integrated with Prometheus AlertManager. SEV-1/2/3/4 routing configured. First on-call rotation established. | +| 27–28 | Mode C binary MC endpoint. Load testing (100 users, <2s CZML p95; MC p95 < 240s). **Prometheus + Grafana**: three dashboards (Operational Overview, System Health, SLO Burn Rate). Full AlertManager rules. ECSS compliance artefacts: SMP, VVP, PAP, DMP. MinIO lifecycle rules: MC blobs > 90 days → cold tier. | Mode C (Monte Carlo Particles). UncertaintyModeSelector unlocks Mode C. Final Playwright E2E suite. Grafana Operational Overview embedded in `/admin`. | **External penetration test** (auth bypass, RBAC escalation, SSRF, XSS→Playwright, WS auth bypass, data integrity, object ownership bypass, API key abuse). All Critical/High remediated. Load test: SLO p95 targets verified under 100-user concurrent load. | +| 29–32 | Regulatory acceptance package: safety case framework, ICAO data quality mapping, shadow validation evidence, SMS integration guide. TRL 6 demonstration. Data archival pipeline (Parquet export to MinIO cold before chunk drop). Storage growth verified against projections. **ESA bid legal**: background IP schedule documented; Consortium Agreement with academic partner signed (IP ownership, publication rights, revenue share); SBOM submitted as part of ESA artefact package. ECCN classification determination received; export screening process in place for all new customer registrations. ToS version updated to reflect any regulatory feedback from first ANSP deployments; re-acceptance triggered. | Regulatory submission report type. TRL demonstration artefacts. | SOC 2 Type I readiness review. Production runbook + incident response per threat scenario. ECSS compliance review. Monthly restore test passing in CI. Error budget dashboard showing < 10% burn rate. | + +--- + +## 20. Key Decisions and Tradeoffs + +| Decision | Chosen | Alternative Considered | Rationale | +|----------|--------|----------------------|-----------| +| Propagator split | SGP4 catalog + numerical decay | SGP4 for everything | SGP4 diverges by days–weeks for re-entry time prediction | +| Numerical integrator | RK7(8) adaptive + NRLMSISE-00 | poliastro Cowell | Direct force model control | +| Frame library | `astropy` | Manual SOFA Fortran | Handles IERS EOP; well-tested IAU 2006 | +| Atmospheric density | NRLMSISE-00 (P1), JB2008 option (P2) | Simple exponential | Community standard; captures solar cycle | +| Breakup model | Simplified ORSAT-like | Full DRAMA/SESAM | DRAMA requires licensing; simplified recovers ~80% utility | +| Uncertainty visualisation | Three modes, phased (A→B→C), user-selectable | Single fixed mode | Serves different personas; operational users need corridors, analysts need heatmaps | +| JWT algorithm | RS256 (asymmetric) | HS256 (shared secret) | Compromise of one service does not expose signing key to all services | +| Token storage | httpOnly Secure SameSite=Strict cookie | localStorage | XSS cannot read httpOnly cookies; localStorage is trivially exfiltrated | +| Token revocation | DB `refresh_tokens` table | Redis-only | Revocations survive restarts; enables rotation-chain audit | +| MFA | TOTP (RFC 6238) required for all roles | Optional MFA | Aviation authority context; government procurement baseline | +| Secrets management | Docker secrets (P1 prod) → Vault (P3) | Env vars only | Env vars appear in process listings and crash dumps; no audit trail | +| Alert integrity | Backend-only generation on verified data | Client-triggered alerts | Prevents false alert injection via API | +| Prediction integrity | HMAC-signed, immutable after creation | Mutable with audit log | Tamper-evident at database level; modification is impossible, not just logged | +| Multi-tenancy | RLS at database layer + `organisation_id` | Application-layer only | DB-level enforcement cannot be bypassed by application bugs | +| Renderer isolation | Separate `renderer` container, no external network | Playwright in backend container | Limits blast radius of XSS→SSRF escalation | +| Server state | TanStack Query | Zustand for everything | Automatic cache, background refetch; Zustand is not a data cache | +| Navigation model | Task-based (events, airspace, analysis) | Module-based | Users think in tasks, not modules | +| Report rendering | Playwright headless server-side | Client-side canvas | Reliable at print resolution; consistent; not affected by client GPU | +| Monorepo | Monorepo | Separate repos | Small team, shared types, simpler CI | +| ORM | SQLAlchemy 2.0 | Raw SQL | Mature async support; Alembic migrations | +| Domain architecture | Dual front door (aviation + space portal), shared physics core | Single aviation-only product | Space operator revenue stream; ESA bid credibility; space credibility supports aviation trust | +| Space operator object scoping | PostgreSQL RLS on `owned_objects` join | Application-layer filtering only | DB-level enforcement; prevents application bugs from leaking cross-operator data | +| NOTAM output | Draft only + mandatory disclaimer; never submitted | System-assisted NOTAM submission | SpaceCom is not a NOTAM originator; keeps platform in purely informational role; reduces regulatory approval burden | +| Reroute module scope | Strategic pre-flight avoidance boundary only | Specific alternate route generation | Specific routes require ATC integration and aircraft performance data SpaceCom does not have; avoidance boundary keeps SpaceCom legally defensible | +| Shadow mode | Org-level flag; all alerts suppressed; records segregated | Per-prediction flag | Enables ANSP trial deployments; accumulates validation evidence for regulatory acceptance; segregation prevents operational confusion | +| Controlled re-entry planner output | CCSDS-format manoeuvre plan + risk-scored deorbit windows | Aviation-format only | Space operators submit to national regulators and ops centres in CCSDS; Zero Debris Charter evidence format | +| API access | Separate API keys (not session JWT); per-key rate limiting | Session cookie only | Space operators integrate SpaceCom into operations centres programmatically; API keys are revocable machine credentials | +| MC parallelism model | Celery `group` + `chord` (fan-out sub-tasks across worker pool) | `multiprocessing.Pool` within single task | Chord distributes across all worker containers; Pool limited to one container's cores; chord scales horizontally | +| Worker topology | Two separate Celery pools: `ingest` and `simulation` | Single shared queue | Runaway simulation jobs cannot starve TLE ingestion; critical for reliability during active TIP events | +| Celery Beat HA | `celery-redbeat` (Redis-backed, distributed locking) | Standard Celery Beat (single process) | Beat SPOF means scheduled ingest silently stops; redbeat enables multiple instances with leader election | +| DB HA | TimescaleDB streaming replication + Patroni auto-failover | Single-instance DB | RPO = 0 for critical tables; 15-minute RTO requires automatic failover, not manual | +| Redis HA | Redis Sentinel (3 nodes) | Single Redis | Master failure without Sentinel means all Celery queues and WebSocket pub/sub stop | +| Deployment gate | CI/CD checks for active CRITICAL/HIGH alerts before deploying | Manual judgement | Prevents deployments during active TIP events; protects operational continuity | +| MC blade sizing | 16 vCPU per simulation worker container | Smaller containers | MC chord sub-tasks fill all available cores; below 16 cores p95 SLO of 240s is not met | +| Temporal uncertainty display | Plain window range ("08h–20h from now / most likely ~14h") for Persona A/C; p05/p50/p95 UTC for Persona B | `± Nh` notation everywhere | `±` implies symmetric uncertainty which re-entry distributions are not; window range is operationally actionable | +| Space weather impact communication | Operational buffer recommendation ("+2h beyond 95th pct") rather than % deviation | Percentage string | Percentage is meaningless without a known baseline; buffer hours are immediately usable by an ops duty manager | +| TLS termination | Caddy with automatic ACME (internet-facing) / internal CA (air-gapped) | nginx + manual certs | Caddy handles cert lifecycle automatically; decision tree in §34 | +| Pagination | Cursor-based `(created_at, id)` | Offset-based | Offset degrades to full-table scan at 7-year retention depth; cursor is O(1) regardless of dataset size | +| CZML delta protocol | `?since=` parameter; max 5 MB full payload; `X-CZML-Full-Required` header on stale client | Full catalog always | 100-object catalog at 1-min cadence is ~10–50 MB/hr per connected client without delta; delta reduces this to <500 KB/hr | +| MC concurrency gate | Per-org Redis semaphore; 1 concurrent MC run (Phase 1); `429 + Retry-After` on limit | Unbounded fan-out | 5 concurrent MC requests = 2,500 sub-tasks queued; p95 SLO collapses without backpressure | +| TimescaleDB `compress_after` | 7 days for `orbits` (not 1 day) | Compress as soon as possible | Compressing hot chunks forces decompress on every write; 1-day compress_after causes 50–200ms write latency thrash | +| Renderer memory limit | `mem_limit: 4g` Docker cap on renderer container | No memory limit | Chromium print rendering at A4/300DPI consumes 2–4 GB; 4 uncapped renderer instances can OOM a 32 GB node | +| Static asset caching | Cloudflare CDN (internet-facing); nginx sidecar (on-premise) | No CDN | CesiumJS bundle ~5–10 MB; 100 concurrent first-load = 500 MB–1 GB burst without caching | +| WAF/DDoS protection | Upstream provider (Cloudflare/AWS Shield) for internet-facing; network perimeter for air-gapped | Application-layer rate limiting only | Application-layer is insufficient for volumetric attacks; must be at ingress | +| Multi-region deployment | Single region per customer jurisdiction; separate instances, not shared cluster | Active-active multi-region | Data sovereignty; simpler compliance certification; Phase 1–3 customer base doesn't justify multi-region cost | +| MinIO erasure coding | EC:2 (4-node) | EC:4 or RAID | EC:2 tolerates 1 write failure / 2 read failures; balanced between protection and storage efficiency at 4 nodes | +| DB connection routing | PgBouncer as single stable connection target | Direct Patroni primary connection | Patroni failover transparent to application; stable DNS target through primary changes | +| Egress filtering | Host-level UFW/nftables allow-list (Tier 2); Calico/Cilium network policy (Tier 3) | Trust Docker network isolation | Docker isolation is inter-network only; outbound internet egress unrestricted without host-level filtering | +| Mode-switch dialogue | Explicit current-mode + target-mode + consequences listed; Cancel left, destructive action right | Generic "Are you sure?" | Aviation HMI conventions; listed consequences prevent silent simulation-during-live error | +| Future-preview temporal wash | Semi-transparent overlay + persistent label on event list when timeline scrubber is not at current time | No visual distinction | Prevents controller from acting on predicted-future data as though it is current operational state | +| Simulation block during active alerts | Optional org-level `disable_simulation_during_active_events` flag | Always allow simulation entry | Prevents an analyst accidentally entering simulation while CRITICAL alerts require attention in the same ops room | +| Prediction superseding | Write-once `superseded_by` FK on `reentry_predictions` / `simulations` | Mutable or delete | Preserves immutability guarantee; gives analysts a way to mark outdated predictions without removing the audit record | +| CRITICAL acknowledgement gate | 10-character minimum free-text field; two-step confirmation modal | Single click | Prevents reflexive acknowledgement; creates meaningful action record for every acknowledged CRITICAL event | +| Multi-ANSP coordination panel | Shared acknowledgement status and coordination notes across ANSP orgs on the same event | Out-of-band only | Creates shared digital situational awareness record without replacing voice coordination; reduces risk of conflicting parallel NOTAMs | +| Legal opinion timing | Phase 2 gate (before shadow deployment); not Phase 3 | Phase 3 task | Common law duty of care may attach regardless of UI disclaimers; liability limitation must be in executed agreements before any ANSP relies on the system | +| Commercial contract instruments | Three instruments: MSA + AUP click-wrap + API Terms | Single platform ToS | Each instrument addresses a different access pathway; API access by Persona E/F must have separate terms recorded against the key | +| Shadow mode legal gate | `legal_opinions.shadow_mode_cleared` must be TRUE before shadow mode can be activated for an org | Admin can enable freely | Shadow deployment is a formal regulatory activity; without a completed legal opinion it exposes SpaceCom to uncapped liability in the deployment jurisdiction | +| GDPR erasure vs. retention | Pseudonymise user references in append-only tables on erasure request; never delete safety records | Hard delete on request | UN Liability Convention requires 7-year retention; GDPR right to erasure is satisfied by removing the link to the individual, not the record itself | +| Space-Track data redistribution | Obtain written clarification from 18th SCS before exposing TLE/CDM data via the SpaceCom API | Assume permissible | Space-Track AUP prohibits redistribution to unregistered parties; violation could result in loss of Space-Track access, disabling the platform's primary data source | +| OSS licence compliance | CesiumJS commercial licence required for closed-source deployment; SBOM generated from Phase 1 | Assume all dependencies are permissively licensed | CesiumJS AGPLv3 requires source disclosure for network-served applications; undiscovered licence violations create IP risk in ESA bid | +| Insurance | Professional indemnity + cyber liability + product liability required before operational deployment | No insurance requirement | Aviation safety context; potential claims from incorrect predictions that inform airspace decisions could exceed SpaceCom's balance sheet without coverage | +| Connection pooling | PgBouncer transaction-mode pooler between all app services and TimescaleDB | Direct connections from app | Tier 3 connection count (2× backend + 4× workers + 2× ingest) exceeds `max_connections=100` without a pooler; Patroni failover updates only pgBouncer | +| Redis eviction policy | `noeviction` for Celery/redbeat (separate DB index); `allkeys-lru` for application cache | Single Redis with one policy | Broker message eviction causes silent job loss; cache eviction is acceptable | +| Bulk export implementation | Celery task → MinIO → presigned URL (async offload pattern) | Streaming response from API handler | Full catalog export can be gigabytes; materialising in API handler risks OOM on the backend container | +| Analytics query routing | Patroni standby replica for Persona B/F analytics; primary for operational reads | All reads to primary | Analytics queries during a TIP event would compete with operational reads on the primary; standby already provisioned at Tier 3 | +| SQLAlchemy lazy loading | `lazy="raise"` on all relationships | Default lazy loading | Async SQLAlchemy silently blocks the event loop on lazy-loaded relationships; `raise` converts silent N+1s into loud development-time errors | +| CZML cache strategy | Per-object fragment cache + full catalog assembly; TTL keyed to last propagation job | No cache; query DB on each request | CZML catalog fetch at 100 objects = 864k rows; uncached this misses the 2s p95 SLO under concurrent load | +| Hypertable chunk interval (`orbits`) | 1-day chunks (not default 7-day) | Default 7-day | 72h CZML query spans 3 × 1-day chunks; spans 11 × 7-day chunks — chunk exclusion is far less effective with the default | +| Continuous aggregate for F10.7 81-day avg | TimescaleDB continuous aggregate `space_weather_daily` | Compute from raw rows per request | At 100 concurrent users, 100 identical scans of 11,664 raw rows; continuous aggregate reduces this to a single-row lookup | +| CI/CD orchestration | GitHub Actions | Jenkins / GitLab CI | Project is GitHub-native; Actions has OIDC → GHCR; no separate CI server to operate | +| Container image tags | `sha-` as canonical immutable tag; semantic version alias for releases | `latest` tag only | `latest` is mutable and non-reproducible; `sha-` gives exact traceability from deployed image back to source commit | +| Multi-stage Docker builds | Builder stage (full toolchain) + runtime stage (distroless/slim) | Single-stage with all tools | Eliminates build toolchain, compiler, and dev dependencies from production image; typically reduces image size by 60–80% | +| Local dev hot-reload | Backend: FastAPI `--reload` via bind-mounted `./backend` volume; Frontend: Next.js Vite HMR | Rebuild container on change | Full container rebuild per code change adds 30–90s per iteration; volume mount + process reload is < 1s | +| `.env.example` contract | `.env.example` with all required variables, descriptions, and stage flags committed to repo; actual `.env` in `.gitignore` | Ad-hoc variable discovery from runtime errors | Engineers must be able to run `cp .env.example .env` and have a working local stack within 15 minutes of cloning | +| Staging environment strategy | `main` branch continuously deployed to staging via GitHub Actions; production deploy requires manual approval gate after staging smoke tests pass | Manual staging deploys | Reduces time-to-detect integration regressions; staging serves as TRL artefact evidence environment | +| Secrets rotation | Per-secret rotation runbook: Space-Track credentials, JWT signing keys, ANSP tokens; old + new key both valid during 5-minute transition window; `security_logs` entry required; rotated via Vault dynamic secrets in Phase 3 | Manual rotation with downtime | Aviation context: key rotation must not cause service interruption; zero-downtime rotation is a reliability requirement, not a convenience | +| Build cache strategy | Docker layer cache: `cache-from/cache-to` targeting GHCR in GitHub Actions; pip wheel cache: `actions/cache` keyed on `requirements.txt` hash; npm cache: `actions/cache` keyed on `package-lock.json` hash | No cache; full rebuild each push | Without cache, a full rebuild takes 8–12 minutes; with cache, incremental pushes take 2–3 minutes — critical for CI as a useful merge gate | +| Image retention policy | Tagged release images kept indefinitely; untagged/orphaned images purged weekly via GHCR lifecycle policy; staging images retained 30 days; dev branch images retained 7 days | No policy; manual cleanup | Unmanaged GHCR storage grows unboundedly; stale images also represent unaudited CVE surface | +| Pre-commit hook completeness | Six hooks: `detect-secrets`, `ruff`, `mypy`, `hadolint`, `prettier`, `sqlfluff` | `git-secrets` only | `git-secrets` scans only for known secret patterns; `detect-secrets` uses entropy analysis; `hadolint` prevents insecure Dockerfile patterns; `sqlfluff` catches migration anti-patterns before code review | +| `alembic check` in CI | CI job runs `alembic check` to detect SQLAlchemy model/migration divergence; fails if models have unapplied changes | Only run migrations, no divergence check | SQLAlchemy models can diverge from migrations silently; `alembic check` catches the gap before it reaches production | +| FIR boundary data source | EUROCONTROL AIRAC (ECAC states) + FAA Digital-Terminal Procedures (US) + OpenAIP (fallback); 28-day update cadence | Manually curated GeoJSON, updated ad hoc | FIR boundaries change on AIRAC cycles; stale boundaries produce wrong airspace intersection results during live TIP events | +| ADS-B data source | OpenSky Network REST API (Phase 3 MVP); commercial upgrade path to Flightradar24 or FAA SWIM ADS-B if required | Direct receiver hardware | OpenSky is free, global, and sufficient for route overlay and intersection advisory; commercial upgrade only if coverage gaps identified in ANSP trials | +| CCSDS OEM reference frame | GCRF (Geocentric Celestial Reference Frame); time system UTC; `OBJECT_ID` = NORAD catalog number; missing international designator populated as `UNKNOWN` | ITRF or TEME | GCRF is the standard output of SpaceCom's frame transform pipeline; downstream mission control tools expect GCRF for propagation inputs | +| CCSDS CDM field population | SpaceCom populates: HEADER, RELATIVE_METADATA, OBJECT1/2 identifiers, state vectors, covariance (if available); fields not held by SpaceCom emitted as `N/A` per CCSDS 508.0-B-1 §4.3 | Omit empty fields | `N/A` is the CCSDS-specified sentinel for unknown values; silent omission causes downstream parser failures | +| CDM ingestion display | Space-Track CDM Pc displayed alongside SpaceCom-computed Pc with explicit provenance labels; > 10× discrepancy triggers `DATA_CONFIDENCE` warning on conjunction panel | Show only one value | Space operators need both values; discrepancy without explanation erodes trust in both | +| WebSocket event schema | Typed event envelope with `type` discriminator, monotonic `seq`, and `ts`; reconnect with `?since_seq=` replay of up to 200 events / 5-minute ring buffer; `resync_required` on stale reconnect | Schema-free JSON stream | Untyped streams require every consumer to reverse-engineer the schema; schema enables typed client generation | +| Alert webhook delivery | At-least-once POST to registered HTTPS endpoint; HMAC-SHA256 signature; 3 retries with exponential backoff; `degraded` status after 3 failures; auto-disable after 10 consecutive failures | WebSocket / email only | ANSPs with existing dispatch infrastructure (AFTN, internal webhook receivers) cannot integrate via browser WebSocket; webhooks are the programmatic last-mile | +| API versioning | `/api/v1` base; breaking changes require `/api/v2` parallel deployment; 6-month support overlap; `Deprecation` / `Sunset` headers (RFC 8594); 3-month written notice to API key holders | No versioning policy; breaking changes deployed ad hoc | Space operators building operations centre integrations need stable contracts; silent breaking changes disable their integrations | +| SWIM integration path | Phase 2: GeoJSON structured export; Phase 3: FIXM review + EUROCONTROL SWIM-TI AMQP publish endpoint | Not applicable | European ANSP procurement increasingly requires SWIM compatibility; GeoJSON export is low-cost first step; full SWIM-TI is Phase 3 | +| Space-Track API contract test | Integration test asserts expected JSON keys present in Space-Track response; ingest health alert fires after 4 consecutive hours with 0 successful Space-Track records | No contract test; breakage discovered at runtime | Space-Track API has had historical breaking changes; silent format change means ingest returns no data while health metrics appear normal | +| TLE checksum validation | Modulo-10 checksum on both lines verified before DB write; BSTAR range check; failed records logged to `security_logs` type `INGEST_VALIDATION_FAILURE` | Accept TLE at face value | Corrupted TLEs (network errors, encoding issues) would propagate incorrect state vectors without validation | +| Model card | `docs/model-card-decay-predictor.md` maintained alongside the model; covers validated orbital regime envelope, known failure modes, systematic biases, and performance by object type | Accuracy statement only in §24.3 | Regulators and ANSPs require a documented operational envelope, not just a headline accuracy figure; ESA TRL artefact requirement | +| Historical backcast selection | Validation report explicitly documents selection criteria, identifies underrepresented object categories, and states accuracy conditional on object type | Single unconditional accuracy figure | Observable re-entry population is biased toward large well-tracked objects; publishing an unconditional accuracy figure misrepresents model generalisation | +| Out-of-distribution detection | `ood_flag = TRUE` and `ood_reason` set at prediction time if any input falls outside validated bounds; UI shows mandatory warning callout | Serve all predictions identically | NRLMSISE-00 calibration domain does not include tumbling objects, very high area-to-mass ratio, or objects with no physical property data | +| Prediction staleness warning | `prediction_valid_until` = `p50_reentry_time - 4h`; UI warns independently of system-level TLE staleness if `NOW() > prediction_valid_until` and not superseded | No time-based staleness on predictions | An hours-old prediction for an imminent re-entry has implicitly grown uncertainty; operators need a signal independent of the system health banner | +| Alert threshold governance | Thresholds documented with rationale; change approval requires engineering lead sign-off + shadow-mode validation period; change log maintained in `docs/alert-threshold-history.md` | Thresholds set in code with no governance | CRITICAL trigger (window < 6h, FIR intersection) has airspace closure consequences; undocumented threshold changes cannot be reviewed by regulators or ANSPs | +| FIR intersection auditability | `alert_events.fir_intersection_km2` and `intersection_percentile` recorded at alert generation; UI shows "p95 corridor intersects ~N km² of FIR XXXX" | Alert log shows only "intersects FIR XXXX" | Intersection without area and percentile context is not auditable; regulators and ANSPs need to know *how much* intersection triggered the alert | +| Recalibration governance | Recalibration requires hold-out validation dataset, minimum accuracy improvement threshold, sign-off authority, rollback procedure, and notification to ANSP shadow partners | Recalibration run and deployed without gates | Unchecked recalibration can silently degrade accuracy for object types not in the calibration set | +| Model version governance | Changes classified as patch/minor/major; major changes require active prediction re-runs with supersession + ANSP notification; rollback path documented | No governance; model updated silently | A major model version change producing materially different corridors without re-running active predictions creates undocumented divergence between what ANSPs are seeing and current best predictions | +| Adverse outcome monitoring | `prediction_outcomes` table records observed re-entry outcomes against predictions; quarterly accuracy report generated from feedback pipeline; false positive/negative rates in Grafana | No post-deployment accuracy tracking | Without outcome monitoring SpaceCom cannot demonstrate performance within acceptable bounds to regulators; shadow validation reports are episodic, not continuous | +| Geographic coverage annotation | FIR intersection results carry `data_coverage_quality` flag per FIR; OpenAIP-sourced boundaries flagged as lower confidence | All FIR intersections treated equally | AIRAC coverage varies by region; operators in non-ECAC regions receive lower-quality intersection assessments without knowing it | +| Public transparency report | Quarterly aggregate accuracy/reliability report published (no personal data); covers prediction count, backcast accuracy, error rates, known limitations | No public reporting | Civil aviation safety tools operate in a regulated transparency environment; ESA bid credibility and regulatory acceptance require demonstrable performance | +| `docs/` directory structure | Canonical tree defined in §12.1; all documentation files live at known paths committed to the repo | Ad-hoc file creation by individual engineers | Documentation that exists only in prose references gets created inconsistently or not at all | +| Architecture Decision Records | MADR-format ADRs in `docs/adr/`; one per consequential decision in §20; linked from relevant code via inline comment | §20 table in master plan only | Engineers working in the repo cannot find decision rationale without reading a 5000-line plan document | +| OpenAPI documentation standard | Every public endpoint has `summary`, `description`, `tags`, and at least one `responses` example; enforced by CI check | Auto-generated stubs only | Auto-generation produces syntactically correct docs that are useless to API integrators (Persona E/F) | +| Runbook format | Standard template in `docs/runbooks/TEMPLATE.md`; required sections: Trigger, Severity, Preconditions, Steps, Verification, Rollback, Notify; runbook index maintained | Free-form runbooks written ad-hoc | Runbooks written under pressure without a template consistently omit the rollback and notification steps | +| Docstring standard | Google-style docstrings required on all public functions in `propagator/`, `reentry/`, `breakup/`, `conjunction/`, `integrity.py`; parameters include physical units | No docstring requirement | Physics functions without units and limitations documented cannot be reviewed or audited by third-party evaluators for ESA TRL | +| Validation procedure | §17 specifies reference data location, run commands, pass/fail tolerances per suite; `docs/validation/README.md` describes how to add new cases | Checklist of what to validate without procedure | A third party cannot reproduce the validation without knowing where the reference data is and what tolerance constitutes a pass | +| User documentation | Phase 2 delivers aviation portal guide + API quickstart; Phase 3 delivers space portal guide + in-app contextual help; stored in `docs/user-guides/` | No user documentation | ANSP SMS acceptance requires user documentation; aviation operators cannot learn an unfamiliar safety tool from the UI alone | +| `CHANGELOG.md` format | Keep a Changelog conventions; human-maintained; one entry per release with `Added/Changed/Deprecated/Removed/Fixed/Security` sections | No format specified | Changelogs written by different engineers without a format are unusable by operators and regulators | +| `AGENTS.md` | Project-root file defining behaviour guidance for AI coding agents; specifies codebase conventions, test requirements, and safety-critical file restrictions; committed to repo | Untracked file, undefined purpose | An undocumented AGENTS.md is either ignored or followed inconsistently, undermining its purpose | +| Test documentation | Module docstrings on physics/security test files state the invariant, reference source, and operational significance of failure; `docs/test-plan.md` lists all suites with scope and blocking classification | No test documentation requirement | ECSS-Q-ST-80C requires a test specification as a separate deliverable from the test code | + +--- + +## 21. Definition of Done per Phase + +### Phase 1 Complete When: +**Physics and data:** +- [ ] 100+ real objects tracked with current TLE data +- [ ] Frame transformation unit tests pass against IERS/Vallado reference cases (round-trip error < 1 m) +- [ ] SGP4 CZML uses J2000 INERTIAL frame (not TEME) +- [ ] Space weather polled from NOAA SWPC; cross-validated against ESA SWS; operational status widget visible +- [ ] TIP messages ingested and displayed for decaying objects +- [ ] TLE cross-validation flags discrepancies > threshold for human review +- [ ] IERS EOP hash verification passing +- [ ] Decay predictor: ≥3 historical re-entry backcast windows overlap actual events +- [ ] Mode A (Percentile Corridors): p05/p50/p95 swaths render with correct visual encoding +- [ ] TimelineGantt displays all active events; click-to-navigate functional +- [ ] LIVE/REPLAY/SIMULATION mode indicator correct on all pages + +**Security (all required before Phase 1 is considered complete):** +- [ ] RBAC enforced: automated `test_rbac.py` verifies every endpoint returns 403 for insufficient role, 401 for unauthenticated +- [ ] JWT RS256 with httpOnly cookies; `localStorage` token storage absent from codebase (grep check in CI) +- [ ] MFA (TOTP) enforced for all roles; recovery codes functional +- [ ] Rate limiting: 429 responses verified by integration tests for all configured limits +- [ ] Simulation parameter range validation: out-of-range values return 400 with clear message +- [ ] Prediction HMAC: tamper test (direct DB row modification) triggers 503 + CRITICAL security_log entry +- [ ] `alert_events` append-only trigger: UPDATE/DELETE raise exception (verified by test) +- [ ] `reentry_predictions` immutability trigger: same (verified by test) +- [ ] Redis AUTH enabled; default user disabled; ACL per service verified +- [ ] MinIO: all buckets verified private; direct object URL returns 403; pre-signed URL required +- [ ] Docker: all containers verified non-root (`docker inspect` check in CI) +- [ ] Docker: network segmentation verified — frontend container cannot reach database port +- [ ] Bandit: 0 High severity findings in CI +- [ ] ESLint security: 0 High findings in CI +- [ ] Trivy: 0 Critical/High CVEs in all container images +- [ ] CSP headers present on all pages; verified by Playwright E2E test +- [ ] axe-core: 0 critical, 0 serious violations on all pages (CI check) +- [ ] WCAG 2.1 AA colour contrast: automated check passes + +**UX:** +- [ ] Globe: object clustering active at global zoom; urgency symbols correct (colour-blind-safe) +- [ ] DataConfidenceBadge visible on all object detail and prediction panels +- [ ] UncertaintyModeSelector visible; Mode B/C greyed with "Phase 2/3" label +- [ ] JobsPanel shows live sample progress for running decay jobs +- [ ] Shared deep links work: `/events/{id}` loads correct event; globe focuses on corridor +- [ ] All pages keyboard-navigable; modal focus trap verified +- [ ] Report generation: Operational Briefing type functional; PDF includes globe corridor map + +**Human Factors (Phase 1 items — all required before Phase 1 is considered complete):** +- [ ] Event cards display window range notation (`Window: Xh–Yh from now / Most likely ~Zh from now`); no `±` notation appears in operational-facing UI (grep check) +- [ ] Mode-switch dialogue: switching to SIMULATION shows current mode, target mode, and "alerts suppressed" consequence; Cancel left, Switch right; Playwright E2E test verifies dialogue content +- [ ] Future-preview temporal wash: dragging timeline scrubber past current time applies overlay and `PREVIEWING +Xh` label to event panel; alert badges show "(projected)"; verified by Playwright test +- [ ] CRITICAL acknowledgement: two-step flow (banner → confirmation modal); Confirm button disabled until `Action taken` field ≥ 10 characters; verified by Playwright test +- [ ] Audio alert: non-looping two-tone chime plays once on CRITICAL alert; stops on acknowledgement; does not play in SIMULATION or REPLAY mode; verified by integration test with audio mock +- [ ] Alert storm meta-alert: > 5 CRITICAL alerts within 1 hour generates Persona D meta-alert with disambiguation prompt (verified by test with synthetic alerts) +- [ ] Onboarding state: new organisation with no FIRs configured sees three-card setup prompt on first login (Playwright test) +- [ ] Degraded mode banner: `/readyz` 207 response triggers correct per-degradation-type operational guidance text in UI (integration test for each degradation type: space weather stale, TLE stale) +- [ ] `superseded_by` constraint: setting `superseded_by` on a prediction a second time raises DB exception (integration test); UI shows `⚠ Superseded` banner on any prediction where `superseded_by IS NOT NULL` + +**Legal / Compliance (Phase 1 items — all required before Phase 1 is considered complete):** +- [ ] **Space-Track AUP architectural decision gate (Finding 9):** Written AUP clarification obtained from 18th Space Control Squadron or legal counsel opinion. `docs/adr/0016-space-track-aup-architecture.md` committed with Path A (shared ingest) or Path B (per-org credentials) decision recorded and evidenced. Ingest architecture finalised accordingly. This is a blocking Phase 1 decision — ingest code must not be written until the path is decided. +- [ ] ToS / AUP / Privacy Notice acceptance gate: first login blocks dashboard access until all three documents are accepted; `users.tos_accepted_at`, `users.tos_version`, `users.tos_accepted_ip` populated on acceptance (integration test: unauthenticated attempt to skip returns 403) +- [ ] ToS version change triggers re-acceptance: bump `tos_version` in config; verify existing users are blocked on next login until they re-accept (integration test) +- [ ] **CesiumJS commercial licence executed** and stored at `legal/LICENCES/cesium-commercial.pdf`; `legal_clearances.cesium_commercial_executed = TRUE` — **blocking gate for any external demo** (§29.11 F1) +- [ ] SBOM generated at build time via `syft` (SPDX-JSON, container image) + `pip-licenses` + `license-checker-rseidelsohn` (dependency manifests); stored in `docs/compliance/sbom/` as versioned artefacts; all dependency licences reviewed against `legal/OSS_LICENCE_REGISTER.md`; CI `pip-licenses --fail-on` gate includes GPL/AGPL/SSPL; no unapproved licence in transitive closure (§29.11 F2, F10) +- [ ] `legal/LGPL_COMPLIANCE.md` created documenting poliastro LGPL dynamic linking compliance and PostGIS GPLv2 linking exception (§29.11 F4, F9) +- [ ] `legal/LICENCES/timescaledb-licence-assessment.md` and `legal/LICENCES/redis-sspl-assessment.md` created with licence assessment sign-off (§29.11 F5, F6) +- [ ] `legal_opinions` table present in schema; admin UI shows legal clearance status per org; shadow mode toggle displays warning if `shadow_mode_cleared = FALSE` +- [ ] GDPR breach notification procedure documented in the incident response runbook; tabletop exercise completed with the engineering team + +**Infrastructure / DevOps (all required before Phase 1 is considered complete):** +- [ ] Docker Compose starts full stack with single command (`make dev`) +- [ ] `make test` executes pytest + vitest in one command; all tests pass on a clean clone +- [ ] `make migrate` runs all Alembic migrations against a fresh DB without error +- [ ] `make seed` loads fixture data; globe shows test objects on first load +- [ ] `.env.example` present with all required variables documented; a new engineer can reach a working local stack in ≤ 15 minutes +- [ ] Multi-stage Dockerfiles in place for backend, worker, renderer, and frontend: builder stage uses full toolchain; runtime stage is distroless/slim; `docker inspect` confirms no build tools (gcc, pip, npm) present in runtime image +- [ ] All containers run as non-root UID (baked in Dockerfile `USER` directive — not set at runtime); verified by `docker inspect` check in CI +- [ ] Self-hosted GitLab CI pipeline exists with jobs: `lint` (pre-commit all hooks), `test-backend` (pytest), `test-frontend` (vitest + Playwright), `security-scan` (Bandit + Trivy + ESLint security), `build-and-push` (multi-stage build -> GitLab container registry with `sha-` tag) +- [ ] `.pre-commit-config.yaml` committed with all six hooks; CI re-runs all hooks and fails if any fail +- [ ] `alembic check` step in CI fails if SQLAlchemy models have unapplied changes +- [ ] Build cache: Docker layer cache, pip wheel cache, npm cache all configured in GitLab CI; incremental push CI time < 4 minutes +- [ ] pytest suite: frame utils, integrity, auth, RBAC, propagator, decay, space weather, ingest, API integration +- [ ] Playwright E2E: mode switch, alert acknowledge, CZML render, job progress, report generation, CSP headers +- [ ] Port exposure CI check: `scripts/check_ports.py` passes with no never-exposed port in a `ports:` mapping +- [ ] Caddy TLS active on local dev stack with self-signed cert or ACME staging cert; HSTS header present (`Strict-Transport-Security: max-age=63072000`); TLS 1.1 and below not offered (verified by `nmap --script ssl-enum-ciphers`) +- [ ] `docs/runbooks/egress-filtering.md` exists documenting the allowed outbound destination whitelist; implementation method (UFW/nftables) noted + +**Performance / Database (Phase 1 items — all required before Phase 1 is considered complete):** +- [ ] pgBouncer in Docker Compose; all app services connect via pgBouncer (not directly to TimescaleDB); verified by `netstat` or connection-source query showing only pgBouncer IPs in `pg_stat_activity` +- [ ] All required indexes present: `orbits_object_epoch_idx`, `reentry_pred_object_created_idx`, `alert_events_unacked_idx`, `reentry_pred_corridor_gist`, `hazard_zones_polygon_gist`, `fragments_impact_gist`, `tle_sets_object_ingested_idx` — verified by `\d+` or `pg_indexes` query +- [ ] `orbits` hypertable chunk interval set to 1 day; `space_weather` to 30 days; `tle_sets` to 7 days — verified by `timescaledb_information.chunks` +- [ ] `space_weather_daily` continuous aggregate created and policy active; Space Weather Widget backend query reads from the aggregate (verified by `EXPLAIN` showing `space_weather_daily` in plan, not raw `space_weather`) +- [ ] Autovacuum settings applied to `alert_events`, `security_logs`, `reentry_predictions` — verified via `pg_class` `reloptions` +- [ ] `lazy="raise"` set on all SQLAlchemy relationships; test suite passes with no `MissingGreenlet` or `InvalidRequestError` exceptions (test suite itself verifies this by accessing relationships without explicit loading — should raise) +- [ ] Redis Celery broker DB index (`SELECT 0`) has `maxmemory-policy noeviction`; application cache DB index (`SELECT 1`) has `allkeys-lru` — verified by `CONFIG GET maxmemory-policy` on each DB +- [ ] CZML catalog endpoint: `EXPLAIN (ANALYZE, BUFFERS)` output recorded in `docs/query-baselines/czml_catalog_100obj.txt`; p95 response time < 2s verified by load test with 10 concurrent users +- [ ] CZML delta endpoint (`?since=`) functional: integration test verifies delta response contains only changed objects; `X-CZML-Full-Required: true` returned when client timestamp > 30 min old +- [ ] Compression policies applied with correct `compress_after` intervals (see §9.4 table): `orbits` = 7 days, `adsb_states` = 14 days, `space_weather` = 60 days, `tle_sets` = 14 days — verified by `timescaledb_information.jobs` +- [ ] Cursor-based pagination: integration test on `/reentry/predictions` with 200+ rows confirms `next_cursor` present and second page returns non-overlapping rows; `limit=201` returns 400 +- [ ] MC concurrency gate: integration test submits two concurrent `POST /decay/predict` requests from the same organisation; second request returns `HTTP 429` with `Retry-After` header while first is running; first completes normally +- [ ] Renderer Docker memory limit set to 4 GB in `docker-compose.yml`; `docker inspect` confirms `HostConfig.Memory = 4294967296` +- [ ] Bulk export endpoint: integration test with 10,000-row dataset confirms response is a task ID + status URL, not an inline response body +- [ ] `tests/load/` directory exists with at least a k6 or Locust scenario for the CZML catalog endpoint; `docs/test-plan.md` load test section specifies scenario, ramp shape, and SLO assertion + +**Technical Writing / Documentation (Phase 1 items — all required before Phase 1 is considered complete):** +- [ ] `docs/` directory tree created and committed matching the structure in §12.1; all referenced documentation paths exist (even if files are stubs with "TODO" content) +- [ ] `AGENTS.md` committed to repo root; contains codebase conventions, test requirements, and safety-critical file restrictions (see §33.9) +- [ ] `docs/adr/` contains minimum 5 ADRs for the most consequential Phase 1 decisions: JWT algorithm choice, dual frontend architecture, Monte Carlo chord pattern, frame library choice, TimescaleDB chunk intervals +- [ ] `docs/runbooks/TEMPLATE.md` committed; `docs/runbooks/README.md` index lists all required runbooks with owner field; at least `db-failover.md`, `ingest-failure.md`, and `hmac-failure.md` are complete (not stubs) +- [ ] `docs/validation/README.md` documents how to run each validation suite and where reference data files live; `docs/validation/reference-data/` contains Vallado SGP4 cases and IERS frame test cases +- [ ] `CHANGELOG.md` exists at repo root in Keep a Changelog format; first entry records Phase 1 initial release +- [ ] `docs/alert-threshold-history.md` exists with initial entry recording threshold values, rationale, and author sign-off (required by §24.8) +- [ ] OpenAPI docs: CI check confirms no public endpoint has an empty `description` field; spot-check 5 endpoints in code review to verify `summary` and at least one `responses` example + +**Ethics / Algorithmic Accountability (Phase 1 items — all required before Phase 1 is considered complete):** +- [ ] `ood_flag` and `ood_reason` populated at prediction time: integration test with an object whose `data_confidence = 'unknown'` and no DISCOS physical properties confirms `ood_flag = TRUE` and `ood_reason` contains `'low_data_confidence'`; prediction is served but UI shows mandatory warning callout above the prediction panel +- [ ] `prediction_valid_until` field present: verify it equals `p50_reentry_time - 4h` for a test prediction; UI shows staleness warning when `NOW() > prediction_valid_until` and prediction is not superseded (Playwright test simulates time travel) +- [ ] `alert_events.fir_intersection_km2` and `intersection_percentile` recorded: synthetic CRITICAL alert with known corridor area confirms both fields populated; UI renders "p95 corridor intersects ~N km² of FIR XXXX" (Playwright test) +- [ ] Alert threshold values documented: `docs/alert-threshold-history.md` exists with initial entry recording threshold values, rationale, and author sign-off +- [ ] `prediction_outcomes` table exists in schema; `POST /api/v1/predictions/{id}/outcome` endpoint (requires `analyst` role) accepts observed re-entry time and source (integration test: unauthenticated attempt returns 401) + +**Interoperability (Phase 1 items — all required before Phase 1 is considered complete):** +- [ ] TLE checksum validation: integration test sends a TLE with deliberately corrupted checksum; verify it is rejected and logged to `security_logs` type `INGEST_VALIDATION_FAILURE`; valid TLE with same content but correct checksum is accepted +- [ ] Space weather format contract test: CI integration test against mocked NOAA SWPC response asserts (a) expected top-level JSON keys present (`time_tag`, `flux` / `kp_index`); (b) F10.7 values in physical range 50–350 sfu; (c) Kp values in range 0–90 (NOAA integer format); test is `@pytest.mark.contract` and runs against mocks in standard CI, against live API in nightly sandbox job +- [ ] Space-Track contract test: integration test against mocked Space-Track response asserts (a) expected JSON keys present for TLE and CDM queries; (b) B* values trigger warning when outside [-0.5, 0.5]; (c) epoch field parseable as ISO-8601; `spacecom_ingest_success_total{source="spacetrack"}` Prometheus metric > 0 after a live ingest cycle (nightly sandbox only) +- [ ] FIR boundary data loaded: `airspace` table populated with FIR/UIR polygons for at least the test ANSP region; source documented in `ingest/sources.py`; AIRAC update date recorded in `airspace_metadata` table +- [ ] WebSocket event schema: `WS /ws/events` delivers typed event envelopes; integration test sends a synthetic `alert.new` event and verifies the client receives `{"type": "alert.new", "seq": , "data": {...}}`; reconnect with `?since_seq=` replays missed event +- [ ] API versioning headers: all API endpoints return `Content-Type: application/vnd.spacecom.v1+json`; deprecated endpoints (if any) return `Deprecation: true` and `Sunset: ` headers (verified by Playwright E2E check) + +**SRE / Reliability (all required before Phase 1 is considered complete):** +- [ ] Health probes: `/healthz` returns 200 on all services; `/readyz` returns 200 (healthy) or 207 (degraded) as appropriate; Docker Compose `depends_on: condition: service_healthy` wired for all service dependencies +- [ ] Celery queue routing: integration test confirms `ingest.*` tasks appear only on `ingest` queue and `propagator.*` tasks appear only on `simulation` queue; no cross-queue contamination possible +- [ ] `celery-redbeat` schedule persistence: Beat process restart test verifies scheduled jobs survive without duplicate scheduling; Redis key `redbeat:*` present after restart +- [ ] Crash-safety: kill a `worker-sim` container mid-task; verify task is requeued (not lost) on worker restart; `task_acks_late = True` and `task_reject_on_worker_lost = True` confirmed by log inspection +- [ ] Dead letter queue: a task that exhausts all retries appears in the DLQ; DLQ depth metric visible in Prometheus +- [ ] WAL archiving: `pg_basebackup` and WAL segments appearing in MinIO `db-wal-archive` bucket within 10 minutes of first write (verified by bucket list) +- [ ] Daily backup Celery task: `backup_database` task appears in Celery Beat schedule; execution logged in `celery-beat.log`; resulting archive object visible in MinIO `db-backups` bucket +- [ ] TimescaleDB compression policy: `orbits` compression policy applied; `timescaledb_information.jobs` shows policy active; manual `CALL run_job()` compresses at least one chunk +- [ ] Prometheus metrics: `spacecom_active_tip_events`, `spacecom_tle_age_hours`, `spacecom_hmac_verification_failures_total`, `spacecom_celery_queue_depth` all visible in Prometheus UI with correct labels +- [ ] MC chord distribution: `run_mc_decay_prediction` fans out 500 sub-tasks; Celery Flower shows sub-tasks distributed across both `worker-sim` instances (not all on one worker) +- [ ] MC p95 latency SLO: 500-sample MC run completes in < 240s on Tier 1 dev hardware (8 vCPU/32 GB) under load test; documented baseline recorded for Tier 2 comparison + +### Phase 2 Complete When: +- [ ] Atmospheric breakup: fragments, casualty areas, fragment globe display +- [ ] Mode B (Probability Heatmap): Deck.gl layer renders; hover tooltip shows probability +- [ ] Conjunction screening: known close approaches identified; Pc computed for ≥1 test case +- [ ] 2D Plan View: FIR boundaries, horizontal corridor projection, altitude cross-section +- [ ] Airspace intersection table: affected FIRs with entry/exit times on Event Detail +- [ ] Hazard zones: HMAC-signed and immutability trigger verified +- [ ] PDF reports: Technical Assessment and Regulatory Submission types functional +- [ ] Renderer container: `network_mode: none` enforced; sanitisation tests passing; 30s timeout verified +- [ ] OWASP ZAP DAST: 0 High/Critical findings against staging environment +- [ ] RLS multi-tenancy: Org A user cannot access Org B records (integration test) +- [ ] SimulationComparison: two runs overlaid on globe with distinct colours + +**Phase 2 SRE / Reliability:** +- [ ] Monthly restore test: `restore_test` Celery task executes on schedule; restores latest backup to isolated `db-restore-test` container; row count reconciliation passes; result logged to `security_logs` (type `RESTORE_TEST`) +- [ ] TimescaleDB retention policy: 90-day drop policy active on `orbits` and `space_weather`; manual chunk drop test in staging confirms chunks older than 90 days are removed without affecting newer data +- [ ] Archival pipeline: Parquet export Celery task runs before chunk drop; resulting `.parquet` files visible in MinIO `db-archive` bucket; spot-check query against archived Parquet returns expected rows +- [ ] Degraded mode UI: stop space weather ingest; confirm `/readyz` returns 207; confirm `StalenessWarningBanner` appears in aviation portal within one polling cycle (≤ 60s); restart ingest; confirm banner clears +- [ ] Error budget dashboard: Grafana `SRE Error Budgets` dashboard shows Phase 2 SLO burn rates for prediction latency and data freshness; alert fires in Prometheus when burn rate exceeds 2× for > 1 hour + +**Phase 2 Human Factors:** +- [ ] Corridor Evolution widget: Event Detail page shows p50 corridor footprint at T+0h/+2h/+4h; auto-updates in LIVE mode; ambering warning appears if corridor is widening +- [ ] Duty Manager View: toggle on Event Detail collapses to large-text window/FIR/action-buttons only; toggles back to technical detail +- [ ] Response Options accordion: contextualised action checklist visible to `operator`+ role; checkbox states and coordination notes persisted to `alert_events` +- [ ] Multi-ANSP Coordination Panel: visible on events where ≥2 registered organisations share affected FIRs; acknowledgement status and coordination notes from each ANSP visible; integration test confirms Org A cannot see Org B coordination notes on unrelated events +- [ ] Simulation block: `disable_simulation_during_active_events` org setting functional; mode switch blocked with correct modal when unacknowledged CRITICAL alerts exist (integration test) +- [ ] Space weather buffer recommendation: Event Detail shows `[95th pct time + buffer]` callout when conditions are Elevated or above; buffer computed by backend from F10.7/Kp thresholds (integration test verifies all four threshold bands) +- [ ] Secondary Display Mode: `?display=secondary` URL opens chrome-free full-screen operational view; navigation, admin links, and simulation controls not present; CRITICAL banners still appear (Playwright test) +- [ ] Mode C first-use overlay: MC particle animation blocked until user acknowledges one-time explanation overlay; preference stored in user record; never shown again after first acknowledgement + +**Phase 2 Performance / Database:** +- [ ] FIR intersection query: `EXPLAIN (ANALYZE)` confirms bounding-box pre-filter (`&&`) eliminates > 90% of `airspace` rows before exact `ST_Intersects`; p95 intersection query time < 200ms with full airspace table loaded +- [ ] Analytics query routing: Persona B/F workspace queries confirmed routing to replica engine via `pg_stat_activity` source host check; replication lag monitored in Grafana (alert if > 30s) +- [ ] Query plan regression: re-run `EXPLAIN (ANALYZE, BUFFERS)` on CZML catalog query; compare to Phase 1 baseline in `docs/query-baselines/`; planning time and execution time increase < 2× (if exceeded, investigate before Phase 3 load test) +- [ ] Hypertable migration: at least one migration involving `orbits` executed using `CREATE INDEX CONCURRENTLY`; CI migration timeout gate in place (> 30s fails CI) +- [ ] Query plan regression CI job active: `tests/load/check_query_baselines.py` runs after each migration in staging; fails if any baseline query execution time increases > 2× vs recorded baseline; PR comment generated with comparison table +- [ ] `ws_connected_clients` Prometheus gauge reporting per backend instance; Grafana alert configured at 400 (WARNING) — verified by injecting 5 synthetic WebSocket connections and confirming gauge increments +- [ ] Space weather backfill cap: integration test simulates 24-hour ingest gap; verify ingest task logs `WARN` and backfills only last 6 hours; no duplicate timestamps written; `space_weather_daily` aggregate remains consistent +- [ ] CDN / static asset caching: `bundle-size` CI step active; PR comment shows bundle size delta; CI fails if main JS bundle grows > 10% vs. previous build; Caddy cache headers for `/_next/static/*` set `Cache-Control: public, max-age=31536000, immutable` + +**Phase 2 Legal / Compliance:** +- [ ] **Regulatory classification ADR committed:** `docs/adr/0012-regulatory-classification.md` documents the chosen position (Position A — ATM/ANS Support Tool, non-safety-critical) with rationale; legal counsel has reviewed the position against EASA IR 2017/373; position is referenced in all ANSP service contracts +- [ ] Legal opinion received for primary deployment jurisdiction; `legal_opinions` table updated with `shadow_mode_cleared = TRUE`; shadow mode admin toggle no longer shows legal warning for that jurisdiction +- [ ] Space-Track AUP redistribution clarification obtained (written); legal position documented; AUP click-wrap wording updated to reflect agreed terms +- [ ] **ESA DISCOS redistribution rights clarified (written):** Written confirmation from ESA/ESAC on permissible use of DISCOS-derived properties in commercial API responses and generated reports; if redistribution is not permitted, API response and report templates updated to show `source: estimated` rather than raw DISCOS values +- [ ] **GDPR DPA signed with each shadow ANSP partner before shadow mode begins:** DPA template reviewed by counsel; executed DPA on file for each organisation before `shadow_mode_cleared` is set to `TRUE`; data processing not permitted for any ANSP organisation without a signed DPA +- [ ] GDPR data inventory documented; pseudonymisation procedure `handle_erasure_request()` implemented and tested: user deleted → name/email replaced with `[user deleted - ID:{hash}]` in `alert_events`/`security_logs`; core safety records preserved +- [ ] Jurisdiction screening at user registration: sanctioned-country check fires before account creation; blocked attempt logged to `security_logs` type `REGISTRATION_BLOCKED_SANCTIONS` +- [ ] MSA template reviewed by aviation law counsel; Regulatory Sandbox Agreement template finalised; first shadow mode deployment covered by a signed Regulatory Sandbox Agreement on file +- [ ] Controlled Re-entry Planner carries in-platform export control notice; `data_source_acknowledgement = TRUE` enforced before API key issuance (integration test: attempt to create API key without acknowledgement returns 403) +- [ ] Professional indemnity, cyber liability, and product liability insurance confirmed in place before first shadow deployment; certificates stored in MinIO `legal-docs` bucket +- [ ] **Shadow mode exit criteria documented and tooled:** `docs/templates/shadow-mode-exit-report.md` exists; Persona B can generate exit statistics from admin panel; exit to operational use for any ANSP requires written Safety Department confirmation on file before `shadow_mode_cleared` is set + +**Phase 2 Technical Writing / Documentation:** +- [ ] `docs/user-guides/aviation-portal-guide.md` complete and reviewed by at least one Persona A representative before first ANSP shadow deployment; covers: dashboard overview, alert acknowledgement workflow, NOTAM draft workflow, degraded mode response +- [ ] `docs/api-guide/` complete: `authentication.md`, `rate-limiting.md`, `webhooks.md`, `error-reference.md`, Python and TypeScript quickstart examples; reviewed by a Persona E/F tester +- [ ] All public functions in `propagator/decay.py`, `propagator/catalog.py`, `reentry/corridor.py`, `integrity.py`, and `breakup/atmospheric.py` have Google-style docstrings with parameter units; `mypy` pre-commit hook enforces no untyped function signatures +- [ ] `docs/test-plan.md` complete: lists all test suites, physical invariant tested, reference source, pass/fail tolerance, and blocking classification; reviewed by physics lead +- [ ] `docs/adr/` contains ≥ 10 ADRs covering all consequential Phase 2 decisions added during the phase +- [ ] All runbooks referenced in the §21 DoD are complete (not stubs): `gdpr-breach-notification.md`, `safety-occurrence-notification.md`, `secrets-rotation-jwt.md`, `blue-green-deploy.md`, `restore-from-backup.md` + +**Phase 2 Ethics / Algorithmic Accountability:** +- [ ] Model card published: `docs/model-card-decay-predictor.md` complete with validated orbital regime envelope, object type performance breakdown, known failure modes, and systematic biases; reviewed by the physics lead before Phase 2 ANSP shadow deployments +- [ ] Backcast validation report: ≥10 historical re-entry events validated; report documents selection criteria, identifies underrepresented object categories (small debris, tumbling objects), and states accuracy conditional on object type — not as a single unconditional figure; stored in MinIO `docs` bucket +- [ ] Out-of-distribution bounds defined: `docs/ood-bounds.md` specifies the threshold values for `ood_flag` triggers (area-to-mass ratio, minimum data confidence, minimum TLE count); CI test confirms all thresholds are checked in `propagator/decay.py` +- [ ] Alert threshold governance: any threshold change requires a PR reviewed by engineering lead + product owner; `docs/alert-threshold-history.md` entry created; change must complete a minimum 2-week shadow-mode validation period before deploying to any operational ANSP connection +- [ ] FIR coverage quality flag: `airspace` table has `data_source` and `coverage_quality` columns; intersection results for OpenAIP-sourced FIRs include a `coverage_quality: 'low'` flag in the API response; UI shows a coverage quality callout for non-AIRAC FIRs +- [ ] Recalibration governance documented: `docs/recalibration-procedure.md` exists specifying hold-out validation dataset, minimum accuracy improvement threshold (> 5% improvement on hold-out, no regression on any object type category), sign-off authority (physics lead + engineering lead), ANSP notification procedure + +**Phase 2 Interoperability:** +- [ ] CCSDS OEM response: `GET /space/objects/{norad_id}/ephemeris` with `Accept: application/ccsds-oem` returns a valid CCSDS 502.0-B-3 OEM file; integration test validates all mandatory keyword fields (`OBJECT_ID`, `CENTER_NAME`, `REF_FRAME=GCRF`, `TIME_SYSTEM=UTC`, `START_TIME`, `STOP_TIME`) are present; test parses with a reference CCSDS OEM parser +- [ ] CCSDS CDM export: bulk export includes CDM-format conjunction records; mandatory CDM fields populated; `N/A` used per CCSDS 508.0-B-1 §4.3 for unknown values; integration test validates with reference CDM parser +- [ ] CDM ingestion display: Space-Track CDM Pc and SpaceCom-computed Pc both visible on conjunction panel with distinct provenance labels; `DATA_CONFIDENCE` warning fires when values differ by > 10× (integration test with synthetic divergent CDM) +- [ ] Alert webhook: `POST /webhooks` registers endpoint; synthetic `alert.new` event POSTed to registered URL within 5s of trigger; `X-SpaceCom-Signature` header present and verifiable with shared secret; retry fires on 500 response from webhook receiver (integration test with mock server) +- [ ] GeoJSON structured export: `GET /events/{id}/export?format=geojson` returns valid GeoJSON `FeatureCollection`; `properties` includes `norad_id`, `p50_utc`, `affected_fir_ids`, `risk_level`, `prediction_hmac`; validates against GeoJSON schema (RFC 7946) +- [ ] ADS-B feed: OpenSky Network integration active; live flight positions overlay on globe in aviation portal; route intersection advisory receives ADS-B flight tracks as input + +**Phase 2 DevOps / Platform Engineering:** +- [ ] Staging environment spec documented: resources, data (synthetic only — no production data in staging), secrets set (separate from production), continuous deployment from `main` branch +- [ ] GitLab staging deploy job: merge to `main` triggers automatic staging deploy; production deploy requires manual approval in GitLab after staging smoke tests pass +- [ ] OWASP ZAP DAST run against staging in CI pipeline; results reviewed; 0 High/Critical required to unblock production deploy approval +- [ ] Secrets rotation runbooks written for all critical secrets: Space-Track credentials, JWT RS256 signing keypair, MinIO access keys, Redis `AUTH` password; each runbook includes: who initiates, affected services, zero-downtime rotation procedure, verification step, `security_logs` entry required +- [ ] JWT RS256 keypair rotation tested without downtime: old public key retained during 5-minute transition window; tokens signed with old key remain valid until expiry; verified by integration test +- [ ] Image retention container-registry lifecycle policy in place: untagged images purged weekly; staging images retained 30 days; dev images retained 7 days; policy verified in registry settings +- [ ] CI observability: GitLab pipeline duration tracked; image size delta posted as merge request comment (fail if > 20% increase); test failure rate visible in CI dashboard +- [ ] `alembic check` CI gate: no migration added a `NOT NULL` column without a default in the same step; CI job validates hypertable migrations use `CONCURRENTLY` (grep check on all new migration files) + +### Phase 2 Additional Regulatory / Dual Domain Items: +- [ ] Shadow mode: admin can enable/disable per organisation; ShadowBanner displayed on all pages when active; shadow records have `shadow_mode = TRUE`; shadow records excluded from all operational API responses (integration test) +- [ ] NOTAM drafting: draft generated in ICAO Annex 15 format from any event with FIR intersection; mandatory regulatory disclaimer present (automated test verifies its presence in every draft); stored in `notam_drafts` +- [ ] Space Operator Portal: `space_operator` user can view only owned objects (non-owned objects return 404, not 403, to prevent object enumeration); ControlledReentryPlanner functional for `has_propulsion = TRUE` objects +- [ ] CCSDS export: ephemeris export in OEM format passes CCSDS 502.0-B-3 structural validation +- [ ] API keys: create, use, and revoke flow functional; per-key rate limiting returns 429 at daily limit; raw key displayed only at creation (never retrievable after) +- [ ] TIP message provenance displayed in UI: source label reads "USSPACECOM TIP (not certified aeronautical information)" — not just "TIP Message #N" +- [ ] Data confidence warnings: objects with `data_confidence = 'unknown'` display a warning callout on all prediction panels explaining the impact on prediction quality + +### Phase 3 Complete When: +- [ ] Mode C (Monte Carlo Particles): animated trajectories render; click-particle shows params +- [ ] Real-time alerts delivered within 30 seconds of trigger condition +- [ ] Geographic alert filtering: alerts scoped to user's FIR list +- [ ] Route intersection analysis functional against sample flight plans +- [ ] Feedback: density scaling recalibration demonstrated from ≥2 historical re-entries +- [ ] Load test: 100 concurrent users; CZML load < 2s at p95 +- [ ] **External penetration test completed; all Critical/High findings remediated** +- [ ] Full axe-core audit + manual screen reader test (NVDA + VoiceOver) passes +- [ ] Secrets manager (Vault or equivalent) replacing Docker secrets for all production credentials +- [ ] All credentials on rotation schedule; rotation verified without downtime +- [ ] Prometheus + Grafana operational; certificate expiry alert configured +- [ ] Production deployment runbook documented; incident response procedure per threat scenario +- [ ] Security audit log shipping to external SIEM verified +- [ ] Shadow validation report generated for ≥1 historical re-entry event demonstrating prediction accuracy +- [ ] ECSS compliance artefacts produced: Software Management Plan, V&V Plan, Product Assurance Plan, Data Management Plan (required for ESA contract bids) +- [ ] TRL 6 demonstration: system demonstrated in operationally relevant environment with real TLE data, real space weather, and ≥1 ANSP shadow deployment +- [ ] Regulatory acceptance package complete: safety case framework, ICAO Annex 15 data quality mapping, SMS integration guide +- [ ] Legal opinion obtained on operational liability per target deployment jurisdictions (Australia, EU, UK minimum) +- [ ] First ANSP shadow mode deployment active with ≥4 weeks of shadow prediction records + +**Phase 3 Infrastructure / HA:** +- [ ] Patroni configuration validated: `scripts/check_patroni_config.py` passes confirming `maximum_lag_on_failover`, `synchronous_mode: true`, `synchronous_mode_strict: true`, `wal_level: replica`, `recovery_target_timeline: latest` all present in `patroni.yml` +- [ ] Patroni failover drill: manually kill the primary DB container; verify standby promoted within 30s; backend API continues serving requests (latency spike acceptable; no 5xx errors after 35s); PgBouncer reconnects automatically to new primary +- [ ] MinIO EC:2 verified: 4-node MinIO starts cleanly; integration test writes a 100 MB object; shut down one MinIO node; read succeeds; write succeeds; shut down second node; write fails with expected error; read still succeeds (EC:2 read quorum = 2 of 4) +- [ ] WAF/DDoS protection confirmed in place at ingress (Cloudflare/AWS Shield or equivalent network-level appliance for on-premise); security architecture review sign-off +- [ ] DNS architecture documented: `docs/runbooks/dns-architecture.md` covers split-horizon zones, PgBouncer VIP, Redis Sentinel VIP, and service discovery records for Tier 3 deployment +- [ ] Backup restore test checklist completed successfully (see §34.5): all 6 checklist items passed within the 30-day window before Phase 3 sign-off +- [ ] TLS certificate lifecycle runbook complete: `docs/runbooks/tls-cert-lifecycle.md` documents ACME auto-renewal path and internal CA path for air-gapped deployments; cert expiry Prometheus alerts firing at 60/30/7-day thresholds + +**Phase 3 Performance:** +- [ ] Formal load test passed: `tests/load/` scenario with k6 or Locust; 100 concurrent users; CZML catalog load < 2s p95; MC job submit < 500ms; alert WebSocket delivery < 30s; test report committed to `docs/validation/load-test-report-phase3.md` +- [ ] MC concurrency gate tested at scale: 10 simultaneous MC submissions across 5 organisations; each org receives `429` for its second request; no deadlock or Redis key leak observed; Celery worker queue depth remains bounded +- [ ] WebSocket subscriber ceiling verified: load test opens 450 connections to a single backend instance; 451st connection receives `HTTP 503`; `ws_connected_clients` gauge reads 450; scaling trigger fires at 400 (alert visible in Grafana) +- [ ] CZML delta adoption: Playwright E2E test confirms the frontend sends `?since=` parameter on all CZML polls after initial load; no full-catalog request occurs after page load in LIVE mode +- [ ] Bundle size CI gate active and green: final production build JS bundle documented; `bundle-size` CI step has passed for ≥2 consecutive deploys without manual override + +--- + +## 22. Open Physics Questions for Engineering Review + +1. **JB2008 vs NRLMSISE-00** — Recommend: NRLMSISE-00 for Phase 1 with a pluggable density model interface that accepts JB2008 in Phase 2 without API or schema changes. + +2. **Covariance source for conjunction probability** — Recommend: SP ephemeris covariance from Space-Track for active payloads; empirical covariance with explicit UI warning for debris. + +3. **Re-entry termination altitude** — Recommend: 80 km for Phase 1; parametric interface for Phase 2 breakup module (default 80 km, allow up to 120 km). + +4. **F10.7 forecast horizon** — For objects re-entering 5–14 days out, NOAA 3-day forecasts have degraded skill. Recommend: 81-day smoothed average as baseline with ±20% MC variation; document clearly in the SpaceWeatherWidget and every prediction panel. + +--- + +## 23. Dual Domain Architecture + +### 23.1 The Interface Problem + +Two technically adjacent domains — space operations and civil aviation — manage debris re-entry hazards using incompatible tools, data formats, and operational vocabularies. The gap between them is the market. + +``` +SPACE DOMAIN THE GAP AVIATION DOMAIN +──────────────── ────────── ──────────────── +TLE / SGP4 NOTAM +CDMs / TIP messages No standard interface FIR restrictions +CCSDS orbit products No common tool ATC procedures +Kp / F10.7 indices No shared language En-route charts +Probability of casualty ← SpaceCom bridges this → Plain English hazard brief +``` + +### 23.2 Shared Physics Core + +One physics engine serves both front doors. Neither domain gets a different model — they get different views of the same computation. + +``` + ┌─────────────────────────────────┐ + │ PHYSICS CORE │ + │ Catalog Propagator (SGP4) │ + │ Decay Predictor (RK7(8)+NRLMS) │ + │ Monte Carlo ensemble │ + │ Conjunction Screener │ + │ Atmospheric Breakup (ORSAT) │ + │ Frame transforms (TEME→WGS84) │ + └────────────┬────────────────────┘ + │ + ┌─────────────────┴─────────────────┐ + │ │ + ┌──────────▼───────────┐ ┌────────────▼──────────┐ + │ SPACE DOMAIN UI │ │ AVIATION DOMAIN UI │ + │ /space portal │ │ / (operational view) │ + │ Persona E, F │ │ Persona A, B, C │ + │ │ │ │ + │ State vectors │ │ Hazard corridors │ + │ Covariance matrices │ │ FIR intersection │ + │ CCSDS formats │ │ NOTAM drafts │ + │ Deorbit windows │ │ Plain-language status│ + │ API keys │ │ Alert acknowledgement│ + │ Conjunction data │ │ Gantt timeline │ + └──────────────────────┘ └───────────────────────┘ +``` + +### 23.3 Domain-Specific Output Formats + +| Output | Space Domain | Aviation Domain | +|--------|-------------|----------------| +| Trajectory | CCSDS OEM (state vectors) | CZML (J2000 INERTIAL for CesiumJS) | +| Re-entry prediction | p05/p50/p95 times + covariance | Percentile corridor polygons on globe | +| Hazard | Probability of casualty (Pc) value | Risk level (LOW/MEDIUM/HIGH/CRITICAL) | +| Uncertainty | Monte Carlo ensemble statistics | Corridor width visual encoding | +| Conjunction | CDM-format Pc value | Not surfaced to Persona A | +| Space weather | F10.7 / Ap / Kp raw indices | "Elevated activity — wider uncertainty" | +| Deorbit plan | CCSDS manoeuvre plan | Corridor risk map on globe | + +### 23.4 Competitive Position + +| Competitor | Their Strength | SpaceCom Advantage | +|-----------|---------------|-------------------| +| **ESA ESOC Re-entry Prediction Service** | Authoritative technical product; longest-running service | Aviation-facing operational UX; ANSP decision support; NOTAM drafting; multi-ANSP coordination | +| **OKAPI:Orbits + DLR + TU Braunschweig** | Academic orbital mechanics depth; space operator integrations | Purpose-built ANSP interface; controlled re-entry planner; shadow mode for regulatory adoption | +| **Aviation weather vendors (e.g., StormGeo)** | Deep ANSP relationships; established procurement pathways | Space domain physics credibility; TLE/CDM ingestion; conjunction screening | +| **General STM platforms** | Broad catalog management | Operational decision support depth; aviation integration layer | + +SpaceCom's moat is the combination of space physics credibility AND aviation operational usability. Neither side alone is sufficient to win regulated aviation authority contracts. + +**Differentiation capabilities — must be maintained regardless of competitor moves (Finding 4):** + +These are the capabilities that competitors cannot quickly replicate and that directly determine whether ANSPs and institutional buyers choose SpaceCom over alternatives: + +| Capability | Why it matters | Maintenance requirement | +|---|---|---| +| ANSP operational workflow integration | NOTAM drafting, multi-ANSP coordination, and shadow mode are purpose-built for ANSP operations — not retrofitted | Must be validated with ≥ 2 ANSP safety teams before Phase 2 shadow deployment | +| Regulatory adoption path | Shadow mode + exit criteria + ANSP Safety Department sign-off creates a documented adoption trail that institutional procurements require | Shadow mode exit report template must remain current; exit statistics generated automatically | +| Physics + aviation in one product | Neither a pure orbital analytics tool nor a pure aviation tool can cover both sides without the other's domain expertise | Dual-domain architecture (§23) must be maintained; any feature removal from either domain triggers an ADR | +| ESA/DISCOS data integration | Institutional credibility with ESA and national space agencies depends on using authoritative ESA data sources | DISCOS redistribution rights must be resolved before Phase 2; integration maintained as P1 data source | + +A `docs/competitive-analysis.md` document (maintained by the product owner, reviewed quarterly) tracks competitor feature releases and assesses impact on these claims. Any competitor capability that closes a differentiation gap triggers a product review within 30 days. + +### 23.5 SWIM Integration Path + +European ANSPs increasingly exchange operational data via SWIM (System Wide Information Management), defined by ICAO Doc 10039 and implemented in Europe via EUROCONTROL SWIM-TI (AMQP/MQTT transport, FIXM/AIXM 5.1 schemas). Full SWIM compliance is a Phase 3+ target; the path is: + +| Phase | Deliverable | Standard | +|-------|-------------|----------| +| Phase 2 | GeoJSON structured event export (`/events/{id}/export?format=geojson`) with ICAO FIR IDs and prediction metadata | GeoJSON + ISO 19115 metadata | +| Phase 3 | Review FIXM Core 4.x schema for re-entry hazard representation; define SpaceCom extension namespace | FIXM Core 4.2 | +| Phase 3 | SWIM-TI AMQP endpoint (publish-only) for `alert.new` and `tip.new` events to EUROCONTROL Network Manager B2B service | EUROCONTROL SWIM-TI Yellow Profile | + +Phase 2 GeoJSON export is the immediate deliverable. Phase 3 SWIM-TI integration is scoped but requires a EUROCONTROL B2B service account and FIXM schema extension review — neither is blocking for Phase 1 or 2. + +--- + +## 24. Regulatory Compliance Framework + +### 24.1 The Regulatory Gap SpaceCom Operates In + +There is currently **no binding international regulatory framework** governing re-entry debris hazard notifications to civil aviation. SpaceCom operates at the boundary between two regulatory regimes that have not yet formally agreed on how to bridge them. + +This creates risk (no approved pathway to slot into) but also opportunity (SpaceCom can help define the standard and accumulate first-mover evidence). + +### 24.2 Liability and Operational Status + +**Legal opinion is a Phase 2 gate, not a Phase 3 task.** Shadow mode deployments with ANSPs must not occur without a completed legal opinion for the deployment jurisdiction. "Advisory only" UI labelling is not contractual protection — liability limitation must be in executed agreements. In common law jurisdictions (Australia, UK, US), a voluntary undertaking of responsibility to a known class of relying professionals can create a duty of care regardless of disclaimers (*Hedley Byrne & Co v Heller* and equivalents). Shadow mode activation in the admin panel is gated by `legal_opinions.shadow_mode_cleared = TRUE` for the organisation's jurisdiction. + +**Legal opinion scope** (per deployment jurisdiction — Australia, EU, UK, US minimum): +- Whether "decision support information" labelling limits liability for incorrect predictions that inform airspace decisions +- Whether the platform creates duty-of-care obligations regardless of labelling +- Whether Space-Track data redistribution via the SpaceCom API requires a separate licensing agreement with 18th Space Control Squadron +- Whether CDM data (national security-adjacent) is subject to export controls in target jurisdictions +- Whether the Controlled Re-entry Planner falls under ECCN 9E515 (spacecraft operations technical data) for non-US users + +**Operational status classification** for SpaceCom outputs — not a UI label, a formal determination made in consultation with the ANSP's legal and SMS teams: +- *Aeronautical information* (ICAO Annex 15) — highest standard; triggers data quality obligations +- *Decision support information* — intermediate; requires formal ANSP SMS acceptance +- *Situational awareness information* — lowest; advisory only; no procedural authority + +**Commercial contract requirements — three instruments required before any access:** + +1. **Master Services Agreement (MSA)** — executed before any ANSP or space operator accesses the system. Must be reviewed by aviation law counsel. Minimum required terms: + - Limitation of liability: capped at 12 months of fees paid, or a fixed cap for government/sovereign customers (to be determined by counsel) + - Exclusion of consequential and indirect loss + - Explicit statement that SpaceCom outputs are decision support information, not certified aeronautical information and not a substitute for ANSP operational procedures + - ANSP's acknowledgement that they retain full authority and responsibility for all operational decisions + - SLOs from §26.1 incorporated by reference + - Governing law and jurisdiction clause + - Data Processing Agreement (DPA) addendum for GDPR-scope deployments (see §29) + - Right to suspend service without liability for maintenance, degraded mode, data quality concerns, or active security incidents + +2. **Acceptable Use Policy (AUP)** — click-wrap accepted in-platform at first login, recorded in `users.tos_accepted_at`, `users.tos_version`, and `users.tos_accepted_ip`. Must re-accept when version changes (system blocks access until accepted). Includes: + - Acknowledgement that orbital data originates from Space-Track, subject to Space-Track terms + - Prohibition on redistributing SpaceCom-derived data to third parties without written consent + - Acknowledgement that the platform is decision support only, not certified aeronautical information + - Export control acknowledgement (user is responsible for compliance in their jurisdiction) + +3. **API Terms** — embedded in the API key issuance flow for Persona E/F programmatic access. Accepted at key creation; recorded against the `api_keys` record. Includes the Space-Track redistribution acknowledgement and the export control notice. + +**Space-Track data redistribution gate (F3):** Space-Track.org Terms of Service prohibit redistribution of TLE data to non-registered entities. The SpaceCom API must not serve TLE-derived fields (raw TLE strings, `tle_epoch`, `tle_line1/2`) to organisations that have not confirmed Space-Track registration. Implementation: + +```sql +-- Add to organisations table +ALTER TABLE organisations ADD COLUMN space_track_registered BOOLEAN NOT NULL DEFAULT FALSE; +ALTER TABLE organisations ADD COLUMN space_track_registered_at TIMESTAMPTZ; +ALTER TABLE organisations ADD COLUMN space_track_username TEXT; -- for audit +``` + +API middleware check (applied to any response containing TLE-derived fields): +```python +def check_space_track_gate(org: Organisation): + if not org.space_track_registered: + raise HTTPException( + status_code=403, + detail="TLE-derived data requires Space-Track registration. " + "Register at space-track.org and confirm in your organisation settings." + ) +``` + +All TLE-derived disclosures are logged in `data_disclosure_log`: +```sql +CREATE TABLE data_disclosure_log ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + org_id UUID NOT NULL REFERENCES organisations(id), + source TEXT NOT NULL, -- 'space_track', 'esa_sst', etc. + endpoint TEXT NOT NULL, + disclosed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + record_count INTEGER +); +CREATE INDEX ON data_disclosure_log (org_id, source, disclosed_at DESC); +``` + +**Contracts table and MRR tracking (F1, F4, F9 — §68):** + +The `contracts` table enforces that feature access is gated on commercial state, provides MRR data for the commercial team, and records discount approval for audit: + +```sql +CREATE TABLE contracts ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + org_id INTEGER NOT NULL REFERENCES organisations(id), + contract_type TEXT NOT NULL + CHECK (contract_type IN ('sandbox','professional','enterprise','on_premise','internal')), + -- Financial terms + monthly_value_cents INTEGER NOT NULL DEFAULT 0, -- 0 for sandbox/internal + currency CHAR(3) NOT NULL DEFAULT 'EUR', + discount_pct NUMERIC(5,2) NOT NULL DEFAULT 0 + CHECK (discount_pct >= 0 AND discount_pct <= 100), + -- Discount approval guard (F4): discounts >20% require second approver + discount_approved_by INTEGER REFERENCES users(id), -- NULL if discount_pct <= 20 + discount_approval_note TEXT, + -- Term + valid_from TIMESTAMPTZ NOT NULL, + valid_until TIMESTAMPTZ NOT NULL, + auto_renew BOOLEAN NOT NULL DEFAULT FALSE, + -- Feature access — what this contract enables + enables_operational_mode BOOLEAN NOT NULL DEFAULT FALSE, + enables_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE, + enables_api_access BOOLEAN NOT NULL DEFAULT FALSE, + -- Audit + created_by INTEGER REFERENCES users(id), + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + signed_msa_at TIMESTAMPTZ, -- NULL until MSA countersigned + msa_document_ref TEXT, -- path in MinIO legal bucket + -- Professional Services (F10) + ps_value_cents INTEGER NOT NULL DEFAULT 0, -- one-time PS revenue on this contract + ps_description TEXT +); +CREATE INDEX ON contracts (org_id, valid_until DESC); +CREATE INDEX ON contracts (valid_until) WHERE valid_until > NOW(); -- active contract lookup + +-- Constraint: discounts >20% must have a named approver +ALTER TABLE contracts ADD CONSTRAINT discount_approval_required + CHECK (discount_pct <= 20 OR discount_approved_by IS NOT NULL); +``` + +**Feature access enforcement (F1):** Feature flags in `organisations` must be set from the active contract, not by admin toggle alone. A Celery task (`tasks/commercial/sync_feature_flags.py`) runs nightly and on contract creation/update to sync `organisations.feature_multi_ansp_coordination` from the active contract's `enables_multi_ansp_coordination`. An admin toggle that disagrees with the active contract is overwritten by the nightly sync. + +**MRR dashboard (F9):** Add a Grafana panel (internal dashboard, not customer-facing) showing current MRR: +```sql +-- Recording rule or direct query: +SELECT SUM(monthly_value_cents) / 100.0 AS mrr_eur +FROM contracts +WHERE valid_from <= NOW() AND valid_until >= NOW() + AND contract_type NOT IN ('sandbox', 'internal'); +``` +Expose as `spacecom_mrr_eur` Prometheus gauge updated by the nightly `sync_feature_flags` task. Grafana panel: *"Current MRR (€)"* — single stat panel, comparison to previous month. + +**Export control screening (F4):** ITAR 22 CFR §120.15 and EAR 15 CFR §736 prohibit providing certain SSA capabilities to nationals of embargoed countries and denied parties. Required at organisation onboarding: + +```sql +ALTER TABLE organisations ADD COLUMN country_of_incorporation CHAR(2); -- ISO 3166-1 alpha-2 +ALTER TABLE organisations ADD COLUMN export_control_screened_at TIMESTAMPTZ; +ALTER TABLE organisations ADD COLUMN export_control_cleared BOOLEAN NOT NULL DEFAULT FALSE; +ALTER TABLE organisations ADD COLUMN itar_cleared BOOLEAN NOT NULL DEFAULT FALSE; -- US-person or licensed +``` + +Onboarding flow: +1. Collect `country_of_incorporation` at registration +2. Flag embargoed countries (CU, IR, KP, RU, SY) for manual review — account held in `PENDING_EXPORT_REVIEW` state +3. Screen organisation name against BIS Entity List (automated lookup; manual review on partial match) +4. EU-SST-derived data gated behind `itar_cleared = TRUE` (EU-SST has its own access restrictions for non-EU entities) +5. All screening decisions logged with reviewer ID and date + +Documented in `legal/EXPORT_CONTROL_POLICY.md`. Legal counsel review required before any deployment that could serve US-origin technical data (TLE from 18th Space Control Squadron) to non-US persons. + +**Regulatory Sandbox Agreement** — a lightweight 2-page letter of understanding required before any ANSP shadow mode activation. Specifies: +- Trial period start and end dates +- ANSP's confirmation that SpaceCom outputs are for internal validation only (not operational) +- SpaceCom's commitment to produce a shadow validation report at trial end +- Data protection terms for the trial period +- How incidents during the trial are handled by both parties +- Mutual agreement that the trial does not create any ongoing commercial obligation + +**Regulatory sandbox liability clarification (F11 — §61):** The sandbox agreement is not a liability shield by itself. During shadow mode, SpaceCom is a tool under evaluation — liability exposure depends on how the ANSP uses outputs and what the sandbox agreement says about consequences of errors. Required provisions: +- **No operational reliance clause:** ANSP certifies in writing that no operational decisions will be made on the basis of SpaceCom outputs during the trial. Any breach of this clause by the ANSP shifts liability to the ANSP. +- **Incident notification:** If a SpaceCom output error is identified during the trial, SpaceCom notifies the ANSP within 2 hours (matching the safety occurrence runbook at §26.8). The sandbox agreement specifies whether this constitutes a notifiable occurrence under the ANSP's SMS. +- **Indemnification cap:** SpaceCom's aggregate liability during the sandbox period is capped at AUD/EUR 50,000 (or local equivalent). Catastrophic loss claims are excluded (consistent with MSA terms). +- **Insurance requirement:** SpaceCom must carry professional indemnity insurance with minimum cover AUD/EUR 1 million before activating any sandbox with an ANSP. Certificate of currency provided to the ANSP before activation. +- **Regulatory notification duty:** If the ANSP's safety regulator requires notification of third-party tool trials (e.g., EASA, CASA, CAA), that obligation rests with the ANSP. SpaceCom provides a one-page system description document to support the ANSP's notification. +- **Sandbox ≠ approval pathway:** A successful sandbox trial is evidence for a future regulatory submission — it is not itself an approval. Neither party should represent the sandbox as a form of regulatory acceptance. + +`legal/SANDBOX_AGREEMENT_TEMPLATE.md` captures the standard text. Legal counsel review required before any amendment. + +The shadow mode admin toggle must display a warning if no Regulatory Sandbox Agreement is on record (`legal_opinions.shadow_mode_cleared = FALSE` for the org's jurisdiction): +``` +⚠ No legal clearance on record for this organisation's jurisdiction. + Shadow mode should not be activated without a completed legal opinion + and a signed Regulatory Sandbox Agreement. + [View legal status →] +``` + +### 24.3 ICAO Data Quality Mapping (Annex 15) + +SpaceCom outputs that may enter aeronautical information channels must be characterised against ICAO's five data quality attributes: + +| Attribute | SpaceCom Characterisation | Required Action | +|-----------|--------------------------|----------------| +| **Accuracy** | Decay predictor accuracy characterised from ≥10 historical re-entry backcasts vs. The Aerospace Corporation database. Published as a formal accuracy statement in `GET /api/v1/reentry/predictions/{id}` response. | Phase 3: produce accuracy characterisation document | +| **Resolution** | Corridor boundaries expressed as geographic polygons with stated precision. Position uncertainty stated as formal resolution value in prediction response. | Included in prediction API response from Phase 1 | +| **Integrity** | HMAC-SHA256 on all prediction and hazard zone records. Integrity assurance level: *Essential* (1×10⁻⁵). Documented in system description. | Implemented Phase 1 (§7.9) | +| **Traceability** | Full parameter provenance in `simulations.params_json` and prediction records. Accessible to regulatory auditors via dedicated API. | Phase 1 | +| **Timeliness** | Maximum latency from TIP message ingestion to updated prediction available: 30 minutes. Maximum latency from NOAA SWPC space weather update to prediction recalculation: 4 hours. Published as formal SLA. | Phase 3 SLA document | + +**F5 — Completeness attribute and ICAO Annex 15 §3.2 data quality classification (§61):** + +ICAO Annex 15 §3.2 defines a sixth implicit attribute — **Completeness** — meaning all data fields required by the receiving system are present and within range. SpaceCom must: +- Define a formal completeness schema for each prediction response (required fields, allowed nulls, value ranges) +- Return `data_quality.completeness_pct` in the prediction response (fields present / fields required × 100) +- Reject predictions with completeness < 90% from the alert pipeline (alert not generated; operator notified of incomplete prediction) + +**ICAO data category and classification** required in the prediction response (Annex 15 Table A3-1): + +| Field | Value | +|-------|-------| +| `data_category` | `AERONAUTICAL_ADVISORY` (until formal AIP entry process established) | +| `originator` | `SPACECOM` + system version string | +| `effective_from` | ISO 8601 UTC timestamp | +| `integrity_assurance` | `ESSENTIAL` (1×10⁻⁵ probability of undetected error) | +| `accuracy_class` | `CLASS_2` (advisory, not certified — until accuracy characterisation completes Phase 3 validation) | + +Formal accuracy characterisation (`docs/validation/ACCURACY_CHARACTERISATION.md`) is a Phase 3 gate before the API can be presented to any ANSP as meeting Annex 15 data quality standards. + +### 24.4 Safety Management System Integration + +Any ANSP formally adopting SpaceCom must include it in their SMS (ICAO Annex 19). SpaceCom provides the following artefacts to support ANSP SMS assessment: + +**Hazard register (SpaceCom's contribution to the ANSP's SMS — F3, §61 structured format):** + +Maintained as `docs/safety/HAZARD_LOG.md`. Each hazard uses the structured schema below. Hazard IDs are permanent — retired hazards are marked CLOSED, not deleted. + +| ID | Description | Cause | Effect | Mitigations | Severity | Likelihood | Risk Level | Status | +|----|-------------|-------|--------|-------------|----------|------------|------------|--------| +| HZ-001 | SpaceCom unavailable during active re-entry event | Infrastructure failure; deployment error; DDoS | ANSP cannot access current re-entry prediction during event window | Patroni HA failover (§26.3); 15-min RTO SLO; automated ANSP push notification + email; documented fallback procedure | Hazardous | Low (SLO 99.9%) | Medium | OPEN | +| HZ-002 | False all-clear prediction (false negative — corridor misses actual impact zone) | TLE age; atmospheric model error; MC sampling variance; adversarial data manipulation | ANSP issues all-clear; aircraft enters debris corridor | HMAC integrity check; dual-source TLE validation; TIP cross-check guard; shadow validation evidence; accuracy characterisation (Phase 3); `@pytest.mark.safety_critical` tests | Catastrophic | Very Low | High | OPEN | +| HZ-003 | False hazard prediction (false positive — corridor over-stated) | Atmospheric model conservatism; TLE propagation error | Unnecessary airspace restriction; operational disruption; credibility loss | Cross-source TLE validation; HMAC; p95 corridor with stated uncertainty; accuracy characterisation | Major | Low | Medium | OPEN | +| HZ-004 | Corridor displayed in wrong reference frame | ECI/ECEF/geographic frame conversion error; CZML frame parameter misconfiguration | Corridor shown at wrong lat/lon; operator makes decisions on incorrect geographic basis | Frame transform unit tests against IERS references (§17); CZML frame convention enforced via CI | Hazardous | Very Low | Medium | OPEN | +| HZ-005 | Outdated prediction served (stale data) | Ingest pipeline failure; TLE source outage; cache not invalidating | Operator sees prediction that no longer reflects current orbital state | Data staleness indicators in UI; automated stale alert to operators; ingest health monitoring; CZML cache invalidation triggers (§35) | Major | Low | Medium | OPEN | +| HZ-006 | Prediction integrity failure (HMAC mismatch) | Database modification; backup restore error; storage corruption | Prediction record cannot be verified; may have been tampered with | Prediction quarantined automatically; CRITICAL security alert; prediction withheld from API | Catastrophic | Very Low | High | OPEN | +| HZ-007 | Unauthorised access to prediction data | Compromised credentials; RLS bypass; API misconfiguration | Competitor or adversary obtains early re-entry corridor data; potential ITAR exposure | PostgreSQL RLS; JWT validation; rate limiting; `security_logs` audit trail; penetration testing | Major | Low | Medium | OPEN | + +**Hazard log governance:** +- Review: quarterly, and after each SEV-1 incident, model version update, or material system change +- New hazards identified during safety occurrence reporting are added within 5 business days +- Risk level = Severity × Likelihood using EUROCAE ED-153 risk classification matrix +- OPEN hazards with `High` risk level are Phase 2 gate blockers — must reach `MITIGATED` before ANSP shadow activation + +**System safety classification:** Safety-related (not safety-critical under DO-278A). Relevant components targeting SAL-2 assurance level (see §24.13). Development assurance standard: EUROCAE ED-78A equivalent for relevant components. + +**Change management:** SpaceCom must notify all ANSP users before model version updates that affect prediction outputs. Version changes tracked in `simulations.model_version` and surfaced in the UI. + +### 24.5 NOTAM System Interface + +SpaceCom's position in the NOTAM workflow: + +``` +SpaceCom generates → NOTAM draft (ICAO format) → Reviewed by Persona A → Submitted by authorised NOTAM originator → Issued NOTAM +``` + +SpaceCom never submits NOTAMs. The draft is a decision support artefact. The mandatory disclaimer on every draft is a non-removable regulatory requirement, not a UI preference. + +**NOTAM timing requirements by jurisdiction:** +- Routine NOTAMs: 24–48 hours minimum lead time +- Short-notice (re-entry window < 24 hours): ASAP; NOTAM issued with minimum lead time +- SpaceCom alert thresholds align with these: CRITICAL alert at < 6h, HIGH at < 24h + +### 24.6 Space Law Considerations + +**UN Liability Convention (1972):** All SpaceCom prediction records, simulation runs, and alert acknowledgements may be legally discoverable in an international liability claim. The immutable audit trail (§7.9) is partially an evidence preservation mechanism. Retention of `reentry_predictions`, `alert_events`, `notam_drafts`, and `shadow_validations` for ≥7 years minimum. + +**National space laws with re-entry obligations:** +- **Australia:** Space (Launches and Returns) Act 2018. CASA and the Australian Space Agency have coordination protocols. SpaceCom's controlled re-entry planner outputs are suitable as evidence for operator obligations under this Act. +- **EU/ESA:** EU Space Programme Regulation; ESA Zero Debris Charter. SpaceCom supports Zero Debris by characterising re-entry risk and supporting responsible end-of-life planning. +- **US:** FAA AST re-entry licensing generates data that SpaceCom should ingest when available. 51 USC Chapter 509 obligations may affect US space operator customers. + +**Space Traffic Management evolution:** US Office of Space Commerce is developing civil STM frameworks that may eventually replace Space-Track as the primary civil space data source. SpaceCom's ingest architecture must be adaptable (hardcoded URL constants in `ingest/sources.py` make this a 1-file change when the source changes). + +### 24.7 ICAO Framework Alignment + +**Existing:** ICAO Doc 10100 (Manual on Space Weather Information, 2019) designates three ICAO-recognised Space Weather Centres (NOAA SWPC, ESA/ESAC, Japan Meteorological Agency). SpaceCom's space weather widget must reference these designated centres by name and ICAO recognition status. + +**Emerging re-entry guidance:** ICAO is in early stages of developing re-entry hazard notification guidance (no published document as of 2025). SpaceCom should: +- Monitor ICAO Air Navigation Commission and Meteorology Panel working group outputs +- Design hazard corridor outputs in a format that parallels SIGMET structure (the closest existing ICAO framework: WHO/WHAT/WHERE/WHEN/INTENSITY/FORECAST) — this positions SpaceCom well for whatever standard emerges +- Consider engaging ICAO working groups as a stakeholder; SpaceCom could become a reference implementation + +**SIGMET parallel structure for re-entry corridor outputs:** +``` +REENTRY ADVISORY (SpaceCom format; parallel to SIGMET structure) +WHO: CZ-5B ROCKET BODY / NORAD 44878 +WHAT: UNCONTROLLED RE-ENTRY / DEBRIS SURVIVAL POSSIBLE +WHERE: CORRIDOR 18S115E TO 28S155E / FL000 TO UNL +WHEN: FROM 2026031614 TO 2026031622 UTC / WINDOW ±4H (P95) +RISK: HIGH / LAND AREA IN CORRIDOR: 12% +FORECAST: CORRIDOR EXPECTED TO NARROW 20% OVER NEXT 6H +SOURCE: SPACECOM V2.1 / PRED-44878-20260316-003 / TIP MSG #3 +``` + +### 24.8 Alert Threshold Governance + +Alert threshold values are consequential algorithmic decisions. A CRITICAL threshold that is too sensitive causes unnecessary airspace disruption; one that is too conservative creates false-negative risk. Both outcomes have legal, operational, and reputational consequences. + +**Current threshold values and rationale:** + +| Threshold | Value | Rationale | +|-----------|-------|-----------| +| CRITICAL window | < 6h | Aligns with ICAO minimum NOTAM lead time for short-notice restrictions; 6h allows ANSP to issue NOTAM with ≥2h lead time | +| HIGH window | < 24h | Operational planning horizon for pre-tactical airspace management | +| FIR intersection trigger | p95 corridor intersects any non-zero area of the FIR | Conservative: any non-zero intersection at p95 level generates an alert; minimum area threshold is an org-configurable setting (default: 0) | +| Alert rate limit | 1 CRITICAL per object per 4h window | Prevents alert flooding from repeated window-shrink events without substantive new information | +| Alert storm threshold | > 5 CRITICAL in 1h | Empirically chosen; above this rate the response-time expectation for individual alerts cannot be met | + +These values are recorded in `docs/alert-threshold-history.md` with initial entry date and author sign-off. + +**Threshold change procedure:** +1. Engineer proposes change in a PR with rationale documented in `docs/alert-threshold-history.md` +2. PR requires review by engineering lead **and** product owner before merge +3. Change is deployed to staging; minimum 2-week shadow-mode observation period against real TLE/TIP data +4. Shadow observation review: false positive rate and false negative rate compared against pre-change baseline +5. If baseline comparison passes: change deployed to production; all ANSP shadow deployment partners notified in writing with new threshold values +6. If any ANSP objects: change is held until concerns are resolved + +**Threshold values are not configurable at runtime by operators.** They are code constants reviewed through the above process. Org-configurable alert settings (geographic FIR filter, mute rules, `OPS_ROOM_SUPPRESS_MINUTES`) are UX preferences, not threshold changes. + +### 24.9 Degraded Mode and Availability + +SpaceCom must specify degraded mode behaviour for ANSP adoption: + +| Condition | System Behaviour | ANSP Action | +|-----------|-----------------|-------------| +| Ingest pipeline failure (TLE data > 6h stale) | MEDIUM alert to all operators; staleness indicator on all objects; predictions greyed | Consult Space-Track directly; activate fallback procedure | +| Space weather data > 4h stale | WARNING banner on SpaceWeatherWidget; uncertainty multiplier set to HIGH conservatively | Note wider uncertainty on any operational decisions | +| System unavailable | Push notification to all registered users; email to ANSP contacts | Activate fallback procedure documented in SpaceCom SMS integration guide | +| HMAC verification failure on a prediction | Prediction withheld; CRITICAL security alert; prediction marked `integrity_failed` | Do not use the withheld prediction; contact SpaceCom immediately | + +**Degraded mode notification:** When SpaceCom is down or data is stale beyond defined thresholds, all connected ANSPs receive push notification (WebSocket if connected; email fallback) so they can activate their fallback procedures. SpaceCom must never go silent when operationally relevant events are active. + +--- + +### 24.10 EU AI Act Obligations + +**Classification:** SpaceCom's conjunction probability model (§19) and any ML-based alert prioritisation constitute an AI system under EU AI Act Art. 3(1). AI systems used in transport infrastructure safety fall under Annex III, point 4 (AI systems intended to be used for dispatching, monitoring, and maintenance of transport infrastructure including aviation). This classification implies **high-risk AI system** obligations. + +**High-risk AI system obligations (EU AI Act Chapter III Section 2):** + +| Obligation | Article | SpaceCom implementation | +|-----------|---------|------------------------| +| Risk management system | Art. 9 | Integrate with existing SMS (§24.4); maintain AI-specific risk register in `legal/EU_AI_ACT_ASSESSMENT.md` | +| Data governance | Art. 10 | TLE training data provenance documented; `simulations.params_json` stores full input provenance; bias assessment required for orbital prediction models | +| Technical documentation | Art. 11 + Annex IV | `legal/EU_AI_ACT_ASSESSMENT.md` — system description, capabilities, limitations, human oversight measures, accuracy characterisation | +| Record-keeping / automatic logging | Art. 12 | `reentry_predictions` and `alert_events` tables provide automatic event logging; immutable (`APPEND`-only with HMAC) | +| Transparency to users | Art. 13 | Conjunction probability values labelled with model version (`simulations.model_version`), TLE age, EOP currency; uncertainty bounds displayed | +| Human oversight | Art. 14 | All decisions remain with duty controller (§24.2 AUP; §28.6 Decision Prompts disclaimer); no autonomous action taken by SpaceCom | +| Accuracy, robustness, cybersecurity | Art. 15 | Accuracy characterisation (§24.3 ICAO Data Quality); adversarial robustness covered by §7 and §36 security review | +| Conformity assessment | Art. 43 | Self-assessment pathway available for transport safety AI without third-party involvement at first deployment; document in `legal/EU_AI_ACT_ASSESSMENT.md` | +| EU database registration | Art. 51 | High-risk AI systems must be registered in the EU AI Act database before placing on market; legal milestone in deployment roadmap | + +**Human oversight statement (required in UI — Art. 14):** The conjunction probability display (§19.4) must include the following non-configurable statement in the model information panel: + +> *"This probability estimate is generated by an AI model and is subject to uncertainty arising from TLE age, atmospheric model limitations, and manoeuvre uncertainty. All operational decisions remain with the duty controller. This system does not replace ANSP procedures."* + +**Gap analysis and roadmap:** `legal/EU_AI_ACT_ASSESSMENT.md` must document: current compliance state → gaps → remediation actions → target dates. Phase 2 gate: conformity assessment documentation complete. Phase 3 gate: EU database registration completed before commercial EU deployment. + +--- + +### 24.11 Regulatory Correspondence Register + +For an ANSP-facing product, regulators (CAA, EASA, national ANSPs, ESA, OACI) will issue queries, audits, formal requests, and correspondence. Missed regulatory deadlines can constitute a licence breach or grounds for suspension of operations. + +**Correspondence log:** `legal/REGULATORY_CORRESPONDENCE_LOG.md` — structured register with the following fields per entry: + +| Field | Description | +|-------|-------------| +| Date received | ISO 8601 | +| Authority | Regulatory body name and country | +| Reference number | Authority's reference (if given) | +| Subject | Brief description | +| Deadline | Formal response deadline (ISO 8601) | +| Owner | Named individual responsible for response | +| Status | PENDING / RESPONDED / CLOSED / ESCALATED | +| Response date | Date formal response sent | +| Notes | Internal context, legal counsel involvement | + +**SLAs:** +- All regulatory correspondence acknowledged (receipt confirmed to sender) within **2 business days** +- Substantive response or extension request within **14 calendar days** (or as required by the correspondence) +- All correspondence older than 14 days without a RESPONDED or CLOSED status triggers an escalation to the CEO + +**Proactive regulatory engagement:** The correspondence register is reviewed at each quarterly steering meeting. Any authority that has issued ≥3 queries in a 12-month period warrants a proactive engagement call to identify and address systemic concerns before they become formal regulatory actions. + +--- + +### 24.12 Safety Case Framework (F1 — §61) + +A safety case is a structured argument that a system is acceptably safe for a specified use in a defined context. SpaceCom must produce and maintain a safety case before any operational ANSP deployment. The safety case is a living document, updated at each material system change. + +**Safety case structure** (Goal Structuring Notation — GSN, consistent with EUROCAE ED-153 / IEC 61508 safety case guidance): + +``` +G1: SpaceCom is acceptably safe to use as a decision support tool + for re-entry hazard awareness in civil airspace operations + + C1: Context — SpaceCom operates as decision support (not autonomous authority); + all operational decisions remain with the ANSP duty controller + + S1: Argument strategy — safety achieved by hazard identification, + risk reduction, and operational constraints + + G1.1: All identified hazards are mitigated to acceptable risk levels + Sn1: Hazard Log (docs/safety/HAZARD_LOG.md) + E1.1.1: HZ-001 through HZ-007 mitigation evidence (§24.4) + E1.1.2: Shadow validation report (≥30 day trial) + + G1.2: System integrity is maintained through all operational modes + Sn2: HMAC integrity on all safety-critical records (§7.9) + E1.2.1: `@pytest.mark.safety_critical` test suite — 100% pass + E1.2.2: Integrity failure quarantine demonstrated (§56 E2E test) + + G1.3: Operators are trained and capable of correct system use + Sn3: Operator Training Programme (§28.9) + E1.3.1: Training completion records (operator_training_records table) + E1.3.2: Reference scenario completion evidence + + G1.4: Degraded mode provides adequate notification for fallback + Sn4: Degraded mode specification (§24.9) + E1.4.1: ANSP communication plan activated in game day exercise (§26.8) + + G1.5: Regulatory obligations are met for the deployment jurisdiction + Sn5: Means of Compliance document (§24.14) + E1.5.1: Legal opinions for deployment jurisdictions (§24.2) + E1.5.2: ANSP SMS integration guide (§24.15) +``` + +**Safety case document:** `docs/safety/SAFETY_CASE.md`. Version-controlled; each tagged release includes a safety case snapshot. Safety case review is required before: +- ANSP shadow mode activation +- Model version updates that affect prediction outputs +- New deployment jurisdiction +- Any change to alert thresholds (§24.8) + +**Safety case custodian:** Named individual (Phase 2: CEO or CTO until a dedicated safety manager is appointed). Changes to the safety case require the custodian's sign-off. + +--- + +### 24.13 Software Assurance Level (SAL) Assignment (F2 — §61) + +EUROCAE ED-153 / DO-278A defines Software Assurance Levels for ground-based aviation software systems. The appropriate SAL determines the rigour of development, verification, and documentation activities required. + +**SpaceCom SAL assignment:** + +| Component | Failure Condition | Severity Class | SAL | Rationale | +|-----------|------------------|----------------|-----|-----------| +| Re-entry prediction engine (`physics/`) | False all-clear (HZ-002) | Hazardous | SAL-2 | Undetected false negative could contribute to an airspace safety event; highest-consequence component | +| Alert generation pipeline (`alerts/`) | Failed alert delivery; wrong threshold applied | Hazardous | SAL-2 | Failure to generate a CRITICAL alert during an active event is equivalent consequence to HZ-002 | +| HMAC integrity verification | Integrity failure undetected | Hazardous | SAL-2 | Loss of integrity checking removes the primary guard against data manipulation | +| CZML corridor rendering | Wrong geographic position displayed (HZ-004) | Hazardous | SAL-2 | Geographic display error directly misleads operator | +| API authentication and authorisation | Unauthorised data access (HZ-007) | Major | SAL-3 | Privacy and data governance impact; not directly causal of airspace event | +| Ingest pipeline (`worker/`) | Stale data not detected (HZ-005) | Major | SAL-3 | Staleness monitoring is a mitigation for HZ-005; failure of staleness monitoring increases HZ-005 likelihood | +| Frontend (non-safety-critical paths) | Cosmetic / non-operational UI failure | Minor | SAL-4 | Not in the safety-critical path | + +**SAL-2 implications** (minimum activities required): +- Independent verification of requirements, design, and code for SAL-2 components (see §24.16 Verification Independence) +- Formal test coverage: 100% statement coverage for SAL-2 modules (enforced via `@pytest.mark.safety_critical`) +- Configuration management of all SAL-2 source files and their test artefacts (see §30.8) +- SAL-2 components documented in the safety case with traceability from requirement → design → code → test + +**SAL assignment document:** `docs/safety/SAL_ASSIGNMENT.md` — reviewed at each architecture change and before any ANSP deployment. + +--- + +### 24.14 Means of Compliance (MoC) Document (F8 — §61) + +A Means of Compliance document maps each regulatory or standard requirement to the specific implementation evidence that demonstrates compliance. Required before any formal regulatory submission (ESA bid, EASA consultation response, ANSP safety acceptance). + +**Document:** `docs/safety/MEANS_OF_COMPLIANCE.md` + +**Structure:** + +| Requirement ID | Source | Requirement Text (summary) | Means of Compliance | Evidence Location | Status | +|---------------|--------|---------------------------|--------------------|--------------------|--------| +| MOC-001 | EUROCAE ED-153 §5.3 | Software requirements defined and verifiable | Requirements documented in relevant §sections of MASTER_PLAN; acceptance criteria in TEST_PLAN | `docs/TEST_PLAN.md`; relevant §sections | PARTIAL | +| MOC-002 | EUROCAE ED-153 §6.4 | Independent verification of SAL-2 software | Verification independence policy (§24.16); separate reviewer for safety-critical PRs | `docs/safety/VERIFICATION_INDEPENDENCE.md` | PLANNED | +| MOC-003 | ICAO Annex 15 §3.2 | Data quality attributes characterised | ICAO data quality table (§24.3); accuracy characterisation document | `docs/validation/ACCURACY_CHARACTERISATION.md` | PARTIAL (Phase 3) | +| MOC-004 | ICAO Annex 19 | ANSP SMS integration supported | SMS integration guide; hazard register; training programme | `docs/safety/ANSP_SMS_GUIDE.md`; `docs/safety/HAZARD_LOG.md` | PLANNED | +| MOC-005 | EU AI Act Art. 9 | Risk management system documented | AI Act assessment; hazard log; safety case | `legal/EU_AI_ACT_ASSESSMENT.md`; `docs/safety/HAZARD_LOG.md` | IN PROGRESS | +| MOC-006 | DO-278A §10 | Configuration management of safety artefacts | CM policy (§30.8); Git tagging of releases; signed commits | `docs/safety/CM_POLICY.md` | PLANNED | +| MOC-007 | ED-153 §7.2 | Safety occurrence reporting procedure | Runbook in §26.8; `SAFETY_OCCURRENCE` log type | `docs/runbooks/`; `security_logs` table | IMPLEMENTED | + +The MoC document is a Phase 2 deliverable. `PARTIAL` items become Phase 3 gates. `PLANNED` items require assigned owners and completion dates before ANSP shadow activation. + +--- + +### 24.15 ANSP-Side Obligations Document (F10 — §61) + +SpaceCom cannot unilaterally satisfy all regulatory requirements — the receiving ANSP has obligations that SpaceCom must document and communicate. Failing to do so is a gap in the safety argument. + +**Document:** `docs/safety/ANSP_SMS_GUIDE.md` — provided to every ANSP before shadow mode activation. + +**ANSP obligations by category:** + +| Category | ANSP Obligation | SpaceCom Provides | +|----------|----------------|-------------------| +| SMS integration | Include SpaceCom in ANSP SMS under ICAO Annex 19 | Hazard register contribution (§24.4); SAL assignment; safety case | +| Change notification | Notify SpaceCom of any ANSP procedure changes that affect how SpaceCom outputs are used | Change notification contact in MSA | +| Operator training | Ensure all SpaceCom users complete the operator training programme (§28.9) | Training modules; completion API; training records | +| Fallback procedure | Maintain and exercise a fallback procedure for SpaceCom unavailability | Fallback procedure template in onboarding documentation | +| Occurrence reporting | Report any safety occurrence involving SpaceCom outputs to SpaceCom within 24 hours | Safety occurrence form; contact details; §26.8 runbook | +| Regulatory notification | Notify applicable safety regulator of SpaceCom use if required by national SMS regulations | System description one-pager for regulator submission | +| Shadow validation | Participate in ≥30-day shadow validation trial; provide evaluation feedback | Shadow validation report template; shadow validation dashboard | +| AUP acceptance | Ensure all users accept the AUP (§24.2) | Automated AUP flow; compliance report for ANSP admin | + +**Liability assignment note (links to §24.2 and §24.12 F11):** The ANSP SMS guide explicitly states that the ANSP retains full operational authority and accountability for all air traffic decisions, regardless of SpaceCom outputs. SpaceCom is a decision support tool. This statement must appear in the ANSP SMS guide, the AUP, and the safety case context node C1 (§24.12). + +### 25.1 Target Tender Profile + +SpaceCom targets ESA tenders in the following programme areas: +- **Space Safety Programme** — re-entry risk, SSA services, space debris +- **GSTP (General Support Technology Programme)** — technology development with commercial potential +- **ARTES (Advanced Research in Telecommunications Systems)** — if the commercial operator portal reaches satellite operators +- **Space-Air Traffic Integration** studies — the category matching ESA's OKAPI:Orbits award + +### 25.2 Differentiation from ESA ESOC Re-entry Prediction Service + +ESA's re-entry prediction service (`reentry.esoc.esa.int`) is a technical product for space operators and agencies. SpaceCom is **not a competitor** to this service — it is a complementary operational layer that could consume ESOC outputs: + +| Dimension | ESA ESOC Service | SpaceCom | +|-----------|-----------------|---------| +| Primary user | Space agencies, debris researchers | ANSPs, airspace managers, space operators | +| Output format | Technical prediction reports | Operational decision support + NOTAM drafts | +| Aviation integration | None | Core feature | +| ANSP decision workflow | Not designed for this | Primary design target | +| Space operator portal | Not provided | Phase 2 deliverable | +| Shadow mode / regulatory adoption | Not provided | Built-in | + +**In an ESA bid:** Position SpaceCom as the *user-facing operational layer* that sits on top of the space surveillance and prediction infrastructure that ESA already operates. ESA invests in the physics; SpaceCom invests in the interface that makes the physics actionable for aviation authorities and space operators. + +### 25.3 TRL Roadmap (ESA Definitions) + +| Phase | End TRL | Evidence | +|-------|---------|---------| +| Phase 1 complete | **TRL 4** | Validated decay predictor (≥3 historical backcasts); SGP4 globe with real TLE data; Mode A corridors; HMAC integrity; full security infrastructure | +| Phase 2 complete | **TRL 5** | Atmospheric breakup; Mode B heatmap; NOTAM drafting; space operator portal; CCSDS export; shadow mode; ≥1 ANSP shadow deployment running | +| Phase 3 complete | **TRL 6** | System demonstrated in operationally relevant environment; ≥1 ANSP shadow deployment with ≥4 weeks validation data; external penetration test passed; ECSS compliance artefacts complete | +| Post-Phase 3 | **TRL 7** | System prototype demonstrated in operational environment (live ANSP deployment, not shadow) | + +### 25.4 ECSS Standards Compliance + +ESA contracts require compliance with the European Cooperation for Space Standardization (ECSS). Required compliance mapping: + +| Standard | Title | SpaceCom Compliance | +|----------|-------|-------------------| +| **ECSS-Q-ST-80C** | Software Product Assurance | Software Management Plan, V&V Plan, Product Assurance Plan — produced Phase 3 | +| **ECSS-E-ST-10-04C** | Space environment | NRLMSISE-00 and JB2008 compliance with ECSS atmospheric model requirements | +| **ECSS-E-ST-10-12C** | Methods for re-entry and debris footprint calculation | Decay predictor and atmospheric breakup model methodology documented and traceable | +| **ECSS-U-AS-010C** | Space sustainability | Zero Debris Charter alignment statement; controlled re-entry planner outputs | + +**Compliance matrix document** (produced Phase 3): Maps every ECSS requirement to the relevant SpaceCom component, test, or document. Required for ESA tender submission. + +### 25.5 ESA Zero Debris Charter Alignment + +SpaceCom directly supports the Zero Debris Charter objectives: + +| Charter Objective | SpaceCom Support | +|-------------------|----------------| +| Responsible end-of-life disposal | Controlled re-entry planner generates CCSDS-format manoeuvre plans minimising ground risk | +| Transparency of re-entry risk | Public hazard corridor data; NOTAM drafting; multi-ANSP coordination | +| Reduction of casualty risk | Atmospheric breakup model; casualty area computation; population density weighting in deorbit optimiser | +| Data sharing | API layer for space operator integration; CCSDS export; open prediction endpoints | + +Include Zero Debris Charter alignment statement in all ESA bid submissions. + +### 25.6 Required ESA Procurement Artefacts + +All ESA contracts require these management documents. SpaceCom must produce them by Phase 3: + +| Document | ECSS Reference | Content | +|----------|---------------|---------| +| **Software Management Plan (SMP)** | ECSS-Q-ST-80C §5 | Development methodology, configuration management, change control, documentation standards | +| **Verification and Validation Plan (VVP)** | ECSS-Q-ST-80C §6 | Test strategy, traceability from requirements to test cases, acceptance criteria | +| **Product Assurance Plan (PAP)** | ECSS-Q-ST-80C §4 | Safety, reliability, quality standards and how they are met | +| **Data Management Plan (DMP)** | ECSS-Q-ST-80C §8 | How data produced under contract is handled, shared, archived, and made reproducible | +| **Software Requirements Specification (SRS)** | Tailored ECSS-E-ST-40C | Software requirements baseline, interfaces, external dependencies, and bounded assumptions including air-risk and RDM exchange boundaries | +| **Software Design Description (SDD)** | Tailored ECSS-E-ST-40C | Module architecture, algorithm choices, interface contracts, and validation assumptions | +| **User Manual / Ops Guide** | Tailored ECSS-E-ST-40C | Installation, configuration, operator workflows, limitations, and degraded-mode handling | +| **Test Plan + Test Report** | Tailored ECSS-Q-ST-80C | Planned validation campaign, executed results, deviations, and acceptance evidence for procurement submission | +| **Accessibility Conformance Report (ACR/VPAT 2.4)** | EN 301 549 v3.2.1 | WCAG 2.1 AA conformance declaration; mandatory for EU public sector ICT procurement; maps each success criterion to Supports / Partially Supports / Does Not Support with remarks | + +Scaffold documents for all procurement-facing artefacts should be created at Phase 1 start and maintained throughout development — not produced from scratch at Phase 3. + +For contracts with explicit software prototype review gates (e.g. PDR, TRR, CDR, QR, FR), the SRS, SDD, User Manual, Test Plan, and Test Report are updated incrementally at each milestone rather than back-filled only at final review. + +### 25.7 Consortium Strategy + +ESA study contracts typically favour consortia that combine: +- **Technical depth** (university or research institute) +- **Industrial relevance** (commercial applicability) +- **End-user representation** (the entity that will use the output) + +SpaceCom's ideal consortium for an ESA bid: +- **SpaceCom** (lead) — system integration, aviation domain interface, commercial deployment +- **Academic partner** (orbital mechanics / atmospheric density modelling credibility — equivalent to TU Braunschweig in the OKAPI:Orbits consortium) +- **ANSP or aviation authority** (end-user representation — demonstrates the aviation gap is real and the solution is wanted) + +Without a credentialled academic or research partner for the physics components, ESA evaluators may question the technical depth. Identify and approach potential academic partners before submitting to any ESA tender. + +### 25.8 Intellectual Property Framework for ESA Bids + +ESA contracts operate under the ESA General Conditions of Contract, which distinguish between **background IP** (pre-existing IP brought into the contract) and **foreground IP** (IP created during the contract). The default terms grant ESA a non-exclusive, royalty-free licence to use foreground IP, while the contractor retains ownership. These terms are negotiable and must be agreed before contract signature. + +**Required IP actions before bid submission:** + +1. **Background IP schedule:** Document all SpaceCom components that constitute background IP — physics engine, data model, UX design, proprietary algorithms. This schedule protects SpaceCom's ability to continue commercial deployment after the ESA contract ends without ESA claiming rights to the core product. + +2. **Foreground IP boundary:** Define clearly what will be created during the ESA contract (e.g., specific ECSS compliance artefacts, validation datasets, TRL demonstration reports) versus what SpaceCom brings in as background IP. Narrow the foreground IP scope to ESA-specific deliverables only. + +3. **Software Bill of Materials (SBOM):** Required for ECSS compliance and as part of the ESA bid artefact package. Generated via `syft` or `cyclonedx-bom`. Must identify all third-party licences. AGPLv3 components (notably CesiumJS community edition) cannot be in the SBOM of a closed-source ESA deliverable — commercial licence required. + +4. **Consortium Agreement:** Must be signed by all consortium members before bid submission. Must specify: + - IP ownership for each consortium member's contributions + - Publication rights for academic partners (must not conflict with any commercial confidentiality obligations) + - Revenue share for any commercial use arising from the contract + - Liability allocation between consortium members + - Exit terms if a member withdraws + +5. **Export control pre-clearance:** Confirm with counsel that the planned ESA deliverable does not require an export licence for transfer to ESA (a Paris-based intergovernmental organisation). Generally covered under EAR licence exception GOV, but verify for any controlled technology components. + +--- + +## 26. SRE and Reliability Framework + +### 26.1 Service Level Objectives + +SpaceCom is most critical during active re-entry events — peak load coincides with highest operational stakes. Standard availability metrics are insufficient. SLOs must be defined against *event-correlated* conditions, not just averages. + +| Service Level Indicator | SLO | Measurement Window | Notes | +|------------------------|-----|--------------------|-------| +| Prediction API availability | 99.9% | Rolling 30 days | 43.8 min/month error budget | +| Prediction API availability (active TIP event) | 99.95% | Duration of TIP window | Stricter; degradation during events is SEV-1 | +| Decay prediction latency p50 | < 90s | Per MC job | 500-sample chord run | +| Decay prediction latency p95 | < 240s | Per MC job | Drives worker sizing (§27) | +| CZML ephemeris load p95 | < 2s | Per request | 100-object catalog | +| TIP message ingest latency | < 30 min from publication | Per TIP message | Drives CRITICAL alert timing | +| Space weather update latency | < 15 min from NOAA SWPC | Per update cycle | Drives uncertainty multiplier refresh | +| Alert WebSocket delivery latency | < 10s from trigger | Per alert | Measured trigger→client receipt | +| Corridor update after new TIP | < 60 min | Per TIP message | Full MC rerun triggered | + +**Error budget policy:** When the 30-day rolling error budget is exhausted, no further deployments or planned maintenance are permitted until the next measurement window opens. Tracked in Grafana SLO dashboard (§26.8). + +**SLOs must be written into the model user agreement** (§24.2) and agreed with each ANSP customer before operational deployment. ANSPs need defined thresholds to determine when to activate their fallback procedures. + +**Customer-facing SLA (Finding 7) — contractual commitments in the MSA:** + +Internal SLOs are aspirational targets; the SLA is a binding contractual commitment with defined measurement, exclusions, and credits. The MSA template includes the following SLA schedule: + +| Metric | SLA commitment | Measurement | Exclusions | +|---|---|---|---| +| Monthly availability | 99.5% | External uptime monitor; excludes scheduled maintenance (max 4h/month; 48h advance notice) | Force majeure; upstream data source outages (Space-Track, NOAA SWPC) lasting > 4h | +| Critical alert delivery | Within 5 minutes of trigger (p95) | `alert_events.created_at` → `delivered_websocket/email = TRUE` timestamp | Customer network connectivity issues | +| Prediction freshness | p50 updated within 4h of new TLE availability | `tle_sets.ingested_at` → `reentry_predictions.created_at` | Space-Track API outage > 4h | +| Support response — CRITICAL incident | Initial response within 1 hour | From customer report or automated alert, whichever earlier | Outside contracted support hours (on-call for CRITICAL) | +| Support response — P1 resolution | Within 8 hours | From initial response | — | +| Service credits | 1 day credit per 0.1% availability below SLA | Applied to next invoice | — | + +Any SRE threshold change that could cause an SLA breach (e.g., raising the ingest failure alert threshold beyond 4 hours) must be reviewed by the product owner before deployment. Tracked in `docs/sla/sla-schedule-v{N}.md` (versioned; MSA references the current version by number). + +--- + +### 26.2 Recovery Objectives + +| Objective | Target | Scope | Derivation | +|-----------|--------|-------|-----------| +| RTO (active TIP event) | ≤ 15 minutes | Prediction API restoration | CRITICAL alert rate-limit window is 4 hours per object; 15-minute outage is tolerable within this window without skipping a CRITICAL cycle; beyond 15 minutes the ANSP must activate fallback procedures | +| RTO (no active event) | ≤ 60 minutes | Full system restoration | 1-hour window aligns with MSA SLA commitment; exceeding this triggers the P1 communication plan | +| RPO (safety-critical tables) | Zero | `reentry_predictions`, `alert_events`, `security_logs`, `notam_drafts` — synchronous replication required | UN Liability Convention evidentiary requirements; loss of a single alert acknowledgement record could be material in a liability investigation | +| RPO (operational data) | ≤ 5 minutes | `orbits`, `tle_sets`, `simulations` — async replication acceptable | 5-minute data age is within the staleness tolerance for TLE-based predictions; loss of in-flight simulations is recoverable by re-submission | + +**MSA sign-off requirement:** RTO and RPO targets must be explicitly stated in and agreed upon in the Master Services Agreement with each ANSP customer before any production deployment. Customers must acknowledge that the fallback procedure (Space-Track direct + ESOC public re-entry page) is their responsibility during the RTO window. RTO/RPO targets are not unilaterally changeable by SpaceCom — any tightening requires customer notification ≥30 days in advance; any relaxation requires customer consent. + +--- + +### 26.3 High Availability Architecture + +#### TimescaleDB — Streaming Replication + Patroni + +```yaml +# Primary + hot standby; Patroni manages leader election and failover +db_primary: + image: timescale/timescaledb-ha:pg17 + environment: + PATRONI_POSTGRESQL_DATA_DIR: /var/lib/postgresql/data + PATRONI_REPLICATION_USERNAME: replicator + networks: [db_net] + +db_standby: + image: timescale/timescaledb-ha:pg17 + environment: + PATRONI_REPLICA: "true" + networks: [db_net] + +etcd: + image: bitnami/etcd:3 # Patroni DCS + networks: [db_net] +``` + +- Synchronous replication for `reentry_predictions`, `alert_events`, `security_logs`, `notam_drafts` (RPO = 0): `synchronous_standby_names = 'FIRST 1 (db_standby)'` with table-level synchronous commit override +- Asynchronous replication for `orbits`, `tle_sets` (RPO ≤ 5 min): default async +- Patroni auto-failover: standby promoted within ~30s of primary failure, well within the 15-minute RTO + +**Required Patroni configuration parameters** (must be present in `patroni.yml`; CI validation via `scripts/check_patroni_config.py`): + +```yaml +bootstrap: + dcs: + maximum_lag_on_failover: 1048576 # 1 MB; standby > 1 MB behind primary is excluded from failover election + synchronous_mode: true # Enable synchronous replication mode + synchronous_mode_strict: true # Primary refuses writes if no synchronous standby confirmed; prevents split-brain + +postgresql: + parameters: + wal_level: replica # Required for streaming replication; 'minimal' breaks replication + recovery_target_timeline: latest # Follow timeline switches after failover; required for correct standby behaviour +``` + +**Rationale:** +- `maximum_lag_on_failover`: without this, a severely lagged standby could be promoted as primary and serve stale data for safety-critical tables. +- `synchronous_mode_strict: true`: trades availability for consistency — primary halts rather than allowing an unconfirmed write to proceed without a standby. Acceptable given 15-minute RTO SLO. +- `wal_level: replica`: `minimal` disables the WAL detail needed for streaming replication; must be explicitly set. +- `recovery_target_timeline: latest`: without this, a promoted standby after failover may not follow future timeline switches, causing divergence. + +#### Redis — Sentinel (3 Nodes) + +```yaml +redis-master: + image: redis:7-alpine + command: redis-server /etc/redis/redis.conf +redis-sentinel-1: + image: redis:7-alpine + command: redis-sentinel /etc/redis/sentinel.conf +redis-sentinel-2: + image: redis:7-alpine + command: redis-sentinel /etc/redis/sentinel.conf +``` + +Three Sentinel instances form a quorum. If the master fails, Sentinel promotes a replica within ~10s. The backend and workers use `redis-py`'s `Sentinel` client which transparently follows the master after failover. + +**Redis Sentinel split-brain risk assessment (F3 — §67):** In a network partition where Sentinel nodes disagree on master reachability, two Sentinels could theoretically promote two different replicas simultaneously. The `min-replicas-to-write 1` Sentinel configuration mitigates this: the old master stops accepting writes when it loses contact with replicas, forcing clients to the new master. + +SpaceCom's Redis data is largely ephemeral — Celery broker messages, WebSocket session state, application cache. A split-brain that loses a small number of Celery tasks or cache entries is survivable. The one persistent concern is the per-org email rate limit counter (`spacecom:email_rate:{org_id}:{hour}`, §65 F7): a split-brain could result in two independent counters, both allowing up to 50 emails, for a brief period before the split resolves. This is accepted: the 50/hr limit is a cost control, not a safety guarantee. Email volume during a short Sentinel split-brain is not a safety risk. + +**Risk acceptance and configuration:** Set `sentinel.conf` values: +``` +sentinel down-after-milliseconds spacecom-redis 5000 +sentinel failover-timeout spacecom-redis 60000 +sentinel parallel-syncs spacecom-redis 1 +min-replicas-to-write 1 +min-replicas-max-lag 10 +``` +ADR: `docs/adr/0021-redis-sentinel-split-brain-risk-acceptance.md` + +#### Cross-Region Disaster Recovery — Warm Standby (F7) + +Single-region deployment cannot meet the RTO ≤ 60 minutes target against a full cloud region failure. A warm standby in a second region provides the required recovery path. + +**Strategy:** Warm standby (not hot active-active) — reduces cost and complexity while meeting RTO. + +| Component | Primary region | DR region | Failover mechanism | +|-----------|--------------|-----------|-------------------| +| TimescaleDB | Primary + hot standby | Read replica (streaming replication from primary) | Promote replica; update DNS; `make db-failover-dr` runbook | +| Application tier | Running | Stopped; container images pre-pulled from GHCR | Deploy from images on failover; < 10 minutes | +| MinIO (object storage) | Active | Active (bucket replication enabled) | Already in sync; no failover needed | +| Redis | Active | Cold (config ready) | Restart on failover; session loss acceptable (operators re-authenticate) | +| DNS | Primary A record | Secondary A record in Route 53 (or equiv.) | Health-check-based routing; TTL 60s; auto-failover on primary health check failure | + +**Failover time estimate:** DB promotion 2–5 minutes + DNS propagation 1 minute + app deploy 10 minutes = **< 15 minutes** (within RTO for active TIP event). + +**Runbook:** `docs/runbooks/region-failover.md` — tested annually as game day scenario 6. Post-failover checklist: verify HMAC validation on restored primary; verify WAL integrity; notify ANSPs of region switch; schedule return to primary region within 48 hours. + +--- + +### 26.4 Celery Reliability + +#### Task Acknowledgement and Crash Safety + +```python +# celeryconfig.py +task_acks_late = True # Task not acknowledged until complete; if worker dies mid-task, task is requeued +task_reject_on_worker_lost = True # Orphaned tasks requeued, not dropped +task_serializer = 'json' +result_expires = 86400 # Results expire after 24h; database is the durable store +worker_prefetch_multiplier = 1 # F6 §58: long MC tasks (up to 240s) — prefetch=1 prevents worker A + # holding 4 tasks while workers B/C/D are idle; fair distribution +``` + +#### Dead Letter Queue + +Failed tasks (exception, timeout, or permanent error) must be captured, not silently dropped: + +```python +# In Celery task base class +class SpaceComTask(Task): + def on_failure(self, exc, task_id, args, kwargs, einfo): + # Update simulations table to status='failed' + update_simulation_status(task_id, 'failed', error_detail=str(exc)) + # Route to dead letter queue for inspection + dead_letter_queue.rpush('dlq:failed_tasks', json.dumps({ + 'task_id': task_id, 'task_name': self.name, + 'error': str(exc), 'failed_at': utcnow().isoformat() + })) +``` + +#### Queue Routing (Ingest vs Simulation Isolation) + +```python +CELERY_TASK_ROUTES = { + 'modules.ingest.*': {'queue': 'ingest'}, + 'modules.propagator.*': {'queue': 'simulation'}, + 'modules.breakup.*': {'queue': 'simulation'}, + 'modules.conjunction.*': {'queue': 'simulation'}, + 'modules.reentry.controlled.*': {'queue': 'simulation'}, +} +``` + +Two separate worker processes — never competing on the same queue: +```bash +# Ingest worker: always running, low concurrency +celery worker --queue=ingest --concurrency=2 --hostname=ingest@%h + +# Simulation worker: high concurrency for MC sub-tasks (see §27.2) +celery worker --queue=simulation --concurrency=16 --pool=prefork --hostname=sim@%h +``` + +**Per-organisation priority isolation (F8):** All organisations share the `simulation` queue, but job priority is set at submission time based on subscription tier and event criticality. This prevents a `shadow_trial` org's bulk simulation from starving a `CRITICAL` alert computation for an `ansp_operational` org. + +```python +TIER_TASK_PRIORITY = { + "internal": 9, + "institutional": 8, + "ansp_operational": 7, + "space_operator": 5, + "shadow_trial": 3, +} +CRITICAL_EVENT_PRIORITY_BOOST = 2 # added when active TIP event exists for the org's objects + +def get_task_priority(org_tier: str, has_active_tip: bool) -> int: + base = TIER_TASK_PRIORITY.get(org_tier, 3) + return min(10, base + (CRITICAL_EVENT_PRIORITY_BOOST if has_active_tip else 0)) + +# At submission: +task.apply_async(priority=get_task_priority(org.subscription_tier, active_tip)) +``` + +Redis with `maxmemory-policy noeviction` supports Celery task priorities natively (0–9). Workers process higher-priority tasks first when multiple tasks are queued. Ingest tasks always route to the separate `ingest` queue and are unaffected by simulation priority. + +#### Celery Beat — High Availability with `celery-redbeat` + +Standard Celery Beat is a single-process SPOF. `celery-redbeat` stores the schedule in Redis with distributed locking — multiple Beat instances can run; only one holds the lock at a time: + +```python +CELERY_BEAT_SCHEDULER = 'redbeat.RedBeatScheduler' +REDBEAT_REDIS_URL = settings.redis_url +REDBEAT_LOCK_TIMEOUT = 60 # 60s; crashed leader blocks scheduling for at most 60s +REDBEAT_MAX_SLEEP_INTERVAL = 5 # standby instances check for lock every 5s after TTL expiry +``` + +The default `REDBEAT_LOCK_TIMEOUT = max_interval × 5` (typically 25 minutes) is too long during active TIP events — a crashed Beat leader would prevent TIP polling for up to 25 minutes. At 60 seconds, a failover causes at most a 60-second scheduling gap. The standby Beat instance acquires the lock within 5 seconds of TTL expiry (`REDBEAT_MAX_SLEEP_INTERVAL = 5`). + +During an active TIP window (`spacecom_active_tip_events > 0`), the AlertManager rule for TIP ingest failure uses a 10-minute threshold rather than the baseline 4-hour threshold — ensuring a Beat failover gap does not silently miss critical TIP updates. + +--- + +### 26.5 Health Checks + +Every service exposes two endpoints. Docker Compose `depends_on: condition: service_healthy` uses these — the backend does not start until the database is healthy. + +**Liveness probe** (`GET /healthz`) — process is alive; returns 200 unconditionally if the process can respond. Does not check dependencies. + +**Readiness probe** (`GET /readyz`) — process is ready to serve traffic: + +```python +@app.get("/readyz") +async def readiness(db: AsyncSession = Depends(get_db)): + checks = {} + + # Database connectivity + try: + await db.execute(text("SELECT 1")) + checks["database"] = "ok" + except Exception as e: + checks["database"] = f"error: {e}" + + # Redis connectivity + try: + await redis_client.ping() + checks["redis"] = "ok" + except Exception: + checks["redis"] = "error" + + # Data freshness + tle_age = await get_oldest_active_tle_age_hours() + sw_age = await get_space_weather_age_hours() + eop_age = await get_eop_age_days() + airac_age = await get_airspace_airac_age_days() + checks["tle_age_hours"] = tle_age + checks["space_weather_age_hours"] = sw_age + checks["eop_age_days"] = eop_age + checks["airac_age_days"] = airac_age + + degraded = [] + if checks["database"] != "ok" or checks["redis"] != "ok": + return JSONResponse(status_code=503, content={"status": "unavailable", "checks": checks}) + if tle_age > 6: + degraded.append("tle_stale") + if sw_age > 4: + degraded.append("space_weather_stale") + if eop_age > 7: + degraded.append("eop_stale") # IERS-A older than 7 days; frame transform accuracy degraded + if airac_age > 28: + degraded.append("airspace_stale") # AIRAC cycle missed + + status_code = 207 if degraded else 200 + return JSONResponse(status_code=status_code, content={ + "status": "degraded" if degraded else "ok", + "degraded": degraded, "checks": checks + }) +``` + +The `207 Degraded` response triggers the staleness banner in the UI (§24.8) without taking the service offline. The load balancer treats 207 as healthy (traffic continues); the operational banner warns users. + +**Renderer service health check** — the `renderer` container runs Playwright/Chromium. If Chromium hangs (a known Playwright failure mode), the container process stays alive and appears healthy while all report generation jobs silently time out. The renderer `GET /healthz` must verify Chromium can respond, not just that the Python process is alive: + +```python +# renderer/app/health.py +import asyncio +from playwright.async_api import async_playwright +from fastapi.responses import JSONResponse + +async def health_check(): + """Liveness probe: verify Chromium can launch and load a blank page within 5s.""" + try: + async with async_playwright() as p: + browser = await asyncio.wait_for(p.chromium.launch(), timeout=5.0) + page = await browser.new_page() + await asyncio.wait_for(page.goto("about:blank"), timeout=3.0) + await browser.close() + return {"status": "ok", "chromium": "responsive"} + except asyncio.TimeoutError: + renderer_chromium_restarts.inc() + return JSONResponse({"status": "chromium_unresponsive"}, status_code=503) +``` + +Docker Compose healthcheck for renderer: +```yaml +renderer: + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8001/healthz"] + interval: 30s + timeout: 10s + retries: 3 + start_period: 15s +``` + +If the healthcheck fails 3 times consecutively, Docker restarts the renderer container. The `renderer_chromium_restarts_total` counter increments on each restart and triggers the `RendererChromiumUnresponsive` alert. + +**Degraded state in `GET /readyz` for API clients and SWIM (Finding 7):** The `degraded` array in the response is the machine-readable signal for any automated integration (Phase 3 SWIM, API polling clients). API clients must not scrape the UI to determine system state — the health endpoint is the authoritative source. Response fields: + +| Field | Type | Meaning | +|---|---|---| +| `status` | `"ok"` \| `"degraded"` \| `"unavailable"` | Overall system state | +| `degraded` | `string[]` | Active degradation reasons: `"tle_stale"`, `"space_weather_stale"`, `"ingest_source_failure"`, `"prediction_service_overloaded"` | +| `degraded_since` | `ISO8601 \| null` | Timestamp of when current degraded state began (from `degraded_mode_events`) | +| `checks` | `object` | Per-subsystem check results | + +Every transition into or out of degraded state is written to `degraded_mode_events` (see §9.2). NOTAM drafts generated while `status = "degraded"` have `generated_during_degraded = TRUE` and the draft `(E)` field includes: `NOTE: GENERATED DURING DEGRADED DATA STATE - VERIFY INDEPENDENTLY BEFORE ISSUANCE`. + +**Docker Compose health check definitions:** +```yaml +backend: + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8000/healthz"] + interval: 10s + timeout: 5s + retries: 3 + start_period: 30s + +db: + healthcheck: + # pg_isready alone passes before the spacecom database and TimescaleDB extension are loaded. + # This check verifies that the application database is accessible and TimescaleDB is active + # before any dependent service (pgbouncer, backend) is marked healthy. + test: | + CMD-SHELL psql -U spacecom_app -d spacecom -c + "SELECT 1 FROM timescaledb_information.hypertables LIMIT 1" + interval: 5s + timeout: 3s + retries: 10 + start_period: 30s # TimescaleDB extension load and initial setup can take up to 20s + +pgbouncer: + depends_on: + db: + condition: service_healthy + healthcheck: + test: ["CMD-SHELL", "psql -h localhost -p 5432 -U spacecom_app -d spacecom -c 'SELECT 1'"] + interval: 5s + timeout: 3s + retries: 5 +``` + +--- + +### 26.6 Backup and Restore + +#### Continuous WAL Archiving (RPO = 0 for critical tables) + +```bash +# postgresql.conf +wal_level = replica +archive_mode = on +archive_command = 'mc cp %p minio/wal-archive/$(hostname)/%f' # MinIO via mc client +archive_timeout = 60 # Force WAL segment every 60s even if no writes +``` + +#### Daily Base Backup + +`pg_basebackup` is a PostgreSQL client tool that is not present in the Python runtime worker image. The backup must run in a dedicated sidecar container that has PostgreSQL client tools installed, invoked by the Celery Beat task via `docker compose run`: + +```yaml +# docker-compose.yml — backup sidecar (no persistent service; run on demand) +services: + db-backup: + image: timescale/timescaledb:2.14-pg17 # same image as db; has pg_basebackup + entrypoint: [] + command: > + sh -c "pg_basebackup -h db -U postgres -D /backup + --format=tar --compress=9 --wal-method=stream && + mc cp /backup/*.tar.gz minio/db-backups/base-$(date +%F)/" + networks: [db_net] + volumes: + - backup_scratch:/backup + profiles: [backup] # not started by default; invoked explicitly + environment: + PGPASSWORD: ${POSTGRES_PASSWORD} + MC_HOST_minio: http://${MINIO_ACCESS_KEY}:${MINIO_SECRET_KEY}@minio:9000 + +volumes: + backup_scratch: + driver: local + driver_opts: + type: tmpfs + device: tmpfs + o: size=20g # large enough for compressed base backup +``` + +The Celery Beat task triggers the sidecar via the Docker socket (backend container must have `/var/run/docker.sock` mounted in development — **not in production**). In production (Tier 2+), use a dedicated cron job on the host: + +```bash +# /etc/cron.d/spacecom-backup — runs outside Docker, uses Docker CLI +0 2 * * * root docker compose -f /opt/spacecom/docker-compose.yml \ + --profile backup run --rm db-backup >> /var/log/spacecom-backup.log 2>&1 +``` + +The Celery Beat task in production polls MinIO for today's backup object to verify completion, and fires an alert if it is absent by 03:00 UTC: + +```python +# Celery Beat: daily at 03:00 UTC (verification, not execution) +@celery.task +def verify_daily_backup(): + """Verify today's base backup exists in MinIO; alert if absent.""" + expected_key = f"db-backups/base-{utcnow().date()}" + try: + minio_client.stat_object("db-backups", expected_key) + structlog.get_logger().info("backup_verified", key=expected_key) + except S3Error: + structlog.get_logger().error("backup_missing", key=expected_key) + alert_admin(f"Daily base backup missing: {expected_key}") + raise # marks task as FAILED in Celery result backend +``` + +#### Monthly Restore Test + +```python +# Celery Beat: first Sunday of each month at 03:00 UTC +@celery.task +def monthly_restore_test(): + """Restore latest backup to ephemeral container; run test suite; alert on failure.""" + # 1. Spin up a test TimescaleDB container from latest base backup + WAL + # 2. Run db/test_restore.py: verify row counts, hypertable integrity, HMAC spot-checks + # 3. Tear down container + # 4. Log result to security_logs; alert admin if test fails +``` + +If the monthly restore test fails, the failure is treated as SEV-2. The incident is not resolved until a successful restore is verified. + +**WAL retention:** 30 days of WAL segments retained in MinIO; base backups retained for 90 days; `reentry_predictions`, `alert_events`, `notam_drafts`, `security_logs` additionally archived to cold storage for 7 years (MinIO lifecycle policy, separate bucket with Object Lock COMPLIANCE mode — prevents deletion even by bucket owner). + +**Application log retention policy (F10 — §57):** + +| Log tier | Storage | Retention | Rationale | +|----------|---------|-----------|-----------| +| Container stdout (json-file) | Docker log driver on host | 7 days (`max-size=100m, max-file=5`) | Short-lived; Promtail ships to Loki in Tier 2+ | +| Loki (structured application logs) | Grafana Loki | **90 days** | Covers 30-day incident investigation SLA with headroom | +| Safety-relevant log lines (`level=CRITICAL`, `security_logs` events, alert-related log lines) | MinIO append-only bucket | **7 years** (same as database safety records) | Regulatory parity with `alert_events` 7-year hold; NIS2 Art. 23 evidence requirement | +| SIEM-forwarded events | External SIEM (customer-specified) | Per customer contract | ANSP customers may have their own retention obligations | + +Loki retention is set in `monitoring/loki-config.yml`: +```yaml +limits_config: + retention_period: 2160h # 90 days +compactor: + retention_enabled: true +``` + +Safety-relevant log shipping: a Promtail pipeline stage tags log lines with `__path__` label `safety_critical=true` when `level=CRITICAL` or `logger` contains `alert` or `security`. A separate Loki ruler rule ships these to MinIO via a Loki-to-S3 connector (Phase 2). Phase 1 interim: Celery Beat task exports CRITICAL log lines from Loki to MinIO daily. + +**Restore time target:** Full restore to latest WAL segment in < 30 minutes (tested monthly). This satisfies the RTO ≤ 60 minutes (no active event) with 30 minutes headroom for DNS propagation and smoke tests. Documented step-by-step in `docs/runbooks/db-restore.md` (Phase 2 deliverable). + +#### Retention Schedule + +```sql +-- Online retention (TimescaleDB compression + drop policies) +SELECT add_compression_policy('orbits', INTERVAL '7 days'); +SELECT add_retention_policy('orbits', INTERVAL '90 days'); -- Archive before drop; see below +SELECT add_retention_policy('space_weather', INTERVAL '2 years'); +SELECT add_retention_policy('tle_sets', INTERVAL '1 year'); + +-- Archival pipeline: Celery task runs before each chunk drop +-- Exports chunk to Parquet in MinIO cold storage before TimescaleDB drops it +-- Legal hold: reentry_predictions, alert_events, notam_drafts, shadow_validations → 7 years +-- No retention policy on these tables; MinIO lifecycle rule retains for 7 years +``` + +--- + +### 26.7 Prometheus Metrics + +Metrics must be instrumented from Phase 1 — not added at Phase 3 as an afterthought. Business-level metrics are more important than infrastructure metrics for this domain. + +**Metric naming convention (F1 — §57):** + +All custom metrics must follow `{namespace}_{subsystem}_{name}_{unit}` with these rules: + +| Rule | Example compliant | Example non-compliant | +|------|------------------|-----------------------| +| Namespace is always `spacecom_` | `spacecom_ingest_success_total` | `ingest_success` | +| Unit suffix required (Prometheus base units) | `spacecom_simulation_duration_seconds` | `spacecom_simulation_duration` | +| Counters end in `_total` | `spacecom_hmac_verification_failures_total` | `spacecom_hmac_failures` | +| Gauges end in `_seconds`, `_bytes`, `_ratio`, or domain unit | `spacecom_celery_queue_depth` | `spacecom_queue` | +| Histograms end in `_seconds` or `_bytes` | `spacecom_alert_delivery_latency_seconds` | `spacecom_alert_latency` | +| Labels use `snake_case` | `queue_name`, `source` | `queueName`, `Source` | +| **High-cardinality fields are NEVER labels** | — | `norad_id`, `organisation_id`, `user_id`, `request_id` as Prometheus labels | +| Per-object drill-down uses recording rules | `spacecom:tle_age_hours:max` recording rule | `spacecom_tle_age_hours{norad_id="25544"}` alerted directly | + +High-cardinality identifiers belong in log fields (structlog) or Prometheus exemplars — not in metric labels. A metric with an unbounded label creates one time series per unique value and will OOM Prometheus at scale. + +**Business-level metrics (custom — most critical):** + +```python +# Phase 1 — instrument from day 1 +active_tip_events = Gauge('spacecom_active_tip_events', 'Objects with active TIP messages') +prediction_age = Gauge('spacecom_prediction_age_seconds', 'Age of latest prediction per object', + ['norad_id']) # per-object label: Grafana drill-down only; alert via recording rule +tle_age = Gauge('spacecom_tle_age_hours', 'TLE data age per object', ['norad_id']) +ingest_success = Counter('spacecom_ingest_success_total', 'Successful ingest runs', ['source']) +ingest_failure = Counter('spacecom_ingest_failure_total', 'Failed ingest runs', ['source']) +hmac_failures = Counter('spacecom_hmac_verification_failures_total', 'HMAC check failures') +simulation_duration = Histogram('spacecom_simulation_duration_seconds', 'MC run duration', ['module'], + buckets=[30, 60, 90, 120, 180, 240, 300, 600]) +alert_delivery_lat = Histogram('spacecom_alert_delivery_latency_seconds', 'Alert trigger → WS receipt', + buckets=[1, 2, 5, 10, 15, 20, 30, 60]) +ws_connected = Gauge('spacecom_ws_connected_clients', 'Active WebSocket connections', ['instance']) +celery_queue_depth = Gauge('spacecom_celery_queue_depth', 'Tasks waiting in queue', ['queue']) +dlq_depth = Gauge('spacecom_dlq_depth', 'Tasks in dead letter queue') +renderer_active_jobs = Gauge('renderer_active_jobs', 'Reports being generated') +renderer_job_dur = Histogram('renderer_job_duration_seconds', 'Report generation time', + buckets=[2, 5, 10, 15, 20, 25, 30]) +renderer_chromium_restarts = Counter('renderer_chromium_restarts_total', 'Chromium process restarts') +``` + +**SLI recording rules** — pre-aggregate before alerting; avoids per-object flooding (Finding 1, 7): + +```yaml +# monitoring/recording-rules.yml +groups: + - name: spacecom_sli + rules: + # SLI: API availability (non-5xx fraction) — feeds availability SLO + - record: spacecom:api_availability:ratio_rate5m + expr: > + sum(rate(http_requests_total{status!~"5.."}[5m])) + / sum(rate(http_requests_total[5m])) + + # SLI: max TLE age across all objects (single series; alertable without flooding) + - record: spacecom:tle_age_hours:max + expr: max(spacecom_tle_age_hours) + + # SLI: count of objects with stale TLEs (for dashboard) + - record: spacecom:tle_stale_objects:count + expr: count(spacecom_tle_age_hours > 6) or vector(0) + + # SLI: max prediction age across active TIP objects + - record: spacecom:prediction_age_seconds:max + expr: max(spacecom_prediction_age_seconds) + + # SLI: alert delivery latency p99 + - record: spacecom:alert_delivery_latency:p99_rate5m + expr: histogram_quantile(0.99, rate(spacecom_alert_delivery_latency_seconds_bucket[5m])) + + # Error budget burn rate — multi-window (F2 — §57) + - record: spacecom:error_budget_burn:rate1h + expr: 1 - avg_over_time(spacecom:api_availability:ratio_rate5m[1h]) + + - record: spacecom:error_budget_burn:rate6h + expr: 1 - avg_over_time(spacecom:api_availability:ratio_rate5m[6h]) + + # Fast-burn window (5 min) — catches sudden outages + - record: spacecom:error_budget_burn:rate5m + expr: 1 - spacecom:api_availability:ratio_rate5m +``` + +**Alerting rules (Prometheus AlertManager):** + +```yaml +# monitoring/alertmanager/spacecom-rules.yml +groups: + - name: spacecom_critical + rules: + - alert: HmacVerificationFailure + expr: increase(spacecom_hmac_verification_failures_total[5m]) > 0 + labels: + severity: critical + annotations: + summary: "HMAC verification failure detected — prediction integrity compromised" + runbook_url: "https://spacecom.internal/docs/runbooks/hmac-integrity-failure.md" + + - alert: TipIngestStale + expr: spacecom_tle_age_hours{source="tip"} > 0.5 + for: 5m + labels: + severity: critical + annotations: + summary: "TIP data > 30 min old — active re-entry warning may be stale" + runbook_url: "https://spacecom.internal/docs/runbooks/tip-ingest-failure.md" + + - alert: ActiveTipNoPrediction + expr: spacecom_active_tip_events > 0 and spacecom:prediction_age_seconds:max > 3600 + labels: + severity: critical + annotations: + summary: "Active TIP event but newest prediction is {{ $value | humanizeDuration }} old" + runbook_url: "https://spacecom.internal/docs/runbooks/tip-ingest-failure.md" + + # Fast burn: 1h + 5min windows (catches sudden outages quickly) — F2 §57 + - alert: ErrorBudgetFastBurn + expr: > + spacecom:error_budget_burn:rate1h > (14.4 * 0.001) + and + spacecom:error_budget_burn:rate5m > (14.4 * 0.001) + for: 2m + labels: + severity: critical + burn_window: fast + annotations: + summary: "Error budget burning fast — 1h burn rate {{ $value | humanizePercentage }}" + runbook_url: "https://spacecom.internal/docs/runbooks/db-failover.md" + dashboard_url: "https://grafana.spacecom.internal/d/slo-burn-rate" + + # Slow burn: 6h + 30min windows (catches gradual degradation before budget exhausts) — F2 §57 + - alert: ErrorBudgetSlowBurn + expr: > + spacecom:error_budget_burn:rate6h > (6 * 0.001) + and + spacecom:error_budget_burn:rate1h > (6 * 0.001) + for: 15m + labels: + severity: warning + burn_window: slow + annotations: + summary: "Error budget burning slowly — 6h burn rate {{ $value | humanizePercentage }}" + runbook_url: "https://spacecom.internal/docs/runbooks/db-failover.md" + dashboard_url: "https://grafana.spacecom.internal/d/slo-burn-rate" + + - name: spacecom_warning + rules: + - alert: TleStale + # Alert on recording rule aggregate — single alert, not 600 per-NORAD alerts + expr: spacecom:tle_stale_objects:count > 0 + for: 10m + labels: + severity: warning + annotations: + summary: "{{ $value }} objects have TLE age > 6h" + runbook_url: "https://spacecom.internal/docs/runbooks/ingest-pipeline-staleness.md" + + - alert: IngestConsecutiveFailures + # Use increase(), not rate(); rate() is always positive once any failure exists + expr: increase(spacecom_ingest_failure_total[15m]) >= 3 + labels: + severity: warning + annotations: + summary: "Ingest source {{ $labels.source }} failed ≥ 3 times in 15 min" + runbook_url: "https://spacecom.internal/docs/runbooks/ingest-pipeline-staleness.md" + + - alert: CelerySimulationQueueDeep + expr: spacecom_celery_queue_depth{queue="simulation"} > 20 + for: 5m + labels: + severity: warning + annotations: + summary: "Simulation queue depth {{ $value }} — workers may be overwhelmed" + runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md" + + - alert: DLQGrowing + expr: increase(spacecom_dlq_depth[10m]) > 0 + labels: + severity: warning + annotations: + summary: "Dead letter queue growing — tasks exhausting retries" + runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md" + + - alert: WebSocketCeilingApproaching + expr: spacecom_ws_connected_clients > 400 + labels: + severity: warning + annotations: + summary: "WS connections {{ $value }}/500 — scale backend before ceiling hit" + runbook_url: "https://spacecom.internal/docs/runbooks/capacity-limits.md" + + # Queue depth growth rate alert — fires before threshold is breached (F8 — §57) + - alert: CelerySimulationQueueGrowing + expr: rate(spacecom_celery_queue_depth{queue="simulation"}[10m]) > 2 + for: 5m + labels: + severity: warning + annotations: + summary: "Simulation queue growing at {{ $value | humanize }} tasks/sec — workers not keeping up" + runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md" + + - alert: RendererChromiumUnresponsive + expr: increase(renderer_chromium_restarts_total[5m]) > 0 + labels: + severity: warning + annotations: + summary: "Renderer Chromium restarted — report generation may be delayed" + runbook_url: "https://spacecom.internal/docs/runbooks/renderer-recovery.md" +``` + +**Alert authoring rule (F11 — §57):** Every AlertManager alert rule MUST include `annotations.runbook_url` pointing to an existing file in `docs/runbooks/`. CI lint step (`make lint-alerts`) validates this using `promtool check rules` plus a custom Python script that asserts every rule has a non-empty `runbook_url` annotation that resolves to an existing markdown file. A PR that adds an alert without a runbook fails CI. + +**Alert coverage audit (F5 — §57):** The following table maps every SLO and safety invariant to its alert rule. Gaps must be closed before Phase 2. + +| SLO / Safety invariant | Alert rule | Severity | Gap? | +|------------------------|-----------|----------|------| +| API availability 99.9% | `ErrorBudgetFastBurn`, `ErrorBudgetSlowBurn` | CRITICAL / WARNING | Covered | +| TLE age < 6h | `TleStale` | WARNING | Covered | +| TIP ingest freshness < 30 min | `TipIngestStale` | CRITICAL | Covered | +| Active TIP + prediction age > 1h | `ActiveTipNoPrediction` | CRITICAL | Covered | +| HMAC verification integrity | `HmacVerificationFailure` | CRITICAL | Covered | +| Ingest consecutive failures | `IngestConsecutiveFailures` | WARNING | Covered | +| Celery queue depth threshold | `CelerySimulationQueueDeep` | WARNING | Covered | +| Celery queue depth growth rate | `CelerySimulationQueueGrowing` | WARNING | Covered | +| DLQ depth > 0 | `DLQGrowing` | WARNING | Covered | +| WS connection ceiling approach | `WebSocketCeilingApproaching` | WARNING | Covered | +| Renderer Chromium crash | `RendererChromiumUnresponsive` | WARNING | Covered | +| EOP mirror disagreement | `EopMirrorDisagreement` | CRITICAL | **Gap — add Phase 1** | +| DB replication lag > 30s | `DbReplicationLagHigh` | WARNING | **Gap — add Phase 2** | +| Backup job failure | `BackupJobFailed` | CRITICAL | **Gap — add Phase 1** | +| Security event anomaly | In `security-rules.yml` | CRITICAL | Covered | +| Alert HMAC integrity (nightly) | In `security-rules.yml` | CRITICAL | Covered | + +**Prometheus scrape configuration** (`monitoring/prometheus.yml`): + +```yaml +scrape_configs: + - job_name: backend + static_configs: + - targets: ['backend:8000'] + metrics_path: /metrics # enabled by prometheus-fastapi-instrumentator + + - job_name: renderer + static_configs: + - targets: ['renderer:8001'] + metrics_path: /metrics + + - job_name: celery + static_configs: + - targets: ['celery-exporter:9808'] # celery-exporter sidecar + + - job_name: postgres + static_configs: + - targets: ['postgres-exporter:9187'] # postgres_exporter; also scrapes PgBouncer stats + + - job_name: redis + static_configs: + - targets: ['redis-exporter:9121'] # redis_exporter +``` + +Add to `docker-compose.yml` (Phase 2 service topology): `postgres-exporter`, `redis-exporter`, `celery-exporter` sidecar, `loki`, `promtail`, `tempo` (all on `monitor_net`). Add to `requirements.in`: `prometheus-fastapi-instrumentator`, `structlog`, `opentelemetry-sdk`, `opentelemetry-instrumentation-fastapi`, `opentelemetry-instrumentation-sqlalchemy`, `opentelemetry-instrumentation-celery`. + +**Distributed tracing — OpenTelemetry (Phase 2, ADR 0017):** + +```python +# backend/app/main.py — instrument at startup +from opentelemetry import trace +from opentelemetry.sdk.trace import TracerProvider +from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter +from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor +from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor +from opentelemetry.instrumentation.celery import CeleryInstrumentor + +provider = TracerProvider() +provider.add_span_exporter(OTLPSpanExporter(endpoint="http://tempo:4317")) +trace.set_tracer_provider(provider) + +FastAPIInstrumentor.instrument_app(app) +SQLAlchemyInstrumentor().instrument(engine=engine) +CeleryInstrumentor().instrument() +``` + +The `trace_id` from each span equals the `request_id` bound in `structlog.contextvars` (set by `RequestIDMiddleware`). This gives a single correlation key across Grafana Loki log search and Grafana Tempo trace view — one click from a log entry to its trace, and from a trace span to its log lines. Phase 1 fallback: set `OTEL_SDK_DISABLED=true`; spans emit to stdout only (no collector needed). + +**Celery trace propagation (F4 — §57):** `CeleryInstrumentor` automatically propagates W3C `traceparent` headers through the Celery task message body. The trace started at `POST /api/v1/decay/predict` continues unbroken through the queue wait and into the worker execution. To verify propagation is working: + +```python +# tests/integration/test_tracing.py +def test_celery_trace_propagation(): + """Trace started in HTTP handler must appear in Celery worker span.""" + with patch("opentelemetry.instrumentation.celery") as mock_otel: + response = client.post("/api/v1/decay/predict", ...) + task_id = response.json()["job_id"] + # Poll until complete, then assert trace_id matches request_id + span = get_span_by_task_id(task_id) + assert span.context.trace_id == uuid.UUID(response.headers["X-Request-ID"]).int +``` + +Additionally, `request_id` must be passed explicitly in Celery task kwargs as a belt-and-suspenders fallback for Phase 1 when OTel is disabled (`OTEL_SDK_DISABLED=true`). The worker binds it via `structlog.contextvars.bind_contextvars(request_id=kwargs["request_id"])`. This ensures log correlation works in Phase 1 without a running Tempo instance. + +**Chord sub-task and callback trace propagation (F11 — §67):** `CeleryInstrumentor` propagates `traceparent` through individual task messages. For the MC chord pattern (`group` → `chord` → callback), trace context propagation must flow: FastAPI handler → `run_mc_decay_prediction` → 500× `run_single_trajectory` sub-tasks → `aggregate_mc_results` callback. Each hop in the chord must carry the same `trace_id` to enable end-to-end p95 latency attribution. + +`CeleryInstrumentor` handles single task propagation automatically. For chord callbacks, verify that the parent `trace_id` appears in the `aggregate_mc_results` span — if the span is orphaned (different `trace_id`), set the trace context explicitly in the chord header: + +```python +from opentelemetry import propagate, context + +def run_mc_decay_prediction(object_id: int, params: dict) -> str: + carrier = {} + propagate.inject(carrier) # inject current trace context + params['_trace_context'] = carrier # pass through chord params + ... + +def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str: + ctx = propagate.extract(params.get('_trace_context', {})) + token = context.attach(ctx) # re-attach parent trace context in callback + try: + ... # callback body + finally: + context.detach(token) +``` + +This ensures the Tempo waterfall for an MC prediction shows one continuous trace from HTTP request through all 500 sub-tasks to DB write, enabling per-prediction p95 breakdown. + +**Celery queue depth Beat task** (updates `celery_queue_depth` and `dlq_depth` every 30s): + +```python +@app.task +def update_queue_depth_metrics(): + for queue_name in ['ingest', 'simulation', 'default']: + depth = redis_client.llen(f'celery:{queue_name}') + celery_queue_depth.labels(queue=queue_name).set(depth) + dlq_depth.set(redis_client.llen('dlq:failed_tasks')) +``` + +**Four Grafana dashboards** (updated from three): +1. **Operational Overview** — primary on-call dashboard (F7 — §57): an on-call engineer must be able to answer "is the system healthy?" within 15 seconds of opening this dashboard. Panel order and layout is therefore mandated: + + | Row | Panel | Metric | Alert threshold shown | + |-----|-------|--------|-----------------------| + | 1 (top) | Active TIP events (stat) | `spacecom_active_tip_events` | Red if > 0 | + | 1 | System status (state timeline) | All alert rule states | Any CRITICAL = red bar | + | 2 | Ingest freshness per source (gauge) | `spacecom_tle_age_hours` per source | Yellow > 2h, Red > 6h | + | 2 | Prediction age — active objects (gauge) | `spacecom:prediction_age_seconds:max` | Red > 3600s | + | 3 | Error budget burn rate (time series) | `spacecom:error_budget_burn:rate1h` | Reference line at 14.4× | + | 3 | Alert delivery latency p99 (stat) | `spacecom:alert_delivery_latency:p99_rate5m` | Red > 30s | + | 4 | Celery queue depth (time series) | `spacecom_celery_queue_depth` per queue | Reference line at 20 | + | 4 | DLQ depth (stat) | `spacecom_dlq_depth` | Red if > 0 | + + Rows 1–2 must be visible without scrolling on a 1080p monitor. The dashboard UID is pinned in the AlertManager `dashboard_url` annotations. + +2. **System Health**: DB replication lag, Redis memory, container CPU/RAM, error rates by endpoint, renderer job duration +3. **SLO Burn Rate**: error budget consumption rate from recording rules, fast/slow burn rates, availability by SLO, latency percentiles vs. targets, WS delivery latency p99 +4. **Tracing** (Phase 2, Grafana Tempo): per-request traces for decay prediction and CZML catalog; p95 span breakdown by service + +--- + +### 26.8 Incident Response + +#### On-Call Rotation and Escalation + +| Tier | Responder | Response SLA | Escalation trigger | +|---|---|---|---| +| **L1 On-call** | Rotating engineer (weekly rotation) | 5 min (SEV-1) / 15 min (SEV-2) | Auto-escalate to L2 if no acknowledgement after SLA | +| **L2 Escalation** | Tech lead / senior engineer | 10 min (SEV-1) | Auto-escalate to L3 after 10 min | +| **L3 Incident commander** | Engineering or product lead | SEV-1 only | Manual phone call; no auto-escalation | + +AlertManager routing: +```yaml +# monitoring/alertmanager/routing.yml +route: + receiver: slack-ops-channel + group_wait: 30s + group_interval: 5m + repeat_interval: 4h + routes: + - match: {severity: critical} + receiver: pagerduty-l1 + continue: true # also send to Slack + - match: {severity: warning} + receiver: slack-ops-channel +``` + +On-call guide: `docs/runbooks/on-call-guide.md` — required Phase 2 deliverable. Must cover: rotation schedule, handover checklist, escalation contact list, how to acknowledge PagerDuty alerts, Grafana dashboard URLs, and the "active TIP event protocol" (escalate all SEV-2+ to SEV-1 automatically when `spacecom_active_tip_events > 0`). + +**On-call rotation spec (F5):** +- 7-day rotation; minimum 2 engineers in the pool before going on-call +- L1 → L2 escalation if incident not contained within **30 minutes** of L1 acknowledgement +- L2 → L3 escalation triggers: ANSP data affected; confirmed security breach; total outage > 15 minutes; regulatory notification obligation triggered (NIS2 24h, GDPR 72h) +- **On-call handoff:** At rotation boundary, outgoing on-call documents system state in `docs/runbooks/on-call-handoff-log.md`: active incidents, degraded services, pending maintenance, known risks. Incoming on-call acknowledges in the same log. Mirrors the operator `/handover` concept (§28.5a) applied to engineering shifts. + +**ANSP communication commitments per severity (F6):** + +| Severity | ANSP notification timing | Channel | Update cadence | +|----------|------------------------|---------|---------------| +| SEV-1 (active TIP event) | Within 5 minutes of detection | Push + email | Every 15 minutes until resolved | +| SEV-1 (no active event) | Within 15 minutes | Email | Every 30 minutes until resolved | +| SEV-2 | Within 30 minutes if prediction data affected | Email | On resolution | +| SEV-3/4 | Status page update only | Status page | On resolution | + +Resolution notification always includes: what was affected, duration, root cause summary (1 sentence), and confirmation that prediction integrity was verified post-incident. + +#### Severity Levels + +| Level | Definition | Response Time | Examples | +|-------|-----------|--------------|---------| +| **SEV-1** | System unavailable or prediction integrity compromised during active TIP event | 5 minutes | DB down with TIP window open; HMAC failure on active prediction | +| **SEV-2** | Core functionality broken; no active TIP event | 15 minutes | Workers down; ingest stopped > 2h; Redis down | +| **SEV-3** | Degraded functionality; operational but impaired | 60 minutes | TLE stale > 6h; space weather stale; slow CZML > 5s p95 | +| **SEV-4** | Minor; no operational impact | Next business day | UI cosmetic; log noise; non-critical test failure | + +#### Runbook Standard Structure (F9) + +Every runbook in `docs/runbooks/` must follow this template. Inconsistent runbooks written under incident pressure are a leading cause of missed steps and extended resolution times. + +```markdown +# Runbook: {Title} + +**Owner:** {team or role} +**Last tested:** {YYYY-MM-DD} (game day or real incident) +**Severity scope:** SEV-1 | SEV-2 | SEV-3 (as applicable) + +## Triggers + + +## Immediate actions (first 5 minutes) + +1. +2. + +## Diagnosis + + +## Resolution steps + +1. +2. + +## Verification + + +## Escalation + + +## Post-incident + +``` + +All runbooks are reviewed and updated after each game day or real incident in which they were used. The `Last tested` field must not be older than 12 months — a CI check (`make runbook-audit`) warns if any runbook has not been updated within that window. + +#### Required Runbooks (Phase 2 deliverable) + +Each runbook is a step-by-step operational procedure, not a general guide: + +| Runbook | Key Steps | +|---------|----------| +| **DB failover** | Confirm primary down → Patroni status → manual failover if Patroni stuck → verify standby promoting → update connection strings → verify HMAC validation working on new primary | +| **Celery worker recovery** | Check queue depth → inspect dead letter queue → restart worker containers → verify simulation jobs resuming → check ingest worker catching up | +| **HMAC integrity failure** | Identify affected prediction ID → quarantine record (`integrity_failed = TRUE`) → notify affected ANSP users → investigate modification source → escalate to security incident if tampering confirmed | +| **TIP ingest failure** | Check Space-Track API status → verify credentials not expired → check outbound network → manual TIP fetch if automated ingest blocked → notify operators of manual TIP status | +| **Ingest pipeline staleness** | Check Celery Beat health (redbeat lock status) → check worker queue → inspect ingest failure counter in Prometheus → trigger manual ingest job → notify operators of staleness | +| **GDPR personal data breach** | Contain breach (revoke credentials, isolate affected service) → assess scope (which data, how many data subjects, which jurisdictions) → notify legal counsel within 4 hours → if EU/UK data subjects affected: notify supervisory authority within 72 hours of discovery; notify affected data subjects "without undue delay" if high risk → log in `security_logs` with type `DATA_BREACH` → document remediation | +| **Safety occurrence notification** | If a SpaceCom integrity failure (HMAC fail, data source outage, incorrect prediction) is identified during a period when an ANSP was actively managing a re-entry event: notify affected ANSP within 2 hours → create `security_logs` record with type `SAFETY_OCCURRENCE` → notify legal counsel before any external communications → preserve all prediction records, alert_events, and ingest logs from the relevant period (do not rotate or archive). Full procedure: `docs/runbooks/safety-occurrence.md` — see §26.8a below. | +| **Prediction service outage during active re-entry event (F3)** | Detect via `spacecom_active_tip_events > 0` + prediction API health check fail → immediate ANSP push notification + email within 5 minutes ("SpaceCom prediction service is unavailable. Activate your fallback procedure: consult Space-Track TIP messages directly and ESOC re-entry page.") → designate incident commander → communication cadence every 15 minutes until resolved → service restoration checklist: restore prediction API → verify HMAC integrity on latest predictions → notify ANSPs of restoration with prediction freshness timestamp → trigger PIR. Full procedure: `docs/runbooks/prediction-service-outage-during-active-event.md` | + +#### §26.8a Safety Occurrence Reporting Procedure (F4 — §61) + +A safety occurrence is any event or condition in which a SpaceCom error may have contributed to, or could have contributed to, a reduction in aviation safety. This is distinct from an operational incident (which is defined by system availability/performance). Safety occurrences require a different response chain that includes regulatory and legal notification. + +**Trigger conditions:** +- HMAC integrity failure on any prediction that was served to an ANSP operator during an active TIP event +- A confirmed incorrect prediction (false positive or false negative) where the ANSP was managing airspace based on SpaceCom outputs +- Data staleness in excess of the operational threshold (TLE > 6h old) during an active re-entry event window without degradation notification having been sent +- Any SpaceCom system failure during which an ANSP continued operational use without receiving a degradation notification + +**Response procedure** (`docs/runbooks/safety-occurrence.md`): + +| Step | Action | Owner | Timing | +|------|--------|-------|--------| +| 1 | Detect and classify: confirm the occurrence meets trigger criteria; assign SAFETY_OCCURRENCE vs. standard incident | On-call engineer | Within 30 min of detection | +| 2 | Preserve evidence: set `do_not_archive = TRUE` on all affected prediction records, alert_events, and ingest logs; export to MinIO safety archive | On-call engineer | Within 1 hour | +| 3 | Internal escalation: notify incident commander + legal counsel; do NOT communicate externally until legal counsel is engaged | Incident commander | Within 1 hour | +| 4 | ANSP notification: contact affected ANSP primary contact and safety manager using the safety occurrence notification template (not the standard incident template); include what happened, what data was affected, what the ANSP should do in response | Incident commander + legal counsel review | Within 2 hours | +| 5 | Log: create `security_logs` record with `type = 'SAFETY_OCCURRENCE'`; include ANSP ID, affected prediction IDs, notification timestamp, and legal counsel name | On-call engineer | Same session | +| 6 | ANSP SMS obligation: inform the ANSP in writing that they may have an obligation to report this occurrence to their safety regulator under their SMS; SpaceCom cannot make this determination for the ANSP | Legal counsel | Within 24 hours | +| 7 | PIR: conduct a safety-occurrence-specific post-incident review (same structure as §26.8 PIR but with additional sections: regulatory notification status, hazard log update required?) | Engineering lead | Within 5 business days | +| 8 | Hazard log update: if the occurrence reveals a new hazard or changes the likelihood/severity of an existing hazard, update `docs/safety/HAZARD_LOG.md` and trigger a safety case review | Safety case custodian | Within 10 business days | + +**Safety occurrence log table:** +```sql +-- Add to security_logs or create a dedicated table +CREATE TABLE safety_occurrences ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + occurred_at TIMESTAMPTZ NOT NULL, + detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + org_ids UUID[] NOT NULL, -- affected ANSPs + trigger_type TEXT NOT NULL, -- 'HMAC_FAILURE', 'INCORRECT_PREDICTION', 'STALE_DATA', 'SILENT_FAILURE' + affected_predictions UUID[] NOT NULL DEFAULT '{}', + evidence_archived BOOLEAN NOT NULL DEFAULT FALSE, + ansp_notified_at TIMESTAMPTZ, + legal_notified_at TIMESTAMPTZ, + hazard_log_updated BOOLEAN NOT NULL DEFAULT FALSE, + pir_completed_at TIMESTAMPTZ, + notes TEXT +); +``` + +**What is NOT a safety occurrence (to avoid over-classification):** +- Standard availability incidents with degradation notification sent promptly +- Cosmetic UI errors not in the alert/prediction path +- Prediction updates that change values within stated uncertainty bounds + +#### ANSP Communication Plan + +When SpaceCom is degraded during an active TIP event, operators must be notified immediately through a defined channel: +- **WebSocket push** (if connected): automatic via the degraded-mode notification (§24.8) +- **Email fallback**: automated email to all `operator` role users with active sessions within the last 24h, identifying the degradation type and estimated resolution +- **Documented fallback**: every SpaceCom user onboarding includes the fallback procedure: "In the absence of SpaceCom, consult Space-Track TIP messages directly at space-track.org and coordinate with your national space surveillance authority per existing procedures" + +**Incident communication templates (F10):** Pre-drafted templates in `docs/runbooks/incident-comms-templates.md` — reviewed by legal counsel before first use. On-call engineers must use these templates verbatim; deviations require incident commander approval. Templates cover: +1. **Initial notification** (< 5 minutes): impact, what we know, what we are doing, next update time +2. **15-minute update**: progress, updated ETA if known, revised fallback guidance if needed +3. **Resolution notification**: confirmed restoration, prediction integrity verified, brief root cause (one sentence), PIR date +4. **Post-incident summary** (within 5 business days): full timeline, root cause, remediations implemented +What never appears in templates: speculation about cause before root cause confirmed; estimated recovery time until known with confidence; any admission of negligence or legal liability. + +#### Post-Incident Review Process (F8) + +Mandatory for all SEV-1 and SEV-2 incidents. PIR due within **5 business days** of resolution. + +**PIR document structure** (`docs/post-incident-reviews/YYYY-MM-DD-{slug}.md`): +1. **Incident summary** — what happened, when, duration, severity +2. **Timeline** — minute-by-minute from first alert to resolution +3. **Root cause** — using 5-whys methodology; stop when a process or system gap is identified +4. **Contributing factors** — what made the impact worse or detection slower +5. **Impact** — users/ANSPs affected; data at risk; SLO breach duration +6. **Remediation actions** — each with owner, GitHub issue link, and deadline; tracked with `incident-remediation` label +7. **What went well** — to reinforce effective practices + +PIR presented at the next engineering all-hands. Remediation actions are P2 priority — no new feature work by the responsible engineer until overdue remediations are closed. + +#### Chaos Engineering / Game Day Programme (F4) + +Quarterly game day; scenarios rotated so each is tested at least annually. Document in `docs/runbooks/game-day-scenarios.md`. + +**Minimum scenario set:** + +| # | Scenario | Expected behaviour | Pass criterion | +|---|---------|-------------------|---------------| +| 1 | PostgreSQL primary killed | Patroni promotes standby; API recovers within RTO | API returns 200 within 15 minutes; no data loss | +| 2 | Celery worker crash during active MC simulation | Job moves to DLQ; orphan recovery task re-queues; operator sees `FAILED` state | Job visible in DLQ within 2 minutes; re-queue succeeds | +| 3 | Space-Track ingest unavailable 6 hours | Staleness degraded mode activates; operators notified; predictions greyed | Staleness alert fires within 15 minutes of ingest stop | +| 4 | Redis failure | Sessions expire gracefully; WebSocket reconnects; no silent data loss | Users see "session expired" prompt; no 500 errors | +| 5 | Full prediction service restart during active CRITICAL alert | Alert state preserved in DB; re-subscribing WebSocket clients receive current state | No alert acknowledgement lost; reconnection < 30 seconds | +| 6 | Full region failover (annually) | DNS fails over to DR region; prediction API resumes | Recovery within RTO; HMAC verification passes on new primary | + +Each scenario: defined inject → observe → record actual behaviour → pass/fail vs. criterion → remediation window 2 weeks. Any scenario fail is treated as a SEV-2 incident with a PIR. + +#### Operational vs. Security Incident Runbooks (F11) + +Operational and security incidents have different response teams, communication obligations, and legal constraints: + +| Dimension | Operational incident | Security incident | +|-----------|---------------------|-----------------| +| Primary responder | On-call engineer | On-call engineer + DPO within 4h | +| Communication | Status page + ANSP email | **No public status page until legal counsel approves** | +| Regulatory obligation | SLA breach notification (MSA) | NIS2 24h early warning; GDPR 72h (if personal data) | +| Evidence preservation | Normal log retention | Immediate log freeze; do not rotate or archive | + +Separate runbooks: +- `docs/runbooks/operational-incident-response.md` — standard on-call playbook +- `docs/runbooks/security-incident-response.md` — invokes DPO, legal counsel, NIS2/GDPR timelines; references §29.6 notification obligations + +--- + +### 26.9 Deployment Strategy + +#### Zero-Downtime Deployment (Blue-Green) + +The TLS-terminating Caddy instance routes between blue (current) and green (new) backend instances: + +``` +Client → Caddy → [Blue backend] (current) + → [Green backend] (new — deployed but not yet receiving traffic) +``` + +**Docker Compose implementation for Tier 2 (single-host):** + +Docker Compose service names are fixed, so blue and green run as two separate Compose project instances. The deploy script at `scripts/blue-green-deploy.sh` manages the cutover: + +```bash +#!/usr/bin/env bash +# scripts/blue-green-deploy.sh +set -euo pipefail + +NEW_IMAGE="${1:?Usage: blue-green-deploy.sh }" +COMPOSE_FILE="docker-compose.yml" +BLUE_PROJECT="spacecom-blue" +GREEN_PROJECT="spacecom-green" + +# 1. Determine which colour is currently active +ACTIVE=$(cat /opt/spacecom/.active-colour 2>/dev/null || echo "blue") +if [[ "$ACTIVE" == "blue" ]]; then NEXT="green"; else NEXT="blue"; fi + +# 2. Start next-colour project with new image +SPACECOM_BACKEND_IMAGE="$NEW_IMAGE" \ + docker compose -p "$( [[ $NEXT == green ]] && echo $GREEN_PROJECT || echo $BLUE_PROJECT )" \ + -f "$COMPOSE_FILE" up -d backend + +# 3. Wait for next-colour healthcheck +docker compose -p "$( [[ $NEXT == green ]] && echo $GREEN_PROJECT || echo $BLUE_PROJECT )" \ + exec backend curl -sf http://localhost:8000/healthz || { echo "Health check failed — aborting"; exit 1; } + +# 4. Run smoke tests against next-colour directly +SMOKE_TARGET="http://localhost:$( [[ $NEXT == green ]] && echo 8001 || echo 8000 )" \ + python scripts/smoke-test.py || { echo "Smoke tests failed — aborting"; exit 1; } + +# 5. Shift Caddy upstream to next colour (atomic file swap + reload) +echo "{ \"upstream\": \"backend-$NEXT:8000\" }" > /opt/spacecom/caddy-upstream.json +docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile + +echo "$NEXT" > /opt/spacecom/.active-colour +echo "✓ Traffic shifted to $NEXT. Monitoring for 5 minutes..." +sleep 300 + +# 6. Verify error rate via Prometheus (optional gate) +ERROR_RATE=$(curl -s "http://localhost:9090/api/v1/query?query=spacecom:api_availability:ratio_rate5m" \ + | jq -r '.data.result[0].value[1]') +if (( $(echo "$ERROR_RATE < 0.99" | bc -l) )); then + echo "Error rate $ERROR_RATE < 0.99 — rolling back" + # Swap back to active colour + echo "{ \"upstream\": \"backend-$ACTIVE:8000\" }" > /opt/spacecom/caddy-upstream.json + docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile + echo "$ACTIVE" > /opt/spacecom/.active-colour + exit 1 +fi + +# 7. Decommission old colour +docker compose -p "$( [[ $ACTIVE == blue ]] && echo $BLUE_PROJECT || echo $GREEN_PROJECT )" \ + stop backend && docker compose -p ... rm -f backend +echo "✓ Blue-green deploy complete. Active: $NEXT" +``` + +**Caddy upstream configuration** — Caddy reads a JSON file that the deploy script rewrites atomically: + +``` +# /etc/caddy/Caddyfile +reverse_proxy { + dynamic file /opt/spacecom/caddy-upstream.json + lb_policy first + health_uri /healthz + health_interval 5s +} +``` + +**WebSocket long-lived connection timeout configuration (F11 — §63):** HTTP reverse proxies have default idle timeouts that silently terminate long-lived WebSocket connections. Caddy's default idle timeout for HTTP/2 connections is governed by `idle_timeout` (default: 5 minutes). Many cloud load balancers default to 60 seconds. A WebSocket with no traffic for this period is silently closed by the proxy — the FastAPI server and client may not detect this for minutes, creating a "ghost connection" that is alive at the socket level but dead at the application level. + +**Required Caddyfile additions for WebSocket paths:** + +``` +# /etc/caddy/Caddyfile +{ + servers { + timeouts { + idle_timeout 0 # disable idle timeout globally — WS connections can be silent for extended periods + } + } +} + +spacecom.io { + # WebSocket endpoints: no idle timeout, no read timeout + @websockets { + path /ws/* + header Connection *Upgrade* + header Upgrade websocket + } + handle @websockets { + reverse_proxy backend:8000 { + transport http { + read_timeout 0 # no read timeout — WS connection can be idle + write_timeout 0 # no write timeout — WS send can be slow on poor networks + } + flush_interval -1 # immediate flush; do not buffer WS frames + } + } + + # Non-WebSocket paths: retain normal timeouts + handle { + reverse_proxy backend:8000 { + transport http { + read_timeout 30s + write_timeout 30s + } + } + } +} +``` + +**Ping-pong interval must be less than proxy idle timeout:** The FastAPI WebSocket handler sends a ping every `WS_PING_INTERVAL_SECONDS` (default: 30s). With `idle_timeout 0` in Caddy, this prevents proxy-side termination. If running behind a cloud load balancer with a fixed idle timeout, the ping interval must be set to `(load_balancer_idle_timeout - 10s)` — documented in `docs/runbooks/websocket-proxy-config.md`. + +**Rollback:** `scripts/blue-green-rollback.sh` — resets `/opt/spacecom/caddy-upstream.json` to the previous colour and reloads Caddy. Rollback completes in < 5 seconds (no container restart required). + +Deployment sequence: +1. Deploy green backend alongside blue (both running) +2. Run smoke tests against green directly (`X-Deploy-Target: green` header) +3. Shift 10% of traffic to green (canary); monitor error rate for 5 minutes +4. If clean: shift 100% to green; keep blue running for 10 minutes +5. If error spike: shift 0% back to blue instantly (< 5s rollback via `blue-green-rollback.sh`) +6. Decommission blue after 10 minutes of clean green operation + +#### Alembic Migration Safety Policy + +Every database migration must be backwards-compatible with the previous application version. Required sequence for any schema change: + +1. **Migration only**: deploy migration; verify old app still functions with new schema (additive changes only — new nullable columns, new tables, new indexes) +2. **Application deploy**: deploy new application version that uses the new schema +3. **Cleanup migration** (if needed): remove old columns/constraints after old app version is fully retired + +Never: rename a column, change a column type, or drop a column in a single migration that deploys simultaneously with the application change. + +**Hypertable-specific migration rules:** +- Always use `CREATE INDEX CONCURRENTLY` for new indexes on hypertables — does not acquire a table lock; safe during live ingest. Standard `CREATE INDEX` (without `CONCURRENTLY`) blocks all reads and writes for the duration. +- Never add a column with a non-null default to a populated hypertable in a single migration. Required sequence: (1) add nullable column, (2) backfill in batches with `UPDATE ... WHERE id BETWEEN x AND y`, (3) add NOT NULL constraint in a separate deployment. +- Test every migration against a production-sized data copy before applying to production. Record the measured execution time in the migration file header comment: `# Execution time on 10M-row orbits table: 45s`. +- Set a CI migration timeout gate: if a migration runs > 30 seconds against the test dataset, it must be reviewed by a senior engineer before merge. + +#### TIP Event Deployment Freeze + +No deployments permitted when a CRITICAL or HIGH alert is active for any tracked object. Enforced by a CI/CD gate: + +```python +# .gitlab-ci.yml pre-deploy check +def check_deployment_gate(): + response = requests.get(f"{API_URL}/api/v1/alerts?level=CRITICAL,HIGH&active=true", + headers={"X-Deploy-Check": settings.deploy_check_secret}) + active = response.json()["total"] + if active > 0: + raise DeploymentBlocked( + f"{active} active CRITICAL/HIGH alerts. Deployment blocked until events resolve." + ) +``` + +The deploy check secret is a read-only service credential — it cannot acknowledge alerts or modify data. + +#### CI/CD Pipeline Specification + +**GitLab CI pipeline jobs (`.gitlab-ci.yml`):** + +| Job | Trigger | Steps | Failure behaviour | +|-----|---------|-------|------------------| +| `lint` | All pushes + PRs | `pre-commit run --all-files` (detect-secrets, ruff, mypy, hadolint, prettier, sqlfluff) | Blocks merge | +| `test-backend` | All pushes + PRs | `pytest --cov --cov-fail-under=80`; `alembic check` (model/migration divergence) | Blocks merge | +| `test-frontend` | All pushes + PRs | `vitest run`; `playwright test` | Blocks merge | +| `security-scan` | All pushes + PRs | `bandit -r backend/`; `pip-audit --require backend/requirements.txt`; `npm audit --audit-level=high` (frontend); `eslint --plugin security`; `trivy image` on built images (`.trivyignore` applied); `pip-licenses` + `license-checker-rseidelsohn` gate; `.secrets.baseline` currency check | Blocks merge on High/Critical | +| `build-and-push` | Merge to `main` or `release/*` | Multi-stage `docker build`; `docker push ghcr.io/spacecom/:sha-` via OIDC; `cosign sign` all images; `syft` SPDX-JSON SBOM generated and attached as `cosign attest`; `pip-licenses --format=json` + `license-checker-rseidelsohn --json` manifests merged into SBOM and uploaded as workflow artifact (365-day retention); `docs/compliance/sbom/` updated with versioned SBOM artefact | Blocks deploy | +| `deploy-staging` | After `build-and-push` on `main` | Docker Compose update on staging host; smoke tests | Blocks production deploy gate | +| `deploy-production` | Manual approval after `deploy-staging` passes | `check_deployment_gate()` (no active CRITICAL/HIGH alerts); blue-green deploy | Manual | + +**Image tagging convention:** +- `sha-` — immutable canonical tag; always pushed +- `v..` — release alias pushed on tagged commits +- `latest` — never pushed; forbidden in production Compose files (CI grep check enforces this) + +**Build cache strategy:** +```yaml +# .github/workflows/ci.yml (build-and-push job excerpt) +- uses: docker/setup-buildx-action@v3 +- uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} # OIDC — no stored secret +- uses: docker/build-push-action@v5 + with: + context: ./backend + push: true + tags: ghcr.io/spacecom/backend:sha-${{ github.sha }} + cache-from: type=registry,ref=ghcr.io/spacecom/backend:buildcache + cache-to: type=registry,ref=ghcr.io/spacecom/backend:buildcache,mode=max +``` + +pip and npm caches use `actions/cache` keyed on lock file hash: +```yaml +- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f # v4.0.2 + with: + path: ~/.cache/pip + key: pip-${{ hashFiles('backend/requirements.txt') }} +- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f # v4.0.2 + with: + path: frontend/.next/cache + key: npm-${{ hashFiles('frontend/package-lock.json') }} +``` + +**`cosign` image signing and SBOM attestation** (added after each `docker push`): + +```yaml +# .github/workflows/ci.yml — build-and-push job (after docker push steps) +- uses: sigstore/cosign-installer@59acb6260d9c0ba8f4a2f9d9b48431a222b68e20 # v3.5.0 + +- name: Sign all service images with cosign (keyless, OIDC) + env: + COSIGN_EXPERIMENTAL: "true" + run: | + for svc in backend worker-sim worker-ingest renderer frontend; do + cosign sign --yes \ + ghcr.io/spacecom/${svc}:sha-${{ github.sha }} + done + +- name: Generate SBOM and attach as cosign attestation + env: + COSIGN_EXPERIMENTAL: "true" + run: | + for svc in backend worker-sim worker-ingest renderer frontend; do + syft ghcr.io/spacecom/${svc}:sha-${{ github.sha }} \ + -o spdx-json=sbom-${svc}.spdx.json + # Validate non-empty + jq -e '.packages | length > 0' sbom-${svc}.spdx.json + cosign attest --yes \ + --predicate sbom-${svc}.spdx.json \ + --type spdxjson \ + ghcr.io/spacecom/${svc}:sha-${{ github.sha }} + done + +- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.3.4 + with: + name: sbom-${{ github.sha }} + path: "*.spdx.json" + retention-days: 365 # ESA bid artefacts; ECSS minimum 1 year + +- name: Verify signature before deploy (deploy jobs only) + if: github.event_name == 'workflow_dispatch' + run: | + cosign verify ghcr.io/spacecom/backend:sha-${{ github.sha }} \ + --certificate-identity-regexp="https://github.com/spacecom/spacecom/.*" \ + --certificate-oidc-issuer="https://token.actions.githubusercontent.com" +``` + +**All GitHub Actions pinned by commit SHA** (mutable `@vN` tags allow tag-repointing attacks that exfiltrate all workflow secrets): + +```yaml +# Correct form — all third-party actions in .github/workflows/*.yml: +- uses: docker/setup-buildx-action@4fd812986e6c8c2a69e18311145f9371337f27d # v3.4.0 +- uses: docker/login-action@9780b0c442fbb1117ed29e0efdff1e18412f7567 # v3.3.0 +- uses: docker/build-push-action@1a162644f9a7e87d8f4b053101d1d9a712edc18c # v6.3.0 +- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 +- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f # v4.0.2 +- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.3.4 +``` + +CI lint check enforces no mutable tags remain: +```bash +grep -rE 'uses: [^@]+@v[0-9]' .github/workflows/ && \ + echo "ERROR: Actions must be pinned by commit SHA, not tag" && exit 1 +``` + +Use `pinact` or Renovate's `github-actions` manager to automate SHA updates. + +#### Local Development Environment + +**First-time setup (target: working stack in ≤ 15 minutes from clean clone):** + +```bash +git clone https://github.com/spacecom/spacecom && cd spacecom +cp .env.example .env # fill in Space-Track credentials only; all others have safe defaults +pip install pre-commit && pre-commit install +make dev # starts full stack with hot-reload +make seed # loads test objects, FIRs, and synthetic TIP events +# → Open http://localhost:3000; globe shows 10 test objects +``` + +**`make` targets:** + +| Target | What it does | +|--------|-------------| +| `make dev` | `docker compose up` with `./backend` and `./frontend/src` bind-mounted for hot-reload | +| `make test` | `pytest` (backend) + `vitest run` (frontend) + `playwright test` (E2E) | +| `make migrate` | `alembic upgrade head` inside the running backend container | +| `make seed` | Loads `fixtures/dev_seed.sql` + synthetic TIP events via seed script | +| `make lint` | Runs all pre-commit hooks against all files | +| `make clean` | `docker compose down -v` — removes all containers and volumes (destructive, prompts) | +| `make shell-db` | Opens a `psql` shell inside the TimescaleDB container | +| `make shell-backend` | Opens a bash shell inside the running backend container | + +**Hot-reload configuration (docker-compose.override.yml — dev only, not committed to CI):** +```yaml +services: + backend: + volumes: + - ./backend:/app # bind mount — FastAPI --reload picks up changes instantly + command: ["uvicorn", "app.main:app", "--reload", "--host", "0.0.0.0"] + frontend: + volumes: + - ./frontend/src:/app/src # Next.js / Vite HMR +``` + +**`.env.example` structure (excerpt):** +```bash +# === Required: obtain before first run === +SPACETRACK_USERNAME=your_email@example.com +SPACETRACK_PASSWORD=your_password + +# === Required: generate locally === +JWT_PRIVATE_KEY_PATH=./certs/jwt_private.pem # openssl genrsa -out certs/jwt_private.pem 2048 +JWT_PUBLIC_KEY_PATH=./certs/jwt_public.pem + +# === Safe defaults for local dev (change for production) === +POSTGRES_PASSWORD=spacecom_dev +REDIS_PASSWORD=spacecom_dev +MINIO_ACCESS_KEY=spacecom_dev +MINIO_SECRET_KEY=spacecom_dev_secret +HMAC_SECRET=dev_hmac_secret_change_in_prod + +# === Stage flags === +ENVIRONMENT=development # development | staging | production +SHADOW_MODE_DEFAULT=false +DISABLE_SIMULATION_DURING_ACTIVE_EVENTS=false +``` + +All production-only variables are clearly marked. The README's "Getting Started" section mirrors the first-time setup steps above. + +#### Staging Environment + +**Purpose:** Continuous integration target for `main` branch. Serves as the TRL artefact evidence environment — all shadow validation records and OWASP ZAP reports reference the staging deployment. + +| Property | Staging | Production | +|----------|---------|-----------| +| Infrastructure | Tier 2 (single-host Docker Compose) | Tier 3 (multi-host HA) | +| Data | Synthetic only — no production data | Real TLE/TIP/space weather | +| Secrets | Separate credential set; non-production Space-Track account | Production credential set in Vault | +| Deploy trigger | Automatic on merge to `main` | Manual approval in GitHub Actions | +| OWASP ZAP | Runs against every staging deploy | Run on demand before Phase 3 milestones | +| Retention | Environment resets weekly (fresh `make seed` run) | Persistent | + +#### Secrets Rotation Procedure + +Zero-downtime rotation is required. Service interruption during rotation is a reliability failure. + +**JWT RS256 Signing Keypair:** +1. Generate new keypair: `openssl genrsa -out jwt_private_new.pem 2048 && openssl rsa -in jwt_private_new.pem -pubout -out jwt_public_new.pem` +2. Load new public key into `JWT_PUBLIC_KEY_NEW` env var on all backend instances (old key still active) +3. Backend now validates tokens signed with either old or new key +4. Update `JWT_PRIVATE_KEY` to new key; new tokens are signed with new key +5. Wait for all old tokens to expire (max 1h for access tokens; 30 days for refresh tokens) +6. Remove `JWT_PUBLIC_KEY_NEW`; old public key no longer needed +7. Log `security_logs` entry type `KEY_ROTATION` with rotation timestamp and initiator + +**Space-Track Credentials:** +1. Create new Space-Track account or update password via Space-Track web portal +2. Update `SPACETRACK_USERNAME` / `SPACETRACK_PASSWORD` in secrets manager (Docker secrets / Vault) +3. Trigger one manual ingest cycle; verify 200 response from Space-Track API +4. Deactivate old credentials in Space-Track portal +5. Log `security_logs` entry type `CREDENTIAL_ROTATION` + +**MinIO Access Keys:** +1. Create new access key pair via MinIO console (`mc admin user add`) +2. Update `MINIO_ACCESS_KEY` / `MINIO_SECRET_KEY` in secrets manager +3. Restart backend and worker services (rolling restart — blue-green ensures zero downtime) +4. Verify pre-signed URL generation succeeds +5. Delete old access key from MinIO console + +**HMAC Secret (prediction signing key):** +- **Do not rotate casually.** All existing HMAC-signed predictions will fail verification after rotation. +- Pre-rotation: re-sign all existing predictions with new key (batch migration script required) +- Post-rotation: update `HMAC_SECRET` in secrets manager; verify batch re-sign by spot-checking 10 predictions +- Rotation must be approved by engineering lead; `security_logs` entry type `HMAC_KEY_ROTATION` required + +--- + +### 26.10 Post-Deployment Safety Monitoring Programme (F9 — §61) + +Pre-deployment testing and shadow validation demonstrate that a system was safe at a point in time. Post-deployment monitoring demonstrates that it remains safe in operational conditions. DO-278A §12 and EUROCAE ED-153 both require evidence of ongoing safety monitoring after deployment. + +**Programme components:** + +#### 26.10.1 Prediction Accuracy Monitoring + +After each actual re-entry event where SpaceCom generated predictions: +1. Record the actual re-entry time and location (from The Aerospace Corporation / ESA re-entry campaign results) +2. Compare against SpaceCom's p50 corridor centre and p95 bounds +3. Record in `shadow_validations` table: `actual_reentry_time`, `actual_impact_region`, `p50_error_km`, `p95_captured` (boolean) +4. Compute running accuracy statistics: % of events where actual impact was within p95 corridor; median error in km +5. Publish accuracy statistics to `GET /api/v1/admin/accuracy-report` (accessible to ANSP admins) + +**Alert trigger:** If rolling 12-month p95 capture rate drops below 80% (target: 95%), engineering review is mandatory before the next ANSP shadow activation or model update deployment. + +#### 26.10.2 Safety KPI Dashboard + +Prometheus recording rules and Grafana dashboard (`monitoring/dashboards/safety-kpis.json`): + +| KPI | Metric | Target | Alert threshold | +|-----|--------|--------|----------------| +| HMAC verification failures | `spacecom_hmac_verification_failures_total` | 0 / month | Any failure → SEV-1 | +| Safety occurrences | `safety_occurrences` table count | 0 / year | ≥1 → safety case review | +| Alert false positive rate | Manual: PIR review | < 5% | Engineering review if exceeded | +| Operator training currency | `operator_training_records` expiry | 100% current | < 95% → ANSP admin notification | +| p95 corridor capture rate | `shadow_validations` rolling 12-month | ≥ 95% | < 80% → model review | +| Prediction freshness (TLE age at prediction time) | `spacecom_tle_age_hours` histogram p95 | < 6h | > 24h → MEDIUM alert | + +#### 26.10.3 Quarterly Safety Review + +Mandatory quarterly safety review meeting. Output: `docs/safety/QUARTERLY_SAFETY_REVIEW_YYYY_QN.md`. + +Agenda: +1. Safety KPI review (all metrics above) +2. Safety occurrences since last review (zero is an acceptable answer — record it) +3. Hazard log review: has any hazard likelihood or severity changed since last quarter? +4. MoC status update: progress on PLANNED items +5. Model changes in period: were any SAL-2 components modified? If so, safety case impact assessment +6. ANSP feedback: any concerns raised by ANSP customers regarding safety or accuracy? +7. Actions: owner, deadline, priority + +**Attendance required:** Safety case custodian + engineering lead. One ANSP contact may be invited as an observer (good practice for regulatory demonstration). + +#### 26.10.4 Model Version Safety Monitoring + +When a new model version is deployed (changes to `physics/` or `alerts/` SAL-2 components): +1. Shadow run new model in parallel for ≥14 days before replacing production model +2. Compare new vs. old: prediction differences > 50 km for p50, or > 100 km for p95, require engineering review before promotion +3. After promotion: monitor `shadow_validations` for the next 3 re-entry events; regression alert if p95 capture rate declines +4. Record in `simulations.model_version`; all predictions annotated with the model version they used + +--- + +## 27. Capacity Planning + +### 27.0 Performance Test Specification (F6) + +Performance tests live in `tests/load/` and are run with **k6**. They are not part of the standard `make test` suite — they require a running environment with realistic data. They run: +- Manually before any Phase gate release +- Automatically on the staging environment nightly (scheduled k6 Cloud or self-hosted k6) +- Results committed to `docs/validation/load-test-results/` after each Phase gate + +#### Scenarios + +```javascript +// tests/load/scenarios.js +export const options = { + scenarios: { + czml_catalog: { + executor: 'ramping-vus', + startVUs: 0, stages: [ + { duration: '30s', target: 50 }, + { duration: '2m', target: 100 }, + { duration: '30s', target: 0 }, + ], + }, + websocket_subscribers: { + executor: 'constant-vus', vus: 200, duration: '3m', + }, + decay_submit: { + executor: 'constant-arrival-rate', rate: 5, timeUnit: '1m', + preAllocatedVUs: 10, duration: '5m', + }, + }, +}; +``` + +#### SLO Assertions (k6 thresholds — test fails if breached) + +| Scenario | Metric | Threshold | +|----------|--------|-----------| +| CZML catalog (`GET /objects` + CZML) | p95 response time | < 2 000 ms | +| API auth (`POST /auth/token`) | p99 response time | < 500 ms | +| Decay prediction submit | p95 response time | < 500 ms (202 accept only) | +| WebSocket connection | 200 concurrent connections stable for 3 min | 0 connection drops | +| WebSocket alert delivery | Time from DB insert to browser receipt | < 30 000 ms p95 | +| `/readyz` probe | p99 response time | < 100 ms | + +#### Baseline Environment + +Performance tests are only comparable if run against a consistent hardware baseline: + +```markdown +# docs/validation/load-test-baseline.md +- Host: 8 vCPU / 32 GB RAM (Tier 2 single-host) +- TimescaleDB: 100 tracked objects, 90 days of orbit history +- Celery workers: simulation ×16 concurrency, ingest ×2 +- Redis: empty (no warm cache) at test start +``` + +Results from a different hardware spec must be labelled separately and not compared to the baseline. A performance regression is defined as any threshold breach on the **same** baseline hardware. + +#### Storing and Trending Results + +k6 outputs a JSON summary; a CI step uploads it to `docs/validation/load-test-results/YYYY-MM-DD-{env}.json`. A lightweight Python script (`scripts/load-test-trend.py`) plots p95 latency over time for the past 10 runs and embeds the chart in `docs/TEST_PLAN.md`. A > 20% increase in any p95 metric between consecutive runs on the same hardware creates a `performance-regression` GitHub issue automatically. + +### 27.1 Workload Characterisation + +| Workload | CPU Profile | Memory | Dominant Constraint | +|----------|------------|--------|-------------------| +| **MC decay prediction** (500 samples) | CPU-bound, parallelisable | 200–500 MB per process | CPU cores on simulation workers | +| SGP4 catalog propagation (100 objects) | Trivial | < 100 MB | None — analytical model | +| CZML generation | I/O-bound (DB read) | < 500 MB | DB query latency | +| Atmospheric breakup | CPU-bound, light | ~200 MB | Negligible vs. MC | +| Conjunction screening (100 objects) | CPU-bound, seconds | ~500 MB | Acceptable on any worker | +| Controlled re-entry planner | CPU-bound, similar to MC | 500 MB | Same pool as MC | +| Playwright renderer | Memory-bound (Chromium) | 1–2 GB per instance | Isolated container | +| TimescaleDB queries | I/O-bound | 64 GB (buffer cache) | NVMe IOPS for spatial queries | + +**Cost-tracking metrics (F3, F4, F11):** + +Add the following Prometheus counters to enable per-org cost attribution and external API budget visibility. These feed the unit economics model (§27.7) and the Enterprise tier chargeback reports. + +```python +# backend/app/metrics.py (add to existing prometheus_client registry) +from prometheus_client import Counter + +# F3 — External API call budget tracking +ingest_api_calls_total = Counter( + "spacecom_ingest_api_calls_total", + "Total external API calls made by the ingest worker", + labelnames=["source"] # "space_track", "celestrak", "noaa_swpc", "esa_discos", "iers" +) +# Usage: ingest_api_calls_total.labels(source="space_track").inc() +# Alert: if space_track calls > 100/day → investigate polling loop bug (Space-Track AUP limit: 200/day) + +# F4 — Per-org simulation CPU attribution +simulation_cpu_seconds_total = Counter( + "spacecom_simulation_cpu_seconds_total", + "Total CPU-seconds consumed by MC simulations, by org and object", + labelnames=["org_id", "norad_id"] +) +# Usage: simulation_cpu_seconds_total.labels(org_id=str(org_id), norad_id=str(norad_id)).inc(elapsed) +# This is the primary input to infrastructure_cost_per_mc_run in §27.7 +``` + +**F5 — Inbound API request counter (§68):** + +```python +# backend/app/metrics.py (add to existing prometheus_client registry) +api_requests_total = Counter( + "spacecom_api_requests_total", + "Total inbound API requests, by org, endpoint, and API version", + labelnames=["org_id", "endpoint", "version", "status_code"] +) +# Usage (FastAPI middleware): +# api_requests_total.labels( +# org_id=str(request.state.org_id), +# endpoint=request.url.path, +# version=request.headers.get("X-API-Version", "v1"), +# status_code=str(response.status_code) +# ).inc() +``` + +This counter is the foundation for future API tier enforcement (e.g., 1,000 requests/month for Professional; unlimited for Enterprise) and for supporting usage-based billing for Persona E/F API consumers. Add to the FastAPI middleware stack alongside `prometheus_fastapi_instrumentator`. + +**F11 — Per-org cost attribution for Enterprise tier:** + +Enterprise contracts may include usage-based clauses (e.g., MC simulation credits). The `simulation_cpu_seconds_total` metric provides the raw data; a monthly Celery task (`tasks/billing/generate_usage_report.py`) aggregates it per org: + +```python +@shared_task +def generate_monthly_usage_report(org_id: str, year: int, month: int): + """Aggregate simulation CPU-seconds and ingest API calls per org for billing review.""" + # Query Prometheus/VictoriaMetrics for the org's metrics over the billing period + # Output: docs/business/usage_reports/{org_id}/{year}-{month:02d}.json + # Fields: total_mc_runs, total_cpu_seconds, estimated_cost_usd (at $0.40/run internal rate) +``` + +Per-org usage reports are stored in `docs/business/usage_reports/` and referenced in Enterprise QBRs. The cost rate (`$0.40/run` at Tier 3 scale) is updated quarterly in `docs/business/UNIT_ECONOMICS.md`. + +**Usage surfaced to commercial team and org admins (F2 — §68):** + +Usage data must reach two audiences: the commercial team (for renewal and expansion conversations) and the org admin (to understand value received). + +*Commercial team:* Monthly Celery Beat task (`tasks/commercial/send_commercial_summary.py`) emails `commercial@spacecom.io` on the 1st of each month with: +- Per-org: MC simulation count, PDF reports generated, WebSocket connection hours, alert events (by severity) +- Trend vs. previous 3 months (growth signal for expansion conversations) +- Contracts expiring within 90 days (renewal pipeline) + +*Org admin:* Monthly usage summary email to each org's admin contact showing their own usage. Template: *"In [month], your team ran [N] decay predictions, generated [M] PDF reports, and received [K] CRITICAL alerts. Your monthly quota: [Q] simulations (used: [N])."* This email reinforces value perception ahead of renewal conversations. + +Both emails use the `generate_monthly_usage_report` output. Add `send_usage_summary_emails` to celery-redbeat at `crontab(day_of_month=1, hour=6)`. + +### 27.2 Monte Carlo Parallelism Architecture + +The MC decay predictor must use **Celery `group` + `chord`** to distribute sample computation across the full worker pool. `multiprocessing.Pool` within a single task is limited to one container's cores. + +```python +from celery import group, chord + +@celery.task +def run_mc_decay_prediction(object_id: int, params: dict) -> str: + """Fan out 500 samples as individual sub-tasks; aggregate with chord callback.""" + sample_tasks = group( + run_single_trajectory.s(object_id, params, seed=i) + for i in range(params['mc_samples']) + ) + result = chord(sample_tasks)(aggregate_mc_results.s(object_id, params)) + return result.id + +@celery.task +def run_single_trajectory(object_id: int, params: dict, seed: int) -> dict: + """Single RK7(8) + NRLMSISE-00 trajectory integration. CPU time: 2–20s.""" + rng = np.random.default_rng(seed) + f107 = params['f107'] * rng.normal(1.0, 0.20) # ±20% variation + bstar = params['bstar'] * rng.normal(1.0, 0.10) + return integrate_trajectory(object_id, f107, bstar, params) + +@celery.task +def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str: + """Compute percentiles, build corridor polygon, HMAC-sign, write to DB.""" + prediction = compute_percentiles_and_corridor(results) + prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret) + write_prediction_to_db(prediction) + return str(prediction['id']) +``` + +**Worker concurrency for chord sub-tasks:** +- Each sub-task is short (2–20s) and CPU-bound +- Worker `--pool=prefork --concurrency=16`: 16 OS processes per container +- 2 simulation worker containers: 32 concurrent sub-tasks +- 500 samples / 32 = ~16 batches × ~10s average = **~160s per MC run** (p50) +- p95 target of 240s met with headroom + +**Chord result backend:** Sub-task results stored in Redis temporarily (< 1 MB each × 500 = 500 MB peak per run). Results expire after 1 hour (`result_expires = 3600` in `celeryconfig.py` — §27.8). The aggregate callback reads all results, computes the final prediction, and writes to TimescaleDB — Redis is not the durable store. + +**Chord callback result count validation (F1 — §67):** Redis `noeviction` prevents eviction, but if Redis is misconfigured or hits `maxmemory` and rejects writes, sub-task results may be missing when the chord callback fires. The callback must validate that it received the expected number of results before writing to TimescaleDB: + +```python +@celery.task +def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str: + """Compute percentiles, build corridor polygon, HMAC-sign, write to DB.""" + expected = params['mc_samples'] + if len(results) != expected: + # Partial result — do not write a silently truncated prediction + raise ValueError( + f"MC chord received {len(results)}/{expected} results for object {object_id}. " + "Redis result backend may be under memory pressure. Aborting." + ) + prediction = compute_percentiles_and_corridor(results) + prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret) + write_prediction_to_db(prediction) + return str(prediction['id']) +``` + +The `ValueError` causes the chord callback to fail and be routed to the DLQ (Dead Letter Queue). The originating API call receives a task failure, and the client receives `HTTP 500` with `Retry-After`. A `spacecom_mc_chord_partial_result_total` counter fires, triggering a CRITICAL alert: *"MC chord received partial results — Redis memory budget exceeded."* + +### 27.3 Deployment Tiers + +#### Tier 1 — Development and Demonstration + +Single machine, Docker Compose, all services co-located. No HA. Suitable for development, internal demos, and ESA TRL 4 demonstrations. + +| Spec | Minimum | Recommended | +|------|---------|-------------| +| CPU | 8 cores | 16 cores | +| RAM | 16 GB | 32 GB | +| Storage | 256 GB NVMe SSD | 512 GB NVMe SSD | +| Cloud equivalent | `t3.2xlarge` ~$240/mo | `m6i.4xlarge` ~$540/mo | + +MC prediction p95: ~400–800s (exceeds SLO — acceptable for demo; noted in demo briefings). + +--- + +#### Tier 2 — Phase 1–2 Production + +Separate containers per service. Meets SLOs under moderate load (≤ 5 concurrent simulation users). Single-node per service — no HA. Suitable for shadow mode deployments and early ANSP pilots. + +| Service | vCPU | RAM | Storage | Cloud (AWS) | Monthly | +|---------|------|-----|---------|-------------|---------| +| Backend API | 4 | 8 GB | — | `c6i.xlarge` | ~$140 | +| Simulation Workers ×2 | **16 each** | 32 GB each | — | `c6i.4xlarge` ×2 | ~$560 each | +| Ingest Worker | 2 | 4 GB | — | `t3.medium` | ~$30 | +| Renderer | 4 | 8 GB | — | `c6i.xlarge` | ~$140 | +| TimescaleDB | 8 | **64 GB** | 1 TB NVMe | `r6i.2xlarge` | ~$420 | +| Redis | 2 | 8 GB | — | `cache.r6g.large` | ~$120 | +| MinIO / S3 | 4 | 8 GB | 4 TB | `i3.xlarge` + EBS | ~$200 | +| **Total** | | | | | **~$2,200/mo** | + +**On-premise equivalent (Tier 2):** Two servers — compute host (2× AMD EPYC 7313P, 32 total cores, 192 GB RAM) + storage host (8 vCPU, 256 GB RAM, 2 TB NVMe + 8 TB HDD). Capital cost: **~$25,000–35,000**. + +--- + +#### Tier 3 — Phase 3 HA Production + +Full redundancy. Meets 99.9% availability SLO including during active TIP events. Required before any formal operational ANSP deployment. + +| Service | Count | vCPU each | RAM each | Notes | +|---------|-------|-----------|----------|-------| +| Backend API | 2 | 4 | 8 GB | Load balanced; blue-green deployable | +| Simulation Workers | **4** | **16** | 32 GB | 64 total cores; chord sub-tasks fill all | +| Ingest Worker | 2 | 2 | 4 GB | celery-redbeat leader election | +| Renderer | 2 | 4 | 8 GB | Network-isolated; Chromium memory budget | +| TimescaleDB Primary | 1 | 8 | **128 GB** | Patroni-managed; synchronous replication | +| TimescaleDB Standby | 1 | 8 | **128 GB** | Hot standby; auto-failover ≤ 30s | +| Redis Sentinel ×3 | 3 | 2 | 8 GB | Quorum; master failover ≤ 10s | +| MinIO (distributed) | 4 | 4 | 16 GB | Erasure coding EC:2; 2× 2 TB NVMe each | +| **Cloud total (AWS)** | | | | ~**$6,000–7,000/mo** | + +With 64 simulation worker cores: 500-sample MC in **~80s p50, ~120s p95** — well within SLO. + +**MinIO Erasure Coding (Tier 3):** 4-node distributed MinIO uses **EC:2** (2 parity shards). This provides: +- **Read quorum:** any 2 of 4 nodes (tolerates 2 simultaneous node failures for reads) +- **Write quorum:** requires 3 of 4 nodes (tolerates 1 simultaneous node failure for writes) +- **Effective storage:** 50% — 8 TB raw across 4 nodes → 4 TB usable. Match the Tier 3 table note (8 TB usable requires 16 TB raw across 4×2 TB nodes; resize if needed) +- Configured via `MINIO_ERASURE_SET_DRIVE_COUNT=4` and server startup with all 4 node endpoints + +**Multi-region stance:** SpaceCom is **single-region** through all three phases. Reasoning: +- Phase 1–3 customer base is small (ESA evaluation, early ANSP pilots); cross-region replication cost and operational complexity is not justified. +- Government and defence customers may have data sovereignty requirements — a single, clearly defined deployment region (customer-specified) is simpler to certify than an active-active multi-region setup. +- When a second jurisdiction customer is onboarded, deploy a **separate, independent instance** in their required jurisdiction rather than extending a single global cluster. Each instance has its own data, its own compliance scope, and its own operational team contact. +- This decision is documented as ADR-0010 (see §34 decision log). + +**On-premise equivalent (Tier 3):** Three servers — 2× compute (2× EPYC 7343, 32 cores, 256 GB RAM each) + 1× storage (128 GB RAM, 4× 2 TB NVMe RAID-10, 16 TB HDD). Capital cost: **~$60,000–80,000**. + +**Celery worker idle cost and scale-to-zero decision (F6):** + +Simulation workers are the largest cloud line item ($560/mo each at Tier 2 on `c6i.4xlarge`). Their actual compute utilisation depends on MC run frequency: + +| Usage pattern | Active compute/day | Idle fraction | Monthly cost at Tier 2 ×2 workers | +|--------------|-------------------|--------------|----------------------------------| +| Light (5 MC runs/day × 80s p50) | ~7 min/day | ~99.5% | $1,120 | +| Moderate (20 MC runs/day × 80s) | ~27 min/day | ~98.1% | $1,120 | +| Heavy (100 MC runs/day × 80s) | ~133 min/day | ~90.7% | $1,120 | + +**Scale-to-zero analysis:** + +| Approach | Pros | Cons | Decision | +|---------|------|------|---------| +| Always-on (Tier 1–2) | Zero cold-start; SLO met immediately | High idle cost when lightly used | **Use at Tier 1–2** — cost is ~$1,120/mo regardless; latency SLO requires workers ready | +| Scale-to-1 minimum (Tier 3) | Reduced idle cost vs. 4×; one worker handles ingest keepalive tasks | Cold-start for burst: 3 new workers × 30–60s spin-up; MC SLO may breach during burst | **Use at Tier 3** — scale-to-1 minimum; HPA/KEDA scales 1→4 on `celery_queue_length > 10` | +| Scale-to-zero | Maximum idle savings | 60–120s cold-start violates 10-min MC SLO when all workers are down | **Do not use** — cold-start from zero exceeds acceptable latency for on-demand simulation | + +**Implementation at Tier 3 (Kubernetes):** Use KEDA `ScaledObject` with `celery` trigger: +```yaml +triggers: + - type: redis + metadata: + listName: celery # Celery default queue + listLength: "10" # scale up when >10 tasks queued + activationListLength: "1" # keep at least 1 replica (scale-to-1 minimum) +``` +Minimum replica count: **1**. Maximum: **4**. Scale-down stabilisation window: 5 minutes (prevents oscillation during multi-run bursts). + +**Ingest worker:** Always-on, single instance (2 vCPU, $30/mo at Tier 2). celery-redbeat tasks run on 1-minute and hourly schedules; scale-to-zero is not appropriate. At Tier 3, 2 instances for redundancy; no autoscaling needed. + +--- + +### 27.4 Storage Growth Projections + +| Data | Retention | Raw Growth/Year | Compressed/Year | Cloud Cost/Year (est.) | Notes | +|------|-----------|----------------|----------------|----------------------|-------| +| `orbits` (100 objects, 1/min) | 90 days online | ~15 GB | ~2 GB | ~$20 (EBS gp3, rolling) | TimescaleDB compression ~7:1 | +| `tle_sets` | 1 year | ~55 MB | ~30 MB | Negligible | — | +| `space_weather` | 2 years | ~5 MB | ~2 MB | Negligible | — | +| MC simulation blobs (MinIO) | 2 years | 500 GB–2 TB | Not compressed | **$140–$560/yr** (S3-IA after 90d) | **Dominant cost** — S3-IA at $0.0125/GB/mo | +| PDF reports (MinIO) | 7 years | 10–90 GB | 5–45 GB | $5–$45/yr (S3 Glacier) | $0.004/GB/mo Glacier tier | +| WAL archive (backup) | 30 days rolling | ~25 GB/month | — | **~$100/yr** (300 GB peak × $0.023/GB/mo × 12) | S3 Standard; rolls over; cost is steady-state | +| `security_logs` | 2 years online; 7-year archive | ~500 MB/year | — | Negligible | Legal hold | +| `reentry_predictions` | 7 years | ~100 MB/year | — | Negligible | Legal hold | +| Safety records (`alert_events`, `notam_drafts`, `prediction_outcomes`, `degraded_mode_events`, coordination notes) | **5-year minimum** append-only archive | ~200 MB/year | — | Negligible | ICAO Annex 11 §2.26; safety investigation requirement | + +**Storage cost summary (Phase 2 steady-state):** MC blobs dominate at sustained use. At 50 runs/day × 120 MB/run = 2.2 TB/year, 2-year retention on S3-IA ≈ **$660/year** in object storage alone. This should be captured in the unit economics model (§27.7). Storage cost is the primary variable cost that scales with usage depth (number of MC runs), not with number of users. + +**Backup cost projection (F9):** WAL archive at 30-day rolling window: ~300 GB peak occupancy on S3 Standard ≈ **$83/year** (Tier 2). At Tier 3 with synchronous replication, the base-backup is ~2× TimescaleDB data size. At 1 TB compressed DB size: one weekly base-backup (retained 4 weeks) = ~4 TB S3 occupancy → **~$1,100/year** at Tier 3. Include backup S3 bucket costs in infrastructure budget from Phase 3 onwards. Budget line: `infra/backup-s3` ≈ $100–200/month at steady Tier 3 scale. + +**Safety record retention policy (Finding 11):** Safety-relevant event records have a distinct retention category separate from general operational data. A `safety_record BOOLEAN DEFAULT FALSE` flag on `alert_events` and `notam_drafts` marks records that must survive the standard retention drop. Records with `safety_record = TRUE` are excluded from TimescaleDB drop policies and transferred to MinIO cold tier (append-only) after 90 days online, retained for 5 years minimum. The TimescaleDB retention job checks `WHERE safety_record = FALSE` before dropping chunks. `safety_record` is set to `TRUE` at insert time for any event with `alert_level IN ('HIGH', 'CRITICAL')` and for all NOTAM drafts. + +**MC blob storage dominates at scale.** At sustained use (50 MC runs/day × 120 MB/run): 2.2 TB/year. The Tier 3 distributed MinIO (8 TB usable with erasure coding on 4×2 TB nodes) covers approximately 3–4 years before expansion. + +**Cold tier tiering decision (two object classes with different requirements):** + +| Object class | Cold tier target | Reason | +|---|---|---| +| MC simulation blobs (`mc_blobs/` prefix) | **MinIO ILM warm tier or S3 Infrequent Access** | Blobs may need to be replayed for Mode C visualisation of historical events (e.g., regulatory dispute review, incident investigation). Glacier 12h restore latency is operationally unacceptable for this use case. | +| Compliance-only documents (`reports/`, `notam_drafts/`) | **S3 Glacier / Glacier Deep Archive acceptable** | These are legal records requiring 7-year retention; retrieval is for audit or legal discovery only; 12h restore latency is acceptable. | + +MinIO ILM rules configured in `docs/runbooks/minio-lifecycle.md`. Lifecycle transitions: MC blobs after 90 days → ILM warm (lower-cost MinIO tier or S3-IA); compliance docs after 1 year → Glacier. + +**MinIO multipart upload retry and incomplete upload expiry (F7 — §67):** + +MC simulation blobs (~120 MB each) are uploaded as multipart uploads. During a MinIO node failure in EC:2 distributed mode, write quorum (3/4 nodes) may be temporarily unavailable. An in-flight multipart upload will fail with `MinioException` / `S3Error`. Without a retry policy, the MC prediction is written to TimescaleDB but the blob is lost — the historical replay functionality silently fails. + +```python +# worker/tasks/blob_upload.py +from minio.error import S3Error + +@shared_task( + autoretry_for=(S3Error, ConnectionError), + max_retries=3, + retry_backoff=30, # 30s, 60s, 120s — allow node recovery + retry_jitter=True, +) +def upload_mc_blob(prediction_id: str, blob_data: bytes): + """Upload MC simulation blob to MinIO with retry on quorum failure.""" + object_key = f"mc_blobs/{prediction_id}.msgpack" + minio_client.put_object( + bucket_name="spacecom-simulations", + object_name=object_key, + data=io.BytesIO(blob_data), + length=len(blob_data), + content_type="application/msgpack", + ) +``` + +**Incomplete multipart upload cleanup:** Configure MinIO lifecycle rule to abort incomplete multipart uploads after 24 hours. Add to `docs/runbooks/minio-lifecycle.md`: +```bash +mc ilm rule add --expire-delete-marker --noncurrent-expire-days 1 \ + spacecom/spacecom-simulations --abort-incomplete-multipart-upload-days 1 +``` +This prevents orphaned multipart upload parts accumulating on disk during node failures or application crashes mid-upload. + +### 27.5 Network and External Bandwidth + +| Traffic | Direction | Volume | Notes | +|---------|-----------|--------|-------| +| Space-Track TLE polling | Outbound | ~1 MB per run, every 4h | ~6 MB/day | +| NOAA SWPC space weather | Outbound | ~50 KB per fetch, hourly | ~1 MB/day | +| ESA DISCOS | Outbound | ~10 MB/day (initial bulk); ~100 KB/day incremental | — | +| CZML to clients | Outbound | ~5–15 MB per user page load (full); <500 KB/hr delta | Scales linearly with users; delta protocol essential | +| WebSocket to clients | Outbound | ~1 KB/event × events/day | Low bandwidth, persistent connection | +| PDF reports (download) | Outbound | ~2–5 MB per report | Low frequency; MinIO presigned URL avoids backend proxy | +| MinIO internal traffic | Internal | Dominated by MC blob writes | Keep on internal Docker network | + +**CZML egress cost estimate and compression policy (F5):** + +At Phase 2 (10 concurrent users), daily CZML egress: +- Initial full loads: 10 users × 3 page loads/day × 15 MB = 450 MB/day +- Delta updates (delta protocol, §6): 10 users × 8h active × 500 KB/hr = 40 MB/day +- **Total: ~490 MB/day ≈ 15 GB/month** + +At $0.085/GB AWS CloudFront egress: **~$1.28/month** (Phase 2) → **~$6.40/month** (50 users Phase 3). + +CZML egress is **not a significant cost driver** at this scale, but is significant for latency and user experience. Compression policy: + +| Encoding | CZML size reduction | Implementation | +|---------|-------------------|----------------| +| gzip (Accept-Encoding) | 60–75% | Caddy `encode gzip` — already included in §26.9 Caddy config | +| Brotli | 70–80% | Caddy `encode zstd br gzip` — use br for browser clients | +| CZML delta protocol (`?since=`) | 95%+ for incremental updates | Already specified in §6 | + +**Minimum requirement:** Caddy `encode` block must include `br` before `gzip` in the content negotiation order. A 15 MB CZML payload compresses to ~3–5 MB with brotli. Verify with `curl -H "Accept-Encoding: br" -I ` — response must show `Content-Encoding: br`. + +Network is not a constraint for this workload at the scales described. Standard 1 Gbps datacenter networking is sufficient. For on-premise government deployments, standard enterprise LAN is adequate. + +--- + +### 27.6 DNS Architecture and Service Discovery + +#### Tier 1–2 (Docker Compose) + +Docker Compose provides built-in DNS resolution by service name within each network. Services reference each other by container name (e.g., `db`, `redis`, `minio`). No additional DNS infrastructure required. + +**PgBouncer as single DB connection target:** At Tier 2, the backend and workers connect to `pgbouncer:5432`, not directly to `db:5432`. PgBouncer multiplexes connections and acts as a stable endpoint: +- In a Patroni failover, `pgbouncer` is reconfigured to point to the new primary; application code never changes connection strings. +- PgBouncer configuration: `docs/runbooks/pgbouncer-config.md` + +**Celery task retry during Patroni failover (F2 — §67):** During the ≤ 30s Patroni leader election window, all writes to PgBouncer fail with `FATAL: no connection available` or `OperationalError: server closed the connection unexpectedly`. Celery tasks that execute a DB write during this window will raise `sqlalchemy.exc.OperationalError`. Without a retry policy, these tasks fail permanently and are routed to the DLQ. + +All Celery tasks that write to the database must declare: +```python +@shared_task( + autoretry_for=(OperationalError,), + max_retries=3, + retry_backoff=5, # 5s, 10s, 20s + retry_backoff_max=30, # cap at 30s (within failover window) + retry_jitter=True, +) +def my_db_writing_task(...): + ... +``` + +This covers: `aggregate_mc_results`, `write_alert_event`, `write_prediction_outcome`, all ingest tasks. Tasks that only read from DB should also retry on `OperationalError` since PgBouncer may pause reads during leader election. Add integration test: simulate `OperationalError` on first two attempts → task succeeds on third attempt. + +#### Tier 3 (HA / Kubernetes migration path) + +At Tier 3, introduce **split-horizon DNS**: + +| Zone | Scope | Purpose | +|------|-------|---------| +| `spacecom.internal` | Internal services | Service discovery: `backend.spacecom.internal`, `db.spacecom.internal` (→ PgBouncer VIP) | +| `spacecom.io` (or customer domain) | Public internet | Caddy termination endpoint; ACME certificate domain | + +**Service discovery implementation:** +- **Cloud (AWS/GCP/Azure):** Use cloud-native internal DNS (Route 53 private hosted zones / Cloud DNS) + load balancer for each service tier +- **On-premise:** CoreDNS deployed as a DaemonSet (Kubernetes) or as a Docker container on the management network; service records updated via Patroni callback scripts on failover + +**Key DNS records (Tier 3):** + +| Record | Type | Value | +|--------|------|-------| +| `db.spacecom.internal` | A | PgBouncer VIP (stable through Patroni failover) | +| `redis.spacecom.internal` | A | Redis Sentinel VIP | +| `minio.spacecom.internal` | A | MinIO load balancer (all 4 nodes) | +| `backend.spacecom.internal` | A | Backend API load balancer (2 instances) | + +--- + +### 27.7 Unit Economics Model + +**Reference document:** `docs/business/UNIT_ECONOMICS.md` — maintained alongside this plan; update whenever pricing or infrastructure costs change. + +Unit economics express the cost to serve one organisation per month and the revenue generated, enabling margin analysis per tier. + +**Cost-to-serve model (Phase 2, cloud-hosted, per org):** + +| Cost driver | Basis | Monthly cost per org | +|------------|-------|---------------------| +| Simulation workers (shared pool) | 2 workers shared across all orgs; allocate by MC run share | $1,120 ÷ org count | +| TimescaleDB (shared instance) | ~$420/mo; fixed regardless of org count up to Phase 2 capacity | $420 ÷ org count | +| Redis (shared) | ~$120/mo | $120 ÷ org count | +| MinIO / S3 storage | Variable; ~$660/yr at heavy MC use → $55/mo | $5–55/mo | +| Backend API (shared) | ~$140/mo | $140 ÷ org count | +| Ingest worker (shared) | ~$30/mo | Allocated to platform overhead | +| Email relay | ~$0.001/email × volume | $0–5/mo | +| CZML egress | ~$0.085/GB | $1–7/mo | +| **Total variable (1 org, Tier 2)** | | **~$1,860/mo platform + $60–70 per-org variable** | + +**Revenue per tier (target pricing — cross-reference §55 commercial model):** + +| Tier | Monthly ARR / org | Gross margin target | +|------|-----------------|-------------------| +| Free / Evaluation | $0 | Negative — cost of ESA relationship | +| Professional (shadow) | $3,000–6,000/mo | 50–70% at ≥3 orgs on platform | +| Enterprise (operational) | $15,000–40,000/mo | 65–75% at Tier 3 scale | + +**Break-even analysis:** At Tier 2 platform cost (~$2,200/mo), break-even at Professional tier requires ≥1 paying org at $3,000/mo. Each additional Professional org at shared infrastructure has near-zero incremental infrastructure cost until capacity boundaries (MC concurrency limit, DB connection pooler limit). + +**Key unit economics metric:** `infrastructure_cost_per_mc_run`. At Tier 2 (2 workers, $1,120/mo) and 500 runs/month: **$2.24/run**. At Tier 3 (4 workers KEDA scale-to-1, ~$800/mo amortised at medium utilisation) and 2,000 runs/month: **$0.40/run**. This metric should be tracked alongside `spacecom_simulation_cpu_seconds_total` (§27.1). + +**Professional Services as a revenue line (F10 — §68):** + +Professional Services (PS) revenue is a distinct revenue stream from recurring SaaS fees. For safety-critical aviation systems, PS typically represents **30–50% of first-year contract value** and includes: + +| PS engagement type | Typical value | Description | +|-------------------|-------------|-------------| +| Implementation support | $15,000–40,000 | Deployment, configuration, integration with ANSP SMS | +| Regulatory documentation | $10,000–25,000 | SpaceCom system description for ANSP regulatory submissions; assists with EASA/CASA/CAA shadow mode notifications | +| Training (initial) | $5,000–15,000 | On-site or remote training for duty controllers, analysts, and IT administrators | +| Safety Management System integration | $8,000–20,000 | Integrating SpaceCom alert triggers into the ANSP's existing SMS occurrence reporting workflow | +| Annual training refresh | $2,000–5,000/yr | Recurring annual training for new staff and procedure updates | + +PS revenue is tracked in the `contracts.ps_value_cents` column (§68 F1). Include PS as a budget line in `docs/business/UNIT_ECONOMICS.md`: +- **Year 1 total contract value** = MRR × 12 + PS value +- PS is recognised as one-time revenue at delivery (milestone-based); SaaS fees are recognised monthly +- PS delivery requires dedicated engineering and commercial capacity — budget 1–2 days of senior engineer time per $5,000 of PS value + +**Shadow trial MC quota (F8 - §68):** Free/shadow trial orgs are limited to 100 MC simulation runs per month (`organisations.monthly_mc_run_quota = 100`). Enforcement at `POST /api/v1/decay/predict`: +```python +if org.subscription_tier in ('shadow_trial',) and org.monthly_mc_run_quota > 0: + runs_this_month = get_monthly_mc_run_count(org_id) + if runs_this_month >= org.monthly_mc_run_quota: + raise HTTPException( + status_code=429, + detail={ + "error": "monthly_quota_exceeded", + "quota": org.monthly_mc_run_quota, + "used": runs_this_month, + "resets_at": first_of_next_month().isoformat(), + "upgrade_url": "/settings/billing" + } + ) +``` + +Commercial controls must not interrupt active operations. If the organisation is in an active TIP / CRITICAL operational state, quota exhaustion is logged and surfaced to commercial/admin dashboards but enforcement is deferred until the event closes. + +--- + +### 27.8 Redis Memory Budget + +**Reference document:** `docs/infra/REDIS_SIZING.md` — sizing rationale and eviction policy decisions. + +Redis serves three distinct purposes with different memory characteristics. Using a single Redis instance (with separate DB indexes for broker vs. cache) requires explicit memory budgeting: + +| Purpose | DB index | Key pattern | Estimated peak memory | Eviction policy | +|---------|----------|------------|----------------------|----------------| +| Celery broker + result backend | DB 0 | `celery-task-meta-*`, `_kombu.*` | 500 MB (500 MC sub-tasks × ~1 MB results) | `noeviction` | +| celery-redbeat schedule | DB 1 | `redbeat:*` | < 1 MB | `noeviction` | +| WebSocket session tracking | DB 2 | `spacecom:ws:*`, `spacecom:active_tip:*` | < 10 MB | `noeviction` | +| Application cache (CZML, NOTAM) | DB 3 | `spacecom:cache:*` | 50–200 MB | `allkeys-lru` | +| Redis Pub/Sub fan-out (alerts) | — | `spacecom:alert:*` channels | Transient; ~1 KB/message | N/A (pub/sub, no persistence) | +| **Total budget** | | | **~700–750 MB peak** | | + +**Sizing decision:** Use `cache.r6g.large` (8 GB RAM) with `maxmemory 2gb` — provides 2.5× headroom above peak estimate for burst conditions (multiple simultaneous MC runs × result backend). Set `maxmemory-policy noeviction` globally; the application cache (DB 3) must handle cache misses gracefully (it does — CZML regeneration on miss is defined in §6). + +**Redis memory alert:** Add Grafana alert `redis_memory_used_bytes > 1.5GB` → WARNING; `> 1.8GB` → CRITICAL. At CRITICAL, check for result backend accumulation (expired Celery results not cleaned up) before scaling. + +**Redis result cleanup:** Celery `result_expires` must be set to `3600` (1 hour). Verify in `backend/celeryconfig.py`: +```python +result_expires = 3600 # Clean up MC sub-task results after 1 hour +``` + +--- + +## 28. Human Factors Framework + +SpaceCom is a safety-critical decision support system used by time-pressured operators in aviation operations rooms. Human factors are not a UX concern — they are a safety assurance concern. This section documents the HF design requirements, standards basis, and validation approach. + +**Standards basis:** ICAO Doc 9683 (Human Factors in Air Traffic Management), FAA AC 25.1329 (Flight Guidance Systems — alert prioritisation philosophy), EUROCONTROL HRS-HSP-005, ISA-18.2 (alarm management, adapted for ATC context), Endsley (1995) Situation Awareness model. + +--- + +### 28.1 Situation Awareness Design Requirements + +SpaceCom must support all three levels of Endsley's SA model for Persona A (ANSP duty manager): + +| SA Level | Requirement | Implementation | Time target | +|----------|-------------|----------------|-------------| +| **Level 1 — Perception** | Correct hazard information visible at a glance | Globe with urgency symbols; active events panel; risk level badges | **≤ 5 seconds** from alert appearance — icon, colour, and position alone must convey object + risk level without reading text | +| **Level 2 — Comprehension** | Operator understands what the hazard means for their sector | Plain-language event cards; window range notation; FIR intersection list; data confidence indicators | **≤ 15 seconds** to identify earliest FIR intersection window and whether it falls within the operator's sector | +| **Level 3 — Projection** | Operator can anticipate future state without simulation tools | Corridor Evolution widget (T+0/+2/+4h); Gantt timeline; space weather buffer callout | **≤ 30 seconds** to determine whether the corridor is expanding or contracting using the Corridor Evolution widget | + +These time targets are **pass/fail criteria** for the Phase 2 ANSP usability test (§28.7). + +**Globe visual information hierarchy (F7 — §60):** The globe displays objects, corridors, hazard zones, FIR boundaries, and ADS-B routes simultaneously. Under operational stress, operators must not be required to search for the critical element — it must be pre-attentively distinct. The following hierarchy is mandatory and enforced by the rendering layer: + +| Priority | Element | Visual treatment | Pre-attentive channel | +|----------|---------|------------------|-----------------------| +| 1 — Immediate | Active CRITICAL object | Flashing red octagon (2 Hz, reduced-motion: static + thick border) + label always visible | Motion + colour + shape | +| 2 — Urgent | Active HIGH object | Amber triangle, label visible at zoom ≥ 4 | Colour + shape | +| 3 — Monitor | Active MEDIUM object | Yellow circle, label on hover | Colour + shape | +| 4 — Context | Re-entry corridors (p05–p95) | Semi-transparent red fill, no label until hover | Colour + opacity | +| 5 — Awareness | FIR boundary overlay | Thin white lines, low opacity (30%) | Position | +| 6 — Background | ADS-B routes | Thin grey lines, visible only at zoom ≥ 5 | Position | +| 7 — Ambient | All other tracked objects | Small white dots, no label until hover | Position | + +Rule: no element at priority N may be more visually prominent than an element at priority N-1. The rendering layer enforces draw order and applies opacity/size reduction to lower-priority elements when a priority-1 element is present. This is a **non-negotiable safety requirement** — a CesiumJS performance optimisation that re-orders draw calls or flattens layers must not override this hierarchy. An operator who cannot reach SA Level 1 in ≤ 5 seconds on a CRITICAL alert constitutes a design failure requiring a redesign cycle before shadow deployment. Without numeric targets the usability test cannot produce a meaningful result. + +Level 3 SA support is specifically identified as a gap in pure corridor-display systems and is addressed by the Corridor Evolution widget (§6.8). + +--- + +### 28.2 Mode Error Prevention + +Mode confusion is the most common cause of automation-related incidents in aviation. SpaceCom has three operational modes (LIVE / REPLAY / SIMULATION) that must be unambiguously distinct at all times. + +**Mode error prevention mechanisms:** +1. Persistent mode indicator pill in top nav — never hidden, never small +2. Mode-switch dialogue with explicit current-mode, target-mode, and consequence statements (§6.3) +3. Future-preview temporal wash when the timeline scrubber is not at current time (§6.3) +4. Optional `disable_simulation_during_active_events` org setting to block simulation entry during live incidents (§6.3) +5. Audio alerts suppressed in SIMULATION and REPLAY modes +6. All simulation-generated records have `simulation_id IS NOT NULL` — they cannot appear in operational views + +--- + +### 28.3 Alarm Management + +Alarm management requirements follow the principle: every alarm should demand action, every required action should have an alarm, and no alarm should be generated that does not demand action. + +**Alarm rationalisation:** +- CRITICAL: demands immediate action — full-screen banner + audio +- HIGH: demands timely action — persistent badge + acknowledgement required +- MEDIUM: informs — toast, auto-dismiss, logged +- LOW: awareness only — notification centre + +**Alarm management philosophy and KPIs (F1 — §60):** SpaceCom adopts the EEMUA 191 / ISA-18.2 alarm management framework adapted for space/aviation operations. The following KPIs are measured quarterly by Persona D and included in the ESA compliance artefact package: + +| EEMUA 191 KPI | Target | Definition | +|---------------|--------|-----------| +| Alarm rate (steady-state) | < 1 alarm per 10 minutes per operator | Alarms requiring attention across all levels; excludes LOW awareness-only | +| Nuisance alarm rate | < 1% of all alarms | Alarms acknowledged as `MONITORING` within 30s without any other action — indicates no actionable information | +| Stale alarms | 0 CRITICAL unacknowledged > 10 min | Unacknowledged CRITICAL alerts older than 10 minutes; triggers supervisor notification (F8) | +| Alarm flood threshold | < 10 CRITICAL alarms within 10 minutes | Beyond this rate, an alert storm meta-alert fires and the batch-flood suppression protocol activates | +| Chattering alarms | 0 | Any alarm that fires and clears more than 3 times in 30 minutes without operator action | + +**Alarm quality requirements:** +- Nuisance alarm rate target: < 1 LOW alarm per 10 minutes per user in steady-state operations (logged and reviewed quarterly by Persona D) +- Alert deduplication: consecutive window-shrink events do not re-trigger CRITICAL if the threshold was not crossed +- 4-hour per-object CRITICAL rate limit prevents alarm flooding from a single event +- Alert storm meta-alert disambiguates between genuine multi-object events and system integrity issues (§6.6) + +**Batch TIP flood handling (F2 — §60):** Space-Track releases TIP messages in batches — a single NOAA solar storm event can produce 50+ new TIP entries within a 10-minute window. Without mitigation, this generates 50 simultaneous CRITICAL alerts, constituting an alarm flood that exceeds EEMUA 191 KPIs and cognitively overwhelms the operator. + +Protocol when ingest detects ≥ 5 new TIP messages within a 5-minute window: +1. **Batch gate activates:** Individual CRITICAL banners suppressed for objects 2–N of the batch. Object 1 (highest-priority by predicted Pc or earliest window) receives the standard CRITICAL banner. +2. **Batch summary alert fires:** A single HIGH-level "Batch TIP event: N objects with new TIP data" summary appears in the notification centre. The summary is actionable — it links to a pre-filtered catalog view showing all newly-TIP-flagged objects sorted by predicted re-entry window. +3. **Batch event logged:** A `batch_tip_event` record is created in `alert_events` with `trigger_type = 'BATCH_TIP'`, `affected_objects = [NORAD ID list]`, and `batch_size = N`. This is distinct from individual object alert records. +4. **Per-object alerts queue:** Individual CRITICAL alerts for objects 2–N are queued and delivered at a maximum rate of 1 per minute, only if the operator has not opened the batch summary view within 5 minutes of the batch gate activating. This prevents indefinite suppression while preventing flood. + +The threshold (≥ 5 TIP in 5 minutes) and maximum queue delivery rate (1/min) are configurable per-org via org-admin settings, subject to minimum values (≥ 3 and ≤ 2/min respectively) to prevent safety-defeating misconfiguration. + +**Audio alarm specification (F11 — §60):** +- Two-tone ascending chime: 261 Hz (C4) followed by 392 Hz (G4), each 250ms, 20ms fade-in/out (not siren — ops rooms have sirens from other systems already) +- Conforms to EUROCAE ED-26 / RTCA DO-256 advisory alert audio guidelines (advisory category — attention-getting without startle) +- Plays once on first presentation; **does not loop automatically** +- **Re-alert on missed acknowledgement:** If a CRITICAL alert remains unacknowledged for 3 minutes, the chime replays once. Replays at most once — the second chime is the final audio prompt. Further escalation is via supervisor notification (F8), not repeated audio (which would cause habituation) +- Stops on acknowledgement — not on banner dismiss; banner dismiss without acknowledgement is not permitted for CRITICAL severity +- Per-device volume control via OS; per-session software mute (persists for session only; resets on next login to prevent operators permanently muting safety alerts) +- Enabled by org-level "ops room mode" setting (default: off); must be explicitly enabled by org admin — not auto-enabled to prevent unexpected audio in environments where audio is not appropriate +- Volume floor in ops room mode: minimum 40% of device maximum; operators cannot mute below this floor when ops room mode is active (configurable per-org, minimum 30%) + +**Startle-response mitigation** — sudden full-screen CRITICAL banners cause ~5 seconds of degraded cognitive performance in research studies. The following rules prevent cold-start startle: + +1. **Progressive escalation mandatory:** A CRITICAL alert may only be presented full-screen if the same object has already been in HIGH state for ≥ 1 minute during the current session. If the alert arrives cold (no prior HIGH state), the system must hold the alert in HIGH presentation for 30 seconds before upgrading to CRITICAL full-screen. Exception: `impact_time_minutes < 30` bypasses the 30s hold. +2. **Audio precedes visual by 500ms:** The two-tone chime fires 500ms before the full-screen banner renders. This primes the operator's attentional system and eliminates the startle peak. +3. **Banner is overlay, not replacement:** The CRITICAL full-screen banner is a dimmed overlay (backdrop `rgba(0,0,0,0.72)`) rendered above the corridor map - the map, aircraft positions, and FIR boundaries remain visible beneath it. The banner must never replace the map render, as spatial context is required for the decision the operator is being asked to make. + +**Cross-hat alert override matrix:** The Human Factors, Safety, and Regulatory hats jointly approve the following override rule set: +- `impact_time_minutes < 30` or equivalent imminent-impact state: bypass progressive delay; immediate full-screen CRITICAL permitted +- data-integrity compromise (`HMAC_INVALID`, corrupted prediction provenance, or equivalent): immediate full-screen CRITICAL permitted +- degraded-data or connectivity-only events without direct hazard change: progressive escalation remains mandatory +- all immediate-bypass cases require explicit rationale in the alert type definition and traceability into the safety case and hazard log + +**CRITICAL alert accessibility requirements (F2):** When the CRITICAL alert banner renders: +- `focus()` is called on the alert dialog element programmatically +- `role="alertdialog"` and `aria-modal="true"` on the banner container +- `aria-labelledby` points to the alert title; `aria-describedby` points to the conjunction summary text +- `aria-hidden="true"` set on the map container while the alertdialog is active; removed on dismiss +- `aria-live="assertive"` region announces alert title immediately on render (separate from the dialog, for screen readers that do not expose `alertdialog` role automatically) +- Visible text status indicator "⚠ Audio alert active" accompanies the audio tone for deaf or hard-of-hearing operators (audio-only notification is not sufficient as a sole channel) +- All alert action buttons reachable by `Tab` from within the dialog; `Escape` closes only if the alert has a non-CRITICAL severity; CRITICAL requires explicit category selection before dismiss + +**Alarm rationalisation procedure** — alarm systems degrade over time through threshold drift and alert-to-alert desensitisation. The following procedure is mandatory: + +- Persona D (Operations Analyst) reviews alert event logs quarterly +- Any alarm type that fired ≥ 5 times in a 90-day period and was acknowledged as `MONITORING` ≥ 90% of the time is a **nuisance alarm candidate** — threshold review required before next quarter +- Any alarm threshold change must be recorded in `alarm_threshold_audit` (object, old threshold, new threshold, reviewer, rationale, date); immutable append-only +- ANSP customers may request threshold adjustments for their own organisation via the org-admin settings; changes take effect after a mandatory 7-day confirmation period and are logged in `alarm_threshold_audit` +- Alert categories that have never triggered a `NOTAM_ISSUED` or `ESCALATING` acknowledgement in 12 months are escalated to Persona D for review of whether the alert should be demoted one severity level + +**Habituation countermeasures** — repeated identical stimuli produce reduced response (habituation). The following design rules counteract alarm habituation: + +- CRITICAL audio uses two alternating tones (261 Hz and 392 Hz, ~0.25s each); the alternation pattern is varied pseudo-randomly within the specification range so the exact sound is never identical across sessions +- CRITICAL banner background colour cycles through two dark-amber shades (`#7B4000` / `#6B3400`) at 1 Hz — subtle variation without strobing, enough to maintain arousal without inducing distraction +- Per-object CRITICAL rate limit (4-hour window) prevents habituation to a single persistent event +- `alert_events` habituation report: any operator who has acknowledged ≥ 20 alerts of the same type in a 30-day window without a single `ESCALATING` or `NOTAM_ISSUED` response is flagged for supervisor review — this indicates potential habituation or threshold misconfiguration + +**Reduced-motion support (F10):** WCAG 2.3.3 (Animation from Interactions — Level AAA) and WCAG 2.3.1 (Three Flashes or Below Threshold — Level A) apply. The 1 Hz CRITICAL banner colour cycle and any animated corridor rendering must respect the OS-level `prefers-reduced-motion: reduce` media query: + +```css +/* Default: animated */ +.critical-banner { animation: amber-cycle 1s step-end infinite; } + +/* Reduced motion: static high-contrast state */ +@media (prefers-reduced-motion: reduce) { + .critical-banner { + animation: none; + background-color: #7B4000; + border: 4px solid #FFD580; /* thick static border as redundant indicator */ + } +} +``` + +**Fatigue and cognitive load monitoring (F8 — §60):** Operators on long shifts exhibit reduced alertness. The following server-side rules trigger supervisor notifications without requiring operator interaction: + +| Condition | Trigger | Supervisor notification | +|-----------|---------|------------------------| +| Unacknowledged CRITICAL alert | > 10 minutes without acknowledgement | Push + email to org supervisor role: "CRITICAL alert unacknowledged for 10 minutes — [object, time]" | +| Stale HIGH alert | > 30 minutes without acknowledgement | Push to org supervisor: "HIGH alert unacknowledged for 30 minutes" | +| Long session without interaction | Logged-in operator: no UI interaction for 45 min during active event | Push to operator + supervisor: "Possible inactivity during active event — please verify" | +| Shift duration exceeded | Session age > `org.shift_duration_hours` (default 8h) | Non-blocking reminder to operator: "Your shift duration setting is 8 hours — consider handover" | + +Supervisor notifications are sent to users with `org_admin` or `supervisor` role. If no supervisor role is configured for the org, the notification escalates to SpaceCom internal ops via the existing PagerDuty route with `severity: warning`. All supervisor notifications are logged to `security_logs` with `event_type = SUPERVISOR_NOTIFICATION`. + +For CesiumJS corridor animations: check `window.matchMedia('(prefers-reduced-motion: reduce)').matches` on mount; if true, disable trajectory particle animation (Mode C) and set corridor opacity to a static value instead of pulsing. The preference is re-checked on change via `addEventListener('change', ...)` without requiring a page reload. + +--- + +### 28.4 Probabilistic Communication to Non-Specialist Operators + +Re-entry timing predictions are inherently probabilistic. Aviation operations personnel (Persona A/C) are trained in operational procedures, not orbital mechanics. The following design rules ensure probabilistic information is communicated without creating false precision or misinterpretation: + +1. **No `±` notation for Persona A/C** — use explicit window ranges (`08h–20h from now`) with a "most likely" label; all absolute times rendered as `HH:MMZ` (e.g., `14:00Z`) or `DD MMM YYYY HH:MMZ` (e.g., `22 MAR 2026 14:00Z`) per ICAO Doc 8400 UTC-suffix convention; the `Z` suffix is not a tooltip — it is always rendered inline +2. **Space weather impact as operational buffer, not percentage** — `Add ≥2h beyond 95th percentile`, not `+18% wider uncertainty` +3. **Mode C particles require a mandatory first-use overlay** explaining that particles are not equiprobable; weighted opacity down-weights outliers (§6.4) +4. **"What does this mean?" expandable panel** on Event Detail for Persona C (incident commanders) explaining the window in operational terms +5. **Data confidence badges** contextualise all physical property estimates — `unknown` source triggers a warning callout above the prediction panel +6. **Tail risk annotation (F10):** The p5–p95 window is the primary display, but a 10% probability of re-entry outside that range is operationally significant. Below the primary window, display: *"Extreme case (1% probability outside this range): `p01_reentry_time`Z – `p99_reentry_time`Z"* — labelled clearly as a tail annotation, not the primary window. This annotation is shown only when `p99_reentry_time - p01_reentry_time > 1.5 × (p95_reentry_time - p05_reentry_time)` (i.e., the tails are materially wider than the primary window). Also included as a footnote in NOTAM drafts when this condition is met. + +--- + +### 28.5 Error Recovery and Irreversible Actions + +| Action | Recovery mechanism | +|--------|--------------------| +| Analyst runs prediction with wrong parameters | `superseded_by` FK on `reentry_predictions` — marks old run as superseded; UI shows warning banner; original record preserved | +| Controller accidentally acknowledges CRITICAL alert | Two-step confirmation; structured category selection (see below) + optional free text; append-only audit log preserves full record | +| Analyst shares link to superseded prediction | `⚠ Superseded — see [newer run]` banner appears on the superseded prediction page for any viewer | +| Operator enters SIMULATION during live incident | `disable_simulation_during_active_events` org setting blocks mode switch while unacknowledged CRITICAL/HIGH alerts exist | + +**Structured acknowledgement categories** — replaces 10-character text minimum. Research consistently shows forced-text minimums under time pressure produce reflexive compliance (`1234567890`, `aaaaaaaaaa`) rather than genuine engagement, creating audit noise rather than evidence: + +```typescript +export const ACKNOWLEDGEMENT_CATEGORIES = [ + { value: 'NOTAM_ISSUED', label: 'NOTAM issued or requested' }, + { value: 'COORDINATING', label: 'Coordinating with adjacent FIR' }, + { value: 'MONITORING', label: 'Monitoring — no action required yet' }, + { value: 'ESCALATING', label: 'Escalating to incident command' }, + { value: 'OUTSIDE_MY_SECTOR', label: 'Outside my sector — passing to responsible unit' }, + { value: 'OTHER', label: 'Other (free text required below)' }, +] as const; +// Category selection is mandatory. Free text is optional except when value = 'OTHER'. +// alert_events.action_taken stores the category code; action_notes stores optional text. +``` + +**Acknowledgement form accessibility requirements (F3):** +- Each category option rendered as `` with an explicit `