Files

Allissa Auld 266ee30d4b docs: align roadmap with tender-fit requirements

Plan aircraft-risk modelling, CCSDS RDM support, tender-grade replay validation, and ESA software assurance artefacts in the implementation and master plans.

2026-04-17 20:31:37 +02:00

1.0 MiB

Raw Blame History

SpaceCom Master Development Plan

1. Vision

SpaceCom is a dual-domain re-entry debris hazard analysis platform that bridges the space and aviation domains. It is built by space engineers and operates as two interconnected products sharing a common physics core.

Space Domain (upstream): A technical analysis platform for space operators, orbital analysts, and space agencies — providing decay prediction with full uncertainty quantification, conjunction screening, controlled re-entry corridor planning, and a programmatic API layer for integration with existing space operations systems.

Aviation Domain (downstream): An operational decision support tool for ANSPs, airspace managers, and incident commanders — translating space domain predictions into actionable aviation safety outputs: hazard corridors, FIR intersection analysis, NOTAM drafting assistance, multi-ANSP coordination, and plain-language uncertainty communication.

SpaceCom's strategic position is the interface layer between two domains that currently do not speak the same language. The aviation safety gap is the commercial differentiator and the most underserved operational need in the market. The space domain physics depth — numerical decay prediction, atmospheric density modelling, conjunction probability, and controlled re-entry planning — is the technical credibility that distinguishes SpaceCom from aviation software vendors with bolt-on orbital mechanics.

Positioning statement for procurement: "SpaceCom is the missing operational layer between space domain awareness and aviation domain action — built by space engineers, designed for the people who have to make decisions when something is coming down."

AI-assisted development policy (F11): SpaceCom uses AI coding assistants (currently Claude Code) in the development workflow. AGENTS.md at the repository root defines the boundaries and conventions for this use. Key constraints:

AI assistants may generate, refactor, and review code, and draft documentation
AI assistants may not make autonomous decisions about safety-critical algorithm changes, authentication logic, or regulatory compliance text — all such changes require human review and an approved PR with explicit reviewer sign-off
AI-generated code is subject to identical review and testing standards as human-authored code — there is no reduced scrutiny for AI-generated contributions
AI assistants must not be given production credentials, access to live Space-Track API keys, or personal data
For ESA procurement purposes: all code in the repository, regardless of how it was authored, is the responsibility of the named human engineers. AI assistance is a development tool, not a co-author with liability

This policy is stated explicitly because ESA and other public-sector procurement frameworks increasingly ask whether and how AI tools are used in safety-relevant software development.

2. What We Keep from the Existing Codebase

The prototype established several good foundational choices:

Docker Compose orchestration — frontend, backend, and database run as isolated containers with a single docker compose up
FastAPI backend — lightweight, async-ready Python API server; already serves CZML orbital data
TimescaleDB + PostGIS — time-series hypertables for orbit data and geographic types for footprints; the orbits hypertable and reentry_predictions polygon column are well-suited to the domain
CesiumJS globe — proven 3D geospatial viewer with CZML support, already rendering orbital tracks with OSM tiles
CZML as the orbital data interchange format — native to Cesium, supports time-dynamic position, styling, and labels
Schema tables: objects, orbits, conjunctions, reentry_predictions — solid starting point for the data model (see §9 for required expansions)
Worker service slot — the architecture already anticipates background data ingestion

3. Architecture

3.1 Layered Design

┌─────────────────────────────────────────────────────┐
│                   Frontend (Web)                     │
│   Next.js + TypeScript + CesiumJS + Deck.gl          │
│   httpOnly cookies · CSP · security headers          │
├─────────────────────────────────────────────────────┤
│              TLS Termination (Caddy/Nginx)           │
│              HTTPS + WSS only; HSTS preload          │
├─────────────────────────────────────────────────────┤
│                   API Gateway                        │
│   FastAPI · RBAC middleware · rate limiting          │
│   JWT (RS256) · MFA enforcement · audit logging     │
├─────────────────────────────────────────────────────┤
│                 Core Services                        │
│   Hazard Engine · Event Orchestrator · CZML Builder  │
│   Frame Transform Service · Space Weather Cache      │
│   HMAC integrity signing · Alert integrity guard     │
├─────────────────────────────────────────────────────┤
│         Computational Workers (isolated network)     │
│   Celery tasks: propagation, decay, Monte Carlo      │
│   Per-job CPU time limits · resource caps            │
├─────────────────────────────────────────────────────┤
│    Report Renderer (network-isolated container)      │
│   Playwright headless · no external network access   │
├─────────────────────────────────────────────────────┤
│            Data Layer (backend_net only)             │
│   TimescaleDB+PostGIS · Redis (AUTH+TLS)             │
│   MinIO (private buckets · pre-signed URLs)          │
└─────────────────────────────────────────────────────┘

3.2 Service Breakdown

Service	Runtime	Responsibility	Tier 2 Spec	Tier 3 Spec
`frontend`	Next.js on Node 22 / Nginx static	Globe UI, dashboards, event timeline, simulation controls	2 vCPU / 4 GB	2× (load balanced)
`backend`	FastAPI on Python 3.12	REST + WebSocket API, authentication, RBAC, request validation, CZML generation, HMAC signing	4 vCPU / 8 GB	2× 4 vCPU / 8 GB (blue-green)
`worker-sim`	Python 3.12 + Celery `--queue=simulation --concurrency=16 --pool=prefork`	MC decay prediction (chord sub-tasks), breakup, conjunction, controlled re-entry. Isolated from frontend network.	2× 16 vCPU / 32 GB	4× 16 vCPU / 32 GB
`worker-ingest`	Python 3.12 + Celery `--queue=ingest --concurrency=2`	TLE polling, space weather, DISCOS, IERS EOP. Never competes with simulation queue.	2 vCPU / 4 GB	2× 2 vCPU / 4 GB (celery-redbeat HA)
`renderer`	Python 3.12 + Playwright	PDF report generation only. No external network access. Receives sanitised data from backend via internal API call only.	4 vCPU / 8 GB	2× 4 vCPU / 8 GB
`db`	TimescaleDB (PostgreSQL 17 + PostGIS)	Persistent storage. RLS policies enforced. Append-only triggers on audit tables.	8 vCPU / 64 GB / 1 TB NVMe	Primary + standby: 8 vCPU / 128 GB each; Patroni failover
`redis`	Redis 7	Broker + cache + celery-redbeat schedule. AUTH required. TLS in production. ACL users per service.	2 vCPU / 8 GB	Redis Sentinel: 3× 2 vCPU / 8 GB
`minio`	MinIO (S3-compatible)	Object storage. All buckets private. Pre-signed URLs only.	4 vCPU / 8 GB / 4 TB	Distributed: 4× 4 vCPU / 16 GB / 2 TB NVMe
`etcd`	etcd 3	Patroni DCS (distributed configuration store) for DB leader election	—	3× 1 vCPU / 2 GB
`pgbouncer`	PgBouncer 1.22	Connection pooler between all application services and TimescaleDB. Transaction-mode pooling. Prevents connection count exceeding `max_connections` at Tier 3. Single failover target point for Patroni switchover.	1 vCPU / 1 GB	1 vCPU / 1 GB (updated by Patroni on failover)
`prometheus`	Prometheus 2.x	Metrics scraping from all services; recording rules; AlertManager rules	2 vCPU / 4 GB	2 vCPU / 8 GB
`grafana`	Grafana OSS	Four dashboards (§26.7); Loki + Tempo + Prometheus datasources	1 vCPU / 2 GB	1 vCPU / 2 GB
`loki`	Grafana Loki 2.9	Log aggregation; queried by Grafana; Promtail ships container logs	2 vCPU / 4 GB	2 vCPU / 8 GB
`promtail`	Grafana Promtail 2.9	Scrapes Docker json-file logs; labels by service; ships to Loki	0.5 vCPU / 512 MB	0.5 vCPU / 512 MB
`tempo`	Grafana Tempo	Distributed trace backend (Phase 2); OTLP ingest; queried by Grafana	—	2 vCPU / 4 GB

Horizontal Scaling Trigger Thresholds (F9 — §58)

Tier upgrades are not automatic — SpaceCom is VPS-based and requires deliberate provisioning. The following thresholds trigger a scaling review meeting (not an automated action). The responsible engineer creates a tracked issue within 5 business days.

Metric	Threshold	Sustained for	Tier transition indicated
Backend CPU utilisation	> 70%	30 min	Tier 1 → Tier 2 (add second backend instance)
`spacecom_ws_connected_clients`	> 400 sustained	1 hour	Tier 1 → Tier 2 (WS ceiling at 500; add second backend)
Celery simulation queue depth	> 50	15 min (no active event)	Add simulation worker instance
MC p95 latency	> 180s (75% of 240s SLO)	3 consecutive runs	Add simulation worker instance
DB CPU utilisation	> 60%	1 hour	Tier 2 → Tier 3 (read replica + Patroni)
DB disk used	> 70% of provisioned	—	Expand disk before hitting 85%
Redis memory used	> 60% of `maxmemory`	—	Increase `maxmemory` or add Redis instance

Scaling decisions are recorded in docs/runbooks/capacity-limits.md with: metric value at decision time, decision made, provisioning timeline, and owner. This file is the authoritative capacity log for ESA and ANSP audits.

Redis ACL Definition

SpaceCom uses two Redis trust domains:

redis_app for sessions, rate limits, WebSocket delivery state, commercial-enforcement deferrals, and other application state where stronger consistency and tighter access separation are required
redis_worker for Celery broker/result traffic and ephemeral cache data, where limited inconsistency during failover is acceptable

This split is deliberate. It prevents worker-side compromise from reaching session state and avoids applying the distributed-systems split-brain risk acceptance for ephemeral workloads to user-session or entitlement-adjacent state.

Each Redis service gets its own ACL users with the minimum required key namespace:

# redis_app/acl.conf - bind-mounted into the application Redis container
# Backend: application-state access only (session tokens, rate-limit counters, WebSocket tracking)
user spacecom_backend on >${REDIS_BACKEND_PASSWORD} ~* &* +@all

# Disable unauthenticated default user
user default off

# redis_worker/acl.conf - bind-mounted into the worker Redis container
# Simulation worker: Celery broker/result namespaces only
user spacecom_worker on >${REDIS_WORKER_PASSWORD} ~celery* ~_kombu* ~unacked* &celery* +@all -@dangerous

# Ingest worker: same scope as simulation worker
user spacecom_ingest on >${REDIS_INGEST_PASSWORD} ~celery* ~_kombu* ~unacked* &celery* +@all -@dangerous

# Disable unauthenticated default user
user default off

Mount in docker-compose.yml:

redis_app:
  volumes:
    - ./redis_app/acl.conf:/etc/redis/acl.conf:ro
  command: redis-server --aclfile /etc/redis/acl.conf --tls-port 6379 ...

redis_worker:
  volumes:
    - ./redis_worker/acl.conf:/etc/redis/acl.conf:ro
  command: redis-server --aclfile /etc/redis/acl.conf --tls-port 6379 ...

Separate passwords (REDIS_BACKEND_PASSWORD, REDIS_WORKER_PASSWORD, REDIS_INGEST_PASSWORD) are defined in §30.3. Each rotates independently on the 90-day schedule. Redis Sentinel split-brain risk acceptance in §67 applies to redis_worker only; redis_app is treated as higher-integrity application state and is not covered by that acceptance.

3.3 Docker Compose Services and Network Segmentation

Services are assigned to isolated Docker networks. A compromised container on one network cannot directly reach services on another.

networks:
  frontend_net:   # frontend → backend only
  backend_net:    # backend → db, redis, minio, pgbouncer
  worker_net:     # worker → pgbouncer, redis, minio (no backend access; pgbouncer pools DB connections)
  renderer_net:   # backend → renderer only; renderer has no external egress
  db_net:         # db, pgbouncer — never exposed to frontend_net

services:
  frontend:    networks: [frontend_net]
  backend:     networks: [frontend_net, backend_net, renderer_net]  # +renderer_net: backend calls renderer API
  worker-sim:  networks: [worker_net]
  worker-ingest: networks: [worker_net]
  renderer:    networks: [renderer_net]   # backend-initiated calls only; no outbound to backend_net
  db:          networks: [backend_net, worker_net, db_net]
  pgbouncer:   networks: [backend_net, worker_net, db_net]  # pooling for both backend AND workers
  redis:       networks: [backend_net, worker_net]
  minio:       networks: [backend_net, worker_net]

Network topology rules:

Workers connect to DB via pgbouncer:5432, not db:5432 directly — enforced by workers' DATABASE_URL env var pointing to PgBouncer.
The backend is on renderer_net so it can call renderer:8001; the renderer cannot initiate connections to backend_net.
db_net contains only TimescaleDB, PgBouncer, and etcd. No application service connects directly to this network except PgBouncer.

Container resource limits — without explicit limits a runaway simulation worker OOM-kills the database (Linux OOM killer targets the largest RSS consumer):

services:
  backend:
    deploy:
      resources:
        limits: { cpus: '4.0', memory: 8G }
        reservations: { memory: 512M }

  worker-sim:
    deploy:
      resources:
        limits: { cpus: '16.0', memory: 32G }
        reservations: { memory: 2G }
    stop_grace_period: 300s   # allows long MC jobs to finish before SIGKILL
    command: >
      celery -A app.worker worker
        --queue=simulation
        --concurrency=16
        --pool=prefork
        --without-gossip
        --without-mingle
        --max-tasks-per-child=100
    pids_limit: 64            # prefork: 16 children + Beat + parent + overhead

  worker-ingest:
    deploy:
      resources:
        limits: { cpus: '2.0', memory: 4G }
    stop_grace_period: 60s
    pids_limit: 16

  renderer:
    deploy:
      resources:
        limits: { cpus: '4.0', memory: 8G }
    pids_limit: 100           # Chromium spawns ~5 processes per render × concurrent renders
    tmpfs:
      - /tmp/renders:size=512m,mode=1777   # PDF scratch; never written to persistent layer
    environment:
      RENDER_OUTPUT_DIR: /tmp/renders

  db:
    deploy:
      resources:
        limits: { memory: 64G }   # explicit cap; prevents OOM killer targeting db

  redis:
    deploy:
      resources:
        limits: { cpus: '2.0', memory: 8G }

  minio:
    deploy:
      resources:
        limits: { cpus: '4.0', memory: 8G }

Note: deploy.resources is honoured by docker compose (v2) without Swarm mode from Compose spec 3.x. Verify with docker compose version ≥ 2.0.

All containers run as non-root users, with read-only root filesystems and dropped capabilities (see §7.10), except for the renderer container's documented SYS_ADMIN exception in §7.11. That exception is accepted only for the renderer, must never be copied to other services, and requires stricter network isolation and annual review.

Host Bind Mounts

All directories that operators need to access directly on the VPS — logs, generated exports, config, and backups — are bind-mounted from the host filesystem. This means no docker compose exec is required for routine operations: log tailing, reading generated files, editing config, or recovering a backup.

services:
  backend:
    volumes:
      - ./logs/backend:/app/logs          # structured JSON logs; tail directly on host
      - ./exports:/app/exports            # org export ZIPs, report PDFs
      - ./config/backend.toml:/app/config/settings.toml:ro  # edit on host; container reads

  worker-sim:
    volumes:
      - ./logs/worker-sim:/app/logs
      - ./exports:/app/exports            # shared export directory with backend

  worker-ingest:
    volumes:
      - ./logs/worker-ingest:/app/logs

  frontend:
    volumes:
      - ./logs/frontend:/app/logs

  db:
    volumes:
      - /data/postgres:/var/lib/postgresql/data   # DB data on host disk; survives container recreation
      - ./backups/db:/backups                      # pg_basebackup output directly accessible on host

  minio:
    volumes:
      - /data/minio:/data                          # object storage on host disk

Host-side directory layout (under /opt/spacecom/):

/opt/spacecom/
  logs/
    backend/          ← tail -f logs/backend/app.log
    worker-sim/
    worker-ingest/
    frontend/
  exports/            ← ls exports/ to see generated reports and org export ZIPs
  config/
    backend.toml      ← edit directly; restart backend container to apply
  backups/
    db/               ← pg_basebackup archives; rsync to offsite from here
data/
  postgres/           ← TimescaleDB data files (outside /opt to avoid accidental compose down -v)
  minio/              ← MinIO object data

Key rules:

/data/postgres and /data/minio live outside the project directory so docker compose down -v cannot accidentally wipe them (Compose only removes named volumes, not bind-mounted host paths, but keeping them separate is an additional safeguard)
Log directories are created by make init-dirs before first docker compose up; containers write to them as a non-root user (UID 1000); host operator reads as the same UID or via sudo
Config files are mounted :ro (read-only) inside the container — a misconfigured backend cannot overwrite its own config
make logs SERVICE=backend is a convenience alias for tail -f /opt/spacecom/logs/backend/app.log

Port Exposure Map

Port	Service	Exposed to	Notes
80	Caddy	Public internet	HTTP → HTTPS redirect only
443	Caddy	Public internet	TLS termination; proxies to backend/frontend
8000	Backend API	Internal (`frontend_net`)	Never directly internet-facing
3000	Frontend (Next.js)	Internal (`frontend_net`)	Caddy proxies; HMR port 3001 dev-only
5432	TimescaleDB	Internal (`db_net`)	Never exposed to `frontend_net` or host
6379	Redis	Internal (`backend_net`, `worker_net`)	AUTH required; no public exposure
9000	MinIO API	Internal (`backend_net`, `worker_net`)	Pre-signed URL access only from outside
9001	MinIO Console	Internal (`db_net`)	Never exposed publicly; admin use only
5555	Flower (Celery monitor)	Internal only	VPN/bastion access only in production
2379/2380	etcd (Patroni DCS)	Internal (`db_net`)	Never exposed outside db_net

CI check: scripts/check_ports.py — parses docker-compose.yml and all docker-compose.*.yml overrides; fails if any port from the "never-exposed" category appears in a ports: mapping. Runs in every CI pipeline.

Infrastructure-Level Egress Filtering

Docker's built-in iptables rules prevent inter-network lateral movement but do not restrict egress to the public internet from within a network. An egress filtering layer is mandatory at Tier 2 and Tier 3.

Allowed outbound destinations (whitelist):

Service	Allowed destination	Protocol	Purpose
`ingest_worker`	`www.space-track.org`	HTTPS/443	TLE / conjunction data
`ingest_worker`	`services.swpc.noaa.gov`	HTTPS/443	Space weather
`ingest_worker`	`discosweb.esac.esa.int`	HTTPS/443	DISCOS object catalogue
`ingest_worker`	`celestrak.org`	HTTPS/443	TLE cross-validation
`ingest_worker`	`iers.org`	HTTPS/443	EOP download
`backend`	SMTP relay (org-internal)	SMTP/587	Alert email
All containers	Internal Docker networks	Any	Normal operation
All containers	All other destinations	Any	BLOCKED

Implementation: UFW or nftables rules on host (Tier 2); network policy + Calico/Cilium (Tier 3 Kubernetes migration); explicit allow-list in docs/runbooks/egress-filtering.md. Violations logged at WARN; repeated violations at CRITICAL.

4. Coordinate Frames and Time Systems

This section is non-negotiable infrastructure. Silent frame mismatches invalidate all downstream computation. All developers must understand and implement the conventions below before writing any propagation or display code.

4.1 Reference Frame Pipeline

TLE input
   │
   ▼ sgp4 library propagation
TEME (True Equator Mean Equinox)     ← SGP4 native output; do NOT store as final product
   │
   ▼ IAU 2006 precession-nutation (or Vallado TEME→J2000 simplification)
GCRF / J2000 (Geocentric Celestial Reference Frame)
   │                │
   │                ▼ CZML INERTIAL frame ← CesiumJS expects GCRF/ICRF, not TEME
   │
   ▼ IAU Earth Orientation Parameters (EOP): IERS Bulletin A/B
ITRF (International Terrestrial Reference Frame)   ← Earth-fixed; use for database storage
   │
   ▼ WGS84 geodetic transformation
Latitude / Longitude / Altitude     ← For display, hazard zones, airspace intersections

Implementation: Use astropy (astropy.coordinates, astropy.time) for all frame conversions. It handles IERS EOP download and interpolation automatically. For performance-critical batch conversions, pre-load EOP tables and vectorise.

4.2 CesiumJS Frame Convention

CZML position with referenceFrame: "INERTIAL" expects ICRF/J2000 Cartesian coordinates in metres
SGP4 outputs are in TEME and must be rotated to J2000 before being written into CZML
CZML position with referenceFrame: "FIXED" expects ITRF Cartesian in metres
Never pipe raw TEME coordinates into CesiumJS

4.3 Time System Conventions

System	Where Used	Notes
UTC	System-wide reference. All API timestamps, database timestamps, CZML epochs	Convert immediately at ingestion boundary
UT1	Earth rotation angle for ITRF↔GCRF conversion	UT1-UTC offset from IERS EOP
TT (Terrestrial Time)	`astropy` internal; precession-nutation models	~69 s ahead of UTC
TLE epoch	Encoded in TLE line 1 as year + day-of-year fraction	Parse to UTC immediately
GPS time	May appear in precision ephemeris products	GPS = UTC + 18 s as of 2024

Rule: Store all timestamps as TIMESTAMPTZ in UTC. Convert to local time only at presentation boundaries.

4.4 Coordinate Reference System Contract (F1 — §62)

The CRS used at every system boundary is documented in docs/COORDINATE_SYSTEMS.md. This is the authoritative single-page reference for any engineer writing frame conversion code.

Boundary	CRS	Format	Notes
SGP4 output	TEME (True Equator Mean Equinox)	Cartesian metres	Must not leave `physics/` without conversion
Physics → CZML builder	GCRF/J2000	Cartesian metres	Explicit `teme_to_gcrf()` call
CZML `position` (INERTIAL)	GCRF/J2000	Cartesian metres	`referenceFrame: "INERTIAL"`
CZML `position` (FIXED)	ITRF	Cartesian metres	`referenceFrame: "FIXED"`
Database storage (`orbits`)	GCRF/J2000	Cartesian metres	Consistent with CZML inertial
Corridor polygon (DB)	WGS-84 geographic	`GEOGRAPHY(POLYGON)` SRID 4326	Geodetic lat/lon from ITRF→WGS-84
FIR boundary (DB)	WGS-84 geographic	`GEOMETRY(POLYGON, 4326)`	Planar approx. for regional FIRs
API response	WGS-84 geographic	GeoJSON (EPSG:4326)	Degrees; always lon,lat order (GeoJSON spec)
Globe display (CesiumJS)	ICRF (= GCRF for practical purposes)	Cartesian metres via CZML	CesiumJS handles geodetic display
Altitude display	WGS-84 ellipsoidal	km or ft (user preference)	See §4.4a for datum labelling

Antimeridian and pole handling (F5 — §62):

Antimeridian: Corridor polygons stored as GEOGRAPHY handle antimeridian crossing correctly — PostGIS GEOGRAPHY uses spherical arithmetic and does not wrap coordinates. CesiumJS CZML polygon positions must be expressed as a continuous polyline; for antimeridian-crossing corridors, the CZML serialiser must not clamp coordinates to ±180° — pass the raw ITRF→geodetic output. CesiumJS handles coordinate wrapping internally when referenceFrame: "FIXED" is used for corridor polygons.
Polar orbits: For objects with inclination > 80°, the ground track corridor may approach or cross the poles. ST_AsGeoJSON on a GEOGRAPHY polygon that passes within ~1° of a pole can produce degenerate output (longitude undefined at the pole itself). Mitigation: before storing, check ST_DWithin(corridor, ST_GeogFromText('SRID=4326;POINT(0 90)'), 111000) (within 1° of north pole) or south pole equivalent — if true, log a POLAR_CORRIDOR_WARNING and clip the polygon to 89.5° max latitude. This is a rare case (ISS incl. 51.6°; most rocket bodies are below 75° incl.) but must not crash the pipeline.

docs/COORDINATE_SYSTEMS.md is a Phase 1 deliverable. Tests in tests/test_frame_utils.py serve as executable verification of the contract.

4.5 Implementation Checklist

frame_utils.py: teme_to_gcrf(), gcrf_to_itrf(), itrf_to_geodetic()
Unit tests against Vallado 2013 reference cases
EOP data auto-refresh: weekly Celery task pulling IERS Bulletin A; verify SHA-256 checksum of downloaded file before applying
CZML builder uses gcrf_to_czml_inertial() — explicit function, never implicit conversion
docs/COORDINATE_SYSTEMS.md committed with CRS boundary table

5. User Personas

All UX decisions are traceable to one of the four personas defined here. Navigation, default views, information hierarchy, and alert behaviour must serve user tasks — not the system's internal module structure.

Persona A — Operational Airspace Manager

Role: ANSP or aviation authority staff. Responsible for airspace safety decisions in real-time or near-real-time.

Primary question: "Is any airspace under my responsibility affected in the next 6–12 hours, and what do I need to do about it?"

Key needs: Immediate situational awareness, clear go/no-go spatial display for their region, alert acknowledgement workflow, one-click advisory export, minimal cognitive load.

Tolerance for complexity: Very low.

Persona B — Safety Analyst

Role: Space agency, authority research arm, or consultancy. Conducts detailed re-entry risk assessments for regulatory submissions or post-event reports.

Primary question: "What is the full uncertainty envelope, what assumptions drove the prediction, and how does this compare to previous similar events?"

Key needs: Full simulation parameter access, run comparison, numerical uncertainty detail, full data provenance, configurable report generation, historical replay.

Tolerance for complexity: High.

Persona C — Incident Commander

Role: Senior official coordinating response during an active re-entry event. Uses the platform as a shared situational awareness tool in a briefing room.

Primary question: "Where exactly is it coming down, when, and what is the worst-case affected area right now?"

Key needs: Clean large-format display, auto-narrowing corridor updates, countdown timer, plain-language status summary, shareable live-view URL.

Tolerance for complexity: Low.

Persona D — Systems Administrator / Data Manager

Role: Technical operator managing system health, data ingest, model configuration, and user accounts.

Primary question: "Is everything ingesting correctly, are data sources healthy, and are workers keeping up?"

Key needs: System health dashboard, ingest job status, worker queue metrics, model version management, user and role management.

Tolerance for complexity: High technical tolerance.

Persona E — Space Operator

Role: Satellite or launch vehicle operator responsible for one or more objects in the SpaceCom catalog. May be a commercial operator, a national space agency operating assets, or a launch service provider managing spent upper stages.

Primary question: "What is the current decay prediction for my objects, when do I need to act, and if I have manoeuvre capability, what deorbit window minimises ground risk?"

Key needs: Object-scoped view showing only their registered objects; decay prediction with full Monte Carlo detail; controlled re-entry corridor planner (for objects with remaining propellant); conjunction alert for their own objects; API key management for programmatic integration with their own operations centre; exportable predictions for regulatory submission under national space law.

Tolerance for complexity: High — these are trained orbital engineers, not ATC professionals.

Regulatory context: Many space operators have legal obligations under national space law (e.g., Australia Space (Launches and Returns) Act 2018, FAA AST licensing) to demonstrate responsible end-of-life management. SpaceCom outputs serve as supporting evidence for those submissions. The platform must produce artefacts suitable for regulatory audit.

Persona F — Orbital Analyst

Role: Technical analyst at a space agency, research institution, safety consultancy, or the SSA/STM office of a national authority. Conducts orbital analysis, validates predictions, and produces technical assessments — potentially across the full catalog, not just owned objects.

Primary question: "What does the full orbital picture look like for this object class, how do SpaceCom predictions compare to other tools, and what are the statistical properties of the prediction ensemble?"

Key needs: Full catalog read access; conjunction screening across arbitrary object pairs; simulation parameter tuning and comparison; bulk export (CSV, JSON, CCSDS formats); access to raw propagation outputs (state vectors, covariance matrices); historical validation runs; API access for batch processing.

Tolerance for complexity: Very high — this persona builds the technical evidence base that other personas act on.

6. UX Design Specification

This section translates engineering capability into concrete interface designs. All designs are persona-linked and phase-scheduled.

Navigation is organised around user tasks, not backend modules. Module names never appear in the UI.

The platform has two navigation domains — Aviation (default for Persona A/B/C) and Space (for Persona E/F). Both are accessible from the top navigation. The root route (/) defaults to the domain matched to the user's role on login.

Aviation Domain Navigation:

/                   → Operational Overview       (Persona A, C primary)
/watch/{norad_id}   → Object Watch Page          (Persona A, B)
/events             → Active Events + Timeline   (Persona A, C)
/events/{id}        → Event Detail               (Persona A, B, C)
/airspace           → Airspace Impact View       (Persona A)
/analysis           → Analyst Workspace          (Persona B primary)
/catalog            → Object Catalog             (Persona B)
/reports            → Report Management          (Persona A, B)
/admin              → System Administration      (Persona D)

Space Domain Navigation:

/space                        → Space Operator Overview      (Persona E, F primary)
/space/objects                → My Objects Dashboard         (Persona E — owned objects only)
/space/objects/{norad_id}     → Object Technical Detail      (Persona E, F)
/space/reentry/plan           → Controlled Re-entry Planner  (Persona E)
/space/conjunction            → Conjunction Screening        (Persona F)
/space/analysis               → Orbital Analyst Workspace    (Persona F)
/space/export                 → Bulk Export                  (Persona F)
/space/api                    → API Keys + Documentation     (Persona E, F)

The 3D globe is a shared component embedded within pages, not a standalone page. Different pages focus and configure the globe differently.

6.2 Operational Overview Page (`/`)

Landing page for Persona A and C. Loads immediately without configuration.

Layout:

┌─────────────────────────────────────────────────────────────────┐
│  [● LIVE]  SpaceCom    [Space Weather: ELEVATED ▲]  [Alerts: 2] │
├──────────────────────────────┬──────────────────────────────────┤
│                              │  ACTIVE EVENTS                   │
│    3D GLOBE                  │  ● CZ-5B R/B  44878             │
│    (active events +          │    Window: 08h – 20h from now    │
│     affected FIRs only)      │    Most likely ~14h from now     │
│                              │    YMMM FIR — HIGH               │
│                              │    [View] [Corridor]             │
│                              │  ─────────────────────────────   │
│                              │  ○ SL-16 R/B  28900             │
│                              │    Window: 54h – 90h from now    │
│                              │    Most likely ~72h from now     │
│                              │    Ocean — LOW                   │
│                              │                                  │
│                              │  72-HOUR TIMELINE                │
│                              │  [Gantt strip]                   │
│                              │                                  │
│                              │  SPACE WEATHER                   │
│                              │  Activity: ELEVATED              │
│                              │  Extend window: add ≥2h buffer   │
├──────────────────────────────┴──────────────────────────────────┤
│  [● Live]  ──────────●──────────────────────────────  +72h      │
└─────────────────────────────────────────────────────────────────┘

Globe default state: Active decay objects and their corridors only. All other objects hidden. Affected FIR boundaries highlighted. No orbital tracks unless the user expands an event card.

Temporal uncertainty display — Persona A/C: Event cards and the Operational Overview show window ranges in plain language (Window: 08h – 20h from now / Most likely ~14h from now), never ± N notation. The ± form implies symmetric uncertainty, which re-entry distributions are not. The Analyst Workspace (Persona B) additionally shows raw p05/p50/p95 UTC times.

Three modes — always visible, always unambiguous. Mixing modes without explicit user intent is prohibited.

Mode	Indicator	Description
LIVE	Green pulsing pill: `● LIVE`	Current real-world state. Globe and predictions update from live feeds.
REPLAY	Amber pill: `⏪ REPLAY 2024-01-14 03:22 UTC`	Replaying a historical event. All data fixed. No live updates.
SIMULATION	Purple pill: `⚗ SIMULATION — [object name]`	Custom scenario. Data is synthetic. Must never be confused with live.

The mode indicator is persistent in the top nav bar. Switching modes requires explicit action through a mode-switch dialogue — it cannot happen implicitly.

Mode-switch dialogue specification:

When the user initiates a mode switch (e.g., LIVE → SIMULATION), the following modal must appear. The dialogue must explicitly state the current mode, the target mode, and all operational consequences:

SWITCH TO SIMULATION MODE?
──────────────────────────────────────────────────────────────
You are currently viewing LIVE data.
Switching to SIMULATION will display synthetic scenario data.

  ⚠ Alerts and notifications are suppressed in SIMULATION.
  ⚠ Simulation data must never be used for operational decisions.
  ⚠ Other users will not see your simulation.

[Cancel]                          [Switch to Simulation ▶]
──────────────────────────────────────────────────────────────

Rules:

Cancel on left, destructive action on right (consistent with aviation HMI conventions)
The dialogue must always show both the current mode and target mode — never just "are you sure?"
Equivalent dialogues apply for all mode transitions (LIVE ↔ REPLAY, LIVE ↔ SIMULATION, etc.)

Simulation mode block during active alerts: If the organisation has disable_simulation_during_active_events enabled (admin setting, default: off), the SIMULATION mode switch is blocked whenever there are unacknowledged CRITICAL or HIGH alerts. A modal replaces the switch dialogue:

CANNOT ENTER SIMULATION
──────────────────────────────────────────────────────────────
2 active CRITICAL alerts require acknowledgement.
Acknowledge all active alerts before running simulations.

[View active alerts]                              [Cancel]
──────────────────────────────────────────────────────────────

Document disable_simulation_during_active_events prominently in the admin UI: "Enable only if your organisation has a dedicated SpaceCom monitoring role separate from simulation users."

Timeline control — two zoom levels:

Event scale (default): 72 hours, 6-hour intervals. Re-entry windows shown as coloured bars.
Orbital scale: 4-hour window, 15-minute intervals. For orbital passes and conjunction events.

LIVE mode scrub: User can drag the playhead into the future to preview a predicted corridor. A "Return to Live" button appears whenever the playhead is not at current time.

Future-preview temporal wash: When the timeline playhead is not at current time (user is previewing a future state), the entire right-panel event list and alert badges are overlaid with a temporal wash (semi-transparent grey overlay) and a persistent label:

┌──────────────────────────────────────────────────────────────┐
│  ⏩ PREVIEWING  +4h 00m — not current state  [Return to Live] │
└──────────────────────────────────────────────────────────────┘

The wash and label prevent a controller from acting on predicted-future data as though it were current. The globe corridor may show the projected state; the event list must be visually distinct. Alert badges are greyed and annotated "(projected)" in preview mode. Alert sounds and notifications are suppressed while previewing.

6.4 Uncertainty Visualisation — Three Phased Modes

Three representations are planned across phases. All are user-selectable via the UncertaintyModeSelector once implemented. Each page context has a recommended default.

Mode selector (appears in the layer controls panel whenever corridor data is loaded):

Corridor Display
● Percentile Corridors    ← Phase 1
○ Probability Heatmap     ← Phase 2
○ Monte Carlo Particles   ← Phase 3

Modes B and C appear greyed in the selector until their phase ships.

Mode A — Percentile Corridors (Phase 1, default for Persona A/C)

What it shows: Three nested polygon swaths on the globe — 5th, 50th, and 95th percentile ground track corridors from Monte Carlo output.

Visual encoding:

95th percentile: wide, 15% opacity amber fill, dashed border — hazard extent
50th percentile: medium, 35% opacity amber fill, solid border — nominal corridor
5th percentile: narrow, 60% opacity amber fill, bold border — high-probability core

Colour by risk level: Ocean-only → blue family; partial land → amber; significant land → red-orange.

Over time: As the re-entry window narrows, the outer swath contracts automatically in LIVE mode. The user watches the corridor "tighten" in real-time.

Mode B — Probability Heatmap (Phase 2, default for Persona B)

What it shows: Continuous colour-ramp Deck.gl heatmap. Each cell's colour encodes probability density of ground impact across the full Monte Carlo sample set.

Visual encoding: Perceptually uniform, colour-blind-safe sequential palette (viridis or custom blue-white-orange). Scale normalised to the maximum probability cell; legend with percentile labels always shown.

Interaction: Hover a cell → tooltip shows "~N% probability of impact within this 50×50 km cell." The heatmap is recomputed client-side if the user adjusts the re-entry window bounds via the timeline.

Mode C — Monte Carlo Particle Visualisation (Phase 3, Persona B advanced / Persona C briefing)

What it shows: 50–200 animated MC sample trajectory lines converging from re-entry interface altitude (~80 km) to impact. Particle colour encodes F10.7 assumption (cool = low solar activity = later re-entry, warm = high). Impact points persist as dots.

Interaction: Play/pause animation; scrub to any point in the trajectory; click a particle to see its parameter set (F10.7, Ap, B*).

Performance: Use CesiumJS Primitive API with per-instance colour attributes — not Entity API. Trajectory geometry pre-baked server-side and streamed as binary format (/viz/mc-trajectories/{prediction_id}). Never compute trajectories in the browser.

Not the default for Persona A — the animation can be alarming without quantitative context.

Weighted opacity: Particles render with opacity proportional to their sample weight, not uniform opacity. This visually down-weights outlier trajectories so that low-probability high-consequence paths do not visually dominate.

Mandatory first-use overlay: When Mode C is first enabled (per user, tracked in user preferences), a one-time overlay appears before the animation starts:

MONTE CARLO PARTICLE VIEW
──────────────────────────────────────────────────────────────
Each animated line shows one possible re-entry scenario sampled
from the prediction distribution. Colour encodes the solar
activity assumption used for that sample.

These are not equally likely outcomes — particle opacity
reflects sample weight. For operational planning, the
Percentile Corridors view (Mode A) gives a more reliable
summary.

[Understood — show animation]
──────────────────────────────────────────────────────────────

The overlay is dismissed permanently per user on first acknowledgement and never shown again. It cannot be bypassed — the animation does not play until the user explicitly acknowledges.

6.5 Globe Information Hierarchy and Layer Management

Default view state: Active decay objects and their corridors, FIR boundaries for affected regions. "Show everything" is never the default.

Layer management panel:

LAYERS
────────────────────────────────────────
Objects
  ☑ Active decay objects (TIP issued)
  ☑ Decaying objects (perigee < 250 km)
  ☐ All tracked payloads
  ☐ Rocket bodies
  ☐ Debris catalog

Orbital Tracks
  ☐ Ground tracks (selected object only)
  ☐ All objects — [!] performance warning

Predictions & Corridors
  ☑ Re-entry corridors (active events)
  ☐ Re-entry corridors (all predicted)
  ☐ Fragment impact points
  ☐ Conjunction geometry

Airspace (Phase 2)
  ☐ FIR / UIR boundaries
  ☐ Controlled airspace
  ☐ Affected sectors (hazard intersection)

Reference
  ☐ Population density grid
  ☐ Critical infrastructure
────────────────────────────────────────
Corridor Display:  [Percentile ▾]

Layer state persists to localStorage per session. Shared URLs encode active layer state in query parameters.

Object clustering: At zoom > 5,000 km, objects cluster. Badge shows count and highest urgency level. Clusters expand at < 2,000 km.

Altitude-aware clustering rule (F8 — §62): Objects at different altitudes with the same ground-track sub-point are not co-located — they have different re-entry windows and different hazard profiles. Two objects that share a 2D screen position but differ by > 100 km in altitude must not be merged into a single cluster. Implementation rule: CesiumJS EntityCluster clustering is disabled for any object with reentry_predictions showing a window < 30 days (i.e., any decay-relevant object in the watch/alert state). Objects in the normal catalog (window > 30 days) may continue to use screen-space clustering. This prevents the pathological case where a TIP-active object at 200 km is merged into a cluster with a nominal object at 500 km that shares its ground track, making the TIP object invisible in the cluster badge.

Urgency / Priority Visual Encoding (colour-blind-safe — shape distinguishes as well as colour):

State	Symbol	Colour	Meaning
TIP issued, window < 6h	◆ filled diamond	Red `#D32F2F`	Imminent re-entry
TIP issued, window 6–24h	◆ outlined diamond	Orange `#E65100`	Active threat
Predicted decay, window < 7d	▲ triangle	Amber `#F9A825`	Elevated watch
Decaying, window > 7d	● circle	Yellow-grey	Monitor
Conjunction Pc > 1:1000	✕ cross	Purple `#6A1B9A`	Conjunction risk
Normal tracked	· dot	Grey `#546E7A`	Catalog

Never use red/green as the sole distinguishing pair.

6.6 Alert System UX

Alert taxonomy:

Level	Trigger	Visual Treatment	Requires Acknowledgement?
CRITICAL	TIP issued, window < 6h, hazard intersects active FIR	Full-width banner (red), audio tone (ops room mode)	Yes — named user; timestamp + note recorded
HIGH	Window < 24h, conjunction Pc > 1:1000	Persistent badge (orange)	Yes — dismissal recorded
MEDIUM	New TIP issued (any), window < 7d, new CDM	Toast (amber), 8s auto-dismiss	No — logged
LOW	New TLE ingested, space weather index change	Notification centre only	No

Alert fatigue mitigation:

Mute rules: per-user, per-session LOW suppression
Geographic filtering: alerts scoped to user's configured FIR list
Deduplication: window shrinks that don't cross a threshold do not re-trigger
Rate limit: same trigger condition cannot produce more than 1 CRITICAL alert per object per 4-hour window without a manual operator reset
Alert generation triggered only by backend logic on verified data — never by direct API call from a client

Ops room workload buffer (OPS_ROOM_SUPPRESS_MINUTES): An optional per-organisation setting (default: 0 — disabled). When set to N > 0, CRITICAL alert full-screen banners are queued for up to N minutes before display. The top-nav badge increments immediately so peripheral attention is captured; only the full-screen interrupt is deferred. This matches FAA AC 25.1329 alert prioritisation philosophy: acknowledge at a glance, act when workload permits. Must be documented in the admin UI with a mandatory warning: "Only enable if your operations room has a dedicated SpaceCom monitoring role. If a single controller manages all alerts, suppression introduces delay that may be safety-significant."

Audio alert specification:

Trigger: CRITICAL alert only (no audio for HIGH or lower)
Sound: two-tone ascending chime pattern (not a siren — ops rooms have sirens from other systems)
Behaviour: plays once on alert display; does not loop; stops on alert acknowledgement (not just banner dismiss)
Volume: configurable per-device (default 50% system volume); mutable by operator per-session
Ops room mode: organisation-level setting that enables audio (default: off; requires explicit activation)

Alert storm detection: If the system generates > 5 CRITICAL alerts within 1 hour across all objects, generate a meta-alert to Persona D. The meta-alert presents a disambiguation prompt rather than a bare count:

[META-ALERT — ALERT VOLUME ANOMALY]
──────────────────────────────────────────────────────────────
5 CRITICAL alerts generated within 1 hour.

This may indicate:
  (a) Multiple genuine re-entry events — verify via Space-Track
      independently before taking operational action.
  (b) System integrity issue — check ingest pipeline and data
      source health for signs of false data injection.

[Open /admin health dashboard →]   [View all CRITICAL alerts →]
──────────────────────────────────────────────────────────────

Acknowledgement workflow:

CRITICAL acknowledgement requires two steps to prevent accidental confirmation:

Step 1 — Alert banner with summary and Open Map link:

[CRITICAL ALERT]
───────────────────────────────────────────────────────
CZ-5B R/B (44878) — TIP Issued
Re-entry window: 2026-03-16 14:00 – 22:00 UTC  (8h)
Affected FIRs: YMMM, YSSY
Risk level: HIGH  |  [Open map →]
[Review and Acknowledge →]
───────────────────────────────────────────────────────

Step 2 — Confirmation modal (appears on clicking "Review and Acknowledge"):

ACKNOWLEDGE CRITICAL ALERT
───────────────────────────────────────────────────────
CZ-5B R/B (44878) — Re-entry window 14:00–22:00 UTC 16 Mar

Action taken (required — minimum 10 characters):
[_____________________________________________]

[Cancel]           [Confirm — J. Smith, 09:14 UTC]
───────────────────────────────────────────────────────

The Confirm button is disabled until the Action taken field contains ≥ 10 characters. This prevents reflexive one-click acknowledgement during an incident and ensures a minimal action record is always created.

Acknowledgements stored in alert_events (append-only). Records cannot be modified or deleted.

6.7 Timeline / Gantt View

Full timeline accessible from /events and as a compact strip on the Operational Overview.

                    NOW     +6h      +12h     +24h     +48h     +72h
Object              │        │        │        │        │        │
────────────────────┼────────┼────────┼────────┼────────┼────────┼────
CZ-5B R/B  44878   │   [■■■■■[══════ window ═══════]■■■]        │
  YMMM FIR — HIGH  │        │        │        │        │        │
────────────────────┼────────┼────────┼────────┼────────┼────────┼────
SL-16 R/B  28900   │        │        │ [■[══════════════════════════→
  NZZC FIR — MED   │        │        │        │        │        │

■ = nominal re-entry point; ══ = uncertainty window; colour = risk level.

Click event bar → Event Detail page; hover → tooltip with window bounds and affected FIRs. Zoom range: 6h to 7d.

6.8 Event Detail Page (`/events/{id}`)

┌──────────────────────────────────────────────────────────────┐
│  ← Events  │  CZ-5B R/B  NORAD 44878  │  [■ CRITICAL]       │
│             │  Re-entry window: 14:00–22:00 UTC 16 Mar 2026  │
├──────────────────────────────┬───────────────────────────────┤
│                              │  OBJECT                       │
│    3D GLOBE                  │  Mass: 21,600 kg (● DISCOS)   │
│    (focused on corridor)     │  B*: 0.000215 /ER             │
│    Mode: [Percentile ▾]      │  Data confidence: ● DISCOS    │
│    [Layers]                  │                               │
│                              │  PREDICTION                   │
│                              │  Model: cowell_nrlmsise00 v2  │
│                              │  F10.7 assumed: 148 sfu       │
│                              │  MC samples: 500              │
│                              │  HMAC: ✓ verified             │
│                              │                               │
│                              │  WINDOW                       │
│                              │  5th pct:  13:12 UTC          │
│                              │  50th pct: 17:43 UTC          │
│                              │  95th pct: 22:08 UTC          │
│                              │                               │
│                              │  TIP MESSAGES                 │
│                              │  MSG #3 — 09:00 UTC today     │
│                              │  [All TIP history →]          │
├──────────────────────────────┴───────────────────────────────┤
│  AFFECTED AIRSPACE (Phase 2)                                 │
│  YMMM FIR  ████ HIGH    entry 14:20–19:10 UTC               │
├──────────────────────────────────────────────────────────────┤
│  [Run Simulation]  [Generate Report]  [Share Link]           │
└──────────────────────────────────────────────────────────────┘

HMAC verification status is displayed prominently. If ✗ verification failed appears, a banner reads: "This prediction record may have been tampered with. Do not use for operational decisions. Contact your system administrator."

Data confidence annotates every physical property: ● DISCOS (green), ● estimated (amber), ● unknown (grey). When source is unknown or estimated, a warning callout appears above the prediction panel.

Corridor Evolution widget (Phase 2): A compact 2D strip on the Event Detail page showing how the p50 corridor footprint is evolving over time — three overlapping semi-transparent polygon outlines at T+0h, T+2h, T+4h from the current prediction. Updated automatically in LIVE mode. Gives Persona A Level 3 situation awareness (projection) at a glance without requiring simulation tools. Labelled: "Corridor evolution — how prediction is narrowing". If the corridor is widening (unusual), an amber warning appears: "Uncertainty is increasing — check space weather."

Duty Manager View (Phase 2): A [Duty Manager View] toggle button on the Event Detail header. When active, collapses all technical detail and presents a large-text, decluttered view containing only:

┌──────────────────────────────────────────────────────────────┐
│  CZ-5B R/B  NORAD 44878                    [■ CRITICAL]      │
│                                                              │
│  RE-ENTRY WINDOW                                             │
│  Start:   14:00 UTC  16 Mar 2026                             │
│  End:     22:00 UTC  16 Mar 2026                             │
│  Most likely:  17:43 UTC                                     │
│                                                              │
│  AFFECTED FIRs                                               │
│  YMMM (Airservices Australia) — HIGH RISK                    │
│  YSSY (Airservices Australia) — MEDIUM RISK                  │
│                                                              │
│  [Draft NOTAM]   [Log Action]   [Share Link]                 │
└──────────────────────────────────────────────────────────────┘

Toggle back to full view via [Technical Detail]. State is not persisted between sessions — always starts in full view.

Response Options accordion (Phase 2): An expandable panel at the bottom of the Event Detail page, visible to operator and above roles. Contextualised to the current risk level and FIR intersection. These are considerations only — all decisions rest with the ANSP:

RESPONSE OPTIONS  [▼ expand]
──────────────────────────────────────────────────────────────
Based on current prediction (risk: HIGH, window: 8h):

The following actions are for your consideration.
All operational decisions rest with the ANSP.

  ☐  Issue SIGMET or advisory to aircraft in YMMM FIR
  ☐  Notify adjacent ANSPs (YMMM borders: WAAF, OPKR)
  ☐  Draft NOTAM for authorised issuance   [Open →]
  ☐  Coordinate with FMP on traffic flow impact
  ☐  Establish watching brief schedule (every 30 min)

[Log coordination note]
──────────────────────────────────────────────────────────────

Checkbox states and coordination notes are appended to alert_events (append-only). The Response Options items are dynamically generated by the backend based on risk level and affected FIR count — not hardcoded in the frontend.

6.9 Simulation Job Management UX

Persistent collapsible bottom-drawer panel visible on any page. Jobs continue running when the user navigates away.

SIMULATION JOBS                                     [▲ collapse]
────────────────────────────────────────────────────────────────
● Running  Decay prediction — 44878    312/500  ████░  62%
           F10.7: 148, Ap: 12, B*±10%            ~45s rem
           [Cancel]

✓ Complete  Decay prediction — 44878    High F10.7 scenario
           Completed 09:02 UTC          [View results]  [Compare]

✗ Failed    Breakup simulation — 28900
           Error: DISCOS data missing   [Retry]  [Details]
────────────────────────────────────────────────────────────────

Simulation comparison: Two completed runs for the same object can be overlaid on the globe with distinct colours and a split-panel parameter comparison.

SPACE WEATHER                                    [09:14 UTC]
────────────────────────────────────────────────────────────
Solar Activity       ●●●○○  ELEVATED
                     F10.7 observed: 148 sfu  (81d avg: 132)

Geomagnetic          ●●●●○  ACTIVE
                     Kp: 5.3  /  Ap daily: 27

Re-entry Impact      ▲ Active conditions — extend precaution window
                     Add ≥2h buffer beyond 95th percentile.

Forecast (24h)       Activity expected to decline — Kp 3–4
────────────────────────────────────────────────────────────
Source: NOAA SWPC    Updated: 09:00 UTC    [Full history →]

Operational status summary is generated by the backend based on F10.7 deviation from the 81-day average. The "Re-entry Impact" line delivers an operationally actionable statement — not a percentage — with a concrete recommended precaution buffer computed by the backend and delivered as a structured field:

Condition	Re-entry Impact line	Recommended buffer
F10.7 < 90 or Kp < 2	Low activity — predictions at nominal accuracy	+0h
F10.7 90–140, Kp 2–4	Moderate activity — standard uncertainty applies	+1h
F10.7 140–200, Kp 4–6	Active conditions — extend precaution window. Add ≥2h buffer beyond 95th percentile.	+2h
F10.7 > 200 or Kp > 6	High activity — predictions less reliable. Add ≥4h buffer beyond 95th percentile.	+4h

The buffer recommendation is surfaced on the Event Detail page as an explicit callout when conditions are Elevated or above: "Space weather active: consider extending your airspace precaution window to [95th pct time + buffer]."

6.11 2D Plan View (Phase 2)

Globe/map toggle ([🌐 Globe] [🗺 Plan]) synchronises selected object, active corridor, and time position. State is preserved on switch.

2D view features: Mercator or azimuthal equidistant projection; ICAO chart symbology for airspace; ground-track corridor as horizontal projection only; altitude/time cross-section panel below showing corridor vertical extent at each FIR crossing.

6.12 Reporting Workflow

Report configuration dialogue:

NEW REPORT — CZ-5B R/B (44878)
──────────────────────────────────────────────────────────────
Simulation:  [Run #3 — 09:14 UTC ▾]

Report Type:
  ○ Operational Briefing     (1–2 pages, plain language)
  ○ Technical Assessment     (full uncertainty, model provenance)
  ○ Regulatory Submission    (formal format, appendices)

Include Sections:
  ☑ Object properties and data confidence
  ☑ Re-entry window and uncertainty percentiles
  ☑ Ground track corridor map
  ☑ Affected airspace and FIR crossing times
  ☑ Space weather conditions at prediction time
  ☑ Model version and simulation parameters
  ☐ Full MC sample distribution
  ☐ TIP message history

Prepared by: J. Smith          Authority: CASA
──────────────────────────────────────────────────────────────
[Preview]  [Generate PDF]  [Cancel]

Report identity: Every report has a unique ID, the simulation ID it was derived from, a generation timestamp, and the analyst's name. Reports are stored in MinIO and listed in /reports.

Date format in all reports and exports (F7): Slash-delimited dates (03/04/2026) are ambiguous between DD/MM and MM/DD and are banned from all SpaceCom outputs. All dates in PDF reports, CSV exports, and NOTAM drafts use DD MMM YYYY format (e.g. 04 MAR 2026) — unambiguous across all locales and consistent with ICAO and aviation convention. All times alongside dates use HH:MMZ (e.g. 04 MAR 2026 14:00Z). This applies to: PDF prediction reports, CSV bulk exports, NOTAM draft (B)/(C) fields (which use ICAO YYMMDDHHMM format internally but are displayed as DD MMM YYYY HH:MMZ in the preview).

Report rendering: Server-side Playwright in the isolated renderer container. The map image is a headless Chromium screenshot of the globe at the relevant configuration. All user-supplied text is HTML-escaped before interpolation. The renderer has no external network access — it receives only sanitised, structured data from the backend API.

6.13 NOTAM Drafting Workflow (Phase 2)

SpaceCom cannot issue NOTAMs. Only designated NOTAM offices authorised by the relevant AIS authority can issue them. SpaceCom's role is to produce a draft in ICAO Annex 15 format ready for review and formal submission by an authorised originator.

Trigger: From the Event Detail page, Persona A clicks [Draft NOTAM]. This is only available when a hazard corridor intersects one or more FIRs.

Draft NOTAM output (ICAO Annex 15 / OPADD format):

Field format follows ICAO Annex 15 Appendix 6 and EUROCONTROL OPADD. Timestamps use YYMMDDHHmm format (not ISO 8601 — ICAO Annex 15 §5.1.2). (B) = p10 − 30 min; (C) = p90 + 30 min (see mapping table below).

NOTAM DRAFT — FOR REVIEW AND AUTHORISED ISSUANCE ONLY
══════════════════════════════════════════════════════
Generated by SpaceCom v2.1 | Prediction ID: pred-44878-20260316-003
Data source: USSPACECOM TIP #3 + SpaceCom decay prediction
⚠ This is a DRAFT only. Must be reviewed and issued by authorised NOTAM office.

Q) YMMM/QWELW/IV/NBO/AE/000/999/2200S13300E999
A) YMMM
B) 2603161330
C) 2603162230
E) UNCONTROLLED SPACE OBJECT RE-ENTRY. OBJECT: CZ-5B ROCKET BODY
   NORAD ID 44878. PREDICTED RE-ENTRY WINDOW 1400-2200 UTC 16 MAR
   2026. NOMINAL RE-ENTRY POINT APRX 22S 133E. 95TH PERCENTILE
   CORRIDOR 18S 115E TO 28S 155E. DEBRIS SURVIVAL PSB. AIRSPACE
   WITHIN CORRIDOR MAY BE AFFECTED ALL LEVELS DURING WINDOW.
   REF SPACECOM PRED-44878-20260316-003.
F) SFC
G) UNL

NOTAM field mapping (ICAO Annex 15 Appendix 6):

NOTAM field	SpaceCom data source	Format rule
`(Q)` Q-line	FIR ICAO designator + NOTAM code `QWELW` (re-entry warning)	Generated from `airspace.icao_designator`; subject code `WE` (airspace warning), condition `LW` (laser/space)
`(A)` FIR	`airspace.icao_designator` for each intersecting FIR	One NOTAM per FIR; multi-FIR events generate multiple drafts
`(B)` Valid from	`prediction.p10_reentry_time − 30 minutes`	`YYMMDDHHmm` (UTC); example: `2603161330`
`(C)` Valid to	`prediction.p90_reentry_time + 30 minutes`	`YYMMDDHHmm` (UTC)
`(D)` Schedule	Omitted (continuous)	Do not include `(D)` field for continuous validity
`(E)` Description	Templated from sanitised object name, NORAD ID, p50 time, corridor bounds	`sanitise_icao()` applied; ICAO Doc 8400 abbreviations used (`PSB` not "possible", `APRX` not "approximately")
`(F)/(G)` Limits	`SFC` / `UNL`	Hardcoded for re-entry events; do not compute from corridor altitude

(B)/(C) field: re-entry window to NOTAM validity — time-critical cancellation: The (C) validity time does not mean the hazard persists until then — it is the worst-case boundary. When re-entry is confirmed, the NOTAM cancellation draft must be initiated immediately. The Event Detail page surfaces a prominent [Draft NOTAM Cancellation — RE-ENTRY CONFIRMED] button at the moment the event status changes to confirmed, with a UI note: "Cancellation draft should be submitted to the NOTAM office without delay."

Unit test: Generate a draft for a prediction with p10=2026-03-16T14:00Z, p90=2026-03-16T22:00Z; assert (B) field is 2603161330 and (C) field is 2603162230. Assert Q-line matches regex $Q$ [A-Z]{4}/QWELW/IV/NBO/AE/\d{3}/\d{3}/\d{4}[NS]\d{5}[EW]\d{3}.

NOTAM cancellation draft: When an event is closed (re-entry confirmed, object decayed), the Event Detail page offers [Draft NOTAM Cancellation] — generates a CANX NOTAM draft referencing the original.

Regulatory note displayed in the UI: A persistent banner on the NOTAM draft page reads: "This draft is generated for review purposes only. It must be reviewed for accuracy, formatted to local AIS standards, and issued by an authorised NOTAM originator. SpaceCom does not issue NOTAMs."

NOTAM language and i18n exclusion (F6): ICAO Annex 15 specifies that NOTAMs use ICAO standard phraseology in English (or the language of the state for domestic NOTAMs). NOTAM template strings are never internationalised:

All NOTAM template strings are hardcoded ICAO English phraseology in backend/app/modules/notam/templates.py
Each template string is annotated # ICAO-FIXED: do not translate
The NOTAM draft is excluded from the next-intl message extraction tooling
The NOTAM preview panel renders in a fixed-width monospace font to match traditional NOTAM format
lang="en" attribute is set on the NOTAM text container regardless of the operator's UI locale

The draft is stored in the notam_drafts table (see §9.2) for audit purposes.

6.14 Shadow Mode (Phase 2)

Shadow mode allows ANSPs to run SpaceCom in parallel with existing procedures during a trial period, without acting operationally on its outputs. This is the primary mechanism for building regulatory acceptance evidence.

Activation: admin role only, per-organisation setting in /admin.

Visual treatment when shadow mode is active:

┌─────────────────────────────────────────────────────────────────┐
│  ⚗ SHADOW MODE — Predictions are not for operational use        │
│  All outputs are recorded for validation. No alerts are         │
│  delivered externally. Contact your administrator to disable.   │
└─────────────────────────────────────────────────────────────────┘

A persistent amber banner spans the top of every page
The mode indicator pill shows ⚗ SHADOW in amber
All alert levels are demoted to INFORMATIONAL — no banners, no audio tones, no email delivery
Prediction records have shadow_mode = TRUE in the database (see §9)
Shadow predictions are excluded from all operational views but accessible in /analysis

Validation reporting: After each real re-entry event, Persona B can generate a Shadow Validation Report comparing SpaceCom shadow predictions against the actual observed re-entry time/location. These reports form the evidence base for regulatory adoption.

Shadow Mode Exit Criteria (regulatory hand-off specification — Finding 6):

Shadow mode is a formal regulatory activity, not a product trial. Exit to operational use requires:

Criterion	Requirement
Minimum shadow period	90 days, or covering ≥ 3 re-entry events above the CRITICAL alert threshold, whichever is longer
Prediction accuracy	`corridor_contains_observed ≥ 90%` across shadow period events (from `prediction_outcomes`)
False positive rate	`fir_false_positive_rate ≤ 20%` — no more than 1 in 5 corridor-intersecting FIR alerts is a false alarm
False negative rate	`fir_false_negative = 0` during the shadow period — no re-entry event missed entirely
Exit document	`shadow-mode-exit-report-{org_id}-{date}.pdf` generated from `prediction_outcomes`; contains automated statistics + ANSP Safety Department sign-off field
Regulatory hand-off	Written confirmation from the ANSP's Accountable Manager or Head of ATM Safety that their internal Safety Case / Tool Acceptance process is complete
System state	`shadow_mode_cleared = TRUE` is set by SpaceCom `admin` only after receipt of the written ANSP confirmation

The exit report template lives at docs/templates/shadow-mode-exit-report.md. Persona B generates the statistics from the admin analysis panel; the ANSP prints, signs, and returns the PDF. No software system can substitute for the ANSP's internal Safety Department sign-off.

Commercial trial-to-operational conversion (Finding 5):

A successful shadow exit automatically generates a commercial offer. The admin panel transitions the organisation's subscription_status from 'shadow_trial' to 'offered' and Persona D receives a task notification. The offer package includes:

Commercial offer document (generated from docs/templates/commercial-offer-ansp.md): tier, pricing, SLA schedule, DPA status
MSA execution path: ANSPs that accept the offer sign the MSA; no separate negotiation required for the standard ANSP Operational tier
Onboarding checklist: docs/onboarding/ansp-onboarding-checklist.md

If an ANSP does not convert within 30 days of receiving the offer, subscription_status moves to 'offered_lapsed' and Persona D is notified. The admin panel shows conversion pipeline status for all ANSP organisations. Maximum concurrent ANSP shadow deployments in Phase 2: 2 (resource constraint — each requires a dedicated SpaceCom integration lead for the 90-day shadow period).

6.15 Space Operator Portal UX (Phase 2)

The Space Operator Portal (/space) is the second front door. It serves Persona E and F with a technically dense interface — different visual language from the aviation-facing portal.

Space Operator Overview (/space):

┌─────────────────────────────────────────────────────────────────┐
│  SpaceCom · Space Portal    [API] [Export] [Persona E: ORBCO]   │
├─────────────────────┬───────────────────────────────────────────┤
│                     │  MY OBJECTS (3)                           │
│  3D GLOBE           │  ┌────────────────────────────────────┐   │
│  (owned objects     │  │ CZ-5B R/B  44878                   │   │
│   only, with        │  │ Perigee: 178 km  ↓ Decaying fast   │   │
│   full orbital      │  │ Re-entry: 16 Mar ± 8h              │   │
│   tracks and        │  │ [Predict] [Plan deorbit] [Export]  │   │
│   decay vectors)    │  ├────────────────────────────────────┤   │
│                     │  │ SL-16 R/B  28900                   │   │
│                     │  │ Perigee: 312 km  ~ Stable          │   │
│                     │  │ [Predict] [Export]                 │   │
│                     │  └────────────────────────────────────┘   │
│                     │  CONJUNCTION ALERTS (MY OBJECTS)          │
│                     │  No active conjunctions > Pc 1:10000      │
├─────────────────────┴───────────────────────────────────────────┤
│  API USAGE   Requests today: 143 / 1000   [Manage keys →]       │
└─────────────────────────────────────────────────────────────────┘

Controlled Re-entry Planner (/space/reentry/plan):

Available for objects with remaining manoeuvre capability (flagged in owned_objects.has_propulsion).

CONTROLLED RE-ENTRY PLANNER — CZ-5B R/B (44878)
─────────────────────────────────────────────────────────────────
Delta-V budget: [▓▓▓░░░░░] 12.4 m/s remaining

Target re-entry window:  [2026-03-20 ▾]  to  [2026-03-22 ▾]
Avoid FIRs:              [☑ YMMM]  [☑ YSSY]  [☑ Populated land]
Preferred landing:       ● Ocean   ○ Specific zone

CANDIDATE WINDOWS
──────────────────────────────────────────────────────────────────
  #1  2026-03-21 03:14 UTC    ΔV: 8.2 m/s    Risk: ● LOW
      Landing: South Pacific  FIR: NZZO (ocean)
      [Select] [View corridor]

  #2  2026-03-21 09:47 UTC    ΔV: 11.1 m/s   Risk: ● LOW
      Landing: Indian Ocean   FIR: FJDG (ocean)
      [Select] [View corridor]

  #3  2026-03-21 15:30 UTC    ΔV: 9.8 m/s    Risk: ▲ MEDIUM
      Landing: 22S 133E       FIR: YMMM (land)
      [Select] [View corridor]
──────────────────────────────────────────────────────────────────
[Export manoeuvre plan (CCSDS)]  [Generate operator report]

The planner outputs are suitable for submission to national space regulators as evidence of responsible end-of-life management under the ESA Zero Debris Charter and national space law requirements.

Zero Debris Charter compliance output format (Finding 2):

The planner produces a controlled-reentry-compliance-report-{norad_id}-{date}.pdf containing:

Ranked deorbit window analysis (delta-V budget, window start/end, corridor risk score per window)
FIR avoidance corridors for each candidate window
Probability of casualty on the ground (Pc_ground) computed using NASA Debris Assessment Software methodology (1-in-10,000 IADC casualty threshold; documented in model card)
Comparison table: each candidate window vs. the 1:10,000 Pc_ground threshold; compliant windows flagged green
Zero Debris Charter alignment statement (auto-generated from object disposition)

Machine-readable companion: application/vnd.spacecom.reentry-compliance+json — returned alongside the PDF download URL as compliance_report_url in the planning job result. Format documented in docs/api-guide/compliance-export.md.

The Pc_ground calculation uses the fragment survivability model (§15.3 material class lookup) and the ESA DRAMA casualty area methodology. objects.material_class IS NULL → conservative all-survive assumption → higher Pc_ground — creates an incentive for operators to provide accurate physical data.

ECCN classification review (already in §21 Phase 2 DoD) must resolve before this output is shared with non-US entities.

6.16 Accessibility Requirements

WCAG 2.1 Level AA compliance — required for government and aviation authority procurement
Colour-blind-safe palette throughout; urgency uses shape + colour, never colour alone
High-contrast mode available in user settings (WCAG AAA scheme)
Dark mode as a first-class theme (not an afterthought)
All interactive elements keyboard-accessible; tab order logical
Alerts announced via aria-live="assertive" (CRITICAL) and aria-live="polite" (MEDIUM/LOW)
Globe canvas has aria-label describing current view context
Minimum touch target size 44×44 px
Tested at 1080p (ops room), 1440p (analyst workstation), 1024×768 (tablet minimum)

Automated axe-core audit via @axe-core/playwright run on the 5 core pages on every PR; 0 critical, 0 serious violations required to merge; known acceptable third-party violations (e.g., CesiumJS canvas contrast) recorded in tests/e2e/axe-exclusions.json with a justification comment — not silently suppressed. Implementation:

// tests/e2e/accessibility.spec.ts
import AxeBuilder from '@axe-core/playwright';
for (const [name, path] of [
  ['operational-overview', '/'], ['event-detail', '/events/seed-event'],
  ['notam-draft', '/notam/draft/seed-draft'], ['space-portal', '/space/objects'],
  ['settings', '/settings'],
]) {
  test(`${name} — WCAG 2.1 AA`, async ({ page }) => {
    await page.goto(path);
    const results = await new AxeBuilder({ page })
      .withTags(['wcag2a', 'wcag2aa'])
      .exclude(loadAxeExclusions())   // loads axe-exclusions.json
      .analyze();
    expect(results.violations).toEqual([]);
  });
}

6.17 Multi-ANSP Coordination Panel (Phase 2)

When an event's predicted corridor intersects FIRs belonging to more than one registered organisation, an additional panel appears on the Event Detail page. This panel provides shared situational awareness across ANSPs without replacing voice coordination.

MULTI-ANSP COORDINATION
──────────────────────────────────────────────────────────────
FIRs affected by this event:
  YMMM  Airservices Australia  — ✓ Acknowledged 09:14 UTC  J. Smith
  NZZC  Airways NZ             — ○ Not yet acknowledged

Last activity:
  09:22 UTC  YMMM — "Watching brief established, coordinating with FMP"
──────────────────────────────────────────────────────────────
[Log coordination note]

Rules:

Each ANSP sees the acknowledgement status and latest coordination note from all other ANSPs on the event; they do not see each other's internal alert state
Coordination notes are free text, appended to alert_events (append-only, auditable), with organisation name, user name, and UTC timestamp
The panel is read-only for organisations that have not yet acknowledged; they can acknowledge and then log notes
Visibility is scoped: organisations only see the panel for events that intersect their registered FIRs — they do not see coordination panels for unrelated events from other orgs

This does not replace voice or direct coordination — it creates a shared digital record that both ANSPs can reference. The panel carries a permanent banner: "This coordination panel is for shared situational awareness only. It does not replace formal ATS coordination procedures or voice coordination."

Authority and precedence (Finding 5): The panel has no command authority. If two ANSPs log conflicting assessments, neither supersedes the other in SpaceCom — the system records both. The authoritative coordination outcome is always the result of direct ATS coordination outside the system. SpaceCom coordination notes are supporting evidence, not operational decisions.

WebSocket latency for coordination updates: Coordination note updates must be visible to all parties within 2 seconds of posting (p99). This is specified as a performance SLA for the coordination panel WebSocket channel (distinct from the 5-second SLA for alert events). Latency > 2 seconds means an ANSP may have acted on a stale picture during a fast-moving event.

Data retention for coordination records (ICAO Annex 11 §2.26): Coordination notes are safety records. Minimum retention: 5 years in append-only storage. The coordination_notes table (stored append-only in alert_events.coordination_notes JSONB[] or as a separate table) is included in the safety record retention category (§27.4) and excluded from standard data drop policies.

6.18 First-Time User Onboarding State (Phase 1)

When a new organisation has no configured FIRs and no active events, the globe is empty. An empty globe is indistinguishable from "the system isn't working" for first-time users. An onboarding state prevents this misinterpretation.

Trigger: Organisation has fir_list IS NULL OR fir_list = '{}' at login.

Display: Three setup cards replace the Active Events panel:

WELCOME TO SPACECOM
──────────────────────────────────────────────────────────────
To see relevant events and receive alerts, complete setup:

  1. Configure your FIR watch list
     Determines which re-entry events you see and which
     alerts you receive.                        [Configure →]

  2. Set alert delivery preferences
     Email, WebSocket, or webhook for CRITICAL alerts.
                                                [Configure →]

  3. Optional: Enable Shadow Mode for a trial period
     Run SpaceCom in parallel with existing procedures —
     outputs are not for operational use until disabled.
                                                [Configure →]

──────────────────────────────────────────────────────────────

Cards disappear permanently once step 1 (FIR list) is complete. Steps 2 and 3 remain accessible from /admin at any time. The setup cards are not a modal — they appear inline and the user can still access all navigation.

6.19 Degraded Mode UI Guidance (Phase 1)

The StalenessWarningBanner (triggered by /readyz returning 207) must include an operational guidance line keyed to the specific type of data degradation, not just a generic "data may be stale" message. Persona A's question in degraded mode is not "is the data stale?" — it is "can I use this for an operational decision right now?"

Degradation type	Banner operational guidance
Space weather data stale > 3h	"Uncertainty estimates may be wider than shown. Treat all corridors as potentially broader than the 95th percentile boundary."
TLE data stale > 24h	"Object position data is more than 24 hours old. Do not use for precision airspace decisions without independent position verification."
Active prediction older than 6h without refresh	"This prediction reflects conditions from [timestamp]. A fresh prediction run is recommended before operational use. [Trigger refresh →]"
IERS EOP data stale > 7 days	"Coordinate frame transformations may have minor errors. Technical assessments only — do not use for precision airspace boundary work."

Banner behaviour:

The banner type is set by the backend via the /readyz response body (degradation_type enum)
Each degradation type has its own banner message — not a generic "degraded" label
The banner persists until the degradation is resolved; it cannot be dismissed by the user
When multiple degradations are active, show the highest-impact degradation first, with a (+N more) expand link

6.20 Secondary Display Mode (Phase 2)

An ops room secondary monitor display mode — strips all navigation chrome and presents only the operational picture on a full-screen secondary display alongside existing ATC tools.

Activation: [Secondary Display] link in the user menu, or URL parameter ?display=secondary. Opens in a new window or full-screen.

Layout: Full-screen globe on the left (~70% width), vertical event list on the right (~30% width). No top navigation, no admin links, no simulation controls. No sidebar panels. The LIVE/SHADOW/SIMULATION mode indicator remains visible (always). CRITICAL alert banners still appear.

Design principle: This is a CSS-level change — hide navigation and chrome elements, maximise the operational data density. No new data is added; no existing data is removed.

7. Security Architecture

This section is as non-negotiable as §4. Security must be built in from Week 1, not audited at Phase 3. The primary security risk in an aviation safety system is not data exfiltration — it is data corruption that produces plausible but wrong outputs that are acted upon operationally. A false all-clear for a genuine re-entry threat is the highest-consequence attack against this system's mission.

7.1 Threat Model (STRIDE)

Key trust boundaries and their principal threats:

Boundary	Spoofing	Tampering	Repudiation	Info Disclosure	DoS	Elevation
Browser → API	JWT forgery	Request injection	Unlogged mutations	Token leak via XSS	Auth endpoint flood	RBAC bypass
API → DB	Credential leak	SQL injection	No audit trail	Column over-fetch	N+1 queries	RLS bypass
Ingest → External feeds	DNS/BGP hijack → wrong TLE	Man-in-middle alters F10.7	—	Credential interception	Feed DoS	—
Celery worker → DB	Compromised worker	Corrupt sim output written to DB	Unlogged task	Param leak in logs	Runaway MC task	Worker → backend pivot
Playwright renderer → backend	—	User content → XSS → SSRF	—	Local file read	Hang/timeout	RCE via browser exploit
Redis	—	Cache poisoning	—	Token interception	Queue flood	—

Mitigations for each threat are specified in the sections below.

7.2 Role-Based Access Control (RBAC)

Four roles correspond to the four personas. Every API endpoint enforces the minimum required role via a FastAPI dependency.

Role	Assigned To	Permissions
`viewer`	Read-only external stakeholders	View objects, predictions, corridors; read-only globe (aviation domain)
`analyst`	Persona B	viewer + submit simulations, generate reports, access historical data, shadow validation reports
`operator`	Persona A, C	analyst + acknowledge alerts, issue advisories, draft NOTAMs, access operational tools
`org_admin`	Organisation administrator	operator + invite/remove users within their own org; assign roles up to `operator` within own org; view own org's audit log; manage own org's API keys; update own org's billing contact; cannot access other orgs' data; cannot assign `admin` or `org_admin` without system admin approval
`admin`	Persona D (system-wide)	Full access: user management across all orgs, ingest configuration, model version deployment, shadow mode toggle, subscription management
`space_operator`	Persona E	Object-scoped access (owned objects only via `owned_objects` table); decay predictions and controlled re-entry planning for own objects; conjunction alerts for own objects; API key management; CCSDS export; no access to other organisations' simulation data
`orbital_analyst`	Persona F	Full catalog read; conjunction screening across any object pair; simulation submission; bulk export (CSV, JSON, CCSDS); raw state vector and covariance access; API key management; no alert acknowledgement

Object ownership scoping for space_operator: The owned_objects table maps operators to their registered NORAD IDs. All queries from a space_operator user are automatically scoped to their owned object list — enforced by a PostgreSQL RLS policy on the owned_objects join, not only at the application layer:

-- space_operator users see only their owned objects in catalog queries
CREATE POLICY objects_owner_scope ON objects
  USING (
    current_setting('app.current_role') != 'space_operator'
    OR id IN (
      SELECT object_id FROM owned_objects
      WHERE organisation_id = current_setting('app.current_org_id')::INTEGER
    )
  );

Multi-tenancy: If multiple organisations use the system, every table that contains organisation-specific data (simulations, reports, alert_events, hazard_zones) must include an organisation_id column. PostgreSQL Row-Level Security (RLS) policies enforce the boundary at the database layer — not only at the application layer:

ALTER TABLE simulations ENABLE ROW LEVEL SECURITY;
CREATE POLICY simulations_org_isolation ON simulations
  USING (organisation_id = current_setting('app.current_org_id')::INTEGER);

The application sets app.current_org_id at the start of every database session from the authenticated user's JWT claims.

Comprehensive RLS policy coverage (F1): The simulations example above is the template. Every table that carries organisation_id must have RLS enabled and an isolation policy applied. The full set:

Table	RLS policy	Notes
`simulations`	`organisation_id = current_org_id`
`reentry_predictions`	`organisation_id = current_org_id`	shadow policy layered separately
`alert_events`	`organisation_id = current_org_id`	append-only; no UPDATE/DELETE anyway
`hazard_zones`	`organisation_id = current_org_id`
`reports`	`organisation_id = current_org_id`
`api_keys`	`organisation_id = current_org_id`	admins bypass to revoke any key
`usage_events`	`organisation_id = current_org_id`	billing metering records
`objects`	`organisation_id IS NULL OR organisation_id = current_org_id`	NULL = catalog-wide; org-specific = owned objects only

RLS bypass for system-level tasks: Celery workers and internal admin processes run under a dedicated database role (spacecom_worker) that bypasses RLS (BYPASSRLS). This role is never used by the API request path. Integration test (BLOCKING): establish two orgs with data; issue a query as Org A's session; assert zero Org B rows returned. This test runs in CI against a real database (not mocked).

Shadow mode segregation — database-layer enforcement (Finding 9):

Shadow predictions must be excluded from operational API responses at the RLS layer, not only via application WHERE clauses. A backend query bug or misconfigured join must not expose shadow records to viewer/operator sessions — that would be a regulatory incident.

ALTER TABLE reentry_predictions ENABLE ROW LEVEL SECURITY;

-- Non-admin sessions never see shadow records unless the session flag is set
CREATE POLICY shadow_segregation ON reentry_predictions
  USING (
    shadow_mode = FALSE
    OR current_setting('spacecom.include_shadow', TRUE) = 'true'
  );

The spacecom.include_shadow session variable is set to 'true' only by the backend's shadow-admin code path, which requires admin role and explicit shadow-mode context. Regular backend sessions never set this variable. Integration test: query reentry_predictions as viewer role with no WHERE shadow_mode clause; verify zero shadow rows returned.

Four-eyes principle for admin role elevation (Finding 6):

A single compromised admin account must not be able to silently elevate a backdoor account. Elevation to admin requires a second admin to approve within 30 minutes.

CREATE TABLE pending_role_changes (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  target_user_id INTEGER NOT NULL REFERENCES users(id),
  requested_role TEXT NOT NULL,
  requested_by INTEGER NOT NULL REFERENCES users(id),
  approval_token_hash TEXT NOT NULL,  -- SHA-256 of emailed token
  expires_at TIMESTAMPTZ NOT NULL DEFAULT NOW() + INTERVAL '30 minutes',
  approved_by INTEGER REFERENCES users(id),
  approved_at TIMESTAMPTZ,
  rejected_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Workflow:

PATCH /admin/users/{id}/role with role=admin creates a pending_role_changes row and triggers an email to all other active admins containing a single-use approval token
POST /admin/role-changes/{change_id}/approve?token=<token> — any other admin can approve; completing the role change is atomic
Rows past expires_at are auto-rejected by a nightly job and logged as ROLE_CHANGE_EXPIRED
All outcomes (ROLE_CHANGE_APPROVED, ROLE_CHANGE_REJECTED, ROLE_CHANGE_EXPIRED) are logged to security_logs as HIGH severity
The requesting admin cannot approve their own pending change (enforced by approved_by != requested_by constraint)

RBAC enforcement pattern (FastAPI):

def require_role(*roles: str):
    def dependency(current_user: User = Depends(get_current_user)):
        if current_user.role not in roles:
            log_auth_failure(current_user, roles)
            raise HTTPException(status_code=403, detail="Insufficient permissions")
        return current_user
    return dependency

# Applied per router group — not per individual endpoint where it is easy to miss
router = APIRouter(dependencies=[Depends(require_role("operator", "admin"))])

7.3 Authentication

JWT Implementation

Algorithm: RS256 (asymmetric). Never HS256 with a shared secret. Never none.
Key storage: RSA private signing key stored in Docker secrets / secrets manager (see §7.5). Never in an environment variable or .env file.
Token storage in browser: httpOnly, Secure, SameSite=Strict cookies only. Never localStorage (vulnerable to XSS). Never query parameters (appear in server logs).
Access token lifetime: 15 minutes.
Refresh token lifetime: 24 hours for operator/analyst; 8 hours for admin.
Refresh token rotation with family reuse detection (Finding 5): Invalidate the old token on every refresh. Tokens belong to a family_id (UUID assigned at first issuance). If a token from a superseded generation within a family is presented — i.e. it was already rotated and a newer token in the same family exists — the entire family is immediately revoked, logged as REFRESH_TOKEN_REUSE (HIGH severity), and an email alert is sent to the user ("Suspicious login detected — all sessions revoked"). This detects refresh token theft: the legitimate user retries after the attacker consumed the token first, causing the reuse to surface. The refresh_tokens table includes family_id UUID NOT NULL and superseded_at TIMESTAMPTZ (set when a new token replaces this one in rotation).
Refresh token storage: refresh_tokens table in the database (see §9.2). This enables server-side revocation — Redis-only storage loses revocations on restart.

Multi-Factor Authentication (MFA)

TOTP-based MFA (RFC 6238) is required for all roles from Phase 1. Implementation:

On first login after account creation, user is presented with TOTP QR code (via pyotp) and required to verify before completing registration
Recovery codes (8 × 10-character alphanumeric) generated at setup; stored as bcrypt hashes in users.mfa_recovery_codes
MFA bypass via recovery code is logged as a security event (MEDIUM alert to admins)
MFA is enforced at the JWT issuance step — tokens are not issued until MFA is verified
Failed MFA attempts after 5 consecutive failures trigger a 30-minute account lockout and a MEDIUM alert

SSO / Identity Provider Abstraction

"Integrate with SkyNav SSO later" cannot remain a deferred decision. The auth layer must be designed as a pluggable provider from the start:

class AuthProvider(Protocol):
    async def authenticate(self, credentials: Credentials) -> User: ...
    async def issue_tokens(self, user: User) -> TokenPair: ...
    async def revoke(self, refresh_token: str) -> None: ...

class LocalJWTProvider(AuthProvider): ...   # Phase 1: local JWT + TOTP
class OIDCProvider(AuthProvider): ...       # Phase 3: OIDC/SAML SSO

All endpoint logic depends on AuthProvider — switching from local JWT to OIDC requires no endpoint changes.

7.4 API Security

Rate Limiting

Implemented with slowapi (Redis token bucket). Limits are per-user for authenticated endpoints, per-IP for auth endpoints:

Endpoint	Limit	Window
`POST /token` (login)	10 per IP	1 minute; exponential backoff after 5 failures
`POST /token/refresh`	30 per user	1 hour
`POST /decay/predict`	10 per user	1 hour
`POST /conjunctions/screen`	5 per user	1 hour
`POST /reports`	20 per user	1 day
`WS /ws/events` connection attempts	10 per user	1 minute
General authenticated read endpoints	300 per user	1 minute
General unauthenticated (if any)	60 per IP	1 minute

Rate limit headers returned on every response: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.

Simulation Parameter Validation

All physical parameters must be validated against their physically meaningful ranges before a simulation job is accepted. Type validation alone is insufficient — NRLMSISE-00 will silently produce garbage for out-of-range inputs without raising an error:

class DecayPredictParams(BaseModel):
    f107: float = Field(..., ge=65.0, le=300.0,
        description="F10.7 solar flux (sfu). Physically valid: 65–300.")
    ap: float = Field(..., ge=0.0, le=400.0,
        description="Geomagnetic Ap index. Valid: 0–400.")
    mc_samples: int = Field(..., ge=10, le=1000,
        description="Monte Carlo sample count. Server cap: 1000 regardless of input.")
    bstar_uncertainty_pct: float = Field(..., ge=0.0, le=50.0)

    @validator('mc_samples')
    def cap_mc_samples(cls, v):
        return min(v, 1000)  # Server-side cap regardless of submitted value

Server-Side Request Forgery (SSRF) Mitigation

The Ingest module fetches from five external sources. These URLs must be:

Hardcoded constants in ingest/sources.py — never loaded from user input, API parameters, or database values
Fetched via an HTTP client configured with an allowlist of expected IP ranges per source; connections to private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.0.0/16, ::1, fc00::/7) are blocked at the HTTP client layer

ALLOWED_HOSTS = {
    "www.space-track.org": ["18.0.0.0/8"],   # approximate; update with actual ranges
    "celestrak.org": [...],
    "swpc.noaa.gov": [...],
    "discosweb.esoc.esa.int": [...],
    "maia.usno.navy.mil": [...],
}

CZML and CZML Injection

Object names and descriptions sourced from Space-Track are interpolated into CZML documents and ultimately rendered in CesiumJS. A malicious object name containing <script> or CesiumJS-specific injection must be sanitised:

HTML-encode all string fields from external sources before inserting into CZML
CesiumJS evaluates CZML description fields as HTML in info boxes — treat as untrusted HTML; use DOMPurify on the client before passing to CesiumJS description properties

NOTAM Draft Content Sanitisation (Finding 10)

NOTAM drafts are templated from prediction data, object names, and operator-supplied fields. Object names originate from Space-Track and from manual POST /objects input. ICAO plain-text format is vulnerable to special-character injection and, if the draft is ever rendered to PDF by the Playwright renderer, to XSS.

import re

_ICAO_SAFE = re.compile(r"[^A-Z0-9\-_ /]")

def sanitise_icao(value: str, field_name: str = "field") -> str:
    """
    Strip characters outside ICAO plain-text safe set before NOTAM template interpolation.

    Args:
        value: Raw string from user input or external source.
        field_name: Field identifier for logging if value is modified.

    Returns:
        Sanitised string safe for ICAO plain-text insertion.
    """
    upper = value.upper()
    sanitised = _ICAO_SAFE.sub("", upper)
    if sanitised != upper:
        logger.info("sanitise_icao: modified %s field", field_name)
    return sanitised or "[REDACTED]"

Rules:

sanitise_icao() is called on every user-sourced field before interpolation into NOTAM_TEMPLATE.format(...)
TLE remarks fields are stripped entirely from NOTAM output (not an ICAO-relevant field)
NOTAM template uses str.format() with named arguments, not f-strings with raw variables
sanitise_icao is listed in AGENTS.md as a security-critical function — any change requires a dedicated security review

7.5 Secrets Management

"All secrets via environment variables" is a development-only posture.

Development: .env file. Never committed. .gitignore must include .env, .env.*.

Production: Docker secrets (Compose secrets: stanza) for Phase 1 production deployment; HashiCorp Vault or cloud-provider secrets manager (AWS Secrets Manager, GCP Secret Manager) for Phase 3.

Secrets rotation schedule:

Secret	Rotation Frequency	Method
JWT RS256 private key	90 days	Key ID in JWT header; both old and new keys valid during 24h rotation window
Space-Track.org credentials	90 days	Space-Track account supports credential rotation; coordinated with ops team
Database password	90 days	Dual-credential rotation (see procedure below); zero-downtime
Redis ACL passwords (backend, worker, ingest)	90 days	Update ACL password via `redis-cli ACL SETUSER`; restart dependent services with new env var; old password invalid immediately
MinIO access key	90 days	MinIO admin API
Cesium ion access token	NOT A SECRET	Public browser credential — shipped in `NEXT_PUBLIC_CESIUM_ION_TOKEN`. Read via `Ion.defaultAccessToken = process.env.NEXT_PUBLIC_CESIUM_ION_TOKEN`. Do not proxy through the backend. Do not store in Docker secrets or Vault. Rotate only if the token is explicitly revoked on cesium.com.

Database password rotation procedure — a hard PgBouncer restart drops idle connections cleanly but kills active transactions. Use the drain-then-swap sequence instead:

Update Postgres role (new password valid immediately; old password still in PgBouncer config): ALTER ROLE spacecom_app PASSWORD 'new_secret';
Drain PgBouncer — issue PAUSE pgbouncer;. New connections queue; existing transactions complete. Timeout: 30s (if not drained, proceed and accept brief 503s).
Update PgBouncer config with new password, then RESUME pgbouncer;. Application connections resume using new password.
Verify ingest/API within 5 minutes — /admin/ingest-status and GET /readyz must return 200.
Revoke old password after 15-minute grace: ALTER ROLE spacecom_app PASSWORD 'new_secret'; (already set — no-op; old session tokens expired during drain).
Rotate Patroni replication credentials separately — patronictl reload with updated postgresql.parameters.hba_file; does not affect application connections.

Full runbook: docs/runbooks/db-password-rotation.md.

Anti-patterns — enforced by git-secrets pre-commit hook and CI scan:

No secrets in requirements.txt, docker-compose.yml, Dockerfile, source files, or logs
Secret patterns (AWS keys, private key headers, connection strings) trigger CI failure

7.6 Transport Security

External-facing:

HTTPS only. HTTP → HTTPS 301 redirect.
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
TLS 1.2 minimum; TLS 1.3 preferred. Disable TLS 1.0, 1.1, SSLv3.
Cipher suite: Mozilla "Intermediate" configuration or better.
WebSocket connections: wss:// only. The ws.ts client enforces this.

Internal service communication:

Backend → DB: PostgreSQL TLS with client certificate verification
Backend → Redis: Redis 7 TLS mode (tls-port, tls-cert-file, tls-key-file, tls-ca-cert-file)
Backend → MinIO: HTTPS (MinIO production mode requires TLS)
Backend → Renderer: HTTPS on internal Docker network; renderer does not accept connections from any other service

Certificate management:

Production: Let's Encrypt via Caddy (auto-renewal, OCSP stapling)
Certificate expiry monitored: alert 30 days before expiry via cert-manager or custom Celery task

7.7 Content Security Policy and Security Headers

SpaceCom uses two distinct CSP tiers because CesiumJS requires 'unsafe-eval' (GLSL shader compilation) — a directive that would be unacceptable on non-globe routes.

Tier 1 — Non-globe routes (login, settings, admin, API responses):

Content-Security-Policy:
  default-src 'self';
  script-src 'self';
  style-src 'self' 'unsafe-inline';
  img-src 'self' data: blob:;
  connect-src 'self' wss://[domain];
  worker-src blob:;
  frame-ancestors 'none';
  base-uri 'self';
  form-action 'self';

Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
Referrer-Policy: strict-origin-when-cross-origin
Permissions-Policy: geolocation=(), camera=(), microphone=()

Tier 2 — Globe routes (app/(globe)/ — all routes under the (globe) layout group only):

Content-Security-Policy:
  default-src 'self';
  script-src 'self' 'unsafe-eval' https://cesium.com;
  style-src 'self' 'unsafe-inline';
  img-src 'self' data: blob: https://*.cesium.com https://*.openstreetmap.org;
  connect-src 'self' wss://[domain] https://cesium.com https://api.cesium.com;
  worker-src blob:;
  frame-ancestors 'none';
  base-uri 'self';
  form-action 'self';

Implementation in next.config.ts:

// next.config.ts
const isGlobeRoute = (pathname: string) =>
  pathname.startsWith('/dashboard') || pathname.startsWith('/monitor');

const headers = async () => [
  {
    source: '/((?!dashboard|monitor).*)',  // non-globe routes
    headers: [{ key: 'Content-Security-Policy', value: CSP_STANDARD }],
  },
  {
    source: '/(dashboard|monitor)(.*)',   // globe routes — unsafe-eval allowed
    headers: [{ key: 'Content-Security-Policy', value: CSP_GLOBE }],
  },
];

'unsafe-eval' is required by CesiumJS for runtime GLSL shader compilation. Scope it only to globe routes. This is a known, documented exception — it must never appear in the standard-tier CSP.

'unsafe-inline' for style-src is also required by CesiumJS and appears in both tiers. It must not be used for script-src in the standard tier.

Renderer page CSP (the headless Playwright context, which must be the most restrictive):

Content-Security-Policy:
  default-src 'self';
  script-src 'self';
  style-src 'self';
  img-src 'self' data: blob:;
  connect-src 'none';
  frame-ancestors 'none';

7.8 WebSocket Security

WS /ws/events authentication:

JWT token must be verified at connection establishment (HTTP Upgrade request)
Browser WebSocket APIs cannot send custom headers — use the httpOnly auth cookie (set by the login flow) which is automatically sent with the Upgrade request; verify it in the WebSocket handshake handler
Do not accept tokens via query parameters (?token=...) — they appear in server access logs

Connection management:

Per-user concurrent connection limit: 5. Enforced in the upgrade handler by checking a Redis counter.
Server-side ping every 30 seconds; close connections that do not respond within 60 seconds
All incoming WebSocket messages (if bidirectional) validated against a JSON schema before processing

7.9 Data Integrity

This is the most important security property of the system. Predictions that drive aviation safety decisions must be trustworthy and tamper-evident.

HMAC Signing of Predictions

Every row written to reentry_predictions and hazard_zones is signed at creation time with an application-secret HMAC:

import hmac, hashlib, json

def sign_prediction(prediction: dict, secret: bytes) -> str:
    payload = json.dumps({
        "id": prediction["id"],
        "object_id": prediction["object_id"],
        "p50_reentry_time": prediction["p50_reentry_time"].isoformat(),
        "model_version": prediction["model_version"],
        "f107_assumed": prediction["f107_assumed"],
    }, sort_keys=True)
    return hmac.new(secret, payload.encode(), hashlib.sha256).hexdigest()

HMAC signing race fix (F4 — §67): If reentry_predictions.id is a DB-assigned BIGSERIAL, the application must INSERT first (to get the id), then compute the HMAC using that id, then UPDATE the row — a two-phase write. Between the INSERT and the UPDATE there is a brief window where a valid prediction row exists with an empty record_hmac, which the nightly HMAC verification job (§10.2) would flag as a violation.

Fix: Use UUID as the primary key (DEFAULT gen_random_uuid()) and assign the UUID in the application before the INSERT. The application pre-generates the UUID, computes the HMAC against the full prediction dict including that UUID, then inserts the complete row in a single write:

import uuid

def write_prediction_to_db(prediction: dict):
    prediction_id = str(uuid.uuid4())
    prediction['id'] = prediction_id
    prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret)
    # Single INSERT — no two-phase write; no race window
    db.execute(text("""
        INSERT INTO reentry_predictions (id, object_id, ..., record_hmac)
        VALUES (:id, :object_id, ..., :record_hmac)
    """), prediction)

Migration: ALTER TABLE reentry_predictions ALTER COLUMN id TYPE UUID USING gen_random_uuid(); ALTER TABLE reentry_predictions ALTER COLUMN id SET DEFAULT gen_random_uuid(); — requires cascade updates to FK references (alert_events.prediction_id, prediction_outcomes.prediction_id). Include in the next schema migration (alembic revision --autogenerate).

The HMAC is stored in a record_hmac column. Before serving any prediction to a client, the backend verifies the HMAC. A failed verification:

Is logged as a security event (CRITICAL alert to admins)
Results in the prediction being marked integrity_failed = TRUE
The prediction is not served; the API returns a 503 with a message directing the user to contact the system administrator
The Event Detail page displays ✗ HMAC verification failed and a warning banner

Prediction Immutability

Once written, prediction records must not be modified:

CREATE OR REPLACE FUNCTION prevent_prediction_modification()
RETURNS TRIGGER AS $$
BEGIN
  RAISE EXCEPTION 'reentry_predictions is immutable after creation. Create a new prediction instead.';
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER reentry_predictions_immutable
  BEFORE UPDATE OR DELETE ON reentry_predictions
  FOR EACH ROW EXECUTE FUNCTION prevent_prediction_modification();

Apply the same trigger to hazard_zones.

HMAC Key Rotation Procedure (Finding 1)

The immutability trigger blocks all UPDATEs on reentry_predictions, including legitimate HMAC re-signing during key rotation. The rotation path must be explicit and auditable:

Schema additions to reentry_predictions:

ALTER TABLE reentry_predictions
  ADD COLUMN rotated_at TIMESTAMPTZ,
  ADD COLUMN rotated_by INTEGER REFERENCES users(id);

Parameterised immutability trigger — allows UPDATE only on record_hmac when the session flag is set by the privileged hmac_admin role:

CREATE OR REPLACE FUNCTION prevent_prediction_modification()
RETURNS TRIGGER AS $$
BEGIN
  -- Allow HMAC-only rotation when flag is set by hmac_admin role
  IF TG_OP = 'UPDATE'
     AND current_setting('spacecom.hmac_rotation', TRUE) = 'true'
     AND NEW.record_hmac IS DISTINCT FROM OLD.record_hmac
     AND NEW.id = OLD.id  -- all other columns unchanged
  THEN
    RETURN NEW;
  END IF;
  RAISE EXCEPTION 'reentry_predictions is immutable after creation. Create a new prediction instead.';
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

hmac_admin database role: A dedicated hmac_admin Postgres role is the only role permitted to SET LOCAL spacecom.hmac_rotation = true. The backend application role does not have this privilege. The rotation script connects as hmac_admin, sets the flag per-transaction, re-signs each row, and commits. Every changed row is logged to security_logs as event type HMAC_ROTATION.

Dual sign-off: The rotation script must be run with two operators present. The runbook requires that both operators record their user IDs in the rotated_by column (use the initiating operator) and that the second operator independently verifies a random sample of re-signed HMACs match the new key before the script is considered complete.

The HMAC rotation runbook lives at docs/runbooks/hmac-key-rotation.md and cross-references the zero-downtime JWT keypair rotation runbook for the dual-key validity window.

Append-Only `alert_events`

CREATE OR REPLACE FUNCTION prevent_alert_modification()
RETURNS TRIGGER AS $$
BEGIN
  RAISE EXCEPTION 'alert_events is append-only';
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER alert_events_immutable
  BEFORE UPDATE OR DELETE ON alert_events
  FOR EACH ROW EXECUTE FUNCTION prevent_alert_modification();

Cross-Source Validation

Do not silently trust a single data source:

TLE cross-validation: When the same NORAD ID is received from both Space-Track and CelesTrak within a 6-hour window, compare the key orbital elements. If they differ by more than a defined threshold (e.g., semi-major axis > 1 km, inclination > 0.01°), flag for human review rather than silently using one.
All-clear double check: A prediction record showing no hazard for an object that has an active TIP message triggers an integrity alert. A single-source all-clear cannot override a TIP message.
Space weather cross-validation: Ingest F10.7 from both NOAA SWPC and ESA Space Weather Service. If they disagree by > 20%, alert and use the more conservative (higher) value until the discrepancy resolves.

IERS EOP Integrity

The weekly IERS Bulletin A download must be verified before application:

IERS_BULLETIN_A_SHA256 = {
    # Updated manually each quarter; verified against IERS publications
    "finals2000A.all": "expected_hash_here",
}
# If hash fails, the existing EOP table is retained; a MEDIUM alert is generated

alert_events HMAC integrity (F9): alert_events records are safety-critical audit evidence (UN Liability Convention, ICAO). They carry the same HMAC protection as reentry_predictions:

def sign_alert_event(event: dict, secret: bytes) -> str:
    payload = json.dumps({
        "id": event["id"],
        "object_id": event["object_id"],
        "organisation_id": event["organisation_id"],
        "level": event["level"],
        "trigger_type": event["trigger_type"],
        "created_at": event["created_at"].isoformat(),
        "acknowledged_by": event["acknowledged_by"],
        "action_taken": event.get("action_taken"),
    }, sort_keys=True)
    return hmac.new(secret, payload.encode(), hashlib.sha256).hexdigest()

Nightly integrity check (Celery Beat, 02:00 UTC):

@celery.task
def verify_alert_event_hmac():
    """Re-verify HMAC on all alert_events created in the past 24 hours."""
    yesterday = utcnow() - timedelta(hours=24)
    failures = db.execute(
        text("SELECT id FROM alert_events WHERE created_at >= :since"),
        {"since": yesterday}
    ).fetchall()
    for row in failures:
        event = db.get(AlertEvent, row.id)
        expected = sign_alert_event(event.__dict__, HMAC_SECRET)
        if not hmac.compare_digest(expected, event.record_hmac):
            log_security_event("ALERT_EVENT_HMAC_FAILURE", {"event_id": row.id})
            alert_admin_critical(f"alert_events HMAC integrity failure: id={row.id}")

Database timezone enforcement (F2): PostgreSQL TIMESTAMPTZ stores internally in UTC, but ORM connections can silently apply server or session timezone offsets. All timestamps must remain UTC end-to-end:

# database.py — connection pool creation
from sqlalchemy import event, text

@event.listens_for(engine.sync_engine, "connect")
def set_timezone(dbapi_conn, connection_record):
    cursor = dbapi_conn.cursor()
    cursor.execute("SET TIME ZONE 'UTC'")
    cursor.close()

Integration test (tests/test_db_timezone.py — BLOCKING):

def test_timestamps_round_trip_as_utc(db_session):
    """Ensure ORM never silently converts UTC timestamps to local time."""
    known_utc = datetime(2026, 3, 22, 14, 0, 0, tzinfo=timezone.utc)
    obj = ReentryPrediction(p50_reentry_time=known_utc, ...)
    db_session.add(obj)
    db_session.flush()
    db_session.refresh(obj)
    assert obj.p50_reentry_time == known_utc
    assert obj.p50_reentry_time.tzinfo == timezone.utc

Any non-UTC representation of a timestamp is a display-layer concern only — never stored or transmitted as local time.

7.10 Infrastructure Security

Container Hardening

Applied to all service Dockerfiles and Compose definitions:

# Applied to all services
security_opt:
  - no-new-privileges:true
read_only: true
tmpfs:
  - /tmp:size=256m,mode=1777
user: "1000:1000"   # non-root; created in Dockerfile as: RUN useradd -r -u 1000 appuser
cap_drop:
  - ALL
cap_add: []         # No capabilities added; NET_BIND_SERVICE not needed if ports > 1024

Renderer container — most restrictive:

renderer:
  security_opt:
    - no-new-privileges:true
    - seccomp:renderer-seccomp.json   # Custom seccomp profile for Chromium
  network_mode: none    # Overridden by renderer_net which allows only internal backend API
  read_only: true
  tmpfs:
    - /tmp:size=512m    # Playwright needs /tmp
    - /home/appuser:size=256m  # Chromium profile directory
  cap_drop:
    - ALL
  cap_add:
    - SYS_ADMIN         # Required by Chromium sandbox; document this explicitly

SYS_ADMIN for Chromium is a known requirement. Mitigate by ensuring the renderer container has no network access to anything other than the backend internal API, and by setting a strict seccomp profile.

Redis Authentication and ACLs

# redis.conf (production)
requirepass ""          # Disabled; use ACL only
aclfile /etc/redis/users.acl

# users.acl
user backend on >[backend_password] ~* &* +@all -@dangerous
user worker on >[worker_password] ~celery:* &celery:* +RPUSH +LPOP +LLEN +SUBSCRIBE +PUBLISH +XADD +XREAD
user default off        # Disable default user

MinIO Bucket Policies

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Principal": "*",
    "Action": "s3:*",
    "Resource": "arn:aws:s3:::*"
  }]
}

All buckets are private. Report downloads use 5-minute pre-signed URLs (reduced from 15 minutes — user downloads immediately). Pre-signed URL generation is logged to security_logs (event type PRESIGNED_URL_GENERATED) with user_id, object_key, expires_at, and client_ip — this creates an audit trail of who obtained access to which object.

MC blob access — server-side proxy (Finding 2): Simulation trajectory blobs (MC samples) must not be served as direct pre-signed MinIO URLs to the browser. Instead, the visualiser calls GET /viz/mc-trajectories/{simulation_id} which the backend fetches from MinIO server-side and streams to the authenticated client. This keeps MinIO URLs entirely off the client and prevents URL sharing or exfiltration. The backend enforces the requesting user's organisation matches the simulation's organisation_id before proxying.

7.11 Playwright Renderer Security

The renderer is the highest attack-surface component. It runs a real browser on the server.

Isolation: The renderer service runs in its own container on renderer_net. It accepts HTTPS connections only from the backend's internal IP. It makes no outbound connections beyond backend:8000 (enforced by network segmentation + Playwright request interception — see below).

Data flow: The renderer receives only a report_id (integer) from the backend job queue. It constructs the report URL internally as http://backend:8000/reports/{report_id}/preview — user-supplied values are never interpolated into the URL. The report_id is validated as a positive integer before use. The renderer has no access to the database, Redis, or MinIO directly.

Playwright request interception (Finding 4) — allowlist, not blocklist:

async def setup_request_interception(page: Page) -> None:
    """Block any Playwright navigation to hosts other than the backend."""
    async def handle_route(route: Route) -> None:
        url = route.request.url
        if not url.startswith("http://backend:8000/"):
            await route.abort("blockedbyclient")
        else:
            await route.continue_()
    await page.route("**/*", handle_route)

This is a defence-in-depth layer: even if a bug causes the renderer to receive a crafted URL, the interception handler prevents navigation to any external or internal host outside backend:8000.

Input sanitisation before reaching the renderer:

import bleach

ALLOWED_TAGS = []  # No HTML allowed in user-supplied report fields
ALLOWED_ATTRS = {}

def sanitise_report_field(value: str) -> str:
    """Strip all HTML from user-supplied strings before renderer interpolation."""
    return bleach.clean(value, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRS, strip=True)

Report template: The renderer loads a report template from the local filesystem (bundled in the container image). It does not fetch templates from URLs or the database. User-supplied content is inserted via a strict templating engine (Jinja2 with autoescape=True).

Timeouts: Report generation has a hard 30-second timeout. Playwright's page.goto() timeout set to 10 seconds. If the timeout is exceeded, the job fails with a clear error — the renderer does not hang indefinitely.

No dangerouslySetInnerHTML: The report React template must never use dangerouslySetInnerHTML. All text insertion via {value} (React's built-in escaping).

7.12 Compute Resource Governance

Limit	Value	Enforcement
`mc_samples` maximum	1000	Pydantic validator at API layer; also re-validated inside the Celery task body (Finding 3)
Concurrent simulations per user	3	Checked against `simulations` table before job acceptance; returns 429 if exceeded
Pending jobs per user	10	Checked at submission time
Decay prediction CPU time limit	300 s	Celery `time_limit=300, soft_time_limit=270`
Breakup simulation CPU time limit	600 s	Celery `time_limit=600, soft_time_limit=570`
Ephemeris response points maximum	100,000	Enforced by calculating `(end - start) / step`; returns 400 if exceeded with a message to reduce range or increase step
CZML document size	50 MB	Streaming response with max size enforced; client must paginate for larger ranges
WebSocket connections per user	5	Redis counter checked at upgrade time
Simulation workers	Separate Celery worker pool from ingest workers	Prevents runaway simulations from starving TLE/space-weather ingestion

Celery task-layer validation (Finding 3): Celery tasks are callable directly via Redis write (e.g., by a compromised worker), bypassing the API layer entirely. Every task function must validate its own arguments independently of the API endpoint:

from functools import wraps

def validate_task_args(validator_class):
    """Decorator: re-validate task kwargs using the same Pydantic model as the API endpoint."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                validator_class(**kwargs)
            except ValidationError as exc:
                raise ValueError(f"Task arg validation failed: {exc}") from exc
            return func(*args, **kwargs)
        return wrapper
    return decorator

@app.task(bind=True)
@validate_task_args(DecayPredictParams)
def run_mc_decay_prediction(self, *, norad_id: int, f107: float, ap: float, mc_samples: int, ...):
    ...

ValueError raised inside a Celery task is treated as a non-retryable failure — the task goes to the dead-letter queue and does not silently drop. This applies to all simulation and prediction tasks. Document in AGENTS.md: "Task functions are a security boundary. Validate all task arguments inside the task body."

Orphaned job recovery (Celery Beat task): A Celery worker killed mid-execution (OOM, pod eviction, container restart) leaves its job in status = 'running' indefinitely unless a cleanup task intervenes. Add a Celery Beat periodic task that runs every 5 minutes:

@app.task
def recover_orphaned_jobs():
    """Mark jobs stuck in 'running' beyond 2× their estimated duration as failed."""
    threshold = datetime.utcnow() - timedelta(minutes=1)  # minimum guard
    orphans = (
        db.query(Job)
        .filter(
            Job.status == "running",
            Job.started_at < func.now() - (
                func.coalesce(Job.estimated_duration_seconds, 600) * 2
            ) * text("interval '1 second'"),
        )
        .all()
    )
    for job in orphans:
        job.status = "failed"
        job.error_code = "PRESUMED_DEAD"
        job.error_message = "Worker did not complete within 2× estimated duration"
        job.completed_at = datetime.utcnow()
    db.commit()

Integration test (tests/test_jobs/test_celery_failure.py): set a job to status='running' with started_at = NOW() - 1200s and estimated_duration_seconds = 300; run the Beat task; assert status = 'failed' and error_code = 'PRESUMED_DEAD'.

7.13 Supply Chain and Dependency Security

Python dependency pinning:

All dependencies pinned with exact versions and hashes using pip-tools:

# requirements.in → pip-compile → requirements.txt with hashes
fastapi==0.111.0 --hash=sha256:...

Install with pip install --require-hashes -r requirements.txt in all Docker builds.

Node.js: package-lock.json committed and npm ci used in Docker builds (not npm install).

Base images: All FROM statements use pinned digest tags:

FROM python:3.12.3-slim@sha256:abc123...

Never FROM python:3.12-slim (floating tag).

PyPI index trust policy — dependency confusion protection:

All Python packages must be fetched from a controlled index, not directly from public PyPI without restrictions. Configure pip.conf mounted into all build containers:

# pip.conf (mounted at /etc/pip.conf in builder stage)
[global]
index-url = https://pypi.internal.spacecom.io/simple/
# Proxy mode: passes through to PyPI but logs and scans before serving
# extra-index-url is intentionally absent — no fallback to raw public PyPI

For Phase 1 (no internal proxy available): register all spacecom-* package names on public PyPI as empty stubs to prevent dependency confusion squatting. Document in docs/adr/0019-pypi-index-trust.md.

Automated scanning (CI pipeline):

Tool	Target	Trigger	Notes
`pip-audit`	Python dependencies	Every PR; blocks on High/Critical	Queries Python Advisory Database (PyPADB); lower false-positive rate than OWASP DC for Python
`npm audit`	Node.js dependencies	Every PR; blocks on High/Critical	`--audit-level=high`; run after `npm ci`
Trivy	Container images	Every PR; blocks on Critical/High	`.trivyignore` applied (see below); JSON output archived
Bandit	Python source code	Every PR; blocks on High severity
ESLint security plugin	TypeScript source	Every PR
`pip-licenses`	Python transitive deps	Every PR; blocks on GPL/AGPL	CesiumJS exempted by name with documented commercial licence
`license-checker-rseidelsohn`	npm transitive deps	Every PR; blocks on GPL/AGPL	CesiumJS exempted; other AGPL packages require approval
Renovate Bot	Docker image digests + all deps	Weekly PRs; digest PRs auto-merged if CI passes	Replaces Dependabot for Docker digest pins; Dependabot retained for GitHub Security Advisory integration
`git-secrets` + `detect-secrets`	All commits	Pre-commit; blocks commit on secret patterns	`detect-secrets` is canonical (entropy + regex); `git-secrets` retained for pattern matching
`cosign verify`	Container images at deploy	Every staging/production deploy	Verifies Sigstore keyless signature before pulling

OWASP Dependency-Check is removed from the Python scanning stack — it has high false-positive rates due to CPE name mapping issues for Python packages and is superseded by pip-audit. It may be retained for future Java/Kotlin components.

Trivy configuration — .trivyignore:

# .trivyignore
# Each entry requires: CVE ID, expiry date (90-day max), and documented justification.
# Process: PR required with senior engineer approval. Expired entries fail CI.
# Format: CVE-YYYY-NNNNN  expires:YYYY-MM-DD  reason:<one-line justification>
#
# Example (do not add without process):
# CVE-2024-12345  expires:2024-12-31  reason:builder-stage only; not present in runtime image

CI check rejects entries past their expiry date:

python scripts/check_trivyignore_expiry.py .trivyignore || \
  (echo "ERROR: .trivyignore contains expired entry — review or remove" && exit 1)

License scanning CI steps:

# security-scan job
- name: Python licence gate
  run: |
    pip install pip-licenses
    pip-licenses --format=json --output-file=python-licences.json
    # Fail on GPL/AGPL (CesiumJS has commercial licence; excluded by name in npm step)
    pip-licenses --fail-on="GNU General Public License v2 (GPLv2);GNU General Public License v3 (GPLv3);GNU Affero General Public License v3 (AGPLv3)"

- name: npm licence gate
  working-directory: frontend
  run: |
    npx license-checker-rseidelsohn --json --out npm-licences.json
    # cesium excluded: commercial licence at docs/adr/0007-cesiumjs-commercial-licence.md
    npx license-checker-rseidelsohn \
      --excludePackages "cesium" \
      --failOn "GPL;AGPL"

- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08  # v4.3.4
  with:
    name: licences-${{ github.sha }}
    path: "*.json"
    retention-days: 365

Base image digest updates — Renovate configuration:

Dependabot does not update @sha256: digest pins in Dockerfiles. Renovate's docker-digest manager handles this:

// renovate.json
{
  "extends": ["config:base"],
  "packageRules": [
    {
      "matchDatasources": ["docker"],
      "matchUpdateTypes": ["digest"],
      "automerge": true,
      "automergeType": "pr",
      "schedule": ["every weekend"],
      "commitMessageSuffix": "(base image digest update)"
    },
    {
      "matchDatasources": ["pypi"],
      "automerge": false
    }
  ],
  "github-actions": {
    "enabled": true,
    "pinDigests": true
  }
}

Digest-only updates auto-merge on passing CI. Version bumps (e.g., python:3.12 → python:3.13) require manual PR review. Renovate is added alongside Dependabot; Dependabot retains GitHub Security Advisory integration for Python/Node CVE PRs.

7.14 Audit and Security Logging

Security event categories (stored in security_logs table and shipped to SIEM):

Event	Level	Retention
Successful login	INFO	90 days
Failed login (IP + user)	WARNING	180 days
MFA failure	WARNING	180 days
Account lockout	HIGH	180 days
Token refresh	INFO	30 days
Authorisation failure (403)	WARNING	180 days
Admin action (user create/delete/role change)	HIGH	1 year
Prediction HMAC failure	CRITICAL	2 years
Alert storm detection	CRITICAL	2 years
IERS EOP hash mismatch	HIGH	1 year
Report generated	INFO	1 year
Ingest source error	WARNING	90 days

Security event human-alerting matrix (Finding 7): A Grafana dashboard no one is watching provides no protection during an active attack. The following events must trigger an immediate out-of-band alert to a human (PagerDuty, email, or Slack) — not only log to the database:

Event type	Severity	Alert channel	Response SLA
`HMAC_VERIFICATION_FAILURE`	CRITICAL	PagerDuty + admin email	Immediate
`REFRESH_TOKEN_REUSE`	HIGH	Email to affected user + admin email	< 5 min
`ROLE_CHANGE_APPROVED` / `ROLE_CHANGE_EXPIRED`	HIGH	Admin email summary	< 15 min
`REGISTRATION_BLOCKED_SANCTIONS`	HIGH	Admin email	< 15 min
`RBAC_VIOLATION` ≥ 10 events in 5 min (same `user_id`)	HIGH	PagerDuty	Immediate
`INGEST_VALIDATION_FAILURE` ≥ 5 events in 1 hour (same source)	MEDIUM	Admin email	< 1 hour
Space-Track ingest gap > 4 hours	CRITICAL	PagerDuty (cross-ref §31)	Immediate
Any `level = CRITICAL` security event	CRITICAL	PagerDuty + SIEM	Immediate

Implemented as AlertManager rules (Prometheus security_event_total counter with event_type label) and/or direct webhook dispatch from the security_logs insert trigger. Rules defined in monitoring/alertmanager/security-rules.yml.

Space-Track credential rotation — ingest gap specification (Finding 8): Space-Track supports only one active credential set; rotation is a hard cut with no parallel-credential window. The rotation runbook at docs/runbooks/space-track-credential-rotation.md must include: (a) record last successful ingest time before starting; (b) update Docker secret and restart ingest_worker; (c) verify ingest succeeds within 10 minutes of restart (GET /admin/ingest-status shows last_success_at for Space-Track source); (d) if ingest does not resume within 10 minutes, roll back to previous credentials and raise a CRITICAL alert. The existing 4-hour ingest failure CRITICAL alert (§31) is the backstop — this runbook step reduces mean time to detect to 10 minutes.

Structured log format — all services emit JSON via structlog. Every log record must include these fields:

# backend/app/logging_config.py
REQUIRED_LOG_FIELDS = {
    "timestamp":       "ISO-8601 UTC",
    "level":           "DEBUG|INFO|WARNING|ERROR|CRITICAL",
    "service":         "backend|worker|ingest|renderer",
    "logger":          "module.path",
    "message":         "human-readable summary",
    "request_id":      "UUID | null — set for HTTP requests; propagated into Celery tasks",
    "job_id":          "UUID | null — Celery job_id when inside a task",
    "user_id":         "integer | null",
    "organisation_id": "integer | null",
    "duration_ms":     "integer | null — HTTP response time",
    "status_code":     "integer | null — HTTP responses only",
}

The sanitising formatter wraps the structlog JSON processor (strips JWT substrings, Space-Track passwords, database DSNs before the record is written). Docker log driver: json-file with max-size=100m, max-file=5 for Tier 1; forwarded to Loki via Promtail in Tier 2+.

Log sanitisation: The structlog sanitising processor runs as the final processor in the chain before emission, stripping known sensitive patterns (JWT token substrings, Space-Track password patterns, database DSN with credentials).

Log integrity: Logs are shipped in real-time to an external destination (Loki in Tier 2; S3/MinIO append-only bucket or SIEM for long-term safety record retention). Logs stored only on the container filesystem are considered volatile and untrusted for security purposes.

Request ID correlation middleware — every HTTP request generates a request_id that propagates through logs, Celery tasks, and Prometheus exemplars so an on-call engineer can jump from a metric spike to the causative log line with one click:

# backend/app/middleware.py
import uuid
import structlog
from starlette.middleware.base import BaseHTTPMiddleware

class RequestIDMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        request_id = request.headers.get("X-Request-ID") or str(uuid.uuid4())
        structlog.contextvars.bind_contextvars(request_id=request_id)
        response = await call_next(request)
        response.headers["X-Request-ID"] = request_id
        structlog.contextvars.clear_contextvars()
        return response

When submitting a Celery task, include request_id in task kwargs and bind it in the task preamble:

structlog.contextvars.bind_contextvars(request_id=kwargs.get("request_id"), job_id=str(self.request.id))

This links every log line from the HTTP layer through to the Celery task execution. The request_id equals the OpenTelemetry trace_id when OTel is enabled (Phase 2), giving a single correlation key across logs and traces.

security_logs table:

CREATE TABLE security_logs (
  id BIGSERIAL PRIMARY KEY,
  logged_at TIMESTAMPTZ DEFAULT NOW(),
  level TEXT NOT NULL,
  event_type TEXT NOT NULL,
  user_id INTEGER,
  organisation_id INTEGER,
  source_ip INET,
  user_agent TEXT,
  resource TEXT,
  detail JSONB,
  -- Prevent tampering
  record_hash TEXT    -- SHA-256 of (logged_at || level || event_type || detail)
);
-- Append-only trigger (same pattern as alert_events)

7.15 Security SDLC — Embedded, Not Bolted On

Security activities are integrated into every sprint from Week 1, not deferred to a Phase 3 audit.

Week 1 (mandatory before any other code):

RBAC schema implemented; require_role dependency applied to all router groups
JWT RS256 + httpOnly cookies implemented; HS256 never used
MFA (TOTP) implemented and required for all roles
CSP and security headers applied to frontend and backend
Docker network segmentation and container hardening applied to all services
Redis AUTH and ACL configured
MinIO: all buckets private; pre-signed URLs only
Dependency pinning (pip-compile) and Dependabot configured
git-secrets pre-commit hook installed in repo
Bandit and ESLint security plugin in CI; blocks merge on High severity
Trivy container scanning in CI; blocks merge on Critical/High
security_logs table and log sanitisation formatter implemented
Append-only DB triggers on alert_events

Phase 1 (ongoing):

HMAC signing implemented for reentry_predictions before decay predictor ships (Week 9)
Immutability triggers on reentry_predictions and hazard_zones
Cross-source TLE and space weather validation implemented with ingest module (Week 3–6)
IERS EOP hash verification implemented (Week 1)
Rate limiting (slowapi) configured for all endpoint groups (Week 2)
Simulation parameter range validation (Week 9, with decay predictor)

Phase 2:

OWASP ZAP DAST scan run against staging environment in the Phase 2 CI pipeline
Threat model document (STRIDE) reviewed and updated for Phase 2 attack surface
Playwright renderer: isolated container, sanitised input, timeouts, seccomp profile, Playwright request interception allowlist (Week 19–20, when reports ship)
NOTAM draft content sanitisation: sanitise_icao() function in reentry/notam.py applied to all user-sourced fields before NOTAM template interpolation; unit test: object name containing "><script>alert(1)</script> produces a sanitised NOTAM draft and does not raise (Week 17–18, with NOTAM drafting feature)
Shadow mode RLS integration test: query reentry_predictions as viewer role with no WHERE clause; assert zero shadow rows returned
Refresh token family reuse detection integration test: simulate attacker consuming a rotated token; assert entire family revoked + REFRESH_TOKEN_REUSE logged
RLS policies reviewed and integration-tested for multi-tenancy boundary

Phase 3:

External penetration test by a qualified third party — scope must include: API auth bypass, privilege escalation, SSRF via ingest, XSS → Playwright escalation, WebSocket auth bypass, data integrity attacks on predictions, Redis/MinIO lateral movement
All Critical and High penetration test findings remediated before production go-live
SOC 2 Type I readiness review (if required by customer contracts)
Acceptance Test Procedure (ATP) defined and run (Finding 10): docs/bid/acceptance-test-procedure.md exists with test script structured as: test ID, requirement reference, preconditions, steps, expected result, pass/fail criteria. ATP is runnable by a non-SpaceCom operator (evaluator) using documented environment setup. ATP covers: physics accuracy (§17 validation), NOTAM format (Q-line regex test), alert delivery latency (synthetic TIP → measure delivery time), HMAC integrity (tampered record → 503), multi-tenancy boundary (Org A cannot access Org B data). ATP seed data committed at docs/bid/atp-seed-data/. ATP successfully run by an independent evaluator on the staging environment before any institutional procurement submission.
Competitive differentiation review completed: docs/competitive-analysis.md updated; any competitor capability that closed a differentiation gap has been assessed and a product response documented
Security runbook: incident response procedure for each CRITICAL threat scenario

7.16 Aviation Safety Integrity — Operational Scenarios

Scenario 1 — False all-clear attack:

An attacker who modifies reentry_predictions records to suppress a genuine hazard corridor could cause an airspace manager to conclude a FIR is safe when it is not.

Mitigations layered in depth:

HMAC signing on every prediction record (§7.9) — modification is immediately detected
Immutability DB trigger (§7.9) — modifications fail at the database layer
TIP message cross-check: a prediction showing no hazard for an object with an active TIP message triggers a CRITICAL integrity alert regardless of the prediction's content
The UI displays HMAC status on every prediction — ✗ verification failed is immediately visible to the operator

Scenario 2 — Alert storm attack:

An attacker flooding the alert system with false CRITICALs induces alert fatigue; operators disable alerts; a genuine event is missed.

Mitigations:

Alert generation runs only from backend business logic on verified, HMAC-checked data — not from direct API calls
Rate limiting on CRITICAL alert generation per object per window (§6.6)
Alert storm detection: > 5 CRITICALs in 1 hour triggers a meta-alert to admins
Geographic filtering means alert volume per operator is naturally bounded to their region

8. Functional Modules

Each module is a Python package under backend/modules/ with its own router, schemas, service layer, and (where applicable) Celery tasks. Modules communicate via internal function calls and the shared database — not HTTP between modules.

Phase 1 Modules

Module	Package	Purpose
Catalog	`modules.catalog`	CRUD for space objects: NORAD ID, TLE sets, physical properties (from ESA DISCOS), B* drag term, radar cross-section. Source of truth for all tracked objects.
Catalog Propagator	`modules.propagator.catalog`	SGP4/SDP4 for general catalog tracking. Outputs GCRF state vectors and geodetic coordinates. Feeds the globe display. Not used for decay prediction.
Decay Predictor	`modules.propagator.decay`	Numerical integrator (RK7(8) adaptive step) with NRLMSISE-00 atmospheric density model, J2–J6 geopotential, and solar radiation pressure. Used for all re-entry window estimation. Monte Carlo uncertainty (vary F10.7 ±20%, Ap, B* ±10%). All outputs HMAC-signed on creation. Shadow mode flag propagated to all output records.
Reentry	`modules.reentry`	Phase 1 scope: re-entry window prediction (time ± uncertainty) and ground track corridor (percentile swaths). Phase 2 expands to full breakup/survivability.
Space Weather	`modules.spaceweather`	Ingests NOAA SWPC: F10.7, Ap/Kp, Dst, solar wind. Cross-validates against ESA Space Weather Service. Generates `operational_status` string. Drives Decay Predictor density models.
Visualisation	`modules.viz`	Generates CZML documents from ephemeris (J2000 Cartesian — explicit TEME→J2000 conversion), hazard zones, and debris corridors. Pre-bakes MC trajectory binary blobs for Mode C. All object name/description fields HTML-escaped before CZML output.
Ingest	`modules.ingest`	Background workers: Space-Track.org TLE polling, CelesTrak TLE polling, TIP message ingestion, ESA DISCOS physical property import, NOAA SWPC space weather polling, IERS EOP refresh. All external URLs are hardcoded constants; SSRF mitigation enforced at HTTP client layer.
Public API	`modules.api`	Versioned REST API (`/api/v1/`) as a first-class product for programmatic access by Persona E/F. Includes API key management (generation, rotation, revocation, usage tracking), CCSDS-format export endpoints, bulk ephemeris endpoints, and rate limiting per API key. API keys are separate credentials from the web session JWT and managed independently.

Phase 2 Modules

Module	Package	Purpose
Atmospheric Breakup	`modules.breakup`	ORSAT-like atmospheric re-entry breakup: aerothermal loading → structural failure → fragment generation → ballistic descent → ground impact with kinetic energy and casualty area. Produces fragment descriptors and uncertainty bounds for the sub-/trans-sonic descent layer.
Conjunction	`modules.conjunction`	All-vs-all conjunction screening: apogee/perigee filter → TCA refinement → collision probability (Alfano/Foster). Feeds `conjunctions` table.
Upper Atmosphere	`modules.weather.upper`	NRLMSISE-00 / JB2008 density model driven by space weather inputs. 80–600 km profiles for Decay Predictor and Atmospheric Breakup.
Lower Atmosphere	`modules.weather.lower`	GFS/ECMWF tropospheric wind and density profiles for 0–80 km terminal descent, including wind-sensitive dispersion inputs for fragment clouds after main breakup.
Hazard	`modules.hazard`	Fuses Decay Predictor + Atmospheric Breakup + atmosphere modules into hazard zones with uncertainty bounds. All output records HMAC-signed and immutable. Shadow mode flag preserved on all hazard zone records.
Airspace	`modules.airspace`	FIR/UIR boundaries, controlled airspace, routes. PostGIS hazard-airspace intersection.
Air Risk	`modules.air_risk`	Combines hazard outputs with air traffic density / ADS-B state, aircraft class assumptions, and vulnerability bands to generate time-sliced exposure scores and operator-facing air-risk products. Supports conservative-baseline comparison against blunt closure areas.
On-Orbit Fragmentation	`modules.fragmentation`	NASA Standard Breakup Model for on-orbit collision/explosion fragmentation. Separate from atmospheric breakup — different physics.
Space Operator Portal	`modules.space_portal`	The second front door. Owned object management (`owned_objects` table); object-scoped prediction views; CCSDS export; API key portal; controlled re-entry planner interface. Enforces `space_operator` RBAC object-ownership scoping.
Controlled Re-entry Planner	`modules.reentry.controlled`	For objects with remaining manoeuvre capability: given a delta-V budget and avoidance constraints (FIR exclusions, land avoidance, population density weighting), generates ranked candidate deorbit windows with corridor risk scores. Outputs suitable for national space law regulatory submissions and ESA Zero Debris Charter evidence.
NOTAM Drafting	`modules.notam`	Generates ICAO Annex 15 format NOTAM drafts from hazard corridor outputs. Produces cancellation drafts on event close. Stores all drafts in `notam_drafts` table. Displays mandatory regulatory disclaimer. Never submits NOTAMs — draft production only.

Phase 3 Modules

Module	Package	Purpose
Reroute	`modules.reroute`	Strategic pre-flight route intersection analysis only. Given a filed route, identifies which segments intersect the hazard corridor and outputs the geographic avoidance boundary. Does not generate specific alternate routes — avoidance boundary only, to keep SpaceCom in a purely informational role.
Feedback	`modules.feedback`	Prediction vs. observed outcome comparison. Atmospheric density scaling recalibration from historical re-entries. Maneuver detection (TLE-to-TLE ΔV estimation). Shadow validation reporting for ANSP regulatory adoption evidence.
Alerts	`modules.alerts`	WebSocket push + email notifications. Enforces alert rate limits and deduplication server-side. Stores all events in append-only `alert_events`. Shadow mode: all alerts suppressed to INFORMATIONAL; no external delivery.
Launch Safety	`modules.launch_safety`	Screen proposed launch trajectories against the live catalog for conjunction risk during ascent and parking orbit phases. Natural extension of the conjunction module. Serves launch operators as a third customer segment.

9. Data Model Evolution

9.1 Retain and Expand from Existing Schema

`objects` table

ALTER TABLE objects ADD COLUMN IF NOT EXISTS
  bstar DOUBLE PRECISION,              -- SGP4 drag parameter (1/Earth-radii)
  cd_a_over_m DOUBLE PRECISION,        -- C_D * A / m (m²/kg); physical model
  rcs_m2 DOUBLE PRECISION,             -- Radar cross-section from Space-Track
  rcs_size_class TEXT,                 -- SMALL | MEDIUM | LARGE
  mass_kg DOUBLE PRECISION,
  cross_section_m2 DOUBLE PRECISION,
  material TEXT,
  shape TEXT,
  data_confidence TEXT DEFAULT 'unknown',  -- 'discos' | 'estimated' | 'unknown'
  object_type TEXT,                    -- PAYLOAD | ROCKET BODY | DEBRIS | UNKNOWN
  launch_date DATE,
  launch_site TEXT,
  decay_date DATE,
  organisation_id INTEGER REFERENCES organisations(id),  -- multi-tenancy
  -- Physics model parameters (Finding 3, 5, 7)
  attitude_known BOOLEAN DEFAULT FALSE,    -- FALSE = tumbling; affects A uncertainty sampling
  material_class TEXT,                     -- 'aluminium'|'stainless_steel'|'titanium'|'carbon_composite'|'unknown'
  cd_override DOUBLE PRECISION,            -- operator-provided C_D override (space_operator only)
  bstar_override DOUBLE PRECISION,         -- operator-provided B* override (space_operator only)
  cr_coefficient DOUBLE PRECISION DEFAULT 1.3  -- radiation pressure coefficient; 1.3 = standard non-cooperative

`orbits` table — full state vectors

ALTER TABLE orbits ADD COLUMN IF NOT EXISTS
  reference_frame TEXT DEFAULT 'GCRF',
  pos_x_km DOUBLE PRECISION,
  pos_y_km DOUBLE PRECISION,
  pos_z_km DOUBLE PRECISION,
  vel_x_kms DOUBLE PRECISION,
  vel_y_kms DOUBLE PRECISION,
  vel_z_kms DOUBLE PRECISION,
  lat_deg DOUBLE PRECISION,
  lon_deg DOUBLE PRECISION,
  alt_km DOUBLE PRECISION,
  speed_kms DOUBLE PRECISION,
  -- RTN position covariance (upper triangle of 3×3)
  cov_rr DOUBLE PRECISION,
  cov_rt DOUBLE PRECISION,
  cov_rn DOUBLE PRECISION,
  cov_tt DOUBLE PRECISION,
  cov_tn DOUBLE PRECISION,
  cov_nn DOUBLE PRECISION,
  propagator TEXT DEFAULT 'sgp4',
  tle_epoch TIMESTAMPTZ

`conjunctions` table

ALTER TABLE conjunctions ADD COLUMN IF NOT EXISTS
  collision_probability DOUBLE PRECISION,
  probability_method TEXT,
  combined_radial_sigma_m DOUBLE PRECISION,
  combined_transverse_sigma_m DOUBLE PRECISION,
  combined_normal_sigma_m DOUBLE PRECISION

`reentry_predictions` table

ALTER TABLE reentry_predictions ADD COLUMN IF NOT EXISTS
  confidence_level DOUBLE PRECISION,
  model_version TEXT,
  propagator TEXT,
  f107_assumed DOUBLE PRECISION,
  ap_assumed DOUBLE PRECISION,
  monte_carlo_n INTEGER,
  ground_track_corridor GEOGRAPHY(POLYGON),  -- GEOGRAPHY: global corridors may cross antimeridian
  reentry_window_open TIMESTAMPTZ,
  reentry_window_close TIMESTAMPTZ,
  nominal_reentry_point GEOGRAPHY(POINT),    -- GEOGRAPHY: global point
  nominal_reentry_alt_km DOUBLE PRECISION DEFAULT 80.0,
  p01_reentry_time TIMESTAMPTZ,  -- 1st percentile — extreme early case; displayed as tail risk annotation (F10)
  p05_reentry_time TIMESTAMPTZ,
  p50_reentry_time TIMESTAMPTZ,
  p95_reentry_time TIMESTAMPTZ,
  p99_reentry_time TIMESTAMPTZ,  -- 99th percentile — extreme late case; displayed as tail risk annotation (F10)
  sigma_along_track_km DOUBLE PRECISION,
  sigma_cross_track_km DOUBLE PRECISION,
  organisation_id INTEGER REFERENCES organisations(id),
  record_hmac TEXT NOT NULL,           -- HMAC-SHA256 of canonical field set
  integrity_failed BOOLEAN DEFAULT FALSE,
  superseded_by INTEGER REFERENCES reentry_predictions(id) ON DELETE RESTRICT, -- write-once; RESTRICT prevents deleting a prediction that supersedes another (F10 — §67)
  ood_flag BOOLEAN DEFAULT FALSE,              -- TRUE if any input parameter falls outside the model's validated operating envelope
  ood_reason TEXT,                             -- comma-separated list of which parameters triggered OOD (e.g. "high_am_ratio,low_data_confidence")
  prediction_valid_until TIMESTAMPTZ,          -- computed at creation: p50_reentry_time - 4h; UI warns if NOW() > this and prediction is not superseded
  model_version TEXT NOT NULL,                 -- semantic version of decay predictor used; must match current deployed version or trigger re-run prompt
  -- Multi-source conflict detection (Finding 10)
  prediction_conflict BOOLEAN DEFAULT FALSE,   -- TRUE if SpaceCom window does not overlap TIP or ESA window
  conflict_sources TEXT[],                     -- e.g. ['space_track_tip', 'esa_esac']
  conflict_union_p10 TIMESTAMPTZ,              -- union of all non-overlapping windows: earliest bound
  conflict_union_p90 TIMESTAMPTZ               -- union of all non-overlapping windows: latest bound

superseded_by is write-once after creation: it can be set once by an analyst or above, but never changed once set. A DB constraint enforces this (trigger that raises if superseded_by is being changed from a non-NULL value). The UI displays a ⚠ Superseded — see [newer run] banner on any prediction where superseded_by IS NOT NULL. This preserves the immutability guarantee (old records are never deleted) while giving analysts a mechanism to communicate "this is not the current operational view."

The same superseded_by pattern applies to the simulations table (self-referential FK).

Immutability trigger (see §7.9) applied to this table in the initial migration.

9.2 New Tables

-- Organisations (for multi-tenancy)
CREATE TABLE organisations (
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL UNIQUE,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  -- Commercial tier (Finding 3, 5)
  subscription_tier TEXT NOT NULL DEFAULT 'shadow_trial'
    CHECK (subscription_tier IN ('shadow_trial','ansp_operational','space_operator','institutional','internal')),
  subscription_status TEXT NOT NULL DEFAULT 'active'
    CHECK (subscription_status IN ('active','offered','offered_lapsed','churned','suspended')),
  subscription_started_at TIMESTAMPTZ,
  subscription_expires_at TIMESTAMPTZ,
  -- Shadow trial gate (F3 - §68): expiry normally auto-deactivates shadow mode, but enforcement is deferred while an active TIP / CRITICAL operational event exists
  shadow_trial_expires_at TIMESTAMPTZ,          -- NULL = no trial expiry (paid or internal); set on sandbox agreement signing
  -- Resource quotas (F8 — §68): 0 = unlimited (paid tiers); >0 = monthly cap
  monthly_mc_run_quota INTEGER NOT NULL DEFAULT 100  -- 100 for free/shadow_trial; 0 = unlimited for paid; deferred during active TIP/CRITICAL event
    CHECK (monthly_mc_run_quota >= 0),
  -- Feature flags (F11 — §68): Enterprise-only features gated here
  feature_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE,  -- Enterprise only
  -- On-premise licence (F6 — §68)
  licence_key TEXT,                             -- JWT signed by SpaceCom; checked at startup for on-premise deployments
  licence_expires_at TIMESTAMPTZ,               -- derived from licence_key; stored for query efficiency
  -- Data residency (Finding 8)
  hosting_jurisdiction TEXT NOT NULL DEFAULT 'eu'
    CHECK (hosting_jurisdiction IN ('eu','uk','au','us','on_premise')),
  data_residency_confirmed BOOLEAN DEFAULT FALSE  -- DPA clause confirmed for this org
);

-- Users
CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  organisation_id INTEGER REFERENCES organisations(id) NOT NULL,
  email TEXT NOT NULL UNIQUE,
  password_hash TEXT NOT NULL,           -- bcrypt, cost factor >= 12
  role TEXT NOT NULL DEFAULT 'viewer'
    CHECK (role IN ('viewer','analyst','operator','org_admin','admin','space_operator','orbital_analyst')),
  mfa_secret TEXT,                       -- TOTP secret (encrypted at rest)
  mfa_recovery_codes TEXT[],             -- bcrypt hashes of recovery codes
  mfa_enabled BOOLEAN DEFAULT FALSE,
  failed_mfa_attempts INTEGER DEFAULT 0,
  locked_until TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  last_login_at TIMESTAMPTZ,
  tos_accepted_at TIMESTAMPTZ,          -- NULL = ToS not yet accepted; access blocked until set
  tos_version TEXT,                     -- semver of ToS accepted (e.g. "1.2.0")
  tos_accepted_ip INET,                 -- IP address at time of acceptance (GDPR consent evidence)
  data_source_acknowledgement BOOLEAN DEFAULT FALSE, -- must be TRUE before API key access
  altitude_unit_preference TEXT NOT NULL DEFAULT 'ft'
    CHECK (altitude_unit_preference IN ('m', 'ft', 'km'))
    -- 'ft' default for ansp_operator; 'km' default for space_operator (set at account creation based on role)
);

-- Refresh tokens (server-side revocation)
CREATE TABLE refresh_tokens (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id INTEGER REFERENCES users(id) ON DELETE CASCADE,
  token_hash TEXT NOT NULL UNIQUE,        -- SHA-256 of the raw token
  family_id UUID NOT NULL,               -- All tokens from the same initial issuance share a family_id
  issued_at TIMESTAMPTZ DEFAULT NOW(),
  expires_at TIMESTAMPTZ NOT NULL,
  revoked_at TIMESTAMPTZ,                -- NULL = valid
  superseded_at TIMESTAMPTZ,             -- Set when this token is rotated out (newer token in family exists)
  replaced_by UUID REFERENCES refresh_tokens(id),  -- for rotation chain audit
  source_ip INET,
  user_agent TEXT
);
CREATE INDEX ON refresh_tokens (user_id, revoked_at);
CREATE INDEX ON refresh_tokens (family_id);  -- for family revocation on reuse detection

-- Security event log (append-only)
CREATE TABLE security_logs (
  id BIGSERIAL PRIMARY KEY,
  logged_at TIMESTAMPTZ DEFAULT NOW(),
  level TEXT NOT NULL,
  event_type TEXT NOT NULL,
  user_id INTEGER,
  organisation_id INTEGER,
  source_ip INET,
  user_agent TEXT,
  resource TEXT,
  detail JSONB,
  record_hash TEXT  -- SHA-256(logged_at || event_type || detail) for tamper detection
);
CREATE TRIGGER security_logs_immutable
  BEFORE UPDATE OR DELETE ON security_logs
  FOR EACH ROW EXECUTE FUNCTION prevent_modification();

-- TLE history (hypertable)
-- No surrogate PK: TimescaleDB requires any UNIQUE/PK constraint to include the partition column.
-- Natural unique key is (object_id, ingested_at). Reference TLE records by this composite key.
CREATE TABLE tle_sets (
  object_id INTEGER REFERENCES objects(id),
  epoch TIMESTAMPTZ NOT NULL,
  line1 TEXT NOT NULL,
  line2 TEXT NOT NULL,
  source TEXT NOT NULL,
  ingested_at TIMESTAMPTZ DEFAULT NOW(),
  inclination_deg DOUBLE PRECISION,
  raan_deg DOUBLE PRECISION,
  eccentricity DOUBLE PRECISION,
  arg_perigee_deg DOUBLE PRECISION,
  mean_anomaly_deg DOUBLE PRECISION,
  mean_motion_rev_per_day DOUBLE PRECISION,
  bstar DOUBLE PRECISION,
  apogee_km DOUBLE PRECISION,
  perigee_km DOUBLE PRECISION,
  cross_validated BOOLEAN DEFAULT FALSE,  -- TRUE if confirmed by second source
  cross_validation_delta_sma_km DOUBLE PRECISION,  -- SMA difference between sources
  UNIQUE (object_id, ingested_at)         -- natural key; safe for TimescaleDB (includes partition col)
);
SELECT create_hypertable('tle_sets', 'ingested_at');

-- Space weather (hypertable)
CREATE TABLE space_weather (
  time TIMESTAMPTZ NOT NULL,
  f107_obs DOUBLE PRECISION,             -- observed F10.7 (current day)
  f107_prior_day DOUBLE PRECISION,       -- prior-day F10.7 (NRLMSISE-00 f107 input)
  f107_81day_avg DOUBLE PRECISION,       -- 81-day centred average (NRLMSISE-00 f107A input)
  ap_daily INTEGER,                      -- daily Ap index (linear; NOT Kp)
  ap_3h_history DOUBLE PRECISION[19],    -- 3-hourly Ap values for prior 57h (NRLMSISE-00 full mode)
  kp_3hourly DOUBLE PRECISION[],         -- 3-hourly Kp (for storm detection; Kp > 5 triggers storm flag)
  dst_index INTEGER,
  uncertainty_multiplier DOUBLE PRECISION,
  operational_status TEXT,
  source TEXT DEFAULT 'noaa_swpc',
  secondary_source TEXT,                 -- ESA SWS cross-validation value
  cross_validation_delta_f107 DOUBLE PRECISION  -- difference between sources
);
SELECT create_hypertable('space_weather', 'time');

-- TIP messages
CREATE TABLE tip_messages (
  id BIGSERIAL PRIMARY KEY,
  object_id INTEGER REFERENCES objects(id),
  norad_id INTEGER NOT NULL,
  message_time TIMESTAMPTZ NOT NULL,
  message_number INTEGER,
  reentry_window_open TIMESTAMPTZ,
  reentry_window_close TIMESTAMPTZ,
  predicted_region TEXT,
  source TEXT DEFAULT 'usspacecom',
  raw_message TEXT
);

-- Alert events (append-only)
CREATE TABLE alert_events (
  id BIGSERIAL PRIMARY KEY,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  level TEXT NOT NULL
    CHECK (level IN ('INFO','WARNING','CRITICAL')),
  trigger_type TEXT NOT NULL,
  object_id INTEGER REFERENCES objects(id),
  organisation_id INTEGER REFERENCES organisations(id),
  message TEXT NOT NULL,
  acknowledged_at TIMESTAMPTZ,
  acknowledged_by INTEGER REFERENCES users(id) ON DELETE SET NULL,  -- SET NULL on GDPR erasure; log entry preserved
  acknowledgement_note TEXT,
  delivered_websocket BOOLEAN DEFAULT FALSE,
  delivered_email BOOLEAN DEFAULT FALSE,
  fir_intersection_km2 DOUBLE PRECISION,       -- area of FIR polygon intersected by the triggering corridor (km²); NULL for non-spatial alerts
  intersection_percentile TEXT
    CHECK (intersection_percentile IN ('p50','p95')),  -- which corridor percentile triggered the alert
  prediction_id BIGINT REFERENCES reentry_predictions(id) ON DELETE RESTRICT,  -- RESTRICT prevents cascade delete of legal-hold predictions (F10 — §67)
  record_hmac TEXT NOT NULL DEFAULT ''  -- HMAC-SHA256 of safety-critical fields; signed at insert; verified nightly (F9)
);
CREATE TRIGGER alert_events_immutable
  BEFORE UPDATE OR DELETE ON alert_events
  FOR EACH ROW EXECUTE FUNCTION prevent_modification();

-- Simulations
CREATE TABLE simulations (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  module TEXT NOT NULL,
  object_id INTEGER REFERENCES objects(id),
  organisation_id INTEGER REFERENCES organisations(id),
  params_json JSONB NOT NULL,
  started_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  completed_at TIMESTAMPTZ,
  status TEXT NOT NULL DEFAULT 'pending'
    CHECK (status IN ('pending','running','complete','failed','cancelled')),
  result_uri TEXT,
  model_version TEXT,
  celery_task_id TEXT,
  error_detail TEXT,
  created_by INTEGER REFERENCES users(id)
);

-- Reports
CREATE TABLE reports (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  simulation_id UUID REFERENCES simulations(id),
  object_id INTEGER REFERENCES objects(id),
  organisation_id INTEGER REFERENCES organisations(id),
  report_type TEXT NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  created_by INTEGER REFERENCES users(id),
  storage_uri TEXT NOT NULL,
  params_json JSONB,
  report_number TEXT
);

-- Prediction outcomes (algorithmic accountability — links predictions to observed re-entry events)
CREATE TABLE prediction_outcomes (
  id SERIAL PRIMARY KEY,
  prediction_id BIGINT NOT NULL REFERENCES reentry_predictions(id) ON DELETE RESTRICT,  -- RESTRICT prevents cascade delete of legal-hold predictions (F10 — §67)
  norad_id INTEGER NOT NULL,
  observed_reentry_time TIMESTAMPTZ,           -- actual re-entry time from post-event analysis (The Aerospace Corporation, US18SCS, etc.)
  observed_reentry_source TEXT,                -- 'aerospace_corp' | 'us18scs' | 'esa_esoc' | 'manual'
  p50_error_minutes DOUBLE PRECISION,          -- predicted p50 minus observed (+ = predicted late, - = predicted early)
  corridor_contains_observed BOOLEAN,          -- TRUE if observed impact point fell within p95 corridor
  fir_false_positive BOOLEAN,                  -- TRUE if a CRITICAL alert fired but no observable debris reached the affected FIR
  fir_false_negative BOOLEAN,                  -- TRUE if observable debris reached a FIR but no CRITICAL alert was generated
  ood_flag_at_prediction BOOLEAN,              -- snapshot of ood_flag from the prediction record at prediction time
  notes TEXT,
  recorded_at TIMESTAMPTZ DEFAULT NOW(),
  recorded_by INTEGER REFERENCES users(id)     -- analyst who logged the outcome
);

-- Hazard zones
CREATE TABLE hazard_zones (
  id BIGSERIAL PRIMARY KEY,
  simulation_id UUID REFERENCES simulations(id),
  organisation_id INTEGER REFERENCES organisations(id),
  valid_from TIMESTAMPTZ NOT NULL,
  valid_to TIMESTAMPTZ NOT NULL,
  geometry GEOGRAPHY(POLYGON, 4326) NOT NULL,
  altitude_min_km DOUBLE PRECISION,
  altitude_max_km DOUBLE PRECISION,
  risk_level TEXT,
  confidence DOUBLE PRECISION,
  sigma_along_track_km DOUBLE PRECISION,
  sigma_cross_track_km DOUBLE PRECISION,
  record_hmac TEXT NOT NULL
);
CREATE INDEX ON hazard_zones USING GIST (geometry);
CREATE INDEX ON hazard_zones (valid_from, valid_to);
CREATE TRIGGER hazard_zones_immutable
  BEFORE UPDATE OR DELETE ON hazard_zones
  FOR EACH ROW EXECUTE FUNCTION prevent_modification();

-- Airspace boundaries
CREATE TABLE airspace (
  id BIGSERIAL PRIMARY KEY,
  designator TEXT NOT NULL,
  name TEXT,
  type TEXT NOT NULL,
  geometry GEOMETRY(POLYGON, 4326) NOT NULL,  -- GEOMETRY (not GEOGRAPHY): FIR boundaries never cross antimeridian; ~3× faster for ST_Intersects
  lower_fl INTEGER,
  upper_fl INTEGER,
  icao_region TEXT
);
CREATE INDEX ON airspace USING GIST (geometry);

-- Debris fragments
CREATE TABLE fragments (
  id BIGSERIAL PRIMARY KEY,
  simulation_id UUID REFERENCES simulations(id),
  mass_kg DOUBLE PRECISION,
  characteristic_length_m DOUBLE PRECISION,
  cross_section_m2 DOUBLE PRECISION,
  material TEXT,
  ballistic_coefficient_kgm2 DOUBLE PRECISION,
  pre_entry_survived BOOLEAN,
  impact_point GEOGRAPHY(POINT, 4326),
  impact_velocity_kms DOUBLE PRECISION,
  impact_angle_deg DOUBLE PRECISION,
  kinetic_energy_j DOUBLE PRECISION,
  casualty_area_m2 DOUBLE PRECISION,
  dispersion_semi_major_km DOUBLE PRECISION,
  dispersion_semi_minor_km DOUBLE PRECISION,
  dispersion_orientation_deg DOUBLE PRECISION
);
CREATE INDEX ON fragments USING GIST (impact_point);

-- Owned objects (space operator registration)
CREATE TABLE owned_objects (
  id SERIAL PRIMARY KEY,
  organisation_id INTEGER REFERENCES organisations(id) NOT NULL,
  object_id INTEGER REFERENCES objects(id) NOT NULL,
  norad_id INTEGER NOT NULL,
  registered_at TIMESTAMPTZ DEFAULT NOW(),
  registration_reference TEXT,           -- National space law registration number
  has_propulsion BOOLEAN DEFAULT FALSE,  -- Enables controlled re-entry planner
  UNIQUE (organisation_id, object_id)
);
CREATE INDEX ON owned_objects (organisation_id);

-- API keys (for Persona E/F programmatic access)
CREATE TABLE api_keys (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  organisation_id INTEGER REFERENCES organisations(id) NOT NULL,
  user_id INTEGER REFERENCES users(id),  -- NULL for org-level service account keys (F5)
  is_service_account BOOLEAN NOT NULL DEFAULT FALSE,  -- TRUE = org-level key, no human user
  service_account_name TEXT,             -- required when is_service_account = TRUE; e.g. "ANSP Integration Service"
  key_hash TEXT NOT NULL UNIQUE,         -- SHA-256 of raw key; raw key shown once at creation
  name TEXT NOT NULL,                    -- Human label, e.g. "Ops Centre Integration"
  role TEXT NOT NULL,                    -- space_operator | orbital_analyst
  created_at TIMESTAMPTZ DEFAULT NOW(),
  last_used_at TIMESTAMPTZ,
  expires_at TIMESTAMPTZ,
  revoked_at TIMESTAMPTZ,
  revoked_by INTEGER REFERENCES users(id),  -- org_admin or admin who revoked (F5)
  requests_today INTEGER DEFAULT 0,
  daily_limit INTEGER DEFAULT 1000,
  -- API key scope and rate limit overrides (Finding 11)
  allowed_endpoints TEXT[],              -- NULL = all endpoints for role; e.g. ['GET /space/objects']
  rate_limit_override JSONB,             -- e.g. {"decay_predict": {"limit": 5, "window": "1h"}}
  CONSTRAINT service_account_name_required CHECK (
    (is_service_account = FALSE) OR (service_account_name IS NOT NULL)
  ),
  CONSTRAINT user_or_service CHECK (
    (user_id IS NOT NULL AND is_service_account = FALSE)
    OR (user_id IS NULL AND is_service_account = TRUE)
  )
);
CREATE INDEX ON api_keys (organisation_id, revoked_at);
CREATE INDEX ON api_keys (organisation_id, is_service_account);  -- org admin key listing

-- Async job tracking — all Celery-backed POST endpoints return a job reference (Finding 3)
CREATE TABLE jobs (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  organisation_id INTEGER NOT NULL REFERENCES organisations(id),
  user_id INTEGER NOT NULL REFERENCES users(id),
  job_type TEXT NOT NULL
    CHECK (job_type IN ('decay_predict','report','reentry_plan','propagate')),
  status TEXT NOT NULL DEFAULT 'queued'
    CHECK (status IN ('queued','running','complete','failed','cancelled')),
  celery_task_id TEXT,                  -- Celery AsyncResult ID for internal tracking
  params_hash TEXT,                     -- SHA-256 of input params; used for idempotency check
  result_url TEXT,                      -- populated when status='complete'; e.g. '/decay/predictions/123'
  error_code TEXT,                      -- populated when status='failed'
  error_message TEXT,
  estimated_duration_seconds INTEGER,   -- populated at creation from historical p50 for job_type
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  started_at TIMESTAMPTZ,
  completed_at TIMESTAMPTZ
);
CREATE INDEX ON jobs (organisation_id, status, created_at DESC);
CREATE INDEX ON jobs (celery_task_id);

-- Idempotency key store — prevents duplicate mutations from network retries (Finding 5)
CREATE TABLE idempotency_keys (
  key TEXT NOT NULL,                    -- client-provided UUID
  user_id INTEGER NOT NULL REFERENCES users(id),
  endpoint TEXT NOT NULL,              -- e.g. 'POST /decay/predict'
  response_status INTEGER NOT NULL,
  response_body JSONB NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  expires_at TIMESTAMPTZ NOT NULL DEFAULT NOW() + INTERVAL '24 hours',
  PRIMARY KEY (key, user_id, endpoint)
);
CREATE INDEX ON idempotency_keys (expires_at);  -- for TTL cleanup job

-- Usage metering (F3) — billable events; append-only
CREATE TABLE usage_events (
  id BIGSERIAL PRIMARY KEY,
  organisation_id INTEGER NOT NULL REFERENCES organisations(id),
  user_id INTEGER REFERENCES users(id),          -- NULL for API key / system-triggered events
  api_key_id UUID REFERENCES api_keys(id),        -- set when triggered via API key
  event_type TEXT NOT NULL
    CHECK (event_type IN (
      'decay_prediction_run',
      'conjunction_screen_run',
      'report_export',
      'api_request',
      'mc_quota_exhausted',          -- quota hit; signals upsell opportunity
      'reentry_plan_run'
    )),
  quantity INTEGER NOT NULL DEFAULT 1,            -- e.g. number of API requests batched
  billing_period TEXT NOT NULL,                   -- 'YYYY-MM' — month this event counts toward
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  detail JSONB                                    -- event-specific metadata (object_id, mc_n, etc.)
);
CREATE INDEX ON usage_events (organisation_id, billing_period, event_type);
CREATE INDEX ON usage_events (organisation_id, created_at DESC);
-- Append-only enforcement
CREATE TRIGGER usage_events_immutable
  BEFORE UPDATE OR DELETE ON usage_events
  FOR EACH ROW EXECUTE FUNCTION prevent_modification();

-- Billing contacts (F10)
CREATE TABLE billing_contacts (
  id SERIAL PRIMARY KEY,
  organisation_id INTEGER NOT NULL REFERENCES organisations(id) UNIQUE,
  billing_email TEXT NOT NULL,
  billing_name TEXT NOT NULL,
  billing_address TEXT,
  vat_number TEXT,                               -- EU VAT registration; required for B2B invoicing
  purchase_order_number TEXT,                    -- PO reference required by some ANSP procurement depts
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_by INTEGER REFERENCES users(id)        -- must be org_admin or admin
);

-- Subscription periods (F10) — immutable record of what was billed when
CREATE TABLE subscription_periods (
  id SERIAL PRIMARY KEY,
  organisation_id INTEGER NOT NULL REFERENCES organisations(id),
  tier TEXT NOT NULL,
  period_start TIMESTAMPTZ NOT NULL,
  period_end TIMESTAMPTZ,                        -- NULL = current (open) period
  monthly_fee_eur NUMERIC(10, 2),                -- agreed contract price; NULL for internal/trial
  currency TEXT NOT NULL DEFAULT 'EUR',
  invoice_ref TEXT,                              -- external billing system invoice ID (e.g. Stripe invoice_id)
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON subscription_periods (organisation_id, period_start DESC);

-- NOTAM drafts (audit trail; never submitted by SpaceCom)
CREATE TABLE notam_drafts (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  prediction_id BIGINT REFERENCES reentry_predictions(id),
  organisation_id INTEGER REFERENCES organisations(id),
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  created_by INTEGER REFERENCES users(id),
  draft_type TEXT NOT NULL
    CHECK (draft_type IN ('new','cancellation')),
  fir_designators TEXT[] NOT NULL,
  valid_from TIMESTAMPTZ,
  valid_to TIMESTAMPTZ,
  draft_text TEXT NOT NULL,              -- Full ICAO-format draft text
  reviewed_by INTEGER REFERENCES users(id) ON DELETE SET NULL,  -- SET NULL on GDPR erasure; draft preserved
  reviewed_at TIMESTAMPTZ,
  review_note TEXT,
  safety_record BOOLEAN DEFAULT TRUE,    -- always retained; excluded from data drop policy
  generated_during_degraded BOOLEAN DEFAULT FALSE  -- TRUE if ingest was degraded at generation time
  -- No issuance fields — SpaceCom never issues NOTAMs
);

-- Degraded mode audit log (Finding 7 — operational ANSP disclosure requirement)
-- Records every transition into and out of degraded mode for incident investigation
CREATE TABLE degraded_mode_events (
  id BIGSERIAL PRIMARY KEY,
  started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  ended_at TIMESTAMPTZ,                     -- NULL = currently degraded
  affected_sources TEXT[] NOT NULL,         -- e.g. ['space_track', 'noaa_swpc']
  severity TEXT NOT NULL
    CHECK (severity IN ('WARNING','CRITICAL')),
  trigger_reason TEXT NOT NULL,             -- human-readable: 'Space-Track ingest gap > 4h'
  resolved_by TEXT,                         -- 'auto-recovery' | user_id | 'manual'
  safety_record BOOLEAN DEFAULT TRUE        -- always retained under safety record policy
);
-- Append-only: no UPDATE or DELETE permitted
CREATE TRIGGER degraded_mode_events_immutable
  BEFORE UPDATE OR DELETE ON degraded_mode_events
  FOR EACH ROW EXECUTE FUNCTION prevent_modification();

-- Shadow validation records (compare shadow predictions to actual events)
CREATE TABLE shadow_validations (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  prediction_id BIGINT REFERENCES reentry_predictions(id),
  organisation_id INTEGER REFERENCES organisations(id),
  created_at TIMESTAMPTZ DEFAULT NOW(),
  created_by INTEGER REFERENCES users(id),
  actual_reentry_time TIMESTAMPTZ,
  actual_reentry_location GEOGRAPHY(POINT, 4326),
  actual_source TEXT,                    -- 'aerospace_corp_db' | 'tip_message' | 'manual'
  p50_error_minutes DOUBLE PRECISION,    -- actual - predicted p50 in minutes
  in_p95_corridor BOOLEAN,               -- did actual point fall within 95th pct corridor?
  notes TEXT
);

-- Legal opinions (jurisdiction-level gate for shadow mode and operational deployment)
CREATE TABLE legal_opinions (
  id SERIAL PRIMARY KEY,
  jurisdiction TEXT NOT NULL UNIQUE,      -- e.g. 'AU', 'EU', 'UK', 'US'
  status TEXT NOT NULL DEFAULT 'pending'
    CHECK (status IN ('pending','in_progress','complete','not_required')),
  opinion_date DATE,
  counsel_firm TEXT,
  shadow_mode_cleared BOOLEAN DEFAULT FALSE,  -- opinion confirms shadow deployment is permissible
  operational_cleared BOOLEAN DEFAULT FALSE,  -- opinion confirms operational deployment is permissible
  liability_cap_agreed BOOLEAN DEFAULT FALSE,
  notes TEXT,
  document_minio_key TEXT,                -- reference to stored opinion document in MinIO
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Shared immutability function (used by multiple triggers)
CREATE OR REPLACE FUNCTION prevent_modification()
RETURNS TRIGGER AS $$
BEGIN
  RAISE EXCEPTION 'Table % is append-only or immutable after creation', TG_TABLE_NAME;
END;
$$ LANGUAGE plpgsql;

-- Shared updated_at function (used by mutable tables)
CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
  NEW.updated_at = NOW();
  RETURN NEW;
END;
$$;

-- updated_at triggers for all mutable tables
CREATE TRIGGER organisations_updated_at
  BEFORE UPDATE ON organisations FOR EACH ROW EXECUTE FUNCTION set_updated_at();
CREATE TRIGGER users_updated_at
  BEFORE UPDATE ON users FOR EACH ROW EXECUTE FUNCTION set_updated_at();
CREATE TRIGGER simulations_updated_at
  BEFORE UPDATE ON simulations FOR EACH ROW EXECUTE FUNCTION set_updated_at();
CREATE TRIGGER jobs_updated_at
  BEFORE UPDATE ON jobs FOR EACH ROW EXECUTE FUNCTION set_updated_at();
CREATE TRIGGER notam_drafts_updated_at
  BEFORE UPDATE ON notam_drafts FOR EACH ROW EXECUTE FUNCTION set_updated_at();

Shadow mode flag on predictions and hazard zones: Add shadow_mode BOOLEAN DEFAULT FALSE to both reentry_predictions and hazard_zones. Shadow records are excluded from all operational API responses (WHERE shadow_mode = FALSE applied to all operational endpoints) but accessible via /analysis and the Feedback/shadow validation workflow.

9.3 Index Strategy

All indexes must be created CONCURRENTLY on live hypertables to avoid table locks (see §9.4). The following indexes are required beyond TimescaleDB's automatic chunk indexes:

-- orbits hypertable: object + time range queries (CZML generation)
CREATE INDEX CONCURRENTLY IF NOT EXISTS orbits_object_epoch_idx
  ON orbits (object_id, epoch DESC);

-- reentry_predictions: latest prediction per object (Event Detail, operational overview)
CREATE INDEX CONCURRENTLY IF NOT EXISTS reentry_pred_object_created_idx
  ON reentry_predictions (object_id, created_at DESC)
  WHERE integrity_failed = FALSE AND shadow_mode = FALSE;

-- alert_events: unacknowledged alerts per org (badge count — called on every page load)
-- Partial index on acknowledged_at IS NULL: only live unacked rows indexed; shrinks as alerts are acknowledged
CREATE INDEX CONCURRENTLY IF NOT EXISTS alert_events_unacked_idx
  ON alert_events (organisation_id, level, created_at DESC)
  WHERE acknowledged_at IS NULL;

-- jobs: Celery worker polls for queued jobs; partial index keeps this tiny and fast
CREATE INDEX CONCURRENTLY IF NOT EXISTS jobs_queued_idx
  ON jobs (organisation_id, created_at)
  WHERE status = 'queued';

-- refresh_tokens: token validation only cares about live (non-revoked) tokens
CREATE INDEX CONCURRENTLY IF NOT EXISTS refresh_tokens_live_idx
  ON refresh_tokens (token_hash)
  WHERE revoked_at IS NULL;

-- idempotency_keys: TTL cleanup job needs only expired rows
CREATE INDEX CONCURRENTLY IF NOT EXISTS idempotency_keys_expired_idx
  ON idempotency_keys (expires_at)
  WHERE expires_at IS NOT NULL;

-- PostGIS spatial: all columns used in ST_Intersects / ST_Contains / ST_Distance
CREATE INDEX CONCURRENTLY IF NOT EXISTS reentry_pred_corridor_gist
  ON reentry_predictions USING GIST (ground_track_corridor);
-- airspace.geometry GIST index already present (see §9.2)
CREATE INDEX CONCURRENTLY IF NOT EXISTS hazard_zones_polygon_gist
  ON hazard_zones USING GIST (polygon);
CREATE INDEX CONCURRENTLY IF NOT EXISTS fragments_impact_gist
  ON fragments USING GIST (impact_point);

-- tle_sets hypertable: latest TLE per object (cross-validation, propagation)
CREATE INDEX CONCURRENTLY IF NOT EXISTS tle_sets_object_ingested_idx
  ON tle_sets (object_id, ingested_at DESC);

-- security_logs: recent events per user (audit queries)
CREATE INDEX CONCURRENTLY IF NOT EXISTS security_logs_user_time_idx
  ON security_logs (user_id, created_at DESC);

Spatial type convention:

GEOGRAPHY — used for global features that may cross the antimeridian (corridor polygons, nominal re-entry points, fragment impact points). Geodetic calculations; correct for global spans.
GEOMETRY(POLYGON, 4326) — used for regional features always within ±180° longitude (FIR/UIR airspace boundaries). Planar approximation; ~3× faster for ST_Intersects than GEOGRAPHY; accurate enough for airspace boundary intersection within a single hemisphere.

SRID enforcement (F2 — §62): Declaring the SRID in the column type (GEOMETRY(POLYGON, 4326)) prevents implicit SRID mismatch errors, but does not prevent application code from inserting a geometry constructed with SRID 0. Add explicit CHECK constraints on all spatial columns:

-- Ensure corridor polygon SRID is correct
ALTER TABLE reentry_predictions
  ADD CONSTRAINT chk_corridor_srid
  CHECK (ST_SRID(ground_track_corridor::geometry) = 4326);

ALTER TABLE hazard_zones
  ADD CONSTRAINT chk_hazard_zone_srid
  CHECK (ST_SRID(geometry) = 4326);

ALTER TABLE airspace
  ADD CONSTRAINT chk_airspace_srid
  CHECK (ST_SRID(geometry) = 4326);

The CI migration gate (alembic check) will flag any migration that adds a spatial column without a matching SRID CHECK constraint.

ST_Buffer distance units (F9 — §62): ST_Buffer on a GEOMETRY(POLYGON, 4326) column uses degree-units, not metres. At 60°N, 1° ≈ 55 km; at the equator, 1° ≈ 111 km — an uncertainty buffer expressed in degrees gives wildly different areas at different latitudes. Always buffer in a projected CRS, then transform back:

-- CORRECT: buffer 50 km around corridor point at any latitude
SELECT ST_Transform(
  ST_Buffer(
    ST_Transform(ST_SetSRID(ST_MakePoint(lon, lat), 4326), 3857),  -- project to Web Mercator (metres)
    50000  -- 50 km in metres
  ),
  4326  -- back to WGS84
) AS buffered_geom;

-- WRONG: buffer in degrees — DO NOT USE
-- SELECT ST_Buffer(geom, 0.5) FROM ...  ← 0.5° is ~55 km at 60°N, ~55 km at equator

For global spans where Mercator distortion is unacceptable, use ST_Buffer on a GEOGRAPHY column instead — it accepts metres natively:

SELECT ST_Buffer(corridor::geography, 50000)  -- 50 km buffer, geodetically correct
FROM reentry_predictions WHERE ...

FIR intersection query optimisation: Apply a bounding-box pre-filter before the full polygon intersection test to eliminate most rows cheaply. airspace.geometry is GEOMETRY while hazard_zones.geometry and corridor parameters are GEOGRAPHY — always cast GEOGRAPHY → GEOMETRY explicitly before passing to ST_Intersects with an airspace column; PostgreSQL cannot use the GiST index and falls back to a seq scan if the types are mixed implicitly:

-- Corridor (GEOGRAPHY) intersecting FIR boundaries (GEOMETRY): explicit cast required
SELECT a.designator, a.name
FROM airspace a
WHERE a.geometry && ST_Envelope($1::geography::geometry)   -- fast bbox pre-filter (uses GIST)
  AND ST_Intersects(a.geometry, $1::geography::geometry);  -- exact test (GEOMETRY, not GEOGRAPHY)
-- $1 = corridor polygon passed as GEOGRAPHY from application layer

Add a CI linter rule (or custom ruff plugin) that rejects ST_Intersects(airspace.geometry, <expr>) unless <expr> is explicitly cast to ::geometry. This prevents the mixed-type silent seq-scan regression from being introduced during maintenance.

Cache the FIR intersection result per prediction_id in Redis (TTL: until the prediction is superseded) — the intersection for a given prediction never changes.

9.4 TimescaleDB Configuration and Continuous Aggregates

Hypertable chunk intervals — set explicitly at creation; default 7-day chunks are too large for the orbits CZML query pattern (most queries cover ≤ 72h):

-- orbits: 1-day chunks (72h CZML window spans 3 chunks; good chunk exclusion)
SELECT create_hypertable('orbits', 'epoch',
  chunk_time_interval => INTERVAL '1 day',
  if_not_exists => TRUE);

-- tle_sets: 1-month chunks (~1,800 rows/day at 600 objects × 3 TLE updates; queried by object_id not time range)
-- Small chunks (7 days) produce poor compression ratios (~12,600 rows/chunk); 1 month improves ratio ~4×
SELECT create_hypertable('tle_sets', 'ingested_at',
  chunk_time_interval => INTERVAL '1 month',
  if_not_exists => TRUE);

-- space_weather: 30-day chunks (~3000 rows/month at 15-min cadence)
SELECT create_hypertable('space_weather', 'time',
  chunk_time_interval => INTERVAL '30 days',
  if_not_exists => TRUE);

Continuous aggregates — pre-compute recurring expensive queries instead of scanning raw hypertable rows on every request:

-- 81-day rolling F10.7 average (queried on every Space Weather Widget render)
CREATE MATERIALIZED VIEW space_weather_daily
  WITH (timescaledb.continuous) AS
  SELECT time_bucket('1 day', time) AS day,
         AVG(f107_obs)              AS f107_daily_avg,
         MAX(kp_3hourly[1])         AS kp_max_daily
  FROM space_weather
  GROUP BY day
WITH NO DATA;

SELECT add_continuous_aggregate_policy('space_weather_daily',
  start_offset      => INTERVAL '90 days',
  end_offset        => INTERVAL '1 hour',
  schedule_interval => INTERVAL '1 hour');

Backend queries for the 81-day F10.7 average read from space_weather_daily (the continuous aggregate), not from the raw space_weather hypertable.

Compression policy intervals — compression must not target recently-written chunks. TimescaleDB decompresses a chunk before any write to it; compressing hot chunks adds 50–200ms latency per write batch. Set compress_after well beyond the active write window:

Hypertable	Chunk interval	`compress_after`	Write cadence	Reasoning
`orbits`	1 day	7 days	1 min (continuous)	Data is queryable but not written after ~24h; 7-day buffer prevents write-decompress thrash
`adsb_states`	4 hours	14 days	60s (Celery Beat)	Rolling 24h retention; compress only after data is past retention interest
`space_weather`	30 days	60 days	15 min	Very low write rate; compress after one full 30-day chunk is closed
`tle_sets`	1 month	2 months	Every 4h ingest	~1,800 rows/day; 1-month chunks give good compression ratio; 2-month buffer ensures active month is never compressed

-- Apply compression policies (run after hypertable creation)
SELECT add_compression_policy('orbits',       INTERVAL '7 days');
SELECT add_compression_policy('adsb_states',  INTERVAL '14 days');
SELECT add_compression_policy('space_weather', INTERVAL '60 days');
SELECT add_compression_policy('tle_sets',     INTERVAL '2 months');

Autovacuum tuning — append-only tables still accumulate dead tuples from aborted transactions and MVCC overhead. Default 20% threshold is too conservative for high-write safety tables:

ALTER TABLE alert_events SET (
  autovacuum_vacuum_scale_factor  = 0.01,   -- vacuum at 1% dead tuples (default: 20%)
  autovacuum_analyze_scale_factor = 0.005
);
ALTER TABLE security_logs SET (
  autovacuum_vacuum_scale_factor  = 0.01,
  autovacuum_analyze_scale_factor = 0.005
);
ALTER TABLE reentry_predictions SET (
  autovacuum_vacuum_cost_delay    = 2,      -- allow aggressive vacuum on query-critical table
  autovacuum_analyze_scale_factor = 0.01
);

PostgreSQL-level settings via patroni.yml:

postgresql:
  parameters:
    idle_in_transaction_session_timeout: 30000  # 30s -- prevents analytics sessions blocking autovacuum
    max_connections: 50                          # pgBouncer handles client multiplexing; DB needs only 50
    log_min_duration_statement: 500             # F7 §58: log queries > 500ms; shipped to Loki via Promtail
    shared_preload_libraries: timescaledb,pg_stat_statements  # F7 §58: enable slow query tracking
    pg_stat_statements.track: all              # track all statements including nested
    # Analyst role statement timeout (F11 §58): prevents runaway analytics queries starving ops connections
    # Applied at role level, not globally, to avoid impacting operational paths

Query plan governance (F7 — §58): Slow queries (> 500ms) appear in PostgreSQL logs and are shipped to Loki. A weekly Grafana report queries pg_stat_statements via the postgres-exporter and surfaces the top-10 queries by total_exec_time. Any query appearing in the top-10 for two consecutive weeks requires a PR with an EXPLAIN ANALYSE output and either an index addition or a documented acceptance rationale. The EXPLAIN ANALYSE output is recorded in the migration file header comment for index additions. CI migration timeout (§9.4) applies: migrations running > 30s against the test dataset require review before merge.

Analyst role query timeout (F11 — §58): Persona B/F analyst queries route to the read replica (§3.2) but must still be bounded to prevent a runaway query exhausting replica connections and triggering replication lag. Apply a statement_timeout at the database role level so it applies regardless of connection source:

-- Applied once at schema setup; persists across reconnections
ALTER ROLE spacecom_analyst SET statement_timeout = '30s';
ALTER ROLE spacecom_readonly SET statement_timeout = '30s';

-- Operational roles have no statement timeout — but idle-in-transaction timeout applies globally
-- (idle_in_transaction_session_timeout = 30s in patroni.yml)

The spacecom_analyst role is the PgBouncer user for the read replica pool. All analyst-originated queries automatically inherit the 30s limit. If a query exceeds 30s it receives ERROR: canceling statement due to statement timeout; the frontend displays a user-facing message: "This query exceeded the 30-second limit. Refine your filters or contact your administrator." Logged at WARNING to Loki.

PgBouncer transaction mode + asyncpg prepared statement cache — asyncpg caches prepared statements per server-side connection. In PgBouncer transaction mode, the connection returned after each transaction may differ from the one the statement was prepared on, causing ERROR: prepared statement "..." does not exist under load. Disable the cache in the SQLAlchemy async engine config:

engine = create_async_engine(
    DATABASE_URL,
    connect_args={"prepared_statement_cache_size": 0},
)

This is non-negotiable when using PgBouncer transaction mode. Do not revert this setting in the belief that it is a performance regression — it prevents a hard production failure mode. See ADR 0008.

Migration safety on live hypertables (additions to the Alembic policy in §26.9):

Always use CREATE INDEX CONCURRENTLY for new indexes — no table lock; safe during live ingest
Never add a column with a non-null default to a populated hypertable in one migration: (1) add nullable, (2) backfill in batches, (3) add NOT NULL constraint separately
Test every migration against production-sized data; record execution time in the migration file header comment
Set a CI migration timeout: if a migration runs > 30s against the test dataset, it must be reviewed before merge

10. Technology Stack

Layer	Technology	Rationale
Frontend framework	Next.js 15 + TypeScript	Type safety, SSR for dashboards, static export option
3D Globe	CesiumJS (retained)	Native CZML support; proven in prototype
2D overlays	Deck.gl	WebGL heatmaps (Mode B), arc layers, hex grids
Server state	TanStack Query	Caching, background refetch, stale-while-revalidate. API responses never stored in Zustand.
UI state	Zustand	Pure UI state only: timeline mode, selected object, layer visibility, alert acknowledgements
URL state	nuqs	Shareable deep links; selected object/event/time reflected in URL
Backend framework	FastAPI (retained)	Async, OpenAPI auto-docs, Pydantic validation
Task queue	Celery + Redis	Battle-tested for scientific compute; Flower monitoring
Catalog propagation	`sgp4`	SGP4/SDP4; catalog tracking only, not decay prediction
Numerical integrator	`scipy.integrate.DOP853` or custom RK7(8)	Adaptive step-size for Cowell decay prediction
Atmospheric density	`nrlmsise00` Python wrapper	NRLMSISE-00; driven by F10.7 and Ap
Frame transformations	`astropy`	IAU 2006 precession/nutation, IERS EOP, TEME→GCRF→ITRF
Astrodynamics utilities	`poliastro` (optional)	Conjunction geometry helpers
Auth	`python-jose` (RS256 JWT) + `pyotp` (TOTP MFA)	Asymmetric JWT; TOTP RFC 6238
Rate limiting	`slowapi`	Redis token bucket; per-user and per-IP limits
HTML sanitisation	`bleach`	User-supplied content before Playwright rendering
Password hashing	`passlib[bcrypt]`	bcrypt cost factor ≥ 12
Database	TimescaleDB + PostGIS (retained)	Time-series + geospatial; RLS for multi-tenancy
Cache / broker	Redis 7	Broker + pub/sub: `maxmemory-policy noeviction` (Celery queues must never be evicted). Separate Redis DB index for application cache: `allkeys-lru`. AUTH + TLS in production.
Connection pooler	PgBouncer 1.22	Transaction-mode pooling between all app services and TimescaleDB. Prevents connection exhaustion at Tier 3; single failover target for Patroni switchover. `max_client_conn=200`, `default_pool_size=20`. Pool sizing derivation (F2 — §58): PostgreSQL `max_connections=50`; reserve 5 for superuser/admin; 45 available server connections. `default_pool_size=20` per pool (one pool per DB user); leaves headroom for Alembic migrations and ad-hoc DBA access. `max_client_conn=200` = (2 backend workers × 40 async connections) + (4 sim workers × 16 threads) + (2 ingest workers × 4) = 152 peak; 200 provides burst headroom. Validate with `SHOW pools;` in `psql -h pgbouncer` — `cl_waiting > 0` sustained means pool is undersized.
Object storage	MinIO	Private buckets; pre-signed URLs only
Containerisation	Docker Compose (retained); Caddy as TLS-terminating reverse proxy	Single-command dev; HTTPS auto-provisioning
Testing — backend	pytest + hypothesis	Property-based tests for numerical and security invariants
Testing — frontend	Vitest + Playwright	Unit tests + E2E including security header checks
SAST — Python	Bandit	Static analysis; CI blocks on High severity
SAST — TypeScript	ESLint security plugin	Static analysis; CI blocks on High severity
Container scanning	Trivy	CI blocks on Critical/High CVEs
DAST	OWASP ZAP	Phase 2 pipeline against staging
Dependency management	pip-tools + npm ci	Pinned hashes; `--require-hashes`
Report rendering	Playwright headless (isolated `renderer` container)	Server-side globe screenshot; no client-side canvas
Secrets management	Docker secrets (Phase 1 production) → HashiCorp Vault (Phase 3)
Task scheduler HA	`celery-redbeat`	Redis-backed Beat scheduler; distributed locking; multiple instances safe
DB HA / failover	Patroni + etcd	Automatic TimescaleDB primary/standby failover; ≤ 30s RTO
Redis HA	Redis Sentinel (3 nodes)	Master failover ≤ 10s; transparent to application via `redis-py` Sentinel client
Monitoring	Prometheus + Grafana	Business-level metrics from Phase 1; four dashboards (§26.7); AlertManager with runbook links
Log aggregation	Grafana Loki + Promtail	Phase 2; Promtail scrapes Docker log files; Loki stores and queries; co-deployed with Grafana; no index servers required
Distributed tracing	OpenTelemetry → Grafana Tempo	Phase 2; FastAPI + SQLAlchemy + Celery auto-instrumented; OTLP exporter; trace_id = request_id for log correlation; ADR 0017
Structured logging	structlog	JSON structured logs with required fields; sanitising processor strips secrets; `request_id` propagated through HTTP → Celery chain
On-call alerting	PagerDuty or OpsGenie	Routes Prometheus AlertManager alerts; L1/L2/L3 escalation tiers (§26.8)
CI/CD pipeline	GitLab CI	Native to the self-hosted GitLab monorepo; stage-based builds for Python/Node; protected environments and approval rules for deploys
Container registry	GitLab Container Registry	Co-located with source; `sha-<commit>` is the canonical immutable tag; `latest` tag is forbidden in production deployments; image vulnerability attestations via `cosign`
Pre-commit	`pre-commit` framework	Hooks: `detect-secrets`, `ruff` (lint + format), `mypy` (type gate), `hadolint` (Dockerfile), `prettier` (JS/HTML), `sqlfluff` (migrations); spec in `.pre-commit-config.yaml`; same hooks re-run in CI
Local task runner	`make`	Standard targets: `make dev` (full-stack hot-reload), `make test` (pytest + vitest), `make migrate` (alembic upgrade head), `make seed` (fixture load), `make lint` (all pre-commit hooks), `make clean` (prune volumes)

11. Data Source Inventory

Source	Data	Access	Priority
Space-Track.org	TLE catalog, CDMs, object catalog, RCS data, TIP messages	REST API (account required); credentials in secrets manager	P1
CelesTrak	TLE subsets (active sats, decaying objects)	Public REST API / CSV	P1
USSPACECOM TIP Messages	Tracking and Impact Prediction for decaying objects	Via Space-Track.org	P1
NOAA SWPC	F10.7, Ap/Kp, Dst, solar wind; 3-day forecasts	Public REST API and FTP	P1
ESA Space Weather Service	F10.7, Kp cross-validation source	Public REST API	P1
ESA DISCOS	Physical object properties: mass, dimensions, shape, materials	REST API (account required)	P1
IERS Bulletin A/B	UT1-UTC offsets, polar motion	Public FTP (usno.navy.mil); SHA-256 verified on download	P1
GFS / ECMWF	Tropospheric winds and density 0–80 km	NOMADS (NOAA) public FTP	P2
ILRS / CDDIS	Laser ranging POD products for validation	Public FTP	P2 (validation)

| FIR/UIR boundaries | FIR and UIR boundary polygons for airspace intersection | EUROCONTROL AIRAC dataset (subscription) for ECAC states; FAA Digital-Terminal Procedures for US; OpenAIP as fallback for non-AIRAC regions. GeoJSON format loaded into airspace table. Updated every 28 days on AIRAC cycle. | P1 |

Deprecated reference: "18th SDS" → use Space-Track.org consistently.

ESA DISCOS redistribution rights (Finding 9): ESA DISCOS is subject to an ESAC user agreement. Data may not be redistributed or used in commercial products without explicit ESA permission. SpaceCom is a commercial platform. Required actions before Phase 2 shadow deployment:

Obtain written clarification from ESA/ESAC on whether DISCOS-derived physical properties (mass, dimensions) may be: (a) used internally to drive SpaceCom's own predictions; (b) exposed in API responses to ANSP customers; (c) included in generated PDF reports
If redistribution is not permitted, DISCOS data is used only as internal model input — API responses and reports show source: estimated rather than exposing raw DISCOS values; the data_confidence UI flag continues to show ● DISCOS for internal tracking but is not labelled as DISCOS in customer-facing outputs
Include the DISCOS redistribution clarification in the Phase 2 legal gate checklist alongside the Space-Track AUP opinion

Airspace data scope and SUA disclosure (Finding 4): Phase 2 FIR/UIR scope covers ECAC states (EUROCONTROL AIRAC) and US FIRs (FAA). The following airspace types are explicitly out of scope for Phase 2 and disclosed to users:

Special Use Airspace (SUA): danger areas, restricted areas, prohibited areas (ICAO Annex 11)
Terminal Manoeuvring Areas (TMAs) and Control Zones (CTRs)
Oceanic FIRs (ICAO Annex 2 special procedures; OACCs handle coordination)

A persistent disclosure note on the Airspace Impact Panel reads: "SpaceCom FIR intersection analysis covers FIR/UIR boundaries only. It does not account for special use airspace, terminal areas, or oceanic procedures. Controllers must apply their local procedures for these airspace types." Phase 3 consideration: SUA polygon overlay from national AIP sources. Document in docs/adr/0014-airspace-scope.md.

All source URLs are hardcoded constants in ingest/sources.py. The outbound HTTP client blocks connections to private IP ranges. No source URL is configurable via API or database at runtime.

Space-Track AUP — conditional architecture (Finding 9): The AUP clarification is a Phase 1 architectural decision gate, not a Phase 2 deliverable. The current design assumes shared ingest (a single SpaceCom Space-Track credential fetches TLEs for all organisations). If the AUP prohibits redistribution of derived predictions to customers who have not themselves agreed to the AUP, the ingest architecture must change:

Path A — redistribution permitted: Current shared-ingest design is valid. Each customer organisation's access is governed by SpaceCom's AUP click-wrap and the MSA. No architectural change.
Path B — redistribution not permitted: Per-organisation Space-Track credentials required. Each ANSP/operator must hold their own Space-Track account. SpaceCom acts as a processing layer using each org's own credentials. Architecture change: space_track_credentials table (per-org, encrypted); per-org ingest worker configuration; significant additional complexity.

The decision must be documented in docs/adr/0016-space-track-aup-architecture.md with the chosen path and evidence (written AUP clarification). This ADR is a prerequisite for Phase 1 ingest architecture finalisation — marked as a blocking decision in the Phase 1 DoD.

Space weather raw format specifications:

Source	Endpoint constant	Format	Key fields consumed
NOAA SWPC F10.7	`NOAA_F107_URL = "https://services.swpc.noaa.gov/json/f107_cm_flux.json"`	JSON array	`time_tag`, `flux` (solar flux units)
NOAA SWPC Kp/Ap	`NOAA_KP_URL = "https://services.swpc.noaa.gov/json/planetary_k_index_1m.json"`	JSON array	`time_tag`, `kp_index`, `ap`
NOAA SWPC 3-day forecast	`NOAA_FORECAST_URL = "https://services.swpc.noaa.gov/products/3-day-geomag-forecast.json"`	JSON	`Kp` array
ESA SWS Kp	`ESA_SWS_KP_URL = "https://swe.ssa.esa.int/web/guest/current-space-weather-conditions"`	REST JSON	`kp_index` (cross-validation)

An integration test asserts that each response contains the expected top-level keys. If a key is absent, the test fails and the schema change is caught before it reaches production ingest.

TLE validation at ingestion gate: Before any TLE record is written to the database, ingest/cross_validator.py must verify:

Both lines are exactly 69 characters (standard TLE format)
Modulo-10 checksum passes on line 1 and line 2
Epoch field parses to a valid UTC datetime
BSTAR drag term is within physically plausible bounds (−0.5 to +0.5)

Failed validation is logged to security_logs type INGEST_VALIDATION_FAILURE with the raw TLE and failure reason. The record is not written to the database.

TLE ingest idempotency — ON CONFLICT behavior: The tle_sets table has UNIQUE (object_id, ingested_at). If the ingest worker runs twice for the same object within the same second (e.g., orphan recovery task + normal schedule overlap, or a worker restart mid-task), the second insert must not raise an exception or silently discard the row without tracking. Required semantics:

# ingest/writer.py
async def write_tle_set(session: AsyncSession, tle: TLERecord) -> bool:
    """Insert TLE record. Returns True if inserted, False if duplicate."""
    stmt = pg_insert(TLESet).values(
        object_id=tle.object_id,
        ingested_at=tle.ingested_at,
        tle_line1=tle.line1,
        tle_line2=tle.line2,
        epoch=tle.epoch,
        source=tle.source,
    ).on_conflict_do_nothing(
        index_elements=["object_id", "ingested_at"]
    ).returning(TLESet.object_id)

    result = await session.execute(stmt)
    inserted = result.rowcount > 0
    if not inserted:
        spacecom_ingest_tle_conflict_total.inc()   # metric; non-zero signals scheduling race
        structlog.get_logger().debug("tle_insert_skipped_duplicate",
                                     object_id=tle.object_id, ingested_at=tle.ingested_at)
    return inserted

Prometheus counter spacecom_ingest_tle_conflict_total — a sustained non-zero rate warrants investigation of the Beat schedule overlap. A brief spike during worker restart is acceptable.

Ingest idempotency requirement for all periodic tasks (F8 — §67): TLE ingest uses ON CONFLICT DO NOTHING (above). All other periodic ingest tasks must use equivalent upsert semantics to survive celery-redbeat double-fire on restart:

-- Space weather ingest: upsert on (fetched_at) unique constraint
INSERT INTO space_weather (fetched_at, kp, f107, ...)
VALUES (:fetched_at, :kp, :f107, ...)
ON CONFLICT (fetched_at) DO NOTHING;

-- DISCOS object metadata: upsert on (norad_id) — update if data changed
INSERT INTO objects (norad_id, name, launch_date, ...)
VALUES (:norad_id, :name, :launch_date, ...)
ON CONFLICT (norad_id) DO UPDATE SET
    name = EXCLUDED.name,
    launch_date = EXCLUDED.launch_date,
    updated_at = NOW()
WHERE objects.updated_at < EXCLUDED.updated_at;  -- only update if newer

-- IERS EOP: upsert on (date) unique constraint
INSERT INTO iers_eop (date, ut1_utc, x_pole, y_pole, ...)
VALUES (:date, :ut1_utc, :x_pole, :y_pole, ...)
ON CONFLICT (date) DO NOTHING;

Add unique constraints if not present: UNIQUE (fetched_at) on space_weather; UNIQUE (date) on iers_eop. These prevent double-write corruption at the DB level regardless of application retry logic.

IERS EOP cold-start requirement: On a fresh deployment with no cached EOP data, astropy's IERS_Auto falls back to the bundled IERS-B table (which lags the current date by weeks to months), silently degrading UT1-UTC precision from ~1 ms (IERS-A) to ~10–50 ms (IERS-B). For epochs beyond the IERS-B table end date, astropy raises IERSRangeError, crashing all frame transforms.

The EOP ingest task must run as part of make seed before any propagation task starts:

# Makefile
seed: migrate
	docker compose exec backend python -m ingest.eop --bootstrap   # downloads + caches current IERS-A
	docker compose exec backend python -m ingest.fir --bootstrap    # loads FIR boundaries
	docker compose exec backend python fixtures/dev_seed.sql

The EOP ingest task in Celery Beat is ordered before the TLE ingest task: EOP runs at 00:00 UTC, TLE ingest at 00:10 UTC (ensuring fresh EOP before the first propagation of the day).

IERS EOP verification — dual-mirror comparison: The IERS does not publish SHA-256 hashes alongside its EOP files. Comparing hash-against-prior-download detects corruption but not substitution. The correct approach is downloading from both the USNO mirror and the Paris Observatory mirror and verifying agreement:

# ingest/eop.py
IERS_MIRRORS = [
    "https://maia.usno.navy.mil/ser7/finals2000A.all",
    "https://hpiers.obspm.fr/iers/series/opa/eopc04",   # IERS-C04 series
]

async def fetch_and_verify_eop() -> bytes:
    contents = []
    for url in IERS_MIRRORS:
        resp = await http_client.get(url, timeout=30)
        resp.raise_for_status()
        contents.append(resp.content)

    # Verify UT1-UTC values agree within 0.1 ms across mirrors (format-normalised comparison)
    if not _eop_values_agree(contents[0], contents[1], tolerance_ms=0.1):
        structlog.get_logger().error("eop_mirror_disagreement")
        spacecom_eop_mirror_agreement.set(0)
        raise EOPVerificationError("IERS EOP mirrors disagree — rejecting both")

    spacecom_eop_mirror_agreement.set(1)
    return contents[0]   # USNO is primary; Paris Observatory is the verification witness

Prometheus gauge spacecom_eop_mirror_agreement (1 = mirrors agree, 0 = disagreement detected). Alert on spacecom_eop_mirror_agreement == 0.

12. Backend Directory Structure

backend/
  app/
    main.py              # FastAPI app factory, middleware, router mounting
    config.py            # Settings via pydantic-settings (env vars); no secrets in code
    auth/
      provider.py        # AuthProvider protocol + LocalJWTProvider implementation
      jwt.py             # RS256 token issue, verify, refresh; key loaded from secrets
      mfa.py             # TOTP (pyotp); recovery code generation and verification
      deps.py            # get_current_user, require_role() dependency factory
      middleware.py      # Auth middleware; rate limit enforcement
    frame_utils.py       # TEME→GCRF→ITRF→WGS84 + IERS EOP refresh + hash verification
    time_utils.py        # Time system conversions
    integrity.py         # HMAC sign/verify for predictions and hazard zones
    logging_config.py    # Sanitising log formatter; security event logger
  modules/
    catalog/
      router.py          # /api/v1/objects; requires viewer role minimum
      schemas.py
      service.py
      models.py
    propagator/
      catalog.py         # SGP4 catalog propagation
      decay.py           # RK7(8) + NRLMSISE-00 + Monte Carlo; HMAC-signs output
      tasks.py           # Celery tasks with time_limit, soft_time_limit
      router.py          # /api/v1/propagate, /api/v1/decay; requires analyst role
    reentry/
      router.py          # /api/v1/reentry; requires viewer role
      service.py
      corridor.py        # Percentile corridor polygon generation
    spaceweather/
      router.py          # /api/v1/spaceweather; requires viewer role
      service.py         # Cross-validates NOAA SWPC vs ESA SWS; generates status string
      tasks.py           # Celery Beat: NOAA SWPC polling every 3h
      noaa_swpc.py       # NOAA SWPC client; URL hardcoded constant
      esa_sws.py         # ESA SWS cross-validation client
    viz/
      router.py          # /api/v1/czml; requires viewer role
      czml_builder.py    # CZML output; all strings HTML-escaped; J2000 INERTIAL frame
      mc_geometry.py     # MC trajectory binary blob pre-baking
    ingest/
      sources.py         # Hardcoded external URLs and IP allowlists (SSRF mitigation)
      tasks.py           # Celery Beat-scheduled tasks
      spacetrack.py      # Space-Track client; credentials from secrets manager only
      celestrak.py       # CelesTrak client
      discos.py          # ESA DISCOS client
      iers.py            # IERS EOP fetcher + SHA-256 verification
      cross_validator.py # TLE and space weather cross-source comparison
    alerts/
      router.py          # /api/v1/alerts; requires operator role for acknowledge
      service.py         # Alert trigger evaluation; rate limit enforcement; deduplication
      notifier.py        # WebSocket push + email; storm detection
      integrity_guard.py # TIP vs prediction cross-check; HMAC failure escalation
    reports/
      router.py          # /api/v1/reports; requires analyst role
      builder.py         # Section assembly; all user fields sanitised via bleach
      renderer_client.py # Internal HTTPS call to renderer service with sanitised payload
    security/
      audit.py           # Security event logger; writes to security_logs
      sanitiser.py       # Log formatter that strips credential patterns
    breakup/
      atmospheric.py
      on_orbit.py
      tasks.py
      router.py
    conjunction/
      screener.py
      probability.py
      tasks.py
      router.py
    weather/
      upper.py
      lower.py
    hazard/
      router.py
      fusion.py          # HMAC-signs all hazard_zones output; propagates shadow_mode flag
      tasks.py
    airspace/
      router.py
      loader.py
      intersection.py
    notam/
      router.py          # /api/v1/notam; requires operator role
      drafter.py         # ICAO Annex 15 format generation
      disclaimer.py      # Mandatory regulatory disclaimer text
    space_portal/
      router.py          # /api/v1/space; space_operator and orbital_analyst roles
      owned_objects.py   # Owned object CRUD; RLS enforcement
      controlled_reentry.py  # Deorbit window optimisation
      ccsds_export.py    # CCSDS OEM/CDM format export
      api_keys.py        # API key lifecycle management
    launch_safety/       # Phase 3
      screener.py
      router.py
    reroute/             # Phase 3; strategic pre-flight avoidance boundary only
    feedback/            # Phase 3; includes shadow_validation.py
  migrations/            # Alembic; includes immutability triggers in initial migration
  tests/
    conftest.py          # db_session fixture (SAVEPOINT/ROLLBACK); testcontainers setup for Celery tests
    physics/
      test_frame_utils.py
      test_propagator/
      test_decay/
      test_nrlmsise.py
      test_hypothesis.py   # Hypothesis property-based tests (§42.3)
      test_mc_corridor.py  # MC seeded RNG corridor validation (§42.4)
      test_breakup/
    test_integrity.py    # HMAC sign/verify; tamper detection
    test_auth.py         # JWT; MFA; rate limiting; RBAC enforcement
    test_rbac.py         # Every endpoint tested for correct role enforcement
    test_websocket.py    # WS sequence replay; token expiry warning; close codes 4001/4002
    test_ingest/
      test_contracts.py  # Space-Track + NOAA key presence AND value-range assertions
    test_spaceweather/
    test_jobs/
      test_celery_failure.py  # Timeout → 'failed'; orphan recovery Beat task
    smoke/               # Post-deploy; all idempotent; run in ≤ 2 min; require smoke_user seed
      test_api_health.py    # GET /readyz → 200/207; GET /healthz → 200
      test_auth_smoke.py    # Login → JWT; refresh → new token
      test_catalog_smoke.py # GET /catalog → 200; 'data' key present
      test_ws_smoke.py      # WS connect → heartbeat within 5s
      test_db_smoke.py      # SELECT 1 via backend health endpoint
    quarantine/          # Flaky tests awaiting fix; excluded from blocking CI (see §33.10 policy)
  requirements.in        # pip-tools source
  requirements.txt       # pip-compile output with hashes
  Dockerfile             # FROM pinned digest; non-root user; read-only FS

12.1 Repository `docs/` Directory Structure

All documentation files live under docs/ in the monorepo root. Files referenced elsewhere in this plan must exist at these paths.

docs/
  README.md                          # Documentation index — what's here and where to look
  MASTER_PLAN.md                     # This document
  AGENTS.md                          # Guidance for AI coding agents working in this repo (see §33.9)
  CHANGELOG.md                       # Keep a Changelog format; human-maintained; one entry per release

  adr/                               # Architecture Decision Records (MADR format)
    README.md                        # ADR index with status column
    0001-rs256-asymmetric-jwt.md
    0002-dual-frontend-architecture.md
    0003-monte-carlo-chord-pattern.md
    0004-geography-vs-geometry-spatial-types.md
    0005-lazy-raise-sqlalchemy.md
    0006-timescaledb-chunk-intervals.md
    0007-cesiumjs-commercial-licence.md
    0008-pgbouncer-transaction-mode.md
    0009-ccsds-oem-gcrf-reference-frame.md
    0010-alert-threshold-rationale.md
    # ... continued; one ADR per consequential decision in §20

  runbooks/
    README.md                        # Runbook index with owner and last-reviewed date
    TEMPLATE.md                      # Standard runbook template (see §33.4)
    db-failover.md
    celery-recovery.md
    hmac-failure.md
    ingest-failure.md
    gdpr-breach-notification.md
    safety-occurrence-notification.md
    secrets-rotation-jwt.md
    secrets-rotation-spacetrack.md
    secrets-rotation-hmac.md
    blue-green-deploy.md
    restore-from-backup.md

  model-card-decay-predictor.md      # Living document; updated per model version (§32.1)
  ood-bounds.md                      # OOD detection thresholds (§32.3)
  recalibration-procedure.md         # Recalibration governance (§32.4)
  alert-threshold-history.md         # Alert threshold change log (§24.8)

  query-baselines/                   # EXPLAIN ANALYZE output; one file per critical query
    czml_catalog_100obj.txt
    fir_intersection_baseline.txt
    # ... one file per query baseline recorded in Phase 1

  validation/                        # Validation procedure and reference data (§17)
    README.md                        # How to run each validation suite
    reference-data/
      vallado-sgp4-cases.json        # Vallado (2013) SGP4 reference state vectors
      iers-frame-test-cases.json     # IERS precession-nutation reference cases
      aerospace-corp-reentries.json  # Historical re-entry outcomes for backcast validation
    backcast-validation-v1.0.0.pdf   # Phase 1 validation report (≥3 events)
    backcast-validation-v2.0.0.pdf   # Phase 2 validation report (≥10 events)

  api-guide/                         # Persona E/F API developer documentation (§33.10)
    README.md                        # API guide index
    authentication.md
    rate-limiting.md
    webhooks.md
    code-examples/
      python-quickstart.py
      typescript-quickstart.ts
    error-reference.md

  user-guides/                       # Operational persona documentation (§33.7)
    aviation-portal-guide.md         # Persona A/B/C
    space-portal-guide.md            # Persona E/F
    admin-guide.md                   # Persona D

  test-plan.md                       # Test suite index with scope and blocking classification (§33.11)

  public-reports/                    # Quarterly transparency reports (§32.6)
    # quarterly-accuracy-YYYY-QN.pdf

  legal/                             # Legal opinion documents (MinIO primary; this dir for dev reference)
    # legal-opinion-template.md

13. Frontend Directory Structure and Architecture

frontend/
  src/
    app/
      page.tsx                         # Operational Overview
      watch/[norad_id]/page.tsx        # Object Watch Page
      events/
        page.tsx                       # Active Events + full Timeline/Gantt
        [id]/page.tsx                  # Event Detail
      airspace/page.tsx                # Airspace Impact View
      analysis/page.tsx                # Analyst Workspace
      catalog/page.tsx                 # Object Catalog
      reports/
        page.tsx
        [id]/page.tsx
      admin/page.tsx                   # System Administration (admin role only)
      space/
        page.tsx                       # Space Operator Overview
        objects/
          page.tsx                     # My Objects Dashboard (space_operator: owned only)
          [norad_id]/page.tsx          # Object Technical Detail
        reentry/
          plan/page.tsx                # Controlled Re-entry Planner
        conjunction/page.tsx           # Conjunction Screening (orbital_analyst)
        analysis/page.tsx              # Orbital Analyst Workspace
        export/page.tsx                # Bulk Export
        api/page.tsx                   # API Keys + Documentation
      layout.tsx                       # Root layout: nav, ModeIndicator, AlertBadge,
                                       # JobsPanel; applies security headers via middleware

    middleware.ts                      # Next.js middleware: enforce HTTPS, set CSP
                                       # and security headers on every response,
                                       # redirect unauthenticated users to /login

    components/
      globe/
        CesiumViewer.tsx
        LayerPanel.tsx
        ViewToggle.tsx
        ClusterLayer.tsx
        CorridorLayer.tsx
        corridor/
          PercentileCorridors.tsx      # Mode A
          ProbabilityHeatmap.tsx       # Mode B (Phase 2)
          ParticleTrajectories.tsx     # Mode C (Phase 3)
        UncertaintyModeSelector.tsx
      plan/
        PlanView.tsx                   # Phase 2
        AltitudeCrossSection.tsx       # Phase 2
      timeline/
        TimelineStrip.tsx
        TimelineGantt.tsx
        TimelineControls.tsx
        ModeIndicator.tsx
      panels/
        ObjectInfoPanel.tsx
        PredictionPanel.tsx            # Includes HMAC status indicator
        AirspaceImpactPanel.tsx        # Phase 2
        ConjunctionPanel.tsx           # Phase 2
      alerts/
        AlertBanner.tsx
        AlertBadge.tsx
        NotificationCentre.tsx
        AcknowledgeDialog.tsx
      jobs/
        JobsPanel.tsx
        JobProgressBar.tsx
        SimulationComparison.tsx
      spaceweather/
        SpaceWeatherWidget.tsx
      reports/
        ReportConfigDialog.tsx
        ReportPreview.tsx
      space/
        SpaceOverview.tsx
        OwnedObjectCard.tsx
        ControlledReentryPlanner.tsx
        DeorbitWindowList.tsx
        ApiKeyManager.tsx
        CcsdsExportPanel.tsx
        ShadowBanner.tsx             # Amber banner displayed when shadow mode active
      notam/
        NotamDraftViewer.tsx
        NotamCancellationDialog.tsx
        NotamRegulatoryDisclaimer.tsx
      shadow/
        ShadowModeIndicator.tsx
        ShadowValidationReport.tsx
      dashboard/
        EventSummaryCard.tsx
        SystemHealthCard.tsx
      shared/
        DataConfidenceBadge.tsx
        IntegrityStatusBadge.tsx       # ✓ HMAC verified / ✗ HMAC failed
        UncertaintyBound.tsx
        CountdownTimer.tsx

    hooks/
      useObjects.ts
      usePrediction.ts                 # Polls HMAC status; shows warning if failed
      useEphemeris.ts
      useSpaceWeather.ts
      useAlerts.ts
      useSimulation.ts
      useCZML.ts
      useWebSocket.ts                  # Cookie-based auth; per-user connection limit

    stores/                            # Zustand — UI state only; no API responses
      timelineStore.ts                 # Mode, playhead position, playback speed
      selectionStore.ts                # Selected object/event/zone IDs
      layerStore.ts                    # Layer visibility, corridor display mode
      jobsStore.ts                     # Active job IDs (content fetched via TanStack Query)
      alertStore.ts                    # Unread count, mute rules
      uiStore.ts                       # Panel state, theme (dark/light/high-contrast)

    lib/
      api.ts                           # Typed fetch wrapper; credentials: 'include'
                                       # for httpOnly cookie auth; never reads tokens
      czml.ts
      ws.ts                            # wss:// enforced; cookie auth at upgrade
      corridorGeometry.ts
      mcBinaryDecoder.ts
      reportUtils.ts

    types/
      objects.ts
      predictions.ts                   # Includes hmac_status, integrity_failed fields
      alerts.ts
      spaceweather.ts
      simulation.ts
      czml.ts

  public/
    branding/
  middleware.ts                        # Root Next.js middleware for security headers
  next.config.ts                       # Content-Security-Policy defined here for SSR
  tsconfig.json
  package.json
  package-lock.json                    # Committed; npm ci used in Docker builds

13.0 Accessibility Standard Commitment

Minimum standard: WCAG 2.1 Level AA (ISO/IEC 40500:2012), which is incorporated by reference into EN 301 549 v3.2.1 — the mandatory accessibility standard for ICT procured by EU public sector bodies including ESA. Failure to meet EN 301 549 is a bid disqualifier for any EU public sector tender.

All frontend work must meet these criteria before a PR is merged:

WCAG 2.1 AA automated check passes (axe-core — see §42)
Keyboard-only operation possible for all primary operator workflows
Screen reader (NVDA + Firefox; VoiceOver + Safari) tested for primary workflow on each release
Colour contrast ≥ 4.5:1 for all informational text; ≥ 3:1 for UI components and graphical elements
No functionality conveyed by colour alone

Deliverable: Accessibility Conformance Report (ACR / VPAT 2.4) produced before Phase 2 ESA bid submission. Maintained thereafter for each major release.

UTC-only rule for operational interface (F1): ICAO Annex 2 and Annex 15 mandate UTC for all aeronautical operational communications. The following is a hard rule — no exceptions without explicit documentation and legal/safety sign-off:

All times displayed in Persona A/C operational views (alert panels, event detail, NOTAM draft, shift handover) are UTC only, formatted as HH:MMZ or DD MMM YYYY HH:MMZ
No timezone conversion widget or local-time toggle in the operational interface
Local time display is permitted only in non-operational views (account settings, admin billing pages) and must be clearly labelled with the timezone name
The Z suffix or UTC label is persistently visible — never hidden in a tooltip or hover state
All API timestamps returned as ISO 8601 UTC (2026-03-22T14:00:00Z) — never local time strings

13.1 State Management Separation

TanStack Query: All API-derived data — object lists, predictions, ephemeris, space weather, alerts, simulation results. Handles caching, background refetch, and stale-while-revalidate.

Zustand: Pure UI state with no server dependency — selected IDs, layer visibility, timeline mode and position, panel open/closed state, theme, alert mute rules.

URL state (nuqs): Shareable, bookmarkable — selected NORAD ID, active event ID, time position in replay mode, active layer set. Browser back/forward works correctly. Requires NuqsAdapter wrapping the App Router root layout to hydrate correctly on SSR.

Never in state: Raw API response bodies. No useEffect that writes API responses into Zustand.

Authentication in the client: The api.ts fetch wrapper uses credentials: 'include' to send the httpOnly auth cookie automatically. The client never reads, stores, or handles the JWT token directly — it is invisible to JavaScript. CSRF is mitigated by SameSite=Strict on the cookie.

Next.js App Router component boundary (ADR 0018): The project uses App Router. The globe and all operational views are client components; static pages (onboarding, settings, admin) are React Server Components where practical.

Route group	RSC/Client	Rationale
`app/(globe)/` — operational views	`"use client"` root layout	CesiumJS, WebSocket, Zustand hooks require browser APIs
`app/(static)/` — onboarding, settings	Server Components by default	No browser APIs needed; faster initial load
`app/(auth)/` — login, MFA	Server Components + Client islands	Form validation islands only

Rules enforced in AGENTS.md:

Never add "use client" to a leaf component without a comment explaining which browser API requires it
app/(globe)/layout.tsx is the single "use client" boundary for all operational views — child components inherit it without re-declaring
nuqs requires <NuqsAdapter> at the root of app/(globe)/layout.tsx

TanStack Query key factory (src/lib/queryKeys.ts) — stable hierarchical keys prevent cache invalidation bugs:

export const queryKeys = {
  objects: {
    all: ()           => ['objects'] as const,
    list: (f: ObjectFilters) => ['objects', 'list', f] as const,
    detail: (id: number)    => ['objects', 'detail', id] as const,
    tleHistory: (id: number) => ['objects', id, 'tle-history'] as const,
  },
  predictions: {
    byObject: (id: number) => ['predictions', id] as const,
  },
  alerts: {
    all:    ()           => ['alerts'] as const,
    unacked: (orgId: number) => ['alerts', 'unacked', orgId] as const,
  },
  jobs: {
    detail: (jobId: string) => ['jobs', jobId] as const,
  },
} as const;
// On WS alert.new: queryClient.invalidateQueries({ queryKey: queryKeys.alerts.all() })
// On acknowledge mutation: optimistic setQueryData, then invalidate on settle

React error boundary hierarchy — a CesiumJS crash must never remove the alert panel from the DOM:

// app/(globe)/layout.tsx
<AppErrorBoundary fallback={<AppCrashPage />}>
  <GlobeErrorBoundary fallback={<GlobeUnavailable />}>
    <GlobeCanvas />               {/* WebGL context loss isolated here */}
  </GlobeErrorBoundary>
  <PanelErrorBoundary name="alerts">
    <AlertPanel />                {/* Survives globe crash */}
  </PanelErrorBoundary>
  <PanelErrorBoundary name="events">
    <EventList />
  </PanelErrorBoundary>
</AppErrorBoundary>

GlobeUnavailable displays: "Globe unavailable — WebGL context lost. Re-entry event data below remains operational." Alert and event panels remain visible and functional. Add GlobeErrorBoundary to AGENTS.md safety-critical component list.

Loading and empty state specification — for safety-critical panels, loading and empty must be visually distinct from each other and from error:

State	Visual treatment	Required text
Loading	Skeleton matching panel layout	—
Empty	Explicit affirmative message	`AlertPanel`: "No unacknowledged alerts"; `EventList`: "No active re-entry events"
Error	Inline error with retry button	Never blank

Rule: safety-critical panels (AlertPanel, EventList, PredictionPanel) must never render blank. DataConfidenceBadge must always show a value — display "Unknown" explicitly, never render nothing.

WebSocket reconnection policy (src/lib/ws.ts):

const RECONNECT = {
  initialDelayMs: 1_000,
  maxDelayMs:     30_000,
  multiplier:     2,
  jitter:         0.2,   // ±20% — spreads reconnections after mass outage/deploy
};
// TOKEN_EXPIRY_WARNING handler: trigger silent POST /auth/token/refresh;
//   on success send AUTH_REFRESH; on failure show re-login modal (60s grace before disconnect)
// Reconnect sends ?since_seq=<last_seq> for missed event replay

Operational mode guard (src/hooks/useModeGuard.ts) — enforces LIVE/SIMULATION/REPLAY write restrictions:

export function useModeGuard(allowedModes: OperationalMode[]) {
  const { mode } = useTimelineStore();
  return { isAllowed: allowedModes.includes(mode), currentMode: mode };
}
// Usage: const { isAllowed } = useModeGuard(['LIVE']);
// All write-action components (acknowledge alert, submit NOTAM draft, trigger prediction)
// must call useModeGuard(['LIVE']) and disable + annotate button in other modes.

Deck.gl + CesiumJS integration — use DeckLayer from @deck.gl/cesium (rendered inside CesiumJS as a primitive; correct z-order and shared input handling). Never use a separate Deck.gl canvas:

import { DeckLayer } from '@deck.gl/cesium';
import { HeatmapLayer } from '@deck.gl/aggregation-layers';

const deckLayer = new DeckLayer({
  layers: [new HeatmapLayer({ id: 'mc-heatmap', data: mcTrajectories,
    getPosition: d => [d.lon, d.lat], getWeight: d => d.weight,
    radiusPixels: 30, intensity: 1, threshold: 0.03 })],
});
viewer.scene.primitives.add(deckLayer);
// Remove when switching away from Mode B: viewer.scene.primitives.remove(deckLayer)

CesiumJS client-side memory constraints:

Constraint	Value	Enforcement
Max CZML entity count in globe	500	Prune lowest-perigee objects beyond 500; `useCZML` monitors count
Orbit path duration	72h forward / 24h back	Longer paths accumulate geometry
Heatmap cell resolution (Mode B)	0.5° × 0.5°	Higher resolution requires more GPU memory
Stale entity pruning	Remove entities not updated in 48h	Prevents ghost entities in long sessions
Globe entity count Prometheus metric	`spacecom_globe_entity_count` (gauge)	WARNING alert at 450; prune trigger at 500

Bundle size budget and dynamic imports:

Bundle	Strategy	Budget (gzipped)
Login / onboarding / settings	Static; no CesiumJS/Deck.gl	< 200 KB
Globe route initial load	CesiumJS lazy-loaded; spinner shown	< 500 KB before CesiumJS
Globe fully loaded	CesiumJS + Deck.gl + app	< 8 MB

// src/components/globe/GlobeCanvas.tsx
import dynamic from 'next/dynamic';
const CesiumViewer = dynamic(
  () => import('./CesiumViewerInner'),
  { ssr: false, loading: () => <GlobeLoadingState /> }
);

bundlewatch (or @next/bundle-analyzer) in CI; warning (non-blocking) if initial route bundle exceeds budget. Baseline stored in .bundle-size-baseline.

13.2 Accessible Parallel Table View (F4)

The CesiumJS WebGL globe is inherently inaccessible: no keyboard navigation, no screen reader support, no motor-impairment accommodation. All interactions available via the globe must also be available via a parallel data table view.

Component: src/components/globe/ObjectTableView.tsx

Accessible via keyboard shortcut Alt+T from any operational view, and via a persistent visible "Table view" button in the globe toolbar
Displays all objects currently rendered on the globe: NORAD ID, name, orbit type, conjunction status badge, predicted re-entry window, alert level
Sortable by any column (aria-sort updated on header click/keypress); filterable by alert level
Row selection focuses the object's Event Detail panel (same as map click)
All alert acknowledgement actions reachable from the table view — no functionality requires the globe
Implemented as <table> with <thead>, <tbody>, <th scope="col">, <th scope="row"> — no ARIA table role substitutes where native HTML suffices
Pagination or virtual scroll for large object sets; aria-rowcount and aria-rowindex set correctly for virtualised rows

The table view is the primary interaction surface for users who cannot use the map. It must be functionally complete, not a read-only summary.

All primary operator workflows must be completable by keyboard alone. Required implementation:

Skip links (rendered as the first focusable element in the page, visible on focus):

<a href="#alert-panel" class="skip-link">Skip to alert panel</a>
<a href="#main-content" class="skip-link">Skip to main content</a>
<a href="#object-table" class="skip-link">Skip to object table</a>

Focus ring: Minimum 3px solid outline, ≥ 3:1 contrast against adjacent colours (WCAG 2.4.11 Focus Appearance, AA). Never outline: none without a custom focus indicator. Defined in design tokens: --focus-ring: 3px solid #4A9FFF.

Tab order: Follows DOM order (no tabindex > 0). Logical flow: nav → alert panel → map toolbar → main content. Modal dialogs trap focus within the dialog while open; focus returns to the trigger element on close.

Application keyboard shortcuts (all documented in UI via ? help overlay):

Shortcut	Action
`Alt+A`	Focus most-recent active CRITICAL alert
`Alt+T`	Toggle table / globe view
`Alt+H`	Open shift handover view
`Alt+N`	Open NOTAM draft for active event
`?`	Open keyboard shortcut reference overlay
`Escape`	Close modal / dismiss non-CRITICAL overlay
`Arrow keys`	Navigate within alert list, table rows, accordion items

All shortcuts declared via aria-keyshortcuts on their trigger elements. No shortcut conflicts with browser or screen reader reserved keys.

13.4 Colour and Contrast Specification (F7)

All colour pairs must meet WCAG 2.1 AA contrast requirements. Documented in frontend/src/tokens/colours.ts as design tokens; no hardcoded colour values in component files.

Operational severity palette (dark theme — background: #1A1A2E):

Severity	Background	Text	Contrast ratio	Status
CRITICAL	`#7B4000`	`#FFFFFF`	7.2:1	✓ AA
HIGH	`#7A3B00`	`#FFD580`	5.1:1	✓ AA
MEDIUM	`#1A3A5C`	`#90CAF9`	4.6:1	✓ AA
LOW	`#1E3A2F`	`#81C784`	4.5:1	✓ AA (minimum)
Focus ring	`#1A1A2E`	`#4A9FFF`	4.8:1	✓ AA

All pairs verified with the APCA algorithm for large display text (corridor labels on the globe). If a colour fails at the target background, the background is adjusted — the text colour is kept consistent for operator recognition.

Number formatting (F4): Probability values, altitudes, and distances must be formatted correctly across locales:

Operational interface (Persona A/C): Always use ICAO-standard decimal point (.) regardless of browser locale — deviating from locale convention is intentional and matches ICAO Doc 8400 standards; this is documented as an explicit design decision
Admin / reporting / Space Operator views: Use Intl.NumberFormat(locale) for locale-aware formatting (comma decimal separator in DE/FR/ES locales)
Helper: formatOperationalNumber(n: number): string — always . decimal, 3 significant figures for probabilities; formatDisplayNumber(n: number, locale: string): string — locale-aware
Never use raw Number.toString() or n.toFixed() in JSX — both ignore locale

Non-colour severity indicators (F5): Colour must never be the sole differentiator. Each severity level also carries:

Severity	Icon/shape	Text label	Border width
CRITICAL	⬟ (octagon)	"CRITICAL" always visible	3px solid
HIGH	▲ (triangle)	"HIGH" always visible	2px solid
MEDIUM	● (circle)	"MEDIUM" always visible	1px solid
LOW	○ (circle outline)	"LOW" always visible	1px dashed

The 1 Hz CRITICAL colour cycle (§28.3 habituation countermeasure) must also include a redundant non-colour animation: 1 Hz border-width pulse (2px → 4px → 2px). Users with prefers-reduced-motion: reduce see a static thick border instead (see §28.3 reduced-motion rules).

13.5 Internationalisation Architecture (F5, F8, F11)

Language scope — Phase 1: English only. No other locale is served. This is not a gap — it is an explicit decision that allows Phase 1 to ship without a localisation workflow. The architecture is designed so that adding a new locale requires only adding a messages/{locale}.json file and testing; no component code changes.

String externalisation strategy:

Library: next-intl (native Next.js App Router support, RSC-compatible, type-safe message keys)
Source of truth: messages/en.json — all user-facing strings, namespaced by feature area
Message ID convention: {feature}.{component}.{element} e.g. alerts.critical.title, handover.accept.button
No bare string literals in JSX (enforced by eslint-plugin-i18n-json or equivalent)
ICAO-fixed strings are excluded from i18n scope and must never appear in messages/en.json — they are hardcoded constants. Examples: NOTAM, UTC, SIGMET, category codes (NOTAM_ISSUED), ICAO phraseology in NOTAM templates. These are annotated // ICAO-FIXED: do not translate in source

messages/
  en.json          # Source of truth — Phase 1 complete
  fr.json          # Phase 2 scaffold (machine-translated placeholders; native-speaker review before deploy)
  de.json          # Phase 3 scaffold

CSS logical properties (F8): All new components use CSS logical properties instead of directional utilities, making RTL support a configuration change rather than a code rewrite:

Avoid	Use instead
`margin-left`, `ml-*`	`margin-inline-start`, `ms-*`
`margin-right`, `mr-*`	`margin-inline-end`, `me-*`
`padding-left`, `pl-*`	`padding-inline-start`, `ps-*`
`padding-right`, `pr-*`	`padding-inline-end`, `pe-*`
`left: 0`	`inset-inline-start: 0`
`text-align: left`	`text-align: start`

The <html> element carries dir="ltr" (hardcoded for Phase 1). When a RTL locale is added, this becomes dir={locale.dir} — no component changes required. RTL testing with Arabic locale is a Phase 3 gate before any Middle East deployment.

Altitude and distance unit display (F9): Aviation and space domain use different unit conventions. All altitudes and distances are stored and transmitted in metres (SI base unit) in the database and API. The display layer converts based on users.altitude_unit_preference:

Role default	Unit	Display example
`ansp_operator`	`ft`	`39,370 ft (FL394)`
`space_operator`	`km`	`12.0 km`
`analyst`	`km`	`12.0 km`

Rules:

Unit label always shown alongside the value — no bare numbers
aria-label provides full unit name: aria-label="39,370 feet (Flight Level 394)"
User can override their default in account settings via PATCH /api/v1/users/me
API always returns metres; unit conversion is client-side only
FL (Flight Level) shown in parentheses for ft display when altitude > 0 ft MSL and context is airspace

Altitude datum labelling (F11 — §62): The SGP4 propagator and NRLMSISE-00 output altitudes above the WGS-84 ellipsoid. Aviation altimetry uses altitude above Mean Sea Level (MSL). The geoid height (difference between ellipsoid and MSL) varies globally from approximately −106 m to +85 m (EGM2008). For operational altitudes (below ~25 km / 82,000 ft during re-entry terminal phase), this difference is significant.

Required labelling rule: All altitude displays must specify the datum. The datum is a non-configurable system constant per altitude context:

Altitude context	Datum	Display example	Notes
Orbital altitude (> 80 km)	WGS-84 ellipsoid	`185 km (ellipsoidal)`	SGP4 output; geoid difference negligible at orbital altitudes
Re-entry corridor boundary	WGS-84 ellipsoid	`80 km (ellipsoidal)`	Model boundary altitude
Fragment impact altitude	WGS-84 ellipsoid	`0 km (ellipsoidal)` → display as ground level	Converted at display time
Airspace sector boundary (FL)	QNH barometric	`FL390` / `39,000 ft (QNH)`	Aviation standard; NOT ellipsoidal
Terrain clearance / NOTAM lower bound	MSL (approx. ellipsoidal for > 1,000 ft)	`5,000 ft MSL`	Use `MSL` label explicitly

Implementation: formatAltitude(metres, context) helper accepts a context parameter ('orbital' | 'airspace' | 'notam') and appends the appropriate datum label. The datum label is rendered in a smaller secondary font weight alongside the altitude value — not in aria-label alone.

API response datum field: The prediction API response must include altitude_datum: "WGS84_ELLIPSOIDAL" alongside any altitude value. Consumers must not assume a datum that is not stated.

Future locale addition checklist (documented in docs/ADDING_A_LOCALE.md):

Add messages/{locale}.json translated by a native-speaker aviation professional
Verify all ICAO-fixed strings are excluded from translation
Set dir for the locale (ltr/rtl)
Run automated RTL layout tests if dir=rtl
Confirm operational time display still shows UTC (not locale timezone)
Legal review of any jurisdiction-specific compliance text

13.6 Contribution Workflow (F3)

CONTRIBUTING.md at the repository root is a required document. It defines how contributors (internal engineers, auditors, future ESA-directed reviewers) engage with the codebase.

Branch naming convention:

Branch type	Pattern	Example
Feature	`feature/{ticket-id}-short-description`	`feature/SC-142-decay-unit-pref`
Bug fix	`fix/{ticket-id}-short-description`	`fix/SC-200-hmac-null-check`
Chore / dependency	`chore/{description}`	`chore/bump-fastapi-0.115`
Release	`release/{semver}`	`release/1.2.0`
Hotfix	`hotfix/{semver}`	`hotfix/1.1.1`

No direct commits to main. All changes via pull request. main is branch-protected: 1 required approval, all status checks must pass, no force-push.

Commit message format: Conventional Commits — type(scope): description. Types: feat, fix, chore, docs, refactor, test, ci. Example: feat(decay): add p01/p99 tail risk columns.

PR template (.github/pull_request_template.md):

## Summary
<!-- What does this PR do? -->

## Linked ticket
<!-- e.g. SC-142 -->

## Checklist
- [ ] `make test` passes locally
- [ ] OpenAPI spec regenerated (`make generate-openapi`) if API changed
- [ ] CHANGELOG.md updated under `[Unreleased]`
- [ ] axe-core accessibility check passes if UI changed
- [ ] Contract test passes if API response shape changed
- [ ] ADR created if an architectural decision was made

Review SLA: Pull requests must receive a first review within 1 business day of opening. Stale PRs (no activity > 3 business days) are labelled stale automatically.

13.7 Architecture Decision Records (F4)

ADRs (Nygard format) are the lightweight record for code-level and architectural decisions. They live in docs/adr/ and are numbered sequentially.

When to write an ADR: Any decision that is:

Hard to reverse (e.g., choosing a library, a DB schema approach, an algorithm)
Likely to confuse a future contributor who finds the code without context
Required by a public-sector procurement framework (ESA specifically requests evidence of a structured decision process)
Referenced in a specialist review appendix (§45–§54 all reference ADR numbers)

Format (docs/adr/NNNN-title.md):

# ADR NNNN: Title

**Status:** Proposed | Accepted | Deprecated | Superseded by ADR MMMM
**Date:** YYYY-MM-DD

## Context
What problem are we solving? What constraints apply?

## Decision
What did we decide?

## Consequences
What becomes easier? What becomes harder? What is now out of scope?

Known ADRs referenced in this plan:

ADR	Topic
0001	FastAPI over Django REST Framework
0002	TimescaleDB + PostGIS for orbital time-series
0003	CesiumJS + Deck.gl for 3D globe rendering
0004	next-intl for string externalisation
0005	Append-only alert_events with HMAC signing
0016	NRLMSISE-00 vs JB2008 atmospheric density model

All ADR numbers referenced in this document must have a corresponding docs/adr/NNNN-*.md file before Phase 2 ESA submission. New ADRs start at the next available number.

13.8 Developer Environment Setup (F6)

docs/DEVELOPMENT.md is a required onboarding document. A new engineer must be able to run a fully functional local environment within 30 minutes of reading it. The document covers:

Prerequisites: Python 3.11 (pinned in .python-version), Node.js 20 LTS, Docker Desktop, make

Environment bootstrap:

cp .env.example .env          # review and fill required values
make init-dirs                # creates logs/, exports/, config/, backups/ on host
make dev-up                   # docker compose up -d postgres redis minio
make migrate                  # alembic upgrade head
make seed                     # load development fixture data (10 tracked objects, sample TIPs)
make dev                      # starts: uvicorn + Next.js dev server + Celery worker

Running tests:

make test                     # full test suite (backend + frontend)
make test-backend             # backend only (pytest)
make test-frontend            # frontend only (jest + playwright)
make test-e2e                 # Playwright end-to-end (requires make dev running)

Useful local URLs:
- API: http://localhost:8000 / Swagger UI: http://localhost:8000/docs
- Frontend: http://localhost:3000
- MinIO console: http://localhost:9001 (credentials in .env.example)
Common issues: documented in a ## Troubleshooting section covering: Docker port conflicts, TimescaleDB first-run migration failure, CesiumJS ion token missing.

.env.example is committed and kept up-to-date with all required variables (no value — keys only). .env is in .gitignore and must never be committed.

13.9 Docs-as-Code Pipeline (F10)

All project documentation (this plan, runbooks, ADRs, OpenAPI spec, data provenance records) is version-controlled in the repository and validated by CI.

Documentation site: MkDocs Material. Source in docs/. Published to GitHub Pages on merge to main. Configuration in mkdocs.yml.

CI documentation checks (run on every PR):

mkdocs build --strict — fails on broken links, missing pages, invalid nav
markdown-link-check docs/ — external link validation (warns, does not fail, to avoid flaky CI on transient outages)
openapi-diff — spec drift check (see §14 F1)
vale --config=.vale.ini docs/ — prose style linter (SpaceCom style guide: no passive voice in runbooks, consistent terminology table for re-entry vs reentry)

ESA submission artefact: The MkDocs build output (static HTML) is archived as a CI artefact on each release tag. This provides a reproducible, point-in-time documentation snapshot for the ESA bid submission. The submission artefact is docs-site-{version}.zip stored in the GitHub release assets.

Docs owner: Each section of the documentation has an owner: frontmatter field. The owner is responsible for keeping the section current after their feature area changes. Missing or stale ownership is flagged by a quarterly docs-review GitHub issue auto-created by a cron workflow.

14. API Design

Base path: /api/v1. All endpoints require authentication (minimum viewer role) unless noted. Role requirements listed per group.

System (no auth required)

GET /health — liveness probe; returns 200 {"status": "ok", "version": "<semver>"} if the process is running. Used by Docker/Kubernetes liveness probe and load balancer health check. Does not check downstream dependencies — a healthy response means only that the API process is alive.
GET /readyz — readiness probe; returns 200 {"status": "ready", "checks": {...}} when all dependencies are reachable. Returns 503 if any required dependency is unhealthy. Checks performed: PostgreSQL (query SELECT 1), Redis (PING), Celery worker queue depth < 1000. Used by DR automation to confirm the new primary is accepting traffic before updating DNS (§26.3). Also included in OpenAPI spec under tags: ["System"].

// GET /readyz — healthy response example
{
  "status": "ready",
  "checks": {
    "postgres": "ok",
    "redis": "ok",
    "celery_queue_depth": 42
  },
  "version": "1.2.3"
}
// GET /readyz — unhealthy response (503)
{
  "status": "not_ready",
  "checks": {
    "postgres": "ok",
    "redis": "error: connection refused",
    "celery_queue_depth": 42
  }
}

Auth

POST /auth/token — login; returns httpOnly cookie (access) + httpOnly cookie (refresh); rate-limited 10/min/IP
POST /auth/token/refresh — rotate refresh token; rate-limited
POST /auth/mfa/verify — complete MFA; issues full-access token
POST /auth/logout — revoke refresh token; clear cookies

Catalog (`viewer` minimum)

GET /objects — list/search (paginated; filter by type, perigee, decay status, data_confidence)
GET /objects/{norad_id} — detail with TLE, physical properties, data confidence annotation
POST /objects — manual entry (operator role)
GET /objects/{norad_id}/tle-history — full TLE history including cross-validation status

Propagation (`analyst` role)

POST /propagate — submit catalog propagation job
GET /propagate/{task_id} — poll status

GET /objects/{norad_id}/ephemeris?start=&end=&step= — time range and step validation (Finding 7):

Parameter	Constraint	Error code
`start`	≥ TLE epoch − 7 days; ≤ now + 90 days	`EPHEMERIS_START_OUT_OF_RANGE`
`end`	`start < end ≤ start + 30 days`	`EPHEMERIS_END_OUT_OF_RANGE`
`step`	≥ 10 seconds and ≤ 86,400 seconds	`EPHEMERIS_STEP_OUT_OF_RANGE`
Computed points	`(end − start) / step ≤ 100,000`	`EPHEMERIS_TOO_MANY_POINTS`

Decay Prediction (`analyst` role)

POST /decay/predict — submit decay job; returns 202 Accepted (Finding 3). MC concurrency gate: per-organisation Redis semaphore limits to 1 concurrent MC run (Phase 1); 2 for analyst+ (Phase 2); 429 + Retry-After on limit; admin bypasses.

Async job lifecycle (Finding 3):
```
POST /decay/predict
Idempotency-Key: <client-uuid>          ← optional; prevents duplicate on retry
→ 202 Accepted
{
  "jobId": "uuid",
  "status": "queued",
  "statusUrl": "/jobs/uuid",
  "estimatedDurationSeconds": 45
}

GET /jobs/{job_id}
→ 200 OK
{
  "jobId": "uuid",
  "status": "running" | "complete" | "failed" | "cancelled",
  "resultUrl": "/decay/predictions/12345",   // present when complete
  "error": null | {"code": "...", "message": "..."},
  "createdAt": "...",
  "completedAt": "...",
  "durationSeconds": 42
}
```
WebSocket PREDICTION_COMPLETE / PREDICTION_FAILED events are the primary completion signal. GET /jobs/{id} is the polling fallback (recommended interval: 5 seconds; do not poll faster). All Celery-backed POST endpoints (/reports, /space/reentry/plan, /propagate) follow the same lifecycle pattern.
GET /jobs/{job_id} — poll job status (all job types); 404 if job does not belong to the requesting user's organisation
GET /decay/predictions?norad_id=&status= — list (cursor-paginated)

Re-entry (`viewer` role)

GET /reentry/predictions — list with HMAC status; filterable by FIR, time window, confidence, integrity_failed
GET /reentry/predictions/{id} — full detail; HMAC verified before serving; integrity_failed records return 503
GET /reentry/tip-messages?norad_id= — TIP messages

Space Weather (`viewer` role)

GET /spaceweather/current — F10.7, Kp, Ap, Dst + operational_status + uncertainty_multiplier + cross-validation delta
GET /spaceweather/history?start=&end= — history
GET /spaceweather/forecast — 3-day NOAA SWPC forecast

Conjunctions (`viewer` role)

GET /conjunctions — active events filterable by Pc threshold
GET /conjunctions/{id} — detail with covariance and probability
POST /conjunctions/screen — submit screening (analyst role)

Visualisation (`viewer` role)

GET /czml/objects — full CZML catalog (J2000 INERTIAL; all strings HTML-escaped); max payload policy: 5 MB. If estimated payload exceeds 5 MB, the endpoint returns HTTP 413 with {"error": "catalog_too_large", "use_delta": true}.
GET /czml/objects?since=<iso8601> — delta CZML: returns only objects whose position or metadata has changed since the given timestamp. Clients must use this after the initial full load. Response includes X-CZML-Full-Required: true header if the server cannot produce a valid delta (e.g. client timestamp > 30 minutes old) — client must re-fetch the full catalog. Delta responses are always ≤ 500 KB for the 100-object catalog.
GET /czml/hazard/{zone_id} — HMAC verified before serving
GET /czml/event/{event_id} — full event CZML
GET /viz/mc-trajectories/{prediction_id} — binary MC blob for Mode C

Hazard (`viewer` role)

GET /hazard/zones — active zones; HMAC status included in response
GET /hazard/zones/{id} — detail; HMAC verified before serving; integrity_failed records return 503

Alerts (`viewer` read; `operator` acknowledge)

GET /alerts — alert history
POST /alerts/{id}/acknowledge — records user ID + timestamp + note in alert_events
GET /alerts/unread-count — unread critical/high count for badge

Reports (`analyst` role)

GET /reports — list (organisation-scoped via RLS)
POST /reports — initiate generation (async)
GET /reports/{id} — metadata + pre-signed 15-minute download URL
GET /reports/{id}/preview — HTML preview

Org Admin (`org_admin` role — scoped to own organisation) (F7, F9, F11)

GET /org/users — list users in own org
POST /org/users/invite — invite a new user (sends email; creates user with viewer role pending activation)
PATCH /org/users/{id}/role — assign role up to operator within own org; cannot assign org_admin or admin
DELETE /org/users/{id} — deactivate user (revokes sessions and API keys; triggers pseudonymisation for GDPR)
GET /org/api-keys — list all API keys in own org (including service account keys)
DELETE /org/api-keys/{id} — revoke any key in own org
GET /org/audit-log — paginated org-scoped audit log from security_logs and alert_events filtered by organisation_id; supports ?from=&to=&event_type=&user_id= (F9)
GET /org/usage — usage summary for current and previous billing period (predictions run, quota hits, API calls); sourced from usage_events table
PATCH /org/billing — update billing_contacts row (email, PO number, VAT number)
POST /org/export — trigger asynchronous org data export (F11); returns job ID; export includes all predictions, alert events, handover logs, and NOTAM drafts for the org; delivered as signed ZIP within 3 business days; used for GDPR portability and offboarding

Admin (`admin` role only)

GET /admin/ingest-status — last run time and status per source
GET /admin/worker-status — Celery queue depth and health
GET /admin/security-events — recent security_logs entries
POST /admin/users — create user
PATCH /admin/users/{id}/role — change role (logged as HIGH security event)
GET /admin/organisations — list all organisations with tier, status, usage summary
POST /admin/organisations — provision new organisation (onboarding gate — see §29.8)
PATCH /admin/organisations/{id} — update tier, status, subscription dates

Space Portal (`space_operator` or `orbital_analyst` role)

GET /space/objects — list owned objects (space_operator: scoped; orbital_analyst: full catalog)
GET /space/objects/{norad_id} — full technical detail with state vectors, covariance, TLE history
GET /space/objects/{norad_id}/ephemeris — raw GCRF state vectors; CCSDS OEM format available via Accept: application/ccsds-oem
POST /space/reentry/plan — submit controlled re-entry planning job; requires owned_objects.has_propulsion = TRUE
GET /space/reentry/plan/{task_id} — poll; returns ranked deorbit windows with risk scores and FIR avoidance status
POST /space/conjunction/screen — submit screening (orbital_analyst only)
GET /space/export/bulk — bulk ephemeris/prediction export (JSON, CSV, CCSDS)

NOTAM Drafting (`operator` role)

POST /notam/draft — generate draft NOTAM from prediction ID; returns ICAO-format draft text + mandatory disclaimer
GET /notam/drafts — list drafts for organisation
GET /notam/drafts/{id} — draft detail
POST /notam/drafts/{id}/cancel-draft — generate cancellation draft for a previous new-NOTAM draft

API Key Management (`space_operator` or `orbital_analyst`)

POST /api-keys — create new API key; raw key returned once and never stored
GET /api-keys — list active keys (hashed IDs only, never raw keys)
DELETE /api-keys/{id} — revoke key immediately
GET /api-keys/usage — per-key request counts and last-used timestamp

WebSocket (`viewer` minimum; cookie auth at upgrade)

WS /ws/events — real-time stream; 5 concurrent connections per user enforced. Per-instance subscriber ceiling: 500 connections. New connections beyond this limit receive HTTP 503 at the WebSocket upgrade. A ws_connected_clients Prometheus gauge tracks current count per backend instance; alert fires at 400 (WARNING) to trigger horizontal scaling before the ceiling is reached. At Tier 2 (2 backend instances), the effective ceiling is 1,000 simultaneous WebSocket clients — documented as a known capacity limit in docs/runbooks/capacity-limits.md.

WebSocket event payload schema:

All events share an envelope:

{
  "type": "<event_type>",
  "seq": 1042,
  "ts": "2026-03-17T14:23:01.123Z",
  "data": { ... }
}

`type`	Trigger	`data` fields
`alert.new`	New alert generated	`alert_id`, `level`, `norad_id`, `object_name`, `fir_ids[]`
`alert.acknowledged`	Alert acknowledged by any user in org	`alert_id`, `acknowledged_by`, `note_preview`
`alert.superseded`	Alert superseded by a new one	`old_alert_id`, `new_alert_id`
`prediction.updated`	New re-entry prediction for a tracked object	`prediction_id`, `norad_id`, `p50_utc`, `supersedes_id`
`ingest.status`	Ingest job completed or failed	`source`, `status` (`ok`/`failed`), `record_count`, `next_run_at`
`spaceweather.change`	Operational status band changes	`old_status`, `new_status`, `kp`, `f107`
`tip.new`	New TIP message ingested	`norad_id`, `object_name`, `tip_epoch`, `predicted_reentry_utc`

Reconnection and missed-event recovery: Each event carries a monotonically increasing seq number per organisation. On reconnect, the client sends ?since_seq=<last_seq> in the WebSocket upgrade URL. The server replays up to 200 missed events from an in-memory ring buffer (last 5 minutes). If the client has been disconnected > 5 minutes, it receives a {"type": "resync_required"} event and must re-fetch state via REST.

Per-org sequence number implementation (F5 — §67): The seq counter for each org must be assigned using a PostgreSQL SEQUENCE object, not MAX(seq)+1 in a trigger. MAX(seq)+1 under concurrent inserts for the same org produces duplicate sequence numbers:

-- Migration: create one sequence per org on org creation
-- (or use a single global sequence with per-org prefix — simpler)
CREATE SEQUENCE IF NOT EXISTS alert_seq_global
    START 1 INCREMENT 1 NO CYCLE;

-- In the alert_events INSERT trigger or application code:
-- NEW.seq := nextval('alert_seq_global');
-- This is globally unique and monotonically increasing; per-org ordering
-- is derived by filtering on org_id + ordering by seq.

Preferred approach: A single global alert_seq_global sequence assigned at INSERT time. Per-org ordering is maintained because seq is globally monotonic — any two events for the same org will have the correct relative ordering by seq. The WebSocket ring buffer lookup uses WHERE org_id = $1 AND seq > $2 ORDER BY seq which remains correct with a global sequence.

Do not use: DEFAULT nextval('some_seq') on the column without org-scoped locking — concurrent inserts across orgs share the sequence fine; concurrent inserts for the same org also work correctly since sequences are lock-free and gap-tolerant.

Application-level receipt acknowledgement (F2 — §63): delivered_websocket = TRUE in alert_events is set at send-time, not client-receipt time. For safety-critical CRITICAL and HIGH alerts, the client must send an explicit receipt acknowledgement within 10 seconds:

// Client → Server: after rendering a CRITICAL/HIGH alert.new event
{ "type": "alert.received", "alert_id": "<uuid>", "seq": <n> }

Server response:

{ "type": "alert.receipt_confirmed", "alert_id": "<uuid>", "seq": <n+1> }

If no alert.received arrives within 10 seconds of delivery, the server marks alert_events.ws_receipt_confirmed = FALSE and triggers the email fallback for that alert (same logic as offline delivery). This distinguishes "sent to socket" from "rendered on screen."

ALTER TABLE alert_events
  ADD COLUMN ws_receipt_confirmed BOOLEAN,
  ADD COLUMN ws_receipt_at TIMESTAMPTZ;
-- NULL = not yet sent; TRUE = client confirmed receipt; FALSE = sent but no receipt within 10s

Fan-out architecture across multiple backend instances (F3 — §63): With ≥2 backend instances (Tier 2), a WebSocket connection from org A may be on instance-1 while a new alert fires on instance-2. Without a cross-instance broadcast mechanism, org A's operator misses the alert.

Required: Redis Pub/Sub fan-out:

# backend/app/alerts/fanout.py
import redis.asyncio as aioredis

ALERT_CHANNEL_PREFIX = "spacecom:alert:"

async def publish_alert(redis: aioredis.Redis, org_id: str, event: dict):
    """Publish alert event to Redis channel; all backend instances receive and forward to connected clients."""
    channel = f"{ALERT_CHANNEL_PREFIX}{org_id}"
    await redis.publish(channel, json.dumps(event))

async def subscribe_org_alerts(redis: aioredis.Redis, org_id: str):
    """Each backend instance subscribes to its connected orgs' channels on startup."""
    pubsub = redis.pubsub()
    await pubsub.subscribe(f"{ALERT_CHANNEL_PREFIX}{org_id}")
    return pubsub

Each backend instance maintains a local registry of {org_id: [websocket_connections]}. On receiving a Redis Pub/Sub message, the instance forwards to all local connections for that org. This decouples alert generation (any instance) from delivery (per-instance local connections).

ADR: docs/adr/0020-websocket-fanout-redis-pubsub.md — documents this pattern and the decision against sticky sessions (which would break blue-green deploys).

Dead-connection ANSP fallback notification (F6 — §63): When the ping-pong mechanism detects a dead connection, the current behaviour is to close the socket. There is no notification to the ANSP that their live monitoring connection has silently dropped.

Required behaviour:

On ping-pong timeout: close socket; record ws_disconnected_at in Redis session key for that connection
If no reconnect within WS_DEAD_CONNECTION_GRACE_SECONDS (default: 120s): send email to the org's ANSP contact (organisations.primary_contact_email) with subject: "SpaceCom live connection dropped — please check your browser"
If an active TIP event exists for the org's FIRs when the disconnection is detected: grace period is reduced to 30s and the email subject is: "URGENT: SpaceCom connection dropped during active re-entry event"
On reconnect (before grace period expires): cancel the pending fallback email

# backend/app/alerts/ws_health.py
WS_DEAD_CONNECTION_GRACE_SECONDS = 120
WS_DEAD_CONNECTION_GRACE_ACTIVE_TIP = 30

async def on_connection_closed(org_id: str, user_id: str, redis: aioredis.Redis):
    active_tip = await redis.get(f"spacecom:active_tip:{org_id}")
    grace = WS_DEAD_CONNECTION_GRACE_ACTIVE_TIP if active_tip else WS_DEAD_CONNECTION_GRACE_SECONDS
    # Schedule fallback notification via Celery
    notify_ws_dead.apply_async(
        args=[org_id, user_id],
        countdown=grace,
        task_id=f"ws-dead-{org_id}-{user_id}"  # revocable if reconnect arrives
    )

async def on_reconnect(org_id: str, user_id: str):
    # Cancel pending dead-connection notification
    celery_app.control.revoke(f"ws-dead-{org_id}-{user_id}")

Per-org email alert rate limit (F7 — §65 FinOps):

Email alerts are triggered both by the alert delivery pipeline (when WebSocket delivery is unconfirmed) and by degraded-mode notifications. Without a rate limit, a flapping prediction window or ingest instability can generate hundreds of alert emails per hour to the same ANSP contact, exhausting the SMTP relay quota and creating alert fatigue.

Rate limit policy: Maximum 50 alert emails per org per hour. When the limit is reached, subsequent alerts within the window are queued and delivered as a digest email at the end of the hour.

# backend/app/alerts/email_delivery.py
EMAIL_RATE_LIMIT_PER_ORG_PER_HOUR = 50

async def send_alert_email(org_id: str, alert: dict, redis: aioredis.Redis):
    """Send alert email subject to per-org rate limit; fall back to digest queue."""
    rate_key = f"spacecom:email_rate:{org_id}:{datetime.utcnow().strftime('%Y%m%d%H')}"
    count = await redis.incr(rate_key)
    if count == 1:
        await redis.expire(rate_key, 3600)  # expire at end of hour window

    if count <= EMAIL_RATE_LIMIT_PER_ORG_PER_HOUR:
        # Send immediately
        await _dispatch_email(org_id, alert)
    else:
        # Add to digest queue; Celery task drains it at hour boundary
        digest_key = f"spacecom:email_digest:{org_id}:{datetime.utcnow().strftime('%Y%m%d%H')}"
        await redis.rpush(digest_key, json.dumps(alert))
        await redis.expire(digest_key, 7200)  # safety expire

@shared_task
def send_hourly_digest_emails():
    """Drain digest queues and send consolidated digest emails. Runs at HH:59."""
    # Find all digest keys matching current hour; send one digest per org
    ...

Contract expiry alerts (F7 — §68):

Without proactive expiry alerts, contracts expire silently. Add a Celery Beat task (tasks/commercial/contract_expiry_alerts.py) that runs daily at 07:00 UTC and checks contracts.valid_until:

@shared_task
def check_contract_expiry():
    """Alert commercial team of contracts expiring within 90/30/7 days."""
    thresholds = [
        (90, "90-day renewal notice"),
        (30, "30-day renewal notice — action required"),
        (7,  "URGENT: 7-day contract expiry warning"),
    ]
    for days, subject_prefix in thresholds:
        target_date = date.today() + timedelta(days=days)
        expiring = db.execute(text("""
            SELECT c.id, o.name, c.monthly_value_cents, c.currency,
                   c.valid_until, o.primary_contact_email
            FROM contracts c
            JOIN organisations o ON o.id = c.org_id
            WHERE DATE(c.valid_until) = :target_date
              AND c.contract_type NOT IN ('sandbox', 'internal')
              AND c.auto_renew = FALSE
        """), {"target_date": target_date}).fetchall()
        for contract in expiring:
            send_email(
                to="commercial@spacecom.io",
                subject=f"[SpaceCom] {subject_prefix}: {contract.name}",
                body=f"Contract for {contract.name} expires on {contract.valid_until.date()}. "
                     f"Monthly value: {contract.monthly_value_cents/100:.2f} {contract.currency}."
            )

Add to celery-redbeat at crontab(hour=7, minute=0). Also send a courtesy expiry notice to the org admin contact at the 30-day threshold so they can initiate their internal procurement process.

Celery schedule: Add send_hourly_digest_emails to celery-redbeat at crontab(minute=59).

Cost rationale: SMTP relay services (SES, Mailgun) charge per email. At 50/hour cap and 10 orgs, maximum 500 emails/hour = 12,000/day. At $0.10/1,000 (SES) = $1.20/day ≈ $37/month at sustained maximum. Without rate limiting during a flapping event, a single incident could generate thousands of emails in minutes.

Per-client back-pressure and send queue circuit breaker (F7 — §63): A slow client whose network buffers are full will cause await websocket.send_json(event) to block in the FastAPI handler. Without a per-client queue depth check, a single slow client can block the fan-out loop for all clients.

# backend/app/alerts/ws_manager.py
WS_SEND_QUEUE_MAX = 50  # events; beyond this, circuit-breaker triggers

class ConnectionManager:
    def __init__(self):
        self._connections: dict[str, list[WebSocket]] = {}
        self._send_queues: dict[WebSocket, asyncio.Queue] = {}

    async def broadcast_to_org(self, org_id: str, event: dict):
        for ws in self._connections.get(org_id, []):
            queue = self._send_queues[ws]
            if queue.qsize() >= WS_SEND_QUEUE_MAX:
                # Circuit breaker: drop this connection; client will reconnect and replay
                spacecom_ws_send_queue_overflow_total.labels(org_id=org_id).inc()
                await ws.close(code=4003, reason="Send queue overflow — reconnect to resume")
            else:
                await queue.put(event)

    async def _send_worker(self, ws: WebSocket):
        """Dedicated coroutine per connection — decouples send from broadcast loop."""
        queue = self._send_queues[ws]
        while True:
            event = await queue.get()
            try:
                await ws.send_json(event)
            except Exception:
                break  # connection closed; worker exits

Prometheus counter: spacecom_ws_send_queue_overflow_total{org_id} — any non-zero value warrants investigation.

Missed-alert display for offline clients (F8 — §63): When a client reconnects after receiving resync_required, it calls the REST API to re-fetch current state. The notification centre must explicitly surface alerts that arrived during the offline period:

GET /api/v1/alerts?since=<last_seen_ts>&include_offline=true — returns all unacknowledged alerts since last_seen_ts, annotated with "received_while_offline": true. The notification centre renders these with a distinct visual treatment: amber border + "Received while you were offline" label. The client stores last_seen_ts in localStorage (updated on each WebSocket message); this survives page reload but not localStorage clear.

WebSocket connection metadata — per-org operational visibility (F10 — §63):

New Prometheus metrics:

ws_org_connected = Gauge(
    'spacecom_ws_org_connected',
    'Whether at least one WebSocket connection is active for this org',
    ['org_id', 'org_name']
)
ws_org_connections = Gauge(
    'spacecom_ws_org_connection_count',
    'Number of active WebSocket connections for this org',
    ['org_id']
)

Updated when connections open/close. Alert rule:

- alert: ANSPNoLiveConnectionDuringTIPEvent
  expr: |
    spacecom_active_tip_events > 0
    and on(org_id) spacecom_ws_org_connected == 0
  for: 5m
  severity: warning
  annotations:
    summary: "ANSP {{ $labels.org_name }} has no live WebSocket connection during active TIP event"
    runbook_url: "https://spacecom.internal/docs/runbooks/ansp-connection-lost.md"

On-call dashboard panel 9 (below the fold): "ANSP Connection Status" — table of org names, connection count, last-connected timestamp, TIP-event indicator. Rows with connected = 0 and active TIP highlighted in amber.

Protocol version negotiation (Finding 8): Client connects with ?protocol_version=1. The server's first message is always:

{"type": "CONNECTED", "protocolVersion": 1, "serverVersion": "2.1.3", "seq": 0}

When a breaking event schema change ships, both versions are supported in parallel for 6 months. Clients on a deprecated version receive:

{"type": "PROTOCOL_DEPRECATION_WARNING", "currentVersion": 1, "sunsetDate": "2026-12-01",
 "migrationGuideUrl": "/docs/api-guide/websocket-protocol.md#v2-migration"}

After sunset, old-version connections are closed with code 4002 ("Protocol version deprecated"). Protocol version history is maintained in docs/api-guide/websocket-protocol.md.

Token refresh during long-lived sessions (Finding 4): Access tokens expire in 15 minutes. The server sends a TOKEN_EXPIRY_WARNING event 2 minutes before expiry:

{"type": "TOKEN_EXPIRY_WARNING", "expiresInSeconds": 120, "seq": N}

The client calls POST /auth/token/refresh (standard REST — does not interrupt the WebSocket), then sends on the existing connection:

{"type": "AUTH_REFRESH", "token": "<new_access_token>"}

Server responds: {"type": "AUTH_REFRESHED", "seq": N}. If the client does not refresh before expiry, the server closes with code 4001 ("Token expired — reconnect with a new token"). Clients distinguish 4001 (auth expiry, refresh and reconnect) from 4002 (protocol deprecated, upgrade required) from network errors (reconnect with backoff).

Mode awareness: In SIMULATION or REPLAY mode, the client's WebSocket connection remains open but alert.new and tip.new events are suppressed for the duration of the mode session. Simulation-generated events are delivered on a separate WS /ws/simulation/{session_id} channel.

Alert Webhooks (`admin` role — registration; delivery to registered HTTPS endpoints)

For ANSPs with programmatic dispatch systems that cannot consume a browser WebSocket.

POST /webhooks — register a webhook endpoint; {"url": "https://ansp.example.com/hook", "events": ["alert.new", "tip.new"], "secret": "<shared_secret>"}
GET /webhooks — list registered webhooks for the organisation
DELETE /webhooks/{id} — deregister
POST /webhooks/{id}/test — send a synthetic alert.new event to verify delivery

Delivery semantics: At-least-once. SpaceCom POSTs the event envelope to the registered URL. Signature: X-SpaceCom-Signature: sha256=<HMAC-SHA256(secret, body)> header on every delivery. Retry policy: 3 retries with exponential backoff (1s, 5s, 30s). After 3 failures, the webhook is marked degraded and the org admin is notified by email. After 10 consecutive failures, the webhook is auto-disabled.

alert_webhooks table:

CREATE TABLE alert_webhooks (
  id SERIAL PRIMARY KEY,
  organisation_id INTEGER NOT NULL REFERENCES organisations(id),
  url TEXT NOT NULL,
  secret_hash TEXT NOT NULL,        -- bcrypt hash of the shared secret; never stored in plaintext
  event_types TEXT[] NOT NULL,
  status TEXT NOT NULL DEFAULT 'active',  -- active | degraded | disabled
  failure_count INTEGER DEFAULT 0,
  last_delivery_at TIMESTAMPTZ,
  last_failure_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Structured Event Export (`viewer` minimum)

First step toward SWIM / machine-readable ANSP system integration (Phase 3 target).

GET /events/{id}/export?format=geojson — returns the event's re-entry corridor and impact zone as a GeoJSON FeatureCollection with ICAO FIR IDs and prediction metadata in properties
GET /events/{id}/export?format=czml — CZML event package (same as GET /czml/event/{event_id})
GET /events/{id}/export?format=ccsds-oem — raw OEM for the object's trajectory at time of prediction

The GeoJSON export is the preferred integration surface for ANSP systems that are not SWIM-capable. The properties object includes: norad_id, object_name, p05_utc, p50_utc, p95_utc, affected_fir_ids[], risk_level, prediction_id, prediction_hmac (for downstream integrity verification), generated_at.

API Conventions (Finding 9)

Field naming: All API request and response bodies use camelCase. Database column names and Python internal models use snake_case. The conversion is handled automatically by a shared base model:

from pydantic import BaseModel, ConfigDict
from pydantic.alias_generators import to_camel

class APIModel(BaseModel):
    """Base class for all API response/request models. Serialises to camelCase JSON."""
    model_config = ConfigDict(
        alias_generator=to_camel,
        populate_by_name=True,   # allows snake_case in tests and internal code
    )

class PredictionResponse(APIModel):
    prediction_id: int           # → "predictionId" in JSON
    p50_reentry_time: datetime   # → "p50ReentryTime"
    ood_flag: bool               # → "oodFlag"

All Pydantic response models inherit from APIModel. All request bodies also inherit from APIModel (with populate_by_name=True, clients may send either case). Document in docs/api-guide/conventions.md.

Error Response Schema (Finding 2)

All error responses use the SpaceComError envelope — including FastAPI's default Pydantic validation errors (which are overridden):

class SpaceComError(BaseModel):
    error: str        # machine-readable code from the error registry
    message: str      # human-readable; safe to display in UI
    detail: dict | None = None
    requestId: str    # from X-Request-ID header; enables log correlation

@app.exception_handler(RequestValidationError)
async def validation_error_handler(request, exc):
    return JSONResponse(status_code=422, content=SpaceComError(
        error="VALIDATION_ERROR",
        message="Request validation failed",
        detail={"fields": exc.errors()},
        requestId=request.headers.get("X-Request-ID", ""),
    ).model_dump(by_alias=True))

Canonical error code registry — all codes, HTTP status, and recovery actions documented in docs/api-guide/error-reference.md. CI check: any HTTPException raised in application code must use a code from the registry. Sample entries:

Code	HTTP status	Meaning	Recovery
`VALIDATION_ERROR`	422	Request body or query param invalid	Fix the indicated fields
`INVALID_CURSOR`	400	Pagination cursor malformed or expired	Restart from page 1
`RATE_LIMITED`	429	Rate limit exceeded	Wait `retryAfterSeconds`
`EPHEMERIS_TOO_MANY_POINTS`	400	Computed points exceed 100,000	Reduce range or increase step
`IDEMPOTENCY_IN_PROGRESS`	409	Duplicate request still processing	Wait and retry `statusUrl`
`HMAC_VERIFICATION_FAILED`	503	Prediction integrity check failed	Contact administrator
`API_KEY_INVALID`	401	API key revoked, expired, or invalid	Re-issue key
`PREDICTION_CONFLICT`	200 (not error)	Multi-source window disagreement	See `conflictSources` field

Rate Limit Error Response (Finding 6)

429 Too Many Requests responses include Retry-After (RFC 7231 §7.1.3) and a structured body:

HTTP/1.1 429 Too Many Requests
Retry-After: 47
X-RateLimit-Limit: 10
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1742134847

{
  "error": "RATE_LIMITED",
  "message": "Rate limit exceeded for POST /decay/predict: 10 requests per hour",
  "retryAfterSeconds": 47,
  "limit": 10,
  "window": "1h",
  "requestId": "..."
}

retryAfterSeconds = X-RateLimit-Reset − now(). Clients implementing backoff must honour Retry-After and must not retry before it elapses.

Idempotency Keys (Finding 5)

Mutation endpoints that have real-world consequences support idempotency keys:

POST /decay/predict
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000

Server behaviour:

First receipt: process normally; store (key, user_id, endpoint, response_body) in idempotency_keys table with 24-hour TTL
Duplicate within 24h: return stored response with HTTP 200 + header Idempotency-Replay: true; do not re-execute
Still processing: return 409 Conflict → {"error": "IDEMPOTENCY_IN_PROGRESS", "statusUrl": "/jobs/uuid"}
After 24h: key expired; treat as new request

Applies to: POST /decay/predict, POST /reports, POST /notam/draft, POST /alerts/{id}/acknowledge, POST /admin/users. Documented in docs/api-guide/idempotency.md.

API Key Authentication Model (Finding 11)

API key requests use key-only auth — no JWT required:

Authorization: Bearer apikey_<base64url_encoded_key>

The prefix apikey_ distinguishes API keys from JWT Bearer tokens at the middleware layer. The raw key is hashed with SHA-256 before storage; the raw key is shown exactly once at creation.

Rules:

API key rate limits are independent from JWT session rate limits — separate Redis buckets per key
Webhook deliveries are not counted against any rate limit bucket (server-initiated, not client-initiated)
allowed_endpoints scope: null = all endpoints for the key's role; a non-null array restricts to listed paths. 403 returned for requests to unlisted endpoints with {"error": "ENDPOINT_NOT_IN_KEY_SCOPE"}
Revoked/expired/invalid key: always 401 → {"error": "API_KEY_INVALID", "message": "API key is revoked or expired"} — indistinguishable from never-valid (prevents enumeration)

Document in docs/api-guide/api-keys.md.

System Endpoints (Finding 10)

GET /readyz is included in the OpenAPI spec as a documented endpoint (tagged System), so integrators and SWIM consumers can discover and monitor it:

@app.get(
    "/readyz",
    tags=["System"],
    summary="Readiness and degraded-state check",
    response_model=ReadinessResponse,
    responses={
        200: {"description": "System operational"},
        207: {"description": "System degraded — one or more data sources stale"},
        503: {"description": "System unavailable — database or Redis unreachable"},
    }
)

GET /healthz (liveness probe) remains undocumented in OpenAPI — infrastructure-only. /readyz is the recommended integration health check endpoint for ANSP monitoring systems and the Phase 3 SWIM integration.

Clock skew detection and server time endpoint (F6 — §67):

CZML availability timestamps and prediction windows are generated using server UTC. If the server clock drifts (NTP sync failure after container restart, hypervisor clock skew, or VM migration), CZML ground track windows will be offset from real time. A client whose clock differs from the server clock by > 5 seconds will show predictions in the wrong temporal position.

Infrastructure requirement: All SpaceCom hosts must run chronyd or systemd-timesyncd with NTP synchronisation to a reliable source. Add to the deployment runbook (docs/runbooks/host-setup.md):

# Ubuntu/Debian
timedatectl set-ntp true
timedatectl status  # confirm NTPSynchronized: yes

Add Grafana alert: node_timex_sync_status != 1 → WARNING: "NTP sync lost on ".

Client-side clock skew display: Add GET /api/v1/time endpoint (unauthenticated, rate-limited to 1 req/s per IP):

@router.get("/api/v1/time")
async def server_time():
    return {"utc": datetime.utcnow().isoformat() + "Z", "unix": time.time()}

The frontend calls this on page load and computes skew_seconds = server_unix - Date.now()/1000. If abs(skew_seconds) > 5: display a persistent WARNING banner: "Your browser clock differs from the server by {N}s — prediction windows may appear offset. Please synchronise your system clock."

Pagination Standard

All list endpoints use cursor-based pagination (not offset-based). Offset pagination degrades as OFFSET N forces the DB to scan and discard N rows; at 7-year retention depth this becomes a full table scan.

Canonical response envelope — applied to every list endpoint (Finding 1):

{
  "data": [...],
  "pagination": {
    "next_cursor": "eyJjcmVhdGVkX2F0IjoiMjAyNi0wMy0xNlQxNDozMDowMFoiLCJpZCI6NDQ4Nzh9",
    "has_more": true,
    "limit": 50,
    "total_count": null
  }
}

Rules:

data (not items) is the canonical array key across all list endpoints
next_cursor is base64url(json({"created_at": "<iso8601>", "id": <int>})) — opaque to clients, decoded server-side
total_count is always null — count queries on large tables force full scans; document this explicitly in docs/api-guide/pagination.md
limit defaults to 50; maximum 200; specified per endpoint group in OpenAPI description
Empty result: {"data": [], "pagination": {"next_cursor": null, "has_more": false, "limit": 50, "total_count": null}} — never 404
Invalid/expired cursor: 400 Bad Request → {"error": "INVALID_CURSOR", "message": "Cursor is malformed or refers to a deleted record", "request_id": "..."}

Standard query parameters:

limit — page size (default: 50, maximum: 200)
cursor — opaque cursor token from a previous response (absent = first page)

Cursor decodes server-side to WHERE (created_at, id) < (cursor_ts, cursor_id) ORDER BY created_at DESC, id DESC. Tokens are valid for 24 hours.

Implementation:

class PaginatedResponse(BaseModel, Generic[T]):
    data: list[T]
    pagination: PaginationMeta

class PaginationMeta(BaseModel):
    next_cursor: str | None
    has_more: bool
    limit: int
    total_count: None = None  # always None; never compute count

def paginate_query(q, cursor: str | None, limit: int) -> PaginatedResponse:
    """Shared utility used by all list endpoints — enforces envelope consistency."""
    ...

Enforcement: An OpenAPI CI check confirms every endpoint tagged list has limit and cursor query parameters and returns the PaginatedResponse schema. Violations fail CI.

Affected endpoints (all paginated): /objects, /decay/predictions, /reentry/predictions, /alerts, /conjunctions, /reports, /notam/drafts, /space/objects, /api-keys/usage, /admin/security-events.

API Latency Budget — CZML Catalog Endpoint

The CZML catalog endpoint (GET /czml/objects) is the most latency-sensitive read path and the primary SLO driver (p95 < 2s). Latency budget allocation:

Component	Budget	Notes
DNS + TLS handshake (new connection)	50 ms	Not applicable on keep-alive; amortised to ~0 for repeat requests
Caddy proxy overhead	5 ms	Header processing only
FastAPI routing + middleware (auth, RBAC, rate limit)	30 ms	Each middleware ~5–10 ms; keep middleware count ≤ 5 on this path
PgBouncer connection acquisition	10 ms	Pool saturation adds latency; monitor `pgbouncer_pool_waiting` metric
DB query execution (PostGIS geometry)	800 ms	Includes GiST index scan + geometry serialisation
CZML serialisation (Pydantic → JSON)	200 ms	Validated by benchmark; exceeding this indicates schema complexity regression
HTTP response transmission (5 MB @ 1 Gbps internal)	40 ms	Internal network; negligible
Total budget (new connection)	~1,135 ms	~865 ms headroom to 2s p95 SLO

Any new middleware added to the CZML endpoint path must be profiled and must not exceed its allocated budget. Exceeding the DB or serialisation budget requires a performance investigation before merge.

API Versioning Policy

Base path: /api/v1. All versioned endpoints follow Semantic Versioning applied to the API contract:

Non-breaking changes (additive: new optional fields, new endpoints, new query params): deployed without version bump; announced in CHANGELOG.md
Breaking changes (removed fields, changed types, changed auth requirements, removed endpoints): require a new major version (/api/v2); old version supported in parallel for a minimum of 6 months before sunset
Deprecation signalling: Deprecated endpoints return Deprecation: true and Sunset: <date> response headers (RFC 8594)
Version negotiation: Clients may send Accept: application/vnd.spacecom.v1+json to pin to a specific version; default is always the latest stable version
Breaking change notice: Minimum 3 months written notice (email to registered API key holders + CHANGELOG.md entry) before any breaking change is deployed

Changelog discipline (F5): CHANGELOG.md follows the Keep a Changelog format with Conventional Commits as the commit-level input. Every PR must add an entry under [Unreleased] if it has a user-visible effect. On release, [Unreleased] is renamed to [{semver}] - {date}.

## [Unreleased]
### Added
- `p01_reentry_time` and `p99_reentry_time` fields on decay prediction response (SC-188)
### Changed
- `altitude_unit_preference` default for ANSP operators changed from `m` to `ft` (SC-201)
### Fixed
- HMAC integrity check now correctly handles NULL `action_taken` field (SC-195)
### Deprecated
- `GET /objects/{id}/trajectory` — use `GET /objects/{id}/ephemeris` (sunset 2027-06-01)

make changelog-check (CI step) fails if [Unreleased] section is empty and the diff contains non-chore/docs commits
Release changelogs are the source for API key holder email notifications and GitHub release notes

OpenAPI spec as source of truth (F1): FastAPI generates the OpenAPI 3.1 spec automatically from route decorators, Pydantic schemas, and docstrings. The spec is the authoritative contract — not a separately maintained document. CI enforces this:

GET /api/v1/openapi.json is served by the running API; CI downloads it and diffs against the committed openapi.yaml
Any uncommitted drift fails the build with openapi-diff --fail-on-incompatible
The committed openapi.yaml is regenerated by running make generate-openapi (calls python -m app.generate_spec) — this is a required step in the PR checklist for any API change
The spec is the input to all downstream tooling: Swagger UI (/docs), Redoc (/redoc), contract tests, and the client SDK generator

API date/time contract (F10): All date/time fields in API responses must use ISO 8601 with UTC offset — never Unix timestamps, never local time strings:

Format: "2026-03-22T14:00:00Z" (UTC, Z suffix)
OpenAPI annotation: format: date-time on every _at-suffixed and _time-suffixed field
Contract test (BLOCKING): every field matching /_at$|_time$/ in every response schema asserts it matches ^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?Z$
Pydantic models use datetime with model_config = {"json_encoders": {datetime: lambda v: v.isoformat().replace("+00:00", "Z")}}

Frontend ↔ API contract testing (F4): The TypeScript types used by the Next.js frontend must be validated against the OpenAPI spec on every CI run — preventing the common drift where the Pydantic response model changes but the frontend interface is not updated until a runtime error surfaces.

Implementation: openapi-typescript generates TypeScript types from openapi.yaml into frontend/src/types/api.generated.ts. The frontend imports only from this generated file — no hand-written API response interfaces. A CI check (make check-api-types) regenerates the types and fails if the git diff is non-empty:

# CI step: check-api-types
openapi-typescript openapi.yaml -o frontend/src/types/api.generated.ts
git diff --exit-code frontend/src/types/api.generated.ts \
  || (echo "API types out of sync — run: make generate-api-types" && exit 1)

This is a one-way contract: the spec is authoritative; the TypeScript types are derived. Any API change that affects the frontend must regenerate types before the PR can merge. This replaces the need for a separate consumer-driven contract test framework (Pact) at Phase 1 scale.

OpenAPI response examples (F7): Every endpoint schema in the OpenAPI spec must include at least one examples: block demonstrating a realistic success response. This is enforced by a CI lint step (spectral lint openapi.yaml --ruleset .spectral.yaml) with a custom rule require-response-example. Missing examples fail the build. The examples serve three purposes: Swagger UI and Redoc interactive documentation, contract test fixture baseline, and ESA auditor review readability.

# Example: openapi.yaml fragment for GET /objects/{norad_id}
responses:
  '200':
    content:
      application/json:
        schema:
          $ref: '#/components/schemas/ObjectDetail'
        examples:
          debris_object:
            summary: Tracked debris fragment in decay
            value:
              norad_id: 48274
              name: "CZ-3B DEB"
              object_type: "DEBRIS"
              perigee_km: 187.4
              apogee_km: 312.1
              data_confidence: "nominal"
              propagation_quality: "degraded"
              propagation_warning: "tle_age_7_14_days"

Client SDK strategy (F8): Phase 1 — no dedicated SDK. ANSP integrators are provided:

The committed openapi.yaml for import into Postman, Insomnia, or any OpenAPI-compatible tooling
A docs/integration/ directory with language-specific quickstart guides (Python, JavaScript/TypeScript) showing auth, object fetch, and WebSocket subscription patterns
Python integration examples using httpx (async) and requests (sync) — not a packaged SDK

Phase 2 gate: if ≥ 2 ANSP customers request a typed client, generate one using openapi-generator-cli targeting Python and TypeScript. Generated clients are published under the @spacecom/ npm scope and spacecom-client PyPI package. The generator configuration is committed to tools/sdk-generator/ so regeneration is reproducible from the spec.

15. Propagation Architecture — Technical Detail

15.1 Catalog Propagator (SGP4)

from sgp4.api import Satrec, jday
from app.frame_utils import teme_to_gcrf, gcrf_to_itrf, itrf_to_geodetic

def propagate_catalog(tle_line1: str, tle_line2: str, times_utc: list[datetime]) -> list[OrbitalState]:
    sat = Satrec.twoline2rv(tle_line1, tle_line2)
    results = []
    for t in times_utc:
        jd, fr = jday(t.year, t.month, t.day, t.hour, t.minute, t.second + t.microsecond/1e6)
        e, r_teme, v_teme = sat.sgp4(jd, fr)
        if e != 0:
            raise PropagationError(f"SGP4 error code {e}")
        r_gcrf, v_gcrf = teme_to_gcrf(r_teme, v_teme, t)
        lat, lon, alt = itrf_to_geodetic(gcrf_to_itrf(r_gcrf, t))
        results.append(OrbitalState(
            time=t, reference_frame='GCRF',
            pos_x_km=r_gcrf[0], pos_y_km=r_gcrf[1], pos_z_km=r_gcrf[2],
            vel_x_kms=v_gcrf[0], vel_y_kms=v_gcrf[1], vel_z_kms=v_gcrf[2],
            lat_deg=lat, lon_deg=lon, alt_km=alt, propagator='sgp4'
        ))
    return results

Scope limitation: SGP4 accurate to ~1 km for perigee > 300 km and epoch age < 7 days. Do not use for decay prediction.

SGP4 validity gates — enforced at query time (Finding 1):

Condition	Action	UI signal
`tle_epoch_age ≤ 7 days`	Normal propagation	`propagation_quality: 'nominal'`
`7 days < tle_epoch_age ≤ 14 days`	Propagate with warning	`propagation_quality: 'degraded'`; amber `DataConfidenceBadge`; API includes `propagation_warning: 'tle_age_7_14_days'`
`tle_epoch_age > 14 days`	Return estimate with explicit caveat	`propagation_quality: 'unreliable'`; object position not rendered on globe without user acknowledgement; API returns `propagation_warning: 'tle_age_exceeds_14_days'`
`perigee_altitude < 200 km`	Do not use SGP4	Route all propagation requests to the numerical decay predictor; SGP4 is invalid in this density regime

The epoch age check runs at the start of propagate_catalog(). The perigee altitude gate is enforced during TLE ingest — objects crossing below 200 km perigee are automatically flagged for decay prediction and removed from SGP4 catalog propagation tasks.

Sub-150 km propagation confidence guard (F2): For the numerical decay predictor, objects with current perigee < 150 km are in a regime where atmospheric density model uncertainty dominates and SGP4/numerical model errors grow rapidly. Predictions in this regime are flagged:

if perigee_km < 150:
    prediction.propagation_confidence = 'LOW_CONFIDENCE_PROPAGATION'
    prediction.propagation_confidence_reason = (
        f'Perigee {perigee_km:.0f} km below 150 km; '
        'atmospheric density uncertainty dominant; re-entry imminent'
    )

LOW_CONFIDENCE_PROPAGATION is surfaced in the UI as a red badge: "⚠ Re-entry imminent — prediction confidence low; consult Space-Track TIP directly." Unit test (BLOCKING): construct a TLE with perigee = 120 km; call the decay predictor; assert propagation_confidence == 'LOW_CONFIDENCE_PROPAGATION'.

15.2 Decay Predictor (Numerical)

Physics: J2–J6 geopotential, NRLMSISE-00 drag, solar radiation pressure (cannonball model), WGS84 oblate Earth.

NRLMSISE-00 Input Vector (Finding 2)

NRLMSISE-00 requires a fully specified input vector. Using a single F10.7 value for both the 81-day average and the prior-day slot, or using Kp instead of Ap, introduces systematic density errors that are worst during geomagnetic storms — exactly when prediction uncertainty matters most.

# Required NRLMSISE-00 inputs — both stored in space_weather table
nrlmsise_input = NRLMSISEInput(
    f107A = f107_81day_avg,       # 81-day centred average F10.7 (NOT current)
    f107  = f107_prior_day,        # prior-day F10.7 value (NOT current day)
    ap    = ap_daily,              # daily Ap index (linear) — NOT Kp (logarithmic)
    ap_a  = ap_3h_history_57h,    # 19-element array of 3-hourly Ap for prior 57h
                                   # enables full NRLMSISE accuracy (flags.switches[9]=1)
)

The space_weather table already stores f107_81day_avg and ap_daily. Add f107_prior_day DOUBLE PRECISION and ap_3h_history DOUBLE PRECISION[19] columns (the 3-hourly Ap history array for the 57 hours preceding each observation). The ingest worker populates both from the NOAA SWPC Space Weather JSON endpoint.

Atmospheric density model selection rationale (F3): NRLMSISE-00 is used for Phase 1. JB2008 (Bowman et al. 2008) is the current USSF operational standard and is demonstrably more accurate during high solar activity periods (F10.7 > 150) and geomagnetic storms (Kp > 5). NRLMSISE-00 is chosen for Phase 1 because:

Python bindings are mature (nrlmsise00 PyPI package); JB2008 has no equivalent mature Python binding
For the typical F10.7 range (70–150 sfu) at solar minimum/moderate activity, the accuracy difference is < 10%
Phase 2 milestone: evaluate JB2008 against NRLMSISE-00 on historical re-entry backcasts; if MAE improvement > 15%, migrate; decision documented in docs/adr/0016-atmospheric-density-model.md

NRLMSISE-00 input validity bounds (F3): Inputs outside these ranges produce unphysical density estimates; the prediction is rejected rather than silently accepted:

NRLMSISE_INPUT_BOUNDS = {
    "f107": (65.0, 300.0),   # physical solar flux range; < 65 indicates data gap
    "f107A": (65.0, 300.0),
    "ap": (0.0, 400.0),      # Ap index physical range
    "altitude_km": (85.0, 1000.0),  # validated density range
}

If any bound is violated, raise AtmosphericModelInputError with field and value — never silently clamp.

Altitude scope: NRLMSISE-00 is used from 150 km to 800 km. Above 800 km, the model is applied but the prediction carries ood_flag = TRUE with ood_reason = 'above_nrlmsise_validated_range_800km' (Finding 11).

Geomagnetic storm sensitivity (Finding 11): During the MC sampling, when the current 3-hour Kp index exceeds 5, sample F10.7 and Ap from storm-period values (current observed, not 81-day average). The prediction is annotated:

space_weather_warning: 'geomagnetic_storm' field on the reentry_predictions record
UI amber callout: "Active geomagnetic storm — thermospheric density is elevated; re-entry timing uncertainty is significantly increased"
The storm flag persists for the lifetime of the prediction; it is not cleared when the storm ends (the prediction was made during disturbed conditions)

Ballistic Coefficient Uncertainty Model (Finding 3)

The ballistic coefficient β = m / (C_D × A) is the dominant uncertainty in drag-driven decay. Its three components are sampled independently in the Monte Carlo:

Parameter	Distribution	Rationale
`C_D`	`Uniform(2.0, 2.4)`	Standard assumption for non-cooperative objects in free molecular flow; no direct measurement available
`A` (stable attitude, `attitude_known = TRUE`)	`Normal(A_discos, 0.05 × A_discos)`	5% shape uncertainty for known-attitude objects
`A` (tumbling, `attitude_known = FALSE`)	`Normal(A_discos_mean, 0.25 × A_discos_mean)`	25% uncertainty; tumbling objects present a time-varying cross-section
`m`	`Normal(m_discos, 0.10 × m_discos)`	10% mass uncertainty; DISCOS masses are not independently verified

OOD rules:

attitude_known = FALSE AND mass_kg IS NULL → ood_flag = TRUE, ood_reason = 'tumbling_no_mass' — outside validated regime
cd_a_over_m IS NULL AND mass_kg IS NULL AND cross_section_m2 IS NULL → ood_flag = TRUE, ood_reason = 'no_physical_properties'

Objects with known physical properties can have operator-provided overrides stored in objects.cd_override DOUBLE PRECISION and objects.bstar_override DOUBLE PRECISION. When overrides are present, the MC samples around the override value rather than the DISCOS-derived value.

Solar Radiation Pressure (Finding 7)

SRP is included using the cannonball model:

a_srp = −P_sr × C_r × (A/m) × r̂_sun

where P_sr = 4.56 × 10⁻⁶ N/m² at 1 AU (scaled by (1 AU / r_sun)²), C_r is the radiation pressure coefficient stored in objects.cr_coefficient DOUBLE PRECISION DEFAULT 1.3.

SRP is significant (> 5% of drag contribution) for objects with area-to-mass ratio > 0.01 m²/kg at altitudes > 500 km. OOD flag: area_to_mass > 0.01 AND perigee > 500 km AND cr_coefficient IS NULL → ood_reason = 'srp_significant_cr_unknown'.

Integrator Configuration (Finding 9)

from scipy.integrate import solve_ivp

integrator_config = dict(
    method   = "DOP853",         # RK7(8) embedded pair — adaptive step
    rtol     = 1e-9,             # relative tolerance (parts-per-billion)
    atol     = 1e-9,             # absolute tolerance (km); ≈ 1 mm position error
    max_step = 60.0,             # seconds; constrained to capture density variation at perigee
    t_span   = (t0, t0 + 120 * 86400),  # 120-day maximum integration window
    events   = [
        altitude_80km_event,     # terminal: breakup trigger
        altitude_200km_event,    # non-terminal: log perigee passage
    ],
    dense_output = False,
)

Stopping criterion: integration terminates when altitude ≤ 80 km (breakup trigger fires) or when the 120-day span elapses without reaching 80 km (result: propagation_timeout; stored as status = 'timeout' in simulations). The 120-day cap is a safety stop — any object not re-entering within 120 days from a sub-450 km perigee TLE is anomalous and should be flagged for human review.

The max_step = 60s constraint near perigee prevents the integrator from stepping over atmospheric density variations. For altitudes above 300 km, the max step is relaxed to 300s (5 min) via a step-size hook that checks current altitude.

TLE age uncertainty inflation (F7): TLE age is a formal uncertainty source, not just a staleness indicator. For decaying objects, position uncertainty grows with TLE age due to unmodelled atmospheric drag variations. A linear inflation model is applied to the ballistic coefficient covariance before MC sampling:

# Applied in decay_predictor.py before MC sampling
tle_age_days = (prediction_epoch - tle_epoch).total_seconds() / 86400
if tle_age_days > 0 and perigee_km < 450:
    uncertainty_multiplier = 1.0 + 0.15 * tle_age_days
    sigma_cd *= uncertainty_multiplier
    sigma_area *= uncertainty_multiplier

The 0.15/day coefficient is derived from Vallado (2013) §9.6 propagation error growth for LEO objects in ballistic flight. tle_age_at_prediction_time and uncertainty_multiplier are stored in simulations.params_json and included in the prediction API response for provenance.

Monte Carlo convergence criterion (F4): N = 500 for production is not arbitrary — it satisfies the following convergence criterion tested on the reference object (mc-ensemble-params.json):

N	p95 corridor area (km²)	Change from N/2
100	baseline	—
250	—	~12%
500	—	~4%
1000	—	~1.8%
2000	—	~0.9%

Convergence criterion: corridor area change < 2% between doublings. N = 500 satisfies this for the reference object. N = 1000 is used for objects with ood_flag = TRUE or space_weather_warning = 'geomagnetic_storm' (higher uncertainty → higher N needed for stable tail estimates). Server cap remains 1000.

Monte Carlo:

N = 500 (standard); N = 1000 (OOD flag or storm warning); server cap 1000
Per-sample variation: C_D ~ U(2.0, 2.4); A ~ N(A_discos, σ_A × uncertainty_multiplier);
  m ~ N(m_discos, σ_m); F10.7 and Ap from storm-aware sampling
Output: p01/p05/p25/p50/p75/p95/p99 re-entry times; ground track corridor polygon; per-sample binary blob for Mode C
All output records HMAC-signed before database write

15.3 Atmospheric Breakup Model

Simplified ORSAT approach: aerothermal heating → failure altitude → fragment generation → RK4 ballistic descent → impact (velocity, angle, KE, casualty area). Distinct from NASA SBM on-orbit fragmentation.

Breakup altitude trigger (Finding 5): Structural breakup begins when the numerical integrator crosses altitude = 78 km (midpoint of the 75–80 km range supported by NASA Debris Assessment Software and ESA DRAMA for aluminium-structured objects; documented in model card under "Breakup Altitude Rationale").

Fragment generation: Below 78 km, the fragment cloud is generated using the NASA Standard Breakup Model (NASA-TM-2018-220054) parameter set for the object's mass class:

Mass class A: < 100 kg
Mass class B: 100–1000 kg
Mass class C: > 1000 kg (rocket bodies, large platforms)

Survivability by material (Finding 5): Fragment demise altitude is determined by material class using the ESA DRAMA demise altitude lookup:

`material_class`	Typical demise altitude	Notes
`aluminium`	60–70 km	Most fragments demise; some survive
`stainless_steel`	45–55 km	Higher survival probability
`titanium`	40–50 km	High survival; used in tanks and fasteners
`carbon_composite`	55–65 km	Largely demises but reinforced structures may survive
`unknown`	Conservative: 0 km (surface impact)	All fragments assumed to survive — drives `ood_flag = TRUE`

material_class TEXT added to objects table. When material_class IS NULL, the ood_flag is set and the conservative all-survive assumption is used. The NOTAM (E) field debris survival statement changes from a static disclaimer to a model-driven statement: DEBRIS SURVIVAL PROBABLE (when calculated survivability > 50%) or DEBRIS SURVIVAL POSSIBLE (10–50%) or COMPLETE DEMISE EXPECTED (< 10%).

Casualty area: Computed from fragment mass and velocity using the ESA DRAMA methodology. Stored per-fragment in fragment_impacts table. The aggregate casualty area polygon drives the "ground risk" display in the Event Detail page (Phase 3 feature).

Survival probability output (F5): The aggregate object-level survival probability is stored in reentry_predictions:

ALTER TABLE reentry_predictions
  ADD COLUMN survival_probability DOUBLE PRECISION,  -- fraction of object mass expected to survive to surface (0.0–1.0)
  ADD COLUMN survival_model_version TEXT,            -- e.g. 'phase1_analytical_v1', 'drama_3.2'
  ADD COLUMN survival_model_note TEXT;               -- human-readable caveat, e.g. 'Phase 1: simplified analytical; no fragmentation modelling'

Phase 1 method: simplified analytical — ballistic coefficient of the intact object projected to surface; if material_class = 'unknown', survival_probability = 1.0 (conservative all-survive). Phase 2: integrate ESA DRAMA output files where available from the space operator's licence submission. The NOTAM (E) field statement is driven by survival_probability (already specified above).

15.4 Corridor Generation Algorithm (Finding 4)

The re-entry corridor polygon is generated by reentry/corridor.py. The algorithm must be specified explicitly — the choice between convex hull, alpha-shape, and ellipse fit produces materially different FIR intersection results.

Algorithm:

def generate_corridor_polygon(
    mc_trajectories: list[list[GroundPoint]],
    percentile: float = 0.95,
    alpha: float = 0.1,           # degrees; ~11 km at equator
    buffer_km: float = 50.0,      # lateral dispersion buffer below 80 km
    max_vertices: int = 1000,
) -> Polygon:
    """
    Generate a re-entry hazard corridor polygon from Monte Carlo trajectories.

    Algorithm:
      1. For each MC trajectory, collect ground positions at 10-min intervals
         from the 80 km altitude crossing to the final impact point.
      2. Retain the central `percentile` fraction of trajectories by re-entry time
         (discard the earliest p_low and latest p_high tails).
      3. Compute the alpha-shape (concave hull) of the combined point set
         using alpha = 0.1°. Alpha-shape is preferred over convex hull for
         elongated re-entry corridors (convex hull overestimates width by 2–5x).
      4. Buffer the polygon by `buffer_km` to account for lateral fragment
         dispersion below 80 km.
      5. Simplify to <= `max_vertices` vertices (Douglas-Peucker, tolerance 0.01°).
      6. Store the raw MC endpoint cloud as JSONB in `reentry_predictions.mc_endpoint_cloud`
         for audit and Mode C replay.

    Returns:
        Polygon in EPSG:4326 (WGS84), suitable for PostGIS GEOGRAPHY storage.
    """

The alpha-shape library (alphashape) is added to requirements.in. The 50 km buffer accounts for the fact that fragments detach from the main object trajectory below 80 km and disperse laterally. This value is documented in the model card with a reference to ESA DRAMA lateral dispersion statistics.

Adaptive ground-track sampling for CZML corridor fidelity (F4 — §62):

Step 1 of the corridor algorithm above samples at 10-minute intervals. For the high-deceleration terminal phase (below ~150 km), 10 minutes corresponds to hundreds of kilometres of ground track — the polygon will miss the actual terminal geometry. Adaptive sampling is required:

def adaptive_ground_points(trajectory: list[StateVector]) -> list[GroundPoint]:
    """
    Return ground points at altitude-dependent intervals:
      > 300 km: every 5 min  (slow deceleration; sparse sampling adequate)
      150–300 km: every 2 min
      80–150 km: every 30 s  (rapid deceleration; must resolve terminal corridor)
      < 80 km: every 10 s   (fragment phase; maximum spatial resolution)
    """
    points = []
    for sv in trajectory:
        alt_km = sv.altitude_km
        step_s = 300 if alt_km > 300 else (
                 120 if alt_km > 150 else (
                  30 if alt_km > 80 else 10))
        # only emit a point if sufficient time has elapsed since the last point
        if not points or (sv.t - points[-1].t) >= step_s:
            points.append(to_ground_point(sv))
    return points

This is a breaking change to the corridor algorithm: the reference polygon in docs/validation/reference-data/mc-corridor-reference.geojson must be regenerated after this change is implemented. The ADR for this change must document the old vs. new polygon area difference for the reference object.

PostGIS vs CZML corridor consistency test (F6 — §62):

The PostGIS ground_track_corridor polygon (used for FIR intersection and alert generation) and the CZML polygon positions (displayed on the globe) are independently derived. A serialisation bug in the CZML builder could render the corridor in the wrong location while the database record remains correct — operators would see one corridor, alerts would be generated based on another.

Required integration test in tests/integration/test_corridor_consistency.py:

@pytest.mark.safety_critical
def test_czml_corridor_matches_postgis_polygon(db_session):
    """
    The bounding box of the CZML polygon positions must agree with the
    PostGIS corridor polygon bounding box to within 10 km in each direction.
    """
    prediction = db_session.query(ReentryPrediction).filter(
        ReentryPrediction.ground_track_corridor.isnot(None)
    ).first()

    # Generate CZML from the prediction
    czml_doc = generate_czml_for_prediction(prediction)
    czml_polygon = extract_polygon_positions(czml_doc)  # list of (lat, lon)

    # Get PostGIS bounding box
    postgis_bbox = db_session.execute(
        text("SELECT ST_Envelope(ground_track_corridor::geometry) FROM reentry_predictions WHERE id = :id"),
        {"id": prediction.id}
    ).scalar()
    postgis_coords = extract_bbox_corners(postgis_bbox)  # (min_lat, max_lat, min_lon, max_lon)

    czml_bbox = bounding_box_of(czml_polygon)
    assert abs(czml_bbox.min_lat - postgis_coords.min_lat) < 0.1   # ~10 km latitude tolerance
    assert abs(czml_bbox.max_lat - postgis_coords.max_lat) < 0.1
    # Antimeridian-aware longitude comparison
    assert lon_diff_deg(czml_bbox.min_lon, postgis_coords.min_lon) < 0.1
    assert lon_diff_deg(czml_bbox.max_lon, postgis_coords.max_lon) < 0.1

This test is marked safety_critical because a discrepancy > 10 km between displayed and stored corridor is a direct contribution to HZ-004.

Unit test: Generate a corridor from a known synthetic MC dataset (100 trajectories, straight ground track); verify the resulting polygon contains all input points; verify the polygon area is less than the convex hull area (confirming the alpha-shape is tighter); verify the polygon has ≤ 1000 vertices.

MC test data generation strategy (Finding 10): Generating hundreds of MC trajectories at test time is slow and non-deterministic. Committing raw trajectory arrays is a large binary blob. Use seeded RNG:

# tests/physics/conftest.py
@pytest.fixture(scope="session")
def synthetic_mc_ensemble():
    """500 synthetic trajectories from seeded RNG — deterministic, no external downloads."""
    rng = np.random.default_rng(seed=42)  # seed must never change without updating reference polygon
    return generate_mc_ensemble(
        rng, n=500,
        object_params={  # Reference object: committed, never change without ADR
            "mass_kg": 1000.0, "cd": 2.2, "area_m2": 1.0, "perigee_km": 185.0,
        },
    )

Commit to docs/validation/reference-data/:

mc-corridor-reference.geojson — pre-computed corridor polygon (run python tools/generate_mc_reference.py once; review and commit)
mc-ensemble-params.json — RNG seed, object parameters, generation timestamp

Test asserts: (a) generated corridor polygon matches committed reference within 5% area difference; (b) corridor contains ≥ 95% of input trajectories. If the corridor algorithm changes, the reference polygon must be explicitly regenerated and the change reviewed — the seed itself never changes.

15.5 Conjunction Probability (Pc) Computation Method (Finding 8)

The Pc method is specified in conjunction/pc_compute.py and must be documented in the API response.

Phase 1–2 method: Alfano/Foster 2D Gaussian

def compute_pc_alfano(
    r1: np.ndarray,   # primary position (km, GCRF)
    v1: np.ndarray,   # primary velocity (km/s)
    cov1: np.ndarray, # 6×6 covariance (km², km²/s²)
    r2: np.ndarray,   # secondary position
    v2: np.ndarray,
    cov2: np.ndarray,
    hbr: float,       # combined hard-body radius (m)
) -> float:
    """
    Compute probability of collision using Alfano (2005) 2D Gaussian method.

    Projects combined covariance onto the encounter plane, integrates the
    bivariate normal distribution over the combined hard-body area.
    Standard method in the space surveillance community.

    Reference: Alfano (2005), "A Numerical Implementation of Spherical Object
    Collision Probability", Journal of the Astronautical Sciences.
    """

API response field: Every conjunction record includes pc_method: "alfano_2d_gaussian" so consumers can correctly interpret the result.

Covariance source: TLE format carries no covariance. SpaceCom estimates covariance via TLE differencing (Vallado & Cefola method): multiple TLEs for the same object within a 24-hour window are used to estimate position uncertainty. This is documented in the API as covariance_source: "tle_differencing" and flagged as covariance_quality: 'low' when fewer than 3 TLEs are available within 24 hours.

pc_discrepancy_flag implementation: The log-scale comparison is confirmed as:

pc_discrepancy_flag = abs(math.log10(pc_spacecom) - math.log10(pc_spacetrack)) > 1.0

Not a linear comparison. A discrepancy is an order-of-magnitude difference in probability — this threshold is correct.

Validity domain (F1): The Alfano 2D Gaussian method is valid under the following conditions. Outside these conditions, the Pc estimate is flagged with pc_validity: 'degraded' in the API response:

Short-encounter assumption: valid when the encounter duration is short compared to the orbital period (satisfied for LEO conjunction geometries)
Linear relative motion: degrades when miss_distance_km < 0.1 (non-linear trajectory effects become significant); flag: pc_validity_warning: 'sub_100m_close_approach'
Gaussian covariance: degrades when the position uncertainty ellipsoid aspect ratio (σ_max/σ_min) > 100; flag: pc_validity_warning: 'highly_anisotropic_covariance'
Minimum Pc floor: values below 1×10⁻¹⁵ are reported as < 1e-15 and not computed precisely (numerical precision limit)

Reference implementation test (F1): tests/physics/test_pc_compute.py — BLOCKING:

# Reference cases from Vallado & Alfano (2009), Table 1
VALLADO_ALFANO_CASES = [
    # (miss_dist_m, sigma_r1_m, sigma_t1_m, sigma_n1_m,
    #  sigma_r2_m, sigma_t2_m, sigma_n2_m, hbr_m, expected_pc)
    (100.0, 50.0, 200.0, 50.0, 50.0, 200.0, 50.0, 10.0, 3.45e-3),
    (500.0, 100.0, 500.0, 100.0, 100.0, 500.0, 100.0, 5.0, 2.1e-5),
]

@pytest.mark.parametrize("case", VALLADO_ALFANO_CASES)
def test_pc_against_vallado_alfano(case):
    pc = compute_pc_alfano(*build_conjunction_geometry(case))
    assert abs(pc - case.expected_pc) / case.expected_pc < 0.05  # within 5%

Phase 3 consideration: Monte Carlo Pc for conjunctions where pc_spacecom > 1e-3 (high-probability cases where the Gaussian assumption may break down due to non-linear trajectory evolution). Document in docs/adr/0015-pc-computation-method.md.

15.6 Model Version Governance (F6)

All components of the prediction pipeline are versioned together as a single model_version string using semantic versioning (MAJOR.MINOR.PATCH):

Change type	Version bump	Examples
Pc methodology or propagator algorithm change	MAJOR	Switch from Alfano 2D to Monte Carlo Pc; replace DOP853 integrator
Atmospheric model or input processing change	MINOR	NRLMSISE-00 → JB2008; change TLE age inflation coefficient
Bug fix in existing model	PATCH	Fix F10.7 index lookup off-by-one; correct frame transformation

Rules:

Old model versions are never deleted — tagged in git (model/v1.2.3) and retained in backend/app/modules/physics/versions/
reentry_predictions.model_version is set at creation and immutable thereafter
A model version bump requires: updated unit tests, updated docs/validation/reference-data/, entry in CHANGELOG.md, ADR if MAJOR

Reproducibility endpoint (F6):

POST /api/v1/decay/predict/reproduce
Body: { "prediction_id": "uuid" }

Re-runs the prediction using the exact model version and parameters from simulations.params_json recorded at the time of the original prediction. Returns a new prediction record with reproduced_from_prediction_id set. This endpoint is used for regulatory audit ("what model produced this output?") and post-incident review. Available to analyst role and above.

15.7 Prediction Input Validation (F9)

A validate_prediction_inputs() function in backend/app/modules/physics/validation.py gates all decay prediction submissions. Inputs that fail validation are rejected with structured errors — never silently clamped to a valid range.

def validate_prediction_inputs(params: PredictionParams) -> list[ValidationError]:
    errors = []
    tle_age_days = (utcnow() - params.tle_epoch).days
    if tle_age_days > 30:
        errors.append(ValidationError("INVALID_TLE_EPOCH",
            f"TLE epoch is {tle_age_days} days old; maximum 30 days"))
    if not (65.0 <= params.f107 <= 300.0):
        errors.append(ValidationError("F107_OUT_OF_RANGE",
            f"F10.7 = {params.f107}; valid range [65, 300]"))
    if not (0.0 <= params.ap <= 400.0):
        errors.append(ValidationError("AP_OUT_OF_RANGE",
            f"Ap = {params.ap}; valid range [0, 400]"))
    if params.perigee_km > 1200.0:
        errors.append(ValidationError("PERIGEE_TOO_HIGH",
            f"Perigee {params.perigee_km} km > 1200 km; not a re-entry candidate"))
    if params.mass_kg is not None and params.mass_kg <= 0:
        errors.append(ValidationError("INVALID_MASS",
            f"Mass {params.mass_kg} kg must be > 0"))
    return errors

If errors is non-empty, the endpoint returns 422 Unprocessable Entity with the full error list. Unit tests (BLOCKING) cover each validation path including boundary values.

15.8 Data Provenance Specification (F11)

Phase 1 model classification: No trained ML model components. All prediction parameters are derived from:

Physical constants (gravitational parameter, WGS84 Earth model)
Published atmospheric model coefficients (NRLMSISE-00)
Published orbital mechanics algorithms (SGP4, Alfano 2005 Pc)
Empirical constants from peer-reviewed literature (NASA Standard Breakup Model, ESA DRAMA demise altitudes, Vallado ballistic coefficient uncertainty)

This is documented explicitly in docs/ml/data-provenance.md as: "SpaceCom Phase 1 uses no trained machine learning components. All model parameters are derived from physical constants and published peer-reviewed sources cited below."

EU AI Act Art. 10 compliance (Phase 1): Because Phase 1 has no training data, the data governance obligations of Art. 10 apply to input data rather than training data. Input data provenance is tracked in simulations.params_json (TLE source, space weather source, timestamp, version).

Future ML component protocol: Any future learned component (e.g., drag coefficient ML model, debris type classifier) must be accompanied by:

Training dataset: source, date range, preprocessing steps, known biases
Validation split: method, size, metrics
Performance on historical re-entry backcasts (§15.9 backcasting pipeline)
Documented in docs/ml/data-provenance.md under the component name
docs/ml/model-card-{component}.md following the Google Model Card format

15.9 Backcasting Validation Pipeline (F8)

When a re-entry is confirmed (object decays — objects.status = 'decayed'), the backcasting pipeline runs automatically:

# Triggered by Celery task on object status change to 'decayed'
@celery.task
def run_reentry_backcast(object_id: int, confirmed_reentry_time: datetime):
    """Compare all predictions made in 72h before re-entry to actual outcome."""
    predictions = db.query(ReentryPrediction).filter(
        ReentryPrediction.object_id == object_id,
        ReentryPrediction.created_at >= confirmed_reentry_time - timedelta(hours=72),
    ).all()
    for pred in predictions:
        error_hours = (pred.p50_reentry_time - confirmed_reentry_time).total_seconds() / 3600
        db.add(ReentryBackcast(
            prediction_id=pred.id,
            object_id=object_id,
            confirmed_reentry_time=confirmed_reentry_time,
            p50_error_hours=error_hours,
            lead_time_hours=(confirmed_reentry_time - pred.created_at).total_seconds() / 3600,
            model_version=pred.model_version,
        ))

CREATE TABLE reentry_backcasts (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    prediction_id   BIGINT NOT NULL REFERENCES reentry_predictions(id),
    object_id       INTEGER NOT NULL REFERENCES objects(id),
    confirmed_reentry_time TIMESTAMPTZ NOT NULL,
    p50_error_hours DOUBLE PRECISION NOT NULL,  -- signed: positive = predicted late
    lead_time_hours DOUBLE PRECISION NOT NULL,
    model_version   TEXT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON reentry_backcasts (model_version, created_at DESC);

Drift detection: Rolling 30-prediction MAE by model version, computed nightly. If MAE > 2× historical baseline for the current model version, raise MEDIUM alert to Persona D flagging for model review. Surfaced in the admin analytics panel as a "Model Performance" widget.

16. Cross-Cutting Concerns

16.1 Subscription Tiers and Feature Flags (F2, F6)

SpaceCom gates commercial entitlements by contracts, which is the single authoritative commercial source of truth. organisations.subscription_tier is a presentation and segmentation shorthand only, and must never be used as the authority for feature access, quota limits, or shadow/production eligibility. Active contract state is materialised into derived organisation flags and quotas by a synchronisation job so runtime checks remain cheap and explicit.

Tier	Intended customer	MC concurrent runs	Decay predictions/month	Conjunction screening	API access	Multi-ANSP coordination
`shadow_trial`	Evaluators / test orgs	1	20	Read-only (catalog)	No	No
`ansp_operational`	ANSP Phase 1	1	200	Yes (Phase 2)	Yes	Yes
`space_operator`	Space operator orgs	2	500	Own objects only	Yes	No
`institutional`	Space agencies, research	4	Unlimited	Yes	Yes	Yes
`internal`	SpaceCom internal	Unlimited	Unlimited	Yes	Yes	Yes

Feature flag enforcement pattern:

def require_tier(*tiers: str):
    def dependency(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
        org = db.get(Organisation, current_user.organisation_id)
        if org.subscription_tier not in tiers:
            raise HTTPException(status_code=403, detail={
                "code": "TIER_INSUFFICIENT",
                "current_tier": org.subscription_tier,
                "required_tiers": list(tiers),
            })
        return org
    return dependency

# Applied at router level alongside require_role:
router = APIRouter(dependencies=[
    Depends(require_role("analyst", "operator", "org_admin", "admin")),
    Depends(require_tier("ansp_operational", "institutional", "internal")),
])

Quota enforcement pattern (MC concurrent runs):

TIER_MC_CONCURRENCY = {
    "shadow_trial": 1,
    "ansp_operational": 1,
    "space_operator": 2,
    "institutional": 4,
    "internal": 999,
}

def get_mc_concurrency_limit(org: Organisation) -> int:
    return TIER_MC_CONCURRENCY.get(org.subscription_tier, 1)

Quota exhaustion is a billable signal: Every 429 TIER_QUOTA_EXCEEDED response writes a usage_events row with event_type = 'mc_quota_exhausted' (see §9.2 usage_events table). This powers the org admin's usage dashboard and the upsell trigger in the admin panel.

Tier changes take effect immediately — no session restart required. The require_tier dependency reads from the database on each request; there is no tier caching that could allow a downgraded tier to continue accessing premium features.

Uncertainty and Confidence

Every prediction includes:

confidence_level (0.0–1.0) — derived from MC spread
uncertainty_bounds — explicit p05/p50/p95 times, corridor ellipse axes
model_version — semantic version
monte_carlo_n — ≥ 100 preliminary, ≥ 500 operational
f107_assumed, ap_assumed — critical for reproducibility
record_hmac — tamper-evident signature, verified before serving

TLE covariance: TLE format contains no covariance. Use TLE differencing (multiple TLEs within 24h) or empirical Vallado & Cefola covariance. Document clearly in API responses.

Multi-source prediction conflict resolution (Finding 10):

Space-Track TIP messages and SpaceCom's internal decay predictor may produce non-overlapping re-entry windows for the same object simultaneously. ESA ESAC may publish a third window. The aviation regulatory principle of most-conservative applies — the hazard presented to ANSPs must encompass the full credible uncertainty range.

Resolution rules (applied at the reentry_predictions layer):

Situation	Rule
SpaceCom p10–p90 and TIP window overlap	Display SpaceCom corridor as primary; TIP window shown as secondary reference band on Event Detail page
SpaceCom p10–p90 and TIP window do not overlap	Set `prediction_conflict = TRUE` on the prediction; HIGH severity data quality warning displayed; hazard corridor presented to ANSPs uses the union of SpaceCom p10–p90 and TIP window
ESA ESAC window available	Overlay as third reference band; include in `PREDICTION_CONFLICT` assessment if non-overlapping
All sources agree (all windows overlap)	No flag; SpaceCom corridor is primary

Schema addition to reentry_predictions:

ALTER TABLE reentry_predictions
  ADD COLUMN prediction_conflict BOOLEAN DEFAULT FALSE,
  ADD COLUMN conflict_sources TEXT[],   -- e.g. ['spacecom', 'space_track_tip']
  ADD COLUMN conflict_union_p10 TIMESTAMPTZ,
  ADD COLUMN conflict_union_p90 TIMESTAMPTZ;

The Event Detail page shows a ⚠ PREDICTION CONFLICT banner (HIGH severity style) when prediction_conflict = TRUE, listing the conflicting sources and their windows. The hazard corridor polygon uses conflict_union_p10/conflict_union_p90 when the flag is set. Document in docs/model-card-decay-predictor.md under "Conflict Resolution with Authoritative Sources."

Auditability

Every simulation in simulations with full params_json and result URI
Reports stored with simulation_id reference
alert_events and security_logs are append-only with DB-level triggers
All API mutations logged with user ID, timestamp, and payload hash
TIP messages stored verbatim for audit

Error Handling

Structured error responses: { "error": "code", "message": "...", "detail": {...} }
Celery failures captured in simulations.status = 'failed'; surfaced in jobs panel
Frame transformation failures fail loudly — never silently continue with TEME
HMAC failures return 503 and trigger CRITICAL security event — never silently serve a tampered record
TanStack Query error states render inline messages with retry; not page-level errors

Performance Patterns

SQLAlchemy async — lazy="raise" on all relationships: Async SQLAlchemy prohibits lazy-loaded relationship access outside an async context. Setting lazy="raise" converts silent N+1 errors into loud InvalidRequestError at development time rather than silent blocking DB calls in production:

class ReentryPrediction(Base):
    object:       Mapped["SpaceObject"]   = relationship(lazy="raise")
    tip_messages: Mapped[list["TipMessage"]] = relationship(lazy="raise")
    # Forces all callers to use joinedload/selectinload explicitly

Required eager-loading patterns for the three highest-traffic endpoints:

Event Detail: selectinload(ReentryPrediction.object), selectinload(ReentryPrediction.tip_messages)
Active alerts: selectinload(AlertEvent.prediction)
CZML catalog: raw SQL with a single JOIN rather than ORM (bulk fetch; ORM overhead unacceptable at 864k rows)

CZML caching — two-tier strategy: CZML data for the current 72h window changes only when a new TLE is ingested or a propagation job completes. Cache the full serialised CZML blob:

CZML_CACHE_KEY = "cache:czml:catalog:{catalog_hash}:{window_start}:{window_end}"
# TTL: 15 minutes in LIVE mode (refreshed after new TLE ingest event)
# TTL: permanent in REPLAY mode (historical data never changes)

Per-object CZML fragments cached separately under cache:czml:obj:{norad_id}:{...}. When a TLE is re-ingested for one object, invalidate only that object's fragment and recompute the full catalog CZML from the cached fragments.

CZML cache invalidation triggers (F5 — §58):

Event	Invalidation scope	Mechanism
New TLE ingested for object X	`cache:czml:obj:{norad_id_x}:*` only	Ingest task calls `redis.delete(pattern)` after TLE commit
Propagation job completes for object X	`cache:czml:obj:{norad_id_x}:*` + full catalog key	Propagation Celery task issues invalidation on success
New prediction created for object X	`cache:czml:obj:{norad_id_x}:*`	Prediction task issues invalidation on completion
Manual cache flush (admin API)	`cache:czml:*`	`DELETE /api/v1/admin/cache/czml` — requires `admin` role
Cold start / DR failover	Warm-up Celery task	`warm_czml_cache` Beat task runs at startup (see below)

Stale-while-revalidate strategy: The CZML cache key includes a stale_ok variant. When the primary key is expired but the stale key (cache:czml:catalog:stale:{hash}) exists, serve the stale response immediately and enqueue a background recompute. Maximum stale age: 5 minutes. This prevents a cache stampede during TLE batch ingest (up to 600 simultaneous invalidations).

Cache warm-up on cold start (F5 — §58):

@app.task
def warm_czml_cache():
    """Run at container startup and after DR failover. Estimated: 30–60s for 600 objects."""
    objects = db.query(Object).filter(Object.active == True).all()
    for obj in objects:
        generate_czml_fragment.delay(obj.norad_id)
    # Full catalog key assembled by CZML endpoint after all fragments present

Cold-start warm-up time (600 objects, 16 simulation workers): estimated 30–60 seconds. Included in DR RTO calculation (§26.3) as "cache warm-up: ~1 min" line item.

Redis key namespaces and eviction policy:

Namespace	Contents	Eviction policy	Notes
`celery:*`	Celery broker queues	`noeviction` — must never be evicted	Use separate Redis instance or DB 0 with `noeviction`
`redbeat:*`	celery-redbeat schedules	`noeviction`	Loss causes silent scheduled job disappearance
`cache:*`	Application cache (CZML, space weather, HMAC results)	`allkeys-lru`	Cache misses acceptable; broker loss is not
`ws:session:*`	WebSocket session state	`volatile-lru` (with TTL set)	Expires on session end

Run Celery broker and application cache as separate Redis database indexes (SELECT 0 vs SELECT 1) so eviction policies can differ. The Sentinel configuration monitors both.

Cache TTLs:

cache:czml:catalog → 15 minutes
cache:spaceweather:current → 5 minutes
cache:prediction:{id}:fir_intersection → until superseded (keyed to prediction ID)
cache:prediction:{id}:hmac_verified → 60 minutes

Bulk export — Celery offload for Persona F: GET /space/export/bulk must not materialise the full result set in the backend container — for the full catalog this risks OOM. Implement as a Celery task that writes to MinIO and returns a pre-signed download URL, consistent with the existing report generation pattern:

@app.post("/space/export/bulk")
async def trigger_bulk_export(params: BulkExportParams, ...):
    task = generate_bulk_export.delay(params.dict(), user_id=current_user.id)
    return {"task_id": task.id, "status": "queued"}

@app.get("/space/export/bulk/{task_id}")
async def get_bulk_export(task_id: str, ...):
    # Returns {"status": "complete", "download_url": presigned_url} when done

If a streaming response is preferred over task-based, use SQLAlchemy yield_per=1000 cursor streaming — never materialise the full result set.

Analytics query routing to read replica: Persona B and F analytics queries (simulation comparison, historical validation, bulk export) are I/O intensive and must not compete with operational read paths on the primary TimescaleDB instance during active TIP events. Route to the Patroni standby:

def get_db(write: bool = False, analytics: bool = False) -> AsyncSession:
    if write:
        return AsyncSession(primary_engine)
    if analytics:
        return AsyncSession(replica_engine)  # Patroni standby
    return AsyncSession(primary_engine)      # operational reads: primary (avoids replica lag)

Monitor replication lag: if replica lag > 30s, log a warning and redirect analytics queries to primary.

Query plan baseline: Add to Phase 1 setup: run EXPLAIN (ANALYZE, BUFFERS) on the primary CZML query with 100 objects and record the output in docs/query-baselines/. Re-run at Phase 3 load test and compare — if planning time or execution time has increased > 2×, investigate index bloat or chunk count growth before the load test proceeds.

17. Validation Strategy

17.0 Test Standards and Strategy (F1–F3, F5, F7, F8, F10, F11)

Test Taxonomy (F2)

Three levels — every developer must know which level a new test belongs to before writing it:

Level	Definition	I/O boundary	Tool	Location
Unit	Single function or class; all dependencies mocked or stubbed	No I/O	pytest	`tests/unit/`
Integration	Multiple components; real PostgreSQL + Redis; no external network	Real DB, no internet	pytest + testcontainers	`tests/integration/`
E2E	Full stack including browser; Celery worker running; real DB	Full stack	Playwright	`e2e/`

Rules:

Physics algorithm tests (SGP4, MC, Pc) are unit tests — pure functions, no DB
HMAC signing, RLS isolation, and rate-limit tests are integration tests — require a real DB transaction
Alert delivery, WebSocket flow, and NOTAM draft UI are E2E tests
A test that mocks the database is a unit test regardless of what it is testing — name it accordingly

Coverage Standard (F1)

Scope	Tool	Minimum threshold	CI gate
Backend line coverage	`pytest-cov`	80%	Fail below threshold
Backend branch coverage	`pytest-cov --branch`	70%	Fail below threshold
Frontend line coverage	Jest `--coverage`	75%	Fail below threshold
Safety-critical paths	`pytest -m safety_critical`	100% (all pass, none skipped)	Always blocking

# pyproject.toml
[tool.pytest.ini_options]
addopts = "--cov=app --cov-branch --cov-fail-under=80 --cov-report=term-missing"

[tool.coverage.run]
omit = ["*/migrations/*", "*/tests/*", "*/__pycache__/*"]

Coverage is measured on the integration test run (not unit-only) so that database-layer code paths are included. Coverage reports are uploaded to CI artefacts on every run; a coverage trend chart is required in the Phase 2 ESA submission.

Test Data Management (F3)

Fixtures, not factories for shared reference data: Physics reference cases (TLE sets, re-entry events, conjunction scenarios) are committed JSON files in docs/validation/reference-data/. Tests load them as pytest fixtures — never fetch from the internet at test time.

Isolated fixtures for integration tests: Each integration test that writes to the database runs inside a transaction that is rolled back at teardown. No shared mutable state between tests:

@pytest.fixture
def db_session(engine):
    with engine.connect() as conn:
        with conn.begin() as txn:
            yield conn
            txn.rollback()  # all writes from this test disappear

Time-dependent tests: Any test that checks TLE age, token expiry, or billing period uses freezegun to freeze time to a known epoch. Tests must never rely on datetime.utcnow() producing a particular value:

from freezegun import freeze_time

@freeze_time("2026-01-15T12:00:00Z")
def test_tle_age_degraded_warning():
    # TLE epoch is 2026-01-08 → age = 7 days → expects 'degraded'
    ...

Sensitive test data: Real NORAD IDs, real Space-Track credentials, and real ANSP organisation names must never appear in committed test fixtures. Use fictional NORAD IDs (90001–90099 are reserved for test objects by convention) and generated organisation names (test-org-{uuid4()[:8]}).

Safety-Critical Test Markers (F8)

All tests that verify safety-critical behaviour carry @pytest.mark.safety_critical. These run on every commit (not just pre-merge) and must all pass before any deployment:

# conftest.py
import pytest

def pytest_configure(config):
    config.addinivalue_line(
        "markers", "safety_critical: test verifies a safety-critical invariant; always runs; zero tolerance for failure or skip"
    )

# Usage
@pytest.mark.safety_critical
def test_cross_tenant_isolation():
    ...

@pytest.mark.safety_critical
def test_hmac_integrity_failure_quarantines_record():
    ...

@pytest.mark.safety_critical
def test_sub_150km_low_confidence_flag():
    ...

The full list of safety_critical-marked tests is maintained in docs/TEST_PLAN.md (see F11). CI runs pytest -m safety_critical as a separate fast job (target: < 2 minutes) before the full suite.

Physics Test Determinism (F10)

Monte Carlo tests are non-deterministic by default. All MC-based tests seed the random number generator explicitly:

import numpy as np

@pytest.fixture(autouse=True)
def seed_rng():
    """Seed numpy RNG for all physics tests. Produces identical output across runs."""
    np.random.seed(42)
    yield
    # no teardown needed — each test gets a fresh seed via autouse

@pytest.mark.safety_critical
def test_mc_convergence_criterion():
    result = run_mc_decay(tle=TEST_TLE, n=500, seed=42)
    assert result.corridor_area_change_pct < 2.0

The seed value 42 is fixed in tests/conftest.py and must not be changed without updating the baseline expected values. A PR that changes the seed without updating expected values fails the review checklist.

Mutation Testing (F5)

mutmut is run weekly (not on every commit — too slow) against the backend/app/modules/physics/ and backend/app/modules/alerts/ directories. These are the highest-consequence paths.

mutmut run --paths-to-mutate=backend/app/modules/physics/,backend/app/modules/alerts/
mutmut results

Threshold: Mutation score ≥ 70% for physics and alerts modules. Results published to CI artefacts. A score drop of > 5 percentage points between weekly runs creates a mutation-regression GitHub issue automatically.

Test Environment Parity (F7)

The CI test environment must use identical Docker images to production. Enforced by:

docker-compose.ci.yml extends docker-compose.yml — same image tags, no overrides to DB version or Redis version
TimescaleDB version in CI is pinned to the same tag as production (timescale/timescaledb-ha:pg16-latest is not acceptable — must be timescale/timescaledb-ha:pg16.3-ts2.14.2)
make test in CI fails if TIMESCALEDB_VERSION env var does not match the value in docker-compose.yml
MinIO is used in CI, not mocked — make test brings up the full service stack including MinIO before running integration tests

ESA Test Plan Document (F11)

docs/TEST_PLAN.md is a required Phase 2 deliverable. Structure:

# SpaceCom Test Plan

## 1. Test levels and tools
## 2. Coverage targets and current status
## 3. Safety-critical test traceability matrix
   | Requirement | Test ID | Test name | Result |
   |-------------|---------|-----------|--------|
   | Sub-150km propagation guard | SC-TEST-001 | test_sub_150km_low_confidence_flag | PASS |
   | Cross-tenant data isolation | SC-TEST-002 | test_cross_tenant_isolation | PASS |
   ...
## 4. Known test limitations
## 5. Test environment specification
## 6. Performance test results (latest k6 run)

The traceability matrix links each safety-critical requirement (drawn from §15, §7.2, §26) to its @pytest.mark.safety_critical test. This is the primary evidence document for ESA software assurance review.

Important: Comparing SGP4 against Space-Track TLEs is circular. All validation uses independent reference datasets.

Reference data location: docs/validation/reference-data/ — committed to the repository and loaded automatically by the test suite. No external downloads required at test time.

How to run all validation suites:

make test                            # runs pytest including all validation suites
pytest tests/test_frame_utils.py -v  # frame transforms only
pytest tests/test_decay/ -v          # decay predictor + backcast comparison
pytest tests/test_propagator/ -v     # SGP4 propagator

How to add a new validation case: Add the reference data to the appropriate JSON file in docs/validation/reference-data/, add a test case in the relevant test module, and document the source in the file's header comment.

17.1 Frame Transformation Validation

Test	Reference	Pass criterion	Run command
TEME→GCRF transform	Vallado (2013), Table 3-5	Position error < 1 m; velocity error < 0.001 m/s	`pytest tests/test_frame_utils.py::test_teme_gcrf_vallado`
GCRF→ITRF transform	Vallado (2013), Table 3-4	Position error < 1 m	`pytest tests/test_frame_utils.py::test_gcrf_itrf_vallado`
ITRF→WGS84 geodetic	IAU SOFA test vectors	Lat/lon error < 1 μrad; altitude error < 1 mm	`pytest tests/test_frame_utils.py::test_itrf_geodetic`
Round-trip WGS84→ITRF→GCRF→ITRF→WGS84	Self-consistency	Round-trip error < floating-point machine precision (~1e-12)	`pytest tests/test_frame_utils.py::test_roundtrip`
IERS EOP application	IERS Bulletin A reference values	UT1-UTC error < 1 μs; pole offset error < 0.1 mas	`pytest tests/test_frame_utils.py::test_iers_eop`

Committed test vectors (Finding 6): The following reference data files must be committed to the repository before any frame transformation or propagation code is merged. Tests are parameterised fixtures that load from these files; they fail (not skip) if a file is absent:

File	Content	Source
`docs/validation/reference-data/frame_transform_gcrf_to_itrf.json`	≥ 3 cases from Vallado (2013) §3.7: input UTC epoch + GCRF position → expected ITRF position, accurate to < 1 m	Vallado (2013) Fundamentals of Astrodynamics Table 3-4
`docs/validation/reference-data/sgp4_propagation_cases.json`	ISS (NORAD 25544) and one historical re-entry object: state vector at epoch and after 1h and 24h propagation	STK or GMAT reference propagation
`docs/validation/reference-data/iers_eop_case.json`	One epoch with published IERS Bulletin B UT1-UTC and polar motion values; expected GCRF→ITRF transform result	IERS Bulletin B (iers.org)

# tests/physics/test_frame_transforms.py
import json, pytest
from pathlib import Path

CASES_FILE = Path("docs/validation/reference-data/frame_transform_gcrf_to_itrf.json")

def test_reference_data_exists():
    """Fail hard if committed test vectors are missing — do not skip."""
    assert CASES_FILE.exists(), f"Required reference data missing: {CASES_FILE}"

@pytest.mark.parametrize("case", json.loads(CASES_FILE.read_text()))
def test_gcrf_to_itrf(case):
    result = gcrf_to_itrf(case["gcrf_km"], parse_utc(case["epoch_utc"]))
    assert np.linalg.norm(result - case["expected_itrf_km"]) < 0.001  # 1 m tolerance

Reference data file: docs/validation/reference-data/vallado-sgp4-cases.json and docs/validation/reference-data/iers-frame-test-cases.json.

Operational significance of failure: A frame transform error propagates directly into corridor polygon coordinates. A 1 km error at re-entry altitude produces a ground-track offset of 5–15 km. A failing frame test is a blocking CI failure.

17.2 SGP4 Propagator Validation

Test	Reference	Pass criterion
State vector at epoch	Vallado (2013) test set, 10 objects spanning LEO/MEO/GEO/HEO	Position error < 1 km at epoch; < 10 km after 7-day propagation
Epoch parsing	NORAD 2-line epoch format → UTC	Round-trip to 1 ms precision
TLE line 1/2 checksum	Modulo-10 algorithm	Pass/fail; corrupted checksum rejected before propagation

Operational significance of failure: SGP4 position error at epoch > 1 km produces a corridor centred in the wrong place. Blocking CI failure.

17.3 Decay Predictor Validation

Test	Reference	Pass criterion
NRLMSISE-00 density output	Picone et al. (2002) Table 1 reference atmosphere	Density within 1% of reference at 5 altitude/solar activity combinations
Historical backcast: p50 error	The Aerospace Corporation observed re-entry database (≥3 events Phase 1; ≥10 events Phase 2)	Median p50 error < 4h for rocket bodies with known physical properties
Historical backcast: corridor containment	Same database	p95 corridor contains observed impact in ≥90% of validation events
Historical replay: airspace disruption	Long March 5B Spanish airspace closure reconstruction with replay inputs and operator review	Affected FIR/time-window outputs judged operationally plausible and traceable in replay report
Air-risk ranking consistency	Documented crossing-scenario corpus (≥10 unique spacecraft/aircraft crossing cases by Phase 2)	Highest-ranked exposure slices remain stable under seed and traffic-density perturbations or the differences are explained in the validation note
Conservative-baseline comparison	Same replay corpus vs. full-FIR or fixed-radius precautionary closure baseline	Refined outputs reduce affected area or duration in a majority of replay cases without undercutting the agreed p95 protective envelope
Cross-tool comparison	GMAT (NASA open source) — 3 defined test cases	Re-entry time agreement within 1h for objects with identical inputs
Monte Carlo statistical consistency	Self-consistency: 500-sample run vs. 1000-sample run on same inputs	p05/p50/p95 agree within 2% (reducing with more samples)

Reference data files: docs/validation/reference-data/aerospace-corp-reentries.json for decay-only validation and docs/validation/reference-data/reentry-airspace/ for airspace-risk replay cases (Long March 5B, Columbia-derived cloud case, and documented crossing scenarios). GMAT comparison is a manual procedure documented in docs/validation/README.md (GMAT is not run in CI — too slow; comparison run once per major model version).

Operational significance of failure: Decay predictor p50 error > 4h means corridors are offset in time; operators could see a hazard window that doesn't match the actual re-entry. Major model version gate.

17.4 Breakup Model Validation

Test	Reference	Pass criterion
Fragment count distribution	ESA DRAMA published results for similar-mass objects	Fragment count within 30% of DRAMA reference for a 500 kg object at 70 km
Energy conservation at breakup altitude	Internal check	Total kinetic + potential energy conserved within 1% through fragmentation step
Casualty area geometry	Hand-calculated reference case	Casualty area polygon area within 10% of analytic calculation

Operational significance of failure: Breakup model failure does not block Phase 1. It is an advisory failure in Phase 2. Blocking before Phase 3 regulatory submission.

17.5 Security Validation

Test	Reference	Pass criterion	Blocking?
RBAC enforcement	`test_rbac.py` — every endpoint, every role	403 for insufficient role; 401 for unauthenticated; 0 mismatches	Yes
HMAC tamper detection	`test_integrity.py` — direct DB row modification	API returns 503 + CRITICAL `security_logs` entry	Yes
Rate limiting	`test_auth.py` — per-endpoint threshold	429 after threshold; 200 after reset window	Yes
CSP headers	Playwright E2E	`Content-Security-Policy` header present on all pages	Yes
Container non-root	CI `docker inspect` check	No container running as root UID	Yes
Trivy CVE scan	Trivy against all built images	0 Critical/High CVEs	Yes

17.6 Verification Independence (F6 — §61)

EUROCAE ED-153 / DO-278A §6.4 requires that SAL-2 software components undergo independent verification — meaning the person who verifies (reviews/tests) a SAL-2 requirement, design, or code artefact must not be the same person who produced it.

Policy: docs/safety/VERIFICATION_INDEPENDENCE.md

Scope: All SAL-2 components identified in §24.13:

physics/ (decay prediction engine)
alerts/ (alert generation pipeline)
HMAC integrity verification functions
CZML corridor generation and frame transform

Implementation in GitHub:

# .github/CODEOWNERS
# SAL-2 components require an independent reviewer (not the PR author)
/backend/app/physics/     @safety-reviewer
/backend/app/alerts/      @safety-reviewer
/backend/app/integrity/   @safety-reviewer
/backend/app/czml/        @safety-reviewer

The @safety-reviewer team must have ≥1 member who is not the PR author. GitHub branch protection for main must include:

require_code_owner_reviews: true for the above paths
dismiss_stale_reviews: true (new commits require re-review)
SAL-2 PRs require ≥2 approvals (one of which must be from @safety-reviewer)

Verification traceability: The PR review record (GitHub PR number + reviewer + approval timestamp) serves as evidence for verification independence in the safety case (§24.12 E1.1). This record is referenced in the MoC document (§24.14 MOC-002).

Who qualifies as an independent reviewer for SAL-2: Any engineer who:

Did not write the code being reviewed
Has sufficient domain knowledge to evaluate correctness (orbital mechanics familiarity for physics/; alerting logic familiarity for alerts/)
Is designated in the @safety-reviewer GitHub team

Before ANSP shadow activation, the safety case custodian confirms that all SAL-2 components committed in the release have a documented independent reviewer.

18. Additional Physics Considerations

Topic	Why It Matters	Phase
Solar radiation pressure (SRP)	Dominates drag above ~800 km for high A/m objects	Phase 1 (decay predictor)
J2–J6 geopotential	J2 alone: ~7°/day RAAN error	Phase 1 (decay predictor)
Attitude and tumbling	Drag coefficient 2–3× different; capture via B* Monte Carlo	Phase 2
Lift during re-entry	Non-spherical fragments: 10s km cross-track shift	Phase 2 (breakup)
Maneuver detection	Active satellites maneuver; TLE-to-TLE ΔV estimation	Phase 2
Ionospheric drag	Captured via NRLMSISE-00 ion density profile	Phase 1 (via model)
Re-entry heating uncertainty	Emissivity/melt temperatures poorly known for debris	Phase 2

19. Development Phases — Detailed

Phase 1: Analytical Prototype (Weeks 1–10)

Goal: Real object tracking, decay prediction with uncertainty quantification, functional Persona A/B interface. Security infrastructure fully in place before any other feature ships.

Week	Backend Deliverable	Frontend Deliverable	Security / SRE Deliverable
1-2	FastAPI scaffolding, Alembic migrations, Docker Compose with Tier 2 service topology. `frame_utils.py`, `time_utils.py`. IERS EOP refresh + SHA-256 verify. Append-only DB triggers. HMAC signing infrastructure. Liveness + readiness probes on all services. `GET /healthz`, `GET /readyz` with DB + Redis checks. Dead letter queue for Celery. `task_acks_late`, `task_reject_on_worker_lost` configured. Celery queue routing (`ingest` vs `simulation`). `celery-redbeat` configured. Legal/compliance: `users` table `tos_accepted_at/tos_version/tos_accepted_ip/data_source_acknowledgement` fields. First-login ToS/AUP/Privacy Notice acceptance flow (blocks access until all accepted). SBOM generated via `syft`; CesiumJS commercial licence verified. Privacy Notice drafted and published.	Next.js scaffolding. Root layout: nav, ModeIndicator, AlertBadge, JobsPanel stub. Dark mode + high-contrast theme. CSP and security headers via Next.js middleware. ToS/AUP acceptance gate on first login (blocks dashboard until accepted).	RBAC schema + `require_role()`. JWT RS256 + httpOnly cookies. MFA (TOTP). Redis AUTH + ACLs. MinIO private buckets. Docker network segmentation. Container hardening. `git-secrets`. Bandit + ESLint security in CI. Trivy. Dependency pinning. Dependabot. `security_logs` + sanitising formatter. Docker Compose `depends_on: condition: service_healthy` wired. Documentation: `docs/` directory tree created; `AGENTS.md` committed; initial ADRs for JWT, dual frontend, Monte Carlo chord, frame library; `docs/runbooks/TEMPLATE.md` + index; `CHANGELOG.md` first entry; `docs/validation/reference-data/` with Vallado and IERS cases; `docs/alert-threshold-history.md` initial entry. DevOps/Platform: self-hosted GitLab CI pipeline (lint, test-backend, test-frontend, security-scan, build-and-push jobs); multi-stage Dockerfiles for all services; `.pre-commit-config.yaml` with all six hooks; `.env.example` committed with all variables documented; `Makefile` with `dev`, `test`, `migrate`, `seed`, `lint`, `clean` targets; Docker layer + pip + npm build cache configured; `sha-<commit>` image tagging in the GitLab container registry in place. Prometheus metrics: `spacecom_active_tip_events`, `spacecom_tle_age_hours`, `spacecom_hmac_verification_failures_total` instrumented.
3–4	Catalog module: object CRUD, TLE import. TLE cross-validation. ESA DISCOS import. Ingest Celery Beat (celery-redbeat). Hardcoded URLs, SSRF-mitigated HTTP client. WAL archiving configured. Daily backup Celery task. TimescaleDB compression policy on `orbits`. Retention policy scaffolded.	Object Catalog page. DataConfidenceBadge. Object Watch page stub.	Rate limiting (`slowapi`). Simulation parameter range validation. Prometheus: `spacecom_ingest_success_total`, `spacecom_ingest_failure_total` per source. AlertManager rule: consecutive ingest failures → warning.
5–6	Space Weather: NOAA SWPC + ESA SWS cross-validation. `operational_status` string. TIP message ingestion. Prometheus: `spacecom_prediction_age_seconds` per NORAD ID. Readiness probe: TLE staleness + space weather age checks.	SpaceWeatherWidget. Alert taxonomy: CRITICAL banner, NotificationCentre, AcknowledgeDialog. Degraded mode banner (reads `readyz` 207 response).	`alert_events` append-only verified. Alert rate-limit and deduplication. Alert storm detection. AlertManager rule: `spacecom_active_tip_events > 0 AND prediction_age > 3600` → critical.
7–8	Catalog Propagator (SGP4): TEME→GCRF, CZML (J2000). Ephemeris caching. Frame transform validation. All CZML strings HTML-escaped. MC chord architecture: `run_mc_decay_prediction` → `group(run_single_trajectory)` → `aggregate_mc_results`. Chord result backend (Redis) sized.	Globe: real object positions, LayerPanel, clustering, urgency symbols. TimelineStrip. Live mode scrub.	WebSocket auth: cookie-based; connection limit. WS ping/pong. Prometheus: `spacecom_simulation_duration_seconds` histogram.
9–10	Decay Predictor: RK7(8) + NRLMSISE-00 + Monte Carlo chord. HMAC-signed output. Immutability triggers. Corridor polygon generation. Re-entry API. Validate against ≥3 historical re-entries. Monthly restore test Celery task implemented.	Mode A (Percentile Corridors). Event Detail: PredictionPanel with p05/p50/p95, HMAC status badge. TimelineGantt. Operational Overview. UncertaintyModeSelector (B/C greyed).	HMAC tamper detection E2E test. All-clear TIP cross-check guard. First backup restore test executed and passing. `spacecom_simulation_duration_seconds` p95 verified < 240s on Tier 2 hardware.

Phase 2: Operational Analysis (Weeks 11–22)

Week	Backend Deliverable	Frontend Deliverable	Security / Regulatory
11–12	Atmospheric Breakup: aerothermal, fragments, ballistic descent, casualty area.	Fragment impact points on globe. Fragment detail panel.	OWASP ZAP DAST against staging.
13–14	Conjunction: all-vs-all screening, Alfano probability.	Conjunction events on globe. ConjunctionPanel.	STRIDE threat model reviewed for Phase 2 surface.
15–16	Upper/Lower Atmosphere. Hazard module: fused zones, HMAC-signed, immutable, `shadow_mode` flag.	Mode B (Probability Heatmap): Deck.gl. UncertaintyModeSelector unlocks Mode B.	RLS multi-tenancy integration tests. Shadow records excluded from operational API (integration test).
17–18	Airspace: FIR/UIR load, PostGIS intersection. Airspace impact table. NOTAM Drafting: ICAO format, `notam_drafts` table, mandatory disclaimer. Shadow mode admin toggle.	AirspaceImpactPanel. NOTAM draft flow: NotamDraftViewer, disclaimer banner, review/cancel. 2D Plan View. ViewToggle. `/airspace` page. ShadowBanner + ShadowModeIndicator.	Regulatory disclaimer verified present on all NOTAM drafts. axe-core accessibility audit.
19–20	Report builder: bleach sanitisation, Playwright renderer (isolated, no-network, timeouts, seccomp). MinIO storage. Shadow validation schema + `shadow_validations` table.	ReportConfigDialog, ReportPreview, `/reports` page. IntegrityStatusBadge. SimulationComparison. ShadowValidationReport scaffold.	Renderer: `network_mode: none` enforced; sanitisation tests passing; 30s timeout verified.
21–22	Space Operator Portal: `owned_objects`, controlled re-entry planner (deorbit window optimiser), CCSDS export, `api_keys` table + lifecycle. `modules.api` with per-key rate limiting. Legal gate: legal opinion commissioned and received for primary deployment jurisdiction; `legal_opinions` table populated; shadow mode admin toggle wired to `shadow_mode_cleared` flag. Space-Track AUP redistribution clarification obtained (written confirmation from 18th Space Control Squadron or counsel opinion on permissible use). ECCN classification review commissioned for Controlled Re-entry Planner. GDPR compliance review: data inventory completed, lawful bases documented, DPA template drafted, erasure procedure (`handle_erasure_request`) implemented.	`/space` portal: SpaceOverview, ControlledReentryPlanner, DeorbitWindowList, ApiKeyManager, CcsdsExportPanel. Shadow mode admin toggle displays legal clearance status.	Object ownership RLS policy tested: `space_operator` cannot access non-owned objects. API key rate limiting verified. API Terms accepted at key creation and recorded. Jurisdiction screening at registration (OFAC/EU/UK sanctions list check).

Phase 3: Operational Deployment (Weeks 23–32)

Week	Backend Deliverable	Frontend Deliverable	Security / Regulatory / SRE
23–24	Alerts module: thresholds, email delivery, geographic filtering, `alert_events`. Shadow mode: alerts suppressed. ADS-B feed integration: OpenSky Network REST API (`https://opensky-network.org/api/states/all`); polled every 60s via Celery Beat; flight state vectors stored in `adsb_states` (non-hypertable; rolling 24h window); route intersection advisory module reads `adsb_states` to identify flights in re-entry corridors. Air Risk module initialisation: aircraft exposure scoring, time-slice aggregation, and vulnerability banding by aircraft class. Tier 3 HA infrastructure: TimescaleDB streaming replication + Patroni + etcd. Redis Sentinel (3 nodes). 4× simulation workers (64 total cores). Blue-green deployment pipeline wired.	Full alert lifecycle UI: geographic filtering, mute rules, acknowledgement audit. Route overlay on globe. AirRiskPanel by FIR/time slice. Route intersection advisory (avoidance boundary only).	Legal/regulatory: MSA template finalised by counsel; Regulatory Sandbox Agreement template finalised. First ANSP shadow deployment executed under signed Regulatory Sandbox Agreement and confirmed legal clearance. GDPR breach notification procedure tested (tabletop exercise). Professional indemnity, cyber liability, and product liability insurance confirmed in place. SRE: Patroni failover tested (primary killed; standby promotes; backend reconnects; verify zero lost predictions). Redis Sentinel failover tested. SLO baseline measurements taken on Tier 3 hardware.
25–26	Feedback: prediction vs. outcome. Density scaling recalibration. Maneuver detection. Shadow validation report generation. Historical replay corpus: Long March 5B, Columbia-derived cloud case, and documented crossing-scenario set. Conservative-baseline comparison reporting for airspace closures. Launch safety module. Deployment freeze gate (CI/CD: block deploy if CRITICAL/HIGH alert active). ANSP communication plan implemented (degradation push + email). Incident response runbooks written (DB failover, Celery recovery, HMAC failure, ingest failure).	Prediction accuracy dashboard. Historical comparison. ShadowValidationReport. Air-risk replay comparison views. `/space` Persona F workspace. Launch safety portal.	Vault / cloud secrets manager. Secrets rotation. Begin first ANSP shadow mode deployment. SRE: PagerDuty/OpsGenie integrated with Prometheus AlertManager. SEV-1/2/3/4 routing configured. First on-call rotation established.
27–28	Mode C binary MC endpoint. Load testing (100 users, <2s CZML p95; MC p95 < 240s). Prometheus + Grafana: three dashboards (Operational Overview, System Health, SLO Burn Rate). Full AlertManager rules. ECSS compliance artefacts: SMP, VVP, PAP, DMP. MinIO lifecycle rules: MC blobs > 90 days → cold tier.	Mode C (Monte Carlo Particles). UncertaintyModeSelector unlocks Mode C. Final Playwright E2E suite. Grafana Operational Overview embedded in `/admin`.	External penetration test (auth bypass, RBAC escalation, SSRF, XSS→Playwright, WS auth bypass, data integrity, object ownership bypass, API key abuse). All Critical/High remediated. Load test: SLO p95 targets verified under 100-user concurrent load.
29–32	Regulatory acceptance package: safety case framework, ICAO data quality mapping, shadow validation evidence, SMS integration guide. TRL 6 demonstration. Data archival pipeline (Parquet export to MinIO cold before chunk drop). Storage growth verified against projections. ESA bid legal: background IP schedule documented; Consortium Agreement with academic partner signed (IP ownership, publication rights, revenue share); SBOM submitted as part of ESA artefact package. ECCN classification determination received; export screening process in place for all new customer registrations. ToS version updated to reflect any regulatory feedback from first ANSP deployments; re-acceptance triggered.	Regulatory submission report type. TRL demonstration artefacts.	SOC 2 Type I readiness review. Production runbook + incident response per threat scenario. ECSS compliance review. Monthly restore test passing in CI. Error budget dashboard showing < 10% burn rate.

20. Key Decisions and Tradeoffs

Decision	Chosen	Alternative Considered	Rationale
Propagator split	SGP4 catalog + numerical decay	SGP4 for everything	SGP4 diverges by days–weeks for re-entry time prediction
Numerical integrator	RK7(8) adaptive + NRLMSISE-00	poliastro Cowell	Direct force model control
Frame library	`astropy`	Manual SOFA Fortran	Handles IERS EOP; well-tested IAU 2006
Atmospheric density	NRLMSISE-00 (P1), JB2008 option (P2)	Simple exponential	Community standard; captures solar cycle
Breakup model	Simplified ORSAT-like	Full DRAMA/SESAM	DRAMA requires licensing; simplified recovers ~80% utility
Uncertainty visualisation	Three modes, phased (A→B→C), user-selectable	Single fixed mode	Serves different personas; operational users need corridors, analysts need heatmaps
JWT algorithm	RS256 (asymmetric)	HS256 (shared secret)	Compromise of one service does not expose signing key to all services
Token storage	httpOnly Secure SameSite=Strict cookie	localStorage	XSS cannot read httpOnly cookies; localStorage is trivially exfiltrated
Token revocation	DB `refresh_tokens` table	Redis-only	Revocations survive restarts; enables rotation-chain audit
MFA	TOTP (RFC 6238) required for all roles	Optional MFA	Aviation authority context; government procurement baseline
Secrets management	Docker secrets (P1 prod) → Vault (P3)	Env vars only	Env vars appear in process listings and crash dumps; no audit trail
Alert integrity	Backend-only generation on verified data	Client-triggered alerts	Prevents false alert injection via API
Prediction integrity	HMAC-signed, immutable after creation	Mutable with audit log	Tamper-evident at database level; modification is impossible, not just logged
Multi-tenancy	RLS at database layer + `organisation_id`	Application-layer only	DB-level enforcement cannot be bypassed by application bugs
Renderer isolation	Separate `renderer` container, no external network	Playwright in backend container	Limits blast radius of XSS→SSRF escalation
Server state	TanStack Query	Zustand for everything	Automatic cache, background refetch; Zustand is not a data cache
Navigation model	Task-based (events, airspace, analysis)	Module-based	Users think in tasks, not modules
Report rendering	Playwright headless server-side	Client-side canvas	Reliable at print resolution; consistent; not affected by client GPU
Monorepo	Monorepo	Separate repos	Small team, shared types, simpler CI
ORM	SQLAlchemy 2.0	Raw SQL	Mature async support; Alembic migrations
Domain architecture	Dual front door (aviation + space portal), shared physics core	Single aviation-only product	Space operator revenue stream; ESA bid credibility; space credibility supports aviation trust
Space operator object scoping	PostgreSQL RLS on `owned_objects` join	Application-layer filtering only	DB-level enforcement; prevents application bugs from leaking cross-operator data
NOTAM output	Draft only + mandatory disclaimer; never submitted	System-assisted NOTAM submission	SpaceCom is not a NOTAM originator; keeps platform in purely informational role; reduces regulatory approval burden
Reroute module scope	Strategic pre-flight avoidance boundary only	Specific alternate route generation	Specific routes require ATC integration and aircraft performance data SpaceCom does not have; avoidance boundary keeps SpaceCom legally defensible
Shadow mode	Org-level flag; all alerts suppressed; records segregated	Per-prediction flag	Enables ANSP trial deployments; accumulates validation evidence for regulatory acceptance; segregation prevents operational confusion
Controlled re-entry planner output	CCSDS-format manoeuvre plan + risk-scored deorbit windows	Aviation-format only	Space operators submit to national regulators and ops centres in CCSDS; Zero Debris Charter evidence format
API access	Separate API keys (not session JWT); per-key rate limiting	Session cookie only	Space operators integrate SpaceCom into operations centres programmatically; API keys are revocable machine credentials
MC parallelism model	Celery `group` + `chord` (fan-out sub-tasks across worker pool)	`multiprocessing.Pool` within single task	Chord distributes across all worker containers; Pool limited to one container's cores; chord scales horizontally
Worker topology	Two separate Celery pools: `ingest` and `simulation`	Single shared queue	Runaway simulation jobs cannot starve TLE ingestion; critical for reliability during active TIP events
Celery Beat HA	`celery-redbeat` (Redis-backed, distributed locking)	Standard Celery Beat (single process)	Beat SPOF means scheduled ingest silently stops; redbeat enables multiple instances with leader election
DB HA	TimescaleDB streaming replication + Patroni auto-failover	Single-instance DB	RPO = 0 for critical tables; 15-minute RTO requires automatic failover, not manual
Redis HA	Redis Sentinel (3 nodes)	Single Redis	Master failure without Sentinel means all Celery queues and WebSocket pub/sub stop
Deployment gate	CI/CD checks for active CRITICAL/HIGH alerts before deploying	Manual judgement	Prevents deployments during active TIP events; protects operational continuity
MC blade sizing	16 vCPU per simulation worker container	Smaller containers	MC chord sub-tasks fill all available cores; below 16 cores p95 SLO of 240s is not met
Temporal uncertainty display	Plain window range ("08h–20h from now / most likely ~14h") for Persona A/C; p05/p50/p95 UTC for Persona B	`± Nh` notation everywhere	`±` implies symmetric uncertainty which re-entry distributions are not; window range is operationally actionable
Space weather impact communication	Operational buffer recommendation ("+2h beyond 95th pct") rather than % deviation	Percentage string	Percentage is meaningless without a known baseline; buffer hours are immediately usable by an ops duty manager
TLS termination	Caddy with automatic ACME (internet-facing) / internal CA (air-gapped)	nginx + manual certs	Caddy handles cert lifecycle automatically; decision tree in §34
Pagination	Cursor-based `(created_at, id)`	Offset-based	Offset degrades to full-table scan at 7-year retention depth; cursor is O(1) regardless of dataset size
CZML delta protocol	`?since=<iso8601>` parameter; max 5 MB full payload; `X-CZML-Full-Required` header on stale client	Full catalog always	100-object catalog at 1-min cadence is ~10–50 MB/hr per connected client without delta; delta reduces this to <500 KB/hr
MC concurrency gate	Per-org Redis semaphore; 1 concurrent MC run (Phase 1); `429 + Retry-After` on limit	Unbounded fan-out	5 concurrent MC requests = 2,500 sub-tasks queued; p95 SLO collapses without backpressure
TimescaleDB `compress_after`	7 days for `orbits` (not 1 day)	Compress as soon as possible	Compressing hot chunks forces decompress on every write; 1-day compress_after causes 50–200ms write latency thrash
Renderer memory limit	`mem_limit: 4g` Docker cap on renderer container	No memory limit	Chromium print rendering at A4/300DPI consumes 2–4 GB; 4 uncapped renderer instances can OOM a 32 GB node
Static asset caching	Cloudflare CDN (internet-facing); nginx sidecar (on-premise)	No CDN	CesiumJS bundle ~5–10 MB; 100 concurrent first-load = 500 MB–1 GB burst without caching
WAF/DDoS protection	Upstream provider (Cloudflare/AWS Shield) for internet-facing; network perimeter for air-gapped	Application-layer rate limiting only	Application-layer is insufficient for volumetric attacks; must be at ingress
Multi-region deployment	Single region per customer jurisdiction; separate instances, not shared cluster	Active-active multi-region	Data sovereignty; simpler compliance certification; Phase 1–3 customer base doesn't justify multi-region cost
MinIO erasure coding	EC:2 (4-node)	EC:4 or RAID	EC:2 tolerates 1 write failure / 2 read failures; balanced between protection and storage efficiency at 4 nodes
DB connection routing	PgBouncer as single stable connection target	Direct Patroni primary connection	Patroni failover transparent to application; stable DNS target through primary changes
Egress filtering	Host-level UFW/nftables allow-list (Tier 2); Calico/Cilium network policy (Tier 3)	Trust Docker network isolation	Docker isolation is inter-network only; outbound internet egress unrestricted without host-level filtering
Mode-switch dialogue	Explicit current-mode + target-mode + consequences listed; Cancel left, destructive action right	Generic "Are you sure?"	Aviation HMI conventions; listed consequences prevent silent simulation-during-live error
Future-preview temporal wash	Semi-transparent overlay + persistent label on event list when timeline scrubber is not at current time	No visual distinction	Prevents controller from acting on predicted-future data as though it is current operational state
Simulation block during active alerts	Optional org-level `disable_simulation_during_active_events` flag	Always allow simulation entry	Prevents an analyst accidentally entering simulation while CRITICAL alerts require attention in the same ops room
Prediction superseding	Write-once `superseded_by` FK on `reentry_predictions` / `simulations`	Mutable or delete	Preserves immutability guarantee; gives analysts a way to mark outdated predictions without removing the audit record
CRITICAL acknowledgement gate	10-character minimum free-text field; two-step confirmation modal	Single click	Prevents reflexive acknowledgement; creates meaningful action record for every acknowledged CRITICAL event
Multi-ANSP coordination panel	Shared acknowledgement status and coordination notes across ANSP orgs on the same event	Out-of-band only	Creates shared digital situational awareness record without replacing voice coordination; reduces risk of conflicting parallel NOTAMs
Legal opinion timing	Phase 2 gate (before shadow deployment); not Phase 3	Phase 3 task	Common law duty of care may attach regardless of UI disclaimers; liability limitation must be in executed agreements before any ANSP relies on the system
Commercial contract instruments	Three instruments: MSA + AUP click-wrap + API Terms	Single platform ToS	Each instrument addresses a different access pathway; API access by Persona E/F must have separate terms recorded against the key
Shadow mode legal gate	`legal_opinions.shadow_mode_cleared` must be TRUE before shadow mode can be activated for an org	Admin can enable freely	Shadow deployment is a formal regulatory activity; without a completed legal opinion it exposes SpaceCom to uncapped liability in the deployment jurisdiction
GDPR erasure vs. retention	Pseudonymise user references in append-only tables on erasure request; never delete safety records	Hard delete on request	UN Liability Convention requires 7-year retention; GDPR right to erasure is satisfied by removing the link to the individual, not the record itself
Space-Track data redistribution	Obtain written clarification from 18th SCS before exposing TLE/CDM data via the SpaceCom API	Assume permissible	Space-Track AUP prohibits redistribution to unregistered parties; violation could result in loss of Space-Track access, disabling the platform's primary data source
OSS licence compliance	CesiumJS commercial licence required for closed-source deployment; SBOM generated from Phase 1	Assume all dependencies are permissively licensed	CesiumJS AGPLv3 requires source disclosure for network-served applications; undiscovered licence violations create IP risk in ESA bid
Insurance	Professional indemnity + cyber liability + product liability required before operational deployment	No insurance requirement	Aviation safety context; potential claims from incorrect predictions that inform airspace decisions could exceed SpaceCom's balance sheet without coverage
Connection pooling	PgBouncer transaction-mode pooler between all app services and TimescaleDB	Direct connections from app	Tier 3 connection count (2× backend + 4× workers + 2× ingest) exceeds `max_connections=100` without a pooler; Patroni failover updates only pgBouncer
Redis eviction policy	`noeviction` for Celery/redbeat (separate DB index); `allkeys-lru` for application cache	Single Redis with one policy	Broker message eviction causes silent job loss; cache eviction is acceptable
Bulk export implementation	Celery task → MinIO → presigned URL (async offload pattern)	Streaming response from API handler	Full catalog export can be gigabytes; materialising in API handler risks OOM on the backend container
Analytics query routing	Patroni standby replica for Persona B/F analytics; primary for operational reads	All reads to primary	Analytics queries during a TIP event would compete with operational reads on the primary; standby already provisioned at Tier 3
SQLAlchemy lazy loading	`lazy="raise"` on all relationships	Default lazy loading	Async SQLAlchemy silently blocks the event loop on lazy-loaded relationships; `raise` converts silent N+1s into loud development-time errors
CZML cache strategy	Per-object fragment cache + full catalog assembly; TTL keyed to last propagation job	No cache; query DB on each request	CZML catalog fetch at 100 objects = 864k rows; uncached this misses the 2s p95 SLO under concurrent load
Hypertable chunk interval (`orbits`)	1-day chunks (not default 7-day)	Default 7-day	72h CZML query spans 3 × 1-day chunks; spans 11 × 7-day chunks — chunk exclusion is far less effective with the default
Continuous aggregate for F10.7 81-day avg	TimescaleDB continuous aggregate `space_weather_daily`	Compute from raw rows per request	At 100 concurrent users, 100 identical scans of 11,664 raw rows; continuous aggregate reduces this to a single-row lookup
CI/CD orchestration	GitHub Actions	Jenkins / GitLab CI	Project is GitHub-native; Actions has OIDC → GHCR; no separate CI server to operate
Container image tags	`sha-<commit>` as canonical immutable tag; semantic version alias for releases	`latest` tag only	`latest` is mutable and non-reproducible; `sha-<commit>` gives exact traceability from deployed image back to source commit
Multi-stage Docker builds	Builder stage (full toolchain) + runtime stage (distroless/slim)	Single-stage with all tools	Eliminates build toolchain, compiler, and dev dependencies from production image; typically reduces image size by 60–80%
Local dev hot-reload	Backend: FastAPI `--reload` via bind-mounted `./backend` volume; Frontend: Next.js Vite HMR	Rebuild container on change	Full container rebuild per code change adds 30–90s per iteration; volume mount + process reload is < 1s
`.env.example` contract	`.env.example` with all required variables, descriptions, and stage flags committed to repo; actual `.env` in `.gitignore`	Ad-hoc variable discovery from runtime errors	Engineers must be able to run `cp .env.example .env` and have a working local stack within 15 minutes of cloning
Staging environment strategy	`main` branch continuously deployed to staging via GitHub Actions; production deploy requires manual approval gate after staging smoke tests pass	Manual staging deploys	Reduces time-to-detect integration regressions; staging serves as TRL artefact evidence environment
Secrets rotation	Per-secret rotation runbook: Space-Track credentials, JWT signing keys, ANSP tokens; old + new key both valid during 5-minute transition window; `security_logs` entry required; rotated via Vault dynamic secrets in Phase 3	Manual rotation with downtime	Aviation context: key rotation must not cause service interruption; zero-downtime rotation is a reliability requirement, not a convenience
Build cache strategy	Docker layer cache: `cache-from/cache-to` targeting GHCR in GitHub Actions; pip wheel cache: `actions/cache` keyed on `requirements.txt` hash; npm cache: `actions/cache` keyed on `package-lock.json` hash	No cache; full rebuild each push	Without cache, a full rebuild takes 8–12 minutes; with cache, incremental pushes take 2–3 minutes — critical for CI as a useful merge gate
Image retention policy	Tagged release images kept indefinitely; untagged/orphaned images purged weekly via GHCR lifecycle policy; staging images retained 30 days; dev branch images retained 7 days	No policy; manual cleanup	Unmanaged GHCR storage grows unboundedly; stale images also represent unaudited CVE surface
Pre-commit hook completeness	Six hooks: `detect-secrets`, `ruff`, `mypy`, `hadolint`, `prettier`, `sqlfluff`	`git-secrets` only	`git-secrets` scans only for known secret patterns; `detect-secrets` uses entropy analysis; `hadolint` prevents insecure Dockerfile patterns; `sqlfluff` catches migration anti-patterns before code review
`alembic check` in CI	CI job runs `alembic check` to detect SQLAlchemy model/migration divergence; fails if models have unapplied changes	Only run migrations, no divergence check	SQLAlchemy models can diverge from migrations silently; `alembic check` catches the gap before it reaches production
FIR boundary data source	EUROCONTROL AIRAC (ECAC states) + FAA Digital-Terminal Procedures (US) + OpenAIP (fallback); 28-day update cadence	Manually curated GeoJSON, updated ad hoc	FIR boundaries change on AIRAC cycles; stale boundaries produce wrong airspace intersection results during live TIP events
ADS-B data source	OpenSky Network REST API (Phase 3 MVP); commercial upgrade path to Flightradar24 or FAA SWIM ADS-B if required	Direct receiver hardware	OpenSky is free, global, and sufficient for route overlay and intersection advisory; commercial upgrade only if coverage gaps identified in ANSP trials
CCSDS OEM reference frame	GCRF (Geocentric Celestial Reference Frame); time system UTC; `OBJECT_ID` = NORAD catalog number; missing international designator populated as `UNKNOWN`	ITRF or TEME	GCRF is the standard output of SpaceCom's frame transform pipeline; downstream mission control tools expect GCRF for propagation inputs
CCSDS CDM field population	SpaceCom populates: HEADER, RELATIVE_METADATA, OBJECT1/2 identifiers, state vectors, covariance (if available); fields not held by SpaceCom emitted as `N/A` per CCSDS 508.0-B-1 §4.3	Omit empty fields	`N/A` is the CCSDS-specified sentinel for unknown values; silent omission causes downstream parser failures
CDM ingestion display	Space-Track CDM Pc displayed alongside SpaceCom-computed Pc with explicit provenance labels; > 10× discrepancy triggers `DATA_CONFIDENCE` warning on conjunction panel	Show only one value	Space operators need both values; discrepancy without explanation erodes trust in both
WebSocket event schema	Typed event envelope with `type` discriminator, monotonic `seq`, and `ts`; reconnect with `?since_seq=` replay of up to 200 events / 5-minute ring buffer; `resync_required` on stale reconnect	Schema-free JSON stream	Untyped streams require every consumer to reverse-engineer the schema; schema enables typed client generation
Alert webhook delivery	At-least-once POST to registered HTTPS endpoint; HMAC-SHA256 signature; 3 retries with exponential backoff; `degraded` status after 3 failures; auto-disable after 10 consecutive failures	WebSocket / email only	ANSPs with existing dispatch infrastructure (AFTN, internal webhook receivers) cannot integrate via browser WebSocket; webhooks are the programmatic last-mile
API versioning	`/api/v1` base; breaking changes require `/api/v2` parallel deployment; 6-month support overlap; `Deprecation` / `Sunset` headers (RFC 8594); 3-month written notice to API key holders	No versioning policy; breaking changes deployed ad hoc	Space operators building operations centre integrations need stable contracts; silent breaking changes disable their integrations
SWIM integration path	Phase 2: GeoJSON structured export; Phase 3: FIXM review + EUROCONTROL SWIM-TI AMQP publish endpoint	Not applicable	European ANSP procurement increasingly requires SWIM compatibility; GeoJSON export is low-cost first step; full SWIM-TI is Phase 3
Space-Track API contract test	Integration test asserts expected JSON keys present in Space-Track response; ingest health alert fires after 4 consecutive hours with 0 successful Space-Track records	No contract test; breakage discovered at runtime	Space-Track API has had historical breaking changes; silent format change means ingest returns no data while health metrics appear normal
TLE checksum validation	Modulo-10 checksum on both lines verified before DB write; BSTAR range check; failed records logged to `security_logs` type `INGEST_VALIDATION_FAILURE`	Accept TLE at face value	Corrupted TLEs (network errors, encoding issues) would propagate incorrect state vectors without validation
Model card	`docs/model-card-decay-predictor.md` maintained alongside the model; covers validated orbital regime envelope, known failure modes, systematic biases, and performance by object type	Accuracy statement only in §24.3	Regulators and ANSPs require a documented operational envelope, not just a headline accuracy figure; ESA TRL artefact requirement
Historical backcast selection	Validation report explicitly documents selection criteria, identifies underrepresented object categories, and states accuracy conditional on object type	Single unconditional accuracy figure	Observable re-entry population is biased toward large well-tracked objects; publishing an unconditional accuracy figure misrepresents model generalisation
Out-of-distribution detection	`ood_flag = TRUE` and `ood_reason` set at prediction time if any input falls outside validated bounds; UI shows mandatory warning callout	Serve all predictions identically	NRLMSISE-00 calibration domain does not include tumbling objects, very high area-to-mass ratio, or objects with no physical property data
Prediction staleness warning	`prediction_valid_until` = `p50_reentry_time - 4h`; UI warns independently of system-level TLE staleness if `NOW() > prediction_valid_until` and not superseded	No time-based staleness on predictions	An hours-old prediction for an imminent re-entry has implicitly grown uncertainty; operators need a signal independent of the system health banner
Alert threshold governance	Thresholds documented with rationale; change approval requires engineering lead sign-off + shadow-mode validation period; change log maintained in `docs/alert-threshold-history.md`	Thresholds set in code with no governance	CRITICAL trigger (window < 6h, FIR intersection) has airspace closure consequences; undocumented threshold changes cannot be reviewed by regulators or ANSPs
FIR intersection auditability	`alert_events.fir_intersection_km2` and `intersection_percentile` recorded at alert generation; UI shows "p95 corridor intersects ~N km² of FIR XXXX"	Alert log shows only "intersects FIR XXXX"	Intersection without area and percentile context is not auditable; regulators and ANSPs need to know how much intersection triggered the alert
Recalibration governance	Recalibration requires hold-out validation dataset, minimum accuracy improvement threshold, sign-off authority, rollback procedure, and notification to ANSP shadow partners	Recalibration run and deployed without gates	Unchecked recalibration can silently degrade accuracy for object types not in the calibration set
Model version governance	Changes classified as patch/minor/major; major changes require active prediction re-runs with supersession + ANSP notification; rollback path documented	No governance; model updated silently	A major model version change producing materially different corridors without re-running active predictions creates undocumented divergence between what ANSPs are seeing and current best predictions
Adverse outcome monitoring	`prediction_outcomes` table records observed re-entry outcomes against predictions; quarterly accuracy report generated from feedback pipeline; false positive/negative rates in Grafana	No post-deployment accuracy tracking	Without outcome monitoring SpaceCom cannot demonstrate performance within acceptable bounds to regulators; shadow validation reports are episodic, not continuous
Geographic coverage annotation	FIR intersection results carry `data_coverage_quality` flag per FIR; OpenAIP-sourced boundaries flagged as lower confidence	All FIR intersections treated equally	AIRAC coverage varies by region; operators in non-ECAC regions receive lower-quality intersection assessments without knowing it
Public transparency report	Quarterly aggregate accuracy/reliability report published (no personal data); covers prediction count, backcast accuracy, error rates, known limitations	No public reporting	Civil aviation safety tools operate in a regulated transparency environment; ESA bid credibility and regulatory acceptance require demonstrable performance
`docs/` directory structure	Canonical tree defined in §12.1; all documentation files live at known paths committed to the repo	Ad-hoc file creation by individual engineers	Documentation that exists only in prose references gets created inconsistently or not at all
Architecture Decision Records	MADR-format ADRs in `docs/adr/`; one per consequential decision in §20; linked from relevant code via inline comment	§20 table in master plan only	Engineers working in the repo cannot find decision rationale without reading a 5000-line plan document
OpenAPI documentation standard	Every public endpoint has `summary`, `description`, `tags`, and at least one `responses` example; enforced by CI check	Auto-generated stubs only	Auto-generation produces syntactically correct docs that are useless to API integrators (Persona E/F)
Runbook format	Standard template in `docs/runbooks/TEMPLATE.md`; required sections: Trigger, Severity, Preconditions, Steps, Verification, Rollback, Notify; runbook index maintained	Free-form runbooks written ad-hoc	Runbooks written under pressure without a template consistently omit the rollback and notification steps
Docstring standard	Google-style docstrings required on all public functions in `propagator/`, `reentry/`, `breakup/`, `conjunction/`, `integrity.py`; parameters include physical units	No docstring requirement	Physics functions without units and limitations documented cannot be reviewed or audited by third-party evaluators for ESA TRL
Validation procedure	§17 specifies reference data location, run commands, pass/fail tolerances per suite; `docs/validation/README.md` describes how to add new cases	Checklist of what to validate without procedure	A third party cannot reproduce the validation without knowing where the reference data is and what tolerance constitutes a pass
User documentation	Phase 2 delivers aviation portal guide + API quickstart; Phase 3 delivers space portal guide + in-app contextual help; stored in `docs/user-guides/`	No user documentation	ANSP SMS acceptance requires user documentation; aviation operators cannot learn an unfamiliar safety tool from the UI alone
`CHANGELOG.md` format	Keep a Changelog conventions; human-maintained; one entry per release with `Added/Changed/Deprecated/Removed/Fixed/Security` sections	No format specified	Changelogs written by different engineers without a format are unusable by operators and regulators
`AGENTS.md`	Project-root file defining behaviour guidance for AI coding agents; specifies codebase conventions, test requirements, and safety-critical file restrictions; committed to repo	Untracked file, undefined purpose	An undocumented AGENTS.md is either ignored or followed inconsistently, undermining its purpose
Test documentation	Module docstrings on physics/security test files state the invariant, reference source, and operational significance of failure; `docs/test-plan.md` lists all suites with scope and blocking classification	No test documentation requirement	ECSS-Q-ST-80C requires a test specification as a separate deliverable from the test code

21. Definition of Done per Phase

Phase 1 Complete When:

Physics and data:

100+ real objects tracked with current TLE data
Frame transformation unit tests pass against IERS/Vallado reference cases (round-trip error < 1 m)
SGP4 CZML uses J2000 INERTIAL frame (not TEME)
Space weather polled from NOAA SWPC; cross-validated against ESA SWS; operational status widget visible
TIP messages ingested and displayed for decaying objects
TLE cross-validation flags discrepancies > threshold for human review
IERS EOP hash verification passing
Decay predictor: ≥3 historical re-entry backcast windows overlap actual events
Mode A (Percentile Corridors): p05/p50/p95 swaths render with correct visual encoding
TimelineGantt displays all active events; click-to-navigate functional
LIVE/REPLAY/SIMULATION mode indicator correct on all pages

Security (all required before Phase 1 is considered complete):

RBAC enforced: automated test_rbac.py verifies every endpoint returns 403 for insufficient role, 401 for unauthenticated
JWT RS256 with httpOnly cookies; localStorage token storage absent from codebase (grep check in CI)
MFA (TOTP) enforced for all roles; recovery codes functional
Rate limiting: 429 responses verified by integration tests for all configured limits
Simulation parameter range validation: out-of-range values return 400 with clear message
Prediction HMAC: tamper test (direct DB row modification) triggers 503 + CRITICAL security_log entry
alert_events append-only trigger: UPDATE/DELETE raise exception (verified by test)
reentry_predictions immutability trigger: same (verified by test)
Redis AUTH enabled; default user disabled; ACL per service verified
MinIO: all buckets verified private; direct object URL returns 403; pre-signed URL required
Docker: all containers verified non-root (docker inspect check in CI)
Docker: network segmentation verified — frontend container cannot reach database port
Bandit: 0 High severity findings in CI
ESLint security: 0 High findings in CI
Trivy: 0 Critical/High CVEs in all container images
CSP headers present on all pages; verified by Playwright E2E test
axe-core: 0 critical, 0 serious violations on all pages (CI check)
WCAG 2.1 AA colour contrast: automated check passes

UX:

Globe: object clustering active at global zoom; urgency symbols correct (colour-blind-safe)
DataConfidenceBadge visible on all object detail and prediction panels
UncertaintyModeSelector visible; Mode B/C greyed with "Phase 2/3" label
JobsPanel shows live sample progress for running decay jobs
Shared deep links work: /events/{id} loads correct event; globe focuses on corridor
All pages keyboard-navigable; modal focus trap verified
Report generation: Operational Briefing type functional; PDF includes globe corridor map

Human Factors (Phase 1 items — all required before Phase 1 is considered complete):

Event cards display window range notation (Window: Xh–Yh from now / Most likely ~Zh from now); no ± notation appears in operational-facing UI (grep check)
Mode-switch dialogue: switching to SIMULATION shows current mode, target mode, and "alerts suppressed" consequence; Cancel left, Switch right; Playwright E2E test verifies dialogue content
Future-preview temporal wash: dragging timeline scrubber past current time applies overlay and PREVIEWING +Xh label to event panel; alert badges show "(projected)"; verified by Playwright test
CRITICAL acknowledgement: two-step flow (banner → confirmation modal); Confirm button disabled until Action taken field ≥ 10 characters; verified by Playwright test
Audio alert: non-looping two-tone chime plays once on CRITICAL alert; stops on acknowledgement; does not play in SIMULATION or REPLAY mode; verified by integration test with audio mock
Alert storm meta-alert: > 5 CRITICAL alerts within 1 hour generates Persona D meta-alert with disambiguation prompt (verified by test with synthetic alerts)
Onboarding state: new organisation with no FIRs configured sees three-card setup prompt on first login (Playwright test)
Degraded mode banner: /readyz 207 response triggers correct per-degradation-type operational guidance text in UI (integration test for each degradation type: space weather stale, TLE stale)
superseded_by constraint: setting superseded_by on a prediction a second time raises DB exception (integration test); UI shows ⚠ Superseded banner on any prediction where superseded_by IS NOT NULL

Legal / Compliance (Phase 1 items — all required before Phase 1 is considered complete):

Space-Track AUP architectural decision gate (Finding 9): Written AUP clarification obtained from 18th Space Control Squadron or legal counsel opinion. docs/adr/0016-space-track-aup-architecture.md committed with Path A (shared ingest) or Path B (per-org credentials) decision recorded and evidenced. Ingest architecture finalised accordingly. This is a blocking Phase 1 decision — ingest code must not be written until the path is decided.
ToS / AUP / Privacy Notice acceptance gate: first login blocks dashboard access until all three documents are accepted; users.tos_accepted_at, users.tos_version, users.tos_accepted_ip populated on acceptance (integration test: unauthenticated attempt to skip returns 403)
ToS version change triggers re-acceptance: bump tos_version in config; verify existing users are blocked on next login until they re-accept (integration test)
CesiumJS commercial licence executed and stored at legal/LICENCES/cesium-commercial.pdf; legal_clearances.cesium_commercial_executed = TRUE — blocking gate for any external demo (§29.11 F1)
SBOM generated at build time via syft (SPDX-JSON, container image) + pip-licenses + license-checker-rseidelsohn (dependency manifests); stored in docs/compliance/sbom/ as versioned artefacts; all dependency licences reviewed against legal/OSS_LICENCE_REGISTER.md; CI pip-licenses --fail-on gate includes GPL/AGPL/SSPL; no unapproved licence in transitive closure (§29.11 F2, F10)
legal/LGPL_COMPLIANCE.md created documenting poliastro LGPL dynamic linking compliance and PostGIS GPLv2 linking exception (§29.11 F4, F9)
legal/LICENCES/timescaledb-licence-assessment.md and legal/LICENCES/redis-sspl-assessment.md created with licence assessment sign-off (§29.11 F5, F6)
legal_opinions table present in schema; admin UI shows legal clearance status per org; shadow mode toggle displays warning if shadow_mode_cleared = FALSE
GDPR breach notification procedure documented in the incident response runbook; tabletop exercise completed with the engineering team

Infrastructure / DevOps (all required before Phase 1 is considered complete):

Docker Compose starts full stack with single command (make dev)
make test executes pytest + vitest in one command; all tests pass on a clean clone
make migrate runs all Alembic migrations against a fresh DB without error
make seed loads fixture data; globe shows test objects on first load
.env.example present with all required variables documented; a new engineer can reach a working local stack in ≤ 15 minutes
Multi-stage Dockerfiles in place for backend, worker, renderer, and frontend: builder stage uses full toolchain; runtime stage is distroless/slim; docker inspect confirms no build tools (gcc, pip, npm) present in runtime image
All containers run as non-root UID (baked in Dockerfile USER directive — not set at runtime); verified by docker inspect check in CI
Self-hosted GitLab CI pipeline exists with jobs: lint (pre-commit all hooks), test-backend (pytest), test-frontend (vitest + Playwright), security-scan (Bandit + Trivy + ESLint security), build-and-push (multi-stage build -> GitLab container registry with sha-<commit> tag)
.pre-commit-config.yaml committed with all six hooks; CI re-runs all hooks and fails if any fail
alembic check step in CI fails if SQLAlchemy models have unapplied changes
Build cache: Docker layer cache, pip wheel cache, npm cache all configured in GitLab CI; incremental push CI time < 4 minutes
pytest suite: frame utils, integrity, auth, RBAC, propagator, decay, space weather, ingest, API integration
Playwright E2E: mode switch, alert acknowledge, CZML render, job progress, report generation, CSP headers
Port exposure CI check: scripts/check_ports.py passes with no never-exposed port in a ports: mapping
Caddy TLS active on local dev stack with self-signed cert or ACME staging cert; HSTS header present (Strict-Transport-Security: max-age=63072000); TLS 1.1 and below not offered (verified by nmap --script ssl-enum-ciphers)
docs/runbooks/egress-filtering.md exists documenting the allowed outbound destination whitelist; implementation method (UFW/nftables) noted

Performance / Database (Phase 1 items — all required before Phase 1 is considered complete):

pgBouncer in Docker Compose; all app services connect via pgBouncer (not directly to TimescaleDB); verified by netstat or connection-source query showing only pgBouncer IPs in pg_stat_activity
All required indexes present: orbits_object_epoch_idx, reentry_pred_object_created_idx, alert_events_unacked_idx, reentry_pred_corridor_gist, hazard_zones_polygon_gist, fragments_impact_gist, tle_sets_object_ingested_idx — verified by \d+ or pg_indexes query
orbits hypertable chunk interval set to 1 day; space_weather to 30 days; tle_sets to 7 days — verified by timescaledb_information.chunks
space_weather_daily continuous aggregate created and policy active; Space Weather Widget backend query reads from the aggregate (verified by EXPLAIN showing space_weather_daily in plan, not raw space_weather)
Autovacuum settings applied to alert_events, security_logs, reentry_predictions — verified via pg_class reloptions
lazy="raise" set on all SQLAlchemy relationships; test suite passes with no MissingGreenlet or InvalidRequestError exceptions (test suite itself verifies this by accessing relationships without explicit loading — should raise)
Redis Celery broker DB index (SELECT 0) has maxmemory-policy noeviction; application cache DB index (SELECT 1) has allkeys-lru — verified by CONFIG GET maxmemory-policy on each DB
CZML catalog endpoint: EXPLAIN (ANALYZE, BUFFERS) output recorded in docs/query-baselines/czml_catalog_100obj.txt; p95 response time < 2s verified by load test with 10 concurrent users
CZML delta endpoint (?since=) functional: integration test verifies delta response contains only changed objects; X-CZML-Full-Required: true returned when client timestamp > 30 min old
Compression policies applied with correct compress_after intervals (see §9.4 table): orbits = 7 days, adsb_states = 14 days, space_weather = 60 days, tle_sets = 14 days — verified by timescaledb_information.jobs
Cursor-based pagination: integration test on /reentry/predictions with 200+ rows confirms next_cursor present and second page returns non-overlapping rows; limit=201 returns 400
MC concurrency gate: integration test submits two concurrent POST /decay/predict requests from the same organisation; second request returns HTTP 429 with Retry-After header while first is running; first completes normally
Renderer Docker memory limit set to 4 GB in docker-compose.yml; docker inspect confirms HostConfig.Memory = 4294967296
Bulk export endpoint: integration test with 10,000-row dataset confirms response is a task ID + status URL, not an inline response body
tests/load/ directory exists with at least a k6 or Locust scenario for the CZML catalog endpoint; docs/test-plan.md load test section specifies scenario, ramp shape, and SLO assertion

Technical Writing / Documentation (Phase 1 items — all required before Phase 1 is considered complete):

docs/ directory tree created and committed matching the structure in §12.1; all referenced documentation paths exist (even if files are stubs with "TODO" content)
AGENTS.md committed to repo root; contains codebase conventions, test requirements, and safety-critical file restrictions (see §33.9)
docs/adr/ contains minimum 5 ADRs for the most consequential Phase 1 decisions: JWT algorithm choice, dual frontend architecture, Monte Carlo chord pattern, frame library choice, TimescaleDB chunk intervals
docs/runbooks/TEMPLATE.md committed; docs/runbooks/README.md index lists all required runbooks with owner field; at least db-failover.md, ingest-failure.md, and hmac-failure.md are complete (not stubs)
docs/validation/README.md documents how to run each validation suite and where reference data files live; docs/validation/reference-data/ contains Vallado SGP4 cases and IERS frame test cases
CHANGELOG.md exists at repo root in Keep a Changelog format; first entry records Phase 1 initial release
docs/alert-threshold-history.md exists with initial entry recording threshold values, rationale, and author sign-off (required by §24.8)
OpenAPI docs: CI check confirms no public endpoint has an empty description field; spot-check 5 endpoints in code review to verify summary and at least one responses example

Ethics / Algorithmic Accountability (Phase 1 items — all required before Phase 1 is considered complete):

ood_flag and ood_reason populated at prediction time: integration test with an object whose data_confidence = 'unknown' and no DISCOS physical properties confirms ood_flag = TRUE and ood_reason contains 'low_data_confidence'; prediction is served but UI shows mandatory warning callout above the prediction panel
prediction_valid_until field present: verify it equals p50_reentry_time - 4h for a test prediction; UI shows staleness warning when NOW() > prediction_valid_until and prediction is not superseded (Playwright test simulates time travel)
alert_events.fir_intersection_km2 and intersection_percentile recorded: synthetic CRITICAL alert with known corridor area confirms both fields populated; UI renders "p95 corridor intersects ~N km² of FIR XXXX" (Playwright test)
Alert threshold values documented: docs/alert-threshold-history.md exists with initial entry recording threshold values, rationale, and author sign-off
prediction_outcomes table exists in schema; POST /api/v1/predictions/{id}/outcome endpoint (requires analyst role) accepts observed re-entry time and source (integration test: unauthenticated attempt returns 401)

Interoperability (Phase 1 items — all required before Phase 1 is considered complete):

TLE checksum validation: integration test sends a TLE with deliberately corrupted checksum; verify it is rejected and logged to security_logs type INGEST_VALIDATION_FAILURE; valid TLE with same content but correct checksum is accepted
Space weather format contract test: CI integration test against mocked NOAA SWPC response asserts (a) expected top-level JSON keys present (time_tag, flux / kp_index); (b) F10.7 values in physical range 50–350 sfu; (c) Kp values in range 0–90 (NOAA integer format); test is @pytest.mark.contract and runs against mocks in standard CI, against live API in nightly sandbox job
Space-Track contract test: integration test against mocked Space-Track response asserts (a) expected JSON keys present for TLE and CDM queries; (b) B* values trigger warning when outside [-0.5, 0.5]; (c) epoch field parseable as ISO-8601; spacecom_ingest_success_total{source="spacetrack"} Prometheus metric > 0 after a live ingest cycle (nightly sandbox only)
FIR boundary data loaded: airspace table populated with FIR/UIR polygons for at least the test ANSP region; source documented in ingest/sources.py; AIRAC update date recorded in airspace_metadata table
WebSocket event schema: WS /ws/events delivers typed event envelopes; integration test sends a synthetic alert.new event and verifies the client receives {"type": "alert.new", "seq": <n>, "data": {...}}; reconnect with ?since_seq=<n> replays missed event
API versioning headers: all API endpoints return Content-Type: application/vnd.spacecom.v1+json; deprecated endpoints (if any) return Deprecation: true and Sunset: <date> headers (verified by Playwright E2E check)

SRE / Reliability (all required before Phase 1 is considered complete):

Health probes: /healthz returns 200 on all services; /readyz returns 200 (healthy) or 207 (degraded) as appropriate; Docker Compose depends_on: condition: service_healthy wired for all service dependencies
Celery queue routing: integration test confirms ingest.* tasks appear only on ingest queue and propagator.* tasks appear only on simulation queue; no cross-queue contamination possible
celery-redbeat schedule persistence: Beat process restart test verifies scheduled jobs survive without duplicate scheduling; Redis key redbeat:* present after restart
Crash-safety: kill a worker-sim container mid-task; verify task is requeued (not lost) on worker restart; task_acks_late = True and task_reject_on_worker_lost = True confirmed by log inspection
Dead letter queue: a task that exhausts all retries appears in the DLQ; DLQ depth metric visible in Prometheus
WAL archiving: pg_basebackup and WAL segments appearing in MinIO db-wal-archive bucket within 10 minutes of first write (verified by bucket list)
Daily backup Celery task: backup_database task appears in Celery Beat schedule; execution logged in celery-beat.log; resulting archive object visible in MinIO db-backups bucket
TimescaleDB compression policy: orbits compression policy applied; timescaledb_information.jobs shows policy active; manual CALL run_job() compresses at least one chunk
Prometheus metrics: spacecom_active_tip_events, spacecom_tle_age_hours, spacecom_hmac_verification_failures_total, spacecom_celery_queue_depth all visible in Prometheus UI with correct labels
MC chord distribution: run_mc_decay_prediction fans out 500 sub-tasks; Celery Flower shows sub-tasks distributed across both worker-sim instances (not all on one worker)
MC p95 latency SLO: 500-sample MC run completes in < 240s on Tier 1 dev hardware (8 vCPU/32 GB) under load test; documented baseline recorded for Tier 2 comparison

Phase 2 Complete When:

Atmospheric breakup: fragments, casualty areas, fragment globe display
Mode B (Probability Heatmap): Deck.gl layer renders; hover tooltip shows probability
Conjunction screening: known close approaches identified; Pc computed for ≥1 test case
2D Plan View: FIR boundaries, horizontal corridor projection, altitude cross-section
Airspace intersection table: affected FIRs with entry/exit times on Event Detail
Hazard zones: HMAC-signed and immutability trigger verified
PDF reports: Technical Assessment and Regulatory Submission types functional
Renderer container: network_mode: none enforced; sanitisation tests passing; 30s timeout verified
OWASP ZAP DAST: 0 High/Critical findings against staging environment
RLS multi-tenancy: Org A user cannot access Org B records (integration test)
SimulationComparison: two runs overlaid on globe with distinct colours

Phase 2 SRE / Reliability:

Monthly restore test: restore_test Celery task executes on schedule; restores latest backup to isolated db-restore-test container; row count reconciliation passes; result logged to security_logs (type RESTORE_TEST)
TimescaleDB retention policy: 90-day drop policy active on orbits and space_weather; manual chunk drop test in staging confirms chunks older than 90 days are removed without affecting newer data
Archival pipeline: Parquet export Celery task runs before chunk drop; resulting .parquet files visible in MinIO db-archive bucket; spot-check query against archived Parquet returns expected rows
Degraded mode UI: stop space weather ingest; confirm /readyz returns 207; confirm StalenessWarningBanner appears in aviation portal within one polling cycle (≤ 60s); restart ingest; confirm banner clears
Error budget dashboard: Grafana SRE Error Budgets dashboard shows Phase 2 SLO burn rates for prediction latency and data freshness; alert fires in Prometheus when burn rate exceeds 2× for > 1 hour

Phase 2 Human Factors:

Corridor Evolution widget: Event Detail page shows p50 corridor footprint at T+0h/+2h/+4h; auto-updates in LIVE mode; ambering warning appears if corridor is widening
Duty Manager View: toggle on Event Detail collapses to large-text window/FIR/action-buttons only; toggles back to technical detail
Response Options accordion: contextualised action checklist visible to operator+ role; checkbox states and coordination notes persisted to alert_events
Multi-ANSP Coordination Panel: visible on events where ≥2 registered organisations share affected FIRs; acknowledgement status and coordination notes from each ANSP visible; integration test confirms Org A cannot see Org B coordination notes on unrelated events
Simulation block: disable_simulation_during_active_events org setting functional; mode switch blocked with correct modal when unacknowledged CRITICAL alerts exist (integration test)
Space weather buffer recommendation: Event Detail shows [95th pct time + buffer] callout when conditions are Elevated or above; buffer computed by backend from F10.7/Kp thresholds (integration test verifies all four threshold bands)
Secondary Display Mode: ?display=secondary URL opens chrome-free full-screen operational view; navigation, admin links, and simulation controls not present; CRITICAL banners still appear (Playwright test)
Mode C first-use overlay: MC particle animation blocked until user acknowledges one-time explanation overlay; preference stored in user record; never shown again after first acknowledgement

Phase 2 Performance / Database:

FIR intersection query: EXPLAIN (ANALYZE) confirms bounding-box pre-filter (&&) eliminates > 90% of airspace rows before exact ST_Intersects; p95 intersection query time < 200ms with full airspace table loaded
Analytics query routing: Persona B/F workspace queries confirmed routing to replica engine via pg_stat_activity source host check; replication lag monitored in Grafana (alert if > 30s)
Query plan regression: re-run EXPLAIN (ANALYZE, BUFFERS) on CZML catalog query; compare to Phase 1 baseline in docs/query-baselines/; planning time and execution time increase < 2× (if exceeded, investigate before Phase 3 load test)
Hypertable migration: at least one migration involving orbits executed using CREATE INDEX CONCURRENTLY; CI migration timeout gate in place (> 30s fails CI)
Query plan regression CI job active: tests/load/check_query_baselines.py runs after each migration in staging; fails if any baseline query execution time increases > 2× vs recorded baseline; PR comment generated with comparison table
ws_connected_clients Prometheus gauge reporting per backend instance; Grafana alert configured at 400 (WARNING) — verified by injecting 5 synthetic WebSocket connections and confirming gauge increments
Space weather backfill cap: integration test simulates 24-hour ingest gap; verify ingest task logs WARN and backfills only last 6 hours; no duplicate timestamps written; space_weather_daily aggregate remains consistent
CDN / static asset caching: bundle-size CI step active; PR comment shows bundle size delta; CI fails if main JS bundle grows > 10% vs. previous build; Caddy cache headers for /_next/static/* set Cache-Control: public, max-age=31536000, immutable

Phase 2 Legal / Compliance:

Regulatory classification ADR committed: docs/adr/0012-regulatory-classification.md documents the chosen position (Position A — ATM/ANS Support Tool, non-safety-critical) with rationale; legal counsel has reviewed the position against EASA IR 2017/373; position is referenced in all ANSP service contracts
Legal opinion received for primary deployment jurisdiction; legal_opinions table updated with shadow_mode_cleared = TRUE; shadow mode admin toggle no longer shows legal warning for that jurisdiction
Space-Track AUP redistribution clarification obtained (written); legal position documented; AUP click-wrap wording updated to reflect agreed terms
ESA DISCOS redistribution rights clarified (written): Written confirmation from ESA/ESAC on permissible use of DISCOS-derived properties in commercial API responses and generated reports; if redistribution is not permitted, API response and report templates updated to show source: estimated rather than raw DISCOS values
GDPR DPA signed with each shadow ANSP partner before shadow mode begins: DPA template reviewed by counsel; executed DPA on file for each organisation before shadow_mode_cleared is set to TRUE; data processing not permitted for any ANSP organisation without a signed DPA
GDPR data inventory documented; pseudonymisation procedure handle_erasure_request() implemented and tested: user deleted → name/email replaced with [user deleted - ID:{hash}] in alert_events/security_logs; core safety records preserved
Jurisdiction screening at user registration: sanctioned-country check fires before account creation; blocked attempt logged to security_logs type REGISTRATION_BLOCKED_SANCTIONS
MSA template reviewed by aviation law counsel; Regulatory Sandbox Agreement template finalised; first shadow mode deployment covered by a signed Regulatory Sandbox Agreement on file
Controlled Re-entry Planner carries in-platform export control notice; data_source_acknowledgement = TRUE enforced before API key issuance (integration test: attempt to create API key without acknowledgement returns 403)
Professional indemnity, cyber liability, and product liability insurance confirmed in place before first shadow deployment; certificates stored in MinIO legal-docs bucket
Shadow mode exit criteria documented and tooled: docs/templates/shadow-mode-exit-report.md exists; Persona B can generate exit statistics from admin panel; exit to operational use for any ANSP requires written Safety Department confirmation on file before shadow_mode_cleared is set

Phase 2 Technical Writing / Documentation:

docs/user-guides/aviation-portal-guide.md complete and reviewed by at least one Persona A representative before first ANSP shadow deployment; covers: dashboard overview, alert acknowledgement workflow, NOTAM draft workflow, degraded mode response
docs/api-guide/ complete: authentication.md, rate-limiting.md, webhooks.md, error-reference.md, Python and TypeScript quickstart examples; reviewed by a Persona E/F tester
All public functions in propagator/decay.py, propagator/catalog.py, reentry/corridor.py, integrity.py, and breakup/atmospheric.py have Google-style docstrings with parameter units; mypy pre-commit hook enforces no untyped function signatures
docs/test-plan.md complete: lists all test suites, physical invariant tested, reference source, pass/fail tolerance, and blocking classification; reviewed by physics lead
docs/adr/ contains ≥ 10 ADRs covering all consequential Phase 2 decisions added during the phase
All runbooks referenced in the §21 DoD are complete (not stubs): gdpr-breach-notification.md, safety-occurrence-notification.md, secrets-rotation-jwt.md, blue-green-deploy.md, restore-from-backup.md

Phase 2 Ethics / Algorithmic Accountability:

Model card published: docs/model-card-decay-predictor.md complete with validated orbital regime envelope, object type performance breakdown, known failure modes, and systematic biases; reviewed by the physics lead before Phase 2 ANSP shadow deployments
Backcast validation report: ≥10 historical re-entry events validated; report documents selection criteria, identifies underrepresented object categories (small debris, tumbling objects), and states accuracy conditional on object type — not as a single unconditional figure; stored in MinIO docs bucket
Out-of-distribution bounds defined: docs/ood-bounds.md specifies the threshold values for ood_flag triggers (area-to-mass ratio, minimum data confidence, minimum TLE count); CI test confirms all thresholds are checked in propagator/decay.py
Alert threshold governance: any threshold change requires a PR reviewed by engineering lead + product owner; docs/alert-threshold-history.md entry created; change must complete a minimum 2-week shadow-mode validation period before deploying to any operational ANSP connection
FIR coverage quality flag: airspace table has data_source and coverage_quality columns; intersection results for OpenAIP-sourced FIRs include a coverage_quality: 'low' flag in the API response; UI shows a coverage quality callout for non-AIRAC FIRs
Recalibration governance documented: docs/recalibration-procedure.md exists specifying hold-out validation dataset, minimum accuracy improvement threshold (> 5% improvement on hold-out, no regression on any object type category), sign-off authority (physics lead + engineering lead), ANSP notification procedure

Phase 2 Interoperability:

CCSDS OEM response: GET /space/objects/{norad_id}/ephemeris with Accept: application/ccsds-oem returns a valid CCSDS 502.0-B-3 OEM file; integration test validates all mandatory keyword fields (OBJECT_ID, CENTER_NAME, REF_FRAME=GCRF, TIME_SYSTEM=UTC, START_TIME, STOP_TIME) are present; test parses with a reference CCSDS OEM parser
CCSDS CDM export: bulk export includes CDM-format conjunction records; mandatory CDM fields populated; N/A used per CCSDS 508.0-B-1 §4.3 for unknown values; integration test validates with reference CDM parser
CDM ingestion display: Space-Track CDM Pc and SpaceCom-computed Pc both visible on conjunction panel with distinct provenance labels; DATA_CONFIDENCE warning fires when values differ by > 10× (integration test with synthetic divergent CDM)
Alert webhook: POST /webhooks registers endpoint; synthetic alert.new event POSTed to registered URL within 5s of trigger; X-SpaceCom-Signature header present and verifiable with shared secret; retry fires on 500 response from webhook receiver (integration test with mock server)
GeoJSON structured export: GET /events/{id}/export?format=geojson returns valid GeoJSON FeatureCollection; properties includes norad_id, p50_utc, affected_fir_ids, risk_level, prediction_hmac; validates against GeoJSON schema (RFC 7946)
ADS-B feed: OpenSky Network integration active; live flight positions overlay on globe in aviation portal; route intersection advisory receives ADS-B flight tracks as input

Phase 2 DevOps / Platform Engineering:

Staging environment spec documented: resources, data (synthetic only — no production data in staging), secrets set (separate from production), continuous deployment from main branch
GitLab staging deploy job: merge to main triggers automatic staging deploy; production deploy requires manual approval in GitLab after staging smoke tests pass
OWASP ZAP DAST run against staging in CI pipeline; results reviewed; 0 High/Critical required to unblock production deploy approval
Secrets rotation runbooks written for all critical secrets: Space-Track credentials, JWT RS256 signing keypair, MinIO access keys, Redis AUTH password; each runbook includes: who initiates, affected services, zero-downtime rotation procedure, verification step, security_logs entry required
JWT RS256 keypair rotation tested without downtime: old public key retained during 5-minute transition window; tokens signed with old key remain valid until expiry; verified by integration test
Image retention container-registry lifecycle policy in place: untagged images purged weekly; staging images retained 30 days; dev images retained 7 days; policy verified in registry settings
CI observability: GitLab pipeline duration tracked; image size delta posted as merge request comment (fail if > 20% increase); test failure rate visible in CI dashboard
alembic check CI gate: no migration added a NOT NULL column without a default in the same step; CI job validates hypertable migrations use CONCURRENTLY (grep check on all new migration files)

Phase 2 Additional Regulatory / Dual Domain Items:

Shadow mode: admin can enable/disable per organisation; ShadowBanner displayed on all pages when active; shadow records have shadow_mode = TRUE; shadow records excluded from all operational API responses (integration test)
NOTAM drafting: draft generated in ICAO Annex 15 format from any event with FIR intersection; mandatory regulatory disclaimer present (automated test verifies its presence in every draft); stored in notam_drafts
Space Operator Portal: space_operator user can view only owned objects (non-owned objects return 404, not 403, to prevent object enumeration); ControlledReentryPlanner functional for has_propulsion = TRUE objects
CCSDS export: ephemeris export in OEM format passes CCSDS 502.0-B-3 structural validation
API keys: create, use, and revoke flow functional; per-key rate limiting returns 429 at daily limit; raw key displayed only at creation (never retrievable after)
TIP message provenance displayed in UI: source label reads "USSPACECOM TIP (not certified aeronautical information)" — not just "TIP Message #N"
Data confidence warnings: objects with data_confidence = 'unknown' display a warning callout on all prediction panels explaining the impact on prediction quality

Phase 3 Complete When:

Mode C (Monte Carlo Particles): animated trajectories render; click-particle shows params
Real-time alerts delivered within 30 seconds of trigger condition
Geographic alert filtering: alerts scoped to user's FIR list
Route intersection analysis functional against sample flight plans
Feedback: density scaling recalibration demonstrated from ≥2 historical re-entries
Load test: 100 concurrent users; CZML load < 2s at p95
External penetration test completed; all Critical/High findings remediated
Full axe-core audit + manual screen reader test (NVDA + VoiceOver) passes
Secrets manager (Vault or equivalent) replacing Docker secrets for all production credentials
All credentials on rotation schedule; rotation verified without downtime
Prometheus + Grafana operational; certificate expiry alert configured
Production deployment runbook documented; incident response procedure per threat scenario
Security audit log shipping to external SIEM verified
Shadow validation report generated for ≥1 historical re-entry event demonstrating prediction accuracy
ECSS compliance artefacts produced: Software Management Plan, V&V Plan, Product Assurance Plan, Data Management Plan (required for ESA contract bids)
TRL 6 demonstration: system demonstrated in operationally relevant environment with real TLE data, real space weather, and ≥1 ANSP shadow deployment
Regulatory acceptance package complete: safety case framework, ICAO Annex 15 data quality mapping, SMS integration guide
Legal opinion obtained on operational liability per target deployment jurisdictions (Australia, EU, UK minimum)
First ANSP shadow mode deployment active with ≥4 weeks of shadow prediction records

Phase 3 Infrastructure / HA:

Patroni configuration validated: scripts/check_patroni_config.py passes confirming maximum_lag_on_failover, synchronous_mode: true, synchronous_mode_strict: true, wal_level: replica, recovery_target_timeline: latest all present in patroni.yml
Patroni failover drill: manually kill the primary DB container; verify standby promoted within 30s; backend API continues serving requests (latency spike acceptable; no 5xx errors after 35s); PgBouncer reconnects automatically to new primary
MinIO EC:2 verified: 4-node MinIO starts cleanly; integration test writes a 100 MB object; shut down one MinIO node; read succeeds; write succeeds; shut down second node; write fails with expected error; read still succeeds (EC:2 read quorum = 2 of 4)
WAF/DDoS protection confirmed in place at ingress (Cloudflare/AWS Shield or equivalent network-level appliance for on-premise); security architecture review sign-off
DNS architecture documented: docs/runbooks/dns-architecture.md covers split-horizon zones, PgBouncer VIP, Redis Sentinel VIP, and service discovery records for Tier 3 deployment
Backup restore test checklist completed successfully (see §34.5): all 6 checklist items passed within the 30-day window before Phase 3 sign-off
TLS certificate lifecycle runbook complete: docs/runbooks/tls-cert-lifecycle.md documents ACME auto-renewal path and internal CA path for air-gapped deployments; cert expiry Prometheus alerts firing at 60/30/7-day thresholds

Phase 3 Performance:

Formal load test passed: tests/load/ scenario with k6 or Locust; 100 concurrent users; CZML catalog load < 2s p95; MC job submit < 500ms; alert WebSocket delivery < 30s; test report committed to docs/validation/load-test-report-phase3.md
MC concurrency gate tested at scale: 10 simultaneous MC submissions across 5 organisations; each org receives 429 for its second request; no deadlock or Redis key leak observed; Celery worker queue depth remains bounded
WebSocket subscriber ceiling verified: load test opens 450 connections to a single backend instance; 451st connection receives HTTP 503; ws_connected_clients gauge reads 450; scaling trigger fires at 400 (alert visible in Grafana)
CZML delta adoption: Playwright E2E test confirms the frontend sends ?since= parameter on all CZML polls after initial load; no full-catalog request occurs after page load in LIVE mode
Bundle size CI gate active and green: final production build JS bundle documented; bundle-size CI step has passed for ≥2 consecutive deploys without manual override

22. Open Physics Questions for Engineering Review

JB2008 vs NRLMSISE-00 — Recommend: NRLMSISE-00 for Phase 1 with a pluggable density model interface that accepts JB2008 in Phase 2 without API or schema changes.
Covariance source for conjunction probability — Recommend: SP ephemeris covariance from Space-Track for active payloads; empirical covariance with explicit UI warning for debris.
Re-entry termination altitude — Recommend: 80 km for Phase 1; parametric interface for Phase 2 breakup module (default 80 km, allow up to 120 km).
F10.7 forecast horizon — For objects re-entering 5–14 days out, NOAA 3-day forecasts have degraded skill. Recommend: 81-day smoothed average as baseline with ±20% MC variation; document clearly in the SpaceWeatherWidget and every prediction panel.

23. Dual Domain Architecture

23.1 The Interface Problem

Two technically adjacent domains — space operations and civil aviation — manage debris re-entry hazards using incompatible tools, data formats, and operational vocabularies. The gap between them is the market.

SPACE DOMAIN                          THE GAP                     AVIATION DOMAIN
────────────────                    ──────────                   ────────────────
TLE / SGP4                                                        NOTAM
CDMs / TIP messages          No standard interface               FIR restrictions
CCSDS orbit products         No common tool                      ATC procedures
Kp / F10.7 indices           No shared language                  En-route charts
Probability of casualty      ← SpaceCom bridges this →          Plain English hazard brief

23.2 Shared Physics Core

One physics engine serves both front doors. Neither domain gets a different model — they get different views of the same computation.

                    ┌─────────────────────────────────┐
                    │         PHYSICS CORE            │
                    │  Catalog Propagator (SGP4)      │
                    │  Decay Predictor (RK7(8)+NRLMS) │
                    │  Monte Carlo ensemble           │
                    │  Conjunction Screener           │
                    │  Atmospheric Breakup (ORSAT)    │
                    │  Frame transforms (TEME→WGS84)  │
                    └────────────┬────────────────────┘
                                 │
               ┌─────────────────┴─────────────────┐
               │                                   │
    ┌──────────▼───────────┐          ┌────────────▼──────────┐
    │   SPACE DOMAIN UI    │          │  AVIATION DOMAIN UI   │
    │  /space portal       │          │  / (operational view) │
    │  Persona E, F        │          │  Persona A, B, C      │
    │                      │          │                       │
    │  State vectors       │          │  Hazard corridors     │
    │  Covariance matrices │          │  FIR intersection     │
    │  CCSDS formats       │          │  NOTAM drafts         │
    │  Deorbit windows     │          │  Plain-language status│
    │  API keys            │          │  Alert acknowledgement│
    │  Conjunction data    │          │  Gantt timeline       │
    └──────────────────────┘          └───────────────────────┘

23.3 Domain-Specific Output Formats

Output	Space Domain	Aviation Domain
Trajectory	CCSDS OEM (state vectors)	CZML (J2000 INERTIAL for CesiumJS)
Re-entry prediction	p05/p50/p95 times + covariance	Percentile corridor polygons on globe
Hazard	Probability of casualty (Pc) value	Risk level (LOW/MEDIUM/HIGH/CRITICAL)
Uncertainty	Monte Carlo ensemble statistics	Corridor width visual encoding
Conjunction	CDM-format Pc value	Not surfaced to Persona A
Space weather	F10.7 / Ap / Kp raw indices	"Elevated activity — wider uncertainty"
Deorbit plan	CCSDS manoeuvre plan	Corridor risk map on globe

23.4 Competitive Position

Competitor	Their Strength	SpaceCom Advantage
ESA ESOC Re-entry Prediction Service	Authoritative technical product; longest-running service	Aviation-facing operational UX; ANSP decision support; NOTAM drafting; multi-ANSP coordination
OKAPI:Orbits + DLR + TU Braunschweig	Academic orbital mechanics depth; space operator integrations	Purpose-built ANSP interface; controlled re-entry planner; shadow mode for regulatory adoption
Aviation weather vendors (e.g., StormGeo)	Deep ANSP relationships; established procurement pathways	Space domain physics credibility; TLE/CDM ingestion; conjunction screening
General STM platforms	Broad catalog management	Operational decision support depth; aviation integration layer

SpaceCom's moat is the combination of space physics credibility AND aviation operational usability. Neither side alone is sufficient to win regulated aviation authority contracts.

Differentiation capabilities — must be maintained regardless of competitor moves (Finding 4):

These are the capabilities that competitors cannot quickly replicate and that directly determine whether ANSPs and institutional buyers choose SpaceCom over alternatives:

Capability	Why it matters	Maintenance requirement
ANSP operational workflow integration	NOTAM drafting, multi-ANSP coordination, and shadow mode are purpose-built for ANSP operations — not retrofitted	Must be validated with ≥ 2 ANSP safety teams before Phase 2 shadow deployment
Regulatory adoption path	Shadow mode + exit criteria + ANSP Safety Department sign-off creates a documented adoption trail that institutional procurements require	Shadow mode exit report template must remain current; exit statistics generated automatically
Physics + aviation in one product	Neither a pure orbital analytics tool nor a pure aviation tool can cover both sides without the other's domain expertise	Dual-domain architecture (§23) must be maintained; any feature removal from either domain triggers an ADR
ESA/DISCOS data integration	Institutional credibility with ESA and national space agencies depends on using authoritative ESA data sources	DISCOS redistribution rights must be resolved before Phase 2; integration maintained as P1 data source

A docs/competitive-analysis.md document (maintained by the product owner, reviewed quarterly) tracks competitor feature releases and assesses impact on these claims. Any competitor capability that closes a differentiation gap triggers a product review within 30 days.

23.5 SWIM Integration Path

European ANSPs increasingly exchange operational data via SWIM (System Wide Information Management), defined by ICAO Doc 10039 and implemented in Europe via EUROCONTROL SWIM-TI (AMQP/MQTT transport, FIXM/AIXM 5.1 schemas). Full SWIM compliance is a Phase 3+ target; the path is:

Phase	Deliverable	Standard
Phase 2	GeoJSON structured event export (`/events/{id}/export?format=geojson`) with ICAO FIR IDs and prediction metadata	GeoJSON + ISO 19115 metadata
Phase 3	Review FIXM Core 4.x schema for re-entry hazard representation; define SpaceCom extension namespace	FIXM Core 4.2
Phase 3	SWIM-TI AMQP endpoint (publish-only) for `alert.new` and `tip.new` events to EUROCONTROL Network Manager B2B service	EUROCONTROL SWIM-TI Yellow Profile

Phase 2 GeoJSON export is the immediate deliverable. Phase 3 SWIM-TI integration is scoped but requires a EUROCONTROL B2B service account and FIXM schema extension review — neither is blocking for Phase 1 or 2.

24. Regulatory Compliance Framework

24.1 The Regulatory Gap SpaceCom Operates In

There is currently no binding international regulatory framework governing re-entry debris hazard notifications to civil aviation. SpaceCom operates at the boundary between two regulatory regimes that have not yet formally agreed on how to bridge them.

This creates risk (no approved pathway to slot into) but also opportunity (SpaceCom can help define the standard and accumulate first-mover evidence).

24.2 Liability and Operational Status

Legal opinion is a Phase 2 gate, not a Phase 3 task. Shadow mode deployments with ANSPs must not occur without a completed legal opinion for the deployment jurisdiction. "Advisory only" UI labelling is not contractual protection — liability limitation must be in executed agreements. In common law jurisdictions (Australia, UK, US), a voluntary undertaking of responsibility to a known class of relying professionals can create a duty of care regardless of disclaimers (Hedley Byrne & Co v Heller and equivalents). Shadow mode activation in the admin panel is gated by legal_opinions.shadow_mode_cleared = TRUE for the organisation's jurisdiction.

Legal opinion scope (per deployment jurisdiction — Australia, EU, UK, US minimum):

Whether "decision support information" labelling limits liability for incorrect predictions that inform airspace decisions
Whether the platform creates duty-of-care obligations regardless of labelling
Whether Space-Track data redistribution via the SpaceCom API requires a separate licensing agreement with 18th Space Control Squadron
Whether CDM data (national security-adjacent) is subject to export controls in target jurisdictions
Whether the Controlled Re-entry Planner falls under ECCN 9E515 (spacecraft operations technical data) for non-US users

Operational status classification for SpaceCom outputs — not a UI label, a formal determination made in consultation with the ANSP's legal and SMS teams:

Aeronautical information (ICAO Annex 15) — highest standard; triggers data quality obligations
Decision support information — intermediate; requires formal ANSP SMS acceptance
Situational awareness information — lowest; advisory only; no procedural authority

Commercial contract requirements — three instruments required before any access:

Master Services Agreement (MSA) — executed before any ANSP or space operator accesses the system. Must be reviewed by aviation law counsel. Minimum required terms:
- Limitation of liability: capped at 12 months of fees paid, or a fixed cap for government/sovereign customers (to be determined by counsel)
- Exclusion of consequential and indirect loss
- Explicit statement that SpaceCom outputs are decision support information, not certified aeronautical information and not a substitute for ANSP operational procedures
- ANSP's acknowledgement that they retain full authority and responsibility for all operational decisions
- SLOs from §26.1 incorporated by reference
- Governing law and jurisdiction clause
- Data Processing Agreement (DPA) addendum for GDPR-scope deployments (see §29)
- Right to suspend service without liability for maintenance, degraded mode, data quality concerns, or active security incidents
Acceptable Use Policy (AUP) — click-wrap accepted in-platform at first login, recorded in users.tos_accepted_at, users.tos_version, and users.tos_accepted_ip. Must re-accept when version changes (system blocks access until accepted). Includes:
- Acknowledgement that orbital data originates from Space-Track, subject to Space-Track terms
- Prohibition on redistributing SpaceCom-derived data to third parties without written consent
- Acknowledgement that the platform is decision support only, not certified aeronautical information
- Export control acknowledgement (user is responsible for compliance in their jurisdiction)
API Terms — embedded in the API key issuance flow for Persona E/F programmatic access. Accepted at key creation; recorded against the api_keys record. Includes the Space-Track redistribution acknowledgement and the export control notice.

Space-Track data redistribution gate (F3): Space-Track.org Terms of Service prohibit redistribution of TLE data to non-registered entities. The SpaceCom API must not serve TLE-derived fields (raw TLE strings, tle_epoch, tle_line1/2) to organisations that have not confirmed Space-Track registration. Implementation:

-- Add to organisations table
ALTER TABLE organisations ADD COLUMN space_track_registered BOOLEAN NOT NULL DEFAULT FALSE;
ALTER TABLE organisations ADD COLUMN space_track_registered_at TIMESTAMPTZ;
ALTER TABLE organisations ADD COLUMN space_track_username TEXT; -- for audit

API middleware check (applied to any response containing TLE-derived fields):

def check_space_track_gate(org: Organisation):
    if not org.space_track_registered:
        raise HTTPException(
            status_code=403,
            detail="TLE-derived data requires Space-Track registration. "
                   "Register at space-track.org and confirm in your organisation settings."
        )

All TLE-derived disclosures are logged in data_disclosure_log:

CREATE TABLE data_disclosure_log (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id      UUID NOT NULL REFERENCES organisations(id),
    source      TEXT NOT NULL,  -- 'space_track', 'esa_sst', etc.
    endpoint    TEXT NOT NULL,
    disclosed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    record_count INTEGER
);
CREATE INDEX ON data_disclosure_log (org_id, source, disclosed_at DESC);

Contracts table and MRR tracking (F1, F4, F9 — §68):

The contracts table enforces that feature access is gated on commercial state, provides MRR data for the commercial team, and records discount approval for audit:

CREATE TABLE contracts (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  org_id INTEGER NOT NULL REFERENCES organisations(id),
  contract_type TEXT NOT NULL
    CHECK (contract_type IN ('sandbox','professional','enterprise','on_premise','internal')),
  -- Financial terms
  monthly_value_cents INTEGER NOT NULL DEFAULT 0,  -- 0 for sandbox/internal
  currency CHAR(3) NOT NULL DEFAULT 'EUR',
  discount_pct NUMERIC(5,2) NOT NULL DEFAULT 0
    CHECK (discount_pct >= 0 AND discount_pct <= 100),
  -- Discount approval guard (F4): discounts >20% require second approver
  discount_approved_by INTEGER REFERENCES users(id),  -- NULL if discount_pct <= 20
  discount_approval_note TEXT,
  -- Term
  valid_from TIMESTAMPTZ NOT NULL,
  valid_until TIMESTAMPTZ NOT NULL,
  auto_renew BOOLEAN NOT NULL DEFAULT FALSE,
  -- Feature access — what this contract enables
  enables_operational_mode BOOLEAN NOT NULL DEFAULT FALSE,
  enables_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE,
  enables_api_access BOOLEAN NOT NULL DEFAULT FALSE,
  -- Audit
  created_by INTEGER REFERENCES users(id),
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  signed_msa_at TIMESTAMPTZ,        -- NULL until MSA countersigned
  msa_document_ref TEXT,            -- path in MinIO legal bucket
  -- Professional Services (F10)
  ps_value_cents INTEGER NOT NULL DEFAULT 0,  -- one-time PS revenue on this contract
  ps_description TEXT
);
CREATE INDEX ON contracts (org_id, valid_until DESC);
CREATE INDEX ON contracts (valid_until) WHERE valid_until > NOW();  -- active contract lookup

-- Constraint: discounts >20% must have a named approver
ALTER TABLE contracts ADD CONSTRAINT discount_approval_required
  CHECK (discount_pct <= 20 OR discount_approved_by IS NOT NULL);

Feature access enforcement (F1): Feature flags in organisations must be set from the active contract, not by admin toggle alone. A Celery task (tasks/commercial/sync_feature_flags.py) runs nightly and on contract creation/update to sync organisations.feature_multi_ansp_coordination from the active contract's enables_multi_ansp_coordination. An admin toggle that disagrees with the active contract is overwritten by the nightly sync.

MRR dashboard (F9): Add a Grafana panel (internal dashboard, not customer-facing) showing current MRR:

-- Recording rule or direct query:
SELECT SUM(monthly_value_cents) / 100.0 AS mrr_eur
FROM contracts
WHERE valid_from <= NOW() AND valid_until >= NOW()
  AND contract_type NOT IN ('sandbox', 'internal');

Expose as spacecom_mrr_eur Prometheus gauge updated by the nightly sync_feature_flags task. Grafana panel: "Current MRR (€)" — single stat panel, comparison to previous month.

Export control screening (F4): ITAR 22 CFR §120.15 and EAR 15 CFR §736 prohibit providing certain SSA capabilities to nationals of embargoed countries and denied parties. Required at organisation onboarding:

ALTER TABLE organisations ADD COLUMN country_of_incorporation CHAR(2); -- ISO 3166-1 alpha-2
ALTER TABLE organisations ADD COLUMN export_control_screened_at TIMESTAMPTZ;
ALTER TABLE organisations ADD COLUMN export_control_cleared BOOLEAN NOT NULL DEFAULT FALSE;
ALTER TABLE organisations ADD COLUMN itar_cleared BOOLEAN NOT NULL DEFAULT FALSE; -- US-person or licensed

Onboarding flow:

Collect country_of_incorporation at registration
Flag embargoed countries (CU, IR, KP, RU, SY) for manual review — account held in PENDING_EXPORT_REVIEW state
Screen organisation name against BIS Entity List (automated lookup; manual review on partial match)
EU-SST-derived data gated behind itar_cleared = TRUE (EU-SST has its own access restrictions for non-EU entities)
All screening decisions logged with reviewer ID and date

Documented in legal/EXPORT_CONTROL_POLICY.md. Legal counsel review required before any deployment that could serve US-origin technical data (TLE from 18th Space Control Squadron) to non-US persons.

Regulatory Sandbox Agreement — a lightweight 2-page letter of understanding required before any ANSP shadow mode activation. Specifies:

Trial period start and end dates
ANSP's confirmation that SpaceCom outputs are for internal validation only (not operational)
SpaceCom's commitment to produce a shadow validation report at trial end
Data protection terms for the trial period
How incidents during the trial are handled by both parties
Mutual agreement that the trial does not create any ongoing commercial obligation

Regulatory sandbox liability clarification (F11 — §61): The sandbox agreement is not a liability shield by itself. During shadow mode, SpaceCom is a tool under evaluation — liability exposure depends on how the ANSP uses outputs and what the sandbox agreement says about consequences of errors. Required provisions:

No operational reliance clause: ANSP certifies in writing that no operational decisions will be made on the basis of SpaceCom outputs during the trial. Any breach of this clause by the ANSP shifts liability to the ANSP.
Incident notification: If a SpaceCom output error is identified during the trial, SpaceCom notifies the ANSP within 2 hours (matching the safety occurrence runbook at §26.8). The sandbox agreement specifies whether this constitutes a notifiable occurrence under the ANSP's SMS.
Indemnification cap: SpaceCom's aggregate liability during the sandbox period is capped at AUD/EUR 50,000 (or local equivalent). Catastrophic loss claims are excluded (consistent with MSA terms).
Insurance requirement: SpaceCom must carry professional indemnity insurance with minimum cover AUD/EUR 1 million before activating any sandbox with an ANSP. Certificate of currency provided to the ANSP before activation.
Regulatory notification duty: If the ANSP's safety regulator requires notification of third-party tool trials (e.g., EASA, CASA, CAA), that obligation rests with the ANSP. SpaceCom provides a one-page system description document to support the ANSP's notification.
Sandbox ≠ approval pathway: A successful sandbox trial is evidence for a future regulatory submission — it is not itself an approval. Neither party should represent the sandbox as a form of regulatory acceptance.

legal/SANDBOX_AGREEMENT_TEMPLATE.md captures the standard text. Legal counsel review required before any amendment.

The shadow mode admin toggle must display a warning if no Regulatory Sandbox Agreement is on record (legal_opinions.shadow_mode_cleared = FALSE for the org's jurisdiction):

⚠ No legal clearance on record for this organisation's jurisdiction.
  Shadow mode should not be activated without a completed legal opinion
  and a signed Regulatory Sandbox Agreement.
  [View legal status →]

24.3 ICAO Data Quality Mapping (Annex 15)

SpaceCom outputs that may enter aeronautical information channels must be characterised against ICAO's five data quality attributes:

Attribute	SpaceCom Characterisation	Required Action
Accuracy	Decay predictor accuracy characterised from ≥10 historical re-entry backcasts vs. The Aerospace Corporation database. Published as a formal accuracy statement in `GET /api/v1/reentry/predictions/{id}` response.	Phase 3: produce accuracy characterisation document
Resolution	Corridor boundaries expressed as geographic polygons with stated precision. Position uncertainty stated as formal resolution value in prediction response.	Included in prediction API response from Phase 1
Integrity	HMAC-SHA256 on all prediction and hazard zone records. Integrity assurance level: Essential (1×10⁻⁵). Documented in system description.	Implemented Phase 1 (§7.9)
Traceability	Full parameter provenance in `simulations.params_json` and prediction records. Accessible to regulatory auditors via dedicated API.	Phase 1
Timeliness	Maximum latency from TIP message ingestion to updated prediction available: 30 minutes. Maximum latency from NOAA SWPC space weather update to prediction recalculation: 4 hours. Published as formal SLA.	Phase 3 SLA document

F5 — Completeness attribute and ICAO Annex 15 §3.2 data quality classification (§61):

ICAO Annex 15 §3.2 defines a sixth implicit attribute — Completeness — meaning all data fields required by the receiving system are present and within range. SpaceCom must:

Define a formal completeness schema for each prediction response (required fields, allowed nulls, value ranges)
Return data_quality.completeness_pct in the prediction response (fields present / fields required × 100)
Reject predictions with completeness < 90% from the alert pipeline (alert not generated; operator notified of incomplete prediction)

ICAO data category and classification required in the prediction response (Annex 15 Table A3-1):

Field	Value
`data_category`	`AERONAUTICAL_ADVISORY` (until formal AIP entry process established)
`originator`	`SPACECOM` + system version string
`effective_from`	ISO 8601 UTC timestamp
`integrity_assurance`	`ESSENTIAL` (1×10⁻⁵ probability of undetected error)
`accuracy_class`	`CLASS_2` (advisory, not certified — until accuracy characterisation completes Phase 3 validation)

Formal accuracy characterisation (docs/validation/ACCURACY_CHARACTERISATION.md) is a Phase 3 gate before the API can be presented to any ANSP as meeting Annex 15 data quality standards.

24.4 Safety Management System Integration

Any ANSP formally adopting SpaceCom must include it in their SMS (ICAO Annex 19). SpaceCom provides the following artefacts to support ANSP SMS assessment:

Hazard register (SpaceCom's contribution to the ANSP's SMS — F3, §61 structured format):

Maintained as docs/safety/HAZARD_LOG.md. Each hazard uses the structured schema below. Hazard IDs are permanent — retired hazards are marked CLOSED, not deleted.

ID	Description	Cause	Effect	Mitigations	Severity	Likelihood	Risk Level	Status
HZ-001	SpaceCom unavailable during active re-entry event	Infrastructure failure; deployment error; DDoS	ANSP cannot access current re-entry prediction during event window	Patroni HA failover (§26.3); 15-min RTO SLO; automated ANSP push notification + email; documented fallback procedure	Hazardous	Low (SLO 99.9%)	Medium	OPEN
HZ-002	False all-clear prediction (false negative — corridor misses actual impact zone)	TLE age; atmospheric model error; MC sampling variance; adversarial data manipulation	ANSP issues all-clear; aircraft enters debris corridor	HMAC integrity check; dual-source TLE validation; TIP cross-check guard; shadow validation evidence; accuracy characterisation (Phase 3); `@pytest.mark.safety_critical` tests	Catastrophic	Very Low	High	OPEN
HZ-003	False hazard prediction (false positive — corridor over-stated)	Atmospheric model conservatism; TLE propagation error	Unnecessary airspace restriction; operational disruption; credibility loss	Cross-source TLE validation; HMAC; p95 corridor with stated uncertainty; accuracy characterisation	Major	Low	Medium	OPEN
HZ-004	Corridor displayed in wrong reference frame	ECI/ECEF/geographic frame conversion error; CZML frame parameter misconfiguration	Corridor shown at wrong lat/lon; operator makes decisions on incorrect geographic basis	Frame transform unit tests against IERS references (§17); CZML frame convention enforced via CI	Hazardous	Very Low	Medium	OPEN
HZ-005	Outdated prediction served (stale data)	Ingest pipeline failure; TLE source outage; cache not invalidating	Operator sees prediction that no longer reflects current orbital state	Data staleness indicators in UI; automated stale alert to operators; ingest health monitoring; CZML cache invalidation triggers (§35)	Major	Low	Medium	OPEN
HZ-006	Prediction integrity failure (HMAC mismatch)	Database modification; backup restore error; storage corruption	Prediction record cannot be verified; may have been tampered with	Prediction quarantined automatically; CRITICAL security alert; prediction withheld from API	Catastrophic	Very Low	High	OPEN
HZ-007	Unauthorised access to prediction data	Compromised credentials; RLS bypass; API misconfiguration	Competitor or adversary obtains early re-entry corridor data; potential ITAR exposure	PostgreSQL RLS; JWT validation; rate limiting; `security_logs` audit trail; penetration testing	Major	Low	Medium	OPEN

Hazard log governance:

Review: quarterly, and after each SEV-1 incident, model version update, or material system change
New hazards identified during safety occurrence reporting are added within 5 business days
Risk level = Severity × Likelihood using EUROCAE ED-153 risk classification matrix
OPEN hazards with High risk level are Phase 2 gate blockers — must reach MITIGATED before ANSP shadow activation

System safety classification: Safety-related (not safety-critical under DO-278A). Relevant components targeting SAL-2 assurance level (see §24.13). Development assurance standard: EUROCAE ED-78A equivalent for relevant components.

Change management: SpaceCom must notify all ANSP users before model version updates that affect prediction outputs. Version changes tracked in simulations.model_version and surfaced in the UI.

24.5 NOTAM System Interface

SpaceCom's position in the NOTAM workflow:

SpaceCom generates → NOTAM draft (ICAO format) → Reviewed by Persona A → Submitted by authorised NOTAM originator → Issued NOTAM

SpaceCom never submits NOTAMs. The draft is a decision support artefact. The mandatory disclaimer on every draft is a non-removable regulatory requirement, not a UI preference.

NOTAM timing requirements by jurisdiction:

Routine NOTAMs: 24–48 hours minimum lead time
Short-notice (re-entry window < 24 hours): ASAP; NOTAM issued with minimum lead time
SpaceCom alert thresholds align with these: CRITICAL alert at < 6h, HIGH at < 24h

24.6 Space Law Considerations

UN Liability Convention (1972): All SpaceCom prediction records, simulation runs, and alert acknowledgements may be legally discoverable in an international liability claim. The immutable audit trail (§7.9) is partially an evidence preservation mechanism. Retention of reentry_predictions, alert_events, notam_drafts, and shadow_validations for ≥7 years minimum.

National space laws with re-entry obligations:

Australia: Space (Launches and Returns) Act 2018. CASA and the Australian Space Agency have coordination protocols. SpaceCom's controlled re-entry planner outputs are suitable as evidence for operator obligations under this Act.
EU/ESA: EU Space Programme Regulation; ESA Zero Debris Charter. SpaceCom supports Zero Debris by characterising re-entry risk and supporting responsible end-of-life planning.
US: FAA AST re-entry licensing generates data that SpaceCom should ingest when available. 51 USC Chapter 509 obligations may affect US space operator customers.

Space Traffic Management evolution: US Office of Space Commerce is developing civil STM frameworks that may eventually replace Space-Track as the primary civil space data source. SpaceCom's ingest architecture must be adaptable (hardcoded URL constants in ingest/sources.py make this a 1-file change when the source changes).

24.7 ICAO Framework Alignment

Existing: ICAO Doc 10100 (Manual on Space Weather Information, 2019) designates three ICAO-recognised Space Weather Centres (NOAA SWPC, ESA/ESAC, Japan Meteorological Agency). SpaceCom's space weather widget must reference these designated centres by name and ICAO recognition status.

Emerging re-entry guidance: ICAO is in early stages of developing re-entry hazard notification guidance (no published document as of 2025). SpaceCom should:

Monitor ICAO Air Navigation Commission and Meteorology Panel working group outputs
Design hazard corridor outputs in a format that parallels SIGMET structure (the closest existing ICAO framework: WHO/WHAT/WHERE/WHEN/INTENSITY/FORECAST) — this positions SpaceCom well for whatever standard emerges
Consider engaging ICAO working groups as a stakeholder; SpaceCom could become a reference implementation

SIGMET parallel structure for re-entry corridor outputs:

REENTRY ADVISORY (SpaceCom format; parallel to SIGMET structure)
WHO:      CZ-5B ROCKET BODY / NORAD 44878
WHAT:     UNCONTROLLED RE-ENTRY / DEBRIS SURVIVAL POSSIBLE
WHERE:    CORRIDOR 18S115E TO 28S155E / FL000 TO UNL
WHEN:     FROM 2026031614 TO 2026031622 UTC / WINDOW ±4H (P95)
RISK:     HIGH / LAND AREA IN CORRIDOR: 12%
FORECAST: CORRIDOR EXPECTED TO NARROW 20% OVER NEXT 6H
SOURCE:   SPACECOM V2.1 / PRED-44878-20260316-003 / TIP MSG #3

24.8 Alert Threshold Governance

Alert threshold values are consequential algorithmic decisions. A CRITICAL threshold that is too sensitive causes unnecessary airspace disruption; one that is too conservative creates false-negative risk. Both outcomes have legal, operational, and reputational consequences.

Current threshold values and rationale:

Threshold	Value	Rationale
CRITICAL window	< 6h	Aligns with ICAO minimum NOTAM lead time for short-notice restrictions; 6h allows ANSP to issue NOTAM with ≥2h lead time
HIGH window	< 24h	Operational planning horizon for pre-tactical airspace management
FIR intersection trigger	p95 corridor intersects any non-zero area of the FIR	Conservative: any non-zero intersection at p95 level generates an alert; minimum area threshold is an org-configurable setting (default: 0)
Alert rate limit	1 CRITICAL per object per 4h window	Prevents alert flooding from repeated window-shrink events without substantive new information
Alert storm threshold	> 5 CRITICAL in 1h	Empirically chosen; above this rate the response-time expectation for individual alerts cannot be met

These values are recorded in docs/alert-threshold-history.md with initial entry date and author sign-off.

Threshold change procedure:

Engineer proposes change in a PR with rationale documented in docs/alert-threshold-history.md
PR requires review by engineering lead and product owner before merge
Change is deployed to staging; minimum 2-week shadow-mode observation period against real TLE/TIP data
Shadow observation review: false positive rate and false negative rate compared against pre-change baseline
If baseline comparison passes: change deployed to production; all ANSP shadow deployment partners notified in writing with new threshold values
If any ANSP objects: change is held until concerns are resolved

Threshold values are not configurable at runtime by operators. They are code constants reviewed through the above process. Org-configurable alert settings (geographic FIR filter, mute rules, OPS_ROOM_SUPPRESS_MINUTES) are UX preferences, not threshold changes.

24.9 Degraded Mode and Availability

SpaceCom must specify degraded mode behaviour for ANSP adoption:

Condition	System Behaviour	ANSP Action
Ingest pipeline failure (TLE data > 6h stale)	MEDIUM alert to all operators; staleness indicator on all objects; predictions greyed	Consult Space-Track directly; activate fallback procedure
Space weather data > 4h stale	WARNING banner on SpaceWeatherWidget; uncertainty multiplier set to HIGH conservatively	Note wider uncertainty on any operational decisions
System unavailable	Push notification to all registered users; email to ANSP contacts	Activate fallback procedure documented in SpaceCom SMS integration guide
HMAC verification failure on a prediction	Prediction withheld; CRITICAL security alert; prediction marked `integrity_failed`	Do not use the withheld prediction; contact SpaceCom immediately

Degraded mode notification: When SpaceCom is down or data is stale beyond defined thresholds, all connected ANSPs receive push notification (WebSocket if connected; email fallback) so they can activate their fallback procedures. SpaceCom must never go silent when operationally relevant events are active.

24.10 EU AI Act Obligations

Classification: SpaceCom's conjunction probability model (§19) and any ML-based alert prioritisation constitute an AI system under EU AI Act Art. 3(1). AI systems used in transport infrastructure safety fall under Annex III, point 4 (AI systems intended to be used for dispatching, monitoring, and maintenance of transport infrastructure including aviation). This classification implies high-risk AI system obligations.

High-risk AI system obligations (EU AI Act Chapter III Section 2):

Obligation	Article	SpaceCom implementation
Risk management system	Art. 9	Integrate with existing SMS (§24.4); maintain AI-specific risk register in `legal/EU_AI_ACT_ASSESSMENT.md`
Data governance	Art. 10	TLE training data provenance documented; `simulations.params_json` stores full input provenance; bias assessment required for orbital prediction models
Technical documentation	Art. 11 + Annex IV	`legal/EU_AI_ACT_ASSESSMENT.md` — system description, capabilities, limitations, human oversight measures, accuracy characterisation
Record-keeping / automatic logging	Art. 12	`reentry_predictions` and `alert_events` tables provide automatic event logging; immutable (`APPEND`-only with HMAC)
Transparency to users	Art. 13	Conjunction probability values labelled with model version (`simulations.model_version`), TLE age, EOP currency; uncertainty bounds displayed
Human oversight	Art. 14	All decisions remain with duty controller (§24.2 AUP; §28.6 Decision Prompts disclaimer); no autonomous action taken by SpaceCom
Accuracy, robustness, cybersecurity	Art. 15	Accuracy characterisation (§24.3 ICAO Data Quality); adversarial robustness covered by §7 and §36 security review
Conformity assessment	Art. 43	Self-assessment pathway available for transport safety AI without third-party involvement at first deployment; document in `legal/EU_AI_ACT_ASSESSMENT.md`
EU database registration	Art. 51	High-risk AI systems must be registered in the EU AI Act database before placing on market; legal milestone in deployment roadmap

Human oversight statement (required in UI — Art. 14): The conjunction probability display (§19.4) must include the following non-configurable statement in the model information panel:

"This probability estimate is generated by an AI model and is subject to uncertainty arising from TLE age, atmospheric model limitations, and manoeuvre uncertainty. All operational decisions remain with the duty controller. This system does not replace ANSP procedures."

Gap analysis and roadmap: legal/EU_AI_ACT_ASSESSMENT.md must document: current compliance state → gaps → remediation actions → target dates. Phase 2 gate: conformity assessment documentation complete. Phase 3 gate: EU database registration completed before commercial EU deployment.

24.11 Regulatory Correspondence Register

For an ANSP-facing product, regulators (CAA, EASA, national ANSPs, ESA, OACI) will issue queries, audits, formal requests, and correspondence. Missed regulatory deadlines can constitute a licence breach or grounds for suspension of operations.

Correspondence log: legal/REGULATORY_CORRESPONDENCE_LOG.md — structured register with the following fields per entry:

Field	Description
Date received	ISO 8601
Authority	Regulatory body name and country
Reference number	Authority's reference (if given)
Subject	Brief description
Deadline	Formal response deadline (ISO 8601)
Owner	Named individual responsible for response
Status	PENDING / RESPONDED / CLOSED / ESCALATED
Response date	Date formal response sent
Notes	Internal context, legal counsel involvement

SLAs:

All regulatory correspondence acknowledged (receipt confirmed to sender) within 2 business days
Substantive response or extension request within 14 calendar days (or as required by the correspondence)
All correspondence older than 14 days without a RESPONDED or CLOSED status triggers an escalation to the CEO

Proactive regulatory engagement: The correspondence register is reviewed at each quarterly steering meeting. Any authority that has issued ≥3 queries in a 12-month period warrants a proactive engagement call to identify and address systemic concerns before they become formal regulatory actions.

24.12 Safety Case Framework (F1 — §61)

A safety case is a structured argument that a system is acceptably safe for a specified use in a defined context. SpaceCom must produce and maintain a safety case before any operational ANSP deployment. The safety case is a living document, updated at each material system change.

Safety case structure (Goal Structuring Notation — GSN, consistent with EUROCAE ED-153 / IEC 61508 safety case guidance):

G1: SpaceCom is acceptably safe to use as a decision support tool
    for re-entry hazard awareness in civil airspace operations

  C1: Context — SpaceCom operates as decision support (not autonomous authority);
      all operational decisions remain with the ANSP duty controller

  S1: Argument strategy — safety achieved by hazard identification,
      risk reduction, and operational constraints

    G1.1: All identified hazards are mitigated to acceptable risk levels
      Sn1: Hazard Log (docs/safety/HAZARD_LOG.md)
      E1.1.1: HZ-001 through HZ-007 mitigation evidence (§24.4)
      E1.1.2: Shadow validation report (≥30 day trial)

    G1.2: System integrity is maintained through all operational modes
      Sn2: HMAC integrity on all safety-critical records (§7.9)
      E1.2.1: `@pytest.mark.safety_critical` test suite — 100% pass
      E1.2.2: Integrity failure quarantine demonstrated (§56 E2E test)

    G1.3: Operators are trained and capable of correct system use
      Sn3: Operator Training Programme (§28.9)
      E1.3.1: Training completion records (operator_training_records table)
      E1.3.2: Reference scenario completion evidence

    G1.4: Degraded mode provides adequate notification for fallback
      Sn4: Degraded mode specification (§24.9)
      E1.4.1: ANSP communication plan activated in game day exercise (§26.8)

    G1.5: Regulatory obligations are met for the deployment jurisdiction
      Sn5: Means of Compliance document (§24.14)
      E1.5.1: Legal opinions for deployment jurisdictions (§24.2)
      E1.5.2: ANSP SMS integration guide (§24.15)

Safety case document: docs/safety/SAFETY_CASE.md. Version-controlled; each tagged release includes a safety case snapshot. Safety case review is required before:

ANSP shadow mode activation
Model version updates that affect prediction outputs
New deployment jurisdiction
Any change to alert thresholds (§24.8)

Safety case custodian: Named individual (Phase 2: CEO or CTO until a dedicated safety manager is appointed). Changes to the safety case require the custodian's sign-off.

24.13 Software Assurance Level (SAL) Assignment (F2 — §61)

EUROCAE ED-153 / DO-278A defines Software Assurance Levels for ground-based aviation software systems. The appropriate SAL determines the rigour of development, verification, and documentation activities required.

SpaceCom SAL assignment:

Component	Failure Condition	Severity Class	SAL	Rationale
Re-entry prediction engine (`physics/`)	False all-clear (HZ-002)	Hazardous	SAL-2	Undetected false negative could contribute to an airspace safety event; highest-consequence component
Alert generation pipeline (`alerts/`)	Failed alert delivery; wrong threshold applied	Hazardous	SAL-2	Failure to generate a CRITICAL alert during an active event is equivalent consequence to HZ-002
HMAC integrity verification	Integrity failure undetected	Hazardous	SAL-2	Loss of integrity checking removes the primary guard against data manipulation
CZML corridor rendering	Wrong geographic position displayed (HZ-004)	Hazardous	SAL-2	Geographic display error directly misleads operator
API authentication and authorisation	Unauthorised data access (HZ-007)	Major	SAL-3	Privacy and data governance impact; not directly causal of airspace event
Ingest pipeline (`worker/`)	Stale data not detected (HZ-005)	Major	SAL-3	Staleness monitoring is a mitigation for HZ-005; failure of staleness monitoring increases HZ-005 likelihood
Frontend (non-safety-critical paths)	Cosmetic / non-operational UI failure	Minor	SAL-4	Not in the safety-critical path

SAL-2 implications (minimum activities required):

Independent verification of requirements, design, and code for SAL-2 components (see §24.16 Verification Independence)
Formal test coverage: 100% statement coverage for SAL-2 modules (enforced via @pytest.mark.safety_critical)
Configuration management of all SAL-2 source files and their test artefacts (see §30.8)
SAL-2 components documented in the safety case with traceability from requirement → design → code → test

SAL assignment document: docs/safety/SAL_ASSIGNMENT.md — reviewed at each architecture change and before any ANSP deployment.

24.14 Means of Compliance (MoC) Document (F8 — §61)

A Means of Compliance document maps each regulatory or standard requirement to the specific implementation evidence that demonstrates compliance. Required before any formal regulatory submission (ESA bid, EASA consultation response, ANSP safety acceptance).

Document: docs/safety/MEANS_OF_COMPLIANCE.md

Structure:

Requirement ID	Source	Requirement Text (summary)	Means of Compliance	Evidence Location	Status
MOC-001	EUROCAE ED-153 §5.3	Software requirements defined and verifiable	Requirements documented in relevant §sections of MASTER_PLAN; acceptance criteria in TEST_PLAN	`docs/TEST_PLAN.md`; relevant §sections	PARTIAL
MOC-002	EUROCAE ED-153 §6.4	Independent verification of SAL-2 software	Verification independence policy (§24.16); separate reviewer for safety-critical PRs	`docs/safety/VERIFICATION_INDEPENDENCE.md`	PLANNED
MOC-003	ICAO Annex 15 §3.2	Data quality attributes characterised	ICAO data quality table (§24.3); accuracy characterisation document	`docs/validation/ACCURACY_CHARACTERISATION.md`	PARTIAL (Phase 3)
MOC-004	ICAO Annex 19	ANSP SMS integration supported	SMS integration guide; hazard register; training programme	`docs/safety/ANSP_SMS_GUIDE.md`; `docs/safety/HAZARD_LOG.md`	PLANNED
MOC-005	EU AI Act Art. 9	Risk management system documented	AI Act assessment; hazard log; safety case	`legal/EU_AI_ACT_ASSESSMENT.md`; `docs/safety/HAZARD_LOG.md`	IN PROGRESS
MOC-006	DO-278A §10	Configuration management of safety artefacts	CM policy (§30.8); Git tagging of releases; signed commits	`docs/safety/CM_POLICY.md`	PLANNED
MOC-007	ED-153 §7.2	Safety occurrence reporting procedure	Runbook in §26.8; `SAFETY_OCCURRENCE` log type	`docs/runbooks/`; `security_logs` table	IMPLEMENTED

The MoC document is a Phase 2 deliverable. PARTIAL items become Phase 3 gates. PLANNED items require assigned owners and completion dates before ANSP shadow activation.

24.15 ANSP-Side Obligations Document (F10 — §61)

SpaceCom cannot unilaterally satisfy all regulatory requirements — the receiving ANSP has obligations that SpaceCom must document and communicate. Failing to do so is a gap in the safety argument.

Document: docs/safety/ANSP_SMS_GUIDE.md — provided to every ANSP before shadow mode activation.

ANSP obligations by category:

Category	ANSP Obligation	SpaceCom Provides
SMS integration	Include SpaceCom in ANSP SMS under ICAO Annex 19	Hazard register contribution (§24.4); SAL assignment; safety case
Change notification	Notify SpaceCom of any ANSP procedure changes that affect how SpaceCom outputs are used	Change notification contact in MSA
Operator training	Ensure all SpaceCom users complete the operator training programme (§28.9)	Training modules; completion API; training records
Fallback procedure	Maintain and exercise a fallback procedure for SpaceCom unavailability	Fallback procedure template in onboarding documentation
Occurrence reporting	Report any safety occurrence involving SpaceCom outputs to SpaceCom within 24 hours	Safety occurrence form; contact details; §26.8 runbook
Regulatory notification	Notify applicable safety regulator of SpaceCom use if required by national SMS regulations	System description one-pager for regulator submission
Shadow validation	Participate in ≥30-day shadow validation trial; provide evaluation feedback	Shadow validation report template; shadow validation dashboard
AUP acceptance	Ensure all users accept the AUP (§24.2)	Automated AUP flow; compliance report for ANSP admin

Liability assignment note (links to §24.2 and §24.12 F11): The ANSP SMS guide explicitly states that the ANSP retains full operational authority and accountability for all air traffic decisions, regardless of SpaceCom outputs. SpaceCom is a decision support tool. This statement must appear in the ANSP SMS guide, the AUP, and the safety case context node C1 (§24.12).

25.1 Target Tender Profile

SpaceCom targets ESA tenders in the following programme areas:

Space Safety Programme — re-entry risk, SSA services, space debris
GSTP (General Support Technology Programme) — technology development with commercial potential
ARTES (Advanced Research in Telecommunications Systems) — if the commercial operator portal reaches satellite operators
Space-Air Traffic Integration studies — the category matching ESA's OKAPI:Orbits award

25.2 Differentiation from ESA ESOC Re-entry Prediction Service

ESA's re-entry prediction service (reentry.esoc.esa.int) is a technical product for space operators and agencies. SpaceCom is not a competitor to this service — it is a complementary operational layer that could consume ESOC outputs:

Dimension	ESA ESOC Service	SpaceCom
Primary user	Space agencies, debris researchers	ANSPs, airspace managers, space operators
Output format	Technical prediction reports	Operational decision support + NOTAM drafts
Aviation integration	None	Core feature
ANSP decision workflow	Not designed for this	Primary design target
Space operator portal	Not provided	Phase 2 deliverable
Shadow mode / regulatory adoption	Not provided	Built-in

In an ESA bid: Position SpaceCom as the user-facing operational layer that sits on top of the space surveillance and prediction infrastructure that ESA already operates. ESA invests in the physics; SpaceCom invests in the interface that makes the physics actionable for aviation authorities and space operators.

25.3 TRL Roadmap (ESA Definitions)

Phase	End TRL	Evidence
Phase 1 complete	TRL 4	Validated decay predictor (≥3 historical backcasts); SGP4 globe with real TLE data; Mode A corridors; HMAC integrity; full security infrastructure
Phase 2 complete	TRL 5	Atmospheric breakup; Mode B heatmap; NOTAM drafting; space operator portal; CCSDS export; shadow mode; ≥1 ANSP shadow deployment running
Phase 3 complete	TRL 6	System demonstrated in operationally relevant environment; ≥1 ANSP shadow deployment with ≥4 weeks validation data; external penetration test passed; ECSS compliance artefacts complete
Post-Phase 3	TRL 7	System prototype demonstrated in operational environment (live ANSP deployment, not shadow)

25.4 ECSS Standards Compliance

ESA contracts require compliance with the European Cooperation for Space Standardization (ECSS). Required compliance mapping:

Standard	Title	SpaceCom Compliance
ECSS-Q-ST-80C	Software Product Assurance	Software Management Plan, V&V Plan, Product Assurance Plan — produced Phase 3
ECSS-E-ST-10-04C	Space environment	NRLMSISE-00 and JB2008 compliance with ECSS atmospheric model requirements
ECSS-E-ST-10-12C	Methods for re-entry and debris footprint calculation	Decay predictor and atmospheric breakup model methodology documented and traceable
ECSS-U-AS-010C	Space sustainability	Zero Debris Charter alignment statement; controlled re-entry planner outputs

Compliance matrix document (produced Phase 3): Maps every ECSS requirement to the relevant SpaceCom component, test, or document. Required for ESA tender submission.

25.5 ESA Zero Debris Charter Alignment

SpaceCom directly supports the Zero Debris Charter objectives:

Charter Objective	SpaceCom Support
Responsible end-of-life disposal	Controlled re-entry planner generates CCSDS-format manoeuvre plans minimising ground risk
Transparency of re-entry risk	Public hazard corridor data; NOTAM drafting; multi-ANSP coordination
Reduction of casualty risk	Atmospheric breakup model; casualty area computation; population density weighting in deorbit optimiser
Data sharing	API layer for space operator integration; CCSDS export; open prediction endpoints

Include Zero Debris Charter alignment statement in all ESA bid submissions.

25.6 Required ESA Procurement Artefacts

All ESA contracts require these management documents. SpaceCom must produce them by Phase 3:

Document	ECSS Reference	Content
Software Management Plan (SMP)	ECSS-Q-ST-80C §5	Development methodology, configuration management, change control, documentation standards
Verification and Validation Plan (VVP)	ECSS-Q-ST-80C §6	Test strategy, traceability from requirements to test cases, acceptance criteria
Product Assurance Plan (PAP)	ECSS-Q-ST-80C §4	Safety, reliability, quality standards and how they are met
Data Management Plan (DMP)	ECSS-Q-ST-80C §8	How data produced under contract is handled, shared, archived, and made reproducible
Software Requirements Specification (SRS)	Tailored ECSS-E-ST-40C	Software requirements baseline, interfaces, external dependencies, and bounded assumptions including air-risk and RDM exchange boundaries
Software Design Description (SDD)	Tailored ECSS-E-ST-40C	Module architecture, algorithm choices, interface contracts, and validation assumptions
User Manual / Ops Guide	Tailored ECSS-E-ST-40C	Installation, configuration, operator workflows, limitations, and degraded-mode handling
Test Plan + Test Report	Tailored ECSS-Q-ST-80C	Planned validation campaign, executed results, deviations, and acceptance evidence for procurement submission
Accessibility Conformance Report (ACR/VPAT 2.4)	EN 301 549 v3.2.1	WCAG 2.1 AA conformance declaration; mandatory for EU public sector ICT procurement; maps each success criterion to Supports / Partially Supports / Does Not Support with remarks

Scaffold documents for all procurement-facing artefacts should be created at Phase 1 start and maintained throughout development — not produced from scratch at Phase 3.

For contracts with explicit software prototype review gates (e.g. PDR, TRR, CDR, QR, FR), the SRS, SDD, User Manual, Test Plan, and Test Report are updated incrementally at each milestone rather than back-filled only at final review.

25.7 Consortium Strategy

ESA study contracts typically favour consortia that combine:

Technical depth (university or research institute)
Industrial relevance (commercial applicability)
End-user representation (the entity that will use the output)

SpaceCom's ideal consortium for an ESA bid:

SpaceCom (lead) — system integration, aviation domain interface, commercial deployment
Academic partner (orbital mechanics / atmospheric density modelling credibility — equivalent to TU Braunschweig in the OKAPI:Orbits consortium)
ANSP or aviation authority (end-user representation — demonstrates the aviation gap is real and the solution is wanted)

Without a credentialled academic or research partner for the physics components, ESA evaluators may question the technical depth. Identify and approach potential academic partners before submitting to any ESA tender.

25.8 Intellectual Property Framework for ESA Bids

ESA contracts operate under the ESA General Conditions of Contract, which distinguish between background IP (pre-existing IP brought into the contract) and foreground IP (IP created during the contract). The default terms grant ESA a non-exclusive, royalty-free licence to use foreground IP, while the contractor retains ownership. These terms are negotiable and must be agreed before contract signature.

Required IP actions before bid submission:

Background IP schedule: Document all SpaceCom components that constitute background IP — physics engine, data model, UX design, proprietary algorithms. This schedule protects SpaceCom's ability to continue commercial deployment after the ESA contract ends without ESA claiming rights to the core product.
Foreground IP boundary: Define clearly what will be created during the ESA contract (e.g., specific ECSS compliance artefacts, validation datasets, TRL demonstration reports) versus what SpaceCom brings in as background IP. Narrow the foreground IP scope to ESA-specific deliverables only.
Software Bill of Materials (SBOM): Required for ECSS compliance and as part of the ESA bid artefact package. Generated via syft or cyclonedx-bom. Must identify all third-party licences. AGPLv3 components (notably CesiumJS community edition) cannot be in the SBOM of a closed-source ESA deliverable — commercial licence required.
Consortium Agreement: Must be signed by all consortium members before bid submission. Must specify:
- IP ownership for each consortium member's contributions
- Publication rights for academic partners (must not conflict with any commercial confidentiality obligations)
- Revenue share for any commercial use arising from the contract
- Liability allocation between consortium members
- Exit terms if a member withdraws
Export control pre-clearance: Confirm with counsel that the planned ESA deliverable does not require an export licence for transfer to ESA (a Paris-based intergovernmental organisation). Generally covered under EAR licence exception GOV, but verify for any controlled technology components.

26. SRE and Reliability Framework

26.1 Service Level Objectives

SpaceCom is most critical during active re-entry events — peak load coincides with highest operational stakes. Standard availability metrics are insufficient. SLOs must be defined against event-correlated conditions, not just averages.

Service Level Indicator	SLO	Measurement Window	Notes
Prediction API availability	99.9%	Rolling 30 days	43.8 min/month error budget
Prediction API availability (active TIP event)	99.95%	Duration of TIP window	Stricter; degradation during events is SEV-1
Decay prediction latency p50	< 90s	Per MC job	500-sample chord run
Decay prediction latency p95	< 240s	Per MC job	Drives worker sizing (§27)
CZML ephemeris load p95	< 2s	Per request	100-object catalog
TIP message ingest latency	< 30 min from publication	Per TIP message	Drives CRITICAL alert timing
Space weather update latency	< 15 min from NOAA SWPC	Per update cycle	Drives uncertainty multiplier refresh
Alert WebSocket delivery latency	< 10s from trigger	Per alert	Measured trigger→client receipt
Corridor update after new TIP	< 60 min	Per TIP message	Full MC rerun triggered

Error budget policy: When the 30-day rolling error budget is exhausted, no further deployments or planned maintenance are permitted until the next measurement window opens. Tracked in Grafana SLO dashboard (§26.8).

SLOs must be written into the model user agreement (§24.2) and agreed with each ANSP customer before operational deployment. ANSPs need defined thresholds to determine when to activate their fallback procedures.

Customer-facing SLA (Finding 7) — contractual commitments in the MSA:

Internal SLOs are aspirational targets; the SLA is a binding contractual commitment with defined measurement, exclusions, and credits. The MSA template includes the following SLA schedule:

Metric	SLA commitment	Measurement	Exclusions
Monthly availability	99.5%	External uptime monitor; excludes scheduled maintenance (max 4h/month; 48h advance notice)	Force majeure; upstream data source outages (Space-Track, NOAA SWPC) lasting > 4h
Critical alert delivery	Within 5 minutes of trigger (p95)	`alert_events.created_at` → `delivered_websocket/email = TRUE` timestamp	Customer network connectivity issues
Prediction freshness	p50 updated within 4h of new TLE availability	`tle_sets.ingested_at` → `reentry_predictions.created_at`	Space-Track API outage > 4h
Support response — CRITICAL incident	Initial response within 1 hour	From customer report or automated alert, whichever earlier	Outside contracted support hours (on-call for CRITICAL)
Support response — P1 resolution	Within 8 hours	From initial response	—
Service credits	1 day credit per 0.1% availability below SLA	Applied to next invoice	—

Any SRE threshold change that could cause an SLA breach (e.g., raising the ingest failure alert threshold beyond 4 hours) must be reviewed by the product owner before deployment. Tracked in docs/sla/sla-schedule-v{N}.md (versioned; MSA references the current version by number).

26.2 Recovery Objectives

Objective	Target	Scope	Derivation
RTO (active TIP event)	≤ 15 minutes	Prediction API restoration	CRITICAL alert rate-limit window is 4 hours per object; 15-minute outage is tolerable within this window without skipping a CRITICAL cycle; beyond 15 minutes the ANSP must activate fallback procedures
RTO (no active event)	≤ 60 minutes	Full system restoration	1-hour window aligns with MSA SLA commitment; exceeding this triggers the P1 communication plan
RPO (safety-critical tables)	Zero	`reentry_predictions`, `alert_events`, `security_logs`, `notam_drafts` — synchronous replication required	UN Liability Convention evidentiary requirements; loss of a single alert acknowledgement record could be material in a liability investigation
RPO (operational data)	≤ 5 minutes	`orbits`, `tle_sets`, `simulations` — async replication acceptable	5-minute data age is within the staleness tolerance for TLE-based predictions; loss of in-flight simulations is recoverable by re-submission

MSA sign-off requirement: RTO and RPO targets must be explicitly stated in and agreed upon in the Master Services Agreement with each ANSP customer before any production deployment. Customers must acknowledge that the fallback procedure (Space-Track direct + ESOC public re-entry page) is their responsibility during the RTO window. RTO/RPO targets are not unilaterally changeable by SpaceCom — any tightening requires customer notification ≥30 days in advance; any relaxation requires customer consent.

26.3 High Availability Architecture

TimescaleDB — Streaming Replication + Patroni

# Primary + hot standby; Patroni manages leader election and failover
db_primary:
  image: timescale/timescaledb-ha:pg17
  environment:
    PATRONI_POSTGRESQL_DATA_DIR: /var/lib/postgresql/data
    PATRONI_REPLICATION_USERNAME: replicator
  networks: [db_net]

db_standby:
  image: timescale/timescaledb-ha:pg17
  environment:
    PATRONI_REPLICA: "true"
  networks: [db_net]

etcd:
  image: bitnami/etcd:3   # Patroni DCS
  networks: [db_net]

Synchronous replication for reentry_predictions, alert_events, security_logs, notam_drafts (RPO = 0): synchronous_standby_names = 'FIRST 1 (db_standby)' with table-level synchronous commit override
Asynchronous replication for orbits, tle_sets (RPO ≤ 5 min): default async
Patroni auto-failover: standby promoted within ~30s of primary failure, well within the 15-minute RTO

Required Patroni configuration parameters (must be present in patroni.yml; CI validation via scripts/check_patroni_config.py):

bootstrap:
  dcs:
    maximum_lag_on_failover: 1048576    # 1 MB; standby > 1 MB behind primary is excluded from failover election
    synchronous_mode: true              # Enable synchronous replication mode
    synchronous_mode_strict: true       # Primary refuses writes if no synchronous standby confirmed; prevents split-brain

postgresql:
  parameters:
    wal_level: replica                  # Required for streaming replication; 'minimal' breaks replication
    recovery_target_timeline: latest    # Follow timeline switches after failover; required for correct standby behaviour

Rationale:

maximum_lag_on_failover: without this, a severely lagged standby could be promoted as primary and serve stale data for safety-critical tables.
synchronous_mode_strict: true: trades availability for consistency — primary halts rather than allowing an unconfirmed write to proceed without a standby. Acceptable given 15-minute RTO SLO.
wal_level: replica: minimal disables the WAL detail needed for streaming replication; must be explicitly set.
recovery_target_timeline: latest: without this, a promoted standby after failover may not follow future timeline switches, causing divergence.

Redis — Sentinel (3 Nodes)

redis-master:
  image: redis:7-alpine
  command: redis-server /etc/redis/redis.conf
redis-sentinel-1:
  image: redis:7-alpine
  command: redis-sentinel /etc/redis/sentinel.conf
redis-sentinel-2:
  image: redis:7-alpine
  command: redis-sentinel /etc/redis/sentinel.conf

Three Sentinel instances form a quorum. If the master fails, Sentinel promotes a replica within ~10s. The backend and workers use redis-py's Sentinel client which transparently follows the master after failover.

Redis Sentinel split-brain risk assessment (F3 — §67): In a network partition where Sentinel nodes disagree on master reachability, two Sentinels could theoretically promote two different replicas simultaneously. The min-replicas-to-write 1 Sentinel configuration mitigates this: the old master stops accepting writes when it loses contact with replicas, forcing clients to the new master.

SpaceCom's Redis data is largely ephemeral — Celery broker messages, WebSocket session state, application cache. A split-brain that loses a small number of Celery tasks or cache entries is survivable. The one persistent concern is the per-org email rate limit counter (spacecom:email_rate:{org_id}:{hour}, §65 F7): a split-brain could result in two independent counters, both allowing up to 50 emails, for a brief period before the split resolves. This is accepted: the 50/hr limit is a cost control, not a safety guarantee. Email volume during a short Sentinel split-brain is not a safety risk.

Risk acceptance and configuration: Set sentinel.conf values:

sentinel down-after-milliseconds spacecom-redis 5000
sentinel failover-timeout spacecom-redis 60000
sentinel parallel-syncs spacecom-redis 1
min-replicas-to-write 1
min-replicas-max-lag 10

ADR: docs/adr/0021-redis-sentinel-split-brain-risk-acceptance.md

Cross-Region Disaster Recovery — Warm Standby (F7)

Single-region deployment cannot meet the RTO ≤ 60 minutes target against a full cloud region failure. A warm standby in a second region provides the required recovery path.

Strategy: Warm standby (not hot active-active) — reduces cost and complexity while meeting RTO.

Component	Primary region	DR region	Failover mechanism
TimescaleDB	Primary + hot standby	Read replica (streaming replication from primary)	Promote replica; update DNS; `make db-failover-dr` runbook
Application tier	Running	Stopped; container images pre-pulled from GHCR	Deploy from images on failover; < 10 minutes
MinIO (object storage)	Active	Active (bucket replication enabled)	Already in sync; no failover needed
Redis	Active	Cold (config ready)	Restart on failover; session loss acceptable (operators re-authenticate)
DNS	Primary A record	Secondary A record in Route 53 (or equiv.)	Health-check-based routing; TTL 60s; auto-failover on primary health check failure

Failover time estimate: DB promotion 2–5 minutes + DNS propagation 1 minute + app deploy 10 minutes = < 15 minutes (within RTO for active TIP event).

Runbook: docs/runbooks/region-failover.md — tested annually as game day scenario 6. Post-failover checklist: verify HMAC validation on restored primary; verify WAL integrity; notify ANSPs of region switch; schedule return to primary region within 48 hours.

26.4 Celery Reliability

Task Acknowledgement and Crash Safety

# celeryconfig.py
task_acks_late = True            # Task not acknowledged until complete; if worker dies mid-task, task is requeued
task_reject_on_worker_lost = True  # Orphaned tasks requeued, not dropped
task_serializer = 'json'
result_expires = 86400           # Results expire after 24h; database is the durable store
worker_prefetch_multiplier = 1   # F6 §58: long MC tasks (up to 240s) — prefetch=1 prevents worker A
                                 # holding 4 tasks while workers B/C/D are idle; fair distribution

Dead Letter Queue

Failed tasks (exception, timeout, or permanent error) must be captured, not silently dropped:

# In Celery task base class
class SpaceComTask(Task):
    def on_failure(self, exc, task_id, args, kwargs, einfo):
        # Update simulations table to status='failed'
        update_simulation_status(task_id, 'failed', error_detail=str(exc))
        # Route to dead letter queue for inspection
        dead_letter_queue.rpush('dlq:failed_tasks', json.dumps({
            'task_id': task_id, 'task_name': self.name,
            'error': str(exc), 'failed_at': utcnow().isoformat()
        }))

Queue Routing (Ingest vs Simulation Isolation)

CELERY_TASK_ROUTES = {
    'modules.ingest.*':       {'queue': 'ingest'},
    'modules.propagator.*':   {'queue': 'simulation'},
    'modules.breakup.*':      {'queue': 'simulation'},
    'modules.conjunction.*':  {'queue': 'simulation'},
    'modules.reentry.controlled.*': {'queue': 'simulation'},
}

Two separate worker processes — never competing on the same queue:

# Ingest worker: always running, low concurrency
celery worker --queue=ingest --concurrency=2 --hostname=ingest@%h

# Simulation worker: high concurrency for MC sub-tasks (see §27.2)
celery worker --queue=simulation --concurrency=16 --pool=prefork --hostname=sim@%h

Per-organisation priority isolation (F8): All organisations share the simulation queue, but job priority is set at submission time based on subscription tier and event criticality. This prevents a shadow_trial org's bulk simulation from starving a CRITICAL alert computation for an ansp_operational org.

TIER_TASK_PRIORITY = {
    "internal": 9,
    "institutional": 8,
    "ansp_operational": 7,
    "space_operator": 5,
    "shadow_trial": 3,
}
CRITICAL_EVENT_PRIORITY_BOOST = 2  # added when active TIP event exists for the org's objects

def get_task_priority(org_tier: str, has_active_tip: bool) -> int:
    base = TIER_TASK_PRIORITY.get(org_tier, 3)
    return min(10, base + (CRITICAL_EVENT_PRIORITY_BOOST if has_active_tip else 0))

# At submission:
task.apply_async(priority=get_task_priority(org.subscription_tier, active_tip))

Redis with maxmemory-policy noeviction supports Celery task priorities natively (0–9). Workers process higher-priority tasks first when multiple tasks are queued. Ingest tasks always route to the separate ingest queue and are unaffected by simulation priority.

Celery Beat — High Availability with `celery-redbeat`

Standard Celery Beat is a single-process SPOF. celery-redbeat stores the schedule in Redis with distributed locking — multiple Beat instances can run; only one holds the lock at a time:

CELERY_BEAT_SCHEDULER = 'redbeat.RedBeatScheduler'
REDBEAT_REDIS_URL = settings.redis_url
REDBEAT_LOCK_TIMEOUT = 60        # 60s; crashed leader blocks scheduling for at most 60s
REDBEAT_MAX_SLEEP_INTERVAL = 5   # standby instances check for lock every 5s after TTL expiry

The default REDBEAT_LOCK_TIMEOUT = max_interval × 5 (typically 25 minutes) is too long during active TIP events — a crashed Beat leader would prevent TIP polling for up to 25 minutes. At 60 seconds, a failover causes at most a 60-second scheduling gap. The standby Beat instance acquires the lock within 5 seconds of TTL expiry (REDBEAT_MAX_SLEEP_INTERVAL = 5).

During an active TIP window (spacecom_active_tip_events > 0), the AlertManager rule for TIP ingest failure uses a 10-minute threshold rather than the baseline 4-hour threshold — ensuring a Beat failover gap does not silently miss critical TIP updates.

26.5 Health Checks

Every service exposes two endpoints. Docker Compose depends_on: condition: service_healthy uses these — the backend does not start until the database is healthy.

Liveness probe (GET /healthz) — process is alive; returns 200 unconditionally if the process can respond. Does not check dependencies.

Readiness probe (GET /readyz) — process is ready to serve traffic:

@app.get("/readyz")
async def readiness(db: AsyncSession = Depends(get_db)):
    checks = {}

    # Database connectivity
    try:
        await db.execute(text("SELECT 1"))
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {e}"

    # Redis connectivity
    try:
        await redis_client.ping()
        checks["redis"] = "ok"
    except Exception:
        checks["redis"] = "error"

    # Data freshness
    tle_age = await get_oldest_active_tle_age_hours()
    sw_age = await get_space_weather_age_hours()
    eop_age = await get_eop_age_days()
    airac_age = await get_airspace_airac_age_days()
    checks["tle_age_hours"] = tle_age
    checks["space_weather_age_hours"] = sw_age
    checks["eop_age_days"] = eop_age
    checks["airac_age_days"] = airac_age

    degraded = []
    if checks["database"] != "ok" or checks["redis"] != "ok":
        return JSONResponse(status_code=503, content={"status": "unavailable", "checks": checks})
    if tle_age > 6:
        degraded.append("tle_stale")
    if sw_age > 4:
        degraded.append("space_weather_stale")
    if eop_age > 7:
        degraded.append("eop_stale")       # IERS-A older than 7 days; frame transform accuracy degraded
    if airac_age > 28:
        degraded.append("airspace_stale")  # AIRAC cycle missed

    status_code = 207 if degraded else 200
    return JSONResponse(status_code=status_code, content={
        "status": "degraded" if degraded else "ok",
        "degraded": degraded, "checks": checks
    })

The 207 Degraded response triggers the staleness banner in the UI (§24.8) without taking the service offline. The load balancer treats 207 as healthy (traffic continues); the operational banner warns users.

Renderer service health check — the renderer container runs Playwright/Chromium. If Chromium hangs (a known Playwright failure mode), the container process stays alive and appears healthy while all report generation jobs silently time out. The renderer GET /healthz must verify Chromium can respond, not just that the Python process is alive:

# renderer/app/health.py
import asyncio
from playwright.async_api import async_playwright
from fastapi.responses import JSONResponse

async def health_check():
    """Liveness probe: verify Chromium can launch and load a blank page within 5s."""
    try:
        async with async_playwright() as p:
            browser = await asyncio.wait_for(p.chromium.launch(), timeout=5.0)
            page = await browser.new_page()
            await asyncio.wait_for(page.goto("about:blank"), timeout=3.0)
            await browser.close()
        return {"status": "ok", "chromium": "responsive"}
    except asyncio.TimeoutError:
        renderer_chromium_restarts.inc()
        return JSONResponse({"status": "chromium_unresponsive"}, status_code=503)

Docker Compose healthcheck for renderer:

renderer:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8001/healthz"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 15s

If the healthcheck fails 3 times consecutively, Docker restarts the renderer container. The renderer_chromium_restarts_total counter increments on each restart and triggers the RendererChromiumUnresponsive alert.

Degraded state in GET /readyz for API clients and SWIM (Finding 7): The degraded array in the response is the machine-readable signal for any automated integration (Phase 3 SWIM, API polling clients). API clients must not scrape the UI to determine system state — the health endpoint is the authoritative source. Response fields:

Field	Type	Meaning
`status`	`"ok"` \| `"degraded"` \| `"unavailable"`	Overall system state
`degraded`	`string[]`	Active degradation reasons: `"tle_stale"`, `"space_weather_stale"`, `"ingest_source_failure"`, `"prediction_service_overloaded"`
`degraded_since`	`ISO8601 \| null`	Timestamp of when current degraded state began (from `degraded_mode_events`)
`checks`	`object`	Per-subsystem check results

Every transition into or out of degraded state is written to degraded_mode_events (see §9.2). NOTAM drafts generated while status = "degraded" have generated_during_degraded = TRUE and the draft (E) field includes: NOTE: GENERATED DURING DEGRADED DATA STATE - VERIFY INDEPENDENTLY BEFORE ISSUANCE.

Docker Compose health check definitions:

backend:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8000/healthz"]
    interval: 10s
    timeout: 5s
    retries: 3
    start_period: 30s

db:
  healthcheck:
    # pg_isready alone passes before the spacecom database and TimescaleDB extension are loaded.
    # This check verifies that the application database is accessible and TimescaleDB is active
    # before any dependent service (pgbouncer, backend) is marked healthy.
    test: |
      CMD-SHELL psql -U spacecom_app -d spacecom -c
      "SELECT 1 FROM timescaledb_information.hypertables LIMIT 1"
    interval: 5s
    timeout: 3s
    retries: 10
    start_period: 30s   # TimescaleDB extension load and initial setup can take up to 20s

pgbouncer:
  depends_on:
    db:
      condition: service_healthy
  healthcheck:
    test: ["CMD-SHELL", "psql -h localhost -p 5432 -U spacecom_app -d spacecom -c 'SELECT 1'"]
    interval: 5s
    timeout: 3s
    retries: 5

26.6 Backup and Restore

Continuous WAL Archiving (RPO = 0 for critical tables)

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'mc cp %p minio/wal-archive/$(hostname)/%f'  # MinIO via mc client
archive_timeout = 60  # Force WAL segment every 60s even if no writes

Daily Base Backup

pg_basebackup is a PostgreSQL client tool that is not present in the Python runtime worker image. The backup must run in a dedicated sidecar container that has PostgreSQL client tools installed, invoked by the Celery Beat task via docker compose run:

# docker-compose.yml — backup sidecar (no persistent service; run on demand)
services:
  db-backup:
    image: timescale/timescaledb:2.14-pg17   # same image as db; has pg_basebackup
    entrypoint: []
    command: >
      sh -c "pg_basebackup -h db -U postgres -D /backup
             --format=tar --compress=9 --wal-method=stream &&
             mc cp /backup/*.tar.gz minio/db-backups/base-$(date +%F)/"
    networks: [db_net]
    volumes:
      - backup_scratch:/backup
    profiles: [backup]    # not started by default; invoked explicitly
    environment:
      PGPASSWORD: ${POSTGRES_PASSWORD}
      MC_HOST_minio: http://${MINIO_ACCESS_KEY}:${MINIO_SECRET_KEY}@minio:9000

volumes:
  backup_scratch:
    driver: local
    driver_opts:
      type: tmpfs
      device: tmpfs
      o: size=20g    # large enough for compressed base backup

The Celery Beat task triggers the sidecar via the Docker socket (backend container must have /var/run/docker.sock mounted in development — not in production). In production (Tier 2+), use a dedicated cron job on the host:

# /etc/cron.d/spacecom-backup — runs outside Docker, uses Docker CLI
0 2 * * * root docker compose -f /opt/spacecom/docker-compose.yml \
  --profile backup run --rm db-backup >> /var/log/spacecom-backup.log 2>&1

The Celery Beat task in production polls MinIO for today's backup object to verify completion, and fires an alert if it is absent by 03:00 UTC:

# Celery Beat: daily at 03:00 UTC (verification, not execution)
@celery.task
def verify_daily_backup():
    """Verify today's base backup exists in MinIO; alert if absent."""
    expected_key = f"db-backups/base-{utcnow().date()}"
    try:
        minio_client.stat_object("db-backups", expected_key)
        structlog.get_logger().info("backup_verified", key=expected_key)
    except S3Error:
        structlog.get_logger().error("backup_missing", key=expected_key)
        alert_admin(f"Daily base backup missing: {expected_key}")
        raise  # marks task as FAILED in Celery result backend

Monthly Restore Test

# Celery Beat: first Sunday of each month at 03:00 UTC
@celery.task
def monthly_restore_test():
    """Restore latest backup to ephemeral container; run test suite; alert on failure."""
    # 1. Spin up a test TimescaleDB container from latest base backup + WAL
    # 2. Run db/test_restore.py: verify row counts, hypertable integrity, HMAC spot-checks
    # 3. Tear down container
    # 4. Log result to security_logs; alert admin if test fails

If the monthly restore test fails, the failure is treated as SEV-2. The incident is not resolved until a successful restore is verified.

WAL retention: 30 days of WAL segments retained in MinIO; base backups retained for 90 days; reentry_predictions, alert_events, notam_drafts, security_logs additionally archived to cold storage for 7 years (MinIO lifecycle policy, separate bucket with Object Lock COMPLIANCE mode — prevents deletion even by bucket owner).

Application log retention policy (F10 — §57):

Log tier	Storage	Retention	Rationale
Container stdout (json-file)	Docker log driver on host	7 days (`max-size=100m, max-file=5`)	Short-lived; Promtail ships to Loki in Tier 2+
Loki (structured application logs)	Grafana Loki	90 days	Covers 30-day incident investigation SLA with headroom
Safety-relevant log lines (`level=CRITICAL`, `security_logs` events, alert-related log lines)	MinIO append-only bucket	7 years (same as database safety records)	Regulatory parity with `alert_events` 7-year hold; NIS2 Art. 23 evidence requirement
SIEM-forwarded events	External SIEM (customer-specified)	Per customer contract	ANSP customers may have their own retention obligations

Loki retention is set in monitoring/loki-config.yml:

limits_config:
  retention_period: 2160h   # 90 days
compactor:
  retention_enabled: true

Safety-relevant log shipping: a Promtail pipeline stage tags log lines with __path__ label safety_critical=true when level=CRITICAL or logger contains alert or security. A separate Loki ruler rule ships these to MinIO via a Loki-to-S3 connector (Phase 2). Phase 1 interim: Celery Beat task exports CRITICAL log lines from Loki to MinIO daily.

Restore time target: Full restore to latest WAL segment in < 30 minutes (tested monthly). This satisfies the RTO ≤ 60 minutes (no active event) with 30 minutes headroom for DNS propagation and smoke tests. Documented step-by-step in docs/runbooks/db-restore.md (Phase 2 deliverable).

Retention Schedule

-- Online retention (TimescaleDB compression + drop policies)
SELECT add_compression_policy('orbits', INTERVAL '7 days');
SELECT add_retention_policy('orbits', INTERVAL '90 days');   -- Archive before drop; see below
SELECT add_retention_policy('space_weather', INTERVAL '2 years');
SELECT add_retention_policy('tle_sets', INTERVAL '1 year');

-- Archival pipeline: Celery task runs before each chunk drop
-- Exports chunk to Parquet in MinIO cold storage before TimescaleDB drops it
-- Legal hold: reentry_predictions, alert_events, notam_drafts, shadow_validations → 7 years
-- No retention policy on these tables; MinIO lifecycle rule retains for 7 years

26.7 Prometheus Metrics

Metrics must be instrumented from Phase 1 — not added at Phase 3 as an afterthought. Business-level metrics are more important than infrastructure metrics for this domain.

Metric naming convention (F1 — §57):

All custom metrics must follow {namespace}_{subsystem}_{name}_{unit} with these rules:

Rule	Example compliant	Example non-compliant
Namespace is always `spacecom_`	`spacecom_ingest_success_total`	`ingest_success`
Unit suffix required (Prometheus base units)	`spacecom_simulation_duration_seconds`	`spacecom_simulation_duration`
Counters end in `_total`	`spacecom_hmac_verification_failures_total`	`spacecom_hmac_failures`
Gauges end in `_seconds`, `_bytes`, `_ratio`, or domain unit	`spacecom_celery_queue_depth`	`spacecom_queue`
Histograms end in `_seconds` or `_bytes`	`spacecom_alert_delivery_latency_seconds`	`spacecom_alert_latency`
Labels use `snake_case`	`queue_name`, `source`	`queueName`, `Source`
High-cardinality fields are NEVER labels	—	`norad_id`, `organisation_id`, `user_id`, `request_id` as Prometheus labels
Per-object drill-down uses recording rules	`spacecom:tle_age_hours:max` recording rule	`spacecom_tle_age_hours{norad_id="25544"}` alerted directly

High-cardinality identifiers belong in log fields (structlog) or Prometheus exemplars — not in metric labels. A metric with an unbounded label creates one time series per unique value and will OOM Prometheus at scale.

Business-level metrics (custom — most critical):

# Phase 1 — instrument from day 1
active_tip_events    = Gauge('spacecom_active_tip_events', 'Objects with active TIP messages')
prediction_age       = Gauge('spacecom_prediction_age_seconds', 'Age of latest prediction per object',
                           ['norad_id'])  # per-object label: Grafana drill-down only; alert via recording rule
tle_age              = Gauge('spacecom_tle_age_hours', 'TLE data age per object', ['norad_id'])
ingest_success       = Counter('spacecom_ingest_success_total', 'Successful ingest runs', ['source'])
ingest_failure       = Counter('spacecom_ingest_failure_total', 'Failed ingest runs', ['source'])
hmac_failures        = Counter('spacecom_hmac_verification_failures_total', 'HMAC check failures')
simulation_duration  = Histogram('spacecom_simulation_duration_seconds', 'MC run duration', ['module'],
                           buckets=[30, 60, 90, 120, 180, 240, 300, 600])
alert_delivery_lat   = Histogram('spacecom_alert_delivery_latency_seconds', 'Alert trigger → WS receipt',
                           buckets=[1, 2, 5, 10, 15, 20, 30, 60])
ws_connected         = Gauge('spacecom_ws_connected_clients', 'Active WebSocket connections', ['instance'])
celery_queue_depth   = Gauge('spacecom_celery_queue_depth', 'Tasks waiting in queue', ['queue'])
dlq_depth            = Gauge('spacecom_dlq_depth', 'Tasks in dead letter queue')
renderer_active_jobs = Gauge('renderer_active_jobs', 'Reports being generated')
renderer_job_dur     = Histogram('renderer_job_duration_seconds', 'Report generation time',
                           buckets=[2, 5, 10, 15, 20, 25, 30])
renderer_chromium_restarts = Counter('renderer_chromium_restarts_total', 'Chromium process restarts')

SLI recording rules — pre-aggregate before alerting; avoids per-object flooding (Finding 1, 7):

# monitoring/recording-rules.yml
groups:
  - name: spacecom_sli
    rules:
      # SLI: API availability (non-5xx fraction) — feeds availability SLO
      - record: spacecom:api_availability:ratio_rate5m
        expr: >
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          / sum(rate(http_requests_total[5m]))

      # SLI: max TLE age across all objects (single series; alertable without flooding)
      - record: spacecom:tle_age_hours:max
        expr: max(spacecom_tle_age_hours)

      # SLI: count of objects with stale TLEs (for dashboard)
      - record: spacecom:tle_stale_objects:count
        expr: count(spacecom_tle_age_hours > 6) or vector(0)

      # SLI: max prediction age across active TIP objects
      - record: spacecom:prediction_age_seconds:max
        expr: max(spacecom_prediction_age_seconds)

      # SLI: alert delivery latency p99
      - record: spacecom:alert_delivery_latency:p99_rate5m
        expr: histogram_quantile(0.99, rate(spacecom_alert_delivery_latency_seconds_bucket[5m]))

      # Error budget burn rate — multi-window (F2 — §57)
      - record: spacecom:error_budget_burn:rate1h
        expr: 1 - avg_over_time(spacecom:api_availability:ratio_rate5m[1h])

      - record: spacecom:error_budget_burn:rate6h
        expr: 1 - avg_over_time(spacecom:api_availability:ratio_rate5m[6h])

      # Fast-burn window (5 min) — catches sudden outages
      - record: spacecom:error_budget_burn:rate5m
        expr: 1 - spacecom:api_availability:ratio_rate5m

Alerting rules (Prometheus AlertManager):

# monitoring/alertmanager/spacecom-rules.yml
groups:
  - name: spacecom_critical
    rules:
      - alert: HmacVerificationFailure
        expr: increase(spacecom_hmac_verification_failures_total[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "HMAC verification failure detected — prediction integrity compromised"
          runbook_url: "https://spacecom.internal/docs/runbooks/hmac-integrity-failure.md"

      - alert: TipIngestStale
        expr: spacecom_tle_age_hours{source="tip"} > 0.5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "TIP data > 30 min old — active re-entry warning may be stale"
          runbook_url: "https://spacecom.internal/docs/runbooks/tip-ingest-failure.md"

      - alert: ActiveTipNoPrediction
        expr: spacecom_active_tip_events > 0 and spacecom:prediction_age_seconds:max > 3600
        labels:
          severity: critical
        annotations:
          summary: "Active TIP event but newest prediction is {{ $value | humanizeDuration }} old"
          runbook_url: "https://spacecom.internal/docs/runbooks/tip-ingest-failure.md"

      # Fast burn: 1h + 5min windows (catches sudden outages quickly) — F2 §57
      - alert: ErrorBudgetFastBurn
        expr: >
          spacecom:error_budget_burn:rate1h > (14.4 * 0.001)
          and
          spacecom:error_budget_burn:rate5m > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          burn_window: fast
        annotations:
          summary: "Error budget burning fast — 1h burn rate {{ $value | humanizePercentage }}"
          runbook_url: "https://spacecom.internal/docs/runbooks/db-failover.md"
          dashboard_url: "https://grafana.spacecom.internal/d/slo-burn-rate"

      # Slow burn: 6h + 30min windows (catches gradual degradation before budget exhausts) — F2 §57
      - alert: ErrorBudgetSlowBurn
        expr: >
          spacecom:error_budget_burn:rate6h > (6 * 0.001)
          and
          spacecom:error_budget_burn:rate1h > (6 * 0.001)
        for: 15m
        labels:
          severity: warning
          burn_window: slow
        annotations:
          summary: "Error budget burning slowly — 6h burn rate {{ $value | humanizePercentage }}"
          runbook_url: "https://spacecom.internal/docs/runbooks/db-failover.md"
          dashboard_url: "https://grafana.spacecom.internal/d/slo-burn-rate"

  - name: spacecom_warning
    rules:
      - alert: TleStale
        # Alert on recording rule aggregate — single alert, not 600 per-NORAD alerts
        expr: spacecom:tle_stale_objects:count > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value }} objects have TLE age > 6h"
          runbook_url: "https://spacecom.internal/docs/runbooks/ingest-pipeline-staleness.md"

      - alert: IngestConsecutiveFailures
        # Use increase(), not rate(); rate() is always positive once any failure exists
        expr: increase(spacecom_ingest_failure_total[15m]) >= 3
        labels:
          severity: warning
        annotations:
          summary: "Ingest source {{ $labels.source }} failed ≥ 3 times in 15 min"
          runbook_url: "https://spacecom.internal/docs/runbooks/ingest-pipeline-staleness.md"

      - alert: CelerySimulationQueueDeep
        expr: spacecom_celery_queue_depth{queue="simulation"} > 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Simulation queue depth {{ $value }} — workers may be overwhelmed"
          runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md"

      - alert: DLQGrowing
        expr: increase(spacecom_dlq_depth[10m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Dead letter queue growing — tasks exhausting retries"
          runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md"

      - alert: WebSocketCeilingApproaching
        expr: spacecom_ws_connected_clients > 400
        labels:
          severity: warning
        annotations:
          summary: "WS connections {{ $value }}/500 — scale backend before ceiling hit"
          runbook_url: "https://spacecom.internal/docs/runbooks/capacity-limits.md"

      # Queue depth growth rate alert — fires before threshold is breached (F8 — §57)
      - alert: CelerySimulationQueueGrowing
        expr: rate(spacecom_celery_queue_depth{queue="simulation"}[10m]) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Simulation queue growing at {{ $value | humanize }} tasks/sec — workers not keeping up"
          runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md"

      - alert: RendererChromiumUnresponsive
        expr: increase(renderer_chromium_restarts_total[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Renderer Chromium restarted — report generation may be delayed"
          runbook_url: "https://spacecom.internal/docs/runbooks/renderer-recovery.md"

Alert authoring rule (F11 — §57): Every AlertManager alert rule MUST include annotations.runbook_url pointing to an existing file in docs/runbooks/. CI lint step (make lint-alerts) validates this using promtool check rules plus a custom Python script that asserts every rule has a non-empty runbook_url annotation that resolves to an existing markdown file. A PR that adds an alert without a runbook fails CI.

Alert coverage audit (F5 — §57): The following table maps every SLO and safety invariant to its alert rule. Gaps must be closed before Phase 2.

SLO / Safety invariant	Alert rule	Severity	Gap?
API availability 99.9%	`ErrorBudgetFastBurn`, `ErrorBudgetSlowBurn`	CRITICAL / WARNING	Covered
TLE age < 6h	`TleStale`	WARNING	Covered
TIP ingest freshness < 30 min	`TipIngestStale`	CRITICAL	Covered
Active TIP + prediction age > 1h	`ActiveTipNoPrediction`	CRITICAL	Covered
HMAC verification integrity	`HmacVerificationFailure`	CRITICAL	Covered
Ingest consecutive failures	`IngestConsecutiveFailures`	WARNING	Covered
Celery queue depth threshold	`CelerySimulationQueueDeep`	WARNING	Covered
Celery queue depth growth rate	`CelerySimulationQueueGrowing`	WARNING	Covered
DLQ depth > 0	`DLQGrowing`	WARNING	Covered
WS connection ceiling approach	`WebSocketCeilingApproaching`	WARNING	Covered
Renderer Chromium crash	`RendererChromiumUnresponsive`	WARNING	Covered
EOP mirror disagreement	`EopMirrorDisagreement`	CRITICAL	Gap — add Phase 1
DB replication lag > 30s	`DbReplicationLagHigh`	WARNING	Gap — add Phase 2
Backup job failure	`BackupJobFailed`	CRITICAL	Gap — add Phase 1
Security event anomaly	In `security-rules.yml`	CRITICAL	Covered
Alert HMAC integrity (nightly)	In `security-rules.yml`	CRITICAL	Covered

Prometheus scrape configuration (monitoring/prometheus.yml):

scrape_configs:
  - job_name: backend
    static_configs:
      - targets: ['backend:8000']
    metrics_path: /metrics   # enabled by prometheus-fastapi-instrumentator

  - job_name: renderer
    static_configs:
      - targets: ['renderer:8001']
    metrics_path: /metrics

  - job_name: celery
    static_configs:
      - targets: ['celery-exporter:9808']   # celery-exporter sidecar

  - job_name: postgres
    static_configs:
      - targets: ['postgres-exporter:9187']  # postgres_exporter; also scrapes PgBouncer stats

  - job_name: redis
    static_configs:
      - targets: ['redis-exporter:9121']     # redis_exporter

Add to docker-compose.yml (Phase 2 service topology): postgres-exporter, redis-exporter, celery-exporter sidecar, loki, promtail, tempo (all on monitor_net). Add to requirements.in: prometheus-fastapi-instrumentator, structlog, opentelemetry-sdk, opentelemetry-instrumentation-fastapi, opentelemetry-instrumentation-sqlalchemy, opentelemetry-instrumentation-celery.

Distributed tracing — OpenTelemetry (Phase 2, ADR 0017):

# backend/app/main.py — instrument at startup
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.celery import CeleryInstrumentor

provider = TracerProvider()
provider.add_span_exporter(OTLPSpanExporter(endpoint="http://tempo:4317"))
trace.set_tracer_provider(provider)

FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=engine)
CeleryInstrumentor().instrument()

The trace_id from each span equals the request_id bound in structlog.contextvars (set by RequestIDMiddleware). This gives a single correlation key across Grafana Loki log search and Grafana Tempo trace view — one click from a log entry to its trace, and from a trace span to its log lines. Phase 1 fallback: set OTEL_SDK_DISABLED=true; spans emit to stdout only (no collector needed).

Celery trace propagation (F4 — §57): CeleryInstrumentor automatically propagates W3C traceparent headers through the Celery task message body. The trace started at POST /api/v1/decay/predict continues unbroken through the queue wait and into the worker execution. To verify propagation is working:

# tests/integration/test_tracing.py
def test_celery_trace_propagation():
    """Trace started in HTTP handler must appear in Celery worker span."""
    with patch("opentelemetry.instrumentation.celery") as mock_otel:
        response = client.post("/api/v1/decay/predict", ...)
        task_id = response.json()["job_id"]
        # Poll until complete, then assert trace_id matches request_id
        span = get_span_by_task_id(task_id)
        assert span.context.trace_id == uuid.UUID(response.headers["X-Request-ID"]).int

Additionally, request_id must be passed explicitly in Celery task kwargs as a belt-and-suspenders fallback for Phase 1 when OTel is disabled (OTEL_SDK_DISABLED=true). The worker binds it via structlog.contextvars.bind_contextvars(request_id=kwargs["request_id"]). This ensures log correlation works in Phase 1 without a running Tempo instance.

Chord sub-task and callback trace propagation (F11 — §67): CeleryInstrumentor propagates traceparent through individual task messages. For the MC chord pattern (group → chord → callback), trace context propagation must flow: FastAPI handler → run_mc_decay_prediction → 500× run_single_trajectory sub-tasks → aggregate_mc_results callback. Each hop in the chord must carry the same trace_id to enable end-to-end p95 latency attribution.

CeleryInstrumentor handles single task propagation automatically. For chord callbacks, verify that the parent trace_id appears in the aggregate_mc_results span — if the span is orphaned (different trace_id), set the trace context explicitly in the chord header:

from opentelemetry import propagate, context

def run_mc_decay_prediction(object_id: int, params: dict) -> str:
    carrier = {}
    propagate.inject(carrier)  # inject current trace context
    params['_trace_context'] = carrier  # pass through chord params
    ...

def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str:
    ctx = propagate.extract(params.get('_trace_context', {}))
    token = context.attach(ctx)  # re-attach parent trace context in callback
    try:
        ...  # callback body
    finally:
        context.detach(token)

This ensures the Tempo waterfall for an MC prediction shows one continuous trace from HTTP request through all 500 sub-tasks to DB write, enabling per-prediction p95 breakdown.

Celery queue depth Beat task (updates celery_queue_depth and dlq_depth every 30s):

@app.task
def update_queue_depth_metrics():
    for queue_name in ['ingest', 'simulation', 'default']:
        depth = redis_client.llen(f'celery:{queue_name}')
        celery_queue_depth.labels(queue=queue_name).set(depth)
    dlq_depth.set(redis_client.llen('dlq:failed_tasks'))

Four Grafana dashboards (updated from three):

Operational Overview — primary on-call dashboard (F7 — §57): an on-call engineer must be able to answer "is the system healthy?" within 15 seconds of opening this dashboard. Panel order and layout is therefore mandated:

Row	Panel	Metric	Alert threshold shown
1 (top)	Active TIP events (stat)	`spacecom_active_tip_events`	Red if > 0
1	System status (state timeline)	All alert rule states	Any CRITICAL = red bar
2	Ingest freshness per source (gauge)	`spacecom_tle_age_hours` per source	Yellow > 2h, Red > 6h
2	Prediction age — active objects (gauge)	`spacecom:prediction_age_seconds:max`	Red > 3600s
3	Error budget burn rate (time series)	`spacecom:error_budget_burn:rate1h`	Reference line at 14.4×
3	Alert delivery latency p99 (stat)	`spacecom:alert_delivery_latency:p99_rate5m`	Red > 30s
4	Celery queue depth (time series)	`spacecom_celery_queue_depth` per queue	Reference line at 20
4	DLQ depth (stat)	`spacecom_dlq_depth`	Red if > 0

Rows 1–2 must be visible without scrolling on a 1080p monitor. The dashboard UID is pinned in the AlertManager dashboard_url annotations.

System Health: DB replication lag, Redis memory, container CPU/RAM, error rates by endpoint, renderer job duration
SLO Burn Rate: error budget consumption rate from recording rules, fast/slow burn rates, availability by SLO, latency percentiles vs. targets, WS delivery latency p99
Tracing (Phase 2, Grafana Tempo): per-request traces for decay prediction and CZML catalog; p95 span breakdown by service

26.8 Incident Response

On-Call Rotation and Escalation

Tier	Responder	Response SLA	Escalation trigger
L1 On-call	Rotating engineer (weekly rotation)	5 min (SEV-1) / 15 min (SEV-2)	Auto-escalate to L2 if no acknowledgement after SLA
L2 Escalation	Tech lead / senior engineer	10 min (SEV-1)	Auto-escalate to L3 after 10 min
L3 Incident commander	Engineering or product lead	SEV-1 only	Manual phone call; no auto-escalation

AlertManager routing:

# monitoring/alertmanager/routing.yml
route:
  receiver: slack-ops-channel
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match: {severity: critical}
      receiver: pagerduty-l1
      continue: true   # also send to Slack
    - match: {severity: warning}
      receiver: slack-ops-channel

On-call guide: docs/runbooks/on-call-guide.md — required Phase 2 deliverable. Must cover: rotation schedule, handover checklist, escalation contact list, how to acknowledge PagerDuty alerts, Grafana dashboard URLs, and the "active TIP event protocol" (escalate all SEV-2+ to SEV-1 automatically when spacecom_active_tip_events > 0).

On-call rotation spec (F5):

7-day rotation; minimum 2 engineers in the pool before going on-call
L1 → L2 escalation if incident not contained within 30 minutes of L1 acknowledgement
L2 → L3 escalation triggers: ANSP data affected; confirmed security breach; total outage > 15 minutes; regulatory notification obligation triggered (NIS2 24h, GDPR 72h)
On-call handoff: At rotation boundary, outgoing on-call documents system state in docs/runbooks/on-call-handoff-log.md: active incidents, degraded services, pending maintenance, known risks. Incoming on-call acknowledges in the same log. Mirrors the operator /handover concept (§28.5a) applied to engineering shifts.

ANSP communication commitments per severity (F6):

Severity	ANSP notification timing	Channel	Update cadence
SEV-1 (active TIP event)	Within 5 minutes of detection	Push + email	Every 15 minutes until resolved
SEV-1 (no active event)	Within 15 minutes	Email	Every 30 minutes until resolved
SEV-2	Within 30 minutes if prediction data affected	Email	On resolution
SEV-3/4	Status page update only	Status page	On resolution

Resolution notification always includes: what was affected, duration, root cause summary (1 sentence), and confirmation that prediction integrity was verified post-incident.

Severity Levels

Level	Definition	Response Time	Examples
SEV-1	System unavailable or prediction integrity compromised during active TIP event	5 minutes	DB down with TIP window open; HMAC failure on active prediction
SEV-2	Core functionality broken; no active TIP event	15 minutes	Workers down; ingest stopped > 2h; Redis down
SEV-3	Degraded functionality; operational but impaired	60 minutes	TLE stale > 6h; space weather stale; slow CZML > 5s p95
SEV-4	Minor; no operational impact	Next business day	UI cosmetic; log noise; non-critical test failure

Runbook Standard Structure (F9)

Every runbook in docs/runbooks/ must follow this template. Inconsistent runbooks written under incident pressure are a leading cause of missed steps and extended resolution times.

# Runbook: {Title}

**Owner:** {team or role}
**Last tested:** {YYYY-MM-DD} (game day or real incident)
**Severity scope:** SEV-1 | SEV-2 | SEV-3 (as applicable)

## Triggers
<!-- What conditions cause this runbook to be invoked? Alert name, symptom, or explicit escalation. -->

## Immediate actions (first 5 minutes)
<!-- Numbered steps. Each step must be independently executable. No "investigate" — specific commands only. -->
1.
2.

## Diagnosis
<!-- How to confirm the root cause before taking corrective action. -->

## Resolution steps
<!-- Numbered. Each step: what to do, expected output, what to do if the expected output is NOT seen. -->
1.
2.

## Verification
<!-- How to confirm the incident is resolved. Specific health check commands or metrics to inspect. -->

## Escalation
<!-- If unresolved after N minutes: who to page, what information to have ready. -->

## Post-incident
<!-- Mandatory PIR? Log entry required? Notification required? -->

All runbooks are reviewed and updated after each game day or real incident in which they were used. The Last tested field must not be older than 12 months — a CI check (make runbook-audit) warns if any runbook has not been updated within that window.

Required Runbooks (Phase 2 deliverable)

Each runbook is a step-by-step operational procedure, not a general guide:

Runbook	Key Steps
DB failover	Confirm primary down → Patroni status → manual failover if Patroni stuck → verify standby promoting → update connection strings → verify HMAC validation working on new primary
Celery worker recovery	Check queue depth → inspect dead letter queue → restart worker containers → verify simulation jobs resuming → check ingest worker catching up
HMAC integrity failure	Identify affected prediction ID → quarantine record (`integrity_failed = TRUE`) → notify affected ANSP users → investigate modification source → escalate to security incident if tampering confirmed
TIP ingest failure	Check Space-Track API status → verify credentials not expired → check outbound network → manual TIP fetch if automated ingest blocked → notify operators of manual TIP status
Ingest pipeline staleness	Check Celery Beat health (redbeat lock status) → check worker queue → inspect ingest failure counter in Prometheus → trigger manual ingest job → notify operators of staleness
GDPR personal data breach	Contain breach (revoke credentials, isolate affected service) → assess scope (which data, how many data subjects, which jurisdictions) → notify legal counsel within 4 hours → if EU/UK data subjects affected: notify supervisory authority within 72 hours of discovery; notify affected data subjects "without undue delay" if high risk → log in `security_logs` with type `DATA_BREACH` → document remediation
Safety occurrence notification	If a SpaceCom integrity failure (HMAC fail, data source outage, incorrect prediction) is identified during a period when an ANSP was actively managing a re-entry event: notify affected ANSP within 2 hours → create `security_logs` record with type `SAFETY_OCCURRENCE` → notify legal counsel before any external communications → preserve all prediction records, alert_events, and ingest logs from the relevant period (do not rotate or archive). Full procedure: `docs/runbooks/safety-occurrence.md` — see §26.8a below.
Prediction service outage during active re-entry event (F3)	Detect via `spacecom_active_tip_events > 0` + prediction API health check fail → immediate ANSP push notification + email within 5 minutes ("SpaceCom prediction service is unavailable. Activate your fallback procedure: consult Space-Track TIP messages directly and ESOC re-entry page.") → designate incident commander → communication cadence every 15 minutes until resolved → service restoration checklist: restore prediction API → verify HMAC integrity on latest predictions → notify ANSPs of restoration with prediction freshness timestamp → trigger PIR. Full procedure: `docs/runbooks/prediction-service-outage-during-active-event.md`

§26.8a Safety Occurrence Reporting Procedure (F4 — §61)

A safety occurrence is any event or condition in which a SpaceCom error may have contributed to, or could have contributed to, a reduction in aviation safety. This is distinct from an operational incident (which is defined by system availability/performance). Safety occurrences require a different response chain that includes regulatory and legal notification.

Trigger conditions:

HMAC integrity failure on any prediction that was served to an ANSP operator during an active TIP event
A confirmed incorrect prediction (false positive or false negative) where the ANSP was managing airspace based on SpaceCom outputs
Data staleness in excess of the operational threshold (TLE > 6h old) during an active re-entry event window without degradation notification having been sent
Any SpaceCom system failure during which an ANSP continued operational use without receiving a degradation notification

Response procedure (docs/runbooks/safety-occurrence.md):

Step	Action	Owner	Timing
1	Detect and classify: confirm the occurrence meets trigger criteria; assign SAFETY_OCCURRENCE vs. standard incident	On-call engineer	Within 30 min of detection
2	Preserve evidence: set `do_not_archive = TRUE` on all affected prediction records, alert_events, and ingest logs; export to MinIO safety archive	On-call engineer	Within 1 hour
3	Internal escalation: notify incident commander + legal counsel; do NOT communicate externally until legal counsel is engaged	Incident commander	Within 1 hour
4	ANSP notification: contact affected ANSP primary contact and safety manager using the safety occurrence notification template (not the standard incident template); include what happened, what data was affected, what the ANSP should do in response	Incident commander + legal counsel review	Within 2 hours
5	Log: create `security_logs` record with `type = 'SAFETY_OCCURRENCE'`; include ANSP ID, affected prediction IDs, notification timestamp, and legal counsel name	On-call engineer	Same session
6	ANSP SMS obligation: inform the ANSP in writing that they may have an obligation to report this occurrence to their safety regulator under their SMS; SpaceCom cannot make this determination for the ANSP	Legal counsel	Within 24 hours
7	PIR: conduct a safety-occurrence-specific post-incident review (same structure as §26.8 PIR but with additional sections: regulatory notification status, hazard log update required?)	Engineering lead	Within 5 business days
8	Hazard log update: if the occurrence reveals a new hazard or changes the likelihood/severity of an existing hazard, update `docs/safety/HAZARD_LOG.md` and trigger a safety case review	Safety case custodian	Within 10 business days

Safety occurrence log table:

-- Add to security_logs or create a dedicated table
CREATE TABLE safety_occurrences (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    occurred_at         TIMESTAMPTZ NOT NULL,
    detected_at         TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    org_ids             UUID[] NOT NULL,                          -- affected ANSPs
    trigger_type        TEXT NOT NULL,                            -- 'HMAC_FAILURE', 'INCORRECT_PREDICTION', 'STALE_DATA', 'SILENT_FAILURE'
    affected_predictions UUID[] NOT NULL DEFAULT '{}',
    evidence_archived   BOOLEAN NOT NULL DEFAULT FALSE,
    ansp_notified_at    TIMESTAMPTZ,
    legal_notified_at   TIMESTAMPTZ,
    hazard_log_updated  BOOLEAN NOT NULL DEFAULT FALSE,
    pir_completed_at    TIMESTAMPTZ,
    notes               TEXT
);

What is NOT a safety occurrence (to avoid over-classification):

Standard availability incidents with degradation notification sent promptly
Cosmetic UI errors not in the alert/prediction path
Prediction updates that change values within stated uncertainty bounds

ANSP Communication Plan

When SpaceCom is degraded during an active TIP event, operators must be notified immediately through a defined channel:

WebSocket push (if connected): automatic via the degraded-mode notification (§24.8)
Email fallback: automated email to all operator role users with active sessions within the last 24h, identifying the degradation type and estimated resolution
Documented fallback: every SpaceCom user onboarding includes the fallback procedure: "In the absence of SpaceCom, consult Space-Track TIP messages directly at space-track.org and coordinate with your national space surveillance authority per existing procedures"

Incident communication templates (F10): Pre-drafted templates in docs/runbooks/incident-comms-templates.md — reviewed by legal counsel before first use. On-call engineers must use these templates verbatim; deviations require incident commander approval. Templates cover:

Initial notification (< 5 minutes): impact, what we know, what we are doing, next update time
15-minute update: progress, updated ETA if known, revised fallback guidance if needed
Resolution notification: confirmed restoration, prediction integrity verified, brief root cause (one sentence), PIR date
Post-incident summary (within 5 business days): full timeline, root cause, remediations implemented What never appears in templates: speculation about cause before root cause confirmed; estimated recovery time until known with confidence; any admission of negligence or legal liability.

Post-Incident Review Process (F8)

Mandatory for all SEV-1 and SEV-2 incidents. PIR due within 5 business days of resolution.

PIR document structure (docs/post-incident-reviews/YYYY-MM-DD-{slug}.md):

Incident summary — what happened, when, duration, severity
Timeline — minute-by-minute from first alert to resolution
Root cause — using 5-whys methodology; stop when a process or system gap is identified
Contributing factors — what made the impact worse or detection slower
Impact — users/ANSPs affected; data at risk; SLO breach duration
Remediation actions — each with owner, GitHub issue link, and deadline; tracked with incident-remediation label
What went well — to reinforce effective practices

PIR presented at the next engineering all-hands. Remediation actions are P2 priority — no new feature work by the responsible engineer until overdue remediations are closed.

Chaos Engineering / Game Day Programme (F4)

Quarterly game day; scenarios rotated so each is tested at least annually. Document in docs/runbooks/game-day-scenarios.md.

Minimum scenario set:

#	Scenario	Expected behaviour	Pass criterion
1	PostgreSQL primary killed	Patroni promotes standby; API recovers within RTO	API returns 200 within 15 minutes; no data loss
2	Celery worker crash during active MC simulation	Job moves to DLQ; orphan recovery task re-queues; operator sees `FAILED` state	Job visible in DLQ within 2 minutes; re-queue succeeds
3	Space-Track ingest unavailable 6 hours	Staleness degraded mode activates; operators notified; predictions greyed	Staleness alert fires within 15 minutes of ingest stop
4	Redis failure	Sessions expire gracefully; WebSocket reconnects; no silent data loss	Users see "session expired" prompt; no 500 errors
5	Full prediction service restart during active CRITICAL alert	Alert state preserved in DB; re-subscribing WebSocket clients receive current state	No alert acknowledgement lost; reconnection < 30 seconds
6	Full region failover (annually)	DNS fails over to DR region; prediction API resumes	Recovery within RTO; HMAC verification passes on new primary

Each scenario: defined inject → observe → record actual behaviour → pass/fail vs. criterion → remediation window 2 weeks. Any scenario fail is treated as a SEV-2 incident with a PIR.

Operational vs. Security Incident Runbooks (F11)

Operational and security incidents have different response teams, communication obligations, and legal constraints:

Dimension	Operational incident	Security incident
Primary responder	On-call engineer	On-call engineer + DPO within 4h
Communication	Status page + ANSP email	No public status page until legal counsel approves
Regulatory obligation	SLA breach notification (MSA)	NIS2 24h early warning; GDPR 72h (if personal data)
Evidence preservation	Normal log retention	Immediate log freeze; do not rotate or archive

Separate runbooks:

docs/runbooks/operational-incident-response.md — standard on-call playbook
docs/runbooks/security-incident-response.md — invokes DPO, legal counsel, NIS2/GDPR timelines; references §29.6 notification obligations

26.9 Deployment Strategy

Zero-Downtime Deployment (Blue-Green)

The TLS-terminating Caddy instance routes between blue (current) and green (new) backend instances:

Client → Caddy → [Blue backend] (current)
                → [Green backend] (new — deployed but not yet receiving traffic)

Docker Compose implementation for Tier 2 (single-host):

Docker Compose service names are fixed, so blue and green run as two separate Compose project instances. The deploy script at scripts/blue-green-deploy.sh manages the cutover:

#!/usr/bin/env bash
# scripts/blue-green-deploy.sh
set -euo pipefail

NEW_IMAGE="${1:?Usage: blue-green-deploy.sh <image-tag>}"
COMPOSE_FILE="docker-compose.yml"
BLUE_PROJECT="spacecom-blue"
GREEN_PROJECT="spacecom-green"

# 1. Determine which colour is currently active
ACTIVE=$(cat /opt/spacecom/.active-colour 2>/dev/null || echo "blue")
if [[ "$ACTIVE" == "blue" ]]; then NEXT="green"; else NEXT="blue"; fi

# 2. Start next-colour project with new image
SPACECOM_BACKEND_IMAGE="$NEW_IMAGE" \
  docker compose -p "$( [[ $NEXT == green ]] && echo $GREEN_PROJECT || echo $BLUE_PROJECT )" \
  -f "$COMPOSE_FILE" up -d backend

# 3. Wait for next-colour healthcheck
docker compose -p "$( [[ $NEXT == green ]] && echo $GREEN_PROJECT || echo $BLUE_PROJECT )" \
  exec backend curl -sf http://localhost:8000/healthz || { echo "Health check failed — aborting"; exit 1; }

# 4. Run smoke tests against next-colour directly
SMOKE_TARGET="http://localhost:$( [[ $NEXT == green ]] && echo 8001 || echo 8000 )" \
  python scripts/smoke-test.py || { echo "Smoke tests failed — aborting"; exit 1; }

# 5. Shift Caddy upstream to next colour (atomic file swap + reload)
echo "{ \"upstream\": \"backend-$NEXT:8000\" }" > /opt/spacecom/caddy-upstream.json
docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile

echo "$NEXT" > /opt/spacecom/.active-colour
echo "✓ Traffic shifted to $NEXT. Monitoring for 5 minutes..."
sleep 300

# 6. Verify error rate via Prometheus (optional gate)
ERROR_RATE=$(curl -s "http://localhost:9090/api/v1/query?query=spacecom:api_availability:ratio_rate5m" \
  | jq -r '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE < 0.99" | bc -l) )); then
  echo "Error rate $ERROR_RATE < 0.99 — rolling back"
  # Swap back to active colour
  echo "{ \"upstream\": \"backend-$ACTIVE:8000\" }" > /opt/spacecom/caddy-upstream.json
  docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile
  echo "$ACTIVE" > /opt/spacecom/.active-colour
  exit 1
fi

# 7. Decommission old colour
docker compose -p "$( [[ $ACTIVE == blue ]] && echo $BLUE_PROJECT || echo $GREEN_PROJECT )" \
  stop backend && docker compose -p ... rm -f backend
echo "✓ Blue-green deploy complete. Active: $NEXT"

Caddy upstream configuration — Caddy reads a JSON file that the deploy script rewrites atomically:

# /etc/caddy/Caddyfile
reverse_proxy {
  dynamic file /opt/spacecom/caddy-upstream.json
  lb_policy first
  health_uri /healthz
  health_interval 5s
}

WebSocket long-lived connection timeout configuration (F11 — §63): HTTP reverse proxies have default idle timeouts that silently terminate long-lived WebSocket connections. Caddy's default idle timeout for HTTP/2 connections is governed by idle_timeout (default: 5 minutes). Many cloud load balancers default to 60 seconds. A WebSocket with no traffic for this period is silently closed by the proxy — the FastAPI server and client may not detect this for minutes, creating a "ghost connection" that is alive at the socket level but dead at the application level.

Required Caddyfile additions for WebSocket paths:

# /etc/caddy/Caddyfile
{
  servers {
    timeouts {
      idle_timeout 0  # disable idle timeout globally — WS connections can be silent for extended periods
    }
  }
}

spacecom.io {
  # WebSocket endpoints: no idle timeout, no read timeout
  @websockets {
    path /ws/*
    header Connection *Upgrade*
    header Upgrade websocket
  }
  handle @websockets {
    reverse_proxy backend:8000 {
      transport http {
        read_timeout  0      # no read timeout — WS connection can be idle
        write_timeout 0      # no write timeout — WS send can be slow on poor networks
      }
      flush_interval -1      # immediate flush; do not buffer WS frames
    }
  }

  # Non-WebSocket paths: retain normal timeouts
  handle {
    reverse_proxy backend:8000 {
      transport http {
        read_timeout  30s
        write_timeout 30s
      }
    }
  }
}

Ping-pong interval must be less than proxy idle timeout: The FastAPI WebSocket handler sends a ping every WS_PING_INTERVAL_SECONDS (default: 30s). With idle_timeout 0 in Caddy, this prevents proxy-side termination. If running behind a cloud load balancer with a fixed idle timeout, the ping interval must be set to (load_balancer_idle_timeout - 10s) — documented in docs/runbooks/websocket-proxy-config.md.

Rollback: scripts/blue-green-rollback.sh — resets /opt/spacecom/caddy-upstream.json to the previous colour and reloads Caddy. Rollback completes in < 5 seconds (no container restart required).

Deployment sequence:

Deploy green backend alongside blue (both running)
Run smoke tests against green directly (X-Deploy-Target: green header)
Shift 10% of traffic to green (canary); monitor error rate for 5 minutes
If clean: shift 100% to green; keep blue running for 10 minutes
If error spike: shift 0% back to blue instantly (< 5s rollback via blue-green-rollback.sh)
Decommission blue after 10 minutes of clean green operation

Alembic Migration Safety Policy

Every database migration must be backwards-compatible with the previous application version. Required sequence for any schema change:

Migration only: deploy migration; verify old app still functions with new schema (additive changes only — new nullable columns, new tables, new indexes)
Application deploy: deploy new application version that uses the new schema
Cleanup migration (if needed): remove old columns/constraints after old app version is fully retired

Never: rename a column, change a column type, or drop a column in a single migration that deploys simultaneously with the application change.

Hypertable-specific migration rules:

Always use CREATE INDEX CONCURRENTLY for new indexes on hypertables — does not acquire a table lock; safe during live ingest. Standard CREATE INDEX (without CONCURRENTLY) blocks all reads and writes for the duration.
Never add a column with a non-null default to a populated hypertable in a single migration. Required sequence: (1) add nullable column, (2) backfill in batches with UPDATE ... WHERE id BETWEEN x AND y, (3) add NOT NULL constraint in a separate deployment.
Test every migration against a production-sized data copy before applying to production. Record the measured execution time in the migration file header comment: # Execution time on 10M-row orbits table: 45s.
Set a CI migration timeout gate: if a migration runs > 30 seconds against the test dataset, it must be reviewed by a senior engineer before merge.

TIP Event Deployment Freeze

No deployments permitted when a CRITICAL or HIGH alert is active for any tracked object. Enforced by a CI/CD gate:

# .gitlab-ci.yml pre-deploy check
def check_deployment_gate():
    response = requests.get(f"{API_URL}/api/v1/alerts?level=CRITICAL,HIGH&active=true",
                            headers={"X-Deploy-Check": settings.deploy_check_secret})
    active = response.json()["total"]
    if active > 0:
        raise DeploymentBlocked(
            f"{active} active CRITICAL/HIGH alerts. Deployment blocked until events resolve."
        )

The deploy check secret is a read-only service credential — it cannot acknowledge alerts or modify data.

CI/CD Pipeline Specification

GitLab CI pipeline jobs (.gitlab-ci.yml):

Job	Trigger	Steps	Failure behaviour
`lint`	All pushes + PRs	`pre-commit run --all-files` (detect-secrets, ruff, mypy, hadolint, prettier, sqlfluff)	Blocks merge
`test-backend`	All pushes + PRs	`pytest --cov --cov-fail-under=80`; `alembic check` (model/migration divergence)	Blocks merge
`test-frontend`	All pushes + PRs	`vitest run`; `playwright test`	Blocks merge
`security-scan`	All pushes + PRs	`bandit -r backend/`; `pip-audit --require backend/requirements.txt`; `npm audit --audit-level=high` (frontend); `eslint --plugin security`; `trivy image` on built images (`.trivyignore` applied); `pip-licenses` + `license-checker-rseidelsohn` gate; `.secrets.baseline` currency check	Blocks merge on High/Critical
`build-and-push`	Merge to `main` or `release/*`	Multi-stage `docker build`; `docker push ghcr.io/spacecom/<service>:sha-<commit>` via OIDC; `cosign sign` all images; `syft` SPDX-JSON SBOM generated and attached as `cosign attest`; `pip-licenses --format=json` + `license-checker-rseidelsohn --json` manifests merged into SBOM and uploaded as workflow artifact (365-day retention); `docs/compliance/sbom/` updated with versioned SBOM artefact	Blocks deploy
`deploy-staging`	After `build-and-push` on `main`	Docker Compose update on staging host; smoke tests	Blocks production deploy gate
`deploy-production`	Manual approval after `deploy-staging` passes	`check_deployment_gate()` (no active CRITICAL/HIGH alerts); blue-green deploy	Manual

Image tagging convention:

sha-<commit> — immutable canonical tag; always pushed
v<major>.<minor>.<patch> — release alias pushed on tagged commits
latest — never pushed; forbidden in production Compose files (CI grep check enforces this)

Build cache strategy:

# .github/workflows/ci.yml (build-and-push job excerpt)
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
  with:
    registry: ghcr.io
    username: ${{ github.actor }}
    password: ${{ secrets.GITHUB_TOKEN }}   # OIDC — no stored secret
- uses: docker/build-push-action@v5
  with:
    context: ./backend
    push: true
    tags: ghcr.io/spacecom/backend:sha-${{ github.sha }}
    cache-from: type=registry,ref=ghcr.io/spacecom/backend:buildcache
    cache-to: type=registry,ref=ghcr.io/spacecom/backend:buildcache,mode=max

pip and npm caches use actions/cache keyed on lock file hash:

- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f  # v4.0.2
  with:
    path: ~/.cache/pip
    key: pip-${{ hashFiles('backend/requirements.txt') }}
- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f  # v4.0.2
  with:
    path: frontend/.next/cache
    key: npm-${{ hashFiles('frontend/package-lock.json') }}

cosign image signing and SBOM attestation (added after each docker push):

# .github/workflows/ci.yml — build-and-push job (after docker push steps)
- uses: sigstore/cosign-installer@59acb6260d9c0ba8f4a2f9d9b48431a222b68e20  # v3.5.0

- name: Sign all service images with cosign (keyless, OIDC)
  env:
    COSIGN_EXPERIMENTAL: "true"
  run: |
    for svc in backend worker-sim worker-ingest renderer frontend; do
      cosign sign --yes \
        ghcr.io/spacecom/${svc}:sha-${{ github.sha }}
    done

- name: Generate SBOM and attach as cosign attestation
  env:
    COSIGN_EXPERIMENTAL: "true"
  run: |
    for svc in backend worker-sim worker-ingest renderer frontend; do
      syft ghcr.io/spacecom/${svc}:sha-${{ github.sha }} \
        -o spdx-json=sbom-${svc}.spdx.json
      # Validate non-empty
      jq -e '.packages | length > 0' sbom-${svc}.spdx.json
      cosign attest --yes \
        --predicate sbom-${svc}.spdx.json \
        --type spdxjson \
        ghcr.io/spacecom/${svc}:sha-${{ github.sha }}
    done

- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08  # v4.3.4
  with:
    name: sbom-${{ github.sha }}
    path: "*.spdx.json"
    retention-days: 365   # ESA bid artefacts; ECSS minimum 1 year

- name: Verify signature before deploy (deploy jobs only)
  if: github.event_name == 'workflow_dispatch'
  run: |
    cosign verify ghcr.io/spacecom/backend:sha-${{ github.sha }} \
      --certificate-identity-regexp="https://github.com/spacecom/spacecom/.*" \
      --certificate-oidc-issuer="https://token.actions.githubusercontent.com"

All GitHub Actions pinned by commit SHA (mutable @vN tags allow tag-repointing attacks that exfiltrate all workflow secrets):

# Correct form — all third-party actions in .github/workflows/*.yml:
- uses: docker/setup-buildx-action@4fd812986e6c8c2a69e18311145f9371337f27d  # v3.4.0
- uses: docker/login-action@9780b0c442fbb1117ed29e0efdff1e18412f7567    # v3.3.0
- uses: docker/build-push-action@1a162644f9a7e87d8f4b053101d1d9a712edc18c # v6.3.0
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683        # v4.2.2
- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f             # v4.0.2
- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08  # v4.3.4

CI lint check enforces no mutable tags remain:

grep -rE 'uses: [^@]+@v[0-9]' .github/workflows/ && \
  echo "ERROR: Actions must be pinned by commit SHA, not tag" && exit 1

Use pinact or Renovate's github-actions manager to automate SHA updates.

Local Development Environment

First-time setup (target: working stack in ≤ 15 minutes from clean clone):

git clone https://github.com/spacecom/spacecom && cd spacecom
cp .env.example .env          # fill in Space-Track credentials only; all others have safe defaults
pip install pre-commit && pre-commit install
make dev                      # starts full stack with hot-reload
make seed                     # loads test objects, FIRs, and synthetic TIP events
# → Open http://localhost:3000; globe shows 10 test objects

make targets:

Target	What it does
`make dev`	`docker compose up` with `./backend` and `./frontend/src` bind-mounted for hot-reload
`make test`	`pytest` (backend) + `vitest run` (frontend) + `playwright test` (E2E)
`make migrate`	`alembic upgrade head` inside the running backend container
`make seed`	Loads `fixtures/dev_seed.sql` + synthetic TIP events via seed script
`make lint`	Runs all pre-commit hooks against all files
`make clean`	`docker compose down -v` — removes all containers and volumes (destructive, prompts)
`make shell-db`	Opens a `psql` shell inside the TimescaleDB container
`make shell-backend`	Opens a bash shell inside the running backend container

Hot-reload configuration (docker-compose.override.yml — dev only, not committed to CI):

services:
  backend:
    volumes:
      - ./backend:/app   # bind mount — FastAPI --reload picks up changes instantly
    command: ["uvicorn", "app.main:app", "--reload", "--host", "0.0.0.0"]
  frontend:
    volumes:
      - ./frontend/src:/app/src   # Next.js / Vite HMR

.env.example structure (excerpt):

# === Required: obtain before first run ===
SPACETRACK_USERNAME=your_email@example.com
SPACETRACK_PASSWORD=your_password

# === Required: generate locally ===
JWT_PRIVATE_KEY_PATH=./certs/jwt_private.pem   # openssl genrsa -out certs/jwt_private.pem 2048
JWT_PUBLIC_KEY_PATH=./certs/jwt_public.pem

# === Safe defaults for local dev (change for production) ===
POSTGRES_PASSWORD=spacecom_dev
REDIS_PASSWORD=spacecom_dev
MINIO_ACCESS_KEY=spacecom_dev
MINIO_SECRET_KEY=spacecom_dev_secret
HMAC_SECRET=dev_hmac_secret_change_in_prod

# === Stage flags ===
ENVIRONMENT=development    # development | staging | production
SHADOW_MODE_DEFAULT=false
DISABLE_SIMULATION_DURING_ACTIVE_EVENTS=false

All production-only variables are clearly marked. The README's "Getting Started" section mirrors the first-time setup steps above.

Staging Environment

Purpose: Continuous integration target for main branch. Serves as the TRL artefact evidence environment — all shadow validation records and OWASP ZAP reports reference the staging deployment.

Property	Staging	Production
Infrastructure	Tier 2 (single-host Docker Compose)	Tier 3 (multi-host HA)
Data	Synthetic only — no production data	Real TLE/TIP/space weather
Secrets	Separate credential set; non-production Space-Track account	Production credential set in Vault
Deploy trigger	Automatic on merge to `main`	Manual approval in GitHub Actions
OWASP ZAP	Runs against every staging deploy	Run on demand before Phase 3 milestones
Retention	Environment resets weekly (fresh `make seed` run)	Persistent

Secrets Rotation Procedure

Zero-downtime rotation is required. Service interruption during rotation is a reliability failure.

JWT RS256 Signing Keypair:

Generate new keypair: openssl genrsa -out jwt_private_new.pem 2048 && openssl rsa -in jwt_private_new.pem -pubout -out jwt_public_new.pem
Load new public key into JWT_PUBLIC_KEY_NEW env var on all backend instances (old key still active)
Backend now validates tokens signed with either old or new key
Update JWT_PRIVATE_KEY to new key; new tokens are signed with new key
Wait for all old tokens to expire (max 1h for access tokens; 30 days for refresh tokens)
Remove JWT_PUBLIC_KEY_NEW; old public key no longer needed
Log security_logs entry type KEY_ROTATION with rotation timestamp and initiator

Space-Track Credentials:

Create new Space-Track account or update password via Space-Track web portal
Update SPACETRACK_USERNAME / SPACETRACK_PASSWORD in secrets manager (Docker secrets / Vault)
Trigger one manual ingest cycle; verify 200 response from Space-Track API
Deactivate old credentials in Space-Track portal
Log security_logs entry type CREDENTIAL_ROTATION

MinIO Access Keys:

Create new access key pair via MinIO console (mc admin user add)
Update MINIO_ACCESS_KEY / MINIO_SECRET_KEY in secrets manager
Restart backend and worker services (rolling restart — blue-green ensures zero downtime)
Verify pre-signed URL generation succeeds
Delete old access key from MinIO console

HMAC Secret (prediction signing key):

Do not rotate casually. All existing HMAC-signed predictions will fail verification after rotation.
Pre-rotation: re-sign all existing predictions with new key (batch migration script required)
Post-rotation: update HMAC_SECRET in secrets manager; verify batch re-sign by spot-checking 10 predictions
Rotation must be approved by engineering lead; security_logs entry type HMAC_KEY_ROTATION required

26.10 Post-Deployment Safety Monitoring Programme (F9 — §61)

Pre-deployment testing and shadow validation demonstrate that a system was safe at a point in time. Post-deployment monitoring demonstrates that it remains safe in operational conditions. DO-278A §12 and EUROCAE ED-153 both require evidence of ongoing safety monitoring after deployment.

Programme components:

26.10.1 Prediction Accuracy Monitoring

After each actual re-entry event where SpaceCom generated predictions:

Record the actual re-entry time and location (from The Aerospace Corporation / ESA re-entry campaign results)
Compare against SpaceCom's p50 corridor centre and p95 bounds
Record in shadow_validations table: actual_reentry_time, actual_impact_region, p50_error_km, p95_captured (boolean)
Compute running accuracy statistics: % of events where actual impact was within p95 corridor; median error in km
Publish accuracy statistics to GET /api/v1/admin/accuracy-report (accessible to ANSP admins)

Alert trigger: If rolling 12-month p95 capture rate drops below 80% (target: 95%), engineering review is mandatory before the next ANSP shadow activation or model update deployment.

26.10.2 Safety KPI Dashboard

Prometheus recording rules and Grafana dashboard (monitoring/dashboards/safety-kpis.json):

KPI	Metric	Target	Alert threshold
HMAC verification failures	`spacecom_hmac_verification_failures_total`	0 / month	Any failure → SEV-1
Safety occurrences	`safety_occurrences` table count	0 / year	≥1 → safety case review
Alert false positive rate	Manual: PIR review	< 5%	Engineering review if exceeded
Operator training currency	`operator_training_records` expiry	100% current	< 95% → ANSP admin notification
p95 corridor capture rate	`shadow_validations` rolling 12-month	≥ 95%	< 80% → model review
Prediction freshness (TLE age at prediction time)	`spacecom_tle_age_hours` histogram p95	< 6h	> 24h → MEDIUM alert

26.10.3 Quarterly Safety Review

Mandatory quarterly safety review meeting. Output: docs/safety/QUARTERLY_SAFETY_REVIEW_YYYY_QN.md.

Agenda:

Safety KPI review (all metrics above)
Safety occurrences since last review (zero is an acceptable answer — record it)
Hazard log review: has any hazard likelihood or severity changed since last quarter?
MoC status update: progress on PLANNED items
Model changes in period: were any SAL-2 components modified? If so, safety case impact assessment
ANSP feedback: any concerns raised by ANSP customers regarding safety or accuracy?
Actions: owner, deadline, priority

Attendance required: Safety case custodian + engineering lead. One ANSP contact may be invited as an observer (good practice for regulatory demonstration).

26.10.4 Model Version Safety Monitoring

When a new model version is deployed (changes to physics/ or alerts/ SAL-2 components):

Shadow run new model in parallel for ≥14 days before replacing production model
Compare new vs. old: prediction differences > 50 km for p50, or > 100 km for p95, require engineering review before promotion
After promotion: monitor shadow_validations for the next 3 re-entry events; regression alert if p95 capture rate declines
Record in simulations.model_version; all predictions annotated with the model version they used

27. Capacity Planning

27.0 Performance Test Specification (F6)

Performance tests live in tests/load/ and are run with k6. They are not part of the standard make test suite — they require a running environment with realistic data. They run:

Manually before any Phase gate release
Automatically on the staging environment nightly (scheduled k6 Cloud or self-hosted k6)
Results committed to docs/validation/load-test-results/ after each Phase gate

Scenarios

// tests/load/scenarios.js
export const options = {
  scenarios: {
    czml_catalog: {
      executor: 'ramping-vus',
      startVUs: 0, stages: [
        { duration: '30s', target: 50 },
        { duration: '2m',  target: 100 },
        { duration: '30s', target: 0 },
      ],
    },
    websocket_subscribers: {
      executor: 'constant-vus', vus: 200, duration: '3m',
    },
    decay_submit: {
      executor: 'constant-arrival-rate', rate: 5, timeUnit: '1m',
      preAllocatedVUs: 10, duration: '5m',
    },
  },
};

SLO Assertions (k6 thresholds — test fails if breached)

Scenario	Metric	Threshold
CZML catalog (`GET /objects` + CZML)	p95 response time	< 2 000 ms
API auth (`POST /auth/token`)	p99 response time	< 500 ms
Decay prediction submit	p95 response time	< 500 ms (202 accept only)
WebSocket connection	200 concurrent connections stable for 3 min	0 connection drops
WebSocket alert delivery	Time from DB insert to browser receipt	< 30 000 ms p95
`/readyz` probe	p99 response time	< 100 ms

Baseline Environment

Performance tests are only comparable if run against a consistent hardware baseline:

# docs/validation/load-test-baseline.md
- Host: 8 vCPU / 32 GB RAM (Tier 2 single-host)
- TimescaleDB: 100 tracked objects, 90 days of orbit history
- Celery workers: simulation ×16 concurrency, ingest ×2
- Redis: empty (no warm cache) at test start

Results from a different hardware spec must be labelled separately and not compared to the baseline. A performance regression is defined as any threshold breach on the same baseline hardware.

k6 outputs a JSON summary; a CI step uploads it to docs/validation/load-test-results/YYYY-MM-DD-{env}.json. A lightweight Python script (scripts/load-test-trend.py) plots p95 latency over time for the past 10 runs and embeds the chart in docs/TEST_PLAN.md. A > 20% increase in any p95 metric between consecutive runs on the same hardware creates a performance-regression GitHub issue automatically.

27.1 Workload Characterisation

Workload	CPU Profile	Memory	Dominant Constraint
MC decay prediction (500 samples)	CPU-bound, parallelisable	200–500 MB per process	CPU cores on simulation workers
SGP4 catalog propagation (100 objects)	Trivial	< 100 MB	None — analytical model
CZML generation	I/O-bound (DB read)	< 500 MB	DB query latency
Atmospheric breakup	CPU-bound, light	~200 MB	Negligible vs. MC
Conjunction screening (100 objects)	CPU-bound, seconds	~500 MB	Acceptable on any worker
Controlled re-entry planner	CPU-bound, similar to MC	500 MB	Same pool as MC
Playwright renderer	Memory-bound (Chromium)	1–2 GB per instance	Isolated container
TimescaleDB queries	I/O-bound	64 GB (buffer cache)	NVMe IOPS for spatial queries

Cost-tracking metrics (F3, F4, F11):

Add the following Prometheus counters to enable per-org cost attribution and external API budget visibility. These feed the unit economics model (§27.7) and the Enterprise tier chargeback reports.

# backend/app/metrics.py (add to existing prometheus_client registry)
from prometheus_client import Counter

# F3 — External API call budget tracking
ingest_api_calls_total = Counter(
    "spacecom_ingest_api_calls_total",
    "Total external API calls made by the ingest worker",
    labelnames=["source"]  # "space_track", "celestrak", "noaa_swpc", "esa_discos", "iers"
)
# Usage: ingest_api_calls_total.labels(source="space_track").inc()
# Alert: if space_track calls > 100/day → investigate polling loop bug (Space-Track AUP limit: 200/day)

# F4 — Per-org simulation CPU attribution
simulation_cpu_seconds_total = Counter(
    "spacecom_simulation_cpu_seconds_total",
    "Total CPU-seconds consumed by MC simulations, by org and object",
    labelnames=["org_id", "norad_id"]
)
# Usage: simulation_cpu_seconds_total.labels(org_id=str(org_id), norad_id=str(norad_id)).inc(elapsed)
# This is the primary input to infrastructure_cost_per_mc_run in §27.7

F5 — Inbound API request counter (§68):

# backend/app/metrics.py (add to existing prometheus_client registry)
api_requests_total = Counter(
    "spacecom_api_requests_total",
    "Total inbound API requests, by org, endpoint, and API version",
    labelnames=["org_id", "endpoint", "version", "status_code"]
)
# Usage (FastAPI middleware):
# api_requests_total.labels(
#     org_id=str(request.state.org_id),
#     endpoint=request.url.path,
#     version=request.headers.get("X-API-Version", "v1"),
#     status_code=str(response.status_code)
# ).inc()

This counter is the foundation for future API tier enforcement (e.g., 1,000 requests/month for Professional; unlimited for Enterprise) and for supporting usage-based billing for Persona E/F API consumers. Add to the FastAPI middleware stack alongside prometheus_fastapi_instrumentator.

F11 — Per-org cost attribution for Enterprise tier:

Enterprise contracts may include usage-based clauses (e.g., MC simulation credits). The simulation_cpu_seconds_total metric provides the raw data; a monthly Celery task (tasks/billing/generate_usage_report.py) aggregates it per org:

@shared_task
def generate_monthly_usage_report(org_id: str, year: int, month: int):
    """Aggregate simulation CPU-seconds and ingest API calls per org for billing review."""
    # Query Prometheus/VictoriaMetrics for the org's metrics over the billing period
    # Output: docs/business/usage_reports/{org_id}/{year}-{month:02d}.json
    # Fields: total_mc_runs, total_cpu_seconds, estimated_cost_usd (at $0.40/run internal rate)

Per-org usage reports are stored in docs/business/usage_reports/ and referenced in Enterprise QBRs. The cost rate ($0.40/run at Tier 3 scale) is updated quarterly in docs/business/UNIT_ECONOMICS.md.

Usage surfaced to commercial team and org admins (F2 — §68):

Usage data must reach two audiences: the commercial team (for renewal and expansion conversations) and the org admin (to understand value received).

Commercial team: Monthly Celery Beat task (tasks/commercial/send_commercial_summary.py) emails commercial@spacecom.io on the 1st of each month with:

Per-org: MC simulation count, PDF reports generated, WebSocket connection hours, alert events (by severity)
Trend vs. previous 3 months (growth signal for expansion conversations)
Contracts expiring within 90 days (renewal pipeline)

Org admin: Monthly usage summary email to each org's admin contact showing their own usage. Template: "In [month], your team ran [N] decay predictions, generated [M] PDF reports, and received [K] CRITICAL alerts. Your monthly quota: [Q] simulations (used: [N])." This email reinforces value perception ahead of renewal conversations.

Both emails use the generate_monthly_usage_report output. Add send_usage_summary_emails to celery-redbeat at crontab(day_of_month=1, hour=6).

27.2 Monte Carlo Parallelism Architecture

The MC decay predictor must use Celery group + chord to distribute sample computation across the full worker pool. multiprocessing.Pool within a single task is limited to one container's cores.

from celery import group, chord

@celery.task
def run_mc_decay_prediction(object_id: int, params: dict) -> str:
    """Fan out 500 samples as individual sub-tasks; aggregate with chord callback."""
    sample_tasks = group(
        run_single_trajectory.s(object_id, params, seed=i)
        for i in range(params['mc_samples'])
    )
    result = chord(sample_tasks)(aggregate_mc_results.s(object_id, params))
    return result.id

@celery.task
def run_single_trajectory(object_id: int, params: dict, seed: int) -> dict:
    """Single RK7(8) + NRLMSISE-00 trajectory integration. CPU time: 2–20s."""
    rng = np.random.default_rng(seed)
    f107 = params['f107'] * rng.normal(1.0, 0.20)  # ±20% variation
    bstar = params['bstar'] * rng.normal(1.0, 0.10)
    return integrate_trajectory(object_id, f107, bstar, params)

@celery.task
def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str:
    """Compute percentiles, build corridor polygon, HMAC-sign, write to DB."""
    prediction = compute_percentiles_and_corridor(results)
    prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret)
    write_prediction_to_db(prediction)
    return str(prediction['id'])

Worker concurrency for chord sub-tasks:

Each sub-task is short (2–20s) and CPU-bound
Worker --pool=prefork --concurrency=16: 16 OS processes per container
2 simulation worker containers: 32 concurrent sub-tasks
500 samples / 32 = ~16 batches × ~10s average = ~160s per MC run (p50)
p95 target of 240s met with headroom

Chord result backend: Sub-task results stored in Redis temporarily (< 1 MB each × 500 = 500 MB peak per run). Results expire after 1 hour (result_expires = 3600 in celeryconfig.py — §27.8). The aggregate callback reads all results, computes the final prediction, and writes to TimescaleDB — Redis is not the durable store.

Chord callback result count validation (F1 — §67): Redis noeviction prevents eviction, but if Redis is misconfigured or hits maxmemory and rejects writes, sub-task results may be missing when the chord callback fires. The callback must validate that it received the expected number of results before writing to TimescaleDB:

@celery.task
def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str:
    """Compute percentiles, build corridor polygon, HMAC-sign, write to DB."""
    expected = params['mc_samples']
    if len(results) != expected:
        # Partial result — do not write a silently truncated prediction
        raise ValueError(
            f"MC chord received {len(results)}/{expected} results for object {object_id}. "
            "Redis result backend may be under memory pressure. Aborting."
        )
    prediction = compute_percentiles_and_corridor(results)
    prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret)
    write_prediction_to_db(prediction)
    return str(prediction['id'])

The ValueError causes the chord callback to fail and be routed to the DLQ (Dead Letter Queue). The originating API call receives a task failure, and the client receives HTTP 500 with Retry-After. A spacecom_mc_chord_partial_result_total counter fires, triggering a CRITICAL alert: "MC chord received partial results — Redis memory budget exceeded."

27.3 Deployment Tiers

Tier 1 — Development and Demonstration

Single machine, Docker Compose, all services co-located. No HA. Suitable for development, internal demos, and ESA TRL 4 demonstrations.

Spec	Minimum	Recommended
CPU	8 cores	16 cores
RAM	16 GB	32 GB
Storage	256 GB NVMe SSD	512 GB NVMe SSD
Cloud equivalent	`t3.2xlarge` ~$240/mo	`m6i.4xlarge` ~$540/mo

MC prediction p95: ~400–800s (exceeds SLO — acceptable for demo; noted in demo briefings).

Tier 2 — Phase 1–2 Production

Separate containers per service. Meets SLOs under moderate load (≤ 5 concurrent simulation users). Single-node per service — no HA. Suitable for shadow mode deployments and early ANSP pilots.

Service	vCPU	RAM	Storage	Cloud (AWS)	Monthly
Backend API	4	8 GB	—	`c6i.xlarge`	~$140
Simulation Workers ×2	16 each	32 GB each	—	`c6i.4xlarge` ×2	~$560 each
Ingest Worker	2	4 GB	—	`t3.medium`	~$30
Renderer	4	8 GB	—	`c6i.xlarge`	~$140
TimescaleDB	8	64 GB	1 TB NVMe	`r6i.2xlarge`	~$420
Redis	2	8 GB	—	`cache.r6g.large`	~$120
MinIO / S3	4	8 GB	4 TB	`i3.xlarge` + EBS	~$200
Total					~$2,200/mo

On-premise equivalent (Tier 2): Two servers — compute host (2× AMD EPYC 7313P, 32 total cores, 192 GB RAM) + storage host (8 vCPU, 256 GB RAM, 2 TB NVMe + 8 TB HDD). Capital cost: ~$25,000–35,000.

Tier 3 — Phase 3 HA Production

Full redundancy. Meets 99.9% availability SLO including during active TIP events. Required before any formal operational ANSP deployment.

Service	Count	vCPU each	RAM each	Notes
Backend API	2	4	8 GB	Load balanced; blue-green deployable
Simulation Workers	4	16	32 GB	64 total cores; chord sub-tasks fill all
Ingest Worker	2	2	4 GB	celery-redbeat leader election
Renderer	2	4	8 GB	Network-isolated; Chromium memory budget
TimescaleDB Primary	1	8	128 GB	Patroni-managed; synchronous replication
TimescaleDB Standby	1	8	128 GB	Hot standby; auto-failover ≤ 30s
Redis Sentinel ×3	3	2	8 GB	Quorum; master failover ≤ 10s
MinIO (distributed)	4	4	16 GB	Erasure coding EC:2; 2× 2 TB NVMe each
Cloud total (AWS)				~$6,000–7,000/mo

With 64 simulation worker cores: 500-sample MC in ~80s p50, ~120s p95 — well within SLO.

MinIO Erasure Coding (Tier 3): 4-node distributed MinIO uses EC:2 (2 parity shards). This provides:

Read quorum: any 2 of 4 nodes (tolerates 2 simultaneous node failures for reads)
Write quorum: requires 3 of 4 nodes (tolerates 1 simultaneous node failure for writes)
Effective storage: 50% — 8 TB raw across 4 nodes → 4 TB usable. Match the Tier 3 table note (8 TB usable requires 16 TB raw across 4×2 TB nodes; resize if needed)
Configured via MINIO_ERASURE_SET_DRIVE_COUNT=4 and server startup with all 4 node endpoints

Multi-region stance: SpaceCom is single-region through all three phases. Reasoning:

Phase 1–3 customer base is small (ESA evaluation, early ANSP pilots); cross-region replication cost and operational complexity is not justified.
Government and defence customers may have data sovereignty requirements — a single, clearly defined deployment region (customer-specified) is simpler to certify than an active-active multi-region setup.
When a second jurisdiction customer is onboarded, deploy a separate, independent instance in their required jurisdiction rather than extending a single global cluster. Each instance has its own data, its own compliance scope, and its own operational team contact.
This decision is documented as ADR-0010 (see §34 decision log).

On-premise equivalent (Tier 3): Three servers — 2× compute (2× EPYC 7343, 32 cores, 256 GB RAM each) + 1× storage (128 GB RAM, 4× 2 TB NVMe RAID-10, 16 TB HDD). Capital cost: ~$60,000–80,000.

Celery worker idle cost and scale-to-zero decision (F6):

Simulation workers are the largest cloud line item ($560/mo each at Tier 2 on c6i.4xlarge). Their actual compute utilisation depends on MC run frequency:

Usage pattern	Active compute/day	Idle fraction	Monthly cost at Tier 2 ×2 workers
Light (5 MC runs/day × 80s p50)	~7 min/day	~99.5%	$1,120
Moderate (20 MC runs/day × 80s)	~27 min/day	~98.1%	$1,120
Heavy (100 MC runs/day × 80s)	~133 min/day	~90.7%	$1,120

Scale-to-zero analysis:

Approach	Pros	Cons	Decision
Always-on (Tier 1–2)	Zero cold-start; SLO met immediately	High idle cost when lightly used	Use at Tier 1–2 — cost is ~$1,120/mo regardless; latency SLO requires workers ready
Scale-to-1 minimum (Tier 3)	Reduced idle cost vs. 4×; one worker handles ingest keepalive tasks	Cold-start for burst: 3 new workers × 30–60s spin-up; MC SLO may breach during burst	Use at Tier 3 — scale-to-1 minimum; HPA/KEDA scales 1→4 on `celery_queue_length > 10`
Scale-to-zero	Maximum idle savings	60–120s cold-start violates 10-min MC SLO when all workers are down	Do not use — cold-start from zero exceeds acceptable latency for on-demand simulation

Implementation at Tier 3 (Kubernetes): Use KEDA ScaledObject with celery trigger:

triggers:
  - type: redis
    metadata:
      listName: celery          # Celery default queue
      listLength: "10"          # scale up when >10 tasks queued
      activationListLength: "1" # keep at least 1 replica (scale-to-1 minimum)

Minimum replica count: 1. Maximum: 4. Scale-down stabilisation window: 5 minutes (prevents oscillation during multi-run bursts).

Ingest worker: Always-on, single instance (2 vCPU, $30/mo at Tier 2). celery-redbeat tasks run on 1-minute and hourly schedules; scale-to-zero is not appropriate. At Tier 3, 2 instances for redundancy; no autoscaling needed.

27.4 Storage Growth Projections

Data	Retention	Raw Growth/Year	Compressed/Year	Cloud Cost/Year (est.)	Notes
`orbits` (100 objects, 1/min)	90 days online	~15 GB	~2 GB	~$20 (EBS gp3, rolling)	TimescaleDB compression ~7:1
`tle_sets`	1 year	~55 MB	~30 MB	Negligible	—
`space_weather`	2 years	~5 MB	~2 MB	Negligible	—
MC simulation blobs (MinIO)	2 years	500 GB–2 TB	Not compressed	$140–$560/yr (S3-IA after 90d)	Dominant cost — S3-IA at $0.0125/GB/mo
PDF reports (MinIO)	7 years	10–90 GB	5–45 GB	$5–$45/yr (S3 Glacier)	$0.004/GB/mo Glacier tier
WAL archive (backup)	30 days rolling	~25 GB/month	—	~$100/yr (300 GB peak × $0.023/GB/mo × 12)	S3 Standard; rolls over; cost is steady-state
`security_logs`	2 years online; 7-year archive	~500 MB/year	—	Negligible	Legal hold
`reentry_predictions`	7 years	~100 MB/year	—	Negligible	Legal hold
Safety records (`alert_events`, `notam_drafts`, `prediction_outcomes`, `degraded_mode_events`, coordination notes)	5-year minimum append-only archive	~200 MB/year	—	Negligible	ICAO Annex 11 §2.26; safety investigation requirement

Storage cost summary (Phase 2 steady-state): MC blobs dominate at sustained use. At 50 runs/day × 120 MB/run = 2.2 TB/year, 2-year retention on S3-IA ≈ $660/year in object storage alone. This should be captured in the unit economics model (§27.7). Storage cost is the primary variable cost that scales with usage depth (number of MC runs), not with number of users.

Backup cost projection (F9): WAL archive at 30-day rolling window: ~300 GB peak occupancy on S3 Standard ≈ $83/year (Tier 2). At Tier 3 with synchronous replication, the base-backup is ~2× TimescaleDB data size. At 1 TB compressed DB size: one weekly base-backup (retained 4 weeks) = 4 TB S3 occupancy → **$1,100/year** at Tier 3. Include backup S3 bucket costs in infrastructure budget from Phase 3 onwards. Budget line: infra/backup-s3 ≈ $100–200/month at steady Tier 3 scale.

Safety record retention policy (Finding 11): Safety-relevant event records have a distinct retention category separate from general operational data. A safety_record BOOLEAN DEFAULT FALSE flag on alert_events and notam_drafts marks records that must survive the standard retention drop. Records with safety_record = TRUE are excluded from TimescaleDB drop policies and transferred to MinIO cold tier (append-only) after 90 days online, retained for 5 years minimum. The TimescaleDB retention job checks WHERE safety_record = FALSE before dropping chunks. safety_record is set to TRUE at insert time for any event with alert_level IN ('HIGH', 'CRITICAL') and for all NOTAM drafts.

MC blob storage dominates at scale. At sustained use (50 MC runs/day × 120 MB/run): 2.2 TB/year. The Tier 3 distributed MinIO (8 TB usable with erasure coding on 4×2 TB nodes) covers approximately 3–4 years before expansion.

Cold tier tiering decision (two object classes with different requirements):

Object class	Cold tier target	Reason
MC simulation blobs (`mc_blobs/` prefix)	MinIO ILM warm tier or S3 Infrequent Access	Blobs may need to be replayed for Mode C visualisation of historical events (e.g., regulatory dispute review, incident investigation). Glacier 12h restore latency is operationally unacceptable for this use case.
Compliance-only documents (`reports/`, `notam_drafts/`)	S3 Glacier / Glacier Deep Archive acceptable	These are legal records requiring 7-year retention; retrieval is for audit or legal discovery only; 12h restore latency is acceptable.

MinIO ILM rules configured in docs/runbooks/minio-lifecycle.md. Lifecycle transitions: MC blobs after 90 days → ILM warm (lower-cost MinIO tier or S3-IA); compliance docs after 1 year → Glacier.

MinIO multipart upload retry and incomplete upload expiry (F7 — §67):

MC simulation blobs (~120 MB each) are uploaded as multipart uploads. During a MinIO node failure in EC:2 distributed mode, write quorum (3/4 nodes) may be temporarily unavailable. An in-flight multipart upload will fail with MinioException / S3Error. Without a retry policy, the MC prediction is written to TimescaleDB but the blob is lost — the historical replay functionality silently fails.

# worker/tasks/blob_upload.py
from minio.error import S3Error

@shared_task(
    autoretry_for=(S3Error, ConnectionError),
    max_retries=3,
    retry_backoff=30,    # 30s, 60s, 120s — allow node recovery
    retry_jitter=True,
)
def upload_mc_blob(prediction_id: str, blob_data: bytes):
    """Upload MC simulation blob to MinIO with retry on quorum failure."""
    object_key = f"mc_blobs/{prediction_id}.msgpack"
    minio_client.put_object(
        bucket_name="spacecom-simulations",
        object_name=object_key,
        data=io.BytesIO(blob_data),
        length=len(blob_data),
        content_type="application/msgpack",
    )

Incomplete multipart upload cleanup: Configure MinIO lifecycle rule to abort incomplete multipart uploads after 24 hours. Add to docs/runbooks/minio-lifecycle.md:

mc ilm rule add --expire-delete-marker --noncurrent-expire-days 1 \
  spacecom/spacecom-simulations --abort-incomplete-multipart-upload-days 1

This prevents orphaned multipart upload parts accumulating on disk during node failures or application crashes mid-upload.

27.5 Network and External Bandwidth

Traffic	Direction	Volume	Notes
Space-Track TLE polling	Outbound	~1 MB per run, every 4h	~6 MB/day
NOAA SWPC space weather	Outbound	~50 KB per fetch, hourly	~1 MB/day
ESA DISCOS	Outbound	~10 MB/day (initial bulk); ~100 KB/day incremental	—
CZML to clients	Outbound	~5–15 MB per user page load (full); <500 KB/hr delta	Scales linearly with users; delta protocol essential
WebSocket to clients	Outbound	~1 KB/event × events/day	Low bandwidth, persistent connection
PDF reports (download)	Outbound	~2–5 MB per report	Low frequency; MinIO presigned URL avoids backend proxy
MinIO internal traffic	Internal	Dominated by MC blob writes	Keep on internal Docker network

CZML egress cost estimate and compression policy (F5):

At Phase 2 (10 concurrent users), daily CZML egress:

Initial full loads: 10 users × 3 page loads/day × 15 MB = 450 MB/day
Delta updates (delta protocol, §6): 10 users × 8h active × 500 KB/hr = 40 MB/day
Total: ~490 MB/day ≈ 15 GB/month

At $0.085/GB AWS CloudFront egress: ~$1.28/month (Phase 2) → ~$6.40/month (50 users Phase 3).

CZML egress is not a significant cost driver at this scale, but is significant for latency and user experience. Compression policy:

Encoding	CZML size reduction	Implementation
gzip (Accept-Encoding)	60–75%	Caddy `encode gzip` — already included in §26.9 Caddy config
Brotli	70–80%	Caddy `encode zstd br gzip` — use br for browser clients
CZML delta protocol (`?since=`)	95%+ for incremental updates	Already specified in §6

Minimum requirement: Caddy encode block must include br before gzip in the content negotiation order. A 15 MB CZML payload compresses to ~3–5 MB with brotli. Verify with curl -H "Accept-Encoding: br" -I <url> — response must show Content-Encoding: br.

Network is not a constraint for this workload at the scales described. Standard 1 Gbps datacenter networking is sufficient. For on-premise government deployments, standard enterprise LAN is adequate.

27.6 DNS Architecture and Service Discovery

Tier 1–2 (Docker Compose)

Docker Compose provides built-in DNS resolution by service name within each network. Services reference each other by container name (e.g., db, redis, minio). No additional DNS infrastructure required.

PgBouncer as single DB connection target: At Tier 2, the backend and workers connect to pgbouncer:5432, not directly to db:5432. PgBouncer multiplexes connections and acts as a stable endpoint:

In a Patroni failover, pgbouncer is reconfigured to point to the new primary; application code never changes connection strings.
PgBouncer configuration: docs/runbooks/pgbouncer-config.md

Celery task retry during Patroni failover (F2 — §67): During the ≤ 30s Patroni leader election window, all writes to PgBouncer fail with FATAL: no connection available or OperationalError: server closed the connection unexpectedly. Celery tasks that execute a DB write during this window will raise sqlalchemy.exc.OperationalError. Without a retry policy, these tasks fail permanently and are routed to the DLQ.

All Celery tasks that write to the database must declare:

@shared_task(
    autoretry_for=(OperationalError,),
    max_retries=3,
    retry_backoff=5,        # 5s, 10s, 20s
    retry_backoff_max=30,   # cap at 30s (within failover window)
    retry_jitter=True,
)
def my_db_writing_task(...):
    ...

This covers: aggregate_mc_results, write_alert_event, write_prediction_outcome, all ingest tasks. Tasks that only read from DB should also retry on OperationalError since PgBouncer may pause reads during leader election. Add integration test: simulate OperationalError on first two attempts → task succeeds on third attempt.

Tier 3 (HA / Kubernetes migration path)

At Tier 3, introduce split-horizon DNS:

Zone	Scope	Purpose
`spacecom.internal`	Internal services	Service discovery: `backend.spacecom.internal`, `db.spacecom.internal` (→ PgBouncer VIP)
`spacecom.io` (or customer domain)	Public internet	Caddy termination endpoint; ACME certificate domain

Service discovery implementation:

Cloud (AWS/GCP/Azure): Use cloud-native internal DNS (Route 53 private hosted zones / Cloud DNS) + load balancer for each service tier
On-premise: CoreDNS deployed as a DaemonSet (Kubernetes) or as a Docker container on the management network; service records updated via Patroni callback scripts on failover

Key DNS records (Tier 3):

Record	Type	Value
`db.spacecom.internal`	A	PgBouncer VIP (stable through Patroni failover)
`redis.spacecom.internal`	A	Redis Sentinel VIP
`minio.spacecom.internal`	A	MinIO load balancer (all 4 nodes)
`backend.spacecom.internal`	A	Backend API load balancer (2 instances)

27.7 Unit Economics Model

Reference document: docs/business/UNIT_ECONOMICS.md — maintained alongside this plan; update whenever pricing or infrastructure costs change.

Unit economics express the cost to serve one organisation per month and the revenue generated, enabling margin analysis per tier.

Cost-to-serve model (Phase 2, cloud-hosted, per org):

Cost driver	Basis	Monthly cost per org
Simulation workers (shared pool)	2 workers shared across all orgs; allocate by MC run share	$1,120 ÷ org count
TimescaleDB (shared instance)	~$420/mo; fixed regardless of org count up to Phase 2 capacity	$420 ÷ org count
Redis (shared)	~$120/mo	$120 ÷ org count
MinIO / S3 storage	Variable; ~$660/yr at heavy MC use → $55/mo	$5–55/mo
Backend API (shared)	~$140/mo	$140 ÷ org count
Ingest worker (shared)	~$30/mo	Allocated to platform overhead
Email relay	~$0.001/email × volume	$0–5/mo
CZML egress	~$0.085/GB	$1–7/mo
Total variable (1 org, Tier 2)		~$1,860/mo platform + $60–70 per-org variable

Revenue per tier (target pricing — cross-reference §55 commercial model):

Tier	Monthly ARR / org	Gross margin target
Free / Evaluation	$0	Negative — cost of ESA relationship
Professional (shadow)	$3,000–6,000/mo	50–70% at ≥3 orgs on platform
Enterprise (operational)	$15,000–40,000/mo	65–75% at Tier 3 scale

Break-even analysis: At Tier 2 platform cost (~$2,200/mo), break-even at Professional tier requires ≥1 paying org at $3,000/mo. Each additional Professional org at shared infrastructure has near-zero incremental infrastructure cost until capacity boundaries (MC concurrency limit, DB connection pooler limit).

Key unit economics metric: infrastructure_cost_per_mc_run. At Tier 2 (2 workers, $1,120/mo) and 500 runs/month: $2.24/run. At Tier 3 (4 workers KEDA scale-to-1, ~$800/mo amortised at medium utilisation) and 2,000 runs/month: $0.40/run. This metric should be tracked alongside spacecom_simulation_cpu_seconds_total (§27.1).

Professional Services as a revenue line (F10 — §68):

Professional Services (PS) revenue is a distinct revenue stream from recurring SaaS fees. For safety-critical aviation systems, PS typically represents 30–50% of first-year contract value and includes:

PS engagement type	Typical value	Description
Implementation support	$15,000–40,000	Deployment, configuration, integration with ANSP SMS
Regulatory documentation	$10,000–25,000	SpaceCom system description for ANSP regulatory submissions; assists with EASA/CASA/CAA shadow mode notifications
Training (initial)	$5,000–15,000	On-site or remote training for duty controllers, analysts, and IT administrators
Safety Management System integration	$8,000–20,000	Integrating SpaceCom alert triggers into the ANSP's existing SMS occurrence reporting workflow
Annual training refresh	$2,000–5,000/yr	Recurring annual training for new staff and procedure updates

PS revenue is tracked in the contracts.ps_value_cents column (§68 F1). Include PS as a budget line in docs/business/UNIT_ECONOMICS.md:

Year 1 total contract value = MRR × 12 + PS value
PS is recognised as one-time revenue at delivery (milestone-based); SaaS fees are recognised monthly
PS delivery requires dedicated engineering and commercial capacity — budget 1–2 days of senior engineer time per $5,000 of PS value

Shadow trial MC quota (F8 - §68): Free/shadow trial orgs are limited to 100 MC simulation runs per month (organisations.monthly_mc_run_quota = 100). Enforcement at POST /api/v1/decay/predict:

if org.subscription_tier in ('shadow_trial',) and org.monthly_mc_run_quota > 0:
    runs_this_month = get_monthly_mc_run_count(org_id)
    if runs_this_month >= org.monthly_mc_run_quota:
        raise HTTPException(
            status_code=429,
            detail={
                "error": "monthly_quota_exceeded",
                "quota": org.monthly_mc_run_quota,
                "used": runs_this_month,
                "resets_at": first_of_next_month().isoformat(),
                "upgrade_url": "/settings/billing"
            }
        )

Commercial controls must not interrupt active operations. If the organisation is in an active TIP / CRITICAL operational state, quota exhaustion is logged and surfaced to commercial/admin dashboards but enforcement is deferred until the event closes.

27.8 Redis Memory Budget

Reference document: docs/infra/REDIS_SIZING.md — sizing rationale and eviction policy decisions.

Redis serves three distinct purposes with different memory characteristics. Using a single Redis instance (with separate DB indexes for broker vs. cache) requires explicit memory budgeting:

Purpose	DB index	Key pattern	Estimated peak memory	Eviction policy
Celery broker + result backend	DB 0	`celery-task-meta-`, `_kombu.`	500 MB (500 MC sub-tasks × ~1 MB results)	`noeviction`
celery-redbeat schedule	DB 1	`redbeat:*`	< 1 MB	`noeviction`
WebSocket session tracking	DB 2	`spacecom:ws:`, `spacecom:active_tip:`	< 10 MB	`noeviction`
Application cache (CZML, NOTAM)	DB 3	`spacecom:cache:*`	50–200 MB	`allkeys-lru`
Redis Pub/Sub fan-out (alerts)	—	`spacecom:alert:*` channels	Transient; ~1 KB/message	N/A (pub/sub, no persistence)
Total budget			~700–750 MB peak

Sizing decision: Use cache.r6g.large (8 GB RAM) with maxmemory 2gb — provides 2.5× headroom above peak estimate for burst conditions (multiple simultaneous MC runs × result backend). Set maxmemory-policy noeviction globally; the application cache (DB 3) must handle cache misses gracefully (it does — CZML regeneration on miss is defined in §6).

Redis memory alert: Add Grafana alert redis_memory_used_bytes > 1.5GB → WARNING; > 1.8GB → CRITICAL. At CRITICAL, check for result backend accumulation (expired Celery results not cleaned up) before scaling.

Redis result cleanup: Celery result_expires must be set to 3600 (1 hour). Verify in backend/celeryconfig.py:

result_expires = 3600  # Clean up MC sub-task results after 1 hour

28. Human Factors Framework

SpaceCom is a safety-critical decision support system used by time-pressured operators in aviation operations rooms. Human factors are not a UX concern — they are a safety assurance concern. This section documents the HF design requirements, standards basis, and validation approach.

Standards basis: ICAO Doc 9683 (Human Factors in Air Traffic Management), FAA AC 25.1329 (Flight Guidance Systems — alert prioritisation philosophy), EUROCONTROL HRS-HSP-005, ISA-18.2 (alarm management, adapted for ATC context), Endsley (1995) Situation Awareness model.

28.1 Situation Awareness Design Requirements

SpaceCom must support all three levels of Endsley's SA model for Persona A (ANSP duty manager):

SA Level	Requirement	Implementation	Time target
Level 1 — Perception	Correct hazard information visible at a glance	Globe with urgency symbols; active events panel; risk level badges	≤ 5 seconds from alert appearance — icon, colour, and position alone must convey object + risk level without reading text
Level 2 — Comprehension	Operator understands what the hazard means for their sector	Plain-language event cards; window range notation; FIR intersection list; data confidence indicators	≤ 15 seconds to identify earliest FIR intersection window and whether it falls within the operator's sector
Level 3 — Projection	Operator can anticipate future state without simulation tools	Corridor Evolution widget (T+0/+2/+4h); Gantt timeline; space weather buffer callout	≤ 30 seconds to determine whether the corridor is expanding or contracting using the Corridor Evolution widget

These time targets are pass/fail criteria for the Phase 2 ANSP usability test (§28.7).

Globe visual information hierarchy (F7 — §60): The globe displays objects, corridors, hazard zones, FIR boundaries, and ADS-B routes simultaneously. Under operational stress, operators must not be required to search for the critical element — it must be pre-attentively distinct. The following hierarchy is mandatory and enforced by the rendering layer:

Priority	Element	Visual treatment	Pre-attentive channel
1 — Immediate	Active CRITICAL object	Flashing red octagon (2 Hz, reduced-motion: static + thick border) + label always visible	Motion + colour + shape
2 — Urgent	Active HIGH object	Amber triangle, label visible at zoom ≥ 4	Colour + shape
3 — Monitor	Active MEDIUM object	Yellow circle, label on hover	Colour + shape
4 — Context	Re-entry corridors (p05–p95)	Semi-transparent red fill, no label until hover	Colour + opacity
5 — Awareness	FIR boundary overlay	Thin white lines, low opacity (30%)	Position
6 — Background	ADS-B routes	Thin grey lines, visible only at zoom ≥ 5	Position
7 — Ambient	All other tracked objects	Small white dots, no label until hover	Position

Rule: no element at priority N may be more visually prominent than an element at priority N-1. The rendering layer enforces draw order and applies opacity/size reduction to lower-priority elements when a priority-1 element is present. This is a non-negotiable safety requirement — a CesiumJS performance optimisation that re-orders draw calls or flattens layers must not override this hierarchy. An operator who cannot reach SA Level 1 in ≤ 5 seconds on a CRITICAL alert constitutes a design failure requiring a redesign cycle before shadow deployment. Without numeric targets the usability test cannot produce a meaningful result.

Level 3 SA support is specifically identified as a gap in pure corridor-display systems and is addressed by the Corridor Evolution widget (§6.8).

28.2 Mode Error Prevention

Mode confusion is the most common cause of automation-related incidents in aviation. SpaceCom has three operational modes (LIVE / REPLAY / SIMULATION) that must be unambiguously distinct at all times.

Mode error prevention mechanisms:

Persistent mode indicator pill in top nav — never hidden, never small
Mode-switch dialogue with explicit current-mode, target-mode, and consequence statements (§6.3)
Future-preview temporal wash when the timeline scrubber is not at current time (§6.3)
Optional disable_simulation_during_active_events org setting to block simulation entry during live incidents (§6.3)
Audio alerts suppressed in SIMULATION and REPLAY modes
All simulation-generated records have simulation_id IS NOT NULL — they cannot appear in operational views

28.3 Alarm Management

Alarm management requirements follow the principle: every alarm should demand action, every required action should have an alarm, and no alarm should be generated that does not demand action.

Alarm rationalisation:

CRITICAL: demands immediate action — full-screen banner + audio
HIGH: demands timely action — persistent badge + acknowledgement required
MEDIUM: informs — toast, auto-dismiss, logged
LOW: awareness only — notification centre

Alarm management philosophy and KPIs (F1 — §60): SpaceCom adopts the EEMUA 191 / ISA-18.2 alarm management framework adapted for space/aviation operations. The following KPIs are measured quarterly by Persona D and included in the ESA compliance artefact package:

EEMUA 191 KPI	Target	Definition
Alarm rate (steady-state)	< 1 alarm per 10 minutes per operator	Alarms requiring attention across all levels; excludes LOW awareness-only
Nuisance alarm rate	< 1% of all alarms	Alarms acknowledged as `MONITORING` within 30s without any other action — indicates no actionable information
Stale alarms	0 CRITICAL unacknowledged > 10 min	Unacknowledged CRITICAL alerts older than 10 minutes; triggers supervisor notification (F8)
Alarm flood threshold	< 10 CRITICAL alarms within 10 minutes	Beyond this rate, an alert storm meta-alert fires and the batch-flood suppression protocol activates
Chattering alarms	0	Any alarm that fires and clears more than 3 times in 30 minutes without operator action

Alarm quality requirements:

Nuisance alarm rate target: < 1 LOW alarm per 10 minutes per user in steady-state operations (logged and reviewed quarterly by Persona D)
Alert deduplication: consecutive window-shrink events do not re-trigger CRITICAL if the threshold was not crossed
4-hour per-object CRITICAL rate limit prevents alarm flooding from a single event
Alert storm meta-alert disambiguates between genuine multi-object events and system integrity issues (§6.6)

Batch TIP flood handling (F2 — §60): Space-Track releases TIP messages in batches — a single NOAA solar storm event can produce 50+ new TIP entries within a 10-minute window. Without mitigation, this generates 50 simultaneous CRITICAL alerts, constituting an alarm flood that exceeds EEMUA 191 KPIs and cognitively overwhelms the operator.

Protocol when ingest detects ≥ 5 new TIP messages within a 5-minute window:

Batch gate activates: Individual CRITICAL banners suppressed for objects 2–N of the batch. Object 1 (highest-priority by predicted Pc or earliest window) receives the standard CRITICAL banner.
Batch summary alert fires: A single HIGH-level "Batch TIP event: N objects with new TIP data" summary appears in the notification centre. The summary is actionable — it links to a pre-filtered catalog view showing all newly-TIP-flagged objects sorted by predicted re-entry window.
Batch event logged: A batch_tip_event record is created in alert_events with trigger_type = 'BATCH_TIP', affected_objects = [NORAD ID list], and batch_size = N. This is distinct from individual object alert records.
Per-object alerts queue: Individual CRITICAL alerts for objects 2–N are queued and delivered at a maximum rate of 1 per minute, only if the operator has not opened the batch summary view within 5 minutes of the batch gate activating. This prevents indefinite suppression while preventing flood.

The threshold (≥ 5 TIP in 5 minutes) and maximum queue delivery rate (1/min) are configurable per-org via org-admin settings, subject to minimum values (≥ 3 and ≤ 2/min respectively) to prevent safety-defeating misconfiguration.

Audio alarm specification (F11 — §60):

Two-tone ascending chime: 261 Hz (C4) followed by 392 Hz (G4), each 250ms, 20ms fade-in/out (not siren — ops rooms have sirens from other systems already)
Conforms to EUROCAE ED-26 / RTCA DO-256 advisory alert audio guidelines (advisory category — attention-getting without startle)
Plays once on first presentation; does not loop automatically
Re-alert on missed acknowledgement: If a CRITICAL alert remains unacknowledged for 3 minutes, the chime replays once. Replays at most once — the second chime is the final audio prompt. Further escalation is via supervisor notification (F8), not repeated audio (which would cause habituation)
Stops on acknowledgement — not on banner dismiss; banner dismiss without acknowledgement is not permitted for CRITICAL severity
Per-device volume control via OS; per-session software mute (persists for session only; resets on next login to prevent operators permanently muting safety alerts)
Enabled by org-level "ops room mode" setting (default: off); must be explicitly enabled by org admin — not auto-enabled to prevent unexpected audio in environments where audio is not appropriate
Volume floor in ops room mode: minimum 40% of device maximum; operators cannot mute below this floor when ops room mode is active (configurable per-org, minimum 30%)

Startle-response mitigation — sudden full-screen CRITICAL banners cause ~5 seconds of degraded cognitive performance in research studies. The following rules prevent cold-start startle:

Progressive escalation mandatory: A CRITICAL alert may only be presented full-screen if the same object has already been in HIGH state for ≥ 1 minute during the current session. If the alert arrives cold (no prior HIGH state), the system must hold the alert in HIGH presentation for 30 seconds before upgrading to CRITICAL full-screen. Exception: impact_time_minutes < 30 bypasses the 30s hold.
Audio precedes visual by 500ms: The two-tone chime fires 500ms before the full-screen banner renders. This primes the operator's attentional system and eliminates the startle peak.
Banner is overlay, not replacement: The CRITICAL full-screen banner is a dimmed overlay (backdrop rgba(0,0,0,0.72)) rendered above the corridor map - the map, aircraft positions, and FIR boundaries remain visible beneath it. The banner must never replace the map render, as spatial context is required for the decision the operator is being asked to make.

Cross-hat alert override matrix: The Human Factors, Safety, and Regulatory hats jointly approve the following override rule set:

impact_time_minutes < 30 or equivalent imminent-impact state: bypass progressive delay; immediate full-screen CRITICAL permitted
data-integrity compromise (HMAC_INVALID, corrupted prediction provenance, or equivalent): immediate full-screen CRITICAL permitted
degraded-data or connectivity-only events without direct hazard change: progressive escalation remains mandatory
all immediate-bypass cases require explicit rationale in the alert type definition and traceability into the safety case and hazard log

CRITICAL alert accessibility requirements (F2): When the CRITICAL alert banner renders:

focus() is called on the alert dialog element programmatically
role="alertdialog" and aria-modal="true" on the banner container
aria-labelledby points to the alert title; aria-describedby points to the conjunction summary text
aria-hidden="true" set on the map container while the alertdialog is active; removed on dismiss
aria-live="assertive" region announces alert title immediately on render (separate from the dialog, for screen readers that do not expose alertdialog role automatically)
Visible text status indicator "⚠ Audio alert active" accompanies the audio tone for deaf or hard-of-hearing operators (audio-only notification is not sufficient as a sole channel)
All alert action buttons reachable by Tab from within the dialog; Escape closes only if the alert has a non-CRITICAL severity; CRITICAL requires explicit category selection before dismiss

Alarm rationalisation procedure — alarm systems degrade over time through threshold drift and alert-to-alert desensitisation. The following procedure is mandatory:

Persona D (Operations Analyst) reviews alert event logs quarterly
Any alarm type that fired ≥ 5 times in a 90-day period and was acknowledged as MONITORING ≥ 90% of the time is a nuisance alarm candidate — threshold review required before next quarter
Any alarm threshold change must be recorded in alarm_threshold_audit (object, old threshold, new threshold, reviewer, rationale, date); immutable append-only
ANSP customers may request threshold adjustments for their own organisation via the org-admin settings; changes take effect after a mandatory 7-day confirmation period and are logged in alarm_threshold_audit
Alert categories that have never triggered a NOTAM_ISSUED or ESCALATING acknowledgement in 12 months are escalated to Persona D for review of whether the alert should be demoted one severity level

Habituation countermeasures — repeated identical stimuli produce reduced response (habituation). The following design rules counteract alarm habituation:

CRITICAL audio uses two alternating tones (261 Hz and 392 Hz, ~0.25s each); the alternation pattern is varied pseudo-randomly within the specification range so the exact sound is never identical across sessions
CRITICAL banner background colour cycles through two dark-amber shades (#7B4000 / #6B3400) at 1 Hz — subtle variation without strobing, enough to maintain arousal without inducing distraction
Per-object CRITICAL rate limit (4-hour window) prevents habituation to a single persistent event
alert_events habituation report: any operator who has acknowledged ≥ 20 alerts of the same type in a 30-day window without a single ESCALATING or NOTAM_ISSUED response is flagged for supervisor review — this indicates potential habituation or threshold misconfiguration

Reduced-motion support (F10): WCAG 2.3.3 (Animation from Interactions — Level AAA) and WCAG 2.3.1 (Three Flashes or Below Threshold — Level A) apply. The 1 Hz CRITICAL banner colour cycle and any animated corridor rendering must respect the OS-level prefers-reduced-motion: reduce media query:

/* Default: animated */
.critical-banner { animation: amber-cycle 1s step-end infinite; }

/* Reduced motion: static high-contrast state */
@media (prefers-reduced-motion: reduce) {
  .critical-banner {
    animation: none;
    background-color: #7B4000;
    border: 4px solid #FFD580; /* thick static border as redundant indicator */
  }
}

Fatigue and cognitive load monitoring (F8 — §60): Operators on long shifts exhibit reduced alertness. The following server-side rules trigger supervisor notifications without requiring operator interaction:

Condition	Trigger	Supervisor notification
Unacknowledged CRITICAL alert	> 10 minutes without acknowledgement	Push + email to org supervisor role: "CRITICAL alert unacknowledged for 10 minutes — [object, time]"
Stale HIGH alert	> 30 minutes without acknowledgement	Push to org supervisor: "HIGH alert unacknowledged for 30 minutes"
Long session without interaction	Logged-in operator: no UI interaction for 45 min during active event	Push to operator + supervisor: "Possible inactivity during active event — please verify"
Shift duration exceeded	Session age > `org.shift_duration_hours` (default 8h)	Non-blocking reminder to operator: "Your shift duration setting is 8 hours — consider handover"

Supervisor notifications are sent to users with org_admin or supervisor role. If no supervisor role is configured for the org, the notification escalates to SpaceCom internal ops via the existing PagerDuty route with severity: warning. All supervisor notifications are logged to security_logs with event_type = SUPERVISOR_NOTIFICATION.

For CesiumJS corridor animations: check window.matchMedia('(prefers-reduced-motion: reduce)').matches on mount; if true, disable trajectory particle animation (Mode C) and set corridor opacity to a static value instead of pulsing. The preference is re-checked on change via addEventListener('change', ...) without requiring a page reload.

28.4 Probabilistic Communication to Non-Specialist Operators

Re-entry timing predictions are inherently probabilistic. Aviation operations personnel (Persona A/C) are trained in operational procedures, not orbital mechanics. The following design rules ensure probabilistic information is communicated without creating false precision or misinterpretation:

No ± notation for Persona A/C — use explicit window ranges (08h–20h from now) with a "most likely" label; all absolute times rendered as HH:MMZ (e.g., 14:00Z) or DD MMM YYYY HH:MMZ (e.g., 22 MAR 2026 14:00Z) per ICAO Doc 8400 UTC-suffix convention; the Z suffix is not a tooltip — it is always rendered inline
Space weather impact as operational buffer, not percentage — Add ≥2h beyond 95th percentile, not +18% wider uncertainty
Mode C particles require a mandatory first-use overlay explaining that particles are not equiprobable; weighted opacity down-weights outliers (§6.4)
"What does this mean?" expandable panel on Event Detail for Persona C (incident commanders) explaining the window in operational terms
Data confidence badges contextualise all physical property estimates — unknown source triggers a warning callout above the prediction panel
Tail risk annotation (F10): The p5–p95 window is the primary display, but a 10% probability of re-entry outside that range is operationally significant. Below the primary window, display: "Extreme case (1% probability outside this range): p01_reentry_timeZ – p99_reentry_timeZ" — labelled clearly as a tail annotation, not the primary window. This annotation is shown only when p99_reentry_time - p01_reentry_time > 1.5 × (p95_reentry_time - p05_reentry_time) (i.e., the tails are materially wider than the primary window). Also included as a footnote in NOTAM drafts when this condition is met.

28.5 Error Recovery and Irreversible Actions

Action	Recovery mechanism
Analyst runs prediction with wrong parameters	`superseded_by` FK on `reentry_predictions` — marks old run as superseded; UI shows warning banner; original record preserved
Controller accidentally acknowledges CRITICAL alert	Two-step confirmation; structured category selection (see below) + optional free text; append-only audit log preserves full record
Analyst shares link to superseded prediction	`⚠ Superseded — see [newer run]` banner appears on the superseded prediction page for any viewer
Operator enters SIMULATION during live incident	`disable_simulation_during_active_events` org setting blocks mode switch while unacknowledged CRITICAL/HIGH alerts exist

Structured acknowledgement categories — replaces 10-character text minimum. Research consistently shows forced-text minimums under time pressure produce reflexive compliance (1234567890, aaaaaaaaaa) rather than genuine engagement, creating audit noise rather than evidence:

export const ACKNOWLEDGEMENT_CATEGORIES = [
  { value: 'NOTAM_ISSUED',       label: 'NOTAM issued or requested' },
  { value: 'COORDINATING',       label: 'Coordinating with adjacent FIR' },
  { value: 'MONITORING',         label: 'Monitoring — no action required yet' },
  { value: 'ESCALATING',         label: 'Escalating to incident command' },
  { value: 'OUTSIDE_MY_SECTOR',  label: 'Outside my sector — passing to responsible unit' },
  { value: 'OTHER',              label: 'Other (free text required below)' },
] as const;
// Category selection is mandatory. Free text is optional except when value = 'OTHER'.
// alert_events.action_taken stores the category code; action_notes stores optional text.

Acknowledgement form accessibility requirements (F3):

Each category option rendered as <input type="radio"> with an explicit <label for="..."> — no ARIA substitutes where native HTML suffices
The radio group wrapped in <fieldset> with <legend>Select acknowledgement category</legend>
The keyboard shortcut Alt+A documented via aria-keyshortcuts="Alt+A" on the alert panel trigger element
A visible keyboard shortcut legend displayed within the acknowledgement dialog: "Keyboard: Alt+A to focus · Tab to change category · Enter to submit"
Free-text field (OTHER) labelled <label for="action_notes">Describe action taken (required)</label>; aria-required="true" when OTHER is selected
On submit, a screen-reader-visible confirmation announced via aria-live="polite": "Acknowledgement recorded: [category label]"

Keyboard-completable acknowledgement flow — CRITICAL acknowledgement must be completable in ≤ 3 keyboard interactions from any application state (operators frequently work with one hand on radio PTT):

Alt+A   → focus most-recent active CRITICAL alert in alert panel
Enter   → open acknowledgement dialogue (category pre-selected: MONITORING)
Enter   → submit (Tab to change category; free-text field skipped unless OTHER selected)

This keyboard path must be documented in the operator quick-reference card and tested in the Phase 2 usability study against the ≤ 3 interaction target.

28.5a Shift Handover

Shift handover is a high-risk transition point: situational awareness held by one operator must be reliably transferred to a second operator under time pressure. Current aviation safety events have involved information loss at handover. SpaceCom must not become a contributing factor.

Handover screen (Persona A/C): Dedicated /handover view within Secondary Display Mode (§6.20). Accessible from main nav; also triggered automatically when an operator session exceeds org.shift_duration_hours (configurable; default: 8h).

The handover screen shows:

All active CRITICAL and HIGH alerts with current status and acknowledgement history
Any unresolved multi-ANSP coordination threads (§6.9)
Recent window-change events (last 2h) in reverse chronological order
Free-text handover notes field (plain text, ≤ 2,000 characters)
"Accept handover" button — records handover event with both operator IDs and timestamp

Handover record schema:

CREATE TABLE shift_handovers (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organisations(id),
    outgoing_user   UUID NOT NULL REFERENCES users(id),
    incoming_user   UUID NOT NULL REFERENCES users(id),
    handed_over_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    notes           TEXT,                          -- operator free text, ≤ 2000 chars
    active_alerts   JSONB NOT NULL DEFAULT '[]',   -- snapshot of alert IDs + status at handover
    open_coord_threads JSONB NOT NULL DEFAULT '[]' -- snapshot of open coordination thread IDs
);

CREATE INDEX ON shift_handovers (org_id, handed_over_at DESC);

Handover integrity rules:

incoming_user must be a different users.id from outgoing_user
active_alerts and open_coord_threads are system-populated snapshots — the outgoing operator cannot edit them; only notes is free-form
Handover record is immutable after creation; retained for 7 years (aviation safety audit basis)
If a CRITICAL alert fires within 5 minutes of a handover record being created, the alert email/push notification includes a "⚠ Alert during handover window" flag so the incoming operator and their supervisor are aware

Structured SA transfer prompts (F4 — §60): The handover notes field (free text) is insufficient for reliable SA transfer under time pressure. The handover screen must also include a structured prompt section that the outgoing operator completes — mapping to Endsley's three SA levels:

SA Level	Structured prompt	Type
Level 1 — Perception	"Active objects of concern right now:"	Multi-select from current TIP-flagged objects
Level 2 — Comprehension	"My assessment of the most critical object:"	Dropdown: `Within sector / Adjacent sector / Low confidence / Not a concern yet` + optional text
Level 3 — Projection	"Expected development in next 2 hours:"	Dropdown: `Window narrowing / Window stable / Window widening / Awaiting new prediction` + optional text
Decision context	"Actions I have taken or initiated:"	Multi-select from `ACKNOWLEDGEMENT_CATEGORIES` + free text
Handover flags	"Incoming operator should know:"	Checkboxes: `Space weather active`, `Pending coordination thread`, `Degraded data`, `Unusual pattern`

The structured prompts are optional (the outgoing operator cannot be forced to complete them under time pressure) but their completion status is recorded. If the outgoing operator submits handover without completing any structured prompts, a non-blocking warning appears: "Structured SA transfer not completed — incoming operator will rely on notes only." Completion rate is reported quarterly as a human factors KPI.

Session timeout accessibility (F8): WCAG 2.2.1 (Timing Adjustable — Level A) requires users be warned before session expiry and given the opportunity to extend. For operators completing a handover (which may take longer for users with cognitive or motor impairments):

At T−2 minutes before session expiry: an aria-live="polite" announcement fires and a non-modal warning dialog appears: "Your session will expire in 2 minutes. [Extend session] [Save and log out]"
If the /handover view is active when the warning fires, the session is automatically extended by 30 minutes without user interaction (silently); the warning dialog is suppressed; the extension is logged in security_logs with event_type = SESSION_AUTO_EXTENDED_HANDOVER
The silent auto-extension only applies once per session to prevent indefinite extension; after the 30-minute extension the standard warning dialog fires normally
Session extension endpoint: POST /api/v1/auth/extend-session — returns a new expiry timestamp; requires valid current session cookie

28.6 Cognitive Load Reduction

Event Detail Duty Manager View: Decluttered large-text view for Persona A showing only window, FIRs, risk level, and three action buttons. Collapses all technical detail. Designed for ops room use at a secondary glance distance. (§6.8)

Decision Prompts accordion (formerly "Response Options"): Contextualised checklist of possible ANSP actions. Not automated — for consideration only. Checkbox states create a lightweight action record without requiring Persona A to open a separate logging system. (§6.8)

The feature is renamed from "Response Options" to "Decision Prompts" throughout UI text, documentation, and API field names. "Options" implies equivalence; "Prompts" correctly signals that the list is an aide-mémoire, not a prescribed workflow.

Legal treatment of Decision Prompts: Every Decision Prompts accordion must display the following non-waivable disclaimer in 11px grey text immediately below the accordion header:

"Decision Prompts are non-prescriptive aide-mémoire items generated from common ANSP practice. They do not constitute operational procedures. All decisions remain with the duty controller in accordance with applicable air traffic regulations and your organisation's established procedures."

This disclaimer is: (a) hard-coded, not configurable; (b) included in the printed/exported Event Detail report; (c) present in the API response for Decision Prompts payloads ("legal_notice" field). Rationale: SpaceCom is decision support, not decision authority. Without an explicit disclaimer, a regulator or court could interpret a checked Decision Prompt item as evidence of a prescribed procedure not followed.

Decision prompt content template (F6 — §60): Each Decision Prompt entry must provide four fields to be actionable under operational stress:

interface DecisionPrompt {
  id: string;
  risk_summary: string;       // Plain-language risk in ≤ 20 words. No jargon. No Pc values.
  action_options: string[];   // Specific named actions available to this operator role
  time_available: string;     // "Decision window: X hours before earliest FIR intersection"
  consequence_note?: string;  // Optional: consequence of inaction (shown only if significant)
}

// Example for a re-entry/FIR intersection:
const examplePrompt: DecisionPrompt = {
  id: 'reentry_fir_intersection',
  risk_summary: 'Object expected to re-enter atmosphere over London FIR within 8–14 hours.',
  action_options: [
    'Issue precautionary NOTAM for affected flight levels',
    'Coordinate with adjacent FIR controllers (Paris, Amsterdam)',
    'Notify airline operations centres in affected region',
    'Continue monitoring — no action required yet',
  ],
  time_available: 'Decision window: ~6 hours before earliest FIR intersection (08:00Z)',
  consequence_note: 'If window narrows below 4 hours without NOTAM, affected departures may require last-minute rerouting.',
};

Decision Prompts are pre-authored for each alert scenario type in docs/decision-prompts/ and reviewed annually by a subject-matter expert from an ANSP partner. They are not auto-generated by the system. New prompt types require approval from both the SpaceCom safety case owner and at least one ANSP reviewer.

Legal sufficiency note (F5): The in-UI disclaimer is a reinforcing reminder only. Under UCTA 1977 and the EU Unfair Contract Terms Directive, liability limitation requires that the customer was given a reasonable opportunity to discover and understand the term at contract formation. The substantive liability limitation clause (consequential loss excluded; aggregate cap = 12 months fees paid) must appear in the executed Master Services Agreement (§24.2). The UI disclaimer does not substitute for executed contractual terms.

Decision Prompts accessibility (F9): The accordion must implement the WAI-ARIA Accordion design pattern:

Accordion header: <button role="button" aria-expanded="true|false" aria-controls="panel-{id}"> — Enter and Space toggle open/close
Panel: <div id="panel-{id}" role="region" aria-labelledby="header-{id}">
Arrow keys navigate between accordion items when focus is on a header button
Each prompt item: <input type="checkbox" id="prompt-{n}" aria-checked="true|false"> with <label for="prompt-{n}"> — native checkbox, not ARIA role substitute
On checkbox state change: aria-live="polite" region announces "Action recorded: [prompt text]"
aria-keyshortcuts on the accordion container documents any applicable shortcuts

Attention management — operational environments have high ambient interruption rates. SpaceCom must not become an additional source of cognitive fragmentation:

State	Interaction rate limit	Rationale
Steady-state (no active CRITICAL/HIGH)	≤ 1 unsolicited notification per 10 minutes per user	Preserve peripheral attentional channel for ATC primary tasks
Active event (≥ 1 unacknowledged CRITICAL)	≤ 1 update notification per 60 seconds for the same event	Prevent update flooding during the critical decision window
Critical flow (user actively in acknowledgement or handover screen)	Zero unsolicited notifications	Do not interrupt the operator while they are completing a safety-critical task

Critical flow state is entered when: acknowledgement dialog is open, or /handover view is active. It is exited on dialog close or handover acceptance. During critical flow, all queued notifications are held and delivered as a batch summary immediately on exit.

Secondary Display Mode: Chrome-free full-screen operational view optimised for secondary monitor in an ops room alongside existing ATC displays. (§6.20)

First-time user onboarding: New organisations with no configured FIRs see a three-card guided setup rather than an empty globe. (§6.18)

28.7 HF Validation Approach

HF design cannot be fully validated by automated tests alone. The following validation activities are planned:

Activity	Phase	Method
Cognitive walkthrough of CRITICAL alert handling	Phase 1	Developer walk-through against §28.3 alarm management requirements
ANSP user testing — Persona A operational scenario	Phase 2	Structured usability test: duty manager handles a simulated TIP event; time-to-decision and error rate measured
Multi-ANSP coordination scenario	Phase 2	Two-ANSP test with shared event; assess whether coordination panel reduces perceived workload vs. out-of-band comms only
Mode confusion scenario	Phase 2	Participants switch between LIVE and SIMULATION; measure rate of mode errors without and with the temporal wash
Alarm fatigue assessment	Phase 3	Review of LOW alarm rate over a 30-day shadow deployment; adjust thresholds if nuisance rate > 1/10 min/user
Final HF review by qualified human factors specialist	Phase 3	Required for TRL 6 demonstration and ECSS-E-ST-10-12C compliance evidence

Probabilistic comprehension test items — the Phase 2 usability study must include the following scripted comprehension items delivered verbally to participants after they view a TIP event detail screen. Items are designed to distinguish genuine probabilistic comprehension from confidence masking:

Item	Correct answer	Common wrong answer (detects)
"What does the re-entry window of 08h–20h from now mean — does it mean the object will come down in the middle of that period?"	No — most likely landing is in the modal estimate shown, but the object could land anywhere in the window	"Yes, probably in the middle" — detects false precision from window endpoints
"If SpaceCom shows Impact Probability 0.03, should you start evacuating the FIR corridor?"	Not automatically — impact probability is one input; operational decision depends on assets at risk, corridor extent, and existing procedures	"Yes, 0.03 is high for space" — detects calibration gap between space and aviation risk thresholds
"The window has just widened by 4 hours. Does that mean SpaceCom detected new debris or a new threat?"	No — window widening usually means updated atmospheric data or revised mass/BC estimate increased uncertainty	"Yes, something new happened" — detects misattribution of uncertainty update to new threat
"SpaceCom shows 'Data confidence: TLE age 4 days'. Does that mean the prediction is wrong?"	No — it means the prediction has higher positional uncertainty; the window should be treated as wider in practice	"Yes, ignore it" — detects over-application of data quality warning

Participants who answer ≥ 2 items incorrectly indicate a comprehension design failure requiring UI revision before shadow deployment. Target: ≥ 80% correct on each item across the test cohort.

28.8 Degraded-Data Human Factors

Operators must be able to distinguish "SpaceCom is working normally" from "SpaceCom is working but with reduced fidelity" from "SpaceCom is in a failure state" — three states that require fundamentally different responses. Undifferentiated degradation presentation causes two failure modes: operators continuing to act on stale data as if it were fresh (over-trust), or operators stopping using the system entirely during a tolerable degradation (under-trust).

Visual degradation language:

State	Indicator	Operator action required
All data fresh	Green status pill in system tray (§6.6)	None
TLE age ≥ 48h for any active CRITICAL/HIGH object	Amber "⚠ TLE stale" badge on affected event card	Widen mental model of corridor uncertainty; consult space domain Persona B/D
EOP data stale (>7 days)	Amber system badge + `eop_stale` exposed in `GET /readyz`	Frame transform accuracy reduced; no action required unless close-approach timing is critical
Space weather stale (>2h for active event)	Amber badge on Kp readout in Event Detail	Kp-dependent atmospheric drag estimates are less reliable; apply additional margin
AIRAC data >35 days old	Red "⚠ AIRAC expired" badge on any FIR overlay	FIR boundaries may have changed; do not issue NOTAM text based on SpaceCom FIR names without manual verification
Backend unreachable	Full-screen "SpaceCom Offline" modal	No predictions available; fall back to organisational offline procedures

Graded response rules:

A single stale data source never suppresses the main operational view. Operators must be able to see the event and make decisions; stale data badges are contextual, not blocking.
Multiple simultaneous amber badges (≥ 3) trigger a consolidated "Multiple data sources degraded" yellow banner at top of screen — prevents badge blindness when individual badges are numerous.
The GET /readyz endpoint (§26.5) exposes all staleness states as machine-readable flags. ANSPs may configure their own monitoring to receive readyz alerts via webhook.
Degraded-data states are recorded in system_health_events table and included in the quarterly operational report to Persona D.

Operator quick-reference language for degraded states — the operator quick-reference card must include a "SpaceCom status indicators" section using the exact badge text from the UI (copy-match required). Operators must not need to translate between UI text and documentation text.

28.9 Operator Training and Competency Specification (F10 — §60)

SpaceCom is a safety-critical decision support system. ANSP customers deploying it in operational environments will be asked by their safety regulators what training operators received. This section defines the minimum training specification. Individual ANSPs may add requirements; they may not remove them.

Minimum initial training programme:

Module	Delivery	Duration	Completion criteria
M1 — System overview and safety philosophy	Instructor-led or self-paced e-learning	2 hours	Quiz score ≥ 80%
M2 — Operational interface walkthrough	Instructor-led hands-on with staging environment	3 hours	Complete reference scenario (see below)
M3 — Alert acknowledgement workflow	Scenario-based with role-play	1 hour	Keyboard-completable ack in ≤ 3 interactions
M4 — NOTAM drafting and disclaimer	Instructor-led with sample NOTAMs	1 hour	Produce a compliant NOTAM draft from a scenario
M5 — Degraded mode response	Scenario-based	30 min	Correctly identify each degraded state + action
M6 — Shift handover procedure	Pair exercise	30 min	Complete a structured handover with SA prompts

Total minimum initial training: 8 hours. Training is completed before any operational use. Simulator/staging environment only — no training on production data.

Reference scenario (M2): A CRITICAL re-entry alert fires for an object with a 6–14 hour window intersecting two FIRs. The trainee must: acknowledge the alert, identify the FIR intersection, assess the corridor evolution, draft a NOTAM, and complete a handover to a colleague — all within 20 minutes. This scenario is standardised in docs/training/reference-scenario-01.md.

Recurrency requirements:

Annual refresher: 2 hours, covering any UI changes in the preceding 12 months + repeat of M3 scenario
After any incident where SpaceCom was a contributing factor: mandatory debrief + targeted re-training before return to operational use
After a major version upgrade (breaking UI changes): M2 + affected modules before using upgraded system operationally

Competency record model:

CREATE TABLE operator_training_records (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         INTEGER NOT NULL REFERENCES users(id),
    module_id       TEXT NOT NULL,          -- 'M1'..'M6' or custom ANSP module codes
    completed_at    TIMESTAMPTZ NOT NULL,
    score           INTEGER,                -- quiz score where applicable; NULL for practical
    instructor_id   INTEGER REFERENCES users(id),
    training_env    TEXT NOT NULL DEFAULT 'staging',  -- 'staging' | 'simulator'
    notes           TEXT,
    UNIQUE (user_id, module_id, completed_at)
);

GET /api/v1/admin/training-status (org_admin only) returns completion status for all users in the organisation. Users without all required modules completed are flagged; their access is not automatically blocked (the ANSP retains operational responsibility) but the flag is visible to org_admin and included in the quarterly compliance report.

Training material ownership: docs/training/ directory maintained by SpaceCom. ANSP-specific scenario variants stored in docs/training/ansp-variants/. Annual review cycle tied to the CHANGELOG review process.

Training records data retention and pseudonymisation (F10 — §64): operator_training_records is personal data — it records when a named individual completed specific training activities. For former employees whose accounts are deleted, these records must not be retained indefinitely as identified personal data.

Retention policy:

Active users: retain for the duration of active employment (account status = 'active') plus 2 years after account deletion (for certification audit purposes — an ANSP may need to verify training history after an operator leaves)
After 2 years post-deletion: pseudonymise user_id → tombstone token; retain completion dates and module IDs for aggregate training statistics

-- Add to operator_training_records
ALTER TABLE operator_training_records
    ADD COLUMN pseudonymised_at TIMESTAMPTZ,
    ADD COLUMN user_tombstone   TEXT;  -- SHA-256 prefix of deleted user_id; replaces user_id link

The weekly pseudonymise_old_freetext Celery task (§29.3) is extended to also pseudonymise training records where the linked users row has been deleted for > 2 years:

db.execute(text("""
    UPDATE operator_training_records otr
    SET user_tombstone = CONCAT('tombstone:', LEFT(ENCODE(DIGEST(otr.user_id::text, 'sha256'), 'hex'), 16)),
        pseudonymised_at = NOW()
    WHERE otr.pseudonymised_at IS NULL
      AND NOT EXISTS (SELECT 1 FROM users u WHERE u.id = otr.user_id)
      AND otr.completed_at < NOW() - INTERVAL '2 years'
"""))

---

## 29. Data Protection Framework

SpaceCom processes personal data in the course of providing its services. For EU and UK deployments (ESA bid context), GDPR / UK GDPR compliance is mandatory. For Australian ANSP customers, the Privacy Act 1988 (Cth) applies. This section documents the data protection design requirements.

**Standards basis:** GDPR (EU) 2016/679, UK GDPR, Privacy Act 1988 (Cth), EDPB Guidelines on data breach notification, ICO guidance on legitimate interests, CNIL recommendations on consent records.

---

### 29.1 Data Inventory

**Record of Processing Activities (RoPA) — GDPR Art. 30:** This table constitutes the RoPA. It is maintained in `legal/ROPA.md` (authoritative version) and mirrored here. Organisations with ≥250 employees or processing high-risk data must maintain a written RoPA; space traffic management constitutes high-risk processing (Art. 35 DPIA trigger — see below). The DPO must review and sign off the RoPA annually.

| Data type | Personal? | Lawful basis (GDPR Art. 6) | Retention | Table / Location |
|-----------|-----------|---------------------------|-----------|-----------------|
| User email, name, organisation | Yes | Contract performance (Art. 6(1)(b)) | Account lifetime + 1 year after deletion | `users` |
| IP address in security logs | Yes (pseudonymous) | Legitimate interests — security (Art. 6(1)(f)) | **90 days full; hash retained for 7 years** | `security_logs` |
| IP address at ToS acceptance | Yes | Legitimate interests — consent evidence (Art. 6(1)(f)) | **90 days full; hash retained for account lifetime + 1 year** | `users.tos_accepted_ip` |
| Alert acknowledgement text | Yes (contains user name) | Legitimate interests — aviation safety (Art. 6(1)(f)) | 7 years | `alert_events` |
| Multi-ANSP coordination notes | Yes (contains user name) | Legitimate interests — aviation safety (Art. 6(1)(f)) | 7 years | `alert_events` |
| Shift handover records | Yes (outgoing/incoming user IDs) | Legitimate interests — aviation safety / operational continuity (Art. 6(1)(f)) | 7 years | `shift_handovers` |
| Alarm threshold audit records | Yes (reviewer ID) | Legitimate interests — safety governance (Art. 6(1)(f)) | 7 years | `alarm_threshold_audit` |
| API request logs | Yes (pseudonymous — IP) | Legitimate interests — security / billing (Art. 6(1)(f)) | 90 days | Log files / SIEM |
| MFA secrets (TOTP) | Yes (sensitive account data) | Contract performance (Art. 6(1)(b)) | Account lifetime; immediately deleted on account deletion | `users.mfa_secret` (encrypted at rest) |
| Space-Track data disclosure log | No (records org-level disclosure, not individuals) | Legitimate interests — licence compliance (Art. 6(1)(f)) | 5 years | `data_disclosure_log` |

**IP address data minimisation policy (F3 — §64):** IP addresses are personal data (CJEU *Breyer*, C-582/14). The full IP address is needed for fraud detection and security investigation within the first 90 days; beyond that, only a hashed form is needed for statistical/audit purposes.

Required Celery Beat task (`tasks/privacy_maintenance.py`, runs weekly):
```python
@shared_task
def hash_old_ip_addresses():
    """Replace full IP addresses with SHA-256 hashes after 90-day audit window."""
    cutoff = datetime.utcnow() - timedelta(days=90)
    db.execute(text("""
        UPDATE security_logs
        SET ip_address = CONCAT('sha256:', LEFT(ENCODE(DIGEST(ip_address, 'sha256'), 'hex'), 16))
        WHERE created_at < :cutoff
          AND ip_address NOT LIKE 'sha256:%'
    """), {"cutoff": cutoff})
    db.execute(text("""
        UPDATE users
        SET tos_accepted_ip = CONCAT('sha256:', LEFT(ENCODE(DIGEST(tos_accepted_ip, 'sha256'), 'hex'), 16))
        WHERE created_at < :cutoff
          AND tos_accepted_ip NOT LIKE 'sha256:%'
    """), {"cutoff": cutoff})
    db.commit()

Necessity assessment for IP storage (required in DPIA §2): Full IP is necessary for: (a) detecting account takeover (geolocation anomaly), (b) rate-limiting bypass investigation, (c) regulatory/legal requests within the statutory window. Hashed form is sufficient for: (d) long-term audit log integrity (proving an event occurred from a non-obvious source), (e) statistical reporting. The 90-day threshold is the operational window for security investigations; beyond this, benefit does not outweigh data subjects' privacy interests.

DPIA requirement and structure (F1 — §64): GDPR Article 35 mandates a DPIA before processing that is likely to result in high risk. SpaceCom's processing falls under Art. 35(3)(b) — systematic monitoring of publicly accessible areas — because it tracks the online operational behaviour of aviation professionals (login times, alert acknowledgements, decision patterns, handover text) in a system used to support safety decisions. This is a pre-processing obligation: EU personal data cannot lawfully be processed without completing the DPIA first.

Document: legal/DPIA.md — a Phase 2 gate (must be complete before any EU/UK ANSP shadow activation).

Required DPIA structure (EDPB WP248 rev.01 template):

Section	Content required
1. Description of processing	Purpose, nature, scope, context of processing; categories of data; data flows; recipients
2. Necessity and proportionality	Why is this data necessary? Could the purpose be achieved with less data? Legal basis per activity (mapped in §29.1 RoPA)
3. Risk identification	Risks to data subjects: unauthorised access to operational patterns; re-identification of pseudonymised safety records; cross-border transfer exposure; disclosure to authorities
4. Risk mitigation measures	Technical: RLS, HMAC, TLS, MFA, pseudonymisation. Organisational: DPA with ANSPs, export control screening, sub-processor contracts
5. Residual risk assessment	Risk level after mitigations: Low / Medium / High. If High residual risk: prior consultation with supervisory authority required (Art. 36)
6. DPO opinion	Designated DPO's written sign-off or objection
7. Review schedule	DPIA reviewed when processing changes materially; at least every 3 years

The DPIA covers all processing activities in the RoPA. Key risk finding anticipated: the alert acknowledgement audit trail (who acknowledged what, when) creates a de facto performance monitoring record for individual ANSP controllers — this must be addressed in Section 3 with mitigations in Section 4 (pseudonymisation after operational retention window, access restricted to org_admin and admin roles).

Privacy Notice — must be published at the registration URL and linked from the ToS acceptance flow. Must cover: data controller identity, categories of data collected, purposes and lawful bases, retention periods, data subject rights, third-party processors (cloud provider, SIEM), cross-border transfer safeguards.

29.2 Data Subject Rights Implementation

Right	Mechanism	Notes
Access (Art. 15)	`GET /api/v1/users/me/data-export` — returns all personal data held for the authenticated user as a JSON download	Available to all logged-in users
Rectification (Art. 16)	`PATCH /api/v1/users/me` — allows name, email, organisation update	Email change triggers re-verification
Erasure (Art. 17)	`POST /api/v1/users/me/erasure-request` → calls `handle_erasure_request(user_id)`	See §29.3
Restriction (Art. 18)	Admin-level: `users.access_restricted = TRUE` suspends account without deleting data	Used where erasure conflicts with retention requirement
Portability (Art. 20)	`POST /org/export` (org_admin or admin) — asynchronous export of all org personal data in machine-readable JSON; fulfilled within 30 days; also used for offboarding (§29.8). Covers user-generated content (acknowledgements, handover notes); not derived physics predictions.	F11
Objection (Art. 21)	For legitimate interests processing: handled by erasure or restriction pathway	No automated profiling that would trigger Art. 22

29.3 Erasure vs. Retention Conflict — Pseudonymisation Procedure

The 7-year retention requirement (UN Liability Convention, aviation safety records) conflicts with GDPR Article 17 right to erasure for personal data embedded in alert_events and security_logs. Resolution: pseudonymise, do not delete.

def handle_erasure_request(user_id: int, db: Session):
    """
    Satisfy GDPR Art. 17 erasure request while preserving safety-critical records.
    Called when a user account is deleted or an explicit erasure request is received.
    """
    # Stable pseudonym — deterministic hash of user_id, not reversible
    pseudonym = f"[user deleted - ID:{hashlib.sha256(str(user_id).encode()).hexdigest()[:12]}]"

    # Pseudonymise user references in append-only safety tables
    db.execute(
        text("UPDATE alert_events SET acknowledged_by_name = :p WHERE acknowledged_by = :uid"),
        {"p": pseudonym, "uid": user_id}
    )
    db.execute(
        text("UPDATE security_logs SET user_email = :p WHERE user_id = :uid"),
        {"p": pseudonym, "uid": user_id}
    )
    # Pseudonymise shift handover records — user IDs replaced, notes preserved for safety record
    db.execute(
        text("""UPDATE shift_handovers
                SET outgoing_user = NULL, incoming_user = NULL,
                    notes = CASE WHEN outgoing_user = :uid OR incoming_user = :uid
                                 THEN CONCAT('[pseudonymised: ', :p, '] ', COALESCE(notes,''))
                                 ELSE notes END
                WHERE outgoing_user = :uid OR incoming_user = :uid"""),
        {"p": pseudonym, "uid": user_id}
    )
    # Delete the user record itself (and cascade to refresh_tokens, api_keys)
    db.execute(text("DELETE FROM users WHERE id = :uid"), {"uid": user_id})
    db.commit()
    # Log the erasure event (note: this log entry is itself pseudonymised from creation)
    log_security_event("USER_ERASURE_COMPLETED", details={"pseudonym": pseudonym})

The core safety records (alert_events, security_logs, reentry_predictions) are preserved. The link to the identified individual is severed. This satisfies GDPR recital 26 (pseudonymous data is not personal data when re-identification is not reasonably possible) and Article 17(3)(b) (erasure obligation does not apply where processing is necessary for compliance with a legal obligation).

Free-text field periodic pseudonymisation (F6 — §64): Handover notes (shift_handovers.notes_text) and alert acknowledgement text (alert_events.action_taken) are free-text fields where operators may name colleagues, reference individuals' decisions, or include other personal references. The 7-year retention of these fields as-written creates personal data retained far beyond its operational value. After the operational retention window (2 years — the period within which a re-entry event's record could be actively referenced by an ANSP), free-text personal references must be pseudonymised in place.

Required Celery Beat task (tasks/privacy_maintenance.py, runs monthly):

@shared_task
def pseudonymise_old_freetext():
    """
    Replace identifiable free-text in operational records after 2-year operational window.
    The record itself is retained; only the human-entered text is sanitised.
    """
    cutoff = datetime.utcnow() - timedelta(days=730)  # 2 years
    # Replace acknowledgement text with sanitised marker — preserve the fact of acknowledgement
    db.execute(text("""
        UPDATE alert_events
        SET action_taken = '[text pseudonymised after operational retention window]'
        WHERE created_at < :cutoff
          AND action_taken IS NOT NULL
          AND action_taken NOT LIKE '[text pseudonymised%'
    """), {"cutoff": cutoff})
    # Preserve handover structure; pseudonymise notes text
    db.execute(text("""
        UPDATE shift_handovers
        SET notes_text = '[text pseudonymised after operational retention window]'
        WHERE created_at < :cutoff
          AND notes_text IS NOT NULL
          AND notes_text NOT LIKE '[text pseudonymised%'
    """), {"cutoff": cutoff})
    db.commit()

The 2-year operational window is chosen because: (a) PIR processes complete within 5 business days; (b) regulatory investigations of re-entry events typically complete within 12–18 months; (c) 2 years provides margin. Beyond 2 years, the text serves no legitimate purpose that outweighs the data subject's interest in not having their decision-making text retained indefinitely.

29.4a Data Subject Access Request Procedure (F7 — §64)

The GET /api/v1/users/me/data-export endpoint exists (§29.2). The DSAR procedure — how requests are received, processed, and responded to within the statutory deadline — must also be documented.

DSAR SLA: 30 calendar days from receipt of the verified request (GDPR Art. 12(3)). Extension to 60 days permitted for complex requests with written notice to the data subject within the first 30 days.

DSAR procedure (docs/runbooks/dsar-procedure.md):

Step	Action	Owner	Timing
1	Receive request (email to `privacy@spacecom.io` or in-app `POST /api/v1/users/me/data-export-request`)	DPO/designated contact	Day 0
2	Verify identity of requestor (must be the data subject or authorised representative)	DPO	Within 3 business days
3	Assess scope: what data is held? Which tables? What exemptions apply (safety record retention)?	DPO + engineering	Within 7 days
4	Generate export: `GET /api/v1/users/me/data-export` for self-service; admin endpoint for cases where account is deleted/suspended	Engineering	Within 20 days
5	Deliver export: encrypted ZIP sent to verified email address	DPO	By day 28
6	Document: log in `legal/DSAR_LOG.md` — request date, identity verified, scope, delivery date, any exemptions invoked	DPO	Same day as delivery
7	If exemption applied (safety records retained): provide written explanation of the exemption and residual rights	DPO	Included in delivery

GET /api/v1/users/me/data-export response scope — must include all of:

users record fields (excluding password hash)
alert_events where acknowledged_by = user.id (pre-pseudonymisation only)
shift_handovers where outgoing_user = user.id or incoming_user = user.id
operator_training_records for the user
api_keys metadata (not the key value itself)
security_logs where user_id = user.id (pre-IP-hashing only)
tos_accepted_at, tos_version from users

Fields excluded from DSAR export (not personal data or subject to legitimate processing exemption):

reentry_predictions (not personal data)
security_logs entries of type HMAC_KEY_ROTATION, DEPLOY_* (operational audit, not personal)

29.4 Data Processing Agreements

A Data Processing Agreement (DPA) is required in every commercial relationship where SpaceCom acts as a data processor for customer personal data (GDPR Art. 28).

SpaceCom acts as data processor for: user data belonging to ANSP and space operator customers (the customers are the data controllers for their employees' data).

SpaceCom acts as data controller for: its own user authentication data, security logs, and analytics.

Required DPA provisions (GDPR Art. 28(3)):

Processing only on documented instructions of the controller
Confidentiality obligations on authorised processors
Technical and organisational security measures (reference §7)
Sub-processor approval process (cloud provider, SIEM)
Data subject rights assistance obligations
Deletion or return of data on contract termination
Audit and inspection rights for the controller

The DPA template must be reviewed by counsel before any EU/UK commercial deployment. It is a standard addendum to the MSA.

Sub-processor register (F9 — §64): GDPR Article 28(2) requires that the controller authorises sub-processors, and Article 28(4) requires that the processor imposes equivalent obligations on sub-processors. The DPA template references a sub-processor register; that register must exist as a standalone document.

Document: legal/SUB_PROCESSORS.md — Phase 2 gate (required before first EU/UK commercial deployment).

Sub-processor	Service	Personal data transferred	Location	Transfer mechanism	DPA in place
Cloud host (e.g. AWS/Hetzner)	Infrastructure hosting	All categories (hosted on their infrastructure)	EU-central-1 (Frankfurt)	Adequacy / SCCs	AWS DPA / Hetzner DPA
GitHub	Source code hosting, CI/CD	Developer usernames; may appear in test fixtures	US	EU SCCs (Module 2)	GitHub DPA
Email delivery provider (e.g. Postmark, SES)	Transactional email (alert notifications)	User email address, name, alert content	US	EU SCCs (Module 2)	Provider DPA
Grafana Cloud (if used)	Observability / monitoring	IP addresses in logs ingested to Loki	US/EU	SCCs / EU region option	Grafana DPA
Sentry (if used)	Error tracking	Stack traces may contain user IDs, request data	US	EU SCCs	Sentry DPA

Customer notification obligation: ANSPs (as data controllers) must be notified ≥30 days before any new sub-processor is added. The DPA addendum requires this. The sub-processor register is the mechanism for tracking and triggering notifications.

29.5 Cross-Border Data Transfer Safeguards

For EU/UK customers where SpaceCom infrastructure is hosted outside the EU/UK (e.g., AWS us-east-1):

Use EU/UK regions where available, or
Execute Standard Contractual Clauses (SCCs — 2021 EU SCCs / UK IDTA) with the cloud provider
Document the transfer mechanism in the Privacy Notice

For Australian customers: the Privacy Act's Australian Privacy Principle 8 (cross-border disclosure) requires contractual protections equivalent to the APPs when transferring personal data internationally.

Data residency policy (Finding 8):

Default hosting: EU jurisdiction (eu-central-1 / Frankfurt or equivalent) — satisfies EU data residency requirements for ECAC ANSP customers; stated in the MSA and DPA
On-premise option: Institutional tier supports customer-managed on-premise deployment (§34 specifies the deployment model); customer's own infrastructure, own jurisdiction; SpaceCom provides a deployment package and support contract
Multi-tenancy isolation: Each ANSP organisation's operational data (alert_events, notam_drafts, coordination notes) is accessible only to that organisation's users — enforced by RLS (§7.2). Multi-tenancy does not mean data co-mingling
Subprocessor disclosure: docs/legal/data-residency-policy.md lists hosting provider, region, and any subprocessors; updated when subprocessors change; referenced in the DPA; customers notified of material subprocessor changes ≥ 30 days in advance
organisations.hosting_jurisdiction and organisations.data_residency_confirmed columns (§9.2) track per-organisation residency state; admin UI surfaces this to Persona D
Authoritative document: legal/DATA_RESIDENCY.md — lists hosting provider, region, all sub-processors with their data residency and SCCs/IDTA status; reviewed and re-signed annually by DPO; customers notified of material sub-processor changes ≥30 days in advance per DPA obligations

29.6 Security Breach Notification

Regulatory notification obligations by framework:

Framework	Trigger	Deadline	Authority	Template location
GDPR Art. 33	Personal data breach affecting EU/UK data subjects	72 hours of discovery	National DPA (e.g. ICO, CNIL, BfDI)	`legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md`
UK GDPR	As above for UK data subjects	72 hours	ICO	As above
NIS2 Art. 23	Significant incident affecting network/information systems of an essential entity	Early warning: 24 hours of becoming aware; full notification: 72 hours; final report: 1 month	National CSIRT + competent authority (space traffic management is likely an essential sector under NIS2 Annex I)	As above
Australian Privacy Act	Eligible data breach (serious harm likely)	ASAP (no fixed period; promptness required)	OAIC	As above

Incident response timeline:

Step	Timing	Action
Detect and contain	Immediately	Revoke affected credentials; isolate affected service; preserve logs
Assess scope	Within 2 hours	Determine: categories of data affected, approximate number of data subjects, jurisdictions, NIS2 applicability
Notify legal counsel and DPO	Within 4 hours of detection	Counsel advises on notification obligations across all applicable frameworks
NIS2 early warning	Within 24 hours of awareness	If significant incident: notify national CSIRT with initial information; no need for complete picture at this stage
Notify supervisory authority (EU/UK GDPR)	Within 72 hours of discovery	Via national DPA portal; even if incomplete — update as more known
NIS2 full notification	Within 72 hours of awareness	Full incident notification to national CSIRT / competent authority
Notify data subjects	Without undue delay	If breach likely to result in high risk to individuals
NIS2 final report	Within 1 month of full notification	Detailed description, impact assessment, cross-border impact, measures taken
Document	Ongoing	GDPR Art. 33(5) requires documentation of all breaches; NIS2 requires audit trail

GDPR and NIS2 breach notification is integrated into the §26.8 incident response runbook. The security_logs record type DATA_BREACH triggers the breach notification workflow. On-call engineers must be trained to recognise when NIS2 thresholds (significant impact on service continuity or data integrity) are met and escalate to the DPO within the 24-hour window. Full obligations mapped in legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md.

Even as a B2B SaaS operating within corporate networks, SpaceCom must comply with the ePrivacy Directive (2002/58/EC as amended) for any non-essential cookies set on EU/UK user browsers.

Cookie audit (required at least annually — legal/COOKIE_POLICY.md):

Cookie name	Category	Purpose	Lifetime	Consent required?
`session`	Strictly necessary	Authenticated session token	Session / 8h inactivity	No
`csrf_token`	Strictly necessary	CSRF protection	Session	No
`tos_version`	Strictly necessary	ToS acceptance tracking	1 year	No
`feature_flags`	Functional	A/B flags for UI features	30 days	Yes (functional consent)
`_analytics`	Analytics	Usage telemetry (if implemented)	13 months	Yes (analytics consent)

Security requirements for all session cookies (ePrivacy + §36 security):

Set-Cookie: session=...; HttpOnly; Secure; SameSite=Strict; Path=/; Max-Age=28800

Consent implementation:

Consent banner displayed on first visit to any EU/UK user before any non-essential cookies are set
Three options: Accept all / Functional only / Strictly necessary only
Consent preference stored in user_cookie_preferences or localStorage (no cookie used to store consent — self-defeating)
Consent is re-requested if cookie categories change materially
B2B context note: even if the organisation has a corporate cookie policy, individual users' consent is required under ePrivacy; organisational IT policies do not substitute for individual consent

Cookie policy: legal/COOKIE_POLICY.md — published at registration URL and linked from the consent banner. Reviewed when new cookies are introduced or existing cookies change purpose.

29.8 Organisation Onboarding and Offboarding (F4)

Onboarding workflow

New organisation provisioning requires explicit admin action — self-serve registration is not available in Phase 1 (safety-critical context; all organisations are individually vetted).

Onboarding gates (all must be satisfied before subscription_status → active):

Legal: MSA executed (countersigned PDF stored in legal/contracts/{org_id}/msa.pdf)
Export control: export_control_cleared = TRUE on the organisations row (BIS Entity List check; see §24.2)
Space-Track: If the organisation requires Space-Track data: space_track_registered = TRUE; space_track_username recorded; data disclosure log seeded
Billing: billing_contacts row created; VAT number validated for EU customers
Admin user: at least one org_admin user created with MFA enrolled
ToS: primary org_admin user has tos_accepted_at IS NOT NULL

Each gate is a checklist step in docs/runbooks/org-onboarding.md. Completing all gates creates a subscription_periods row with period_start = NOW().

Offboarding workflow

When an organisation's subscription ends (churn, termination, or suspension), the offboarding procedure:

Step	Action	Who	When
1	Set `subscription_status = 'churned'` / `'suspended'`	Admin	Immediately
2	Revoke all `api_keys` for the org	Admin (automated)	Immediately
3	Invalidate all active sessions (`refresh_tokens`)	Admin (automated)	Immediately
4	Notify org primary contact: 30-day data export window	Admin	Same day
5	Generate and deliver org data export archive	Admin	Within 3 business days
6	After 30-day window: pseudonymise user personal data	Automated job	Day 31
7	Retain non-personal safety records (7-year minimum)	DB — no action	Ongoing
8	Confirm deletion in writing to org billing contact	Admin	After step 6

GDPR Art. 17 vs. retention conflict: User personal data (name, email, IP addresses) is pseudonymised per §29.3 after the 30-day window. Safety records (alert_events, reentry_predictions, shift_handovers) are retained for 7 years per UN Liability Convention — the organisation row remains in the database with subscription_status = 'churned' as the foreign key anchor. No safety record is deleted.

Suspension vs. termination: A suspended organisation (subscription_status = 'suspended') retains data and can be reactivated by an admin. A churned organisation enters the 30-day export window immediately. Suspension is used for payment failure; churn for voluntary or contractual termination.

29.9 Audit Log Personal Data Separation (F8 — §64)

security_logs currently serves two distinct purposes with conflicting retention requirements:

Integrity audit records (HMAC checks, ingest events, deploy markers): no personal data; 7-year retention under UN Liability Convention
Personal data processing records (user logins, IP addresses, acknowledgement events): personal data; subject to data minimisation, IP hashing at 90 days, erasure on request

Mixing these in one table means a single retention policy applies to both — either over-retaining personal data (7 years) or under-retaining operational integrity records. Required separation:

-- New table: operational integrity audit — no personal data, 7-year retention
CREATE TABLE integrity_audit_log (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    event_type   TEXT NOT NULL,  -- 'HMAC_VERIFICATION', 'INGEST_SUCCESS', 'DEPLOY_COMPLETED', etc.
    source       TEXT,           -- service name, job ID
    details      JSONB,          -- operational context; must not contain user IDs or IPs
    severity     TEXT NOT NULL DEFAULT 'INFO'
);

-- Existing security_logs: personal data processing records — IP hashing at 90d, erasure on request
-- Add constraint: security_logs must only hold user-action event types
ALTER TABLE security_logs ADD CONSTRAINT chk_security_logs_type
    CHECK (event_type IN (
        'LOGIN', 'LOGOUT', 'MFA_ENROLLED', 'PASSWORD_RESET', 'API_KEY_CREATED',
        'API_KEY_REVOKED', 'TOS_ACCEPTED', 'DATA_BREACH', 'USER_ERASURE_COMPLETED',
        'SAFETY_OCCURRENCE', 'DEPLOY_ALERT_GATE_OVERRIDE', 'HMAC_KEY_ROTATION',
        'AIRSPACE_UPDATE', 'EXPORT_CONTROL_SCREENED', 'SHADOW_MODE_ACTIVATED'
    ));

Migration: Existing security_logs records of type INGEST_*, HMAC_VERIFICATION_* (pass/fail), DEPLOY_COMPLETED are migrated to integrity_audit_log. The personal-data-containing events remain in security_logs with the updated retention and IP-hashing policy.

Benefit: integrity_audit_log can be retained for 7 years without any privacy obligation. security_logs is subject to the 90-day IP hashing, erasure-on-request, and 2-year text pseudonymisation policies without affecting integrity records.

29.10 Lawful Basis Mapping and ToS Acceptance Clarification (F11 — §64)

The first-login ToS/AUP acceptance flow (§3.1, §13) gates access and records tos_accepted_at. This mechanism does not mean consent (Art. 6(1)(a)) is the universal lawful basis for all processing. The RoPA (§29.1) maps the correct basis per activity; this section clarifies the principle.

Lawful basis is determined by purpose, not by the collection mechanism:

Processing activity	Correct basis	Why NOT consent
Delivering alerts and predictions the user subscribed to	Art. 6(1)(b) — contract performance	User contracted for the service; consent would be revocable and would prevent service delivery
Security logging of user actions	Art. 6(1)(f) — legitimate interests (fraud/security)	Required regardless of consent; security cannot be conditional on consent
Audit trail for UN Liability Convention	Art. 6(1)(c) — legal obligation	Statutory retention requirement; consent is irrelevant
Fatigue monitoring triggers (§28.3 — server-side thresholds)	Art. 6(1)(b) or (f)	Part of the contracted service and/or legitimate safety interest; not health data (Art. 9) because no health information is processed — only activity patterns
Sending marketing or product update emails (not core service)	Art. 6(1)(a) — consent	Marketing emails require opt-in consent separate from service ToS

ToS acceptance is consent evidence only for: (a) acknowledgement of terms, (b) Space-Track redistribution acknowledgement, (c) export control acknowledgement. It is not a blanket consent to all processing.

Implementation requirement: The Privacy Notice (§29.1) must state the correct lawful basis for each category of processing, not imply consent for all. Legal counsel review required before publication.

29.11 Open Source / Dependency Licence Compliance (§66)

SpaceCom is a closed-source SaaS product. Certain open-source licence obligations apply regardless of whether source code is distributed, because SpaceCom serves a web application to end users over a network. This section documents licence assessments for all material dependencies.

Reference document: legal/OSS_LICENCE_REGISTER.md — authoritative per-dependency licence record, updated on every major dependency version change.

F1 — CesiumJS AGPLv3 Commercial Licence

CesiumJS is licensed under AGPLv3. The AGPL network use provision (§13) requires that any software that incorporates AGPLv3 code and is served over a network must make its complete corresponding source available to users. SpaceCom is closed-source and does not satisfy this requirement under the AGPLv3 terms.

Required action: A commercial licence from Cesium Ion must be executed and stored at legal/LICENCES/cesium-commercial.pdf before any Phase 1 demo or ESA evaluation deployment. The CI licence gate (license-checker-rseidelsohn --excludePackages "cesium") is correct only when a valid commercial licence exists — the exclusion without the licence is a false negative. The commercial licence is referenced in ADR-0007 (docs/adr/0007-cesiumjs-commercial-licence.md).

Phase gate: legal/LICENCES/cesium-commercial.pdf present and legal_clearances.cesium_commercial_executed = TRUE is a Phase 1 go/no-go criterion. Block all external deployments until confirmed.

F3 — Space-Track AUP Redistribution Prohibition

Space-Track Terms of Service prohibit redistribution of TLE and CDM data to unregistered parties. SpaceCom's ingest pipeline fetches TLE/CDM data under a single registered account and serves derived predictions to ANSP users. The redistribution risk surfaces in two ways:

Raw TLE exposure via API: If SpaceCom's API returns raw TLE strings (e.g., in /objects/{id}/tle), and those strings are accessible to unauthenticated users or third-party integrations, this may constitute redistribution. All TLE endpoints must require authentication and must not be proxied to unregistered downstream systems.
Credentials in client-side code or SBOM: SPACE_TRACK_PASSWORD must never appear in frontend/ source, git history, SBOM artefacts, or any publicly accessible location. Validate with detect-secrets (already in pre-commit hook) and git secrets --scan-history.

ADR: docs/adr/0016-space-track-aup-architecture.md — records the chosen path (shared ingest vs. per-org credentials) with AUP clarification evidence.

F4 — Python Dependency Licence Assessment

Package	Licence	Risk	Mitigation
NumPy	BSD-3	None	—
SciPy	BSD-3	None	—
astropy	BSD-3	None	—
sgp4	MIT	None	—
poliastro	MIT / LGPLv3 (components)	Low	LGPLv3 requires dynamic linking ability; standard `pip install` satisfies LGPL dynamic linking. SpaceCom does not ship a modified poliastro — no relinking obligation arises. Document in `legal/LGPL_COMPLIANCE.md`.
FastAPI	MIT	None	—
SQLAlchemy	MIT	None	—
Celery	BSD-3	None	—
Pydantic	MIT	None	—
Playwright (Python)	Apache 2.0	None	Chromium binary downloaded at build time; not redistributed. Captured in SBOM.

LGPL compliance document: legal/LGPL_COMPLIANCE.md must confirm: (a) poliastro is installed via pip as a separate library, (b) SpaceCom does not statically link or incorporate modified poliastro source, (c) users can substitute a modified poliastro by reinstalling — this is satisfied by standard Python packaging. No further action required beyond this documentation.

F5 — TimescaleDB Licence Assessment

TimescaleDB uses a dual-licence model:

Feature	Licence	SpaceCom use?
Hypertables, continuous aggregates, compression, `time_bucket()`	Apache 2.0	Yes — all core features used by SpaceCom
Multi-node distributed hypertables	Timescale Licence (TSL)	No — single-node at all tiers
Data tiering (automated S3 tiering)	TSL	No — SpaceCom uses MinIO ILM / manual S3 lifecycle, not TimescaleDB tiering

Assessment: SpaceCom uses only Apache 2.0-licensed TimescaleDB features. No Timescale commercial agreement required. Document in legal/LICENCES/timescaledb-licence-assessment.md. Re-assess if multi-node or data tiering features are adopted at Tier 3.

F6 — Redis SSPL Assessment

Redis 7.4+ adopted the Server Side Public Licence (SSPL). SSPL § 13 requires that any entity offering the software as a service must open-source their entire service stack. The relevant question for SpaceCom is whether deploying Redis as an internal component of SpaceCom constitutes "offering Redis as a service."

Assessment: SpaceCom operates Redis internally — users interact with SpaceCom's API and WebSocket interface, not directly with Redis. This is not offering Redis as a service. The SSPL obligation does not apply to internal use of Redis as a component. However, legal counsel should confirm this position before Phase 3 (operational deployment).

Alternative if legal counsel disagrees: Pin to Redis 7.2.x (BSD-3-Clause, last release before SSPL adoption) or migrate to Valkey (BSD-3-Clause fork maintained by Linux Foundation). Either is a drop-in replacement. Document the chosen path in legal/LICENCES/redis-sspl-assessment.md.

Action: Update pip-licenses fail-on list to include "Server Side Public License" as a blocking licence category. Redis itself is not in the Python dependency tree (it is a Docker service), so this is a docker-image licence check. Add to Trivy scan policy.

F7 — Playwright and Chromium Binary Licence

Playwright (Python) is Apache 2.0. The Chromium binary bundled by Playwright uses the Chromium licence (BSD-3-Clause for most code; additional component licences apply for media codecs). Chromium is not redistributed by SpaceCom — Playwright downloads it at container build time via playwright install chromium.

Assessment: Internal use only; no redistribution. SBOM captures the Playwright version; Chromium binary version is captured by syft scanning the container image at the cosign attest step. No further action required.

F8 — Caddy Licence Assessment

Caddy server is Apache 2.0. Community plugins (the modules used in §26.9: encode, reverse_proxy, tls, file_server) are Apache 2.0. No Caddy enterprise plugins are used by SpaceCom. Caddy DNS challenge modules (if used for ACME wildcard certificates) must be verified — the caddy-dns/cloudflare module is MIT.

Audit requirement: On any Caddyfile change that adds a new module, verify its licence before merging. Add to the PR checklist for infrastructure changes.

F9 — PostGIS Licence Assessment

PostGIS is GPLv2+ with a linking exception for use with PostgreSQL. The linking exception reads: "the copyright holders of PostGIS grant you permission to use PostGIS as a PostgreSQL extension without this resulting in the entire combined work becoming subject to the GPL." SpaceCom uses PostGIS as a PostgreSQL extension (loaded via CREATE EXTENSION postgis) — the linking exception applies.

SpaceCom does not distribute PostGIS, does not modify PostGIS source, and does not ship a combined work — PostGIS is a runtime dependency of the database service. No GPLv2 obligation arises. Document in legal/LGPL_COMPLIANCE.md alongside the poliastro LGPL note.

F10 — Licence Change Monitoring CI Check

The existing pip-licenses --fail-on list (§7.13) catches Python GPL/AGPL. Additions required:

# .github/workflows/ci.yml (security-scan job — update existing step)
- name: Python licence gate
  run: |
    pip install pip-licenses
    pip-licenses --format=json --output-file=python-licences.json
    # Block: GPL v2, GPL v3, AGPL v3, SSPL (if any Python package adopts it)
    pip-licenses --fail-on="GNU General Public License v2 (GPLv2);GNU General Public License v3 (GPLv3);GNU Affero General Public License v3 (AGPLv3);Server Side Public License"

- name: npm licence gate (updated)
  working-directory: frontend
  run: |
    npx license-checker-rseidelsohn --json --out npm-licences.json
    # cesium excluded: commercial licence at docs/adr/0007-cesiumjs-commercial-licence.md
    npx license-checker-rseidelsohn \
      --excludePackages "cesium" \
      --failOn "GPL;AGPL;SSPL"

Additionally, pin all Python and Node dependencies to exact versions in requirements.txt and package-lock.json. Renovate Bot PRs (§7.13) provide controlled upgrade paths; the licence gate re-runs on each Renovate PR to catch licence changes introduced by version upgrades.

F11 — Contributor Licence Agreement for External Contributors

Before any contractor, partner, or third-party engineer contributes code to SpaceCom:

A CLA or work-for-hire clause must be in their contract confirming that all IP created for SpaceCom is owned by SpaceCom (or the appointing entity, per agreement).
The CLA template is at legal/CLA.md — a simple assignment of copyright for contributions made under contract.
The GitHub repository's CONTRIBUTING.md must state: "External contributions require a signed CLA. Contact legal@spacecom.io before submitting a PR."

Phase gate: Before any Phase 2 ESA validation partnership involves third-party engineering, confirm all engineers have executed the CLA or have work-for-hire clauses in their contracts. Unattributed IP in an ESA bid creates serious procurement risk.

30. DevOps / Platform Engineering

30.1 Pre-commit Hook Specification

All six hooks are required. The same hooks run locally (via pre-commit) and in CI (lint job). A push to GitHub that bypasses local hooks will fail CI.

.pre-commit-config.yaml:

repos:
  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.4.0
    hooks:
      - id: detect-secrets
        args: ['--baseline', '.secrets.baseline']

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.3.0
    hooks:
      - id: ruff
        args: ['--fix']
      - id: ruff-format

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.9.0
    hooks:
      - id: mypy
        additional_dependencies: ['types-requests', 'sqlalchemy[mypy]']

  - repo: https://github.com/hadolint/hadolint
    rev: v2.12.0
    hooks:
      - id: hadolint-docker

  - repo: https://github.com/pre-commit/mirrors-prettier
    rev: v3.1.0
    hooks:
      - id: prettier
        types_or: [javascript, typescript, html, css, json, yaml]

  - repo: https://github.com/sqlfluff/sqlfluff
    rev: 3.0.0
    hooks:
      - id: sqlfluff-lint
        args: ['--dialect', 'postgres']
      - id: sqlfluff-fix
        args: ['--dialect', 'postgres']

All hooks are pinned by rev; update via pre-commit autoupdate in a dedicated dependency update PR. The detect-secrets baseline (.secrets.baseline) is committed to the repo and updated whenever legitimate secrets-like strings are added.

detect-secrets baseline maintenance process — incorrect baseline updates are the most common way this hook is neutralised. The correct procedure must be documented and enforced:

# docs/runbooks/detect-secrets-update.md (required runbook)

# CORRECT: update baseline to add a new allowance while preserving existing ones
detect-secrets scan --baseline .secrets.baseline --update
git add .secrets.baseline
git commit -m "chore: update detect-secrets baseline for <reason>"

# WRONG — overwrites ALL existing allowances:
# detect-secrets scan > .secrets.baseline   ← NEVER do this

CI check verifies baseline currency on every PR (stale baseline = hook not enforced):

# In lint job, after running pre-commit:
detect-secrets scan --baseline .secrets.baseline --diff | \
  python -c "import sys,json; d=json.load(sys.stdin); sys.exit(0 if not d else 1)" || \
  (echo "ERROR: .secrets.baseline is stale — run: detect-secrets scan --baseline .secrets.baseline --update" && exit 1)

detect-secrets is the canonical secrets scanner (entropy + regex). git-secrets (listed in §7.13) is also retained for its AWS credential pattern matching, which complements detect-secrets. Both run as pre-commit hooks; there is no conflict — they check different pattern sets.

30.2 Multi-Stage Dockerfile Pattern

All service Dockerfiles follow the builder/runtime two-stage pattern. No exceptions without documented justification.

Backend (example — same pattern for worker and ingest):

# Stage 1: builder
FROM python:3.12-slim AS builder
WORKDIR /build

# Install build dependencies (not copied to runtime stage)
RUN apt-get update && apt-get install -y --no-install-recommends gcc libpq-dev

COPY backend/requirements.txt .
# --require-hashes enforces that every package in requirements.txt carries a hash annotation.
# pip-compile --generate-hashes produces these. Without this flag, hash pinning is specified
# but not verified during build — a dependency confusion attack would be silently installed.
RUN pip install --upgrade pip && \
    pip wheel --no-cache-dir --require-hashes --wheel-dir /wheels -r requirements.txt

# Stage 2: runtime
FROM python:3.12-slim AS runtime
WORKDIR /app

# Create non-root user at build time
RUN groupadd --gid 1001 appuser && \
    useradd --uid 1001 --gid appuser --no-create-home appuser

# Install only compiled wheels — no build tools
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir --no-index --find-links /wheels /wheels/*.whl && \
    rm -rf /wheels

COPY backend/app ./app

USER appuser
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Frontend:

FROM node:22-slim AS builder
WORKDIR /build
COPY frontend/package*.json .
RUN npm ci
COPY frontend/ .
RUN npm run build

FROM node:22-slim AS runtime
WORKDIR /app
RUN groupadd --gid 1001 appuser && useradd --uid 1001 --gid appuser --no-create-home appuser
COPY --from=builder /build/.next/standalone ./
COPY --from=builder /build/.next/static ./.next/static
COPY --from=builder /build/public ./public
USER appuser
EXPOSE 3000
CMD ["node", "server.js"]

Version pin rule: All Python service images use python:3.12-slim. All frontend/Node images use node:22-slim. Any FROM line using a different tag fails the hadolint pre-commit hook and CI lint step. Do not drift these — the service table in §3.2 and the Dockerfiles must agree.

CI verification — the build-and-push job includes:

# Verify no build tools in runtime image
docker run --rm ghcr.io/spacecom/backend:sha-$GITHUB_SHA which gcc 2>&1 | grep -q "no gcc" || exit 1
docker run --rm --user root ghcr.io/spacecom/backend:sha-$GITHUB_SHA id | grep -q "uid=1001" || exit 1
# Verify correct Python version
docker run --rm ghcr.io/spacecom/backend:sha-$GITHUB_SHA python --version | grep -q "Python 3.12" || exit 1

Image digest pinning in production Compose files (F4 — §59): The production docker-compose.yml pins images by digest, not by mutable tag, to guarantee bit-for-bit reproducibility and prevent registry-side tampering:

# docker-compose.yml — production image references
# Update digests via: make update-image-digests (runs after each build-and-push)
services:
  backend:
    image: ghcr.io/spacecom/backend:sha-abc1234@sha256:a1b2c3d4...  # tag + digest
  worker-sim:
    image: ghcr.io/spacecom/worker:sha-abc1234@sha256:e5f6a7b8...

make update-image-digests script (run by CI after build-and-push): queries GHCR for the digest of each newly pushed image and patches docker-compose.yml via sed. The patched file is committed back to the release branch as a separate commit.

GHCR image retention policy (F4 — §59):

Image type	Tag pattern	Retention
Release images	`sha-<commit>` on tagged release	Indefinite
Staging images	`sha-<commit>` on `main` push	30 days
Dev branch images	`sha-<commit>` on PR branch	7 days
Build cache manifests	`buildcache`	Overwritten each build; no accumulation
Untagged images	(orphaned layers)	Purged weekly via GHCR lifecycle policy

GHCR lifecycle policy is configured via the GitHub repository settings (Packages → Manage versions). The policy is documented in docs/runbooks/image-lifecycle.md and reviewed quarterly alongside the secrets audit.

30.3 Environment Variable Contract

All environment variables are documented in .env.example. Variables are grouped by category and stage:

Variable	Required	Stage	Description
`SPACETRACK_USERNAME`	Yes	All	Space-Track.org account email
`SPACETRACK_PASSWORD`	Yes	All	Space-Track.org password
`JWT_PRIVATE_KEY_PATH`	Yes	All	Path to RS256 PEM private key
`JWT_PUBLIC_KEY_PATH`	Yes	All	Path to RS256 PEM public key
`JWT_PUBLIC_KEY_NEW_PATH`	No	Rotation only	Second public key during keypair rotation window
`POSTGRES_PASSWORD`	Yes	All	TimescaleDB password
`REDIS_BACKEND_PASSWORD`	Yes	All	Redis ACL password for `spacecom_backend` user (full keyspace access)
`REDIS_WORKER_PASSWORD`	Yes	All	Redis ACL password for `spacecom_worker` user (Celery namespaces only)
`REDIS_INGEST_PASSWORD`	Yes	All	Redis ACL password for `spacecom_ingest` user (Celery namespaces only)
`MINIO_ACCESS_KEY`	Yes	All	MinIO access key
`MINIO_SECRET_KEY`	Yes	All	MinIO secret key
`HMAC_SECRET`	Yes	All	Prediction signing key (rotate per §26.9 procedure)
`ENVIRONMENT`	Yes	All	`development` / `staging` / `production`
`DEPLOY_CHECK_SECRET`	Yes	Staging/Prod	Read-only CI/CD gate credential
`SENTRY_DSN`	No	Staging/Prod	Error reporting DSN
`PAGERDUTY_ROUTING_KEY`	No	Prod only	AlertManager → PagerDuty routing key
`VAULT_ADDR`	No	Phase 3	HashiCorp Vault address
`VAULT_TOKEN`	No	Phase 3	Vault authentication token
`DISABLE_SIMULATION_DURING_ACTIVE_EVENTS`	No	All	Org-level simulation block; default `false`
`OPS_ROOM_SUPPRESS_MINUTES`	No	All	Alert audio suppression window; default `0`

CI validates that .env.example is up-to-date by checking that every variable referenced in the codebase (os.getenv(...), settings.*) has an entry in .env.example. Missing entries fail CI.

CI secrets register (F3 — §59): GitHub Actions secrets are audited quarterly. The following table is the authoritative register — any secret not in this table must not exist in the repository settings.

Secret name	Environment	Owner	Rotation schedule	What breaks if leaked
`GITHUB_TOKEN`	All	GitHub-managed (OIDC)	Per-job (automatic)	GHCR push access
`DEPLOY_CHECK_SECRET`	Staging, Production	Engineering lead	90 days	CI can skip alert gate
`STAGING_SSH_KEY`	Staging	Engineering lead	180 days	Staging server access
`PRODUCTION_SSH_KEY`	Production	Engineering lead + 1	90 days	Production server access
`SPACETRACK_USERNAME_STAGING`	Staging	DevOps	On offboarding	Space-Track ingest
`SPACETRACK_PASSWORD_STAGING`	Staging	DevOps	90 days	Space-Track ingest
`SENTRY_DSN`	Staging, Production	DevOps	On rotation	Error reporting only
`PAGERDUTY_ROUTING_KEY`	Production	Engineering lead	On rotation	On-call alerting

Rotation procedure: use gh secret set <NAME> --env <ENV> from a local machine; never paste secrets into PR descriptions or issue comments. Quarterly audit: gh secret list --env production output reviewed by engineering lead; any unrecognised secret triggers a security review.

30.4 Staging Environment Specification

Staging is a Tier 2 deployment (single-host Docker Compose) running continuously on a dedicated server or cloud VM.

Data policy: Staging never holds production data. On weekly reset (make clean && make seed), the database is wiped and synthetic fixtures are loaded. Synthetic fixtures include:

50 tracked objects with pre-computed TLE histories
5 synthetic TIP events across the test FIR set
3 synthetic CRITICAL alert events at various acknowledgement states
2 shadow mode test organisations

Credential policy: Staging uses a separate Space-Track account (if available) or rate-limited credentials. JWT keypairs, HMAC secrets, and MinIO keys are all distinct from production. Staging credentials are stored in GitHub Actions environment secrets, not in the production Vault.

OWASP ZAP integration:

# .github/workflows/ci.yml (post-staging-deploy step)
- name: OWASP ZAP baseline scan
  uses: zaproxy/action-baseline@v0.11.0
  with:
    target: 'https://staging.spacecom.io'
    rules_file_name: '.zap/rules.tsv'
    fail_action: true

ZAP results are uploaded as GitHub Actions artefacts and must be reviewed before production deploy approval is granted in Phase 2+.

30.5 CI Observability

Build duration: Each GitHub Actions job reports duration to a summary table. A Grafana dashboard (CI Health) tracks p50/p95 job durations over time. Alert if any job's p95 duration increases > 2× week-over-week.

Image size delta: The build-and-push job posts a PR comment with the compressed image size delta versus the previous main build:

Backend image: 187 MB → 192 MB (+2.7%) ✅
Worker image: 203 MB → 289 MB (+42.4%) ⚠️ Investigate before merge

If any image grows > 20% in a single PR, CI posts a warning. If any image exceeds the tier limits below, CI fails:

Image	Max size (compressed)
`backend`	300 MB
`worker`	350 MB
`frontend`	200 MB
`renderer`	500 MB (Chromium)
`ingest`	250 MB

Test failure rate: GitHub Actions test reports (JUnit XML output from pytest and vitest) are stored as artefacts. A weekly CI health review checks for flaky tests (passing < 90% of the time) and schedules them for investigation.

30.6 DevOps Decision Log

Decision	Chosen	Rationale
CI/CD orchestration	GitHub Actions	Project is GitHub-native; OIDC → GHCR eliminates long-lived registry credentials; matrix builds supported
Container registry	GHCR	Co-located with source; free for this repo; `cosign` attestation support
Image tagging	`sha-<commit>` canonical; version alias on release tags; `latest` forbidden	`latest` is mutable; `sha` tag gives exact source traceability
Multi-stage builds	Builder + distroless/slim runtime for all services	60–80% image size reduction; eliminates compiler/build tools from production attack surface
Hot-reload strategy	`docker-compose.override.yml` with bind-mounted source volumes	< 1s reload vs. 30–90s container rebuild; override file not committed to CI
Local task runner	`make`	Universally available, no extra install; self-documenting targets; shell-level DX standard
Pre-commit stack	6 hooks: detect-secrets + ruff + mypy + hadolint + prettier + sqlfluff	Each addresses a distinct failure mode; hooks run in CI to enforce for engineers who skip local install
Staging data	Synthetic fixtures only; weekly reset	Production data in staging creates GDPR complexity; synthetic data is sufficient for integration testing
Secrets rotation	Zero-downtime per-secret runbook; HMAC rotation requires batch re-sign migration	Aviation context: rotation cannot cause service interruption; HMAC is special-cased due to signed-data dependency
HMAC key rotation	Requires batch re-sign of all existing predictions; engineering lead approval required	All existing HMAC signatures become invalid on key change; silent re-sign is safer than mass verification failures

30.7 GitLab CI Workflow Specification (F1, F5, F8, F10 - §59)

The CI pipeline must enforce a strict job dependency graph. Jobs that do not declare needs: run in parallel by default — this is incorrect for a safety-critical pipeline where a failed test must prevent a build reaching production.

Canonical job dependency graph:

lint ──┬── test-backend ──┬── security-scan ──── build-and-push ──── deploy-staging ──── deploy-production
       └── test-frontend ─┘                                                 ↑ (auto)          ↑ (manual gate)

.github/workflows/ci.yml (abbreviated — full spec below):

name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - uses: actions/cache@v4
        with:
          path: ~/.cache/pre-commit
          key: pre-commit-${{ hashFiles('.pre-commit-config.yaml') }}
      - run: pip install pre-commit
      - run: pre-commit run --all-files   # F6 §59: enforce hooks in CI

  test-backend:
    needs: [lint]
    runs-on: ubuntu-latest
    services:
      db:
        image: timescale/timescaledb:2.14-pg17
        env: { POSTGRES_PASSWORD: test }
        options: --health-cmd pg_isready
      redis:
        image: redis:7-alpine
        options: --health-cmd "redis-cli ping"
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - uses: actions/cache@v4   # F10 §59: pip wheel cache
        with:
          path: ~/.cache/pip
          key: pip-${{ hashFiles('backend/requirements.txt') }}
      - run: pip install -r backend/requirements.txt
      - run: pytest -m safety_critical --tb=short -q   # fast safety gate first
      - run: pytest --cov=backend --cov-fail-under=80

  test-frontend:
    needs: [lint]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '22' }
      - uses: actions/cache@v4   # F10 §59: npm cache
        with:
          path: ~/.npm
          key: npm-${{ hashFiles('frontend/package-lock.json') }}
      - run: npm ci --prefix frontend
      - run: npm run test --prefix frontend

  migration-gate:              # F11 §59: migration reversibility + timing gate
    needs: [lint]
    if: contains(github.event.commits[*].modified, 'migrations/')
    runs-on: ubuntu-latest
    services:
      db:
        image: timescale/timescaledb:2.14-pg17
        env: { POSTGRES_PASSWORD: test }
        options: --health-cmd pg_isready
    steps:
      - uses: actions/checkout@v4
      - run: pip install alembic psycopg2-binary
      - name: Forward migration (timed)
        run: |
          START=$(date +%s)
          alembic upgrade head
          END=$(date +%s)
          ELAPSED=$((END - START))
          echo "Migration took ${ELAPSED}s"
          if [ "$ELAPSED" -gt 30 ]; then
            echo "::error::Migration took ${ELAPSED}s > 30s budget — requires review"
            exit 1
          fi
      - name: Reverse migration (reversibility check)
        run: alembic downgrade -1
      - name: Model/migration sync check
        run: alembic check

  security-scan:
    needs: [test-backend, test-frontend, migration-gate]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install bandit && bandit -r backend/app -ll
      - uses: actions/setup-node@v4
        with: { node-version: '22' }
      - run: npm audit --prefix frontend --audit-level=high
      - name: Trivy container scan (on previous image)
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ghcr.io/${{ github.repository }}/backend:latest
          severity: CRITICAL,HIGH
          exit-code: '1'

  build-and-push:
    needs: [security-scan]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    permissions: { contents: read, packages: write, id-token: write }
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}   # OIDC — no long-lived token
      - name: Build and push (with layer cache)   # F10 §59
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ env.IMAGE_NAME }}/backend:sha-${{ github.sha }}
          cache-from: type=registry,ref=ghcr.io/${{ env.IMAGE_NAME }}/backend:buildcache
          cache-to: type=registry,ref=ghcr.io/${{ env.IMAGE_NAME }}/backend:buildcache,mode=max
      - name: Sign image with cosign (F5 §59)
        uses: sigstore/cosign-installer@v3
      - run: |
          cosign sign --yes \
            ghcr.io/${{ env.IMAGE_NAME }}/backend:sha-${{ github.sha }}
      - name: Generate SBOM and attach (F5 §59)
        uses: anchore/sbom-action@v0
        with:
          image: ghcr.io/${{ env.IMAGE_NAME }}/backend:sha-${{ github.sha }}
          upload-artifact: true

  deploy-staging:
    needs: [build-and-push]
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Check no active CRITICAL alert (F8 §59)
        run: |
          STATUS=$(curl -sf -H "Authorization: Bearer ${{ secrets.DEPLOY_CHECK_SECRET }}" \
            https://staging.spacecom.io/api/v1/readyz | jq -r '.alert_gate')
          if [ "$STATUS" != "clear" ]; then
            echo "::error::Active CRITICAL/HIGH alert — deploy blocked. Override with workflow_dispatch."
            exit 1
          fi
      - name: SSH deploy to staging
        run: |
          ssh deploy@staging.spacecom.io \
            "bash /opt/spacecom/scripts/blue-green-deploy.sh sha-${{ github.sha }}"

  deploy-production:
    needs: [deploy-staging]
    runs-on: ubuntu-latest
    environment: production   # GitLab protected environment with required approvers - manual gate
    steps:
      - uses: actions/checkout@v4
      - name: Check no active CRITICAL alert (F8 §59)
        run: |
          STATUS=$(curl -sf -H "Authorization: Bearer ${{ secrets.DEPLOY_CHECK_SECRET }}" \
            https://spacecom.io/api/v1/readyz | jq -r '.alert_gate')
          if [ "$STATUS" != "clear" ]; then
            echo "::error::Active CRITICAL/HIGH alert — production deploy blocked."
            exit 1
          fi
      - name: SSH deploy to production
        run: |
          ssh deploy@spacecom.io \
            "bash /opt/spacecom/scripts/blue-green-deploy.sh sha-${{ github.sha }}"

/api/v1/readyz alert gate field (F8 — §59): The existing GET /readyz response is extended with an alert_gate field:

# Returns "clear" | "blocked"
alert_gate = "blocked" if db.query(AlertEvent).filter(
    AlertEvent.level.in_(["CRITICAL", "HIGH"]),
    AlertEvent.acknowledged_at == None,
    AlertEvent.organisation_id != INTERNAL_ORG_ID,  # internal test alerts don't block deploys
).count() > 0 else "clear"

Emergency deploy override: use workflow_dispatch with input override_alert_gate: true — requires two approvals in the GitHub production environment. All overrides are logged to security_logs with event_type = DEPLOY_ALERT_GATE_OVERRIDE.

30.8 Configuration Management of Safety-Critical Artefacts (F7 — §61)

EUROCAE ED-153 / DO-278A §10 requires that safety-critical software and its associated artefacts are placed under configuration management. This extends beyond the code itself to include requirements, test cases, design documents, and safety evidence.

Policy document: docs/safety/CM_POLICY.md

Artefacts under CM:

Artefact	Location	CM Control
SAL-2 source files (`physics/`, `alerts/`, `integrity/`, `czml/`)	Git `main` branch	Signed commits required; CODEOWNERS enforcement; no direct push to `main`
Hazard Log	`docs/safety/HAZARD_LOG.md`	Git-tracked; changes require safety case custodian sign-off (CODEOWNERS rule)
Safety Case	`docs/safety/SAFETY_CASE.md`	Git-tracked; changes require safety case custodian sign-off
SAL Assignment	`docs/safety/SAL_ASSIGNMENT.md`	Git-tracked; changes require safety case custodian sign-off
Means of Compliance	`docs/safety/MEANS_OF_COMPLIANCE.md`	Git-tracked; changes require safety case custodian sign-off
Verification Independence Policy	`docs/safety/VERIFICATION_INDEPENDENCE.md`	Git-tracked
Test plan (safety-critical markers)	`docs/TEST_PLAN.md`	Git-tracked; `safety_critical` marker additions/removals reviewed in PR
Reference validation data	`docs/validation/reference-data/`	Git-tracked; immutable once committed (SHA verified in CI)
Accuracy Characterisation	`docs/validation/ACCURACY_CHARACTERISATION.md`	Git-tracked; Phase 3 deliverable
ANSP SMS Guide	`docs/safety/ANSP_SMS_GUIDE.md`	Git-tracked
Release artefacts (SBOM, Trivy report, cosign signature)	GHCR + MinIO safety archive	Tagged per release; 7-year retention

Release tagging for safety artefacts:

Every production release (vMAJOR.MINOR.PATCH) creates a Git tag that captures:

# scripts/tag-safety-release.sh
VERSION=$1
git tag -a "$VERSION" -m "Release $VERSION — safety artefacts frozen at this tag"
# Attach safety snapshot to the release
gh release create "$VERSION" \
  docs/safety/SAFETY_CASE.md \
  docs/safety/HAZARD_LOG.md \
  docs/safety/SAL_ASSIGNMENT.md \
  docs/safety/MEANS_OF_COMPLIANCE.md \
  --title "SpaceCom $VERSION" \
  --notes "Safety artefacts attached. See CHANGELOG.md for changes."

Signed commits for SAL-2 paths: backend/app/physics/, backend/app/alerts/, backend/app/integrity/, backend/app/czml/ require GPG-signed commits. Branch protection rule: require_signed_commits: true on main. This provides non-repudiation for safety-critical code changes.

CODEOWNERS additions:

# .github/CODEOWNERS
# Safety artefacts — require safety case custodian review
/docs/safety/              @safety-custodian
/docs/validation/          @safety-custodian

Configuration baseline: At each ANSP deployment, a configuration baseline is recorded in legal/ANSP_DEPLOYMENT_REGISTER.md:

SpaceCom version deployed (Git tag)
Commit SHA
SBOM hash
Safety case version
SAL assignment version
Deployment jurisdiction and date

This baseline is the reference for any subsequent regulatory audit or safety occurrence investigation.

31. Interoperability / Systems Integration

31.1 External Data Source Contracts

For each inbound data source, the integration contract must be explicit. Implicit assumptions about format are the most common source of silent ingest failures.

31.1.1 Space-Track.org

Endpoints consumed:

Data	Endpoint	Format	Baseline interval	Active TIP interval
TLE catalog	`/basicspacedata/query/class/gp/DECAY_DATE/null-val/orderby/NORAD_CAT_ID asc/format/json`	JSON array	Every 6h	Every 6h (unchanged)
CDMs	`/basicspacedata/query/class/cdm_public/format/json`	JSON array	Every 2h	Every 30min
TIP messages	`/basicspacedata/query/class/tip/format/json`	JSON array	Every 30min	Every 5min
Object catalog	`/basicspacedata/query/class/satcat/format/json`	JSON array	Daily	Daily

Adaptive polling: When spacecom_active_tip_events > 0 (any object with predicted re-entry within 6 hours), the Celery Beat schedule dynamically switches TIP polling to 5-minute intervals and CDM polling to 30-minute intervals. This is implemented via redbeat schedule overrides, not by running additional tasks — the existing Beat entry's run_every is updated in Redis. When all TIP events clear, intervals revert to baseline.

Space-Track request budget (600 requests/day):

Space-Track enforces a 600 requests/day limit per account. Budget must be tracked and protected:

# ingest/budget.py
DAILY_REQUEST_BUDGET = 600
BUDGET_ALERT_THRESHOLD = 0.80   # alert at 80% consumed

class SpaceTrackBudget:
    """Redis counter tracking daily Space-Track API requests. Resets at midnight UTC."""

    def __init__(self, redis_client):
        self._redis = redis_client
        self._key = f"spacetrack:budget:{date.today().isoformat()}"

    def consume(self, n: int = 1) -> bool:
        """Deduct n requests. Returns False if budget exhausted; raises if > threshold."""
        current = self._redis.incrby(self._key, n)
        self._redis.expireat(self._key, self._next_midnight())
        if current > DAILY_REQUEST_BUDGET:
            raise SpaceTrackBudgetExhausted(f"Daily budget exhausted ({current}/{DAILY_REQUEST_BUDGET})")
        if current / DAILY_REQUEST_BUDGET >= BUDGET_ALERT_THRESHOLD:
            structlog.get_logger().warning(
                "spacetrack_budget_warning",
                consumed=current, budget=DAILY_REQUEST_BUDGET,
            )
        return True

    def remaining(self) -> int:
        return max(0, DAILY_REQUEST_BUDGET - int(self._redis.get(self._key) or 0))

Prometheus gauge: spacecom_spacetrack_budget_remaining — alert at < 100 remaining requests.

Exponential backoff and circuit breaker:

# ingest/tasks.py
@app.task(
    bind=True,
    autoretry_for=(SpaceTrackError, httpx.TimeoutException, httpx.ConnectError),
    retry_backoff=True,       # 2s, 4s, 8s, 16s, 32s ...
    retry_backoff_max=3600,   # cap at 1 hour
    retry_jitter=True,        # ±20% jitter per retry
    max_retries=5,            # task → DLQ on 6th failure
    acks_late=True,
)
def ingest_tle_catalog(self):
    if not circuit_breaker.is_closed("spacetrack"):
        raise SpaceTrackCircuitOpen("Circuit open — Space-Track unreachable")
    try:
        budget.consume(1)
        result = spacetrack_client.fetch_tle_catalog()
        circuit_breaker.record_success("spacetrack")
        return result
    except (SpaceTrackError, httpx.TimeoutException) as exc:
        circuit_breaker.record_failure("spacetrack")
        raise self.retry(exc=exc)

Circuit breaker config: open after 3 consecutive failures; half-open after 30 minutes; close after 1 successful probe. Implemented via pybreaker or equivalent. State stored in Redis for cross-worker visibility.

Session expiry handling:

Space-Track uses cookie-based sessions that expire after ~2 hours of inactivity. A 6-hour TLE poll interval guarantees session expiry between polls. The spacetrack library must be configured to re-authenticate transparently on 401/403:

# ingest/spacetrack.py
class SpaceTrackClient:
    def __init__(self):
        self._session_valid_until: datetime | None = None
        self._SESSION_TTL = timedelta(hours=1, minutes=45)  # conservative re-auth before expiry

    async def _ensure_authenticated(self):
        if self._session_valid_until is None or datetime.utcnow() >= self._session_valid_until:
            await self._authenticate()
            self._session_valid_until = datetime.utcnow() + self._SESSION_TTL
            spacecom_ingest_session_reauth_total.labels(source="spacetrack").inc()

    async def fetch_tle_catalog(self):
        await self._ensure_authenticated()
        # ... fetch logic

Metric spacecom_ingest_session_reauth_total{source="spacetrack"} distinguishes routine re-auth from genuine authentication failures. An alert fires if reauth_total increments more than once per hour (indicates session instability, not normal expiry).

Contract test (asserts on every CI run against a live Space-Track response):

def test_spacetrack_tle_schema(spacetrack_client):
    response = spacetrack_client.query("gp", limit=1)
    required_keys = {"NORAD_CAT_ID", "TLE_LINE1", "TLE_LINE2", "EPOCH", "BSTAR", "OBJECT_NAME"}
    assert required_keys.issubset(response[0].keys()), f"Missing keys: {required_keys - response[0].keys()}"

Failure alerting: spacecom_ingest_success_total{source="spacetrack"} counter. AlertManager rules:

Baseline: if counter does not increment for 4 consecutive hours during expected polling windows → CRITICAL INGEST_SOURCE_FAILURE alert.
Active TIP window: if spacecom_ingest_success_total{source="spacetrack", type="tip"} does not increment for > 10 minutes when spacecom_active_tip_events > 0 → immediate L1 page (bypasses standard 4h threshold).

31.1.2 NOAA SWPC Space Weather

All endpoints are hardcoded constants in ingest/sources.py. Format is JSON for all P1 endpoints.

# ingest/sources.py
NOAA_F107_URL      = "https://services.swpc.noaa.gov/json/f107_cm_flux.json"
NOAA_KP_URL        = "https://services.swpc.noaa.gov/json/planetary_k_index_1m.json"
NOAA_DST_URL       = "https://services.swpc.noaa.gov/json/geomag/dst/index.json"
NOAA_FORECAST_URL  = "https://services.swpc.noaa.gov/products/3-day-geomag-forecast.json"
ESA_SWS_KP_URL     = "https://swe.ssa.esa.int/web/guest/current-space-weather-conditions"

Nowcast vs. forecast distinction: NRLMSISE-00 decay predictions spanning hours to days require different F10.7/Ap inputs depending on the prediction horizon. These must be stored separately and selected by the decay predictor at query time:

-- space_weather table: forecast_horizon_hours column required
ALTER TABLE space_weather ADD COLUMN forecast_horizon_hours INTEGER NOT NULL DEFAULT 0;
-- 0 = nowcast (observed); 24/48/72 = NOAA 3-day forecast horizon; NULL = 81-day average
COMMENT ON COLUMN space_weather.forecast_horizon_hours IS
  '0=nowcast; 24/48/72=NOAA 3-day forecast; NULL=81-day F10.7 average for long-horizon use';

Decay predictor input selection rule (documented in model card and decay.py):

Prediction horizon	F10.7 source	Ap source
t < 6h	Nowcast (`horizon=0`)	Nowcast (`horizon=0`)
6h ≤ t < 72h	NOAA 3-day forecast (`horizon=24/48/72`)	NOAA 3-day forecast
t ≥ 72h	81-day F10.7 average (`horizon=NULL`)	Storm-aware climatological Ap

Beyond 72h: the NOAA forecast expires. The model uses the 81-day F10.7 average (a standard NRLMSISE-00 input) and the long-range uncertainty is reflected in wider Monte Carlo corridor bounds. This is documented in the model card under "Space Weather Input Uncertainty Beyond 72h".

ESA SWS Kp cross-validation decision rule: ESA SWS Kp is a cross-validation source, not a fallback. A decision rule is required when NOAA and ESA values diverge — without one, the cross-validation is observational only:

# ingest/space_weather.py
NOAA_ESA_KP_DIVERGENCE_THRESHOLD = 2.0  # Kp units; ADR-0018

def arbitrate_kp(noaa_kp: float, esa_kp: float) -> float:
    """Select Kp value for NRLMSISE-00 input. Conservative-high on divergence."""
    divergence = abs(noaa_kp - esa_kp)
    if divergence > NOAA_ESA_KP_DIVERGENCE_THRESHOLD:
        structlog.get_logger().warning(
            "kp_source_divergence",
            noaa_kp=noaa_kp, esa_kp=esa_kp, divergence=divergence,
        )
        spacecom_kp_divergence_events_total.inc()
        # Conservative: higher Kp → denser atmosphere → shorter predicted lifetime → earlier alerting
        return max(noaa_kp, esa_kp)
    return noaa_kp   # NOAA is primary source

The threshold (2.0 Kp) and the conservative-high selection policy are documented in docs/adr/0018-kp-source-arbitration.md and reviewed by the physics lead. The spacecom_kp_divergence_events_total counter is monitored; a sustained rate of divergence warrants investigation of source calibration.

Schema contract test (CI):

def test_noaa_kp_schema(noaa_client):
    response = noaa_client.get_kp()
    assert isinstance(response, list) and len(response) > 0
    assert {"time_tag", "kp_index"}.issubset(response[0].keys())

def test_space_weather_forecast_horizon_stored(db_session):
    """Verify nowcast and forecast rows are stored with distinct horizon values."""
    nowcast = db_session.query(SpaceWeather).filter_by(forecast_horizon_hours=0).first()
    forecast_72 = db_session.query(SpaceWeather).filter_by(forecast_horizon_hours=72).first()
    assert nowcast is not None, "Nowcast row missing"
    assert forecast_72 is not None, "72h forecast row missing"

31.1.3 FIR Boundary Data

Source: EUROCONTROL AIRAC dataset (primary for ECAC states); FAA Digital-Terminal Procedures Publication (US); OpenAIP (fallback for non-AIRAC regions).

Format: GeoJSON FeatureCollection with properties.icao_id (FIR ICAO designator) and properties.name.

Update procedure (runs on each 28-day AIRAC cycle):

Download new AIRAC dataset from EUROCONTROL (subscription required; credentials in secrets manager)
Convert to GeoJSON via ingest/fir_loader.py
Compare new boundaries against current airspace table; log added/removed/changed FIRs to security_logs type AIRSPACE_UPDATE
Stage new boundaries in airspace_staging table; run intersection regression test against 10 known prediction corridors
If regression passes: swap airspace and airspace_staging in a single transaction
Record update in airspace_metadata table: airac_cycle, record_count, updated_at, updated_by

airspace_metadata table:

CREATE TABLE airspace_metadata (
  id SERIAL PRIMARY KEY,
  airac_cycle TEXT NOT NULL,       -- e.g. "2026-03"
  effective_date DATE NOT NULL,
  expiry_date DATE NOT NULL,       -- effective_date + 28 days; used for staleness detection
  record_count INTEGER NOT NULL,
  source TEXT NOT NULL,            -- 'eurocontrol' | 'faa' | 'openaip'
  updated_at TIMESTAMPTZ DEFAULT NOW(),
  updated_by TEXT NOT NULL
);

AIRAC staleness detection: The AIRAC update procedure is manual — there is no automated mechanism to trigger it. Without monitoring, a missed cycle goes undetected for up to 28 days.

Required additions:

Prometheus gauge: spacecom_airspace_airac_age_days = EXTRACT(EPOCH FROM NOW() - MAX(effective_date)) / 86400 from airspace_metadata. Alert rule:

- alert: AIRACAirspaceStale
  expr: spacecom_airspace_airac_age_days > 29
  for: 1h
  severity: warning
  annotations:
    runbook_url: "https://spacecom.internal/docs/runbooks/fir-update.md"
    summary: "FIR boundary data is {{ $value | humanizeDuration }} old — AIRAC cycle may be missed"

GET /readyz integration: "airspace_stale" is added to the degraded array when airac_age_days > 28 (already incorporated into §26.5 readyz check above).
FIR update runbook (docs/runbooks/fir-update.md) is a Phase 1 deliverable — it must exist before shadow deployment. Add to the Phase 1 DoD runbook checklist alongside secrets-rotation-jwt.md.

31.1.4 TLE Validation Gate

Before any TLE record is written to the database, ingest/cross_validator.py enforces:

def validate_tle(line1: str, line2: str) -> TLEValidationResult:
    errors = []
    if len(line1) != 69:
        errors.append(f"Line 1 length {len(line1)} != 69")
    if len(line2) != 69:
        errors.append(f"Line 2 length {len(line2)} != 69")
    if not _tle_checksum_valid(line1):
        errors.append("Line 1 checksum failed")
    if not _tle_checksum_valid(line2):
        errors.append("Line 2 checksum failed")
    epoch = _parse_epoch(line1[18:32])
    if epoch is None:
        errors.append("Epoch field invalid")
    bstar = float(line1[53:61].replace(' ', ''))
    # Finding 10: BSTAR validation revised
    # Lower bound removed: valid high-density objects (e.g. tungsten sphere) have B* << 0.0001
    # Zero or negative B* is physically meaningless (negative drag) → hard reject
    if bstar <= 0.0:
        errors.append(f"BSTAR {bstar} is zero or negative — physically invalid")
    elif bstar > 0.5:
        # Physically implausible at altitude > 300 km; log warning but do not reject
        log_security_event("TLE_VALIDATION_WARNING", {
            "tle": [line1, line2], "reason": "HIGH_BSTAR", "bstar": bstar
        }, level="WARNING")
    # Hard reject only the impossible combination: very high drag at high altitude
    if bstar > 0.5 and perigee_km > 300:
        errors.append(f"BSTAR {bstar} implausible for perigee {perigee_km:.0f} km — high drag at high altitude")
    if errors:
        log_security_event("INGEST_VALIDATION_FAILURE", {"tle": [line1, line2], "errors": errors})
        return TLEValidationResult(valid=False, errors=errors)
    return TLEValidationResult(valid=True)

31.2 CCSDS Format Specifications

31.2.1 OEM (Orbit Ephemeris Message) — CCSDS 502.0-B-3

Emitted by GET /space/objects/{norad_id}/ephemeris when Accept: application/ccsds-oem.

Header keyword population:

Keyword	Value	Source
`CCSDS_OEM_VERS`	`3.0`	Fixed
`CREATION_DATE`	ISO 8601 UTC timestamp	`datetime.utcnow()`
`ORIGINATOR`	`SPACECOM`	Fixed
`OBJECT_NAME`	`objects.name`	DB
`OBJECT_ID`	COSPAR designator if known; `NORAD-<norad_id>` otherwise	DB
`CENTER_NAME`	`EARTH`	Fixed
`REF_FRAME`	`GCRF`	Fixed — SpaceCom frame transform output
`TIME_SYSTEM`	`UTC`	Fixed
`START_TIME`	Query `start` parameter	Request
`STOP_TIME`	Query `end` parameter	Request

Unknown fields: Any keyword for which SpaceCom holds no data is emitted as N/A per CCSDS 502.0-B-3 §4.1.

31.2.2 CDM (Conjunction Data Message) — CCSDS 508.0-B-1

Emitted by GET /space/export/bulk?format=ccsds-cdm.

Field population table (abbreviated):

Field	Populated?	Source
`CREATION_DATE`	Yes	`datetime.utcnow()`
`ORIGINATOR`	Yes	`SPACECOM`
`TCA`	Yes	SpaceCom conjunction screener
`MISS_DISTANCE`	Yes	SpaceCom conjunction screener
`COLLISION_PROBABILITY`	Yes	SpaceCom Alfano Pc
`COLLISION_PROBABILITY_METHOD`	Yes	`ALFANO-2005`
`OBJ1/2 COVARIANCE_*`	Conditional	From Space-Track CDM if available; `N/A` for debris without covariance
`OBJ1/2 RECOMMENDED_OD_SPAN`	No	`N/A` — SpaceCom does not hold OD span
`OBJ1/2 SEDR`	No	`N/A`

CDM ingestion and Pc reconciliation: When a Space-Track CDM is ingested for an object that SpaceCom has also screened, both Pc values are stored:

conjunctions.pc_spacecom — SpaceCom Alfano result
conjunctions.pc_spacetrack — from ingested CDM
conjunctions.pc_discrepancy_flag — set TRUE when abs(log10(pc_spacecom/pc_spacetrack)) > 1 (order-of-magnitude difference)

The conjunction panel displays both values with their provenance labels. When pc_discrepancy_flag = TRUE, a DATA_CONFIDENCE warning callout is shown explaining possible causes (different epoch, different covariance source, different Pc method).

31.2.3 RDM (Re-entry Data Message) — CCSDS 508.1-B-1

Emitted by GET /reentry/predictions/{prediction_id}/export?format=ccsds-rdm.

Planned population rules:

SpaceCom populates creation metadata, object identifiers, prediction provenance, prediction epoch, and the primary predicted re-entry time range from the active prediction record.
Where the active prediction carries prediction_conflict = TRUE, the export includes both the primary SpaceCom range and the conservative union range used for aviation-facing products, with explicit conflict provenance.
Corridor, fragment-cloud, and air-risk annotations are included only when supported by the active model version and marked with the model version identifier used to generate them.
Unknown optional fields are emitted as N/A rather than silently omitted, matching the CCSDS handling already used for OEM/CDM unknowns.
Raw upstream TIP or third-party reference messages are not overwritten; they remain separate provenance sources and are cross-referenced in the export metadata and audit trail.

31.3 WebSocket Event Reference

Full event type catalogue for WS /ws/events. All events share the envelope:

{
  "type": "alert.new",
  "seq": 1042,
  "ts": "2026-03-17T14:23:01.123Z",
  "org_id": 7,
  "data": { ... }
}

Event type specifications:

alert.new
  data: {alert_id, level, norad_id, object_name, fir_ids[], predicted_reentry_utc, corridor_wkt}

alert.acknowledged
  data: {alert_id, acknowledged_by_name, note_preview (first 80 chars), acknowledged_at}

alert.superseded
  data: {old_alert_id, new_alert_id, reason}

prediction.updated
  data: {prediction_id, norad_id, p50_utc, p05_utc, p95_utc, supersedes_id (nullable), corridor_wkt}

tip.new
  data: {norad_id, object_name, tip_epoch, predicted_reentry_utc, source_label ("USSPACECOM TIP")}

ingest.status
  data: {source, status ("ok"|"failed"), record_count (nullable), next_run_at, failure_reason (nullable)}

spaceweather.change
  data: {old_status, new_status, kp, f107, recommended_buffer_hours}

resync_required
  data: {reason ("reconnect_too_stale"), last_known_seq}

Reconnection protocol:

Client stores last received seq
On reconnect: upgrade with ?since_seq=<last_seq>
Server delivers all events with seq > last_seq from a 5-minute / 200-event ring buffer
If the gap is too large: server sends {"type": "resync_required"}; client must call REST endpoints to re-fetch current state before resuming WebSocket consumption

Simulation/Replay isolation: During SIMULATION or REPLAY mode, the client is connected to WS /ws/simulation/{session_id} instead of WS /ws/events. No LIVE events are delivered while in a simulation session.

31.4 Alert Webhook Specification

Registration:

POST /api/v1/webhooks
Content-Type: application/json
Authorization: Bearer <admin_jwt>

{
  "url": "https://ansp-dispatch.example.com/spacecom/hook",
  "events": ["alert.new", "tip.new"],
  "secret": "webhook_shared_secret_min_32_chars"
}

Response includes webhook_id. The secret is bcrypt-hashed before storage; the plaintext is never retrievable after registration.

Delivery:

POST https://ansp-dispatch.example.com/spacecom/hook
Content-Type: application/json
X-SpaceCom-Signature: sha256=<HMAC-SHA256(secret, raw_body)>
X-SpaceCom-Event: alert.new
X-SpaceCom-Delivery: <uuid>

{ "type": "alert.new", "seq": 1042, ... }

Receiver verification (example):

import hmac, hashlib

def verify_signature(secret: str, body: bytes, header_sig: str) -> bool:
    expected = "sha256=" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expected, header_sig)

Retry and status lifecycle:

State	Condition	Action
`active`	Deliveries succeeding	Normal operation
`degraded`	3 consecutive delivery failures	Org admin notified by email; deliveries continue
`disabled`	10 consecutive delivery failures	No further deliveries; manual re-enable via `PATCH /webhooks/{id}` required

31.5 Interoperability Decision Log

Decision	Chosen	Rationale
ADS-B source	OpenSky Network REST API	Free, global, sufficient for Phase 3 route overlay; upgrade path to FAA SWIM ADS-B if coverage gaps emerge
CCSDS OEM reference frame	GCRF	SpaceCom frame transform pipeline output; downstream tools expect GCRF
CCSDS CDM unknown fields	`N/A` per CCSDS 508.0-B-1 §4.3	Silent omission causes downstream parser failures; `N/A` is the standard sentinel
CDM Pc reconciliation	Both Space-Track CDM Pc and SpaceCom Pc displayed with provenance; discrepancy flag on order-of-magnitude difference	Transparency over false precision; operators need to see the discrepancy, not have SpaceCom silently override it
FIR update mechanism	Staging table swap + regression test on 28-day AIRAC cycle	Direct overwrite during a live TIP event would corrupt ongoing airspace intersection queries
WebSocket event schema	Typed envelope with `type` discriminator + monotonic `seq`	Enables typed client generation; `seq` enables reliable missed-event recovery
Webhook signature	HMAC-SHA256 with `sha256=` prefix (same convention as GitHub webhooks)	Operators will already know this pattern; reduces integration friction
SWIM integration timing	Phase 2: GeoJSON export; Phase 3: FIXM review + AMQP endpoint	Full SWIM-TI requires EUROCONTROL B2B account and FIXM extension work — not Phase 1/2 blocking
API versioning	`/api/v1` base; 6-month parallel support on breaking changes; RFC 8594 headers	Space operators need stable contracts; 6-month overlap is industry standard for operational API changes
Space weather format	JSON REST endpoints (not legacy ASCII FTP)	ASCII FTP format is brittle; NOAA SWPC JSON API is stable and machine-readable; contract test catches format changes

32. Ethics / Algorithmic Accountability

SpaceCom makes algorithmic predictions that inform operational airspace decisions. False negatives are catastrophic; false positives cause economic disruption and erode operator trust. This section documents the accountability framework that governs how the prediction model is specified, validated, changed, and monitored.

Applicable frameworks: IEEE 7001-2021 (Transparency of Autonomous Systems), NIST AI RMF (Govern/Map/Measure/Manage), ICAO Safety Management (Annex 19), ECSS-Q-ST-80C (Software Product Assurance).

32.1 Decay Predictor Model Card

The model card is a living document maintained at docs/model-card-decay-predictor.md. It is a required artefact for ESA Phase 2 TRL demonstrations and ANSP SMS acceptance. It must be updated whenever the model version changes.

Required sections:

# Decay Predictor Model Card — SpaceCom v<X.Y.Z>

## Model summary
Numerical decay predictor using RK7(8) adaptive integrator + NRLMSISE-00 atmospheric
density model + J2–J6 geopotential + solar radiation pressure. Monte Carlo uncertainty
via 500-sample ensemble varying F10.7 (±20%), Ap, and B* (±10%).

## Validated orbital regime
- Perigee altitude: 100–600 km
- Inclination: 0–98°
- Object type: rocket bodies and payloads with RCS > 0.1 m²
- B* range: 0.0001–0.3
- Area-to-mass ratio: 0.005–0.04 m²/kg

## Known out-of-distribution inputs (ood_flag triggers)
| Parameter | OOD condition | Expected behaviour |
|-----------|--------------|-------------------|
| Area-to-mass ratio | > 0.04 m²/kg | Underestimates atmospheric drag; re-entry time predicted too late |
| data_confidence | 'unknown' | Physical properties estimated from object type defaults; wide systematic uncertainty |
| TLE count in history | < 5 TLEs in last 30 days | B* estimate unreliable; uncertainty may be significantly underestimated |
| Perigee altitude | < 100 km | Object may already be in final decay corridor; NRLMSISE-00 not calibrated below 100 km |

## Performance characterisation
(Updated from backcast validation report — see MinIO docs/backcast-validation-v<X>.pdf)

| Object category | N backcasts | p50 error (median) | p50 error (95th pct) | Corridor containment |
|----------------|-------------|-------------------|---------------------|---------------------|
| Rocket bodies, RCS > 2 m² | TBD | TBD | TBD | TBD |
| Payloads, RCS 0.5–2 m² | TBD | TBD | TBD | TBD |
| Small debris / unknown RCS | TBD (underrepresented) | TBD | TBD | TBD |

## Known systematic biases
- NRLMSISE-00 underestimates atmospheric density during geomagnetic storms at altitudes 200–350 km.
  Effect: predictions during Kp > 5 events tend to predict re-entry slightly later than observed.
  Mitigation: space weather buffer recommendation adds ≥2h beyond p95 during Elevated/Severe/Extreme conditions.
- Tumbling objects: effective drag area unknown; B* from TLEs reflects tumble-averaged drag.
  Effect: uncertainty may be systematically underestimated for highly elongated objects.
- Calibration data bias: validation events are dominated by large well-tracked objects from major launch
  programmes. Small debris and objects from less-tracked orbital regimes are underrepresented.

## Not intended for
- Objects with perigee < 100 km (already in terminal descent corridor)
- Crewed vehicles (use mission-specific tools)
- Objects undergoing active manoeuvring
- Predictions beyond 21 days (F10.7 forecast skill degrades sharply beyond 3 days)

32.2 Backcast Validation Requirements

Phase 1 minimum: ≥3 historical re-entries selected from The Aerospace Corporation observed re-entry database. Selection criteria documented.

Phase 2 target: ≥10 historical re-entries. The validation report (docs/backcast-validation-v<X>.pdf) must explicitly:

Document selection criteria — which events were chosen and why. Selection must include at least one event from each of: rocket bodies, payloads, and at least one high-area-to-mass object if available.
Flag underrepresented categories — explicitly state which object types have < 3 validation events and what the implication is for accuracy claims in those categories.
State accuracy as conditional — not "p50 accuracy is ±2h" but "for rocket bodies (N=7): median p50 error is 1.8h; for payloads (N=3): median p50 error is 3.1h; for small debris (N=0): no validation data available."
Include negative results — events where the p95 corridor did not contain the observed impact point must be included and analysed.
Compare across model versions — each new validation report must include a comparison table against the previous version's results.

The validation report is generated by modules.feedback and stored in MinIO docs/ bucket with a version tag matching the model version.

32.3 Out-of-Distribution Detection

At prediction creation time, propagator/decay.py evaluates each input object against the OOD bounds defined in docs/ood-bounds.md and sets reentry_predictions.ood_flag and ood_reason accordingly.

OOD checks (initial set — update in docs/ood-bounds.md as model is validated):

def check_ood(obj: ObjectParams) -> tuple[bool, list[str]]:
    reasons = []
    if obj.area_to_mass_ratio is not None and obj.area_to_mass_ratio > 0.04:
        reasons.append("high_am_ratio")
    if obj.data_confidence == "unknown":
        reasons.append("low_data_confidence")
    if obj.tle_count_last_30d is not None and obj.tle_count_last_30d < 5:
        reasons.append("sparse_tle_history")
    if obj.perigee_km is not None and obj.perigee_km < 100:
        reasons.append("sub_100km_perigee")
    if obj.bstar is not None and not (0.0001 <= obj.bstar <= 0.3):
        reasons.append("bstar_out_of_range")
    return len(reasons) > 0, reasons

UI presentation when ood_flag = TRUE:

⚠ OUT-OF-CALIBRATION-RANGE PREDICTION
──────────────────────────────────────────────────────────────
This prediction uses inputs outside the model's validated range:
  • high_am_ratio — effective drag may be underestimated
  • low_data_confidence — physical properties estimated from defaults

Timing uncertainty may be significantly larger than shown.
For operational planning, treat the p95 window as a minimum bound.

[What does this mean? →]
──────────────────────────────────────────────────────────────

The callout is mandatory and non-dismissable. It appears above the prediction panel wherever the prediction is displayed. It does not prevent the prediction from being used — operators retain full autonomy.

32.4 Recalibration Governance

The modules.feedback pipeline computes atmospheric density scaling coefficients from observed re-entry outcomes recorded in prediction_outcomes. Updating these coefficients changes all future predictions.

Recalibration procedure:

Trigger: Automated check in the feedback pipeline flags when the last 10 outcomes show a systematic bias (median p50 error > 1.5× the historical baseline).
Candidate coefficients: New coefficients computed from the full prediction_outcomes history using a hold-out split (80% train / 20% hold-out). Hold-out set is fixed and never used in training.
Validation gate: New coefficients must achieve:
- 5% improvement in median p50 error on hold-out set
- No regression (> 10% worsening) on any validated object type category
- Corridor containment rate ≥ 95% on hold-out set
Sign-off: Physics lead + engineering lead both must approve via PR review. PR includes the validation comparison table.
Active prediction handling: Before deployment, a batch job re-runs all active predictions (status = active, not superseded) using the new coefficients. Each re-run creates a new prediction record linked via superseded_by. ANSPs with active shadow deployments receive an automated notification: "SpaceCom model recalibrated — active predictions updated. Previous predictions superseded. New model version: X.Y.Z."
Rollback: If a post-deployment accuracy regression is detected, the previous coefficient set is restored via the same procedure (treated as a new recalibration). The rollback is logged to security_logs type MODEL_ROLLBACK.

32.5 Model Version Governance

Version classification:

Classification	Examples	Active prediction re-run?	ANSP notification required?
Patch	Documentation update, logging improvement, no physics change	No	No
Minor	Performance improvement, OOD bound adjustment, new object type support	No (optional for analyst review)	Yes — changelog summary
Major	Integrator change, density model change, MC parameter change, recalibration	Yes — all active predictions superseded	Yes — written notice to all shadow deployment partners; 2-week notice before deployment

Version string: Semantic version (MAJOR.MINOR.PATCH) embedded in every prediction record at creation time as model_version. The currently deployed version is exposed via GET /api/v1/system/model-version.

Cross-version prediction display: When a prediction was made with a model version that differs from the current deployed version by a major bump, the UI shows:

ℹ Prediction generated with model v1.2.0 — current model is v2.0.0 (major update).
  This prediction reflects older parameters. Re-run recommended for operational planning.
  [Re-run with current model →]

32.6 Adverse Outcome Monitoring

Continuous monitoring of prediction accuracy post-deployment is a regulatory credibility requirement. It is also the primary input to the recalibration pipeline.

Data flow:

Analyst logs observed re-entry outcome via POST /api/v1/predictions/{id}/outcome after post-event analysis (source: The Aerospace Corporation observed re-entry database, US18SCS reports, or ESA ESOC confirmation)
prediction_outcomes record created with p50_error_minutes, corridor_contains_observed, fir_false_positive, fir_false_negative
Feedback pipeline runs weekly: aggregates outcomes, computes rolling accuracy metrics, flags systematic biases
Grafana Model Accuracy dashboard shows: rolling 90-day median p50 error, corridor containment rate, false positive rate (CRITICAL alerts with no confirmed hazard), false negative rate (confirmed hazard with no CRITICAL alert)

Quarterly transparency report: Generated automatically from prediction_outcomes. Contains aggregate (non-personal) data:

Total predictions served in the quarter
Number of outcomes recorded (and percentage — coverage of the total)
Median p50 error, 95th percentile error
Corridor containment rate
False positive rate (CRITICAL alerts with no confirmed hazard) and estimated false negative rate
Known model limitations summary (from model card)
Model version(s) active during the quarter

Report stored in MinIO public-reports/ bucket and made available on SpaceCom's public documentation site. The report is a Phase 3 deliverable.

32.7 Geographic Coverage Quality

FIR intersection quality varies by boundary data source. Operators in non-ECAC regions receive lower-quality airspace intersection assessments than European counterparts. This disparity must be acknowledged, not hidden.

Coverage quality levels:

Source	Coverage quality	Regions
EUROCONTROL AIRAC	High	All ECAC states (Europe, Turkey, Israel, parts of North Africa)
FAA Digital-Terminal Procedures	High	Continental US, Alaska, Hawaii, US territories
OpenAIP	Medium	Global fallback; community-maintained; may lag AIRAC
Manual / not loaded	Low	Any region where no FIR data has been imported

The airspace table has a coverage_quality column (high / medium / low). The airspace intersection API response includes coverage_quality per affected FIR. The UI shows a coverage quality callout on the airspace impact table when any affected FIR is medium or low:

ℹ FIR boundary quality: MEDIUM (OpenAIP source)
  Intersection calculations for this region use community-maintained boundary data.
  Verify with official AIRAC charts before operational use.

32.8 Ethics Accountability Decision Log

Decision	Chosen	Rationale
Model card	Required artefact; maintained alongside model in `docs/`	Regulators and ANSPs need a documented operational envelope; ESA TRL process requires it
Backcast accuracy statement	Conditional on object type; selection bias explicitly documented	Single unconditional figure misrepresents model generalisation to non-specialist audiences
OOD detection	Evaluated at prediction time; `ood_flag` + UI warning callout; prediction still served	Operators retain autonomy; OOD flag informs rather than blocks; hiding it would create false confidence
Recalibration governance	Hold-out validation + dual sign-off + active prediction re-run + ANSP notification	Ungoverned recalibration is an ungoverned change to a safety-critical model
Alert threshold governance	Documented rationale; change requires PR review + 2-week shadow validation + ANSP notification	Threshold values are consequential algorithmic decisions; they must be as auditable as code changes
Prediction staleness warning	`prediction_valid_until` = `p50 - 4h`; warning independent of system health banner	A prediction for an imminent re-entry event has growing implicit uncertainty; operators need a signal
Adverse outcome monitoring	`prediction_outcomes` table; weekly pipeline; quarterly public report	Without outcome data, performance claims are assertions not evidence; public report builds regulatory trust
FIR coverage disparity	`coverage_quality` column on `airspace`; disclosed per-FIR in intersection results	Hiding coverage quality differences from operators would be a form of false precision
False positive / negative framing	Both tracked in `prediction_outcomes`; both in quarterly report	Optimising only for one error type can silently worsen the other; both must be visible
Public transparency report	Aggregate accuracy data; no personal data; quarterly cadence	Aviation safety infrastructure operates in a regulated transparency environment; SpaceCom must too

33. Technical Writing / Documentation Engineering

33.1 Documentation Principles

SpaceCom documentation has three distinct audiences with different needs:

Audience	Primary docs	Format
Engineers building the system	ADRs, inline docstrings, test plan, `AGENTS.md`	Markdown in repo
Operators using the system	User guides, API guide, in-app help	Hosted docs site / PDF
Regulators and auditors	Model card, validation reports, runbooks, CHANGELOG	Formal documents; version-controlled

Documentation that serves the wrong audience in the wrong format fails both audiences. The §12.1 docs/ directory tree encodes this separation by subdirectory.

33.2 Architecture Decision Record (ADR) Standard

Format: MADR — Markdown Architectural Decision Records. Lightweight, git-friendly, no tooling dependency.

File naming: docs/adr/NNNN-short-title.md where NNNN is a zero-padded sequence number.

Template:

# NNNN — <Title>

**Status:** Accepted | Superseded by [MMMM](MMMM-title.md) | Deprecated

## Context

<What is the issue or design question this decision addresses? What forces are at play?>

## Decision

<What was decided?>

## Consequences

**Positive:** <What does this decision make easier or better?>
**Negative / trade-offs:** <What does this decision make harder or require accepting?>
**Neutral:** <Other effects worth noting>

## Alternatives considered

| Alternative | Why rejected |
|-------------|-------------|
| ...         | ...         |

Linking from code: When a code section implements a non-obvious decision, add an inline comment: # See docs/adr/0003-monte-carlo-chord-pattern.md. This makes the rationale discoverable from the code, not just from the plan.

Required initial ADR set (Phase 1):

ADR	Decision
0001	RS256 asymmetric JWT over HS256
0002	Dual front-door architecture (aviation + space portals)
0003	Monte Carlo chord pattern (Celery group + chord)
0004	GEOGRAPHY vs GEOMETRY spatial column types
0005	`lazy="raise"` on all SQLAlchemy relationships
0006	TimescaleDB chunk intervals (orbits: 1 day, space_weather: 30 days)
0007	CesiumJS commercial licence requirement
0008	PgBouncer transaction-mode pooling
0009	CCSDS OEM GCRF reference frame
0010	Alert threshold rationale (6h CRITICAL, 24h HIGH)

33.3 OpenAPI Documentation Standard

FastAPI auto-generates OpenAPI 3.1 schema from Python type annotations. Auto-generation is necessary but not sufficient. The following requirements are enforced by CI.

Per-endpoint requirements:

@router.get(
    "/reentry/predictions/{id}",
    summary="Get re-entry prediction by ID",
    description=(
        "Returns a single re-entry prediction with HMAC integrity verification. "
        "If the prediction's HMAC fails verification, returns 503 — do not use the data. "
        "Requires `viewer` role minimum. OOD-flagged predictions include a warning field."
    ),
    tags=["Re-entry"],
    responses={
        200: {"description": "Prediction returned; check `integrity_failed` field"},
        401: {"description": "Not authenticated"},
        403: {"description": "Insufficient role"},
        404: {"description": "Prediction not found or belongs to another organisation"},
        503: {"description": "HMAC integrity check failed — prediction data is untrusted"},
    },
)
async def get_prediction(id: int, ...):

CI enforcement: A pytest fixture iterates the FastAPI app's routes and asserts that description is non-empty for every route with path starting /api/v1/. Fails CI with a list of non-compliant endpoints.

Rate limiting documentation: Endpoints with rate limits include the limit in the description field: "Rate limited: 10 requests/minute per user. Returns 429 with Retry-After header when exceeded."

33.4 Runbook Standard

Template (docs/runbooks/TEMPLATE.md):

# Runbook: <Title>

**Severity:** SEV-1 | SEV-2 | SEV-3 | SEV-4
**Owner:** <team or role>
**Last reviewed:** YYYY-MM-DD
**Estimated duration:** <X minutes>

## Trigger condition

<What condition causes this runbook to be needed? What alert or observation triggers it?>

## Preconditions

- [ ] You have SSH access to the production host
- [ ] <other preconditions>

## Steps

1. <First step — be specific; include exact commands>
2. <Second step>
   ```bash
   # exact command with expected output noted
   docker compose ps

Verification

Rollback

Notify

Engineering lead notified (Slack #incidents)
On-call via PagerDuty if SEV-1/2
ANSP partners notified if operational disruption (template: docs/runbooks/ansp-notification-template.md)


**Runbook index** (`docs/runbooks/README.md`):

| Runbook | Severity | Owner | Last reviewed |
|---------|----------|-------|--------------|
| `db-failover.md` | SEV-1 | Platform | Phase 3 |
| `celery-recovery.md` | SEV-2 | Platform | Phase 3 |
| `hmac-failure.md` | SEV-1 | Security | Phase 1 |
| `ingest-failure.md` | SEV-2 | Platform | Phase 1 |
| `gdpr-breach-notification.md` | SEV-1 | Legal + Engineering | Phase 2 |
| `safety-occurrence-notification.md` | SEV-1 | Legal + Engineering | Phase 2 |
| `secrets-rotation-jwt.md` | SEV-2 | Platform | Phase 2 |
| `secrets-rotation-spacetrack.md` | SEV-2 | Platform | Phase 2 |
| `secrets-rotation-hmac.md` | SEV-1 | Engineering Lead | Phase 2 |
| `blue-green-deploy.md` | SEV-3 | Platform | Phase 3 |
| `restore-from-backup.md` | SEV-2 | Platform | Phase 2 |

---

### 33.5 Docstring Standard

All public functions in the following modules must have Google-style docstrings:
`propagator/decay.py`, `propagator/catalog.py`, `reentry/corridor.py`, `breakup/atmospheric.py`, `conjunction/probability.py`, `integrity.py`, `frame_utils.py`, `time_utils.py`.

**Required docstring sections:** `Args` (with physical units for all dimensional quantities), `Returns`, `Raises`, and `Notes` (for numerical limitations or known edge cases).

```python
def integrate_trajectory(
    object_id: int,
    f107: float,
    bstar: float,
    params: dict,
) -> TrajectoryResult:
    """Integrate a single RK7(8) decay trajectory from current epoch to re-entry.

    Uses NRLMSISE-00 atmospheric density model with J2–J6 geopotential and
    solar radiation pressure. Terminates at 80 km altitude (configurable via
    params['termination_altitude_km']).

    Args:
        object_id: NORAD catalog number of the decaying object.
        f107: Solar flux index (10.7 cm) in solar flux units (sfu).
            Valid range: 65–300 sfu. Values outside this range are accepted
            but produce extrapolated NRLMSISE-00 results (see docs/ood-bounds.md).
        bstar: BSTAR drag term from TLE (units: 1/Earth_radius).
            Valid range: 0.0001–0.3 per docs/ood-bounds.md.
        params: Simulation parameters dict. Required keys:
            'mc_samples' (int), 'termination_altitude_km' (float, default 80.0).

    Returns:
        TrajectoryResult with fields: reentry_time (UTC datetime),
        impact_lat_deg (float), impact_lon_deg (float), final_velocity_ms (float).

    Raises:
        IntegrationDivergenceError: If the integrator step size shrinks below
            1e-6 seconds (indicates numerical instability — log and flag as OOD).
        ValueError: If object_id is not in the database.

    Notes:
        NRLMSISE-00 is calibrated for 100–600 km altitude. Below 100 km the
        density is extrapolated and uncertainty grows significantly. The OOD
        flag is set by the caller based on ood-bounds.md thresholds, not here.
    """

Enforcement: mypy pre-commit hook enforces no untyped function signatures. A separate CI check using pydocstyle or ruff with docstring rules enforces non-empty docstrings on public functions in the listed modules.

33.6 `CHANGELOG.md` Format

Follows Keep a Changelog conventions. Human-maintained — not auto-generated from commit messages.

# Changelog

All notable changes to SpaceCom are documented here.
Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)

## [Unreleased]

## [1.0.0] — 2026-MM-DD

### Added
- Re-entry decay predictor (RK7(8) + NRLMSISE-00 + Monte Carlo 500 samples)
- Percentile corridor visualisation (Mode A)
- Space weather widget (NOAA SWPC + ESA SWS cross-validation)
- CRITICAL/HIGH/MEDIUM/LOW alert system with two-step CRITICAL acknowledgement
- Shadow mode with per-org legal clearance gate

### Security
- JWT RS256 with httpOnly cookies; TOTP MFA enforced for all roles
- HMAC-SHA256 integrity on all prediction and hazard zone records
- Append-only `alert_events` and `security_logs` tables

## [0.1.0] — 2026-MM-DD (Phase 1 internal)
...

Who maintains it: The engineer cutting the release writes the entry. Product owner reviews before tagging. Entries are written for operators and regulators — not for engineers.

33.7 User Documentation Plan

Document	Audience	Phase	Format	Location
Aviation Portal User Guide	Persona A/B/C	Phase 2	Markdown → PDF	`docs/user-guides/aviation-portal-guide.md`
Space Portal User Guide	Persona E/F	Phase 3	Markdown → PDF	`docs/user-guides/space-portal-guide.md`
Administrator Guide	Persona D	Phase 2	Markdown	`docs/user-guides/admin-guide.md`
API Developer Guide	Persona E/F	Phase 2	Markdown → hosted	`docs/api-guide/`
In-app contextual help	Persona A/C	Phase 3	React component content	`frontend/src/components/shared/HelpContent.ts`

Aviation Portal User Guide — required sections:

Dashboard overview (what you see on first login)
Understanding the globe display and urgency symbols
Reading a re-entry event: window range, corridor, risk level
Alert acknowledgement workflow (step-by-step with screenshots)
NOTAM draft workflow and mandatory disclaimer
Degraded mode: what the banners mean and what to do
Sharing views: deep links
Contacting SpaceCom support

Review requirement: The aviation portal guide must be reviewed by at least one Persona A representative (ANSP duty manager or equivalent) before first shadow deployment. Their sign-off is recorded in docs/user-guides/review-log.md.

33.8 API Developer Guide

Located at docs/api-guide/. This is the primary onboarding resource for Persona E (space operators using API keys) and Persona F (orbital analysts with programmatic access).

Minimum content for Phase 2:

authentication.md:

How to create an API key (step-by-step with screenshots)
How to attach the key to requests (Authorization: Bearer <key> header)
API key scopes and which endpoints each scope can access
How to revoke a key

rate-limiting.md:

Per-endpoint rate limits in a table
429 response format and Retry-After header usage
Burst vs. sustained limits

error-reference.md:

400 Bad Request        — Invalid parameters; see `detail` field
401 Unauthorized       — Missing or invalid API key
403 Forbidden          — API key does not have the required scope
404 Not Found          — Resource not found or not owned by your account
422 Unprocessable      — Request body failed schema validation
429 Too Many Requests  — Rate limit exceeded; see Retry-After header
503 Service Unavailable — HMAC integrity check failed; do not use the returned data

code-examples/python-quickstart.py:

import requests

API_BASE = "https://api.spacecom.io/api/v1"
API_KEY = "sk_live_..."   # from your API key dashboard

session = requests.Session()
session.headers["Authorization"] = f"Bearer {API_KEY}"

# Get list of tracked objects currently decaying
resp = session.get(f"{API_BASE}/objects", params={"decay_status": "decaying"})
resp.raise_for_status()
objects = resp.json()["results"]
print(f"{len(objects)} objects in active decay")

# Get OEM ephemeris for the first object
norad_id = objects[0]["norad_id"]
resp = session.get(
    f"{API_BASE}/space/objects/{norad_id}/ephemeris",
    headers={"Accept": "application/ccsds-oem"},
    params={"start": "2026-03-17T00:00:00Z", "end": "2026-03-18T00:00:00Z"}
)
print(resp.text)   # CCSDS OEM format

33.9 `AGENTS.md` Specification

AGENTS.md at the project root provides guidance to AI coding agents (such as Claude Code) working in this codebase. It is a first-class documentation artefact — committed to the repo, version-controlled, and referenced in the onboarding guide.

Required sections:

# SpaceCom — Agent Guidance

## Codebase overview
<3-paragraph summary of architecture, key modules, and safety context>

## Safety-critical files — extra care required
The following files have safety-critical implications. Any change must include
a test and a brief rationale comment:
- `backend/app/frame_utils.py` — frame transforms affect corridor coordinates
- `backend/app/integrity.py` — HMAC signing affects prediction integrity guarantees
- `backend/app/modules/propagator/decay.py` — physics model
- `backend/app/modules/alerts/service.py` — alert trigger logic
- `backend/migrations/` — schema changes affect immutability triggers

## Test requirements
- All backend changes must pass `make test` before committing
- Physics function changes require a new test case in the relevant test module
- Security-relevant changes require a `test_rbac.py` or `test_integrity.py` case
- Never mock the database in integration tests — use the test DB container

## Code conventions
- FastAPI endpoints must have `summary`, `description`, and `responses` (see §33.3)
- Public physics/security functions must have Google-style docstrings with units
- All new decisions should have an ADR in `docs/adr/` (see §33.2)
- New runbooks go in `docs/runbooks/` using the template at `docs/runbooks/TEMPLATE.md`

## Playwright / E2E test selector convention
- Every interactive element targeted by a Playwright test **must** have a `data-testid="<component>-<action>"` attribute
  - Examples: `data-testid="alert-acknowledge-btn"`, `data-testid="notam-draft-submit"`, `data-testid="decay-predict-form"`
- Playwright tests must use `page.getByTestId(...)` or accessible role selectors (`page.getByRole(...)`) **only**
- CSS class selectors, XPath, and `page.locator('.')` are forbidden in test files
- A CI lint step (`grep -r 'page\.locator\b\|page\.\$\b' tests/e2e/`) must return empty

## What not to do
- Do not add `latest` tags to Docker image references
- Do not store secrets in `.env` files committed to git
- Do not make changes to alert thresholds without updating `docs/alert-threshold-history.md`
- Do not change `model_version` in `decay.py` without following the model version governance procedure (§32.5)
- Do not proxy the Cesium ion token server-side — it is a public browser credential by design (`NEXT_PUBLIC_CESIUM_ION_TOKEN`). Do not store it in Vault, Docker secrets, or treat it as sensitive.
- Do not add write operations (POST/PUT/DELETE API calls, Zustand mutations) to components rendered in SIMULATION or REPLAY mode without calling `useModeGuard(['LIVE'])` first and disabling the control in non-LIVE modes.

33.10 Test Documentation Standard

Test pyramid and coverage gates — enforced in CI; make test runs all layers:

Layer	Scope	Minimum gate	CI enforcement
Unit	`backend/app/` excluding `migrations/`, `schemas/`	80% line coverage	`pytest --cov=backend/app --cov-fail-under=80`
Integration	Every API endpoint × every applicable role	100% of routes in `test_rbac.py`	RBAC matrix fixture enumerates all FastAPI routes via `app.routes`
E2E	5 critical user journeys (see below)	All journeys pass	Playwright job in CI; blocks merge
Physics validation	All suites in `docs/test-plan.md` marked Blocking	0 failures	Separate CI job; always runs before merge

5 critical user journeys (E2E blocking):

CRITICAL alert → acknowledge → NOTAM draft saved
Analyst submits decay prediction → job completes → corridor visible on globe
Admin creates user → user logs in → MFA enrolment complete
Space operator registers object → views conjunction list
Admin enables shadow mode → shadow prediction absent from viewer response

Module docstring requirement for all physics and security test modules:

"""
test_frame_utils.py — Frame Transformation Validation Suite

Physical invariant tested:
    TEME → GCRF → ITRF → WGS84 coordinate chain must agree with
    Vallado (2013) reference state vectors to within specified tolerances.

Reference source:
    Vallado, D.A. (2013). Fundamentals of Astrodynamics and Applications, 4th ed.
    Table 3-4 (GCRF↔ITRF) and Table 3-5 (TEME→GCRF). Reference vectors in
    docs/validation/reference-data/vallado-sgp4-cases.json.

Operational significance of failure:
    A frame transform error propagates directly into corridor polygon coordinates.
    A 1 km error at re-entry altitude produces a ground-track offset of 5–15 km.
    ALL tests in this module are BLOCKING CI failures.

How to add a new test case:
    1. Add the reference state vector to vallado-sgp4-cases.json
    2. Add a parametrised test case to TestTEMEGCRF or TestGCRFITRF
    3. Document the source in a comment on the test case
"""

docs/test-plan.md structure:

Suite	Module(s)	Physical invariant / behaviour	Reference	Pass tolerance	Blocking?
Frame transforms	`tests/physics/test_frame_utils.py`	TEME→GCRF→ITRF→WGS84 chain accuracy	Vallado (2013) Table 3-4/3-5	Position < 1 km	Yes
SGP4 propagator	`tests/physics/test_propagator/`	State vector at epoch; 7-day propagation	Vallado (2013) test set	< 1 km at epoch; < 10 km at +7d	Yes
Decay predictor	`tests/physics/test_decay/`	p50 re-entry time accuracy; corridor containment	Aerospace Corp database	Median error < 4h; containment ≥ 90%	Phase 2+
NRLMSISE-00 density	`tests/physics/test_decay/test_nrlmsise.py`	Density agrees with reference atmosphere	Picone et al. (2002) Table 1	< 1% at 5 reference points	Yes
Hypothesis invariants	`tests/physics/test_hypothesis.py`	SGP4 round-trip; p95 corridor containment; RLS tenant isolation	Internal + Vallado	See §42.3	Yes
HMAC integrity	`tests/test_integrity.py`	Tampered record detected; correct error response	Internal	503 + CRITICAL log entry	Yes
RBAC enforcement	`tests/test_rbac.py`	Every endpoint returns correct status for every role	Internal	0 mismatches	Yes
Rate limiting	`tests/test_auth.py`	429 at threshold; 200 after reset	Internal	Exact threshold	Yes
WebSocket	`tests/test_websocket.py`	Sequence replay; token expiry warning; close codes 4001/4002	Internal spec §14	All assertions pass	Yes
Contract tests	`tests/test_ingest/test_contracts.py`	Space-Track + NOAA key presence AND value ranges	Internal	0 violations	Yes (in CI against mocks)
Celery lifecycle	`tests/test_jobs/test_celery_failure.py`	Timed-out job → `failed`; orphan recovery Beat task	Internal	State correct within 5 min	Yes
MC corridor	`tests/physics/test_mc_corridor.py`	Corridor contains ≥ 95% of p95 trajectories; polygon matches committed reference	Internal (seeded RNG seed=42)	Area delta < 5%	Phase 2+
Smoke suite	`tests/smoke/`	API/WS health; auth; catalog non-empty; DB connectivity	Internal	All pass in ≤ 2 min	Yes (post-deploy)
E2E journeys	`tests/e2e/` (Playwright)	5 critical user journeys; WCAG 2.1 AA axe-core scan	Internal	0 journey failures; 0 axe violations	Yes
Breakup energy conservation	`tests/physics/test_breakup/`	Energy conserved through fragmentation	Internal analytic	< 1% error	Phase 2+

Test database isolation strategy — prevents test state leakage and enables parallel execution (pytest-xdist):

Unit tests and single-connection integration tests: db_session fixture wraps each test in a SAVEPOINT/ROLLBACK TO SAVEPOINT transaction. No committed data persists between tests.
Celery integration tests (multi-connection, multi-process): use testcontainers-python (PostgresContainer) to spin up a dedicated DB container per pytest-xdist worker. The container is created at session scope and torn down at session end. Each test worker sets search_path to its own schema (test_worker_<worker_id>) for additional isolation.
Never use the development or production DB for tests. The DATABASE_URL in test config must point to localhost:5433 (test container) or the testcontainers dynamic port. CI enforces this via environment variable assertion at test startup.

pytest.ini configuration:

[pytest]
addopts = -x --strict-markers -p no:warnings
markers =
    quarantine: flaky tests excluded from blocking CI
    contract: external API contract tests; run against mocks in CI
    smoke: post-deploy smoke tests

Flaky test policy:

A test is "flaky" if it fails without a code change ≥ 2 times in any 30-day window (tracked via GitHub Actions JUnit artefact history)
On second flaky failure: the test is decorated with @pytest.mark.quarantine and moved to tests/quarantine/; a GitHub issue is filed automatically by the CI workflow
Quarantined tests are excluded from blocking CI (pytest -m "not quarantine") but continue to run in a non-blocking nightly job so failures are visible
A test in quarantine > 14 days without a fix must be deleted — a never-fixed flaky test provides no safety value and actively erodes trust in CI
The quarantine list is reviewed at each sprint review; any test in quarantine > 30 days blocks the next sprint release gate

33.11 Technical Writing Decision Log

Decision	Chosen	Rationale
ADR format	MADR (Markdown)	Lightweight; git-native; no tooling; linkable from code comments
ADR location	`docs/adr/` in monorepo	Engineers find rationale where they work, not in a separate wiki
Changelog format	Keep a Changelog (human-maintained)	Commit messages are for engineers; changelogs are for operators and regulators; auto-generation produces wrong audience tone
Docstring style	Google-style	Most readable inline; compatible with Sphinx if API reference generation is needed; `ruff` can enforce it
Runbook format	Standard template with Trigger/Steps/Verification/Rollback/Notify	On-call engineers under pressure skip steps that aren't explicitly numbered; Rollback and Notify are consistently omitted without a template
User documentation timing	Phase 2 for aviation portal; Phase 3 for space portal	ANSP SMS acceptance requires user documentation before shadow deployment; space portal can follow
API guide location	`docs/api-guide/` in repo	Co-located with code; version-controlled; engineers update it when they change the API
`AGENTS.md`	Committed to repo root; safety-critical files explicitly listed	An undocumented AGENTS.md is ignored or followed inconsistently; explicit safety-critical file list is the highest-value content
Test documentation	Module docstring + `docs/test-plan.md`	ECSS-Q-ST-80C requires test specification as a separate artefact; module docstrings are the lowest-friction way to maintain it
OpenAPI enforcement	CI check on empty `description` fields	Developers don't write documentation voluntarily; CI enforcement is the only reliable mechanism

34. Infrastructure Design

This section consolidates infrastructure-level specifications: TLS lifecycle, port map, reverse-proxy configuration, WAF/DDoS posture, object storage configuration, backup validation, egress control, and the HA database parameters. For Patroni parameters see §26.3; for port exposure details see §3.3; for storage tiering see §27.4; for DNS/service discovery see §27.6.

34.1 TLS Certificate Lifecycle

Certificate Issuance Decision Tree

Is the deployment internet-facing?
├── YES → Use Caddy ACME (Let's Encrypt / ZeroSSL)
│         Caddy automatically renews; no manual steps required
│         Domain must be publicly resolvable (A record pointing to Caddy host)
│
└── NO (air-gapped / on-premise with no public DNS)
    ├── Does the customer operate an internal CA?
    │   ├── YES → Request cert from customer CA; configure Caddy with cert_file + key_file
    │   │         Document CA chain in `docs/runbooks/tls-cert-lifecycle.md`
    │   └── NO  → Generate internal CA with `step-ca` (Smallstep)
    │               Run step-ca as a sidecar container on the management network
    │               Issue Caddy cert from internal CA; clients import internal CA root cert

Cert Expiry Alert Thresholds

Prometheus alert rules in monitoring/alerts/tls.yml:

Alert	Threshold	Severity
`TLSCertExpiringSoon`	< 60 days remaining	WARNING
`TLSCertExpiringImminent`	< 30 days remaining	HIGH
`TLSCertExpiryCritical`	< 7 days remaining	CRITICAL (pages on-call)

For ACME-managed certs: Caddy renews at 30 days remaining by default; the 30-day alert should never fire in steady state. The 7-day CRITICAL alert is the backstop for ACME renewal failures.

Runbook Entry

docs/runbooks/tls-cert-lifecycle.md must cover:

How to verify current cert expiry (echo | openssl s_client -connect host:443 2>/dev/null | openssl x509 -noout -dates)
ACME renewal troubleshooting (Caddy logs: caddy logs --tail 100)
Manual certificate replacement procedure for air-gapped deployments
Internal CA cert distribution to client browsers / API consumers

34.2 Caddy Reverse Proxy Configuration

# /etc/caddy/Caddyfile
# Production Caddyfile stub — customise domain and backend addresses
{
    email admin@your-domain.com          # ACME account email
    # For air-gapped: comment out email, add tls /path/to/cert /path/to/key
}

your-domain.com {
    # TLS — automatic ACME for internet-facing; replace with manual cert for air-gapped
    tls {
        protocols tls1.2 tls1.3         # Disable TLS 1.0 and 1.1
    }

    # Security headers
    header {
        Strict-Transport-Security "max-age=63072000; includeSubDomains; preload"
        X-Content-Type-Options "nosniff"
        X-Frame-Options "DENY"
        Referrer-Policy "strict-origin-when-cross-origin"
        -Server                          # Strip Server header (do not expose Caddy version)
        -X-Powered-By                    # Strip if present
    }

    # WebSocket proxy (backend WebSocket endpoint)
    handle /ws/* {
        reverse_proxy backend:8000 {
            header_up Host {host}
            header_up X-Real-IP {remote_host}
            header_up X-Forwarded-Proto {scheme}
        }
    }

    # API and SSR routes
    handle /api/* {
        reverse_proxy backend:8000 {
            header_up X-Real-IP {remote_host}
            header_up X-Forwarded-Proto {scheme}
        }
    }

    # Static assets — served with long-lived immutable cache headers (F8 — §58)
    # Next.js content-hashes all filenames under /_next/static/ — safe for max-age=1y
    handle /_next/static/* {
        header Cache-Control "public, max-age=31536000, immutable"
        reverse_proxy frontend:3000 {
            header_up X-Real-IP {remote_host}
        }
    }

    # Cesium workers and static resources (large; benefit most from caching)
    handle /cesium/* {
        header Cache-Control "public, max-age=604800"   # 7 days; not content-hashed
        reverse_proxy frontend:3000 {
            header_up X-Real-IP {remote_host}
        }
    }

    # Frontend (Next.js) — HTML and dynamic routes (no caching)
    handle {
        header Cache-Control "no-store"   # HTML must never be cached; contains stale JS references otherwise
        reverse_proxy frontend:3000 {
            header_up X-Real-IP {remote_host}
            header_up X-Forwarded-Proto {scheme}
        }
    }
}

Notes:

MinIO console (9001) and Flower (5555) are not exposed through Caddy in production. VPN/bastion access only.
Static asset Cache-Control: immutable is safe only because Next.js content-hashes all filenames. HTML pages must use no-store to force browsers to re-fetch the latest JS bundle references after a deploy.
HTTP (port 80) is implicitly redirected to HTTPS by Caddy when a TLS block is present.
max-age=63072000 = 2 years; standard for HSTS preload submission.

34.3 WAF and DDoS Protection

SpaceCom's application-layer rate limiting (§7.7) is a mitigation for abusive authenticated clients, not a defence against volumetric DDoS or web application attacks. A dedicated WAF/DDoS layer is required at Tier 2+ production deployments.

Internet-facing deployments (cloud or hosted):

Deploy behind Cloudflare (free tier minimum; Pro tier for WAF rules) or AWS Shield Standard + AWS WAF
Cloudflare: enable DDoS protection, OWASP managed ruleset, Bot Fight Mode
Configure Caddy to only accept connections from Cloudflare IP ranges (Cloudflare publishes the range; verify with curl https://www.cloudflare.com/ips-v4)

Air-gapped / on-premise government deployments:

Customer's upstream network perimeter (firewall/IPS) provides the DDoS and WAF layer
Document the perimeter protection requirement in the customer deployment checklist (docs/runbooks/on-premise-deployment.md)
SpaceCom is not responsible for perimeter DDoS mitigation in customer-managed deployments; this is a contractual boundary that must be documented in the MSA

On-premise licence key enforcement (F6 — §68):

On-premise deployments run on customer infrastructure. Without a licence key mechanism, a customer could run additional instances, share the deployment, or continue operating after licence expiry.

Licence key design: A JWT signed with SpaceCom's RSA private key (2048-bit minimum). Claims:

{
  "sub": "<org_id>",
  "org_name": "Civil Aviation Authority of Australia",
  "contract_type": "on_premise",
  "valid_from": "2026-01-01T00:00:00Z",
  "valid_until": "2027-01-01T00:00:00Z",
  "features": ["operational_mode", "multi_ansp_coordination"],
  "max_users": 50,
  "iss": "spacecom.io",
  "iat": 1735689600
}

Enforcement: At startup, backend/app/main.py verifies the licence JWT using SpaceCom's public key (bundled in the Docker image). If validation fails or the licence has expired: the backend starts in licence-expired degraded mode — read-only access to historical data; no new predictions or alerts; all write endpoints return HTTP 402 Payment Required with {"error": "licence_expired", "contact": "commercial@spacecom.io"}. An hourly Celery Beat task re-validates the licence. If it expires mid-operation, running simulations complete but no new simulations are accepted after the check fires.

Key rotation: New licence JWT issued via scripts/generate_licence_key.py (requires SpaceCom private key, stored in HashiCorp Vault — never committed to the repository). Customer sets SPACECOM_LICENCE_KEY environment variable; container restart picks it up. SpaceCom's RSA public key is embedded in the Docker image at build time (/etc/spacecom/licence_pubkey.pem).

CI/DAST complement: OWASP ZAP DAST (§21 Phase 2 DoD) tests the application layer; WAF covers infrastructure-layer attack patterns. Both are required — they cover different threat categories.

34.4 MinIO Object Storage Configuration

Erasure Coding (Tier 3)

4-node distributed MinIO uses EC:2 (2 data + 2 parity shards per erasure set):

# MinIO server startup command (each of 4 nodes runs the same command)
minio server \
  http://minio-1:9000/data \
  http://minio-2:9000/data \
  http://minio-3:9000/data \
  http://minio-4:9000/data \
  --console-address ":9001"

EC:2 on 4 nodes means:

Each object is split into 4 shards (2 data + 2 parity)
Read quorum: 2 shards (tolerates 2 simultaneous node failures for reads)
Write quorum: 3 shards (tolerates 1 simultaneous node failure for writes)
Usable capacity: 50% of raw total

ILM (Information Lifecycle Management) Policies

Configured via mc ilm add commands in docs/runbooks/minio-lifecycle.md:

Bucket	Prefix	Transition after	Target
`mc-blobs`	(all)	90 days	MinIO warm tier or S3-IA
`pdf-reports`	(all)	365 days	S3 Glacier
`notam-drafts`	(all)	365 days	S3 Glacier
`db-wal-archive`	(all)	31 days	Delete (WAL older than 30 days not needed for point-in-time recovery)

34.5 Backup Restore Test Verification Checklist

Monthly restore test procedure (executed by the restore_test Celery task; results logged to security_logs type RESTORE_TEST). A human engineer must verify all six items before marking the restore test as passed:

#	Verification item	How to verify
1	Row count match	`SELECT COUNT(*) FROM reentry_predictions` on restored DB equals baseline count captured before backup
2	Latest record present	Most recent `reentry_predictions.created_at` in restored DB is within 5 minutes of the backup timestamp
3	HMAC spot-check	Run `integrity.verify_prediction(id)` on 5 randomly selected prediction IDs; all must return `VALID`
4	Append-only trigger functional	Attempt `UPDATE reentry_predictions SET risk_level = 'LOW' WHERE id = <test_id>`; must raise exception
5	Hypertable chunks intact	`SELECT count(*) FROM timescaledb_information.chunks WHERE hypertable_name = 'orbits'` matches expected chunk count for the backup date range
6	Foreign key integrity	`pg_restore` completed with 0 FK constraint violations (check restore log for `ERROR: insert or update on table ... violates foreign key constraint`)

Restore test failures are treated as CRITICAL alerts. The restore test target DB (db-restore-test container) must be isolated from the production network (not attached to db_net).

34.6 Infrastructure Design Decision Log

Decision	Chosen	Alternative Considered	Rationale
Reverse proxy	Caddy	nginx + certbot	Caddy automatic ACME eliminates manual cert management; simpler config; native HTTP/2 and HTTP/3
TLS air-gapped	Internal CA (`step-ca`)	Self-signed per-service	Internal CA allows cert chain trust; self-signed requires per-client exception management
WAF/DDoS	Upstream provider (Cloudflare/AWS Shield)	Application-layer rate limiting only	Volumetric DDoS bypasses application-layer; WAF covers OWASP attack patterns at network ingress
MinIO erasure coding	EC:2 on 4 nodes	EC:4 (higher parity)	EC:4 on 4 nodes would require 4-node write quorum; any single failure blocks writes; EC:2 balances protection and availability
Multi-region	Single region per jurisdiction	Active-active global cluster	Data sovereignty; compliance certification scope; Phase 1–3 customer base size doesn't justify multi-region operational complexity
DB connection target	PgBouncer VIP	Direct Patroni primary connection string	Application connection strings don't change during Patroni failover; stable operational target
Cold tier (MC blobs)	MinIO ILM warm → S3-IA	S3 Glacier	MC blobs may be replayed for Mode C visualisation; 12h Glacier restore latency is operationally unacceptable
Cold tier (compliance)	S3 Glacier / Deep Archive	Warm S3	Compliance docs need 7-year retention but rare retrieval; Glacier cost is 80–90% lower than S3-IA
Egress filtering	Host-level UFW/nftables	Rely on Docker network isolation	Docker isolation is inter-network only; outbound internet egress must be filtered at host level
HSTS `max-age`	63072000 (2 years)	31536000 (1 year)	2 years is the HSTS preload list minimum; aligns with standard hardening guides

35. Performance Engineering

This section consolidates performance specifications, load test definitions, and scalability constraints across the system. For compression policy configuration see §9.4; for latency budget and pagination standard see §14; for WebSocket subscriber ceiling see §14; for renderer memory limits see §3 / §27.

35.1 Load Test Specification

Tool: k6 (preferred) or Locust. Scripts in tests/load/. Scenarios must be deterministic and reproducible on a freshly seeded database.

Scenario: CZML Catalog (Phase 1 baseline, Phase 3 SLO gate)

// tests/load/czml_catalog.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 20 },   // Ramp to 20 users
    { duration: '5m', target: 100 },  // Ramp to 100 users (SLO target)
    { duration: '5m', target: 100 },  // Sustain 100 users
    { duration: '2m', target: 0 },    // Ramp down
  ],
  thresholds: {
    'http_req_duration{endpoint:czml_full}':  ['p(95)<2000'],   // Phase 3 SLO
    'http_req_duration{endpoint:czml_delta}': ['p(95)<500'],    // Delta must be faster
    'http_req_failed': ['rate<0.01'],                           // < 1% error rate
  },
};

export default function () {
  // First load: full catalog
  const fullRes = http.get('/czml/objects', {
    tags: { endpoint: 'czml_full' },
    headers: { Authorization: `Bearer ${__ENV.TEST_TOKEN}` },
  });
  check(fullRes, { 'full catalog 200': (r) => r.status === 200 });

  // Subsequent loads: delta
  const since = new Date(Date.now() - 60000).toISOString();
  const deltaRes = http.get(`/czml/objects?since=${since}`, {
    tags: { endpoint: 'czml_delta' },
    headers: { Authorization: `Bearer ${__ENV.TEST_TOKEN}` },
  });
  check(deltaRes, { 'delta 200': (r) => r.status === 200 });

  sleep(5);  // Think time: user views globe for ~5s before next action
}

Scenario: MC Prediction Submission

// tests/load/mc_predict.js — tests concurrency gate
export const options = {
  vus: 10,           // 10 concurrent MC submissions from 5 orgs (2 per org)
  duration: '3m',
  thresholds: {
    'http_req_duration{endpoint:mc_submit}': ['p(95)<500'],
    // 429s are expected (concurrency gate) — not counted as failures
    'checks': ['rate>0.95'],
  },
};

Scenario: WebSocket Alert Delivery

// tests/load/ws_alerts.js — verifies < 30s delivery under load
// Opens 100 persistent WebSocket connections; triggers 10 synthetic alerts;
// measures time from alert POST to WS delivery on all 100 clients

Load test execution:

Phase 1: run czml_catalog scenario on Tier 1 dev hardware; record p95 baseline
Phase 2: run after each major migration; confirm no regression vs Phase 1 baseline
Phase 3: full suite (all three scenarios) on Tier 2 staging; all thresholds must pass before production deploy approval

Load test reports committed to docs/validation/load-test-report-phase{N}.md.

35.2 CZML Delta Protocol

The full CZML catalog grows proportionally with object count and time-step density. The delta protocol prevents repeat full-catalog downloads after initial page load.

Client responsibility:

On page load: fetch GET /czml/objects (full catalog). Cache X-CZML-Timestamp response header as lastSync.
Every 30s (or on reconnect): fetch GET /czml/objects?since=<lastSync>.
On receipt of X-CZML-Full-Required: true: discard globe state and re-fetch full catalog.
On receipt of HTTP 413: the server cannot serve the full catalog (too large); contact system admin.

Server responsibility:

Full response: include X-CZML-Timestamp: <server_time_iso8601> header.
Delta response: include only objects with updated_at > since. If since is more than 30 minutes ago, return X-CZML-Full-Required: true with an empty CZML body (client must re-fetch).
Maximum full payload: 5 MB. If estimated size exceeds limit, return HTTP 413 with {"error": "catalog_too_large", "use_delta": true}.

Prometheus metric: czml_delta_ratio = delta requests / (delta + full requests). Target: > 0.95 in steady state (95% of CZML requests are delta).

35.3 Monte Carlo Concurrency Gate

Unbounded MC fan-out collapses SLOs when multiple users submit concurrent jobs. The concurrency gate is implemented as a per-organisation Redis semaphore:

# worker/tasks/decay.py

import redis
from celery import current_app

REDIS = redis.Redis.from_url(settings.REDIS_URL)
MC_SEMAPHORE_TTL = 600  # seconds; covers maximum expected MC duration + margin

def acquire_mc_slot(org_id: int, org_tier: str) -> bool:
    """Returns True if slot acquired, False if at capacity. Limit derived from subscription tier (F6)."""
    from app.modules.billing.tiers import get_mc_concurrency_limit
    limit = get_mc_concurrency_limit_by_tier(org_tier)
    key = f"mc_running:{org_id}"
    pipe = REDIS.pipeline()
    pipe.incr(key)
    pipe.expire(key, MC_SEMAPHORE_TTL)
    count, _ = pipe.execute()
    if count > limit:
        REDIS.decr(key)
        return False
    return True

def release_mc_slot(org_id: int) -> None:
    key = f"mc_running:{org_id}"
    current = REDIS.get(key)
    if current and int(current) > 0:
        REDIS.decr(key)

API layer:

# backend/api/decay.py

@router.post("/decay/predict")
async def submit_decay(req: DecayRequest, user: User = Depends(current_user)):
    if not acquire_mc_slot(user.organisation_id, user.role):
        raise HTTPException(
            status_code=429,
            detail="MC concurrency limit reached for your organisation",
            headers={"Retry-After": "120"},
        )
    task = run_mc_decay_prediction.delay(...)
    return {"task_id": task.id}

The Celery chord callback (on_chord_done) calls release_mc_slot. A TTL of 600s ensures the slot is released even if the worker crashes mid-task.

Quota exhaustion logging (F6): When acquire_mc_slot returns False, before returning 429, the endpoint writes a usage_events row: event_type = 'mc_quota_exhausted'. This makes quota pressure visible to the org admin and to the SpaceCom sales team (via admin panel). The org admin's usage dashboard shows: predictions run this month, quota hits this month, and a prompt to upgrade if hits ≥ 3 in a billing period.

35.4 Query Plan Regression Gate

CI job: performance-regression (runs in staging pipeline after make migrate):

# scripts/check_query_baselines.py
"""
Runs EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) for each query in
docs/query-baselines/*.sql against the migrated staging DB.
Compares execution time to the baseline JSON stored in the same directory.
Fails with exit code 1 if any query exceeds 2× the recorded baseline.
Emits a GitHub PR comment with a comparison table.
"""

BASELINE_DIR = "docs/query-baselines"
THRESHOLD_MULTIPLIER = 2.0

queries = {
    "czml_catalog_100obj": "SELECT ...",         # from czml_catalog_100obj.sql
    "fir_intersection":    "SELECT ...",         # from fir_intersection.sql
    "prediction_list":     "SELECT ...",         # from prediction_list_cursor.sql
}

Baselines are JSON files containing {"planning_time_ms": N, "execution_time_ms": N, "recorded_at": "..."}. Updated manually after a deliberate schema change with a PR comment explaining the expected regression.

35.5 Renderer Container Constraints

The renderer service (Playwright + Chromium) is memory-intensive during print-resolution globe captures:

# docker-compose.yml (renderer service)
renderer:
  image: spacecom/renderer:sha-${GIT_SHA}
  mem_limit: 4g
  memswap_limit: 4g       # No swap; if OOM, container restarts cleanly
  networks: [renderer_net]
  environment:
    RENDERER_MAX_PAGES: "4"       # Maximum concurrent render jobs
    RENDERER_TIMEOUT_S: "30"      # Per-render timeout; matches §21 DoD
    RENDERER_MAX_RESOLUTION: "300dpi"

Renderer Prometheus metrics:

renderer_memory_usage_bytes — current RSS of Chromium process; alert at 3.5 GB (WARN before OOM)
renderer_jobs_active — concurrent in-flight renders; alert if > 3 for > 60s (capacity signal)
renderer_timeout_total — count of renders killed by timeout; alert if > 0 in a 5-min window

Maximum report constraints (enforced in worker/tasks/renderer.py):

Maximum report pages: 50
Maximum globe snapshot resolution: 300 DPI (A4 format)
Reports exceeding these limits are rejected at submission with HTTP 400

Renderer memory isolation and on-demand rationale (F8 — §65 FinOps):

The renderer is the second-most memory-intensive service after TimescaleDB. At Tier 2 it is allocated a dedicated c6i.xlarge (~$140/mo) or equivalent. Unlike simulation workers, the renderer is called infrequently — typically a few times per day when a duty manager requests a PDF briefing pack.

On-demand vs. always-on analysis:

Approach	Benefit	Cost/risk	Decision
Always-on (current)	Zero latency to first render; Chromium warm	$140/mo even if 0 renders/day	Use at Tier 1–2 — cost is predictable; latency matters for interactive report requests
On-demand (start on request, stop after idle)	Saves $140/mo on lightly used deployments	15–30s Chromium cold-start per report; complicates deployment	Consider at Tier 3 with HPA scale-to-zero on `renderer_jobs_active` if customer SLA permits a 30s wait
Shared with simulation worker	Saves dedicated instance	Chromium OOM risk during concurrent MC + render	Do not use — Chromium 2–4 GB footprint during render + MC worker memory = OOM on 32 GB nodes

Memory isolation is non-negotiable: The renderer container is on an isolated Docker network (renderer_net) with no direct DB access and no simulation worker co-location. This is both a security boundary (§7, §35.5) and a memory isolation boundary. A runaway Chromium process will OOM its own container and restart cleanly without affecting simulation workers or the backend API.

Cost-saving lever (on-premise): For on-premise deployments where the renderer runs on the same physical server as simulation workers, monitor renderer_memory_usage_bytes + spacecom_simulation_worker_memory_bytes via Grafana. Add a combined alert renderer + workers > 80% host RAM to detect co-location pressure before OOM.

35.6 Static Asset CDN Strategy

CesiumJS uncompressed: ~8 MB. With gzip compression: ~2.5 MB. At 100 concurrent first-time users: ~250 MB outbound in a burst.

Internet-facing (Cloudflare):

All paths under /_next/static/* and /static/* are served with Cache-Control: public, max-age=31536000, immutable (1 year, immutable — Next.js uses content-hash filenames)
Caddy upstream caches are bypassed for these paths (Cloudflare edge is the cache)
CesiumJS assets: cache hit ratio target > 0.98 after warm-up

On-premise:

Deploy an nginx sidecar container (static-cache) on frontend_net serving the Next.js out/ or .next/static/ directory directly
Caddy routes /_next/static/* → static-cache:80 (bypasses Next.js server)
Configure in docs/runbooks/on-premise-deployment.md

Bundle size monitoring (CI):

# .github/workflows/ci.yml (bundle-size job)
- name: Check bundle size
  run: |
    npm run build 2>&1 | grep "First Load JS"
    # Fails if main bundle > previous + 10% (threshold stored in .bundle-size-baseline)
    node scripts/check_bundle_size.js

Baseline stored in .bundle-size-baseline at repo root (plain number in bytes). Updated manually with a PR comment when a deliberate size increase is approved.

35.7 Performance Engineering Decision Log

Decision	Chosen	Alternative Considered	Rationale
Load test tool	k6	Locust, JMeter	k6 is script-based (TypeScript-friendly), CI-native, outputs Prometheus-compatible metrics; Locust requires a Python process; JMeter is XML-heavy
CZML delta	`?since=<iso8601>` server-side filter	Client-side WebSocket push of changed entities	Server-side filter is simpler and works with HTTP caching; push requires server to track per-client state
MC semaphore	Redis INCR/DECR with TTL	DB-level lock	Redis is already the Celery broker; DB-level lock adds latency on every MC submit; TTL prevents deadlock on worker crash
Pagination	Cursor `(created_at, id)`	Keyset on single column	Single-column keyset has ties at same `created_at` (batch ingest); compound key is unique and stable
Query regression gate	`EXPLAIN (ANALYZE, BUFFERS)` JSON baseline	pg_stat_statements	`EXPLAIN` is deterministic per run on a warm buffer; `pg_stat_statements` averages across all historic executions and requires prod traffic to populate
Renderer memory cap	4 GB Docker `mem_limit`	ulimit in container	Docker `mem_limit` is enforced by the kernel cgroup; `ulimit` only applies to the shell process, not Chromium subprocesses
Bundle size gate	`+10%` threshold vs. stored baseline	Absolute byte limit	Percentage is proportional to current size; absolute limits become irrelevant as bundles grow or shrink

36. Security Architecture — Red Team / Adversarial Review

This section records the findings of an adversarial review against the §7 security architecture. Where findings were resolved by updating existing sections (§7.2, §7.3, §7.4, §7.9, §7.10, §7.11, §7.12, §7.14, §9.2), this section provides the finding rationale and cross-reference for traceability.

36.1 Finding Summary

#	Finding	Primary Section Updated	Severity
1	HMAC key rotation has no path through the immutability trigger	§7.9 — HMAC Key Rotation Procedure	Critical
2	Pre-signed MinIO URLs unscoped and unproxied for MC blobs	§7.10 — MinIO Bucket Policies	High
3	Celery task arguments not validated at the task layer	§7.12 — Compute Resource Governance	High
4	Playwright renderer SSRF mitigation incomplete	§7.11 — request interception allowlist	High
5	Refresh token theft: no family reuse detection	§7.3 + §9.2 `refresh_tokens` schema	High
6	Admin role elevation with no four-eyes approval	§7.2 + `pending_role_changes` table	High
7	Security events logged but no human alert matrix	§7.14 — security alerting matrix	Medium
8	Space-Track credential rotation has no ingest-gap spec	§7.14 — rotation runbook cross-reference	Medium
9	Shadow mode segregation application-layer only	§7.2 — shadow_segregation RLS policy	High
10	NOTAM draft content not sanitised — injection path	§7.4 — `sanitise_icao()` function	High
11	Supply chain posture not fully specified	§7.13 — already fully covered; no gap found	N/A

36.2 Attack Paths Considered

The following attack paths were evaluated in this review:

Insider threat paths:

Compromised admin account silently elevating a backdoor account → mitigated by four-eyes approval (Finding 6)
Admin with access to the HMAC rotation script replacing legitimate predictions with forged ones → mitigated by dual sign-off + rotated_by audit trail (Finding 1)
ANSP operator sharing a pre-signed report URL with an external party → mitigated by 5-minute TTL + audit log (Finding 2)

Compromised worker paths:

Compromised ingest_worker (shares worker_net with Redis) writing crafted Celery task args → mitigated by task-layer validation (Finding 3)
Compromised worker exfiltrating simulation trajectory URLs → mitigated by server-side MC blob proxy (Finding 2)

Authentication/session paths:

Refresh token exfiltration + replay before legitimate client retries → mitigated by family reuse detection + full-family revocation (Finding 5)
Compromised admin credential creating backdoor admin → mitigated by four-eyes principle (Finding 6)

Renderer SSRF paths:

Bug causing renderer to navigate to a crafted URL → mitigated by Playwright request interception allowlist (Finding 4)
Report ID injection → mitigated by integer validation + hardcoded URL construction (Finding 4)

Data integrity paths:

Shadow prediction leaking into operational response via query bug → mitigated by RLS shadow_segregation policy (Finding 9)
NOTAM draft XSS → Playwright PDF renderer execution → mitigated by sanitise_icao() + Jinja2 autoescape (Finding 10)

Credential rotation paths:

HMAC key compromise: attacker forges predictions → mitigated by rotation procedure with hmac_admin role isolation (Finding 1)
Space-Track credential rotation creates an undetected ingest gap → mitigated by 10-minute verification step in runbook (Finding 8)

36.3 Security Architecture ADRs

ADR	Title	Decision
`docs/adr/0007-hmac-rotation-procedure.md`	HMAC key rotation with parameterised immutability trigger	`hmac_admin` role + `SET LOCAL spacecom.hmac_rotation` flag; dual sign-off required
`docs/adr/0008-admin-four-eyes.md`	Admin role elevation requires four-eyes approval	`pending_role_changes` table; 30-minute token; second admin must approve
`docs/adr/0009-shadow-mode-rls.md`	Shadow mode segregated at RLS layer, not application layer	`shadow_segregation` RLS policy; `spacecom.include_shadow` session variable; admin-only
`docs/adr/0010-refresh-token-families.md`	Refresh token family reuse detection	`family_id` column; full family revocation on reuse; user email alert
`docs/adr/0011-mc-blob-proxy.md`	MC trajectory blobs proxied server-side, not pre-signed URL	`GET /viz/mc-trajectories/{id}` backend proxy; MinIO URLs never exposed to browser

36.4 Penetration Test Scope (Phase 3)

The Phase 3 external penetration test (referenced in §7.15) must include the following adversarial scenarios derived from this review:

HMAC rotation bypass — attempt to forge a prediction record by exploiting the immutability trigger with and without the hmac_admin role
Pre-signed URL exfiltration — verify that MC blob URLs are not present in any browser-side response; verify pre-signed report URLs cannot be used after 5 minutes
Celery task injection — attempt to enqueue tasks with out-of-range arguments directly via Redis; verify the task validates and rejects them
Playwright SSRF — attempt to trigger renderer navigation to http://169.254.169.254/ (AWS metadata) or http://backend:8000/internal/admin; verify interception blocks both
Refresh token theft simulation — replay a superseded refresh token; verify full family revocation and email alert
Admin privilege escalation — attempt to elevate a viewer account to admin via a single compromised admin account without the four-eyes approval token; verify the attempt is blocked and logged
Shadow mode leak — query GET /decay/predictions as viewer; inject a shadow prediction directly at the DB layer; verify the API response never returns it
NOTAM injection — submit an object with a name containing <script>alert(1)</script> via POST /objects; generate a NOTAM draft; verify PDF render does not execute script

36.5 Decision Log

Decision	Chosen	Alternative	Rationale
HMAC rotation trigger	Parameterised `SET LOCAL` flag scoped to `hmac_admin` role	Separate migration to drop and recreate trigger	`SET LOCAL` is session-scoped; cannot be set by application role; minimises window of bypass
Family reuse detection	Full family revocation on superseded token reuse	Single token revocation	Full revocation is the only action that guarantees the attacker's session is destroyed even if the legitimate user doesn't notice
MC blob delivery	Server-side proxy endpoint	Pre-signed MinIO URL with short TTL	Pre-signed URLs can be shared or logged in browser history; server-side proxy enforces org scoping on every request
Admin four-eyes	Email approval token with 30-minute window	Yubikey hardware confirmation	Email approval is achievable without additional hardware; 30-minute window prevents indefinite pending states
Shadow RLS	PostgreSQL RLS policy	Application-layer `WHERE shadow_mode = FALSE`	RLS is enforced by the database engine regardless of query construction; application-layer filters can be omitted by bugs or direct DB queries

37. Aviation Regulatory / ATM Compliance Review

This section records findings from an ATM systems engineering review against the ICAO/EUROCONTROL regulatory environment that governs ANSP customers. Findings were incorporated into §6.13 (NOTAM format), §6.14 (shadow exit), §6.17 (multi-ANSP panel), §11 (data sources / airspace scope), §16 (prediction conflict), §21 Phase 2 DoD, §27.4 (safety record retention), and §9.2 (schema additions).

37.1 Finding Summary

#	Finding	Primary Section Updated	Severity
1	Regulatory classification (EASA IR 2017/373 position) unresolved	§21 Phase 2 DoD + ADR 0012	Critical
2	NOTAM format non-compliant with ICAO Annex 15 field formatting	§6.13 — field mapping table, Q-line, `YYMMDDHHmm` timestamps	High
3	Re-entry window → NOTAM (B)/(C) mapping not specified	§6.13 — `p10−30min` / `p90+30min` rule + cancellation urgency	High
4	FIR scope excludes SUA, TMAs, oceanic — undisclosed	§11 — airspace scope disclosure; ADR 0014	Medium
5	Multi-ANSP coordination panel has no authority/precedence spec	§6.17 — advisory-only banner, retention, WebSocket SLA	Medium
6	Shadow mode exit criteria not specified	§6.14 — exit criteria table, exit report template	High
7	Degraded mode disclosure insufficient for ANSP operational use	§9.2 `degraded_mode_events` table; §14 `GET /readyz` schema; NOTAM `(E)` injection	High
8	GDPR DPA must be signed before shadow mode begins, not Phase 3	§21 Phase 2 DoD legal gate	High
9	ESA DISCOS redistribution rights unaddressed	§11 — redistribution rights requirement; §21 Phase 2 DoD	High
10	Multi-source prediction conflict resolution not specified	§16 — conflict resolution rules; `prediction_conflict` schema columns	High
11	Safety-relevant records have no distinct retention policy	§27.4 — `safety_record` flag; 5-year safety category	Medium

37.2 Regulatory Framework References

Framework	Relevance	Position taken
EASA IR (EU) 2017/373	Requirements for ATM/ANS providers; may apply if ANSP integrates SpaceCom into operational workflow	Position A: advisory tool; not ATM/ANS provision — documented in ADR 0012
ICAO Annex 15 (AIS) + Appendix 6	NOTAM format specification	NOTAM drafts now comply with Annex 15 field formatting (§6.13)
ICAO Annex 11 (ATS) §2.26	ATC record retention recommendation	Safety records retained ≥ 5 years (§27.4)
ICAO Doc 8400	ICAO abbreviations and codes used in NOTAM `(E)` field	`sanitise_icao()` uses Doc 8400 abbreviation list
EUROCONTROL OPADD	Operational NOTAM Production and Distribution; EUR regional NOTAM practice	Q-line format and series conventions follow OPADD (§6.13)
GDPR Article 28	Data processor obligations when processing ANSP staff personal data	DPA must be signed before any ANSP data processing, including shadow mode
UN Liability Convention 1972	7-year record retention for space object liability claims	`reentry_predictions`, `alert_events` retained 7 years (§27.4)

37.3 Regulatory ADRs

ADR	Title	Decision
`docs/adr/0012-regulatory-classification.md`	EASA IR 2017/373 position	Position A: ATM/ANS Support Tool; decision support only; not ATM/ANS provision; written ANSP agreements required
`docs/adr/0013-notam-format.md`	ICAO Annex 15 NOTAM field compliance	Field mapping table; `YYMMDDHHmm` timestamps; Q-line `QWELW`; `(B)` = p10−30min; `(C)` = p90+30min
`docs/adr/0014-airspace-scope.md`	Phase 2 airspace data scope	FIR/UIR only (ECAC + US); SUA/TMA/oceanic explicitly out of scope; disclosed in UI; Phase 3 SUA consideration

37.4 Compliance Checklist (Phase 2 Gate)

Before the first ANSP shadow deployment:

docs/adr/0012-regulatory-classification.md committed and reviewed by aviation law counsel
NOTAM draft generator produces ICAO-compliant output (unit test passes Q-line regex and YYMMDDHHmm field checks)
Airspace scope disclosure note present in Airspace Impact Panel (Playwright test verifies text)
Multi-ANSP coordination advisory-only banner present in panel (Playwright test verifies text)
degraded_mode_events table active; transitions logged; GET /readyz response includes degraded_since
NOTAM draft (E) field injects degraded-state warning when generated_during_degraded = TRUE (integration test)
DPA signed with each ANSP shadow partner; DPA template reviewed by counsel
ESA DISCOS redistribution rights clarified in writing; API/report templates updated if required
prediction_conflict flag operational; Event Detail page shows ⚠ PREDICTION CONFLICT when set
Safety record retention policy active: safety_record = TRUE records excluded from TimescaleDB drop; degraded_mode_events retained 5 years
Shadow mode exit report template (docs/templates/shadow-mode-exit-report.md) exists and Persona B can generate statistics from admin panel

37.5 Decision Log

Decision	Chosen	Alternative	Rationale
Regulatory classification	Position A — advisory, non-safety-critical ATM/ANS Support Tool	Position B — Functional System under IR 2017/373	Position B would require ED-78A system safety assessment, ATCO HMI compliance, and EASA change management — disproportionate for a decision-support tool where a human verifies all outputs before acting
NOTAM timestamp format	`YYMMDDHHmm` (ICAO Annex 15 §5.1.2)	ISO 8601 `YYYY-MM-DDTHH:mmZ`	ICAO Annex 15 is unambiguous; ISO 8601 would require the NOTAM office to reformat before issuance
NOTAM window mapping	`(B)` = p10 − 30 min; `(C)` = p90 + 30 min	`(B)` = p50 − 3h; `(C)` = p50 + 3h	p10/p90 are the actual statistical bounds; symmetric windows around p50 ignore the often-asymmetric uncertainty distribution
Degraded NOTAM warning	Machine-inserted line in `(E)` field	UI-only warning on the draft page	The `(E)` field is what the NOTAM office receives; a UI-only warning is lost when the draft is copied to the NOTAM office's system
Multi-source conflict	Union of windows when non-overlapping	SpaceCom window always primary regardless	ICAO most-conservative principle; ANSPs must be protected against the case where SpaceCom is wrong and TIP is right
Safety record retention	`safety_record` flag on row; excluded from drop policy	Separate table for safety records	Flag approach avoids data duplication and works with TimescaleDB chunk-level policies; excluded records stay in the same hypertable partition for query performance

38. Orbital Mechanics / Astrodynamics Review

This section records findings from an astrodynamics specialist review of the physics specification. Findings were incorporated into §15.1 (SGP4 validity gates), §15.2 (NRLMSISE-00 inputs, MC uncertainty model, SRP, integrator config), §15.3 (breakup altitude trigger, material survivability), §15.4 (new — corridor generation algorithm), §15.5 (new — Pc computation method), §17.1 (committed test vectors), §31.1 (BSTAR validation), and the objects/space_weather schema in §9.

38.1 Finding Summary

#	Finding	Section Updated	Severity
1	SGP4 validity limits not enforced at query time	§15.1 — epoch age gates, perigee < 200 km routing	High
2	NRLMSISE-00 input vector under-specified	§15.2 — f107_prior_day, ap_3h_history, Ap vs Kp	High
3	Ballistic coefficient uncertainty model not specified	§15.2 — C_D/A/m sampling distributions; `objects` schema	High
4	Corridor generation algorithm not specified	§15.4 (new) — alpha-shape, 50 km buffer, ≤ 1000 vertices	High
5	Breakup altitude trigger not specified	§15.3 — 78 km trigger, NASA SBM, material survivability	High
6	Frame transformation test vectors not committed	§17.1 — 3 required JSON files; fail-not-skip test pattern	Medium
7	Solar radiation pressure absent from decay predictor	§15.2 — cannonball SRP model, `cr_coefficient` column	Medium
8	Pc computation method not specified	§15.5 (new) — Alfano 2D Gaussian, TLE differencing covariance	Medium
9	Integrator tolerances and stopping criterion not specified	§15.2 — atol=1e-9, rtol=1e-9, max_step=60s, 120-day cap	High
10	BSTAR validation range excludes valid high-density objects	§31.1 — removed lower floor; warn-not-reject for B* > 0.5	Medium
11	NRLMSISE-00 altitude limit and storm handling not specified	§15.2 — 800 km OOD boundary; Kp > 5 storm flag	Medium

38.2 Physics Model Decisions

Decision	Chosen	Alternative Considered	Rationale
Catalog propagator	SGP4 (`sgp4` library)	SP (Special Perturbations) via GMAT	SGP4 is the standard for TLE-based catalog propagation; SP requires full state vector with covariance — not available from TLEs
Decay integrator	DOP853 (RK7/8 adaptive)	RK4 fixed step	DOP853 is embedded error control; RK4 fixed step requires manual step-size management and may miss density variations near perigee
Atmospheric model	NRLMSISE-00	JB2008 (Jacchia-Bowman 2008)	NRLMSISE-00 is well-validated, open-source, and widely used in community tools; JB2008 is more accurate during storms but requires additional data inputs not yet in scope
Corridor shape	Alpha-shape (concave hull)	Convex hull	Convex hull overestimates corridor width by 2–5× for elongated re-entry ground tracks; alpha-shape produces tighter, more operationally useful polygons
C_D sampling	Uniform(2.0, 2.4)	Fixed value 2.2	Uniform sampling covers the credible range without assuming a specific distribution; fixed value understates uncertainty
SRP model	Cannonball (scalar)	Panelled model	Cannonball model is standard for non-cooperative objects; panelled model requires detailed attitude and geometry data unavailable for most catalog objects
Pc method	Alfano 2D Gaussian	Monte Carlo Pc	Alfano is computationally fast and the community standard; Monte Carlo Pc added as Phase 3 consideration for high-Pc events
BSTAR lower bound	No lower bound (reject ≤ 0 only)	0.0001 lower bound	Dense objects (tungsten, stainless steel tanks) can have B* << 0.0001; the previous lower bound would silently reject valid high-density object TLEs

38.3 Model Card Additions Required

The following items must be added to docs/model-card-decay-predictor.md:

Breakup altitude rationale: 78 km trigger; reference to NASA Debris Assessment Software range (75–80 km for Al structures)
Monte Carlo uncertainty model: C_D, A, m sampling distributions and their justifications
SRP significance: conditions under which SRP > 5% of drag (area-to-mass > 0.01 m²/kg, altitude > 500 km)
NRLMSISE-00 altitude scope: validated 150–800 km; OOD flag above 800 km
Geomagnetic storm sensitivity: Kp > 5 triggers storm-period sampling; prediction uncertainty is elevated
Corridor generation algorithm: alpha-shape with α = 0.1°, 50 km buffer; reference to alpha-shape literature
Pc computation: Alfano 2D Gaussian; TLE differencing covariance; quality flag when < 3 TLEs available
SGP4 validity limits: 7-day degraded, 14-day unreliable, 200 km perigee routing to decay predictor

38.4 Validation Test Vector Requirements

File	Required before	Blocking if absent
`docs/validation/reference-data/frame_transform_gcrf_to_itrf.json`	Any frame transform code merged	Yes — test fails hard
`docs/validation/reference-data/sgp4_propagation_cases.json`	SGP4 propagator merged	Yes
`docs/validation/reference-data/iers_eop_case.json`	IERS EOP application merged	Yes
`docs/validation/reference-data/nrlmsise00_density_cases.json`	Decay predictor merged	Yes — referenced in §17.3
`docs/validation/reference-data/aerospace-corp-reentries.json`	Phase 1 backcast validation	Yes for Phase 2 gate

39. API Design / Developer Experience Review

This section records findings from a senior API design review. Findings were incorporated into §9.2 (new jobs and idempotency_keys tables; expanded api_keys schema), §14 (canonical pagination envelope, error schema, rate limit 429 body, async job lifecycle, ephemeris validation, WebSocket token refresh, WebSocket protocol versioning, field naming convention, GET /readyz in OpenAPI, API key auth model).

39.1 Finding Summary

#	Finding	Section Updated	Severity
1	Pagination envelope not canonical across endpoints	§14 — `PaginatedResponse[T]`, `data` key, `total_count: null`	High
2	Error response shape inconsistent; no error code registry	§14 — `SpaceComError` base, `RequestValidationError` override, registry table	High
3	Async job lifecycle for `POST /decay/predict` not specified	§14 — 202 response, `/jobs/{id}` endpoint; §9.2 — `jobs` table	High
4	WebSocket token refresh path not specified	§14 — `TOKEN_EXPIRY_WARNING`, `AUTH_REFRESH`, close codes `4001`/`4002`	High
5	Idempotency keys not specified for mutation endpoints	§14 — idempotency spec; §9.2 — `idempotency_keys` table	Medium
6	`429` missing `Retry-After` header and structured body	§14 — `retryAfterSeconds` body field, `Retry-After` header spec	Medium
7	Ephemeris endpoint lacks time range and step validation	§14 — 4-row validation table with error codes	Medium
8	WebSocket protocol versioning not specified	§14 — `?protocol_version=N`, deprecation warning event, sunset close code	Medium
9	Field naming convention not decided	§14 — `APIModel` base class, `alias_generator=to_camel`	Medium
10	`GET /readyz` not in OpenAPI spec	§14 — `tags=["System"]` decorated endpoint	Low
11	API key auth model, rate limits, and scope not specified	§14 — `apikey_` prefix, independent buckets, `allowed_endpoints` scope	High

39.2 Developer Experience Contracts

The following contracts are enforced by CI and must not be broken without an ADR:

Contract	Enforcement
All list endpoints return `{"data": [...], "pagination": {...}}`	OpenAPI CI check: `list`-tagged endpoints validated against `PaginatedResponse` schema
All errors return `{"error": "...", "message": "...", "requestId": "..."}`	AST/grep CI check: `HTTPException` and `JSONResponse` must reference registry codes
`POST` endpoints returning async jobs return `202` with `statusUrl`	OpenAPI CI check: endpoints tagged `async` validated for `202` response schema
`429` responses include `Retry-After` header	Integration test: rate-limited request asserts `Retry-After` header present
`Idempotency-Key` header documented for mutation endpoints	OpenAPI CI check: endpoints tagged `mutation` declare the header parameter
`GET /readyz` is in the OpenAPI spec	Schema validation: `readyz` path present in generated `openapi.json`

39.3 New Endpoints Added

Endpoint	Role	Purpose
`GET /jobs/{job_id}`	`viewer` (own jobs only)	Poll async job status; returns `resultUrl` on completion
`DELETE /jobs/{job_id}`	`viewer` (own jobs only)	Cancel a queued job (no effect if already running)

39.4 New API Guide Documents Required

Document	Content
`docs/api-guide/conventions.md`	`camelCase` rule, `APIModel` base class, error envelope, request ID tracing
`docs/api-guide/pagination.md`	Cursor encoding, `total_count: null` rationale, empty result shape
`docs/api-guide/error-reference.md`	Canonical error code registry with HTTP status, description, recovery action
`docs/api-guide/idempotency.md`	Idempotency key protocol, 24h TTL, replay header, in-progress behaviour
`docs/api-guide/async-jobs.md`	Job lifecycle, WebSocket vs polling, recommended poll interval
`docs/api-guide/websocket-protocol.md`	Protocol version history, token refresh flow, close codes, reconnection
`docs/api-guide/api-keys.md`	Key creation, `apikey_` prefix, scope, independent rate limits

39.5 Decision Log

Decision	Chosen	Alternative	Rationale
Pagination key	`data`	`items`, `results`	`data` is the most common convention (JSON:API, GitHub API, Stripe); `items` is ambiguous with Python iterables
`total_count`	Always `null`	Compute count on every list request	COUNT(*) on a 7-year-retention hypertable can be a full scan; cursor pagination does not need count; document the omission
Error base model	`SpaceComError` with `requestId`	Per-endpoint error types	Uniform shape allows generic client error handling; `requestId` enables log correlation without exposing internals
Field naming	`camelCase` via `alias_generator`	`snake_case` (Python default)	Frontend and API consumer convention is `camelCase`; `populate_by_name=True` keeps internal code readable
Async job surface	`/jobs/{id}` unified endpoint	Per-type endpoints (`/decay/{id}`, `/reports/{id}`)	Unified job surface simplifies client polling logic; type-specific result URLs are returned in `resultUrl` field
WebSocket close codes	`4001` auth expiry, `4002` protocol deprecated	Generic `1008` for all auth failures	Application-specific close codes enable clients to take the correct action (refresh token vs. upgrade protocol) without scraping close reason text
Idempotency TTL	24 hours	1 hour, 7 days	24 hours covers retry windows caused by network outages, client restarts, and overnight batch jobs; longer risks unbounded table growth

40. Commercial Strategy Review

SpaceCom is a standalone commercial product. Institutional procurements (ESA STAR #182213 and similar) are market opportunities pursued with existing capabilities — the product is not built to suit any single bid. This section records findings from a commercial strategy review; incorporations are in the product and architecture sections, not in bid-specific requirements.

40.1 Finding Summary

#	Finding	Section Updated	Severity
1	ESA bid requirements not mapped to plan	Scoped as per-bid process only — `docs/bid/` created per procurement opportunity, not a structural plan requirement	Critical (clarified)
2	Zero Debris Charter compliance output format not specified	§6 — Controlled Re-entry Planner compliance report spec, Pc_ground, `compliance_report_url`	High
3	No commercial tier structure	§9.2 — `subscription_tier`, `subscription_status` on `organisations`; tier table defined	High
4	Competitive differentiation not anchored to maintained capabilities	§23.4 — maintained capabilities table; `docs/competitive-analysis.md` quarterly review	Medium
5	Shadow trial-to-operational conversion not specified	§6.14 — conversion path, offer package, `subscription_status` transitions, 2-concurrent-deployment cap	High
6	Delivery schedule vs. procurement milestones	Light touch: per-procurement milestone reconciliation doc created at bid time; not a structural plan requirement	High (scoped)
7	No customer-facing SLA	§26.1 — SLA schedule table in MSA; measurement methodology; service credits	High
8	Data residency requirements not addressed	§29.5 — EU default hosting; on-premise option; `hosting_jurisdiction` column; subprocessor disclosure	High
9	Space-Track AUP conditional architecture not specified	§11 — Path A/B conditional architecture; ADR 0016; Phase 1 architectural decision gate	High
10	No Acceptance Test Procedure specification	§21 Phase 3 DoD — ATP requirement; independent evaluator; `docs/bid/acceptance-test-procedure.md`	Medium
11	Go-to-market sequence not validated against resource constraints	§6.14 — 2-concurrent-shadow cap; integration lead assignment; onboarding package spec	Medium

40.2 Commercial Tier Structure

Tier	Customer	Feature access	Pricing model
Shadow Trial	ANSP (pre-commercial)	Full aviation portal; shadow mode only; 90-day maximum; 2 concurrent deployments maximum	Free — bilateral agreement or institutional funding
ANSP Operational	ANSP (post-shadow)	Full aviation portal; live alerts; NOTAM drafting; multi-ANSP coordination	Annual SaaS subscription per ANSP (seat-unlimited within org)
Space Operator	Satellite operators	Space portal; decay prediction; conjunction; CCSDS export; API access	Per-object-per-month or flat subscription with object cap
Institutional	ESA, national agencies, research	Full access; data export; API; bulk historical; on-premise deployment option	Bilateral contract or grant-funded; source code escrow option

Tier is stored in organisations.subscription_tier. Tier-based feature gating added to RBAC: e.g., shadow_trial orgs cannot activate live alert delivery to external systems.

40.3 Procurement Readiness Process

For each institutional procurement opportunity pursued:

Create docs/bid/{procurement-id}/traceability.md — maps the procurement's SoR requirements to existing MASTER_PLAN.md section(s); gaps marked NOT MET or PARTIALLY MET
Create docs/bid/{procurement-id}/milestone-reconciliation.md — maps procurement milestones (KO, PDR, CDR, AT) to SpaceCom phase completion dates
Run ATP (docs/bid/acceptance-test-procedure.md) on the staging environment before submission
Create docs/bid/{procurement-id}/kpi-and-validation-plan.md — maps tender KPIs to replay cases, conservative baselines, evidence artefacts, and any partner-supplied validation input still required
Update docs/competitive-analysis.md to confirm differentiation claims are current

This is a per-opportunity process maintained by the product owner — it does not drive changes to the core plan unless a genuine product gap is identified.

40.4 Customer Onboarding Specification

Artefact	Location	Purpose
ANSP onboarding checklist	`docs/onboarding/ansp-onboarding-checklist.md`	Integration lead walkthrough; environment setup; FIR configuration; user training
Admin setup guide	`docs/onboarding/admin-setup.md`	Persona D configuration; shadow mode activation; user provisioning
Shadow exit report template	`docs/templates/shadow-mode-exit-report.md`	Statistics + ANSP Safety Department sign-off
Commercial offer template	`docs/templates/commercial-offer-ansp.md`	Auto-populated from org data; sent at shadow exit

40.5 Decision Log

Decision	Chosen	Alternative	Rationale
Plan structure vs. bid	Product-first; bid traceability is a per-opportunity overlay	Restructure plan around ESA SoR	SpaceCom serves multiple market segments; structuring around one procurement creates lock-in and excludes ANSP and space operator commercial pathways
Default hosting jurisdiction	EU (eu-central-1)	US-based hosting	ECAC ANSP customers are predominantly EU/UK; EU hosting satisfies data residency without per-customer complexity
Shadow deployment cap	2 concurrent	Unlimited	Each shadow deployment requires a dedicated integration lead for 90 days; 2 concurrent is the realistic Phase 2 capacity without specialist hiring
Space-Track AUP gate	Phase 1 architectural decision	Phase 2 clarification	The shared vs. per-org ingest architecture is a fundamental Phase 1 design choice; deferring to Phase 2 would require rearchitecting already-shipped code
SLA in MSA	Separate SLA schedule versioned independently	Inline in MSA body	SLA values change more frequently than contract terms; versioned schedule allows SLA updates without full MSA re-execution

41. Database Engineering Review

41.1 Finding Summary

#	Finding	Severity	Location updated
1	`tle_sets` BIGSERIAL PK incompatible with TimescaleDB hypertable uniqueness requirement	High	§9.2 `tle_sets`
2	TEXT enum columns lacking CHECK constraints (12 columns across 7 tables)	High	§9.2 all affected tables
3	asyncpg prepared statement cache conflicts with PgBouncer transaction mode	High	§9.4
4	`prediction_outcomes.prediction_id` and `alert_events.prediction_id` typed INTEGER; references BIGSERIAL column	Medium	§9.2
5	`idempotency_keys` already has composite PRIMARY KEY — confirmed safe; upsert pattern documented	N/A (already correct)	§9.2
6	Mixed GEOGRAPHY/GEOMETRY types break GiST index selectivity on cross-table spatial joins	Medium	§9.3
7	`acknowledged_by` and `reviewed_by` FKs block GDPR erasure with default RESTRICT	Medium	§9.2
8	Mutable tables missing `updated_at` column and trigger	Medium	§9.2
9	DB password rotation procedure killed in-flight transactions via hard restart	Medium	§7.5
10	`tle_sets` chunk interval (7 days) too small; poor compression ratio for ingest rate	Low	§9.4
11	Missing partial indexes on hot-path filtered queries (jobs, refresh_tokens, idempotency_keys, alert_events)	Low	§9.3

41.2 Schema Integrity Rules

Rules enforced after this review:

Hypertable natural keys — No surrogate BIGSERIAL PK on hypertables. Reference tle_sets rows by (object_id, ingested_at). If a surrogate is needed, use UNIQUE (surrogate_id, partition_col) composite.
CHECK constraints mandatory — Every TEXT column with a finite valid value set must have a CHECK (col IN (...)) constraint. Application-layer validation is supplemental, not primary.
asyncpg pool config — prepared_statement_cache_size=0 must be set on all async engine instances. Enforced by a test that creates a test engine and asserts the connect_arg is present.
BIGINT FK parity — Any FK referencing a BIGSERIAL column must be BIGINT. Linted in CI via a custom Alembic migration checker.
Spatial type discipline — Every ST_Intersects / ST_Contains call mixing GEOGRAPHY and GEOMETRY sources must include an explicit ::geometry cast on the GEOGRAPHY operand. Linted via ruff custom rule.
ON DELETE SET NULL on audit FKs — FKs in audit/safety tables (security_logs, alert_events.acknowledged_by, notam_drafts.reviewed_by) use ON DELETE SET NULL. Hard DELETE on users is reserved for GDPR erasure only; see §29.
updated_at trigger — All mutable (non-append-only) tables must have updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() and a BEFORE UPDATE trigger using set_updated_at(). Append-only tables (those with prevent_modification() trigger) are excluded.

Per Finding 7 — a hard DELETE FROM users WHERE id = $1 is not the correct GDPR erasure mechanism. The correct procedure:

Null out PII columns: UPDATE users SET email = 'erased-' || id || '@erased.invalid', password_hash = 'ERASED', mfa_secret = NULL, mfa_recovery_codes = NULL, tos_accepted_ip = NULL WHERE id = $1
Security logs, alert acknowledgements, and NOTAM review records are preserved with user_id = NULL (ON DELETE SET NULL handles this automatically if a hard DELETE is later required by specific legal instruction)
Log the erasure in security_logs with event_type = 'GDPR_ERASURE' before nulling
The users row itself is retained as a tombstone (email contains the erased marker) — this preserves referential integrity for organisation_id links and prevents FK violations in tables without SET NULL

Full procedure: docs/runbooks/gdpr-erasure.md (Phase 2 gate, per §29).

41.4 Decision Log

Decision	Chosen	Alternative	Rationale
Hypertable surrogate key	Remove BIGSERIAL; use `UNIQUE(object_id, ingested_at)`	Add `UNIQUE(id, ingested_at)` composite	Natural key is semantically stable and meaningful; composite surrogate is confusing and rarely queried by raw id
CHECK constraints vs. Postgres ENUM	CHECK (col IN (...))	`CREATE TYPE` ENUM	CHECK constraints are simpler to extend in migrations (no `ALTER TYPE ADD VALUE`); ENUM changes require `pg_dump` for type renaming
GDPR erasure	Tombstone update, not hard DELETE	Hard DELETE with CASCADE	Hard DELETE cascades into safety records (NOTAM drafts, alert logs) that must be retained under EASA/ICAO safety record requirements; tombstone preserves the record while removing identity
Spatial type mixing	Explicit `::geometry` cast; document in §9.3	Migrate all columns to GEOGRAPHY	Airspace GEOMETRY gives 3× `ST_Intersects` speedup for regional FIR queries; global corridors correctly use GEOGRAPHY; cast is cheap and safe

42. Test Engineering / QA Review

42.1 Finding Summary

#	Finding	Severity	Location updated
1	No formal test pyramid with per-layer coverage gates	High	§33.10
2	No database isolation strategy for integration tests	High	§33.10
3	Hypothesis property-based tests unspecified	High	§33.10 table, §12
4	WebSocket test strategy missing	High	§33.10 table, §12
5	Playwright E2E tests lack `data-testid` selector convention	Medium	§33.9
6	No smoke test suite for post-deploy verification	Medium	§12, §33.10
7	No flaky test policy	Medium	§33.10
8	Contract tests lack value-range assertions	Medium	DoD checklists
9	Celery task timeout → `jobs` state transition untested; no orphan cleanup	Medium	§7.12
10	MC simulation test data generation strategy not specified	Low	§15.4
11	Accessibility testing not integrated into CI with implementation spec	Low	§6.16

42.2 Test Suite Inventory

Full test suite after this review:

tests/
  conftest.py              # db_session (SAVEPOINT); testcontainers for Celery tests; pytest.ini markers
  physics/
    test_frame_utils.py    # Vallado reference cases — all BLOCKING
    test_propagator/       # SGP4 state vectors — BLOCKING
    test_decay/            # Decay predictor backcast — Phase 2+
    test_nrlmsise.py       # NRLMSISE-00 density reference — BLOCKING
    test_hypothesis.py     # Hypothesis property-based invariants — BLOCKING
    test_mc_corridor.py    # MC seeded RNG corridor — Phase 2+
    test_breakup/          # Breakup energy conservation — Phase 2+
  test_integrity.py        # HMAC sign/verify/tamper — BLOCKING
  test_auth.py             # JWT; MFA; rate limiting — BLOCKING
  test_rbac.py             # Every endpoint × every role — BLOCKING
  test_websocket.py        # WS lifecycle; sequence replay; close codes — BLOCKING
  test_ingest/
    test_contracts.py      # Space-Track + NOAA key + value range — BLOCKING (mocked)
  test_spaceweather/       # Space weather ingest logic
  test_jobs/
    test_celery_failure.py # Timeout → failed; orphan recovery — BLOCKING
  smoke/                   # Post-deploy; idempotent; ≤ 2 min — BLOCKING post-deploy
  quarantine/              # Flaky tests awaiting fix; non-blocking nightly only
  e2e/                     # Playwright; 5 user journeys + axe WCAG 2.1 AA — BLOCKING
    test_accessibility.ts  # axe-core scan on every primary view; fails PR on any WCAG 2.1 AA violation
    test_alert_websocket.ts  # submit prediction → Celery completes → CRITICAL alert in browser via WS (F9)
  load/                    # k6 performance scenarios — non-blocking (nightly)

Accessibility test specification (F11):

e2e/test_accessibility.ts uses @axe-core/playwright to scan each primary view on every PR:

import { checkA11y } from 'axe-playwright';

const VIEWS_TO_SCAN = [
  '/',                          // Operational Overview
  '/events',                    // Active Events
  '/events/[sample-id]',        // Event Detail
  '/handover',                  // Shift Handover
  '/space/objects',             // Space Operator Overview
];

test.each(VIEWS_TO_SCAN)('WCAG 2.1 AA: %s', async ({ page }) => {
  await page.goto(url);
  await checkA11y(page, undefined, {
    axeOptions: { runOnly: { type: 'tag', values: ['wcag2a', 'wcag2aa'] } },
    detailedReport: true,
    detailedReportOptions: { html: true },
  });
});

CI gate: any axe-core violation at wcag2a or wcag2aa level fails the PR. wcag2aaa violations are reported as warnings only. Results published as CI artefact (a11y-report.html).

WebSocket alert delivery E2E test (F9): e2e/test_alert_websocket.ts is a BLOCKING E2E test that verifies the full end-to-end path from prediction submission to browser alert receipt. This test requires the full stack (Celery workers running, WebSocket server live):

// e2e/test_alert_websocket.ts
import { test, expect } from '@playwright/test';

test('CRITICAL alert appears in browser via WebSocket after prediction job completes', async ({ page }) => {
  // 1. Authenticate as operator
  await page.goto('/login');
  await page.fill('[name=email]', process.env.E2E_OPERATOR_EMAIL);
  await page.fill('[name=password]', process.env.E2E_OPERATOR_PASSWORD);
  await page.click('[type=submit]');
  await page.waitForURL('/');

  // 2. Submit a decay prediction via API that will produce a CRITICAL alert
  const job = await fetch('/api/v1/decay/predict', {
    method: 'POST',
    headers: { Cookie: await page.context().cookies().then(c => c.map(x => `${x.name}=${x.value}`).join('; ')) },
    body: JSON.stringify({ norad_id: 90001, mc_samples: 50 }),  // test object; always produces CRITICAL
  }).then(r => r.json());

  // 3. Wait for the CRITICAL alert banner to appear in the browser (max 60s)
  await expect(page.locator('[role="alertdialog"][data-severity="CRITICAL"]'))
    .toBeVisible({ timeout: 60_000 });

  // 4. Assert the alert references our prediction
  const alertText = await page.locator('[role="alertdialog"]').textContent();
  expect(alertText).toContain('90001');
});

The 60-second timeout covers: Celery task queue, MC computation (50 samples), alert threshold evaluation, WebSocket push to all org subscribers, React state update, and DOM render. If this test fails intermittently, the failure is investigated as a potential latency regression — it must not be moved to quarantine/ without a root-cause investigation.

Manual screen reader test (release checklist — not automated):

NVDA + Firefox (Windows): primary operator workflow (alert receipt → acknowledgement → NOTAM draft)
VoiceOver + Safari (macOS): same workflow
Keyboard-only: full workflow without mouse
Added to release gate checklist in docs/RELEASE_CHECKLIST.md

42.3 Hypothesis Invariant Specifications

Minimum 5 required Hypothesis properties in tests/physics/test_hypothesis.py:

Property	Strategy	Assertion	max_examples
SGP4 round-trip position	Random valid TLE orbital elements	Forward propagate T days then back; position error < 1 m	200
p95 corridor containment	Seeded MC ensemble (seed=42, N=500)	Corridor contains ≥ 95% of input trajectories	50
NRLMSISE-00 density positive	Random altitude 100–800 km, valid F10.7/Ap	Density always > 0 kg/m³	500
RLS tenant isolation	Two different organisation IDs	Session set to org A never returns rows for org B	100
Pagination non-overlap	Cursor pagination with random page sizes	Pages are non-overlapping and cover full dataset	100

42.4 MC Corridor Test Data Specification

Reference data committed to docs/validation/reference-data/:

File	Contents	Regeneration
`mc-ensemble-params.json`	RNG seed=42, object params, generation timestamp	Never change seed; add to file if params change
`mc-corridor-reference.geojson`	Pre-computed p95 corridor polygon	Run `python tools/generate_mc_reference.py` after algorithm change; review diff before committing

Test asserts area delta < 5% between computed and reference polygon. If the algorithm changes, the reference polygon must be explicitly regenerated and the change logged in CHANGELOG.md.

42.5 Decision Log

Decision	Chosen	Alternative	Rationale
DB isolation	SAVEPOINT for unit/single-connection; testcontainers for Celery	Shared test DB with cleanup	SAVEPOINT is zero-overhead and perfectly isolated; testcontainers gives true process isolation for multi-connection Celery tests without manual teardown
Flaky test policy	Quarantine after 2 failures in 30 days; delete if unfixed > 14 days	Retry flaky tests automatically	Auto-retry masks root causes; quarantine with mandatory resolution timeline creates accountability
Hypothesis in blocking CI	Yes, max_examples ≥ 200 for physics	Optional/nightly only	Safety-critical physics invariants must be checked on every commit; 200 examples adds < 30s to CI at default shrink settings
MC test data	Seeded RNG + committed reference polygon	Committed raw trajectory arrays	Raw arrays are large (~MB); seeded RNG is deterministic and tiny; committed polygon provides a stable regression target
`data-testid` convention	Mandatory for all Playwright targets; CSS class selectors forbidden	Allow CSS class selectors	CSS classes are refactoring artefacts; `data-testid` is stable across UI refactors and explicitly documents test intent
Smoke test gate	Blocking post-deploy, not blocking pre-deploy CI	Block pre-deploy CI	Smoke tests require a running stack; pre-deploy CI has no stack. Post-deploy gate means deployment rollback is the recovery action for smoke failure
Accessibility CI gate	`axe-core` wcag2a + wcag2aa violations block PR; wcag2aaa warnings only	Manual testing only	Manual testing is too slow and inconsistent for PR-level feedback; automated axe-core catches ~57% of WCAG issues at zero marginal cost; manual screen reader testing reserved for release gate

43. Observability / Monitoring Engineering Review

43.1 Finding Summary

#	Finding	Severity	Location updated
1	Per-object Gauge labels cause alert flooding (600 pages for one outage)	High	§26.7 — recording rules added
2	No structured logging format specification	High	§7.14, §10
3	No distributed tracing (OpenTelemetry)	High	§26.7, §10
4	AlertManager rules have semantic errors; no runbook links	High	§26.7 — rules rewritten
5	No log aggregation stack specified	Medium	§3.2, §10
6	Celery queue depth and DLQ depth metrics not defined	Medium	§26.7
7	SLIs not formally instrumented against SLOs	Medium	§26.7 — recording rules
8	No request_id / trace_id correlation between logs and metrics	Medium	§7.14
9	Prometheus scrape configuration not specified	Medium	§26.7
10	Renderer service has no functional health check or metrics	Medium	§26.5
11	No on-call rotation spec or AlertManager escalation routing	Medium	§26.8

43.2 Observability Stack Summary

After this review the full observability stack is:

Layer	Tool	Phase
Metrics	Prometheus + `prometheus-fastapi-instrumentator`	1
Alerting	AlertManager with runbook_url annotations	1
Dashboards	Grafana (4 dashboards)	2
Structured logs	`structlog` JSON with required fields + sanitiser	1
Log aggregation	Grafana Loki + Promtail (Docker log scrape)	2
Distributed tracing	OpenTelemetry → Grafana Tempo	2
On-call routing	PagerDuty/OpsGenie via AlertManager L1/L2/L3 tiers	2

43.3 Alert Anti-Patterns (Do Not Reintroduce)

Anti-pattern	Correct form
`rate(counter[Xm]) > 0`	`increase(counter[Xm]) >= N` — `rate()` is per-second and stays positive once counter increments
Alert directly on `spacecom_tle_age_hours{norad_id=...}`	Alert on `spacecom:tle_stale_objects:count` recording rule — prevents 600-alert floods
AlertManager rule with no `annotations.runbook_url`	Every rule must include `runbook_url` pointing to the relevant runbook in `docs/runbooks/`
Grafana dashboard as sole incident channel	All CRITICAL alerts also page via PagerDuty; dashboards are diagnosis tools, not alert channels

43.4 Decision Log

Decision	Chosen	Alternative	Rationale
Log aggregation	Grafana Loki	ELK stack	Loki is 10× cheaper to operate (no full-text index); Prometheus labels for log querying are sufficient for this workload; co-deploys with existing Grafana without separate ES cluster
Tracing backend	Grafana Tempo	Jaeger	Tempo co-deploys with Grafana/Loki with no separate storage; native Grafana datasource; OTLP ingest; no query language to learn
Per-object label strategy	Keep labels for Grafana; alert on recording rule aggregates	Remove per-object labels	Per-object drill-down in Grafana dashboards is operationally valuable; the alert flooding problem is solved by recording rules, not by removing labels
Structured logging library	structlog	Python standard logging + JSON formatter	structlog integrates natively with contextvars for request_id propagation; the context binding pattern is cleaner than threading.local
Renderer health check	Functional Chromium launch test	Process liveness only	Chromium hanging without crashing is a known Playwright failure mode; process liveness gives false confidence; functional check is the only reliable signal

§44 — Frontend Architecture Review

44.1 Finding Summary

#	Finding	Severity	Resolution
1	No documented decision on Next.js App Router vs Pages Router; component boundary (`"use client"`) placement unspecified	Medium	§13.1 — App Router confirmed; `"use client"` at `app/(globe)/layout.tsx` boundary
2	CesiumJS requires `'unsafe-eval'` in CSP for GLSL shader compilation; existing policy blocks the globe	High	§7.7 — two-tier CSP; `'unsafe-eval'` scoped to `app/(globe)/` routes only
3	Globe WebGL crash removes alert panel from DOM; CesiumJS WebGL context loss is unhandled	High	§13.1 — `GlobeErrorBoundary` wrapping only the globe canvas; alert panel in separate `PanelErrorBoundary`
4	CesiumJS entity memory leak: unbounded entity accumulation causes WebGL OOM and renderer crash	Medium	§13.1 — max 500 entities; 96h orbit path limit; stale entity pruning on update
5	WebSocket reconnection strategy unspecified; naive reconnect causes thundering-herd on server restart	Medium	§13.1 — exponential backoff with ±20% jitter; `RECONNECT` config object; max 30s delay
6	No TanStack Query key management strategy; ad-hoc key strings cause cache stampedes and stale data	Medium	§13.1 — `queryKeys` key factory pattern; all query keys centralised in `src/lib/queryKeys.ts`
7	Safety-critical panels (alert list, corridor map) have no loading/empty/error state specification	High	§13.1 — explicit state matrix per panel; alert panel must show degraded-data warning on stale WebSocket
8	LIVE/SIMULATION/REPLAY mode isolation not enforced in UI; writes possible in replay mode	High	§13.1 — `useModeGuard` hook; §33.9 — AGENTS.md rule added
9	Deck.gl renders on a separate canvas above CesiumJS; z-order and input event handling are broken	Medium	§13.1 — `DeckLayer` from `@deck.gl/cesium`; single canvas; shared input handling
10	CesiumJS imported at module level causes SSR crash; `next build` fails	High	§13.1 — `next/dynamic` with `ssr: false` for all CesiumJS components
11	Cesium ion token injection pattern undocumented; risk of over-engineering (proxying a public credential)	Low	§7.5 — explicit `NOT A SECRET` annotation; §33.9 — AGENTS.md rule added

44.2 Architecture Constraints Summary

After this review the frontend architecture constraints are:

Constraint	Rule
App Router split	`app/(auth)/` and `app/(admin)/` — server components; `app/(globe)/` — `"use client"` root layout
CesiumJS import	`next/dynamic` + `ssr: false` only; never a static import at module level
CSP	Two-tier: standard (no `'unsafe-eval'`) for non-globe; globe-tier (`'unsafe-eval'`) for `app/(globe)/` only
Error isolation	Globe crash must not affect alert panel; independent `ErrorBoundary` per major region
Entity cap	500 CesiumJS entities maximum; prune entities not updated in last 96h
WebSocket reconnect	Exponential backoff, initial 1s, max 30s, ×2 multiplier, ±20% jitter
Query keys	All keys defined in `src/lib/queryKeys.ts` key factory; no inline key strings
Mode guard	All write operations must check `useModeGuard(['LIVE'])` and disable in SIMULATION/REPLAY
Deck.gl	`DeckLayer` from `@deck.gl/cesium` only; no separate canvas
Cesium ion token	`NEXT_PUBLIC_CESIUM_ION_TOKEN`; public credential; not proxied; not in Vault

44.3 Anti-Patterns (Do Not Introduce)

Anti-pattern	Correct form
`import * as Cesium from 'cesium'` at module level	`next/dynamic(() => import('./CesiumViewerInner'), { ssr: false })`
Single root `<ErrorBoundary>` wrapping entire app	Independent boundaries: `GlobeErrorBoundary`, `PanelErrorBoundary`, `AlertErrorBoundary`
`queryClient.invalidateQueries('objects')` (string key)	`queryClient.invalidateQueries({ queryKey: queryKeys.objects.all() })`
Rendering write controls (buttons, forms) without mode check	`const { isAllowed } = useModeGuard(['LIVE']); <button disabled={!isAllowed}>`
Deck.gl separate canvas (`new Deck({ canvas: ... })`)	`viewer.scene.primitives.add(new DeckLayer({ layers: [...] }))`
Storing Cesium ion token in backend env / Vault / Docker secrets	`NEXT_PUBLIC_CESIUM_ION_TOKEN` in `.env.local`; committed non-secret in CI
Reconnect without jitter (`setTimeout(connect, delay)`)	`delay * (1 + (Math.random() * 2 - 1) * RECONNECT.jitter)`

44.4 Decision Log

Decision	Chosen	Alternative	Rationale
App Router adoption	App Router with route groups	Pages Router	Route groups (`(globe)`, `(auth)`) enable per-group CSP header configuration in `next.config.ts`; server components reduce globe-route initial JS; incremental adoption possible
`"use client"` boundary	`app/(globe)/layout.tsx`	Per-component `"use client"` annotations	Single boundary at layout level is simpler; all CesiumJS/Zustand/WebSocket code already browser-only; per-component annotations at this scale would be noise
Globe CSP strategy	Route-scoped `'unsafe-eval'`	Hash-based CSP for GLSL	CesiumJS generates shader source dynamically; hashes cannot cover runtime-generated strings; route-scoping is the only practical option
Deck.gl integration	`DeckLayer` from `@deck.gl/cesium`	Separate Deck.gl canvas	Separate canvas breaks mouse event routing and z-order; `DeckLayer` renders inside CesiumJS as a primitive, sharing the WebGL context
Cesium ion token	`NEXT_PUBLIC_` env var	Backend proxy endpoint	Cesium ion is a CDN/tile service with public tokens by design; proxying adds latency and a backend dependency for a non-secret; Cesium's own documentation recommends direct browser use

§45 — Platform / Infrastructure Operations Engineering Review

45.1 Finding Summary

#	Finding	Severity	Resolution
1	Python 3.11/3.12 version mismatch between Dockerfiles and service table	Medium	§30.2 — all images updated to `python:3.12-slim`, `node:22-slim`; CI version check added
2	No container resource limits; runaway simulation worker can OOM-kill the database	High	§3.3 — `deploy.resources.limits` added for all services; stop_grace_period added
3	Docker SIGTERM→SIGKILL grace period (10s default) too short for MC task warm shutdown	High	§3.3 — `stop_grace_period: 300s` for worker-sim; `--without-gossip --without-mingle` flags specified
4	Backend and renderer on disjoint networks — cannot communicate	Critical	§3.3 — `backend` added to `renderer_net`; network topology diagram corrected
5	Workers bypass PgBouncer — 16 direct connections per worker undermines connection pooling	Medium	§3.3 — PgBouncer added to `worker_net`; workers connect via `pgbouncer:5432`
6	Redis ACL per-service is stated in §3.2 but undefined — compromised worker can read session tokens	High	§3.2 — full ACL definition added; three separate passwords added to §30.3 env contract
7	`pg_isready -U postgres` healthcheck passes before TimescaleDB extension and application DB are ready	Medium	§26.5 — healthcheck replaced with `psql` query against `timescaledb_information.hypertables`
8	`daily_base_backup` calls `pg_basebackup` from Python worker image — tool not installed	High	§26.6 — replaced with dedicated `db-backup` sidecar container; Celery task now verifies backup presence in MinIO
9	No `pids_limit` on renderer or worker containers — Chromium crash can fork-bomb host	Medium	§3.3 — `pids_limit` added: renderer=100, worker-sim=64, worker-ingest=16
10	Renderer PDF scratch written to container writable layer — sensitive data persists	Medium	§3.3 — `tmpfs` mount at `/tmp/renders` (512 MB); `RENDER_OUTPUT_DIR` env var added
11	Blue-green deployment mechanics unspecified for Docker Compose — first production deploy would fail	High	§26.9 — `scripts/blue-green-deploy.sh` spec added; Caddy dynamic upstream pattern defined

45.2 Container Runtime Safety Summary

After this review the container runtime safety posture is:

Concern	Control
Resource isolation	`deploy.resources.limits` per service; DB memory-capped to survive worker OOM
Graceful shutdown	`stop_grace_period: 300s` for simulation workers; Celery `--without-gossip --without-mingle`
Process containment	`pids_limit` on renderer (100) and both workers
Sensitive scratch data	Renderer uses `tmpfs` at `/tmp/renders`; cleared on container stop
Network access	Backend on `renderer_net`; PgBouncer on `worker_net`; workers never reach `frontend_net`
Redis ACL	Three ACL users (backend, worker, ingest) with scoped key namespaces; default user disabled
DB healthcheck	Verifies TimescaleDB extension loaded and application DB accessible before dependent services start
Backups	Dedicated `db-backup` sidecar with PostgreSQL tools; Celery Beat verifies presence not execution

45.3 Operations Anti-Patterns (Do Not Reintroduce)

Anti-pattern	Correct form
`FROM python:3.11-slim` or `FROM node:20-slim` in any Dockerfile	`python:3.12-slim` / `node:22-slim`; hadolint check enforces this
No `deploy.resources.limits` on CPU/memory-intensive services	All services must have limits; simulation workers especially
Worker `DATABASE_URL` pointing to `db:5432`	`pgbouncer:5432` — all workers route through PgBouncer
`subprocess.run(['pg_basebackup', ...])` from a Python worker container	Dedicated `db-backup` sidecar container with PostgreSQL tools
`pg_isready -U postgres` as the DB healthcheck	`psql -c "SELECT 1 FROM timescaledb_information.hypertables LIMIT 1"`
`docker compose stop` (default 10s) for simulation workers	`stop_grace_period: 300s` on worker-sim service definition
All services sharing single `REDIS_PASSWORD`	Three ACL users with scoped namespaces; separate passwords
Blue-green deploy without specifying the Compose implementation	`scripts/blue-green-deploy.sh` with separate Compose project instances + Caddy dynamic upstream

45.4 Decision Log

Decision	Chosen	Alternative	Rationale
Python version	3.12 (service table and Dockerfiles aligned)	3.11 (original Dockerfiles)	3.12 has 10–25% numeric performance improvements; free-threaded GIL prep; security support through 2028; alignment eliminates silent version drift
Blue-green implementation	Separate Compose project instances + Caddy dynamic upstream file	Single Compose file with blue/green service name variants	Separate projects mean the Compose file is not modified per deployment; Caddy JSON upstream reload is atomic and < 5s
Backup execution model	Host cron → `db-backup` sidecar via `docker compose run`	Celery task + `subprocess.run`	Celery workers do not have `pg_basebackup`; host cron is independent of application availability — backup runs even if Celery is down
PID limits	Per-service `pids_limit` in Compose	Kernel cgroup default	Compose `pids_limit` is applied at container creation; simpler to audit than system-level cgroup tuning; values sized per expected process count
Renderer scratch storage	`tmpfs`	Named Docker volume	PDF contents include prediction data; tmpfs guarantees no persistence; cleared on container stop/restart without manual cleanup
Redis ACL scope	Key prefix namespacing (`~celery*` for workers)	Command-level ACL only	Key-prefix ACL prevents workers from reading/writing outside their namespace; command-level-only ACL is weaker (worker could still enumerate all keys)

§46 — Data Pipeline / ETL Engineering Review

46.1 Finding Summary

#	Finding	Severity	Resolution
1	No Space-Track request budget tracked; 30-min TIP polling consumes 48/600 requests/day before retries	High	§31.1.1 — `SpaceTrackBudget` Redis counter; alert at 80%; operator re-fetches budget-checked
2	TIP 30-min polling too slow for late re-entry phase; CDM 12h polling can miss short-TCA conjunctions entirely	High	§31.1.1 — adaptive polling: TIP→5min, CDM→30min when `active_tip_events > 0`
3	TLE ingest ON CONFLICT behavior unspecified; double-run hits unique constraint silently	Medium	§11 — `INSERT ... ON CONFLICT DO NOTHING` + `spacecom_ingest_tle_conflict_total` metric
4	IERS EOP cold-start: astropy falls back to months-old IERS-B, silently degrading frame transforms	High	§11 — `make seed` EOP bootstrap step; EOP freshness check in `GET /readyz`
5	AIRAC FIR updates are fully manual with no staleness detection or missed-cycle alert	Medium	§31.1.3 — `spacecom_airspace_airac_age_days` gauge + alert; `airspace_stale` in `readyz`; fir-update runbook as Phase 1 deliverable
6	Space weather nowcast vs. forecast not distinguished; decay predictor uses wrong F10.7 for horizon > 72h	High	§31.1.2 — `forecast_horizon_hours` column; decay predictor input selection table
7	IERS EOP SHA-256 verification unimplementable — IERS publishes no reference hashes	Medium	§11 — dual-mirror comparison (USNO + Paris Observatory); `spacecom_eop_mirror_agreement` gauge
8	No exponential backoff or circuit breaker on ingest tasks; transient failures exhaust Space-Track budget	High	§31.1.1 — `retry_backoff=True`, `retry_backoff_max=3600`, `max_retries=5`; `pybreaker` circuit breaker
9	Space-Track session cookie expires between 6h polls; re-auth behavior not specified or tested	Medium	§31.1.1 — `_ensure_authenticated()` with proactive 1h45m TTL; `session_reauth_total` metric
10	ESA SWS Kp cross-validation has no decision rule; divergence from NOAA is silently ignored	Medium	§31.1.2 — `arbitrate_kp()` with 2.0 Kp threshold; conservative-high selection; ADR-0018
11	`celery-redbeat` default lock TTL 25min causes up to 25min scheduling gap on Beat crash during TIP event	High	§26.4 — `REDBEAT_LOCK_TIMEOUT=60`; `REDBEAT_MAX_SLEEP_INTERVAL=5`; active TIP alert threshold 10min

46.2 Ingest Pipeline Reliability Summary

After this review the ingest pipeline reliability posture is:

Concern	Control
Space-Track rate limit	`SpaceTrackBudget` Redis counter; alert at 80%; hard stop at 600/day
Upstream failure recovery	Exponential backoff (2s→1h, ×2, ±20% jitter); circuit breaker after 3 failures; max 5 retries then DLQ
TIP latency during re-entry	Adaptive polling: 5-minute TIP cycle when active TIP event detected
CDM conjunction coverage	30-minute CDM cycle during active TIP events (baseline 2h)
TLE ingest idempotency	`ON CONFLICT DO NOTHING` + conflict metric
EOP freshness	Daily download (USNO primary); dual-mirror verification; 7-day staleness alert; cold-start bootstrap in `make seed`
AIRAC currency	28-day staleness alert; `/readyz` degraded signal; manual update runbook as Phase 1 deliverable
Space weather horizon	`forecast_horizon_hours` column; predictor selects by horizon; 81-day F10.7 average beyond 72h
Beat HA failover gap	`REDBEAT_LOCK_TIMEOUT=60s`; standby acquires lock within 5s of TTL expiry

46.3 New ADR Required

ADR	Title	Decision
`docs/adr/0018-kp-source-arbitration.md`	Kp Source Arbitration Policy	NOAA primary; ESA SWS cross-validation; conservative-high selection on > 2.0 Kp divergence; physics lead approval required

46.4 Ingest Pipeline Anti-Patterns (Do Not Reintroduce)

Anti-pattern	Correct form
`INSERT INTO tle_sets ... VALUES (...)` without `ON CONFLICT DO NOTHING`	Always use `ON CONFLICT DO NOTHING` + increment conflict metric
`spacetrack_client.fetch()` without budget check	Always call `budget.consume(1)` before any Space-Track HTTP request
Celery ingest task with `max_retries=None` or no backoff	`retry_backoff=True`, `retry_backoff_max=3600`, `max_retries=5`
EOP verification by SHA-256 against prior download	Dual-mirror UT1-UTC value comparison (USNO + Paris Observatory)
`REDBEAT_LOCK_TIMEOUT = 300` (default 5min or 25min)	`REDBEAT_LOCK_TIMEOUT = 60` for active TIP event tolerance
Single F10.7 value regardless of prediction horizon	Select by `forecast_horizon_hours`; 81-day average beyond 72h
ESA SWS Kp logged but not acted upon	`arbitrate_kp()` decision rule; conservative-high on divergence

46.5 Decision Log

Decision	Chosen	Alternative	Rationale
Adaptive TIP polling	Dynamic redbeat schedule override when `active_tip_events > 0`	Fixed 5-min polling always	Fixed 5-min polling uses 288/600 Space-Track requests/day for TIPs alone; adaptive polling reserves budget for baseline operations
Space-Track budget enforcement	Redis counter with hard stop	Honour-system rate limit compliance	Hard stop prevents CI/staging test runs or operator actions from exhausting production budget unexpectedly
EOP verification	Dual-mirror value comparison	SHA-256 against prior download	IERS publishes no reference hashes; prior-download comparison detects corruption but not substitution; dual-mirror comparison is the de facto industry approach
Kp arbitration	Conservative-high (max of NOAA, ESA on divergence)	Average of both sources	Averaging introduces a systematic bias toward lower geomagnetic activity; in a safety-critical context, the conservative choice is the higher Kp (denser atmosphere, shorter lifetime, earlier alerting)
`forecast_horizon_hours` schema	Dedicated column on `space_weather`	Separate tables per horizon	Single table with horizon column is simpler to query (`WHERE forecast_horizon_hours = 0`); adding a table per horizon complicates the ingest pipeline without query benefit

§47 — Supply Chain / Dependency Security Engineering Review

47.1 Finding Summary

#	Finding	Severity	Resolution
1	`pip wheel` in Dockerfile does not enforce `--require-hashes`; hash pinning specified but not verified during build	High	§30.2 — `--require-hashes` added to `pip wheel` command with explanatory comment
2	`cosign` image signing absent from CI workflow; attestation claim was aspirational	High	§26.9 — full `cosign sign` + `cosign attest` YAML added to `build-and-push` job
3	SBOM format, CI step, and retention unspecified; ESA ECSS requirement undeliverable	High	§26.9 — SPDX-JSON via `syft`; `cosign attest` attachment; 365-day artifact retention
4	`pip-audit` absent; OWASP Dependency-Check has high Python false-positive rate	Medium	§7.13 — `pip-audit` added to `security-scan`; OWASP DC removed from Python scope
5	No automated license scanning; CesiumJS AGPLv3 compliance check was manual	High	§7.13 — `pip-licenses` + `license-checker-rseidelsohn` gate on every PR
6	Base image digest update process undefined; Dependabot cannot update `@sha256:` pins	Medium	§7.13 — Renovate Bot `docker-digest` manager; digest PRs auto-merged on passing CI
7	No `.trivyignore` file; first base-image CVE with no fix will break all CI builds	Medium	§7.13 — `.trivyignore` spec with expiry dates + CI expiry check
8	`npm audit` absent from CI; `npm ci` does not scan for known vulnerabilities	Medium	§7.13 + §26.9 — `npm audit --audit-level=high` in `security-scan` job
9	`detect-secrets` baseline update process undefined; incorrect `scan >` overwrites all allowances	Medium	§30.1 — correct `--update` procedure documented; CI baseline currency check added
10	No PyPI index trust policy; dependency confusion attack surface unmitigated	High	§7.13 — private PyPI proxy spec; `spacecom-*` namespace reservation on public PyPI; ADR-0019
11	GitHub Actions pinned by mutable `@vN` tags; tag repointing exfiltrates all workflow secrets	Critical	§26.9 — all actions pinned by full commit SHA; CI lint check enforces no `@v\d` tags

47.2 Supply Chain Security Posture Summary

After this review the supply chain security posture is:

Layer	Control
Python build-time hash verification	`pip wheel --require-hashes` enforces hash pinning during Docker build
Python CVE scanning	`pip-audit` (PyPADB); every PR; blocks on High/Critical
Node.js CVE scanning	`npm audit --audit-level=high`; every PR
Container CVE scanning	Trivy + `.trivyignore` with expiry enforcement
Image provenance	`cosign` keyless signing (Sigstore) on every image push
SBOM	SPDX-JSON via `syft`; attached as `cosign attest`; 365-day retention
License gate	`pip-licenses` + `license-checker-rseidelsohn`; GPL/AGPL blocks merge
Base image currency	Renovate `docker-digest` manager; weekly PRs; auto-merged on CI pass
Dependency currency	Dependabot (GitHub Advisory integration) for Python/Node versions
CI pipeline integrity	All actions SHA-pinned; lint check rejects `@vN` references
Secrets detection	`detect-secrets` (entropy + regex) primary; `git-secrets` secondary; baseline currency check in CI
PyPI index trust	Private proxy (Phase 2+); `spacecom-*` namespace stubs on public PyPI

47.3 New ADR Required

ADR	Title	Decision
`docs/adr/0019-pypi-index-trust.md`	PyPI Index Trust Policy	Private proxy for Phase 2+; public PyPI namespace reservation for `spacecom-*` packages in Phase 1

47.4 Anti-Patterns (Do Not Reintroduce)

Anti-pattern	Correct form
`pip wheel -r requirements.txt` without `--require-hashes`	`pip wheel --require-hashes -r requirements.txt`
`uses: actions/checkout@v4` in any workflow file	`uses: actions/checkout@<full-commit-sha> # vX.Y.Z`
`detect-secrets scan > .secrets.baseline`	`detect-secrets scan --baseline .secrets.baseline --update`
OWASP Dependency-Check as Python CVE scanner	`pip-audit --requirement requirements.txt`
Trivy gate with no `.trivyignore`	`.trivyignore` with documented expiry dates + CI expiry check
Manual CesiumJS licence check at Phase 1 only	`license-checker-rseidelsohn --failOn "GPL;AGPL"` on every PR (CesiumJS exempted by name)
`cosign` mentioned in decision log but absent from CI	`cosign sign` + `cosign attest` in `build-and-push` job; `cosign verify` in deploy jobs

47.5 Decision Log

Decision	Chosen	Alternative	Rationale
Python CVE scanning	`pip-audit` (PyPADB)	OWASP Dependency-Check	OWASP DC CPE mapping generates false positives for Python; `pip-audit` queries the Python-native advisory database with near-zero false positives
Image signing	`cosign` keyless (Sigstore)	Long-lived signing key	Keyless signing uses ephemeral OIDC-bound keys; no key management overhead; verifiable against GitHub Actions OIDC issuer
SBOM format	SPDX 2.3 JSON (`spdx-json`)	CycloneDX 1.5	SPDX is the ECSS/ESA-preferred format; both are equivalent for compliance purposes; SPDX has wider tooling support in the aerospace sector
Base image update automation	Renovate `docker-digest`	Manual digest updates	Manual digest updates are always deferred; Renovate auto-merge on passing CI achieves zero-latency security patch application for base image OS updates
GitHub Actions pinning	Commit SHA with tag comment	Dependabot auto-bump of `@vN`	Tag references are mutable; SHA pins are immutable; Renovate `github-actions` manager keeps SHAs current automatically
PyPI trust (Phase 1)	Namespace reservation on public PyPI	Private proxy	Private proxy requires infrastructure investment not available in Phase 1; namespace squatting prevention provides meaningful protection at zero cost

§48 Human Factors Engineering — Specialist Review

Hat: Human Factors Engineering Standard basis: ECSS-E-ST-10-12C (Space engineering — Human factors), CAP 1264 (Alarm management for safety-related ATC systems), EASA GM1 ATCO.B.001(d) (Competency-based training — decision making under uncertainty), Endsley (1995) Situation Awareness taxonomy, Parasuraman & Riley (1997) automation trust calibration

Review scope: §28 Human Factors Framework, §6 UI/UX Feature Specifications, §26 Infrastructure (alert delivery), §31 Data Pipeline (data freshness / degraded state)

48.1 Findings

Finding 1 — SA timing targets absent: §28.1 contained no quantitative time-to-comprehension targets. Situation Awareness without measurable timing criteria cannot be validated against ECSS-E-ST-10-12C Part 6.4 or used as pass/fail criteria in usability testing. Fix applied (§28.1): SA Level 1 ≤ 5s (icon/colour/position); SA Level 2 ≤ 15s (FIR intersection + sector); SA Level 3 ≤ 30s (corridor expanding/contracting). Targets designated as Phase 2 usability test pass/fail criteria.

Finding 2 — Forced-text acknowledgement minimum causes compliance noise: The 10-character minimum on alert acknowledgement text is a common anti-pattern. Under time pressure, operators produce 1234567890 or similar, which is audit record pollution rather than evidence of cognitive engagement. Fix applied (§28.5): Replaced with ACKNOWLEDGEMENT_CATEGORIES (6 structured options). Free text is optional except when OTHER is selected. Category selection satisfies audit requirements with less operator burden.

Finding 3 — No keyboard-completable acknowledgement path: ANSP ops room staff routinely hold a radio PTT with one hand. A mouse-dependent acknowledgement dialog is inaccessible in that context and constitutes a HF design failure. Fix applied (§28.5): Alt+A → Enter → Enter three-keystroke path from any application state. Documented for operator quick-reference card; included in Phase 2 usability test scenario.

Finding 4 — No startle-response mitigation: Sudden full-screen CRITICAL banners produce a documented ~5-second degraded cognitive performance window (startle effect, Staal 2004). The existing design transitions directly to full-screen without priming. Fix applied (§28.3): Three-rule mitigation: (1) progressive escalation — CRITICAL full-screen only after ≥ 1 minute in HIGH state (except impact_time_minutes < 30); (2) audio precedes visual by 500ms; (3) banner is dimmed overlay over corridor map, not a replacement.

Finding 5 — No shift handover specification: Handover is the highest-risk transition in continuous operations. Loss of situational awareness at shift change is a documented contributing factor in ATC incidents. No handover mechanism existed. Fix applied (§28.5a): Dedicated /handover view; shift_handovers table with outgoing_user, incoming_user, notes, active_alerts snapshot, open_coord_threads snapshot; immutable audit record; CRITICAL-during-handover flag on notifications.

Finding 6 — Alarm rationalisation procedure absent: Alarm systems without formal rationalisation procedures inevitably drift toward nuisance alarm rates that exceed operator tolerance. The existing quarterly review target (< 1 LOW/10 min/user) had no enforcement mechanism. Fix applied (§28.3): Quarterly rationalisation procedure with alarm_threshold_audit table; 90% MONITORING acknowledgement rate as nuisance alarm trigger; mandatory 7-day confirmation for threshold changes; 12-month no-escalation review for alert categories.

Finding 7 — Comprehension test items not specified: §28.7 stated "usability test" without scripted probabilistic comprehension items. Generic usability tests are insensitive to the specific calibration failures relevant to probabilistic re-entry data (false precision, space/aviation risk threshold conflation, uncertainty update misattribution). Fix applied (§28.7): Four scripted comprehension items with correct answer, common wrong answer, and failure mode each item detects. Pass criterion: ≥ 80% correct per item across the test cohort.

Finding 8 — No habituation countermeasures: Repeated identical stimuli (identical alarm sound, identical banner appearance) produce habituation — reduced physiological and attentional response over weeks of exposure. No design provisions existed. Fix applied (§28.3): Pseudo-random alternation of two-tone audio pattern; 1 Hz colour cycling on CRITICAL banner between two dark-amber shades; per-operator habituation metric (≥ 20 same-type acknowledgements in 30 days without escalation triggers supervisor review).

Finding 9 — "Response Options" label creates legal ambiguity: The label "Response Options" implies these are prescribed choices. In a regulatory investigation following an incident, checked items could be interpreted as evidence of a standard procedure that was or was not followed. Fix applied (§28.6): Feature renamed to "Decision Prompts" throughout. Non-waivable legal disclaimer added below accordion header. Disclaimer included in printed/exported Event Detail report and in API response legal_notice field.

Finding 10 — No attention management specification: SpaceCom exists in an environment (ops room) with very high ambient interruption rates. Without explicit constraints on unsolicited notification rate, SpaceCom becomes an additional fragmentation source — the documented cause of error in multiple ATC incident analyses. Fix applied (§28.6): Three-tier rate limit: ≤ 1/10 min in steady state; ≤ 1/60s for same-event updates during active incident; zero during critical flow (acknowledgement dialog or handover screen). Queued notifications delivered as batch on critical-flow exit.

Finding 11 — Degraded-data states not differentiated for operators: Three meaningfully different system states (healthy, degraded, failed) were visually undifferentiated in the previous design. Operators cannot distinguish between data they should trust, trust with margin, or not trust at all. Fix applied (§28.8): Graded visual degradation language table (5 amber/red states with exact badge text and required operator response); multiple-amber consolidation rule; GET /readyz machine-readable staleness flags for ANSP monitoring integration; system_health_events audit table.

48.2 Files / Sections Modified

Section	Change
§28.1 Situation Awareness Design Requirements	Added SA level timing targets as pass/fail usability criteria
§28.3 Alarm Management	Added startle-response mitigation (3 rules), alarm rationalisation procedure, habituation countermeasures
§28.5 Error Recovery and Irreversible Actions	Replaced 10-char text minimum with `ACKNOWLEDGEMENT_CATEGORIES`; added `Alt+A → Enter → Enter` keyboard path
§28.5a Shift Handover (new section)	Handover screen spec; `shift_handovers` table schema; integrity rules; handover-window CRITICAL flag
§28.6 Cognitive Load Reduction	Renamed Response Options → Decision Prompts; added legal disclaimer; added attention management rate limits
§28.7 HF Validation Approach	Added 4 scripted probabilistic comprehension test items with pass criterion
§28.8 Degraded-Data Human Factors (new section)	Graded degradation language; 5-state indicator table; multiple-amber consolidation; `GET /readyz` integration

48.3 New Tables / Schema Changes

Table	Purpose
`shift_handovers`	Immutable record of shift handovers with alert and coordination thread snapshots
`alarm_threshold_audit`	Immutable record of alarm threshold changes with reviewer and rationale
`system_health_events`	Time-series log of degraded-data state transitions for operational reporting

48.4 New ADR Required

ADR	Title	Decision
`docs/adr/0020-acknowledgement-categories.md`	Alert Acknowledgement Design	Structured category selection replaces free-text minimum; `OTHER` requires text; 6 categories cover all anticipated operational responses
`docs/adr/0021-decision-prompts-legal.md`	Decision Prompts Legal Treatment	Feature renamed from Response Options; non-waivable disclaimer required; legal rationale documented for future regulatory inquiries

48.5 Anti-Patterns (Do Not Reintroduce)

Anti-pattern	Correct form
Full-screen CRITICAL banner without progressive escalation	Progressive escalation: ≥ 1 min in HIGH state before CRITICAL full-screen (except `impact_time < 30 min`)
Audio and visual CRITICAL alert fired simultaneously	Audio fires 500ms before visual banner render
Alert acknowledgement with free-text character minimum	`ACKNOWLEDGEMENT_CATEGORIES` structured selection; free text only when `OTHER` selected
"Response Options" label anywhere in UI, API, or docs	"Decision Prompts" throughout; legal disclaimer present
Comprehension test without scripted probabilistic items	Use the 4 scripted items in §28.7; measure per-item accuracy against 80% pass threshold
Degraded data shown with same visual weight as fresh data	Use exact badge text from §28.8; amber for stale, red for expired/unusable

48.6 Decision Log

Decision	Chosen	Alternative	Rationale
Acknowledgement mechanism	Structured categories	Free-text minimum	Research shows forced-text minimums produce compliance noise, not evidence; structured categories produce lower operator burden with higher audit utility
CRITICAL escalation model	Progressive (HIGH → CRITICAL)	Immediate full-screen	Startle effect causes ~5s cognitive degradation; progressive escalation eliminates cold-start startle while preserving urgency
Audio timing	500ms pre-visual	Simultaneous	Pre-auditory alert primes attentional orienting response; eliminates visual startling; 500ms is within the ICAO recommended alerting lead-time range
Shift handover	System-managed `/handover` view	Out-of-band process	Out-of-band handovers leave no audit trail and are not integrated with active alert state; system-managed handover provides immutable record and SA transfer assurance
Decision Prompts legal treatment	Non-waivable hard-coded disclaimer	Configurable disclaimer or none	Configurable disclaimer creates discovery risk (could be disabled); absence of disclaimer creates precedent risk; hard-coded disclaimer is the only legally safe option

§49 Legal / Compliance Engineering — Specialist Review

Standards basis: GDPR (Regulation 2016/679), UK GDPR, ePrivacy Directive, Export Administration Regulations (EAR), ITAR (22 CFR 120–130), ESA Procurement Rules, EUMETSAT Data Policy, Space Debris Mitigation Guidelines (IADC/ISO 24113), Chicago Convention Article 28, EU AI Act (Regulation 2024/1689), NIS2 Directive (2022/2555) Review scope: Data handling, user consent, liability framing, export control, third-party data licensing, AI Act obligations, operator accountability chain, record retention, cross-border transfer, regulatory correspondence readiness

49.1 Findings and Fixes Applied

F1 — No GDPR lawful basis documented per processing activity Fix applied (§29.1): RoPA requirement formalised. legal/ROPA.md designated as authoritative document. Data inventory table extended to include all processing activities with lawful basis, retention period, and table reference. shift_handovers and alarm_threshold_audit added as processing activities. Annual DPO sign-off required. DPIA trigger documented.

F2 — No DPIA for conjunction alert delivery Fix applied (§29.1): DPIA trigger documented — conjunction alert delivery constitutes systematic monitoring under GDPR Art. 35(3)(b). DPIA required before production deployment; template designated as legal/DPIA_conjunction_alerts.md.

F3 — TLE / space weather data redistribution may breach upstream licence Fix applied (§24.2): space_track_registered boolean column added to organisations table. API middleware gate blocks TLE-derived fields for non-registered orgs. data_disclosure_log table added for licence audit trail. EU-SST gated separately behind itar_cleared flag.

F4 — No export control screening at registration Fix applied (§24.2): country_of_incorporation, export_control_screened_at, export_control_cleared, and itar_cleared columns added to organisations table. Onboarding flow screens against embargoed countries (ISO 3166-1 alpha-2) and BIS Entity List. EU-SST-derived data gated behind itar_cleared. Documented in legal/EXPORT_CONTROL_POLICY.md.

F5 — Liability disclaimer in Decision Prompts insufficient as standalone protection Fix applied (§28.6): Note added that the in-UI disclaimer is a reinforcing reminder only. Substantive liability limitation (consequential loss excluded; aggregate cap = 12 months fees) must appear in the executed MSA (§24.2). UCTA 1977 and EU Unfair Contract Terms Directive requirement noted.

F6 — No retention / deletion schedule; erasure requests unhandled for new tables Fix applied (§29.1, §29.3): shift_handovers and alarm_threshold_audit added to RoPA with 7-year retention (safety record basis). Pseudonymisation procedure in §29.3 extended to cover shift_handovers — user ID columns nulled, notes prefixed with pseudonym on erasure request.

F7 — Cross-border data transfer mechanism not formally documented Fix applied (§29.5): legal/DATA_RESIDENCY.md designated as authoritative sub-processor list with hosting provider, region, SCC/IDTA status. Annual DPO review and customer notification on material sub-processor change formalised.

F8 — EU AI Act obligations not assessed Fix applied (§24.10): New section added. Conjunction probability model classified as high-risk AI under EU AI Act Annex III (transport infrastructure safety). Eight high-risk obligations mapped (risk management, data governance, technical documentation, logging, transparency, human oversight, accuracy/robustness, conformity assessment). Human oversight statement added as mandatory non-configurable UI element in §19.4 conjunction probability display. EU database registration (Art. 51) added as Phase 3 gate. legal/EU_AI_ACT_ASSESSMENT.md designated as authoritative document.

F9 — No regulatory correspondence register Fix applied (§24.11): New section added. legal/REGULATORY_CORRESPONDENCE_LOG.md designated as structured register. SLAs: 2-business-day acknowledgement, 14-calendar-day response. Quarterly steering review of outstanding correspondence. Proactive engagement triggered by ≥3 queries from same authority in 12 months.

F10 — Cookie / tracking consent mechanism not specified Fix applied (§29.7): New section added. Cookie audit table defined (strictly necessary / functional / analytics). HttpOnly; Secure; SameSite=Strict formalised as required security attributes. Consent banner specification: three tiers, preference stored in localStorage (not a cookie), re-requested on material category changes. legal/COOKIE_POLICY.md designated as authoritative document.

F11 — Incident notification obligations not mapped to regulatory timelines Fix applied (§29.6): NIS2 Art. 23 obligations added alongside GDPR Art. 33. Early warning deadline: 24 hours of awareness (NIS2) vs. 72 hours (GDPR). Full NIS2 notification: 72 hours. Final report: 1 month. On-call escalation requirement to DPO within 24-hour window documented. legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md designated as authoritative template document.

49.2 Sections Modified

Section	Change
§24.2 Liability and Operational Status	Added Space-Track redistribution gate (`space_track_registered`), `data_disclosure_log` table, export control screening columns and onboarding flow
§24.10 (new) EU AI Act Obligations	Full high-risk AI obligation mapping; human oversight statement; conformity assessment and registration roadmap
§24.11 (new) Regulatory Correspondence Register	Structured log specification; SLAs; escalation trigger
§28.6 Cognitive Load Reduction	Added legal sufficiency note on Decision Prompts disclaimer; MSA cross-reference
§29.1 Data Inventory	Formalised as GDPR Art. 30 RoPA; added `shift_handovers`, `alarm_threshold_audit`, `data_disclosure_log` entries; DPIA trigger documented
§29.3 Erasure vs. Retention Conflict	Extended pseudonymisation procedure to cover `shift_handovers`
§29.5 Cross-Border Data Transfer Safeguards	Added `legal/DATA_RESIDENCY.md` as authoritative document with annual review requirement
§29.6 Security Breach Notification	Expanded to full NIS2 Art. 23 obligations table; multi-framework notification timeline
§29.7 (new) Cookie / Tracking Consent	Cookie audit table; `HttpOnly; Secure; SameSite=Strict` formalised; consent banner specification

49.3 New Tables

Table	Purpose
`data_disclosure_log`	Immutable record of every TLE-derived data disclosure per organisation; supports Space-Track licence audit
`organisations.space_track_registered`	Gate controlling access to TLE-derived API fields
`organisations.country_of_incorporation`	Feeds export control screening at onboarding
`organisations.export_control_cleared`	Records completion of export control screening
`organisations.itar_cleared`	Gates EU-SST-derived data to cleared entities only

49.4 New Legal Documents (required before Phase 2 gate)

Document	Purpose
`legal/ROPA.md`	GDPR Art. 30 Record of Processing Activities — authoritative version
`legal/DPIA_conjunction_alerts.md`	Data Protection Impact Assessment for conjunction alert delivery
`legal/EXPORT_CONTROL_POLICY.md`	Export control screening procedure and embargoed-country list
`legal/DATA_RESIDENCY.md`	Sub-processor list with hosting regions and SCC/IDTA status
`legal/EU_AI_ACT_ASSESSMENT.md`	High-risk AI classification; obligation mapping; conformity assessment
`legal/REGULATORY_CORRESPONDENCE_LOG.md`	Structured register of regulatory correspondence
`legal/COOKIE_POLICY.md`	Cookie audit and consent policy
`legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md`	Multi-framework notification timelines and templates

49.5 Anti-Patterns Identified

Anti-pattern	Correct approach
In-UI disclaimer as sole liability protection	Substantive liability cap in executed MSA; UI disclaimer is reinforcement only
Serving TLE-derived data without licence verification	Gate behind `space_track_registered`; log all disclosures
Registering users without country-of-incorporation check	Collect at onboarding; screen against embargoed countries and BIS Entity List before account activation
Treating GDPR 72-hour obligation as the only notification deadline	NIS2 requires 24-hour early warning for significant incidents; both timelines must be tracked simultaneously
Storing consent preference in a cookie	Self-defeating; use localStorage with no expiry
Self-classifying the conjunction model as low-risk AI	Transport infrastructure safety = Annex III high-risk; full obligations apply regardless of system size

49.6 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
RoPA location	`legal/ROPA.md` (authoritative) + §29.1 mirror	MASTER_PLAN only	Regulatory auditors expect a standalone document; MASTER_PLAN mirror keeps engineers informed
Space-Track gate mechanism	Per-org boolean + middleware check	Per-request licence verification	Per-request verification against Space-Track API would add latency and a hard dependency; boolean flag updated at onboarding and reviewed quarterly
EU AI Act classification	High-risk (Annex III, transport safety)	Low-risk / unclassified	The conjunction model informs time-critical airspace decisions; conservative classification is the legally safe position; reclassification requires legal opinion
Cookie consent storage	localStorage	Session cookie	Storing consent in a cookie creates a circular dependency (need consent to set cookie, but cookie stores consent); localStorage avoids this without additional server round-trips
NIS2 applicability	Treat SpaceCom as essential entity (space traffic management)	Treat as non-essential until formally classified	Early compliance avoids a reclassification scramble; ENISA guidance indicates space infrastructure operators are likely Annex I essential entities

§50 Accessibility Engineering — Specialist Review

Standards basis: WCAG 2.1 Level AA (ISO/IEC 40500:2012), WAI-ARIA 1.2, EN 301 549 v3.2.1, Section 508, APCA contrast algorithm, ATAG 2.0 Review scope: Keyboard navigation, screen reader compatibility, colour contrast, motion/animation, focus management, dynamic content announcements, form accessibility, alert/modal accessibility, time-limited interactions, ARIA live regions

50.1 Findings and Fixes Applied

F1 — No accessibility standard committed; EN 301 549 mandatory for ESA procurement Fix applied (§13.0, §25.6): WCAG 2.1 AA committed as minimum standard in new §13.0. Definition of done updated: all PRs must pass axe-core wcag2a/aa before merge. ACR/VPAT 2.4 added to §25.6 ESA procurement artefacts table as a required Phase 2 deliverable.

F2 — CRITICAL alert overlay inaccessible to screen reader and keyboard users Fix applied (§28.3): Full ARIA alertdialog spec added: role="alertdialog", aria-modal="true", programmatic focus() on render, aria-hidden="true" on map container, aria-live="assertive" announcement region, visible text status indicator for deaf operators, Escape key handling per severity level.

F3 — Structured acknowledgement form has no accessible labels Fix applied (§28.5): Native <input type="radio"> with <label for="...">, <fieldset> + <legend>, aria-keyshortcuts on trigger, visible keyboard shortcut legend inside dialog, aria-required on free-text field when OTHER selected, aria-live="polite" confirmation on submit.

F4 — CesiumJS globe inaccessible; no keyboard/screen reader equivalent Fix applied (§13.2): New §13.2 specifies ObjectTableView.tsx as a parallel accessible table view. Accessible via Alt+T and a persistent visible button. All alert interactions completable from table view alone. Implemented with native <table> elements; aria-sort, aria-rowcount, aria-rowindex for virtual scroll.

F5 — Colour is the sole differentiator for alert severity Fix applied (§13.4): Non-colour severity indicators specified in §13.4: per-severity icon/shape (octagon/triangle/circle/circle-outline), text labels always visible, distinct border widths. 1 Hz colour cycle also has a 1 Hz border-width pulse as redundant indicator.

F6 — No keyboard navigation spec for primary operator workflow Fix applied (§13.3): New §13.3 specifies skip links, focus ring (3px, ≥3:1 contrast, --focus-ring token), tab order rules (no tabindex > 0), full application keyboard shortcut table (Alt+A/T/H/N, ?, Escape, arrow keys), aria-keyshortcuts on all trigger elements, conflict-free shortcut design.

F7 — Colour contrast ratios not specified Fix applied (§13.4): Verified contrast table for all operational severity colours on dark theme #1A1A2E. All pairs meet ≥4.5:1 (AA). Design token file frontend/src/tokens/colours.ts designated as authoritative; no hardcoded colour values in component files.

F8 — Session timeout risk during shift handover Fix applied (§28.5a): WCAG 2.2.1 (Timing Adjustable) compliance spec added. T−2 minute warning dialog with aria-live="polite" announcement. Auto-extension (30 min, once per session) when /handover view is active. POST /api/v1/auth/extend-session endpoint specified. Extension logged in security_logs as SESSION_AUTO_EXTENDED_HANDOVER.

F9 — Decision Prompts accordion not keyboard-operable or screen-reader-friendly Fix applied (§28.6): Full WAI-ARIA Accordion pattern specified: aria-expanded, aria-controls, role="region", aria-labelledby, native checkbox inputs with labels, arrow-key navigation, aria-live="polite" confirmation on checkbox state change.

F10 — No reduced-motion support Fix applied (§28.3): prefers-reduced-motion: reduce CSS implementation specified for CRITICAL banner colour cycle (static thick border replaces animation). CesiumJS corridor animation: JS matchMedia check on mount; particle animation disabled; static opacity when reduced motion preferred. Listener on change event for live preference updates without page reload.

F11 — No accessibility testing in CI Fix applied (§42.2, §42.5): e2e/test_accessibility.ts added using @axe-core/playwright. Scans 5 primary views. wcag2a + wcag2aa violations block PR; wcag2aaa warnings only. Results as CI artefact a11y-report.html. Manual screen reader test (NVDA+Firefox, VoiceOver+Safari) added to release checklist. Decision log entry added in §42.5.

50.2 Sections Modified

Section	Change
§13.0 (new) Accessibility Standard Commitment	WCAG 2.1 AA minimum standard; EN 301 549 mandatory for ESA; ACR/VPAT as Phase 2 deliverable; definition of done
§13.2 (new) Accessible Parallel Table View	`ObjectTableView.tsx` spec; keyboard trigger; native table markup; virtual scroll ARIA attributes
§13.3 (new) Keyboard Navigation Specification	Skip links; focus ring token; tab order rules; full shortcut table; `aria-keyshortcuts` convention
§13.4 (new) Colour and Contrast Specification	Verified contrast table; design token file; non-colour severity indicators (icons, text labels, border widths)
§25.6 Required ESA Procurement Artefacts	ACR/VPAT 2.4 added to artefacts table
§28.3 Alarm Management	CRITICAL alert ARIA spec; reduced-motion CSS spec
§28.5 Error Recovery	Acknowledgement form accessibility: native inputs, fieldset/legend, aria-keyshortcuts, confirmation announcement
§28.5a Shift Handover	Session timeout accessibility: T−2 min warning, auto-extension during handover, extend-session endpoint
§28.6 Cognitive Load Reduction	Decision Prompts ARIA Accordion pattern spec
§42.2 Test Suite Inventory	`test_accessibility.ts` added to e2e suite
§42.3 (renamed from 42.2)	axe-core implementation spec with code example; manual screen reader test checklist
§42.5 Decision Log	Accessibility CI gate decision added

50.3 New Components

Component / File	Purpose
`src/components/globe/ObjectTableView.tsx`	Accessible parallel table view for all globe objects
`frontend/src/tokens/colours.ts`	Design token file for all operational colours; authoritative contrast reference
`e2e/test_accessibility.ts`	`@axe-core/playwright` scans blocking PRs on WCAG 2.1 AA violations
`docs/RELEASE_CHECKLIST.md`	Manual screen reader test steps; keyboard-only workflow test

50.4 Anti-Patterns Identified

Anti-pattern	Correct approach
`aria-label` on a `<div>` when a native `<button>` would do	Always prefer native HTML semantics; ARIA substitutes only when no native element exists
`outline: none` without a custom focus indicator	Never suppress focus ring without providing an equivalent; use `--focus-ring` token
`tabindex="2"` or any positive tabindex	Never; positive tabindex breaks natural reading order and confuses screen readers
Colour-only severity communication	Always pair colour with shape, text label, and border width as redundant indicators
Inline `aria-live="assertive"` for non-emergency announcements	`assertive` interrupts immediately; use `polite` for non-CRITICAL confirmations, `assertive` only for CRITICAL alerts
Session timeout that cannot be extended	WCAG 2.2.1 requires user ability to extend or disable timing; auto-extend during safety-critical views is the correct pattern

50.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Globe accessibility approach	Parallel accessible table view	Making CesiumJS accessible directly	WebGL canvas cannot be made screen-reader accessible; a parallel data view is the only WCAG-conformant approach for complex visualisations
Focus ring specification	3px solid `#4A9FFF`, design token	Browser default outline	Browser default fails contrast requirements on dark themes; design token ensures consistency and testability
axe-core CI level	wcag2a + wcag2aa block; wcag2aaa warn	All levels block, or all levels warn	All-block creates false positives (AAA is aspirational); all-warn provides no enforcement; AA is the legal and contractual minimum
Reduced-motion: animation vs. static	Static thick border when `prefers-reduced-motion: reduce`	Slow down animation	Slowing animation still triggers vestibular symptoms; static replacement is the only fully safe approach
Session auto-extension scope	Only during `/handover` active; once per session	For any active form	Broad auto-extension creates security risk (indefinitely open sessions); limiting to handover scope is the narrowest sufficient accommodation

§52 Incident Response / Disaster Recovery Engineering — Specialist Review

Standards basis: NIST SP 800-61r2, ISO/IEC 27035, ISO 22301, ITIL 4, ICAO Doc 9859, AWS/GCP Well-Architected Framework (Reliability Pillar), Google SRE Book (Chapter 14) Review scope: Incident classification, runbook completeness, escalation chains, RTO/RPO definition and achievability, backup and restore, chaos/game day testing, on-call rotation, post-incident review, DR site strategy, alert_events integrity

52.1 Findings and Fixes Applied

F1 — RTO and RPO targets not formally defined with derivation rationale Fix applied (§26.2): Table expanded with derivation column. RTO ≤ 15 min (active TIP event) derived from 4-hour CRITICAL rate-limit window. RTO ≤ 60 min (no active event) aligns with MSA SLA. RPO zero for safety-critical tables derived from UN Liability Convention evidentiary requirements. MSA sign-off requirement added — customers must agree RTO/RPO before production deployment.

F2 — No restore time target or WAL retention period Fix applied (§26.6): WAL retained 30 days; base backups 90 days; safety-critical tables in MinIO Object Lock COMPLIANCE mode for 7 years. Restore time target < 30 minutes documented. docs/runbooks/db-restore.md designated as Phase 2 deliverable.

F3 — No runbook for prediction service outage during active re-entry event Fix applied (§26.8): New runbook row added to the required runbooks table covering: detection → 5-minute ANSP notification → incident commander designation → 15-minute update cadence → restoration checklist → PIR trigger. Full procedure in docs/runbooks/prediction-service-outage-during-active-event.md.

F4 — No chaos engineering / game day programme Fix applied (§26.8): Quarterly game day programme specified. 6 scenarios defined with inject, expected behaviour, and pass criteria. Scenario fail treated as SEV-2 with PIR. docs/runbooks/game-day-scenarios.md designated.

F5 — On-call rotation underspecified Fix applied (§26.8): 7-day rotation, minimum 2-engineer pool. L1 → L2 escalation trigger: 30 minutes without containment. L2 → L3 triggers enumerated (ANSP data affected, security breach, total outage > 15 min, regulatory notification triggered). On-call handoff log specified mirroring operator /handover model.

F6 — No P1/P2/P3 severity communication commitments Fix applied (§26.8): ANSP notification commitments per SEV level added. SEV-1 active TIP event: push + email within 5 minutes, 15-minute cadence. SEV-1 no active event: email within 15 minutes. SEV-2: email within 30 minutes if predictions affected. SEV-3/4: status page only.

F7 — No DR site or failover architecture Fix applied (§26.3): Cross-region warm standby architecture added. DB replica promoted on failover; app tier deployed from pre-pulled container images; MinIO bucket replication active; DNS health-check-based routing (TTL 60s). Estimated failover time < 15 minutes. Annual game day test (scenario 6). docs/runbooks/region-failover.md designated.

F8 — No post-incident review process Fix applied (§26.8): Mandatory PIR for all SEV-1 and SEV-2. Due within 5 business days. 7-section structure: summary, timeline, 5-whys root cause, contributing factors, impact, remediation actions (GitHub issues, incident-remediation label), what went well. Presented at engineering all-hands. Remediations are P2 priority.

F9 — alert_events not HMAC-protected Fix applied (§7.9, alert_events schema): record_hmac TEXT NOT NULL column added. Signing function specified (id, object_id, org_id, level, trigger_type, created_at, acknowledged_by, action_taken). Nightly Celery Beat integrity check re-verifies all events from past 24 hours; HMAC failure raises CRITICAL security alert. Existing alert_events_immutable trigger already prevents modification.

F10 — No incident communication templates Fix applied (§26.8): docs/runbooks/incident-comms-templates.md designated with 4 templates (initial notification, 15-min update, resolution, post-incident summary). Legal counsel review required before first use. Templates specify what never to include (speculation, premature ETAs, admissions of liability).

F11 — Operational and security incidents not separated Fix applied (§26.8): Operational vs. security incident comparison table added. Separate runbooks designated: docs/runbooks/operational-incident-response.md and docs/runbooks/security-incident-response.md. Security incidents: no public status page until legal counsel approves; DPO within 4 hours; NIS2/GDPR timelines from §29.6.

52.2 Sections Modified

Section	Change
§26.2 Recovery Objectives	Derivation rationale column; MSA sign-off requirement
§26.3 High Availability Architecture	Cross-region warm standby DR strategy; component failover table; estimated recovery time
§26.6 Backup and Restore	WAL retention 30 days; restore time target < 30 min; MinIO Object Lock for 7-year legal hold; `docs/runbooks/db-restore.md`
§26.8 Incident Response	Prediction-service-outage runbook; on-call rotation spec + handoff log; ANSP comms per severity; PIR process; game day programme; incident comms templates; operational/security split
§7.9 Data Integrity	`alert_events` HMAC signing function; nightly integrity check Celery task
`alert_events` schema	`record_hmac TEXT NOT NULL` column added

52.3 New Runbooks Required (Phase 2 deliverables)

Runbook	Trigger
`docs/runbooks/db-restore.md`	Monthly restore test failure; DR failover
`docs/runbooks/prediction-service-outage-during-active-event.md`	SEV-1 during active TIP event
`docs/runbooks/region-failover.md`	Cloud region failure; annual game day
`docs/runbooks/game-day-scenarios.md`	Quarterly game day reference
`docs/runbooks/incident-comms-templates.md`	All SEV-1/2 incidents
`docs/runbooks/operational-incident-response.md`	All operational incidents
`docs/runbooks/security-incident-response.md`	All security incidents
`docs/runbooks/on-call-handoff-log.md`	Weekly rotation boundary
`docs/post-incident-reviews/`	All SEV-1/2 incidents (within 5 business days)

52.4 Anti-Patterns Identified

Anti-pattern	Correct approach
RTO/RPO as aspirational targets without derivation	Derive from operational requirements; document rationale; agree in MSA
Single-region deployment with 1-hour RTO target	Warm standby in a second region; < 15 min estimated failover
Conflating operational and security incident response	Separate runbooks; different escalation chains; different communication rules
Improvised ANSP communications under pressure	Pre-drafted legal-reviewed templates; deviations require incident commander approval
PIR as optional / informal	Mandatory for SEV-1/2; structured format; remediation tracking; all-hands presentation
Game day as a one-time activity	Quarterly rotation; each scenario tested at least annually; failures treated as SEV-2

52.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
DR strategy	Warm standby (second region)	Cold standby or active-active	Cold standby: restore time too slow for RTO; active-active: complexity and cost disproportionate to Phase 1 scale; warm standby meets RTO at acceptable cost
alert_events HMAC	Nightly batch verification	Per-request verification	Per-request adds latency to the alert delivery path; nightly batch catches tampering within 24 hours — adequate for evidentiary purposes
PIR timing	5 business days	24 hours / 30 days	24 hours is too fast for full 5-whys analysis; 30 days allows recurrence before remediation; 5 days balances speed with quality
Game day cadence	Quarterly	Monthly / annually	Monthly creates operational fatigue; annually is too infrequent to maintain muscle memory; quarterly is standard SRE practice
On-call escalation trigger	30 minutes containment	15 minutes / 60 minutes	15 minutes is too aggressive for complex incidents; 60 minutes risks SLO breach before L2 engaged; 30 minutes matches the active TIP event RTO window

§51 Internationalisation / Localisation Engineering — Specialist Review

Standards basis: Unicode CLDR 44, IETF BCP 47, ISO 8601, ICAO Annex 2 / Annex 15 / Doc 8400 (UTC mandate), POSIX locale model, W3C Internationalisation guidelines, ICU MessageFormat 2.0, EU Regulation 2018/1139 (EASA language requirements) Review scope: Timezone handling, date/time display, number/unit formatting, string externalisation, RTL layout, language coverage, ICAO UTC compliance, API date formats, database timezone storage

51.1 Findings and Fixes Applied

F1 — Operational times must be UTC; no local timezone conversion in ops interface Fix applied (§13.0): Iron UTC rule documented. All Persona A/C views display UTC only, formatted HH:MMZ or DD MMM YYYY HH:MMZ. Z suffix always inline, never a tooltip. No timezone conversion widget in operational interface. Local time permitted only in non-operational admin views with explicit timezone label. API times always ISO 8601 UTC.

F2 — ORM may silently convert TIMESTAMPTZ to session timezone Fix applied (§7.9): SET TIME ZONE 'UTC' enforced on every connection via SQLAlchemy engine event listener. Blocking integration test test_timestamps_round_trip_as_utc added — asserts that a known UTC datetime survives a full ORM insert/read cycle without offset conversion.

F3 — Re-entry window displayed without explicit UTC label Fix applied (§28.4): Rule 1 of probabilistic communication to non-specialists updated — all absolute times rendered as HH:MMZ per ICAO Doc 8400 UTC-suffix convention. Z suffix always rendered inline; never hover-only.

F4 — Number formatting not locale-aware in non-operational views Fix applied (§13.4): formatOperationalNumber() (ICAO decimal point, invariant) and formatDisplayNumber(locale) (Intl.NumberFormat, locale-aware) helpers specified. Raw Number.toString() and n.toFixed() banned from JSX.

F5 — No string externalisation strategy; hardcoded strings block localisation Fix applied (§13.5): next-intl adopted. All user-facing strings in messages/en.json. Message ID convention defined. eslint-plugin-i18n-json enforcement. ICAO-fixed strings explicitly excluded and annotated // ICAO-FIXED: do not translate.

F6 — NOTAM draft output must be ICAO English regardless of UI locale Fix applied (§6.13): NOTAM template strings hardcoded ICAO English phraseology in backend/app/modules/notam/templates.py, annotated # ICAO-FIXED: do not translate. Excluded from next-intl extraction. Preview renders in monospace font with lang="en" attribute.

F7 — Slash-delimited dates are ambiguous in exports Fix applied (§6.12): DD MMM YYYY format mandated for all PDF reports, CSV exports, and display previews (e.g. 04 MAR 2026). Slash-delimited dates banned from all SpaceCom outputs. Times alongside dates use HH:MMZ. NOTAM internal YYMMDDHHMM fields displayed as DD MMM YYYY HH:MMZ in preview.

F8 — RTL layout not considered; directional CSS utilities used Fix applied (§13.5): CSS logical properties table specified (margin-inline-start etc. replacing ml-/mr-). <html dir="ltr"> hardcoded for Phase 1; becomes dir={locale.dir} when RTL locale added — no component changes required. docs/ADDING_A_LOCALE.md checklist includes RTL gate.

F9 — Altitude units inconsistent between aviation and space personas Fix applied (users table, §13.5): altitude_unit_preference column added to users table (ft default for ANSP operators, km for space operators). API transmits metres; display layer converts. Unit label always visible. FL notation shown in parentheses for ft context. User can override in account settings.

F10 — API date formats inconsistent (Unix timestamps vs. ISO 8601) Fix applied (§14 API Versioning Policy): ISO 8601 UTC (2026-03-22T14:00:00Z) mandated for all API date fields. OpenAPI format: date-time on all _at/_time fields. Blocking contract test asserts regex match. Pydantic json_encoders specified.

F11 — Language coverage undefined; English-only now but architecture must support future localisation Fix applied (§13.5): English-only explicitly committed for Phase 1. next-intl architecture allows adding a locale by adding messages/{locale}.json only — no component changes. messages/fr.json and messages/de.json scaffolded at Phase 2/3 start. docs/ADDING_A_LOCALE.md checklist documented.

51.2 Sections Modified

Section	Change
§6.12 Report Generation	`DD MMM YYYY` date format rule; slash-delimited dates banned
§6.13 NOTAM Drafting Workflow	ICAO-FIXED template rule; `lang="en"` on NOTAM container
§7.9 Data Integrity	`SET TIME ZONE 'UTC'` connection event listener; `test_timestamps_round_trip_as_utc` integration test
§13.0 Accessibility Standard Commitment	UTC-only rule added
§13.4 Colour and Contrast Specification	`formatOperationalNumber` / `formatDisplayNumber` helpers; `Intl.NumberFormat` mandate
§13.5 (new) Internationalisation Architecture	`next-intl`; `messages/en.json`; ICAO-FIXED exclusions; CSS logical properties; altitude unit display; `docs/ADDING_A_LOCALE.md` checklist
§14 API Versioning Policy	ISO 8601 UTC contract; OpenAPI `format: date-time`; contract test; Pydantic encoder
§28.4 Probabilistic Communication	`HH:MMZ` inline UTC suffix rule
`users` table	`altitude_unit_preference` column added

51.3 New Files

File	Purpose
`messages/en.json`	Phase 1 string source of truth for `next-intl`
`messages/fr.json`	Phase 2 scaffold (machine-translated placeholders; native-speaker review before deploy)
`messages/de.json`	Phase 3 scaffold
`docs/ADDING_A_LOCALE.md`	Step-by-step checklist for adding a new locale; includes RTL gate
`frontend/src/lib/formatters.ts`	`formatOperationalNumber`, `formatDisplayNumber`, `formatUtcTime`, `formatUtcDate` helpers
`tests/test_db_timezone.py`	Blocking integration test for UTC round-trip integrity

51.4 Anti-Patterns Identified

Anti-pattern	Correct approach
Displaying local time in the ops interface	UTC only; `HH:MMZ` always; no conversion widget
`Number.toString()` or `n.toFixed()` in JSX	`formatOperationalNumber()` (ICAO) or `formatDisplayNumber(locale)` depending on context
`03/04/2026` in any export or report	`04 MAR 2026` — unambiguous ICAO-aligned format
Translating NOTAM template strings	ICAO-FIXED; annotate and exclude from i18n tooling
Positive `tabindex` (already covered §50)	Never; noted here as it is also an i18n anti-pattern (breaks RTL reading order)
Hardcoded `margin-left` in new components	`margin-inline-start`; logical properties throughout
Multiple API date formats in same response	ISO 8601 UTC only; one format, no exceptions

51.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Operational time display	UTC-only, `HH:MMZ` inline	User-selectable timezone	ICAO Annex 15 mandates UTC for aeronautical data; a timezone selector introduces conversion errors under time pressure
Date format in exports	`DD MMM YYYY`	ISO 8601 (`2026-03-04`)	ISO 8601 is unambiguous but unfamiliar to aviation professionals; `DD MMM YYYY` matches aviation document convention (NOTAM, METARs) and is equally unambiguous
Phase 1 language scope	English only	Multi-language from Phase 1	Localisation adds QA overhead and translation cost before product-market fit is proven; architecture supports future locales without rework
i18n library	`next-intl`	`react-i18next`	`next-intl` has first-class App Router RSC support; `react-i18next` requires client-component wrapping for all translated text
Altitude storage unit	Metres (API + DB)	Role-dependent storage	Single SI storage unit eliminates conversion bugs in physics engine; display conversion is well-understood and testable
ORM timezone enforcement	Engine event listener (`SET TIME ZONE UTC`)	Application-level assertion	Engine listener fires at connection creation and cannot be bypassed by individual queries; application assertions can be accidentally omitted

§53 Machine Learning / Data Science — Specialist Review

Standards basis: ISO/IEC 22989, ECSS-E-ST-10-04C, IADC Space Debris Mitigation Guidelines, ESA DRAMA methodology, Vallado (2013), JB2008, NRLMSISE-00, FAA Order 8040.4B, EU AI Act Art. 10 Review scope: Conjunction Pc model, SGP4 domain, atmospheric density model selection, MC convergence, survival probability, model versioning, TLE age uncertainty, backcasting, input validation, tail risk, data provenance

53.1 Findings and Fixes Applied

F1 — Conjunction probability model methodology unspecified Fix applied (§15.5): Alfano (2005) 2D Gaussian method already specified. Validity domain added: three degradation conditions (sub-100m close approach, anisotropic covariance > 100:1, Pc < 1×10⁻¹⁵ floor). API response carries pc_validity and pc_validity_warning fields. Reference test suite added against Vallado & Alfano (2009) published cases with 5% tolerance.

F2 — SGP4 used beyond valid domain without sub-150 km guard Fix applied (§15.1): Sub-150 km LOW_CONFIDENCE_PROPAGATION flag added to decay predictor. UI badge: "⚠ Re-entry imminent — prediction confidence low." BLOCKING unit test: TLE with perigee 120 km → asserts flag is set.

F3 — Atmospheric density model not justified vs. JB2008 Fix applied (§15.2): NRLMSISE-00 Phase 1 selection rationale documented (Python binding maturity, acceptable accuracy at moderate F10.7). Known limitations stated. Phase 2 milestone: evaluate JB2008 on backcasts; migrate if MAE improvement > 15%; ADR 0016. Input validity bounds added: F10.7 [65, 300], Ap [0, 400], altitude [85, 1000] km; violation raises AtmosphericModelInputError.

F4 — MC sample count not justified by convergence analysis Fix applied (§15.2/§15.4): Convergence table added. N = 500 satisfies < 2% corridor area change between doublings on the reference object. N = 1000 for OOD or storm-warning cases. MC output updated to include p01 and p99.

F5 — Survival probability methodology absent Fix applied (§15.3): survival_probability, survival_model_version, survival_model_note columns added to reentry_predictions. Phase 1: simplified analytical all-survive/no-survive per material class. Phase 2: ESA DRAMA integration. NOTAM (E) field statement driven by survival_probability.

F6 — No model version governance or reproducibility Fix applied (§15.6 new): MAJOR/MINOR/PATCH version bump policy defined. Old versions retained in git tags and physics/versions/. POST /decay/predict/reproduce endpoint specified — re-runs with original model version and params for regulatory audit.

F7 — TLE age not a formal uncertainty source Fix applied (§15.2): Linear inflation model added: uncertainty_multiplier = 1 + 0.15 × tle_age_days applied to ballistic coefficient covariance before MC sampling. tle_age_at_prediction_time and uncertainty_multiplier stored in simulations.params_json and returned in API response.

F8 — No model performance monitoring or drift detection Fix applied (§15.9 new): reentry_backcasts table specified. Celery task triggered on object status = 'decayed'; compares all 72h predictions to confirmed re-entry time. Rolling 30-prediction MAE nightly; MEDIUM alert if MAE > 2× historical baseline. Admin panel "Model Performance" widget.

F9 — Input data quality gates insufficient Fix applied (§15.7 new): validate_prediction_inputs() function in backend/app/modules/physics/validation.py. Validates TLE epoch age ≤ 30 days, F10.7/Ap/perigee bounds, mass > 0. Returns structured ValidationError list; endpoint returns 422. All validation paths covered by BLOCKING unit tests.

F10 — Tail risks not communicated; only p5–p95 shown Fix applied (§28.4, reentry_predictions schema): p01_reentry_time and p99_reentry_time columns added. Tail risk annotation displayed when p1–p99 range > 1.5× p5–p95 range: "Extreme case (1% probability outside): p01Z – p99Z." Included as NOTAM draft footnote when condition met.

F11 — No training/validation data provenance Fix applied (§15.8 new): Phase 1 explicitly documented as physics-based with no trained ML components. docs/ml/data-provenance.md designated. EU AI Act Art. 10 compliance mapped to input data provenance (tracked in simulations.params_json). Future ML component protocol: training data, validation split, model card in docs/ml/model-card-{component}.md.

53.2 Sections Modified

Section	Change
§15.1 Catalog Propagator	Sub-150 km LOW_CONFIDENCE_PROPAGATION flag + unit test
§15.2 Decay Predictor	NRLMSISE-00 selection rationale vs. JB2008; input bounds; TLE age inflation model; MC convergence table; N=1000 for OOD/storm cases
§15.3 Atmospheric Breakup Model	`survival_probability` / `survival_model_version` / `survival_model_note` columns; Phase 1 analytical methodology
§15.5 Conjunction Pc	Validity domain (3 degradation conditions); `pc_validity` API fields; Vallado & Alfano reference test suite
§15.6 (new) Model Version Governance	MAJOR/MINOR/PATCH policy; version retention; `reproduce` endpoint
§15.7 (new) Prediction Input Validation	`validate_prediction_inputs()`; 5 validation rules; 422 response; BLOCKING tests
§15.8 (new) Data Provenance	Phase 1 no-ML declaration; EU AI Act Art. 10 mapping; future ML component protocol
§15.9 (new) Backcasting Validation	`reentry_backcasts` table; Celery trigger on decay; rolling MAE drift detection; admin panel widget
§28.4 Probabilistic Communication	Tail risk annotation (rule 6); p01/p99 display condition; NOTAM footnote
`reentry_predictions` schema	`p01_reentry_time`, `p99_reentry_time`, `survival_probability`, `survival_model_version`, `survival_model_note`

53.3 New Tables and Files

Artefact	Purpose
`reentry_backcasts` table	Prediction vs. actual comparison; drift detection input
`docs/ml/data-provenance.md`	Phase 1 no-ML declaration; future ML data provenance template
`docs/ml/model-card-{component}.md`	Template for any future learned component
`docs/adr/0016-atmospheric-density-model.md`	NRLMSISE-00 vs. JB2008 decision; Phase 2 evaluation trigger
`backend/app/modules/physics/validation.py`	`validate_prediction_inputs()` function
`tests/physics/test_pc_compute.py`	Vallado & Alfano reference cases (BLOCKING)

53.4 Anti-Patterns Identified

Anti-pattern	Correct approach
Displaying only p5–p95 without tail annotation	Add p1/p99 as explicit tail risk annotation when materially wider
Silently clamping out-of-range inputs	Reject with structured `ValidationError`; operator must correct the input
Deleting old model versions on update	Tag and retain; `reproduce` endpoint requires historical version access
Treating TLE age as display-only staleness	TLE age is a formal uncertainty source; inflate MC covariance accordingly
Choosing atmospheric model without documented rationale	Document selection vs. alternatives; schedule re-evaluation with objective criterion
No feedback loop from confirmed re-entries	Backcasting pipeline closes the loop; MAE monitoring detects drift

53.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Phase 1 atmospheric model	NRLMSISE-00	JB2008	Mature Python binding; acceptable accuracy at moderate F10.7; JB2008 evaluation deferred to Phase 2 with objective trigger
Pc method	Alfano (2005) 2D Gaussian	Monte Carlo Pc	Alfano is computationally fast and widely accepted; MC Pc reserved for Phase 3 high-Pc cases where Gaussian assumption breaks down
MC convergence criterion	< 2% corridor area change between N doublings	Fixed N from literature	Fixed N is arbitrary; convergence criterion is object-class specific and reproducible
Tail risk display threshold	p1–p99 > 1.5× p5–p95	Always show / never show	Always showing creates visual clutter for well-constrained predictions; never showing hides operationally relevant uncertainty; threshold balances both
Model version retention	Git tags + `physics/versions/` directory	Docker image tags only	Docker images are routinely pruned; git tags are permanent; `reproduce` endpoint needs the actual code, not just an image

§54 Technical Documentation / Developer Experience — Specialist Review

Standards basis: OpenAPI 3.1, Keep a Changelog, Conventional Commits, Nygard ADR format, WCAG authoring guidance, MkDocs Material, spectral OpenAPI linting, ESA ECSS documentation requirements Review scope: OpenAPI spec governance, health endpoint coverage, contribution workflow, ADR process, changelog discipline, developer onboarding, response examples, SDK strategy, runbook structure, docs pipeline, AI assistance declaration

54.1 Findings and Fixes Applied

F1 — OpenAPI spec not declared as source of truth Fix applied (§14 API Versioning Policy): FastAPI's built-in OpenAPI generation is declared as the sole source of truth. make generate-openapi regenerates openapi.yaml. CI runs openapi-diff --fail-on-incompatible to detect uncommitted drift. The spec is input to Swagger UI, Redoc, contract tests, and the SDK generator.

F2 — No /health or /readiness endpoint specified Fix applied (§14 System endpoints): New System (no auth required) group added. GET /health — liveness probe; process-alive check only. GET /readyz — readiness probe; checks PostgreSQL, Redis, Celery queue depth; returns 503 when any dependency is unhealthy. Both used by Kubernetes probes, load balancers, and DR automation DNS-flip gate (§26.3). Both included in OpenAPI spec.

F3 — CONTRIBUTING.md absent Fix applied (§13.6 new): Full contribution workflow documented. Branch naming convention table (feature/fix/chore/release/hotfix), main branch protection (1 approval, all checks pass, no force-push), Conventional Commits commit format, PR template with checklist (test, openapi regeneration, CHANGELOG, axe-core, ADR), 1-business-day review SLA, stale PR automation.

F4 — No ADR process Fix applied (§13.7 new): ADR process specified using Nygard format in docs/adr/NNNN-title.md. Trigger criteria defined (hard-to-reverse decisions, auditor context, procurement evidence). Standard template specified. Known ADR register table provided with 6 existing entries. Phase 2 ESA submission gate: all referenced ADR numbers must have corresponding files.

F5 — Changelog discipline unspecified Fix applied (§14 API Versioning Policy): Keep a Changelog format + Conventional Commits declared. [Unreleased] section with Added/Changed/Fixed/Deprecated subsections required on every PR with user-visible effect. make changelog-check CI step fails if [Unreleased] is empty for non-chore/docs commits. Release changelogs drive API key holder notifications and GitHub release notes.

F6 — Developer environment setup undocumented Fix applied (§13.8 new): docs/DEVELOPMENT.md spec covering: prerequisites (Python 3.11 pinned, Node.js 20, Docker Desktop, make), make dev-up / migrate / seed / dev bootstrap sequence, make test / test-backend / test-frontend / test-e2e commands, local URL map (API, Swagger UI, frontend, MinIO). 30-minute onboarding target. .env.example committed; .env in .gitignore.

F7 — OpenAPI response examples not required Fix applied (§14 API Versioning Policy): All endpoint schemas must include at least one examples: block. Enforced by spectral lint with custom require-response-example rule in CI. Example YAML fragment provided for GET /objects/{norad_id}. Examples serve: Swagger/Redoc docs, contract test fixtures, ESA auditor readability.

F8 — No SDK or client library strategy Fix applied (§14 API Versioning Policy): Phase 1 — no SDK; ANSP integrators receive openapi.yaml, docs/integration/ quickstarts (Python httpx/requests, TypeScript), and Postman-importable spec. Phase 2 gate: if ≥ 2 ANSP customers request a typed client, generate with openapi-generator-cli targeting Python and TypeScript. Generator config committed to tools/sdk-generator/. Published as spacecom-client PyPI and @spacecom/client npm packages.

F9 — Runbooks named but not templated Fix applied (§26.8 new subsection): Standard runbook template specified with 7 sections: Triggers, Immediate actions (first 5 minutes), Diagnosis, Resolution steps, Verification, Escalation, Post-incident. Last tested frontmatter field required. make runbook-audit CI check warns if any runbook is older than 12 months. Template preempts the most common incident-pressure failures: vague steps, no expected output, missing escalation path.

F10 — No docs-as-code pipeline Fix applied (§13.9 new): MkDocs Material as the documentation site generator. mkdocs build --strict in CI fails on broken links and missing pages. markdown-link-check for external links. vale prose style linter. openapi-diff spec drift check. ESA submission artefact: static HTML archived as docs-site-{version}.zip in release assets — reproducible point-in-time snapshot. owner: frontmatter field with quarterly docs-review cron issue.

F11 — AGENTS.md scope vs. MASTER_PLAN undefined Fix applied (§1 Vision): AI-assisted development policy added. Defines: permitted uses (code generation, refactoring, review, documentation drafting), prohibited uses (autonomous decisions on safety-critical algorithms, auth logic, regulatory compliance text; production credentials; personal data). Human review standards apply identically to AI-generated code. ESA procurement statement: human engineers are sole responsible parties regardless of authoring tool.

54.2 Sections Modified

Section	Change
§1 Vision	AI-assisted development policy; AGENTS.md scope declaration; ESA procurement statement
§13.6 (new) Contribution Workflow	Branch naming; commit format; PR template; review SLA; `main` protection
§13.7 (new) Architecture Decision Records	Nygard ADR format; trigger criteria; template; known ADR register; Phase 2 ESA gate
§13.8 (new) Developer Environment Setup	`docs/DEVELOPMENT.md` spec; make targets; 30-minute onboarding target; `.env.example` policy
§13.9 (new) Docs-as-Code Pipeline	MkDocs Material; CI checks (strict, link, vale, openapi-diff); ESA artefact; docs ownership
§14 API Versioning Policy	OpenAPI as source of truth; `make generate-openapi`; CI drift check; changelog discipline; response examples mandate; client SDK strategy
§14 System Endpoints (new)	`GET /health` liveness spec; `GET /readyz` readiness spec with example responses
§26.8 Incident Response	Runbook standard structure template; `Last tested` field; `make runbook-audit`

54.3 New Tables and Files

Artefact	Purpose
`CONTRIBUTING.md`	Branch naming, commit format, PR template, review SLA
`CHANGELOG.md`	Keep a Changelog format; `[Unreleased]` driven by PRs; release notes source
`docs/adr/NNNN-*.md`	Architecture Decision Records (Nygard format)
`docs/DEVELOPMENT.md`	Developer onboarding; make targets; environment bootstrap
`docs/ADDING_A_LOCALE.md`	(already referenced §13.5) — Locale addition checklist
`docs/integration/`	ANSP quickstart guides (Python, TypeScript)
`tools/sdk-generator/`	openapi-generator-cli config for Phase 2 SDK generation
`.github/pull_request_template.md`	PR checklist enforcing OpenAPI regeneration, CHANGELOG, axe-core, ADR
`.spectral.yaml`	Custom spectral ruleset including `require-response-example`
`.vale.ini`	Prose style linter config for docs
`mkdocs.yml`	MkDocs Material configuration
`docs/runbooks/*.md`	All runbooks follow the standard template with `Last tested` frontmatter

54.4 Anti-Patterns Identified

Anti-pattern	Correct approach
Maintaining a separate OpenAPI spec alongside FastAPI routes	Generate from code; enforce with CI drift check
Undocumented `GET /health` with ad-hoc response shape	Specify the schema, document it in OpenAPI, use it in DR automation
New engineers learning the codebase by asking colleagues	`docs/DEVELOPMENT.md` with 30-min onboarding target; `make dev` brings up everything
Architectural decisions in Slack or PR comments	ADR in `docs/adr/`; permanent and findable by auditors and new engineers
Runbooks written for the first time during an incident	Template-first; test in game day before needed
Publishing an API with no response examples	`spectral` enforces `examples:` blocks; Swagger UI shows realistic data
Building an SDK before customers ask	Phase 2 gate: ≥ 2 ANSP requests; Phase 1 is `openapi.yaml` + quickstarts

54.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
OpenAPI generation direction	Code → spec (FastAPI auto-generation)	Spec → code (contract-first with codegen)	Team is Python-first; FastAPI's generation is high-fidelity; contract-first adds a separate edit step without meaningful quality gain at Phase 1 scale
SDK strategy	Generated from spec (Phase 2)	Hand-crafted SDK	Generated SDK stays in sync with spec automatically; hand-crafted SDKs drift; generation deferred until customer demand justifies maintenance cost
Documentation tooling	MkDocs Material	Docusaurus, GitBook	MkDocs Material is Python-native (same toolchain as backend); `mkdocs build --strict` provides CI integration; no JS toolchain dependency for docs
ADR format	Nygard (Context/Decision/Consequences)	MADR, RFC-style	Nygard is the most widely recognised format; recognised by ESA/public-sector auditors; minimal overhead
AI assistance declaration	Explicit policy in §1 Vision	Silent (no declaration)	ESA and EASA increasingly require disclosure of AI tool use in safety-relevant software; proactive disclosure pre-empts audit questions and demonstrates process maturity

§55 Multi-Tenancy, Billing & Org Management — Specialist Review

Standards basis: GDPR Art. 17/20, PCI-DSS (if card payments introduced), SaaS subscription billing conventions, PostgreSQL Row Level Security documentation, Celery priority queue documentation, ICAO Annex 11 (operator accountability) Review scope: Data isolation, subscription tier model, usage metering, org lifecycle, API key governance, quota enforcement, queue fairness, audit log access, billing data model, data portability

55.1 Findings and Fixes Applied

F1 — No row-level tenant isolation strategy defined Fix applied (§7.2): Comprehensive RLS policy table added covering all 8 organisation_id-carrying tables. spacecom_worker database role specified as the only BYPASSRLS principal. BLOCKING integration test specified: query as Org A session; assert zero Org B rows across all tenanted tables.

F2 — Subscription tiers and feature flags not specified Fix applied (§16.1 new): Tier table defined (shadow_trial, ansp_operational, space_operator, institutional, internal) with per-tier MC concurrency, prediction quota, and feature access. require_tier() FastAPI dependency pattern specified. TIER_MC_CONCURRENCY dict ties limits to tier. Tier changes take immediate effect (no session cache).

F3 — Usage metering not modelled Fix applied (§9.2): usage_events table added — append-only, immutable trigger, indexed by (organisation_id, billing_period, event_type). Billable event types: decay_prediction_run, conjunction_screen_run, report_export, api_request, mc_quota_exhausted, reentry_plan_run. Powers org admin usage dashboard and upsell trigger.

F4 — Organisation onboarding and offboarding procedures absent Fix applied (§29.8 new): Onboarding gate checklist specified (MSA, export control, Space-Track, billing contact, org_admin user, ToS). Offboarding 8-step procedure with timing, owner, and GDPR Art. 17 vs. retention resolution. Suspension vs. churn distinction documented. docs/runbooks/org-onboarding.md designated.

F5 — API key lifecycle lacks org-level service account concept Fix applied (§9.2 api_keys table): is_service_account column added; user_id made nullable for service account keys; service_account_name required when is_service_account = TRUE; revoked_by column added for org_admin audit trail. CHECK constraints enforce mutual exclusivity. Org admin can see and revoke all org keys via GET/DELETE /org/api-keys.

F6 — Concurrent prediction limit not persisted and not tier-linked Fix applied (§16.1, Celery section): acquire_mc_slot now derives limit from org_tier via get_mc_concurrency_limit_by_tier(). Quota exhaustion writes usage_events row with event_type = 'mc_quota_exhausted'. Org admin usage dashboard shows hits per billing period with upgrade prompt if hits ≥ 3.

F7 — No org-level admin role Fix applied (§7.2 RBAC table, users.role CHECK): org_admin role added between operator and admin. Permissions: manage users within own org (up to operator), manage own org's API keys, view own org's audit log, update billing contact. Cannot cross org boundaries or assign admin/org_admin without system admin.

F8 — Shared Celery queues with no per-org priority Fix applied (Celery Queue section): TIER_TASK_PRIORITY table (3–9 by tier) with CRITICAL_EVENT_PRIORITY_BOOST = 2 when active TIP event exists. get_task_priority() function specified. Priority submitted via apply_async(priority=...). Redis noeviction policy supports native Celery priorities 0–9.

F9 — No tenant-scoped audit log API Fix applied (§14 Org Admin endpoints): GET /org/audit-log added — paginated, filtered by organisation_id, supports ?from=&to=&event_type=&user_id=. Sources security_logs and alert_events. Accessible to org_admin and admin. Required by enterprise SaaS compliance expectations.

F10 — Billing data model absent Fix applied (§9.2): billing_contacts table (email, name, address, VAT, PO reference), subscription_periods table (immutable billing history with tier, dates, monthly fee, invoice reference). PATCH /org/billing endpoint for org_admin self-service updates. Phase 1 billing is manual; invoice_ref field accommodates future Stripe or Lago integration.

F11 — No org data export or portability mechanism Fix applied (§14 Org Admin endpoints, §29.2): POST /org/export endpoint added — async job, delivers signed ZIP within 3 business days. Used for GDPR Art. 20 portability and offboarding. §29.2 portability row updated with endpoint reference and scope clarification (user-generated content, not derived predictions).

55.2 Sections Modified

Section	Change
§7.2 RBAC	`org_admin` role added; comprehensive RLS policy table; `spacecom_worker` BYPASSRLS principal; `users.role` CHECK constraint updated
§9.2 `api_keys`	`is_service_account`, `service_account_name`, `revoked_by` columns; CHECK constraints; service account index
§9.2 (new tables)	`usage_events`, `billing_contacts`, `subscription_periods`
§14 Org Admin endpoints (new group)	10 `org_admin`-scoped endpoints covering users, API keys, audit log, usage, billing, and data export
§14 Admin endpoints	`GET /admin/organisations`, `POST /admin/organisations`, `PATCH /admin/organisations/{id}` added
§16.1 (new) Subscription Tiers	Tier table; `require_tier()` pattern; `TIER_MC_CONCURRENCY`; tier change immediacy
Celery Queue section	`TIER_TASK_PRIORITY` priority map; `CRITICAL_EVENT_PRIORITY_BOOST`; `get_task_priority()` function
MC concurrency gate	`acquire_mc_slot` now tier-driven; quota exhaustion writes `usage_events`
§29.2 Data Subject Rights	Portability row updated with `POST /org/export` endpoint and scope
§29.8 (new) Org Onboarding/Offboarding	6-gate onboarding checklist; 8-step offboarding procedure; suspension vs. churn distinction

55.3 New Tables and Files

Artefact	Purpose
`usage_events` table	Billable event metering; org admin dashboard; quota exhaustion signal
`billing_contacts` table	Invoice address, VAT, PO number per org
`subscription_periods` table	Immutable billing history; Phase 2 invoice integration anchor
`docs/runbooks/org-onboarding.md`	Onboarding gate checklist; provisioning procedure
`backend/app/modules/billing/tiers.py`	`get_mc_concurrency_limit_by_tier()` and `TIER_TASK_PRIORITY`

55.4 Anti-Patterns Identified

Anti-pattern	Correct approach
Relying solely on application-layer `WHERE organisation_id = X`	RLS at database layer; application filter is defence-in-depth only
Role model with only system-wide `admin`	`org_admin` for self-service tenant management; `admin` for cross-org system operations
Flat API key model with no service accounts	Service account keys (`user_id IS NULL`) for system integrations; org admin can audit and revoke all keys
Sharing Celery queue with equal priority for all orgs	Priority queue by tier + active event boost prevents low-tier bulk jobs starving safety-critical work
No audit log access for tenants	Tenant-scoped `GET /org/audit-log`; required by enterprise procurement and insurance
Treating `subscription_tier` as static configuration	Tier changes must be real-time enforced; `require_tier()` reads from DB on each request

55.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Tenant isolation mechanism	PostgreSQL RLS + application filter	Application filter only	RLS enforces at DB layer; a single missing WHERE clause in application code cannot leak cross-tenant data
Tier change immediacy	Real-time DB read on each request	Cached in JWT claim	JWT caching means downgraded orgs continue at higher tier until token expires; unacceptable for billing correctness
Billing integration (Phase 1)	Manual + `subscription_periods` table	Stripe/Lago from day 1	Phase 1 has ≤5 paying customers; manual invoicing is sufficient; `invoice_ref` field enables future integration without schema migration
org_admin role scope	Cannot assign `admin` or `org_admin` without system admin approval	Full self-service role management	Self-service `org_admin` assignment creates privilege escalation paths; system admin as approval gate is a standard SaaS pattern
Service account API keys	`user_id IS NULL` with `is_service_account = TRUE` flag	Separate `service_accounts` table	Single `api_keys` table is simpler; constraints enforce consistency; avoids JOIN complexity for key lookup hot path

§56 Testing Strategy — Specialist Review

Standards basis: pytest, pytest-cov, mutmut, k6, Playwright, openapi-typescript, freezegun, ISTQB test level definitions, ESA ECSS-E-ST-40C software testing standard Review scope: Coverage standard, test taxonomy, test data management, frontend/API contract drift, mutation testing, performance test specification, environment parity, safety-critical labelling, WebSocket E2E, MC determinism, ESA test plan artefact

56.1 Findings and Fixes Applied

F1 — No test coverage standard defined Fix applied (§17.0): Coverage thresholds declared: 80% line / 70% branch for backend (pytest-cov), 75% line for frontend (Jest). Enforced via pyproject.toml --cov-fail-under. Measured on the integration run (real DB), not unit-only. Coverage artefact required in Phase 2 ESA submission.

F2 — Test level boundary undefined Fix applied (§17.0): Three-level taxonomy defined: unit (no I/O, tests/unit/), integration (real DB + Redis, tests/integration/), E2E (full stack + browser, e2e/). Rules specify which level each category of test belongs to. Stops developers placing DB tests in tests/unit/ or mocking the database in integration tests.

F3 — Test data management strategy absent Fix applied (§17.0): Committed JSON reference data for physics; transaction-rollback isolation for integration tests; freezegun mandate for all time-dependent tests; fictional NORAD IDs (90001–90099) and generated org names for sensitive data. Prevents flaky time-dependent failures and production-data leakage into the test repo.

F4 — No contract testing between frontend and API Fix applied (§14): openapi-typescript generates frontend/src/types/api.generated.ts from openapi.yaml. Frontend imports only from the generated file. make check-api-types CI step fails on any drift. Replaces Pact-style consumer-driven contracts at Phase 1 scale — simpler, equally effective for a single-team project.

F5 — Mutation testing not specified Fix applied (§17.0): mutmut runs weekly against physics/ and alerts/ modules. Threshold: ≥ 70% mutation score. Results published to CI artefacts. > 5 percentage point drop between runs creates a mutation-regression issue automatically.

F6 — Performance test specification informal Fix applied (§27.0 new): k6 chosen as the load testing tool. Three scenarios specified: CZML catalog ramp, 200 WebSocket subscribers, decay submit constant arrival rate. SLO thresholds as k6 thresholds (test fails if breached). Baseline hardware spec documented in docs/validation/load-test-baseline.md. Results stored as JSON and trended; > 20% p95 increase creates performance-regression issue.

F7 — Test environment parity unspecified Fix applied (§17.0): docker-compose.ci.yml must use pinned image tags matching production (not latest). make test fails if TIMESCALEDB_VERSION env var does not match docker-compose.yml. MinIO used in CI (not mocked). Prevents the class of "passes in CI, fails in prod" due to minor version differences in TimescaleDB chunk behaviour.

F8 — Safety-critical tests not labelled Fix applied (§17.0): @pytest.mark.safety_critical marker defined in conftest.py. Applied to: cross-tenant isolation, HMAC integrity, sub-150km guard, shadow segregation, and any other safety-invariant test. Separate fast CI job (pytest -m safety_critical, target < 2 min) runs on every commit before the full suite.

F9 — No E2E test for WebSocket alert delivery Fix applied (§42.2 E2E test inventory, accessibility section): e2e/test_alert_websocket.ts added. Full path: submit prediction via API → Celery completes → CRITICAL alert appears in browser DOM via WebSocket within 60 seconds. BLOCKING. Intermittent failures are root-cause investigated, not quarantined.

F10 — Physics tests non-deterministic Fix applied (§17.0): np.random.seed(42) autouse fixture in tests/conftest.py. seed=42 passed explicitly to all MC calls in tests. Seed value pinned; a PR changing it without updating baselines fails the review checklist. MC-based tests are now fully reproducible across machines and Python versions.

F11 — No test plan document for ESA submission Fix applied (§17.0): docs/TEST_PLAN.md structure specified with 6 sections including safety-critical traceability matrix (requirement → test ID → test name → result). This is the primary software assurance evidence document for the ESA bid. Required as a Phase 2 deliverable.

Bind mount strategy (companion fix) Fix applied (§3.3 Docker Compose): Host bind mounts specified for logs, exports, config, and DB data. Eliminates the need for docker compose exec for all routine operations. /data/postgres and /data/minio outside the project directory to prevent accidental wipe. make init-dirs creates the host directory structure before first docker compose up. make logs SERVICE=backend convenience alias.

56.2 Sections Modified

Section	Change
§3.3 Docker Compose	Host bind mount specification; host directory layout; `make init-dirs`; `:ro` config mounts
§13.8 Developer Environment Setup	`make init-dirs` added to bootstrap sequence
§17.0 (new) Test Standards and Strategy	Full test taxonomy, coverage standard, fixture isolation, `freezegun`, safety_critical marker, MC seed, mutation testing, env parity, `docs/TEST_PLAN.md` structure
§27.0 (new) Performance Test Specification	k6 scenarios, SLO thresholds, baseline hardware spec, result storage and trending
§14 API Versioning Policy	`openapi-typescript` contract type generation; `make check-api-types` CI step
§42.2 E2E Test Inventory	`test_alert_websocket.ts` added; full WebSocket delivery E2E spec

56.3 New Tables and Files

Artefact	Purpose
`tests/unit/`, `tests/integration/`, `e2e/`	Canonical test directory structure per taxonomy
`e2e/test_alert_websocket.ts`	WebSocket alert delivery E2E test
`tests/conftest.py`	`seed_rng` autouse fixture; `safety_critical` marker registration
`docs/TEST_PLAN.md`	ESA Phase 2 deliverable; traceability matrix
`docs/validation/load-test-baseline.md`	k6 baseline hardware and data spec
`docs/validation/load-test-results/`	Stored k6 JSON results for trending
`tests/load/scenarios.js`	k6 scenario definitions
`frontend/src/types/api.generated.ts`	Generated TypeScript API types from `openapi.yaml`
`scripts/load-test-trend.py`	p95 latency trend chart generator

56.4 Anti-Patterns Identified

Anti-pattern	Correct approach
Mocking the database in integration tests	Transaction-rollback isolation against a real DB; mocks hide schema and RLS bugs
`datetime.utcnow()` in tests	`freezegun` `@freeze_time` decorator; tests must be time-independent
Non-deterministic MC tests	`np.random.seed(42)` autouse fixture; same seed → same output everywhere
Coverage measured on unit tests only	Integration run coverage includes DB-layer code; unit-only inflates the number
Putting safety-critical tests in the full suite only	`pytest -m safety_critical` fast job on every commit; never wait for the full suite to catch a safety regression
Performance test results not stored	JSON output committed to `docs/validation/`; trend script flags regressions

56.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Frontend/API contract testing	`openapi-typescript` generated types + `make check-api-types`	Pact consumer-driven contracts	Pact requires a broker and bidirectional test setup; `openapi-typescript` achieves the same drift detection with a single CI command at Phase 1 team size
Performance test tool	k6	Locust, Gatling	k6 is JavaScript-native (same language as frontend tests); scripting is lightweight; built-in threshold assertions; good CI integration
Coverage measurement scope	Integration test run	Unit test run	Unit-only coverage excludes database, Redis, and auth middleware code paths — the most likely sources of prod bugs
Mutation testing scope	`physics/` and `alerts/` only (weekly)	Full codebase (every commit)	Full-codebase mutation testing on every commit would take hours; scoping to highest-consequence modules provides meaningful signal at reasonable cost
Host bind mounts approach	Named directories under `/opt/spacecom/` with `make init-dirs`	Named Docker volumes	Host bind mounts are directly accessible via SSH without `docker exec`; named volumes require exec or a volume driver for host access

§57 Observability & Monitoring — Specialist Review

Hat: Observability & Monitoring Findings reviewed: 11 Sections modified: §26.6, §26.7 Date: 2026-03-24

57.1 Findings and Fixes Applied

F1 — Prometheus metric naming convention not defined Fix applied (§26.7 new): Naming convention table added before metric definitions. Rules: spacecom_ namespace required; unit suffix mandatory; _total for counters; high-cardinality identifiers (norad_id, organisation_id, user_id, request_id) banned from metric labels; snake_case labels only. CI make lint-metrics step validates names against the convention pattern.

F2 — SLO burn rate alerting single-window only Fix applied (§26.7): Replaced single ErrorBudgetBurnRate alert with two-alert multi-window pattern. ErrorBudgetFastBurn (1h + 5min windows, 14.4× multiplier, for: 2m) catches sudden outages. ErrorBudgetSlowBurn (6h + 1h windows, 6× multiplier, for: 15m) catches gradual degradation before the budget exhausts silently. Three recording rules added (rate1h, rate6h, rate5m).

F3 — Structured log schema undefined Already substantially addressed in §2274: REQUIRED_LOG_FIELDS schema with 10 mandatory fields, sanitising processor, request_id correlation middleware, and log integrity policy. No further action required for F3 — confirmed as covered.

F4 — Distributed tracing not specified for Celery path Fix applied (§26.7): Explicit Celery W3C traceparent propagation spec added. CeleryInstrumentor handles automatic propagation; request_id passed in task kwargs as Phase 1 fallback when OTEL_SDK_DISABLED=true. Integration test stub specified to verify trace continuity from HTTP handler through worker span.

F5 — No alerting rule coverage audit Fix applied (§26.7 new): Alert coverage audit table added mapping every SLO and safety invariant to its alert rule. Two gaps identified: EopMirrorDisagreement alert (Phase 1 gap — metric exists, alert rule missing), DbReplicationLagHigh (Phase 2 gap — requires streaming replication). BackupJobFailed alert identified as Phase 1 gap.

F6 — High-cardinality label risk Already addressed: norad_id label was already noted as "Grafana drill-down only; alert via recording rule" in the existing metric definition comment. F1 naming convention formalises this as an explicit prohibition with a CI-enforced lint rule. No additional edit required.

F7 — On-call dashboard not specified Fix applied (§26.7): Operational Overview dashboard panel layout mandated. 8-panel grid with fixed row order; rows 1–2 visible without scroll at 1080p. Each panel maps to a specific metric and threshold. Dashboard UID pinned in AlertManager dashboard_url annotations. Design criterion: "answer is the system healthy in 15 seconds."

F8 — Celery queue depth alerting threshold-only Fix applied (§26.7): CelerySimulationQueueGrowing alert added using rate(spacecom_celery_queue_depth{queue="simulation"}[10m]) > 2 with for: 5m. Complements the existing threshold-based CelerySimulationQueueDeep. Growth rate alert catches a rising queue before it breaches the absolute threshold.

F9 — No DLQ monitoring Already addressed: DLQGrowing alert (increase(spacecom_dlq_depth[10m]) > 0) and spacecom_dlq_depth metric were already specified in §26.7. F9 confirmed as covered — no further action required.

F10 — Log retention and SIEM integration not specified Fix applied (§26.6 new): Application log retention policy table added. Container stdout: 7 days (Docker json-file). Loki: 90 days (covers incident investigation SLA). Safety-relevant log lines: 7 years (MinIO, matching database safety record retention). SIEM forwarding: per customer contract. Loki retention YAML configuration specified. Phase 1 interim: Celery Beat daily export of CRITICAL log lines to MinIO until Loki ruler is deployed.

F11 — No alerting runbook cross-reference mandate Fix applied (§26.7): runbook_url added to WebSocketCeilingApproaching (previously missing). Mandate added: every AlertManager rule must include annotations.runbook_url pointing to an existing file in docs/runbooks/. make lint-alerts CI step enforces this using promtool check rules plus a custom script that validates the URL resolves to a real markdown file.

57.2 Sections Modified

Section	Change
§26.6 Backup and Restore	Application log retention policy table added; Loki 90-day retention config; safety-critical log line archival to MinIO
§26.7 Prometheus Metrics	Metric naming convention table; multi-window burn rate recording rules and alerts; Celery trace propagation spec; queue growth rate alert; alert coverage audit table; `runbook_url` mandate; `WebSocketCeilingApproaching` runbook link added; on-call dashboard panel layout mandated

57.3 New Tables and Files

Artefact	Purpose
`monitoring/alertmanager/spacecom-rules.yml`	Updated with multi-window burn rate alerts and queue growth alert
`monitoring/loki-config.yml`	90-day retention configuration
`monitoring/recording-rules.yml`	Three burn rate recording rules
`docs/runbooks/capacity-limits.md`	Referenced by WebSocketCeilingApproaching; Phase 2 deliverable
`scripts/lint-alerts.py`	CI script validating `runbook_url` annotation on every alert rule
`monitoring/grafana/dashboards/operational-overview.json`	Codified panel layout per §26.7 on-call dashboard spec
`tests/integration/test_tracing.py`	Celery trace propagation integration test stub

57.4 Anti-Patterns Identified

Anti-pattern	Correct approach
Single-window burn rate alert (`for: 30m`)	Multi-window fast+slow burn: catches both sudden outages and slow degradations
`norad_id` or `organisation_id` as Prometheus label	Recording rule aggregates; high-cardinality identifiers in log fields or exemplars only
Alert rules without `runbook_url`	`make lint-alerts` enforces presence; a page at 3am without a runbook link adds ~5 min to MTTR
Threshold-only queue alerts	Complement with rate-of-growth alert; threshold fires too late on a gradually filling queue
On-call dashboard with no defined layout	Mandated panel order; rows 1–2 visible without scroll; 15-second health answer target
Application logs with no retention policy	Explicit tier policy: 7 days local, 90 days Loki, 7 years for safety-relevant lines

57.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Burn rate multipliers	14.4× (fast, 1h) / 6× (slow, 6h)	Custom thresholds	Google SRE Workbook standard multipliers for 99.9% SLO; well-understood by on-call engineers familiar with SRE literature
Loki retention	90 days	30 days / 1 year	30 days is insufficient for post-incident reviews triggered by regulatory queries; 1 year is expensive for high-volume structured logs; 90 days covers all contractual and regulatory investigation windows
Fast burn `for: 2m`	2 minutes	Immediate (no `for`)	Without a `for` clause, a single scraped bad value pages on-call; 2 minutes filters transient scrape errors while still alerting within 5 minutes of a real outage
Celery trace propagation	`CeleryInstrumentor` + explicit `request_id` kwargs	OTel only	OTel-only approach breaks Phase 1 when `OTEL_SDK_DISABLED=true`; explicit kwargs are a zero-dependency fallback that costs nothing and ensures log correlation always works

§58 Performance & Scalability — Specialist Review

Hat: Performance & Scalability Findings reviewed: 11 Sections modified: §3.2, §9.4, §16 (CZML cache), §34.2 (Caddyfile), Celery config Date: 2026-03-24

58.1 Findings and Fixes Applied

F1 — No index strategy documented beyond primary keys Already addressed: §9.3 contains a comprehensive index specification with 10+ named indexes covering all identified hot paths: orbits (CZML generation), reentry_predictions (latest per object, partial), alert_events (unacknowledged per org, partial), jobs (queued, partial), refresh_tokens (live only, partial), PostGIS GiST indexes on all geometry columns, tle_sets (latest per object), security_logs (user+time). F1 confirmed as covered — no further action required.

F2 — PgBouncer pool size not derived from workload Fix applied (§3.2 technology table): Derivation rationale added inline. max_client_conn=200 derived from: 2 backend × 40 async + 4 sim workers × 16 + 2 ingest × 4 = 152 peak, 200 burst headroom. default_pool_size=20 derived from max_connections=50 with 5 reserved for superuser. Validation query (SHOW pools; cl_waiting > 0 = undersized) documented.

F3 — N+1 query risk in catalog and alert APIs Already addressed: §16 (CZML and API performance section) already specifies ORM loading strategies: selectinload for Event Detail and active alerts; raw SQL with explicit JOIN for CZML catalog bulk fetch (ORM overhead unacceptable at 864k rows). F3 confirmed as covered — no further action required.

F4 — Redis cache eviction policy not specified Already addressed: §16 Redis key namespace table specifies noeviction for celery:* and redbeat:*, allkeys-lru for cache:*, volatile-lru for ws:session:*. Separate Redis DB indexes mandated. F4 confirmed as covered — no further action required.

F5 — CZML cache invalidation strategy incomplete Fix applied (§16): Invalidation trigger table added (TLE re-ingest, propagation completion, new prediction, admin flush, cold start). Stale-while-revalidate strategy specified: stale key served immediately on primary expiry; background recompute enqueued; max stale age 5 minutes. warm_czml_cache Celery task specified for cold start and DR failover; estimated 30–60 seconds for 600 objects. Cold-start warm-up added to DR RTO calculation.

F6 — Celery worker_prefetch_multiplier not tuned Fix applied (celeryconfig.py): worker_prefetch_multiplier = 1 added with rationale comment. Long MC tasks (up to 240s) with default prefetch=4 cause worker starvation. Prefetch=1 ensures fair task distribution across all available workers.

F7 — No database query plan governance Fix applied (§9.4 PostgreSQL parameters): log_min_duration_statement: 500 and shared_preload_libraries: timescaledb,pg_stat_statements added to patroni.yml. Query plan governance process specified: weekly top-10 slow query report from pg_stat_statements; any query in top-10 for two consecutive weeks requires PR with EXPLAIN ANALYSE and index addition or documented acceptance.

F8 — Static asset delivery strategy undefined Fix applied (§34.2 Caddyfile): Three-tier static asset strategy added. /_next/static/*: Cache-Control: public, max-age=31536000, immutable (safe — Next.js content-hashes filenames). /cesium/*: Cache-Control: public, max-age=604800 (7 days; not content-hashed). HTML routes: Cache-Control: no-store (force re-fetch after deploy). Rationale: immutable caching only safe for content-hashed assets; HTML must never be cached.

F9 — Horizontal scaling trigger thresholds not defined Fix applied (§3.2 new table): Scaling trigger threshold table added covering backend CPU (>70% for 30min), WS connections (>400 sustained), simulation queue depth (>50 for 15min), MC p95 latency (>180s), DB CPU (>60% for 1h), disk usage (>70%), Redis memory (>60%). All triggers initiate a scaling review meeting, not automatic action. Decisions logged in docs/runbooks/capacity-limits.md.

F10 — TimescaleDB chunk interval not specified Already addressed: §9.4 specifies chunk intervals for all hypertables with derivation rationale table: orbits 1 day (72h CZML window spans 3 chunks), tle_sets 1 month (compression ratio), space_weather 30 days (low write rate), adsb_states 4 hours (24h rolling window). F10 confirmed as covered — no further action required.

F11 — No query timeout or statement timeout policy Fix applied (§9.4): ALTER ROLE spacecom_analyst SET statement_timeout = '30s' and ALTER ROLE spacecom_readonly SET statement_timeout = '30s'. Applied at role level so it persists regardless of connection source. User-facing error message specified for timeout exceeded. Operational roles excluded (they have idle_in_transaction_session_timeout as global backstop only).

58.2 Sections Modified

Section	Change
§3.2 Service Breakdown	PgBouncer pool size derivation rationale; horizontal scaling trigger threshold table
§9.4 TimescaleDB Configuration	`log_min_duration_statement`, `pg_stat_statements` in patroni.yml; query plan governance process; analyst role `statement_timeout`; `idle_in_transaction_session_timeout` comment
§16 CZML / Cache	Invalidation trigger table; stale-while-revalidate strategy; `warm_czml_cache` cold-start task
§34.2 Caddyfile	Three-tier static asset `Cache-Control` strategy; HTML `no-store` mandate
`celeryconfig.py`	`worker_prefetch_multiplier = 1` with rationale

58.3 New Tables and Files

Artefact	Purpose
`docs/runbooks/capacity-limits.md`	Scaling decision log; WS ceiling documentation; capacity trigger thresholds
`worker/celeryconfig.py`	Updated with `worker_prefetch_multiplier = 1`

58.4 Anti-Patterns Identified

Anti-pattern	Correct approach
Default Celery `prefetch_multiplier=4` with long tasks	`prefetch_multiplier=1` for MC jobs; fair distribution across workers
Single Redis `maxmemory-policy` for broker + cache	Separate DB indexes with `noeviction` for broker, `allkeys-lru` for cache
HTML pages with `Cache-Control: public, max-age=...`	`no-store` for HTML; `immutable` only for content-hashed static assets
Analyst queries without timeout	`statement_timeout=30s` at role level; prevents replica exhaustion cascading to primary
Monitoring slow queries without a review process	Weekly `pg_stat_statements` top-10 review; two-week persistence triggers mandatory PR
Scaling triggers defined as "when it feels slow"	Metric thresholds with sustained durations; documented decision log for audit trail

58.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
`worker_prefetch_multiplier`	1	4 (default)	Long MC tasks (up to 240s) make default prefetch cause severe worker imbalance; prefetch=1 adds trivial latency (one extra Redis round-trip) per task
Analyst timeout	30 seconds at role level	Global `statement_timeout`	Global timeout would cancel legitimate long-running operations like backup restore tests and migration backfills; role-scoped is surgical
CZML stale-while-revalidate max age	5 minutes	0 (no stale)	Without stale window, TLE batch ingest (600 objects) causes 600 simultaneous cache stampedes; 5-minute stale window amortises recompute over the natural ingest cadence
Static asset caching	Immutable for `/_next/static/`, 7 days for `/cesium/`, no-store for HTML	Uniform TTL	Content-hash presence determines whether immutable is safe; non-uniform strategy is correct, not inconsistent

§59 DevOps / CI-CD Pipeline — Specialist Review

Hat: DevOps / CI-CD Pipeline Findings reviewed: 11 Sections modified: §30.2, §30.3, §30.7 (new) Date: 2026-03-24

59.1 Findings and Fixes Applied

F1 — CI pipeline job dependency graph not specified Fix applied (§30.7 new): Full GitLab CI pipeline specified with explicit stage/needs ordering enforcing the dependency order: lint → (test-backend ∥ test-frontend ∥ migration-gate) → security-scan → build-and-push → deploy-staging → deploy-production. Parallel jobs where safe; sequential where correctness requires it.

F2 — No environment promotion gate between staging and production Already addressed: §30.4 specifies the staging environment spec and data policy. The ADR at §30.6 records the decision: "production deploy requires manual approval gate after staging smoke tests pass." The new §30.7 workflow formalises this as a GitLab protected production environment with required approvers. Confirmed as covered and formalised.

F3 — Secrets in CI not audited or rotated Fix applied (§30.3): CI secrets register table added with 8 entries covering all pipeline secrets. Each entry specifies: environment scope, owner, rotation schedule (90-180 days), and blast radius on leak. Quarterly audit procedure using GitLab CI/CD variable inventory documented. Rotation procedure for GitLab protected variables specified.

F4 — Docker image tags without immutability guarantee Fix applied (§30.2): Production docker-compose.yml now pins images by tag@digest rather than tag alone. make update-image-digests script added to CI post-build pipeline. Container-registry retention policy table added covering 5 image categories. Lifecycle policy documented in docs/runbooks/image-lifecycle.md.

F5 — No build provenance or SBOM in CI pipeline Fix applied (§30.7): cosign sign --yes step added to build-and-push job using Sigstore keyless signing (OIDC identity from GitLab CI). SBOM artefacts are attached to the pipeline and copied into the compliance artefact store. The deploy-time cosign verify step remains the verification gate.

F6 — Pre-commit hooks not enforced in CI Already addressed: §30.1 explicitly states "The same hooks run locally (via pre-commit) and in CI (lint job)." The new §30.7 workflow formalises this as pre-commit run --all-files in the lint job with a dedicated cache. F6 confirmed as covered and formalised.

F7 — No automated rollback trigger Already addressed: §26.9 blue-green deploy script (step 6) already checks spacecom:api_availability:ratio_rate5m < 0.99 after a 5-minute monitoring window and executes the Caddy upstream rollback atomically if the threshold is breached. F7 confirmed as covered.

F8 — Deployment pipeline does not check for active CRITICAL events Fix applied (§30.7): check no active CRITICAL alert step added to both deploy-staging and deploy-production jobs. Calls GET /readyz and checks alert_gate field. "blocked" aborts the deploy with a clear error message. Emergency override requires two production-environment approvals and is logged to security_logs.

F9 — No branch protection or merge queue specification Already addressed: §13.6 (CONTRIBUTING.md spec from §54) specifies: "No direct commits to main. All changes via pull request. main is branch-protected: 1 required approval, all status checks must pass, no force-push." The §30.7 workflow defines all required status checks (lint, test-backend, test-frontend, migration-gate, security-scan) which the branch protection rule references. F9 confirmed as covered.

F10 — Docker layer cache strategy not documented for CI Fix applied (§30.7): Build cache strategy formalised in the build-and-push job using docker/build-push-action with cache-from: type=registry and cache-to: type=registry,mode=max targeting the GHCR buildcache tag. pip wheel cache keyed on requirements.txt hash. npm cache keyed on package-lock.json hash. Both use actions/cache@v4.

F11 — No database migration CI gate Fix applied (§30.7 migration-gate job): Three-step gate on all PRs touching migrations/: (1) timed forward migration — fails if > 30s; (2) reverse migration alembic downgrade -1 — fails if not reversible; (3) alembic check — fails if model/migration divergence. Gate runs in parallel with test jobs to minimise critical path impact.

59.2 Sections Modified

Section	Change
§30.2 Multi-Stage Dockerfile	Image digest pinning spec; GHCR retention policy table; `make update-image-digests`
§30.3 Environment Variable Contract	CI secrets register table; rotation schedule; quarterly audit procedure
§30.7 (new) GitHub Actions Workflow	Full CI YAML with `needs:` graph; all 8 jobs; `cosign sign`; `migration-gate`; alert gate step; environment-gated production deploy

59.3 New Tables and Files

Artefact	Purpose
`.github/workflows/ci.yml`	Canonical CI pipeline — 8 jobs with explicit dependency graph
`scripts/smoke-test.py`	Post-deploy smoke test (already referenced in §26.9; now mandatory gate in CI)
`scripts/update-image-digests.sh`	Patches `docker-compose.yml` with `tag@digest` after each build
`docs/runbooks/image-lifecycle.md`	GHCR retention policy; lifecycle policy config procedure
`docs/runbooks/detect-secrets-update.md`	Correct baseline update procedure (already referenced in §30.1)

59.4 Anti-Patterns Identified

Anti-pattern	Correct approach
Jobs without `needs:` run in parallel by default	Explicit `needs:` chains; test jobs must precede build; build must precede deploy
Mutable image tags in production Compose	`tag@digest` pinning; `make update-image-digests` in post-build CI step
Long-lived CI credentials for registry push	OIDC `GITHUB_TOKEN` (per-job, automatic); no static `GHCR_TOKEN` secret needed
Signing at deploy-time only (`cosign verify`)	Sign at build-time (`cosign sign`); verify at deploy; both steps required for supply chain integrity
Deploying during active CRITICAL alert	`alert_gate` check in CI deploy steps; emergency override requires two approvals and is logged
Migrations tested only by running them forward	Three-step gate: forward (timed) + reverse (reversibility) + `alembic check` (model sync)

59.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
OIDC for GHCR auth	`GITHUB_TOKEN` OIDC (per-job)	Static `GHCR_TOKEN` secret	Static tokens don't expire; OIDC tokens are per-job and cannot be reused outside the workflow
cosign keyless signing	Sigstore keyless (OIDC identity)	Private key signing	Keyless signing ties the signature to the GitHub Actions OIDC identity; no long-lived private key to rotate or leak
Alert gate scope	Blocks `CRITICAL` and `HIGH` unacknowledged alerts from non-internal orgs	All alerts	Internal test org alerts should not block production operations; unacknowledged = operator hasn't seen it yet
migration-gate triggers	Only on PRs touching `migrations/`	Every PR	Running `alembic upgrade head` on every PR adds 60–90 seconds to CI for PRs that don't touch the schema; path filter reduces cost

§60 Human Factors / Operational UX — Specialist Review

Hat: Human Factors / Operational UX Findings reviewed: 11 Sections modified: §28.1, §28.3, §28.5a, §28.6, §28.9 (new) Date: 2026-03-24

60.1 Findings and Fixes Applied

F1 — No alarm management philosophy documented Fix applied (§28.3): EEMUA 191 / ISA-18.2 alarm management KPI table added with 5 quantitative targets: alarm rate (< 1/10min), nuisance rate (< 1%), stale CRITICAL (0 unacknowledged > 10min), alarm flood threshold (< 10 CRITICAL in 10min), chattering alarms (0). Measured quarterly by Persona D; included in ESA compliance artefact package.

F2 — Alarm flood scenario not bounded Fix applied (§28.3): Batch TIP flood protocol added. Triggers at >= 5 new TIP messages in 5 minutes. Protocol: highest-priority object gets CRITICAL banner; objects 2-N are suppressed; single HIGH "Batch TIP event: N objects" summary fires; per-object alerts queue at <= 1/min after 5-minute operator grace period. batch_tip_event record type added to alert_events. Thresholds configurable per-org within safety bounds.

F3 — Mode confusion risk unmitigated Already addressed: §28.2 specifies six mode error prevention mechanisms including persistent mode indicator, mode-switch confirmation dialog with consequence statements, temporal wash for future-preview, simulation disable during active events, audio suppression in non-LIVE modes, and simulation record segregation. F3 confirmed as covered.

F4 — Handover workflow does not account for SA transfer Fix applied (§28.5a): Structured SA transfer prompt table added. Five prompts mapping to Endsley SA levels: active objects (L1 perception), operator assessment (L2 comprehension), expected development (L3 projection), actions taken (decision context), and handover flags (situational context). Prompts are optional but completion rate tracked as HF KPI. Non-blocking warning on submission without completion.

F5 — Acknowledgement does not distinguish seen from assessed Already addressed: §28.5 structured acknowledgement categories distinguish MONITORING (seen, no action) from NOTAM_ISSUED, COORDINATING, ESCALATING (assessed and acted). The category taxonomy maps directly to perception vs. comprehension+projection. F5 confirmed as covered.

F6 — No specification for decision prompt content Fix applied (§28.6): DecisionPrompt TypeScript interface specified with four mandatory fields: risk_summary (<= 20 words, no jargon), action_options (role-specific), time_available (decision window before FIR intersection), consequence_note (optional). Example instance for re-entry/FIR scenario provided. Pre-authored prompt library in docs/decision-prompts/; annual ANSP SME review required.

F7 — Globe information hierarchy not specified Fix applied (§28.1): Seven-level visual information hierarchy table added with mandatory rendering order. Priority 1 (CRITICAL object): flashing red octagon + always-visible label. Priority 2 (HIGH): amber triangle. Down to Priority 7 (ambient objects): white dots on hover only. Rule: no lower-priority element may be visually more prominent than a higher-priority element. Non-negotiable safety requirement — overrides CesiumJS performance optimisations that reorder draw calls.

F8 — No fatigue or cognitive load accommodation Fix applied (§28.3): Server-side fatigue monitoring rules added. Four triggers: CRITICAL unacknowledged > 10 min — supervisor push+email; HIGH unacknowledged > 30 min — supervisor push; inactivity during active event (45 min) — operator+supervisor push; session age > shift_duration_hours — non-blocking operator reminder. All notifications logged to security_logs. Escalates to SpaceCom internal ops if no supervisor role configured.

F9 — Degraded mode display not actionable Already addressed: §28.8 (Degraded-Data Human Factors) specifies per-degradation-type visual indicators with operator action required. §1315 specifies operational guidance text per degradation type. Acceptance criteria (§6056) requires integration test for each type. F9 confirmed as covered.

F10 — No operator training specification Fix applied (§28.9 new): Full operator training programme specified. Six modules (M1-M6), 8 hours total minimum. M2 reference scenario defined. Recurrency requirements: annual 2-hour refresher + scenario repeat. operator_training_records schema added. GET /api/v1/admin/training-status endpoint added. Training material ownership and annual review cycle defined.

F11 — Audio alert design not fully specified Fix applied (§28.3): Audio spec expanded with EUROCAE ED-26 / RTCA DO-256 advisory alert compliance. Tones specified: 261 Hz (C4) + 392 Hz (G4), 250ms each with 20ms fade. Re-alert on missed acknowledgement: replays once at 3 minutes; no further audio beyond second play (supervisor notification handles further escalation). Volume floor in ops room mode: minimum 40%. Per-session mute resets on next login.

60.2 Sections Modified

Section	Change
§28.1 Situation Awareness	Globe visual information hierarchy table (7 levels, mandatory rendering order)
§28.3 Alarm Management	EEMUA 191 KPI table; batch TIP flood protocol; fatigue monitoring rules; audio spec expanded with EUROCAE ref, re-alert rule, volume floor
§28.5a Shift Handover	Structured SA transfer prompts (5 prompts, 3 SA levels); completion tracking
§28.6 Cognitive Load Reduction	Decision prompt TypeScript interface + example; pre-authored library governance
§28.9 (new) Operator Training	6-module programme; reference scenario; recurrency; `operator_training_records` schema; API endpoint

60.3 New Tables and Files

Artefact	Purpose
`operator_training_records`	Training completion records per user/module
`docs/training/`	Training module content directory
`docs/training/reference-scenario-01.md`	Standardised M2 reference scenario
`docs/decision-prompts/`	Pre-authored decision prompt library (per scenario type)
`GET /api/v1/admin/training-status`	Org-admin view of operator training completion

60.4 Anti-Patterns Identified

Anti-pattern	Correct approach
Single "data may be delayed" degraded banner	Per-degradation-type badges with operator action required; graded response rules
Free-text only handover notes	Structured SA transfer prompts + notes; prompts tracked as HF KPI
Audio alert that loops indefinitely	Plays once; re-alerts once at 3 min; further escalation is supervisor notification, not more audio
Acknowledgement with 10-character text minimum	Structured category selection — captures intent, not just compliance
Unlimited alarm rate during batch TIP events	Batch flood protocol: suppress objects 2-N, queue at <= 1/min after grace period
Globe with equal visual weight for all elements	7-level mandatory hierarchy; safety-critical objects pre-attentively distinct at all zoom levels

60.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Alarm KPI standard	EEMUA 191 adapted for ATC	Process-control standard verbatim	EEMUA 191 is process-control oriented; ATC operations have different alarm rate expectations; adaptation noted explicitly
Re-alert timing	Once at 3 minutes	Continuous loop / never re-alert	Loop causes habituation; never re-alerting risks missed CRITICAL in a noisy environment; single replay at 3 min is the minimum effective prompt
SA transfer prompts	Optional with completion tracking	Mandatory (blocks handover submission)	Mandatory completion under time pressure produces checkbox compliance, not genuine SA transfer; optional + tracked provides accountability without creating a safety-defeating blocker
Operator training blocking	Flag but not block access	Auto-block untrained users	ANSP retains operational responsibility; SpaceCom cannot unilaterally block a certified ATC professional; flag + report gives ANSP the information to manage their own training compliance

§61 Aviation & Space Regulatory Compliance — Specialist Review

61.1 Finding Summary

#	Finding	Severity	Resolution
1	No formal safety case structure — argument/evidence/claims framework absent	High	§24.12 — Safety case with GSN argument structure, evidence nodes, and claims added; `docs/safety/SAFETY_CASE.md`
2	SAL assignment under ED-153/DO-278A not documented — no formal assurance level per component	High	§24.13 — SAL assignment table: SAL-2 for physics, alerts, HMAC, CZML; SAL-3 for auth and ingest; `docs/safety/SAL_ASSIGNMENT.md`
3	Hazard log lacked structured format — no ID, cause/effect decomposition, risk level, or governance	Medium	§24.4 — Hazard register restructured with 7 hazards (HZ-001 to HZ-007), structured fields, governance rules, and EUROCAE ED-153 risk matrix
4	Safety occurrence reporting procedure lacked formal structure — ANSP notification, evidence preservation, and regulatory notification flow not defined	High	§26.8a — Full safety occurrence reporting procedure with trigger conditions, 8-step response, SQL table, and clear negative scope
5	ICAO data quality mapping incomplete — Completeness attribute absent; no formal data category and classification fields in API response	Medium	§24.3 — Completeness attribute added; formal ICAO data category/classification fields specified; accuracy characterisation as Phase 3 gate
6	Verification independence not specified — no CODEOWNERS, PR review rule, or traceability for SAL-2 components	High	§17.6 — CODEOWNERS for SAL-2 paths, 2-reviewer requirement, qualification criteria, traceability to safety case evidence
7	No configuration management policy for safety-critical artefacts — source files, safety documents, and validation data not formally under CM	High	§30.8 — CM policy covering 10 artefact types, release tagging script, signed commits, deployment register, CODEOWNERS for `docs/safety/`
8	Means of Compliance document not planned — no mapping from regulatory requirement to implementation evidence	Medium	§24.14 — MoC document structure with 7 initial MOC entries, status tracking, and Phase 2/3 gates
9	Post-deployment safety monitoring programme absent — no ongoing accuracy monitoring, safety KPIs, or model version monitoring	High	§26.10 — Four-component programme: prediction accuracy monitoring, safety KPI dashboard, quarterly safety review, model version monitoring
10	ANSP-side obligations not documented — SpaceCom's safety argument assumes ANSP actions that are never formally communicated	Medium	§24.15 — ANSP obligations table by category; SMS guide document; liability assignment note linking to safety case
11	Regulatory sandbox liability not formally characterised — who bears liability during trial, what insurance is required, sandbox ≠ approval	Medium	§24.2 — Sandbox liability provisions: no operational reliance clause, indemnification cap, insurance requirement, regulatory notification duty, explicit statement that sandbox ≠ regulatory approval

Already addressed — no further action required:

NOTAM interface and disclaimer (§24.5 — covered in prior sessions)
Space law retention obligations (§24.6 — 7-year retention already specified)
EU AI Act compliance obligations (§24.10 — fully covered including Art. 14 human oversight statement)
Regulatory correspondence register (§24.11 — covered)

61.2 Sections Modified

Section	Change
§24.2 Liability and Operational Status	Regulatory sandbox liability provisions (F11): no operational reliance clause, indemnification cap, insurance requirement, sandbox ≠ approval statement
§24.3 ICAO Data Quality Mapping	Completeness attribute added (F5); formal ICAO data category and classification table; accuracy characterisation Phase 3 gate
§24.4 Safety Management System Integration	Hazard register fully restructured (F3): 7 hazards with IDs, cause/effect, risk levels, governance; system safety classification updated to reference §24.13 SAL assignment
§24.11 (after)	New §24.12 Safety Case Framework (F1); §24.13 SAL Assignment (F2); §24.14 Means of Compliance (F8); §24.15 ANSP-Side Obligations (F10)
§17.5 (after)	New §17.6 Verification Independence (F6): CODEOWNERS, 2-reviewer rule, qualification criteria, traceability
§26.8 Incident Response runbooks	Safety occurrence runbook pointer updated; §26.8a Safety Occurrence Reporting full procedure added (F4)
§26.9 (after)	New §26.10 Post-Deployment Safety Monitoring Programme (F9): accuracy monitoring, safety KPI dashboard, quarterly review, model version monitoring
§30.7 (after)	New §30.8 Configuration Management of Safety-Critical Artefacts (F7): CM policy table, release tagging, signed commits, deployment register

61.3 New Documents and Tables

Artefact	Purpose
`docs/safety/SAFETY_CASE.md`	GSN-structured safety case; living document; version-controlled
`docs/safety/SAL_ASSIGNMENT.md`	Software Assurance Level per component; review triggers
`docs/safety/HAZARD_LOG.md`	Structured hazard log (HZ-001 to HZ-007 and future additions)
`docs/safety/MEANS_OF_COMPLIANCE.md`	Regulatory requirement → implementation evidence mapping
`docs/safety/ANSP_SMS_GUIDE.md`	ANSP obligations and SMS integration guide
`docs/safety/CM_POLICY.md`	Configuration management policy for safety artefacts
`docs/safety/VERIFICATION_INDEPENDENCE.md`	Verification independence policy for SAL-2 components
`docs/safety/QUARTERLY_SAFETY_REVIEW_YYYY_QN.md`	Quarterly safety review output template
`legal/SANDBOX_AGREEMENT_TEMPLATE.md`	Standard regulatory sandbox letter of understanding
`legal/ANSP_DEPLOYMENT_REGISTER.md`	Configuration baseline per ANSP deployment
`docs/validation/ACCURACY_CHARACTERISATION.md`	Phase 3: formal accuracy statement (ICAO Annex 15)
`safety_occurrences` SQL table	Dedicated log for safety occurrences with full audit fields
`monitoring/dashboards/safety-kpis.json`	Grafana dashboard: 6 safety KPIs with alert thresholds
`.github/CODEOWNERS` additions	SAL-2 source paths + `docs/safety/` require custodian review

61.4 Anti-Patterns Identified

Anti-pattern	Correct approach
"Advisory only" UI label as sole liability protection	Legal instruments required: MSA, AUP, legal opinion; label is not contractual protection
Hazard log as a table of symptoms with no cause/effect structure	Structured hazard log with ID, cause, effect, mitigations, risk level, status — enables safety case argument
No distinction between safety occurrence and operational incident	Safety occurrences require a separate response chain (legal counsel, ANSP regulatory notification); conflating with incidents creates regulatory exposure
Verification by the author of safety-critical code	SAL-2 requires independent verification — CODEOWNERS enforcement is the implementation mechanism
Safety documents outside version control	All safety artefacts are Git-tracked; changes require custodian sign-off via CODEOWNERS; release tags capture safety snapshots
Sandbox trial treated as implicit regulatory approval	Explicit language required: sandbox ≠ approval; the ANSP cannot represent a trial as regulatory acceptance
Post-deployment safety monitoring as "we'll look at incidents when they happen"	Proactive programme: quarterly review, prediction accuracy tracking, model version monitoring — demonstrates ongoing safe operation

61.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Safety case notation	Goal Structuring Notation (GSN)	ASCE text-only format	GSN is the standard for DO-178C and ED-153 safety cases; accepted by EASA and ESA reviewers; tooling (Astah, Visio, ArgoSAFETY) exists for formal diagrams when Phase 3 requires it
SAL-2 for physics and alerts	SAL-2 (not SAL-1)	SAL-1 (highest)	SAL-1 implies formal methods / formal proofs — disproportionate for decision support software where the ANSP retains authority; SAL-2 balances rigour with development practicality
Safety occurrence trigger scope	4 specific trigger conditions	Any anomaly during operational use	Over-broad triggers desensitise the process; under-broad triggers miss real occurrences; 4 conditions map directly to the identified hazards
Post-deployment monitoring cadence	Quarterly safety review	Monthly review / ad hoc	Quarterly balances administrative overhead with meaningful trend data; monthly creates review fatigue for a small team; ad hoc provides no assurance
Configuration management of safety documents	Git + CODEOWNERS + release attachments	Dedicated safety management tool	Git is already the source of truth; CODEOWNERS provides access control; release attachments are the simplest artefact preservation mechanism without introducing a new tool

§62 Geospatial / Mapping Engineering — Specialist Review

62.1 Finding Summary

#	Finding	Severity	Resolution
1	No authoritative CRS contract document — frame transitions at each boundary were scattered across multiple sections with no single reference	Medium	§4.4 — CRS boundary table added; `docs/COORDINATE_SYSTEMS.md` defined as Phase 1 deliverable; antimeridian and pole handling specified
2	SRID not enforced by CHECK constraint — column type declares SRID 4326 but application code can insert SRID-0 geometries silently	Medium	§9.3 — CHECK constraints added for `reentry_predictions`, `hazard_zones`, `airspace` spatial columns; migration gate lints new spatial columns
3	No spatial GiST index on corridor polygon columns	High	Already addressed — §9.3 contains GiST indexes for `reentry_predictions`, `hazard_zones`, `airspace` geometry columns. No further action required.
4	CZML corridor geometry uses fixed 10-minute time-step sampling — under-represents terminal phase where displacement is highest	High	§15.4 — Adaptive sampling function added: 5 min above 300 km, 2 min at 150–300 km, 30 s at 80–150 km, 10 s below 80 km; ADR required for reference polygon regeneration
5	Antimeridian and pole handling not explicitly specified	Medium	§4.4 — Antimeridian: GEOGRAPHY type confirmed; CZML serialiser must not clamp to ±180°. Polar corridors: `ST_DWithin` pole proximity check; clip to 89.5° max latitude with `POLAR_CORRIDOR_WARNING` log
6	No test verifying PostGIS corridor polygon matches CZML polygon positions	High	§15.4 — `test_czml_corridor_matches_postgis_polygon` integration test added; marked `safety_critical`; 10 km bbox agreement tolerance
7	FIR boundary data source and update policy not documented	Medium	Already addressed — §31.1.3 documents EUROCONTROL AIRAC source, 28-day update procedure, `airspace_metadata` table, Prometheus staleness alert, `readyz` integration. No further action required.
8	Globe clustering merges objects at different altitudes sharing a ground-track sub-point	Medium	§13.2 (globe clustering) — Altitude-aware clustering rule: clustering disabled for any object with re-entry window < 30 days; prevents TIP-active objects from being absorbed into catalog clusters
9	`ST_Buffer` distance units ambiguous — degree-based buffer on SRID 4326 geometry produces latitude-varying results	Medium	§9.3 — Correct pattern documented: project to Web Mercator for metric buffer, or use `GEOGRAPHY` column buffer (natively metre-aware). Wrong pattern explicitly prohibited.
10	FIR intersection missing bounding-box pre-filter in some query paths	Medium	Already addressed — §9.3 FIR intersection query with `&&` pre-filter and explicit `::geography::geometry` cast; CI linter rule added. No further action required.
11	Altitude display mixes WGS-84 ellipsoidal and MSL datums without labelling — geoid offset (−106 m to +85 m) material at re-entry terminal altitudes	High	§13.5 — Altitude datum labelling table added: orbital context → ellipsoidal; airspace context → QNH; `formatAltitude(metres, context)` helper; `altitude_datum` field in prediction API response

62.2 Sections Modified

Section	Change
§4.4 (new) Coordinate Reference System Contract	CRS boundary table; `docs/COORDINATE_SYSTEMS.md` reference; antimeridian CZML serialiser note; polar corridor `ST_DWithin` proximity check and 89.5° clip
§4.5 (renumbered from 4.4) Implementation Checklist	Added `docs/COORDINATE_SYSTEMS.md` deliverable
§9.3 Index Specification	SRID CHECK constraints for 3 spatial tables; ST_Buffer correct/wrong patterns; explicit prohibition on degree-unit buffers
§13.2 Globe Object Clustering	Altitude-aware clustering rule: disable for decay-relevant objects (window < 30 days)
§13.5 Altitude and Distance Unit Display	Altitude datum labelling table (4 contexts); `formatAltitude(metres, context)` helper spec; `altitude_datum` API field
§15.4 Corridor Generation Algorithm	Adaptive ground-track sampling function (4 altitude bands); ADR requirement for reference polygon regeneration; `test_czml_corridor_matches_postgis_polygon` integration test

62.3 New Documents and Files

Artefact	Purpose
`docs/COORDINATE_SYSTEMS.md`	Authoritative CRS contract: frame at every system boundary
`tests/integration/test_corridor_consistency.py`	PostGIS vs CZML corridor bbox consistency test (safety_critical)
`backend/app/utils/altitude.py`	`formatAltitude(metres, context)` helper

62.4 Anti-Patterns Identified

Anti-pattern	Correct approach
Fixed 10-minute ground track sampling across all altitudes	Adaptive sampling: coarse above 300 km, fine in terminal phase below 150 km
`ST_Buffer(geom_4326, 0.5)` — degree buffer on geographic column	`ST_Buffer(ST_Transform(geom, 3857), 50000)` for Mercator metric, or `ST_Buffer(geom::geography, 50000)` for geodetic metric
`ST_Intersects(airspace.geometry, corridor)` without explicit cast	Always `::geography::geometry` cast when mixing GEOGRAPHY and GEOMETRY types; enforced by CI linter
Clustering all objects by screen position	Disable CesiumJS EntityCluster for decay-relevant objects; altitude is a critical dimension for orbital objects
Altitude labelled as `km` without datum	Datum is always explicit: `(ellipsoidal)` or `QNH` or `MSL` per context
SRID declared in column type only	Add CHECK constraint: `CHECK (ST_SRID(geom::geometry) = 4326)` — prevents SRID-0 insertion from application layer

62.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Adaptive sampling bands	4 bands (> 300 km / 150–300 km / 80–150 km / < 80 km)	Single fine step (30 s) everywhere	Fine step everywhere generates unnecessary data volume in the high-altitude portion where trajectory changes are slow; 4 bands give fidelity where it matters at manageable data volume
Antimeridian strategy	GEOGRAPHY type (spherical arithmetic) for corridors	Split polygons at ±180°	Splitting at antimeridian requires downstream consumers (CesiumJS, PostGIS) to handle multi-polygon; GEOGRAPHY avoids the split natively
Polar corridor clip at 89.5°	`ST_DWithin` + clip	Full polar treatment	True polar passages are extremely rare for the tracked object population; full treatment (azimuthal projection, pole-aware alpha-shape) is disproportionate; clip + warning is the pragmatic safe choice
Altitude datum labelling	Per-context datum in `formatAltitude` helper	Global user setting	Datum is physically determined by the altitude context (orbital = ellipsoidal; aviation = QNH), not user preference; a user setting would allow operators to view the wrong datum label
Corridor consistency test tolerance	10 km (0.1°) bbox agreement	Exact match	Sub-pixel globe rendering differences make exact match impractical; 10 km is far below the display resolution at most zoom levels and well below any operationally significant discrepancy

§63 Real-Time Systems / WebSocket Engineering — Specialist Review

63.1 Finding Summary

#	Finding	Severity	Resolution
1	No message sequence numbers or ordering guarantee	High	Already addressed — `seq` field in event envelope; `?since_seq=` reconnect replay; 200-event / 5-min ring buffer; `resync_required` on stale gap. No further action required.
2	No application-level delivery acknowledgement — `delivered_websocket = TRUE` set at send-time, not client-receipt	High	§4 WebSocket schema — `alert.received` / `alert.receipt_confirmed` round-trip for CRITICAL/HIGH; `ws_receipt_confirmed` column in `alert_events`; 10s timeout triggers email fallback
3	Fan-out architecture for multiple backend instances not specified	High	§4 WebSocket schema — Redis Pub/Sub fan-out via `spacecom:alert:{org_id}` channels; per-instance local connection registry; `docs/adr/0020-websocket-fanout-redis-pubsub.md`
4	No client-side reconnection backoff policy	High	Already addressed — `src/lib/ws.ts` specifies `initialDelayMs=1000`, `maxDelayMs=30000`, `multiplier=2`, `jitter=0.2`. No further action required.
5	No state reconciliation protocol after reconnect	High	Already addressed — `resync_required` event triggers REST re-fetch; `?since_seq=` replays up to 200 events. No further action required.
6	Dead WebSocket connection does not trigger ANSP fallback notification	High	§4 WebSocket schema — `on_connection_closed` schedules Celery task with 120s / 30s (active TIP) grace; `on_reconnect` revokes pending task; org primary contact emailed with TIP-aware subject line
7	No back-pressure or per-client send queue monitoring	High	§4 WebSocket schema — `ConnectionManager` with per-connection `asyncio.Queue`; circuit breaker at 50 queued events closes slow-client connection; `spacecom_ws_send_queue_overflow_total` counter
8	Offline clients do not see missed alerts surfaced on reconnect	Medium	§4 WebSocket schema — `GET /alerts?since=<ts>&include_offline=true`; `received_while_offline: true` annotation; `localStorage` `last_seen_ts`; amber border visual treatment in notification centre
9	Multi-tab acknowledgement not synced	Medium	Already addressed — `alert.acknowledged` event type in WebSocket schema broadcasts to all org connections. No further action required.
10	No per-org WebSocket connection visibility during TIP events	Medium	§4 WebSocket schema + Observability — `spacecom_ws_org_connected` and `spacecom_ws_org_connection_count` gauges; `ANSPNoLiveConnectionDuringTIPEvent` alert rule; on-call dashboard panel 9
11	Caddy idle timeout silently terminates long-lived WebSocket connections	High	§26.9 Caddy configuration — `idle_timeout 0` for WebSocket paths; `read_timeout 0` / `write_timeout 0` on WS reverse proxy transport; `flush_interval -1`; ping interval < proxy idle timeout rule documented

63.2 Sections Modified

Section	Change
§4 WebSocket event schema	App-level receipt ACK protocol (F2); Redis Pub/Sub fan-out spec with code (F3); dead-connection ANSP fallback (F6); `ConnectionManager` back-pressure with per-connection queue (F7); offline missed-alert REST endpoint and notification centre treatment (F8); per-org Prometheus gauges and `ANSPNoLiveConnectionDuringTIPEvent` alert rule (F10)
§26.9 Caddy upstream configuration	WebSocket-specific Caddyfile additions: `idle_timeout 0`, WS path matcher, `read_timeout 0`, `write_timeout 0`, `flush_interval -1`; ping interval < proxy idle timeout rule (F11)

63.3 New Tables, Metrics, and Files

Artefact	Purpose
`alert_events.ws_receipt_confirmed`	Tracks whether client confirmed receipt of CRITICAL/HIGH alerts
`alert_events.ws_receipt_at`	Timestamp of client receipt confirmation
`spacecom_ws_send_queue_overflow_total{org_id}`	Counter: WS send queue circuit breaker activations
`spacecom_ws_org_connected{org_id, org_name}`	Gauge: whether org has ≥1 active WS connection
`spacecom_ws_org_connection_count{org_id}`	Gauge: count of active WS connections per org
`ANSPNoLiveConnectionDuringTIPEvent`	Prometheus alert rule: warning when ANSP has no WS connection during active TIP
On-call dashboard panel 9	ANSP Connection Status table (below fold)
`docs/adr/0020-websocket-fanout-redis-pubsub.md`	ADR: Redis Pub/Sub for cross-instance WS fan-out
`docs/runbooks/websocket-proxy-config.md`	Runbook: WS proxy timeout configuration for cloud deployments
`docs/runbooks/ansp-connection-lost.md`	Runbook: ANSP with no live connection during TIP event
`GET /alerts?since=<ts>&include_offline=true`	Missed-alert reconciliation endpoint

63.4 Anti-Patterns Identified

Anti-pattern	Correct approach
`delivered_websocket = TRUE` set at `send()` time	App-level receipt ACK with 10s timeout; `FALSE` triggers email fallback
Single fan-out loop blocks on slow client	Per-connection async send queue with circuit breaker; slow client disconnected, not blocking
Caddy default idle timeout terminates quiet WS connections	`idle_timeout 0` + `read_timeout 0` on WS paths; ping interval enforced below proxy timeout
No distinction between "connected to SpaceCom" and "receiving alerts during TIP event"	Per-org connection gauge + `ANSPNoLiveConnectionDuringTIPEvent` alert distinguishes the two
`resync_required` causes silent state restoration with no visual indication	`received_while_offline: true` annotation + amber border in notification centre
Dead socket detected by ping-pong, silently closed	Grace-period Celery task schedules ANSP notification; cancelled on reconnect

63.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Fan-out mechanism	Redis Pub/Sub	Sticky sessions (consistent hash)	Sticky sessions break blue-green deploys; Pub/Sub is stateless and works with any instance count
App-level ACK scope	CRITICAL and HIGH only	All events	Ack overhead for `ingest.status` and `spaceweather.change` is disproportionate; only safety-relevant alerts need receipt confirmation
Dead connection grace period	120s normal / 30s active TIP	Immediate notification	False-positive notifications from brief network hiccups destroy operator trust in the system; grace period filters transient drops
Back-pressure circuit breaker	Close slow client (force reconnect)	Drop messages silently	Silently dropping alert messages is unacceptable; forced reconnect triggers the `?since_seq=` replay mechanism, giving the client another chance to receive the queued events
Caddy WS idle timeout	`0` (no timeout) on WS paths only	Global `0`	Non-WS paths benefit from timeout protection against slow HTTP clients; WS paths require persistent connections; path-specific override is the correct scope

§64 Data Governance & Privacy Engineering — Specialist Review

64.1 Finding Summary

#	Finding	Severity	Resolution
1	No DPIA document — pre-processing obligation for high-risk processing of aviation professionals' behavioural data	High	§29.1 — Full DPIA structure added (EDPB WP248 template, 7 sections, key risk findings identified); `legal/DPIA.md` designated as Phase 2 gate before EU/UK ANSP shadow activation
2	Right-to-erasure conflict with 7-year safety retention unresolved	High	Already addressed — §29.3 documents pseudonymisation procedure; Art. 17(3)(b) exemption explicitly invoked. No further action required.
3	IP addresses stored full-resolution for 7 years — no necessity assessment, no minimisation policy	High	§29.1 — IP retention updated to 90 days full / hash retained for longer period; `hash_old_ip_addresses` Celery task specified; necessity assessment documented
4	No Record of Processing Activities (RoPA) document	Medium	Already addressed — §29.1 contains the RoPA table with all required Art. 30 fields; `legal/ROPA.md` designated as authoritative. No further action required.
5	Cross-border transfer mechanisms not documented per jurisdiction pair	Medium	Already addressed — §29.5 documents EU default hosting, SCCs for cross-border transfers, Australian APP8, data residency policy in `legal/DATA_RESIDENCY.md`. No further action required.
6	Handover notes and acknowledgement text retained as-written indefinitely — free-text personal references not pseudonymised	Medium	§29.3 — `pseudonymise_old_freetext` Celery task added; 2-year operational retention window; text replaced with `[text pseudonymised after operational retention window]`
7	No DSAR procedure or SLA — endpoint exists but no documented process	High	§29.4a — Full DSAR procedure added: 7-step runbook, 30-day SLA, 60-day extension provision, `legal/DSAR_LOG.md`, export scope defined, exemptions documented
8	Audit log mixes personal data and integrity records — single table, conflicting retention obligations	High	§29.9 — `integrity_audit_log` table split out for non-personal operational records (7-year retention); `security_logs` constrained to user-action types with CHECK; migration plan specified
9	No formal sub-processor register — sub-processor details scattered across multiple documents	Medium	§29.4 — `legal/SUB_PROCESSORS.md` register added with 5 sub-processors, transfer mechanism, DPA status; customer notification obligation documented
10	`operator_training_records` has no retention or pseudonymisation policy	Medium	§28.9 — Retention policy: active + 2 years post-deletion; `user_tombstone` column; pseudonymisation task extended to cover training records
11	ToS acceptance implies consent is the universal lawful basis — incorrect and creates compliance exposure	High	§29.10 — Lawful basis mapping table added (5 processing activities); clarification that ToS acceptance evidences consent only for specific acknowledgements; Privacy Notice requirement restated

64.2 Sections Modified

Section	Change
§28.9 Operator Training	Training records retention policy and pseudonymisation (F10): 2-year post-deletion window; `user_tombstone` column; Celery task extension
§29.1 Data Inventory	IP address retention updated to 90-day full / hash retained (F3); `hash_old_ip_addresses` Celery task; IP necessity assessment; DPIA structure expanded to full EDPB WP248 template (F1)
§29.3 Erasure Procedure	Free-text field periodic pseudonymisation added (F6): 2-year operational window; `pseudonymise_old_freetext` Celery task for `shift_handovers.notes_text` and `alert_events.action_taken`
§29.4 Data Processing Agreements	Sub-processor register table added (F9): 5 sub-processors, locations, transfer mechanisms
§29.4a (new) DSAR Procedure	Full 7-step DSAR procedure with 30-day SLA, export scope, exemption documentation (F7)
§29.9 (new) Audit Log Separation	`integrity_audit_log` table split; `security_logs` constrained to user-action types; migration plan (F8)
§29.10 (new) Lawful Basis Mapping	Per-activity lawful basis table; ToS acceptance ≠ universal consent; Privacy Notice requirement (F11)

64.3 New Documents and Tables

Artefact	Purpose
`legal/DPIA.md`	Data Protection Impact Assessment (EDPB WP248 template) — Phase 2 gate
`legal/SUB_PROCESSORS.md`	Art. 28 sub-processor register with transfer mechanisms
`legal/DSAR_LOG.md`	Log of all Data Subject Access Requests received and fulfilled
`docs/runbooks/dsar-procedure.md`	Step-by-step DSAR handling runbook
`tasks/privacy_maintenance.py`	Celery tasks: `hash_old_ip_addresses`, `pseudonymise_old_freetext` (extended to training records)
`integrity_audit_log` table	Non-personal operational audit records separated from `security_logs`
`operator_training_records.user_tombstone`	Pseudonymisation field for post-deletion training records
`operator_training_records.pseudonymised_at`	Timestamp tracking pseudonymisation

64.4 Anti-Patterns Identified

Anti-pattern	Correct approach
DPIA treated as optional documentation exercise	Pre-processing legal obligation; EU personal data cannot be processed without completing it first
Full IP address retained for 7 years "for security"	90-day necessity window; hash retained for longer-term audit; necessity assessment documented
Single `security_logs` table for both personal data and operational integrity records	Separate tables with separate retention policies; `integrity_audit_log` for non-personal records
ToS acceptance as universal consent mechanism	Lawful basis is determined by processing purpose; most SpaceCom processing is Art. 6(1)(b) or (f), not consent
Sub-processor details spread across multiple documents	Single `legal/SUB_PROCESSORS.md` register with mandatory Art. 28(3) fields
Free-text operational fields retained as-written indefinitely	2-year operational window then pseudonymisation in place; record preserved, personal reference removed

64.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
DPIA processing category	Art. 35(3)(b) — systematic monitoring of publicly accessible area	Art. 35(3)(a) — large-scale special category data	No special category data is processed; the systematic monitoring category is the correct trigger given real-time operational pattern tracking of named aviation professionals
IP hashing threshold	90 days	30 days / 1 year	90 days covers the active investigation window for the vast majority of security incidents; shorter is unnecessarily restrictive for legitimate investigation; longer retains more than necessary
Free-text pseudonymisation window	2 years post-creation	Immediate deletion / 7-year retention as-written	2 years covers all active PIR, investigation, and regulatory inquiry periods while removing personal references well before maximum retention; deletion would destroy operational context needed for safety record; 7-year as-written retention is disproportionate
Audit log split mechanism	Separate table with CHECK constraint on `security_logs`	Application-level routing only	Database constraint enforces the separation at ingest time; application routing alone is fragile and will be bypassed as code evolves
DSAR response channel	Encrypted ZIP to verified email	In-platform download only	In-platform download is unavailable after account deletion; verified email ensures identity confirmation and provides a paper trail

Appendix §65 — Cost Engineering / FinOps Hat Review

Hat: Cost Engineering / FinOps Reviewer focus: Infrastructure cost visibility, unit economics, per-resource attribution, cost anti-patterns, egress waste, idle resource cost

65.1 Findings and Fixes

#	Finding	Severity	Section modified	Fix applied
F1	No unit economics model — impossible to reason about margin per customer tier	HIGH	§27.7 (new)	Added unit economics model with cost-to-serve breakdown and break-even analysis; reference doc `docs/business/UNIT_ECONOMICS.md`
F2	Storage table lacked cost figures — MC blob cost invisible to planners	MEDIUM	§27.4	Added Cloud Cost/Year column to storage table; S3-IA pricing for MC blobs; noted dominant cost driver
F3	No metric tracking external API calls (Space-Track budget at risk)	MEDIUM	§27.1	Added `spacecom_ingest_api_calls_total{source}` counter; alert at Space-Track 100/day approaching AUP limit
F4	No per-org simulation CPU tracking — Enterprise chargeback impossible	MEDIUM	§27.1	Added `spacecom_simulation_cpu_seconds_total{org_id, norad_id}` counter; monthly usage report task
F5	CZML egress cost unquantified; no brotli compression mandate	LOW	§27.5	Added CZML egress cost estimate (~$1–7/mo at Phase 2–3); brotli compression policy added
F6	Celery worker idle cost not analysed — $1,120/mo regardless of usage	HIGH	§27.3	Added idle cost analysis; scale-to-zero rejected (violates MC SLO); scale-to-1 KEDA policy for Tier 3 documented
F7	No per-org email rate limit — SMTP quota at risk during flapping events	MEDIUM	§4 (WebSocket/alerts)	Added 50 emails/hour/org rate limit with digest fallback; Celery hourly digest task; cost rationale
F8	Renderer always-on rationale not documented; co-location OOM risk unaddressed	LOW	§35.5	Added on-demand analysis table; confirmed always-on at Tier 1–2; documented co-location isolation requirement
F9	Backup storage cost not projected — surprise cost at Tier 3	LOW	§27.4	Added WAL backup cost projection; $100–200/month at Tier 3 steady state
F10	No Redis memory budget — result backend accumulation can cause OOM	HIGH	§27.8 (new)	Added Redis memory budget table by purpose/DB index; `maxmemory 2gb`; `result_expires=3600` requirement
F11	No per-org cost attribution mechanism for Enterprise tier negotiations	MEDIUM	§27.1	Added monthly usage report Celery task; per-org CPU-seconds → cost-per-run attribution

65.2 Sections Modified

Section	Change summary
§27.1 Workload Characterisation	Added cost-tracking Prometheus counters (F3, F4) and per-org usage report task (F11)
§27.3 Deployment Tiers	Added Celery worker idle cost analysis and scale-to-zero decision table (F6)
§27.4 Storage Growth Projections	Added Cloud Cost/Year column; storage cost summary; backup cost projection (F2, F9)
§27.5 Network and External Bandwidth	Added CZML egress cost estimate and brotli compression policy (F5)
§27.7 Unit Economics Model (new)	Full unit economics model: cost-to-serve, revenue per tier, break-even analysis (F1)
§27.8 Redis Memory Budget (new)	Redis memory budget by purpose; `maxmemory` setting; result cleanup requirement (F10)
§4 WebSocket / Alerts	Added per-org email rate limit (50/hr) with digest fallback; SMTP cost rationale (F7)
§35.5 Renderer Container Constraints	Added on-demand analysis; memory isolation rationale; co-location risk guidance (F8)

65.3 New Files and Documents Required

File	Purpose
`docs/business/UNIT_ECONOMICS.md`	Unit economics model; cost-to-serve per tier; break-even analysis; update quarterly
`docs/infra/REDIS_SIZING.md`	Redis memory budget by purpose; eviction policy decisions; sizing rationale
`docs/business/usage_reports/{org_id}/{year}-{month}.json`	Per-org monthly usage reports for Enterprise tier chargeback
`backend/app/metrics.py` (additions)	`spacecom_ingest_api_calls_total` and `spacecom_simulation_cpu_seconds_total` counters
`backend/app/alerts/email_delivery.py`	Per-org email rate limiting logic with Redis counter and digest queue
`backend/celeryconfig.py` (addition)	`result_expires = 3600` to prevent Redis result backend accumulation

65.4 Anti-Patterns Rejected

Anti-pattern	Why rejected
Scale-to-zero simulation workers	60–120s Chromium-style cold-start violates 10-min MC SLO; scale-to-1 minimum is the correct floor
Co-locating renderer with simulation workers	Chromium 2–4 GB render memory + MC worker memory = OOM on 32 GB nodes; isolated container required
Unbounded alert emails per org	SMTP relay quota exhausted during flapping events; 50/hr cap with digest is operationally equivalent at lower cost
Redis without `result_expires`	MC sub-task result accumulation; 500 sub-tasks × 1 MB = 500 MB peak; without expiry, accumulates across runs indefinitely
Single Redis `noeviction` policy	Blocks cache use alongside broker in same instance; DB-index split with `allkeys-lru` on cache DB required

65.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Simulation worker floor	Scale-to-1 minimum at Tier 3	Scale-to-zero	Cold-start from zero violates 10-min MC SLO; one warm worker absorbs small queues instantly
Email rate limit mechanism	Redis hour-window counter + Celery digest task	Database-level throttle / no limit	Redis counter is O(1) per email with sub-millisecond latency; DB throttle adds per-email DB write at high fan-out; no limit is a SMTP quota risk
Unit economics granularity	Per-org CPU-seconds via Prometheus	Per-request DB logging	Prometheus counter aggregation has negligible overhead; DB per-request logging at MC sub-task granularity = 500 writes/run
Redis maxmemory target	2 GB (`cache.r6g.large` with 8 GB RAM)	4 GB / 1 GB	2× headroom above 700–750 MB peak estimate; leaves OS and other processes room; below 4 GB alerts before OOM
CZML compression priority	Brotli before gzip in Caddy `encode` block	gzip only	Brotli achieves 70–80% reduction vs. gzip's 60–75%; modern browsers universally support brotli; on-premise clients are always browser-based

Appendix §66 — Open Source / Dependency Licensing Hat Review

Hat: OSS Licensing Engineer Reviewer focus: Licence obligations for closed-source SaaS, SBOM completeness, redistribution constraints, IP risk in ESA bid context, contractor IP ownership

66.1 Findings and Fixes

#	Finding	Severity	Section modified	Fix applied
F1	CesiumJS AGPLv3 commercial licence not explicitly gated as Phase 1 blocker	CRITICAL	§6 Phase 1 checklist, §29.11 (new)	Added Phase 1 blocking gate requiring `cesium-commercial.pdf`; dedicated §29.11 F1 section with phase-gate language
F2	SBOM covered container image (syft) but not dependency manifests (pip-licenses/license-checker JSON merge)	HIGH	§26.9 CI table, §6 Phase 1 checklist, §29.11 (new)	Added manifest SBOM merge to `build-and-push`; `docs/compliance/sbom/` as versioned store; Phase 1 gate updated
F3	Space-Track AUP redistribution risk not analysed in detail for API endpoint and credential exposure	MEDIUM	§29.11 (new)	Added two-vector redistribution analysis (API exposure + credential in client-side code); confirmed `detect-secrets` coverage
F4	poliastro LGPLv3 licence not documented; LGPL dynamic linking compliance undocumented	MEDIUM	§29.11 (new)	Added LGPL compliance assessment; `legal/LGPL_COMPLIANCE.md` required; standard pip install satisfies LGPL
F5	TimescaleDB dual-licence (TSL vs Apache 2.0) not assessed; risk if TSL-only features adopted	MEDIUM	§29.11 (new)	Added feature-by-feature TimescaleDB licence table; confirmed SpaceCom uses only Apache 2.0 features; re-assessment gate if multi-node adopted
F6	Redis SSPL adoption (7.4+) not assessed; Valkey alternative not documented	MEDIUM	§29.11 (new)	Added SSPL internal-use assessment; legal counsel confirmation required before Phase 3; Valkey/Redis 7.2 as fallback
F7	Playwright/Chromium binary licence not captured in SBOM	LOW	§29.11 (new)	Confirmed Apache 2.0 (Playwright) + BSD-3 (Chromium); captured by `syft` container scan; no redistribution
F8	Caddy enterprise plugin licence risk not noted; audit process not defined	LOW	§29.11 (new)	Added plugin licence audit requirement; PR checklist for Caddyfile changes
F9	PostGIS GPLv2 linking exception not documented	LOW	§29.11 (new)	Confirmed linking exception applies to PostgreSQL extension use; `legal/LGPL_COMPLIANCE.md` to document
F10	`pip-licenses --fail-on` list missing SSPL; no SSPL check on npm side	MEDIUM	§29.11 (new), §7.13 CI step	Added SSPL to Python fail-on list; SSPL added to npm failOn; exact version pinning requirement stated
F11	No CLA or work-for-hire mechanism before contractor contributions	HIGH	§29.11 (new), §6 Phase 2 checklist	Added CLA template requirement (`legal/CLA.md`); `CONTRIBUTING.md` disclosure; Phase 2 gate

66.2 Sections Modified

Section	Change summary
§6 Phase 1 legal/compliance checklist	Added CesiumJS commercial licence as explicit blocking gate; expanded SBOM checklist item to cover manifest SBOMs; added LGPL/PostGIS and TimescaleDB/Redis licence document gates
§26.9 CI workflow table	Updated `build-and-push` job to include manifest SBOM merge and `docs/compliance/sbom/` artefact storage
§29.11 (new)	Full OSS licence compliance section: F1–F11 covering all material dependencies

66.3 New Files and Documents Required

File	Purpose
`legal/OSS_LICENCE_REGISTER.md`	Authoritative per-dependency licence record; updated on major version changes
`legal/LICENCES/cesium-commercial.pdf`	Executed CesiumJS commercial licence — Phase 1 blocking gate
`legal/LICENCES/timescaledb-licence-assessment.md`	TimescaleDB Apache 2.0 vs. TSL feature confirmation
`legal/LICENCES/redis-sspl-assessment.md`	Redis SSPL internal-use assessment; legal counsel sign-off
`legal/LGPL_COMPLIANCE.md`	poliastro LGPL dynamic linking compliance; PostGIS GPLv2 linking exception
`legal/CLA.md`	Contributor Licence Agreement template for external contributors
`docs/compliance/sbom/`	Versioned SBOM artefacts: `syft` SPDX-JSON + manifest JSONs per release
`CONTRIBUTING.md`	CLA requirement disclosure; external contributor instructions

66.4 Anti-Patterns Rejected

Anti-pattern	Why rejected
"CesiumJS licence can wait until Phase 2"	AGPLv3 network use provision applies from the first external demo — waiting creates retroactive non-compliance exposure in an ESA bid context
Excluding CesiumJS from the licence gate without a commercial licence on file	CI exclusion hides the issue; the gate is correct only when the commercial licence exists
Assuming LGPL dynamic linking is automatically satisfied	Must be documented; LGPL allows relinking — standard pip install satisfies this but the compliance position must be written down
Single Redis `noeviction` policy	Already rejected in §65; Redis SSPL also motivates Valkey evaluation as BSD-3 alternative
Assuming all TimescaleDB features are Apache 2.0	TSL features (multi-node, data tiering) would require a Timescale commercial agreement; feature use must be tracked

66.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
CesiumJS licence	Commercial licence from Cesium Ion; Phase 1 blocker	Open-source the frontend (comply with AGPLv3)	Source disclosure of SpaceCom's frontend is commercially unacceptable; commercial licence is the only viable path for a closed-source product
Redis SSPL response	Legal counsel assessment; Valkey as fallback	Immediate migration to Valkey	Internal-use assessment is likely favourable; premature migration introduces risk; assess first
poliastro LGPL	Document standard pip install compliance	Seek MIT-licensed alternative	Standard pip install satisfies LGPL dynamic linking; replacing poliastro would require significant re-engineering for marginal legal gain
SBOM format	SPDX-JSON (syft) + pip-licenses/license-checker manifests merged	CycloneDX only	SPDX is the format required by ECSS and EU Cyber Resilience Act; CycloneDX can be generated alongside if required by a specific customer

Appendix §67 — Distributed Systems / Consistency Hat Review

Hat: Distributed Systems Engineer Reviewer focus: Consistency guarantees, failure modes, split-brain scenarios, clock skew, ordering, idempotency, CAP trade-offs

67.1 Findings and Fixes

#	Finding	Severity	Section modified	Fix applied
F1	Chord callback doesn't validate result count — partial results silently produce truncated predictions	CRITICAL	§27.2 chord section	Added result count guard in `aggregate_mc_results`; raises `ValueError` on mismatch; `spacecom_mc_chord_partial_result_total` counter; DLQ routing
F2	No Celery `autoretry_for=(OperationalError,)` on DB-writing tasks — Patroni 30s failover window causes permanent task failure	HIGH	§27.6 PgBouncer section	Added `autoretry_for=(OperationalError,)` policy; `max_retries=3`, `retry_backoff=5`, cap 30s; applies to all DB-writing Celery tasks
F3	Redis Sentinel split-brain risk not documented or assessed	MEDIUM	§26 Redis Sentinel section	Added split-brain assessment; accepted risk for ephemeral data; `min-replicas-to-write 1` mitigates; ADR-0021 required
F4	HMAC signing race — prediction INSERT then HMAC UPDATE creates window of unsigned prediction	HIGH	§10 HMAC section	Fixed: pre-generate UUID in application before INSERT; compute HMAC with UUID; single-phase write; migration from BIGSERIAL to UUID PK documented
F5	`alert_events.seq` assigned via `MAX(seq)+1` trigger — concurrent inserts produce duplicates	HIGH	§4 WebSocket/events section	Replaced with `CREATE SEQUENCE alert_seq_global`; globally monotonic; per-org ordering via `WHERE org_id = $1 ORDER BY seq`
F6	Clock skew between server and client causes CZML ground track timing drift — no detection mechanism	MEDIUM	§4 API section	Added `chronyd`/`timesyncd` host requirement; `node_timex_sync_status` Grafana alert; `GET /api/v1/time` endpoint; client-side skew warning banner at >5s
F7	MinIO multipart upload has no retry on write quorum failure — MC blob lost silently	HIGH	§27.4 storage section	Added `autoretry_for=(S3Error,)` with 30s backoff; MinIO ILM rule to abort incomplete multipart uploads after 24h
F8	celery-redbeat double-fire on restart: only TLE ingest has `ON CONFLICT DO NOTHING`; space weather and IERS EOP lack upsert	MEDIUM	§11 ingest section	Added upsert patterns for all periodic ingest tables; unique constraint requirements stated
F9	WebSocket fan-out cross-channel ordering — no cross-org ordering guarantee	LOW	—	Already addressed — Redis Pub/Sub ordering is per-channel (per-org); sequence numbers provide intra-org ordering. No further action required.
F10	`reentry_predictions` FK referenced with default `CASCADE` — accidental simulation delete cascades to legal-hold predictions	HIGH	§9 schema	Changed all `REFERENCES reentry_predictions(id)` to `ON DELETE RESTRICT` in `alert_events`, `prediction_outcomes`, `superseded_by` FK
F11	No distributed trace context propagation through chord sub-tasks and callback	MEDIUM	§26.9 OTel section	Added chord trace context injection/extraction pattern; verified `CeleryInstrumentor` for single tasks; manual `propagate.inject/extract` for chord callback continuity

67.2 Sections Modified

Section	Change summary
§27.2 MC Parallelism	Added chord result count validation in `aggregate_mc_results`; partial result counter
§27.6 DNS / PgBouncer	Added Celery `autoretry_for=(OperationalError,)` policy for Patroni failover window
§26 Redis Sentinel	Added split-brain risk assessment; `min-replicas-to-write 1` config; ADR-0021
§10 HMAC signing	Fixed two-phase write race: pre-generate UUID, single-phase INSERT; PK migration note
§4 WebSocket schema	Added `alert_seq_global` PostgreSQL SEQUENCE replacing `MAX(seq)+1` trigger
§4 API / health	Added `GET /api/v1/time` clock skew endpoint; NTP sync requirement; client banner
§27.4 Storage	Added MinIO multipart upload retry; incomplete upload ILM expiry rule
§11 Ingest	Added upsert patterns for space_weather and IERS EOP; unique constraint requirements
§9 Data Model	Changed `REFERENCES reentry_predictions(id)` to `ON DELETE RESTRICT` on 3 FKs
§26.9 OTel/Tracing	Added chord trace context propagation pattern; `propagate.inject/extract` for callback

67.3 New ADRs Required

ADR	Decision
`docs/adr/0021-redis-sentinel-split-brain-risk-acceptance.md`	Accept Redis Sentinel split-brain risk for ephemeral data; `min-replicas-to-write 1` mitigation; email rate limit counter inconsistency accepted as cost control gap

67.4 Anti-Patterns Rejected

Anti-pattern	Why rejected
`MAX(seq)+1` for sequence assignment in trigger	Race condition under concurrent inserts — two transactions read same MAX and both write the same seq; PostgreSQL `SEQUENCE` is lock-free and gap-tolerant
Two-phase HMAC (INSERT then UPDATE)	Creates a window where a valid unsigned prediction exists in the DB; single-phase INSERT with pre-generated UUID eliminates the window
No retry on Celery DB tasks during Patroni failover	The 30s failover window is a known operational event; retries with 5s backoff cap at 30s, fitting entirely within the failover window
`ON DELETE CASCADE` on legal-hold FK references	Accidental deletion of a simulation row would cascade to 7-year-retention safety records; `RESTRICT` forces explicit deletion of dependents first, making accidental cascade impossible
Scale-to-zero with immediate cold-start	Already rejected in §65; distributed systems perspective adds: cold-start during Patroni failover + worker cold-start = double failure; always keep 1 warm worker

67.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Chord result count validation	`ValueError` → DLQ → `HTTP 500 + Retry-After`	Silently write partial result	A 400-sample prediction is not a 500-sample prediction; confidence intervals and corridor widths are wrong; it is safer to fail visibly
reentry_predictions PK type	Migrate BIGSERIAL → UUID; pre-generate in application	Keep BIGSERIAL; use two-phase HMAC	UUID pre-generation eliminates the race window; UUID is also a safer choice for distributed deployments where sequence coordination between nodes is not possible
alert_seq assignment	Single global `alert_seq_global` SEQUENCE	Per-org sequences	Single sequence is simpler to manage; global monotonicity is sufficient for per-org ordering by filtering on org_id; per-org sequences require one sequence per org — complex at scale
Redis split-brain response	Accept risk; document in ADR	Migrate to Redis Cluster (stronger consistency)	Redis Cluster adds significant operational complexity (hash slots, resharding, client-side routing); split-brain on Sentinel with 3 nodes is rare and the affected data is ephemeral or cost-control only

Appendix §68 — Commercial / Pricing Architecture Hat Review

Hat: Commercial Strategy / Pricing Architect Reviewer focus: Pricing model design, deal structure, revenue protection, margin preservation, enterprise negotiation guardrails, commercial signals in technical architecture

68.1 Findings and Fixes

#	Finding	Severity	Section modified	Fix applied
F1	No `contracts` table — feature access not gated on commercial state; admin can enable Enterprise features with no contract	CRITICAL	§9 data model, §24 commercial section	Added `contracts` table with financial terms, feature enablement flags, discount approval constraint, PS tracking; nightly sync task
F2	Usage data not surfaced to commercial team or org admins — renewal conversations lack data	HIGH	§27.7 unit economics	Added monthly usage summary emails to commercial team and org admins; `send_usage_summary_emails` Beat task
F3	No shadow trial time limit — ANSP could remain in shadow mode indefinitely without signing production contract	HIGH	§9 organisations table	Added `shadow_trial_expires_at` column; enforcement via daily Celery task that auto-deactivates expired trials
F4	No discount approval guard-rails — single admin can give 100% discount	MEDIUM	§9 contracts table	Added `CHECK (discount_pct <= 20 OR discount_approved_by IS NOT NULL)` constraint; discount >20% requires named approver
F5	No inbound API request counter — usage-based billing for Persona E/F impossible	MEDIUM	§27.1 metrics	Added `spacecom_api_requests_total{org_id, endpoint, version, status_code}`; FastAPI middleware
F6	On-premise deployments have no licence key enforcement — multi-instance or post-expiry use undetectable	HIGH	§34 infrastructure section	Added RSA JWT licence key mechanism; licence-expired degraded mode; hourly Celery re-validation; key rotation script
F7	No contract expiry alerts — contracts expire silently; revenue risk	HIGH	§4 Celery tasks	Added `check_contract_expiry` Beat task at 90/30/7-day thresholds; courtesy notice to org admin at 30 days
F8	Free/shadow tier has no MC simulation quota — free usage consumes paid-tier worker capacity	MEDIUM	§9 organisations table, §27.7	Added `monthly_mc_run_quota` column (default 100); `POST /api/v1/decay/predict` quota enforcement with `429 + Retry-After`
F9	No MRR/ARR tracking — commercial team cannot measure revenue targets	HIGH	§9 contracts table, §27.7	`contracts.monthly_value_cents` + `spacecom_mrr_eur` Prometheus gauge updated nightly; Grafana MRR panel
F10	Professional Services not documented as a revenue line — first-year contract value underestimated	MEDIUM	§27.7 unit economics	Added PS revenue table (engagement types, values); `contracts.ps_value_cents`; Year 1 total contract value formula
F11	Multi-ANSP coordination panel available to all tiers — high-value Enterprise feature not packaging-protected	MEDIUM	§9 organisations table	Added `feature_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE`; gated in UI by feature flag; synced from `contracts.enables_multi_ansp_coordination`

68.2 Sections Modified

Section	Change summary
§9 organisations table	Added `shadow_trial_expires_at`, `monthly_mc_run_quota`, `feature_multi_ansp_coordination`, `licence_key`, `licence_expires_at` columns
§9 (new contracts table)	Full `contracts` table with financial terms, discount approval constraint, feature enablement, PS tracking
§24 commercial section	Added contracts table spec, MRR tracking, feature sync task, discount enforcement
§27.1 cost-tracking metrics	Added `spacecom_api_requests_total{org_id, endpoint, version, status_code}` counter
§27.7 unit economics	Added PS revenue table; shadow trial quota enforcement code; usage summary emails
§34 on-premise deployment	Added RSA JWT licence key mechanism; degraded mode on expiry; key rotation process
§4 Celery Beat tasks	Added `check_contract_expiry` 90/30/7-day alert task; `send_usage_summary_emails` monthly task

68.3 New Files and Documents Required

File	Purpose
`docs/business/UNIT_ECONOMICS.md`	Updated with PS revenue line, Year 1 total contract value formula, MRR tracking
`tasks/commercial/contract_expiry_alerts.py`	Contract expiry Celery task (90/30/7-day thresholds)
`tasks/commercial/send_commercial_summary.py`	Monthly commercial team usage summary email
`tasks/commercial/sync_feature_flags.py`	Nightly sync of org feature flags from active contracts
`scripts/generate_licence_key.py`	RSA JWT licence key generation script (requires private key)
`legal/contracts/`	Contract document store (MSA PDFs, signed sandbox agreements)

68.4 Anti-Patterns Rejected

Anti-pattern	Why rejected
Admin toggle for feature access without contract gate	Single admin can bypass commercial controls; `contracts` table with nightly sync is the authoritative source
Unlimited MC runs for free tier	Free-tier heavy users degrade paid-tier SLO by consuming simulation worker capacity; 100-run/month quota is enforceable without impacting legitimate evaluation
Honour-system on-premise licensing	Without a licence key, post-expiry use is undetectable and unenforceable; JWT with RSA signature provides cryptographic enforcement with no ongoing connectivity requirement
Silent contract expiry	Revenue loss from silent expiry is predictable and preventable; 90/30/7-day alerts are standard SaaS practice
Infinite shadow trial	Shadow mode is a commercial transition stage, not a permanent state; `shadow_trial_expires_at` enforces the commercial expectation established in the Regulatory Sandbox Agreement

68.5 Decision Log

Decision	Chosen approach	Rejected alternative	Rationale
Feature flag sync	Nightly Celery task syncs from `contracts`	Real-time sync on every request	Real-time sync adds DB query per request; nightly sync is sufficient for contract-level changes which happen at most monthly
Licence key format	RSA-signed JWT	Database-backed licence check	JWT is verifiable offline (no network required for air-gapped deployments); RSA signature prevents forgery without access to SpaceCom private key
Discount approval threshold	20% without approval; >20% requires named approver	Flat approval for all discounts	0-20% is sales discretion; >20% represents strategic pricing requiring commercial leadership sign-off; DB constraint makes this enforceable rather than advisory
PS revenue tracking	`contracts.ps_value_cents` one-time field	Separate PS contracts table	PS is almost always bundled with the main contract at first engagement; a separate table adds complexity for marginal benefit at Phase 2-3 scale
MRR metric	Prometheus gauge from nightly Celery task	Real-time DB query in Grafana	Prometheus gauge is consistent with other business metrics; Grafana can scrape it without a DB connection; historical MRR trend is automatically recorded

§69 Cross-Hat Governance and Decision Authority

This section resolves conflicts between specialist reviews. SpaceCom uses hats to surface expert constraints, not to create parallel authorities. Where hats conflict, this section defines who decides, how the decision is recorded, and which interpretation governs implementation.

69.1 Decision Authority Model

Decision class	Primary owner	Mandatory reviewers	Tie-break principle
Product packaging, contracts, commercial entitlements	Product / Commercial owner	Legal, Engineering	Contractual and legal truth beats UI shorthand
Safety-critical alerting, operational UX, hazard communication	Safety case owner	Human Factors, Regulatory, Engineering	Safer operator outcome beats convenience or sales flexibility
Core architecture, infrastructure, CI/CD, consistency	Architecture / Platform owner	Security, SRE, DevOps	Lower operational risk and clearer failure semantics beat elegance
Privacy, data governance, lawful basis, retention	Legal / Privacy owner	Product, Engineering	Regulatory obligation beats implementation convenience
External licensing / open source / procurement artefacts	Legal / Procurement owner	Engineering, Product	Licence compliance beats delivery speed

Any unresolved cross-hat conflict is recorded in docs/governance/CROSS_HAT_CONFLICT_REGISTER.md before implementation proceeds.

69.2 Arbitration Rules Adopted

Commercial source of truth: contracts is the authoritative source for features, quotas, and deployment rights. subscription_tier is descriptive only.
CI/CD platform: SpaceCom uses self-hosted GitLab. All GitHub Actions references in the plan are interpreted as GitLab CI equivalents and must be implemented in .gitlab-ci.yml, protected environments, and GitLab approval rules.
Redis split by trust class: redis_app holds higher-integrity application state; redis_worker holds broker/result/cache state. Split-brain acceptance applies only to redis_worker.
Commercial enforcement deferral: Licence expiry, shadow-trial expiry, and quota exhaustion must not interrupt active TIP / CRITICAL operations. Enforcement is deferred, logged, and applied after the active event closes.
Alert escalation matrix: Progressive escalation is the default. Immediate bypass is allowed only for imminent-impact or integrity-compromise conditions formally listed in the alert definition and traced into safety artefacts.
Renderer privilege exception: The renderer SYS_ADMIN capability is an approved exception, not a precedent. Any similar request from another service requires a new ADR and security review.
Phase 0 blockers: Space-Track AUP architecture and Cesium commercial licensing are Phase 0 gates. Work that would lock in ingest or frontend architecture must not proceed before those gates are closed.

69.3 Phase 0 Governance Gates

Before Phase 1 implementation begins, the following must be complete:

Space-Track AUP architecture decision recorded in docs/adr/0016-space-track-aup-architecture.md
Cesium commercial licence executed and stored at legal/LICENCES/cesium-commercial.pdf
GitLab CI/CD authority confirmed in platform docs and reflected in .gitlab-ci.yml
contracts entitlement model and synchronisation path approved by Product, Legal, and Engineering
Redis trust split (redis_app / redis_worker) approved by Architecture, Security, and SRE

These are architectural commitment gates, not paperwork gates. If any remain open, implementation that would cement the affected design area is blocked.

69.4 Intervention Register

Conflict	Sections affected	Intervention	Owner	Status
`subscription_tier` vs `contracts` authority	§16.1, §24, §68	`contracts` made authoritative; org flags become derived cache	Product / Commercial	Accepted
GitHub Actions vs self-hosted GitLab	§26.9, §30.4, §30.7, delivery checklists	GitLab CI/CD designated authoritative	Platform	Accepted
Shared Redis vs accepted split-brain risk	§3.2, §3.3, §65, §67	Redis split into app-state and worker-state trust domains	Architecture / Security	Accepted
Commercial enforcement during incidents	§9, §27.7, §34, §68	Enforcement deferred during active TIP / CRITICAL event	Product / Operations	Accepted
HF progressive escalation vs safety urgency	§28.3, §60, §61	Immediate-bypass matrix added for imminent-impact and integrity events	Safety case owner	Accepted
Non-root/container hardening vs renderer `SYS_ADMIN`	§3.3, §7.11	Renderer documented as approved exception with tighter isolation	Security / Platform	Accepted
Implementation starting before legal/licence blockers close	§6, §19, §21, §29.11	Blockers moved into Phase 0 governance gates	Programme owner	Accepted

1.0 MiB Raw Blame History Unescape Escape

SpaceCom Master Development Plan

1. Vision

2. What We Keep from the Existing Codebase

3. Architecture

3.1 Layered Design

3.2 Service Breakdown

Horizontal Scaling Trigger Thresholds (F9 — §58)

Redis ACL Definition

3.3 Docker Compose Services and Network Segmentation

Host Bind Mounts

Port Exposure Map

Infrastructure-Level Egress Filtering

4. Coordinate Frames and Time Systems

4.1 Reference Frame Pipeline

4.2 CesiumJS Frame Convention

4.3 Time System Conventions

4.4 Coordinate Reference System Contract (F1 — §62)

4.5 Implementation Checklist

5. User Personas

Persona A — Operational Airspace Manager

Persona B — Safety Analyst

Persona C — Incident Commander

Persona D — Systems Administrator / Data Manager

Persona E — Space Operator

Persona F — Orbital Analyst

6. UX Design Specification

6.1 Information Architecture — Task-Based Navigation

6.2 Operational Overview Page (/)

6.3 Time Navigation System

6.4 Uncertainty Visualisation — Three Phased Modes

Mode A — Percentile Corridors (Phase 1, default for Persona A/C)

Mode B — Probability Heatmap (Phase 2, default for Persona B)

Mode C — Monte Carlo Particle Visualisation (Phase 3, Persona B advanced / Persona C briefing)

6.5 Globe Information Hierarchy and Layer Management

6.6 Alert System UX

6.7 Timeline / Gantt View

6.8 Event Detail Page (/events/{id})

6.9 Simulation Job Management UX

6.10 Space Weather Widget

6.11 2D Plan View (Phase 2)

6.12 Reporting Workflow

6.13 NOTAM Drafting Workflow (Phase 2)

6.14 Shadow Mode (Phase 2)

6.15 Space Operator Portal UX (Phase 2)

6.16 Accessibility Requirements

6.17 Multi-ANSP Coordination Panel (Phase 2)

6.18 First-Time User Onboarding State (Phase 1)

6.19 Degraded Mode UI Guidance (Phase 1)

6.20 Secondary Display Mode (Phase 2)

7. Security Architecture

7.1 Threat Model (STRIDE)

7.2 Role-Based Access Control (RBAC)

7.3 Authentication

JWT Implementation

Multi-Factor Authentication (MFA)

SSO / Identity Provider Abstraction

7.4 API Security

Rate Limiting

Simulation Parameter Validation

Server-Side Request Forgery (SSRF) Mitigation

CZML and CZML Injection

NOTAM Draft Content Sanitisation (Finding 10)

7.5 Secrets Management

7.6 Transport Security

7.7 Content Security Policy and Security Headers

7.8 WebSocket Security

7.9 Data Integrity

HMAC Signing of Predictions

Prediction Immutability

HMAC Key Rotation Procedure (Finding 1)

Append-Only alert_events

Cross-Source Validation

IERS EOP Integrity

7.10 Infrastructure Security

Container Hardening

Redis Authentication and ACLs

MinIO Bucket Policies

7.11 Playwright Renderer Security

7.12 Compute Resource Governance

1.0 MiB

Raw Blame History

6.2 Operational Overview Page (`/`)

6.8 Event Detail Page (`/events/{id}`)

Append-Only `alert_events`

`objects` table

`orbits` table — full state vectors

`conjunctions` table

`reentry_predictions` table

12.1 Repository `docs/` Directory Structure

Catalog (`viewer` minimum)

Propagation (`analyst` role)

Decay Prediction (`analyst` role)

Re-entry (`viewer` role)

Space Weather (`viewer` role)

Conjunctions (`viewer` role)

Visualisation (`viewer` role)

Hazard (`viewer` role)

Alerts (`viewer` read; `operator` acknowledge)

Reports (`analyst` role)

Org Admin (`org_admin` role — scoped to own organisation) (F7, F9, F11)

Admin (`admin` role only)

Space Portal (`space_operator` or `orbital_analyst` role)

NOTAM Drafting (`operator` role)

API Key Management (`space_operator` or `orbital_analyst`)

WebSocket (`viewer` minimum; cookie auth at upgrade)

Alert Webhooks (`admin` role — registration; delivery to registered HTTPS endpoints)

Structured Event Export (`viewer` minimum)