Files
SpaceCom/docs/Master_Plan.md
Allissa Auld 266ee30d4b docs: align roadmap with tender-fit requirements
Plan aircraft-risk modelling, CCSDS RDM support, tender-grade replay validation, and ESA software assurance artefacts in the implementation and master plans.
2026-04-17 20:31:37 +02:00

1.0 MiB
Raw Blame History

SpaceCom Master Development Plan

1. Vision

SpaceCom is a dual-domain re-entry debris hazard analysis platform that bridges the space and aviation domains. It is built by space engineers and operates as two interconnected products sharing a common physics core.

Space Domain (upstream): A technical analysis platform for space operators, orbital analysts, and space agencies — providing decay prediction with full uncertainty quantification, conjunction screening, controlled re-entry corridor planning, and a programmatic API layer for integration with existing space operations systems.

Aviation Domain (downstream): An operational decision support tool for ANSPs, airspace managers, and incident commanders — translating space domain predictions into actionable aviation safety outputs: hazard corridors, FIR intersection analysis, NOTAM drafting assistance, multi-ANSP coordination, and plain-language uncertainty communication.

SpaceCom's strategic position is the interface layer between two domains that currently do not speak the same language. The aviation safety gap is the commercial differentiator and the most underserved operational need in the market. The space domain physics depth — numerical decay prediction, atmospheric density modelling, conjunction probability, and controlled re-entry planning — is the technical credibility that distinguishes SpaceCom from aviation software vendors with bolt-on orbital mechanics.

Positioning statement for procurement: "SpaceCom is the missing operational layer between space domain awareness and aviation domain action — built by space engineers, designed for the people who have to make decisions when something is coming down."

AI-assisted development policy (F11): SpaceCom uses AI coding assistants (currently Claude Code) in the development workflow. AGENTS.md at the repository root defines the boundaries and conventions for this use. Key constraints:

  • AI assistants may generate, refactor, and review code, and draft documentation
  • AI assistants may not make autonomous decisions about safety-critical algorithm changes, authentication logic, or regulatory compliance text — all such changes require human review and an approved PR with explicit reviewer sign-off
  • AI-generated code is subject to identical review and testing standards as human-authored code — there is no reduced scrutiny for AI-generated contributions
  • AI assistants must not be given production credentials, access to live Space-Track API keys, or personal data
  • For ESA procurement purposes: all code in the repository, regardless of how it was authored, is the responsibility of the named human engineers. AI assistance is a development tool, not a co-author with liability

This policy is stated explicitly because ESA and other public-sector procurement frameworks increasingly ask whether and how AI tools are used in safety-relevant software development.


2. What We Keep from the Existing Codebase

The prototype established several good foundational choices:

  • Docker Compose orchestration — frontend, backend, and database run as isolated containers with a single docker compose up
  • FastAPI backend — lightweight, async-ready Python API server; already serves CZML orbital data
  • TimescaleDB + PostGIS — time-series hypertables for orbit data and geographic types for footprints; the orbits hypertable and reentry_predictions polygon column are well-suited to the domain
  • CesiumJS globe — proven 3D geospatial viewer with CZML support, already rendering orbital tracks with OSM tiles
  • CZML as the orbital data interchange format — native to Cesium, supports time-dynamic position, styling, and labels
  • Schema tables: objects, orbits, conjunctions, reentry_predictions — solid starting point for the data model (see §9 for required expansions)
  • Worker service slot — the architecture already anticipates background data ingestion

3. Architecture

3.1 Layered Design

┌─────────────────────────────────────────────────────┐
│                   Frontend (Web)                     │
│   Next.js + TypeScript + CesiumJS + Deck.gl          │
│   httpOnly cookies · CSP · security headers          │
├─────────────────────────────────────────────────────┤
│              TLS Termination (Caddy/Nginx)           │
│              HTTPS + WSS only; HSTS preload          │
├─────────────────────────────────────────────────────┤
│                   API Gateway                        │
│   FastAPI · RBAC middleware · rate limiting          │
│   JWT (RS256) · MFA enforcement · audit logging     │
├─────────────────────────────────────────────────────┤
│                 Core Services                        │
│   Hazard Engine · Event Orchestrator · CZML Builder  │
│   Frame Transform Service · Space Weather Cache      │
│   HMAC integrity signing · Alert integrity guard     │
├─────────────────────────────────────────────────────┤
│         Computational Workers (isolated network)     │
│   Celery tasks: propagation, decay, Monte Carlo      │
│   Per-job CPU time limits · resource caps            │
├─────────────────────────────────────────────────────┤
│    Report Renderer (network-isolated container)      │
│   Playwright headless · no external network access   │
├─────────────────────────────────────────────────────┤
│            Data Layer (backend_net only)             │
│   TimescaleDB+PostGIS · Redis (AUTH+TLS)             │
│   MinIO (private buckets · pre-signed URLs)          │
└─────────────────────────────────────────────────────┘

3.2 Service Breakdown

Service Runtime Responsibility Tier 2 Spec Tier 3 Spec
frontend Next.js on Node 22 / Nginx static Globe UI, dashboards, event timeline, simulation controls 2 vCPU / 4 GB 2× (load balanced)
backend FastAPI on Python 3.12 REST + WebSocket API, authentication, RBAC, request validation, CZML generation, HMAC signing 4 vCPU / 8 GB 2× 4 vCPU / 8 GB (blue-green)
worker-sim Python 3.12 + Celery --queue=simulation --concurrency=16 --pool=prefork MC decay prediction (chord sub-tasks), breakup, conjunction, controlled re-entry. Isolated from frontend network. 2× 16 vCPU / 32 GB 4× 16 vCPU / 32 GB
worker-ingest Python 3.12 + Celery --queue=ingest --concurrency=2 TLE polling, space weather, DISCOS, IERS EOP. Never competes with simulation queue. 2 vCPU / 4 GB 2× 2 vCPU / 4 GB (celery-redbeat HA)
renderer Python 3.12 + Playwright PDF report generation only. No external network access. Receives sanitised data from backend via internal API call only. 4 vCPU / 8 GB 2× 4 vCPU / 8 GB
db TimescaleDB (PostgreSQL 17 + PostGIS) Persistent storage. RLS policies enforced. Append-only triggers on audit tables. 8 vCPU / 64 GB / 1 TB NVMe Primary + standby: 8 vCPU / 128 GB each; Patroni failover
redis Redis 7 Broker + cache + celery-redbeat schedule. AUTH required. TLS in production. ACL users per service. 2 vCPU / 8 GB Redis Sentinel: 3× 2 vCPU / 8 GB
minio MinIO (S3-compatible) Object storage. All buckets private. Pre-signed URLs only. 4 vCPU / 8 GB / 4 TB Distributed: 4× 4 vCPU / 16 GB / 2 TB NVMe
etcd etcd 3 Patroni DCS (distributed configuration store) for DB leader election 3× 1 vCPU / 2 GB
pgbouncer PgBouncer 1.22 Connection pooler between all application services and TimescaleDB. Transaction-mode pooling. Prevents connection count exceeding max_connections at Tier 3. Single failover target point for Patroni switchover. 1 vCPU / 1 GB 1 vCPU / 1 GB (updated by Patroni on failover)
prometheus Prometheus 2.x Metrics scraping from all services; recording rules; AlertManager rules 2 vCPU / 4 GB 2 vCPU / 8 GB
grafana Grafana OSS Four dashboards (§26.7); Loki + Tempo + Prometheus datasources 1 vCPU / 2 GB 1 vCPU / 2 GB
loki Grafana Loki 2.9 Log aggregation; queried by Grafana; Promtail ships container logs 2 vCPU / 4 GB 2 vCPU / 8 GB
promtail Grafana Promtail 2.9 Scrapes Docker json-file logs; labels by service; ships to Loki 0.5 vCPU / 512 MB 0.5 vCPU / 512 MB
tempo Grafana Tempo Distributed trace backend (Phase 2); OTLP ingest; queried by Grafana 2 vCPU / 4 GB

Horizontal Scaling Trigger Thresholds (F9 — §58)

Tier upgrades are not automatic — SpaceCom is VPS-based and requires deliberate provisioning. The following thresholds trigger a scaling review meeting (not an automated action). The responsible engineer creates a tracked issue within 5 business days.

Metric Threshold Sustained for Tier transition indicated
Backend CPU utilisation > 70% 30 min Tier 1 → Tier 2 (add second backend instance)
spacecom_ws_connected_clients > 400 sustained 1 hour Tier 1 → Tier 2 (WS ceiling at 500; add second backend)
Celery simulation queue depth > 50 15 min (no active event) Add simulation worker instance
MC p95 latency > 180s (75% of 240s SLO) 3 consecutive runs Add simulation worker instance
DB CPU utilisation > 60% 1 hour Tier 2 → Tier 3 (read replica + Patroni)
DB disk used > 70% of provisioned Expand disk before hitting 85%
Redis memory used > 60% of maxmemory Increase maxmemory or add Redis instance

Scaling decisions are recorded in docs/runbooks/capacity-limits.md with: metric value at decision time, decision made, provisioning timeline, and owner. This file is the authoritative capacity log for ESA and ANSP audits.

Redis ACL Definition

SpaceCom uses two Redis trust domains:

  • redis_app for sessions, rate limits, WebSocket delivery state, commercial-enforcement deferrals, and other application state where stronger consistency and tighter access separation are required
  • redis_worker for Celery broker/result traffic and ephemeral cache data, where limited inconsistency during failover is acceptable

This split is deliberate. It prevents worker-side compromise from reaching session state and avoids applying the distributed-systems split-brain risk acceptance for ephemeral workloads to user-session or entitlement-adjacent state.

Each Redis service gets its own ACL users with the minimum required key namespace:

# redis_app/acl.conf - bind-mounted into the application Redis container
# Backend: application-state access only (session tokens, rate-limit counters, WebSocket tracking)
user spacecom_backend on >${REDIS_BACKEND_PASSWORD} ~* &* +@all

# Disable unauthenticated default user
user default off
# redis_worker/acl.conf - bind-mounted into the worker Redis container
# Simulation worker: Celery broker/result namespaces only
user spacecom_worker on >${REDIS_WORKER_PASSWORD} ~celery* ~_kombu* ~unacked* &celery* +@all -@dangerous

# Ingest worker: same scope as simulation worker
user spacecom_ingest on >${REDIS_INGEST_PASSWORD} ~celery* ~_kombu* ~unacked* &celery* +@all -@dangerous

# Disable unauthenticated default user
user default off

Mount in docker-compose.yml:

redis_app:
  volumes:
    - ./redis_app/acl.conf:/etc/redis/acl.conf:ro
  command: redis-server --aclfile /etc/redis/acl.conf --tls-port 6379 ...

redis_worker:
  volumes:
    - ./redis_worker/acl.conf:/etc/redis/acl.conf:ro
  command: redis-server --aclfile /etc/redis/acl.conf --tls-port 6379 ...

Separate passwords (REDIS_BACKEND_PASSWORD, REDIS_WORKER_PASSWORD, REDIS_INGEST_PASSWORD) are defined in §30.3. Each rotates independently on the 90-day schedule. Redis Sentinel split-brain risk acceptance in §67 applies to redis_worker only; redis_app is treated as higher-integrity application state and is not covered by that acceptance.

3.3 Docker Compose Services and Network Segmentation

Services are assigned to isolated Docker networks. A compromised container on one network cannot directly reach services on another.

networks:
  frontend_net:   # frontend → backend only
  backend_net:    # backend → db, redis, minio, pgbouncer
  worker_net:     # worker → pgbouncer, redis, minio (no backend access; pgbouncer pools DB connections)
  renderer_net:   # backend → renderer only; renderer has no external egress
  db_net:         # db, pgbouncer — never exposed to frontend_net

services:
  frontend:    networks: [frontend_net]
  backend:     networks: [frontend_net, backend_net, renderer_net]  # +renderer_net: backend calls renderer API
  worker-sim:  networks: [worker_net]
  worker-ingest: networks: [worker_net]
  renderer:    networks: [renderer_net]   # backend-initiated calls only; no outbound to backend_net
  db:          networks: [backend_net, worker_net, db_net]
  pgbouncer:   networks: [backend_net, worker_net, db_net]  # pooling for both backend AND workers
  redis:       networks: [backend_net, worker_net]
  minio:       networks: [backend_net, worker_net]

Network topology rules:

  • Workers connect to DB via pgbouncer:5432, not db:5432 directly — enforced by workers' DATABASE_URL env var pointing to PgBouncer.
  • The backend is on renderer_net so it can call renderer:8001; the renderer cannot initiate connections to backend_net.
  • db_net contains only TimescaleDB, PgBouncer, and etcd. No application service connects directly to this network except PgBouncer.

Container resource limits — without explicit limits a runaway simulation worker OOM-kills the database (Linux OOM killer targets the largest RSS consumer):

services:
  backend:
    deploy:
      resources:
        limits: { cpus: '4.0', memory: 8G }
        reservations: { memory: 512M }

  worker-sim:
    deploy:
      resources:
        limits: { cpus: '16.0', memory: 32G }
        reservations: { memory: 2G }
    stop_grace_period: 300s   # allows long MC jobs to finish before SIGKILL
    command: >
      celery -A app.worker worker
        --queue=simulation
        --concurrency=16
        --pool=prefork
        --without-gossip
        --without-mingle
        --max-tasks-per-child=100
    pids_limit: 64            # prefork: 16 children + Beat + parent + overhead

  worker-ingest:
    deploy:
      resources:
        limits: { cpus: '2.0', memory: 4G }
    stop_grace_period: 60s
    pids_limit: 16

  renderer:
    deploy:
      resources:
        limits: { cpus: '4.0', memory: 8G }
    pids_limit: 100           # Chromium spawns ~5 processes per render × concurrent renders
    tmpfs:
      - /tmp/renders:size=512m,mode=1777   # PDF scratch; never written to persistent layer
    environment:
      RENDER_OUTPUT_DIR: /tmp/renders

  db:
    deploy:
      resources:
        limits: { memory: 64G }   # explicit cap; prevents OOM killer targeting db

  redis:
    deploy:
      resources:
        limits: { cpus: '2.0', memory: 8G }

  minio:
    deploy:
      resources:
        limits: { cpus: '4.0', memory: 8G }

Note: deploy.resources is honoured by docker compose (v2) without Swarm mode from Compose spec 3.x. Verify with docker compose version ≥ 2.0.

All containers run as non-root users, with read-only root filesystems and dropped capabilities (see §7.10), except for the renderer container's documented SYS_ADMIN exception in §7.11. That exception is accepted only for the renderer, must never be copied to other services, and requires stricter network isolation and annual review.

Host Bind Mounts

All directories that operators need to access directly on the VPS — logs, generated exports, config, and backups — are bind-mounted from the host filesystem. This means no docker compose exec is required for routine operations: log tailing, reading generated files, editing config, or recovering a backup.

services:
  backend:
    volumes:
      - ./logs/backend:/app/logs          # structured JSON logs; tail directly on host
      - ./exports:/app/exports            # org export ZIPs, report PDFs
      - ./config/backend.toml:/app/config/settings.toml:ro  # edit on host; container reads

  worker-sim:
    volumes:
      - ./logs/worker-sim:/app/logs
      - ./exports:/app/exports            # shared export directory with backend

  worker-ingest:
    volumes:
      - ./logs/worker-ingest:/app/logs

  frontend:
    volumes:
      - ./logs/frontend:/app/logs

  db:
    volumes:
      - /data/postgres:/var/lib/postgresql/data   # DB data on host disk; survives container recreation
      - ./backups/db:/backups                      # pg_basebackup output directly accessible on host

  minio:
    volumes:
      - /data/minio:/data                          # object storage on host disk

Host-side directory layout (under /opt/spacecom/):

/opt/spacecom/
  logs/
    backend/          ← tail -f logs/backend/app.log
    worker-sim/
    worker-ingest/
    frontend/
  exports/            ← ls exports/ to see generated reports and org export ZIPs
  config/
    backend.toml      ← edit directly; restart backend container to apply
  backups/
    db/               ← pg_basebackup archives; rsync to offsite from here
data/
  postgres/           ← TimescaleDB data files (outside /opt to avoid accidental compose down -v)
  minio/              ← MinIO object data

Key rules:

  • /data/postgres and /data/minio live outside the project directory so docker compose down -v cannot accidentally wipe them (Compose only removes named volumes, not bind-mounted host paths, but keeping them separate is an additional safeguard)
  • Log directories are created by make init-dirs before first docker compose up; containers write to them as a non-root user (UID 1000); host operator reads as the same UID or via sudo
  • Config files are mounted :ro (read-only) inside the container — a misconfigured backend cannot overwrite its own config
  • make logs SERVICE=backend is a convenience alias for tail -f /opt/spacecom/logs/backend/app.log

Port Exposure Map

Port Service Exposed to Notes
80 Caddy Public internet HTTP → HTTPS redirect only
443 Caddy Public internet TLS termination; proxies to backend/frontend
8000 Backend API Internal (frontend_net) Never directly internet-facing
3000 Frontend (Next.js) Internal (frontend_net) Caddy proxies; HMR port 3001 dev-only
5432 TimescaleDB Internal (db_net) Never exposed to frontend_net or host
6379 Redis Internal (backend_net, worker_net) AUTH required; no public exposure
9000 MinIO API Internal (backend_net, worker_net) Pre-signed URL access only from outside
9001 MinIO Console Internal (db_net) Never exposed publicly; admin use only
5555 Flower (Celery monitor) Internal only VPN/bastion access only in production
2379/2380 etcd (Patroni DCS) Internal (db_net) Never exposed outside db_net

CI check: scripts/check_ports.py — parses docker-compose.yml and all docker-compose.*.yml overrides; fails if any port from the "never-exposed" category appears in a ports: mapping. Runs in every CI pipeline.

Infrastructure-Level Egress Filtering

Docker's built-in iptables rules prevent inter-network lateral movement but do not restrict egress to the public internet from within a network. An egress filtering layer is mandatory at Tier 2 and Tier 3.

Allowed outbound destinations (whitelist):

Service Allowed destination Protocol Purpose
ingest_worker www.space-track.org HTTPS/443 TLE / conjunction data
ingest_worker services.swpc.noaa.gov HTTPS/443 Space weather
ingest_worker discosweb.esac.esa.int HTTPS/443 DISCOS object catalogue
ingest_worker celestrak.org HTTPS/443 TLE cross-validation
ingest_worker iers.org HTTPS/443 EOP download
backend SMTP relay (org-internal) SMTP/587 Alert email
All containers Internal Docker networks Any Normal operation
All containers All other destinations Any BLOCKED

Implementation: UFW or nftables rules on host (Tier 2); network policy + Calico/Cilium (Tier 3 Kubernetes migration); explicit allow-list in docs/runbooks/egress-filtering.md. Violations logged at WARN; repeated violations at CRITICAL.


4. Coordinate Frames and Time Systems

This section is non-negotiable infrastructure. Silent frame mismatches invalidate all downstream computation. All developers must understand and implement the conventions below before writing any propagation or display code.

4.1 Reference Frame Pipeline

TLE input
   │
   ▼ sgp4 library propagation
TEME (True Equator Mean Equinox)     ← SGP4 native output; do NOT store as final product
   │
   ▼ IAU 2006 precession-nutation (or Vallado TEME→J2000 simplification)
GCRF / J2000 (Geocentric Celestial Reference Frame)
   │                │
   │                ▼ CZML INERTIAL frame ← CesiumJS expects GCRF/ICRF, not TEME
   │
   ▼ IAU Earth Orientation Parameters (EOP): IERS Bulletin A/B
ITRF (International Terrestrial Reference Frame)   ← Earth-fixed; use for database storage
   │
   ▼ WGS84 geodetic transformation
Latitude / Longitude / Altitude     ← For display, hazard zones, airspace intersections

Implementation: Use astropy (astropy.coordinates, astropy.time) for all frame conversions. It handles IERS EOP download and interpolation automatically. For performance-critical batch conversions, pre-load EOP tables and vectorise.

4.2 CesiumJS Frame Convention

  • CZML position with referenceFrame: "INERTIAL" expects ICRF/J2000 Cartesian coordinates in metres
  • SGP4 outputs are in TEME and must be rotated to J2000 before being written into CZML
  • CZML position with referenceFrame: "FIXED" expects ITRF Cartesian in metres
  • Never pipe raw TEME coordinates into CesiumJS

4.3 Time System Conventions

System Where Used Notes
UTC System-wide reference. All API timestamps, database timestamps, CZML epochs Convert immediately at ingestion boundary
UT1 Earth rotation angle for ITRF↔GCRF conversion UT1-UTC offset from IERS EOP
TT (Terrestrial Time) astropy internal; precession-nutation models ~69 s ahead of UTC
TLE epoch Encoded in TLE line 1 as year + day-of-year fraction Parse to UTC immediately
GPS time May appear in precision ephemeris products GPS = UTC + 18 s as of 2024

Rule: Store all timestamps as TIMESTAMPTZ in UTC. Convert to local time only at presentation boundaries.

4.4 Coordinate Reference System Contract (F1 — §62)

The CRS used at every system boundary is documented in docs/COORDINATE_SYSTEMS.md. This is the authoritative single-page reference for any engineer writing frame conversion code.

Boundary CRS Format Notes
SGP4 output TEME (True Equator Mean Equinox) Cartesian metres Must not leave physics/ without conversion
Physics → CZML builder GCRF/J2000 Cartesian metres Explicit teme_to_gcrf() call
CZML position (INERTIAL) GCRF/J2000 Cartesian metres referenceFrame: "INERTIAL"
CZML position (FIXED) ITRF Cartesian metres referenceFrame: "FIXED"
Database storage (orbits) GCRF/J2000 Cartesian metres Consistent with CZML inertial
Corridor polygon (DB) WGS-84 geographic GEOGRAPHY(POLYGON) SRID 4326 Geodetic lat/lon from ITRF→WGS-84
FIR boundary (DB) WGS-84 geographic GEOMETRY(POLYGON, 4326) Planar approx. for regional FIRs
API response WGS-84 geographic GeoJSON (EPSG:4326) Degrees; always lon,lat order (GeoJSON spec)
Globe display (CesiumJS) ICRF (= GCRF for practical purposes) Cartesian metres via CZML CesiumJS handles geodetic display
Altitude display WGS-84 ellipsoidal km or ft (user preference) See §4.4a for datum labelling

Antimeridian and pole handling (F5 — §62):

  • Antimeridian: Corridor polygons stored as GEOGRAPHY handle antimeridian crossing correctly — PostGIS GEOGRAPHY uses spherical arithmetic and does not wrap coordinates. CesiumJS CZML polygon positions must be expressed as a continuous polyline; for antimeridian-crossing corridors, the CZML serialiser must not clamp coordinates to ±180° — pass the raw ITRF→geodetic output. CesiumJS handles coordinate wrapping internally when referenceFrame: "FIXED" is used for corridor polygons.
  • Polar orbits: For objects with inclination > 80°, the ground track corridor may approach or cross the poles. ST_AsGeoJSON on a GEOGRAPHY polygon that passes within ~1° of a pole can produce degenerate output (longitude undefined at the pole itself). Mitigation: before storing, check ST_DWithin(corridor, ST_GeogFromText('SRID=4326;POINT(0 90)'), 111000) (within 1° of north pole) or south pole equivalent — if true, log a POLAR_CORRIDOR_WARNING and clip the polygon to 89.5° max latitude. This is a rare case (ISS incl. 51.6°; most rocket bodies are below 75° incl.) but must not crash the pipeline.

docs/COORDINATE_SYSTEMS.md is a Phase 1 deliverable. Tests in tests/test_frame_utils.py serve as executable verification of the contract.

4.5 Implementation Checklist

  • frame_utils.py: teme_to_gcrf(), gcrf_to_itrf(), itrf_to_geodetic()
  • Unit tests against Vallado 2013 reference cases
  • EOP data auto-refresh: weekly Celery task pulling IERS Bulletin A; verify SHA-256 checksum of downloaded file before applying
  • CZML builder uses gcrf_to_czml_inertial() — explicit function, never implicit conversion
  • docs/COORDINATE_SYSTEMS.md committed with CRS boundary table

5. User Personas

All UX decisions are traceable to one of the four personas defined here. Navigation, default views, information hierarchy, and alert behaviour must serve user tasks — not the system's internal module structure.

Persona A — Operational Airspace Manager

Role: ANSP or aviation authority staff. Responsible for airspace safety decisions in real-time or near-real-time.

Primary question: "Is any airspace under my responsibility affected in the next 612 hours, and what do I need to do about it?"

Key needs: Immediate situational awareness, clear go/no-go spatial display for their region, alert acknowledgement workflow, one-click advisory export, minimal cognitive load.

Tolerance for complexity: Very low.


Persona B — Safety Analyst

Role: Space agency, authority research arm, or consultancy. Conducts detailed re-entry risk assessments for regulatory submissions or post-event reports.

Primary question: "What is the full uncertainty envelope, what assumptions drove the prediction, and how does this compare to previous similar events?"

Key needs: Full simulation parameter access, run comparison, numerical uncertainty detail, full data provenance, configurable report generation, historical replay.

Tolerance for complexity: High.


Persona C — Incident Commander

Role: Senior official coordinating response during an active re-entry event. Uses the platform as a shared situational awareness tool in a briefing room.

Primary question: "Where exactly is it coming down, when, and what is the worst-case affected area right now?"

Key needs: Clean large-format display, auto-narrowing corridor updates, countdown timer, plain-language status summary, shareable live-view URL.

Tolerance for complexity: Low.


Persona D — Systems Administrator / Data Manager

Role: Technical operator managing system health, data ingest, model configuration, and user accounts.

Primary question: "Is everything ingesting correctly, are data sources healthy, and are workers keeping up?"

Key needs: System health dashboard, ingest job status, worker queue metrics, model version management, user and role management.

Tolerance for complexity: High technical tolerance.


Persona E — Space Operator

Role: Satellite or launch vehicle operator responsible for one or more objects in the SpaceCom catalog. May be a commercial operator, a national space agency operating assets, or a launch service provider managing spent upper stages.

Primary question: "What is the current decay prediction for my objects, when do I need to act, and if I have manoeuvre capability, what deorbit window minimises ground risk?"

Key needs: Object-scoped view showing only their registered objects; decay prediction with full Monte Carlo detail; controlled re-entry corridor planner (for objects with remaining propellant); conjunction alert for their own objects; API key management for programmatic integration with their own operations centre; exportable predictions for regulatory submission under national space law.

Tolerance for complexity: High — these are trained orbital engineers, not ATC professionals.

Regulatory context: Many space operators have legal obligations under national space law (e.g., Australia Space (Launches and Returns) Act 2018, FAA AST licensing) to demonstrate responsible end-of-life management. SpaceCom outputs serve as supporting evidence for those submissions. The platform must produce artefacts suitable for regulatory audit.


Persona F — Orbital Analyst

Role: Technical analyst at a space agency, research institution, safety consultancy, or the SSA/STM office of a national authority. Conducts orbital analysis, validates predictions, and produces technical assessments — potentially across the full catalog, not just owned objects.

Primary question: "What does the full orbital picture look like for this object class, how do SpaceCom predictions compare to other tools, and what are the statistical properties of the prediction ensemble?"

Key needs: Full catalog read access; conjunction screening across arbitrary object pairs; simulation parameter tuning and comparison; bulk export (CSV, JSON, CCSDS formats); access to raw propagation outputs (state vectors, covariance matrices); historical validation runs; API access for batch processing.

Tolerance for complexity: Very high — this persona builds the technical evidence base that other personas act on.


6. UX Design Specification

This section translates engineering capability into concrete interface designs. All designs are persona-linked and phase-scheduled.

6.1 Information Architecture — Task-Based Navigation

Navigation is organised around user tasks, not backend modules. Module names never appear in the UI.

The platform has two navigation domains — Aviation (default for Persona A/B/C) and Space (for Persona E/F). Both are accessible from the top navigation. The root route (/) defaults to the domain matched to the user's role on login.

Aviation Domain Navigation:

/                   → Operational Overview       (Persona A, C primary)
/watch/{norad_id}   → Object Watch Page          (Persona A, B)
/events             → Active Events + Timeline   (Persona A, C)
/events/{id}        → Event Detail               (Persona A, B, C)
/airspace           → Airspace Impact View       (Persona A)
/analysis           → Analyst Workspace          (Persona B primary)
/catalog            → Object Catalog             (Persona B)
/reports            → Report Management          (Persona A, B)
/admin              → System Administration      (Persona D)

Space Domain Navigation:

/space                        → Space Operator Overview      (Persona E, F primary)
/space/objects                → My Objects Dashboard         (Persona E — owned objects only)
/space/objects/{norad_id}     → Object Technical Detail      (Persona E, F)
/space/reentry/plan           → Controlled Re-entry Planner  (Persona E)
/space/conjunction            → Conjunction Screening        (Persona F)
/space/analysis               → Orbital Analyst Workspace    (Persona F)
/space/export                 → Bulk Export                  (Persona F)
/space/api                    → API Keys + Documentation     (Persona E, F)

The 3D globe is a shared component embedded within pages, not a standalone page. Different pages focus and configure the globe differently.


6.2 Operational Overview Page (/)

Landing page for Persona A and C. Loads immediately without configuration.

Layout:

┌─────────────────────────────────────────────────────────────────┐
│  [● LIVE]  SpaceCom    [Space Weather: ELEVATED ▲]  [Alerts: 2] │
├──────────────────────────────┬──────────────────────────────────┤
│                              │  ACTIVE EVENTS                   │
│    3D GLOBE                  │  ● CZ-5B R/B  44878             │
│    (active events +          │    Window: 08h  20h from now    │
│     affected FIRs only)      │    Most likely ~14h from now     │
│                              │    YMMM FIR — HIGH               │
│                              │    [View] [Corridor]             │
│                              │  ─────────────────────────────   │
│                              │  ○ SL-16 R/B  28900             │
│                              │    Window: 54h  90h from now    │
│                              │    Most likely ~72h from now     │
│                              │    Ocean — LOW                   │
│                              │                                  │
│                              │  72-HOUR TIMELINE                │
│                              │  [Gantt strip]                   │
│                              │                                  │
│                              │  SPACE WEATHER                   │
│                              │  Activity: ELEVATED              │
│                              │  Extend window: add ≥2h buffer   │
├──────────────────────────────┴──────────────────────────────────┤
│  [● Live]  ──────────●──────────────────────────────  +72h      │
└─────────────────────────────────────────────────────────────────┘

Globe default state: Active decay objects and their corridors only. All other objects hidden. Affected FIR boundaries highlighted. No orbital tracks unless the user expands an event card.

Temporal uncertainty display — Persona A/C: Event cards and the Operational Overview show window ranges in plain language (Window: 08h 20h from now / Most likely ~14h from now), never ± N notation. The ± form implies symmetric uncertainty, which re-entry distributions are not. The Analyst Workspace (Persona B) additionally shows raw p05/p50/p95 UTC times.


6.3 Time Navigation System

Three modes — always visible, always unambiguous. Mixing modes without explicit user intent is prohibited.

Mode Indicator Description
LIVE Green pulsing pill: ● LIVE Current real-world state. Globe and predictions update from live feeds.
REPLAY Amber pill: ⏪ REPLAY 2024-01-14 03:22 UTC Replaying a historical event. All data fixed. No live updates.
SIMULATION Purple pill: ⚗ SIMULATION — [object name] Custom scenario. Data is synthetic. Must never be confused with live.

The mode indicator is persistent in the top nav bar. Switching modes requires explicit action through a mode-switch dialogue — it cannot happen implicitly.

Mode-switch dialogue specification:

When the user initiates a mode switch (e.g., LIVE → SIMULATION), the following modal must appear. The dialogue must explicitly state the current mode, the target mode, and all operational consequences:

SWITCH TO SIMULATION MODE?
──────────────────────────────────────────────────────────────
You are currently viewing LIVE data.
Switching to SIMULATION will display synthetic scenario data.

  ⚠ Alerts and notifications are suppressed in SIMULATION.
  ⚠ Simulation data must never be used for operational decisions.
  ⚠ Other users will not see your simulation.

[Cancel]                          [Switch to Simulation ▶]
──────────────────────────────────────────────────────────────

Rules:

  • Cancel on left, destructive action on right (consistent with aviation HMI conventions)
  • The dialogue must always show both the current mode and target mode — never just "are you sure?"
  • Equivalent dialogues apply for all mode transitions (LIVE ↔ REPLAY, LIVE ↔ SIMULATION, etc.)

Simulation mode block during active alerts: If the organisation has disable_simulation_during_active_events enabled (admin setting, default: off), the SIMULATION mode switch is blocked whenever there are unacknowledged CRITICAL or HIGH alerts. A modal replaces the switch dialogue:

CANNOT ENTER SIMULATION
──────────────────────────────────────────────────────────────
2 active CRITICAL alerts require acknowledgement.
Acknowledge all active alerts before running simulations.

[View active alerts]                              [Cancel]
──────────────────────────────────────────────────────────────

Document disable_simulation_during_active_events prominently in the admin UI: "Enable only if your organisation has a dedicated SpaceCom monitoring role separate from simulation users."

Timeline control — two zoom levels:

  • Event scale (default): 72 hours, 6-hour intervals. Re-entry windows shown as coloured bars.
  • Orbital scale: 4-hour window, 15-minute intervals. For orbital passes and conjunction events.

LIVE mode scrub: User can drag the playhead into the future to preview a predicted corridor. A "Return to Live" button appears whenever the playhead is not at current time.

Future-preview temporal wash: When the timeline playhead is not at current time (user is previewing a future state), the entire right-panel event list and alert badges are overlaid with a temporal wash (semi-transparent grey overlay) and a persistent label:

┌──────────────────────────────────────────────────────────────┐
│  ⏩ PREVIEWING  +4h 00m — not current state  [Return to Live] │
└──────────────────────────────────────────────────────────────┘

The wash and label prevent a controller from acting on predicted-future data as though it were current. The globe corridor may show the projected state; the event list must be visually distinct. Alert badges are greyed and annotated "(projected)" in preview mode. Alert sounds and notifications are suppressed while previewing.


6.4 Uncertainty Visualisation — Three Phased Modes

Three representations are planned across phases. All are user-selectable via the UncertaintyModeSelector once implemented. Each page context has a recommended default.

Mode selector (appears in the layer controls panel whenever corridor data is loaded):

Corridor Display
● Percentile Corridors    ← Phase 1
○ Probability Heatmap     ← Phase 2
○ Monte Carlo Particles   ← Phase 3

Modes B and C appear greyed in the selector until their phase ships.


Mode A — Percentile Corridors (Phase 1, default for Persona A/C)

What it shows: Three nested polygon swaths on the globe — 5th, 50th, and 95th percentile ground track corridors from Monte Carlo output.

Visual encoding:

  • 95th percentile: wide, 15% opacity amber fill, dashed border — hazard extent
  • 50th percentile: medium, 35% opacity amber fill, solid border — nominal corridor
  • 5th percentile: narrow, 60% opacity amber fill, bold border — high-probability core

Colour by risk level: Ocean-only → blue family; partial land → amber; significant land → red-orange.

Over time: As the re-entry window narrows, the outer swath contracts automatically in LIVE mode. The user watches the corridor "tighten" in real-time.


Mode B — Probability Heatmap (Phase 2, default for Persona B)

What it shows: Continuous colour-ramp Deck.gl heatmap. Each cell's colour encodes probability density of ground impact across the full Monte Carlo sample set.

Visual encoding: Perceptually uniform, colour-blind-safe sequential palette (viridis or custom blue-white-orange). Scale normalised to the maximum probability cell; legend with percentile labels always shown.

Interaction: Hover a cell → tooltip shows "~N% probability of impact within this 50×50 km cell." The heatmap is recomputed client-side if the user adjusts the re-entry window bounds via the timeline.


Mode C — Monte Carlo Particle Visualisation (Phase 3, Persona B advanced / Persona C briefing)

What it shows: 50200 animated MC sample trajectory lines converging from re-entry interface altitude (~80 km) to impact. Particle colour encodes F10.7 assumption (cool = low solar activity = later re-entry, warm = high). Impact points persist as dots.

Interaction: Play/pause animation; scrub to any point in the trajectory; click a particle to see its parameter set (F10.7, Ap, B*).

Performance: Use CesiumJS Primitive API with per-instance colour attributes — not Entity API. Trajectory geometry pre-baked server-side and streamed as binary format (/viz/mc-trajectories/{prediction_id}). Never compute trajectories in the browser.

Not the default for Persona A — the animation can be alarming without quantitative context.

Weighted opacity: Particles render with opacity proportional to their sample weight, not uniform opacity. This visually down-weights outlier trajectories so that low-probability high-consequence paths do not visually dominate.

Mandatory first-use overlay: When Mode C is first enabled (per user, tracked in user preferences), a one-time overlay appears before the animation starts:

MONTE CARLO PARTICLE VIEW
──────────────────────────────────────────────────────────────
Each animated line shows one possible re-entry scenario sampled
from the prediction distribution. Colour encodes the solar
activity assumption used for that sample.

These are not equally likely outcomes — particle opacity
reflects sample weight. For operational planning, the
Percentile Corridors view (Mode A) gives a more reliable
summary.

[Understood — show animation]
──────────────────────────────────────────────────────────────

The overlay is dismissed permanently per user on first acknowledgement and never shown again. It cannot be bypassed — the animation does not play until the user explicitly acknowledges.


6.5 Globe Information Hierarchy and Layer Management

Default view state: Active decay objects and their corridors, FIR boundaries for affected regions. "Show everything" is never the default.

Layer management panel:

LAYERS
────────────────────────────────────────
Objects
  ☑ Active decay objects (TIP issued)
  ☑ Decaying objects (perigee < 250 km)
  ☐ All tracked payloads
  ☐ Rocket bodies
  ☐ Debris catalog

Orbital Tracks
  ☐ Ground tracks (selected object only)
  ☐ All objects — [!] performance warning

Predictions & Corridors
  ☑ Re-entry corridors (active events)
  ☐ Re-entry corridors (all predicted)
  ☐ Fragment impact points
  ☐ Conjunction geometry

Airspace (Phase 2)
  ☐ FIR / UIR boundaries
  ☐ Controlled airspace
  ☐ Affected sectors (hazard intersection)

Reference
  ☐ Population density grid
  ☐ Critical infrastructure
────────────────────────────────────────
Corridor Display:  [Percentile ▾]

Layer state persists to localStorage per session. Shared URLs encode active layer state in query parameters.

Object clustering: At zoom > 5,000 km, objects cluster. Badge shows count and highest urgency level. Clusters expand at < 2,000 km.

Altitude-aware clustering rule (F8 — §62): Objects at different altitudes with the same ground-track sub-point are not co-located — they have different re-entry windows and different hazard profiles. Two objects that share a 2D screen position but differ by > 100 km in altitude must not be merged into a single cluster. Implementation rule: CesiumJS EntityCluster clustering is disabled for any object with reentry_predictions showing a window < 30 days (i.e., any decay-relevant object in the watch/alert state). Objects in the normal catalog (window > 30 days) may continue to use screen-space clustering. This prevents the pathological case where a TIP-active object at 200 km is merged into a cluster with a nominal object at 500 km that shares its ground track, making the TIP object invisible in the cluster badge.

Urgency / Priority Visual Encoding (colour-blind-safe — shape distinguishes as well as colour):

State Symbol Colour Meaning
TIP issued, window < 6h ◆ filled diamond Red #D32F2F Imminent re-entry
TIP issued, window 624h ◆ outlined diamond Orange #E65100 Active threat
Predicted decay, window < 7d ▲ triangle Amber #F9A825 Elevated watch
Decaying, window > 7d ● circle Yellow-grey Monitor
Conjunction Pc > 1:1000 ✕ cross Purple #6A1B9A Conjunction risk
Normal tracked · dot Grey #546E7A Catalog

Never use red/green as the sole distinguishing pair.


6.6 Alert System UX

Alert taxonomy:

Level Trigger Visual Treatment Requires Acknowledgement?
CRITICAL TIP issued, window < 6h, hazard intersects active FIR Full-width banner (red), audio tone (ops room mode) Yes — named user; timestamp + note recorded
HIGH Window < 24h, conjunction Pc > 1:1000 Persistent badge (orange) Yes — dismissal recorded
MEDIUM New TIP issued (any), window < 7d, new CDM Toast (amber), 8s auto-dismiss No — logged
LOW New TLE ingested, space weather index change Notification centre only No

Alert fatigue mitigation:

  • Mute rules: per-user, per-session LOW suppression
  • Geographic filtering: alerts scoped to user's configured FIR list
  • Deduplication: window shrinks that don't cross a threshold do not re-trigger
  • Rate limit: same trigger condition cannot produce more than 1 CRITICAL alert per object per 4-hour window without a manual operator reset
  • Alert generation triggered only by backend logic on verified data — never by direct API call from a client

Ops room workload buffer (OPS_ROOM_SUPPRESS_MINUTES): An optional per-organisation setting (default: 0 — disabled). When set to N > 0, CRITICAL alert full-screen banners are queued for up to N minutes before display. The top-nav badge increments immediately so peripheral attention is captured; only the full-screen interrupt is deferred. This matches FAA AC 25.1329 alert prioritisation philosophy: acknowledge at a glance, act when workload permits. Must be documented in the admin UI with a mandatory warning: "Only enable if your operations room has a dedicated SpaceCom monitoring role. If a single controller manages all alerts, suppression introduces delay that may be safety-significant."

Audio alert specification:

  • Trigger: CRITICAL alert only (no audio for HIGH or lower)
  • Sound: two-tone ascending chime pattern (not a siren — ops rooms have sirens from other systems)
  • Behaviour: plays once on alert display; does not loop; stops on alert acknowledgement (not just banner dismiss)
  • Volume: configurable per-device (default 50% system volume); mutable by operator per-session
  • Ops room mode: organisation-level setting that enables audio (default: off; requires explicit activation)

Alert storm detection: If the system generates > 5 CRITICAL alerts within 1 hour across all objects, generate a meta-alert to Persona D. The meta-alert presents a disambiguation prompt rather than a bare count:

[META-ALERT — ALERT VOLUME ANOMALY]
──────────────────────────────────────────────────────────────
5 CRITICAL alerts generated within 1 hour.

This may indicate:
  (a) Multiple genuine re-entry events — verify via Space-Track
      independently before taking operational action.
  (b) System integrity issue — check ingest pipeline and data
      source health for signs of false data injection.

[Open /admin health dashboard →]   [View all CRITICAL alerts →]
──────────────────────────────────────────────────────────────

Acknowledgement workflow:

CRITICAL acknowledgement requires two steps to prevent accidental confirmation:

Step 1 — Alert banner with summary and Open Map link:

[CRITICAL ALERT]
───────────────────────────────────────────────────────
CZ-5B R/B (44878) — TIP Issued
Re-entry window: 2026-03-16 14:00  22:00 UTC  (8h)
Affected FIRs: YMMM, YSSY
Risk level: HIGH  |  [Open map →]
[Review and Acknowledge →]
───────────────────────────────────────────────────────

Step 2 — Confirmation modal (appears on clicking "Review and Acknowledge"):

ACKNOWLEDGE CRITICAL ALERT
───────────────────────────────────────────────────────
CZ-5B R/B (44878) — Re-entry window 14:0022:00 UTC 16 Mar

Action taken (required — minimum 10 characters):
[_____________________________________________]

[Cancel]           [Confirm — J. Smith, 09:14 UTC]
───────────────────────────────────────────────────────

The Confirm button is disabled until the Action taken field contains ≥ 10 characters. This prevents reflexive one-click acknowledgement during an incident and ensures a minimal action record is always created.

Acknowledgements stored in alert_events (append-only). Records cannot be modified or deleted.


6.7 Timeline / Gantt View

Full timeline accessible from /events and as a compact strip on the Operational Overview.

                    NOW     +6h      +12h     +24h     +48h     +72h
Object              │        │        │        │        │        │
────────────────────┼────────┼────────┼────────┼────────┼────────┼────
CZ-5B R/B  44878   │   [■■■■■[══════ window ═══════]■■■]        │
  YMMM FIR — HIGH  │        │        │        │        │        │
────────────────────┼────────┼────────┼────────┼────────┼────────┼────
SL-16 R/B  28900   │        │        │ [■[══════════════════════════→
  NZZC FIR — MED   │        │        │        │        │        │

= nominal re-entry point; ══ = uncertainty window; colour = risk level.

Click event bar → Event Detail page; hover → tooltip with window bounds and affected FIRs. Zoom range: 6h to 7d.


6.8 Event Detail Page (/events/{id})

┌──────────────────────────────────────────────────────────────┐
│  ← Events  │  CZ-5B R/B  NORAD 44878  │  [■ CRITICAL]       │
│             │  Re-entry window: 14:0022:00 UTC 16 Mar 2026  │
├──────────────────────────────┬───────────────────────────────┤
│                              │  OBJECT                       │
│    3D GLOBE                  │  Mass: 21,600 kg (● DISCOS)   │
│    (focused on corridor)     │  B*: 0.000215 /ER             │
│    Mode: [Percentile ▾]      │  Data confidence: ● DISCOS    │
│    [Layers]                  │                               │
│                              │  PREDICTION                   │
│                              │  Model: cowell_nrlmsise00 v2  │
│                              │  F10.7 assumed: 148 sfu       │
│                              │  MC samples: 500              │
│                              │  HMAC: ✓ verified             │
│                              │                               │
│                              │  WINDOW                       │
│                              │  5th pct:  13:12 UTC          │
│                              │  50th pct: 17:43 UTC          │
│                              │  95th pct: 22:08 UTC          │
│                              │                               │
│                              │  TIP MESSAGES                 │
│                              │  MSG #3 — 09:00 UTC today     │
│                              │  [All TIP history →]          │
├──────────────────────────────┴───────────────────────────────┤
│  AFFECTED AIRSPACE (Phase 2)                                 │
│  YMMM FIR  ████ HIGH    entry 14:2019:10 UTC               │
├──────────────────────────────────────────────────────────────┤
│  [Run Simulation]  [Generate Report]  [Share Link]           │
└──────────────────────────────────────────────────────────────┘

HMAC verification status is displayed prominently. If ✗ verification failed appears, a banner reads: "This prediction record may have been tampered with. Do not use for operational decisions. Contact your system administrator."

Data confidence annotates every physical property: ● DISCOS (green), ● estimated (amber), ● unknown (grey). When source is unknown or estimated, a warning callout appears above the prediction panel.

Corridor Evolution widget (Phase 2): A compact 2D strip on the Event Detail page showing how the p50 corridor footprint is evolving over time — three overlapping semi-transparent polygon outlines at T+0h, T+2h, T+4h from the current prediction. Updated automatically in LIVE mode. Gives Persona A Level 3 situation awareness (projection) at a glance without requiring simulation tools. Labelled: "Corridor evolution — how prediction is narrowing". If the corridor is widening (unusual), an amber warning appears: "Uncertainty is increasing — check space weather."

Duty Manager View (Phase 2): A [Duty Manager View] toggle button on the Event Detail header. When active, collapses all technical detail and presents a large-text, decluttered view containing only:

┌──────────────────────────────────────────────────────────────┐
│  CZ-5B R/B  NORAD 44878                    [■ CRITICAL]      │
│                                                              │
│  RE-ENTRY WINDOW                                             │
│  Start:   14:00 UTC  16 Mar 2026                             │
│  End:     22:00 UTC  16 Mar 2026                             │
│  Most likely:  17:43 UTC                                     │
│                                                              │
│  AFFECTED FIRs                                               │
│  YMMM (Airservices Australia) — HIGH RISK                    │
│  YSSY (Airservices Australia) — MEDIUM RISK                  │
│                                                              │
│  [Draft NOTAM]   [Log Action]   [Share Link]                 │
└──────────────────────────────────────────────────────────────┘

Toggle back to full view via [Technical Detail]. State is not persisted between sessions — always starts in full view.

Response Options accordion (Phase 2): An expandable panel at the bottom of the Event Detail page, visible to operator and above roles. Contextualised to the current risk level and FIR intersection. These are considerations only — all decisions rest with the ANSP:

RESPONSE OPTIONS  [▼ expand]
──────────────────────────────────────────────────────────────
Based on current prediction (risk: HIGH, window: 8h):

The following actions are for your consideration.
All operational decisions rest with the ANSP.

  ☐  Issue SIGMET or advisory to aircraft in YMMM FIR
  ☐  Notify adjacent ANSPs (YMMM borders: WAAF, OPKR)
  ☐  Draft NOTAM for authorised issuance   [Open →]
  ☐  Coordinate with FMP on traffic flow impact
  ☐  Establish watching brief schedule (every 30 min)

[Log coordination note]
──────────────────────────────────────────────────────────────

Checkbox states and coordination notes are appended to alert_events (append-only). The Response Options items are dynamically generated by the backend based on risk level and affected FIR count — not hardcoded in the frontend.


6.9 Simulation Job Management UX

Persistent collapsible bottom-drawer panel visible on any page. Jobs continue running when the user navigates away.

SIMULATION JOBS                                     [▲ collapse]
────────────────────────────────────────────────────────────────
● Running  Decay prediction — 44878    312/500  ████░  62%
           F10.7: 148, Ap: 12, B*±10%            ~45s rem
           [Cancel]

✓ Complete  Decay prediction — 44878    High F10.7 scenario
           Completed 09:02 UTC          [View results]  [Compare]

✗ Failed    Breakup simulation — 28900
           Error: DISCOS data missing   [Retry]  [Details]
────────────────────────────────────────────────────────────────

Simulation comparison: Two completed runs for the same object can be overlaid on the globe with distinct colours and a split-panel parameter comparison.


6.10 Space Weather Widget

SPACE WEATHER                                    [09:14 UTC]
────────────────────────────────────────────────────────────
Solar Activity       ●●●○○  ELEVATED
                     F10.7 observed: 148 sfu  (81d avg: 132)

Geomagnetic          ●●●●○  ACTIVE
                     Kp: 5.3  /  Ap daily: 27

Re-entry Impact      ▲ Active conditions — extend precaution window
                     Add ≥2h buffer beyond 95th percentile.

Forecast (24h)       Activity expected to decline — Kp 34
────────────────────────────────────────────────────────────
Source: NOAA SWPC    Updated: 09:00 UTC    [Full history →]

Operational status summary is generated by the backend based on F10.7 deviation from the 81-day average. The "Re-entry Impact" line delivers an operationally actionable statement — not a percentage — with a concrete recommended precaution buffer computed by the backend and delivered as a structured field:

Condition Re-entry Impact line Recommended buffer
F10.7 < 90 or Kp < 2 Low activity — predictions at nominal accuracy +0h
F10.7 90140, Kp 24 Moderate activity — standard uncertainty applies +1h
F10.7 140200, Kp 46 Active conditions — extend precaution window. Add ≥2h buffer beyond 95th percentile. +2h
F10.7 > 200 or Kp > 6 High activity — predictions less reliable. Add ≥4h buffer beyond 95th percentile. +4h

The buffer recommendation is surfaced on the Event Detail page as an explicit callout when conditions are Elevated or above: "Space weather active: consider extending your airspace precaution window to [95th pct time + buffer]."


6.11 2D Plan View (Phase 2)

Globe/map toggle ([🌐 Globe] [🗺 Plan]) synchronises selected object, active corridor, and time position. State is preserved on switch.

2D view features: Mercator or azimuthal equidistant projection; ICAO chart symbology for airspace; ground-track corridor as horizontal projection only; altitude/time cross-section panel below showing corridor vertical extent at each FIR crossing.


6.12 Reporting Workflow

Report configuration dialogue:

NEW REPORT — CZ-5B R/B (44878)
──────────────────────────────────────────────────────────────
Simulation:  [Run #3 — 09:14 UTC ▾]

Report Type:
  ○ Operational Briefing     (12 pages, plain language)
  ○ Technical Assessment     (full uncertainty, model provenance)
  ○ Regulatory Submission    (formal format, appendices)

Include Sections:
  ☑ Object properties and data confidence
  ☑ Re-entry window and uncertainty percentiles
  ☑ Ground track corridor map
  ☑ Affected airspace and FIR crossing times
  ☑ Space weather conditions at prediction time
  ☑ Model version and simulation parameters
  ☐ Full MC sample distribution
  ☐ TIP message history

Prepared by: J. Smith          Authority: CASA
──────────────────────────────────────────────────────────────
[Preview]  [Generate PDF]  [Cancel]

Report identity: Every report has a unique ID, the simulation ID it was derived from, a generation timestamp, and the analyst's name. Reports are stored in MinIO and listed in /reports.

Date format in all reports and exports (F7): Slash-delimited dates (03/04/2026) are ambiguous between DD/MM and MM/DD and are banned from all SpaceCom outputs. All dates in PDF reports, CSV exports, and NOTAM drafts use DD MMM YYYY format (e.g. 04 MAR 2026) — unambiguous across all locales and consistent with ICAO and aviation convention. All times alongside dates use HH:MMZ (e.g. 04 MAR 2026 14:00Z). This applies to: PDF prediction reports, CSV bulk exports, NOTAM draft (B)/(C) fields (which use ICAO YYMMDDHHMM format internally but are displayed as DD MMM YYYY HH:MMZ in the preview).

Report rendering: Server-side Playwright in the isolated renderer container. The map image is a headless Chromium screenshot of the globe at the relevant configuration. All user-supplied text is HTML-escaped before interpolation. The renderer has no external network access — it receives only sanitised, structured data from the backend API.


6.13 NOTAM Drafting Workflow (Phase 2)

SpaceCom cannot issue NOTAMs. Only designated NOTAM offices authorised by the relevant AIS authority can issue them. SpaceCom's role is to produce a draft in ICAO Annex 15 format ready for review and formal submission by an authorised originator.

Trigger: From the Event Detail page, Persona A clicks [Draft NOTAM]. This is only available when a hazard corridor intersects one or more FIRs.

Draft NOTAM output (ICAO Annex 15 / OPADD format):

Field format follows ICAO Annex 15 Appendix 6 and EUROCONTROL OPADD. Timestamps use YYMMDDHHmm format (not ISO 8601 — ICAO Annex 15 §5.1.2). (B) = p10 30 min; (C) = p90 + 30 min (see mapping table below).

NOTAM DRAFT — FOR REVIEW AND AUTHORISED ISSUANCE ONLY
══════════════════════════════════════════════════════
Generated by SpaceCom v2.1 | Prediction ID: pred-44878-20260316-003
Data source: USSPACECOM TIP #3 + SpaceCom decay prediction
⚠ This is a DRAFT only. Must be reviewed and issued by authorised NOTAM office.

Q) YMMM/QWELW/IV/NBO/AE/000/999/2200S13300E999
A) YMMM
B) 2603161330
C) 2603162230
E) UNCONTROLLED SPACE OBJECT RE-ENTRY. OBJECT: CZ-5B ROCKET BODY
   NORAD ID 44878. PREDICTED RE-ENTRY WINDOW 1400-2200 UTC 16 MAR
   2026. NOMINAL RE-ENTRY POINT APRX 22S 133E. 95TH PERCENTILE
   CORRIDOR 18S 115E TO 28S 155E. DEBRIS SURVIVAL PSB. AIRSPACE
   WITHIN CORRIDOR MAY BE AFFECTED ALL LEVELS DURING WINDOW.
   REF SPACECOM PRED-44878-20260316-003.
F) SFC
G) UNL

NOTAM field mapping (ICAO Annex 15 Appendix 6):

NOTAM field SpaceCom data source Format rule
(Q) Q-line FIR ICAO designator + NOTAM code QWELW (re-entry warning) Generated from airspace.icao_designator; subject code WE (airspace warning), condition LW (laser/space)
(A) FIR airspace.icao_designator for each intersecting FIR One NOTAM per FIR; multi-FIR events generate multiple drafts
(B) Valid from prediction.p10_reentry_time 30 minutes YYMMDDHHmm (UTC); example: 2603161330
(C) Valid to prediction.p90_reentry_time + 30 minutes YYMMDDHHmm (UTC)
(D) Schedule Omitted (continuous) Do not include (D) field for continuous validity
(E) Description Templated from sanitised object name, NORAD ID, p50 time, corridor bounds sanitise_icao() applied; ICAO Doc 8400 abbreviations used (PSB not "possible", APRX not "approximately")
(F)/(G) Limits SFC / UNL Hardcoded for re-entry events; do not compute from corridor altitude

(B)/(C) field: re-entry window to NOTAM validity — time-critical cancellation: The (C) validity time does not mean the hazard persists until then — it is the worst-case boundary. When re-entry is confirmed, the NOTAM cancellation draft must be initiated immediately. The Event Detail page surfaces a prominent [Draft NOTAM Cancellation — RE-ENTRY CONFIRMED] button at the moment the event status changes to confirmed, with a UI note: "Cancellation draft should be submitted to the NOTAM office without delay."

Unit test: Generate a draft for a prediction with p10=2026-03-16T14:00Z, p90=2026-03-16T22:00Z; assert (B) field is 2603161330 and (C) field is 2603162230. Assert Q-line matches regex \(Q\) [A-Z]{4}/QWELW/IV/NBO/AE/\d{3}/\d{3}/\d{4}[NS]\d{5}[EW]\d{3}.

NOTAM cancellation draft: When an event is closed (re-entry confirmed, object decayed), the Event Detail page offers [Draft NOTAM Cancellation] — generates a CANX NOTAM draft referencing the original.

Regulatory note displayed in the UI: A persistent banner on the NOTAM draft page reads: "This draft is generated for review purposes only. It must be reviewed for accuracy, formatted to local AIS standards, and issued by an authorised NOTAM originator. SpaceCom does not issue NOTAMs."

NOTAM language and i18n exclusion (F6): ICAO Annex 15 specifies that NOTAMs use ICAO standard phraseology in English (or the language of the state for domestic NOTAMs). NOTAM template strings are never internationalised:

  • All NOTAM template strings are hardcoded ICAO English phraseology in backend/app/modules/notam/templates.py
  • Each template string is annotated # ICAO-FIXED: do not translate
  • The NOTAM draft is excluded from the next-intl message extraction tooling
  • The NOTAM preview panel renders in a fixed-width monospace font to match traditional NOTAM format
  • lang="en" attribute is set on the NOTAM text container regardless of the operator's UI locale

The draft is stored in the notam_drafts table (see §9.2) for audit purposes.


6.14 Shadow Mode (Phase 2)

Shadow mode allows ANSPs to run SpaceCom in parallel with existing procedures during a trial period, without acting operationally on its outputs. This is the primary mechanism for building regulatory acceptance evidence.

Activation: admin role only, per-organisation setting in /admin.

Visual treatment when shadow mode is active:

┌─────────────────────────────────────────────────────────────────┐
│  ⚗ SHADOW MODE — Predictions are not for operational use        │
│  All outputs are recorded for validation. No alerts are         │
│  delivered externally. Contact your administrator to disable.   │
└─────────────────────────────────────────────────────────────────┘
  • A persistent amber banner spans the top of every page
  • The mode indicator pill shows ⚗ SHADOW in amber
  • All alert levels are demoted to INFORMATIONAL — no banners, no audio tones, no email delivery
  • Prediction records have shadow_mode = TRUE in the database (see §9)
  • Shadow predictions are excluded from all operational views but accessible in /analysis

Validation reporting: After each real re-entry event, Persona B can generate a Shadow Validation Report comparing SpaceCom shadow predictions against the actual observed re-entry time/location. These reports form the evidence base for regulatory adoption.

Shadow Mode Exit Criteria (regulatory hand-off specification — Finding 6):

Shadow mode is a formal regulatory activity, not a product trial. Exit to operational use requires:

Criterion Requirement
Minimum shadow period 90 days, or covering ≥ 3 re-entry events above the CRITICAL alert threshold, whichever is longer
Prediction accuracy corridor_contains_observed ≥ 90% across shadow period events (from prediction_outcomes)
False positive rate fir_false_positive_rate ≤ 20% — no more than 1 in 5 corridor-intersecting FIR alerts is a false alarm
False negative rate fir_false_negative = 0 during the shadow period — no re-entry event missed entirely
Exit document shadow-mode-exit-report-{org_id}-{date}.pdf generated from prediction_outcomes; contains automated statistics + ANSP Safety Department sign-off field
Regulatory hand-off Written confirmation from the ANSP's Accountable Manager or Head of ATM Safety that their internal Safety Case / Tool Acceptance process is complete
System state shadow_mode_cleared = TRUE is set by SpaceCom admin only after receipt of the written ANSP confirmation

The exit report template lives at docs/templates/shadow-mode-exit-report.md. Persona B generates the statistics from the admin analysis panel; the ANSP prints, signs, and returns the PDF. No software system can substitute for the ANSP's internal Safety Department sign-off.

Commercial trial-to-operational conversion (Finding 5):

A successful shadow exit automatically generates a commercial offer. The admin panel transitions the organisation's subscription_status from 'shadow_trial' to 'offered' and Persona D receives a task notification. The offer package includes:

  • Commercial offer document (generated from docs/templates/commercial-offer-ansp.md): tier, pricing, SLA schedule, DPA status
  • MSA execution path: ANSPs that accept the offer sign the MSA; no separate negotiation required for the standard ANSP Operational tier
  • Onboarding checklist: docs/onboarding/ansp-onboarding-checklist.md

If an ANSP does not convert within 30 days of receiving the offer, subscription_status moves to 'offered_lapsed' and Persona D is notified. The admin panel shows conversion pipeline status for all ANSP organisations. Maximum concurrent ANSP shadow deployments in Phase 2: 2 (resource constraint — each requires a dedicated SpaceCom integration lead for the 90-day shadow period).


6.15 Space Operator Portal UX (Phase 2)

The Space Operator Portal (/space) is the second front door. It serves Persona E and F with a technically dense interface — different visual language from the aviation-facing portal.

Space Operator Overview (/space):

┌─────────────────────────────────────────────────────────────────┐
│  SpaceCom · Space Portal    [API] [Export] [Persona E: ORBCO]   │
├─────────────────────┬───────────────────────────────────────────┤
│                     │  MY OBJECTS (3)                           │
│  3D GLOBE           │  ┌────────────────────────────────────┐   │
│  (owned objects     │  │ CZ-5B R/B  44878                   │   │
│   only, with        │  │ Perigee: 178 km  ↓ Decaying fast   │   │
│   full orbital      │  │ Re-entry: 16 Mar ± 8h              │   │
│   tracks and        │  │ [Predict] [Plan deorbit] [Export]  │   │
│   decay vectors)    │  ├────────────────────────────────────┤   │
│                     │  │ SL-16 R/B  28900                   │   │
│                     │  │ Perigee: 312 km  ~ Stable          │   │
│                     │  │ [Predict] [Export]                 │   │
│                     │  └────────────────────────────────────┘   │
│                     │  CONJUNCTION ALERTS (MY OBJECTS)          │
│                     │  No active conjunctions > Pc 1:10000      │
├─────────────────────┴───────────────────────────────────────────┤
│  API USAGE   Requests today: 143 / 1000   [Manage keys →]       │
└─────────────────────────────────────────────────────────────────┘

Controlled Re-entry Planner (/space/reentry/plan):

Available for objects with remaining manoeuvre capability (flagged in owned_objects.has_propulsion).

CONTROLLED RE-ENTRY PLANNER — CZ-5B R/B (44878)
─────────────────────────────────────────────────────────────────
Delta-V budget: [▓▓▓░░░░░] 12.4 m/s remaining

Target re-entry window:  [2026-03-20 ▾]  to  [2026-03-22 ▾]
Avoid FIRs:              [☑ YMMM]  [☑ YSSY]  [☑ Populated land]
Preferred landing:       ● Ocean   ○ Specific zone

CANDIDATE WINDOWS
──────────────────────────────────────────────────────────────────
  #1  2026-03-21 03:14 UTC    ΔV: 8.2 m/s    Risk: ● LOW
      Landing: South Pacific  FIR: NZZO (ocean)
      [Select] [View corridor]

  #2  2026-03-21 09:47 UTC    ΔV: 11.1 m/s   Risk: ● LOW
      Landing: Indian Ocean   FIR: FJDG (ocean)
      [Select] [View corridor]

  #3  2026-03-21 15:30 UTC    ΔV: 9.8 m/s    Risk: ▲ MEDIUM
      Landing: 22S 133E       FIR: YMMM (land)
      [Select] [View corridor]
──────────────────────────────────────────────────────────────────
[Export manoeuvre plan (CCSDS)]  [Generate operator report]

The planner outputs are suitable for submission to national space regulators as evidence of responsible end-of-life management under the ESA Zero Debris Charter and national space law requirements.

Zero Debris Charter compliance output format (Finding 2):

The planner produces a controlled-reentry-compliance-report-{norad_id}-{date}.pdf containing:

  • Ranked deorbit window analysis (delta-V budget, window start/end, corridor risk score per window)
  • FIR avoidance corridors for each candidate window
  • Probability of casualty on the ground (Pc_ground) computed using NASA Debris Assessment Software methodology (1-in-10,000 IADC casualty threshold; documented in model card)
  • Comparison table: each candidate window vs. the 1:10,000 Pc_ground threshold; compliant windows flagged green
  • Zero Debris Charter alignment statement (auto-generated from object disposition)

Machine-readable companion: application/vnd.spacecom.reentry-compliance+json — returned alongside the PDF download URL as compliance_report_url in the planning job result. Format documented in docs/api-guide/compliance-export.md.

The Pc_ground calculation uses the fragment survivability model (§15.3 material class lookup) and the ESA DRAMA casualty area methodology. objects.material_class IS NULL → conservative all-survive assumption → higher Pc_ground — creates an incentive for operators to provide accurate physical data.

ECCN classification review (already in §21 Phase 2 DoD) must resolve before this output is shared with non-US entities.


6.16 Accessibility Requirements

  • WCAG 2.1 Level AA compliance — required for government and aviation authority procurement
  • Colour-blind-safe palette throughout; urgency uses shape + colour, never colour alone
  • High-contrast mode available in user settings (WCAG AAA scheme)
  • Dark mode as a first-class theme (not an afterthought)
  • All interactive elements keyboard-accessible; tab order logical
  • Alerts announced via aria-live="assertive" (CRITICAL) and aria-live="polite" (MEDIUM/LOW)
  • Globe canvas has aria-label describing current view context
  • Minimum touch target size 44×44 px
  • Tested at 1080p (ops room), 1440p (analyst workstation), 1024×768 (tablet minimum)
  • Automated axe-core audit via @axe-core/playwright run on the 5 core pages on every PR; 0 critical, 0 serious violations required to merge; known acceptable third-party violations (e.g., CesiumJS canvas contrast) recorded in tests/e2e/axe-exclusions.json with a justification comment — not silently suppressed. Implementation:
    // tests/e2e/accessibility.spec.ts
    import AxeBuilder from '@axe-core/playwright';
    for (const [name, path] of [
      ['operational-overview', '/'], ['event-detail', '/events/seed-event'],
      ['notam-draft', '/notam/draft/seed-draft'], ['space-portal', '/space/objects'],
      ['settings', '/settings'],
    ]) {
      test(`${name} — WCAG 2.1 AA`, async ({ page }) => {
        await page.goto(path);
        const results = await new AxeBuilder({ page })
          .withTags(['wcag2a', 'wcag2aa'])
          .exclude(loadAxeExclusions())   // loads axe-exclusions.json
          .analyze();
        expect(results.violations).toEqual([]);
      });
    }
    

6.17 Multi-ANSP Coordination Panel (Phase 2)

When an event's predicted corridor intersects FIRs belonging to more than one registered organisation, an additional panel appears on the Event Detail page. This panel provides shared situational awareness across ANSPs without replacing voice coordination.

MULTI-ANSP COORDINATION
──────────────────────────────────────────────────────────────
FIRs affected by this event:
  YMMM  Airservices Australia  — ✓ Acknowledged 09:14 UTC  J. Smith
  NZZC  Airways NZ             — ○ Not yet acknowledged

Last activity:
  09:22 UTC  YMMM — "Watching brief established, coordinating with FMP"
──────────────────────────────────────────────────────────────
[Log coordination note]

Rules:

  • Each ANSP sees the acknowledgement status and latest coordination note from all other ANSPs on the event; they do not see each other's internal alert state
  • Coordination notes are free text, appended to alert_events (append-only, auditable), with organisation name, user name, and UTC timestamp
  • The panel is read-only for organisations that have not yet acknowledged; they can acknowledge and then log notes
  • Visibility is scoped: organisations only see the panel for events that intersect their registered FIRs — they do not see coordination panels for unrelated events from other orgs

This does not replace voice or direct coordination — it creates a shared digital record that both ANSPs can reference. The panel carries a permanent banner: "This coordination panel is for shared situational awareness only. It does not replace formal ATS coordination procedures or voice coordination."

Authority and precedence (Finding 5): The panel has no command authority. If two ANSPs log conflicting assessments, neither supersedes the other in SpaceCom — the system records both. The authoritative coordination outcome is always the result of direct ATS coordination outside the system. SpaceCom coordination notes are supporting evidence, not operational decisions.

WebSocket latency for coordination updates: Coordination note updates must be visible to all parties within 2 seconds of posting (p99). This is specified as a performance SLA for the coordination panel WebSocket channel (distinct from the 5-second SLA for alert events). Latency > 2 seconds means an ANSP may have acted on a stale picture during a fast-moving event.

Data retention for coordination records (ICAO Annex 11 §2.26): Coordination notes are safety records. Minimum retention: 5 years in append-only storage. The coordination_notes table (stored append-only in alert_events.coordination_notes JSONB[] or as a separate table) is included in the safety record retention category (§27.4) and excluded from standard data drop policies.


6.18 First-Time User Onboarding State (Phase 1)

When a new organisation has no configured FIRs and no active events, the globe is empty. An empty globe is indistinguishable from "the system isn't working" for first-time users. An onboarding state prevents this misinterpretation.

Trigger: Organisation has fir_list IS NULL OR fir_list = '{}' at login.

Display: Three setup cards replace the Active Events panel:

WELCOME TO SPACECOM
──────────────────────────────────────────────────────────────
To see relevant events and receive alerts, complete setup:

  1. Configure your FIR watch list
     Determines which re-entry events you see and which
     alerts you receive.                        [Configure →]

  2. Set alert delivery preferences
     Email, WebSocket, or webhook for CRITICAL alerts.
                                                [Configure →]

  3. Optional: Enable Shadow Mode for a trial period
     Run SpaceCom in parallel with existing procedures —
     outputs are not for operational use until disabled.
                                                [Configure →]

──────────────────────────────────────────────────────────────

Cards disappear permanently once step 1 (FIR list) is complete. Steps 2 and 3 remain accessible from /admin at any time. The setup cards are not a modal — they appear inline and the user can still access all navigation.


6.19 Degraded Mode UI Guidance (Phase 1)

The StalenessWarningBanner (triggered by /readyz returning 207) must include an operational guidance line keyed to the specific type of data degradation, not just a generic "data may be stale" message. Persona A's question in degraded mode is not "is the data stale?" — it is "can I use this for an operational decision right now?"

Degradation type Banner operational guidance
Space weather data stale > 3h "Uncertainty estimates may be wider than shown. Treat all corridors as potentially broader than the 95th percentile boundary."
TLE data stale > 24h "Object position data is more than 24 hours old. Do not use for precision airspace decisions without independent position verification."
Active prediction older than 6h without refresh "This prediction reflects conditions from [timestamp]. A fresh prediction run is recommended before operational use. [Trigger refresh →]"
IERS EOP data stale > 7 days "Coordinate frame transformations may have minor errors. Technical assessments only — do not use for precision airspace boundary work."

Banner behaviour:

  • The banner type is set by the backend via the /readyz response body (degradation_type enum)
  • Each degradation type has its own banner message — not a generic "degraded" label
  • The banner persists until the degradation is resolved; it cannot be dismissed by the user
  • When multiple degradations are active, show the highest-impact degradation first, with a (+N more) expand link

6.20 Secondary Display Mode (Phase 2)

An ops room secondary monitor display mode — strips all navigation chrome and presents only the operational picture on a full-screen secondary display alongside existing ATC tools.

Activation: [Secondary Display] link in the user menu, or URL parameter ?display=secondary. Opens in a new window or full-screen.

Layout: Full-screen globe on the left (~70% width), vertical event list on the right (~30% width). No top navigation, no admin links, no simulation controls. No sidebar panels. The LIVE/SHADOW/SIMULATION mode indicator remains visible (always). CRITICAL alert banners still appear.

Design principle: This is a CSS-level change — hide navigation and chrome elements, maximise the operational data density. No new data is added; no existing data is removed.


7. Security Architecture

This section is as non-negotiable as §4. Security must be built in from Week 1, not audited at Phase 3. The primary security risk in an aviation safety system is not data exfiltration — it is data corruption that produces plausible but wrong outputs that are acted upon operationally. A false all-clear for a genuine re-entry threat is the highest-consequence attack against this system's mission.

7.1 Threat Model (STRIDE)

Key trust boundaries and their principal threats:

Boundary Spoofing Tampering Repudiation Info Disclosure DoS Elevation
Browser → API JWT forgery Request injection Unlogged mutations Token leak via XSS Auth endpoint flood RBAC bypass
API → DB Credential leak SQL injection No audit trail Column over-fetch N+1 queries RLS bypass
Ingest → External feeds DNS/BGP hijack → wrong TLE Man-in-middle alters F10.7 Credential interception Feed DoS
Celery worker → DB Compromised worker Corrupt sim output written to DB Unlogged task Param leak in logs Runaway MC task Worker → backend pivot
Playwright renderer → backend User content → XSS → SSRF Local file read Hang/timeout RCE via browser exploit
Redis Cache poisoning Token interception Queue flood

Mitigations for each threat are specified in the sections below.


7.2 Role-Based Access Control (RBAC)

Four roles correspond to the four personas. Every API endpoint enforces the minimum required role via a FastAPI dependency.

Role Assigned To Permissions
viewer Read-only external stakeholders View objects, predictions, corridors; read-only globe (aviation domain)
analyst Persona B viewer + submit simulations, generate reports, access historical data, shadow validation reports
operator Persona A, C analyst + acknowledge alerts, issue advisories, draft NOTAMs, access operational tools
org_admin Organisation administrator operator + invite/remove users within their own org; assign roles up to operator within own org; view own org's audit log; manage own org's API keys; update own org's billing contact; cannot access other orgs' data; cannot assign admin or org_admin without system admin approval
admin Persona D (system-wide) Full access: user management across all orgs, ingest configuration, model version deployment, shadow mode toggle, subscription management
space_operator Persona E Object-scoped access (owned objects only via owned_objects table); decay predictions and controlled re-entry planning for own objects; conjunction alerts for own objects; API key management; CCSDS export; no access to other organisations' simulation data
orbital_analyst Persona F Full catalog read; conjunction screening across any object pair; simulation submission; bulk export (CSV, JSON, CCSDS); raw state vector and covariance access; API key management; no alert acknowledgement

Object ownership scoping for space_operator: The owned_objects table maps operators to their registered NORAD IDs. All queries from a space_operator user are automatically scoped to their owned object list — enforced by a PostgreSQL RLS policy on the owned_objects join, not only at the application layer:

-- space_operator users see only their owned objects in catalog queries
CREATE POLICY objects_owner_scope ON objects
  USING (
    current_setting('app.current_role') != 'space_operator'
    OR id IN (
      SELECT object_id FROM owned_objects
      WHERE organisation_id = current_setting('app.current_org_id')::INTEGER
    )
  );

Multi-tenancy: If multiple organisations use the system, every table that contains organisation-specific data (simulations, reports, alert_events, hazard_zones) must include an organisation_id column. PostgreSQL Row-Level Security (RLS) policies enforce the boundary at the database layer — not only at the application layer:

ALTER TABLE simulations ENABLE ROW LEVEL SECURITY;
CREATE POLICY simulations_org_isolation ON simulations
  USING (organisation_id = current_setting('app.current_org_id')::INTEGER);

The application sets app.current_org_id at the start of every database session from the authenticated user's JWT claims.

Comprehensive RLS policy coverage (F1): The simulations example above is the template. Every table that carries organisation_id must have RLS enabled and an isolation policy applied. The full set:

Table RLS policy Notes
simulations organisation_id = current_org_id
reentry_predictions organisation_id = current_org_id shadow policy layered separately
alert_events organisation_id = current_org_id append-only; no UPDATE/DELETE anyway
hazard_zones organisation_id = current_org_id
reports organisation_id = current_org_id
api_keys organisation_id = current_org_id admins bypass to revoke any key
usage_events organisation_id = current_org_id billing metering records
objects organisation_id IS NULL OR organisation_id = current_org_id NULL = catalog-wide; org-specific = owned objects only

RLS bypass for system-level tasks: Celery workers and internal admin processes run under a dedicated database role (spacecom_worker) that bypasses RLS (BYPASSRLS). This role is never used by the API request path. Integration test (BLOCKING): establish two orgs with data; issue a query as Org A's session; assert zero Org B rows returned. This test runs in CI against a real database (not mocked).

Shadow mode segregation — database-layer enforcement (Finding 9):

Shadow predictions must be excluded from operational API responses at the RLS layer, not only via application WHERE clauses. A backend query bug or misconfigured join must not expose shadow records to viewer/operator sessions — that would be a regulatory incident.

ALTER TABLE reentry_predictions ENABLE ROW LEVEL SECURITY;

-- Non-admin sessions never see shadow records unless the session flag is set
CREATE POLICY shadow_segregation ON reentry_predictions
  USING (
    shadow_mode = FALSE
    OR current_setting('spacecom.include_shadow', TRUE) = 'true'
  );

The spacecom.include_shadow session variable is set to 'true' only by the backend's shadow-admin code path, which requires admin role and explicit shadow-mode context. Regular backend sessions never set this variable. Integration test: query reentry_predictions as viewer role with no WHERE shadow_mode clause; verify zero shadow rows returned.

Four-eyes principle for admin role elevation (Finding 6):

A single compromised admin account must not be able to silently elevate a backdoor account. Elevation to admin requires a second admin to approve within 30 minutes.

CREATE TABLE pending_role_changes (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  target_user_id INTEGER NOT NULL REFERENCES users(id),
  requested_role TEXT NOT NULL,
  requested_by INTEGER NOT NULL REFERENCES users(id),
  approval_token_hash TEXT NOT NULL,  -- SHA-256 of emailed token
  expires_at TIMESTAMPTZ NOT NULL DEFAULT NOW() + INTERVAL '30 minutes',
  approved_by INTEGER REFERENCES users(id),
  approved_at TIMESTAMPTZ,
  rejected_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Workflow:

  1. PATCH /admin/users/{id}/role with role=admin creates a pending_role_changes row and triggers an email to all other active admins containing a single-use approval token
  2. POST /admin/role-changes/{change_id}/approve?token=<token> — any other admin can approve; completing the role change is atomic
  3. Rows past expires_at are auto-rejected by a nightly job and logged as ROLE_CHANGE_EXPIRED
  4. All outcomes (ROLE_CHANGE_APPROVED, ROLE_CHANGE_REJECTED, ROLE_CHANGE_EXPIRED) are logged to security_logs as HIGH severity
  5. The requesting admin cannot approve their own pending change (enforced by approved_by != requested_by constraint)

RBAC enforcement pattern (FastAPI):

def require_role(*roles: str):
    def dependency(current_user: User = Depends(get_current_user)):
        if current_user.role not in roles:
            log_auth_failure(current_user, roles)
            raise HTTPException(status_code=403, detail="Insufficient permissions")
        return current_user
    return dependency

# Applied per router group — not per individual endpoint where it is easy to miss
router = APIRouter(dependencies=[Depends(require_role("operator", "admin"))])

7.3 Authentication

JWT Implementation

  • Algorithm: RS256 (asymmetric). Never HS256 with a shared secret. Never none.
  • Key storage: RSA private signing key stored in Docker secrets / secrets manager (see §7.5). Never in an environment variable or .env file.
  • Token storage in browser: httpOnly, Secure, SameSite=Strict cookies only. Never localStorage (vulnerable to XSS). Never query parameters (appear in server logs).
  • Access token lifetime: 15 minutes.
  • Refresh token lifetime: 24 hours for operator/analyst; 8 hours for admin.
  • Refresh token rotation with family reuse detection (Finding 5): Invalidate the old token on every refresh. Tokens belong to a family_id (UUID assigned at first issuance). If a token from a superseded generation within a family is presented — i.e. it was already rotated and a newer token in the same family exists — the entire family is immediately revoked, logged as REFRESH_TOKEN_REUSE (HIGH severity), and an email alert is sent to the user ("Suspicious login detected — all sessions revoked"). This detects refresh token theft: the legitimate user retries after the attacker consumed the token first, causing the reuse to surface. The refresh_tokens table includes family_id UUID NOT NULL and superseded_at TIMESTAMPTZ (set when a new token replaces this one in rotation).
  • Refresh token storage: refresh_tokens table in the database (see §9.2). This enables server-side revocation — Redis-only storage loses revocations on restart.

Multi-Factor Authentication (MFA)

TOTP-based MFA (RFC 6238) is required for all roles from Phase 1. Implementation:

  • On first login after account creation, user is presented with TOTP QR code (via pyotp) and required to verify before completing registration
  • Recovery codes (8 × 10-character alphanumeric) generated at setup; stored as bcrypt hashes in users.mfa_recovery_codes
  • MFA bypass via recovery code is logged as a security event (MEDIUM alert to admins)
  • MFA is enforced at the JWT issuance step — tokens are not issued until MFA is verified
  • Failed MFA attempts after 5 consecutive failures trigger a 30-minute account lockout and a MEDIUM alert

SSO / Identity Provider Abstraction

"Integrate with SkyNav SSO later" cannot remain a deferred decision. The auth layer must be designed as a pluggable provider from the start:

class AuthProvider(Protocol):
    async def authenticate(self, credentials: Credentials) -> User: ...
    async def issue_tokens(self, user: User) -> TokenPair: ...
    async def revoke(self, refresh_token: str) -> None: ...

class LocalJWTProvider(AuthProvider): ...   # Phase 1: local JWT + TOTP
class OIDCProvider(AuthProvider): ...       # Phase 3: OIDC/SAML SSO

All endpoint logic depends on AuthProvider — switching from local JWT to OIDC requires no endpoint changes.


7.4 API Security

Rate Limiting

Implemented with slowapi (Redis token bucket). Limits are per-user for authenticated endpoints, per-IP for auth endpoints:

Endpoint Limit Window
POST /token (login) 10 per IP 1 minute; exponential backoff after 5 failures
POST /token/refresh 30 per user 1 hour
POST /decay/predict 10 per user 1 hour
POST /conjunctions/screen 5 per user 1 hour
POST /reports 20 per user 1 day
WS /ws/events connection attempts 10 per user 1 minute
General authenticated read endpoints 300 per user 1 minute
General unauthenticated (if any) 60 per IP 1 minute

Rate limit headers returned on every response: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.

Simulation Parameter Validation

All physical parameters must be validated against their physically meaningful ranges before a simulation job is accepted. Type validation alone is insufficient — NRLMSISE-00 will silently produce garbage for out-of-range inputs without raising an error:

class DecayPredictParams(BaseModel):
    f107: float = Field(..., ge=65.0, le=300.0,
        description="F10.7 solar flux (sfu). Physically valid: 65300.")
    ap: float = Field(..., ge=0.0, le=400.0,
        description="Geomagnetic Ap index. Valid: 0400.")
    mc_samples: int = Field(..., ge=10, le=1000,
        description="Monte Carlo sample count. Server cap: 1000 regardless of input.")
    bstar_uncertainty_pct: float = Field(..., ge=0.0, le=50.0)

    @validator('mc_samples')
    def cap_mc_samples(cls, v):
        return min(v, 1000)  # Server-side cap regardless of submitted value

Server-Side Request Forgery (SSRF) Mitigation

The Ingest module fetches from five external sources. These URLs must be:

  • Hardcoded constants in ingest/sources.py — never loaded from user input, API parameters, or database values
  • Fetched via an HTTP client configured with an allowlist of expected IP ranges per source; connections to private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.0.0/16, ::1, fc00::/7) are blocked at the HTTP client layer
ALLOWED_HOSTS = {
    "www.space-track.org": ["18.0.0.0/8"],   # approximate; update with actual ranges
    "celestrak.org": [...],
    "swpc.noaa.gov": [...],
    "discosweb.esoc.esa.int": [...],
    "maia.usno.navy.mil": [...],
}

CZML and CZML Injection

Object names and descriptions sourced from Space-Track are interpolated into CZML documents and ultimately rendered in CesiumJS. A malicious object name containing <script> or CesiumJS-specific injection must be sanitised:

  • HTML-encode all string fields from external sources before inserting into CZML
  • CesiumJS evaluates CZML description fields as HTML in info boxes — treat as untrusted HTML; use DOMPurify on the client before passing to CesiumJS description properties

NOTAM Draft Content Sanitisation (Finding 10)

NOTAM drafts are templated from prediction data, object names, and operator-supplied fields. Object names originate from Space-Track and from manual POST /objects input. ICAO plain-text format is vulnerable to special-character injection and, if the draft is ever rendered to PDF by the Playwright renderer, to XSS.

import re

_ICAO_SAFE = re.compile(r"[^A-Z0-9\-_ /]")

def sanitise_icao(value: str, field_name: str = "field") -> str:
    """
    Strip characters outside ICAO plain-text safe set before NOTAM template interpolation.

    Args:
        value: Raw string from user input or external source.
        field_name: Field identifier for logging if value is modified.

    Returns:
        Sanitised string safe for ICAO plain-text insertion.
    """
    upper = value.upper()
    sanitised = _ICAO_SAFE.sub("", upper)
    if sanitised != upper:
        logger.info("sanitise_icao: modified %s field", field_name)
    return sanitised or "[REDACTED]"

Rules:

  • sanitise_icao() is called on every user-sourced field before interpolation into NOTAM_TEMPLATE.format(...)
  • TLE remarks fields are stripped entirely from NOTAM output (not an ICAO-relevant field)
  • NOTAM template uses str.format() with named arguments, not f-strings with raw variables
  • sanitise_icao is listed in AGENTS.md as a security-critical function — any change requires a dedicated security review

7.5 Secrets Management

"All secrets via environment variables" is a development-only posture.

Development: .env file. Never committed. .gitignore must include .env, .env.*.

Production: Docker secrets (Compose secrets: stanza) for Phase 1 production deployment; HashiCorp Vault or cloud-provider secrets manager (AWS Secrets Manager, GCP Secret Manager) for Phase 3.

Secrets rotation schedule:

Secret Rotation Frequency Method
JWT RS256 private key 90 days Key ID in JWT header; both old and new keys valid during 24h rotation window
Space-Track.org credentials 90 days Space-Track account supports credential rotation; coordinated with ops team
Database password 90 days Dual-credential rotation (see procedure below); zero-downtime
Redis ACL passwords (backend, worker, ingest) 90 days Update ACL password via redis-cli ACL SETUSER; restart dependent services with new env var; old password invalid immediately
MinIO access key 90 days MinIO admin API
Cesium ion access token NOT A SECRET Public browser credential — shipped in NEXT_PUBLIC_CESIUM_ION_TOKEN. Read via Ion.defaultAccessToken = process.env.NEXT_PUBLIC_CESIUM_ION_TOKEN. Do not proxy through the backend. Do not store in Docker secrets or Vault. Rotate only if the token is explicitly revoked on cesium.com.

Database password rotation procedure — a hard PgBouncer restart drops idle connections cleanly but kills active transactions. Use the drain-then-swap sequence instead:

  1. Update Postgres role (new password valid immediately; old password still in PgBouncer config): ALTER ROLE spacecom_app PASSWORD 'new_secret';
  2. Drain PgBouncer — issue PAUSE pgbouncer;. New connections queue; existing transactions complete. Timeout: 30s (if not drained, proceed and accept brief 503s).
  3. Update PgBouncer config with new password, then RESUME pgbouncer;. Application connections resume using new password.
  4. Verify ingest/API within 5 minutes/admin/ingest-status and GET /readyz must return 200.
  5. Revoke old password after 15-minute grace: ALTER ROLE spacecom_app PASSWORD 'new_secret'; (already set — no-op; old session tokens expired during drain).
  6. Rotate Patroni replication credentials separatelypatronictl reload with updated postgresql.parameters.hba_file; does not affect application connections.

Full runbook: docs/runbooks/db-password-rotation.md.

Anti-patterns — enforced by git-secrets pre-commit hook and CI scan:

  • No secrets in requirements.txt, docker-compose.yml, Dockerfile, source files, or logs
  • Secret patterns (AWS keys, private key headers, connection strings) trigger CI failure

7.6 Transport Security

External-facing:

  • HTTPS only. HTTP → HTTPS 301 redirect.
  • Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
  • TLS 1.2 minimum; TLS 1.3 preferred. Disable TLS 1.0, 1.1, SSLv3.
  • Cipher suite: Mozilla "Intermediate" configuration or better.
  • WebSocket connections: wss:// only. The ws.ts client enforces this.

Internal service communication:

  • Backend → DB: PostgreSQL TLS with client certificate verification
  • Backend → Redis: Redis 7 TLS mode (tls-port, tls-cert-file, tls-key-file, tls-ca-cert-file)
  • Backend → MinIO: HTTPS (MinIO production mode requires TLS)
  • Backend → Renderer: HTTPS on internal Docker network; renderer does not accept connections from any other service

Certificate management:

  • Production: Let's Encrypt via Caddy (auto-renewal, OCSP stapling)
  • Certificate expiry monitored: alert 30 days before expiry via cert-manager or custom Celery task

7.7 Content Security Policy and Security Headers

SpaceCom uses two distinct CSP tiers because CesiumJS requires 'unsafe-eval' (GLSL shader compilation) — a directive that would be unacceptable on non-globe routes.

Tier 1 — Non-globe routes (login, settings, admin, API responses):

Content-Security-Policy:
  default-src 'self';
  script-src 'self';
  style-src 'self' 'unsafe-inline';
  img-src 'self' data: blob:;
  connect-src 'self' wss://[domain];
  worker-src blob:;
  frame-ancestors 'none';
  base-uri 'self';
  form-action 'self';

Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
Referrer-Policy: strict-origin-when-cross-origin
Permissions-Policy: geolocation=(), camera=(), microphone=()

Tier 2 — Globe routes (app/(globe)/ — all routes under the (globe) layout group only):

Content-Security-Policy:
  default-src 'self';
  script-src 'self' 'unsafe-eval' https://cesium.com;
  style-src 'self' 'unsafe-inline';
  img-src 'self' data: blob: https://*.cesium.com https://*.openstreetmap.org;
  connect-src 'self' wss://[domain] https://cesium.com https://api.cesium.com;
  worker-src blob:;
  frame-ancestors 'none';
  base-uri 'self';
  form-action 'self';

Implementation in next.config.ts:

// next.config.ts
const isGlobeRoute = (pathname: string) =>
  pathname.startsWith('/dashboard') || pathname.startsWith('/monitor');

const headers = async () => [
  {
    source: '/((?!dashboard|monitor).*)',  // non-globe routes
    headers: [{ key: 'Content-Security-Policy', value: CSP_STANDARD }],
  },
  {
    source: '/(dashboard|monitor)(.*)',   // globe routes — unsafe-eval allowed
    headers: [{ key: 'Content-Security-Policy', value: CSP_GLOBE }],
  },
];

'unsafe-eval' is required by CesiumJS for runtime GLSL shader compilation. Scope it only to globe routes. This is a known, documented exception — it must never appear in the standard-tier CSP.

'unsafe-inline' for style-src is also required by CesiumJS and appears in both tiers. It must not be used for script-src in the standard tier.

Renderer page CSP (the headless Playwright context, which must be the most restrictive):

Content-Security-Policy:
  default-src 'self';
  script-src 'self';
  style-src 'self';
  img-src 'self' data: blob:;
  connect-src 'none';
  frame-ancestors 'none';

7.8 WebSocket Security

WS /ws/events authentication:

  • JWT token must be verified at connection establishment (HTTP Upgrade request)
  • Browser WebSocket APIs cannot send custom headers — use the httpOnly auth cookie (set by the login flow) which is automatically sent with the Upgrade request; verify it in the WebSocket handshake handler
  • Do not accept tokens via query parameters (?token=...) — they appear in server access logs

Connection management:

  • Per-user concurrent connection limit: 5. Enforced in the upgrade handler by checking a Redis counter.
  • Server-side ping every 30 seconds; close connections that do not respond within 60 seconds
  • All incoming WebSocket messages (if bidirectional) validated against a JSON schema before processing

7.9 Data Integrity

This is the most important security property of the system. Predictions that drive aviation safety decisions must be trustworthy and tamper-evident.

HMAC Signing of Predictions

Every row written to reentry_predictions and hazard_zones is signed at creation time with an application-secret HMAC:

import hmac, hashlib, json

def sign_prediction(prediction: dict, secret: bytes) -> str:
    payload = json.dumps({
        "id": prediction["id"],
        "object_id": prediction["object_id"],
        "p50_reentry_time": prediction["p50_reentry_time"].isoformat(),
        "model_version": prediction["model_version"],
        "f107_assumed": prediction["f107_assumed"],
    }, sort_keys=True)
    return hmac.new(secret, payload.encode(), hashlib.sha256).hexdigest()

HMAC signing race fix (F4 — §67): If reentry_predictions.id is a DB-assigned BIGSERIAL, the application must INSERT first (to get the id), then compute the HMAC using that id, then UPDATE the row — a two-phase write. Between the INSERT and the UPDATE there is a brief window where a valid prediction row exists with an empty record_hmac, which the nightly HMAC verification job (§10.2) would flag as a violation.

Fix: Use UUID as the primary key (DEFAULT gen_random_uuid()) and assign the UUID in the application before the INSERT. The application pre-generates the UUID, computes the HMAC against the full prediction dict including that UUID, then inserts the complete row in a single write:

import uuid

def write_prediction_to_db(prediction: dict):
    prediction_id = str(uuid.uuid4())
    prediction['id'] = prediction_id
    prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret)
    # Single INSERT — no two-phase write; no race window
    db.execute(text("""
        INSERT INTO reentry_predictions (id, object_id, ..., record_hmac)
        VALUES (:id, :object_id, ..., :record_hmac)
    """), prediction)

Migration: ALTER TABLE reentry_predictions ALTER COLUMN id TYPE UUID USING gen_random_uuid(); ALTER TABLE reentry_predictions ALTER COLUMN id SET DEFAULT gen_random_uuid(); — requires cascade updates to FK references (alert_events.prediction_id, prediction_outcomes.prediction_id). Include in the next schema migration (alembic revision --autogenerate).

The HMAC is stored in a record_hmac column. Before serving any prediction to a client, the backend verifies the HMAC. A failed verification:

  • Is logged as a security event (CRITICAL alert to admins)
  • Results in the prediction being marked integrity_failed = TRUE
  • The prediction is not served; the API returns a 503 with a message directing the user to contact the system administrator
  • The Event Detail page displays ✗ HMAC verification failed and a warning banner

Prediction Immutability

Once written, prediction records must not be modified:

CREATE OR REPLACE FUNCTION prevent_prediction_modification()
RETURNS TRIGGER AS $$
BEGIN
  RAISE EXCEPTION 'reentry_predictions is immutable after creation. Create a new prediction instead.';
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER reentry_predictions_immutable
  BEFORE UPDATE OR DELETE ON reentry_predictions
  FOR EACH ROW EXECUTE FUNCTION prevent_prediction_modification();

Apply the same trigger to hazard_zones.

HMAC Key Rotation Procedure (Finding 1)

The immutability trigger blocks all UPDATEs on reentry_predictions, including legitimate HMAC re-signing during key rotation. The rotation path must be explicit and auditable:

Schema additions to reentry_predictions:

ALTER TABLE reentry_predictions
  ADD COLUMN rotated_at TIMESTAMPTZ,
  ADD COLUMN rotated_by INTEGER REFERENCES users(id);

Parameterised immutability trigger — allows UPDATE only on record_hmac when the session flag is set by the privileged hmac_admin role:

CREATE OR REPLACE FUNCTION prevent_prediction_modification()
RETURNS TRIGGER AS $$
BEGIN
  -- Allow HMAC-only rotation when flag is set by hmac_admin role
  IF TG_OP = 'UPDATE'
     AND current_setting('spacecom.hmac_rotation', TRUE) = 'true'
     AND NEW.record_hmac IS DISTINCT FROM OLD.record_hmac
     AND NEW.id = OLD.id  -- all other columns unchanged
  THEN
    RETURN NEW;
  END IF;
  RAISE EXCEPTION 'reentry_predictions is immutable after creation. Create a new prediction instead.';
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

hmac_admin database role: A dedicated hmac_admin Postgres role is the only role permitted to SET LOCAL spacecom.hmac_rotation = true. The backend application role does not have this privilege. The rotation script connects as hmac_admin, sets the flag per-transaction, re-signs each row, and commits. Every changed row is logged to security_logs as event type HMAC_ROTATION.

Dual sign-off: The rotation script must be run with two operators present. The runbook requires that both operators record their user IDs in the rotated_by column (use the initiating operator) and that the second operator independently verifies a random sample of re-signed HMACs match the new key before the script is considered complete.

The HMAC rotation runbook lives at docs/runbooks/hmac-key-rotation.md and cross-references the zero-downtime JWT keypair rotation runbook for the dual-key validity window.

Append-Only alert_events

CREATE OR REPLACE FUNCTION prevent_alert_modification()
RETURNS TRIGGER AS $$
BEGIN
  RAISE EXCEPTION 'alert_events is append-only';
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER alert_events_immutable
  BEFORE UPDATE OR DELETE ON alert_events
  FOR EACH ROW EXECUTE FUNCTION prevent_alert_modification();

Cross-Source Validation

Do not silently trust a single data source:

  • TLE cross-validation: When the same NORAD ID is received from both Space-Track and CelesTrak within a 6-hour window, compare the key orbital elements. If they differ by more than a defined threshold (e.g., semi-major axis > 1 km, inclination > 0.01°), flag for human review rather than silently using one.
  • All-clear double check: A prediction record showing no hazard for an object that has an active TIP message triggers an integrity alert. A single-source all-clear cannot override a TIP message.
  • Space weather cross-validation: Ingest F10.7 from both NOAA SWPC and ESA Space Weather Service. If they disagree by > 20%, alert and use the more conservative (higher) value until the discrepancy resolves.

IERS EOP Integrity

The weekly IERS Bulletin A download must be verified before application:

IERS_BULLETIN_A_SHA256 = {
    # Updated manually each quarter; verified against IERS publications
    "finals2000A.all": "expected_hash_here",
}
# If hash fails, the existing EOP table is retained; a MEDIUM alert is generated

alert_events HMAC integrity (F9): alert_events records are safety-critical audit evidence (UN Liability Convention, ICAO). They carry the same HMAC protection as reentry_predictions:

def sign_alert_event(event: dict, secret: bytes) -> str:
    payload = json.dumps({
        "id": event["id"],
        "object_id": event["object_id"],
        "organisation_id": event["organisation_id"],
        "level": event["level"],
        "trigger_type": event["trigger_type"],
        "created_at": event["created_at"].isoformat(),
        "acknowledged_by": event["acknowledged_by"],
        "action_taken": event.get("action_taken"),
    }, sort_keys=True)
    return hmac.new(secret, payload.encode(), hashlib.sha256).hexdigest()

Nightly integrity check (Celery Beat, 02:00 UTC):

@celery.task
def verify_alert_event_hmac():
    """Re-verify HMAC on all alert_events created in the past 24 hours."""
    yesterday = utcnow() - timedelta(hours=24)
    failures = db.execute(
        text("SELECT id FROM alert_events WHERE created_at >= :since"),
        {"since": yesterday}
    ).fetchall()
    for row in failures:
        event = db.get(AlertEvent, row.id)
        expected = sign_alert_event(event.__dict__, HMAC_SECRET)
        if not hmac.compare_digest(expected, event.record_hmac):
            log_security_event("ALERT_EVENT_HMAC_FAILURE", {"event_id": row.id})
            alert_admin_critical(f"alert_events HMAC integrity failure: id={row.id}")

Database timezone enforcement (F2): PostgreSQL TIMESTAMPTZ stores internally in UTC, but ORM connections can silently apply server or session timezone offsets. All timestamps must remain UTC end-to-end:

# database.py — connection pool creation
from sqlalchemy import event, text

@event.listens_for(engine.sync_engine, "connect")
def set_timezone(dbapi_conn, connection_record):
    cursor = dbapi_conn.cursor()
    cursor.execute("SET TIME ZONE 'UTC'")
    cursor.close()

Integration test (tests/test_db_timezone.py — BLOCKING):

def test_timestamps_round_trip_as_utc(db_session):
    """Ensure ORM never silently converts UTC timestamps to local time."""
    known_utc = datetime(2026, 3, 22, 14, 0, 0, tzinfo=timezone.utc)
    obj = ReentryPrediction(p50_reentry_time=known_utc, ...)
    db_session.add(obj)
    db_session.flush()
    db_session.refresh(obj)
    assert obj.p50_reentry_time == known_utc
    assert obj.p50_reentry_time.tzinfo == timezone.utc

Any non-UTC representation of a timestamp is a display-layer concern only — never stored or transmitted as local time.


7.10 Infrastructure Security

Container Hardening

Applied to all service Dockerfiles and Compose definitions:

# Applied to all services
security_opt:
  - no-new-privileges:true
read_only: true
tmpfs:
  - /tmp:size=256m,mode=1777
user: "1000:1000"   # non-root; created in Dockerfile as: RUN useradd -r -u 1000 appuser
cap_drop:
  - ALL
cap_add: []         # No capabilities added; NET_BIND_SERVICE not needed if ports > 1024

Renderer container — most restrictive:

renderer:
  security_opt:
    - no-new-privileges:true
    - seccomp:renderer-seccomp.json   # Custom seccomp profile for Chromium
  network_mode: none    # Overridden by renderer_net which allows only internal backend API
  read_only: true
  tmpfs:
    - /tmp:size=512m    # Playwright needs /tmp
    - /home/appuser:size=256m  # Chromium profile directory
  cap_drop:
    - ALL
  cap_add:
    - SYS_ADMIN         # Required by Chromium sandbox; document this explicitly

SYS_ADMIN for Chromium is a known requirement. Mitigate by ensuring the renderer container has no network access to anything other than the backend internal API, and by setting a strict seccomp profile.

Redis Authentication and ACLs

# redis.conf (production)
requirepass ""          # Disabled; use ACL only
aclfile /etc/redis/users.acl

# users.acl
user backend on >[backend_password] ~* &* +@all -@dangerous
user worker on >[worker_password] ~celery:* &celery:* +RPUSH +LPOP +LLEN +SUBSCRIBE +PUBLISH +XADD +XREAD
user default off        # Disable default user

MinIO Bucket Policies

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Principal": "*",
    "Action": "s3:*",
    "Resource": "arn:aws:s3:::*"
  }]
}

All buckets are private. Report downloads use 5-minute pre-signed URLs (reduced from 15 minutes — user downloads immediately). Pre-signed URL generation is logged to security_logs (event type PRESIGNED_URL_GENERATED) with user_id, object_key, expires_at, and client_ip — this creates an audit trail of who obtained access to which object.

MC blob access — server-side proxy (Finding 2): Simulation trajectory blobs (MC samples) must not be served as direct pre-signed MinIO URLs to the browser. Instead, the visualiser calls GET /viz/mc-trajectories/{simulation_id} which the backend fetches from MinIO server-side and streams to the authenticated client. This keeps MinIO URLs entirely off the client and prevents URL sharing or exfiltration. The backend enforces the requesting user's organisation matches the simulation's organisation_id before proxying.


7.11 Playwright Renderer Security

The renderer is the highest attack-surface component. It runs a real browser on the server.

Isolation: The renderer service runs in its own container on renderer_net. It accepts HTTPS connections only from the backend's internal IP. It makes no outbound connections beyond backend:8000 (enforced by network segmentation + Playwright request interception — see below).

Data flow: The renderer receives only a report_id (integer) from the backend job queue. It constructs the report URL internally as http://backend:8000/reports/{report_id}/preview — user-supplied values are never interpolated into the URL. The report_id is validated as a positive integer before use. The renderer has no access to the database, Redis, or MinIO directly.

Playwright request interception (Finding 4) — allowlist, not blocklist:

async def setup_request_interception(page: Page) -> None:
    """Block any Playwright navigation to hosts other than the backend."""
    async def handle_route(route: Route) -> None:
        url = route.request.url
        if not url.startswith("http://backend:8000/"):
            await route.abort("blockedbyclient")
        else:
            await route.continue_()
    await page.route("**/*", handle_route)

This is a defence-in-depth layer: even if a bug causes the renderer to receive a crafted URL, the interception handler prevents navigation to any external or internal host outside backend:8000.

Input sanitisation before reaching the renderer:

import bleach

ALLOWED_TAGS = []  # No HTML allowed in user-supplied report fields
ALLOWED_ATTRS = {}

def sanitise_report_field(value: str) -> str:
    """Strip all HTML from user-supplied strings before renderer interpolation."""
    return bleach.clean(value, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRS, strip=True)

Report template: The renderer loads a report template from the local filesystem (bundled in the container image). It does not fetch templates from URLs or the database. User-supplied content is inserted via a strict templating engine (Jinja2 with autoescape=True).

Timeouts: Report generation has a hard 30-second timeout. Playwright's page.goto() timeout set to 10 seconds. If the timeout is exceeded, the job fails with a clear error — the renderer does not hang indefinitely.

No dangerouslySetInnerHTML: The report React template must never use dangerouslySetInnerHTML. All text insertion via {value} (React's built-in escaping).


7.12 Compute Resource Governance

Limit Value Enforcement
mc_samples maximum 1000 Pydantic validator at API layer; also re-validated inside the Celery task body (Finding 3)
Concurrent simulations per user 3 Checked against simulations table before job acceptance; returns 429 if exceeded
Pending jobs per user 10 Checked at submission time
Decay prediction CPU time limit 300 s Celery time_limit=300, soft_time_limit=270
Breakup simulation CPU time limit 600 s Celery time_limit=600, soft_time_limit=570
Ephemeris response points maximum 100,000 Enforced by calculating (end - start) / step; returns 400 if exceeded with a message to reduce range or increase step
CZML document size 50 MB Streaming response with max size enforced; client must paginate for larger ranges
WebSocket connections per user 5 Redis counter checked at upgrade time
Simulation workers Separate Celery worker pool from ingest workers Prevents runaway simulations from starving TLE/space-weather ingestion

Celery task-layer validation (Finding 3): Celery tasks are callable directly via Redis write (e.g., by a compromised worker), bypassing the API layer entirely. Every task function must validate its own arguments independently of the API endpoint:

from functools import wraps

def validate_task_args(validator_class):
    """Decorator: re-validate task kwargs using the same Pydantic model as the API endpoint."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                validator_class(**kwargs)
            except ValidationError as exc:
                raise ValueError(f"Task arg validation failed: {exc}") from exc
            return func(*args, **kwargs)
        return wrapper
    return decorator

@app.task(bind=True)
@validate_task_args(DecayPredictParams)
def run_mc_decay_prediction(self, *, norad_id: int, f107: float, ap: float, mc_samples: int, ...):
    ...

ValueError raised inside a Celery task is treated as a non-retryable failure — the task goes to the dead-letter queue and does not silently drop. This applies to all simulation and prediction tasks. Document in AGENTS.md: "Task functions are a security boundary. Validate all task arguments inside the task body."

Orphaned job recovery (Celery Beat task): A Celery worker killed mid-execution (OOM, pod eviction, container restart) leaves its job in status = 'running' indefinitely unless a cleanup task intervenes. Add a Celery Beat periodic task that runs every 5 minutes:

@app.task
def recover_orphaned_jobs():
    """Mark jobs stuck in 'running' beyond 2× their estimated duration as failed."""
    threshold = datetime.utcnow() - timedelta(minutes=1)  # minimum guard
    orphans = (
        db.query(Job)
        .filter(
            Job.status == "running",
            Job.started_at < func.now() - (
                func.coalesce(Job.estimated_duration_seconds, 600) * 2
            ) * text("interval '1 second'"),
        )
        .all()
    )
    for job in orphans:
        job.status = "failed"
        job.error_code = "PRESUMED_DEAD"
        job.error_message = "Worker did not complete within 2× estimated duration"
        job.completed_at = datetime.utcnow()
    db.commit()

Integration test (tests/test_jobs/test_celery_failure.py): set a job to status='running' with started_at = NOW() - 1200s and estimated_duration_seconds = 300; run the Beat task; assert status = 'failed' and error_code = 'PRESUMED_DEAD'.


7.13 Supply Chain and Dependency Security

Python dependency pinning:

All dependencies pinned with exact versions and hashes using pip-tools:

# requirements.in → pip-compile → requirements.txt with hashes
fastapi==0.111.0 --hash=sha256:...

Install with pip install --require-hashes -r requirements.txt in all Docker builds.

Node.js: package-lock.json committed and npm ci used in Docker builds (not npm install).

Base images: All FROM statements use pinned digest tags:

FROM python:3.12.3-slim@sha256:abc123...

Never FROM python:3.12-slim (floating tag).

PyPI index trust policy — dependency confusion protection:

All Python packages must be fetched from a controlled index, not directly from public PyPI without restrictions. Configure pip.conf mounted into all build containers:

# pip.conf (mounted at /etc/pip.conf in builder stage)
[global]
index-url = https://pypi.internal.spacecom.io/simple/
# Proxy mode: passes through to PyPI but logs and scans before serving
# extra-index-url is intentionally absent — no fallback to raw public PyPI

For Phase 1 (no internal proxy available): register all spacecom-* package names on public PyPI as empty stubs to prevent dependency confusion squatting. Document in docs/adr/0019-pypi-index-trust.md.

Automated scanning (CI pipeline):

Tool Target Trigger Notes
pip-audit Python dependencies Every PR; blocks on High/Critical Queries Python Advisory Database (PyPADB); lower false-positive rate than OWASP DC for Python
npm audit Node.js dependencies Every PR; blocks on High/Critical --audit-level=high; run after npm ci
Trivy Container images Every PR; blocks on Critical/High .trivyignore applied (see below); JSON output archived
Bandit Python source code Every PR; blocks on High severity
ESLint security plugin TypeScript source Every PR
pip-licenses Python transitive deps Every PR; blocks on GPL/AGPL CesiumJS exempted by name with documented commercial licence
license-checker-rseidelsohn npm transitive deps Every PR; blocks on GPL/AGPL CesiumJS exempted; other AGPL packages require approval
Renovate Bot Docker image digests + all deps Weekly PRs; digest PRs auto-merged if CI passes Replaces Dependabot for Docker digest pins; Dependabot retained for GitHub Security Advisory integration
git-secrets + detect-secrets All commits Pre-commit; blocks commit on secret patterns detect-secrets is canonical (entropy + regex); git-secrets retained for pattern matching
cosign verify Container images at deploy Every staging/production deploy Verifies Sigstore keyless signature before pulling

OWASP Dependency-Check is removed from the Python scanning stack — it has high false-positive rates due to CPE name mapping issues for Python packages and is superseded by pip-audit. It may be retained for future Java/Kotlin components.

Trivy configuration — .trivyignore:

# .trivyignore
# Each entry requires: CVE ID, expiry date (90-day max), and documented justification.
# Process: PR required with senior engineer approval. Expired entries fail CI.
# Format: CVE-YYYY-NNNNN  expires:YYYY-MM-DD  reason:<one-line justification>
#
# Example (do not add without process):
# CVE-2024-12345  expires:2024-12-31  reason:builder-stage only; not present in runtime image

CI check rejects entries past their expiry date:

python scripts/check_trivyignore_expiry.py .trivyignore || \
  (echo "ERROR: .trivyignore contains expired entry — review or remove" && exit 1)

License scanning CI steps:

# security-scan job
- name: Python licence gate
  run: |
    pip install pip-licenses
    pip-licenses --format=json --output-file=python-licences.json
    # Fail on GPL/AGPL (CesiumJS has commercial licence; excluded by name in npm step)
    pip-licenses --fail-on="GNU General Public License v2 (GPLv2);GNU General Public License v3 (GPLv3);GNU Affero General Public License v3 (AGPLv3)"

- name: npm licence gate
  working-directory: frontend
  run: |
    npx license-checker-rseidelsohn --json --out npm-licences.json
    # cesium excluded: commercial licence at docs/adr/0007-cesiumjs-commercial-licence.md
    npx license-checker-rseidelsohn \
      --excludePackages "cesium" \
      --failOn "GPL;AGPL"

- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08  # v4.3.4
  with:
    name: licences-${{ github.sha }}
    path: "*.json"
    retention-days: 365

Base image digest updates — Renovate configuration:

Dependabot does not update @sha256: digest pins in Dockerfiles. Renovate's docker-digest manager handles this:

// renovate.json
{
  "extends": ["config:base"],
  "packageRules": [
    {
      "matchDatasources": ["docker"],
      "matchUpdateTypes": ["digest"],
      "automerge": true,
      "automergeType": "pr",
      "schedule": ["every weekend"],
      "commitMessageSuffix": "(base image digest update)"
    },
    {
      "matchDatasources": ["pypi"],
      "automerge": false
    }
  ],
  "github-actions": {
    "enabled": true,
    "pinDigests": true
  }
}

Digest-only updates auto-merge on passing CI. Version bumps (e.g., python:3.12python:3.13) require manual PR review. Renovate is added alongside Dependabot; Dependabot retains GitHub Security Advisory integration for Python/Node CVE PRs.


7.14 Audit and Security Logging

Security event categories (stored in security_logs table and shipped to SIEM):

Event Level Retention
Successful login INFO 90 days
Failed login (IP + user) WARNING 180 days
MFA failure WARNING 180 days
Account lockout HIGH 180 days
Token refresh INFO 30 days
Authorisation failure (403) WARNING 180 days
Admin action (user create/delete/role change) HIGH 1 year
Prediction HMAC failure CRITICAL 2 years
Alert storm detection CRITICAL 2 years
IERS EOP hash mismatch HIGH 1 year
Report generated INFO 1 year
Ingest source error WARNING 90 days

Security event human-alerting matrix (Finding 7): A Grafana dashboard no one is watching provides no protection during an active attack. The following events must trigger an immediate out-of-band alert to a human (PagerDuty, email, or Slack) — not only log to the database:

Event type Severity Alert channel Response SLA
HMAC_VERIFICATION_FAILURE CRITICAL PagerDuty + admin email Immediate
REFRESH_TOKEN_REUSE HIGH Email to affected user + admin email < 5 min
ROLE_CHANGE_APPROVED / ROLE_CHANGE_EXPIRED HIGH Admin email summary < 15 min
REGISTRATION_BLOCKED_SANCTIONS HIGH Admin email < 15 min
RBAC_VIOLATION ≥ 10 events in 5 min (same user_id) HIGH PagerDuty Immediate
INGEST_VALIDATION_FAILURE ≥ 5 events in 1 hour (same source) MEDIUM Admin email < 1 hour
Space-Track ingest gap > 4 hours CRITICAL PagerDuty (cross-ref §31) Immediate
Any level = CRITICAL security event CRITICAL PagerDuty + SIEM Immediate

Implemented as AlertManager rules (Prometheus security_event_total counter with event_type label) and/or direct webhook dispatch from the security_logs insert trigger. Rules defined in monitoring/alertmanager/security-rules.yml.

Space-Track credential rotation — ingest gap specification (Finding 8): Space-Track supports only one active credential set; rotation is a hard cut with no parallel-credential window. The rotation runbook at docs/runbooks/space-track-credential-rotation.md must include: (a) record last successful ingest time before starting; (b) update Docker secret and restart ingest_worker; (c) verify ingest succeeds within 10 minutes of restart (GET /admin/ingest-status shows last_success_at for Space-Track source); (d) if ingest does not resume within 10 minutes, roll back to previous credentials and raise a CRITICAL alert. The existing 4-hour ingest failure CRITICAL alert (§31) is the backstop — this runbook step reduces mean time to detect to 10 minutes.

Structured log format — all services emit JSON via structlog. Every log record must include these fields:

# backend/app/logging_config.py
REQUIRED_LOG_FIELDS = {
    "timestamp":       "ISO-8601 UTC",
    "level":           "DEBUG|INFO|WARNING|ERROR|CRITICAL",
    "service":         "backend|worker|ingest|renderer",
    "logger":          "module.path",
    "message":         "human-readable summary",
    "request_id":      "UUID | null — set for HTTP requests; propagated into Celery tasks",
    "job_id":          "UUID | null — Celery job_id when inside a task",
    "user_id":         "integer | null",
    "organisation_id": "integer | null",
    "duration_ms":     "integer | null — HTTP response time",
    "status_code":     "integer | null — HTTP responses only",
}

The sanitising formatter wraps the structlog JSON processor (strips JWT substrings, Space-Track passwords, database DSNs before the record is written). Docker log driver: json-file with max-size=100m, max-file=5 for Tier 1; forwarded to Loki via Promtail in Tier 2+.

Log sanitisation: The structlog sanitising processor runs as the final processor in the chain before emission, stripping known sensitive patterns (JWT token substrings, Space-Track password patterns, database DSN with credentials).

Log integrity: Logs are shipped in real-time to an external destination (Loki in Tier 2; S3/MinIO append-only bucket or SIEM for long-term safety record retention). Logs stored only on the container filesystem are considered volatile and untrusted for security purposes.

Request ID correlation middleware — every HTTP request generates a request_id that propagates through logs, Celery tasks, and Prometheus exemplars so an on-call engineer can jump from a metric spike to the causative log line with one click:

# backend/app/middleware.py
import uuid
import structlog
from starlette.middleware.base import BaseHTTPMiddleware

class RequestIDMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        request_id = request.headers.get("X-Request-ID") or str(uuid.uuid4())
        structlog.contextvars.bind_contextvars(request_id=request_id)
        response = await call_next(request)
        response.headers["X-Request-ID"] = request_id
        structlog.contextvars.clear_contextvars()
        return response

When submitting a Celery task, include request_id in task kwargs and bind it in the task preamble:

structlog.contextvars.bind_contextvars(request_id=kwargs.get("request_id"), job_id=str(self.request.id))

This links every log line from the HTTP layer through to the Celery task execution. The request_id equals the OpenTelemetry trace_id when OTel is enabled (Phase 2), giving a single correlation key across logs and traces.

security_logs table:

CREATE TABLE security_logs (
  id BIGSERIAL PRIMARY KEY,
  logged_at TIMESTAMPTZ DEFAULT NOW(),
  level TEXT NOT NULL,
  event_type TEXT NOT NULL,
  user_id INTEGER,
  organisation_id INTEGER,
  source_ip INET,
  user_agent TEXT,
  resource TEXT,
  detail JSONB,
  -- Prevent tampering
  record_hash TEXT    -- SHA-256 of (logged_at || level || event_type || detail)
);
-- Append-only trigger (same pattern as alert_events)

7.15 Security SDLC — Embedded, Not Bolted On

Security activities are integrated into every sprint from Week 1, not deferred to a Phase 3 audit.

Week 1 (mandatory before any other code):

  • RBAC schema implemented; require_role dependency applied to all router groups
  • JWT RS256 + httpOnly cookies implemented; HS256 never used
  • MFA (TOTP) implemented and required for all roles
  • CSP and security headers applied to frontend and backend
  • Docker network segmentation and container hardening applied to all services
  • Redis AUTH and ACL configured
  • MinIO: all buckets private; pre-signed URLs only
  • Dependency pinning (pip-compile) and Dependabot configured
  • git-secrets pre-commit hook installed in repo
  • Bandit and ESLint security plugin in CI; blocks merge on High severity
  • Trivy container scanning in CI; blocks merge on Critical/High
  • security_logs table and log sanitisation formatter implemented
  • Append-only DB triggers on alert_events

Phase 1 (ongoing):

  • HMAC signing implemented for reentry_predictions before decay predictor ships (Week 9)
  • Immutability triggers on reentry_predictions and hazard_zones
  • Cross-source TLE and space weather validation implemented with ingest module (Week 36)
  • IERS EOP hash verification implemented (Week 1)
  • Rate limiting (slowapi) configured for all endpoint groups (Week 2)
  • Simulation parameter range validation (Week 9, with decay predictor)

Phase 2:

  • OWASP ZAP DAST scan run against staging environment in the Phase 2 CI pipeline
  • Threat model document (STRIDE) reviewed and updated for Phase 2 attack surface
  • Playwright renderer: isolated container, sanitised input, timeouts, seccomp profile, Playwright request interception allowlist (Week 1920, when reports ship)
  • NOTAM draft content sanitisation: sanitise_icao() function in reentry/notam.py applied to all user-sourced fields before NOTAM template interpolation; unit test: object name containing "><script>alert(1)</script> produces a sanitised NOTAM draft and does not raise (Week 1718, with NOTAM drafting feature)
  • Shadow mode RLS integration test: query reentry_predictions as viewer role with no WHERE clause; assert zero shadow rows returned
  • Refresh token family reuse detection integration test: simulate attacker consuming a rotated token; assert entire family revoked + REFRESH_TOKEN_REUSE logged
  • RLS policies reviewed and integration-tested for multi-tenancy boundary

Phase 3:

  • External penetration test by a qualified third party — scope must include: API auth bypass, privilege escalation, SSRF via ingest, XSS → Playwright escalation, WebSocket auth bypass, data integrity attacks on predictions, Redis/MinIO lateral movement
  • All Critical and High penetration test findings remediated before production go-live
  • SOC 2 Type I readiness review (if required by customer contracts)
  • Acceptance Test Procedure (ATP) defined and run (Finding 10): docs/bid/acceptance-test-procedure.md exists with test script structured as: test ID, requirement reference, preconditions, steps, expected result, pass/fail criteria. ATP is runnable by a non-SpaceCom operator (evaluator) using documented environment setup. ATP covers: physics accuracy (§17 validation), NOTAM format (Q-line regex test), alert delivery latency (synthetic TIP → measure delivery time), HMAC integrity (tampered record → 503), multi-tenancy boundary (Org A cannot access Org B data). ATP seed data committed at docs/bid/atp-seed-data/. ATP successfully run by an independent evaluator on the staging environment before any institutional procurement submission.
  • Competitive differentiation review completed: docs/competitive-analysis.md updated; any competitor capability that closed a differentiation gap has been assessed and a product response documented
  • Security runbook: incident response procedure for each CRITICAL threat scenario

7.16 Aviation Safety Integrity — Operational Scenarios

Scenario 1 — False all-clear attack:

An attacker who modifies reentry_predictions records to suppress a genuine hazard corridor could cause an airspace manager to conclude a FIR is safe when it is not.

Mitigations layered in depth:

  1. HMAC signing on every prediction record (§7.9) — modification is immediately detected
  2. Immutability DB trigger (§7.9) — modifications fail at the database layer
  3. TIP message cross-check: a prediction showing no hazard for an object with an active TIP message triggers a CRITICAL integrity alert regardless of the prediction's content
  4. The UI displays HMAC status on every prediction — ✗ verification failed is immediately visible to the operator

Scenario 2 — Alert storm attack:

An attacker flooding the alert system with false CRITICALs induces alert fatigue; operators disable alerts; a genuine event is missed.

Mitigations:

  1. Alert generation runs only from backend business logic on verified, HMAC-checked data — not from direct API calls
  2. Rate limiting on CRITICAL alert generation per object per window (§6.6)
  3. Alert storm detection: > 5 CRITICALs in 1 hour triggers a meta-alert to admins
  4. Geographic filtering means alert volume per operator is naturally bounded to their region

8. Functional Modules

Each module is a Python package under backend/modules/ with its own router, schemas, service layer, and (where applicable) Celery tasks. Modules communicate via internal function calls and the shared database — not HTTP between modules.

Phase 1 Modules

Module Package Purpose
Catalog modules.catalog CRUD for space objects: NORAD ID, TLE sets, physical properties (from ESA DISCOS), B* drag term, radar cross-section. Source of truth for all tracked objects.
Catalog Propagator modules.propagator.catalog SGP4/SDP4 for general catalog tracking. Outputs GCRF state vectors and geodetic coordinates. Feeds the globe display. Not used for decay prediction.
Decay Predictor modules.propagator.decay Numerical integrator (RK7(8) adaptive step) with NRLMSISE-00 atmospheric density model, J2J6 geopotential, and solar radiation pressure. Used for all re-entry window estimation. Monte Carlo uncertainty (vary F10.7 ±20%, Ap, B* ±10%). All outputs HMAC-signed on creation. Shadow mode flag propagated to all output records.
Reentry modules.reentry Phase 1 scope: re-entry window prediction (time ± uncertainty) and ground track corridor (percentile swaths). Phase 2 expands to full breakup/survivability.
Space Weather modules.spaceweather Ingests NOAA SWPC: F10.7, Ap/Kp, Dst, solar wind. Cross-validates against ESA Space Weather Service. Generates operational_status string. Drives Decay Predictor density models.
Visualisation modules.viz Generates CZML documents from ephemeris (J2000 Cartesian — explicit TEME→J2000 conversion), hazard zones, and debris corridors. Pre-bakes MC trajectory binary blobs for Mode C. All object name/description fields HTML-escaped before CZML output.
Ingest modules.ingest Background workers: Space-Track.org TLE polling, CelesTrak TLE polling, TIP message ingestion, ESA DISCOS physical property import, NOAA SWPC space weather polling, IERS EOP refresh. All external URLs are hardcoded constants; SSRF mitigation enforced at HTTP client layer.
Public API modules.api Versioned REST API (/api/v1/) as a first-class product for programmatic access by Persona E/F. Includes API key management (generation, rotation, revocation, usage tracking), CCSDS-format export endpoints, bulk ephemeris endpoints, and rate limiting per API key. API keys are separate credentials from the web session JWT and managed independently.

Phase 2 Modules

Module Package Purpose
Atmospheric Breakup modules.breakup ORSAT-like atmospheric re-entry breakup: aerothermal loading → structural failure → fragment generation → ballistic descent → ground impact with kinetic energy and casualty area. Produces fragment descriptors and uncertainty bounds for the sub-/trans-sonic descent layer.
Conjunction modules.conjunction All-vs-all conjunction screening: apogee/perigee filter → TCA refinement → collision probability (Alfano/Foster). Feeds conjunctions table.
Upper Atmosphere modules.weather.upper NRLMSISE-00 / JB2008 density model driven by space weather inputs. 80600 km profiles for Decay Predictor and Atmospheric Breakup.
Lower Atmosphere modules.weather.lower GFS/ECMWF tropospheric wind and density profiles for 080 km terminal descent, including wind-sensitive dispersion inputs for fragment clouds after main breakup.
Hazard modules.hazard Fuses Decay Predictor + Atmospheric Breakup + atmosphere modules into hazard zones with uncertainty bounds. All output records HMAC-signed and immutable. Shadow mode flag preserved on all hazard zone records.
Airspace modules.airspace FIR/UIR boundaries, controlled airspace, routes. PostGIS hazard-airspace intersection.
Air Risk modules.air_risk Combines hazard outputs with air traffic density / ADS-B state, aircraft class assumptions, and vulnerability bands to generate time-sliced exposure scores and operator-facing air-risk products. Supports conservative-baseline comparison against blunt closure areas.
On-Orbit Fragmentation modules.fragmentation NASA Standard Breakup Model for on-orbit collision/explosion fragmentation. Separate from atmospheric breakup — different physics.
Space Operator Portal modules.space_portal The second front door. Owned object management (owned_objects table); object-scoped prediction views; CCSDS export; API key portal; controlled re-entry planner interface. Enforces space_operator RBAC object-ownership scoping.
Controlled Re-entry Planner modules.reentry.controlled For objects with remaining manoeuvre capability: given a delta-V budget and avoidance constraints (FIR exclusions, land avoidance, population density weighting), generates ranked candidate deorbit windows with corridor risk scores. Outputs suitable for national space law regulatory submissions and ESA Zero Debris Charter evidence.
NOTAM Drafting modules.notam Generates ICAO Annex 15 format NOTAM drafts from hazard corridor outputs. Produces cancellation drafts on event close. Stores all drafts in notam_drafts table. Displays mandatory regulatory disclaimer. Never submits NOTAMs — draft production only.

Phase 3 Modules

Module Package Purpose
Reroute modules.reroute Strategic pre-flight route intersection analysis only. Given a filed route, identifies which segments intersect the hazard corridor and outputs the geographic avoidance boundary. Does not generate specific alternate routes — avoidance boundary only, to keep SpaceCom in a purely informational role.
Feedback modules.feedback Prediction vs. observed outcome comparison. Atmospheric density scaling recalibration from historical re-entries. Maneuver detection (TLE-to-TLE ΔV estimation). Shadow validation reporting for ANSP regulatory adoption evidence.
Alerts modules.alerts WebSocket push + email notifications. Enforces alert rate limits and deduplication server-side. Stores all events in append-only alert_events. Shadow mode: all alerts suppressed to INFORMATIONAL; no external delivery.
Launch Safety modules.launch_safety Screen proposed launch trajectories against the live catalog for conjunction risk during ascent and parking orbit phases. Natural extension of the conjunction module. Serves launch operators as a third customer segment.

9. Data Model Evolution

9.1 Retain and Expand from Existing Schema

objects table

ALTER TABLE objects ADD COLUMN IF NOT EXISTS
  bstar DOUBLE PRECISION,              -- SGP4 drag parameter (1/Earth-radii)
  cd_a_over_m DOUBLE PRECISION,        -- C_D * A / m (m²/kg); physical model
  rcs_m2 DOUBLE PRECISION,             -- Radar cross-section from Space-Track
  rcs_size_class TEXT,                 -- SMALL | MEDIUM | LARGE
  mass_kg DOUBLE PRECISION,
  cross_section_m2 DOUBLE PRECISION,
  material TEXT,
  shape TEXT,
  data_confidence TEXT DEFAULT 'unknown',  -- 'discos' | 'estimated' | 'unknown'
  object_type TEXT,                    -- PAYLOAD | ROCKET BODY | DEBRIS | UNKNOWN
  launch_date DATE,
  launch_site TEXT,
  decay_date DATE,
  organisation_id INTEGER REFERENCES organisations(id),  -- multi-tenancy
  -- Physics model parameters (Finding 3, 5, 7)
  attitude_known BOOLEAN DEFAULT FALSE,    -- FALSE = tumbling; affects A uncertainty sampling
  material_class TEXT,                     -- 'aluminium'|'stainless_steel'|'titanium'|'carbon_composite'|'unknown'
  cd_override DOUBLE PRECISION,            -- operator-provided C_D override (space_operator only)
  bstar_override DOUBLE PRECISION,         -- operator-provided B* override (space_operator only)
  cr_coefficient DOUBLE PRECISION DEFAULT 1.3  -- radiation pressure coefficient; 1.3 = standard non-cooperative

orbits table — full state vectors

ALTER TABLE orbits ADD COLUMN IF NOT EXISTS
  reference_frame TEXT DEFAULT 'GCRF',
  pos_x_km DOUBLE PRECISION,
  pos_y_km DOUBLE PRECISION,
  pos_z_km DOUBLE PRECISION,
  vel_x_kms DOUBLE PRECISION,
  vel_y_kms DOUBLE PRECISION,
  vel_z_kms DOUBLE PRECISION,
  lat_deg DOUBLE PRECISION,
  lon_deg DOUBLE PRECISION,
  alt_km DOUBLE PRECISION,
  speed_kms DOUBLE PRECISION,
  -- RTN position covariance (upper triangle of 3×3)
  cov_rr DOUBLE PRECISION,
  cov_rt DOUBLE PRECISION,
  cov_rn DOUBLE PRECISION,
  cov_tt DOUBLE PRECISION,
  cov_tn DOUBLE PRECISION,
  cov_nn DOUBLE PRECISION,
  propagator TEXT DEFAULT 'sgp4',
  tle_epoch TIMESTAMPTZ

conjunctions table

ALTER TABLE conjunctions ADD COLUMN IF NOT EXISTS
  collision_probability DOUBLE PRECISION,
  probability_method TEXT,
  combined_radial_sigma_m DOUBLE PRECISION,
  combined_transverse_sigma_m DOUBLE PRECISION,
  combined_normal_sigma_m DOUBLE PRECISION

reentry_predictions table

ALTER TABLE reentry_predictions ADD COLUMN IF NOT EXISTS
  confidence_level DOUBLE PRECISION,
  model_version TEXT,
  propagator TEXT,
  f107_assumed DOUBLE PRECISION,
  ap_assumed DOUBLE PRECISION,
  monte_carlo_n INTEGER,
  ground_track_corridor GEOGRAPHY(POLYGON),  -- GEOGRAPHY: global corridors may cross antimeridian
  reentry_window_open TIMESTAMPTZ,
  reentry_window_close TIMESTAMPTZ,
  nominal_reentry_point GEOGRAPHY(POINT),    -- GEOGRAPHY: global point
  nominal_reentry_alt_km DOUBLE PRECISION DEFAULT 80.0,
  p01_reentry_time TIMESTAMPTZ,  -- 1st percentile — extreme early case; displayed as tail risk annotation (F10)
  p05_reentry_time TIMESTAMPTZ,
  p50_reentry_time TIMESTAMPTZ,
  p95_reentry_time TIMESTAMPTZ,
  p99_reentry_time TIMESTAMPTZ,  -- 99th percentile — extreme late case; displayed as tail risk annotation (F10)
  sigma_along_track_km DOUBLE PRECISION,
  sigma_cross_track_km DOUBLE PRECISION,
  organisation_id INTEGER REFERENCES organisations(id),
  record_hmac TEXT NOT NULL,           -- HMAC-SHA256 of canonical field set
  integrity_failed BOOLEAN DEFAULT FALSE,
  superseded_by INTEGER REFERENCES reentry_predictions(id) ON DELETE RESTRICT, -- write-once; RESTRICT prevents deleting a prediction that supersedes another (F10 — §67)
  ood_flag BOOLEAN DEFAULT FALSE,              -- TRUE if any input parameter falls outside the model's validated operating envelope
  ood_reason TEXT,                             -- comma-separated list of which parameters triggered OOD (e.g. "high_am_ratio,low_data_confidence")
  prediction_valid_until TIMESTAMPTZ,          -- computed at creation: p50_reentry_time - 4h; UI warns if NOW() > this and prediction is not superseded
  model_version TEXT NOT NULL,                 -- semantic version of decay predictor used; must match current deployed version or trigger re-run prompt
  -- Multi-source conflict detection (Finding 10)
  prediction_conflict BOOLEAN DEFAULT FALSE,   -- TRUE if SpaceCom window does not overlap TIP or ESA window
  conflict_sources TEXT[],                     -- e.g. ['space_track_tip', 'esa_esac']
  conflict_union_p10 TIMESTAMPTZ,              -- union of all non-overlapping windows: earliest bound
  conflict_union_p90 TIMESTAMPTZ               -- union of all non-overlapping windows: latest bound

superseded_by is write-once after creation: it can be set once by an analyst or above, but never changed once set. A DB constraint enforces this (trigger that raises if superseded_by is being changed from a non-NULL value). The UI displays a ⚠ Superseded — see [newer run] banner on any prediction where superseded_by IS NOT NULL. This preserves the immutability guarantee (old records are never deleted) while giving analysts a mechanism to communicate "this is not the current operational view."

The same superseded_by pattern applies to the simulations table (self-referential FK).

Immutability trigger (see §7.9) applied to this table in the initial migration.

9.2 New Tables

-- Organisations (for multi-tenancy)
CREATE TABLE organisations (
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL UNIQUE,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  -- Commercial tier (Finding 3, 5)
  subscription_tier TEXT NOT NULL DEFAULT 'shadow_trial'
    CHECK (subscription_tier IN ('shadow_trial','ansp_operational','space_operator','institutional','internal')),
  subscription_status TEXT NOT NULL DEFAULT 'active'
    CHECK (subscription_status IN ('active','offered','offered_lapsed','churned','suspended')),
  subscription_started_at TIMESTAMPTZ,
  subscription_expires_at TIMESTAMPTZ,
  -- Shadow trial gate (F3 - §68): expiry normally auto-deactivates shadow mode, but enforcement is deferred while an active TIP / CRITICAL operational event exists
  shadow_trial_expires_at TIMESTAMPTZ,          -- NULL = no trial expiry (paid or internal); set on sandbox agreement signing
  -- Resource quotas (F8 — §68): 0 = unlimited (paid tiers); >0 = monthly cap
  monthly_mc_run_quota INTEGER NOT NULL DEFAULT 100  -- 100 for free/shadow_trial; 0 = unlimited for paid; deferred during active TIP/CRITICAL event
    CHECK (monthly_mc_run_quota >= 0),
  -- Feature flags (F11 — §68): Enterprise-only features gated here
  feature_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE,  -- Enterprise only
  -- On-premise licence (F6 — §68)
  licence_key TEXT,                             -- JWT signed by SpaceCom; checked at startup for on-premise deployments
  licence_expires_at TIMESTAMPTZ,               -- derived from licence_key; stored for query efficiency
  -- Data residency (Finding 8)
  hosting_jurisdiction TEXT NOT NULL DEFAULT 'eu'
    CHECK (hosting_jurisdiction IN ('eu','uk','au','us','on_premise')),
  data_residency_confirmed BOOLEAN DEFAULT FALSE  -- DPA clause confirmed for this org
);

-- Users
CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  organisation_id INTEGER REFERENCES organisations(id) NOT NULL,
  email TEXT NOT NULL UNIQUE,
  password_hash TEXT NOT NULL,           -- bcrypt, cost factor >= 12
  role TEXT NOT NULL DEFAULT 'viewer'
    CHECK (role IN ('viewer','analyst','operator','org_admin','admin','space_operator','orbital_analyst')),
  mfa_secret TEXT,                       -- TOTP secret (encrypted at rest)
  mfa_recovery_codes TEXT[],             -- bcrypt hashes of recovery codes
  mfa_enabled BOOLEAN DEFAULT FALSE,
  failed_mfa_attempts INTEGER DEFAULT 0,
  locked_until TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  last_login_at TIMESTAMPTZ,
  tos_accepted_at TIMESTAMPTZ,          -- NULL = ToS not yet accepted; access blocked until set
  tos_version TEXT,                     -- semver of ToS accepted (e.g. "1.2.0")
  tos_accepted_ip INET,                 -- IP address at time of acceptance (GDPR consent evidence)
  data_source_acknowledgement BOOLEAN DEFAULT FALSE, -- must be TRUE before API key access
  altitude_unit_preference TEXT NOT NULL DEFAULT 'ft'
    CHECK (altitude_unit_preference IN ('m', 'ft', 'km'))
    -- 'ft' default for ansp_operator; 'km' default for space_operator (set at account creation based on role)
);

-- Refresh tokens (server-side revocation)
CREATE TABLE refresh_tokens (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id INTEGER REFERENCES users(id) ON DELETE CASCADE,
  token_hash TEXT NOT NULL UNIQUE,        -- SHA-256 of the raw token
  family_id UUID NOT NULL,               -- All tokens from the same initial issuance share a family_id
  issued_at TIMESTAMPTZ DEFAULT NOW(),
  expires_at TIMESTAMPTZ NOT NULL,
  revoked_at TIMESTAMPTZ,                -- NULL = valid
  superseded_at TIMESTAMPTZ,             -- Set when this token is rotated out (newer token in family exists)
  replaced_by UUID REFERENCES refresh_tokens(id),  -- for rotation chain audit
  source_ip INET,
  user_agent TEXT
);
CREATE INDEX ON refresh_tokens (user_id, revoked_at);
CREATE INDEX ON refresh_tokens (family_id);  -- for family revocation on reuse detection

-- Security event log (append-only)
CREATE TABLE security_logs (
  id BIGSERIAL PRIMARY KEY,
  logged_at TIMESTAMPTZ DEFAULT NOW(),
  level TEXT NOT NULL,
  event_type TEXT NOT NULL,
  user_id INTEGER,
  organisation_id INTEGER,
  source_ip INET,
  user_agent TEXT,
  resource TEXT,
  detail JSONB,
  record_hash TEXT  -- SHA-256(logged_at || event_type || detail) for tamper detection
);
CREATE TRIGGER security_logs_immutable
  BEFORE UPDATE OR DELETE ON security_logs
  FOR EACH ROW EXECUTE FUNCTION prevent_modification();

-- TLE history (hypertable)
-- No surrogate PK: TimescaleDB requires any UNIQUE/PK constraint to include the partition column.
-- Natural unique key is (object_id, ingested_at). Reference TLE records by this composite key.
CREATE TABLE tle_sets (
  object_id INTEGER REFERENCES objects(id),
  epoch TIMESTAMPTZ NOT NULL,
  line1 TEXT NOT NULL,
  line2 TEXT NOT NULL,
  source TEXT NOT NULL,
  ingested_at TIMESTAMPTZ DEFAULT NOW(),
  inclination_deg DOUBLE PRECISION,
  raan_deg DOUBLE PRECISION,
  eccentricity DOUBLE PRECISION,
  arg_perigee_deg DOUBLE PRECISION,
  mean_anomaly_deg DOUBLE PRECISION,
  mean_motion_rev_per_day DOUBLE PRECISION,
  bstar DOUBLE PRECISION,
  apogee_km DOUBLE PRECISION,
  perigee_km DOUBLE PRECISION,
  cross_validated BOOLEAN DEFAULT FALSE,  -- TRUE if confirmed by second source
  cross_validation_delta_sma_km DOUBLE PRECISION,  -- SMA difference between sources
  UNIQUE (object_id, ingested_at)         -- natural key; safe for TimescaleDB (includes partition col)
);
SELECT create_hypertable('tle_sets', 'ingested_at');

-- Space weather (hypertable)
CREATE TABLE space_weather (
  time TIMESTAMPTZ NOT NULL,
  f107_obs DOUBLE PRECISION,             -- observed F10.7 (current day)
  f107_prior_day DOUBLE PRECISION,       -- prior-day F10.7 (NRLMSISE-00 f107 input)
  f107_81day_avg DOUBLE PRECISION,       -- 81-day centred average (NRLMSISE-00 f107A input)
  ap_daily INTEGER,                      -- daily Ap index (linear; NOT Kp)
  ap_3h_history DOUBLE PRECISION[19],    -- 3-hourly Ap values for prior 57h (NRLMSISE-00 full mode)
  kp_3hourly DOUBLE PRECISION[],         -- 3-hourly Kp (for storm detection; Kp > 5 triggers storm flag)
  dst_index INTEGER,
  uncertainty_multiplier DOUBLE PRECISION,
  operational_status TEXT,
  source TEXT DEFAULT 'noaa_swpc',
  secondary_source TEXT,                 -- ESA SWS cross-validation value
  cross_validation_delta_f107 DOUBLE PRECISION  -- difference between sources
);
SELECT create_hypertable('space_weather', 'time');

-- TIP messages
CREATE TABLE tip_messages (
  id BIGSERIAL PRIMARY KEY,
  object_id INTEGER REFERENCES objects(id),
  norad_id INTEGER NOT NULL,
  message_time TIMESTAMPTZ NOT NULL,
  message_number INTEGER,
  reentry_window_open TIMESTAMPTZ,
  reentry_window_close TIMESTAMPTZ,
  predicted_region TEXT,
  source TEXT DEFAULT 'usspacecom',
  raw_message TEXT
);

-- Alert events (append-only)
CREATE TABLE alert_events (
  id BIGSERIAL PRIMARY KEY,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  level TEXT NOT NULL
    CHECK (level IN ('INFO','WARNING','CRITICAL')),
  trigger_type TEXT NOT NULL,
  object_id INTEGER REFERENCES objects(id),
  organisation_id INTEGER REFERENCES organisations(id),
  message TEXT NOT NULL,
  acknowledged_at TIMESTAMPTZ,
  acknowledged_by INTEGER REFERENCES users(id) ON DELETE SET NULL,  -- SET NULL on GDPR erasure; log entry preserved
  acknowledgement_note TEXT,
  delivered_websocket BOOLEAN DEFAULT FALSE,
  delivered_email BOOLEAN DEFAULT FALSE,
  fir_intersection_km2 DOUBLE PRECISION,       -- area of FIR polygon intersected by the triggering corridor (km²); NULL for non-spatial alerts
  intersection_percentile TEXT
    CHECK (intersection_percentile IN ('p50','p95')),  -- which corridor percentile triggered the alert
  prediction_id BIGINT REFERENCES reentry_predictions(id) ON DELETE RESTRICT,  -- RESTRICT prevents cascade delete of legal-hold predictions (F10 — §67)
  record_hmac TEXT NOT NULL DEFAULT ''  -- HMAC-SHA256 of safety-critical fields; signed at insert; verified nightly (F9)
);
CREATE TRIGGER alert_events_immutable
  BEFORE UPDATE OR DELETE ON alert_events
  FOR EACH ROW EXECUTE FUNCTION prevent_modification();

-- Simulations
CREATE TABLE simulations (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  module TEXT NOT NULL,
  object_id INTEGER REFERENCES objects(id),
  organisation_id INTEGER REFERENCES organisations(id),
  params_json JSONB NOT NULL,
  started_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  completed_at TIMESTAMPTZ,
  status TEXT NOT NULL DEFAULT 'pending'
    CHECK (status IN ('pending','running','complete','failed','cancelled')),
  result_uri TEXT,
  model_version TEXT,
  celery_task_id TEXT,
  error_detail TEXT,
  created_by INTEGER REFERENCES users(id)
);

-- Reports
CREATE TABLE reports (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  simulation_id UUID REFERENCES simulations(id),
  object_id INTEGER REFERENCES objects(id),
  organisation_id INTEGER REFERENCES organisations(id),
  report_type TEXT NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  created_by INTEGER REFERENCES users(id),
  storage_uri TEXT NOT NULL,
  params_json JSONB,
  report_number TEXT
);

-- Prediction outcomes (algorithmic accountability — links predictions to observed re-entry events)
CREATE TABLE prediction_outcomes (
  id SERIAL PRIMARY KEY,
  prediction_id BIGINT NOT NULL REFERENCES reentry_predictions(id) ON DELETE RESTRICT,  -- RESTRICT prevents cascade delete of legal-hold predictions (F10 — §67)
  norad_id INTEGER NOT NULL,
  observed_reentry_time TIMESTAMPTZ,           -- actual re-entry time from post-event analysis (The Aerospace Corporation, US18SCS, etc.)
  observed_reentry_source TEXT,                -- 'aerospace_corp' | 'us18scs' | 'esa_esoc' | 'manual'
  p50_error_minutes DOUBLE PRECISION,          -- predicted p50 minus observed (+ = predicted late, - = predicted early)
  corridor_contains_observed BOOLEAN,          -- TRUE if observed impact point fell within p95 corridor
  fir_false_positive BOOLEAN,                  -- TRUE if a CRITICAL alert fired but no observable debris reached the affected FIR
  fir_false_negative BOOLEAN,                  -- TRUE if observable debris reached a FIR but no CRITICAL alert was generated
  ood_flag_at_prediction BOOLEAN,              -- snapshot of ood_flag from the prediction record at prediction time
  notes TEXT,
  recorded_at TIMESTAMPTZ DEFAULT NOW(),
  recorded_by INTEGER REFERENCES users(id)     -- analyst who logged the outcome
);

-- Hazard zones
CREATE TABLE hazard_zones (
  id BIGSERIAL PRIMARY KEY,
  simulation_id UUID REFERENCES simulations(id),
  organisation_id INTEGER REFERENCES organisations(id),
  valid_from TIMESTAMPTZ NOT NULL,
  valid_to TIMESTAMPTZ NOT NULL,
  geometry GEOGRAPHY(POLYGON, 4326) NOT NULL,
  altitude_min_km DOUBLE PRECISION,
  altitude_max_km DOUBLE PRECISION,
  risk_level TEXT,
  confidence DOUBLE PRECISION,
  sigma_along_track_km DOUBLE PRECISION,
  sigma_cross_track_km DOUBLE PRECISION,
  record_hmac TEXT NOT NULL
);
CREATE INDEX ON hazard_zones USING GIST (geometry);
CREATE INDEX ON hazard_zones (valid_from, valid_to);
CREATE TRIGGER hazard_zones_immutable
  BEFORE UPDATE OR DELETE ON hazard_zones
  FOR EACH ROW EXECUTE FUNCTION prevent_modification();

-- Airspace boundaries
CREATE TABLE airspace (
  id BIGSERIAL PRIMARY KEY,
  designator TEXT NOT NULL,
  name TEXT,
  type TEXT NOT NULL,
  geometry GEOMETRY(POLYGON, 4326) NOT NULL,  -- GEOMETRY (not GEOGRAPHY): FIR boundaries never cross antimeridian; ~3× faster for ST_Intersects
  lower_fl INTEGER,
  upper_fl INTEGER,
  icao_region TEXT
);
CREATE INDEX ON airspace USING GIST (geometry);

-- Debris fragments
CREATE TABLE fragments (
  id BIGSERIAL PRIMARY KEY,
  simulation_id UUID REFERENCES simulations(id),
  mass_kg DOUBLE PRECISION,
  characteristic_length_m DOUBLE PRECISION,
  cross_section_m2 DOUBLE PRECISION,
  material TEXT,
  ballistic_coefficient_kgm2 DOUBLE PRECISION,
  pre_entry_survived BOOLEAN,
  impact_point GEOGRAPHY(POINT, 4326),
  impact_velocity_kms DOUBLE PRECISION,
  impact_angle_deg DOUBLE PRECISION,
  kinetic_energy_j DOUBLE PRECISION,
  casualty_area_m2 DOUBLE PRECISION,
  dispersion_semi_major_km DOUBLE PRECISION,
  dispersion_semi_minor_km DOUBLE PRECISION,
  dispersion_orientation_deg DOUBLE PRECISION
);
CREATE INDEX ON fragments USING GIST (impact_point);

-- Owned objects (space operator registration)
CREATE TABLE owned_objects (
  id SERIAL PRIMARY KEY,
  organisation_id INTEGER REFERENCES organisations(id) NOT NULL,
  object_id INTEGER REFERENCES objects(id) NOT NULL,
  norad_id INTEGER NOT NULL,
  registered_at TIMESTAMPTZ DEFAULT NOW(),
  registration_reference TEXT,           -- National space law registration number
  has_propulsion BOOLEAN DEFAULT FALSE,  -- Enables controlled re-entry planner
  UNIQUE (organisation_id, object_id)
);
CREATE INDEX ON owned_objects (organisation_id);

-- API keys (for Persona E/F programmatic access)
CREATE TABLE api_keys (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  organisation_id INTEGER REFERENCES organisations(id) NOT NULL,
  user_id INTEGER REFERENCES users(id),  -- NULL for org-level service account keys (F5)
  is_service_account BOOLEAN NOT NULL DEFAULT FALSE,  -- TRUE = org-level key, no human user
  service_account_name TEXT,             -- required when is_service_account = TRUE; e.g. "ANSP Integration Service"
  key_hash TEXT NOT NULL UNIQUE,         -- SHA-256 of raw key; raw key shown once at creation
  name TEXT NOT NULL,                    -- Human label, e.g. "Ops Centre Integration"
  role TEXT NOT NULL,                    -- space_operator | orbital_analyst
  created_at TIMESTAMPTZ DEFAULT NOW(),
  last_used_at TIMESTAMPTZ,
  expires_at TIMESTAMPTZ,
  revoked_at TIMESTAMPTZ,
  revoked_by INTEGER REFERENCES users(id),  -- org_admin or admin who revoked (F5)
  requests_today INTEGER DEFAULT 0,
  daily_limit INTEGER DEFAULT 1000,
  -- API key scope and rate limit overrides (Finding 11)
  allowed_endpoints TEXT[],              -- NULL = all endpoints for role; e.g. ['GET /space/objects']
  rate_limit_override JSONB,             -- e.g. {"decay_predict": {"limit": 5, "window": "1h"}}
  CONSTRAINT service_account_name_required CHECK (
    (is_service_account = FALSE) OR (service_account_name IS NOT NULL)
  ),
  CONSTRAINT user_or_service CHECK (
    (user_id IS NOT NULL AND is_service_account = FALSE)
    OR (user_id IS NULL AND is_service_account = TRUE)
  )
);
CREATE INDEX ON api_keys (organisation_id, revoked_at);
CREATE INDEX ON api_keys (organisation_id, is_service_account);  -- org admin key listing

-- Async job tracking — all Celery-backed POST endpoints return a job reference (Finding 3)
CREATE TABLE jobs (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  organisation_id INTEGER NOT NULL REFERENCES organisations(id),
  user_id INTEGER NOT NULL REFERENCES users(id),
  job_type TEXT NOT NULL
    CHECK (job_type IN ('decay_predict','report','reentry_plan','propagate')),
  status TEXT NOT NULL DEFAULT 'queued'
    CHECK (status IN ('queued','running','complete','failed','cancelled')),
  celery_task_id TEXT,                  -- Celery AsyncResult ID for internal tracking
  params_hash TEXT,                     -- SHA-256 of input params; used for idempotency check
  result_url TEXT,                      -- populated when status='complete'; e.g. '/decay/predictions/123'
  error_code TEXT,                      -- populated when status='failed'
  error_message TEXT,
  estimated_duration_seconds INTEGER,   -- populated at creation from historical p50 for job_type
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  started_at TIMESTAMPTZ,
  completed_at TIMESTAMPTZ
);
CREATE INDEX ON jobs (organisation_id, status, created_at DESC);
CREATE INDEX ON jobs (celery_task_id);

-- Idempotency key store — prevents duplicate mutations from network retries (Finding 5)
CREATE TABLE idempotency_keys (
  key TEXT NOT NULL,                    -- client-provided UUID
  user_id INTEGER NOT NULL REFERENCES users(id),
  endpoint TEXT NOT NULL,              -- e.g. 'POST /decay/predict'
  response_status INTEGER NOT NULL,
  response_body JSONB NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  expires_at TIMESTAMPTZ NOT NULL DEFAULT NOW() + INTERVAL '24 hours',
  PRIMARY KEY (key, user_id, endpoint)
);
CREATE INDEX ON idempotency_keys (expires_at);  -- for TTL cleanup job

-- Usage metering (F3) — billable events; append-only
CREATE TABLE usage_events (
  id BIGSERIAL PRIMARY KEY,
  organisation_id INTEGER NOT NULL REFERENCES organisations(id),
  user_id INTEGER REFERENCES users(id),          -- NULL for API key / system-triggered events
  api_key_id UUID REFERENCES api_keys(id),        -- set when triggered via API key
  event_type TEXT NOT NULL
    CHECK (event_type IN (
      'decay_prediction_run',
      'conjunction_screen_run',
      'report_export',
      'api_request',
      'mc_quota_exhausted',          -- quota hit; signals upsell opportunity
      'reentry_plan_run'
    )),
  quantity INTEGER NOT NULL DEFAULT 1,            -- e.g. number of API requests batched
  billing_period TEXT NOT NULL,                   -- 'YYYY-MM' — month this event counts toward
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  detail JSONB                                    -- event-specific metadata (object_id, mc_n, etc.)
);
CREATE INDEX ON usage_events (organisation_id, billing_period, event_type);
CREATE INDEX ON usage_events (organisation_id, created_at DESC);
-- Append-only enforcement
CREATE TRIGGER usage_events_immutable
  BEFORE UPDATE OR DELETE ON usage_events
  FOR EACH ROW EXECUTE FUNCTION prevent_modification();

-- Billing contacts (F10)
CREATE TABLE billing_contacts (
  id SERIAL PRIMARY KEY,
  organisation_id INTEGER NOT NULL REFERENCES organisations(id) UNIQUE,
  billing_email TEXT NOT NULL,
  billing_name TEXT NOT NULL,
  billing_address TEXT,
  vat_number TEXT,                               -- EU VAT registration; required for B2B invoicing
  purchase_order_number TEXT,                    -- PO reference required by some ANSP procurement depts
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_by INTEGER REFERENCES users(id)        -- must be org_admin or admin
);

-- Subscription periods (F10) — immutable record of what was billed when
CREATE TABLE subscription_periods (
  id SERIAL PRIMARY KEY,
  organisation_id INTEGER NOT NULL REFERENCES organisations(id),
  tier TEXT NOT NULL,
  period_start TIMESTAMPTZ NOT NULL,
  period_end TIMESTAMPTZ,                        -- NULL = current (open) period
  monthly_fee_eur NUMERIC(10, 2),                -- agreed contract price; NULL for internal/trial
  currency TEXT NOT NULL DEFAULT 'EUR',
  invoice_ref TEXT,                              -- external billing system invoice ID (e.g. Stripe invoice_id)
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON subscription_periods (organisation_id, period_start DESC);

-- NOTAM drafts (audit trail; never submitted by SpaceCom)
CREATE TABLE notam_drafts (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  prediction_id BIGINT REFERENCES reentry_predictions(id),
  organisation_id INTEGER REFERENCES organisations(id),
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  created_by INTEGER REFERENCES users(id),
  draft_type TEXT NOT NULL
    CHECK (draft_type IN ('new','cancellation')),
  fir_designators TEXT[] NOT NULL,
  valid_from TIMESTAMPTZ,
  valid_to TIMESTAMPTZ,
  draft_text TEXT NOT NULL,              -- Full ICAO-format draft text
  reviewed_by INTEGER REFERENCES users(id) ON DELETE SET NULL,  -- SET NULL on GDPR erasure; draft preserved
  reviewed_at TIMESTAMPTZ,
  review_note TEXT,
  safety_record BOOLEAN DEFAULT TRUE,    -- always retained; excluded from data drop policy
  generated_during_degraded BOOLEAN DEFAULT FALSE  -- TRUE if ingest was degraded at generation time
  -- No issuance fields — SpaceCom never issues NOTAMs
);

-- Degraded mode audit log (Finding 7 — operational ANSP disclosure requirement)
-- Records every transition into and out of degraded mode for incident investigation
CREATE TABLE degraded_mode_events (
  id BIGSERIAL PRIMARY KEY,
  started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  ended_at TIMESTAMPTZ,                     -- NULL = currently degraded
  affected_sources TEXT[] NOT NULL,         -- e.g. ['space_track', 'noaa_swpc']
  severity TEXT NOT NULL
    CHECK (severity IN ('WARNING','CRITICAL')),
  trigger_reason TEXT NOT NULL,             -- human-readable: 'Space-Track ingest gap > 4h'
  resolved_by TEXT,                         -- 'auto-recovery' | user_id | 'manual'
  safety_record BOOLEAN DEFAULT TRUE        -- always retained under safety record policy
);
-- Append-only: no UPDATE or DELETE permitted
CREATE TRIGGER degraded_mode_events_immutable
  BEFORE UPDATE OR DELETE ON degraded_mode_events
  FOR EACH ROW EXECUTE FUNCTION prevent_modification();

-- Shadow validation records (compare shadow predictions to actual events)
CREATE TABLE shadow_validations (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  prediction_id BIGINT REFERENCES reentry_predictions(id),
  organisation_id INTEGER REFERENCES organisations(id),
  created_at TIMESTAMPTZ DEFAULT NOW(),
  created_by INTEGER REFERENCES users(id),
  actual_reentry_time TIMESTAMPTZ,
  actual_reentry_location GEOGRAPHY(POINT, 4326),
  actual_source TEXT,                    -- 'aerospace_corp_db' | 'tip_message' | 'manual'
  p50_error_minutes DOUBLE PRECISION,    -- actual - predicted p50 in minutes
  in_p95_corridor BOOLEAN,               -- did actual point fall within 95th pct corridor?
  notes TEXT
);

-- Legal opinions (jurisdiction-level gate for shadow mode and operational deployment)
CREATE TABLE legal_opinions (
  id SERIAL PRIMARY KEY,
  jurisdiction TEXT NOT NULL UNIQUE,      -- e.g. 'AU', 'EU', 'UK', 'US'
  status TEXT NOT NULL DEFAULT 'pending'
    CHECK (status IN ('pending','in_progress','complete','not_required')),
  opinion_date DATE,
  counsel_firm TEXT,
  shadow_mode_cleared BOOLEAN DEFAULT FALSE,  -- opinion confirms shadow deployment is permissible
  operational_cleared BOOLEAN DEFAULT FALSE,  -- opinion confirms operational deployment is permissible
  liability_cap_agreed BOOLEAN DEFAULT FALSE,
  notes TEXT,
  document_minio_key TEXT,                -- reference to stored opinion document in MinIO
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Shared immutability function (used by multiple triggers)
CREATE OR REPLACE FUNCTION prevent_modification()
RETURNS TRIGGER AS $$
BEGIN
  RAISE EXCEPTION 'Table % is append-only or immutable after creation', TG_TABLE_NAME;
END;
$$ LANGUAGE plpgsql;

-- Shared updated_at function (used by mutable tables)
CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
  NEW.updated_at = NOW();
  RETURN NEW;
END;
$$;

-- updated_at triggers for all mutable tables
CREATE TRIGGER organisations_updated_at
  BEFORE UPDATE ON organisations FOR EACH ROW EXECUTE FUNCTION set_updated_at();
CREATE TRIGGER users_updated_at
  BEFORE UPDATE ON users FOR EACH ROW EXECUTE FUNCTION set_updated_at();
CREATE TRIGGER simulations_updated_at
  BEFORE UPDATE ON simulations FOR EACH ROW EXECUTE FUNCTION set_updated_at();
CREATE TRIGGER jobs_updated_at
  BEFORE UPDATE ON jobs FOR EACH ROW EXECUTE FUNCTION set_updated_at();
CREATE TRIGGER notam_drafts_updated_at
  BEFORE UPDATE ON notam_drafts FOR EACH ROW EXECUTE FUNCTION set_updated_at();

Shadow mode flag on predictions and hazard zones: Add shadow_mode BOOLEAN DEFAULT FALSE to both reentry_predictions and hazard_zones. Shadow records are excluded from all operational API responses (WHERE shadow_mode = FALSE applied to all operational endpoints) but accessible via /analysis and the Feedback/shadow validation workflow.


9.3 Index Strategy

All indexes must be created CONCURRENTLY on live hypertables to avoid table locks (see §9.4). The following indexes are required beyond TimescaleDB's automatic chunk indexes:

-- orbits hypertable: object + time range queries (CZML generation)
CREATE INDEX CONCURRENTLY IF NOT EXISTS orbits_object_epoch_idx
  ON orbits (object_id, epoch DESC);

-- reentry_predictions: latest prediction per object (Event Detail, operational overview)
CREATE INDEX CONCURRENTLY IF NOT EXISTS reentry_pred_object_created_idx
  ON reentry_predictions (object_id, created_at DESC)
  WHERE integrity_failed = FALSE AND shadow_mode = FALSE;

-- alert_events: unacknowledged alerts per org (badge count — called on every page load)
-- Partial index on acknowledged_at IS NULL: only live unacked rows indexed; shrinks as alerts are acknowledged
CREATE INDEX CONCURRENTLY IF NOT EXISTS alert_events_unacked_idx
  ON alert_events (organisation_id, level, created_at DESC)
  WHERE acknowledged_at IS NULL;

-- jobs: Celery worker polls for queued jobs; partial index keeps this tiny and fast
CREATE INDEX CONCURRENTLY IF NOT EXISTS jobs_queued_idx
  ON jobs (organisation_id, created_at)
  WHERE status = 'queued';

-- refresh_tokens: token validation only cares about live (non-revoked) tokens
CREATE INDEX CONCURRENTLY IF NOT EXISTS refresh_tokens_live_idx
  ON refresh_tokens (token_hash)
  WHERE revoked_at IS NULL;

-- idempotency_keys: TTL cleanup job needs only expired rows
CREATE INDEX CONCURRENTLY IF NOT EXISTS idempotency_keys_expired_idx
  ON idempotency_keys (expires_at)
  WHERE expires_at IS NOT NULL;

-- PostGIS spatial: all columns used in ST_Intersects / ST_Contains / ST_Distance
CREATE INDEX CONCURRENTLY IF NOT EXISTS reentry_pred_corridor_gist
  ON reentry_predictions USING GIST (ground_track_corridor);
-- airspace.geometry GIST index already present (see §9.2)
CREATE INDEX CONCURRENTLY IF NOT EXISTS hazard_zones_polygon_gist
  ON hazard_zones USING GIST (polygon);
CREATE INDEX CONCURRENTLY IF NOT EXISTS fragments_impact_gist
  ON fragments USING GIST (impact_point);

-- tle_sets hypertable: latest TLE per object (cross-validation, propagation)
CREATE INDEX CONCURRENTLY IF NOT EXISTS tle_sets_object_ingested_idx
  ON tle_sets (object_id, ingested_at DESC);

-- security_logs: recent events per user (audit queries)
CREATE INDEX CONCURRENTLY IF NOT EXISTS security_logs_user_time_idx
  ON security_logs (user_id, created_at DESC);

Spatial type convention:

  • GEOGRAPHY — used for global features that may cross the antimeridian (corridor polygons, nominal re-entry points, fragment impact points). Geodetic calculations; correct for global spans.
  • GEOMETRY(POLYGON, 4326) — used for regional features always within ±180° longitude (FIR/UIR airspace boundaries). Planar approximation; ~3× faster for ST_Intersects than GEOGRAPHY; accurate enough for airspace boundary intersection within a single hemisphere.

SRID enforcement (F2 — §62): Declaring the SRID in the column type (GEOMETRY(POLYGON, 4326)) prevents implicit SRID mismatch errors, but does not prevent application code from inserting a geometry constructed with SRID 0. Add explicit CHECK constraints on all spatial columns:

-- Ensure corridor polygon SRID is correct
ALTER TABLE reentry_predictions
  ADD CONSTRAINT chk_corridor_srid
  CHECK (ST_SRID(ground_track_corridor::geometry) = 4326);

ALTER TABLE hazard_zones
  ADD CONSTRAINT chk_hazard_zone_srid
  CHECK (ST_SRID(geometry) = 4326);

ALTER TABLE airspace
  ADD CONSTRAINT chk_airspace_srid
  CHECK (ST_SRID(geometry) = 4326);

The CI migration gate (alembic check) will flag any migration that adds a spatial column without a matching SRID CHECK constraint.

ST_Buffer distance units (F9 — §62): ST_Buffer on a GEOMETRY(POLYGON, 4326) column uses degree-units, not metres. At 60°N, 1° ≈ 55 km; at the equator, 1° ≈ 111 km — an uncertainty buffer expressed in degrees gives wildly different areas at different latitudes. Always buffer in a projected CRS, then transform back:

-- CORRECT: buffer 50 km around corridor point at any latitude
SELECT ST_Transform(
  ST_Buffer(
    ST_Transform(ST_SetSRID(ST_MakePoint(lon, lat), 4326), 3857),  -- project to Web Mercator (metres)
    50000  -- 50 km in metres
  ),
  4326  -- back to WGS84
) AS buffered_geom;

-- WRONG: buffer in degrees — DO NOT USE
-- SELECT ST_Buffer(geom, 0.5) FROM ...  ← 0.5° is ~55 km at 60°N, ~55 km at equator

For global spans where Mercator distortion is unacceptable, use ST_Buffer on a GEOGRAPHY column instead — it accepts metres natively:

SELECT ST_Buffer(corridor::geography, 50000)  -- 50 km buffer, geodetically correct
FROM reentry_predictions WHERE ...

FIR intersection query optimisation: Apply a bounding-box pre-filter before the full polygon intersection test to eliminate most rows cheaply. airspace.geometry is GEOMETRY while hazard_zones.geometry and corridor parameters are GEOGRAPHYalways cast GEOGRAPHY → GEOMETRY explicitly before passing to ST_Intersects with an airspace column; PostgreSQL cannot use the GiST index and falls back to a seq scan if the types are mixed implicitly:

-- Corridor (GEOGRAPHY) intersecting FIR boundaries (GEOMETRY): explicit cast required
SELECT a.designator, a.name
FROM airspace a
WHERE a.geometry && ST_Envelope($1::geography::geometry)   -- fast bbox pre-filter (uses GIST)
  AND ST_Intersects(a.geometry, $1::geography::geometry);  -- exact test (GEOMETRY, not GEOGRAPHY)
-- $1 = corridor polygon passed as GEOGRAPHY from application layer

Add a CI linter rule (or custom ruff plugin) that rejects ST_Intersects(airspace.geometry, <expr>) unless <expr> is explicitly cast to ::geometry. This prevents the mixed-type silent seq-scan regression from being introduced during maintenance.

Cache the FIR intersection result per prediction_id in Redis (TTL: until the prediction is superseded) — the intersection for a given prediction never changes.


9.4 TimescaleDB Configuration and Continuous Aggregates

Hypertable chunk intervals — set explicitly at creation; default 7-day chunks are too large for the orbits CZML query pattern (most queries cover ≤ 72h):

-- orbits: 1-day chunks (72h CZML window spans 3 chunks; good chunk exclusion)
SELECT create_hypertable('orbits', 'epoch',
  chunk_time_interval => INTERVAL '1 day',
  if_not_exists => TRUE);

-- tle_sets: 1-month chunks (~1,800 rows/day at 600 objects × 3 TLE updates; queried by object_id not time range)
-- Small chunks (7 days) produce poor compression ratios (~12,600 rows/chunk); 1 month improves ratio ~4×
SELECT create_hypertable('tle_sets', 'ingested_at',
  chunk_time_interval => INTERVAL '1 month',
  if_not_exists => TRUE);

-- space_weather: 30-day chunks (~3000 rows/month at 15-min cadence)
SELECT create_hypertable('space_weather', 'time',
  chunk_time_interval => INTERVAL '30 days',
  if_not_exists => TRUE);

Continuous aggregates — pre-compute recurring expensive queries instead of scanning raw hypertable rows on every request:

-- 81-day rolling F10.7 average (queried on every Space Weather Widget render)
CREATE MATERIALIZED VIEW space_weather_daily
  WITH (timescaledb.continuous) AS
  SELECT time_bucket('1 day', time) AS day,
         AVG(f107_obs)              AS f107_daily_avg,
         MAX(kp_3hourly[1])         AS kp_max_daily
  FROM space_weather
  GROUP BY day
WITH NO DATA;

SELECT add_continuous_aggregate_policy('space_weather_daily',
  start_offset      => INTERVAL '90 days',
  end_offset        => INTERVAL '1 hour',
  schedule_interval => INTERVAL '1 hour');

Backend queries for the 81-day F10.7 average read from space_weather_daily (the continuous aggregate), not from the raw space_weather hypertable.

Compression policy intervals — compression must not target recently-written chunks. TimescaleDB decompresses a chunk before any write to it; compressing hot chunks adds 50200ms latency per write batch. Set compress_after well beyond the active write window:

Hypertable Chunk interval compress_after Write cadence Reasoning
orbits 1 day 7 days 1 min (continuous) Data is queryable but not written after ~24h; 7-day buffer prevents write-decompress thrash
adsb_states 4 hours 14 days 60s (Celery Beat) Rolling 24h retention; compress only after data is past retention interest
space_weather 30 days 60 days 15 min Very low write rate; compress after one full 30-day chunk is closed
tle_sets 1 month 2 months Every 4h ingest ~1,800 rows/day; 1-month chunks give good compression ratio; 2-month buffer ensures active month is never compressed
-- Apply compression policies (run after hypertable creation)
SELECT add_compression_policy('orbits',       INTERVAL '7 days');
SELECT add_compression_policy('adsb_states',  INTERVAL '14 days');
SELECT add_compression_policy('space_weather', INTERVAL '60 days');
SELECT add_compression_policy('tle_sets',     INTERVAL '2 months');

Autovacuum tuning — append-only tables still accumulate dead tuples from aborted transactions and MVCC overhead. Default 20% threshold is too conservative for high-write safety tables:

ALTER TABLE alert_events SET (
  autovacuum_vacuum_scale_factor  = 0.01,   -- vacuum at 1% dead tuples (default: 20%)
  autovacuum_analyze_scale_factor = 0.005
);
ALTER TABLE security_logs SET (
  autovacuum_vacuum_scale_factor  = 0.01,
  autovacuum_analyze_scale_factor = 0.005
);
ALTER TABLE reentry_predictions SET (
  autovacuum_vacuum_cost_delay    = 2,      -- allow aggressive vacuum on query-critical table
  autovacuum_analyze_scale_factor = 0.01
);

PostgreSQL-level settings via patroni.yml:

postgresql:
  parameters:
    idle_in_transaction_session_timeout: 30000  # 30s -- prevents analytics sessions blocking autovacuum
    max_connections: 50                          # pgBouncer handles client multiplexing; DB needs only 50
    log_min_duration_statement: 500             # F7 §58: log queries > 500ms; shipped to Loki via Promtail
    shared_preload_libraries: timescaledb,pg_stat_statements  # F7 §58: enable slow query tracking
    pg_stat_statements.track: all              # track all statements including nested
    # Analyst role statement timeout (F11 §58): prevents runaway analytics queries starving ops connections
    # Applied at role level, not globally, to avoid impacting operational paths

Query plan governance (F7 — §58): Slow queries (> 500ms) appear in PostgreSQL logs and are shipped to Loki. A weekly Grafana report queries pg_stat_statements via the postgres-exporter and surfaces the top-10 queries by total_exec_time. Any query appearing in the top-10 for two consecutive weeks requires a PR with an EXPLAIN ANALYSE output and either an index addition or a documented acceptance rationale. The EXPLAIN ANALYSE output is recorded in the migration file header comment for index additions. CI migration timeout (§9.4) applies: migrations running > 30s against the test dataset require review before merge.

Analyst role query timeout (F11 — §58): Persona B/F analyst queries route to the read replica (§3.2) but must still be bounded to prevent a runaway query exhausting replica connections and triggering replication lag. Apply a statement_timeout at the database role level so it applies regardless of connection source:

-- Applied once at schema setup; persists across reconnections
ALTER ROLE spacecom_analyst SET statement_timeout = '30s';
ALTER ROLE spacecom_readonly SET statement_timeout = '30s';

-- Operational roles have no statement timeout — but idle-in-transaction timeout applies globally
-- (idle_in_transaction_session_timeout = 30s in patroni.yml)

The spacecom_analyst role is the PgBouncer user for the read replica pool. All analyst-originated queries automatically inherit the 30s limit. If a query exceeds 30s it receives ERROR: canceling statement due to statement timeout; the frontend displays a user-facing message: "This query exceeded the 30-second limit. Refine your filters or contact your administrator." Logged at WARNING to Loki.

PgBouncer transaction mode + asyncpg prepared statement cache — asyncpg caches prepared statements per server-side connection. In PgBouncer transaction mode, the connection returned after each transaction may differ from the one the statement was prepared on, causing ERROR: prepared statement "..." does not exist under load. Disable the cache in the SQLAlchemy async engine config:

engine = create_async_engine(
    DATABASE_URL,
    connect_args={"prepared_statement_cache_size": 0},
)

This is non-negotiable when using PgBouncer transaction mode. Do not revert this setting in the belief that it is a performance regression — it prevents a hard production failure mode. See ADR 0008.

Migration safety on live hypertables (additions to the Alembic policy in §26.9):

  • Always use CREATE INDEX CONCURRENTLY for new indexes — no table lock; safe during live ingest
  • Never add a column with a non-null default to a populated hypertable in one migration: (1) add nullable, (2) backfill in batches, (3) add NOT NULL constraint separately
  • Test every migration against production-sized data; record execution time in the migration file header comment
  • Set a CI migration timeout: if a migration runs > 30s against the test dataset, it must be reviewed before merge

10. Technology Stack

Layer Technology Rationale
Frontend framework Next.js 15 + TypeScript Type safety, SSR for dashboards, static export option
3D Globe CesiumJS (retained) Native CZML support; proven in prototype
2D overlays Deck.gl WebGL heatmaps (Mode B), arc layers, hex grids
Server state TanStack Query Caching, background refetch, stale-while-revalidate. API responses never stored in Zustand.
UI state Zustand Pure UI state only: timeline mode, selected object, layer visibility, alert acknowledgements
URL state nuqs Shareable deep links; selected object/event/time reflected in URL
Backend framework FastAPI (retained) Async, OpenAPI auto-docs, Pydantic validation
Task queue Celery + Redis Battle-tested for scientific compute; Flower monitoring
Catalog propagation sgp4 SGP4/SDP4; catalog tracking only, not decay prediction
Numerical integrator scipy.integrate.DOP853 or custom RK7(8) Adaptive step-size for Cowell decay prediction
Atmospheric density nrlmsise00 Python wrapper NRLMSISE-00; driven by F10.7 and Ap
Frame transformations astropy IAU 2006 precession/nutation, IERS EOP, TEME→GCRF→ITRF
Astrodynamics utilities poliastro (optional) Conjunction geometry helpers
Auth python-jose (RS256 JWT) + pyotp (TOTP MFA) Asymmetric JWT; TOTP RFC 6238
Rate limiting slowapi Redis token bucket; per-user and per-IP limits
HTML sanitisation bleach User-supplied content before Playwright rendering
Password hashing passlib[bcrypt] bcrypt cost factor ≥ 12
Database TimescaleDB + PostGIS (retained) Time-series + geospatial; RLS for multi-tenancy
Cache / broker Redis 7 Broker + pub/sub: maxmemory-policy noeviction (Celery queues must never be evicted). Separate Redis DB index for application cache: allkeys-lru. AUTH + TLS in production.
Connection pooler PgBouncer 1.22 Transaction-mode pooling between all app services and TimescaleDB. Prevents connection exhaustion at Tier 3; single failover target for Patroni switchover. max_client_conn=200, default_pool_size=20. Pool sizing derivation (F2 — §58): PostgreSQL max_connections=50; reserve 5 for superuser/admin; 45 available server connections. default_pool_size=20 per pool (one pool per DB user); leaves headroom for Alembic migrations and ad-hoc DBA access. max_client_conn=200 = (2 backend workers × 40 async connections) + (4 sim workers × 16 threads) + (2 ingest workers × 4) = 152 peak; 200 provides burst headroom. Validate with SHOW pools; in psql -h pgbouncercl_waiting > 0 sustained means pool is undersized.
Object storage MinIO Private buckets; pre-signed URLs only
Containerisation Docker Compose (retained); Caddy as TLS-terminating reverse proxy Single-command dev; HTTPS auto-provisioning
Testing — backend pytest + hypothesis Property-based tests for numerical and security invariants
Testing — frontend Vitest + Playwright Unit tests + E2E including security header checks
SAST — Python Bandit Static analysis; CI blocks on High severity
SAST — TypeScript ESLint security plugin Static analysis; CI blocks on High severity
Container scanning Trivy CI blocks on Critical/High CVEs
DAST OWASP ZAP Phase 2 pipeline against staging
Dependency management pip-tools + npm ci Pinned hashes; --require-hashes
Report rendering Playwright headless (isolated renderer container) Server-side globe screenshot; no client-side canvas
Secrets management Docker secrets (Phase 1 production) → HashiCorp Vault (Phase 3)
Task scheduler HA celery-redbeat Redis-backed Beat scheduler; distributed locking; multiple instances safe
DB HA / failover Patroni + etcd Automatic TimescaleDB primary/standby failover; ≤ 30s RTO
Redis HA Redis Sentinel (3 nodes) Master failover ≤ 10s; transparent to application via redis-py Sentinel client
Monitoring Prometheus + Grafana Business-level metrics from Phase 1; four dashboards (§26.7); AlertManager with runbook links
Log aggregation Grafana Loki + Promtail Phase 2; Promtail scrapes Docker log files; Loki stores and queries; co-deployed with Grafana; no index servers required
Distributed tracing OpenTelemetry → Grafana Tempo Phase 2; FastAPI + SQLAlchemy + Celery auto-instrumented; OTLP exporter; trace_id = request_id for log correlation; ADR 0017
Structured logging structlog JSON structured logs with required fields; sanitising processor strips secrets; request_id propagated through HTTP → Celery chain
On-call alerting PagerDuty or OpsGenie Routes Prometheus AlertManager alerts; L1/L2/L3 escalation tiers (§26.8)
CI/CD pipeline GitLab CI Native to the self-hosted GitLab monorepo; stage-based builds for Python/Node; protected environments and approval rules for deploys
Container registry GitLab Container Registry Co-located with source; sha-<commit> is the canonical immutable tag; latest tag is forbidden in production deployments; image vulnerability attestations via cosign
Pre-commit pre-commit framework Hooks: detect-secrets, ruff (lint + format), mypy (type gate), hadolint (Dockerfile), prettier (JS/HTML), sqlfluff (migrations); spec in .pre-commit-config.yaml; same hooks re-run in CI
Local task runner make Standard targets: make dev (full-stack hot-reload), make test (pytest + vitest), make migrate (alembic upgrade head), make seed (fixture load), make lint (all pre-commit hooks), make clean (prune volumes)

11. Data Source Inventory

Source Data Access Priority
Space-Track.org TLE catalog, CDMs, object catalog, RCS data, TIP messages REST API (account required); credentials in secrets manager P1
CelesTrak TLE subsets (active sats, decaying objects) Public REST API / CSV P1
USSPACECOM TIP Messages Tracking and Impact Prediction for decaying objects Via Space-Track.org P1
NOAA SWPC F10.7, Ap/Kp, Dst, solar wind; 3-day forecasts Public REST API and FTP P1
ESA Space Weather Service F10.7, Kp cross-validation source Public REST API P1
ESA DISCOS Physical object properties: mass, dimensions, shape, materials REST API (account required) P1
IERS Bulletin A/B UT1-UTC offsets, polar motion Public FTP (usno.navy.mil); SHA-256 verified on download P1
GFS / ECMWF Tropospheric winds and density 080 km NOMADS (NOAA) public FTP P2
ILRS / CDDIS Laser ranging POD products for validation Public FTP P2 (validation)

| FIR/UIR boundaries | FIR and UIR boundary polygons for airspace intersection | EUROCONTROL AIRAC dataset (subscription) for ECAC states; FAA Digital-Terminal Procedures for US; OpenAIP as fallback for non-AIRAC regions. GeoJSON format loaded into airspace table. Updated every 28 days on AIRAC cycle. | P1 |

Deprecated reference: "18th SDS" → use Space-Track.org consistently.

ESA DISCOS redistribution rights (Finding 9): ESA DISCOS is subject to an ESAC user agreement. Data may not be redistributed or used in commercial products without explicit ESA permission. SpaceCom is a commercial platform. Required actions before Phase 2 shadow deployment:

  • Obtain written clarification from ESA/ESAC on whether DISCOS-derived physical properties (mass, dimensions) may be: (a) used internally to drive SpaceCom's own predictions; (b) exposed in API responses to ANSP customers; (c) included in generated PDF reports
  • If redistribution is not permitted, DISCOS data is used only as internal model input — API responses and reports show source: estimated rather than exposing raw DISCOS values; the data_confidence UI flag continues to show ● DISCOS for internal tracking but is not labelled as DISCOS in customer-facing outputs
  • Include the DISCOS redistribution clarification in the Phase 2 legal gate checklist alongside the Space-Track AUP opinion

Airspace data scope and SUA disclosure (Finding 4): Phase 2 FIR/UIR scope covers ECAC states (EUROCONTROL AIRAC) and US FIRs (FAA). The following airspace types are explicitly out of scope for Phase 2 and disclosed to users:

  • Special Use Airspace (SUA): danger areas, restricted areas, prohibited areas (ICAO Annex 11)
  • Terminal Manoeuvring Areas (TMAs) and Control Zones (CTRs)
  • Oceanic FIRs (ICAO Annex 2 special procedures; OACCs handle coordination)

A persistent disclosure note on the Airspace Impact Panel reads: "SpaceCom FIR intersection analysis covers FIR/UIR boundaries only. It does not account for special use airspace, terminal areas, or oceanic procedures. Controllers must apply their local procedures for these airspace types." Phase 3 consideration: SUA polygon overlay from national AIP sources. Document in docs/adr/0014-airspace-scope.md.

All source URLs are hardcoded constants in ingest/sources.py. The outbound HTTP client blocks connections to private IP ranges. No source URL is configurable via API or database at runtime.

Space-Track AUP — conditional architecture (Finding 9): The AUP clarification is a Phase 1 architectural decision gate, not a Phase 2 deliverable. The current design assumes shared ingest (a single SpaceCom Space-Track credential fetches TLEs for all organisations). If the AUP prohibits redistribution of derived predictions to customers who have not themselves agreed to the AUP, the ingest architecture must change:

  • Path A — redistribution permitted: Current shared-ingest design is valid. Each customer organisation's access is governed by SpaceCom's AUP click-wrap and the MSA. No architectural change.
  • Path B — redistribution not permitted: Per-organisation Space-Track credentials required. Each ANSP/operator must hold their own Space-Track account. SpaceCom acts as a processing layer using each org's own credentials. Architecture change: space_track_credentials table (per-org, encrypted); per-org ingest worker configuration; significant additional complexity.

The decision must be documented in docs/adr/0016-space-track-aup-architecture.md with the chosen path and evidence (written AUP clarification). This ADR is a prerequisite for Phase 1 ingest architecture finalisation — marked as a blocking decision in the Phase 1 DoD.

Space weather raw format specifications:

Source Endpoint constant Format Key fields consumed
NOAA SWPC F10.7 NOAA_F107_URL = "https://services.swpc.noaa.gov/json/f107_cm_flux.json" JSON array time_tag, flux (solar flux units)
NOAA SWPC Kp/Ap NOAA_KP_URL = "https://services.swpc.noaa.gov/json/planetary_k_index_1m.json" JSON array time_tag, kp_index, ap
NOAA SWPC 3-day forecast NOAA_FORECAST_URL = "https://services.swpc.noaa.gov/products/3-day-geomag-forecast.json" JSON Kp array
ESA SWS Kp ESA_SWS_KP_URL = "https://swe.ssa.esa.int/web/guest/current-space-weather-conditions" REST JSON kp_index (cross-validation)

An integration test asserts that each response contains the expected top-level keys. If a key is absent, the test fails and the schema change is caught before it reaches production ingest.

TLE validation at ingestion gate: Before any TLE record is written to the database, ingest/cross_validator.py must verify:

  1. Both lines are exactly 69 characters (standard TLE format)
  2. Modulo-10 checksum passes on line 1 and line 2
  3. Epoch field parses to a valid UTC datetime
  4. BSTAR drag term is within physically plausible bounds (0.5 to +0.5)

Failed validation is logged to security_logs type INGEST_VALIDATION_FAILURE with the raw TLE and failure reason. The record is not written to the database.

TLE ingest idempotency — ON CONFLICT behavior: The tle_sets table has UNIQUE (object_id, ingested_at). If the ingest worker runs twice for the same object within the same second (e.g., orphan recovery task + normal schedule overlap, or a worker restart mid-task), the second insert must not raise an exception or silently discard the row without tracking. Required semantics:

# ingest/writer.py
async def write_tle_set(session: AsyncSession, tle: TLERecord) -> bool:
    """Insert TLE record. Returns True if inserted, False if duplicate."""
    stmt = pg_insert(TLESet).values(
        object_id=tle.object_id,
        ingested_at=tle.ingested_at,
        tle_line1=tle.line1,
        tle_line2=tle.line2,
        epoch=tle.epoch,
        source=tle.source,
    ).on_conflict_do_nothing(
        index_elements=["object_id", "ingested_at"]
    ).returning(TLESet.object_id)

    result = await session.execute(stmt)
    inserted = result.rowcount > 0
    if not inserted:
        spacecom_ingest_tle_conflict_total.inc()   # metric; non-zero signals scheduling race
        structlog.get_logger().debug("tle_insert_skipped_duplicate",
                                     object_id=tle.object_id, ingested_at=tle.ingested_at)
    return inserted

Prometheus counter spacecom_ingest_tle_conflict_total — a sustained non-zero rate warrants investigation of the Beat schedule overlap. A brief spike during worker restart is acceptable.

Ingest idempotency requirement for all periodic tasks (F8 — §67): TLE ingest uses ON CONFLICT DO NOTHING (above). All other periodic ingest tasks must use equivalent upsert semantics to survive celery-redbeat double-fire on restart:

-- Space weather ingest: upsert on (fetched_at) unique constraint
INSERT INTO space_weather (fetched_at, kp, f107, ...)
VALUES (:fetched_at, :kp, :f107, ...)
ON CONFLICT (fetched_at) DO NOTHING;

-- DISCOS object metadata: upsert on (norad_id) — update if data changed
INSERT INTO objects (norad_id, name, launch_date, ...)
VALUES (:norad_id, :name, :launch_date, ...)
ON CONFLICT (norad_id) DO UPDATE SET
    name = EXCLUDED.name,
    launch_date = EXCLUDED.launch_date,
    updated_at = NOW()
WHERE objects.updated_at < EXCLUDED.updated_at;  -- only update if newer

-- IERS EOP: upsert on (date) unique constraint
INSERT INTO iers_eop (date, ut1_utc, x_pole, y_pole, ...)
VALUES (:date, :ut1_utc, :x_pole, :y_pole, ...)
ON CONFLICT (date) DO NOTHING;

Add unique constraints if not present: UNIQUE (fetched_at) on space_weather; UNIQUE (date) on iers_eop. These prevent double-write corruption at the DB level regardless of application retry logic.

IERS EOP cold-start requirement: On a fresh deployment with no cached EOP data, astropy's IERS_Auto falls back to the bundled IERS-B table (which lags the current date by weeks to months), silently degrading UT1-UTC precision from ~1 ms (IERS-A) to ~1050 ms (IERS-B). For epochs beyond the IERS-B table end date, astropy raises IERSRangeError, crashing all frame transforms.

The EOP ingest task must run as part of make seed before any propagation task starts:

# Makefile
seed: migrate
	docker compose exec backend python -m ingest.eop --bootstrap   # downloads + caches current IERS-A
	docker compose exec backend python -m ingest.fir --bootstrap    # loads FIR boundaries
	docker compose exec backend python fixtures/dev_seed.sql

The EOP ingest task in Celery Beat is ordered before the TLE ingest task: EOP runs at 00:00 UTC, TLE ingest at 00:10 UTC (ensuring fresh EOP before the first propagation of the day).

IERS EOP verification — dual-mirror comparison: The IERS does not publish SHA-256 hashes alongside its EOP files. Comparing hash-against-prior-download detects corruption but not substitution. The correct approach is downloading from both the USNO mirror and the Paris Observatory mirror and verifying agreement:

# ingest/eop.py
IERS_MIRRORS = [
    "https://maia.usno.navy.mil/ser7/finals2000A.all",
    "https://hpiers.obspm.fr/iers/series/opa/eopc04",   # IERS-C04 series
]

async def fetch_and_verify_eop() -> bytes:
    contents = []
    for url in IERS_MIRRORS:
        resp = await http_client.get(url, timeout=30)
        resp.raise_for_status()
        contents.append(resp.content)

    # Verify UT1-UTC values agree within 0.1 ms across mirrors (format-normalised comparison)
    if not _eop_values_agree(contents[0], contents[1], tolerance_ms=0.1):
        structlog.get_logger().error("eop_mirror_disagreement")
        spacecom_eop_mirror_agreement.set(0)
        raise EOPVerificationError("IERS EOP mirrors disagree — rejecting both")

    spacecom_eop_mirror_agreement.set(1)
    return contents[0]   # USNO is primary; Paris Observatory is the verification witness

Prometheus gauge spacecom_eop_mirror_agreement (1 = mirrors agree, 0 = disagreement detected). Alert on spacecom_eop_mirror_agreement == 0.


12. Backend Directory Structure

backend/
  app/
    main.py              # FastAPI app factory, middleware, router mounting
    config.py            # Settings via pydantic-settings (env vars); no secrets in code
    auth/
      provider.py        # AuthProvider protocol + LocalJWTProvider implementation
      jwt.py             # RS256 token issue, verify, refresh; key loaded from secrets
      mfa.py             # TOTP (pyotp); recovery code generation and verification
      deps.py            # get_current_user, require_role() dependency factory
      middleware.py      # Auth middleware; rate limit enforcement
    frame_utils.py       # TEME→GCRF→ITRF→WGS84 + IERS EOP refresh + hash verification
    time_utils.py        # Time system conversions
    integrity.py         # HMAC sign/verify for predictions and hazard zones
    logging_config.py    # Sanitising log formatter; security event logger
  modules/
    catalog/
      router.py          # /api/v1/objects; requires viewer role minimum
      schemas.py
      service.py
      models.py
    propagator/
      catalog.py         # SGP4 catalog propagation
      decay.py           # RK7(8) + NRLMSISE-00 + Monte Carlo; HMAC-signs output
      tasks.py           # Celery tasks with time_limit, soft_time_limit
      router.py          # /api/v1/propagate, /api/v1/decay; requires analyst role
    reentry/
      router.py          # /api/v1/reentry; requires viewer role
      service.py
      corridor.py        # Percentile corridor polygon generation
    spaceweather/
      router.py          # /api/v1/spaceweather; requires viewer role
      service.py         # Cross-validates NOAA SWPC vs ESA SWS; generates status string
      tasks.py           # Celery Beat: NOAA SWPC polling every 3h
      noaa_swpc.py       # NOAA SWPC client; URL hardcoded constant
      esa_sws.py         # ESA SWS cross-validation client
    viz/
      router.py          # /api/v1/czml; requires viewer role
      czml_builder.py    # CZML output; all strings HTML-escaped; J2000 INERTIAL frame
      mc_geometry.py     # MC trajectory binary blob pre-baking
    ingest/
      sources.py         # Hardcoded external URLs and IP allowlists (SSRF mitigation)
      tasks.py           # Celery Beat-scheduled tasks
      spacetrack.py      # Space-Track client; credentials from secrets manager only
      celestrak.py       # CelesTrak client
      discos.py          # ESA DISCOS client
      iers.py            # IERS EOP fetcher + SHA-256 verification
      cross_validator.py # TLE and space weather cross-source comparison
    alerts/
      router.py          # /api/v1/alerts; requires operator role for acknowledge
      service.py         # Alert trigger evaluation; rate limit enforcement; deduplication
      notifier.py        # WebSocket push + email; storm detection
      integrity_guard.py # TIP vs prediction cross-check; HMAC failure escalation
    reports/
      router.py          # /api/v1/reports; requires analyst role
      builder.py         # Section assembly; all user fields sanitised via bleach
      renderer_client.py # Internal HTTPS call to renderer service with sanitised payload
    security/
      audit.py           # Security event logger; writes to security_logs
      sanitiser.py       # Log formatter that strips credential patterns
    breakup/
      atmospheric.py
      on_orbit.py
      tasks.py
      router.py
    conjunction/
      screener.py
      probability.py
      tasks.py
      router.py
    weather/
      upper.py
      lower.py
    hazard/
      router.py
      fusion.py          # HMAC-signs all hazard_zones output; propagates shadow_mode flag
      tasks.py
    airspace/
      router.py
      loader.py
      intersection.py
    notam/
      router.py          # /api/v1/notam; requires operator role
      drafter.py         # ICAO Annex 15 format generation
      disclaimer.py      # Mandatory regulatory disclaimer text
    space_portal/
      router.py          # /api/v1/space; space_operator and orbital_analyst roles
      owned_objects.py   # Owned object CRUD; RLS enforcement
      controlled_reentry.py  # Deorbit window optimisation
      ccsds_export.py    # CCSDS OEM/CDM format export
      api_keys.py        # API key lifecycle management
    launch_safety/       # Phase 3
      screener.py
      router.py
    reroute/             # Phase 3; strategic pre-flight avoidance boundary only
    feedback/            # Phase 3; includes shadow_validation.py
  migrations/            # Alembic; includes immutability triggers in initial migration
  tests/
    conftest.py          # db_session fixture (SAVEPOINT/ROLLBACK); testcontainers setup for Celery tests
    physics/
      test_frame_utils.py
      test_propagator/
      test_decay/
      test_nrlmsise.py
      test_hypothesis.py   # Hypothesis property-based tests (§42.3)
      test_mc_corridor.py  # MC seeded RNG corridor validation (§42.4)
      test_breakup/
    test_integrity.py    # HMAC sign/verify; tamper detection
    test_auth.py         # JWT; MFA; rate limiting; RBAC enforcement
    test_rbac.py         # Every endpoint tested for correct role enforcement
    test_websocket.py    # WS sequence replay; token expiry warning; close codes 4001/4002
    test_ingest/
      test_contracts.py  # Space-Track + NOAA key presence AND value-range assertions
    test_spaceweather/
    test_jobs/
      test_celery_failure.py  # Timeout → 'failed'; orphan recovery Beat task
    smoke/               # Post-deploy; all idempotent; run in ≤ 2 min; require smoke_user seed
      test_api_health.py    # GET /readyz → 200/207; GET /healthz → 200
      test_auth_smoke.py    # Login → JWT; refresh → new token
      test_catalog_smoke.py # GET /catalog → 200; 'data' key present
      test_ws_smoke.py      # WS connect → heartbeat within 5s
      test_db_smoke.py      # SELECT 1 via backend health endpoint
    quarantine/          # Flaky tests awaiting fix; excluded from blocking CI (see §33.10 policy)
  requirements.in        # pip-tools source
  requirements.txt       # pip-compile output with hashes
  Dockerfile             # FROM pinned digest; non-root user; read-only FS

12.1 Repository docs/ Directory Structure

All documentation files live under docs/ in the monorepo root. Files referenced elsewhere in this plan must exist at these paths.

docs/
  README.md                          # Documentation index — what's here and where to look
  MASTER_PLAN.md                     # This document
  AGENTS.md                          # Guidance for AI coding agents working in this repo (see §33.9)
  CHANGELOG.md                       # Keep a Changelog format; human-maintained; one entry per release

  adr/                               # Architecture Decision Records (MADR format)
    README.md                        # ADR index with status column
    0001-rs256-asymmetric-jwt.md
    0002-dual-frontend-architecture.md
    0003-monte-carlo-chord-pattern.md
    0004-geography-vs-geometry-spatial-types.md
    0005-lazy-raise-sqlalchemy.md
    0006-timescaledb-chunk-intervals.md
    0007-cesiumjs-commercial-licence.md
    0008-pgbouncer-transaction-mode.md
    0009-ccsds-oem-gcrf-reference-frame.md
    0010-alert-threshold-rationale.md
    # ... continued; one ADR per consequential decision in §20

  runbooks/
    README.md                        # Runbook index with owner and last-reviewed date
    TEMPLATE.md                      # Standard runbook template (see §33.4)
    db-failover.md
    celery-recovery.md
    hmac-failure.md
    ingest-failure.md
    gdpr-breach-notification.md
    safety-occurrence-notification.md
    secrets-rotation-jwt.md
    secrets-rotation-spacetrack.md
    secrets-rotation-hmac.md
    blue-green-deploy.md
    restore-from-backup.md

  model-card-decay-predictor.md      # Living document; updated per model version (§32.1)
  ood-bounds.md                      # OOD detection thresholds (§32.3)
  recalibration-procedure.md         # Recalibration governance (§32.4)
  alert-threshold-history.md         # Alert threshold change log (§24.8)

  query-baselines/                   # EXPLAIN ANALYZE output; one file per critical query
    czml_catalog_100obj.txt
    fir_intersection_baseline.txt
    # ... one file per query baseline recorded in Phase 1

  validation/                        # Validation procedure and reference data (§17)
    README.md                        # How to run each validation suite
    reference-data/
      vallado-sgp4-cases.json        # Vallado (2013) SGP4 reference state vectors
      iers-frame-test-cases.json     # IERS precession-nutation reference cases
      aerospace-corp-reentries.json  # Historical re-entry outcomes for backcast validation
    backcast-validation-v1.0.0.pdf   # Phase 1 validation report (≥3 events)
    backcast-validation-v2.0.0.pdf   # Phase 2 validation report (≥10 events)

  api-guide/                         # Persona E/F API developer documentation (§33.10)
    README.md                        # API guide index
    authentication.md
    rate-limiting.md
    webhooks.md
    code-examples/
      python-quickstart.py
      typescript-quickstart.ts
    error-reference.md

  user-guides/                       # Operational persona documentation (§33.7)
    aviation-portal-guide.md         # Persona A/B/C
    space-portal-guide.md            # Persona E/F
    admin-guide.md                   # Persona D

  test-plan.md                       # Test suite index with scope and blocking classification (§33.11)

  public-reports/                    # Quarterly transparency reports (§32.6)
    # quarterly-accuracy-YYYY-QN.pdf

  legal/                             # Legal opinion documents (MinIO primary; this dir for dev reference)
    # legal-opinion-template.md

13. Frontend Directory Structure and Architecture

frontend/
  src/
    app/
      page.tsx                         # Operational Overview
      watch/[norad_id]/page.tsx        # Object Watch Page
      events/
        page.tsx                       # Active Events + full Timeline/Gantt
        [id]/page.tsx                  # Event Detail
      airspace/page.tsx                # Airspace Impact View
      analysis/page.tsx                # Analyst Workspace
      catalog/page.tsx                 # Object Catalog
      reports/
        page.tsx
        [id]/page.tsx
      admin/page.tsx                   # System Administration (admin role only)
      space/
        page.tsx                       # Space Operator Overview
        objects/
          page.tsx                     # My Objects Dashboard (space_operator: owned only)
          [norad_id]/page.tsx          # Object Technical Detail
        reentry/
          plan/page.tsx                # Controlled Re-entry Planner
        conjunction/page.tsx           # Conjunction Screening (orbital_analyst)
        analysis/page.tsx              # Orbital Analyst Workspace
        export/page.tsx                # Bulk Export
        api/page.tsx                   # API Keys + Documentation
      layout.tsx                       # Root layout: nav, ModeIndicator, AlertBadge,
                                       # JobsPanel; applies security headers via middleware

    middleware.ts                      # Next.js middleware: enforce HTTPS, set CSP
                                       # and security headers on every response,
                                       # redirect unauthenticated users to /login

    components/
      globe/
        CesiumViewer.tsx
        LayerPanel.tsx
        ViewToggle.tsx
        ClusterLayer.tsx
        CorridorLayer.tsx
        corridor/
          PercentileCorridors.tsx      # Mode A
          ProbabilityHeatmap.tsx       # Mode B (Phase 2)
          ParticleTrajectories.tsx     # Mode C (Phase 3)
        UncertaintyModeSelector.tsx
      plan/
        PlanView.tsx                   # Phase 2
        AltitudeCrossSection.tsx       # Phase 2
      timeline/
        TimelineStrip.tsx
        TimelineGantt.tsx
        TimelineControls.tsx
        ModeIndicator.tsx
      panels/
        ObjectInfoPanel.tsx
        PredictionPanel.tsx            # Includes HMAC status indicator
        AirspaceImpactPanel.tsx        # Phase 2
        ConjunctionPanel.tsx           # Phase 2
      alerts/
        AlertBanner.tsx
        AlertBadge.tsx
        NotificationCentre.tsx
        AcknowledgeDialog.tsx
      jobs/
        JobsPanel.tsx
        JobProgressBar.tsx
        SimulationComparison.tsx
      spaceweather/
        SpaceWeatherWidget.tsx
      reports/
        ReportConfigDialog.tsx
        ReportPreview.tsx
      space/
        SpaceOverview.tsx
        OwnedObjectCard.tsx
        ControlledReentryPlanner.tsx
        DeorbitWindowList.tsx
        ApiKeyManager.tsx
        CcsdsExportPanel.tsx
        ShadowBanner.tsx             # Amber banner displayed when shadow mode active
      notam/
        NotamDraftViewer.tsx
        NotamCancellationDialog.tsx
        NotamRegulatoryDisclaimer.tsx
      shadow/
        ShadowModeIndicator.tsx
        ShadowValidationReport.tsx
      dashboard/
        EventSummaryCard.tsx
        SystemHealthCard.tsx
      shared/
        DataConfidenceBadge.tsx
        IntegrityStatusBadge.tsx       # ✓ HMAC verified / ✗ HMAC failed
        UncertaintyBound.tsx
        CountdownTimer.tsx

    hooks/
      useObjects.ts
      usePrediction.ts                 # Polls HMAC status; shows warning if failed
      useEphemeris.ts
      useSpaceWeather.ts
      useAlerts.ts
      useSimulation.ts
      useCZML.ts
      useWebSocket.ts                  # Cookie-based auth; per-user connection limit

    stores/                            # Zustand — UI state only; no API responses
      timelineStore.ts                 # Mode, playhead position, playback speed
      selectionStore.ts                # Selected object/event/zone IDs
      layerStore.ts                    # Layer visibility, corridor display mode
      jobsStore.ts                     # Active job IDs (content fetched via TanStack Query)
      alertStore.ts                    # Unread count, mute rules
      uiStore.ts                       # Panel state, theme (dark/light/high-contrast)

    lib/
      api.ts                           # Typed fetch wrapper; credentials: 'include'
                                       # for httpOnly cookie auth; never reads tokens
      czml.ts
      ws.ts                            # wss:// enforced; cookie auth at upgrade
      corridorGeometry.ts
      mcBinaryDecoder.ts
      reportUtils.ts

    types/
      objects.ts
      predictions.ts                   # Includes hmac_status, integrity_failed fields
      alerts.ts
      spaceweather.ts
      simulation.ts
      czml.ts

  public/
    branding/
  middleware.ts                        # Root Next.js middleware for security headers
  next.config.ts                       # Content-Security-Policy defined here for SSR
  tsconfig.json
  package.json
  package-lock.json                    # Committed; npm ci used in Docker builds

13.0 Accessibility Standard Commitment

Minimum standard: WCAG 2.1 Level AA (ISO/IEC 40500:2012), which is incorporated by reference into EN 301 549 v3.2.1 — the mandatory accessibility standard for ICT procured by EU public sector bodies including ESA. Failure to meet EN 301 549 is a bid disqualifier for any EU public sector tender.

All frontend work must meet these criteria before a PR is merged:

  • WCAG 2.1 AA automated check passes (axe-core — see §42)
  • Keyboard-only operation possible for all primary operator workflows
  • Screen reader (NVDA + Firefox; VoiceOver + Safari) tested for primary workflow on each release
  • Colour contrast ≥ 4.5:1 for all informational text; ≥ 3:1 for UI components and graphical elements
  • No functionality conveyed by colour alone

Deliverable: Accessibility Conformance Report (ACR / VPAT 2.4) produced before Phase 2 ESA bid submission. Maintained thereafter for each major release.

UTC-only rule for operational interface (F1): ICAO Annex 2 and Annex 15 mandate UTC for all aeronautical operational communications. The following is a hard rule — no exceptions without explicit documentation and legal/safety sign-off:

  • All times displayed in Persona A/C operational views (alert panels, event detail, NOTAM draft, shift handover) are UTC only, formatted as HH:MMZ or DD MMM YYYY HH:MMZ
  • No timezone conversion widget or local-time toggle in the operational interface
  • Local time display is permitted only in non-operational views (account settings, admin billing pages) and must be clearly labelled with the timezone name
  • The Z suffix or UTC label is persistently visible — never hidden in a tooltip or hover state
  • All API timestamps returned as ISO 8601 UTC (2026-03-22T14:00:00Z) — never local time strings

13.1 State Management Separation

TanStack Query: All API-derived data — object lists, predictions, ephemeris, space weather, alerts, simulation results. Handles caching, background refetch, and stale-while-revalidate.

Zustand: Pure UI state with no server dependency — selected IDs, layer visibility, timeline mode and position, panel open/closed state, theme, alert mute rules.

URL state (nuqs): Shareable, bookmarkable — selected NORAD ID, active event ID, time position in replay mode, active layer set. Browser back/forward works correctly. Requires NuqsAdapter wrapping the App Router root layout to hydrate correctly on SSR.

Never in state: Raw API response bodies. No useEffect that writes API responses into Zustand.

Authentication in the client: The api.ts fetch wrapper uses credentials: 'include' to send the httpOnly auth cookie automatically. The client never reads, stores, or handles the JWT token directly — it is invisible to JavaScript. CSRF is mitigated by SameSite=Strict on the cookie.

Next.js App Router component boundary (ADR 0018): The project uses App Router. The globe and all operational views are client components; static pages (onboarding, settings, admin) are React Server Components where practical.

Route group RSC/Client Rationale
app/(globe)/ — operational views "use client" root layout CesiumJS, WebSocket, Zustand hooks require browser APIs
app/(static)/ — onboarding, settings Server Components by default No browser APIs needed; faster initial load
app/(auth)/ — login, MFA Server Components + Client islands Form validation islands only

Rules enforced in AGENTS.md:

  • Never add "use client" to a leaf component without a comment explaining which browser API requires it
  • app/(globe)/layout.tsx is the single "use client" boundary for all operational views — child components inherit it without re-declaring
  • nuqs requires <NuqsAdapter> at the root of app/(globe)/layout.tsx

TanStack Query key factory (src/lib/queryKeys.ts) — stable hierarchical keys prevent cache invalidation bugs:

export const queryKeys = {
  objects: {
    all: ()           => ['objects'] as const,
    list: (f: ObjectFilters) => ['objects', 'list', f] as const,
    detail: (id: number)    => ['objects', 'detail', id] as const,
    tleHistory: (id: number) => ['objects', id, 'tle-history'] as const,
  },
  predictions: {
    byObject: (id: number) => ['predictions', id] as const,
  },
  alerts: {
    all:    ()           => ['alerts'] as const,
    unacked: (orgId: number) => ['alerts', 'unacked', orgId] as const,
  },
  jobs: {
    detail: (jobId: string) => ['jobs', jobId] as const,
  },
} as const;
// On WS alert.new: queryClient.invalidateQueries({ queryKey: queryKeys.alerts.all() })
// On acknowledge mutation: optimistic setQueryData, then invalidate on settle

React error boundary hierarchy — a CesiumJS crash must never remove the alert panel from the DOM:

// app/(globe)/layout.tsx
<AppErrorBoundary fallback={<AppCrashPage />}>
  <GlobeErrorBoundary fallback={<GlobeUnavailable />}>
    <GlobeCanvas />               {/* WebGL context loss isolated here */}
  </GlobeErrorBoundary>
  <PanelErrorBoundary name="alerts">
    <AlertPanel />                {/* Survives globe crash */}
  </PanelErrorBoundary>
  <PanelErrorBoundary name="events">
    <EventList />
  </PanelErrorBoundary>
</AppErrorBoundary>

GlobeUnavailable displays: "Globe unavailable — WebGL context lost. Re-entry event data below remains operational." Alert and event panels remain visible and functional. Add GlobeErrorBoundary to AGENTS.md safety-critical component list.

Loading and empty state specification — for safety-critical panels, loading and empty must be visually distinct from each other and from error:

State Visual treatment Required text
Loading Skeleton matching panel layout
Empty Explicit affirmative message AlertPanel: "No unacknowledged alerts"; EventList: "No active re-entry events"
Error Inline error with retry button Never blank

Rule: safety-critical panels (AlertPanel, EventList, PredictionPanel) must never render blank. DataConfidenceBadge must always show a value — display "Unknown" explicitly, never render nothing.

WebSocket reconnection policy (src/lib/ws.ts):

const RECONNECT = {
  initialDelayMs: 1_000,
  maxDelayMs:     30_000,
  multiplier:     2,
  jitter:         0.2,   // ±20% — spreads reconnections after mass outage/deploy
};
// TOKEN_EXPIRY_WARNING handler: trigger silent POST /auth/token/refresh;
//   on success send AUTH_REFRESH; on failure show re-login modal (60s grace before disconnect)
// Reconnect sends ?since_seq=<last_seq> for missed event replay

Operational mode guard (src/hooks/useModeGuard.ts) — enforces LIVE/SIMULATION/REPLAY write restrictions:

export function useModeGuard(allowedModes: OperationalMode[]) {
  const { mode } = useTimelineStore();
  return { isAllowed: allowedModes.includes(mode), currentMode: mode };
}
// Usage: const { isAllowed } = useModeGuard(['LIVE']);
// All write-action components (acknowledge alert, submit NOTAM draft, trigger prediction)
// must call useModeGuard(['LIVE']) and disable + annotate button in other modes.

Deck.gl + CesiumJS integration — use DeckLayer from @deck.gl/cesium (rendered inside CesiumJS as a primitive; correct z-order and shared input handling). Never use a separate Deck.gl canvas:

import { DeckLayer } from '@deck.gl/cesium';
import { HeatmapLayer } from '@deck.gl/aggregation-layers';

const deckLayer = new DeckLayer({
  layers: [new HeatmapLayer({ id: 'mc-heatmap', data: mcTrajectories,
    getPosition: d => [d.lon, d.lat], getWeight: d => d.weight,
    radiusPixels: 30, intensity: 1, threshold: 0.03 })],
});
viewer.scene.primitives.add(deckLayer);
// Remove when switching away from Mode B: viewer.scene.primitives.remove(deckLayer)

CesiumJS client-side memory constraints:

Constraint Value Enforcement
Max CZML entity count in globe 500 Prune lowest-perigee objects beyond 500; useCZML monitors count
Orbit path duration 72h forward / 24h back Longer paths accumulate geometry
Heatmap cell resolution (Mode B) 0.5° × 0.5° Higher resolution requires more GPU memory
Stale entity pruning Remove entities not updated in 48h Prevents ghost entities in long sessions
Globe entity count Prometheus metric spacecom_globe_entity_count (gauge) WARNING alert at 450; prune trigger at 500

Bundle size budget and dynamic imports:

Bundle Strategy Budget (gzipped)
Login / onboarding / settings Static; no CesiumJS/Deck.gl < 200 KB
Globe route initial load CesiumJS lazy-loaded; spinner shown < 500 KB before CesiumJS
Globe fully loaded CesiumJS + Deck.gl + app < 8 MB
// src/components/globe/GlobeCanvas.tsx
import dynamic from 'next/dynamic';
const CesiumViewer = dynamic(
  () => import('./CesiumViewerInner'),
  { ssr: false, loading: () => <GlobeLoadingState /> }
);

bundlewatch (or @next/bundle-analyzer) in CI; warning (non-blocking) if initial route bundle exceeds budget. Baseline stored in .bundle-size-baseline.


13.2 Accessible Parallel Table View (F4)

The CesiumJS WebGL globe is inherently inaccessible: no keyboard navigation, no screen reader support, no motor-impairment accommodation. All interactions available via the globe must also be available via a parallel data table view.

Component: src/components/globe/ObjectTableView.tsx

  • Accessible via keyboard shortcut Alt+T from any operational view, and via a persistent visible "Table view" button in the globe toolbar
  • Displays all objects currently rendered on the globe: NORAD ID, name, orbit type, conjunction status badge, predicted re-entry window, alert level
  • Sortable by any column (aria-sort updated on header click/keypress); filterable by alert level
  • Row selection focuses the object's Event Detail panel (same as map click)
  • All alert acknowledgement actions reachable from the table view — no functionality requires the globe
  • Implemented as <table> with <thead>, <tbody>, <th scope="col">, <th scope="row"> — no ARIA table role substitutes where native HTML suffices
  • Pagination or virtual scroll for large object sets; aria-rowcount and aria-rowindex set correctly for virtualised rows

The table view is the primary interaction surface for users who cannot use the map. It must be functionally complete, not a read-only summary.


13.3 Keyboard Navigation Specification (F6)

All primary operator workflows must be completable by keyboard alone. Required implementation:

Skip links (rendered as the first focusable element in the page, visible on focus):

<a href="#alert-panel" class="skip-link">Skip to alert panel</a>
<a href="#main-content" class="skip-link">Skip to main content</a>
<a href="#object-table" class="skip-link">Skip to object table</a>

Focus ring: Minimum 3px solid outline, ≥ 3:1 contrast against adjacent colours (WCAG 2.4.11 Focus Appearance, AA). Never outline: none without a custom focus indicator. Defined in design tokens: --focus-ring: 3px solid #4A9FFF.

Tab order: Follows DOM order (no tabindex > 0). Logical flow: nav → alert panel → map toolbar → main content. Modal dialogs trap focus within the dialog while open; focus returns to the trigger element on close.

Application keyboard shortcuts (all documented in UI via ? help overlay):

Shortcut Action
Alt+A Focus most-recent active CRITICAL alert
Alt+T Toggle table / globe view
Alt+H Open shift handover view
Alt+N Open NOTAM draft for active event
? Open keyboard shortcut reference overlay
Escape Close modal / dismiss non-CRITICAL overlay
Arrow keys Navigate within alert list, table rows, accordion items

All shortcuts declared via aria-keyshortcuts on their trigger elements. No shortcut conflicts with browser or screen reader reserved keys.


13.4 Colour and Contrast Specification (F7)

All colour pairs must meet WCAG 2.1 AA contrast requirements. Documented in frontend/src/tokens/colours.ts as design tokens; no hardcoded colour values in component files.

Operational severity palette (dark theme — background: #1A1A2E):

Severity Background Text Contrast ratio Status
CRITICAL #7B4000 #FFFFFF 7.2:1 ✓ AA
HIGH #7A3B00 #FFD580 5.1:1 ✓ AA
MEDIUM #1A3A5C #90CAF9 4.6:1 ✓ AA
LOW #1E3A2F #81C784 4.5:1 ✓ AA (minimum)
Focus ring #1A1A2E #4A9FFF 4.8:1 ✓ AA

All pairs verified with the APCA algorithm for large display text (corridor labels on the globe). If a colour fails at the target background, the background is adjusted — the text colour is kept consistent for operator recognition.

Number formatting (F4): Probability values, altitudes, and distances must be formatted correctly across locales:

  • Operational interface (Persona A/C): Always use ICAO-standard decimal point (.) regardless of browser locale — deviating from locale convention is intentional and matches ICAO Doc 8400 standards; this is documented as an explicit design decision
  • Admin / reporting / Space Operator views: Use Intl.NumberFormat(locale) for locale-aware formatting (comma decimal separator in DE/FR/ES locales)
  • Helper: formatOperationalNumber(n: number): string — always . decimal, 3 significant figures for probabilities; formatDisplayNumber(n: number, locale: string): string — locale-aware
  • Never use raw Number.toString() or n.toFixed() in JSX — both ignore locale

Non-colour severity indicators (F5): Colour must never be the sole differentiator. Each severity level also carries:

Severity Icon/shape Text label Border width
CRITICAL ⬟ (octagon) "CRITICAL" always visible 3px solid
HIGH ▲ (triangle) "HIGH" always visible 2px solid
MEDIUM ● (circle) "MEDIUM" always visible 1px solid
LOW ○ (circle outline) "LOW" always visible 1px dashed

The 1 Hz CRITICAL colour cycle (§28.3 habituation countermeasure) must also include a redundant non-colour animation: 1 Hz border-width pulse (2px → 4px → 2px). Users with prefers-reduced-motion: reduce see a static thick border instead (see §28.3 reduced-motion rules).


13.5 Internationalisation Architecture (F5, F8, F11)

Language scope — Phase 1: English only. No other locale is served. This is not a gap — it is an explicit decision that allows Phase 1 to ship without a localisation workflow. The architecture is designed so that adding a new locale requires only adding a messages/{locale}.json file and testing; no component code changes.

String externalisation strategy:

  • Library: next-intl (native Next.js App Router support, RSC-compatible, type-safe message keys)
  • Source of truth: messages/en.json — all user-facing strings, namespaced by feature area
  • Message ID convention: {feature}.{component}.{element} e.g. alerts.critical.title, handover.accept.button
  • No bare string literals in JSX (enforced by eslint-plugin-i18n-json or equivalent)
  • ICAO-fixed strings are excluded from i18n scope and must never appear in messages/en.json — they are hardcoded constants. Examples: NOTAM, UTC, SIGMET, category codes (NOTAM_ISSUED), ICAO phraseology in NOTAM templates. These are annotated // ICAO-FIXED: do not translate in source
messages/
  en.json          # Source of truth — Phase 1 complete
  fr.json          # Phase 2 scaffold (machine-translated placeholders; native-speaker review before deploy)
  de.json          # Phase 3 scaffold

CSS logical properties (F8): All new components use CSS logical properties instead of directional utilities, making RTL support a configuration change rather than a code rewrite:

Avoid Use instead
margin-left, ml-* margin-inline-start, ms-*
margin-right, mr-* margin-inline-end, me-*
padding-left, pl-* padding-inline-start, ps-*
padding-right, pr-* padding-inline-end, pe-*
left: 0 inset-inline-start: 0
text-align: left text-align: start

The <html> element carries dir="ltr" (hardcoded for Phase 1). When a RTL locale is added, this becomes dir={locale.dir} — no component changes required. RTL testing with Arabic locale is a Phase 3 gate before any Middle East deployment.

Altitude and distance unit display (F9): Aviation and space domain use different unit conventions. All altitudes and distances are stored and transmitted in metres (SI base unit) in the database and API. The display layer converts based on users.altitude_unit_preference:

Role default Unit Display example
ansp_operator ft 39,370 ft (FL394)
space_operator km 12.0 km
analyst km 12.0 km

Rules:

  • Unit label always shown alongside the value — no bare numbers
  • aria-label provides full unit name: aria-label="39,370 feet (Flight Level 394)"
  • User can override their default in account settings via PATCH /api/v1/users/me
  • API always returns metres; unit conversion is client-side only
  • FL (Flight Level) shown in parentheses for ft display when altitude > 0 ft MSL and context is airspace

Altitude datum labelling (F11 — §62): The SGP4 propagator and NRLMSISE-00 output altitudes above the WGS-84 ellipsoid. Aviation altimetry uses altitude above Mean Sea Level (MSL). The geoid height (difference between ellipsoid and MSL) varies globally from approximately 106 m to +85 m (EGM2008). For operational altitudes (below ~25 km / 82,000 ft during re-entry terminal phase), this difference is significant.

Required labelling rule: All altitude displays must specify the datum. The datum is a non-configurable system constant per altitude context:

Altitude context Datum Display example Notes
Orbital altitude (> 80 km) WGS-84 ellipsoid 185 km (ellipsoidal) SGP4 output; geoid difference negligible at orbital altitudes
Re-entry corridor boundary WGS-84 ellipsoid 80 km (ellipsoidal) Model boundary altitude
Fragment impact altitude WGS-84 ellipsoid 0 km (ellipsoidal) → display as ground level Converted at display time
Airspace sector boundary (FL) QNH barometric FL390 / 39,000 ft (QNH) Aviation standard; NOT ellipsoidal
Terrain clearance / NOTAM lower bound MSL (approx. ellipsoidal for > 1,000 ft) 5,000 ft MSL Use MSL label explicitly

Implementation: formatAltitude(metres, context) helper accepts a context parameter ('orbital' | 'airspace' | 'notam') and appends the appropriate datum label. The datum label is rendered in a smaller secondary font weight alongside the altitude value — not in aria-label alone.

API response datum field: The prediction API response must include altitude_datum: "WGS84_ELLIPSOIDAL" alongside any altitude value. Consumers must not assume a datum that is not stated.

Future locale addition checklist (documented in docs/ADDING_A_LOCALE.md):

  1. Add messages/{locale}.json translated by a native-speaker aviation professional
  2. Verify all ICAO-fixed strings are excluded from translation
  3. Set dir for the locale (ltr/rtl)
  4. Run automated RTL layout tests if dir=rtl
  5. Confirm operational time display still shows UTC (not locale timezone)
  6. Legal review of any jurisdiction-specific compliance text

13.6 Contribution Workflow (F3)

CONTRIBUTING.md at the repository root is a required document. It defines how contributors (internal engineers, auditors, future ESA-directed reviewers) engage with the codebase.

Branch naming convention:

Branch type Pattern Example
Feature feature/{ticket-id}-short-description feature/SC-142-decay-unit-pref
Bug fix fix/{ticket-id}-short-description fix/SC-200-hmac-null-check
Chore / dependency chore/{description} chore/bump-fastapi-0.115
Release release/{semver} release/1.2.0
Hotfix hotfix/{semver} hotfix/1.1.1

No direct commits to main. All changes via pull request. main is branch-protected: 1 required approval, all status checks must pass, no force-push.

Commit message format: Conventional Commitstype(scope): description. Types: feat, fix, chore, docs, refactor, test, ci. Example: feat(decay): add p01/p99 tail risk columns.

PR template (.github/pull_request_template.md):

## Summary
<!-- What does this PR do? -->

## Linked ticket
<!-- e.g. SC-142 -->

## Checklist
- [ ] `make test` passes locally
- [ ] OpenAPI spec regenerated (`make generate-openapi`) if API changed
- [ ] CHANGELOG.md updated under `[Unreleased]`
- [ ] axe-core accessibility check passes if UI changed
- [ ] Contract test passes if API response shape changed
- [ ] ADR created if an architectural decision was made

Review SLA: Pull requests must receive a first review within 1 business day of opening. Stale PRs (no activity > 3 business days) are labelled stale automatically.


13.7 Architecture Decision Records (F4)

ADRs (Nygard format) are the lightweight record for code-level and architectural decisions. They live in docs/adr/ and are numbered sequentially.

When to write an ADR: Any decision that is:

  • Hard to reverse (e.g., choosing a library, a DB schema approach, an algorithm)
  • Likely to confuse a future contributor who finds the code without context
  • Required by a public-sector procurement framework (ESA specifically requests evidence of a structured decision process)
  • Referenced in a specialist review appendix (§45§54 all reference ADR numbers)

Format (docs/adr/NNNN-title.md):

# ADR NNNN: Title

**Status:** Proposed | Accepted | Deprecated | Superseded by ADR MMMM
**Date:** YYYY-MM-DD

## Context
What problem are we solving? What constraints apply?

## Decision
What did we decide?

## Consequences
What becomes easier? What becomes harder? What is now out of scope?

Known ADRs referenced in this plan:

ADR Topic
0001 FastAPI over Django REST Framework
0002 TimescaleDB + PostGIS for orbital time-series
0003 CesiumJS + Deck.gl for 3D globe rendering
0004 next-intl for string externalisation
0005 Append-only alert_events with HMAC signing
0016 NRLMSISE-00 vs JB2008 atmospheric density model

All ADR numbers referenced in this document must have a corresponding docs/adr/NNNN-*.md file before Phase 2 ESA submission. New ADRs start at the next available number.


13.8 Developer Environment Setup (F6)

docs/DEVELOPMENT.md is a required onboarding document. A new engineer must be able to run a fully functional local environment within 30 minutes of reading it. The document covers:

  1. Prerequisites: Python 3.11 (pinned in .python-version), Node.js 20 LTS, Docker Desktop, make
  2. Environment bootstrap:
    cp .env.example .env          # review and fill required values
    make init-dirs                # creates logs/, exports/, config/, backups/ on host
    make dev-up                   # docker compose up -d postgres redis minio
    make migrate                  # alembic upgrade head
    make seed                     # load development fixture data (10 tracked objects, sample TIPs)
    make dev                      # starts: uvicorn + Next.js dev server + Celery worker
    
  3. Running tests:
    make test                     # full test suite (backend + frontend)
    make test-backend             # backend only (pytest)
    make test-frontend            # frontend only (jest + playwright)
    make test-e2e                 # Playwright end-to-end (requires make dev running)
    
  4. Useful local URLs:
    • API: http://localhost:8000 / Swagger UI: http://localhost:8000/docs
    • Frontend: http://localhost:3000
    • MinIO console: http://localhost:9001 (credentials in .env.example)
  5. Common issues: documented in a ## Troubleshooting section covering: Docker port conflicts, TimescaleDB first-run migration failure, CesiumJS ion token missing.

.env.example is committed and kept up-to-date with all required variables (no value — keys only). .env is in .gitignore and must never be committed.


13.9 Docs-as-Code Pipeline (F10)

All project documentation (this plan, runbooks, ADRs, OpenAPI spec, data provenance records) is version-controlled in the repository and validated by CI.

Documentation site: MkDocs Material. Source in docs/. Published to GitHub Pages on merge to main. Configuration in mkdocs.yml.

CI documentation checks (run on every PR):

  • mkdocs build --strict — fails on broken links, missing pages, invalid nav
  • markdown-link-check docs/ — external link validation (warns, does not fail, to avoid flaky CI on transient outages)
  • openapi-diff — spec drift check (see §14 F1)
  • vale --config=.vale.ini docs/ — prose style linter (SpaceCom style guide: no passive voice in runbooks, consistent terminology table for re-entry vs reentry)

ESA submission artefact: The MkDocs build output (static HTML) is archived as a CI artefact on each release tag. This provides a reproducible, point-in-time documentation snapshot for the ESA bid submission. The submission artefact is docs-site-{version}.zip stored in the GitHub release assets.

Docs owner: Each section of the documentation has an owner: frontmatter field. The owner is responsible for keeping the section current after their feature area changes. Missing or stale ownership is flagged by a quarterly docs-review GitHub issue auto-created by a cron workflow.


14. API Design

Base path: /api/v1. All endpoints require authentication (minimum viewer role) unless noted. Role requirements listed per group.

System (no auth required)

  • GET /health — liveness probe; returns 200 {"status": "ok", "version": "<semver>"} if the process is running. Used by Docker/Kubernetes liveness probe and load balancer health check. Does not check downstream dependencies — a healthy response means only that the API process is alive.
  • GET /readyz — readiness probe; returns 200 {"status": "ready", "checks": {...}} when all dependencies are reachable. Returns 503 if any required dependency is unhealthy. Checks performed: PostgreSQL (query SELECT 1), Redis (PING), Celery worker queue depth < 1000. Used by DR automation to confirm the new primary is accepting traffic before updating DNS (§26.3). Also included in OpenAPI spec under tags: ["System"].
// GET /readyz — healthy response example
{
  "status": "ready",
  "checks": {
    "postgres": "ok",
    "redis": "ok",
    "celery_queue_depth": 42
  },
  "version": "1.2.3"
}
// GET /readyz — unhealthy response (503)
{
  "status": "not_ready",
  "checks": {
    "postgres": "ok",
    "redis": "error: connection refused",
    "celery_queue_depth": 42
  }
}

Auth

  • POST /auth/token — login; returns httpOnly cookie (access) + httpOnly cookie (refresh); rate-limited 10/min/IP
  • POST /auth/token/refresh — rotate refresh token; rate-limited
  • POST /auth/mfa/verify — complete MFA; issues full-access token
  • POST /auth/logout — revoke refresh token; clear cookies

Catalog (viewer minimum)

  • GET /objects — list/search (paginated; filter by type, perigee, decay status, data_confidence)
  • GET /objects/{norad_id} — detail with TLE, physical properties, data confidence annotation
  • POST /objects — manual entry (operator role)
  • GET /objects/{norad_id}/tle-history — full TLE history including cross-validation status

Propagation (analyst role)

  • POST /propagate — submit catalog propagation job
  • GET /propagate/{task_id} — poll status
  • GET /objects/{norad_id}/ephemeris?start=&end=&step= — time range and step validation (Finding 7):
    Parameter Constraint Error code
    start ≥ TLE epoch 7 days; ≤ now + 90 days EPHEMERIS_START_OUT_OF_RANGE
    end start < end ≤ start + 30 days EPHEMERIS_END_OUT_OF_RANGE
    step ≥ 10 seconds and ≤ 86,400 seconds EPHEMERIS_STEP_OUT_OF_RANGE
    Computed points (end start) / step ≤ 100,000 EPHEMERIS_TOO_MANY_POINTS

Decay Prediction (analyst role)

  • POST /decay/predict — submit decay job; returns 202 Accepted (Finding 3). MC concurrency gate: per-organisation Redis semaphore limits to 1 concurrent MC run (Phase 1); 2 for analyst+ (Phase 2); 429 + Retry-After on limit; admin bypasses.

    Async job lifecycle (Finding 3):

    POST /decay/predict
    Idempotency-Key: <client-uuid>          ← optional; prevents duplicate on retry
    → 202 Accepted
    {
      "jobId": "uuid",
      "status": "queued",
      "statusUrl": "/jobs/uuid",
      "estimatedDurationSeconds": 45
    }
    
    GET /jobs/{job_id}
    → 200 OK
    {
      "jobId": "uuid",
      "status": "running" | "complete" | "failed" | "cancelled",
      "resultUrl": "/decay/predictions/12345",   // present when complete
      "error": null | {"code": "...", "message": "..."},
      "createdAt": "...",
      "completedAt": "...",
      "durationSeconds": 42
    }
    

    WebSocket PREDICTION_COMPLETE / PREDICTION_FAILED events are the primary completion signal. GET /jobs/{id} is the polling fallback (recommended interval: 5 seconds; do not poll faster). All Celery-backed POST endpoints (/reports, /space/reentry/plan, /propagate) follow the same lifecycle pattern.

  • GET /jobs/{job_id} — poll job status (all job types); 404 if job does not belong to the requesting user's organisation

  • GET /decay/predictions?norad_id=&status= — list (cursor-paginated)

Re-entry (viewer role)

  • GET /reentry/predictions — list with HMAC status; filterable by FIR, time window, confidence, integrity_failed
  • GET /reentry/predictions/{id} — full detail; HMAC verified before serving; integrity_failed records return 503
  • GET /reentry/tip-messages?norad_id= — TIP messages

Space Weather (viewer role)

  • GET /spaceweather/current — F10.7, Kp, Ap, Dst + operational_status + uncertainty_multiplier + cross-validation delta
  • GET /spaceweather/history?start=&end= — history
  • GET /spaceweather/forecast — 3-day NOAA SWPC forecast

Conjunctions (viewer role)

  • GET /conjunctions — active events filterable by Pc threshold
  • GET /conjunctions/{id} — detail with covariance and probability
  • POST /conjunctions/screen — submit screening (analyst role)

Visualisation (viewer role)

  • GET /czml/objects — full CZML catalog (J2000 INERTIAL; all strings HTML-escaped); max payload policy: 5 MB. If estimated payload exceeds 5 MB, the endpoint returns HTTP 413 with {"error": "catalog_too_large", "use_delta": true}.
  • GET /czml/objects?since=<iso8601>delta CZML: returns only objects whose position or metadata has changed since the given timestamp. Clients must use this after the initial full load. Response includes X-CZML-Full-Required: true header if the server cannot produce a valid delta (e.g. client timestamp > 30 minutes old) — client must re-fetch the full catalog. Delta responses are always ≤ 500 KB for the 100-object catalog.
  • GET /czml/hazard/{zone_id} — HMAC verified before serving
  • GET /czml/event/{event_id} — full event CZML
  • GET /viz/mc-trajectories/{prediction_id} — binary MC blob for Mode C

Hazard (viewer role)

  • GET /hazard/zones — active zones; HMAC status included in response
  • GET /hazard/zones/{id} — detail; HMAC verified before serving; integrity_failed records return 503

Alerts (viewer read; operator acknowledge)

  • GET /alerts — alert history
  • POST /alerts/{id}/acknowledge — records user ID + timestamp + note in alert_events
  • GET /alerts/unread-count — unread critical/high count for badge

Reports (analyst role)

  • GET /reports — list (organisation-scoped via RLS)
  • POST /reports — initiate generation (async)
  • GET /reports/{id} — metadata + pre-signed 15-minute download URL
  • GET /reports/{id}/preview — HTML preview

Org Admin (org_admin role — scoped to own organisation) (F7, F9, F11)

  • GET /org/users — list users in own org
  • POST /org/users/invite — invite a new user (sends email; creates user with viewer role pending activation)
  • PATCH /org/users/{id}/role — assign role up to operator within own org; cannot assign org_admin or admin
  • DELETE /org/users/{id} — deactivate user (revokes sessions and API keys; triggers pseudonymisation for GDPR)
  • GET /org/api-keys — list all API keys in own org (including service account keys)
  • DELETE /org/api-keys/{id} — revoke any key in own org
  • GET /org/audit-log — paginated org-scoped audit log from security_logs and alert_events filtered by organisation_id; supports ?from=&to=&event_type=&user_id= (F9)
  • GET /org/usage — usage summary for current and previous billing period (predictions run, quota hits, API calls); sourced from usage_events table
  • PATCH /org/billing — update billing_contacts row (email, PO number, VAT number)
  • POST /org/export — trigger asynchronous org data export (F11); returns job ID; export includes all predictions, alert events, handover logs, and NOTAM drafts for the org; delivered as signed ZIP within 3 business days; used for GDPR portability and offboarding

Admin (admin role only)

  • GET /admin/ingest-status — last run time and status per source
  • GET /admin/worker-status — Celery queue depth and health
  • GET /admin/security-events — recent security_logs entries
  • POST /admin/users — create user
  • PATCH /admin/users/{id}/role — change role (logged as HIGH security event)
  • GET /admin/organisations — list all organisations with tier, status, usage summary
  • POST /admin/organisations — provision new organisation (onboarding gate — see §29.8)
  • PATCH /admin/organisations/{id} — update tier, status, subscription dates

Space Portal (space_operator or orbital_analyst role)

  • GET /space/objects — list owned objects (space_operator: scoped; orbital_analyst: full catalog)
  • GET /space/objects/{norad_id} — full technical detail with state vectors, covariance, TLE history
  • GET /space/objects/{norad_id}/ephemeris — raw GCRF state vectors; CCSDS OEM format available via Accept: application/ccsds-oem
  • POST /space/reentry/plan — submit controlled re-entry planning job; requires owned_objects.has_propulsion = TRUE
  • GET /space/reentry/plan/{task_id} — poll; returns ranked deorbit windows with risk scores and FIR avoidance status
  • POST /space/conjunction/screen — submit screening (orbital_analyst only)
  • GET /space/export/bulk — bulk ephemeris/prediction export (JSON, CSV, CCSDS)

NOTAM Drafting (operator role)

  • POST /notam/draft — generate draft NOTAM from prediction ID; returns ICAO-format draft text + mandatory disclaimer
  • GET /notam/drafts — list drafts for organisation
  • GET /notam/drafts/{id} — draft detail
  • POST /notam/drafts/{id}/cancel-draft — generate cancellation draft for a previous new-NOTAM draft

API Key Management (space_operator or orbital_analyst)

  • POST /api-keys — create new API key; raw key returned once and never stored
  • GET /api-keys — list active keys (hashed IDs only, never raw keys)
  • DELETE /api-keys/{id} — revoke key immediately
  • GET /api-keys/usage — per-key request counts and last-used timestamp
  • WS /ws/events — real-time stream; 5 concurrent connections per user enforced. Per-instance subscriber ceiling: 500 connections. New connections beyond this limit receive HTTP 503 at the WebSocket upgrade. A ws_connected_clients Prometheus gauge tracks current count per backend instance; alert fires at 400 (WARNING) to trigger horizontal scaling before the ceiling is reached. At Tier 2 (2 backend instances), the effective ceiling is 1,000 simultaneous WebSocket clients — documented as a known capacity limit in docs/runbooks/capacity-limits.md.

WebSocket event payload schema:

All events share an envelope:

{
  "type": "<event_type>",
  "seq": 1042,
  "ts": "2026-03-17T14:23:01.123Z",
  "data": { ... }
}
type Trigger data fields
alert.new New alert generated alert_id, level, norad_id, object_name, fir_ids[]
alert.acknowledged Alert acknowledged by any user in org alert_id, acknowledged_by, note_preview
alert.superseded Alert superseded by a new one old_alert_id, new_alert_id
prediction.updated New re-entry prediction for a tracked object prediction_id, norad_id, p50_utc, supersedes_id
ingest.status Ingest job completed or failed source, status (ok/failed), record_count, next_run_at
spaceweather.change Operational status band changes old_status, new_status, kp, f107
tip.new New TIP message ingested norad_id, object_name, tip_epoch, predicted_reentry_utc

Reconnection and missed-event recovery: Each event carries a monotonically increasing seq number per organisation. On reconnect, the client sends ?since_seq=<last_seq> in the WebSocket upgrade URL. The server replays up to 200 missed events from an in-memory ring buffer (last 5 minutes). If the client has been disconnected > 5 minutes, it receives a {"type": "resync_required"} event and must re-fetch state via REST.

Per-org sequence number implementation (F5 — §67): The seq counter for each org must be assigned using a PostgreSQL SEQUENCE object, not MAX(seq)+1 in a trigger. MAX(seq)+1 under concurrent inserts for the same org produces duplicate sequence numbers:

-- Migration: create one sequence per org on org creation
-- (or use a single global sequence with per-org prefix — simpler)
CREATE SEQUENCE IF NOT EXISTS alert_seq_global
    START 1 INCREMENT 1 NO CYCLE;

-- In the alert_events INSERT trigger or application code:
-- NEW.seq := nextval('alert_seq_global');
-- This is globally unique and monotonically increasing; per-org ordering
-- is derived by filtering on org_id + ordering by seq.

Preferred approach: A single global alert_seq_global sequence assigned at INSERT time. Per-org ordering is maintained because seq is globally monotonic — any two events for the same org will have the correct relative ordering by seq. The WebSocket ring buffer lookup uses WHERE org_id = $1 AND seq > $2 ORDER BY seq which remains correct with a global sequence.

Do not use: DEFAULT nextval('some_seq') on the column without org-scoped locking — concurrent inserts across orgs share the sequence fine; concurrent inserts for the same org also work correctly since sequences are lock-free and gap-tolerant.

Application-level receipt acknowledgement (F2 — §63): delivered_websocket = TRUE in alert_events is set at send-time, not client-receipt time. For safety-critical CRITICAL and HIGH alerts, the client must send an explicit receipt acknowledgement within 10 seconds:

// Client → Server: after rendering a CRITICAL/HIGH alert.new event
{ "type": "alert.received", "alert_id": "<uuid>", "seq": <n> }

Server response:

{ "type": "alert.receipt_confirmed", "alert_id": "<uuid>", "seq": <n+1> }

If no alert.received arrives within 10 seconds of delivery, the server marks alert_events.ws_receipt_confirmed = FALSE and triggers the email fallback for that alert (same logic as offline delivery). This distinguishes "sent to socket" from "rendered on screen."

ALTER TABLE alert_events
  ADD COLUMN ws_receipt_confirmed BOOLEAN,
  ADD COLUMN ws_receipt_at TIMESTAMPTZ;
-- NULL = not yet sent; TRUE = client confirmed receipt; FALSE = sent but no receipt within 10s

Fan-out architecture across multiple backend instances (F3 — §63): With ≥2 backend instances (Tier 2), a WebSocket connection from org A may be on instance-1 while a new alert fires on instance-2. Without a cross-instance broadcast mechanism, org A's operator misses the alert.

Required: Redis Pub/Sub fan-out:

# backend/app/alerts/fanout.py
import redis.asyncio as aioredis

ALERT_CHANNEL_PREFIX = "spacecom:alert:"

async def publish_alert(redis: aioredis.Redis, org_id: str, event: dict):
    """Publish alert event to Redis channel; all backend instances receive and forward to connected clients."""
    channel = f"{ALERT_CHANNEL_PREFIX}{org_id}"
    await redis.publish(channel, json.dumps(event))

async def subscribe_org_alerts(redis: aioredis.Redis, org_id: str):
    """Each backend instance subscribes to its connected orgs' channels on startup."""
    pubsub = redis.pubsub()
    await pubsub.subscribe(f"{ALERT_CHANNEL_PREFIX}{org_id}")
    return pubsub

Each backend instance maintains a local registry of {org_id: [websocket_connections]}. On receiving a Redis Pub/Sub message, the instance forwards to all local connections for that org. This decouples alert generation (any instance) from delivery (per-instance local connections).

ADR: docs/adr/0020-websocket-fanout-redis-pubsub.md — documents this pattern and the decision against sticky sessions (which would break blue-green deploys).

Dead-connection ANSP fallback notification (F6 — §63): When the ping-pong mechanism detects a dead connection, the current behaviour is to close the socket. There is no notification to the ANSP that their live monitoring connection has silently dropped.

Required behaviour:

  1. On ping-pong timeout: close socket; record ws_disconnected_at in Redis session key for that connection
  2. If no reconnect within WS_DEAD_CONNECTION_GRACE_SECONDS (default: 120s): send email to the org's ANSP contact (organisations.primary_contact_email) with subject: "SpaceCom live connection dropped — please check your browser"
  3. If an active TIP event exists for the org's FIRs when the disconnection is detected: grace period is reduced to 30s and the email subject is: "URGENT: SpaceCom connection dropped during active re-entry event"
  4. On reconnect (before grace period expires): cancel the pending fallback email
# backend/app/alerts/ws_health.py
WS_DEAD_CONNECTION_GRACE_SECONDS = 120
WS_DEAD_CONNECTION_GRACE_ACTIVE_TIP = 30

async def on_connection_closed(org_id: str, user_id: str, redis: aioredis.Redis):
    active_tip = await redis.get(f"spacecom:active_tip:{org_id}")
    grace = WS_DEAD_CONNECTION_GRACE_ACTIVE_TIP if active_tip else WS_DEAD_CONNECTION_GRACE_SECONDS
    # Schedule fallback notification via Celery
    notify_ws_dead.apply_async(
        args=[org_id, user_id],
        countdown=grace,
        task_id=f"ws-dead-{org_id}-{user_id}"  # revocable if reconnect arrives
    )

async def on_reconnect(org_id: str, user_id: str):
    # Cancel pending dead-connection notification
    celery_app.control.revoke(f"ws-dead-{org_id}-{user_id}")

Per-org email alert rate limit (F7 — §65 FinOps):

Email alerts are triggered both by the alert delivery pipeline (when WebSocket delivery is unconfirmed) and by degraded-mode notifications. Without a rate limit, a flapping prediction window or ingest instability can generate hundreds of alert emails per hour to the same ANSP contact, exhausting the SMTP relay quota and creating alert fatigue.

Rate limit policy: Maximum 50 alert emails per org per hour. When the limit is reached, subsequent alerts within the window are queued and delivered as a digest email at the end of the hour.

# backend/app/alerts/email_delivery.py
EMAIL_RATE_LIMIT_PER_ORG_PER_HOUR = 50

async def send_alert_email(org_id: str, alert: dict, redis: aioredis.Redis):
    """Send alert email subject to per-org rate limit; fall back to digest queue."""
    rate_key = f"spacecom:email_rate:{org_id}:{datetime.utcnow().strftime('%Y%m%d%H')}"
    count = await redis.incr(rate_key)
    if count == 1:
        await redis.expire(rate_key, 3600)  # expire at end of hour window

    if count <= EMAIL_RATE_LIMIT_PER_ORG_PER_HOUR:
        # Send immediately
        await _dispatch_email(org_id, alert)
    else:
        # Add to digest queue; Celery task drains it at hour boundary
        digest_key = f"spacecom:email_digest:{org_id}:{datetime.utcnow().strftime('%Y%m%d%H')}"
        await redis.rpush(digest_key, json.dumps(alert))
        await redis.expire(digest_key, 7200)  # safety expire

@shared_task
def send_hourly_digest_emails():
    """Drain digest queues and send consolidated digest emails. Runs at HH:59."""
    # Find all digest keys matching current hour; send one digest per org
    ...

Contract expiry alerts (F7 — §68):

Without proactive expiry alerts, contracts expire silently. Add a Celery Beat task (tasks/commercial/contract_expiry_alerts.py) that runs daily at 07:00 UTC and checks contracts.valid_until:

@shared_task
def check_contract_expiry():
    """Alert commercial team of contracts expiring within 90/30/7 days."""
    thresholds = [
        (90, "90-day renewal notice"),
        (30, "30-day renewal notice — action required"),
        (7,  "URGENT: 7-day contract expiry warning"),
    ]
    for days, subject_prefix in thresholds:
        target_date = date.today() + timedelta(days=days)
        expiring = db.execute(text("""
            SELECT c.id, o.name, c.monthly_value_cents, c.currency,
                   c.valid_until, o.primary_contact_email
            FROM contracts c
            JOIN organisations o ON o.id = c.org_id
            WHERE DATE(c.valid_until) = :target_date
              AND c.contract_type NOT IN ('sandbox', 'internal')
              AND c.auto_renew = FALSE
        """), {"target_date": target_date}).fetchall()
        for contract in expiring:
            send_email(
                to="commercial@spacecom.io",
                subject=f"[SpaceCom] {subject_prefix}: {contract.name}",
                body=f"Contract for {contract.name} expires on {contract.valid_until.date()}. "
                     f"Monthly value: {contract.monthly_value_cents/100:.2f} {contract.currency}."
            )

Add to celery-redbeat at crontab(hour=7, minute=0). Also send a courtesy expiry notice to the org admin contact at the 30-day threshold so they can initiate their internal procurement process.

Celery schedule: Add send_hourly_digest_emails to celery-redbeat at crontab(minute=59).

Cost rationale: SMTP relay services (SES, Mailgun) charge per email. At 50/hour cap and 10 orgs, maximum 500 emails/hour = 12,000/day. At $0.10/1,000 (SES) = $1.20/day ≈ $37/month at sustained maximum. Without rate limiting during a flapping event, a single incident could generate thousands of emails in minutes.

Per-client back-pressure and send queue circuit breaker (F7 — §63): A slow client whose network buffers are full will cause await websocket.send_json(event) to block in the FastAPI handler. Without a per-client queue depth check, a single slow client can block the fan-out loop for all clients.

# backend/app/alerts/ws_manager.py
WS_SEND_QUEUE_MAX = 50  # events; beyond this, circuit-breaker triggers

class ConnectionManager:
    def __init__(self):
        self._connections: dict[str, list[WebSocket]] = {}
        self._send_queues: dict[WebSocket, asyncio.Queue] = {}

    async def broadcast_to_org(self, org_id: str, event: dict):
        for ws in self._connections.get(org_id, []):
            queue = self._send_queues[ws]
            if queue.qsize() >= WS_SEND_QUEUE_MAX:
                # Circuit breaker: drop this connection; client will reconnect and replay
                spacecom_ws_send_queue_overflow_total.labels(org_id=org_id).inc()
                await ws.close(code=4003, reason="Send queue overflow — reconnect to resume")
            else:
                await queue.put(event)

    async def _send_worker(self, ws: WebSocket):
        """Dedicated coroutine per connection — decouples send from broadcast loop."""
        queue = self._send_queues[ws]
        while True:
            event = await queue.get()
            try:
                await ws.send_json(event)
            except Exception:
                break  # connection closed; worker exits

Prometheus counter: spacecom_ws_send_queue_overflow_total{org_id} — any non-zero value warrants investigation.

Missed-alert display for offline clients (F8 — §63): When a client reconnects after receiving resync_required, it calls the REST API to re-fetch current state. The notification centre must explicitly surface alerts that arrived during the offline period:

GET /api/v1/alerts?since=<last_seen_ts>&include_offline=true — returns all unacknowledged alerts since last_seen_ts, annotated with "received_while_offline": true. The notification centre renders these with a distinct visual treatment: amber border + "Received while you were offline" label. The client stores last_seen_ts in localStorage (updated on each WebSocket message); this survives page reload but not localStorage clear.

WebSocket connection metadata — per-org operational visibility (F10 — §63):

New Prometheus metrics:

ws_org_connected = Gauge(
    'spacecom_ws_org_connected',
    'Whether at least one WebSocket connection is active for this org',
    ['org_id', 'org_name']
)
ws_org_connections = Gauge(
    'spacecom_ws_org_connection_count',
    'Number of active WebSocket connections for this org',
    ['org_id']
)

Updated when connections open/close. Alert rule:

- alert: ANSPNoLiveConnectionDuringTIPEvent
  expr: |
    spacecom_active_tip_events > 0
    and on(org_id) spacecom_ws_org_connected == 0
  for: 5m
  severity: warning
  annotations:
    summary: "ANSP {{ $labels.org_name }} has no live WebSocket connection during active TIP event"
    runbook_url: "https://spacecom.internal/docs/runbooks/ansp-connection-lost.md"

On-call dashboard panel 9 (below the fold): "ANSP Connection Status" — table of org names, connection count, last-connected timestamp, TIP-event indicator. Rows with connected = 0 and active TIP highlighted in amber.

Protocol version negotiation (Finding 8): Client connects with ?protocol_version=1. The server's first message is always:

{"type": "CONNECTED", "protocolVersion": 1, "serverVersion": "2.1.3", "seq": 0}

When a breaking event schema change ships, both versions are supported in parallel for 6 months. Clients on a deprecated version receive:

{"type": "PROTOCOL_DEPRECATION_WARNING", "currentVersion": 1, "sunsetDate": "2026-12-01",
 "migrationGuideUrl": "/docs/api-guide/websocket-protocol.md#v2-migration"}

After sunset, old-version connections are closed with code 4002 ("Protocol version deprecated"). Protocol version history is maintained in docs/api-guide/websocket-protocol.md.

Token refresh during long-lived sessions (Finding 4): Access tokens expire in 15 minutes. The server sends a TOKEN_EXPIRY_WARNING event 2 minutes before expiry:

{"type": "TOKEN_EXPIRY_WARNING", "expiresInSeconds": 120, "seq": N}

The client calls POST /auth/token/refresh (standard REST — does not interrupt the WebSocket), then sends on the existing connection:

{"type": "AUTH_REFRESH", "token": "<new_access_token>"}

Server responds: {"type": "AUTH_REFRESHED", "seq": N}. If the client does not refresh before expiry, the server closes with code 4001 ("Token expired — reconnect with a new token"). Clients distinguish 4001 (auth expiry, refresh and reconnect) from 4002 (protocol deprecated, upgrade required) from network errors (reconnect with backoff).

Mode awareness: In SIMULATION or REPLAY mode, the client's WebSocket connection remains open but alert.new and tip.new events are suppressed for the duration of the mode session. Simulation-generated events are delivered on a separate WS /ws/simulation/{session_id} channel.

Alert Webhooks (admin role — registration; delivery to registered HTTPS endpoints)

For ANSPs with programmatic dispatch systems that cannot consume a browser WebSocket.

  • POST /webhooks — register a webhook endpoint; {"url": "https://ansp.example.com/hook", "events": ["alert.new", "tip.new"], "secret": "<shared_secret>"}
  • GET /webhooks — list registered webhooks for the organisation
  • DELETE /webhooks/{id} — deregister
  • POST /webhooks/{id}/test — send a synthetic alert.new event to verify delivery

Delivery semantics: At-least-once. SpaceCom POSTs the event envelope to the registered URL. Signature: X-SpaceCom-Signature: sha256=<HMAC-SHA256(secret, body)> header on every delivery. Retry policy: 3 retries with exponential backoff (1s, 5s, 30s). After 3 failures, the webhook is marked degraded and the org admin is notified by email. After 10 consecutive failures, the webhook is auto-disabled.

alert_webhooks table:

CREATE TABLE alert_webhooks (
  id SERIAL PRIMARY KEY,
  organisation_id INTEGER NOT NULL REFERENCES organisations(id),
  url TEXT NOT NULL,
  secret_hash TEXT NOT NULL,        -- bcrypt hash of the shared secret; never stored in plaintext
  event_types TEXT[] NOT NULL,
  status TEXT NOT NULL DEFAULT 'active',  -- active | degraded | disabled
  failure_count INTEGER DEFAULT 0,
  last_delivery_at TIMESTAMPTZ,
  last_failure_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Structured Event Export (viewer minimum)

First step toward SWIM / machine-readable ANSP system integration (Phase 3 target).

  • GET /events/{id}/export?format=geojson — returns the event's re-entry corridor and impact zone as a GeoJSON FeatureCollection with ICAO FIR IDs and prediction metadata in properties
  • GET /events/{id}/export?format=czml — CZML event package (same as GET /czml/event/{event_id})
  • GET /events/{id}/export?format=ccsds-oem — raw OEM for the object's trajectory at time of prediction

The GeoJSON export is the preferred integration surface for ANSP systems that are not SWIM-capable. The properties object includes: norad_id, object_name, p05_utc, p50_utc, p95_utc, affected_fir_ids[], risk_level, prediction_id, prediction_hmac (for downstream integrity verification), generated_at.

API Conventions (Finding 9)

Field naming: All API request and response bodies use camelCase. Database column names and Python internal models use snake_case. The conversion is handled automatically by a shared base model:

from pydantic import BaseModel, ConfigDict
from pydantic.alias_generators import to_camel

class APIModel(BaseModel):
    """Base class for all API response/request models. Serialises to camelCase JSON."""
    model_config = ConfigDict(
        alias_generator=to_camel,
        populate_by_name=True,   # allows snake_case in tests and internal code
    )

class PredictionResponse(APIModel):
    prediction_id: int           # → "predictionId" in JSON
    p50_reentry_time: datetime   # → "p50ReentryTime"
    ood_flag: bool               # → "oodFlag"

All Pydantic response models inherit from APIModel. All request bodies also inherit from APIModel (with populate_by_name=True, clients may send either case). Document in docs/api-guide/conventions.md.

Error Response Schema (Finding 2)

All error responses use the SpaceComError envelope — including FastAPI's default Pydantic validation errors (which are overridden):

class SpaceComError(BaseModel):
    error: str        # machine-readable code from the error registry
    message: str      # human-readable; safe to display in UI
    detail: dict | None = None
    requestId: str    # from X-Request-ID header; enables log correlation

@app.exception_handler(RequestValidationError)
async def validation_error_handler(request, exc):
    return JSONResponse(status_code=422, content=SpaceComError(
        error="VALIDATION_ERROR",
        message="Request validation failed",
        detail={"fields": exc.errors()},
        requestId=request.headers.get("X-Request-ID", ""),
    ).model_dump(by_alias=True))

Canonical error code registry — all codes, HTTP status, and recovery actions documented in docs/api-guide/error-reference.md. CI check: any HTTPException raised in application code must use a code from the registry. Sample entries:

Code HTTP status Meaning Recovery
VALIDATION_ERROR 422 Request body or query param invalid Fix the indicated fields
INVALID_CURSOR 400 Pagination cursor malformed or expired Restart from page 1
RATE_LIMITED 429 Rate limit exceeded Wait retryAfterSeconds
EPHEMERIS_TOO_MANY_POINTS 400 Computed points exceed 100,000 Reduce range or increase step
IDEMPOTENCY_IN_PROGRESS 409 Duplicate request still processing Wait and retry statusUrl
HMAC_VERIFICATION_FAILED 503 Prediction integrity check failed Contact administrator
API_KEY_INVALID 401 API key revoked, expired, or invalid Re-issue key
PREDICTION_CONFLICT 200 (not error) Multi-source window disagreement See conflictSources field

Rate Limit Error Response (Finding 6)

429 Too Many Requests responses include Retry-After (RFC 7231 §7.1.3) and a structured body:

HTTP/1.1 429 Too Many Requests
Retry-After: 47
X-RateLimit-Limit: 10
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1742134847

{
  "error": "RATE_LIMITED",
  "message": "Rate limit exceeded for POST /decay/predict: 10 requests per hour",
  "retryAfterSeconds": 47,
  "limit": 10,
  "window": "1h",
  "requestId": "..."
}

retryAfterSeconds = X-RateLimit-Reset now(). Clients implementing backoff must honour Retry-After and must not retry before it elapses.

Idempotency Keys (Finding 5)

Mutation endpoints that have real-world consequences support idempotency keys:

POST /decay/predict
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000

Server behaviour:

  • First receipt: process normally; store (key, user_id, endpoint, response_body) in idempotency_keys table with 24-hour TTL
  • Duplicate within 24h: return stored response with HTTP 200 + header Idempotency-Replay: true; do not re-execute
  • Still processing: return 409 Conflict{"error": "IDEMPOTENCY_IN_PROGRESS", "statusUrl": "/jobs/uuid"}
  • After 24h: key expired; treat as new request

Applies to: POST /decay/predict, POST /reports, POST /notam/draft, POST /alerts/{id}/acknowledge, POST /admin/users. Documented in docs/api-guide/idempotency.md.

API Key Authentication Model (Finding 11)

API key requests use key-only auth — no JWT required:

Authorization: Bearer apikey_<base64url_encoded_key>

The prefix apikey_ distinguishes API keys from JWT Bearer tokens at the middleware layer. The raw key is hashed with SHA-256 before storage; the raw key is shown exactly once at creation.

Rules:

  • API key rate limits are independent from JWT session rate limits — separate Redis buckets per key
  • Webhook deliveries are not counted against any rate limit bucket (server-initiated, not client-initiated)
  • allowed_endpoints scope: null = all endpoints for the key's role; a non-null array restricts to listed paths. 403 returned for requests to unlisted endpoints with {"error": "ENDPOINT_NOT_IN_KEY_SCOPE"}
  • Revoked/expired/invalid key: always 401{"error": "API_KEY_INVALID", "message": "API key is revoked or expired"} — indistinguishable from never-valid (prevents enumeration)

Document in docs/api-guide/api-keys.md.

System Endpoints (Finding 10)

GET /readyz is included in the OpenAPI spec as a documented endpoint (tagged System), so integrators and SWIM consumers can discover and monitor it:

@app.get(
    "/readyz",
    tags=["System"],
    summary="Readiness and degraded-state check",
    response_model=ReadinessResponse,
    responses={
        200: {"description": "System operational"},
        207: {"description": "System degraded — one or more data sources stale"},
        503: {"description": "System unavailable — database or Redis unreachable"},
    }
)

GET /healthz (liveness probe) remains undocumented in OpenAPI — infrastructure-only. /readyz is the recommended integration health check endpoint for ANSP monitoring systems and the Phase 3 SWIM integration.

Clock skew detection and server time endpoint (F6 — §67):

CZML availability timestamps and prediction windows are generated using server UTC. If the server clock drifts (NTP sync failure after container restart, hypervisor clock skew, or VM migration), CZML ground track windows will be offset from real time. A client whose clock differs from the server clock by > 5 seconds will show predictions in the wrong temporal position.

Infrastructure requirement: All SpaceCom hosts must run chronyd or systemd-timesyncd with NTP synchronisation to a reliable source. Add to the deployment runbook (docs/runbooks/host-setup.md):

# Ubuntu/Debian
timedatectl set-ntp true
timedatectl status  # confirm NTPSynchronized: yes

Add Grafana alert: node_timex_sync_status != 1 → WARNING: "NTP sync lost on ".

Client-side clock skew display: Add GET /api/v1/time endpoint (unauthenticated, rate-limited to 1 req/s per IP):

@router.get("/api/v1/time")
async def server_time():
    return {"utc": datetime.utcnow().isoformat() + "Z", "unix": time.time()}

The frontend calls this on page load and computes skew_seconds = server_unix - Date.now()/1000. If abs(skew_seconds) > 5: display a persistent WARNING banner: "Your browser clock differs from the server by {N}s — prediction windows may appear offset. Please synchronise your system clock."

Pagination Standard

All list endpoints use cursor-based pagination (not offset-based). Offset pagination degrades as OFFSET N forces the DB to scan and discard N rows; at 7-year retention depth this becomes a full table scan.

Canonical response envelope — applied to every list endpoint (Finding 1):

{
  "data": [...],
  "pagination": {
    "next_cursor": "eyJjcmVhdGVkX2F0IjoiMjAyNi0wMy0xNlQxNDozMDowMFoiLCJpZCI6NDQ4Nzh9",
    "has_more": true,
    "limit": 50,
    "total_count": null
  }
}

Rules:

  • data (not items) is the canonical array key across all list endpoints
  • next_cursor is base64url(json({"created_at": "<iso8601>", "id": <int>})) — opaque to clients, decoded server-side
  • total_count is always null — count queries on large tables force full scans; document this explicitly in docs/api-guide/pagination.md
  • limit defaults to 50; maximum 200; specified per endpoint group in OpenAPI description
  • Empty result: {"data": [], "pagination": {"next_cursor": null, "has_more": false, "limit": 50, "total_count": null}} — never 404
  • Invalid/expired cursor: 400 Bad Request{"error": "INVALID_CURSOR", "message": "Cursor is malformed or refers to a deleted record", "request_id": "..."}

Standard query parameters:

  • limit — page size (default: 50, maximum: 200)
  • cursor — opaque cursor token from a previous response (absent = first page)

Cursor decodes server-side to WHERE (created_at, id) < (cursor_ts, cursor_id) ORDER BY created_at DESC, id DESC. Tokens are valid for 24 hours.

Implementation:

class PaginatedResponse(BaseModel, Generic[T]):
    data: list[T]
    pagination: PaginationMeta

class PaginationMeta(BaseModel):
    next_cursor: str | None
    has_more: bool
    limit: int
    total_count: None = None  # always None; never compute count

def paginate_query(q, cursor: str | None, limit: int) -> PaginatedResponse:
    """Shared utility used by all list endpoints — enforces envelope consistency."""
    ...

Enforcement: An OpenAPI CI check confirms every endpoint tagged list has limit and cursor query parameters and returns the PaginatedResponse schema. Violations fail CI.

Affected endpoints (all paginated): /objects, /decay/predictions, /reentry/predictions, /alerts, /conjunctions, /reports, /notam/drafts, /space/objects, /api-keys/usage, /admin/security-events.


API Latency Budget — CZML Catalog Endpoint

The CZML catalog endpoint (GET /czml/objects) is the most latency-sensitive read path and the primary SLO driver (p95 < 2s). Latency budget allocation:

Component Budget Notes
DNS + TLS handshake (new connection) 50 ms Not applicable on keep-alive; amortised to ~0 for repeat requests
Caddy proxy overhead 5 ms Header processing only
FastAPI routing + middleware (auth, RBAC, rate limit) 30 ms Each middleware ~510 ms; keep middleware count ≤ 5 on this path
PgBouncer connection acquisition 10 ms Pool saturation adds latency; monitor pgbouncer_pool_waiting metric
DB query execution (PostGIS geometry) 800 ms Includes GiST index scan + geometry serialisation
CZML serialisation (Pydantic → JSON) 200 ms Validated by benchmark; exceeding this indicates schema complexity regression
HTTP response transmission (5 MB @ 1 Gbps internal) 40 ms Internal network; negligible
Total budget (new connection) ~1,135 ms ~865 ms headroom to 2s p95 SLO

Any new middleware added to the CZML endpoint path must be profiled and must not exceed its allocated budget. Exceeding the DB or serialisation budget requires a performance investigation before merge.


API Versioning Policy

Base path: /api/v1. All versioned endpoints follow Semantic Versioning applied to the API contract:

  • Non-breaking changes (additive: new optional fields, new endpoints, new query params): deployed without version bump; announced in CHANGELOG.md
  • Breaking changes (removed fields, changed types, changed auth requirements, removed endpoints): require a new major version (/api/v2); old version supported in parallel for a minimum of 6 months before sunset
  • Deprecation signalling: Deprecated endpoints return Deprecation: true and Sunset: <date> response headers (RFC 8594)
  • Version negotiation: Clients may send Accept: application/vnd.spacecom.v1+json to pin to a specific version; default is always the latest stable version
  • Breaking change notice: Minimum 3 months written notice (email to registered API key holders + CHANGELOG.md entry) before any breaking change is deployed

Changelog discipline (F5): CHANGELOG.md follows the Keep a Changelog format with Conventional Commits as the commit-level input. Every PR must add an entry under [Unreleased] if it has a user-visible effect. On release, [Unreleased] is renamed to [{semver}] - {date}.

## [Unreleased]
### Added
- `p01_reentry_time` and `p99_reentry_time` fields on decay prediction response (SC-188)
### Changed
- `altitude_unit_preference` default for ANSP operators changed from `m` to `ft` (SC-201)
### Fixed
- HMAC integrity check now correctly handles NULL `action_taken` field (SC-195)
### Deprecated
- `GET /objects/{id}/trajectory` — use `GET /objects/{id}/ephemeris` (sunset 2027-06-01)
  • make changelog-check (CI step) fails if [Unreleased] section is empty and the diff contains non-chore/docs commits
  • Release changelogs are the source for API key holder email notifications and GitHub release notes

OpenAPI spec as source of truth (F1): FastAPI generates the OpenAPI 3.1 spec automatically from route decorators, Pydantic schemas, and docstrings. The spec is the authoritative contract — not a separately maintained document. CI enforces this:

  • GET /api/v1/openapi.json is served by the running API; CI downloads it and diffs against the committed openapi.yaml
  • Any uncommitted drift fails the build with openapi-diff --fail-on-incompatible
  • The committed openapi.yaml is regenerated by running make generate-openapi (calls python -m app.generate_spec) — this is a required step in the PR checklist for any API change
  • The spec is the input to all downstream tooling: Swagger UI (/docs), Redoc (/redoc), contract tests, and the client SDK generator

API date/time contract (F10): All date/time fields in API responses must use ISO 8601 with UTC offset — never Unix timestamps, never local time strings:

  • Format: "2026-03-22T14:00:00Z" (UTC, Z suffix)
  • OpenAPI annotation: format: date-time on every _at-suffixed and _time-suffixed field
  • Contract test (BLOCKING): every field matching /_at$|_time$/ in every response schema asserts it matches ^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?Z$
  • Pydantic models use datetime with model_config = {"json_encoders": {datetime: lambda v: v.isoformat().replace("+00:00", "Z")}}

Frontend ↔ API contract testing (F4): The TypeScript types used by the Next.js frontend must be validated against the OpenAPI spec on every CI run — preventing the common drift where the Pydantic response model changes but the frontend interface is not updated until a runtime error surfaces.

Implementation: openapi-typescript generates TypeScript types from openapi.yaml into frontend/src/types/api.generated.ts. The frontend imports only from this generated file — no hand-written API response interfaces. A CI check (make check-api-types) regenerates the types and fails if the git diff is non-empty:

# CI step: check-api-types
openapi-typescript openapi.yaml -o frontend/src/types/api.generated.ts
git diff --exit-code frontend/src/types/api.generated.ts \
  || (echo "API types out of sync — run: make generate-api-types" && exit 1)

This is a one-way contract: the spec is authoritative; the TypeScript types are derived. Any API change that affects the frontend must regenerate types before the PR can merge. This replaces the need for a separate consumer-driven contract test framework (Pact) at Phase 1 scale.

OpenAPI response examples (F7): Every endpoint schema in the OpenAPI spec must include at least one examples: block demonstrating a realistic success response. This is enforced by a CI lint step (spectral lint openapi.yaml --ruleset .spectral.yaml) with a custom rule require-response-example. Missing examples fail the build. The examples serve three purposes: Swagger UI and Redoc interactive documentation, contract test fixture baseline, and ESA auditor review readability.

# Example: openapi.yaml fragment for GET /objects/{norad_id}
responses:
  '200':
    content:
      application/json:
        schema:
          $ref: '#/components/schemas/ObjectDetail'
        examples:
          debris_object:
            summary: Tracked debris fragment in decay
            value:
              norad_id: 48274
              name: "CZ-3B DEB"
              object_type: "DEBRIS"
              perigee_km: 187.4
              apogee_km: 312.1
              data_confidence: "nominal"
              propagation_quality: "degraded"
              propagation_warning: "tle_age_7_14_days"

Client SDK strategy (F8): Phase 1 — no dedicated SDK. ANSP integrators are provided:

  1. The committed openapi.yaml for import into Postman, Insomnia, or any OpenAPI-compatible tooling
  2. A docs/integration/ directory with language-specific quickstart guides (Python, JavaScript/TypeScript) showing auth, object fetch, and WebSocket subscription patterns
  3. Python integration examples using httpx (async) and requests (sync) — not a packaged SDK

Phase 2 gate: if ≥ 2 ANSP customers request a typed client, generate one using openapi-generator-cli targeting Python and TypeScript. Generated clients are published under the @spacecom/ npm scope and spacecom-client PyPI package. The generator configuration is committed to tools/sdk-generator/ so regeneration is reproducible from the spec.


15. Propagation Architecture — Technical Detail

15.1 Catalog Propagator (SGP4)

from sgp4.api import Satrec, jday
from app.frame_utils import teme_to_gcrf, gcrf_to_itrf, itrf_to_geodetic

def propagate_catalog(tle_line1: str, tle_line2: str, times_utc: list[datetime]) -> list[OrbitalState]:
    sat = Satrec.twoline2rv(tle_line1, tle_line2)
    results = []
    for t in times_utc:
        jd, fr = jday(t.year, t.month, t.day, t.hour, t.minute, t.second + t.microsecond/1e6)
        e, r_teme, v_teme = sat.sgp4(jd, fr)
        if e != 0:
            raise PropagationError(f"SGP4 error code {e}")
        r_gcrf, v_gcrf = teme_to_gcrf(r_teme, v_teme, t)
        lat, lon, alt = itrf_to_geodetic(gcrf_to_itrf(r_gcrf, t))
        results.append(OrbitalState(
            time=t, reference_frame='GCRF',
            pos_x_km=r_gcrf[0], pos_y_km=r_gcrf[1], pos_z_km=r_gcrf[2],
            vel_x_kms=v_gcrf[0], vel_y_kms=v_gcrf[1], vel_z_kms=v_gcrf[2],
            lat_deg=lat, lon_deg=lon, alt_km=alt, propagator='sgp4'
        ))
    return results

Scope limitation: SGP4 accurate to ~1 km for perigee > 300 km and epoch age < 7 days. Do not use for decay prediction.

SGP4 validity gates — enforced at query time (Finding 1):

Condition Action UI signal
tle_epoch_age ≤ 7 days Normal propagation propagation_quality: 'nominal'
7 days < tle_epoch_age ≤ 14 days Propagate with warning propagation_quality: 'degraded'; amber DataConfidenceBadge; API includes propagation_warning: 'tle_age_7_14_days'
tle_epoch_age > 14 days Return estimate with explicit caveat propagation_quality: 'unreliable'; object position not rendered on globe without user acknowledgement; API returns propagation_warning: 'tle_age_exceeds_14_days'
perigee_altitude < 200 km Do not use SGP4 Route all propagation requests to the numerical decay predictor; SGP4 is invalid in this density regime

The epoch age check runs at the start of propagate_catalog(). The perigee altitude gate is enforced during TLE ingest — objects crossing below 200 km perigee are automatically flagged for decay prediction and removed from SGP4 catalog propagation tasks.

Sub-150 km propagation confidence guard (F2): For the numerical decay predictor, objects with current perigee < 150 km are in a regime where atmospheric density model uncertainty dominates and SGP4/numerical model errors grow rapidly. Predictions in this regime are flagged:

if perigee_km < 150:
    prediction.propagation_confidence = 'LOW_CONFIDENCE_PROPAGATION'
    prediction.propagation_confidence_reason = (
        f'Perigee {perigee_km:.0f} km below 150 km; '
        'atmospheric density uncertainty dominant; re-entry imminent'
    )

LOW_CONFIDENCE_PROPAGATION is surfaced in the UI as a red badge: "⚠ Re-entry imminent — prediction confidence low; consult Space-Track TIP directly." Unit test (BLOCKING): construct a TLE with perigee = 120 km; call the decay predictor; assert propagation_confidence == 'LOW_CONFIDENCE_PROPAGATION'.

15.2 Decay Predictor (Numerical)

Physics: J2J6 geopotential, NRLMSISE-00 drag, solar radiation pressure (cannonball model), WGS84 oblate Earth.

NRLMSISE-00 Input Vector (Finding 2)

NRLMSISE-00 requires a fully specified input vector. Using a single F10.7 value for both the 81-day average and the prior-day slot, or using Kp instead of Ap, introduces systematic density errors that are worst during geomagnetic storms — exactly when prediction uncertainty matters most.

# Required NRLMSISE-00 inputs — both stored in space_weather table
nrlmsise_input = NRLMSISEInput(
    f107A = f107_81day_avg,       # 81-day centred average F10.7 (NOT current)
    f107  = f107_prior_day,        # prior-day F10.7 value (NOT current day)
    ap    = ap_daily,              # daily Ap index (linear) — NOT Kp (logarithmic)
    ap_a  = ap_3h_history_57h,    # 19-element array of 3-hourly Ap for prior 57h
                                   # enables full NRLMSISE accuracy (flags.switches[9]=1)
)

The space_weather table already stores f107_81day_avg and ap_daily. Add f107_prior_day DOUBLE PRECISION and ap_3h_history DOUBLE PRECISION[19] columns (the 3-hourly Ap history array for the 57 hours preceding each observation). The ingest worker populates both from the NOAA SWPC Space Weather JSON endpoint.

Atmospheric density model selection rationale (F3): NRLMSISE-00 is used for Phase 1. JB2008 (Bowman et al. 2008) is the current USSF operational standard and is demonstrably more accurate during high solar activity periods (F10.7 > 150) and geomagnetic storms (Kp > 5). NRLMSISE-00 is chosen for Phase 1 because:

  • Python bindings are mature (nrlmsise00 PyPI package); JB2008 has no equivalent mature Python binding
  • For the typical F10.7 range (70150 sfu) at solar minimum/moderate activity, the accuracy difference is < 10%
  • Phase 2 milestone: evaluate JB2008 against NRLMSISE-00 on historical re-entry backcasts; if MAE improvement > 15%, migrate; decision documented in docs/adr/0016-atmospheric-density-model.md

NRLMSISE-00 input validity bounds (F3): Inputs outside these ranges produce unphysical density estimates; the prediction is rejected rather than silently accepted:

NRLMSISE_INPUT_BOUNDS = {
    "f107": (65.0, 300.0),   # physical solar flux range; < 65 indicates data gap
    "f107A": (65.0, 300.0),
    "ap": (0.0, 400.0),      # Ap index physical range
    "altitude_km": (85.0, 1000.0),  # validated density range
}

If any bound is violated, raise AtmosphericModelInputError with field and value — never silently clamp.

Altitude scope: NRLMSISE-00 is used from 150 km to 800 km. Above 800 km, the model is applied but the prediction carries ood_flag = TRUE with ood_reason = 'above_nrlmsise_validated_range_800km' (Finding 11).

Geomagnetic storm sensitivity (Finding 11): During the MC sampling, when the current 3-hour Kp index exceeds 5, sample F10.7 and Ap from storm-period values (current observed, not 81-day average). The prediction is annotated:

  • space_weather_warning: 'geomagnetic_storm' field on the reentry_predictions record
  • UI amber callout: "Active geomagnetic storm — thermospheric density is elevated; re-entry timing uncertainty is significantly increased"
  • The storm flag persists for the lifetime of the prediction; it is not cleared when the storm ends (the prediction was made during disturbed conditions)

Ballistic Coefficient Uncertainty Model (Finding 3)

The ballistic coefficient β = m / (C_D × A) is the dominant uncertainty in drag-driven decay. Its three components are sampled independently in the Monte Carlo:

Parameter Distribution Rationale
C_D Uniform(2.0, 2.4) Standard assumption for non-cooperative objects in free molecular flow; no direct measurement available
A (stable attitude, attitude_known = TRUE) Normal(A_discos, 0.05 × A_discos) 5% shape uncertainty for known-attitude objects
A (tumbling, attitude_known = FALSE) Normal(A_discos_mean, 0.25 × A_discos_mean) 25% uncertainty; tumbling objects present a time-varying cross-section
m Normal(m_discos, 0.10 × m_discos) 10% mass uncertainty; DISCOS masses are not independently verified

OOD rules:

  • attitude_known = FALSE AND mass_kg IS NULLood_flag = TRUE, ood_reason = 'tumbling_no_mass' — outside validated regime
  • cd_a_over_m IS NULL AND mass_kg IS NULL AND cross_section_m2 IS NULLood_flag = TRUE, ood_reason = 'no_physical_properties'

Objects with known physical properties can have operator-provided overrides stored in objects.cd_override DOUBLE PRECISION and objects.bstar_override DOUBLE PRECISION. When overrides are present, the MC samples around the override value rather than the DISCOS-derived value.

Solar Radiation Pressure (Finding 7)

SRP is included using the cannonball model:

a_srp = P_sr × C_r × (A/m) × r̂_sun

where P_sr = 4.56 × 10⁻⁶ N/m² at 1 AU (scaled by (1 AU / r_sun)²), C_r is the radiation pressure coefficient stored in objects.cr_coefficient DOUBLE PRECISION DEFAULT 1.3.

SRP is significant (> 5% of drag contribution) for objects with area-to-mass ratio > 0.01 m²/kg at altitudes > 500 km. OOD flag: area_to_mass > 0.01 AND perigee > 500 km AND cr_coefficient IS NULLood_reason = 'srp_significant_cr_unknown'.

Integrator Configuration (Finding 9)

from scipy.integrate import solve_ivp

integrator_config = dict(
    method   = "DOP853",         # RK7(8) embedded pair — adaptive step
    rtol     = 1e-9,             # relative tolerance (parts-per-billion)
    atol     = 1e-9,             # absolute tolerance (km); ≈ 1 mm position error
    max_step = 60.0,             # seconds; constrained to capture density variation at perigee
    t_span   = (t0, t0 + 120 * 86400),  # 120-day maximum integration window
    events   = [
        altitude_80km_event,     # terminal: breakup trigger
        altitude_200km_event,    # non-terminal: log perigee passage
    ],
    dense_output = False,
)

Stopping criterion: integration terminates when altitude ≤ 80 km (breakup trigger fires) or when the 120-day span elapses without reaching 80 km (result: propagation_timeout; stored as status = 'timeout' in simulations). The 120-day cap is a safety stop — any object not re-entering within 120 days from a sub-450 km perigee TLE is anomalous and should be flagged for human review.

The max_step = 60s constraint near perigee prevents the integrator from stepping over atmospheric density variations. For altitudes above 300 km, the max step is relaxed to 300s (5 min) via a step-size hook that checks current altitude.

TLE age uncertainty inflation (F7): TLE age is a formal uncertainty source, not just a staleness indicator. For decaying objects, position uncertainty grows with TLE age due to unmodelled atmospheric drag variations. A linear inflation model is applied to the ballistic coefficient covariance before MC sampling:

# Applied in decay_predictor.py before MC sampling
tle_age_days = (prediction_epoch - tle_epoch).total_seconds() / 86400
if tle_age_days > 0 and perigee_km < 450:
    uncertainty_multiplier = 1.0 + 0.15 * tle_age_days
    sigma_cd *= uncertainty_multiplier
    sigma_area *= uncertainty_multiplier

The 0.15/day coefficient is derived from Vallado (2013) §9.6 propagation error growth for LEO objects in ballistic flight. tle_age_at_prediction_time and uncertainty_multiplier are stored in simulations.params_json and included in the prediction API response for provenance.

Monte Carlo convergence criterion (F4): N = 500 for production is not arbitrary — it satisfies the following convergence criterion tested on the reference object (mc-ensemble-params.json):

N p95 corridor area (km²) Change from N/2
100 baseline
250 ~12%
500 ~4%
1000 ~1.8%
2000 ~0.9%

Convergence criterion: corridor area change < 2% between doublings. N = 500 satisfies this for the reference object. N = 1000 is used for objects with ood_flag = TRUE or space_weather_warning = 'geomagnetic_storm' (higher uncertainty → higher N needed for stable tail estimates). Server cap remains 1000.

Monte Carlo:

N = 500 (standard); N = 1000 (OOD flag or storm warning); server cap 1000
Per-sample variation: C_D ~ U(2.0, 2.4); A ~ N(A_discos, σ_A × uncertainty_multiplier);
  m ~ N(m_discos, σ_m); F10.7 and Ap from storm-aware sampling
Output: p01/p05/p25/p50/p75/p95/p99 re-entry times; ground track corridor polygon; per-sample binary blob for Mode C
All output records HMAC-signed before database write

15.3 Atmospheric Breakup Model

Simplified ORSAT approach: aerothermal heating → failure altitude → fragment generation → RK4 ballistic descent → impact (velocity, angle, KE, casualty area). Distinct from NASA SBM on-orbit fragmentation.

Breakup altitude trigger (Finding 5): Structural breakup begins when the numerical integrator crosses altitude = 78 km (midpoint of the 7580 km range supported by NASA Debris Assessment Software and ESA DRAMA for aluminium-structured objects; documented in model card under "Breakup Altitude Rationale").

Fragment generation: Below 78 km, the fragment cloud is generated using the NASA Standard Breakup Model (NASA-TM-2018-220054) parameter set for the object's mass class:

  • Mass class A: < 100 kg
  • Mass class B: 1001000 kg
  • Mass class C: > 1000 kg (rocket bodies, large platforms)

Survivability by material (Finding 5): Fragment demise altitude is determined by material class using the ESA DRAMA demise altitude lookup:

material_class Typical demise altitude Notes
aluminium 6070 km Most fragments demise; some survive
stainless_steel 4555 km Higher survival probability
titanium 4050 km High survival; used in tanks and fasteners
carbon_composite 5565 km Largely demises but reinforced structures may survive
unknown Conservative: 0 km (surface impact) All fragments assumed to survive — drives ood_flag = TRUE

material_class TEXT added to objects table. When material_class IS NULL, the ood_flag is set and the conservative all-survive assumption is used. The NOTAM (E) field debris survival statement changes from a static disclaimer to a model-driven statement: DEBRIS SURVIVAL PROBABLE (when calculated survivability > 50%) or DEBRIS SURVIVAL POSSIBLE (1050%) or COMPLETE DEMISE EXPECTED (< 10%).

Casualty area: Computed from fragment mass and velocity using the ESA DRAMA methodology. Stored per-fragment in fragment_impacts table. The aggregate casualty area polygon drives the "ground risk" display in the Event Detail page (Phase 3 feature).

Survival probability output (F5): The aggregate object-level survival probability is stored in reentry_predictions:

ALTER TABLE reentry_predictions
  ADD COLUMN survival_probability DOUBLE PRECISION,  -- fraction of object mass expected to survive to surface (0.01.0)
  ADD COLUMN survival_model_version TEXT,            -- e.g. 'phase1_analytical_v1', 'drama_3.2'
  ADD COLUMN survival_model_note TEXT;               -- human-readable caveat, e.g. 'Phase 1: simplified analytical; no fragmentation modelling'

Phase 1 method: simplified analytical — ballistic coefficient of the intact object projected to surface; if material_class = 'unknown', survival_probability = 1.0 (conservative all-survive). Phase 2: integrate ESA DRAMA output files where available from the space operator's licence submission. The NOTAM (E) field statement is driven by survival_probability (already specified above).

15.4 Corridor Generation Algorithm (Finding 4)

The re-entry corridor polygon is generated by reentry/corridor.py. The algorithm must be specified explicitly — the choice between convex hull, alpha-shape, and ellipse fit produces materially different FIR intersection results.

Algorithm:

def generate_corridor_polygon(
    mc_trajectories: list[list[GroundPoint]],
    percentile: float = 0.95,
    alpha: float = 0.1,           # degrees; ~11 km at equator
    buffer_km: float = 50.0,      # lateral dispersion buffer below 80 km
    max_vertices: int = 1000,
) -> Polygon:
    """
    Generate a re-entry hazard corridor polygon from Monte Carlo trajectories.

    Algorithm:
      1. For each MC trajectory, collect ground positions at 10-min intervals
         from the 80 km altitude crossing to the final impact point.
      2. Retain the central `percentile` fraction of trajectories by re-entry time
         (discard the earliest p_low and latest p_high tails).
      3. Compute the alpha-shape (concave hull) of the combined point set
         using alpha = 0.1°. Alpha-shape is preferred over convex hull for
         elongated re-entry corridors (convex hull overestimates width by 25x).
      4. Buffer the polygon by `buffer_km` to account for lateral fragment
         dispersion below 80 km.
      5. Simplify to <= `max_vertices` vertices (Douglas-Peucker, tolerance 0.01°).
      6. Store the raw MC endpoint cloud as JSONB in `reentry_predictions.mc_endpoint_cloud`
         for audit and Mode C replay.

    Returns:
        Polygon in EPSG:4326 (WGS84), suitable for PostGIS GEOGRAPHY storage.
    """

The alpha-shape library (alphashape) is added to requirements.in. The 50 km buffer accounts for the fact that fragments detach from the main object trajectory below 80 km and disperse laterally. This value is documented in the model card with a reference to ESA DRAMA lateral dispersion statistics.

Adaptive ground-track sampling for CZML corridor fidelity (F4 — §62):

Step 1 of the corridor algorithm above samples at 10-minute intervals. For the high-deceleration terminal phase (below ~150 km), 10 minutes corresponds to hundreds of kilometres of ground track — the polygon will miss the actual terminal geometry. Adaptive sampling is required:

def adaptive_ground_points(trajectory: list[StateVector]) -> list[GroundPoint]:
    """
    Return ground points at altitude-dependent intervals:
      > 300 km: every 5 min  (slow deceleration; sparse sampling adequate)
      150300 km: every 2 min
      80150 km: every 30 s  (rapid deceleration; must resolve terminal corridor)
      < 80 km: every 10 s   (fragment phase; maximum spatial resolution)
    """
    points = []
    for sv in trajectory:
        alt_km = sv.altitude_km
        step_s = 300 if alt_km > 300 else (
                 120 if alt_km > 150 else (
                  30 if alt_km > 80 else 10))
        # only emit a point if sufficient time has elapsed since the last point
        if not points or (sv.t - points[-1].t) >= step_s:
            points.append(to_ground_point(sv))
    return points

This is a breaking change to the corridor algorithm: the reference polygon in docs/validation/reference-data/mc-corridor-reference.geojson must be regenerated after this change is implemented. The ADR for this change must document the old vs. new polygon area difference for the reference object.

PostGIS vs CZML corridor consistency test (F6 — §62):

The PostGIS ground_track_corridor polygon (used for FIR intersection and alert generation) and the CZML polygon positions (displayed on the globe) are independently derived. A serialisation bug in the CZML builder could render the corridor in the wrong location while the database record remains correct — operators would see one corridor, alerts would be generated based on another.

Required integration test in tests/integration/test_corridor_consistency.py:

@pytest.mark.safety_critical
def test_czml_corridor_matches_postgis_polygon(db_session):
    """
    The bounding box of the CZML polygon positions must agree with the
    PostGIS corridor polygon bounding box to within 10 km in each direction.
    """
    prediction = db_session.query(ReentryPrediction).filter(
        ReentryPrediction.ground_track_corridor.isnot(None)
    ).first()

    # Generate CZML from the prediction
    czml_doc = generate_czml_for_prediction(prediction)
    czml_polygon = extract_polygon_positions(czml_doc)  # list of (lat, lon)

    # Get PostGIS bounding box
    postgis_bbox = db_session.execute(
        text("SELECT ST_Envelope(ground_track_corridor::geometry) FROM reentry_predictions WHERE id = :id"),
        {"id": prediction.id}
    ).scalar()
    postgis_coords = extract_bbox_corners(postgis_bbox)  # (min_lat, max_lat, min_lon, max_lon)

    czml_bbox = bounding_box_of(czml_polygon)
    assert abs(czml_bbox.min_lat - postgis_coords.min_lat) < 0.1   # ~10 km latitude tolerance
    assert abs(czml_bbox.max_lat - postgis_coords.max_lat) < 0.1
    # Antimeridian-aware longitude comparison
    assert lon_diff_deg(czml_bbox.min_lon, postgis_coords.min_lon) < 0.1
    assert lon_diff_deg(czml_bbox.max_lon, postgis_coords.max_lon) < 0.1

This test is marked safety_critical because a discrepancy > 10 km between displayed and stored corridor is a direct contribution to HZ-004.

Unit test: Generate a corridor from a known synthetic MC dataset (100 trajectories, straight ground track); verify the resulting polygon contains all input points; verify the polygon area is less than the convex hull area (confirming the alpha-shape is tighter); verify the polygon has ≤ 1000 vertices.

MC test data generation strategy (Finding 10): Generating hundreds of MC trajectories at test time is slow and non-deterministic. Committing raw trajectory arrays is a large binary blob. Use seeded RNG:

# tests/physics/conftest.py
@pytest.fixture(scope="session")
def synthetic_mc_ensemble():
    """500 synthetic trajectories from seeded RNG — deterministic, no external downloads."""
    rng = np.random.default_rng(seed=42)  # seed must never change without updating reference polygon
    return generate_mc_ensemble(
        rng, n=500,
        object_params={  # Reference object: committed, never change without ADR
            "mass_kg": 1000.0, "cd": 2.2, "area_m2": 1.0, "perigee_km": 185.0,
        },
    )

Commit to docs/validation/reference-data/:

  • mc-corridor-reference.geojson — pre-computed corridor polygon (run python tools/generate_mc_reference.py once; review and commit)
  • mc-ensemble-params.json — RNG seed, object parameters, generation timestamp

Test asserts: (a) generated corridor polygon matches committed reference within 5% area difference; (b) corridor contains ≥ 95% of input trajectories. If the corridor algorithm changes, the reference polygon must be explicitly regenerated and the change reviewed — the seed itself never changes.

15.5 Conjunction Probability (Pc) Computation Method (Finding 8)

The Pc method is specified in conjunction/pc_compute.py and must be documented in the API response.

Phase 12 method: Alfano/Foster 2D Gaussian

def compute_pc_alfano(
    r1: np.ndarray,   # primary position (km, GCRF)
    v1: np.ndarray,   # primary velocity (km/s)
    cov1: np.ndarray, # 6×6 covariance (km², km²/s²)
    r2: np.ndarray,   # secondary position
    v2: np.ndarray,
    cov2: np.ndarray,
    hbr: float,       # combined hard-body radius (m)
) -> float:
    """
    Compute probability of collision using Alfano (2005) 2D Gaussian method.

    Projects combined covariance onto the encounter plane, integrates the
    bivariate normal distribution over the combined hard-body area.
    Standard method in the space surveillance community.

    Reference: Alfano (2005), "A Numerical Implementation of Spherical Object
    Collision Probability", Journal of the Astronautical Sciences.
    """

API response field: Every conjunction record includes pc_method: "alfano_2d_gaussian" so consumers can correctly interpret the result.

Covariance source: TLE format carries no covariance. SpaceCom estimates covariance via TLE differencing (Vallado & Cefola method): multiple TLEs for the same object within a 24-hour window are used to estimate position uncertainty. This is documented in the API as covariance_source: "tle_differencing" and flagged as covariance_quality: 'low' when fewer than 3 TLEs are available within 24 hours.

pc_discrepancy_flag implementation: The log-scale comparison is confirmed as:

pc_discrepancy_flag = abs(math.log10(pc_spacecom) - math.log10(pc_spacetrack)) > 1.0

Not a linear comparison. A discrepancy is an order-of-magnitude difference in probability — this threshold is correct.

Validity domain (F1): The Alfano 2D Gaussian method is valid under the following conditions. Outside these conditions, the Pc estimate is flagged with pc_validity: 'degraded' in the API response:

  • Short-encounter assumption: valid when the encounter duration is short compared to the orbital period (satisfied for LEO conjunction geometries)
  • Linear relative motion: degrades when miss_distance_km < 0.1 (non-linear trajectory effects become significant); flag: pc_validity_warning: 'sub_100m_close_approach'
  • Gaussian covariance: degrades when the position uncertainty ellipsoid aspect ratio (σ_max/σ_min) > 100; flag: pc_validity_warning: 'highly_anisotropic_covariance'
  • Minimum Pc floor: values below 1×10⁻¹⁵ are reported as < 1e-15 and not computed precisely (numerical precision limit)

Reference implementation test (F1): tests/physics/test_pc_compute.py — BLOCKING:

# Reference cases from Vallado & Alfano (2009), Table 1
VALLADO_ALFANO_CASES = [
    # (miss_dist_m, sigma_r1_m, sigma_t1_m, sigma_n1_m,
    #  sigma_r2_m, sigma_t2_m, sigma_n2_m, hbr_m, expected_pc)
    (100.0, 50.0, 200.0, 50.0, 50.0, 200.0, 50.0, 10.0, 3.45e-3),
    (500.0, 100.0, 500.0, 100.0, 100.0, 500.0, 100.0, 5.0, 2.1e-5),
]

@pytest.mark.parametrize("case", VALLADO_ALFANO_CASES)
def test_pc_against_vallado_alfano(case):
    pc = compute_pc_alfano(*build_conjunction_geometry(case))
    assert abs(pc - case.expected_pc) / case.expected_pc < 0.05  # within 5%

Phase 3 consideration: Monte Carlo Pc for conjunctions where pc_spacecom > 1e-3 (high-probability cases where the Gaussian assumption may break down due to non-linear trajectory evolution). Document in docs/adr/0015-pc-computation-method.md.

15.6 Model Version Governance (F6)

All components of the prediction pipeline are versioned together as a single model_version string using semantic versioning (MAJOR.MINOR.PATCH):

Change type Version bump Examples
Pc methodology or propagator algorithm change MAJOR Switch from Alfano 2D to Monte Carlo Pc; replace DOP853 integrator
Atmospheric model or input processing change MINOR NRLMSISE-00 → JB2008; change TLE age inflation coefficient
Bug fix in existing model PATCH Fix F10.7 index lookup off-by-one; correct frame transformation

Rules:

  • Old model versions are never deleted — tagged in git (model/v1.2.3) and retained in backend/app/modules/physics/versions/
  • reentry_predictions.model_version is set at creation and immutable thereafter
  • A model version bump requires: updated unit tests, updated docs/validation/reference-data/, entry in CHANGELOG.md, ADR if MAJOR

Reproducibility endpoint (F6):

POST /api/v1/decay/predict/reproduce
Body: { "prediction_id": "uuid" }

Re-runs the prediction using the exact model version and parameters from simulations.params_json recorded at the time of the original prediction. Returns a new prediction record with reproduced_from_prediction_id set. This endpoint is used for regulatory audit ("what model produced this output?") and post-incident review. Available to analyst role and above.

15.7 Prediction Input Validation (F9)

A validate_prediction_inputs() function in backend/app/modules/physics/validation.py gates all decay prediction submissions. Inputs that fail validation are rejected with structured errors — never silently clamped to a valid range.

def validate_prediction_inputs(params: PredictionParams) -> list[ValidationError]:
    errors = []
    tle_age_days = (utcnow() - params.tle_epoch).days
    if tle_age_days > 30:
        errors.append(ValidationError("INVALID_TLE_EPOCH",
            f"TLE epoch is {tle_age_days} days old; maximum 30 days"))
    if not (65.0 <= params.f107 <= 300.0):
        errors.append(ValidationError("F107_OUT_OF_RANGE",
            f"F10.7 = {params.f107}; valid range [65, 300]"))
    if not (0.0 <= params.ap <= 400.0):
        errors.append(ValidationError("AP_OUT_OF_RANGE",
            f"Ap = {params.ap}; valid range [0, 400]"))
    if params.perigee_km > 1200.0:
        errors.append(ValidationError("PERIGEE_TOO_HIGH",
            f"Perigee {params.perigee_km} km > 1200 km; not a re-entry candidate"))
    if params.mass_kg is not None and params.mass_kg <= 0:
        errors.append(ValidationError("INVALID_MASS",
            f"Mass {params.mass_kg} kg must be > 0"))
    return errors

If errors is non-empty, the endpoint returns 422 Unprocessable Entity with the full error list. Unit tests (BLOCKING) cover each validation path including boundary values.

15.8 Data Provenance Specification (F11)

Phase 1 model classification: No trained ML model components. All prediction parameters are derived from:

  • Physical constants (gravitational parameter, WGS84 Earth model)
  • Published atmospheric model coefficients (NRLMSISE-00)
  • Published orbital mechanics algorithms (SGP4, Alfano 2005 Pc)
  • Empirical constants from peer-reviewed literature (NASA Standard Breakup Model, ESA DRAMA demise altitudes, Vallado ballistic coefficient uncertainty)

This is documented explicitly in docs/ml/data-provenance.md as: "SpaceCom Phase 1 uses no trained machine learning components. All model parameters are derived from physical constants and published peer-reviewed sources cited below."

EU AI Act Art. 10 compliance (Phase 1): Because Phase 1 has no training data, the data governance obligations of Art. 10 apply to input data rather than training data. Input data provenance is tracked in simulations.params_json (TLE source, space weather source, timestamp, version).

Future ML component protocol: Any future learned component (e.g., drag coefficient ML model, debris type classifier) must be accompanied by:

  • Training dataset: source, date range, preprocessing steps, known biases
  • Validation split: method, size, metrics
  • Performance on historical re-entry backcasts (§15.9 backcasting pipeline)
  • Documented in docs/ml/data-provenance.md under the component name
  • docs/ml/model-card-{component}.md following the Google Model Card format

15.9 Backcasting Validation Pipeline (F8)

When a re-entry is confirmed (object decays — objects.status = 'decayed'), the backcasting pipeline runs automatically:

# Triggered by Celery task on object status change to 'decayed'
@celery.task
def run_reentry_backcast(object_id: int, confirmed_reentry_time: datetime):
    """Compare all predictions made in 72h before re-entry to actual outcome."""
    predictions = db.query(ReentryPrediction).filter(
        ReentryPrediction.object_id == object_id,
        ReentryPrediction.created_at >= confirmed_reentry_time - timedelta(hours=72),
    ).all()
    for pred in predictions:
        error_hours = (pred.p50_reentry_time - confirmed_reentry_time).total_seconds() / 3600
        db.add(ReentryBackcast(
            prediction_id=pred.id,
            object_id=object_id,
            confirmed_reentry_time=confirmed_reentry_time,
            p50_error_hours=error_hours,
            lead_time_hours=(confirmed_reentry_time - pred.created_at).total_seconds() / 3600,
            model_version=pred.model_version,
        ))
CREATE TABLE reentry_backcasts (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    prediction_id   BIGINT NOT NULL REFERENCES reentry_predictions(id),
    object_id       INTEGER NOT NULL REFERENCES objects(id),
    confirmed_reentry_time TIMESTAMPTZ NOT NULL,
    p50_error_hours DOUBLE PRECISION NOT NULL,  -- signed: positive = predicted late
    lead_time_hours DOUBLE PRECISION NOT NULL,
    model_version   TEXT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON reentry_backcasts (model_version, created_at DESC);

Drift detection: Rolling 30-prediction MAE by model version, computed nightly. If MAE > 2× historical baseline for the current model version, raise MEDIUM alert to Persona D flagging for model review. Surfaced in the admin analytics panel as a "Model Performance" widget.


16. Cross-Cutting Concerns

16.1 Subscription Tiers and Feature Flags (F2, F6)

SpaceCom gates commercial entitlements by contracts, which is the single authoritative commercial source of truth. organisations.subscription_tier is a presentation and segmentation shorthand only, and must never be used as the authority for feature access, quota limits, or shadow/production eligibility. Active contract state is materialised into derived organisation flags and quotas by a synchronisation job so runtime checks remain cheap and explicit.

Tier Intended customer MC concurrent runs Decay predictions/month Conjunction screening API access Multi-ANSP coordination
shadow_trial Evaluators / test orgs 1 20 Read-only (catalog) No No
ansp_operational ANSP Phase 1 1 200 Yes (Phase 2) Yes Yes
space_operator Space operator orgs 2 500 Own objects only Yes No
institutional Space agencies, research 4 Unlimited Yes Yes Yes
internal SpaceCom internal Unlimited Unlimited Yes Yes Yes

Feature flag enforcement pattern:

def require_tier(*tiers: str):
    def dependency(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
        org = db.get(Organisation, current_user.organisation_id)
        if org.subscription_tier not in tiers:
            raise HTTPException(status_code=403, detail={
                "code": "TIER_INSUFFICIENT",
                "current_tier": org.subscription_tier,
                "required_tiers": list(tiers),
            })
        return org
    return dependency

# Applied at router level alongside require_role:
router = APIRouter(dependencies=[
    Depends(require_role("analyst", "operator", "org_admin", "admin")),
    Depends(require_tier("ansp_operational", "institutional", "internal")),
])

Quota enforcement pattern (MC concurrent runs):

TIER_MC_CONCURRENCY = {
    "shadow_trial": 1,
    "ansp_operational": 1,
    "space_operator": 2,
    "institutional": 4,
    "internal": 999,
}

def get_mc_concurrency_limit(org: Organisation) -> int:
    return TIER_MC_CONCURRENCY.get(org.subscription_tier, 1)

Quota exhaustion is a billable signal: Every 429 TIER_QUOTA_EXCEEDED response writes a usage_events row with event_type = 'mc_quota_exhausted' (see §9.2 usage_events table). This powers the org admin's usage dashboard and the upsell trigger in the admin panel.

Tier changes take effect immediately — no session restart required. The require_tier dependency reads from the database on each request; there is no tier caching that could allow a downgraded tier to continue accessing premium features.

Uncertainty and Confidence

Every prediction includes:

  • confidence_level (0.01.0) — derived from MC spread
  • uncertainty_bounds — explicit p05/p50/p95 times, corridor ellipse axes
  • model_version — semantic version
  • monte_carlo_n — ≥ 100 preliminary, ≥ 500 operational
  • f107_assumed, ap_assumed — critical for reproducibility
  • record_hmac — tamper-evident signature, verified before serving

TLE covariance: TLE format contains no covariance. Use TLE differencing (multiple TLEs within 24h) or empirical Vallado & Cefola covariance. Document clearly in API responses.

Multi-source prediction conflict resolution (Finding 10):

Space-Track TIP messages and SpaceCom's internal decay predictor may produce non-overlapping re-entry windows for the same object simultaneously. ESA ESAC may publish a third window. The aviation regulatory principle of most-conservative applies — the hazard presented to ANSPs must encompass the full credible uncertainty range.

Resolution rules (applied at the reentry_predictions layer):

Situation Rule
SpaceCom p10p90 and TIP window overlap Display SpaceCom corridor as primary; TIP window shown as secondary reference band on Event Detail page
SpaceCom p10p90 and TIP window do not overlap Set prediction_conflict = TRUE on the prediction; HIGH severity data quality warning displayed; hazard corridor presented to ANSPs uses the union of SpaceCom p10p90 and TIP window
ESA ESAC window available Overlay as third reference band; include in PREDICTION_CONFLICT assessment if non-overlapping
All sources agree (all windows overlap) No flag; SpaceCom corridor is primary

Schema addition to reentry_predictions:

ALTER TABLE reentry_predictions
  ADD COLUMN prediction_conflict BOOLEAN DEFAULT FALSE,
  ADD COLUMN conflict_sources TEXT[],   -- e.g. ['spacecom', 'space_track_tip']
  ADD COLUMN conflict_union_p10 TIMESTAMPTZ,
  ADD COLUMN conflict_union_p90 TIMESTAMPTZ;

The Event Detail page shows a ⚠ PREDICTION CONFLICT banner (HIGH severity style) when prediction_conflict = TRUE, listing the conflicting sources and their windows. The hazard corridor polygon uses conflict_union_p10/conflict_union_p90 when the flag is set. Document in docs/model-card-decay-predictor.md under "Conflict Resolution with Authoritative Sources."

Auditability

  • Every simulation in simulations with full params_json and result URI
  • Reports stored with simulation_id reference
  • alert_events and security_logs are append-only with DB-level triggers
  • All API mutations logged with user ID, timestamp, and payload hash
  • TIP messages stored verbatim for audit

Error Handling

  • Structured error responses: { "error": "code", "message": "...", "detail": {...} }
  • Celery failures captured in simulations.status = 'failed'; surfaced in jobs panel
  • Frame transformation failures fail loudly — never silently continue with TEME
  • HMAC failures return 503 and trigger CRITICAL security event — never silently serve a tampered record
  • TanStack Query error states render inline messages with retry; not page-level errors

Performance Patterns

SQLAlchemy async — lazy="raise" on all relationships: Async SQLAlchemy prohibits lazy-loaded relationship access outside an async context. Setting lazy="raise" converts silent N+1 errors into loud InvalidRequestError at development time rather than silent blocking DB calls in production:

class ReentryPrediction(Base):
    object:       Mapped["SpaceObject"]   = relationship(lazy="raise")
    tip_messages: Mapped[list["TipMessage"]] = relationship(lazy="raise")
    # Forces all callers to use joinedload/selectinload explicitly

Required eager-loading patterns for the three highest-traffic endpoints:

  • Event Detail: selectinload(ReentryPrediction.object), selectinload(ReentryPrediction.tip_messages)
  • Active alerts: selectinload(AlertEvent.prediction)
  • CZML catalog: raw SQL with a single JOIN rather than ORM (bulk fetch; ORM overhead unacceptable at 864k rows)

CZML caching — two-tier strategy: CZML data for the current 72h window changes only when a new TLE is ingested or a propagation job completes. Cache the full serialised CZML blob:

CZML_CACHE_KEY = "cache:czml:catalog:{catalog_hash}:{window_start}:{window_end}"
# TTL: 15 minutes in LIVE mode (refreshed after new TLE ingest event)
# TTL: permanent in REPLAY mode (historical data never changes)

Per-object CZML fragments cached separately under cache:czml:obj:{norad_id}:{...}. When a TLE is re-ingested for one object, invalidate only that object's fragment and recompute the full catalog CZML from the cached fragments.

CZML cache invalidation triggers (F5 — §58):

Event Invalidation scope Mechanism
New TLE ingested for object X cache:czml:obj:{norad_id_x}:* only Ingest task calls redis.delete(pattern) after TLE commit
Propagation job completes for object X cache:czml:obj:{norad_id_x}:* + full catalog key Propagation Celery task issues invalidation on success
New prediction created for object X cache:czml:obj:{norad_id_x}:* Prediction task issues invalidation on completion
Manual cache flush (admin API) cache:czml:* DELETE /api/v1/admin/cache/czml — requires admin role
Cold start / DR failover Warm-up Celery task warm_czml_cache Beat task runs at startup (see below)

Stale-while-revalidate strategy: The CZML cache key includes a stale_ok variant. When the primary key is expired but the stale key (cache:czml:catalog:stale:{hash}) exists, serve the stale response immediately and enqueue a background recompute. Maximum stale age: 5 minutes. This prevents a cache stampede during TLE batch ingest (up to 600 simultaneous invalidations).

Cache warm-up on cold start (F5 — §58):

@app.task
def warm_czml_cache():
    """Run at container startup and after DR failover. Estimated: 3060s for 600 objects."""
    objects = db.query(Object).filter(Object.active == True).all()
    for obj in objects:
        generate_czml_fragment.delay(obj.norad_id)
    # Full catalog key assembled by CZML endpoint after all fragments present

Cold-start warm-up time (600 objects, 16 simulation workers): estimated 3060 seconds. Included in DR RTO calculation (§26.3) as "cache warm-up: ~1 min" line item.

Redis key namespaces and eviction policy:

Namespace Contents Eviction policy Notes
celery:* Celery broker queues noeviction — must never be evicted Use separate Redis instance or DB 0 with noeviction
redbeat:* celery-redbeat schedules noeviction Loss causes silent scheduled job disappearance
cache:* Application cache (CZML, space weather, HMAC results) allkeys-lru Cache misses acceptable; broker loss is not
ws:session:* WebSocket session state volatile-lru (with TTL set) Expires on session end

Run Celery broker and application cache as separate Redis database indexes (SELECT 0 vs SELECT 1) so eviction policies can differ. The Sentinel configuration monitors both.

Cache TTLs:

  • cache:czml:catalog → 15 minutes
  • cache:spaceweather:current → 5 minutes
  • cache:prediction:{id}:fir_intersection → until superseded (keyed to prediction ID)
  • cache:prediction:{id}:hmac_verified → 60 minutes

Bulk export — Celery offload for Persona F: GET /space/export/bulk must not materialise the full result set in the backend container — for the full catalog this risks OOM. Implement as a Celery task that writes to MinIO and returns a pre-signed download URL, consistent with the existing report generation pattern:

@app.post("/space/export/bulk")
async def trigger_bulk_export(params: BulkExportParams, ...):
    task = generate_bulk_export.delay(params.dict(), user_id=current_user.id)
    return {"task_id": task.id, "status": "queued"}

@app.get("/space/export/bulk/{task_id}")
async def get_bulk_export(task_id: str, ...):
    # Returns {"status": "complete", "download_url": presigned_url} when done

If a streaming response is preferred over task-based, use SQLAlchemy yield_per=1000 cursor streaming — never materialise the full result set.

Analytics query routing to read replica: Persona B and F analytics queries (simulation comparison, historical validation, bulk export) are I/O intensive and must not compete with operational read paths on the primary TimescaleDB instance during active TIP events. Route to the Patroni standby:

def get_db(write: bool = False, analytics: bool = False) -> AsyncSession:
    if write:
        return AsyncSession(primary_engine)
    if analytics:
        return AsyncSession(replica_engine)  # Patroni standby
    return AsyncSession(primary_engine)      # operational reads: primary (avoids replica lag)

Monitor replication lag: if replica lag > 30s, log a warning and redirect analytics queries to primary.

Query plan baseline: Add to Phase 1 setup: run EXPLAIN (ANALYZE, BUFFERS) on the primary CZML query with 100 objects and record the output in docs/query-baselines/. Re-run at Phase 3 load test and compare — if planning time or execution time has increased > 2×, investigate index bloat or chunk count growth before the load test proceeds.


17. Validation Strategy

17.0 Test Standards and Strategy (F1F3, F5, F7, F8, F10, F11)

Test Taxonomy (F2)

Three levels — every developer must know which level a new test belongs to before writing it:

Level Definition I/O boundary Tool Location
Unit Single function or class; all dependencies mocked or stubbed No I/O pytest tests/unit/
Integration Multiple components; real PostgreSQL + Redis; no external network Real DB, no internet pytest + testcontainers tests/integration/
E2E Full stack including browser; Celery worker running; real DB Full stack Playwright e2e/

Rules:

  • Physics algorithm tests (SGP4, MC, Pc) are unit tests — pure functions, no DB
  • HMAC signing, RLS isolation, and rate-limit tests are integration tests — require a real DB transaction
  • Alert delivery, WebSocket flow, and NOTAM draft UI are E2E tests
  • A test that mocks the database is a unit test regardless of what it is testing — name it accordingly

Coverage Standard (F1)

Scope Tool Minimum threshold CI gate
Backend line coverage pytest-cov 80% Fail below threshold
Backend branch coverage pytest-cov --branch 70% Fail below threshold
Frontend line coverage Jest --coverage 75% Fail below threshold
Safety-critical paths pytest -m safety_critical 100% (all pass, none skipped) Always blocking
# pyproject.toml
[tool.pytest.ini_options]
addopts = "--cov=app --cov-branch --cov-fail-under=80 --cov-report=term-missing"

[tool.coverage.run]
omit = ["*/migrations/*", "*/tests/*", "*/__pycache__/*"]

Coverage is measured on the integration test run (not unit-only) so that database-layer code paths are included. Coverage reports are uploaded to CI artefacts on every run; a coverage trend chart is required in the Phase 2 ESA submission.

Test Data Management (F3)

Fixtures, not factories for shared reference data: Physics reference cases (TLE sets, re-entry events, conjunction scenarios) are committed JSON files in docs/validation/reference-data/. Tests load them as pytest fixtures — never fetch from the internet at test time.

Isolated fixtures for integration tests: Each integration test that writes to the database runs inside a transaction that is rolled back at teardown. No shared mutable state between tests:

@pytest.fixture
def db_session(engine):
    with engine.connect() as conn:
        with conn.begin() as txn:
            yield conn
            txn.rollback()  # all writes from this test disappear

Time-dependent tests: Any test that checks TLE age, token expiry, or billing period uses freezegun to freeze time to a known epoch. Tests must never rely on datetime.utcnow() producing a particular value:

from freezegun import freeze_time

@freeze_time("2026-01-15T12:00:00Z")
def test_tle_age_degraded_warning():
    # TLE epoch is 2026-01-08 → age = 7 days → expects 'degraded'
    ...

Sensitive test data: Real NORAD IDs, real Space-Track credentials, and real ANSP organisation names must never appear in committed test fixtures. Use fictional NORAD IDs (9000190099 are reserved for test objects by convention) and generated organisation names (test-org-{uuid4()[:8]}).

Safety-Critical Test Markers (F8)

All tests that verify safety-critical behaviour carry @pytest.mark.safety_critical. These run on every commit (not just pre-merge) and must all pass before any deployment:

# conftest.py
import pytest

def pytest_configure(config):
    config.addinivalue_line(
        "markers", "safety_critical: test verifies a safety-critical invariant; always runs; zero tolerance for failure or skip"
    )
# Usage
@pytest.mark.safety_critical
def test_cross_tenant_isolation():
    ...

@pytest.mark.safety_critical
def test_hmac_integrity_failure_quarantines_record():
    ...

@pytest.mark.safety_critical
def test_sub_150km_low_confidence_flag():
    ...

The full list of safety_critical-marked tests is maintained in docs/TEST_PLAN.md (see F11). CI runs pytest -m safety_critical as a separate fast job (target: < 2 minutes) before the full suite.

Physics Test Determinism (F10)

Monte Carlo tests are non-deterministic by default. All MC-based tests seed the random number generator explicitly:

import numpy as np

@pytest.fixture(autouse=True)
def seed_rng():
    """Seed numpy RNG for all physics tests. Produces identical output across runs."""
    np.random.seed(42)
    yield
    # no teardown needed — each test gets a fresh seed via autouse

@pytest.mark.safety_critical
def test_mc_convergence_criterion():
    result = run_mc_decay(tle=TEST_TLE, n=500, seed=42)
    assert result.corridor_area_change_pct < 2.0

The seed value 42 is fixed in tests/conftest.py and must not be changed without updating the baseline expected values. A PR that changes the seed without updating expected values fails the review checklist.

Mutation Testing (F5)

mutmut is run weekly (not on every commit — too slow) against the backend/app/modules/physics/ and backend/app/modules/alerts/ directories. These are the highest-consequence paths.

mutmut run --paths-to-mutate=backend/app/modules/physics/,backend/app/modules/alerts/
mutmut results

Threshold: Mutation score ≥ 70% for physics and alerts modules. Results published to CI artefacts. A score drop of > 5 percentage points between weekly runs creates a mutation-regression GitHub issue automatically.

Test Environment Parity (F7)

The CI test environment must use identical Docker images to production. Enforced by:

  • docker-compose.ci.yml extends docker-compose.yml — same image tags, no overrides to DB version or Redis version
  • TimescaleDB version in CI is pinned to the same tag as production (timescale/timescaledb-ha:pg16-latest is not acceptable — must be timescale/timescaledb-ha:pg16.3-ts2.14.2)
  • make test in CI fails if TIMESCALEDB_VERSION env var does not match the value in docker-compose.yml
  • MinIO is used in CI, not mocked — make test brings up the full service stack including MinIO before running integration tests

ESA Test Plan Document (F11)

docs/TEST_PLAN.md is a required Phase 2 deliverable. Structure:

# SpaceCom Test Plan

## 1. Test levels and tools
## 2. Coverage targets and current status
## 3. Safety-critical test traceability matrix
   | Requirement | Test ID | Test name | Result |
   |-------------|---------|-----------|--------|
   | Sub-150km propagation guard | SC-TEST-001 | test_sub_150km_low_confidence_flag | PASS |
   | Cross-tenant data isolation | SC-TEST-002 | test_cross_tenant_isolation | PASS |
   ...
## 4. Known test limitations
## 5. Test environment specification
## 6. Performance test results (latest k6 run)

The traceability matrix links each safety-critical requirement (drawn from §15, §7.2, §26) to its @pytest.mark.safety_critical test. This is the primary evidence document for ESA software assurance review.


Important: Comparing SGP4 against Space-Track TLEs is circular. All validation uses independent reference datasets.

Reference data location: docs/validation/reference-data/ — committed to the repository and loaded automatically by the test suite. No external downloads required at test time.

How to run all validation suites:

make test                            # runs pytest including all validation suites
pytest tests/test_frame_utils.py -v  # frame transforms only
pytest tests/test_decay/ -v          # decay predictor + backcast comparison
pytest tests/test_propagator/ -v     # SGP4 propagator

How to add a new validation case: Add the reference data to the appropriate JSON file in docs/validation/reference-data/, add a test case in the relevant test module, and document the source in the file's header comment.


17.1 Frame Transformation Validation

Test Reference Pass criterion Run command
TEME→GCRF transform Vallado (2013), Table 3-5 Position error < 1 m; velocity error < 0.001 m/s pytest tests/test_frame_utils.py::test_teme_gcrf_vallado
GCRF→ITRF transform Vallado (2013), Table 3-4 Position error < 1 m pytest tests/test_frame_utils.py::test_gcrf_itrf_vallado
ITRF→WGS84 geodetic IAU SOFA test vectors Lat/lon error < 1 μrad; altitude error < 1 mm pytest tests/test_frame_utils.py::test_itrf_geodetic
Round-trip WGS84→ITRF→GCRF→ITRF→WGS84 Self-consistency Round-trip error < floating-point machine precision (~1e-12) pytest tests/test_frame_utils.py::test_roundtrip
IERS EOP application IERS Bulletin A reference values UT1-UTC error < 1 μs; pole offset error < 0.1 mas pytest tests/test_frame_utils.py::test_iers_eop

Committed test vectors (Finding 6): The following reference data files must be committed to the repository before any frame transformation or propagation code is merged. Tests are parameterised fixtures that load from these files; they fail (not skip) if a file is absent:

File Content Source
docs/validation/reference-data/frame_transform_gcrf_to_itrf.json ≥ 3 cases from Vallado (2013) §3.7: input UTC epoch + GCRF position → expected ITRF position, accurate to < 1 m Vallado (2013) Fundamentals of Astrodynamics Table 3-4
docs/validation/reference-data/sgp4_propagation_cases.json ISS (NORAD 25544) and one historical re-entry object: state vector at epoch and after 1h and 24h propagation STK or GMAT reference propagation
docs/validation/reference-data/iers_eop_case.json One epoch with published IERS Bulletin B UT1-UTC and polar motion values; expected GCRF→ITRF transform result IERS Bulletin B (iers.org)
# tests/physics/test_frame_transforms.py
import json, pytest
from pathlib import Path

CASES_FILE = Path("docs/validation/reference-data/frame_transform_gcrf_to_itrf.json")

def test_reference_data_exists():
    """Fail hard if committed test vectors are missing — do not skip."""
    assert CASES_FILE.exists(), f"Required reference data missing: {CASES_FILE}"

@pytest.mark.parametrize("case", json.loads(CASES_FILE.read_text()))
def test_gcrf_to_itrf(case):
    result = gcrf_to_itrf(case["gcrf_km"], parse_utc(case["epoch_utc"]))
    assert np.linalg.norm(result - case["expected_itrf_km"]) < 0.001  # 1 m tolerance

Reference data file: docs/validation/reference-data/vallado-sgp4-cases.json and docs/validation/reference-data/iers-frame-test-cases.json.

Operational significance of failure: A frame transform error propagates directly into corridor polygon coordinates. A 1 km error at re-entry altitude produces a ground-track offset of 515 km. A failing frame test is a blocking CI failure.


17.2 SGP4 Propagator Validation

Test Reference Pass criterion
State vector at epoch Vallado (2013) test set, 10 objects spanning LEO/MEO/GEO/HEO Position error < 1 km at epoch; < 10 km after 7-day propagation
Epoch parsing NORAD 2-line epoch format → UTC Round-trip to 1 ms precision
TLE line 1/2 checksum Modulo-10 algorithm Pass/fail; corrupted checksum rejected before propagation

Operational significance of failure: SGP4 position error at epoch > 1 km produces a corridor centred in the wrong place. Blocking CI failure.


17.3 Decay Predictor Validation

Test Reference Pass criterion
NRLMSISE-00 density output Picone et al. (2002) Table 1 reference atmosphere Density within 1% of reference at 5 altitude/solar activity combinations
Historical backcast: p50 error The Aerospace Corporation observed re-entry database (≥3 events Phase 1; ≥10 events Phase 2) Median p50 error < 4h for rocket bodies with known physical properties
Historical backcast: corridor containment Same database p95 corridor contains observed impact in ≥90% of validation events
Historical replay: airspace disruption Long March 5B Spanish airspace closure reconstruction with replay inputs and operator review Affected FIR/time-window outputs judged operationally plausible and traceable in replay report
Air-risk ranking consistency Documented crossing-scenario corpus (≥10 unique spacecraft/aircraft crossing cases by Phase 2) Highest-ranked exposure slices remain stable under seed and traffic-density perturbations or the differences are explained in the validation note
Conservative-baseline comparison Same replay corpus vs. full-FIR or fixed-radius precautionary closure baseline Refined outputs reduce affected area or duration in a majority of replay cases without undercutting the agreed p95 protective envelope
Cross-tool comparison GMAT (NASA open source) — 3 defined test cases Re-entry time agreement within 1h for objects with identical inputs
Monte Carlo statistical consistency Self-consistency: 500-sample run vs. 1000-sample run on same inputs p05/p50/p95 agree within 2% (reducing with more samples)

Reference data files: docs/validation/reference-data/aerospace-corp-reentries.json for decay-only validation and docs/validation/reference-data/reentry-airspace/ for airspace-risk replay cases (Long March 5B, Columbia-derived cloud case, and documented crossing scenarios). GMAT comparison is a manual procedure documented in docs/validation/README.md (GMAT is not run in CI — too slow; comparison run once per major model version).

Operational significance of failure: Decay predictor p50 error > 4h means corridors are offset in time; operators could see a hazard window that doesn't match the actual re-entry. Major model version gate.


17.4 Breakup Model Validation

Test Reference Pass criterion
Fragment count distribution ESA DRAMA published results for similar-mass objects Fragment count within 30% of DRAMA reference for a 500 kg object at 70 km
Energy conservation at breakup altitude Internal check Total kinetic + potential energy conserved within 1% through fragmentation step
Casualty area geometry Hand-calculated reference case Casualty area polygon area within 10% of analytic calculation

Operational significance of failure: Breakup model failure does not block Phase 1. It is an advisory failure in Phase 2. Blocking before Phase 3 regulatory submission.


17.5 Security Validation

Test Reference Pass criterion Blocking?
RBAC enforcement test_rbac.py — every endpoint, every role 403 for insufficient role; 401 for unauthenticated; 0 mismatches Yes
HMAC tamper detection test_integrity.py — direct DB row modification API returns 503 + CRITICAL security_logs entry Yes
Rate limiting test_auth.py — per-endpoint threshold 429 after threshold; 200 after reset window Yes
CSP headers Playwright E2E Content-Security-Policy header present on all pages Yes
Container non-root CI docker inspect check No container running as root UID Yes
Trivy CVE scan Trivy against all built images 0 Critical/High CVEs Yes

17.6 Verification Independence (F6 — §61)

EUROCAE ED-153 / DO-278A §6.4 requires that SAL-2 software components undergo independent verification — meaning the person who verifies (reviews/tests) a SAL-2 requirement, design, or code artefact must not be the same person who produced it.

Policy: docs/safety/VERIFICATION_INDEPENDENCE.md

Scope: All SAL-2 components identified in §24.13:

  • physics/ (decay prediction engine)
  • alerts/ (alert generation pipeline)
  • HMAC integrity verification functions
  • CZML corridor generation and frame transform

Implementation in GitHub:

# .github/CODEOWNERS
# SAL-2 components require an independent reviewer (not the PR author)
/backend/app/physics/     @safety-reviewer
/backend/app/alerts/      @safety-reviewer
/backend/app/integrity/   @safety-reviewer
/backend/app/czml/        @safety-reviewer

The @safety-reviewer team must have ≥1 member who is not the PR author. GitHub branch protection for main must include:

  • require_code_owner_reviews: true for the above paths
  • dismiss_stale_reviews: true (new commits require re-review)
  • SAL-2 PRs require ≥2 approvals (one of which must be from @safety-reviewer)

Verification traceability: The PR review record (GitHub PR number + reviewer + approval timestamp) serves as evidence for verification independence in the safety case (§24.12 E1.1). This record is referenced in the MoC document (§24.14 MOC-002).

Who qualifies as an independent reviewer for SAL-2: Any engineer who:

  1. Did not write the code being reviewed
  2. Has sufficient domain knowledge to evaluate correctness (orbital mechanics familiarity for physics/; alerting logic familiarity for alerts/)
  3. Is designated in the @safety-reviewer GitHub team

Before ANSP shadow activation, the safety case custodian confirms that all SAL-2 components committed in the release have a documented independent reviewer.


18. Additional Physics Considerations

Topic Why It Matters Phase
Solar radiation pressure (SRP) Dominates drag above ~800 km for high A/m objects Phase 1 (decay predictor)
J2J6 geopotential J2 alone: ~7°/day RAAN error Phase 1 (decay predictor)
Attitude and tumbling Drag coefficient 23× different; capture via B* Monte Carlo Phase 2
Lift during re-entry Non-spherical fragments: 10s km cross-track shift Phase 2 (breakup)
Maneuver detection Active satellites maneuver; TLE-to-TLE ΔV estimation Phase 2
Ionospheric drag Captured via NRLMSISE-00 ion density profile Phase 1 (via model)
Re-entry heating uncertainty Emissivity/melt temperatures poorly known for debris Phase 2

19. Development Phases — Detailed

Phase 1: Analytical Prototype (Weeks 110)

Goal: Real object tracking, decay prediction with uncertainty quantification, functional Persona A/B interface. Security infrastructure fully in place before any other feature ships.

Week Backend Deliverable Frontend Deliverable Security / SRE Deliverable
1-2 FastAPI scaffolding, Alembic migrations, Docker Compose with Tier 2 service topology. frame_utils.py, time_utils.py. IERS EOP refresh + SHA-256 verify. Append-only DB triggers. HMAC signing infrastructure. Liveness + readiness probes on all services. GET /healthz, GET /readyz with DB + Redis checks. Dead letter queue for Celery. task_acks_late, task_reject_on_worker_lost configured. Celery queue routing (ingest vs simulation). celery-redbeat configured. Legal/compliance: users table tos_accepted_at/tos_version/tos_accepted_ip/data_source_acknowledgement fields. First-login ToS/AUP/Privacy Notice acceptance flow (blocks access until all accepted). SBOM generated via syft; CesiumJS commercial licence verified. Privacy Notice drafted and published. Next.js scaffolding. Root layout: nav, ModeIndicator, AlertBadge, JobsPanel stub. Dark mode + high-contrast theme. CSP and security headers via Next.js middleware. ToS/AUP acceptance gate on first login (blocks dashboard until accepted). RBAC schema + require_role(). JWT RS256 + httpOnly cookies. MFA (TOTP). Redis AUTH + ACLs. MinIO private buckets. Docker network segmentation. Container hardening. git-secrets. Bandit + ESLint security in CI. Trivy. Dependency pinning. Dependabot. security_logs + sanitising formatter. Docker Compose depends_on: condition: service_healthy wired. Documentation: docs/ directory tree created; AGENTS.md committed; initial ADRs for JWT, dual frontend, Monte Carlo chord, frame library; docs/runbooks/TEMPLATE.md + index; CHANGELOG.md first entry; docs/validation/reference-data/ with Vallado and IERS cases; docs/alert-threshold-history.md initial entry. DevOps/Platform: self-hosted GitLab CI pipeline (lint, test-backend, test-frontend, security-scan, build-and-push jobs); multi-stage Dockerfiles for all services; .pre-commit-config.yaml with all six hooks; .env.example committed with all variables documented; Makefile with dev, test, migrate, seed, lint, clean targets; Docker layer + pip + npm build cache configured; sha-<commit> image tagging in the GitLab container registry in place. Prometheus metrics: spacecom_active_tip_events, spacecom_tle_age_hours, spacecom_hmac_verification_failures_total instrumented.
34 Catalog module: object CRUD, TLE import. TLE cross-validation. ESA DISCOS import. Ingest Celery Beat (celery-redbeat). Hardcoded URLs, SSRF-mitigated HTTP client. WAL archiving configured. Daily backup Celery task. TimescaleDB compression policy on orbits. Retention policy scaffolded. Object Catalog page. DataConfidenceBadge. Object Watch page stub. Rate limiting (slowapi). Simulation parameter range validation. Prometheus: spacecom_ingest_success_total, spacecom_ingest_failure_total per source. AlertManager rule: consecutive ingest failures → warning.
56 Space Weather: NOAA SWPC + ESA SWS cross-validation. operational_status string. TIP message ingestion. Prometheus: spacecom_prediction_age_seconds per NORAD ID. Readiness probe: TLE staleness + space weather age checks. SpaceWeatherWidget. Alert taxonomy: CRITICAL banner, NotificationCentre, AcknowledgeDialog. Degraded mode banner (reads readyz 207 response). alert_events append-only verified. Alert rate-limit and deduplication. Alert storm detection. AlertManager rule: spacecom_active_tip_events > 0 AND prediction_age > 3600 → critical.
78 Catalog Propagator (SGP4): TEME→GCRF, CZML (J2000). Ephemeris caching. Frame transform validation. All CZML strings HTML-escaped. MC chord architecture: run_mc_decay_predictiongroup(run_single_trajectory)aggregate_mc_results. Chord result backend (Redis) sized. Globe: real object positions, LayerPanel, clustering, urgency symbols. TimelineStrip. Live mode scrub. WebSocket auth: cookie-based; connection limit. WS ping/pong. Prometheus: spacecom_simulation_duration_seconds histogram.
910 Decay Predictor: RK7(8) + NRLMSISE-00 + Monte Carlo chord. HMAC-signed output. Immutability triggers. Corridor polygon generation. Re-entry API. Validate against ≥3 historical re-entries. Monthly restore test Celery task implemented. Mode A (Percentile Corridors). Event Detail: PredictionPanel with p05/p50/p95, HMAC status badge. TimelineGantt. Operational Overview. UncertaintyModeSelector (B/C greyed). HMAC tamper detection E2E test. All-clear TIP cross-check guard. First backup restore test executed and passing. spacecom_simulation_duration_seconds p95 verified < 240s on Tier 2 hardware.

Phase 2: Operational Analysis (Weeks 1122)

Week Backend Deliverable Frontend Deliverable Security / Regulatory
1112 Atmospheric Breakup: aerothermal, fragments, ballistic descent, casualty area. Fragment impact points on globe. Fragment detail panel. OWASP ZAP DAST against staging.
1314 Conjunction: all-vs-all screening, Alfano probability. Conjunction events on globe. ConjunctionPanel. STRIDE threat model reviewed for Phase 2 surface.
1516 Upper/Lower Atmosphere. Hazard module: fused zones, HMAC-signed, immutable, shadow_mode flag. Mode B (Probability Heatmap): Deck.gl. UncertaintyModeSelector unlocks Mode B. RLS multi-tenancy integration tests. Shadow records excluded from operational API (integration test).
1718 Airspace: FIR/UIR load, PostGIS intersection. Airspace impact table. NOTAM Drafting: ICAO format, notam_drafts table, mandatory disclaimer. Shadow mode admin toggle. AirspaceImpactPanel. NOTAM draft flow: NotamDraftViewer, disclaimer banner, review/cancel. 2D Plan View. ViewToggle. /airspace page. ShadowBanner + ShadowModeIndicator. Regulatory disclaimer verified present on all NOTAM drafts. axe-core accessibility audit.
1920 Report builder: bleach sanitisation, Playwright renderer (isolated, no-network, timeouts, seccomp). MinIO storage. Shadow validation schema + shadow_validations table. ReportConfigDialog, ReportPreview, /reports page. IntegrityStatusBadge. SimulationComparison. ShadowValidationReport scaffold. Renderer: network_mode: none enforced; sanitisation tests passing; 30s timeout verified.
2122 Space Operator Portal: owned_objects, controlled re-entry planner (deorbit window optimiser), CCSDS export, api_keys table + lifecycle. modules.api with per-key rate limiting. Legal gate: legal opinion commissioned and received for primary deployment jurisdiction; legal_opinions table populated; shadow mode admin toggle wired to shadow_mode_cleared flag. Space-Track AUP redistribution clarification obtained (written confirmation from 18th Space Control Squadron or counsel opinion on permissible use). ECCN classification review commissioned for Controlled Re-entry Planner. GDPR compliance review: data inventory completed, lawful bases documented, DPA template drafted, erasure procedure (handle_erasure_request) implemented. /space portal: SpaceOverview, ControlledReentryPlanner, DeorbitWindowList, ApiKeyManager, CcsdsExportPanel. Shadow mode admin toggle displays legal clearance status. Object ownership RLS policy tested: space_operator cannot access non-owned objects. API key rate limiting verified. API Terms accepted at key creation and recorded. Jurisdiction screening at registration (OFAC/EU/UK sanctions list check).

Phase 3: Operational Deployment (Weeks 2332)

Week Backend Deliverable Frontend Deliverable Security / Regulatory / SRE
2324 Alerts module: thresholds, email delivery, geographic filtering, alert_events. Shadow mode: alerts suppressed. ADS-B feed integration: OpenSky Network REST API (https://opensky-network.org/api/states/all); polled every 60s via Celery Beat; flight state vectors stored in adsb_states (non-hypertable; rolling 24h window); route intersection advisory module reads adsb_states to identify flights in re-entry corridors. Air Risk module initialisation: aircraft exposure scoring, time-slice aggregation, and vulnerability banding by aircraft class. Tier 3 HA infrastructure: TimescaleDB streaming replication + Patroni + etcd. Redis Sentinel (3 nodes). 4× simulation workers (64 total cores). Blue-green deployment pipeline wired. Full alert lifecycle UI: geographic filtering, mute rules, acknowledgement audit. Route overlay on globe. AirRiskPanel by FIR/time slice. Route intersection advisory (avoidance boundary only). Legal/regulatory: MSA template finalised by counsel; Regulatory Sandbox Agreement template finalised. First ANSP shadow deployment executed under signed Regulatory Sandbox Agreement and confirmed legal clearance. GDPR breach notification procedure tested (tabletop exercise). Professional indemnity, cyber liability, and product liability insurance confirmed in place. SRE: Patroni failover tested (primary killed; standby promotes; backend reconnects; verify zero lost predictions). Redis Sentinel failover tested. SLO baseline measurements taken on Tier 3 hardware.
2526 Feedback: prediction vs. outcome. Density scaling recalibration. Maneuver detection. Shadow validation report generation. Historical replay corpus: Long March 5B, Columbia-derived cloud case, and documented crossing-scenario set. Conservative-baseline comparison reporting for airspace closures. Launch safety module. Deployment freeze gate (CI/CD: block deploy if CRITICAL/HIGH alert active). ANSP communication plan implemented (degradation push + email). Incident response runbooks written (DB failover, Celery recovery, HMAC failure, ingest failure). Prediction accuracy dashboard. Historical comparison. ShadowValidationReport. Air-risk replay comparison views. /space Persona F workspace. Launch safety portal. Vault / cloud secrets manager. Secrets rotation. Begin first ANSP shadow mode deployment. SRE: PagerDuty/OpsGenie integrated with Prometheus AlertManager. SEV-1/2/3/4 routing configured. First on-call rotation established.
2728 Mode C binary MC endpoint. Load testing (100 users, <2s CZML p95; MC p95 < 240s). Prometheus + Grafana: three dashboards (Operational Overview, System Health, SLO Burn Rate). Full AlertManager rules. ECSS compliance artefacts: SMP, VVP, PAP, DMP. MinIO lifecycle rules: MC blobs > 90 days → cold tier. Mode C (Monte Carlo Particles). UncertaintyModeSelector unlocks Mode C. Final Playwright E2E suite. Grafana Operational Overview embedded in /admin. External penetration test (auth bypass, RBAC escalation, SSRF, XSS→Playwright, WS auth bypass, data integrity, object ownership bypass, API key abuse). All Critical/High remediated. Load test: SLO p95 targets verified under 100-user concurrent load.
2932 Regulatory acceptance package: safety case framework, ICAO data quality mapping, shadow validation evidence, SMS integration guide. TRL 6 demonstration. Data archival pipeline (Parquet export to MinIO cold before chunk drop). Storage growth verified against projections. ESA bid legal: background IP schedule documented; Consortium Agreement with academic partner signed (IP ownership, publication rights, revenue share); SBOM submitted as part of ESA artefact package. ECCN classification determination received; export screening process in place for all new customer registrations. ToS version updated to reflect any regulatory feedback from first ANSP deployments; re-acceptance triggered. Regulatory submission report type. TRL demonstration artefacts. SOC 2 Type I readiness review. Production runbook + incident response per threat scenario. ECSS compliance review. Monthly restore test passing in CI. Error budget dashboard showing < 10% burn rate.

20. Key Decisions and Tradeoffs

Decision Chosen Alternative Considered Rationale
Propagator split SGP4 catalog + numerical decay SGP4 for everything SGP4 diverges by daysweeks for re-entry time prediction
Numerical integrator RK7(8) adaptive + NRLMSISE-00 poliastro Cowell Direct force model control
Frame library astropy Manual SOFA Fortran Handles IERS EOP; well-tested IAU 2006
Atmospheric density NRLMSISE-00 (P1), JB2008 option (P2) Simple exponential Community standard; captures solar cycle
Breakup model Simplified ORSAT-like Full DRAMA/SESAM DRAMA requires licensing; simplified recovers ~80% utility
Uncertainty visualisation Three modes, phased (A→B→C), user-selectable Single fixed mode Serves different personas; operational users need corridors, analysts need heatmaps
JWT algorithm RS256 (asymmetric) HS256 (shared secret) Compromise of one service does not expose signing key to all services
Token storage httpOnly Secure SameSite=Strict cookie localStorage XSS cannot read httpOnly cookies; localStorage is trivially exfiltrated
Token revocation DB refresh_tokens table Redis-only Revocations survive restarts; enables rotation-chain audit
MFA TOTP (RFC 6238) required for all roles Optional MFA Aviation authority context; government procurement baseline
Secrets management Docker secrets (P1 prod) → Vault (P3) Env vars only Env vars appear in process listings and crash dumps; no audit trail
Alert integrity Backend-only generation on verified data Client-triggered alerts Prevents false alert injection via API
Prediction integrity HMAC-signed, immutable after creation Mutable with audit log Tamper-evident at database level; modification is impossible, not just logged
Multi-tenancy RLS at database layer + organisation_id Application-layer only DB-level enforcement cannot be bypassed by application bugs
Renderer isolation Separate renderer container, no external network Playwright in backend container Limits blast radius of XSS→SSRF escalation
Server state TanStack Query Zustand for everything Automatic cache, background refetch; Zustand is not a data cache
Navigation model Task-based (events, airspace, analysis) Module-based Users think in tasks, not modules
Report rendering Playwright headless server-side Client-side canvas Reliable at print resolution; consistent; not affected by client GPU
Monorepo Monorepo Separate repos Small team, shared types, simpler CI
ORM SQLAlchemy 2.0 Raw SQL Mature async support; Alembic migrations
Domain architecture Dual front door (aviation + space portal), shared physics core Single aviation-only product Space operator revenue stream; ESA bid credibility; space credibility supports aviation trust
Space operator object scoping PostgreSQL RLS on owned_objects join Application-layer filtering only DB-level enforcement; prevents application bugs from leaking cross-operator data
NOTAM output Draft only + mandatory disclaimer; never submitted System-assisted NOTAM submission SpaceCom is not a NOTAM originator; keeps platform in purely informational role; reduces regulatory approval burden
Reroute module scope Strategic pre-flight avoidance boundary only Specific alternate route generation Specific routes require ATC integration and aircraft performance data SpaceCom does not have; avoidance boundary keeps SpaceCom legally defensible
Shadow mode Org-level flag; all alerts suppressed; records segregated Per-prediction flag Enables ANSP trial deployments; accumulates validation evidence for regulatory acceptance; segregation prevents operational confusion
Controlled re-entry planner output CCSDS-format manoeuvre plan + risk-scored deorbit windows Aviation-format only Space operators submit to national regulators and ops centres in CCSDS; Zero Debris Charter evidence format
API access Separate API keys (not session JWT); per-key rate limiting Session cookie only Space operators integrate SpaceCom into operations centres programmatically; API keys are revocable machine credentials
MC parallelism model Celery group + chord (fan-out sub-tasks across worker pool) multiprocessing.Pool within single task Chord distributes across all worker containers; Pool limited to one container's cores; chord scales horizontally
Worker topology Two separate Celery pools: ingest and simulation Single shared queue Runaway simulation jobs cannot starve TLE ingestion; critical for reliability during active TIP events
Celery Beat HA celery-redbeat (Redis-backed, distributed locking) Standard Celery Beat (single process) Beat SPOF means scheduled ingest silently stops; redbeat enables multiple instances with leader election
DB HA TimescaleDB streaming replication + Patroni auto-failover Single-instance DB RPO = 0 for critical tables; 15-minute RTO requires automatic failover, not manual
Redis HA Redis Sentinel (3 nodes) Single Redis Master failure without Sentinel means all Celery queues and WebSocket pub/sub stop
Deployment gate CI/CD checks for active CRITICAL/HIGH alerts before deploying Manual judgement Prevents deployments during active TIP events; protects operational continuity
MC blade sizing 16 vCPU per simulation worker container Smaller containers MC chord sub-tasks fill all available cores; below 16 cores p95 SLO of 240s is not met
Temporal uncertainty display Plain window range ("08h20h from now / most likely ~14h") for Persona A/C; p05/p50/p95 UTC for Persona B ± Nh notation everywhere ± implies symmetric uncertainty which re-entry distributions are not; window range is operationally actionable
Space weather impact communication Operational buffer recommendation ("+2h beyond 95th pct") rather than % deviation Percentage string Percentage is meaningless without a known baseline; buffer hours are immediately usable by an ops duty manager
TLS termination Caddy with automatic ACME (internet-facing) / internal CA (air-gapped) nginx + manual certs Caddy handles cert lifecycle automatically; decision tree in §34
Pagination Cursor-based (created_at, id) Offset-based Offset degrades to full-table scan at 7-year retention depth; cursor is O(1) regardless of dataset size
CZML delta protocol ?since=<iso8601> parameter; max 5 MB full payload; X-CZML-Full-Required header on stale client Full catalog always 100-object catalog at 1-min cadence is ~1050 MB/hr per connected client without delta; delta reduces this to <500 KB/hr
MC concurrency gate Per-org Redis semaphore; 1 concurrent MC run (Phase 1); 429 + Retry-After on limit Unbounded fan-out 5 concurrent MC requests = 2,500 sub-tasks queued; p95 SLO collapses without backpressure
TimescaleDB compress_after 7 days for orbits (not 1 day) Compress as soon as possible Compressing hot chunks forces decompress on every write; 1-day compress_after causes 50200ms write latency thrash
Renderer memory limit mem_limit: 4g Docker cap on renderer container No memory limit Chromium print rendering at A4/300DPI consumes 24 GB; 4 uncapped renderer instances can OOM a 32 GB node
Static asset caching Cloudflare CDN (internet-facing); nginx sidecar (on-premise) No CDN CesiumJS bundle ~510 MB; 100 concurrent first-load = 500 MB1 GB burst without caching
WAF/DDoS protection Upstream provider (Cloudflare/AWS Shield) for internet-facing; network perimeter for air-gapped Application-layer rate limiting only Application-layer is insufficient for volumetric attacks; must be at ingress
Multi-region deployment Single region per customer jurisdiction; separate instances, not shared cluster Active-active multi-region Data sovereignty; simpler compliance certification; Phase 13 customer base doesn't justify multi-region cost
MinIO erasure coding EC:2 (4-node) EC:4 or RAID EC:2 tolerates 1 write failure / 2 read failures; balanced between protection and storage efficiency at 4 nodes
DB connection routing PgBouncer as single stable connection target Direct Patroni primary connection Patroni failover transparent to application; stable DNS target through primary changes
Egress filtering Host-level UFW/nftables allow-list (Tier 2); Calico/Cilium network policy (Tier 3) Trust Docker network isolation Docker isolation is inter-network only; outbound internet egress unrestricted without host-level filtering
Mode-switch dialogue Explicit current-mode + target-mode + consequences listed; Cancel left, destructive action right Generic "Are you sure?" Aviation HMI conventions; listed consequences prevent silent simulation-during-live error
Future-preview temporal wash Semi-transparent overlay + persistent label on event list when timeline scrubber is not at current time No visual distinction Prevents controller from acting on predicted-future data as though it is current operational state
Simulation block during active alerts Optional org-level disable_simulation_during_active_events flag Always allow simulation entry Prevents an analyst accidentally entering simulation while CRITICAL alerts require attention in the same ops room
Prediction superseding Write-once superseded_by FK on reentry_predictions / simulations Mutable or delete Preserves immutability guarantee; gives analysts a way to mark outdated predictions without removing the audit record
CRITICAL acknowledgement gate 10-character minimum free-text field; two-step confirmation modal Single click Prevents reflexive acknowledgement; creates meaningful action record for every acknowledged CRITICAL event
Multi-ANSP coordination panel Shared acknowledgement status and coordination notes across ANSP orgs on the same event Out-of-band only Creates shared digital situational awareness record without replacing voice coordination; reduces risk of conflicting parallel NOTAMs
Legal opinion timing Phase 2 gate (before shadow deployment); not Phase 3 Phase 3 task Common law duty of care may attach regardless of UI disclaimers; liability limitation must be in executed agreements before any ANSP relies on the system
Commercial contract instruments Three instruments: MSA + AUP click-wrap + API Terms Single platform ToS Each instrument addresses a different access pathway; API access by Persona E/F must have separate terms recorded against the key
Shadow mode legal gate legal_opinions.shadow_mode_cleared must be TRUE before shadow mode can be activated for an org Admin can enable freely Shadow deployment is a formal regulatory activity; without a completed legal opinion it exposes SpaceCom to uncapped liability in the deployment jurisdiction
GDPR erasure vs. retention Pseudonymise user references in append-only tables on erasure request; never delete safety records Hard delete on request UN Liability Convention requires 7-year retention; GDPR right to erasure is satisfied by removing the link to the individual, not the record itself
Space-Track data redistribution Obtain written clarification from 18th SCS before exposing TLE/CDM data via the SpaceCom API Assume permissible Space-Track AUP prohibits redistribution to unregistered parties; violation could result in loss of Space-Track access, disabling the platform's primary data source
OSS licence compliance CesiumJS commercial licence required for closed-source deployment; SBOM generated from Phase 1 Assume all dependencies are permissively licensed CesiumJS AGPLv3 requires source disclosure for network-served applications; undiscovered licence violations create IP risk in ESA bid
Insurance Professional indemnity + cyber liability + product liability required before operational deployment No insurance requirement Aviation safety context; potential claims from incorrect predictions that inform airspace decisions could exceed SpaceCom's balance sheet without coverage
Connection pooling PgBouncer transaction-mode pooler between all app services and TimescaleDB Direct connections from app Tier 3 connection count (2× backend + 4× workers + 2× ingest) exceeds max_connections=100 without a pooler; Patroni failover updates only pgBouncer
Redis eviction policy noeviction for Celery/redbeat (separate DB index); allkeys-lru for application cache Single Redis with one policy Broker message eviction causes silent job loss; cache eviction is acceptable
Bulk export implementation Celery task → MinIO → presigned URL (async offload pattern) Streaming response from API handler Full catalog export can be gigabytes; materialising in API handler risks OOM on the backend container
Analytics query routing Patroni standby replica for Persona B/F analytics; primary for operational reads All reads to primary Analytics queries during a TIP event would compete with operational reads on the primary; standby already provisioned at Tier 3
SQLAlchemy lazy loading lazy="raise" on all relationships Default lazy loading Async SQLAlchemy silently blocks the event loop on lazy-loaded relationships; raise converts silent N+1s into loud development-time errors
CZML cache strategy Per-object fragment cache + full catalog assembly; TTL keyed to last propagation job No cache; query DB on each request CZML catalog fetch at 100 objects = 864k rows; uncached this misses the 2s p95 SLO under concurrent load
Hypertable chunk interval (orbits) 1-day chunks (not default 7-day) Default 7-day 72h CZML query spans 3 × 1-day chunks; spans 11 × 7-day chunks — chunk exclusion is far less effective with the default
Continuous aggregate for F10.7 81-day avg TimescaleDB continuous aggregate space_weather_daily Compute from raw rows per request At 100 concurrent users, 100 identical scans of 11,664 raw rows; continuous aggregate reduces this to a single-row lookup
CI/CD orchestration GitHub Actions Jenkins / GitLab CI Project is GitHub-native; Actions has OIDC → GHCR; no separate CI server to operate
Container image tags sha-<commit> as canonical immutable tag; semantic version alias for releases latest tag only latest is mutable and non-reproducible; sha-<commit> gives exact traceability from deployed image back to source commit
Multi-stage Docker builds Builder stage (full toolchain) + runtime stage (distroless/slim) Single-stage with all tools Eliminates build toolchain, compiler, and dev dependencies from production image; typically reduces image size by 6080%
Local dev hot-reload Backend: FastAPI --reload via bind-mounted ./backend volume; Frontend: Next.js Vite HMR Rebuild container on change Full container rebuild per code change adds 3090s per iteration; volume mount + process reload is < 1s
.env.example contract .env.example with all required variables, descriptions, and stage flags committed to repo; actual .env in .gitignore Ad-hoc variable discovery from runtime errors Engineers must be able to run cp .env.example .env and have a working local stack within 15 minutes of cloning
Staging environment strategy main branch continuously deployed to staging via GitHub Actions; production deploy requires manual approval gate after staging smoke tests pass Manual staging deploys Reduces time-to-detect integration regressions; staging serves as TRL artefact evidence environment
Secrets rotation Per-secret rotation runbook: Space-Track credentials, JWT signing keys, ANSP tokens; old + new key both valid during 5-minute transition window; security_logs entry required; rotated via Vault dynamic secrets in Phase 3 Manual rotation with downtime Aviation context: key rotation must not cause service interruption; zero-downtime rotation is a reliability requirement, not a convenience
Build cache strategy Docker layer cache: cache-from/cache-to targeting GHCR in GitHub Actions; pip wheel cache: actions/cache keyed on requirements.txt hash; npm cache: actions/cache keyed on package-lock.json hash No cache; full rebuild each push Without cache, a full rebuild takes 812 minutes; with cache, incremental pushes take 23 minutes — critical for CI as a useful merge gate
Image retention policy Tagged release images kept indefinitely; untagged/orphaned images purged weekly via GHCR lifecycle policy; staging images retained 30 days; dev branch images retained 7 days No policy; manual cleanup Unmanaged GHCR storage grows unboundedly; stale images also represent unaudited CVE surface
Pre-commit hook completeness Six hooks: detect-secrets, ruff, mypy, hadolint, prettier, sqlfluff git-secrets only git-secrets scans only for known secret patterns; detect-secrets uses entropy analysis; hadolint prevents insecure Dockerfile patterns; sqlfluff catches migration anti-patterns before code review
alembic check in CI CI job runs alembic check to detect SQLAlchemy model/migration divergence; fails if models have unapplied changes Only run migrations, no divergence check SQLAlchemy models can diverge from migrations silently; alembic check catches the gap before it reaches production
FIR boundary data source EUROCONTROL AIRAC (ECAC states) + FAA Digital-Terminal Procedures (US) + OpenAIP (fallback); 28-day update cadence Manually curated GeoJSON, updated ad hoc FIR boundaries change on AIRAC cycles; stale boundaries produce wrong airspace intersection results during live TIP events
ADS-B data source OpenSky Network REST API (Phase 3 MVP); commercial upgrade path to Flightradar24 or FAA SWIM ADS-B if required Direct receiver hardware OpenSky is free, global, and sufficient for route overlay and intersection advisory; commercial upgrade only if coverage gaps identified in ANSP trials
CCSDS OEM reference frame GCRF (Geocentric Celestial Reference Frame); time system UTC; OBJECT_ID = NORAD catalog number; missing international designator populated as UNKNOWN ITRF or TEME GCRF is the standard output of SpaceCom's frame transform pipeline; downstream mission control tools expect GCRF for propagation inputs
CCSDS CDM field population SpaceCom populates: HEADER, RELATIVE_METADATA, OBJECT1/2 identifiers, state vectors, covariance (if available); fields not held by SpaceCom emitted as N/A per CCSDS 508.0-B-1 §4.3 Omit empty fields N/A is the CCSDS-specified sentinel for unknown values; silent omission causes downstream parser failures
CDM ingestion display Space-Track CDM Pc displayed alongside SpaceCom-computed Pc with explicit provenance labels; > 10× discrepancy triggers DATA_CONFIDENCE warning on conjunction panel Show only one value Space operators need both values; discrepancy without explanation erodes trust in both
WebSocket event schema Typed event envelope with type discriminator, monotonic seq, and ts; reconnect with ?since_seq= replay of up to 200 events / 5-minute ring buffer; resync_required on stale reconnect Schema-free JSON stream Untyped streams require every consumer to reverse-engineer the schema; schema enables typed client generation
Alert webhook delivery At-least-once POST to registered HTTPS endpoint; HMAC-SHA256 signature; 3 retries with exponential backoff; degraded status after 3 failures; auto-disable after 10 consecutive failures WebSocket / email only ANSPs with existing dispatch infrastructure (AFTN, internal webhook receivers) cannot integrate via browser WebSocket; webhooks are the programmatic last-mile
API versioning /api/v1 base; breaking changes require /api/v2 parallel deployment; 6-month support overlap; Deprecation / Sunset headers (RFC 8594); 3-month written notice to API key holders No versioning policy; breaking changes deployed ad hoc Space operators building operations centre integrations need stable contracts; silent breaking changes disable their integrations
SWIM integration path Phase 2: GeoJSON structured export; Phase 3: FIXM review + EUROCONTROL SWIM-TI AMQP publish endpoint Not applicable European ANSP procurement increasingly requires SWIM compatibility; GeoJSON export is low-cost first step; full SWIM-TI is Phase 3
Space-Track API contract test Integration test asserts expected JSON keys present in Space-Track response; ingest health alert fires after 4 consecutive hours with 0 successful Space-Track records No contract test; breakage discovered at runtime Space-Track API has had historical breaking changes; silent format change means ingest returns no data while health metrics appear normal
TLE checksum validation Modulo-10 checksum on both lines verified before DB write; BSTAR range check; failed records logged to security_logs type INGEST_VALIDATION_FAILURE Accept TLE at face value Corrupted TLEs (network errors, encoding issues) would propagate incorrect state vectors without validation
Model card docs/model-card-decay-predictor.md maintained alongside the model; covers validated orbital regime envelope, known failure modes, systematic biases, and performance by object type Accuracy statement only in §24.3 Regulators and ANSPs require a documented operational envelope, not just a headline accuracy figure; ESA TRL artefact requirement
Historical backcast selection Validation report explicitly documents selection criteria, identifies underrepresented object categories, and states accuracy conditional on object type Single unconditional accuracy figure Observable re-entry population is biased toward large well-tracked objects; publishing an unconditional accuracy figure misrepresents model generalisation
Out-of-distribution detection ood_flag = TRUE and ood_reason set at prediction time if any input falls outside validated bounds; UI shows mandatory warning callout Serve all predictions identically NRLMSISE-00 calibration domain does not include tumbling objects, very high area-to-mass ratio, or objects with no physical property data
Prediction staleness warning prediction_valid_until = p50_reentry_time - 4h; UI warns independently of system-level TLE staleness if NOW() > prediction_valid_until and not superseded No time-based staleness on predictions An hours-old prediction for an imminent re-entry has implicitly grown uncertainty; operators need a signal independent of the system health banner
Alert threshold governance Thresholds documented with rationale; change approval requires engineering lead sign-off + shadow-mode validation period; change log maintained in docs/alert-threshold-history.md Thresholds set in code with no governance CRITICAL trigger (window < 6h, FIR intersection) has airspace closure consequences; undocumented threshold changes cannot be reviewed by regulators or ANSPs
FIR intersection auditability alert_events.fir_intersection_km2 and intersection_percentile recorded at alert generation; UI shows "p95 corridor intersects ~N km² of FIR XXXX" Alert log shows only "intersects FIR XXXX" Intersection without area and percentile context is not auditable; regulators and ANSPs need to know how much intersection triggered the alert
Recalibration governance Recalibration requires hold-out validation dataset, minimum accuracy improvement threshold, sign-off authority, rollback procedure, and notification to ANSP shadow partners Recalibration run and deployed without gates Unchecked recalibration can silently degrade accuracy for object types not in the calibration set
Model version governance Changes classified as patch/minor/major; major changes require active prediction re-runs with supersession + ANSP notification; rollback path documented No governance; model updated silently A major model version change producing materially different corridors without re-running active predictions creates undocumented divergence between what ANSPs are seeing and current best predictions
Adverse outcome monitoring prediction_outcomes table records observed re-entry outcomes against predictions; quarterly accuracy report generated from feedback pipeline; false positive/negative rates in Grafana No post-deployment accuracy tracking Without outcome monitoring SpaceCom cannot demonstrate performance within acceptable bounds to regulators; shadow validation reports are episodic, not continuous
Geographic coverage annotation FIR intersection results carry data_coverage_quality flag per FIR; OpenAIP-sourced boundaries flagged as lower confidence All FIR intersections treated equally AIRAC coverage varies by region; operators in non-ECAC regions receive lower-quality intersection assessments without knowing it
Public transparency report Quarterly aggregate accuracy/reliability report published (no personal data); covers prediction count, backcast accuracy, error rates, known limitations No public reporting Civil aviation safety tools operate in a regulated transparency environment; ESA bid credibility and regulatory acceptance require demonstrable performance
docs/ directory structure Canonical tree defined in §12.1; all documentation files live at known paths committed to the repo Ad-hoc file creation by individual engineers Documentation that exists only in prose references gets created inconsistently or not at all
Architecture Decision Records MADR-format ADRs in docs/adr/; one per consequential decision in §20; linked from relevant code via inline comment §20 table in master plan only Engineers working in the repo cannot find decision rationale without reading a 5000-line plan document
OpenAPI documentation standard Every public endpoint has summary, description, tags, and at least one responses example; enforced by CI check Auto-generated stubs only Auto-generation produces syntactically correct docs that are useless to API integrators (Persona E/F)
Runbook format Standard template in docs/runbooks/TEMPLATE.md; required sections: Trigger, Severity, Preconditions, Steps, Verification, Rollback, Notify; runbook index maintained Free-form runbooks written ad-hoc Runbooks written under pressure without a template consistently omit the rollback and notification steps
Docstring standard Google-style docstrings required on all public functions in propagator/, reentry/, breakup/, conjunction/, integrity.py; parameters include physical units No docstring requirement Physics functions without units and limitations documented cannot be reviewed or audited by third-party evaluators for ESA TRL
Validation procedure §17 specifies reference data location, run commands, pass/fail tolerances per suite; docs/validation/README.md describes how to add new cases Checklist of what to validate without procedure A third party cannot reproduce the validation without knowing where the reference data is and what tolerance constitutes a pass
User documentation Phase 2 delivers aviation portal guide + API quickstart; Phase 3 delivers space portal guide + in-app contextual help; stored in docs/user-guides/ No user documentation ANSP SMS acceptance requires user documentation; aviation operators cannot learn an unfamiliar safety tool from the UI alone
CHANGELOG.md format Keep a Changelog conventions; human-maintained; one entry per release with Added/Changed/Deprecated/Removed/Fixed/Security sections No format specified Changelogs written by different engineers without a format are unusable by operators and regulators
AGENTS.md Project-root file defining behaviour guidance for AI coding agents; specifies codebase conventions, test requirements, and safety-critical file restrictions; committed to repo Untracked file, undefined purpose An undocumented AGENTS.md is either ignored or followed inconsistently, undermining its purpose
Test documentation Module docstrings on physics/security test files state the invariant, reference source, and operational significance of failure; docs/test-plan.md lists all suites with scope and blocking classification No test documentation requirement ECSS-Q-ST-80C requires a test specification as a separate deliverable from the test code

21. Definition of Done per Phase

Phase 1 Complete When:

Physics and data:

  • 100+ real objects tracked with current TLE data
  • Frame transformation unit tests pass against IERS/Vallado reference cases (round-trip error < 1 m)
  • SGP4 CZML uses J2000 INERTIAL frame (not TEME)
  • Space weather polled from NOAA SWPC; cross-validated against ESA SWS; operational status widget visible
  • TIP messages ingested and displayed for decaying objects
  • TLE cross-validation flags discrepancies > threshold for human review
  • IERS EOP hash verification passing
  • Decay predictor: ≥3 historical re-entry backcast windows overlap actual events
  • Mode A (Percentile Corridors): p05/p50/p95 swaths render with correct visual encoding
  • TimelineGantt displays all active events; click-to-navigate functional
  • LIVE/REPLAY/SIMULATION mode indicator correct on all pages

Security (all required before Phase 1 is considered complete):

  • RBAC enforced: automated test_rbac.py verifies every endpoint returns 403 for insufficient role, 401 for unauthenticated
  • JWT RS256 with httpOnly cookies; localStorage token storage absent from codebase (grep check in CI)
  • MFA (TOTP) enforced for all roles; recovery codes functional
  • Rate limiting: 429 responses verified by integration tests for all configured limits
  • Simulation parameter range validation: out-of-range values return 400 with clear message
  • Prediction HMAC: tamper test (direct DB row modification) triggers 503 + CRITICAL security_log entry
  • alert_events append-only trigger: UPDATE/DELETE raise exception (verified by test)
  • reentry_predictions immutability trigger: same (verified by test)
  • Redis AUTH enabled; default user disabled; ACL per service verified
  • MinIO: all buckets verified private; direct object URL returns 403; pre-signed URL required
  • Docker: all containers verified non-root (docker inspect check in CI)
  • Docker: network segmentation verified — frontend container cannot reach database port
  • Bandit: 0 High severity findings in CI
  • ESLint security: 0 High findings in CI
  • Trivy: 0 Critical/High CVEs in all container images
  • CSP headers present on all pages; verified by Playwright E2E test
  • axe-core: 0 critical, 0 serious violations on all pages (CI check)
  • WCAG 2.1 AA colour contrast: automated check passes

UX:

  • Globe: object clustering active at global zoom; urgency symbols correct (colour-blind-safe)
  • DataConfidenceBadge visible on all object detail and prediction panels
  • UncertaintyModeSelector visible; Mode B/C greyed with "Phase 2/3" label
  • JobsPanel shows live sample progress for running decay jobs
  • Shared deep links work: /events/{id} loads correct event; globe focuses on corridor
  • All pages keyboard-navigable; modal focus trap verified
  • Report generation: Operational Briefing type functional; PDF includes globe corridor map

Human Factors (Phase 1 items — all required before Phase 1 is considered complete):

  • Event cards display window range notation (Window: XhYh from now / Most likely ~Zh from now); no ± notation appears in operational-facing UI (grep check)
  • Mode-switch dialogue: switching to SIMULATION shows current mode, target mode, and "alerts suppressed" consequence; Cancel left, Switch right; Playwright E2E test verifies dialogue content
  • Future-preview temporal wash: dragging timeline scrubber past current time applies overlay and PREVIEWING +Xh label to event panel; alert badges show "(projected)"; verified by Playwright test
  • CRITICAL acknowledgement: two-step flow (banner → confirmation modal); Confirm button disabled until Action taken field ≥ 10 characters; verified by Playwright test
  • Audio alert: non-looping two-tone chime plays once on CRITICAL alert; stops on acknowledgement; does not play in SIMULATION or REPLAY mode; verified by integration test with audio mock
  • Alert storm meta-alert: > 5 CRITICAL alerts within 1 hour generates Persona D meta-alert with disambiguation prompt (verified by test with synthetic alerts)
  • Onboarding state: new organisation with no FIRs configured sees three-card setup prompt on first login (Playwright test)
  • Degraded mode banner: /readyz 207 response triggers correct per-degradation-type operational guidance text in UI (integration test for each degradation type: space weather stale, TLE stale)
  • superseded_by constraint: setting superseded_by on a prediction a second time raises DB exception (integration test); UI shows ⚠ Superseded banner on any prediction where superseded_by IS NOT NULL

Legal / Compliance (Phase 1 items — all required before Phase 1 is considered complete):

  • Space-Track AUP architectural decision gate (Finding 9): Written AUP clarification obtained from 18th Space Control Squadron or legal counsel opinion. docs/adr/0016-space-track-aup-architecture.md committed with Path A (shared ingest) or Path B (per-org credentials) decision recorded and evidenced. Ingest architecture finalised accordingly. This is a blocking Phase 1 decision — ingest code must not be written until the path is decided.
  • ToS / AUP / Privacy Notice acceptance gate: first login blocks dashboard access until all three documents are accepted; users.tos_accepted_at, users.tos_version, users.tos_accepted_ip populated on acceptance (integration test: unauthenticated attempt to skip returns 403)
  • ToS version change triggers re-acceptance: bump tos_version in config; verify existing users are blocked on next login until they re-accept (integration test)
  • CesiumJS commercial licence executed and stored at legal/LICENCES/cesium-commercial.pdf; legal_clearances.cesium_commercial_executed = TRUEblocking gate for any external demo (§29.11 F1)
  • SBOM generated at build time via syft (SPDX-JSON, container image) + pip-licenses + license-checker-rseidelsohn (dependency manifests); stored in docs/compliance/sbom/ as versioned artefacts; all dependency licences reviewed against legal/OSS_LICENCE_REGISTER.md; CI pip-licenses --fail-on gate includes GPL/AGPL/SSPL; no unapproved licence in transitive closure (§29.11 F2, F10)
  • legal/LGPL_COMPLIANCE.md created documenting poliastro LGPL dynamic linking compliance and PostGIS GPLv2 linking exception (§29.11 F4, F9)
  • legal/LICENCES/timescaledb-licence-assessment.md and legal/LICENCES/redis-sspl-assessment.md created with licence assessment sign-off (§29.11 F5, F6)
  • legal_opinions table present in schema; admin UI shows legal clearance status per org; shadow mode toggle displays warning if shadow_mode_cleared = FALSE
  • GDPR breach notification procedure documented in the incident response runbook; tabletop exercise completed with the engineering team

Infrastructure / DevOps (all required before Phase 1 is considered complete):

  • Docker Compose starts full stack with single command (make dev)
  • make test executes pytest + vitest in one command; all tests pass on a clean clone
  • make migrate runs all Alembic migrations against a fresh DB without error
  • make seed loads fixture data; globe shows test objects on first load
  • .env.example present with all required variables documented; a new engineer can reach a working local stack in ≤ 15 minutes
  • Multi-stage Dockerfiles in place for backend, worker, renderer, and frontend: builder stage uses full toolchain; runtime stage is distroless/slim; docker inspect confirms no build tools (gcc, pip, npm) present in runtime image
  • All containers run as non-root UID (baked in Dockerfile USER directive — not set at runtime); verified by docker inspect check in CI
  • Self-hosted GitLab CI pipeline exists with jobs: lint (pre-commit all hooks), test-backend (pytest), test-frontend (vitest + Playwright), security-scan (Bandit + Trivy + ESLint security), build-and-push (multi-stage build -> GitLab container registry with sha-<commit> tag)
  • .pre-commit-config.yaml committed with all six hooks; CI re-runs all hooks and fails if any fail
  • alembic check step in CI fails if SQLAlchemy models have unapplied changes
  • Build cache: Docker layer cache, pip wheel cache, npm cache all configured in GitLab CI; incremental push CI time < 4 minutes
  • pytest suite: frame utils, integrity, auth, RBAC, propagator, decay, space weather, ingest, API integration
  • Playwright E2E: mode switch, alert acknowledge, CZML render, job progress, report generation, CSP headers
  • Port exposure CI check: scripts/check_ports.py passes with no never-exposed port in a ports: mapping
  • Caddy TLS active on local dev stack with self-signed cert or ACME staging cert; HSTS header present (Strict-Transport-Security: max-age=63072000); TLS 1.1 and below not offered (verified by nmap --script ssl-enum-ciphers)
  • docs/runbooks/egress-filtering.md exists documenting the allowed outbound destination whitelist; implementation method (UFW/nftables) noted

Performance / Database (Phase 1 items — all required before Phase 1 is considered complete):

  • pgBouncer in Docker Compose; all app services connect via pgBouncer (not directly to TimescaleDB); verified by netstat or connection-source query showing only pgBouncer IPs in pg_stat_activity
  • All required indexes present: orbits_object_epoch_idx, reentry_pred_object_created_idx, alert_events_unacked_idx, reentry_pred_corridor_gist, hazard_zones_polygon_gist, fragments_impact_gist, tle_sets_object_ingested_idx — verified by \d+ or pg_indexes query
  • orbits hypertable chunk interval set to 1 day; space_weather to 30 days; tle_sets to 7 days — verified by timescaledb_information.chunks
  • space_weather_daily continuous aggregate created and policy active; Space Weather Widget backend query reads from the aggregate (verified by EXPLAIN showing space_weather_daily in plan, not raw space_weather)
  • Autovacuum settings applied to alert_events, security_logs, reentry_predictions — verified via pg_class reloptions
  • lazy="raise" set on all SQLAlchemy relationships; test suite passes with no MissingGreenlet or InvalidRequestError exceptions (test suite itself verifies this by accessing relationships without explicit loading — should raise)
  • Redis Celery broker DB index (SELECT 0) has maxmemory-policy noeviction; application cache DB index (SELECT 1) has allkeys-lru — verified by CONFIG GET maxmemory-policy on each DB
  • CZML catalog endpoint: EXPLAIN (ANALYZE, BUFFERS) output recorded in docs/query-baselines/czml_catalog_100obj.txt; p95 response time < 2s verified by load test with 10 concurrent users
  • CZML delta endpoint (?since=) functional: integration test verifies delta response contains only changed objects; X-CZML-Full-Required: true returned when client timestamp > 30 min old
  • Compression policies applied with correct compress_after intervals (see §9.4 table): orbits = 7 days, adsb_states = 14 days, space_weather = 60 days, tle_sets = 14 days — verified by timescaledb_information.jobs
  • Cursor-based pagination: integration test on /reentry/predictions with 200+ rows confirms next_cursor present and second page returns non-overlapping rows; limit=201 returns 400
  • MC concurrency gate: integration test submits two concurrent POST /decay/predict requests from the same organisation; second request returns HTTP 429 with Retry-After header while first is running; first completes normally
  • Renderer Docker memory limit set to 4 GB in docker-compose.yml; docker inspect confirms HostConfig.Memory = 4294967296
  • Bulk export endpoint: integration test with 10,000-row dataset confirms response is a task ID + status URL, not an inline response body
  • tests/load/ directory exists with at least a k6 or Locust scenario for the CZML catalog endpoint; docs/test-plan.md load test section specifies scenario, ramp shape, and SLO assertion

Technical Writing / Documentation (Phase 1 items — all required before Phase 1 is considered complete):

  • docs/ directory tree created and committed matching the structure in §12.1; all referenced documentation paths exist (even if files are stubs with "TODO" content)
  • AGENTS.md committed to repo root; contains codebase conventions, test requirements, and safety-critical file restrictions (see §33.9)
  • docs/adr/ contains minimum 5 ADRs for the most consequential Phase 1 decisions: JWT algorithm choice, dual frontend architecture, Monte Carlo chord pattern, frame library choice, TimescaleDB chunk intervals
  • docs/runbooks/TEMPLATE.md committed; docs/runbooks/README.md index lists all required runbooks with owner field; at least db-failover.md, ingest-failure.md, and hmac-failure.md are complete (not stubs)
  • docs/validation/README.md documents how to run each validation suite and where reference data files live; docs/validation/reference-data/ contains Vallado SGP4 cases and IERS frame test cases
  • CHANGELOG.md exists at repo root in Keep a Changelog format; first entry records Phase 1 initial release
  • docs/alert-threshold-history.md exists with initial entry recording threshold values, rationale, and author sign-off (required by §24.8)
  • OpenAPI docs: CI check confirms no public endpoint has an empty description field; spot-check 5 endpoints in code review to verify summary and at least one responses example

Ethics / Algorithmic Accountability (Phase 1 items — all required before Phase 1 is considered complete):

  • ood_flag and ood_reason populated at prediction time: integration test with an object whose data_confidence = 'unknown' and no DISCOS physical properties confirms ood_flag = TRUE and ood_reason contains 'low_data_confidence'; prediction is served but UI shows mandatory warning callout above the prediction panel
  • prediction_valid_until field present: verify it equals p50_reentry_time - 4h for a test prediction; UI shows staleness warning when NOW() > prediction_valid_until and prediction is not superseded (Playwright test simulates time travel)
  • alert_events.fir_intersection_km2 and intersection_percentile recorded: synthetic CRITICAL alert with known corridor area confirms both fields populated; UI renders "p95 corridor intersects ~N km² of FIR XXXX" (Playwright test)
  • Alert threshold values documented: docs/alert-threshold-history.md exists with initial entry recording threshold values, rationale, and author sign-off
  • prediction_outcomes table exists in schema; POST /api/v1/predictions/{id}/outcome endpoint (requires analyst role) accepts observed re-entry time and source (integration test: unauthenticated attempt returns 401)

Interoperability (Phase 1 items — all required before Phase 1 is considered complete):

  • TLE checksum validation: integration test sends a TLE with deliberately corrupted checksum; verify it is rejected and logged to security_logs type INGEST_VALIDATION_FAILURE; valid TLE with same content but correct checksum is accepted
  • Space weather format contract test: CI integration test against mocked NOAA SWPC response asserts (a) expected top-level JSON keys present (time_tag, flux / kp_index); (b) F10.7 values in physical range 50350 sfu; (c) Kp values in range 090 (NOAA integer format); test is @pytest.mark.contract and runs against mocks in standard CI, against live API in nightly sandbox job
  • Space-Track contract test: integration test against mocked Space-Track response asserts (a) expected JSON keys present for TLE and CDM queries; (b) B* values trigger warning when outside [-0.5, 0.5]; (c) epoch field parseable as ISO-8601; spacecom_ingest_success_total{source="spacetrack"} Prometheus metric > 0 after a live ingest cycle (nightly sandbox only)
  • FIR boundary data loaded: airspace table populated with FIR/UIR polygons for at least the test ANSP region; source documented in ingest/sources.py; AIRAC update date recorded in airspace_metadata table
  • WebSocket event schema: WS /ws/events delivers typed event envelopes; integration test sends a synthetic alert.new event and verifies the client receives {"type": "alert.new", "seq": <n>, "data": {...}}; reconnect with ?since_seq=<n> replays missed event
  • API versioning headers: all API endpoints return Content-Type: application/vnd.spacecom.v1+json; deprecated endpoints (if any) return Deprecation: true and Sunset: <date> headers (verified by Playwright E2E check)

SRE / Reliability (all required before Phase 1 is considered complete):

  • Health probes: /healthz returns 200 on all services; /readyz returns 200 (healthy) or 207 (degraded) as appropriate; Docker Compose depends_on: condition: service_healthy wired for all service dependencies
  • Celery queue routing: integration test confirms ingest.* tasks appear only on ingest queue and propagator.* tasks appear only on simulation queue; no cross-queue contamination possible
  • celery-redbeat schedule persistence: Beat process restart test verifies scheduled jobs survive without duplicate scheduling; Redis key redbeat:* present after restart
  • Crash-safety: kill a worker-sim container mid-task; verify task is requeued (not lost) on worker restart; task_acks_late = True and task_reject_on_worker_lost = True confirmed by log inspection
  • Dead letter queue: a task that exhausts all retries appears in the DLQ; DLQ depth metric visible in Prometheus
  • WAL archiving: pg_basebackup and WAL segments appearing in MinIO db-wal-archive bucket within 10 minutes of first write (verified by bucket list)
  • Daily backup Celery task: backup_database task appears in Celery Beat schedule; execution logged in celery-beat.log; resulting archive object visible in MinIO db-backups bucket
  • TimescaleDB compression policy: orbits compression policy applied; timescaledb_information.jobs shows policy active; manual CALL run_job() compresses at least one chunk
  • Prometheus metrics: spacecom_active_tip_events, spacecom_tle_age_hours, spacecom_hmac_verification_failures_total, spacecom_celery_queue_depth all visible in Prometheus UI with correct labels
  • MC chord distribution: run_mc_decay_prediction fans out 500 sub-tasks; Celery Flower shows sub-tasks distributed across both worker-sim instances (not all on one worker)
  • MC p95 latency SLO: 500-sample MC run completes in < 240s on Tier 1 dev hardware (8 vCPU/32 GB) under load test; documented baseline recorded for Tier 2 comparison

Phase 2 Complete When:

  • Atmospheric breakup: fragments, casualty areas, fragment globe display
  • Mode B (Probability Heatmap): Deck.gl layer renders; hover tooltip shows probability
  • Conjunction screening: known close approaches identified; Pc computed for ≥1 test case
  • 2D Plan View: FIR boundaries, horizontal corridor projection, altitude cross-section
  • Airspace intersection table: affected FIRs with entry/exit times on Event Detail
  • Hazard zones: HMAC-signed and immutability trigger verified
  • PDF reports: Technical Assessment and Regulatory Submission types functional
  • Renderer container: network_mode: none enforced; sanitisation tests passing; 30s timeout verified
  • OWASP ZAP DAST: 0 High/Critical findings against staging environment
  • RLS multi-tenancy: Org A user cannot access Org B records (integration test)
  • SimulationComparison: two runs overlaid on globe with distinct colours

Phase 2 SRE / Reliability:

  • Monthly restore test: restore_test Celery task executes on schedule; restores latest backup to isolated db-restore-test container; row count reconciliation passes; result logged to security_logs (type RESTORE_TEST)
  • TimescaleDB retention policy: 90-day drop policy active on orbits and space_weather; manual chunk drop test in staging confirms chunks older than 90 days are removed without affecting newer data
  • Archival pipeline: Parquet export Celery task runs before chunk drop; resulting .parquet files visible in MinIO db-archive bucket; spot-check query against archived Parquet returns expected rows
  • Degraded mode UI: stop space weather ingest; confirm /readyz returns 207; confirm StalenessWarningBanner appears in aviation portal within one polling cycle (≤ 60s); restart ingest; confirm banner clears
  • Error budget dashboard: Grafana SRE Error Budgets dashboard shows Phase 2 SLO burn rates for prediction latency and data freshness; alert fires in Prometheus when burn rate exceeds 2× for > 1 hour

Phase 2 Human Factors:

  • Corridor Evolution widget: Event Detail page shows p50 corridor footprint at T+0h/+2h/+4h; auto-updates in LIVE mode; ambering warning appears if corridor is widening
  • Duty Manager View: toggle on Event Detail collapses to large-text window/FIR/action-buttons only; toggles back to technical detail
  • Response Options accordion: contextualised action checklist visible to operator+ role; checkbox states and coordination notes persisted to alert_events
  • Multi-ANSP Coordination Panel: visible on events where ≥2 registered organisations share affected FIRs; acknowledgement status and coordination notes from each ANSP visible; integration test confirms Org A cannot see Org B coordination notes on unrelated events
  • Simulation block: disable_simulation_during_active_events org setting functional; mode switch blocked with correct modal when unacknowledged CRITICAL alerts exist (integration test)
  • Space weather buffer recommendation: Event Detail shows [95th pct time + buffer] callout when conditions are Elevated or above; buffer computed by backend from F10.7/Kp thresholds (integration test verifies all four threshold bands)
  • Secondary Display Mode: ?display=secondary URL opens chrome-free full-screen operational view; navigation, admin links, and simulation controls not present; CRITICAL banners still appear (Playwright test)
  • Mode C first-use overlay: MC particle animation blocked until user acknowledges one-time explanation overlay; preference stored in user record; never shown again after first acknowledgement

Phase 2 Performance / Database:

  • FIR intersection query: EXPLAIN (ANALYZE) confirms bounding-box pre-filter (&&) eliminates > 90% of airspace rows before exact ST_Intersects; p95 intersection query time < 200ms with full airspace table loaded
  • Analytics query routing: Persona B/F workspace queries confirmed routing to replica engine via pg_stat_activity source host check; replication lag monitored in Grafana (alert if > 30s)
  • Query plan regression: re-run EXPLAIN (ANALYZE, BUFFERS) on CZML catalog query; compare to Phase 1 baseline in docs/query-baselines/; planning time and execution time increase < 2× (if exceeded, investigate before Phase 3 load test)
  • Hypertable migration: at least one migration involving orbits executed using CREATE INDEX CONCURRENTLY; CI migration timeout gate in place (> 30s fails CI)
  • Query plan regression CI job active: tests/load/check_query_baselines.py runs after each migration in staging; fails if any baseline query execution time increases > 2× vs recorded baseline; PR comment generated with comparison table
  • ws_connected_clients Prometheus gauge reporting per backend instance; Grafana alert configured at 400 (WARNING) — verified by injecting 5 synthetic WebSocket connections and confirming gauge increments
  • Space weather backfill cap: integration test simulates 24-hour ingest gap; verify ingest task logs WARN and backfills only last 6 hours; no duplicate timestamps written; space_weather_daily aggregate remains consistent
  • CDN / static asset caching: bundle-size CI step active; PR comment shows bundle size delta; CI fails if main JS bundle grows > 10% vs. previous build; Caddy cache headers for /_next/static/* set Cache-Control: public, max-age=31536000, immutable

Phase 2 Legal / Compliance:

  • Regulatory classification ADR committed: docs/adr/0012-regulatory-classification.md documents the chosen position (Position A — ATM/ANS Support Tool, non-safety-critical) with rationale; legal counsel has reviewed the position against EASA IR 2017/373; position is referenced in all ANSP service contracts
  • Legal opinion received for primary deployment jurisdiction; legal_opinions table updated with shadow_mode_cleared = TRUE; shadow mode admin toggle no longer shows legal warning for that jurisdiction
  • Space-Track AUP redistribution clarification obtained (written); legal position documented; AUP click-wrap wording updated to reflect agreed terms
  • ESA DISCOS redistribution rights clarified (written): Written confirmation from ESA/ESAC on permissible use of DISCOS-derived properties in commercial API responses and generated reports; if redistribution is not permitted, API response and report templates updated to show source: estimated rather than raw DISCOS values
  • GDPR DPA signed with each shadow ANSP partner before shadow mode begins: DPA template reviewed by counsel; executed DPA on file for each organisation before shadow_mode_cleared is set to TRUE; data processing not permitted for any ANSP organisation without a signed DPA
  • GDPR data inventory documented; pseudonymisation procedure handle_erasure_request() implemented and tested: user deleted → name/email replaced with [user deleted - ID:{hash}] in alert_events/security_logs; core safety records preserved
  • Jurisdiction screening at user registration: sanctioned-country check fires before account creation; blocked attempt logged to security_logs type REGISTRATION_BLOCKED_SANCTIONS
  • MSA template reviewed by aviation law counsel; Regulatory Sandbox Agreement template finalised; first shadow mode deployment covered by a signed Regulatory Sandbox Agreement on file
  • Controlled Re-entry Planner carries in-platform export control notice; data_source_acknowledgement = TRUE enforced before API key issuance (integration test: attempt to create API key without acknowledgement returns 403)
  • Professional indemnity, cyber liability, and product liability insurance confirmed in place before first shadow deployment; certificates stored in MinIO legal-docs bucket
  • Shadow mode exit criteria documented and tooled: docs/templates/shadow-mode-exit-report.md exists; Persona B can generate exit statistics from admin panel; exit to operational use for any ANSP requires written Safety Department confirmation on file before shadow_mode_cleared is set

Phase 2 Technical Writing / Documentation:

  • docs/user-guides/aviation-portal-guide.md complete and reviewed by at least one Persona A representative before first ANSP shadow deployment; covers: dashboard overview, alert acknowledgement workflow, NOTAM draft workflow, degraded mode response
  • docs/api-guide/ complete: authentication.md, rate-limiting.md, webhooks.md, error-reference.md, Python and TypeScript quickstart examples; reviewed by a Persona E/F tester
  • All public functions in propagator/decay.py, propagator/catalog.py, reentry/corridor.py, integrity.py, and breakup/atmospheric.py have Google-style docstrings with parameter units; mypy pre-commit hook enforces no untyped function signatures
  • docs/test-plan.md complete: lists all test suites, physical invariant tested, reference source, pass/fail tolerance, and blocking classification; reviewed by physics lead
  • docs/adr/ contains ≥ 10 ADRs covering all consequential Phase 2 decisions added during the phase
  • All runbooks referenced in the §21 DoD are complete (not stubs): gdpr-breach-notification.md, safety-occurrence-notification.md, secrets-rotation-jwt.md, blue-green-deploy.md, restore-from-backup.md

Phase 2 Ethics / Algorithmic Accountability:

  • Model card published: docs/model-card-decay-predictor.md complete with validated orbital regime envelope, object type performance breakdown, known failure modes, and systematic biases; reviewed by the physics lead before Phase 2 ANSP shadow deployments
  • Backcast validation report: ≥10 historical re-entry events validated; report documents selection criteria, identifies underrepresented object categories (small debris, tumbling objects), and states accuracy conditional on object type — not as a single unconditional figure; stored in MinIO docs bucket
  • Out-of-distribution bounds defined: docs/ood-bounds.md specifies the threshold values for ood_flag triggers (area-to-mass ratio, minimum data confidence, minimum TLE count); CI test confirms all thresholds are checked in propagator/decay.py
  • Alert threshold governance: any threshold change requires a PR reviewed by engineering lead + product owner; docs/alert-threshold-history.md entry created; change must complete a minimum 2-week shadow-mode validation period before deploying to any operational ANSP connection
  • FIR coverage quality flag: airspace table has data_source and coverage_quality columns; intersection results for OpenAIP-sourced FIRs include a coverage_quality: 'low' flag in the API response; UI shows a coverage quality callout for non-AIRAC FIRs
  • Recalibration governance documented: docs/recalibration-procedure.md exists specifying hold-out validation dataset, minimum accuracy improvement threshold (> 5% improvement on hold-out, no regression on any object type category), sign-off authority (physics lead + engineering lead), ANSP notification procedure

Phase 2 Interoperability:

  • CCSDS OEM response: GET /space/objects/{norad_id}/ephemeris with Accept: application/ccsds-oem returns a valid CCSDS 502.0-B-3 OEM file; integration test validates all mandatory keyword fields (OBJECT_ID, CENTER_NAME, REF_FRAME=GCRF, TIME_SYSTEM=UTC, START_TIME, STOP_TIME) are present; test parses with a reference CCSDS OEM parser
  • CCSDS CDM export: bulk export includes CDM-format conjunction records; mandatory CDM fields populated; N/A used per CCSDS 508.0-B-1 §4.3 for unknown values; integration test validates with reference CDM parser
  • CDM ingestion display: Space-Track CDM Pc and SpaceCom-computed Pc both visible on conjunction panel with distinct provenance labels; DATA_CONFIDENCE warning fires when values differ by > 10× (integration test with synthetic divergent CDM)
  • Alert webhook: POST /webhooks registers endpoint; synthetic alert.new event POSTed to registered URL within 5s of trigger; X-SpaceCom-Signature header present and verifiable with shared secret; retry fires on 500 response from webhook receiver (integration test with mock server)
  • GeoJSON structured export: GET /events/{id}/export?format=geojson returns valid GeoJSON FeatureCollection; properties includes norad_id, p50_utc, affected_fir_ids, risk_level, prediction_hmac; validates against GeoJSON schema (RFC 7946)
  • ADS-B feed: OpenSky Network integration active; live flight positions overlay on globe in aviation portal; route intersection advisory receives ADS-B flight tracks as input

Phase 2 DevOps / Platform Engineering:

  • Staging environment spec documented: resources, data (synthetic only — no production data in staging), secrets set (separate from production), continuous deployment from main branch
  • GitLab staging deploy job: merge to main triggers automatic staging deploy; production deploy requires manual approval in GitLab after staging smoke tests pass
  • OWASP ZAP DAST run against staging in CI pipeline; results reviewed; 0 High/Critical required to unblock production deploy approval
  • Secrets rotation runbooks written for all critical secrets: Space-Track credentials, JWT RS256 signing keypair, MinIO access keys, Redis AUTH password; each runbook includes: who initiates, affected services, zero-downtime rotation procedure, verification step, security_logs entry required
  • JWT RS256 keypair rotation tested without downtime: old public key retained during 5-minute transition window; tokens signed with old key remain valid until expiry; verified by integration test
  • Image retention container-registry lifecycle policy in place: untagged images purged weekly; staging images retained 30 days; dev images retained 7 days; policy verified in registry settings
  • CI observability: GitLab pipeline duration tracked; image size delta posted as merge request comment (fail if > 20% increase); test failure rate visible in CI dashboard
  • alembic check CI gate: no migration added a NOT NULL column without a default in the same step; CI job validates hypertable migrations use CONCURRENTLY (grep check on all new migration files)

Phase 2 Additional Regulatory / Dual Domain Items:

  • Shadow mode: admin can enable/disable per organisation; ShadowBanner displayed on all pages when active; shadow records have shadow_mode = TRUE; shadow records excluded from all operational API responses (integration test)
  • NOTAM drafting: draft generated in ICAO Annex 15 format from any event with FIR intersection; mandatory regulatory disclaimer present (automated test verifies its presence in every draft); stored in notam_drafts
  • Space Operator Portal: space_operator user can view only owned objects (non-owned objects return 404, not 403, to prevent object enumeration); ControlledReentryPlanner functional for has_propulsion = TRUE objects
  • CCSDS export: ephemeris export in OEM format passes CCSDS 502.0-B-3 structural validation
  • API keys: create, use, and revoke flow functional; per-key rate limiting returns 429 at daily limit; raw key displayed only at creation (never retrievable after)
  • TIP message provenance displayed in UI: source label reads "USSPACECOM TIP (not certified aeronautical information)" — not just "TIP Message #N"
  • Data confidence warnings: objects with data_confidence = 'unknown' display a warning callout on all prediction panels explaining the impact on prediction quality

Phase 3 Complete When:

  • Mode C (Monte Carlo Particles): animated trajectories render; click-particle shows params
  • Real-time alerts delivered within 30 seconds of trigger condition
  • Geographic alert filtering: alerts scoped to user's FIR list
  • Route intersection analysis functional against sample flight plans
  • Feedback: density scaling recalibration demonstrated from ≥2 historical re-entries
  • Load test: 100 concurrent users; CZML load < 2s at p95
  • External penetration test completed; all Critical/High findings remediated
  • Full axe-core audit + manual screen reader test (NVDA + VoiceOver) passes
  • Secrets manager (Vault or equivalent) replacing Docker secrets for all production credentials
  • All credentials on rotation schedule; rotation verified without downtime
  • Prometheus + Grafana operational; certificate expiry alert configured
  • Production deployment runbook documented; incident response procedure per threat scenario
  • Security audit log shipping to external SIEM verified
  • Shadow validation report generated for ≥1 historical re-entry event demonstrating prediction accuracy
  • ECSS compliance artefacts produced: Software Management Plan, V&V Plan, Product Assurance Plan, Data Management Plan (required for ESA contract bids)
  • TRL 6 demonstration: system demonstrated in operationally relevant environment with real TLE data, real space weather, and ≥1 ANSP shadow deployment
  • Regulatory acceptance package complete: safety case framework, ICAO Annex 15 data quality mapping, SMS integration guide
  • Legal opinion obtained on operational liability per target deployment jurisdictions (Australia, EU, UK minimum)
  • First ANSP shadow mode deployment active with ≥4 weeks of shadow prediction records

Phase 3 Infrastructure / HA:

  • Patroni configuration validated: scripts/check_patroni_config.py passes confirming maximum_lag_on_failover, synchronous_mode: true, synchronous_mode_strict: true, wal_level: replica, recovery_target_timeline: latest all present in patroni.yml
  • Patroni failover drill: manually kill the primary DB container; verify standby promoted within 30s; backend API continues serving requests (latency spike acceptable; no 5xx errors after 35s); PgBouncer reconnects automatically to new primary
  • MinIO EC:2 verified: 4-node MinIO starts cleanly; integration test writes a 100 MB object; shut down one MinIO node; read succeeds; write succeeds; shut down second node; write fails with expected error; read still succeeds (EC:2 read quorum = 2 of 4)
  • WAF/DDoS protection confirmed in place at ingress (Cloudflare/AWS Shield or equivalent network-level appliance for on-premise); security architecture review sign-off
  • DNS architecture documented: docs/runbooks/dns-architecture.md covers split-horizon zones, PgBouncer VIP, Redis Sentinel VIP, and service discovery records for Tier 3 deployment
  • Backup restore test checklist completed successfully (see §34.5): all 6 checklist items passed within the 30-day window before Phase 3 sign-off
  • TLS certificate lifecycle runbook complete: docs/runbooks/tls-cert-lifecycle.md documents ACME auto-renewal path and internal CA path for air-gapped deployments; cert expiry Prometheus alerts firing at 60/30/7-day thresholds

Phase 3 Performance:

  • Formal load test passed: tests/load/ scenario with k6 or Locust; 100 concurrent users; CZML catalog load < 2s p95; MC job submit < 500ms; alert WebSocket delivery < 30s; test report committed to docs/validation/load-test-report-phase3.md
  • MC concurrency gate tested at scale: 10 simultaneous MC submissions across 5 organisations; each org receives 429 for its second request; no deadlock or Redis key leak observed; Celery worker queue depth remains bounded
  • WebSocket subscriber ceiling verified: load test opens 450 connections to a single backend instance; 451st connection receives HTTP 503; ws_connected_clients gauge reads 450; scaling trigger fires at 400 (alert visible in Grafana)
  • CZML delta adoption: Playwright E2E test confirms the frontend sends ?since= parameter on all CZML polls after initial load; no full-catalog request occurs after page load in LIVE mode
  • Bundle size CI gate active and green: final production build JS bundle documented; bundle-size CI step has passed for ≥2 consecutive deploys without manual override

22. Open Physics Questions for Engineering Review

  1. JB2008 vs NRLMSISE-00 — Recommend: NRLMSISE-00 for Phase 1 with a pluggable density model interface that accepts JB2008 in Phase 2 without API or schema changes.

  2. Covariance source for conjunction probability — Recommend: SP ephemeris covariance from Space-Track for active payloads; empirical covariance with explicit UI warning for debris.

  3. Re-entry termination altitude — Recommend: 80 km for Phase 1; parametric interface for Phase 2 breakup module (default 80 km, allow up to 120 km).

  4. F10.7 forecast horizon — For objects re-entering 514 days out, NOAA 3-day forecasts have degraded skill. Recommend: 81-day smoothed average as baseline with ±20% MC variation; document clearly in the SpaceWeatherWidget and every prediction panel.


23. Dual Domain Architecture

23.1 The Interface Problem

Two technically adjacent domains — space operations and civil aviation — manage debris re-entry hazards using incompatible tools, data formats, and operational vocabularies. The gap between them is the market.

SPACE DOMAIN                          THE GAP                     AVIATION DOMAIN
────────────────                    ──────────                   ────────────────
TLE / SGP4                                                        NOTAM
CDMs / TIP messages          No standard interface               FIR restrictions
CCSDS orbit products         No common tool                      ATC procedures
Kp / F10.7 indices           No shared language                  En-route charts
Probability of casualty      ← SpaceCom bridges this →          Plain English hazard brief

23.2 Shared Physics Core

One physics engine serves both front doors. Neither domain gets a different model — they get different views of the same computation.

                    ┌─────────────────────────────────┐
                    │         PHYSICS CORE            │
                    │  Catalog Propagator (SGP4)      │
                    │  Decay Predictor (RK7(8)+NRLMS) │
                    │  Monte Carlo ensemble           │
                    │  Conjunction Screener           │
                    │  Atmospheric Breakup (ORSAT)    │
                    │  Frame transforms (TEME→WGS84)  │
                    └────────────┬────────────────────┘
                                 │
               ┌─────────────────┴─────────────────┐
               │                                   │
    ┌──────────▼───────────┐          ┌────────────▼──────────┐
    │   SPACE DOMAIN UI    │          │  AVIATION DOMAIN UI   │
    │  /space portal       │          │  / (operational view) │
    │  Persona E, F        │          │  Persona A, B, C      │
    │                      │          │                       │
    │  State vectors       │          │  Hazard corridors     │
    │  Covariance matrices │          │  FIR intersection     │
    │  CCSDS formats       │          │  NOTAM drafts         │
    │  Deorbit windows     │          │  Plain-language status│
    │  API keys            │          │  Alert acknowledgement│
    │  Conjunction data    │          │  Gantt timeline       │
    └──────────────────────┘          └───────────────────────┘

23.3 Domain-Specific Output Formats

Output Space Domain Aviation Domain
Trajectory CCSDS OEM (state vectors) CZML (J2000 INERTIAL for CesiumJS)
Re-entry prediction p05/p50/p95 times + covariance Percentile corridor polygons on globe
Hazard Probability of casualty (Pc) value Risk level (LOW/MEDIUM/HIGH/CRITICAL)
Uncertainty Monte Carlo ensemble statistics Corridor width visual encoding
Conjunction CDM-format Pc value Not surfaced to Persona A
Space weather F10.7 / Ap / Kp raw indices "Elevated activity — wider uncertainty"
Deorbit plan CCSDS manoeuvre plan Corridor risk map on globe

23.4 Competitive Position

Competitor Their Strength SpaceCom Advantage
ESA ESOC Re-entry Prediction Service Authoritative technical product; longest-running service Aviation-facing operational UX; ANSP decision support; NOTAM drafting; multi-ANSP coordination
OKAPI:Orbits + DLR + TU Braunschweig Academic orbital mechanics depth; space operator integrations Purpose-built ANSP interface; controlled re-entry planner; shadow mode for regulatory adoption
Aviation weather vendors (e.g., StormGeo) Deep ANSP relationships; established procurement pathways Space domain physics credibility; TLE/CDM ingestion; conjunction screening
General STM platforms Broad catalog management Operational decision support depth; aviation integration layer

SpaceCom's moat is the combination of space physics credibility AND aviation operational usability. Neither side alone is sufficient to win regulated aviation authority contracts.

Differentiation capabilities — must be maintained regardless of competitor moves (Finding 4):

These are the capabilities that competitors cannot quickly replicate and that directly determine whether ANSPs and institutional buyers choose SpaceCom over alternatives:

Capability Why it matters Maintenance requirement
ANSP operational workflow integration NOTAM drafting, multi-ANSP coordination, and shadow mode are purpose-built for ANSP operations — not retrofitted Must be validated with ≥ 2 ANSP safety teams before Phase 2 shadow deployment
Regulatory adoption path Shadow mode + exit criteria + ANSP Safety Department sign-off creates a documented adoption trail that institutional procurements require Shadow mode exit report template must remain current; exit statistics generated automatically
Physics + aviation in one product Neither a pure orbital analytics tool nor a pure aviation tool can cover both sides without the other's domain expertise Dual-domain architecture (§23) must be maintained; any feature removal from either domain triggers an ADR
ESA/DISCOS data integration Institutional credibility with ESA and national space agencies depends on using authoritative ESA data sources DISCOS redistribution rights must be resolved before Phase 2; integration maintained as P1 data source

A docs/competitive-analysis.md document (maintained by the product owner, reviewed quarterly) tracks competitor feature releases and assesses impact on these claims. Any competitor capability that closes a differentiation gap triggers a product review within 30 days.

23.5 SWIM Integration Path

European ANSPs increasingly exchange operational data via SWIM (System Wide Information Management), defined by ICAO Doc 10039 and implemented in Europe via EUROCONTROL SWIM-TI (AMQP/MQTT transport, FIXM/AIXM 5.1 schemas). Full SWIM compliance is a Phase 3+ target; the path is:

Phase Deliverable Standard
Phase 2 GeoJSON structured event export (/events/{id}/export?format=geojson) with ICAO FIR IDs and prediction metadata GeoJSON + ISO 19115 metadata
Phase 3 Review FIXM Core 4.x schema for re-entry hazard representation; define SpaceCom extension namespace FIXM Core 4.2
Phase 3 SWIM-TI AMQP endpoint (publish-only) for alert.new and tip.new events to EUROCONTROL Network Manager B2B service EUROCONTROL SWIM-TI Yellow Profile

Phase 2 GeoJSON export is the immediate deliverable. Phase 3 SWIM-TI integration is scoped but requires a EUROCONTROL B2B service account and FIXM schema extension review — neither is blocking for Phase 1 or 2.


24. Regulatory Compliance Framework

24.1 The Regulatory Gap SpaceCom Operates In

There is currently no binding international regulatory framework governing re-entry debris hazard notifications to civil aviation. SpaceCom operates at the boundary between two regulatory regimes that have not yet formally agreed on how to bridge them.

This creates risk (no approved pathway to slot into) but also opportunity (SpaceCom can help define the standard and accumulate first-mover evidence).

24.2 Liability and Operational Status

Legal opinion is a Phase 2 gate, not a Phase 3 task. Shadow mode deployments with ANSPs must not occur without a completed legal opinion for the deployment jurisdiction. "Advisory only" UI labelling is not contractual protection — liability limitation must be in executed agreements. In common law jurisdictions (Australia, UK, US), a voluntary undertaking of responsibility to a known class of relying professionals can create a duty of care regardless of disclaimers (Hedley Byrne & Co v Heller and equivalents). Shadow mode activation in the admin panel is gated by legal_opinions.shadow_mode_cleared = TRUE for the organisation's jurisdiction.

Legal opinion scope (per deployment jurisdiction — Australia, EU, UK, US minimum):

  • Whether "decision support information" labelling limits liability for incorrect predictions that inform airspace decisions
  • Whether the platform creates duty-of-care obligations regardless of labelling
  • Whether Space-Track data redistribution via the SpaceCom API requires a separate licensing agreement with 18th Space Control Squadron
  • Whether CDM data (national security-adjacent) is subject to export controls in target jurisdictions
  • Whether the Controlled Re-entry Planner falls under ECCN 9E515 (spacecraft operations technical data) for non-US users

Operational status classification for SpaceCom outputs — not a UI label, a formal determination made in consultation with the ANSP's legal and SMS teams:

  • Aeronautical information (ICAO Annex 15) — highest standard; triggers data quality obligations
  • Decision support information — intermediate; requires formal ANSP SMS acceptance
  • Situational awareness information — lowest; advisory only; no procedural authority

Commercial contract requirements — three instruments required before any access:

  1. Master Services Agreement (MSA) — executed before any ANSP or space operator accesses the system. Must be reviewed by aviation law counsel. Minimum required terms:

    • Limitation of liability: capped at 12 months of fees paid, or a fixed cap for government/sovereign customers (to be determined by counsel)
    • Exclusion of consequential and indirect loss
    • Explicit statement that SpaceCom outputs are decision support information, not certified aeronautical information and not a substitute for ANSP operational procedures
    • ANSP's acknowledgement that they retain full authority and responsibility for all operational decisions
    • SLOs from §26.1 incorporated by reference
    • Governing law and jurisdiction clause
    • Data Processing Agreement (DPA) addendum for GDPR-scope deployments (see §29)
    • Right to suspend service without liability for maintenance, degraded mode, data quality concerns, or active security incidents
  2. Acceptable Use Policy (AUP) — click-wrap accepted in-platform at first login, recorded in users.tos_accepted_at, users.tos_version, and users.tos_accepted_ip. Must re-accept when version changes (system blocks access until accepted). Includes:

    • Acknowledgement that orbital data originates from Space-Track, subject to Space-Track terms
    • Prohibition on redistributing SpaceCom-derived data to third parties without written consent
    • Acknowledgement that the platform is decision support only, not certified aeronautical information
    • Export control acknowledgement (user is responsible for compliance in their jurisdiction)
  3. API Terms — embedded in the API key issuance flow for Persona E/F programmatic access. Accepted at key creation; recorded against the api_keys record. Includes the Space-Track redistribution acknowledgement and the export control notice.

Space-Track data redistribution gate (F3): Space-Track.org Terms of Service prohibit redistribution of TLE data to non-registered entities. The SpaceCom API must not serve TLE-derived fields (raw TLE strings, tle_epoch, tle_line1/2) to organisations that have not confirmed Space-Track registration. Implementation:

-- Add to organisations table
ALTER TABLE organisations ADD COLUMN space_track_registered BOOLEAN NOT NULL DEFAULT FALSE;
ALTER TABLE organisations ADD COLUMN space_track_registered_at TIMESTAMPTZ;
ALTER TABLE organisations ADD COLUMN space_track_username TEXT; -- for audit

API middleware check (applied to any response containing TLE-derived fields):

def check_space_track_gate(org: Organisation):
    if not org.space_track_registered:
        raise HTTPException(
            status_code=403,
            detail="TLE-derived data requires Space-Track registration. "
                   "Register at space-track.org and confirm in your organisation settings."
        )

All TLE-derived disclosures are logged in data_disclosure_log:

CREATE TABLE data_disclosure_log (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id      UUID NOT NULL REFERENCES organisations(id),
    source      TEXT NOT NULL,  -- 'space_track', 'esa_sst', etc.
    endpoint    TEXT NOT NULL,
    disclosed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    record_count INTEGER
);
CREATE INDEX ON data_disclosure_log (org_id, source, disclosed_at DESC);

Contracts table and MRR tracking (F1, F4, F9 — §68):

The contracts table enforces that feature access is gated on commercial state, provides MRR data for the commercial team, and records discount approval for audit:

CREATE TABLE contracts (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  org_id INTEGER NOT NULL REFERENCES organisations(id),
  contract_type TEXT NOT NULL
    CHECK (contract_type IN ('sandbox','professional','enterprise','on_premise','internal')),
  -- Financial terms
  monthly_value_cents INTEGER NOT NULL DEFAULT 0,  -- 0 for sandbox/internal
  currency CHAR(3) NOT NULL DEFAULT 'EUR',
  discount_pct NUMERIC(5,2) NOT NULL DEFAULT 0
    CHECK (discount_pct >= 0 AND discount_pct <= 100),
  -- Discount approval guard (F4): discounts >20% require second approver
  discount_approved_by INTEGER REFERENCES users(id),  -- NULL if discount_pct <= 20
  discount_approval_note TEXT,
  -- Term
  valid_from TIMESTAMPTZ NOT NULL,
  valid_until TIMESTAMPTZ NOT NULL,
  auto_renew BOOLEAN NOT NULL DEFAULT FALSE,
  -- Feature access — what this contract enables
  enables_operational_mode BOOLEAN NOT NULL DEFAULT FALSE,
  enables_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE,
  enables_api_access BOOLEAN NOT NULL DEFAULT FALSE,
  -- Audit
  created_by INTEGER REFERENCES users(id),
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  signed_msa_at TIMESTAMPTZ,        -- NULL until MSA countersigned
  msa_document_ref TEXT,            -- path in MinIO legal bucket
  -- Professional Services (F10)
  ps_value_cents INTEGER NOT NULL DEFAULT 0,  -- one-time PS revenue on this contract
  ps_description TEXT
);
CREATE INDEX ON contracts (org_id, valid_until DESC);
CREATE INDEX ON contracts (valid_until) WHERE valid_until > NOW();  -- active contract lookup

-- Constraint: discounts >20% must have a named approver
ALTER TABLE contracts ADD CONSTRAINT discount_approval_required
  CHECK (discount_pct <= 20 OR discount_approved_by IS NOT NULL);

Feature access enforcement (F1): Feature flags in organisations must be set from the active contract, not by admin toggle alone. A Celery task (tasks/commercial/sync_feature_flags.py) runs nightly and on contract creation/update to sync organisations.feature_multi_ansp_coordination from the active contract's enables_multi_ansp_coordination. An admin toggle that disagrees with the active contract is overwritten by the nightly sync.

MRR dashboard (F9): Add a Grafana panel (internal dashboard, not customer-facing) showing current MRR:

-- Recording rule or direct query:
SELECT SUM(monthly_value_cents) / 100.0 AS mrr_eur
FROM contracts
WHERE valid_from <= NOW() AND valid_until >= NOW()
  AND contract_type NOT IN ('sandbox', 'internal');

Expose as spacecom_mrr_eur Prometheus gauge updated by the nightly sync_feature_flags task. Grafana panel: "Current MRR (€)" — single stat panel, comparison to previous month.

Export control screening (F4): ITAR 22 CFR §120.15 and EAR 15 CFR §736 prohibit providing certain SSA capabilities to nationals of embargoed countries and denied parties. Required at organisation onboarding:

ALTER TABLE organisations ADD COLUMN country_of_incorporation CHAR(2); -- ISO 3166-1 alpha-2
ALTER TABLE organisations ADD COLUMN export_control_screened_at TIMESTAMPTZ;
ALTER TABLE organisations ADD COLUMN export_control_cleared BOOLEAN NOT NULL DEFAULT FALSE;
ALTER TABLE organisations ADD COLUMN itar_cleared BOOLEAN NOT NULL DEFAULT FALSE; -- US-person or licensed

Onboarding flow:

  1. Collect country_of_incorporation at registration
  2. Flag embargoed countries (CU, IR, KP, RU, SY) for manual review — account held in PENDING_EXPORT_REVIEW state
  3. Screen organisation name against BIS Entity List (automated lookup; manual review on partial match)
  4. EU-SST-derived data gated behind itar_cleared = TRUE (EU-SST has its own access restrictions for non-EU entities)
  5. All screening decisions logged with reviewer ID and date

Documented in legal/EXPORT_CONTROL_POLICY.md. Legal counsel review required before any deployment that could serve US-origin technical data (TLE from 18th Space Control Squadron) to non-US persons.

Regulatory Sandbox Agreement — a lightweight 2-page letter of understanding required before any ANSP shadow mode activation. Specifies:

  • Trial period start and end dates
  • ANSP's confirmation that SpaceCom outputs are for internal validation only (not operational)
  • SpaceCom's commitment to produce a shadow validation report at trial end
  • Data protection terms for the trial period
  • How incidents during the trial are handled by both parties
  • Mutual agreement that the trial does not create any ongoing commercial obligation

Regulatory sandbox liability clarification (F11 — §61): The sandbox agreement is not a liability shield by itself. During shadow mode, SpaceCom is a tool under evaluation — liability exposure depends on how the ANSP uses outputs and what the sandbox agreement says about consequences of errors. Required provisions:

  • No operational reliance clause: ANSP certifies in writing that no operational decisions will be made on the basis of SpaceCom outputs during the trial. Any breach of this clause by the ANSP shifts liability to the ANSP.
  • Incident notification: If a SpaceCom output error is identified during the trial, SpaceCom notifies the ANSP within 2 hours (matching the safety occurrence runbook at §26.8). The sandbox agreement specifies whether this constitutes a notifiable occurrence under the ANSP's SMS.
  • Indemnification cap: SpaceCom's aggregate liability during the sandbox period is capped at AUD/EUR 50,000 (or local equivalent). Catastrophic loss claims are excluded (consistent with MSA terms).
  • Insurance requirement: SpaceCom must carry professional indemnity insurance with minimum cover AUD/EUR 1 million before activating any sandbox with an ANSP. Certificate of currency provided to the ANSP before activation.
  • Regulatory notification duty: If the ANSP's safety regulator requires notification of third-party tool trials (e.g., EASA, CASA, CAA), that obligation rests with the ANSP. SpaceCom provides a one-page system description document to support the ANSP's notification.
  • Sandbox ≠ approval pathway: A successful sandbox trial is evidence for a future regulatory submission — it is not itself an approval. Neither party should represent the sandbox as a form of regulatory acceptance.

legal/SANDBOX_AGREEMENT_TEMPLATE.md captures the standard text. Legal counsel review required before any amendment.

The shadow mode admin toggle must display a warning if no Regulatory Sandbox Agreement is on record (legal_opinions.shadow_mode_cleared = FALSE for the org's jurisdiction):

⚠ No legal clearance on record for this organisation's jurisdiction.
  Shadow mode should not be activated without a completed legal opinion
  and a signed Regulatory Sandbox Agreement.
  [View legal status →]

24.3 ICAO Data Quality Mapping (Annex 15)

SpaceCom outputs that may enter aeronautical information channels must be characterised against ICAO's five data quality attributes:

Attribute SpaceCom Characterisation Required Action
Accuracy Decay predictor accuracy characterised from ≥10 historical re-entry backcasts vs. The Aerospace Corporation database. Published as a formal accuracy statement in GET /api/v1/reentry/predictions/{id} response. Phase 3: produce accuracy characterisation document
Resolution Corridor boundaries expressed as geographic polygons with stated precision. Position uncertainty stated as formal resolution value in prediction response. Included in prediction API response from Phase 1
Integrity HMAC-SHA256 on all prediction and hazard zone records. Integrity assurance level: Essential (1×10⁻⁵). Documented in system description. Implemented Phase 1 (§7.9)
Traceability Full parameter provenance in simulations.params_json and prediction records. Accessible to regulatory auditors via dedicated API. Phase 1
Timeliness Maximum latency from TIP message ingestion to updated prediction available: 30 minutes. Maximum latency from NOAA SWPC space weather update to prediction recalculation: 4 hours. Published as formal SLA. Phase 3 SLA document

F5 — Completeness attribute and ICAO Annex 15 §3.2 data quality classification (§61):

ICAO Annex 15 §3.2 defines a sixth implicit attribute — Completeness — meaning all data fields required by the receiving system are present and within range. SpaceCom must:

  • Define a formal completeness schema for each prediction response (required fields, allowed nulls, value ranges)
  • Return data_quality.completeness_pct in the prediction response (fields present / fields required × 100)
  • Reject predictions with completeness < 90% from the alert pipeline (alert not generated; operator notified of incomplete prediction)

ICAO data category and classification required in the prediction response (Annex 15 Table A3-1):

Field Value
data_category AERONAUTICAL_ADVISORY (until formal AIP entry process established)
originator SPACECOM + system version string
effective_from ISO 8601 UTC timestamp
integrity_assurance ESSENTIAL (1×10⁻⁵ probability of undetected error)
accuracy_class CLASS_2 (advisory, not certified — until accuracy characterisation completes Phase 3 validation)

Formal accuracy characterisation (docs/validation/ACCURACY_CHARACTERISATION.md) is a Phase 3 gate before the API can be presented to any ANSP as meeting Annex 15 data quality standards.

24.4 Safety Management System Integration

Any ANSP formally adopting SpaceCom must include it in their SMS (ICAO Annex 19). SpaceCom provides the following artefacts to support ANSP SMS assessment:

Hazard register (SpaceCom's contribution to the ANSP's SMS — F3, §61 structured format):

Maintained as docs/safety/HAZARD_LOG.md. Each hazard uses the structured schema below. Hazard IDs are permanent — retired hazards are marked CLOSED, not deleted.

ID Description Cause Effect Mitigations Severity Likelihood Risk Level Status
HZ-001 SpaceCom unavailable during active re-entry event Infrastructure failure; deployment error; DDoS ANSP cannot access current re-entry prediction during event window Patroni HA failover (§26.3); 15-min RTO SLO; automated ANSP push notification + email; documented fallback procedure Hazardous Low (SLO 99.9%) Medium OPEN
HZ-002 False all-clear prediction (false negative — corridor misses actual impact zone) TLE age; atmospheric model error; MC sampling variance; adversarial data manipulation ANSP issues all-clear; aircraft enters debris corridor HMAC integrity check; dual-source TLE validation; TIP cross-check guard; shadow validation evidence; accuracy characterisation (Phase 3); @pytest.mark.safety_critical tests Catastrophic Very Low High OPEN
HZ-003 False hazard prediction (false positive — corridor over-stated) Atmospheric model conservatism; TLE propagation error Unnecessary airspace restriction; operational disruption; credibility loss Cross-source TLE validation; HMAC; p95 corridor with stated uncertainty; accuracy characterisation Major Low Medium OPEN
HZ-004 Corridor displayed in wrong reference frame ECI/ECEF/geographic frame conversion error; CZML frame parameter misconfiguration Corridor shown at wrong lat/lon; operator makes decisions on incorrect geographic basis Frame transform unit tests against IERS references (§17); CZML frame convention enforced via CI Hazardous Very Low Medium OPEN
HZ-005 Outdated prediction served (stale data) Ingest pipeline failure; TLE source outage; cache not invalidating Operator sees prediction that no longer reflects current orbital state Data staleness indicators in UI; automated stale alert to operators; ingest health monitoring; CZML cache invalidation triggers (§35) Major Low Medium OPEN
HZ-006 Prediction integrity failure (HMAC mismatch) Database modification; backup restore error; storage corruption Prediction record cannot be verified; may have been tampered with Prediction quarantined automatically; CRITICAL security alert; prediction withheld from API Catastrophic Very Low High OPEN
HZ-007 Unauthorised access to prediction data Compromised credentials; RLS bypass; API misconfiguration Competitor or adversary obtains early re-entry corridor data; potential ITAR exposure PostgreSQL RLS; JWT validation; rate limiting; security_logs audit trail; penetration testing Major Low Medium OPEN

Hazard log governance:

  • Review: quarterly, and after each SEV-1 incident, model version update, or material system change
  • New hazards identified during safety occurrence reporting are added within 5 business days
  • Risk level = Severity × Likelihood using EUROCAE ED-153 risk classification matrix
  • OPEN hazards with High risk level are Phase 2 gate blockers — must reach MITIGATED before ANSP shadow activation

System safety classification: Safety-related (not safety-critical under DO-278A). Relevant components targeting SAL-2 assurance level (see §24.13). Development assurance standard: EUROCAE ED-78A equivalent for relevant components.

Change management: SpaceCom must notify all ANSP users before model version updates that affect prediction outputs. Version changes tracked in simulations.model_version and surfaced in the UI.

24.5 NOTAM System Interface

SpaceCom's position in the NOTAM workflow:

SpaceCom generates → NOTAM draft (ICAO format) → Reviewed by Persona A → Submitted by authorised NOTAM originator → Issued NOTAM

SpaceCom never submits NOTAMs. The draft is a decision support artefact. The mandatory disclaimer on every draft is a non-removable regulatory requirement, not a UI preference.

NOTAM timing requirements by jurisdiction:

  • Routine NOTAMs: 2448 hours minimum lead time
  • Short-notice (re-entry window < 24 hours): ASAP; NOTAM issued with minimum lead time
  • SpaceCom alert thresholds align with these: CRITICAL alert at < 6h, HIGH at < 24h

24.6 Space Law Considerations

UN Liability Convention (1972): All SpaceCom prediction records, simulation runs, and alert acknowledgements may be legally discoverable in an international liability claim. The immutable audit trail (§7.9) is partially an evidence preservation mechanism. Retention of reentry_predictions, alert_events, notam_drafts, and shadow_validations for ≥7 years minimum.

National space laws with re-entry obligations:

  • Australia: Space (Launches and Returns) Act 2018. CASA and the Australian Space Agency have coordination protocols. SpaceCom's controlled re-entry planner outputs are suitable as evidence for operator obligations under this Act.
  • EU/ESA: EU Space Programme Regulation; ESA Zero Debris Charter. SpaceCom supports Zero Debris by characterising re-entry risk and supporting responsible end-of-life planning.
  • US: FAA AST re-entry licensing generates data that SpaceCom should ingest when available. 51 USC Chapter 509 obligations may affect US space operator customers.

Space Traffic Management evolution: US Office of Space Commerce is developing civil STM frameworks that may eventually replace Space-Track as the primary civil space data source. SpaceCom's ingest architecture must be adaptable (hardcoded URL constants in ingest/sources.py make this a 1-file change when the source changes).

24.7 ICAO Framework Alignment

Existing: ICAO Doc 10100 (Manual on Space Weather Information, 2019) designates three ICAO-recognised Space Weather Centres (NOAA SWPC, ESA/ESAC, Japan Meteorological Agency). SpaceCom's space weather widget must reference these designated centres by name and ICAO recognition status.

Emerging re-entry guidance: ICAO is in early stages of developing re-entry hazard notification guidance (no published document as of 2025). SpaceCom should:

  • Monitor ICAO Air Navigation Commission and Meteorology Panel working group outputs
  • Design hazard corridor outputs in a format that parallels SIGMET structure (the closest existing ICAO framework: WHO/WHAT/WHERE/WHEN/INTENSITY/FORECAST) — this positions SpaceCom well for whatever standard emerges
  • Consider engaging ICAO working groups as a stakeholder; SpaceCom could become a reference implementation

SIGMET parallel structure for re-entry corridor outputs:

REENTRY ADVISORY (SpaceCom format; parallel to SIGMET structure)
WHO:      CZ-5B ROCKET BODY / NORAD 44878
WHAT:     UNCONTROLLED RE-ENTRY / DEBRIS SURVIVAL POSSIBLE
WHERE:    CORRIDOR 18S115E TO 28S155E / FL000 TO UNL
WHEN:     FROM 2026031614 TO 2026031622 UTC / WINDOW ±4H (P95)
RISK:     HIGH / LAND AREA IN CORRIDOR: 12%
FORECAST: CORRIDOR EXPECTED TO NARROW 20% OVER NEXT 6H
SOURCE:   SPACECOM V2.1 / PRED-44878-20260316-003 / TIP MSG #3

24.8 Alert Threshold Governance

Alert threshold values are consequential algorithmic decisions. A CRITICAL threshold that is too sensitive causes unnecessary airspace disruption; one that is too conservative creates false-negative risk. Both outcomes have legal, operational, and reputational consequences.

Current threshold values and rationale:

Threshold Value Rationale
CRITICAL window < 6h Aligns with ICAO minimum NOTAM lead time for short-notice restrictions; 6h allows ANSP to issue NOTAM with ≥2h lead time
HIGH window < 24h Operational planning horizon for pre-tactical airspace management
FIR intersection trigger p95 corridor intersects any non-zero area of the FIR Conservative: any non-zero intersection at p95 level generates an alert; minimum area threshold is an org-configurable setting (default: 0)
Alert rate limit 1 CRITICAL per object per 4h window Prevents alert flooding from repeated window-shrink events without substantive new information
Alert storm threshold > 5 CRITICAL in 1h Empirically chosen; above this rate the response-time expectation for individual alerts cannot be met

These values are recorded in docs/alert-threshold-history.md with initial entry date and author sign-off.

Threshold change procedure:

  1. Engineer proposes change in a PR with rationale documented in docs/alert-threshold-history.md
  2. PR requires review by engineering lead and product owner before merge
  3. Change is deployed to staging; minimum 2-week shadow-mode observation period against real TLE/TIP data
  4. Shadow observation review: false positive rate and false negative rate compared against pre-change baseline
  5. If baseline comparison passes: change deployed to production; all ANSP shadow deployment partners notified in writing with new threshold values
  6. If any ANSP objects: change is held until concerns are resolved

Threshold values are not configurable at runtime by operators. They are code constants reviewed through the above process. Org-configurable alert settings (geographic FIR filter, mute rules, OPS_ROOM_SUPPRESS_MINUTES) are UX preferences, not threshold changes.

24.9 Degraded Mode and Availability

SpaceCom must specify degraded mode behaviour for ANSP adoption:

Condition System Behaviour ANSP Action
Ingest pipeline failure (TLE data > 6h stale) MEDIUM alert to all operators; staleness indicator on all objects; predictions greyed Consult Space-Track directly; activate fallback procedure
Space weather data > 4h stale WARNING banner on SpaceWeatherWidget; uncertainty multiplier set to HIGH conservatively Note wider uncertainty on any operational decisions
System unavailable Push notification to all registered users; email to ANSP contacts Activate fallback procedure documented in SpaceCom SMS integration guide
HMAC verification failure on a prediction Prediction withheld; CRITICAL security alert; prediction marked integrity_failed Do not use the withheld prediction; contact SpaceCom immediately

Degraded mode notification: When SpaceCom is down or data is stale beyond defined thresholds, all connected ANSPs receive push notification (WebSocket if connected; email fallback) so they can activate their fallback procedures. SpaceCom must never go silent when operationally relevant events are active.


24.10 EU AI Act Obligations

Classification: SpaceCom's conjunction probability model (§19) and any ML-based alert prioritisation constitute an AI system under EU AI Act Art. 3(1). AI systems used in transport infrastructure safety fall under Annex III, point 4 (AI systems intended to be used for dispatching, monitoring, and maintenance of transport infrastructure including aviation). This classification implies high-risk AI system obligations.

High-risk AI system obligations (EU AI Act Chapter III Section 2):

Obligation Article SpaceCom implementation
Risk management system Art. 9 Integrate with existing SMS (§24.4); maintain AI-specific risk register in legal/EU_AI_ACT_ASSESSMENT.md
Data governance Art. 10 TLE training data provenance documented; simulations.params_json stores full input provenance; bias assessment required for orbital prediction models
Technical documentation Art. 11 + Annex IV legal/EU_AI_ACT_ASSESSMENT.md — system description, capabilities, limitations, human oversight measures, accuracy characterisation
Record-keeping / automatic logging Art. 12 reentry_predictions and alert_events tables provide automatic event logging; immutable (APPEND-only with HMAC)
Transparency to users Art. 13 Conjunction probability values labelled with model version (simulations.model_version), TLE age, EOP currency; uncertainty bounds displayed
Human oversight Art. 14 All decisions remain with duty controller (§24.2 AUP; §28.6 Decision Prompts disclaimer); no autonomous action taken by SpaceCom
Accuracy, robustness, cybersecurity Art. 15 Accuracy characterisation (§24.3 ICAO Data Quality); adversarial robustness covered by §7 and §36 security review
Conformity assessment Art. 43 Self-assessment pathway available for transport safety AI without third-party involvement at first deployment; document in legal/EU_AI_ACT_ASSESSMENT.md
EU database registration Art. 51 High-risk AI systems must be registered in the EU AI Act database before placing on market; legal milestone in deployment roadmap

Human oversight statement (required in UI — Art. 14): The conjunction probability display (§19.4) must include the following non-configurable statement in the model information panel:

"This probability estimate is generated by an AI model and is subject to uncertainty arising from TLE age, atmospheric model limitations, and manoeuvre uncertainty. All operational decisions remain with the duty controller. This system does not replace ANSP procedures."

Gap analysis and roadmap: legal/EU_AI_ACT_ASSESSMENT.md must document: current compliance state → gaps → remediation actions → target dates. Phase 2 gate: conformity assessment documentation complete. Phase 3 gate: EU database registration completed before commercial EU deployment.


24.11 Regulatory Correspondence Register

For an ANSP-facing product, regulators (CAA, EASA, national ANSPs, ESA, OACI) will issue queries, audits, formal requests, and correspondence. Missed regulatory deadlines can constitute a licence breach or grounds for suspension of operations.

Correspondence log: legal/REGULATORY_CORRESPONDENCE_LOG.md — structured register with the following fields per entry:

Field Description
Date received ISO 8601
Authority Regulatory body name and country
Reference number Authority's reference (if given)
Subject Brief description
Deadline Formal response deadline (ISO 8601)
Owner Named individual responsible for response
Status PENDING / RESPONDED / CLOSED / ESCALATED
Response date Date formal response sent
Notes Internal context, legal counsel involvement

SLAs:

  • All regulatory correspondence acknowledged (receipt confirmed to sender) within 2 business days
  • Substantive response or extension request within 14 calendar days (or as required by the correspondence)
  • All correspondence older than 14 days without a RESPONDED or CLOSED status triggers an escalation to the CEO

Proactive regulatory engagement: The correspondence register is reviewed at each quarterly steering meeting. Any authority that has issued ≥3 queries in a 12-month period warrants a proactive engagement call to identify and address systemic concerns before they become formal regulatory actions.


24.12 Safety Case Framework (F1 — §61)

A safety case is a structured argument that a system is acceptably safe for a specified use in a defined context. SpaceCom must produce and maintain a safety case before any operational ANSP deployment. The safety case is a living document, updated at each material system change.

Safety case structure (Goal Structuring Notation — GSN, consistent with EUROCAE ED-153 / IEC 61508 safety case guidance):

G1: SpaceCom is acceptably safe to use as a decision support tool
    for re-entry hazard awareness in civil airspace operations

  C1: Context — SpaceCom operates as decision support (not autonomous authority);
      all operational decisions remain with the ANSP duty controller

  S1: Argument strategy — safety achieved by hazard identification,
      risk reduction, and operational constraints

    G1.1: All identified hazards are mitigated to acceptable risk levels
      Sn1: Hazard Log (docs/safety/HAZARD_LOG.md)
      E1.1.1: HZ-001 through HZ-007 mitigation evidence (§24.4)
      E1.1.2: Shadow validation report (≥30 day trial)

    G1.2: System integrity is maintained through all operational modes
      Sn2: HMAC integrity on all safety-critical records (§7.9)
      E1.2.1: `@pytest.mark.safety_critical` test suite — 100% pass
      E1.2.2: Integrity failure quarantine demonstrated (§56 E2E test)

    G1.3: Operators are trained and capable of correct system use
      Sn3: Operator Training Programme (§28.9)
      E1.3.1: Training completion records (operator_training_records table)
      E1.3.2: Reference scenario completion evidence

    G1.4: Degraded mode provides adequate notification for fallback
      Sn4: Degraded mode specification (§24.9)
      E1.4.1: ANSP communication plan activated in game day exercise (§26.8)

    G1.5: Regulatory obligations are met for the deployment jurisdiction
      Sn5: Means of Compliance document (§24.14)
      E1.5.1: Legal opinions for deployment jurisdictions (§24.2)
      E1.5.2: ANSP SMS integration guide (§24.15)

Safety case document: docs/safety/SAFETY_CASE.md. Version-controlled; each tagged release includes a safety case snapshot. Safety case review is required before:

  • ANSP shadow mode activation
  • Model version updates that affect prediction outputs
  • New deployment jurisdiction
  • Any change to alert thresholds (§24.8)

Safety case custodian: Named individual (Phase 2: CEO or CTO until a dedicated safety manager is appointed). Changes to the safety case require the custodian's sign-off.


24.13 Software Assurance Level (SAL) Assignment (F2 — §61)

EUROCAE ED-153 / DO-278A defines Software Assurance Levels for ground-based aviation software systems. The appropriate SAL determines the rigour of development, verification, and documentation activities required.

SpaceCom SAL assignment:

Component Failure Condition Severity Class SAL Rationale
Re-entry prediction engine (physics/) False all-clear (HZ-002) Hazardous SAL-2 Undetected false negative could contribute to an airspace safety event; highest-consequence component
Alert generation pipeline (alerts/) Failed alert delivery; wrong threshold applied Hazardous SAL-2 Failure to generate a CRITICAL alert during an active event is equivalent consequence to HZ-002
HMAC integrity verification Integrity failure undetected Hazardous SAL-2 Loss of integrity checking removes the primary guard against data manipulation
CZML corridor rendering Wrong geographic position displayed (HZ-004) Hazardous SAL-2 Geographic display error directly misleads operator
API authentication and authorisation Unauthorised data access (HZ-007) Major SAL-3 Privacy and data governance impact; not directly causal of airspace event
Ingest pipeline (worker/) Stale data not detected (HZ-005) Major SAL-3 Staleness monitoring is a mitigation for HZ-005; failure of staleness monitoring increases HZ-005 likelihood
Frontend (non-safety-critical paths) Cosmetic / non-operational UI failure Minor SAL-4 Not in the safety-critical path

SAL-2 implications (minimum activities required):

  • Independent verification of requirements, design, and code for SAL-2 components (see §24.16 Verification Independence)
  • Formal test coverage: 100% statement coverage for SAL-2 modules (enforced via @pytest.mark.safety_critical)
  • Configuration management of all SAL-2 source files and their test artefacts (see §30.8)
  • SAL-2 components documented in the safety case with traceability from requirement → design → code → test

SAL assignment document: docs/safety/SAL_ASSIGNMENT.md — reviewed at each architecture change and before any ANSP deployment.


24.14 Means of Compliance (MoC) Document (F8 — §61)

A Means of Compliance document maps each regulatory or standard requirement to the specific implementation evidence that demonstrates compliance. Required before any formal regulatory submission (ESA bid, EASA consultation response, ANSP safety acceptance).

Document: docs/safety/MEANS_OF_COMPLIANCE.md

Structure:

Requirement ID Source Requirement Text (summary) Means of Compliance Evidence Location Status
MOC-001 EUROCAE ED-153 §5.3 Software requirements defined and verifiable Requirements documented in relevant §sections of MASTER_PLAN; acceptance criteria in TEST_PLAN docs/TEST_PLAN.md; relevant §sections PARTIAL
MOC-002 EUROCAE ED-153 §6.4 Independent verification of SAL-2 software Verification independence policy (§24.16); separate reviewer for safety-critical PRs docs/safety/VERIFICATION_INDEPENDENCE.md PLANNED
MOC-003 ICAO Annex 15 §3.2 Data quality attributes characterised ICAO data quality table (§24.3); accuracy characterisation document docs/validation/ACCURACY_CHARACTERISATION.md PARTIAL (Phase 3)
MOC-004 ICAO Annex 19 ANSP SMS integration supported SMS integration guide; hazard register; training programme docs/safety/ANSP_SMS_GUIDE.md; docs/safety/HAZARD_LOG.md PLANNED
MOC-005 EU AI Act Art. 9 Risk management system documented AI Act assessment; hazard log; safety case legal/EU_AI_ACT_ASSESSMENT.md; docs/safety/HAZARD_LOG.md IN PROGRESS
MOC-006 DO-278A §10 Configuration management of safety artefacts CM policy (§30.8); Git tagging of releases; signed commits docs/safety/CM_POLICY.md PLANNED
MOC-007 ED-153 §7.2 Safety occurrence reporting procedure Runbook in §26.8; SAFETY_OCCURRENCE log type docs/runbooks/; security_logs table IMPLEMENTED

The MoC document is a Phase 2 deliverable. PARTIAL items become Phase 3 gates. PLANNED items require assigned owners and completion dates before ANSP shadow activation.


24.15 ANSP-Side Obligations Document (F10 — §61)

SpaceCom cannot unilaterally satisfy all regulatory requirements — the receiving ANSP has obligations that SpaceCom must document and communicate. Failing to do so is a gap in the safety argument.

Document: docs/safety/ANSP_SMS_GUIDE.md — provided to every ANSP before shadow mode activation.

ANSP obligations by category:

Category ANSP Obligation SpaceCom Provides
SMS integration Include SpaceCom in ANSP SMS under ICAO Annex 19 Hazard register contribution (§24.4); SAL assignment; safety case
Change notification Notify SpaceCom of any ANSP procedure changes that affect how SpaceCom outputs are used Change notification contact in MSA
Operator training Ensure all SpaceCom users complete the operator training programme (§28.9) Training modules; completion API; training records
Fallback procedure Maintain and exercise a fallback procedure for SpaceCom unavailability Fallback procedure template in onboarding documentation
Occurrence reporting Report any safety occurrence involving SpaceCom outputs to SpaceCom within 24 hours Safety occurrence form; contact details; §26.8 runbook
Regulatory notification Notify applicable safety regulator of SpaceCom use if required by national SMS regulations System description one-pager for regulator submission
Shadow validation Participate in ≥30-day shadow validation trial; provide evaluation feedback Shadow validation report template; shadow validation dashboard
AUP acceptance Ensure all users accept the AUP (§24.2) Automated AUP flow; compliance report for ANSP admin

Liability assignment note (links to §24.2 and §24.12 F11): The ANSP SMS guide explicitly states that the ANSP retains full operational authority and accountability for all air traffic decisions, regardless of SpaceCom outputs. SpaceCom is a decision support tool. This statement must appear in the ANSP SMS guide, the AUP, and the safety case context node C1 (§24.12).

25.1 Target Tender Profile

SpaceCom targets ESA tenders in the following programme areas:

  • Space Safety Programme — re-entry risk, SSA services, space debris
  • GSTP (General Support Technology Programme) — technology development with commercial potential
  • ARTES (Advanced Research in Telecommunications Systems) — if the commercial operator portal reaches satellite operators
  • Space-Air Traffic Integration studies — the category matching ESA's OKAPI:Orbits award

25.2 Differentiation from ESA ESOC Re-entry Prediction Service

ESA's re-entry prediction service (reentry.esoc.esa.int) is a technical product for space operators and agencies. SpaceCom is not a competitor to this service — it is a complementary operational layer that could consume ESOC outputs:

Dimension ESA ESOC Service SpaceCom
Primary user Space agencies, debris researchers ANSPs, airspace managers, space operators
Output format Technical prediction reports Operational decision support + NOTAM drafts
Aviation integration None Core feature
ANSP decision workflow Not designed for this Primary design target
Space operator portal Not provided Phase 2 deliverable
Shadow mode / regulatory adoption Not provided Built-in

In an ESA bid: Position SpaceCom as the user-facing operational layer that sits on top of the space surveillance and prediction infrastructure that ESA already operates. ESA invests in the physics; SpaceCom invests in the interface that makes the physics actionable for aviation authorities and space operators.

25.3 TRL Roadmap (ESA Definitions)

Phase End TRL Evidence
Phase 1 complete TRL 4 Validated decay predictor (≥3 historical backcasts); SGP4 globe with real TLE data; Mode A corridors; HMAC integrity; full security infrastructure
Phase 2 complete TRL 5 Atmospheric breakup; Mode B heatmap; NOTAM drafting; space operator portal; CCSDS export; shadow mode; ≥1 ANSP shadow deployment running
Phase 3 complete TRL 6 System demonstrated in operationally relevant environment; ≥1 ANSP shadow deployment with ≥4 weeks validation data; external penetration test passed; ECSS compliance artefacts complete
Post-Phase 3 TRL 7 System prototype demonstrated in operational environment (live ANSP deployment, not shadow)

25.4 ECSS Standards Compliance

ESA contracts require compliance with the European Cooperation for Space Standardization (ECSS). Required compliance mapping:

Standard Title SpaceCom Compliance
ECSS-Q-ST-80C Software Product Assurance Software Management Plan, V&V Plan, Product Assurance Plan — produced Phase 3
ECSS-E-ST-10-04C Space environment NRLMSISE-00 and JB2008 compliance with ECSS atmospheric model requirements
ECSS-E-ST-10-12C Methods for re-entry and debris footprint calculation Decay predictor and atmospheric breakup model methodology documented and traceable
ECSS-U-AS-010C Space sustainability Zero Debris Charter alignment statement; controlled re-entry planner outputs

Compliance matrix document (produced Phase 3): Maps every ECSS requirement to the relevant SpaceCom component, test, or document. Required for ESA tender submission.

25.5 ESA Zero Debris Charter Alignment

SpaceCom directly supports the Zero Debris Charter objectives:

Charter Objective SpaceCom Support
Responsible end-of-life disposal Controlled re-entry planner generates CCSDS-format manoeuvre plans minimising ground risk
Transparency of re-entry risk Public hazard corridor data; NOTAM drafting; multi-ANSP coordination
Reduction of casualty risk Atmospheric breakup model; casualty area computation; population density weighting in deorbit optimiser
Data sharing API layer for space operator integration; CCSDS export; open prediction endpoints

Include Zero Debris Charter alignment statement in all ESA bid submissions.

25.6 Required ESA Procurement Artefacts

All ESA contracts require these management documents. SpaceCom must produce them by Phase 3:

Document ECSS Reference Content
Software Management Plan (SMP) ECSS-Q-ST-80C §5 Development methodology, configuration management, change control, documentation standards
Verification and Validation Plan (VVP) ECSS-Q-ST-80C §6 Test strategy, traceability from requirements to test cases, acceptance criteria
Product Assurance Plan (PAP) ECSS-Q-ST-80C §4 Safety, reliability, quality standards and how they are met
Data Management Plan (DMP) ECSS-Q-ST-80C §8 How data produced under contract is handled, shared, archived, and made reproducible
Software Requirements Specification (SRS) Tailored ECSS-E-ST-40C Software requirements baseline, interfaces, external dependencies, and bounded assumptions including air-risk and RDM exchange boundaries
Software Design Description (SDD) Tailored ECSS-E-ST-40C Module architecture, algorithm choices, interface contracts, and validation assumptions
User Manual / Ops Guide Tailored ECSS-E-ST-40C Installation, configuration, operator workflows, limitations, and degraded-mode handling
Test Plan + Test Report Tailored ECSS-Q-ST-80C Planned validation campaign, executed results, deviations, and acceptance evidence for procurement submission
Accessibility Conformance Report (ACR/VPAT 2.4) EN 301 549 v3.2.1 WCAG 2.1 AA conformance declaration; mandatory for EU public sector ICT procurement; maps each success criterion to Supports / Partially Supports / Does Not Support with remarks

Scaffold documents for all procurement-facing artefacts should be created at Phase 1 start and maintained throughout development — not produced from scratch at Phase 3.

For contracts with explicit software prototype review gates (e.g. PDR, TRR, CDR, QR, FR), the SRS, SDD, User Manual, Test Plan, and Test Report are updated incrementally at each milestone rather than back-filled only at final review.

25.7 Consortium Strategy

ESA study contracts typically favour consortia that combine:

  • Technical depth (university or research institute)
  • Industrial relevance (commercial applicability)
  • End-user representation (the entity that will use the output)

SpaceCom's ideal consortium for an ESA bid:

  • SpaceCom (lead) — system integration, aviation domain interface, commercial deployment
  • Academic partner (orbital mechanics / atmospheric density modelling credibility — equivalent to TU Braunschweig in the OKAPI:Orbits consortium)
  • ANSP or aviation authority (end-user representation — demonstrates the aviation gap is real and the solution is wanted)

Without a credentialled academic or research partner for the physics components, ESA evaluators may question the technical depth. Identify and approach potential academic partners before submitting to any ESA tender.

25.8 Intellectual Property Framework for ESA Bids

ESA contracts operate under the ESA General Conditions of Contract, which distinguish between background IP (pre-existing IP brought into the contract) and foreground IP (IP created during the contract). The default terms grant ESA a non-exclusive, royalty-free licence to use foreground IP, while the contractor retains ownership. These terms are negotiable and must be agreed before contract signature.

Required IP actions before bid submission:

  1. Background IP schedule: Document all SpaceCom components that constitute background IP — physics engine, data model, UX design, proprietary algorithms. This schedule protects SpaceCom's ability to continue commercial deployment after the ESA contract ends without ESA claiming rights to the core product.

  2. Foreground IP boundary: Define clearly what will be created during the ESA contract (e.g., specific ECSS compliance artefacts, validation datasets, TRL demonstration reports) versus what SpaceCom brings in as background IP. Narrow the foreground IP scope to ESA-specific deliverables only.

  3. Software Bill of Materials (SBOM): Required for ECSS compliance and as part of the ESA bid artefact package. Generated via syft or cyclonedx-bom. Must identify all third-party licences. AGPLv3 components (notably CesiumJS community edition) cannot be in the SBOM of a closed-source ESA deliverable — commercial licence required.

  4. Consortium Agreement: Must be signed by all consortium members before bid submission. Must specify:

    • IP ownership for each consortium member's contributions
    • Publication rights for academic partners (must not conflict with any commercial confidentiality obligations)
    • Revenue share for any commercial use arising from the contract
    • Liability allocation between consortium members
    • Exit terms if a member withdraws
  5. Export control pre-clearance: Confirm with counsel that the planned ESA deliverable does not require an export licence for transfer to ESA (a Paris-based intergovernmental organisation). Generally covered under EAR licence exception GOV, but verify for any controlled technology components.


26. SRE and Reliability Framework

26.1 Service Level Objectives

SpaceCom is most critical during active re-entry events — peak load coincides with highest operational stakes. Standard availability metrics are insufficient. SLOs must be defined against event-correlated conditions, not just averages.

Service Level Indicator SLO Measurement Window Notes
Prediction API availability 99.9% Rolling 30 days 43.8 min/month error budget
Prediction API availability (active TIP event) 99.95% Duration of TIP window Stricter; degradation during events is SEV-1
Decay prediction latency p50 < 90s Per MC job 500-sample chord run
Decay prediction latency p95 < 240s Per MC job Drives worker sizing (§27)
CZML ephemeris load p95 < 2s Per request 100-object catalog
TIP message ingest latency < 30 min from publication Per TIP message Drives CRITICAL alert timing
Space weather update latency < 15 min from NOAA SWPC Per update cycle Drives uncertainty multiplier refresh
Alert WebSocket delivery latency < 10s from trigger Per alert Measured trigger→client receipt
Corridor update after new TIP < 60 min Per TIP message Full MC rerun triggered

Error budget policy: When the 30-day rolling error budget is exhausted, no further deployments or planned maintenance are permitted until the next measurement window opens. Tracked in Grafana SLO dashboard (§26.8).

SLOs must be written into the model user agreement (§24.2) and agreed with each ANSP customer before operational deployment. ANSPs need defined thresholds to determine when to activate their fallback procedures.

Customer-facing SLA (Finding 7) — contractual commitments in the MSA:

Internal SLOs are aspirational targets; the SLA is a binding contractual commitment with defined measurement, exclusions, and credits. The MSA template includes the following SLA schedule:

Metric SLA commitment Measurement Exclusions
Monthly availability 99.5% External uptime monitor; excludes scheduled maintenance (max 4h/month; 48h advance notice) Force majeure; upstream data source outages (Space-Track, NOAA SWPC) lasting > 4h
Critical alert delivery Within 5 minutes of trigger (p95) alert_events.created_atdelivered_websocket/email = TRUE timestamp Customer network connectivity issues
Prediction freshness p50 updated within 4h of new TLE availability tle_sets.ingested_atreentry_predictions.created_at Space-Track API outage > 4h
Support response — CRITICAL incident Initial response within 1 hour From customer report or automated alert, whichever earlier Outside contracted support hours (on-call for CRITICAL)
Support response — P1 resolution Within 8 hours From initial response
Service credits 1 day credit per 0.1% availability below SLA Applied to next invoice

Any SRE threshold change that could cause an SLA breach (e.g., raising the ingest failure alert threshold beyond 4 hours) must be reviewed by the product owner before deployment. Tracked in docs/sla/sla-schedule-v{N}.md (versioned; MSA references the current version by number).


26.2 Recovery Objectives

Objective Target Scope Derivation
RTO (active TIP event) ≤ 15 minutes Prediction API restoration CRITICAL alert rate-limit window is 4 hours per object; 15-minute outage is tolerable within this window without skipping a CRITICAL cycle; beyond 15 minutes the ANSP must activate fallback procedures
RTO (no active event) ≤ 60 minutes Full system restoration 1-hour window aligns with MSA SLA commitment; exceeding this triggers the P1 communication plan
RPO (safety-critical tables) Zero reentry_predictions, alert_events, security_logs, notam_drafts — synchronous replication required UN Liability Convention evidentiary requirements; loss of a single alert acknowledgement record could be material in a liability investigation
RPO (operational data) ≤ 5 minutes orbits, tle_sets, simulations — async replication acceptable 5-minute data age is within the staleness tolerance for TLE-based predictions; loss of in-flight simulations is recoverable by re-submission

MSA sign-off requirement: RTO and RPO targets must be explicitly stated in and agreed upon in the Master Services Agreement with each ANSP customer before any production deployment. Customers must acknowledge that the fallback procedure (Space-Track direct + ESOC public re-entry page) is their responsibility during the RTO window. RTO/RPO targets are not unilaterally changeable by SpaceCom — any tightening requires customer notification ≥30 days in advance; any relaxation requires customer consent.


26.3 High Availability Architecture

TimescaleDB — Streaming Replication + Patroni

# Primary + hot standby; Patroni manages leader election and failover
db_primary:
  image: timescale/timescaledb-ha:pg17
  environment:
    PATRONI_POSTGRESQL_DATA_DIR: /var/lib/postgresql/data
    PATRONI_REPLICATION_USERNAME: replicator
  networks: [db_net]

db_standby:
  image: timescale/timescaledb-ha:pg17
  environment:
    PATRONI_REPLICA: "true"
  networks: [db_net]

etcd:
  image: bitnami/etcd:3   # Patroni DCS
  networks: [db_net]
  • Synchronous replication for reentry_predictions, alert_events, security_logs, notam_drafts (RPO = 0): synchronous_standby_names = 'FIRST 1 (db_standby)' with table-level synchronous commit override
  • Asynchronous replication for orbits, tle_sets (RPO ≤ 5 min): default async
  • Patroni auto-failover: standby promoted within ~30s of primary failure, well within the 15-minute RTO

Required Patroni configuration parameters (must be present in patroni.yml; CI validation via scripts/check_patroni_config.py):

bootstrap:
  dcs:
    maximum_lag_on_failover: 1048576    # 1 MB; standby > 1 MB behind primary is excluded from failover election
    synchronous_mode: true              # Enable synchronous replication mode
    synchronous_mode_strict: true       # Primary refuses writes if no synchronous standby confirmed; prevents split-brain

postgresql:
  parameters:
    wal_level: replica                  # Required for streaming replication; 'minimal' breaks replication
    recovery_target_timeline: latest    # Follow timeline switches after failover; required for correct standby behaviour

Rationale:

  • maximum_lag_on_failover: without this, a severely lagged standby could be promoted as primary and serve stale data for safety-critical tables.
  • synchronous_mode_strict: true: trades availability for consistency — primary halts rather than allowing an unconfirmed write to proceed without a standby. Acceptable given 15-minute RTO SLO.
  • wal_level: replica: minimal disables the WAL detail needed for streaming replication; must be explicitly set.
  • recovery_target_timeline: latest: without this, a promoted standby after failover may not follow future timeline switches, causing divergence.

Redis — Sentinel (3 Nodes)

redis-master:
  image: redis:7-alpine
  command: redis-server /etc/redis/redis.conf
redis-sentinel-1:
  image: redis:7-alpine
  command: redis-sentinel /etc/redis/sentinel.conf
redis-sentinel-2:
  image: redis:7-alpine
  command: redis-sentinel /etc/redis/sentinel.conf

Three Sentinel instances form a quorum. If the master fails, Sentinel promotes a replica within ~10s. The backend and workers use redis-py's Sentinel client which transparently follows the master after failover.

Redis Sentinel split-brain risk assessment (F3 — §67): In a network partition where Sentinel nodes disagree on master reachability, two Sentinels could theoretically promote two different replicas simultaneously. The min-replicas-to-write 1 Sentinel configuration mitigates this: the old master stops accepting writes when it loses contact with replicas, forcing clients to the new master.

SpaceCom's Redis data is largely ephemeral — Celery broker messages, WebSocket session state, application cache. A split-brain that loses a small number of Celery tasks or cache entries is survivable. The one persistent concern is the per-org email rate limit counter (spacecom:email_rate:{org_id}:{hour}, §65 F7): a split-brain could result in two independent counters, both allowing up to 50 emails, for a brief period before the split resolves. This is accepted: the 50/hr limit is a cost control, not a safety guarantee. Email volume during a short Sentinel split-brain is not a safety risk.

Risk acceptance and configuration: Set sentinel.conf values:

sentinel down-after-milliseconds spacecom-redis 5000
sentinel failover-timeout spacecom-redis 60000
sentinel parallel-syncs spacecom-redis 1
min-replicas-to-write 1
min-replicas-max-lag 10

ADR: docs/adr/0021-redis-sentinel-split-brain-risk-acceptance.md

Cross-Region Disaster Recovery — Warm Standby (F7)

Single-region deployment cannot meet the RTO ≤ 60 minutes target against a full cloud region failure. A warm standby in a second region provides the required recovery path.

Strategy: Warm standby (not hot active-active) — reduces cost and complexity while meeting RTO.

Component Primary region DR region Failover mechanism
TimescaleDB Primary + hot standby Read replica (streaming replication from primary) Promote replica; update DNS; make db-failover-dr runbook
Application tier Running Stopped; container images pre-pulled from GHCR Deploy from images on failover; < 10 minutes
MinIO (object storage) Active Active (bucket replication enabled) Already in sync; no failover needed
Redis Active Cold (config ready) Restart on failover; session loss acceptable (operators re-authenticate)
DNS Primary A record Secondary A record in Route 53 (or equiv.) Health-check-based routing; TTL 60s; auto-failover on primary health check failure

Failover time estimate: DB promotion 25 minutes + DNS propagation 1 minute + app deploy 10 minutes = < 15 minutes (within RTO for active TIP event).

Runbook: docs/runbooks/region-failover.md — tested annually as game day scenario 6. Post-failover checklist: verify HMAC validation on restored primary; verify WAL integrity; notify ANSPs of region switch; schedule return to primary region within 48 hours.


26.4 Celery Reliability

Task Acknowledgement and Crash Safety

# celeryconfig.py
task_acks_late = True            # Task not acknowledged until complete; if worker dies mid-task, task is requeued
task_reject_on_worker_lost = True  # Orphaned tasks requeued, not dropped
task_serializer = 'json'
result_expires = 86400           # Results expire after 24h; database is the durable store
worker_prefetch_multiplier = 1   # F6 §58: long MC tasks (up to 240s) — prefetch=1 prevents worker A
                                 # holding 4 tasks while workers B/C/D are idle; fair distribution

Dead Letter Queue

Failed tasks (exception, timeout, or permanent error) must be captured, not silently dropped:

# In Celery task base class
class SpaceComTask(Task):
    def on_failure(self, exc, task_id, args, kwargs, einfo):
        # Update simulations table to status='failed'
        update_simulation_status(task_id, 'failed', error_detail=str(exc))
        # Route to dead letter queue for inspection
        dead_letter_queue.rpush('dlq:failed_tasks', json.dumps({
            'task_id': task_id, 'task_name': self.name,
            'error': str(exc), 'failed_at': utcnow().isoformat()
        }))

Queue Routing (Ingest vs Simulation Isolation)

CELERY_TASK_ROUTES = {
    'modules.ingest.*':       {'queue': 'ingest'},
    'modules.propagator.*':   {'queue': 'simulation'},
    'modules.breakup.*':      {'queue': 'simulation'},
    'modules.conjunction.*':  {'queue': 'simulation'},
    'modules.reentry.controlled.*': {'queue': 'simulation'},
}

Two separate worker processes — never competing on the same queue:

# Ingest worker: always running, low concurrency
celery worker --queue=ingest --concurrency=2 --hostname=ingest@%h

# Simulation worker: high concurrency for MC sub-tasks (see §27.2)
celery worker --queue=simulation --concurrency=16 --pool=prefork --hostname=sim@%h

Per-organisation priority isolation (F8): All organisations share the simulation queue, but job priority is set at submission time based on subscription tier and event criticality. This prevents a shadow_trial org's bulk simulation from starving a CRITICAL alert computation for an ansp_operational org.

TIER_TASK_PRIORITY = {
    "internal": 9,
    "institutional": 8,
    "ansp_operational": 7,
    "space_operator": 5,
    "shadow_trial": 3,
}
CRITICAL_EVENT_PRIORITY_BOOST = 2  # added when active TIP event exists for the org's objects

def get_task_priority(org_tier: str, has_active_tip: bool) -> int:
    base = TIER_TASK_PRIORITY.get(org_tier, 3)
    return min(10, base + (CRITICAL_EVENT_PRIORITY_BOOST if has_active_tip else 0))

# At submission:
task.apply_async(priority=get_task_priority(org.subscription_tier, active_tip))

Redis with maxmemory-policy noeviction supports Celery task priorities natively (09). Workers process higher-priority tasks first when multiple tasks are queued. Ingest tasks always route to the separate ingest queue and are unaffected by simulation priority.

Celery Beat — High Availability with celery-redbeat

Standard Celery Beat is a single-process SPOF. celery-redbeat stores the schedule in Redis with distributed locking — multiple Beat instances can run; only one holds the lock at a time:

CELERY_BEAT_SCHEDULER = 'redbeat.RedBeatScheduler'
REDBEAT_REDIS_URL = settings.redis_url
REDBEAT_LOCK_TIMEOUT = 60        # 60s; crashed leader blocks scheduling for at most 60s
REDBEAT_MAX_SLEEP_INTERVAL = 5   # standby instances check for lock every 5s after TTL expiry

The default REDBEAT_LOCK_TIMEOUT = max_interval × 5 (typically 25 minutes) is too long during active TIP events — a crashed Beat leader would prevent TIP polling for up to 25 minutes. At 60 seconds, a failover causes at most a 60-second scheduling gap. The standby Beat instance acquires the lock within 5 seconds of TTL expiry (REDBEAT_MAX_SLEEP_INTERVAL = 5).

During an active TIP window (spacecom_active_tip_events > 0), the AlertManager rule for TIP ingest failure uses a 10-minute threshold rather than the baseline 4-hour threshold — ensuring a Beat failover gap does not silently miss critical TIP updates.


26.5 Health Checks

Every service exposes two endpoints. Docker Compose depends_on: condition: service_healthy uses these — the backend does not start until the database is healthy.

Liveness probe (GET /healthz) — process is alive; returns 200 unconditionally if the process can respond. Does not check dependencies.

Readiness probe (GET /readyz) — process is ready to serve traffic:

@app.get("/readyz")
async def readiness(db: AsyncSession = Depends(get_db)):
    checks = {}

    # Database connectivity
    try:
        await db.execute(text("SELECT 1"))
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {e}"

    # Redis connectivity
    try:
        await redis_client.ping()
        checks["redis"] = "ok"
    except Exception:
        checks["redis"] = "error"

    # Data freshness
    tle_age = await get_oldest_active_tle_age_hours()
    sw_age = await get_space_weather_age_hours()
    eop_age = await get_eop_age_days()
    airac_age = await get_airspace_airac_age_days()
    checks["tle_age_hours"] = tle_age
    checks["space_weather_age_hours"] = sw_age
    checks["eop_age_days"] = eop_age
    checks["airac_age_days"] = airac_age

    degraded = []
    if checks["database"] != "ok" or checks["redis"] != "ok":
        return JSONResponse(status_code=503, content={"status": "unavailable", "checks": checks})
    if tle_age > 6:
        degraded.append("tle_stale")
    if sw_age > 4:
        degraded.append("space_weather_stale")
    if eop_age > 7:
        degraded.append("eop_stale")       # IERS-A older than 7 days; frame transform accuracy degraded
    if airac_age > 28:
        degraded.append("airspace_stale")  # AIRAC cycle missed

    status_code = 207 if degraded else 200
    return JSONResponse(status_code=status_code, content={
        "status": "degraded" if degraded else "ok",
        "degraded": degraded, "checks": checks
    })

The 207 Degraded response triggers the staleness banner in the UI (§24.8) without taking the service offline. The load balancer treats 207 as healthy (traffic continues); the operational banner warns users.

Renderer service health check — the renderer container runs Playwright/Chromium. If Chromium hangs (a known Playwright failure mode), the container process stays alive and appears healthy while all report generation jobs silently time out. The renderer GET /healthz must verify Chromium can respond, not just that the Python process is alive:

# renderer/app/health.py
import asyncio
from playwright.async_api import async_playwright
from fastapi.responses import JSONResponse

async def health_check():
    """Liveness probe: verify Chromium can launch and load a blank page within 5s."""
    try:
        async with async_playwright() as p:
            browser = await asyncio.wait_for(p.chromium.launch(), timeout=5.0)
            page = await browser.new_page()
            await asyncio.wait_for(page.goto("about:blank"), timeout=3.0)
            await browser.close()
        return {"status": "ok", "chromium": "responsive"}
    except asyncio.TimeoutError:
        renderer_chromium_restarts.inc()
        return JSONResponse({"status": "chromium_unresponsive"}, status_code=503)

Docker Compose healthcheck for renderer:

renderer:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8001/healthz"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 15s

If the healthcheck fails 3 times consecutively, Docker restarts the renderer container. The renderer_chromium_restarts_total counter increments on each restart and triggers the RendererChromiumUnresponsive alert.

Degraded state in GET /readyz for API clients and SWIM (Finding 7): The degraded array in the response is the machine-readable signal for any automated integration (Phase 3 SWIM, API polling clients). API clients must not scrape the UI to determine system state — the health endpoint is the authoritative source. Response fields:

Field Type Meaning
status "ok" | "degraded" | "unavailable" Overall system state
degraded string[] Active degradation reasons: "tle_stale", "space_weather_stale", "ingest_source_failure", "prediction_service_overloaded"
degraded_since ISO8601 | null Timestamp of when current degraded state began (from degraded_mode_events)
checks object Per-subsystem check results

Every transition into or out of degraded state is written to degraded_mode_events (see §9.2). NOTAM drafts generated while status = "degraded" have generated_during_degraded = TRUE and the draft (E) field includes: NOTE: GENERATED DURING DEGRADED DATA STATE - VERIFY INDEPENDENTLY BEFORE ISSUANCE.

Docker Compose health check definitions:

backend:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8000/healthz"]
    interval: 10s
    timeout: 5s
    retries: 3
    start_period: 30s

db:
  healthcheck:
    # pg_isready alone passes before the spacecom database and TimescaleDB extension are loaded.
    # This check verifies that the application database is accessible and TimescaleDB is active
    # before any dependent service (pgbouncer, backend) is marked healthy.
    test: |
      CMD-SHELL psql -U spacecom_app -d spacecom -c
      "SELECT 1 FROM timescaledb_information.hypertables LIMIT 1"
    interval: 5s
    timeout: 3s
    retries: 10
    start_period: 30s   # TimescaleDB extension load and initial setup can take up to 20s

pgbouncer:
  depends_on:
    db:
      condition: service_healthy
  healthcheck:
    test: ["CMD-SHELL", "psql -h localhost -p 5432 -U spacecom_app -d spacecom -c 'SELECT 1'"]
    interval: 5s
    timeout: 3s
    retries: 5

26.6 Backup and Restore

Continuous WAL Archiving (RPO = 0 for critical tables)

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'mc cp %p minio/wal-archive/$(hostname)/%f'  # MinIO via mc client
archive_timeout = 60  # Force WAL segment every 60s even if no writes

Daily Base Backup

pg_basebackup is a PostgreSQL client tool that is not present in the Python runtime worker image. The backup must run in a dedicated sidecar container that has PostgreSQL client tools installed, invoked by the Celery Beat task via docker compose run:

# docker-compose.yml — backup sidecar (no persistent service; run on demand)
services:
  db-backup:
    image: timescale/timescaledb:2.14-pg17   # same image as db; has pg_basebackup
    entrypoint: []
    command: >
      sh -c "pg_basebackup -h db -U postgres -D /backup
             --format=tar --compress=9 --wal-method=stream &&
             mc cp /backup/*.tar.gz minio/db-backups/base-$(date +%F)/"
    networks: [db_net]
    volumes:
      - backup_scratch:/backup
    profiles: [backup]    # not started by default; invoked explicitly
    environment:
      PGPASSWORD: ${POSTGRES_PASSWORD}
      MC_HOST_minio: http://${MINIO_ACCESS_KEY}:${MINIO_SECRET_KEY}@minio:9000

volumes:
  backup_scratch:
    driver: local
    driver_opts:
      type: tmpfs
      device: tmpfs
      o: size=20g    # large enough for compressed base backup

The Celery Beat task triggers the sidecar via the Docker socket (backend container must have /var/run/docker.sock mounted in development — not in production). In production (Tier 2+), use a dedicated cron job on the host:

# /etc/cron.d/spacecom-backup — runs outside Docker, uses Docker CLI
0 2 * * * root docker compose -f /opt/spacecom/docker-compose.yml \
  --profile backup run --rm db-backup >> /var/log/spacecom-backup.log 2>&1

The Celery Beat task in production polls MinIO for today's backup object to verify completion, and fires an alert if it is absent by 03:00 UTC:

# Celery Beat: daily at 03:00 UTC (verification, not execution)
@celery.task
def verify_daily_backup():
    """Verify today's base backup exists in MinIO; alert if absent."""
    expected_key = f"db-backups/base-{utcnow().date()}"
    try:
        minio_client.stat_object("db-backups", expected_key)
        structlog.get_logger().info("backup_verified", key=expected_key)
    except S3Error:
        structlog.get_logger().error("backup_missing", key=expected_key)
        alert_admin(f"Daily base backup missing: {expected_key}")
        raise  # marks task as FAILED in Celery result backend

Monthly Restore Test

# Celery Beat: first Sunday of each month at 03:00 UTC
@celery.task
def monthly_restore_test():
    """Restore latest backup to ephemeral container; run test suite; alert on failure."""
    # 1. Spin up a test TimescaleDB container from latest base backup + WAL
    # 2. Run db/test_restore.py: verify row counts, hypertable integrity, HMAC spot-checks
    # 3. Tear down container
    # 4. Log result to security_logs; alert admin if test fails

If the monthly restore test fails, the failure is treated as SEV-2. The incident is not resolved until a successful restore is verified.

WAL retention: 30 days of WAL segments retained in MinIO; base backups retained for 90 days; reentry_predictions, alert_events, notam_drafts, security_logs additionally archived to cold storage for 7 years (MinIO lifecycle policy, separate bucket with Object Lock COMPLIANCE mode — prevents deletion even by bucket owner).

Application log retention policy (F10 — §57):

Log tier Storage Retention Rationale
Container stdout (json-file) Docker log driver on host 7 days (max-size=100m, max-file=5) Short-lived; Promtail ships to Loki in Tier 2+
Loki (structured application logs) Grafana Loki 90 days Covers 30-day incident investigation SLA with headroom
Safety-relevant log lines (level=CRITICAL, security_logs events, alert-related log lines) MinIO append-only bucket 7 years (same as database safety records) Regulatory parity with alert_events 7-year hold; NIS2 Art. 23 evidence requirement
SIEM-forwarded events External SIEM (customer-specified) Per customer contract ANSP customers may have their own retention obligations

Loki retention is set in monitoring/loki-config.yml:

limits_config:
  retention_period: 2160h   # 90 days
compactor:
  retention_enabled: true

Safety-relevant log shipping: a Promtail pipeline stage tags log lines with __path__ label safety_critical=true when level=CRITICAL or logger contains alert or security. A separate Loki ruler rule ships these to MinIO via a Loki-to-S3 connector (Phase 2). Phase 1 interim: Celery Beat task exports CRITICAL log lines from Loki to MinIO daily.

Restore time target: Full restore to latest WAL segment in < 30 minutes (tested monthly). This satisfies the RTO ≤ 60 minutes (no active event) with 30 minutes headroom for DNS propagation and smoke tests. Documented step-by-step in docs/runbooks/db-restore.md (Phase 2 deliverable).

Retention Schedule

-- Online retention (TimescaleDB compression + drop policies)
SELECT add_compression_policy('orbits', INTERVAL '7 days');
SELECT add_retention_policy('orbits', INTERVAL '90 days');   -- Archive before drop; see below
SELECT add_retention_policy('space_weather', INTERVAL '2 years');
SELECT add_retention_policy('tle_sets', INTERVAL '1 year');

-- Archival pipeline: Celery task runs before each chunk drop
-- Exports chunk to Parquet in MinIO cold storage before TimescaleDB drops it
-- Legal hold: reentry_predictions, alert_events, notam_drafts, shadow_validations → 7 years
-- No retention policy on these tables; MinIO lifecycle rule retains for 7 years

26.7 Prometheus Metrics

Metrics must be instrumented from Phase 1 — not added at Phase 3 as an afterthought. Business-level metrics are more important than infrastructure metrics for this domain.

Metric naming convention (F1 — §57):

All custom metrics must follow {namespace}_{subsystem}_{name}_{unit} with these rules:

Rule Example compliant Example non-compliant
Namespace is always spacecom_ spacecom_ingest_success_total ingest_success
Unit suffix required (Prometheus base units) spacecom_simulation_duration_seconds spacecom_simulation_duration
Counters end in _total spacecom_hmac_verification_failures_total spacecom_hmac_failures
Gauges end in _seconds, _bytes, _ratio, or domain unit spacecom_celery_queue_depth spacecom_queue
Histograms end in _seconds or _bytes spacecom_alert_delivery_latency_seconds spacecom_alert_latency
Labels use snake_case queue_name, source queueName, Source
High-cardinality fields are NEVER labels norad_id, organisation_id, user_id, request_id as Prometheus labels
Per-object drill-down uses recording rules spacecom:tle_age_hours:max recording rule spacecom_tle_age_hours{norad_id="25544"} alerted directly

High-cardinality identifiers belong in log fields (structlog) or Prometheus exemplars — not in metric labels. A metric with an unbounded label creates one time series per unique value and will OOM Prometheus at scale.

Business-level metrics (custom — most critical):

# Phase 1 — instrument from day 1
active_tip_events    = Gauge('spacecom_active_tip_events', 'Objects with active TIP messages')
prediction_age       = Gauge('spacecom_prediction_age_seconds', 'Age of latest prediction per object',
                           ['norad_id'])  # per-object label: Grafana drill-down only; alert via recording rule
tle_age              = Gauge('spacecom_tle_age_hours', 'TLE data age per object', ['norad_id'])
ingest_success       = Counter('spacecom_ingest_success_total', 'Successful ingest runs', ['source'])
ingest_failure       = Counter('spacecom_ingest_failure_total', 'Failed ingest runs', ['source'])
hmac_failures        = Counter('spacecom_hmac_verification_failures_total', 'HMAC check failures')
simulation_duration  = Histogram('spacecom_simulation_duration_seconds', 'MC run duration', ['module'],
                           buckets=[30, 60, 90, 120, 180, 240, 300, 600])
alert_delivery_lat   = Histogram('spacecom_alert_delivery_latency_seconds', 'Alert trigger → WS receipt',
                           buckets=[1, 2, 5, 10, 15, 20, 30, 60])
ws_connected         = Gauge('spacecom_ws_connected_clients', 'Active WebSocket connections', ['instance'])
celery_queue_depth   = Gauge('spacecom_celery_queue_depth', 'Tasks waiting in queue', ['queue'])
dlq_depth            = Gauge('spacecom_dlq_depth', 'Tasks in dead letter queue')
renderer_active_jobs = Gauge('renderer_active_jobs', 'Reports being generated')
renderer_job_dur     = Histogram('renderer_job_duration_seconds', 'Report generation time',
                           buckets=[2, 5, 10, 15, 20, 25, 30])
renderer_chromium_restarts = Counter('renderer_chromium_restarts_total', 'Chromium process restarts')

SLI recording rules — pre-aggregate before alerting; avoids per-object flooding (Finding 1, 7):

# monitoring/recording-rules.yml
groups:
  - name: spacecom_sli
    rules:
      # SLI: API availability (non-5xx fraction) — feeds availability SLO
      - record: spacecom:api_availability:ratio_rate5m
        expr: >
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          / sum(rate(http_requests_total[5m]))

      # SLI: max TLE age across all objects (single series; alertable without flooding)
      - record: spacecom:tle_age_hours:max
        expr: max(spacecom_tle_age_hours)

      # SLI: count of objects with stale TLEs (for dashboard)
      - record: spacecom:tle_stale_objects:count
        expr: count(spacecom_tle_age_hours > 6) or vector(0)

      # SLI: max prediction age across active TIP objects
      - record: spacecom:prediction_age_seconds:max
        expr: max(spacecom_prediction_age_seconds)

      # SLI: alert delivery latency p99
      - record: spacecom:alert_delivery_latency:p99_rate5m
        expr: histogram_quantile(0.99, rate(spacecom_alert_delivery_latency_seconds_bucket[5m]))

      # Error budget burn rate — multi-window (F2 — §57)
      - record: spacecom:error_budget_burn:rate1h
        expr: 1 - avg_over_time(spacecom:api_availability:ratio_rate5m[1h])

      - record: spacecom:error_budget_burn:rate6h
        expr: 1 - avg_over_time(spacecom:api_availability:ratio_rate5m[6h])

      # Fast-burn window (5 min) — catches sudden outages
      - record: spacecom:error_budget_burn:rate5m
        expr: 1 - spacecom:api_availability:ratio_rate5m

Alerting rules (Prometheus AlertManager):

# monitoring/alertmanager/spacecom-rules.yml
groups:
  - name: spacecom_critical
    rules:
      - alert: HmacVerificationFailure
        expr: increase(spacecom_hmac_verification_failures_total[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "HMAC verification failure detected — prediction integrity compromised"
          runbook_url: "https://spacecom.internal/docs/runbooks/hmac-integrity-failure.md"

      - alert: TipIngestStale
        expr: spacecom_tle_age_hours{source="tip"} > 0.5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "TIP data > 30 min old — active re-entry warning may be stale"
          runbook_url: "https://spacecom.internal/docs/runbooks/tip-ingest-failure.md"

      - alert: ActiveTipNoPrediction
        expr: spacecom_active_tip_events > 0 and spacecom:prediction_age_seconds:max > 3600
        labels:
          severity: critical
        annotations:
          summary: "Active TIP event but newest prediction is {{ $value | humanizeDuration }} old"
          runbook_url: "https://spacecom.internal/docs/runbooks/tip-ingest-failure.md"

      # Fast burn: 1h + 5min windows (catches sudden outages quickly) — F2 §57
      - alert: ErrorBudgetFastBurn
        expr: >
          spacecom:error_budget_burn:rate1h > (14.4 * 0.001)
          and
          spacecom:error_budget_burn:rate5m > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          burn_window: fast
        annotations:
          summary: "Error budget burning fast — 1h burn rate {{ $value | humanizePercentage }}"
          runbook_url: "https://spacecom.internal/docs/runbooks/db-failover.md"
          dashboard_url: "https://grafana.spacecom.internal/d/slo-burn-rate"

      # Slow burn: 6h + 30min windows (catches gradual degradation before budget exhausts) — F2 §57
      - alert: ErrorBudgetSlowBurn
        expr: >
          spacecom:error_budget_burn:rate6h > (6 * 0.001)
          and
          spacecom:error_budget_burn:rate1h > (6 * 0.001)
        for: 15m
        labels:
          severity: warning
          burn_window: slow
        annotations:
          summary: "Error budget burning slowly — 6h burn rate {{ $value | humanizePercentage }}"
          runbook_url: "https://spacecom.internal/docs/runbooks/db-failover.md"
          dashboard_url: "https://grafana.spacecom.internal/d/slo-burn-rate"

  - name: spacecom_warning
    rules:
      - alert: TleStale
        # Alert on recording rule aggregate — single alert, not 600 per-NORAD alerts
        expr: spacecom:tle_stale_objects:count > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value }} objects have TLE age > 6h"
          runbook_url: "https://spacecom.internal/docs/runbooks/ingest-pipeline-staleness.md"

      - alert: IngestConsecutiveFailures
        # Use increase(), not rate(); rate() is always positive once any failure exists
        expr: increase(spacecom_ingest_failure_total[15m]) >= 3
        labels:
          severity: warning
        annotations:
          summary: "Ingest source {{ $labels.source }} failed ≥ 3 times in 15 min"
          runbook_url: "https://spacecom.internal/docs/runbooks/ingest-pipeline-staleness.md"

      - alert: CelerySimulationQueueDeep
        expr: spacecom_celery_queue_depth{queue="simulation"} > 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Simulation queue depth {{ $value }} — workers may be overwhelmed"
          runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md"

      - alert: DLQGrowing
        expr: increase(spacecom_dlq_depth[10m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Dead letter queue growing — tasks exhausting retries"
          runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md"

      - alert: WebSocketCeilingApproaching
        expr: spacecom_ws_connected_clients > 400
        labels:
          severity: warning
        annotations:
          summary: "WS connections {{ $value }}/500 — scale backend before ceiling hit"
          runbook_url: "https://spacecom.internal/docs/runbooks/capacity-limits.md"

      # Queue depth growth rate alert — fires before threshold is breached (F8 — §57)
      - alert: CelerySimulationQueueGrowing
        expr: rate(spacecom_celery_queue_depth{queue="simulation"}[10m]) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Simulation queue growing at {{ $value | humanize }} tasks/sec — workers not keeping up"
          runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md"

      - alert: RendererChromiumUnresponsive
        expr: increase(renderer_chromium_restarts_total[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Renderer Chromium restarted — report generation may be delayed"
          runbook_url: "https://spacecom.internal/docs/runbooks/renderer-recovery.md"

Alert authoring rule (F11 — §57): Every AlertManager alert rule MUST include annotations.runbook_url pointing to an existing file in docs/runbooks/. CI lint step (make lint-alerts) validates this using promtool check rules plus a custom Python script that asserts every rule has a non-empty runbook_url annotation that resolves to an existing markdown file. A PR that adds an alert without a runbook fails CI.

Alert coverage audit (F5 — §57): The following table maps every SLO and safety invariant to its alert rule. Gaps must be closed before Phase 2.

SLO / Safety invariant Alert rule Severity Gap?
API availability 99.9% ErrorBudgetFastBurn, ErrorBudgetSlowBurn CRITICAL / WARNING Covered
TLE age < 6h TleStale WARNING Covered
TIP ingest freshness < 30 min TipIngestStale CRITICAL Covered
Active TIP + prediction age > 1h ActiveTipNoPrediction CRITICAL Covered
HMAC verification integrity HmacVerificationFailure CRITICAL Covered
Ingest consecutive failures IngestConsecutiveFailures WARNING Covered
Celery queue depth threshold CelerySimulationQueueDeep WARNING Covered
Celery queue depth growth rate CelerySimulationQueueGrowing WARNING Covered
DLQ depth > 0 DLQGrowing WARNING Covered
WS connection ceiling approach WebSocketCeilingApproaching WARNING Covered
Renderer Chromium crash RendererChromiumUnresponsive WARNING Covered
EOP mirror disagreement EopMirrorDisagreement CRITICAL Gap — add Phase 1
DB replication lag > 30s DbReplicationLagHigh WARNING Gap — add Phase 2
Backup job failure BackupJobFailed CRITICAL Gap — add Phase 1
Security event anomaly In security-rules.yml CRITICAL Covered
Alert HMAC integrity (nightly) In security-rules.yml CRITICAL Covered

Prometheus scrape configuration (monitoring/prometheus.yml):

scrape_configs:
  - job_name: backend
    static_configs:
      - targets: ['backend:8000']
    metrics_path: /metrics   # enabled by prometheus-fastapi-instrumentator

  - job_name: renderer
    static_configs:
      - targets: ['renderer:8001']
    metrics_path: /metrics

  - job_name: celery
    static_configs:
      - targets: ['celery-exporter:9808']   # celery-exporter sidecar

  - job_name: postgres
    static_configs:
      - targets: ['postgres-exporter:9187']  # postgres_exporter; also scrapes PgBouncer stats

  - job_name: redis
    static_configs:
      - targets: ['redis-exporter:9121']     # redis_exporter

Add to docker-compose.yml (Phase 2 service topology): postgres-exporter, redis-exporter, celery-exporter sidecar, loki, promtail, tempo (all on monitor_net). Add to requirements.in: prometheus-fastapi-instrumentator, structlog, opentelemetry-sdk, opentelemetry-instrumentation-fastapi, opentelemetry-instrumentation-sqlalchemy, opentelemetry-instrumentation-celery.

Distributed tracing — OpenTelemetry (Phase 2, ADR 0017):

# backend/app/main.py — instrument at startup
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.celery import CeleryInstrumentor

provider = TracerProvider()
provider.add_span_exporter(OTLPSpanExporter(endpoint="http://tempo:4317"))
trace.set_tracer_provider(provider)

FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=engine)
CeleryInstrumentor().instrument()

The trace_id from each span equals the request_id bound in structlog.contextvars (set by RequestIDMiddleware). This gives a single correlation key across Grafana Loki log search and Grafana Tempo trace view — one click from a log entry to its trace, and from a trace span to its log lines. Phase 1 fallback: set OTEL_SDK_DISABLED=true; spans emit to stdout only (no collector needed).

Celery trace propagation (F4 — §57): CeleryInstrumentor automatically propagates W3C traceparent headers through the Celery task message body. The trace started at POST /api/v1/decay/predict continues unbroken through the queue wait and into the worker execution. To verify propagation is working:

# tests/integration/test_tracing.py
def test_celery_trace_propagation():
    """Trace started in HTTP handler must appear in Celery worker span."""
    with patch("opentelemetry.instrumentation.celery") as mock_otel:
        response = client.post("/api/v1/decay/predict", ...)
        task_id = response.json()["job_id"]
        # Poll until complete, then assert trace_id matches request_id
        span = get_span_by_task_id(task_id)
        assert span.context.trace_id == uuid.UUID(response.headers["X-Request-ID"]).int

Additionally, request_id must be passed explicitly in Celery task kwargs as a belt-and-suspenders fallback for Phase 1 when OTel is disabled (OTEL_SDK_DISABLED=true). The worker binds it via structlog.contextvars.bind_contextvars(request_id=kwargs["request_id"]). This ensures log correlation works in Phase 1 without a running Tempo instance.

Chord sub-task and callback trace propagation (F11 — §67): CeleryInstrumentor propagates traceparent through individual task messages. For the MC chord pattern (groupchord → callback), trace context propagation must flow: FastAPI handler → run_mc_decay_prediction → 500× run_single_trajectory sub-tasks → aggregate_mc_results callback. Each hop in the chord must carry the same trace_id to enable end-to-end p95 latency attribution.

CeleryInstrumentor handles single task propagation automatically. For chord callbacks, verify that the parent trace_id appears in the aggregate_mc_results span — if the span is orphaned (different trace_id), set the trace context explicitly in the chord header:

from opentelemetry import propagate, context

def run_mc_decay_prediction(object_id: int, params: dict) -> str:
    carrier = {}
    propagate.inject(carrier)  # inject current trace context
    params['_trace_context'] = carrier  # pass through chord params
    ...

def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str:
    ctx = propagate.extract(params.get('_trace_context', {}))
    token = context.attach(ctx)  # re-attach parent trace context in callback
    try:
        ...  # callback body
    finally:
        context.detach(token)

This ensures the Tempo waterfall for an MC prediction shows one continuous trace from HTTP request through all 500 sub-tasks to DB write, enabling per-prediction p95 breakdown.

Celery queue depth Beat task (updates celery_queue_depth and dlq_depth every 30s):

@app.task
def update_queue_depth_metrics():
    for queue_name in ['ingest', 'simulation', 'default']:
        depth = redis_client.llen(f'celery:{queue_name}')
        celery_queue_depth.labels(queue=queue_name).set(depth)
    dlq_depth.set(redis_client.llen('dlq:failed_tasks'))

Four Grafana dashboards (updated from three):

  1. Operational Overview — primary on-call dashboard (F7 — §57): an on-call engineer must be able to answer "is the system healthy?" within 15 seconds of opening this dashboard. Panel order and layout is therefore mandated:

    Row Panel Metric Alert threshold shown
    1 (top) Active TIP events (stat) spacecom_active_tip_events Red if > 0
    1 System status (state timeline) All alert rule states Any CRITICAL = red bar
    2 Ingest freshness per source (gauge) spacecom_tle_age_hours per source Yellow > 2h, Red > 6h
    2 Prediction age — active objects (gauge) spacecom:prediction_age_seconds:max Red > 3600s
    3 Error budget burn rate (time series) spacecom:error_budget_burn:rate1h Reference line at 14.4×
    3 Alert delivery latency p99 (stat) spacecom:alert_delivery_latency:p99_rate5m Red > 30s
    4 Celery queue depth (time series) spacecom_celery_queue_depth per queue Reference line at 20
    4 DLQ depth (stat) spacecom_dlq_depth Red if > 0

    Rows 12 must be visible without scrolling on a 1080p monitor. The dashboard UID is pinned in the AlertManager dashboard_url annotations.

  2. System Health: DB replication lag, Redis memory, container CPU/RAM, error rates by endpoint, renderer job duration

  3. SLO Burn Rate: error budget consumption rate from recording rules, fast/slow burn rates, availability by SLO, latency percentiles vs. targets, WS delivery latency p99

  4. Tracing (Phase 2, Grafana Tempo): per-request traces for decay prediction and CZML catalog; p95 span breakdown by service


26.8 Incident Response

On-Call Rotation and Escalation

Tier Responder Response SLA Escalation trigger
L1 On-call Rotating engineer (weekly rotation) 5 min (SEV-1) / 15 min (SEV-2) Auto-escalate to L2 if no acknowledgement after SLA
L2 Escalation Tech lead / senior engineer 10 min (SEV-1) Auto-escalate to L3 after 10 min
L3 Incident commander Engineering or product lead SEV-1 only Manual phone call; no auto-escalation

AlertManager routing:

# monitoring/alertmanager/routing.yml
route:
  receiver: slack-ops-channel
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match: {severity: critical}
      receiver: pagerduty-l1
      continue: true   # also send to Slack
    - match: {severity: warning}
      receiver: slack-ops-channel

On-call guide: docs/runbooks/on-call-guide.md — required Phase 2 deliverable. Must cover: rotation schedule, handover checklist, escalation contact list, how to acknowledge PagerDuty alerts, Grafana dashboard URLs, and the "active TIP event protocol" (escalate all SEV-2+ to SEV-1 automatically when spacecom_active_tip_events > 0).

On-call rotation spec (F5):

  • 7-day rotation; minimum 2 engineers in the pool before going on-call
  • L1 → L2 escalation if incident not contained within 30 minutes of L1 acknowledgement
  • L2 → L3 escalation triggers: ANSP data affected; confirmed security breach; total outage > 15 minutes; regulatory notification obligation triggered (NIS2 24h, GDPR 72h)
  • On-call handoff: At rotation boundary, outgoing on-call documents system state in docs/runbooks/on-call-handoff-log.md: active incidents, degraded services, pending maintenance, known risks. Incoming on-call acknowledges in the same log. Mirrors the operator /handover concept (§28.5a) applied to engineering shifts.

ANSP communication commitments per severity (F6):

Severity ANSP notification timing Channel Update cadence
SEV-1 (active TIP event) Within 5 minutes of detection Push + email Every 15 minutes until resolved
SEV-1 (no active event) Within 15 minutes Email Every 30 minutes until resolved
SEV-2 Within 30 minutes if prediction data affected Email On resolution
SEV-3/4 Status page update only Status page On resolution

Resolution notification always includes: what was affected, duration, root cause summary (1 sentence), and confirmation that prediction integrity was verified post-incident.

Severity Levels

Level Definition Response Time Examples
SEV-1 System unavailable or prediction integrity compromised during active TIP event 5 minutes DB down with TIP window open; HMAC failure on active prediction
SEV-2 Core functionality broken; no active TIP event 15 minutes Workers down; ingest stopped > 2h; Redis down
SEV-3 Degraded functionality; operational but impaired 60 minutes TLE stale > 6h; space weather stale; slow CZML > 5s p95
SEV-4 Minor; no operational impact Next business day UI cosmetic; log noise; non-critical test failure

Runbook Standard Structure (F9)

Every runbook in docs/runbooks/ must follow this template. Inconsistent runbooks written under incident pressure are a leading cause of missed steps and extended resolution times.

# Runbook: {Title}

**Owner:** {team or role}
**Last tested:** {YYYY-MM-DD} (game day or real incident)
**Severity scope:** SEV-1 | SEV-2 | SEV-3 (as applicable)

## Triggers
<!-- What conditions cause this runbook to be invoked? Alert name, symptom, or explicit escalation. -->

## Immediate actions (first 5 minutes)
<!-- Numbered steps. Each step must be independently executable. No "investigate" — specific commands only. -->
1.
2.

## Diagnosis
<!-- How to confirm the root cause before taking corrective action. -->

## Resolution steps
<!-- Numbered. Each step: what to do, expected output, what to do if the expected output is NOT seen. -->
1.
2.

## Verification
<!-- How to confirm the incident is resolved. Specific health check commands or metrics to inspect. -->

## Escalation
<!-- If unresolved after N minutes: who to page, what information to have ready. -->

## Post-incident
<!-- Mandatory PIR? Log entry required? Notification required? -->

All runbooks are reviewed and updated after each game day or real incident in which they were used. The Last tested field must not be older than 12 months — a CI check (make runbook-audit) warns if any runbook has not been updated within that window.

Required Runbooks (Phase 2 deliverable)

Each runbook is a step-by-step operational procedure, not a general guide:

Runbook Key Steps
DB failover Confirm primary down → Patroni status → manual failover if Patroni stuck → verify standby promoting → update connection strings → verify HMAC validation working on new primary
Celery worker recovery Check queue depth → inspect dead letter queue → restart worker containers → verify simulation jobs resuming → check ingest worker catching up
HMAC integrity failure Identify affected prediction ID → quarantine record (integrity_failed = TRUE) → notify affected ANSP users → investigate modification source → escalate to security incident if tampering confirmed
TIP ingest failure Check Space-Track API status → verify credentials not expired → check outbound network → manual TIP fetch if automated ingest blocked → notify operators of manual TIP status
Ingest pipeline staleness Check Celery Beat health (redbeat lock status) → check worker queue → inspect ingest failure counter in Prometheus → trigger manual ingest job → notify operators of staleness
GDPR personal data breach Contain breach (revoke credentials, isolate affected service) → assess scope (which data, how many data subjects, which jurisdictions) → notify legal counsel within 4 hours → if EU/UK data subjects affected: notify supervisory authority within 72 hours of discovery; notify affected data subjects "without undue delay" if high risk → log in security_logs with type DATA_BREACH → document remediation
Safety occurrence notification If a SpaceCom integrity failure (HMAC fail, data source outage, incorrect prediction) is identified during a period when an ANSP was actively managing a re-entry event: notify affected ANSP within 2 hours → create security_logs record with type SAFETY_OCCURRENCE → notify legal counsel before any external communications → preserve all prediction records, alert_events, and ingest logs from the relevant period (do not rotate or archive). Full procedure: docs/runbooks/safety-occurrence.md — see §26.8a below.
Prediction service outage during active re-entry event (F3) Detect via spacecom_active_tip_events > 0 + prediction API health check fail → immediate ANSP push notification + email within 5 minutes ("SpaceCom prediction service is unavailable. Activate your fallback procedure: consult Space-Track TIP messages directly and ESOC re-entry page.") → designate incident commander → communication cadence every 15 minutes until resolved → service restoration checklist: restore prediction API → verify HMAC integrity on latest predictions → notify ANSPs of restoration with prediction freshness timestamp → trigger PIR. Full procedure: docs/runbooks/prediction-service-outage-during-active-event.md

§26.8a Safety Occurrence Reporting Procedure (F4 — §61)

A safety occurrence is any event or condition in which a SpaceCom error may have contributed to, or could have contributed to, a reduction in aviation safety. This is distinct from an operational incident (which is defined by system availability/performance). Safety occurrences require a different response chain that includes regulatory and legal notification.

Trigger conditions:

  • HMAC integrity failure on any prediction that was served to an ANSP operator during an active TIP event
  • A confirmed incorrect prediction (false positive or false negative) where the ANSP was managing airspace based on SpaceCom outputs
  • Data staleness in excess of the operational threshold (TLE > 6h old) during an active re-entry event window without degradation notification having been sent
  • Any SpaceCom system failure during which an ANSP continued operational use without receiving a degradation notification

Response procedure (docs/runbooks/safety-occurrence.md):

Step Action Owner Timing
1 Detect and classify: confirm the occurrence meets trigger criteria; assign SAFETY_OCCURRENCE vs. standard incident On-call engineer Within 30 min of detection
2 Preserve evidence: set do_not_archive = TRUE on all affected prediction records, alert_events, and ingest logs; export to MinIO safety archive On-call engineer Within 1 hour
3 Internal escalation: notify incident commander + legal counsel; do NOT communicate externally until legal counsel is engaged Incident commander Within 1 hour
4 ANSP notification: contact affected ANSP primary contact and safety manager using the safety occurrence notification template (not the standard incident template); include what happened, what data was affected, what the ANSP should do in response Incident commander + legal counsel review Within 2 hours
5 Log: create security_logs record with type = 'SAFETY_OCCURRENCE'; include ANSP ID, affected prediction IDs, notification timestamp, and legal counsel name On-call engineer Same session
6 ANSP SMS obligation: inform the ANSP in writing that they may have an obligation to report this occurrence to their safety regulator under their SMS; SpaceCom cannot make this determination for the ANSP Legal counsel Within 24 hours
7 PIR: conduct a safety-occurrence-specific post-incident review (same structure as §26.8 PIR but with additional sections: regulatory notification status, hazard log update required?) Engineering lead Within 5 business days
8 Hazard log update: if the occurrence reveals a new hazard or changes the likelihood/severity of an existing hazard, update docs/safety/HAZARD_LOG.md and trigger a safety case review Safety case custodian Within 10 business days

Safety occurrence log table:

-- Add to security_logs or create a dedicated table
CREATE TABLE safety_occurrences (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    occurred_at         TIMESTAMPTZ NOT NULL,
    detected_at         TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    org_ids             UUID[] NOT NULL,                          -- affected ANSPs
    trigger_type        TEXT NOT NULL,                            -- 'HMAC_FAILURE', 'INCORRECT_PREDICTION', 'STALE_DATA', 'SILENT_FAILURE'
    affected_predictions UUID[] NOT NULL DEFAULT '{}',
    evidence_archived   BOOLEAN NOT NULL DEFAULT FALSE,
    ansp_notified_at    TIMESTAMPTZ,
    legal_notified_at   TIMESTAMPTZ,
    hazard_log_updated  BOOLEAN NOT NULL DEFAULT FALSE,
    pir_completed_at    TIMESTAMPTZ,
    notes               TEXT
);

What is NOT a safety occurrence (to avoid over-classification):

  • Standard availability incidents with degradation notification sent promptly
  • Cosmetic UI errors not in the alert/prediction path
  • Prediction updates that change values within stated uncertainty bounds

ANSP Communication Plan

When SpaceCom is degraded during an active TIP event, operators must be notified immediately through a defined channel:

  • WebSocket push (if connected): automatic via the degraded-mode notification (§24.8)
  • Email fallback: automated email to all operator role users with active sessions within the last 24h, identifying the degradation type and estimated resolution
  • Documented fallback: every SpaceCom user onboarding includes the fallback procedure: "In the absence of SpaceCom, consult Space-Track TIP messages directly at space-track.org and coordinate with your national space surveillance authority per existing procedures"

Incident communication templates (F10): Pre-drafted templates in docs/runbooks/incident-comms-templates.md — reviewed by legal counsel before first use. On-call engineers must use these templates verbatim; deviations require incident commander approval. Templates cover:

  1. Initial notification (< 5 minutes): impact, what we know, what we are doing, next update time
  2. 15-minute update: progress, updated ETA if known, revised fallback guidance if needed
  3. Resolution notification: confirmed restoration, prediction integrity verified, brief root cause (one sentence), PIR date
  4. Post-incident summary (within 5 business days): full timeline, root cause, remediations implemented What never appears in templates: speculation about cause before root cause confirmed; estimated recovery time until known with confidence; any admission of negligence or legal liability.

Post-Incident Review Process (F8)

Mandatory for all SEV-1 and SEV-2 incidents. PIR due within 5 business days of resolution.

PIR document structure (docs/post-incident-reviews/YYYY-MM-DD-{slug}.md):

  1. Incident summary — what happened, when, duration, severity
  2. Timeline — minute-by-minute from first alert to resolution
  3. Root cause — using 5-whys methodology; stop when a process or system gap is identified
  4. Contributing factors — what made the impact worse or detection slower
  5. Impact — users/ANSPs affected; data at risk; SLO breach duration
  6. Remediation actions — each with owner, GitHub issue link, and deadline; tracked with incident-remediation label
  7. What went well — to reinforce effective practices

PIR presented at the next engineering all-hands. Remediation actions are P2 priority — no new feature work by the responsible engineer until overdue remediations are closed.

Chaos Engineering / Game Day Programme (F4)

Quarterly game day; scenarios rotated so each is tested at least annually. Document in docs/runbooks/game-day-scenarios.md.

Minimum scenario set:

# Scenario Expected behaviour Pass criterion
1 PostgreSQL primary killed Patroni promotes standby; API recovers within RTO API returns 200 within 15 minutes; no data loss
2 Celery worker crash during active MC simulation Job moves to DLQ; orphan recovery task re-queues; operator sees FAILED state Job visible in DLQ within 2 minutes; re-queue succeeds
3 Space-Track ingest unavailable 6 hours Staleness degraded mode activates; operators notified; predictions greyed Staleness alert fires within 15 minutes of ingest stop
4 Redis failure Sessions expire gracefully; WebSocket reconnects; no silent data loss Users see "session expired" prompt; no 500 errors
5 Full prediction service restart during active CRITICAL alert Alert state preserved in DB; re-subscribing WebSocket clients receive current state No alert acknowledgement lost; reconnection < 30 seconds
6 Full region failover (annually) DNS fails over to DR region; prediction API resumes Recovery within RTO; HMAC verification passes on new primary

Each scenario: defined inject → observe → record actual behaviour → pass/fail vs. criterion → remediation window 2 weeks. Any scenario fail is treated as a SEV-2 incident with a PIR.

Operational vs. Security Incident Runbooks (F11)

Operational and security incidents have different response teams, communication obligations, and legal constraints:

Dimension Operational incident Security incident
Primary responder On-call engineer On-call engineer + DPO within 4h
Communication Status page + ANSP email No public status page until legal counsel approves
Regulatory obligation SLA breach notification (MSA) NIS2 24h early warning; GDPR 72h (if personal data)
Evidence preservation Normal log retention Immediate log freeze; do not rotate or archive

Separate runbooks:

  • docs/runbooks/operational-incident-response.md — standard on-call playbook
  • docs/runbooks/security-incident-response.md — invokes DPO, legal counsel, NIS2/GDPR timelines; references §29.6 notification obligations

26.9 Deployment Strategy

Zero-Downtime Deployment (Blue-Green)

The TLS-terminating Caddy instance routes between blue (current) and green (new) backend instances:

Client → Caddy → [Blue backend] (current)
                → [Green backend] (new — deployed but not yet receiving traffic)

Docker Compose implementation for Tier 2 (single-host):

Docker Compose service names are fixed, so blue and green run as two separate Compose project instances. The deploy script at scripts/blue-green-deploy.sh manages the cutover:

#!/usr/bin/env bash
# scripts/blue-green-deploy.sh
set -euo pipefail

NEW_IMAGE="${1:?Usage: blue-green-deploy.sh <image-tag>}"
COMPOSE_FILE="docker-compose.yml"
BLUE_PROJECT="spacecom-blue"
GREEN_PROJECT="spacecom-green"

# 1. Determine which colour is currently active
ACTIVE=$(cat /opt/spacecom/.active-colour 2>/dev/null || echo "blue")
if [[ "$ACTIVE" == "blue" ]]; then NEXT="green"; else NEXT="blue"; fi

# 2. Start next-colour project with new image
SPACECOM_BACKEND_IMAGE="$NEW_IMAGE" \
  docker compose -p "$( [[ $NEXT == green ]] && echo $GREEN_PROJECT || echo $BLUE_PROJECT )" \
  -f "$COMPOSE_FILE" up -d backend

# 3. Wait for next-colour healthcheck
docker compose -p "$( [[ $NEXT == green ]] && echo $GREEN_PROJECT || echo $BLUE_PROJECT )" \
  exec backend curl -sf http://localhost:8000/healthz || { echo "Health check failed — aborting"; exit 1; }

# 4. Run smoke tests against next-colour directly
SMOKE_TARGET="http://localhost:$( [[ $NEXT == green ]] && echo 8001 || echo 8000 )" \
  python scripts/smoke-test.py || { echo "Smoke tests failed — aborting"; exit 1; }

# 5. Shift Caddy upstream to next colour (atomic file swap + reload)
echo "{ \"upstream\": \"backend-$NEXT:8000\" }" > /opt/spacecom/caddy-upstream.json
docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile

echo "$NEXT" > /opt/spacecom/.active-colour
echo "✓ Traffic shifted to $NEXT. Monitoring for 5 minutes..."
sleep 300

# 6. Verify error rate via Prometheus (optional gate)
ERROR_RATE=$(curl -s "http://localhost:9090/api/v1/query?query=spacecom:api_availability:ratio_rate5m" \
  | jq -r '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE < 0.99" | bc -l) )); then
  echo "Error rate $ERROR_RATE < 0.99 — rolling back"
  # Swap back to active colour
  echo "{ \"upstream\": \"backend-$ACTIVE:8000\" }" > /opt/spacecom/caddy-upstream.json
  docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile
  echo "$ACTIVE" > /opt/spacecom/.active-colour
  exit 1
fi

# 7. Decommission old colour
docker compose -p "$( [[ $ACTIVE == blue ]] && echo $BLUE_PROJECT || echo $GREEN_PROJECT )" \
  stop backend && docker compose -p ... rm -f backend
echo "✓ Blue-green deploy complete. Active: $NEXT"

Caddy upstream configuration — Caddy reads a JSON file that the deploy script rewrites atomically:

# /etc/caddy/Caddyfile
reverse_proxy {
  dynamic file /opt/spacecom/caddy-upstream.json
  lb_policy first
  health_uri /healthz
  health_interval 5s
}

WebSocket long-lived connection timeout configuration (F11 — §63): HTTP reverse proxies have default idle timeouts that silently terminate long-lived WebSocket connections. Caddy's default idle timeout for HTTP/2 connections is governed by idle_timeout (default: 5 minutes). Many cloud load balancers default to 60 seconds. A WebSocket with no traffic for this period is silently closed by the proxy — the FastAPI server and client may not detect this for minutes, creating a "ghost connection" that is alive at the socket level but dead at the application level.

Required Caddyfile additions for WebSocket paths:

# /etc/caddy/Caddyfile
{
  servers {
    timeouts {
      idle_timeout 0  # disable idle timeout globally — WS connections can be silent for extended periods
    }
  }
}

spacecom.io {
  # WebSocket endpoints: no idle timeout, no read timeout
  @websockets {
    path /ws/*
    header Connection *Upgrade*
    header Upgrade websocket
  }
  handle @websockets {
    reverse_proxy backend:8000 {
      transport http {
        read_timeout  0      # no read timeout — WS connection can be idle
        write_timeout 0      # no write timeout — WS send can be slow on poor networks
      }
      flush_interval -1      # immediate flush; do not buffer WS frames
    }
  }

  # Non-WebSocket paths: retain normal timeouts
  handle {
    reverse_proxy backend:8000 {
      transport http {
        read_timeout  30s
        write_timeout 30s
      }
    }
  }
}

Ping-pong interval must be less than proxy idle timeout: The FastAPI WebSocket handler sends a ping every WS_PING_INTERVAL_SECONDS (default: 30s). With idle_timeout 0 in Caddy, this prevents proxy-side termination. If running behind a cloud load balancer with a fixed idle timeout, the ping interval must be set to (load_balancer_idle_timeout - 10s) — documented in docs/runbooks/websocket-proxy-config.md.

Rollback: scripts/blue-green-rollback.sh — resets /opt/spacecom/caddy-upstream.json to the previous colour and reloads Caddy. Rollback completes in < 5 seconds (no container restart required).

Deployment sequence:

  1. Deploy green backend alongside blue (both running)
  2. Run smoke tests against green directly (X-Deploy-Target: green header)
  3. Shift 10% of traffic to green (canary); monitor error rate for 5 minutes
  4. If clean: shift 100% to green; keep blue running for 10 minutes
  5. If error spike: shift 0% back to blue instantly (< 5s rollback via blue-green-rollback.sh)
  6. Decommission blue after 10 minutes of clean green operation

Alembic Migration Safety Policy

Every database migration must be backwards-compatible with the previous application version. Required sequence for any schema change:

  1. Migration only: deploy migration; verify old app still functions with new schema (additive changes only — new nullable columns, new tables, new indexes)
  2. Application deploy: deploy new application version that uses the new schema
  3. Cleanup migration (if needed): remove old columns/constraints after old app version is fully retired

Never: rename a column, change a column type, or drop a column in a single migration that deploys simultaneously with the application change.

Hypertable-specific migration rules:

  • Always use CREATE INDEX CONCURRENTLY for new indexes on hypertables — does not acquire a table lock; safe during live ingest. Standard CREATE INDEX (without CONCURRENTLY) blocks all reads and writes for the duration.
  • Never add a column with a non-null default to a populated hypertable in a single migration. Required sequence: (1) add nullable column, (2) backfill in batches with UPDATE ... WHERE id BETWEEN x AND y, (3) add NOT NULL constraint in a separate deployment.
  • Test every migration against a production-sized data copy before applying to production. Record the measured execution time in the migration file header comment: # Execution time on 10M-row orbits table: 45s.
  • Set a CI migration timeout gate: if a migration runs > 30 seconds against the test dataset, it must be reviewed by a senior engineer before merge.

TIP Event Deployment Freeze

No deployments permitted when a CRITICAL or HIGH alert is active for any tracked object. Enforced by a CI/CD gate:

# .gitlab-ci.yml pre-deploy check
def check_deployment_gate():
    response = requests.get(f"{API_URL}/api/v1/alerts?level=CRITICAL,HIGH&active=true",
                            headers={"X-Deploy-Check": settings.deploy_check_secret})
    active = response.json()["total"]
    if active > 0:
        raise DeploymentBlocked(
            f"{active} active CRITICAL/HIGH alerts. Deployment blocked until events resolve."
        )

The deploy check secret is a read-only service credential — it cannot acknowledge alerts or modify data.

CI/CD Pipeline Specification

GitLab CI pipeline jobs (.gitlab-ci.yml):

Job Trigger Steps Failure behaviour
lint All pushes + PRs pre-commit run --all-files (detect-secrets, ruff, mypy, hadolint, prettier, sqlfluff) Blocks merge
test-backend All pushes + PRs pytest --cov --cov-fail-under=80; alembic check (model/migration divergence) Blocks merge
test-frontend All pushes + PRs vitest run; playwright test Blocks merge
security-scan All pushes + PRs bandit -r backend/; pip-audit --require backend/requirements.txt; npm audit --audit-level=high (frontend); eslint --plugin security; trivy image on built images (.trivyignore applied); pip-licenses + license-checker-rseidelsohn gate; .secrets.baseline currency check Blocks merge on High/Critical
build-and-push Merge to main or release/* Multi-stage docker build; docker push ghcr.io/spacecom/<service>:sha-<commit> via OIDC; cosign sign all images; syft SPDX-JSON SBOM generated and attached as cosign attest; pip-licenses --format=json + license-checker-rseidelsohn --json manifests merged into SBOM and uploaded as workflow artifact (365-day retention); docs/compliance/sbom/ updated with versioned SBOM artefact Blocks deploy
deploy-staging After build-and-push on main Docker Compose update on staging host; smoke tests Blocks production deploy gate
deploy-production Manual approval after deploy-staging passes check_deployment_gate() (no active CRITICAL/HIGH alerts); blue-green deploy Manual

Image tagging convention:

  • sha-<commit> — immutable canonical tag; always pushed
  • v<major>.<minor>.<patch> — release alias pushed on tagged commits
  • latest — never pushed; forbidden in production Compose files (CI grep check enforces this)

Build cache strategy:

# .github/workflows/ci.yml (build-and-push job excerpt)
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
  with:
    registry: ghcr.io
    username: ${{ github.actor }}
    password: ${{ secrets.GITHUB_TOKEN }}   # OIDC — no stored secret
- uses: docker/build-push-action@v5
  with:
    context: ./backend
    push: true
    tags: ghcr.io/spacecom/backend:sha-${{ github.sha }}
    cache-from: type=registry,ref=ghcr.io/spacecom/backend:buildcache
    cache-to: type=registry,ref=ghcr.io/spacecom/backend:buildcache,mode=max

pip and npm caches use actions/cache keyed on lock file hash:

- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f  # v4.0.2
  with:
    path: ~/.cache/pip
    key: pip-${{ hashFiles('backend/requirements.txt') }}
- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f  # v4.0.2
  with:
    path: frontend/.next/cache
    key: npm-${{ hashFiles('frontend/package-lock.json') }}

cosign image signing and SBOM attestation (added after each docker push):

# .github/workflows/ci.yml — build-and-push job (after docker push steps)
- uses: sigstore/cosign-installer@59acb6260d9c0ba8f4a2f9d9b48431a222b68e20  # v3.5.0

- name: Sign all service images with cosign (keyless, OIDC)
  env:
    COSIGN_EXPERIMENTAL: "true"
  run: |
    for svc in backend worker-sim worker-ingest renderer frontend; do
      cosign sign --yes \
        ghcr.io/spacecom/${svc}:sha-${{ github.sha }}
    done

- name: Generate SBOM and attach as cosign attestation
  env:
    COSIGN_EXPERIMENTAL: "true"
  run: |
    for svc in backend worker-sim worker-ingest renderer frontend; do
      syft ghcr.io/spacecom/${svc}:sha-${{ github.sha }} \
        -o spdx-json=sbom-${svc}.spdx.json
      # Validate non-empty
      jq -e '.packages | length > 0' sbom-${svc}.spdx.json
      cosign attest --yes \
        --predicate sbom-${svc}.spdx.json \
        --type spdxjson \
        ghcr.io/spacecom/${svc}:sha-${{ github.sha }}
    done

- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08  # v4.3.4
  with:
    name: sbom-${{ github.sha }}
    path: "*.spdx.json"
    retention-days: 365   # ESA bid artefacts; ECSS minimum 1 year

- name: Verify signature before deploy (deploy jobs only)
  if: github.event_name == 'workflow_dispatch'
  run: |
    cosign verify ghcr.io/spacecom/backend:sha-${{ github.sha }} \
      --certificate-identity-regexp="https://github.com/spacecom/spacecom/.*" \
      --certificate-oidc-issuer="https://token.actions.githubusercontent.com"

All GitHub Actions pinned by commit SHA (mutable @vN tags allow tag-repointing attacks that exfiltrate all workflow secrets):

# Correct form — all third-party actions in .github/workflows/*.yml:
- uses: docker/setup-buildx-action@4fd812986e6c8c2a69e18311145f9371337f27d  # v3.4.0
- uses: docker/login-action@9780b0c442fbb1117ed29e0efdff1e18412f7567    # v3.3.0
- uses: docker/build-push-action@1a162644f9a7e87d8f4b053101d1d9a712edc18c # v6.3.0
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683        # v4.2.2
- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f             # v4.0.2
- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08  # v4.3.4

CI lint check enforces no mutable tags remain:

grep -rE 'uses: [^@]+@v[0-9]' .github/workflows/ && \
  echo "ERROR: Actions must be pinned by commit SHA, not tag" && exit 1

Use pinact or Renovate's github-actions manager to automate SHA updates.

Local Development Environment

First-time setup (target: working stack in ≤ 15 minutes from clean clone):

git clone https://github.com/spacecom/spacecom && cd spacecom
cp .env.example .env          # fill in Space-Track credentials only; all others have safe defaults
pip install pre-commit && pre-commit install
make dev                      # starts full stack with hot-reload
make seed                     # loads test objects, FIRs, and synthetic TIP events
# → Open http://localhost:3000; globe shows 10 test objects

make targets:

Target What it does
make dev docker compose up with ./backend and ./frontend/src bind-mounted for hot-reload
make test pytest (backend) + vitest run (frontend) + playwright test (E2E)
make migrate alembic upgrade head inside the running backend container
make seed Loads fixtures/dev_seed.sql + synthetic TIP events via seed script
make lint Runs all pre-commit hooks against all files
make clean docker compose down -v — removes all containers and volumes (destructive, prompts)
make shell-db Opens a psql shell inside the TimescaleDB container
make shell-backend Opens a bash shell inside the running backend container

Hot-reload configuration (docker-compose.override.yml — dev only, not committed to CI):

services:
  backend:
    volumes:
      - ./backend:/app   # bind mount — FastAPI --reload picks up changes instantly
    command: ["uvicorn", "app.main:app", "--reload", "--host", "0.0.0.0"]
  frontend:
    volumes:
      - ./frontend/src:/app/src   # Next.js / Vite HMR

.env.example structure (excerpt):

# === Required: obtain before first run ===
SPACETRACK_USERNAME=your_email@example.com
SPACETRACK_PASSWORD=your_password

# === Required: generate locally ===
JWT_PRIVATE_KEY_PATH=./certs/jwt_private.pem   # openssl genrsa -out certs/jwt_private.pem 2048
JWT_PUBLIC_KEY_PATH=./certs/jwt_public.pem

# === Safe defaults for local dev (change for production) ===
POSTGRES_PASSWORD=spacecom_dev
REDIS_PASSWORD=spacecom_dev
MINIO_ACCESS_KEY=spacecom_dev
MINIO_SECRET_KEY=spacecom_dev_secret
HMAC_SECRET=dev_hmac_secret_change_in_prod

# === Stage flags ===
ENVIRONMENT=development    # development | staging | production
SHADOW_MODE_DEFAULT=false
DISABLE_SIMULATION_DURING_ACTIVE_EVENTS=false

All production-only variables are clearly marked. The README's "Getting Started" section mirrors the first-time setup steps above.

Staging Environment

Purpose: Continuous integration target for main branch. Serves as the TRL artefact evidence environment — all shadow validation records and OWASP ZAP reports reference the staging deployment.

Property Staging Production
Infrastructure Tier 2 (single-host Docker Compose) Tier 3 (multi-host HA)
Data Synthetic only — no production data Real TLE/TIP/space weather
Secrets Separate credential set; non-production Space-Track account Production credential set in Vault
Deploy trigger Automatic on merge to main Manual approval in GitHub Actions
OWASP ZAP Runs against every staging deploy Run on demand before Phase 3 milestones
Retention Environment resets weekly (fresh make seed run) Persistent

Secrets Rotation Procedure

Zero-downtime rotation is required. Service interruption during rotation is a reliability failure.

JWT RS256 Signing Keypair:

  1. Generate new keypair: openssl genrsa -out jwt_private_new.pem 2048 && openssl rsa -in jwt_private_new.pem -pubout -out jwt_public_new.pem
  2. Load new public key into JWT_PUBLIC_KEY_NEW env var on all backend instances (old key still active)
  3. Backend now validates tokens signed with either old or new key
  4. Update JWT_PRIVATE_KEY to new key; new tokens are signed with new key
  5. Wait for all old tokens to expire (max 1h for access tokens; 30 days for refresh tokens)
  6. Remove JWT_PUBLIC_KEY_NEW; old public key no longer needed
  7. Log security_logs entry type KEY_ROTATION with rotation timestamp and initiator

Space-Track Credentials:

  1. Create new Space-Track account or update password via Space-Track web portal
  2. Update SPACETRACK_USERNAME / SPACETRACK_PASSWORD in secrets manager (Docker secrets / Vault)
  3. Trigger one manual ingest cycle; verify 200 response from Space-Track API
  4. Deactivate old credentials in Space-Track portal
  5. Log security_logs entry type CREDENTIAL_ROTATION

MinIO Access Keys:

  1. Create new access key pair via MinIO console (mc admin user add)
  2. Update MINIO_ACCESS_KEY / MINIO_SECRET_KEY in secrets manager
  3. Restart backend and worker services (rolling restart — blue-green ensures zero downtime)
  4. Verify pre-signed URL generation succeeds
  5. Delete old access key from MinIO console

HMAC Secret (prediction signing key):

  • Do not rotate casually. All existing HMAC-signed predictions will fail verification after rotation.
  • Pre-rotation: re-sign all existing predictions with new key (batch migration script required)
  • Post-rotation: update HMAC_SECRET in secrets manager; verify batch re-sign by spot-checking 10 predictions
  • Rotation must be approved by engineering lead; security_logs entry type HMAC_KEY_ROTATION required

26.10 Post-Deployment Safety Monitoring Programme (F9 — §61)

Pre-deployment testing and shadow validation demonstrate that a system was safe at a point in time. Post-deployment monitoring demonstrates that it remains safe in operational conditions. DO-278A §12 and EUROCAE ED-153 both require evidence of ongoing safety monitoring after deployment.

Programme components:

26.10.1 Prediction Accuracy Monitoring

After each actual re-entry event where SpaceCom generated predictions:

  1. Record the actual re-entry time and location (from The Aerospace Corporation / ESA re-entry campaign results)
  2. Compare against SpaceCom's p50 corridor centre and p95 bounds
  3. Record in shadow_validations table: actual_reentry_time, actual_impact_region, p50_error_km, p95_captured (boolean)
  4. Compute running accuracy statistics: % of events where actual impact was within p95 corridor; median error in km
  5. Publish accuracy statistics to GET /api/v1/admin/accuracy-report (accessible to ANSP admins)

Alert trigger: If rolling 12-month p95 capture rate drops below 80% (target: 95%), engineering review is mandatory before the next ANSP shadow activation or model update deployment.

26.10.2 Safety KPI Dashboard

Prometheus recording rules and Grafana dashboard (monitoring/dashboards/safety-kpis.json):

KPI Metric Target Alert threshold
HMAC verification failures spacecom_hmac_verification_failures_total 0 / month Any failure → SEV-1
Safety occurrences safety_occurrences table count 0 / year ≥1 → safety case review
Alert false positive rate Manual: PIR review < 5% Engineering review if exceeded
Operator training currency operator_training_records expiry 100% current < 95% → ANSP admin notification
p95 corridor capture rate shadow_validations rolling 12-month ≥ 95% < 80% → model review
Prediction freshness (TLE age at prediction time) spacecom_tle_age_hours histogram p95 < 6h > 24h → MEDIUM alert

26.10.3 Quarterly Safety Review

Mandatory quarterly safety review meeting. Output: docs/safety/QUARTERLY_SAFETY_REVIEW_YYYY_QN.md.

Agenda:

  1. Safety KPI review (all metrics above)
  2. Safety occurrences since last review (zero is an acceptable answer — record it)
  3. Hazard log review: has any hazard likelihood or severity changed since last quarter?
  4. MoC status update: progress on PLANNED items
  5. Model changes in period: were any SAL-2 components modified? If so, safety case impact assessment
  6. ANSP feedback: any concerns raised by ANSP customers regarding safety or accuracy?
  7. Actions: owner, deadline, priority

Attendance required: Safety case custodian + engineering lead. One ANSP contact may be invited as an observer (good practice for regulatory demonstration).

26.10.4 Model Version Safety Monitoring

When a new model version is deployed (changes to physics/ or alerts/ SAL-2 components):

  1. Shadow run new model in parallel for ≥14 days before replacing production model
  2. Compare new vs. old: prediction differences > 50 km for p50, or > 100 km for p95, require engineering review before promotion
  3. After promotion: monitor shadow_validations for the next 3 re-entry events; regression alert if p95 capture rate declines
  4. Record in simulations.model_version; all predictions annotated with the model version they used

27. Capacity Planning

27.0 Performance Test Specification (F6)

Performance tests live in tests/load/ and are run with k6. They are not part of the standard make test suite — they require a running environment with realistic data. They run:

  • Manually before any Phase gate release
  • Automatically on the staging environment nightly (scheduled k6 Cloud or self-hosted k6)
  • Results committed to docs/validation/load-test-results/ after each Phase gate

Scenarios

// tests/load/scenarios.js
export const options = {
  scenarios: {
    czml_catalog: {
      executor: 'ramping-vus',
      startVUs: 0, stages: [
        { duration: '30s', target: 50 },
        { duration: '2m',  target: 100 },
        { duration: '30s', target: 0 },
      ],
    },
    websocket_subscribers: {
      executor: 'constant-vus', vus: 200, duration: '3m',
    },
    decay_submit: {
      executor: 'constant-arrival-rate', rate: 5, timeUnit: '1m',
      preAllocatedVUs: 10, duration: '5m',
    },
  },
};

SLO Assertions (k6 thresholds — test fails if breached)

Scenario Metric Threshold
CZML catalog (GET /objects + CZML) p95 response time < 2 000 ms
API auth (POST /auth/token) p99 response time < 500 ms
Decay prediction submit p95 response time < 500 ms (202 accept only)
WebSocket connection 200 concurrent connections stable for 3 min 0 connection drops
WebSocket alert delivery Time from DB insert to browser receipt < 30 000 ms p95
/readyz probe p99 response time < 100 ms

Baseline Environment

Performance tests are only comparable if run against a consistent hardware baseline:

# docs/validation/load-test-baseline.md
- Host: 8 vCPU / 32 GB RAM (Tier 2 single-host)
- TimescaleDB: 100 tracked objects, 90 days of orbit history
- Celery workers: simulation ×16 concurrency, ingest ×2
- Redis: empty (no warm cache) at test start

Results from a different hardware spec must be labelled separately and not compared to the baseline. A performance regression is defined as any threshold breach on the same baseline hardware.

k6 outputs a JSON summary; a CI step uploads it to docs/validation/load-test-results/YYYY-MM-DD-{env}.json. A lightweight Python script (scripts/load-test-trend.py) plots p95 latency over time for the past 10 runs and embeds the chart in docs/TEST_PLAN.md. A > 20% increase in any p95 metric between consecutive runs on the same hardware creates a performance-regression GitHub issue automatically.

27.1 Workload Characterisation

Workload CPU Profile Memory Dominant Constraint
MC decay prediction (500 samples) CPU-bound, parallelisable 200500 MB per process CPU cores on simulation workers
SGP4 catalog propagation (100 objects) Trivial < 100 MB None — analytical model
CZML generation I/O-bound (DB read) < 500 MB DB query latency
Atmospheric breakup CPU-bound, light ~200 MB Negligible vs. MC
Conjunction screening (100 objects) CPU-bound, seconds ~500 MB Acceptable on any worker
Controlled re-entry planner CPU-bound, similar to MC 500 MB Same pool as MC
Playwright renderer Memory-bound (Chromium) 12 GB per instance Isolated container
TimescaleDB queries I/O-bound 64 GB (buffer cache) NVMe IOPS for spatial queries

Cost-tracking metrics (F3, F4, F11):

Add the following Prometheus counters to enable per-org cost attribution and external API budget visibility. These feed the unit economics model (§27.7) and the Enterprise tier chargeback reports.

# backend/app/metrics.py (add to existing prometheus_client registry)
from prometheus_client import Counter

# F3 — External API call budget tracking
ingest_api_calls_total = Counter(
    "spacecom_ingest_api_calls_total",
    "Total external API calls made by the ingest worker",
    labelnames=["source"]  # "space_track", "celestrak", "noaa_swpc", "esa_discos", "iers"
)
# Usage: ingest_api_calls_total.labels(source="space_track").inc()
# Alert: if space_track calls > 100/day → investigate polling loop bug (Space-Track AUP limit: 200/day)

# F4 — Per-org simulation CPU attribution
simulation_cpu_seconds_total = Counter(
    "spacecom_simulation_cpu_seconds_total",
    "Total CPU-seconds consumed by MC simulations, by org and object",
    labelnames=["org_id", "norad_id"]
)
# Usage: simulation_cpu_seconds_total.labels(org_id=str(org_id), norad_id=str(norad_id)).inc(elapsed)
# This is the primary input to infrastructure_cost_per_mc_run in §27.7

F5 — Inbound API request counter (§68):

# backend/app/metrics.py (add to existing prometheus_client registry)
api_requests_total = Counter(
    "spacecom_api_requests_total",
    "Total inbound API requests, by org, endpoint, and API version",
    labelnames=["org_id", "endpoint", "version", "status_code"]
)
# Usage (FastAPI middleware):
# api_requests_total.labels(
#     org_id=str(request.state.org_id),
#     endpoint=request.url.path,
#     version=request.headers.get("X-API-Version", "v1"),
#     status_code=str(response.status_code)
# ).inc()

This counter is the foundation for future API tier enforcement (e.g., 1,000 requests/month for Professional; unlimited for Enterprise) and for supporting usage-based billing for Persona E/F API consumers. Add to the FastAPI middleware stack alongside prometheus_fastapi_instrumentator.

F11 — Per-org cost attribution for Enterprise tier:

Enterprise contracts may include usage-based clauses (e.g., MC simulation credits). The simulation_cpu_seconds_total metric provides the raw data; a monthly Celery task (tasks/billing/generate_usage_report.py) aggregates it per org:

@shared_task
def generate_monthly_usage_report(org_id: str, year: int, month: int):
    """Aggregate simulation CPU-seconds and ingest API calls per org for billing review."""
    # Query Prometheus/VictoriaMetrics for the org's metrics over the billing period
    # Output: docs/business/usage_reports/{org_id}/{year}-{month:02d}.json
    # Fields: total_mc_runs, total_cpu_seconds, estimated_cost_usd (at $0.40/run internal rate)

Per-org usage reports are stored in docs/business/usage_reports/ and referenced in Enterprise QBRs. The cost rate ($0.40/run at Tier 3 scale) is updated quarterly in docs/business/UNIT_ECONOMICS.md.

Usage surfaced to commercial team and org admins (F2 — §68):

Usage data must reach two audiences: the commercial team (for renewal and expansion conversations) and the org admin (to understand value received).

Commercial team: Monthly Celery Beat task (tasks/commercial/send_commercial_summary.py) emails commercial@spacecom.io on the 1st of each month with:

  • Per-org: MC simulation count, PDF reports generated, WebSocket connection hours, alert events (by severity)
  • Trend vs. previous 3 months (growth signal for expansion conversations)
  • Contracts expiring within 90 days (renewal pipeline)

Org admin: Monthly usage summary email to each org's admin contact showing their own usage. Template: "In [month], your team ran [N] decay predictions, generated [M] PDF reports, and received [K] CRITICAL alerts. Your monthly quota: [Q] simulations (used: [N])." This email reinforces value perception ahead of renewal conversations.

Both emails use the generate_monthly_usage_report output. Add send_usage_summary_emails to celery-redbeat at crontab(day_of_month=1, hour=6).

27.2 Monte Carlo Parallelism Architecture

The MC decay predictor must use Celery group + chord to distribute sample computation across the full worker pool. multiprocessing.Pool within a single task is limited to one container's cores.

from celery import group, chord

@celery.task
def run_mc_decay_prediction(object_id: int, params: dict) -> str:
    """Fan out 500 samples as individual sub-tasks; aggregate with chord callback."""
    sample_tasks = group(
        run_single_trajectory.s(object_id, params, seed=i)
        for i in range(params['mc_samples'])
    )
    result = chord(sample_tasks)(aggregate_mc_results.s(object_id, params))
    return result.id

@celery.task
def run_single_trajectory(object_id: int, params: dict, seed: int) -> dict:
    """Single RK7(8) + NRLMSISE-00 trajectory integration. CPU time: 220s."""
    rng = np.random.default_rng(seed)
    f107 = params['f107'] * rng.normal(1.0, 0.20)  # ±20% variation
    bstar = params['bstar'] * rng.normal(1.0, 0.10)
    return integrate_trajectory(object_id, f107, bstar, params)

@celery.task
def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str:
    """Compute percentiles, build corridor polygon, HMAC-sign, write to DB."""
    prediction = compute_percentiles_and_corridor(results)
    prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret)
    write_prediction_to_db(prediction)
    return str(prediction['id'])

Worker concurrency for chord sub-tasks:

  • Each sub-task is short (220s) and CPU-bound
  • Worker --pool=prefork --concurrency=16: 16 OS processes per container
  • 2 simulation worker containers: 32 concurrent sub-tasks
  • 500 samples / 32 = ~16 batches × ~10s average = ~160s per MC run (p50)
  • p95 target of 240s met with headroom

Chord result backend: Sub-task results stored in Redis temporarily (< 1 MB each × 500 = 500 MB peak per run). Results expire after 1 hour (result_expires = 3600 in celeryconfig.py — §27.8). The aggregate callback reads all results, computes the final prediction, and writes to TimescaleDB — Redis is not the durable store.

Chord callback result count validation (F1 — §67): Redis noeviction prevents eviction, but if Redis is misconfigured or hits maxmemory and rejects writes, sub-task results may be missing when the chord callback fires. The callback must validate that it received the expected number of results before writing to TimescaleDB:

@celery.task
def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str:
    """Compute percentiles, build corridor polygon, HMAC-sign, write to DB."""
    expected = params['mc_samples']
    if len(results) != expected:
        # Partial result — do not write a silently truncated prediction
        raise ValueError(
            f"MC chord received {len(results)}/{expected} results for object {object_id}. "
            "Redis result backend may be under memory pressure. Aborting."
        )
    prediction = compute_percentiles_and_corridor(results)
    prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret)
    write_prediction_to_db(prediction)
    return str(prediction['id'])

The ValueError causes the chord callback to fail and be routed to the DLQ (Dead Letter Queue). The originating API call receives a task failure, and the client receives HTTP 500 with Retry-After. A spacecom_mc_chord_partial_result_total counter fires, triggering a CRITICAL alert: "MC chord received partial results — Redis memory budget exceeded."

27.3 Deployment Tiers

Tier 1 — Development and Demonstration

Single machine, Docker Compose, all services co-located. No HA. Suitable for development, internal demos, and ESA TRL 4 demonstrations.

Spec Minimum Recommended
CPU 8 cores 16 cores
RAM 16 GB 32 GB
Storage 256 GB NVMe SSD 512 GB NVMe SSD
Cloud equivalent t3.2xlarge ~$240/mo m6i.4xlarge ~$540/mo

MC prediction p95: ~400800s (exceeds SLO — acceptable for demo; noted in demo briefings).


Tier 2 — Phase 12 Production

Separate containers per service. Meets SLOs under moderate load (≤ 5 concurrent simulation users). Single-node per service — no HA. Suitable for shadow mode deployments and early ANSP pilots.

Service vCPU RAM Storage Cloud (AWS) Monthly
Backend API 4 8 GB c6i.xlarge ~$140
Simulation Workers ×2 16 each 32 GB each c6i.4xlarge ×2 ~$560 each
Ingest Worker 2 4 GB t3.medium ~$30
Renderer 4 8 GB c6i.xlarge ~$140
TimescaleDB 8 64 GB 1 TB NVMe r6i.2xlarge ~$420
Redis 2 8 GB cache.r6g.large ~$120
MinIO / S3 4 8 GB 4 TB i3.xlarge + EBS ~$200
Total ~$2,200/mo

On-premise equivalent (Tier 2): Two servers — compute host (2× AMD EPYC 7313P, 32 total cores, 192 GB RAM) + storage host (8 vCPU, 256 GB RAM, 2 TB NVMe + 8 TB HDD). Capital cost: ~$25,00035,000.


Tier 3 — Phase 3 HA Production

Full redundancy. Meets 99.9% availability SLO including during active TIP events. Required before any formal operational ANSP deployment.

Service Count vCPU each RAM each Notes
Backend API 2 4 8 GB Load balanced; blue-green deployable
Simulation Workers 4 16 32 GB 64 total cores; chord sub-tasks fill all
Ingest Worker 2 2 4 GB celery-redbeat leader election
Renderer 2 4 8 GB Network-isolated; Chromium memory budget
TimescaleDB Primary 1 8 128 GB Patroni-managed; synchronous replication
TimescaleDB Standby 1 8 128 GB Hot standby; auto-failover ≤ 30s
Redis Sentinel ×3 3 2 8 GB Quorum; master failover ≤ 10s
MinIO (distributed) 4 4 16 GB Erasure coding EC:2; 2× 2 TB NVMe each
Cloud total (AWS) ~$6,0007,000/mo

With 64 simulation worker cores: 500-sample MC in ~80s p50, ~120s p95 — well within SLO.

MinIO Erasure Coding (Tier 3): 4-node distributed MinIO uses EC:2 (2 parity shards). This provides:

  • Read quorum: any 2 of 4 nodes (tolerates 2 simultaneous node failures for reads)
  • Write quorum: requires 3 of 4 nodes (tolerates 1 simultaneous node failure for writes)
  • Effective storage: 50% — 8 TB raw across 4 nodes → 4 TB usable. Match the Tier 3 table note (8 TB usable requires 16 TB raw across 4×2 TB nodes; resize if needed)
  • Configured via MINIO_ERASURE_SET_DRIVE_COUNT=4 and server startup with all 4 node endpoints

Multi-region stance: SpaceCom is single-region through all three phases. Reasoning:

  • Phase 13 customer base is small (ESA evaluation, early ANSP pilots); cross-region replication cost and operational complexity is not justified.
  • Government and defence customers may have data sovereignty requirements — a single, clearly defined deployment region (customer-specified) is simpler to certify than an active-active multi-region setup.
  • When a second jurisdiction customer is onboarded, deploy a separate, independent instance in their required jurisdiction rather than extending a single global cluster. Each instance has its own data, its own compliance scope, and its own operational team contact.
  • This decision is documented as ADR-0010 (see §34 decision log).

On-premise equivalent (Tier 3): Three servers — 2× compute (2× EPYC 7343, 32 cores, 256 GB RAM each) + 1× storage (128 GB RAM, 4× 2 TB NVMe RAID-10, 16 TB HDD). Capital cost: ~$60,00080,000.

Celery worker idle cost and scale-to-zero decision (F6):

Simulation workers are the largest cloud line item ($560/mo each at Tier 2 on c6i.4xlarge). Their actual compute utilisation depends on MC run frequency:

Usage pattern Active compute/day Idle fraction Monthly cost at Tier 2 ×2 workers
Light (5 MC runs/day × 80s p50) ~7 min/day ~99.5% $1,120
Moderate (20 MC runs/day × 80s) ~27 min/day ~98.1% $1,120
Heavy (100 MC runs/day × 80s) ~133 min/day ~90.7% $1,120

Scale-to-zero analysis:

Approach Pros Cons Decision
Always-on (Tier 12) Zero cold-start; SLO met immediately High idle cost when lightly used Use at Tier 12 — cost is ~$1,120/mo regardless; latency SLO requires workers ready
Scale-to-1 minimum (Tier 3) Reduced idle cost vs. 4×; one worker handles ingest keepalive tasks Cold-start for burst: 3 new workers × 3060s spin-up; MC SLO may breach during burst Use at Tier 3 — scale-to-1 minimum; HPA/KEDA scales 1→4 on celery_queue_length > 10
Scale-to-zero Maximum idle savings 60120s cold-start violates 10-min MC SLO when all workers are down Do not use — cold-start from zero exceeds acceptable latency for on-demand simulation

Implementation at Tier 3 (Kubernetes): Use KEDA ScaledObject with celery trigger:

triggers:
  - type: redis
    metadata:
      listName: celery          # Celery default queue
      listLength: "10"          # scale up when >10 tasks queued
      activationListLength: "1" # keep at least 1 replica (scale-to-1 minimum)

Minimum replica count: 1. Maximum: 4. Scale-down stabilisation window: 5 minutes (prevents oscillation during multi-run bursts).

Ingest worker: Always-on, single instance (2 vCPU, $30/mo at Tier 2). celery-redbeat tasks run on 1-minute and hourly schedules; scale-to-zero is not appropriate. At Tier 3, 2 instances for redundancy; no autoscaling needed.


27.4 Storage Growth Projections

Data Retention Raw Growth/Year Compressed/Year Cloud Cost/Year (est.) Notes
orbits (100 objects, 1/min) 90 days online ~15 GB ~2 GB ~$20 (EBS gp3, rolling) TimescaleDB compression ~7:1
tle_sets 1 year ~55 MB ~30 MB Negligible
space_weather 2 years ~5 MB ~2 MB Negligible
MC simulation blobs (MinIO) 2 years 500 GB2 TB Not compressed $140$560/yr (S3-IA after 90d) Dominant cost — S3-IA at $0.0125/GB/mo
PDF reports (MinIO) 7 years 1090 GB 545 GB $5$45/yr (S3 Glacier) $0.004/GB/mo Glacier tier
WAL archive (backup) 30 days rolling ~25 GB/month ~$100/yr (300 GB peak × $0.023/GB/mo × 12) S3 Standard; rolls over; cost is steady-state
security_logs 2 years online; 7-year archive ~500 MB/year Negligible Legal hold
reentry_predictions 7 years ~100 MB/year Negligible Legal hold
Safety records (alert_events, notam_drafts, prediction_outcomes, degraded_mode_events, coordination notes) 5-year minimum append-only archive ~200 MB/year Negligible ICAO Annex 11 §2.26; safety investigation requirement

Storage cost summary (Phase 2 steady-state): MC blobs dominate at sustained use. At 50 runs/day × 120 MB/run = 2.2 TB/year, 2-year retention on S3-IA ≈ $660/year in object storage alone. This should be captured in the unit economics model (§27.7). Storage cost is the primary variable cost that scales with usage depth (number of MC runs), not with number of users.

Backup cost projection (F9): WAL archive at 30-day rolling window: ~300 GB peak occupancy on S3 Standard ≈ $83/year (Tier 2). At Tier 3 with synchronous replication, the base-backup is ~2× TimescaleDB data size. At 1 TB compressed DB size: one weekly base-backup (retained 4 weeks) = 4 TB S3 occupancy → **$1,100/year** at Tier 3. Include backup S3 bucket costs in infrastructure budget from Phase 3 onwards. Budget line: infra/backup-s3 ≈ $100200/month at steady Tier 3 scale.

Safety record retention policy (Finding 11): Safety-relevant event records have a distinct retention category separate from general operational data. A safety_record BOOLEAN DEFAULT FALSE flag on alert_events and notam_drafts marks records that must survive the standard retention drop. Records with safety_record = TRUE are excluded from TimescaleDB drop policies and transferred to MinIO cold tier (append-only) after 90 days online, retained for 5 years minimum. The TimescaleDB retention job checks WHERE safety_record = FALSE before dropping chunks. safety_record is set to TRUE at insert time for any event with alert_level IN ('HIGH', 'CRITICAL') and for all NOTAM drafts.

MC blob storage dominates at scale. At sustained use (50 MC runs/day × 120 MB/run): 2.2 TB/year. The Tier 3 distributed MinIO (8 TB usable with erasure coding on 4×2 TB nodes) covers approximately 34 years before expansion.

Cold tier tiering decision (two object classes with different requirements):

Object class Cold tier target Reason
MC simulation blobs (mc_blobs/ prefix) MinIO ILM warm tier or S3 Infrequent Access Blobs may need to be replayed for Mode C visualisation of historical events (e.g., regulatory dispute review, incident investigation). Glacier 12h restore latency is operationally unacceptable for this use case.
Compliance-only documents (reports/, notam_drafts/) S3 Glacier / Glacier Deep Archive acceptable These are legal records requiring 7-year retention; retrieval is for audit or legal discovery only; 12h restore latency is acceptable.

MinIO ILM rules configured in docs/runbooks/minio-lifecycle.md. Lifecycle transitions: MC blobs after 90 days → ILM warm (lower-cost MinIO tier or S3-IA); compliance docs after 1 year → Glacier.

MinIO multipart upload retry and incomplete upload expiry (F7 — §67):

MC simulation blobs (~120 MB each) are uploaded as multipart uploads. During a MinIO node failure in EC:2 distributed mode, write quorum (3/4 nodes) may be temporarily unavailable. An in-flight multipart upload will fail with MinioException / S3Error. Without a retry policy, the MC prediction is written to TimescaleDB but the blob is lost — the historical replay functionality silently fails.

# worker/tasks/blob_upload.py
from minio.error import S3Error

@shared_task(
    autoretry_for=(S3Error, ConnectionError),
    max_retries=3,
    retry_backoff=30,    # 30s, 60s, 120s — allow node recovery
    retry_jitter=True,
)
def upload_mc_blob(prediction_id: str, blob_data: bytes):
    """Upload MC simulation blob to MinIO with retry on quorum failure."""
    object_key = f"mc_blobs/{prediction_id}.msgpack"
    minio_client.put_object(
        bucket_name="spacecom-simulations",
        object_name=object_key,
        data=io.BytesIO(blob_data),
        length=len(blob_data),
        content_type="application/msgpack",
    )

Incomplete multipart upload cleanup: Configure MinIO lifecycle rule to abort incomplete multipart uploads after 24 hours. Add to docs/runbooks/minio-lifecycle.md:

mc ilm rule add --expire-delete-marker --noncurrent-expire-days 1 \
  spacecom/spacecom-simulations --abort-incomplete-multipart-upload-days 1

This prevents orphaned multipart upload parts accumulating on disk during node failures or application crashes mid-upload.

27.5 Network and External Bandwidth

Traffic Direction Volume Notes
Space-Track TLE polling Outbound ~1 MB per run, every 4h ~6 MB/day
NOAA SWPC space weather Outbound ~50 KB per fetch, hourly ~1 MB/day
ESA DISCOS Outbound ~10 MB/day (initial bulk); ~100 KB/day incremental
CZML to clients Outbound ~515 MB per user page load (full); <500 KB/hr delta Scales linearly with users; delta protocol essential
WebSocket to clients Outbound ~1 KB/event × events/day Low bandwidth, persistent connection
PDF reports (download) Outbound ~25 MB per report Low frequency; MinIO presigned URL avoids backend proxy
MinIO internal traffic Internal Dominated by MC blob writes Keep on internal Docker network

CZML egress cost estimate and compression policy (F5):

At Phase 2 (10 concurrent users), daily CZML egress:

  • Initial full loads: 10 users × 3 page loads/day × 15 MB = 450 MB/day
  • Delta updates (delta protocol, §6): 10 users × 8h active × 500 KB/hr = 40 MB/day
  • Total: ~490 MB/day ≈ 15 GB/month

At $0.085/GB AWS CloudFront egress: ~$1.28/month (Phase 2) → ~$6.40/month (50 users Phase 3).

CZML egress is not a significant cost driver at this scale, but is significant for latency and user experience. Compression policy:

Encoding CZML size reduction Implementation
gzip (Accept-Encoding) 6075% Caddy encode gzip — already included in §26.9 Caddy config
Brotli 7080% Caddy encode zstd br gzip — use br for browser clients
CZML delta protocol (?since=) 95%+ for incremental updates Already specified in §6

Minimum requirement: Caddy encode block must include br before gzip in the content negotiation order. A 15 MB CZML payload compresses to ~35 MB with brotli. Verify with curl -H "Accept-Encoding: br" -I <url> — response must show Content-Encoding: br.

Network is not a constraint for this workload at the scales described. Standard 1 Gbps datacenter networking is sufficient. For on-premise government deployments, standard enterprise LAN is adequate.


27.6 DNS Architecture and Service Discovery

Tier 12 (Docker Compose)

Docker Compose provides built-in DNS resolution by service name within each network. Services reference each other by container name (e.g., db, redis, minio). No additional DNS infrastructure required.

PgBouncer as single DB connection target: At Tier 2, the backend and workers connect to pgbouncer:5432, not directly to db:5432. PgBouncer multiplexes connections and acts as a stable endpoint:

  • In a Patroni failover, pgbouncer is reconfigured to point to the new primary; application code never changes connection strings.
  • PgBouncer configuration: docs/runbooks/pgbouncer-config.md

Celery task retry during Patroni failover (F2 — §67): During the ≤ 30s Patroni leader election window, all writes to PgBouncer fail with FATAL: no connection available or OperationalError: server closed the connection unexpectedly. Celery tasks that execute a DB write during this window will raise sqlalchemy.exc.OperationalError. Without a retry policy, these tasks fail permanently and are routed to the DLQ.

All Celery tasks that write to the database must declare:

@shared_task(
    autoretry_for=(OperationalError,),
    max_retries=3,
    retry_backoff=5,        # 5s, 10s, 20s
    retry_backoff_max=30,   # cap at 30s (within failover window)
    retry_jitter=True,
)
def my_db_writing_task(...):
    ...

This covers: aggregate_mc_results, write_alert_event, write_prediction_outcome, all ingest tasks. Tasks that only read from DB should also retry on OperationalError since PgBouncer may pause reads during leader election. Add integration test: simulate OperationalError on first two attempts → task succeeds on third attempt.

Tier 3 (HA / Kubernetes migration path)

At Tier 3, introduce split-horizon DNS:

Zone Scope Purpose
spacecom.internal Internal services Service discovery: backend.spacecom.internal, db.spacecom.internal (→ PgBouncer VIP)
spacecom.io (or customer domain) Public internet Caddy termination endpoint; ACME certificate domain

Service discovery implementation:

  • Cloud (AWS/GCP/Azure): Use cloud-native internal DNS (Route 53 private hosted zones / Cloud DNS) + load balancer for each service tier
  • On-premise: CoreDNS deployed as a DaemonSet (Kubernetes) or as a Docker container on the management network; service records updated via Patroni callback scripts on failover

Key DNS records (Tier 3):

Record Type Value
db.spacecom.internal A PgBouncer VIP (stable through Patroni failover)
redis.spacecom.internal A Redis Sentinel VIP
minio.spacecom.internal A MinIO load balancer (all 4 nodes)
backend.spacecom.internal A Backend API load balancer (2 instances)

27.7 Unit Economics Model

Reference document: docs/business/UNIT_ECONOMICS.md — maintained alongside this plan; update whenever pricing or infrastructure costs change.

Unit economics express the cost to serve one organisation per month and the revenue generated, enabling margin analysis per tier.

Cost-to-serve model (Phase 2, cloud-hosted, per org):

Cost driver Basis Monthly cost per org
Simulation workers (shared pool) 2 workers shared across all orgs; allocate by MC run share $1,120 ÷ org count
TimescaleDB (shared instance) ~$420/mo; fixed regardless of org count up to Phase 2 capacity $420 ÷ org count
Redis (shared) ~$120/mo $120 ÷ org count
MinIO / S3 storage Variable; ~$660/yr at heavy MC use → $55/mo $555/mo
Backend API (shared) ~$140/mo $140 ÷ org count
Ingest worker (shared) ~$30/mo Allocated to platform overhead
Email relay ~$0.001/email × volume $05/mo
CZML egress ~$0.085/GB $17/mo
Total variable (1 org, Tier 2) ~$1,860/mo platform + $6070 per-org variable

Revenue per tier (target pricing — cross-reference §55 commercial model):

Tier Monthly ARR / org Gross margin target
Free / Evaluation $0 Negative — cost of ESA relationship
Professional (shadow) $3,0006,000/mo 5070% at ≥3 orgs on platform
Enterprise (operational) $15,00040,000/mo 6575% at Tier 3 scale

Break-even analysis: At Tier 2 platform cost (~$2,200/mo), break-even at Professional tier requires ≥1 paying org at $3,000/mo. Each additional Professional org at shared infrastructure has near-zero incremental infrastructure cost until capacity boundaries (MC concurrency limit, DB connection pooler limit).

Key unit economics metric: infrastructure_cost_per_mc_run. At Tier 2 (2 workers, $1,120/mo) and 500 runs/month: $2.24/run. At Tier 3 (4 workers KEDA scale-to-1, ~$800/mo amortised at medium utilisation) and 2,000 runs/month: $0.40/run. This metric should be tracked alongside spacecom_simulation_cpu_seconds_total (§27.1).

Professional Services as a revenue line (F10 — §68):

Professional Services (PS) revenue is a distinct revenue stream from recurring SaaS fees. For safety-critical aviation systems, PS typically represents 3050% of first-year contract value and includes:

PS engagement type Typical value Description
Implementation support $15,00040,000 Deployment, configuration, integration with ANSP SMS
Regulatory documentation $10,00025,000 SpaceCom system description for ANSP regulatory submissions; assists with EASA/CASA/CAA shadow mode notifications
Training (initial) $5,00015,000 On-site or remote training for duty controllers, analysts, and IT administrators
Safety Management System integration $8,00020,000 Integrating SpaceCom alert triggers into the ANSP's existing SMS occurrence reporting workflow
Annual training refresh $2,0005,000/yr Recurring annual training for new staff and procedure updates

PS revenue is tracked in the contracts.ps_value_cents column (§68 F1). Include PS as a budget line in docs/business/UNIT_ECONOMICS.md:

  • Year 1 total contract value = MRR × 12 + PS value
  • PS is recognised as one-time revenue at delivery (milestone-based); SaaS fees are recognised monthly
  • PS delivery requires dedicated engineering and commercial capacity — budget 12 days of senior engineer time per $5,000 of PS value

Shadow trial MC quota (F8 - §68): Free/shadow trial orgs are limited to 100 MC simulation runs per month (organisations.monthly_mc_run_quota = 100). Enforcement at POST /api/v1/decay/predict:

if org.subscription_tier in ('shadow_trial',) and org.monthly_mc_run_quota > 0:
    runs_this_month = get_monthly_mc_run_count(org_id)
    if runs_this_month >= org.monthly_mc_run_quota:
        raise HTTPException(
            status_code=429,
            detail={
                "error": "monthly_quota_exceeded",
                "quota": org.monthly_mc_run_quota,
                "used": runs_this_month,
                "resets_at": first_of_next_month().isoformat(),
                "upgrade_url": "/settings/billing"
            }
        )

Commercial controls must not interrupt active operations. If the organisation is in an active TIP / CRITICAL operational state, quota exhaustion is logged and surfaced to commercial/admin dashboards but enforcement is deferred until the event closes.


27.8 Redis Memory Budget

Reference document: docs/infra/REDIS_SIZING.md — sizing rationale and eviction policy decisions.

Redis serves three distinct purposes with different memory characteristics. Using a single Redis instance (with separate DB indexes for broker vs. cache) requires explicit memory budgeting:

Purpose DB index Key pattern Estimated peak memory Eviction policy
Celery broker + result backend DB 0 celery-task-meta-*, _kombu.* 500 MB (500 MC sub-tasks × ~1 MB results) noeviction
celery-redbeat schedule DB 1 redbeat:* < 1 MB noeviction
WebSocket session tracking DB 2 spacecom:ws:*, spacecom:active_tip:* < 10 MB noeviction
Application cache (CZML, NOTAM) DB 3 spacecom:cache:* 50200 MB allkeys-lru
Redis Pub/Sub fan-out (alerts) spacecom:alert:* channels Transient; ~1 KB/message N/A (pub/sub, no persistence)
Total budget ~700750 MB peak

Sizing decision: Use cache.r6g.large (8 GB RAM) with maxmemory 2gb — provides 2.5× headroom above peak estimate for burst conditions (multiple simultaneous MC runs × result backend). Set maxmemory-policy noeviction globally; the application cache (DB 3) must handle cache misses gracefully (it does — CZML regeneration on miss is defined in §6).

Redis memory alert: Add Grafana alert redis_memory_used_bytes > 1.5GB → WARNING; > 1.8GB → CRITICAL. At CRITICAL, check for result backend accumulation (expired Celery results not cleaned up) before scaling.

Redis result cleanup: Celery result_expires must be set to 3600 (1 hour). Verify in backend/celeryconfig.py:

result_expires = 3600  # Clean up MC sub-task results after 1 hour

28. Human Factors Framework

SpaceCom is a safety-critical decision support system used by time-pressured operators in aviation operations rooms. Human factors are not a UX concern — they are a safety assurance concern. This section documents the HF design requirements, standards basis, and validation approach.

Standards basis: ICAO Doc 9683 (Human Factors in Air Traffic Management), FAA AC 25.1329 (Flight Guidance Systems — alert prioritisation philosophy), EUROCONTROL HRS-HSP-005, ISA-18.2 (alarm management, adapted for ATC context), Endsley (1995) Situation Awareness model.


28.1 Situation Awareness Design Requirements

SpaceCom must support all three levels of Endsley's SA model for Persona A (ANSP duty manager):

SA Level Requirement Implementation Time target
Level 1 — Perception Correct hazard information visible at a glance Globe with urgency symbols; active events panel; risk level badges ≤ 5 seconds from alert appearance — icon, colour, and position alone must convey object + risk level without reading text
Level 2 — Comprehension Operator understands what the hazard means for their sector Plain-language event cards; window range notation; FIR intersection list; data confidence indicators ≤ 15 seconds to identify earliest FIR intersection window and whether it falls within the operator's sector
Level 3 — Projection Operator can anticipate future state without simulation tools Corridor Evolution widget (T+0/+2/+4h); Gantt timeline; space weather buffer callout ≤ 30 seconds to determine whether the corridor is expanding or contracting using the Corridor Evolution widget

These time targets are pass/fail criteria for the Phase 2 ANSP usability test (§28.7).

Globe visual information hierarchy (F7 — §60): The globe displays objects, corridors, hazard zones, FIR boundaries, and ADS-B routes simultaneously. Under operational stress, operators must not be required to search for the critical element — it must be pre-attentively distinct. The following hierarchy is mandatory and enforced by the rendering layer:

Priority Element Visual treatment Pre-attentive channel
1 — Immediate Active CRITICAL object Flashing red octagon (2 Hz, reduced-motion: static + thick border) + label always visible Motion + colour + shape
2 — Urgent Active HIGH object Amber triangle, label visible at zoom ≥ 4 Colour + shape
3 — Monitor Active MEDIUM object Yellow circle, label on hover Colour + shape
4 — Context Re-entry corridors (p05p95) Semi-transparent red fill, no label until hover Colour + opacity
5 — Awareness FIR boundary overlay Thin white lines, low opacity (30%) Position
6 — Background ADS-B routes Thin grey lines, visible only at zoom ≥ 5 Position
7 — Ambient All other tracked objects Small white dots, no label until hover Position

Rule: no element at priority N may be more visually prominent than an element at priority N-1. The rendering layer enforces draw order and applies opacity/size reduction to lower-priority elements when a priority-1 element is present. This is a non-negotiable safety requirement — a CesiumJS performance optimisation that re-orders draw calls or flattens layers must not override this hierarchy. An operator who cannot reach SA Level 1 in ≤ 5 seconds on a CRITICAL alert constitutes a design failure requiring a redesign cycle before shadow deployment. Without numeric targets the usability test cannot produce a meaningful result.

Level 3 SA support is specifically identified as a gap in pure corridor-display systems and is addressed by the Corridor Evolution widget (§6.8).


28.2 Mode Error Prevention

Mode confusion is the most common cause of automation-related incidents in aviation. SpaceCom has three operational modes (LIVE / REPLAY / SIMULATION) that must be unambiguously distinct at all times.

Mode error prevention mechanisms:

  1. Persistent mode indicator pill in top nav — never hidden, never small
  2. Mode-switch dialogue with explicit current-mode, target-mode, and consequence statements (§6.3)
  3. Future-preview temporal wash when the timeline scrubber is not at current time (§6.3)
  4. Optional disable_simulation_during_active_events org setting to block simulation entry during live incidents (§6.3)
  5. Audio alerts suppressed in SIMULATION and REPLAY modes
  6. All simulation-generated records have simulation_id IS NOT NULL — they cannot appear in operational views

28.3 Alarm Management

Alarm management requirements follow the principle: every alarm should demand action, every required action should have an alarm, and no alarm should be generated that does not demand action.

Alarm rationalisation:

  • CRITICAL: demands immediate action — full-screen banner + audio
  • HIGH: demands timely action — persistent badge + acknowledgement required
  • MEDIUM: informs — toast, auto-dismiss, logged
  • LOW: awareness only — notification centre

Alarm management philosophy and KPIs (F1 — §60): SpaceCom adopts the EEMUA 191 / ISA-18.2 alarm management framework adapted for space/aviation operations. The following KPIs are measured quarterly by Persona D and included in the ESA compliance artefact package:

EEMUA 191 KPI Target Definition
Alarm rate (steady-state) < 1 alarm per 10 minutes per operator Alarms requiring attention across all levels; excludes LOW awareness-only
Nuisance alarm rate < 1% of all alarms Alarms acknowledged as MONITORING within 30s without any other action — indicates no actionable information
Stale alarms 0 CRITICAL unacknowledged > 10 min Unacknowledged CRITICAL alerts older than 10 minutes; triggers supervisor notification (F8)
Alarm flood threshold < 10 CRITICAL alarms within 10 minutes Beyond this rate, an alert storm meta-alert fires and the batch-flood suppression protocol activates
Chattering alarms 0 Any alarm that fires and clears more than 3 times in 30 minutes without operator action

Alarm quality requirements:

  • Nuisance alarm rate target: < 1 LOW alarm per 10 minutes per user in steady-state operations (logged and reviewed quarterly by Persona D)
  • Alert deduplication: consecutive window-shrink events do not re-trigger CRITICAL if the threshold was not crossed
  • 4-hour per-object CRITICAL rate limit prevents alarm flooding from a single event
  • Alert storm meta-alert disambiguates between genuine multi-object events and system integrity issues (§6.6)

Batch TIP flood handling (F2 — §60): Space-Track releases TIP messages in batches — a single NOAA solar storm event can produce 50+ new TIP entries within a 10-minute window. Without mitigation, this generates 50 simultaneous CRITICAL alerts, constituting an alarm flood that exceeds EEMUA 191 KPIs and cognitively overwhelms the operator.

Protocol when ingest detects ≥ 5 new TIP messages within a 5-minute window:

  1. Batch gate activates: Individual CRITICAL banners suppressed for objects 2N of the batch. Object 1 (highest-priority by predicted Pc or earliest window) receives the standard CRITICAL banner.
  2. Batch summary alert fires: A single HIGH-level "Batch TIP event: N objects with new TIP data" summary appears in the notification centre. The summary is actionable — it links to a pre-filtered catalog view showing all newly-TIP-flagged objects sorted by predicted re-entry window.
  3. Batch event logged: A batch_tip_event record is created in alert_events with trigger_type = 'BATCH_TIP', affected_objects = [NORAD ID list], and batch_size = N. This is distinct from individual object alert records.
  4. Per-object alerts queue: Individual CRITICAL alerts for objects 2N are queued and delivered at a maximum rate of 1 per minute, only if the operator has not opened the batch summary view within 5 minutes of the batch gate activating. This prevents indefinite suppression while preventing flood.

The threshold (≥ 5 TIP in 5 minutes) and maximum queue delivery rate (1/min) are configurable per-org via org-admin settings, subject to minimum values (≥ 3 and ≤ 2/min respectively) to prevent safety-defeating misconfiguration.

Audio alarm specification (F11 — §60):

  • Two-tone ascending chime: 261 Hz (C4) followed by 392 Hz (G4), each 250ms, 20ms fade-in/out (not siren — ops rooms have sirens from other systems already)
  • Conforms to EUROCAE ED-26 / RTCA DO-256 advisory alert audio guidelines (advisory category — attention-getting without startle)
  • Plays once on first presentation; does not loop automatically
  • Re-alert on missed acknowledgement: If a CRITICAL alert remains unacknowledged for 3 minutes, the chime replays once. Replays at most once — the second chime is the final audio prompt. Further escalation is via supervisor notification (F8), not repeated audio (which would cause habituation)
  • Stops on acknowledgement — not on banner dismiss; banner dismiss without acknowledgement is not permitted for CRITICAL severity
  • Per-device volume control via OS; per-session software mute (persists for session only; resets on next login to prevent operators permanently muting safety alerts)
  • Enabled by org-level "ops room mode" setting (default: off); must be explicitly enabled by org admin — not auto-enabled to prevent unexpected audio in environments where audio is not appropriate
  • Volume floor in ops room mode: minimum 40% of device maximum; operators cannot mute below this floor when ops room mode is active (configurable per-org, minimum 30%)

Startle-response mitigation — sudden full-screen CRITICAL banners cause ~5 seconds of degraded cognitive performance in research studies. The following rules prevent cold-start startle:

  1. Progressive escalation mandatory: A CRITICAL alert may only be presented full-screen if the same object has already been in HIGH state for ≥ 1 minute during the current session. If the alert arrives cold (no prior HIGH state), the system must hold the alert in HIGH presentation for 30 seconds before upgrading to CRITICAL full-screen. Exception: impact_time_minutes < 30 bypasses the 30s hold.
  2. Audio precedes visual by 500ms: The two-tone chime fires 500ms before the full-screen banner renders. This primes the operator's attentional system and eliminates the startle peak.
  3. Banner is overlay, not replacement: The CRITICAL full-screen banner is a dimmed overlay (backdrop rgba(0,0,0,0.72)) rendered above the corridor map - the map, aircraft positions, and FIR boundaries remain visible beneath it. The banner must never replace the map render, as spatial context is required for the decision the operator is being asked to make.

Cross-hat alert override matrix: The Human Factors, Safety, and Regulatory hats jointly approve the following override rule set:

  • impact_time_minutes < 30 or equivalent imminent-impact state: bypass progressive delay; immediate full-screen CRITICAL permitted
  • data-integrity compromise (HMAC_INVALID, corrupted prediction provenance, or equivalent): immediate full-screen CRITICAL permitted
  • degraded-data or connectivity-only events without direct hazard change: progressive escalation remains mandatory
  • all immediate-bypass cases require explicit rationale in the alert type definition and traceability into the safety case and hazard log

CRITICAL alert accessibility requirements (F2): When the CRITICAL alert banner renders:

  • focus() is called on the alert dialog element programmatically
  • role="alertdialog" and aria-modal="true" on the banner container
  • aria-labelledby points to the alert title; aria-describedby points to the conjunction summary text
  • aria-hidden="true" set on the map container while the alertdialog is active; removed on dismiss
  • aria-live="assertive" region announces alert title immediately on render (separate from the dialog, for screen readers that do not expose alertdialog role automatically)
  • Visible text status indicator "⚠ Audio alert active" accompanies the audio tone for deaf or hard-of-hearing operators (audio-only notification is not sufficient as a sole channel)
  • All alert action buttons reachable by Tab from within the dialog; Escape closes only if the alert has a non-CRITICAL severity; CRITICAL requires explicit category selection before dismiss

Alarm rationalisation procedure — alarm systems degrade over time through threshold drift and alert-to-alert desensitisation. The following procedure is mandatory:

  • Persona D (Operations Analyst) reviews alert event logs quarterly
  • Any alarm type that fired ≥ 5 times in a 90-day period and was acknowledged as MONITORING ≥ 90% of the time is a nuisance alarm candidate — threshold review required before next quarter
  • Any alarm threshold change must be recorded in alarm_threshold_audit (object, old threshold, new threshold, reviewer, rationale, date); immutable append-only
  • ANSP customers may request threshold adjustments for their own organisation via the org-admin settings; changes take effect after a mandatory 7-day confirmation period and are logged in alarm_threshold_audit
  • Alert categories that have never triggered a NOTAM_ISSUED or ESCALATING acknowledgement in 12 months are escalated to Persona D for review of whether the alert should be demoted one severity level

Habituation countermeasures — repeated identical stimuli produce reduced response (habituation). The following design rules counteract alarm habituation:

  • CRITICAL audio uses two alternating tones (261 Hz and 392 Hz, ~0.25s each); the alternation pattern is varied pseudo-randomly within the specification range so the exact sound is never identical across sessions
  • CRITICAL banner background colour cycles through two dark-amber shades (#7B4000 / #6B3400) at 1 Hz — subtle variation without strobing, enough to maintain arousal without inducing distraction
  • Per-object CRITICAL rate limit (4-hour window) prevents habituation to a single persistent event
  • alert_events habituation report: any operator who has acknowledged ≥ 20 alerts of the same type in a 30-day window without a single ESCALATING or NOTAM_ISSUED response is flagged for supervisor review — this indicates potential habituation or threshold misconfiguration

Reduced-motion support (F10): WCAG 2.3.3 (Animation from Interactions — Level AAA) and WCAG 2.3.1 (Three Flashes or Below Threshold — Level A) apply. The 1 Hz CRITICAL banner colour cycle and any animated corridor rendering must respect the OS-level prefers-reduced-motion: reduce media query:

/* Default: animated */
.critical-banner { animation: amber-cycle 1s step-end infinite; }

/* Reduced motion: static high-contrast state */
@media (prefers-reduced-motion: reduce) {
  .critical-banner {
    animation: none;
    background-color: #7B4000;
    border: 4px solid #FFD580; /* thick static border as redundant indicator */
  }
}

Fatigue and cognitive load monitoring (F8 — §60): Operators on long shifts exhibit reduced alertness. The following server-side rules trigger supervisor notifications without requiring operator interaction:

Condition Trigger Supervisor notification
Unacknowledged CRITICAL alert > 10 minutes without acknowledgement Push + email to org supervisor role: "CRITICAL alert unacknowledged for 10 minutes — [object, time]"
Stale HIGH alert > 30 minutes without acknowledgement Push to org supervisor: "HIGH alert unacknowledged for 30 minutes"
Long session without interaction Logged-in operator: no UI interaction for 45 min during active event Push to operator + supervisor: "Possible inactivity during active event — please verify"
Shift duration exceeded Session age > org.shift_duration_hours (default 8h) Non-blocking reminder to operator: "Your shift duration setting is 8 hours — consider handover"

Supervisor notifications are sent to users with org_admin or supervisor role. If no supervisor role is configured for the org, the notification escalates to SpaceCom internal ops via the existing PagerDuty route with severity: warning. All supervisor notifications are logged to security_logs with event_type = SUPERVISOR_NOTIFICATION.

For CesiumJS corridor animations: check window.matchMedia('(prefers-reduced-motion: reduce)').matches on mount; if true, disable trajectory particle animation (Mode C) and set corridor opacity to a static value instead of pulsing. The preference is re-checked on change via addEventListener('change', ...) without requiring a page reload.


28.4 Probabilistic Communication to Non-Specialist Operators

Re-entry timing predictions are inherently probabilistic. Aviation operations personnel (Persona A/C) are trained in operational procedures, not orbital mechanics. The following design rules ensure probabilistic information is communicated without creating false precision or misinterpretation:

  1. No ± notation for Persona A/C — use explicit window ranges (08h20h from now) with a "most likely" label; all absolute times rendered as HH:MMZ (e.g., 14:00Z) or DD MMM YYYY HH:MMZ (e.g., 22 MAR 2026 14:00Z) per ICAO Doc 8400 UTC-suffix convention; the Z suffix is not a tooltip — it is always rendered inline
  2. Space weather impact as operational buffer, not percentageAdd ≥2h beyond 95th percentile, not +18% wider uncertainty
  3. Mode C particles require a mandatory first-use overlay explaining that particles are not equiprobable; weighted opacity down-weights outliers (§6.4)
  4. "What does this mean?" expandable panel on Event Detail for Persona C (incident commanders) explaining the window in operational terms
  5. Data confidence badges contextualise all physical property estimates — unknown source triggers a warning callout above the prediction panel
  6. Tail risk annotation (F10): The p5p95 window is the primary display, but a 10% probability of re-entry outside that range is operationally significant. Below the primary window, display: "Extreme case (1% probability outside this range): p01_reentry_timeZ p99_reentry_timeZ" — labelled clearly as a tail annotation, not the primary window. This annotation is shown only when p99_reentry_time - p01_reentry_time > 1.5 × (p95_reentry_time - p05_reentry_time) (i.e., the tails are materially wider than the primary window). Also included as a footnote in NOTAM drafts when this condition is met.

28.5 Error Recovery and Irreversible Actions

Action Recovery mechanism
Analyst runs prediction with wrong parameters superseded_by FK on reentry_predictions — marks old run as superseded; UI shows warning banner; original record preserved
Controller accidentally acknowledges CRITICAL alert Two-step confirmation; structured category selection (see below) + optional free text; append-only audit log preserves full record
Analyst shares link to superseded prediction ⚠ Superseded — see [newer run] banner appears on the superseded prediction page for any viewer
Operator enters SIMULATION during live incident disable_simulation_during_active_events org setting blocks mode switch while unacknowledged CRITICAL/HIGH alerts exist

Structured acknowledgement categories — replaces 10-character text minimum. Research consistently shows forced-text minimums under time pressure produce reflexive compliance (1234567890, aaaaaaaaaa) rather than genuine engagement, creating audit noise rather than evidence:

export const ACKNOWLEDGEMENT_CATEGORIES = [
  { value: 'NOTAM_ISSUED',       label: 'NOTAM issued or requested' },
  { value: 'COORDINATING',       label: 'Coordinating with adjacent FIR' },
  { value: 'MONITORING',         label: 'Monitoring — no action required yet' },
  { value: 'ESCALATING',         label: 'Escalating to incident command' },
  { value: 'OUTSIDE_MY_SECTOR',  label: 'Outside my sector — passing to responsible unit' },
  { value: 'OTHER',              label: 'Other (free text required below)' },
] as const;
// Category selection is mandatory. Free text is optional except when value = 'OTHER'.
// alert_events.action_taken stores the category code; action_notes stores optional text.

Acknowledgement form accessibility requirements (F3):

  • Each category option rendered as <input type="radio"> with an explicit <label for="..."> — no ARIA substitutes where native HTML suffices
  • The radio group wrapped in <fieldset> with <legend>Select acknowledgement category</legend>
  • The keyboard shortcut Alt+A documented via aria-keyshortcuts="Alt+A" on the alert panel trigger element
  • A visible keyboard shortcut legend displayed within the acknowledgement dialog: "Keyboard: Alt+A to focus · Tab to change category · Enter to submit"
  • Free-text field (OTHER) labelled <label for="action_notes">Describe action taken (required)</label>; aria-required="true" when OTHER is selected
  • On submit, a screen-reader-visible confirmation announced via aria-live="polite": "Acknowledgement recorded: [category label]"

Keyboard-completable acknowledgement flow — CRITICAL acknowledgement must be completable in ≤ 3 keyboard interactions from any application state (operators frequently work with one hand on radio PTT):

Alt+A   → focus most-recent active CRITICAL alert in alert panel
Enter   → open acknowledgement dialogue (category pre-selected: MONITORING)
Enter   → submit (Tab to change category; free-text field skipped unless OTHER selected)

This keyboard path must be documented in the operator quick-reference card and tested in the Phase 2 usability study against the ≤ 3 interaction target.


28.5a Shift Handover

Shift handover is a high-risk transition point: situational awareness held by one operator must be reliably transferred to a second operator under time pressure. Current aviation safety events have involved information loss at handover. SpaceCom must not become a contributing factor.

Handover screen (Persona A/C): Dedicated /handover view within Secondary Display Mode (§6.20). Accessible from main nav; also triggered automatically when an operator session exceeds org.shift_duration_hours (configurable; default: 8h).

The handover screen shows:

  1. All active CRITICAL and HIGH alerts with current status and acknowledgement history
  2. Any unresolved multi-ANSP coordination threads (§6.9)
  3. Recent window-change events (last 2h) in reverse chronological order
  4. Free-text handover notes field (plain text, ≤ 2,000 characters)
  5. "Accept handover" button — records handover event with both operator IDs and timestamp

Handover record schema:

CREATE TABLE shift_handovers (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organisations(id),
    outgoing_user   UUID NOT NULL REFERENCES users(id),
    incoming_user   UUID NOT NULL REFERENCES users(id),
    handed_over_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    notes           TEXT,                          -- operator free text, ≤ 2000 chars
    active_alerts   JSONB NOT NULL DEFAULT '[]',   -- snapshot of alert IDs + status at handover
    open_coord_threads JSONB NOT NULL DEFAULT '[]' -- snapshot of open coordination thread IDs
);

CREATE INDEX ON shift_handovers (org_id, handed_over_at DESC);

Handover integrity rules:

  • incoming_user must be a different users.id from outgoing_user
  • active_alerts and open_coord_threads are system-populated snapshots — the outgoing operator cannot edit them; only notes is free-form
  • Handover record is immutable after creation; retained for 7 years (aviation safety audit basis)
  • If a CRITICAL alert fires within 5 minutes of a handover record being created, the alert email/push notification includes a "⚠ Alert during handover window" flag so the incoming operator and their supervisor are aware

Structured SA transfer prompts (F4 — §60): The handover notes field (free text) is insufficient for reliable SA transfer under time pressure. The handover screen must also include a structured prompt section that the outgoing operator completes — mapping to Endsley's three SA levels:

SA Level Structured prompt Type
Level 1 — Perception "Active objects of concern right now:" Multi-select from current TIP-flagged objects
Level 2 — Comprehension "My assessment of the most critical object:" Dropdown: Within sector / Adjacent sector / Low confidence / Not a concern yet + optional text
Level 3 — Projection "Expected development in next 2 hours:" Dropdown: Window narrowing / Window stable / Window widening / Awaiting new prediction + optional text
Decision context "Actions I have taken or initiated:" Multi-select from ACKNOWLEDGEMENT_CATEGORIES + free text
Handover flags "Incoming operator should know:" Checkboxes: Space weather active, Pending coordination thread, Degraded data, Unusual pattern

The structured prompts are optional (the outgoing operator cannot be forced to complete them under time pressure) but their completion status is recorded. If the outgoing operator submits handover without completing any structured prompts, a non-blocking warning appears: "Structured SA transfer not completed — incoming operator will rely on notes only." Completion rate is reported quarterly as a human factors KPI.

Session timeout accessibility (F8): WCAG 2.2.1 (Timing Adjustable — Level A) requires users be warned before session expiry and given the opportunity to extend. For operators completing a handover (which may take longer for users with cognitive or motor impairments):

  • At T2 minutes before session expiry: an aria-live="polite" announcement fires and a non-modal warning dialog appears: "Your session will expire in 2 minutes. [Extend session] [Save and log out]"
  • If the /handover view is active when the warning fires, the session is automatically extended by 30 minutes without user interaction (silently); the warning dialog is suppressed; the extension is logged in security_logs with event_type = SESSION_AUTO_EXTENDED_HANDOVER
  • The silent auto-extension only applies once per session to prevent indefinite extension; after the 30-minute extension the standard warning dialog fires normally
  • Session extension endpoint: POST /api/v1/auth/extend-session — returns a new expiry timestamp; requires valid current session cookie

28.6 Cognitive Load Reduction

Event Detail Duty Manager View: Decluttered large-text view for Persona A showing only window, FIRs, risk level, and three action buttons. Collapses all technical detail. Designed for ops room use at a secondary glance distance. (§6.8)

Decision Prompts accordion (formerly "Response Options"): Contextualised checklist of possible ANSP actions. Not automated — for consideration only. Checkbox states create a lightweight action record without requiring Persona A to open a separate logging system. (§6.8)

The feature is renamed from "Response Options" to "Decision Prompts" throughout UI text, documentation, and API field names. "Options" implies equivalence; "Prompts" correctly signals that the list is an aide-mémoire, not a prescribed workflow.

Legal treatment of Decision Prompts: Every Decision Prompts accordion must display the following non-waivable disclaimer in 11px grey text immediately below the accordion header:

"Decision Prompts are non-prescriptive aide-mémoire items generated from common ANSP practice. They do not constitute operational procedures. All decisions remain with the duty controller in accordance with applicable air traffic regulations and your organisation's established procedures."

This disclaimer is: (a) hard-coded, not configurable; (b) included in the printed/exported Event Detail report; (c) present in the API response for Decision Prompts payloads ("legal_notice" field). Rationale: SpaceCom is decision support, not decision authority. Without an explicit disclaimer, a regulator or court could interpret a checked Decision Prompt item as evidence of a prescribed procedure not followed.

Decision prompt content template (F6 — §60): Each Decision Prompt entry must provide four fields to be actionable under operational stress:

interface DecisionPrompt {
  id: string;
  risk_summary: string;       // Plain-language risk in ≤ 20 words. No jargon. No Pc values.
  action_options: string[];   // Specific named actions available to this operator role
  time_available: string;     // "Decision window: X hours before earliest FIR intersection"
  consequence_note?: string;  // Optional: consequence of inaction (shown only if significant)
}

// Example for a re-entry/FIR intersection:
const examplePrompt: DecisionPrompt = {
  id: 'reentry_fir_intersection',
  risk_summary: 'Object expected to re-enter atmosphere over London FIR within 814 hours.',
  action_options: [
    'Issue precautionary NOTAM for affected flight levels',
    'Coordinate with adjacent FIR controllers (Paris, Amsterdam)',
    'Notify airline operations centres in affected region',
    'Continue monitoring — no action required yet',
  ],
  time_available: 'Decision window: ~6 hours before earliest FIR intersection (08:00Z)',
  consequence_note: 'If window narrows below 4 hours without NOTAM, affected departures may require last-minute rerouting.',
};

Decision Prompts are pre-authored for each alert scenario type in docs/decision-prompts/ and reviewed annually by a subject-matter expert from an ANSP partner. They are not auto-generated by the system. New prompt types require approval from both the SpaceCom safety case owner and at least one ANSP reviewer.

Legal sufficiency note (F5): The in-UI disclaimer is a reinforcing reminder only. Under UCTA 1977 and the EU Unfair Contract Terms Directive, liability limitation requires that the customer was given a reasonable opportunity to discover and understand the term at contract formation. The substantive liability limitation clause (consequential loss excluded; aggregate cap = 12 months fees paid) must appear in the executed Master Services Agreement (§24.2). The UI disclaimer does not substitute for executed contractual terms.

Decision Prompts accessibility (F9): The accordion must implement the WAI-ARIA Accordion design pattern:

  • Accordion header: <button role="button" aria-expanded="true|false" aria-controls="panel-{id}">Enter and Space toggle open/close
  • Panel: <div id="panel-{id}" role="region" aria-labelledby="header-{id}">
  • Arrow keys navigate between accordion items when focus is on a header button
  • Each prompt item: <input type="checkbox" id="prompt-{n}" aria-checked="true|false"> with <label for="prompt-{n}"> — native checkbox, not ARIA role substitute
  • On checkbox state change: aria-live="polite" region announces "Action recorded: [prompt text]"
  • aria-keyshortcuts on the accordion container documents any applicable shortcuts

Attention management — operational environments have high ambient interruption rates. SpaceCom must not become an additional source of cognitive fragmentation:

State Interaction rate limit Rationale
Steady-state (no active CRITICAL/HIGH) ≤ 1 unsolicited notification per 10 minutes per user Preserve peripheral attentional channel for ATC primary tasks
Active event (≥ 1 unacknowledged CRITICAL) ≤ 1 update notification per 60 seconds for the same event Prevent update flooding during the critical decision window
Critical flow (user actively in acknowledgement or handover screen) Zero unsolicited notifications Do not interrupt the operator while they are completing a safety-critical task

Critical flow state is entered when: acknowledgement dialog is open, or /handover view is active. It is exited on dialog close or handover acceptance. During critical flow, all queued notifications are held and delivered as a batch summary immediately on exit.

Secondary Display Mode: Chrome-free full-screen operational view optimised for secondary monitor in an ops room alongside existing ATC displays. (§6.20)

First-time user onboarding: New organisations with no configured FIRs see a three-card guided setup rather than an empty globe. (§6.18)


28.7 HF Validation Approach

HF design cannot be fully validated by automated tests alone. The following validation activities are planned:

Activity Phase Method
Cognitive walkthrough of CRITICAL alert handling Phase 1 Developer walk-through against §28.3 alarm management requirements
ANSP user testing — Persona A operational scenario Phase 2 Structured usability test: duty manager handles a simulated TIP event; time-to-decision and error rate measured
Multi-ANSP coordination scenario Phase 2 Two-ANSP test with shared event; assess whether coordination panel reduces perceived workload vs. out-of-band comms only
Mode confusion scenario Phase 2 Participants switch between LIVE and SIMULATION; measure rate of mode errors without and with the temporal wash
Alarm fatigue assessment Phase 3 Review of LOW alarm rate over a 30-day shadow deployment; adjust thresholds if nuisance rate > 1/10 min/user
Final HF review by qualified human factors specialist Phase 3 Required for TRL 6 demonstration and ECSS-E-ST-10-12C compliance evidence

Probabilistic comprehension test items — the Phase 2 usability study must include the following scripted comprehension items delivered verbally to participants after they view a TIP event detail screen. Items are designed to distinguish genuine probabilistic comprehension from confidence masking:

Item Correct answer Common wrong answer (detects)
"What does the re-entry window of 08h20h from now mean — does it mean the object will come down in the middle of that period?" No — most likely landing is in the modal estimate shown, but the object could land anywhere in the window "Yes, probably in the middle" — detects false precision from window endpoints
"If SpaceCom shows Impact Probability 0.03, should you start evacuating the FIR corridor?" Not automatically — impact probability is one input; operational decision depends on assets at risk, corridor extent, and existing procedures "Yes, 0.03 is high for space" — detects calibration gap between space and aviation risk thresholds
"The window has just widened by 4 hours. Does that mean SpaceCom detected new debris or a new threat?" No — window widening usually means updated atmospheric data or revised mass/BC estimate increased uncertainty "Yes, something new happened" — detects misattribution of uncertainty update to new threat
"SpaceCom shows 'Data confidence: TLE age 4 days'. Does that mean the prediction is wrong?" No — it means the prediction has higher positional uncertainty; the window should be treated as wider in practice "Yes, ignore it" — detects over-application of data quality warning

Participants who answer ≥ 2 items incorrectly indicate a comprehension design failure requiring UI revision before shadow deployment. Target: ≥ 80% correct on each item across the test cohort.


28.8 Degraded-Data Human Factors

Operators must be able to distinguish "SpaceCom is working normally" from "SpaceCom is working but with reduced fidelity" from "SpaceCom is in a failure state" — three states that require fundamentally different responses. Undifferentiated degradation presentation causes two failure modes: operators continuing to act on stale data as if it were fresh (over-trust), or operators stopping using the system entirely during a tolerable degradation (under-trust).

Visual degradation language:

State Indicator Operator action required
All data fresh Green status pill in system tray (§6.6) None
TLE age ≥ 48h for any active CRITICAL/HIGH object Amber "⚠ TLE stale" badge on affected event card Widen mental model of corridor uncertainty; consult space domain Persona B/D
EOP data stale (>7 days) Amber system badge + eop_stale exposed in GET /readyz Frame transform accuracy reduced; no action required unless close-approach timing is critical
Space weather stale (>2h for active event) Amber badge on Kp readout in Event Detail Kp-dependent atmospheric drag estimates are less reliable; apply additional margin
AIRAC data >35 days old Red "⚠ AIRAC expired" badge on any FIR overlay FIR boundaries may have changed; do not issue NOTAM text based on SpaceCom FIR names without manual verification
Backend unreachable Full-screen "SpaceCom Offline" modal No predictions available; fall back to organisational offline procedures

Graded response rules:

  1. A single stale data source never suppresses the main operational view. Operators must be able to see the event and make decisions; stale data badges are contextual, not blocking.
  2. Multiple simultaneous amber badges (≥ 3) trigger a consolidated "Multiple data sources degraded" yellow banner at top of screen — prevents badge blindness when individual badges are numerous.
  3. The GET /readyz endpoint (§26.5) exposes all staleness states as machine-readable flags. ANSPs may configure their own monitoring to receive readyz alerts via webhook.
  4. Degraded-data states are recorded in system_health_events table and included in the quarterly operational report to Persona D.

Operator quick-reference language for degraded states — the operator quick-reference card must include a "SpaceCom status indicators" section using the exact badge text from the UI (copy-match required). Operators must not need to translate between UI text and documentation text.


28.9 Operator Training and Competency Specification (F10 — §60)

SpaceCom is a safety-critical decision support system. ANSP customers deploying it in operational environments will be asked by their safety regulators what training operators received. This section defines the minimum training specification. Individual ANSPs may add requirements; they may not remove them.

Minimum initial training programme:

Module Delivery Duration Completion criteria
M1 — System overview and safety philosophy Instructor-led or self-paced e-learning 2 hours Quiz score ≥ 80%
M2 — Operational interface walkthrough Instructor-led hands-on with staging environment 3 hours Complete reference scenario (see below)
M3 — Alert acknowledgement workflow Scenario-based with role-play 1 hour Keyboard-completable ack in ≤ 3 interactions
M4 — NOTAM drafting and disclaimer Instructor-led with sample NOTAMs 1 hour Produce a compliant NOTAM draft from a scenario
M5 — Degraded mode response Scenario-based 30 min Correctly identify each degraded state + action
M6 — Shift handover procedure Pair exercise 30 min Complete a structured handover with SA prompts

Total minimum initial training: 8 hours. Training is completed before any operational use. Simulator/staging environment only — no training on production data.

Reference scenario (M2): A CRITICAL re-entry alert fires for an object with a 614 hour window intersecting two FIRs. The trainee must: acknowledge the alert, identify the FIR intersection, assess the corridor evolution, draft a NOTAM, and complete a handover to a colleague — all within 20 minutes. This scenario is standardised in docs/training/reference-scenario-01.md.

Recurrency requirements:

  • Annual refresher: 2 hours, covering any UI changes in the preceding 12 months + repeat of M3 scenario
  • After any incident where SpaceCom was a contributing factor: mandatory debrief + targeted re-training before return to operational use
  • After a major version upgrade (breaking UI changes): M2 + affected modules before using upgraded system operationally

Competency record model:

CREATE TABLE operator_training_records (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         INTEGER NOT NULL REFERENCES users(id),
    module_id       TEXT NOT NULL,          -- 'M1'..'M6' or custom ANSP module codes
    completed_at    TIMESTAMPTZ NOT NULL,
    score           INTEGER,                -- quiz score where applicable; NULL for practical
    instructor_id   INTEGER REFERENCES users(id),
    training_env    TEXT NOT NULL DEFAULT 'staging',  -- 'staging' | 'simulator'
    notes           TEXT,
    UNIQUE (user_id, module_id, completed_at)
);

GET /api/v1/admin/training-status (org_admin only) returns completion status for all users in the organisation. Users without all required modules completed are flagged; their access is not automatically blocked (the ANSP retains operational responsibility) but the flag is visible to org_admin and included in the quarterly compliance report.

Training material ownership: docs/training/ directory maintained by SpaceCom. ANSP-specific scenario variants stored in docs/training/ansp-variants/. Annual review cycle tied to the CHANGELOG review process.

Training records data retention and pseudonymisation (F10 — §64): operator_training_records is personal data — it records when a named individual completed specific training activities. For former employees whose accounts are deleted, these records must not be retained indefinitely as identified personal data.

Retention policy:

  • Active users: retain for the duration of active employment (account status = 'active') plus 2 years after account deletion (for certification audit purposes — an ANSP may need to verify training history after an operator leaves)
  • After 2 years post-deletion: pseudonymise user_id → tombstone token; retain completion dates and module IDs for aggregate training statistics
-- Add to operator_training_records
ALTER TABLE operator_training_records
    ADD COLUMN pseudonymised_at TIMESTAMPTZ,
    ADD COLUMN user_tombstone   TEXT;  -- SHA-256 prefix of deleted user_id; replaces user_id link

The weekly pseudonymise_old_freetext Celery task (§29.3) is extended to also pseudonymise training records where the linked users row has been deleted for > 2 years:

db.execute(text("""
    UPDATE operator_training_records otr
    SET user_tombstone = CONCAT('tombstone:', LEFT(ENCODE(DIGEST(otr.user_id::text, 'sha256'), 'hex'), 16)),
        pseudonymised_at = NOW()
    WHERE otr.pseudonymised_at IS NULL
      AND NOT EXISTS (SELECT 1 FROM users u WHERE u.id = otr.user_id)
      AND otr.completed_at < NOW() - INTERVAL '2 years'
"""))

---

## 29. Data Protection Framework

SpaceCom processes personal data in the course of providing its services. For EU and UK deployments (ESA bid context), GDPR / UK GDPR compliance is mandatory. For Australian ANSP customers, the Privacy Act 1988 (Cth) applies. This section documents the data protection design requirements.

**Standards basis:** GDPR (EU) 2016/679, UK GDPR, Privacy Act 1988 (Cth), EDPB Guidelines on data breach notification, ICO guidance on legitimate interests, CNIL recommendations on consent records.

---

### 29.1 Data Inventory

**Record of Processing Activities (RoPA)  GDPR Art. 30:** This table constitutes the RoPA. It is maintained in `legal/ROPA.md` (authoritative version) and mirrored here. Organisations with 250 employees or processing high-risk data must maintain a written RoPA; space traffic management constitutes high-risk processing (Art. 35 DPIA trigger  see below). The DPO must review and sign off the RoPA annually.

| Data type | Personal? | Lawful basis (GDPR Art. 6) | Retention | Table / Location |
|-----------|-----------|---------------------------|-----------|-----------------|
| User email, name, organisation | Yes | Contract performance (Art. 6(1)(b)) | Account lifetime + 1 year after deletion | `users` |
| IP address in security logs | Yes (pseudonymous) | Legitimate interests  security (Art. 6(1)(f)) | **90 days full; hash retained for 7 years** | `security_logs` |
| IP address at ToS acceptance | Yes | Legitimate interests  consent evidence (Art. 6(1)(f)) | **90 days full; hash retained for account lifetime + 1 year** | `users.tos_accepted_ip` |
| Alert acknowledgement text | Yes (contains user name) | Legitimate interests  aviation safety (Art. 6(1)(f)) | 7 years | `alert_events` |
| Multi-ANSP coordination notes | Yes (contains user name) | Legitimate interests  aviation safety (Art. 6(1)(f)) | 7 years | `alert_events` |
| Shift handover records | Yes (outgoing/incoming user IDs) | Legitimate interests  aviation safety / operational continuity (Art. 6(1)(f)) | 7 years | `shift_handovers` |
| Alarm threshold audit records | Yes (reviewer ID) | Legitimate interests  safety governance (Art. 6(1)(f)) | 7 years | `alarm_threshold_audit` |
| API request logs | Yes (pseudonymous  IP) | Legitimate interests  security / billing (Art. 6(1)(f)) | 90 days | Log files / SIEM |
| MFA secrets (TOTP) | Yes (sensitive account data) | Contract performance (Art. 6(1)(b)) | Account lifetime; immediately deleted on account deletion | `users.mfa_secret` (encrypted at rest) |
| Space-Track data disclosure log | No (records org-level disclosure, not individuals) | Legitimate interests  licence compliance (Art. 6(1)(f)) | 5 years | `data_disclosure_log` |

**IP address data minimisation policy (F3  §64):** IP addresses are personal data (CJEU *Breyer*, C-582/14). The full IP address is needed for fraud detection and security investigation within the first 90 days; beyond that, only a hashed form is needed for statistical/audit purposes.

Required Celery Beat task (`tasks/privacy_maintenance.py`, runs weekly):
```python
@shared_task
def hash_old_ip_addresses():
    """Replace full IP addresses with SHA-256 hashes after 90-day audit window."""
    cutoff = datetime.utcnow() - timedelta(days=90)
    db.execute(text("""
        UPDATE security_logs
        SET ip_address = CONCAT('sha256:', LEFT(ENCODE(DIGEST(ip_address, 'sha256'), 'hex'), 16))
        WHERE created_at < :cutoff
          AND ip_address NOT LIKE 'sha256:%'
    """), {"cutoff": cutoff})
    db.execute(text("""
        UPDATE users
        SET tos_accepted_ip = CONCAT('sha256:', LEFT(ENCODE(DIGEST(tos_accepted_ip, 'sha256'), 'hex'), 16))
        WHERE created_at < :cutoff
          AND tos_accepted_ip NOT LIKE 'sha256:%'
    """), {"cutoff": cutoff})
    db.commit()

Necessity assessment for IP storage (required in DPIA §2): Full IP is necessary for: (a) detecting account takeover (geolocation anomaly), (b) rate-limiting bypass investigation, (c) regulatory/legal requests within the statutory window. Hashed form is sufficient for: (d) long-term audit log integrity (proving an event occurred from a non-obvious source), (e) statistical reporting. The 90-day threshold is the operational window for security investigations; beyond this, benefit does not outweigh data subjects' privacy interests.

DPIA requirement and structure (F1 — §64): GDPR Article 35 mandates a DPIA before processing that is likely to result in high risk. SpaceCom's processing falls under Art. 35(3)(b) — systematic monitoring of publicly accessible areas — because it tracks the online operational behaviour of aviation professionals (login times, alert acknowledgements, decision patterns, handover text) in a system used to support safety decisions. This is a pre-processing obligation: EU personal data cannot lawfully be processed without completing the DPIA first.

Document: legal/DPIA.md — a Phase 2 gate (must be complete before any EU/UK ANSP shadow activation).

Required DPIA structure (EDPB WP248 rev.01 template):

Section Content required
1. Description of processing Purpose, nature, scope, context of processing; categories of data; data flows; recipients
2. Necessity and proportionality Why is this data necessary? Could the purpose be achieved with less data? Legal basis per activity (mapped in §29.1 RoPA)
3. Risk identification Risks to data subjects: unauthorised access to operational patterns; re-identification of pseudonymised safety records; cross-border transfer exposure; disclosure to authorities
4. Risk mitigation measures Technical: RLS, HMAC, TLS, MFA, pseudonymisation. Organisational: DPA with ANSPs, export control screening, sub-processor contracts
5. Residual risk assessment Risk level after mitigations: Low / Medium / High. If High residual risk: prior consultation with supervisory authority required (Art. 36)
6. DPO opinion Designated DPO's written sign-off or objection
7. Review schedule DPIA reviewed when processing changes materially; at least every 3 years

The DPIA covers all processing activities in the RoPA. Key risk finding anticipated: the alert acknowledgement audit trail (who acknowledged what, when) creates a de facto performance monitoring record for individual ANSP controllers — this must be addressed in Section 3 with mitigations in Section 4 (pseudonymisation after operational retention window, access restricted to org_admin and admin roles).

Privacy Notice — must be published at the registration URL and linked from the ToS acceptance flow. Must cover: data controller identity, categories of data collected, purposes and lawful bases, retention periods, data subject rights, third-party processors (cloud provider, SIEM), cross-border transfer safeguards.


29.2 Data Subject Rights Implementation

Right Mechanism Notes
Access (Art. 15) GET /api/v1/users/me/data-export — returns all personal data held for the authenticated user as a JSON download Available to all logged-in users
Rectification (Art. 16) PATCH /api/v1/users/me — allows name, email, organisation update Email change triggers re-verification
Erasure (Art. 17) POST /api/v1/users/me/erasure-request → calls handle_erasure_request(user_id) See §29.3
Restriction (Art. 18) Admin-level: users.access_restricted = TRUE suspends account without deleting data Used where erasure conflicts with retention requirement
Portability (Art. 20) POST /org/export (org_admin or admin) — asynchronous export of all org personal data in machine-readable JSON; fulfilled within 30 days; also used for offboarding (§29.8). Covers user-generated content (acknowledgements, handover notes); not derived physics predictions. F11
Objection (Art. 21) For legitimate interests processing: handled by erasure or restriction pathway No automated profiling that would trigger Art. 22

29.3 Erasure vs. Retention Conflict — Pseudonymisation Procedure

The 7-year retention requirement (UN Liability Convention, aviation safety records) conflicts with GDPR Article 17 right to erasure for personal data embedded in alert_events and security_logs. Resolution: pseudonymise, do not delete.

def handle_erasure_request(user_id: int, db: Session):
    """
    Satisfy GDPR Art. 17 erasure request while preserving safety-critical records.
    Called when a user account is deleted or an explicit erasure request is received.
    """
    # Stable pseudonym — deterministic hash of user_id, not reversible
    pseudonym = f"[user deleted - ID:{hashlib.sha256(str(user_id).encode()).hexdigest()[:12]}]"

    # Pseudonymise user references in append-only safety tables
    db.execute(
        text("UPDATE alert_events SET acknowledged_by_name = :p WHERE acknowledged_by = :uid"),
        {"p": pseudonym, "uid": user_id}
    )
    db.execute(
        text("UPDATE security_logs SET user_email = :p WHERE user_id = :uid"),
        {"p": pseudonym, "uid": user_id}
    )
    # Pseudonymise shift handover records — user IDs replaced, notes preserved for safety record
    db.execute(
        text("""UPDATE shift_handovers
                SET outgoing_user = NULL, incoming_user = NULL,
                    notes = CASE WHEN outgoing_user = :uid OR incoming_user = :uid
                                 THEN CONCAT('[pseudonymised: ', :p, '] ', COALESCE(notes,''))
                                 ELSE notes END
                WHERE outgoing_user = :uid OR incoming_user = :uid"""),
        {"p": pseudonym, "uid": user_id}
    )
    # Delete the user record itself (and cascade to refresh_tokens, api_keys)
    db.execute(text("DELETE FROM users WHERE id = :uid"), {"uid": user_id})
    db.commit()
    # Log the erasure event (note: this log entry is itself pseudonymised from creation)
    log_security_event("USER_ERASURE_COMPLETED", details={"pseudonym": pseudonym})

The core safety records (alert_events, security_logs, reentry_predictions) are preserved. The link to the identified individual is severed. This satisfies GDPR recital 26 (pseudonymous data is not personal data when re-identification is not reasonably possible) and Article 17(3)(b) (erasure obligation does not apply where processing is necessary for compliance with a legal obligation).

Free-text field periodic pseudonymisation (F6 — §64): Handover notes (shift_handovers.notes_text) and alert acknowledgement text (alert_events.action_taken) are free-text fields where operators may name colleagues, reference individuals' decisions, or include other personal references. The 7-year retention of these fields as-written creates personal data retained far beyond its operational value. After the operational retention window (2 years — the period within which a re-entry event's record could be actively referenced by an ANSP), free-text personal references must be pseudonymised in place.

Required Celery Beat task (tasks/privacy_maintenance.py, runs monthly):

@shared_task
def pseudonymise_old_freetext():
    """
    Replace identifiable free-text in operational records after 2-year operational window.
    The record itself is retained; only the human-entered text is sanitised.
    """
    cutoff = datetime.utcnow() - timedelta(days=730)  # 2 years
    # Replace acknowledgement text with sanitised marker — preserve the fact of acknowledgement
    db.execute(text("""
        UPDATE alert_events
        SET action_taken = '[text pseudonymised after operational retention window]'
        WHERE created_at < :cutoff
          AND action_taken IS NOT NULL
          AND action_taken NOT LIKE '[text pseudonymised%'
    """), {"cutoff": cutoff})
    # Preserve handover structure; pseudonymise notes text
    db.execute(text("""
        UPDATE shift_handovers
        SET notes_text = '[text pseudonymised after operational retention window]'
        WHERE created_at < :cutoff
          AND notes_text IS NOT NULL
          AND notes_text NOT LIKE '[text pseudonymised%'
    """), {"cutoff": cutoff})
    db.commit()

The 2-year operational window is chosen because: (a) PIR processes complete within 5 business days; (b) regulatory investigations of re-entry events typically complete within 1218 months; (c) 2 years provides margin. Beyond 2 years, the text serves no legitimate purpose that outweighs the data subject's interest in not having their decision-making text retained indefinitely.


29.4a Data Subject Access Request Procedure (F7 — §64)

The GET /api/v1/users/me/data-export endpoint exists (§29.2). The DSAR procedure — how requests are received, processed, and responded to within the statutory deadline — must also be documented.

DSAR SLA: 30 calendar days from receipt of the verified request (GDPR Art. 12(3)). Extension to 60 days permitted for complex requests with written notice to the data subject within the first 30 days.

DSAR procedure (docs/runbooks/dsar-procedure.md):

Step Action Owner Timing
1 Receive request (email to privacy@spacecom.io or in-app POST /api/v1/users/me/data-export-request) DPO/designated contact Day 0
2 Verify identity of requestor (must be the data subject or authorised representative) DPO Within 3 business days
3 Assess scope: what data is held? Which tables? What exemptions apply (safety record retention)? DPO + engineering Within 7 days
4 Generate export: GET /api/v1/users/me/data-export for self-service; admin endpoint for cases where account is deleted/suspended Engineering Within 20 days
5 Deliver export: encrypted ZIP sent to verified email address DPO By day 28
6 Document: log in legal/DSAR_LOG.md — request date, identity verified, scope, delivery date, any exemptions invoked DPO Same day as delivery
7 If exemption applied (safety records retained): provide written explanation of the exemption and residual rights DPO Included in delivery

GET /api/v1/users/me/data-export response scope — must include all of:

  • users record fields (excluding password hash)
  • alert_events where acknowledged_by = user.id (pre-pseudonymisation only)
  • shift_handovers where outgoing_user = user.id or incoming_user = user.id
  • operator_training_records for the user
  • api_keys metadata (not the key value itself)
  • security_logs where user_id = user.id (pre-IP-hashing only)
  • tos_accepted_at, tos_version from users

Fields excluded from DSAR export (not personal data or subject to legitimate processing exemption):

  • reentry_predictions (not personal data)
  • security_logs entries of type HMAC_KEY_ROTATION, DEPLOY_* (operational audit, not personal)

29.4 Data Processing Agreements

A Data Processing Agreement (DPA) is required in every commercial relationship where SpaceCom acts as a data processor for customer personal data (GDPR Art. 28).

SpaceCom acts as data processor for: user data belonging to ANSP and space operator customers (the customers are the data controllers for their employees' data).

SpaceCom acts as data controller for: its own user authentication data, security logs, and analytics.

Required DPA provisions (GDPR Art. 28(3)):

  • Processing only on documented instructions of the controller
  • Confidentiality obligations on authorised processors
  • Technical and organisational security measures (reference §7)
  • Sub-processor approval process (cloud provider, SIEM)
  • Data subject rights assistance obligations
  • Deletion or return of data on contract termination
  • Audit and inspection rights for the controller

The DPA template must be reviewed by counsel before any EU/UK commercial deployment. It is a standard addendum to the MSA.

Sub-processor register (F9 — §64): GDPR Article 28(2) requires that the controller authorises sub-processors, and Article 28(4) requires that the processor imposes equivalent obligations on sub-processors. The DPA template references a sub-processor register; that register must exist as a standalone document.

Document: legal/SUB_PROCESSORS.md — Phase 2 gate (required before first EU/UK commercial deployment).

Sub-processor Service Personal data transferred Location Transfer mechanism DPA in place
Cloud host (e.g. AWS/Hetzner) Infrastructure hosting All categories (hosted on their infrastructure) EU-central-1 (Frankfurt) Adequacy / SCCs AWS DPA / Hetzner DPA
GitHub Source code hosting, CI/CD Developer usernames; may appear in test fixtures US EU SCCs (Module 2) GitHub DPA
Email delivery provider (e.g. Postmark, SES) Transactional email (alert notifications) User email address, name, alert content US EU SCCs (Module 2) Provider DPA
Grafana Cloud (if used) Observability / monitoring IP addresses in logs ingested to Loki US/EU SCCs / EU region option Grafana DPA
Sentry (if used) Error tracking Stack traces may contain user IDs, request data US EU SCCs Sentry DPA

Customer notification obligation: ANSPs (as data controllers) must be notified ≥30 days before any new sub-processor is added. The DPA addendum requires this. The sub-processor register is the mechanism for tracking and triggering notifications.


29.5 Cross-Border Data Transfer Safeguards

For EU/UK customers where SpaceCom infrastructure is hosted outside the EU/UK (e.g., AWS us-east-1):

  • Use EU/UK regions where available, or
  • Execute Standard Contractual Clauses (SCCs — 2021 EU SCCs / UK IDTA) with the cloud provider
  • Document the transfer mechanism in the Privacy Notice

For Australian customers: the Privacy Act's Australian Privacy Principle 8 (cross-border disclosure) requires contractual protections equivalent to the APPs when transferring personal data internationally.

Data residency policy (Finding 8):

  • Default hosting: EU jurisdiction (eu-central-1 / Frankfurt or equivalent) — satisfies EU data residency requirements for ECAC ANSP customers; stated in the MSA and DPA
  • On-premise option: Institutional tier supports customer-managed on-premise deployment (§34 specifies the deployment model); customer's own infrastructure, own jurisdiction; SpaceCom provides a deployment package and support contract
  • Multi-tenancy isolation: Each ANSP organisation's operational data (alert_events, notam_drafts, coordination notes) is accessible only to that organisation's users — enforced by RLS (§7.2). Multi-tenancy does not mean data co-mingling
  • Subprocessor disclosure: docs/legal/data-residency-policy.md lists hosting provider, region, and any subprocessors; updated when subprocessors change; referenced in the DPA; customers notified of material subprocessor changes ≥ 30 days in advance
  • organisations.hosting_jurisdiction and organisations.data_residency_confirmed columns (§9.2) track per-organisation residency state; admin UI surfaces this to Persona D
  • Authoritative document: legal/DATA_RESIDENCY.md — lists hosting provider, region, all sub-processors with their data residency and SCCs/IDTA status; reviewed and re-signed annually by DPO; customers notified of material sub-processor changes ≥30 days in advance per DPA obligations

29.6 Security Breach Notification

Regulatory notification obligations by framework:

Framework Trigger Deadline Authority Template location
GDPR Art. 33 Personal data breach affecting EU/UK data subjects 72 hours of discovery National DPA (e.g. ICO, CNIL, BfDI) legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md
UK GDPR As above for UK data subjects 72 hours ICO As above
NIS2 Art. 23 Significant incident affecting network/information systems of an essential entity Early warning: 24 hours of becoming aware; full notification: 72 hours; final report: 1 month National CSIRT + competent authority (space traffic management is likely an essential sector under NIS2 Annex I) As above
Australian Privacy Act Eligible data breach (serious harm likely) ASAP (no fixed period; promptness required) OAIC As above

Incident response timeline:

Step Timing Action
Detect and contain Immediately Revoke affected credentials; isolate affected service; preserve logs
Assess scope Within 2 hours Determine: categories of data affected, approximate number of data subjects, jurisdictions, NIS2 applicability
Notify legal counsel and DPO Within 4 hours of detection Counsel advises on notification obligations across all applicable frameworks
NIS2 early warning Within 24 hours of awareness If significant incident: notify national CSIRT with initial information; no need for complete picture at this stage
Notify supervisory authority (EU/UK GDPR) Within 72 hours of discovery Via national DPA portal; even if incomplete — update as more known
NIS2 full notification Within 72 hours of awareness Full incident notification to national CSIRT / competent authority
Notify data subjects Without undue delay If breach likely to result in high risk to individuals
NIS2 final report Within 1 month of full notification Detailed description, impact assessment, cross-border impact, measures taken
Document Ongoing GDPR Art. 33(5) requires documentation of all breaches; NIS2 requires audit trail

GDPR and NIS2 breach notification is integrated into the §26.8 incident response runbook. The security_logs record type DATA_BREACH triggers the breach notification workflow. On-call engineers must be trained to recognise when NIS2 thresholds (significant impact on service continuity or data integrity) are met and escalate to the DPO within the 24-hour window. Full obligations mapped in legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md.


Even as a B2B SaaS operating within corporate networks, SpaceCom must comply with the ePrivacy Directive (2002/58/EC as amended) for any non-essential cookies set on EU/UK user browsers.

Cookie audit (required at least annually — legal/COOKIE_POLICY.md):

Cookie name Category Purpose Lifetime Consent required?
session Strictly necessary Authenticated session token Session / 8h inactivity No
csrf_token Strictly necessary CSRF protection Session No
tos_version Strictly necessary ToS acceptance tracking 1 year No
feature_flags Functional A/B flags for UI features 30 days Yes (functional consent)
_analytics Analytics Usage telemetry (if implemented) 13 months Yes (analytics consent)

Security requirements for all session cookies (ePrivacy + §36 security):

Set-Cookie: session=...; HttpOnly; Secure; SameSite=Strict; Path=/; Max-Age=28800

Consent implementation:

  • Consent banner displayed on first visit to any EU/UK user before any non-essential cookies are set
  • Three options: Accept all / Functional only / Strictly necessary only
  • Consent preference stored in user_cookie_preferences or localStorage (no cookie used to store consent — self-defeating)
  • Consent is re-requested if cookie categories change materially
  • B2B context note: even if the organisation has a corporate cookie policy, individual users' consent is required under ePrivacy; organisational IT policies do not substitute for individual consent

Cookie policy: legal/COOKIE_POLICY.md — published at registration URL and linked from the consent banner. Reviewed when new cookies are introduced or existing cookies change purpose.


29.8 Organisation Onboarding and Offboarding (F4)

Onboarding workflow

New organisation provisioning requires explicit admin action — self-serve registration is not available in Phase 1 (safety-critical context; all organisations are individually vetted).

Onboarding gates (all must be satisfied before subscription_statusactive):

  1. Legal: MSA executed (countersigned PDF stored in legal/contracts/{org_id}/msa.pdf)
  2. Export control: export_control_cleared = TRUE on the organisations row (BIS Entity List check; see §24.2)
  3. Space-Track: If the organisation requires Space-Track data: space_track_registered = TRUE; space_track_username recorded; data disclosure log seeded
  4. Billing: billing_contacts row created; VAT number validated for EU customers
  5. Admin user: at least one org_admin user created with MFA enrolled
  6. ToS: primary org_admin user has tos_accepted_at IS NOT NULL

Each gate is a checklist step in docs/runbooks/org-onboarding.md. Completing all gates creates a subscription_periods row with period_start = NOW().

Offboarding workflow

When an organisation's subscription ends (churn, termination, or suspension), the offboarding procedure:

Step Action Who When
1 Set subscription_status = 'churned' / 'suspended' Admin Immediately
2 Revoke all api_keys for the org Admin (automated) Immediately
3 Invalidate all active sessions (refresh_tokens) Admin (automated) Immediately
4 Notify org primary contact: 30-day data export window Admin Same day
5 Generate and deliver org data export archive Admin Within 3 business days
6 After 30-day window: pseudonymise user personal data Automated job Day 31
7 Retain non-personal safety records (7-year minimum) DB — no action Ongoing
8 Confirm deletion in writing to org billing contact Admin After step 6

GDPR Art. 17 vs. retention conflict: User personal data (name, email, IP addresses) is pseudonymised per §29.3 after the 30-day window. Safety records (alert_events, reentry_predictions, shift_handovers) are retained for 7 years per UN Liability Convention — the organisation row remains in the database with subscription_status = 'churned' as the foreign key anchor. No safety record is deleted.

Suspension vs. termination: A suspended organisation (subscription_status = 'suspended') retains data and can be reactivated by an admin. A churned organisation enters the 30-day export window immediately. Suspension is used for payment failure; churn for voluntary or contractual termination.


29.9 Audit Log Personal Data Separation (F8 — §64)

security_logs currently serves two distinct purposes with conflicting retention requirements:

  • Integrity audit records (HMAC checks, ingest events, deploy markers): no personal data; 7-year retention under UN Liability Convention
  • Personal data processing records (user logins, IP addresses, acknowledgement events): personal data; subject to data minimisation, IP hashing at 90 days, erasure on request

Mixing these in one table means a single retention policy applies to both — either over-retaining personal data (7 years) or under-retaining operational integrity records. Required separation:

-- New table: operational integrity audit — no personal data, 7-year retention
CREATE TABLE integrity_audit_log (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    event_type   TEXT NOT NULL,  -- 'HMAC_VERIFICATION', 'INGEST_SUCCESS', 'DEPLOY_COMPLETED', etc.
    source       TEXT,           -- service name, job ID
    details      JSONB,          -- operational context; must not contain user IDs or IPs
    severity     TEXT NOT NULL DEFAULT 'INFO'
);

-- Existing security_logs: personal data processing records — IP hashing at 90d, erasure on request
-- Add constraint: security_logs must only hold user-action event types
ALTER TABLE security_logs ADD CONSTRAINT chk_security_logs_type
    CHECK (event_type IN (
        'LOGIN', 'LOGOUT', 'MFA_ENROLLED', 'PASSWORD_RESET', 'API_KEY_CREATED',
        'API_KEY_REVOKED', 'TOS_ACCEPTED', 'DATA_BREACH', 'USER_ERASURE_COMPLETED',
        'SAFETY_OCCURRENCE', 'DEPLOY_ALERT_GATE_OVERRIDE', 'HMAC_KEY_ROTATION',
        'AIRSPACE_UPDATE', 'EXPORT_CONTROL_SCREENED', 'SHADOW_MODE_ACTIVATED'
    ));

Migration: Existing security_logs records of type INGEST_*, HMAC_VERIFICATION_* (pass/fail), DEPLOY_COMPLETED are migrated to integrity_audit_log. The personal-data-containing events remain in security_logs with the updated retention and IP-hashing policy.

Benefit: integrity_audit_log can be retained for 7 years without any privacy obligation. security_logs is subject to the 90-day IP hashing, erasure-on-request, and 2-year text pseudonymisation policies without affecting integrity records.


29.10 Lawful Basis Mapping and ToS Acceptance Clarification (F11 — §64)

The first-login ToS/AUP acceptance flow (§3.1, §13) gates access and records tos_accepted_at. This mechanism does not mean consent (Art. 6(1)(a)) is the universal lawful basis for all processing. The RoPA (§29.1) maps the correct basis per activity; this section clarifies the principle.

Lawful basis is determined by purpose, not by the collection mechanism:

Processing activity Correct basis Why NOT consent
Delivering alerts and predictions the user subscribed to Art. 6(1)(b) — contract performance User contracted for the service; consent would be revocable and would prevent service delivery
Security logging of user actions Art. 6(1)(f) — legitimate interests (fraud/security) Required regardless of consent; security cannot be conditional on consent
Audit trail for UN Liability Convention Art. 6(1)(c) — legal obligation Statutory retention requirement; consent is irrelevant
Fatigue monitoring triggers (§28.3 — server-side thresholds) Art. 6(1)(b) or (f) Part of the contracted service and/or legitimate safety interest; not health data (Art. 9) because no health information is processed — only activity patterns
Sending marketing or product update emails (not core service) Art. 6(1)(a) — consent Marketing emails require opt-in consent separate from service ToS

ToS acceptance is consent evidence only for: (a) acknowledgement of terms, (b) Space-Track redistribution acknowledgement, (c) export control acknowledgement. It is not a blanket consent to all processing.

Implementation requirement: The Privacy Notice (§29.1) must state the correct lawful basis for each category of processing, not imply consent for all. Legal counsel review required before publication.


29.11 Open Source / Dependency Licence Compliance (§66)

SpaceCom is a closed-source SaaS product. Certain open-source licence obligations apply regardless of whether source code is distributed, because SpaceCom serves a web application to end users over a network. This section documents licence assessments for all material dependencies.

Reference document: legal/OSS_LICENCE_REGISTER.md — authoritative per-dependency licence record, updated on every major dependency version change.

F1 — CesiumJS AGPLv3 Commercial Licence

CesiumJS is licensed under AGPLv3. The AGPL network use provision (§13) requires that any software that incorporates AGPLv3 code and is served over a network must make its complete corresponding source available to users. SpaceCom is closed-source and does not satisfy this requirement under the AGPLv3 terms.

Required action: A commercial licence from Cesium Ion must be executed and stored at legal/LICENCES/cesium-commercial.pdf before any Phase 1 demo or ESA evaluation deployment. The CI licence gate (license-checker-rseidelsohn --excludePackages "cesium") is correct only when a valid commercial licence exists — the exclusion without the licence is a false negative. The commercial licence is referenced in ADR-0007 (docs/adr/0007-cesiumjs-commercial-licence.md).

Phase gate: legal/LICENCES/cesium-commercial.pdf present and legal_clearances.cesium_commercial_executed = TRUE is a Phase 1 go/no-go criterion. Block all external deployments until confirmed.

F3 — Space-Track AUP Redistribution Prohibition

Space-Track Terms of Service prohibit redistribution of TLE and CDM data to unregistered parties. SpaceCom's ingest pipeline fetches TLE/CDM data under a single registered account and serves derived predictions to ANSP users. The redistribution risk surfaces in two ways:

  1. Raw TLE exposure via API: If SpaceCom's API returns raw TLE strings (e.g., in /objects/{id}/tle), and those strings are accessible to unauthenticated users or third-party integrations, this may constitute redistribution. All TLE endpoints must require authentication and must not be proxied to unregistered downstream systems.

  2. Credentials in client-side code or SBOM: SPACE_TRACK_PASSWORD must never appear in frontend/ source, git history, SBOM artefacts, or any publicly accessible location. Validate with detect-secrets (already in pre-commit hook) and git secrets --scan-history.

ADR: docs/adr/0016-space-track-aup-architecture.md — records the chosen path (shared ingest vs. per-org credentials) with AUP clarification evidence.

F4 — Python Dependency Licence Assessment

Package Licence Risk Mitigation
NumPy BSD-3 None
SciPy BSD-3 None
astropy BSD-3 None
sgp4 MIT None
poliastro MIT / LGPLv3 (components) Low LGPLv3 requires dynamic linking ability; standard pip install satisfies LGPL dynamic linking. SpaceCom does not ship a modified poliastro — no relinking obligation arises. Document in legal/LGPL_COMPLIANCE.md.
FastAPI MIT None
SQLAlchemy MIT None
Celery BSD-3 None
Pydantic MIT None
Playwright (Python) Apache 2.0 None Chromium binary downloaded at build time; not redistributed. Captured in SBOM.

LGPL compliance document: legal/LGPL_COMPLIANCE.md must confirm: (a) poliastro is installed via pip as a separate library, (b) SpaceCom does not statically link or incorporate modified poliastro source, (c) users can substitute a modified poliastro by reinstalling — this is satisfied by standard Python packaging. No further action required beyond this documentation.

F5 — TimescaleDB Licence Assessment

TimescaleDB uses a dual-licence model:

Feature Licence SpaceCom use?
Hypertables, continuous aggregates, compression, time_bucket() Apache 2.0 Yes — all core features used by SpaceCom
Multi-node distributed hypertables Timescale Licence (TSL) No — single-node at all tiers
Data tiering (automated S3 tiering) TSL No — SpaceCom uses MinIO ILM / manual S3 lifecycle, not TimescaleDB tiering

Assessment: SpaceCom uses only Apache 2.0-licensed TimescaleDB features. No Timescale commercial agreement required. Document in legal/LICENCES/timescaledb-licence-assessment.md. Re-assess if multi-node or data tiering features are adopted at Tier 3.

F6 — Redis SSPL Assessment

Redis 7.4+ adopted the Server Side Public Licence (SSPL). SSPL § 13 requires that any entity offering the software as a service must open-source their entire service stack. The relevant question for SpaceCom is whether deploying Redis as an internal component of SpaceCom constitutes "offering Redis as a service."

Assessment: SpaceCom operates Redis internally — users interact with SpaceCom's API and WebSocket interface, not directly with Redis. This is not offering Redis as a service. The SSPL obligation does not apply to internal use of Redis as a component. However, legal counsel should confirm this position before Phase 3 (operational deployment).

Alternative if legal counsel disagrees: Pin to Redis 7.2.x (BSD-3-Clause, last release before SSPL adoption) or migrate to Valkey (BSD-3-Clause fork maintained by Linux Foundation). Either is a drop-in replacement. Document the chosen path in legal/LICENCES/redis-sspl-assessment.md.

Action: Update pip-licenses fail-on list to include "Server Side Public License" as a blocking licence category. Redis itself is not in the Python dependency tree (it is a Docker service), so this is a docker-image licence check. Add to Trivy scan policy.

F7 — Playwright and Chromium Binary Licence

Playwright (Python) is Apache 2.0. The Chromium binary bundled by Playwright uses the Chromium licence (BSD-3-Clause for most code; additional component licences apply for media codecs). Chromium is not redistributed by SpaceCom — Playwright downloads it at container build time via playwright install chromium.

Assessment: Internal use only; no redistribution. SBOM captures the Playwright version; Chromium binary version is captured by syft scanning the container image at the cosign attest step. No further action required.

F8 — Caddy Licence Assessment

Caddy server is Apache 2.0. Community plugins (the modules used in §26.9: encode, reverse_proxy, tls, file_server) are Apache 2.0. No Caddy enterprise plugins are used by SpaceCom. Caddy DNS challenge modules (if used for ACME wildcard certificates) must be verified — the caddy-dns/cloudflare module is MIT.

Audit requirement: On any Caddyfile change that adds a new module, verify its licence before merging. Add to the PR checklist for infrastructure changes.

F9 — PostGIS Licence Assessment

PostGIS is GPLv2+ with a linking exception for use with PostgreSQL. The linking exception reads: "the copyright holders of PostGIS grant you permission to use PostGIS as a PostgreSQL extension without this resulting in the entire combined work becoming subject to the GPL." SpaceCom uses PostGIS as a PostgreSQL extension (loaded via CREATE EXTENSION postgis) — the linking exception applies.

SpaceCom does not distribute PostGIS, does not modify PostGIS source, and does not ship a combined work — PostGIS is a runtime dependency of the database service. No GPLv2 obligation arises. Document in legal/LGPL_COMPLIANCE.md alongside the poliastro LGPL note.

F10 — Licence Change Monitoring CI Check

The existing pip-licenses --fail-on list (§7.13) catches Python GPL/AGPL. Additions required:

# .github/workflows/ci.yml (security-scan job — update existing step)
- name: Python licence gate
  run: |
    pip install pip-licenses
    pip-licenses --format=json --output-file=python-licences.json
    # Block: GPL v2, GPL v3, AGPL v3, SSPL (if any Python package adopts it)
    pip-licenses --fail-on="GNU General Public License v2 (GPLv2);GNU General Public License v3 (GPLv3);GNU Affero General Public License v3 (AGPLv3);Server Side Public License"

- name: npm licence gate (updated)
  working-directory: frontend
  run: |
    npx license-checker-rseidelsohn --json --out npm-licences.json
    # cesium excluded: commercial licence at docs/adr/0007-cesiumjs-commercial-licence.md
    npx license-checker-rseidelsohn \
      --excludePackages "cesium" \
      --failOn "GPL;AGPL;SSPL"

Additionally, pin all Python and Node dependencies to exact versions in requirements.txt and package-lock.json. Renovate Bot PRs (§7.13) provide controlled upgrade paths; the licence gate re-runs on each Renovate PR to catch licence changes introduced by version upgrades.

F11 — Contributor Licence Agreement for External Contributors

Before any contractor, partner, or third-party engineer contributes code to SpaceCom:

  1. A CLA or work-for-hire clause must be in their contract confirming that all IP created for SpaceCom is owned by SpaceCom (or the appointing entity, per agreement).
  2. The CLA template is at legal/CLA.md — a simple assignment of copyright for contributions made under contract.
  3. The GitHub repository's CONTRIBUTING.md must state: "External contributions require a signed CLA. Contact legal@spacecom.io before submitting a PR."

Phase gate: Before any Phase 2 ESA validation partnership involves third-party engineering, confirm all engineers have executed the CLA or have work-for-hire clauses in their contracts. Unattributed IP in an ESA bid creates serious procurement risk.


30. DevOps / Platform Engineering

30.1 Pre-commit Hook Specification

All six hooks are required. The same hooks run locally (via pre-commit) and in CI (lint job). A push to GitHub that bypasses local hooks will fail CI.

.pre-commit-config.yaml:

repos:
  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.4.0
    hooks:
      - id: detect-secrets
        args: ['--baseline', '.secrets.baseline']

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.3.0
    hooks:
      - id: ruff
        args: ['--fix']
      - id: ruff-format

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.9.0
    hooks:
      - id: mypy
        additional_dependencies: ['types-requests', 'sqlalchemy[mypy]']

  - repo: https://github.com/hadolint/hadolint
    rev: v2.12.0
    hooks:
      - id: hadolint-docker

  - repo: https://github.com/pre-commit/mirrors-prettier
    rev: v3.1.0
    hooks:
      - id: prettier
        types_or: [javascript, typescript, html, css, json, yaml]

  - repo: https://github.com/sqlfluff/sqlfluff
    rev: 3.0.0
    hooks:
      - id: sqlfluff-lint
        args: ['--dialect', 'postgres']
      - id: sqlfluff-fix
        args: ['--dialect', 'postgres']

All hooks are pinned by rev; update via pre-commit autoupdate in a dedicated dependency update PR. The detect-secrets baseline (.secrets.baseline) is committed to the repo and updated whenever legitimate secrets-like strings are added.

detect-secrets baseline maintenance process — incorrect baseline updates are the most common way this hook is neutralised. The correct procedure must be documented and enforced:

# docs/runbooks/detect-secrets-update.md (required runbook)

# CORRECT: update baseline to add a new allowance while preserving existing ones
detect-secrets scan --baseline .secrets.baseline --update
git add .secrets.baseline
git commit -m "chore: update detect-secrets baseline for <reason>"

# WRONG — overwrites ALL existing allowances:
# detect-secrets scan > .secrets.baseline   ← NEVER do this

CI check verifies baseline currency on every PR (stale baseline = hook not enforced):

# In lint job, after running pre-commit:
detect-secrets scan --baseline .secrets.baseline --diff | \
  python -c "import sys,json; d=json.load(sys.stdin); sys.exit(0 if not d else 1)" || \
  (echo "ERROR: .secrets.baseline is stale — run: detect-secrets scan --baseline .secrets.baseline --update" && exit 1)

detect-secrets is the canonical secrets scanner (entropy + regex). git-secrets (listed in §7.13) is also retained for its AWS credential pattern matching, which complements detect-secrets. Both run as pre-commit hooks; there is no conflict — they check different pattern sets.


30.2 Multi-Stage Dockerfile Pattern

All service Dockerfiles follow the builder/runtime two-stage pattern. No exceptions without documented justification.

Backend (example — same pattern for worker and ingest):

# Stage 1: builder
FROM python:3.12-slim AS builder
WORKDIR /build

# Install build dependencies (not copied to runtime stage)
RUN apt-get update && apt-get install -y --no-install-recommends gcc libpq-dev

COPY backend/requirements.txt .
# --require-hashes enforces that every package in requirements.txt carries a hash annotation.
# pip-compile --generate-hashes produces these. Without this flag, hash pinning is specified
# but not verified during build — a dependency confusion attack would be silently installed.
RUN pip install --upgrade pip && \
    pip wheel --no-cache-dir --require-hashes --wheel-dir /wheels -r requirements.txt

# Stage 2: runtime
FROM python:3.12-slim AS runtime
WORKDIR /app

# Create non-root user at build time
RUN groupadd --gid 1001 appuser && \
    useradd --uid 1001 --gid appuser --no-create-home appuser

# Install only compiled wheels — no build tools
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir --no-index --find-links /wheels /wheels/*.whl && \
    rm -rf /wheels

COPY backend/app ./app

USER appuser
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Frontend:

FROM node:22-slim AS builder
WORKDIR /build
COPY frontend/package*.json .
RUN npm ci
COPY frontend/ .
RUN npm run build

FROM node:22-slim AS runtime
WORKDIR /app
RUN groupadd --gid 1001 appuser && useradd --uid 1001 --gid appuser --no-create-home appuser
COPY --from=builder /build/.next/standalone ./
COPY --from=builder /build/.next/static ./.next/static
COPY --from=builder /build/public ./public
USER appuser
EXPOSE 3000
CMD ["node", "server.js"]

Version pin rule: All Python service images use python:3.12-slim. All frontend/Node images use node:22-slim. Any FROM line using a different tag fails the hadolint pre-commit hook and CI lint step. Do not drift these — the service table in §3.2 and the Dockerfiles must agree.

CI verification — the build-and-push job includes:

# Verify no build tools in runtime image
docker run --rm ghcr.io/spacecom/backend:sha-$GITHUB_SHA which gcc 2>&1 | grep -q "no gcc" || exit 1
docker run --rm --user root ghcr.io/spacecom/backend:sha-$GITHUB_SHA id | grep -q "uid=1001" || exit 1
# Verify correct Python version
docker run --rm ghcr.io/spacecom/backend:sha-$GITHUB_SHA python --version | grep -q "Python 3.12" || exit 1

Image digest pinning in production Compose files (F4 — §59): The production docker-compose.yml pins images by digest, not by mutable tag, to guarantee bit-for-bit reproducibility and prevent registry-side tampering:

# docker-compose.yml — production image references
# Update digests via: make update-image-digests (runs after each build-and-push)
services:
  backend:
    image: ghcr.io/spacecom/backend:sha-abc1234@sha256:a1b2c3d4...  # tag + digest
  worker-sim:
    image: ghcr.io/spacecom/worker:sha-abc1234@sha256:e5f6a7b8...

make update-image-digests script (run by CI after build-and-push): queries GHCR for the digest of each newly pushed image and patches docker-compose.yml via sed. The patched file is committed back to the release branch as a separate commit.

GHCR image retention policy (F4 — §59):

Image type Tag pattern Retention
Release images sha-<commit> on tagged release Indefinite
Staging images sha-<commit> on main push 30 days
Dev branch images sha-<commit> on PR branch 7 days
Build cache manifests buildcache Overwritten each build; no accumulation
Untagged images (orphaned layers) Purged weekly via GHCR lifecycle policy

GHCR lifecycle policy is configured via the GitHub repository settings (Packages → Manage versions). The policy is documented in docs/runbooks/image-lifecycle.md and reviewed quarterly alongside the secrets audit.


30.3 Environment Variable Contract

All environment variables are documented in .env.example. Variables are grouped by category and stage:

Variable Required Stage Description
SPACETRACK_USERNAME Yes All Space-Track.org account email
SPACETRACK_PASSWORD Yes All Space-Track.org password
JWT_PRIVATE_KEY_PATH Yes All Path to RS256 PEM private key
JWT_PUBLIC_KEY_PATH Yes All Path to RS256 PEM public key
JWT_PUBLIC_KEY_NEW_PATH No Rotation only Second public key during keypair rotation window
POSTGRES_PASSWORD Yes All TimescaleDB password
REDIS_BACKEND_PASSWORD Yes All Redis ACL password for spacecom_backend user (full keyspace access)
REDIS_WORKER_PASSWORD Yes All Redis ACL password for spacecom_worker user (Celery namespaces only)
REDIS_INGEST_PASSWORD Yes All Redis ACL password for spacecom_ingest user (Celery namespaces only)
MINIO_ACCESS_KEY Yes All MinIO access key
MINIO_SECRET_KEY Yes All MinIO secret key
HMAC_SECRET Yes All Prediction signing key (rotate per §26.9 procedure)
ENVIRONMENT Yes All development / staging / production
DEPLOY_CHECK_SECRET Yes Staging/Prod Read-only CI/CD gate credential
SENTRY_DSN No Staging/Prod Error reporting DSN
PAGERDUTY_ROUTING_KEY No Prod only AlertManager → PagerDuty routing key
VAULT_ADDR No Phase 3 HashiCorp Vault address
VAULT_TOKEN No Phase 3 Vault authentication token
DISABLE_SIMULATION_DURING_ACTIVE_EVENTS No All Org-level simulation block; default false
OPS_ROOM_SUPPRESS_MINUTES No All Alert audio suppression window; default 0

CI validates that .env.example is up-to-date by checking that every variable referenced in the codebase (os.getenv(...), settings.*) has an entry in .env.example. Missing entries fail CI.

CI secrets register (F3 — §59): GitHub Actions secrets are audited quarterly. The following table is the authoritative register — any secret not in this table must not exist in the repository settings.

Secret name Environment Owner Rotation schedule What breaks if leaked
GITHUB_TOKEN All GitHub-managed (OIDC) Per-job (automatic) GHCR push access
DEPLOY_CHECK_SECRET Staging, Production Engineering lead 90 days CI can skip alert gate
STAGING_SSH_KEY Staging Engineering lead 180 days Staging server access
PRODUCTION_SSH_KEY Production Engineering lead + 1 90 days Production server access
SPACETRACK_USERNAME_STAGING Staging DevOps On offboarding Space-Track ingest
SPACETRACK_PASSWORD_STAGING Staging DevOps 90 days Space-Track ingest
SENTRY_DSN Staging, Production DevOps On rotation Error reporting only
PAGERDUTY_ROUTING_KEY Production Engineering lead On rotation On-call alerting

Rotation procedure: use gh secret set <NAME> --env <ENV> from a local machine; never paste secrets into PR descriptions or issue comments. Quarterly audit: gh secret list --env production output reviewed by engineering lead; any unrecognised secret triggers a security review.


30.4 Staging Environment Specification

Staging is a Tier 2 deployment (single-host Docker Compose) running continuously on a dedicated server or cloud VM.

Data policy: Staging never holds production data. On weekly reset (make clean && make seed), the database is wiped and synthetic fixtures are loaded. Synthetic fixtures include:

  • 50 tracked objects with pre-computed TLE histories
  • 5 synthetic TIP events across the test FIR set
  • 3 synthetic CRITICAL alert events at various acknowledgement states
  • 2 shadow mode test organisations

Credential policy: Staging uses a separate Space-Track account (if available) or rate-limited credentials. JWT keypairs, HMAC secrets, and MinIO keys are all distinct from production. Staging credentials are stored in GitHub Actions environment secrets, not in the production Vault.

OWASP ZAP integration:

# .github/workflows/ci.yml (post-staging-deploy step)
- name: OWASP ZAP baseline scan
  uses: zaproxy/action-baseline@v0.11.0
  with:
    target: 'https://staging.spacecom.io'
    rules_file_name: '.zap/rules.tsv'
    fail_action: true

ZAP results are uploaded as GitHub Actions artefacts and must be reviewed before production deploy approval is granted in Phase 2+.


30.5 CI Observability

Build duration: Each GitHub Actions job reports duration to a summary table. A Grafana dashboard (CI Health) tracks p50/p95 job durations over time. Alert if any job's p95 duration increases > 2× week-over-week.

Image size delta: The build-and-push job posts a PR comment with the compressed image size delta versus the previous main build:

Backend image: 187 MB → 192 MB (+2.7%) ✅
Worker image: 203 MB → 289 MB (+42.4%) ⚠️ Investigate before merge

If any image grows > 20% in a single PR, CI posts a warning. If any image exceeds the tier limits below, CI fails:

Image Max size (compressed)
backend 300 MB
worker 350 MB
frontend 200 MB
renderer 500 MB (Chromium)
ingest 250 MB

Test failure rate: GitHub Actions test reports (JUnit XML output from pytest and vitest) are stored as artefacts. A weekly CI health review checks for flaky tests (passing < 90% of the time) and schedules them for investigation.


30.6 DevOps Decision Log

Decision Chosen Rationale
CI/CD orchestration GitHub Actions Project is GitHub-native; OIDC → GHCR eliminates long-lived registry credentials; matrix builds supported
Container registry GHCR Co-located with source; free for this repo; cosign attestation support
Image tagging sha-<commit> canonical; version alias on release tags; latest forbidden latest is mutable; sha tag gives exact source traceability
Multi-stage builds Builder + distroless/slim runtime for all services 6080% image size reduction; eliminates compiler/build tools from production attack surface
Hot-reload strategy docker-compose.override.yml with bind-mounted source volumes < 1s reload vs. 3090s container rebuild; override file not committed to CI
Local task runner make Universally available, no extra install; self-documenting targets; shell-level DX standard
Pre-commit stack 6 hooks: detect-secrets + ruff + mypy + hadolint + prettier + sqlfluff Each addresses a distinct failure mode; hooks run in CI to enforce for engineers who skip local install
Staging data Synthetic fixtures only; weekly reset Production data in staging creates GDPR complexity; synthetic data is sufficient for integration testing
Secrets rotation Zero-downtime per-secret runbook; HMAC rotation requires batch re-sign migration Aviation context: rotation cannot cause service interruption; HMAC is special-cased due to signed-data dependency
HMAC key rotation Requires batch re-sign of all existing predictions; engineering lead approval required All existing HMAC signatures become invalid on key change; silent re-sign is safer than mass verification failures

30.7 GitLab CI Workflow Specification (F1, F5, F8, F10 - §59)

The CI pipeline must enforce a strict job dependency graph. Jobs that do not declare needs: run in parallel by default — this is incorrect for a safety-critical pipeline where a failed test must prevent a build reaching production.

Canonical job dependency graph:

lint ──┬── test-backend ──┬── security-scan ──── build-and-push ──── deploy-staging ──── deploy-production
       └── test-frontend ─┘                                                 ↑ (auto)          ↑ (manual gate)

.github/workflows/ci.yml (abbreviated — full spec below):

name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - uses: actions/cache@v4
        with:
          path: ~/.cache/pre-commit
          key: pre-commit-${{ hashFiles('.pre-commit-config.yaml') }}
      - run: pip install pre-commit
      - run: pre-commit run --all-files   # F6 §59: enforce hooks in CI

  test-backend:
    needs: [lint]
    runs-on: ubuntu-latest
    services:
      db:
        image: timescale/timescaledb:2.14-pg17
        env: { POSTGRES_PASSWORD: test }
        options: --health-cmd pg_isready
      redis:
        image: redis:7-alpine
        options: --health-cmd "redis-cli ping"
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - uses: actions/cache@v4   # F10 §59: pip wheel cache
        with:
          path: ~/.cache/pip
          key: pip-${{ hashFiles('backend/requirements.txt') }}
      - run: pip install -r backend/requirements.txt
      - run: pytest -m safety_critical --tb=short -q   # fast safety gate first
      - run: pytest --cov=backend --cov-fail-under=80

  test-frontend:
    needs: [lint]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '22' }
      - uses: actions/cache@v4   # F10 §59: npm cache
        with:
          path: ~/.npm
          key: npm-${{ hashFiles('frontend/package-lock.json') }}
      - run: npm ci --prefix frontend
      - run: npm run test --prefix frontend

  migration-gate:              # F11 §59: migration reversibility + timing gate
    needs: [lint]
    if: contains(github.event.commits[*].modified, 'migrations/')
    runs-on: ubuntu-latest
    services:
      db:
        image: timescale/timescaledb:2.14-pg17
        env: { POSTGRES_PASSWORD: test }
        options: --health-cmd pg_isready
    steps:
      - uses: actions/checkout@v4
      - run: pip install alembic psycopg2-binary
      - name: Forward migration (timed)
        run: |
          START=$(date +%s)
          alembic upgrade head
          END=$(date +%s)
          ELAPSED=$((END - START))
          echo "Migration took ${ELAPSED}s"
          if [ "$ELAPSED" -gt 30 ]; then
            echo "::error::Migration took ${ELAPSED}s > 30s budget — requires review"
            exit 1
          fi
      - name: Reverse migration (reversibility check)
        run: alembic downgrade -1
      - name: Model/migration sync check
        run: alembic check

  security-scan:
    needs: [test-backend, test-frontend, migration-gate]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install bandit && bandit -r backend/app -ll
      - uses: actions/setup-node@v4
        with: { node-version: '22' }
      - run: npm audit --prefix frontend --audit-level=high
      - name: Trivy container scan (on previous image)
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ghcr.io/${{ github.repository }}/backend:latest
          severity: CRITICAL,HIGH
          exit-code: '1'

  build-and-push:
    needs: [security-scan]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    permissions: { contents: read, packages: write, id-token: write }
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}   # OIDC — no long-lived token
      - name: Build and push (with layer cache)   # F10 §59
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ env.IMAGE_NAME }}/backend:sha-${{ github.sha }}
          cache-from: type=registry,ref=ghcr.io/${{ env.IMAGE_NAME }}/backend:buildcache
          cache-to: type=registry,ref=ghcr.io/${{ env.IMAGE_NAME }}/backend:buildcache,mode=max
      - name: Sign image with cosign (F5 §59)
        uses: sigstore/cosign-installer@v3
      - run: |
          cosign sign --yes \
            ghcr.io/${{ env.IMAGE_NAME }}/backend:sha-${{ github.sha }}
      - name: Generate SBOM and attach (F5 §59)
        uses: anchore/sbom-action@v0
        with:
          image: ghcr.io/${{ env.IMAGE_NAME }}/backend:sha-${{ github.sha }}
          upload-artifact: true

  deploy-staging:
    needs: [build-and-push]
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Check no active CRITICAL alert (F8 §59)
        run: |
          STATUS=$(curl -sf -H "Authorization: Bearer ${{ secrets.DEPLOY_CHECK_SECRET }}" \
            https://staging.spacecom.io/api/v1/readyz | jq -r '.alert_gate')
          if [ "$STATUS" != "clear" ]; then
            echo "::error::Active CRITICAL/HIGH alert — deploy blocked. Override with workflow_dispatch."
            exit 1
          fi
      - name: SSH deploy to staging
        run: |
          ssh deploy@staging.spacecom.io \
            "bash /opt/spacecom/scripts/blue-green-deploy.sh sha-${{ github.sha }}"

  deploy-production:
    needs: [deploy-staging]
    runs-on: ubuntu-latest
    environment: production   # GitLab protected environment with required approvers - manual gate
    steps:
      - uses: actions/checkout@v4
      - name: Check no active CRITICAL alert (F8 §59)
        run: |
          STATUS=$(curl -sf -H "Authorization: Bearer ${{ secrets.DEPLOY_CHECK_SECRET }}" \
            https://spacecom.io/api/v1/readyz | jq -r '.alert_gate')
          if [ "$STATUS" != "clear" ]; then
            echo "::error::Active CRITICAL/HIGH alert — production deploy blocked."
            exit 1
          fi
      - name: SSH deploy to production
        run: |
          ssh deploy@spacecom.io \
            "bash /opt/spacecom/scripts/blue-green-deploy.sh sha-${{ github.sha }}"

/api/v1/readyz alert gate field (F8 — §59): The existing GET /readyz response is extended with an alert_gate field:

# Returns "clear" | "blocked"
alert_gate = "blocked" if db.query(AlertEvent).filter(
    AlertEvent.level.in_(["CRITICAL", "HIGH"]),
    AlertEvent.acknowledged_at == None,
    AlertEvent.organisation_id != INTERNAL_ORG_ID,  # internal test alerts don't block deploys
).count() > 0 else "clear"

Emergency deploy override: use workflow_dispatch with input override_alert_gate: true — requires two approvals in the GitHub production environment. All overrides are logged to security_logs with event_type = DEPLOY_ALERT_GATE_OVERRIDE.


30.8 Configuration Management of Safety-Critical Artefacts (F7 — §61)

EUROCAE ED-153 / DO-278A §10 requires that safety-critical software and its associated artefacts are placed under configuration management. This extends beyond the code itself to include requirements, test cases, design documents, and safety evidence.

Policy document: docs/safety/CM_POLICY.md

Artefacts under CM:

Artefact Location CM Control
SAL-2 source files (physics/, alerts/, integrity/, czml/) Git main branch Signed commits required; CODEOWNERS enforcement; no direct push to main
Hazard Log docs/safety/HAZARD_LOG.md Git-tracked; changes require safety case custodian sign-off (CODEOWNERS rule)
Safety Case docs/safety/SAFETY_CASE.md Git-tracked; changes require safety case custodian sign-off
SAL Assignment docs/safety/SAL_ASSIGNMENT.md Git-tracked; changes require safety case custodian sign-off
Means of Compliance docs/safety/MEANS_OF_COMPLIANCE.md Git-tracked; changes require safety case custodian sign-off
Verification Independence Policy docs/safety/VERIFICATION_INDEPENDENCE.md Git-tracked
Test plan (safety-critical markers) docs/TEST_PLAN.md Git-tracked; safety_critical marker additions/removals reviewed in PR
Reference validation data docs/validation/reference-data/ Git-tracked; immutable once committed (SHA verified in CI)
Accuracy Characterisation docs/validation/ACCURACY_CHARACTERISATION.md Git-tracked; Phase 3 deliverable
ANSP SMS Guide docs/safety/ANSP_SMS_GUIDE.md Git-tracked
Release artefacts (SBOM, Trivy report, cosign signature) GHCR + MinIO safety archive Tagged per release; 7-year retention

Release tagging for safety artefacts:

Every production release (vMAJOR.MINOR.PATCH) creates a Git tag that captures:

# scripts/tag-safety-release.sh
VERSION=$1
git tag -a "$VERSION" -m "Release $VERSION — safety artefacts frozen at this tag"
# Attach safety snapshot to the release
gh release create "$VERSION" \
  docs/safety/SAFETY_CASE.md \
  docs/safety/HAZARD_LOG.md \
  docs/safety/SAL_ASSIGNMENT.md \
  docs/safety/MEANS_OF_COMPLIANCE.md \
  --title "SpaceCom $VERSION" \
  --notes "Safety artefacts attached. See CHANGELOG.md for changes."

Signed commits for SAL-2 paths: backend/app/physics/, backend/app/alerts/, backend/app/integrity/, backend/app/czml/ require GPG-signed commits. Branch protection rule: require_signed_commits: true on main. This provides non-repudiation for safety-critical code changes.

CODEOWNERS additions:

# .github/CODEOWNERS
# Safety artefacts — require safety case custodian review
/docs/safety/              @safety-custodian
/docs/validation/          @safety-custodian

Configuration baseline: At each ANSP deployment, a configuration baseline is recorded in legal/ANSP_DEPLOYMENT_REGISTER.md:

  • SpaceCom version deployed (Git tag)
  • Commit SHA
  • SBOM hash
  • Safety case version
  • SAL assignment version
  • Deployment jurisdiction and date

This baseline is the reference for any subsequent regulatory audit or safety occurrence investigation.


31. Interoperability / Systems Integration

31.1 External Data Source Contracts

For each inbound data source, the integration contract must be explicit. Implicit assumptions about format are the most common source of silent ingest failures.

31.1.1 Space-Track.org

Endpoints consumed:

Data Endpoint Format Baseline interval Active TIP interval
TLE catalog /basicspacedata/query/class/gp/DECAY_DATE/null-val/orderby/NORAD_CAT_ID asc/format/json JSON array Every 6h Every 6h (unchanged)
CDMs /basicspacedata/query/class/cdm_public/format/json JSON array Every 2h Every 30min
TIP messages /basicspacedata/query/class/tip/format/json JSON array Every 30min Every 5min
Object catalog /basicspacedata/query/class/satcat/format/json JSON array Daily Daily

Adaptive polling: When spacecom_active_tip_events > 0 (any object with predicted re-entry within 6 hours), the Celery Beat schedule dynamically switches TIP polling to 5-minute intervals and CDM polling to 30-minute intervals. This is implemented via redbeat schedule overrides, not by running additional tasks — the existing Beat entry's run_every is updated in Redis. When all TIP events clear, intervals revert to baseline.

Space-Track request budget (600 requests/day):

Space-Track enforces a 600 requests/day limit per account. Budget must be tracked and protected:

# ingest/budget.py
DAILY_REQUEST_BUDGET = 600
BUDGET_ALERT_THRESHOLD = 0.80   # alert at 80% consumed

class SpaceTrackBudget:
    """Redis counter tracking daily Space-Track API requests. Resets at midnight UTC."""

    def __init__(self, redis_client):
        self._redis = redis_client
        self._key = f"spacetrack:budget:{date.today().isoformat()}"

    def consume(self, n: int = 1) -> bool:
        """Deduct n requests. Returns False if budget exhausted; raises if > threshold."""
        current = self._redis.incrby(self._key, n)
        self._redis.expireat(self._key, self._next_midnight())
        if current > DAILY_REQUEST_BUDGET:
            raise SpaceTrackBudgetExhausted(f"Daily budget exhausted ({current}/{DAILY_REQUEST_BUDGET})")
        if current / DAILY_REQUEST_BUDGET >= BUDGET_ALERT_THRESHOLD:
            structlog.get_logger().warning(
                "spacetrack_budget_warning",
                consumed=current, budget=DAILY_REQUEST_BUDGET,
            )
        return True

    def remaining(self) -> int:
        return max(0, DAILY_REQUEST_BUDGET - int(self._redis.get(self._key) or 0))

Prometheus gauge: spacecom_spacetrack_budget_remaining — alert at < 100 remaining requests.

Exponential backoff and circuit breaker:

# ingest/tasks.py
@app.task(
    bind=True,
    autoretry_for=(SpaceTrackError, httpx.TimeoutException, httpx.ConnectError),
    retry_backoff=True,       # 2s, 4s, 8s, 16s, 32s ...
    retry_backoff_max=3600,   # cap at 1 hour
    retry_jitter=True,        # ±20% jitter per retry
    max_retries=5,            # task → DLQ on 6th failure
    acks_late=True,
)
def ingest_tle_catalog(self):
    if not circuit_breaker.is_closed("spacetrack"):
        raise SpaceTrackCircuitOpen("Circuit open — Space-Track unreachable")
    try:
        budget.consume(1)
        result = spacetrack_client.fetch_tle_catalog()
        circuit_breaker.record_success("spacetrack")
        return result
    except (SpaceTrackError, httpx.TimeoutException) as exc:
        circuit_breaker.record_failure("spacetrack")
        raise self.retry(exc=exc)

Circuit breaker config: open after 3 consecutive failures; half-open after 30 minutes; close after 1 successful probe. Implemented via pybreaker or equivalent. State stored in Redis for cross-worker visibility.

Session expiry handling:

Space-Track uses cookie-based sessions that expire after ~2 hours of inactivity. A 6-hour TLE poll interval guarantees session expiry between polls. The spacetrack library must be configured to re-authenticate transparently on 401/403:

# ingest/spacetrack.py
class SpaceTrackClient:
    def __init__(self):
        self._session_valid_until: datetime | None = None
        self._SESSION_TTL = timedelta(hours=1, minutes=45)  # conservative re-auth before expiry

    async def _ensure_authenticated(self):
        if self._session_valid_until is None or datetime.utcnow() >= self._session_valid_until:
            await self._authenticate()
            self._session_valid_until = datetime.utcnow() + self._SESSION_TTL
            spacecom_ingest_session_reauth_total.labels(source="spacetrack").inc()

    async def fetch_tle_catalog(self):
        await self._ensure_authenticated()
        # ... fetch logic

Metric spacecom_ingest_session_reauth_total{source="spacetrack"} distinguishes routine re-auth from genuine authentication failures. An alert fires if reauth_total increments more than once per hour (indicates session instability, not normal expiry).

Contract test (asserts on every CI run against a live Space-Track response):

def test_spacetrack_tle_schema(spacetrack_client):
    response = spacetrack_client.query("gp", limit=1)
    required_keys = {"NORAD_CAT_ID", "TLE_LINE1", "TLE_LINE2", "EPOCH", "BSTAR", "OBJECT_NAME"}
    assert required_keys.issubset(response[0].keys()), f"Missing keys: {required_keys - response[0].keys()}"

Failure alerting: spacecom_ingest_success_total{source="spacetrack"} counter. AlertManager rules:

  • Baseline: if counter does not increment for 4 consecutive hours during expected polling windows → CRITICAL INGEST_SOURCE_FAILURE alert.
  • Active TIP window: if spacecom_ingest_success_total{source="spacetrack", type="tip"} does not increment for > 10 minutes when spacecom_active_tip_events > 0 → immediate L1 page (bypasses standard 4h threshold).

31.1.2 NOAA SWPC Space Weather

All endpoints are hardcoded constants in ingest/sources.py. Format is JSON for all P1 endpoints.

# ingest/sources.py
NOAA_F107_URL      = "https://services.swpc.noaa.gov/json/f107_cm_flux.json"
NOAA_KP_URL        = "https://services.swpc.noaa.gov/json/planetary_k_index_1m.json"
NOAA_DST_URL       = "https://services.swpc.noaa.gov/json/geomag/dst/index.json"
NOAA_FORECAST_URL  = "https://services.swpc.noaa.gov/products/3-day-geomag-forecast.json"
ESA_SWS_KP_URL     = "https://swe.ssa.esa.int/web/guest/current-space-weather-conditions"

Nowcast vs. forecast distinction: NRLMSISE-00 decay predictions spanning hours to days require different F10.7/Ap inputs depending on the prediction horizon. These must be stored separately and selected by the decay predictor at query time:

-- space_weather table: forecast_horizon_hours column required
ALTER TABLE space_weather ADD COLUMN forecast_horizon_hours INTEGER NOT NULL DEFAULT 0;
-- 0 = nowcast (observed); 24/48/72 = NOAA 3-day forecast horizon; NULL = 81-day average
COMMENT ON COLUMN space_weather.forecast_horizon_hours IS
  '0=nowcast; 24/48/72=NOAA 3-day forecast; NULL=81-day F10.7 average for long-horizon use';

Decay predictor input selection rule (documented in model card and decay.py):

Prediction horizon F10.7 source Ap source
t < 6h Nowcast (horizon=0) Nowcast (horizon=0)
6h ≤ t < 72h NOAA 3-day forecast (horizon=24/48/72) NOAA 3-day forecast
t ≥ 72h 81-day F10.7 average (horizon=NULL) Storm-aware climatological Ap

Beyond 72h: the NOAA forecast expires. The model uses the 81-day F10.7 average (a standard NRLMSISE-00 input) and the long-range uncertainty is reflected in wider Monte Carlo corridor bounds. This is documented in the model card under "Space Weather Input Uncertainty Beyond 72h".

ESA SWS Kp cross-validation decision rule: ESA SWS Kp is a cross-validation source, not a fallback. A decision rule is required when NOAA and ESA values diverge — without one, the cross-validation is observational only:

# ingest/space_weather.py
NOAA_ESA_KP_DIVERGENCE_THRESHOLD = 2.0  # Kp units; ADR-0018

def arbitrate_kp(noaa_kp: float, esa_kp: float) -> float:
    """Select Kp value for NRLMSISE-00 input. Conservative-high on divergence."""
    divergence = abs(noaa_kp - esa_kp)
    if divergence > NOAA_ESA_KP_DIVERGENCE_THRESHOLD:
        structlog.get_logger().warning(
            "kp_source_divergence",
            noaa_kp=noaa_kp, esa_kp=esa_kp, divergence=divergence,
        )
        spacecom_kp_divergence_events_total.inc()
        # Conservative: higher Kp → denser atmosphere → shorter predicted lifetime → earlier alerting
        return max(noaa_kp, esa_kp)
    return noaa_kp   # NOAA is primary source

The threshold (2.0 Kp) and the conservative-high selection policy are documented in docs/adr/0018-kp-source-arbitration.md and reviewed by the physics lead. The spacecom_kp_divergence_events_total counter is monitored; a sustained rate of divergence warrants investigation of source calibration.

Schema contract test (CI):

def test_noaa_kp_schema(noaa_client):
    response = noaa_client.get_kp()
    assert isinstance(response, list) and len(response) > 0
    assert {"time_tag", "kp_index"}.issubset(response[0].keys())

def test_space_weather_forecast_horizon_stored(db_session):
    """Verify nowcast and forecast rows are stored with distinct horizon values."""
    nowcast = db_session.query(SpaceWeather).filter_by(forecast_horizon_hours=0).first()
    forecast_72 = db_session.query(SpaceWeather).filter_by(forecast_horizon_hours=72).first()
    assert nowcast is not None, "Nowcast row missing"
    assert forecast_72 is not None, "72h forecast row missing"

31.1.3 FIR Boundary Data

Source: EUROCONTROL AIRAC dataset (primary for ECAC states); FAA Digital-Terminal Procedures Publication (US); OpenAIP (fallback for non-AIRAC regions).

Format: GeoJSON FeatureCollection with properties.icao_id (FIR ICAO designator) and properties.name.

Update procedure (runs on each 28-day AIRAC cycle):

  1. Download new AIRAC dataset from EUROCONTROL (subscription required; credentials in secrets manager)
  2. Convert to GeoJSON via ingest/fir_loader.py
  3. Compare new boundaries against current airspace table; log added/removed/changed FIRs to security_logs type AIRSPACE_UPDATE
  4. Stage new boundaries in airspace_staging table; run intersection regression test against 10 known prediction corridors
  5. If regression passes: swap airspace and airspace_staging in a single transaction
  6. Record update in airspace_metadata table: airac_cycle, record_count, updated_at, updated_by

airspace_metadata table:

CREATE TABLE airspace_metadata (
  id SERIAL PRIMARY KEY,
  airac_cycle TEXT NOT NULL,       -- e.g. "2026-03"
  effective_date DATE NOT NULL,
  expiry_date DATE NOT NULL,       -- effective_date + 28 days; used for staleness detection
  record_count INTEGER NOT NULL,
  source TEXT NOT NULL,            -- 'eurocontrol' | 'faa' | 'openaip'
  updated_at TIMESTAMPTZ DEFAULT NOW(),
  updated_by TEXT NOT NULL
);

AIRAC staleness detection: The AIRAC update procedure is manual — there is no automated mechanism to trigger it. Without monitoring, a missed cycle goes undetected for up to 28 days.

Required additions:

  1. Prometheus gauge: spacecom_airspace_airac_age_days = EXTRACT(EPOCH FROM NOW() - MAX(effective_date)) / 86400 from airspace_metadata. Alert rule:
- alert: AIRACAirspaceStale
  expr: spacecom_airspace_airac_age_days > 29
  for: 1h
  severity: warning
  annotations:
    runbook_url: "https://spacecom.internal/docs/runbooks/fir-update.md"
    summary: "FIR boundary data is {{ $value | humanizeDuration }} old — AIRAC cycle may be missed"
  1. GET /readyz integration: "airspace_stale" is added to the degraded array when airac_age_days > 28 (already incorporated into §26.5 readyz check above).

  2. FIR update runbook (docs/runbooks/fir-update.md) is a Phase 1 deliverable — it must exist before shadow deployment. Add to the Phase 1 DoD runbook checklist alongside secrets-rotation-jwt.md.

31.1.4 TLE Validation Gate

Before any TLE record is written to the database, ingest/cross_validator.py enforces:

def validate_tle(line1: str, line2: str) -> TLEValidationResult:
    errors = []
    if len(line1) != 69:
        errors.append(f"Line 1 length {len(line1)} != 69")
    if len(line2) != 69:
        errors.append(f"Line 2 length {len(line2)} != 69")
    if not _tle_checksum_valid(line1):
        errors.append("Line 1 checksum failed")
    if not _tle_checksum_valid(line2):
        errors.append("Line 2 checksum failed")
    epoch = _parse_epoch(line1[18:32])
    if epoch is None:
        errors.append("Epoch field invalid")
    bstar = float(line1[53:61].replace(' ', ''))
    # Finding 10: BSTAR validation revised
    # Lower bound removed: valid high-density objects (e.g. tungsten sphere) have B* << 0.0001
    # Zero or negative B* is physically meaningless (negative drag) → hard reject
    if bstar <= 0.0:
        errors.append(f"BSTAR {bstar} is zero or negative — physically invalid")
    elif bstar > 0.5:
        # Physically implausible at altitude > 300 km; log warning but do not reject
        log_security_event("TLE_VALIDATION_WARNING", {
            "tle": [line1, line2], "reason": "HIGH_BSTAR", "bstar": bstar
        }, level="WARNING")
    # Hard reject only the impossible combination: very high drag at high altitude
    if bstar > 0.5 and perigee_km > 300:
        errors.append(f"BSTAR {bstar} implausible for perigee {perigee_km:.0f} km — high drag at high altitude")
    if errors:
        log_security_event("INGEST_VALIDATION_FAILURE", {"tle": [line1, line2], "errors": errors})
        return TLEValidationResult(valid=False, errors=errors)
    return TLEValidationResult(valid=True)

31.2 CCSDS Format Specifications

31.2.1 OEM (Orbit Ephemeris Message) — CCSDS 502.0-B-3

Emitted by GET /space/objects/{norad_id}/ephemeris when Accept: application/ccsds-oem.

Header keyword population:

Keyword Value Source
CCSDS_OEM_VERS 3.0 Fixed
CREATION_DATE ISO 8601 UTC timestamp datetime.utcnow()
ORIGINATOR SPACECOM Fixed
OBJECT_NAME objects.name DB
OBJECT_ID COSPAR designator if known; NORAD-<norad_id> otherwise DB
CENTER_NAME EARTH Fixed
REF_FRAME GCRF Fixed — SpaceCom frame transform output
TIME_SYSTEM UTC Fixed
START_TIME Query start parameter Request
STOP_TIME Query end parameter Request

Unknown fields: Any keyword for which SpaceCom holds no data is emitted as N/A per CCSDS 502.0-B-3 §4.1.

31.2.2 CDM (Conjunction Data Message) — CCSDS 508.0-B-1

Emitted by GET /space/export/bulk?format=ccsds-cdm.

Field population table (abbreviated):

Field Populated? Source
CREATION_DATE Yes datetime.utcnow()
ORIGINATOR Yes SPACECOM
TCA Yes SpaceCom conjunction screener
MISS_DISTANCE Yes SpaceCom conjunction screener
COLLISION_PROBABILITY Yes SpaceCom Alfano Pc
COLLISION_PROBABILITY_METHOD Yes ALFANO-2005
OBJ1/2 COVARIANCE_* Conditional From Space-Track CDM if available; N/A for debris without covariance
OBJ1/2 RECOMMENDED_OD_SPAN No N/A — SpaceCom does not hold OD span
OBJ1/2 SEDR No N/A

CDM ingestion and Pc reconciliation: When a Space-Track CDM is ingested for an object that SpaceCom has also screened, both Pc values are stored:

  • conjunctions.pc_spacecom — SpaceCom Alfano result
  • conjunctions.pc_spacetrack — from ingested CDM
  • conjunctions.pc_discrepancy_flag — set TRUE when abs(log10(pc_spacecom/pc_spacetrack)) > 1 (order-of-magnitude difference)

The conjunction panel displays both values with their provenance labels. When pc_discrepancy_flag = TRUE, a DATA_CONFIDENCE warning callout is shown explaining possible causes (different epoch, different covariance source, different Pc method).


31.2.3 RDM (Re-entry Data Message) — CCSDS 508.1-B-1

Emitted by GET /reentry/predictions/{prediction_id}/export?format=ccsds-rdm.

Planned population rules:

  • SpaceCom populates creation metadata, object identifiers, prediction provenance, prediction epoch, and the primary predicted re-entry time range from the active prediction record.
  • Where the active prediction carries prediction_conflict = TRUE, the export includes both the primary SpaceCom range and the conservative union range used for aviation-facing products, with explicit conflict provenance.
  • Corridor, fragment-cloud, and air-risk annotations are included only when supported by the active model version and marked with the model version identifier used to generate them.
  • Unknown optional fields are emitted as N/A rather than silently omitted, matching the CCSDS handling already used for OEM/CDM unknowns.
  • Raw upstream TIP or third-party reference messages are not overwritten; they remain separate provenance sources and are cross-referenced in the export metadata and audit trail.

31.3 WebSocket Event Reference

Full event type catalogue for WS /ws/events. All events share the envelope:

{
  "type": "alert.new",
  "seq": 1042,
  "ts": "2026-03-17T14:23:01.123Z",
  "org_id": 7,
  "data": { ... }
}

Event type specifications:

alert.new
  data: {alert_id, level, norad_id, object_name, fir_ids[], predicted_reentry_utc, corridor_wkt}

alert.acknowledged
  data: {alert_id, acknowledged_by_name, note_preview (first 80 chars), acknowledged_at}

alert.superseded
  data: {old_alert_id, new_alert_id, reason}

prediction.updated
  data: {prediction_id, norad_id, p50_utc, p05_utc, p95_utc, supersedes_id (nullable), corridor_wkt}

tip.new
  data: {norad_id, object_name, tip_epoch, predicted_reentry_utc, source_label ("USSPACECOM TIP")}

ingest.status
  data: {source, status ("ok"|"failed"), record_count (nullable), next_run_at, failure_reason (nullable)}

spaceweather.change
  data: {old_status, new_status, kp, f107, recommended_buffer_hours}

resync_required
  data: {reason ("reconnect_too_stale"), last_known_seq}

Reconnection protocol:

  1. Client stores last received seq
  2. On reconnect: upgrade with ?since_seq=<last_seq>
  3. Server delivers all events with seq > last_seq from a 5-minute / 200-event ring buffer
  4. If the gap is too large: server sends {"type": "resync_required"}; client must call REST endpoints to re-fetch current state before resuming WebSocket consumption

Simulation/Replay isolation: During SIMULATION or REPLAY mode, the client is connected to WS /ws/simulation/{session_id} instead of WS /ws/events. No LIVE events are delivered while in a simulation session.


31.4 Alert Webhook Specification

Registration:

POST /api/v1/webhooks
Content-Type: application/json
Authorization: Bearer <admin_jwt>

{
  "url": "https://ansp-dispatch.example.com/spacecom/hook",
  "events": ["alert.new", "tip.new"],
  "secret": "webhook_shared_secret_min_32_chars"
}

Response includes webhook_id. The secret is bcrypt-hashed before storage; the plaintext is never retrievable after registration.

Delivery:

POST https://ansp-dispatch.example.com/spacecom/hook
Content-Type: application/json
X-SpaceCom-Signature: sha256=<HMAC-SHA256(secret, raw_body)>
X-SpaceCom-Event: alert.new
X-SpaceCom-Delivery: <uuid>

{ "type": "alert.new", "seq": 1042, ... }

Receiver verification (example):

import hmac, hashlib

def verify_signature(secret: str, body: bytes, header_sig: str) -> bool:
    expected = "sha256=" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expected, header_sig)

Retry and status lifecycle:

State Condition Action
active Deliveries succeeding Normal operation
degraded 3 consecutive delivery failures Org admin notified by email; deliveries continue
disabled 10 consecutive delivery failures No further deliveries; manual re-enable via PATCH /webhooks/{id} required

31.5 Interoperability Decision Log

Decision Chosen Rationale
ADS-B source OpenSky Network REST API Free, global, sufficient for Phase 3 route overlay; upgrade path to FAA SWIM ADS-B if coverage gaps emerge
CCSDS OEM reference frame GCRF SpaceCom frame transform pipeline output; downstream tools expect GCRF
CCSDS CDM unknown fields N/A per CCSDS 508.0-B-1 §4.3 Silent omission causes downstream parser failures; N/A is the standard sentinel
CDM Pc reconciliation Both Space-Track CDM Pc and SpaceCom Pc displayed with provenance; discrepancy flag on order-of-magnitude difference Transparency over false precision; operators need to see the discrepancy, not have SpaceCom silently override it
FIR update mechanism Staging table swap + regression test on 28-day AIRAC cycle Direct overwrite during a live TIP event would corrupt ongoing airspace intersection queries
WebSocket event schema Typed envelope with type discriminator + monotonic seq Enables typed client generation; seq enables reliable missed-event recovery
Webhook signature HMAC-SHA256 with sha256= prefix (same convention as GitHub webhooks) Operators will already know this pattern; reduces integration friction
SWIM integration timing Phase 2: GeoJSON export; Phase 3: FIXM review + AMQP endpoint Full SWIM-TI requires EUROCONTROL B2B account and FIXM extension work — not Phase 1/2 blocking
API versioning /api/v1 base; 6-month parallel support on breaking changes; RFC 8594 headers Space operators need stable contracts; 6-month overlap is industry standard for operational API changes
Space weather format JSON REST endpoints (not legacy ASCII FTP) ASCII FTP format is brittle; NOAA SWPC JSON API is stable and machine-readable; contract test catches format changes

32. Ethics / Algorithmic Accountability

SpaceCom makes algorithmic predictions that inform operational airspace decisions. False negatives are catastrophic; false positives cause economic disruption and erode operator trust. This section documents the accountability framework that governs how the prediction model is specified, validated, changed, and monitored.

Applicable frameworks: IEEE 7001-2021 (Transparency of Autonomous Systems), NIST AI RMF (Govern/Map/Measure/Manage), ICAO Safety Management (Annex 19), ECSS-Q-ST-80C (Software Product Assurance).


32.1 Decay Predictor Model Card

The model card is a living document maintained at docs/model-card-decay-predictor.md. It is a required artefact for ESA Phase 2 TRL demonstrations and ANSP SMS acceptance. It must be updated whenever the model version changes.

Required sections:

# Decay Predictor Model Card — SpaceCom v<X.Y.Z>

## Model summary
Numerical decay predictor using RK7(8) adaptive integrator + NRLMSISE-00 atmospheric
density model + J2J6 geopotential + solar radiation pressure. Monte Carlo uncertainty
via 500-sample ensemble varying F10.7 (±20%), Ap, and B* (±10%).

## Validated orbital regime
- Perigee altitude: 100600 km
- Inclination: 098°
- Object type: rocket bodies and payloads with RCS > 0.1 m²
- B* range: 0.00010.3
- Area-to-mass ratio: 0.0050.04 m²/kg

## Known out-of-distribution inputs (ood_flag triggers)
| Parameter | OOD condition | Expected behaviour |
|-----------|--------------|-------------------|
| Area-to-mass ratio | > 0.04 m²/kg | Underestimates atmospheric drag; re-entry time predicted too late |
| data_confidence | 'unknown' | Physical properties estimated from object type defaults; wide systematic uncertainty |
| TLE count in history | < 5 TLEs in last 30 days | B* estimate unreliable; uncertainty may be significantly underestimated |
| Perigee altitude | < 100 km | Object may already be in final decay corridor; NRLMSISE-00 not calibrated below 100 km |

## Performance characterisation
(Updated from backcast validation report — see MinIO docs/backcast-validation-v<X>.pdf)

| Object category | N backcasts | p50 error (median) | p50 error (95th pct) | Corridor containment |
|----------------|-------------|-------------------|---------------------|---------------------|
| Rocket bodies, RCS > 2 m² | TBD | TBD | TBD | TBD |
| Payloads, RCS 0.52 m² | TBD | TBD | TBD | TBD |
| Small debris / unknown RCS | TBD (underrepresented) | TBD | TBD | TBD |

## Known systematic biases
- NRLMSISE-00 underestimates atmospheric density during geomagnetic storms at altitudes 200350 km.
  Effect: predictions during Kp > 5 events tend to predict re-entry slightly later than observed.
  Mitigation: space weather buffer recommendation adds ≥2h beyond p95 during Elevated/Severe/Extreme conditions.
- Tumbling objects: effective drag area unknown; B* from TLEs reflects tumble-averaged drag.
  Effect: uncertainty may be systematically underestimated for highly elongated objects.
- Calibration data bias: validation events are dominated by large well-tracked objects from major launch
  programmes. Small debris and objects from less-tracked orbital regimes are underrepresented.

## Not intended for
- Objects with perigee < 100 km (already in terminal descent corridor)
- Crewed vehicles (use mission-specific tools)
- Objects undergoing active manoeuvring
- Predictions beyond 21 days (F10.7 forecast skill degrades sharply beyond 3 days)

32.2 Backcast Validation Requirements

Phase 1 minimum: ≥3 historical re-entries selected from The Aerospace Corporation observed re-entry database. Selection criteria documented.

Phase 2 target: ≥10 historical re-entries. The validation report (docs/backcast-validation-v<X>.pdf) must explicitly:

  1. Document selection criteria — which events were chosen and why. Selection must include at least one event from each of: rocket bodies, payloads, and at least one high-area-to-mass object if available.
  2. Flag underrepresented categories — explicitly state which object types have < 3 validation events and what the implication is for accuracy claims in those categories.
  3. State accuracy as conditional — not "p50 accuracy is ±2h" but "for rocket bodies (N=7): median p50 error is 1.8h; for payloads (N=3): median p50 error is 3.1h; for small debris (N=0): no validation data available."
  4. Include negative results — events where the p95 corridor did not contain the observed impact point must be included and analysed.
  5. Compare across model versions — each new validation report must include a comparison table against the previous version's results.

The validation report is generated by modules.feedback and stored in MinIO docs/ bucket with a version tag matching the model version.


32.3 Out-of-Distribution Detection

At prediction creation time, propagator/decay.py evaluates each input object against the OOD bounds defined in docs/ood-bounds.md and sets reentry_predictions.ood_flag and ood_reason accordingly.

OOD checks (initial set — update in docs/ood-bounds.md as model is validated):

def check_ood(obj: ObjectParams) -> tuple[bool, list[str]]:
    reasons = []
    if obj.area_to_mass_ratio is not None and obj.area_to_mass_ratio > 0.04:
        reasons.append("high_am_ratio")
    if obj.data_confidence == "unknown":
        reasons.append("low_data_confidence")
    if obj.tle_count_last_30d is not None and obj.tle_count_last_30d < 5:
        reasons.append("sparse_tle_history")
    if obj.perigee_km is not None and obj.perigee_km < 100:
        reasons.append("sub_100km_perigee")
    if obj.bstar is not None and not (0.0001 <= obj.bstar <= 0.3):
        reasons.append("bstar_out_of_range")
    return len(reasons) > 0, reasons

UI presentation when ood_flag = TRUE:

⚠ OUT-OF-CALIBRATION-RANGE PREDICTION
──────────────────────────────────────────────────────────────
This prediction uses inputs outside the model's validated range:
  • high_am_ratio — effective drag may be underestimated
  • low_data_confidence — physical properties estimated from defaults

Timing uncertainty may be significantly larger than shown.
For operational planning, treat the p95 window as a minimum bound.

[What does this mean? →]
──────────────────────────────────────────────────────────────

The callout is mandatory and non-dismissable. It appears above the prediction panel wherever the prediction is displayed. It does not prevent the prediction from being used — operators retain full autonomy.


32.4 Recalibration Governance

The modules.feedback pipeline computes atmospheric density scaling coefficients from observed re-entry outcomes recorded in prediction_outcomes. Updating these coefficients changes all future predictions.

Recalibration procedure:

  1. Trigger: Automated check in the feedback pipeline flags when the last 10 outcomes show a systematic bias (median p50 error > 1.5× the historical baseline).
  2. Candidate coefficients: New coefficients computed from the full prediction_outcomes history using a hold-out split (80% train / 20% hold-out). Hold-out set is fixed and never used in training.
  3. Validation gate: New coefficients must achieve:
    • 5% improvement in median p50 error on hold-out set

    • No regression (> 10% worsening) on any validated object type category
    • Corridor containment rate ≥ 95% on hold-out set
  4. Sign-off: Physics lead + engineering lead both must approve via PR review. PR includes the validation comparison table.
  5. Active prediction handling: Before deployment, a batch job re-runs all active predictions (status = active, not superseded) using the new coefficients. Each re-run creates a new prediction record linked via superseded_by. ANSPs with active shadow deployments receive an automated notification: "SpaceCom model recalibrated — active predictions updated. Previous predictions superseded. New model version: X.Y.Z."
  6. Rollback: If a post-deployment accuracy regression is detected, the previous coefficient set is restored via the same procedure (treated as a new recalibration). The rollback is logged to security_logs type MODEL_ROLLBACK.

32.5 Model Version Governance

Version classification:

Classification Examples Active prediction re-run? ANSP notification required?
Patch Documentation update, logging improvement, no physics change No No
Minor Performance improvement, OOD bound adjustment, new object type support No (optional for analyst review) Yes — changelog summary
Major Integrator change, density model change, MC parameter change, recalibration Yes — all active predictions superseded Yes — written notice to all shadow deployment partners; 2-week notice before deployment

Version string: Semantic version (MAJOR.MINOR.PATCH) embedded in every prediction record at creation time as model_version. The currently deployed version is exposed via GET /api/v1/system/model-version.

Cross-version prediction display: When a prediction was made with a model version that differs from the current deployed version by a major bump, the UI shows:

 Prediction generated with model v1.2.0 — current model is v2.0.0 (major update).
  This prediction reflects older parameters. Re-run recommended for operational planning.
  [Re-run with current model →]

32.6 Adverse Outcome Monitoring

Continuous monitoring of prediction accuracy post-deployment is a regulatory credibility requirement. It is also the primary input to the recalibration pipeline.

Data flow:

  1. Analyst logs observed re-entry outcome via POST /api/v1/predictions/{id}/outcome after post-event analysis (source: The Aerospace Corporation observed re-entry database, US18SCS reports, or ESA ESOC confirmation)
  2. prediction_outcomes record created with p50_error_minutes, corridor_contains_observed, fir_false_positive, fir_false_negative
  3. Feedback pipeline runs weekly: aggregates outcomes, computes rolling accuracy metrics, flags systematic biases
  4. Grafana Model Accuracy dashboard shows: rolling 90-day median p50 error, corridor containment rate, false positive rate (CRITICAL alerts with no confirmed hazard), false negative rate (confirmed hazard with no CRITICAL alert)

Quarterly transparency report: Generated automatically from prediction_outcomes. Contains aggregate (non-personal) data:

  • Total predictions served in the quarter
  • Number of outcomes recorded (and percentage — coverage of the total)
  • Median p50 error, 95th percentile error
  • Corridor containment rate
  • False positive rate (CRITICAL alerts with no confirmed hazard) and estimated false negative rate
  • Known model limitations summary (from model card)
  • Model version(s) active during the quarter

Report stored in MinIO public-reports/ bucket and made available on SpaceCom's public documentation site. The report is a Phase 3 deliverable.


32.7 Geographic Coverage Quality

FIR intersection quality varies by boundary data source. Operators in non-ECAC regions receive lower-quality airspace intersection assessments than European counterparts. This disparity must be acknowledged, not hidden.

Coverage quality levels:

Source Coverage quality Regions
EUROCONTROL AIRAC High All ECAC states (Europe, Turkey, Israel, parts of North Africa)
FAA Digital-Terminal Procedures High Continental US, Alaska, Hawaii, US territories
OpenAIP Medium Global fallback; community-maintained; may lag AIRAC
Manual / not loaded Low Any region where no FIR data has been imported

The airspace table has a coverage_quality column (high / medium / low). The airspace intersection API response includes coverage_quality per affected FIR. The UI shows a coverage quality callout on the airspace impact table when any affected FIR is medium or low:

 FIR boundary quality: MEDIUM (OpenAIP source)
  Intersection calculations for this region use community-maintained boundary data.
  Verify with official AIRAC charts before operational use.

32.8 Ethics Accountability Decision Log

Decision Chosen Rationale
Model card Required artefact; maintained alongside model in docs/ Regulators and ANSPs need a documented operational envelope; ESA TRL process requires it
Backcast accuracy statement Conditional on object type; selection bias explicitly documented Single unconditional figure misrepresents model generalisation to non-specialist audiences
OOD detection Evaluated at prediction time; ood_flag + UI warning callout; prediction still served Operators retain autonomy; OOD flag informs rather than blocks; hiding it would create false confidence
Recalibration governance Hold-out validation + dual sign-off + active prediction re-run + ANSP notification Ungoverned recalibration is an ungoverned change to a safety-critical model
Alert threshold governance Documented rationale; change requires PR review + 2-week shadow validation + ANSP notification Threshold values are consequential algorithmic decisions; they must be as auditable as code changes
Prediction staleness warning prediction_valid_until = p50 - 4h; warning independent of system health banner A prediction for an imminent re-entry event has growing implicit uncertainty; operators need a signal
Adverse outcome monitoring prediction_outcomes table; weekly pipeline; quarterly public report Without outcome data, performance claims are assertions not evidence; public report builds regulatory trust
FIR coverage disparity coverage_quality column on airspace; disclosed per-FIR in intersection results Hiding coverage quality differences from operators would be a form of false precision
False positive / negative framing Both tracked in prediction_outcomes; both in quarterly report Optimising only for one error type can silently worsen the other; both must be visible
Public transparency report Aggregate accuracy data; no personal data; quarterly cadence Aviation safety infrastructure operates in a regulated transparency environment; SpaceCom must too

33. Technical Writing / Documentation Engineering

33.1 Documentation Principles

SpaceCom documentation has three distinct audiences with different needs:

Audience Primary docs Format
Engineers building the system ADRs, inline docstrings, test plan, AGENTS.md Markdown in repo
Operators using the system User guides, API guide, in-app help Hosted docs site / PDF
Regulators and auditors Model card, validation reports, runbooks, CHANGELOG Formal documents; version-controlled

Documentation that serves the wrong audience in the wrong format fails both audiences. The §12.1 docs/ directory tree encodes this separation by subdirectory.


33.2 Architecture Decision Record (ADR) Standard

Format: MADR — Markdown Architectural Decision Records. Lightweight, git-friendly, no tooling dependency.

File naming: docs/adr/NNNN-short-title.md where NNNN is a zero-padded sequence number.

Template:

# NNNN — <Title>

**Status:** Accepted | Superseded by [MMMM](MMMM-title.md) | Deprecated

## Context

<What is the issue or design question this decision addresses? What forces are at play?>

## Decision

<What was decided?>

## Consequences

**Positive:** <What does this decision make easier or better?>
**Negative / trade-offs:** <What does this decision make harder or require accepting?>
**Neutral:** <Other effects worth noting>

## Alternatives considered

| Alternative | Why rejected |
|-------------|-------------|
| ...         | ...         |

Linking from code: When a code section implements a non-obvious decision, add an inline comment: # See docs/adr/0003-monte-carlo-chord-pattern.md. This makes the rationale discoverable from the code, not just from the plan.

Required initial ADR set (Phase 1):

ADR Decision
0001 RS256 asymmetric JWT over HS256
0002 Dual front-door architecture (aviation + space portals)
0003 Monte Carlo chord pattern (Celery group + chord)
0004 GEOGRAPHY vs GEOMETRY spatial column types
0005 lazy="raise" on all SQLAlchemy relationships
0006 TimescaleDB chunk intervals (orbits: 1 day, space_weather: 30 days)
0007 CesiumJS commercial licence requirement
0008 PgBouncer transaction-mode pooling
0009 CCSDS OEM GCRF reference frame
0010 Alert threshold rationale (6h CRITICAL, 24h HIGH)

33.3 OpenAPI Documentation Standard

FastAPI auto-generates OpenAPI 3.1 schema from Python type annotations. Auto-generation is necessary but not sufficient. The following requirements are enforced by CI.

Per-endpoint requirements:

@router.get(
    "/reentry/predictions/{id}",
    summary="Get re-entry prediction by ID",
    description=(
        "Returns a single re-entry prediction with HMAC integrity verification. "
        "If the prediction's HMAC fails verification, returns 503 — do not use the data. "
        "Requires `viewer` role minimum. OOD-flagged predictions include a warning field."
    ),
    tags=["Re-entry"],
    responses={
        200: {"description": "Prediction returned; check `integrity_failed` field"},
        401: {"description": "Not authenticated"},
        403: {"description": "Insufficient role"},
        404: {"description": "Prediction not found or belongs to another organisation"},
        503: {"description": "HMAC integrity check failed — prediction data is untrusted"},
    },
)
async def get_prediction(id: int, ...):

CI enforcement: A pytest fixture iterates the FastAPI app's routes and asserts that description is non-empty for every route with path starting /api/v1/. Fails CI with a list of non-compliant endpoints.

Rate limiting documentation: Endpoints with rate limits include the limit in the description field: "Rate limited: 10 requests/minute per user. Returns 429 with Retry-After header when exceeded."


33.4 Runbook Standard

Template (docs/runbooks/TEMPLATE.md):

# Runbook: <Title>

**Severity:** SEV-1 | SEV-2 | SEV-3 | SEV-4
**Owner:** <team or role>
**Last reviewed:** YYYY-MM-DD
**Estimated duration:** <X minutes>

## Trigger condition

<What condition causes this runbook to be needed? What alert or observation triggers it?>

## Preconditions

- [ ] You have SSH access to the production host
- [ ] <other preconditions>

## Steps

1. <First step — be specific; include exact commands>
2. <Second step>
   ```bash
   # exact command with expected output noted
   docker compose ps
  1. ...

Verification

<How do you confirm the runbook was successful? What does healthy state look like?>

Rollback

<If the steps made things worse, how do you undo them?>

Notify

  • Engineering lead notified (Slack #incidents)
  • On-call via PagerDuty if SEV-1/2
  • ANSP partners notified if operational disruption (template: docs/runbooks/ansp-notification-template.md)

**Runbook index** (`docs/runbooks/README.md`):

| Runbook | Severity | Owner | Last reviewed |
|---------|----------|-------|--------------|
| `db-failover.md` | SEV-1 | Platform | Phase 3 |
| `celery-recovery.md` | SEV-2 | Platform | Phase 3 |
| `hmac-failure.md` | SEV-1 | Security | Phase 1 |
| `ingest-failure.md` | SEV-2 | Platform | Phase 1 |
| `gdpr-breach-notification.md` | SEV-1 | Legal + Engineering | Phase 2 |
| `safety-occurrence-notification.md` | SEV-1 | Legal + Engineering | Phase 2 |
| `secrets-rotation-jwt.md` | SEV-2 | Platform | Phase 2 |
| `secrets-rotation-spacetrack.md` | SEV-2 | Platform | Phase 2 |
| `secrets-rotation-hmac.md` | SEV-1 | Engineering Lead | Phase 2 |
| `blue-green-deploy.md` | SEV-3 | Platform | Phase 3 |
| `restore-from-backup.md` | SEV-2 | Platform | Phase 2 |

---

### 33.5 Docstring Standard

All public functions in the following modules must have Google-style docstrings:
`propagator/decay.py`, `propagator/catalog.py`, `reentry/corridor.py`, `breakup/atmospheric.py`, `conjunction/probability.py`, `integrity.py`, `frame_utils.py`, `time_utils.py`.

**Required docstring sections:** `Args` (with physical units for all dimensional quantities), `Returns`, `Raises`, and `Notes` (for numerical limitations or known edge cases).

```python
def integrate_trajectory(
    object_id: int,
    f107: float,
    bstar: float,
    params: dict,
) -> TrajectoryResult:
    """Integrate a single RK7(8) decay trajectory from current epoch to re-entry.

    Uses NRLMSISE-00 atmospheric density model with J2J6 geopotential and
    solar radiation pressure. Terminates at 80 km altitude (configurable via
    params['termination_altitude_km']).

    Args:
        object_id: NORAD catalog number of the decaying object.
        f107: Solar flux index (10.7 cm) in solar flux units (sfu).
            Valid range: 65300 sfu. Values outside this range are accepted
            but produce extrapolated NRLMSISE-00 results (see docs/ood-bounds.md).
        bstar: BSTAR drag term from TLE (units: 1/Earth_radius).
            Valid range: 0.00010.3 per docs/ood-bounds.md.
        params: Simulation parameters dict. Required keys:
            'mc_samples' (int), 'termination_altitude_km' (float, default 80.0).

    Returns:
        TrajectoryResult with fields: reentry_time (UTC datetime),
        impact_lat_deg (float), impact_lon_deg (float), final_velocity_ms (float).

    Raises:
        IntegrationDivergenceError: If the integrator step size shrinks below
            1e-6 seconds (indicates numerical instability — log and flag as OOD).
        ValueError: If object_id is not in the database.

    Notes:
        NRLMSISE-00 is calibrated for 100600 km altitude. Below 100 km the
        density is extrapolated and uncertainty grows significantly. The OOD
        flag is set by the caller based on ood-bounds.md thresholds, not here.
    """

Enforcement: mypy pre-commit hook enforces no untyped function signatures. A separate CI check using pydocstyle or ruff with docstring rules enforces non-empty docstrings on public functions in the listed modules.


33.6 CHANGELOG.md Format

Follows Keep a Changelog conventions. Human-maintained — not auto-generated from commit messages.

# Changelog

All notable changes to SpaceCom are documented here.
Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)

## [Unreleased]

## [1.0.0] — 2026-MM-DD

### Added
- Re-entry decay predictor (RK7(8) + NRLMSISE-00 + Monte Carlo 500 samples)
- Percentile corridor visualisation (Mode A)
- Space weather widget (NOAA SWPC + ESA SWS cross-validation)
- CRITICAL/HIGH/MEDIUM/LOW alert system with two-step CRITICAL acknowledgement
- Shadow mode with per-org legal clearance gate

### Security
- JWT RS256 with httpOnly cookies; TOTP MFA enforced for all roles
- HMAC-SHA256 integrity on all prediction and hazard zone records
- Append-only `alert_events` and `security_logs` tables

## [0.1.0] — 2026-MM-DD (Phase 1 internal)
...

Who maintains it: The engineer cutting the release writes the entry. Product owner reviews before tagging. Entries are written for operators and regulators — not for engineers.


33.7 User Documentation Plan

Document Audience Phase Format Location
Aviation Portal User Guide Persona A/B/C Phase 2 Markdown → PDF docs/user-guides/aviation-portal-guide.md
Space Portal User Guide Persona E/F Phase 3 Markdown → PDF docs/user-guides/space-portal-guide.md
Administrator Guide Persona D Phase 2 Markdown docs/user-guides/admin-guide.md
API Developer Guide Persona E/F Phase 2 Markdown → hosted docs/api-guide/
In-app contextual help Persona A/C Phase 3 React component content frontend/src/components/shared/HelpContent.ts

Aviation Portal User Guide — required sections:

  1. Dashboard overview (what you see on first login)
  2. Understanding the globe display and urgency symbols
  3. Reading a re-entry event: window range, corridor, risk level
  4. Alert acknowledgement workflow (step-by-step with screenshots)
  5. NOTAM draft workflow and mandatory disclaimer
  6. Degraded mode: what the banners mean and what to do
  7. Sharing views: deep links
  8. Contacting SpaceCom support

Review requirement: The aviation portal guide must be reviewed by at least one Persona A representative (ANSP duty manager or equivalent) before first shadow deployment. Their sign-off is recorded in docs/user-guides/review-log.md.


33.8 API Developer Guide

Located at docs/api-guide/. This is the primary onboarding resource for Persona E (space operators using API keys) and Persona F (orbital analysts with programmatic access).

Minimum content for Phase 2:

authentication.md:

  • How to create an API key (step-by-step with screenshots)
  • How to attach the key to requests (Authorization: Bearer <key> header)
  • API key scopes and which endpoints each scope can access
  • How to revoke a key

rate-limiting.md:

  • Per-endpoint rate limits in a table
  • 429 response format and Retry-After header usage
  • Burst vs. sustained limits

error-reference.md:

400 Bad Request        — Invalid parameters; see `detail` field
401 Unauthorized       — Missing or invalid API key
403 Forbidden          — API key does not have the required scope
404 Not Found          — Resource not found or not owned by your account
422 Unprocessable      — Request body failed schema validation
429 Too Many Requests  — Rate limit exceeded; see Retry-After header
503 Service Unavailable — HMAC integrity check failed; do not use the returned data

code-examples/python-quickstart.py:

import requests

API_BASE = "https://api.spacecom.io/api/v1"
API_KEY = "sk_live_..."   # from your API key dashboard

session = requests.Session()
session.headers["Authorization"] = f"Bearer {API_KEY}"

# Get list of tracked objects currently decaying
resp = session.get(f"{API_BASE}/objects", params={"decay_status": "decaying"})
resp.raise_for_status()
objects = resp.json()["results"]
print(f"{len(objects)} objects in active decay")

# Get OEM ephemeris for the first object
norad_id = objects[0]["norad_id"]
resp = session.get(
    f"{API_BASE}/space/objects/{norad_id}/ephemeris",
    headers={"Accept": "application/ccsds-oem"},
    params={"start": "2026-03-17T00:00:00Z", "end": "2026-03-18T00:00:00Z"}
)
print(resp.text)   # CCSDS OEM format

33.9 AGENTS.md Specification

AGENTS.md at the project root provides guidance to AI coding agents (such as Claude Code) working in this codebase. It is a first-class documentation artefact — committed to the repo, version-controlled, and referenced in the onboarding guide.

Required sections:

# SpaceCom — Agent Guidance

## Codebase overview
<3-paragraph summary of architecture, key modules, and safety context>

## Safety-critical files — extra care required
The following files have safety-critical implications. Any change must include
a test and a brief rationale comment:
- `backend/app/frame_utils.py` — frame transforms affect corridor coordinates
- `backend/app/integrity.py` — HMAC signing affects prediction integrity guarantees
- `backend/app/modules/propagator/decay.py` — physics model
- `backend/app/modules/alerts/service.py` — alert trigger logic
- `backend/migrations/` — schema changes affect immutability triggers

## Test requirements
- All backend changes must pass `make test` before committing
- Physics function changes require a new test case in the relevant test module
- Security-relevant changes require a `test_rbac.py` or `test_integrity.py` case
- Never mock the database in integration tests — use the test DB container

## Code conventions
- FastAPI endpoints must have `summary`, `description`, and `responses` (see §33.3)
- Public physics/security functions must have Google-style docstrings with units
- All new decisions should have an ADR in `docs/adr/` (see §33.2)
- New runbooks go in `docs/runbooks/` using the template at `docs/runbooks/TEMPLATE.md`

## Playwright / E2E test selector convention
- Every interactive element targeted by a Playwright test **must** have a `data-testid="<component>-<action>"` attribute
  - Examples: `data-testid="alert-acknowledge-btn"`, `data-testid="notam-draft-submit"`, `data-testid="decay-predict-form"`
- Playwright tests must use `page.getByTestId(...)` or accessible role selectors (`page.getByRole(...)`) **only**
- CSS class selectors, XPath, and `page.locator('.')` are forbidden in test files
- A CI lint step (`grep -r 'page\.locator\b\|page\.\$\b' tests/e2e/`) must return empty

## What not to do
- Do not add `latest` tags to Docker image references
- Do not store secrets in `.env` files committed to git
- Do not make changes to alert thresholds without updating `docs/alert-threshold-history.md`
- Do not change `model_version` in `decay.py` without following the model version governance procedure (§32.5)
- Do not proxy the Cesium ion token server-side — it is a public browser credential by design (`NEXT_PUBLIC_CESIUM_ION_TOKEN`). Do not store it in Vault, Docker secrets, or treat it as sensitive.
- Do not add write operations (POST/PUT/DELETE API calls, Zustand mutations) to components rendered in SIMULATION or REPLAY mode without calling `useModeGuard(['LIVE'])` first and disabling the control in non-LIVE modes.

33.10 Test Documentation Standard

Test pyramid and coverage gates — enforced in CI; make test runs all layers:

Layer Scope Minimum gate CI enforcement
Unit backend/app/ excluding migrations/, schemas/ 80% line coverage pytest --cov=backend/app --cov-fail-under=80
Integration Every API endpoint × every applicable role 100% of routes in test_rbac.py RBAC matrix fixture enumerates all FastAPI routes via app.routes
E2E 5 critical user journeys (see below) All journeys pass Playwright job in CI; blocks merge
Physics validation All suites in docs/test-plan.md marked Blocking 0 failures Separate CI job; always runs before merge

5 critical user journeys (E2E blocking):

  1. CRITICAL alert → acknowledge → NOTAM draft saved
  2. Analyst submits decay prediction → job completes → corridor visible on globe
  3. Admin creates user → user logs in → MFA enrolment complete
  4. Space operator registers object → views conjunction list
  5. Admin enables shadow mode → shadow prediction absent from viewer response

Module docstring requirement for all physics and security test modules:

"""
test_frame_utils.py — Frame Transformation Validation Suite

Physical invariant tested:
    TEME → GCRF → ITRF → WGS84 coordinate chain must agree with
    Vallado (2013) reference state vectors to within specified tolerances.

Reference source:
    Vallado, D.A. (2013). Fundamentals of Astrodynamics and Applications, 4th ed.
    Table 3-4 (GCRF↔ITRF) and Table 3-5 (TEME→GCRF). Reference vectors in
    docs/validation/reference-data/vallado-sgp4-cases.json.

Operational significance of failure:
    A frame transform error propagates directly into corridor polygon coordinates.
    A 1 km error at re-entry altitude produces a ground-track offset of 515 km.
    ALL tests in this module are BLOCKING CI failures.

How to add a new test case:
    1. Add the reference state vector to vallado-sgp4-cases.json
    2. Add a parametrised test case to TestTEMEGCRF or TestGCRFITRF
    3. Document the source in a comment on the test case
"""

docs/test-plan.md structure:

Suite Module(s) Physical invariant / behaviour Reference Pass tolerance Blocking?
Frame transforms tests/physics/test_frame_utils.py TEME→GCRF→ITRF→WGS84 chain accuracy Vallado (2013) Table 3-4/3-5 Position < 1 km Yes
SGP4 propagator tests/physics/test_propagator/ State vector at epoch; 7-day propagation Vallado (2013) test set < 1 km at epoch; < 10 km at +7d Yes
Decay predictor tests/physics/test_decay/ p50 re-entry time accuracy; corridor containment Aerospace Corp database Median error < 4h; containment ≥ 90% Phase 2+
NRLMSISE-00 density tests/physics/test_decay/test_nrlmsise.py Density agrees with reference atmosphere Picone et al. (2002) Table 1 < 1% at 5 reference points Yes
Hypothesis invariants tests/physics/test_hypothesis.py SGP4 round-trip; p95 corridor containment; RLS tenant isolation Internal + Vallado See §42.3 Yes
HMAC integrity tests/test_integrity.py Tampered record detected; correct error response Internal 503 + CRITICAL log entry Yes
RBAC enforcement tests/test_rbac.py Every endpoint returns correct status for every role Internal 0 mismatches Yes
Rate limiting tests/test_auth.py 429 at threshold; 200 after reset Internal Exact threshold Yes
WebSocket tests/test_websocket.py Sequence replay; token expiry warning; close codes 4001/4002 Internal spec §14 All assertions pass Yes
Contract tests tests/test_ingest/test_contracts.py Space-Track + NOAA key presence AND value ranges Internal 0 violations Yes (in CI against mocks)
Celery lifecycle tests/test_jobs/test_celery_failure.py Timed-out job → failed; orphan recovery Beat task Internal State correct within 5 min Yes
MC corridor tests/physics/test_mc_corridor.py Corridor contains ≥ 95% of p95 trajectories; polygon matches committed reference Internal (seeded RNG seed=42) Area delta < 5% Phase 2+
Smoke suite tests/smoke/ API/WS health; auth; catalog non-empty; DB connectivity Internal All pass in ≤ 2 min Yes (post-deploy)
E2E journeys tests/e2e/ (Playwright) 5 critical user journeys; WCAG 2.1 AA axe-core scan Internal 0 journey failures; 0 axe violations Yes
Breakup energy conservation tests/physics/test_breakup/ Energy conserved through fragmentation Internal analytic < 1% error Phase 2+

Test database isolation strategy — prevents test state leakage and enables parallel execution (pytest-xdist):

  • Unit tests and single-connection integration tests: db_session fixture wraps each test in a SAVEPOINT/ROLLBACK TO SAVEPOINT transaction. No committed data persists between tests.
  • Celery integration tests (multi-connection, multi-process): use testcontainers-python (PostgresContainer) to spin up a dedicated DB container per pytest-xdist worker. The container is created at session scope and torn down at session end. Each test worker sets search_path to its own schema (test_worker_<worker_id>) for additional isolation.
  • Never use the development or production DB for tests. The DATABASE_URL in test config must point to localhost:5433 (test container) or the testcontainers dynamic port. CI enforces this via environment variable assertion at test startup.
  • pytest.ini configuration:
    [pytest]
    addopts = -x --strict-markers -p no:warnings
    markers =
        quarantine: flaky tests excluded from blocking CI
        contract: external API contract tests; run against mocks in CI
        smoke: post-deploy smoke tests
    

Flaky test policy:

  1. A test is "flaky" if it fails without a code change ≥ 2 times in any 30-day window (tracked via GitHub Actions JUnit artefact history)
  2. On second flaky failure: the test is decorated with @pytest.mark.quarantine and moved to tests/quarantine/; a GitHub issue is filed automatically by the CI workflow
  3. Quarantined tests are excluded from blocking CI (pytest -m "not quarantine") but continue to run in a non-blocking nightly job so failures are visible
  4. A test in quarantine > 14 days without a fix must be deleted — a never-fixed flaky test provides no safety value and actively erodes trust in CI
  5. The quarantine list is reviewed at each sprint review; any test in quarantine > 30 days blocks the next sprint release gate

33.11 Technical Writing Decision Log

Decision Chosen Rationale
ADR format MADR (Markdown) Lightweight; git-native; no tooling; linkable from code comments
ADR location docs/adr/ in monorepo Engineers find rationale where they work, not in a separate wiki
Changelog format Keep a Changelog (human-maintained) Commit messages are for engineers; changelogs are for operators and regulators; auto-generation produces wrong audience tone
Docstring style Google-style Most readable inline; compatible with Sphinx if API reference generation is needed; ruff can enforce it
Runbook format Standard template with Trigger/Steps/Verification/Rollback/Notify On-call engineers under pressure skip steps that aren't explicitly numbered; Rollback and Notify are consistently omitted without a template
User documentation timing Phase 2 for aviation portal; Phase 3 for space portal ANSP SMS acceptance requires user documentation before shadow deployment; space portal can follow
API guide location docs/api-guide/ in repo Co-located with code; version-controlled; engineers update it when they change the API
AGENTS.md Committed to repo root; safety-critical files explicitly listed An undocumented AGENTS.md is ignored or followed inconsistently; explicit safety-critical file list is the highest-value content
Test documentation Module docstring + docs/test-plan.md ECSS-Q-ST-80C requires test specification as a separate artefact; module docstrings are the lowest-friction way to maintain it
OpenAPI enforcement CI check on empty description fields Developers don't write documentation voluntarily; CI enforcement is the only reliable mechanism

34. Infrastructure Design

This section consolidates infrastructure-level specifications: TLS lifecycle, port map, reverse-proxy configuration, WAF/DDoS posture, object storage configuration, backup validation, egress control, and the HA database parameters. For Patroni parameters see §26.3; for port exposure details see §3.3; for storage tiering see §27.4; for DNS/service discovery see §27.6.


34.1 TLS Certificate Lifecycle

Certificate Issuance Decision Tree

Is the deployment internet-facing?
├── YES → Use Caddy ACME (Let's Encrypt / ZeroSSL)
│         Caddy automatically renews; no manual steps required
│         Domain must be publicly resolvable (A record pointing to Caddy host)
│
└── NO (air-gapped / on-premise with no public DNS)
    ├── Does the customer operate an internal CA?
    │   ├── YES → Request cert from customer CA; configure Caddy with cert_file + key_file
    │   │         Document CA chain in `docs/runbooks/tls-cert-lifecycle.md`
    │   └── NO  → Generate internal CA with `step-ca` (Smallstep)
    │               Run step-ca as a sidecar container on the management network
    │               Issue Caddy cert from internal CA; clients import internal CA root cert

Cert Expiry Alert Thresholds

Prometheus alert rules in monitoring/alerts/tls.yml:

Alert Threshold Severity
TLSCertExpiringSoon < 60 days remaining WARNING
TLSCertExpiringImminent < 30 days remaining HIGH
TLSCertExpiryCritical < 7 days remaining CRITICAL (pages on-call)

For ACME-managed certs: Caddy renews at 30 days remaining by default; the 30-day alert should never fire in steady state. The 7-day CRITICAL alert is the backstop for ACME renewal failures.

Runbook Entry

docs/runbooks/tls-cert-lifecycle.md must cover:

  1. How to verify current cert expiry (echo | openssl s_client -connect host:443 2>/dev/null | openssl x509 -noout -dates)
  2. ACME renewal troubleshooting (Caddy logs: caddy logs --tail 100)
  3. Manual certificate replacement procedure for air-gapped deployments
  4. Internal CA cert distribution to client browsers / API consumers

34.2 Caddy Reverse Proxy Configuration

# /etc/caddy/Caddyfile
# Production Caddyfile stub — customise domain and backend addresses
{
    email admin@your-domain.com          # ACME account email
    # For air-gapped: comment out email, add tls /path/to/cert /path/to/key
}

your-domain.com {
    # TLS — automatic ACME for internet-facing; replace with manual cert for air-gapped
    tls {
        protocols tls1.2 tls1.3         # Disable TLS 1.0 and 1.1
    }

    # Security headers
    header {
        Strict-Transport-Security "max-age=63072000; includeSubDomains; preload"
        X-Content-Type-Options "nosniff"
        X-Frame-Options "DENY"
        Referrer-Policy "strict-origin-when-cross-origin"
        -Server                          # Strip Server header (do not expose Caddy version)
        -X-Powered-By                    # Strip if present
    }

    # WebSocket proxy (backend WebSocket endpoint)
    handle /ws/* {
        reverse_proxy backend:8000 {
            header_up Host {host}
            header_up X-Real-IP {remote_host}
            header_up X-Forwarded-Proto {scheme}
        }
    }

    # API and SSR routes
    handle /api/* {
        reverse_proxy backend:8000 {
            header_up X-Real-IP {remote_host}
            header_up X-Forwarded-Proto {scheme}
        }
    }

    # Static assets — served with long-lived immutable cache headers (F8 — §58)
    # Next.js content-hashes all filenames under /_next/static/ — safe for max-age=1y
    handle /_next/static/* {
        header Cache-Control "public, max-age=31536000, immutable"
        reverse_proxy frontend:3000 {
            header_up X-Real-IP {remote_host}
        }
    }

    # Cesium workers and static resources (large; benefit most from caching)
    handle /cesium/* {
        header Cache-Control "public, max-age=604800"   # 7 days; not content-hashed
        reverse_proxy frontend:3000 {
            header_up X-Real-IP {remote_host}
        }
    }

    # Frontend (Next.js) — HTML and dynamic routes (no caching)
    handle {
        header Cache-Control "no-store"   # HTML must never be cached; contains stale JS references otherwise
        reverse_proxy frontend:3000 {
            header_up X-Real-IP {remote_host}
            header_up X-Forwarded-Proto {scheme}
        }
    }
}

Notes:

  • MinIO console (9001) and Flower (5555) are not exposed through Caddy in production. VPN/bastion access only.
  • Static asset Cache-Control: immutable is safe only because Next.js content-hashes all filenames. HTML pages must use no-store to force browsers to re-fetch the latest JS bundle references after a deploy.
  • HTTP (port 80) is implicitly redirected to HTTPS by Caddy when a TLS block is present.
  • max-age=63072000 = 2 years; standard for HSTS preload submission.

34.3 WAF and DDoS Protection

SpaceCom's application-layer rate limiting (§7.7) is a mitigation for abusive authenticated clients, not a defence against volumetric DDoS or web application attacks. A dedicated WAF/DDoS layer is required at Tier 2+ production deployments.

Internet-facing deployments (cloud or hosted):

  • Deploy behind Cloudflare (free tier minimum; Pro tier for WAF rules) or AWS Shield Standard + AWS WAF
  • Cloudflare: enable DDoS protection, OWASP managed ruleset, Bot Fight Mode
  • Configure Caddy to only accept connections from Cloudflare IP ranges (Cloudflare publishes the range; verify with curl https://www.cloudflare.com/ips-v4)

Air-gapped / on-premise government deployments:

  • Customer's upstream network perimeter (firewall/IPS) provides the DDoS and WAF layer
  • Document the perimeter protection requirement in the customer deployment checklist (docs/runbooks/on-premise-deployment.md)
  • SpaceCom is not responsible for perimeter DDoS mitigation in customer-managed deployments; this is a contractual boundary that must be documented in the MSA

On-premise licence key enforcement (F6 — §68):

On-premise deployments run on customer infrastructure. Without a licence key mechanism, a customer could run additional instances, share the deployment, or continue operating after licence expiry.

Licence key design: A JWT signed with SpaceCom's RSA private key (2048-bit minimum). Claims:

{
  "sub": "<org_id>",
  "org_name": "Civil Aviation Authority of Australia",
  "contract_type": "on_premise",
  "valid_from": "2026-01-01T00:00:00Z",
  "valid_until": "2027-01-01T00:00:00Z",
  "features": ["operational_mode", "multi_ansp_coordination"],
  "max_users": 50,
  "iss": "spacecom.io",
  "iat": 1735689600
}

Enforcement: At startup, backend/app/main.py verifies the licence JWT using SpaceCom's public key (bundled in the Docker image). If validation fails or the licence has expired: the backend starts in licence-expired degraded mode — read-only access to historical data; no new predictions or alerts; all write endpoints return HTTP 402 Payment Required with {"error": "licence_expired", "contact": "commercial@spacecom.io"}. An hourly Celery Beat task re-validates the licence. If it expires mid-operation, running simulations complete but no new simulations are accepted after the check fires.

Key rotation: New licence JWT issued via scripts/generate_licence_key.py (requires SpaceCom private key, stored in HashiCorp Vault — never committed to the repository). Customer sets SPACECOM_LICENCE_KEY environment variable; container restart picks it up. SpaceCom's RSA public key is embedded in the Docker image at build time (/etc/spacecom/licence_pubkey.pem).

CI/DAST complement: OWASP ZAP DAST (§21 Phase 2 DoD) tests the application layer; WAF covers infrastructure-layer attack patterns. Both are required — they cover different threat categories.


34.4 MinIO Object Storage Configuration

Erasure Coding (Tier 3)

4-node distributed MinIO uses EC:2 (2 data + 2 parity shards per erasure set):

# MinIO server startup command (each of 4 nodes runs the same command)
minio server \
  http://minio-1:9000/data \
  http://minio-2:9000/data \
  http://minio-3:9000/data \
  http://minio-4:9000/data \
  --console-address ":9001"

EC:2 on 4 nodes means:

  • Each object is split into 4 shards (2 data + 2 parity)
  • Read quorum: 2 shards (tolerates 2 simultaneous node failures for reads)
  • Write quorum: 3 shards (tolerates 1 simultaneous node failure for writes)
  • Usable capacity: 50% of raw total

ILM (Information Lifecycle Management) Policies

Configured via mc ilm add commands in docs/runbooks/minio-lifecycle.md:

Bucket Prefix Transition after Target
mc-blobs (all) 90 days MinIO warm tier or S3-IA
pdf-reports (all) 365 days S3 Glacier
notam-drafts (all) 365 days S3 Glacier
db-wal-archive (all) 31 days Delete (WAL older than 30 days not needed for point-in-time recovery)

34.5 Backup Restore Test Verification Checklist

Monthly restore test procedure (executed by the restore_test Celery task; results logged to security_logs type RESTORE_TEST). A human engineer must verify all six items before marking the restore test as passed:

# Verification item How to verify
1 Row count match SELECT COUNT(*) FROM reentry_predictions on restored DB equals baseline count captured before backup
2 Latest record present Most recent reentry_predictions.created_at in restored DB is within 5 minutes of the backup timestamp
3 HMAC spot-check Run integrity.verify_prediction(id) on 5 randomly selected prediction IDs; all must return VALID
4 Append-only trigger functional Attempt UPDATE reentry_predictions SET risk_level = 'LOW' WHERE id = <test_id>; must raise exception
5 Hypertable chunks intact SELECT count(*) FROM timescaledb_information.chunks WHERE hypertable_name = 'orbits' matches expected chunk count for the backup date range
6 Foreign key integrity pg_restore completed with 0 FK constraint violations (check restore log for ERROR: insert or update on table ... violates foreign key constraint)

Restore test failures are treated as CRITICAL alerts. The restore test target DB (db-restore-test container) must be isolated from the production network (not attached to db_net).


34.6 Infrastructure Design Decision Log

Decision Chosen Alternative Considered Rationale
Reverse proxy Caddy nginx + certbot Caddy automatic ACME eliminates manual cert management; simpler config; native HTTP/2 and HTTP/3
TLS air-gapped Internal CA (step-ca) Self-signed per-service Internal CA allows cert chain trust; self-signed requires per-client exception management
WAF/DDoS Upstream provider (Cloudflare/AWS Shield) Application-layer rate limiting only Volumetric DDoS bypasses application-layer; WAF covers OWASP attack patterns at network ingress
MinIO erasure coding EC:2 on 4 nodes EC:4 (higher parity) EC:4 on 4 nodes would require 4-node write quorum; any single failure blocks writes; EC:2 balances protection and availability
Multi-region Single region per jurisdiction Active-active global cluster Data sovereignty; compliance certification scope; Phase 13 customer base size doesn't justify multi-region operational complexity
DB connection target PgBouncer VIP Direct Patroni primary connection string Application connection strings don't change during Patroni failover; stable operational target
Cold tier (MC blobs) MinIO ILM warm → S3-IA S3 Glacier MC blobs may be replayed for Mode C visualisation; 12h Glacier restore latency is operationally unacceptable
Cold tier (compliance) S3 Glacier / Deep Archive Warm S3 Compliance docs need 7-year retention but rare retrieval; Glacier cost is 8090% lower than S3-IA
Egress filtering Host-level UFW/nftables Rely on Docker network isolation Docker isolation is inter-network only; outbound internet egress must be filtered at host level
HSTS max-age 63072000 (2 years) 31536000 (1 year) 2 years is the HSTS preload list minimum; aligns with standard hardening guides

35. Performance Engineering

This section consolidates performance specifications, load test definitions, and scalability constraints across the system. For compression policy configuration see §9.4; for latency budget and pagination standard see §14; for WebSocket subscriber ceiling see §14; for renderer memory limits see §3 / §27.


35.1 Load Test Specification

Tool: k6 (preferred) or Locust. Scripts in tests/load/. Scenarios must be deterministic and reproducible on a freshly seeded database.

Scenario: CZML Catalog (Phase 1 baseline, Phase 3 SLO gate)

// tests/load/czml_catalog.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 20 },   // Ramp to 20 users
    { duration: '5m', target: 100 },  // Ramp to 100 users (SLO target)
    { duration: '5m', target: 100 },  // Sustain 100 users
    { duration: '2m', target: 0 },    // Ramp down
  ],
  thresholds: {
    'http_req_duration{endpoint:czml_full}':  ['p(95)<2000'],   // Phase 3 SLO
    'http_req_duration{endpoint:czml_delta}': ['p(95)<500'],    // Delta must be faster
    'http_req_failed': ['rate<0.01'],                           // < 1% error rate
  },
};

export default function () {
  // First load: full catalog
  const fullRes = http.get('/czml/objects', {
    tags: { endpoint: 'czml_full' },
    headers: { Authorization: `Bearer ${__ENV.TEST_TOKEN}` },
  });
  check(fullRes, { 'full catalog 200': (r) => r.status === 200 });

  // Subsequent loads: delta
  const since = new Date(Date.now() - 60000).toISOString();
  const deltaRes = http.get(`/czml/objects?since=${since}`, {
    tags: { endpoint: 'czml_delta' },
    headers: { Authorization: `Bearer ${__ENV.TEST_TOKEN}` },
  });
  check(deltaRes, { 'delta 200': (r) => r.status === 200 });

  sleep(5);  // Think time: user views globe for ~5s before next action
}

Scenario: MC Prediction Submission

// tests/load/mc_predict.js — tests concurrency gate
export const options = {
  vus: 10,           // 10 concurrent MC submissions from 5 orgs (2 per org)
  duration: '3m',
  thresholds: {
    'http_req_duration{endpoint:mc_submit}': ['p(95)<500'],
    // 429s are expected (concurrency gate) — not counted as failures
    'checks': ['rate>0.95'],
  },
};

Scenario: WebSocket Alert Delivery

// tests/load/ws_alerts.js — verifies < 30s delivery under load
// Opens 100 persistent WebSocket connections; triggers 10 synthetic alerts;
// measures time from alert POST to WS delivery on all 100 clients

Load test execution:

  • Phase 1: run czml_catalog scenario on Tier 1 dev hardware; record p95 baseline
  • Phase 2: run after each major migration; confirm no regression vs Phase 1 baseline
  • Phase 3: full suite (all three scenarios) on Tier 2 staging; all thresholds must pass before production deploy approval

Load test reports committed to docs/validation/load-test-report-phase{N}.md.


35.2 CZML Delta Protocol

The full CZML catalog grows proportionally with object count and time-step density. The delta protocol prevents repeat full-catalog downloads after initial page load.

Client responsibility:

  1. On page load: fetch GET /czml/objects (full catalog). Cache X-CZML-Timestamp response header as lastSync.
  2. Every 30s (or on reconnect): fetch GET /czml/objects?since=<lastSync>.
  3. On receipt of X-CZML-Full-Required: true: discard globe state and re-fetch full catalog.
  4. On receipt of HTTP 413: the server cannot serve the full catalog (too large); contact system admin.

Server responsibility:

  • Full response: include X-CZML-Timestamp: <server_time_iso8601> header.
  • Delta response: include only objects with updated_at > since. If since is more than 30 minutes ago, return X-CZML-Full-Required: true with an empty CZML body (client must re-fetch).
  • Maximum full payload: 5 MB. If estimated size exceeds limit, return HTTP 413 with {"error": "catalog_too_large", "use_delta": true}.

Prometheus metric: czml_delta_ratio = delta requests / (delta + full requests). Target: > 0.95 in steady state (95% of CZML requests are delta).


35.3 Monte Carlo Concurrency Gate

Unbounded MC fan-out collapses SLOs when multiple users submit concurrent jobs. The concurrency gate is implemented as a per-organisation Redis semaphore:

# worker/tasks/decay.py

import redis
from celery import current_app

REDIS = redis.Redis.from_url(settings.REDIS_URL)
MC_SEMAPHORE_TTL = 600  # seconds; covers maximum expected MC duration + margin

def acquire_mc_slot(org_id: int, org_tier: str) -> bool:
    """Returns True if slot acquired, False if at capacity. Limit derived from subscription tier (F6)."""
    from app.modules.billing.tiers import get_mc_concurrency_limit
    limit = get_mc_concurrency_limit_by_tier(org_tier)
    key = f"mc_running:{org_id}"
    pipe = REDIS.pipeline()
    pipe.incr(key)
    pipe.expire(key, MC_SEMAPHORE_TTL)
    count, _ = pipe.execute()
    if count > limit:
        REDIS.decr(key)
        return False
    return True

def release_mc_slot(org_id: int) -> None:
    key = f"mc_running:{org_id}"
    current = REDIS.get(key)
    if current and int(current) > 0:
        REDIS.decr(key)

API layer:

# backend/api/decay.py

@router.post("/decay/predict")
async def submit_decay(req: DecayRequest, user: User = Depends(current_user)):
    if not acquire_mc_slot(user.organisation_id, user.role):
        raise HTTPException(
            status_code=429,
            detail="MC concurrency limit reached for your organisation",
            headers={"Retry-After": "120"},
        )
    task = run_mc_decay_prediction.delay(...)
    return {"task_id": task.id}

The Celery chord callback (on_chord_done) calls release_mc_slot. A TTL of 600s ensures the slot is released even if the worker crashes mid-task.

Quota exhaustion logging (F6): When acquire_mc_slot returns False, before returning 429, the endpoint writes a usage_events row: event_type = 'mc_quota_exhausted'. This makes quota pressure visible to the org admin and to the SpaceCom sales team (via admin panel). The org admin's usage dashboard shows: predictions run this month, quota hits this month, and a prompt to upgrade if hits ≥ 3 in a billing period.


35.4 Query Plan Regression Gate

CI job: performance-regression (runs in staging pipeline after make migrate):

# scripts/check_query_baselines.py
"""
Runs EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) for each query in
docs/query-baselines/*.sql against the migrated staging DB.
Compares execution time to the baseline JSON stored in the same directory.
Fails with exit code 1 if any query exceeds 2× the recorded baseline.
Emits a GitHub PR comment with a comparison table.
"""

BASELINE_DIR = "docs/query-baselines"
THRESHOLD_MULTIPLIER = 2.0

queries = {
    "czml_catalog_100obj": "SELECT ...",         # from czml_catalog_100obj.sql
    "fir_intersection":    "SELECT ...",         # from fir_intersection.sql
    "prediction_list":     "SELECT ...",         # from prediction_list_cursor.sql
}

Baselines are JSON files containing {"planning_time_ms": N, "execution_time_ms": N, "recorded_at": "..."}. Updated manually after a deliberate schema change with a PR comment explaining the expected regression.


35.5 Renderer Container Constraints

The renderer service (Playwright + Chromium) is memory-intensive during print-resolution globe captures:

# docker-compose.yml (renderer service)
renderer:
  image: spacecom/renderer:sha-${GIT_SHA}
  mem_limit: 4g
  memswap_limit: 4g       # No swap; if OOM, container restarts cleanly
  networks: [renderer_net]
  environment:
    RENDERER_MAX_PAGES: "4"       # Maximum concurrent render jobs
    RENDERER_TIMEOUT_S: "30"      # Per-render timeout; matches §21 DoD
    RENDERER_MAX_RESOLUTION: "300dpi"

Renderer Prometheus metrics:

  • renderer_memory_usage_bytes — current RSS of Chromium process; alert at 3.5 GB (WARN before OOM)
  • renderer_jobs_active — concurrent in-flight renders; alert if > 3 for > 60s (capacity signal)
  • renderer_timeout_total — count of renders killed by timeout; alert if > 0 in a 5-min window

Maximum report constraints (enforced in worker/tasks/renderer.py):

  • Maximum report pages: 50
  • Maximum globe snapshot resolution: 300 DPI (A4 format)
  • Reports exceeding these limits are rejected at submission with HTTP 400

Renderer memory isolation and on-demand rationale (F8 — §65 FinOps):

The renderer is the second-most memory-intensive service after TimescaleDB. At Tier 2 it is allocated a dedicated c6i.xlarge (~$140/mo) or equivalent. Unlike simulation workers, the renderer is called infrequently — typically a few times per day when a duty manager requests a PDF briefing pack.

On-demand vs. always-on analysis:

Approach Benefit Cost/risk Decision
Always-on (current) Zero latency to first render; Chromium warm $140/mo even if 0 renders/day Use at Tier 12 — cost is predictable; latency matters for interactive report requests
On-demand (start on request, stop after idle) Saves $140/mo on lightly used deployments 1530s Chromium cold-start per report; complicates deployment Consider at Tier 3 with HPA scale-to-zero on renderer_jobs_active if customer SLA permits a 30s wait
Shared with simulation worker Saves dedicated instance Chromium OOM risk during concurrent MC + render Do not use — Chromium 24 GB footprint during render + MC worker memory = OOM on 32 GB nodes

Memory isolation is non-negotiable: The renderer container is on an isolated Docker network (renderer_net) with no direct DB access and no simulation worker co-location. This is both a security boundary (§7, §35.5) and a memory isolation boundary. A runaway Chromium process will OOM its own container and restart cleanly without affecting simulation workers or the backend API.

Cost-saving lever (on-premise): For on-premise deployments where the renderer runs on the same physical server as simulation workers, monitor renderer_memory_usage_bytes + spacecom_simulation_worker_memory_bytes via Grafana. Add a combined alert renderer + workers > 80% host RAM to detect co-location pressure before OOM.


35.6 Static Asset CDN Strategy

CesiumJS uncompressed: ~8 MB. With gzip compression: ~2.5 MB. At 100 concurrent first-time users: ~250 MB outbound in a burst.

Internet-facing (Cloudflare):

  • All paths under /_next/static/* and /static/* are served with Cache-Control: public, max-age=31536000, immutable (1 year, immutable — Next.js uses content-hash filenames)
  • Caddy upstream caches are bypassed for these paths (Cloudflare edge is the cache)
  • CesiumJS assets: cache hit ratio target > 0.98 after warm-up

On-premise:

  • Deploy an nginx sidecar container (static-cache) on frontend_net serving the Next.js out/ or .next/static/ directory directly
  • Caddy routes /_next/static/*static-cache:80 (bypasses Next.js server)
  • Configure in docs/runbooks/on-premise-deployment.md

Bundle size monitoring (CI):

# .github/workflows/ci.yml (bundle-size job)
- name: Check bundle size
  run: |
    npm run build 2>&1 | grep "First Load JS"
    # Fails if main bundle > previous + 10% (threshold stored in .bundle-size-baseline)
    node scripts/check_bundle_size.js

Baseline stored in .bundle-size-baseline at repo root (plain number in bytes). Updated manually with a PR comment when a deliberate size increase is approved.


35.7 Performance Engineering Decision Log

Decision Chosen Alternative Considered Rationale
Load test tool k6 Locust, JMeter k6 is script-based (TypeScript-friendly), CI-native, outputs Prometheus-compatible metrics; Locust requires a Python process; JMeter is XML-heavy
CZML delta ?since=<iso8601> server-side filter Client-side WebSocket push of changed entities Server-side filter is simpler and works with HTTP caching; push requires server to track per-client state
MC semaphore Redis INCR/DECR with TTL DB-level lock Redis is already the Celery broker; DB-level lock adds latency on every MC submit; TTL prevents deadlock on worker crash
Pagination Cursor (created_at, id) Keyset on single column Single-column keyset has ties at same created_at (batch ingest); compound key is unique and stable
Query regression gate EXPLAIN (ANALYZE, BUFFERS) JSON baseline pg_stat_statements EXPLAIN is deterministic per run on a warm buffer; pg_stat_statements averages across all historic executions and requires prod traffic to populate
Renderer memory cap 4 GB Docker mem_limit ulimit in container Docker mem_limit is enforced by the kernel cgroup; ulimit only applies to the shell process, not Chromium subprocesses
Bundle size gate +10% threshold vs. stored baseline Absolute byte limit Percentage is proportional to current size; absolute limits become irrelevant as bundles grow or shrink

36. Security Architecture — Red Team / Adversarial Review

This section records the findings of an adversarial review against the §7 security architecture. Where findings were resolved by updating existing sections (§7.2, §7.3, §7.4, §7.9, §7.10, §7.11, §7.12, §7.14, §9.2), this section provides the finding rationale and cross-reference for traceability.

36.1 Finding Summary

# Finding Primary Section Updated Severity
1 HMAC key rotation has no path through the immutability trigger §7.9 — HMAC Key Rotation Procedure Critical
2 Pre-signed MinIO URLs unscoped and unproxied for MC blobs §7.10 — MinIO Bucket Policies High
3 Celery task arguments not validated at the task layer §7.12 — Compute Resource Governance High
4 Playwright renderer SSRF mitigation incomplete §7.11 — request interception allowlist High
5 Refresh token theft: no family reuse detection §7.3 + §9.2 refresh_tokens schema High
6 Admin role elevation with no four-eyes approval §7.2 + pending_role_changes table High
7 Security events logged but no human alert matrix §7.14 — security alerting matrix Medium
8 Space-Track credential rotation has no ingest-gap spec §7.14 — rotation runbook cross-reference Medium
9 Shadow mode segregation application-layer only §7.2 — shadow_segregation RLS policy High
10 NOTAM draft content not sanitised — injection path §7.4 — sanitise_icao() function High
11 Supply chain posture not fully specified §7.13 — already fully covered; no gap found N/A

36.2 Attack Paths Considered

The following attack paths were evaluated in this review:

Insider threat paths:

  • Compromised admin account silently elevating a backdoor account → mitigated by four-eyes approval (Finding 6)
  • Admin with access to the HMAC rotation script replacing legitimate predictions with forged ones → mitigated by dual sign-off + rotated_by audit trail (Finding 1)
  • ANSP operator sharing a pre-signed report URL with an external party → mitigated by 5-minute TTL + audit log (Finding 2)

Compromised worker paths:

  • Compromised ingest_worker (shares worker_net with Redis) writing crafted Celery task args → mitigated by task-layer validation (Finding 3)
  • Compromised worker exfiltrating simulation trajectory URLs → mitigated by server-side MC blob proxy (Finding 2)

Authentication/session paths:

  • Refresh token exfiltration + replay before legitimate client retries → mitigated by family reuse detection + full-family revocation (Finding 5)
  • Compromised admin credential creating backdoor admin → mitigated by four-eyes principle (Finding 6)

Renderer SSRF paths:

  • Bug causing renderer to navigate to a crafted URL → mitigated by Playwright request interception allowlist (Finding 4)
  • Report ID injection → mitigated by integer validation + hardcoded URL construction (Finding 4)

Data integrity paths:

  • Shadow prediction leaking into operational response via query bug → mitigated by RLS shadow_segregation policy (Finding 9)
  • NOTAM draft XSS → Playwright PDF renderer execution → mitigated by sanitise_icao() + Jinja2 autoescape (Finding 10)

Credential rotation paths:

  • HMAC key compromise: attacker forges predictions → mitigated by rotation procedure with hmac_admin role isolation (Finding 1)
  • Space-Track credential rotation creates an undetected ingest gap → mitigated by 10-minute verification step in runbook (Finding 8)

36.3 Security Architecture ADRs

ADR Title Decision
docs/adr/0007-hmac-rotation-procedure.md HMAC key rotation with parameterised immutability trigger hmac_admin role + SET LOCAL spacecom.hmac_rotation flag; dual sign-off required
docs/adr/0008-admin-four-eyes.md Admin role elevation requires four-eyes approval pending_role_changes table; 30-minute token; second admin must approve
docs/adr/0009-shadow-mode-rls.md Shadow mode segregated at RLS layer, not application layer shadow_segregation RLS policy; spacecom.include_shadow session variable; admin-only
docs/adr/0010-refresh-token-families.md Refresh token family reuse detection family_id column; full family revocation on reuse; user email alert
docs/adr/0011-mc-blob-proxy.md MC trajectory blobs proxied server-side, not pre-signed URL GET /viz/mc-trajectories/{id} backend proxy; MinIO URLs never exposed to browser

36.4 Penetration Test Scope (Phase 3)

The Phase 3 external penetration test (referenced in §7.15) must include the following adversarial scenarios derived from this review:

  1. HMAC rotation bypass — attempt to forge a prediction record by exploiting the immutability trigger with and without the hmac_admin role
  2. Pre-signed URL exfiltration — verify that MC blob URLs are not present in any browser-side response; verify pre-signed report URLs cannot be used after 5 minutes
  3. Celery task injection — attempt to enqueue tasks with out-of-range arguments directly via Redis; verify the task validates and rejects them
  4. Playwright SSRF — attempt to trigger renderer navigation to http://169.254.169.254/ (AWS metadata) or http://backend:8000/internal/admin; verify interception blocks both
  5. Refresh token theft simulation — replay a superseded refresh token; verify full family revocation and email alert
  6. Admin privilege escalation — attempt to elevate a viewer account to admin via a single compromised admin account without the four-eyes approval token; verify the attempt is blocked and logged
  7. Shadow mode leak — query GET /decay/predictions as viewer; inject a shadow prediction directly at the DB layer; verify the API response never returns it
  8. NOTAM injection — submit an object with a name containing <script>alert(1)</script> via POST /objects; generate a NOTAM draft; verify PDF render does not execute script

36.5 Decision Log

Decision Chosen Alternative Rationale
HMAC rotation trigger Parameterised SET LOCAL flag scoped to hmac_admin role Separate migration to drop and recreate trigger SET LOCAL is session-scoped; cannot be set by application role; minimises window of bypass
Family reuse detection Full family revocation on superseded token reuse Single token revocation Full revocation is the only action that guarantees the attacker's session is destroyed even if the legitimate user doesn't notice
MC blob delivery Server-side proxy endpoint Pre-signed MinIO URL with short TTL Pre-signed URLs can be shared or logged in browser history; server-side proxy enforces org scoping on every request
Admin four-eyes Email approval token with 30-minute window Yubikey hardware confirmation Email approval is achievable without additional hardware; 30-minute window prevents indefinite pending states
Shadow RLS PostgreSQL RLS policy Application-layer WHERE shadow_mode = FALSE RLS is enforced by the database engine regardless of query construction; application-layer filters can be omitted by bugs or direct DB queries

37. Aviation Regulatory / ATM Compliance Review

This section records findings from an ATM systems engineering review against the ICAO/EUROCONTROL regulatory environment that governs ANSP customers. Findings were incorporated into §6.13 (NOTAM format), §6.14 (shadow exit), §6.17 (multi-ANSP panel), §11 (data sources / airspace scope), §16 (prediction conflict), §21 Phase 2 DoD, §27.4 (safety record retention), and §9.2 (schema additions).

37.1 Finding Summary

# Finding Primary Section Updated Severity
1 Regulatory classification (EASA IR 2017/373 position) unresolved §21 Phase 2 DoD + ADR 0012 Critical
2 NOTAM format non-compliant with ICAO Annex 15 field formatting §6.13 — field mapping table, Q-line, YYMMDDHHmm timestamps High
3 Re-entry window → NOTAM (B)/(C) mapping not specified §6.13 — p1030min / p90+30min rule + cancellation urgency High
4 FIR scope excludes SUA, TMAs, oceanic — undisclosed §11 — airspace scope disclosure; ADR 0014 Medium
5 Multi-ANSP coordination panel has no authority/precedence spec §6.17 — advisory-only banner, retention, WebSocket SLA Medium
6 Shadow mode exit criteria not specified §6.14 — exit criteria table, exit report template High
7 Degraded mode disclosure insufficient for ANSP operational use §9.2 degraded_mode_events table; §14 GET /readyz schema; NOTAM (E) injection High
8 GDPR DPA must be signed before shadow mode begins, not Phase 3 §21 Phase 2 DoD legal gate High
9 ESA DISCOS redistribution rights unaddressed §11 — redistribution rights requirement; §21 Phase 2 DoD High
10 Multi-source prediction conflict resolution not specified §16 — conflict resolution rules; prediction_conflict schema columns High
11 Safety-relevant records have no distinct retention policy §27.4 — safety_record flag; 5-year safety category Medium

37.2 Regulatory Framework References

Framework Relevance Position taken
EASA IR (EU) 2017/373 Requirements for ATM/ANS providers; may apply if ANSP integrates SpaceCom into operational workflow Position A: advisory tool; not ATM/ANS provision — documented in ADR 0012
ICAO Annex 15 (AIS) + Appendix 6 NOTAM format specification NOTAM drafts now comply with Annex 15 field formatting (§6.13)
ICAO Annex 11 (ATS) §2.26 ATC record retention recommendation Safety records retained ≥ 5 years (§27.4)
ICAO Doc 8400 ICAO abbreviations and codes used in NOTAM (E) field sanitise_icao() uses Doc 8400 abbreviation list
EUROCONTROL OPADD Operational NOTAM Production and Distribution; EUR regional NOTAM practice Q-line format and series conventions follow OPADD (§6.13)
GDPR Article 28 Data processor obligations when processing ANSP staff personal data DPA must be signed before any ANSP data processing, including shadow mode
UN Liability Convention 1972 7-year record retention for space object liability claims reentry_predictions, alert_events retained 7 years (§27.4)

37.3 Regulatory ADRs

ADR Title Decision
docs/adr/0012-regulatory-classification.md EASA IR 2017/373 position Position A: ATM/ANS Support Tool; decision support only; not ATM/ANS provision; written ANSP agreements required
docs/adr/0013-notam-format.md ICAO Annex 15 NOTAM field compliance Field mapping table; YYMMDDHHmm timestamps; Q-line QWELW; (B) = p1030min; (C) = p90+30min
docs/adr/0014-airspace-scope.md Phase 2 airspace data scope FIR/UIR only (ECAC + US); SUA/TMA/oceanic explicitly out of scope; disclosed in UI; Phase 3 SUA consideration

37.4 Compliance Checklist (Phase 2 Gate)

Before the first ANSP shadow deployment:

  • docs/adr/0012-regulatory-classification.md committed and reviewed by aviation law counsel
  • NOTAM draft generator produces ICAO-compliant output (unit test passes Q-line regex and YYMMDDHHmm field checks)
  • Airspace scope disclosure note present in Airspace Impact Panel (Playwright test verifies text)
  • Multi-ANSP coordination advisory-only banner present in panel (Playwright test verifies text)
  • degraded_mode_events table active; transitions logged; GET /readyz response includes degraded_since
  • NOTAM draft (E) field injects degraded-state warning when generated_during_degraded = TRUE (integration test)
  • DPA signed with each ANSP shadow partner; DPA template reviewed by counsel
  • ESA DISCOS redistribution rights clarified in writing; API/report templates updated if required
  • prediction_conflict flag operational; Event Detail page shows ⚠ PREDICTION CONFLICT when set
  • Safety record retention policy active: safety_record = TRUE records excluded from TimescaleDB drop; degraded_mode_events retained 5 years
  • Shadow mode exit report template (docs/templates/shadow-mode-exit-report.md) exists and Persona B can generate statistics from admin panel

37.5 Decision Log

Decision Chosen Alternative Rationale
Regulatory classification Position A — advisory, non-safety-critical ATM/ANS Support Tool Position B — Functional System under IR 2017/373 Position B would require ED-78A system safety assessment, ATCO HMI compliance, and EASA change management — disproportionate for a decision-support tool where a human verifies all outputs before acting
NOTAM timestamp format YYMMDDHHmm (ICAO Annex 15 §5.1.2) ISO 8601 YYYY-MM-DDTHH:mmZ ICAO Annex 15 is unambiguous; ISO 8601 would require the NOTAM office to reformat before issuance
NOTAM window mapping (B) = p10 30 min; (C) = p90 + 30 min (B) = p50 3h; (C) = p50 + 3h p10/p90 are the actual statistical bounds; symmetric windows around p50 ignore the often-asymmetric uncertainty distribution
Degraded NOTAM warning Machine-inserted line in (E) field UI-only warning on the draft page The (E) field is what the NOTAM office receives; a UI-only warning is lost when the draft is copied to the NOTAM office's system
Multi-source conflict Union of windows when non-overlapping SpaceCom window always primary regardless ICAO most-conservative principle; ANSPs must be protected against the case where SpaceCom is wrong and TIP is right
Safety record retention safety_record flag on row; excluded from drop policy Separate table for safety records Flag approach avoids data duplication and works with TimescaleDB chunk-level policies; excluded records stay in the same hypertable partition for query performance

38. Orbital Mechanics / Astrodynamics Review

This section records findings from an astrodynamics specialist review of the physics specification. Findings were incorporated into §15.1 (SGP4 validity gates), §15.2 (NRLMSISE-00 inputs, MC uncertainty model, SRP, integrator config), §15.3 (breakup altitude trigger, material survivability), §15.4 (new — corridor generation algorithm), §15.5 (new — Pc computation method), §17.1 (committed test vectors), §31.1 (BSTAR validation), and the objects/space_weather schema in §9.

38.1 Finding Summary

# Finding Section Updated Severity
1 SGP4 validity limits not enforced at query time §15.1 — epoch age gates, perigee < 200 km routing High
2 NRLMSISE-00 input vector under-specified §15.2 — f107_prior_day, ap_3h_history, Ap vs Kp High
3 Ballistic coefficient uncertainty model not specified §15.2 — C_D/A/m sampling distributions; objects schema High
4 Corridor generation algorithm not specified §15.4 (new) — alpha-shape, 50 km buffer, ≤ 1000 vertices High
5 Breakup altitude trigger not specified §15.3 — 78 km trigger, NASA SBM, material survivability High
6 Frame transformation test vectors not committed §17.1 — 3 required JSON files; fail-not-skip test pattern Medium
7 Solar radiation pressure absent from decay predictor §15.2 — cannonball SRP model, cr_coefficient column Medium
8 Pc computation method not specified §15.5 (new) — Alfano 2D Gaussian, TLE differencing covariance Medium
9 Integrator tolerances and stopping criterion not specified §15.2 — atol=1e-9, rtol=1e-9, max_step=60s, 120-day cap High
10 BSTAR validation range excludes valid high-density objects §31.1 — removed lower floor; warn-not-reject for B* > 0.5 Medium
11 NRLMSISE-00 altitude limit and storm handling not specified §15.2 — 800 km OOD boundary; Kp > 5 storm flag Medium

38.2 Physics Model Decisions

Decision Chosen Alternative Considered Rationale
Catalog propagator SGP4 (sgp4 library) SP (Special Perturbations) via GMAT SGP4 is the standard for TLE-based catalog propagation; SP requires full state vector with covariance — not available from TLEs
Decay integrator DOP853 (RK7/8 adaptive) RK4 fixed step DOP853 is embedded error control; RK4 fixed step requires manual step-size management and may miss density variations near perigee
Atmospheric model NRLMSISE-00 JB2008 (Jacchia-Bowman 2008) NRLMSISE-00 is well-validated, open-source, and widely used in community tools; JB2008 is more accurate during storms but requires additional data inputs not yet in scope
Corridor shape Alpha-shape (concave hull) Convex hull Convex hull overestimates corridor width by 25× for elongated re-entry ground tracks; alpha-shape produces tighter, more operationally useful polygons
C_D sampling Uniform(2.0, 2.4) Fixed value 2.2 Uniform sampling covers the credible range without assuming a specific distribution; fixed value understates uncertainty
SRP model Cannonball (scalar) Panelled model Cannonball model is standard for non-cooperative objects; panelled model requires detailed attitude and geometry data unavailable for most catalog objects
Pc method Alfano 2D Gaussian Monte Carlo Pc Alfano is computationally fast and the community standard; Monte Carlo Pc added as Phase 3 consideration for high-Pc events
BSTAR lower bound No lower bound (reject ≤ 0 only) 0.0001 lower bound Dense objects (tungsten, stainless steel tanks) can have B* << 0.0001; the previous lower bound would silently reject valid high-density object TLEs

38.3 Model Card Additions Required

The following items must be added to docs/model-card-decay-predictor.md:

  • Breakup altitude rationale: 78 km trigger; reference to NASA Debris Assessment Software range (7580 km for Al structures)
  • Monte Carlo uncertainty model: C_D, A, m sampling distributions and their justifications
  • SRP significance: conditions under which SRP > 5% of drag (area-to-mass > 0.01 m²/kg, altitude > 500 km)
  • NRLMSISE-00 altitude scope: validated 150800 km; OOD flag above 800 km
  • Geomagnetic storm sensitivity: Kp > 5 triggers storm-period sampling; prediction uncertainty is elevated
  • Corridor generation algorithm: alpha-shape with α = 0.1°, 50 km buffer; reference to alpha-shape literature
  • Pc computation: Alfano 2D Gaussian; TLE differencing covariance; quality flag when < 3 TLEs available
  • SGP4 validity limits: 7-day degraded, 14-day unreliable, 200 km perigee routing to decay predictor

38.4 Validation Test Vector Requirements

File Required before Blocking if absent
docs/validation/reference-data/frame_transform_gcrf_to_itrf.json Any frame transform code merged Yes — test fails hard
docs/validation/reference-data/sgp4_propagation_cases.json SGP4 propagator merged Yes
docs/validation/reference-data/iers_eop_case.json IERS EOP application merged Yes
docs/validation/reference-data/nrlmsise00_density_cases.json Decay predictor merged Yes — referenced in §17.3
docs/validation/reference-data/aerospace-corp-reentries.json Phase 1 backcast validation Yes for Phase 2 gate

39. API Design / Developer Experience Review

This section records findings from a senior API design review. Findings were incorporated into §9.2 (new jobs and idempotency_keys tables; expanded api_keys schema), §14 (canonical pagination envelope, error schema, rate limit 429 body, async job lifecycle, ephemeris validation, WebSocket token refresh, WebSocket protocol versioning, field naming convention, GET /readyz in OpenAPI, API key auth model).

39.1 Finding Summary

# Finding Section Updated Severity
1 Pagination envelope not canonical across endpoints §14 — PaginatedResponse[T], data key, total_count: null High
2 Error response shape inconsistent; no error code registry §14 — SpaceComError base, RequestValidationError override, registry table High
3 Async job lifecycle for POST /decay/predict not specified §14 — 202 response, /jobs/{id} endpoint; §9.2 — jobs table High
4 WebSocket token refresh path not specified §14 — TOKEN_EXPIRY_WARNING, AUTH_REFRESH, close codes 4001/4002 High
5 Idempotency keys not specified for mutation endpoints §14 — idempotency spec; §9.2 — idempotency_keys table Medium
6 429 missing Retry-After header and structured body §14 — retryAfterSeconds body field, Retry-After header spec Medium
7 Ephemeris endpoint lacks time range and step validation §14 — 4-row validation table with error codes Medium
8 WebSocket protocol versioning not specified §14 — ?protocol_version=N, deprecation warning event, sunset close code Medium
9 Field naming convention not decided §14 — APIModel base class, alias_generator=to_camel Medium
10 GET /readyz not in OpenAPI spec §14 — tags=["System"] decorated endpoint Low
11 API key auth model, rate limits, and scope not specified §14 — apikey_ prefix, independent buckets, allowed_endpoints scope High

39.2 Developer Experience Contracts

The following contracts are enforced by CI and must not be broken without an ADR:

Contract Enforcement
All list endpoints return {"data": [...], "pagination": {...}} OpenAPI CI check: list-tagged endpoints validated against PaginatedResponse schema
All errors return {"error": "...", "message": "...", "requestId": "..."} AST/grep CI check: HTTPException and JSONResponse must reference registry codes
POST endpoints returning async jobs return 202 with statusUrl OpenAPI CI check: endpoints tagged async validated for 202 response schema
429 responses include Retry-After header Integration test: rate-limited request asserts Retry-After header present
Idempotency-Key header documented for mutation endpoints OpenAPI CI check: endpoints tagged mutation declare the header parameter
GET /readyz is in the OpenAPI spec Schema validation: readyz path present in generated openapi.json

39.3 New Endpoints Added

Endpoint Role Purpose
GET /jobs/{job_id} viewer (own jobs only) Poll async job status; returns resultUrl on completion
DELETE /jobs/{job_id} viewer (own jobs only) Cancel a queued job (no effect if already running)

39.4 New API Guide Documents Required

Document Content
docs/api-guide/conventions.md camelCase rule, APIModel base class, error envelope, request ID tracing
docs/api-guide/pagination.md Cursor encoding, total_count: null rationale, empty result shape
docs/api-guide/error-reference.md Canonical error code registry with HTTP status, description, recovery action
docs/api-guide/idempotency.md Idempotency key protocol, 24h TTL, replay header, in-progress behaviour
docs/api-guide/async-jobs.md Job lifecycle, WebSocket vs polling, recommended poll interval
docs/api-guide/websocket-protocol.md Protocol version history, token refresh flow, close codes, reconnection
docs/api-guide/api-keys.md Key creation, apikey_ prefix, scope, independent rate limits

39.5 Decision Log

Decision Chosen Alternative Rationale
Pagination key data items, results data is the most common convention (JSON:API, GitHub API, Stripe); items is ambiguous with Python iterables
total_count Always null Compute count on every list request COUNT(*) on a 7-year-retention hypertable can be a full scan; cursor pagination does not need count; document the omission
Error base model SpaceComError with requestId Per-endpoint error types Uniform shape allows generic client error handling; requestId enables log correlation without exposing internals
Field naming camelCase via alias_generator snake_case (Python default) Frontend and API consumer convention is camelCase; populate_by_name=True keeps internal code readable
Async job surface /jobs/{id} unified endpoint Per-type endpoints (/decay/{id}, /reports/{id}) Unified job surface simplifies client polling logic; type-specific result URLs are returned in resultUrl field
WebSocket close codes 4001 auth expiry, 4002 protocol deprecated Generic 1008 for all auth failures Application-specific close codes enable clients to take the correct action (refresh token vs. upgrade protocol) without scraping close reason text
Idempotency TTL 24 hours 1 hour, 7 days 24 hours covers retry windows caused by network outages, client restarts, and overnight batch jobs; longer risks unbounded table growth

40. Commercial Strategy Review

SpaceCom is a standalone commercial product. Institutional procurements (ESA STAR #182213 and similar) are market opportunities pursued with existing capabilities — the product is not built to suit any single bid. This section records findings from a commercial strategy review; incorporations are in the product and architecture sections, not in bid-specific requirements.

40.1 Finding Summary

# Finding Section Updated Severity
1 ESA bid requirements not mapped to plan Scoped as per-bid process only — docs/bid/ created per procurement opportunity, not a structural plan requirement Critical (clarified)
2 Zero Debris Charter compliance output format not specified §6 — Controlled Re-entry Planner compliance report spec, Pc_ground, compliance_report_url High
3 No commercial tier structure §9.2 — subscription_tier, subscription_status on organisations; tier table defined High
4 Competitive differentiation not anchored to maintained capabilities §23.4 — maintained capabilities table; docs/competitive-analysis.md quarterly review Medium
5 Shadow trial-to-operational conversion not specified §6.14 — conversion path, offer package, subscription_status transitions, 2-concurrent-deployment cap High
6 Delivery schedule vs. procurement milestones Light touch: per-procurement milestone reconciliation doc created at bid time; not a structural plan requirement High (scoped)
7 No customer-facing SLA §26.1 — SLA schedule table in MSA; measurement methodology; service credits High
8 Data residency requirements not addressed §29.5 — EU default hosting; on-premise option; hosting_jurisdiction column; subprocessor disclosure High
9 Space-Track AUP conditional architecture not specified §11 — Path A/B conditional architecture; ADR 0016; Phase 1 architectural decision gate High
10 No Acceptance Test Procedure specification §21 Phase 3 DoD — ATP requirement; independent evaluator; docs/bid/acceptance-test-procedure.md Medium
11 Go-to-market sequence not validated against resource constraints §6.14 — 2-concurrent-shadow cap; integration lead assignment; onboarding package spec Medium

40.2 Commercial Tier Structure

Tier Customer Feature access Pricing model
Shadow Trial ANSP (pre-commercial) Full aviation portal; shadow mode only; 90-day maximum; 2 concurrent deployments maximum Free — bilateral agreement or institutional funding
ANSP Operational ANSP (post-shadow) Full aviation portal; live alerts; NOTAM drafting; multi-ANSP coordination Annual SaaS subscription per ANSP (seat-unlimited within org)
Space Operator Satellite operators Space portal; decay prediction; conjunction; CCSDS export; API access Per-object-per-month or flat subscription with object cap
Institutional ESA, national agencies, research Full access; data export; API; bulk historical; on-premise deployment option Bilateral contract or grant-funded; source code escrow option

Tier is stored in organisations.subscription_tier. Tier-based feature gating added to RBAC: e.g., shadow_trial orgs cannot activate live alert delivery to external systems.

40.3 Procurement Readiness Process

For each institutional procurement opportunity pursued:

  1. Create docs/bid/{procurement-id}/traceability.md — maps the procurement's SoR requirements to existing MASTER_PLAN.md section(s); gaps marked NOT MET or PARTIALLY MET
  2. Create docs/bid/{procurement-id}/milestone-reconciliation.md — maps procurement milestones (KO, PDR, CDR, AT) to SpaceCom phase completion dates
  3. Run ATP (docs/bid/acceptance-test-procedure.md) on the staging environment before submission
  4. Create docs/bid/{procurement-id}/kpi-and-validation-plan.md — maps tender KPIs to replay cases, conservative baselines, evidence artefacts, and any partner-supplied validation input still required
  5. Update docs/competitive-analysis.md to confirm differentiation claims are current

This is a per-opportunity process maintained by the product owner — it does not drive changes to the core plan unless a genuine product gap is identified.

40.4 Customer Onboarding Specification

Artefact Location Purpose
ANSP onboarding checklist docs/onboarding/ansp-onboarding-checklist.md Integration lead walkthrough; environment setup; FIR configuration; user training
Admin setup guide docs/onboarding/admin-setup.md Persona D configuration; shadow mode activation; user provisioning
Shadow exit report template docs/templates/shadow-mode-exit-report.md Statistics + ANSP Safety Department sign-off
Commercial offer template docs/templates/commercial-offer-ansp.md Auto-populated from org data; sent at shadow exit

40.5 Decision Log

Decision Chosen Alternative Rationale
Plan structure vs. bid Product-first; bid traceability is a per-opportunity overlay Restructure plan around ESA SoR SpaceCom serves multiple market segments; structuring around one procurement creates lock-in and excludes ANSP and space operator commercial pathways
Default hosting jurisdiction EU (eu-central-1) US-based hosting ECAC ANSP customers are predominantly EU/UK; EU hosting satisfies data residency without per-customer complexity
Shadow deployment cap 2 concurrent Unlimited Each shadow deployment requires a dedicated integration lead for 90 days; 2 concurrent is the realistic Phase 2 capacity without specialist hiring
Space-Track AUP gate Phase 1 architectural decision Phase 2 clarification The shared vs. per-org ingest architecture is a fundamental Phase 1 design choice; deferring to Phase 2 would require rearchitecting already-shipped code
SLA in MSA Separate SLA schedule versioned independently Inline in MSA body SLA values change more frequently than contract terms; versioned schedule allows SLA updates without full MSA re-execution

41. Database Engineering Review

41.1 Finding Summary

# Finding Severity Location updated
1 tle_sets BIGSERIAL PK incompatible with TimescaleDB hypertable uniqueness requirement High §9.2 tle_sets
2 TEXT enum columns lacking CHECK constraints (12 columns across 7 tables) High §9.2 all affected tables
3 asyncpg prepared statement cache conflicts with PgBouncer transaction mode High §9.4
4 prediction_outcomes.prediction_id and alert_events.prediction_id typed INTEGER; references BIGSERIAL column Medium §9.2
5 idempotency_keys already has composite PRIMARY KEY — confirmed safe; upsert pattern documented N/A (already correct) §9.2
6 Mixed GEOGRAPHY/GEOMETRY types break GiST index selectivity on cross-table spatial joins Medium §9.3
7 acknowledged_by and reviewed_by FKs block GDPR erasure with default RESTRICT Medium §9.2
8 Mutable tables missing updated_at column and trigger Medium §9.2
9 DB password rotation procedure killed in-flight transactions via hard restart Medium §7.5
10 tle_sets chunk interval (7 days) too small; poor compression ratio for ingest rate Low §9.4
11 Missing partial indexes on hot-path filtered queries (jobs, refresh_tokens, idempotency_keys, alert_events) Low §9.3

41.2 Schema Integrity Rules

Rules enforced after this review:

  1. Hypertable natural keys — No surrogate BIGSERIAL PK on hypertables. Reference tle_sets rows by (object_id, ingested_at). If a surrogate is needed, use UNIQUE (surrogate_id, partition_col) composite.
  2. CHECK constraints mandatory — Every TEXT column with a finite valid value set must have a CHECK (col IN (...)) constraint. Application-layer validation is supplemental, not primary.
  3. asyncpg pool configprepared_statement_cache_size=0 must be set on all async engine instances. Enforced by a test that creates a test engine and asserts the connect_arg is present.
  4. BIGINT FK parity — Any FK referencing a BIGSERIAL column must be BIGINT. Linted in CI via a custom Alembic migration checker.
  5. Spatial type discipline — Every ST_Intersects / ST_Contains call mixing GEOGRAPHY and GEOMETRY sources must include an explicit ::geometry cast on the GEOGRAPHY operand. Linted via ruff custom rule.
  6. ON DELETE SET NULL on audit FKs — FKs in audit/safety tables (security_logs, alert_events.acknowledged_by, notam_drafts.reviewed_by) use ON DELETE SET NULL. Hard DELETE on users is reserved for GDPR erasure only; see §29.
  7. updated_at trigger — All mutable (non-append-only) tables must have updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() and a BEFORE UPDATE trigger using set_updated_at(). Append-only tables (those with prevent_modification() trigger) are excluded.

41.3 GDPR Erasure Procedure (users table)

Per Finding 7 — a hard DELETE FROM users WHERE id = $1 is not the correct GDPR erasure mechanism. The correct procedure:

  1. Null out PII columns: UPDATE users SET email = 'erased-' || id || '@erased.invalid', password_hash = 'ERASED', mfa_secret = NULL, mfa_recovery_codes = NULL, tos_accepted_ip = NULL WHERE id = $1
  2. Security logs, alert acknowledgements, and NOTAM review records are preserved with user_id = NULL (ON DELETE SET NULL handles this automatically if a hard DELETE is later required by specific legal instruction)
  3. Log the erasure in security_logs with event_type = 'GDPR_ERASURE' before nulling
  4. The users row itself is retained as a tombstone (email contains the erased marker) — this preserves referential integrity for organisation_id links and prevents FK violations in tables without SET NULL

Full procedure: docs/runbooks/gdpr-erasure.md (Phase 2 gate, per §29).

41.4 Decision Log

Decision Chosen Alternative Rationale
Hypertable surrogate key Remove BIGSERIAL; use UNIQUE(object_id, ingested_at) Add UNIQUE(id, ingested_at) composite Natural key is semantically stable and meaningful; composite surrogate is confusing and rarely queried by raw id
CHECK constraints vs. Postgres ENUM CHECK (col IN (...)) CREATE TYPE ENUM CHECK constraints are simpler to extend in migrations (no ALTER TYPE ADD VALUE); ENUM changes require pg_dump for type renaming
GDPR erasure Tombstone update, not hard DELETE Hard DELETE with CASCADE Hard DELETE cascades into safety records (NOTAM drafts, alert logs) that must be retained under EASA/ICAO safety record requirements; tombstone preserves the record while removing identity
Spatial type mixing Explicit ::geometry cast; document in §9.3 Migrate all columns to GEOGRAPHY Airspace GEOMETRY gives 3× ST_Intersects speedup for regional FIR queries; global corridors correctly use GEOGRAPHY; cast is cheap and safe

42. Test Engineering / QA Review

42.1 Finding Summary

# Finding Severity Location updated
1 No formal test pyramid with per-layer coverage gates High §33.10
2 No database isolation strategy for integration tests High §33.10
3 Hypothesis property-based tests unspecified High §33.10 table, §12
4 WebSocket test strategy missing High §33.10 table, §12
5 Playwright E2E tests lack data-testid selector convention Medium §33.9
6 No smoke test suite for post-deploy verification Medium §12, §33.10
7 No flaky test policy Medium §33.10
8 Contract tests lack value-range assertions Medium DoD checklists
9 Celery task timeout → jobs state transition untested; no orphan cleanup Medium §7.12
10 MC simulation test data generation strategy not specified Low §15.4
11 Accessibility testing not integrated into CI with implementation spec Low §6.16

42.2 Test Suite Inventory

Full test suite after this review:

tests/
  conftest.py              # db_session (SAVEPOINT); testcontainers for Celery tests; pytest.ini markers
  physics/
    test_frame_utils.py    # Vallado reference cases — all BLOCKING
    test_propagator/       # SGP4 state vectors — BLOCKING
    test_decay/            # Decay predictor backcast — Phase 2+
    test_nrlmsise.py       # NRLMSISE-00 density reference — BLOCKING
    test_hypothesis.py     # Hypothesis property-based invariants — BLOCKING
    test_mc_corridor.py    # MC seeded RNG corridor — Phase 2+
    test_breakup/          # Breakup energy conservation — Phase 2+
  test_integrity.py        # HMAC sign/verify/tamper — BLOCKING
  test_auth.py             # JWT; MFA; rate limiting — BLOCKING
  test_rbac.py             # Every endpoint × every role — BLOCKING
  test_websocket.py        # WS lifecycle; sequence replay; close codes — BLOCKING
  test_ingest/
    test_contracts.py      # Space-Track + NOAA key + value range — BLOCKING (mocked)
  test_spaceweather/       # Space weather ingest logic
  test_jobs/
    test_celery_failure.py # Timeout → failed; orphan recovery — BLOCKING
  smoke/                   # Post-deploy; idempotent; ≤ 2 min — BLOCKING post-deploy
  quarantine/              # Flaky tests awaiting fix; non-blocking nightly only
  e2e/                     # Playwright; 5 user journeys + axe WCAG 2.1 AA — BLOCKING
    test_accessibility.ts  # axe-core scan on every primary view; fails PR on any WCAG 2.1 AA violation
    test_alert_websocket.ts  # submit prediction → Celery completes → CRITICAL alert in browser via WS (F9)
  load/                    # k6 performance scenarios — non-blocking (nightly)

Accessibility test specification (F11):

e2e/test_accessibility.ts uses @axe-core/playwright to scan each primary view on every PR:

import { checkA11y } from 'axe-playwright';

const VIEWS_TO_SCAN = [
  '/',                          // Operational Overview
  '/events',                    // Active Events
  '/events/[sample-id]',        // Event Detail
  '/handover',                  // Shift Handover
  '/space/objects',             // Space Operator Overview
];

test.each(VIEWS_TO_SCAN)('WCAG 2.1 AA: %s', async ({ page }) => {
  await page.goto(url);
  await checkA11y(page, undefined, {
    axeOptions: { runOnly: { type: 'tag', values: ['wcag2a', 'wcag2aa'] } },
    detailedReport: true,
    detailedReportOptions: { html: true },
  });
});

CI gate: any axe-core violation at wcag2a or wcag2aa level fails the PR. wcag2aaa violations are reported as warnings only. Results published as CI artefact (a11y-report.html).

WebSocket alert delivery E2E test (F9): e2e/test_alert_websocket.ts is a BLOCKING E2E test that verifies the full end-to-end path from prediction submission to browser alert receipt. This test requires the full stack (Celery workers running, WebSocket server live):

// e2e/test_alert_websocket.ts
import { test, expect } from '@playwright/test';

test('CRITICAL alert appears in browser via WebSocket after prediction job completes', async ({ page }) => {
  // 1. Authenticate as operator
  await page.goto('/login');
  await page.fill('[name=email]', process.env.E2E_OPERATOR_EMAIL);
  await page.fill('[name=password]', process.env.E2E_OPERATOR_PASSWORD);
  await page.click('[type=submit]');
  await page.waitForURL('/');

  // 2. Submit a decay prediction via API that will produce a CRITICAL alert
  const job = await fetch('/api/v1/decay/predict', {
    method: 'POST',
    headers: { Cookie: await page.context().cookies().then(c => c.map(x => `${x.name}=${x.value}`).join('; ')) },
    body: JSON.stringify({ norad_id: 90001, mc_samples: 50 }),  // test object; always produces CRITICAL
  }).then(r => r.json());

  // 3. Wait for the CRITICAL alert banner to appear in the browser (max 60s)
  await expect(page.locator('[role="alertdialog"][data-severity="CRITICAL"]'))
    .toBeVisible({ timeout: 60_000 });

  // 4. Assert the alert references our prediction
  const alertText = await page.locator('[role="alertdialog"]').textContent();
  expect(alertText).toContain('90001');
});

The 60-second timeout covers: Celery task queue, MC computation (50 samples), alert threshold evaluation, WebSocket push to all org subscribers, React state update, and DOM render. If this test fails intermittently, the failure is investigated as a potential latency regression — it must not be moved to quarantine/ without a root-cause investigation.

Manual screen reader test (release checklist — not automated):

  • NVDA + Firefox (Windows): primary operator workflow (alert receipt → acknowledgement → NOTAM draft)
  • VoiceOver + Safari (macOS): same workflow
  • Keyboard-only: full workflow without mouse
  • Added to release gate checklist in docs/RELEASE_CHECKLIST.md

42.3 Hypothesis Invariant Specifications

Minimum 5 required Hypothesis properties in tests/physics/test_hypothesis.py:

Property Strategy Assertion max_examples
SGP4 round-trip position Random valid TLE orbital elements Forward propagate T days then back; position error < 1 m 200
p95 corridor containment Seeded MC ensemble (seed=42, N=500) Corridor contains ≥ 95% of input trajectories 50
NRLMSISE-00 density positive Random altitude 100800 km, valid F10.7/Ap Density always > 0 kg/m³ 500
RLS tenant isolation Two different organisation IDs Session set to org A never returns rows for org B 100
Pagination non-overlap Cursor pagination with random page sizes Pages are non-overlapping and cover full dataset 100

42.4 MC Corridor Test Data Specification

Reference data committed to docs/validation/reference-data/:

File Contents Regeneration
mc-ensemble-params.json RNG seed=42, object params, generation timestamp Never change seed; add to file if params change
mc-corridor-reference.geojson Pre-computed p95 corridor polygon Run python tools/generate_mc_reference.py after algorithm change; review diff before committing

Test asserts area delta < 5% between computed and reference polygon. If the algorithm changes, the reference polygon must be explicitly regenerated and the change logged in CHANGELOG.md.

42.5 Decision Log

Decision Chosen Alternative Rationale
DB isolation SAVEPOINT for unit/single-connection; testcontainers for Celery Shared test DB with cleanup SAVEPOINT is zero-overhead and perfectly isolated; testcontainers gives true process isolation for multi-connection Celery tests without manual teardown
Flaky test policy Quarantine after 2 failures in 30 days; delete if unfixed > 14 days Retry flaky tests automatically Auto-retry masks root causes; quarantine with mandatory resolution timeline creates accountability
Hypothesis in blocking CI Yes, max_examples ≥ 200 for physics Optional/nightly only Safety-critical physics invariants must be checked on every commit; 200 examples adds < 30s to CI at default shrink settings
MC test data Seeded RNG + committed reference polygon Committed raw trajectory arrays Raw arrays are large (~MB); seeded RNG is deterministic and tiny; committed polygon provides a stable regression target
data-testid convention Mandatory for all Playwright targets; CSS class selectors forbidden Allow CSS class selectors CSS classes are refactoring artefacts; data-testid is stable across UI refactors and explicitly documents test intent
Smoke test gate Blocking post-deploy, not blocking pre-deploy CI Block pre-deploy CI Smoke tests require a running stack; pre-deploy CI has no stack. Post-deploy gate means deployment rollback is the recovery action for smoke failure
Accessibility CI gate axe-core wcag2a + wcag2aa violations block PR; wcag2aaa warnings only Manual testing only Manual testing is too slow and inconsistent for PR-level feedback; automated axe-core catches ~57% of WCAG issues at zero marginal cost; manual screen reader testing reserved for release gate

43. Observability / Monitoring Engineering Review

43.1 Finding Summary

# Finding Severity Location updated
1 Per-object Gauge labels cause alert flooding (600 pages for one outage) High §26.7 — recording rules added
2 No structured logging format specification High §7.14, §10
3 No distributed tracing (OpenTelemetry) High §26.7, §10
4 AlertManager rules have semantic errors; no runbook links High §26.7 — rules rewritten
5 No log aggregation stack specified Medium §3.2, §10
6 Celery queue depth and DLQ depth metrics not defined Medium §26.7
7 SLIs not formally instrumented against SLOs Medium §26.7 — recording rules
8 No request_id / trace_id correlation between logs and metrics Medium §7.14
9 Prometheus scrape configuration not specified Medium §26.7
10 Renderer service has no functional health check or metrics Medium §26.5
11 No on-call rotation spec or AlertManager escalation routing Medium §26.8

43.2 Observability Stack Summary

After this review the full observability stack is:

Layer Tool Phase
Metrics Prometheus + prometheus-fastapi-instrumentator 1
Alerting AlertManager with runbook_url annotations 1
Dashboards Grafana (4 dashboards) 2
Structured logs structlog JSON with required fields + sanitiser 1
Log aggregation Grafana Loki + Promtail (Docker log scrape) 2
Distributed tracing OpenTelemetry → Grafana Tempo 2
On-call routing PagerDuty/OpsGenie via AlertManager L1/L2/L3 tiers 2

43.3 Alert Anti-Patterns (Do Not Reintroduce)

Anti-pattern Correct form
rate(counter[Xm]) > 0 increase(counter[Xm]) >= Nrate() is per-second and stays positive once counter increments
Alert directly on spacecom_tle_age_hours{norad_id=...} Alert on spacecom:tle_stale_objects:count recording rule — prevents 600-alert floods
AlertManager rule with no annotations.runbook_url Every rule must include runbook_url pointing to the relevant runbook in docs/runbooks/
Grafana dashboard as sole incident channel All CRITICAL alerts also page via PagerDuty; dashboards are diagnosis tools, not alert channels

43.4 Decision Log

Decision Chosen Alternative Rationale
Log aggregation Grafana Loki ELK stack Loki is 10× cheaper to operate (no full-text index); Prometheus labels for log querying are sufficient for this workload; co-deploys with existing Grafana without separate ES cluster
Tracing backend Grafana Tempo Jaeger Tempo co-deploys with Grafana/Loki with no separate storage; native Grafana datasource; OTLP ingest; no query language to learn
Per-object label strategy Keep labels for Grafana; alert on recording rule aggregates Remove per-object labels Per-object drill-down in Grafana dashboards is operationally valuable; the alert flooding problem is solved by recording rules, not by removing labels
Structured logging library structlog Python standard logging + JSON formatter structlog integrates natively with contextvars for request_id propagation; the context binding pattern is cleaner than threading.local
Renderer health check Functional Chromium launch test Process liveness only Chromium hanging without crashing is a known Playwright failure mode; process liveness gives false confidence; functional check is the only reliable signal

§44 — Frontend Architecture Review

44.1 Finding Summary

# Finding Severity Resolution
1 No documented decision on Next.js App Router vs Pages Router; component boundary ("use client") placement unspecified Medium §13.1 — App Router confirmed; "use client" at app/(globe)/layout.tsx boundary
2 CesiumJS requires 'unsafe-eval' in CSP for GLSL shader compilation; existing policy blocks the globe High §7.7 — two-tier CSP; 'unsafe-eval' scoped to app/(globe)/ routes only
3 Globe WebGL crash removes alert panel from DOM; CesiumJS WebGL context loss is unhandled High §13.1 — GlobeErrorBoundary wrapping only the globe canvas; alert panel in separate PanelErrorBoundary
4 CesiumJS entity memory leak: unbounded entity accumulation causes WebGL OOM and renderer crash Medium §13.1 — max 500 entities; 96h orbit path limit; stale entity pruning on update
5 WebSocket reconnection strategy unspecified; naive reconnect causes thundering-herd on server restart Medium §13.1 — exponential backoff with ±20% jitter; RECONNECT config object; max 30s delay
6 No TanStack Query key management strategy; ad-hoc key strings cause cache stampedes and stale data Medium §13.1 — queryKeys key factory pattern; all query keys centralised in src/lib/queryKeys.ts
7 Safety-critical panels (alert list, corridor map) have no loading/empty/error state specification High §13.1 — explicit state matrix per panel; alert panel must show degraded-data warning on stale WebSocket
8 LIVE/SIMULATION/REPLAY mode isolation not enforced in UI; writes possible in replay mode High §13.1 — useModeGuard hook; §33.9 — AGENTS.md rule added
9 Deck.gl renders on a separate canvas above CesiumJS; z-order and input event handling are broken Medium §13.1 — DeckLayer from @deck.gl/cesium; single canvas; shared input handling
10 CesiumJS imported at module level causes SSR crash; next build fails High §13.1 — next/dynamic with ssr: false for all CesiumJS components
11 Cesium ion token injection pattern undocumented; risk of over-engineering (proxying a public credential) Low §7.5 — explicit NOT A SECRET annotation; §33.9 — AGENTS.md rule added

44.2 Architecture Constraints Summary

After this review the frontend architecture constraints are:

Constraint Rule
App Router split app/(auth)/ and app/(admin)/ — server components; app/(globe)/"use client" root layout
CesiumJS import next/dynamic + ssr: false only; never a static import at module level
CSP Two-tier: standard (no 'unsafe-eval') for non-globe; globe-tier ('unsafe-eval') for app/(globe)/ only
Error isolation Globe crash must not affect alert panel; independent ErrorBoundary per major region
Entity cap 500 CesiumJS entities maximum; prune entities not updated in last 96h
WebSocket reconnect Exponential backoff, initial 1s, max 30s, ×2 multiplier, ±20% jitter
Query keys All keys defined in src/lib/queryKeys.ts key factory; no inline key strings
Mode guard All write operations must check useModeGuard(['LIVE']) and disable in SIMULATION/REPLAY
Deck.gl DeckLayer from @deck.gl/cesium only; no separate canvas
Cesium ion token NEXT_PUBLIC_CESIUM_ION_TOKEN; public credential; not proxied; not in Vault

44.3 Anti-Patterns (Do Not Introduce)

Anti-pattern Correct form
import * as Cesium from 'cesium' at module level next/dynamic(() => import('./CesiumViewerInner'), { ssr: false })
Single root <ErrorBoundary> wrapping entire app Independent boundaries: GlobeErrorBoundary, PanelErrorBoundary, AlertErrorBoundary
queryClient.invalidateQueries('objects') (string key) queryClient.invalidateQueries({ queryKey: queryKeys.objects.all() })
Rendering write controls (buttons, forms) without mode check const { isAllowed } = useModeGuard(['LIVE']); <button disabled={!isAllowed}>
Deck.gl separate canvas (new Deck({ canvas: ... })) viewer.scene.primitives.add(new DeckLayer({ layers: [...] }))
Storing Cesium ion token in backend env / Vault / Docker secrets NEXT_PUBLIC_CESIUM_ION_TOKEN in .env.local; committed non-secret in CI
Reconnect without jitter (setTimeout(connect, delay)) delay * (1 + (Math.random() * 2 - 1) * RECONNECT.jitter)

44.4 Decision Log

Decision Chosen Alternative Rationale
App Router adoption App Router with route groups Pages Router Route groups ((globe), (auth)) enable per-group CSP header configuration in next.config.ts; server components reduce globe-route initial JS; incremental adoption possible
"use client" boundary app/(globe)/layout.tsx Per-component "use client" annotations Single boundary at layout level is simpler; all CesiumJS/Zustand/WebSocket code already browser-only; per-component annotations at this scale would be noise
Globe CSP strategy Route-scoped 'unsafe-eval' Hash-based CSP for GLSL CesiumJS generates shader source dynamically; hashes cannot cover runtime-generated strings; route-scoping is the only practical option
Deck.gl integration DeckLayer from @deck.gl/cesium Separate Deck.gl canvas Separate canvas breaks mouse event routing and z-order; DeckLayer renders inside CesiumJS as a primitive, sharing the WebGL context
Cesium ion token NEXT_PUBLIC_ env var Backend proxy endpoint Cesium ion is a CDN/tile service with public tokens by design; proxying adds latency and a backend dependency for a non-secret; Cesium's own documentation recommends direct browser use

§45 — Platform / Infrastructure Operations Engineering Review

45.1 Finding Summary

# Finding Severity Resolution
1 Python 3.11/3.12 version mismatch between Dockerfiles and service table Medium §30.2 — all images updated to python:3.12-slim, node:22-slim; CI version check added
2 No container resource limits; runaway simulation worker can OOM-kill the database High §3.3 — deploy.resources.limits added for all services; stop_grace_period added
3 Docker SIGTERM→SIGKILL grace period (10s default) too short for MC task warm shutdown High §3.3 — stop_grace_period: 300s for worker-sim; --without-gossip --without-mingle flags specified
4 Backend and renderer on disjoint networks — cannot communicate Critical §3.3 — backend added to renderer_net; network topology diagram corrected
5 Workers bypass PgBouncer — 16 direct connections per worker undermines connection pooling Medium §3.3 — PgBouncer added to worker_net; workers connect via pgbouncer:5432
6 Redis ACL per-service is stated in §3.2 but undefined — compromised worker can read session tokens High §3.2 — full ACL definition added; three separate passwords added to §30.3 env contract
7 pg_isready -U postgres healthcheck passes before TimescaleDB extension and application DB are ready Medium §26.5 — healthcheck replaced with psql query against timescaledb_information.hypertables
8 daily_base_backup calls pg_basebackup from Python worker image — tool not installed High §26.6 — replaced with dedicated db-backup sidecar container; Celery task now verifies backup presence in MinIO
9 No pids_limit on renderer or worker containers — Chromium crash can fork-bomb host Medium §3.3 — pids_limit added: renderer=100, worker-sim=64, worker-ingest=16
10 Renderer PDF scratch written to container writable layer — sensitive data persists Medium §3.3 — tmpfs mount at /tmp/renders (512 MB); RENDER_OUTPUT_DIR env var added
11 Blue-green deployment mechanics unspecified for Docker Compose — first production deploy would fail High §26.9 — scripts/blue-green-deploy.sh spec added; Caddy dynamic upstream pattern defined

45.2 Container Runtime Safety Summary

After this review the container runtime safety posture is:

Concern Control
Resource isolation deploy.resources.limits per service; DB memory-capped to survive worker OOM
Graceful shutdown stop_grace_period: 300s for simulation workers; Celery --without-gossip --without-mingle
Process containment pids_limit on renderer (100) and both workers
Sensitive scratch data Renderer uses tmpfs at /tmp/renders; cleared on container stop
Network access Backend on renderer_net; PgBouncer on worker_net; workers never reach frontend_net
Redis ACL Three ACL users (backend, worker, ingest) with scoped key namespaces; default user disabled
DB healthcheck Verifies TimescaleDB extension loaded and application DB accessible before dependent services start
Backups Dedicated db-backup sidecar with PostgreSQL tools; Celery Beat verifies presence not execution

45.3 Operations Anti-Patterns (Do Not Reintroduce)

Anti-pattern Correct form
FROM python:3.11-slim or FROM node:20-slim in any Dockerfile python:3.12-slim / node:22-slim; hadolint check enforces this
No deploy.resources.limits on CPU/memory-intensive services All services must have limits; simulation workers especially
Worker DATABASE_URL pointing to db:5432 pgbouncer:5432 — all workers route through PgBouncer
subprocess.run(['pg_basebackup', ...]) from a Python worker container Dedicated db-backup sidecar container with PostgreSQL tools
pg_isready -U postgres as the DB healthcheck psql -c "SELECT 1 FROM timescaledb_information.hypertables LIMIT 1"
docker compose stop (default 10s) for simulation workers stop_grace_period: 300s on worker-sim service definition
All services sharing single REDIS_PASSWORD Three ACL users with scoped namespaces; separate passwords
Blue-green deploy without specifying the Compose implementation scripts/blue-green-deploy.sh with separate Compose project instances + Caddy dynamic upstream

45.4 Decision Log

Decision Chosen Alternative Rationale
Python version 3.12 (service table and Dockerfiles aligned) 3.11 (original Dockerfiles) 3.12 has 1025% numeric performance improvements; free-threaded GIL prep; security support through 2028; alignment eliminates silent version drift
Blue-green implementation Separate Compose project instances + Caddy dynamic upstream file Single Compose file with blue/green service name variants Separate projects mean the Compose file is not modified per deployment; Caddy JSON upstream reload is atomic and < 5s
Backup execution model Host cron → db-backup sidecar via docker compose run Celery task + subprocess.run Celery workers do not have pg_basebackup; host cron is independent of application availability — backup runs even if Celery is down
PID limits Per-service pids_limit in Compose Kernel cgroup default Compose pids_limit is applied at container creation; simpler to audit than system-level cgroup tuning; values sized per expected process count
Renderer scratch storage tmpfs Named Docker volume PDF contents include prediction data; tmpfs guarantees no persistence; cleared on container stop/restart without manual cleanup
Redis ACL scope Key prefix namespacing (~celery* for workers) Command-level ACL only Key-prefix ACL prevents workers from reading/writing outside their namespace; command-level-only ACL is weaker (worker could still enumerate all keys)

§46 — Data Pipeline / ETL Engineering Review

46.1 Finding Summary

# Finding Severity Resolution
1 No Space-Track request budget tracked; 30-min TIP polling consumes 48/600 requests/day before retries High §31.1.1 — SpaceTrackBudget Redis counter; alert at 80%; operator re-fetches budget-checked
2 TIP 30-min polling too slow for late re-entry phase; CDM 12h polling can miss short-TCA conjunctions entirely High §31.1.1 — adaptive polling: TIP→5min, CDM→30min when active_tip_events > 0
3 TLE ingest ON CONFLICT behavior unspecified; double-run hits unique constraint silently Medium §11 — INSERT ... ON CONFLICT DO NOTHING + spacecom_ingest_tle_conflict_total metric
4 IERS EOP cold-start: astropy falls back to months-old IERS-B, silently degrading frame transforms High §11 — make seed EOP bootstrap step; EOP freshness check in GET /readyz
5 AIRAC FIR updates are fully manual with no staleness detection or missed-cycle alert Medium §31.1.3 — spacecom_airspace_airac_age_days gauge + alert; airspace_stale in readyz; fir-update runbook as Phase 1 deliverable
6 Space weather nowcast vs. forecast not distinguished; decay predictor uses wrong F10.7 for horizon > 72h High §31.1.2 — forecast_horizon_hours column; decay predictor input selection table
7 IERS EOP SHA-256 verification unimplementable — IERS publishes no reference hashes Medium §11 — dual-mirror comparison (USNO + Paris Observatory); spacecom_eop_mirror_agreement gauge
8 No exponential backoff or circuit breaker on ingest tasks; transient failures exhaust Space-Track budget High §31.1.1 — retry_backoff=True, retry_backoff_max=3600, max_retries=5; pybreaker circuit breaker
9 Space-Track session cookie expires between 6h polls; re-auth behavior not specified or tested Medium §31.1.1 — _ensure_authenticated() with proactive 1h45m TTL; session_reauth_total metric
10 ESA SWS Kp cross-validation has no decision rule; divergence from NOAA is silently ignored Medium §31.1.2 — arbitrate_kp() with 2.0 Kp threshold; conservative-high selection; ADR-0018
11 celery-redbeat default lock TTL 25min causes up to 25min scheduling gap on Beat crash during TIP event High §26.4 — REDBEAT_LOCK_TIMEOUT=60; REDBEAT_MAX_SLEEP_INTERVAL=5; active TIP alert threshold 10min

46.2 Ingest Pipeline Reliability Summary

After this review the ingest pipeline reliability posture is:

Concern Control
Space-Track rate limit SpaceTrackBudget Redis counter; alert at 80%; hard stop at 600/day
Upstream failure recovery Exponential backoff (2s→1h, ×2, ±20% jitter); circuit breaker after 3 failures; max 5 retries then DLQ
TIP latency during re-entry Adaptive polling: 5-minute TIP cycle when active TIP event detected
CDM conjunction coverage 30-minute CDM cycle during active TIP events (baseline 2h)
TLE ingest idempotency ON CONFLICT DO NOTHING + conflict metric
EOP freshness Daily download (USNO primary); dual-mirror verification; 7-day staleness alert; cold-start bootstrap in make seed
AIRAC currency 28-day staleness alert; /readyz degraded signal; manual update runbook as Phase 1 deliverable
Space weather horizon forecast_horizon_hours column; predictor selects by horizon; 81-day F10.7 average beyond 72h
Beat HA failover gap REDBEAT_LOCK_TIMEOUT=60s; standby acquires lock within 5s of TTL expiry

46.3 New ADR Required

ADR Title Decision
docs/adr/0018-kp-source-arbitration.md Kp Source Arbitration Policy NOAA primary; ESA SWS cross-validation; conservative-high selection on > 2.0 Kp divergence; physics lead approval required

46.4 Ingest Pipeline Anti-Patterns (Do Not Reintroduce)

Anti-pattern Correct form
INSERT INTO tle_sets ... VALUES (...) without ON CONFLICT DO NOTHING Always use ON CONFLICT DO NOTHING + increment conflict metric
spacetrack_client.fetch() without budget check Always call budget.consume(1) before any Space-Track HTTP request
Celery ingest task with max_retries=None or no backoff retry_backoff=True, retry_backoff_max=3600, max_retries=5
EOP verification by SHA-256 against prior download Dual-mirror UT1-UTC value comparison (USNO + Paris Observatory)
REDBEAT_LOCK_TIMEOUT = 300 (default 5min or 25min) REDBEAT_LOCK_TIMEOUT = 60 for active TIP event tolerance
Single F10.7 value regardless of prediction horizon Select by forecast_horizon_hours; 81-day average beyond 72h
ESA SWS Kp logged but not acted upon arbitrate_kp() decision rule; conservative-high on divergence

46.5 Decision Log

Decision Chosen Alternative Rationale
Adaptive TIP polling Dynamic redbeat schedule override when active_tip_events > 0 Fixed 5-min polling always Fixed 5-min polling uses 288/600 Space-Track requests/day for TIPs alone; adaptive polling reserves budget for baseline operations
Space-Track budget enforcement Redis counter with hard stop Honour-system rate limit compliance Hard stop prevents CI/staging test runs or operator actions from exhausting production budget unexpectedly
EOP verification Dual-mirror value comparison SHA-256 against prior download IERS publishes no reference hashes; prior-download comparison detects corruption but not substitution; dual-mirror comparison is the de facto industry approach
Kp arbitration Conservative-high (max of NOAA, ESA on divergence) Average of both sources Averaging introduces a systematic bias toward lower geomagnetic activity; in a safety-critical context, the conservative choice is the higher Kp (denser atmosphere, shorter lifetime, earlier alerting)
forecast_horizon_hours schema Dedicated column on space_weather Separate tables per horizon Single table with horizon column is simpler to query (WHERE forecast_horizon_hours = 0); adding a table per horizon complicates the ingest pipeline without query benefit

§47 — Supply Chain / Dependency Security Engineering Review

47.1 Finding Summary

# Finding Severity Resolution
1 pip wheel in Dockerfile does not enforce --require-hashes; hash pinning specified but not verified during build High §30.2 — --require-hashes added to pip wheel command with explanatory comment
2 cosign image signing absent from CI workflow; attestation claim was aspirational High §26.9 — full cosign sign + cosign attest YAML added to build-and-push job
3 SBOM format, CI step, and retention unspecified; ESA ECSS requirement undeliverable High §26.9 — SPDX-JSON via syft; cosign attest attachment; 365-day artifact retention
4 pip-audit absent; OWASP Dependency-Check has high Python false-positive rate Medium §7.13 — pip-audit added to security-scan; OWASP DC removed from Python scope
5 No automated license scanning; CesiumJS AGPLv3 compliance check was manual High §7.13 — pip-licenses + license-checker-rseidelsohn gate on every PR
6 Base image digest update process undefined; Dependabot cannot update @sha256: pins Medium §7.13 — Renovate Bot docker-digest manager; digest PRs auto-merged on passing CI
7 No .trivyignore file; first base-image CVE with no fix will break all CI builds Medium §7.13 — .trivyignore spec with expiry dates + CI expiry check
8 npm audit absent from CI; npm ci does not scan for known vulnerabilities Medium §7.13 + §26.9 — npm audit --audit-level=high in security-scan job
9 detect-secrets baseline update process undefined; incorrect scan > overwrites all allowances Medium §30.1 — correct --update procedure documented; CI baseline currency check added
10 No PyPI index trust policy; dependency confusion attack surface unmitigated High §7.13 — private PyPI proxy spec; spacecom-* namespace reservation on public PyPI; ADR-0019
11 GitHub Actions pinned by mutable @vN tags; tag repointing exfiltrates all workflow secrets Critical §26.9 — all actions pinned by full commit SHA; CI lint check enforces no @v\d tags

47.2 Supply Chain Security Posture Summary

After this review the supply chain security posture is:

Layer Control
Python build-time hash verification pip wheel --require-hashes enforces hash pinning during Docker build
Python CVE scanning pip-audit (PyPADB); every PR; blocks on High/Critical
Node.js CVE scanning npm audit --audit-level=high; every PR
Container CVE scanning Trivy + .trivyignore with expiry enforcement
Image provenance cosign keyless signing (Sigstore) on every image push
SBOM SPDX-JSON via syft; attached as cosign attest; 365-day retention
License gate pip-licenses + license-checker-rseidelsohn; GPL/AGPL blocks merge
Base image currency Renovate docker-digest manager; weekly PRs; auto-merged on CI pass
Dependency currency Dependabot (GitHub Advisory integration) for Python/Node versions
CI pipeline integrity All actions SHA-pinned; lint check rejects @vN references
Secrets detection detect-secrets (entropy + regex) primary; git-secrets secondary; baseline currency check in CI
PyPI index trust Private proxy (Phase 2+); spacecom-* namespace stubs on public PyPI

47.3 New ADR Required

ADR Title Decision
docs/adr/0019-pypi-index-trust.md PyPI Index Trust Policy Private proxy for Phase 2+; public PyPI namespace reservation for spacecom-* packages in Phase 1

47.4 Anti-Patterns (Do Not Reintroduce)

Anti-pattern Correct form
pip wheel -r requirements.txt without --require-hashes pip wheel --require-hashes -r requirements.txt
uses: actions/checkout@v4 in any workflow file uses: actions/checkout@<full-commit-sha> # vX.Y.Z
detect-secrets scan > .secrets.baseline detect-secrets scan --baseline .secrets.baseline --update
OWASP Dependency-Check as Python CVE scanner pip-audit --requirement requirements.txt
Trivy gate with no .trivyignore .trivyignore with documented expiry dates + CI expiry check
Manual CesiumJS licence check at Phase 1 only license-checker-rseidelsohn --failOn "GPL;AGPL" on every PR (CesiumJS exempted by name)
cosign mentioned in decision log but absent from CI cosign sign + cosign attest in build-and-push job; cosign verify in deploy jobs

47.5 Decision Log

Decision Chosen Alternative Rationale
Python CVE scanning pip-audit (PyPADB) OWASP Dependency-Check OWASP DC CPE mapping generates false positives for Python; pip-audit queries the Python-native advisory database with near-zero false positives
Image signing cosign keyless (Sigstore) Long-lived signing key Keyless signing uses ephemeral OIDC-bound keys; no key management overhead; verifiable against GitHub Actions OIDC issuer
SBOM format SPDX 2.3 JSON (spdx-json) CycloneDX 1.5 SPDX is the ECSS/ESA-preferred format; both are equivalent for compliance purposes; SPDX has wider tooling support in the aerospace sector
Base image update automation Renovate docker-digest Manual digest updates Manual digest updates are always deferred; Renovate auto-merge on passing CI achieves zero-latency security patch application for base image OS updates
GitHub Actions pinning Commit SHA with tag comment Dependabot auto-bump of @vN Tag references are mutable; SHA pins are immutable; Renovate github-actions manager keeps SHAs current automatically
PyPI trust (Phase 1) Namespace reservation on public PyPI Private proxy Private proxy requires infrastructure investment not available in Phase 1; namespace squatting prevention provides meaningful protection at zero cost

§48 Human Factors Engineering — Specialist Review

Hat: Human Factors Engineering Standard basis: ECSS-E-ST-10-12C (Space engineering — Human factors), CAP 1264 (Alarm management for safety-related ATC systems), EASA GM1 ATCO.B.001(d) (Competency-based training — decision making under uncertainty), Endsley (1995) Situation Awareness taxonomy, Parasuraman & Riley (1997) automation trust calibration

Review scope: §28 Human Factors Framework, §6 UI/UX Feature Specifications, §26 Infrastructure (alert delivery), §31 Data Pipeline (data freshness / degraded state)


48.1 Findings

Finding 1 — SA timing targets absent: §28.1 contained no quantitative time-to-comprehension targets. Situation Awareness without measurable timing criteria cannot be validated against ECSS-E-ST-10-12C Part 6.4 or used as pass/fail criteria in usability testing. Fix applied (§28.1): SA Level 1 ≤ 5s (icon/colour/position); SA Level 2 ≤ 15s (FIR intersection + sector); SA Level 3 ≤ 30s (corridor expanding/contracting). Targets designated as Phase 2 usability test pass/fail criteria.

Finding 2 — Forced-text acknowledgement minimum causes compliance noise: The 10-character minimum on alert acknowledgement text is a common anti-pattern. Under time pressure, operators produce 1234567890 or similar, which is audit record pollution rather than evidence of cognitive engagement. Fix applied (§28.5): Replaced with ACKNOWLEDGEMENT_CATEGORIES (6 structured options). Free text is optional except when OTHER is selected. Category selection satisfies audit requirements with less operator burden.

Finding 3 — No keyboard-completable acknowledgement path: ANSP ops room staff routinely hold a radio PTT with one hand. A mouse-dependent acknowledgement dialog is inaccessible in that context and constitutes a HF design failure. Fix applied (§28.5): Alt+A → Enter → Enter three-keystroke path from any application state. Documented for operator quick-reference card; included in Phase 2 usability test scenario.

Finding 4 — No startle-response mitigation: Sudden full-screen CRITICAL banners produce a documented ~5-second degraded cognitive performance window (startle effect, Staal 2004). The existing design transitions directly to full-screen without priming. Fix applied (§28.3): Three-rule mitigation: (1) progressive escalation — CRITICAL full-screen only after ≥ 1 minute in HIGH state (except impact_time_minutes < 30); (2) audio precedes visual by 500ms; (3) banner is dimmed overlay over corridor map, not a replacement.

Finding 5 — No shift handover specification: Handover is the highest-risk transition in continuous operations. Loss of situational awareness at shift change is a documented contributing factor in ATC incidents. No handover mechanism existed. Fix applied (§28.5a): Dedicated /handover view; shift_handovers table with outgoing_user, incoming_user, notes, active_alerts snapshot, open_coord_threads snapshot; immutable audit record; CRITICAL-during-handover flag on notifications.

Finding 6 — Alarm rationalisation procedure absent: Alarm systems without formal rationalisation procedures inevitably drift toward nuisance alarm rates that exceed operator tolerance. The existing quarterly review target (< 1 LOW/10 min/user) had no enforcement mechanism. Fix applied (§28.3): Quarterly rationalisation procedure with alarm_threshold_audit table; 90% MONITORING acknowledgement rate as nuisance alarm trigger; mandatory 7-day confirmation for threshold changes; 12-month no-escalation review for alert categories.

Finding 7 — Comprehension test items not specified: §28.7 stated "usability test" without scripted probabilistic comprehension items. Generic usability tests are insensitive to the specific calibration failures relevant to probabilistic re-entry data (false precision, space/aviation risk threshold conflation, uncertainty update misattribution). Fix applied (§28.7): Four scripted comprehension items with correct answer, common wrong answer, and failure mode each item detects. Pass criterion: ≥ 80% correct per item across the test cohort.

Finding 8 — No habituation countermeasures: Repeated identical stimuli (identical alarm sound, identical banner appearance) produce habituation — reduced physiological and attentional response over weeks of exposure. No design provisions existed. Fix applied (§28.3): Pseudo-random alternation of two-tone audio pattern; 1 Hz colour cycling on CRITICAL banner between two dark-amber shades; per-operator habituation metric (≥ 20 same-type acknowledgements in 30 days without escalation triggers supervisor review).

Finding 9 — "Response Options" label creates legal ambiguity: The label "Response Options" implies these are prescribed choices. In a regulatory investigation following an incident, checked items could be interpreted as evidence of a standard procedure that was or was not followed. Fix applied (§28.6): Feature renamed to "Decision Prompts" throughout. Non-waivable legal disclaimer added below accordion header. Disclaimer included in printed/exported Event Detail report and in API response legal_notice field.

Finding 10 — No attention management specification: SpaceCom exists in an environment (ops room) with very high ambient interruption rates. Without explicit constraints on unsolicited notification rate, SpaceCom becomes an additional fragmentation source — the documented cause of error in multiple ATC incident analyses. Fix applied (§28.6): Three-tier rate limit: ≤ 1/10 min in steady state; ≤ 1/60s for same-event updates during active incident; zero during critical flow (acknowledgement dialog or handover screen). Queued notifications delivered as batch on critical-flow exit.

Finding 11 — Degraded-data states not differentiated for operators: Three meaningfully different system states (healthy, degraded, failed) were visually undifferentiated in the previous design. Operators cannot distinguish between data they should trust, trust with margin, or not trust at all. Fix applied (§28.8): Graded visual degradation language table (5 amber/red states with exact badge text and required operator response); multiple-amber consolidation rule; GET /readyz machine-readable staleness flags for ANSP monitoring integration; system_health_events audit table.


48.2 Files / Sections Modified

Section Change
§28.1 Situation Awareness Design Requirements Added SA level timing targets as pass/fail usability criteria
§28.3 Alarm Management Added startle-response mitigation (3 rules), alarm rationalisation procedure, habituation countermeasures
§28.5 Error Recovery and Irreversible Actions Replaced 10-char text minimum with ACKNOWLEDGEMENT_CATEGORIES; added Alt+A → Enter → Enter keyboard path
§28.5a Shift Handover (new section) Handover screen spec; shift_handovers table schema; integrity rules; handover-window CRITICAL flag
§28.6 Cognitive Load Reduction Renamed Response Options → Decision Prompts; added legal disclaimer; added attention management rate limits
§28.7 HF Validation Approach Added 4 scripted probabilistic comprehension test items with pass criterion
§28.8 Degraded-Data Human Factors (new section) Graded degradation language; 5-state indicator table; multiple-amber consolidation; GET /readyz integration

48.3 New Tables / Schema Changes

Table Purpose
shift_handovers Immutable record of shift handovers with alert and coordination thread snapshots
alarm_threshold_audit Immutable record of alarm threshold changes with reviewer and rationale
system_health_events Time-series log of degraded-data state transitions for operational reporting

48.4 New ADR Required

ADR Title Decision
docs/adr/0020-acknowledgement-categories.md Alert Acknowledgement Design Structured category selection replaces free-text minimum; OTHER requires text; 6 categories cover all anticipated operational responses
docs/adr/0021-decision-prompts-legal.md Decision Prompts Legal Treatment Feature renamed from Response Options; non-waivable disclaimer required; legal rationale documented for future regulatory inquiries

48.5 Anti-Patterns (Do Not Reintroduce)

Anti-pattern Correct form
Full-screen CRITICAL banner without progressive escalation Progressive escalation: ≥ 1 min in HIGH state before CRITICAL full-screen (except impact_time < 30 min)
Audio and visual CRITICAL alert fired simultaneously Audio fires 500ms before visual banner render
Alert acknowledgement with free-text character minimum ACKNOWLEDGEMENT_CATEGORIES structured selection; free text only when OTHER selected
"Response Options" label anywhere in UI, API, or docs "Decision Prompts" throughout; legal disclaimer present
Comprehension test without scripted probabilistic items Use the 4 scripted items in §28.7; measure per-item accuracy against 80% pass threshold
Degraded data shown with same visual weight as fresh data Use exact badge text from §28.8; amber for stale, red for expired/unusable

48.6 Decision Log

Decision Chosen Alternative Rationale
Acknowledgement mechanism Structured categories Free-text minimum Research shows forced-text minimums produce compliance noise, not evidence; structured categories produce lower operator burden with higher audit utility
CRITICAL escalation model Progressive (HIGH → CRITICAL) Immediate full-screen Startle effect causes ~5s cognitive degradation; progressive escalation eliminates cold-start startle while preserving urgency
Audio timing 500ms pre-visual Simultaneous Pre-auditory alert primes attentional orienting response; eliminates visual startling; 500ms is within the ICAO recommended alerting lead-time range
Shift handover System-managed /handover view Out-of-band process Out-of-band handovers leave no audit trail and are not integrated with active alert state; system-managed handover provides immutable record and SA transfer assurance
Decision Prompts legal treatment Non-waivable hard-coded disclaimer Configurable disclaimer or none Configurable disclaimer creates discovery risk (could be disabled); absence of disclaimer creates precedent risk; hard-coded disclaimer is the only legally safe option

Standards basis: GDPR (Regulation 2016/679), UK GDPR, ePrivacy Directive, Export Administration Regulations (EAR), ITAR (22 CFR 120130), ESA Procurement Rules, EUMETSAT Data Policy, Space Debris Mitigation Guidelines (IADC/ISO 24113), Chicago Convention Article 28, EU AI Act (Regulation 2024/1689), NIS2 Directive (2022/2555) Review scope: Data handling, user consent, liability framing, export control, third-party data licensing, AI Act obligations, operator accountability chain, record retention, cross-border transfer, regulatory correspondence readiness


49.1 Findings and Fixes Applied

F1 — No GDPR lawful basis documented per processing activity Fix applied (§29.1): RoPA requirement formalised. legal/ROPA.md designated as authoritative document. Data inventory table extended to include all processing activities with lawful basis, retention period, and table reference. shift_handovers and alarm_threshold_audit added as processing activities. Annual DPO sign-off required. DPIA trigger documented.

F2 — No DPIA for conjunction alert delivery Fix applied (§29.1): DPIA trigger documented — conjunction alert delivery constitutes systematic monitoring under GDPR Art. 35(3)(b). DPIA required before production deployment; template designated as legal/DPIA_conjunction_alerts.md.

F3 — TLE / space weather data redistribution may breach upstream licence Fix applied (§24.2): space_track_registered boolean column added to organisations table. API middleware gate blocks TLE-derived fields for non-registered orgs. data_disclosure_log table added for licence audit trail. EU-SST gated separately behind itar_cleared flag.

F4 — No export control screening at registration Fix applied (§24.2): country_of_incorporation, export_control_screened_at, export_control_cleared, and itar_cleared columns added to organisations table. Onboarding flow screens against embargoed countries (ISO 3166-1 alpha-2) and BIS Entity List. EU-SST-derived data gated behind itar_cleared. Documented in legal/EXPORT_CONTROL_POLICY.md.

F5 — Liability disclaimer in Decision Prompts insufficient as standalone protection Fix applied (§28.6): Note added that the in-UI disclaimer is a reinforcing reminder only. Substantive liability limitation (consequential loss excluded; aggregate cap = 12 months fees) must appear in the executed MSA (§24.2). UCTA 1977 and EU Unfair Contract Terms Directive requirement noted.

F6 — No retention / deletion schedule; erasure requests unhandled for new tables Fix applied (§29.1, §29.3): shift_handovers and alarm_threshold_audit added to RoPA with 7-year retention (safety record basis). Pseudonymisation procedure in §29.3 extended to cover shift_handovers — user ID columns nulled, notes prefixed with pseudonym on erasure request.

F7 — Cross-border data transfer mechanism not formally documented Fix applied (§29.5): legal/DATA_RESIDENCY.md designated as authoritative sub-processor list with hosting provider, region, SCC/IDTA status. Annual DPO review and customer notification on material sub-processor change formalised.

F8 — EU AI Act obligations not assessed Fix applied (§24.10): New section added. Conjunction probability model classified as high-risk AI under EU AI Act Annex III (transport infrastructure safety). Eight high-risk obligations mapped (risk management, data governance, technical documentation, logging, transparency, human oversight, accuracy/robustness, conformity assessment). Human oversight statement added as mandatory non-configurable UI element in §19.4 conjunction probability display. EU database registration (Art. 51) added as Phase 3 gate. legal/EU_AI_ACT_ASSESSMENT.md designated as authoritative document.

F9 — No regulatory correspondence register Fix applied (§24.11): New section added. legal/REGULATORY_CORRESPONDENCE_LOG.md designated as structured register. SLAs: 2-business-day acknowledgement, 14-calendar-day response. Quarterly steering review of outstanding correspondence. Proactive engagement triggered by ≥3 queries from same authority in 12 months.

F10 — Cookie / tracking consent mechanism not specified Fix applied (§29.7): New section added. Cookie audit table defined (strictly necessary / functional / analytics). HttpOnly; Secure; SameSite=Strict formalised as required security attributes. Consent banner specification: three tiers, preference stored in localStorage (not a cookie), re-requested on material category changes. legal/COOKIE_POLICY.md designated as authoritative document.

F11 — Incident notification obligations not mapped to regulatory timelines Fix applied (§29.6): NIS2 Art. 23 obligations added alongside GDPR Art. 33. Early warning deadline: 24 hours of awareness (NIS2) vs. 72 hours (GDPR). Full NIS2 notification: 72 hours. Final report: 1 month. On-call escalation requirement to DPO within 24-hour window documented. legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md designated as authoritative template document.


49.2 Sections Modified

Section Change
§24.2 Liability and Operational Status Added Space-Track redistribution gate (space_track_registered), data_disclosure_log table, export control screening columns and onboarding flow
§24.10 (new) EU AI Act Obligations Full high-risk AI obligation mapping; human oversight statement; conformity assessment and registration roadmap
§24.11 (new) Regulatory Correspondence Register Structured log specification; SLAs; escalation trigger
§28.6 Cognitive Load Reduction Added legal sufficiency note on Decision Prompts disclaimer; MSA cross-reference
§29.1 Data Inventory Formalised as GDPR Art. 30 RoPA; added shift_handovers, alarm_threshold_audit, data_disclosure_log entries; DPIA trigger documented
§29.3 Erasure vs. Retention Conflict Extended pseudonymisation procedure to cover shift_handovers
§29.5 Cross-Border Data Transfer Safeguards Added legal/DATA_RESIDENCY.md as authoritative document with annual review requirement
§29.6 Security Breach Notification Expanded to full NIS2 Art. 23 obligations table; multi-framework notification timeline
§29.7 (new) Cookie / Tracking Consent Cookie audit table; HttpOnly; Secure; SameSite=Strict formalised; consent banner specification

49.3 New Tables

Table Purpose
data_disclosure_log Immutable record of every TLE-derived data disclosure per organisation; supports Space-Track licence audit
organisations.space_track_registered Gate controlling access to TLE-derived API fields
organisations.country_of_incorporation Feeds export control screening at onboarding
organisations.export_control_cleared Records completion of export control screening
organisations.itar_cleared Gates EU-SST-derived data to cleared entities only

Document Purpose
legal/ROPA.md GDPR Art. 30 Record of Processing Activities — authoritative version
legal/DPIA_conjunction_alerts.md Data Protection Impact Assessment for conjunction alert delivery
legal/EXPORT_CONTROL_POLICY.md Export control screening procedure and embargoed-country list
legal/DATA_RESIDENCY.md Sub-processor list with hosting regions and SCC/IDTA status
legal/EU_AI_ACT_ASSESSMENT.md High-risk AI classification; obligation mapping; conformity assessment
legal/REGULATORY_CORRESPONDENCE_LOG.md Structured register of regulatory correspondence
legal/COOKIE_POLICY.md Cookie audit and consent policy
legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md Multi-framework notification timelines and templates

49.5 Anti-Patterns Identified

Anti-pattern Correct approach
In-UI disclaimer as sole liability protection Substantive liability cap in executed MSA; UI disclaimer is reinforcement only
Serving TLE-derived data without licence verification Gate behind space_track_registered; log all disclosures
Registering users without country-of-incorporation check Collect at onboarding; screen against embargoed countries and BIS Entity List before account activation
Treating GDPR 72-hour obligation as the only notification deadline NIS2 requires 24-hour early warning for significant incidents; both timelines must be tracked simultaneously
Storing consent preference in a cookie Self-defeating; use localStorage with no expiry
Self-classifying the conjunction model as low-risk AI Transport infrastructure safety = Annex III high-risk; full obligations apply regardless of system size

49.6 Decision Log

Decision Chosen approach Rejected alternative Rationale
RoPA location legal/ROPA.md (authoritative) + §29.1 mirror MASTER_PLAN only Regulatory auditors expect a standalone document; MASTER_PLAN mirror keeps engineers informed
Space-Track gate mechanism Per-org boolean + middleware check Per-request licence verification Per-request verification against Space-Track API would add latency and a hard dependency; boolean flag updated at onboarding and reviewed quarterly
EU AI Act classification High-risk (Annex III, transport safety) Low-risk / unclassified The conjunction model informs time-critical airspace decisions; conservative classification is the legally safe position; reclassification requires legal opinion
Cookie consent storage localStorage Session cookie Storing consent in a cookie creates a circular dependency (need consent to set cookie, but cookie stores consent); localStorage avoids this without additional server round-trips
NIS2 applicability Treat SpaceCom as essential entity (space traffic management) Treat as non-essential until formally classified Early compliance avoids a reclassification scramble; ENISA guidance indicates space infrastructure operators are likely Annex I essential entities

§50 Accessibility Engineering — Specialist Review

Standards basis: WCAG 2.1 Level AA (ISO/IEC 40500:2012), WAI-ARIA 1.2, EN 301 549 v3.2.1, Section 508, APCA contrast algorithm, ATAG 2.0 Review scope: Keyboard navigation, screen reader compatibility, colour contrast, motion/animation, focus management, dynamic content announcements, form accessibility, alert/modal accessibility, time-limited interactions, ARIA live regions


50.1 Findings and Fixes Applied

F1 — No accessibility standard committed; EN 301 549 mandatory for ESA procurement Fix applied (§13.0, §25.6): WCAG 2.1 AA committed as minimum standard in new §13.0. Definition of done updated: all PRs must pass axe-core wcag2a/aa before merge. ACR/VPAT 2.4 added to §25.6 ESA procurement artefacts table as a required Phase 2 deliverable.

F2 — CRITICAL alert overlay inaccessible to screen reader and keyboard users Fix applied (§28.3): Full ARIA alertdialog spec added: role="alertdialog", aria-modal="true", programmatic focus() on render, aria-hidden="true" on map container, aria-live="assertive" announcement region, visible text status indicator for deaf operators, Escape key handling per severity level.

F3 — Structured acknowledgement form has no accessible labels Fix applied (§28.5): Native <input type="radio"> with <label for="...">, <fieldset> + <legend>, aria-keyshortcuts on trigger, visible keyboard shortcut legend inside dialog, aria-required on free-text field when OTHER selected, aria-live="polite" confirmation on submit.

F4 — CesiumJS globe inaccessible; no keyboard/screen reader equivalent Fix applied (§13.2): New §13.2 specifies ObjectTableView.tsx as a parallel accessible table view. Accessible via Alt+T and a persistent visible button. All alert interactions completable from table view alone. Implemented with native <table> elements; aria-sort, aria-rowcount, aria-rowindex for virtual scroll.

F5 — Colour is the sole differentiator for alert severity Fix applied (§13.4): Non-colour severity indicators specified in §13.4: per-severity icon/shape (octagon/triangle/circle/circle-outline), text labels always visible, distinct border widths. 1 Hz colour cycle also has a 1 Hz border-width pulse as redundant indicator.

F6 — No keyboard navigation spec for primary operator workflow Fix applied (§13.3): New §13.3 specifies skip links, focus ring (3px, ≥3:1 contrast, --focus-ring token), tab order rules (no tabindex > 0), full application keyboard shortcut table (Alt+A/T/H/N, ?, Escape, arrow keys), aria-keyshortcuts on all trigger elements, conflict-free shortcut design.

F7 — Colour contrast ratios not specified Fix applied (§13.4): Verified contrast table for all operational severity colours on dark theme #1A1A2E. All pairs meet ≥4.5:1 (AA). Design token file frontend/src/tokens/colours.ts designated as authoritative; no hardcoded colour values in component files.

F8 — Session timeout risk during shift handover Fix applied (§28.5a): WCAG 2.2.1 (Timing Adjustable) compliance spec added. T2 minute warning dialog with aria-live="polite" announcement. Auto-extension (30 min, once per session) when /handover view is active. POST /api/v1/auth/extend-session endpoint specified. Extension logged in security_logs as SESSION_AUTO_EXTENDED_HANDOVER.

F9 — Decision Prompts accordion not keyboard-operable or screen-reader-friendly Fix applied (§28.6): Full WAI-ARIA Accordion pattern specified: aria-expanded, aria-controls, role="region", aria-labelledby, native checkbox inputs with labels, arrow-key navigation, aria-live="polite" confirmation on checkbox state change.

F10 — No reduced-motion support Fix applied (§28.3): prefers-reduced-motion: reduce CSS implementation specified for CRITICAL banner colour cycle (static thick border replaces animation). CesiumJS corridor animation: JS matchMedia check on mount; particle animation disabled; static opacity when reduced motion preferred. Listener on change event for live preference updates without page reload.

F11 — No accessibility testing in CI Fix applied (§42.2, §42.5): e2e/test_accessibility.ts added using @axe-core/playwright. Scans 5 primary views. wcag2a + wcag2aa violations block PR; wcag2aaa warnings only. Results as CI artefact a11y-report.html. Manual screen reader test (NVDA+Firefox, VoiceOver+Safari) added to release checklist. Decision log entry added in §42.5.


50.2 Sections Modified

Section Change
§13.0 (new) Accessibility Standard Commitment WCAG 2.1 AA minimum standard; EN 301 549 mandatory for ESA; ACR/VPAT as Phase 2 deliverable; definition of done
§13.2 (new) Accessible Parallel Table View ObjectTableView.tsx spec; keyboard trigger; native table markup; virtual scroll ARIA attributes
§13.3 (new) Keyboard Navigation Specification Skip links; focus ring token; tab order rules; full shortcut table; aria-keyshortcuts convention
§13.4 (new) Colour and Contrast Specification Verified contrast table; design token file; non-colour severity indicators (icons, text labels, border widths)
§25.6 Required ESA Procurement Artefacts ACR/VPAT 2.4 added to artefacts table
§28.3 Alarm Management CRITICAL alert ARIA spec; reduced-motion CSS spec
§28.5 Error Recovery Acknowledgement form accessibility: native inputs, fieldset/legend, aria-keyshortcuts, confirmation announcement
§28.5a Shift Handover Session timeout accessibility: T2 min warning, auto-extension during handover, extend-session endpoint
§28.6 Cognitive Load Reduction Decision Prompts ARIA Accordion pattern spec
§42.2 Test Suite Inventory test_accessibility.ts added to e2e suite
§42.3 (renamed from 42.2) axe-core implementation spec with code example; manual screen reader test checklist
§42.5 Decision Log Accessibility CI gate decision added

50.3 New Components

Component / File Purpose
src/components/globe/ObjectTableView.tsx Accessible parallel table view for all globe objects
frontend/src/tokens/colours.ts Design token file for all operational colours; authoritative contrast reference
e2e/test_accessibility.ts @axe-core/playwright scans blocking PRs on WCAG 2.1 AA violations
docs/RELEASE_CHECKLIST.md Manual screen reader test steps; keyboard-only workflow test

50.4 Anti-Patterns Identified

Anti-pattern Correct approach
aria-label on a <div> when a native <button> would do Always prefer native HTML semantics; ARIA substitutes only when no native element exists
outline: none without a custom focus indicator Never suppress focus ring without providing an equivalent; use --focus-ring token
tabindex="2" or any positive tabindex Never; positive tabindex breaks natural reading order and confuses screen readers
Colour-only severity communication Always pair colour with shape, text label, and border width as redundant indicators
Inline aria-live="assertive" for non-emergency announcements assertive interrupts immediately; use polite for non-CRITICAL confirmations, assertive only for CRITICAL alerts
Session timeout that cannot be extended WCAG 2.2.1 requires user ability to extend or disable timing; auto-extend during safety-critical views is the correct pattern

50.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Globe accessibility approach Parallel accessible table view Making CesiumJS accessible directly WebGL canvas cannot be made screen-reader accessible; a parallel data view is the only WCAG-conformant approach for complex visualisations
Focus ring specification 3px solid #4A9FFF, design token Browser default outline Browser default fails contrast requirements on dark themes; design token ensures consistency and testability
axe-core CI level wcag2a + wcag2aa block; wcag2aaa warn All levels block, or all levels warn All-block creates false positives (AAA is aspirational); all-warn provides no enforcement; AA is the legal and contractual minimum
Reduced-motion: animation vs. static Static thick border when prefers-reduced-motion: reduce Slow down animation Slowing animation still triggers vestibular symptoms; static replacement is the only fully safe approach
Session auto-extension scope Only during /handover active; once per session For any active form Broad auto-extension creates security risk (indefinitely open sessions); limiting to handover scope is the narrowest sufficient accommodation

§52 Incident Response / Disaster Recovery Engineering — Specialist Review

Standards basis: NIST SP 800-61r2, ISO/IEC 27035, ISO 22301, ITIL 4, ICAO Doc 9859, AWS/GCP Well-Architected Framework (Reliability Pillar), Google SRE Book (Chapter 14) Review scope: Incident classification, runbook completeness, escalation chains, RTO/RPO definition and achievability, backup and restore, chaos/game day testing, on-call rotation, post-incident review, DR site strategy, alert_events integrity


52.1 Findings and Fixes Applied

F1 — RTO and RPO targets not formally defined with derivation rationale Fix applied (§26.2): Table expanded with derivation column. RTO ≤ 15 min (active TIP event) derived from 4-hour CRITICAL rate-limit window. RTO ≤ 60 min (no active event) aligns with MSA SLA. RPO zero for safety-critical tables derived from UN Liability Convention evidentiary requirements. MSA sign-off requirement added — customers must agree RTO/RPO before production deployment.

F2 — No restore time target or WAL retention period Fix applied (§26.6): WAL retained 30 days; base backups 90 days; safety-critical tables in MinIO Object Lock COMPLIANCE mode for 7 years. Restore time target < 30 minutes documented. docs/runbooks/db-restore.md designated as Phase 2 deliverable.

F3 — No runbook for prediction service outage during active re-entry event Fix applied (§26.8): New runbook row added to the required runbooks table covering: detection → 5-minute ANSP notification → incident commander designation → 15-minute update cadence → restoration checklist → PIR trigger. Full procedure in docs/runbooks/prediction-service-outage-during-active-event.md.

F4 — No chaos engineering / game day programme Fix applied (§26.8): Quarterly game day programme specified. 6 scenarios defined with inject, expected behaviour, and pass criteria. Scenario fail treated as SEV-2 with PIR. docs/runbooks/game-day-scenarios.md designated.

F5 — On-call rotation underspecified Fix applied (§26.8): 7-day rotation, minimum 2-engineer pool. L1 → L2 escalation trigger: 30 minutes without containment. L2 → L3 triggers enumerated (ANSP data affected, security breach, total outage > 15 min, regulatory notification triggered). On-call handoff log specified mirroring operator /handover model.

F6 — No P1/P2/P3 severity communication commitments Fix applied (§26.8): ANSP notification commitments per SEV level added. SEV-1 active TIP event: push + email within 5 minutes, 15-minute cadence. SEV-1 no active event: email within 15 minutes. SEV-2: email within 30 minutes if predictions affected. SEV-3/4: status page only.

F7 — No DR site or failover architecture Fix applied (§26.3): Cross-region warm standby architecture added. DB replica promoted on failover; app tier deployed from pre-pulled container images; MinIO bucket replication active; DNS health-check-based routing (TTL 60s). Estimated failover time < 15 minutes. Annual game day test (scenario 6). docs/runbooks/region-failover.md designated.

F8 — No post-incident review process Fix applied (§26.8): Mandatory PIR for all SEV-1 and SEV-2. Due within 5 business days. 7-section structure: summary, timeline, 5-whys root cause, contributing factors, impact, remediation actions (GitHub issues, incident-remediation label), what went well. Presented at engineering all-hands. Remediations are P2 priority.

F9 — alert_events not HMAC-protected Fix applied (§7.9, alert_events schema): record_hmac TEXT NOT NULL column added. Signing function specified (id, object_id, org_id, level, trigger_type, created_at, acknowledged_by, action_taken). Nightly Celery Beat integrity check re-verifies all events from past 24 hours; HMAC failure raises CRITICAL security alert. Existing alert_events_immutable trigger already prevents modification.

F10 — No incident communication templates Fix applied (§26.8): docs/runbooks/incident-comms-templates.md designated with 4 templates (initial notification, 15-min update, resolution, post-incident summary). Legal counsel review required before first use. Templates specify what never to include (speculation, premature ETAs, admissions of liability).

F11 — Operational and security incidents not separated Fix applied (§26.8): Operational vs. security incident comparison table added. Separate runbooks designated: docs/runbooks/operational-incident-response.md and docs/runbooks/security-incident-response.md. Security incidents: no public status page until legal counsel approves; DPO within 4 hours; NIS2/GDPR timelines from §29.6.


52.2 Sections Modified

Section Change
§26.2 Recovery Objectives Derivation rationale column; MSA sign-off requirement
§26.3 High Availability Architecture Cross-region warm standby DR strategy; component failover table; estimated recovery time
§26.6 Backup and Restore WAL retention 30 days; restore time target < 30 min; MinIO Object Lock for 7-year legal hold; docs/runbooks/db-restore.md
§26.8 Incident Response Prediction-service-outage runbook; on-call rotation spec + handoff log; ANSP comms per severity; PIR process; game day programme; incident comms templates; operational/security split
§7.9 Data Integrity alert_events HMAC signing function; nightly integrity check Celery task
alert_events schema record_hmac TEXT NOT NULL column added

52.3 New Runbooks Required (Phase 2 deliverables)

Runbook Trigger
docs/runbooks/db-restore.md Monthly restore test failure; DR failover
docs/runbooks/prediction-service-outage-during-active-event.md SEV-1 during active TIP event
docs/runbooks/region-failover.md Cloud region failure; annual game day
docs/runbooks/game-day-scenarios.md Quarterly game day reference
docs/runbooks/incident-comms-templates.md All SEV-1/2 incidents
docs/runbooks/operational-incident-response.md All operational incidents
docs/runbooks/security-incident-response.md All security incidents
docs/runbooks/on-call-handoff-log.md Weekly rotation boundary
docs/post-incident-reviews/ All SEV-1/2 incidents (within 5 business days)

52.4 Anti-Patterns Identified

Anti-pattern Correct approach
RTO/RPO as aspirational targets without derivation Derive from operational requirements; document rationale; agree in MSA
Single-region deployment with 1-hour RTO target Warm standby in a second region; < 15 min estimated failover
Conflating operational and security incident response Separate runbooks; different escalation chains; different communication rules
Improvised ANSP communications under pressure Pre-drafted legal-reviewed templates; deviations require incident commander approval
PIR as optional / informal Mandatory for SEV-1/2; structured format; remediation tracking; all-hands presentation
Game day as a one-time activity Quarterly rotation; each scenario tested at least annually; failures treated as SEV-2

52.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
DR strategy Warm standby (second region) Cold standby or active-active Cold standby: restore time too slow for RTO; active-active: complexity and cost disproportionate to Phase 1 scale; warm standby meets RTO at acceptable cost
alert_events HMAC Nightly batch verification Per-request verification Per-request adds latency to the alert delivery path; nightly batch catches tampering within 24 hours — adequate for evidentiary purposes
PIR timing 5 business days 24 hours / 30 days 24 hours is too fast for full 5-whys analysis; 30 days allows recurrence before remediation; 5 days balances speed with quality
Game day cadence Quarterly Monthly / annually Monthly creates operational fatigue; annually is too infrequent to maintain muscle memory; quarterly is standard SRE practice
On-call escalation trigger 30 minutes containment 15 minutes / 60 minutes 15 minutes is too aggressive for complex incidents; 60 minutes risks SLO breach before L2 engaged; 30 minutes matches the active TIP event RTO window

§51 Internationalisation / Localisation Engineering — Specialist Review

Standards basis: Unicode CLDR 44, IETF BCP 47, ISO 8601, ICAO Annex 2 / Annex 15 / Doc 8400 (UTC mandate), POSIX locale model, W3C Internationalisation guidelines, ICU MessageFormat 2.0, EU Regulation 2018/1139 (EASA language requirements) Review scope: Timezone handling, date/time display, number/unit formatting, string externalisation, RTL layout, language coverage, ICAO UTC compliance, API date formats, database timezone storage


51.1 Findings and Fixes Applied

F1 — Operational times must be UTC; no local timezone conversion in ops interface Fix applied (§13.0): Iron UTC rule documented. All Persona A/C views display UTC only, formatted HH:MMZ or DD MMM YYYY HH:MMZ. Z suffix always inline, never a tooltip. No timezone conversion widget in operational interface. Local time permitted only in non-operational admin views with explicit timezone label. API times always ISO 8601 UTC.

F2 — ORM may silently convert TIMESTAMPTZ to session timezone Fix applied (§7.9): SET TIME ZONE 'UTC' enforced on every connection via SQLAlchemy engine event listener. Blocking integration test test_timestamps_round_trip_as_utc added — asserts that a known UTC datetime survives a full ORM insert/read cycle without offset conversion.

F3 — Re-entry window displayed without explicit UTC label Fix applied (§28.4): Rule 1 of probabilistic communication to non-specialists updated — all absolute times rendered as HH:MMZ per ICAO Doc 8400 UTC-suffix convention. Z suffix always rendered inline; never hover-only.

F4 — Number formatting not locale-aware in non-operational views Fix applied (§13.4): formatOperationalNumber() (ICAO decimal point, invariant) and formatDisplayNumber(locale) (Intl.NumberFormat, locale-aware) helpers specified. Raw Number.toString() and n.toFixed() banned from JSX.

F5 — No string externalisation strategy; hardcoded strings block localisation Fix applied (§13.5): next-intl adopted. All user-facing strings in messages/en.json. Message ID convention defined. eslint-plugin-i18n-json enforcement. ICAO-fixed strings explicitly excluded and annotated // ICAO-FIXED: do not translate.

F6 — NOTAM draft output must be ICAO English regardless of UI locale Fix applied (§6.13): NOTAM template strings hardcoded ICAO English phraseology in backend/app/modules/notam/templates.py, annotated # ICAO-FIXED: do not translate. Excluded from next-intl extraction. Preview renders in monospace font with lang="en" attribute.

F7 — Slash-delimited dates are ambiguous in exports Fix applied (§6.12): DD MMM YYYY format mandated for all PDF reports, CSV exports, and display previews (e.g. 04 MAR 2026). Slash-delimited dates banned from all SpaceCom outputs. Times alongside dates use HH:MMZ. NOTAM internal YYMMDDHHMM fields displayed as DD MMM YYYY HH:MMZ in preview.

F8 — RTL layout not considered; directional CSS utilities used Fix applied (§13.5): CSS logical properties table specified (margin-inline-start etc. replacing ml-/mr-). <html dir="ltr"> hardcoded for Phase 1; becomes dir={locale.dir} when RTL locale added — no component changes required. docs/ADDING_A_LOCALE.md checklist includes RTL gate.

F9 — Altitude units inconsistent between aviation and space personas Fix applied (users table, §13.5): altitude_unit_preference column added to users table (ft default for ANSP operators, km for space operators). API transmits metres; display layer converts. Unit label always visible. FL notation shown in parentheses for ft context. User can override in account settings.

F10 — API date formats inconsistent (Unix timestamps vs. ISO 8601) Fix applied (§14 API Versioning Policy): ISO 8601 UTC (2026-03-22T14:00:00Z) mandated for all API date fields. OpenAPI format: date-time on all _at/_time fields. Blocking contract test asserts regex match. Pydantic json_encoders specified.

F11 — Language coverage undefined; English-only now but architecture must support future localisation Fix applied (§13.5): English-only explicitly committed for Phase 1. next-intl architecture allows adding a locale by adding messages/{locale}.json only — no component changes. messages/fr.json and messages/de.json scaffolded at Phase 2/3 start. docs/ADDING_A_LOCALE.md checklist documented.


51.2 Sections Modified

Section Change
§6.12 Report Generation DD MMM YYYY date format rule; slash-delimited dates banned
§6.13 NOTAM Drafting Workflow ICAO-FIXED template rule; lang="en" on NOTAM container
§7.9 Data Integrity SET TIME ZONE 'UTC' connection event listener; test_timestamps_round_trip_as_utc integration test
§13.0 Accessibility Standard Commitment UTC-only rule added
§13.4 Colour and Contrast Specification formatOperationalNumber / formatDisplayNumber helpers; Intl.NumberFormat mandate
§13.5 (new) Internationalisation Architecture next-intl; messages/en.json; ICAO-FIXED exclusions; CSS logical properties; altitude unit display; docs/ADDING_A_LOCALE.md checklist
§14 API Versioning Policy ISO 8601 UTC contract; OpenAPI format: date-time; contract test; Pydantic encoder
§28.4 Probabilistic Communication HH:MMZ inline UTC suffix rule
users table altitude_unit_preference column added

51.3 New Files

File Purpose
messages/en.json Phase 1 string source of truth for next-intl
messages/fr.json Phase 2 scaffold (machine-translated placeholders; native-speaker review before deploy)
messages/de.json Phase 3 scaffold
docs/ADDING_A_LOCALE.md Step-by-step checklist for adding a new locale; includes RTL gate
frontend/src/lib/formatters.ts formatOperationalNumber, formatDisplayNumber, formatUtcTime, formatUtcDate helpers
tests/test_db_timezone.py Blocking integration test for UTC round-trip integrity

51.4 Anti-Patterns Identified

Anti-pattern Correct approach
Displaying local time in the ops interface UTC only; HH:MMZ always; no conversion widget
Number.toString() or n.toFixed() in JSX formatOperationalNumber() (ICAO) or formatDisplayNumber(locale) depending on context
03/04/2026 in any export or report 04 MAR 2026 — unambiguous ICAO-aligned format
Translating NOTAM template strings ICAO-FIXED; annotate and exclude from i18n tooling
Positive tabindex (already covered §50) Never; noted here as it is also an i18n anti-pattern (breaks RTL reading order)
Hardcoded margin-left in new components margin-inline-start; logical properties throughout
Multiple API date formats in same response ISO 8601 UTC only; one format, no exceptions

51.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Operational time display UTC-only, HH:MMZ inline User-selectable timezone ICAO Annex 15 mandates UTC for aeronautical data; a timezone selector introduces conversion errors under time pressure
Date format in exports DD MMM YYYY ISO 8601 (2026-03-04) ISO 8601 is unambiguous but unfamiliar to aviation professionals; DD MMM YYYY matches aviation document convention (NOTAM, METARs) and is equally unambiguous
Phase 1 language scope English only Multi-language from Phase 1 Localisation adds QA overhead and translation cost before product-market fit is proven; architecture supports future locales without rework
i18n library next-intl react-i18next next-intl has first-class App Router RSC support; react-i18next requires client-component wrapping for all translated text
Altitude storage unit Metres (API + DB) Role-dependent storage Single SI storage unit eliminates conversion bugs in physics engine; display conversion is well-understood and testable
ORM timezone enforcement Engine event listener (SET TIME ZONE UTC) Application-level assertion Engine listener fires at connection creation and cannot be bypassed by individual queries; application assertions can be accidentally omitted

§53 Machine Learning / Data Science — Specialist Review

Standards basis: ISO/IEC 22989, ECSS-E-ST-10-04C, IADC Space Debris Mitigation Guidelines, ESA DRAMA methodology, Vallado (2013), JB2008, NRLMSISE-00, FAA Order 8040.4B, EU AI Act Art. 10 Review scope: Conjunction Pc model, SGP4 domain, atmospheric density model selection, MC convergence, survival probability, model versioning, TLE age uncertainty, backcasting, input validation, tail risk, data provenance


53.1 Findings and Fixes Applied

F1 — Conjunction probability model methodology unspecified Fix applied (§15.5): Alfano (2005) 2D Gaussian method already specified. Validity domain added: three degradation conditions (sub-100m close approach, anisotropic covariance > 100:1, Pc < 1×10⁻¹⁵ floor). API response carries pc_validity and pc_validity_warning fields. Reference test suite added against Vallado & Alfano (2009) published cases with 5% tolerance.

F2 — SGP4 used beyond valid domain without sub-150 km guard Fix applied (§15.1): Sub-150 km LOW_CONFIDENCE_PROPAGATION flag added to decay predictor. UI badge: "⚠ Re-entry imminent — prediction confidence low." BLOCKING unit test: TLE with perigee 120 km → asserts flag is set.

F3 — Atmospheric density model not justified vs. JB2008 Fix applied (§15.2): NRLMSISE-00 Phase 1 selection rationale documented (Python binding maturity, acceptable accuracy at moderate F10.7). Known limitations stated. Phase 2 milestone: evaluate JB2008 on backcasts; migrate if MAE improvement > 15%; ADR 0016. Input validity bounds added: F10.7 [65, 300], Ap [0, 400], altitude [85, 1000] km; violation raises AtmosphericModelInputError.

F4 — MC sample count not justified by convergence analysis Fix applied (§15.2/§15.4): Convergence table added. N = 500 satisfies < 2% corridor area change between doublings on the reference object. N = 1000 for OOD or storm-warning cases. MC output updated to include p01 and p99.

F5 — Survival probability methodology absent Fix applied (§15.3): survival_probability, survival_model_version, survival_model_note columns added to reentry_predictions. Phase 1: simplified analytical all-survive/no-survive per material class. Phase 2: ESA DRAMA integration. NOTAM (E) field statement driven by survival_probability.

F6 — No model version governance or reproducibility Fix applied (§15.6 new): MAJOR/MINOR/PATCH version bump policy defined. Old versions retained in git tags and physics/versions/. POST /decay/predict/reproduce endpoint specified — re-runs with original model version and params for regulatory audit.

F7 — TLE age not a formal uncertainty source Fix applied (§15.2): Linear inflation model added: uncertainty_multiplier = 1 + 0.15 × tle_age_days applied to ballistic coefficient covariance before MC sampling. tle_age_at_prediction_time and uncertainty_multiplier stored in simulations.params_json and returned in API response.

F8 — No model performance monitoring or drift detection Fix applied (§15.9 new): reentry_backcasts table specified. Celery task triggered on object status = 'decayed'; compares all 72h predictions to confirmed re-entry time. Rolling 30-prediction MAE nightly; MEDIUM alert if MAE > 2× historical baseline. Admin panel "Model Performance" widget.

F9 — Input data quality gates insufficient Fix applied (§15.7 new): validate_prediction_inputs() function in backend/app/modules/physics/validation.py. Validates TLE epoch age ≤ 30 days, F10.7/Ap/perigee bounds, mass > 0. Returns structured ValidationError list; endpoint returns 422. All validation paths covered by BLOCKING unit tests.

F10 — Tail risks not communicated; only p5p95 shown Fix applied (§28.4, reentry_predictions schema): p01_reentry_time and p99_reentry_time columns added. Tail risk annotation displayed when p1p99 range > 1.5× p5p95 range: "Extreme case (1% probability outside): p01Z p99Z." Included as NOTAM draft footnote when condition met.

F11 — No training/validation data provenance Fix applied (§15.8 new): Phase 1 explicitly documented as physics-based with no trained ML components. docs/ml/data-provenance.md designated. EU AI Act Art. 10 compliance mapped to input data provenance (tracked in simulations.params_json). Future ML component protocol: training data, validation split, model card in docs/ml/model-card-{component}.md.


53.2 Sections Modified

Section Change
§15.1 Catalog Propagator Sub-150 km LOW_CONFIDENCE_PROPAGATION flag + unit test
§15.2 Decay Predictor NRLMSISE-00 selection rationale vs. JB2008; input bounds; TLE age inflation model; MC convergence table; N=1000 for OOD/storm cases
§15.3 Atmospheric Breakup Model survival_probability / survival_model_version / survival_model_note columns; Phase 1 analytical methodology
§15.5 Conjunction Pc Validity domain (3 degradation conditions); pc_validity API fields; Vallado & Alfano reference test suite
§15.6 (new) Model Version Governance MAJOR/MINOR/PATCH policy; version retention; reproduce endpoint
§15.7 (new) Prediction Input Validation validate_prediction_inputs(); 5 validation rules; 422 response; BLOCKING tests
§15.8 (new) Data Provenance Phase 1 no-ML declaration; EU AI Act Art. 10 mapping; future ML component protocol
§15.9 (new) Backcasting Validation reentry_backcasts table; Celery trigger on decay; rolling MAE drift detection; admin panel widget
§28.4 Probabilistic Communication Tail risk annotation (rule 6); p01/p99 display condition; NOTAM footnote
reentry_predictions schema p01_reentry_time, p99_reentry_time, survival_probability, survival_model_version, survival_model_note

53.3 New Tables and Files

Artefact Purpose
reentry_backcasts table Prediction vs. actual comparison; drift detection input
docs/ml/data-provenance.md Phase 1 no-ML declaration; future ML data provenance template
docs/ml/model-card-{component}.md Template for any future learned component
docs/adr/0016-atmospheric-density-model.md NRLMSISE-00 vs. JB2008 decision; Phase 2 evaluation trigger
backend/app/modules/physics/validation.py validate_prediction_inputs() function
tests/physics/test_pc_compute.py Vallado & Alfano reference cases (BLOCKING)

53.4 Anti-Patterns Identified

Anti-pattern Correct approach
Displaying only p5p95 without tail annotation Add p1/p99 as explicit tail risk annotation when materially wider
Silently clamping out-of-range inputs Reject with structured ValidationError; operator must correct the input
Deleting old model versions on update Tag and retain; reproduce endpoint requires historical version access
Treating TLE age as display-only staleness TLE age is a formal uncertainty source; inflate MC covariance accordingly
Choosing atmospheric model without documented rationale Document selection vs. alternatives; schedule re-evaluation with objective criterion
No feedback loop from confirmed re-entries Backcasting pipeline closes the loop; MAE monitoring detects drift

53.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Phase 1 atmospheric model NRLMSISE-00 JB2008 Mature Python binding; acceptable accuracy at moderate F10.7; JB2008 evaluation deferred to Phase 2 with objective trigger
Pc method Alfano (2005) 2D Gaussian Monte Carlo Pc Alfano is computationally fast and widely accepted; MC Pc reserved for Phase 3 high-Pc cases where Gaussian assumption breaks down
MC convergence criterion < 2% corridor area change between N doublings Fixed N from literature Fixed N is arbitrary; convergence criterion is object-class specific and reproducible
Tail risk display threshold p1p99 > 1.5× p5p95 Always show / never show Always showing creates visual clutter for well-constrained predictions; never showing hides operationally relevant uncertainty; threshold balances both
Model version retention Git tags + physics/versions/ directory Docker image tags only Docker images are routinely pruned; git tags are permanent; reproduce endpoint needs the actual code, not just an image

§54 Technical Documentation / Developer Experience — Specialist Review

Standards basis: OpenAPI 3.1, Keep a Changelog, Conventional Commits, Nygard ADR format, WCAG authoring guidance, MkDocs Material, spectral OpenAPI linting, ESA ECSS documentation requirements Review scope: OpenAPI spec governance, health endpoint coverage, contribution workflow, ADR process, changelog discipline, developer onboarding, response examples, SDK strategy, runbook structure, docs pipeline, AI assistance declaration


54.1 Findings and Fixes Applied

F1 — OpenAPI spec not declared as source of truth Fix applied (§14 API Versioning Policy): FastAPI's built-in OpenAPI generation is declared as the sole source of truth. make generate-openapi regenerates openapi.yaml. CI runs openapi-diff --fail-on-incompatible to detect uncommitted drift. The spec is input to Swagger UI, Redoc, contract tests, and the SDK generator.

F2 — No /health or /readiness endpoint specified Fix applied (§14 System endpoints): New System (no auth required) group added. GET /health — liveness probe; process-alive check only. GET /readyz — readiness probe; checks PostgreSQL, Redis, Celery queue depth; returns 503 when any dependency is unhealthy. Both used by Kubernetes probes, load balancers, and DR automation DNS-flip gate (§26.3). Both included in OpenAPI spec.

F3 — CONTRIBUTING.md absent Fix applied (§13.6 new): Full contribution workflow documented. Branch naming convention table (feature/fix/chore/release/hotfix), main branch protection (1 approval, all checks pass, no force-push), Conventional Commits commit format, PR template with checklist (test, openapi regeneration, CHANGELOG, axe-core, ADR), 1-business-day review SLA, stale PR automation.

F4 — No ADR process Fix applied (§13.7 new): ADR process specified using Nygard format in docs/adr/NNNN-title.md. Trigger criteria defined (hard-to-reverse decisions, auditor context, procurement evidence). Standard template specified. Known ADR register table provided with 6 existing entries. Phase 2 ESA submission gate: all referenced ADR numbers must have corresponding files.

F5 — Changelog discipline unspecified Fix applied (§14 API Versioning Policy): Keep a Changelog format + Conventional Commits declared. [Unreleased] section with Added/Changed/Fixed/Deprecated subsections required on every PR with user-visible effect. make changelog-check CI step fails if [Unreleased] is empty for non-chore/docs commits. Release changelogs drive API key holder notifications and GitHub release notes.

F6 — Developer environment setup undocumented Fix applied (§13.8 new): docs/DEVELOPMENT.md spec covering: prerequisites (Python 3.11 pinned, Node.js 20, Docker Desktop, make), make dev-up / migrate / seed / dev bootstrap sequence, make test / test-backend / test-frontend / test-e2e commands, local URL map (API, Swagger UI, frontend, MinIO). 30-minute onboarding target. .env.example committed; .env in .gitignore.

F7 — OpenAPI response examples not required Fix applied (§14 API Versioning Policy): All endpoint schemas must include at least one examples: block. Enforced by spectral lint with custom require-response-example rule in CI. Example YAML fragment provided for GET /objects/{norad_id}. Examples serve: Swagger/Redoc docs, contract test fixtures, ESA auditor readability.

F8 — No SDK or client library strategy Fix applied (§14 API Versioning Policy): Phase 1 — no SDK; ANSP integrators receive openapi.yaml, docs/integration/ quickstarts (Python httpx/requests, TypeScript), and Postman-importable spec. Phase 2 gate: if ≥ 2 ANSP customers request a typed client, generate with openapi-generator-cli targeting Python and TypeScript. Generator config committed to tools/sdk-generator/. Published as spacecom-client PyPI and @spacecom/client npm packages.

F9 — Runbooks named but not templated Fix applied (§26.8 new subsection): Standard runbook template specified with 7 sections: Triggers, Immediate actions (first 5 minutes), Diagnosis, Resolution steps, Verification, Escalation, Post-incident. Last tested frontmatter field required. make runbook-audit CI check warns if any runbook is older than 12 months. Template preempts the most common incident-pressure failures: vague steps, no expected output, missing escalation path.

F10 — No docs-as-code pipeline Fix applied (§13.9 new): MkDocs Material as the documentation site generator. mkdocs build --strict in CI fails on broken links and missing pages. markdown-link-check for external links. vale prose style linter. openapi-diff spec drift check. ESA submission artefact: static HTML archived as docs-site-{version}.zip in release assets — reproducible point-in-time snapshot. owner: frontmatter field with quarterly docs-review cron issue.

F11 — AGENTS.md scope vs. MASTER_PLAN undefined Fix applied (§1 Vision): AI-assisted development policy added. Defines: permitted uses (code generation, refactoring, review, documentation drafting), prohibited uses (autonomous decisions on safety-critical algorithms, auth logic, regulatory compliance text; production credentials; personal data). Human review standards apply identically to AI-generated code. ESA procurement statement: human engineers are sole responsible parties regardless of authoring tool.


54.2 Sections Modified

Section Change
§1 Vision AI-assisted development policy; AGENTS.md scope declaration; ESA procurement statement
§13.6 (new) Contribution Workflow Branch naming; commit format; PR template; review SLA; main protection
§13.7 (new) Architecture Decision Records Nygard ADR format; trigger criteria; template; known ADR register; Phase 2 ESA gate
§13.8 (new) Developer Environment Setup docs/DEVELOPMENT.md spec; make targets; 30-minute onboarding target; .env.example policy
§13.9 (new) Docs-as-Code Pipeline MkDocs Material; CI checks (strict, link, vale, openapi-diff); ESA artefact; docs ownership
§14 API Versioning Policy OpenAPI as source of truth; make generate-openapi; CI drift check; changelog discipline; response examples mandate; client SDK strategy
§14 System Endpoints (new) GET /health liveness spec; GET /readyz readiness spec with example responses
§26.8 Incident Response Runbook standard structure template; Last tested field; make runbook-audit

54.3 New Tables and Files

Artefact Purpose
CONTRIBUTING.md Branch naming, commit format, PR template, review SLA
CHANGELOG.md Keep a Changelog format; [Unreleased] driven by PRs; release notes source
docs/adr/NNNN-*.md Architecture Decision Records (Nygard format)
docs/DEVELOPMENT.md Developer onboarding; make targets; environment bootstrap
docs/ADDING_A_LOCALE.md (already referenced §13.5) — Locale addition checklist
docs/integration/ ANSP quickstart guides (Python, TypeScript)
tools/sdk-generator/ openapi-generator-cli config for Phase 2 SDK generation
.github/pull_request_template.md PR checklist enforcing OpenAPI regeneration, CHANGELOG, axe-core, ADR
.spectral.yaml Custom spectral ruleset including require-response-example
.vale.ini Prose style linter config for docs
mkdocs.yml MkDocs Material configuration
docs/runbooks/*.md All runbooks follow the standard template with Last tested frontmatter

54.4 Anti-Patterns Identified

Anti-pattern Correct approach
Maintaining a separate OpenAPI spec alongside FastAPI routes Generate from code; enforce with CI drift check
Undocumented GET /health with ad-hoc response shape Specify the schema, document it in OpenAPI, use it in DR automation
New engineers learning the codebase by asking colleagues docs/DEVELOPMENT.md with 30-min onboarding target; make dev brings up everything
Architectural decisions in Slack or PR comments ADR in docs/adr/; permanent and findable by auditors and new engineers
Runbooks written for the first time during an incident Template-first; test in game day before needed
Publishing an API with no response examples spectral enforces examples: blocks; Swagger UI shows realistic data
Building an SDK before customers ask Phase 2 gate: ≥ 2 ANSP requests; Phase 1 is openapi.yaml + quickstarts

54.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
OpenAPI generation direction Code → spec (FastAPI auto-generation) Spec → code (contract-first with codegen) Team is Python-first; FastAPI's generation is high-fidelity; contract-first adds a separate edit step without meaningful quality gain at Phase 1 scale
SDK strategy Generated from spec (Phase 2) Hand-crafted SDK Generated SDK stays in sync with spec automatically; hand-crafted SDKs drift; generation deferred until customer demand justifies maintenance cost
Documentation tooling MkDocs Material Docusaurus, GitBook MkDocs Material is Python-native (same toolchain as backend); mkdocs build --strict provides CI integration; no JS toolchain dependency for docs
ADR format Nygard (Context/Decision/Consequences) MADR, RFC-style Nygard is the most widely recognised format; recognised by ESA/public-sector auditors; minimal overhead
AI assistance declaration Explicit policy in §1 Vision Silent (no declaration) ESA and EASA increasingly require disclosure of AI tool use in safety-relevant software; proactive disclosure pre-empts audit questions and demonstrates process maturity

§55 Multi-Tenancy, Billing & Org Management — Specialist Review

Standards basis: GDPR Art. 17/20, PCI-DSS (if card payments introduced), SaaS subscription billing conventions, PostgreSQL Row Level Security documentation, Celery priority queue documentation, ICAO Annex 11 (operator accountability) Review scope: Data isolation, subscription tier model, usage metering, org lifecycle, API key governance, quota enforcement, queue fairness, audit log access, billing data model, data portability


55.1 Findings and Fixes Applied

F1 — No row-level tenant isolation strategy defined Fix applied (§7.2): Comprehensive RLS policy table added covering all 8 organisation_id-carrying tables. spacecom_worker database role specified as the only BYPASSRLS principal. BLOCKING integration test specified: query as Org A session; assert zero Org B rows across all tenanted tables.

F2 — Subscription tiers and feature flags not specified Fix applied (§16.1 new): Tier table defined (shadow_trial, ansp_operational, space_operator, institutional, internal) with per-tier MC concurrency, prediction quota, and feature access. require_tier() FastAPI dependency pattern specified. TIER_MC_CONCURRENCY dict ties limits to tier. Tier changes take immediate effect (no session cache).

F3 — Usage metering not modelled Fix applied (§9.2): usage_events table added — append-only, immutable trigger, indexed by (organisation_id, billing_period, event_type). Billable event types: decay_prediction_run, conjunction_screen_run, report_export, api_request, mc_quota_exhausted, reentry_plan_run. Powers org admin usage dashboard and upsell trigger.

F4 — Organisation onboarding and offboarding procedures absent Fix applied (§29.8 new): Onboarding gate checklist specified (MSA, export control, Space-Track, billing contact, org_admin user, ToS). Offboarding 8-step procedure with timing, owner, and GDPR Art. 17 vs. retention resolution. Suspension vs. churn distinction documented. docs/runbooks/org-onboarding.md designated.

F5 — API key lifecycle lacks org-level service account concept Fix applied (§9.2 api_keys table): is_service_account column added; user_id made nullable for service account keys; service_account_name required when is_service_account = TRUE; revoked_by column added for org_admin audit trail. CHECK constraints enforce mutual exclusivity. Org admin can see and revoke all org keys via GET/DELETE /org/api-keys.

F6 — Concurrent prediction limit not persisted and not tier-linked Fix applied (§16.1, Celery section): acquire_mc_slot now derives limit from org_tier via get_mc_concurrency_limit_by_tier(). Quota exhaustion writes usage_events row with event_type = 'mc_quota_exhausted'. Org admin usage dashboard shows hits per billing period with upgrade prompt if hits ≥ 3.

F7 — No org-level admin role Fix applied (§7.2 RBAC table, users.role CHECK): org_admin role added between operator and admin. Permissions: manage users within own org (up to operator), manage own org's API keys, view own org's audit log, update billing contact. Cannot cross org boundaries or assign admin/org_admin without system admin.

F8 — Shared Celery queues with no per-org priority Fix applied (Celery Queue section): TIER_TASK_PRIORITY table (39 by tier) with CRITICAL_EVENT_PRIORITY_BOOST = 2 when active TIP event exists. get_task_priority() function specified. Priority submitted via apply_async(priority=...). Redis noeviction policy supports native Celery priorities 09.

F9 — No tenant-scoped audit log API Fix applied (§14 Org Admin endpoints): GET /org/audit-log added — paginated, filtered by organisation_id, supports ?from=&to=&event_type=&user_id=. Sources security_logs and alert_events. Accessible to org_admin and admin. Required by enterprise SaaS compliance expectations.

F10 — Billing data model absent Fix applied (§9.2): billing_contacts table (email, name, address, VAT, PO reference), subscription_periods table (immutable billing history with tier, dates, monthly fee, invoice reference). PATCH /org/billing endpoint for org_admin self-service updates. Phase 1 billing is manual; invoice_ref field accommodates future Stripe or Lago integration.

F11 — No org data export or portability mechanism Fix applied (§14 Org Admin endpoints, §29.2): POST /org/export endpoint added — async job, delivers signed ZIP within 3 business days. Used for GDPR Art. 20 portability and offboarding. §29.2 portability row updated with endpoint reference and scope clarification (user-generated content, not derived predictions).


55.2 Sections Modified

Section Change
§7.2 RBAC org_admin role added; comprehensive RLS policy table; spacecom_worker BYPASSRLS principal; users.role CHECK constraint updated
§9.2 api_keys is_service_account, service_account_name, revoked_by columns; CHECK constraints; service account index
§9.2 (new tables) usage_events, billing_contacts, subscription_periods
§14 Org Admin endpoints (new group) 10 org_admin-scoped endpoints covering users, API keys, audit log, usage, billing, and data export
§14 Admin endpoints GET /admin/organisations, POST /admin/organisations, PATCH /admin/organisations/{id} added
§16.1 (new) Subscription Tiers Tier table; require_tier() pattern; TIER_MC_CONCURRENCY; tier change immediacy
Celery Queue section TIER_TASK_PRIORITY priority map; CRITICAL_EVENT_PRIORITY_BOOST; get_task_priority() function
MC concurrency gate acquire_mc_slot now tier-driven; quota exhaustion writes usage_events
§29.2 Data Subject Rights Portability row updated with POST /org/export endpoint and scope
§29.8 (new) Org Onboarding/Offboarding 6-gate onboarding checklist; 8-step offboarding procedure; suspension vs. churn distinction

55.3 New Tables and Files

Artefact Purpose
usage_events table Billable event metering; org admin dashboard; quota exhaustion signal
billing_contacts table Invoice address, VAT, PO number per org
subscription_periods table Immutable billing history; Phase 2 invoice integration anchor
docs/runbooks/org-onboarding.md Onboarding gate checklist; provisioning procedure
backend/app/modules/billing/tiers.py get_mc_concurrency_limit_by_tier() and TIER_TASK_PRIORITY

55.4 Anti-Patterns Identified

Anti-pattern Correct approach
Relying solely on application-layer WHERE organisation_id = X RLS at database layer; application filter is defence-in-depth only
Role model with only system-wide admin org_admin for self-service tenant management; admin for cross-org system operations
Flat API key model with no service accounts Service account keys (user_id IS NULL) for system integrations; org admin can audit and revoke all keys
Sharing Celery queue with equal priority for all orgs Priority queue by tier + active event boost prevents low-tier bulk jobs starving safety-critical work
No audit log access for tenants Tenant-scoped GET /org/audit-log; required by enterprise procurement and insurance
Treating subscription_tier as static configuration Tier changes must be real-time enforced; require_tier() reads from DB on each request

55.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Tenant isolation mechanism PostgreSQL RLS + application filter Application filter only RLS enforces at DB layer; a single missing WHERE clause in application code cannot leak cross-tenant data
Tier change immediacy Real-time DB read on each request Cached in JWT claim JWT caching means downgraded orgs continue at higher tier until token expires; unacceptable for billing correctness
Billing integration (Phase 1) Manual + subscription_periods table Stripe/Lago from day 1 Phase 1 has ≤5 paying customers; manual invoicing is sufficient; invoice_ref field enables future integration without schema migration
org_admin role scope Cannot assign admin or org_admin without system admin approval Full self-service role management Self-service org_admin assignment creates privilege escalation paths; system admin as approval gate is a standard SaaS pattern
Service account API keys user_id IS NULL with is_service_account = TRUE flag Separate service_accounts table Single api_keys table is simpler; constraints enforce consistency; avoids JOIN complexity for key lookup hot path

§56 Testing Strategy — Specialist Review

Standards basis: pytest, pytest-cov, mutmut, k6, Playwright, openapi-typescript, freezegun, ISTQB test level definitions, ESA ECSS-E-ST-40C software testing standard Review scope: Coverage standard, test taxonomy, test data management, frontend/API contract drift, mutation testing, performance test specification, environment parity, safety-critical labelling, WebSocket E2E, MC determinism, ESA test plan artefact


56.1 Findings and Fixes Applied

F1 — No test coverage standard defined Fix applied (§17.0): Coverage thresholds declared: 80% line / 70% branch for backend (pytest-cov), 75% line for frontend (Jest). Enforced via pyproject.toml --cov-fail-under. Measured on the integration run (real DB), not unit-only. Coverage artefact required in Phase 2 ESA submission.

F2 — Test level boundary undefined Fix applied (§17.0): Three-level taxonomy defined: unit (no I/O, tests/unit/), integration (real DB + Redis, tests/integration/), E2E (full stack + browser, e2e/). Rules specify which level each category of test belongs to. Stops developers placing DB tests in tests/unit/ or mocking the database in integration tests.

F3 — Test data management strategy absent Fix applied (§17.0): Committed JSON reference data for physics; transaction-rollback isolation for integration tests; freezegun mandate for all time-dependent tests; fictional NORAD IDs (9000190099) and generated org names for sensitive data. Prevents flaky time-dependent failures and production-data leakage into the test repo.

F4 — No contract testing between frontend and API Fix applied (§14): openapi-typescript generates frontend/src/types/api.generated.ts from openapi.yaml. Frontend imports only from the generated file. make check-api-types CI step fails on any drift. Replaces Pact-style consumer-driven contracts at Phase 1 scale — simpler, equally effective for a single-team project.

F5 — Mutation testing not specified Fix applied (§17.0): mutmut runs weekly against physics/ and alerts/ modules. Threshold: ≥ 70% mutation score. Results published to CI artefacts. > 5 percentage point drop between runs creates a mutation-regression issue automatically.

F6 — Performance test specification informal Fix applied (§27.0 new): k6 chosen as the load testing tool. Three scenarios specified: CZML catalog ramp, 200 WebSocket subscribers, decay submit constant arrival rate. SLO thresholds as k6 thresholds (test fails if breached). Baseline hardware spec documented in docs/validation/load-test-baseline.md. Results stored as JSON and trended; > 20% p95 increase creates performance-regression issue.

F7 — Test environment parity unspecified Fix applied (§17.0): docker-compose.ci.yml must use pinned image tags matching production (not latest). make test fails if TIMESCALEDB_VERSION env var does not match docker-compose.yml. MinIO used in CI (not mocked). Prevents the class of "passes in CI, fails in prod" due to minor version differences in TimescaleDB chunk behaviour.

F8 — Safety-critical tests not labelled Fix applied (§17.0): @pytest.mark.safety_critical marker defined in conftest.py. Applied to: cross-tenant isolation, HMAC integrity, sub-150km guard, shadow segregation, and any other safety-invariant test. Separate fast CI job (pytest -m safety_critical, target < 2 min) runs on every commit before the full suite.

F9 — No E2E test for WebSocket alert delivery Fix applied (§42.2 E2E test inventory, accessibility section): e2e/test_alert_websocket.ts added. Full path: submit prediction via API → Celery completes → CRITICAL alert appears in browser DOM via WebSocket within 60 seconds. BLOCKING. Intermittent failures are root-cause investigated, not quarantined.

F10 — Physics tests non-deterministic Fix applied (§17.0): np.random.seed(42) autouse fixture in tests/conftest.py. seed=42 passed explicitly to all MC calls in tests. Seed value pinned; a PR changing it without updating baselines fails the review checklist. MC-based tests are now fully reproducible across machines and Python versions.

F11 — No test plan document for ESA submission Fix applied (§17.0): docs/TEST_PLAN.md structure specified with 6 sections including safety-critical traceability matrix (requirement → test ID → test name → result). This is the primary software assurance evidence document for the ESA bid. Required as a Phase 2 deliverable.

Bind mount strategy (companion fix) Fix applied (§3.3 Docker Compose): Host bind mounts specified for logs, exports, config, and DB data. Eliminates the need for docker compose exec for all routine operations. /data/postgres and /data/minio outside the project directory to prevent accidental wipe. make init-dirs creates the host directory structure before first docker compose up. make logs SERVICE=backend convenience alias.


56.2 Sections Modified

Section Change
§3.3 Docker Compose Host bind mount specification; host directory layout; make init-dirs; :ro config mounts
§13.8 Developer Environment Setup make init-dirs added to bootstrap sequence
§17.0 (new) Test Standards and Strategy Full test taxonomy, coverage standard, fixture isolation, freezegun, safety_critical marker, MC seed, mutation testing, env parity, docs/TEST_PLAN.md structure
§27.0 (new) Performance Test Specification k6 scenarios, SLO thresholds, baseline hardware spec, result storage and trending
§14 API Versioning Policy openapi-typescript contract type generation; make check-api-types CI step
§42.2 E2E Test Inventory test_alert_websocket.ts added; full WebSocket delivery E2E spec

56.3 New Tables and Files

Artefact Purpose
tests/unit/, tests/integration/, e2e/ Canonical test directory structure per taxonomy
e2e/test_alert_websocket.ts WebSocket alert delivery E2E test
tests/conftest.py seed_rng autouse fixture; safety_critical marker registration
docs/TEST_PLAN.md ESA Phase 2 deliverable; traceability matrix
docs/validation/load-test-baseline.md k6 baseline hardware and data spec
docs/validation/load-test-results/ Stored k6 JSON results for trending
tests/load/scenarios.js k6 scenario definitions
frontend/src/types/api.generated.ts Generated TypeScript API types from openapi.yaml
scripts/load-test-trend.py p95 latency trend chart generator

56.4 Anti-Patterns Identified

Anti-pattern Correct approach
Mocking the database in integration tests Transaction-rollback isolation against a real DB; mocks hide schema and RLS bugs
datetime.utcnow() in tests freezegun @freeze_time decorator; tests must be time-independent
Non-deterministic MC tests np.random.seed(42) autouse fixture; same seed → same output everywhere
Coverage measured on unit tests only Integration run coverage includes DB-layer code; unit-only inflates the number
Putting safety-critical tests in the full suite only pytest -m safety_critical fast job on every commit; never wait for the full suite to catch a safety regression
Performance test results not stored JSON output committed to docs/validation/; trend script flags regressions

56.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Frontend/API contract testing openapi-typescript generated types + make check-api-types Pact consumer-driven contracts Pact requires a broker and bidirectional test setup; openapi-typescript achieves the same drift detection with a single CI command at Phase 1 team size
Performance test tool k6 Locust, Gatling k6 is JavaScript-native (same language as frontend tests); scripting is lightweight; built-in threshold assertions; good CI integration
Coverage measurement scope Integration test run Unit test run Unit-only coverage excludes database, Redis, and auth middleware code paths — the most likely sources of prod bugs
Mutation testing scope physics/ and alerts/ only (weekly) Full codebase (every commit) Full-codebase mutation testing on every commit would take hours; scoping to highest-consequence modules provides meaningful signal at reasonable cost
Host bind mounts approach Named directories under /opt/spacecom/ with make init-dirs Named Docker volumes Host bind mounts are directly accessible via SSH without docker exec; named volumes require exec or a volume driver for host access

§57 Observability & Monitoring — Specialist Review

Hat: Observability & Monitoring Findings reviewed: 11 Sections modified: §26.6, §26.7 Date: 2026-03-24


57.1 Findings and Fixes Applied

F1 — Prometheus metric naming convention not defined Fix applied (§26.7 new): Naming convention table added before metric definitions. Rules: spacecom_ namespace required; unit suffix mandatory; _total for counters; high-cardinality identifiers (norad_id, organisation_id, user_id, request_id) banned from metric labels; snake_case labels only. CI make lint-metrics step validates names against the convention pattern.

F2 — SLO burn rate alerting single-window only Fix applied (§26.7): Replaced single ErrorBudgetBurnRate alert with two-alert multi-window pattern. ErrorBudgetFastBurn (1h + 5min windows, 14.4× multiplier, for: 2m) catches sudden outages. ErrorBudgetSlowBurn (6h + 1h windows, 6× multiplier, for: 15m) catches gradual degradation before the budget exhausts silently. Three recording rules added (rate1h, rate6h, rate5m).

F3 — Structured log schema undefined Already substantially addressed in §2274: REQUIRED_LOG_FIELDS schema with 10 mandatory fields, sanitising processor, request_id correlation middleware, and log integrity policy. No further action required for F3 — confirmed as covered.

F4 — Distributed tracing not specified for Celery path Fix applied (§26.7): Explicit Celery W3C traceparent propagation spec added. CeleryInstrumentor handles automatic propagation; request_id passed in task kwargs as Phase 1 fallback when OTEL_SDK_DISABLED=true. Integration test stub specified to verify trace continuity from HTTP handler through worker span.

F5 — No alerting rule coverage audit Fix applied (§26.7 new): Alert coverage audit table added mapping every SLO and safety invariant to its alert rule. Two gaps identified: EopMirrorDisagreement alert (Phase 1 gap — metric exists, alert rule missing), DbReplicationLagHigh (Phase 2 gap — requires streaming replication). BackupJobFailed alert identified as Phase 1 gap.

F6 — High-cardinality label risk Already addressed: norad_id label was already noted as "Grafana drill-down only; alert via recording rule" in the existing metric definition comment. F1 naming convention formalises this as an explicit prohibition with a CI-enforced lint rule. No additional edit required.

F7 — On-call dashboard not specified Fix applied (§26.7): Operational Overview dashboard panel layout mandated. 8-panel grid with fixed row order; rows 12 visible without scroll at 1080p. Each panel maps to a specific metric and threshold. Dashboard UID pinned in AlertManager dashboard_url annotations. Design criterion: "answer is the system healthy in 15 seconds."

F8 — Celery queue depth alerting threshold-only Fix applied (§26.7): CelerySimulationQueueGrowing alert added using rate(spacecom_celery_queue_depth{queue="simulation"}[10m]) > 2 with for: 5m. Complements the existing threshold-based CelerySimulationQueueDeep. Growth rate alert catches a rising queue before it breaches the absolute threshold.

F9 — No DLQ monitoring Already addressed: DLQGrowing alert (increase(spacecom_dlq_depth[10m]) > 0) and spacecom_dlq_depth metric were already specified in §26.7. F9 confirmed as covered — no further action required.

F10 — Log retention and SIEM integration not specified Fix applied (§26.6 new): Application log retention policy table added. Container stdout: 7 days (Docker json-file). Loki: 90 days (covers incident investigation SLA). Safety-relevant log lines: 7 years (MinIO, matching database safety record retention). SIEM forwarding: per customer contract. Loki retention YAML configuration specified. Phase 1 interim: Celery Beat daily export of CRITICAL log lines to MinIO until Loki ruler is deployed.

F11 — No alerting runbook cross-reference mandate Fix applied (§26.7): runbook_url added to WebSocketCeilingApproaching (previously missing). Mandate added: every AlertManager rule must include annotations.runbook_url pointing to an existing file in docs/runbooks/. make lint-alerts CI step enforces this using promtool check rules plus a custom script that validates the URL resolves to a real markdown file.


57.2 Sections Modified

Section Change
§26.6 Backup and Restore Application log retention policy table added; Loki 90-day retention config; safety-critical log line archival to MinIO
§26.7 Prometheus Metrics Metric naming convention table; multi-window burn rate recording rules and alerts; Celery trace propagation spec; queue growth rate alert; alert coverage audit table; runbook_url mandate; WebSocketCeilingApproaching runbook link added; on-call dashboard panel layout mandated

57.3 New Tables and Files

Artefact Purpose
monitoring/alertmanager/spacecom-rules.yml Updated with multi-window burn rate alerts and queue growth alert
monitoring/loki-config.yml 90-day retention configuration
monitoring/recording-rules.yml Three burn rate recording rules
docs/runbooks/capacity-limits.md Referenced by WebSocketCeilingApproaching; Phase 2 deliverable
scripts/lint-alerts.py CI script validating runbook_url annotation on every alert rule
monitoring/grafana/dashboards/operational-overview.json Codified panel layout per §26.7 on-call dashboard spec
tests/integration/test_tracing.py Celery trace propagation integration test stub

57.4 Anti-Patterns Identified

Anti-pattern Correct approach
Single-window burn rate alert (for: 30m) Multi-window fast+slow burn: catches both sudden outages and slow degradations
norad_id or organisation_id as Prometheus label Recording rule aggregates; high-cardinality identifiers in log fields or exemplars only
Alert rules without runbook_url make lint-alerts enforces presence; a page at 3am without a runbook link adds ~5 min to MTTR
Threshold-only queue alerts Complement with rate-of-growth alert; threshold fires too late on a gradually filling queue
On-call dashboard with no defined layout Mandated panel order; rows 12 visible without scroll; 15-second health answer target
Application logs with no retention policy Explicit tier policy: 7 days local, 90 days Loki, 7 years for safety-relevant lines

57.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Burn rate multipliers 14.4× (fast, 1h) / 6× (slow, 6h) Custom thresholds Google SRE Workbook standard multipliers for 99.9% SLO; well-understood by on-call engineers familiar with SRE literature
Loki retention 90 days 30 days / 1 year 30 days is insufficient for post-incident reviews triggered by regulatory queries; 1 year is expensive for high-volume structured logs; 90 days covers all contractual and regulatory investigation windows
Fast burn for: 2m 2 minutes Immediate (no for) Without a for clause, a single scraped bad value pages on-call; 2 minutes filters transient scrape errors while still alerting within 5 minutes of a real outage
Celery trace propagation CeleryInstrumentor + explicit request_id kwargs OTel only OTel-only approach breaks Phase 1 when OTEL_SDK_DISABLED=true; explicit kwargs are a zero-dependency fallback that costs nothing and ensures log correlation always works

§58 Performance & Scalability — Specialist Review

Hat: Performance & Scalability Findings reviewed: 11 Sections modified: §3.2, §9.4, §16 (CZML cache), §34.2 (Caddyfile), Celery config Date: 2026-03-24


58.1 Findings and Fixes Applied

F1 — No index strategy documented beyond primary keys Already addressed: §9.3 contains a comprehensive index specification with 10+ named indexes covering all identified hot paths: orbits (CZML generation), reentry_predictions (latest per object, partial), alert_events (unacknowledged per org, partial), jobs (queued, partial), refresh_tokens (live only, partial), PostGIS GiST indexes on all geometry columns, tle_sets (latest per object), security_logs (user+time). F1 confirmed as covered — no further action required.

F2 — PgBouncer pool size not derived from workload Fix applied (§3.2 technology table): Derivation rationale added inline. max_client_conn=200 derived from: 2 backend × 40 async + 4 sim workers × 16 + 2 ingest × 4 = 152 peak, 200 burst headroom. default_pool_size=20 derived from max_connections=50 with 5 reserved for superuser. Validation query (SHOW pools; cl_waiting > 0 = undersized) documented.

F3 — N+1 query risk in catalog and alert APIs Already addressed: §16 (CZML and API performance section) already specifies ORM loading strategies: selectinload for Event Detail and active alerts; raw SQL with explicit JOIN for CZML catalog bulk fetch (ORM overhead unacceptable at 864k rows). F3 confirmed as covered — no further action required.

F4 — Redis cache eviction policy not specified Already addressed: §16 Redis key namespace table specifies noeviction for celery:* and redbeat:*, allkeys-lru for cache:*, volatile-lru for ws:session:*. Separate Redis DB indexes mandated. F4 confirmed as covered — no further action required.

F5 — CZML cache invalidation strategy incomplete Fix applied (§16): Invalidation trigger table added (TLE re-ingest, propagation completion, new prediction, admin flush, cold start). Stale-while-revalidate strategy specified: stale key served immediately on primary expiry; background recompute enqueued; max stale age 5 minutes. warm_czml_cache Celery task specified for cold start and DR failover; estimated 3060 seconds for 600 objects. Cold-start warm-up added to DR RTO calculation.

F6 — Celery worker_prefetch_multiplier not tuned Fix applied (celeryconfig.py): worker_prefetch_multiplier = 1 added with rationale comment. Long MC tasks (up to 240s) with default prefetch=4 cause worker starvation. Prefetch=1 ensures fair task distribution across all available workers.

F7 — No database query plan governance Fix applied (§9.4 PostgreSQL parameters): log_min_duration_statement: 500 and shared_preload_libraries: timescaledb,pg_stat_statements added to patroni.yml. Query plan governance process specified: weekly top-10 slow query report from pg_stat_statements; any query in top-10 for two consecutive weeks requires PR with EXPLAIN ANALYSE and index addition or documented acceptance.

F8 — Static asset delivery strategy undefined Fix applied (§34.2 Caddyfile): Three-tier static asset strategy added. /_next/static/*: Cache-Control: public, max-age=31536000, immutable (safe — Next.js content-hashes filenames). /cesium/*: Cache-Control: public, max-age=604800 (7 days; not content-hashed). HTML routes: Cache-Control: no-store (force re-fetch after deploy). Rationale: immutable caching only safe for content-hashed assets; HTML must never be cached.

F9 — Horizontal scaling trigger thresholds not defined Fix applied (§3.2 new table): Scaling trigger threshold table added covering backend CPU (>70% for 30min), WS connections (>400 sustained), simulation queue depth (>50 for 15min), MC p95 latency (>180s), DB CPU (>60% for 1h), disk usage (>70%), Redis memory (>60%). All triggers initiate a scaling review meeting, not automatic action. Decisions logged in docs/runbooks/capacity-limits.md.

F10 — TimescaleDB chunk interval not specified Already addressed: §9.4 specifies chunk intervals for all hypertables with derivation rationale table: orbits 1 day (72h CZML window spans 3 chunks), tle_sets 1 month (compression ratio), space_weather 30 days (low write rate), adsb_states 4 hours (24h rolling window). F10 confirmed as covered — no further action required.

F11 — No query timeout or statement timeout policy Fix applied (§9.4): ALTER ROLE spacecom_analyst SET statement_timeout = '30s' and ALTER ROLE spacecom_readonly SET statement_timeout = '30s'. Applied at role level so it persists regardless of connection source. User-facing error message specified for timeout exceeded. Operational roles excluded (they have idle_in_transaction_session_timeout as global backstop only).


58.2 Sections Modified

Section Change
§3.2 Service Breakdown PgBouncer pool size derivation rationale; horizontal scaling trigger threshold table
§9.4 TimescaleDB Configuration log_min_duration_statement, pg_stat_statements in patroni.yml; query plan governance process; analyst role statement_timeout; idle_in_transaction_session_timeout comment
§16 CZML / Cache Invalidation trigger table; stale-while-revalidate strategy; warm_czml_cache cold-start task
§34.2 Caddyfile Three-tier static asset Cache-Control strategy; HTML no-store mandate
celeryconfig.py worker_prefetch_multiplier = 1 with rationale

58.3 New Tables and Files

Artefact Purpose
docs/runbooks/capacity-limits.md Scaling decision log; WS ceiling documentation; capacity trigger thresholds
worker/celeryconfig.py Updated with worker_prefetch_multiplier = 1

58.4 Anti-Patterns Identified

Anti-pattern Correct approach
Default Celery prefetch_multiplier=4 with long tasks prefetch_multiplier=1 for MC jobs; fair distribution across workers
Single Redis maxmemory-policy for broker + cache Separate DB indexes with noeviction for broker, allkeys-lru for cache
HTML pages with Cache-Control: public, max-age=... no-store for HTML; immutable only for content-hashed static assets
Analyst queries without timeout statement_timeout=30s at role level; prevents replica exhaustion cascading to primary
Monitoring slow queries without a review process Weekly pg_stat_statements top-10 review; two-week persistence triggers mandatory PR
Scaling triggers defined as "when it feels slow" Metric thresholds with sustained durations; documented decision log for audit trail

58.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
worker_prefetch_multiplier 1 4 (default) Long MC tasks (up to 240s) make default prefetch cause severe worker imbalance; prefetch=1 adds trivial latency (one extra Redis round-trip) per task
Analyst timeout 30 seconds at role level Global statement_timeout Global timeout would cancel legitimate long-running operations like backup restore tests and migration backfills; role-scoped is surgical
CZML stale-while-revalidate max age 5 minutes 0 (no stale) Without stale window, TLE batch ingest (600 objects) causes 600 simultaneous cache stampedes; 5-minute stale window amortises recompute over the natural ingest cadence
Static asset caching Immutable for /_next/static/, 7 days for /cesium/, no-store for HTML Uniform TTL Content-hash presence determines whether immutable is safe; non-uniform strategy is correct, not inconsistent

§59 DevOps / CI-CD Pipeline — Specialist Review

Hat: DevOps / CI-CD Pipeline Findings reviewed: 11 Sections modified: §30.2, §30.3, §30.7 (new) Date: 2026-03-24


59.1 Findings and Fixes Applied

F1 — CI pipeline job dependency graph not specified Fix applied (§30.7 new): Full GitLab CI pipeline specified with explicit stage/needs ordering enforcing the dependency order: lint → (test-backendtest-frontendmigration-gate) → security-scanbuild-and-pushdeploy-stagingdeploy-production. Parallel jobs where safe; sequential where correctness requires it.

F2 — No environment promotion gate between staging and production Already addressed: §30.4 specifies the staging environment spec and data policy. The ADR at §30.6 records the decision: "production deploy requires manual approval gate after staging smoke tests pass." The new §30.7 workflow formalises this as a GitLab protected production environment with required approvers. Confirmed as covered and formalised.

F3 — Secrets in CI not audited or rotated Fix applied (§30.3): CI secrets register table added with 8 entries covering all pipeline secrets. Each entry specifies: environment scope, owner, rotation schedule (90-180 days), and blast radius on leak. Quarterly audit procedure using GitLab CI/CD variable inventory documented. Rotation procedure for GitLab protected variables specified.

F4 — Docker image tags without immutability guarantee Fix applied (§30.2): Production docker-compose.yml now pins images by tag@digest rather than tag alone. make update-image-digests script added to CI post-build pipeline. Container-registry retention policy table added covering 5 image categories. Lifecycle policy documented in docs/runbooks/image-lifecycle.md.

F5 — No build provenance or SBOM in CI pipeline Fix applied (§30.7): cosign sign --yes step added to build-and-push job using Sigstore keyless signing (OIDC identity from GitLab CI). SBOM artefacts are attached to the pipeline and copied into the compliance artefact store. The deploy-time cosign verify step remains the verification gate.

F6 — Pre-commit hooks not enforced in CI Already addressed: §30.1 explicitly states "The same hooks run locally (via pre-commit) and in CI (lint job)." The new §30.7 workflow formalises this as pre-commit run --all-files in the lint job with a dedicated cache. F6 confirmed as covered and formalised.

F7 — No automated rollback trigger Already addressed: §26.9 blue-green deploy script (step 6) already checks spacecom:api_availability:ratio_rate5m < 0.99 after a 5-minute monitoring window and executes the Caddy upstream rollback atomically if the threshold is breached. F7 confirmed as covered.

F8 — Deployment pipeline does not check for active CRITICAL events Fix applied (§30.7): check no active CRITICAL alert step added to both deploy-staging and deploy-production jobs. Calls GET /readyz and checks alert_gate field. "blocked" aborts the deploy with a clear error message. Emergency override requires two production-environment approvals and is logged to security_logs.

F9 — No branch protection or merge queue specification Already addressed: §13.6 (CONTRIBUTING.md spec from §54) specifies: "No direct commits to main. All changes via pull request. main is branch-protected: 1 required approval, all status checks must pass, no force-push." The §30.7 workflow defines all required status checks (lint, test-backend, test-frontend, migration-gate, security-scan) which the branch protection rule references. F9 confirmed as covered.

F10 — Docker layer cache strategy not documented for CI Fix applied (§30.7): Build cache strategy formalised in the build-and-push job using docker/build-push-action with cache-from: type=registry and cache-to: type=registry,mode=max targeting the GHCR buildcache tag. pip wheel cache keyed on requirements.txt hash. npm cache keyed on package-lock.json hash. Both use actions/cache@v4.

F11 — No database migration CI gate Fix applied (§30.7 migration-gate job): Three-step gate on all PRs touching migrations/: (1) timed forward migration — fails if > 30s; (2) reverse migration alembic downgrade -1 — fails if not reversible; (3) alembic check — fails if model/migration divergence. Gate runs in parallel with test jobs to minimise critical path impact.


59.2 Sections Modified

Section Change
§30.2 Multi-Stage Dockerfile Image digest pinning spec; GHCR retention policy table; make update-image-digests
§30.3 Environment Variable Contract CI secrets register table; rotation schedule; quarterly audit procedure
§30.7 (new) GitHub Actions Workflow Full CI YAML with needs: graph; all 8 jobs; cosign sign; migration-gate; alert gate step; environment-gated production deploy

59.3 New Tables and Files

Artefact Purpose
.github/workflows/ci.yml Canonical CI pipeline — 8 jobs with explicit dependency graph
scripts/smoke-test.py Post-deploy smoke test (already referenced in §26.9; now mandatory gate in CI)
scripts/update-image-digests.sh Patches docker-compose.yml with tag@digest after each build
docs/runbooks/image-lifecycle.md GHCR retention policy; lifecycle policy config procedure
docs/runbooks/detect-secrets-update.md Correct baseline update procedure (already referenced in §30.1)

59.4 Anti-Patterns Identified

Anti-pattern Correct approach
Jobs without needs: run in parallel by default Explicit needs: chains; test jobs must precede build; build must precede deploy
Mutable image tags in production Compose tag@digest pinning; make update-image-digests in post-build CI step
Long-lived CI credentials for registry push OIDC GITHUB_TOKEN (per-job, automatic); no static GHCR_TOKEN secret needed
Signing at deploy-time only (cosign verify) Sign at build-time (cosign sign); verify at deploy; both steps required for supply chain integrity
Deploying during active CRITICAL alert alert_gate check in CI deploy steps; emergency override requires two approvals and is logged
Migrations tested only by running them forward Three-step gate: forward (timed) + reverse (reversibility) + alembic check (model sync)

59.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
OIDC for GHCR auth GITHUB_TOKEN OIDC (per-job) Static GHCR_TOKEN secret Static tokens don't expire; OIDC tokens are per-job and cannot be reused outside the workflow
cosign keyless signing Sigstore keyless (OIDC identity) Private key signing Keyless signing ties the signature to the GitHub Actions OIDC identity; no long-lived private key to rotate or leak
Alert gate scope Blocks CRITICAL and HIGH unacknowledged alerts from non-internal orgs All alerts Internal test org alerts should not block production operations; unacknowledged = operator hasn't seen it yet
migration-gate triggers Only on PRs touching migrations/ Every PR Running alembic upgrade head on every PR adds 6090 seconds to CI for PRs that don't touch the schema; path filter reduces cost

§60 Human Factors / Operational UX — Specialist Review

Hat: Human Factors / Operational UX Findings reviewed: 11 Sections modified: §28.1, §28.3, §28.5a, §28.6, §28.9 (new) Date: 2026-03-24


60.1 Findings and Fixes Applied

F1 — No alarm management philosophy documented Fix applied (§28.3): EEMUA 191 / ISA-18.2 alarm management KPI table added with 5 quantitative targets: alarm rate (< 1/10min), nuisance rate (< 1%), stale CRITICAL (0 unacknowledged > 10min), alarm flood threshold (< 10 CRITICAL in 10min), chattering alarms (0). Measured quarterly by Persona D; included in ESA compliance artefact package.

F2 — Alarm flood scenario not bounded Fix applied (§28.3): Batch TIP flood protocol added. Triggers at >= 5 new TIP messages in 5 minutes. Protocol: highest-priority object gets CRITICAL banner; objects 2-N are suppressed; single HIGH "Batch TIP event: N objects" summary fires; per-object alerts queue at <= 1/min after 5-minute operator grace period. batch_tip_event record type added to alert_events. Thresholds configurable per-org within safety bounds.

F3 — Mode confusion risk unmitigated Already addressed: §28.2 specifies six mode error prevention mechanisms including persistent mode indicator, mode-switch confirmation dialog with consequence statements, temporal wash for future-preview, simulation disable during active events, audio suppression in non-LIVE modes, and simulation record segregation. F3 confirmed as covered.

F4 — Handover workflow does not account for SA transfer Fix applied (§28.5a): Structured SA transfer prompt table added. Five prompts mapping to Endsley SA levels: active objects (L1 perception), operator assessment (L2 comprehension), expected development (L3 projection), actions taken (decision context), and handover flags (situational context). Prompts are optional but completion rate tracked as HF KPI. Non-blocking warning on submission without completion.

F5 — Acknowledgement does not distinguish seen from assessed Already addressed: §28.5 structured acknowledgement categories distinguish MONITORING (seen, no action) from NOTAM_ISSUED, COORDINATING, ESCALATING (assessed and acted). The category taxonomy maps directly to perception vs. comprehension+projection. F5 confirmed as covered.

F6 — No specification for decision prompt content Fix applied (§28.6): DecisionPrompt TypeScript interface specified with four mandatory fields: risk_summary (<= 20 words, no jargon), action_options (role-specific), time_available (decision window before FIR intersection), consequence_note (optional). Example instance for re-entry/FIR scenario provided. Pre-authored prompt library in docs/decision-prompts/; annual ANSP SME review required.

F7 — Globe information hierarchy not specified Fix applied (§28.1): Seven-level visual information hierarchy table added with mandatory rendering order. Priority 1 (CRITICAL object): flashing red octagon + always-visible label. Priority 2 (HIGH): amber triangle. Down to Priority 7 (ambient objects): white dots on hover only. Rule: no lower-priority element may be visually more prominent than a higher-priority element. Non-negotiable safety requirement — overrides CesiumJS performance optimisations that reorder draw calls.

F8 — No fatigue or cognitive load accommodation Fix applied (§28.3): Server-side fatigue monitoring rules added. Four triggers: CRITICAL unacknowledged > 10 min — supervisor push+email; HIGH unacknowledged > 30 min — supervisor push; inactivity during active event (45 min) — operator+supervisor push; session age > shift_duration_hours — non-blocking operator reminder. All notifications logged to security_logs. Escalates to SpaceCom internal ops if no supervisor role configured.

F9 — Degraded mode display not actionable Already addressed: §28.8 (Degraded-Data Human Factors) specifies per-degradation-type visual indicators with operator action required. §1315 specifies operational guidance text per degradation type. Acceptance criteria (§6056) requires integration test for each type. F9 confirmed as covered.

F10 — No operator training specification Fix applied (§28.9 new): Full operator training programme specified. Six modules (M1-M6), 8 hours total minimum. M2 reference scenario defined. Recurrency requirements: annual 2-hour refresher + scenario repeat. operator_training_records schema added. GET /api/v1/admin/training-status endpoint added. Training material ownership and annual review cycle defined.

F11 — Audio alert design not fully specified Fix applied (§28.3): Audio spec expanded with EUROCAE ED-26 / RTCA DO-256 advisory alert compliance. Tones specified: 261 Hz (C4) + 392 Hz (G4), 250ms each with 20ms fade. Re-alert on missed acknowledgement: replays once at 3 minutes; no further audio beyond second play (supervisor notification handles further escalation). Volume floor in ops room mode: minimum 40%. Per-session mute resets on next login.


60.2 Sections Modified

Section Change
§28.1 Situation Awareness Globe visual information hierarchy table (7 levels, mandatory rendering order)
§28.3 Alarm Management EEMUA 191 KPI table; batch TIP flood protocol; fatigue monitoring rules; audio spec expanded with EUROCAE ref, re-alert rule, volume floor
§28.5a Shift Handover Structured SA transfer prompts (5 prompts, 3 SA levels); completion tracking
§28.6 Cognitive Load Reduction Decision prompt TypeScript interface + example; pre-authored library governance
§28.9 (new) Operator Training 6-module programme; reference scenario; recurrency; operator_training_records schema; API endpoint

60.3 New Tables and Files

Artefact Purpose
operator_training_records Training completion records per user/module
docs/training/ Training module content directory
docs/training/reference-scenario-01.md Standardised M2 reference scenario
docs/decision-prompts/ Pre-authored decision prompt library (per scenario type)
GET /api/v1/admin/training-status Org-admin view of operator training completion

60.4 Anti-Patterns Identified

Anti-pattern Correct approach
Single "data may be delayed" degraded banner Per-degradation-type badges with operator action required; graded response rules
Free-text only handover notes Structured SA transfer prompts + notes; prompts tracked as HF KPI
Audio alert that loops indefinitely Plays once; re-alerts once at 3 min; further escalation is supervisor notification, not more audio
Acknowledgement with 10-character text minimum Structured category selection — captures intent, not just compliance
Unlimited alarm rate during batch TIP events Batch flood protocol: suppress objects 2-N, queue at <= 1/min after grace period
Globe with equal visual weight for all elements 7-level mandatory hierarchy; safety-critical objects pre-attentively distinct at all zoom levels

60.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Alarm KPI standard EEMUA 191 adapted for ATC Process-control standard verbatim EEMUA 191 is process-control oriented; ATC operations have different alarm rate expectations; adaptation noted explicitly
Re-alert timing Once at 3 minutes Continuous loop / never re-alert Loop causes habituation; never re-alerting risks missed CRITICAL in a noisy environment; single replay at 3 min is the minimum effective prompt
SA transfer prompts Optional with completion tracking Mandatory (blocks handover submission) Mandatory completion under time pressure produces checkbox compliance, not genuine SA transfer; optional + tracked provides accountability without creating a safety-defeating blocker
Operator training blocking Flag but not block access Auto-block untrained users ANSP retains operational responsibility; SpaceCom cannot unilaterally block a certified ATC professional; flag + report gives ANSP the information to manage their own training compliance

§61 Aviation & Space Regulatory Compliance — Specialist Review

61.1 Finding Summary

# Finding Severity Resolution
1 No formal safety case structure — argument/evidence/claims framework absent High §24.12 — Safety case with GSN argument structure, evidence nodes, and claims added; docs/safety/SAFETY_CASE.md
2 SAL assignment under ED-153/DO-278A not documented — no formal assurance level per component High §24.13 — SAL assignment table: SAL-2 for physics, alerts, HMAC, CZML; SAL-3 for auth and ingest; docs/safety/SAL_ASSIGNMENT.md
3 Hazard log lacked structured format — no ID, cause/effect decomposition, risk level, or governance Medium §24.4 — Hazard register restructured with 7 hazards (HZ-001 to HZ-007), structured fields, governance rules, and EUROCAE ED-153 risk matrix
4 Safety occurrence reporting procedure lacked formal structure — ANSP notification, evidence preservation, and regulatory notification flow not defined High §26.8a — Full safety occurrence reporting procedure with trigger conditions, 8-step response, SQL table, and clear negative scope
5 ICAO data quality mapping incomplete — Completeness attribute absent; no formal data category and classification fields in API response Medium §24.3 — Completeness attribute added; formal ICAO data category/classification fields specified; accuracy characterisation as Phase 3 gate
6 Verification independence not specified — no CODEOWNERS, PR review rule, or traceability for SAL-2 components High §17.6 — CODEOWNERS for SAL-2 paths, 2-reviewer requirement, qualification criteria, traceability to safety case evidence
7 No configuration management policy for safety-critical artefacts — source files, safety documents, and validation data not formally under CM High §30.8 — CM policy covering 10 artefact types, release tagging script, signed commits, deployment register, CODEOWNERS for docs/safety/
8 Means of Compliance document not planned — no mapping from regulatory requirement to implementation evidence Medium §24.14 — MoC document structure with 7 initial MOC entries, status tracking, and Phase 2/3 gates
9 Post-deployment safety monitoring programme absent — no ongoing accuracy monitoring, safety KPIs, or model version monitoring High §26.10 — Four-component programme: prediction accuracy monitoring, safety KPI dashboard, quarterly safety review, model version monitoring
10 ANSP-side obligations not documented — SpaceCom's safety argument assumes ANSP actions that are never formally communicated Medium §24.15 — ANSP obligations table by category; SMS guide document; liability assignment note linking to safety case
11 Regulatory sandbox liability not formally characterised — who bears liability during trial, what insurance is required, sandbox ≠ approval Medium §24.2 — Sandbox liability provisions: no operational reliance clause, indemnification cap, insurance requirement, regulatory notification duty, explicit statement that sandbox ≠ regulatory approval

Already addressed — no further action required:

  • NOTAM interface and disclaimer (§24.5 — covered in prior sessions)
  • Space law retention obligations (§24.6 — 7-year retention already specified)
  • EU AI Act compliance obligations (§24.10 — fully covered including Art. 14 human oversight statement)
  • Regulatory correspondence register (§24.11 — covered)

61.2 Sections Modified

Section Change
§24.2 Liability and Operational Status Regulatory sandbox liability provisions (F11): no operational reliance clause, indemnification cap, insurance requirement, sandbox ≠ approval statement
§24.3 ICAO Data Quality Mapping Completeness attribute added (F5); formal ICAO data category and classification table; accuracy characterisation Phase 3 gate
§24.4 Safety Management System Integration Hazard register fully restructured (F3): 7 hazards with IDs, cause/effect, risk levels, governance; system safety classification updated to reference §24.13 SAL assignment
§24.11 (after) New §24.12 Safety Case Framework (F1); §24.13 SAL Assignment (F2); §24.14 Means of Compliance (F8); §24.15 ANSP-Side Obligations (F10)
§17.5 (after) New §17.6 Verification Independence (F6): CODEOWNERS, 2-reviewer rule, qualification criteria, traceability
§26.8 Incident Response runbooks Safety occurrence runbook pointer updated; §26.8a Safety Occurrence Reporting full procedure added (F4)
§26.9 (after) New §26.10 Post-Deployment Safety Monitoring Programme (F9): accuracy monitoring, safety KPI dashboard, quarterly review, model version monitoring
§30.7 (after) New §30.8 Configuration Management of Safety-Critical Artefacts (F7): CM policy table, release tagging, signed commits, deployment register

61.3 New Documents and Tables

Artefact Purpose
docs/safety/SAFETY_CASE.md GSN-structured safety case; living document; version-controlled
docs/safety/SAL_ASSIGNMENT.md Software Assurance Level per component; review triggers
docs/safety/HAZARD_LOG.md Structured hazard log (HZ-001 to HZ-007 and future additions)
docs/safety/MEANS_OF_COMPLIANCE.md Regulatory requirement → implementation evidence mapping
docs/safety/ANSP_SMS_GUIDE.md ANSP obligations and SMS integration guide
docs/safety/CM_POLICY.md Configuration management policy for safety artefacts
docs/safety/VERIFICATION_INDEPENDENCE.md Verification independence policy for SAL-2 components
docs/safety/QUARTERLY_SAFETY_REVIEW_YYYY_QN.md Quarterly safety review output template
legal/SANDBOX_AGREEMENT_TEMPLATE.md Standard regulatory sandbox letter of understanding
legal/ANSP_DEPLOYMENT_REGISTER.md Configuration baseline per ANSP deployment
docs/validation/ACCURACY_CHARACTERISATION.md Phase 3: formal accuracy statement (ICAO Annex 15)
safety_occurrences SQL table Dedicated log for safety occurrences with full audit fields
monitoring/dashboards/safety-kpis.json Grafana dashboard: 6 safety KPIs with alert thresholds
.github/CODEOWNERS additions SAL-2 source paths + docs/safety/ require custodian review

61.4 Anti-Patterns Identified

Anti-pattern Correct approach
"Advisory only" UI label as sole liability protection Legal instruments required: MSA, AUP, legal opinion; label is not contractual protection
Hazard log as a table of symptoms with no cause/effect structure Structured hazard log with ID, cause, effect, mitigations, risk level, status — enables safety case argument
No distinction between safety occurrence and operational incident Safety occurrences require a separate response chain (legal counsel, ANSP regulatory notification); conflating with incidents creates regulatory exposure
Verification by the author of safety-critical code SAL-2 requires independent verification — CODEOWNERS enforcement is the implementation mechanism
Safety documents outside version control All safety artefacts are Git-tracked; changes require custodian sign-off via CODEOWNERS; release tags capture safety snapshots
Sandbox trial treated as implicit regulatory approval Explicit language required: sandbox ≠ approval; the ANSP cannot represent a trial as regulatory acceptance
Post-deployment safety monitoring as "we'll look at incidents when they happen" Proactive programme: quarterly review, prediction accuracy tracking, model version monitoring — demonstrates ongoing safe operation

61.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Safety case notation Goal Structuring Notation (GSN) ASCE text-only format GSN is the standard for DO-178C and ED-153 safety cases; accepted by EASA and ESA reviewers; tooling (Astah, Visio, ArgoSAFETY) exists for formal diagrams when Phase 3 requires it
SAL-2 for physics and alerts SAL-2 (not SAL-1) SAL-1 (highest) SAL-1 implies formal methods / formal proofs — disproportionate for decision support software where the ANSP retains authority; SAL-2 balances rigour with development practicality
Safety occurrence trigger scope 4 specific trigger conditions Any anomaly during operational use Over-broad triggers desensitise the process; under-broad triggers miss real occurrences; 4 conditions map directly to the identified hazards
Post-deployment monitoring cadence Quarterly safety review Monthly review / ad hoc Quarterly balances administrative overhead with meaningful trend data; monthly creates review fatigue for a small team; ad hoc provides no assurance
Configuration management of safety documents Git + CODEOWNERS + release attachments Dedicated safety management tool Git is already the source of truth; CODEOWNERS provides access control; release attachments are the simplest artefact preservation mechanism without introducing a new tool

§62 Geospatial / Mapping Engineering — Specialist Review

62.1 Finding Summary

# Finding Severity Resolution
1 No authoritative CRS contract document — frame transitions at each boundary were scattered across multiple sections with no single reference Medium §4.4 — CRS boundary table added; docs/COORDINATE_SYSTEMS.md defined as Phase 1 deliverable; antimeridian and pole handling specified
2 SRID not enforced by CHECK constraint — column type declares SRID 4326 but application code can insert SRID-0 geometries silently Medium §9.3 — CHECK constraints added for reentry_predictions, hazard_zones, airspace spatial columns; migration gate lints new spatial columns
3 No spatial GiST index on corridor polygon columns High Already addressed — §9.3 contains GiST indexes for reentry_predictions, hazard_zones, airspace geometry columns. No further action required.
4 CZML corridor geometry uses fixed 10-minute time-step sampling — under-represents terminal phase where displacement is highest High §15.4 — Adaptive sampling function added: 5 min above 300 km, 2 min at 150300 km, 30 s at 80150 km, 10 s below 80 km; ADR required for reference polygon regeneration
5 Antimeridian and pole handling not explicitly specified Medium §4.4 — Antimeridian: GEOGRAPHY type confirmed; CZML serialiser must not clamp to ±180°. Polar corridors: ST_DWithin pole proximity check; clip to 89.5° max latitude with POLAR_CORRIDOR_WARNING log
6 No test verifying PostGIS corridor polygon matches CZML polygon positions High §15.4 — test_czml_corridor_matches_postgis_polygon integration test added; marked safety_critical; 10 km bbox agreement tolerance
7 FIR boundary data source and update policy not documented Medium Already addressed — §31.1.3 documents EUROCONTROL AIRAC source, 28-day update procedure, airspace_metadata table, Prometheus staleness alert, readyz integration. No further action required.
8 Globe clustering merges objects at different altitudes sharing a ground-track sub-point Medium §13.2 (globe clustering) — Altitude-aware clustering rule: clustering disabled for any object with re-entry window < 30 days; prevents TIP-active objects from being absorbed into catalog clusters
9 ST_Buffer distance units ambiguous — degree-based buffer on SRID 4326 geometry produces latitude-varying results Medium §9.3 — Correct pattern documented: project to Web Mercator for metric buffer, or use GEOGRAPHY column buffer (natively metre-aware). Wrong pattern explicitly prohibited.
10 FIR intersection missing bounding-box pre-filter in some query paths Medium Already addressed — §9.3 FIR intersection query with && pre-filter and explicit ::geography::geometry cast; CI linter rule added. No further action required.
11 Altitude display mixes WGS-84 ellipsoidal and MSL datums without labelling — geoid offset (106 m to +85 m) material at re-entry terminal altitudes High §13.5 — Altitude datum labelling table added: orbital context → ellipsoidal; airspace context → QNH; formatAltitude(metres, context) helper; altitude_datum field in prediction API response

62.2 Sections Modified

Section Change
§4.4 (new) Coordinate Reference System Contract CRS boundary table; docs/COORDINATE_SYSTEMS.md reference; antimeridian CZML serialiser note; polar corridor ST_DWithin proximity check and 89.5° clip
§4.5 (renumbered from 4.4) Implementation Checklist Added docs/COORDINATE_SYSTEMS.md deliverable
§9.3 Index Specification SRID CHECK constraints for 3 spatial tables; ST_Buffer correct/wrong patterns; explicit prohibition on degree-unit buffers
§13.2 Globe Object Clustering Altitude-aware clustering rule: disable for decay-relevant objects (window < 30 days)
§13.5 Altitude and Distance Unit Display Altitude datum labelling table (4 contexts); formatAltitude(metres, context) helper spec; altitude_datum API field
§15.4 Corridor Generation Algorithm Adaptive ground-track sampling function (4 altitude bands); ADR requirement for reference polygon regeneration; test_czml_corridor_matches_postgis_polygon integration test

62.3 New Documents and Files

Artefact Purpose
docs/COORDINATE_SYSTEMS.md Authoritative CRS contract: frame at every system boundary
tests/integration/test_corridor_consistency.py PostGIS vs CZML corridor bbox consistency test (safety_critical)
backend/app/utils/altitude.py formatAltitude(metres, context) helper

62.4 Anti-Patterns Identified

Anti-pattern Correct approach
Fixed 10-minute ground track sampling across all altitudes Adaptive sampling: coarse above 300 km, fine in terminal phase below 150 km
ST_Buffer(geom_4326, 0.5) — degree buffer on geographic column ST_Buffer(ST_Transform(geom, 3857), 50000) for Mercator metric, or ST_Buffer(geom::geography, 50000) for geodetic metric
ST_Intersects(airspace.geometry, corridor) without explicit cast Always ::geography::geometry cast when mixing GEOGRAPHY and GEOMETRY types; enforced by CI linter
Clustering all objects by screen position Disable CesiumJS EntityCluster for decay-relevant objects; altitude is a critical dimension for orbital objects
Altitude labelled as km without datum Datum is always explicit: (ellipsoidal) or QNH or MSL per context
SRID declared in column type only Add CHECK constraint: CHECK (ST_SRID(geom::geometry) = 4326) — prevents SRID-0 insertion from application layer

62.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Adaptive sampling bands 4 bands (> 300 km / 150300 km / 80150 km / < 80 km) Single fine step (30 s) everywhere Fine step everywhere generates unnecessary data volume in the high-altitude portion where trajectory changes are slow; 4 bands give fidelity where it matters at manageable data volume
Antimeridian strategy GEOGRAPHY type (spherical arithmetic) for corridors Split polygons at ±180° Splitting at antimeridian requires downstream consumers (CesiumJS, PostGIS) to handle multi-polygon; GEOGRAPHY avoids the split natively
Polar corridor clip at 89.5° ST_DWithin + clip Full polar treatment True polar passages are extremely rare for the tracked object population; full treatment (azimuthal projection, pole-aware alpha-shape) is disproportionate; clip + warning is the pragmatic safe choice
Altitude datum labelling Per-context datum in formatAltitude helper Global user setting Datum is physically determined by the altitude context (orbital = ellipsoidal; aviation = QNH), not user preference; a user setting would allow operators to view the wrong datum label
Corridor consistency test tolerance 10 km (0.1°) bbox agreement Exact match Sub-pixel globe rendering differences make exact match impractical; 10 km is far below the display resolution at most zoom levels and well below any operationally significant discrepancy

§63 Real-Time Systems / WebSocket Engineering — Specialist Review

63.1 Finding Summary

# Finding Severity Resolution
1 No message sequence numbers or ordering guarantee High Already addressed — seq field in event envelope; ?since_seq= reconnect replay; 200-event / 5-min ring buffer; resync_required on stale gap. No further action required.
2 No application-level delivery acknowledgement — delivered_websocket = TRUE set at send-time, not client-receipt High §4 WebSocket schema — alert.received / alert.receipt_confirmed round-trip for CRITICAL/HIGH; ws_receipt_confirmed column in alert_events; 10s timeout triggers email fallback
3 Fan-out architecture for multiple backend instances not specified High §4 WebSocket schema — Redis Pub/Sub fan-out via spacecom:alert:{org_id} channels; per-instance local connection registry; docs/adr/0020-websocket-fanout-redis-pubsub.md
4 No client-side reconnection backoff policy High Already addressed — src/lib/ws.ts specifies initialDelayMs=1000, maxDelayMs=30000, multiplier=2, jitter=0.2. No further action required.
5 No state reconciliation protocol after reconnect High Already addressed — resync_required event triggers REST re-fetch; ?since_seq= replays up to 200 events. No further action required.
6 Dead WebSocket connection does not trigger ANSP fallback notification High §4 WebSocket schema — on_connection_closed schedules Celery task with 120s / 30s (active TIP) grace; on_reconnect revokes pending task; org primary contact emailed with TIP-aware subject line
7 No back-pressure or per-client send queue monitoring High §4 WebSocket schema — ConnectionManager with per-connection asyncio.Queue; circuit breaker at 50 queued events closes slow-client connection; spacecom_ws_send_queue_overflow_total counter
8 Offline clients do not see missed alerts surfaced on reconnect Medium §4 WebSocket schema — GET /alerts?since=<ts>&include_offline=true; received_while_offline: true annotation; localStorage last_seen_ts; amber border visual treatment in notification centre
9 Multi-tab acknowledgement not synced Medium Already addressed — alert.acknowledged event type in WebSocket schema broadcasts to all org connections. No further action required.
10 No per-org WebSocket connection visibility during TIP events Medium §4 WebSocket schema + Observability — spacecom_ws_org_connected and spacecom_ws_org_connection_count gauges; ANSPNoLiveConnectionDuringTIPEvent alert rule; on-call dashboard panel 9
11 Caddy idle timeout silently terminates long-lived WebSocket connections High §26.9 Caddy configuration — idle_timeout 0 for WebSocket paths; read_timeout 0 / write_timeout 0 on WS reverse proxy transport; flush_interval -1; ping interval < proxy idle timeout rule documented

63.2 Sections Modified

Section Change
§4 WebSocket event schema App-level receipt ACK protocol (F2); Redis Pub/Sub fan-out spec with code (F3); dead-connection ANSP fallback (F6); ConnectionManager back-pressure with per-connection queue (F7); offline missed-alert REST endpoint and notification centre treatment (F8); per-org Prometheus gauges and ANSPNoLiveConnectionDuringTIPEvent alert rule (F10)
§26.9 Caddy upstream configuration WebSocket-specific Caddyfile additions: idle_timeout 0, WS path matcher, read_timeout 0, write_timeout 0, flush_interval -1; ping interval < proxy idle timeout rule (F11)

63.3 New Tables, Metrics, and Files

Artefact Purpose
alert_events.ws_receipt_confirmed Tracks whether client confirmed receipt of CRITICAL/HIGH alerts
alert_events.ws_receipt_at Timestamp of client receipt confirmation
spacecom_ws_send_queue_overflow_total{org_id} Counter: WS send queue circuit breaker activations
spacecom_ws_org_connected{org_id, org_name} Gauge: whether org has ≥1 active WS connection
spacecom_ws_org_connection_count{org_id} Gauge: count of active WS connections per org
ANSPNoLiveConnectionDuringTIPEvent Prometheus alert rule: warning when ANSP has no WS connection during active TIP
On-call dashboard panel 9 ANSP Connection Status table (below fold)
docs/adr/0020-websocket-fanout-redis-pubsub.md ADR: Redis Pub/Sub for cross-instance WS fan-out
docs/runbooks/websocket-proxy-config.md Runbook: WS proxy timeout configuration for cloud deployments
docs/runbooks/ansp-connection-lost.md Runbook: ANSP with no live connection during TIP event
GET /alerts?since=<ts>&include_offline=true Missed-alert reconciliation endpoint

63.4 Anti-Patterns Identified

Anti-pattern Correct approach
delivered_websocket = TRUE set at send() time App-level receipt ACK with 10s timeout; FALSE triggers email fallback
Single fan-out loop blocks on slow client Per-connection async send queue with circuit breaker; slow client disconnected, not blocking
Caddy default idle timeout terminates quiet WS connections idle_timeout 0 + read_timeout 0 on WS paths; ping interval enforced below proxy timeout
No distinction between "connected to SpaceCom" and "receiving alerts during TIP event" Per-org connection gauge + ANSPNoLiveConnectionDuringTIPEvent alert distinguishes the two
resync_required causes silent state restoration with no visual indication received_while_offline: true annotation + amber border in notification centre
Dead socket detected by ping-pong, silently closed Grace-period Celery task schedules ANSP notification; cancelled on reconnect

63.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Fan-out mechanism Redis Pub/Sub Sticky sessions (consistent hash) Sticky sessions break blue-green deploys; Pub/Sub is stateless and works with any instance count
App-level ACK scope CRITICAL and HIGH only All events Ack overhead for ingest.status and spaceweather.change is disproportionate; only safety-relevant alerts need receipt confirmation
Dead connection grace period 120s normal / 30s active TIP Immediate notification False-positive notifications from brief network hiccups destroy operator trust in the system; grace period filters transient drops
Back-pressure circuit breaker Close slow client (force reconnect) Drop messages silently Silently dropping alert messages is unacceptable; forced reconnect triggers the ?since_seq= replay mechanism, giving the client another chance to receive the queued events
Caddy WS idle timeout 0 (no timeout) on WS paths only Global 0 Non-WS paths benefit from timeout protection against slow HTTP clients; WS paths require persistent connections; path-specific override is the correct scope

§64 Data Governance & Privacy Engineering — Specialist Review

64.1 Finding Summary

# Finding Severity Resolution
1 No DPIA document — pre-processing obligation for high-risk processing of aviation professionals' behavioural data High §29.1 — Full DPIA structure added (EDPB WP248 template, 7 sections, key risk findings identified); legal/DPIA.md designated as Phase 2 gate before EU/UK ANSP shadow activation
2 Right-to-erasure conflict with 7-year safety retention unresolved High Already addressed — §29.3 documents pseudonymisation procedure; Art. 17(3)(b) exemption explicitly invoked. No further action required.
3 IP addresses stored full-resolution for 7 years — no necessity assessment, no minimisation policy High §29.1 — IP retention updated to 90 days full / hash retained for longer period; hash_old_ip_addresses Celery task specified; necessity assessment documented
4 No Record of Processing Activities (RoPA) document Medium Already addressed — §29.1 contains the RoPA table with all required Art. 30 fields; legal/ROPA.md designated as authoritative. No further action required.
5 Cross-border transfer mechanisms not documented per jurisdiction pair Medium Already addressed — §29.5 documents EU default hosting, SCCs for cross-border transfers, Australian APP8, data residency policy in legal/DATA_RESIDENCY.md. No further action required.
6 Handover notes and acknowledgement text retained as-written indefinitely — free-text personal references not pseudonymised Medium §29.3 — pseudonymise_old_freetext Celery task added; 2-year operational retention window; text replaced with [text pseudonymised after operational retention window]
7 No DSAR procedure or SLA — endpoint exists but no documented process High §29.4a — Full DSAR procedure added: 7-step runbook, 30-day SLA, 60-day extension provision, legal/DSAR_LOG.md, export scope defined, exemptions documented
8 Audit log mixes personal data and integrity records — single table, conflicting retention obligations High §29.9 — integrity_audit_log table split out for non-personal operational records (7-year retention); security_logs constrained to user-action types with CHECK; migration plan specified
9 No formal sub-processor register — sub-processor details scattered across multiple documents Medium §29.4 — legal/SUB_PROCESSORS.md register added with 5 sub-processors, transfer mechanism, DPA status; customer notification obligation documented
10 operator_training_records has no retention or pseudonymisation policy Medium §28.9 — Retention policy: active + 2 years post-deletion; user_tombstone column; pseudonymisation task extended to cover training records
11 ToS acceptance implies consent is the universal lawful basis — incorrect and creates compliance exposure High §29.10 — Lawful basis mapping table added (5 processing activities); clarification that ToS acceptance evidences consent only for specific acknowledgements; Privacy Notice requirement restated

64.2 Sections Modified

Section Change
§28.9 Operator Training Training records retention policy and pseudonymisation (F10): 2-year post-deletion window; user_tombstone column; Celery task extension
§29.1 Data Inventory IP address retention updated to 90-day full / hash retained (F3); hash_old_ip_addresses Celery task; IP necessity assessment; DPIA structure expanded to full EDPB WP248 template (F1)
§29.3 Erasure Procedure Free-text field periodic pseudonymisation added (F6): 2-year operational window; pseudonymise_old_freetext Celery task for shift_handovers.notes_text and alert_events.action_taken
§29.4 Data Processing Agreements Sub-processor register table added (F9): 5 sub-processors, locations, transfer mechanisms
§29.4a (new) DSAR Procedure Full 7-step DSAR procedure with 30-day SLA, export scope, exemption documentation (F7)
§29.9 (new) Audit Log Separation integrity_audit_log table split; security_logs constrained to user-action types; migration plan (F8)
§29.10 (new) Lawful Basis Mapping Per-activity lawful basis table; ToS acceptance ≠ universal consent; Privacy Notice requirement (F11)

64.3 New Documents and Tables

Artefact Purpose
legal/DPIA.md Data Protection Impact Assessment (EDPB WP248 template) — Phase 2 gate
legal/SUB_PROCESSORS.md Art. 28 sub-processor register with transfer mechanisms
legal/DSAR_LOG.md Log of all Data Subject Access Requests received and fulfilled
docs/runbooks/dsar-procedure.md Step-by-step DSAR handling runbook
tasks/privacy_maintenance.py Celery tasks: hash_old_ip_addresses, pseudonymise_old_freetext (extended to training records)
integrity_audit_log table Non-personal operational audit records separated from security_logs
operator_training_records.user_tombstone Pseudonymisation field for post-deletion training records
operator_training_records.pseudonymised_at Timestamp tracking pseudonymisation

64.4 Anti-Patterns Identified

Anti-pattern Correct approach
DPIA treated as optional documentation exercise Pre-processing legal obligation; EU personal data cannot be processed without completing it first
Full IP address retained for 7 years "for security" 90-day necessity window; hash retained for longer-term audit; necessity assessment documented
Single security_logs table for both personal data and operational integrity records Separate tables with separate retention policies; integrity_audit_log for non-personal records
ToS acceptance as universal consent mechanism Lawful basis is determined by processing purpose; most SpaceCom processing is Art. 6(1)(b) or (f), not consent
Sub-processor details spread across multiple documents Single legal/SUB_PROCESSORS.md register with mandatory Art. 28(3) fields
Free-text operational fields retained as-written indefinitely 2-year operational window then pseudonymisation in place; record preserved, personal reference removed

64.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
DPIA processing category Art. 35(3)(b) — systematic monitoring of publicly accessible area Art. 35(3)(a) — large-scale special category data No special category data is processed; the systematic monitoring category is the correct trigger given real-time operational pattern tracking of named aviation professionals
IP hashing threshold 90 days 30 days / 1 year 90 days covers the active investigation window for the vast majority of security incidents; shorter is unnecessarily restrictive for legitimate investigation; longer retains more than necessary
Free-text pseudonymisation window 2 years post-creation Immediate deletion / 7-year retention as-written 2 years covers all active PIR, investigation, and regulatory inquiry periods while removing personal references well before maximum retention; deletion would destroy operational context needed for safety record; 7-year as-written retention is disproportionate
Audit log split mechanism Separate table with CHECK constraint on security_logs Application-level routing only Database constraint enforces the separation at ingest time; application routing alone is fragile and will be bypassed as code evolves
DSAR response channel Encrypted ZIP to verified email In-platform download only In-platform download is unavailable after account deletion; verified email ensures identity confirmation and provides a paper trail

Appendix §65 — Cost Engineering / FinOps Hat Review

Hat: Cost Engineering / FinOps Reviewer focus: Infrastructure cost visibility, unit economics, per-resource attribution, cost anti-patterns, egress waste, idle resource cost


65.1 Findings and Fixes

# Finding Severity Section modified Fix applied
F1 No unit economics model — impossible to reason about margin per customer tier HIGH §27.7 (new) Added unit economics model with cost-to-serve breakdown and break-even analysis; reference doc docs/business/UNIT_ECONOMICS.md
F2 Storage table lacked cost figures — MC blob cost invisible to planners MEDIUM §27.4 Added Cloud Cost/Year column to storage table; S3-IA pricing for MC blobs; noted dominant cost driver
F3 No metric tracking external API calls (Space-Track budget at risk) MEDIUM §27.1 Added spacecom_ingest_api_calls_total{source} counter; alert at Space-Track 100/day approaching AUP limit
F4 No per-org simulation CPU tracking — Enterprise chargeback impossible MEDIUM §27.1 Added spacecom_simulation_cpu_seconds_total{org_id, norad_id} counter; monthly usage report task
F5 CZML egress cost unquantified; no brotli compression mandate LOW §27.5 Added CZML egress cost estimate (~$17/mo at Phase 23); brotli compression policy added
F6 Celery worker idle cost not analysed — $1,120/mo regardless of usage HIGH §27.3 Added idle cost analysis; scale-to-zero rejected (violates MC SLO); scale-to-1 KEDA policy for Tier 3 documented
F7 No per-org email rate limit — SMTP quota at risk during flapping events MEDIUM §4 (WebSocket/alerts) Added 50 emails/hour/org rate limit with digest fallback; Celery hourly digest task; cost rationale
F8 Renderer always-on rationale not documented; co-location OOM risk unaddressed LOW §35.5 Added on-demand analysis table; confirmed always-on at Tier 12; documented co-location isolation requirement
F9 Backup storage cost not projected — surprise cost at Tier 3 LOW §27.4 Added WAL backup cost projection; $100200/month at Tier 3 steady state
F10 No Redis memory budget — result backend accumulation can cause OOM HIGH §27.8 (new) Added Redis memory budget table by purpose/DB index; maxmemory 2gb; result_expires=3600 requirement
F11 No per-org cost attribution mechanism for Enterprise tier negotiations MEDIUM §27.1 Added monthly usage report Celery task; per-org CPU-seconds → cost-per-run attribution

65.2 Sections Modified

Section Change summary
§27.1 Workload Characterisation Added cost-tracking Prometheus counters (F3, F4) and per-org usage report task (F11)
§27.3 Deployment Tiers Added Celery worker idle cost analysis and scale-to-zero decision table (F6)
§27.4 Storage Growth Projections Added Cloud Cost/Year column; storage cost summary; backup cost projection (F2, F9)
§27.5 Network and External Bandwidth Added CZML egress cost estimate and brotli compression policy (F5)
§27.7 Unit Economics Model (new) Full unit economics model: cost-to-serve, revenue per tier, break-even analysis (F1)
§27.8 Redis Memory Budget (new) Redis memory budget by purpose; maxmemory setting; result cleanup requirement (F10)
§4 WebSocket / Alerts Added per-org email rate limit (50/hr) with digest fallback; SMTP cost rationale (F7)
§35.5 Renderer Container Constraints Added on-demand analysis; memory isolation rationale; co-location risk guidance (F8)

65.3 New Files and Documents Required

File Purpose
docs/business/UNIT_ECONOMICS.md Unit economics model; cost-to-serve per tier; break-even analysis; update quarterly
docs/infra/REDIS_SIZING.md Redis memory budget by purpose; eviction policy decisions; sizing rationale
docs/business/usage_reports/{org_id}/{year}-{month}.json Per-org monthly usage reports for Enterprise tier chargeback
backend/app/metrics.py (additions) spacecom_ingest_api_calls_total and spacecom_simulation_cpu_seconds_total counters
backend/app/alerts/email_delivery.py Per-org email rate limiting logic with Redis counter and digest queue
backend/celeryconfig.py (addition) result_expires = 3600 to prevent Redis result backend accumulation

65.4 Anti-Patterns Rejected

Anti-pattern Why rejected
Scale-to-zero simulation workers 60120s Chromium-style cold-start violates 10-min MC SLO; scale-to-1 minimum is the correct floor
Co-locating renderer with simulation workers Chromium 24 GB render memory + MC worker memory = OOM on 32 GB nodes; isolated container required
Unbounded alert emails per org SMTP relay quota exhausted during flapping events; 50/hr cap with digest is operationally equivalent at lower cost
Redis without result_expires MC sub-task result accumulation; 500 sub-tasks × 1 MB = 500 MB peak; without expiry, accumulates across runs indefinitely
Single Redis noeviction policy Blocks cache use alongside broker in same instance; DB-index split with allkeys-lru on cache DB required

65.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Simulation worker floor Scale-to-1 minimum at Tier 3 Scale-to-zero Cold-start from zero violates 10-min MC SLO; one warm worker absorbs small queues instantly
Email rate limit mechanism Redis hour-window counter + Celery digest task Database-level throttle / no limit Redis counter is O(1) per email with sub-millisecond latency; DB throttle adds per-email DB write at high fan-out; no limit is a SMTP quota risk
Unit economics granularity Per-org CPU-seconds via Prometheus Per-request DB logging Prometheus counter aggregation has negligible overhead; DB per-request logging at MC sub-task granularity = 500 writes/run
Redis maxmemory target 2 GB (cache.r6g.large with 8 GB RAM) 4 GB / 1 GB 2× headroom above 700750 MB peak estimate; leaves OS and other processes room; below 4 GB alerts before OOM
CZML compression priority Brotli before gzip in Caddy encode block gzip only Brotli achieves 7080% reduction vs. gzip's 6075%; modern browsers universally support brotli; on-premise clients are always browser-based

Appendix §66 — Open Source / Dependency Licensing Hat Review

Hat: OSS Licensing Engineer Reviewer focus: Licence obligations for closed-source SaaS, SBOM completeness, redistribution constraints, IP risk in ESA bid context, contractor IP ownership


66.1 Findings and Fixes

# Finding Severity Section modified Fix applied
F1 CesiumJS AGPLv3 commercial licence not explicitly gated as Phase 1 blocker CRITICAL §6 Phase 1 checklist, §29.11 (new) Added Phase 1 blocking gate requiring cesium-commercial.pdf; dedicated §29.11 F1 section with phase-gate language
F2 SBOM covered container image (syft) but not dependency manifests (pip-licenses/license-checker JSON merge) HIGH §26.9 CI table, §6 Phase 1 checklist, §29.11 (new) Added manifest SBOM merge to build-and-push; docs/compliance/sbom/ as versioned store; Phase 1 gate updated
F3 Space-Track AUP redistribution risk not analysed in detail for API endpoint and credential exposure MEDIUM §29.11 (new) Added two-vector redistribution analysis (API exposure + credential in client-side code); confirmed detect-secrets coverage
F4 poliastro LGPLv3 licence not documented; LGPL dynamic linking compliance undocumented MEDIUM §29.11 (new) Added LGPL compliance assessment; legal/LGPL_COMPLIANCE.md required; standard pip install satisfies LGPL
F5 TimescaleDB dual-licence (TSL vs Apache 2.0) not assessed; risk if TSL-only features adopted MEDIUM §29.11 (new) Added feature-by-feature TimescaleDB licence table; confirmed SpaceCom uses only Apache 2.0 features; re-assessment gate if multi-node adopted
F6 Redis SSPL adoption (7.4+) not assessed; Valkey alternative not documented MEDIUM §29.11 (new) Added SSPL internal-use assessment; legal counsel confirmation required before Phase 3; Valkey/Redis 7.2 as fallback
F7 Playwright/Chromium binary licence not captured in SBOM LOW §29.11 (new) Confirmed Apache 2.0 (Playwright) + BSD-3 (Chromium); captured by syft container scan; no redistribution
F8 Caddy enterprise plugin licence risk not noted; audit process not defined LOW §29.11 (new) Added plugin licence audit requirement; PR checklist for Caddyfile changes
F9 PostGIS GPLv2 linking exception not documented LOW §29.11 (new) Confirmed linking exception applies to PostgreSQL extension use; legal/LGPL_COMPLIANCE.md to document
F10 pip-licenses --fail-on list missing SSPL; no SSPL check on npm side MEDIUM §29.11 (new), §7.13 CI step Added SSPL to Python fail-on list; SSPL added to npm failOn; exact version pinning requirement stated
F11 No CLA or work-for-hire mechanism before contractor contributions HIGH §29.11 (new), §6 Phase 2 checklist Added CLA template requirement (legal/CLA.md); CONTRIBUTING.md disclosure; Phase 2 gate

66.2 Sections Modified

Section Change summary
§6 Phase 1 legal/compliance checklist Added CesiumJS commercial licence as explicit blocking gate; expanded SBOM checklist item to cover manifest SBOMs; added LGPL/PostGIS and TimescaleDB/Redis licence document gates
§26.9 CI workflow table Updated build-and-push job to include manifest SBOM merge and docs/compliance/sbom/ artefact storage
§29.11 (new) Full OSS licence compliance section: F1F11 covering all material dependencies

66.3 New Files and Documents Required

File Purpose
legal/OSS_LICENCE_REGISTER.md Authoritative per-dependency licence record; updated on major version changes
legal/LICENCES/cesium-commercial.pdf Executed CesiumJS commercial licence — Phase 1 blocking gate
legal/LICENCES/timescaledb-licence-assessment.md TimescaleDB Apache 2.0 vs. TSL feature confirmation
legal/LICENCES/redis-sspl-assessment.md Redis SSPL internal-use assessment; legal counsel sign-off
legal/LGPL_COMPLIANCE.md poliastro LGPL dynamic linking compliance; PostGIS GPLv2 linking exception
legal/CLA.md Contributor Licence Agreement template for external contributors
docs/compliance/sbom/ Versioned SBOM artefacts: syft SPDX-JSON + manifest JSONs per release
CONTRIBUTING.md CLA requirement disclosure; external contributor instructions

66.4 Anti-Patterns Rejected

Anti-pattern Why rejected
"CesiumJS licence can wait until Phase 2" AGPLv3 network use provision applies from the first external demo — waiting creates retroactive non-compliance exposure in an ESA bid context
Excluding CesiumJS from the licence gate without a commercial licence on file CI exclusion hides the issue; the gate is correct only when the commercial licence exists
Assuming LGPL dynamic linking is automatically satisfied Must be documented; LGPL allows relinking — standard pip install satisfies this but the compliance position must be written down
Single Redis noeviction policy Already rejected in §65; Redis SSPL also motivates Valkey evaluation as BSD-3 alternative
Assuming all TimescaleDB features are Apache 2.0 TSL features (multi-node, data tiering) would require a Timescale commercial agreement; feature use must be tracked

66.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
CesiumJS licence Commercial licence from Cesium Ion; Phase 1 blocker Open-source the frontend (comply with AGPLv3) Source disclosure of SpaceCom's frontend is commercially unacceptable; commercial licence is the only viable path for a closed-source product
Redis SSPL response Legal counsel assessment; Valkey as fallback Immediate migration to Valkey Internal-use assessment is likely favourable; premature migration introduces risk; assess first
poliastro LGPL Document standard pip install compliance Seek MIT-licensed alternative Standard pip install satisfies LGPL dynamic linking; replacing poliastro would require significant re-engineering for marginal legal gain
SBOM format SPDX-JSON (syft) + pip-licenses/license-checker manifests merged CycloneDX only SPDX is the format required by ECSS and EU Cyber Resilience Act; CycloneDX can be generated alongside if required by a specific customer

Appendix §67 — Distributed Systems / Consistency Hat Review

Hat: Distributed Systems Engineer Reviewer focus: Consistency guarantees, failure modes, split-brain scenarios, clock skew, ordering, idempotency, CAP trade-offs


67.1 Findings and Fixes

# Finding Severity Section modified Fix applied
F1 Chord callback doesn't validate result count — partial results silently produce truncated predictions CRITICAL §27.2 chord section Added result count guard in aggregate_mc_results; raises ValueError on mismatch; spacecom_mc_chord_partial_result_total counter; DLQ routing
F2 No Celery autoretry_for=(OperationalError,) on DB-writing tasks — Patroni 30s failover window causes permanent task failure HIGH §27.6 PgBouncer section Added autoretry_for=(OperationalError,) policy; max_retries=3, retry_backoff=5, cap 30s; applies to all DB-writing Celery tasks
F3 Redis Sentinel split-brain risk not documented or assessed MEDIUM §26 Redis Sentinel section Added split-brain assessment; accepted risk for ephemeral data; min-replicas-to-write 1 mitigates; ADR-0021 required
F4 HMAC signing race — prediction INSERT then HMAC UPDATE creates window of unsigned prediction HIGH §10 HMAC section Fixed: pre-generate UUID in application before INSERT; compute HMAC with UUID; single-phase write; migration from BIGSERIAL to UUID PK documented
F5 alert_events.seq assigned via MAX(seq)+1 trigger — concurrent inserts produce duplicates HIGH §4 WebSocket/events section Replaced with CREATE SEQUENCE alert_seq_global; globally monotonic; per-org ordering via WHERE org_id = $1 ORDER BY seq
F6 Clock skew between server and client causes CZML ground track timing drift — no detection mechanism MEDIUM §4 API section Added chronyd/timesyncd host requirement; node_timex_sync_status Grafana alert; GET /api/v1/time endpoint; client-side skew warning banner at >5s
F7 MinIO multipart upload has no retry on write quorum failure — MC blob lost silently HIGH §27.4 storage section Added autoretry_for=(S3Error,) with 30s backoff; MinIO ILM rule to abort incomplete multipart uploads after 24h
F8 celery-redbeat double-fire on restart: only TLE ingest has ON CONFLICT DO NOTHING; space weather and IERS EOP lack upsert MEDIUM §11 ingest section Added upsert patterns for all periodic ingest tables; unique constraint requirements stated
F9 WebSocket fan-out cross-channel ordering — no cross-org ordering guarantee LOW Already addressed — Redis Pub/Sub ordering is per-channel (per-org); sequence numbers provide intra-org ordering. No further action required.
F10 reentry_predictions FK referenced with default CASCADE — accidental simulation delete cascades to legal-hold predictions HIGH §9 schema Changed all REFERENCES reentry_predictions(id) to ON DELETE RESTRICT in alert_events, prediction_outcomes, superseded_by FK
F11 No distributed trace context propagation through chord sub-tasks and callback MEDIUM §26.9 OTel section Added chord trace context injection/extraction pattern; verified CeleryInstrumentor for single tasks; manual propagate.inject/extract for chord callback continuity

67.2 Sections Modified

Section Change summary
§27.2 MC Parallelism Added chord result count validation in aggregate_mc_results; partial result counter
§27.6 DNS / PgBouncer Added Celery autoretry_for=(OperationalError,) policy for Patroni failover window
§26 Redis Sentinel Added split-brain risk assessment; min-replicas-to-write 1 config; ADR-0021
§10 HMAC signing Fixed two-phase write race: pre-generate UUID, single-phase INSERT; PK migration note
§4 WebSocket schema Added alert_seq_global PostgreSQL SEQUENCE replacing MAX(seq)+1 trigger
§4 API / health Added GET /api/v1/time clock skew endpoint; NTP sync requirement; client banner
§27.4 Storage Added MinIO multipart upload retry; incomplete upload ILM expiry rule
§11 Ingest Added upsert patterns for space_weather and IERS EOP; unique constraint requirements
§9 Data Model Changed REFERENCES reentry_predictions(id) to ON DELETE RESTRICT on 3 FKs
§26.9 OTel/Tracing Added chord trace context propagation pattern; propagate.inject/extract for callback

67.3 New ADRs Required

ADR Decision
docs/adr/0021-redis-sentinel-split-brain-risk-acceptance.md Accept Redis Sentinel split-brain risk for ephemeral data; min-replicas-to-write 1 mitigation; email rate limit counter inconsistency accepted as cost control gap

67.4 Anti-Patterns Rejected

Anti-pattern Why rejected
MAX(seq)+1 for sequence assignment in trigger Race condition under concurrent inserts — two transactions read same MAX and both write the same seq; PostgreSQL SEQUENCE is lock-free and gap-tolerant
Two-phase HMAC (INSERT then UPDATE) Creates a window where a valid unsigned prediction exists in the DB; single-phase INSERT with pre-generated UUID eliminates the window
No retry on Celery DB tasks during Patroni failover The 30s failover window is a known operational event; retries with 5s backoff cap at 30s, fitting entirely within the failover window
ON DELETE CASCADE on legal-hold FK references Accidental deletion of a simulation row would cascade to 7-year-retention safety records; RESTRICT forces explicit deletion of dependents first, making accidental cascade impossible
Scale-to-zero with immediate cold-start Already rejected in §65; distributed systems perspective adds: cold-start during Patroni failover + worker cold-start = double failure; always keep 1 warm worker

67.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Chord result count validation ValueError → DLQ → HTTP 500 + Retry-After Silently write partial result A 400-sample prediction is not a 500-sample prediction; confidence intervals and corridor widths are wrong; it is safer to fail visibly
reentry_predictions PK type Migrate BIGSERIAL → UUID; pre-generate in application Keep BIGSERIAL; use two-phase HMAC UUID pre-generation eliminates the race window; UUID is also a safer choice for distributed deployments where sequence coordination between nodes is not possible
alert_seq assignment Single global alert_seq_global SEQUENCE Per-org sequences Single sequence is simpler to manage; global monotonicity is sufficient for per-org ordering by filtering on org_id; per-org sequences require one sequence per org — complex at scale
Redis split-brain response Accept risk; document in ADR Migrate to Redis Cluster (stronger consistency) Redis Cluster adds significant operational complexity (hash slots, resharding, client-side routing); split-brain on Sentinel with 3 nodes is rare and the affected data is ephemeral or cost-control only

Appendix §68 — Commercial / Pricing Architecture Hat Review

Hat: Commercial Strategy / Pricing Architect Reviewer focus: Pricing model design, deal structure, revenue protection, margin preservation, enterprise negotiation guardrails, commercial signals in technical architecture


68.1 Findings and Fixes

# Finding Severity Section modified Fix applied
F1 No contracts table — feature access not gated on commercial state; admin can enable Enterprise features with no contract CRITICAL §9 data model, §24 commercial section Added contracts table with financial terms, feature enablement flags, discount approval constraint, PS tracking; nightly sync task
F2 Usage data not surfaced to commercial team or org admins — renewal conversations lack data HIGH §27.7 unit economics Added monthly usage summary emails to commercial team and org admins; send_usage_summary_emails Beat task
F3 No shadow trial time limit — ANSP could remain in shadow mode indefinitely without signing production contract HIGH §9 organisations table Added shadow_trial_expires_at column; enforcement via daily Celery task that auto-deactivates expired trials
F4 No discount approval guard-rails — single admin can give 100% discount MEDIUM §9 contracts table Added CHECK (discount_pct <= 20 OR discount_approved_by IS NOT NULL) constraint; discount >20% requires named approver
F5 No inbound API request counter — usage-based billing for Persona E/F impossible MEDIUM §27.1 metrics Added spacecom_api_requests_total{org_id, endpoint, version, status_code}; FastAPI middleware
F6 On-premise deployments have no licence key enforcement — multi-instance or post-expiry use undetectable HIGH §34 infrastructure section Added RSA JWT licence key mechanism; licence-expired degraded mode; hourly Celery re-validation; key rotation script
F7 No contract expiry alerts — contracts expire silently; revenue risk HIGH §4 Celery tasks Added check_contract_expiry Beat task at 90/30/7-day thresholds; courtesy notice to org admin at 30 days
F8 Free/shadow tier has no MC simulation quota — free usage consumes paid-tier worker capacity MEDIUM §9 organisations table, §27.7 Added monthly_mc_run_quota column (default 100); POST /api/v1/decay/predict quota enforcement with 429 + Retry-After
F9 No MRR/ARR tracking — commercial team cannot measure revenue targets HIGH §9 contracts table, §27.7 contracts.monthly_value_cents + spacecom_mrr_eur Prometheus gauge updated nightly; Grafana MRR panel
F10 Professional Services not documented as a revenue line — first-year contract value underestimated MEDIUM §27.7 unit economics Added PS revenue table (engagement types, values); contracts.ps_value_cents; Year 1 total contract value formula
F11 Multi-ANSP coordination panel available to all tiers — high-value Enterprise feature not packaging-protected MEDIUM §9 organisations table Added feature_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE; gated in UI by feature flag; synced from contracts.enables_multi_ansp_coordination

68.2 Sections Modified

Section Change summary
§9 organisations table Added shadow_trial_expires_at, monthly_mc_run_quota, feature_multi_ansp_coordination, licence_key, licence_expires_at columns
§9 (new contracts table) Full contracts table with financial terms, discount approval constraint, feature enablement, PS tracking
§24 commercial section Added contracts table spec, MRR tracking, feature sync task, discount enforcement
§27.1 cost-tracking metrics Added spacecom_api_requests_total{org_id, endpoint, version, status_code} counter
§27.7 unit economics Added PS revenue table; shadow trial quota enforcement code; usage summary emails
§34 on-premise deployment Added RSA JWT licence key mechanism; degraded mode on expiry; key rotation process
§4 Celery Beat tasks Added check_contract_expiry 90/30/7-day alert task; send_usage_summary_emails monthly task

68.3 New Files and Documents Required

File Purpose
docs/business/UNIT_ECONOMICS.md Updated with PS revenue line, Year 1 total contract value formula, MRR tracking
tasks/commercial/contract_expiry_alerts.py Contract expiry Celery task (90/30/7-day thresholds)
tasks/commercial/send_commercial_summary.py Monthly commercial team usage summary email
tasks/commercial/sync_feature_flags.py Nightly sync of org feature flags from active contracts
scripts/generate_licence_key.py RSA JWT licence key generation script (requires private key)
legal/contracts/ Contract document store (MSA PDFs, signed sandbox agreements)

68.4 Anti-Patterns Rejected

Anti-pattern Why rejected
Admin toggle for feature access without contract gate Single admin can bypass commercial controls; contracts table with nightly sync is the authoritative source
Unlimited MC runs for free tier Free-tier heavy users degrade paid-tier SLO by consuming simulation worker capacity; 100-run/month quota is enforceable without impacting legitimate evaluation
Honour-system on-premise licensing Without a licence key, post-expiry use is undetectable and unenforceable; JWT with RSA signature provides cryptographic enforcement with no ongoing connectivity requirement
Silent contract expiry Revenue loss from silent expiry is predictable and preventable; 90/30/7-day alerts are standard SaaS practice
Infinite shadow trial Shadow mode is a commercial transition stage, not a permanent state; shadow_trial_expires_at enforces the commercial expectation established in the Regulatory Sandbox Agreement

68.5 Decision Log

Decision Chosen approach Rejected alternative Rationale
Feature flag sync Nightly Celery task syncs from contracts Real-time sync on every request Real-time sync adds DB query per request; nightly sync is sufficient for contract-level changes which happen at most monthly
Licence key format RSA-signed JWT Database-backed licence check JWT is verifiable offline (no network required for air-gapped deployments); RSA signature prevents forgery without access to SpaceCom private key
Discount approval threshold 20% without approval; >20% requires named approver Flat approval for all discounts 0-20% is sales discretion; >20% represents strategic pricing requiring commercial leadership sign-off; DB constraint makes this enforceable rather than advisory
PS revenue tracking contracts.ps_value_cents one-time field Separate PS contracts table PS is almost always bundled with the main contract at first engagement; a separate table adds complexity for marginal benefit at Phase 2-3 scale
MRR metric Prometheus gauge from nightly Celery task Real-time DB query in Grafana Prometheus gauge is consistent with other business metrics; Grafana can scrape it without a DB connection; historical MRR trend is automatically recorded

§69 Cross-Hat Governance and Decision Authority

This section resolves conflicts between specialist reviews. SpaceCom uses hats to surface expert constraints, not to create parallel authorities. Where hats conflict, this section defines who decides, how the decision is recorded, and which interpretation governs implementation.

69.1 Decision Authority Model

Decision class Primary owner Mandatory reviewers Tie-break principle
Product packaging, contracts, commercial entitlements Product / Commercial owner Legal, Engineering Contractual and legal truth beats UI shorthand
Safety-critical alerting, operational UX, hazard communication Safety case owner Human Factors, Regulatory, Engineering Safer operator outcome beats convenience or sales flexibility
Core architecture, infrastructure, CI/CD, consistency Architecture / Platform owner Security, SRE, DevOps Lower operational risk and clearer failure semantics beat elegance
Privacy, data governance, lawful basis, retention Legal / Privacy owner Product, Engineering Regulatory obligation beats implementation convenience
External licensing / open source / procurement artefacts Legal / Procurement owner Engineering, Product Licence compliance beats delivery speed

Any unresolved cross-hat conflict is recorded in docs/governance/CROSS_HAT_CONFLICT_REGISTER.md before implementation proceeds.

69.2 Arbitration Rules Adopted

  1. Commercial source of truth: contracts is the authoritative source for features, quotas, and deployment rights. subscription_tier is descriptive only.
  2. CI/CD platform: SpaceCom uses self-hosted GitLab. All GitHub Actions references in the plan are interpreted as GitLab CI equivalents and must be implemented in .gitlab-ci.yml, protected environments, and GitLab approval rules.
  3. Redis split by trust class: redis_app holds higher-integrity application state; redis_worker holds broker/result/cache state. Split-brain acceptance applies only to redis_worker.
  4. Commercial enforcement deferral: Licence expiry, shadow-trial expiry, and quota exhaustion must not interrupt active TIP / CRITICAL operations. Enforcement is deferred, logged, and applied after the active event closes.
  5. Alert escalation matrix: Progressive escalation is the default. Immediate bypass is allowed only for imminent-impact or integrity-compromise conditions formally listed in the alert definition and traced into safety artefacts.
  6. Renderer privilege exception: The renderer SYS_ADMIN capability is an approved exception, not a precedent. Any similar request from another service requires a new ADR and security review.
  7. Phase 0 blockers: Space-Track AUP architecture and Cesium commercial licensing are Phase 0 gates. Work that would lock in ingest or frontend architecture must not proceed before those gates are closed.

69.3 Phase 0 Governance Gates

Before Phase 1 implementation begins, the following must be complete:

  • Space-Track AUP architecture decision recorded in docs/adr/0016-space-track-aup-architecture.md
  • Cesium commercial licence executed and stored at legal/LICENCES/cesium-commercial.pdf
  • GitLab CI/CD authority confirmed in platform docs and reflected in .gitlab-ci.yml
  • contracts entitlement model and synchronisation path approved by Product, Legal, and Engineering
  • Redis trust split (redis_app / redis_worker) approved by Architecture, Security, and SRE

These are architectural commitment gates, not paperwork gates. If any remain open, implementation that would cement the affected design area is blocked.

69.4 Intervention Register

Conflict Sections affected Intervention Owner Status
subscription_tier vs contracts authority §16.1, §24, §68 contracts made authoritative; org flags become derived cache Product / Commercial Accepted
GitHub Actions vs self-hosted GitLab §26.9, §30.4, §30.7, delivery checklists GitLab CI/CD designated authoritative Platform Accepted
Shared Redis vs accepted split-brain risk §3.2, §3.3, §65, §67 Redis split into app-state and worker-state trust domains Architecture / Security Accepted
Commercial enforcement during incidents §9, §27.7, §34, §68 Enforcement deferred during active TIP / CRITICAL event Product / Operations Accepted
HF progressive escalation vs safety urgency §28.3, §60, §61 Immediate-bypass matrix added for imminent-impact and integrity events Safety case owner Accepted
Non-root/container hardening vs renderer SYS_ADMIN §3.3, §7.11 Renderer documented as approved exception with tighter isolation Security / Platform Accepted
Implementation starting before legal/licence blockers close §6, §19, §21, §29.11 Blockers moved into Phase 0 governance gates Programme owner Accepted