Plan aircraft-risk modelling, CCSDS RDM support, tender-grade replay validation, and ESA software assurance artefacts in the implementation and master plans.
1.0 MiB
SpaceCom Master Development Plan
1. Vision
SpaceCom is a dual-domain re-entry debris hazard analysis platform that bridges the space and aviation domains. It is built by space engineers and operates as two interconnected products sharing a common physics core.
Space Domain (upstream): A technical analysis platform for space operators, orbital analysts, and space agencies — providing decay prediction with full uncertainty quantification, conjunction screening, controlled re-entry corridor planning, and a programmatic API layer for integration with existing space operations systems.
Aviation Domain (downstream): An operational decision support tool for ANSPs, airspace managers, and incident commanders — translating space domain predictions into actionable aviation safety outputs: hazard corridors, FIR intersection analysis, NOTAM drafting assistance, multi-ANSP coordination, and plain-language uncertainty communication.
SpaceCom's strategic position is the interface layer between two domains that currently do not speak the same language. The aviation safety gap is the commercial differentiator and the most underserved operational need in the market. The space domain physics depth — numerical decay prediction, atmospheric density modelling, conjunction probability, and controlled re-entry planning — is the technical credibility that distinguishes SpaceCom from aviation software vendors with bolt-on orbital mechanics.
Positioning statement for procurement: "SpaceCom is the missing operational layer between space domain awareness and aviation domain action — built by space engineers, designed for the people who have to make decisions when something is coming down."
AI-assisted development policy (F11): SpaceCom uses AI coding assistants (currently Claude Code) in the development workflow. AGENTS.md at the repository root defines the boundaries and conventions for this use. Key constraints:
- AI assistants may generate, refactor, and review code, and draft documentation
- AI assistants may not make autonomous decisions about safety-critical algorithm changes, authentication logic, or regulatory compliance text — all such changes require human review and an approved PR with explicit reviewer sign-off
- AI-generated code is subject to identical review and testing standards as human-authored code — there is no reduced scrutiny for AI-generated contributions
- AI assistants must not be given production credentials, access to live Space-Track API keys, or personal data
- For ESA procurement purposes: all code in the repository, regardless of how it was authored, is the responsibility of the named human engineers. AI assistance is a development tool, not a co-author with liability
This policy is stated explicitly because ESA and other public-sector procurement frameworks increasingly ask whether and how AI tools are used in safety-relevant software development.
2. What We Keep from the Existing Codebase
The prototype established several good foundational choices:
- Docker Compose orchestration — frontend, backend, and database run as isolated containers with a single
docker compose up - FastAPI backend — lightweight, async-ready Python API server; already serves CZML orbital data
- TimescaleDB + PostGIS — time-series hypertables for orbit data and geographic types for footprints; the
orbitshypertable andreentry_predictionspolygon column are well-suited to the domain - CesiumJS globe — proven 3D geospatial viewer with CZML support, already rendering orbital tracks with OSM tiles
- CZML as the orbital data interchange format — native to Cesium, supports time-dynamic position, styling, and labels
- Schema tables:
objects,orbits,conjunctions,reentry_predictions— solid starting point for the data model (see §9 for required expansions) - Worker service slot — the architecture already anticipates background data ingestion
3. Architecture
3.1 Layered Design
┌─────────────────────────────────────────────────────┐
│ Frontend (Web) │
│ Next.js + TypeScript + CesiumJS + Deck.gl │
│ httpOnly cookies · CSP · security headers │
├─────────────────────────────────────────────────────┤
│ TLS Termination (Caddy/Nginx) │
│ HTTPS + WSS only; HSTS preload │
├─────────────────────────────────────────────────────┤
│ API Gateway │
│ FastAPI · RBAC middleware · rate limiting │
│ JWT (RS256) · MFA enforcement · audit logging │
├─────────────────────────────────────────────────────┤
│ Core Services │
│ Hazard Engine · Event Orchestrator · CZML Builder │
│ Frame Transform Service · Space Weather Cache │
│ HMAC integrity signing · Alert integrity guard │
├─────────────────────────────────────────────────────┤
│ Computational Workers (isolated network) │
│ Celery tasks: propagation, decay, Monte Carlo │
│ Per-job CPU time limits · resource caps │
├─────────────────────────────────────────────────────┤
│ Report Renderer (network-isolated container) │
│ Playwright headless · no external network access │
├─────────────────────────────────────────────────────┤
│ Data Layer (backend_net only) │
│ TimescaleDB+PostGIS · Redis (AUTH+TLS) │
│ MinIO (private buckets · pre-signed URLs) │
└─────────────────────────────────────────────────────┘
3.2 Service Breakdown
| Service | Runtime | Responsibility | Tier 2 Spec | Tier 3 Spec |
|---|---|---|---|---|
frontend |
Next.js on Node 22 / Nginx static | Globe UI, dashboards, event timeline, simulation controls | 2 vCPU / 4 GB | 2× (load balanced) |
backend |
FastAPI on Python 3.12 | REST + WebSocket API, authentication, RBAC, request validation, CZML generation, HMAC signing | 4 vCPU / 8 GB | 2× 4 vCPU / 8 GB (blue-green) |
worker-sim |
Python 3.12 + Celery --queue=simulation --concurrency=16 --pool=prefork |
MC decay prediction (chord sub-tasks), breakup, conjunction, controlled re-entry. Isolated from frontend network. | 2× 16 vCPU / 32 GB | 4× 16 vCPU / 32 GB |
worker-ingest |
Python 3.12 + Celery --queue=ingest --concurrency=2 |
TLE polling, space weather, DISCOS, IERS EOP. Never competes with simulation queue. | 2 vCPU / 4 GB | 2× 2 vCPU / 4 GB (celery-redbeat HA) |
renderer |
Python 3.12 + Playwright | PDF report generation only. No external network access. Receives sanitised data from backend via internal API call only. | 4 vCPU / 8 GB | 2× 4 vCPU / 8 GB |
db |
TimescaleDB (PostgreSQL 17 + PostGIS) | Persistent storage. RLS policies enforced. Append-only triggers on audit tables. | 8 vCPU / 64 GB / 1 TB NVMe | Primary + standby: 8 vCPU / 128 GB each; Patroni failover |
redis |
Redis 7 | Broker + cache + celery-redbeat schedule. AUTH required. TLS in production. ACL users per service. | 2 vCPU / 8 GB | Redis Sentinel: 3× 2 vCPU / 8 GB |
minio |
MinIO (S3-compatible) | Object storage. All buckets private. Pre-signed URLs only. | 4 vCPU / 8 GB / 4 TB | Distributed: 4× 4 vCPU / 16 GB / 2 TB NVMe |
etcd |
etcd 3 | Patroni DCS (distributed configuration store) for DB leader election | — | 3× 1 vCPU / 2 GB |
pgbouncer |
PgBouncer 1.22 | Connection pooler between all application services and TimescaleDB. Transaction-mode pooling. Prevents connection count exceeding max_connections at Tier 3. Single failover target point for Patroni switchover. |
1 vCPU / 1 GB | 1 vCPU / 1 GB (updated by Patroni on failover) |
prometheus |
Prometheus 2.x | Metrics scraping from all services; recording rules; AlertManager rules | 2 vCPU / 4 GB | 2 vCPU / 8 GB |
grafana |
Grafana OSS | Four dashboards (§26.7); Loki + Tempo + Prometheus datasources | 1 vCPU / 2 GB | 1 vCPU / 2 GB |
loki |
Grafana Loki 2.9 | Log aggregation; queried by Grafana; Promtail ships container logs | 2 vCPU / 4 GB | 2 vCPU / 8 GB |
promtail |
Grafana Promtail 2.9 | Scrapes Docker json-file logs; labels by service; ships to Loki | 0.5 vCPU / 512 MB | 0.5 vCPU / 512 MB |
tempo |
Grafana Tempo | Distributed trace backend (Phase 2); OTLP ingest; queried by Grafana | — | 2 vCPU / 4 GB |
Horizontal Scaling Trigger Thresholds (F9 — §58)
Tier upgrades are not automatic — SpaceCom is VPS-based and requires deliberate provisioning. The following thresholds trigger a scaling review meeting (not an automated action). The responsible engineer creates a tracked issue within 5 business days.
| Metric | Threshold | Sustained for | Tier transition indicated |
|---|---|---|---|
| Backend CPU utilisation | > 70% | 30 min | Tier 1 → Tier 2 (add second backend instance) |
spacecom_ws_connected_clients |
> 400 sustained | 1 hour | Tier 1 → Tier 2 (WS ceiling at 500; add second backend) |
| Celery simulation queue depth | > 50 | 15 min (no active event) | Add simulation worker instance |
| MC p95 latency | > 180s (75% of 240s SLO) | 3 consecutive runs | Add simulation worker instance |
| DB CPU utilisation | > 60% | 1 hour | Tier 2 → Tier 3 (read replica + Patroni) |
| DB disk used | > 70% of provisioned | — | Expand disk before hitting 85% |
| Redis memory used | > 60% of maxmemory |
— | Increase maxmemory or add Redis instance |
Scaling decisions are recorded in docs/runbooks/capacity-limits.md with: metric value at decision time, decision made, provisioning timeline, and owner. This file is the authoritative capacity log for ESA and ANSP audits.
Redis ACL Definition
SpaceCom uses two Redis trust domains:
redis_appfor sessions, rate limits, WebSocket delivery state, commercial-enforcement deferrals, and other application state where stronger consistency and tighter access separation are requiredredis_workerfor Celery broker/result traffic and ephemeral cache data, where limited inconsistency during failover is acceptable
This split is deliberate. It prevents worker-side compromise from reaching session state and avoids applying the distributed-systems split-brain risk acceptance for ephemeral workloads to user-session or entitlement-adjacent state.
Each Redis service gets its own ACL users with the minimum required key namespace:
# redis_app/acl.conf - bind-mounted into the application Redis container
# Backend: application-state access only (session tokens, rate-limit counters, WebSocket tracking)
user spacecom_backend on >${REDIS_BACKEND_PASSWORD} ~* &* +@all
# Disable unauthenticated default user
user default off
# redis_worker/acl.conf - bind-mounted into the worker Redis container
# Simulation worker: Celery broker/result namespaces only
user spacecom_worker on >${REDIS_WORKER_PASSWORD} ~celery* ~_kombu* ~unacked* &celery* +@all -@dangerous
# Ingest worker: same scope as simulation worker
user spacecom_ingest on >${REDIS_INGEST_PASSWORD} ~celery* ~_kombu* ~unacked* &celery* +@all -@dangerous
# Disable unauthenticated default user
user default off
Mount in docker-compose.yml:
redis_app:
volumes:
- ./redis_app/acl.conf:/etc/redis/acl.conf:ro
command: redis-server --aclfile /etc/redis/acl.conf --tls-port 6379 ...
redis_worker:
volumes:
- ./redis_worker/acl.conf:/etc/redis/acl.conf:ro
command: redis-server --aclfile /etc/redis/acl.conf --tls-port 6379 ...
Separate passwords (REDIS_BACKEND_PASSWORD, REDIS_WORKER_PASSWORD, REDIS_INGEST_PASSWORD) are defined in §30.3. Each rotates independently on the 90-day schedule. Redis Sentinel split-brain risk acceptance in §67 applies to redis_worker only; redis_app is treated as higher-integrity application state and is not covered by that acceptance.
3.3 Docker Compose Services and Network Segmentation
Services are assigned to isolated Docker networks. A compromised container on one network cannot directly reach services on another.
networks:
frontend_net: # frontend → backend only
backend_net: # backend → db, redis, minio, pgbouncer
worker_net: # worker → pgbouncer, redis, minio (no backend access; pgbouncer pools DB connections)
renderer_net: # backend → renderer only; renderer has no external egress
db_net: # db, pgbouncer — never exposed to frontend_net
services:
frontend: networks: [frontend_net]
backend: networks: [frontend_net, backend_net, renderer_net] # +renderer_net: backend calls renderer API
worker-sim: networks: [worker_net]
worker-ingest: networks: [worker_net]
renderer: networks: [renderer_net] # backend-initiated calls only; no outbound to backend_net
db: networks: [backend_net, worker_net, db_net]
pgbouncer: networks: [backend_net, worker_net, db_net] # pooling for both backend AND workers
redis: networks: [backend_net, worker_net]
minio: networks: [backend_net, worker_net]
Network topology rules:
- Workers connect to DB via
pgbouncer:5432, notdb:5432directly — enforced by workers'DATABASE_URLenv var pointing to PgBouncer. - The backend is on
renderer_netso it can callrenderer:8001; the renderer cannot initiate connections tobackend_net. db_netcontains only TimescaleDB, PgBouncer, and etcd. No application service connects directly to this network except PgBouncer.
Container resource limits — without explicit limits a runaway simulation worker OOM-kills the database (Linux OOM killer targets the largest RSS consumer):
services:
backend:
deploy:
resources:
limits: { cpus: '4.0', memory: 8G }
reservations: { memory: 512M }
worker-sim:
deploy:
resources:
limits: { cpus: '16.0', memory: 32G }
reservations: { memory: 2G }
stop_grace_period: 300s # allows long MC jobs to finish before SIGKILL
command: >
celery -A app.worker worker
--queue=simulation
--concurrency=16
--pool=prefork
--without-gossip
--without-mingle
--max-tasks-per-child=100
pids_limit: 64 # prefork: 16 children + Beat + parent + overhead
worker-ingest:
deploy:
resources:
limits: { cpus: '2.0', memory: 4G }
stop_grace_period: 60s
pids_limit: 16
renderer:
deploy:
resources:
limits: { cpus: '4.0', memory: 8G }
pids_limit: 100 # Chromium spawns ~5 processes per render × concurrent renders
tmpfs:
- /tmp/renders:size=512m,mode=1777 # PDF scratch; never written to persistent layer
environment:
RENDER_OUTPUT_DIR: /tmp/renders
db:
deploy:
resources:
limits: { memory: 64G } # explicit cap; prevents OOM killer targeting db
redis:
deploy:
resources:
limits: { cpus: '2.0', memory: 8G }
minio:
deploy:
resources:
limits: { cpus: '4.0', memory: 8G }
Note: deploy.resources is honoured by docker compose (v2) without Swarm mode from Compose spec 3.x. Verify with docker compose version ≥ 2.0.
All containers run as non-root users, with read-only root filesystems and dropped capabilities (see §7.10), except for the renderer container's documented SYS_ADMIN exception in §7.11. That exception is accepted only for the renderer, must never be copied to other services, and requires stricter network isolation and annual review.
Host Bind Mounts
All directories that operators need to access directly on the VPS — logs, generated exports, config, and backups — are bind-mounted from the host filesystem. This means no docker compose exec is required for routine operations: log tailing, reading generated files, editing config, or recovering a backup.
services:
backend:
volumes:
- ./logs/backend:/app/logs # structured JSON logs; tail directly on host
- ./exports:/app/exports # org export ZIPs, report PDFs
- ./config/backend.toml:/app/config/settings.toml:ro # edit on host; container reads
worker-sim:
volumes:
- ./logs/worker-sim:/app/logs
- ./exports:/app/exports # shared export directory with backend
worker-ingest:
volumes:
- ./logs/worker-ingest:/app/logs
frontend:
volumes:
- ./logs/frontend:/app/logs
db:
volumes:
- /data/postgres:/var/lib/postgresql/data # DB data on host disk; survives container recreation
- ./backups/db:/backups # pg_basebackup output directly accessible on host
minio:
volumes:
- /data/minio:/data # object storage on host disk
Host-side directory layout (under /opt/spacecom/):
/opt/spacecom/
logs/
backend/ ← tail -f logs/backend/app.log
worker-sim/
worker-ingest/
frontend/
exports/ ← ls exports/ to see generated reports and org export ZIPs
config/
backend.toml ← edit directly; restart backend container to apply
backups/
db/ ← pg_basebackup archives; rsync to offsite from here
data/
postgres/ ← TimescaleDB data files (outside /opt to avoid accidental compose down -v)
minio/ ← MinIO object data
Key rules:
/data/postgresand/data/miniolive outside the project directory sodocker compose down -vcannot accidentally wipe them (Compose only removes named volumes, not bind-mounted host paths, but keeping them separate is an additional safeguard)- Log directories are created by
make init-dirsbefore firstdocker compose up; containers write to them as a non-root user (UID 1000); host operator reads as the same UID or viasudo - Config files are mounted
:ro(read-only) inside the container — a misconfigured backend cannot overwrite its own config make logs SERVICE=backendis a convenience alias fortail -f /opt/spacecom/logs/backend/app.log
Port Exposure Map
| Port | Service | Exposed to | Notes |
|---|---|---|---|
| 80 | Caddy | Public internet | HTTP → HTTPS redirect only |
| 443 | Caddy | Public internet | TLS termination; proxies to backend/frontend |
| 8000 | Backend API | Internal (frontend_net) |
Never directly internet-facing |
| 3000 | Frontend (Next.js) | Internal (frontend_net) |
Caddy proxies; HMR port 3001 dev-only |
| 5432 | TimescaleDB | Internal (db_net) |
Never exposed to frontend_net or host |
| 6379 | Redis | Internal (backend_net, worker_net) |
AUTH required; no public exposure |
| 9000 | MinIO API | Internal (backend_net, worker_net) |
Pre-signed URL access only from outside |
| 9001 | MinIO Console | Internal (db_net) |
Never exposed publicly; admin use only |
| 5555 | Flower (Celery monitor) | Internal only | VPN/bastion access only in production |
| 2379/2380 | etcd (Patroni DCS) | Internal (db_net) |
Never exposed outside db_net |
CI check: scripts/check_ports.py — parses docker-compose.yml and all docker-compose.*.yml overrides; fails if any port from the "never-exposed" category appears in a ports: mapping. Runs in every CI pipeline.
Infrastructure-Level Egress Filtering
Docker's built-in iptables rules prevent inter-network lateral movement but do not restrict egress to the public internet from within a network. An egress filtering layer is mandatory at Tier 2 and Tier 3.
Allowed outbound destinations (whitelist):
| Service | Allowed destination | Protocol | Purpose |
|---|---|---|---|
ingest_worker |
www.space-track.org |
HTTPS/443 | TLE / conjunction data |
ingest_worker |
services.swpc.noaa.gov |
HTTPS/443 | Space weather |
ingest_worker |
discosweb.esac.esa.int |
HTTPS/443 | DISCOS object catalogue |
ingest_worker |
celestrak.org |
HTTPS/443 | TLE cross-validation |
ingest_worker |
iers.org |
HTTPS/443 | EOP download |
backend |
SMTP relay (org-internal) | SMTP/587 | Alert email |
| All containers | Internal Docker networks | Any | Normal operation |
| All containers | All other destinations | Any | BLOCKED |
Implementation: UFW or nftables rules on host (Tier 2); network policy + Calico/Cilium (Tier 3 Kubernetes migration); explicit allow-list in docs/runbooks/egress-filtering.md. Violations logged at WARN; repeated violations at CRITICAL.
4. Coordinate Frames and Time Systems
This section is non-negotiable infrastructure. Silent frame mismatches invalidate all downstream computation. All developers must understand and implement the conventions below before writing any propagation or display code.
4.1 Reference Frame Pipeline
TLE input
│
▼ sgp4 library propagation
TEME (True Equator Mean Equinox) ← SGP4 native output; do NOT store as final product
│
▼ IAU 2006 precession-nutation (or Vallado TEME→J2000 simplification)
GCRF / J2000 (Geocentric Celestial Reference Frame)
│ │
│ ▼ CZML INERTIAL frame ← CesiumJS expects GCRF/ICRF, not TEME
│
▼ IAU Earth Orientation Parameters (EOP): IERS Bulletin A/B
ITRF (International Terrestrial Reference Frame) ← Earth-fixed; use for database storage
│
▼ WGS84 geodetic transformation
Latitude / Longitude / Altitude ← For display, hazard zones, airspace intersections
Implementation: Use astropy (astropy.coordinates, astropy.time) for all frame conversions. It handles IERS EOP download and interpolation automatically. For performance-critical batch conversions, pre-load EOP tables and vectorise.
4.2 CesiumJS Frame Convention
- CZML
positionwithreferenceFrame: "INERTIAL"expects ICRF/J2000 Cartesian coordinates in metres - SGP4 outputs are in TEME and must be rotated to J2000 before being written into CZML
- CZML
positionwithreferenceFrame: "FIXED"expects ITRF Cartesian in metres - Never pipe raw TEME coordinates into CesiumJS
4.3 Time System Conventions
| System | Where Used | Notes |
|---|---|---|
| UTC | System-wide reference. All API timestamps, database timestamps, CZML epochs | Convert immediately at ingestion boundary |
| UT1 | Earth rotation angle for ITRF↔GCRF conversion | UT1-UTC offset from IERS EOP |
| TT (Terrestrial Time) | astropy internal; precession-nutation models |
~69 s ahead of UTC |
| TLE epoch | Encoded in TLE line 1 as year + day-of-year fraction | Parse to UTC immediately |
| GPS time | May appear in precision ephemeris products | GPS = UTC + 18 s as of 2024 |
Rule: Store all timestamps as TIMESTAMPTZ in UTC. Convert to local time only at presentation boundaries.
4.4 Coordinate Reference System Contract (F1 — §62)
The CRS used at every system boundary is documented in docs/COORDINATE_SYSTEMS.md. This is the authoritative single-page reference for any engineer writing frame conversion code.
| Boundary | CRS | Format | Notes |
|---|---|---|---|
| SGP4 output | TEME (True Equator Mean Equinox) | Cartesian metres | Must not leave physics/ without conversion |
| Physics → CZML builder | GCRF/J2000 | Cartesian metres | Explicit teme_to_gcrf() call |
CZML position (INERTIAL) |
GCRF/J2000 | Cartesian metres | referenceFrame: "INERTIAL" |
CZML position (FIXED) |
ITRF | Cartesian metres | referenceFrame: "FIXED" |
Database storage (orbits) |
GCRF/J2000 | Cartesian metres | Consistent with CZML inertial |
| Corridor polygon (DB) | WGS-84 geographic | GEOGRAPHY(POLYGON) SRID 4326 |
Geodetic lat/lon from ITRF→WGS-84 |
| FIR boundary (DB) | WGS-84 geographic | GEOMETRY(POLYGON, 4326) |
Planar approx. for regional FIRs |
| API response | WGS-84 geographic | GeoJSON (EPSG:4326) | Degrees; always lon,lat order (GeoJSON spec) |
| Globe display (CesiumJS) | ICRF (= GCRF for practical purposes) | Cartesian metres via CZML | CesiumJS handles geodetic display |
| Altitude display | WGS-84 ellipsoidal | km or ft (user preference) | See §4.4a for datum labelling |
Antimeridian and pole handling (F5 — §62):
- Antimeridian: Corridor polygons stored as
GEOGRAPHYhandle antimeridian crossing correctly — PostGIS GEOGRAPHY uses spherical arithmetic and does not wrap coordinates. CesiumJS CZML polygon positions must be expressed as a continuous polyline; for antimeridian-crossing corridors, the CZML serialiser must not clamp coordinates to ±180° — pass the raw ITRF→geodetic output. CesiumJS handles coordinate wrapping internally whenreferenceFrame: "FIXED"is used for corridor polygons. - Polar orbits: For objects with inclination > 80°, the ground track corridor may approach or cross the poles.
ST_AsGeoJSONon a GEOGRAPHY polygon that passes within ~1° of a pole can produce degenerate output (longitude undefined at the pole itself). Mitigation: before storing, checkST_DWithin(corridor, ST_GeogFromText('SRID=4326;POINT(0 90)'), 111000)(within 1° of north pole) or south pole equivalent — if true, log aPOLAR_CORRIDOR_WARNINGand clip the polygon to 89.5° max latitude. This is a rare case (ISS incl. 51.6°; most rocket bodies are below 75° incl.) but must not crash the pipeline.
docs/COORDINATE_SYSTEMS.md is a Phase 1 deliverable. Tests in tests/test_frame_utils.py serve as executable verification of the contract.
4.5 Implementation Checklist
frame_utils.py:teme_to_gcrf(),gcrf_to_itrf(),itrf_to_geodetic()- Unit tests against Vallado 2013 reference cases
- EOP data auto-refresh: weekly Celery task pulling IERS Bulletin A; verify SHA-256 checksum of downloaded file before applying
- CZML builder uses
gcrf_to_czml_inertial()— explicit function, never implicit conversion docs/COORDINATE_SYSTEMS.mdcommitted with CRS boundary table
5. User Personas
All UX decisions are traceable to one of the four personas defined here. Navigation, default views, information hierarchy, and alert behaviour must serve user tasks — not the system's internal module structure.
Persona A — Operational Airspace Manager
Role: ANSP or aviation authority staff. Responsible for airspace safety decisions in real-time or near-real-time.
Primary question: "Is any airspace under my responsibility affected in the next 6–12 hours, and what do I need to do about it?"
Key needs: Immediate situational awareness, clear go/no-go spatial display for their region, alert acknowledgement workflow, one-click advisory export, minimal cognitive load.
Tolerance for complexity: Very low.
Persona B — Safety Analyst
Role: Space agency, authority research arm, or consultancy. Conducts detailed re-entry risk assessments for regulatory submissions or post-event reports.
Primary question: "What is the full uncertainty envelope, what assumptions drove the prediction, and how does this compare to previous similar events?"
Key needs: Full simulation parameter access, run comparison, numerical uncertainty detail, full data provenance, configurable report generation, historical replay.
Tolerance for complexity: High.
Persona C — Incident Commander
Role: Senior official coordinating response during an active re-entry event. Uses the platform as a shared situational awareness tool in a briefing room.
Primary question: "Where exactly is it coming down, when, and what is the worst-case affected area right now?"
Key needs: Clean large-format display, auto-narrowing corridor updates, countdown timer, plain-language status summary, shareable live-view URL.
Tolerance for complexity: Low.
Persona D — Systems Administrator / Data Manager
Role: Technical operator managing system health, data ingest, model configuration, and user accounts.
Primary question: "Is everything ingesting correctly, are data sources healthy, and are workers keeping up?"
Key needs: System health dashboard, ingest job status, worker queue metrics, model version management, user and role management.
Tolerance for complexity: High technical tolerance.
Persona E — Space Operator
Role: Satellite or launch vehicle operator responsible for one or more objects in the SpaceCom catalog. May be a commercial operator, a national space agency operating assets, or a launch service provider managing spent upper stages.
Primary question: "What is the current decay prediction for my objects, when do I need to act, and if I have manoeuvre capability, what deorbit window minimises ground risk?"
Key needs: Object-scoped view showing only their registered objects; decay prediction with full Monte Carlo detail; controlled re-entry corridor planner (for objects with remaining propellant); conjunction alert for their own objects; API key management for programmatic integration with their own operations centre; exportable predictions for regulatory submission under national space law.
Tolerance for complexity: High — these are trained orbital engineers, not ATC professionals.
Regulatory context: Many space operators have legal obligations under national space law (e.g., Australia Space (Launches and Returns) Act 2018, FAA AST licensing) to demonstrate responsible end-of-life management. SpaceCom outputs serve as supporting evidence for those submissions. The platform must produce artefacts suitable for regulatory audit.
Persona F — Orbital Analyst
Role: Technical analyst at a space agency, research institution, safety consultancy, or the SSA/STM office of a national authority. Conducts orbital analysis, validates predictions, and produces technical assessments — potentially across the full catalog, not just owned objects.
Primary question: "What does the full orbital picture look like for this object class, how do SpaceCom predictions compare to other tools, and what are the statistical properties of the prediction ensemble?"
Key needs: Full catalog read access; conjunction screening across arbitrary object pairs; simulation parameter tuning and comparison; bulk export (CSV, JSON, CCSDS formats); access to raw propagation outputs (state vectors, covariance matrices); historical validation runs; API access for batch processing.
Tolerance for complexity: Very high — this persona builds the technical evidence base that other personas act on.
6. UX Design Specification
This section translates engineering capability into concrete interface designs. All designs are persona-linked and phase-scheduled.
6.1 Information Architecture — Task-Based Navigation
Navigation is organised around user tasks, not backend modules. Module names never appear in the UI.
The platform has two navigation domains — Aviation (default for Persona A/B/C) and Space (for Persona E/F). Both are accessible from the top navigation. The root route (/) defaults to the domain matched to the user's role on login.
Aviation Domain Navigation:
/ → Operational Overview (Persona A, C primary)
/watch/{norad_id} → Object Watch Page (Persona A, B)
/events → Active Events + Timeline (Persona A, C)
/events/{id} → Event Detail (Persona A, B, C)
/airspace → Airspace Impact View (Persona A)
/analysis → Analyst Workspace (Persona B primary)
/catalog → Object Catalog (Persona B)
/reports → Report Management (Persona A, B)
/admin → System Administration (Persona D)
Space Domain Navigation:
/space → Space Operator Overview (Persona E, F primary)
/space/objects → My Objects Dashboard (Persona E — owned objects only)
/space/objects/{norad_id} → Object Technical Detail (Persona E, F)
/space/reentry/plan → Controlled Re-entry Planner (Persona E)
/space/conjunction → Conjunction Screening (Persona F)
/space/analysis → Orbital Analyst Workspace (Persona F)
/space/export → Bulk Export (Persona F)
/space/api → API Keys + Documentation (Persona E, F)
The 3D globe is a shared component embedded within pages, not a standalone page. Different pages focus and configure the globe differently.
6.2 Operational Overview Page (/)
Landing page for Persona A and C. Loads immediately without configuration.
Layout:
┌─────────────────────────────────────────────────────────────────┐
│ [● LIVE] SpaceCom [Space Weather: ELEVATED ▲] [Alerts: 2] │
├──────────────────────────────┬──────────────────────────────────┤
│ │ ACTIVE EVENTS │
│ 3D GLOBE │ ● CZ-5B R/B 44878 │
│ (active events + │ Window: 08h – 20h from now │
│ affected FIRs only) │ Most likely ~14h from now │
│ │ YMMM FIR — HIGH │
│ │ [View] [Corridor] │
│ │ ───────────────────────────── │
│ │ ○ SL-16 R/B 28900 │
│ │ Window: 54h – 90h from now │
│ │ Most likely ~72h from now │
│ │ Ocean — LOW │
│ │ │
│ │ 72-HOUR TIMELINE │
│ │ [Gantt strip] │
│ │ │
│ │ SPACE WEATHER │
│ │ Activity: ELEVATED │
│ │ Extend window: add ≥2h buffer │
├──────────────────────────────┴──────────────────────────────────┤
│ [● Live] ──────────●────────────────────────────── +72h │
└─────────────────────────────────────────────────────────────────┘
Globe default state: Active decay objects and their corridors only. All other objects hidden. Affected FIR boundaries highlighted. No orbital tracks unless the user expands an event card.
Temporal uncertainty display — Persona A/C: Event cards and the Operational Overview show window ranges in plain language (Window: 08h – 20h from now / Most likely ~14h from now), never ± N notation. The ± form implies symmetric uncertainty, which re-entry distributions are not. The Analyst Workspace (Persona B) additionally shows raw p05/p50/p95 UTC times.
6.3 Time Navigation System
Three modes — always visible, always unambiguous. Mixing modes without explicit user intent is prohibited.
| Mode | Indicator | Description |
|---|---|---|
| LIVE | Green pulsing pill: ● LIVE |
Current real-world state. Globe and predictions update from live feeds. |
| REPLAY | Amber pill: ⏪ REPLAY 2024-01-14 03:22 UTC |
Replaying a historical event. All data fixed. No live updates. |
| SIMULATION | Purple pill: ⚗ SIMULATION — [object name] |
Custom scenario. Data is synthetic. Must never be confused with live. |
The mode indicator is persistent in the top nav bar. Switching modes requires explicit action through a mode-switch dialogue — it cannot happen implicitly.
Mode-switch dialogue specification:
When the user initiates a mode switch (e.g., LIVE → SIMULATION), the following modal must appear. The dialogue must explicitly state the current mode, the target mode, and all operational consequences:
SWITCH TO SIMULATION MODE?
──────────────────────────────────────────────────────────────
You are currently viewing LIVE data.
Switching to SIMULATION will display synthetic scenario data.
⚠ Alerts and notifications are suppressed in SIMULATION.
⚠ Simulation data must never be used for operational decisions.
⚠ Other users will not see your simulation.
[Cancel] [Switch to Simulation ▶]
──────────────────────────────────────────────────────────────
Rules:
- Cancel on left, destructive action on right (consistent with aviation HMI conventions)
- The dialogue must always show both the current mode and target mode — never just "are you sure?"
- Equivalent dialogues apply for all mode transitions (LIVE ↔ REPLAY, LIVE ↔ SIMULATION, etc.)
Simulation mode block during active alerts: If the organisation has disable_simulation_during_active_events enabled (admin setting, default: off), the SIMULATION mode switch is blocked whenever there are unacknowledged CRITICAL or HIGH alerts. A modal replaces the switch dialogue:
CANNOT ENTER SIMULATION
──────────────────────────────────────────────────────────────
2 active CRITICAL alerts require acknowledgement.
Acknowledge all active alerts before running simulations.
[View active alerts] [Cancel]
──────────────────────────────────────────────────────────────
Document disable_simulation_during_active_events prominently in the admin UI: "Enable only if your organisation has a dedicated SpaceCom monitoring role separate from simulation users."
Timeline control — two zoom levels:
- Event scale (default): 72 hours, 6-hour intervals. Re-entry windows shown as coloured bars.
- Orbital scale: 4-hour window, 15-minute intervals. For orbital passes and conjunction events.
LIVE mode scrub: User can drag the playhead into the future to preview a predicted corridor. A "Return to Live" button appears whenever the playhead is not at current time.
Future-preview temporal wash: When the timeline playhead is not at current time (user is previewing a future state), the entire right-panel event list and alert badges are overlaid with a temporal wash (semi-transparent grey overlay) and a persistent label:
┌──────────────────────────────────────────────────────────────┐
│ ⏩ PREVIEWING +4h 00m — not current state [Return to Live] │
└──────────────────────────────────────────────────────────────┘
The wash and label prevent a controller from acting on predicted-future data as though it were current. The globe corridor may show the projected state; the event list must be visually distinct. Alert badges are greyed and annotated "(projected)" in preview mode. Alert sounds and notifications are suppressed while previewing.
6.4 Uncertainty Visualisation — Three Phased Modes
Three representations are planned across phases. All are user-selectable via the UncertaintyModeSelector once implemented. Each page context has a recommended default.
Mode selector (appears in the layer controls panel whenever corridor data is loaded):
Corridor Display
● Percentile Corridors ← Phase 1
○ Probability Heatmap ← Phase 2
○ Monte Carlo Particles ← Phase 3
Modes B and C appear greyed in the selector until their phase ships.
Mode A — Percentile Corridors (Phase 1, default for Persona A/C)
What it shows: Three nested polygon swaths on the globe — 5th, 50th, and 95th percentile ground track corridors from Monte Carlo output.
Visual encoding:
- 95th percentile: wide, 15% opacity amber fill, dashed border — hazard extent
- 50th percentile: medium, 35% opacity amber fill, solid border — nominal corridor
- 5th percentile: narrow, 60% opacity amber fill, bold border — high-probability core
Colour by risk level: Ocean-only → blue family; partial land → amber; significant land → red-orange.
Over time: As the re-entry window narrows, the outer swath contracts automatically in LIVE mode. The user watches the corridor "tighten" in real-time.
Mode B — Probability Heatmap (Phase 2, default for Persona B)
What it shows: Continuous colour-ramp Deck.gl heatmap. Each cell's colour encodes probability density of ground impact across the full Monte Carlo sample set.
Visual encoding: Perceptually uniform, colour-blind-safe sequential palette (viridis or custom blue-white-orange). Scale normalised to the maximum probability cell; legend with percentile labels always shown.
Interaction: Hover a cell → tooltip shows "~N% probability of impact within this 50×50 km cell." The heatmap is recomputed client-side if the user adjusts the re-entry window bounds via the timeline.
Mode C — Monte Carlo Particle Visualisation (Phase 3, Persona B advanced / Persona C briefing)
What it shows: 50–200 animated MC sample trajectory lines converging from re-entry interface altitude (~80 km) to impact. Particle colour encodes F10.7 assumption (cool = low solar activity = later re-entry, warm = high). Impact points persist as dots.
Interaction: Play/pause animation; scrub to any point in the trajectory; click a particle to see its parameter set (F10.7, Ap, B*).
Performance: Use CesiumJS Primitive API with per-instance colour attributes — not Entity API. Trajectory geometry pre-baked server-side and streamed as binary format (/viz/mc-trajectories/{prediction_id}). Never compute trajectories in the browser.
Not the default for Persona A — the animation can be alarming without quantitative context.
Weighted opacity: Particles render with opacity proportional to their sample weight, not uniform opacity. This visually down-weights outlier trajectories so that low-probability high-consequence paths do not visually dominate.
Mandatory first-use overlay: When Mode C is first enabled (per user, tracked in user preferences), a one-time overlay appears before the animation starts:
MONTE CARLO PARTICLE VIEW
──────────────────────────────────────────────────────────────
Each animated line shows one possible re-entry scenario sampled
from the prediction distribution. Colour encodes the solar
activity assumption used for that sample.
These are not equally likely outcomes — particle opacity
reflects sample weight. For operational planning, the
Percentile Corridors view (Mode A) gives a more reliable
summary.
[Understood — show animation]
──────────────────────────────────────────────────────────────
The overlay is dismissed permanently per user on first acknowledgement and never shown again. It cannot be bypassed — the animation does not play until the user explicitly acknowledges.
6.5 Globe Information Hierarchy and Layer Management
Default view state: Active decay objects and their corridors, FIR boundaries for affected regions. "Show everything" is never the default.
Layer management panel:
LAYERS
────────────────────────────────────────
Objects
☑ Active decay objects (TIP issued)
☑ Decaying objects (perigee < 250 km)
☐ All tracked payloads
☐ Rocket bodies
☐ Debris catalog
Orbital Tracks
☐ Ground tracks (selected object only)
☐ All objects — [!] performance warning
Predictions & Corridors
☑ Re-entry corridors (active events)
☐ Re-entry corridors (all predicted)
☐ Fragment impact points
☐ Conjunction geometry
Airspace (Phase 2)
☐ FIR / UIR boundaries
☐ Controlled airspace
☐ Affected sectors (hazard intersection)
Reference
☐ Population density grid
☐ Critical infrastructure
────────────────────────────────────────
Corridor Display: [Percentile ▾]
Layer state persists to localStorage per session. Shared URLs encode active layer state in query parameters.
Object clustering: At zoom > 5,000 km, objects cluster. Badge shows count and highest urgency level. Clusters expand at < 2,000 km.
Altitude-aware clustering rule (F8 — §62): Objects at different altitudes with the same ground-track sub-point are not co-located — they have different re-entry windows and different hazard profiles. Two objects that share a 2D screen position but differ by > 100 km in altitude must not be merged into a single cluster. Implementation rule: CesiumJS EntityCluster clustering is disabled for any object with reentry_predictions showing a window < 30 days (i.e., any decay-relevant object in the watch/alert state). Objects in the normal catalog (window > 30 days) may continue to use screen-space clustering. This prevents the pathological case where a TIP-active object at 200 km is merged into a cluster with a nominal object at 500 km that shares its ground track, making the TIP object invisible in the cluster badge.
Urgency / Priority Visual Encoding (colour-blind-safe — shape distinguishes as well as colour):
| State | Symbol | Colour | Meaning |
|---|---|---|---|
| TIP issued, window < 6h | ◆ filled diamond | Red #D32F2F |
Imminent re-entry |
| TIP issued, window 6–24h | ◆ outlined diamond | Orange #E65100 |
Active threat |
| Predicted decay, window < 7d | ▲ triangle | Amber #F9A825 |
Elevated watch |
| Decaying, window > 7d | ● circle | Yellow-grey | Monitor |
| Conjunction Pc > 1:1000 | ✕ cross | Purple #6A1B9A |
Conjunction risk |
| Normal tracked | · dot | Grey #546E7A |
Catalog |
Never use red/green as the sole distinguishing pair.
6.6 Alert System UX
Alert taxonomy:
| Level | Trigger | Visual Treatment | Requires Acknowledgement? |
|---|---|---|---|
| CRITICAL | TIP issued, window < 6h, hazard intersects active FIR | Full-width banner (red), audio tone (ops room mode) | Yes — named user; timestamp + note recorded |
| HIGH | Window < 24h, conjunction Pc > 1:1000 | Persistent badge (orange) | Yes — dismissal recorded |
| MEDIUM | New TIP issued (any), window < 7d, new CDM | Toast (amber), 8s auto-dismiss | No — logged |
| LOW | New TLE ingested, space weather index change | Notification centre only | No |
Alert fatigue mitigation:
- Mute rules: per-user, per-session LOW suppression
- Geographic filtering: alerts scoped to user's configured FIR list
- Deduplication: window shrinks that don't cross a threshold do not re-trigger
- Rate limit: same trigger condition cannot produce more than 1 CRITICAL alert per object per 4-hour window without a manual operator reset
- Alert generation triggered only by backend logic on verified data — never by direct API call from a client
Ops room workload buffer (OPS_ROOM_SUPPRESS_MINUTES): An optional per-organisation setting (default: 0 — disabled). When set to N > 0, CRITICAL alert full-screen banners are queued for up to N minutes before display. The top-nav badge increments immediately so peripheral attention is captured; only the full-screen interrupt is deferred. This matches FAA AC 25.1329 alert prioritisation philosophy: acknowledge at a glance, act when workload permits. Must be documented in the admin UI with a mandatory warning: "Only enable if your operations room has a dedicated SpaceCom monitoring role. If a single controller manages all alerts, suppression introduces delay that may be safety-significant."
Audio alert specification:
- Trigger: CRITICAL alert only (no audio for HIGH or lower)
- Sound: two-tone ascending chime pattern (not a siren — ops rooms have sirens from other systems)
- Behaviour: plays once on alert display; does not loop; stops on alert acknowledgement (not just banner dismiss)
- Volume: configurable per-device (default 50% system volume); mutable by operator per-session
- Ops room mode: organisation-level setting that enables audio (default: off; requires explicit activation)
Alert storm detection: If the system generates > 5 CRITICAL alerts within 1 hour across all objects, generate a meta-alert to Persona D. The meta-alert presents a disambiguation prompt rather than a bare count:
[META-ALERT — ALERT VOLUME ANOMALY]
──────────────────────────────────────────────────────────────
5 CRITICAL alerts generated within 1 hour.
This may indicate:
(a) Multiple genuine re-entry events — verify via Space-Track
independently before taking operational action.
(b) System integrity issue — check ingest pipeline and data
source health for signs of false data injection.
[Open /admin health dashboard →] [View all CRITICAL alerts →]
──────────────────────────────────────────────────────────────
Acknowledgement workflow:
CRITICAL acknowledgement requires two steps to prevent accidental confirmation:
Step 1 — Alert banner with summary and Open Map link:
[CRITICAL ALERT]
───────────────────────────────────────────────────────
CZ-5B R/B (44878) — TIP Issued
Re-entry window: 2026-03-16 14:00 – 22:00 UTC (8h)
Affected FIRs: YMMM, YSSY
Risk level: HIGH | [Open map →]
[Review and Acknowledge →]
───────────────────────────────────────────────────────
Step 2 — Confirmation modal (appears on clicking "Review and Acknowledge"):
ACKNOWLEDGE CRITICAL ALERT
───────────────────────────────────────────────────────
CZ-5B R/B (44878) — Re-entry window 14:00–22:00 UTC 16 Mar
Action taken (required — minimum 10 characters):
[_____________________________________________]
[Cancel] [Confirm — J. Smith, 09:14 UTC]
───────────────────────────────────────────────────────
The Confirm button is disabled until the Action taken field contains ≥ 10 characters. This prevents reflexive one-click acknowledgement during an incident and ensures a minimal action record is always created.
Acknowledgements stored in alert_events (append-only). Records cannot be modified or deleted.
6.7 Timeline / Gantt View
Full timeline accessible from /events and as a compact strip on the Operational Overview.
NOW +6h +12h +24h +48h +72h
Object │ │ │ │ │ │
────────────────────┼────────┼────────┼────────┼────────┼────────┼────
CZ-5B R/B 44878 │ [■■■■■[══════ window ═══════]■■■] │
YMMM FIR — HIGH │ │ │ │ │ │
────────────────────┼────────┼────────┼────────┼────────┼────────┼────
SL-16 R/B 28900 │ │ │ [■[══════════════════════════→
NZZC FIR — MED │ │ │ │ │ │
■ = nominal re-entry point; ══ = uncertainty window; colour = risk level.
Click event bar → Event Detail page; hover → tooltip with window bounds and affected FIRs. Zoom range: 6h to 7d.
6.8 Event Detail Page (/events/{id})
┌──────────────────────────────────────────────────────────────┐
│ ← Events │ CZ-5B R/B NORAD 44878 │ [■ CRITICAL] │
│ │ Re-entry window: 14:00–22:00 UTC 16 Mar 2026 │
├──────────────────────────────┬───────────────────────────────┤
│ │ OBJECT │
│ 3D GLOBE │ Mass: 21,600 kg (● DISCOS) │
│ (focused on corridor) │ B*: 0.000215 /ER │
│ Mode: [Percentile ▾] │ Data confidence: ● DISCOS │
│ [Layers] │ │
│ │ PREDICTION │
│ │ Model: cowell_nrlmsise00 v2 │
│ │ F10.7 assumed: 148 sfu │
│ │ MC samples: 500 │
│ │ HMAC: ✓ verified │
│ │ │
│ │ WINDOW │
│ │ 5th pct: 13:12 UTC │
│ │ 50th pct: 17:43 UTC │
│ │ 95th pct: 22:08 UTC │
│ │ │
│ │ TIP MESSAGES │
│ │ MSG #3 — 09:00 UTC today │
│ │ [All TIP history →] │
├──────────────────────────────┴───────────────────────────────┤
│ AFFECTED AIRSPACE (Phase 2) │
│ YMMM FIR ████ HIGH entry 14:20–19:10 UTC │
├──────────────────────────────────────────────────────────────┤
│ [Run Simulation] [Generate Report] [Share Link] │
└──────────────────────────────────────────────────────────────┘
HMAC verification status is displayed prominently. If ✗ verification failed appears, a banner reads: "This prediction record may have been tampered with. Do not use for operational decisions. Contact your system administrator."
Data confidence annotates every physical property: ● DISCOS (green), ● estimated (amber), ● unknown (grey). When source is unknown or estimated, a warning callout appears above the prediction panel.
Corridor Evolution widget (Phase 2): A compact 2D strip on the Event Detail page showing how the p50 corridor footprint is evolving over time — three overlapping semi-transparent polygon outlines at T+0h, T+2h, T+4h from the current prediction. Updated automatically in LIVE mode. Gives Persona A Level 3 situation awareness (projection) at a glance without requiring simulation tools. Labelled: "Corridor evolution — how prediction is narrowing". If the corridor is widening (unusual), an amber warning appears: "Uncertainty is increasing — check space weather."
Duty Manager View (Phase 2): A [Duty Manager View] toggle button on the Event Detail header. When active, collapses all technical detail and presents a large-text, decluttered view containing only:
┌──────────────────────────────────────────────────────────────┐
│ CZ-5B R/B NORAD 44878 [■ CRITICAL] │
│ │
│ RE-ENTRY WINDOW │
│ Start: 14:00 UTC 16 Mar 2026 │
│ End: 22:00 UTC 16 Mar 2026 │
│ Most likely: 17:43 UTC │
│ │
│ AFFECTED FIRs │
│ YMMM (Airservices Australia) — HIGH RISK │
│ YSSY (Airservices Australia) — MEDIUM RISK │
│ │
│ [Draft NOTAM] [Log Action] [Share Link] │
└──────────────────────────────────────────────────────────────┘
Toggle back to full view via [Technical Detail]. State is not persisted between sessions — always starts in full view.
Response Options accordion (Phase 2): An expandable panel at the bottom of the Event Detail page, visible to operator and above roles. Contextualised to the current risk level and FIR intersection. These are considerations only — all decisions rest with the ANSP:
RESPONSE OPTIONS [▼ expand]
──────────────────────────────────────────────────────────────
Based on current prediction (risk: HIGH, window: 8h):
The following actions are for your consideration.
All operational decisions rest with the ANSP.
☐ Issue SIGMET or advisory to aircraft in YMMM FIR
☐ Notify adjacent ANSPs (YMMM borders: WAAF, OPKR)
☐ Draft NOTAM for authorised issuance [Open →]
☐ Coordinate with FMP on traffic flow impact
☐ Establish watching brief schedule (every 30 min)
[Log coordination note]
──────────────────────────────────────────────────────────────
Checkbox states and coordination notes are appended to alert_events (append-only). The Response Options items are dynamically generated by the backend based on risk level and affected FIR count — not hardcoded in the frontend.
6.9 Simulation Job Management UX
Persistent collapsible bottom-drawer panel visible on any page. Jobs continue running when the user navigates away.
SIMULATION JOBS [▲ collapse]
────────────────────────────────────────────────────────────────
● Running Decay prediction — 44878 312/500 ████░ 62%
F10.7: 148, Ap: 12, B*±10% ~45s rem
[Cancel]
✓ Complete Decay prediction — 44878 High F10.7 scenario
Completed 09:02 UTC [View results] [Compare]
✗ Failed Breakup simulation — 28900
Error: DISCOS data missing [Retry] [Details]
────────────────────────────────────────────────────────────────
Simulation comparison: Two completed runs for the same object can be overlaid on the globe with distinct colours and a split-panel parameter comparison.
6.10 Space Weather Widget
SPACE WEATHER [09:14 UTC]
────────────────────────────────────────────────────────────
Solar Activity ●●●○○ ELEVATED
F10.7 observed: 148 sfu (81d avg: 132)
Geomagnetic ●●●●○ ACTIVE
Kp: 5.3 / Ap daily: 27
Re-entry Impact ▲ Active conditions — extend precaution window
Add ≥2h buffer beyond 95th percentile.
Forecast (24h) Activity expected to decline — Kp 3–4
────────────────────────────────────────────────────────────
Source: NOAA SWPC Updated: 09:00 UTC [Full history →]
Operational status summary is generated by the backend based on F10.7 deviation from the 81-day average. The "Re-entry Impact" line delivers an operationally actionable statement — not a percentage — with a concrete recommended precaution buffer computed by the backend and delivered as a structured field:
| Condition | Re-entry Impact line | Recommended buffer |
|---|---|---|
| F10.7 < 90 or Kp < 2 | Low activity — predictions at nominal accuracy | +0h |
| F10.7 90–140, Kp 2–4 | Moderate activity — standard uncertainty applies | +1h |
| F10.7 140–200, Kp 4–6 | Active conditions — extend precaution window. Add ≥2h buffer beyond 95th percentile. | +2h |
| F10.7 > 200 or Kp > 6 | High activity — predictions less reliable. Add ≥4h buffer beyond 95th percentile. | +4h |
The buffer recommendation is surfaced on the Event Detail page as an explicit callout when conditions are Elevated or above: "Space weather active: consider extending your airspace precaution window to [95th pct time + buffer]."
6.11 2D Plan View (Phase 2)
Globe/map toggle ([🌐 Globe] [🗺 Plan]) synchronises selected object, active corridor, and time position. State is preserved on switch.
2D view features: Mercator or azimuthal equidistant projection; ICAO chart symbology for airspace; ground-track corridor as horizontal projection only; altitude/time cross-section panel below showing corridor vertical extent at each FIR crossing.
6.12 Reporting Workflow
Report configuration dialogue:
NEW REPORT — CZ-5B R/B (44878)
──────────────────────────────────────────────────────────────
Simulation: [Run #3 — 09:14 UTC ▾]
Report Type:
○ Operational Briefing (1–2 pages, plain language)
○ Technical Assessment (full uncertainty, model provenance)
○ Regulatory Submission (formal format, appendices)
Include Sections:
☑ Object properties and data confidence
☑ Re-entry window and uncertainty percentiles
☑ Ground track corridor map
☑ Affected airspace and FIR crossing times
☑ Space weather conditions at prediction time
☑ Model version and simulation parameters
☐ Full MC sample distribution
☐ TIP message history
Prepared by: J. Smith Authority: CASA
──────────────────────────────────────────────────────────────
[Preview] [Generate PDF] [Cancel]
Report identity: Every report has a unique ID, the simulation ID it was derived from, a generation timestamp, and the analyst's name. Reports are stored in MinIO and listed in /reports.
Date format in all reports and exports (F7): Slash-delimited dates (03/04/2026) are ambiguous between DD/MM and MM/DD and are banned from all SpaceCom outputs. All dates in PDF reports, CSV exports, and NOTAM drafts use DD MMM YYYY format (e.g. 04 MAR 2026) — unambiguous across all locales and consistent with ICAO and aviation convention. All times alongside dates use HH:MMZ (e.g. 04 MAR 2026 14:00Z). This applies to: PDF prediction reports, CSV bulk exports, NOTAM draft (B)/(C) fields (which use ICAO YYMMDDHHMM format internally but are displayed as DD MMM YYYY HH:MMZ in the preview).
Report rendering: Server-side Playwright in the isolated renderer container. The map image is a headless Chromium screenshot of the globe at the relevant configuration. All user-supplied text is HTML-escaped before interpolation. The renderer has no external network access — it receives only sanitised, structured data from the backend API.
6.13 NOTAM Drafting Workflow (Phase 2)
SpaceCom cannot issue NOTAMs. Only designated NOTAM offices authorised by the relevant AIS authority can issue them. SpaceCom's role is to produce a draft in ICAO Annex 15 format ready for review and formal submission by an authorised originator.
Trigger: From the Event Detail page, Persona A clicks [Draft NOTAM]. This is only available when a hazard corridor intersects one or more FIRs.
Draft NOTAM output (ICAO Annex 15 / OPADD format):
Field format follows ICAO Annex 15 Appendix 6 and EUROCONTROL OPADD. Timestamps use YYMMDDHHmm format (not ISO 8601 — ICAO Annex 15 §5.1.2). (B) = p10 − 30 min; (C) = p90 + 30 min (see mapping table below).
NOTAM DRAFT — FOR REVIEW AND AUTHORISED ISSUANCE ONLY
══════════════════════════════════════════════════════
Generated by SpaceCom v2.1 | Prediction ID: pred-44878-20260316-003
Data source: USSPACECOM TIP #3 + SpaceCom decay prediction
⚠ This is a DRAFT only. Must be reviewed and issued by authorised NOTAM office.
Q) YMMM/QWELW/IV/NBO/AE/000/999/2200S13300E999
A) YMMM
B) 2603161330
C) 2603162230
E) UNCONTROLLED SPACE OBJECT RE-ENTRY. OBJECT: CZ-5B ROCKET BODY
NORAD ID 44878. PREDICTED RE-ENTRY WINDOW 1400-2200 UTC 16 MAR
2026. NOMINAL RE-ENTRY POINT APRX 22S 133E. 95TH PERCENTILE
CORRIDOR 18S 115E TO 28S 155E. DEBRIS SURVIVAL PSB. AIRSPACE
WITHIN CORRIDOR MAY BE AFFECTED ALL LEVELS DURING WINDOW.
REF SPACECOM PRED-44878-20260316-003.
F) SFC
G) UNL
NOTAM field mapping (ICAO Annex 15 Appendix 6):
| NOTAM field | SpaceCom data source | Format rule |
|---|---|---|
(Q) Q-line |
FIR ICAO designator + NOTAM code QWELW (re-entry warning) |
Generated from airspace.icao_designator; subject code WE (airspace warning), condition LW (laser/space) |
(A) FIR |
airspace.icao_designator for each intersecting FIR |
One NOTAM per FIR; multi-FIR events generate multiple drafts |
(B) Valid from |
prediction.p10_reentry_time − 30 minutes |
YYMMDDHHmm (UTC); example: 2603161330 |
(C) Valid to |
prediction.p90_reentry_time + 30 minutes |
YYMMDDHHmm (UTC) |
(D) Schedule |
Omitted (continuous) | Do not include (D) field for continuous validity |
(E) Description |
Templated from sanitised object name, NORAD ID, p50 time, corridor bounds | sanitise_icao() applied; ICAO Doc 8400 abbreviations used (PSB not "possible", APRX not "approximately") |
(F)/(G) Limits |
SFC / UNL |
Hardcoded for re-entry events; do not compute from corridor altitude |
(B)/(C) field: re-entry window to NOTAM validity — time-critical cancellation: The (C) validity time does not mean the hazard persists until then — it is the worst-case boundary. When re-entry is confirmed, the NOTAM cancellation draft must be initiated immediately. The Event Detail page surfaces a prominent [Draft NOTAM Cancellation — RE-ENTRY CONFIRMED] button at the moment the event status changes to confirmed, with a UI note: "Cancellation draft should be submitted to the NOTAM office without delay."
Unit test: Generate a draft for a prediction with p10=2026-03-16T14:00Z, p90=2026-03-16T22:00Z; assert (B) field is 2603161330 and (C) field is 2603162230. Assert Q-line matches regex \(Q\) [A-Z]{4}/QWELW/IV/NBO/AE/\d{3}/\d{3}/\d{4}[NS]\d{5}[EW]\d{3}.
NOTAM cancellation draft: When an event is closed (re-entry confirmed, object decayed), the Event Detail page offers [Draft NOTAM Cancellation] — generates a CANX NOTAM draft referencing the original.
Regulatory note displayed in the UI: A persistent banner on the NOTAM draft page reads: "This draft is generated for review purposes only. It must be reviewed for accuracy, formatted to local AIS standards, and issued by an authorised NOTAM originator. SpaceCom does not issue NOTAMs."
NOTAM language and i18n exclusion (F6): ICAO Annex 15 specifies that NOTAMs use ICAO standard phraseology in English (or the language of the state for domestic NOTAMs). NOTAM template strings are never internationalised:
- All NOTAM template strings are hardcoded ICAO English phraseology in
backend/app/modules/notam/templates.py - Each template string is annotated
# ICAO-FIXED: do not translate - The NOTAM draft is excluded from the
next-intlmessage extraction tooling - The NOTAM preview panel renders in a fixed-width monospace font to match traditional NOTAM format
lang="en"attribute is set on the NOTAM text container regardless of the operator's UI locale
The draft is stored in the notam_drafts table (see §9.2) for audit purposes.
6.14 Shadow Mode (Phase 2)
Shadow mode allows ANSPs to run SpaceCom in parallel with existing procedures during a trial period, without acting operationally on its outputs. This is the primary mechanism for building regulatory acceptance evidence.
Activation: admin role only, per-organisation setting in /admin.
Visual treatment when shadow mode is active:
┌─────────────────────────────────────────────────────────────────┐
│ ⚗ SHADOW MODE — Predictions are not for operational use │
│ All outputs are recorded for validation. No alerts are │
│ delivered externally. Contact your administrator to disable. │
└─────────────────────────────────────────────────────────────────┘
- A persistent amber banner spans the top of every page
- The mode indicator pill shows
⚗ SHADOWin amber - All alert levels are demoted to INFORMATIONAL — no banners, no audio tones, no email delivery
- Prediction records have
shadow_mode = TRUEin the database (see §9) - Shadow predictions are excluded from all operational views but accessible in
/analysis
Validation reporting: After each real re-entry event, Persona B can generate a Shadow Validation Report comparing SpaceCom shadow predictions against the actual observed re-entry time/location. These reports form the evidence base for regulatory adoption.
Shadow Mode Exit Criteria (regulatory hand-off specification — Finding 6):
Shadow mode is a formal regulatory activity, not a product trial. Exit to operational use requires:
| Criterion | Requirement |
|---|---|
| Minimum shadow period | 90 days, or covering ≥ 3 re-entry events above the CRITICAL alert threshold, whichever is longer |
| Prediction accuracy | corridor_contains_observed ≥ 90% across shadow period events (from prediction_outcomes) |
| False positive rate | fir_false_positive_rate ≤ 20% — no more than 1 in 5 corridor-intersecting FIR alerts is a false alarm |
| False negative rate | fir_false_negative = 0 during the shadow period — no re-entry event missed entirely |
| Exit document | shadow-mode-exit-report-{org_id}-{date}.pdf generated from prediction_outcomes; contains automated statistics + ANSP Safety Department sign-off field |
| Regulatory hand-off | Written confirmation from the ANSP's Accountable Manager or Head of ATM Safety that their internal Safety Case / Tool Acceptance process is complete |
| System state | shadow_mode_cleared = TRUE is set by SpaceCom admin only after receipt of the written ANSP confirmation |
The exit report template lives at docs/templates/shadow-mode-exit-report.md. Persona B generates the statistics from the admin analysis panel; the ANSP prints, signs, and returns the PDF. No software system can substitute for the ANSP's internal Safety Department sign-off.
Commercial trial-to-operational conversion (Finding 5):
A successful shadow exit automatically generates a commercial offer. The admin panel transitions the organisation's subscription_status from 'shadow_trial' to 'offered' and Persona D receives a task notification. The offer package includes:
- Commercial offer document (generated from
docs/templates/commercial-offer-ansp.md): tier, pricing, SLA schedule, DPA status - MSA execution path: ANSPs that accept the offer sign the MSA; no separate negotiation required for the standard ANSP Operational tier
- Onboarding checklist:
docs/onboarding/ansp-onboarding-checklist.md
If an ANSP does not convert within 30 days of receiving the offer, subscription_status moves to 'offered_lapsed' and Persona D is notified. The admin panel shows conversion pipeline status for all ANSP organisations. Maximum concurrent ANSP shadow deployments in Phase 2: 2 (resource constraint — each requires a dedicated SpaceCom integration lead for the 90-day shadow period).
6.15 Space Operator Portal UX (Phase 2)
The Space Operator Portal (/space) is the second front door. It serves Persona E and F with a technically dense interface — different visual language from the aviation-facing portal.
Space Operator Overview (/space):
┌─────────────────────────────────────────────────────────────────┐
│ SpaceCom · Space Portal [API] [Export] [Persona E: ORBCO] │
├─────────────────────┬───────────────────────────────────────────┤
│ │ MY OBJECTS (3) │
│ 3D GLOBE │ ┌────────────────────────────────────┐ │
│ (owned objects │ │ CZ-5B R/B 44878 │ │
│ only, with │ │ Perigee: 178 km ↓ Decaying fast │ │
│ full orbital │ │ Re-entry: 16 Mar ± 8h │ │
│ tracks and │ │ [Predict] [Plan deorbit] [Export] │ │
│ decay vectors) │ ├────────────────────────────────────┤ │
│ │ │ SL-16 R/B 28900 │ │
│ │ │ Perigee: 312 km ~ Stable │ │
│ │ │ [Predict] [Export] │ │
│ │ └────────────────────────────────────┘ │
│ │ CONJUNCTION ALERTS (MY OBJECTS) │
│ │ No active conjunctions > Pc 1:10000 │
├─────────────────────┴───────────────────────────────────────────┤
│ API USAGE Requests today: 143 / 1000 [Manage keys →] │
└─────────────────────────────────────────────────────────────────┘
Controlled Re-entry Planner (/space/reentry/plan):
Available for objects with remaining manoeuvre capability (flagged in owned_objects.has_propulsion).
CONTROLLED RE-ENTRY PLANNER — CZ-5B R/B (44878)
─────────────────────────────────────────────────────────────────
Delta-V budget: [▓▓▓░░░░░] 12.4 m/s remaining
Target re-entry window: [2026-03-20 ▾] to [2026-03-22 ▾]
Avoid FIRs: [☑ YMMM] [☑ YSSY] [☑ Populated land]
Preferred landing: ● Ocean ○ Specific zone
CANDIDATE WINDOWS
──────────────────────────────────────────────────────────────────
#1 2026-03-21 03:14 UTC ΔV: 8.2 m/s Risk: ● LOW
Landing: South Pacific FIR: NZZO (ocean)
[Select] [View corridor]
#2 2026-03-21 09:47 UTC ΔV: 11.1 m/s Risk: ● LOW
Landing: Indian Ocean FIR: FJDG (ocean)
[Select] [View corridor]
#3 2026-03-21 15:30 UTC ΔV: 9.8 m/s Risk: ▲ MEDIUM
Landing: 22S 133E FIR: YMMM (land)
[Select] [View corridor]
──────────────────────────────────────────────────────────────────
[Export manoeuvre plan (CCSDS)] [Generate operator report]
The planner outputs are suitable for submission to national space regulators as evidence of responsible end-of-life management under the ESA Zero Debris Charter and national space law requirements.
Zero Debris Charter compliance output format (Finding 2):
The planner produces a controlled-reentry-compliance-report-{norad_id}-{date}.pdf containing:
- Ranked deorbit window analysis (delta-V budget, window start/end, corridor risk score per window)
- FIR avoidance corridors for each candidate window
- Probability of casualty on the ground (Pc_ground) computed using NASA Debris Assessment Software methodology (1-in-10,000 IADC casualty threshold; documented in model card)
- Comparison table: each candidate window vs. the 1:10,000 Pc_ground threshold; compliant windows flagged green
- Zero Debris Charter alignment statement (auto-generated from object disposition)
Machine-readable companion: application/vnd.spacecom.reentry-compliance+json — returned alongside the PDF download URL as compliance_report_url in the planning job result. Format documented in docs/api-guide/compliance-export.md.
The Pc_ground calculation uses the fragment survivability model (§15.3 material class lookup) and the ESA DRAMA casualty area methodology. objects.material_class IS NULL → conservative all-survive assumption → higher Pc_ground — creates an incentive for operators to provide accurate physical data.
ECCN classification review (already in §21 Phase 2 DoD) must resolve before this output is shared with non-US entities.
6.16 Accessibility Requirements
- WCAG 2.1 Level AA compliance — required for government and aviation authority procurement
- Colour-blind-safe palette throughout; urgency uses shape + colour, never colour alone
- High-contrast mode available in user settings (WCAG AAA scheme)
- Dark mode as a first-class theme (not an afterthought)
- All interactive elements keyboard-accessible; tab order logical
- Alerts announced via
aria-live="assertive"(CRITICAL) andaria-live="polite"(MEDIUM/LOW) - Globe canvas has
aria-labeldescribing current view context - Minimum touch target size 44×44 px
- Tested at 1080p (ops room), 1440p (analyst workstation), 1024×768 (tablet minimum)
- Automated axe-core audit via
@axe-core/playwrightrun on the 5 core pages on every PR; 0 critical, 0 serious violations required to merge; known acceptable third-party violations (e.g., CesiumJS canvas contrast) recorded intests/e2e/axe-exclusions.jsonwith a justification comment — not silently suppressed. Implementation:// tests/e2e/accessibility.spec.ts import AxeBuilder from '@axe-core/playwright'; for (const [name, path] of [ ['operational-overview', '/'], ['event-detail', '/events/seed-event'], ['notam-draft', '/notam/draft/seed-draft'], ['space-portal', '/space/objects'], ['settings', '/settings'], ]) { test(`${name} — WCAG 2.1 AA`, async ({ page }) => { await page.goto(path); const results = await new AxeBuilder({ page }) .withTags(['wcag2a', 'wcag2aa']) .exclude(loadAxeExclusions()) // loads axe-exclusions.json .analyze(); expect(results.violations).toEqual([]); }); }
6.17 Multi-ANSP Coordination Panel (Phase 2)
When an event's predicted corridor intersects FIRs belonging to more than one registered organisation, an additional panel appears on the Event Detail page. This panel provides shared situational awareness across ANSPs without replacing voice coordination.
MULTI-ANSP COORDINATION
──────────────────────────────────────────────────────────────
FIRs affected by this event:
YMMM Airservices Australia — ✓ Acknowledged 09:14 UTC J. Smith
NZZC Airways NZ — ○ Not yet acknowledged
Last activity:
09:22 UTC YMMM — "Watching brief established, coordinating with FMP"
──────────────────────────────────────────────────────────────
[Log coordination note]
Rules:
- Each ANSP sees the acknowledgement status and latest coordination note from all other ANSPs on the event; they do not see each other's internal alert state
- Coordination notes are free text, appended to
alert_events(append-only, auditable), with organisation name, user name, and UTC timestamp - The panel is read-only for organisations that have not yet acknowledged; they can acknowledge and then log notes
- Visibility is scoped: organisations only see the panel for events that intersect their registered FIRs — they do not see coordination panels for unrelated events from other orgs
This does not replace voice or direct coordination — it creates a shared digital record that both ANSPs can reference. The panel carries a permanent banner: "This coordination panel is for shared situational awareness only. It does not replace formal ATS coordination procedures or voice coordination."
Authority and precedence (Finding 5): The panel has no command authority. If two ANSPs log conflicting assessments, neither supersedes the other in SpaceCom — the system records both. The authoritative coordination outcome is always the result of direct ATS coordination outside the system. SpaceCom coordination notes are supporting evidence, not operational decisions.
WebSocket latency for coordination updates: Coordination note updates must be visible to all parties within 2 seconds of posting (p99). This is specified as a performance SLA for the coordination panel WebSocket channel (distinct from the 5-second SLA for alert events). Latency > 2 seconds means an ANSP may have acted on a stale picture during a fast-moving event.
Data retention for coordination records (ICAO Annex 11 §2.26): Coordination notes are safety records. Minimum retention: 5 years in append-only storage. The coordination_notes table (stored append-only in alert_events.coordination_notes JSONB[] or as a separate table) is included in the safety record retention category (§27.4) and excluded from standard data drop policies.
6.18 First-Time User Onboarding State (Phase 1)
When a new organisation has no configured FIRs and no active events, the globe is empty. An empty globe is indistinguishable from "the system isn't working" for first-time users. An onboarding state prevents this misinterpretation.
Trigger: Organisation has fir_list IS NULL OR fir_list = '{}' at login.
Display: Three setup cards replace the Active Events panel:
WELCOME TO SPACECOM
──────────────────────────────────────────────────────────────
To see relevant events and receive alerts, complete setup:
1. Configure your FIR watch list
Determines which re-entry events you see and which
alerts you receive. [Configure →]
2. Set alert delivery preferences
Email, WebSocket, or webhook for CRITICAL alerts.
[Configure →]
3. Optional: Enable Shadow Mode for a trial period
Run SpaceCom in parallel with existing procedures —
outputs are not for operational use until disabled.
[Configure →]
──────────────────────────────────────────────────────────────
Cards disappear permanently once step 1 (FIR list) is complete. Steps 2 and 3 remain accessible from /admin at any time. The setup cards are not a modal — they appear inline and the user can still access all navigation.
6.19 Degraded Mode UI Guidance (Phase 1)
The StalenessWarningBanner (triggered by /readyz returning 207) must include an operational guidance line keyed to the specific type of data degradation, not just a generic "data may be stale" message. Persona A's question in degraded mode is not "is the data stale?" — it is "can I use this for an operational decision right now?"
| Degradation type | Banner operational guidance |
|---|---|
| Space weather data stale > 3h | "Uncertainty estimates may be wider than shown. Treat all corridors as potentially broader than the 95th percentile boundary." |
| TLE data stale > 24h | "Object position data is more than 24 hours old. Do not use for precision airspace decisions without independent position verification." |
| Active prediction older than 6h without refresh | "This prediction reflects conditions from [timestamp]. A fresh prediction run is recommended before operational use. [Trigger refresh →]" |
| IERS EOP data stale > 7 days | "Coordinate frame transformations may have minor errors. Technical assessments only — do not use for precision airspace boundary work." |
Banner behaviour:
- The banner type is set by the backend via the
/readyzresponse body (degradation_typeenum) - Each degradation type has its own banner message — not a generic "degraded" label
- The banner persists until the degradation is resolved; it cannot be dismissed by the user
- When multiple degradations are active, show the highest-impact degradation first, with a
(+N more)expand link
6.20 Secondary Display Mode (Phase 2)
An ops room secondary monitor display mode — strips all navigation chrome and presents only the operational picture on a full-screen secondary display alongside existing ATC tools.
Activation: [Secondary Display] link in the user menu, or URL parameter ?display=secondary. Opens in a new window or full-screen.
Layout: Full-screen globe on the left (~70% width), vertical event list on the right (~30% width). No top navigation, no admin links, no simulation controls. No sidebar panels. The LIVE/SHADOW/SIMULATION mode indicator remains visible (always). CRITICAL alert banners still appear.
Design principle: This is a CSS-level change — hide navigation and chrome elements, maximise the operational data density. No new data is added; no existing data is removed.
7. Security Architecture
This section is as non-negotiable as §4. Security must be built in from Week 1, not audited at Phase 3. The primary security risk in an aviation safety system is not data exfiltration — it is data corruption that produces plausible but wrong outputs that are acted upon operationally. A false all-clear for a genuine re-entry threat is the highest-consequence attack against this system's mission.
7.1 Threat Model (STRIDE)
Key trust boundaries and their principal threats:
| Boundary | Spoofing | Tampering | Repudiation | Info Disclosure | DoS | Elevation |
|---|---|---|---|---|---|---|
| Browser → API | JWT forgery | Request injection | Unlogged mutations | Token leak via XSS | Auth endpoint flood | RBAC bypass |
| API → DB | Credential leak | SQL injection | No audit trail | Column over-fetch | N+1 queries | RLS bypass |
| Ingest → External feeds | DNS/BGP hijack → wrong TLE | Man-in-middle alters F10.7 | — | Credential interception | Feed DoS | — |
| Celery worker → DB | Compromised worker | Corrupt sim output written to DB | Unlogged task | Param leak in logs | Runaway MC task | Worker → backend pivot |
| Playwright renderer → backend | — | User content → XSS → SSRF | — | Local file read | Hang/timeout | RCE via browser exploit |
| Redis | — | Cache poisoning | — | Token interception | Queue flood | — |
Mitigations for each threat are specified in the sections below.
7.2 Role-Based Access Control (RBAC)
Four roles correspond to the four personas. Every API endpoint enforces the minimum required role via a FastAPI dependency.
| Role | Assigned To | Permissions |
|---|---|---|
viewer |
Read-only external stakeholders | View objects, predictions, corridors; read-only globe (aviation domain) |
analyst |
Persona B | viewer + submit simulations, generate reports, access historical data, shadow validation reports |
operator |
Persona A, C | analyst + acknowledge alerts, issue advisories, draft NOTAMs, access operational tools |
org_admin |
Organisation administrator | operator + invite/remove users within their own org; assign roles up to operator within own org; view own org's audit log; manage own org's API keys; update own org's billing contact; cannot access other orgs' data; cannot assign admin or org_admin without system admin approval |
admin |
Persona D (system-wide) | Full access: user management across all orgs, ingest configuration, model version deployment, shadow mode toggle, subscription management |
space_operator |
Persona E | Object-scoped access (owned objects only via owned_objects table); decay predictions and controlled re-entry planning for own objects; conjunction alerts for own objects; API key management; CCSDS export; no access to other organisations' simulation data |
orbital_analyst |
Persona F | Full catalog read; conjunction screening across any object pair; simulation submission; bulk export (CSV, JSON, CCSDS); raw state vector and covariance access; API key management; no alert acknowledgement |
Object ownership scoping for space_operator: The owned_objects table maps operators to their registered NORAD IDs. All queries from a space_operator user are automatically scoped to their owned object list — enforced by a PostgreSQL RLS policy on the owned_objects join, not only at the application layer:
-- space_operator users see only their owned objects in catalog queries
CREATE POLICY objects_owner_scope ON objects
USING (
current_setting('app.current_role') != 'space_operator'
OR id IN (
SELECT object_id FROM owned_objects
WHERE organisation_id = current_setting('app.current_org_id')::INTEGER
)
);
Multi-tenancy: If multiple organisations use the system, every table that contains organisation-specific data (simulations, reports, alert_events, hazard_zones) must include an organisation_id column. PostgreSQL Row-Level Security (RLS) policies enforce the boundary at the database layer — not only at the application layer:
ALTER TABLE simulations ENABLE ROW LEVEL SECURITY;
CREATE POLICY simulations_org_isolation ON simulations
USING (organisation_id = current_setting('app.current_org_id')::INTEGER);
The application sets app.current_org_id at the start of every database session from the authenticated user's JWT claims.
Comprehensive RLS policy coverage (F1): The simulations example above is the template. Every table that carries organisation_id must have RLS enabled and an isolation policy applied. The full set:
| Table | RLS policy | Notes |
|---|---|---|
simulations |
organisation_id = current_org_id |
|
reentry_predictions |
organisation_id = current_org_id |
shadow policy layered separately |
alert_events |
organisation_id = current_org_id |
append-only; no UPDATE/DELETE anyway |
hazard_zones |
organisation_id = current_org_id |
|
reports |
organisation_id = current_org_id |
|
api_keys |
organisation_id = current_org_id |
admins bypass to revoke any key |
usage_events |
organisation_id = current_org_id |
billing metering records |
objects |
organisation_id IS NULL OR organisation_id = current_org_id |
NULL = catalog-wide; org-specific = owned objects only |
RLS bypass for system-level tasks: Celery workers and internal admin processes run under a dedicated database role (spacecom_worker) that bypasses RLS (BYPASSRLS). This role is never used by the API request path. Integration test (BLOCKING): establish two orgs with data; issue a query as Org A's session; assert zero Org B rows returned. This test runs in CI against a real database (not mocked).
Shadow mode segregation — database-layer enforcement (Finding 9):
Shadow predictions must be excluded from operational API responses at the RLS layer, not only via application WHERE clauses. A backend query bug or misconfigured join must not expose shadow records to viewer/operator sessions — that would be a regulatory incident.
ALTER TABLE reentry_predictions ENABLE ROW LEVEL SECURITY;
-- Non-admin sessions never see shadow records unless the session flag is set
CREATE POLICY shadow_segregation ON reentry_predictions
USING (
shadow_mode = FALSE
OR current_setting('spacecom.include_shadow', TRUE) = 'true'
);
The spacecom.include_shadow session variable is set to 'true' only by the backend's shadow-admin code path, which requires admin role and explicit shadow-mode context. Regular backend sessions never set this variable. Integration test: query reentry_predictions as viewer role with no WHERE shadow_mode clause; verify zero shadow rows returned.
Four-eyes principle for admin role elevation (Finding 6):
A single compromised admin account must not be able to silently elevate a backdoor account. Elevation to admin requires a second admin to approve within 30 minutes.
CREATE TABLE pending_role_changes (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
target_user_id INTEGER NOT NULL REFERENCES users(id),
requested_role TEXT NOT NULL,
requested_by INTEGER NOT NULL REFERENCES users(id),
approval_token_hash TEXT NOT NULL, -- SHA-256 of emailed token
expires_at TIMESTAMPTZ NOT NULL DEFAULT NOW() + INTERVAL '30 minutes',
approved_by INTEGER REFERENCES users(id),
approved_at TIMESTAMPTZ,
rejected_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
Workflow:
PATCH /admin/users/{id}/rolewithrole=admincreates apending_role_changesrow and triggers an email to all other active admins containing a single-use approval tokenPOST /admin/role-changes/{change_id}/approve?token=<token>— any other admin can approve; completing the role change is atomic- Rows past
expires_atare auto-rejected by a nightly job and logged asROLE_CHANGE_EXPIRED - All outcomes (
ROLE_CHANGE_APPROVED,ROLE_CHANGE_REJECTED,ROLE_CHANGE_EXPIRED) are logged tosecurity_logsas HIGH severity - The requesting admin cannot approve their own pending change (enforced by
approved_by != requested_byconstraint)
RBAC enforcement pattern (FastAPI):
def require_role(*roles: str):
def dependency(current_user: User = Depends(get_current_user)):
if current_user.role not in roles:
log_auth_failure(current_user, roles)
raise HTTPException(status_code=403, detail="Insufficient permissions")
return current_user
return dependency
# Applied per router group — not per individual endpoint where it is easy to miss
router = APIRouter(dependencies=[Depends(require_role("operator", "admin"))])
7.3 Authentication
JWT Implementation
- Algorithm:
RS256(asymmetric). NeverHS256with a shared secret. Nevernone. - Key storage: RSA private signing key stored in Docker secrets / secrets manager (see §7.5). Never in an environment variable or
.envfile. - Token storage in browser:
httpOnly,Secure,SameSite=Strictcookies only. NeverlocalStorage(vulnerable to XSS). Never query parameters (appear in server logs). - Access token lifetime: 15 minutes.
- Refresh token lifetime: 24 hours for
operator/analyst; 8 hours foradmin. - Refresh token rotation with family reuse detection (Finding 5): Invalidate the old token on every refresh. Tokens belong to a
family_id(UUID assigned at first issuance). If a token from a superseded generation within a family is presented — i.e. it was already rotated and a newer token in the same family exists — the entire family is immediately revoked, logged asREFRESH_TOKEN_REUSE(HIGH severity), and an email alert is sent to the user ("Suspicious login detected — all sessions revoked"). This detects refresh token theft: the legitimate user retries after the attacker consumed the token first, causing the reuse to surface. Therefresh_tokenstable includesfamily_id UUID NOT NULLandsuperseded_at TIMESTAMPTZ(set when a new token replaces this one in rotation). - Refresh token storage:
refresh_tokenstable in the database (see §9.2). This enables server-side revocation — Redis-only storage loses revocations on restart.
Multi-Factor Authentication (MFA)
TOTP-based MFA (RFC 6238) is required for all roles from Phase 1. Implementation:
- On first login after account creation, user is presented with TOTP QR code (via
pyotp) and required to verify before completing registration - Recovery codes (8 × 10-character alphanumeric) generated at setup; stored as bcrypt hashes in
users.mfa_recovery_codes - MFA bypass via recovery code is logged as a security event (MEDIUM alert to admins)
- MFA is enforced at the JWT issuance step — tokens are not issued until MFA is verified
- Failed MFA attempts after 5 consecutive failures trigger a 30-minute account lockout and a MEDIUM alert
SSO / Identity Provider Abstraction
"Integrate with SkyNav SSO later" cannot remain a deferred decision. The auth layer must be designed as a pluggable provider from the start:
class AuthProvider(Protocol):
async def authenticate(self, credentials: Credentials) -> User: ...
async def issue_tokens(self, user: User) -> TokenPair: ...
async def revoke(self, refresh_token: str) -> None: ...
class LocalJWTProvider(AuthProvider): ... # Phase 1: local JWT + TOTP
class OIDCProvider(AuthProvider): ... # Phase 3: OIDC/SAML SSO
All endpoint logic depends on AuthProvider — switching from local JWT to OIDC requires no endpoint changes.
7.4 API Security
Rate Limiting
Implemented with slowapi (Redis token bucket). Limits are per-user for authenticated endpoints, per-IP for auth endpoints:
| Endpoint | Limit | Window |
|---|---|---|
POST /token (login) |
10 per IP | 1 minute; exponential backoff after 5 failures |
POST /token/refresh |
30 per user | 1 hour |
POST /decay/predict |
10 per user | 1 hour |
POST /conjunctions/screen |
5 per user | 1 hour |
POST /reports |
20 per user | 1 day |
WS /ws/events connection attempts |
10 per user | 1 minute |
| General authenticated read endpoints | 300 per user | 1 minute |
| General unauthenticated (if any) | 60 per IP | 1 minute |
Rate limit headers returned on every response: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.
Simulation Parameter Validation
All physical parameters must be validated against their physically meaningful ranges before a simulation job is accepted. Type validation alone is insufficient — NRLMSISE-00 will silently produce garbage for out-of-range inputs without raising an error:
class DecayPredictParams(BaseModel):
f107: float = Field(..., ge=65.0, le=300.0,
description="F10.7 solar flux (sfu). Physically valid: 65–300.")
ap: float = Field(..., ge=0.0, le=400.0,
description="Geomagnetic Ap index. Valid: 0–400.")
mc_samples: int = Field(..., ge=10, le=1000,
description="Monte Carlo sample count. Server cap: 1000 regardless of input.")
bstar_uncertainty_pct: float = Field(..., ge=0.0, le=50.0)
@validator('mc_samples')
def cap_mc_samples(cls, v):
return min(v, 1000) # Server-side cap regardless of submitted value
Server-Side Request Forgery (SSRF) Mitigation
The Ingest module fetches from five external sources. These URLs must be:
- Hardcoded constants in
ingest/sources.py— never loaded from user input, API parameters, or database values - Fetched via an HTTP client configured with an allowlist of expected IP ranges per source; connections to private IP ranges (
10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,169.254.0.0/16,::1,fc00::/7) are blocked at the HTTP client layer
ALLOWED_HOSTS = {
"www.space-track.org": ["18.0.0.0/8"], # approximate; update with actual ranges
"celestrak.org": [...],
"swpc.noaa.gov": [...],
"discosweb.esoc.esa.int": [...],
"maia.usno.navy.mil": [...],
}
CZML and CZML Injection
Object names and descriptions sourced from Space-Track are interpolated into CZML documents and ultimately rendered in CesiumJS. A malicious object name containing <script> or CesiumJS-specific injection must be sanitised:
- HTML-encode all string fields from external sources before inserting into CZML
- CesiumJS evaluates CZML
descriptionfields as HTML in info boxes — treat as untrusted HTML; use DOMPurify on the client before passing to CesiumJSdescriptionproperties
NOTAM Draft Content Sanitisation (Finding 10)
NOTAM drafts are templated from prediction data, object names, and operator-supplied fields. Object names originate from Space-Track and from manual POST /objects input. ICAO plain-text format is vulnerable to special-character injection and, if the draft is ever rendered to PDF by the Playwright renderer, to XSS.
import re
_ICAO_SAFE = re.compile(r"[^A-Z0-9\-_ /]")
def sanitise_icao(value: str, field_name: str = "field") -> str:
"""
Strip characters outside ICAO plain-text safe set before NOTAM template interpolation.
Args:
value: Raw string from user input or external source.
field_name: Field identifier for logging if value is modified.
Returns:
Sanitised string safe for ICAO plain-text insertion.
"""
upper = value.upper()
sanitised = _ICAO_SAFE.sub("", upper)
if sanitised != upper:
logger.info("sanitise_icao: modified %s field", field_name)
return sanitised or "[REDACTED]"
Rules:
sanitise_icao()is called on every user-sourced field before interpolation intoNOTAM_TEMPLATE.format(...)- TLE remarks fields are stripped entirely from NOTAM output (not an ICAO-relevant field)
- NOTAM template uses
str.format()with named arguments, not f-strings with raw variables sanitise_icaois listed inAGENTS.mdas a security-critical function — any change requires a dedicated security review
7.5 Secrets Management
"All secrets via environment variables" is a development-only posture.
Development: .env file. Never committed. .gitignore must include .env, .env.*.
Production: Docker secrets (Compose secrets: stanza) for Phase 1 production deployment; HashiCorp Vault or cloud-provider secrets manager (AWS Secrets Manager, GCP Secret Manager) for Phase 3.
Secrets rotation schedule:
| Secret | Rotation Frequency | Method |
|---|---|---|
| JWT RS256 private key | 90 days | Key ID in JWT header; both old and new keys valid during 24h rotation window |
| Space-Track.org credentials | 90 days | Space-Track account supports credential rotation; coordinated with ops team |
| Database password | 90 days | Dual-credential rotation (see procedure below); zero-downtime |
| Redis ACL passwords (backend, worker, ingest) | 90 days | Update ACL password via redis-cli ACL SETUSER; restart dependent services with new env var; old password invalid immediately |
| MinIO access key | 90 days | MinIO admin API |
| Cesium ion access token | NOT A SECRET | Public browser credential — shipped in NEXT_PUBLIC_CESIUM_ION_TOKEN. Read via Ion.defaultAccessToken = process.env.NEXT_PUBLIC_CESIUM_ION_TOKEN. Do not proxy through the backend. Do not store in Docker secrets or Vault. Rotate only if the token is explicitly revoked on cesium.com. |
Database password rotation procedure — a hard PgBouncer restart drops idle connections cleanly but kills active transactions. Use the drain-then-swap sequence instead:
- Update Postgres role (new password valid immediately; old password still in PgBouncer config):
ALTER ROLE spacecom_app PASSWORD 'new_secret'; - Drain PgBouncer — issue
PAUSE pgbouncer;. New connections queue; existing transactions complete. Timeout: 30s (if not drained, proceed and accept brief 503s). - Update PgBouncer config with new password, then
RESUME pgbouncer;. Application connections resume using new password. - Verify ingest/API within 5 minutes —
/admin/ingest-statusandGET /readyzmust return 200. - Revoke old password after 15-minute grace:
ALTER ROLE spacecom_app PASSWORD 'new_secret';(already set — no-op; old session tokens expired during drain). - Rotate Patroni replication credentials separately —
patronictlreloadwith updatedpostgresql.parameters.hba_file; does not affect application connections.
Full runbook: docs/runbooks/db-password-rotation.md.
Anti-patterns — enforced by git-secrets pre-commit hook and CI scan:
- No secrets in
requirements.txt,docker-compose.yml,Dockerfile, source files, or logs - Secret patterns (AWS keys, private key headers, connection strings) trigger CI failure
7.6 Transport Security
External-facing:
- HTTPS only. HTTP → HTTPS 301 redirect.
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload- TLS 1.2 minimum; TLS 1.3 preferred. Disable TLS 1.0, 1.1, SSLv3.
- Cipher suite: Mozilla "Intermediate" configuration or better.
- WebSocket connections:
wss://only. Thews.tsclient enforces this.
Internal service communication:
- Backend → DB: PostgreSQL TLS with client certificate verification
- Backend → Redis: Redis 7 TLS mode (
tls-port,tls-cert-file,tls-key-file,tls-ca-cert-file) - Backend → MinIO: HTTPS (MinIO production mode requires TLS)
- Backend → Renderer: HTTPS on internal Docker network; renderer does not accept connections from any other service
Certificate management:
- Production: Let's Encrypt via Caddy (auto-renewal, OCSP stapling)
- Certificate expiry monitored: alert 30 days before expiry via
cert-manageror custom Celery task
7.7 Content Security Policy and Security Headers
SpaceCom uses two distinct CSP tiers because CesiumJS requires 'unsafe-eval' (GLSL shader compilation) — a directive that would be unacceptable on non-globe routes.
Tier 1 — Non-globe routes (login, settings, admin, API responses):
Content-Security-Policy:
default-src 'self';
script-src 'self';
style-src 'self' 'unsafe-inline';
img-src 'self' data: blob:;
connect-src 'self' wss://[domain];
worker-src blob:;
frame-ancestors 'none';
base-uri 'self';
form-action 'self';
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
Referrer-Policy: strict-origin-when-cross-origin
Permissions-Policy: geolocation=(), camera=(), microphone=()
Tier 2 — Globe routes (app/(globe)/ — all routes under the (globe) layout group only):
Content-Security-Policy:
default-src 'self';
script-src 'self' 'unsafe-eval' https://cesium.com;
style-src 'self' 'unsafe-inline';
img-src 'self' data: blob: https://*.cesium.com https://*.openstreetmap.org;
connect-src 'self' wss://[domain] https://cesium.com https://api.cesium.com;
worker-src blob:;
frame-ancestors 'none';
base-uri 'self';
form-action 'self';
Implementation in next.config.ts:
// next.config.ts
const isGlobeRoute = (pathname: string) =>
pathname.startsWith('/dashboard') || pathname.startsWith('/monitor');
const headers = async () => [
{
source: '/((?!dashboard|monitor).*)', // non-globe routes
headers: [{ key: 'Content-Security-Policy', value: CSP_STANDARD }],
},
{
source: '/(dashboard|monitor)(.*)', // globe routes — unsafe-eval allowed
headers: [{ key: 'Content-Security-Policy', value: CSP_GLOBE }],
},
];
'unsafe-eval' is required by CesiumJS for runtime GLSL shader compilation. Scope it only to globe routes. This is a known, documented exception — it must never appear in the standard-tier CSP.
'unsafe-inline' for style-src is also required by CesiumJS and appears in both tiers. It must not be used for script-src in the standard tier.
Renderer page CSP (the headless Playwright context, which must be the most restrictive):
Content-Security-Policy:
default-src 'self';
script-src 'self';
style-src 'self';
img-src 'self' data: blob:;
connect-src 'none';
frame-ancestors 'none';
7.8 WebSocket Security
WS /ws/events authentication:
- JWT token must be verified at connection establishment (HTTP Upgrade request)
- Browser WebSocket APIs cannot send custom headers — use the
httpOnlyauth cookie (set by the login flow) which is automatically sent with the Upgrade request; verify it in the WebSocket handshake handler - Do not accept tokens via query parameters (
?token=...) — they appear in server access logs
Connection management:
- Per-user concurrent connection limit: 5. Enforced in the upgrade handler by checking a Redis counter.
- Server-side ping every 30 seconds; close connections that do not respond within 60 seconds
- All incoming WebSocket messages (if bidirectional) validated against a JSON schema before processing
7.9 Data Integrity
This is the most important security property of the system. Predictions that drive aviation safety decisions must be trustworthy and tamper-evident.
HMAC Signing of Predictions
Every row written to reentry_predictions and hazard_zones is signed at creation time with an application-secret HMAC:
import hmac, hashlib, json
def sign_prediction(prediction: dict, secret: bytes) -> str:
payload = json.dumps({
"id": prediction["id"],
"object_id": prediction["object_id"],
"p50_reentry_time": prediction["p50_reentry_time"].isoformat(),
"model_version": prediction["model_version"],
"f107_assumed": prediction["f107_assumed"],
}, sort_keys=True)
return hmac.new(secret, payload.encode(), hashlib.sha256).hexdigest()
HMAC signing race fix (F4 — §67): If reentry_predictions.id is a DB-assigned BIGSERIAL, the application must INSERT first (to get the id), then compute the HMAC using that id, then UPDATE the row — a two-phase write. Between the INSERT and the UPDATE there is a brief window where a valid prediction row exists with an empty record_hmac, which the nightly HMAC verification job (§10.2) would flag as a violation.
Fix: Use UUID as the primary key (DEFAULT gen_random_uuid()) and assign the UUID in the application before the INSERT. The application pre-generates the UUID, computes the HMAC against the full prediction dict including that UUID, then inserts the complete row in a single write:
import uuid
def write_prediction_to_db(prediction: dict):
prediction_id = str(uuid.uuid4())
prediction['id'] = prediction_id
prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret)
# Single INSERT — no two-phase write; no race window
db.execute(text("""
INSERT INTO reentry_predictions (id, object_id, ..., record_hmac)
VALUES (:id, :object_id, ..., :record_hmac)
"""), prediction)
Migration: ALTER TABLE reentry_predictions ALTER COLUMN id TYPE UUID USING gen_random_uuid(); ALTER TABLE reentry_predictions ALTER COLUMN id SET DEFAULT gen_random_uuid(); — requires cascade updates to FK references (alert_events.prediction_id, prediction_outcomes.prediction_id). Include in the next schema migration (alembic revision --autogenerate).
The HMAC is stored in a record_hmac column. Before serving any prediction to a client, the backend verifies the HMAC. A failed verification:
- Is logged as a security event (CRITICAL alert to admins)
- Results in the prediction being marked
integrity_failed = TRUE - The prediction is not served; the API returns a 503 with a message directing the user to contact the system administrator
- The Event Detail page displays
✗ HMAC verification failedand a warning banner
Prediction Immutability
Once written, prediction records must not be modified:
CREATE OR REPLACE FUNCTION prevent_prediction_modification()
RETURNS TRIGGER AS $$
BEGIN
RAISE EXCEPTION 'reentry_predictions is immutable after creation. Create a new prediction instead.';
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER reentry_predictions_immutable
BEFORE UPDATE OR DELETE ON reentry_predictions
FOR EACH ROW EXECUTE FUNCTION prevent_prediction_modification();
Apply the same trigger to hazard_zones.
HMAC Key Rotation Procedure (Finding 1)
The immutability trigger blocks all UPDATEs on reentry_predictions, including legitimate HMAC re-signing during key rotation. The rotation path must be explicit and auditable:
Schema additions to reentry_predictions:
ALTER TABLE reentry_predictions
ADD COLUMN rotated_at TIMESTAMPTZ,
ADD COLUMN rotated_by INTEGER REFERENCES users(id);
Parameterised immutability trigger — allows UPDATE only on record_hmac when the session flag is set by the privileged hmac_admin role:
CREATE OR REPLACE FUNCTION prevent_prediction_modification()
RETURNS TRIGGER AS $$
BEGIN
-- Allow HMAC-only rotation when flag is set by hmac_admin role
IF TG_OP = 'UPDATE'
AND current_setting('spacecom.hmac_rotation', TRUE) = 'true'
AND NEW.record_hmac IS DISTINCT FROM OLD.record_hmac
AND NEW.id = OLD.id -- all other columns unchanged
THEN
RETURN NEW;
END IF;
RAISE EXCEPTION 'reentry_predictions is immutable after creation. Create a new prediction instead.';
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;
hmac_admin database role: A dedicated hmac_admin Postgres role is the only role permitted to SET LOCAL spacecom.hmac_rotation = true. The backend application role does not have this privilege. The rotation script connects as hmac_admin, sets the flag per-transaction, re-signs each row, and commits. Every changed row is logged to security_logs as event type HMAC_ROTATION.
Dual sign-off: The rotation script must be run with two operators present. The runbook requires that both operators record their user IDs in the rotated_by column (use the initiating operator) and that the second operator independently verifies a random sample of re-signed HMACs match the new key before the script is considered complete.
The HMAC rotation runbook lives at docs/runbooks/hmac-key-rotation.md and cross-references the zero-downtime JWT keypair rotation runbook for the dual-key validity window.
Append-Only alert_events
CREATE OR REPLACE FUNCTION prevent_alert_modification()
RETURNS TRIGGER AS $$
BEGIN
RAISE EXCEPTION 'alert_events is append-only';
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER alert_events_immutable
BEFORE UPDATE OR DELETE ON alert_events
FOR EACH ROW EXECUTE FUNCTION prevent_alert_modification();
Cross-Source Validation
Do not silently trust a single data source:
- TLE cross-validation: When the same NORAD ID is received from both Space-Track and CelesTrak within a 6-hour window, compare the key orbital elements. If they differ by more than a defined threshold (e.g., semi-major axis > 1 km, inclination > 0.01°), flag for human review rather than silently using one.
- All-clear double check: A prediction record showing no hazard for an object that has an active TIP message triggers an integrity alert. A single-source all-clear cannot override a TIP message.
- Space weather cross-validation: Ingest F10.7 from both NOAA SWPC and ESA Space Weather Service. If they disagree by > 20%, alert and use the more conservative (higher) value until the discrepancy resolves.
IERS EOP Integrity
The weekly IERS Bulletin A download must be verified before application:
IERS_BULLETIN_A_SHA256 = {
# Updated manually each quarter; verified against IERS publications
"finals2000A.all": "expected_hash_here",
}
# If hash fails, the existing EOP table is retained; a MEDIUM alert is generated
alert_events HMAC integrity (F9): alert_events records are safety-critical audit evidence (UN Liability Convention, ICAO). They carry the same HMAC protection as reentry_predictions:
def sign_alert_event(event: dict, secret: bytes) -> str:
payload = json.dumps({
"id": event["id"],
"object_id": event["object_id"],
"organisation_id": event["organisation_id"],
"level": event["level"],
"trigger_type": event["trigger_type"],
"created_at": event["created_at"].isoformat(),
"acknowledged_by": event["acknowledged_by"],
"action_taken": event.get("action_taken"),
}, sort_keys=True)
return hmac.new(secret, payload.encode(), hashlib.sha256).hexdigest()
Nightly integrity check (Celery Beat, 02:00 UTC):
@celery.task
def verify_alert_event_hmac():
"""Re-verify HMAC on all alert_events created in the past 24 hours."""
yesterday = utcnow() - timedelta(hours=24)
failures = db.execute(
text("SELECT id FROM alert_events WHERE created_at >= :since"),
{"since": yesterday}
).fetchall()
for row in failures:
event = db.get(AlertEvent, row.id)
expected = sign_alert_event(event.__dict__, HMAC_SECRET)
if not hmac.compare_digest(expected, event.record_hmac):
log_security_event("ALERT_EVENT_HMAC_FAILURE", {"event_id": row.id})
alert_admin_critical(f"alert_events HMAC integrity failure: id={row.id}")
Database timezone enforcement (F2): PostgreSQL TIMESTAMPTZ stores internally in UTC, but ORM connections can silently apply server or session timezone offsets. All timestamps must remain UTC end-to-end:
# database.py — connection pool creation
from sqlalchemy import event, text
@event.listens_for(engine.sync_engine, "connect")
def set_timezone(dbapi_conn, connection_record):
cursor = dbapi_conn.cursor()
cursor.execute("SET TIME ZONE 'UTC'")
cursor.close()
Integration test (tests/test_db_timezone.py — BLOCKING):
def test_timestamps_round_trip_as_utc(db_session):
"""Ensure ORM never silently converts UTC timestamps to local time."""
known_utc = datetime(2026, 3, 22, 14, 0, 0, tzinfo=timezone.utc)
obj = ReentryPrediction(p50_reentry_time=known_utc, ...)
db_session.add(obj)
db_session.flush()
db_session.refresh(obj)
assert obj.p50_reentry_time == known_utc
assert obj.p50_reentry_time.tzinfo == timezone.utc
Any non-UTC representation of a timestamp is a display-layer concern only — never stored or transmitted as local time.
7.10 Infrastructure Security
Container Hardening
Applied to all service Dockerfiles and Compose definitions:
# Applied to all services
security_opt:
- no-new-privileges:true
read_only: true
tmpfs:
- /tmp:size=256m,mode=1777
user: "1000:1000" # non-root; created in Dockerfile as: RUN useradd -r -u 1000 appuser
cap_drop:
- ALL
cap_add: [] # No capabilities added; NET_BIND_SERVICE not needed if ports > 1024
Renderer container — most restrictive:
renderer:
security_opt:
- no-new-privileges:true
- seccomp:renderer-seccomp.json # Custom seccomp profile for Chromium
network_mode: none # Overridden by renderer_net which allows only internal backend API
read_only: true
tmpfs:
- /tmp:size=512m # Playwright needs /tmp
- /home/appuser:size=256m # Chromium profile directory
cap_drop:
- ALL
cap_add:
- SYS_ADMIN # Required by Chromium sandbox; document this explicitly
SYS_ADMIN for Chromium is a known requirement. Mitigate by ensuring the renderer container has no network access to anything other than the backend internal API, and by setting a strict seccomp profile.
Redis Authentication and ACLs
# redis.conf (production)
requirepass "" # Disabled; use ACL only
aclfile /etc/redis/users.acl
# users.acl
user backend on >[backend_password] ~* &* +@all -@dangerous
user worker on >[worker_password] ~celery:* &celery:* +RPUSH +LPOP +LLEN +SUBSCRIBE +PUBLISH +XADD +XREAD
user default off # Disable default user
MinIO Bucket Policies
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::*"
}]
}
All buckets are private. Report downloads use 5-minute pre-signed URLs (reduced from 15 minutes — user downloads immediately). Pre-signed URL generation is logged to security_logs (event type PRESIGNED_URL_GENERATED) with user_id, object_key, expires_at, and client_ip — this creates an audit trail of who obtained access to which object.
MC blob access — server-side proxy (Finding 2): Simulation trajectory blobs (MC samples) must not be served as direct pre-signed MinIO URLs to the browser. Instead, the visualiser calls GET /viz/mc-trajectories/{simulation_id} which the backend fetches from MinIO server-side and streams to the authenticated client. This keeps MinIO URLs entirely off the client and prevents URL sharing or exfiltration. The backend enforces the requesting user's organisation matches the simulation's organisation_id before proxying.
7.11 Playwright Renderer Security
The renderer is the highest attack-surface component. It runs a real browser on the server.
Isolation: The renderer service runs in its own container on renderer_net. It accepts HTTPS connections only from the backend's internal IP. It makes no outbound connections beyond backend:8000 (enforced by network segmentation + Playwright request interception — see below).
Data flow: The renderer receives only a report_id (integer) from the backend job queue. It constructs the report URL internally as http://backend:8000/reports/{report_id}/preview — user-supplied values are never interpolated into the URL. The report_id is validated as a positive integer before use. The renderer has no access to the database, Redis, or MinIO directly.
Playwright request interception (Finding 4) — allowlist, not blocklist:
async def setup_request_interception(page: Page) -> None:
"""Block any Playwright navigation to hosts other than the backend."""
async def handle_route(route: Route) -> None:
url = route.request.url
if not url.startswith("http://backend:8000/"):
await route.abort("blockedbyclient")
else:
await route.continue_()
await page.route("**/*", handle_route)
This is a defence-in-depth layer: even if a bug causes the renderer to receive a crafted URL, the interception handler prevents navigation to any external or internal host outside backend:8000.
Input sanitisation before reaching the renderer:
import bleach
ALLOWED_TAGS = [] # No HTML allowed in user-supplied report fields
ALLOWED_ATTRS = {}
def sanitise_report_field(value: str) -> str:
"""Strip all HTML from user-supplied strings before renderer interpolation."""
return bleach.clean(value, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRS, strip=True)
Report template: The renderer loads a report template from the local filesystem (bundled in the container image). It does not fetch templates from URLs or the database. User-supplied content is inserted via a strict templating engine (Jinja2 with autoescape=True).
Timeouts: Report generation has a hard 30-second timeout. Playwright's page.goto() timeout set to 10 seconds. If the timeout is exceeded, the job fails with a clear error — the renderer does not hang indefinitely.
No dangerouslySetInnerHTML: The report React template must never use dangerouslySetInnerHTML. All text insertion via {value} (React's built-in escaping).
7.12 Compute Resource Governance
| Limit | Value | Enforcement |
|---|---|---|
mc_samples maximum |
1000 | Pydantic validator at API layer; also re-validated inside the Celery task body (Finding 3) |
| Concurrent simulations per user | 3 | Checked against simulations table before job acceptance; returns 429 if exceeded |
| Pending jobs per user | 10 | Checked at submission time |
| Decay prediction CPU time limit | 300 s | Celery time_limit=300, soft_time_limit=270 |
| Breakup simulation CPU time limit | 600 s | Celery time_limit=600, soft_time_limit=570 |
| Ephemeris response points maximum | 100,000 | Enforced by calculating (end - start) / step; returns 400 if exceeded with a message to reduce range or increase step |
| CZML document size | 50 MB | Streaming response with max size enforced; client must paginate for larger ranges |
| WebSocket connections per user | 5 | Redis counter checked at upgrade time |
| Simulation workers | Separate Celery worker pool from ingest workers | Prevents runaway simulations from starving TLE/space-weather ingestion |
Celery task-layer validation (Finding 3): Celery tasks are callable directly via Redis write (e.g., by a compromised worker), bypassing the API layer entirely. Every task function must validate its own arguments independently of the API endpoint:
from functools import wraps
def validate_task_args(validator_class):
"""Decorator: re-validate task kwargs using the same Pydantic model as the API endpoint."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
validator_class(**kwargs)
except ValidationError as exc:
raise ValueError(f"Task arg validation failed: {exc}") from exc
return func(*args, **kwargs)
return wrapper
return decorator
@app.task(bind=True)
@validate_task_args(DecayPredictParams)
def run_mc_decay_prediction(self, *, norad_id: int, f107: float, ap: float, mc_samples: int, ...):
...
ValueError raised inside a Celery task is treated as a non-retryable failure — the task goes to the dead-letter queue and does not silently drop. This applies to all simulation and prediction tasks. Document in AGENTS.md: "Task functions are a security boundary. Validate all task arguments inside the task body."
Orphaned job recovery (Celery Beat task): A Celery worker killed mid-execution (OOM, pod eviction, container restart) leaves its job in status = 'running' indefinitely unless a cleanup task intervenes. Add a Celery Beat periodic task that runs every 5 minutes:
@app.task
def recover_orphaned_jobs():
"""Mark jobs stuck in 'running' beyond 2× their estimated duration as failed."""
threshold = datetime.utcnow() - timedelta(minutes=1) # minimum guard
orphans = (
db.query(Job)
.filter(
Job.status == "running",
Job.started_at < func.now() - (
func.coalesce(Job.estimated_duration_seconds, 600) * 2
) * text("interval '1 second'"),
)
.all()
)
for job in orphans:
job.status = "failed"
job.error_code = "PRESUMED_DEAD"
job.error_message = "Worker did not complete within 2× estimated duration"
job.completed_at = datetime.utcnow()
db.commit()
Integration test (tests/test_jobs/test_celery_failure.py): set a job to status='running' with started_at = NOW() - 1200s and estimated_duration_seconds = 300; run the Beat task; assert status = 'failed' and error_code = 'PRESUMED_DEAD'.
7.13 Supply Chain and Dependency Security
Python dependency pinning:
All dependencies pinned with exact versions and hashes using pip-tools:
# requirements.in → pip-compile → requirements.txt with hashes
fastapi==0.111.0 --hash=sha256:...
Install with pip install --require-hashes -r requirements.txt in all Docker builds.
Node.js: package-lock.json committed and npm ci used in Docker builds (not npm install).
Base images: All FROM statements use pinned digest tags:
FROM python:3.12.3-slim@sha256:abc123...
Never FROM python:3.12-slim (floating tag).
PyPI index trust policy — dependency confusion protection:
All Python packages must be fetched from a controlled index, not directly from public PyPI without restrictions. Configure pip.conf mounted into all build containers:
# pip.conf (mounted at /etc/pip.conf in builder stage)
[global]
index-url = https://pypi.internal.spacecom.io/simple/
# Proxy mode: passes through to PyPI but logs and scans before serving
# extra-index-url is intentionally absent — no fallback to raw public PyPI
For Phase 1 (no internal proxy available): register all spacecom-* package names on public PyPI as empty stubs to prevent dependency confusion squatting. Document in docs/adr/0019-pypi-index-trust.md.
Automated scanning (CI pipeline):
| Tool | Target | Trigger | Notes |
|---|---|---|---|
pip-audit |
Python dependencies | Every PR; blocks on High/Critical | Queries Python Advisory Database (PyPADB); lower false-positive rate than OWASP DC for Python |
npm audit |
Node.js dependencies | Every PR; blocks on High/Critical | --audit-level=high; run after npm ci |
| Trivy | Container images | Every PR; blocks on Critical/High | .trivyignore applied (see below); JSON output archived |
| Bandit | Python source code | Every PR; blocks on High severity | |
| ESLint security plugin | TypeScript source | Every PR | |
pip-licenses |
Python transitive deps | Every PR; blocks on GPL/AGPL | CesiumJS exempted by name with documented commercial licence |
license-checker-rseidelsohn |
npm transitive deps | Every PR; blocks on GPL/AGPL | CesiumJS exempted; other AGPL packages require approval |
| Renovate Bot | Docker image digests + all deps | Weekly PRs; digest PRs auto-merged if CI passes | Replaces Dependabot for Docker digest pins; Dependabot retained for GitHub Security Advisory integration |
git-secrets + detect-secrets |
All commits | Pre-commit; blocks commit on secret patterns | detect-secrets is canonical (entropy + regex); git-secrets retained for pattern matching |
cosign verify |
Container images at deploy | Every staging/production deploy | Verifies Sigstore keyless signature before pulling |
OWASP Dependency-Check is removed from the Python scanning stack — it has high false-positive rates due to CPE name mapping issues for Python packages and is superseded by pip-audit. It may be retained for future Java/Kotlin components.
Trivy configuration — .trivyignore:
# .trivyignore
# Each entry requires: CVE ID, expiry date (90-day max), and documented justification.
# Process: PR required with senior engineer approval. Expired entries fail CI.
# Format: CVE-YYYY-NNNNN expires:YYYY-MM-DD reason:<one-line justification>
#
# Example (do not add without process):
# CVE-2024-12345 expires:2024-12-31 reason:builder-stage only; not present in runtime image
CI check rejects entries past their expiry date:
python scripts/check_trivyignore_expiry.py .trivyignore || \
(echo "ERROR: .trivyignore contains expired entry — review or remove" && exit 1)
License scanning CI steps:
# security-scan job
- name: Python licence gate
run: |
pip install pip-licenses
pip-licenses --format=json --output-file=python-licences.json
# Fail on GPL/AGPL (CesiumJS has commercial licence; excluded by name in npm step)
pip-licenses --fail-on="GNU General Public License v2 (GPLv2);GNU General Public License v3 (GPLv3);GNU Affero General Public License v3 (AGPLv3)"
- name: npm licence gate
working-directory: frontend
run: |
npx license-checker-rseidelsohn --json --out npm-licences.json
# cesium excluded: commercial licence at docs/adr/0007-cesiumjs-commercial-licence.md
npx license-checker-rseidelsohn \
--excludePackages "cesium" \
--failOn "GPL;AGPL"
- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.3.4
with:
name: licences-${{ github.sha }}
path: "*.json"
retention-days: 365
Base image digest updates — Renovate configuration:
Dependabot does not update @sha256: digest pins in Dockerfiles. Renovate's docker-digest manager handles this:
// renovate.json
{
"extends": ["config:base"],
"packageRules": [
{
"matchDatasources": ["docker"],
"matchUpdateTypes": ["digest"],
"automerge": true,
"automergeType": "pr",
"schedule": ["every weekend"],
"commitMessageSuffix": "(base image digest update)"
},
{
"matchDatasources": ["pypi"],
"automerge": false
}
],
"github-actions": {
"enabled": true,
"pinDigests": true
}
}
Digest-only updates auto-merge on passing CI. Version bumps (e.g., python:3.12 → python:3.13) require manual PR review. Renovate is added alongside Dependabot; Dependabot retains GitHub Security Advisory integration for Python/Node CVE PRs.
7.14 Audit and Security Logging
Security event categories (stored in security_logs table and shipped to SIEM):
| Event | Level | Retention |
|---|---|---|
| Successful login | INFO | 90 days |
| Failed login (IP + user) | WARNING | 180 days |
| MFA failure | WARNING | 180 days |
| Account lockout | HIGH | 180 days |
| Token refresh | INFO | 30 days |
| Authorisation failure (403) | WARNING | 180 days |
| Admin action (user create/delete/role change) | HIGH | 1 year |
| Prediction HMAC failure | CRITICAL | 2 years |
| Alert storm detection | CRITICAL | 2 years |
| IERS EOP hash mismatch | HIGH | 1 year |
| Report generated | INFO | 1 year |
| Ingest source error | WARNING | 90 days |
Security event human-alerting matrix (Finding 7): A Grafana dashboard no one is watching provides no protection during an active attack. The following events must trigger an immediate out-of-band alert to a human (PagerDuty, email, or Slack) — not only log to the database:
| Event type | Severity | Alert channel | Response SLA |
|---|---|---|---|
HMAC_VERIFICATION_FAILURE |
CRITICAL | PagerDuty + admin email | Immediate |
REFRESH_TOKEN_REUSE |
HIGH | Email to affected user + admin email | < 5 min |
ROLE_CHANGE_APPROVED / ROLE_CHANGE_EXPIRED |
HIGH | Admin email summary | < 15 min |
REGISTRATION_BLOCKED_SANCTIONS |
HIGH | Admin email | < 15 min |
RBAC_VIOLATION ≥ 10 events in 5 min (same user_id) |
HIGH | PagerDuty | Immediate |
INGEST_VALIDATION_FAILURE ≥ 5 events in 1 hour (same source) |
MEDIUM | Admin email | < 1 hour |
| Space-Track ingest gap > 4 hours | CRITICAL | PagerDuty (cross-ref §31) | Immediate |
Any level = CRITICAL security event |
CRITICAL | PagerDuty + SIEM | Immediate |
Implemented as AlertManager rules (Prometheus security_event_total counter with event_type label) and/or direct webhook dispatch from the security_logs insert trigger. Rules defined in monitoring/alertmanager/security-rules.yml.
Space-Track credential rotation — ingest gap specification (Finding 8): Space-Track supports only one active credential set; rotation is a hard cut with no parallel-credential window. The rotation runbook at docs/runbooks/space-track-credential-rotation.md must include: (a) record last successful ingest time before starting; (b) update Docker secret and restart ingest_worker; (c) verify ingest succeeds within 10 minutes of restart (GET /admin/ingest-status shows last_success_at for Space-Track source); (d) if ingest does not resume within 10 minutes, roll back to previous credentials and raise a CRITICAL alert. The existing 4-hour ingest failure CRITICAL alert (§31) is the backstop — this runbook step reduces mean time to detect to 10 minutes.
Structured log format — all services emit JSON via structlog. Every log record must include these fields:
# backend/app/logging_config.py
REQUIRED_LOG_FIELDS = {
"timestamp": "ISO-8601 UTC",
"level": "DEBUG|INFO|WARNING|ERROR|CRITICAL",
"service": "backend|worker|ingest|renderer",
"logger": "module.path",
"message": "human-readable summary",
"request_id": "UUID | null — set for HTTP requests; propagated into Celery tasks",
"job_id": "UUID | null — Celery job_id when inside a task",
"user_id": "integer | null",
"organisation_id": "integer | null",
"duration_ms": "integer | null — HTTP response time",
"status_code": "integer | null — HTTP responses only",
}
The sanitising formatter wraps the structlog JSON processor (strips JWT substrings, Space-Track passwords, database DSNs before the record is written). Docker log driver: json-file with max-size=100m, max-file=5 for Tier 1; forwarded to Loki via Promtail in Tier 2+.
Log sanitisation: The structlog sanitising processor runs as the final processor in the chain before emission, stripping known sensitive patterns (JWT token substrings, Space-Track password patterns, database DSN with credentials).
Log integrity: Logs are shipped in real-time to an external destination (Loki in Tier 2; S3/MinIO append-only bucket or SIEM for long-term safety record retention). Logs stored only on the container filesystem are considered volatile and untrusted for security purposes.
Request ID correlation middleware — every HTTP request generates a request_id that propagates through logs, Celery tasks, and Prometheus exemplars so an on-call engineer can jump from a metric spike to the causative log line with one click:
# backend/app/middleware.py
import uuid
import structlog
from starlette.middleware.base import BaseHTTPMiddleware
class RequestIDMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
request_id = request.headers.get("X-Request-ID") or str(uuid.uuid4())
structlog.contextvars.bind_contextvars(request_id=request_id)
response = await call_next(request)
response.headers["X-Request-ID"] = request_id
structlog.contextvars.clear_contextvars()
return response
When submitting a Celery task, include request_id in task kwargs and bind it in the task preamble:
structlog.contextvars.bind_contextvars(request_id=kwargs.get("request_id"), job_id=str(self.request.id))
This links every log line from the HTTP layer through to the Celery task execution. The request_id equals the OpenTelemetry trace_id when OTel is enabled (Phase 2), giving a single correlation key across logs and traces.
security_logs table:
CREATE TABLE security_logs (
id BIGSERIAL PRIMARY KEY,
logged_at TIMESTAMPTZ DEFAULT NOW(),
level TEXT NOT NULL,
event_type TEXT NOT NULL,
user_id INTEGER,
organisation_id INTEGER,
source_ip INET,
user_agent TEXT,
resource TEXT,
detail JSONB,
-- Prevent tampering
record_hash TEXT -- SHA-256 of (logged_at || level || event_type || detail)
);
-- Append-only trigger (same pattern as alert_events)
7.15 Security SDLC — Embedded, Not Bolted On
Security activities are integrated into every sprint from Week 1, not deferred to a Phase 3 audit.
Week 1 (mandatory before any other code):
- RBAC schema implemented;
require_roledependency applied to all router groups - JWT RS256 + httpOnly cookies implemented;
HS256never used - MFA (TOTP) implemented and required for all roles
- CSP and security headers applied to frontend and backend
- Docker network segmentation and container hardening applied to all services
- Redis AUTH and ACL configured
- MinIO: all buckets private; pre-signed URLs only
- Dependency pinning (
pip-compile) and Dependabot configured git-secretspre-commit hook installed in repo- Bandit and ESLint security plugin in CI; blocks merge on High severity
- Trivy container scanning in CI; blocks merge on Critical/High
security_logstable and log sanitisation formatter implemented- Append-only DB triggers on
alert_events
Phase 1 (ongoing):
- HMAC signing implemented for
reentry_predictionsbefore decay predictor ships (Week 9) - Immutability triggers on
reentry_predictionsandhazard_zones - Cross-source TLE and space weather validation implemented with ingest module (Week 3–6)
- IERS EOP hash verification implemented (Week 1)
- Rate limiting (slowapi) configured for all endpoint groups (Week 2)
- Simulation parameter range validation (Week 9, with decay predictor)
Phase 2:
- OWASP ZAP DAST scan run against staging environment in the Phase 2 CI pipeline
- Threat model document (STRIDE) reviewed and updated for Phase 2 attack surface
- Playwright renderer: isolated container, sanitised input, timeouts, seccomp profile, Playwright request interception allowlist (Week 19–20, when reports ship)
- NOTAM draft content sanitisation:
sanitise_icao()function inreentry/notam.pyapplied to all user-sourced fields before NOTAM template interpolation; unit test: object name containing"><script>alert(1)</script>produces a sanitised NOTAM draft and does not raise (Week 17–18, with NOTAM drafting feature) - Shadow mode RLS integration test: query
reentry_predictionsasviewerrole with no WHERE clause; assert zero shadow rows returned - Refresh token family reuse detection integration test: simulate attacker consuming a rotated token; assert entire family revoked +
REFRESH_TOKEN_REUSElogged - RLS policies reviewed and integration-tested for multi-tenancy boundary
Phase 3:
- External penetration test by a qualified third party — scope must include: API auth bypass, privilege escalation, SSRF via ingest, XSS → Playwright escalation, WebSocket auth bypass, data integrity attacks on predictions, Redis/MinIO lateral movement
- All Critical and High penetration test findings remediated before production go-live
- SOC 2 Type I readiness review (if required by customer contracts)
- Acceptance Test Procedure (ATP) defined and run (Finding 10):
docs/bid/acceptance-test-procedure.mdexists with test script structured as: test ID, requirement reference, preconditions, steps, expected result, pass/fail criteria. ATP is runnable by a non-SpaceCom operator (evaluator) using documented environment setup. ATP covers: physics accuracy (§17 validation), NOTAM format (Q-line regex test), alert delivery latency (synthetic TIP → measure delivery time), HMAC integrity (tampered record → 503), multi-tenancy boundary (Org A cannot access Org B data). ATP seed data committed atdocs/bid/atp-seed-data/. ATP successfully run by an independent evaluator on the staging environment before any institutional procurement submission. - Competitive differentiation review completed:
docs/competitive-analysis.mdupdated; any competitor capability that closed a differentiation gap has been assessed and a product response documented - Security runbook: incident response procedure for each CRITICAL threat scenario
7.16 Aviation Safety Integrity — Operational Scenarios
Scenario 1 — False all-clear attack:
An attacker who modifies reentry_predictions records to suppress a genuine hazard corridor could cause an airspace manager to conclude a FIR is safe when it is not.
Mitigations layered in depth:
- HMAC signing on every prediction record (§7.9) — modification is immediately detected
- Immutability DB trigger (§7.9) — modifications fail at the database layer
- TIP message cross-check: a prediction showing no hazard for an object with an active TIP message triggers a CRITICAL integrity alert regardless of the prediction's content
- The UI displays HMAC status on every prediction —
✗ verification failedis immediately visible to the operator
Scenario 2 — Alert storm attack:
An attacker flooding the alert system with false CRITICALs induces alert fatigue; operators disable alerts; a genuine event is missed.
Mitigations:
- Alert generation runs only from backend business logic on verified, HMAC-checked data — not from direct API calls
- Rate limiting on CRITICAL alert generation per object per window (§6.6)
- Alert storm detection: > 5 CRITICALs in 1 hour triggers a meta-alert to admins
- Geographic filtering means alert volume per operator is naturally bounded to their region
8. Functional Modules
Each module is a Python package under backend/modules/ with its own router, schemas, service layer, and (where applicable) Celery tasks. Modules communicate via internal function calls and the shared database — not HTTP between modules.
Phase 1 Modules
| Module | Package | Purpose |
|---|---|---|
| Catalog | modules.catalog |
CRUD for space objects: NORAD ID, TLE sets, physical properties (from ESA DISCOS), B* drag term, radar cross-section. Source of truth for all tracked objects. |
| Catalog Propagator | modules.propagator.catalog |
SGP4/SDP4 for general catalog tracking. Outputs GCRF state vectors and geodetic coordinates. Feeds the globe display. Not used for decay prediction. |
| Decay Predictor | modules.propagator.decay |
Numerical integrator (RK7(8) adaptive step) with NRLMSISE-00 atmospheric density model, J2–J6 geopotential, and solar radiation pressure. Used for all re-entry window estimation. Monte Carlo uncertainty (vary F10.7 ±20%, Ap, B* ±10%). All outputs HMAC-signed on creation. Shadow mode flag propagated to all output records. |
| Reentry | modules.reentry |
Phase 1 scope: re-entry window prediction (time ± uncertainty) and ground track corridor (percentile swaths). Phase 2 expands to full breakup/survivability. |
| Space Weather | modules.spaceweather |
Ingests NOAA SWPC: F10.7, Ap/Kp, Dst, solar wind. Cross-validates against ESA Space Weather Service. Generates operational_status string. Drives Decay Predictor density models. |
| Visualisation | modules.viz |
Generates CZML documents from ephemeris (J2000 Cartesian — explicit TEME→J2000 conversion), hazard zones, and debris corridors. Pre-bakes MC trajectory binary blobs for Mode C. All object name/description fields HTML-escaped before CZML output. |
| Ingest | modules.ingest |
Background workers: Space-Track.org TLE polling, CelesTrak TLE polling, TIP message ingestion, ESA DISCOS physical property import, NOAA SWPC space weather polling, IERS EOP refresh. All external URLs are hardcoded constants; SSRF mitigation enforced at HTTP client layer. |
| Public API | modules.api |
Versioned REST API (/api/v1/) as a first-class product for programmatic access by Persona E/F. Includes API key management (generation, rotation, revocation, usage tracking), CCSDS-format export endpoints, bulk ephemeris endpoints, and rate limiting per API key. API keys are separate credentials from the web session JWT and managed independently. |
Phase 2 Modules
| Module | Package | Purpose |
|---|---|---|
| Atmospheric Breakup | modules.breakup |
ORSAT-like atmospheric re-entry breakup: aerothermal loading → structural failure → fragment generation → ballistic descent → ground impact with kinetic energy and casualty area. Produces fragment descriptors and uncertainty bounds for the sub-/trans-sonic descent layer. |
| Conjunction | modules.conjunction |
All-vs-all conjunction screening: apogee/perigee filter → TCA refinement → collision probability (Alfano/Foster). Feeds conjunctions table. |
| Upper Atmosphere | modules.weather.upper |
NRLMSISE-00 / JB2008 density model driven by space weather inputs. 80–600 km profiles for Decay Predictor and Atmospheric Breakup. |
| Lower Atmosphere | modules.weather.lower |
GFS/ECMWF tropospheric wind and density profiles for 0–80 km terminal descent, including wind-sensitive dispersion inputs for fragment clouds after main breakup. |
| Hazard | modules.hazard |
Fuses Decay Predictor + Atmospheric Breakup + atmosphere modules into hazard zones with uncertainty bounds. All output records HMAC-signed and immutable. Shadow mode flag preserved on all hazard zone records. |
| Airspace | modules.airspace |
FIR/UIR boundaries, controlled airspace, routes. PostGIS hazard-airspace intersection. |
| Air Risk | modules.air_risk |
Combines hazard outputs with air traffic density / ADS-B state, aircraft class assumptions, and vulnerability bands to generate time-sliced exposure scores and operator-facing air-risk products. Supports conservative-baseline comparison against blunt closure areas. |
| On-Orbit Fragmentation | modules.fragmentation |
NASA Standard Breakup Model for on-orbit collision/explosion fragmentation. Separate from atmospheric breakup — different physics. |
| Space Operator Portal | modules.space_portal |
The second front door. Owned object management (owned_objects table); object-scoped prediction views; CCSDS export; API key portal; controlled re-entry planner interface. Enforces space_operator RBAC object-ownership scoping. |
| Controlled Re-entry Planner | modules.reentry.controlled |
For objects with remaining manoeuvre capability: given a delta-V budget and avoidance constraints (FIR exclusions, land avoidance, population density weighting), generates ranked candidate deorbit windows with corridor risk scores. Outputs suitable for national space law regulatory submissions and ESA Zero Debris Charter evidence. |
| NOTAM Drafting | modules.notam |
Generates ICAO Annex 15 format NOTAM drafts from hazard corridor outputs. Produces cancellation drafts on event close. Stores all drafts in notam_drafts table. Displays mandatory regulatory disclaimer. Never submits NOTAMs — draft production only. |
Phase 3 Modules
| Module | Package | Purpose |
|---|---|---|
| Reroute | modules.reroute |
Strategic pre-flight route intersection analysis only. Given a filed route, identifies which segments intersect the hazard corridor and outputs the geographic avoidance boundary. Does not generate specific alternate routes — avoidance boundary only, to keep SpaceCom in a purely informational role. |
| Feedback | modules.feedback |
Prediction vs. observed outcome comparison. Atmospheric density scaling recalibration from historical re-entries. Maneuver detection (TLE-to-TLE ΔV estimation). Shadow validation reporting for ANSP regulatory adoption evidence. |
| Alerts | modules.alerts |
WebSocket push + email notifications. Enforces alert rate limits and deduplication server-side. Stores all events in append-only alert_events. Shadow mode: all alerts suppressed to INFORMATIONAL; no external delivery. |
| Launch Safety | modules.launch_safety |
Screen proposed launch trajectories against the live catalog for conjunction risk during ascent and parking orbit phases. Natural extension of the conjunction module. Serves launch operators as a third customer segment. |
9. Data Model Evolution
9.1 Retain and Expand from Existing Schema
objects table
ALTER TABLE objects ADD COLUMN IF NOT EXISTS
bstar DOUBLE PRECISION, -- SGP4 drag parameter (1/Earth-radii)
cd_a_over_m DOUBLE PRECISION, -- C_D * A / m (m²/kg); physical model
rcs_m2 DOUBLE PRECISION, -- Radar cross-section from Space-Track
rcs_size_class TEXT, -- SMALL | MEDIUM | LARGE
mass_kg DOUBLE PRECISION,
cross_section_m2 DOUBLE PRECISION,
material TEXT,
shape TEXT,
data_confidence TEXT DEFAULT 'unknown', -- 'discos' | 'estimated' | 'unknown'
object_type TEXT, -- PAYLOAD | ROCKET BODY | DEBRIS | UNKNOWN
launch_date DATE,
launch_site TEXT,
decay_date DATE,
organisation_id INTEGER REFERENCES organisations(id), -- multi-tenancy
-- Physics model parameters (Finding 3, 5, 7)
attitude_known BOOLEAN DEFAULT FALSE, -- FALSE = tumbling; affects A uncertainty sampling
material_class TEXT, -- 'aluminium'|'stainless_steel'|'titanium'|'carbon_composite'|'unknown'
cd_override DOUBLE PRECISION, -- operator-provided C_D override (space_operator only)
bstar_override DOUBLE PRECISION, -- operator-provided B* override (space_operator only)
cr_coefficient DOUBLE PRECISION DEFAULT 1.3 -- radiation pressure coefficient; 1.3 = standard non-cooperative
orbits table — full state vectors
ALTER TABLE orbits ADD COLUMN IF NOT EXISTS
reference_frame TEXT DEFAULT 'GCRF',
pos_x_km DOUBLE PRECISION,
pos_y_km DOUBLE PRECISION,
pos_z_km DOUBLE PRECISION,
vel_x_kms DOUBLE PRECISION,
vel_y_kms DOUBLE PRECISION,
vel_z_kms DOUBLE PRECISION,
lat_deg DOUBLE PRECISION,
lon_deg DOUBLE PRECISION,
alt_km DOUBLE PRECISION,
speed_kms DOUBLE PRECISION,
-- RTN position covariance (upper triangle of 3×3)
cov_rr DOUBLE PRECISION,
cov_rt DOUBLE PRECISION,
cov_rn DOUBLE PRECISION,
cov_tt DOUBLE PRECISION,
cov_tn DOUBLE PRECISION,
cov_nn DOUBLE PRECISION,
propagator TEXT DEFAULT 'sgp4',
tle_epoch TIMESTAMPTZ
conjunctions table
ALTER TABLE conjunctions ADD COLUMN IF NOT EXISTS
collision_probability DOUBLE PRECISION,
probability_method TEXT,
combined_radial_sigma_m DOUBLE PRECISION,
combined_transverse_sigma_m DOUBLE PRECISION,
combined_normal_sigma_m DOUBLE PRECISION
reentry_predictions table
ALTER TABLE reentry_predictions ADD COLUMN IF NOT EXISTS
confidence_level DOUBLE PRECISION,
model_version TEXT,
propagator TEXT,
f107_assumed DOUBLE PRECISION,
ap_assumed DOUBLE PRECISION,
monte_carlo_n INTEGER,
ground_track_corridor GEOGRAPHY(POLYGON), -- GEOGRAPHY: global corridors may cross antimeridian
reentry_window_open TIMESTAMPTZ,
reentry_window_close TIMESTAMPTZ,
nominal_reentry_point GEOGRAPHY(POINT), -- GEOGRAPHY: global point
nominal_reentry_alt_km DOUBLE PRECISION DEFAULT 80.0,
p01_reentry_time TIMESTAMPTZ, -- 1st percentile — extreme early case; displayed as tail risk annotation (F10)
p05_reentry_time TIMESTAMPTZ,
p50_reentry_time TIMESTAMPTZ,
p95_reentry_time TIMESTAMPTZ,
p99_reentry_time TIMESTAMPTZ, -- 99th percentile — extreme late case; displayed as tail risk annotation (F10)
sigma_along_track_km DOUBLE PRECISION,
sigma_cross_track_km DOUBLE PRECISION,
organisation_id INTEGER REFERENCES organisations(id),
record_hmac TEXT NOT NULL, -- HMAC-SHA256 of canonical field set
integrity_failed BOOLEAN DEFAULT FALSE,
superseded_by INTEGER REFERENCES reentry_predictions(id) ON DELETE RESTRICT, -- write-once; RESTRICT prevents deleting a prediction that supersedes another (F10 — §67)
ood_flag BOOLEAN DEFAULT FALSE, -- TRUE if any input parameter falls outside the model's validated operating envelope
ood_reason TEXT, -- comma-separated list of which parameters triggered OOD (e.g. "high_am_ratio,low_data_confidence")
prediction_valid_until TIMESTAMPTZ, -- computed at creation: p50_reentry_time - 4h; UI warns if NOW() > this and prediction is not superseded
model_version TEXT NOT NULL, -- semantic version of decay predictor used; must match current deployed version or trigger re-run prompt
-- Multi-source conflict detection (Finding 10)
prediction_conflict BOOLEAN DEFAULT FALSE, -- TRUE if SpaceCom window does not overlap TIP or ESA window
conflict_sources TEXT[], -- e.g. ['space_track_tip', 'esa_esac']
conflict_union_p10 TIMESTAMPTZ, -- union of all non-overlapping windows: earliest bound
conflict_union_p90 TIMESTAMPTZ -- union of all non-overlapping windows: latest bound
superseded_by is write-once after creation: it can be set once by an analyst or above, but never changed once set. A DB constraint enforces this (trigger that raises if superseded_by is being changed from a non-NULL value). The UI displays a ⚠ Superseded — see [newer run] banner on any prediction where superseded_by IS NOT NULL. This preserves the immutability guarantee (old records are never deleted) while giving analysts a mechanism to communicate "this is not the current operational view."
The same superseded_by pattern applies to the simulations table (self-referential FK).
Immutability trigger (see §7.9) applied to this table in the initial migration.
9.2 New Tables
-- Organisations (for multi-tenancy)
CREATE TABLE organisations (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-- Commercial tier (Finding 3, 5)
subscription_tier TEXT NOT NULL DEFAULT 'shadow_trial'
CHECK (subscription_tier IN ('shadow_trial','ansp_operational','space_operator','institutional','internal')),
subscription_status TEXT NOT NULL DEFAULT 'active'
CHECK (subscription_status IN ('active','offered','offered_lapsed','churned','suspended')),
subscription_started_at TIMESTAMPTZ,
subscription_expires_at TIMESTAMPTZ,
-- Shadow trial gate (F3 - §68): expiry normally auto-deactivates shadow mode, but enforcement is deferred while an active TIP / CRITICAL operational event exists
shadow_trial_expires_at TIMESTAMPTZ, -- NULL = no trial expiry (paid or internal); set on sandbox agreement signing
-- Resource quotas (F8 — §68): 0 = unlimited (paid tiers); >0 = monthly cap
monthly_mc_run_quota INTEGER NOT NULL DEFAULT 100 -- 100 for free/shadow_trial; 0 = unlimited for paid; deferred during active TIP/CRITICAL event
CHECK (monthly_mc_run_quota >= 0),
-- Feature flags (F11 — §68): Enterprise-only features gated here
feature_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE, -- Enterprise only
-- On-premise licence (F6 — §68)
licence_key TEXT, -- JWT signed by SpaceCom; checked at startup for on-premise deployments
licence_expires_at TIMESTAMPTZ, -- derived from licence_key; stored for query efficiency
-- Data residency (Finding 8)
hosting_jurisdiction TEXT NOT NULL DEFAULT 'eu'
CHECK (hosting_jurisdiction IN ('eu','uk','au','us','on_premise')),
data_residency_confirmed BOOLEAN DEFAULT FALSE -- DPA clause confirmed for this org
);
-- Users
CREATE TABLE users (
id SERIAL PRIMARY KEY,
organisation_id INTEGER REFERENCES organisations(id) NOT NULL,
email TEXT NOT NULL UNIQUE,
password_hash TEXT NOT NULL, -- bcrypt, cost factor >= 12
role TEXT NOT NULL DEFAULT 'viewer'
CHECK (role IN ('viewer','analyst','operator','org_admin','admin','space_operator','orbital_analyst')),
mfa_secret TEXT, -- TOTP secret (encrypted at rest)
mfa_recovery_codes TEXT[], -- bcrypt hashes of recovery codes
mfa_enabled BOOLEAN DEFAULT FALSE,
failed_mfa_attempts INTEGER DEFAULT 0,
locked_until TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_login_at TIMESTAMPTZ,
tos_accepted_at TIMESTAMPTZ, -- NULL = ToS not yet accepted; access blocked until set
tos_version TEXT, -- semver of ToS accepted (e.g. "1.2.0")
tos_accepted_ip INET, -- IP address at time of acceptance (GDPR consent evidence)
data_source_acknowledgement BOOLEAN DEFAULT FALSE, -- must be TRUE before API key access
altitude_unit_preference TEXT NOT NULL DEFAULT 'ft'
CHECK (altitude_unit_preference IN ('m', 'ft', 'km'))
-- 'ft' default for ansp_operator; 'km' default for space_operator (set at account creation based on role)
);
-- Refresh tokens (server-side revocation)
CREATE TABLE refresh_tokens (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id INTEGER REFERENCES users(id) ON DELETE CASCADE,
token_hash TEXT NOT NULL UNIQUE, -- SHA-256 of the raw token
family_id UUID NOT NULL, -- All tokens from the same initial issuance share a family_id
issued_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL,
revoked_at TIMESTAMPTZ, -- NULL = valid
superseded_at TIMESTAMPTZ, -- Set when this token is rotated out (newer token in family exists)
replaced_by UUID REFERENCES refresh_tokens(id), -- for rotation chain audit
source_ip INET,
user_agent TEXT
);
CREATE INDEX ON refresh_tokens (user_id, revoked_at);
CREATE INDEX ON refresh_tokens (family_id); -- for family revocation on reuse detection
-- Security event log (append-only)
CREATE TABLE security_logs (
id BIGSERIAL PRIMARY KEY,
logged_at TIMESTAMPTZ DEFAULT NOW(),
level TEXT NOT NULL,
event_type TEXT NOT NULL,
user_id INTEGER,
organisation_id INTEGER,
source_ip INET,
user_agent TEXT,
resource TEXT,
detail JSONB,
record_hash TEXT -- SHA-256(logged_at || event_type || detail) for tamper detection
);
CREATE TRIGGER security_logs_immutable
BEFORE UPDATE OR DELETE ON security_logs
FOR EACH ROW EXECUTE FUNCTION prevent_modification();
-- TLE history (hypertable)
-- No surrogate PK: TimescaleDB requires any UNIQUE/PK constraint to include the partition column.
-- Natural unique key is (object_id, ingested_at). Reference TLE records by this composite key.
CREATE TABLE tle_sets (
object_id INTEGER REFERENCES objects(id),
epoch TIMESTAMPTZ NOT NULL,
line1 TEXT NOT NULL,
line2 TEXT NOT NULL,
source TEXT NOT NULL,
ingested_at TIMESTAMPTZ DEFAULT NOW(),
inclination_deg DOUBLE PRECISION,
raan_deg DOUBLE PRECISION,
eccentricity DOUBLE PRECISION,
arg_perigee_deg DOUBLE PRECISION,
mean_anomaly_deg DOUBLE PRECISION,
mean_motion_rev_per_day DOUBLE PRECISION,
bstar DOUBLE PRECISION,
apogee_km DOUBLE PRECISION,
perigee_km DOUBLE PRECISION,
cross_validated BOOLEAN DEFAULT FALSE, -- TRUE if confirmed by second source
cross_validation_delta_sma_km DOUBLE PRECISION, -- SMA difference between sources
UNIQUE (object_id, ingested_at) -- natural key; safe for TimescaleDB (includes partition col)
);
SELECT create_hypertable('tle_sets', 'ingested_at');
-- Space weather (hypertable)
CREATE TABLE space_weather (
time TIMESTAMPTZ NOT NULL,
f107_obs DOUBLE PRECISION, -- observed F10.7 (current day)
f107_prior_day DOUBLE PRECISION, -- prior-day F10.7 (NRLMSISE-00 f107 input)
f107_81day_avg DOUBLE PRECISION, -- 81-day centred average (NRLMSISE-00 f107A input)
ap_daily INTEGER, -- daily Ap index (linear; NOT Kp)
ap_3h_history DOUBLE PRECISION[19], -- 3-hourly Ap values for prior 57h (NRLMSISE-00 full mode)
kp_3hourly DOUBLE PRECISION[], -- 3-hourly Kp (for storm detection; Kp > 5 triggers storm flag)
dst_index INTEGER,
uncertainty_multiplier DOUBLE PRECISION,
operational_status TEXT,
source TEXT DEFAULT 'noaa_swpc',
secondary_source TEXT, -- ESA SWS cross-validation value
cross_validation_delta_f107 DOUBLE PRECISION -- difference between sources
);
SELECT create_hypertable('space_weather', 'time');
-- TIP messages
CREATE TABLE tip_messages (
id BIGSERIAL PRIMARY KEY,
object_id INTEGER REFERENCES objects(id),
norad_id INTEGER NOT NULL,
message_time TIMESTAMPTZ NOT NULL,
message_number INTEGER,
reentry_window_open TIMESTAMPTZ,
reentry_window_close TIMESTAMPTZ,
predicted_region TEXT,
source TEXT DEFAULT 'usspacecom',
raw_message TEXT
);
-- Alert events (append-only)
CREATE TABLE alert_events (
id BIGSERIAL PRIMARY KEY,
created_at TIMESTAMPTZ DEFAULT NOW(),
level TEXT NOT NULL
CHECK (level IN ('INFO','WARNING','CRITICAL')),
trigger_type TEXT NOT NULL,
object_id INTEGER REFERENCES objects(id),
organisation_id INTEGER REFERENCES organisations(id),
message TEXT NOT NULL,
acknowledged_at TIMESTAMPTZ,
acknowledged_by INTEGER REFERENCES users(id) ON DELETE SET NULL, -- SET NULL on GDPR erasure; log entry preserved
acknowledgement_note TEXT,
delivered_websocket BOOLEAN DEFAULT FALSE,
delivered_email BOOLEAN DEFAULT FALSE,
fir_intersection_km2 DOUBLE PRECISION, -- area of FIR polygon intersected by the triggering corridor (km²); NULL for non-spatial alerts
intersection_percentile TEXT
CHECK (intersection_percentile IN ('p50','p95')), -- which corridor percentile triggered the alert
prediction_id BIGINT REFERENCES reentry_predictions(id) ON DELETE RESTRICT, -- RESTRICT prevents cascade delete of legal-hold predictions (F10 — §67)
record_hmac TEXT NOT NULL DEFAULT '' -- HMAC-SHA256 of safety-critical fields; signed at insert; verified nightly (F9)
);
CREATE TRIGGER alert_events_immutable
BEFORE UPDATE OR DELETE ON alert_events
FOR EACH ROW EXECUTE FUNCTION prevent_modification();
-- Simulations
CREATE TABLE simulations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
module TEXT NOT NULL,
object_id INTEGER REFERENCES objects(id),
organisation_id INTEGER REFERENCES organisations(id),
params_json JSONB NOT NULL,
started_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
status TEXT NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','running','complete','failed','cancelled')),
result_uri TEXT,
model_version TEXT,
celery_task_id TEXT,
error_detail TEXT,
created_by INTEGER REFERENCES users(id)
);
-- Reports
CREATE TABLE reports (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
simulation_id UUID REFERENCES simulations(id),
object_id INTEGER REFERENCES objects(id),
organisation_id INTEGER REFERENCES organisations(id),
report_type TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
created_by INTEGER REFERENCES users(id),
storage_uri TEXT NOT NULL,
params_json JSONB,
report_number TEXT
);
-- Prediction outcomes (algorithmic accountability — links predictions to observed re-entry events)
CREATE TABLE prediction_outcomes (
id SERIAL PRIMARY KEY,
prediction_id BIGINT NOT NULL REFERENCES reentry_predictions(id) ON DELETE RESTRICT, -- RESTRICT prevents cascade delete of legal-hold predictions (F10 — §67)
norad_id INTEGER NOT NULL,
observed_reentry_time TIMESTAMPTZ, -- actual re-entry time from post-event analysis (The Aerospace Corporation, US18SCS, etc.)
observed_reentry_source TEXT, -- 'aerospace_corp' | 'us18scs' | 'esa_esoc' | 'manual'
p50_error_minutes DOUBLE PRECISION, -- predicted p50 minus observed (+ = predicted late, - = predicted early)
corridor_contains_observed BOOLEAN, -- TRUE if observed impact point fell within p95 corridor
fir_false_positive BOOLEAN, -- TRUE if a CRITICAL alert fired but no observable debris reached the affected FIR
fir_false_negative BOOLEAN, -- TRUE if observable debris reached a FIR but no CRITICAL alert was generated
ood_flag_at_prediction BOOLEAN, -- snapshot of ood_flag from the prediction record at prediction time
notes TEXT,
recorded_at TIMESTAMPTZ DEFAULT NOW(),
recorded_by INTEGER REFERENCES users(id) -- analyst who logged the outcome
);
-- Hazard zones
CREATE TABLE hazard_zones (
id BIGSERIAL PRIMARY KEY,
simulation_id UUID REFERENCES simulations(id),
organisation_id INTEGER REFERENCES organisations(id),
valid_from TIMESTAMPTZ NOT NULL,
valid_to TIMESTAMPTZ NOT NULL,
geometry GEOGRAPHY(POLYGON, 4326) NOT NULL,
altitude_min_km DOUBLE PRECISION,
altitude_max_km DOUBLE PRECISION,
risk_level TEXT,
confidence DOUBLE PRECISION,
sigma_along_track_km DOUBLE PRECISION,
sigma_cross_track_km DOUBLE PRECISION,
record_hmac TEXT NOT NULL
);
CREATE INDEX ON hazard_zones USING GIST (geometry);
CREATE INDEX ON hazard_zones (valid_from, valid_to);
CREATE TRIGGER hazard_zones_immutable
BEFORE UPDATE OR DELETE ON hazard_zones
FOR EACH ROW EXECUTE FUNCTION prevent_modification();
-- Airspace boundaries
CREATE TABLE airspace (
id BIGSERIAL PRIMARY KEY,
designator TEXT NOT NULL,
name TEXT,
type TEXT NOT NULL,
geometry GEOMETRY(POLYGON, 4326) NOT NULL, -- GEOMETRY (not GEOGRAPHY): FIR boundaries never cross antimeridian; ~3× faster for ST_Intersects
lower_fl INTEGER,
upper_fl INTEGER,
icao_region TEXT
);
CREATE INDEX ON airspace USING GIST (geometry);
-- Debris fragments
CREATE TABLE fragments (
id BIGSERIAL PRIMARY KEY,
simulation_id UUID REFERENCES simulations(id),
mass_kg DOUBLE PRECISION,
characteristic_length_m DOUBLE PRECISION,
cross_section_m2 DOUBLE PRECISION,
material TEXT,
ballistic_coefficient_kgm2 DOUBLE PRECISION,
pre_entry_survived BOOLEAN,
impact_point GEOGRAPHY(POINT, 4326),
impact_velocity_kms DOUBLE PRECISION,
impact_angle_deg DOUBLE PRECISION,
kinetic_energy_j DOUBLE PRECISION,
casualty_area_m2 DOUBLE PRECISION,
dispersion_semi_major_km DOUBLE PRECISION,
dispersion_semi_minor_km DOUBLE PRECISION,
dispersion_orientation_deg DOUBLE PRECISION
);
CREATE INDEX ON fragments USING GIST (impact_point);
-- Owned objects (space operator registration)
CREATE TABLE owned_objects (
id SERIAL PRIMARY KEY,
organisation_id INTEGER REFERENCES organisations(id) NOT NULL,
object_id INTEGER REFERENCES objects(id) NOT NULL,
norad_id INTEGER NOT NULL,
registered_at TIMESTAMPTZ DEFAULT NOW(),
registration_reference TEXT, -- National space law registration number
has_propulsion BOOLEAN DEFAULT FALSE, -- Enables controlled re-entry planner
UNIQUE (organisation_id, object_id)
);
CREATE INDEX ON owned_objects (organisation_id);
-- API keys (for Persona E/F programmatic access)
CREATE TABLE api_keys (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organisation_id INTEGER REFERENCES organisations(id) NOT NULL,
user_id INTEGER REFERENCES users(id), -- NULL for org-level service account keys (F5)
is_service_account BOOLEAN NOT NULL DEFAULT FALSE, -- TRUE = org-level key, no human user
service_account_name TEXT, -- required when is_service_account = TRUE; e.g. "ANSP Integration Service"
key_hash TEXT NOT NULL UNIQUE, -- SHA-256 of raw key; raw key shown once at creation
name TEXT NOT NULL, -- Human label, e.g. "Ops Centre Integration"
role TEXT NOT NULL, -- space_operator | orbital_analyst
created_at TIMESTAMPTZ DEFAULT NOW(),
last_used_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ,
revoked_at TIMESTAMPTZ,
revoked_by INTEGER REFERENCES users(id), -- org_admin or admin who revoked (F5)
requests_today INTEGER DEFAULT 0,
daily_limit INTEGER DEFAULT 1000,
-- API key scope and rate limit overrides (Finding 11)
allowed_endpoints TEXT[], -- NULL = all endpoints for role; e.g. ['GET /space/objects']
rate_limit_override JSONB, -- e.g. {"decay_predict": {"limit": 5, "window": "1h"}}
CONSTRAINT service_account_name_required CHECK (
(is_service_account = FALSE) OR (service_account_name IS NOT NULL)
),
CONSTRAINT user_or_service CHECK (
(user_id IS NOT NULL AND is_service_account = FALSE)
OR (user_id IS NULL AND is_service_account = TRUE)
)
);
CREATE INDEX ON api_keys (organisation_id, revoked_at);
CREATE INDEX ON api_keys (organisation_id, is_service_account); -- org admin key listing
-- Async job tracking — all Celery-backed POST endpoints return a job reference (Finding 3)
CREATE TABLE jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organisation_id INTEGER NOT NULL REFERENCES organisations(id),
user_id INTEGER NOT NULL REFERENCES users(id),
job_type TEXT NOT NULL
CHECK (job_type IN ('decay_predict','report','reentry_plan','propagate')),
status TEXT NOT NULL DEFAULT 'queued'
CHECK (status IN ('queued','running','complete','failed','cancelled')),
celery_task_id TEXT, -- Celery AsyncResult ID for internal tracking
params_hash TEXT, -- SHA-256 of input params; used for idempotency check
result_url TEXT, -- populated when status='complete'; e.g. '/decay/predictions/123'
error_code TEXT, -- populated when status='failed'
error_message TEXT,
estimated_duration_seconds INTEGER, -- populated at creation from historical p50 for job_type
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ
);
CREATE INDEX ON jobs (organisation_id, status, created_at DESC);
CREATE INDEX ON jobs (celery_task_id);
-- Idempotency key store — prevents duplicate mutations from network retries (Finding 5)
CREATE TABLE idempotency_keys (
key TEXT NOT NULL, -- client-provided UUID
user_id INTEGER NOT NULL REFERENCES users(id),
endpoint TEXT NOT NULL, -- e.g. 'POST /decay/predict'
response_status INTEGER NOT NULL,
response_body JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL DEFAULT NOW() + INTERVAL '24 hours',
PRIMARY KEY (key, user_id, endpoint)
);
CREATE INDEX ON idempotency_keys (expires_at); -- for TTL cleanup job
-- Usage metering (F3) — billable events; append-only
CREATE TABLE usage_events (
id BIGSERIAL PRIMARY KEY,
organisation_id INTEGER NOT NULL REFERENCES organisations(id),
user_id INTEGER REFERENCES users(id), -- NULL for API key / system-triggered events
api_key_id UUID REFERENCES api_keys(id), -- set when triggered via API key
event_type TEXT NOT NULL
CHECK (event_type IN (
'decay_prediction_run',
'conjunction_screen_run',
'report_export',
'api_request',
'mc_quota_exhausted', -- quota hit; signals upsell opportunity
'reentry_plan_run'
)),
quantity INTEGER NOT NULL DEFAULT 1, -- e.g. number of API requests batched
billing_period TEXT NOT NULL, -- 'YYYY-MM' — month this event counts toward
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
detail JSONB -- event-specific metadata (object_id, mc_n, etc.)
);
CREATE INDEX ON usage_events (organisation_id, billing_period, event_type);
CREATE INDEX ON usage_events (organisation_id, created_at DESC);
-- Append-only enforcement
CREATE TRIGGER usage_events_immutable
BEFORE UPDATE OR DELETE ON usage_events
FOR EACH ROW EXECUTE FUNCTION prevent_modification();
-- Billing contacts (F10)
CREATE TABLE billing_contacts (
id SERIAL PRIMARY KEY,
organisation_id INTEGER NOT NULL REFERENCES organisations(id) UNIQUE,
billing_email TEXT NOT NULL,
billing_name TEXT NOT NULL,
billing_address TEXT,
vat_number TEXT, -- EU VAT registration; required for B2B invoicing
purchase_order_number TEXT, -- PO reference required by some ANSP procurement depts
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_by INTEGER REFERENCES users(id) -- must be org_admin or admin
);
-- Subscription periods (F10) — immutable record of what was billed when
CREATE TABLE subscription_periods (
id SERIAL PRIMARY KEY,
organisation_id INTEGER NOT NULL REFERENCES organisations(id),
tier TEXT NOT NULL,
period_start TIMESTAMPTZ NOT NULL,
period_end TIMESTAMPTZ, -- NULL = current (open) period
monthly_fee_eur NUMERIC(10, 2), -- agreed contract price; NULL for internal/trial
currency TEXT NOT NULL DEFAULT 'EUR',
invoice_ref TEXT, -- external billing system invoice ID (e.g. Stripe invoice_id)
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON subscription_periods (organisation_id, period_start DESC);
-- NOTAM drafts (audit trail; never submitted by SpaceCom)
CREATE TABLE notam_drafts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
prediction_id BIGINT REFERENCES reentry_predictions(id),
organisation_id INTEGER REFERENCES organisations(id),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
created_by INTEGER REFERENCES users(id),
draft_type TEXT NOT NULL
CHECK (draft_type IN ('new','cancellation')),
fir_designators TEXT[] NOT NULL,
valid_from TIMESTAMPTZ,
valid_to TIMESTAMPTZ,
draft_text TEXT NOT NULL, -- Full ICAO-format draft text
reviewed_by INTEGER REFERENCES users(id) ON DELETE SET NULL, -- SET NULL on GDPR erasure; draft preserved
reviewed_at TIMESTAMPTZ,
review_note TEXT,
safety_record BOOLEAN DEFAULT TRUE, -- always retained; excluded from data drop policy
generated_during_degraded BOOLEAN DEFAULT FALSE -- TRUE if ingest was degraded at generation time
-- No issuance fields — SpaceCom never issues NOTAMs
);
-- Degraded mode audit log (Finding 7 — operational ANSP disclosure requirement)
-- Records every transition into and out of degraded mode for incident investigation
CREATE TABLE degraded_mode_events (
id BIGSERIAL PRIMARY KEY,
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
ended_at TIMESTAMPTZ, -- NULL = currently degraded
affected_sources TEXT[] NOT NULL, -- e.g. ['space_track', 'noaa_swpc']
severity TEXT NOT NULL
CHECK (severity IN ('WARNING','CRITICAL')),
trigger_reason TEXT NOT NULL, -- human-readable: 'Space-Track ingest gap > 4h'
resolved_by TEXT, -- 'auto-recovery' | user_id | 'manual'
safety_record BOOLEAN DEFAULT TRUE -- always retained under safety record policy
);
-- Append-only: no UPDATE or DELETE permitted
CREATE TRIGGER degraded_mode_events_immutable
BEFORE UPDATE OR DELETE ON degraded_mode_events
FOR EACH ROW EXECUTE FUNCTION prevent_modification();
-- Shadow validation records (compare shadow predictions to actual events)
CREATE TABLE shadow_validations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
prediction_id BIGINT REFERENCES reentry_predictions(id),
organisation_id INTEGER REFERENCES organisations(id),
created_at TIMESTAMPTZ DEFAULT NOW(),
created_by INTEGER REFERENCES users(id),
actual_reentry_time TIMESTAMPTZ,
actual_reentry_location GEOGRAPHY(POINT, 4326),
actual_source TEXT, -- 'aerospace_corp_db' | 'tip_message' | 'manual'
p50_error_minutes DOUBLE PRECISION, -- actual - predicted p50 in minutes
in_p95_corridor BOOLEAN, -- did actual point fall within 95th pct corridor?
notes TEXT
);
-- Legal opinions (jurisdiction-level gate for shadow mode and operational deployment)
CREATE TABLE legal_opinions (
id SERIAL PRIMARY KEY,
jurisdiction TEXT NOT NULL UNIQUE, -- e.g. 'AU', 'EU', 'UK', 'US'
status TEXT NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','in_progress','complete','not_required')),
opinion_date DATE,
counsel_firm TEXT,
shadow_mode_cleared BOOLEAN DEFAULT FALSE, -- opinion confirms shadow deployment is permissible
operational_cleared BOOLEAN DEFAULT FALSE, -- opinion confirms operational deployment is permissible
liability_cap_agreed BOOLEAN DEFAULT FALSE,
notes TEXT,
document_minio_key TEXT, -- reference to stored opinion document in MinIO
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Shared immutability function (used by multiple triggers)
CREATE OR REPLACE FUNCTION prevent_modification()
RETURNS TRIGGER AS $$
BEGIN
RAISE EXCEPTION 'Table % is append-only or immutable after creation', TG_TABLE_NAME;
END;
$$ LANGUAGE plpgsql;
-- Shared updated_at function (used by mutable tables)
CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
NEW.updated_at = NOW();
RETURN NEW;
END;
$$;
-- updated_at triggers for all mutable tables
CREATE TRIGGER organisations_updated_at
BEFORE UPDATE ON organisations FOR EACH ROW EXECUTE FUNCTION set_updated_at();
CREATE TRIGGER users_updated_at
BEFORE UPDATE ON users FOR EACH ROW EXECUTE FUNCTION set_updated_at();
CREATE TRIGGER simulations_updated_at
BEFORE UPDATE ON simulations FOR EACH ROW EXECUTE FUNCTION set_updated_at();
CREATE TRIGGER jobs_updated_at
BEFORE UPDATE ON jobs FOR EACH ROW EXECUTE FUNCTION set_updated_at();
CREATE TRIGGER notam_drafts_updated_at
BEFORE UPDATE ON notam_drafts FOR EACH ROW EXECUTE FUNCTION set_updated_at();
Shadow mode flag on predictions and hazard zones: Add shadow_mode BOOLEAN DEFAULT FALSE to both reentry_predictions and hazard_zones. Shadow records are excluded from all operational API responses (WHERE shadow_mode = FALSE applied to all operational endpoints) but accessible via /analysis and the Feedback/shadow validation workflow.
9.3 Index Strategy
All indexes must be created CONCURRENTLY on live hypertables to avoid table locks (see §9.4). The following indexes are required beyond TimescaleDB's automatic chunk indexes:
-- orbits hypertable: object + time range queries (CZML generation)
CREATE INDEX CONCURRENTLY IF NOT EXISTS orbits_object_epoch_idx
ON orbits (object_id, epoch DESC);
-- reentry_predictions: latest prediction per object (Event Detail, operational overview)
CREATE INDEX CONCURRENTLY IF NOT EXISTS reentry_pred_object_created_idx
ON reentry_predictions (object_id, created_at DESC)
WHERE integrity_failed = FALSE AND shadow_mode = FALSE;
-- alert_events: unacknowledged alerts per org (badge count — called on every page load)
-- Partial index on acknowledged_at IS NULL: only live unacked rows indexed; shrinks as alerts are acknowledged
CREATE INDEX CONCURRENTLY IF NOT EXISTS alert_events_unacked_idx
ON alert_events (organisation_id, level, created_at DESC)
WHERE acknowledged_at IS NULL;
-- jobs: Celery worker polls for queued jobs; partial index keeps this tiny and fast
CREATE INDEX CONCURRENTLY IF NOT EXISTS jobs_queued_idx
ON jobs (organisation_id, created_at)
WHERE status = 'queued';
-- refresh_tokens: token validation only cares about live (non-revoked) tokens
CREATE INDEX CONCURRENTLY IF NOT EXISTS refresh_tokens_live_idx
ON refresh_tokens (token_hash)
WHERE revoked_at IS NULL;
-- idempotency_keys: TTL cleanup job needs only expired rows
CREATE INDEX CONCURRENTLY IF NOT EXISTS idempotency_keys_expired_idx
ON idempotency_keys (expires_at)
WHERE expires_at IS NOT NULL;
-- PostGIS spatial: all columns used in ST_Intersects / ST_Contains / ST_Distance
CREATE INDEX CONCURRENTLY IF NOT EXISTS reentry_pred_corridor_gist
ON reentry_predictions USING GIST (ground_track_corridor);
-- airspace.geometry GIST index already present (see §9.2)
CREATE INDEX CONCURRENTLY IF NOT EXISTS hazard_zones_polygon_gist
ON hazard_zones USING GIST (polygon);
CREATE INDEX CONCURRENTLY IF NOT EXISTS fragments_impact_gist
ON fragments USING GIST (impact_point);
-- tle_sets hypertable: latest TLE per object (cross-validation, propagation)
CREATE INDEX CONCURRENTLY IF NOT EXISTS tle_sets_object_ingested_idx
ON tle_sets (object_id, ingested_at DESC);
-- security_logs: recent events per user (audit queries)
CREATE INDEX CONCURRENTLY IF NOT EXISTS security_logs_user_time_idx
ON security_logs (user_id, created_at DESC);
Spatial type convention:
GEOGRAPHY— used for global features that may cross the antimeridian (corridor polygons, nominal re-entry points, fragment impact points). Geodetic calculations; correct for global spans.GEOMETRY(POLYGON, 4326)— used for regional features always within ±180° longitude (FIR/UIR airspace boundaries). Planar approximation; ~3× faster forST_IntersectsthanGEOGRAPHY; accurate enough for airspace boundary intersection within a single hemisphere.
SRID enforcement (F2 — §62): Declaring the SRID in the column type (GEOMETRY(POLYGON, 4326)) prevents implicit SRID mismatch errors, but does not prevent application code from inserting a geometry constructed with SRID 0. Add explicit CHECK constraints on all spatial columns:
-- Ensure corridor polygon SRID is correct
ALTER TABLE reentry_predictions
ADD CONSTRAINT chk_corridor_srid
CHECK (ST_SRID(ground_track_corridor::geometry) = 4326);
ALTER TABLE hazard_zones
ADD CONSTRAINT chk_hazard_zone_srid
CHECK (ST_SRID(geometry) = 4326);
ALTER TABLE airspace
ADD CONSTRAINT chk_airspace_srid
CHECK (ST_SRID(geometry) = 4326);
The CI migration gate (alembic check) will flag any migration that adds a spatial column without a matching SRID CHECK constraint.
ST_Buffer distance units (F9 — §62): ST_Buffer on a GEOMETRY(POLYGON, 4326) column uses degree-units, not metres. At 60°N, 1° ≈ 55 km; at the equator, 1° ≈ 111 km — an uncertainty buffer expressed in degrees gives wildly different areas at different latitudes. Always buffer in a projected CRS, then transform back:
-- CORRECT: buffer 50 km around corridor point at any latitude
SELECT ST_Transform(
ST_Buffer(
ST_Transform(ST_SetSRID(ST_MakePoint(lon, lat), 4326), 3857), -- project to Web Mercator (metres)
50000 -- 50 km in metres
),
4326 -- back to WGS84
) AS buffered_geom;
-- WRONG: buffer in degrees — DO NOT USE
-- SELECT ST_Buffer(geom, 0.5) FROM ... ← 0.5° is ~55 km at 60°N, ~55 km at equator
For global spans where Mercator distortion is unacceptable, use ST_Buffer on a GEOGRAPHY column instead — it accepts metres natively:
SELECT ST_Buffer(corridor::geography, 50000) -- 50 km buffer, geodetically correct
FROM reentry_predictions WHERE ...
FIR intersection query optimisation: Apply a bounding-box pre-filter before the full polygon intersection test to eliminate most rows cheaply. airspace.geometry is GEOMETRY while hazard_zones.geometry and corridor parameters are GEOGRAPHY — always cast GEOGRAPHY → GEOMETRY explicitly before passing to ST_Intersects with an airspace column; PostgreSQL cannot use the GiST index and falls back to a seq scan if the types are mixed implicitly:
-- Corridor (GEOGRAPHY) intersecting FIR boundaries (GEOMETRY): explicit cast required
SELECT a.designator, a.name
FROM airspace a
WHERE a.geometry && ST_Envelope($1::geography::geometry) -- fast bbox pre-filter (uses GIST)
AND ST_Intersects(a.geometry, $1::geography::geometry); -- exact test (GEOMETRY, not GEOGRAPHY)
-- $1 = corridor polygon passed as GEOGRAPHY from application layer
Add a CI linter rule (or custom ruff plugin) that rejects ST_Intersects(airspace.geometry, <expr>) unless <expr> is explicitly cast to ::geometry. This prevents the mixed-type silent seq-scan regression from being introduced during maintenance.
Cache the FIR intersection result per prediction_id in Redis (TTL: until the prediction is superseded) — the intersection for a given prediction never changes.
9.4 TimescaleDB Configuration and Continuous Aggregates
Hypertable chunk intervals — set explicitly at creation; default 7-day chunks are too large for the orbits CZML query pattern (most queries cover ≤ 72h):
-- orbits: 1-day chunks (72h CZML window spans 3 chunks; good chunk exclusion)
SELECT create_hypertable('orbits', 'epoch',
chunk_time_interval => INTERVAL '1 day',
if_not_exists => TRUE);
-- tle_sets: 1-month chunks (~1,800 rows/day at 600 objects × 3 TLE updates; queried by object_id not time range)
-- Small chunks (7 days) produce poor compression ratios (~12,600 rows/chunk); 1 month improves ratio ~4×
SELECT create_hypertable('tle_sets', 'ingested_at',
chunk_time_interval => INTERVAL '1 month',
if_not_exists => TRUE);
-- space_weather: 30-day chunks (~3000 rows/month at 15-min cadence)
SELECT create_hypertable('space_weather', 'time',
chunk_time_interval => INTERVAL '30 days',
if_not_exists => TRUE);
Continuous aggregates — pre-compute recurring expensive queries instead of scanning raw hypertable rows on every request:
-- 81-day rolling F10.7 average (queried on every Space Weather Widget render)
CREATE MATERIALIZED VIEW space_weather_daily
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', time) AS day,
AVG(f107_obs) AS f107_daily_avg,
MAX(kp_3hourly[1]) AS kp_max_daily
FROM space_weather
GROUP BY day
WITH NO DATA;
SELECT add_continuous_aggregate_policy('space_weather_daily',
start_offset => INTERVAL '90 days',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 hour');
Backend queries for the 81-day F10.7 average read from space_weather_daily (the continuous aggregate), not from the raw space_weather hypertable.
Compression policy intervals — compression must not target recently-written chunks. TimescaleDB decompresses a chunk before any write to it; compressing hot chunks adds 50–200ms latency per write batch. Set compress_after well beyond the active write window:
| Hypertable | Chunk interval | compress_after |
Write cadence | Reasoning |
|---|---|---|---|---|
orbits |
1 day | 7 days | 1 min (continuous) | Data is queryable but not written after ~24h; 7-day buffer prevents write-decompress thrash |
adsb_states |
4 hours | 14 days | 60s (Celery Beat) | Rolling 24h retention; compress only after data is past retention interest |
space_weather |
30 days | 60 days | 15 min | Very low write rate; compress after one full 30-day chunk is closed |
tle_sets |
1 month | 2 months | Every 4h ingest | ~1,800 rows/day; 1-month chunks give good compression ratio; 2-month buffer ensures active month is never compressed |
-- Apply compression policies (run after hypertable creation)
SELECT add_compression_policy('orbits', INTERVAL '7 days');
SELECT add_compression_policy('adsb_states', INTERVAL '14 days');
SELECT add_compression_policy('space_weather', INTERVAL '60 days');
SELECT add_compression_policy('tle_sets', INTERVAL '2 months');
Autovacuum tuning — append-only tables still accumulate dead tuples from aborted transactions and MVCC overhead. Default 20% threshold is too conservative for high-write safety tables:
ALTER TABLE alert_events SET (
autovacuum_vacuum_scale_factor = 0.01, -- vacuum at 1% dead tuples (default: 20%)
autovacuum_analyze_scale_factor = 0.005
);
ALTER TABLE security_logs SET (
autovacuum_vacuum_scale_factor = 0.01,
autovacuum_analyze_scale_factor = 0.005
);
ALTER TABLE reentry_predictions SET (
autovacuum_vacuum_cost_delay = 2, -- allow aggressive vacuum on query-critical table
autovacuum_analyze_scale_factor = 0.01
);
PostgreSQL-level settings via patroni.yml:
postgresql:
parameters:
idle_in_transaction_session_timeout: 30000 # 30s -- prevents analytics sessions blocking autovacuum
max_connections: 50 # pgBouncer handles client multiplexing; DB needs only 50
log_min_duration_statement: 500 # F7 §58: log queries > 500ms; shipped to Loki via Promtail
shared_preload_libraries: timescaledb,pg_stat_statements # F7 §58: enable slow query tracking
pg_stat_statements.track: all # track all statements including nested
# Analyst role statement timeout (F11 §58): prevents runaway analytics queries starving ops connections
# Applied at role level, not globally, to avoid impacting operational paths
Query plan governance (F7 — §58): Slow queries (> 500ms) appear in PostgreSQL logs and are shipped to Loki. A weekly Grafana report queries pg_stat_statements via the postgres-exporter and surfaces the top-10 queries by total_exec_time. Any query appearing in the top-10 for two consecutive weeks requires a PR with an EXPLAIN ANALYSE output and either an index addition or a documented acceptance rationale. The EXPLAIN ANALYSE output is recorded in the migration file header comment for index additions. CI migration timeout (§9.4) applies: migrations running > 30s against the test dataset require review before merge.
Analyst role query timeout (F11 — §58): Persona B/F analyst queries route to the read replica (§3.2) but must still be bounded to prevent a runaway query exhausting replica connections and triggering replication lag. Apply a statement_timeout at the database role level so it applies regardless of connection source:
-- Applied once at schema setup; persists across reconnections
ALTER ROLE spacecom_analyst SET statement_timeout = '30s';
ALTER ROLE spacecom_readonly SET statement_timeout = '30s';
-- Operational roles have no statement timeout — but idle-in-transaction timeout applies globally
-- (idle_in_transaction_session_timeout = 30s in patroni.yml)
The spacecom_analyst role is the PgBouncer user for the read replica pool. All analyst-originated queries automatically inherit the 30s limit. If a query exceeds 30s it receives ERROR: canceling statement due to statement timeout; the frontend displays a user-facing message: "This query exceeded the 30-second limit. Refine your filters or contact your administrator." Logged at WARNING to Loki.
PgBouncer transaction mode + asyncpg prepared statement cache — asyncpg caches prepared statements per server-side connection. In PgBouncer transaction mode, the connection returned after each transaction may differ from the one the statement was prepared on, causing ERROR: prepared statement "..." does not exist under load. Disable the cache in the SQLAlchemy async engine config:
engine = create_async_engine(
DATABASE_URL,
connect_args={"prepared_statement_cache_size": 0},
)
This is non-negotiable when using PgBouncer transaction mode. Do not revert this setting in the belief that it is a performance regression — it prevents a hard production failure mode. See ADR 0008.
Migration safety on live hypertables (additions to the Alembic policy in §26.9):
- Always use
CREATE INDEX CONCURRENTLYfor new indexes — no table lock; safe during live ingest - Never add a column with a non-null default to a populated hypertable in one migration: (1) add nullable, (2) backfill in batches, (3) add NOT NULL constraint separately
- Test every migration against production-sized data; record execution time in the migration file header comment
- Set a CI migration timeout: if a migration runs > 30s against the test dataset, it must be reviewed before merge
10. Technology Stack
| Layer | Technology | Rationale |
|---|---|---|
| Frontend framework | Next.js 15 + TypeScript | Type safety, SSR for dashboards, static export option |
| 3D Globe | CesiumJS (retained) | Native CZML support; proven in prototype |
| 2D overlays | Deck.gl | WebGL heatmaps (Mode B), arc layers, hex grids |
| Server state | TanStack Query | Caching, background refetch, stale-while-revalidate. API responses never stored in Zustand. |
| UI state | Zustand | Pure UI state only: timeline mode, selected object, layer visibility, alert acknowledgements |
| URL state | nuqs | Shareable deep links; selected object/event/time reflected in URL |
| Backend framework | FastAPI (retained) | Async, OpenAPI auto-docs, Pydantic validation |
| Task queue | Celery + Redis | Battle-tested for scientific compute; Flower monitoring |
| Catalog propagation | sgp4 |
SGP4/SDP4; catalog tracking only, not decay prediction |
| Numerical integrator | scipy.integrate.DOP853 or custom RK7(8) |
Adaptive step-size for Cowell decay prediction |
| Atmospheric density | nrlmsise00 Python wrapper |
NRLMSISE-00; driven by F10.7 and Ap |
| Frame transformations | astropy |
IAU 2006 precession/nutation, IERS EOP, TEME→GCRF→ITRF |
| Astrodynamics utilities | poliastro (optional) |
Conjunction geometry helpers |
| Auth | python-jose (RS256 JWT) + pyotp (TOTP MFA) |
Asymmetric JWT; TOTP RFC 6238 |
| Rate limiting | slowapi |
Redis token bucket; per-user and per-IP limits |
| HTML sanitisation | bleach |
User-supplied content before Playwright rendering |
| Password hashing | passlib[bcrypt] |
bcrypt cost factor ≥ 12 |
| Database | TimescaleDB + PostGIS (retained) | Time-series + geospatial; RLS for multi-tenancy |
| Cache / broker | Redis 7 | Broker + pub/sub: maxmemory-policy noeviction (Celery queues must never be evicted). Separate Redis DB index for application cache: allkeys-lru. AUTH + TLS in production. |
| Connection pooler | PgBouncer 1.22 | Transaction-mode pooling between all app services and TimescaleDB. Prevents connection exhaustion at Tier 3; single failover target for Patroni switchover. max_client_conn=200, default_pool_size=20. Pool sizing derivation (F2 — §58): PostgreSQL max_connections=50; reserve 5 for superuser/admin; 45 available server connections. default_pool_size=20 per pool (one pool per DB user); leaves headroom for Alembic migrations and ad-hoc DBA access. max_client_conn=200 = (2 backend workers × 40 async connections) + (4 sim workers × 16 threads) + (2 ingest workers × 4) = 152 peak; 200 provides burst headroom. Validate with SHOW pools; in psql -h pgbouncer — cl_waiting > 0 sustained means pool is undersized. |
| Object storage | MinIO | Private buckets; pre-signed URLs only |
| Containerisation | Docker Compose (retained); Caddy as TLS-terminating reverse proxy | Single-command dev; HTTPS auto-provisioning |
| Testing — backend | pytest + hypothesis | Property-based tests for numerical and security invariants |
| Testing — frontend | Vitest + Playwright | Unit tests + E2E including security header checks |
| SAST — Python | Bandit | Static analysis; CI blocks on High severity |
| SAST — TypeScript | ESLint security plugin | Static analysis; CI blocks on High severity |
| Container scanning | Trivy | CI blocks on Critical/High CVEs |
| DAST | OWASP ZAP | Phase 2 pipeline against staging |
| Dependency management | pip-tools + npm ci | Pinned hashes; --require-hashes |
| Report rendering | Playwright headless (isolated renderer container) |
Server-side globe screenshot; no client-side canvas |
| Secrets management | Docker secrets (Phase 1 production) → HashiCorp Vault (Phase 3) | |
| Task scheduler HA | celery-redbeat |
Redis-backed Beat scheduler; distributed locking; multiple instances safe |
| DB HA / failover | Patroni + etcd | Automatic TimescaleDB primary/standby failover; ≤ 30s RTO |
| Redis HA | Redis Sentinel (3 nodes) | Master failover ≤ 10s; transparent to application via redis-py Sentinel client |
| Monitoring | Prometheus + Grafana | Business-level metrics from Phase 1; four dashboards (§26.7); AlertManager with runbook links |
| Log aggregation | Grafana Loki + Promtail | Phase 2; Promtail scrapes Docker log files; Loki stores and queries; co-deployed with Grafana; no index servers required |
| Distributed tracing | OpenTelemetry → Grafana Tempo | Phase 2; FastAPI + SQLAlchemy + Celery auto-instrumented; OTLP exporter; trace_id = request_id for log correlation; ADR 0017 |
| Structured logging | structlog | JSON structured logs with required fields; sanitising processor strips secrets; request_id propagated through HTTP → Celery chain |
| On-call alerting | PagerDuty or OpsGenie | Routes Prometheus AlertManager alerts; L1/L2/L3 escalation tiers (§26.8) |
| CI/CD pipeline | GitLab CI | Native to the self-hosted GitLab monorepo; stage-based builds for Python/Node; protected environments and approval rules for deploys |
| Container registry | GitLab Container Registry | Co-located with source; sha-<commit> is the canonical immutable tag; latest tag is forbidden in production deployments; image vulnerability attestations via cosign |
| Pre-commit | pre-commit framework |
Hooks: detect-secrets, ruff (lint + format), mypy (type gate), hadolint (Dockerfile), prettier (JS/HTML), sqlfluff (migrations); spec in .pre-commit-config.yaml; same hooks re-run in CI |
| Local task runner | make |
Standard targets: make dev (full-stack hot-reload), make test (pytest + vitest), make migrate (alembic upgrade head), make seed (fixture load), make lint (all pre-commit hooks), make clean (prune volumes) |
11. Data Source Inventory
| Source | Data | Access | Priority |
|---|---|---|---|
| Space-Track.org | TLE catalog, CDMs, object catalog, RCS data, TIP messages | REST API (account required); credentials in secrets manager | P1 |
| CelesTrak | TLE subsets (active sats, decaying objects) | Public REST API / CSV | P1 |
| USSPACECOM TIP Messages | Tracking and Impact Prediction for decaying objects | Via Space-Track.org | P1 |
| NOAA SWPC | F10.7, Ap/Kp, Dst, solar wind; 3-day forecasts | Public REST API and FTP | P1 |
| ESA Space Weather Service | F10.7, Kp cross-validation source | Public REST API | P1 |
| ESA DISCOS | Physical object properties: mass, dimensions, shape, materials | REST API (account required) | P1 |
| IERS Bulletin A/B | UT1-UTC offsets, polar motion | Public FTP (usno.navy.mil); SHA-256 verified on download | P1 |
| GFS / ECMWF | Tropospheric winds and density 0–80 km | NOMADS (NOAA) public FTP | P2 |
| ILRS / CDDIS | Laser ranging POD products for validation | Public FTP | P2 (validation) |
| FIR/UIR boundaries | FIR and UIR boundary polygons for airspace intersection | EUROCONTROL AIRAC dataset (subscription) for ECAC states; FAA Digital-Terminal Procedures for US; OpenAIP as fallback for non-AIRAC regions. GeoJSON format loaded into airspace table. Updated every 28 days on AIRAC cycle. | P1 |
Deprecated reference: "18th SDS" → use Space-Track.org consistently.
ESA DISCOS redistribution rights (Finding 9): ESA DISCOS is subject to an ESAC user agreement. Data may not be redistributed or used in commercial products without explicit ESA permission. SpaceCom is a commercial platform. Required actions before Phase 2 shadow deployment:
- Obtain written clarification from ESA/ESAC on whether DISCOS-derived physical properties (mass, dimensions) may be: (a) used internally to drive SpaceCom's own predictions; (b) exposed in API responses to ANSP customers; (c) included in generated PDF reports
- If redistribution is not permitted, DISCOS data is used only as internal model input — API responses and reports show
source: estimatedrather than exposing raw DISCOS values; thedata_confidenceUI flag continues to show● DISCOSfor internal tracking but is not labelled as DISCOS in customer-facing outputs - Include the DISCOS redistribution clarification in the Phase 2 legal gate checklist alongside the Space-Track AUP opinion
Airspace data scope and SUA disclosure (Finding 4): Phase 2 FIR/UIR scope covers ECAC states (EUROCONTROL AIRAC) and US FIRs (FAA). The following airspace types are explicitly out of scope for Phase 2 and disclosed to users:
- Special Use Airspace (SUA): danger areas, restricted areas, prohibited areas (ICAO Annex 11)
- Terminal Manoeuvring Areas (TMAs) and Control Zones (CTRs)
- Oceanic FIRs (ICAO Annex 2 special procedures; OACCs handle coordination)
A persistent disclosure note on the Airspace Impact Panel reads: "SpaceCom FIR intersection analysis covers FIR/UIR boundaries only. It does not account for special use airspace, terminal areas, or oceanic procedures. Controllers must apply their local procedures for these airspace types." Phase 3 consideration: SUA polygon overlay from national AIP sources. Document in docs/adr/0014-airspace-scope.md.
All source URLs are hardcoded constants in ingest/sources.py. The outbound HTTP client blocks connections to private IP ranges. No source URL is configurable via API or database at runtime.
Space-Track AUP — conditional architecture (Finding 9): The AUP clarification is a Phase 1 architectural decision gate, not a Phase 2 deliverable. The current design assumes shared ingest (a single SpaceCom Space-Track credential fetches TLEs for all organisations). If the AUP prohibits redistribution of derived predictions to customers who have not themselves agreed to the AUP, the ingest architecture must change:
- Path A — redistribution permitted: Current shared-ingest design is valid. Each customer organisation's access is governed by SpaceCom's AUP click-wrap and the MSA. No architectural change.
- Path B — redistribution not permitted: Per-organisation Space-Track credentials required. Each ANSP/operator must hold their own Space-Track account. SpaceCom acts as a processing layer using each org's own credentials. Architecture change:
space_track_credentialstable (per-org, encrypted); per-org ingest worker configuration; significant additional complexity.
The decision must be documented in docs/adr/0016-space-track-aup-architecture.md with the chosen path and evidence (written AUP clarification). This ADR is a prerequisite for Phase 1 ingest architecture finalisation — marked as a blocking decision in the Phase 1 DoD.
Space weather raw format specifications:
| Source | Endpoint constant | Format | Key fields consumed |
|---|---|---|---|
| NOAA SWPC F10.7 | NOAA_F107_URL = "https://services.swpc.noaa.gov/json/f107_cm_flux.json" |
JSON array | time_tag, flux (solar flux units) |
| NOAA SWPC Kp/Ap | NOAA_KP_URL = "https://services.swpc.noaa.gov/json/planetary_k_index_1m.json" |
JSON array | time_tag, kp_index, ap |
| NOAA SWPC 3-day forecast | NOAA_FORECAST_URL = "https://services.swpc.noaa.gov/products/3-day-geomag-forecast.json" |
JSON | Kp array |
| ESA SWS Kp | ESA_SWS_KP_URL = "https://swe.ssa.esa.int/web/guest/current-space-weather-conditions" |
REST JSON | kp_index (cross-validation) |
An integration test asserts that each response contains the expected top-level keys. If a key is absent, the test fails and the schema change is caught before it reaches production ingest.
TLE validation at ingestion gate: Before any TLE record is written to the database, ingest/cross_validator.py must verify:
- Both lines are exactly 69 characters (standard TLE format)
- Modulo-10 checksum passes on line 1 and line 2
- Epoch field parses to a valid UTC datetime
BSTARdrag term is within physically plausible bounds (−0.5 to +0.5)
Failed validation is logged to security_logs type INGEST_VALIDATION_FAILURE with the raw TLE and failure reason. The record is not written to the database.
TLE ingest idempotency — ON CONFLICT behavior: The tle_sets table has UNIQUE (object_id, ingested_at). If the ingest worker runs twice for the same object within the same second (e.g., orphan recovery task + normal schedule overlap, or a worker restart mid-task), the second insert must not raise an exception or silently discard the row without tracking. Required semantics:
# ingest/writer.py
async def write_tle_set(session: AsyncSession, tle: TLERecord) -> bool:
"""Insert TLE record. Returns True if inserted, False if duplicate."""
stmt = pg_insert(TLESet).values(
object_id=tle.object_id,
ingested_at=tle.ingested_at,
tle_line1=tle.line1,
tle_line2=tle.line2,
epoch=tle.epoch,
source=tle.source,
).on_conflict_do_nothing(
index_elements=["object_id", "ingested_at"]
).returning(TLESet.object_id)
result = await session.execute(stmt)
inserted = result.rowcount > 0
if not inserted:
spacecom_ingest_tle_conflict_total.inc() # metric; non-zero signals scheduling race
structlog.get_logger().debug("tle_insert_skipped_duplicate",
object_id=tle.object_id, ingested_at=tle.ingested_at)
return inserted
Prometheus counter spacecom_ingest_tle_conflict_total — a sustained non-zero rate warrants investigation of the Beat schedule overlap. A brief spike during worker restart is acceptable.
Ingest idempotency requirement for all periodic tasks (F8 — §67): TLE ingest uses ON CONFLICT DO NOTHING (above). All other periodic ingest tasks must use equivalent upsert semantics to survive celery-redbeat double-fire on restart:
-- Space weather ingest: upsert on (fetched_at) unique constraint
INSERT INTO space_weather (fetched_at, kp, f107, ...)
VALUES (:fetched_at, :kp, :f107, ...)
ON CONFLICT (fetched_at) DO NOTHING;
-- DISCOS object metadata: upsert on (norad_id) — update if data changed
INSERT INTO objects (norad_id, name, launch_date, ...)
VALUES (:norad_id, :name, :launch_date, ...)
ON CONFLICT (norad_id) DO UPDATE SET
name = EXCLUDED.name,
launch_date = EXCLUDED.launch_date,
updated_at = NOW()
WHERE objects.updated_at < EXCLUDED.updated_at; -- only update if newer
-- IERS EOP: upsert on (date) unique constraint
INSERT INTO iers_eop (date, ut1_utc, x_pole, y_pole, ...)
VALUES (:date, :ut1_utc, :x_pole, :y_pole, ...)
ON CONFLICT (date) DO NOTHING;
Add unique constraints if not present: UNIQUE (fetched_at) on space_weather; UNIQUE (date) on iers_eop. These prevent double-write corruption at the DB level regardless of application retry logic.
IERS EOP cold-start requirement: On a fresh deployment with no cached EOP data, astropy's IERS_Auto falls back to the bundled IERS-B table (which lags the current date by weeks to months), silently degrading UT1-UTC precision from ~1 ms (IERS-A) to ~10–50 ms (IERS-B). For epochs beyond the IERS-B table end date, astropy raises IERSRangeError, crashing all frame transforms.
The EOP ingest task must run as part of make seed before any propagation task starts:
# Makefile
seed: migrate
docker compose exec backend python -m ingest.eop --bootstrap # downloads + caches current IERS-A
docker compose exec backend python -m ingest.fir --bootstrap # loads FIR boundaries
docker compose exec backend python fixtures/dev_seed.sql
The EOP ingest task in Celery Beat is ordered before the TLE ingest task: EOP runs at 00:00 UTC, TLE ingest at 00:10 UTC (ensuring fresh EOP before the first propagation of the day).
IERS EOP verification — dual-mirror comparison: The IERS does not publish SHA-256 hashes alongside its EOP files. Comparing hash-against-prior-download detects corruption but not substitution. The correct approach is downloading from both the USNO mirror and the Paris Observatory mirror and verifying agreement:
# ingest/eop.py
IERS_MIRRORS = [
"https://maia.usno.navy.mil/ser7/finals2000A.all",
"https://hpiers.obspm.fr/iers/series/opa/eopc04", # IERS-C04 series
]
async def fetch_and_verify_eop() -> bytes:
contents = []
for url in IERS_MIRRORS:
resp = await http_client.get(url, timeout=30)
resp.raise_for_status()
contents.append(resp.content)
# Verify UT1-UTC values agree within 0.1 ms across mirrors (format-normalised comparison)
if not _eop_values_agree(contents[0], contents[1], tolerance_ms=0.1):
structlog.get_logger().error("eop_mirror_disagreement")
spacecom_eop_mirror_agreement.set(0)
raise EOPVerificationError("IERS EOP mirrors disagree — rejecting both")
spacecom_eop_mirror_agreement.set(1)
return contents[0] # USNO is primary; Paris Observatory is the verification witness
Prometheus gauge spacecom_eop_mirror_agreement (1 = mirrors agree, 0 = disagreement detected). Alert on spacecom_eop_mirror_agreement == 0.
12. Backend Directory Structure
backend/
app/
main.py # FastAPI app factory, middleware, router mounting
config.py # Settings via pydantic-settings (env vars); no secrets in code
auth/
provider.py # AuthProvider protocol + LocalJWTProvider implementation
jwt.py # RS256 token issue, verify, refresh; key loaded from secrets
mfa.py # TOTP (pyotp); recovery code generation and verification
deps.py # get_current_user, require_role() dependency factory
middleware.py # Auth middleware; rate limit enforcement
frame_utils.py # TEME→GCRF→ITRF→WGS84 + IERS EOP refresh + hash verification
time_utils.py # Time system conversions
integrity.py # HMAC sign/verify for predictions and hazard zones
logging_config.py # Sanitising log formatter; security event logger
modules/
catalog/
router.py # /api/v1/objects; requires viewer role minimum
schemas.py
service.py
models.py
propagator/
catalog.py # SGP4 catalog propagation
decay.py # RK7(8) + NRLMSISE-00 + Monte Carlo; HMAC-signs output
tasks.py # Celery tasks with time_limit, soft_time_limit
router.py # /api/v1/propagate, /api/v1/decay; requires analyst role
reentry/
router.py # /api/v1/reentry; requires viewer role
service.py
corridor.py # Percentile corridor polygon generation
spaceweather/
router.py # /api/v1/spaceweather; requires viewer role
service.py # Cross-validates NOAA SWPC vs ESA SWS; generates status string
tasks.py # Celery Beat: NOAA SWPC polling every 3h
noaa_swpc.py # NOAA SWPC client; URL hardcoded constant
esa_sws.py # ESA SWS cross-validation client
viz/
router.py # /api/v1/czml; requires viewer role
czml_builder.py # CZML output; all strings HTML-escaped; J2000 INERTIAL frame
mc_geometry.py # MC trajectory binary blob pre-baking
ingest/
sources.py # Hardcoded external URLs and IP allowlists (SSRF mitigation)
tasks.py # Celery Beat-scheduled tasks
spacetrack.py # Space-Track client; credentials from secrets manager only
celestrak.py # CelesTrak client
discos.py # ESA DISCOS client
iers.py # IERS EOP fetcher + SHA-256 verification
cross_validator.py # TLE and space weather cross-source comparison
alerts/
router.py # /api/v1/alerts; requires operator role for acknowledge
service.py # Alert trigger evaluation; rate limit enforcement; deduplication
notifier.py # WebSocket push + email; storm detection
integrity_guard.py # TIP vs prediction cross-check; HMAC failure escalation
reports/
router.py # /api/v1/reports; requires analyst role
builder.py # Section assembly; all user fields sanitised via bleach
renderer_client.py # Internal HTTPS call to renderer service with sanitised payload
security/
audit.py # Security event logger; writes to security_logs
sanitiser.py # Log formatter that strips credential patterns
breakup/
atmospheric.py
on_orbit.py
tasks.py
router.py
conjunction/
screener.py
probability.py
tasks.py
router.py
weather/
upper.py
lower.py
hazard/
router.py
fusion.py # HMAC-signs all hazard_zones output; propagates shadow_mode flag
tasks.py
airspace/
router.py
loader.py
intersection.py
notam/
router.py # /api/v1/notam; requires operator role
drafter.py # ICAO Annex 15 format generation
disclaimer.py # Mandatory regulatory disclaimer text
space_portal/
router.py # /api/v1/space; space_operator and orbital_analyst roles
owned_objects.py # Owned object CRUD; RLS enforcement
controlled_reentry.py # Deorbit window optimisation
ccsds_export.py # CCSDS OEM/CDM format export
api_keys.py # API key lifecycle management
launch_safety/ # Phase 3
screener.py
router.py
reroute/ # Phase 3; strategic pre-flight avoidance boundary only
feedback/ # Phase 3; includes shadow_validation.py
migrations/ # Alembic; includes immutability triggers in initial migration
tests/
conftest.py # db_session fixture (SAVEPOINT/ROLLBACK); testcontainers setup for Celery tests
physics/
test_frame_utils.py
test_propagator/
test_decay/
test_nrlmsise.py
test_hypothesis.py # Hypothesis property-based tests (§42.3)
test_mc_corridor.py # MC seeded RNG corridor validation (§42.4)
test_breakup/
test_integrity.py # HMAC sign/verify; tamper detection
test_auth.py # JWT; MFA; rate limiting; RBAC enforcement
test_rbac.py # Every endpoint tested for correct role enforcement
test_websocket.py # WS sequence replay; token expiry warning; close codes 4001/4002
test_ingest/
test_contracts.py # Space-Track + NOAA key presence AND value-range assertions
test_spaceweather/
test_jobs/
test_celery_failure.py # Timeout → 'failed'; orphan recovery Beat task
smoke/ # Post-deploy; all idempotent; run in ≤ 2 min; require smoke_user seed
test_api_health.py # GET /readyz → 200/207; GET /healthz → 200
test_auth_smoke.py # Login → JWT; refresh → new token
test_catalog_smoke.py # GET /catalog → 200; 'data' key present
test_ws_smoke.py # WS connect → heartbeat within 5s
test_db_smoke.py # SELECT 1 via backend health endpoint
quarantine/ # Flaky tests awaiting fix; excluded from blocking CI (see §33.10 policy)
requirements.in # pip-tools source
requirements.txt # pip-compile output with hashes
Dockerfile # FROM pinned digest; non-root user; read-only FS
12.1 Repository docs/ Directory Structure
All documentation files live under docs/ in the monorepo root. Files referenced elsewhere in this plan must exist at these paths.
docs/
README.md # Documentation index — what's here and where to look
MASTER_PLAN.md # This document
AGENTS.md # Guidance for AI coding agents working in this repo (see §33.9)
CHANGELOG.md # Keep a Changelog format; human-maintained; one entry per release
adr/ # Architecture Decision Records (MADR format)
README.md # ADR index with status column
0001-rs256-asymmetric-jwt.md
0002-dual-frontend-architecture.md
0003-monte-carlo-chord-pattern.md
0004-geography-vs-geometry-spatial-types.md
0005-lazy-raise-sqlalchemy.md
0006-timescaledb-chunk-intervals.md
0007-cesiumjs-commercial-licence.md
0008-pgbouncer-transaction-mode.md
0009-ccsds-oem-gcrf-reference-frame.md
0010-alert-threshold-rationale.md
# ... continued; one ADR per consequential decision in §20
runbooks/
README.md # Runbook index with owner and last-reviewed date
TEMPLATE.md # Standard runbook template (see §33.4)
db-failover.md
celery-recovery.md
hmac-failure.md
ingest-failure.md
gdpr-breach-notification.md
safety-occurrence-notification.md
secrets-rotation-jwt.md
secrets-rotation-spacetrack.md
secrets-rotation-hmac.md
blue-green-deploy.md
restore-from-backup.md
model-card-decay-predictor.md # Living document; updated per model version (§32.1)
ood-bounds.md # OOD detection thresholds (§32.3)
recalibration-procedure.md # Recalibration governance (§32.4)
alert-threshold-history.md # Alert threshold change log (§24.8)
query-baselines/ # EXPLAIN ANALYZE output; one file per critical query
czml_catalog_100obj.txt
fir_intersection_baseline.txt
# ... one file per query baseline recorded in Phase 1
validation/ # Validation procedure and reference data (§17)
README.md # How to run each validation suite
reference-data/
vallado-sgp4-cases.json # Vallado (2013) SGP4 reference state vectors
iers-frame-test-cases.json # IERS precession-nutation reference cases
aerospace-corp-reentries.json # Historical re-entry outcomes for backcast validation
backcast-validation-v1.0.0.pdf # Phase 1 validation report (≥3 events)
backcast-validation-v2.0.0.pdf # Phase 2 validation report (≥10 events)
api-guide/ # Persona E/F API developer documentation (§33.10)
README.md # API guide index
authentication.md
rate-limiting.md
webhooks.md
code-examples/
python-quickstart.py
typescript-quickstart.ts
error-reference.md
user-guides/ # Operational persona documentation (§33.7)
aviation-portal-guide.md # Persona A/B/C
space-portal-guide.md # Persona E/F
admin-guide.md # Persona D
test-plan.md # Test suite index with scope and blocking classification (§33.11)
public-reports/ # Quarterly transparency reports (§32.6)
# quarterly-accuracy-YYYY-QN.pdf
legal/ # Legal opinion documents (MinIO primary; this dir for dev reference)
# legal-opinion-template.md
13. Frontend Directory Structure and Architecture
frontend/
src/
app/
page.tsx # Operational Overview
watch/[norad_id]/page.tsx # Object Watch Page
events/
page.tsx # Active Events + full Timeline/Gantt
[id]/page.tsx # Event Detail
airspace/page.tsx # Airspace Impact View
analysis/page.tsx # Analyst Workspace
catalog/page.tsx # Object Catalog
reports/
page.tsx
[id]/page.tsx
admin/page.tsx # System Administration (admin role only)
space/
page.tsx # Space Operator Overview
objects/
page.tsx # My Objects Dashboard (space_operator: owned only)
[norad_id]/page.tsx # Object Technical Detail
reentry/
plan/page.tsx # Controlled Re-entry Planner
conjunction/page.tsx # Conjunction Screening (orbital_analyst)
analysis/page.tsx # Orbital Analyst Workspace
export/page.tsx # Bulk Export
api/page.tsx # API Keys + Documentation
layout.tsx # Root layout: nav, ModeIndicator, AlertBadge,
# JobsPanel; applies security headers via middleware
middleware.ts # Next.js middleware: enforce HTTPS, set CSP
# and security headers on every response,
# redirect unauthenticated users to /login
components/
globe/
CesiumViewer.tsx
LayerPanel.tsx
ViewToggle.tsx
ClusterLayer.tsx
CorridorLayer.tsx
corridor/
PercentileCorridors.tsx # Mode A
ProbabilityHeatmap.tsx # Mode B (Phase 2)
ParticleTrajectories.tsx # Mode C (Phase 3)
UncertaintyModeSelector.tsx
plan/
PlanView.tsx # Phase 2
AltitudeCrossSection.tsx # Phase 2
timeline/
TimelineStrip.tsx
TimelineGantt.tsx
TimelineControls.tsx
ModeIndicator.tsx
panels/
ObjectInfoPanel.tsx
PredictionPanel.tsx # Includes HMAC status indicator
AirspaceImpactPanel.tsx # Phase 2
ConjunctionPanel.tsx # Phase 2
alerts/
AlertBanner.tsx
AlertBadge.tsx
NotificationCentre.tsx
AcknowledgeDialog.tsx
jobs/
JobsPanel.tsx
JobProgressBar.tsx
SimulationComparison.tsx
spaceweather/
SpaceWeatherWidget.tsx
reports/
ReportConfigDialog.tsx
ReportPreview.tsx
space/
SpaceOverview.tsx
OwnedObjectCard.tsx
ControlledReentryPlanner.tsx
DeorbitWindowList.tsx
ApiKeyManager.tsx
CcsdsExportPanel.tsx
ShadowBanner.tsx # Amber banner displayed when shadow mode active
notam/
NotamDraftViewer.tsx
NotamCancellationDialog.tsx
NotamRegulatoryDisclaimer.tsx
shadow/
ShadowModeIndicator.tsx
ShadowValidationReport.tsx
dashboard/
EventSummaryCard.tsx
SystemHealthCard.tsx
shared/
DataConfidenceBadge.tsx
IntegrityStatusBadge.tsx # ✓ HMAC verified / ✗ HMAC failed
UncertaintyBound.tsx
CountdownTimer.tsx
hooks/
useObjects.ts
usePrediction.ts # Polls HMAC status; shows warning if failed
useEphemeris.ts
useSpaceWeather.ts
useAlerts.ts
useSimulation.ts
useCZML.ts
useWebSocket.ts # Cookie-based auth; per-user connection limit
stores/ # Zustand — UI state only; no API responses
timelineStore.ts # Mode, playhead position, playback speed
selectionStore.ts # Selected object/event/zone IDs
layerStore.ts # Layer visibility, corridor display mode
jobsStore.ts # Active job IDs (content fetched via TanStack Query)
alertStore.ts # Unread count, mute rules
uiStore.ts # Panel state, theme (dark/light/high-contrast)
lib/
api.ts # Typed fetch wrapper; credentials: 'include'
# for httpOnly cookie auth; never reads tokens
czml.ts
ws.ts # wss:// enforced; cookie auth at upgrade
corridorGeometry.ts
mcBinaryDecoder.ts
reportUtils.ts
types/
objects.ts
predictions.ts # Includes hmac_status, integrity_failed fields
alerts.ts
spaceweather.ts
simulation.ts
czml.ts
public/
branding/
middleware.ts # Root Next.js middleware for security headers
next.config.ts # Content-Security-Policy defined here for SSR
tsconfig.json
package.json
package-lock.json # Committed; npm ci used in Docker builds
13.0 Accessibility Standard Commitment
Minimum standard: WCAG 2.1 Level AA (ISO/IEC 40500:2012), which is incorporated by reference into EN 301 549 v3.2.1 — the mandatory accessibility standard for ICT procured by EU public sector bodies including ESA. Failure to meet EN 301 549 is a bid disqualifier for any EU public sector tender.
All frontend work must meet these criteria before a PR is merged:
- WCAG 2.1 AA automated check passes (
axe-core— see §42) - Keyboard-only operation possible for all primary operator workflows
- Screen reader (NVDA + Firefox; VoiceOver + Safari) tested for primary workflow on each release
- Colour contrast ≥ 4.5:1 for all informational text; ≥ 3:1 for UI components and graphical elements
- No functionality conveyed by colour alone
Deliverable: Accessibility Conformance Report (ACR / VPAT 2.4) produced before Phase 2 ESA bid submission. Maintained thereafter for each major release.
UTC-only rule for operational interface (F1): ICAO Annex 2 and Annex 15 mandate UTC for all aeronautical operational communications. The following is a hard rule — no exceptions without explicit documentation and legal/safety sign-off:
- All times displayed in Persona A/C operational views (alert panels, event detail, NOTAM draft, shift handover) are UTC only, formatted as
HH:MMZorDD MMM YYYY HH:MMZ - No timezone conversion widget or local-time toggle in the operational interface
- Local time display is permitted only in non-operational views (account settings, admin billing pages) and must be clearly labelled with the timezone name
- The
Zsuffix orUTClabel is persistently visible — never hidden in a tooltip or hover state - All API timestamps returned as ISO 8601 UTC (
2026-03-22T14:00:00Z) — never local time strings
13.1 State Management Separation
TanStack Query: All API-derived data — object lists, predictions, ephemeris, space weather, alerts, simulation results. Handles caching, background refetch, and stale-while-revalidate.
Zustand: Pure UI state with no server dependency — selected IDs, layer visibility, timeline mode and position, panel open/closed state, theme, alert mute rules.
URL state (nuqs): Shareable, bookmarkable — selected NORAD ID, active event ID, time position in replay mode, active layer set. Browser back/forward works correctly. Requires NuqsAdapter wrapping the App Router root layout to hydrate correctly on SSR.
Never in state: Raw API response bodies. No useEffect that writes API responses into Zustand.
Authentication in the client: The api.ts fetch wrapper uses credentials: 'include' to send the httpOnly auth cookie automatically. The client never reads, stores, or handles the JWT token directly — it is invisible to JavaScript. CSRF is mitigated by SameSite=Strict on the cookie.
Next.js App Router component boundary (ADR 0018): The project uses App Router. The globe and all operational views are client components; static pages (onboarding, settings, admin) are React Server Components where practical.
| Route group | RSC/Client | Rationale |
|---|---|---|
app/(globe)/ — operational views |
"use client" root layout |
CesiumJS, WebSocket, Zustand hooks require browser APIs |
app/(static)/ — onboarding, settings |
Server Components by default | No browser APIs needed; faster initial load |
app/(auth)/ — login, MFA |
Server Components + Client islands | Form validation islands only |
Rules enforced in AGENTS.md:
- Never add
"use client"to a leaf component without a comment explaining which browser API requires it app/(globe)/layout.tsxis the single"use client"boundary for all operational views — child components inherit it without re-declaringnuqsrequires<NuqsAdapter>at the root ofapp/(globe)/layout.tsx
TanStack Query key factory (src/lib/queryKeys.ts) — stable hierarchical keys prevent cache invalidation bugs:
export const queryKeys = {
objects: {
all: () => ['objects'] as const,
list: (f: ObjectFilters) => ['objects', 'list', f] as const,
detail: (id: number) => ['objects', 'detail', id] as const,
tleHistory: (id: number) => ['objects', id, 'tle-history'] as const,
},
predictions: {
byObject: (id: number) => ['predictions', id] as const,
},
alerts: {
all: () => ['alerts'] as const,
unacked: (orgId: number) => ['alerts', 'unacked', orgId] as const,
},
jobs: {
detail: (jobId: string) => ['jobs', jobId] as const,
},
} as const;
// On WS alert.new: queryClient.invalidateQueries({ queryKey: queryKeys.alerts.all() })
// On acknowledge mutation: optimistic setQueryData, then invalidate on settle
React error boundary hierarchy — a CesiumJS crash must never remove the alert panel from the DOM:
// app/(globe)/layout.tsx
<AppErrorBoundary fallback={<AppCrashPage />}>
<GlobeErrorBoundary fallback={<GlobeUnavailable />}>
<GlobeCanvas /> {/* WebGL context loss isolated here */}
</GlobeErrorBoundary>
<PanelErrorBoundary name="alerts">
<AlertPanel /> {/* Survives globe crash */}
</PanelErrorBoundary>
<PanelErrorBoundary name="events">
<EventList />
</PanelErrorBoundary>
</AppErrorBoundary>
GlobeUnavailable displays: "Globe unavailable — WebGL context lost. Re-entry event data below remains operational." Alert and event panels remain visible and functional. Add GlobeErrorBoundary to AGENTS.md safety-critical component list.
Loading and empty state specification — for safety-critical panels, loading and empty must be visually distinct from each other and from error:
| State | Visual treatment | Required text |
|---|---|---|
| Loading | Skeleton matching panel layout | — |
| Empty | Explicit affirmative message | AlertPanel: "No unacknowledged alerts"; EventList: "No active re-entry events" |
| Error | Inline error with retry button | Never blank |
Rule: safety-critical panels (AlertPanel, EventList, PredictionPanel) must never render blank. DataConfidenceBadge must always show a value — display "Unknown" explicitly, never render nothing.
WebSocket reconnection policy (src/lib/ws.ts):
const RECONNECT = {
initialDelayMs: 1_000,
maxDelayMs: 30_000,
multiplier: 2,
jitter: 0.2, // ±20% — spreads reconnections after mass outage/deploy
};
// TOKEN_EXPIRY_WARNING handler: trigger silent POST /auth/token/refresh;
// on success send AUTH_REFRESH; on failure show re-login modal (60s grace before disconnect)
// Reconnect sends ?since_seq=<last_seq> for missed event replay
Operational mode guard (src/hooks/useModeGuard.ts) — enforces LIVE/SIMULATION/REPLAY write restrictions:
export function useModeGuard(allowedModes: OperationalMode[]) {
const { mode } = useTimelineStore();
return { isAllowed: allowedModes.includes(mode), currentMode: mode };
}
// Usage: const { isAllowed } = useModeGuard(['LIVE']);
// All write-action components (acknowledge alert, submit NOTAM draft, trigger prediction)
// must call useModeGuard(['LIVE']) and disable + annotate button in other modes.
Deck.gl + CesiumJS integration — use DeckLayer from @deck.gl/cesium (rendered inside CesiumJS as a primitive; correct z-order and shared input handling). Never use a separate Deck.gl canvas:
import { DeckLayer } from '@deck.gl/cesium';
import { HeatmapLayer } from '@deck.gl/aggregation-layers';
const deckLayer = new DeckLayer({
layers: [new HeatmapLayer({ id: 'mc-heatmap', data: mcTrajectories,
getPosition: d => [d.lon, d.lat], getWeight: d => d.weight,
radiusPixels: 30, intensity: 1, threshold: 0.03 })],
});
viewer.scene.primitives.add(deckLayer);
// Remove when switching away from Mode B: viewer.scene.primitives.remove(deckLayer)
CesiumJS client-side memory constraints:
| Constraint | Value | Enforcement |
|---|---|---|
| Max CZML entity count in globe | 500 | Prune lowest-perigee objects beyond 500; useCZML monitors count |
| Orbit path duration | 72h forward / 24h back | Longer paths accumulate geometry |
| Heatmap cell resolution (Mode B) | 0.5° × 0.5° | Higher resolution requires more GPU memory |
| Stale entity pruning | Remove entities not updated in 48h | Prevents ghost entities in long sessions |
| Globe entity count Prometheus metric | spacecom_globe_entity_count (gauge) |
WARNING alert at 450; prune trigger at 500 |
Bundle size budget and dynamic imports:
| Bundle | Strategy | Budget (gzipped) |
|---|---|---|
| Login / onboarding / settings | Static; no CesiumJS/Deck.gl | < 200 KB |
| Globe route initial load | CesiumJS lazy-loaded; spinner shown | < 500 KB before CesiumJS |
| Globe fully loaded | CesiumJS + Deck.gl + app | < 8 MB |
// src/components/globe/GlobeCanvas.tsx
import dynamic from 'next/dynamic';
const CesiumViewer = dynamic(
() => import('./CesiumViewerInner'),
{ ssr: false, loading: () => <GlobeLoadingState /> }
);
bundlewatch (or @next/bundle-analyzer) in CI; warning (non-blocking) if initial route bundle exceeds budget. Baseline stored in .bundle-size-baseline.
13.2 Accessible Parallel Table View (F4)
The CesiumJS WebGL globe is inherently inaccessible: no keyboard navigation, no screen reader support, no motor-impairment accommodation. All interactions available via the globe must also be available via a parallel data table view.
Component: src/components/globe/ObjectTableView.tsx
- Accessible via keyboard shortcut
Alt+Tfrom any operational view, and via a persistent visible "Table view" button in the globe toolbar - Displays all objects currently rendered on the globe: NORAD ID, name, orbit type, conjunction status badge, predicted re-entry window, alert level
- Sortable by any column (
aria-sortupdated on header click/keypress); filterable by alert level - Row selection focuses the object's Event Detail panel (same as map click)
- All alert acknowledgement actions reachable from the table view — no functionality requires the globe
- Implemented as
<table>with<thead>,<tbody>,<th scope="col">,<th scope="row">— no ARIA table role substitutes where native HTML suffices - Pagination or virtual scroll for large object sets;
aria-rowcountandaria-rowindexset correctly for virtualised rows
The table view is the primary interaction surface for users who cannot use the map. It must be functionally complete, not a read-only summary.
13.3 Keyboard Navigation Specification (F6)
All primary operator workflows must be completable by keyboard alone. Required implementation:
Skip links (rendered as the first focusable element in the page, visible on focus):
<a href="#alert-panel" class="skip-link">Skip to alert panel</a>
<a href="#main-content" class="skip-link">Skip to main content</a>
<a href="#object-table" class="skip-link">Skip to object table</a>
Focus ring: Minimum 3px solid outline, ≥ 3:1 contrast against adjacent colours (WCAG 2.4.11 Focus Appearance, AA). Never outline: none without a custom focus indicator. Defined in design tokens: --focus-ring: 3px solid #4A9FFF.
Tab order: Follows DOM order (no tabindex > 0). Logical flow: nav → alert panel → map toolbar → main content. Modal dialogs trap focus within the dialog while open; focus returns to the trigger element on close.
Application keyboard shortcuts (all documented in UI via ? help overlay):
| Shortcut | Action |
|---|---|
Alt+A |
Focus most-recent active CRITICAL alert |
Alt+T |
Toggle table / globe view |
Alt+H |
Open shift handover view |
Alt+N |
Open NOTAM draft for active event |
? |
Open keyboard shortcut reference overlay |
Escape |
Close modal / dismiss non-CRITICAL overlay |
Arrow keys |
Navigate within alert list, table rows, accordion items |
All shortcuts declared via aria-keyshortcuts on their trigger elements. No shortcut conflicts with browser or screen reader reserved keys.
13.4 Colour and Contrast Specification (F7)
All colour pairs must meet WCAG 2.1 AA contrast requirements. Documented in frontend/src/tokens/colours.ts as design tokens; no hardcoded colour values in component files.
Operational severity palette (dark theme — background: #1A1A2E):
| Severity | Background | Text | Contrast ratio | Status |
|---|---|---|---|---|
| CRITICAL | #7B4000 |
#FFFFFF |
7.2:1 | ✓ AA |
| HIGH | #7A3B00 |
#FFD580 |
5.1:1 | ✓ AA |
| MEDIUM | #1A3A5C |
#90CAF9 |
4.6:1 | ✓ AA |
| LOW | #1E3A2F |
#81C784 |
4.5:1 | ✓ AA (minimum) |
| Focus ring | #1A1A2E |
#4A9FFF |
4.8:1 | ✓ AA |
All pairs verified with the APCA algorithm for large display text (corridor labels on the globe). If a colour fails at the target background, the background is adjusted — the text colour is kept consistent for operator recognition.
Number formatting (F4): Probability values, altitudes, and distances must be formatted correctly across locales:
- Operational interface (Persona A/C): Always use ICAO-standard decimal point (
.) regardless of browser locale — deviating from locale convention is intentional and matches ICAO Doc 8400 standards; this is documented as an explicit design decision - Admin / reporting / Space Operator views: Use
Intl.NumberFormat(locale)for locale-aware formatting (comma decimal separator in DE/FR/ES locales) - Helper:
formatOperationalNumber(n: number): string— always.decimal, 3 significant figures for probabilities;formatDisplayNumber(n: number, locale: string): string— locale-aware - Never use raw
Number.toString()orn.toFixed()in JSX — both ignore locale
Non-colour severity indicators (F5): Colour must never be the sole differentiator. Each severity level also carries:
| Severity | Icon/shape | Text label | Border width |
|---|---|---|---|
| CRITICAL | ⬟ (octagon) | "CRITICAL" always visible | 3px solid |
| HIGH | ▲ (triangle) | "HIGH" always visible | 2px solid |
| MEDIUM | ● (circle) | "MEDIUM" always visible | 1px solid |
| LOW | ○ (circle outline) | "LOW" always visible | 1px dashed |
The 1 Hz CRITICAL colour cycle (§28.3 habituation countermeasure) must also include a redundant non-colour animation: 1 Hz border-width pulse (2px → 4px → 2px). Users with prefers-reduced-motion: reduce see a static thick border instead (see §28.3 reduced-motion rules).
13.5 Internationalisation Architecture (F5, F8, F11)
Language scope — Phase 1: English only. No other locale is served. This is not a gap — it is an explicit decision that allows Phase 1 to ship without a localisation workflow. The architecture is designed so that adding a new locale requires only adding a messages/{locale}.json file and testing; no component code changes.
String externalisation strategy:
- Library:
next-intl(native Next.js App Router support, RSC-compatible, type-safe message keys) - Source of truth:
messages/en.json— all user-facing strings, namespaced by feature area - Message ID convention:
{feature}.{component}.{element}e.g.alerts.critical.title,handover.accept.button - No bare string literals in JSX (enforced by
eslint-plugin-i18n-jsonor equivalent) - ICAO-fixed strings are excluded from i18n scope and must never appear in
messages/en.json— they are hardcoded constants. Examples:NOTAM,UTC,SIGMET, category codes (NOTAM_ISSUED), ICAO phraseology in NOTAM templates. These are annotated// ICAO-FIXED: do not translatein source
messages/
en.json # Source of truth — Phase 1 complete
fr.json # Phase 2 scaffold (machine-translated placeholders; native-speaker review before deploy)
de.json # Phase 3 scaffold
CSS logical properties (F8): All new components use CSS logical properties instead of directional utilities, making RTL support a configuration change rather than a code rewrite:
| Avoid | Use instead |
|---|---|
margin-left, ml-* |
margin-inline-start, ms-* |
margin-right, mr-* |
margin-inline-end, me-* |
padding-left, pl-* |
padding-inline-start, ps-* |
padding-right, pr-* |
padding-inline-end, pe-* |
left: 0 |
inset-inline-start: 0 |
text-align: left |
text-align: start |
The <html> element carries dir="ltr" (hardcoded for Phase 1). When a RTL locale is added, this becomes dir={locale.dir} — no component changes required. RTL testing with Arabic locale is a Phase 3 gate before any Middle East deployment.
Altitude and distance unit display (F9): Aviation and space domain use different unit conventions. All altitudes and distances are stored and transmitted in metres (SI base unit) in the database and API. The display layer converts based on users.altitude_unit_preference:
| Role default | Unit | Display example |
|---|---|---|
ansp_operator |
ft |
39,370 ft (FL394) |
space_operator |
km |
12.0 km |
analyst |
km |
12.0 km |
Rules:
- Unit label always shown alongside the value — no bare numbers
aria-labelprovides full unit name:aria-label="39,370 feet (Flight Level 394)"- User can override their default in account settings via
PATCH /api/v1/users/me - API always returns metres; unit conversion is client-side only
- FL (Flight Level) shown in parentheses for
ftdisplay when altitude > 0 ft MSL and context is airspace
Altitude datum labelling (F11 — §62): The SGP4 propagator and NRLMSISE-00 output altitudes above the WGS-84 ellipsoid. Aviation altimetry uses altitude above Mean Sea Level (MSL). The geoid height (difference between ellipsoid and MSL) varies globally from approximately −106 m to +85 m (EGM2008). For operational altitudes (below ~25 km / 82,000 ft during re-entry terminal phase), this difference is significant.
Required labelling rule: All altitude displays must specify the datum. The datum is a non-configurable system constant per altitude context:
| Altitude context | Datum | Display example | Notes |
|---|---|---|---|
| Orbital altitude (> 80 km) | WGS-84 ellipsoid | 185 km (ellipsoidal) |
SGP4 output; geoid difference negligible at orbital altitudes |
| Re-entry corridor boundary | WGS-84 ellipsoid | 80 km (ellipsoidal) |
Model boundary altitude |
| Fragment impact altitude | WGS-84 ellipsoid | 0 km (ellipsoidal) → display as ground level |
Converted at display time |
| Airspace sector boundary (FL) | QNH barometric | FL390 / 39,000 ft (QNH) |
Aviation standard; NOT ellipsoidal |
| Terrain clearance / NOTAM lower bound | MSL (approx. ellipsoidal for > 1,000 ft) | 5,000 ft MSL |
Use MSL label explicitly |
Implementation: formatAltitude(metres, context) helper accepts a context parameter ('orbital' | 'airspace' | 'notam') and appends the appropriate datum label. The datum label is rendered in a smaller secondary font weight alongside the altitude value — not in aria-label alone.
API response datum field: The prediction API response must include altitude_datum: "WGS84_ELLIPSOIDAL" alongside any altitude value. Consumers must not assume a datum that is not stated.
Future locale addition checklist (documented in docs/ADDING_A_LOCALE.md):
- Add
messages/{locale}.jsontranslated by a native-speaker aviation professional - Verify all ICAO-fixed strings are excluded from translation
- Set
dirfor the locale (ltr/rtl) - Run automated RTL layout tests if
dir=rtl - Confirm operational time display still shows UTC (not locale timezone)
- Legal review of any jurisdiction-specific compliance text
13.6 Contribution Workflow (F3)
CONTRIBUTING.md at the repository root is a required document. It defines how contributors (internal engineers, auditors, future ESA-directed reviewers) engage with the codebase.
Branch naming convention:
| Branch type | Pattern | Example |
|---|---|---|
| Feature | feature/{ticket-id}-short-description |
feature/SC-142-decay-unit-pref |
| Bug fix | fix/{ticket-id}-short-description |
fix/SC-200-hmac-null-check |
| Chore / dependency | chore/{description} |
chore/bump-fastapi-0.115 |
| Release | release/{semver} |
release/1.2.0 |
| Hotfix | hotfix/{semver} |
hotfix/1.1.1 |
No direct commits to main. All changes via pull request. main is branch-protected: 1 required approval, all status checks must pass, no force-push.
Commit message format: Conventional Commits — type(scope): description. Types: feat, fix, chore, docs, refactor, test, ci. Example: feat(decay): add p01/p99 tail risk columns.
PR template (.github/pull_request_template.md):
## Summary
<!-- What does this PR do? -->
## Linked ticket
<!-- e.g. SC-142 -->
## Checklist
- [ ] `make test` passes locally
- [ ] OpenAPI spec regenerated (`make generate-openapi`) if API changed
- [ ] CHANGELOG.md updated under `[Unreleased]`
- [ ] axe-core accessibility check passes if UI changed
- [ ] Contract test passes if API response shape changed
- [ ] ADR created if an architectural decision was made
Review SLA: Pull requests must receive a first review within 1 business day of opening. Stale PRs (no activity > 3 business days) are labelled stale automatically.
13.7 Architecture Decision Records (F4)
ADRs (Nygard format) are the lightweight record for code-level and architectural decisions. They live in docs/adr/ and are numbered sequentially.
When to write an ADR: Any decision that is:
- Hard to reverse (e.g., choosing a library, a DB schema approach, an algorithm)
- Likely to confuse a future contributor who finds the code without context
- Required by a public-sector procurement framework (ESA specifically requests evidence of a structured decision process)
- Referenced in a specialist review appendix (§45–§54 all reference ADR numbers)
Format (docs/adr/NNNN-title.md):
# ADR NNNN: Title
**Status:** Proposed | Accepted | Deprecated | Superseded by ADR MMMM
**Date:** YYYY-MM-DD
## Context
What problem are we solving? What constraints apply?
## Decision
What did we decide?
## Consequences
What becomes easier? What becomes harder? What is now out of scope?
Known ADRs referenced in this plan:
| ADR | Topic |
|---|---|
| 0001 | FastAPI over Django REST Framework |
| 0002 | TimescaleDB + PostGIS for orbital time-series |
| 0003 | CesiumJS + Deck.gl for 3D globe rendering |
| 0004 | next-intl for string externalisation |
| 0005 | Append-only alert_events with HMAC signing |
| 0016 | NRLMSISE-00 vs JB2008 atmospheric density model |
All ADR numbers referenced in this document must have a corresponding docs/adr/NNNN-*.md file before Phase 2 ESA submission. New ADRs start at the next available number.
13.8 Developer Environment Setup (F6)
docs/DEVELOPMENT.md is a required onboarding document. A new engineer must be able to run a fully functional local environment within 30 minutes of reading it. The document covers:
- Prerequisites: Python 3.11 (pinned in
.python-version), Node.js 20 LTS, Docker Desktop,make - Environment bootstrap:
cp .env.example .env # review and fill required values make init-dirs # creates logs/, exports/, config/, backups/ on host make dev-up # docker compose up -d postgres redis minio make migrate # alembic upgrade head make seed # load development fixture data (10 tracked objects, sample TIPs) make dev # starts: uvicorn + Next.js dev server + Celery worker - Running tests:
make test # full test suite (backend + frontend) make test-backend # backend only (pytest) make test-frontend # frontend only (jest + playwright) make test-e2e # Playwright end-to-end (requires make dev running) - Useful local URLs:
- API:
http://localhost:8000/ Swagger UI:http://localhost:8000/docs - Frontend:
http://localhost:3000 - MinIO console:
http://localhost:9001(credentials in.env.example)
- API:
- Common issues: documented in a
## Troubleshootingsection covering: Docker port conflicts, TimescaleDB first-run migration failure, CesiumJS ion token missing.
.env.example is committed and kept up-to-date with all required variables (no value — keys only). .env is in .gitignore and must never be committed.
13.9 Docs-as-Code Pipeline (F10)
All project documentation (this plan, runbooks, ADRs, OpenAPI spec, data provenance records) is version-controlled in the repository and validated by CI.
Documentation site: MkDocs Material. Source in docs/. Published to GitHub Pages on merge to main. Configuration in mkdocs.yml.
CI documentation checks (run on every PR):
mkdocs build --strict— fails on broken links, missing pages, invalid navmarkdown-link-check docs/— external link validation (warns, does not fail, to avoid flaky CI on transient outages)openapi-diff— spec drift check (see §14 F1)vale --config=.vale.ini docs/— prose style linter (SpaceCom style guide: no passive voice in runbooks, consistent terminology table forre-entryvsreentry)
ESA submission artefact: The MkDocs build output (static HTML) is archived as a CI artefact on each release tag. This provides a reproducible, point-in-time documentation snapshot for the ESA bid submission. The submission artefact is docs-site-{version}.zip stored in the GitHub release assets.
Docs owner: Each section of the documentation has an owner: frontmatter field. The owner is responsible for keeping the section current after their feature area changes. Missing or stale ownership is flagged by a quarterly docs-review GitHub issue auto-created by a cron workflow.
14. API Design
Base path: /api/v1. All endpoints require authentication (minimum viewer role) unless noted. Role requirements listed per group.
System (no auth required)
GET /health— liveness probe; returns200 {"status": "ok", "version": "<semver>"}if the process is running. Used by Docker/Kubernetes liveness probe and load balancer health check. Does not check downstream dependencies — a healthy response means only that the API process is alive.GET /readyz— readiness probe; returns200 {"status": "ready", "checks": {...}}when all dependencies are reachable. Returns503if any required dependency is unhealthy. Checks performed: PostgreSQL (querySELECT 1), Redis (PING), Celery worker queue depth < 1000. Used by DR automation to confirm the new primary is accepting traffic before updating DNS (§26.3). Also included in OpenAPI spec undertags: ["System"].
// GET /readyz — healthy response example
{
"status": "ready",
"checks": {
"postgres": "ok",
"redis": "ok",
"celery_queue_depth": 42
},
"version": "1.2.3"
}
// GET /readyz — unhealthy response (503)
{
"status": "not_ready",
"checks": {
"postgres": "ok",
"redis": "error: connection refused",
"celery_queue_depth": 42
}
}
Auth
POST /auth/token— login; returnshttpOnlycookie (access) +httpOnlycookie (refresh); rate-limited 10/min/IPPOST /auth/token/refresh— rotate refresh token; rate-limitedPOST /auth/mfa/verify— complete MFA; issues full-access tokenPOST /auth/logout— revoke refresh token; clear cookies
Catalog (viewer minimum)
GET /objects— list/search (paginated; filter by type, perigee, decay status, data_confidence)GET /objects/{norad_id}— detail with TLE, physical properties, data confidence annotationPOST /objects— manual entry (operatorrole)GET /objects/{norad_id}/tle-history— full TLE history including cross-validation status
Propagation (analyst role)
POST /propagate— submit catalog propagation jobGET /propagate/{task_id}— poll statusGET /objects/{norad_id}/ephemeris?start=&end=&step=— time range and step validation (Finding 7):Parameter Constraint Error code start≥ TLE epoch − 7 days; ≤ now + 90 days EPHEMERIS_START_OUT_OF_RANGEendstart < end ≤ start + 30 daysEPHEMERIS_END_OUT_OF_RANGEstep≥ 10 seconds and ≤ 86,400 seconds EPHEMERIS_STEP_OUT_OF_RANGEComputed points (end − start) / step ≤ 100,000EPHEMERIS_TOO_MANY_POINTS
Decay Prediction (analyst role)
-
POST /decay/predict— submit decay job; returns202 Accepted(Finding 3). MC concurrency gate: per-organisation Redis semaphore limits to 1 concurrent MC run (Phase 1); 2 foranalyst+ (Phase 2);429 + Retry-Afteron limit;adminbypasses.Async job lifecycle (Finding 3):
POST /decay/predict Idempotency-Key: <client-uuid> ← optional; prevents duplicate on retry → 202 Accepted { "jobId": "uuid", "status": "queued", "statusUrl": "/jobs/uuid", "estimatedDurationSeconds": 45 } GET /jobs/{job_id} → 200 OK { "jobId": "uuid", "status": "running" | "complete" | "failed" | "cancelled", "resultUrl": "/decay/predictions/12345", // present when complete "error": null | {"code": "...", "message": "..."}, "createdAt": "...", "completedAt": "...", "durationSeconds": 42 }WebSocket
PREDICTION_COMPLETE/PREDICTION_FAILEDevents are the primary completion signal.GET /jobs/{id}is the polling fallback (recommended interval: 5 seconds; do not poll faster). All Celery-backed POST endpoints (/reports,/space/reentry/plan,/propagate) follow the same lifecycle pattern. -
GET /jobs/{job_id}— poll job status (all job types);404if job does not belong to the requesting user's organisation -
GET /decay/predictions?norad_id=&status=— list (cursor-paginated)
Re-entry (viewer role)
GET /reentry/predictions— list with HMAC status; filterable by FIR, time window, confidence, integrity_failedGET /reentry/predictions/{id}— full detail; HMAC verified before serving;integrity_failedrecords return 503GET /reentry/tip-messages?norad_id=— TIP messages
Space Weather (viewer role)
GET /spaceweather/current— F10.7, Kp, Ap, Dst +operational_status+uncertainty_multiplier+ cross-validation deltaGET /spaceweather/history?start=&end=— historyGET /spaceweather/forecast— 3-day NOAA SWPC forecast
Conjunctions (viewer role)
GET /conjunctions— active events filterable by Pc thresholdGET /conjunctions/{id}— detail with covariance and probabilityPOST /conjunctions/screen— submit screening (analystrole)
Visualisation (viewer role)
GET /czml/objects— full CZML catalog (J2000 INERTIAL; all strings HTML-escaped); max payload policy: 5 MB. If estimated payload exceeds 5 MB, the endpoint returnsHTTP 413with{"error": "catalog_too_large", "use_delta": true}.GET /czml/objects?since=<iso8601>— delta CZML: returns only objects whose position or metadata has changed since the given timestamp. Clients must use this after the initial full load. Response includesX-CZML-Full-Required: trueheader if the server cannot produce a valid delta (e.g. client timestamp > 30 minutes old) — client must re-fetch the full catalog. Delta responses are always ≤ 500 KB for the 100-object catalog.GET /czml/hazard/{zone_id}— HMAC verified before servingGET /czml/event/{event_id}— full event CZMLGET /viz/mc-trajectories/{prediction_id}— binary MC blob for Mode C
Hazard (viewer role)
GET /hazard/zones— active zones; HMAC status included in responseGET /hazard/zones/{id}— detail; HMAC verified before serving;integrity_failedrecords return 503
Alerts (viewer read; operator acknowledge)
GET /alerts— alert historyPOST /alerts/{id}/acknowledge— records user ID + timestamp + note inalert_eventsGET /alerts/unread-count— unread critical/high count for badge
Reports (analyst role)
GET /reports— list (organisation-scoped via RLS)POST /reports— initiate generation (async)GET /reports/{id}— metadata + pre-signed 15-minute download URLGET /reports/{id}/preview— HTML preview
Org Admin (org_admin role — scoped to own organisation) (F7, F9, F11)
GET /org/users— list users in own orgPOST /org/users/invite— invite a new user (sends email; creates user withviewerrole pending activation)PATCH /org/users/{id}/role— assign role up tooperatorwithin own org; cannot assignorg_adminoradminDELETE /org/users/{id}— deactivate user (revokes sessions and API keys; triggers pseudonymisation for GDPR)GET /org/api-keys— list all API keys in own org (including service account keys)DELETE /org/api-keys/{id}— revoke any key in own orgGET /org/audit-log— paginated org-scoped audit log fromsecurity_logsandalert_eventsfiltered byorganisation_id; supports?from=&to=&event_type=&user_id=(F9)GET /org/usage— usage summary for current and previous billing period (predictions run, quota hits, API calls); sourced fromusage_eventstablePATCH /org/billing— updatebilling_contactsrow (email, PO number, VAT number)POST /org/export— trigger asynchronous org data export (F11); returns job ID; export includes all predictions, alert events, handover logs, and NOTAM drafts for the org; delivered as signed ZIP within 3 business days; used for GDPR portability and offboarding
Admin (admin role only)
GET /admin/ingest-status— last run time and status per sourceGET /admin/worker-status— Celery queue depth and healthGET /admin/security-events— recent security_logs entriesPOST /admin/users— create userPATCH /admin/users/{id}/role— change role (logged as HIGH security event)GET /admin/organisations— list all organisations with tier, status, usage summaryPOST /admin/organisations— provision new organisation (onboarding gate — see §29.8)PATCH /admin/organisations/{id}— update tier, status, subscription dates
Space Portal (space_operator or orbital_analyst role)
GET /space/objects— list owned objects (space_operator: scoped;orbital_analyst: full catalog)GET /space/objects/{norad_id}— full technical detail with state vectors, covariance, TLE historyGET /space/objects/{norad_id}/ephemeris— raw GCRF state vectors; CCSDS OEM format available viaAccept: application/ccsds-oemPOST /space/reentry/plan— submit controlled re-entry planning job; requiresowned_objects.has_propulsion = TRUEGET /space/reentry/plan/{task_id}— poll; returns ranked deorbit windows with risk scores and FIR avoidance statusPOST /space/conjunction/screen— submit screening (orbital_analystonly)GET /space/export/bulk— bulk ephemeris/prediction export (JSON, CSV, CCSDS)
NOTAM Drafting (operator role)
POST /notam/draft— generate draft NOTAM from prediction ID; returns ICAO-format draft text + mandatory disclaimerGET /notam/drafts— list drafts for organisationGET /notam/drafts/{id}— draft detailPOST /notam/drafts/{id}/cancel-draft— generate cancellation draft for a previous new-NOTAM draft
API Key Management (space_operator or orbital_analyst)
POST /api-keys— create new API key; raw key returned once and never storedGET /api-keys— list active keys (hashed IDs only, never raw keys)DELETE /api-keys/{id}— revoke key immediatelyGET /api-keys/usage— per-key request counts and last-used timestamp
WebSocket (viewer minimum; cookie auth at upgrade)
WS /ws/events— real-time stream; 5 concurrent connections per user enforced. Per-instance subscriber ceiling: 500 connections. New connections beyond this limit receiveHTTP 503at the WebSocket upgrade. Aws_connected_clientsPrometheus gauge tracks current count per backend instance; alert fires at 400 (WARNING) to trigger horizontal scaling before the ceiling is reached. At Tier 2 (2 backend instances), the effective ceiling is 1,000 simultaneous WebSocket clients — documented as a known capacity limit indocs/runbooks/capacity-limits.md.
WebSocket event payload schema:
All events share an envelope:
{
"type": "<event_type>",
"seq": 1042,
"ts": "2026-03-17T14:23:01.123Z",
"data": { ... }
}
type |
Trigger | data fields |
|---|---|---|
alert.new |
New alert generated | alert_id, level, norad_id, object_name, fir_ids[] |
alert.acknowledged |
Alert acknowledged by any user in org | alert_id, acknowledged_by, note_preview |
alert.superseded |
Alert superseded by a new one | old_alert_id, new_alert_id |
prediction.updated |
New re-entry prediction for a tracked object | prediction_id, norad_id, p50_utc, supersedes_id |
ingest.status |
Ingest job completed or failed | source, status (ok/failed), record_count, next_run_at |
spaceweather.change |
Operational status band changes | old_status, new_status, kp, f107 |
tip.new |
New TIP message ingested | norad_id, object_name, tip_epoch, predicted_reentry_utc |
Reconnection and missed-event recovery: Each event carries a monotonically increasing seq number per organisation. On reconnect, the client sends ?since_seq=<last_seq> in the WebSocket upgrade URL. The server replays up to 200 missed events from an in-memory ring buffer (last 5 minutes). If the client has been disconnected > 5 minutes, it receives a {"type": "resync_required"} event and must re-fetch state via REST.
Per-org sequence number implementation (F5 — §67): The seq counter for each org must be assigned using a PostgreSQL SEQUENCE object, not MAX(seq)+1 in a trigger. MAX(seq)+1 under concurrent inserts for the same org produces duplicate sequence numbers:
-- Migration: create one sequence per org on org creation
-- (or use a single global sequence with per-org prefix — simpler)
CREATE SEQUENCE IF NOT EXISTS alert_seq_global
START 1 INCREMENT 1 NO CYCLE;
-- In the alert_events INSERT trigger or application code:
-- NEW.seq := nextval('alert_seq_global');
-- This is globally unique and monotonically increasing; per-org ordering
-- is derived by filtering on org_id + ordering by seq.
Preferred approach: A single global alert_seq_global sequence assigned at INSERT time. Per-org ordering is maintained because seq is globally monotonic — any two events for the same org will have the correct relative ordering by seq. The WebSocket ring buffer lookup uses WHERE org_id = $1 AND seq > $2 ORDER BY seq which remains correct with a global sequence.
Do not use: DEFAULT nextval('some_seq') on the column without org-scoped locking — concurrent inserts across orgs share the sequence fine; concurrent inserts for the same org also work correctly since sequences are lock-free and gap-tolerant.
Application-level receipt acknowledgement (F2 — §63): delivered_websocket = TRUE in alert_events is set at send-time, not client-receipt time. For safety-critical CRITICAL and HIGH alerts, the client must send an explicit receipt acknowledgement within 10 seconds:
// Client → Server: after rendering a CRITICAL/HIGH alert.new event
{ "type": "alert.received", "alert_id": "<uuid>", "seq": <n> }
Server response:
{ "type": "alert.receipt_confirmed", "alert_id": "<uuid>", "seq": <n+1> }
If no alert.received arrives within 10 seconds of delivery, the server marks alert_events.ws_receipt_confirmed = FALSE and triggers the email fallback for that alert (same logic as offline delivery). This distinguishes "sent to socket" from "rendered on screen."
ALTER TABLE alert_events
ADD COLUMN ws_receipt_confirmed BOOLEAN,
ADD COLUMN ws_receipt_at TIMESTAMPTZ;
-- NULL = not yet sent; TRUE = client confirmed receipt; FALSE = sent but no receipt within 10s
Fan-out architecture across multiple backend instances (F3 — §63): With ≥2 backend instances (Tier 2), a WebSocket connection from org A may be on instance-1 while a new alert fires on instance-2. Without a cross-instance broadcast mechanism, org A's operator misses the alert.
Required: Redis Pub/Sub fan-out:
# backend/app/alerts/fanout.py
import redis.asyncio as aioredis
ALERT_CHANNEL_PREFIX = "spacecom:alert:"
async def publish_alert(redis: aioredis.Redis, org_id: str, event: dict):
"""Publish alert event to Redis channel; all backend instances receive and forward to connected clients."""
channel = f"{ALERT_CHANNEL_PREFIX}{org_id}"
await redis.publish(channel, json.dumps(event))
async def subscribe_org_alerts(redis: aioredis.Redis, org_id: str):
"""Each backend instance subscribes to its connected orgs' channels on startup."""
pubsub = redis.pubsub()
await pubsub.subscribe(f"{ALERT_CHANNEL_PREFIX}{org_id}")
return pubsub
Each backend instance maintains a local registry of {org_id: [websocket_connections]}. On receiving a Redis Pub/Sub message, the instance forwards to all local connections for that org. This decouples alert generation (any instance) from delivery (per-instance local connections).
ADR: docs/adr/0020-websocket-fanout-redis-pubsub.md — documents this pattern and the decision against sticky sessions (which would break blue-green deploys).
Dead-connection ANSP fallback notification (F6 — §63): When the ping-pong mechanism detects a dead connection, the current behaviour is to close the socket. There is no notification to the ANSP that their live monitoring connection has silently dropped.
Required behaviour:
- On ping-pong timeout: close socket; record
ws_disconnected_atin Redis session key for that connection - If no reconnect within
WS_DEAD_CONNECTION_GRACE_SECONDS(default: 120s): send email to the org's ANSP contact (organisations.primary_contact_email) with subject: "SpaceCom live connection dropped — please check your browser" - If an active TIP event exists for the org's FIRs when the disconnection is detected: grace period is reduced to 30s and the email subject is: "URGENT: SpaceCom connection dropped during active re-entry event"
- On reconnect (before grace period expires): cancel the pending fallback email
# backend/app/alerts/ws_health.py
WS_DEAD_CONNECTION_GRACE_SECONDS = 120
WS_DEAD_CONNECTION_GRACE_ACTIVE_TIP = 30
async def on_connection_closed(org_id: str, user_id: str, redis: aioredis.Redis):
active_tip = await redis.get(f"spacecom:active_tip:{org_id}")
grace = WS_DEAD_CONNECTION_GRACE_ACTIVE_TIP if active_tip else WS_DEAD_CONNECTION_GRACE_SECONDS
# Schedule fallback notification via Celery
notify_ws_dead.apply_async(
args=[org_id, user_id],
countdown=grace,
task_id=f"ws-dead-{org_id}-{user_id}" # revocable if reconnect arrives
)
async def on_reconnect(org_id: str, user_id: str):
# Cancel pending dead-connection notification
celery_app.control.revoke(f"ws-dead-{org_id}-{user_id}")
Per-org email alert rate limit (F7 — §65 FinOps):
Email alerts are triggered both by the alert delivery pipeline (when WebSocket delivery is unconfirmed) and by degraded-mode notifications. Without a rate limit, a flapping prediction window or ingest instability can generate hundreds of alert emails per hour to the same ANSP contact, exhausting the SMTP relay quota and creating alert fatigue.
Rate limit policy: Maximum 50 alert emails per org per hour. When the limit is reached, subsequent alerts within the window are queued and delivered as a digest email at the end of the hour.
# backend/app/alerts/email_delivery.py
EMAIL_RATE_LIMIT_PER_ORG_PER_HOUR = 50
async def send_alert_email(org_id: str, alert: dict, redis: aioredis.Redis):
"""Send alert email subject to per-org rate limit; fall back to digest queue."""
rate_key = f"spacecom:email_rate:{org_id}:{datetime.utcnow().strftime('%Y%m%d%H')}"
count = await redis.incr(rate_key)
if count == 1:
await redis.expire(rate_key, 3600) # expire at end of hour window
if count <= EMAIL_RATE_LIMIT_PER_ORG_PER_HOUR:
# Send immediately
await _dispatch_email(org_id, alert)
else:
# Add to digest queue; Celery task drains it at hour boundary
digest_key = f"spacecom:email_digest:{org_id}:{datetime.utcnow().strftime('%Y%m%d%H')}"
await redis.rpush(digest_key, json.dumps(alert))
await redis.expire(digest_key, 7200) # safety expire
@shared_task
def send_hourly_digest_emails():
"""Drain digest queues and send consolidated digest emails. Runs at HH:59."""
# Find all digest keys matching current hour; send one digest per org
...
Contract expiry alerts (F7 — §68):
Without proactive expiry alerts, contracts expire silently. Add a Celery Beat task (tasks/commercial/contract_expiry_alerts.py) that runs daily at 07:00 UTC and checks contracts.valid_until:
@shared_task
def check_contract_expiry():
"""Alert commercial team of contracts expiring within 90/30/7 days."""
thresholds = [
(90, "90-day renewal notice"),
(30, "30-day renewal notice — action required"),
(7, "URGENT: 7-day contract expiry warning"),
]
for days, subject_prefix in thresholds:
target_date = date.today() + timedelta(days=days)
expiring = db.execute(text("""
SELECT c.id, o.name, c.monthly_value_cents, c.currency,
c.valid_until, o.primary_contact_email
FROM contracts c
JOIN organisations o ON o.id = c.org_id
WHERE DATE(c.valid_until) = :target_date
AND c.contract_type NOT IN ('sandbox', 'internal')
AND c.auto_renew = FALSE
"""), {"target_date": target_date}).fetchall()
for contract in expiring:
send_email(
to="commercial@spacecom.io",
subject=f"[SpaceCom] {subject_prefix}: {contract.name}",
body=f"Contract for {contract.name} expires on {contract.valid_until.date()}. "
f"Monthly value: {contract.monthly_value_cents/100:.2f} {contract.currency}."
)
Add to celery-redbeat at crontab(hour=7, minute=0). Also send a courtesy expiry notice to the org admin contact at the 30-day threshold so they can initiate their internal procurement process.
Celery schedule: Add send_hourly_digest_emails to celery-redbeat at crontab(minute=59).
Cost rationale: SMTP relay services (SES, Mailgun) charge per email. At 50/hour cap and 10 orgs, maximum 500 emails/hour = 12,000/day. At $0.10/1,000 (SES) = $1.20/day ≈ $37/month at sustained maximum. Without rate limiting during a flapping event, a single incident could generate thousands of emails in minutes.
Per-client back-pressure and send queue circuit breaker (F7 — §63): A slow client whose network buffers are full will cause await websocket.send_json(event) to block in the FastAPI handler. Without a per-client queue depth check, a single slow client can block the fan-out loop for all clients.
# backend/app/alerts/ws_manager.py
WS_SEND_QUEUE_MAX = 50 # events; beyond this, circuit-breaker triggers
class ConnectionManager:
def __init__(self):
self._connections: dict[str, list[WebSocket]] = {}
self._send_queues: dict[WebSocket, asyncio.Queue] = {}
async def broadcast_to_org(self, org_id: str, event: dict):
for ws in self._connections.get(org_id, []):
queue = self._send_queues[ws]
if queue.qsize() >= WS_SEND_QUEUE_MAX:
# Circuit breaker: drop this connection; client will reconnect and replay
spacecom_ws_send_queue_overflow_total.labels(org_id=org_id).inc()
await ws.close(code=4003, reason="Send queue overflow — reconnect to resume")
else:
await queue.put(event)
async def _send_worker(self, ws: WebSocket):
"""Dedicated coroutine per connection — decouples send from broadcast loop."""
queue = self._send_queues[ws]
while True:
event = await queue.get()
try:
await ws.send_json(event)
except Exception:
break # connection closed; worker exits
Prometheus counter: spacecom_ws_send_queue_overflow_total{org_id} — any non-zero value warrants investigation.
Missed-alert display for offline clients (F8 — §63): When a client reconnects after receiving resync_required, it calls the REST API to re-fetch current state. The notification centre must explicitly surface alerts that arrived during the offline period:
GET /api/v1/alerts?since=<last_seen_ts>&include_offline=true — returns all unacknowledged alerts since last_seen_ts, annotated with "received_while_offline": true. The notification centre renders these with a distinct visual treatment: amber border + "Received while you were offline" label. The client stores last_seen_ts in localStorage (updated on each WebSocket message); this survives page reload but not localStorage clear.
WebSocket connection metadata — per-org operational visibility (F10 — §63):
New Prometheus metrics:
ws_org_connected = Gauge(
'spacecom_ws_org_connected',
'Whether at least one WebSocket connection is active for this org',
['org_id', 'org_name']
)
ws_org_connections = Gauge(
'spacecom_ws_org_connection_count',
'Number of active WebSocket connections for this org',
['org_id']
)
Updated when connections open/close. Alert rule:
- alert: ANSPNoLiveConnectionDuringTIPEvent
expr: |
spacecom_active_tip_events > 0
and on(org_id) spacecom_ws_org_connected == 0
for: 5m
severity: warning
annotations:
summary: "ANSP {{ $labels.org_name }} has no live WebSocket connection during active TIP event"
runbook_url: "https://spacecom.internal/docs/runbooks/ansp-connection-lost.md"
On-call dashboard panel 9 (below the fold): "ANSP Connection Status" — table of org names, connection count, last-connected timestamp, TIP-event indicator. Rows with connected = 0 and active TIP highlighted in amber.
Protocol version negotiation (Finding 8): Client connects with ?protocol_version=1. The server's first message is always:
{"type": "CONNECTED", "protocolVersion": 1, "serverVersion": "2.1.3", "seq": 0}
When a breaking event schema change ships, both versions are supported in parallel for 6 months. Clients on a deprecated version receive:
{"type": "PROTOCOL_DEPRECATION_WARNING", "currentVersion": 1, "sunsetDate": "2026-12-01",
"migrationGuideUrl": "/docs/api-guide/websocket-protocol.md#v2-migration"}
After sunset, old-version connections are closed with code 4002 ("Protocol version deprecated"). Protocol version history is maintained in docs/api-guide/websocket-protocol.md.
Token refresh during long-lived sessions (Finding 4): Access tokens expire in 15 minutes. The server sends a TOKEN_EXPIRY_WARNING event 2 minutes before expiry:
{"type": "TOKEN_EXPIRY_WARNING", "expiresInSeconds": 120, "seq": N}
The client calls POST /auth/token/refresh (standard REST — does not interrupt the WebSocket), then sends on the existing connection:
{"type": "AUTH_REFRESH", "token": "<new_access_token>"}
Server responds: {"type": "AUTH_REFRESHED", "seq": N}. If the client does not refresh before expiry, the server closes with code 4001 ("Token expired — reconnect with a new token"). Clients distinguish 4001 (auth expiry, refresh and reconnect) from 4002 (protocol deprecated, upgrade required) from network errors (reconnect with backoff).
Mode awareness: In SIMULATION or REPLAY mode, the client's WebSocket connection remains open but alert.new and tip.new events are suppressed for the duration of the mode session. Simulation-generated events are delivered on a separate WS /ws/simulation/{session_id} channel.
Alert Webhooks (admin role — registration; delivery to registered HTTPS endpoints)
For ANSPs with programmatic dispatch systems that cannot consume a browser WebSocket.
POST /webhooks— register a webhook endpoint;{"url": "https://ansp.example.com/hook", "events": ["alert.new", "tip.new"], "secret": "<shared_secret>"}GET /webhooks— list registered webhooks for the organisationDELETE /webhooks/{id}— deregisterPOST /webhooks/{id}/test— send a syntheticalert.newevent to verify delivery
Delivery semantics: At-least-once. SpaceCom POSTs the event envelope to the registered URL. Signature: X-SpaceCom-Signature: sha256=<HMAC-SHA256(secret, body)> header on every delivery. Retry policy: 3 retries with exponential backoff (1s, 5s, 30s). After 3 failures, the webhook is marked degraded and the org admin is notified by email. After 10 consecutive failures, the webhook is auto-disabled.
alert_webhooks table:
CREATE TABLE alert_webhooks (
id SERIAL PRIMARY KEY,
organisation_id INTEGER NOT NULL REFERENCES organisations(id),
url TEXT NOT NULL,
secret_hash TEXT NOT NULL, -- bcrypt hash of the shared secret; never stored in plaintext
event_types TEXT[] NOT NULL,
status TEXT NOT NULL DEFAULT 'active', -- active | degraded | disabled
failure_count INTEGER DEFAULT 0,
last_delivery_at TIMESTAMPTZ,
last_failure_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
Structured Event Export (viewer minimum)
First step toward SWIM / machine-readable ANSP system integration (Phase 3 target).
GET /events/{id}/export?format=geojson— returns the event's re-entry corridor and impact zone as a GeoJSONFeatureCollectionwith ICAO FIR IDs and prediction metadata inpropertiesGET /events/{id}/export?format=czml— CZML event package (same asGET /czml/event/{event_id})GET /events/{id}/export?format=ccsds-oem— raw OEM for the object's trajectory at time of prediction
The GeoJSON export is the preferred integration surface for ANSP systems that are not SWIM-capable. The properties object includes: norad_id, object_name, p05_utc, p50_utc, p95_utc, affected_fir_ids[], risk_level, prediction_id, prediction_hmac (for downstream integrity verification), generated_at.
API Conventions (Finding 9)
Field naming: All API request and response bodies use camelCase. Database column names and Python internal models use snake_case. The conversion is handled automatically by a shared base model:
from pydantic import BaseModel, ConfigDict
from pydantic.alias_generators import to_camel
class APIModel(BaseModel):
"""Base class for all API response/request models. Serialises to camelCase JSON."""
model_config = ConfigDict(
alias_generator=to_camel,
populate_by_name=True, # allows snake_case in tests and internal code
)
class PredictionResponse(APIModel):
prediction_id: int # → "predictionId" in JSON
p50_reentry_time: datetime # → "p50ReentryTime"
ood_flag: bool # → "oodFlag"
All Pydantic response models inherit from APIModel. All request bodies also inherit from APIModel (with populate_by_name=True, clients may send either case). Document in docs/api-guide/conventions.md.
Error Response Schema (Finding 2)
All error responses use the SpaceComError envelope — including FastAPI's default Pydantic validation errors (which are overridden):
class SpaceComError(BaseModel):
error: str # machine-readable code from the error registry
message: str # human-readable; safe to display in UI
detail: dict | None = None
requestId: str # from X-Request-ID header; enables log correlation
@app.exception_handler(RequestValidationError)
async def validation_error_handler(request, exc):
return JSONResponse(status_code=422, content=SpaceComError(
error="VALIDATION_ERROR",
message="Request validation failed",
detail={"fields": exc.errors()},
requestId=request.headers.get("X-Request-ID", ""),
).model_dump(by_alias=True))
Canonical error code registry — all codes, HTTP status, and recovery actions documented in docs/api-guide/error-reference.md. CI check: any HTTPException raised in application code must use a code from the registry. Sample entries:
| Code | HTTP status | Meaning | Recovery |
|---|---|---|---|
VALIDATION_ERROR |
422 | Request body or query param invalid | Fix the indicated fields |
INVALID_CURSOR |
400 | Pagination cursor malformed or expired | Restart from page 1 |
RATE_LIMITED |
429 | Rate limit exceeded | Wait retryAfterSeconds |
EPHEMERIS_TOO_MANY_POINTS |
400 | Computed points exceed 100,000 | Reduce range or increase step |
IDEMPOTENCY_IN_PROGRESS |
409 | Duplicate request still processing | Wait and retry statusUrl |
HMAC_VERIFICATION_FAILED |
503 | Prediction integrity check failed | Contact administrator |
API_KEY_INVALID |
401 | API key revoked, expired, or invalid | Re-issue key |
PREDICTION_CONFLICT |
200 (not error) | Multi-source window disagreement | See conflictSources field |
Rate Limit Error Response (Finding 6)
429 Too Many Requests responses include Retry-After (RFC 7231 §7.1.3) and a structured body:
HTTP/1.1 429 Too Many Requests
Retry-After: 47
X-RateLimit-Limit: 10
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1742134847
{
"error": "RATE_LIMITED",
"message": "Rate limit exceeded for POST /decay/predict: 10 requests per hour",
"retryAfterSeconds": 47,
"limit": 10,
"window": "1h",
"requestId": "..."
}
retryAfterSeconds = X-RateLimit-Reset − now(). Clients implementing backoff must honour Retry-After and must not retry before it elapses.
Idempotency Keys (Finding 5)
Mutation endpoints that have real-world consequences support idempotency keys:
POST /decay/predict
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Server behaviour:
- First receipt: process normally; store
(key, user_id, endpoint, response_body)inidempotency_keystable with 24-hour TTL - Duplicate within 24h: return stored response with
HTTP 200+ headerIdempotency-Replay: true; do not re-execute - Still processing: return
409 Conflict→{"error": "IDEMPOTENCY_IN_PROGRESS", "statusUrl": "/jobs/uuid"} - After 24h: key expired; treat as new request
Applies to: POST /decay/predict, POST /reports, POST /notam/draft, POST /alerts/{id}/acknowledge, POST /admin/users. Documented in docs/api-guide/idempotency.md.
API Key Authentication Model (Finding 11)
API key requests use key-only auth — no JWT required:
Authorization: Bearer apikey_<base64url_encoded_key>
The prefix apikey_ distinguishes API keys from JWT Bearer tokens at the middleware layer. The raw key is hashed with SHA-256 before storage; the raw key is shown exactly once at creation.
Rules:
- API key rate limits are independent from JWT session rate limits — separate Redis buckets per key
- Webhook deliveries are not counted against any rate limit bucket (server-initiated, not client-initiated)
allowed_endpointsscope:null= all endpoints for the key's role; a non-null array restricts to listed paths.403returned for requests to unlisted endpoints with{"error": "ENDPOINT_NOT_IN_KEY_SCOPE"}- Revoked/expired/invalid key: always
401→{"error": "API_KEY_INVALID", "message": "API key is revoked or expired"}— indistinguishable from never-valid (prevents enumeration)
Document in docs/api-guide/api-keys.md.
System Endpoints (Finding 10)
GET /readyz is included in the OpenAPI spec as a documented endpoint (tagged System), so integrators and SWIM consumers can discover and monitor it:
@app.get(
"/readyz",
tags=["System"],
summary="Readiness and degraded-state check",
response_model=ReadinessResponse,
responses={
200: {"description": "System operational"},
207: {"description": "System degraded — one or more data sources stale"},
503: {"description": "System unavailable — database or Redis unreachable"},
}
)
GET /healthz (liveness probe) remains undocumented in OpenAPI — infrastructure-only. /readyz is the recommended integration health check endpoint for ANSP monitoring systems and the Phase 3 SWIM integration.
Clock skew detection and server time endpoint (F6 — §67):
CZML availability timestamps and prediction windows are generated using server UTC. If the server clock drifts (NTP sync failure after container restart, hypervisor clock skew, or VM migration), CZML ground track windows will be offset from real time. A client whose clock differs from the server clock by > 5 seconds will show predictions in the wrong temporal position.
Infrastructure requirement: All SpaceCom hosts must run chronyd or systemd-timesyncd with NTP synchronisation to a reliable source. Add to the deployment runbook (docs/runbooks/host-setup.md):
# Ubuntu/Debian
timedatectl set-ntp true
timedatectl status # confirm NTPSynchronized: yes
Add Grafana alert: node_timex_sync_status != 1 → WARNING: "NTP sync lost on ".
Client-side clock skew display: Add GET /api/v1/time endpoint (unauthenticated, rate-limited to 1 req/s per IP):
@router.get("/api/v1/time")
async def server_time():
return {"utc": datetime.utcnow().isoformat() + "Z", "unix": time.time()}
The frontend calls this on page load and computes skew_seconds = server_unix - Date.now()/1000. If abs(skew_seconds) > 5: display a persistent WARNING banner: "Your browser clock differs from the server by {N}s — prediction windows may appear offset. Please synchronise your system clock."
Pagination Standard
All list endpoints use cursor-based pagination (not offset-based). Offset pagination degrades as OFFSET N forces the DB to scan and discard N rows; at 7-year retention depth this becomes a full table scan.
Canonical response envelope — applied to every list endpoint (Finding 1):
{
"data": [...],
"pagination": {
"next_cursor": "eyJjcmVhdGVkX2F0IjoiMjAyNi0wMy0xNlQxNDozMDowMFoiLCJpZCI6NDQ4Nzh9",
"has_more": true,
"limit": 50,
"total_count": null
}
}
Rules:
data(notitems) is the canonical array key across all list endpointsnext_cursorisbase64url(json({"created_at": "<iso8601>", "id": <int>}))— opaque to clients, decoded server-sidetotal_countis alwaysnull— count queries on large tables force full scans; document this explicitly indocs/api-guide/pagination.mdlimitdefaults to 50; maximum 200; specified per endpoint group in OpenAPIdescription- Empty result:
{"data": [], "pagination": {"next_cursor": null, "has_more": false, "limit": 50, "total_count": null}}— never404 - Invalid/expired cursor:
400 Bad Request→{"error": "INVALID_CURSOR", "message": "Cursor is malformed or refers to a deleted record", "request_id": "..."}
Standard query parameters:
limit— page size (default: 50, maximum: 200)cursor— opaque cursor token from a previous response (absent = first page)
Cursor decodes server-side to WHERE (created_at, id) < (cursor_ts, cursor_id) ORDER BY created_at DESC, id DESC. Tokens are valid for 24 hours.
Implementation:
class PaginatedResponse(BaseModel, Generic[T]):
data: list[T]
pagination: PaginationMeta
class PaginationMeta(BaseModel):
next_cursor: str | None
has_more: bool
limit: int
total_count: None = None # always None; never compute count
def paginate_query(q, cursor: str | None, limit: int) -> PaginatedResponse:
"""Shared utility used by all list endpoints — enforces envelope consistency."""
...
Enforcement: An OpenAPI CI check confirms every endpoint tagged list has limit and cursor query parameters and returns the PaginatedResponse schema. Violations fail CI.
Affected endpoints (all paginated): /objects, /decay/predictions, /reentry/predictions, /alerts, /conjunctions, /reports, /notam/drafts, /space/objects, /api-keys/usage, /admin/security-events.
API Latency Budget — CZML Catalog Endpoint
The CZML catalog endpoint (GET /czml/objects) is the most latency-sensitive read path and the primary SLO driver (p95 < 2s). Latency budget allocation:
| Component | Budget | Notes |
|---|---|---|
| DNS + TLS handshake (new connection) | 50 ms | Not applicable on keep-alive; amortised to ~0 for repeat requests |
| Caddy proxy overhead | 5 ms | Header processing only |
| FastAPI routing + middleware (auth, RBAC, rate limit) | 30 ms | Each middleware ~5–10 ms; keep middleware count ≤ 5 on this path |
| PgBouncer connection acquisition | 10 ms | Pool saturation adds latency; monitor pgbouncer_pool_waiting metric |
| DB query execution (PostGIS geometry) | 800 ms | Includes GiST index scan + geometry serialisation |
| CZML serialisation (Pydantic → JSON) | 200 ms | Validated by benchmark; exceeding this indicates schema complexity regression |
| HTTP response transmission (5 MB @ 1 Gbps internal) | 40 ms | Internal network; negligible |
| Total budget (new connection) | ~1,135 ms | ~865 ms headroom to 2s p95 SLO |
Any new middleware added to the CZML endpoint path must be profiled and must not exceed its allocated budget. Exceeding the DB or serialisation budget requires a performance investigation before merge.
API Versioning Policy
Base path: /api/v1. All versioned endpoints follow Semantic Versioning applied to the API contract:
- Non-breaking changes (additive: new optional fields, new endpoints, new query params): deployed without version bump; announced in
CHANGELOG.md - Breaking changes (removed fields, changed types, changed auth requirements, removed endpoints): require a new major version (
/api/v2); old version supported in parallel for a minimum of 6 months before sunset - Deprecation signalling: Deprecated endpoints return
Deprecation: trueandSunset: <date>response headers (RFC 8594) - Version negotiation: Clients may send
Accept: application/vnd.spacecom.v1+jsonto pin to a specific version; default is always the latest stable version - Breaking change notice: Minimum 3 months written notice (email to registered API key holders +
CHANGELOG.mdentry) before any breaking change is deployed
Changelog discipline (F5): CHANGELOG.md follows the Keep a Changelog format with Conventional Commits as the commit-level input. Every PR must add an entry under [Unreleased] if it has a user-visible effect. On release, [Unreleased] is renamed to [{semver}] - {date}.
## [Unreleased]
### Added
- `p01_reentry_time` and `p99_reentry_time` fields on decay prediction response (SC-188)
### Changed
- `altitude_unit_preference` default for ANSP operators changed from `m` to `ft` (SC-201)
### Fixed
- HMAC integrity check now correctly handles NULL `action_taken` field (SC-195)
### Deprecated
- `GET /objects/{id}/trajectory` — use `GET /objects/{id}/ephemeris` (sunset 2027-06-01)
make changelog-check(CI step) fails if[Unreleased]section is empty and the diff contains non-chore/docs commits- Release changelogs are the source for API key holder email notifications and GitHub release notes
OpenAPI spec as source of truth (F1): FastAPI generates the OpenAPI 3.1 spec automatically from route decorators, Pydantic schemas, and docstrings. The spec is the authoritative contract — not a separately maintained document. CI enforces this:
GET /api/v1/openapi.jsonis served by the running API; CI downloads it and diffs against the committedopenapi.yaml- Any uncommitted drift fails the build with
openapi-diff --fail-on-incompatible - The committed
openapi.yamlis regenerated by runningmake generate-openapi(callspython -m app.generate_spec) — this is a required step in the PR checklist for any API change - The spec is the input to all downstream tooling: Swagger UI (
/docs), Redoc (/redoc), contract tests, and the client SDK generator
API date/time contract (F10): All date/time fields in API responses must use ISO 8601 with UTC offset — never Unix timestamps, never local time strings:
- Format:
"2026-03-22T14:00:00Z"(UTC,Zsuffix) - OpenAPI annotation:
format: date-timeon every_at-suffixed and_time-suffixed field - Contract test (BLOCKING): every field matching
/_at$|_time$/in every response schema asserts it matches^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?Z$ - Pydantic models use
datetimewithmodel_config = {"json_encoders": {datetime: lambda v: v.isoformat().replace("+00:00", "Z")}}
Frontend ↔ API contract testing (F4): The TypeScript types used by the Next.js frontend must be validated against the OpenAPI spec on every CI run — preventing the common drift where the Pydantic response model changes but the frontend interface is not updated until a runtime error surfaces.
Implementation: openapi-typescript generates TypeScript types from openapi.yaml into frontend/src/types/api.generated.ts. The frontend imports only from this generated file — no hand-written API response interfaces. A CI check (make check-api-types) regenerates the types and fails if the git diff is non-empty:
# CI step: check-api-types
openapi-typescript openapi.yaml -o frontend/src/types/api.generated.ts
git diff --exit-code frontend/src/types/api.generated.ts \
|| (echo "API types out of sync — run: make generate-api-types" && exit 1)
This is a one-way contract: the spec is authoritative; the TypeScript types are derived. Any API change that affects the frontend must regenerate types before the PR can merge. This replaces the need for a separate consumer-driven contract test framework (Pact) at Phase 1 scale.
OpenAPI response examples (F7): Every endpoint schema in the OpenAPI spec must include at least one examples: block demonstrating a realistic success response. This is enforced by a CI lint step (spectral lint openapi.yaml --ruleset .spectral.yaml) with a custom rule require-response-example. Missing examples fail the build. The examples serve three purposes: Swagger UI and Redoc interactive documentation, contract test fixture baseline, and ESA auditor review readability.
# Example: openapi.yaml fragment for GET /objects/{norad_id}
responses:
'200':
content:
application/json:
schema:
$ref: '#/components/schemas/ObjectDetail'
examples:
debris_object:
summary: Tracked debris fragment in decay
value:
norad_id: 48274
name: "CZ-3B DEB"
object_type: "DEBRIS"
perigee_km: 187.4
apogee_km: 312.1
data_confidence: "nominal"
propagation_quality: "degraded"
propagation_warning: "tle_age_7_14_days"
Client SDK strategy (F8): Phase 1 — no dedicated SDK. ANSP integrators are provided:
- The committed
openapi.yamlfor import into Postman, Insomnia, or any OpenAPI-compatible tooling - A
docs/integration/directory with language-specific quickstart guides (Python, JavaScript/TypeScript) showing auth, object fetch, and WebSocket subscription patterns - Python integration examples using
httpx(async) andrequests(sync) — not a packaged SDK
Phase 2 gate: if ≥ 2 ANSP customers request a typed client, generate one using openapi-generator-cli targeting Python and TypeScript. Generated clients are published under the @spacecom/ npm scope and spacecom-client PyPI package. The generator configuration is committed to tools/sdk-generator/ so regeneration is reproducible from the spec.
15. Propagation Architecture — Technical Detail
15.1 Catalog Propagator (SGP4)
from sgp4.api import Satrec, jday
from app.frame_utils import teme_to_gcrf, gcrf_to_itrf, itrf_to_geodetic
def propagate_catalog(tle_line1: str, tle_line2: str, times_utc: list[datetime]) -> list[OrbitalState]:
sat = Satrec.twoline2rv(tle_line1, tle_line2)
results = []
for t in times_utc:
jd, fr = jday(t.year, t.month, t.day, t.hour, t.minute, t.second + t.microsecond/1e6)
e, r_teme, v_teme = sat.sgp4(jd, fr)
if e != 0:
raise PropagationError(f"SGP4 error code {e}")
r_gcrf, v_gcrf = teme_to_gcrf(r_teme, v_teme, t)
lat, lon, alt = itrf_to_geodetic(gcrf_to_itrf(r_gcrf, t))
results.append(OrbitalState(
time=t, reference_frame='GCRF',
pos_x_km=r_gcrf[0], pos_y_km=r_gcrf[1], pos_z_km=r_gcrf[2],
vel_x_kms=v_gcrf[0], vel_y_kms=v_gcrf[1], vel_z_kms=v_gcrf[2],
lat_deg=lat, lon_deg=lon, alt_km=alt, propagator='sgp4'
))
return results
Scope limitation: SGP4 accurate to ~1 km for perigee > 300 km and epoch age < 7 days. Do not use for decay prediction.
SGP4 validity gates — enforced at query time (Finding 1):
| Condition | Action | UI signal |
|---|---|---|
tle_epoch_age ≤ 7 days |
Normal propagation | propagation_quality: 'nominal' |
7 days < tle_epoch_age ≤ 14 days |
Propagate with warning | propagation_quality: 'degraded'; amber DataConfidenceBadge; API includes propagation_warning: 'tle_age_7_14_days' |
tle_epoch_age > 14 days |
Return estimate with explicit caveat | propagation_quality: 'unreliable'; object position not rendered on globe without user acknowledgement; API returns propagation_warning: 'tle_age_exceeds_14_days' |
perigee_altitude < 200 km |
Do not use SGP4 | Route all propagation requests to the numerical decay predictor; SGP4 is invalid in this density regime |
The epoch age check runs at the start of propagate_catalog(). The perigee altitude gate is enforced during TLE ingest — objects crossing below 200 km perigee are automatically flagged for decay prediction and removed from SGP4 catalog propagation tasks.
Sub-150 km propagation confidence guard (F2): For the numerical decay predictor, objects with current perigee < 150 km are in a regime where atmospheric density model uncertainty dominates and SGP4/numerical model errors grow rapidly. Predictions in this regime are flagged:
if perigee_km < 150:
prediction.propagation_confidence = 'LOW_CONFIDENCE_PROPAGATION'
prediction.propagation_confidence_reason = (
f'Perigee {perigee_km:.0f} km below 150 km; '
'atmospheric density uncertainty dominant; re-entry imminent'
)
LOW_CONFIDENCE_PROPAGATION is surfaced in the UI as a red badge: "⚠ Re-entry imminent — prediction confidence low; consult Space-Track TIP directly." Unit test (BLOCKING): construct a TLE with perigee = 120 km; call the decay predictor; assert propagation_confidence == 'LOW_CONFIDENCE_PROPAGATION'.
15.2 Decay Predictor (Numerical)
Physics: J2–J6 geopotential, NRLMSISE-00 drag, solar radiation pressure (cannonball model), WGS84 oblate Earth.
NRLMSISE-00 Input Vector (Finding 2)
NRLMSISE-00 requires a fully specified input vector. Using a single F10.7 value for both the 81-day average and the prior-day slot, or using Kp instead of Ap, introduces systematic density errors that are worst during geomagnetic storms — exactly when prediction uncertainty matters most.
# Required NRLMSISE-00 inputs — both stored in space_weather table
nrlmsise_input = NRLMSISEInput(
f107A = f107_81day_avg, # 81-day centred average F10.7 (NOT current)
f107 = f107_prior_day, # prior-day F10.7 value (NOT current day)
ap = ap_daily, # daily Ap index (linear) — NOT Kp (logarithmic)
ap_a = ap_3h_history_57h, # 19-element array of 3-hourly Ap for prior 57h
# enables full NRLMSISE accuracy (flags.switches[9]=1)
)
The space_weather table already stores f107_81day_avg and ap_daily. Add f107_prior_day DOUBLE PRECISION and ap_3h_history DOUBLE PRECISION[19] columns (the 3-hourly Ap history array for the 57 hours preceding each observation). The ingest worker populates both from the NOAA SWPC Space Weather JSON endpoint.
Atmospheric density model selection rationale (F3): NRLMSISE-00 is used for Phase 1. JB2008 (Bowman et al. 2008) is the current USSF operational standard and is demonstrably more accurate during high solar activity periods (F10.7 > 150) and geomagnetic storms (Kp > 5). NRLMSISE-00 is chosen for Phase 1 because:
- Python bindings are mature (
nrlmsise00PyPI package); JB2008 has no equivalent mature Python binding - For the typical F10.7 range (70–150 sfu) at solar minimum/moderate activity, the accuracy difference is < 10%
- Phase 2 milestone: evaluate JB2008 against NRLMSISE-00 on historical re-entry backcasts; if MAE improvement > 15%, migrate; decision documented in
docs/adr/0016-atmospheric-density-model.md
NRLMSISE-00 input validity bounds (F3): Inputs outside these ranges produce unphysical density estimates; the prediction is rejected rather than silently accepted:
NRLMSISE_INPUT_BOUNDS = {
"f107": (65.0, 300.0), # physical solar flux range; < 65 indicates data gap
"f107A": (65.0, 300.0),
"ap": (0.0, 400.0), # Ap index physical range
"altitude_km": (85.0, 1000.0), # validated density range
}
If any bound is violated, raise AtmosphericModelInputError with field and value — never silently clamp.
Altitude scope: NRLMSISE-00 is used from 150 km to 800 km. Above 800 km, the model is applied but the prediction carries ood_flag = TRUE with ood_reason = 'above_nrlmsise_validated_range_800km' (Finding 11).
Geomagnetic storm sensitivity (Finding 11): During the MC sampling, when the current 3-hour Kp index exceeds 5, sample F10.7 and Ap from storm-period values (current observed, not 81-day average). The prediction is annotated:
space_weather_warning: 'geomagnetic_storm'field on thereentry_predictionsrecord- UI amber callout: "Active geomagnetic storm — thermospheric density is elevated; re-entry timing uncertainty is significantly increased"
- The storm flag persists for the lifetime of the prediction; it is not cleared when the storm ends (the prediction was made during disturbed conditions)
Ballistic Coefficient Uncertainty Model (Finding 3)
The ballistic coefficient β = m / (C_D × A) is the dominant uncertainty in drag-driven decay. Its three components are sampled independently in the Monte Carlo:
| Parameter | Distribution | Rationale |
|---|---|---|
C_D |
Uniform(2.0, 2.4) |
Standard assumption for non-cooperative objects in free molecular flow; no direct measurement available |
A (stable attitude, attitude_known = TRUE) |
Normal(A_discos, 0.05 × A_discos) |
5% shape uncertainty for known-attitude objects |
A (tumbling, attitude_known = FALSE) |
Normal(A_discos_mean, 0.25 × A_discos_mean) |
25% uncertainty; tumbling objects present a time-varying cross-section |
m |
Normal(m_discos, 0.10 × m_discos) |
10% mass uncertainty; DISCOS masses are not independently verified |
OOD rules:
attitude_known = FALSE AND mass_kg IS NULL→ood_flag = TRUE,ood_reason = 'tumbling_no_mass'— outside validated regimecd_a_over_m IS NULL AND mass_kg IS NULL AND cross_section_m2 IS NULL→ood_flag = TRUE,ood_reason = 'no_physical_properties'
Objects with known physical properties can have operator-provided overrides stored in objects.cd_override DOUBLE PRECISION and objects.bstar_override DOUBLE PRECISION. When overrides are present, the MC samples around the override value rather than the DISCOS-derived value.
Solar Radiation Pressure (Finding 7)
SRP is included using the cannonball model:
a_srp = −P_sr × C_r × (A/m) × r̂_sun
where P_sr = 4.56 × 10⁻⁶ N/m² at 1 AU (scaled by (1 AU / r_sun)²), C_r is the radiation pressure coefficient stored in objects.cr_coefficient DOUBLE PRECISION DEFAULT 1.3.
SRP is significant (> 5% of drag contribution) for objects with area-to-mass ratio > 0.01 m²/kg at altitudes > 500 km. OOD flag: area_to_mass > 0.01 AND perigee > 500 km AND cr_coefficient IS NULL → ood_reason = 'srp_significant_cr_unknown'.
Integrator Configuration (Finding 9)
from scipy.integrate import solve_ivp
integrator_config = dict(
method = "DOP853", # RK7(8) embedded pair — adaptive step
rtol = 1e-9, # relative tolerance (parts-per-billion)
atol = 1e-9, # absolute tolerance (km); ≈ 1 mm position error
max_step = 60.0, # seconds; constrained to capture density variation at perigee
t_span = (t0, t0 + 120 * 86400), # 120-day maximum integration window
events = [
altitude_80km_event, # terminal: breakup trigger
altitude_200km_event, # non-terminal: log perigee passage
],
dense_output = False,
)
Stopping criterion: integration terminates when altitude ≤ 80 km (breakup trigger fires) or when the 120-day span elapses without reaching 80 km (result: propagation_timeout; stored as status = 'timeout' in simulations). The 120-day cap is a safety stop — any object not re-entering within 120 days from a sub-450 km perigee TLE is anomalous and should be flagged for human review.
The max_step = 60s constraint near perigee prevents the integrator from stepping over atmospheric density variations. For altitudes above 300 km, the max step is relaxed to 300s (5 min) via a step-size hook that checks current altitude.
TLE age uncertainty inflation (F7): TLE age is a formal uncertainty source, not just a staleness indicator. For decaying objects, position uncertainty grows with TLE age due to unmodelled atmospheric drag variations. A linear inflation model is applied to the ballistic coefficient covariance before MC sampling:
# Applied in decay_predictor.py before MC sampling
tle_age_days = (prediction_epoch - tle_epoch).total_seconds() / 86400
if tle_age_days > 0 and perigee_km < 450:
uncertainty_multiplier = 1.0 + 0.15 * tle_age_days
sigma_cd *= uncertainty_multiplier
sigma_area *= uncertainty_multiplier
The 0.15/day coefficient is derived from Vallado (2013) §9.6 propagation error growth for LEO objects in ballistic flight. tle_age_at_prediction_time and uncertainty_multiplier are stored in simulations.params_json and included in the prediction API response for provenance.
Monte Carlo convergence criterion (F4): N = 500 for production is not arbitrary — it satisfies the following convergence criterion tested on the reference object (mc-ensemble-params.json):
| N | p95 corridor area (km²) | Change from N/2 |
|---|---|---|
| 100 | baseline | — |
| 250 | — | ~12% |
| 500 | — | ~4% |
| 1000 | — | ~1.8% |
| 2000 | — | ~0.9% |
Convergence criterion: corridor area change < 2% between doublings. N = 500 satisfies this for the reference object. N = 1000 is used for objects with ood_flag = TRUE or space_weather_warning = 'geomagnetic_storm' (higher uncertainty → higher N needed for stable tail estimates). Server cap remains 1000.
Monte Carlo:
N = 500 (standard); N = 1000 (OOD flag or storm warning); server cap 1000
Per-sample variation: C_D ~ U(2.0, 2.4); A ~ N(A_discos, σ_A × uncertainty_multiplier);
m ~ N(m_discos, σ_m); F10.7 and Ap from storm-aware sampling
Output: p01/p05/p25/p50/p75/p95/p99 re-entry times; ground track corridor polygon; per-sample binary blob for Mode C
All output records HMAC-signed before database write
15.3 Atmospheric Breakup Model
Simplified ORSAT approach: aerothermal heating → failure altitude → fragment generation → RK4 ballistic descent → impact (velocity, angle, KE, casualty area). Distinct from NASA SBM on-orbit fragmentation.
Breakup altitude trigger (Finding 5): Structural breakup begins when the numerical integrator crosses altitude = 78 km (midpoint of the 75–80 km range supported by NASA Debris Assessment Software and ESA DRAMA for aluminium-structured objects; documented in model card under "Breakup Altitude Rationale").
Fragment generation: Below 78 km, the fragment cloud is generated using the NASA Standard Breakup Model (NASA-TM-2018-220054) parameter set for the object's mass class:
- Mass class A: < 100 kg
- Mass class B: 100–1000 kg
- Mass class C: > 1000 kg (rocket bodies, large platforms)
Survivability by material (Finding 5): Fragment demise altitude is determined by material class using the ESA DRAMA demise altitude lookup:
material_class |
Typical demise altitude | Notes |
|---|---|---|
aluminium |
60–70 km | Most fragments demise; some survive |
stainless_steel |
45–55 km | Higher survival probability |
titanium |
40–50 km | High survival; used in tanks and fasteners |
carbon_composite |
55–65 km | Largely demises but reinforced structures may survive |
unknown |
Conservative: 0 km (surface impact) | All fragments assumed to survive — drives ood_flag = TRUE |
material_class TEXT added to objects table. When material_class IS NULL, the ood_flag is set and the conservative all-survive assumption is used. The NOTAM (E) field debris survival statement changes from a static disclaimer to a model-driven statement: DEBRIS SURVIVAL PROBABLE (when calculated survivability > 50%) or DEBRIS SURVIVAL POSSIBLE (10–50%) or COMPLETE DEMISE EXPECTED (< 10%).
Casualty area: Computed from fragment mass and velocity using the ESA DRAMA methodology. Stored per-fragment in fragment_impacts table. The aggregate casualty area polygon drives the "ground risk" display in the Event Detail page (Phase 3 feature).
Survival probability output (F5): The aggregate object-level survival probability is stored in reentry_predictions:
ALTER TABLE reentry_predictions
ADD COLUMN survival_probability DOUBLE PRECISION, -- fraction of object mass expected to survive to surface (0.0–1.0)
ADD COLUMN survival_model_version TEXT, -- e.g. 'phase1_analytical_v1', 'drama_3.2'
ADD COLUMN survival_model_note TEXT; -- human-readable caveat, e.g. 'Phase 1: simplified analytical; no fragmentation modelling'
Phase 1 method: simplified analytical — ballistic coefficient of the intact object projected to surface; if material_class = 'unknown', survival_probability = 1.0 (conservative all-survive). Phase 2: integrate ESA DRAMA output files where available from the space operator's licence submission. The NOTAM (E) field statement is driven by survival_probability (already specified above).
15.4 Corridor Generation Algorithm (Finding 4)
The re-entry corridor polygon is generated by reentry/corridor.py. The algorithm must be specified explicitly — the choice between convex hull, alpha-shape, and ellipse fit produces materially different FIR intersection results.
Algorithm:
def generate_corridor_polygon(
mc_trajectories: list[list[GroundPoint]],
percentile: float = 0.95,
alpha: float = 0.1, # degrees; ~11 km at equator
buffer_km: float = 50.0, # lateral dispersion buffer below 80 km
max_vertices: int = 1000,
) -> Polygon:
"""
Generate a re-entry hazard corridor polygon from Monte Carlo trajectories.
Algorithm:
1. For each MC trajectory, collect ground positions at 10-min intervals
from the 80 km altitude crossing to the final impact point.
2. Retain the central `percentile` fraction of trajectories by re-entry time
(discard the earliest p_low and latest p_high tails).
3. Compute the alpha-shape (concave hull) of the combined point set
using alpha = 0.1°. Alpha-shape is preferred over convex hull for
elongated re-entry corridors (convex hull overestimates width by 2–5x).
4. Buffer the polygon by `buffer_km` to account for lateral fragment
dispersion below 80 km.
5. Simplify to <= `max_vertices` vertices (Douglas-Peucker, tolerance 0.01°).
6. Store the raw MC endpoint cloud as JSONB in `reentry_predictions.mc_endpoint_cloud`
for audit and Mode C replay.
Returns:
Polygon in EPSG:4326 (WGS84), suitable for PostGIS GEOGRAPHY storage.
"""
The alpha-shape library (alphashape) is added to requirements.in. The 50 km buffer accounts for the fact that fragments detach from the main object trajectory below 80 km and disperse laterally. This value is documented in the model card with a reference to ESA DRAMA lateral dispersion statistics.
Adaptive ground-track sampling for CZML corridor fidelity (F4 — §62):
Step 1 of the corridor algorithm above samples at 10-minute intervals. For the high-deceleration terminal phase (below ~150 km), 10 minutes corresponds to hundreds of kilometres of ground track — the polygon will miss the actual terminal geometry. Adaptive sampling is required:
def adaptive_ground_points(trajectory: list[StateVector]) -> list[GroundPoint]:
"""
Return ground points at altitude-dependent intervals:
> 300 km: every 5 min (slow deceleration; sparse sampling adequate)
150–300 km: every 2 min
80–150 km: every 30 s (rapid deceleration; must resolve terminal corridor)
< 80 km: every 10 s (fragment phase; maximum spatial resolution)
"""
points = []
for sv in trajectory:
alt_km = sv.altitude_km
step_s = 300 if alt_km > 300 else (
120 if alt_km > 150 else (
30 if alt_km > 80 else 10))
# only emit a point if sufficient time has elapsed since the last point
if not points or (sv.t - points[-1].t) >= step_s:
points.append(to_ground_point(sv))
return points
This is a breaking change to the corridor algorithm: the reference polygon in docs/validation/reference-data/mc-corridor-reference.geojson must be regenerated after this change is implemented. The ADR for this change must document the old vs. new polygon area difference for the reference object.
PostGIS vs CZML corridor consistency test (F6 — §62):
The PostGIS ground_track_corridor polygon (used for FIR intersection and alert generation) and the CZML polygon positions (displayed on the globe) are independently derived. A serialisation bug in the CZML builder could render the corridor in the wrong location while the database record remains correct — operators would see one corridor, alerts would be generated based on another.
Required integration test in tests/integration/test_corridor_consistency.py:
@pytest.mark.safety_critical
def test_czml_corridor_matches_postgis_polygon(db_session):
"""
The bounding box of the CZML polygon positions must agree with the
PostGIS corridor polygon bounding box to within 10 km in each direction.
"""
prediction = db_session.query(ReentryPrediction).filter(
ReentryPrediction.ground_track_corridor.isnot(None)
).first()
# Generate CZML from the prediction
czml_doc = generate_czml_for_prediction(prediction)
czml_polygon = extract_polygon_positions(czml_doc) # list of (lat, lon)
# Get PostGIS bounding box
postgis_bbox = db_session.execute(
text("SELECT ST_Envelope(ground_track_corridor::geometry) FROM reentry_predictions WHERE id = :id"),
{"id": prediction.id}
).scalar()
postgis_coords = extract_bbox_corners(postgis_bbox) # (min_lat, max_lat, min_lon, max_lon)
czml_bbox = bounding_box_of(czml_polygon)
assert abs(czml_bbox.min_lat - postgis_coords.min_lat) < 0.1 # ~10 km latitude tolerance
assert abs(czml_bbox.max_lat - postgis_coords.max_lat) < 0.1
# Antimeridian-aware longitude comparison
assert lon_diff_deg(czml_bbox.min_lon, postgis_coords.min_lon) < 0.1
assert lon_diff_deg(czml_bbox.max_lon, postgis_coords.max_lon) < 0.1
This test is marked safety_critical because a discrepancy > 10 km between displayed and stored corridor is a direct contribution to HZ-004.
Unit test: Generate a corridor from a known synthetic MC dataset (100 trajectories, straight ground track); verify the resulting polygon contains all input points; verify the polygon area is less than the convex hull area (confirming the alpha-shape is tighter); verify the polygon has ≤ 1000 vertices.
MC test data generation strategy (Finding 10): Generating hundreds of MC trajectories at test time is slow and non-deterministic. Committing raw trajectory arrays is a large binary blob. Use seeded RNG:
# tests/physics/conftest.py
@pytest.fixture(scope="session")
def synthetic_mc_ensemble():
"""500 synthetic trajectories from seeded RNG — deterministic, no external downloads."""
rng = np.random.default_rng(seed=42) # seed must never change without updating reference polygon
return generate_mc_ensemble(
rng, n=500,
object_params={ # Reference object: committed, never change without ADR
"mass_kg": 1000.0, "cd": 2.2, "area_m2": 1.0, "perigee_km": 185.0,
},
)
Commit to docs/validation/reference-data/:
mc-corridor-reference.geojson— pre-computed corridor polygon (runpython tools/generate_mc_reference.pyonce; review and commit)mc-ensemble-params.json— RNG seed, object parameters, generation timestamp
Test asserts: (a) generated corridor polygon matches committed reference within 5% area difference; (b) corridor contains ≥ 95% of input trajectories. If the corridor algorithm changes, the reference polygon must be explicitly regenerated and the change reviewed — the seed itself never changes.
15.5 Conjunction Probability (Pc) Computation Method (Finding 8)
The Pc method is specified in conjunction/pc_compute.py and must be documented in the API response.
Phase 1–2 method: Alfano/Foster 2D Gaussian
def compute_pc_alfano(
r1: np.ndarray, # primary position (km, GCRF)
v1: np.ndarray, # primary velocity (km/s)
cov1: np.ndarray, # 6×6 covariance (km², km²/s²)
r2: np.ndarray, # secondary position
v2: np.ndarray,
cov2: np.ndarray,
hbr: float, # combined hard-body radius (m)
) -> float:
"""
Compute probability of collision using Alfano (2005) 2D Gaussian method.
Projects combined covariance onto the encounter plane, integrates the
bivariate normal distribution over the combined hard-body area.
Standard method in the space surveillance community.
Reference: Alfano (2005), "A Numerical Implementation of Spherical Object
Collision Probability", Journal of the Astronautical Sciences.
"""
API response field: Every conjunction record includes pc_method: "alfano_2d_gaussian" so consumers can correctly interpret the result.
Covariance source: TLE format carries no covariance. SpaceCom estimates covariance via TLE differencing (Vallado & Cefola method): multiple TLEs for the same object within a 24-hour window are used to estimate position uncertainty. This is documented in the API as covariance_source: "tle_differencing" and flagged as covariance_quality: 'low' when fewer than 3 TLEs are available within 24 hours.
pc_discrepancy_flag implementation: The log-scale comparison is confirmed as:
pc_discrepancy_flag = abs(math.log10(pc_spacecom) - math.log10(pc_spacetrack)) > 1.0
Not a linear comparison. A discrepancy is an order-of-magnitude difference in probability — this threshold is correct.
Validity domain (F1): The Alfano 2D Gaussian method is valid under the following conditions. Outside these conditions, the Pc estimate is flagged with pc_validity: 'degraded' in the API response:
- Short-encounter assumption: valid when the encounter duration is short compared to the orbital period (satisfied for LEO conjunction geometries)
- Linear relative motion: degrades when
miss_distance_km < 0.1(non-linear trajectory effects become significant); flag:pc_validity_warning: 'sub_100m_close_approach' - Gaussian covariance: degrades when the position uncertainty ellipsoid aspect ratio (σ_max/σ_min) > 100; flag:
pc_validity_warning: 'highly_anisotropic_covariance' - Minimum Pc floor: values below 1×10⁻¹⁵ are reported as
< 1e-15and not computed precisely (numerical precision limit)
Reference implementation test (F1): tests/physics/test_pc_compute.py — BLOCKING:
# Reference cases from Vallado & Alfano (2009), Table 1
VALLADO_ALFANO_CASES = [
# (miss_dist_m, sigma_r1_m, sigma_t1_m, sigma_n1_m,
# sigma_r2_m, sigma_t2_m, sigma_n2_m, hbr_m, expected_pc)
(100.0, 50.0, 200.0, 50.0, 50.0, 200.0, 50.0, 10.0, 3.45e-3),
(500.0, 100.0, 500.0, 100.0, 100.0, 500.0, 100.0, 5.0, 2.1e-5),
]
@pytest.mark.parametrize("case", VALLADO_ALFANO_CASES)
def test_pc_against_vallado_alfano(case):
pc = compute_pc_alfano(*build_conjunction_geometry(case))
assert abs(pc - case.expected_pc) / case.expected_pc < 0.05 # within 5%
Phase 3 consideration: Monte Carlo Pc for conjunctions where pc_spacecom > 1e-3 (high-probability cases where the Gaussian assumption may break down due to non-linear trajectory evolution). Document in docs/adr/0015-pc-computation-method.md.
15.6 Model Version Governance (F6)
All components of the prediction pipeline are versioned together as a single model_version string using semantic versioning (MAJOR.MINOR.PATCH):
| Change type | Version bump | Examples |
|---|---|---|
| Pc methodology or propagator algorithm change | MAJOR | Switch from Alfano 2D to Monte Carlo Pc; replace DOP853 integrator |
| Atmospheric model or input processing change | MINOR | NRLMSISE-00 → JB2008; change TLE age inflation coefficient |
| Bug fix in existing model | PATCH | Fix F10.7 index lookup off-by-one; correct frame transformation |
Rules:
- Old model versions are never deleted — tagged in git (
model/v1.2.3) and retained inbackend/app/modules/physics/versions/ reentry_predictions.model_versionis set at creation and immutable thereafter- A model version bump requires: updated unit tests, updated
docs/validation/reference-data/, entry inCHANGELOG.md, ADR if MAJOR
Reproducibility endpoint (F6):
POST /api/v1/decay/predict/reproduce
Body: { "prediction_id": "uuid" }
Re-runs the prediction using the exact model version and parameters from simulations.params_json recorded at the time of the original prediction. Returns a new prediction record with reproduced_from_prediction_id set. This endpoint is used for regulatory audit ("what model produced this output?") and post-incident review. Available to analyst role and above.
15.7 Prediction Input Validation (F9)
A validate_prediction_inputs() function in backend/app/modules/physics/validation.py gates all decay prediction submissions. Inputs that fail validation are rejected with structured errors — never silently clamped to a valid range.
def validate_prediction_inputs(params: PredictionParams) -> list[ValidationError]:
errors = []
tle_age_days = (utcnow() - params.tle_epoch).days
if tle_age_days > 30:
errors.append(ValidationError("INVALID_TLE_EPOCH",
f"TLE epoch is {tle_age_days} days old; maximum 30 days"))
if not (65.0 <= params.f107 <= 300.0):
errors.append(ValidationError("F107_OUT_OF_RANGE",
f"F10.7 = {params.f107}; valid range [65, 300]"))
if not (0.0 <= params.ap <= 400.0):
errors.append(ValidationError("AP_OUT_OF_RANGE",
f"Ap = {params.ap}; valid range [0, 400]"))
if params.perigee_km > 1200.0:
errors.append(ValidationError("PERIGEE_TOO_HIGH",
f"Perigee {params.perigee_km} km > 1200 km; not a re-entry candidate"))
if params.mass_kg is not None and params.mass_kg <= 0:
errors.append(ValidationError("INVALID_MASS",
f"Mass {params.mass_kg} kg must be > 0"))
return errors
If errors is non-empty, the endpoint returns 422 Unprocessable Entity with the full error list. Unit tests (BLOCKING) cover each validation path including boundary values.
15.8 Data Provenance Specification (F11)
Phase 1 model classification: No trained ML model components. All prediction parameters are derived from:
- Physical constants (gravitational parameter, WGS84 Earth model)
- Published atmospheric model coefficients (NRLMSISE-00)
- Published orbital mechanics algorithms (SGP4, Alfano 2005 Pc)
- Empirical constants from peer-reviewed literature (NASA Standard Breakup Model, ESA DRAMA demise altitudes, Vallado ballistic coefficient uncertainty)
This is documented explicitly in docs/ml/data-provenance.md as: "SpaceCom Phase 1 uses no trained machine learning components. All model parameters are derived from physical constants and published peer-reviewed sources cited below."
EU AI Act Art. 10 compliance (Phase 1): Because Phase 1 has no training data, the data governance obligations of Art. 10 apply to input data rather than training data. Input data provenance is tracked in simulations.params_json (TLE source, space weather source, timestamp, version).
Future ML component protocol: Any future learned component (e.g., drag coefficient ML model, debris type classifier) must be accompanied by:
- Training dataset: source, date range, preprocessing steps, known biases
- Validation split: method, size, metrics
- Performance on historical re-entry backcasts (§15.9 backcasting pipeline)
- Documented in
docs/ml/data-provenance.mdunder the component name docs/ml/model-card-{component}.mdfollowing the Google Model Card format
15.9 Backcasting Validation Pipeline (F8)
When a re-entry is confirmed (object decays — objects.status = 'decayed'), the backcasting pipeline runs automatically:
# Triggered by Celery task on object status change to 'decayed'
@celery.task
def run_reentry_backcast(object_id: int, confirmed_reentry_time: datetime):
"""Compare all predictions made in 72h before re-entry to actual outcome."""
predictions = db.query(ReentryPrediction).filter(
ReentryPrediction.object_id == object_id,
ReentryPrediction.created_at >= confirmed_reentry_time - timedelta(hours=72),
).all()
for pred in predictions:
error_hours = (pred.p50_reentry_time - confirmed_reentry_time).total_seconds() / 3600
db.add(ReentryBackcast(
prediction_id=pred.id,
object_id=object_id,
confirmed_reentry_time=confirmed_reentry_time,
p50_error_hours=error_hours,
lead_time_hours=(confirmed_reentry_time - pred.created_at).total_seconds() / 3600,
model_version=pred.model_version,
))
CREATE TABLE reentry_backcasts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
prediction_id BIGINT NOT NULL REFERENCES reentry_predictions(id),
object_id INTEGER NOT NULL REFERENCES objects(id),
confirmed_reentry_time TIMESTAMPTZ NOT NULL,
p50_error_hours DOUBLE PRECISION NOT NULL, -- signed: positive = predicted late
lead_time_hours DOUBLE PRECISION NOT NULL,
model_version TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON reentry_backcasts (model_version, created_at DESC);
Drift detection: Rolling 30-prediction MAE by model version, computed nightly. If MAE > 2× historical baseline for the current model version, raise MEDIUM alert to Persona D flagging for model review. Surfaced in the admin analytics panel as a "Model Performance" widget.
16. Cross-Cutting Concerns
16.1 Subscription Tiers and Feature Flags (F2, F6)
SpaceCom gates commercial entitlements by contracts, which is the single authoritative commercial source of truth. organisations.subscription_tier is a presentation and segmentation shorthand only, and must never be used as the authority for feature access, quota limits, or shadow/production eligibility. Active contract state is materialised into derived organisation flags and quotas by a synchronisation job so runtime checks remain cheap and explicit.
| Tier | Intended customer | MC concurrent runs | Decay predictions/month | Conjunction screening | API access | Multi-ANSP coordination |
|---|---|---|---|---|---|---|
shadow_trial |
Evaluators / test orgs | 1 | 20 | Read-only (catalog) | No | No |
ansp_operational |
ANSP Phase 1 | 1 | 200 | Yes (Phase 2) | Yes | Yes |
space_operator |
Space operator orgs | 2 | 500 | Own objects only | Yes | No |
institutional |
Space agencies, research | 4 | Unlimited | Yes | Yes | Yes |
internal |
SpaceCom internal | Unlimited | Unlimited | Yes | Yes | Yes |
Feature flag enforcement pattern:
def require_tier(*tiers: str):
def dependency(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
org = db.get(Organisation, current_user.organisation_id)
if org.subscription_tier not in tiers:
raise HTTPException(status_code=403, detail={
"code": "TIER_INSUFFICIENT",
"current_tier": org.subscription_tier,
"required_tiers": list(tiers),
})
return org
return dependency
# Applied at router level alongside require_role:
router = APIRouter(dependencies=[
Depends(require_role("analyst", "operator", "org_admin", "admin")),
Depends(require_tier("ansp_operational", "institutional", "internal")),
])
Quota enforcement pattern (MC concurrent runs):
TIER_MC_CONCURRENCY = {
"shadow_trial": 1,
"ansp_operational": 1,
"space_operator": 2,
"institutional": 4,
"internal": 999,
}
def get_mc_concurrency_limit(org: Organisation) -> int:
return TIER_MC_CONCURRENCY.get(org.subscription_tier, 1)
Quota exhaustion is a billable signal: Every 429 TIER_QUOTA_EXCEEDED response writes a usage_events row with event_type = 'mc_quota_exhausted' (see §9.2 usage_events table). This powers the org admin's usage dashboard and the upsell trigger in the admin panel.
Tier changes take effect immediately — no session restart required. The require_tier dependency reads from the database on each request; there is no tier caching that could allow a downgraded tier to continue accessing premium features.
Uncertainty and Confidence
Every prediction includes:
confidence_level(0.0–1.0) — derived from MC spreaduncertainty_bounds— explicit p05/p50/p95 times, corridor ellipse axesmodel_version— semantic versionmonte_carlo_n— ≥ 100 preliminary, ≥ 500 operationalf107_assumed,ap_assumed— critical for reproducibilityrecord_hmac— tamper-evident signature, verified before serving
TLE covariance: TLE format contains no covariance. Use TLE differencing (multiple TLEs within 24h) or empirical Vallado & Cefola covariance. Document clearly in API responses.
Multi-source prediction conflict resolution (Finding 10):
Space-Track TIP messages and SpaceCom's internal decay predictor may produce non-overlapping re-entry windows for the same object simultaneously. ESA ESAC may publish a third window. The aviation regulatory principle of most-conservative applies — the hazard presented to ANSPs must encompass the full credible uncertainty range.
Resolution rules (applied at the reentry_predictions layer):
| Situation | Rule |
|---|---|
| SpaceCom p10–p90 and TIP window overlap | Display SpaceCom corridor as primary; TIP window shown as secondary reference band on Event Detail page |
| SpaceCom p10–p90 and TIP window do not overlap | Set prediction_conflict = TRUE on the prediction; HIGH severity data quality warning displayed; hazard corridor presented to ANSPs uses the union of SpaceCom p10–p90 and TIP window |
| ESA ESAC window available | Overlay as third reference band; include in PREDICTION_CONFLICT assessment if non-overlapping |
| All sources agree (all windows overlap) | No flag; SpaceCom corridor is primary |
Schema addition to reentry_predictions:
ALTER TABLE reentry_predictions
ADD COLUMN prediction_conflict BOOLEAN DEFAULT FALSE,
ADD COLUMN conflict_sources TEXT[], -- e.g. ['spacecom', 'space_track_tip']
ADD COLUMN conflict_union_p10 TIMESTAMPTZ,
ADD COLUMN conflict_union_p90 TIMESTAMPTZ;
The Event Detail page shows a ⚠ PREDICTION CONFLICT banner (HIGH severity style) when prediction_conflict = TRUE, listing the conflicting sources and their windows. The hazard corridor polygon uses conflict_union_p10/conflict_union_p90 when the flag is set. Document in docs/model-card-decay-predictor.md under "Conflict Resolution with Authoritative Sources."
Auditability
- Every simulation in
simulationswith fullparams_jsonand result URI - Reports stored with
simulation_idreference alert_eventsandsecurity_logsare append-only with DB-level triggers- All API mutations logged with user ID, timestamp, and payload hash
- TIP messages stored verbatim for audit
Error Handling
- Structured error responses:
{ "error": "code", "message": "...", "detail": {...} } - Celery failures captured in
simulations.status = 'failed'; surfaced in jobs panel - Frame transformation failures fail loudly — never silently continue with TEME
- HMAC failures return 503 and trigger CRITICAL security event — never silently serve a tampered record
- TanStack Query error states render inline messages with retry; not page-level errors
Performance Patterns
SQLAlchemy async — lazy="raise" on all relationships:
Async SQLAlchemy prohibits lazy-loaded relationship access outside an async context. Setting lazy="raise" converts silent N+1 errors into loud InvalidRequestError at development time rather than silent blocking DB calls in production:
class ReentryPrediction(Base):
object: Mapped["SpaceObject"] = relationship(lazy="raise")
tip_messages: Mapped[list["TipMessage"]] = relationship(lazy="raise")
# Forces all callers to use joinedload/selectinload explicitly
Required eager-loading patterns for the three highest-traffic endpoints:
- Event Detail:
selectinload(ReentryPrediction.object),selectinload(ReentryPrediction.tip_messages) - Active alerts:
selectinload(AlertEvent.prediction) - CZML catalog: raw SQL with a single
JOINrather than ORM (bulk fetch; ORM overhead unacceptable at 864k rows)
CZML caching — two-tier strategy: CZML data for the current 72h window changes only when a new TLE is ingested or a propagation job completes. Cache the full serialised CZML blob:
CZML_CACHE_KEY = "cache:czml:catalog:{catalog_hash}:{window_start}:{window_end}"
# TTL: 15 minutes in LIVE mode (refreshed after new TLE ingest event)
# TTL: permanent in REPLAY mode (historical data never changes)
Per-object CZML fragments cached separately under cache:czml:obj:{norad_id}:{...}. When a TLE is re-ingested for one object, invalidate only that object's fragment and recompute the full catalog CZML from the cached fragments.
CZML cache invalidation triggers (F5 — §58):
| Event | Invalidation scope | Mechanism |
|---|---|---|
| New TLE ingested for object X | cache:czml:obj:{norad_id_x}:* only |
Ingest task calls redis.delete(pattern) after TLE commit |
| Propagation job completes for object X | cache:czml:obj:{norad_id_x}:* + full catalog key |
Propagation Celery task issues invalidation on success |
| New prediction created for object X | cache:czml:obj:{norad_id_x}:* |
Prediction task issues invalidation on completion |
| Manual cache flush (admin API) | cache:czml:* |
DELETE /api/v1/admin/cache/czml — requires admin role |
| Cold start / DR failover | Warm-up Celery task | warm_czml_cache Beat task runs at startup (see below) |
Stale-while-revalidate strategy: The CZML cache key includes a stale_ok variant. When the primary key is expired but the stale key (cache:czml:catalog:stale:{hash}) exists, serve the stale response immediately and enqueue a background recompute. Maximum stale age: 5 minutes. This prevents a cache stampede during TLE batch ingest (up to 600 simultaneous invalidations).
Cache warm-up on cold start (F5 — §58):
@app.task
def warm_czml_cache():
"""Run at container startup and after DR failover. Estimated: 30–60s for 600 objects."""
objects = db.query(Object).filter(Object.active == True).all()
for obj in objects:
generate_czml_fragment.delay(obj.norad_id)
# Full catalog key assembled by CZML endpoint after all fragments present
Cold-start warm-up time (600 objects, 16 simulation workers): estimated 30–60 seconds. Included in DR RTO calculation (§26.3) as "cache warm-up: ~1 min" line item.
Redis key namespaces and eviction policy:
| Namespace | Contents | Eviction policy | Notes |
|---|---|---|---|
celery:* |
Celery broker queues | noeviction — must never be evicted |
Use separate Redis instance or DB 0 with noeviction |
redbeat:* |
celery-redbeat schedules | noeviction |
Loss causes silent scheduled job disappearance |
cache:* |
Application cache (CZML, space weather, HMAC results) | allkeys-lru |
Cache misses acceptable; broker loss is not |
ws:session:* |
WebSocket session state | volatile-lru (with TTL set) |
Expires on session end |
Run Celery broker and application cache as separate Redis database indexes (SELECT 0 vs SELECT 1) so eviction policies can differ. The Sentinel configuration monitors both.
Cache TTLs:
cache:czml:catalog→ 15 minutescache:spaceweather:current→ 5 minutescache:prediction:{id}:fir_intersection→ until superseded (keyed to prediction ID)cache:prediction:{id}:hmac_verified→ 60 minutes
Bulk export — Celery offload for Persona F:
GET /space/export/bulk must not materialise the full result set in the backend container — for the full catalog this risks OOM. Implement as a Celery task that writes to MinIO and returns a pre-signed download URL, consistent with the existing report generation pattern:
@app.post("/space/export/bulk")
async def trigger_bulk_export(params: BulkExportParams, ...):
task = generate_bulk_export.delay(params.dict(), user_id=current_user.id)
return {"task_id": task.id, "status": "queued"}
@app.get("/space/export/bulk/{task_id}")
async def get_bulk_export(task_id: str, ...):
# Returns {"status": "complete", "download_url": presigned_url} when done
If a streaming response is preferred over task-based, use SQLAlchemy yield_per=1000 cursor streaming — never materialise the full result set.
Analytics query routing to read replica: Persona B and F analytics queries (simulation comparison, historical validation, bulk export) are I/O intensive and must not compete with operational read paths on the primary TimescaleDB instance during active TIP events. Route to the Patroni standby:
def get_db(write: bool = False, analytics: bool = False) -> AsyncSession:
if write:
return AsyncSession(primary_engine)
if analytics:
return AsyncSession(replica_engine) # Patroni standby
return AsyncSession(primary_engine) # operational reads: primary (avoids replica lag)
Monitor replication lag: if replica lag > 30s, log a warning and redirect analytics queries to primary.
Query plan baseline:
Add to Phase 1 setup: run EXPLAIN (ANALYZE, BUFFERS) on the primary CZML query with 100 objects and record the output in docs/query-baselines/. Re-run at Phase 3 load test and compare — if planning time or execution time has increased > 2×, investigate index bloat or chunk count growth before the load test proceeds.
17. Validation Strategy
17.0 Test Standards and Strategy (F1–F3, F5, F7, F8, F10, F11)
Test Taxonomy (F2)
Three levels — every developer must know which level a new test belongs to before writing it:
| Level | Definition | I/O boundary | Tool | Location |
|---|---|---|---|---|
| Unit | Single function or class; all dependencies mocked or stubbed | No I/O | pytest | tests/unit/ |
| Integration | Multiple components; real PostgreSQL + Redis; no external network | Real DB, no internet | pytest + testcontainers | tests/integration/ |
| E2E | Full stack including browser; Celery worker running; real DB | Full stack | Playwright | e2e/ |
Rules:
- Physics algorithm tests (SGP4, MC, Pc) are unit tests — pure functions, no DB
- HMAC signing, RLS isolation, and rate-limit tests are integration tests — require a real DB transaction
- Alert delivery, WebSocket flow, and NOTAM draft UI are E2E tests
- A test that mocks the database is a unit test regardless of what it is testing — name it accordingly
Coverage Standard (F1)
| Scope | Tool | Minimum threshold | CI gate |
|---|---|---|---|
| Backend line coverage | pytest-cov |
80% | Fail below threshold |
| Backend branch coverage | pytest-cov --branch |
70% | Fail below threshold |
| Frontend line coverage | Jest --coverage |
75% | Fail below threshold |
| Safety-critical paths | pytest -m safety_critical |
100% (all pass, none skipped) | Always blocking |
# pyproject.toml
[tool.pytest.ini_options]
addopts = "--cov=app --cov-branch --cov-fail-under=80 --cov-report=term-missing"
[tool.coverage.run]
omit = ["*/migrations/*", "*/tests/*", "*/__pycache__/*"]
Coverage is measured on the integration test run (not unit-only) so that database-layer code paths are included. Coverage reports are uploaded to CI artefacts on every run; a coverage trend chart is required in the Phase 2 ESA submission.
Test Data Management (F3)
Fixtures, not factories for shared reference data: Physics reference cases (TLE sets, re-entry events, conjunction scenarios) are committed JSON files in docs/validation/reference-data/. Tests load them as pytest fixtures — never fetch from the internet at test time.
Isolated fixtures for integration tests: Each integration test that writes to the database runs inside a transaction that is rolled back at teardown. No shared mutable state between tests:
@pytest.fixture
def db_session(engine):
with engine.connect() as conn:
with conn.begin() as txn:
yield conn
txn.rollback() # all writes from this test disappear
Time-dependent tests: Any test that checks TLE age, token expiry, or billing period uses freezegun to freeze time to a known epoch. Tests must never rely on datetime.utcnow() producing a particular value:
from freezegun import freeze_time
@freeze_time("2026-01-15T12:00:00Z")
def test_tle_age_degraded_warning():
# TLE epoch is 2026-01-08 → age = 7 days → expects 'degraded'
...
Sensitive test data: Real NORAD IDs, real Space-Track credentials, and real ANSP organisation names must never appear in committed test fixtures. Use fictional NORAD IDs (90001–90099 are reserved for test objects by convention) and generated organisation names (test-org-{uuid4()[:8]}).
Safety-Critical Test Markers (F8)
All tests that verify safety-critical behaviour carry @pytest.mark.safety_critical. These run on every commit (not just pre-merge) and must all pass before any deployment:
# conftest.py
import pytest
def pytest_configure(config):
config.addinivalue_line(
"markers", "safety_critical: test verifies a safety-critical invariant; always runs; zero tolerance for failure or skip"
)
# Usage
@pytest.mark.safety_critical
def test_cross_tenant_isolation():
...
@pytest.mark.safety_critical
def test_hmac_integrity_failure_quarantines_record():
...
@pytest.mark.safety_critical
def test_sub_150km_low_confidence_flag():
...
The full list of safety_critical-marked tests is maintained in docs/TEST_PLAN.md (see F11). CI runs pytest -m safety_critical as a separate fast job (target: < 2 minutes) before the full suite.
Physics Test Determinism (F10)
Monte Carlo tests are non-deterministic by default. All MC-based tests seed the random number generator explicitly:
import numpy as np
@pytest.fixture(autouse=True)
def seed_rng():
"""Seed numpy RNG for all physics tests. Produces identical output across runs."""
np.random.seed(42)
yield
# no teardown needed — each test gets a fresh seed via autouse
@pytest.mark.safety_critical
def test_mc_convergence_criterion():
result = run_mc_decay(tle=TEST_TLE, n=500, seed=42)
assert result.corridor_area_change_pct < 2.0
The seed value 42 is fixed in tests/conftest.py and must not be changed without updating the baseline expected values. A PR that changes the seed without updating expected values fails the review checklist.
Mutation Testing (F5)
mutmut is run weekly (not on every commit — too slow) against the backend/app/modules/physics/ and backend/app/modules/alerts/ directories. These are the highest-consequence paths.
mutmut run --paths-to-mutate=backend/app/modules/physics/,backend/app/modules/alerts/
mutmut results
Threshold: Mutation score ≥ 70% for physics and alerts modules. Results published to CI artefacts. A score drop of > 5 percentage points between weekly runs creates a mutation-regression GitHub issue automatically.
Test Environment Parity (F7)
The CI test environment must use identical Docker images to production. Enforced by:
docker-compose.ci.ymlextendsdocker-compose.yml— same image tags, no overrides to DB version or Redis version- TimescaleDB version in CI is pinned to the same tag as production (
timescale/timescaledb-ha:pg16-latestis not acceptable — must betimescale/timescaledb-ha:pg16.3-ts2.14.2) make testin CI fails ifTIMESCALEDB_VERSIONenv var does not match the value indocker-compose.yml- MinIO is used in CI, not mocked —
make testbrings up the full service stack including MinIO before running integration tests
ESA Test Plan Document (F11)
docs/TEST_PLAN.md is a required Phase 2 deliverable. Structure:
# SpaceCom Test Plan
## 1. Test levels and tools
## 2. Coverage targets and current status
## 3. Safety-critical test traceability matrix
| Requirement | Test ID | Test name | Result |
|-------------|---------|-----------|--------|
| Sub-150km propagation guard | SC-TEST-001 | test_sub_150km_low_confidence_flag | PASS |
| Cross-tenant data isolation | SC-TEST-002 | test_cross_tenant_isolation | PASS |
...
## 4. Known test limitations
## 5. Test environment specification
## 6. Performance test results (latest k6 run)
The traceability matrix links each safety-critical requirement (drawn from §15, §7.2, §26) to its @pytest.mark.safety_critical test. This is the primary evidence document for ESA software assurance review.
Important: Comparing SGP4 against Space-Track TLEs is circular. All validation uses independent reference datasets.
Reference data location: docs/validation/reference-data/ — committed to the repository and loaded automatically by the test suite. No external downloads required at test time.
How to run all validation suites:
make test # runs pytest including all validation suites
pytest tests/test_frame_utils.py -v # frame transforms only
pytest tests/test_decay/ -v # decay predictor + backcast comparison
pytest tests/test_propagator/ -v # SGP4 propagator
How to add a new validation case: Add the reference data to the appropriate JSON file in docs/validation/reference-data/, add a test case in the relevant test module, and document the source in the file's header comment.
17.1 Frame Transformation Validation
| Test | Reference | Pass criterion | Run command |
|---|---|---|---|
| TEME→GCRF transform | Vallado (2013), Table 3-5 | Position error < 1 m; velocity error < 0.001 m/s | pytest tests/test_frame_utils.py::test_teme_gcrf_vallado |
| GCRF→ITRF transform | Vallado (2013), Table 3-4 | Position error < 1 m | pytest tests/test_frame_utils.py::test_gcrf_itrf_vallado |
| ITRF→WGS84 geodetic | IAU SOFA test vectors | Lat/lon error < 1 μrad; altitude error < 1 mm | pytest tests/test_frame_utils.py::test_itrf_geodetic |
| Round-trip WGS84→ITRF→GCRF→ITRF→WGS84 | Self-consistency | Round-trip error < floating-point machine precision (~1e-12) | pytest tests/test_frame_utils.py::test_roundtrip |
| IERS EOP application | IERS Bulletin A reference values | UT1-UTC error < 1 μs; pole offset error < 0.1 mas | pytest tests/test_frame_utils.py::test_iers_eop |
Committed test vectors (Finding 6): The following reference data files must be committed to the repository before any frame transformation or propagation code is merged. Tests are parameterised fixtures that load from these files; they fail (not skip) if a file is absent:
| File | Content | Source |
|---|---|---|
docs/validation/reference-data/frame_transform_gcrf_to_itrf.json |
≥ 3 cases from Vallado (2013) §3.7: input UTC epoch + GCRF position → expected ITRF position, accurate to < 1 m | Vallado (2013) Fundamentals of Astrodynamics Table 3-4 |
docs/validation/reference-data/sgp4_propagation_cases.json |
ISS (NORAD 25544) and one historical re-entry object: state vector at epoch and after 1h and 24h propagation | STK or GMAT reference propagation |
docs/validation/reference-data/iers_eop_case.json |
One epoch with published IERS Bulletin B UT1-UTC and polar motion values; expected GCRF→ITRF transform result | IERS Bulletin B (iers.org) |
# tests/physics/test_frame_transforms.py
import json, pytest
from pathlib import Path
CASES_FILE = Path("docs/validation/reference-data/frame_transform_gcrf_to_itrf.json")
def test_reference_data_exists():
"""Fail hard if committed test vectors are missing — do not skip."""
assert CASES_FILE.exists(), f"Required reference data missing: {CASES_FILE}"
@pytest.mark.parametrize("case", json.loads(CASES_FILE.read_text()))
def test_gcrf_to_itrf(case):
result = gcrf_to_itrf(case["gcrf_km"], parse_utc(case["epoch_utc"]))
assert np.linalg.norm(result - case["expected_itrf_km"]) < 0.001 # 1 m tolerance
Reference data file: docs/validation/reference-data/vallado-sgp4-cases.json and docs/validation/reference-data/iers-frame-test-cases.json.
Operational significance of failure: A frame transform error propagates directly into corridor polygon coordinates. A 1 km error at re-entry altitude produces a ground-track offset of 5–15 km. A failing frame test is a blocking CI failure.
17.2 SGP4 Propagator Validation
| Test | Reference | Pass criterion |
|---|---|---|
| State vector at epoch | Vallado (2013) test set, 10 objects spanning LEO/MEO/GEO/HEO | Position error < 1 km at epoch; < 10 km after 7-day propagation |
| Epoch parsing | NORAD 2-line epoch format → UTC | Round-trip to 1 ms precision |
| TLE line 1/2 checksum | Modulo-10 algorithm | Pass/fail; corrupted checksum rejected before propagation |
Operational significance of failure: SGP4 position error at epoch > 1 km produces a corridor centred in the wrong place. Blocking CI failure.
17.3 Decay Predictor Validation
| Test | Reference | Pass criterion |
|---|---|---|
| NRLMSISE-00 density output | Picone et al. (2002) Table 1 reference atmosphere | Density within 1% of reference at 5 altitude/solar activity combinations |
| Historical backcast: p50 error | The Aerospace Corporation observed re-entry database (≥3 events Phase 1; ≥10 events Phase 2) | Median p50 error < 4h for rocket bodies with known physical properties |
| Historical backcast: corridor containment | Same database | p95 corridor contains observed impact in ≥90% of validation events |
| Historical replay: airspace disruption | Long March 5B Spanish airspace closure reconstruction with replay inputs and operator review | Affected FIR/time-window outputs judged operationally plausible and traceable in replay report |
| Air-risk ranking consistency | Documented crossing-scenario corpus (≥10 unique spacecraft/aircraft crossing cases by Phase 2) | Highest-ranked exposure slices remain stable under seed and traffic-density perturbations or the differences are explained in the validation note |
| Conservative-baseline comparison | Same replay corpus vs. full-FIR or fixed-radius precautionary closure baseline | Refined outputs reduce affected area or duration in a majority of replay cases without undercutting the agreed p95 protective envelope |
| Cross-tool comparison | GMAT (NASA open source) — 3 defined test cases | Re-entry time agreement within 1h for objects with identical inputs |
| Monte Carlo statistical consistency | Self-consistency: 500-sample run vs. 1000-sample run on same inputs | p05/p50/p95 agree within 2% (reducing with more samples) |
Reference data files: docs/validation/reference-data/aerospace-corp-reentries.json for decay-only validation and docs/validation/reference-data/reentry-airspace/ for airspace-risk replay cases (Long March 5B, Columbia-derived cloud case, and documented crossing scenarios). GMAT comparison is a manual procedure documented in docs/validation/README.md (GMAT is not run in CI — too slow; comparison run once per major model version).
Operational significance of failure: Decay predictor p50 error > 4h means corridors are offset in time; operators could see a hazard window that doesn't match the actual re-entry. Major model version gate.
17.4 Breakup Model Validation
| Test | Reference | Pass criterion |
|---|---|---|
| Fragment count distribution | ESA DRAMA published results for similar-mass objects | Fragment count within 30% of DRAMA reference for a 500 kg object at 70 km |
| Energy conservation at breakup altitude | Internal check | Total kinetic + potential energy conserved within 1% through fragmentation step |
| Casualty area geometry | Hand-calculated reference case | Casualty area polygon area within 10% of analytic calculation |
Operational significance of failure: Breakup model failure does not block Phase 1. It is an advisory failure in Phase 2. Blocking before Phase 3 regulatory submission.
17.5 Security Validation
| Test | Reference | Pass criterion | Blocking? |
|---|---|---|---|
| RBAC enforcement | test_rbac.py — every endpoint, every role |
403 for insufficient role; 401 for unauthenticated; 0 mismatches | Yes |
| HMAC tamper detection | test_integrity.py — direct DB row modification |
API returns 503 + CRITICAL security_logs entry |
Yes |
| Rate limiting | test_auth.py — per-endpoint threshold |
429 after threshold; 200 after reset window | Yes |
| CSP headers | Playwright E2E | Content-Security-Policy header present on all pages |
Yes |
| Container non-root | CI docker inspect check |
No container running as root UID | Yes |
| Trivy CVE scan | Trivy against all built images | 0 Critical/High CVEs | Yes |
17.6 Verification Independence (F6 — §61)
EUROCAE ED-153 / DO-278A §6.4 requires that SAL-2 software components undergo independent verification — meaning the person who verifies (reviews/tests) a SAL-2 requirement, design, or code artefact must not be the same person who produced it.
Policy: docs/safety/VERIFICATION_INDEPENDENCE.md
Scope: All SAL-2 components identified in §24.13:
physics/(decay prediction engine)alerts/(alert generation pipeline)- HMAC integrity verification functions
- CZML corridor generation and frame transform
Implementation in GitHub:
# .github/CODEOWNERS
# SAL-2 components require an independent reviewer (not the PR author)
/backend/app/physics/ @safety-reviewer
/backend/app/alerts/ @safety-reviewer
/backend/app/integrity/ @safety-reviewer
/backend/app/czml/ @safety-reviewer
The @safety-reviewer team must have ≥1 member who is not the PR author. GitHub branch protection for main must include:
require_code_owner_reviews: truefor the above pathsdismiss_stale_reviews: true(new commits require re-review)- SAL-2 PRs require ≥2 approvals (one of which must be from
@safety-reviewer)
Verification traceability: The PR review record (GitHub PR number + reviewer + approval timestamp) serves as evidence for verification independence in the safety case (§24.12 E1.1). This record is referenced in the MoC document (§24.14 MOC-002).
Who qualifies as an independent reviewer for SAL-2: Any engineer who:
- Did not write the code being reviewed
- Has sufficient domain knowledge to evaluate correctness (orbital mechanics familiarity for
physics/; alerting logic familiarity foralerts/) - Is designated in the
@safety-reviewerGitHub team
Before ANSP shadow activation, the safety case custodian confirms that all SAL-2 components committed in the release have a documented independent reviewer.
18. Additional Physics Considerations
| Topic | Why It Matters | Phase |
|---|---|---|
| Solar radiation pressure (SRP) | Dominates drag above ~800 km for high A/m objects | Phase 1 (decay predictor) |
| J2–J6 geopotential | J2 alone: ~7°/day RAAN error | Phase 1 (decay predictor) |
| Attitude and tumbling | Drag coefficient 2–3× different; capture via B* Monte Carlo | Phase 2 |
| Lift during re-entry | Non-spherical fragments: 10s km cross-track shift | Phase 2 (breakup) |
| Maneuver detection | Active satellites maneuver; TLE-to-TLE ΔV estimation | Phase 2 |
| Ionospheric drag | Captured via NRLMSISE-00 ion density profile | Phase 1 (via model) |
| Re-entry heating uncertainty | Emissivity/melt temperatures poorly known for debris | Phase 2 |
19. Development Phases — Detailed
Phase 1: Analytical Prototype (Weeks 1–10)
Goal: Real object tracking, decay prediction with uncertainty quantification, functional Persona A/B interface. Security infrastructure fully in place before any other feature ships.
| Week | Backend Deliverable | Frontend Deliverable | Security / SRE Deliverable |
|---|---|---|---|
| 1-2 | FastAPI scaffolding, Alembic migrations, Docker Compose with Tier 2 service topology. frame_utils.py, time_utils.py. IERS EOP refresh + SHA-256 verify. Append-only DB triggers. HMAC signing infrastructure. Liveness + readiness probes on all services. GET /healthz, GET /readyz with DB + Redis checks. Dead letter queue for Celery. task_acks_late, task_reject_on_worker_lost configured. Celery queue routing (ingest vs simulation). celery-redbeat configured. Legal/compliance: users table tos_accepted_at/tos_version/tos_accepted_ip/data_source_acknowledgement fields. First-login ToS/AUP/Privacy Notice acceptance flow (blocks access until all accepted). SBOM generated via syft; CesiumJS commercial licence verified. Privacy Notice drafted and published. |
Next.js scaffolding. Root layout: nav, ModeIndicator, AlertBadge, JobsPanel stub. Dark mode + high-contrast theme. CSP and security headers via Next.js middleware. ToS/AUP acceptance gate on first login (blocks dashboard until accepted). | RBAC schema + require_role(). JWT RS256 + httpOnly cookies. MFA (TOTP). Redis AUTH + ACLs. MinIO private buckets. Docker network segmentation. Container hardening. git-secrets. Bandit + ESLint security in CI. Trivy. Dependency pinning. Dependabot. security_logs + sanitising formatter. Docker Compose depends_on: condition: service_healthy wired. Documentation: docs/ directory tree created; AGENTS.md committed; initial ADRs for JWT, dual frontend, Monte Carlo chord, frame library; docs/runbooks/TEMPLATE.md + index; CHANGELOG.md first entry; docs/validation/reference-data/ with Vallado and IERS cases; docs/alert-threshold-history.md initial entry. DevOps/Platform: self-hosted GitLab CI pipeline (lint, test-backend, test-frontend, security-scan, build-and-push jobs); multi-stage Dockerfiles for all services; .pre-commit-config.yaml with all six hooks; .env.example committed with all variables documented; Makefile with dev, test, migrate, seed, lint, clean targets; Docker layer + pip + npm build cache configured; sha-<commit> image tagging in the GitLab container registry in place. Prometheus metrics: spacecom_active_tip_events, spacecom_tle_age_hours, spacecom_hmac_verification_failures_total instrumented. |
| 3–4 | Catalog module: object CRUD, TLE import. TLE cross-validation. ESA DISCOS import. Ingest Celery Beat (celery-redbeat). Hardcoded URLs, SSRF-mitigated HTTP client. WAL archiving configured. Daily backup Celery task. TimescaleDB compression policy on orbits. Retention policy scaffolded. |
Object Catalog page. DataConfidenceBadge. Object Watch page stub. | Rate limiting (slowapi). Simulation parameter range validation. Prometheus: spacecom_ingest_success_total, spacecom_ingest_failure_total per source. AlertManager rule: consecutive ingest failures → warning. |
| 5–6 | Space Weather: NOAA SWPC + ESA SWS cross-validation. operational_status string. TIP message ingestion. Prometheus: spacecom_prediction_age_seconds per NORAD ID. Readiness probe: TLE staleness + space weather age checks. |
SpaceWeatherWidget. Alert taxonomy: CRITICAL banner, NotificationCentre, AcknowledgeDialog. Degraded mode banner (reads readyz 207 response). |
alert_events append-only verified. Alert rate-limit and deduplication. Alert storm detection. AlertManager rule: spacecom_active_tip_events > 0 AND prediction_age > 3600 → critical. |
| 7–8 | Catalog Propagator (SGP4): TEME→GCRF, CZML (J2000). Ephemeris caching. Frame transform validation. All CZML strings HTML-escaped. MC chord architecture: run_mc_decay_prediction → group(run_single_trajectory) → aggregate_mc_results. Chord result backend (Redis) sized. |
Globe: real object positions, LayerPanel, clustering, urgency symbols. TimelineStrip. Live mode scrub. | WebSocket auth: cookie-based; connection limit. WS ping/pong. Prometheus: spacecom_simulation_duration_seconds histogram. |
| 9–10 | Decay Predictor: RK7(8) + NRLMSISE-00 + Monte Carlo chord. HMAC-signed output. Immutability triggers. Corridor polygon generation. Re-entry API. Validate against ≥3 historical re-entries. Monthly restore test Celery task implemented. | Mode A (Percentile Corridors). Event Detail: PredictionPanel with p05/p50/p95, HMAC status badge. TimelineGantt. Operational Overview. UncertaintyModeSelector (B/C greyed). | HMAC tamper detection E2E test. All-clear TIP cross-check guard. First backup restore test executed and passing. spacecom_simulation_duration_seconds p95 verified < 240s on Tier 2 hardware. |
Phase 2: Operational Analysis (Weeks 11–22)
| Week | Backend Deliverable | Frontend Deliverable | Security / Regulatory |
|---|---|---|---|
| 11–12 | Atmospheric Breakup: aerothermal, fragments, ballistic descent, casualty area. | Fragment impact points on globe. Fragment detail panel. | OWASP ZAP DAST against staging. |
| 13–14 | Conjunction: all-vs-all screening, Alfano probability. | Conjunction events on globe. ConjunctionPanel. | STRIDE threat model reviewed for Phase 2 surface. |
| 15–16 | Upper/Lower Atmosphere. Hazard module: fused zones, HMAC-signed, immutable, shadow_mode flag. |
Mode B (Probability Heatmap): Deck.gl. UncertaintyModeSelector unlocks Mode B. | RLS multi-tenancy integration tests. Shadow records excluded from operational API (integration test). |
| 17–18 | Airspace: FIR/UIR load, PostGIS intersection. Airspace impact table. NOTAM Drafting: ICAO format, notam_drafts table, mandatory disclaimer. Shadow mode admin toggle. |
AirspaceImpactPanel. NOTAM draft flow: NotamDraftViewer, disclaimer banner, review/cancel. 2D Plan View. ViewToggle. /airspace page. ShadowBanner + ShadowModeIndicator. |
Regulatory disclaimer verified present on all NOTAM drafts. axe-core accessibility audit. |
| 19–20 | Report builder: bleach sanitisation, Playwright renderer (isolated, no-network, timeouts, seccomp). MinIO storage. Shadow validation schema + shadow_validations table. |
ReportConfigDialog, ReportPreview, /reports page. IntegrityStatusBadge. SimulationComparison. ShadowValidationReport scaffold. |
Renderer: network_mode: none enforced; sanitisation tests passing; 30s timeout verified. |
| 21–22 | Space Operator Portal: owned_objects, controlled re-entry planner (deorbit window optimiser), CCSDS export, api_keys table + lifecycle. modules.api with per-key rate limiting. Legal gate: legal opinion commissioned and received for primary deployment jurisdiction; legal_opinions table populated; shadow mode admin toggle wired to shadow_mode_cleared flag. Space-Track AUP redistribution clarification obtained (written confirmation from 18th Space Control Squadron or counsel opinion on permissible use). ECCN classification review commissioned for Controlled Re-entry Planner. GDPR compliance review: data inventory completed, lawful bases documented, DPA template drafted, erasure procedure (handle_erasure_request) implemented. |
/space portal: SpaceOverview, ControlledReentryPlanner, DeorbitWindowList, ApiKeyManager, CcsdsExportPanel. Shadow mode admin toggle displays legal clearance status. |
Object ownership RLS policy tested: space_operator cannot access non-owned objects. API key rate limiting verified. API Terms accepted at key creation and recorded. Jurisdiction screening at registration (OFAC/EU/UK sanctions list check). |
Phase 3: Operational Deployment (Weeks 23–32)
| Week | Backend Deliverable | Frontend Deliverable | Security / Regulatory / SRE |
|---|---|---|---|
| 23–24 | Alerts module: thresholds, email delivery, geographic filtering, alert_events. Shadow mode: alerts suppressed. ADS-B feed integration: OpenSky Network REST API (https://opensky-network.org/api/states/all); polled every 60s via Celery Beat; flight state vectors stored in adsb_states (non-hypertable; rolling 24h window); route intersection advisory module reads adsb_states to identify flights in re-entry corridors. Air Risk module initialisation: aircraft exposure scoring, time-slice aggregation, and vulnerability banding by aircraft class. Tier 3 HA infrastructure: TimescaleDB streaming replication + Patroni + etcd. Redis Sentinel (3 nodes). 4× simulation workers (64 total cores). Blue-green deployment pipeline wired. |
Full alert lifecycle UI: geographic filtering, mute rules, acknowledgement audit. Route overlay on globe. AirRiskPanel by FIR/time slice. Route intersection advisory (avoidance boundary only). | Legal/regulatory: MSA template finalised by counsel; Regulatory Sandbox Agreement template finalised. First ANSP shadow deployment executed under signed Regulatory Sandbox Agreement and confirmed legal clearance. GDPR breach notification procedure tested (tabletop exercise). Professional indemnity, cyber liability, and product liability insurance confirmed in place. SRE: Patroni failover tested (primary killed; standby promotes; backend reconnects; verify zero lost predictions). Redis Sentinel failover tested. SLO baseline measurements taken on Tier 3 hardware. |
| 25–26 | Feedback: prediction vs. outcome. Density scaling recalibration. Maneuver detection. Shadow validation report generation. Historical replay corpus: Long March 5B, Columbia-derived cloud case, and documented crossing-scenario set. Conservative-baseline comparison reporting for airspace closures. Launch safety module. Deployment freeze gate (CI/CD: block deploy if CRITICAL/HIGH alert active). ANSP communication plan implemented (degradation push + email). Incident response runbooks written (DB failover, Celery recovery, HMAC failure, ingest failure). | Prediction accuracy dashboard. Historical comparison. ShadowValidationReport. Air-risk replay comparison views. /space Persona F workspace. Launch safety portal. |
Vault / cloud secrets manager. Secrets rotation. Begin first ANSP shadow mode deployment. SRE: PagerDuty/OpsGenie integrated with Prometheus AlertManager. SEV-1/2/3/4 routing configured. First on-call rotation established. |
| 27–28 | Mode C binary MC endpoint. Load testing (100 users, <2s CZML p95; MC p95 < 240s). Prometheus + Grafana: three dashboards (Operational Overview, System Health, SLO Burn Rate). Full AlertManager rules. ECSS compliance artefacts: SMP, VVP, PAP, DMP. MinIO lifecycle rules: MC blobs > 90 days → cold tier. | Mode C (Monte Carlo Particles). UncertaintyModeSelector unlocks Mode C. Final Playwright E2E suite. Grafana Operational Overview embedded in /admin. |
External penetration test (auth bypass, RBAC escalation, SSRF, XSS→Playwright, WS auth bypass, data integrity, object ownership bypass, API key abuse). All Critical/High remediated. Load test: SLO p95 targets verified under 100-user concurrent load. |
| 29–32 | Regulatory acceptance package: safety case framework, ICAO data quality mapping, shadow validation evidence, SMS integration guide. TRL 6 demonstration. Data archival pipeline (Parquet export to MinIO cold before chunk drop). Storage growth verified against projections. ESA bid legal: background IP schedule documented; Consortium Agreement with academic partner signed (IP ownership, publication rights, revenue share); SBOM submitted as part of ESA artefact package. ECCN classification determination received; export screening process in place for all new customer registrations. ToS version updated to reflect any regulatory feedback from first ANSP deployments; re-acceptance triggered. | Regulatory submission report type. TRL demonstration artefacts. | SOC 2 Type I readiness review. Production runbook + incident response per threat scenario. ECSS compliance review. Monthly restore test passing in CI. Error budget dashboard showing < 10% burn rate. |
20. Key Decisions and Tradeoffs
| Decision | Chosen | Alternative Considered | Rationale |
|---|---|---|---|
| Propagator split | SGP4 catalog + numerical decay | SGP4 for everything | SGP4 diverges by days–weeks for re-entry time prediction |
| Numerical integrator | RK7(8) adaptive + NRLMSISE-00 | poliastro Cowell | Direct force model control |
| Frame library | astropy |
Manual SOFA Fortran | Handles IERS EOP; well-tested IAU 2006 |
| Atmospheric density | NRLMSISE-00 (P1), JB2008 option (P2) | Simple exponential | Community standard; captures solar cycle |
| Breakup model | Simplified ORSAT-like | Full DRAMA/SESAM | DRAMA requires licensing; simplified recovers ~80% utility |
| Uncertainty visualisation | Three modes, phased (A→B→C), user-selectable | Single fixed mode | Serves different personas; operational users need corridors, analysts need heatmaps |
| JWT algorithm | RS256 (asymmetric) | HS256 (shared secret) | Compromise of one service does not expose signing key to all services |
| Token storage | httpOnly Secure SameSite=Strict cookie | localStorage | XSS cannot read httpOnly cookies; localStorage is trivially exfiltrated |
| Token revocation | DB refresh_tokens table |
Redis-only | Revocations survive restarts; enables rotation-chain audit |
| MFA | TOTP (RFC 6238) required for all roles | Optional MFA | Aviation authority context; government procurement baseline |
| Secrets management | Docker secrets (P1 prod) → Vault (P3) | Env vars only | Env vars appear in process listings and crash dumps; no audit trail |
| Alert integrity | Backend-only generation on verified data | Client-triggered alerts | Prevents false alert injection via API |
| Prediction integrity | HMAC-signed, immutable after creation | Mutable with audit log | Tamper-evident at database level; modification is impossible, not just logged |
| Multi-tenancy | RLS at database layer + organisation_id |
Application-layer only | DB-level enforcement cannot be bypassed by application bugs |
| Renderer isolation | Separate renderer container, no external network |
Playwright in backend container | Limits blast radius of XSS→SSRF escalation |
| Server state | TanStack Query | Zustand for everything | Automatic cache, background refetch; Zustand is not a data cache |
| Navigation model | Task-based (events, airspace, analysis) | Module-based | Users think in tasks, not modules |
| Report rendering | Playwright headless server-side | Client-side canvas | Reliable at print resolution; consistent; not affected by client GPU |
| Monorepo | Monorepo | Separate repos | Small team, shared types, simpler CI |
| ORM | SQLAlchemy 2.0 | Raw SQL | Mature async support; Alembic migrations |
| Domain architecture | Dual front door (aviation + space portal), shared physics core | Single aviation-only product | Space operator revenue stream; ESA bid credibility; space credibility supports aviation trust |
| Space operator object scoping | PostgreSQL RLS on owned_objects join |
Application-layer filtering only | DB-level enforcement; prevents application bugs from leaking cross-operator data |
| NOTAM output | Draft only + mandatory disclaimer; never submitted | System-assisted NOTAM submission | SpaceCom is not a NOTAM originator; keeps platform in purely informational role; reduces regulatory approval burden |
| Reroute module scope | Strategic pre-flight avoidance boundary only | Specific alternate route generation | Specific routes require ATC integration and aircraft performance data SpaceCom does not have; avoidance boundary keeps SpaceCom legally defensible |
| Shadow mode | Org-level flag; all alerts suppressed; records segregated | Per-prediction flag | Enables ANSP trial deployments; accumulates validation evidence for regulatory acceptance; segregation prevents operational confusion |
| Controlled re-entry planner output | CCSDS-format manoeuvre plan + risk-scored deorbit windows | Aviation-format only | Space operators submit to national regulators and ops centres in CCSDS; Zero Debris Charter evidence format |
| API access | Separate API keys (not session JWT); per-key rate limiting | Session cookie only | Space operators integrate SpaceCom into operations centres programmatically; API keys are revocable machine credentials |
| MC parallelism model | Celery group + chord (fan-out sub-tasks across worker pool) |
multiprocessing.Pool within single task |
Chord distributes across all worker containers; Pool limited to one container's cores; chord scales horizontally |
| Worker topology | Two separate Celery pools: ingest and simulation |
Single shared queue | Runaway simulation jobs cannot starve TLE ingestion; critical for reliability during active TIP events |
| Celery Beat HA | celery-redbeat (Redis-backed, distributed locking) |
Standard Celery Beat (single process) | Beat SPOF means scheduled ingest silently stops; redbeat enables multiple instances with leader election |
| DB HA | TimescaleDB streaming replication + Patroni auto-failover | Single-instance DB | RPO = 0 for critical tables; 15-minute RTO requires automatic failover, not manual |
| Redis HA | Redis Sentinel (3 nodes) | Single Redis | Master failure without Sentinel means all Celery queues and WebSocket pub/sub stop |
| Deployment gate | CI/CD checks for active CRITICAL/HIGH alerts before deploying | Manual judgement | Prevents deployments during active TIP events; protects operational continuity |
| MC blade sizing | 16 vCPU per simulation worker container | Smaller containers | MC chord sub-tasks fill all available cores; below 16 cores p95 SLO of 240s is not met |
| Temporal uncertainty display | Plain window range ("08h–20h from now / most likely ~14h") for Persona A/C; p05/p50/p95 UTC for Persona B | ± Nh notation everywhere |
± implies symmetric uncertainty which re-entry distributions are not; window range is operationally actionable |
| Space weather impact communication | Operational buffer recommendation ("+2h beyond 95th pct") rather than % deviation | Percentage string | Percentage is meaningless without a known baseline; buffer hours are immediately usable by an ops duty manager |
| TLS termination | Caddy with automatic ACME (internet-facing) / internal CA (air-gapped) | nginx + manual certs | Caddy handles cert lifecycle automatically; decision tree in §34 |
| Pagination | Cursor-based (created_at, id) |
Offset-based | Offset degrades to full-table scan at 7-year retention depth; cursor is O(1) regardless of dataset size |
| CZML delta protocol | ?since=<iso8601> parameter; max 5 MB full payload; X-CZML-Full-Required header on stale client |
Full catalog always | 100-object catalog at 1-min cadence is ~10–50 MB/hr per connected client without delta; delta reduces this to <500 KB/hr |
| MC concurrency gate | Per-org Redis semaphore; 1 concurrent MC run (Phase 1); 429 + Retry-After on limit |
Unbounded fan-out | 5 concurrent MC requests = 2,500 sub-tasks queued; p95 SLO collapses without backpressure |
TimescaleDB compress_after |
7 days for orbits (not 1 day) |
Compress as soon as possible | Compressing hot chunks forces decompress on every write; 1-day compress_after causes 50–200ms write latency thrash |
| Renderer memory limit | mem_limit: 4g Docker cap on renderer container |
No memory limit | Chromium print rendering at A4/300DPI consumes 2–4 GB; 4 uncapped renderer instances can OOM a 32 GB node |
| Static asset caching | Cloudflare CDN (internet-facing); nginx sidecar (on-premise) | No CDN | CesiumJS bundle ~5–10 MB; 100 concurrent first-load = 500 MB–1 GB burst without caching |
| WAF/DDoS protection | Upstream provider (Cloudflare/AWS Shield) for internet-facing; network perimeter for air-gapped | Application-layer rate limiting only | Application-layer is insufficient for volumetric attacks; must be at ingress |
| Multi-region deployment | Single region per customer jurisdiction; separate instances, not shared cluster | Active-active multi-region | Data sovereignty; simpler compliance certification; Phase 1–3 customer base doesn't justify multi-region cost |
| MinIO erasure coding | EC:2 (4-node) | EC:4 or RAID | EC:2 tolerates 1 write failure / 2 read failures; balanced between protection and storage efficiency at 4 nodes |
| DB connection routing | PgBouncer as single stable connection target | Direct Patroni primary connection | Patroni failover transparent to application; stable DNS target through primary changes |
| Egress filtering | Host-level UFW/nftables allow-list (Tier 2); Calico/Cilium network policy (Tier 3) | Trust Docker network isolation | Docker isolation is inter-network only; outbound internet egress unrestricted without host-level filtering |
| Mode-switch dialogue | Explicit current-mode + target-mode + consequences listed; Cancel left, destructive action right | Generic "Are you sure?" | Aviation HMI conventions; listed consequences prevent silent simulation-during-live error |
| Future-preview temporal wash | Semi-transparent overlay + persistent label on event list when timeline scrubber is not at current time | No visual distinction | Prevents controller from acting on predicted-future data as though it is current operational state |
| Simulation block during active alerts | Optional org-level disable_simulation_during_active_events flag |
Always allow simulation entry | Prevents an analyst accidentally entering simulation while CRITICAL alerts require attention in the same ops room |
| Prediction superseding | Write-once superseded_by FK on reentry_predictions / simulations |
Mutable or delete | Preserves immutability guarantee; gives analysts a way to mark outdated predictions without removing the audit record |
| CRITICAL acknowledgement gate | 10-character minimum free-text field; two-step confirmation modal | Single click | Prevents reflexive acknowledgement; creates meaningful action record for every acknowledged CRITICAL event |
| Multi-ANSP coordination panel | Shared acknowledgement status and coordination notes across ANSP orgs on the same event | Out-of-band only | Creates shared digital situational awareness record without replacing voice coordination; reduces risk of conflicting parallel NOTAMs |
| Legal opinion timing | Phase 2 gate (before shadow deployment); not Phase 3 | Phase 3 task | Common law duty of care may attach regardless of UI disclaimers; liability limitation must be in executed agreements before any ANSP relies on the system |
| Commercial contract instruments | Three instruments: MSA + AUP click-wrap + API Terms | Single platform ToS | Each instrument addresses a different access pathway; API access by Persona E/F must have separate terms recorded against the key |
| Shadow mode legal gate | legal_opinions.shadow_mode_cleared must be TRUE before shadow mode can be activated for an org |
Admin can enable freely | Shadow deployment is a formal regulatory activity; without a completed legal opinion it exposes SpaceCom to uncapped liability in the deployment jurisdiction |
| GDPR erasure vs. retention | Pseudonymise user references in append-only tables on erasure request; never delete safety records | Hard delete on request | UN Liability Convention requires 7-year retention; GDPR right to erasure is satisfied by removing the link to the individual, not the record itself |
| Space-Track data redistribution | Obtain written clarification from 18th SCS before exposing TLE/CDM data via the SpaceCom API | Assume permissible | Space-Track AUP prohibits redistribution to unregistered parties; violation could result in loss of Space-Track access, disabling the platform's primary data source |
| OSS licence compliance | CesiumJS commercial licence required for closed-source deployment; SBOM generated from Phase 1 | Assume all dependencies are permissively licensed | CesiumJS AGPLv3 requires source disclosure for network-served applications; undiscovered licence violations create IP risk in ESA bid |
| Insurance | Professional indemnity + cyber liability + product liability required before operational deployment | No insurance requirement | Aviation safety context; potential claims from incorrect predictions that inform airspace decisions could exceed SpaceCom's balance sheet without coverage |
| Connection pooling | PgBouncer transaction-mode pooler between all app services and TimescaleDB | Direct connections from app | Tier 3 connection count (2× backend + 4× workers + 2× ingest) exceeds max_connections=100 without a pooler; Patroni failover updates only pgBouncer |
| Redis eviction policy | noeviction for Celery/redbeat (separate DB index); allkeys-lru for application cache |
Single Redis with one policy | Broker message eviction causes silent job loss; cache eviction is acceptable |
| Bulk export implementation | Celery task → MinIO → presigned URL (async offload pattern) | Streaming response from API handler | Full catalog export can be gigabytes; materialising in API handler risks OOM on the backend container |
| Analytics query routing | Patroni standby replica for Persona B/F analytics; primary for operational reads | All reads to primary | Analytics queries during a TIP event would compete with operational reads on the primary; standby already provisioned at Tier 3 |
| SQLAlchemy lazy loading | lazy="raise" on all relationships |
Default lazy loading | Async SQLAlchemy silently blocks the event loop on lazy-loaded relationships; raise converts silent N+1s into loud development-time errors |
| CZML cache strategy | Per-object fragment cache + full catalog assembly; TTL keyed to last propagation job | No cache; query DB on each request | CZML catalog fetch at 100 objects = 864k rows; uncached this misses the 2s p95 SLO under concurrent load |
Hypertable chunk interval (orbits) |
1-day chunks (not default 7-day) | Default 7-day | 72h CZML query spans 3 × 1-day chunks; spans 11 × 7-day chunks — chunk exclusion is far less effective with the default |
| Continuous aggregate for F10.7 81-day avg | TimescaleDB continuous aggregate space_weather_daily |
Compute from raw rows per request | At 100 concurrent users, 100 identical scans of 11,664 raw rows; continuous aggregate reduces this to a single-row lookup |
| CI/CD orchestration | GitHub Actions | Jenkins / GitLab CI | Project is GitHub-native; Actions has OIDC → GHCR; no separate CI server to operate |
| Container image tags | sha-<commit> as canonical immutable tag; semantic version alias for releases |
latest tag only |
latest is mutable and non-reproducible; sha-<commit> gives exact traceability from deployed image back to source commit |
| Multi-stage Docker builds | Builder stage (full toolchain) + runtime stage (distroless/slim) | Single-stage with all tools | Eliminates build toolchain, compiler, and dev dependencies from production image; typically reduces image size by 60–80% |
| Local dev hot-reload | Backend: FastAPI --reload via bind-mounted ./backend volume; Frontend: Next.js Vite HMR |
Rebuild container on change | Full container rebuild per code change adds 30–90s per iteration; volume mount + process reload is < 1s |
.env.example contract |
.env.example with all required variables, descriptions, and stage flags committed to repo; actual .env in .gitignore |
Ad-hoc variable discovery from runtime errors | Engineers must be able to run cp .env.example .env and have a working local stack within 15 minutes of cloning |
| Staging environment strategy | main branch continuously deployed to staging via GitHub Actions; production deploy requires manual approval gate after staging smoke tests pass |
Manual staging deploys | Reduces time-to-detect integration regressions; staging serves as TRL artefact evidence environment |
| Secrets rotation | Per-secret rotation runbook: Space-Track credentials, JWT signing keys, ANSP tokens; old + new key both valid during 5-minute transition window; security_logs entry required; rotated via Vault dynamic secrets in Phase 3 |
Manual rotation with downtime | Aviation context: key rotation must not cause service interruption; zero-downtime rotation is a reliability requirement, not a convenience |
| Build cache strategy | Docker layer cache: cache-from/cache-to targeting GHCR in GitHub Actions; pip wheel cache: actions/cache keyed on requirements.txt hash; npm cache: actions/cache keyed on package-lock.json hash |
No cache; full rebuild each push | Without cache, a full rebuild takes 8–12 minutes; with cache, incremental pushes take 2–3 minutes — critical for CI as a useful merge gate |
| Image retention policy | Tagged release images kept indefinitely; untagged/orphaned images purged weekly via GHCR lifecycle policy; staging images retained 30 days; dev branch images retained 7 days | No policy; manual cleanup | Unmanaged GHCR storage grows unboundedly; stale images also represent unaudited CVE surface |
| Pre-commit hook completeness | Six hooks: detect-secrets, ruff, mypy, hadolint, prettier, sqlfluff |
git-secrets only |
git-secrets scans only for known secret patterns; detect-secrets uses entropy analysis; hadolint prevents insecure Dockerfile patterns; sqlfluff catches migration anti-patterns before code review |
alembic check in CI |
CI job runs alembic check to detect SQLAlchemy model/migration divergence; fails if models have unapplied changes |
Only run migrations, no divergence check | SQLAlchemy models can diverge from migrations silently; alembic check catches the gap before it reaches production |
| FIR boundary data source | EUROCONTROL AIRAC (ECAC states) + FAA Digital-Terminal Procedures (US) + OpenAIP (fallback); 28-day update cadence | Manually curated GeoJSON, updated ad hoc | FIR boundaries change on AIRAC cycles; stale boundaries produce wrong airspace intersection results during live TIP events |
| ADS-B data source | OpenSky Network REST API (Phase 3 MVP); commercial upgrade path to Flightradar24 or FAA SWIM ADS-B if required | Direct receiver hardware | OpenSky is free, global, and sufficient for route overlay and intersection advisory; commercial upgrade only if coverage gaps identified in ANSP trials |
| CCSDS OEM reference frame | GCRF (Geocentric Celestial Reference Frame); time system UTC; OBJECT_ID = NORAD catalog number; missing international designator populated as UNKNOWN |
ITRF or TEME | GCRF is the standard output of SpaceCom's frame transform pipeline; downstream mission control tools expect GCRF for propagation inputs |
| CCSDS CDM field population | SpaceCom populates: HEADER, RELATIVE_METADATA, OBJECT1/2 identifiers, state vectors, covariance (if available); fields not held by SpaceCom emitted as N/A per CCSDS 508.0-B-1 §4.3 |
Omit empty fields | N/A is the CCSDS-specified sentinel for unknown values; silent omission causes downstream parser failures |
| CDM ingestion display | Space-Track CDM Pc displayed alongside SpaceCom-computed Pc with explicit provenance labels; > 10× discrepancy triggers DATA_CONFIDENCE warning on conjunction panel |
Show only one value | Space operators need both values; discrepancy without explanation erodes trust in both |
| WebSocket event schema | Typed event envelope with type discriminator, monotonic seq, and ts; reconnect with ?since_seq= replay of up to 200 events / 5-minute ring buffer; resync_required on stale reconnect |
Schema-free JSON stream | Untyped streams require every consumer to reverse-engineer the schema; schema enables typed client generation |
| Alert webhook delivery | At-least-once POST to registered HTTPS endpoint; HMAC-SHA256 signature; 3 retries with exponential backoff; degraded status after 3 failures; auto-disable after 10 consecutive failures |
WebSocket / email only | ANSPs with existing dispatch infrastructure (AFTN, internal webhook receivers) cannot integrate via browser WebSocket; webhooks are the programmatic last-mile |
| API versioning | /api/v1 base; breaking changes require /api/v2 parallel deployment; 6-month support overlap; Deprecation / Sunset headers (RFC 8594); 3-month written notice to API key holders |
No versioning policy; breaking changes deployed ad hoc | Space operators building operations centre integrations need stable contracts; silent breaking changes disable their integrations |
| SWIM integration path | Phase 2: GeoJSON structured export; Phase 3: FIXM review + EUROCONTROL SWIM-TI AMQP publish endpoint | Not applicable | European ANSP procurement increasingly requires SWIM compatibility; GeoJSON export is low-cost first step; full SWIM-TI is Phase 3 |
| Space-Track API contract test | Integration test asserts expected JSON keys present in Space-Track response; ingest health alert fires after 4 consecutive hours with 0 successful Space-Track records | No contract test; breakage discovered at runtime | Space-Track API has had historical breaking changes; silent format change means ingest returns no data while health metrics appear normal |
| TLE checksum validation | Modulo-10 checksum on both lines verified before DB write; BSTAR range check; failed records logged to security_logs type INGEST_VALIDATION_FAILURE |
Accept TLE at face value | Corrupted TLEs (network errors, encoding issues) would propagate incorrect state vectors without validation |
| Model card | docs/model-card-decay-predictor.md maintained alongside the model; covers validated orbital regime envelope, known failure modes, systematic biases, and performance by object type |
Accuracy statement only in §24.3 | Regulators and ANSPs require a documented operational envelope, not just a headline accuracy figure; ESA TRL artefact requirement |
| Historical backcast selection | Validation report explicitly documents selection criteria, identifies underrepresented object categories, and states accuracy conditional on object type | Single unconditional accuracy figure | Observable re-entry population is biased toward large well-tracked objects; publishing an unconditional accuracy figure misrepresents model generalisation |
| Out-of-distribution detection | ood_flag = TRUE and ood_reason set at prediction time if any input falls outside validated bounds; UI shows mandatory warning callout |
Serve all predictions identically | NRLMSISE-00 calibration domain does not include tumbling objects, very high area-to-mass ratio, or objects with no physical property data |
| Prediction staleness warning | prediction_valid_until = p50_reentry_time - 4h; UI warns independently of system-level TLE staleness if NOW() > prediction_valid_until and not superseded |
No time-based staleness on predictions | An hours-old prediction for an imminent re-entry has implicitly grown uncertainty; operators need a signal independent of the system health banner |
| Alert threshold governance | Thresholds documented with rationale; change approval requires engineering lead sign-off + shadow-mode validation period; change log maintained in docs/alert-threshold-history.md |
Thresholds set in code with no governance | CRITICAL trigger (window < 6h, FIR intersection) has airspace closure consequences; undocumented threshold changes cannot be reviewed by regulators or ANSPs |
| FIR intersection auditability | alert_events.fir_intersection_km2 and intersection_percentile recorded at alert generation; UI shows "p95 corridor intersects ~N km² of FIR XXXX" |
Alert log shows only "intersects FIR XXXX" | Intersection without area and percentile context is not auditable; regulators and ANSPs need to know how much intersection triggered the alert |
| Recalibration governance | Recalibration requires hold-out validation dataset, minimum accuracy improvement threshold, sign-off authority, rollback procedure, and notification to ANSP shadow partners | Recalibration run and deployed without gates | Unchecked recalibration can silently degrade accuracy for object types not in the calibration set |
| Model version governance | Changes classified as patch/minor/major; major changes require active prediction re-runs with supersession + ANSP notification; rollback path documented | No governance; model updated silently | A major model version change producing materially different corridors without re-running active predictions creates undocumented divergence between what ANSPs are seeing and current best predictions |
| Adverse outcome monitoring | prediction_outcomes table records observed re-entry outcomes against predictions; quarterly accuracy report generated from feedback pipeline; false positive/negative rates in Grafana |
No post-deployment accuracy tracking | Without outcome monitoring SpaceCom cannot demonstrate performance within acceptable bounds to regulators; shadow validation reports are episodic, not continuous |
| Geographic coverage annotation | FIR intersection results carry data_coverage_quality flag per FIR; OpenAIP-sourced boundaries flagged as lower confidence |
All FIR intersections treated equally | AIRAC coverage varies by region; operators in non-ECAC regions receive lower-quality intersection assessments without knowing it |
| Public transparency report | Quarterly aggregate accuracy/reliability report published (no personal data); covers prediction count, backcast accuracy, error rates, known limitations | No public reporting | Civil aviation safety tools operate in a regulated transparency environment; ESA bid credibility and regulatory acceptance require demonstrable performance |
docs/ directory structure |
Canonical tree defined in §12.1; all documentation files live at known paths committed to the repo | Ad-hoc file creation by individual engineers | Documentation that exists only in prose references gets created inconsistently or not at all |
| Architecture Decision Records | MADR-format ADRs in docs/adr/; one per consequential decision in §20; linked from relevant code via inline comment |
§20 table in master plan only | Engineers working in the repo cannot find decision rationale without reading a 5000-line plan document |
| OpenAPI documentation standard | Every public endpoint has summary, description, tags, and at least one responses example; enforced by CI check |
Auto-generated stubs only | Auto-generation produces syntactically correct docs that are useless to API integrators (Persona E/F) |
| Runbook format | Standard template in docs/runbooks/TEMPLATE.md; required sections: Trigger, Severity, Preconditions, Steps, Verification, Rollback, Notify; runbook index maintained |
Free-form runbooks written ad-hoc | Runbooks written under pressure without a template consistently omit the rollback and notification steps |
| Docstring standard | Google-style docstrings required on all public functions in propagator/, reentry/, breakup/, conjunction/, integrity.py; parameters include physical units |
No docstring requirement | Physics functions without units and limitations documented cannot be reviewed or audited by third-party evaluators for ESA TRL |
| Validation procedure | §17 specifies reference data location, run commands, pass/fail tolerances per suite; docs/validation/README.md describes how to add new cases |
Checklist of what to validate without procedure | A third party cannot reproduce the validation without knowing where the reference data is and what tolerance constitutes a pass |
| User documentation | Phase 2 delivers aviation portal guide + API quickstart; Phase 3 delivers space portal guide + in-app contextual help; stored in docs/user-guides/ |
No user documentation | ANSP SMS acceptance requires user documentation; aviation operators cannot learn an unfamiliar safety tool from the UI alone |
CHANGELOG.md format |
Keep a Changelog conventions; human-maintained; one entry per release with Added/Changed/Deprecated/Removed/Fixed/Security sections |
No format specified | Changelogs written by different engineers without a format are unusable by operators and regulators |
AGENTS.md |
Project-root file defining behaviour guidance for AI coding agents; specifies codebase conventions, test requirements, and safety-critical file restrictions; committed to repo | Untracked file, undefined purpose | An undocumented AGENTS.md is either ignored or followed inconsistently, undermining its purpose |
| Test documentation | Module docstrings on physics/security test files state the invariant, reference source, and operational significance of failure; docs/test-plan.md lists all suites with scope and blocking classification |
No test documentation requirement | ECSS-Q-ST-80C requires a test specification as a separate deliverable from the test code |
21. Definition of Done per Phase
Phase 1 Complete When:
Physics and data:
- 100+ real objects tracked with current TLE data
- Frame transformation unit tests pass against IERS/Vallado reference cases (round-trip error < 1 m)
- SGP4 CZML uses J2000 INERTIAL frame (not TEME)
- Space weather polled from NOAA SWPC; cross-validated against ESA SWS; operational status widget visible
- TIP messages ingested and displayed for decaying objects
- TLE cross-validation flags discrepancies > threshold for human review
- IERS EOP hash verification passing
- Decay predictor: ≥3 historical re-entry backcast windows overlap actual events
- Mode A (Percentile Corridors): p05/p50/p95 swaths render with correct visual encoding
- TimelineGantt displays all active events; click-to-navigate functional
- LIVE/REPLAY/SIMULATION mode indicator correct on all pages
Security (all required before Phase 1 is considered complete):
- RBAC enforced: automated
test_rbac.pyverifies every endpoint returns 403 for insufficient role, 401 for unauthenticated - JWT RS256 with httpOnly cookies;
localStoragetoken storage absent from codebase (grep check in CI) - MFA (TOTP) enforced for all roles; recovery codes functional
- Rate limiting: 429 responses verified by integration tests for all configured limits
- Simulation parameter range validation: out-of-range values return 400 with clear message
- Prediction HMAC: tamper test (direct DB row modification) triggers 503 + CRITICAL security_log entry
alert_eventsappend-only trigger: UPDATE/DELETE raise exception (verified by test)reentry_predictionsimmutability trigger: same (verified by test)- Redis AUTH enabled; default user disabled; ACL per service verified
- MinIO: all buckets verified private; direct object URL returns 403; pre-signed URL required
- Docker: all containers verified non-root (
docker inspectcheck in CI) - Docker: network segmentation verified — frontend container cannot reach database port
- Bandit: 0 High severity findings in CI
- ESLint security: 0 High findings in CI
- Trivy: 0 Critical/High CVEs in all container images
- CSP headers present on all pages; verified by Playwright E2E test
- axe-core: 0 critical, 0 serious violations on all pages (CI check)
- WCAG 2.1 AA colour contrast: automated check passes
UX:
- Globe: object clustering active at global zoom; urgency symbols correct (colour-blind-safe)
- DataConfidenceBadge visible on all object detail and prediction panels
- UncertaintyModeSelector visible; Mode B/C greyed with "Phase 2/3" label
- JobsPanel shows live sample progress for running decay jobs
- Shared deep links work:
/events/{id}loads correct event; globe focuses on corridor - All pages keyboard-navigable; modal focus trap verified
- Report generation: Operational Briefing type functional; PDF includes globe corridor map
Human Factors (Phase 1 items — all required before Phase 1 is considered complete):
- Event cards display window range notation (
Window: Xh–Yh from now / Most likely ~Zh from now); no±notation appears in operational-facing UI (grep check) - Mode-switch dialogue: switching to SIMULATION shows current mode, target mode, and "alerts suppressed" consequence; Cancel left, Switch right; Playwright E2E test verifies dialogue content
- Future-preview temporal wash: dragging timeline scrubber past current time applies overlay and
PREVIEWING +Xhlabel to event panel; alert badges show "(projected)"; verified by Playwright test - CRITICAL acknowledgement: two-step flow (banner → confirmation modal); Confirm button disabled until
Action takenfield ≥ 10 characters; verified by Playwright test - Audio alert: non-looping two-tone chime plays once on CRITICAL alert; stops on acknowledgement; does not play in SIMULATION or REPLAY mode; verified by integration test with audio mock
- Alert storm meta-alert: > 5 CRITICAL alerts within 1 hour generates Persona D meta-alert with disambiguation prompt (verified by test with synthetic alerts)
- Onboarding state: new organisation with no FIRs configured sees three-card setup prompt on first login (Playwright test)
- Degraded mode banner:
/readyz207 response triggers correct per-degradation-type operational guidance text in UI (integration test for each degradation type: space weather stale, TLE stale) superseded_byconstraint: settingsuperseded_byon a prediction a second time raises DB exception (integration test); UI shows⚠ Supersededbanner on any prediction wheresuperseded_by IS NOT NULL
Legal / Compliance (Phase 1 items — all required before Phase 1 is considered complete):
- Space-Track AUP architectural decision gate (Finding 9): Written AUP clarification obtained from 18th Space Control Squadron or legal counsel opinion.
docs/adr/0016-space-track-aup-architecture.mdcommitted with Path A (shared ingest) or Path B (per-org credentials) decision recorded and evidenced. Ingest architecture finalised accordingly. This is a blocking Phase 1 decision — ingest code must not be written until the path is decided. - ToS / AUP / Privacy Notice acceptance gate: first login blocks dashboard access until all three documents are accepted;
users.tos_accepted_at,users.tos_version,users.tos_accepted_ippopulated on acceptance (integration test: unauthenticated attempt to skip returns 403) - ToS version change triggers re-acceptance: bump
tos_versionin config; verify existing users are blocked on next login until they re-accept (integration test) - CesiumJS commercial licence executed and stored at
legal/LICENCES/cesium-commercial.pdf;legal_clearances.cesium_commercial_executed = TRUE— blocking gate for any external demo (§29.11 F1) - SBOM generated at build time via
syft(SPDX-JSON, container image) +pip-licenses+license-checker-rseidelsohn(dependency manifests); stored indocs/compliance/sbom/as versioned artefacts; all dependency licences reviewed againstlegal/OSS_LICENCE_REGISTER.md; CIpip-licenses --fail-ongate includes GPL/AGPL/SSPL; no unapproved licence in transitive closure (§29.11 F2, F10) legal/LGPL_COMPLIANCE.mdcreated documenting poliastro LGPL dynamic linking compliance and PostGIS GPLv2 linking exception (§29.11 F4, F9)legal/LICENCES/timescaledb-licence-assessment.mdandlegal/LICENCES/redis-sspl-assessment.mdcreated with licence assessment sign-off (§29.11 F5, F6)legal_opinionstable present in schema; admin UI shows legal clearance status per org; shadow mode toggle displays warning ifshadow_mode_cleared = FALSE- GDPR breach notification procedure documented in the incident response runbook; tabletop exercise completed with the engineering team
Infrastructure / DevOps (all required before Phase 1 is considered complete):
- Docker Compose starts full stack with single command (
make dev) make testexecutes pytest + vitest in one command; all tests pass on a clean clonemake migrateruns all Alembic migrations against a fresh DB without errormake seedloads fixture data; globe shows test objects on first load.env.examplepresent with all required variables documented; a new engineer can reach a working local stack in ≤ 15 minutes- Multi-stage Dockerfiles in place for backend, worker, renderer, and frontend: builder stage uses full toolchain; runtime stage is distroless/slim;
docker inspectconfirms no build tools (gcc, pip, npm) present in runtime image - All containers run as non-root UID (baked in Dockerfile
USERdirective — not set at runtime); verified bydocker inspectcheck in CI - Self-hosted GitLab CI pipeline exists with jobs:
lint(pre-commit all hooks),test-backend(pytest),test-frontend(vitest + Playwright),security-scan(Bandit + Trivy + ESLint security),build-and-push(multi-stage build -> GitLab container registry withsha-<commit>tag) .pre-commit-config.yamlcommitted with all six hooks; CI re-runs all hooks and fails if any failalembic checkstep in CI fails if SQLAlchemy models have unapplied changes- Build cache: Docker layer cache, pip wheel cache, npm cache all configured in GitLab CI; incremental push CI time < 4 minutes
- pytest suite: frame utils, integrity, auth, RBAC, propagator, decay, space weather, ingest, API integration
- Playwright E2E: mode switch, alert acknowledge, CZML render, job progress, report generation, CSP headers
- Port exposure CI check:
scripts/check_ports.pypasses with no never-exposed port in aports:mapping - Caddy TLS active on local dev stack with self-signed cert or ACME staging cert; HSTS header present (
Strict-Transport-Security: max-age=63072000); TLS 1.1 and below not offered (verified bynmap --script ssl-enum-ciphers) docs/runbooks/egress-filtering.mdexists documenting the allowed outbound destination whitelist; implementation method (UFW/nftables) noted
Performance / Database (Phase 1 items — all required before Phase 1 is considered complete):
- pgBouncer in Docker Compose; all app services connect via pgBouncer (not directly to TimescaleDB); verified by
netstator connection-source query showing only pgBouncer IPs inpg_stat_activity - All required indexes present:
orbits_object_epoch_idx,reentry_pred_object_created_idx,alert_events_unacked_idx,reentry_pred_corridor_gist,hazard_zones_polygon_gist,fragments_impact_gist,tle_sets_object_ingested_idx— verified by\d+orpg_indexesquery orbitshypertable chunk interval set to 1 day;space_weatherto 30 days;tle_setsto 7 days — verified bytimescaledb_information.chunksspace_weather_dailycontinuous aggregate created and policy active; Space Weather Widget backend query reads from the aggregate (verified byEXPLAINshowingspace_weather_dailyin plan, not rawspace_weather)- Autovacuum settings applied to
alert_events,security_logs,reentry_predictions— verified viapg_classreloptions lazy="raise"set on all SQLAlchemy relationships; test suite passes with noMissingGreenletorInvalidRequestErrorexceptions (test suite itself verifies this by accessing relationships without explicit loading — should raise)- Redis Celery broker DB index (
SELECT 0) hasmaxmemory-policy noeviction; application cache DB index (SELECT 1) hasallkeys-lru— verified byCONFIG GET maxmemory-policyon each DB - CZML catalog endpoint:
EXPLAIN (ANALYZE, BUFFERS)output recorded indocs/query-baselines/czml_catalog_100obj.txt; p95 response time < 2s verified by load test with 10 concurrent users - CZML delta endpoint (
?since=) functional: integration test verifies delta response contains only changed objects;X-CZML-Full-Required: truereturned when client timestamp > 30 min old - Compression policies applied with correct
compress_afterintervals (see §9.4 table):orbits= 7 days,adsb_states= 14 days,space_weather= 60 days,tle_sets= 14 days — verified bytimescaledb_information.jobs - Cursor-based pagination: integration test on
/reentry/predictionswith 200+ rows confirmsnext_cursorpresent and second page returns non-overlapping rows;limit=201returns 400 - MC concurrency gate: integration test submits two concurrent
POST /decay/predictrequests from the same organisation; second request returnsHTTP 429withRetry-Afterheader while first is running; first completes normally - Renderer Docker memory limit set to 4 GB in
docker-compose.yml;docker inspectconfirmsHostConfig.Memory = 4294967296 - Bulk export endpoint: integration test with 10,000-row dataset confirms response is a task ID + status URL, not an inline response body
tests/load/directory exists with at least a k6 or Locust scenario for the CZML catalog endpoint;docs/test-plan.mdload test section specifies scenario, ramp shape, and SLO assertion
Technical Writing / Documentation (Phase 1 items — all required before Phase 1 is considered complete):
docs/directory tree created and committed matching the structure in §12.1; all referenced documentation paths exist (even if files are stubs with "TODO" content)AGENTS.mdcommitted to repo root; contains codebase conventions, test requirements, and safety-critical file restrictions (see §33.9)docs/adr/contains minimum 5 ADRs for the most consequential Phase 1 decisions: JWT algorithm choice, dual frontend architecture, Monte Carlo chord pattern, frame library choice, TimescaleDB chunk intervalsdocs/runbooks/TEMPLATE.mdcommitted;docs/runbooks/README.mdindex lists all required runbooks with owner field; at leastdb-failover.md,ingest-failure.md, andhmac-failure.mdare complete (not stubs)docs/validation/README.mddocuments how to run each validation suite and where reference data files live;docs/validation/reference-data/contains Vallado SGP4 cases and IERS frame test casesCHANGELOG.mdexists at repo root in Keep a Changelog format; first entry records Phase 1 initial releasedocs/alert-threshold-history.mdexists with initial entry recording threshold values, rationale, and author sign-off (required by §24.8)- OpenAPI docs: CI check confirms no public endpoint has an empty
descriptionfield; spot-check 5 endpoints in code review to verifysummaryand at least oneresponsesexample
Ethics / Algorithmic Accountability (Phase 1 items — all required before Phase 1 is considered complete):
ood_flagandood_reasonpopulated at prediction time: integration test with an object whosedata_confidence = 'unknown'and no DISCOS physical properties confirmsood_flag = TRUEandood_reasoncontains'low_data_confidence'; prediction is served but UI shows mandatory warning callout above the prediction panelprediction_valid_untilfield present: verify it equalsp50_reentry_time - 4hfor a test prediction; UI shows staleness warning whenNOW() > prediction_valid_untiland prediction is not superseded (Playwright test simulates time travel)alert_events.fir_intersection_km2andintersection_percentilerecorded: synthetic CRITICAL alert with known corridor area confirms both fields populated; UI renders "p95 corridor intersects ~N km² of FIR XXXX" (Playwright test)- Alert threshold values documented:
docs/alert-threshold-history.mdexists with initial entry recording threshold values, rationale, and author sign-off prediction_outcomestable exists in schema;POST /api/v1/predictions/{id}/outcomeendpoint (requiresanalystrole) accepts observed re-entry time and source (integration test: unauthenticated attempt returns 401)
Interoperability (Phase 1 items — all required before Phase 1 is considered complete):
- TLE checksum validation: integration test sends a TLE with deliberately corrupted checksum; verify it is rejected and logged to
security_logstypeINGEST_VALIDATION_FAILURE; valid TLE with same content but correct checksum is accepted - Space weather format contract test: CI integration test against mocked NOAA SWPC response asserts (a) expected top-level JSON keys present (
time_tag,flux/kp_index); (b) F10.7 values in physical range 50–350 sfu; (c) Kp values in range 0–90 (NOAA integer format); test is@pytest.mark.contractand runs against mocks in standard CI, against live API in nightly sandbox job - Space-Track contract test: integration test against mocked Space-Track response asserts (a) expected JSON keys present for TLE and CDM queries; (b) B* values trigger warning when outside [-0.5, 0.5]; (c) epoch field parseable as ISO-8601;
spacecom_ingest_success_total{source="spacetrack"}Prometheus metric > 0 after a live ingest cycle (nightly sandbox only) - FIR boundary data loaded:
airspacetable populated with FIR/UIR polygons for at least the test ANSP region; source documented iningest/sources.py; AIRAC update date recorded inairspace_metadatatable - WebSocket event schema:
WS /ws/eventsdelivers typed event envelopes; integration test sends a syntheticalert.newevent and verifies the client receives{"type": "alert.new", "seq": <n>, "data": {...}}; reconnect with?since_seq=<n>replays missed event - API versioning headers: all API endpoints return
Content-Type: application/vnd.spacecom.v1+json; deprecated endpoints (if any) returnDeprecation: trueandSunset: <date>headers (verified by Playwright E2E check)
SRE / Reliability (all required before Phase 1 is considered complete):
- Health probes:
/healthzreturns 200 on all services;/readyzreturns 200 (healthy) or 207 (degraded) as appropriate; Docker Composedepends_on: condition: service_healthywired for all service dependencies - Celery queue routing: integration test confirms
ingest.*tasks appear only oningestqueue andpropagator.*tasks appear only onsimulationqueue; no cross-queue contamination possible celery-redbeatschedule persistence: Beat process restart test verifies scheduled jobs survive without duplicate scheduling; Redis keyredbeat:*present after restart- Crash-safety: kill a
worker-simcontainer mid-task; verify task is requeued (not lost) on worker restart;task_acks_late = Trueandtask_reject_on_worker_lost = Trueconfirmed by log inspection - Dead letter queue: a task that exhausts all retries appears in the DLQ; DLQ depth metric visible in Prometheus
- WAL archiving:
pg_basebackupand WAL segments appearing in MinIOdb-wal-archivebucket within 10 minutes of first write (verified by bucket list) - Daily backup Celery task:
backup_databasetask appears in Celery Beat schedule; execution logged incelery-beat.log; resulting archive object visible in MinIOdb-backupsbucket - TimescaleDB compression policy:
orbitscompression policy applied;timescaledb_information.jobsshows policy active; manualCALL run_job()compresses at least one chunk - Prometheus metrics:
spacecom_active_tip_events,spacecom_tle_age_hours,spacecom_hmac_verification_failures_total,spacecom_celery_queue_depthall visible in Prometheus UI with correct labels - MC chord distribution:
run_mc_decay_predictionfans out 500 sub-tasks; Celery Flower shows sub-tasks distributed across bothworker-siminstances (not all on one worker) - MC p95 latency SLO: 500-sample MC run completes in < 240s on Tier 1 dev hardware (8 vCPU/32 GB) under load test; documented baseline recorded for Tier 2 comparison
Phase 2 Complete When:
- Atmospheric breakup: fragments, casualty areas, fragment globe display
- Mode B (Probability Heatmap): Deck.gl layer renders; hover tooltip shows probability
- Conjunction screening: known close approaches identified; Pc computed for ≥1 test case
- 2D Plan View: FIR boundaries, horizontal corridor projection, altitude cross-section
- Airspace intersection table: affected FIRs with entry/exit times on Event Detail
- Hazard zones: HMAC-signed and immutability trigger verified
- PDF reports: Technical Assessment and Regulatory Submission types functional
- Renderer container:
network_mode: noneenforced; sanitisation tests passing; 30s timeout verified - OWASP ZAP DAST: 0 High/Critical findings against staging environment
- RLS multi-tenancy: Org A user cannot access Org B records (integration test)
- SimulationComparison: two runs overlaid on globe with distinct colours
Phase 2 SRE / Reliability:
- Monthly restore test:
restore_testCelery task executes on schedule; restores latest backup to isolateddb-restore-testcontainer; row count reconciliation passes; result logged tosecurity_logs(typeRESTORE_TEST) - TimescaleDB retention policy: 90-day drop policy active on
orbitsandspace_weather; manual chunk drop test in staging confirms chunks older than 90 days are removed without affecting newer data - Archival pipeline: Parquet export Celery task runs before chunk drop; resulting
.parquetfiles visible in MinIOdb-archivebucket; spot-check query against archived Parquet returns expected rows - Degraded mode UI: stop space weather ingest; confirm
/readyzreturns 207; confirmStalenessWarningBannerappears in aviation portal within one polling cycle (≤ 60s); restart ingest; confirm banner clears - Error budget dashboard: Grafana
SRE Error Budgetsdashboard shows Phase 2 SLO burn rates for prediction latency and data freshness; alert fires in Prometheus when burn rate exceeds 2× for > 1 hour
Phase 2 Human Factors:
- Corridor Evolution widget: Event Detail page shows p50 corridor footprint at T+0h/+2h/+4h; auto-updates in LIVE mode; ambering warning appears if corridor is widening
- Duty Manager View: toggle on Event Detail collapses to large-text window/FIR/action-buttons only; toggles back to technical detail
- Response Options accordion: contextualised action checklist visible to
operator+ role; checkbox states and coordination notes persisted toalert_events - Multi-ANSP Coordination Panel: visible on events where ≥2 registered organisations share affected FIRs; acknowledgement status and coordination notes from each ANSP visible; integration test confirms Org A cannot see Org B coordination notes on unrelated events
- Simulation block:
disable_simulation_during_active_eventsorg setting functional; mode switch blocked with correct modal when unacknowledged CRITICAL alerts exist (integration test) - Space weather buffer recommendation: Event Detail shows
[95th pct time + buffer]callout when conditions are Elevated or above; buffer computed by backend from F10.7/Kp thresholds (integration test verifies all four threshold bands) - Secondary Display Mode:
?display=secondaryURL opens chrome-free full-screen operational view; navigation, admin links, and simulation controls not present; CRITICAL banners still appear (Playwright test) - Mode C first-use overlay: MC particle animation blocked until user acknowledges one-time explanation overlay; preference stored in user record; never shown again after first acknowledgement
Phase 2 Performance / Database:
- FIR intersection query:
EXPLAIN (ANALYZE)confirms bounding-box pre-filter (&&) eliminates > 90% ofairspacerows before exactST_Intersects; p95 intersection query time < 200ms with full airspace table loaded - Analytics query routing: Persona B/F workspace queries confirmed routing to replica engine via
pg_stat_activitysource host check; replication lag monitored in Grafana (alert if > 30s) - Query plan regression: re-run
EXPLAIN (ANALYZE, BUFFERS)on CZML catalog query; compare to Phase 1 baseline indocs/query-baselines/; planning time and execution time increase < 2× (if exceeded, investigate before Phase 3 load test) - Hypertable migration: at least one migration involving
orbitsexecuted usingCREATE INDEX CONCURRENTLY; CI migration timeout gate in place (> 30s fails CI) - Query plan regression CI job active:
tests/load/check_query_baselines.pyruns after each migration in staging; fails if any baseline query execution time increases > 2× vs recorded baseline; PR comment generated with comparison table ws_connected_clientsPrometheus gauge reporting per backend instance; Grafana alert configured at 400 (WARNING) — verified by injecting 5 synthetic WebSocket connections and confirming gauge increments- Space weather backfill cap: integration test simulates 24-hour ingest gap; verify ingest task logs
WARNand backfills only last 6 hours; no duplicate timestamps written;space_weather_dailyaggregate remains consistent - CDN / static asset caching:
bundle-sizeCI step active; PR comment shows bundle size delta; CI fails if main JS bundle grows > 10% vs. previous build; Caddy cache headers for/_next/static/*setCache-Control: public, max-age=31536000, immutable
Phase 2 Legal / Compliance:
- Regulatory classification ADR committed:
docs/adr/0012-regulatory-classification.mddocuments the chosen position (Position A — ATM/ANS Support Tool, non-safety-critical) with rationale; legal counsel has reviewed the position against EASA IR 2017/373; position is referenced in all ANSP service contracts - Legal opinion received for primary deployment jurisdiction;
legal_opinionstable updated withshadow_mode_cleared = TRUE; shadow mode admin toggle no longer shows legal warning for that jurisdiction - Space-Track AUP redistribution clarification obtained (written); legal position documented; AUP click-wrap wording updated to reflect agreed terms
- ESA DISCOS redistribution rights clarified (written): Written confirmation from ESA/ESAC on permissible use of DISCOS-derived properties in commercial API responses and generated reports; if redistribution is not permitted, API response and report templates updated to show
source: estimatedrather than raw DISCOS values - GDPR DPA signed with each shadow ANSP partner before shadow mode begins: DPA template reviewed by counsel; executed DPA on file for each organisation before
shadow_mode_clearedis set toTRUE; data processing not permitted for any ANSP organisation without a signed DPA - GDPR data inventory documented; pseudonymisation procedure
handle_erasure_request()implemented and tested: user deleted → name/email replaced with[user deleted - ID:{hash}]inalert_events/security_logs; core safety records preserved - Jurisdiction screening at user registration: sanctioned-country check fires before account creation; blocked attempt logged to
security_logstypeREGISTRATION_BLOCKED_SANCTIONS - MSA template reviewed by aviation law counsel; Regulatory Sandbox Agreement template finalised; first shadow mode deployment covered by a signed Regulatory Sandbox Agreement on file
- Controlled Re-entry Planner carries in-platform export control notice;
data_source_acknowledgement = TRUEenforced before API key issuance (integration test: attempt to create API key without acknowledgement returns 403) - Professional indemnity, cyber liability, and product liability insurance confirmed in place before first shadow deployment; certificates stored in MinIO
legal-docsbucket - Shadow mode exit criteria documented and tooled:
docs/templates/shadow-mode-exit-report.mdexists; Persona B can generate exit statistics from admin panel; exit to operational use for any ANSP requires written Safety Department confirmation on file beforeshadow_mode_clearedis set
Phase 2 Technical Writing / Documentation:
docs/user-guides/aviation-portal-guide.mdcomplete and reviewed by at least one Persona A representative before first ANSP shadow deployment; covers: dashboard overview, alert acknowledgement workflow, NOTAM draft workflow, degraded mode responsedocs/api-guide/complete:authentication.md,rate-limiting.md,webhooks.md,error-reference.md, Python and TypeScript quickstart examples; reviewed by a Persona E/F tester- All public functions in
propagator/decay.py,propagator/catalog.py,reentry/corridor.py,integrity.py, andbreakup/atmospheric.pyhave Google-style docstrings with parameter units;mypypre-commit hook enforces no untyped function signatures docs/test-plan.mdcomplete: lists all test suites, physical invariant tested, reference source, pass/fail tolerance, and blocking classification; reviewed by physics leaddocs/adr/contains ≥ 10 ADRs covering all consequential Phase 2 decisions added during the phase- All runbooks referenced in the §21 DoD are complete (not stubs):
gdpr-breach-notification.md,safety-occurrence-notification.md,secrets-rotation-jwt.md,blue-green-deploy.md,restore-from-backup.md
Phase 2 Ethics / Algorithmic Accountability:
- Model card published:
docs/model-card-decay-predictor.mdcomplete with validated orbital regime envelope, object type performance breakdown, known failure modes, and systematic biases; reviewed by the physics lead before Phase 2 ANSP shadow deployments - Backcast validation report: ≥10 historical re-entry events validated; report documents selection criteria, identifies underrepresented object categories (small debris, tumbling objects), and states accuracy conditional on object type — not as a single unconditional figure; stored in MinIO
docsbucket - Out-of-distribution bounds defined:
docs/ood-bounds.mdspecifies the threshold values forood_flagtriggers (area-to-mass ratio, minimum data confidence, minimum TLE count); CI test confirms all thresholds are checked inpropagator/decay.py - Alert threshold governance: any threshold change requires a PR reviewed by engineering lead + product owner;
docs/alert-threshold-history.mdentry created; change must complete a minimum 2-week shadow-mode validation period before deploying to any operational ANSP connection - FIR coverage quality flag:
airspacetable hasdata_sourceandcoverage_qualitycolumns; intersection results for OpenAIP-sourced FIRs include acoverage_quality: 'low'flag in the API response; UI shows a coverage quality callout for non-AIRAC FIRs - Recalibration governance documented:
docs/recalibration-procedure.mdexists specifying hold-out validation dataset, minimum accuracy improvement threshold (> 5% improvement on hold-out, no regression on any object type category), sign-off authority (physics lead + engineering lead), ANSP notification procedure
Phase 2 Interoperability:
- CCSDS OEM response:
GET /space/objects/{norad_id}/ephemeriswithAccept: application/ccsds-oemreturns a valid CCSDS 502.0-B-3 OEM file; integration test validates all mandatory keyword fields (OBJECT_ID,CENTER_NAME,REF_FRAME=GCRF,TIME_SYSTEM=UTC,START_TIME,STOP_TIME) are present; test parses with a reference CCSDS OEM parser - CCSDS CDM export: bulk export includes CDM-format conjunction records; mandatory CDM fields populated;
N/Aused per CCSDS 508.0-B-1 §4.3 for unknown values; integration test validates with reference CDM parser - CDM ingestion display: Space-Track CDM Pc and SpaceCom-computed Pc both visible on conjunction panel with distinct provenance labels;
DATA_CONFIDENCEwarning fires when values differ by > 10× (integration test with synthetic divergent CDM) - Alert webhook:
POST /webhooksregisters endpoint; syntheticalert.newevent POSTed to registered URL within 5s of trigger;X-SpaceCom-Signatureheader present and verifiable with shared secret; retry fires on 500 response from webhook receiver (integration test with mock server) - GeoJSON structured export:
GET /events/{id}/export?format=geojsonreturns valid GeoJSONFeatureCollection;propertiesincludesnorad_id,p50_utc,affected_fir_ids,risk_level,prediction_hmac; validates against GeoJSON schema (RFC 7946) - ADS-B feed: OpenSky Network integration active; live flight positions overlay on globe in aviation portal; route intersection advisory receives ADS-B flight tracks as input
Phase 2 DevOps / Platform Engineering:
- Staging environment spec documented: resources, data (synthetic only — no production data in staging), secrets set (separate from production), continuous deployment from
mainbranch - GitLab staging deploy job: merge to
maintriggers automatic staging deploy; production deploy requires manual approval in GitLab after staging smoke tests pass - OWASP ZAP DAST run against staging in CI pipeline; results reviewed; 0 High/Critical required to unblock production deploy approval
- Secrets rotation runbooks written for all critical secrets: Space-Track credentials, JWT RS256 signing keypair, MinIO access keys, Redis
AUTHpassword; each runbook includes: who initiates, affected services, zero-downtime rotation procedure, verification step,security_logsentry required - JWT RS256 keypair rotation tested without downtime: old public key retained during 5-minute transition window; tokens signed with old key remain valid until expiry; verified by integration test
- Image retention container-registry lifecycle policy in place: untagged images purged weekly; staging images retained 30 days; dev images retained 7 days; policy verified in registry settings
- CI observability: GitLab pipeline duration tracked; image size delta posted as merge request comment (fail if > 20% increase); test failure rate visible in CI dashboard
alembic checkCI gate: no migration added aNOT NULLcolumn without a default in the same step; CI job validates hypertable migrations useCONCURRENTLY(grep check on all new migration files)
Phase 2 Additional Regulatory / Dual Domain Items:
- Shadow mode: admin can enable/disable per organisation; ShadowBanner displayed on all pages when active; shadow records have
shadow_mode = TRUE; shadow records excluded from all operational API responses (integration test) - NOTAM drafting: draft generated in ICAO Annex 15 format from any event with FIR intersection; mandatory regulatory disclaimer present (automated test verifies its presence in every draft); stored in
notam_drafts - Space Operator Portal:
space_operatoruser can view only owned objects (non-owned objects return 404, not 403, to prevent object enumeration); ControlledReentryPlanner functional forhas_propulsion = TRUEobjects - CCSDS export: ephemeris export in OEM format passes CCSDS 502.0-B-3 structural validation
- API keys: create, use, and revoke flow functional; per-key rate limiting returns 429 at daily limit; raw key displayed only at creation (never retrievable after)
- TIP message provenance displayed in UI: source label reads "USSPACECOM TIP (not certified aeronautical information)" — not just "TIP Message #N"
- Data confidence warnings: objects with
data_confidence = 'unknown'display a warning callout on all prediction panels explaining the impact on prediction quality
Phase 3 Complete When:
- Mode C (Monte Carlo Particles): animated trajectories render; click-particle shows params
- Real-time alerts delivered within 30 seconds of trigger condition
- Geographic alert filtering: alerts scoped to user's FIR list
- Route intersection analysis functional against sample flight plans
- Feedback: density scaling recalibration demonstrated from ≥2 historical re-entries
- Load test: 100 concurrent users; CZML load < 2s at p95
- External penetration test completed; all Critical/High findings remediated
- Full axe-core audit + manual screen reader test (NVDA + VoiceOver) passes
- Secrets manager (Vault or equivalent) replacing Docker secrets for all production credentials
- All credentials on rotation schedule; rotation verified without downtime
- Prometheus + Grafana operational; certificate expiry alert configured
- Production deployment runbook documented; incident response procedure per threat scenario
- Security audit log shipping to external SIEM verified
- Shadow validation report generated for ≥1 historical re-entry event demonstrating prediction accuracy
- ECSS compliance artefacts produced: Software Management Plan, V&V Plan, Product Assurance Plan, Data Management Plan (required for ESA contract bids)
- TRL 6 demonstration: system demonstrated in operationally relevant environment with real TLE data, real space weather, and ≥1 ANSP shadow deployment
- Regulatory acceptance package complete: safety case framework, ICAO Annex 15 data quality mapping, SMS integration guide
- Legal opinion obtained on operational liability per target deployment jurisdictions (Australia, EU, UK minimum)
- First ANSP shadow mode deployment active with ≥4 weeks of shadow prediction records
Phase 3 Infrastructure / HA:
- Patroni configuration validated:
scripts/check_patroni_config.pypasses confirmingmaximum_lag_on_failover,synchronous_mode: true,synchronous_mode_strict: true,wal_level: replica,recovery_target_timeline: latestall present inpatroni.yml - Patroni failover drill: manually kill the primary DB container; verify standby promoted within 30s; backend API continues serving requests (latency spike acceptable; no 5xx errors after 35s); PgBouncer reconnects automatically to new primary
- MinIO EC:2 verified: 4-node MinIO starts cleanly; integration test writes a 100 MB object; shut down one MinIO node; read succeeds; write succeeds; shut down second node; write fails with expected error; read still succeeds (EC:2 read quorum = 2 of 4)
- WAF/DDoS protection confirmed in place at ingress (Cloudflare/AWS Shield or equivalent network-level appliance for on-premise); security architecture review sign-off
- DNS architecture documented:
docs/runbooks/dns-architecture.mdcovers split-horizon zones, PgBouncer VIP, Redis Sentinel VIP, and service discovery records for Tier 3 deployment - Backup restore test checklist completed successfully (see §34.5): all 6 checklist items passed within the 30-day window before Phase 3 sign-off
- TLS certificate lifecycle runbook complete:
docs/runbooks/tls-cert-lifecycle.mddocuments ACME auto-renewal path and internal CA path for air-gapped deployments; cert expiry Prometheus alerts firing at 60/30/7-day thresholds
Phase 3 Performance:
- Formal load test passed:
tests/load/scenario with k6 or Locust; 100 concurrent users; CZML catalog load < 2s p95; MC job submit < 500ms; alert WebSocket delivery < 30s; test report committed todocs/validation/load-test-report-phase3.md - MC concurrency gate tested at scale: 10 simultaneous MC submissions across 5 organisations; each org receives
429for its second request; no deadlock or Redis key leak observed; Celery worker queue depth remains bounded - WebSocket subscriber ceiling verified: load test opens 450 connections to a single backend instance; 451st connection receives
HTTP 503;ws_connected_clientsgauge reads 450; scaling trigger fires at 400 (alert visible in Grafana) - CZML delta adoption: Playwright E2E test confirms the frontend sends
?since=parameter on all CZML polls after initial load; no full-catalog request occurs after page load in LIVE mode - Bundle size CI gate active and green: final production build JS bundle documented;
bundle-sizeCI step has passed for ≥2 consecutive deploys without manual override
22. Open Physics Questions for Engineering Review
-
JB2008 vs NRLMSISE-00 — Recommend: NRLMSISE-00 for Phase 1 with a pluggable density model interface that accepts JB2008 in Phase 2 without API or schema changes.
-
Covariance source for conjunction probability — Recommend: SP ephemeris covariance from Space-Track for active payloads; empirical covariance with explicit UI warning for debris.
-
Re-entry termination altitude — Recommend: 80 km for Phase 1; parametric interface for Phase 2 breakup module (default 80 km, allow up to 120 km).
-
F10.7 forecast horizon — For objects re-entering 5–14 days out, NOAA 3-day forecasts have degraded skill. Recommend: 81-day smoothed average as baseline with ±20% MC variation; document clearly in the SpaceWeatherWidget and every prediction panel.
23. Dual Domain Architecture
23.1 The Interface Problem
Two technically adjacent domains — space operations and civil aviation — manage debris re-entry hazards using incompatible tools, data formats, and operational vocabularies. The gap between them is the market.
SPACE DOMAIN THE GAP AVIATION DOMAIN
──────────────── ────────── ────────────────
TLE / SGP4 NOTAM
CDMs / TIP messages No standard interface FIR restrictions
CCSDS orbit products No common tool ATC procedures
Kp / F10.7 indices No shared language En-route charts
Probability of casualty ← SpaceCom bridges this → Plain English hazard brief
23.2 Shared Physics Core
One physics engine serves both front doors. Neither domain gets a different model — they get different views of the same computation.
┌─────────────────────────────────┐
│ PHYSICS CORE │
│ Catalog Propagator (SGP4) │
│ Decay Predictor (RK7(8)+NRLMS) │
│ Monte Carlo ensemble │
│ Conjunction Screener │
│ Atmospheric Breakup (ORSAT) │
│ Frame transforms (TEME→WGS84) │
└────────────┬────────────────────┘
│
┌─────────────────┴─────────────────┐
│ │
┌──────────▼───────────┐ ┌────────────▼──────────┐
│ SPACE DOMAIN UI │ │ AVIATION DOMAIN UI │
│ /space portal │ │ / (operational view) │
│ Persona E, F │ │ Persona A, B, C │
│ │ │ │
│ State vectors │ │ Hazard corridors │
│ Covariance matrices │ │ FIR intersection │
│ CCSDS formats │ │ NOTAM drafts │
│ Deorbit windows │ │ Plain-language status│
│ API keys │ │ Alert acknowledgement│
│ Conjunction data │ │ Gantt timeline │
└──────────────────────┘ └───────────────────────┘
23.3 Domain-Specific Output Formats
| Output | Space Domain | Aviation Domain |
|---|---|---|
| Trajectory | CCSDS OEM (state vectors) | CZML (J2000 INERTIAL for CesiumJS) |
| Re-entry prediction | p05/p50/p95 times + covariance | Percentile corridor polygons on globe |
| Hazard | Probability of casualty (Pc) value | Risk level (LOW/MEDIUM/HIGH/CRITICAL) |
| Uncertainty | Monte Carlo ensemble statistics | Corridor width visual encoding |
| Conjunction | CDM-format Pc value | Not surfaced to Persona A |
| Space weather | F10.7 / Ap / Kp raw indices | "Elevated activity — wider uncertainty" |
| Deorbit plan | CCSDS manoeuvre plan | Corridor risk map on globe |
23.4 Competitive Position
| Competitor | Their Strength | SpaceCom Advantage |
|---|---|---|
| ESA ESOC Re-entry Prediction Service | Authoritative technical product; longest-running service | Aviation-facing operational UX; ANSP decision support; NOTAM drafting; multi-ANSP coordination |
| OKAPI:Orbits + DLR + TU Braunschweig | Academic orbital mechanics depth; space operator integrations | Purpose-built ANSP interface; controlled re-entry planner; shadow mode for regulatory adoption |
| Aviation weather vendors (e.g., StormGeo) | Deep ANSP relationships; established procurement pathways | Space domain physics credibility; TLE/CDM ingestion; conjunction screening |
| General STM platforms | Broad catalog management | Operational decision support depth; aviation integration layer |
SpaceCom's moat is the combination of space physics credibility AND aviation operational usability. Neither side alone is sufficient to win regulated aviation authority contracts.
Differentiation capabilities — must be maintained regardless of competitor moves (Finding 4):
These are the capabilities that competitors cannot quickly replicate and that directly determine whether ANSPs and institutional buyers choose SpaceCom over alternatives:
| Capability | Why it matters | Maintenance requirement |
|---|---|---|
| ANSP operational workflow integration | NOTAM drafting, multi-ANSP coordination, and shadow mode are purpose-built for ANSP operations — not retrofitted | Must be validated with ≥ 2 ANSP safety teams before Phase 2 shadow deployment |
| Regulatory adoption path | Shadow mode + exit criteria + ANSP Safety Department sign-off creates a documented adoption trail that institutional procurements require | Shadow mode exit report template must remain current; exit statistics generated automatically |
| Physics + aviation in one product | Neither a pure orbital analytics tool nor a pure aviation tool can cover both sides without the other's domain expertise | Dual-domain architecture (§23) must be maintained; any feature removal from either domain triggers an ADR |
| ESA/DISCOS data integration | Institutional credibility with ESA and national space agencies depends on using authoritative ESA data sources | DISCOS redistribution rights must be resolved before Phase 2; integration maintained as P1 data source |
A docs/competitive-analysis.md document (maintained by the product owner, reviewed quarterly) tracks competitor feature releases and assesses impact on these claims. Any competitor capability that closes a differentiation gap triggers a product review within 30 days.
23.5 SWIM Integration Path
European ANSPs increasingly exchange operational data via SWIM (System Wide Information Management), defined by ICAO Doc 10039 and implemented in Europe via EUROCONTROL SWIM-TI (AMQP/MQTT transport, FIXM/AIXM 5.1 schemas). Full SWIM compliance is a Phase 3+ target; the path is:
| Phase | Deliverable | Standard |
|---|---|---|
| Phase 2 | GeoJSON structured event export (/events/{id}/export?format=geojson) with ICAO FIR IDs and prediction metadata |
GeoJSON + ISO 19115 metadata |
| Phase 3 | Review FIXM Core 4.x schema for re-entry hazard representation; define SpaceCom extension namespace | FIXM Core 4.2 |
| Phase 3 | SWIM-TI AMQP endpoint (publish-only) for alert.new and tip.new events to EUROCONTROL Network Manager B2B service |
EUROCONTROL SWIM-TI Yellow Profile |
Phase 2 GeoJSON export is the immediate deliverable. Phase 3 SWIM-TI integration is scoped but requires a EUROCONTROL B2B service account and FIXM schema extension review — neither is blocking for Phase 1 or 2.
24. Regulatory Compliance Framework
24.1 The Regulatory Gap SpaceCom Operates In
There is currently no binding international regulatory framework governing re-entry debris hazard notifications to civil aviation. SpaceCom operates at the boundary between two regulatory regimes that have not yet formally agreed on how to bridge them.
This creates risk (no approved pathway to slot into) but also opportunity (SpaceCom can help define the standard and accumulate first-mover evidence).
24.2 Liability and Operational Status
Legal opinion is a Phase 2 gate, not a Phase 3 task. Shadow mode deployments with ANSPs must not occur without a completed legal opinion for the deployment jurisdiction. "Advisory only" UI labelling is not contractual protection — liability limitation must be in executed agreements. In common law jurisdictions (Australia, UK, US), a voluntary undertaking of responsibility to a known class of relying professionals can create a duty of care regardless of disclaimers (Hedley Byrne & Co v Heller and equivalents). Shadow mode activation in the admin panel is gated by legal_opinions.shadow_mode_cleared = TRUE for the organisation's jurisdiction.
Legal opinion scope (per deployment jurisdiction — Australia, EU, UK, US minimum):
- Whether "decision support information" labelling limits liability for incorrect predictions that inform airspace decisions
- Whether the platform creates duty-of-care obligations regardless of labelling
- Whether Space-Track data redistribution via the SpaceCom API requires a separate licensing agreement with 18th Space Control Squadron
- Whether CDM data (national security-adjacent) is subject to export controls in target jurisdictions
- Whether the Controlled Re-entry Planner falls under ECCN 9E515 (spacecraft operations technical data) for non-US users
Operational status classification for SpaceCom outputs — not a UI label, a formal determination made in consultation with the ANSP's legal and SMS teams:
- Aeronautical information (ICAO Annex 15) — highest standard; triggers data quality obligations
- Decision support information — intermediate; requires formal ANSP SMS acceptance
- Situational awareness information — lowest; advisory only; no procedural authority
Commercial contract requirements — three instruments required before any access:
-
Master Services Agreement (MSA) — executed before any ANSP or space operator accesses the system. Must be reviewed by aviation law counsel. Minimum required terms:
- Limitation of liability: capped at 12 months of fees paid, or a fixed cap for government/sovereign customers (to be determined by counsel)
- Exclusion of consequential and indirect loss
- Explicit statement that SpaceCom outputs are decision support information, not certified aeronautical information and not a substitute for ANSP operational procedures
- ANSP's acknowledgement that they retain full authority and responsibility for all operational decisions
- SLOs from §26.1 incorporated by reference
- Governing law and jurisdiction clause
- Data Processing Agreement (DPA) addendum for GDPR-scope deployments (see §29)
- Right to suspend service without liability for maintenance, degraded mode, data quality concerns, or active security incidents
-
Acceptable Use Policy (AUP) — click-wrap accepted in-platform at first login, recorded in
users.tos_accepted_at,users.tos_version, andusers.tos_accepted_ip. Must re-accept when version changes (system blocks access until accepted). Includes:- Acknowledgement that orbital data originates from Space-Track, subject to Space-Track terms
- Prohibition on redistributing SpaceCom-derived data to third parties without written consent
- Acknowledgement that the platform is decision support only, not certified aeronautical information
- Export control acknowledgement (user is responsible for compliance in their jurisdiction)
-
API Terms — embedded in the API key issuance flow for Persona E/F programmatic access. Accepted at key creation; recorded against the
api_keysrecord. Includes the Space-Track redistribution acknowledgement and the export control notice.
Space-Track data redistribution gate (F3): Space-Track.org Terms of Service prohibit redistribution of TLE data to non-registered entities. The SpaceCom API must not serve TLE-derived fields (raw TLE strings, tle_epoch, tle_line1/2) to organisations that have not confirmed Space-Track registration. Implementation:
-- Add to organisations table
ALTER TABLE organisations ADD COLUMN space_track_registered BOOLEAN NOT NULL DEFAULT FALSE;
ALTER TABLE organisations ADD COLUMN space_track_registered_at TIMESTAMPTZ;
ALTER TABLE organisations ADD COLUMN space_track_username TEXT; -- for audit
API middleware check (applied to any response containing TLE-derived fields):
def check_space_track_gate(org: Organisation):
if not org.space_track_registered:
raise HTTPException(
status_code=403,
detail="TLE-derived data requires Space-Track registration. "
"Register at space-track.org and confirm in your organisation settings."
)
All TLE-derived disclosures are logged in data_disclosure_log:
CREATE TABLE data_disclosure_log (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID NOT NULL REFERENCES organisations(id),
source TEXT NOT NULL, -- 'space_track', 'esa_sst', etc.
endpoint TEXT NOT NULL,
disclosed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
record_count INTEGER
);
CREATE INDEX ON data_disclosure_log (org_id, source, disclosed_at DESC);
Contracts table and MRR tracking (F1, F4, F9 — §68):
The contracts table enforces that feature access is gated on commercial state, provides MRR data for the commercial team, and records discount approval for audit:
CREATE TABLE contracts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id INTEGER NOT NULL REFERENCES organisations(id),
contract_type TEXT NOT NULL
CHECK (contract_type IN ('sandbox','professional','enterprise','on_premise','internal')),
-- Financial terms
monthly_value_cents INTEGER NOT NULL DEFAULT 0, -- 0 for sandbox/internal
currency CHAR(3) NOT NULL DEFAULT 'EUR',
discount_pct NUMERIC(5,2) NOT NULL DEFAULT 0
CHECK (discount_pct >= 0 AND discount_pct <= 100),
-- Discount approval guard (F4): discounts >20% require second approver
discount_approved_by INTEGER REFERENCES users(id), -- NULL if discount_pct <= 20
discount_approval_note TEXT,
-- Term
valid_from TIMESTAMPTZ NOT NULL,
valid_until TIMESTAMPTZ NOT NULL,
auto_renew BOOLEAN NOT NULL DEFAULT FALSE,
-- Feature access — what this contract enables
enables_operational_mode BOOLEAN NOT NULL DEFAULT FALSE,
enables_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE,
enables_api_access BOOLEAN NOT NULL DEFAULT FALSE,
-- Audit
created_by INTEGER REFERENCES users(id),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
signed_msa_at TIMESTAMPTZ, -- NULL until MSA countersigned
msa_document_ref TEXT, -- path in MinIO legal bucket
-- Professional Services (F10)
ps_value_cents INTEGER NOT NULL DEFAULT 0, -- one-time PS revenue on this contract
ps_description TEXT
);
CREATE INDEX ON contracts (org_id, valid_until DESC);
CREATE INDEX ON contracts (valid_until) WHERE valid_until > NOW(); -- active contract lookup
-- Constraint: discounts >20% must have a named approver
ALTER TABLE contracts ADD CONSTRAINT discount_approval_required
CHECK (discount_pct <= 20 OR discount_approved_by IS NOT NULL);
Feature access enforcement (F1): Feature flags in organisations must be set from the active contract, not by admin toggle alone. A Celery task (tasks/commercial/sync_feature_flags.py) runs nightly and on contract creation/update to sync organisations.feature_multi_ansp_coordination from the active contract's enables_multi_ansp_coordination. An admin toggle that disagrees with the active contract is overwritten by the nightly sync.
MRR dashboard (F9): Add a Grafana panel (internal dashboard, not customer-facing) showing current MRR:
-- Recording rule or direct query:
SELECT SUM(monthly_value_cents) / 100.0 AS mrr_eur
FROM contracts
WHERE valid_from <= NOW() AND valid_until >= NOW()
AND contract_type NOT IN ('sandbox', 'internal');
Expose as spacecom_mrr_eur Prometheus gauge updated by the nightly sync_feature_flags task. Grafana panel: "Current MRR (€)" — single stat panel, comparison to previous month.
Export control screening (F4): ITAR 22 CFR §120.15 and EAR 15 CFR §736 prohibit providing certain SSA capabilities to nationals of embargoed countries and denied parties. Required at organisation onboarding:
ALTER TABLE organisations ADD COLUMN country_of_incorporation CHAR(2); -- ISO 3166-1 alpha-2
ALTER TABLE organisations ADD COLUMN export_control_screened_at TIMESTAMPTZ;
ALTER TABLE organisations ADD COLUMN export_control_cleared BOOLEAN NOT NULL DEFAULT FALSE;
ALTER TABLE organisations ADD COLUMN itar_cleared BOOLEAN NOT NULL DEFAULT FALSE; -- US-person or licensed
Onboarding flow:
- Collect
country_of_incorporationat registration - Flag embargoed countries (CU, IR, KP, RU, SY) for manual review — account held in
PENDING_EXPORT_REVIEWstate - Screen organisation name against BIS Entity List (automated lookup; manual review on partial match)
- EU-SST-derived data gated behind
itar_cleared = TRUE(EU-SST has its own access restrictions for non-EU entities) - All screening decisions logged with reviewer ID and date
Documented in legal/EXPORT_CONTROL_POLICY.md. Legal counsel review required before any deployment that could serve US-origin technical data (TLE from 18th Space Control Squadron) to non-US persons.
Regulatory Sandbox Agreement — a lightweight 2-page letter of understanding required before any ANSP shadow mode activation. Specifies:
- Trial period start and end dates
- ANSP's confirmation that SpaceCom outputs are for internal validation only (not operational)
- SpaceCom's commitment to produce a shadow validation report at trial end
- Data protection terms for the trial period
- How incidents during the trial are handled by both parties
- Mutual agreement that the trial does not create any ongoing commercial obligation
Regulatory sandbox liability clarification (F11 — §61): The sandbox agreement is not a liability shield by itself. During shadow mode, SpaceCom is a tool under evaluation — liability exposure depends on how the ANSP uses outputs and what the sandbox agreement says about consequences of errors. Required provisions:
- No operational reliance clause: ANSP certifies in writing that no operational decisions will be made on the basis of SpaceCom outputs during the trial. Any breach of this clause by the ANSP shifts liability to the ANSP.
- Incident notification: If a SpaceCom output error is identified during the trial, SpaceCom notifies the ANSP within 2 hours (matching the safety occurrence runbook at §26.8). The sandbox agreement specifies whether this constitutes a notifiable occurrence under the ANSP's SMS.
- Indemnification cap: SpaceCom's aggregate liability during the sandbox period is capped at AUD/EUR 50,000 (or local equivalent). Catastrophic loss claims are excluded (consistent with MSA terms).
- Insurance requirement: SpaceCom must carry professional indemnity insurance with minimum cover AUD/EUR 1 million before activating any sandbox with an ANSP. Certificate of currency provided to the ANSP before activation.
- Regulatory notification duty: If the ANSP's safety regulator requires notification of third-party tool trials (e.g., EASA, CASA, CAA), that obligation rests with the ANSP. SpaceCom provides a one-page system description document to support the ANSP's notification.
- Sandbox ≠ approval pathway: A successful sandbox trial is evidence for a future regulatory submission — it is not itself an approval. Neither party should represent the sandbox as a form of regulatory acceptance.
legal/SANDBOX_AGREEMENT_TEMPLATE.md captures the standard text. Legal counsel review required before any amendment.
The shadow mode admin toggle must display a warning if no Regulatory Sandbox Agreement is on record (legal_opinions.shadow_mode_cleared = FALSE for the org's jurisdiction):
⚠ No legal clearance on record for this organisation's jurisdiction.
Shadow mode should not be activated without a completed legal opinion
and a signed Regulatory Sandbox Agreement.
[View legal status →]
24.3 ICAO Data Quality Mapping (Annex 15)
SpaceCom outputs that may enter aeronautical information channels must be characterised against ICAO's five data quality attributes:
| Attribute | SpaceCom Characterisation | Required Action |
|---|---|---|
| Accuracy | Decay predictor accuracy characterised from ≥10 historical re-entry backcasts vs. The Aerospace Corporation database. Published as a formal accuracy statement in GET /api/v1/reentry/predictions/{id} response. |
Phase 3: produce accuracy characterisation document |
| Resolution | Corridor boundaries expressed as geographic polygons with stated precision. Position uncertainty stated as formal resolution value in prediction response. | Included in prediction API response from Phase 1 |
| Integrity | HMAC-SHA256 on all prediction and hazard zone records. Integrity assurance level: Essential (1×10⁻⁵). Documented in system description. | Implemented Phase 1 (§7.9) |
| Traceability | Full parameter provenance in simulations.params_json and prediction records. Accessible to regulatory auditors via dedicated API. |
Phase 1 |
| Timeliness | Maximum latency from TIP message ingestion to updated prediction available: 30 minutes. Maximum latency from NOAA SWPC space weather update to prediction recalculation: 4 hours. Published as formal SLA. | Phase 3 SLA document |
F5 — Completeness attribute and ICAO Annex 15 §3.2 data quality classification (§61):
ICAO Annex 15 §3.2 defines a sixth implicit attribute — Completeness — meaning all data fields required by the receiving system are present and within range. SpaceCom must:
- Define a formal completeness schema for each prediction response (required fields, allowed nulls, value ranges)
- Return
data_quality.completeness_pctin the prediction response (fields present / fields required × 100) - Reject predictions with completeness < 90% from the alert pipeline (alert not generated; operator notified of incomplete prediction)
ICAO data category and classification required in the prediction response (Annex 15 Table A3-1):
| Field | Value |
|---|---|
data_category |
AERONAUTICAL_ADVISORY (until formal AIP entry process established) |
originator |
SPACECOM + system version string |
effective_from |
ISO 8601 UTC timestamp |
integrity_assurance |
ESSENTIAL (1×10⁻⁵ probability of undetected error) |
accuracy_class |
CLASS_2 (advisory, not certified — until accuracy characterisation completes Phase 3 validation) |
Formal accuracy characterisation (docs/validation/ACCURACY_CHARACTERISATION.md) is a Phase 3 gate before the API can be presented to any ANSP as meeting Annex 15 data quality standards.
24.4 Safety Management System Integration
Any ANSP formally adopting SpaceCom must include it in their SMS (ICAO Annex 19). SpaceCom provides the following artefacts to support ANSP SMS assessment:
Hazard register (SpaceCom's contribution to the ANSP's SMS — F3, §61 structured format):
Maintained as docs/safety/HAZARD_LOG.md. Each hazard uses the structured schema below. Hazard IDs are permanent — retired hazards are marked CLOSED, not deleted.
| ID | Description | Cause | Effect | Mitigations | Severity | Likelihood | Risk Level | Status |
|---|---|---|---|---|---|---|---|---|
| HZ-001 | SpaceCom unavailable during active re-entry event | Infrastructure failure; deployment error; DDoS | ANSP cannot access current re-entry prediction during event window | Patroni HA failover (§26.3); 15-min RTO SLO; automated ANSP push notification + email; documented fallback procedure | Hazardous | Low (SLO 99.9%) | Medium | OPEN |
| HZ-002 | False all-clear prediction (false negative — corridor misses actual impact zone) | TLE age; atmospheric model error; MC sampling variance; adversarial data manipulation | ANSP issues all-clear; aircraft enters debris corridor | HMAC integrity check; dual-source TLE validation; TIP cross-check guard; shadow validation evidence; accuracy characterisation (Phase 3); @pytest.mark.safety_critical tests |
Catastrophic | Very Low | High | OPEN |
| HZ-003 | False hazard prediction (false positive — corridor over-stated) | Atmospheric model conservatism; TLE propagation error | Unnecessary airspace restriction; operational disruption; credibility loss | Cross-source TLE validation; HMAC; p95 corridor with stated uncertainty; accuracy characterisation | Major | Low | Medium | OPEN |
| HZ-004 | Corridor displayed in wrong reference frame | ECI/ECEF/geographic frame conversion error; CZML frame parameter misconfiguration | Corridor shown at wrong lat/lon; operator makes decisions on incorrect geographic basis | Frame transform unit tests against IERS references (§17); CZML frame convention enforced via CI | Hazardous | Very Low | Medium | OPEN |
| HZ-005 | Outdated prediction served (stale data) | Ingest pipeline failure; TLE source outage; cache not invalidating | Operator sees prediction that no longer reflects current orbital state | Data staleness indicators in UI; automated stale alert to operators; ingest health monitoring; CZML cache invalidation triggers (§35) | Major | Low | Medium | OPEN |
| HZ-006 | Prediction integrity failure (HMAC mismatch) | Database modification; backup restore error; storage corruption | Prediction record cannot be verified; may have been tampered with | Prediction quarantined automatically; CRITICAL security alert; prediction withheld from API | Catastrophic | Very Low | High | OPEN |
| HZ-007 | Unauthorised access to prediction data | Compromised credentials; RLS bypass; API misconfiguration | Competitor or adversary obtains early re-entry corridor data; potential ITAR exposure | PostgreSQL RLS; JWT validation; rate limiting; security_logs audit trail; penetration testing |
Major | Low | Medium | OPEN |
Hazard log governance:
- Review: quarterly, and after each SEV-1 incident, model version update, or material system change
- New hazards identified during safety occurrence reporting are added within 5 business days
- Risk level = Severity × Likelihood using EUROCAE ED-153 risk classification matrix
- OPEN hazards with
Highrisk level are Phase 2 gate blockers — must reachMITIGATEDbefore ANSP shadow activation
System safety classification: Safety-related (not safety-critical under DO-278A). Relevant components targeting SAL-2 assurance level (see §24.13). Development assurance standard: EUROCAE ED-78A equivalent for relevant components.
Change management: SpaceCom must notify all ANSP users before model version updates that affect prediction outputs. Version changes tracked in simulations.model_version and surfaced in the UI.
24.5 NOTAM System Interface
SpaceCom's position in the NOTAM workflow:
SpaceCom generates → NOTAM draft (ICAO format) → Reviewed by Persona A → Submitted by authorised NOTAM originator → Issued NOTAM
SpaceCom never submits NOTAMs. The draft is a decision support artefact. The mandatory disclaimer on every draft is a non-removable regulatory requirement, not a UI preference.
NOTAM timing requirements by jurisdiction:
- Routine NOTAMs: 24–48 hours minimum lead time
- Short-notice (re-entry window < 24 hours): ASAP; NOTAM issued with minimum lead time
- SpaceCom alert thresholds align with these: CRITICAL alert at < 6h, HIGH at < 24h
24.6 Space Law Considerations
UN Liability Convention (1972): All SpaceCom prediction records, simulation runs, and alert acknowledgements may be legally discoverable in an international liability claim. The immutable audit trail (§7.9) is partially an evidence preservation mechanism. Retention of reentry_predictions, alert_events, notam_drafts, and shadow_validations for ≥7 years minimum.
National space laws with re-entry obligations:
- Australia: Space (Launches and Returns) Act 2018. CASA and the Australian Space Agency have coordination protocols. SpaceCom's controlled re-entry planner outputs are suitable as evidence for operator obligations under this Act.
- EU/ESA: EU Space Programme Regulation; ESA Zero Debris Charter. SpaceCom supports Zero Debris by characterising re-entry risk and supporting responsible end-of-life planning.
- US: FAA AST re-entry licensing generates data that SpaceCom should ingest when available. 51 USC Chapter 509 obligations may affect US space operator customers.
Space Traffic Management evolution: US Office of Space Commerce is developing civil STM frameworks that may eventually replace Space-Track as the primary civil space data source. SpaceCom's ingest architecture must be adaptable (hardcoded URL constants in ingest/sources.py make this a 1-file change when the source changes).
24.7 ICAO Framework Alignment
Existing: ICAO Doc 10100 (Manual on Space Weather Information, 2019) designates three ICAO-recognised Space Weather Centres (NOAA SWPC, ESA/ESAC, Japan Meteorological Agency). SpaceCom's space weather widget must reference these designated centres by name and ICAO recognition status.
Emerging re-entry guidance: ICAO is in early stages of developing re-entry hazard notification guidance (no published document as of 2025). SpaceCom should:
- Monitor ICAO Air Navigation Commission and Meteorology Panel working group outputs
- Design hazard corridor outputs in a format that parallels SIGMET structure (the closest existing ICAO framework: WHO/WHAT/WHERE/WHEN/INTENSITY/FORECAST) — this positions SpaceCom well for whatever standard emerges
- Consider engaging ICAO working groups as a stakeholder; SpaceCom could become a reference implementation
SIGMET parallel structure for re-entry corridor outputs:
REENTRY ADVISORY (SpaceCom format; parallel to SIGMET structure)
WHO: CZ-5B ROCKET BODY / NORAD 44878
WHAT: UNCONTROLLED RE-ENTRY / DEBRIS SURVIVAL POSSIBLE
WHERE: CORRIDOR 18S115E TO 28S155E / FL000 TO UNL
WHEN: FROM 2026031614 TO 2026031622 UTC / WINDOW ±4H (P95)
RISK: HIGH / LAND AREA IN CORRIDOR: 12%
FORECAST: CORRIDOR EXPECTED TO NARROW 20% OVER NEXT 6H
SOURCE: SPACECOM V2.1 / PRED-44878-20260316-003 / TIP MSG #3
24.8 Alert Threshold Governance
Alert threshold values are consequential algorithmic decisions. A CRITICAL threshold that is too sensitive causes unnecessary airspace disruption; one that is too conservative creates false-negative risk. Both outcomes have legal, operational, and reputational consequences.
Current threshold values and rationale:
| Threshold | Value | Rationale |
|---|---|---|
| CRITICAL window | < 6h | Aligns with ICAO minimum NOTAM lead time for short-notice restrictions; 6h allows ANSP to issue NOTAM with ≥2h lead time |
| HIGH window | < 24h | Operational planning horizon for pre-tactical airspace management |
| FIR intersection trigger | p95 corridor intersects any non-zero area of the FIR | Conservative: any non-zero intersection at p95 level generates an alert; minimum area threshold is an org-configurable setting (default: 0) |
| Alert rate limit | 1 CRITICAL per object per 4h window | Prevents alert flooding from repeated window-shrink events without substantive new information |
| Alert storm threshold | > 5 CRITICAL in 1h | Empirically chosen; above this rate the response-time expectation for individual alerts cannot be met |
These values are recorded in docs/alert-threshold-history.md with initial entry date and author sign-off.
Threshold change procedure:
- Engineer proposes change in a PR with rationale documented in
docs/alert-threshold-history.md - PR requires review by engineering lead and product owner before merge
- Change is deployed to staging; minimum 2-week shadow-mode observation period against real TLE/TIP data
- Shadow observation review: false positive rate and false negative rate compared against pre-change baseline
- If baseline comparison passes: change deployed to production; all ANSP shadow deployment partners notified in writing with new threshold values
- If any ANSP objects: change is held until concerns are resolved
Threshold values are not configurable at runtime by operators. They are code constants reviewed through the above process. Org-configurable alert settings (geographic FIR filter, mute rules, OPS_ROOM_SUPPRESS_MINUTES) are UX preferences, not threshold changes.
24.9 Degraded Mode and Availability
SpaceCom must specify degraded mode behaviour for ANSP adoption:
| Condition | System Behaviour | ANSP Action |
|---|---|---|
| Ingest pipeline failure (TLE data > 6h stale) | MEDIUM alert to all operators; staleness indicator on all objects; predictions greyed | Consult Space-Track directly; activate fallback procedure |
| Space weather data > 4h stale | WARNING banner on SpaceWeatherWidget; uncertainty multiplier set to HIGH conservatively | Note wider uncertainty on any operational decisions |
| System unavailable | Push notification to all registered users; email to ANSP contacts | Activate fallback procedure documented in SpaceCom SMS integration guide |
| HMAC verification failure on a prediction | Prediction withheld; CRITICAL security alert; prediction marked integrity_failed |
Do not use the withheld prediction; contact SpaceCom immediately |
Degraded mode notification: When SpaceCom is down or data is stale beyond defined thresholds, all connected ANSPs receive push notification (WebSocket if connected; email fallback) so they can activate their fallback procedures. SpaceCom must never go silent when operationally relevant events are active.
24.10 EU AI Act Obligations
Classification: SpaceCom's conjunction probability model (§19) and any ML-based alert prioritisation constitute an AI system under EU AI Act Art. 3(1). AI systems used in transport infrastructure safety fall under Annex III, point 4 (AI systems intended to be used for dispatching, monitoring, and maintenance of transport infrastructure including aviation). This classification implies high-risk AI system obligations.
High-risk AI system obligations (EU AI Act Chapter III Section 2):
| Obligation | Article | SpaceCom implementation |
|---|---|---|
| Risk management system | Art. 9 | Integrate with existing SMS (§24.4); maintain AI-specific risk register in legal/EU_AI_ACT_ASSESSMENT.md |
| Data governance | Art. 10 | TLE training data provenance documented; simulations.params_json stores full input provenance; bias assessment required for orbital prediction models |
| Technical documentation | Art. 11 + Annex IV | legal/EU_AI_ACT_ASSESSMENT.md — system description, capabilities, limitations, human oversight measures, accuracy characterisation |
| Record-keeping / automatic logging | Art. 12 | reentry_predictions and alert_events tables provide automatic event logging; immutable (APPEND-only with HMAC) |
| Transparency to users | Art. 13 | Conjunction probability values labelled with model version (simulations.model_version), TLE age, EOP currency; uncertainty bounds displayed |
| Human oversight | Art. 14 | All decisions remain with duty controller (§24.2 AUP; §28.6 Decision Prompts disclaimer); no autonomous action taken by SpaceCom |
| Accuracy, robustness, cybersecurity | Art. 15 | Accuracy characterisation (§24.3 ICAO Data Quality); adversarial robustness covered by §7 and §36 security review |
| Conformity assessment | Art. 43 | Self-assessment pathway available for transport safety AI without third-party involvement at first deployment; document in legal/EU_AI_ACT_ASSESSMENT.md |
| EU database registration | Art. 51 | High-risk AI systems must be registered in the EU AI Act database before placing on market; legal milestone in deployment roadmap |
Human oversight statement (required in UI — Art. 14): The conjunction probability display (§19.4) must include the following non-configurable statement in the model information panel:
"This probability estimate is generated by an AI model and is subject to uncertainty arising from TLE age, atmospheric model limitations, and manoeuvre uncertainty. All operational decisions remain with the duty controller. This system does not replace ANSP procedures."
Gap analysis and roadmap: legal/EU_AI_ACT_ASSESSMENT.md must document: current compliance state → gaps → remediation actions → target dates. Phase 2 gate: conformity assessment documentation complete. Phase 3 gate: EU database registration completed before commercial EU deployment.
24.11 Regulatory Correspondence Register
For an ANSP-facing product, regulators (CAA, EASA, national ANSPs, ESA, OACI) will issue queries, audits, formal requests, and correspondence. Missed regulatory deadlines can constitute a licence breach or grounds for suspension of operations.
Correspondence log: legal/REGULATORY_CORRESPONDENCE_LOG.md — structured register with the following fields per entry:
| Field | Description |
|---|---|
| Date received | ISO 8601 |
| Authority | Regulatory body name and country |
| Reference number | Authority's reference (if given) |
| Subject | Brief description |
| Deadline | Formal response deadline (ISO 8601) |
| Owner | Named individual responsible for response |
| Status | PENDING / RESPONDED / CLOSED / ESCALATED |
| Response date | Date formal response sent |
| Notes | Internal context, legal counsel involvement |
SLAs:
- All regulatory correspondence acknowledged (receipt confirmed to sender) within 2 business days
- Substantive response or extension request within 14 calendar days (or as required by the correspondence)
- All correspondence older than 14 days without a RESPONDED or CLOSED status triggers an escalation to the CEO
Proactive regulatory engagement: The correspondence register is reviewed at each quarterly steering meeting. Any authority that has issued ≥3 queries in a 12-month period warrants a proactive engagement call to identify and address systemic concerns before they become formal regulatory actions.
24.12 Safety Case Framework (F1 — §61)
A safety case is a structured argument that a system is acceptably safe for a specified use in a defined context. SpaceCom must produce and maintain a safety case before any operational ANSP deployment. The safety case is a living document, updated at each material system change.
Safety case structure (Goal Structuring Notation — GSN, consistent with EUROCAE ED-153 / IEC 61508 safety case guidance):
G1: SpaceCom is acceptably safe to use as a decision support tool
for re-entry hazard awareness in civil airspace operations
C1: Context — SpaceCom operates as decision support (not autonomous authority);
all operational decisions remain with the ANSP duty controller
S1: Argument strategy — safety achieved by hazard identification,
risk reduction, and operational constraints
G1.1: All identified hazards are mitigated to acceptable risk levels
Sn1: Hazard Log (docs/safety/HAZARD_LOG.md)
E1.1.1: HZ-001 through HZ-007 mitigation evidence (§24.4)
E1.1.2: Shadow validation report (≥30 day trial)
G1.2: System integrity is maintained through all operational modes
Sn2: HMAC integrity on all safety-critical records (§7.9)
E1.2.1: `@pytest.mark.safety_critical` test suite — 100% pass
E1.2.2: Integrity failure quarantine demonstrated (§56 E2E test)
G1.3: Operators are trained and capable of correct system use
Sn3: Operator Training Programme (§28.9)
E1.3.1: Training completion records (operator_training_records table)
E1.3.2: Reference scenario completion evidence
G1.4: Degraded mode provides adequate notification for fallback
Sn4: Degraded mode specification (§24.9)
E1.4.1: ANSP communication plan activated in game day exercise (§26.8)
G1.5: Regulatory obligations are met for the deployment jurisdiction
Sn5: Means of Compliance document (§24.14)
E1.5.1: Legal opinions for deployment jurisdictions (§24.2)
E1.5.2: ANSP SMS integration guide (§24.15)
Safety case document: docs/safety/SAFETY_CASE.md. Version-controlled; each tagged release includes a safety case snapshot. Safety case review is required before:
- ANSP shadow mode activation
- Model version updates that affect prediction outputs
- New deployment jurisdiction
- Any change to alert thresholds (§24.8)
Safety case custodian: Named individual (Phase 2: CEO or CTO until a dedicated safety manager is appointed). Changes to the safety case require the custodian's sign-off.
24.13 Software Assurance Level (SAL) Assignment (F2 — §61)
EUROCAE ED-153 / DO-278A defines Software Assurance Levels for ground-based aviation software systems. The appropriate SAL determines the rigour of development, verification, and documentation activities required.
SpaceCom SAL assignment:
| Component | Failure Condition | Severity Class | SAL | Rationale |
|---|---|---|---|---|
Re-entry prediction engine (physics/) |
False all-clear (HZ-002) | Hazardous | SAL-2 | Undetected false negative could contribute to an airspace safety event; highest-consequence component |
Alert generation pipeline (alerts/) |
Failed alert delivery; wrong threshold applied | Hazardous | SAL-2 | Failure to generate a CRITICAL alert during an active event is equivalent consequence to HZ-002 |
| HMAC integrity verification | Integrity failure undetected | Hazardous | SAL-2 | Loss of integrity checking removes the primary guard against data manipulation |
| CZML corridor rendering | Wrong geographic position displayed (HZ-004) | Hazardous | SAL-2 | Geographic display error directly misleads operator |
| API authentication and authorisation | Unauthorised data access (HZ-007) | Major | SAL-3 | Privacy and data governance impact; not directly causal of airspace event |
Ingest pipeline (worker/) |
Stale data not detected (HZ-005) | Major | SAL-3 | Staleness monitoring is a mitigation for HZ-005; failure of staleness monitoring increases HZ-005 likelihood |
| Frontend (non-safety-critical paths) | Cosmetic / non-operational UI failure | Minor | SAL-4 | Not in the safety-critical path |
SAL-2 implications (minimum activities required):
- Independent verification of requirements, design, and code for SAL-2 components (see §24.16 Verification Independence)
- Formal test coverage: 100% statement coverage for SAL-2 modules (enforced via
@pytest.mark.safety_critical) - Configuration management of all SAL-2 source files and their test artefacts (see §30.8)
- SAL-2 components documented in the safety case with traceability from requirement → design → code → test
SAL assignment document: docs/safety/SAL_ASSIGNMENT.md — reviewed at each architecture change and before any ANSP deployment.
24.14 Means of Compliance (MoC) Document (F8 — §61)
A Means of Compliance document maps each regulatory or standard requirement to the specific implementation evidence that demonstrates compliance. Required before any formal regulatory submission (ESA bid, EASA consultation response, ANSP safety acceptance).
Document: docs/safety/MEANS_OF_COMPLIANCE.md
Structure:
| Requirement ID | Source | Requirement Text (summary) | Means of Compliance | Evidence Location | Status |
|---|---|---|---|---|---|
| MOC-001 | EUROCAE ED-153 §5.3 | Software requirements defined and verifiable | Requirements documented in relevant §sections of MASTER_PLAN; acceptance criteria in TEST_PLAN | docs/TEST_PLAN.md; relevant §sections |
PARTIAL |
| MOC-002 | EUROCAE ED-153 §6.4 | Independent verification of SAL-2 software | Verification independence policy (§24.16); separate reviewer for safety-critical PRs | docs/safety/VERIFICATION_INDEPENDENCE.md |
PLANNED |
| MOC-003 | ICAO Annex 15 §3.2 | Data quality attributes characterised | ICAO data quality table (§24.3); accuracy characterisation document | docs/validation/ACCURACY_CHARACTERISATION.md |
PARTIAL (Phase 3) |
| MOC-004 | ICAO Annex 19 | ANSP SMS integration supported | SMS integration guide; hazard register; training programme | docs/safety/ANSP_SMS_GUIDE.md; docs/safety/HAZARD_LOG.md |
PLANNED |
| MOC-005 | EU AI Act Art. 9 | Risk management system documented | AI Act assessment; hazard log; safety case | legal/EU_AI_ACT_ASSESSMENT.md; docs/safety/HAZARD_LOG.md |
IN PROGRESS |
| MOC-006 | DO-278A §10 | Configuration management of safety artefacts | CM policy (§30.8); Git tagging of releases; signed commits | docs/safety/CM_POLICY.md |
PLANNED |
| MOC-007 | ED-153 §7.2 | Safety occurrence reporting procedure | Runbook in §26.8; SAFETY_OCCURRENCE log type |
docs/runbooks/; security_logs table |
IMPLEMENTED |
The MoC document is a Phase 2 deliverable. PARTIAL items become Phase 3 gates. PLANNED items require assigned owners and completion dates before ANSP shadow activation.
24.15 ANSP-Side Obligations Document (F10 — §61)
SpaceCom cannot unilaterally satisfy all regulatory requirements — the receiving ANSP has obligations that SpaceCom must document and communicate. Failing to do so is a gap in the safety argument.
Document: docs/safety/ANSP_SMS_GUIDE.md — provided to every ANSP before shadow mode activation.
ANSP obligations by category:
| Category | ANSP Obligation | SpaceCom Provides |
|---|---|---|
| SMS integration | Include SpaceCom in ANSP SMS under ICAO Annex 19 | Hazard register contribution (§24.4); SAL assignment; safety case |
| Change notification | Notify SpaceCom of any ANSP procedure changes that affect how SpaceCom outputs are used | Change notification contact in MSA |
| Operator training | Ensure all SpaceCom users complete the operator training programme (§28.9) | Training modules; completion API; training records |
| Fallback procedure | Maintain and exercise a fallback procedure for SpaceCom unavailability | Fallback procedure template in onboarding documentation |
| Occurrence reporting | Report any safety occurrence involving SpaceCom outputs to SpaceCom within 24 hours | Safety occurrence form; contact details; §26.8 runbook |
| Regulatory notification | Notify applicable safety regulator of SpaceCom use if required by national SMS regulations | System description one-pager for regulator submission |
| Shadow validation | Participate in ≥30-day shadow validation trial; provide evaluation feedback | Shadow validation report template; shadow validation dashboard |
| AUP acceptance | Ensure all users accept the AUP (§24.2) | Automated AUP flow; compliance report for ANSP admin |
Liability assignment note (links to §24.2 and §24.12 F11): The ANSP SMS guide explicitly states that the ANSP retains full operational authority and accountability for all air traffic decisions, regardless of SpaceCom outputs. SpaceCom is a decision support tool. This statement must appear in the ANSP SMS guide, the AUP, and the safety case context node C1 (§24.12).
25.1 Target Tender Profile
SpaceCom targets ESA tenders in the following programme areas:
- Space Safety Programme — re-entry risk, SSA services, space debris
- GSTP (General Support Technology Programme) — technology development with commercial potential
- ARTES (Advanced Research in Telecommunications Systems) — if the commercial operator portal reaches satellite operators
- Space-Air Traffic Integration studies — the category matching ESA's OKAPI:Orbits award
25.2 Differentiation from ESA ESOC Re-entry Prediction Service
ESA's re-entry prediction service (reentry.esoc.esa.int) is a technical product for space operators and agencies. SpaceCom is not a competitor to this service — it is a complementary operational layer that could consume ESOC outputs:
| Dimension | ESA ESOC Service | SpaceCom |
|---|---|---|
| Primary user | Space agencies, debris researchers | ANSPs, airspace managers, space operators |
| Output format | Technical prediction reports | Operational decision support + NOTAM drafts |
| Aviation integration | None | Core feature |
| ANSP decision workflow | Not designed for this | Primary design target |
| Space operator portal | Not provided | Phase 2 deliverable |
| Shadow mode / regulatory adoption | Not provided | Built-in |
In an ESA bid: Position SpaceCom as the user-facing operational layer that sits on top of the space surveillance and prediction infrastructure that ESA already operates. ESA invests in the physics; SpaceCom invests in the interface that makes the physics actionable for aviation authorities and space operators.
25.3 TRL Roadmap (ESA Definitions)
| Phase | End TRL | Evidence |
|---|---|---|
| Phase 1 complete | TRL 4 | Validated decay predictor (≥3 historical backcasts); SGP4 globe with real TLE data; Mode A corridors; HMAC integrity; full security infrastructure |
| Phase 2 complete | TRL 5 | Atmospheric breakup; Mode B heatmap; NOTAM drafting; space operator portal; CCSDS export; shadow mode; ≥1 ANSP shadow deployment running |
| Phase 3 complete | TRL 6 | System demonstrated in operationally relevant environment; ≥1 ANSP shadow deployment with ≥4 weeks validation data; external penetration test passed; ECSS compliance artefacts complete |
| Post-Phase 3 | TRL 7 | System prototype demonstrated in operational environment (live ANSP deployment, not shadow) |
25.4 ECSS Standards Compliance
ESA contracts require compliance with the European Cooperation for Space Standardization (ECSS). Required compliance mapping:
| Standard | Title | SpaceCom Compliance |
|---|---|---|
| ECSS-Q-ST-80C | Software Product Assurance | Software Management Plan, V&V Plan, Product Assurance Plan — produced Phase 3 |
| ECSS-E-ST-10-04C | Space environment | NRLMSISE-00 and JB2008 compliance with ECSS atmospheric model requirements |
| ECSS-E-ST-10-12C | Methods for re-entry and debris footprint calculation | Decay predictor and atmospheric breakup model methodology documented and traceable |
| ECSS-U-AS-010C | Space sustainability | Zero Debris Charter alignment statement; controlled re-entry planner outputs |
Compliance matrix document (produced Phase 3): Maps every ECSS requirement to the relevant SpaceCom component, test, or document. Required for ESA tender submission.
25.5 ESA Zero Debris Charter Alignment
SpaceCom directly supports the Zero Debris Charter objectives:
| Charter Objective | SpaceCom Support |
|---|---|
| Responsible end-of-life disposal | Controlled re-entry planner generates CCSDS-format manoeuvre plans minimising ground risk |
| Transparency of re-entry risk | Public hazard corridor data; NOTAM drafting; multi-ANSP coordination |
| Reduction of casualty risk | Atmospheric breakup model; casualty area computation; population density weighting in deorbit optimiser |
| Data sharing | API layer for space operator integration; CCSDS export; open prediction endpoints |
Include Zero Debris Charter alignment statement in all ESA bid submissions.
25.6 Required ESA Procurement Artefacts
All ESA contracts require these management documents. SpaceCom must produce them by Phase 3:
| Document | ECSS Reference | Content |
|---|---|---|
| Software Management Plan (SMP) | ECSS-Q-ST-80C §5 | Development methodology, configuration management, change control, documentation standards |
| Verification and Validation Plan (VVP) | ECSS-Q-ST-80C §6 | Test strategy, traceability from requirements to test cases, acceptance criteria |
| Product Assurance Plan (PAP) | ECSS-Q-ST-80C §4 | Safety, reliability, quality standards and how they are met |
| Data Management Plan (DMP) | ECSS-Q-ST-80C §8 | How data produced under contract is handled, shared, archived, and made reproducible |
| Software Requirements Specification (SRS) | Tailored ECSS-E-ST-40C | Software requirements baseline, interfaces, external dependencies, and bounded assumptions including air-risk and RDM exchange boundaries |
| Software Design Description (SDD) | Tailored ECSS-E-ST-40C | Module architecture, algorithm choices, interface contracts, and validation assumptions |
| User Manual / Ops Guide | Tailored ECSS-E-ST-40C | Installation, configuration, operator workflows, limitations, and degraded-mode handling |
| Test Plan + Test Report | Tailored ECSS-Q-ST-80C | Planned validation campaign, executed results, deviations, and acceptance evidence for procurement submission |
| Accessibility Conformance Report (ACR/VPAT 2.4) | EN 301 549 v3.2.1 | WCAG 2.1 AA conformance declaration; mandatory for EU public sector ICT procurement; maps each success criterion to Supports / Partially Supports / Does Not Support with remarks |
Scaffold documents for all procurement-facing artefacts should be created at Phase 1 start and maintained throughout development — not produced from scratch at Phase 3.
For contracts with explicit software prototype review gates (e.g. PDR, TRR, CDR, QR, FR), the SRS, SDD, User Manual, Test Plan, and Test Report are updated incrementally at each milestone rather than back-filled only at final review.
25.7 Consortium Strategy
ESA study contracts typically favour consortia that combine:
- Technical depth (university or research institute)
- Industrial relevance (commercial applicability)
- End-user representation (the entity that will use the output)
SpaceCom's ideal consortium for an ESA bid:
- SpaceCom (lead) — system integration, aviation domain interface, commercial deployment
- Academic partner (orbital mechanics / atmospheric density modelling credibility — equivalent to TU Braunschweig in the OKAPI:Orbits consortium)
- ANSP or aviation authority (end-user representation — demonstrates the aviation gap is real and the solution is wanted)
Without a credentialled academic or research partner for the physics components, ESA evaluators may question the technical depth. Identify and approach potential academic partners before submitting to any ESA tender.
25.8 Intellectual Property Framework for ESA Bids
ESA contracts operate under the ESA General Conditions of Contract, which distinguish between background IP (pre-existing IP brought into the contract) and foreground IP (IP created during the contract). The default terms grant ESA a non-exclusive, royalty-free licence to use foreground IP, while the contractor retains ownership. These terms are negotiable and must be agreed before contract signature.
Required IP actions before bid submission:
-
Background IP schedule: Document all SpaceCom components that constitute background IP — physics engine, data model, UX design, proprietary algorithms. This schedule protects SpaceCom's ability to continue commercial deployment after the ESA contract ends without ESA claiming rights to the core product.
-
Foreground IP boundary: Define clearly what will be created during the ESA contract (e.g., specific ECSS compliance artefacts, validation datasets, TRL demonstration reports) versus what SpaceCom brings in as background IP. Narrow the foreground IP scope to ESA-specific deliverables only.
-
Software Bill of Materials (SBOM): Required for ECSS compliance and as part of the ESA bid artefact package. Generated via
syftorcyclonedx-bom. Must identify all third-party licences. AGPLv3 components (notably CesiumJS community edition) cannot be in the SBOM of a closed-source ESA deliverable — commercial licence required. -
Consortium Agreement: Must be signed by all consortium members before bid submission. Must specify:
- IP ownership for each consortium member's contributions
- Publication rights for academic partners (must not conflict with any commercial confidentiality obligations)
- Revenue share for any commercial use arising from the contract
- Liability allocation between consortium members
- Exit terms if a member withdraws
-
Export control pre-clearance: Confirm with counsel that the planned ESA deliverable does not require an export licence for transfer to ESA (a Paris-based intergovernmental organisation). Generally covered under EAR licence exception GOV, but verify for any controlled technology components.
26. SRE and Reliability Framework
26.1 Service Level Objectives
SpaceCom is most critical during active re-entry events — peak load coincides with highest operational stakes. Standard availability metrics are insufficient. SLOs must be defined against event-correlated conditions, not just averages.
| Service Level Indicator | SLO | Measurement Window | Notes |
|---|---|---|---|
| Prediction API availability | 99.9% | Rolling 30 days | 43.8 min/month error budget |
| Prediction API availability (active TIP event) | 99.95% | Duration of TIP window | Stricter; degradation during events is SEV-1 |
| Decay prediction latency p50 | < 90s | Per MC job | 500-sample chord run |
| Decay prediction latency p95 | < 240s | Per MC job | Drives worker sizing (§27) |
| CZML ephemeris load p95 | < 2s | Per request | 100-object catalog |
| TIP message ingest latency | < 30 min from publication | Per TIP message | Drives CRITICAL alert timing |
| Space weather update latency | < 15 min from NOAA SWPC | Per update cycle | Drives uncertainty multiplier refresh |
| Alert WebSocket delivery latency | < 10s from trigger | Per alert | Measured trigger→client receipt |
| Corridor update after new TIP | < 60 min | Per TIP message | Full MC rerun triggered |
Error budget policy: When the 30-day rolling error budget is exhausted, no further deployments or planned maintenance are permitted until the next measurement window opens. Tracked in Grafana SLO dashboard (§26.8).
SLOs must be written into the model user agreement (§24.2) and agreed with each ANSP customer before operational deployment. ANSPs need defined thresholds to determine when to activate their fallback procedures.
Customer-facing SLA (Finding 7) — contractual commitments in the MSA:
Internal SLOs are aspirational targets; the SLA is a binding contractual commitment with defined measurement, exclusions, and credits. The MSA template includes the following SLA schedule:
| Metric | SLA commitment | Measurement | Exclusions |
|---|---|---|---|
| Monthly availability | 99.5% | External uptime monitor; excludes scheduled maintenance (max 4h/month; 48h advance notice) | Force majeure; upstream data source outages (Space-Track, NOAA SWPC) lasting > 4h |
| Critical alert delivery | Within 5 minutes of trigger (p95) | alert_events.created_at → delivered_websocket/email = TRUE timestamp |
Customer network connectivity issues |
| Prediction freshness | p50 updated within 4h of new TLE availability | tle_sets.ingested_at → reentry_predictions.created_at |
Space-Track API outage > 4h |
| Support response — CRITICAL incident | Initial response within 1 hour | From customer report or automated alert, whichever earlier | Outside contracted support hours (on-call for CRITICAL) |
| Support response — P1 resolution | Within 8 hours | From initial response | — |
| Service credits | 1 day credit per 0.1% availability below SLA | Applied to next invoice | — |
Any SRE threshold change that could cause an SLA breach (e.g., raising the ingest failure alert threshold beyond 4 hours) must be reviewed by the product owner before deployment. Tracked in docs/sla/sla-schedule-v{N}.md (versioned; MSA references the current version by number).
26.2 Recovery Objectives
| Objective | Target | Scope | Derivation |
|---|---|---|---|
| RTO (active TIP event) | ≤ 15 minutes | Prediction API restoration | CRITICAL alert rate-limit window is 4 hours per object; 15-minute outage is tolerable within this window without skipping a CRITICAL cycle; beyond 15 minutes the ANSP must activate fallback procedures |
| RTO (no active event) | ≤ 60 minutes | Full system restoration | 1-hour window aligns with MSA SLA commitment; exceeding this triggers the P1 communication plan |
| RPO (safety-critical tables) | Zero | reentry_predictions, alert_events, security_logs, notam_drafts — synchronous replication required |
UN Liability Convention evidentiary requirements; loss of a single alert acknowledgement record could be material in a liability investigation |
| RPO (operational data) | ≤ 5 minutes | orbits, tle_sets, simulations — async replication acceptable |
5-minute data age is within the staleness tolerance for TLE-based predictions; loss of in-flight simulations is recoverable by re-submission |
MSA sign-off requirement: RTO and RPO targets must be explicitly stated in and agreed upon in the Master Services Agreement with each ANSP customer before any production deployment. Customers must acknowledge that the fallback procedure (Space-Track direct + ESOC public re-entry page) is their responsibility during the RTO window. RTO/RPO targets are not unilaterally changeable by SpaceCom — any tightening requires customer notification ≥30 days in advance; any relaxation requires customer consent.
26.3 High Availability Architecture
TimescaleDB — Streaming Replication + Patroni
# Primary + hot standby; Patroni manages leader election and failover
db_primary:
image: timescale/timescaledb-ha:pg17
environment:
PATRONI_POSTGRESQL_DATA_DIR: /var/lib/postgresql/data
PATRONI_REPLICATION_USERNAME: replicator
networks: [db_net]
db_standby:
image: timescale/timescaledb-ha:pg17
environment:
PATRONI_REPLICA: "true"
networks: [db_net]
etcd:
image: bitnami/etcd:3 # Patroni DCS
networks: [db_net]
- Synchronous replication for
reentry_predictions,alert_events,security_logs,notam_drafts(RPO = 0):synchronous_standby_names = 'FIRST 1 (db_standby)'with table-level synchronous commit override - Asynchronous replication for
orbits,tle_sets(RPO ≤ 5 min): default async - Patroni auto-failover: standby promoted within ~30s of primary failure, well within the 15-minute RTO
Required Patroni configuration parameters (must be present in patroni.yml; CI validation via scripts/check_patroni_config.py):
bootstrap:
dcs:
maximum_lag_on_failover: 1048576 # 1 MB; standby > 1 MB behind primary is excluded from failover election
synchronous_mode: true # Enable synchronous replication mode
synchronous_mode_strict: true # Primary refuses writes if no synchronous standby confirmed; prevents split-brain
postgresql:
parameters:
wal_level: replica # Required for streaming replication; 'minimal' breaks replication
recovery_target_timeline: latest # Follow timeline switches after failover; required for correct standby behaviour
Rationale:
maximum_lag_on_failover: without this, a severely lagged standby could be promoted as primary and serve stale data for safety-critical tables.synchronous_mode_strict: true: trades availability for consistency — primary halts rather than allowing an unconfirmed write to proceed without a standby. Acceptable given 15-minute RTO SLO.wal_level: replica:minimaldisables the WAL detail needed for streaming replication; must be explicitly set.recovery_target_timeline: latest: without this, a promoted standby after failover may not follow future timeline switches, causing divergence.
Redis — Sentinel (3 Nodes)
redis-master:
image: redis:7-alpine
command: redis-server /etc/redis/redis.conf
redis-sentinel-1:
image: redis:7-alpine
command: redis-sentinel /etc/redis/sentinel.conf
redis-sentinel-2:
image: redis:7-alpine
command: redis-sentinel /etc/redis/sentinel.conf
Three Sentinel instances form a quorum. If the master fails, Sentinel promotes a replica within ~10s. The backend and workers use redis-py's Sentinel client which transparently follows the master after failover.
Redis Sentinel split-brain risk assessment (F3 — §67): In a network partition where Sentinel nodes disagree on master reachability, two Sentinels could theoretically promote two different replicas simultaneously. The min-replicas-to-write 1 Sentinel configuration mitigates this: the old master stops accepting writes when it loses contact with replicas, forcing clients to the new master.
SpaceCom's Redis data is largely ephemeral — Celery broker messages, WebSocket session state, application cache. A split-brain that loses a small number of Celery tasks or cache entries is survivable. The one persistent concern is the per-org email rate limit counter (spacecom:email_rate:{org_id}:{hour}, §65 F7): a split-brain could result in two independent counters, both allowing up to 50 emails, for a brief period before the split resolves. This is accepted: the 50/hr limit is a cost control, not a safety guarantee. Email volume during a short Sentinel split-brain is not a safety risk.
Risk acceptance and configuration: Set sentinel.conf values:
sentinel down-after-milliseconds spacecom-redis 5000
sentinel failover-timeout spacecom-redis 60000
sentinel parallel-syncs spacecom-redis 1
min-replicas-to-write 1
min-replicas-max-lag 10
ADR: docs/adr/0021-redis-sentinel-split-brain-risk-acceptance.md
Cross-Region Disaster Recovery — Warm Standby (F7)
Single-region deployment cannot meet the RTO ≤ 60 minutes target against a full cloud region failure. A warm standby in a second region provides the required recovery path.
Strategy: Warm standby (not hot active-active) — reduces cost and complexity while meeting RTO.
| Component | Primary region | DR region | Failover mechanism |
|---|---|---|---|
| TimescaleDB | Primary + hot standby | Read replica (streaming replication from primary) | Promote replica; update DNS; make db-failover-dr runbook |
| Application tier | Running | Stopped; container images pre-pulled from GHCR | Deploy from images on failover; < 10 minutes |
| MinIO (object storage) | Active | Active (bucket replication enabled) | Already in sync; no failover needed |
| Redis | Active | Cold (config ready) | Restart on failover; session loss acceptable (operators re-authenticate) |
| DNS | Primary A record | Secondary A record in Route 53 (or equiv.) | Health-check-based routing; TTL 60s; auto-failover on primary health check failure |
Failover time estimate: DB promotion 2–5 minutes + DNS propagation 1 minute + app deploy 10 minutes = < 15 minutes (within RTO for active TIP event).
Runbook: docs/runbooks/region-failover.md — tested annually as game day scenario 6. Post-failover checklist: verify HMAC validation on restored primary; verify WAL integrity; notify ANSPs of region switch; schedule return to primary region within 48 hours.
26.4 Celery Reliability
Task Acknowledgement and Crash Safety
# celeryconfig.py
task_acks_late = True # Task not acknowledged until complete; if worker dies mid-task, task is requeued
task_reject_on_worker_lost = True # Orphaned tasks requeued, not dropped
task_serializer = 'json'
result_expires = 86400 # Results expire after 24h; database is the durable store
worker_prefetch_multiplier = 1 # F6 §58: long MC tasks (up to 240s) — prefetch=1 prevents worker A
# holding 4 tasks while workers B/C/D are idle; fair distribution
Dead Letter Queue
Failed tasks (exception, timeout, or permanent error) must be captured, not silently dropped:
# In Celery task base class
class SpaceComTask(Task):
def on_failure(self, exc, task_id, args, kwargs, einfo):
# Update simulations table to status='failed'
update_simulation_status(task_id, 'failed', error_detail=str(exc))
# Route to dead letter queue for inspection
dead_letter_queue.rpush('dlq:failed_tasks', json.dumps({
'task_id': task_id, 'task_name': self.name,
'error': str(exc), 'failed_at': utcnow().isoformat()
}))
Queue Routing (Ingest vs Simulation Isolation)
CELERY_TASK_ROUTES = {
'modules.ingest.*': {'queue': 'ingest'},
'modules.propagator.*': {'queue': 'simulation'},
'modules.breakup.*': {'queue': 'simulation'},
'modules.conjunction.*': {'queue': 'simulation'},
'modules.reentry.controlled.*': {'queue': 'simulation'},
}
Two separate worker processes — never competing on the same queue:
# Ingest worker: always running, low concurrency
celery worker --queue=ingest --concurrency=2 --hostname=ingest@%h
# Simulation worker: high concurrency for MC sub-tasks (see §27.2)
celery worker --queue=simulation --concurrency=16 --pool=prefork --hostname=sim@%h
Per-organisation priority isolation (F8): All organisations share the simulation queue, but job priority is set at submission time based on subscription tier and event criticality. This prevents a shadow_trial org's bulk simulation from starving a CRITICAL alert computation for an ansp_operational org.
TIER_TASK_PRIORITY = {
"internal": 9,
"institutional": 8,
"ansp_operational": 7,
"space_operator": 5,
"shadow_trial": 3,
}
CRITICAL_EVENT_PRIORITY_BOOST = 2 # added when active TIP event exists for the org's objects
def get_task_priority(org_tier: str, has_active_tip: bool) -> int:
base = TIER_TASK_PRIORITY.get(org_tier, 3)
return min(10, base + (CRITICAL_EVENT_PRIORITY_BOOST if has_active_tip else 0))
# At submission:
task.apply_async(priority=get_task_priority(org.subscription_tier, active_tip))
Redis with maxmemory-policy noeviction supports Celery task priorities natively (0–9). Workers process higher-priority tasks first when multiple tasks are queued. Ingest tasks always route to the separate ingest queue and are unaffected by simulation priority.
Celery Beat — High Availability with celery-redbeat
Standard Celery Beat is a single-process SPOF. celery-redbeat stores the schedule in Redis with distributed locking — multiple Beat instances can run; only one holds the lock at a time:
CELERY_BEAT_SCHEDULER = 'redbeat.RedBeatScheduler'
REDBEAT_REDIS_URL = settings.redis_url
REDBEAT_LOCK_TIMEOUT = 60 # 60s; crashed leader blocks scheduling for at most 60s
REDBEAT_MAX_SLEEP_INTERVAL = 5 # standby instances check for lock every 5s after TTL expiry
The default REDBEAT_LOCK_TIMEOUT = max_interval × 5 (typically 25 minutes) is too long during active TIP events — a crashed Beat leader would prevent TIP polling for up to 25 minutes. At 60 seconds, a failover causes at most a 60-second scheduling gap. The standby Beat instance acquires the lock within 5 seconds of TTL expiry (REDBEAT_MAX_SLEEP_INTERVAL = 5).
During an active TIP window (spacecom_active_tip_events > 0), the AlertManager rule for TIP ingest failure uses a 10-minute threshold rather than the baseline 4-hour threshold — ensuring a Beat failover gap does not silently miss critical TIP updates.
26.5 Health Checks
Every service exposes two endpoints. Docker Compose depends_on: condition: service_healthy uses these — the backend does not start until the database is healthy.
Liveness probe (GET /healthz) — process is alive; returns 200 unconditionally if the process can respond. Does not check dependencies.
Readiness probe (GET /readyz) — process is ready to serve traffic:
@app.get("/readyz")
async def readiness(db: AsyncSession = Depends(get_db)):
checks = {}
# Database connectivity
try:
await db.execute(text("SELECT 1"))
checks["database"] = "ok"
except Exception as e:
checks["database"] = f"error: {e}"
# Redis connectivity
try:
await redis_client.ping()
checks["redis"] = "ok"
except Exception:
checks["redis"] = "error"
# Data freshness
tle_age = await get_oldest_active_tle_age_hours()
sw_age = await get_space_weather_age_hours()
eop_age = await get_eop_age_days()
airac_age = await get_airspace_airac_age_days()
checks["tle_age_hours"] = tle_age
checks["space_weather_age_hours"] = sw_age
checks["eop_age_days"] = eop_age
checks["airac_age_days"] = airac_age
degraded = []
if checks["database"] != "ok" or checks["redis"] != "ok":
return JSONResponse(status_code=503, content={"status": "unavailable", "checks": checks})
if tle_age > 6:
degraded.append("tle_stale")
if sw_age > 4:
degraded.append("space_weather_stale")
if eop_age > 7:
degraded.append("eop_stale") # IERS-A older than 7 days; frame transform accuracy degraded
if airac_age > 28:
degraded.append("airspace_stale") # AIRAC cycle missed
status_code = 207 if degraded else 200
return JSONResponse(status_code=status_code, content={
"status": "degraded" if degraded else "ok",
"degraded": degraded, "checks": checks
})
The 207 Degraded response triggers the staleness banner in the UI (§24.8) without taking the service offline. The load balancer treats 207 as healthy (traffic continues); the operational banner warns users.
Renderer service health check — the renderer container runs Playwright/Chromium. If Chromium hangs (a known Playwright failure mode), the container process stays alive and appears healthy while all report generation jobs silently time out. The renderer GET /healthz must verify Chromium can respond, not just that the Python process is alive:
# renderer/app/health.py
import asyncio
from playwright.async_api import async_playwright
from fastapi.responses import JSONResponse
async def health_check():
"""Liveness probe: verify Chromium can launch and load a blank page within 5s."""
try:
async with async_playwright() as p:
browser = await asyncio.wait_for(p.chromium.launch(), timeout=5.0)
page = await browser.new_page()
await asyncio.wait_for(page.goto("about:blank"), timeout=3.0)
await browser.close()
return {"status": "ok", "chromium": "responsive"}
except asyncio.TimeoutError:
renderer_chromium_restarts.inc()
return JSONResponse({"status": "chromium_unresponsive"}, status_code=503)
Docker Compose healthcheck for renderer:
renderer:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8001/healthz"]
interval: 30s
timeout: 10s
retries: 3
start_period: 15s
If the healthcheck fails 3 times consecutively, Docker restarts the renderer container. The renderer_chromium_restarts_total counter increments on each restart and triggers the RendererChromiumUnresponsive alert.
Degraded state in GET /readyz for API clients and SWIM (Finding 7): The degraded array in the response is the machine-readable signal for any automated integration (Phase 3 SWIM, API polling clients). API clients must not scrape the UI to determine system state — the health endpoint is the authoritative source. Response fields:
| Field | Type | Meaning |
|---|---|---|
status |
"ok" | "degraded" | "unavailable" |
Overall system state |
degraded |
string[] |
Active degradation reasons: "tle_stale", "space_weather_stale", "ingest_source_failure", "prediction_service_overloaded" |
degraded_since |
ISO8601 | null |
Timestamp of when current degraded state began (from degraded_mode_events) |
checks |
object |
Per-subsystem check results |
Every transition into or out of degraded state is written to degraded_mode_events (see §9.2). NOTAM drafts generated while status = "degraded" have generated_during_degraded = TRUE and the draft (E) field includes: NOTE: GENERATED DURING DEGRADED DATA STATE - VERIFY INDEPENDENTLY BEFORE ISSUANCE.
Docker Compose health check definitions:
backend:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/healthz"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30s
db:
healthcheck:
# pg_isready alone passes before the spacecom database and TimescaleDB extension are loaded.
# This check verifies that the application database is accessible and TimescaleDB is active
# before any dependent service (pgbouncer, backend) is marked healthy.
test: |
CMD-SHELL psql -U spacecom_app -d spacecom -c
"SELECT 1 FROM timescaledb_information.hypertables LIMIT 1"
interval: 5s
timeout: 3s
retries: 10
start_period: 30s # TimescaleDB extension load and initial setup can take up to 20s
pgbouncer:
depends_on:
db:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "psql -h localhost -p 5432 -U spacecom_app -d spacecom -c 'SELECT 1'"]
interval: 5s
timeout: 3s
retries: 5
26.6 Backup and Restore
Continuous WAL Archiving (RPO = 0 for critical tables)
# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'mc cp %p minio/wal-archive/$(hostname)/%f' # MinIO via mc client
archive_timeout = 60 # Force WAL segment every 60s even if no writes
Daily Base Backup
pg_basebackup is a PostgreSQL client tool that is not present in the Python runtime worker image. The backup must run in a dedicated sidecar container that has PostgreSQL client tools installed, invoked by the Celery Beat task via docker compose run:
# docker-compose.yml — backup sidecar (no persistent service; run on demand)
services:
db-backup:
image: timescale/timescaledb:2.14-pg17 # same image as db; has pg_basebackup
entrypoint: []
command: >
sh -c "pg_basebackup -h db -U postgres -D /backup
--format=tar --compress=9 --wal-method=stream &&
mc cp /backup/*.tar.gz minio/db-backups/base-$(date +%F)/"
networks: [db_net]
volumes:
- backup_scratch:/backup
profiles: [backup] # not started by default; invoked explicitly
environment:
PGPASSWORD: ${POSTGRES_PASSWORD}
MC_HOST_minio: http://${MINIO_ACCESS_KEY}:${MINIO_SECRET_KEY}@minio:9000
volumes:
backup_scratch:
driver: local
driver_opts:
type: tmpfs
device: tmpfs
o: size=20g # large enough for compressed base backup
The Celery Beat task triggers the sidecar via the Docker socket (backend container must have /var/run/docker.sock mounted in development — not in production). In production (Tier 2+), use a dedicated cron job on the host:
# /etc/cron.d/spacecom-backup — runs outside Docker, uses Docker CLI
0 2 * * * root docker compose -f /opt/spacecom/docker-compose.yml \
--profile backup run --rm db-backup >> /var/log/spacecom-backup.log 2>&1
The Celery Beat task in production polls MinIO for today's backup object to verify completion, and fires an alert if it is absent by 03:00 UTC:
# Celery Beat: daily at 03:00 UTC (verification, not execution)
@celery.task
def verify_daily_backup():
"""Verify today's base backup exists in MinIO; alert if absent."""
expected_key = f"db-backups/base-{utcnow().date()}"
try:
minio_client.stat_object("db-backups", expected_key)
structlog.get_logger().info("backup_verified", key=expected_key)
except S3Error:
structlog.get_logger().error("backup_missing", key=expected_key)
alert_admin(f"Daily base backup missing: {expected_key}")
raise # marks task as FAILED in Celery result backend
Monthly Restore Test
# Celery Beat: first Sunday of each month at 03:00 UTC
@celery.task
def monthly_restore_test():
"""Restore latest backup to ephemeral container; run test suite; alert on failure."""
# 1. Spin up a test TimescaleDB container from latest base backup + WAL
# 2. Run db/test_restore.py: verify row counts, hypertable integrity, HMAC spot-checks
# 3. Tear down container
# 4. Log result to security_logs; alert admin if test fails
If the monthly restore test fails, the failure is treated as SEV-2. The incident is not resolved until a successful restore is verified.
WAL retention: 30 days of WAL segments retained in MinIO; base backups retained for 90 days; reentry_predictions, alert_events, notam_drafts, security_logs additionally archived to cold storage for 7 years (MinIO lifecycle policy, separate bucket with Object Lock COMPLIANCE mode — prevents deletion even by bucket owner).
Application log retention policy (F10 — §57):
| Log tier | Storage | Retention | Rationale |
|---|---|---|---|
| Container stdout (json-file) | Docker log driver on host | 7 days (max-size=100m, max-file=5) |
Short-lived; Promtail ships to Loki in Tier 2+ |
| Loki (structured application logs) | Grafana Loki | 90 days | Covers 30-day incident investigation SLA with headroom |
Safety-relevant log lines (level=CRITICAL, security_logs events, alert-related log lines) |
MinIO append-only bucket | 7 years (same as database safety records) | Regulatory parity with alert_events 7-year hold; NIS2 Art. 23 evidence requirement |
| SIEM-forwarded events | External SIEM (customer-specified) | Per customer contract | ANSP customers may have their own retention obligations |
Loki retention is set in monitoring/loki-config.yml:
limits_config:
retention_period: 2160h # 90 days
compactor:
retention_enabled: true
Safety-relevant log shipping: a Promtail pipeline stage tags log lines with __path__ label safety_critical=true when level=CRITICAL or logger contains alert or security. A separate Loki ruler rule ships these to MinIO via a Loki-to-S3 connector (Phase 2). Phase 1 interim: Celery Beat task exports CRITICAL log lines from Loki to MinIO daily.
Restore time target: Full restore to latest WAL segment in < 30 minutes (tested monthly). This satisfies the RTO ≤ 60 minutes (no active event) with 30 minutes headroom for DNS propagation and smoke tests. Documented step-by-step in docs/runbooks/db-restore.md (Phase 2 deliverable).
Retention Schedule
-- Online retention (TimescaleDB compression + drop policies)
SELECT add_compression_policy('orbits', INTERVAL '7 days');
SELECT add_retention_policy('orbits', INTERVAL '90 days'); -- Archive before drop; see below
SELECT add_retention_policy('space_weather', INTERVAL '2 years');
SELECT add_retention_policy('tle_sets', INTERVAL '1 year');
-- Archival pipeline: Celery task runs before each chunk drop
-- Exports chunk to Parquet in MinIO cold storage before TimescaleDB drops it
-- Legal hold: reentry_predictions, alert_events, notam_drafts, shadow_validations → 7 years
-- No retention policy on these tables; MinIO lifecycle rule retains for 7 years
26.7 Prometheus Metrics
Metrics must be instrumented from Phase 1 — not added at Phase 3 as an afterthought. Business-level metrics are more important than infrastructure metrics for this domain.
Metric naming convention (F1 — §57):
All custom metrics must follow {namespace}_{subsystem}_{name}_{unit} with these rules:
| Rule | Example compliant | Example non-compliant |
|---|---|---|
Namespace is always spacecom_ |
spacecom_ingest_success_total |
ingest_success |
| Unit suffix required (Prometheus base units) | spacecom_simulation_duration_seconds |
spacecom_simulation_duration |
Counters end in _total |
spacecom_hmac_verification_failures_total |
spacecom_hmac_failures |
Gauges end in _seconds, _bytes, _ratio, or domain unit |
spacecom_celery_queue_depth |
spacecom_queue |
Histograms end in _seconds or _bytes |
spacecom_alert_delivery_latency_seconds |
spacecom_alert_latency |
Labels use snake_case |
queue_name, source |
queueName, Source |
| High-cardinality fields are NEVER labels | — | norad_id, organisation_id, user_id, request_id as Prometheus labels |
| Per-object drill-down uses recording rules | spacecom:tle_age_hours:max recording rule |
spacecom_tle_age_hours{norad_id="25544"} alerted directly |
High-cardinality identifiers belong in log fields (structlog) or Prometheus exemplars — not in metric labels. A metric with an unbounded label creates one time series per unique value and will OOM Prometheus at scale.
Business-level metrics (custom — most critical):
# Phase 1 — instrument from day 1
active_tip_events = Gauge('spacecom_active_tip_events', 'Objects with active TIP messages')
prediction_age = Gauge('spacecom_prediction_age_seconds', 'Age of latest prediction per object',
['norad_id']) # per-object label: Grafana drill-down only; alert via recording rule
tle_age = Gauge('spacecom_tle_age_hours', 'TLE data age per object', ['norad_id'])
ingest_success = Counter('spacecom_ingest_success_total', 'Successful ingest runs', ['source'])
ingest_failure = Counter('spacecom_ingest_failure_total', 'Failed ingest runs', ['source'])
hmac_failures = Counter('spacecom_hmac_verification_failures_total', 'HMAC check failures')
simulation_duration = Histogram('spacecom_simulation_duration_seconds', 'MC run duration', ['module'],
buckets=[30, 60, 90, 120, 180, 240, 300, 600])
alert_delivery_lat = Histogram('spacecom_alert_delivery_latency_seconds', 'Alert trigger → WS receipt',
buckets=[1, 2, 5, 10, 15, 20, 30, 60])
ws_connected = Gauge('spacecom_ws_connected_clients', 'Active WebSocket connections', ['instance'])
celery_queue_depth = Gauge('spacecom_celery_queue_depth', 'Tasks waiting in queue', ['queue'])
dlq_depth = Gauge('spacecom_dlq_depth', 'Tasks in dead letter queue')
renderer_active_jobs = Gauge('renderer_active_jobs', 'Reports being generated')
renderer_job_dur = Histogram('renderer_job_duration_seconds', 'Report generation time',
buckets=[2, 5, 10, 15, 20, 25, 30])
renderer_chromium_restarts = Counter('renderer_chromium_restarts_total', 'Chromium process restarts')
SLI recording rules — pre-aggregate before alerting; avoids per-object flooding (Finding 1, 7):
# monitoring/recording-rules.yml
groups:
- name: spacecom_sli
rules:
# SLI: API availability (non-5xx fraction) — feeds availability SLO
- record: spacecom:api_availability:ratio_rate5m
expr: >
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# SLI: max TLE age across all objects (single series; alertable without flooding)
- record: spacecom:tle_age_hours:max
expr: max(spacecom_tle_age_hours)
# SLI: count of objects with stale TLEs (for dashboard)
- record: spacecom:tle_stale_objects:count
expr: count(spacecom_tle_age_hours > 6) or vector(0)
# SLI: max prediction age across active TIP objects
- record: spacecom:prediction_age_seconds:max
expr: max(spacecom_prediction_age_seconds)
# SLI: alert delivery latency p99
- record: spacecom:alert_delivery_latency:p99_rate5m
expr: histogram_quantile(0.99, rate(spacecom_alert_delivery_latency_seconds_bucket[5m]))
# Error budget burn rate — multi-window (F2 — §57)
- record: spacecom:error_budget_burn:rate1h
expr: 1 - avg_over_time(spacecom:api_availability:ratio_rate5m[1h])
- record: spacecom:error_budget_burn:rate6h
expr: 1 - avg_over_time(spacecom:api_availability:ratio_rate5m[6h])
# Fast-burn window (5 min) — catches sudden outages
- record: spacecom:error_budget_burn:rate5m
expr: 1 - spacecom:api_availability:ratio_rate5m
Alerting rules (Prometheus AlertManager):
# monitoring/alertmanager/spacecom-rules.yml
groups:
- name: spacecom_critical
rules:
- alert: HmacVerificationFailure
expr: increase(spacecom_hmac_verification_failures_total[5m]) > 0
labels:
severity: critical
annotations:
summary: "HMAC verification failure detected — prediction integrity compromised"
runbook_url: "https://spacecom.internal/docs/runbooks/hmac-integrity-failure.md"
- alert: TipIngestStale
expr: spacecom_tle_age_hours{source="tip"} > 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "TIP data > 30 min old — active re-entry warning may be stale"
runbook_url: "https://spacecom.internal/docs/runbooks/tip-ingest-failure.md"
- alert: ActiveTipNoPrediction
expr: spacecom_active_tip_events > 0 and spacecom:prediction_age_seconds:max > 3600
labels:
severity: critical
annotations:
summary: "Active TIP event but newest prediction is {{ $value | humanizeDuration }} old"
runbook_url: "https://spacecom.internal/docs/runbooks/tip-ingest-failure.md"
# Fast burn: 1h + 5min windows (catches sudden outages quickly) — F2 §57
- alert: ErrorBudgetFastBurn
expr: >
spacecom:error_budget_burn:rate1h > (14.4 * 0.001)
and
spacecom:error_budget_burn:rate5m > (14.4 * 0.001)
for: 2m
labels:
severity: critical
burn_window: fast
annotations:
summary: "Error budget burning fast — 1h burn rate {{ $value | humanizePercentage }}"
runbook_url: "https://spacecom.internal/docs/runbooks/db-failover.md"
dashboard_url: "https://grafana.spacecom.internal/d/slo-burn-rate"
# Slow burn: 6h + 30min windows (catches gradual degradation before budget exhausts) — F2 §57
- alert: ErrorBudgetSlowBurn
expr: >
spacecom:error_budget_burn:rate6h > (6 * 0.001)
and
spacecom:error_budget_burn:rate1h > (6 * 0.001)
for: 15m
labels:
severity: warning
burn_window: slow
annotations:
summary: "Error budget burning slowly — 6h burn rate {{ $value | humanizePercentage }}"
runbook_url: "https://spacecom.internal/docs/runbooks/db-failover.md"
dashboard_url: "https://grafana.spacecom.internal/d/slo-burn-rate"
- name: spacecom_warning
rules:
- alert: TleStale
# Alert on recording rule aggregate — single alert, not 600 per-NORAD alerts
expr: spacecom:tle_stale_objects:count > 0
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $value }} objects have TLE age > 6h"
runbook_url: "https://spacecom.internal/docs/runbooks/ingest-pipeline-staleness.md"
- alert: IngestConsecutiveFailures
# Use increase(), not rate(); rate() is always positive once any failure exists
expr: increase(spacecom_ingest_failure_total[15m]) >= 3
labels:
severity: warning
annotations:
summary: "Ingest source {{ $labels.source }} failed ≥ 3 times in 15 min"
runbook_url: "https://spacecom.internal/docs/runbooks/ingest-pipeline-staleness.md"
- alert: CelerySimulationQueueDeep
expr: spacecom_celery_queue_depth{queue="simulation"} > 20
for: 5m
labels:
severity: warning
annotations:
summary: "Simulation queue depth {{ $value }} — workers may be overwhelmed"
runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md"
- alert: DLQGrowing
expr: increase(spacecom_dlq_depth[10m]) > 0
labels:
severity: warning
annotations:
summary: "Dead letter queue growing — tasks exhausting retries"
runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md"
- alert: WebSocketCeilingApproaching
expr: spacecom_ws_connected_clients > 400
labels:
severity: warning
annotations:
summary: "WS connections {{ $value }}/500 — scale backend before ceiling hit"
runbook_url: "https://spacecom.internal/docs/runbooks/capacity-limits.md"
# Queue depth growth rate alert — fires before threshold is breached (F8 — §57)
- alert: CelerySimulationQueueGrowing
expr: rate(spacecom_celery_queue_depth{queue="simulation"}[10m]) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Simulation queue growing at {{ $value | humanize }} tasks/sec — workers not keeping up"
runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md"
- alert: RendererChromiumUnresponsive
expr: increase(renderer_chromium_restarts_total[5m]) > 0
labels:
severity: warning
annotations:
summary: "Renderer Chromium restarted — report generation may be delayed"
runbook_url: "https://spacecom.internal/docs/runbooks/renderer-recovery.md"
Alert authoring rule (F11 — §57): Every AlertManager alert rule MUST include annotations.runbook_url pointing to an existing file in docs/runbooks/. CI lint step (make lint-alerts) validates this using promtool check rules plus a custom Python script that asserts every rule has a non-empty runbook_url annotation that resolves to an existing markdown file. A PR that adds an alert without a runbook fails CI.
Alert coverage audit (F5 — §57): The following table maps every SLO and safety invariant to its alert rule. Gaps must be closed before Phase 2.
| SLO / Safety invariant | Alert rule | Severity | Gap? |
|---|---|---|---|
| API availability 99.9% | ErrorBudgetFastBurn, ErrorBudgetSlowBurn |
CRITICAL / WARNING | Covered |
| TLE age < 6h | TleStale |
WARNING | Covered |
| TIP ingest freshness < 30 min | TipIngestStale |
CRITICAL | Covered |
| Active TIP + prediction age > 1h | ActiveTipNoPrediction |
CRITICAL | Covered |
| HMAC verification integrity | HmacVerificationFailure |
CRITICAL | Covered |
| Ingest consecutive failures | IngestConsecutiveFailures |
WARNING | Covered |
| Celery queue depth threshold | CelerySimulationQueueDeep |
WARNING | Covered |
| Celery queue depth growth rate | CelerySimulationQueueGrowing |
WARNING | Covered |
| DLQ depth > 0 | DLQGrowing |
WARNING | Covered |
| WS connection ceiling approach | WebSocketCeilingApproaching |
WARNING | Covered |
| Renderer Chromium crash | RendererChromiumUnresponsive |
WARNING | Covered |
| EOP mirror disagreement | EopMirrorDisagreement |
CRITICAL | Gap — add Phase 1 |
| DB replication lag > 30s | DbReplicationLagHigh |
WARNING | Gap — add Phase 2 |
| Backup job failure | BackupJobFailed |
CRITICAL | Gap — add Phase 1 |
| Security event anomaly | In security-rules.yml |
CRITICAL | Covered |
| Alert HMAC integrity (nightly) | In security-rules.yml |
CRITICAL | Covered |
Prometheus scrape configuration (monitoring/prometheus.yml):
scrape_configs:
- job_name: backend
static_configs:
- targets: ['backend:8000']
metrics_path: /metrics # enabled by prometheus-fastapi-instrumentator
- job_name: renderer
static_configs:
- targets: ['renderer:8001']
metrics_path: /metrics
- job_name: celery
static_configs:
- targets: ['celery-exporter:9808'] # celery-exporter sidecar
- job_name: postgres
static_configs:
- targets: ['postgres-exporter:9187'] # postgres_exporter; also scrapes PgBouncer stats
- job_name: redis
static_configs:
- targets: ['redis-exporter:9121'] # redis_exporter
Add to docker-compose.yml (Phase 2 service topology): postgres-exporter, redis-exporter, celery-exporter sidecar, loki, promtail, tempo (all on monitor_net). Add to requirements.in: prometheus-fastapi-instrumentator, structlog, opentelemetry-sdk, opentelemetry-instrumentation-fastapi, opentelemetry-instrumentation-sqlalchemy, opentelemetry-instrumentation-celery.
Distributed tracing — OpenTelemetry (Phase 2, ADR 0017):
# backend/app/main.py — instrument at startup
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.celery import CeleryInstrumentor
provider = TracerProvider()
provider.add_span_exporter(OTLPSpanExporter(endpoint="http://tempo:4317"))
trace.set_tracer_provider(provider)
FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=engine)
CeleryInstrumentor().instrument()
The trace_id from each span equals the request_id bound in structlog.contextvars (set by RequestIDMiddleware). This gives a single correlation key across Grafana Loki log search and Grafana Tempo trace view — one click from a log entry to its trace, and from a trace span to its log lines. Phase 1 fallback: set OTEL_SDK_DISABLED=true; spans emit to stdout only (no collector needed).
Celery trace propagation (F4 — §57): CeleryInstrumentor automatically propagates W3C traceparent headers through the Celery task message body. The trace started at POST /api/v1/decay/predict continues unbroken through the queue wait and into the worker execution. To verify propagation is working:
# tests/integration/test_tracing.py
def test_celery_trace_propagation():
"""Trace started in HTTP handler must appear in Celery worker span."""
with patch("opentelemetry.instrumentation.celery") as mock_otel:
response = client.post("/api/v1/decay/predict", ...)
task_id = response.json()["job_id"]
# Poll until complete, then assert trace_id matches request_id
span = get_span_by_task_id(task_id)
assert span.context.trace_id == uuid.UUID(response.headers["X-Request-ID"]).int
Additionally, request_id must be passed explicitly in Celery task kwargs as a belt-and-suspenders fallback for Phase 1 when OTel is disabled (OTEL_SDK_DISABLED=true). The worker binds it via structlog.contextvars.bind_contextvars(request_id=kwargs["request_id"]). This ensures log correlation works in Phase 1 without a running Tempo instance.
Chord sub-task and callback trace propagation (F11 — §67): CeleryInstrumentor propagates traceparent through individual task messages. For the MC chord pattern (group → chord → callback), trace context propagation must flow: FastAPI handler → run_mc_decay_prediction → 500× run_single_trajectory sub-tasks → aggregate_mc_results callback. Each hop in the chord must carry the same trace_id to enable end-to-end p95 latency attribution.
CeleryInstrumentor handles single task propagation automatically. For chord callbacks, verify that the parent trace_id appears in the aggregate_mc_results span — if the span is orphaned (different trace_id), set the trace context explicitly in the chord header:
from opentelemetry import propagate, context
def run_mc_decay_prediction(object_id: int, params: dict) -> str:
carrier = {}
propagate.inject(carrier) # inject current trace context
params['_trace_context'] = carrier # pass through chord params
...
def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str:
ctx = propagate.extract(params.get('_trace_context', {}))
token = context.attach(ctx) # re-attach parent trace context in callback
try:
... # callback body
finally:
context.detach(token)
This ensures the Tempo waterfall for an MC prediction shows one continuous trace from HTTP request through all 500 sub-tasks to DB write, enabling per-prediction p95 breakdown.
Celery queue depth Beat task (updates celery_queue_depth and dlq_depth every 30s):
@app.task
def update_queue_depth_metrics():
for queue_name in ['ingest', 'simulation', 'default']:
depth = redis_client.llen(f'celery:{queue_name}')
celery_queue_depth.labels(queue=queue_name).set(depth)
dlq_depth.set(redis_client.llen('dlq:failed_tasks'))
Four Grafana dashboards (updated from three):
-
Operational Overview — primary on-call dashboard (F7 — §57): an on-call engineer must be able to answer "is the system healthy?" within 15 seconds of opening this dashboard. Panel order and layout is therefore mandated:
Row Panel Metric Alert threshold shown 1 (top) Active TIP events (stat) spacecom_active_tip_eventsRed if > 0 1 System status (state timeline) All alert rule states Any CRITICAL = red bar 2 Ingest freshness per source (gauge) spacecom_tle_age_hoursper sourceYellow > 2h, Red > 6h 2 Prediction age — active objects (gauge) spacecom:prediction_age_seconds:maxRed > 3600s 3 Error budget burn rate (time series) spacecom:error_budget_burn:rate1hReference line at 14.4× 3 Alert delivery latency p99 (stat) spacecom:alert_delivery_latency:p99_rate5mRed > 30s 4 Celery queue depth (time series) spacecom_celery_queue_depthper queueReference line at 20 4 DLQ depth (stat) spacecom_dlq_depthRed if > 0 Rows 1–2 must be visible without scrolling on a 1080p monitor. The dashboard UID is pinned in the AlertManager
dashboard_urlannotations. -
System Health: DB replication lag, Redis memory, container CPU/RAM, error rates by endpoint, renderer job duration
-
SLO Burn Rate: error budget consumption rate from recording rules, fast/slow burn rates, availability by SLO, latency percentiles vs. targets, WS delivery latency p99
-
Tracing (Phase 2, Grafana Tempo): per-request traces for decay prediction and CZML catalog; p95 span breakdown by service
26.8 Incident Response
On-Call Rotation and Escalation
| Tier | Responder | Response SLA | Escalation trigger |
|---|---|---|---|
| L1 On-call | Rotating engineer (weekly rotation) | 5 min (SEV-1) / 15 min (SEV-2) | Auto-escalate to L2 if no acknowledgement after SLA |
| L2 Escalation | Tech lead / senior engineer | 10 min (SEV-1) | Auto-escalate to L3 after 10 min |
| L3 Incident commander | Engineering or product lead | SEV-1 only | Manual phone call; no auto-escalation |
AlertManager routing:
# monitoring/alertmanager/routing.yml
route:
receiver: slack-ops-channel
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match: {severity: critical}
receiver: pagerduty-l1
continue: true # also send to Slack
- match: {severity: warning}
receiver: slack-ops-channel
On-call guide: docs/runbooks/on-call-guide.md — required Phase 2 deliverable. Must cover: rotation schedule, handover checklist, escalation contact list, how to acknowledge PagerDuty alerts, Grafana dashboard URLs, and the "active TIP event protocol" (escalate all SEV-2+ to SEV-1 automatically when spacecom_active_tip_events > 0).
On-call rotation spec (F5):
- 7-day rotation; minimum 2 engineers in the pool before going on-call
- L1 → L2 escalation if incident not contained within 30 minutes of L1 acknowledgement
- L2 → L3 escalation triggers: ANSP data affected; confirmed security breach; total outage > 15 minutes; regulatory notification obligation triggered (NIS2 24h, GDPR 72h)
- On-call handoff: At rotation boundary, outgoing on-call documents system state in
docs/runbooks/on-call-handoff-log.md: active incidents, degraded services, pending maintenance, known risks. Incoming on-call acknowledges in the same log. Mirrors the operator/handoverconcept (§28.5a) applied to engineering shifts.
ANSP communication commitments per severity (F6):
| Severity | ANSP notification timing | Channel | Update cadence |
|---|---|---|---|
| SEV-1 (active TIP event) | Within 5 minutes of detection | Push + email | Every 15 minutes until resolved |
| SEV-1 (no active event) | Within 15 minutes | Every 30 minutes until resolved | |
| SEV-2 | Within 30 minutes if prediction data affected | On resolution | |
| SEV-3/4 | Status page update only | Status page | On resolution |
Resolution notification always includes: what was affected, duration, root cause summary (1 sentence), and confirmation that prediction integrity was verified post-incident.
Severity Levels
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| SEV-1 | System unavailable or prediction integrity compromised during active TIP event | 5 minutes | DB down with TIP window open; HMAC failure on active prediction |
| SEV-2 | Core functionality broken; no active TIP event | 15 minutes | Workers down; ingest stopped > 2h; Redis down |
| SEV-3 | Degraded functionality; operational but impaired | 60 minutes | TLE stale > 6h; space weather stale; slow CZML > 5s p95 |
| SEV-4 | Minor; no operational impact | Next business day | UI cosmetic; log noise; non-critical test failure |
Runbook Standard Structure (F9)
Every runbook in docs/runbooks/ must follow this template. Inconsistent runbooks written under incident pressure are a leading cause of missed steps and extended resolution times.
# Runbook: {Title}
**Owner:** {team or role}
**Last tested:** {YYYY-MM-DD} (game day or real incident)
**Severity scope:** SEV-1 | SEV-2 | SEV-3 (as applicable)
## Triggers
<!-- What conditions cause this runbook to be invoked? Alert name, symptom, or explicit escalation. -->
## Immediate actions (first 5 minutes)
<!-- Numbered steps. Each step must be independently executable. No "investigate" — specific commands only. -->
1.
2.
## Diagnosis
<!-- How to confirm the root cause before taking corrective action. -->
## Resolution steps
<!-- Numbered. Each step: what to do, expected output, what to do if the expected output is NOT seen. -->
1.
2.
## Verification
<!-- How to confirm the incident is resolved. Specific health check commands or metrics to inspect. -->
## Escalation
<!-- If unresolved after N minutes: who to page, what information to have ready. -->
## Post-incident
<!-- Mandatory PIR? Log entry required? Notification required? -->
All runbooks are reviewed and updated after each game day or real incident in which they were used. The Last tested field must not be older than 12 months — a CI check (make runbook-audit) warns if any runbook has not been updated within that window.
Required Runbooks (Phase 2 deliverable)
Each runbook is a step-by-step operational procedure, not a general guide:
| Runbook | Key Steps |
|---|---|
| DB failover | Confirm primary down → Patroni status → manual failover if Patroni stuck → verify standby promoting → update connection strings → verify HMAC validation working on new primary |
| Celery worker recovery | Check queue depth → inspect dead letter queue → restart worker containers → verify simulation jobs resuming → check ingest worker catching up |
| HMAC integrity failure | Identify affected prediction ID → quarantine record (integrity_failed = TRUE) → notify affected ANSP users → investigate modification source → escalate to security incident if tampering confirmed |
| TIP ingest failure | Check Space-Track API status → verify credentials not expired → check outbound network → manual TIP fetch if automated ingest blocked → notify operators of manual TIP status |
| Ingest pipeline staleness | Check Celery Beat health (redbeat lock status) → check worker queue → inspect ingest failure counter in Prometheus → trigger manual ingest job → notify operators of staleness |
| GDPR personal data breach | Contain breach (revoke credentials, isolate affected service) → assess scope (which data, how many data subjects, which jurisdictions) → notify legal counsel within 4 hours → if EU/UK data subjects affected: notify supervisory authority within 72 hours of discovery; notify affected data subjects "without undue delay" if high risk → log in security_logs with type DATA_BREACH → document remediation |
| Safety occurrence notification | If a SpaceCom integrity failure (HMAC fail, data source outage, incorrect prediction) is identified during a period when an ANSP was actively managing a re-entry event: notify affected ANSP within 2 hours → create security_logs record with type SAFETY_OCCURRENCE → notify legal counsel before any external communications → preserve all prediction records, alert_events, and ingest logs from the relevant period (do not rotate or archive). Full procedure: docs/runbooks/safety-occurrence.md — see §26.8a below. |
| Prediction service outage during active re-entry event (F3) | Detect via spacecom_active_tip_events > 0 + prediction API health check fail → immediate ANSP push notification + email within 5 minutes ("SpaceCom prediction service is unavailable. Activate your fallback procedure: consult Space-Track TIP messages directly and ESOC re-entry page.") → designate incident commander → communication cadence every 15 minutes until resolved → service restoration checklist: restore prediction API → verify HMAC integrity on latest predictions → notify ANSPs of restoration with prediction freshness timestamp → trigger PIR. Full procedure: docs/runbooks/prediction-service-outage-during-active-event.md |
§26.8a Safety Occurrence Reporting Procedure (F4 — §61)
A safety occurrence is any event or condition in which a SpaceCom error may have contributed to, or could have contributed to, a reduction in aviation safety. This is distinct from an operational incident (which is defined by system availability/performance). Safety occurrences require a different response chain that includes regulatory and legal notification.
Trigger conditions:
- HMAC integrity failure on any prediction that was served to an ANSP operator during an active TIP event
- A confirmed incorrect prediction (false positive or false negative) where the ANSP was managing airspace based on SpaceCom outputs
- Data staleness in excess of the operational threshold (TLE > 6h old) during an active re-entry event window without degradation notification having been sent
- Any SpaceCom system failure during which an ANSP continued operational use without receiving a degradation notification
Response procedure (docs/runbooks/safety-occurrence.md):
| Step | Action | Owner | Timing |
|---|---|---|---|
| 1 | Detect and classify: confirm the occurrence meets trigger criteria; assign SAFETY_OCCURRENCE vs. standard incident | On-call engineer | Within 30 min of detection |
| 2 | Preserve evidence: set do_not_archive = TRUE on all affected prediction records, alert_events, and ingest logs; export to MinIO safety archive |
On-call engineer | Within 1 hour |
| 3 | Internal escalation: notify incident commander + legal counsel; do NOT communicate externally until legal counsel is engaged | Incident commander | Within 1 hour |
| 4 | ANSP notification: contact affected ANSP primary contact and safety manager using the safety occurrence notification template (not the standard incident template); include what happened, what data was affected, what the ANSP should do in response | Incident commander + legal counsel review | Within 2 hours |
| 5 | Log: create security_logs record with type = 'SAFETY_OCCURRENCE'; include ANSP ID, affected prediction IDs, notification timestamp, and legal counsel name |
On-call engineer | Same session |
| 6 | ANSP SMS obligation: inform the ANSP in writing that they may have an obligation to report this occurrence to their safety regulator under their SMS; SpaceCom cannot make this determination for the ANSP | Legal counsel | Within 24 hours |
| 7 | PIR: conduct a safety-occurrence-specific post-incident review (same structure as §26.8 PIR but with additional sections: regulatory notification status, hazard log update required?) | Engineering lead | Within 5 business days |
| 8 | Hazard log update: if the occurrence reveals a new hazard or changes the likelihood/severity of an existing hazard, update docs/safety/HAZARD_LOG.md and trigger a safety case review |
Safety case custodian | Within 10 business days |
Safety occurrence log table:
-- Add to security_logs or create a dedicated table
CREATE TABLE safety_occurrences (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
occurred_at TIMESTAMPTZ NOT NULL,
detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
org_ids UUID[] NOT NULL, -- affected ANSPs
trigger_type TEXT NOT NULL, -- 'HMAC_FAILURE', 'INCORRECT_PREDICTION', 'STALE_DATA', 'SILENT_FAILURE'
affected_predictions UUID[] NOT NULL DEFAULT '{}',
evidence_archived BOOLEAN NOT NULL DEFAULT FALSE,
ansp_notified_at TIMESTAMPTZ,
legal_notified_at TIMESTAMPTZ,
hazard_log_updated BOOLEAN NOT NULL DEFAULT FALSE,
pir_completed_at TIMESTAMPTZ,
notes TEXT
);
What is NOT a safety occurrence (to avoid over-classification):
- Standard availability incidents with degradation notification sent promptly
- Cosmetic UI errors not in the alert/prediction path
- Prediction updates that change values within stated uncertainty bounds
ANSP Communication Plan
When SpaceCom is degraded during an active TIP event, operators must be notified immediately through a defined channel:
- WebSocket push (if connected): automatic via the degraded-mode notification (§24.8)
- Email fallback: automated email to all
operatorrole users with active sessions within the last 24h, identifying the degradation type and estimated resolution - Documented fallback: every SpaceCom user onboarding includes the fallback procedure: "In the absence of SpaceCom, consult Space-Track TIP messages directly at space-track.org and coordinate with your national space surveillance authority per existing procedures"
Incident communication templates (F10): Pre-drafted templates in docs/runbooks/incident-comms-templates.md — reviewed by legal counsel before first use. On-call engineers must use these templates verbatim; deviations require incident commander approval. Templates cover:
- Initial notification (< 5 minutes): impact, what we know, what we are doing, next update time
- 15-minute update: progress, updated ETA if known, revised fallback guidance if needed
- Resolution notification: confirmed restoration, prediction integrity verified, brief root cause (one sentence), PIR date
- Post-incident summary (within 5 business days): full timeline, root cause, remediations implemented What never appears in templates: speculation about cause before root cause confirmed; estimated recovery time until known with confidence; any admission of negligence or legal liability.
Post-Incident Review Process (F8)
Mandatory for all SEV-1 and SEV-2 incidents. PIR due within 5 business days of resolution.
PIR document structure (docs/post-incident-reviews/YYYY-MM-DD-{slug}.md):
- Incident summary — what happened, when, duration, severity
- Timeline — minute-by-minute from first alert to resolution
- Root cause — using 5-whys methodology; stop when a process or system gap is identified
- Contributing factors — what made the impact worse or detection slower
- Impact — users/ANSPs affected; data at risk; SLO breach duration
- Remediation actions — each with owner, GitHub issue link, and deadline; tracked with
incident-remediationlabel - What went well — to reinforce effective practices
PIR presented at the next engineering all-hands. Remediation actions are P2 priority — no new feature work by the responsible engineer until overdue remediations are closed.
Chaos Engineering / Game Day Programme (F4)
Quarterly game day; scenarios rotated so each is tested at least annually. Document in docs/runbooks/game-day-scenarios.md.
Minimum scenario set:
| # | Scenario | Expected behaviour | Pass criterion |
|---|---|---|---|
| 1 | PostgreSQL primary killed | Patroni promotes standby; API recovers within RTO | API returns 200 within 15 minutes; no data loss |
| 2 | Celery worker crash during active MC simulation | Job moves to DLQ; orphan recovery task re-queues; operator sees FAILED state |
Job visible in DLQ within 2 minutes; re-queue succeeds |
| 3 | Space-Track ingest unavailable 6 hours | Staleness degraded mode activates; operators notified; predictions greyed | Staleness alert fires within 15 minutes of ingest stop |
| 4 | Redis failure | Sessions expire gracefully; WebSocket reconnects; no silent data loss | Users see "session expired" prompt; no 500 errors |
| 5 | Full prediction service restart during active CRITICAL alert | Alert state preserved in DB; re-subscribing WebSocket clients receive current state | No alert acknowledgement lost; reconnection < 30 seconds |
| 6 | Full region failover (annually) | DNS fails over to DR region; prediction API resumes | Recovery within RTO; HMAC verification passes on new primary |
Each scenario: defined inject → observe → record actual behaviour → pass/fail vs. criterion → remediation window 2 weeks. Any scenario fail is treated as a SEV-2 incident with a PIR.
Operational vs. Security Incident Runbooks (F11)
Operational and security incidents have different response teams, communication obligations, and legal constraints:
| Dimension | Operational incident | Security incident |
|---|---|---|
| Primary responder | On-call engineer | On-call engineer + DPO within 4h |
| Communication | Status page + ANSP email | No public status page until legal counsel approves |
| Regulatory obligation | SLA breach notification (MSA) | NIS2 24h early warning; GDPR 72h (if personal data) |
| Evidence preservation | Normal log retention | Immediate log freeze; do not rotate or archive |
Separate runbooks:
docs/runbooks/operational-incident-response.md— standard on-call playbookdocs/runbooks/security-incident-response.md— invokes DPO, legal counsel, NIS2/GDPR timelines; references §29.6 notification obligations
26.9 Deployment Strategy
Zero-Downtime Deployment (Blue-Green)
The TLS-terminating Caddy instance routes between blue (current) and green (new) backend instances:
Client → Caddy → [Blue backend] (current)
→ [Green backend] (new — deployed but not yet receiving traffic)
Docker Compose implementation for Tier 2 (single-host):
Docker Compose service names are fixed, so blue and green run as two separate Compose project instances. The deploy script at scripts/blue-green-deploy.sh manages the cutover:
#!/usr/bin/env bash
# scripts/blue-green-deploy.sh
set -euo pipefail
NEW_IMAGE="${1:?Usage: blue-green-deploy.sh <image-tag>}"
COMPOSE_FILE="docker-compose.yml"
BLUE_PROJECT="spacecom-blue"
GREEN_PROJECT="spacecom-green"
# 1. Determine which colour is currently active
ACTIVE=$(cat /opt/spacecom/.active-colour 2>/dev/null || echo "blue")
if [[ "$ACTIVE" == "blue" ]]; then NEXT="green"; else NEXT="blue"; fi
# 2. Start next-colour project with new image
SPACECOM_BACKEND_IMAGE="$NEW_IMAGE" \
docker compose -p "$( [[ $NEXT == green ]] && echo $GREEN_PROJECT || echo $BLUE_PROJECT )" \
-f "$COMPOSE_FILE" up -d backend
# 3. Wait for next-colour healthcheck
docker compose -p "$( [[ $NEXT == green ]] && echo $GREEN_PROJECT || echo $BLUE_PROJECT )" \
exec backend curl -sf http://localhost:8000/healthz || { echo "Health check failed — aborting"; exit 1; }
# 4. Run smoke tests against next-colour directly
SMOKE_TARGET="http://localhost:$( [[ $NEXT == green ]] && echo 8001 || echo 8000 )" \
python scripts/smoke-test.py || { echo "Smoke tests failed — aborting"; exit 1; }
# 5. Shift Caddy upstream to next colour (atomic file swap + reload)
echo "{ \"upstream\": \"backend-$NEXT:8000\" }" > /opt/spacecom/caddy-upstream.json
docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile
echo "$NEXT" > /opt/spacecom/.active-colour
echo "✓ Traffic shifted to $NEXT. Monitoring for 5 minutes..."
sleep 300
# 6. Verify error rate via Prometheus (optional gate)
ERROR_RATE=$(curl -s "http://localhost:9090/api/v1/query?query=spacecom:api_availability:ratio_rate5m" \
| jq -r '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE < 0.99" | bc -l) )); then
echo "Error rate $ERROR_RATE < 0.99 — rolling back"
# Swap back to active colour
echo "{ \"upstream\": \"backend-$ACTIVE:8000\" }" > /opt/spacecom/caddy-upstream.json
docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile
echo "$ACTIVE" > /opt/spacecom/.active-colour
exit 1
fi
# 7. Decommission old colour
docker compose -p "$( [[ $ACTIVE == blue ]] && echo $BLUE_PROJECT || echo $GREEN_PROJECT )" \
stop backend && docker compose -p ... rm -f backend
echo "✓ Blue-green deploy complete. Active: $NEXT"
Caddy upstream configuration — Caddy reads a JSON file that the deploy script rewrites atomically:
# /etc/caddy/Caddyfile
reverse_proxy {
dynamic file /opt/spacecom/caddy-upstream.json
lb_policy first
health_uri /healthz
health_interval 5s
}
WebSocket long-lived connection timeout configuration (F11 — §63): HTTP reverse proxies have default idle timeouts that silently terminate long-lived WebSocket connections. Caddy's default idle timeout for HTTP/2 connections is governed by idle_timeout (default: 5 minutes). Many cloud load balancers default to 60 seconds. A WebSocket with no traffic for this period is silently closed by the proxy — the FastAPI server and client may not detect this for minutes, creating a "ghost connection" that is alive at the socket level but dead at the application level.
Required Caddyfile additions for WebSocket paths:
# /etc/caddy/Caddyfile
{
servers {
timeouts {
idle_timeout 0 # disable idle timeout globally — WS connections can be silent for extended periods
}
}
}
spacecom.io {
# WebSocket endpoints: no idle timeout, no read timeout
@websockets {
path /ws/*
header Connection *Upgrade*
header Upgrade websocket
}
handle @websockets {
reverse_proxy backend:8000 {
transport http {
read_timeout 0 # no read timeout — WS connection can be idle
write_timeout 0 # no write timeout — WS send can be slow on poor networks
}
flush_interval -1 # immediate flush; do not buffer WS frames
}
}
# Non-WebSocket paths: retain normal timeouts
handle {
reverse_proxy backend:8000 {
transport http {
read_timeout 30s
write_timeout 30s
}
}
}
}
Ping-pong interval must be less than proxy idle timeout: The FastAPI WebSocket handler sends a ping every WS_PING_INTERVAL_SECONDS (default: 30s). With idle_timeout 0 in Caddy, this prevents proxy-side termination. If running behind a cloud load balancer with a fixed idle timeout, the ping interval must be set to (load_balancer_idle_timeout - 10s) — documented in docs/runbooks/websocket-proxy-config.md.
Rollback: scripts/blue-green-rollback.sh — resets /opt/spacecom/caddy-upstream.json to the previous colour and reloads Caddy. Rollback completes in < 5 seconds (no container restart required).
Deployment sequence:
- Deploy green backend alongside blue (both running)
- Run smoke tests against green directly (
X-Deploy-Target: greenheader) - Shift 10% of traffic to green (canary); monitor error rate for 5 minutes
- If clean: shift 100% to green; keep blue running for 10 minutes
- If error spike: shift 0% back to blue instantly (< 5s rollback via
blue-green-rollback.sh) - Decommission blue after 10 minutes of clean green operation
Alembic Migration Safety Policy
Every database migration must be backwards-compatible with the previous application version. Required sequence for any schema change:
- Migration only: deploy migration; verify old app still functions with new schema (additive changes only — new nullable columns, new tables, new indexes)
- Application deploy: deploy new application version that uses the new schema
- Cleanup migration (if needed): remove old columns/constraints after old app version is fully retired
Never: rename a column, change a column type, or drop a column in a single migration that deploys simultaneously with the application change.
Hypertable-specific migration rules:
- Always use
CREATE INDEX CONCURRENTLYfor new indexes on hypertables — does not acquire a table lock; safe during live ingest. StandardCREATE INDEX(withoutCONCURRENTLY) blocks all reads and writes for the duration. - Never add a column with a non-null default to a populated hypertable in a single migration. Required sequence: (1) add nullable column, (2) backfill in batches with
UPDATE ... WHERE id BETWEEN x AND y, (3) add NOT NULL constraint in a separate deployment. - Test every migration against a production-sized data copy before applying to production. Record the measured execution time in the migration file header comment:
# Execution time on 10M-row orbits table: 45s. - Set a CI migration timeout gate: if a migration runs > 30 seconds against the test dataset, it must be reviewed by a senior engineer before merge.
TIP Event Deployment Freeze
No deployments permitted when a CRITICAL or HIGH alert is active for any tracked object. Enforced by a CI/CD gate:
# .gitlab-ci.yml pre-deploy check
def check_deployment_gate():
response = requests.get(f"{API_URL}/api/v1/alerts?level=CRITICAL,HIGH&active=true",
headers={"X-Deploy-Check": settings.deploy_check_secret})
active = response.json()["total"]
if active > 0:
raise DeploymentBlocked(
f"{active} active CRITICAL/HIGH alerts. Deployment blocked until events resolve."
)
The deploy check secret is a read-only service credential — it cannot acknowledge alerts or modify data.
CI/CD Pipeline Specification
GitLab CI pipeline jobs (.gitlab-ci.yml):
| Job | Trigger | Steps | Failure behaviour |
|---|---|---|---|
lint |
All pushes + PRs | pre-commit run --all-files (detect-secrets, ruff, mypy, hadolint, prettier, sqlfluff) |
Blocks merge |
test-backend |
All pushes + PRs | pytest --cov --cov-fail-under=80; alembic check (model/migration divergence) |
Blocks merge |
test-frontend |
All pushes + PRs | vitest run; playwright test |
Blocks merge |
security-scan |
All pushes + PRs | bandit -r backend/; pip-audit --require backend/requirements.txt; npm audit --audit-level=high (frontend); eslint --plugin security; trivy image on built images (.trivyignore applied); pip-licenses + license-checker-rseidelsohn gate; .secrets.baseline currency check |
Blocks merge on High/Critical |
build-and-push |
Merge to main or release/* |
Multi-stage docker build; docker push ghcr.io/spacecom/<service>:sha-<commit> via OIDC; cosign sign all images; syft SPDX-JSON SBOM generated and attached as cosign attest; pip-licenses --format=json + license-checker-rseidelsohn --json manifests merged into SBOM and uploaded as workflow artifact (365-day retention); docs/compliance/sbom/ updated with versioned SBOM artefact |
Blocks deploy |
deploy-staging |
After build-and-push on main |
Docker Compose update on staging host; smoke tests | Blocks production deploy gate |
deploy-production |
Manual approval after deploy-staging passes |
check_deployment_gate() (no active CRITICAL/HIGH alerts); blue-green deploy |
Manual |
Image tagging convention:
sha-<commit>— immutable canonical tag; always pushedv<major>.<minor>.<patch>— release alias pushed on tagged commitslatest— never pushed; forbidden in production Compose files (CI grep check enforces this)
Build cache strategy:
# .github/workflows/ci.yml (build-and-push job excerpt)
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }} # OIDC — no stored secret
- uses: docker/build-push-action@v5
with:
context: ./backend
push: true
tags: ghcr.io/spacecom/backend:sha-${{ github.sha }}
cache-from: type=registry,ref=ghcr.io/spacecom/backend:buildcache
cache-to: type=registry,ref=ghcr.io/spacecom/backend:buildcache,mode=max
pip and npm caches use actions/cache keyed on lock file hash:
- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f # v4.0.2
with:
path: ~/.cache/pip
key: pip-${{ hashFiles('backend/requirements.txt') }}
- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f # v4.0.2
with:
path: frontend/.next/cache
key: npm-${{ hashFiles('frontend/package-lock.json') }}
cosign image signing and SBOM attestation (added after each docker push):
# .github/workflows/ci.yml — build-and-push job (after docker push steps)
- uses: sigstore/cosign-installer@59acb6260d9c0ba8f4a2f9d9b48431a222b68e20 # v3.5.0
- name: Sign all service images with cosign (keyless, OIDC)
env:
COSIGN_EXPERIMENTAL: "true"
run: |
for svc in backend worker-sim worker-ingest renderer frontend; do
cosign sign --yes \
ghcr.io/spacecom/${svc}:sha-${{ github.sha }}
done
- name: Generate SBOM and attach as cosign attestation
env:
COSIGN_EXPERIMENTAL: "true"
run: |
for svc in backend worker-sim worker-ingest renderer frontend; do
syft ghcr.io/spacecom/${svc}:sha-${{ github.sha }} \
-o spdx-json=sbom-${svc}.spdx.json
# Validate non-empty
jq -e '.packages | length > 0' sbom-${svc}.spdx.json
cosign attest --yes \
--predicate sbom-${svc}.spdx.json \
--type spdxjson \
ghcr.io/spacecom/${svc}:sha-${{ github.sha }}
done
- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.3.4
with:
name: sbom-${{ github.sha }}
path: "*.spdx.json"
retention-days: 365 # ESA bid artefacts; ECSS minimum 1 year
- name: Verify signature before deploy (deploy jobs only)
if: github.event_name == 'workflow_dispatch'
run: |
cosign verify ghcr.io/spacecom/backend:sha-${{ github.sha }} \
--certificate-identity-regexp="https://github.com/spacecom/spacecom/.*" \
--certificate-oidc-issuer="https://token.actions.githubusercontent.com"
All GitHub Actions pinned by commit SHA (mutable @vN tags allow tag-repointing attacks that exfiltrate all workflow secrets):
# Correct form — all third-party actions in .github/workflows/*.yml:
- uses: docker/setup-buildx-action@4fd812986e6c8c2a69e18311145f9371337f27d # v3.4.0
- uses: docker/login-action@9780b0c442fbb1117ed29e0efdff1e18412f7567 # v3.3.0
- uses: docker/build-push-action@1a162644f9a7e87d8f4b053101d1d9a712edc18c # v6.3.0
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f # v4.0.2
- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.3.4
CI lint check enforces no mutable tags remain:
grep -rE 'uses: [^@]+@v[0-9]' .github/workflows/ && \
echo "ERROR: Actions must be pinned by commit SHA, not tag" && exit 1
Use pinact or Renovate's github-actions manager to automate SHA updates.
Local Development Environment
First-time setup (target: working stack in ≤ 15 minutes from clean clone):
git clone https://github.com/spacecom/spacecom && cd spacecom
cp .env.example .env # fill in Space-Track credentials only; all others have safe defaults
pip install pre-commit && pre-commit install
make dev # starts full stack with hot-reload
make seed # loads test objects, FIRs, and synthetic TIP events
# → Open http://localhost:3000; globe shows 10 test objects
make targets:
| Target | What it does |
|---|---|
make dev |
docker compose up with ./backend and ./frontend/src bind-mounted for hot-reload |
make test |
pytest (backend) + vitest run (frontend) + playwright test (E2E) |
make migrate |
alembic upgrade head inside the running backend container |
make seed |
Loads fixtures/dev_seed.sql + synthetic TIP events via seed script |
make lint |
Runs all pre-commit hooks against all files |
make clean |
docker compose down -v — removes all containers and volumes (destructive, prompts) |
make shell-db |
Opens a psql shell inside the TimescaleDB container |
make shell-backend |
Opens a bash shell inside the running backend container |
Hot-reload configuration (docker-compose.override.yml — dev only, not committed to CI):
services:
backend:
volumes:
- ./backend:/app # bind mount — FastAPI --reload picks up changes instantly
command: ["uvicorn", "app.main:app", "--reload", "--host", "0.0.0.0"]
frontend:
volumes:
- ./frontend/src:/app/src # Next.js / Vite HMR
.env.example structure (excerpt):
# === Required: obtain before first run ===
SPACETRACK_USERNAME=your_email@example.com
SPACETRACK_PASSWORD=your_password
# === Required: generate locally ===
JWT_PRIVATE_KEY_PATH=./certs/jwt_private.pem # openssl genrsa -out certs/jwt_private.pem 2048
JWT_PUBLIC_KEY_PATH=./certs/jwt_public.pem
# === Safe defaults for local dev (change for production) ===
POSTGRES_PASSWORD=spacecom_dev
REDIS_PASSWORD=spacecom_dev
MINIO_ACCESS_KEY=spacecom_dev
MINIO_SECRET_KEY=spacecom_dev_secret
HMAC_SECRET=dev_hmac_secret_change_in_prod
# === Stage flags ===
ENVIRONMENT=development # development | staging | production
SHADOW_MODE_DEFAULT=false
DISABLE_SIMULATION_DURING_ACTIVE_EVENTS=false
All production-only variables are clearly marked. The README's "Getting Started" section mirrors the first-time setup steps above.
Staging Environment
Purpose: Continuous integration target for main branch. Serves as the TRL artefact evidence environment — all shadow validation records and OWASP ZAP reports reference the staging deployment.
| Property | Staging | Production |
|---|---|---|
| Infrastructure | Tier 2 (single-host Docker Compose) | Tier 3 (multi-host HA) |
| Data | Synthetic only — no production data | Real TLE/TIP/space weather |
| Secrets | Separate credential set; non-production Space-Track account | Production credential set in Vault |
| Deploy trigger | Automatic on merge to main |
Manual approval in GitHub Actions |
| OWASP ZAP | Runs against every staging deploy | Run on demand before Phase 3 milestones |
| Retention | Environment resets weekly (fresh make seed run) |
Persistent |
Secrets Rotation Procedure
Zero-downtime rotation is required. Service interruption during rotation is a reliability failure.
JWT RS256 Signing Keypair:
- Generate new keypair:
openssl genrsa -out jwt_private_new.pem 2048 && openssl rsa -in jwt_private_new.pem -pubout -out jwt_public_new.pem - Load new public key into
JWT_PUBLIC_KEY_NEWenv var on all backend instances (old key still active) - Backend now validates tokens signed with either old or new key
- Update
JWT_PRIVATE_KEYto new key; new tokens are signed with new key - Wait for all old tokens to expire (max 1h for access tokens; 30 days for refresh tokens)
- Remove
JWT_PUBLIC_KEY_NEW; old public key no longer needed - Log
security_logsentry typeKEY_ROTATIONwith rotation timestamp and initiator
Space-Track Credentials:
- Create new Space-Track account or update password via Space-Track web portal
- Update
SPACETRACK_USERNAME/SPACETRACK_PASSWORDin secrets manager (Docker secrets / Vault) - Trigger one manual ingest cycle; verify 200 response from Space-Track API
- Deactivate old credentials in Space-Track portal
- Log
security_logsentry typeCREDENTIAL_ROTATION
MinIO Access Keys:
- Create new access key pair via MinIO console (
mc admin user add) - Update
MINIO_ACCESS_KEY/MINIO_SECRET_KEYin secrets manager - Restart backend and worker services (rolling restart — blue-green ensures zero downtime)
- Verify pre-signed URL generation succeeds
- Delete old access key from MinIO console
HMAC Secret (prediction signing key):
- Do not rotate casually. All existing HMAC-signed predictions will fail verification after rotation.
- Pre-rotation: re-sign all existing predictions with new key (batch migration script required)
- Post-rotation: update
HMAC_SECRETin secrets manager; verify batch re-sign by spot-checking 10 predictions - Rotation must be approved by engineering lead;
security_logsentry typeHMAC_KEY_ROTATIONrequired
26.10 Post-Deployment Safety Monitoring Programme (F9 — §61)
Pre-deployment testing and shadow validation demonstrate that a system was safe at a point in time. Post-deployment monitoring demonstrates that it remains safe in operational conditions. DO-278A §12 and EUROCAE ED-153 both require evidence of ongoing safety monitoring after deployment.
Programme components:
26.10.1 Prediction Accuracy Monitoring
After each actual re-entry event where SpaceCom generated predictions:
- Record the actual re-entry time and location (from The Aerospace Corporation / ESA re-entry campaign results)
- Compare against SpaceCom's p50 corridor centre and p95 bounds
- Record in
shadow_validationstable:actual_reentry_time,actual_impact_region,p50_error_km,p95_captured(boolean) - Compute running accuracy statistics: % of events where actual impact was within p95 corridor; median error in km
- Publish accuracy statistics to
GET /api/v1/admin/accuracy-report(accessible to ANSP admins)
Alert trigger: If rolling 12-month p95 capture rate drops below 80% (target: 95%), engineering review is mandatory before the next ANSP shadow activation or model update deployment.
26.10.2 Safety KPI Dashboard
Prometheus recording rules and Grafana dashboard (monitoring/dashboards/safety-kpis.json):
| KPI | Metric | Target | Alert threshold |
|---|---|---|---|
| HMAC verification failures | spacecom_hmac_verification_failures_total |
0 / month | Any failure → SEV-1 |
| Safety occurrences | safety_occurrences table count |
0 / year | ≥1 → safety case review |
| Alert false positive rate | Manual: PIR review | < 5% | Engineering review if exceeded |
| Operator training currency | operator_training_records expiry |
100% current | < 95% → ANSP admin notification |
| p95 corridor capture rate | shadow_validations rolling 12-month |
≥ 95% | < 80% → model review |
| Prediction freshness (TLE age at prediction time) | spacecom_tle_age_hours histogram p95 |
< 6h | > 24h → MEDIUM alert |
26.10.3 Quarterly Safety Review
Mandatory quarterly safety review meeting. Output: docs/safety/QUARTERLY_SAFETY_REVIEW_YYYY_QN.md.
Agenda:
- Safety KPI review (all metrics above)
- Safety occurrences since last review (zero is an acceptable answer — record it)
- Hazard log review: has any hazard likelihood or severity changed since last quarter?
- MoC status update: progress on PLANNED items
- Model changes in period: were any SAL-2 components modified? If so, safety case impact assessment
- ANSP feedback: any concerns raised by ANSP customers regarding safety or accuracy?
- Actions: owner, deadline, priority
Attendance required: Safety case custodian + engineering lead. One ANSP contact may be invited as an observer (good practice for regulatory demonstration).
26.10.4 Model Version Safety Monitoring
When a new model version is deployed (changes to physics/ or alerts/ SAL-2 components):
- Shadow run new model in parallel for ≥14 days before replacing production model
- Compare new vs. old: prediction differences > 50 km for p50, or > 100 km for p95, require engineering review before promotion
- After promotion: monitor
shadow_validationsfor the next 3 re-entry events; regression alert if p95 capture rate declines - Record in
simulations.model_version; all predictions annotated with the model version they used
27. Capacity Planning
27.0 Performance Test Specification (F6)
Performance tests live in tests/load/ and are run with k6. They are not part of the standard make test suite — they require a running environment with realistic data. They run:
- Manually before any Phase gate release
- Automatically on the staging environment nightly (scheduled k6 Cloud or self-hosted k6)
- Results committed to
docs/validation/load-test-results/after each Phase gate
Scenarios
// tests/load/scenarios.js
export const options = {
scenarios: {
czml_catalog: {
executor: 'ramping-vus',
startVUs: 0, stages: [
{ duration: '30s', target: 50 },
{ duration: '2m', target: 100 },
{ duration: '30s', target: 0 },
],
},
websocket_subscribers: {
executor: 'constant-vus', vus: 200, duration: '3m',
},
decay_submit: {
executor: 'constant-arrival-rate', rate: 5, timeUnit: '1m',
preAllocatedVUs: 10, duration: '5m',
},
},
};
SLO Assertions (k6 thresholds — test fails if breached)
| Scenario | Metric | Threshold |
|---|---|---|
CZML catalog (GET /objects + CZML) |
p95 response time | < 2 000 ms |
API auth (POST /auth/token) |
p99 response time | < 500 ms |
| Decay prediction submit | p95 response time | < 500 ms (202 accept only) |
| WebSocket connection | 200 concurrent connections stable for 3 min | 0 connection drops |
| WebSocket alert delivery | Time from DB insert to browser receipt | < 30 000 ms p95 |
/readyz probe |
p99 response time | < 100 ms |
Baseline Environment
Performance tests are only comparable if run against a consistent hardware baseline:
# docs/validation/load-test-baseline.md
- Host: 8 vCPU / 32 GB RAM (Tier 2 single-host)
- TimescaleDB: 100 tracked objects, 90 days of orbit history
- Celery workers: simulation ×16 concurrency, ingest ×2
- Redis: empty (no warm cache) at test start
Results from a different hardware spec must be labelled separately and not compared to the baseline. A performance regression is defined as any threshold breach on the same baseline hardware.
Storing and Trending Results
k6 outputs a JSON summary; a CI step uploads it to docs/validation/load-test-results/YYYY-MM-DD-{env}.json. A lightweight Python script (scripts/load-test-trend.py) plots p95 latency over time for the past 10 runs and embeds the chart in docs/TEST_PLAN.md. A > 20% increase in any p95 metric between consecutive runs on the same hardware creates a performance-regression GitHub issue automatically.
27.1 Workload Characterisation
| Workload | CPU Profile | Memory | Dominant Constraint |
|---|---|---|---|
| MC decay prediction (500 samples) | CPU-bound, parallelisable | 200–500 MB per process | CPU cores on simulation workers |
| SGP4 catalog propagation (100 objects) | Trivial | < 100 MB | None — analytical model |
| CZML generation | I/O-bound (DB read) | < 500 MB | DB query latency |
| Atmospheric breakup | CPU-bound, light | ~200 MB | Negligible vs. MC |
| Conjunction screening (100 objects) | CPU-bound, seconds | ~500 MB | Acceptable on any worker |
| Controlled re-entry planner | CPU-bound, similar to MC | 500 MB | Same pool as MC |
| Playwright renderer | Memory-bound (Chromium) | 1–2 GB per instance | Isolated container |
| TimescaleDB queries | I/O-bound | 64 GB (buffer cache) | NVMe IOPS for spatial queries |
Cost-tracking metrics (F3, F4, F11):
Add the following Prometheus counters to enable per-org cost attribution and external API budget visibility. These feed the unit economics model (§27.7) and the Enterprise tier chargeback reports.
# backend/app/metrics.py (add to existing prometheus_client registry)
from prometheus_client import Counter
# F3 — External API call budget tracking
ingest_api_calls_total = Counter(
"spacecom_ingest_api_calls_total",
"Total external API calls made by the ingest worker",
labelnames=["source"] # "space_track", "celestrak", "noaa_swpc", "esa_discos", "iers"
)
# Usage: ingest_api_calls_total.labels(source="space_track").inc()
# Alert: if space_track calls > 100/day → investigate polling loop bug (Space-Track AUP limit: 200/day)
# F4 — Per-org simulation CPU attribution
simulation_cpu_seconds_total = Counter(
"spacecom_simulation_cpu_seconds_total",
"Total CPU-seconds consumed by MC simulations, by org and object",
labelnames=["org_id", "norad_id"]
)
# Usage: simulation_cpu_seconds_total.labels(org_id=str(org_id), norad_id=str(norad_id)).inc(elapsed)
# This is the primary input to infrastructure_cost_per_mc_run in §27.7
F5 — Inbound API request counter (§68):
# backend/app/metrics.py (add to existing prometheus_client registry)
api_requests_total = Counter(
"spacecom_api_requests_total",
"Total inbound API requests, by org, endpoint, and API version",
labelnames=["org_id", "endpoint", "version", "status_code"]
)
# Usage (FastAPI middleware):
# api_requests_total.labels(
# org_id=str(request.state.org_id),
# endpoint=request.url.path,
# version=request.headers.get("X-API-Version", "v1"),
# status_code=str(response.status_code)
# ).inc()
This counter is the foundation for future API tier enforcement (e.g., 1,000 requests/month for Professional; unlimited for Enterprise) and for supporting usage-based billing for Persona E/F API consumers. Add to the FastAPI middleware stack alongside prometheus_fastapi_instrumentator.
F11 — Per-org cost attribution for Enterprise tier:
Enterprise contracts may include usage-based clauses (e.g., MC simulation credits). The simulation_cpu_seconds_total metric provides the raw data; a monthly Celery task (tasks/billing/generate_usage_report.py) aggregates it per org:
@shared_task
def generate_monthly_usage_report(org_id: str, year: int, month: int):
"""Aggregate simulation CPU-seconds and ingest API calls per org for billing review."""
# Query Prometheus/VictoriaMetrics for the org's metrics over the billing period
# Output: docs/business/usage_reports/{org_id}/{year}-{month:02d}.json
# Fields: total_mc_runs, total_cpu_seconds, estimated_cost_usd (at $0.40/run internal rate)
Per-org usage reports are stored in docs/business/usage_reports/ and referenced in Enterprise QBRs. The cost rate ($0.40/run at Tier 3 scale) is updated quarterly in docs/business/UNIT_ECONOMICS.md.
Usage surfaced to commercial team and org admins (F2 — §68):
Usage data must reach two audiences: the commercial team (for renewal and expansion conversations) and the org admin (to understand value received).
Commercial team: Monthly Celery Beat task (tasks/commercial/send_commercial_summary.py) emails commercial@spacecom.io on the 1st of each month with:
- Per-org: MC simulation count, PDF reports generated, WebSocket connection hours, alert events (by severity)
- Trend vs. previous 3 months (growth signal for expansion conversations)
- Contracts expiring within 90 days (renewal pipeline)
Org admin: Monthly usage summary email to each org's admin contact showing their own usage. Template: "In [month], your team ran [N] decay predictions, generated [M] PDF reports, and received [K] CRITICAL alerts. Your monthly quota: [Q] simulations (used: [N])." This email reinforces value perception ahead of renewal conversations.
Both emails use the generate_monthly_usage_report output. Add send_usage_summary_emails to celery-redbeat at crontab(day_of_month=1, hour=6).
27.2 Monte Carlo Parallelism Architecture
The MC decay predictor must use Celery group + chord to distribute sample computation across the full worker pool. multiprocessing.Pool within a single task is limited to one container's cores.
from celery import group, chord
@celery.task
def run_mc_decay_prediction(object_id: int, params: dict) -> str:
"""Fan out 500 samples as individual sub-tasks; aggregate with chord callback."""
sample_tasks = group(
run_single_trajectory.s(object_id, params, seed=i)
for i in range(params['mc_samples'])
)
result = chord(sample_tasks)(aggregate_mc_results.s(object_id, params))
return result.id
@celery.task
def run_single_trajectory(object_id: int, params: dict, seed: int) -> dict:
"""Single RK7(8) + NRLMSISE-00 trajectory integration. CPU time: 2–20s."""
rng = np.random.default_rng(seed)
f107 = params['f107'] * rng.normal(1.0, 0.20) # ±20% variation
bstar = params['bstar'] * rng.normal(1.0, 0.10)
return integrate_trajectory(object_id, f107, bstar, params)
@celery.task
def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str:
"""Compute percentiles, build corridor polygon, HMAC-sign, write to DB."""
prediction = compute_percentiles_and_corridor(results)
prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret)
write_prediction_to_db(prediction)
return str(prediction['id'])
Worker concurrency for chord sub-tasks:
- Each sub-task is short (2–20s) and CPU-bound
- Worker
--pool=prefork --concurrency=16: 16 OS processes per container - 2 simulation worker containers: 32 concurrent sub-tasks
- 500 samples / 32 = ~16 batches × ~10s average = ~160s per MC run (p50)
- p95 target of 240s met with headroom
Chord result backend: Sub-task results stored in Redis temporarily (< 1 MB each × 500 = 500 MB peak per run). Results expire after 1 hour (result_expires = 3600 in celeryconfig.py — §27.8). The aggregate callback reads all results, computes the final prediction, and writes to TimescaleDB — Redis is not the durable store.
Chord callback result count validation (F1 — §67): Redis noeviction prevents eviction, but if Redis is misconfigured or hits maxmemory and rejects writes, sub-task results may be missing when the chord callback fires. The callback must validate that it received the expected number of results before writing to TimescaleDB:
@celery.task
def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str:
"""Compute percentiles, build corridor polygon, HMAC-sign, write to DB."""
expected = params['mc_samples']
if len(results) != expected:
# Partial result — do not write a silently truncated prediction
raise ValueError(
f"MC chord received {len(results)}/{expected} results for object {object_id}. "
"Redis result backend may be under memory pressure. Aborting."
)
prediction = compute_percentiles_and_corridor(results)
prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret)
write_prediction_to_db(prediction)
return str(prediction['id'])
The ValueError causes the chord callback to fail and be routed to the DLQ (Dead Letter Queue). The originating API call receives a task failure, and the client receives HTTP 500 with Retry-After. A spacecom_mc_chord_partial_result_total counter fires, triggering a CRITICAL alert: "MC chord received partial results — Redis memory budget exceeded."
27.3 Deployment Tiers
Tier 1 — Development and Demonstration
Single machine, Docker Compose, all services co-located. No HA. Suitable for development, internal demos, and ESA TRL 4 demonstrations.
| Spec | Minimum | Recommended |
|---|---|---|
| CPU | 8 cores | 16 cores |
| RAM | 16 GB | 32 GB |
| Storage | 256 GB NVMe SSD | 512 GB NVMe SSD |
| Cloud equivalent | t3.2xlarge ~$240/mo |
m6i.4xlarge ~$540/mo |
MC prediction p95: ~400–800s (exceeds SLO — acceptable for demo; noted in demo briefings).
Tier 2 — Phase 1–2 Production
Separate containers per service. Meets SLOs under moderate load (≤ 5 concurrent simulation users). Single-node per service — no HA. Suitable for shadow mode deployments and early ANSP pilots.
| Service | vCPU | RAM | Storage | Cloud (AWS) | Monthly |
|---|---|---|---|---|---|
| Backend API | 4 | 8 GB | — | c6i.xlarge |
~$140 |
| Simulation Workers ×2 | 16 each | 32 GB each | — | c6i.4xlarge ×2 |
~$560 each |
| Ingest Worker | 2 | 4 GB | — | t3.medium |
~$30 |
| Renderer | 4 | 8 GB | — | c6i.xlarge |
~$140 |
| TimescaleDB | 8 | 64 GB | 1 TB NVMe | r6i.2xlarge |
~$420 |
| Redis | 2 | 8 GB | — | cache.r6g.large |
~$120 |
| MinIO / S3 | 4 | 8 GB | 4 TB | i3.xlarge + EBS |
~$200 |
| Total | ~$2,200/mo |
On-premise equivalent (Tier 2): Two servers — compute host (2× AMD EPYC 7313P, 32 total cores, 192 GB RAM) + storage host (8 vCPU, 256 GB RAM, 2 TB NVMe + 8 TB HDD). Capital cost: ~$25,000–35,000.
Tier 3 — Phase 3 HA Production
Full redundancy. Meets 99.9% availability SLO including during active TIP events. Required before any formal operational ANSP deployment.
| Service | Count | vCPU each | RAM each | Notes |
|---|---|---|---|---|
| Backend API | 2 | 4 | 8 GB | Load balanced; blue-green deployable |
| Simulation Workers | 4 | 16 | 32 GB | 64 total cores; chord sub-tasks fill all |
| Ingest Worker | 2 | 2 | 4 GB | celery-redbeat leader election |
| Renderer | 2 | 4 | 8 GB | Network-isolated; Chromium memory budget |
| TimescaleDB Primary | 1 | 8 | 128 GB | Patroni-managed; synchronous replication |
| TimescaleDB Standby | 1 | 8 | 128 GB | Hot standby; auto-failover ≤ 30s |
| Redis Sentinel ×3 | 3 | 2 | 8 GB | Quorum; master failover ≤ 10s |
| MinIO (distributed) | 4 | 4 | 16 GB | Erasure coding EC:2; 2× 2 TB NVMe each |
| Cloud total (AWS) | ~$6,000–7,000/mo |
With 64 simulation worker cores: 500-sample MC in ~80s p50, ~120s p95 — well within SLO.
MinIO Erasure Coding (Tier 3): 4-node distributed MinIO uses EC:2 (2 parity shards). This provides:
- Read quorum: any 2 of 4 nodes (tolerates 2 simultaneous node failures for reads)
- Write quorum: requires 3 of 4 nodes (tolerates 1 simultaneous node failure for writes)
- Effective storage: 50% — 8 TB raw across 4 nodes → 4 TB usable. Match the Tier 3 table note (8 TB usable requires 16 TB raw across 4×2 TB nodes; resize if needed)
- Configured via
MINIO_ERASURE_SET_DRIVE_COUNT=4and server startup with all 4 node endpoints
Multi-region stance: SpaceCom is single-region through all three phases. Reasoning:
- Phase 1–3 customer base is small (ESA evaluation, early ANSP pilots); cross-region replication cost and operational complexity is not justified.
- Government and defence customers may have data sovereignty requirements — a single, clearly defined deployment region (customer-specified) is simpler to certify than an active-active multi-region setup.
- When a second jurisdiction customer is onboarded, deploy a separate, independent instance in their required jurisdiction rather than extending a single global cluster. Each instance has its own data, its own compliance scope, and its own operational team contact.
- This decision is documented as ADR-0010 (see §34 decision log).
On-premise equivalent (Tier 3): Three servers — 2× compute (2× EPYC 7343, 32 cores, 256 GB RAM each) + 1× storage (128 GB RAM, 4× 2 TB NVMe RAID-10, 16 TB HDD). Capital cost: ~$60,000–80,000.
Celery worker idle cost and scale-to-zero decision (F6):
Simulation workers are the largest cloud line item ($560/mo each at Tier 2 on c6i.4xlarge). Their actual compute utilisation depends on MC run frequency:
| Usage pattern | Active compute/day | Idle fraction | Monthly cost at Tier 2 ×2 workers |
|---|---|---|---|
| Light (5 MC runs/day × 80s p50) | ~7 min/day | ~99.5% | $1,120 |
| Moderate (20 MC runs/day × 80s) | ~27 min/day | ~98.1% | $1,120 |
| Heavy (100 MC runs/day × 80s) | ~133 min/day | ~90.7% | $1,120 |
Scale-to-zero analysis:
| Approach | Pros | Cons | Decision |
|---|---|---|---|
| Always-on (Tier 1–2) | Zero cold-start; SLO met immediately | High idle cost when lightly used | Use at Tier 1–2 — cost is ~$1,120/mo regardless; latency SLO requires workers ready |
| Scale-to-1 minimum (Tier 3) | Reduced idle cost vs. 4×; one worker handles ingest keepalive tasks | Cold-start for burst: 3 new workers × 30–60s spin-up; MC SLO may breach during burst | Use at Tier 3 — scale-to-1 minimum; HPA/KEDA scales 1→4 on celery_queue_length > 10 |
| Scale-to-zero | Maximum idle savings | 60–120s cold-start violates 10-min MC SLO when all workers are down | Do not use — cold-start from zero exceeds acceptable latency for on-demand simulation |
Implementation at Tier 3 (Kubernetes): Use KEDA ScaledObject with celery trigger:
triggers:
- type: redis
metadata:
listName: celery # Celery default queue
listLength: "10" # scale up when >10 tasks queued
activationListLength: "1" # keep at least 1 replica (scale-to-1 minimum)
Minimum replica count: 1. Maximum: 4. Scale-down stabilisation window: 5 minutes (prevents oscillation during multi-run bursts).
Ingest worker: Always-on, single instance (2 vCPU, $30/mo at Tier 2). celery-redbeat tasks run on 1-minute and hourly schedules; scale-to-zero is not appropriate. At Tier 3, 2 instances for redundancy; no autoscaling needed.
27.4 Storage Growth Projections
| Data | Retention | Raw Growth/Year | Compressed/Year | Cloud Cost/Year (est.) | Notes |
|---|---|---|---|---|---|
orbits (100 objects, 1/min) |
90 days online | ~15 GB | ~2 GB | ~$20 (EBS gp3, rolling) | TimescaleDB compression ~7:1 |
tle_sets |
1 year | ~55 MB | ~30 MB | Negligible | — |
space_weather |
2 years | ~5 MB | ~2 MB | Negligible | — |
| MC simulation blobs (MinIO) | 2 years | 500 GB–2 TB | Not compressed | $140–$560/yr (S3-IA after 90d) | Dominant cost — S3-IA at $0.0125/GB/mo |
| PDF reports (MinIO) | 7 years | 10–90 GB | 5–45 GB | $5–$45/yr (S3 Glacier) | $0.004/GB/mo Glacier tier |
| WAL archive (backup) | 30 days rolling | ~25 GB/month | — | ~$100/yr (300 GB peak × $0.023/GB/mo × 12) | S3 Standard; rolls over; cost is steady-state |
security_logs |
2 years online; 7-year archive | ~500 MB/year | — | Negligible | Legal hold |
reentry_predictions |
7 years | ~100 MB/year | — | Negligible | Legal hold |
Safety records (alert_events, notam_drafts, prediction_outcomes, degraded_mode_events, coordination notes) |
5-year minimum append-only archive | ~200 MB/year | — | Negligible | ICAO Annex 11 §2.26; safety investigation requirement |
Storage cost summary (Phase 2 steady-state): MC blobs dominate at sustained use. At 50 runs/day × 120 MB/run = 2.2 TB/year, 2-year retention on S3-IA ≈ $660/year in object storage alone. This should be captured in the unit economics model (§27.7). Storage cost is the primary variable cost that scales with usage depth (number of MC runs), not with number of users.
Backup cost projection (F9): WAL archive at 30-day rolling window: ~300 GB peak occupancy on S3 Standard ≈ $83/year (Tier 2). At Tier 3 with synchronous replication, the base-backup is ~2× TimescaleDB data size. At 1 TB compressed DB size: one weekly base-backup (retained 4 weeks) = 4 TB S3 occupancy → **$1,100/year** at Tier 3. Include backup S3 bucket costs in infrastructure budget from Phase 3 onwards. Budget line: infra/backup-s3 ≈ $100–200/month at steady Tier 3 scale.
Safety record retention policy (Finding 11): Safety-relevant event records have a distinct retention category separate from general operational data. A safety_record BOOLEAN DEFAULT FALSE flag on alert_events and notam_drafts marks records that must survive the standard retention drop. Records with safety_record = TRUE are excluded from TimescaleDB drop policies and transferred to MinIO cold tier (append-only) after 90 days online, retained for 5 years minimum. The TimescaleDB retention job checks WHERE safety_record = FALSE before dropping chunks. safety_record is set to TRUE at insert time for any event with alert_level IN ('HIGH', 'CRITICAL') and for all NOTAM drafts.
MC blob storage dominates at scale. At sustained use (50 MC runs/day × 120 MB/run): 2.2 TB/year. The Tier 3 distributed MinIO (8 TB usable with erasure coding on 4×2 TB nodes) covers approximately 3–4 years before expansion.
Cold tier tiering decision (two object classes with different requirements):
| Object class | Cold tier target | Reason |
|---|---|---|
MC simulation blobs (mc_blobs/ prefix) |
MinIO ILM warm tier or S3 Infrequent Access | Blobs may need to be replayed for Mode C visualisation of historical events (e.g., regulatory dispute review, incident investigation). Glacier 12h restore latency is operationally unacceptable for this use case. |
Compliance-only documents (reports/, notam_drafts/) |
S3 Glacier / Glacier Deep Archive acceptable | These are legal records requiring 7-year retention; retrieval is for audit or legal discovery only; 12h restore latency is acceptable. |
MinIO ILM rules configured in docs/runbooks/minio-lifecycle.md. Lifecycle transitions: MC blobs after 90 days → ILM warm (lower-cost MinIO tier or S3-IA); compliance docs after 1 year → Glacier.
MinIO multipart upload retry and incomplete upload expiry (F7 — §67):
MC simulation blobs (~120 MB each) are uploaded as multipart uploads. During a MinIO node failure in EC:2 distributed mode, write quorum (3/4 nodes) may be temporarily unavailable. An in-flight multipart upload will fail with MinioException / S3Error. Without a retry policy, the MC prediction is written to TimescaleDB but the blob is lost — the historical replay functionality silently fails.
# worker/tasks/blob_upload.py
from minio.error import S3Error
@shared_task(
autoretry_for=(S3Error, ConnectionError),
max_retries=3,
retry_backoff=30, # 30s, 60s, 120s — allow node recovery
retry_jitter=True,
)
def upload_mc_blob(prediction_id: str, blob_data: bytes):
"""Upload MC simulation blob to MinIO with retry on quorum failure."""
object_key = f"mc_blobs/{prediction_id}.msgpack"
minio_client.put_object(
bucket_name="spacecom-simulations",
object_name=object_key,
data=io.BytesIO(blob_data),
length=len(blob_data),
content_type="application/msgpack",
)
Incomplete multipart upload cleanup: Configure MinIO lifecycle rule to abort incomplete multipart uploads after 24 hours. Add to docs/runbooks/minio-lifecycle.md:
mc ilm rule add --expire-delete-marker --noncurrent-expire-days 1 \
spacecom/spacecom-simulations --abort-incomplete-multipart-upload-days 1
This prevents orphaned multipart upload parts accumulating on disk during node failures or application crashes mid-upload.
27.5 Network and External Bandwidth
| Traffic | Direction | Volume | Notes |
|---|---|---|---|
| Space-Track TLE polling | Outbound | ~1 MB per run, every 4h | ~6 MB/day |
| NOAA SWPC space weather | Outbound | ~50 KB per fetch, hourly | ~1 MB/day |
| ESA DISCOS | Outbound | ~10 MB/day (initial bulk); ~100 KB/day incremental | — |
| CZML to clients | Outbound | ~5–15 MB per user page load (full); <500 KB/hr delta | Scales linearly with users; delta protocol essential |
| WebSocket to clients | Outbound | ~1 KB/event × events/day | Low bandwidth, persistent connection |
| PDF reports (download) | Outbound | ~2–5 MB per report | Low frequency; MinIO presigned URL avoids backend proxy |
| MinIO internal traffic | Internal | Dominated by MC blob writes | Keep on internal Docker network |
CZML egress cost estimate and compression policy (F5):
At Phase 2 (10 concurrent users), daily CZML egress:
- Initial full loads: 10 users × 3 page loads/day × 15 MB = 450 MB/day
- Delta updates (delta protocol, §6): 10 users × 8h active × 500 KB/hr = 40 MB/day
- Total: ~490 MB/day ≈ 15 GB/month
At $0.085/GB AWS CloudFront egress: ~$1.28/month (Phase 2) → ~$6.40/month (50 users Phase 3).
CZML egress is not a significant cost driver at this scale, but is significant for latency and user experience. Compression policy:
| Encoding | CZML size reduction | Implementation |
|---|---|---|
| gzip (Accept-Encoding) | 60–75% | Caddy encode gzip — already included in §26.9 Caddy config |
| Brotli | 70–80% | Caddy encode zstd br gzip — use br for browser clients |
CZML delta protocol (?since=) |
95%+ for incremental updates | Already specified in §6 |
Minimum requirement: Caddy encode block must include br before gzip in the content negotiation order. A 15 MB CZML payload compresses to ~3–5 MB with brotli. Verify with curl -H "Accept-Encoding: br" -I <url> — response must show Content-Encoding: br.
Network is not a constraint for this workload at the scales described. Standard 1 Gbps datacenter networking is sufficient. For on-premise government deployments, standard enterprise LAN is adequate.
27.6 DNS Architecture and Service Discovery
Tier 1–2 (Docker Compose)
Docker Compose provides built-in DNS resolution by service name within each network. Services reference each other by container name (e.g., db, redis, minio). No additional DNS infrastructure required.
PgBouncer as single DB connection target: At Tier 2, the backend and workers connect to pgbouncer:5432, not directly to db:5432. PgBouncer multiplexes connections and acts as a stable endpoint:
- In a Patroni failover,
pgbounceris reconfigured to point to the new primary; application code never changes connection strings. - PgBouncer configuration:
docs/runbooks/pgbouncer-config.md
Celery task retry during Patroni failover (F2 — §67): During the ≤ 30s Patroni leader election window, all writes to PgBouncer fail with FATAL: no connection available or OperationalError: server closed the connection unexpectedly. Celery tasks that execute a DB write during this window will raise sqlalchemy.exc.OperationalError. Without a retry policy, these tasks fail permanently and are routed to the DLQ.
All Celery tasks that write to the database must declare:
@shared_task(
autoretry_for=(OperationalError,),
max_retries=3,
retry_backoff=5, # 5s, 10s, 20s
retry_backoff_max=30, # cap at 30s (within failover window)
retry_jitter=True,
)
def my_db_writing_task(...):
...
This covers: aggregate_mc_results, write_alert_event, write_prediction_outcome, all ingest tasks. Tasks that only read from DB should also retry on OperationalError since PgBouncer may pause reads during leader election. Add integration test: simulate OperationalError on first two attempts → task succeeds on third attempt.
Tier 3 (HA / Kubernetes migration path)
At Tier 3, introduce split-horizon DNS:
| Zone | Scope | Purpose |
|---|---|---|
spacecom.internal |
Internal services | Service discovery: backend.spacecom.internal, db.spacecom.internal (→ PgBouncer VIP) |
spacecom.io (or customer domain) |
Public internet | Caddy termination endpoint; ACME certificate domain |
Service discovery implementation:
- Cloud (AWS/GCP/Azure): Use cloud-native internal DNS (Route 53 private hosted zones / Cloud DNS) + load balancer for each service tier
- On-premise: CoreDNS deployed as a DaemonSet (Kubernetes) or as a Docker container on the management network; service records updated via Patroni callback scripts on failover
Key DNS records (Tier 3):
| Record | Type | Value |
|---|---|---|
db.spacecom.internal |
A | PgBouncer VIP (stable through Patroni failover) |
redis.spacecom.internal |
A | Redis Sentinel VIP |
minio.spacecom.internal |
A | MinIO load balancer (all 4 nodes) |
backend.spacecom.internal |
A | Backend API load balancer (2 instances) |
27.7 Unit Economics Model
Reference document: docs/business/UNIT_ECONOMICS.md — maintained alongside this plan; update whenever pricing or infrastructure costs change.
Unit economics express the cost to serve one organisation per month and the revenue generated, enabling margin analysis per tier.
Cost-to-serve model (Phase 2, cloud-hosted, per org):
| Cost driver | Basis | Monthly cost per org |
|---|---|---|
| Simulation workers (shared pool) | 2 workers shared across all orgs; allocate by MC run share | $1,120 ÷ org count |
| TimescaleDB (shared instance) | ~$420/mo; fixed regardless of org count up to Phase 2 capacity | $420 ÷ org count |
| Redis (shared) | ~$120/mo | $120 ÷ org count |
| MinIO / S3 storage | Variable; ~$660/yr at heavy MC use → $55/mo | $5–55/mo |
| Backend API (shared) | ~$140/mo | $140 ÷ org count |
| Ingest worker (shared) | ~$30/mo | Allocated to platform overhead |
| Email relay | ~$0.001/email × volume | $0–5/mo |
| CZML egress | ~$0.085/GB | $1–7/mo |
| Total variable (1 org, Tier 2) | ~$1,860/mo platform + $60–70 per-org variable |
Revenue per tier (target pricing — cross-reference §55 commercial model):
| Tier | Monthly ARR / org | Gross margin target |
|---|---|---|
| Free / Evaluation | $0 | Negative — cost of ESA relationship |
| Professional (shadow) | $3,000–6,000/mo | 50–70% at ≥3 orgs on platform |
| Enterprise (operational) | $15,000–40,000/mo | 65–75% at Tier 3 scale |
Break-even analysis: At Tier 2 platform cost (~$2,200/mo), break-even at Professional tier requires ≥1 paying org at $3,000/mo. Each additional Professional org at shared infrastructure has near-zero incremental infrastructure cost until capacity boundaries (MC concurrency limit, DB connection pooler limit).
Key unit economics metric: infrastructure_cost_per_mc_run. At Tier 2 (2 workers, $1,120/mo) and 500 runs/month: $2.24/run. At Tier 3 (4 workers KEDA scale-to-1, ~$800/mo amortised at medium utilisation) and 2,000 runs/month: $0.40/run. This metric should be tracked alongside spacecom_simulation_cpu_seconds_total (§27.1).
Professional Services as a revenue line (F10 — §68):
Professional Services (PS) revenue is a distinct revenue stream from recurring SaaS fees. For safety-critical aviation systems, PS typically represents 30–50% of first-year contract value and includes:
| PS engagement type | Typical value | Description |
|---|---|---|
| Implementation support | $15,000–40,000 | Deployment, configuration, integration with ANSP SMS |
| Regulatory documentation | $10,000–25,000 | SpaceCom system description for ANSP regulatory submissions; assists with EASA/CASA/CAA shadow mode notifications |
| Training (initial) | $5,000–15,000 | On-site or remote training for duty controllers, analysts, and IT administrators |
| Safety Management System integration | $8,000–20,000 | Integrating SpaceCom alert triggers into the ANSP's existing SMS occurrence reporting workflow |
| Annual training refresh | $2,000–5,000/yr | Recurring annual training for new staff and procedure updates |
PS revenue is tracked in the contracts.ps_value_cents column (§68 F1). Include PS as a budget line in docs/business/UNIT_ECONOMICS.md:
- Year 1 total contract value = MRR × 12 + PS value
- PS is recognised as one-time revenue at delivery (milestone-based); SaaS fees are recognised monthly
- PS delivery requires dedicated engineering and commercial capacity — budget 1–2 days of senior engineer time per $5,000 of PS value
Shadow trial MC quota (F8 - §68): Free/shadow trial orgs are limited to 100 MC simulation runs per month (organisations.monthly_mc_run_quota = 100). Enforcement at POST /api/v1/decay/predict:
if org.subscription_tier in ('shadow_trial',) and org.monthly_mc_run_quota > 0:
runs_this_month = get_monthly_mc_run_count(org_id)
if runs_this_month >= org.monthly_mc_run_quota:
raise HTTPException(
status_code=429,
detail={
"error": "monthly_quota_exceeded",
"quota": org.monthly_mc_run_quota,
"used": runs_this_month,
"resets_at": first_of_next_month().isoformat(),
"upgrade_url": "/settings/billing"
}
)
Commercial controls must not interrupt active operations. If the organisation is in an active TIP / CRITICAL operational state, quota exhaustion is logged and surfaced to commercial/admin dashboards but enforcement is deferred until the event closes.
27.8 Redis Memory Budget
Reference document: docs/infra/REDIS_SIZING.md — sizing rationale and eviction policy decisions.
Redis serves three distinct purposes with different memory characteristics. Using a single Redis instance (with separate DB indexes for broker vs. cache) requires explicit memory budgeting:
| Purpose | DB index | Key pattern | Estimated peak memory | Eviction policy |
|---|---|---|---|---|
| Celery broker + result backend | DB 0 | celery-task-meta-*, _kombu.* |
500 MB (500 MC sub-tasks × ~1 MB results) | noeviction |
| celery-redbeat schedule | DB 1 | redbeat:* |
< 1 MB | noeviction |
| WebSocket session tracking | DB 2 | spacecom:ws:*, spacecom:active_tip:* |
< 10 MB | noeviction |
| Application cache (CZML, NOTAM) | DB 3 | spacecom:cache:* |
50–200 MB | allkeys-lru |
| Redis Pub/Sub fan-out (alerts) | — | spacecom:alert:* channels |
Transient; ~1 KB/message | N/A (pub/sub, no persistence) |
| Total budget | ~700–750 MB peak |
Sizing decision: Use cache.r6g.large (8 GB RAM) with maxmemory 2gb — provides 2.5× headroom above peak estimate for burst conditions (multiple simultaneous MC runs × result backend). Set maxmemory-policy noeviction globally; the application cache (DB 3) must handle cache misses gracefully (it does — CZML regeneration on miss is defined in §6).
Redis memory alert: Add Grafana alert redis_memory_used_bytes > 1.5GB → WARNING; > 1.8GB → CRITICAL. At CRITICAL, check for result backend accumulation (expired Celery results not cleaned up) before scaling.
Redis result cleanup: Celery result_expires must be set to 3600 (1 hour). Verify in backend/celeryconfig.py:
result_expires = 3600 # Clean up MC sub-task results after 1 hour
28. Human Factors Framework
SpaceCom is a safety-critical decision support system used by time-pressured operators in aviation operations rooms. Human factors are not a UX concern — they are a safety assurance concern. This section documents the HF design requirements, standards basis, and validation approach.
Standards basis: ICAO Doc 9683 (Human Factors in Air Traffic Management), FAA AC 25.1329 (Flight Guidance Systems — alert prioritisation philosophy), EUROCONTROL HRS-HSP-005, ISA-18.2 (alarm management, adapted for ATC context), Endsley (1995) Situation Awareness model.
28.1 Situation Awareness Design Requirements
SpaceCom must support all three levels of Endsley's SA model for Persona A (ANSP duty manager):
| SA Level | Requirement | Implementation | Time target |
|---|---|---|---|
| Level 1 — Perception | Correct hazard information visible at a glance | Globe with urgency symbols; active events panel; risk level badges | ≤ 5 seconds from alert appearance — icon, colour, and position alone must convey object + risk level without reading text |
| Level 2 — Comprehension | Operator understands what the hazard means for their sector | Plain-language event cards; window range notation; FIR intersection list; data confidence indicators | ≤ 15 seconds to identify earliest FIR intersection window and whether it falls within the operator's sector |
| Level 3 — Projection | Operator can anticipate future state without simulation tools | Corridor Evolution widget (T+0/+2/+4h); Gantt timeline; space weather buffer callout | ≤ 30 seconds to determine whether the corridor is expanding or contracting using the Corridor Evolution widget |
These time targets are pass/fail criteria for the Phase 2 ANSP usability test (§28.7).
Globe visual information hierarchy (F7 — §60): The globe displays objects, corridors, hazard zones, FIR boundaries, and ADS-B routes simultaneously. Under operational stress, operators must not be required to search for the critical element — it must be pre-attentively distinct. The following hierarchy is mandatory and enforced by the rendering layer:
| Priority | Element | Visual treatment | Pre-attentive channel |
|---|---|---|---|
| 1 — Immediate | Active CRITICAL object | Flashing red octagon (2 Hz, reduced-motion: static + thick border) + label always visible | Motion + colour + shape |
| 2 — Urgent | Active HIGH object | Amber triangle, label visible at zoom ≥ 4 | Colour + shape |
| 3 — Monitor | Active MEDIUM object | Yellow circle, label on hover | Colour + shape |
| 4 — Context | Re-entry corridors (p05–p95) | Semi-transparent red fill, no label until hover | Colour + opacity |
| 5 — Awareness | FIR boundary overlay | Thin white lines, low opacity (30%) | Position |
| 6 — Background | ADS-B routes | Thin grey lines, visible only at zoom ≥ 5 | Position |
| 7 — Ambient | All other tracked objects | Small white dots, no label until hover | Position |
Rule: no element at priority N may be more visually prominent than an element at priority N-1. The rendering layer enforces draw order and applies opacity/size reduction to lower-priority elements when a priority-1 element is present. This is a non-negotiable safety requirement — a CesiumJS performance optimisation that re-orders draw calls or flattens layers must not override this hierarchy. An operator who cannot reach SA Level 1 in ≤ 5 seconds on a CRITICAL alert constitutes a design failure requiring a redesign cycle before shadow deployment. Without numeric targets the usability test cannot produce a meaningful result.
Level 3 SA support is specifically identified as a gap in pure corridor-display systems and is addressed by the Corridor Evolution widget (§6.8).
28.2 Mode Error Prevention
Mode confusion is the most common cause of automation-related incidents in aviation. SpaceCom has three operational modes (LIVE / REPLAY / SIMULATION) that must be unambiguously distinct at all times.
Mode error prevention mechanisms:
- Persistent mode indicator pill in top nav — never hidden, never small
- Mode-switch dialogue with explicit current-mode, target-mode, and consequence statements (§6.3)
- Future-preview temporal wash when the timeline scrubber is not at current time (§6.3)
- Optional
disable_simulation_during_active_eventsorg setting to block simulation entry during live incidents (§6.3) - Audio alerts suppressed in SIMULATION and REPLAY modes
- All simulation-generated records have
simulation_id IS NOT NULL— they cannot appear in operational views
28.3 Alarm Management
Alarm management requirements follow the principle: every alarm should demand action, every required action should have an alarm, and no alarm should be generated that does not demand action.
Alarm rationalisation:
- CRITICAL: demands immediate action — full-screen banner + audio
- HIGH: demands timely action — persistent badge + acknowledgement required
- MEDIUM: informs — toast, auto-dismiss, logged
- LOW: awareness only — notification centre
Alarm management philosophy and KPIs (F1 — §60): SpaceCom adopts the EEMUA 191 / ISA-18.2 alarm management framework adapted for space/aviation operations. The following KPIs are measured quarterly by Persona D and included in the ESA compliance artefact package:
| EEMUA 191 KPI | Target | Definition |
|---|---|---|
| Alarm rate (steady-state) | < 1 alarm per 10 minutes per operator | Alarms requiring attention across all levels; excludes LOW awareness-only |
| Nuisance alarm rate | < 1% of all alarms | Alarms acknowledged as MONITORING within 30s without any other action — indicates no actionable information |
| Stale alarms | 0 CRITICAL unacknowledged > 10 min | Unacknowledged CRITICAL alerts older than 10 minutes; triggers supervisor notification (F8) |
| Alarm flood threshold | < 10 CRITICAL alarms within 10 minutes | Beyond this rate, an alert storm meta-alert fires and the batch-flood suppression protocol activates |
| Chattering alarms | 0 | Any alarm that fires and clears more than 3 times in 30 minutes without operator action |
Alarm quality requirements:
- Nuisance alarm rate target: < 1 LOW alarm per 10 minutes per user in steady-state operations (logged and reviewed quarterly by Persona D)
- Alert deduplication: consecutive window-shrink events do not re-trigger CRITICAL if the threshold was not crossed
- 4-hour per-object CRITICAL rate limit prevents alarm flooding from a single event
- Alert storm meta-alert disambiguates between genuine multi-object events and system integrity issues (§6.6)
Batch TIP flood handling (F2 — §60): Space-Track releases TIP messages in batches — a single NOAA solar storm event can produce 50+ new TIP entries within a 10-minute window. Without mitigation, this generates 50 simultaneous CRITICAL alerts, constituting an alarm flood that exceeds EEMUA 191 KPIs and cognitively overwhelms the operator.
Protocol when ingest detects ≥ 5 new TIP messages within a 5-minute window:
- Batch gate activates: Individual CRITICAL banners suppressed for objects 2–N of the batch. Object 1 (highest-priority by predicted Pc or earliest window) receives the standard CRITICAL banner.
- Batch summary alert fires: A single HIGH-level "Batch TIP event: N objects with new TIP data" summary appears in the notification centre. The summary is actionable — it links to a pre-filtered catalog view showing all newly-TIP-flagged objects sorted by predicted re-entry window.
- Batch event logged: A
batch_tip_eventrecord is created inalert_eventswithtrigger_type = 'BATCH_TIP',affected_objects = [NORAD ID list], andbatch_size = N. This is distinct from individual object alert records. - Per-object alerts queue: Individual CRITICAL alerts for objects 2–N are queued and delivered at a maximum rate of 1 per minute, only if the operator has not opened the batch summary view within 5 minutes of the batch gate activating. This prevents indefinite suppression while preventing flood.
The threshold (≥ 5 TIP in 5 minutes) and maximum queue delivery rate (1/min) are configurable per-org via org-admin settings, subject to minimum values (≥ 3 and ≤ 2/min respectively) to prevent safety-defeating misconfiguration.
Audio alarm specification (F11 — §60):
- Two-tone ascending chime: 261 Hz (C4) followed by 392 Hz (G4), each 250ms, 20ms fade-in/out (not siren — ops rooms have sirens from other systems already)
- Conforms to EUROCAE ED-26 / RTCA DO-256 advisory alert audio guidelines (advisory category — attention-getting without startle)
- Plays once on first presentation; does not loop automatically
- Re-alert on missed acknowledgement: If a CRITICAL alert remains unacknowledged for 3 minutes, the chime replays once. Replays at most once — the second chime is the final audio prompt. Further escalation is via supervisor notification (F8), not repeated audio (which would cause habituation)
- Stops on acknowledgement — not on banner dismiss; banner dismiss without acknowledgement is not permitted for CRITICAL severity
- Per-device volume control via OS; per-session software mute (persists for session only; resets on next login to prevent operators permanently muting safety alerts)
- Enabled by org-level "ops room mode" setting (default: off); must be explicitly enabled by org admin — not auto-enabled to prevent unexpected audio in environments where audio is not appropriate
- Volume floor in ops room mode: minimum 40% of device maximum; operators cannot mute below this floor when ops room mode is active (configurable per-org, minimum 30%)
Startle-response mitigation — sudden full-screen CRITICAL banners cause ~5 seconds of degraded cognitive performance in research studies. The following rules prevent cold-start startle:
- Progressive escalation mandatory: A CRITICAL alert may only be presented full-screen if the same object has already been in HIGH state for ≥ 1 minute during the current session. If the alert arrives cold (no prior HIGH state), the system must hold the alert in HIGH presentation for 30 seconds before upgrading to CRITICAL full-screen. Exception:
impact_time_minutes < 30bypasses the 30s hold. - Audio precedes visual by 500ms: The two-tone chime fires 500ms before the full-screen banner renders. This primes the operator's attentional system and eliminates the startle peak.
- Banner is overlay, not replacement: The CRITICAL full-screen banner is a dimmed overlay (backdrop
rgba(0,0,0,0.72)) rendered above the corridor map - the map, aircraft positions, and FIR boundaries remain visible beneath it. The banner must never replace the map render, as spatial context is required for the decision the operator is being asked to make.
Cross-hat alert override matrix: The Human Factors, Safety, and Regulatory hats jointly approve the following override rule set:
impact_time_minutes < 30or equivalent imminent-impact state: bypass progressive delay; immediate full-screen CRITICAL permitted- data-integrity compromise (
HMAC_INVALID, corrupted prediction provenance, or equivalent): immediate full-screen CRITICAL permitted - degraded-data or connectivity-only events without direct hazard change: progressive escalation remains mandatory
- all immediate-bypass cases require explicit rationale in the alert type definition and traceability into the safety case and hazard log
CRITICAL alert accessibility requirements (F2): When the CRITICAL alert banner renders:
focus()is called on the alert dialog element programmaticallyrole="alertdialog"andaria-modal="true"on the banner containeraria-labelledbypoints to the alert title;aria-describedbypoints to the conjunction summary textaria-hidden="true"set on the map container while the alertdialog is active; removed on dismissaria-live="assertive"region announces alert title immediately on render (separate from the dialog, for screen readers that do not exposealertdialogrole automatically)- Visible text status indicator "⚠ Audio alert active" accompanies the audio tone for deaf or hard-of-hearing operators (audio-only notification is not sufficient as a sole channel)
- All alert action buttons reachable by
Tabfrom within the dialog;Escapecloses only if the alert has a non-CRITICAL severity; CRITICAL requires explicit category selection before dismiss
Alarm rationalisation procedure — alarm systems degrade over time through threshold drift and alert-to-alert desensitisation. The following procedure is mandatory:
- Persona D (Operations Analyst) reviews alert event logs quarterly
- Any alarm type that fired ≥ 5 times in a 90-day period and was acknowledged as
MONITORING≥ 90% of the time is a nuisance alarm candidate — threshold review required before next quarter - Any alarm threshold change must be recorded in
alarm_threshold_audit(object, old threshold, new threshold, reviewer, rationale, date); immutable append-only - ANSP customers may request threshold adjustments for their own organisation via the org-admin settings; changes take effect after a mandatory 7-day confirmation period and are logged in
alarm_threshold_audit - Alert categories that have never triggered a
NOTAM_ISSUEDorESCALATINGacknowledgement in 12 months are escalated to Persona D for review of whether the alert should be demoted one severity level
Habituation countermeasures — repeated identical stimuli produce reduced response (habituation). The following design rules counteract alarm habituation:
- CRITICAL audio uses two alternating tones (261 Hz and 392 Hz, ~0.25s each); the alternation pattern is varied pseudo-randomly within the specification range so the exact sound is never identical across sessions
- CRITICAL banner background colour cycles through two dark-amber shades (
#7B4000/#6B3400) at 1 Hz — subtle variation without strobing, enough to maintain arousal without inducing distraction - Per-object CRITICAL rate limit (4-hour window) prevents habituation to a single persistent event
alert_eventshabituation report: any operator who has acknowledged ≥ 20 alerts of the same type in a 30-day window without a singleESCALATINGorNOTAM_ISSUEDresponse is flagged for supervisor review — this indicates potential habituation or threshold misconfiguration
Reduced-motion support (F10): WCAG 2.3.3 (Animation from Interactions — Level AAA) and WCAG 2.3.1 (Three Flashes or Below Threshold — Level A) apply. The 1 Hz CRITICAL banner colour cycle and any animated corridor rendering must respect the OS-level prefers-reduced-motion: reduce media query:
/* Default: animated */
.critical-banner { animation: amber-cycle 1s step-end infinite; }
/* Reduced motion: static high-contrast state */
@media (prefers-reduced-motion: reduce) {
.critical-banner {
animation: none;
background-color: #7B4000;
border: 4px solid #FFD580; /* thick static border as redundant indicator */
}
}
Fatigue and cognitive load monitoring (F8 — §60): Operators on long shifts exhibit reduced alertness. The following server-side rules trigger supervisor notifications without requiring operator interaction:
| Condition | Trigger | Supervisor notification |
|---|---|---|
| Unacknowledged CRITICAL alert | > 10 minutes without acknowledgement | Push + email to org supervisor role: "CRITICAL alert unacknowledged for 10 minutes — [object, time]" |
| Stale HIGH alert | > 30 minutes without acknowledgement | Push to org supervisor: "HIGH alert unacknowledged for 30 minutes" |
| Long session without interaction | Logged-in operator: no UI interaction for 45 min during active event | Push to operator + supervisor: "Possible inactivity during active event — please verify" |
| Shift duration exceeded | Session age > org.shift_duration_hours (default 8h) |
Non-blocking reminder to operator: "Your shift duration setting is 8 hours — consider handover" |
Supervisor notifications are sent to users with org_admin or supervisor role. If no supervisor role is configured for the org, the notification escalates to SpaceCom internal ops via the existing PagerDuty route with severity: warning. All supervisor notifications are logged to security_logs with event_type = SUPERVISOR_NOTIFICATION.
For CesiumJS corridor animations: check window.matchMedia('(prefers-reduced-motion: reduce)').matches on mount; if true, disable trajectory particle animation (Mode C) and set corridor opacity to a static value instead of pulsing. The preference is re-checked on change via addEventListener('change', ...) without requiring a page reload.
28.4 Probabilistic Communication to Non-Specialist Operators
Re-entry timing predictions are inherently probabilistic. Aviation operations personnel (Persona A/C) are trained in operational procedures, not orbital mechanics. The following design rules ensure probabilistic information is communicated without creating false precision or misinterpretation:
- No
±notation for Persona A/C — use explicit window ranges (08h–20h from now) with a "most likely" label; all absolute times rendered asHH:MMZ(e.g.,14:00Z) orDD MMM YYYY HH:MMZ(e.g.,22 MAR 2026 14:00Z) per ICAO Doc 8400 UTC-suffix convention; theZsuffix is not a tooltip — it is always rendered inline - Space weather impact as operational buffer, not percentage —
Add ≥2h beyond 95th percentile, not+18% wider uncertainty - Mode C particles require a mandatory first-use overlay explaining that particles are not equiprobable; weighted opacity down-weights outliers (§6.4)
- "What does this mean?" expandable panel on Event Detail for Persona C (incident commanders) explaining the window in operational terms
- Data confidence badges contextualise all physical property estimates —
unknownsource triggers a warning callout above the prediction panel - Tail risk annotation (F10): The p5–p95 window is the primary display, but a 10% probability of re-entry outside that range is operationally significant. Below the primary window, display: "Extreme case (1% probability outside this range):
p01_reentry_timeZ –p99_reentry_timeZ" — labelled clearly as a tail annotation, not the primary window. This annotation is shown only whenp99_reentry_time - p01_reentry_time > 1.5 × (p95_reentry_time - p05_reentry_time)(i.e., the tails are materially wider than the primary window). Also included as a footnote in NOTAM drafts when this condition is met.
28.5 Error Recovery and Irreversible Actions
| Action | Recovery mechanism |
|---|---|
| Analyst runs prediction with wrong parameters | superseded_by FK on reentry_predictions — marks old run as superseded; UI shows warning banner; original record preserved |
| Controller accidentally acknowledges CRITICAL alert | Two-step confirmation; structured category selection (see below) + optional free text; append-only audit log preserves full record |
| Analyst shares link to superseded prediction | ⚠ Superseded — see [newer run] banner appears on the superseded prediction page for any viewer |
| Operator enters SIMULATION during live incident | disable_simulation_during_active_events org setting blocks mode switch while unacknowledged CRITICAL/HIGH alerts exist |
Structured acknowledgement categories — replaces 10-character text minimum. Research consistently shows forced-text minimums under time pressure produce reflexive compliance (1234567890, aaaaaaaaaa) rather than genuine engagement, creating audit noise rather than evidence:
export const ACKNOWLEDGEMENT_CATEGORIES = [
{ value: 'NOTAM_ISSUED', label: 'NOTAM issued or requested' },
{ value: 'COORDINATING', label: 'Coordinating with adjacent FIR' },
{ value: 'MONITORING', label: 'Monitoring — no action required yet' },
{ value: 'ESCALATING', label: 'Escalating to incident command' },
{ value: 'OUTSIDE_MY_SECTOR', label: 'Outside my sector — passing to responsible unit' },
{ value: 'OTHER', label: 'Other (free text required below)' },
] as const;
// Category selection is mandatory. Free text is optional except when value = 'OTHER'.
// alert_events.action_taken stores the category code; action_notes stores optional text.
Acknowledgement form accessibility requirements (F3):
- Each category option rendered as
<input type="radio">with an explicit<label for="...">— no ARIA substitutes where native HTML suffices - The radio group wrapped in
<fieldset>with<legend>Select acknowledgement category</legend> - The keyboard shortcut
Alt+Adocumented viaaria-keyshortcuts="Alt+A"on the alert panel trigger element - A visible keyboard shortcut legend displayed within the acknowledgement dialog: "Keyboard: Alt+A to focus · Tab to change category · Enter to submit"
- Free-text field (
OTHER) labelled<label for="action_notes">Describe action taken (required)</label>;aria-required="true"when OTHER is selected - On submit, a screen-reader-visible confirmation announced via
aria-live="polite": "Acknowledgement recorded: [category label]"
Keyboard-completable acknowledgement flow — CRITICAL acknowledgement must be completable in ≤ 3 keyboard interactions from any application state (operators frequently work with one hand on radio PTT):
Alt+A → focus most-recent active CRITICAL alert in alert panel
Enter → open acknowledgement dialogue (category pre-selected: MONITORING)
Enter → submit (Tab to change category; free-text field skipped unless OTHER selected)
This keyboard path must be documented in the operator quick-reference card and tested in the Phase 2 usability study against the ≤ 3 interaction target.
28.5a Shift Handover
Shift handover is a high-risk transition point: situational awareness held by one operator must be reliably transferred to a second operator under time pressure. Current aviation safety events have involved information loss at handover. SpaceCom must not become a contributing factor.
Handover screen (Persona A/C): Dedicated /handover view within Secondary Display Mode (§6.20). Accessible from main nav; also triggered automatically when an operator session exceeds org.shift_duration_hours (configurable; default: 8h).
The handover screen shows:
- All active CRITICAL and HIGH alerts with current status and acknowledgement history
- Any unresolved multi-ANSP coordination threads (§6.9)
- Recent window-change events (last 2h) in reverse chronological order
- Free-text handover notes field (plain text, ≤ 2,000 characters)
- "Accept handover" button — records handover event with both operator IDs and timestamp
Handover record schema:
CREATE TABLE shift_handovers (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID NOT NULL REFERENCES organisations(id),
outgoing_user UUID NOT NULL REFERENCES users(id),
incoming_user UUID NOT NULL REFERENCES users(id),
handed_over_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
notes TEXT, -- operator free text, ≤ 2000 chars
active_alerts JSONB NOT NULL DEFAULT '[]', -- snapshot of alert IDs + status at handover
open_coord_threads JSONB NOT NULL DEFAULT '[]' -- snapshot of open coordination thread IDs
);
CREATE INDEX ON shift_handovers (org_id, handed_over_at DESC);
Handover integrity rules:
incoming_usermust be a differentusers.idfromoutgoing_useractive_alertsandopen_coord_threadsare system-populated snapshots — the outgoing operator cannot edit them; onlynotesis free-form- Handover record is immutable after creation; retained for 7 years (aviation safety audit basis)
- If a CRITICAL alert fires within 5 minutes of a handover record being created, the alert email/push notification includes a "⚠ Alert during handover window" flag so the incoming operator and their supervisor are aware
Structured SA transfer prompts (F4 — §60): The handover notes field (free text) is insufficient for reliable SA transfer under time pressure. The handover screen must also include a structured prompt section that the outgoing operator completes — mapping to Endsley's three SA levels:
| SA Level | Structured prompt | Type |
|---|---|---|
| Level 1 — Perception | "Active objects of concern right now:" | Multi-select from current TIP-flagged objects |
| Level 2 — Comprehension | "My assessment of the most critical object:" | Dropdown: Within sector / Adjacent sector / Low confidence / Not a concern yet + optional text |
| Level 3 — Projection | "Expected development in next 2 hours:" | Dropdown: Window narrowing / Window stable / Window widening / Awaiting new prediction + optional text |
| Decision context | "Actions I have taken or initiated:" | Multi-select from ACKNOWLEDGEMENT_CATEGORIES + free text |
| Handover flags | "Incoming operator should know:" | Checkboxes: Space weather active, Pending coordination thread, Degraded data, Unusual pattern |
The structured prompts are optional (the outgoing operator cannot be forced to complete them under time pressure) but their completion status is recorded. If the outgoing operator submits handover without completing any structured prompts, a non-blocking warning appears: "Structured SA transfer not completed — incoming operator will rely on notes only." Completion rate is reported quarterly as a human factors KPI.
Session timeout accessibility (F8): WCAG 2.2.1 (Timing Adjustable — Level A) requires users be warned before session expiry and given the opportunity to extend. For operators completing a handover (which may take longer for users with cognitive or motor impairments):
- At T−2 minutes before session expiry: an
aria-live="polite"announcement fires and a non-modal warning dialog appears: "Your session will expire in 2 minutes. [Extend session] [Save and log out]" - If the
/handoverview is active when the warning fires, the session is automatically extended by 30 minutes without user interaction (silently); the warning dialog is suppressed; the extension is logged insecurity_logswithevent_type = SESSION_AUTO_EXTENDED_HANDOVER - The silent auto-extension only applies once per session to prevent indefinite extension; after the 30-minute extension the standard warning dialog fires normally
- Session extension endpoint:
POST /api/v1/auth/extend-session— returns a new expiry timestamp; requires valid current session cookie
28.6 Cognitive Load Reduction
Event Detail Duty Manager View: Decluttered large-text view for Persona A showing only window, FIRs, risk level, and three action buttons. Collapses all technical detail. Designed for ops room use at a secondary glance distance. (§6.8)
Decision Prompts accordion (formerly "Response Options"): Contextualised checklist of possible ANSP actions. Not automated — for consideration only. Checkbox states create a lightweight action record without requiring Persona A to open a separate logging system. (§6.8)
The feature is renamed from "Response Options" to "Decision Prompts" throughout UI text, documentation, and API field names. "Options" implies equivalence; "Prompts" correctly signals that the list is an aide-mémoire, not a prescribed workflow.
Legal treatment of Decision Prompts: Every Decision Prompts accordion must display the following non-waivable disclaimer in 11px grey text immediately below the accordion header:
"Decision Prompts are non-prescriptive aide-mémoire items generated from common ANSP practice. They do not constitute operational procedures. All decisions remain with the duty controller in accordance with applicable air traffic regulations and your organisation's established procedures."
This disclaimer is: (a) hard-coded, not configurable; (b) included in the printed/exported Event Detail report; (c) present in the API response for Decision Prompts payloads ("legal_notice" field). Rationale: SpaceCom is decision support, not decision authority. Without an explicit disclaimer, a regulator or court could interpret a checked Decision Prompt item as evidence of a prescribed procedure not followed.
Decision prompt content template (F6 — §60): Each Decision Prompt entry must provide four fields to be actionable under operational stress:
interface DecisionPrompt {
id: string;
risk_summary: string; // Plain-language risk in ≤ 20 words. No jargon. No Pc values.
action_options: string[]; // Specific named actions available to this operator role
time_available: string; // "Decision window: X hours before earliest FIR intersection"
consequence_note?: string; // Optional: consequence of inaction (shown only if significant)
}
// Example for a re-entry/FIR intersection:
const examplePrompt: DecisionPrompt = {
id: 'reentry_fir_intersection',
risk_summary: 'Object expected to re-enter atmosphere over London FIR within 8–14 hours.',
action_options: [
'Issue precautionary NOTAM for affected flight levels',
'Coordinate with adjacent FIR controllers (Paris, Amsterdam)',
'Notify airline operations centres in affected region',
'Continue monitoring — no action required yet',
],
time_available: 'Decision window: ~6 hours before earliest FIR intersection (08:00Z)',
consequence_note: 'If window narrows below 4 hours without NOTAM, affected departures may require last-minute rerouting.',
};
Decision Prompts are pre-authored for each alert scenario type in docs/decision-prompts/ and reviewed annually by a subject-matter expert from an ANSP partner. They are not auto-generated by the system. New prompt types require approval from both the SpaceCom safety case owner and at least one ANSP reviewer.
Legal sufficiency note (F5): The in-UI disclaimer is a reinforcing reminder only. Under UCTA 1977 and the EU Unfair Contract Terms Directive, liability limitation requires that the customer was given a reasonable opportunity to discover and understand the term at contract formation. The substantive liability limitation clause (consequential loss excluded; aggregate cap = 12 months fees paid) must appear in the executed Master Services Agreement (§24.2). The UI disclaimer does not substitute for executed contractual terms.
Decision Prompts accessibility (F9): The accordion must implement the WAI-ARIA Accordion design pattern:
- Accordion header:
<button role="button" aria-expanded="true|false" aria-controls="panel-{id}">—EnterandSpacetoggle open/close - Panel:
<div id="panel-{id}" role="region" aria-labelledby="header-{id}"> - Arrow keys navigate between accordion items when focus is on a header button
- Each prompt item:
<input type="checkbox" id="prompt-{n}" aria-checked="true|false">with<label for="prompt-{n}">— native checkbox, not ARIA role substitute - On checkbox state change:
aria-live="polite"region announces "Action recorded: [prompt text]" aria-keyshortcutson the accordion container documents any applicable shortcuts
Attention management — operational environments have high ambient interruption rates. SpaceCom must not become an additional source of cognitive fragmentation:
| State | Interaction rate limit | Rationale |
|---|---|---|
| Steady-state (no active CRITICAL/HIGH) | ≤ 1 unsolicited notification per 10 minutes per user | Preserve peripheral attentional channel for ATC primary tasks |
| Active event (≥ 1 unacknowledged CRITICAL) | ≤ 1 update notification per 60 seconds for the same event | Prevent update flooding during the critical decision window |
| Critical flow (user actively in acknowledgement or handover screen) | Zero unsolicited notifications | Do not interrupt the operator while they are completing a safety-critical task |
Critical flow state is entered when: acknowledgement dialog is open, or /handover view is active. It is exited on dialog close or handover acceptance. During critical flow, all queued notifications are held and delivered as a batch summary immediately on exit.
Secondary Display Mode: Chrome-free full-screen operational view optimised for secondary monitor in an ops room alongside existing ATC displays. (§6.20)
First-time user onboarding: New organisations with no configured FIRs see a three-card guided setup rather than an empty globe. (§6.18)
28.7 HF Validation Approach
HF design cannot be fully validated by automated tests alone. The following validation activities are planned:
| Activity | Phase | Method |
|---|---|---|
| Cognitive walkthrough of CRITICAL alert handling | Phase 1 | Developer walk-through against §28.3 alarm management requirements |
| ANSP user testing — Persona A operational scenario | Phase 2 | Structured usability test: duty manager handles a simulated TIP event; time-to-decision and error rate measured |
| Multi-ANSP coordination scenario | Phase 2 | Two-ANSP test with shared event; assess whether coordination panel reduces perceived workload vs. out-of-band comms only |
| Mode confusion scenario | Phase 2 | Participants switch between LIVE and SIMULATION; measure rate of mode errors without and with the temporal wash |
| Alarm fatigue assessment | Phase 3 | Review of LOW alarm rate over a 30-day shadow deployment; adjust thresholds if nuisance rate > 1/10 min/user |
| Final HF review by qualified human factors specialist | Phase 3 | Required for TRL 6 demonstration and ECSS-E-ST-10-12C compliance evidence |
Probabilistic comprehension test items — the Phase 2 usability study must include the following scripted comprehension items delivered verbally to participants after they view a TIP event detail screen. Items are designed to distinguish genuine probabilistic comprehension from confidence masking:
| Item | Correct answer | Common wrong answer (detects) |
|---|---|---|
| "What does the re-entry window of 08h–20h from now mean — does it mean the object will come down in the middle of that period?" | No — most likely landing is in the modal estimate shown, but the object could land anywhere in the window | "Yes, probably in the middle" — detects false precision from window endpoints |
| "If SpaceCom shows Impact Probability 0.03, should you start evacuating the FIR corridor?" | Not automatically — impact probability is one input; operational decision depends on assets at risk, corridor extent, and existing procedures | "Yes, 0.03 is high for space" — detects calibration gap between space and aviation risk thresholds |
| "The window has just widened by 4 hours. Does that mean SpaceCom detected new debris or a new threat?" | No — window widening usually means updated atmospheric data or revised mass/BC estimate increased uncertainty | "Yes, something new happened" — detects misattribution of uncertainty update to new threat |
| "SpaceCom shows 'Data confidence: TLE age 4 days'. Does that mean the prediction is wrong?" | No — it means the prediction has higher positional uncertainty; the window should be treated as wider in practice | "Yes, ignore it" — detects over-application of data quality warning |
Participants who answer ≥ 2 items incorrectly indicate a comprehension design failure requiring UI revision before shadow deployment. Target: ≥ 80% correct on each item across the test cohort.
28.8 Degraded-Data Human Factors
Operators must be able to distinguish "SpaceCom is working normally" from "SpaceCom is working but with reduced fidelity" from "SpaceCom is in a failure state" — three states that require fundamentally different responses. Undifferentiated degradation presentation causes two failure modes: operators continuing to act on stale data as if it were fresh (over-trust), or operators stopping using the system entirely during a tolerable degradation (under-trust).
Visual degradation language:
| State | Indicator | Operator action required |
|---|---|---|
| All data fresh | Green status pill in system tray (§6.6) | None |
| TLE age ≥ 48h for any active CRITICAL/HIGH object | Amber "⚠ TLE stale" badge on affected event card | Widen mental model of corridor uncertainty; consult space domain Persona B/D |
| EOP data stale (>7 days) | Amber system badge + eop_stale exposed in GET /readyz |
Frame transform accuracy reduced; no action required unless close-approach timing is critical |
| Space weather stale (>2h for active event) | Amber badge on Kp readout in Event Detail | Kp-dependent atmospheric drag estimates are less reliable; apply additional margin |
| AIRAC data >35 days old | Red "⚠ AIRAC expired" badge on any FIR overlay | FIR boundaries may have changed; do not issue NOTAM text based on SpaceCom FIR names without manual verification |
| Backend unreachable | Full-screen "SpaceCom Offline" modal | No predictions available; fall back to organisational offline procedures |
Graded response rules:
- A single stale data source never suppresses the main operational view. Operators must be able to see the event and make decisions; stale data badges are contextual, not blocking.
- Multiple simultaneous amber badges (≥ 3) trigger a consolidated "Multiple data sources degraded" yellow banner at top of screen — prevents badge blindness when individual badges are numerous.
- The
GET /readyzendpoint (§26.5) exposes all staleness states as machine-readable flags. ANSPs may configure their own monitoring to receivereadyzalerts via webhook. - Degraded-data states are recorded in
system_health_eventstable and included in the quarterly operational report to Persona D.
Operator quick-reference language for degraded states — the operator quick-reference card must include a "SpaceCom status indicators" section using the exact badge text from the UI (copy-match required). Operators must not need to translate between UI text and documentation text.
28.9 Operator Training and Competency Specification (F10 — §60)
SpaceCom is a safety-critical decision support system. ANSP customers deploying it in operational environments will be asked by their safety regulators what training operators received. This section defines the minimum training specification. Individual ANSPs may add requirements; they may not remove them.
Minimum initial training programme:
| Module | Delivery | Duration | Completion criteria |
|---|---|---|---|
| M1 — System overview and safety philosophy | Instructor-led or self-paced e-learning | 2 hours | Quiz score ≥ 80% |
| M2 — Operational interface walkthrough | Instructor-led hands-on with staging environment | 3 hours | Complete reference scenario (see below) |
| M3 — Alert acknowledgement workflow | Scenario-based with role-play | 1 hour | Keyboard-completable ack in ≤ 3 interactions |
| M4 — NOTAM drafting and disclaimer | Instructor-led with sample NOTAMs | 1 hour | Produce a compliant NOTAM draft from a scenario |
| M5 — Degraded mode response | Scenario-based | 30 min | Correctly identify each degraded state + action |
| M6 — Shift handover procedure | Pair exercise | 30 min | Complete a structured handover with SA prompts |
Total minimum initial training: 8 hours. Training is completed before any operational use. Simulator/staging environment only — no training on production data.
Reference scenario (M2): A CRITICAL re-entry alert fires for an object with a 6–14 hour window intersecting two FIRs. The trainee must: acknowledge the alert, identify the FIR intersection, assess the corridor evolution, draft a NOTAM, and complete a handover to a colleague — all within 20 minutes. This scenario is standardised in docs/training/reference-scenario-01.md.
Recurrency requirements:
- Annual refresher: 2 hours, covering any UI changes in the preceding 12 months + repeat of M3 scenario
- After any incident where SpaceCom was a contributing factor: mandatory debrief + targeted re-training before return to operational use
- After a major version upgrade (breaking UI changes): M2 + affected modules before using upgraded system operationally
Competency record model:
CREATE TABLE operator_training_records (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id INTEGER NOT NULL REFERENCES users(id),
module_id TEXT NOT NULL, -- 'M1'..'M6' or custom ANSP module codes
completed_at TIMESTAMPTZ NOT NULL,
score INTEGER, -- quiz score where applicable; NULL for practical
instructor_id INTEGER REFERENCES users(id),
training_env TEXT NOT NULL DEFAULT 'staging', -- 'staging' | 'simulator'
notes TEXT,
UNIQUE (user_id, module_id, completed_at)
);
GET /api/v1/admin/training-status (org_admin only) returns completion status for all users in the organisation. Users without all required modules completed are flagged; their access is not automatically blocked (the ANSP retains operational responsibility) but the flag is visible to org_admin and included in the quarterly compliance report.
Training material ownership: docs/training/ directory maintained by SpaceCom. ANSP-specific scenario variants stored in docs/training/ansp-variants/. Annual review cycle tied to the CHANGELOG review process.
Training records data retention and pseudonymisation (F10 — §64): operator_training_records is personal data — it records when a named individual completed specific training activities. For former employees whose accounts are deleted, these records must not be retained indefinitely as identified personal data.
Retention policy:
- Active users: retain for the duration of active employment (account
status = 'active') plus 2 years after account deletion (for certification audit purposes — an ANSP may need to verify training history after an operator leaves) - After 2 years post-deletion: pseudonymise
user_id→ tombstone token; retain completion dates and module IDs for aggregate training statistics
-- Add to operator_training_records
ALTER TABLE operator_training_records
ADD COLUMN pseudonymised_at TIMESTAMPTZ,
ADD COLUMN user_tombstone TEXT; -- SHA-256 prefix of deleted user_id; replaces user_id link
The weekly pseudonymise_old_freetext Celery task (§29.3) is extended to also pseudonymise training records where the linked users row has been deleted for > 2 years:
db.execute(text("""
UPDATE operator_training_records otr
SET user_tombstone = CONCAT('tombstone:', LEFT(ENCODE(DIGEST(otr.user_id::text, 'sha256'), 'hex'), 16)),
pseudonymised_at = NOW()
WHERE otr.pseudonymised_at IS NULL
AND NOT EXISTS (SELECT 1 FROM users u WHERE u.id = otr.user_id)
AND otr.completed_at < NOW() - INTERVAL '2 years'
"""))
---
## 29. Data Protection Framework
SpaceCom processes personal data in the course of providing its services. For EU and UK deployments (ESA bid context), GDPR / UK GDPR compliance is mandatory. For Australian ANSP customers, the Privacy Act 1988 (Cth) applies. This section documents the data protection design requirements.
**Standards basis:** GDPR (EU) 2016/679, UK GDPR, Privacy Act 1988 (Cth), EDPB Guidelines on data breach notification, ICO guidance on legitimate interests, CNIL recommendations on consent records.
---
### 29.1 Data Inventory
**Record of Processing Activities (RoPA) — GDPR Art. 30:** This table constitutes the RoPA. It is maintained in `legal/ROPA.md` (authoritative version) and mirrored here. Organisations with ≥250 employees or processing high-risk data must maintain a written RoPA; space traffic management constitutes high-risk processing (Art. 35 DPIA trigger — see below). The DPO must review and sign off the RoPA annually.
| Data type | Personal? | Lawful basis (GDPR Art. 6) | Retention | Table / Location |
|-----------|-----------|---------------------------|-----------|-----------------|
| User email, name, organisation | Yes | Contract performance (Art. 6(1)(b)) | Account lifetime + 1 year after deletion | `users` |
| IP address in security logs | Yes (pseudonymous) | Legitimate interests — security (Art. 6(1)(f)) | **90 days full; hash retained for 7 years** | `security_logs` |
| IP address at ToS acceptance | Yes | Legitimate interests — consent evidence (Art. 6(1)(f)) | **90 days full; hash retained for account lifetime + 1 year** | `users.tos_accepted_ip` |
| Alert acknowledgement text | Yes (contains user name) | Legitimate interests — aviation safety (Art. 6(1)(f)) | 7 years | `alert_events` |
| Multi-ANSP coordination notes | Yes (contains user name) | Legitimate interests — aviation safety (Art. 6(1)(f)) | 7 years | `alert_events` |
| Shift handover records | Yes (outgoing/incoming user IDs) | Legitimate interests — aviation safety / operational continuity (Art. 6(1)(f)) | 7 years | `shift_handovers` |
| Alarm threshold audit records | Yes (reviewer ID) | Legitimate interests — safety governance (Art. 6(1)(f)) | 7 years | `alarm_threshold_audit` |
| API request logs | Yes (pseudonymous — IP) | Legitimate interests — security / billing (Art. 6(1)(f)) | 90 days | Log files / SIEM |
| MFA secrets (TOTP) | Yes (sensitive account data) | Contract performance (Art. 6(1)(b)) | Account lifetime; immediately deleted on account deletion | `users.mfa_secret` (encrypted at rest) |
| Space-Track data disclosure log | No (records org-level disclosure, not individuals) | Legitimate interests — licence compliance (Art. 6(1)(f)) | 5 years | `data_disclosure_log` |
**IP address data minimisation policy (F3 — §64):** IP addresses are personal data (CJEU *Breyer*, C-582/14). The full IP address is needed for fraud detection and security investigation within the first 90 days; beyond that, only a hashed form is needed for statistical/audit purposes.
Required Celery Beat task (`tasks/privacy_maintenance.py`, runs weekly):
```python
@shared_task
def hash_old_ip_addresses():
"""Replace full IP addresses with SHA-256 hashes after 90-day audit window."""
cutoff = datetime.utcnow() - timedelta(days=90)
db.execute(text("""
UPDATE security_logs
SET ip_address = CONCAT('sha256:', LEFT(ENCODE(DIGEST(ip_address, 'sha256'), 'hex'), 16))
WHERE created_at < :cutoff
AND ip_address NOT LIKE 'sha256:%'
"""), {"cutoff": cutoff})
db.execute(text("""
UPDATE users
SET tos_accepted_ip = CONCAT('sha256:', LEFT(ENCODE(DIGEST(tos_accepted_ip, 'sha256'), 'hex'), 16))
WHERE created_at < :cutoff
AND tos_accepted_ip NOT LIKE 'sha256:%'
"""), {"cutoff": cutoff})
db.commit()
Necessity assessment for IP storage (required in DPIA §2): Full IP is necessary for: (a) detecting account takeover (geolocation anomaly), (b) rate-limiting bypass investigation, (c) regulatory/legal requests within the statutory window. Hashed form is sufficient for: (d) long-term audit log integrity (proving an event occurred from a non-obvious source), (e) statistical reporting. The 90-day threshold is the operational window for security investigations; beyond this, benefit does not outweigh data subjects' privacy interests.
DPIA requirement and structure (F1 — §64): GDPR Article 35 mandates a DPIA before processing that is likely to result in high risk. SpaceCom's processing falls under Art. 35(3)(b) — systematic monitoring of publicly accessible areas — because it tracks the online operational behaviour of aviation professionals (login times, alert acknowledgements, decision patterns, handover text) in a system used to support safety decisions. This is a pre-processing obligation: EU personal data cannot lawfully be processed without completing the DPIA first.
Document: legal/DPIA.md — a Phase 2 gate (must be complete before any EU/UK ANSP shadow activation).
Required DPIA structure (EDPB WP248 rev.01 template):
| Section | Content required |
|---|---|
| 1. Description of processing | Purpose, nature, scope, context of processing; categories of data; data flows; recipients |
| 2. Necessity and proportionality | Why is this data necessary? Could the purpose be achieved with less data? Legal basis per activity (mapped in §29.1 RoPA) |
| 3. Risk identification | Risks to data subjects: unauthorised access to operational patterns; re-identification of pseudonymised safety records; cross-border transfer exposure; disclosure to authorities |
| 4. Risk mitigation measures | Technical: RLS, HMAC, TLS, MFA, pseudonymisation. Organisational: DPA with ANSPs, export control screening, sub-processor contracts |
| 5. Residual risk assessment | Risk level after mitigations: Low / Medium / High. If High residual risk: prior consultation with supervisory authority required (Art. 36) |
| 6. DPO opinion | Designated DPO's written sign-off or objection |
| 7. Review schedule | DPIA reviewed when processing changes materially; at least every 3 years |
The DPIA covers all processing activities in the RoPA. Key risk finding anticipated: the alert acknowledgement audit trail (who acknowledged what, when) creates a de facto performance monitoring record for individual ANSP controllers — this must be addressed in Section 3 with mitigations in Section 4 (pseudonymisation after operational retention window, access restricted to org_admin and admin roles).
Privacy Notice — must be published at the registration URL and linked from the ToS acceptance flow. Must cover: data controller identity, categories of data collected, purposes and lawful bases, retention periods, data subject rights, third-party processors (cloud provider, SIEM), cross-border transfer safeguards.
29.2 Data Subject Rights Implementation
| Right | Mechanism | Notes |
|---|---|---|
| Access (Art. 15) | GET /api/v1/users/me/data-export — returns all personal data held for the authenticated user as a JSON download |
Available to all logged-in users |
| Rectification (Art. 16) | PATCH /api/v1/users/me — allows name, email, organisation update |
Email change triggers re-verification |
| Erasure (Art. 17) | POST /api/v1/users/me/erasure-request → calls handle_erasure_request(user_id) |
See §29.3 |
| Restriction (Art. 18) | Admin-level: users.access_restricted = TRUE suspends account without deleting data |
Used where erasure conflicts with retention requirement |
| Portability (Art. 20) | POST /org/export (org_admin or admin) — asynchronous export of all org personal data in machine-readable JSON; fulfilled within 30 days; also used for offboarding (§29.8). Covers user-generated content (acknowledgements, handover notes); not derived physics predictions. |
F11 |
| Objection (Art. 21) | For legitimate interests processing: handled by erasure or restriction pathway | No automated profiling that would trigger Art. 22 |
29.3 Erasure vs. Retention Conflict — Pseudonymisation Procedure
The 7-year retention requirement (UN Liability Convention, aviation safety records) conflicts with GDPR Article 17 right to erasure for personal data embedded in alert_events and security_logs. Resolution: pseudonymise, do not delete.
def handle_erasure_request(user_id: int, db: Session):
"""
Satisfy GDPR Art. 17 erasure request while preserving safety-critical records.
Called when a user account is deleted or an explicit erasure request is received.
"""
# Stable pseudonym — deterministic hash of user_id, not reversible
pseudonym = f"[user deleted - ID:{hashlib.sha256(str(user_id).encode()).hexdigest()[:12]}]"
# Pseudonymise user references in append-only safety tables
db.execute(
text("UPDATE alert_events SET acknowledged_by_name = :p WHERE acknowledged_by = :uid"),
{"p": pseudonym, "uid": user_id}
)
db.execute(
text("UPDATE security_logs SET user_email = :p WHERE user_id = :uid"),
{"p": pseudonym, "uid": user_id}
)
# Pseudonymise shift handover records — user IDs replaced, notes preserved for safety record
db.execute(
text("""UPDATE shift_handovers
SET outgoing_user = NULL, incoming_user = NULL,
notes = CASE WHEN outgoing_user = :uid OR incoming_user = :uid
THEN CONCAT('[pseudonymised: ', :p, '] ', COALESCE(notes,''))
ELSE notes END
WHERE outgoing_user = :uid OR incoming_user = :uid"""),
{"p": pseudonym, "uid": user_id}
)
# Delete the user record itself (and cascade to refresh_tokens, api_keys)
db.execute(text("DELETE FROM users WHERE id = :uid"), {"uid": user_id})
db.commit()
# Log the erasure event (note: this log entry is itself pseudonymised from creation)
log_security_event("USER_ERASURE_COMPLETED", details={"pseudonym": pseudonym})
The core safety records (alert_events, security_logs, reentry_predictions) are preserved. The link to the identified individual is severed. This satisfies GDPR recital 26 (pseudonymous data is not personal data when re-identification is not reasonably possible) and Article 17(3)(b) (erasure obligation does not apply where processing is necessary for compliance with a legal obligation).
Free-text field periodic pseudonymisation (F6 — §64): Handover notes (shift_handovers.notes_text) and alert acknowledgement text (alert_events.action_taken) are free-text fields where operators may name colleagues, reference individuals' decisions, or include other personal references. The 7-year retention of these fields as-written creates personal data retained far beyond its operational value. After the operational retention window (2 years — the period within which a re-entry event's record could be actively referenced by an ANSP), free-text personal references must be pseudonymised in place.
Required Celery Beat task (tasks/privacy_maintenance.py, runs monthly):
@shared_task
def pseudonymise_old_freetext():
"""
Replace identifiable free-text in operational records after 2-year operational window.
The record itself is retained; only the human-entered text is sanitised.
"""
cutoff = datetime.utcnow() - timedelta(days=730) # 2 years
# Replace acknowledgement text with sanitised marker — preserve the fact of acknowledgement
db.execute(text("""
UPDATE alert_events
SET action_taken = '[text pseudonymised after operational retention window]'
WHERE created_at < :cutoff
AND action_taken IS NOT NULL
AND action_taken NOT LIKE '[text pseudonymised%'
"""), {"cutoff": cutoff})
# Preserve handover structure; pseudonymise notes text
db.execute(text("""
UPDATE shift_handovers
SET notes_text = '[text pseudonymised after operational retention window]'
WHERE created_at < :cutoff
AND notes_text IS NOT NULL
AND notes_text NOT LIKE '[text pseudonymised%'
"""), {"cutoff": cutoff})
db.commit()
The 2-year operational window is chosen because: (a) PIR processes complete within 5 business days; (b) regulatory investigations of re-entry events typically complete within 12–18 months; (c) 2 years provides margin. Beyond 2 years, the text serves no legitimate purpose that outweighs the data subject's interest in not having their decision-making text retained indefinitely.
29.4a Data Subject Access Request Procedure (F7 — §64)
The GET /api/v1/users/me/data-export endpoint exists (§29.2). The DSAR procedure — how requests are received, processed, and responded to within the statutory deadline — must also be documented.
DSAR SLA: 30 calendar days from receipt of the verified request (GDPR Art. 12(3)). Extension to 60 days permitted for complex requests with written notice to the data subject within the first 30 days.
DSAR procedure (docs/runbooks/dsar-procedure.md):
| Step | Action | Owner | Timing |
|---|---|---|---|
| 1 | Receive request (email to privacy@spacecom.io or in-app POST /api/v1/users/me/data-export-request) |
DPO/designated contact | Day 0 |
| 2 | Verify identity of requestor (must be the data subject or authorised representative) | DPO | Within 3 business days |
| 3 | Assess scope: what data is held? Which tables? What exemptions apply (safety record retention)? | DPO + engineering | Within 7 days |
| 4 | Generate export: GET /api/v1/users/me/data-export for self-service; admin endpoint for cases where account is deleted/suspended |
Engineering | Within 20 days |
| 5 | Deliver export: encrypted ZIP sent to verified email address | DPO | By day 28 |
| 6 | Document: log in legal/DSAR_LOG.md — request date, identity verified, scope, delivery date, any exemptions invoked |
DPO | Same day as delivery |
| 7 | If exemption applied (safety records retained): provide written explanation of the exemption and residual rights | DPO | Included in delivery |
GET /api/v1/users/me/data-export response scope — must include all of:
usersrecord fields (excluding password hash)alert_eventswhereacknowledged_by = user.id(pre-pseudonymisation only)shift_handoverswhereoutgoing_user = user.idorincoming_user = user.idoperator_training_recordsfor the userapi_keysmetadata (not the key value itself)security_logswhereuser_id = user.id(pre-IP-hashing only)tos_accepted_at,tos_versionfromusers
Fields excluded from DSAR export (not personal data or subject to legitimate processing exemption):
reentry_predictions(not personal data)security_logsentries of typeHMAC_KEY_ROTATION,DEPLOY_*(operational audit, not personal)
29.4 Data Processing Agreements
A Data Processing Agreement (DPA) is required in every commercial relationship where SpaceCom acts as a data processor for customer personal data (GDPR Art. 28).
SpaceCom acts as data processor for: user data belonging to ANSP and space operator customers (the customers are the data controllers for their employees' data).
SpaceCom acts as data controller for: its own user authentication data, security logs, and analytics.
Required DPA provisions (GDPR Art. 28(3)):
- Processing only on documented instructions of the controller
- Confidentiality obligations on authorised processors
- Technical and organisational security measures (reference §7)
- Sub-processor approval process (cloud provider, SIEM)
- Data subject rights assistance obligations
- Deletion or return of data on contract termination
- Audit and inspection rights for the controller
The DPA template must be reviewed by counsel before any EU/UK commercial deployment. It is a standard addendum to the MSA.
Sub-processor register (F9 — §64): GDPR Article 28(2) requires that the controller authorises sub-processors, and Article 28(4) requires that the processor imposes equivalent obligations on sub-processors. The DPA template references a sub-processor register; that register must exist as a standalone document.
Document: legal/SUB_PROCESSORS.md — Phase 2 gate (required before first EU/UK commercial deployment).
| Sub-processor | Service | Personal data transferred | Location | Transfer mechanism | DPA in place |
|---|---|---|---|---|---|
| Cloud host (e.g. AWS/Hetzner) | Infrastructure hosting | All categories (hosted on their infrastructure) | EU-central-1 (Frankfurt) | Adequacy / SCCs | AWS DPA / Hetzner DPA |
| GitHub | Source code hosting, CI/CD | Developer usernames; may appear in test fixtures | US | EU SCCs (Module 2) | GitHub DPA |
| Email delivery provider (e.g. Postmark, SES) | Transactional email (alert notifications) | User email address, name, alert content | US | EU SCCs (Module 2) | Provider DPA |
| Grafana Cloud (if used) | Observability / monitoring | IP addresses in logs ingested to Loki | US/EU | SCCs / EU region option | Grafana DPA |
| Sentry (if used) | Error tracking | Stack traces may contain user IDs, request data | US | EU SCCs | Sentry DPA |
Customer notification obligation: ANSPs (as data controllers) must be notified ≥30 days before any new sub-processor is added. The DPA addendum requires this. The sub-processor register is the mechanism for tracking and triggering notifications.
29.5 Cross-Border Data Transfer Safeguards
For EU/UK customers where SpaceCom infrastructure is hosted outside the EU/UK (e.g., AWS us-east-1):
- Use EU/UK regions where available, or
- Execute Standard Contractual Clauses (SCCs — 2021 EU SCCs / UK IDTA) with the cloud provider
- Document the transfer mechanism in the Privacy Notice
For Australian customers: the Privacy Act's Australian Privacy Principle 8 (cross-border disclosure) requires contractual protections equivalent to the APPs when transferring personal data internationally.
Data residency policy (Finding 8):
- Default hosting: EU jurisdiction (eu-central-1 / Frankfurt or equivalent) — satisfies EU data residency requirements for ECAC ANSP customers; stated in the MSA and DPA
- On-premise option:
Institutionaltier supports customer-managed on-premise deployment (§34 specifies the deployment model); customer's own infrastructure, own jurisdiction; SpaceCom provides a deployment package and support contract - Multi-tenancy isolation: Each ANSP organisation's operational data (
alert_events,notam_drafts, coordination notes) is accessible only to that organisation's users — enforced by RLS (§7.2). Multi-tenancy does not mean data co-mingling - Subprocessor disclosure:
docs/legal/data-residency-policy.mdlists hosting provider, region, and any subprocessors; updated when subprocessors change; referenced in the DPA; customers notified of material subprocessor changes ≥ 30 days in advance organisations.hosting_jurisdictionandorganisations.data_residency_confirmedcolumns (§9.2) track per-organisation residency state; admin UI surfaces this to Persona D- Authoritative document:
legal/DATA_RESIDENCY.md— lists hosting provider, region, all sub-processors with their data residency and SCCs/IDTA status; reviewed and re-signed annually by DPO; customers notified of material sub-processor changes ≥30 days in advance per DPA obligations
29.6 Security Breach Notification
Regulatory notification obligations by framework:
| Framework | Trigger | Deadline | Authority | Template location |
|---|---|---|---|---|
| GDPR Art. 33 | Personal data breach affecting EU/UK data subjects | 72 hours of discovery | National DPA (e.g. ICO, CNIL, BfDI) | legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md |
| UK GDPR | As above for UK data subjects | 72 hours | ICO | As above |
| NIS2 Art. 23 | Significant incident affecting network/information systems of an essential entity | Early warning: 24 hours of becoming aware; full notification: 72 hours; final report: 1 month | National CSIRT + competent authority (space traffic management is likely an essential sector under NIS2 Annex I) | As above |
| Australian Privacy Act | Eligible data breach (serious harm likely) | ASAP (no fixed period; promptness required) | OAIC | As above |
Incident response timeline:
| Step | Timing | Action |
|---|---|---|
| Detect and contain | Immediately | Revoke affected credentials; isolate affected service; preserve logs |
| Assess scope | Within 2 hours | Determine: categories of data affected, approximate number of data subjects, jurisdictions, NIS2 applicability |
| Notify legal counsel and DPO | Within 4 hours of detection | Counsel advises on notification obligations across all applicable frameworks |
| NIS2 early warning | Within 24 hours of awareness | If significant incident: notify national CSIRT with initial information; no need for complete picture at this stage |
| Notify supervisory authority (EU/UK GDPR) | Within 72 hours of discovery | Via national DPA portal; even if incomplete — update as more known |
| NIS2 full notification | Within 72 hours of awareness | Full incident notification to national CSIRT / competent authority |
| Notify data subjects | Without undue delay | If breach likely to result in high risk to individuals |
| NIS2 final report | Within 1 month of full notification | Detailed description, impact assessment, cross-border impact, measures taken |
| Document | Ongoing | GDPR Art. 33(5) requires documentation of all breaches; NIS2 requires audit trail |
GDPR and NIS2 breach notification is integrated into the §26.8 incident response runbook. The security_logs record type DATA_BREACH triggers the breach notification workflow. On-call engineers must be trained to recognise when NIS2 thresholds (significant impact on service continuity or data integrity) are met and escalate to the DPO within the 24-hour window. Full obligations mapped in legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md.
29.7 Cookie / Tracking Consent
Even as a B2B SaaS operating within corporate networks, SpaceCom must comply with the ePrivacy Directive (2002/58/EC as amended) for any non-essential cookies set on EU/UK user browsers.
Cookie audit (required at least annually — legal/COOKIE_POLICY.md):
| Cookie name | Category | Purpose | Lifetime | Consent required? |
|---|---|---|---|---|
session |
Strictly necessary | Authenticated session token | Session / 8h inactivity | No |
csrf_token |
Strictly necessary | CSRF protection | Session | No |
tos_version |
Strictly necessary | ToS acceptance tracking | 1 year | No |
feature_flags |
Functional | A/B flags for UI features | 30 days | Yes (functional consent) |
_analytics |
Analytics | Usage telemetry (if implemented) | 13 months | Yes (analytics consent) |
Security requirements for all session cookies (ePrivacy + §36 security):
Set-Cookie: session=...; HttpOnly; Secure; SameSite=Strict; Path=/; Max-Age=28800
Consent implementation:
- Consent banner displayed on first visit to any EU/UK user before any non-essential cookies are set
- Three options: Accept all / Functional only / Strictly necessary only
- Consent preference stored in
user_cookie_preferencesor localStorage (no cookie used to store consent — self-defeating) - Consent is re-requested if cookie categories change materially
- B2B context note: even if the organisation has a corporate cookie policy, individual users' consent is required under ePrivacy; organisational IT policies do not substitute for individual consent
Cookie policy: legal/COOKIE_POLICY.md — published at registration URL and linked from the consent banner. Reviewed when new cookies are introduced or existing cookies change purpose.
29.8 Organisation Onboarding and Offboarding (F4)
Onboarding workflow
New organisation provisioning requires explicit admin action — self-serve registration is not available in Phase 1 (safety-critical context; all organisations are individually vetted).
Onboarding gates (all must be satisfied before subscription_status → active):
- Legal: MSA executed (countersigned PDF stored in
legal/contracts/{org_id}/msa.pdf) - Export control:
export_control_cleared = TRUEon theorganisationsrow (BIS Entity List check; see §24.2) - Space-Track: If the organisation requires Space-Track data:
space_track_registered = TRUE;space_track_usernamerecorded; data disclosure log seeded - Billing:
billing_contactsrow created; VAT number validated for EU customers - Admin user: at least one
org_adminuser created with MFA enrolled - ToS: primary
org_adminuser hastos_accepted_at IS NOT NULL
Each gate is a checklist step in docs/runbooks/org-onboarding.md. Completing all gates creates a subscription_periods row with period_start = NOW().
Offboarding workflow
When an organisation's subscription ends (churn, termination, or suspension), the offboarding procedure:
| Step | Action | Who | When |
|---|---|---|---|
| 1 | Set subscription_status = 'churned' / 'suspended' |
Admin | Immediately |
| 2 | Revoke all api_keys for the org |
Admin (automated) | Immediately |
| 3 | Invalidate all active sessions (refresh_tokens) |
Admin (automated) | Immediately |
| 4 | Notify org primary contact: 30-day data export window | Admin | Same day |
| 5 | Generate and deliver org data export archive | Admin | Within 3 business days |
| 6 | After 30-day window: pseudonymise user personal data | Automated job | Day 31 |
| 7 | Retain non-personal safety records (7-year minimum) | DB — no action | Ongoing |
| 8 | Confirm deletion in writing to org billing contact | Admin | After step 6 |
GDPR Art. 17 vs. retention conflict: User personal data (name, email, IP addresses) is pseudonymised per §29.3 after the 30-day window. Safety records (alert_events, reentry_predictions, shift_handovers) are retained for 7 years per UN Liability Convention — the organisation row remains in the database with subscription_status = 'churned' as the foreign key anchor. No safety record is deleted.
Suspension vs. termination: A suspended organisation (subscription_status = 'suspended') retains data and can be reactivated by an admin. A churned organisation enters the 30-day export window immediately. Suspension is used for payment failure; churn for voluntary or contractual termination.
29.9 Audit Log Personal Data Separation (F8 — §64)
security_logs currently serves two distinct purposes with conflicting retention requirements:
- Integrity audit records (HMAC checks, ingest events, deploy markers): no personal data; 7-year retention under UN Liability Convention
- Personal data processing records (user logins, IP addresses, acknowledgement events): personal data; subject to data minimisation, IP hashing at 90 days, erasure on request
Mixing these in one table means a single retention policy applies to both — either over-retaining personal data (7 years) or under-retaining operational integrity records. Required separation:
-- New table: operational integrity audit — no personal data, 7-year retention
CREATE TABLE integrity_audit_log (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
event_type TEXT NOT NULL, -- 'HMAC_VERIFICATION', 'INGEST_SUCCESS', 'DEPLOY_COMPLETED', etc.
source TEXT, -- service name, job ID
details JSONB, -- operational context; must not contain user IDs or IPs
severity TEXT NOT NULL DEFAULT 'INFO'
);
-- Existing security_logs: personal data processing records — IP hashing at 90d, erasure on request
-- Add constraint: security_logs must only hold user-action event types
ALTER TABLE security_logs ADD CONSTRAINT chk_security_logs_type
CHECK (event_type IN (
'LOGIN', 'LOGOUT', 'MFA_ENROLLED', 'PASSWORD_RESET', 'API_KEY_CREATED',
'API_KEY_REVOKED', 'TOS_ACCEPTED', 'DATA_BREACH', 'USER_ERASURE_COMPLETED',
'SAFETY_OCCURRENCE', 'DEPLOY_ALERT_GATE_OVERRIDE', 'HMAC_KEY_ROTATION',
'AIRSPACE_UPDATE', 'EXPORT_CONTROL_SCREENED', 'SHADOW_MODE_ACTIVATED'
));
Migration: Existing security_logs records of type INGEST_*, HMAC_VERIFICATION_* (pass/fail), DEPLOY_COMPLETED are migrated to integrity_audit_log. The personal-data-containing events remain in security_logs with the updated retention and IP-hashing policy.
Benefit: integrity_audit_log can be retained for 7 years without any privacy obligation. security_logs is subject to the 90-day IP hashing, erasure-on-request, and 2-year text pseudonymisation policies without affecting integrity records.
29.10 Lawful Basis Mapping and ToS Acceptance Clarification (F11 — §64)
The first-login ToS/AUP acceptance flow (§3.1, §13) gates access and records tos_accepted_at. This mechanism does not mean consent (Art. 6(1)(a)) is the universal lawful basis for all processing. The RoPA (§29.1) maps the correct basis per activity; this section clarifies the principle.
Lawful basis is determined by purpose, not by the collection mechanism:
| Processing activity | Correct basis | Why NOT consent |
|---|---|---|
| Delivering alerts and predictions the user subscribed to | Art. 6(1)(b) — contract performance | User contracted for the service; consent would be revocable and would prevent service delivery |
| Security logging of user actions | Art. 6(1)(f) — legitimate interests (fraud/security) | Required regardless of consent; security cannot be conditional on consent |
| Audit trail for UN Liability Convention | Art. 6(1)(c) — legal obligation | Statutory retention requirement; consent is irrelevant |
| Fatigue monitoring triggers (§28.3 — server-side thresholds) | Art. 6(1)(b) or (f) | Part of the contracted service and/or legitimate safety interest; not health data (Art. 9) because no health information is processed — only activity patterns |
| Sending marketing or product update emails (not core service) | Art. 6(1)(a) — consent | Marketing emails require opt-in consent separate from service ToS |
ToS acceptance is consent evidence only for: (a) acknowledgement of terms, (b) Space-Track redistribution acknowledgement, (c) export control acknowledgement. It is not a blanket consent to all processing.
Implementation requirement: The Privacy Notice (§29.1) must state the correct lawful basis for each category of processing, not imply consent for all. Legal counsel review required before publication.
29.11 Open Source / Dependency Licence Compliance (§66)
SpaceCom is a closed-source SaaS product. Certain open-source licence obligations apply regardless of whether source code is distributed, because SpaceCom serves a web application to end users over a network. This section documents licence assessments for all material dependencies.
Reference document: legal/OSS_LICENCE_REGISTER.md — authoritative per-dependency licence record, updated on every major dependency version change.
F1 — CesiumJS AGPLv3 Commercial Licence
CesiumJS is licensed under AGPLv3. The AGPL network use provision (§13) requires that any software that incorporates AGPLv3 code and is served over a network must make its complete corresponding source available to users. SpaceCom is closed-source and does not satisfy this requirement under the AGPLv3 terms.
Required action: A commercial licence from Cesium Ion must be executed and stored at legal/LICENCES/cesium-commercial.pdf before any Phase 1 demo or ESA evaluation deployment. The CI licence gate (license-checker-rseidelsohn --excludePackages "cesium") is correct only when a valid commercial licence exists — the exclusion without the licence is a false negative. The commercial licence is referenced in ADR-0007 (docs/adr/0007-cesiumjs-commercial-licence.md).
Phase gate: legal/LICENCES/cesium-commercial.pdf present and legal_clearances.cesium_commercial_executed = TRUE is a Phase 1 go/no-go criterion. Block all external deployments until confirmed.
F3 — Space-Track AUP Redistribution Prohibition
Space-Track Terms of Service prohibit redistribution of TLE and CDM data to unregistered parties. SpaceCom's ingest pipeline fetches TLE/CDM data under a single registered account and serves derived predictions to ANSP users. The redistribution risk surfaces in two ways:
-
Raw TLE exposure via API: If SpaceCom's API returns raw TLE strings (e.g., in
/objects/{id}/tle), and those strings are accessible to unauthenticated users or third-party integrations, this may constitute redistribution. All TLE endpoints must require authentication and must not be proxied to unregistered downstream systems. -
Credentials in client-side code or SBOM:
SPACE_TRACK_PASSWORDmust never appear infrontend/source, git history, SBOM artefacts, or any publicly accessible location. Validate withdetect-secrets(already in pre-commit hook) andgit secrets --scan-history.
ADR: docs/adr/0016-space-track-aup-architecture.md — records the chosen path (shared ingest vs. per-org credentials) with AUP clarification evidence.
F4 — Python Dependency Licence Assessment
| Package | Licence | Risk | Mitigation |
|---|---|---|---|
| NumPy | BSD-3 | None | — |
| SciPy | BSD-3 | None | — |
| astropy | BSD-3 | None | — |
| sgp4 | MIT | None | — |
| poliastro | MIT / LGPLv3 (components) | Low | LGPLv3 requires dynamic linking ability; standard pip install satisfies LGPL dynamic linking. SpaceCom does not ship a modified poliastro — no relinking obligation arises. Document in legal/LGPL_COMPLIANCE.md. |
| FastAPI | MIT | None | — |
| SQLAlchemy | MIT | None | — |
| Celery | BSD-3 | None | — |
| Pydantic | MIT | None | — |
| Playwright (Python) | Apache 2.0 | None | Chromium binary downloaded at build time; not redistributed. Captured in SBOM. |
LGPL compliance document: legal/LGPL_COMPLIANCE.md must confirm: (a) poliastro is installed via pip as a separate library, (b) SpaceCom does not statically link or incorporate modified poliastro source, (c) users can substitute a modified poliastro by reinstalling — this is satisfied by standard Python packaging. No further action required beyond this documentation.
F5 — TimescaleDB Licence Assessment
TimescaleDB uses a dual-licence model:
| Feature | Licence | SpaceCom use? |
|---|---|---|
Hypertables, continuous aggregates, compression, time_bucket() |
Apache 2.0 | Yes — all core features used by SpaceCom |
| Multi-node distributed hypertables | Timescale Licence (TSL) | No — single-node at all tiers |
| Data tiering (automated S3 tiering) | TSL | No — SpaceCom uses MinIO ILM / manual S3 lifecycle, not TimescaleDB tiering |
Assessment: SpaceCom uses only Apache 2.0-licensed TimescaleDB features. No Timescale commercial agreement required. Document in legal/LICENCES/timescaledb-licence-assessment.md. Re-assess if multi-node or data tiering features are adopted at Tier 3.
F6 — Redis SSPL Assessment
Redis 7.4+ adopted the Server Side Public Licence (SSPL). SSPL § 13 requires that any entity offering the software as a service must open-source their entire service stack. The relevant question for SpaceCom is whether deploying Redis as an internal component of SpaceCom constitutes "offering Redis as a service."
Assessment: SpaceCom operates Redis internally — users interact with SpaceCom's API and WebSocket interface, not directly with Redis. This is not offering Redis as a service. The SSPL obligation does not apply to internal use of Redis as a component. However, legal counsel should confirm this position before Phase 3 (operational deployment).
Alternative if legal counsel disagrees: Pin to Redis 7.2.x (BSD-3-Clause, last release before SSPL adoption) or migrate to Valkey (BSD-3-Clause fork maintained by Linux Foundation). Either is a drop-in replacement. Document the chosen path in legal/LICENCES/redis-sspl-assessment.md.
Action: Update pip-licenses fail-on list to include "Server Side Public License" as a blocking licence category. Redis itself is not in the Python dependency tree (it is a Docker service), so this is a docker-image licence check. Add to Trivy scan policy.
F7 — Playwright and Chromium Binary Licence
Playwright (Python) is Apache 2.0. The Chromium binary bundled by Playwright uses the Chromium licence (BSD-3-Clause for most code; additional component licences apply for media codecs). Chromium is not redistributed by SpaceCom — Playwright downloads it at container build time via playwright install chromium.
Assessment: Internal use only; no redistribution. SBOM captures the Playwright version; Chromium binary version is captured by syft scanning the container image at the cosign attest step. No further action required.
F8 — Caddy Licence Assessment
Caddy server is Apache 2.0. Community plugins (the modules used in §26.9: encode, reverse_proxy, tls, file_server) are Apache 2.0. No Caddy enterprise plugins are used by SpaceCom. Caddy DNS challenge modules (if used for ACME wildcard certificates) must be verified — the caddy-dns/cloudflare module is MIT.
Audit requirement: On any Caddyfile change that adds a new module, verify its licence before merging. Add to the PR checklist for infrastructure changes.
F9 — PostGIS Licence Assessment
PostGIS is GPLv2+ with a linking exception for use with PostgreSQL. The linking exception reads: "the copyright holders of PostGIS grant you permission to use PostGIS as a PostgreSQL extension without this resulting in the entire combined work becoming subject to the GPL." SpaceCom uses PostGIS as a PostgreSQL extension (loaded via CREATE EXTENSION postgis) — the linking exception applies.
SpaceCom does not distribute PostGIS, does not modify PostGIS source, and does not ship a combined work — PostGIS is a runtime dependency of the database service. No GPLv2 obligation arises. Document in legal/LGPL_COMPLIANCE.md alongside the poliastro LGPL note.
F10 — Licence Change Monitoring CI Check
The existing pip-licenses --fail-on list (§7.13) catches Python GPL/AGPL. Additions required:
# .github/workflows/ci.yml (security-scan job — update existing step)
- name: Python licence gate
run: |
pip install pip-licenses
pip-licenses --format=json --output-file=python-licences.json
# Block: GPL v2, GPL v3, AGPL v3, SSPL (if any Python package adopts it)
pip-licenses --fail-on="GNU General Public License v2 (GPLv2);GNU General Public License v3 (GPLv3);GNU Affero General Public License v3 (AGPLv3);Server Side Public License"
- name: npm licence gate (updated)
working-directory: frontend
run: |
npx license-checker-rseidelsohn --json --out npm-licences.json
# cesium excluded: commercial licence at docs/adr/0007-cesiumjs-commercial-licence.md
npx license-checker-rseidelsohn \
--excludePackages "cesium" \
--failOn "GPL;AGPL;SSPL"
Additionally, pin all Python and Node dependencies to exact versions in requirements.txt and package-lock.json. Renovate Bot PRs (§7.13) provide controlled upgrade paths; the licence gate re-runs on each Renovate PR to catch licence changes introduced by version upgrades.
F11 — Contributor Licence Agreement for External Contributors
Before any contractor, partner, or third-party engineer contributes code to SpaceCom:
- A CLA or work-for-hire clause must be in their contract confirming that all IP created for SpaceCom is owned by SpaceCom (or the appointing entity, per agreement).
- The CLA template is at
legal/CLA.md— a simple assignment of copyright for contributions made under contract. - The GitHub repository's
CONTRIBUTING.mdmust state: "External contributions require a signed CLA. Contact legal@spacecom.io before submitting a PR."
Phase gate: Before any Phase 2 ESA validation partnership involves third-party engineering, confirm all engineers have executed the CLA or have work-for-hire clauses in their contracts. Unattributed IP in an ESA bid creates serious procurement risk.
30. DevOps / Platform Engineering
30.1 Pre-commit Hook Specification
All six hooks are required. The same hooks run locally (via pre-commit) and in CI (lint job). A push to GitHub that bypasses local hooks will fail CI.
.pre-commit-config.yaml:
repos:
- repo: https://github.com/Yelp/detect-secrets
rev: v1.4.0
hooks:
- id: detect-secrets
args: ['--baseline', '.secrets.baseline']
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.3.0
hooks:
- id: ruff
args: ['--fix']
- id: ruff-format
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.9.0
hooks:
- id: mypy
additional_dependencies: ['types-requests', 'sqlalchemy[mypy]']
- repo: https://github.com/hadolint/hadolint
rev: v2.12.0
hooks:
- id: hadolint-docker
- repo: https://github.com/pre-commit/mirrors-prettier
rev: v3.1.0
hooks:
- id: prettier
types_or: [javascript, typescript, html, css, json, yaml]
- repo: https://github.com/sqlfluff/sqlfluff
rev: 3.0.0
hooks:
- id: sqlfluff-lint
args: ['--dialect', 'postgres']
- id: sqlfluff-fix
args: ['--dialect', 'postgres']
All hooks are pinned by rev; update via pre-commit autoupdate in a dedicated dependency update PR. The detect-secrets baseline (.secrets.baseline) is committed to the repo and updated whenever legitimate secrets-like strings are added.
detect-secrets baseline maintenance process — incorrect baseline updates are the most common way this hook is neutralised. The correct procedure must be documented and enforced:
# docs/runbooks/detect-secrets-update.md (required runbook)
# CORRECT: update baseline to add a new allowance while preserving existing ones
detect-secrets scan --baseline .secrets.baseline --update
git add .secrets.baseline
git commit -m "chore: update detect-secrets baseline for <reason>"
# WRONG — overwrites ALL existing allowances:
# detect-secrets scan > .secrets.baseline ← NEVER do this
CI check verifies baseline currency on every PR (stale baseline = hook not enforced):
# In lint job, after running pre-commit:
detect-secrets scan --baseline .secrets.baseline --diff | \
python -c "import sys,json; d=json.load(sys.stdin); sys.exit(0 if not d else 1)" || \
(echo "ERROR: .secrets.baseline is stale — run: detect-secrets scan --baseline .secrets.baseline --update" && exit 1)
detect-secrets is the canonical secrets scanner (entropy + regex). git-secrets (listed in §7.13) is also retained for its AWS credential pattern matching, which complements detect-secrets. Both run as pre-commit hooks; there is no conflict — they check different pattern sets.
30.2 Multi-Stage Dockerfile Pattern
All service Dockerfiles follow the builder/runtime two-stage pattern. No exceptions without documented justification.
Backend (example — same pattern for worker and ingest):
# Stage 1: builder
FROM python:3.12-slim AS builder
WORKDIR /build
# Install build dependencies (not copied to runtime stage)
RUN apt-get update && apt-get install -y --no-install-recommends gcc libpq-dev
COPY backend/requirements.txt .
# --require-hashes enforces that every package in requirements.txt carries a hash annotation.
# pip-compile --generate-hashes produces these. Without this flag, hash pinning is specified
# but not verified during build — a dependency confusion attack would be silently installed.
RUN pip install --upgrade pip && \
pip wheel --no-cache-dir --require-hashes --wheel-dir /wheels -r requirements.txt
# Stage 2: runtime
FROM python:3.12-slim AS runtime
WORKDIR /app
# Create non-root user at build time
RUN groupadd --gid 1001 appuser && \
useradd --uid 1001 --gid appuser --no-create-home appuser
# Install only compiled wheels — no build tools
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir --no-index --find-links /wheels /wheels/*.whl && \
rm -rf /wheels
COPY backend/app ./app
USER appuser
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
Frontend:
FROM node:22-slim AS builder
WORKDIR /build
COPY frontend/package*.json .
RUN npm ci
COPY frontend/ .
RUN npm run build
FROM node:22-slim AS runtime
WORKDIR /app
RUN groupadd --gid 1001 appuser && useradd --uid 1001 --gid appuser --no-create-home appuser
COPY --from=builder /build/.next/standalone ./
COPY --from=builder /build/.next/static ./.next/static
COPY --from=builder /build/public ./public
USER appuser
EXPOSE 3000
CMD ["node", "server.js"]
Version pin rule: All Python service images use python:3.12-slim. All frontend/Node images use node:22-slim. Any FROM line using a different tag fails the hadolint pre-commit hook and CI lint step. Do not drift these — the service table in §3.2 and the Dockerfiles must agree.
CI verification — the build-and-push job includes:
# Verify no build tools in runtime image
docker run --rm ghcr.io/spacecom/backend:sha-$GITHUB_SHA which gcc 2>&1 | grep -q "no gcc" || exit 1
docker run --rm --user root ghcr.io/spacecom/backend:sha-$GITHUB_SHA id | grep -q "uid=1001" || exit 1
# Verify correct Python version
docker run --rm ghcr.io/spacecom/backend:sha-$GITHUB_SHA python --version | grep -q "Python 3.12" || exit 1
Image digest pinning in production Compose files (F4 — §59): The production docker-compose.yml pins images by digest, not by mutable tag, to guarantee bit-for-bit reproducibility and prevent registry-side tampering:
# docker-compose.yml — production image references
# Update digests via: make update-image-digests (runs after each build-and-push)
services:
backend:
image: ghcr.io/spacecom/backend:sha-abc1234@sha256:a1b2c3d4... # tag + digest
worker-sim:
image: ghcr.io/spacecom/worker:sha-abc1234@sha256:e5f6a7b8...
make update-image-digests script (run by CI after build-and-push): queries GHCR for the digest of each newly pushed image and patches docker-compose.yml via sed. The patched file is committed back to the release branch as a separate commit.
GHCR image retention policy (F4 — §59):
| Image type | Tag pattern | Retention |
|---|---|---|
| Release images | sha-<commit> on tagged release |
Indefinite |
| Staging images | sha-<commit> on main push |
30 days |
| Dev branch images | sha-<commit> on PR branch |
7 days |
| Build cache manifests | buildcache |
Overwritten each build; no accumulation |
| Untagged images | (orphaned layers) | Purged weekly via GHCR lifecycle policy |
GHCR lifecycle policy is configured via the GitHub repository settings (Packages → Manage versions). The policy is documented in docs/runbooks/image-lifecycle.md and reviewed quarterly alongside the secrets audit.
30.3 Environment Variable Contract
All environment variables are documented in .env.example. Variables are grouped by category and stage:
| Variable | Required | Stage | Description |
|---|---|---|---|
SPACETRACK_USERNAME |
Yes | All | Space-Track.org account email |
SPACETRACK_PASSWORD |
Yes | All | Space-Track.org password |
JWT_PRIVATE_KEY_PATH |
Yes | All | Path to RS256 PEM private key |
JWT_PUBLIC_KEY_PATH |
Yes | All | Path to RS256 PEM public key |
JWT_PUBLIC_KEY_NEW_PATH |
No | Rotation only | Second public key during keypair rotation window |
POSTGRES_PASSWORD |
Yes | All | TimescaleDB password |
REDIS_BACKEND_PASSWORD |
Yes | All | Redis ACL password for spacecom_backend user (full keyspace access) |
REDIS_WORKER_PASSWORD |
Yes | All | Redis ACL password for spacecom_worker user (Celery namespaces only) |
REDIS_INGEST_PASSWORD |
Yes | All | Redis ACL password for spacecom_ingest user (Celery namespaces only) |
MINIO_ACCESS_KEY |
Yes | All | MinIO access key |
MINIO_SECRET_KEY |
Yes | All | MinIO secret key |
HMAC_SECRET |
Yes | All | Prediction signing key (rotate per §26.9 procedure) |
ENVIRONMENT |
Yes | All | development / staging / production |
DEPLOY_CHECK_SECRET |
Yes | Staging/Prod | Read-only CI/CD gate credential |
SENTRY_DSN |
No | Staging/Prod | Error reporting DSN |
PAGERDUTY_ROUTING_KEY |
No | Prod only | AlertManager → PagerDuty routing key |
VAULT_ADDR |
No | Phase 3 | HashiCorp Vault address |
VAULT_TOKEN |
No | Phase 3 | Vault authentication token |
DISABLE_SIMULATION_DURING_ACTIVE_EVENTS |
No | All | Org-level simulation block; default false |
OPS_ROOM_SUPPRESS_MINUTES |
No | All | Alert audio suppression window; default 0 |
CI validates that .env.example is up-to-date by checking that every variable referenced in the codebase (os.getenv(...), settings.*) has an entry in .env.example. Missing entries fail CI.
CI secrets register (F3 — §59): GitHub Actions secrets are audited quarterly. The following table is the authoritative register — any secret not in this table must not exist in the repository settings.
| Secret name | Environment | Owner | Rotation schedule | What breaks if leaked |
|---|---|---|---|---|
GITHUB_TOKEN |
All | GitHub-managed (OIDC) | Per-job (automatic) | GHCR push access |
DEPLOY_CHECK_SECRET |
Staging, Production | Engineering lead | 90 days | CI can skip alert gate |
STAGING_SSH_KEY |
Staging | Engineering lead | 180 days | Staging server access |
PRODUCTION_SSH_KEY |
Production | Engineering lead + 1 | 90 days | Production server access |
SPACETRACK_USERNAME_STAGING |
Staging | DevOps | On offboarding | Space-Track ingest |
SPACETRACK_PASSWORD_STAGING |
Staging | DevOps | 90 days | Space-Track ingest |
SENTRY_DSN |
Staging, Production | DevOps | On rotation | Error reporting only |
PAGERDUTY_ROUTING_KEY |
Production | Engineering lead | On rotation | On-call alerting |
Rotation procedure: use gh secret set <NAME> --env <ENV> from a local machine; never paste secrets into PR descriptions or issue comments. Quarterly audit: gh secret list --env production output reviewed by engineering lead; any unrecognised secret triggers a security review.
30.4 Staging Environment Specification
Staging is a Tier 2 deployment (single-host Docker Compose) running continuously on a dedicated server or cloud VM.
Data policy: Staging never holds production data. On weekly reset (make clean && make seed), the database is wiped and synthetic fixtures are loaded. Synthetic fixtures include:
- 50 tracked objects with pre-computed TLE histories
- 5 synthetic TIP events across the test FIR set
- 3 synthetic CRITICAL alert events at various acknowledgement states
- 2 shadow mode test organisations
Credential policy: Staging uses a separate Space-Track account (if available) or rate-limited credentials. JWT keypairs, HMAC secrets, and MinIO keys are all distinct from production. Staging credentials are stored in GitHub Actions environment secrets, not in the production Vault.
OWASP ZAP integration:
# .github/workflows/ci.yml (post-staging-deploy step)
- name: OWASP ZAP baseline scan
uses: zaproxy/action-baseline@v0.11.0
with:
target: 'https://staging.spacecom.io'
rules_file_name: '.zap/rules.tsv'
fail_action: true
ZAP results are uploaded as GitHub Actions artefacts and must be reviewed before production deploy approval is granted in Phase 2+.
30.5 CI Observability
Build duration: Each GitHub Actions job reports duration to a summary table. A Grafana dashboard (CI Health) tracks p50/p95 job durations over time. Alert if any job's p95 duration increases > 2× week-over-week.
Image size delta: The build-and-push job posts a PR comment with the compressed image size delta versus the previous main build:
Backend image: 187 MB → 192 MB (+2.7%) ✅
Worker image: 203 MB → 289 MB (+42.4%) ⚠️ Investigate before merge
If any image grows > 20% in a single PR, CI posts a warning. If any image exceeds the tier limits below, CI fails:
| Image | Max size (compressed) |
|---|---|
backend |
300 MB |
worker |
350 MB |
frontend |
200 MB |
renderer |
500 MB (Chromium) |
ingest |
250 MB |
Test failure rate: GitHub Actions test reports (JUnit XML output from pytest and vitest) are stored as artefacts. A weekly CI health review checks for flaky tests (passing < 90% of the time) and schedules them for investigation.
30.6 DevOps Decision Log
| Decision | Chosen | Rationale |
|---|---|---|
| CI/CD orchestration | GitHub Actions | Project is GitHub-native; OIDC → GHCR eliminates long-lived registry credentials; matrix builds supported |
| Container registry | GHCR | Co-located with source; free for this repo; cosign attestation support |
| Image tagging | sha-<commit> canonical; version alias on release tags; latest forbidden |
latest is mutable; sha tag gives exact source traceability |
| Multi-stage builds | Builder + distroless/slim runtime for all services | 60–80% image size reduction; eliminates compiler/build tools from production attack surface |
| Hot-reload strategy | docker-compose.override.yml with bind-mounted source volumes |
< 1s reload vs. 30–90s container rebuild; override file not committed to CI |
| Local task runner | make |
Universally available, no extra install; self-documenting targets; shell-level DX standard |
| Pre-commit stack | 6 hooks: detect-secrets + ruff + mypy + hadolint + prettier + sqlfluff | Each addresses a distinct failure mode; hooks run in CI to enforce for engineers who skip local install |
| Staging data | Synthetic fixtures only; weekly reset | Production data in staging creates GDPR complexity; synthetic data is sufficient for integration testing |
| Secrets rotation | Zero-downtime per-secret runbook; HMAC rotation requires batch re-sign migration | Aviation context: rotation cannot cause service interruption; HMAC is special-cased due to signed-data dependency |
| HMAC key rotation | Requires batch re-sign of all existing predictions; engineering lead approval required | All existing HMAC signatures become invalid on key change; silent re-sign is safer than mass verification failures |
30.7 GitLab CI Workflow Specification (F1, F5, F8, F10 - §59)
The CI pipeline must enforce a strict job dependency graph. Jobs that do not declare needs: run in parallel by default — this is incorrect for a safety-critical pipeline where a failed test must prevent a build reaching production.
Canonical job dependency graph:
lint ──┬── test-backend ──┬── security-scan ──── build-and-push ──── deploy-staging ──── deploy-production
└── test-frontend ─┘ ↑ (auto) ↑ (manual gate)
.github/workflows/ci.yml (abbreviated — full spec below):
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.12' }
- uses: actions/cache@v4
with:
path: ~/.cache/pre-commit
key: pre-commit-${{ hashFiles('.pre-commit-config.yaml') }}
- run: pip install pre-commit
- run: pre-commit run --all-files # F6 §59: enforce hooks in CI
test-backend:
needs: [lint]
runs-on: ubuntu-latest
services:
db:
image: timescale/timescaledb:2.14-pg17
env: { POSTGRES_PASSWORD: test }
options: --health-cmd pg_isready
redis:
image: redis:7-alpine
options: --health-cmd "redis-cli ping"
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.12' }
- uses: actions/cache@v4 # F10 §59: pip wheel cache
with:
path: ~/.cache/pip
key: pip-${{ hashFiles('backend/requirements.txt') }}
- run: pip install -r backend/requirements.txt
- run: pytest -m safety_critical --tb=short -q # fast safety gate first
- run: pytest --cov=backend --cov-fail-under=80
test-frontend:
needs: [lint]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '22' }
- uses: actions/cache@v4 # F10 §59: npm cache
with:
path: ~/.npm
key: npm-${{ hashFiles('frontend/package-lock.json') }}
- run: npm ci --prefix frontend
- run: npm run test --prefix frontend
migration-gate: # F11 §59: migration reversibility + timing gate
needs: [lint]
if: contains(github.event.commits[*].modified, 'migrations/')
runs-on: ubuntu-latest
services:
db:
image: timescale/timescaledb:2.14-pg17
env: { POSTGRES_PASSWORD: test }
options: --health-cmd pg_isready
steps:
- uses: actions/checkout@v4
- run: pip install alembic psycopg2-binary
- name: Forward migration (timed)
run: |
START=$(date +%s)
alembic upgrade head
END=$(date +%s)
ELAPSED=$((END - START))
echo "Migration took ${ELAPSED}s"
if [ "$ELAPSED" -gt 30 ]; then
echo "::error::Migration took ${ELAPSED}s > 30s budget — requires review"
exit 1
fi
- name: Reverse migration (reversibility check)
run: alembic downgrade -1
- name: Model/migration sync check
run: alembic check
security-scan:
needs: [test-backend, test-frontend, migration-gate]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install bandit && bandit -r backend/app -ll
- uses: actions/setup-node@v4
with: { node-version: '22' }
- run: npm audit --prefix frontend --audit-level=high
- name: Trivy container scan (on previous image)
uses: aquasecurity/trivy-action@master
with:
image-ref: ghcr.io/${{ github.repository }}/backend:latest
severity: CRITICAL,HIGH
exit-code: '1'
build-and-push:
needs: [security-scan]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
permissions: { contents: read, packages: write, id-token: write }
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }} # OIDC — no long-lived token
- name: Build and push (with layer cache) # F10 §59
uses: docker/build-push-action@v5
with:
push: true
tags: ghcr.io/${{ env.IMAGE_NAME }}/backend:sha-${{ github.sha }}
cache-from: type=registry,ref=ghcr.io/${{ env.IMAGE_NAME }}/backend:buildcache
cache-to: type=registry,ref=ghcr.io/${{ env.IMAGE_NAME }}/backend:buildcache,mode=max
- name: Sign image with cosign (F5 §59)
uses: sigstore/cosign-installer@v3
- run: |
cosign sign --yes \
ghcr.io/${{ env.IMAGE_NAME }}/backend:sha-${{ github.sha }}
- name: Generate SBOM and attach (F5 §59)
uses: anchore/sbom-action@v0
with:
image: ghcr.io/${{ env.IMAGE_NAME }}/backend:sha-${{ github.sha }}
upload-artifact: true
deploy-staging:
needs: [build-and-push]
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Check no active CRITICAL alert (F8 §59)
run: |
STATUS=$(curl -sf -H "Authorization: Bearer ${{ secrets.DEPLOY_CHECK_SECRET }}" \
https://staging.spacecom.io/api/v1/readyz | jq -r '.alert_gate')
if [ "$STATUS" != "clear" ]; then
echo "::error::Active CRITICAL/HIGH alert — deploy blocked. Override with workflow_dispatch."
exit 1
fi
- name: SSH deploy to staging
run: |
ssh deploy@staging.spacecom.io \
"bash /opt/spacecom/scripts/blue-green-deploy.sh sha-${{ github.sha }}"
deploy-production:
needs: [deploy-staging]
runs-on: ubuntu-latest
environment: production # GitLab protected environment with required approvers - manual gate
steps:
- uses: actions/checkout@v4
- name: Check no active CRITICAL alert (F8 §59)
run: |
STATUS=$(curl -sf -H "Authorization: Bearer ${{ secrets.DEPLOY_CHECK_SECRET }}" \
https://spacecom.io/api/v1/readyz | jq -r '.alert_gate')
if [ "$STATUS" != "clear" ]; then
echo "::error::Active CRITICAL/HIGH alert — production deploy blocked."
exit 1
fi
- name: SSH deploy to production
run: |
ssh deploy@spacecom.io \
"bash /opt/spacecom/scripts/blue-green-deploy.sh sha-${{ github.sha }}"
/api/v1/readyz alert gate field (F8 — §59): The existing GET /readyz response is extended with an alert_gate field:
# Returns "clear" | "blocked"
alert_gate = "blocked" if db.query(AlertEvent).filter(
AlertEvent.level.in_(["CRITICAL", "HIGH"]),
AlertEvent.acknowledged_at == None,
AlertEvent.organisation_id != INTERNAL_ORG_ID, # internal test alerts don't block deploys
).count() > 0 else "clear"
Emergency deploy override: use workflow_dispatch with input override_alert_gate: true — requires two approvals in the GitHub production environment. All overrides are logged to security_logs with event_type = DEPLOY_ALERT_GATE_OVERRIDE.
30.8 Configuration Management of Safety-Critical Artefacts (F7 — §61)
EUROCAE ED-153 / DO-278A §10 requires that safety-critical software and its associated artefacts are placed under configuration management. This extends beyond the code itself to include requirements, test cases, design documents, and safety evidence.
Policy document: docs/safety/CM_POLICY.md
Artefacts under CM:
| Artefact | Location | CM Control |
|---|---|---|
SAL-2 source files (physics/, alerts/, integrity/, czml/) |
Git main branch |
Signed commits required; CODEOWNERS enforcement; no direct push to main |
| Hazard Log | docs/safety/HAZARD_LOG.md |
Git-tracked; changes require safety case custodian sign-off (CODEOWNERS rule) |
| Safety Case | docs/safety/SAFETY_CASE.md |
Git-tracked; changes require safety case custodian sign-off |
| SAL Assignment | docs/safety/SAL_ASSIGNMENT.md |
Git-tracked; changes require safety case custodian sign-off |
| Means of Compliance | docs/safety/MEANS_OF_COMPLIANCE.md |
Git-tracked; changes require safety case custodian sign-off |
| Verification Independence Policy | docs/safety/VERIFICATION_INDEPENDENCE.md |
Git-tracked |
| Test plan (safety-critical markers) | docs/TEST_PLAN.md |
Git-tracked; safety_critical marker additions/removals reviewed in PR |
| Reference validation data | docs/validation/reference-data/ |
Git-tracked; immutable once committed (SHA verified in CI) |
| Accuracy Characterisation | docs/validation/ACCURACY_CHARACTERISATION.md |
Git-tracked; Phase 3 deliverable |
| ANSP SMS Guide | docs/safety/ANSP_SMS_GUIDE.md |
Git-tracked |
| Release artefacts (SBOM, Trivy report, cosign signature) | GHCR + MinIO safety archive | Tagged per release; 7-year retention |
Release tagging for safety artefacts:
Every production release (vMAJOR.MINOR.PATCH) creates a Git tag that captures:
# scripts/tag-safety-release.sh
VERSION=$1
git tag -a "$VERSION" -m "Release $VERSION — safety artefacts frozen at this tag"
# Attach safety snapshot to the release
gh release create "$VERSION" \
docs/safety/SAFETY_CASE.md \
docs/safety/HAZARD_LOG.md \
docs/safety/SAL_ASSIGNMENT.md \
docs/safety/MEANS_OF_COMPLIANCE.md \
--title "SpaceCom $VERSION" \
--notes "Safety artefacts attached. See CHANGELOG.md for changes."
Signed commits for SAL-2 paths: backend/app/physics/, backend/app/alerts/, backend/app/integrity/, backend/app/czml/ require GPG-signed commits. Branch protection rule: require_signed_commits: true on main. This provides non-repudiation for safety-critical code changes.
CODEOWNERS additions:
# .github/CODEOWNERS
# Safety artefacts — require safety case custodian review
/docs/safety/ @safety-custodian
/docs/validation/ @safety-custodian
Configuration baseline: At each ANSP deployment, a configuration baseline is recorded in legal/ANSP_DEPLOYMENT_REGISTER.md:
- SpaceCom version deployed (Git tag)
- Commit SHA
- SBOM hash
- Safety case version
- SAL assignment version
- Deployment jurisdiction and date
This baseline is the reference for any subsequent regulatory audit or safety occurrence investigation.
31. Interoperability / Systems Integration
31.1 External Data Source Contracts
For each inbound data source, the integration contract must be explicit. Implicit assumptions about format are the most common source of silent ingest failures.
31.1.1 Space-Track.org
Endpoints consumed:
| Data | Endpoint | Format | Baseline interval | Active TIP interval |
|---|---|---|---|---|
| TLE catalog | /basicspacedata/query/class/gp/DECAY_DATE/null-val/orderby/NORAD_CAT_ID asc/format/json |
JSON array | Every 6h | Every 6h (unchanged) |
| CDMs | /basicspacedata/query/class/cdm_public/format/json |
JSON array | Every 2h | Every 30min |
| TIP messages | /basicspacedata/query/class/tip/format/json |
JSON array | Every 30min | Every 5min |
| Object catalog | /basicspacedata/query/class/satcat/format/json |
JSON array | Daily | Daily |
Adaptive polling: When spacecom_active_tip_events > 0 (any object with predicted re-entry within 6 hours), the Celery Beat schedule dynamically switches TIP polling to 5-minute intervals and CDM polling to 30-minute intervals. This is implemented via redbeat schedule overrides, not by running additional tasks — the existing Beat entry's run_every is updated in Redis. When all TIP events clear, intervals revert to baseline.
Space-Track request budget (600 requests/day):
Space-Track enforces a 600 requests/day limit per account. Budget must be tracked and protected:
# ingest/budget.py
DAILY_REQUEST_BUDGET = 600
BUDGET_ALERT_THRESHOLD = 0.80 # alert at 80% consumed
class SpaceTrackBudget:
"""Redis counter tracking daily Space-Track API requests. Resets at midnight UTC."""
def __init__(self, redis_client):
self._redis = redis_client
self._key = f"spacetrack:budget:{date.today().isoformat()}"
def consume(self, n: int = 1) -> bool:
"""Deduct n requests. Returns False if budget exhausted; raises if > threshold."""
current = self._redis.incrby(self._key, n)
self._redis.expireat(self._key, self._next_midnight())
if current > DAILY_REQUEST_BUDGET:
raise SpaceTrackBudgetExhausted(f"Daily budget exhausted ({current}/{DAILY_REQUEST_BUDGET})")
if current / DAILY_REQUEST_BUDGET >= BUDGET_ALERT_THRESHOLD:
structlog.get_logger().warning(
"spacetrack_budget_warning",
consumed=current, budget=DAILY_REQUEST_BUDGET,
)
return True
def remaining(self) -> int:
return max(0, DAILY_REQUEST_BUDGET - int(self._redis.get(self._key) or 0))
Prometheus gauge: spacecom_spacetrack_budget_remaining — alert at < 100 remaining requests.
Exponential backoff and circuit breaker:
# ingest/tasks.py
@app.task(
bind=True,
autoretry_for=(SpaceTrackError, httpx.TimeoutException, httpx.ConnectError),
retry_backoff=True, # 2s, 4s, 8s, 16s, 32s ...
retry_backoff_max=3600, # cap at 1 hour
retry_jitter=True, # ±20% jitter per retry
max_retries=5, # task → DLQ on 6th failure
acks_late=True,
)
def ingest_tle_catalog(self):
if not circuit_breaker.is_closed("spacetrack"):
raise SpaceTrackCircuitOpen("Circuit open — Space-Track unreachable")
try:
budget.consume(1)
result = spacetrack_client.fetch_tle_catalog()
circuit_breaker.record_success("spacetrack")
return result
except (SpaceTrackError, httpx.TimeoutException) as exc:
circuit_breaker.record_failure("spacetrack")
raise self.retry(exc=exc)
Circuit breaker config: open after 3 consecutive failures; half-open after 30 minutes; close after 1 successful probe. Implemented via pybreaker or equivalent. State stored in Redis for cross-worker visibility.
Session expiry handling:
Space-Track uses cookie-based sessions that expire after ~2 hours of inactivity. A 6-hour TLE poll interval guarantees session expiry between polls. The spacetrack library must be configured to re-authenticate transparently on 401/403:
# ingest/spacetrack.py
class SpaceTrackClient:
def __init__(self):
self._session_valid_until: datetime | None = None
self._SESSION_TTL = timedelta(hours=1, minutes=45) # conservative re-auth before expiry
async def _ensure_authenticated(self):
if self._session_valid_until is None or datetime.utcnow() >= self._session_valid_until:
await self._authenticate()
self._session_valid_until = datetime.utcnow() + self._SESSION_TTL
spacecom_ingest_session_reauth_total.labels(source="spacetrack").inc()
async def fetch_tle_catalog(self):
await self._ensure_authenticated()
# ... fetch logic
Metric spacecom_ingest_session_reauth_total{source="spacetrack"} distinguishes routine re-auth from genuine authentication failures. An alert fires if reauth_total increments more than once per hour (indicates session instability, not normal expiry).
Contract test (asserts on every CI run against a live Space-Track response):
def test_spacetrack_tle_schema(spacetrack_client):
response = spacetrack_client.query("gp", limit=1)
required_keys = {"NORAD_CAT_ID", "TLE_LINE1", "TLE_LINE2", "EPOCH", "BSTAR", "OBJECT_NAME"}
assert required_keys.issubset(response[0].keys()), f"Missing keys: {required_keys - response[0].keys()}"
Failure alerting: spacecom_ingest_success_total{source="spacetrack"} counter. AlertManager rules:
- Baseline: if counter does not increment for 4 consecutive hours during expected polling windows → CRITICAL
INGEST_SOURCE_FAILUREalert. - Active TIP window: if
spacecom_ingest_success_total{source="spacetrack", type="tip"}does not increment for > 10 minutes whenspacecom_active_tip_events > 0→ immediate L1 page (bypasses standard 4h threshold).
31.1.2 NOAA SWPC Space Weather
All endpoints are hardcoded constants in ingest/sources.py. Format is JSON for all P1 endpoints.
# ingest/sources.py
NOAA_F107_URL = "https://services.swpc.noaa.gov/json/f107_cm_flux.json"
NOAA_KP_URL = "https://services.swpc.noaa.gov/json/planetary_k_index_1m.json"
NOAA_DST_URL = "https://services.swpc.noaa.gov/json/geomag/dst/index.json"
NOAA_FORECAST_URL = "https://services.swpc.noaa.gov/products/3-day-geomag-forecast.json"
ESA_SWS_KP_URL = "https://swe.ssa.esa.int/web/guest/current-space-weather-conditions"
Nowcast vs. forecast distinction: NRLMSISE-00 decay predictions spanning hours to days require different F10.7/Ap inputs depending on the prediction horizon. These must be stored separately and selected by the decay predictor at query time:
-- space_weather table: forecast_horizon_hours column required
ALTER TABLE space_weather ADD COLUMN forecast_horizon_hours INTEGER NOT NULL DEFAULT 0;
-- 0 = nowcast (observed); 24/48/72 = NOAA 3-day forecast horizon; NULL = 81-day average
COMMENT ON COLUMN space_weather.forecast_horizon_hours IS
'0=nowcast; 24/48/72=NOAA 3-day forecast; NULL=81-day F10.7 average for long-horizon use';
Decay predictor input selection rule (documented in model card and decay.py):
| Prediction horizon | F10.7 source | Ap source |
|---|---|---|
| t < 6h | Nowcast (horizon=0) |
Nowcast (horizon=0) |
| 6h ≤ t < 72h | NOAA 3-day forecast (horizon=24/48/72) |
NOAA 3-day forecast |
| t ≥ 72h | 81-day F10.7 average (horizon=NULL) |
Storm-aware climatological Ap |
Beyond 72h: the NOAA forecast expires. The model uses the 81-day F10.7 average (a standard NRLMSISE-00 input) and the long-range uncertainty is reflected in wider Monte Carlo corridor bounds. This is documented in the model card under "Space Weather Input Uncertainty Beyond 72h".
ESA SWS Kp cross-validation decision rule: ESA SWS Kp is a cross-validation source, not a fallback. A decision rule is required when NOAA and ESA values diverge — without one, the cross-validation is observational only:
# ingest/space_weather.py
NOAA_ESA_KP_DIVERGENCE_THRESHOLD = 2.0 # Kp units; ADR-0018
def arbitrate_kp(noaa_kp: float, esa_kp: float) -> float:
"""Select Kp value for NRLMSISE-00 input. Conservative-high on divergence."""
divergence = abs(noaa_kp - esa_kp)
if divergence > NOAA_ESA_KP_DIVERGENCE_THRESHOLD:
structlog.get_logger().warning(
"kp_source_divergence",
noaa_kp=noaa_kp, esa_kp=esa_kp, divergence=divergence,
)
spacecom_kp_divergence_events_total.inc()
# Conservative: higher Kp → denser atmosphere → shorter predicted lifetime → earlier alerting
return max(noaa_kp, esa_kp)
return noaa_kp # NOAA is primary source
The threshold (2.0 Kp) and the conservative-high selection policy are documented in docs/adr/0018-kp-source-arbitration.md and reviewed by the physics lead. The spacecom_kp_divergence_events_total counter is monitored; a sustained rate of divergence warrants investigation of source calibration.
Schema contract test (CI):
def test_noaa_kp_schema(noaa_client):
response = noaa_client.get_kp()
assert isinstance(response, list) and len(response) > 0
assert {"time_tag", "kp_index"}.issubset(response[0].keys())
def test_space_weather_forecast_horizon_stored(db_session):
"""Verify nowcast and forecast rows are stored with distinct horizon values."""
nowcast = db_session.query(SpaceWeather).filter_by(forecast_horizon_hours=0).first()
forecast_72 = db_session.query(SpaceWeather).filter_by(forecast_horizon_hours=72).first()
assert nowcast is not None, "Nowcast row missing"
assert forecast_72 is not None, "72h forecast row missing"
31.1.3 FIR Boundary Data
Source: EUROCONTROL AIRAC dataset (primary for ECAC states); FAA Digital-Terminal Procedures Publication (US); OpenAIP (fallback for non-AIRAC regions).
Format: GeoJSON FeatureCollection with properties.icao_id (FIR ICAO designator) and properties.name.
Update procedure (runs on each 28-day AIRAC cycle):
- Download new AIRAC dataset from EUROCONTROL (subscription required; credentials in secrets manager)
- Convert to GeoJSON via
ingest/fir_loader.py - Compare new boundaries against current
airspacetable; log added/removed/changed FIRs tosecurity_logstypeAIRSPACE_UPDATE - Stage new boundaries in
airspace_stagingtable; run intersection regression test against 10 known prediction corridors - If regression passes: swap
airspaceandairspace_stagingin a single transaction - Record update in
airspace_metadatatable:airac_cycle,record_count,updated_at,updated_by
airspace_metadata table:
CREATE TABLE airspace_metadata (
id SERIAL PRIMARY KEY,
airac_cycle TEXT NOT NULL, -- e.g. "2026-03"
effective_date DATE NOT NULL,
expiry_date DATE NOT NULL, -- effective_date + 28 days; used for staleness detection
record_count INTEGER NOT NULL,
source TEXT NOT NULL, -- 'eurocontrol' | 'faa' | 'openaip'
updated_at TIMESTAMPTZ DEFAULT NOW(),
updated_by TEXT NOT NULL
);
AIRAC staleness detection: The AIRAC update procedure is manual — there is no automated mechanism to trigger it. Without monitoring, a missed cycle goes undetected for up to 28 days.
Required additions:
- Prometheus gauge:
spacecom_airspace_airac_age_days=EXTRACT(EPOCH FROM NOW() - MAX(effective_date)) / 86400fromairspace_metadata. Alert rule:
- alert: AIRACAirspaceStale
expr: spacecom_airspace_airac_age_days > 29
for: 1h
severity: warning
annotations:
runbook_url: "https://spacecom.internal/docs/runbooks/fir-update.md"
summary: "FIR boundary data is {{ $value | humanizeDuration }} old — AIRAC cycle may be missed"
-
GET /readyzintegration:"airspace_stale"is added to thedegradedarray whenairac_age_days > 28(already incorporated into §26.5readyzcheck above). -
FIR update runbook (
docs/runbooks/fir-update.md) is a Phase 1 deliverable — it must exist before shadow deployment. Add to the Phase 1 DoD runbook checklist alongsidesecrets-rotation-jwt.md.
31.1.4 TLE Validation Gate
Before any TLE record is written to the database, ingest/cross_validator.py enforces:
def validate_tle(line1: str, line2: str) -> TLEValidationResult:
errors = []
if len(line1) != 69:
errors.append(f"Line 1 length {len(line1)} != 69")
if len(line2) != 69:
errors.append(f"Line 2 length {len(line2)} != 69")
if not _tle_checksum_valid(line1):
errors.append("Line 1 checksum failed")
if not _tle_checksum_valid(line2):
errors.append("Line 2 checksum failed")
epoch = _parse_epoch(line1[18:32])
if epoch is None:
errors.append("Epoch field invalid")
bstar = float(line1[53:61].replace(' ', ''))
# Finding 10: BSTAR validation revised
# Lower bound removed: valid high-density objects (e.g. tungsten sphere) have B* << 0.0001
# Zero or negative B* is physically meaningless (negative drag) → hard reject
if bstar <= 0.0:
errors.append(f"BSTAR {bstar} is zero or negative — physically invalid")
elif bstar > 0.5:
# Physically implausible at altitude > 300 km; log warning but do not reject
log_security_event("TLE_VALIDATION_WARNING", {
"tle": [line1, line2], "reason": "HIGH_BSTAR", "bstar": bstar
}, level="WARNING")
# Hard reject only the impossible combination: very high drag at high altitude
if bstar > 0.5 and perigee_km > 300:
errors.append(f"BSTAR {bstar} implausible for perigee {perigee_km:.0f} km — high drag at high altitude")
if errors:
log_security_event("INGEST_VALIDATION_FAILURE", {"tle": [line1, line2], "errors": errors})
return TLEValidationResult(valid=False, errors=errors)
return TLEValidationResult(valid=True)
31.2 CCSDS Format Specifications
31.2.1 OEM (Orbit Ephemeris Message) — CCSDS 502.0-B-3
Emitted by GET /space/objects/{norad_id}/ephemeris when Accept: application/ccsds-oem.
Header keyword population:
| Keyword | Value | Source |
|---|---|---|
CCSDS_OEM_VERS |
3.0 |
Fixed |
CREATION_DATE |
ISO 8601 UTC timestamp | datetime.utcnow() |
ORIGINATOR |
SPACECOM |
Fixed |
OBJECT_NAME |
objects.name |
DB |
OBJECT_ID |
COSPAR designator if known; NORAD-<norad_id> otherwise |
DB |
CENTER_NAME |
EARTH |
Fixed |
REF_FRAME |
GCRF |
Fixed — SpaceCom frame transform output |
TIME_SYSTEM |
UTC |
Fixed |
START_TIME |
Query start parameter |
Request |
STOP_TIME |
Query end parameter |
Request |
Unknown fields: Any keyword for which SpaceCom holds no data is emitted as N/A per CCSDS 502.0-B-3 §4.1.
31.2.2 CDM (Conjunction Data Message) — CCSDS 508.0-B-1
Emitted by GET /space/export/bulk?format=ccsds-cdm.
Field population table (abbreviated):
| Field | Populated? | Source |
|---|---|---|
CREATION_DATE |
Yes | datetime.utcnow() |
ORIGINATOR |
Yes | SPACECOM |
TCA |
Yes | SpaceCom conjunction screener |
MISS_DISTANCE |
Yes | SpaceCom conjunction screener |
COLLISION_PROBABILITY |
Yes | SpaceCom Alfano Pc |
COLLISION_PROBABILITY_METHOD |
Yes | ALFANO-2005 |
OBJ1/2 COVARIANCE_* |
Conditional | From Space-Track CDM if available; N/A for debris without covariance |
OBJ1/2 RECOMMENDED_OD_SPAN |
No | N/A — SpaceCom does not hold OD span |
OBJ1/2 SEDR |
No | N/A |
CDM ingestion and Pc reconciliation: When a Space-Track CDM is ingested for an object that SpaceCom has also screened, both Pc values are stored:
conjunctions.pc_spacecom— SpaceCom Alfano resultconjunctions.pc_spacetrack— from ingested CDMconjunctions.pc_discrepancy_flag— set TRUE whenabs(log10(pc_spacecom/pc_spacetrack)) > 1(order-of-magnitude difference)
The conjunction panel displays both values with their provenance labels. When pc_discrepancy_flag = TRUE, a DATA_CONFIDENCE warning callout is shown explaining possible causes (different epoch, different covariance source, different Pc method).
31.2.3 RDM (Re-entry Data Message) — CCSDS 508.1-B-1
Emitted by GET /reentry/predictions/{prediction_id}/export?format=ccsds-rdm.
Planned population rules:
- SpaceCom populates creation metadata, object identifiers, prediction provenance, prediction epoch, and the primary predicted re-entry time range from the active prediction record.
- Where the active prediction carries
prediction_conflict = TRUE, the export includes both the primary SpaceCom range and the conservative union range used for aviation-facing products, with explicit conflict provenance. - Corridor, fragment-cloud, and air-risk annotations are included only when supported by the active model version and marked with the model version identifier used to generate them.
- Unknown optional fields are emitted as
N/Arather than silently omitted, matching the CCSDS handling already used for OEM/CDM unknowns. - Raw upstream TIP or third-party reference messages are not overwritten; they remain separate provenance sources and are cross-referenced in the export metadata and audit trail.
31.3 WebSocket Event Reference
Full event type catalogue for WS /ws/events. All events share the envelope:
{
"type": "alert.new",
"seq": 1042,
"ts": "2026-03-17T14:23:01.123Z",
"org_id": 7,
"data": { ... }
}
Event type specifications:
alert.new
data: {alert_id, level, norad_id, object_name, fir_ids[], predicted_reentry_utc, corridor_wkt}
alert.acknowledged
data: {alert_id, acknowledged_by_name, note_preview (first 80 chars), acknowledged_at}
alert.superseded
data: {old_alert_id, new_alert_id, reason}
prediction.updated
data: {prediction_id, norad_id, p50_utc, p05_utc, p95_utc, supersedes_id (nullable), corridor_wkt}
tip.new
data: {norad_id, object_name, tip_epoch, predicted_reentry_utc, source_label ("USSPACECOM TIP")}
ingest.status
data: {source, status ("ok"|"failed"), record_count (nullable), next_run_at, failure_reason (nullable)}
spaceweather.change
data: {old_status, new_status, kp, f107, recommended_buffer_hours}
resync_required
data: {reason ("reconnect_too_stale"), last_known_seq}
Reconnection protocol:
- Client stores last received
seq - On reconnect: upgrade with
?since_seq=<last_seq> - Server delivers all events with
seq > last_seqfrom a 5-minute / 200-event ring buffer - If the gap is too large: server sends
{"type": "resync_required"}; client must call REST endpoints to re-fetch current state before resuming WebSocket consumption
Simulation/Replay isolation: During SIMULATION or REPLAY mode, the client is connected to WS /ws/simulation/{session_id} instead of WS /ws/events. No LIVE events are delivered while in a simulation session.
31.4 Alert Webhook Specification
Registration:
POST /api/v1/webhooks
Content-Type: application/json
Authorization: Bearer <admin_jwt>
{
"url": "https://ansp-dispatch.example.com/spacecom/hook",
"events": ["alert.new", "tip.new"],
"secret": "webhook_shared_secret_min_32_chars"
}
Response includes webhook_id. The secret is bcrypt-hashed before storage; the plaintext is never retrievable after registration.
Delivery:
POST https://ansp-dispatch.example.com/spacecom/hook
Content-Type: application/json
X-SpaceCom-Signature: sha256=<HMAC-SHA256(secret, raw_body)>
X-SpaceCom-Event: alert.new
X-SpaceCom-Delivery: <uuid>
{ "type": "alert.new", "seq": 1042, ... }
Receiver verification (example):
import hmac, hashlib
def verify_signature(secret: str, body: bytes, header_sig: str) -> bool:
expected = "sha256=" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
return hmac.compare_digest(expected, header_sig)
Retry and status lifecycle:
| State | Condition | Action |
|---|---|---|
active |
Deliveries succeeding | Normal operation |
degraded |
3 consecutive delivery failures | Org admin notified by email; deliveries continue |
disabled |
10 consecutive delivery failures | No further deliveries; manual re-enable via PATCH /webhooks/{id} required |
31.5 Interoperability Decision Log
| Decision | Chosen | Rationale |
|---|---|---|
| ADS-B source | OpenSky Network REST API | Free, global, sufficient for Phase 3 route overlay; upgrade path to FAA SWIM ADS-B if coverage gaps emerge |
| CCSDS OEM reference frame | GCRF | SpaceCom frame transform pipeline output; downstream tools expect GCRF |
| CCSDS CDM unknown fields | N/A per CCSDS 508.0-B-1 §4.3 |
Silent omission causes downstream parser failures; N/A is the standard sentinel |
| CDM Pc reconciliation | Both Space-Track CDM Pc and SpaceCom Pc displayed with provenance; discrepancy flag on order-of-magnitude difference | Transparency over false precision; operators need to see the discrepancy, not have SpaceCom silently override it |
| FIR update mechanism | Staging table swap + regression test on 28-day AIRAC cycle | Direct overwrite during a live TIP event would corrupt ongoing airspace intersection queries |
| WebSocket event schema | Typed envelope with type discriminator + monotonic seq |
Enables typed client generation; seq enables reliable missed-event recovery |
| Webhook signature | HMAC-SHA256 with sha256= prefix (same convention as GitHub webhooks) |
Operators will already know this pattern; reduces integration friction |
| SWIM integration timing | Phase 2: GeoJSON export; Phase 3: FIXM review + AMQP endpoint | Full SWIM-TI requires EUROCONTROL B2B account and FIXM extension work — not Phase 1/2 blocking |
| API versioning | /api/v1 base; 6-month parallel support on breaking changes; RFC 8594 headers |
Space operators need stable contracts; 6-month overlap is industry standard for operational API changes |
| Space weather format | JSON REST endpoints (not legacy ASCII FTP) | ASCII FTP format is brittle; NOAA SWPC JSON API is stable and machine-readable; contract test catches format changes |
32. Ethics / Algorithmic Accountability
SpaceCom makes algorithmic predictions that inform operational airspace decisions. False negatives are catastrophic; false positives cause economic disruption and erode operator trust. This section documents the accountability framework that governs how the prediction model is specified, validated, changed, and monitored.
Applicable frameworks: IEEE 7001-2021 (Transparency of Autonomous Systems), NIST AI RMF (Govern/Map/Measure/Manage), ICAO Safety Management (Annex 19), ECSS-Q-ST-80C (Software Product Assurance).
32.1 Decay Predictor Model Card
The model card is a living document maintained at docs/model-card-decay-predictor.md. It is a required artefact for ESA Phase 2 TRL demonstrations and ANSP SMS acceptance. It must be updated whenever the model version changes.
Required sections:
# Decay Predictor Model Card — SpaceCom v<X.Y.Z>
## Model summary
Numerical decay predictor using RK7(8) adaptive integrator + NRLMSISE-00 atmospheric
density model + J2–J6 geopotential + solar radiation pressure. Monte Carlo uncertainty
via 500-sample ensemble varying F10.7 (±20%), Ap, and B* (±10%).
## Validated orbital regime
- Perigee altitude: 100–600 km
- Inclination: 0–98°
- Object type: rocket bodies and payloads with RCS > 0.1 m²
- B* range: 0.0001–0.3
- Area-to-mass ratio: 0.005–0.04 m²/kg
## Known out-of-distribution inputs (ood_flag triggers)
| Parameter | OOD condition | Expected behaviour |
|-----------|--------------|-------------------|
| Area-to-mass ratio | > 0.04 m²/kg | Underestimates atmospheric drag; re-entry time predicted too late |
| data_confidence | 'unknown' | Physical properties estimated from object type defaults; wide systematic uncertainty |
| TLE count in history | < 5 TLEs in last 30 days | B* estimate unreliable; uncertainty may be significantly underestimated |
| Perigee altitude | < 100 km | Object may already be in final decay corridor; NRLMSISE-00 not calibrated below 100 km |
## Performance characterisation
(Updated from backcast validation report — see MinIO docs/backcast-validation-v<X>.pdf)
| Object category | N backcasts | p50 error (median) | p50 error (95th pct) | Corridor containment |
|----------------|-------------|-------------------|---------------------|---------------------|
| Rocket bodies, RCS > 2 m² | TBD | TBD | TBD | TBD |
| Payloads, RCS 0.5–2 m² | TBD | TBD | TBD | TBD |
| Small debris / unknown RCS | TBD (underrepresented) | TBD | TBD | TBD |
## Known systematic biases
- NRLMSISE-00 underestimates atmospheric density during geomagnetic storms at altitudes 200–350 km.
Effect: predictions during Kp > 5 events tend to predict re-entry slightly later than observed.
Mitigation: space weather buffer recommendation adds ≥2h beyond p95 during Elevated/Severe/Extreme conditions.
- Tumbling objects: effective drag area unknown; B* from TLEs reflects tumble-averaged drag.
Effect: uncertainty may be systematically underestimated for highly elongated objects.
- Calibration data bias: validation events are dominated by large well-tracked objects from major launch
programmes. Small debris and objects from less-tracked orbital regimes are underrepresented.
## Not intended for
- Objects with perigee < 100 km (already in terminal descent corridor)
- Crewed vehicles (use mission-specific tools)
- Objects undergoing active manoeuvring
- Predictions beyond 21 days (F10.7 forecast skill degrades sharply beyond 3 days)
32.2 Backcast Validation Requirements
Phase 1 minimum: ≥3 historical re-entries selected from The Aerospace Corporation observed re-entry database. Selection criteria documented.
Phase 2 target: ≥10 historical re-entries. The validation report (docs/backcast-validation-v<X>.pdf) must explicitly:
- Document selection criteria — which events were chosen and why. Selection must include at least one event from each of: rocket bodies, payloads, and at least one high-area-to-mass object if available.
- Flag underrepresented categories — explicitly state which object types have < 3 validation events and what the implication is for accuracy claims in those categories.
- State accuracy as conditional — not "p50 accuracy is ±2h" but "for rocket bodies (N=7): median p50 error is 1.8h; for payloads (N=3): median p50 error is 3.1h; for small debris (N=0): no validation data available."
- Include negative results — events where the p95 corridor did not contain the observed impact point must be included and analysed.
- Compare across model versions — each new validation report must include a comparison table against the previous version's results.
The validation report is generated by modules.feedback and stored in MinIO docs/ bucket with a version tag matching the model version.
32.3 Out-of-Distribution Detection
At prediction creation time, propagator/decay.py evaluates each input object against the OOD bounds defined in docs/ood-bounds.md and sets reentry_predictions.ood_flag and ood_reason accordingly.
OOD checks (initial set — update in docs/ood-bounds.md as model is validated):
def check_ood(obj: ObjectParams) -> tuple[bool, list[str]]:
reasons = []
if obj.area_to_mass_ratio is not None and obj.area_to_mass_ratio > 0.04:
reasons.append("high_am_ratio")
if obj.data_confidence == "unknown":
reasons.append("low_data_confidence")
if obj.tle_count_last_30d is not None and obj.tle_count_last_30d < 5:
reasons.append("sparse_tle_history")
if obj.perigee_km is not None and obj.perigee_km < 100:
reasons.append("sub_100km_perigee")
if obj.bstar is not None and not (0.0001 <= obj.bstar <= 0.3):
reasons.append("bstar_out_of_range")
return len(reasons) > 0, reasons
UI presentation when ood_flag = TRUE:
⚠ OUT-OF-CALIBRATION-RANGE PREDICTION
──────────────────────────────────────────────────────────────
This prediction uses inputs outside the model's validated range:
• high_am_ratio — effective drag may be underestimated
• low_data_confidence — physical properties estimated from defaults
Timing uncertainty may be significantly larger than shown.
For operational planning, treat the p95 window as a minimum bound.
[What does this mean? →]
──────────────────────────────────────────────────────────────
The callout is mandatory and non-dismissable. It appears above the prediction panel wherever the prediction is displayed. It does not prevent the prediction from being used — operators retain full autonomy.
32.4 Recalibration Governance
The modules.feedback pipeline computes atmospheric density scaling coefficients from observed re-entry outcomes recorded in prediction_outcomes. Updating these coefficients changes all future predictions.
Recalibration procedure:
- Trigger: Automated check in the feedback pipeline flags when the last 10 outcomes show a systematic bias (median p50 error > 1.5× the historical baseline).
- Candidate coefficients: New coefficients computed from the full
prediction_outcomeshistory using a hold-out split (80% train / 20% hold-out). Hold-out set is fixed and never used in training. - Validation gate: New coefficients must achieve:
-
5% improvement in median p50 error on hold-out set
- No regression (> 10% worsening) on any validated object type category
- Corridor containment rate ≥ 95% on hold-out set
-
- Sign-off: Physics lead + engineering lead both must approve via PR review. PR includes the validation comparison table.
- Active prediction handling: Before deployment, a batch job re-runs all active predictions (status =
active, not superseded) using the new coefficients. Each re-run creates a new prediction record linked viasuperseded_by. ANSPs with active shadow deployments receive an automated notification: "SpaceCom model recalibrated — active predictions updated. Previous predictions superseded. New model version: X.Y.Z." - Rollback: If a post-deployment accuracy regression is detected, the previous coefficient set is restored via the same procedure (treated as a new recalibration). The rollback is logged to
security_logstypeMODEL_ROLLBACK.
32.5 Model Version Governance
Version classification:
| Classification | Examples | Active prediction re-run? | ANSP notification required? |
|---|---|---|---|
| Patch | Documentation update, logging improvement, no physics change | No | No |
| Minor | Performance improvement, OOD bound adjustment, new object type support | No (optional for analyst review) | Yes — changelog summary |
| Major | Integrator change, density model change, MC parameter change, recalibration | Yes — all active predictions superseded | Yes — written notice to all shadow deployment partners; 2-week notice before deployment |
Version string: Semantic version (MAJOR.MINOR.PATCH) embedded in every prediction record at creation time as model_version. The currently deployed version is exposed via GET /api/v1/system/model-version.
Cross-version prediction display: When a prediction was made with a model version that differs from the current deployed version by a major bump, the UI shows:
ℹ Prediction generated with model v1.2.0 — current model is v2.0.0 (major update).
This prediction reflects older parameters. Re-run recommended for operational planning.
[Re-run with current model →]
32.6 Adverse Outcome Monitoring
Continuous monitoring of prediction accuracy post-deployment is a regulatory credibility requirement. It is also the primary input to the recalibration pipeline.
Data flow:
- Analyst logs observed re-entry outcome via
POST /api/v1/predictions/{id}/outcomeafter post-event analysis (source: The Aerospace Corporation observed re-entry database, US18SCS reports, or ESA ESOC confirmation) prediction_outcomesrecord created withp50_error_minutes,corridor_contains_observed,fir_false_positive,fir_false_negative- Feedback pipeline runs weekly: aggregates outcomes, computes rolling accuracy metrics, flags systematic biases
- Grafana
Model Accuracydashboard shows: rolling 90-day median p50 error, corridor containment rate, false positive rate (CRITICAL alerts with no confirmed hazard), false negative rate (confirmed hazard with no CRITICAL alert)
Quarterly transparency report: Generated automatically from prediction_outcomes. Contains aggregate (non-personal) data:
- Total predictions served in the quarter
- Number of outcomes recorded (and percentage — coverage of the total)
- Median p50 error, 95th percentile error
- Corridor containment rate
- False positive rate (CRITICAL alerts with no confirmed hazard) and estimated false negative rate
- Known model limitations summary (from model card)
- Model version(s) active during the quarter
Report stored in MinIO public-reports/ bucket and made available on SpaceCom's public documentation site. The report is a Phase 3 deliverable.
32.7 Geographic Coverage Quality
FIR intersection quality varies by boundary data source. Operators in non-ECAC regions receive lower-quality airspace intersection assessments than European counterparts. This disparity must be acknowledged, not hidden.
Coverage quality levels:
| Source | Coverage quality | Regions |
|---|---|---|
| EUROCONTROL AIRAC | High | All ECAC states (Europe, Turkey, Israel, parts of North Africa) |
| FAA Digital-Terminal Procedures | High | Continental US, Alaska, Hawaii, US territories |
| OpenAIP | Medium | Global fallback; community-maintained; may lag AIRAC |
| Manual / not loaded | Low | Any region where no FIR data has been imported |
The airspace table has a coverage_quality column (high / medium / low). The airspace intersection API response includes coverage_quality per affected FIR. The UI shows a coverage quality callout on the airspace impact table when any affected FIR is medium or low:
ℹ FIR boundary quality: MEDIUM (OpenAIP source)
Intersection calculations for this region use community-maintained boundary data.
Verify with official AIRAC charts before operational use.
32.8 Ethics Accountability Decision Log
| Decision | Chosen | Rationale |
|---|---|---|
| Model card | Required artefact; maintained alongside model in docs/ |
Regulators and ANSPs need a documented operational envelope; ESA TRL process requires it |
| Backcast accuracy statement | Conditional on object type; selection bias explicitly documented | Single unconditional figure misrepresents model generalisation to non-specialist audiences |
| OOD detection | Evaluated at prediction time; ood_flag + UI warning callout; prediction still served |
Operators retain autonomy; OOD flag informs rather than blocks; hiding it would create false confidence |
| Recalibration governance | Hold-out validation + dual sign-off + active prediction re-run + ANSP notification | Ungoverned recalibration is an ungoverned change to a safety-critical model |
| Alert threshold governance | Documented rationale; change requires PR review + 2-week shadow validation + ANSP notification | Threshold values are consequential algorithmic decisions; they must be as auditable as code changes |
| Prediction staleness warning | prediction_valid_until = p50 - 4h; warning independent of system health banner |
A prediction for an imminent re-entry event has growing implicit uncertainty; operators need a signal |
| Adverse outcome monitoring | prediction_outcomes table; weekly pipeline; quarterly public report |
Without outcome data, performance claims are assertions not evidence; public report builds regulatory trust |
| FIR coverage disparity | coverage_quality column on airspace; disclosed per-FIR in intersection results |
Hiding coverage quality differences from operators would be a form of false precision |
| False positive / negative framing | Both tracked in prediction_outcomes; both in quarterly report |
Optimising only for one error type can silently worsen the other; both must be visible |
| Public transparency report | Aggregate accuracy data; no personal data; quarterly cadence | Aviation safety infrastructure operates in a regulated transparency environment; SpaceCom must too |
33. Technical Writing / Documentation Engineering
33.1 Documentation Principles
SpaceCom documentation has three distinct audiences with different needs:
| Audience | Primary docs | Format |
|---|---|---|
| Engineers building the system | ADRs, inline docstrings, test plan, AGENTS.md |
Markdown in repo |
| Operators using the system | User guides, API guide, in-app help | Hosted docs site / PDF |
| Regulators and auditors | Model card, validation reports, runbooks, CHANGELOG | Formal documents; version-controlled |
Documentation that serves the wrong audience in the wrong format fails both audiences. The §12.1 docs/ directory tree encodes this separation by subdirectory.
33.2 Architecture Decision Record (ADR) Standard
Format: MADR — Markdown Architectural Decision Records. Lightweight, git-friendly, no tooling dependency.
File naming: docs/adr/NNNN-short-title.md where NNNN is a zero-padded sequence number.
Template:
# NNNN — <Title>
**Status:** Accepted | Superseded by [MMMM](MMMM-title.md) | Deprecated
## Context
<What is the issue or design question this decision addresses? What forces are at play?>
## Decision
<What was decided?>
## Consequences
**Positive:** <What does this decision make easier or better?>
**Negative / trade-offs:** <What does this decision make harder or require accepting?>
**Neutral:** <Other effects worth noting>
## Alternatives considered
| Alternative | Why rejected |
|-------------|-------------|
| ... | ... |
Linking from code: When a code section implements a non-obvious decision, add an inline comment: # See docs/adr/0003-monte-carlo-chord-pattern.md. This makes the rationale discoverable from the code, not just from the plan.
Required initial ADR set (Phase 1):
| ADR | Decision |
|---|---|
| 0001 | RS256 asymmetric JWT over HS256 |
| 0002 | Dual front-door architecture (aviation + space portals) |
| 0003 | Monte Carlo chord pattern (Celery group + chord) |
| 0004 | GEOGRAPHY vs GEOMETRY spatial column types |
| 0005 | lazy="raise" on all SQLAlchemy relationships |
| 0006 | TimescaleDB chunk intervals (orbits: 1 day, space_weather: 30 days) |
| 0007 | CesiumJS commercial licence requirement |
| 0008 | PgBouncer transaction-mode pooling |
| 0009 | CCSDS OEM GCRF reference frame |
| 0010 | Alert threshold rationale (6h CRITICAL, 24h HIGH) |
33.3 OpenAPI Documentation Standard
FastAPI auto-generates OpenAPI 3.1 schema from Python type annotations. Auto-generation is necessary but not sufficient. The following requirements are enforced by CI.
Per-endpoint requirements:
@router.get(
"/reentry/predictions/{id}",
summary="Get re-entry prediction by ID",
description=(
"Returns a single re-entry prediction with HMAC integrity verification. "
"If the prediction's HMAC fails verification, returns 503 — do not use the data. "
"Requires `viewer` role minimum. OOD-flagged predictions include a warning field."
),
tags=["Re-entry"],
responses={
200: {"description": "Prediction returned; check `integrity_failed` field"},
401: {"description": "Not authenticated"},
403: {"description": "Insufficient role"},
404: {"description": "Prediction not found or belongs to another organisation"},
503: {"description": "HMAC integrity check failed — prediction data is untrusted"},
},
)
async def get_prediction(id: int, ...):
CI enforcement: A pytest fixture iterates the FastAPI app's routes and asserts that description is non-empty for every route with path starting /api/v1/. Fails CI with a list of non-compliant endpoints.
Rate limiting documentation: Endpoints with rate limits include the limit in the description field: "Rate limited: 10 requests/minute per user. Returns 429 with Retry-After header when exceeded."
33.4 Runbook Standard
Template (docs/runbooks/TEMPLATE.md):
# Runbook: <Title>
**Severity:** SEV-1 | SEV-2 | SEV-3 | SEV-4
**Owner:** <team or role>
**Last reviewed:** YYYY-MM-DD
**Estimated duration:** <X minutes>
## Trigger condition
<What condition causes this runbook to be needed? What alert or observation triggers it?>
## Preconditions
- [ ] You have SSH access to the production host
- [ ] <other preconditions>
## Steps
1. <First step — be specific; include exact commands>
2. <Second step>
```bash
# exact command with expected output noted
docker compose ps
- ...
Verification
<How do you confirm the runbook was successful? What does healthy state look like?>
Rollback
<If the steps made things worse, how do you undo them?>
Notify
- Engineering lead notified (Slack #incidents)
- On-call via PagerDuty if SEV-1/2
- ANSP partners notified if operational disruption (template:
docs/runbooks/ansp-notification-template.md)
**Runbook index** (`docs/runbooks/README.md`):
| Runbook | Severity | Owner | Last reviewed |
|---------|----------|-------|--------------|
| `db-failover.md` | SEV-1 | Platform | Phase 3 |
| `celery-recovery.md` | SEV-2 | Platform | Phase 3 |
| `hmac-failure.md` | SEV-1 | Security | Phase 1 |
| `ingest-failure.md` | SEV-2 | Platform | Phase 1 |
| `gdpr-breach-notification.md` | SEV-1 | Legal + Engineering | Phase 2 |
| `safety-occurrence-notification.md` | SEV-1 | Legal + Engineering | Phase 2 |
| `secrets-rotation-jwt.md` | SEV-2 | Platform | Phase 2 |
| `secrets-rotation-spacetrack.md` | SEV-2 | Platform | Phase 2 |
| `secrets-rotation-hmac.md` | SEV-1 | Engineering Lead | Phase 2 |
| `blue-green-deploy.md` | SEV-3 | Platform | Phase 3 |
| `restore-from-backup.md` | SEV-2 | Platform | Phase 2 |
---
### 33.5 Docstring Standard
All public functions in the following modules must have Google-style docstrings:
`propagator/decay.py`, `propagator/catalog.py`, `reentry/corridor.py`, `breakup/atmospheric.py`, `conjunction/probability.py`, `integrity.py`, `frame_utils.py`, `time_utils.py`.
**Required docstring sections:** `Args` (with physical units for all dimensional quantities), `Returns`, `Raises`, and `Notes` (for numerical limitations or known edge cases).
```python
def integrate_trajectory(
object_id: int,
f107: float,
bstar: float,
params: dict,
) -> TrajectoryResult:
"""Integrate a single RK7(8) decay trajectory from current epoch to re-entry.
Uses NRLMSISE-00 atmospheric density model with J2–J6 geopotential and
solar radiation pressure. Terminates at 80 km altitude (configurable via
params['termination_altitude_km']).
Args:
object_id: NORAD catalog number of the decaying object.
f107: Solar flux index (10.7 cm) in solar flux units (sfu).
Valid range: 65–300 sfu. Values outside this range are accepted
but produce extrapolated NRLMSISE-00 results (see docs/ood-bounds.md).
bstar: BSTAR drag term from TLE (units: 1/Earth_radius).
Valid range: 0.0001–0.3 per docs/ood-bounds.md.
params: Simulation parameters dict. Required keys:
'mc_samples' (int), 'termination_altitude_km' (float, default 80.0).
Returns:
TrajectoryResult with fields: reentry_time (UTC datetime),
impact_lat_deg (float), impact_lon_deg (float), final_velocity_ms (float).
Raises:
IntegrationDivergenceError: If the integrator step size shrinks below
1e-6 seconds (indicates numerical instability — log and flag as OOD).
ValueError: If object_id is not in the database.
Notes:
NRLMSISE-00 is calibrated for 100–600 km altitude. Below 100 km the
density is extrapolated and uncertainty grows significantly. The OOD
flag is set by the caller based on ood-bounds.md thresholds, not here.
"""
Enforcement: mypy pre-commit hook enforces no untyped function signatures. A separate CI check using pydocstyle or ruff with docstring rules enforces non-empty docstrings on public functions in the listed modules.
33.6 CHANGELOG.md Format
Follows Keep a Changelog conventions. Human-maintained — not auto-generated from commit messages.
# Changelog
All notable changes to SpaceCom are documented here.
Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)
## [Unreleased]
## [1.0.0] — 2026-MM-DD
### Added
- Re-entry decay predictor (RK7(8) + NRLMSISE-00 + Monte Carlo 500 samples)
- Percentile corridor visualisation (Mode A)
- Space weather widget (NOAA SWPC + ESA SWS cross-validation)
- CRITICAL/HIGH/MEDIUM/LOW alert system with two-step CRITICAL acknowledgement
- Shadow mode with per-org legal clearance gate
### Security
- JWT RS256 with httpOnly cookies; TOTP MFA enforced for all roles
- HMAC-SHA256 integrity on all prediction and hazard zone records
- Append-only `alert_events` and `security_logs` tables
## [0.1.0] — 2026-MM-DD (Phase 1 internal)
...
Who maintains it: The engineer cutting the release writes the entry. Product owner reviews before tagging. Entries are written for operators and regulators — not for engineers.
33.7 User Documentation Plan
| Document | Audience | Phase | Format | Location |
|---|---|---|---|---|
| Aviation Portal User Guide | Persona A/B/C | Phase 2 | Markdown → PDF | docs/user-guides/aviation-portal-guide.md |
| Space Portal User Guide | Persona E/F | Phase 3 | Markdown → PDF | docs/user-guides/space-portal-guide.md |
| Administrator Guide | Persona D | Phase 2 | Markdown | docs/user-guides/admin-guide.md |
| API Developer Guide | Persona E/F | Phase 2 | Markdown → hosted | docs/api-guide/ |
| In-app contextual help | Persona A/C | Phase 3 | React component content | frontend/src/components/shared/HelpContent.ts |
Aviation Portal User Guide — required sections:
- Dashboard overview (what you see on first login)
- Understanding the globe display and urgency symbols
- Reading a re-entry event: window range, corridor, risk level
- Alert acknowledgement workflow (step-by-step with screenshots)
- NOTAM draft workflow and mandatory disclaimer
- Degraded mode: what the banners mean and what to do
- Sharing views: deep links
- Contacting SpaceCom support
Review requirement: The aviation portal guide must be reviewed by at least one Persona A representative (ANSP duty manager or equivalent) before first shadow deployment. Their sign-off is recorded in docs/user-guides/review-log.md.
33.8 API Developer Guide
Located at docs/api-guide/. This is the primary onboarding resource for Persona E (space operators using API keys) and Persona F (orbital analysts with programmatic access).
Minimum content for Phase 2:
authentication.md:
- How to create an API key (step-by-step with screenshots)
- How to attach the key to requests (
Authorization: Bearer <key>header) - API key scopes and which endpoints each scope can access
- How to revoke a key
rate-limiting.md:
- Per-endpoint rate limits in a table
429response format andRetry-Afterheader usage- Burst vs. sustained limits
error-reference.md:
400 Bad Request — Invalid parameters; see `detail` field
401 Unauthorized — Missing or invalid API key
403 Forbidden — API key does not have the required scope
404 Not Found — Resource not found or not owned by your account
422 Unprocessable — Request body failed schema validation
429 Too Many Requests — Rate limit exceeded; see Retry-After header
503 Service Unavailable — HMAC integrity check failed; do not use the returned data
code-examples/python-quickstart.py:
import requests
API_BASE = "https://api.spacecom.io/api/v1"
API_KEY = "sk_live_..." # from your API key dashboard
session = requests.Session()
session.headers["Authorization"] = f"Bearer {API_KEY}"
# Get list of tracked objects currently decaying
resp = session.get(f"{API_BASE}/objects", params={"decay_status": "decaying"})
resp.raise_for_status()
objects = resp.json()["results"]
print(f"{len(objects)} objects in active decay")
# Get OEM ephemeris for the first object
norad_id = objects[0]["norad_id"]
resp = session.get(
f"{API_BASE}/space/objects/{norad_id}/ephemeris",
headers={"Accept": "application/ccsds-oem"},
params={"start": "2026-03-17T00:00:00Z", "end": "2026-03-18T00:00:00Z"}
)
print(resp.text) # CCSDS OEM format
33.9 AGENTS.md Specification
AGENTS.md at the project root provides guidance to AI coding agents (such as Claude Code) working in this codebase. It is a first-class documentation artefact — committed to the repo, version-controlled, and referenced in the onboarding guide.
Required sections:
# SpaceCom — Agent Guidance
## Codebase overview
<3-paragraph summary of architecture, key modules, and safety context>
## Safety-critical files — extra care required
The following files have safety-critical implications. Any change must include
a test and a brief rationale comment:
- `backend/app/frame_utils.py` — frame transforms affect corridor coordinates
- `backend/app/integrity.py` — HMAC signing affects prediction integrity guarantees
- `backend/app/modules/propagator/decay.py` — physics model
- `backend/app/modules/alerts/service.py` — alert trigger logic
- `backend/migrations/` — schema changes affect immutability triggers
## Test requirements
- All backend changes must pass `make test` before committing
- Physics function changes require a new test case in the relevant test module
- Security-relevant changes require a `test_rbac.py` or `test_integrity.py` case
- Never mock the database in integration tests — use the test DB container
## Code conventions
- FastAPI endpoints must have `summary`, `description`, and `responses` (see §33.3)
- Public physics/security functions must have Google-style docstrings with units
- All new decisions should have an ADR in `docs/adr/` (see §33.2)
- New runbooks go in `docs/runbooks/` using the template at `docs/runbooks/TEMPLATE.md`
## Playwright / E2E test selector convention
- Every interactive element targeted by a Playwright test **must** have a `data-testid="<component>-<action>"` attribute
- Examples: `data-testid="alert-acknowledge-btn"`, `data-testid="notam-draft-submit"`, `data-testid="decay-predict-form"`
- Playwright tests must use `page.getByTestId(...)` or accessible role selectors (`page.getByRole(...)`) **only**
- CSS class selectors, XPath, and `page.locator('.')` are forbidden in test files
- A CI lint step (`grep -r 'page\.locator\b\|page\.\$\b' tests/e2e/`) must return empty
## What not to do
- Do not add `latest` tags to Docker image references
- Do not store secrets in `.env` files committed to git
- Do not make changes to alert thresholds without updating `docs/alert-threshold-history.md`
- Do not change `model_version` in `decay.py` without following the model version governance procedure (§32.5)
- Do not proxy the Cesium ion token server-side — it is a public browser credential by design (`NEXT_PUBLIC_CESIUM_ION_TOKEN`). Do not store it in Vault, Docker secrets, or treat it as sensitive.
- Do not add write operations (POST/PUT/DELETE API calls, Zustand mutations) to components rendered in SIMULATION or REPLAY mode without calling `useModeGuard(['LIVE'])` first and disabling the control in non-LIVE modes.
33.10 Test Documentation Standard
Test pyramid and coverage gates — enforced in CI; make test runs all layers:
| Layer | Scope | Minimum gate | CI enforcement |
|---|---|---|---|
| Unit | backend/app/ excluding migrations/, schemas/ |
80% line coverage | pytest --cov=backend/app --cov-fail-under=80 |
| Integration | Every API endpoint × every applicable role | 100% of routes in test_rbac.py |
RBAC matrix fixture enumerates all FastAPI routes via app.routes |
| E2E | 5 critical user journeys (see below) | All journeys pass | Playwright job in CI; blocks merge |
| Physics validation | All suites in docs/test-plan.md marked Blocking |
0 failures | Separate CI job; always runs before merge |
5 critical user journeys (E2E blocking):
- CRITICAL alert → acknowledge → NOTAM draft saved
- Analyst submits decay prediction → job completes → corridor visible on globe
- Admin creates user → user logs in → MFA enrolment complete
- Space operator registers object → views conjunction list
- Admin enables shadow mode → shadow prediction absent from viewer response
Module docstring requirement for all physics and security test modules:
"""
test_frame_utils.py — Frame Transformation Validation Suite
Physical invariant tested:
TEME → GCRF → ITRF → WGS84 coordinate chain must agree with
Vallado (2013) reference state vectors to within specified tolerances.
Reference source:
Vallado, D.A. (2013). Fundamentals of Astrodynamics and Applications, 4th ed.
Table 3-4 (GCRF↔ITRF) and Table 3-5 (TEME→GCRF). Reference vectors in
docs/validation/reference-data/vallado-sgp4-cases.json.
Operational significance of failure:
A frame transform error propagates directly into corridor polygon coordinates.
A 1 km error at re-entry altitude produces a ground-track offset of 5–15 km.
ALL tests in this module are BLOCKING CI failures.
How to add a new test case:
1. Add the reference state vector to vallado-sgp4-cases.json
2. Add a parametrised test case to TestTEMEGCRF or TestGCRFITRF
3. Document the source in a comment on the test case
"""
docs/test-plan.md structure:
| Suite | Module(s) | Physical invariant / behaviour | Reference | Pass tolerance | Blocking? |
|---|---|---|---|---|---|
| Frame transforms | tests/physics/test_frame_utils.py |
TEME→GCRF→ITRF→WGS84 chain accuracy | Vallado (2013) Table 3-4/3-5 | Position < 1 km | Yes |
| SGP4 propagator | tests/physics/test_propagator/ |
State vector at epoch; 7-day propagation | Vallado (2013) test set | < 1 km at epoch; < 10 km at +7d | Yes |
| Decay predictor | tests/physics/test_decay/ |
p50 re-entry time accuracy; corridor containment | Aerospace Corp database | Median error < 4h; containment ≥ 90% | Phase 2+ |
| NRLMSISE-00 density | tests/physics/test_decay/test_nrlmsise.py |
Density agrees with reference atmosphere | Picone et al. (2002) Table 1 | < 1% at 5 reference points | Yes |
| Hypothesis invariants | tests/physics/test_hypothesis.py |
SGP4 round-trip; p95 corridor containment; RLS tenant isolation | Internal + Vallado | See §42.3 | Yes |
| HMAC integrity | tests/test_integrity.py |
Tampered record detected; correct error response | Internal | 503 + CRITICAL log entry | Yes |
| RBAC enforcement | tests/test_rbac.py |
Every endpoint returns correct status for every role | Internal | 0 mismatches | Yes |
| Rate limiting | tests/test_auth.py |
429 at threshold; 200 after reset | Internal | Exact threshold | Yes |
| WebSocket | tests/test_websocket.py |
Sequence replay; token expiry warning; close codes 4001/4002 | Internal spec §14 | All assertions pass | Yes |
| Contract tests | tests/test_ingest/test_contracts.py |
Space-Track + NOAA key presence AND value ranges | Internal | 0 violations | Yes (in CI against mocks) |
| Celery lifecycle | tests/test_jobs/test_celery_failure.py |
Timed-out job → failed; orphan recovery Beat task |
Internal | State correct within 5 min | Yes |
| MC corridor | tests/physics/test_mc_corridor.py |
Corridor contains ≥ 95% of p95 trajectories; polygon matches committed reference | Internal (seeded RNG seed=42) | Area delta < 5% | Phase 2+ |
| Smoke suite | tests/smoke/ |
API/WS health; auth; catalog non-empty; DB connectivity | Internal | All pass in ≤ 2 min | Yes (post-deploy) |
| E2E journeys | tests/e2e/ (Playwright) |
5 critical user journeys; WCAG 2.1 AA axe-core scan | Internal | 0 journey failures; 0 axe violations | Yes |
| Breakup energy conservation | tests/physics/test_breakup/ |
Energy conserved through fragmentation | Internal analytic | < 1% error | Phase 2+ |
Test database isolation strategy — prevents test state leakage and enables parallel execution (pytest-xdist):
- Unit tests and single-connection integration tests:
db_sessionfixture wraps each test in aSAVEPOINT/ROLLBACK TO SAVEPOINTtransaction. No committed data persists between tests. - Celery integration tests (multi-connection, multi-process): use
testcontainers-python(PostgresContainer) to spin up a dedicated DB container perpytest-xdistworker. The container is created at session scope and torn down at session end. Each test worker setssearch_pathto its own schema (test_worker_<worker_id>) for additional isolation. - Never use the development or production DB for tests. The
DATABASE_URLin test config must point tolocalhost:5433(test container) or thetestcontainersdynamic port. CI enforces this via environment variable assertion at test startup. pytest.iniconfiguration:[pytest] addopts = -x --strict-markers -p no:warnings markers = quarantine: flaky tests excluded from blocking CI contract: external API contract tests; run against mocks in CI smoke: post-deploy smoke tests
Flaky test policy:
- A test is "flaky" if it fails without a code change ≥ 2 times in any 30-day window (tracked via GitHub Actions JUnit artefact history)
- On second flaky failure: the test is decorated with
@pytest.mark.quarantineand moved totests/quarantine/; a GitHub issue is filed automatically by the CI workflow - Quarantined tests are excluded from blocking CI (
pytest -m "not quarantine") but continue to run in a non-blocking nightly job so failures are visible - A test in quarantine > 14 days without a fix must be deleted — a never-fixed flaky test provides no safety value and actively erodes trust in CI
- The quarantine list is reviewed at each sprint review; any test in quarantine > 30 days blocks the next sprint release gate
33.11 Technical Writing Decision Log
| Decision | Chosen | Rationale |
|---|---|---|
| ADR format | MADR (Markdown) | Lightweight; git-native; no tooling; linkable from code comments |
| ADR location | docs/adr/ in monorepo |
Engineers find rationale where they work, not in a separate wiki |
| Changelog format | Keep a Changelog (human-maintained) | Commit messages are for engineers; changelogs are for operators and regulators; auto-generation produces wrong audience tone |
| Docstring style | Google-style | Most readable inline; compatible with Sphinx if API reference generation is needed; ruff can enforce it |
| Runbook format | Standard template with Trigger/Steps/Verification/Rollback/Notify | On-call engineers under pressure skip steps that aren't explicitly numbered; Rollback and Notify are consistently omitted without a template |
| User documentation timing | Phase 2 for aviation portal; Phase 3 for space portal | ANSP SMS acceptance requires user documentation before shadow deployment; space portal can follow |
| API guide location | docs/api-guide/ in repo |
Co-located with code; version-controlled; engineers update it when they change the API |
AGENTS.md |
Committed to repo root; safety-critical files explicitly listed | An undocumented AGENTS.md is ignored or followed inconsistently; explicit safety-critical file list is the highest-value content |
| Test documentation | Module docstring + docs/test-plan.md |
ECSS-Q-ST-80C requires test specification as a separate artefact; module docstrings are the lowest-friction way to maintain it |
| OpenAPI enforcement | CI check on empty description fields |
Developers don't write documentation voluntarily; CI enforcement is the only reliable mechanism |
34. Infrastructure Design
This section consolidates infrastructure-level specifications: TLS lifecycle, port map, reverse-proxy configuration, WAF/DDoS posture, object storage configuration, backup validation, egress control, and the HA database parameters. For Patroni parameters see §26.3; for port exposure details see §3.3; for storage tiering see §27.4; for DNS/service discovery see §27.6.
34.1 TLS Certificate Lifecycle
Certificate Issuance Decision Tree
Is the deployment internet-facing?
├── YES → Use Caddy ACME (Let's Encrypt / ZeroSSL)
│ Caddy automatically renews; no manual steps required
│ Domain must be publicly resolvable (A record pointing to Caddy host)
│
└── NO (air-gapped / on-premise with no public DNS)
├── Does the customer operate an internal CA?
│ ├── YES → Request cert from customer CA; configure Caddy with cert_file + key_file
│ │ Document CA chain in `docs/runbooks/tls-cert-lifecycle.md`
│ └── NO → Generate internal CA with `step-ca` (Smallstep)
│ Run step-ca as a sidecar container on the management network
│ Issue Caddy cert from internal CA; clients import internal CA root cert
Cert Expiry Alert Thresholds
Prometheus alert rules in monitoring/alerts/tls.yml:
| Alert | Threshold | Severity |
|---|---|---|
TLSCertExpiringSoon |
< 60 days remaining | WARNING |
TLSCertExpiringImminent |
< 30 days remaining | HIGH |
TLSCertExpiryCritical |
< 7 days remaining | CRITICAL (pages on-call) |
For ACME-managed certs: Caddy renews at 30 days remaining by default; the 30-day alert should never fire in steady state. The 7-day CRITICAL alert is the backstop for ACME renewal failures.
Runbook Entry
docs/runbooks/tls-cert-lifecycle.md must cover:
- How to verify current cert expiry (
echo | openssl s_client -connect host:443 2>/dev/null | openssl x509 -noout -dates) - ACME renewal troubleshooting (Caddy logs:
caddy logs --tail 100) - Manual certificate replacement procedure for air-gapped deployments
- Internal CA cert distribution to client browsers / API consumers
34.2 Caddy Reverse Proxy Configuration
# /etc/caddy/Caddyfile
# Production Caddyfile stub — customise domain and backend addresses
{
email admin@your-domain.com # ACME account email
# For air-gapped: comment out email, add tls /path/to/cert /path/to/key
}
your-domain.com {
# TLS — automatic ACME for internet-facing; replace with manual cert for air-gapped
tls {
protocols tls1.2 tls1.3 # Disable TLS 1.0 and 1.1
}
# Security headers
header {
Strict-Transport-Security "max-age=63072000; includeSubDomains; preload"
X-Content-Type-Options "nosniff"
X-Frame-Options "DENY"
Referrer-Policy "strict-origin-when-cross-origin"
-Server # Strip Server header (do not expose Caddy version)
-X-Powered-By # Strip if present
}
# WebSocket proxy (backend WebSocket endpoint)
handle /ws/* {
reverse_proxy backend:8000 {
header_up Host {host}
header_up X-Real-IP {remote_host}
header_up X-Forwarded-Proto {scheme}
}
}
# API and SSR routes
handle /api/* {
reverse_proxy backend:8000 {
header_up X-Real-IP {remote_host}
header_up X-Forwarded-Proto {scheme}
}
}
# Static assets — served with long-lived immutable cache headers (F8 — §58)
# Next.js content-hashes all filenames under /_next/static/ — safe for max-age=1y
handle /_next/static/* {
header Cache-Control "public, max-age=31536000, immutable"
reverse_proxy frontend:3000 {
header_up X-Real-IP {remote_host}
}
}
# Cesium workers and static resources (large; benefit most from caching)
handle /cesium/* {
header Cache-Control "public, max-age=604800" # 7 days; not content-hashed
reverse_proxy frontend:3000 {
header_up X-Real-IP {remote_host}
}
}
# Frontend (Next.js) — HTML and dynamic routes (no caching)
handle {
header Cache-Control "no-store" # HTML must never be cached; contains stale JS references otherwise
reverse_proxy frontend:3000 {
header_up X-Real-IP {remote_host}
header_up X-Forwarded-Proto {scheme}
}
}
}
Notes:
- MinIO console (
9001) and Flower (5555) are not exposed through Caddy in production. VPN/bastion access only. - Static asset
Cache-Control: immutableis safe only because Next.js content-hashes all filenames. HTML pages must useno-storeto force browsers to re-fetch the latest JS bundle references after a deploy. - HTTP (port 80) is implicitly redirected to HTTPS by Caddy when a TLS block is present.
max-age=63072000= 2 years; standard for HSTS preload submission.
34.3 WAF and DDoS Protection
SpaceCom's application-layer rate limiting (§7.7) is a mitigation for abusive authenticated clients, not a defence against volumetric DDoS or web application attacks. A dedicated WAF/DDoS layer is required at Tier 2+ production deployments.
Internet-facing deployments (cloud or hosted):
- Deploy behind Cloudflare (free tier minimum; Pro tier for WAF rules) or AWS Shield Standard + AWS WAF
- Cloudflare: enable DDoS protection, OWASP managed ruleset, Bot Fight Mode
- Configure Caddy to only accept connections from Cloudflare IP ranges (Cloudflare publishes the range; verify with
curl https://www.cloudflare.com/ips-v4)
Air-gapped / on-premise government deployments:
- Customer's upstream network perimeter (firewall/IPS) provides the DDoS and WAF layer
- Document the perimeter protection requirement in the customer deployment checklist (
docs/runbooks/on-premise-deployment.md) - SpaceCom is not responsible for perimeter DDoS mitigation in customer-managed deployments; this is a contractual boundary that must be documented in the MSA
On-premise licence key enforcement (F6 — §68):
On-premise deployments run on customer infrastructure. Without a licence key mechanism, a customer could run additional instances, share the deployment, or continue operating after licence expiry.
Licence key design: A JWT signed with SpaceCom's RSA private key (2048-bit minimum). Claims:
{
"sub": "<org_id>",
"org_name": "Civil Aviation Authority of Australia",
"contract_type": "on_premise",
"valid_from": "2026-01-01T00:00:00Z",
"valid_until": "2027-01-01T00:00:00Z",
"features": ["operational_mode", "multi_ansp_coordination"],
"max_users": 50,
"iss": "spacecom.io",
"iat": 1735689600
}
Enforcement: At startup, backend/app/main.py verifies the licence JWT using SpaceCom's public key (bundled in the Docker image). If validation fails or the licence has expired: the backend starts in licence-expired degraded mode — read-only access to historical data; no new predictions or alerts; all write endpoints return HTTP 402 Payment Required with {"error": "licence_expired", "contact": "commercial@spacecom.io"}. An hourly Celery Beat task re-validates the licence. If it expires mid-operation, running simulations complete but no new simulations are accepted after the check fires.
Key rotation: New licence JWT issued via scripts/generate_licence_key.py (requires SpaceCom private key, stored in HashiCorp Vault — never committed to the repository). Customer sets SPACECOM_LICENCE_KEY environment variable; container restart picks it up. SpaceCom's RSA public key is embedded in the Docker image at build time (/etc/spacecom/licence_pubkey.pem).
CI/DAST complement: OWASP ZAP DAST (§21 Phase 2 DoD) tests the application layer; WAF covers infrastructure-layer attack patterns. Both are required — they cover different threat categories.
34.4 MinIO Object Storage Configuration
Erasure Coding (Tier 3)
4-node distributed MinIO uses EC:2 (2 data + 2 parity shards per erasure set):
# MinIO server startup command (each of 4 nodes runs the same command)
minio server \
http://minio-1:9000/data \
http://minio-2:9000/data \
http://minio-3:9000/data \
http://minio-4:9000/data \
--console-address ":9001"
EC:2 on 4 nodes means:
- Each object is split into 4 shards (2 data + 2 parity)
- Read quorum: 2 shards (tolerates 2 simultaneous node failures for reads)
- Write quorum: 3 shards (tolerates 1 simultaneous node failure for writes)
- Usable capacity: 50% of raw total
ILM (Information Lifecycle Management) Policies
Configured via mc ilm add commands in docs/runbooks/minio-lifecycle.md:
| Bucket | Prefix | Transition after | Target |
|---|---|---|---|
mc-blobs |
(all) | 90 days | MinIO warm tier or S3-IA |
pdf-reports |
(all) | 365 days | S3 Glacier |
notam-drafts |
(all) | 365 days | S3 Glacier |
db-wal-archive |
(all) | 31 days | Delete (WAL older than 30 days not needed for point-in-time recovery) |
34.5 Backup Restore Test Verification Checklist
Monthly restore test procedure (executed by the restore_test Celery task; results logged to security_logs type RESTORE_TEST). A human engineer must verify all six items before marking the restore test as passed:
| # | Verification item | How to verify |
|---|---|---|
| 1 | Row count match | SELECT COUNT(*) FROM reentry_predictions on restored DB equals baseline count captured before backup |
| 2 | Latest record present | Most recent reentry_predictions.created_at in restored DB is within 5 minutes of the backup timestamp |
| 3 | HMAC spot-check | Run integrity.verify_prediction(id) on 5 randomly selected prediction IDs; all must return VALID |
| 4 | Append-only trigger functional | Attempt UPDATE reentry_predictions SET risk_level = 'LOW' WHERE id = <test_id>; must raise exception |
| 5 | Hypertable chunks intact | SELECT count(*) FROM timescaledb_information.chunks WHERE hypertable_name = 'orbits' matches expected chunk count for the backup date range |
| 6 | Foreign key integrity | pg_restore completed with 0 FK constraint violations (check restore log for ERROR: insert or update on table ... violates foreign key constraint) |
Restore test failures are treated as CRITICAL alerts. The restore test target DB (db-restore-test container) must be isolated from the production network (not attached to db_net).
34.6 Infrastructure Design Decision Log
| Decision | Chosen | Alternative Considered | Rationale |
|---|---|---|---|
| Reverse proxy | Caddy | nginx + certbot | Caddy automatic ACME eliminates manual cert management; simpler config; native HTTP/2 and HTTP/3 |
| TLS air-gapped | Internal CA (step-ca) |
Self-signed per-service | Internal CA allows cert chain trust; self-signed requires per-client exception management |
| WAF/DDoS | Upstream provider (Cloudflare/AWS Shield) | Application-layer rate limiting only | Volumetric DDoS bypasses application-layer; WAF covers OWASP attack patterns at network ingress |
| MinIO erasure coding | EC:2 on 4 nodes | EC:4 (higher parity) | EC:4 on 4 nodes would require 4-node write quorum; any single failure blocks writes; EC:2 balances protection and availability |
| Multi-region | Single region per jurisdiction | Active-active global cluster | Data sovereignty; compliance certification scope; Phase 1–3 customer base size doesn't justify multi-region operational complexity |
| DB connection target | PgBouncer VIP | Direct Patroni primary connection string | Application connection strings don't change during Patroni failover; stable operational target |
| Cold tier (MC blobs) | MinIO ILM warm → S3-IA | S3 Glacier | MC blobs may be replayed for Mode C visualisation; 12h Glacier restore latency is operationally unacceptable |
| Cold tier (compliance) | S3 Glacier / Deep Archive | Warm S3 | Compliance docs need 7-year retention but rare retrieval; Glacier cost is 80–90% lower than S3-IA |
| Egress filtering | Host-level UFW/nftables | Rely on Docker network isolation | Docker isolation is inter-network only; outbound internet egress must be filtered at host level |
HSTS max-age |
63072000 (2 years) | 31536000 (1 year) | 2 years is the HSTS preload list minimum; aligns with standard hardening guides |
35. Performance Engineering
This section consolidates performance specifications, load test definitions, and scalability constraints across the system. For compression policy configuration see §9.4; for latency budget and pagination standard see §14; for WebSocket subscriber ceiling see §14; for renderer memory limits see §3 / §27.
35.1 Load Test Specification
Tool: k6 (preferred) or Locust. Scripts in tests/load/. Scenarios must be deterministic and reproducible on a freshly seeded database.
Scenario: CZML Catalog (Phase 1 baseline, Phase 3 SLO gate)
// tests/load/czml_catalog.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 20 }, // Ramp to 20 users
{ duration: '5m', target: 100 }, // Ramp to 100 users (SLO target)
{ duration: '5m', target: 100 }, // Sustain 100 users
{ duration: '2m', target: 0 }, // Ramp down
],
thresholds: {
'http_req_duration{endpoint:czml_full}': ['p(95)<2000'], // Phase 3 SLO
'http_req_duration{endpoint:czml_delta}': ['p(95)<500'], // Delta must be faster
'http_req_failed': ['rate<0.01'], // < 1% error rate
},
};
export default function () {
// First load: full catalog
const fullRes = http.get('/czml/objects', {
tags: { endpoint: 'czml_full' },
headers: { Authorization: `Bearer ${__ENV.TEST_TOKEN}` },
});
check(fullRes, { 'full catalog 200': (r) => r.status === 200 });
// Subsequent loads: delta
const since = new Date(Date.now() - 60000).toISOString();
const deltaRes = http.get(`/czml/objects?since=${since}`, {
tags: { endpoint: 'czml_delta' },
headers: { Authorization: `Bearer ${__ENV.TEST_TOKEN}` },
});
check(deltaRes, { 'delta 200': (r) => r.status === 200 });
sleep(5); // Think time: user views globe for ~5s before next action
}
Scenario: MC Prediction Submission
// tests/load/mc_predict.js — tests concurrency gate
export const options = {
vus: 10, // 10 concurrent MC submissions from 5 orgs (2 per org)
duration: '3m',
thresholds: {
'http_req_duration{endpoint:mc_submit}': ['p(95)<500'],
// 429s are expected (concurrency gate) — not counted as failures
'checks': ['rate>0.95'],
},
};
Scenario: WebSocket Alert Delivery
// tests/load/ws_alerts.js — verifies < 30s delivery under load
// Opens 100 persistent WebSocket connections; triggers 10 synthetic alerts;
// measures time from alert POST to WS delivery on all 100 clients
Load test execution:
- Phase 1: run
czml_catalogscenario on Tier 1 dev hardware; record p95 baseline - Phase 2: run after each major migration; confirm no regression vs Phase 1 baseline
- Phase 3: full suite (all three scenarios) on Tier 2 staging; all thresholds must pass before production deploy approval
Load test reports committed to docs/validation/load-test-report-phase{N}.md.
35.2 CZML Delta Protocol
The full CZML catalog grows proportionally with object count and time-step density. The delta protocol prevents repeat full-catalog downloads after initial page load.
Client responsibility:
- On page load: fetch
GET /czml/objects(full catalog). CacheX-CZML-Timestampresponse header aslastSync. - Every 30s (or on reconnect): fetch
GET /czml/objects?since=<lastSync>. - On receipt of
X-CZML-Full-Required: true: discard globe state and re-fetch full catalog. - On receipt of
HTTP 413: the server cannot serve the full catalog (too large); contact system admin.
Server responsibility:
- Full response: include
X-CZML-Timestamp: <server_time_iso8601>header. - Delta response: include only objects with
updated_at > since. Ifsinceis more than 30 minutes ago, returnX-CZML-Full-Required: truewith an empty CZML body (client must re-fetch). - Maximum full payload: 5 MB. If estimated size exceeds limit, return
HTTP 413with{"error": "catalog_too_large", "use_delta": true}.
Prometheus metric: czml_delta_ratio = delta requests / (delta + full requests). Target: > 0.95 in steady state (95% of CZML requests are delta).
35.3 Monte Carlo Concurrency Gate
Unbounded MC fan-out collapses SLOs when multiple users submit concurrent jobs. The concurrency gate is implemented as a per-organisation Redis semaphore:
# worker/tasks/decay.py
import redis
from celery import current_app
REDIS = redis.Redis.from_url(settings.REDIS_URL)
MC_SEMAPHORE_TTL = 600 # seconds; covers maximum expected MC duration + margin
def acquire_mc_slot(org_id: int, org_tier: str) -> bool:
"""Returns True if slot acquired, False if at capacity. Limit derived from subscription tier (F6)."""
from app.modules.billing.tiers import get_mc_concurrency_limit
limit = get_mc_concurrency_limit_by_tier(org_tier)
key = f"mc_running:{org_id}"
pipe = REDIS.pipeline()
pipe.incr(key)
pipe.expire(key, MC_SEMAPHORE_TTL)
count, _ = pipe.execute()
if count > limit:
REDIS.decr(key)
return False
return True
def release_mc_slot(org_id: int) -> None:
key = f"mc_running:{org_id}"
current = REDIS.get(key)
if current and int(current) > 0:
REDIS.decr(key)
API layer:
# backend/api/decay.py
@router.post("/decay/predict")
async def submit_decay(req: DecayRequest, user: User = Depends(current_user)):
if not acquire_mc_slot(user.organisation_id, user.role):
raise HTTPException(
status_code=429,
detail="MC concurrency limit reached for your organisation",
headers={"Retry-After": "120"},
)
task = run_mc_decay_prediction.delay(...)
return {"task_id": task.id}
The Celery chord callback (on_chord_done) calls release_mc_slot. A TTL of 600s ensures the slot is released even if the worker crashes mid-task.
Quota exhaustion logging (F6): When acquire_mc_slot returns False, before returning 429, the endpoint writes a usage_events row: event_type = 'mc_quota_exhausted'. This makes quota pressure visible to the org admin and to the SpaceCom sales team (via admin panel). The org admin's usage dashboard shows: predictions run this month, quota hits this month, and a prompt to upgrade if hits ≥ 3 in a billing period.
35.4 Query Plan Regression Gate
CI job: performance-regression (runs in staging pipeline after make migrate):
# scripts/check_query_baselines.py
"""
Runs EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) for each query in
docs/query-baselines/*.sql against the migrated staging DB.
Compares execution time to the baseline JSON stored in the same directory.
Fails with exit code 1 if any query exceeds 2× the recorded baseline.
Emits a GitHub PR comment with a comparison table.
"""
BASELINE_DIR = "docs/query-baselines"
THRESHOLD_MULTIPLIER = 2.0
queries = {
"czml_catalog_100obj": "SELECT ...", # from czml_catalog_100obj.sql
"fir_intersection": "SELECT ...", # from fir_intersection.sql
"prediction_list": "SELECT ...", # from prediction_list_cursor.sql
}
Baselines are JSON files containing {"planning_time_ms": N, "execution_time_ms": N, "recorded_at": "..."}. Updated manually after a deliberate schema change with a PR comment explaining the expected regression.
35.5 Renderer Container Constraints
The renderer service (Playwright + Chromium) is memory-intensive during print-resolution globe captures:
# docker-compose.yml (renderer service)
renderer:
image: spacecom/renderer:sha-${GIT_SHA}
mem_limit: 4g
memswap_limit: 4g # No swap; if OOM, container restarts cleanly
networks: [renderer_net]
environment:
RENDERER_MAX_PAGES: "4" # Maximum concurrent render jobs
RENDERER_TIMEOUT_S: "30" # Per-render timeout; matches §21 DoD
RENDERER_MAX_RESOLUTION: "300dpi"
Renderer Prometheus metrics:
renderer_memory_usage_bytes— current RSS of Chromium process; alert at 3.5 GB (WARN before OOM)renderer_jobs_active— concurrent in-flight renders; alert if > 3 for > 60s (capacity signal)renderer_timeout_total— count of renders killed by timeout; alert if > 0 in a 5-min window
Maximum report constraints (enforced in worker/tasks/renderer.py):
- Maximum report pages: 50
- Maximum globe snapshot resolution: 300 DPI (A4 format)
- Reports exceeding these limits are rejected at submission with
HTTP 400
Renderer memory isolation and on-demand rationale (F8 — §65 FinOps):
The renderer is the second-most memory-intensive service after TimescaleDB. At Tier 2 it is allocated a dedicated c6i.xlarge (~$140/mo) or equivalent. Unlike simulation workers, the renderer is called infrequently — typically a few times per day when a duty manager requests a PDF briefing pack.
On-demand vs. always-on analysis:
| Approach | Benefit | Cost/risk | Decision |
|---|---|---|---|
| Always-on (current) | Zero latency to first render; Chromium warm | $140/mo even if 0 renders/day | Use at Tier 1–2 — cost is predictable; latency matters for interactive report requests |
| On-demand (start on request, stop after idle) | Saves $140/mo on lightly used deployments | 15–30s Chromium cold-start per report; complicates deployment | Consider at Tier 3 with HPA scale-to-zero on renderer_jobs_active if customer SLA permits a 30s wait |
| Shared with simulation worker | Saves dedicated instance | Chromium OOM risk during concurrent MC + render | Do not use — Chromium 2–4 GB footprint during render + MC worker memory = OOM on 32 GB nodes |
Memory isolation is non-negotiable: The renderer container is on an isolated Docker network (renderer_net) with no direct DB access and no simulation worker co-location. This is both a security boundary (§7, §35.5) and a memory isolation boundary. A runaway Chromium process will OOM its own container and restart cleanly without affecting simulation workers or the backend API.
Cost-saving lever (on-premise): For on-premise deployments where the renderer runs on the same physical server as simulation workers, monitor renderer_memory_usage_bytes + spacecom_simulation_worker_memory_bytes via Grafana. Add a combined alert renderer + workers > 80% host RAM to detect co-location pressure before OOM.
35.6 Static Asset CDN Strategy
CesiumJS uncompressed: ~8 MB. With gzip compression: ~2.5 MB. At 100 concurrent first-time users: ~250 MB outbound in a burst.
Internet-facing (Cloudflare):
- All paths under
/_next/static/*and/static/*are served withCache-Control: public, max-age=31536000, immutable(1 year, immutable — Next.js uses content-hash filenames) - Caddy upstream caches are bypassed for these paths (Cloudflare edge is the cache)
- CesiumJS assets: cache hit ratio target > 0.98 after warm-up
On-premise:
- Deploy an nginx sidecar container (
static-cache) onfrontend_netserving the Next.jsout/or.next/static/directory directly - Caddy routes
/_next/static/*→static-cache:80(bypasses Next.js server) - Configure in
docs/runbooks/on-premise-deployment.md
Bundle size monitoring (CI):
# .github/workflows/ci.yml (bundle-size job)
- name: Check bundle size
run: |
npm run build 2>&1 | grep "First Load JS"
# Fails if main bundle > previous + 10% (threshold stored in .bundle-size-baseline)
node scripts/check_bundle_size.js
Baseline stored in .bundle-size-baseline at repo root (plain number in bytes). Updated manually with a PR comment when a deliberate size increase is approved.
35.7 Performance Engineering Decision Log
| Decision | Chosen | Alternative Considered | Rationale |
|---|---|---|---|
| Load test tool | k6 | Locust, JMeter | k6 is script-based (TypeScript-friendly), CI-native, outputs Prometheus-compatible metrics; Locust requires a Python process; JMeter is XML-heavy |
| CZML delta | ?since=<iso8601> server-side filter |
Client-side WebSocket push of changed entities | Server-side filter is simpler and works with HTTP caching; push requires server to track per-client state |
| MC semaphore | Redis INCR/DECR with TTL | DB-level lock | Redis is already the Celery broker; DB-level lock adds latency on every MC submit; TTL prevents deadlock on worker crash |
| Pagination | Cursor (created_at, id) |
Keyset on single column | Single-column keyset has ties at same created_at (batch ingest); compound key is unique and stable |
| Query regression gate | EXPLAIN (ANALYZE, BUFFERS) JSON baseline |
pg_stat_statements | EXPLAIN is deterministic per run on a warm buffer; pg_stat_statements averages across all historic executions and requires prod traffic to populate |
| Renderer memory cap | 4 GB Docker mem_limit |
ulimit in container | Docker mem_limit is enforced by the kernel cgroup; ulimit only applies to the shell process, not Chromium subprocesses |
| Bundle size gate | +10% threshold vs. stored baseline |
Absolute byte limit | Percentage is proportional to current size; absolute limits become irrelevant as bundles grow or shrink |
36. Security Architecture — Red Team / Adversarial Review
This section records the findings of an adversarial review against the §7 security architecture. Where findings were resolved by updating existing sections (§7.2, §7.3, §7.4, §7.9, §7.10, §7.11, §7.12, §7.14, §9.2), this section provides the finding rationale and cross-reference for traceability.
36.1 Finding Summary
| # | Finding | Primary Section Updated | Severity |
|---|---|---|---|
| 1 | HMAC key rotation has no path through the immutability trigger | §7.9 — HMAC Key Rotation Procedure | Critical |
| 2 | Pre-signed MinIO URLs unscoped and unproxied for MC blobs | §7.10 — MinIO Bucket Policies | High |
| 3 | Celery task arguments not validated at the task layer | §7.12 — Compute Resource Governance | High |
| 4 | Playwright renderer SSRF mitigation incomplete | §7.11 — request interception allowlist | High |
| 5 | Refresh token theft: no family reuse detection | §7.3 + §9.2 refresh_tokens schema |
High |
| 6 | Admin role elevation with no four-eyes approval | §7.2 + pending_role_changes table |
High |
| 7 | Security events logged but no human alert matrix | §7.14 — security alerting matrix | Medium |
| 8 | Space-Track credential rotation has no ingest-gap spec | §7.14 — rotation runbook cross-reference | Medium |
| 9 | Shadow mode segregation application-layer only | §7.2 — shadow_segregation RLS policy | High |
| 10 | NOTAM draft content not sanitised — injection path | §7.4 — sanitise_icao() function |
High |
| 11 | Supply chain posture not fully specified | §7.13 — already fully covered; no gap found | N/A |
36.2 Attack Paths Considered
The following attack paths were evaluated in this review:
Insider threat paths:
- Compromised admin account silently elevating a backdoor account → mitigated by four-eyes approval (Finding 6)
- Admin with access to the HMAC rotation script replacing legitimate predictions with forged ones → mitigated by dual sign-off +
rotated_byaudit trail (Finding 1) - ANSP operator sharing a pre-signed report URL with an external party → mitigated by 5-minute TTL + audit log (Finding 2)
Compromised worker paths:
- Compromised
ingest_worker(sharesworker_netwith Redis) writing crafted Celery task args → mitigated by task-layer validation (Finding 3) - Compromised worker exfiltrating simulation trajectory URLs → mitigated by server-side MC blob proxy (Finding 2)
Authentication/session paths:
- Refresh token exfiltration + replay before legitimate client retries → mitigated by family reuse detection + full-family revocation (Finding 5)
- Compromised admin credential creating backdoor admin → mitigated by four-eyes principle (Finding 6)
Renderer SSRF paths:
- Bug causing renderer to navigate to a crafted URL → mitigated by Playwright request interception allowlist (Finding 4)
- Report ID injection → mitigated by integer validation + hardcoded URL construction (Finding 4)
Data integrity paths:
- Shadow prediction leaking into operational response via query bug → mitigated by RLS
shadow_segregationpolicy (Finding 9) - NOTAM draft XSS → Playwright PDF renderer execution → mitigated by
sanitise_icao()+ Jinja2 autoescape (Finding 10)
Credential rotation paths:
- HMAC key compromise: attacker forges predictions → mitigated by rotation procedure with
hmac_adminrole isolation (Finding 1) - Space-Track credential rotation creates an undetected ingest gap → mitigated by 10-minute verification step in runbook (Finding 8)
36.3 Security Architecture ADRs
| ADR | Title | Decision |
|---|---|---|
docs/adr/0007-hmac-rotation-procedure.md |
HMAC key rotation with parameterised immutability trigger | hmac_admin role + SET LOCAL spacecom.hmac_rotation flag; dual sign-off required |
docs/adr/0008-admin-four-eyes.md |
Admin role elevation requires four-eyes approval | pending_role_changes table; 30-minute token; second admin must approve |
docs/adr/0009-shadow-mode-rls.md |
Shadow mode segregated at RLS layer, not application layer | shadow_segregation RLS policy; spacecom.include_shadow session variable; admin-only |
docs/adr/0010-refresh-token-families.md |
Refresh token family reuse detection | family_id column; full family revocation on reuse; user email alert |
docs/adr/0011-mc-blob-proxy.md |
MC trajectory blobs proxied server-side, not pre-signed URL | GET /viz/mc-trajectories/{id} backend proxy; MinIO URLs never exposed to browser |
36.4 Penetration Test Scope (Phase 3)
The Phase 3 external penetration test (referenced in §7.15) must include the following adversarial scenarios derived from this review:
- HMAC rotation bypass — attempt to forge a prediction record by exploiting the immutability trigger with and without the
hmac_adminrole - Pre-signed URL exfiltration — verify that MC blob URLs are not present in any browser-side response; verify pre-signed report URLs cannot be used after 5 minutes
- Celery task injection — attempt to enqueue tasks with out-of-range arguments directly via Redis; verify the task validates and rejects them
- Playwright SSRF — attempt to trigger renderer navigation to
http://169.254.169.254/(AWS metadata) orhttp://backend:8000/internal/admin; verify interception blocks both - Refresh token theft simulation — replay a superseded refresh token; verify full family revocation and email alert
- Admin privilege escalation — attempt to elevate a
vieweraccount toadminvia a single compromised admin account without the four-eyes approval token; verify the attempt is blocked and logged - Shadow mode leak — query
GET /decay/predictionsasviewer; inject a shadow prediction directly at the DB layer; verify the API response never returns it - NOTAM injection — submit an object with a name containing
<script>alert(1)</script>viaPOST /objects; generate a NOTAM draft; verify PDF render does not execute script
36.5 Decision Log
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| HMAC rotation trigger | Parameterised SET LOCAL flag scoped to hmac_admin role |
Separate migration to drop and recreate trigger | SET LOCAL is session-scoped; cannot be set by application role; minimises window of bypass |
| Family reuse detection | Full family revocation on superseded token reuse | Single token revocation | Full revocation is the only action that guarantees the attacker's session is destroyed even if the legitimate user doesn't notice |
| MC blob delivery | Server-side proxy endpoint | Pre-signed MinIO URL with short TTL | Pre-signed URLs can be shared or logged in browser history; server-side proxy enforces org scoping on every request |
| Admin four-eyes | Email approval token with 30-minute window | Yubikey hardware confirmation | Email approval is achievable without additional hardware; 30-minute window prevents indefinite pending states |
| Shadow RLS | PostgreSQL RLS policy | Application-layer WHERE shadow_mode = FALSE |
RLS is enforced by the database engine regardless of query construction; application-layer filters can be omitted by bugs or direct DB queries |
37. Aviation Regulatory / ATM Compliance Review
This section records findings from an ATM systems engineering review against the ICAO/EUROCONTROL regulatory environment that governs ANSP customers. Findings were incorporated into §6.13 (NOTAM format), §6.14 (shadow exit), §6.17 (multi-ANSP panel), §11 (data sources / airspace scope), §16 (prediction conflict), §21 Phase 2 DoD, §27.4 (safety record retention), and §9.2 (schema additions).
37.1 Finding Summary
| # | Finding | Primary Section Updated | Severity |
|---|---|---|---|
| 1 | Regulatory classification (EASA IR 2017/373 position) unresolved | §21 Phase 2 DoD + ADR 0012 | Critical |
| 2 | NOTAM format non-compliant with ICAO Annex 15 field formatting | §6.13 — field mapping table, Q-line, YYMMDDHHmm timestamps |
High |
| 3 | Re-entry window → NOTAM (B)/(C) mapping not specified | §6.13 — p10−30min / p90+30min rule + cancellation urgency |
High |
| 4 | FIR scope excludes SUA, TMAs, oceanic — undisclosed | §11 — airspace scope disclosure; ADR 0014 | Medium |
| 5 | Multi-ANSP coordination panel has no authority/precedence spec | §6.17 — advisory-only banner, retention, WebSocket SLA | Medium |
| 6 | Shadow mode exit criteria not specified | §6.14 — exit criteria table, exit report template | High |
| 7 | Degraded mode disclosure insufficient for ANSP operational use | §9.2 degraded_mode_events table; §14 GET /readyz schema; NOTAM (E) injection |
High |
| 8 | GDPR DPA must be signed before shadow mode begins, not Phase 3 | §21 Phase 2 DoD legal gate | High |
| 9 | ESA DISCOS redistribution rights unaddressed | §11 — redistribution rights requirement; §21 Phase 2 DoD | High |
| 10 | Multi-source prediction conflict resolution not specified | §16 — conflict resolution rules; prediction_conflict schema columns |
High |
| 11 | Safety-relevant records have no distinct retention policy | §27.4 — safety_record flag; 5-year safety category |
Medium |
37.2 Regulatory Framework References
| Framework | Relevance | Position taken |
|---|---|---|
| EASA IR (EU) 2017/373 | Requirements for ATM/ANS providers; may apply if ANSP integrates SpaceCom into operational workflow | Position A: advisory tool; not ATM/ANS provision — documented in ADR 0012 |
| ICAO Annex 15 (AIS) + Appendix 6 | NOTAM format specification | NOTAM drafts now comply with Annex 15 field formatting (§6.13) |
| ICAO Annex 11 (ATS) §2.26 | ATC record retention recommendation | Safety records retained ≥ 5 years (§27.4) |
| ICAO Doc 8400 | ICAO abbreviations and codes used in NOTAM (E) field |
sanitise_icao() uses Doc 8400 abbreviation list |
| EUROCONTROL OPADD | Operational NOTAM Production and Distribution; EUR regional NOTAM practice | Q-line format and series conventions follow OPADD (§6.13) |
| GDPR Article 28 | Data processor obligations when processing ANSP staff personal data | DPA must be signed before any ANSP data processing, including shadow mode |
| UN Liability Convention 1972 | 7-year record retention for space object liability claims | reentry_predictions, alert_events retained 7 years (§27.4) |
37.3 Regulatory ADRs
| ADR | Title | Decision |
|---|---|---|
docs/adr/0012-regulatory-classification.md |
EASA IR 2017/373 position | Position A: ATM/ANS Support Tool; decision support only; not ATM/ANS provision; written ANSP agreements required |
docs/adr/0013-notam-format.md |
ICAO Annex 15 NOTAM field compliance | Field mapping table; YYMMDDHHmm timestamps; Q-line QWELW; (B) = p10−30min; (C) = p90+30min |
docs/adr/0014-airspace-scope.md |
Phase 2 airspace data scope | FIR/UIR only (ECAC + US); SUA/TMA/oceanic explicitly out of scope; disclosed in UI; Phase 3 SUA consideration |
37.4 Compliance Checklist (Phase 2 Gate)
Before the first ANSP shadow deployment:
docs/adr/0012-regulatory-classification.mdcommitted and reviewed by aviation law counsel- NOTAM draft generator produces ICAO-compliant output (unit test passes Q-line regex and
YYMMDDHHmmfield checks) - Airspace scope disclosure note present in Airspace Impact Panel (Playwright test verifies text)
- Multi-ANSP coordination advisory-only banner present in panel (Playwright test verifies text)
degraded_mode_eventstable active; transitions logged;GET /readyzresponse includesdegraded_since- NOTAM draft
(E)field injects degraded-state warning whengenerated_during_degraded = TRUE(integration test) - DPA signed with each ANSP shadow partner; DPA template reviewed by counsel
- ESA DISCOS redistribution rights clarified in writing; API/report templates updated if required
prediction_conflictflag operational; Event Detail page shows⚠ PREDICTION CONFLICTwhen set- Safety record retention policy active:
safety_record = TRUErecords excluded from TimescaleDB drop;degraded_mode_eventsretained 5 years - Shadow mode exit report template (
docs/templates/shadow-mode-exit-report.md) exists and Persona B can generate statistics from admin panel
37.5 Decision Log
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| Regulatory classification | Position A — advisory, non-safety-critical ATM/ANS Support Tool | Position B — Functional System under IR 2017/373 | Position B would require ED-78A system safety assessment, ATCO HMI compliance, and EASA change management — disproportionate for a decision-support tool where a human verifies all outputs before acting |
| NOTAM timestamp format | YYMMDDHHmm (ICAO Annex 15 §5.1.2) |
ISO 8601 YYYY-MM-DDTHH:mmZ |
ICAO Annex 15 is unambiguous; ISO 8601 would require the NOTAM office to reformat before issuance |
| NOTAM window mapping | (B) = p10 − 30 min; (C) = p90 + 30 min |
(B) = p50 − 3h; (C) = p50 + 3h |
p10/p90 are the actual statistical bounds; symmetric windows around p50 ignore the often-asymmetric uncertainty distribution |
| Degraded NOTAM warning | Machine-inserted line in (E) field |
UI-only warning on the draft page | The (E) field is what the NOTAM office receives; a UI-only warning is lost when the draft is copied to the NOTAM office's system |
| Multi-source conflict | Union of windows when non-overlapping | SpaceCom window always primary regardless | ICAO most-conservative principle; ANSPs must be protected against the case where SpaceCom is wrong and TIP is right |
| Safety record retention | safety_record flag on row; excluded from drop policy |
Separate table for safety records | Flag approach avoids data duplication and works with TimescaleDB chunk-level policies; excluded records stay in the same hypertable partition for query performance |
38. Orbital Mechanics / Astrodynamics Review
This section records findings from an astrodynamics specialist review of the physics specification. Findings were incorporated into §15.1 (SGP4 validity gates), §15.2 (NRLMSISE-00 inputs, MC uncertainty model, SRP, integrator config), §15.3 (breakup altitude trigger, material survivability), §15.4 (new — corridor generation algorithm), §15.5 (new — Pc computation method), §17.1 (committed test vectors), §31.1 (BSTAR validation), and the objects/space_weather schema in §9.
38.1 Finding Summary
| # | Finding | Section Updated | Severity |
|---|---|---|---|
| 1 | SGP4 validity limits not enforced at query time | §15.1 — epoch age gates, perigee < 200 km routing | High |
| 2 | NRLMSISE-00 input vector under-specified | §15.2 — f107_prior_day, ap_3h_history, Ap vs Kp | High |
| 3 | Ballistic coefficient uncertainty model not specified | §15.2 — C_D/A/m sampling distributions; objects schema |
High |
| 4 | Corridor generation algorithm not specified | §15.4 (new) — alpha-shape, 50 km buffer, ≤ 1000 vertices | High |
| 5 | Breakup altitude trigger not specified | §15.3 — 78 km trigger, NASA SBM, material survivability | High |
| 6 | Frame transformation test vectors not committed | §17.1 — 3 required JSON files; fail-not-skip test pattern | Medium |
| 7 | Solar radiation pressure absent from decay predictor | §15.2 — cannonball SRP model, cr_coefficient column |
Medium |
| 8 | Pc computation method not specified | §15.5 (new) — Alfano 2D Gaussian, TLE differencing covariance | Medium |
| 9 | Integrator tolerances and stopping criterion not specified | §15.2 — atol=1e-9, rtol=1e-9, max_step=60s, 120-day cap | High |
| 10 | BSTAR validation range excludes valid high-density objects | §31.1 — removed lower floor; warn-not-reject for B* > 0.5 | Medium |
| 11 | NRLMSISE-00 altitude limit and storm handling not specified | §15.2 — 800 km OOD boundary; Kp > 5 storm flag | Medium |
38.2 Physics Model Decisions
| Decision | Chosen | Alternative Considered | Rationale |
|---|---|---|---|
| Catalog propagator | SGP4 (sgp4 library) |
SP (Special Perturbations) via GMAT | SGP4 is the standard for TLE-based catalog propagation; SP requires full state vector with covariance — not available from TLEs |
| Decay integrator | DOP853 (RK7/8 adaptive) | RK4 fixed step | DOP853 is embedded error control; RK4 fixed step requires manual step-size management and may miss density variations near perigee |
| Atmospheric model | NRLMSISE-00 | JB2008 (Jacchia-Bowman 2008) | NRLMSISE-00 is well-validated, open-source, and widely used in community tools; JB2008 is more accurate during storms but requires additional data inputs not yet in scope |
| Corridor shape | Alpha-shape (concave hull) | Convex hull | Convex hull overestimates corridor width by 2–5× for elongated re-entry ground tracks; alpha-shape produces tighter, more operationally useful polygons |
| C_D sampling | Uniform(2.0, 2.4) | Fixed value 2.2 | Uniform sampling covers the credible range without assuming a specific distribution; fixed value understates uncertainty |
| SRP model | Cannonball (scalar) | Panelled model | Cannonball model is standard for non-cooperative objects; panelled model requires detailed attitude and geometry data unavailable for most catalog objects |
| Pc method | Alfano 2D Gaussian | Monte Carlo Pc | Alfano is computationally fast and the community standard; Monte Carlo Pc added as Phase 3 consideration for high-Pc events |
| BSTAR lower bound | No lower bound (reject ≤ 0 only) | 0.0001 lower bound | Dense objects (tungsten, stainless steel tanks) can have B* << 0.0001; the previous lower bound would silently reject valid high-density object TLEs |
38.3 Model Card Additions Required
The following items must be added to docs/model-card-decay-predictor.md:
- Breakup altitude rationale: 78 km trigger; reference to NASA Debris Assessment Software range (75–80 km for Al structures)
- Monte Carlo uncertainty model: C_D, A, m sampling distributions and their justifications
- SRP significance: conditions under which SRP > 5% of drag (area-to-mass > 0.01 m²/kg, altitude > 500 km)
- NRLMSISE-00 altitude scope: validated 150–800 km; OOD flag above 800 km
- Geomagnetic storm sensitivity: Kp > 5 triggers storm-period sampling; prediction uncertainty is elevated
- Corridor generation algorithm: alpha-shape with α = 0.1°, 50 km buffer; reference to alpha-shape literature
- Pc computation: Alfano 2D Gaussian; TLE differencing covariance; quality flag when < 3 TLEs available
- SGP4 validity limits: 7-day degraded, 14-day unreliable, 200 km perigee routing to decay predictor
38.4 Validation Test Vector Requirements
| File | Required before | Blocking if absent |
|---|---|---|
docs/validation/reference-data/frame_transform_gcrf_to_itrf.json |
Any frame transform code merged | Yes — test fails hard |
docs/validation/reference-data/sgp4_propagation_cases.json |
SGP4 propagator merged | Yes |
docs/validation/reference-data/iers_eop_case.json |
IERS EOP application merged | Yes |
docs/validation/reference-data/nrlmsise00_density_cases.json |
Decay predictor merged | Yes — referenced in §17.3 |
docs/validation/reference-data/aerospace-corp-reentries.json |
Phase 1 backcast validation | Yes for Phase 2 gate |
39. API Design / Developer Experience Review
This section records findings from a senior API design review. Findings were incorporated into §9.2 (new jobs and idempotency_keys tables; expanded api_keys schema), §14 (canonical pagination envelope, error schema, rate limit 429 body, async job lifecycle, ephemeris validation, WebSocket token refresh, WebSocket protocol versioning, field naming convention, GET /readyz in OpenAPI, API key auth model).
39.1 Finding Summary
| # | Finding | Section Updated | Severity |
|---|---|---|---|
| 1 | Pagination envelope not canonical across endpoints | §14 — PaginatedResponse[T], data key, total_count: null |
High |
| 2 | Error response shape inconsistent; no error code registry | §14 — SpaceComError base, RequestValidationError override, registry table |
High |
| 3 | Async job lifecycle for POST /decay/predict not specified |
§14 — 202 response, /jobs/{id} endpoint; §9.2 — jobs table |
High |
| 4 | WebSocket token refresh path not specified | §14 — TOKEN_EXPIRY_WARNING, AUTH_REFRESH, close codes 4001/4002 |
High |
| 5 | Idempotency keys not specified for mutation endpoints | §14 — idempotency spec; §9.2 — idempotency_keys table |
Medium |
| 6 | 429 missing Retry-After header and structured body |
§14 — retryAfterSeconds body field, Retry-After header spec |
Medium |
| 7 | Ephemeris endpoint lacks time range and step validation | §14 — 4-row validation table with error codes | Medium |
| 8 | WebSocket protocol versioning not specified | §14 — ?protocol_version=N, deprecation warning event, sunset close code |
Medium |
| 9 | Field naming convention not decided | §14 — APIModel base class, alias_generator=to_camel |
Medium |
| 10 | GET /readyz not in OpenAPI spec |
§14 — tags=["System"] decorated endpoint |
Low |
| 11 | API key auth model, rate limits, and scope not specified | §14 — apikey_ prefix, independent buckets, allowed_endpoints scope |
High |
39.2 Developer Experience Contracts
The following contracts are enforced by CI and must not be broken without an ADR:
| Contract | Enforcement |
|---|---|
All list endpoints return {"data": [...], "pagination": {...}} |
OpenAPI CI check: list-tagged endpoints validated against PaginatedResponse schema |
All errors return {"error": "...", "message": "...", "requestId": "..."} |
AST/grep CI check: HTTPException and JSONResponse must reference registry codes |
POST endpoints returning async jobs return 202 with statusUrl |
OpenAPI CI check: endpoints tagged async validated for 202 response schema |
429 responses include Retry-After header |
Integration test: rate-limited request asserts Retry-After header present |
Idempotency-Key header documented for mutation endpoints |
OpenAPI CI check: endpoints tagged mutation declare the header parameter |
GET /readyz is in the OpenAPI spec |
Schema validation: readyz path present in generated openapi.json |
39.3 New Endpoints Added
| Endpoint | Role | Purpose |
|---|---|---|
GET /jobs/{job_id} |
viewer (own jobs only) |
Poll async job status; returns resultUrl on completion |
DELETE /jobs/{job_id} |
viewer (own jobs only) |
Cancel a queued job (no effect if already running) |
39.4 New API Guide Documents Required
| Document | Content |
|---|---|
docs/api-guide/conventions.md |
camelCase rule, APIModel base class, error envelope, request ID tracing |
docs/api-guide/pagination.md |
Cursor encoding, total_count: null rationale, empty result shape |
docs/api-guide/error-reference.md |
Canonical error code registry with HTTP status, description, recovery action |
docs/api-guide/idempotency.md |
Idempotency key protocol, 24h TTL, replay header, in-progress behaviour |
docs/api-guide/async-jobs.md |
Job lifecycle, WebSocket vs polling, recommended poll interval |
docs/api-guide/websocket-protocol.md |
Protocol version history, token refresh flow, close codes, reconnection |
docs/api-guide/api-keys.md |
Key creation, apikey_ prefix, scope, independent rate limits |
39.5 Decision Log
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| Pagination key | data |
items, results |
data is the most common convention (JSON:API, GitHub API, Stripe); items is ambiguous with Python iterables |
total_count |
Always null |
Compute count on every list request | COUNT(*) on a 7-year-retention hypertable can be a full scan; cursor pagination does not need count; document the omission |
| Error base model | SpaceComError with requestId |
Per-endpoint error types | Uniform shape allows generic client error handling; requestId enables log correlation without exposing internals |
| Field naming | camelCase via alias_generator |
snake_case (Python default) |
Frontend and API consumer convention is camelCase; populate_by_name=True keeps internal code readable |
| Async job surface | /jobs/{id} unified endpoint |
Per-type endpoints (/decay/{id}, /reports/{id}) |
Unified job surface simplifies client polling logic; type-specific result URLs are returned in resultUrl field |
| WebSocket close codes | 4001 auth expiry, 4002 protocol deprecated |
Generic 1008 for all auth failures |
Application-specific close codes enable clients to take the correct action (refresh token vs. upgrade protocol) without scraping close reason text |
| Idempotency TTL | 24 hours | 1 hour, 7 days | 24 hours covers retry windows caused by network outages, client restarts, and overnight batch jobs; longer risks unbounded table growth |
40. Commercial Strategy Review
SpaceCom is a standalone commercial product. Institutional procurements (ESA STAR #182213 and similar) are market opportunities pursued with existing capabilities — the product is not built to suit any single bid. This section records findings from a commercial strategy review; incorporations are in the product and architecture sections, not in bid-specific requirements.
40.1 Finding Summary
| # | Finding | Section Updated | Severity |
|---|---|---|---|
| 1 | ESA bid requirements not mapped to plan | Scoped as per-bid process only — docs/bid/ created per procurement opportunity, not a structural plan requirement |
Critical (clarified) |
| 2 | Zero Debris Charter compliance output format not specified | §6 — Controlled Re-entry Planner compliance report spec, Pc_ground, compliance_report_url |
High |
| 3 | No commercial tier structure | §9.2 — subscription_tier, subscription_status on organisations; tier table defined |
High |
| 4 | Competitive differentiation not anchored to maintained capabilities | §23.4 — maintained capabilities table; docs/competitive-analysis.md quarterly review |
Medium |
| 5 | Shadow trial-to-operational conversion not specified | §6.14 — conversion path, offer package, subscription_status transitions, 2-concurrent-deployment cap |
High |
| 6 | Delivery schedule vs. procurement milestones | Light touch: per-procurement milestone reconciliation doc created at bid time; not a structural plan requirement | High (scoped) |
| 7 | No customer-facing SLA | §26.1 — SLA schedule table in MSA; measurement methodology; service credits | High |
| 8 | Data residency requirements not addressed | §29.5 — EU default hosting; on-premise option; hosting_jurisdiction column; subprocessor disclosure |
High |
| 9 | Space-Track AUP conditional architecture not specified | §11 — Path A/B conditional architecture; ADR 0016; Phase 1 architectural decision gate | High |
| 10 | No Acceptance Test Procedure specification | §21 Phase 3 DoD — ATP requirement; independent evaluator; docs/bid/acceptance-test-procedure.md |
Medium |
| 11 | Go-to-market sequence not validated against resource constraints | §6.14 — 2-concurrent-shadow cap; integration lead assignment; onboarding package spec | Medium |
40.2 Commercial Tier Structure
| Tier | Customer | Feature access | Pricing model |
|---|---|---|---|
| Shadow Trial | ANSP (pre-commercial) | Full aviation portal; shadow mode only; 90-day maximum; 2 concurrent deployments maximum | Free — bilateral agreement or institutional funding |
| ANSP Operational | ANSP (post-shadow) | Full aviation portal; live alerts; NOTAM drafting; multi-ANSP coordination | Annual SaaS subscription per ANSP (seat-unlimited within org) |
| Space Operator | Satellite operators | Space portal; decay prediction; conjunction; CCSDS export; API access | Per-object-per-month or flat subscription with object cap |
| Institutional | ESA, national agencies, research | Full access; data export; API; bulk historical; on-premise deployment option | Bilateral contract or grant-funded; source code escrow option |
Tier is stored in organisations.subscription_tier. Tier-based feature gating added to RBAC: e.g., shadow_trial orgs cannot activate live alert delivery to external systems.
40.3 Procurement Readiness Process
For each institutional procurement opportunity pursued:
- Create
docs/bid/{procurement-id}/traceability.md— maps the procurement's SoR requirements to existing MASTER_PLAN.md section(s); gaps markedNOT METorPARTIALLY MET - Create
docs/bid/{procurement-id}/milestone-reconciliation.md— maps procurement milestones (KO, PDR, CDR, AT) to SpaceCom phase completion dates - Run ATP (
docs/bid/acceptance-test-procedure.md) on the staging environment before submission - Create
docs/bid/{procurement-id}/kpi-and-validation-plan.md— maps tender KPIs to replay cases, conservative baselines, evidence artefacts, and any partner-supplied validation input still required - Update
docs/competitive-analysis.mdto confirm differentiation claims are current
This is a per-opportunity process maintained by the product owner — it does not drive changes to the core plan unless a genuine product gap is identified.
40.4 Customer Onboarding Specification
| Artefact | Location | Purpose |
|---|---|---|
| ANSP onboarding checklist | docs/onboarding/ansp-onboarding-checklist.md |
Integration lead walkthrough; environment setup; FIR configuration; user training |
| Admin setup guide | docs/onboarding/admin-setup.md |
Persona D configuration; shadow mode activation; user provisioning |
| Shadow exit report template | docs/templates/shadow-mode-exit-report.md |
Statistics + ANSP Safety Department sign-off |
| Commercial offer template | docs/templates/commercial-offer-ansp.md |
Auto-populated from org data; sent at shadow exit |
40.5 Decision Log
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| Plan structure vs. bid | Product-first; bid traceability is a per-opportunity overlay | Restructure plan around ESA SoR | SpaceCom serves multiple market segments; structuring around one procurement creates lock-in and excludes ANSP and space operator commercial pathways |
| Default hosting jurisdiction | EU (eu-central-1) | US-based hosting | ECAC ANSP customers are predominantly EU/UK; EU hosting satisfies data residency without per-customer complexity |
| Shadow deployment cap | 2 concurrent | Unlimited | Each shadow deployment requires a dedicated integration lead for 90 days; 2 concurrent is the realistic Phase 2 capacity without specialist hiring |
| Space-Track AUP gate | Phase 1 architectural decision | Phase 2 clarification | The shared vs. per-org ingest architecture is a fundamental Phase 1 design choice; deferring to Phase 2 would require rearchitecting already-shipped code |
| SLA in MSA | Separate SLA schedule versioned independently | Inline in MSA body | SLA values change more frequently than contract terms; versioned schedule allows SLA updates without full MSA re-execution |
41. Database Engineering Review
41.1 Finding Summary
| # | Finding | Severity | Location updated |
|---|---|---|---|
| 1 | tle_sets BIGSERIAL PK incompatible with TimescaleDB hypertable uniqueness requirement |
High | §9.2 tle_sets |
| 2 | TEXT enum columns lacking CHECK constraints (12 columns across 7 tables) | High | §9.2 all affected tables |
| 3 | asyncpg prepared statement cache conflicts with PgBouncer transaction mode | High | §9.4 |
| 4 | prediction_outcomes.prediction_id and alert_events.prediction_id typed INTEGER; references BIGSERIAL column |
Medium | §9.2 |
| 5 | idempotency_keys already has composite PRIMARY KEY — confirmed safe; upsert pattern documented |
N/A (already correct) | §9.2 |
| 6 | Mixed GEOGRAPHY/GEOMETRY types break GiST index selectivity on cross-table spatial joins | Medium | §9.3 |
| 7 | acknowledged_by and reviewed_by FKs block GDPR erasure with default RESTRICT |
Medium | §9.2 |
| 8 | Mutable tables missing updated_at column and trigger |
Medium | §9.2 |
| 9 | DB password rotation procedure killed in-flight transactions via hard restart | Medium | §7.5 |
| 10 | tle_sets chunk interval (7 days) too small; poor compression ratio for ingest rate |
Low | §9.4 |
| 11 | Missing partial indexes on hot-path filtered queries (jobs, refresh_tokens, idempotency_keys, alert_events) | Low | §9.3 |
41.2 Schema Integrity Rules
Rules enforced after this review:
- Hypertable natural keys — No surrogate BIGSERIAL PK on hypertables. Reference
tle_setsrows by(object_id, ingested_at). If a surrogate is needed, useUNIQUE (surrogate_id, partition_col)composite. - CHECK constraints mandatory — Every TEXT column with a finite valid value set must have a
CHECK (col IN (...))constraint. Application-layer validation is supplemental, not primary. - asyncpg pool config —
prepared_statement_cache_size=0must be set on all async engine instances. Enforced by a test that creates a test engine and asserts the connect_arg is present. - BIGINT FK parity — Any FK referencing a BIGSERIAL column must be
BIGINT. Linted in CI via a custom Alembic migration checker. - Spatial type discipline — Every
ST_Intersects/ST_Containscall mixing GEOGRAPHY and GEOMETRY sources must include an explicit::geometrycast on the GEOGRAPHY operand. Linted via ruff custom rule. - ON DELETE SET NULL on audit FKs — FKs in audit/safety tables (
security_logs,alert_events.acknowledged_by,notam_drafts.reviewed_by) useON DELETE SET NULL. Hard DELETE onusersis reserved for GDPR erasure only; see §29. - updated_at trigger — All mutable (non-append-only) tables must have
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()and a BEFORE UPDATE trigger usingset_updated_at(). Append-only tables (those withprevent_modification()trigger) are excluded.
41.3 GDPR Erasure Procedure (users table)
Per Finding 7 — a hard DELETE FROM users WHERE id = $1 is not the correct GDPR erasure mechanism. The correct procedure:
- Null out PII columns:
UPDATE users SET email = 'erased-' || id || '@erased.invalid', password_hash = 'ERASED', mfa_secret = NULL, mfa_recovery_codes = NULL, tos_accepted_ip = NULL WHERE id = $1 - Security logs, alert acknowledgements, and NOTAM review records are preserved with
user_id = NULL(ON DELETE SET NULL handles this automatically if a hard DELETE is later required by specific legal instruction) - Log the erasure in
security_logswithevent_type = 'GDPR_ERASURE'before nulling - The
usersrow itself is retained as a tombstone (emailcontains the erased marker) — this preserves referential integrity fororganisation_idlinks and prevents FK violations in tables without SET NULL
Full procedure: docs/runbooks/gdpr-erasure.md (Phase 2 gate, per §29).
41.4 Decision Log
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| Hypertable surrogate key | Remove BIGSERIAL; use UNIQUE(object_id, ingested_at) |
Add UNIQUE(id, ingested_at) composite |
Natural key is semantically stable and meaningful; composite surrogate is confusing and rarely queried by raw id |
| CHECK constraints vs. Postgres ENUM | CHECK (col IN (...)) | CREATE TYPE ENUM |
CHECK constraints are simpler to extend in migrations (no ALTER TYPE ADD VALUE); ENUM changes require pg_dump for type renaming |
| GDPR erasure | Tombstone update, not hard DELETE | Hard DELETE with CASCADE | Hard DELETE cascades into safety records (NOTAM drafts, alert logs) that must be retained under EASA/ICAO safety record requirements; tombstone preserves the record while removing identity |
| Spatial type mixing | Explicit ::geometry cast; document in §9.3 |
Migrate all columns to GEOGRAPHY | Airspace GEOMETRY gives 3× ST_Intersects speedup for regional FIR queries; global corridors correctly use GEOGRAPHY; cast is cheap and safe |
42. Test Engineering / QA Review
42.1 Finding Summary
| # | Finding | Severity | Location updated |
|---|---|---|---|
| 1 | No formal test pyramid with per-layer coverage gates | High | §33.10 |
| 2 | No database isolation strategy for integration tests | High | §33.10 |
| 3 | Hypothesis property-based tests unspecified | High | §33.10 table, §12 |
| 4 | WebSocket test strategy missing | High | §33.10 table, §12 |
| 5 | Playwright E2E tests lack data-testid selector convention |
Medium | §33.9 |
| 6 | No smoke test suite for post-deploy verification | Medium | §12, §33.10 |
| 7 | No flaky test policy | Medium | §33.10 |
| 8 | Contract tests lack value-range assertions | Medium | DoD checklists |
| 9 | Celery task timeout → jobs state transition untested; no orphan cleanup |
Medium | §7.12 |
| 10 | MC simulation test data generation strategy not specified | Low | §15.4 |
| 11 | Accessibility testing not integrated into CI with implementation spec | Low | §6.16 |
42.2 Test Suite Inventory
Full test suite after this review:
tests/
conftest.py # db_session (SAVEPOINT); testcontainers for Celery tests; pytest.ini markers
physics/
test_frame_utils.py # Vallado reference cases — all BLOCKING
test_propagator/ # SGP4 state vectors — BLOCKING
test_decay/ # Decay predictor backcast — Phase 2+
test_nrlmsise.py # NRLMSISE-00 density reference — BLOCKING
test_hypothesis.py # Hypothesis property-based invariants — BLOCKING
test_mc_corridor.py # MC seeded RNG corridor — Phase 2+
test_breakup/ # Breakup energy conservation — Phase 2+
test_integrity.py # HMAC sign/verify/tamper — BLOCKING
test_auth.py # JWT; MFA; rate limiting — BLOCKING
test_rbac.py # Every endpoint × every role — BLOCKING
test_websocket.py # WS lifecycle; sequence replay; close codes — BLOCKING
test_ingest/
test_contracts.py # Space-Track + NOAA key + value range — BLOCKING (mocked)
test_spaceweather/ # Space weather ingest logic
test_jobs/
test_celery_failure.py # Timeout → failed; orphan recovery — BLOCKING
smoke/ # Post-deploy; idempotent; ≤ 2 min — BLOCKING post-deploy
quarantine/ # Flaky tests awaiting fix; non-blocking nightly only
e2e/ # Playwright; 5 user journeys + axe WCAG 2.1 AA — BLOCKING
test_accessibility.ts # axe-core scan on every primary view; fails PR on any WCAG 2.1 AA violation
test_alert_websocket.ts # submit prediction → Celery completes → CRITICAL alert in browser via WS (F9)
load/ # k6 performance scenarios — non-blocking (nightly)
Accessibility test specification (F11):
e2e/test_accessibility.ts uses @axe-core/playwright to scan each primary view on every PR:
import { checkA11y } from 'axe-playwright';
const VIEWS_TO_SCAN = [
'/', // Operational Overview
'/events', // Active Events
'/events/[sample-id]', // Event Detail
'/handover', // Shift Handover
'/space/objects', // Space Operator Overview
];
test.each(VIEWS_TO_SCAN)('WCAG 2.1 AA: %s', async ({ page }) => {
await page.goto(url);
await checkA11y(page, undefined, {
axeOptions: { runOnly: { type: 'tag', values: ['wcag2a', 'wcag2aa'] } },
detailedReport: true,
detailedReportOptions: { html: true },
});
});
CI gate: any axe-core violation at wcag2a or wcag2aa level fails the PR. wcag2aaa violations are reported as warnings only. Results published as CI artefact (a11y-report.html).
WebSocket alert delivery E2E test (F9): e2e/test_alert_websocket.ts is a BLOCKING E2E test that verifies the full end-to-end path from prediction submission to browser alert receipt. This test requires the full stack (Celery workers running, WebSocket server live):
// e2e/test_alert_websocket.ts
import { test, expect } from '@playwright/test';
test('CRITICAL alert appears in browser via WebSocket after prediction job completes', async ({ page }) => {
// 1. Authenticate as operator
await page.goto('/login');
await page.fill('[name=email]', process.env.E2E_OPERATOR_EMAIL);
await page.fill('[name=password]', process.env.E2E_OPERATOR_PASSWORD);
await page.click('[type=submit]');
await page.waitForURL('/');
// 2. Submit a decay prediction via API that will produce a CRITICAL alert
const job = await fetch('/api/v1/decay/predict', {
method: 'POST',
headers: { Cookie: await page.context().cookies().then(c => c.map(x => `${x.name}=${x.value}`).join('; ')) },
body: JSON.stringify({ norad_id: 90001, mc_samples: 50 }), // test object; always produces CRITICAL
}).then(r => r.json());
// 3. Wait for the CRITICAL alert banner to appear in the browser (max 60s)
await expect(page.locator('[role="alertdialog"][data-severity="CRITICAL"]'))
.toBeVisible({ timeout: 60_000 });
// 4. Assert the alert references our prediction
const alertText = await page.locator('[role="alertdialog"]').textContent();
expect(alertText).toContain('90001');
});
The 60-second timeout covers: Celery task queue, MC computation (50 samples), alert threshold evaluation, WebSocket push to all org subscribers, React state update, and DOM render. If this test fails intermittently, the failure is investigated as a potential latency regression — it must not be moved to quarantine/ without a root-cause investigation.
Manual screen reader test (release checklist — not automated):
- NVDA + Firefox (Windows): primary operator workflow (alert receipt → acknowledgement → NOTAM draft)
- VoiceOver + Safari (macOS): same workflow
- Keyboard-only: full workflow without mouse
- Added to release gate checklist in
docs/RELEASE_CHECKLIST.md
42.3 Hypothesis Invariant Specifications
Minimum 5 required Hypothesis properties in tests/physics/test_hypothesis.py:
| Property | Strategy | Assertion | max_examples |
|---|---|---|---|
| SGP4 round-trip position | Random valid TLE orbital elements | Forward propagate T days then back; position error < 1 m | 200 |
| p95 corridor containment | Seeded MC ensemble (seed=42, N=500) | Corridor contains ≥ 95% of input trajectories | 50 |
| NRLMSISE-00 density positive | Random altitude 100–800 km, valid F10.7/Ap | Density always > 0 kg/m³ | 500 |
| RLS tenant isolation | Two different organisation IDs | Session set to org A never returns rows for org B | 100 |
| Pagination non-overlap | Cursor pagination with random page sizes | Pages are non-overlapping and cover full dataset | 100 |
42.4 MC Corridor Test Data Specification
Reference data committed to docs/validation/reference-data/:
| File | Contents | Regeneration |
|---|---|---|
mc-ensemble-params.json |
RNG seed=42, object params, generation timestamp | Never change seed; add to file if params change |
mc-corridor-reference.geojson |
Pre-computed p95 corridor polygon | Run python tools/generate_mc_reference.py after algorithm change; review diff before committing |
Test asserts area delta < 5% between computed and reference polygon. If the algorithm changes, the reference polygon must be explicitly regenerated and the change logged in CHANGELOG.md.
42.5 Decision Log
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| DB isolation | SAVEPOINT for unit/single-connection; testcontainers for Celery | Shared test DB with cleanup | SAVEPOINT is zero-overhead and perfectly isolated; testcontainers gives true process isolation for multi-connection Celery tests without manual teardown |
| Flaky test policy | Quarantine after 2 failures in 30 days; delete if unfixed > 14 days | Retry flaky tests automatically | Auto-retry masks root causes; quarantine with mandatory resolution timeline creates accountability |
| Hypothesis in blocking CI | Yes, max_examples ≥ 200 for physics | Optional/nightly only | Safety-critical physics invariants must be checked on every commit; 200 examples adds < 30s to CI at default shrink settings |
| MC test data | Seeded RNG + committed reference polygon | Committed raw trajectory arrays | Raw arrays are large (~MB); seeded RNG is deterministic and tiny; committed polygon provides a stable regression target |
data-testid convention |
Mandatory for all Playwright targets; CSS class selectors forbidden | Allow CSS class selectors | CSS classes are refactoring artefacts; data-testid is stable across UI refactors and explicitly documents test intent |
| Smoke test gate | Blocking post-deploy, not blocking pre-deploy CI | Block pre-deploy CI | Smoke tests require a running stack; pre-deploy CI has no stack. Post-deploy gate means deployment rollback is the recovery action for smoke failure |
| Accessibility CI gate | axe-core wcag2a + wcag2aa violations block PR; wcag2aaa warnings only |
Manual testing only | Manual testing is too slow and inconsistent for PR-level feedback; automated axe-core catches ~57% of WCAG issues at zero marginal cost; manual screen reader testing reserved for release gate |
43. Observability / Monitoring Engineering Review
43.1 Finding Summary
| # | Finding | Severity | Location updated |
|---|---|---|---|
| 1 | Per-object Gauge labels cause alert flooding (600 pages for one outage) | High | §26.7 — recording rules added |
| 2 | No structured logging format specification | High | §7.14, §10 |
| 3 | No distributed tracing (OpenTelemetry) | High | §26.7, §10 |
| 4 | AlertManager rules have semantic errors; no runbook links | High | §26.7 — rules rewritten |
| 5 | No log aggregation stack specified | Medium | §3.2, §10 |
| 6 | Celery queue depth and DLQ depth metrics not defined | Medium | §26.7 |
| 7 | SLIs not formally instrumented against SLOs | Medium | §26.7 — recording rules |
| 8 | No request_id / trace_id correlation between logs and metrics | Medium | §7.14 |
| 9 | Prometheus scrape configuration not specified | Medium | §26.7 |
| 10 | Renderer service has no functional health check or metrics | Medium | §26.5 |
| 11 | No on-call rotation spec or AlertManager escalation routing | Medium | §26.8 |
43.2 Observability Stack Summary
After this review the full observability stack is:
| Layer | Tool | Phase |
|---|---|---|
| Metrics | Prometheus + prometheus-fastapi-instrumentator |
1 |
| Alerting | AlertManager with runbook_url annotations | 1 |
| Dashboards | Grafana (4 dashboards) | 2 |
| Structured logs | structlog JSON with required fields + sanitiser |
1 |
| Log aggregation | Grafana Loki + Promtail (Docker log scrape) | 2 |
| Distributed tracing | OpenTelemetry → Grafana Tempo | 2 |
| On-call routing | PagerDuty/OpsGenie via AlertManager L1/L2/L3 tiers | 2 |
43.3 Alert Anti-Patterns (Do Not Reintroduce)
| Anti-pattern | Correct form |
|---|---|
rate(counter[Xm]) > 0 |
increase(counter[Xm]) >= N — rate() is per-second and stays positive once counter increments |
Alert directly on spacecom_tle_age_hours{norad_id=...} |
Alert on spacecom:tle_stale_objects:count recording rule — prevents 600-alert floods |
AlertManager rule with no annotations.runbook_url |
Every rule must include runbook_url pointing to the relevant runbook in docs/runbooks/ |
| Grafana dashboard as sole incident channel | All CRITICAL alerts also page via PagerDuty; dashboards are diagnosis tools, not alert channels |
43.4 Decision Log
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| Log aggregation | Grafana Loki | ELK stack | Loki is 10× cheaper to operate (no full-text index); Prometheus labels for log querying are sufficient for this workload; co-deploys with existing Grafana without separate ES cluster |
| Tracing backend | Grafana Tempo | Jaeger | Tempo co-deploys with Grafana/Loki with no separate storage; native Grafana datasource; OTLP ingest; no query language to learn |
| Per-object label strategy | Keep labels for Grafana; alert on recording rule aggregates | Remove per-object labels | Per-object drill-down in Grafana dashboards is operationally valuable; the alert flooding problem is solved by recording rules, not by removing labels |
| Structured logging library | structlog | Python standard logging + JSON formatter | structlog integrates natively with contextvars for request_id propagation; the context binding pattern is cleaner than threading.local |
| Renderer health check | Functional Chromium launch test | Process liveness only | Chromium hanging without crashing is a known Playwright failure mode; process liveness gives false confidence; functional check is the only reliable signal |
§44 — Frontend Architecture Review
44.1 Finding Summary
| # | Finding | Severity | Resolution |
|---|---|---|---|
| 1 | No documented decision on Next.js App Router vs Pages Router; component boundary ("use client") placement unspecified |
Medium | §13.1 — App Router confirmed; "use client" at app/(globe)/layout.tsx boundary |
| 2 | CesiumJS requires 'unsafe-eval' in CSP for GLSL shader compilation; existing policy blocks the globe |
High | §7.7 — two-tier CSP; 'unsafe-eval' scoped to app/(globe)/ routes only |
| 3 | Globe WebGL crash removes alert panel from DOM; CesiumJS WebGL context loss is unhandled | High | §13.1 — GlobeErrorBoundary wrapping only the globe canvas; alert panel in separate PanelErrorBoundary |
| 4 | CesiumJS entity memory leak: unbounded entity accumulation causes WebGL OOM and renderer crash | Medium | §13.1 — max 500 entities; 96h orbit path limit; stale entity pruning on update |
| 5 | WebSocket reconnection strategy unspecified; naive reconnect causes thundering-herd on server restart | Medium | §13.1 — exponential backoff with ±20% jitter; RECONNECT config object; max 30s delay |
| 6 | No TanStack Query key management strategy; ad-hoc key strings cause cache stampedes and stale data | Medium | §13.1 — queryKeys key factory pattern; all query keys centralised in src/lib/queryKeys.ts |
| 7 | Safety-critical panels (alert list, corridor map) have no loading/empty/error state specification | High | §13.1 — explicit state matrix per panel; alert panel must show degraded-data warning on stale WebSocket |
| 8 | LIVE/SIMULATION/REPLAY mode isolation not enforced in UI; writes possible in replay mode | High | §13.1 — useModeGuard hook; §33.9 — AGENTS.md rule added |
| 9 | Deck.gl renders on a separate canvas above CesiumJS; z-order and input event handling are broken | Medium | §13.1 — DeckLayer from @deck.gl/cesium; single canvas; shared input handling |
| 10 | CesiumJS imported at module level causes SSR crash; next build fails |
High | §13.1 — next/dynamic with ssr: false for all CesiumJS components |
| 11 | Cesium ion token injection pattern undocumented; risk of over-engineering (proxying a public credential) | Low | §7.5 — explicit NOT A SECRET annotation; §33.9 — AGENTS.md rule added |
44.2 Architecture Constraints Summary
After this review the frontend architecture constraints are:
| Constraint | Rule |
|---|---|
| App Router split | app/(auth)/ and app/(admin)/ — server components; app/(globe)/ — "use client" root layout |
| CesiumJS import | next/dynamic + ssr: false only; never a static import at module level |
| CSP | Two-tier: standard (no 'unsafe-eval') for non-globe; globe-tier ('unsafe-eval') for app/(globe)/ only |
| Error isolation | Globe crash must not affect alert panel; independent ErrorBoundary per major region |
| Entity cap | 500 CesiumJS entities maximum; prune entities not updated in last 96h |
| WebSocket reconnect | Exponential backoff, initial 1s, max 30s, ×2 multiplier, ±20% jitter |
| Query keys | All keys defined in src/lib/queryKeys.ts key factory; no inline key strings |
| Mode guard | All write operations must check useModeGuard(['LIVE']) and disable in SIMULATION/REPLAY |
| Deck.gl | DeckLayer from @deck.gl/cesium only; no separate canvas |
| Cesium ion token | NEXT_PUBLIC_CESIUM_ION_TOKEN; public credential; not proxied; not in Vault |
44.3 Anti-Patterns (Do Not Introduce)
| Anti-pattern | Correct form |
|---|---|
import * as Cesium from 'cesium' at module level |
next/dynamic(() => import('./CesiumViewerInner'), { ssr: false }) |
Single root <ErrorBoundary> wrapping entire app |
Independent boundaries: GlobeErrorBoundary, PanelErrorBoundary, AlertErrorBoundary |
queryClient.invalidateQueries('objects') (string key) |
queryClient.invalidateQueries({ queryKey: queryKeys.objects.all() }) |
| Rendering write controls (buttons, forms) without mode check | const { isAllowed } = useModeGuard(['LIVE']); <button disabled={!isAllowed}> |
Deck.gl separate canvas (new Deck({ canvas: ... })) |
viewer.scene.primitives.add(new DeckLayer({ layers: [...] })) |
| Storing Cesium ion token in backend env / Vault / Docker secrets | NEXT_PUBLIC_CESIUM_ION_TOKEN in .env.local; committed non-secret in CI |
Reconnect without jitter (setTimeout(connect, delay)) |
delay * (1 + (Math.random() * 2 - 1) * RECONNECT.jitter) |
44.4 Decision Log
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| App Router adoption | App Router with route groups | Pages Router | Route groups ((globe), (auth)) enable per-group CSP header configuration in next.config.ts; server components reduce globe-route initial JS; incremental adoption possible |
"use client" boundary |
app/(globe)/layout.tsx |
Per-component "use client" annotations |
Single boundary at layout level is simpler; all CesiumJS/Zustand/WebSocket code already browser-only; per-component annotations at this scale would be noise |
| Globe CSP strategy | Route-scoped 'unsafe-eval' |
Hash-based CSP for GLSL | CesiumJS generates shader source dynamically; hashes cannot cover runtime-generated strings; route-scoping is the only practical option |
| Deck.gl integration | DeckLayer from @deck.gl/cesium |
Separate Deck.gl canvas | Separate canvas breaks mouse event routing and z-order; DeckLayer renders inside CesiumJS as a primitive, sharing the WebGL context |
| Cesium ion token | NEXT_PUBLIC_ env var |
Backend proxy endpoint | Cesium ion is a CDN/tile service with public tokens by design; proxying adds latency and a backend dependency for a non-secret; Cesium's own documentation recommends direct browser use |
§45 — Platform / Infrastructure Operations Engineering Review
45.1 Finding Summary
| # | Finding | Severity | Resolution |
|---|---|---|---|
| 1 | Python 3.11/3.12 version mismatch between Dockerfiles and service table | Medium | §30.2 — all images updated to python:3.12-slim, node:22-slim; CI version check added |
| 2 | No container resource limits; runaway simulation worker can OOM-kill the database | High | §3.3 — deploy.resources.limits added for all services; stop_grace_period added |
| 3 | Docker SIGTERM→SIGKILL grace period (10s default) too short for MC task warm shutdown | High | §3.3 — stop_grace_period: 300s for worker-sim; --without-gossip --without-mingle flags specified |
| 4 | Backend and renderer on disjoint networks — cannot communicate | Critical | §3.3 — backend added to renderer_net; network topology diagram corrected |
| 5 | Workers bypass PgBouncer — 16 direct connections per worker undermines connection pooling | Medium | §3.3 — PgBouncer added to worker_net; workers connect via pgbouncer:5432 |
| 6 | Redis ACL per-service is stated in §3.2 but undefined — compromised worker can read session tokens | High | §3.2 — full ACL definition added; three separate passwords added to §30.3 env contract |
| 7 | pg_isready -U postgres healthcheck passes before TimescaleDB extension and application DB are ready |
Medium | §26.5 — healthcheck replaced with psql query against timescaledb_information.hypertables |
| 8 | daily_base_backup calls pg_basebackup from Python worker image — tool not installed |
High | §26.6 — replaced with dedicated db-backup sidecar container; Celery task now verifies backup presence in MinIO |
| 9 | No pids_limit on renderer or worker containers — Chromium crash can fork-bomb host |
Medium | §3.3 — pids_limit added: renderer=100, worker-sim=64, worker-ingest=16 |
| 10 | Renderer PDF scratch written to container writable layer — sensitive data persists | Medium | §3.3 — tmpfs mount at /tmp/renders (512 MB); RENDER_OUTPUT_DIR env var added |
| 11 | Blue-green deployment mechanics unspecified for Docker Compose — first production deploy would fail | High | §26.9 — scripts/blue-green-deploy.sh spec added; Caddy dynamic upstream pattern defined |
45.2 Container Runtime Safety Summary
After this review the container runtime safety posture is:
| Concern | Control |
|---|---|
| Resource isolation | deploy.resources.limits per service; DB memory-capped to survive worker OOM |
| Graceful shutdown | stop_grace_period: 300s for simulation workers; Celery --without-gossip --without-mingle |
| Process containment | pids_limit on renderer (100) and both workers |
| Sensitive scratch data | Renderer uses tmpfs at /tmp/renders; cleared on container stop |
| Network access | Backend on renderer_net; PgBouncer on worker_net; workers never reach frontend_net |
| Redis ACL | Three ACL users (backend, worker, ingest) with scoped key namespaces; default user disabled |
| DB healthcheck | Verifies TimescaleDB extension loaded and application DB accessible before dependent services start |
| Backups | Dedicated db-backup sidecar with PostgreSQL tools; Celery Beat verifies presence not execution |
45.3 Operations Anti-Patterns (Do Not Reintroduce)
| Anti-pattern | Correct form |
|---|---|
FROM python:3.11-slim or FROM node:20-slim in any Dockerfile |
python:3.12-slim / node:22-slim; hadolint check enforces this |
No deploy.resources.limits on CPU/memory-intensive services |
All services must have limits; simulation workers especially |
Worker DATABASE_URL pointing to db:5432 |
pgbouncer:5432 — all workers route through PgBouncer |
subprocess.run(['pg_basebackup', ...]) from a Python worker container |
Dedicated db-backup sidecar container with PostgreSQL tools |
pg_isready -U postgres as the DB healthcheck |
psql -c "SELECT 1 FROM timescaledb_information.hypertables LIMIT 1" |
docker compose stop (default 10s) for simulation workers |
stop_grace_period: 300s on worker-sim service definition |
All services sharing single REDIS_PASSWORD |
Three ACL users with scoped namespaces; separate passwords |
| Blue-green deploy without specifying the Compose implementation | scripts/blue-green-deploy.sh with separate Compose project instances + Caddy dynamic upstream |
45.4 Decision Log
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| Python version | 3.12 (service table and Dockerfiles aligned) | 3.11 (original Dockerfiles) | 3.12 has 10–25% numeric performance improvements; free-threaded GIL prep; security support through 2028; alignment eliminates silent version drift |
| Blue-green implementation | Separate Compose project instances + Caddy dynamic upstream file | Single Compose file with blue/green service name variants | Separate projects mean the Compose file is not modified per deployment; Caddy JSON upstream reload is atomic and < 5s |
| Backup execution model | Host cron → db-backup sidecar via docker compose run |
Celery task + subprocess.run |
Celery workers do not have pg_basebackup; host cron is independent of application availability — backup runs even if Celery is down |
| PID limits | Per-service pids_limit in Compose |
Kernel cgroup default | Compose pids_limit is applied at container creation; simpler to audit than system-level cgroup tuning; values sized per expected process count |
| Renderer scratch storage | tmpfs |
Named Docker volume | PDF contents include prediction data; tmpfs guarantees no persistence; cleared on container stop/restart without manual cleanup |
| Redis ACL scope | Key prefix namespacing (~celery* for workers) |
Command-level ACL only | Key-prefix ACL prevents workers from reading/writing outside their namespace; command-level-only ACL is weaker (worker could still enumerate all keys) |
§46 — Data Pipeline / ETL Engineering Review
46.1 Finding Summary
| # | Finding | Severity | Resolution |
|---|---|---|---|
| 1 | No Space-Track request budget tracked; 30-min TIP polling consumes 48/600 requests/day before retries | High | §31.1.1 — SpaceTrackBudget Redis counter; alert at 80%; operator re-fetches budget-checked |
| 2 | TIP 30-min polling too slow for late re-entry phase; CDM 12h polling can miss short-TCA conjunctions entirely | High | §31.1.1 — adaptive polling: TIP→5min, CDM→30min when active_tip_events > 0 |
| 3 | TLE ingest ON CONFLICT behavior unspecified; double-run hits unique constraint silently | Medium | §11 — INSERT ... ON CONFLICT DO NOTHING + spacecom_ingest_tle_conflict_total metric |
| 4 | IERS EOP cold-start: astropy falls back to months-old IERS-B, silently degrading frame transforms | High | §11 — make seed EOP bootstrap step; EOP freshness check in GET /readyz |
| 5 | AIRAC FIR updates are fully manual with no staleness detection or missed-cycle alert | Medium | §31.1.3 — spacecom_airspace_airac_age_days gauge + alert; airspace_stale in readyz; fir-update runbook as Phase 1 deliverable |
| 6 | Space weather nowcast vs. forecast not distinguished; decay predictor uses wrong F10.7 for horizon > 72h | High | §31.1.2 — forecast_horizon_hours column; decay predictor input selection table |
| 7 | IERS EOP SHA-256 verification unimplementable — IERS publishes no reference hashes | Medium | §11 — dual-mirror comparison (USNO + Paris Observatory); spacecom_eop_mirror_agreement gauge |
| 8 | No exponential backoff or circuit breaker on ingest tasks; transient failures exhaust Space-Track budget | High | §31.1.1 — retry_backoff=True, retry_backoff_max=3600, max_retries=5; pybreaker circuit breaker |
| 9 | Space-Track session cookie expires between 6h polls; re-auth behavior not specified or tested | Medium | §31.1.1 — _ensure_authenticated() with proactive 1h45m TTL; session_reauth_total metric |
| 10 | ESA SWS Kp cross-validation has no decision rule; divergence from NOAA is silently ignored | Medium | §31.1.2 — arbitrate_kp() with 2.0 Kp threshold; conservative-high selection; ADR-0018 |
| 11 | celery-redbeat default lock TTL 25min causes up to 25min scheduling gap on Beat crash during TIP event |
High | §26.4 — REDBEAT_LOCK_TIMEOUT=60; REDBEAT_MAX_SLEEP_INTERVAL=5; active TIP alert threshold 10min |
46.2 Ingest Pipeline Reliability Summary
After this review the ingest pipeline reliability posture is:
| Concern | Control |
|---|---|
| Space-Track rate limit | SpaceTrackBudget Redis counter; alert at 80%; hard stop at 600/day |
| Upstream failure recovery | Exponential backoff (2s→1h, ×2, ±20% jitter); circuit breaker after 3 failures; max 5 retries then DLQ |
| TIP latency during re-entry | Adaptive polling: 5-minute TIP cycle when active TIP event detected |
| CDM conjunction coverage | 30-minute CDM cycle during active TIP events (baseline 2h) |
| TLE ingest idempotency | ON CONFLICT DO NOTHING + conflict metric |
| EOP freshness | Daily download (USNO primary); dual-mirror verification; 7-day staleness alert; cold-start bootstrap in make seed |
| AIRAC currency | 28-day staleness alert; /readyz degraded signal; manual update runbook as Phase 1 deliverable |
| Space weather horizon | forecast_horizon_hours column; predictor selects by horizon; 81-day F10.7 average beyond 72h |
| Beat HA failover gap | REDBEAT_LOCK_TIMEOUT=60s; standby acquires lock within 5s of TTL expiry |
46.3 New ADR Required
| ADR | Title | Decision |
|---|---|---|
docs/adr/0018-kp-source-arbitration.md |
Kp Source Arbitration Policy | NOAA primary; ESA SWS cross-validation; conservative-high selection on > 2.0 Kp divergence; physics lead approval required |
46.4 Ingest Pipeline Anti-Patterns (Do Not Reintroduce)
| Anti-pattern | Correct form |
|---|---|
INSERT INTO tle_sets ... VALUES (...) without ON CONFLICT DO NOTHING |
Always use ON CONFLICT DO NOTHING + increment conflict metric |
spacetrack_client.fetch() without budget check |
Always call budget.consume(1) before any Space-Track HTTP request |
Celery ingest task with max_retries=None or no backoff |
retry_backoff=True, retry_backoff_max=3600, max_retries=5 |
| EOP verification by SHA-256 against prior download | Dual-mirror UT1-UTC value comparison (USNO + Paris Observatory) |
REDBEAT_LOCK_TIMEOUT = 300 (default 5min or 25min) |
REDBEAT_LOCK_TIMEOUT = 60 for active TIP event tolerance |
| Single F10.7 value regardless of prediction horizon | Select by forecast_horizon_hours; 81-day average beyond 72h |
| ESA SWS Kp logged but not acted upon | arbitrate_kp() decision rule; conservative-high on divergence |
46.5 Decision Log
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| Adaptive TIP polling | Dynamic redbeat schedule override when active_tip_events > 0 |
Fixed 5-min polling always | Fixed 5-min polling uses 288/600 Space-Track requests/day for TIPs alone; adaptive polling reserves budget for baseline operations |
| Space-Track budget enforcement | Redis counter with hard stop | Honour-system rate limit compliance | Hard stop prevents CI/staging test runs or operator actions from exhausting production budget unexpectedly |
| EOP verification | Dual-mirror value comparison | SHA-256 against prior download | IERS publishes no reference hashes; prior-download comparison detects corruption but not substitution; dual-mirror comparison is the de facto industry approach |
| Kp arbitration | Conservative-high (max of NOAA, ESA on divergence) | Average of both sources | Averaging introduces a systematic bias toward lower geomagnetic activity; in a safety-critical context, the conservative choice is the higher Kp (denser atmosphere, shorter lifetime, earlier alerting) |
forecast_horizon_hours schema |
Dedicated column on space_weather |
Separate tables per horizon | Single table with horizon column is simpler to query (WHERE forecast_horizon_hours = 0); adding a table per horizon complicates the ingest pipeline without query benefit |
§47 — Supply Chain / Dependency Security Engineering Review
47.1 Finding Summary
| # | Finding | Severity | Resolution |
|---|---|---|---|
| 1 | pip wheel in Dockerfile does not enforce --require-hashes; hash pinning specified but not verified during build |
High | §30.2 — --require-hashes added to pip wheel command with explanatory comment |
| 2 | cosign image signing absent from CI workflow; attestation claim was aspirational |
High | §26.9 — full cosign sign + cosign attest YAML added to build-and-push job |
| 3 | SBOM format, CI step, and retention unspecified; ESA ECSS requirement undeliverable | High | §26.9 — SPDX-JSON via syft; cosign attest attachment; 365-day artifact retention |
| 4 | pip-audit absent; OWASP Dependency-Check has high Python false-positive rate |
Medium | §7.13 — pip-audit added to security-scan; OWASP DC removed from Python scope |
| 5 | No automated license scanning; CesiumJS AGPLv3 compliance check was manual | High | §7.13 — pip-licenses + license-checker-rseidelsohn gate on every PR |
| 6 | Base image digest update process undefined; Dependabot cannot update @sha256: pins |
Medium | §7.13 — Renovate Bot docker-digest manager; digest PRs auto-merged on passing CI |
| 7 | No .trivyignore file; first base-image CVE with no fix will break all CI builds |
Medium | §7.13 — .trivyignore spec with expiry dates + CI expiry check |
| 8 | npm audit absent from CI; npm ci does not scan for known vulnerabilities |
Medium | §7.13 + §26.9 — npm audit --audit-level=high in security-scan job |
| 9 | detect-secrets baseline update process undefined; incorrect scan > overwrites all allowances |
Medium | §30.1 — correct --update procedure documented; CI baseline currency check added |
| 10 | No PyPI index trust policy; dependency confusion attack surface unmitigated | High | §7.13 — private PyPI proxy spec; spacecom-* namespace reservation on public PyPI; ADR-0019 |
| 11 | GitHub Actions pinned by mutable @vN tags; tag repointing exfiltrates all workflow secrets |
Critical | §26.9 — all actions pinned by full commit SHA; CI lint check enforces no @v\d tags |
47.2 Supply Chain Security Posture Summary
After this review the supply chain security posture is:
| Layer | Control |
|---|---|
| Python build-time hash verification | pip wheel --require-hashes enforces hash pinning during Docker build |
| Python CVE scanning | pip-audit (PyPADB); every PR; blocks on High/Critical |
| Node.js CVE scanning | npm audit --audit-level=high; every PR |
| Container CVE scanning | Trivy + .trivyignore with expiry enforcement |
| Image provenance | cosign keyless signing (Sigstore) on every image push |
| SBOM | SPDX-JSON via syft; attached as cosign attest; 365-day retention |
| License gate | pip-licenses + license-checker-rseidelsohn; GPL/AGPL blocks merge |
| Base image currency | Renovate docker-digest manager; weekly PRs; auto-merged on CI pass |
| Dependency currency | Dependabot (GitHub Advisory integration) for Python/Node versions |
| CI pipeline integrity | All actions SHA-pinned; lint check rejects @vN references |
| Secrets detection | detect-secrets (entropy + regex) primary; git-secrets secondary; baseline currency check in CI |
| PyPI index trust | Private proxy (Phase 2+); spacecom-* namespace stubs on public PyPI |
47.3 New ADR Required
| ADR | Title | Decision |
|---|---|---|
docs/adr/0019-pypi-index-trust.md |
PyPI Index Trust Policy | Private proxy for Phase 2+; public PyPI namespace reservation for spacecom-* packages in Phase 1 |
47.4 Anti-Patterns (Do Not Reintroduce)
| Anti-pattern | Correct form |
|---|---|
pip wheel -r requirements.txt without --require-hashes |
pip wheel --require-hashes -r requirements.txt |
uses: actions/checkout@v4 in any workflow file |
uses: actions/checkout@<full-commit-sha> # vX.Y.Z |
detect-secrets scan > .secrets.baseline |
detect-secrets scan --baseline .secrets.baseline --update |
| OWASP Dependency-Check as Python CVE scanner | pip-audit --requirement requirements.txt |
Trivy gate with no .trivyignore |
.trivyignore with documented expiry dates + CI expiry check |
| Manual CesiumJS licence check at Phase 1 only | license-checker-rseidelsohn --failOn "GPL;AGPL" on every PR (CesiumJS exempted by name) |
cosign mentioned in decision log but absent from CI |
cosign sign + cosign attest in build-and-push job; cosign verify in deploy jobs |
47.5 Decision Log
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| Python CVE scanning | pip-audit (PyPADB) |
OWASP Dependency-Check | OWASP DC CPE mapping generates false positives for Python; pip-audit queries the Python-native advisory database with near-zero false positives |
| Image signing | cosign keyless (Sigstore) |
Long-lived signing key | Keyless signing uses ephemeral OIDC-bound keys; no key management overhead; verifiable against GitHub Actions OIDC issuer |
| SBOM format | SPDX 2.3 JSON (spdx-json) |
CycloneDX 1.5 | SPDX is the ECSS/ESA-preferred format; both are equivalent for compliance purposes; SPDX has wider tooling support in the aerospace sector |
| Base image update automation | Renovate docker-digest |
Manual digest updates | Manual digest updates are always deferred; Renovate auto-merge on passing CI achieves zero-latency security patch application for base image OS updates |
| GitHub Actions pinning | Commit SHA with tag comment | Dependabot auto-bump of @vN |
Tag references are mutable; SHA pins are immutable; Renovate github-actions manager keeps SHAs current automatically |
| PyPI trust (Phase 1) | Namespace reservation on public PyPI | Private proxy | Private proxy requires infrastructure investment not available in Phase 1; namespace squatting prevention provides meaningful protection at zero cost |
§48 Human Factors Engineering — Specialist Review
Hat: Human Factors Engineering Standard basis: ECSS-E-ST-10-12C (Space engineering — Human factors), CAP 1264 (Alarm management for safety-related ATC systems), EASA GM1 ATCO.B.001(d) (Competency-based training — decision making under uncertainty), Endsley (1995) Situation Awareness taxonomy, Parasuraman & Riley (1997) automation trust calibration
Review scope: §28 Human Factors Framework, §6 UI/UX Feature Specifications, §26 Infrastructure (alert delivery), §31 Data Pipeline (data freshness / degraded state)
48.1 Findings
Finding 1 — SA timing targets absent: §28.1 contained no quantitative time-to-comprehension targets. Situation Awareness without measurable timing criteria cannot be validated against ECSS-E-ST-10-12C Part 6.4 or used as pass/fail criteria in usability testing. Fix applied (§28.1): SA Level 1 ≤ 5s (icon/colour/position); SA Level 2 ≤ 15s (FIR intersection + sector); SA Level 3 ≤ 30s (corridor expanding/contracting). Targets designated as Phase 2 usability test pass/fail criteria.
Finding 2 — Forced-text acknowledgement minimum causes compliance noise: The 10-character minimum on alert acknowledgement text is a common anti-pattern. Under time pressure, operators produce 1234567890 or similar, which is audit record pollution rather than evidence of cognitive engagement.
Fix applied (§28.5): Replaced with ACKNOWLEDGEMENT_CATEGORIES (6 structured options). Free text is optional except when OTHER is selected. Category selection satisfies audit requirements with less operator burden.
Finding 3 — No keyboard-completable acknowledgement path: ANSP ops room staff routinely hold a radio PTT with one hand. A mouse-dependent acknowledgement dialog is inaccessible in that context and constitutes a HF design failure.
Fix applied (§28.5): Alt+A → Enter → Enter three-keystroke path from any application state. Documented for operator quick-reference card; included in Phase 2 usability test scenario.
Finding 4 — No startle-response mitigation: Sudden full-screen CRITICAL banners produce a documented ~5-second degraded cognitive performance window (startle effect, Staal 2004). The existing design transitions directly to full-screen without priming.
Fix applied (§28.3): Three-rule mitigation: (1) progressive escalation — CRITICAL full-screen only after ≥ 1 minute in HIGH state (except impact_time_minutes < 30); (2) audio precedes visual by 500ms; (3) banner is dimmed overlay over corridor map, not a replacement.
Finding 5 — No shift handover specification: Handover is the highest-risk transition in continuous operations. Loss of situational awareness at shift change is a documented contributing factor in ATC incidents. No handover mechanism existed.
Fix applied (§28.5a): Dedicated /handover view; shift_handovers table with outgoing_user, incoming_user, notes, active_alerts snapshot, open_coord_threads snapshot; immutable audit record; CRITICAL-during-handover flag on notifications.
Finding 6 — Alarm rationalisation procedure absent: Alarm systems without formal rationalisation procedures inevitably drift toward nuisance alarm rates that exceed operator tolerance. The existing quarterly review target (< 1 LOW/10 min/user) had no enforcement mechanism.
Fix applied (§28.3): Quarterly rationalisation procedure with alarm_threshold_audit table; 90% MONITORING acknowledgement rate as nuisance alarm trigger; mandatory 7-day confirmation for threshold changes; 12-month no-escalation review for alert categories.
Finding 7 — Comprehension test items not specified: §28.7 stated "usability test" without scripted probabilistic comprehension items. Generic usability tests are insensitive to the specific calibration failures relevant to probabilistic re-entry data (false precision, space/aviation risk threshold conflation, uncertainty update misattribution). Fix applied (§28.7): Four scripted comprehension items with correct answer, common wrong answer, and failure mode each item detects. Pass criterion: ≥ 80% correct per item across the test cohort.
Finding 8 — No habituation countermeasures: Repeated identical stimuli (identical alarm sound, identical banner appearance) produce habituation — reduced physiological and attentional response over weeks of exposure. No design provisions existed. Fix applied (§28.3): Pseudo-random alternation of two-tone audio pattern; 1 Hz colour cycling on CRITICAL banner between two dark-amber shades; per-operator habituation metric (≥ 20 same-type acknowledgements in 30 days without escalation triggers supervisor review).
Finding 9 — "Response Options" label creates legal ambiguity: The label "Response Options" implies these are prescribed choices. In a regulatory investigation following an incident, checked items could be interpreted as evidence of a standard procedure that was or was not followed.
Fix applied (§28.6): Feature renamed to "Decision Prompts" throughout. Non-waivable legal disclaimer added below accordion header. Disclaimer included in printed/exported Event Detail report and in API response legal_notice field.
Finding 10 — No attention management specification: SpaceCom exists in an environment (ops room) with very high ambient interruption rates. Without explicit constraints on unsolicited notification rate, SpaceCom becomes an additional fragmentation source — the documented cause of error in multiple ATC incident analyses. Fix applied (§28.6): Three-tier rate limit: ≤ 1/10 min in steady state; ≤ 1/60s for same-event updates during active incident; zero during critical flow (acknowledgement dialog or handover screen). Queued notifications delivered as batch on critical-flow exit.
Finding 11 — Degraded-data states not differentiated for operators: Three meaningfully different system states (healthy, degraded, failed) were visually undifferentiated in the previous design. Operators cannot distinguish between data they should trust, trust with margin, or not trust at all.
Fix applied (§28.8): Graded visual degradation language table (5 amber/red states with exact badge text and required operator response); multiple-amber consolidation rule; GET /readyz machine-readable staleness flags for ANSP monitoring integration; system_health_events audit table.
48.2 Files / Sections Modified
| Section | Change |
|---|---|
| §28.1 Situation Awareness Design Requirements | Added SA level timing targets as pass/fail usability criteria |
| §28.3 Alarm Management | Added startle-response mitigation (3 rules), alarm rationalisation procedure, habituation countermeasures |
| §28.5 Error Recovery and Irreversible Actions | Replaced 10-char text minimum with ACKNOWLEDGEMENT_CATEGORIES; added Alt+A → Enter → Enter keyboard path |
| §28.5a Shift Handover (new section) | Handover screen spec; shift_handovers table schema; integrity rules; handover-window CRITICAL flag |
| §28.6 Cognitive Load Reduction | Renamed Response Options → Decision Prompts; added legal disclaimer; added attention management rate limits |
| §28.7 HF Validation Approach | Added 4 scripted probabilistic comprehension test items with pass criterion |
| §28.8 Degraded-Data Human Factors (new section) | Graded degradation language; 5-state indicator table; multiple-amber consolidation; GET /readyz integration |
48.3 New Tables / Schema Changes
| Table | Purpose |
|---|---|
shift_handovers |
Immutable record of shift handovers with alert and coordination thread snapshots |
alarm_threshold_audit |
Immutable record of alarm threshold changes with reviewer and rationale |
system_health_events |
Time-series log of degraded-data state transitions for operational reporting |
48.4 New ADR Required
| ADR | Title | Decision |
|---|---|---|
docs/adr/0020-acknowledgement-categories.md |
Alert Acknowledgement Design | Structured category selection replaces free-text minimum; OTHER requires text; 6 categories cover all anticipated operational responses |
docs/adr/0021-decision-prompts-legal.md |
Decision Prompts Legal Treatment | Feature renamed from Response Options; non-waivable disclaimer required; legal rationale documented for future regulatory inquiries |
48.5 Anti-Patterns (Do Not Reintroduce)
| Anti-pattern | Correct form |
|---|---|
| Full-screen CRITICAL banner without progressive escalation | Progressive escalation: ≥ 1 min in HIGH state before CRITICAL full-screen (except impact_time < 30 min) |
| Audio and visual CRITICAL alert fired simultaneously | Audio fires 500ms before visual banner render |
| Alert acknowledgement with free-text character minimum | ACKNOWLEDGEMENT_CATEGORIES structured selection; free text only when OTHER selected |
| "Response Options" label anywhere in UI, API, or docs | "Decision Prompts" throughout; legal disclaimer present |
| Comprehension test without scripted probabilistic items | Use the 4 scripted items in §28.7; measure per-item accuracy against 80% pass threshold |
| Degraded data shown with same visual weight as fresh data | Use exact badge text from §28.8; amber for stale, red for expired/unusable |
48.6 Decision Log
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| Acknowledgement mechanism | Structured categories | Free-text minimum | Research shows forced-text minimums produce compliance noise, not evidence; structured categories produce lower operator burden with higher audit utility |
| CRITICAL escalation model | Progressive (HIGH → CRITICAL) | Immediate full-screen | Startle effect causes ~5s cognitive degradation; progressive escalation eliminates cold-start startle while preserving urgency |
| Audio timing | 500ms pre-visual | Simultaneous | Pre-auditory alert primes attentional orienting response; eliminates visual startling; 500ms is within the ICAO recommended alerting lead-time range |
| Shift handover | System-managed /handover view |
Out-of-band process | Out-of-band handovers leave no audit trail and are not integrated with active alert state; system-managed handover provides immutable record and SA transfer assurance |
| Decision Prompts legal treatment | Non-waivable hard-coded disclaimer | Configurable disclaimer or none | Configurable disclaimer creates discovery risk (could be disabled); absence of disclaimer creates precedent risk; hard-coded disclaimer is the only legally safe option |
§49 Legal / Compliance Engineering — Specialist Review
Standards basis: GDPR (Regulation 2016/679), UK GDPR, ePrivacy Directive, Export Administration Regulations (EAR), ITAR (22 CFR 120–130), ESA Procurement Rules, EUMETSAT Data Policy, Space Debris Mitigation Guidelines (IADC/ISO 24113), Chicago Convention Article 28, EU AI Act (Regulation 2024/1689), NIS2 Directive (2022/2555) Review scope: Data handling, user consent, liability framing, export control, third-party data licensing, AI Act obligations, operator accountability chain, record retention, cross-border transfer, regulatory correspondence readiness
49.1 Findings and Fixes Applied
F1 — No GDPR lawful basis documented per processing activity
Fix applied (§29.1): RoPA requirement formalised. legal/ROPA.md designated as authoritative document. Data inventory table extended to include all processing activities with lawful basis, retention period, and table reference. shift_handovers and alarm_threshold_audit added as processing activities. Annual DPO sign-off required. DPIA trigger documented.
F2 — No DPIA for conjunction alert delivery
Fix applied (§29.1): DPIA trigger documented — conjunction alert delivery constitutes systematic monitoring under GDPR Art. 35(3)(b). DPIA required before production deployment; template designated as legal/DPIA_conjunction_alerts.md.
F3 — TLE / space weather data redistribution may breach upstream licence
Fix applied (§24.2): space_track_registered boolean column added to organisations table. API middleware gate blocks TLE-derived fields for non-registered orgs. data_disclosure_log table added for licence audit trail. EU-SST gated separately behind itar_cleared flag.
F4 — No export control screening at registration
Fix applied (§24.2): country_of_incorporation, export_control_screened_at, export_control_cleared, and itar_cleared columns added to organisations table. Onboarding flow screens against embargoed countries (ISO 3166-1 alpha-2) and BIS Entity List. EU-SST-derived data gated behind itar_cleared. Documented in legal/EXPORT_CONTROL_POLICY.md.
F5 — Liability disclaimer in Decision Prompts insufficient as standalone protection Fix applied (§28.6): Note added that the in-UI disclaimer is a reinforcing reminder only. Substantive liability limitation (consequential loss excluded; aggregate cap = 12 months fees) must appear in the executed MSA (§24.2). UCTA 1977 and EU Unfair Contract Terms Directive requirement noted.
F6 — No retention / deletion schedule; erasure requests unhandled for new tables
Fix applied (§29.1, §29.3): shift_handovers and alarm_threshold_audit added to RoPA with 7-year retention (safety record basis). Pseudonymisation procedure in §29.3 extended to cover shift_handovers — user ID columns nulled, notes prefixed with pseudonym on erasure request.
F7 — Cross-border data transfer mechanism not formally documented
Fix applied (§29.5): legal/DATA_RESIDENCY.md designated as authoritative sub-processor list with hosting provider, region, SCC/IDTA status. Annual DPO review and customer notification on material sub-processor change formalised.
F8 — EU AI Act obligations not assessed
Fix applied (§24.10): New section added. Conjunction probability model classified as high-risk AI under EU AI Act Annex III (transport infrastructure safety). Eight high-risk obligations mapped (risk management, data governance, technical documentation, logging, transparency, human oversight, accuracy/robustness, conformity assessment). Human oversight statement added as mandatory non-configurable UI element in §19.4 conjunction probability display. EU database registration (Art. 51) added as Phase 3 gate. legal/EU_AI_ACT_ASSESSMENT.md designated as authoritative document.
F9 — No regulatory correspondence register
Fix applied (§24.11): New section added. legal/REGULATORY_CORRESPONDENCE_LOG.md designated as structured register. SLAs: 2-business-day acknowledgement, 14-calendar-day response. Quarterly steering review of outstanding correspondence. Proactive engagement triggered by ≥3 queries from same authority in 12 months.
F10 — Cookie / tracking consent mechanism not specified
Fix applied (§29.7): New section added. Cookie audit table defined (strictly necessary / functional / analytics). HttpOnly; Secure; SameSite=Strict formalised as required security attributes. Consent banner specification: three tiers, preference stored in localStorage (not a cookie), re-requested on material category changes. legal/COOKIE_POLICY.md designated as authoritative document.
F11 — Incident notification obligations not mapped to regulatory timelines
Fix applied (§29.6): NIS2 Art. 23 obligations added alongside GDPR Art. 33. Early warning deadline: 24 hours of awareness (NIS2) vs. 72 hours (GDPR). Full NIS2 notification: 72 hours. Final report: 1 month. On-call escalation requirement to DPO within 24-hour window documented. legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md designated as authoritative template document.
49.2 Sections Modified
| Section | Change |
|---|---|
| §24.2 Liability and Operational Status | Added Space-Track redistribution gate (space_track_registered), data_disclosure_log table, export control screening columns and onboarding flow |
| §24.10 (new) EU AI Act Obligations | Full high-risk AI obligation mapping; human oversight statement; conformity assessment and registration roadmap |
| §24.11 (new) Regulatory Correspondence Register | Structured log specification; SLAs; escalation trigger |
| §28.6 Cognitive Load Reduction | Added legal sufficiency note on Decision Prompts disclaimer; MSA cross-reference |
| §29.1 Data Inventory | Formalised as GDPR Art. 30 RoPA; added shift_handovers, alarm_threshold_audit, data_disclosure_log entries; DPIA trigger documented |
| §29.3 Erasure vs. Retention Conflict | Extended pseudonymisation procedure to cover shift_handovers |
| §29.5 Cross-Border Data Transfer Safeguards | Added legal/DATA_RESIDENCY.md as authoritative document with annual review requirement |
| §29.6 Security Breach Notification | Expanded to full NIS2 Art. 23 obligations table; multi-framework notification timeline |
| §29.7 (new) Cookie / Tracking Consent | Cookie audit table; HttpOnly; Secure; SameSite=Strict formalised; consent banner specification |
49.3 New Tables
| Table | Purpose |
|---|---|
data_disclosure_log |
Immutable record of every TLE-derived data disclosure per organisation; supports Space-Track licence audit |
organisations.space_track_registered |
Gate controlling access to TLE-derived API fields |
organisations.country_of_incorporation |
Feeds export control screening at onboarding |
organisations.export_control_cleared |
Records completion of export control screening |
organisations.itar_cleared |
Gates EU-SST-derived data to cleared entities only |
49.4 New Legal Documents (required before Phase 2 gate)
| Document | Purpose |
|---|---|
legal/ROPA.md |
GDPR Art. 30 Record of Processing Activities — authoritative version |
legal/DPIA_conjunction_alerts.md |
Data Protection Impact Assessment for conjunction alert delivery |
legal/EXPORT_CONTROL_POLICY.md |
Export control screening procedure and embargoed-country list |
legal/DATA_RESIDENCY.md |
Sub-processor list with hosting regions and SCC/IDTA status |
legal/EU_AI_ACT_ASSESSMENT.md |
High-risk AI classification; obligation mapping; conformity assessment |
legal/REGULATORY_CORRESPONDENCE_LOG.md |
Structured register of regulatory correspondence |
legal/COOKIE_POLICY.md |
Cookie audit and consent policy |
legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md |
Multi-framework notification timelines and templates |
49.5 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
| In-UI disclaimer as sole liability protection | Substantive liability cap in executed MSA; UI disclaimer is reinforcement only |
| Serving TLE-derived data without licence verification | Gate behind space_track_registered; log all disclosures |
| Registering users without country-of-incorporation check | Collect at onboarding; screen against embargoed countries and BIS Entity List before account activation |
| Treating GDPR 72-hour obligation as the only notification deadline | NIS2 requires 24-hour early warning for significant incidents; both timelines must be tracked simultaneously |
| Storing consent preference in a cookie | Self-defeating; use localStorage with no expiry |
| Self-classifying the conjunction model as low-risk AI | Transport infrastructure safety = Annex III high-risk; full obligations apply regardless of system size |
49.6 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| RoPA location | legal/ROPA.md (authoritative) + §29.1 mirror |
MASTER_PLAN only | Regulatory auditors expect a standalone document; MASTER_PLAN mirror keeps engineers informed |
| Space-Track gate mechanism | Per-org boolean + middleware check | Per-request licence verification | Per-request verification against Space-Track API would add latency and a hard dependency; boolean flag updated at onboarding and reviewed quarterly |
| EU AI Act classification | High-risk (Annex III, transport safety) | Low-risk / unclassified | The conjunction model informs time-critical airspace decisions; conservative classification is the legally safe position; reclassification requires legal opinion |
| Cookie consent storage | localStorage | Session cookie | Storing consent in a cookie creates a circular dependency (need consent to set cookie, but cookie stores consent); localStorage avoids this without additional server round-trips |
| NIS2 applicability | Treat SpaceCom as essential entity (space traffic management) | Treat as non-essential until formally classified | Early compliance avoids a reclassification scramble; ENISA guidance indicates space infrastructure operators are likely Annex I essential entities |
§50 Accessibility Engineering — Specialist Review
Standards basis: WCAG 2.1 Level AA (ISO/IEC 40500:2012), WAI-ARIA 1.2, EN 301 549 v3.2.1, Section 508, APCA contrast algorithm, ATAG 2.0 Review scope: Keyboard navigation, screen reader compatibility, colour contrast, motion/animation, focus management, dynamic content announcements, form accessibility, alert/modal accessibility, time-limited interactions, ARIA live regions
50.1 Findings and Fixes Applied
F1 — No accessibility standard committed; EN 301 549 mandatory for ESA procurement Fix applied (§13.0, §25.6): WCAG 2.1 AA committed as minimum standard in new §13.0. Definition of done updated: all PRs must pass axe-core wcag2a/aa before merge. ACR/VPAT 2.4 added to §25.6 ESA procurement artefacts table as a required Phase 2 deliverable.
F2 — CRITICAL alert overlay inaccessible to screen reader and keyboard users
Fix applied (§28.3): Full ARIA alertdialog spec added: role="alertdialog", aria-modal="true", programmatic focus() on render, aria-hidden="true" on map container, aria-live="assertive" announcement region, visible text status indicator for deaf operators, Escape key handling per severity level.
F3 — Structured acknowledgement form has no accessible labels
Fix applied (§28.5): Native <input type="radio"> with <label for="...">, <fieldset> + <legend>, aria-keyshortcuts on trigger, visible keyboard shortcut legend inside dialog, aria-required on free-text field when OTHER selected, aria-live="polite" confirmation on submit.
F4 — CesiumJS globe inaccessible; no keyboard/screen reader equivalent
Fix applied (§13.2): New §13.2 specifies ObjectTableView.tsx as a parallel accessible table view. Accessible via Alt+T and a persistent visible button. All alert interactions completable from table view alone. Implemented with native <table> elements; aria-sort, aria-rowcount, aria-rowindex for virtual scroll.
F5 — Colour is the sole differentiator for alert severity Fix applied (§13.4): Non-colour severity indicators specified in §13.4: per-severity icon/shape (octagon/triangle/circle/circle-outline), text labels always visible, distinct border widths. 1 Hz colour cycle also has a 1 Hz border-width pulse as redundant indicator.
F6 — No keyboard navigation spec for primary operator workflow
Fix applied (§13.3): New §13.3 specifies skip links, focus ring (3px, ≥3:1 contrast, --focus-ring token), tab order rules (no tabindex > 0), full application keyboard shortcut table (Alt+A/T/H/N, ?, Escape, arrow keys), aria-keyshortcuts on all trigger elements, conflict-free shortcut design.
F7 — Colour contrast ratios not specified
Fix applied (§13.4): Verified contrast table for all operational severity colours on dark theme #1A1A2E. All pairs meet ≥4.5:1 (AA). Design token file frontend/src/tokens/colours.ts designated as authoritative; no hardcoded colour values in component files.
F8 — Session timeout risk during shift handover
Fix applied (§28.5a): WCAG 2.2.1 (Timing Adjustable) compliance spec added. T−2 minute warning dialog with aria-live="polite" announcement. Auto-extension (30 min, once per session) when /handover view is active. POST /api/v1/auth/extend-session endpoint specified. Extension logged in security_logs as SESSION_AUTO_EXTENDED_HANDOVER.
F9 — Decision Prompts accordion not keyboard-operable or screen-reader-friendly
Fix applied (§28.6): Full WAI-ARIA Accordion pattern specified: aria-expanded, aria-controls, role="region", aria-labelledby, native checkbox inputs with labels, arrow-key navigation, aria-live="polite" confirmation on checkbox state change.
F10 — No reduced-motion support
Fix applied (§28.3): prefers-reduced-motion: reduce CSS implementation specified for CRITICAL banner colour cycle (static thick border replaces animation). CesiumJS corridor animation: JS matchMedia check on mount; particle animation disabled; static opacity when reduced motion preferred. Listener on change event for live preference updates without page reload.
F11 — No accessibility testing in CI
Fix applied (§42.2, §42.5): e2e/test_accessibility.ts added using @axe-core/playwright. Scans 5 primary views. wcag2a + wcag2aa violations block PR; wcag2aaa warnings only. Results as CI artefact a11y-report.html. Manual screen reader test (NVDA+Firefox, VoiceOver+Safari) added to release checklist. Decision log entry added in §42.5.
50.2 Sections Modified
| Section | Change |
|---|---|
| §13.0 (new) Accessibility Standard Commitment | WCAG 2.1 AA minimum standard; EN 301 549 mandatory for ESA; ACR/VPAT as Phase 2 deliverable; definition of done |
| §13.2 (new) Accessible Parallel Table View | ObjectTableView.tsx spec; keyboard trigger; native table markup; virtual scroll ARIA attributes |
| §13.3 (new) Keyboard Navigation Specification | Skip links; focus ring token; tab order rules; full shortcut table; aria-keyshortcuts convention |
| §13.4 (new) Colour and Contrast Specification | Verified contrast table; design token file; non-colour severity indicators (icons, text labels, border widths) |
| §25.6 Required ESA Procurement Artefacts | ACR/VPAT 2.4 added to artefacts table |
| §28.3 Alarm Management | CRITICAL alert ARIA spec; reduced-motion CSS spec |
| §28.5 Error Recovery | Acknowledgement form accessibility: native inputs, fieldset/legend, aria-keyshortcuts, confirmation announcement |
| §28.5a Shift Handover | Session timeout accessibility: T−2 min warning, auto-extension during handover, extend-session endpoint |
| §28.6 Cognitive Load Reduction | Decision Prompts ARIA Accordion pattern spec |
| §42.2 Test Suite Inventory | test_accessibility.ts added to e2e suite |
| §42.3 (renamed from 42.2) | axe-core implementation spec with code example; manual screen reader test checklist |
| §42.5 Decision Log | Accessibility CI gate decision added |
50.3 New Components
| Component / File | Purpose |
|---|---|
src/components/globe/ObjectTableView.tsx |
Accessible parallel table view for all globe objects |
frontend/src/tokens/colours.ts |
Design token file for all operational colours; authoritative contrast reference |
e2e/test_accessibility.ts |
@axe-core/playwright scans blocking PRs on WCAG 2.1 AA violations |
docs/RELEASE_CHECKLIST.md |
Manual screen reader test steps; keyboard-only workflow test |
50.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
aria-label on a <div> when a native <button> would do |
Always prefer native HTML semantics; ARIA substitutes only when no native element exists |
outline: none without a custom focus indicator |
Never suppress focus ring without providing an equivalent; use --focus-ring token |
tabindex="2" or any positive tabindex |
Never; positive tabindex breaks natural reading order and confuses screen readers |
| Colour-only severity communication | Always pair colour with shape, text label, and border width as redundant indicators |
Inline aria-live="assertive" for non-emergency announcements |
assertive interrupts immediately; use polite for non-CRITICAL confirmations, assertive only for CRITICAL alerts |
| Session timeout that cannot be extended | WCAG 2.2.1 requires user ability to extend or disable timing; auto-extend during safety-critical views is the correct pattern |
50.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Globe accessibility approach | Parallel accessible table view | Making CesiumJS accessible directly | WebGL canvas cannot be made screen-reader accessible; a parallel data view is the only WCAG-conformant approach for complex visualisations |
| Focus ring specification | 3px solid #4A9FFF, design token |
Browser default outline | Browser default fails contrast requirements on dark themes; design token ensures consistency and testability |
| axe-core CI level | wcag2a + wcag2aa block; wcag2aaa warn | All levels block, or all levels warn | All-block creates false positives (AAA is aspirational); all-warn provides no enforcement; AA is the legal and contractual minimum |
| Reduced-motion: animation vs. static | Static thick border when prefers-reduced-motion: reduce |
Slow down animation | Slowing animation still triggers vestibular symptoms; static replacement is the only fully safe approach |
| Session auto-extension scope | Only during /handover active; once per session |
For any active form | Broad auto-extension creates security risk (indefinitely open sessions); limiting to handover scope is the narrowest sufficient accommodation |
§52 Incident Response / Disaster Recovery Engineering — Specialist Review
Standards basis: NIST SP 800-61r2, ISO/IEC 27035, ISO 22301, ITIL 4, ICAO Doc 9859, AWS/GCP Well-Architected Framework (Reliability Pillar), Google SRE Book (Chapter 14) Review scope: Incident classification, runbook completeness, escalation chains, RTO/RPO definition and achievability, backup and restore, chaos/game day testing, on-call rotation, post-incident review, DR site strategy, alert_events integrity
52.1 Findings and Fixes Applied
F1 — RTO and RPO targets not formally defined with derivation rationale Fix applied (§26.2): Table expanded with derivation column. RTO ≤ 15 min (active TIP event) derived from 4-hour CRITICAL rate-limit window. RTO ≤ 60 min (no active event) aligns with MSA SLA. RPO zero for safety-critical tables derived from UN Liability Convention evidentiary requirements. MSA sign-off requirement added — customers must agree RTO/RPO before production deployment.
F2 — No restore time target or WAL retention period
Fix applied (§26.6): WAL retained 30 days; base backups 90 days; safety-critical tables in MinIO Object Lock COMPLIANCE mode for 7 years. Restore time target < 30 minutes documented. docs/runbooks/db-restore.md designated as Phase 2 deliverable.
F3 — No runbook for prediction service outage during active re-entry event
Fix applied (§26.8): New runbook row added to the required runbooks table covering: detection → 5-minute ANSP notification → incident commander designation → 15-minute update cadence → restoration checklist → PIR trigger. Full procedure in docs/runbooks/prediction-service-outage-during-active-event.md.
F4 — No chaos engineering / game day programme
Fix applied (§26.8): Quarterly game day programme specified. 6 scenarios defined with inject, expected behaviour, and pass criteria. Scenario fail treated as SEV-2 with PIR. docs/runbooks/game-day-scenarios.md designated.
F5 — On-call rotation underspecified
Fix applied (§26.8): 7-day rotation, minimum 2-engineer pool. L1 → L2 escalation trigger: 30 minutes without containment. L2 → L3 triggers enumerated (ANSP data affected, security breach, total outage > 15 min, regulatory notification triggered). On-call handoff log specified mirroring operator /handover model.
F6 — No P1/P2/P3 severity communication commitments Fix applied (§26.8): ANSP notification commitments per SEV level added. SEV-1 active TIP event: push + email within 5 minutes, 15-minute cadence. SEV-1 no active event: email within 15 minutes. SEV-2: email within 30 minutes if predictions affected. SEV-3/4: status page only.
F7 — No DR site or failover architecture
Fix applied (§26.3): Cross-region warm standby architecture added. DB replica promoted on failover; app tier deployed from pre-pulled container images; MinIO bucket replication active; DNS health-check-based routing (TTL 60s). Estimated failover time < 15 minutes. Annual game day test (scenario 6). docs/runbooks/region-failover.md designated.
F8 — No post-incident review process
Fix applied (§26.8): Mandatory PIR for all SEV-1 and SEV-2. Due within 5 business days. 7-section structure: summary, timeline, 5-whys root cause, contributing factors, impact, remediation actions (GitHub issues, incident-remediation label), what went well. Presented at engineering all-hands. Remediations are P2 priority.
F9 — alert_events not HMAC-protected
Fix applied (§7.9, alert_events schema): record_hmac TEXT NOT NULL column added. Signing function specified (id, object_id, org_id, level, trigger_type, created_at, acknowledged_by, action_taken). Nightly Celery Beat integrity check re-verifies all events from past 24 hours; HMAC failure raises CRITICAL security alert. Existing alert_events_immutable trigger already prevents modification.
F10 — No incident communication templates
Fix applied (§26.8): docs/runbooks/incident-comms-templates.md designated with 4 templates (initial notification, 15-min update, resolution, post-incident summary). Legal counsel review required before first use. Templates specify what never to include (speculation, premature ETAs, admissions of liability).
F11 — Operational and security incidents not separated
Fix applied (§26.8): Operational vs. security incident comparison table added. Separate runbooks designated: docs/runbooks/operational-incident-response.md and docs/runbooks/security-incident-response.md. Security incidents: no public status page until legal counsel approves; DPO within 4 hours; NIS2/GDPR timelines from §29.6.
52.2 Sections Modified
| Section | Change |
|---|---|
| §26.2 Recovery Objectives | Derivation rationale column; MSA sign-off requirement |
| §26.3 High Availability Architecture | Cross-region warm standby DR strategy; component failover table; estimated recovery time |
| §26.6 Backup and Restore | WAL retention 30 days; restore time target < 30 min; MinIO Object Lock for 7-year legal hold; docs/runbooks/db-restore.md |
| §26.8 Incident Response | Prediction-service-outage runbook; on-call rotation spec + handoff log; ANSP comms per severity; PIR process; game day programme; incident comms templates; operational/security split |
| §7.9 Data Integrity | alert_events HMAC signing function; nightly integrity check Celery task |
alert_events schema |
record_hmac TEXT NOT NULL column added |
52.3 New Runbooks Required (Phase 2 deliverables)
| Runbook | Trigger |
|---|---|
docs/runbooks/db-restore.md |
Monthly restore test failure; DR failover |
docs/runbooks/prediction-service-outage-during-active-event.md |
SEV-1 during active TIP event |
docs/runbooks/region-failover.md |
Cloud region failure; annual game day |
docs/runbooks/game-day-scenarios.md |
Quarterly game day reference |
docs/runbooks/incident-comms-templates.md |
All SEV-1/2 incidents |
docs/runbooks/operational-incident-response.md |
All operational incidents |
docs/runbooks/security-incident-response.md |
All security incidents |
docs/runbooks/on-call-handoff-log.md |
Weekly rotation boundary |
docs/post-incident-reviews/ |
All SEV-1/2 incidents (within 5 business days) |
52.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
| RTO/RPO as aspirational targets without derivation | Derive from operational requirements; document rationale; agree in MSA |
| Single-region deployment with 1-hour RTO target | Warm standby in a second region; < 15 min estimated failover |
| Conflating operational and security incident response | Separate runbooks; different escalation chains; different communication rules |
| Improvised ANSP communications under pressure | Pre-drafted legal-reviewed templates; deviations require incident commander approval |
| PIR as optional / informal | Mandatory for SEV-1/2; structured format; remediation tracking; all-hands presentation |
| Game day as a one-time activity | Quarterly rotation; each scenario tested at least annually; failures treated as SEV-2 |
52.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| DR strategy | Warm standby (second region) | Cold standby or active-active | Cold standby: restore time too slow for RTO; active-active: complexity and cost disproportionate to Phase 1 scale; warm standby meets RTO at acceptable cost |
| alert_events HMAC | Nightly batch verification | Per-request verification | Per-request adds latency to the alert delivery path; nightly batch catches tampering within 24 hours — adequate for evidentiary purposes |
| PIR timing | 5 business days | 24 hours / 30 days | 24 hours is too fast for full 5-whys analysis; 30 days allows recurrence before remediation; 5 days balances speed with quality |
| Game day cadence | Quarterly | Monthly / annually | Monthly creates operational fatigue; annually is too infrequent to maintain muscle memory; quarterly is standard SRE practice |
| On-call escalation trigger | 30 minutes containment | 15 minutes / 60 minutes | 15 minutes is too aggressive for complex incidents; 60 minutes risks SLO breach before L2 engaged; 30 minutes matches the active TIP event RTO window |
§51 Internationalisation / Localisation Engineering — Specialist Review
Standards basis: Unicode CLDR 44, IETF BCP 47, ISO 8601, ICAO Annex 2 / Annex 15 / Doc 8400 (UTC mandate), POSIX locale model, W3C Internationalisation guidelines, ICU MessageFormat 2.0, EU Regulation 2018/1139 (EASA language requirements) Review scope: Timezone handling, date/time display, number/unit formatting, string externalisation, RTL layout, language coverage, ICAO UTC compliance, API date formats, database timezone storage
51.1 Findings and Fixes Applied
F1 — Operational times must be UTC; no local timezone conversion in ops interface
Fix applied (§13.0): Iron UTC rule documented. All Persona A/C views display UTC only, formatted HH:MMZ or DD MMM YYYY HH:MMZ. Z suffix always inline, never a tooltip. No timezone conversion widget in operational interface. Local time permitted only in non-operational admin views with explicit timezone label. API times always ISO 8601 UTC.
F2 — ORM may silently convert TIMESTAMPTZ to session timezone
Fix applied (§7.9): SET TIME ZONE 'UTC' enforced on every connection via SQLAlchemy engine event listener. Blocking integration test test_timestamps_round_trip_as_utc added — asserts that a known UTC datetime survives a full ORM insert/read cycle without offset conversion.
F3 — Re-entry window displayed without explicit UTC label
Fix applied (§28.4): Rule 1 of probabilistic communication to non-specialists updated — all absolute times rendered as HH:MMZ per ICAO Doc 8400 UTC-suffix convention. Z suffix always rendered inline; never hover-only.
F4 — Number formatting not locale-aware in non-operational views
Fix applied (§13.4): formatOperationalNumber() (ICAO decimal point, invariant) and formatDisplayNumber(locale) (Intl.NumberFormat, locale-aware) helpers specified. Raw Number.toString() and n.toFixed() banned from JSX.
F5 — No string externalisation strategy; hardcoded strings block localisation
Fix applied (§13.5): next-intl adopted. All user-facing strings in messages/en.json. Message ID convention defined. eslint-plugin-i18n-json enforcement. ICAO-fixed strings explicitly excluded and annotated // ICAO-FIXED: do not translate.
F6 — NOTAM draft output must be ICAO English regardless of UI locale
Fix applied (§6.13): NOTAM template strings hardcoded ICAO English phraseology in backend/app/modules/notam/templates.py, annotated # ICAO-FIXED: do not translate. Excluded from next-intl extraction. Preview renders in monospace font with lang="en" attribute.
F7 — Slash-delimited dates are ambiguous in exports
Fix applied (§6.12): DD MMM YYYY format mandated for all PDF reports, CSV exports, and display previews (e.g. 04 MAR 2026). Slash-delimited dates banned from all SpaceCom outputs. Times alongside dates use HH:MMZ. NOTAM internal YYMMDDHHMM fields displayed as DD MMM YYYY HH:MMZ in preview.
F8 — RTL layout not considered; directional CSS utilities used
Fix applied (§13.5): CSS logical properties table specified (margin-inline-start etc. replacing ml-/mr-). <html dir="ltr"> hardcoded for Phase 1; becomes dir={locale.dir} when RTL locale added — no component changes required. docs/ADDING_A_LOCALE.md checklist includes RTL gate.
F9 — Altitude units inconsistent between aviation and space personas
Fix applied (users table, §13.5): altitude_unit_preference column added to users table (ft default for ANSP operators, km for space operators). API transmits metres; display layer converts. Unit label always visible. FL notation shown in parentheses for ft context. User can override in account settings.
F10 — API date formats inconsistent (Unix timestamps vs. ISO 8601)
Fix applied (§14 API Versioning Policy): ISO 8601 UTC (2026-03-22T14:00:00Z) mandated for all API date fields. OpenAPI format: date-time on all _at/_time fields. Blocking contract test asserts regex match. Pydantic json_encoders specified.
F11 — Language coverage undefined; English-only now but architecture must support future localisation
Fix applied (§13.5): English-only explicitly committed for Phase 1. next-intl architecture allows adding a locale by adding messages/{locale}.json only — no component changes. messages/fr.json and messages/de.json scaffolded at Phase 2/3 start. docs/ADDING_A_LOCALE.md checklist documented.
51.2 Sections Modified
| Section | Change |
|---|---|
| §6.12 Report Generation | DD MMM YYYY date format rule; slash-delimited dates banned |
| §6.13 NOTAM Drafting Workflow | ICAO-FIXED template rule; lang="en" on NOTAM container |
| §7.9 Data Integrity | SET TIME ZONE 'UTC' connection event listener; test_timestamps_round_trip_as_utc integration test |
| §13.0 Accessibility Standard Commitment | UTC-only rule added |
| §13.4 Colour and Contrast Specification | formatOperationalNumber / formatDisplayNumber helpers; Intl.NumberFormat mandate |
| §13.5 (new) Internationalisation Architecture | next-intl; messages/en.json; ICAO-FIXED exclusions; CSS logical properties; altitude unit display; docs/ADDING_A_LOCALE.md checklist |
| §14 API Versioning Policy | ISO 8601 UTC contract; OpenAPI format: date-time; contract test; Pydantic encoder |
| §28.4 Probabilistic Communication | HH:MMZ inline UTC suffix rule |
users table |
altitude_unit_preference column added |
51.3 New Files
| File | Purpose |
|---|---|
messages/en.json |
Phase 1 string source of truth for next-intl |
messages/fr.json |
Phase 2 scaffold (machine-translated placeholders; native-speaker review before deploy) |
messages/de.json |
Phase 3 scaffold |
docs/ADDING_A_LOCALE.md |
Step-by-step checklist for adding a new locale; includes RTL gate |
frontend/src/lib/formatters.ts |
formatOperationalNumber, formatDisplayNumber, formatUtcTime, formatUtcDate helpers |
tests/test_db_timezone.py |
Blocking integration test for UTC round-trip integrity |
51.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
| Displaying local time in the ops interface | UTC only; HH:MMZ always; no conversion widget |
Number.toString() or n.toFixed() in JSX |
formatOperationalNumber() (ICAO) or formatDisplayNumber(locale) depending on context |
03/04/2026 in any export or report |
04 MAR 2026 — unambiguous ICAO-aligned format |
| Translating NOTAM template strings | ICAO-FIXED; annotate and exclude from i18n tooling |
Positive tabindex (already covered §50) |
Never; noted here as it is also an i18n anti-pattern (breaks RTL reading order) |
Hardcoded margin-left in new components |
margin-inline-start; logical properties throughout |
| Multiple API date formats in same response | ISO 8601 UTC only; one format, no exceptions |
51.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Operational time display | UTC-only, HH:MMZ inline |
User-selectable timezone | ICAO Annex 15 mandates UTC for aeronautical data; a timezone selector introduces conversion errors under time pressure |
| Date format in exports | DD MMM YYYY |
ISO 8601 (2026-03-04) |
ISO 8601 is unambiguous but unfamiliar to aviation professionals; DD MMM YYYY matches aviation document convention (NOTAM, METARs) and is equally unambiguous |
| Phase 1 language scope | English only | Multi-language from Phase 1 | Localisation adds QA overhead and translation cost before product-market fit is proven; architecture supports future locales without rework |
| i18n library | next-intl |
react-i18next |
next-intl has first-class App Router RSC support; react-i18next requires client-component wrapping for all translated text |
| Altitude storage unit | Metres (API + DB) | Role-dependent storage | Single SI storage unit eliminates conversion bugs in physics engine; display conversion is well-understood and testable |
| ORM timezone enforcement | Engine event listener (SET TIME ZONE UTC) |
Application-level assertion | Engine listener fires at connection creation and cannot be bypassed by individual queries; application assertions can be accidentally omitted |
§53 Machine Learning / Data Science — Specialist Review
Standards basis: ISO/IEC 22989, ECSS-E-ST-10-04C, IADC Space Debris Mitigation Guidelines, ESA DRAMA methodology, Vallado (2013), JB2008, NRLMSISE-00, FAA Order 8040.4B, EU AI Act Art. 10 Review scope: Conjunction Pc model, SGP4 domain, atmospheric density model selection, MC convergence, survival probability, model versioning, TLE age uncertainty, backcasting, input validation, tail risk, data provenance
53.1 Findings and Fixes Applied
F1 — Conjunction probability model methodology unspecified
Fix applied (§15.5): Alfano (2005) 2D Gaussian method already specified. Validity domain added: three degradation conditions (sub-100m close approach, anisotropic covariance > 100:1, Pc < 1×10⁻¹⁵ floor). API response carries pc_validity and pc_validity_warning fields. Reference test suite added against Vallado & Alfano (2009) published cases with 5% tolerance.
F2 — SGP4 used beyond valid domain without sub-150 km guard
Fix applied (§15.1): Sub-150 km LOW_CONFIDENCE_PROPAGATION flag added to decay predictor. UI badge: "⚠ Re-entry imminent — prediction confidence low." BLOCKING unit test: TLE with perigee 120 km → asserts flag is set.
F3 — Atmospheric density model not justified vs. JB2008
Fix applied (§15.2): NRLMSISE-00 Phase 1 selection rationale documented (Python binding maturity, acceptable accuracy at moderate F10.7). Known limitations stated. Phase 2 milestone: evaluate JB2008 on backcasts; migrate if MAE improvement > 15%; ADR 0016. Input validity bounds added: F10.7 [65, 300], Ap [0, 400], altitude [85, 1000] km; violation raises AtmosphericModelInputError.
F4 — MC sample count not justified by convergence analysis Fix applied (§15.2/§15.4): Convergence table added. N = 500 satisfies < 2% corridor area change between doublings on the reference object. N = 1000 for OOD or storm-warning cases. MC output updated to include p01 and p99.
F5 — Survival probability methodology absent
Fix applied (§15.3): survival_probability, survival_model_version, survival_model_note columns added to reentry_predictions. Phase 1: simplified analytical all-survive/no-survive per material class. Phase 2: ESA DRAMA integration. NOTAM (E) field statement driven by survival_probability.
F6 — No model version governance or reproducibility
Fix applied (§15.6 new): MAJOR/MINOR/PATCH version bump policy defined. Old versions retained in git tags and physics/versions/. POST /decay/predict/reproduce endpoint specified — re-runs with original model version and params for regulatory audit.
F7 — TLE age not a formal uncertainty source
Fix applied (§15.2): Linear inflation model added: uncertainty_multiplier = 1 + 0.15 × tle_age_days applied to ballistic coefficient covariance before MC sampling. tle_age_at_prediction_time and uncertainty_multiplier stored in simulations.params_json and returned in API response.
F8 — No model performance monitoring or drift detection
Fix applied (§15.9 new): reentry_backcasts table specified. Celery task triggered on object status = 'decayed'; compares all 72h predictions to confirmed re-entry time. Rolling 30-prediction MAE nightly; MEDIUM alert if MAE > 2× historical baseline. Admin panel "Model Performance" widget.
F9 — Input data quality gates insufficient
Fix applied (§15.7 new): validate_prediction_inputs() function in backend/app/modules/physics/validation.py. Validates TLE epoch age ≤ 30 days, F10.7/Ap/perigee bounds, mass > 0. Returns structured ValidationError list; endpoint returns 422. All validation paths covered by BLOCKING unit tests.
F10 — Tail risks not communicated; only p5–p95 shown
Fix applied (§28.4, reentry_predictions schema): p01_reentry_time and p99_reentry_time columns added. Tail risk annotation displayed when p1–p99 range > 1.5× p5–p95 range: "Extreme case (1% probability outside): p01Z – p99Z." Included as NOTAM draft footnote when condition met.
F11 — No training/validation data provenance
Fix applied (§15.8 new): Phase 1 explicitly documented as physics-based with no trained ML components. docs/ml/data-provenance.md designated. EU AI Act Art. 10 compliance mapped to input data provenance (tracked in simulations.params_json). Future ML component protocol: training data, validation split, model card in docs/ml/model-card-{component}.md.
53.2 Sections Modified
| Section | Change |
|---|---|
| §15.1 Catalog Propagator | Sub-150 km LOW_CONFIDENCE_PROPAGATION flag + unit test |
| §15.2 Decay Predictor | NRLMSISE-00 selection rationale vs. JB2008; input bounds; TLE age inflation model; MC convergence table; N=1000 for OOD/storm cases |
| §15.3 Atmospheric Breakup Model | survival_probability / survival_model_version / survival_model_note columns; Phase 1 analytical methodology |
| §15.5 Conjunction Pc | Validity domain (3 degradation conditions); pc_validity API fields; Vallado & Alfano reference test suite |
| §15.6 (new) Model Version Governance | MAJOR/MINOR/PATCH policy; version retention; reproduce endpoint |
| §15.7 (new) Prediction Input Validation | validate_prediction_inputs(); 5 validation rules; 422 response; BLOCKING tests |
| §15.8 (new) Data Provenance | Phase 1 no-ML declaration; EU AI Act Art. 10 mapping; future ML component protocol |
| §15.9 (new) Backcasting Validation | reentry_backcasts table; Celery trigger on decay; rolling MAE drift detection; admin panel widget |
| §28.4 Probabilistic Communication | Tail risk annotation (rule 6); p01/p99 display condition; NOTAM footnote |
reentry_predictions schema |
p01_reentry_time, p99_reentry_time, survival_probability, survival_model_version, survival_model_note |
53.3 New Tables and Files
| Artefact | Purpose |
|---|---|
reentry_backcasts table |
Prediction vs. actual comparison; drift detection input |
docs/ml/data-provenance.md |
Phase 1 no-ML declaration; future ML data provenance template |
docs/ml/model-card-{component}.md |
Template for any future learned component |
docs/adr/0016-atmospheric-density-model.md |
NRLMSISE-00 vs. JB2008 decision; Phase 2 evaluation trigger |
backend/app/modules/physics/validation.py |
validate_prediction_inputs() function |
tests/physics/test_pc_compute.py |
Vallado & Alfano reference cases (BLOCKING) |
53.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
| Displaying only p5–p95 without tail annotation | Add p1/p99 as explicit tail risk annotation when materially wider |
| Silently clamping out-of-range inputs | Reject with structured ValidationError; operator must correct the input |
| Deleting old model versions on update | Tag and retain; reproduce endpoint requires historical version access |
| Treating TLE age as display-only staleness | TLE age is a formal uncertainty source; inflate MC covariance accordingly |
| Choosing atmospheric model without documented rationale | Document selection vs. alternatives; schedule re-evaluation with objective criterion |
| No feedback loop from confirmed re-entries | Backcasting pipeline closes the loop; MAE monitoring detects drift |
53.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Phase 1 atmospheric model | NRLMSISE-00 | JB2008 | Mature Python binding; acceptable accuracy at moderate F10.7; JB2008 evaluation deferred to Phase 2 with objective trigger |
| Pc method | Alfano (2005) 2D Gaussian | Monte Carlo Pc | Alfano is computationally fast and widely accepted; MC Pc reserved for Phase 3 high-Pc cases where Gaussian assumption breaks down |
| MC convergence criterion | < 2% corridor area change between N doublings | Fixed N from literature | Fixed N is arbitrary; convergence criterion is object-class specific and reproducible |
| Tail risk display threshold | p1–p99 > 1.5× p5–p95 | Always show / never show | Always showing creates visual clutter for well-constrained predictions; never showing hides operationally relevant uncertainty; threshold balances both |
| Model version retention | Git tags + physics/versions/ directory |
Docker image tags only | Docker images are routinely pruned; git tags are permanent; reproduce endpoint needs the actual code, not just an image |
§54 Technical Documentation / Developer Experience — Specialist Review
Standards basis: OpenAPI 3.1, Keep a Changelog, Conventional Commits, Nygard ADR format, WCAG authoring guidance, MkDocs Material, spectral OpenAPI linting, ESA ECSS documentation requirements Review scope: OpenAPI spec governance, health endpoint coverage, contribution workflow, ADR process, changelog discipline, developer onboarding, response examples, SDK strategy, runbook structure, docs pipeline, AI assistance declaration
54.1 Findings and Fixes Applied
F1 — OpenAPI spec not declared as source of truth
Fix applied (§14 API Versioning Policy): FastAPI's built-in OpenAPI generation is declared as the sole source of truth. make generate-openapi regenerates openapi.yaml. CI runs openapi-diff --fail-on-incompatible to detect uncommitted drift. The spec is input to Swagger UI, Redoc, contract tests, and the SDK generator.
F2 — No /health or /readiness endpoint specified
Fix applied (§14 System endpoints): New System (no auth required) group added. GET /health — liveness probe; process-alive check only. GET /readyz — readiness probe; checks PostgreSQL, Redis, Celery queue depth; returns 503 when any dependency is unhealthy. Both used by Kubernetes probes, load balancers, and DR automation DNS-flip gate (§26.3). Both included in OpenAPI spec.
F3 — CONTRIBUTING.md absent
Fix applied (§13.6 new): Full contribution workflow documented. Branch naming convention table (feature/fix/chore/release/hotfix), main branch protection (1 approval, all checks pass, no force-push), Conventional Commits commit format, PR template with checklist (test, openapi regeneration, CHANGELOG, axe-core, ADR), 1-business-day review SLA, stale PR automation.
F4 — No ADR process
Fix applied (§13.7 new): ADR process specified using Nygard format in docs/adr/NNNN-title.md. Trigger criteria defined (hard-to-reverse decisions, auditor context, procurement evidence). Standard template specified. Known ADR register table provided with 6 existing entries. Phase 2 ESA submission gate: all referenced ADR numbers must have corresponding files.
F5 — Changelog discipline unspecified
Fix applied (§14 API Versioning Policy): Keep a Changelog format + Conventional Commits declared. [Unreleased] section with Added/Changed/Fixed/Deprecated subsections required on every PR with user-visible effect. make changelog-check CI step fails if [Unreleased] is empty for non-chore/docs commits. Release changelogs drive API key holder notifications and GitHub release notes.
F6 — Developer environment setup undocumented
Fix applied (§13.8 new): docs/DEVELOPMENT.md spec covering: prerequisites (Python 3.11 pinned, Node.js 20, Docker Desktop, make), make dev-up / migrate / seed / dev bootstrap sequence, make test / test-backend / test-frontend / test-e2e commands, local URL map (API, Swagger UI, frontend, MinIO). 30-minute onboarding target. .env.example committed; .env in .gitignore.
F7 — OpenAPI response examples not required
Fix applied (§14 API Versioning Policy): All endpoint schemas must include at least one examples: block. Enforced by spectral lint with custom require-response-example rule in CI. Example YAML fragment provided for GET /objects/{norad_id}. Examples serve: Swagger/Redoc docs, contract test fixtures, ESA auditor readability.
F8 — No SDK or client library strategy
Fix applied (§14 API Versioning Policy): Phase 1 — no SDK; ANSP integrators receive openapi.yaml, docs/integration/ quickstarts (Python httpx/requests, TypeScript), and Postman-importable spec. Phase 2 gate: if ≥ 2 ANSP customers request a typed client, generate with openapi-generator-cli targeting Python and TypeScript. Generator config committed to tools/sdk-generator/. Published as spacecom-client PyPI and @spacecom/client npm packages.
F9 — Runbooks named but not templated
Fix applied (§26.8 new subsection): Standard runbook template specified with 7 sections: Triggers, Immediate actions (first 5 minutes), Diagnosis, Resolution steps, Verification, Escalation, Post-incident. Last tested frontmatter field required. make runbook-audit CI check warns if any runbook is older than 12 months. Template preempts the most common incident-pressure failures: vague steps, no expected output, missing escalation path.
F10 — No docs-as-code pipeline
Fix applied (§13.9 new): MkDocs Material as the documentation site generator. mkdocs build --strict in CI fails on broken links and missing pages. markdown-link-check for external links. vale prose style linter. openapi-diff spec drift check. ESA submission artefact: static HTML archived as docs-site-{version}.zip in release assets — reproducible point-in-time snapshot. owner: frontmatter field with quarterly docs-review cron issue.
F11 — AGENTS.md scope vs. MASTER_PLAN undefined Fix applied (§1 Vision): AI-assisted development policy added. Defines: permitted uses (code generation, refactoring, review, documentation drafting), prohibited uses (autonomous decisions on safety-critical algorithms, auth logic, regulatory compliance text; production credentials; personal data). Human review standards apply identically to AI-generated code. ESA procurement statement: human engineers are sole responsible parties regardless of authoring tool.
54.2 Sections Modified
| Section | Change |
|---|---|
| §1 Vision | AI-assisted development policy; AGENTS.md scope declaration; ESA procurement statement |
| §13.6 (new) Contribution Workflow | Branch naming; commit format; PR template; review SLA; main protection |
| §13.7 (new) Architecture Decision Records | Nygard ADR format; trigger criteria; template; known ADR register; Phase 2 ESA gate |
| §13.8 (new) Developer Environment Setup | docs/DEVELOPMENT.md spec; make targets; 30-minute onboarding target; .env.example policy |
| §13.9 (new) Docs-as-Code Pipeline | MkDocs Material; CI checks (strict, link, vale, openapi-diff); ESA artefact; docs ownership |
| §14 API Versioning Policy | OpenAPI as source of truth; make generate-openapi; CI drift check; changelog discipline; response examples mandate; client SDK strategy |
| §14 System Endpoints (new) | GET /health liveness spec; GET /readyz readiness spec with example responses |
| §26.8 Incident Response | Runbook standard structure template; Last tested field; make runbook-audit |
54.3 New Tables and Files
| Artefact | Purpose |
|---|---|
CONTRIBUTING.md |
Branch naming, commit format, PR template, review SLA |
CHANGELOG.md |
Keep a Changelog format; [Unreleased] driven by PRs; release notes source |
docs/adr/NNNN-*.md |
Architecture Decision Records (Nygard format) |
docs/DEVELOPMENT.md |
Developer onboarding; make targets; environment bootstrap |
docs/ADDING_A_LOCALE.md |
(already referenced §13.5) — Locale addition checklist |
docs/integration/ |
ANSP quickstart guides (Python, TypeScript) |
tools/sdk-generator/ |
openapi-generator-cli config for Phase 2 SDK generation |
.github/pull_request_template.md |
PR checklist enforcing OpenAPI regeneration, CHANGELOG, axe-core, ADR |
.spectral.yaml |
Custom spectral ruleset including require-response-example |
.vale.ini |
Prose style linter config for docs |
mkdocs.yml |
MkDocs Material configuration |
docs/runbooks/*.md |
All runbooks follow the standard template with Last tested frontmatter |
54.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
| Maintaining a separate OpenAPI spec alongside FastAPI routes | Generate from code; enforce with CI drift check |
Undocumented GET /health with ad-hoc response shape |
Specify the schema, document it in OpenAPI, use it in DR automation |
| New engineers learning the codebase by asking colleagues | docs/DEVELOPMENT.md with 30-min onboarding target; make dev brings up everything |
| Architectural decisions in Slack or PR comments | ADR in docs/adr/; permanent and findable by auditors and new engineers |
| Runbooks written for the first time during an incident | Template-first; test in game day before needed |
| Publishing an API with no response examples | spectral enforces examples: blocks; Swagger UI shows realistic data |
| Building an SDK before customers ask | Phase 2 gate: ≥ 2 ANSP requests; Phase 1 is openapi.yaml + quickstarts |
54.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| OpenAPI generation direction | Code → spec (FastAPI auto-generation) | Spec → code (contract-first with codegen) | Team is Python-first; FastAPI's generation is high-fidelity; contract-first adds a separate edit step without meaningful quality gain at Phase 1 scale |
| SDK strategy | Generated from spec (Phase 2) | Hand-crafted SDK | Generated SDK stays in sync with spec automatically; hand-crafted SDKs drift; generation deferred until customer demand justifies maintenance cost |
| Documentation tooling | MkDocs Material | Docusaurus, GitBook | MkDocs Material is Python-native (same toolchain as backend); mkdocs build --strict provides CI integration; no JS toolchain dependency for docs |
| ADR format | Nygard (Context/Decision/Consequences) | MADR, RFC-style | Nygard is the most widely recognised format; recognised by ESA/public-sector auditors; minimal overhead |
| AI assistance declaration | Explicit policy in §1 Vision | Silent (no declaration) | ESA and EASA increasingly require disclosure of AI tool use in safety-relevant software; proactive disclosure pre-empts audit questions and demonstrates process maturity |
§55 Multi-Tenancy, Billing & Org Management — Specialist Review
Standards basis: GDPR Art. 17/20, PCI-DSS (if card payments introduced), SaaS subscription billing conventions, PostgreSQL Row Level Security documentation, Celery priority queue documentation, ICAO Annex 11 (operator accountability) Review scope: Data isolation, subscription tier model, usage metering, org lifecycle, API key governance, quota enforcement, queue fairness, audit log access, billing data model, data portability
55.1 Findings and Fixes Applied
F1 — No row-level tenant isolation strategy defined
Fix applied (§7.2): Comprehensive RLS policy table added covering all 8 organisation_id-carrying tables. spacecom_worker database role specified as the only BYPASSRLS principal. BLOCKING integration test specified: query as Org A session; assert zero Org B rows across all tenanted tables.
F2 — Subscription tiers and feature flags not specified
Fix applied (§16.1 new): Tier table defined (shadow_trial, ansp_operational, space_operator, institutional, internal) with per-tier MC concurrency, prediction quota, and feature access. require_tier() FastAPI dependency pattern specified. TIER_MC_CONCURRENCY dict ties limits to tier. Tier changes take immediate effect (no session cache).
F3 — Usage metering not modelled
Fix applied (§9.2): usage_events table added — append-only, immutable trigger, indexed by (organisation_id, billing_period, event_type). Billable event types: decay_prediction_run, conjunction_screen_run, report_export, api_request, mc_quota_exhausted, reentry_plan_run. Powers org admin usage dashboard and upsell trigger.
F4 — Organisation onboarding and offboarding procedures absent
Fix applied (§29.8 new): Onboarding gate checklist specified (MSA, export control, Space-Track, billing contact, org_admin user, ToS). Offboarding 8-step procedure with timing, owner, and GDPR Art. 17 vs. retention resolution. Suspension vs. churn distinction documented. docs/runbooks/org-onboarding.md designated.
F5 — API key lifecycle lacks org-level service account concept
Fix applied (§9.2 api_keys table): is_service_account column added; user_id made nullable for service account keys; service_account_name required when is_service_account = TRUE; revoked_by column added for org_admin audit trail. CHECK constraints enforce mutual exclusivity. Org admin can see and revoke all org keys via GET/DELETE /org/api-keys.
F6 — Concurrent prediction limit not persisted and not tier-linked
Fix applied (§16.1, Celery section): acquire_mc_slot now derives limit from org_tier via get_mc_concurrency_limit_by_tier(). Quota exhaustion writes usage_events row with event_type = 'mc_quota_exhausted'. Org admin usage dashboard shows hits per billing period with upgrade prompt if hits ≥ 3.
F7 — No org-level admin role
Fix applied (§7.2 RBAC table, users.role CHECK): org_admin role added between operator and admin. Permissions: manage users within own org (up to operator), manage own org's API keys, view own org's audit log, update billing contact. Cannot cross org boundaries or assign admin/org_admin without system admin.
F8 — Shared Celery queues with no per-org priority
Fix applied (Celery Queue section): TIER_TASK_PRIORITY table (3–9 by tier) with CRITICAL_EVENT_PRIORITY_BOOST = 2 when active TIP event exists. get_task_priority() function specified. Priority submitted via apply_async(priority=...). Redis noeviction policy supports native Celery priorities 0–9.
F9 — No tenant-scoped audit log API
Fix applied (§14 Org Admin endpoints): GET /org/audit-log added — paginated, filtered by organisation_id, supports ?from=&to=&event_type=&user_id=. Sources security_logs and alert_events. Accessible to org_admin and admin. Required by enterprise SaaS compliance expectations.
F10 — Billing data model absent
Fix applied (§9.2): billing_contacts table (email, name, address, VAT, PO reference), subscription_periods table (immutable billing history with tier, dates, monthly fee, invoice reference). PATCH /org/billing endpoint for org_admin self-service updates. Phase 1 billing is manual; invoice_ref field accommodates future Stripe or Lago integration.
F11 — No org data export or portability mechanism
Fix applied (§14 Org Admin endpoints, §29.2): POST /org/export endpoint added — async job, delivers signed ZIP within 3 business days. Used for GDPR Art. 20 portability and offboarding. §29.2 portability row updated with endpoint reference and scope clarification (user-generated content, not derived predictions).
55.2 Sections Modified
| Section | Change |
|---|---|
| §7.2 RBAC | org_admin role added; comprehensive RLS policy table; spacecom_worker BYPASSRLS principal; users.role CHECK constraint updated |
§9.2 api_keys |
is_service_account, service_account_name, revoked_by columns; CHECK constraints; service account index |
| §9.2 (new tables) | usage_events, billing_contacts, subscription_periods |
| §14 Org Admin endpoints (new group) | 10 org_admin-scoped endpoints covering users, API keys, audit log, usage, billing, and data export |
| §14 Admin endpoints | GET /admin/organisations, POST /admin/organisations, PATCH /admin/organisations/{id} added |
| §16.1 (new) Subscription Tiers | Tier table; require_tier() pattern; TIER_MC_CONCURRENCY; tier change immediacy |
| Celery Queue section | TIER_TASK_PRIORITY priority map; CRITICAL_EVENT_PRIORITY_BOOST; get_task_priority() function |
| MC concurrency gate | acquire_mc_slot now tier-driven; quota exhaustion writes usage_events |
| §29.2 Data Subject Rights | Portability row updated with POST /org/export endpoint and scope |
| §29.8 (new) Org Onboarding/Offboarding | 6-gate onboarding checklist; 8-step offboarding procedure; suspension vs. churn distinction |
55.3 New Tables and Files
| Artefact | Purpose |
|---|---|
usage_events table |
Billable event metering; org admin dashboard; quota exhaustion signal |
billing_contacts table |
Invoice address, VAT, PO number per org |
subscription_periods table |
Immutable billing history; Phase 2 invoice integration anchor |
docs/runbooks/org-onboarding.md |
Onboarding gate checklist; provisioning procedure |
backend/app/modules/billing/tiers.py |
get_mc_concurrency_limit_by_tier() and TIER_TASK_PRIORITY |
55.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
Relying solely on application-layer WHERE organisation_id = X |
RLS at database layer; application filter is defence-in-depth only |
Role model with only system-wide admin |
org_admin for self-service tenant management; admin for cross-org system operations |
| Flat API key model with no service accounts | Service account keys (user_id IS NULL) for system integrations; org admin can audit and revoke all keys |
| Sharing Celery queue with equal priority for all orgs | Priority queue by tier + active event boost prevents low-tier bulk jobs starving safety-critical work |
| No audit log access for tenants | Tenant-scoped GET /org/audit-log; required by enterprise procurement and insurance |
Treating subscription_tier as static configuration |
Tier changes must be real-time enforced; require_tier() reads from DB on each request |
55.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Tenant isolation mechanism | PostgreSQL RLS + application filter | Application filter only | RLS enforces at DB layer; a single missing WHERE clause in application code cannot leak cross-tenant data |
| Tier change immediacy | Real-time DB read on each request | Cached in JWT claim | JWT caching means downgraded orgs continue at higher tier until token expires; unacceptable for billing correctness |
| Billing integration (Phase 1) | Manual + subscription_periods table |
Stripe/Lago from day 1 | Phase 1 has ≤5 paying customers; manual invoicing is sufficient; invoice_ref field enables future integration without schema migration |
| org_admin role scope | Cannot assign admin or org_admin without system admin approval |
Full self-service role management | Self-service org_admin assignment creates privilege escalation paths; system admin as approval gate is a standard SaaS pattern |
| Service account API keys | user_id IS NULL with is_service_account = TRUE flag |
Separate service_accounts table |
Single api_keys table is simpler; constraints enforce consistency; avoids JOIN complexity for key lookup hot path |
§56 Testing Strategy — Specialist Review
Standards basis: pytest, pytest-cov, mutmut, k6, Playwright, openapi-typescript, freezegun, ISTQB test level definitions, ESA ECSS-E-ST-40C software testing standard Review scope: Coverage standard, test taxonomy, test data management, frontend/API contract drift, mutation testing, performance test specification, environment parity, safety-critical labelling, WebSocket E2E, MC determinism, ESA test plan artefact
56.1 Findings and Fixes Applied
F1 — No test coverage standard defined
Fix applied (§17.0): Coverage thresholds declared: 80% line / 70% branch for backend (pytest-cov), 75% line for frontend (Jest). Enforced via pyproject.toml --cov-fail-under. Measured on the integration run (real DB), not unit-only. Coverage artefact required in Phase 2 ESA submission.
F2 — Test level boundary undefined
Fix applied (§17.0): Three-level taxonomy defined: unit (no I/O, tests/unit/), integration (real DB + Redis, tests/integration/), E2E (full stack + browser, e2e/). Rules specify which level each category of test belongs to. Stops developers placing DB tests in tests/unit/ or mocking the database in integration tests.
F3 — Test data management strategy absent
Fix applied (§17.0): Committed JSON reference data for physics; transaction-rollback isolation for integration tests; freezegun mandate for all time-dependent tests; fictional NORAD IDs (90001–90099) and generated org names for sensitive data. Prevents flaky time-dependent failures and production-data leakage into the test repo.
F4 — No contract testing between frontend and API
Fix applied (§14): openapi-typescript generates frontend/src/types/api.generated.ts from openapi.yaml. Frontend imports only from the generated file. make check-api-types CI step fails on any drift. Replaces Pact-style consumer-driven contracts at Phase 1 scale — simpler, equally effective for a single-team project.
F5 — Mutation testing not specified
Fix applied (§17.0): mutmut runs weekly against physics/ and alerts/ modules. Threshold: ≥ 70% mutation score. Results published to CI artefacts. > 5 percentage point drop between runs creates a mutation-regression issue automatically.
F6 — Performance test specification informal
Fix applied (§27.0 new): k6 chosen as the load testing tool. Three scenarios specified: CZML catalog ramp, 200 WebSocket subscribers, decay submit constant arrival rate. SLO thresholds as k6 thresholds (test fails if breached). Baseline hardware spec documented in docs/validation/load-test-baseline.md. Results stored as JSON and trended; > 20% p95 increase creates performance-regression issue.
F7 — Test environment parity unspecified
Fix applied (§17.0): docker-compose.ci.yml must use pinned image tags matching production (not latest). make test fails if TIMESCALEDB_VERSION env var does not match docker-compose.yml. MinIO used in CI (not mocked). Prevents the class of "passes in CI, fails in prod" due to minor version differences in TimescaleDB chunk behaviour.
F8 — Safety-critical tests not labelled
Fix applied (§17.0): @pytest.mark.safety_critical marker defined in conftest.py. Applied to: cross-tenant isolation, HMAC integrity, sub-150km guard, shadow segregation, and any other safety-invariant test. Separate fast CI job (pytest -m safety_critical, target < 2 min) runs on every commit before the full suite.
F9 — No E2E test for WebSocket alert delivery
Fix applied (§42.2 E2E test inventory, accessibility section): e2e/test_alert_websocket.ts added. Full path: submit prediction via API → Celery completes → CRITICAL alert appears in browser DOM via WebSocket within 60 seconds. BLOCKING. Intermittent failures are root-cause investigated, not quarantined.
F10 — Physics tests non-deterministic
Fix applied (§17.0): np.random.seed(42) autouse fixture in tests/conftest.py. seed=42 passed explicitly to all MC calls in tests. Seed value pinned; a PR changing it without updating baselines fails the review checklist. MC-based tests are now fully reproducible across machines and Python versions.
F11 — No test plan document for ESA submission
Fix applied (§17.0): docs/TEST_PLAN.md structure specified with 6 sections including safety-critical traceability matrix (requirement → test ID → test name → result). This is the primary software assurance evidence document for the ESA bid. Required as a Phase 2 deliverable.
Bind mount strategy (companion fix)
Fix applied (§3.3 Docker Compose): Host bind mounts specified for logs, exports, config, and DB data. Eliminates the need for docker compose exec for all routine operations. /data/postgres and /data/minio outside the project directory to prevent accidental wipe. make init-dirs creates the host directory structure before first docker compose up. make logs SERVICE=backend convenience alias.
56.2 Sections Modified
| Section | Change |
|---|---|
| §3.3 Docker Compose | Host bind mount specification; host directory layout; make init-dirs; :ro config mounts |
| §13.8 Developer Environment Setup | make init-dirs added to bootstrap sequence |
| §17.0 (new) Test Standards and Strategy | Full test taxonomy, coverage standard, fixture isolation, freezegun, safety_critical marker, MC seed, mutation testing, env parity, docs/TEST_PLAN.md structure |
| §27.0 (new) Performance Test Specification | k6 scenarios, SLO thresholds, baseline hardware spec, result storage and trending |
| §14 API Versioning Policy | openapi-typescript contract type generation; make check-api-types CI step |
| §42.2 E2E Test Inventory | test_alert_websocket.ts added; full WebSocket delivery E2E spec |
56.3 New Tables and Files
| Artefact | Purpose |
|---|---|
tests/unit/, tests/integration/, e2e/ |
Canonical test directory structure per taxonomy |
e2e/test_alert_websocket.ts |
WebSocket alert delivery E2E test |
tests/conftest.py |
seed_rng autouse fixture; safety_critical marker registration |
docs/TEST_PLAN.md |
ESA Phase 2 deliverable; traceability matrix |
docs/validation/load-test-baseline.md |
k6 baseline hardware and data spec |
docs/validation/load-test-results/ |
Stored k6 JSON results for trending |
tests/load/scenarios.js |
k6 scenario definitions |
frontend/src/types/api.generated.ts |
Generated TypeScript API types from openapi.yaml |
scripts/load-test-trend.py |
p95 latency trend chart generator |
56.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
| Mocking the database in integration tests | Transaction-rollback isolation against a real DB; mocks hide schema and RLS bugs |
datetime.utcnow() in tests |
freezegun @freeze_time decorator; tests must be time-independent |
| Non-deterministic MC tests | np.random.seed(42) autouse fixture; same seed → same output everywhere |
| Coverage measured on unit tests only | Integration run coverage includes DB-layer code; unit-only inflates the number |
| Putting safety-critical tests in the full suite only | pytest -m safety_critical fast job on every commit; never wait for the full suite to catch a safety regression |
| Performance test results not stored | JSON output committed to docs/validation/; trend script flags regressions |
56.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Frontend/API contract testing | openapi-typescript generated types + make check-api-types |
Pact consumer-driven contracts | Pact requires a broker and bidirectional test setup; openapi-typescript achieves the same drift detection with a single CI command at Phase 1 team size |
| Performance test tool | k6 | Locust, Gatling | k6 is JavaScript-native (same language as frontend tests); scripting is lightweight; built-in threshold assertions; good CI integration |
| Coverage measurement scope | Integration test run | Unit test run | Unit-only coverage excludes database, Redis, and auth middleware code paths — the most likely sources of prod bugs |
| Mutation testing scope | physics/ and alerts/ only (weekly) |
Full codebase (every commit) | Full-codebase mutation testing on every commit would take hours; scoping to highest-consequence modules provides meaningful signal at reasonable cost |
| Host bind mounts approach | Named directories under /opt/spacecom/ with make init-dirs |
Named Docker volumes | Host bind mounts are directly accessible via SSH without docker exec; named volumes require exec or a volume driver for host access |
§57 Observability & Monitoring — Specialist Review
Hat: Observability & Monitoring Findings reviewed: 11 Sections modified: §26.6, §26.7 Date: 2026-03-24
57.1 Findings and Fixes Applied
F1 — Prometheus metric naming convention not defined
Fix applied (§26.7 new): Naming convention table added before metric definitions. Rules: spacecom_ namespace required; unit suffix mandatory; _total for counters; high-cardinality identifiers (norad_id, organisation_id, user_id, request_id) banned from metric labels; snake_case labels only. CI make lint-metrics step validates names against the convention pattern.
F2 — SLO burn rate alerting single-window only
Fix applied (§26.7): Replaced single ErrorBudgetBurnRate alert with two-alert multi-window pattern. ErrorBudgetFastBurn (1h + 5min windows, 14.4× multiplier, for: 2m) catches sudden outages. ErrorBudgetSlowBurn (6h + 1h windows, 6× multiplier, for: 15m) catches gradual degradation before the budget exhausts silently. Three recording rules added (rate1h, rate6h, rate5m).
F3 — Structured log schema undefined
Already substantially addressed in §2274: REQUIRED_LOG_FIELDS schema with 10 mandatory fields, sanitising processor, request_id correlation middleware, and log integrity policy. No further action required for F3 — confirmed as covered.
F4 — Distributed tracing not specified for Celery path
Fix applied (§26.7): Explicit Celery W3C traceparent propagation spec added. CeleryInstrumentor handles automatic propagation; request_id passed in task kwargs as Phase 1 fallback when OTEL_SDK_DISABLED=true. Integration test stub specified to verify trace continuity from HTTP handler through worker span.
F5 — No alerting rule coverage audit
Fix applied (§26.7 new): Alert coverage audit table added mapping every SLO and safety invariant to its alert rule. Two gaps identified: EopMirrorDisagreement alert (Phase 1 gap — metric exists, alert rule missing), DbReplicationLagHigh (Phase 2 gap — requires streaming replication). BackupJobFailed alert identified as Phase 1 gap.
F6 — High-cardinality label risk
Already addressed: norad_id label was already noted as "Grafana drill-down only; alert via recording rule" in the existing metric definition comment. F1 naming convention formalises this as an explicit prohibition with a CI-enforced lint rule. No additional edit required.
F7 — On-call dashboard not specified
Fix applied (§26.7): Operational Overview dashboard panel layout mandated. 8-panel grid with fixed row order; rows 1–2 visible without scroll at 1080p. Each panel maps to a specific metric and threshold. Dashboard UID pinned in AlertManager dashboard_url annotations. Design criterion: "answer is the system healthy in 15 seconds."
F8 — Celery queue depth alerting threshold-only
Fix applied (§26.7): CelerySimulationQueueGrowing alert added using rate(spacecom_celery_queue_depth{queue="simulation"}[10m]) > 2 with for: 5m. Complements the existing threshold-based CelerySimulationQueueDeep. Growth rate alert catches a rising queue before it breaches the absolute threshold.
F9 — No DLQ monitoring
Already addressed: DLQGrowing alert (increase(spacecom_dlq_depth[10m]) > 0) and spacecom_dlq_depth metric were already specified in §26.7. F9 confirmed as covered — no further action required.
F10 — Log retention and SIEM integration not specified Fix applied (§26.6 new): Application log retention policy table added. Container stdout: 7 days (Docker json-file). Loki: 90 days (covers incident investigation SLA). Safety-relevant log lines: 7 years (MinIO, matching database safety record retention). SIEM forwarding: per customer contract. Loki retention YAML configuration specified. Phase 1 interim: Celery Beat daily export of CRITICAL log lines to MinIO until Loki ruler is deployed.
F11 — No alerting runbook cross-reference mandate
Fix applied (§26.7): runbook_url added to WebSocketCeilingApproaching (previously missing). Mandate added: every AlertManager rule must include annotations.runbook_url pointing to an existing file in docs/runbooks/. make lint-alerts CI step enforces this using promtool check rules plus a custom script that validates the URL resolves to a real markdown file.
57.2 Sections Modified
| Section | Change |
|---|---|
| §26.6 Backup and Restore | Application log retention policy table added; Loki 90-day retention config; safety-critical log line archival to MinIO |
| §26.7 Prometheus Metrics | Metric naming convention table; multi-window burn rate recording rules and alerts; Celery trace propagation spec; queue growth rate alert; alert coverage audit table; runbook_url mandate; WebSocketCeilingApproaching runbook link added; on-call dashboard panel layout mandated |
57.3 New Tables and Files
| Artefact | Purpose |
|---|---|
monitoring/alertmanager/spacecom-rules.yml |
Updated with multi-window burn rate alerts and queue growth alert |
monitoring/loki-config.yml |
90-day retention configuration |
monitoring/recording-rules.yml |
Three burn rate recording rules |
docs/runbooks/capacity-limits.md |
Referenced by WebSocketCeilingApproaching; Phase 2 deliverable |
scripts/lint-alerts.py |
CI script validating runbook_url annotation on every alert rule |
monitoring/grafana/dashboards/operational-overview.json |
Codified panel layout per §26.7 on-call dashboard spec |
tests/integration/test_tracing.py |
Celery trace propagation integration test stub |
57.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
Single-window burn rate alert (for: 30m) |
Multi-window fast+slow burn: catches both sudden outages and slow degradations |
norad_id or organisation_id as Prometheus label |
Recording rule aggregates; high-cardinality identifiers in log fields or exemplars only |
Alert rules without runbook_url |
make lint-alerts enforces presence; a page at 3am without a runbook link adds ~5 min to MTTR |
| Threshold-only queue alerts | Complement with rate-of-growth alert; threshold fires too late on a gradually filling queue |
| On-call dashboard with no defined layout | Mandated panel order; rows 1–2 visible without scroll; 15-second health answer target |
| Application logs with no retention policy | Explicit tier policy: 7 days local, 90 days Loki, 7 years for safety-relevant lines |
57.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Burn rate multipliers | 14.4× (fast, 1h) / 6× (slow, 6h) | Custom thresholds | Google SRE Workbook standard multipliers for 99.9% SLO; well-understood by on-call engineers familiar with SRE literature |
| Loki retention | 90 days | 30 days / 1 year | 30 days is insufficient for post-incident reviews triggered by regulatory queries; 1 year is expensive for high-volume structured logs; 90 days covers all contractual and regulatory investigation windows |
Fast burn for: 2m |
2 minutes | Immediate (no for) |
Without a for clause, a single scraped bad value pages on-call; 2 minutes filters transient scrape errors while still alerting within 5 minutes of a real outage |
| Celery trace propagation | CeleryInstrumentor + explicit request_id kwargs |
OTel only | OTel-only approach breaks Phase 1 when OTEL_SDK_DISABLED=true; explicit kwargs are a zero-dependency fallback that costs nothing and ensures log correlation always works |
§58 Performance & Scalability — Specialist Review
Hat: Performance & Scalability Findings reviewed: 11 Sections modified: §3.2, §9.4, §16 (CZML cache), §34.2 (Caddyfile), Celery config Date: 2026-03-24
58.1 Findings and Fixes Applied
F1 — No index strategy documented beyond primary keys
Already addressed: §9.3 contains a comprehensive index specification with 10+ named indexes covering all identified hot paths: orbits (CZML generation), reentry_predictions (latest per object, partial), alert_events (unacknowledged per org, partial), jobs (queued, partial), refresh_tokens (live only, partial), PostGIS GiST indexes on all geometry columns, tle_sets (latest per object), security_logs (user+time). F1 confirmed as covered — no further action required.
F2 — PgBouncer pool size not derived from workload
Fix applied (§3.2 technology table): Derivation rationale added inline. max_client_conn=200 derived from: 2 backend × 40 async + 4 sim workers × 16 + 2 ingest × 4 = 152 peak, 200 burst headroom. default_pool_size=20 derived from max_connections=50 with 5 reserved for superuser. Validation query (SHOW pools; cl_waiting > 0 = undersized) documented.
F3 — N+1 query risk in catalog and alert APIs
Already addressed: §16 (CZML and API performance section) already specifies ORM loading strategies: selectinload for Event Detail and active alerts; raw SQL with explicit JOIN for CZML catalog bulk fetch (ORM overhead unacceptable at 864k rows). F3 confirmed as covered — no further action required.
F4 — Redis cache eviction policy not specified
Already addressed: §16 Redis key namespace table specifies noeviction for celery:* and redbeat:*, allkeys-lru for cache:*, volatile-lru for ws:session:*. Separate Redis DB indexes mandated. F4 confirmed as covered — no further action required.
F5 — CZML cache invalidation strategy incomplete
Fix applied (§16): Invalidation trigger table added (TLE re-ingest, propagation completion, new prediction, admin flush, cold start). Stale-while-revalidate strategy specified: stale key served immediately on primary expiry; background recompute enqueued; max stale age 5 minutes. warm_czml_cache Celery task specified for cold start and DR failover; estimated 30–60 seconds for 600 objects. Cold-start warm-up added to DR RTO calculation.
F6 — Celery worker_prefetch_multiplier not tuned
Fix applied (celeryconfig.py): worker_prefetch_multiplier = 1 added with rationale comment. Long MC tasks (up to 240s) with default prefetch=4 cause worker starvation. Prefetch=1 ensures fair task distribution across all available workers.
F7 — No database query plan governance
Fix applied (§9.4 PostgreSQL parameters): log_min_duration_statement: 500 and shared_preload_libraries: timescaledb,pg_stat_statements added to patroni.yml. Query plan governance process specified: weekly top-10 slow query report from pg_stat_statements; any query in top-10 for two consecutive weeks requires PR with EXPLAIN ANALYSE and index addition or documented acceptance.
F8 — Static asset delivery strategy undefined
Fix applied (§34.2 Caddyfile): Three-tier static asset strategy added. /_next/static/*: Cache-Control: public, max-age=31536000, immutable (safe — Next.js content-hashes filenames). /cesium/*: Cache-Control: public, max-age=604800 (7 days; not content-hashed). HTML routes: Cache-Control: no-store (force re-fetch after deploy). Rationale: immutable caching only safe for content-hashed assets; HTML must never be cached.
F9 — Horizontal scaling trigger thresholds not defined
Fix applied (§3.2 new table): Scaling trigger threshold table added covering backend CPU (>70% for 30min), WS connections (>400 sustained), simulation queue depth (>50 for 15min), MC p95 latency (>180s), DB CPU (>60% for 1h), disk usage (>70%), Redis memory (>60%). All triggers initiate a scaling review meeting, not automatic action. Decisions logged in docs/runbooks/capacity-limits.md.
F10 — TimescaleDB chunk interval not specified
Already addressed: §9.4 specifies chunk intervals for all hypertables with derivation rationale table: orbits 1 day (72h CZML window spans 3 chunks), tle_sets 1 month (compression ratio), space_weather 30 days (low write rate), adsb_states 4 hours (24h rolling window). F10 confirmed as covered — no further action required.
F11 — No query timeout or statement timeout policy
Fix applied (§9.4): ALTER ROLE spacecom_analyst SET statement_timeout = '30s' and ALTER ROLE spacecom_readonly SET statement_timeout = '30s'. Applied at role level so it persists regardless of connection source. User-facing error message specified for timeout exceeded. Operational roles excluded (they have idle_in_transaction_session_timeout as global backstop only).
58.2 Sections Modified
| Section | Change |
|---|---|
| §3.2 Service Breakdown | PgBouncer pool size derivation rationale; horizontal scaling trigger threshold table |
| §9.4 TimescaleDB Configuration | log_min_duration_statement, pg_stat_statements in patroni.yml; query plan governance process; analyst role statement_timeout; idle_in_transaction_session_timeout comment |
| §16 CZML / Cache | Invalidation trigger table; stale-while-revalidate strategy; warm_czml_cache cold-start task |
| §34.2 Caddyfile | Three-tier static asset Cache-Control strategy; HTML no-store mandate |
celeryconfig.py |
worker_prefetch_multiplier = 1 with rationale |
58.3 New Tables and Files
| Artefact | Purpose |
|---|---|
docs/runbooks/capacity-limits.md |
Scaling decision log; WS ceiling documentation; capacity trigger thresholds |
worker/celeryconfig.py |
Updated with worker_prefetch_multiplier = 1 |
58.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
Default Celery prefetch_multiplier=4 with long tasks |
prefetch_multiplier=1 for MC jobs; fair distribution across workers |
Single Redis maxmemory-policy for broker + cache |
Separate DB indexes with noeviction for broker, allkeys-lru for cache |
HTML pages with Cache-Control: public, max-age=... |
no-store for HTML; immutable only for content-hashed static assets |
| Analyst queries without timeout | statement_timeout=30s at role level; prevents replica exhaustion cascading to primary |
| Monitoring slow queries without a review process | Weekly pg_stat_statements top-10 review; two-week persistence triggers mandatory PR |
| Scaling triggers defined as "when it feels slow" | Metric thresholds with sustained durations; documented decision log for audit trail |
58.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
worker_prefetch_multiplier |
1 | 4 (default) | Long MC tasks (up to 240s) make default prefetch cause severe worker imbalance; prefetch=1 adds trivial latency (one extra Redis round-trip) per task |
| Analyst timeout | 30 seconds at role level | Global statement_timeout |
Global timeout would cancel legitimate long-running operations like backup restore tests and migration backfills; role-scoped is surgical |
| CZML stale-while-revalidate max age | 5 minutes | 0 (no stale) | Without stale window, TLE batch ingest (600 objects) causes 600 simultaneous cache stampedes; 5-minute stale window amortises recompute over the natural ingest cadence |
| Static asset caching | Immutable for /_next/static/, 7 days for /cesium/, no-store for HTML |
Uniform TTL | Content-hash presence determines whether immutable is safe; non-uniform strategy is correct, not inconsistent |
§59 DevOps / CI-CD Pipeline — Specialist Review
Hat: DevOps / CI-CD Pipeline Findings reviewed: 11 Sections modified: §30.2, §30.3, §30.7 (new) Date: 2026-03-24
59.1 Findings and Fixes Applied
F1 — CI pipeline job dependency graph not specified
Fix applied (§30.7 new): Full GitLab CI pipeline specified with explicit stage/needs ordering enforcing the dependency order: lint → (test-backend ∥ test-frontend ∥ migration-gate) → security-scan → build-and-push → deploy-staging → deploy-production. Parallel jobs where safe; sequential where correctness requires it.
F2 — No environment promotion gate between staging and production
Already addressed: §30.4 specifies the staging environment spec and data policy. The ADR at §30.6 records the decision: "production deploy requires manual approval gate after staging smoke tests pass." The new §30.7 workflow formalises this as a GitLab protected production environment with required approvers. Confirmed as covered and formalised.
F3 — Secrets in CI not audited or rotated Fix applied (§30.3): CI secrets register table added with 8 entries covering all pipeline secrets. Each entry specifies: environment scope, owner, rotation schedule (90-180 days), and blast radius on leak. Quarterly audit procedure using GitLab CI/CD variable inventory documented. Rotation procedure for GitLab protected variables specified.
F4 — Docker image tags without immutability guarantee
Fix applied (§30.2): Production docker-compose.yml now pins images by tag@digest rather than tag alone. make update-image-digests script added to CI post-build pipeline. Container-registry retention policy table added covering 5 image categories. Lifecycle policy documented in docs/runbooks/image-lifecycle.md.
F5 — No build provenance or SBOM in CI pipeline
Fix applied (§30.7): cosign sign --yes step added to build-and-push job using Sigstore keyless signing (OIDC identity from GitLab CI). SBOM artefacts are attached to the pipeline and copied into the compliance artefact store. The deploy-time cosign verify step remains the verification gate.
F6 — Pre-commit hooks not enforced in CI
Already addressed: §30.1 explicitly states "The same hooks run locally (via pre-commit) and in CI (lint job)." The new §30.7 workflow formalises this as pre-commit run --all-files in the lint job with a dedicated cache. F6 confirmed as covered and formalised.
F7 — No automated rollback trigger
Already addressed: §26.9 blue-green deploy script (step 6) already checks spacecom:api_availability:ratio_rate5m < 0.99 after a 5-minute monitoring window and executes the Caddy upstream rollback atomically if the threshold is breached. F7 confirmed as covered.
F8 — Deployment pipeline does not check for active CRITICAL events
Fix applied (§30.7): check no active CRITICAL alert step added to both deploy-staging and deploy-production jobs. Calls GET /readyz and checks alert_gate field. "blocked" aborts the deploy with a clear error message. Emergency override requires two production-environment approvals and is logged to security_logs.
F9 — No branch protection or merge queue specification
Already addressed: §13.6 (CONTRIBUTING.md spec from §54) specifies: "No direct commits to main. All changes via pull request. main is branch-protected: 1 required approval, all status checks must pass, no force-push." The §30.7 workflow defines all required status checks (lint, test-backend, test-frontend, migration-gate, security-scan) which the branch protection rule references. F9 confirmed as covered.
F10 — Docker layer cache strategy not documented for CI
Fix applied (§30.7): Build cache strategy formalised in the build-and-push job using docker/build-push-action with cache-from: type=registry and cache-to: type=registry,mode=max targeting the GHCR buildcache tag. pip wheel cache keyed on requirements.txt hash. npm cache keyed on package-lock.json hash. Both use actions/cache@v4.
F11 — No database migration CI gate
Fix applied (§30.7 migration-gate job): Three-step gate on all PRs touching migrations/: (1) timed forward migration — fails if > 30s; (2) reverse migration alembic downgrade -1 — fails if not reversible; (3) alembic check — fails if model/migration divergence. Gate runs in parallel with test jobs to minimise critical path impact.
59.2 Sections Modified
| Section | Change |
|---|---|
| §30.2 Multi-Stage Dockerfile | Image digest pinning spec; GHCR retention policy table; make update-image-digests |
| §30.3 Environment Variable Contract | CI secrets register table; rotation schedule; quarterly audit procedure |
| §30.7 (new) GitHub Actions Workflow | Full CI YAML with needs: graph; all 8 jobs; cosign sign; migration-gate; alert gate step; environment-gated production deploy |
59.3 New Tables and Files
| Artefact | Purpose |
|---|---|
.github/workflows/ci.yml |
Canonical CI pipeline — 8 jobs with explicit dependency graph |
scripts/smoke-test.py |
Post-deploy smoke test (already referenced in §26.9; now mandatory gate in CI) |
scripts/update-image-digests.sh |
Patches docker-compose.yml with tag@digest after each build |
docs/runbooks/image-lifecycle.md |
GHCR retention policy; lifecycle policy config procedure |
docs/runbooks/detect-secrets-update.md |
Correct baseline update procedure (already referenced in §30.1) |
59.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
Jobs without needs: run in parallel by default |
Explicit needs: chains; test jobs must precede build; build must precede deploy |
| Mutable image tags in production Compose | tag@digest pinning; make update-image-digests in post-build CI step |
| Long-lived CI credentials for registry push | OIDC GITHUB_TOKEN (per-job, automatic); no static GHCR_TOKEN secret needed |
Signing at deploy-time only (cosign verify) |
Sign at build-time (cosign sign); verify at deploy; both steps required for supply chain integrity |
| Deploying during active CRITICAL alert | alert_gate check in CI deploy steps; emergency override requires two approvals and is logged |
| Migrations tested only by running them forward | Three-step gate: forward (timed) + reverse (reversibility) + alembic check (model sync) |
59.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| OIDC for GHCR auth | GITHUB_TOKEN OIDC (per-job) |
Static GHCR_TOKEN secret |
Static tokens don't expire; OIDC tokens are per-job and cannot be reused outside the workflow |
| cosign keyless signing | Sigstore keyless (OIDC identity) | Private key signing | Keyless signing ties the signature to the GitHub Actions OIDC identity; no long-lived private key to rotate or leak |
| Alert gate scope | Blocks CRITICAL and HIGH unacknowledged alerts from non-internal orgs |
All alerts | Internal test org alerts should not block production operations; unacknowledged = operator hasn't seen it yet |
| migration-gate triggers | Only on PRs touching migrations/ |
Every PR | Running alembic upgrade head on every PR adds 60–90 seconds to CI for PRs that don't touch the schema; path filter reduces cost |
§60 Human Factors / Operational UX — Specialist Review
Hat: Human Factors / Operational UX Findings reviewed: 11 Sections modified: §28.1, §28.3, §28.5a, §28.6, §28.9 (new) Date: 2026-03-24
60.1 Findings and Fixes Applied
F1 — No alarm management philosophy documented Fix applied (§28.3): EEMUA 191 / ISA-18.2 alarm management KPI table added with 5 quantitative targets: alarm rate (< 1/10min), nuisance rate (< 1%), stale CRITICAL (0 unacknowledged > 10min), alarm flood threshold (< 10 CRITICAL in 10min), chattering alarms (0). Measured quarterly by Persona D; included in ESA compliance artefact package.
F2 — Alarm flood scenario not bounded
Fix applied (§28.3): Batch TIP flood protocol added. Triggers at >= 5 new TIP messages in 5 minutes. Protocol: highest-priority object gets CRITICAL banner; objects 2-N are suppressed; single HIGH "Batch TIP event: N objects" summary fires; per-object alerts queue at <= 1/min after 5-minute operator grace period. batch_tip_event record type added to alert_events. Thresholds configurable per-org within safety bounds.
F3 — Mode confusion risk unmitigated Already addressed: §28.2 specifies six mode error prevention mechanisms including persistent mode indicator, mode-switch confirmation dialog with consequence statements, temporal wash for future-preview, simulation disable during active events, audio suppression in non-LIVE modes, and simulation record segregation. F3 confirmed as covered.
F4 — Handover workflow does not account for SA transfer Fix applied (§28.5a): Structured SA transfer prompt table added. Five prompts mapping to Endsley SA levels: active objects (L1 perception), operator assessment (L2 comprehension), expected development (L3 projection), actions taken (decision context), and handover flags (situational context). Prompts are optional but completion rate tracked as HF KPI. Non-blocking warning on submission without completion.
F5 — Acknowledgement does not distinguish seen from assessed
Already addressed: §28.5 structured acknowledgement categories distinguish MONITORING (seen, no action) from NOTAM_ISSUED, COORDINATING, ESCALATING (assessed and acted). The category taxonomy maps directly to perception vs. comprehension+projection. F5 confirmed as covered.
F6 — No specification for decision prompt content
Fix applied (§28.6): DecisionPrompt TypeScript interface specified with four mandatory fields: risk_summary (<= 20 words, no jargon), action_options (role-specific), time_available (decision window before FIR intersection), consequence_note (optional). Example instance for re-entry/FIR scenario provided. Pre-authored prompt library in docs/decision-prompts/; annual ANSP SME review required.
F7 — Globe information hierarchy not specified Fix applied (§28.1): Seven-level visual information hierarchy table added with mandatory rendering order. Priority 1 (CRITICAL object): flashing red octagon + always-visible label. Priority 2 (HIGH): amber triangle. Down to Priority 7 (ambient objects): white dots on hover only. Rule: no lower-priority element may be visually more prominent than a higher-priority element. Non-negotiable safety requirement — overrides CesiumJS performance optimisations that reorder draw calls.
F8 — No fatigue or cognitive load accommodation
Fix applied (§28.3): Server-side fatigue monitoring rules added. Four triggers: CRITICAL unacknowledged > 10 min — supervisor push+email; HIGH unacknowledged > 30 min — supervisor push; inactivity during active event (45 min) — operator+supervisor push; session age > shift_duration_hours — non-blocking operator reminder. All notifications logged to security_logs. Escalates to SpaceCom internal ops if no supervisor role configured.
F9 — Degraded mode display not actionable Already addressed: §28.8 (Degraded-Data Human Factors) specifies per-degradation-type visual indicators with operator action required. §1315 specifies operational guidance text per degradation type. Acceptance criteria (§6056) requires integration test for each type. F9 confirmed as covered.
F10 — No operator training specification
Fix applied (§28.9 new): Full operator training programme specified. Six modules (M1-M6), 8 hours total minimum. M2 reference scenario defined. Recurrency requirements: annual 2-hour refresher + scenario repeat. operator_training_records schema added. GET /api/v1/admin/training-status endpoint added. Training material ownership and annual review cycle defined.
F11 — Audio alert design not fully specified Fix applied (§28.3): Audio spec expanded with EUROCAE ED-26 / RTCA DO-256 advisory alert compliance. Tones specified: 261 Hz (C4) + 392 Hz (G4), 250ms each with 20ms fade. Re-alert on missed acknowledgement: replays once at 3 minutes; no further audio beyond second play (supervisor notification handles further escalation). Volume floor in ops room mode: minimum 40%. Per-session mute resets on next login.
60.2 Sections Modified
| Section | Change |
|---|---|
| §28.1 Situation Awareness | Globe visual information hierarchy table (7 levels, mandatory rendering order) |
| §28.3 Alarm Management | EEMUA 191 KPI table; batch TIP flood protocol; fatigue monitoring rules; audio spec expanded with EUROCAE ref, re-alert rule, volume floor |
| §28.5a Shift Handover | Structured SA transfer prompts (5 prompts, 3 SA levels); completion tracking |
| §28.6 Cognitive Load Reduction | Decision prompt TypeScript interface + example; pre-authored library governance |
| §28.9 (new) Operator Training | 6-module programme; reference scenario; recurrency; operator_training_records schema; API endpoint |
60.3 New Tables and Files
| Artefact | Purpose |
|---|---|
operator_training_records |
Training completion records per user/module |
docs/training/ |
Training module content directory |
docs/training/reference-scenario-01.md |
Standardised M2 reference scenario |
docs/decision-prompts/ |
Pre-authored decision prompt library (per scenario type) |
GET /api/v1/admin/training-status |
Org-admin view of operator training completion |
60.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
| Single "data may be delayed" degraded banner | Per-degradation-type badges with operator action required; graded response rules |
| Free-text only handover notes | Structured SA transfer prompts + notes; prompts tracked as HF KPI |
| Audio alert that loops indefinitely | Plays once; re-alerts once at 3 min; further escalation is supervisor notification, not more audio |
| Acknowledgement with 10-character text minimum | Structured category selection — captures intent, not just compliance |
| Unlimited alarm rate during batch TIP events | Batch flood protocol: suppress objects 2-N, queue at <= 1/min after grace period |
| Globe with equal visual weight for all elements | 7-level mandatory hierarchy; safety-critical objects pre-attentively distinct at all zoom levels |
60.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Alarm KPI standard | EEMUA 191 adapted for ATC | Process-control standard verbatim | EEMUA 191 is process-control oriented; ATC operations have different alarm rate expectations; adaptation noted explicitly |
| Re-alert timing | Once at 3 minutes | Continuous loop / never re-alert | Loop causes habituation; never re-alerting risks missed CRITICAL in a noisy environment; single replay at 3 min is the minimum effective prompt |
| SA transfer prompts | Optional with completion tracking | Mandatory (blocks handover submission) | Mandatory completion under time pressure produces checkbox compliance, not genuine SA transfer; optional + tracked provides accountability without creating a safety-defeating blocker |
| Operator training blocking | Flag but not block access | Auto-block untrained users | ANSP retains operational responsibility; SpaceCom cannot unilaterally block a certified ATC professional; flag + report gives ANSP the information to manage their own training compliance |
§61 Aviation & Space Regulatory Compliance — Specialist Review
61.1 Finding Summary
| # | Finding | Severity | Resolution |
|---|---|---|---|
| 1 | No formal safety case structure — argument/evidence/claims framework absent | High | §24.12 — Safety case with GSN argument structure, evidence nodes, and claims added; docs/safety/SAFETY_CASE.md |
| 2 | SAL assignment under ED-153/DO-278A not documented — no formal assurance level per component | High | §24.13 — SAL assignment table: SAL-2 for physics, alerts, HMAC, CZML; SAL-3 for auth and ingest; docs/safety/SAL_ASSIGNMENT.md |
| 3 | Hazard log lacked structured format — no ID, cause/effect decomposition, risk level, or governance | Medium | §24.4 — Hazard register restructured with 7 hazards (HZ-001 to HZ-007), structured fields, governance rules, and EUROCAE ED-153 risk matrix |
| 4 | Safety occurrence reporting procedure lacked formal structure — ANSP notification, evidence preservation, and regulatory notification flow not defined | High | §26.8a — Full safety occurrence reporting procedure with trigger conditions, 8-step response, SQL table, and clear negative scope |
| 5 | ICAO data quality mapping incomplete — Completeness attribute absent; no formal data category and classification fields in API response | Medium | §24.3 — Completeness attribute added; formal ICAO data category/classification fields specified; accuracy characterisation as Phase 3 gate |
| 6 | Verification independence not specified — no CODEOWNERS, PR review rule, or traceability for SAL-2 components | High | §17.6 — CODEOWNERS for SAL-2 paths, 2-reviewer requirement, qualification criteria, traceability to safety case evidence |
| 7 | No configuration management policy for safety-critical artefacts — source files, safety documents, and validation data not formally under CM | High | §30.8 — CM policy covering 10 artefact types, release tagging script, signed commits, deployment register, CODEOWNERS for docs/safety/ |
| 8 | Means of Compliance document not planned — no mapping from regulatory requirement to implementation evidence | Medium | §24.14 — MoC document structure with 7 initial MOC entries, status tracking, and Phase 2/3 gates |
| 9 | Post-deployment safety monitoring programme absent — no ongoing accuracy monitoring, safety KPIs, or model version monitoring | High | §26.10 — Four-component programme: prediction accuracy monitoring, safety KPI dashboard, quarterly safety review, model version monitoring |
| 10 | ANSP-side obligations not documented — SpaceCom's safety argument assumes ANSP actions that are never formally communicated | Medium | §24.15 — ANSP obligations table by category; SMS guide document; liability assignment note linking to safety case |
| 11 | Regulatory sandbox liability not formally characterised — who bears liability during trial, what insurance is required, sandbox ≠ approval | Medium | §24.2 — Sandbox liability provisions: no operational reliance clause, indemnification cap, insurance requirement, regulatory notification duty, explicit statement that sandbox ≠ regulatory approval |
Already addressed — no further action required:
- NOTAM interface and disclaimer (§24.5 — covered in prior sessions)
- Space law retention obligations (§24.6 — 7-year retention already specified)
- EU AI Act compliance obligations (§24.10 — fully covered including Art. 14 human oversight statement)
- Regulatory correspondence register (§24.11 — covered)
61.2 Sections Modified
| Section | Change |
|---|---|
| §24.2 Liability and Operational Status | Regulatory sandbox liability provisions (F11): no operational reliance clause, indemnification cap, insurance requirement, sandbox ≠ approval statement |
| §24.3 ICAO Data Quality Mapping | Completeness attribute added (F5); formal ICAO data category and classification table; accuracy characterisation Phase 3 gate |
| §24.4 Safety Management System Integration | Hazard register fully restructured (F3): 7 hazards with IDs, cause/effect, risk levels, governance; system safety classification updated to reference §24.13 SAL assignment |
| §24.11 (after) | New §24.12 Safety Case Framework (F1); §24.13 SAL Assignment (F2); §24.14 Means of Compliance (F8); §24.15 ANSP-Side Obligations (F10) |
| §17.5 (after) | New §17.6 Verification Independence (F6): CODEOWNERS, 2-reviewer rule, qualification criteria, traceability |
| §26.8 Incident Response runbooks | Safety occurrence runbook pointer updated; §26.8a Safety Occurrence Reporting full procedure added (F4) |
| §26.9 (after) | New §26.10 Post-Deployment Safety Monitoring Programme (F9): accuracy monitoring, safety KPI dashboard, quarterly review, model version monitoring |
| §30.7 (after) | New §30.8 Configuration Management of Safety-Critical Artefacts (F7): CM policy table, release tagging, signed commits, deployment register |
61.3 New Documents and Tables
| Artefact | Purpose |
|---|---|
docs/safety/SAFETY_CASE.md |
GSN-structured safety case; living document; version-controlled |
docs/safety/SAL_ASSIGNMENT.md |
Software Assurance Level per component; review triggers |
docs/safety/HAZARD_LOG.md |
Structured hazard log (HZ-001 to HZ-007 and future additions) |
docs/safety/MEANS_OF_COMPLIANCE.md |
Regulatory requirement → implementation evidence mapping |
docs/safety/ANSP_SMS_GUIDE.md |
ANSP obligations and SMS integration guide |
docs/safety/CM_POLICY.md |
Configuration management policy for safety artefacts |
docs/safety/VERIFICATION_INDEPENDENCE.md |
Verification independence policy for SAL-2 components |
docs/safety/QUARTERLY_SAFETY_REVIEW_YYYY_QN.md |
Quarterly safety review output template |
legal/SANDBOX_AGREEMENT_TEMPLATE.md |
Standard regulatory sandbox letter of understanding |
legal/ANSP_DEPLOYMENT_REGISTER.md |
Configuration baseline per ANSP deployment |
docs/validation/ACCURACY_CHARACTERISATION.md |
Phase 3: formal accuracy statement (ICAO Annex 15) |
safety_occurrences SQL table |
Dedicated log for safety occurrences with full audit fields |
monitoring/dashboards/safety-kpis.json |
Grafana dashboard: 6 safety KPIs with alert thresholds |
.github/CODEOWNERS additions |
SAL-2 source paths + docs/safety/ require custodian review |
61.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
| "Advisory only" UI label as sole liability protection | Legal instruments required: MSA, AUP, legal opinion; label is not contractual protection |
| Hazard log as a table of symptoms with no cause/effect structure | Structured hazard log with ID, cause, effect, mitigations, risk level, status — enables safety case argument |
| No distinction between safety occurrence and operational incident | Safety occurrences require a separate response chain (legal counsel, ANSP regulatory notification); conflating with incidents creates regulatory exposure |
| Verification by the author of safety-critical code | SAL-2 requires independent verification — CODEOWNERS enforcement is the implementation mechanism |
| Safety documents outside version control | All safety artefacts are Git-tracked; changes require custodian sign-off via CODEOWNERS; release tags capture safety snapshots |
| Sandbox trial treated as implicit regulatory approval | Explicit language required: sandbox ≠ approval; the ANSP cannot represent a trial as regulatory acceptance |
| Post-deployment safety monitoring as "we'll look at incidents when they happen" | Proactive programme: quarterly review, prediction accuracy tracking, model version monitoring — demonstrates ongoing safe operation |
61.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Safety case notation | Goal Structuring Notation (GSN) | ASCE text-only format | GSN is the standard for DO-178C and ED-153 safety cases; accepted by EASA and ESA reviewers; tooling (Astah, Visio, ArgoSAFETY) exists for formal diagrams when Phase 3 requires it |
| SAL-2 for physics and alerts | SAL-2 (not SAL-1) | SAL-1 (highest) | SAL-1 implies formal methods / formal proofs — disproportionate for decision support software where the ANSP retains authority; SAL-2 balances rigour with development practicality |
| Safety occurrence trigger scope | 4 specific trigger conditions | Any anomaly during operational use | Over-broad triggers desensitise the process; under-broad triggers miss real occurrences; 4 conditions map directly to the identified hazards |
| Post-deployment monitoring cadence | Quarterly safety review | Monthly review / ad hoc | Quarterly balances administrative overhead with meaningful trend data; monthly creates review fatigue for a small team; ad hoc provides no assurance |
| Configuration management of safety documents | Git + CODEOWNERS + release attachments | Dedicated safety management tool | Git is already the source of truth; CODEOWNERS provides access control; release attachments are the simplest artefact preservation mechanism without introducing a new tool |
§62 Geospatial / Mapping Engineering — Specialist Review
62.1 Finding Summary
| # | Finding | Severity | Resolution |
|---|---|---|---|
| 1 | No authoritative CRS contract document — frame transitions at each boundary were scattered across multiple sections with no single reference | Medium | §4.4 — CRS boundary table added; docs/COORDINATE_SYSTEMS.md defined as Phase 1 deliverable; antimeridian and pole handling specified |
| 2 | SRID not enforced by CHECK constraint — column type declares SRID 4326 but application code can insert SRID-0 geometries silently | Medium | §9.3 — CHECK constraints added for reentry_predictions, hazard_zones, airspace spatial columns; migration gate lints new spatial columns |
| 3 | No spatial GiST index on corridor polygon columns | High | Already addressed — §9.3 contains GiST indexes for reentry_predictions, hazard_zones, airspace geometry columns. No further action required. |
| 4 | CZML corridor geometry uses fixed 10-minute time-step sampling — under-represents terminal phase where displacement is highest | High | §15.4 — Adaptive sampling function added: 5 min above 300 km, 2 min at 150–300 km, 30 s at 80–150 km, 10 s below 80 km; ADR required for reference polygon regeneration |
| 5 | Antimeridian and pole handling not explicitly specified | Medium | §4.4 — Antimeridian: GEOGRAPHY type confirmed; CZML serialiser must not clamp to ±180°. Polar corridors: ST_DWithin pole proximity check; clip to 89.5° max latitude with POLAR_CORRIDOR_WARNING log |
| 6 | No test verifying PostGIS corridor polygon matches CZML polygon positions | High | §15.4 — test_czml_corridor_matches_postgis_polygon integration test added; marked safety_critical; 10 km bbox agreement tolerance |
| 7 | FIR boundary data source and update policy not documented | Medium | Already addressed — §31.1.3 documents EUROCONTROL AIRAC source, 28-day update procedure, airspace_metadata table, Prometheus staleness alert, readyz integration. No further action required. |
| 8 | Globe clustering merges objects at different altitudes sharing a ground-track sub-point | Medium | §13.2 (globe clustering) — Altitude-aware clustering rule: clustering disabled for any object with re-entry window < 30 days; prevents TIP-active objects from being absorbed into catalog clusters |
| 9 | ST_Buffer distance units ambiguous — degree-based buffer on SRID 4326 geometry produces latitude-varying results |
Medium | §9.3 — Correct pattern documented: project to Web Mercator for metric buffer, or use GEOGRAPHY column buffer (natively metre-aware). Wrong pattern explicitly prohibited. |
| 10 | FIR intersection missing bounding-box pre-filter in some query paths | Medium | Already addressed — §9.3 FIR intersection query with && pre-filter and explicit ::geography::geometry cast; CI linter rule added. No further action required. |
| 11 | Altitude display mixes WGS-84 ellipsoidal and MSL datums without labelling — geoid offset (−106 m to +85 m) material at re-entry terminal altitudes | High | §13.5 — Altitude datum labelling table added: orbital context → ellipsoidal; airspace context → QNH; formatAltitude(metres, context) helper; altitude_datum field in prediction API response |
62.2 Sections Modified
| Section | Change |
|---|---|
| §4.4 (new) Coordinate Reference System Contract | CRS boundary table; docs/COORDINATE_SYSTEMS.md reference; antimeridian CZML serialiser note; polar corridor ST_DWithin proximity check and 89.5° clip |
| §4.5 (renumbered from 4.4) Implementation Checklist | Added docs/COORDINATE_SYSTEMS.md deliverable |
| §9.3 Index Specification | SRID CHECK constraints for 3 spatial tables; ST_Buffer correct/wrong patterns; explicit prohibition on degree-unit buffers |
| §13.2 Globe Object Clustering | Altitude-aware clustering rule: disable for decay-relevant objects (window < 30 days) |
| §13.5 Altitude and Distance Unit Display | Altitude datum labelling table (4 contexts); formatAltitude(metres, context) helper spec; altitude_datum API field |
| §15.4 Corridor Generation Algorithm | Adaptive ground-track sampling function (4 altitude bands); ADR requirement for reference polygon regeneration; test_czml_corridor_matches_postgis_polygon integration test |
62.3 New Documents and Files
| Artefact | Purpose |
|---|---|
docs/COORDINATE_SYSTEMS.md |
Authoritative CRS contract: frame at every system boundary |
tests/integration/test_corridor_consistency.py |
PostGIS vs CZML corridor bbox consistency test (safety_critical) |
backend/app/utils/altitude.py |
formatAltitude(metres, context) helper |
62.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
| Fixed 10-minute ground track sampling across all altitudes | Adaptive sampling: coarse above 300 km, fine in terminal phase below 150 km |
ST_Buffer(geom_4326, 0.5) — degree buffer on geographic column |
ST_Buffer(ST_Transform(geom, 3857), 50000) for Mercator metric, or ST_Buffer(geom::geography, 50000) for geodetic metric |
ST_Intersects(airspace.geometry, corridor) without explicit cast |
Always ::geography::geometry cast when mixing GEOGRAPHY and GEOMETRY types; enforced by CI linter |
| Clustering all objects by screen position | Disable CesiumJS EntityCluster for decay-relevant objects; altitude is a critical dimension for orbital objects |
Altitude labelled as km without datum |
Datum is always explicit: (ellipsoidal) or QNH or MSL per context |
| SRID declared in column type only | Add CHECK constraint: CHECK (ST_SRID(geom::geometry) = 4326) — prevents SRID-0 insertion from application layer |
62.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Adaptive sampling bands | 4 bands (> 300 km / 150–300 km / 80–150 km / < 80 km) | Single fine step (30 s) everywhere | Fine step everywhere generates unnecessary data volume in the high-altitude portion where trajectory changes are slow; 4 bands give fidelity where it matters at manageable data volume |
| Antimeridian strategy | GEOGRAPHY type (spherical arithmetic) for corridors | Split polygons at ±180° | Splitting at antimeridian requires downstream consumers (CesiumJS, PostGIS) to handle multi-polygon; GEOGRAPHY avoids the split natively |
| Polar corridor clip at 89.5° | ST_DWithin + clip |
Full polar treatment | True polar passages are extremely rare for the tracked object population; full treatment (azimuthal projection, pole-aware alpha-shape) is disproportionate; clip + warning is the pragmatic safe choice |
| Altitude datum labelling | Per-context datum in formatAltitude helper |
Global user setting | Datum is physically determined by the altitude context (orbital = ellipsoidal; aviation = QNH), not user preference; a user setting would allow operators to view the wrong datum label |
| Corridor consistency test tolerance | 10 km (0.1°) bbox agreement | Exact match | Sub-pixel globe rendering differences make exact match impractical; 10 km is far below the display resolution at most zoom levels and well below any operationally significant discrepancy |
§63 Real-Time Systems / WebSocket Engineering — Specialist Review
63.1 Finding Summary
| # | Finding | Severity | Resolution |
|---|---|---|---|
| 1 | No message sequence numbers or ordering guarantee | High | Already addressed — seq field in event envelope; ?since_seq= reconnect replay; 200-event / 5-min ring buffer; resync_required on stale gap. No further action required. |
| 2 | No application-level delivery acknowledgement — delivered_websocket = TRUE set at send-time, not client-receipt |
High | §4 WebSocket schema — alert.received / alert.receipt_confirmed round-trip for CRITICAL/HIGH; ws_receipt_confirmed column in alert_events; 10s timeout triggers email fallback |
| 3 | Fan-out architecture for multiple backend instances not specified | High | §4 WebSocket schema — Redis Pub/Sub fan-out via spacecom:alert:{org_id} channels; per-instance local connection registry; docs/adr/0020-websocket-fanout-redis-pubsub.md |
| 4 | No client-side reconnection backoff policy | High | Already addressed — src/lib/ws.ts specifies initialDelayMs=1000, maxDelayMs=30000, multiplier=2, jitter=0.2. No further action required. |
| 5 | No state reconciliation protocol after reconnect | High | Already addressed — resync_required event triggers REST re-fetch; ?since_seq= replays up to 200 events. No further action required. |
| 6 | Dead WebSocket connection does not trigger ANSP fallback notification | High | §4 WebSocket schema — on_connection_closed schedules Celery task with 120s / 30s (active TIP) grace; on_reconnect revokes pending task; org primary contact emailed with TIP-aware subject line |
| 7 | No back-pressure or per-client send queue monitoring | High | §4 WebSocket schema — ConnectionManager with per-connection asyncio.Queue; circuit breaker at 50 queued events closes slow-client connection; spacecom_ws_send_queue_overflow_total counter |
| 8 | Offline clients do not see missed alerts surfaced on reconnect | Medium | §4 WebSocket schema — GET /alerts?since=<ts>&include_offline=true; received_while_offline: true annotation; localStorage last_seen_ts; amber border visual treatment in notification centre |
| 9 | Multi-tab acknowledgement not synced | Medium | Already addressed — alert.acknowledged event type in WebSocket schema broadcasts to all org connections. No further action required. |
| 10 | No per-org WebSocket connection visibility during TIP events | Medium | §4 WebSocket schema + Observability — spacecom_ws_org_connected and spacecom_ws_org_connection_count gauges; ANSPNoLiveConnectionDuringTIPEvent alert rule; on-call dashboard panel 9 |
| 11 | Caddy idle timeout silently terminates long-lived WebSocket connections | High | §26.9 Caddy configuration — idle_timeout 0 for WebSocket paths; read_timeout 0 / write_timeout 0 on WS reverse proxy transport; flush_interval -1; ping interval < proxy idle timeout rule documented |
63.2 Sections Modified
| Section | Change |
|---|---|
| §4 WebSocket event schema | App-level receipt ACK protocol (F2); Redis Pub/Sub fan-out spec with code (F3); dead-connection ANSP fallback (F6); ConnectionManager back-pressure with per-connection queue (F7); offline missed-alert REST endpoint and notification centre treatment (F8); per-org Prometheus gauges and ANSPNoLiveConnectionDuringTIPEvent alert rule (F10) |
| §26.9 Caddy upstream configuration | WebSocket-specific Caddyfile additions: idle_timeout 0, WS path matcher, read_timeout 0, write_timeout 0, flush_interval -1; ping interval < proxy idle timeout rule (F11) |
63.3 New Tables, Metrics, and Files
| Artefact | Purpose |
|---|---|
alert_events.ws_receipt_confirmed |
Tracks whether client confirmed receipt of CRITICAL/HIGH alerts |
alert_events.ws_receipt_at |
Timestamp of client receipt confirmation |
spacecom_ws_send_queue_overflow_total{org_id} |
Counter: WS send queue circuit breaker activations |
spacecom_ws_org_connected{org_id, org_name} |
Gauge: whether org has ≥1 active WS connection |
spacecom_ws_org_connection_count{org_id} |
Gauge: count of active WS connections per org |
ANSPNoLiveConnectionDuringTIPEvent |
Prometheus alert rule: warning when ANSP has no WS connection during active TIP |
| On-call dashboard panel 9 | ANSP Connection Status table (below fold) |
docs/adr/0020-websocket-fanout-redis-pubsub.md |
ADR: Redis Pub/Sub for cross-instance WS fan-out |
docs/runbooks/websocket-proxy-config.md |
Runbook: WS proxy timeout configuration for cloud deployments |
docs/runbooks/ansp-connection-lost.md |
Runbook: ANSP with no live connection during TIP event |
GET /alerts?since=<ts>&include_offline=true |
Missed-alert reconciliation endpoint |
63.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
delivered_websocket = TRUE set at send() time |
App-level receipt ACK with 10s timeout; FALSE triggers email fallback |
| Single fan-out loop blocks on slow client | Per-connection async send queue with circuit breaker; slow client disconnected, not blocking |
| Caddy default idle timeout terminates quiet WS connections | idle_timeout 0 + read_timeout 0 on WS paths; ping interval enforced below proxy timeout |
| No distinction between "connected to SpaceCom" and "receiving alerts during TIP event" | Per-org connection gauge + ANSPNoLiveConnectionDuringTIPEvent alert distinguishes the two |
resync_required causes silent state restoration with no visual indication |
received_while_offline: true annotation + amber border in notification centre |
| Dead socket detected by ping-pong, silently closed | Grace-period Celery task schedules ANSP notification; cancelled on reconnect |
63.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Fan-out mechanism | Redis Pub/Sub | Sticky sessions (consistent hash) | Sticky sessions break blue-green deploys; Pub/Sub is stateless and works with any instance count |
| App-level ACK scope | CRITICAL and HIGH only | All events | Ack overhead for ingest.status and spaceweather.change is disproportionate; only safety-relevant alerts need receipt confirmation |
| Dead connection grace period | 120s normal / 30s active TIP | Immediate notification | False-positive notifications from brief network hiccups destroy operator trust in the system; grace period filters transient drops |
| Back-pressure circuit breaker | Close slow client (force reconnect) | Drop messages silently | Silently dropping alert messages is unacceptable; forced reconnect triggers the ?since_seq= replay mechanism, giving the client another chance to receive the queued events |
| Caddy WS idle timeout | 0 (no timeout) on WS paths only |
Global 0 |
Non-WS paths benefit from timeout protection against slow HTTP clients; WS paths require persistent connections; path-specific override is the correct scope |
§64 Data Governance & Privacy Engineering — Specialist Review
64.1 Finding Summary
| # | Finding | Severity | Resolution |
|---|---|---|---|
| 1 | No DPIA document — pre-processing obligation for high-risk processing of aviation professionals' behavioural data | High | §29.1 — Full DPIA structure added (EDPB WP248 template, 7 sections, key risk findings identified); legal/DPIA.md designated as Phase 2 gate before EU/UK ANSP shadow activation |
| 2 | Right-to-erasure conflict with 7-year safety retention unresolved | High | Already addressed — §29.3 documents pseudonymisation procedure; Art. 17(3)(b) exemption explicitly invoked. No further action required. |
| 3 | IP addresses stored full-resolution for 7 years — no necessity assessment, no minimisation policy | High | §29.1 — IP retention updated to 90 days full / hash retained for longer period; hash_old_ip_addresses Celery task specified; necessity assessment documented |
| 4 | No Record of Processing Activities (RoPA) document | Medium | Already addressed — §29.1 contains the RoPA table with all required Art. 30 fields; legal/ROPA.md designated as authoritative. No further action required. |
| 5 | Cross-border transfer mechanisms not documented per jurisdiction pair | Medium | Already addressed — §29.5 documents EU default hosting, SCCs for cross-border transfers, Australian APP8, data residency policy in legal/DATA_RESIDENCY.md. No further action required. |
| 6 | Handover notes and acknowledgement text retained as-written indefinitely — free-text personal references not pseudonymised | Medium | §29.3 — pseudonymise_old_freetext Celery task added; 2-year operational retention window; text replaced with [text pseudonymised after operational retention window] |
| 7 | No DSAR procedure or SLA — endpoint exists but no documented process | High | §29.4a — Full DSAR procedure added: 7-step runbook, 30-day SLA, 60-day extension provision, legal/DSAR_LOG.md, export scope defined, exemptions documented |
| 8 | Audit log mixes personal data and integrity records — single table, conflicting retention obligations | High | §29.9 — integrity_audit_log table split out for non-personal operational records (7-year retention); security_logs constrained to user-action types with CHECK; migration plan specified |
| 9 | No formal sub-processor register — sub-processor details scattered across multiple documents | Medium | §29.4 — legal/SUB_PROCESSORS.md register added with 5 sub-processors, transfer mechanism, DPA status; customer notification obligation documented |
| 10 | operator_training_records has no retention or pseudonymisation policy |
Medium | §28.9 — Retention policy: active + 2 years post-deletion; user_tombstone column; pseudonymisation task extended to cover training records |
| 11 | ToS acceptance implies consent is the universal lawful basis — incorrect and creates compliance exposure | High | §29.10 — Lawful basis mapping table added (5 processing activities); clarification that ToS acceptance evidences consent only for specific acknowledgements; Privacy Notice requirement restated |
64.2 Sections Modified
| Section | Change |
|---|---|
| §28.9 Operator Training | Training records retention policy and pseudonymisation (F10): 2-year post-deletion window; user_tombstone column; Celery task extension |
| §29.1 Data Inventory | IP address retention updated to 90-day full / hash retained (F3); hash_old_ip_addresses Celery task; IP necessity assessment; DPIA structure expanded to full EDPB WP248 template (F1) |
| §29.3 Erasure Procedure | Free-text field periodic pseudonymisation added (F6): 2-year operational window; pseudonymise_old_freetext Celery task for shift_handovers.notes_text and alert_events.action_taken |
| §29.4 Data Processing Agreements | Sub-processor register table added (F9): 5 sub-processors, locations, transfer mechanisms |
| §29.4a (new) DSAR Procedure | Full 7-step DSAR procedure with 30-day SLA, export scope, exemption documentation (F7) |
| §29.9 (new) Audit Log Separation | integrity_audit_log table split; security_logs constrained to user-action types; migration plan (F8) |
| §29.10 (new) Lawful Basis Mapping | Per-activity lawful basis table; ToS acceptance ≠ universal consent; Privacy Notice requirement (F11) |
64.3 New Documents and Tables
| Artefact | Purpose |
|---|---|
legal/DPIA.md |
Data Protection Impact Assessment (EDPB WP248 template) — Phase 2 gate |
legal/SUB_PROCESSORS.md |
Art. 28 sub-processor register with transfer mechanisms |
legal/DSAR_LOG.md |
Log of all Data Subject Access Requests received and fulfilled |
docs/runbooks/dsar-procedure.md |
Step-by-step DSAR handling runbook |
tasks/privacy_maintenance.py |
Celery tasks: hash_old_ip_addresses, pseudonymise_old_freetext (extended to training records) |
integrity_audit_log table |
Non-personal operational audit records separated from security_logs |
operator_training_records.user_tombstone |
Pseudonymisation field for post-deletion training records |
operator_training_records.pseudonymised_at |
Timestamp tracking pseudonymisation |
64.4 Anti-Patterns Identified
| Anti-pattern | Correct approach |
|---|---|
| DPIA treated as optional documentation exercise | Pre-processing legal obligation; EU personal data cannot be processed without completing it first |
| Full IP address retained for 7 years "for security" | 90-day necessity window; hash retained for longer-term audit; necessity assessment documented |
Single security_logs table for both personal data and operational integrity records |
Separate tables with separate retention policies; integrity_audit_log for non-personal records |
| ToS acceptance as universal consent mechanism | Lawful basis is determined by processing purpose; most SpaceCom processing is Art. 6(1)(b) or (f), not consent |
| Sub-processor details spread across multiple documents | Single legal/SUB_PROCESSORS.md register with mandatory Art. 28(3) fields |
| Free-text operational fields retained as-written indefinitely | 2-year operational window then pseudonymisation in place; record preserved, personal reference removed |
64.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| DPIA processing category | Art. 35(3)(b) — systematic monitoring of publicly accessible area | Art. 35(3)(a) — large-scale special category data | No special category data is processed; the systematic monitoring category is the correct trigger given real-time operational pattern tracking of named aviation professionals |
| IP hashing threshold | 90 days | 30 days / 1 year | 90 days covers the active investigation window for the vast majority of security incidents; shorter is unnecessarily restrictive for legitimate investigation; longer retains more than necessary |
| Free-text pseudonymisation window | 2 years post-creation | Immediate deletion / 7-year retention as-written | 2 years covers all active PIR, investigation, and regulatory inquiry periods while removing personal references well before maximum retention; deletion would destroy operational context needed for safety record; 7-year as-written retention is disproportionate |
| Audit log split mechanism | Separate table with CHECK constraint on security_logs |
Application-level routing only | Database constraint enforces the separation at ingest time; application routing alone is fragile and will be bypassed as code evolves |
| DSAR response channel | Encrypted ZIP to verified email | In-platform download only | In-platform download is unavailable after account deletion; verified email ensures identity confirmation and provides a paper trail |
Appendix §65 — Cost Engineering / FinOps Hat Review
Hat: Cost Engineering / FinOps Reviewer focus: Infrastructure cost visibility, unit economics, per-resource attribution, cost anti-patterns, egress waste, idle resource cost
65.1 Findings and Fixes
| # | Finding | Severity | Section modified | Fix applied |
|---|---|---|---|---|
| F1 | No unit economics model — impossible to reason about margin per customer tier | HIGH | §27.7 (new) | Added unit economics model with cost-to-serve breakdown and break-even analysis; reference doc docs/business/UNIT_ECONOMICS.md |
| F2 | Storage table lacked cost figures — MC blob cost invisible to planners | MEDIUM | §27.4 | Added Cloud Cost/Year column to storage table; S3-IA pricing for MC blobs; noted dominant cost driver |
| F3 | No metric tracking external API calls (Space-Track budget at risk) | MEDIUM | §27.1 | Added spacecom_ingest_api_calls_total{source} counter; alert at Space-Track 100/day approaching AUP limit |
| F4 | No per-org simulation CPU tracking — Enterprise chargeback impossible | MEDIUM | §27.1 | Added spacecom_simulation_cpu_seconds_total{org_id, norad_id} counter; monthly usage report task |
| F5 | CZML egress cost unquantified; no brotli compression mandate | LOW | §27.5 | Added CZML egress cost estimate (~$1–7/mo at Phase 2–3); brotli compression policy added |
| F6 | Celery worker idle cost not analysed — $1,120/mo regardless of usage | HIGH | §27.3 | Added idle cost analysis; scale-to-zero rejected (violates MC SLO); scale-to-1 KEDA policy for Tier 3 documented |
| F7 | No per-org email rate limit — SMTP quota at risk during flapping events | MEDIUM | §4 (WebSocket/alerts) | Added 50 emails/hour/org rate limit with digest fallback; Celery hourly digest task; cost rationale |
| F8 | Renderer always-on rationale not documented; co-location OOM risk unaddressed | LOW | §35.5 | Added on-demand analysis table; confirmed always-on at Tier 1–2; documented co-location isolation requirement |
| F9 | Backup storage cost not projected — surprise cost at Tier 3 | LOW | §27.4 | Added WAL backup cost projection; $100–200/month at Tier 3 steady state |
| F10 | No Redis memory budget — result backend accumulation can cause OOM | HIGH | §27.8 (new) | Added Redis memory budget table by purpose/DB index; maxmemory 2gb; result_expires=3600 requirement |
| F11 | No per-org cost attribution mechanism for Enterprise tier negotiations | MEDIUM | §27.1 | Added monthly usage report Celery task; per-org CPU-seconds → cost-per-run attribution |
65.2 Sections Modified
| Section | Change summary |
|---|---|
| §27.1 Workload Characterisation | Added cost-tracking Prometheus counters (F3, F4) and per-org usage report task (F11) |
| §27.3 Deployment Tiers | Added Celery worker idle cost analysis and scale-to-zero decision table (F6) |
| §27.4 Storage Growth Projections | Added Cloud Cost/Year column; storage cost summary; backup cost projection (F2, F9) |
| §27.5 Network and External Bandwidth | Added CZML egress cost estimate and brotli compression policy (F5) |
| §27.7 Unit Economics Model (new) | Full unit economics model: cost-to-serve, revenue per tier, break-even analysis (F1) |
| §27.8 Redis Memory Budget (new) | Redis memory budget by purpose; maxmemory setting; result cleanup requirement (F10) |
| §4 WebSocket / Alerts | Added per-org email rate limit (50/hr) with digest fallback; SMTP cost rationale (F7) |
| §35.5 Renderer Container Constraints | Added on-demand analysis; memory isolation rationale; co-location risk guidance (F8) |
65.3 New Files and Documents Required
| File | Purpose |
|---|---|
docs/business/UNIT_ECONOMICS.md |
Unit economics model; cost-to-serve per tier; break-even analysis; update quarterly |
docs/infra/REDIS_SIZING.md |
Redis memory budget by purpose; eviction policy decisions; sizing rationale |
docs/business/usage_reports/{org_id}/{year}-{month}.json |
Per-org monthly usage reports for Enterprise tier chargeback |
backend/app/metrics.py (additions) |
spacecom_ingest_api_calls_total and spacecom_simulation_cpu_seconds_total counters |
backend/app/alerts/email_delivery.py |
Per-org email rate limiting logic with Redis counter and digest queue |
backend/celeryconfig.py (addition) |
result_expires = 3600 to prevent Redis result backend accumulation |
65.4 Anti-Patterns Rejected
| Anti-pattern | Why rejected |
|---|---|
| Scale-to-zero simulation workers | 60–120s Chromium-style cold-start violates 10-min MC SLO; scale-to-1 minimum is the correct floor |
| Co-locating renderer with simulation workers | Chromium 2–4 GB render memory + MC worker memory = OOM on 32 GB nodes; isolated container required |
| Unbounded alert emails per org | SMTP relay quota exhausted during flapping events; 50/hr cap with digest is operationally equivalent at lower cost |
Redis without result_expires |
MC sub-task result accumulation; 500 sub-tasks × 1 MB = 500 MB peak; without expiry, accumulates across runs indefinitely |
Single Redis noeviction policy |
Blocks cache use alongside broker in same instance; DB-index split with allkeys-lru on cache DB required |
65.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Simulation worker floor | Scale-to-1 minimum at Tier 3 | Scale-to-zero | Cold-start from zero violates 10-min MC SLO; one warm worker absorbs small queues instantly |
| Email rate limit mechanism | Redis hour-window counter + Celery digest task | Database-level throttle / no limit | Redis counter is O(1) per email with sub-millisecond latency; DB throttle adds per-email DB write at high fan-out; no limit is a SMTP quota risk |
| Unit economics granularity | Per-org CPU-seconds via Prometheus | Per-request DB logging | Prometheus counter aggregation has negligible overhead; DB per-request logging at MC sub-task granularity = 500 writes/run |
| Redis maxmemory target | 2 GB (cache.r6g.large with 8 GB RAM) |
4 GB / 1 GB | 2× headroom above 700–750 MB peak estimate; leaves OS and other processes room; below 4 GB alerts before OOM |
| CZML compression priority | Brotli before gzip in Caddy encode block |
gzip only | Brotli achieves 70–80% reduction vs. gzip's 60–75%; modern browsers universally support brotli; on-premise clients are always browser-based |
Appendix §66 — Open Source / Dependency Licensing Hat Review
Hat: OSS Licensing Engineer Reviewer focus: Licence obligations for closed-source SaaS, SBOM completeness, redistribution constraints, IP risk in ESA bid context, contractor IP ownership
66.1 Findings and Fixes
| # | Finding | Severity | Section modified | Fix applied |
|---|---|---|---|---|
| F1 | CesiumJS AGPLv3 commercial licence not explicitly gated as Phase 1 blocker | CRITICAL | §6 Phase 1 checklist, §29.11 (new) | Added Phase 1 blocking gate requiring cesium-commercial.pdf; dedicated §29.11 F1 section with phase-gate language |
| F2 | SBOM covered container image (syft) but not dependency manifests (pip-licenses/license-checker JSON merge) | HIGH | §26.9 CI table, §6 Phase 1 checklist, §29.11 (new) | Added manifest SBOM merge to build-and-push; docs/compliance/sbom/ as versioned store; Phase 1 gate updated |
| F3 | Space-Track AUP redistribution risk not analysed in detail for API endpoint and credential exposure | MEDIUM | §29.11 (new) | Added two-vector redistribution analysis (API exposure + credential in client-side code); confirmed detect-secrets coverage |
| F4 | poliastro LGPLv3 licence not documented; LGPL dynamic linking compliance undocumented | MEDIUM | §29.11 (new) | Added LGPL compliance assessment; legal/LGPL_COMPLIANCE.md required; standard pip install satisfies LGPL |
| F5 | TimescaleDB dual-licence (TSL vs Apache 2.0) not assessed; risk if TSL-only features adopted | MEDIUM | §29.11 (new) | Added feature-by-feature TimescaleDB licence table; confirmed SpaceCom uses only Apache 2.0 features; re-assessment gate if multi-node adopted |
| F6 | Redis SSPL adoption (7.4+) not assessed; Valkey alternative not documented | MEDIUM | §29.11 (new) | Added SSPL internal-use assessment; legal counsel confirmation required before Phase 3; Valkey/Redis 7.2 as fallback |
| F7 | Playwright/Chromium binary licence not captured in SBOM | LOW | §29.11 (new) | Confirmed Apache 2.0 (Playwright) + BSD-3 (Chromium); captured by syft container scan; no redistribution |
| F8 | Caddy enterprise plugin licence risk not noted; audit process not defined | LOW | §29.11 (new) | Added plugin licence audit requirement; PR checklist for Caddyfile changes |
| F9 | PostGIS GPLv2 linking exception not documented | LOW | §29.11 (new) | Confirmed linking exception applies to PostgreSQL extension use; legal/LGPL_COMPLIANCE.md to document |
| F10 | pip-licenses --fail-on list missing SSPL; no SSPL check on npm side |
MEDIUM | §29.11 (new), §7.13 CI step | Added SSPL to Python fail-on list; SSPL added to npm failOn; exact version pinning requirement stated |
| F11 | No CLA or work-for-hire mechanism before contractor contributions | HIGH | §29.11 (new), §6 Phase 2 checklist | Added CLA template requirement (legal/CLA.md); CONTRIBUTING.md disclosure; Phase 2 gate |
66.2 Sections Modified
| Section | Change summary |
|---|---|
| §6 Phase 1 legal/compliance checklist | Added CesiumJS commercial licence as explicit blocking gate; expanded SBOM checklist item to cover manifest SBOMs; added LGPL/PostGIS and TimescaleDB/Redis licence document gates |
| §26.9 CI workflow table | Updated build-and-push job to include manifest SBOM merge and docs/compliance/sbom/ artefact storage |
| §29.11 (new) | Full OSS licence compliance section: F1–F11 covering all material dependencies |
66.3 New Files and Documents Required
| File | Purpose |
|---|---|
legal/OSS_LICENCE_REGISTER.md |
Authoritative per-dependency licence record; updated on major version changes |
legal/LICENCES/cesium-commercial.pdf |
Executed CesiumJS commercial licence — Phase 1 blocking gate |
legal/LICENCES/timescaledb-licence-assessment.md |
TimescaleDB Apache 2.0 vs. TSL feature confirmation |
legal/LICENCES/redis-sspl-assessment.md |
Redis SSPL internal-use assessment; legal counsel sign-off |
legal/LGPL_COMPLIANCE.md |
poliastro LGPL dynamic linking compliance; PostGIS GPLv2 linking exception |
legal/CLA.md |
Contributor Licence Agreement template for external contributors |
docs/compliance/sbom/ |
Versioned SBOM artefacts: syft SPDX-JSON + manifest JSONs per release |
CONTRIBUTING.md |
CLA requirement disclosure; external contributor instructions |
66.4 Anti-Patterns Rejected
| Anti-pattern | Why rejected |
|---|---|
| "CesiumJS licence can wait until Phase 2" | AGPLv3 network use provision applies from the first external demo — waiting creates retroactive non-compliance exposure in an ESA bid context |
| Excluding CesiumJS from the licence gate without a commercial licence on file | CI exclusion hides the issue; the gate is correct only when the commercial licence exists |
| Assuming LGPL dynamic linking is automatically satisfied | Must be documented; LGPL allows relinking — standard pip install satisfies this but the compliance position must be written down |
Single Redis noeviction policy |
Already rejected in §65; Redis SSPL also motivates Valkey evaluation as BSD-3 alternative |
| Assuming all TimescaleDB features are Apache 2.0 | TSL features (multi-node, data tiering) would require a Timescale commercial agreement; feature use must be tracked |
66.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| CesiumJS licence | Commercial licence from Cesium Ion; Phase 1 blocker | Open-source the frontend (comply with AGPLv3) | Source disclosure of SpaceCom's frontend is commercially unacceptable; commercial licence is the only viable path for a closed-source product |
| Redis SSPL response | Legal counsel assessment; Valkey as fallback | Immediate migration to Valkey | Internal-use assessment is likely favourable; premature migration introduces risk; assess first |
| poliastro LGPL | Document standard pip install compliance | Seek MIT-licensed alternative | Standard pip install satisfies LGPL dynamic linking; replacing poliastro would require significant re-engineering for marginal legal gain |
| SBOM format | SPDX-JSON (syft) + pip-licenses/license-checker manifests merged | CycloneDX only | SPDX is the format required by ECSS and EU Cyber Resilience Act; CycloneDX can be generated alongside if required by a specific customer |
Appendix §67 — Distributed Systems / Consistency Hat Review
Hat: Distributed Systems Engineer Reviewer focus: Consistency guarantees, failure modes, split-brain scenarios, clock skew, ordering, idempotency, CAP trade-offs
67.1 Findings and Fixes
| # | Finding | Severity | Section modified | Fix applied |
|---|---|---|---|---|
| F1 | Chord callback doesn't validate result count — partial results silently produce truncated predictions | CRITICAL | §27.2 chord section | Added result count guard in aggregate_mc_results; raises ValueError on mismatch; spacecom_mc_chord_partial_result_total counter; DLQ routing |
| F2 | No Celery autoretry_for=(OperationalError,) on DB-writing tasks — Patroni 30s failover window causes permanent task failure |
HIGH | §27.6 PgBouncer section | Added autoretry_for=(OperationalError,) policy; max_retries=3, retry_backoff=5, cap 30s; applies to all DB-writing Celery tasks |
| F3 | Redis Sentinel split-brain risk not documented or assessed | MEDIUM | §26 Redis Sentinel section | Added split-brain assessment; accepted risk for ephemeral data; min-replicas-to-write 1 mitigates; ADR-0021 required |
| F4 | HMAC signing race — prediction INSERT then HMAC UPDATE creates window of unsigned prediction | HIGH | §10 HMAC section | Fixed: pre-generate UUID in application before INSERT; compute HMAC with UUID; single-phase write; migration from BIGSERIAL to UUID PK documented |
| F5 | alert_events.seq assigned via MAX(seq)+1 trigger — concurrent inserts produce duplicates |
HIGH | §4 WebSocket/events section | Replaced with CREATE SEQUENCE alert_seq_global; globally monotonic; per-org ordering via WHERE org_id = $1 ORDER BY seq |
| F6 | Clock skew between server and client causes CZML ground track timing drift — no detection mechanism | MEDIUM | §4 API section | Added chronyd/timesyncd host requirement; node_timex_sync_status Grafana alert; GET /api/v1/time endpoint; client-side skew warning banner at >5s |
| F7 | MinIO multipart upload has no retry on write quorum failure — MC blob lost silently | HIGH | §27.4 storage section | Added autoretry_for=(S3Error,) with 30s backoff; MinIO ILM rule to abort incomplete multipart uploads after 24h |
| F8 | celery-redbeat double-fire on restart: only TLE ingest has ON CONFLICT DO NOTHING; space weather and IERS EOP lack upsert |
MEDIUM | §11 ingest section | Added upsert patterns for all periodic ingest tables; unique constraint requirements stated |
| F9 | WebSocket fan-out cross-channel ordering — no cross-org ordering guarantee | LOW | — | Already addressed — Redis Pub/Sub ordering is per-channel (per-org); sequence numbers provide intra-org ordering. No further action required. |
| F10 | reentry_predictions FK referenced with default CASCADE — accidental simulation delete cascades to legal-hold predictions |
HIGH | §9 schema | Changed all REFERENCES reentry_predictions(id) to ON DELETE RESTRICT in alert_events, prediction_outcomes, superseded_by FK |
| F11 | No distributed trace context propagation through chord sub-tasks and callback | MEDIUM | §26.9 OTel section | Added chord trace context injection/extraction pattern; verified CeleryInstrumentor for single tasks; manual propagate.inject/extract for chord callback continuity |
67.2 Sections Modified
| Section | Change summary |
|---|---|
| §27.2 MC Parallelism | Added chord result count validation in aggregate_mc_results; partial result counter |
| §27.6 DNS / PgBouncer | Added Celery autoretry_for=(OperationalError,) policy for Patroni failover window |
| §26 Redis Sentinel | Added split-brain risk assessment; min-replicas-to-write 1 config; ADR-0021 |
| §10 HMAC signing | Fixed two-phase write race: pre-generate UUID, single-phase INSERT; PK migration note |
| §4 WebSocket schema | Added alert_seq_global PostgreSQL SEQUENCE replacing MAX(seq)+1 trigger |
| §4 API / health | Added GET /api/v1/time clock skew endpoint; NTP sync requirement; client banner |
| §27.4 Storage | Added MinIO multipart upload retry; incomplete upload ILM expiry rule |
| §11 Ingest | Added upsert patterns for space_weather and IERS EOP; unique constraint requirements |
| §9 Data Model | Changed REFERENCES reentry_predictions(id) to ON DELETE RESTRICT on 3 FKs |
| §26.9 OTel/Tracing | Added chord trace context propagation pattern; propagate.inject/extract for callback |
67.3 New ADRs Required
| ADR | Decision |
|---|---|
docs/adr/0021-redis-sentinel-split-brain-risk-acceptance.md |
Accept Redis Sentinel split-brain risk for ephemeral data; min-replicas-to-write 1 mitigation; email rate limit counter inconsistency accepted as cost control gap |
67.4 Anti-Patterns Rejected
| Anti-pattern | Why rejected |
|---|---|
MAX(seq)+1 for sequence assignment in trigger |
Race condition under concurrent inserts — two transactions read same MAX and both write the same seq; PostgreSQL SEQUENCE is lock-free and gap-tolerant |
| Two-phase HMAC (INSERT then UPDATE) | Creates a window where a valid unsigned prediction exists in the DB; single-phase INSERT with pre-generated UUID eliminates the window |
| No retry on Celery DB tasks during Patroni failover | The 30s failover window is a known operational event; retries with 5s backoff cap at 30s, fitting entirely within the failover window |
ON DELETE CASCADE on legal-hold FK references |
Accidental deletion of a simulation row would cascade to 7-year-retention safety records; RESTRICT forces explicit deletion of dependents first, making accidental cascade impossible |
| Scale-to-zero with immediate cold-start | Already rejected in §65; distributed systems perspective adds: cold-start during Patroni failover + worker cold-start = double failure; always keep 1 warm worker |
67.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Chord result count validation | ValueError → DLQ → HTTP 500 + Retry-After |
Silently write partial result | A 400-sample prediction is not a 500-sample prediction; confidence intervals and corridor widths are wrong; it is safer to fail visibly |
| reentry_predictions PK type | Migrate BIGSERIAL → UUID; pre-generate in application | Keep BIGSERIAL; use two-phase HMAC | UUID pre-generation eliminates the race window; UUID is also a safer choice for distributed deployments where sequence coordination between nodes is not possible |
| alert_seq assignment | Single global alert_seq_global SEQUENCE |
Per-org sequences | Single sequence is simpler to manage; global monotonicity is sufficient for per-org ordering by filtering on org_id; per-org sequences require one sequence per org — complex at scale |
| Redis split-brain response | Accept risk; document in ADR | Migrate to Redis Cluster (stronger consistency) | Redis Cluster adds significant operational complexity (hash slots, resharding, client-side routing); split-brain on Sentinel with 3 nodes is rare and the affected data is ephemeral or cost-control only |
Appendix §68 — Commercial / Pricing Architecture Hat Review
Hat: Commercial Strategy / Pricing Architect Reviewer focus: Pricing model design, deal structure, revenue protection, margin preservation, enterprise negotiation guardrails, commercial signals in technical architecture
68.1 Findings and Fixes
| # | Finding | Severity | Section modified | Fix applied |
|---|---|---|---|---|
| F1 | No contracts table — feature access not gated on commercial state; admin can enable Enterprise features with no contract |
CRITICAL | §9 data model, §24 commercial section | Added contracts table with financial terms, feature enablement flags, discount approval constraint, PS tracking; nightly sync task |
| F2 | Usage data not surfaced to commercial team or org admins — renewal conversations lack data | HIGH | §27.7 unit economics | Added monthly usage summary emails to commercial team and org admins; send_usage_summary_emails Beat task |
| F3 | No shadow trial time limit — ANSP could remain in shadow mode indefinitely without signing production contract | HIGH | §9 organisations table | Added shadow_trial_expires_at column; enforcement via daily Celery task that auto-deactivates expired trials |
| F4 | No discount approval guard-rails — single admin can give 100% discount | MEDIUM | §9 contracts table | Added CHECK (discount_pct <= 20 OR discount_approved_by IS NOT NULL) constraint; discount >20% requires named approver |
| F5 | No inbound API request counter — usage-based billing for Persona E/F impossible | MEDIUM | §27.1 metrics | Added spacecom_api_requests_total{org_id, endpoint, version, status_code}; FastAPI middleware |
| F6 | On-premise deployments have no licence key enforcement — multi-instance or post-expiry use undetectable | HIGH | §34 infrastructure section | Added RSA JWT licence key mechanism; licence-expired degraded mode; hourly Celery re-validation; key rotation script |
| F7 | No contract expiry alerts — contracts expire silently; revenue risk | HIGH | §4 Celery tasks | Added check_contract_expiry Beat task at 90/30/7-day thresholds; courtesy notice to org admin at 30 days |
| F8 | Free/shadow tier has no MC simulation quota — free usage consumes paid-tier worker capacity | MEDIUM | §9 organisations table, §27.7 | Added monthly_mc_run_quota column (default 100); POST /api/v1/decay/predict quota enforcement with 429 + Retry-After |
| F9 | No MRR/ARR tracking — commercial team cannot measure revenue targets | HIGH | §9 contracts table, §27.7 | contracts.monthly_value_cents + spacecom_mrr_eur Prometheus gauge updated nightly; Grafana MRR panel |
| F10 | Professional Services not documented as a revenue line — first-year contract value underestimated | MEDIUM | §27.7 unit economics | Added PS revenue table (engagement types, values); contracts.ps_value_cents; Year 1 total contract value formula |
| F11 | Multi-ANSP coordination panel available to all tiers — high-value Enterprise feature not packaging-protected | MEDIUM | §9 organisations table | Added feature_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE; gated in UI by feature flag; synced from contracts.enables_multi_ansp_coordination |
68.2 Sections Modified
| Section | Change summary |
|---|---|
| §9 organisations table | Added shadow_trial_expires_at, monthly_mc_run_quota, feature_multi_ansp_coordination, licence_key, licence_expires_at columns |
| §9 (new contracts table) | Full contracts table with financial terms, discount approval constraint, feature enablement, PS tracking |
| §24 commercial section | Added contracts table spec, MRR tracking, feature sync task, discount enforcement |
| §27.1 cost-tracking metrics | Added spacecom_api_requests_total{org_id, endpoint, version, status_code} counter |
| §27.7 unit economics | Added PS revenue table; shadow trial quota enforcement code; usage summary emails |
| §34 on-premise deployment | Added RSA JWT licence key mechanism; degraded mode on expiry; key rotation process |
| §4 Celery Beat tasks | Added check_contract_expiry 90/30/7-day alert task; send_usage_summary_emails monthly task |
68.3 New Files and Documents Required
| File | Purpose |
|---|---|
docs/business/UNIT_ECONOMICS.md |
Updated with PS revenue line, Year 1 total contract value formula, MRR tracking |
tasks/commercial/contract_expiry_alerts.py |
Contract expiry Celery task (90/30/7-day thresholds) |
tasks/commercial/send_commercial_summary.py |
Monthly commercial team usage summary email |
tasks/commercial/sync_feature_flags.py |
Nightly sync of org feature flags from active contracts |
scripts/generate_licence_key.py |
RSA JWT licence key generation script (requires private key) |
legal/contracts/ |
Contract document store (MSA PDFs, signed sandbox agreements) |
68.4 Anti-Patterns Rejected
| Anti-pattern | Why rejected |
|---|---|
| Admin toggle for feature access without contract gate | Single admin can bypass commercial controls; contracts table with nightly sync is the authoritative source |
| Unlimited MC runs for free tier | Free-tier heavy users degrade paid-tier SLO by consuming simulation worker capacity; 100-run/month quota is enforceable without impacting legitimate evaluation |
| Honour-system on-premise licensing | Without a licence key, post-expiry use is undetectable and unenforceable; JWT with RSA signature provides cryptographic enforcement with no ongoing connectivity requirement |
| Silent contract expiry | Revenue loss from silent expiry is predictable and preventable; 90/30/7-day alerts are standard SaaS practice |
| Infinite shadow trial | Shadow mode is a commercial transition stage, not a permanent state; shadow_trial_expires_at enforces the commercial expectation established in the Regulatory Sandbox Agreement |
68.5 Decision Log
| Decision | Chosen approach | Rejected alternative | Rationale |
|---|---|---|---|
| Feature flag sync | Nightly Celery task syncs from contracts |
Real-time sync on every request | Real-time sync adds DB query per request; nightly sync is sufficient for contract-level changes which happen at most monthly |
| Licence key format | RSA-signed JWT | Database-backed licence check | JWT is verifiable offline (no network required for air-gapped deployments); RSA signature prevents forgery without access to SpaceCom private key |
| Discount approval threshold | 20% without approval; >20% requires named approver | Flat approval for all discounts | 0-20% is sales discretion; >20% represents strategic pricing requiring commercial leadership sign-off; DB constraint makes this enforceable rather than advisory |
| PS revenue tracking | contracts.ps_value_cents one-time field |
Separate PS contracts table | PS is almost always bundled with the main contract at first engagement; a separate table adds complexity for marginal benefit at Phase 2-3 scale |
| MRR metric | Prometheus gauge from nightly Celery task | Real-time DB query in Grafana | Prometheus gauge is consistent with other business metrics; Grafana can scrape it without a DB connection; historical MRR trend is automatically recorded |
§69 Cross-Hat Governance and Decision Authority
This section resolves conflicts between specialist reviews. SpaceCom uses hats to surface expert constraints, not to create parallel authorities. Where hats conflict, this section defines who decides, how the decision is recorded, and which interpretation governs implementation.
69.1 Decision Authority Model
| Decision class | Primary owner | Mandatory reviewers | Tie-break principle |
|---|---|---|---|
| Product packaging, contracts, commercial entitlements | Product / Commercial owner | Legal, Engineering | Contractual and legal truth beats UI shorthand |
| Safety-critical alerting, operational UX, hazard communication | Safety case owner | Human Factors, Regulatory, Engineering | Safer operator outcome beats convenience or sales flexibility |
| Core architecture, infrastructure, CI/CD, consistency | Architecture / Platform owner | Security, SRE, DevOps | Lower operational risk and clearer failure semantics beat elegance |
| Privacy, data governance, lawful basis, retention | Legal / Privacy owner | Product, Engineering | Regulatory obligation beats implementation convenience |
| External licensing / open source / procurement artefacts | Legal / Procurement owner | Engineering, Product | Licence compliance beats delivery speed |
Any unresolved cross-hat conflict is recorded in docs/governance/CROSS_HAT_CONFLICT_REGISTER.md before implementation proceeds.
69.2 Arbitration Rules Adopted
- Commercial source of truth:
contractsis the authoritative source for features, quotas, and deployment rights.subscription_tieris descriptive only. - CI/CD platform: SpaceCom uses self-hosted GitLab. All GitHub Actions references in the plan are interpreted as GitLab CI equivalents and must be implemented in
.gitlab-ci.yml, protected environments, and GitLab approval rules. - Redis split by trust class:
redis_appholds higher-integrity application state;redis_workerholds broker/result/cache state. Split-brain acceptance applies only toredis_worker. - Commercial enforcement deferral: Licence expiry, shadow-trial expiry, and quota exhaustion must not interrupt active TIP / CRITICAL operations. Enforcement is deferred, logged, and applied after the active event closes.
- Alert escalation matrix: Progressive escalation is the default. Immediate bypass is allowed only for imminent-impact or integrity-compromise conditions formally listed in the alert definition and traced into safety artefacts.
- Renderer privilege exception: The renderer
SYS_ADMINcapability is an approved exception, not a precedent. Any similar request from another service requires a new ADR and security review. - Phase 0 blockers: Space-Track AUP architecture and Cesium commercial licensing are Phase 0 gates. Work that would lock in ingest or frontend architecture must not proceed before those gates are closed.
69.3 Phase 0 Governance Gates
Before Phase 1 implementation begins, the following must be complete:
- Space-Track AUP architecture decision recorded in
docs/adr/0016-space-track-aup-architecture.md - Cesium commercial licence executed and stored at
legal/LICENCES/cesium-commercial.pdf - GitLab CI/CD authority confirmed in platform docs and reflected in
.gitlab-ci.yml contractsentitlement model and synchronisation path approved by Product, Legal, and Engineering- Redis trust split (
redis_app/redis_worker) approved by Architecture, Security, and SRE
These are architectural commitment gates, not paperwork gates. If any remain open, implementation that would cement the affected design area is blocked.
69.4 Intervention Register
| Conflict | Sections affected | Intervention | Owner | Status |
|---|---|---|---|---|
subscription_tier vs contracts authority |
§16.1, §24, §68 | contracts made authoritative; org flags become derived cache |
Product / Commercial | Accepted |
| GitHub Actions vs self-hosted GitLab | §26.9, §30.4, §30.7, delivery checklists | GitLab CI/CD designated authoritative | Platform | Accepted |
| Shared Redis vs accepted split-brain risk | §3.2, §3.3, §65, §67 | Redis split into app-state and worker-state trust domains | Architecture / Security | Accepted |
| Commercial enforcement during incidents | §9, §27.7, §34, §68 | Enforcement deferred during active TIP / CRITICAL event | Product / Operations | Accepted |
| HF progressive escalation vs safety urgency | §28.3, §60, §61 | Immediate-bypass matrix added for imminent-impact and integrity events | Safety case owner | Accepted |
Non-root/container hardening vs renderer SYS_ADMIN |
§3.3, §7.11 | Renderer documented as approved exception with tighter isolation | Security / Platform | Accepted |
| Implementation starting before legal/licence blockers close | §6, §19, §21, §29.11 | Blockers moved into Phase 0 governance gates | Programme owner | Accepted |