Plan aircraft-risk modelling, CCSDS RDM support, tender-grade replay validation, and ESA software assurance artefacts in the implementation and master plans.
15517 lines
1.0 MiB
15517 lines
1.0 MiB
# SpaceCom Master Development Plan
|
||
|
||
## 1. Vision
|
||
|
||
SpaceCom is a dual-domain re-entry debris hazard analysis platform that bridges the space and aviation domains. It is built by space engineers and operates as two interconnected products sharing a common physics core.
|
||
|
||
**Space Domain (upstream):** A technical analysis platform for space operators, orbital analysts, and space agencies — providing decay prediction with full uncertainty quantification, conjunction screening, controlled re-entry corridor planning, and a programmatic API layer for integration with existing space operations systems.
|
||
|
||
**Aviation Domain (downstream):** An operational decision support tool for ANSPs, airspace managers, and incident commanders — translating space domain predictions into actionable aviation safety outputs: hazard corridors, FIR intersection analysis, NOTAM drafting assistance, multi-ANSP coordination, and plain-language uncertainty communication.
|
||
|
||
SpaceCom's strategic position is the interface layer between two domains that currently do not speak the same language. The aviation safety gap is the commercial differentiator and the most underserved operational need in the market. The space domain physics depth — numerical decay prediction, atmospheric density modelling, conjunction probability, and controlled re-entry planning — is the technical credibility that distinguishes SpaceCom from aviation software vendors with bolt-on orbital mechanics.
|
||
|
||
**Positioning statement for procurement:** *"SpaceCom is the missing operational layer between space domain awareness and aviation domain action — built by space engineers, designed for the people who have to make decisions when something is coming down."*
|
||
|
||
**AI-assisted development policy (F11):** SpaceCom uses AI coding assistants (currently Claude Code) in the development workflow. `AGENTS.md` at the repository root defines the boundaries and conventions for this use. Key constraints:
|
||
- AI assistants may generate, refactor, and review code, and draft documentation
|
||
- AI assistants **may not** make autonomous decisions about safety-critical algorithm changes, authentication logic, or regulatory compliance text — all such changes require human review and an approved PR with explicit reviewer sign-off
|
||
- AI-generated code is subject to identical review and testing standards as human-authored code — there is no reduced scrutiny for AI-generated contributions
|
||
- AI assistants **must not** be given production credentials, access to live Space-Track API keys, or personal data
|
||
- For ESA procurement purposes: all code in the repository, regardless of how it was authored, is the responsibility of the named human engineers. AI assistance is a development tool, not a co-author with liability
|
||
|
||
This policy is stated explicitly because ESA and other public-sector procurement frameworks increasingly ask whether and how AI tools are used in safety-relevant software development.
|
||
|
||
---
|
||
|
||
## 2. What We Keep from the Existing Codebase
|
||
|
||
The prototype established several good foundational choices:
|
||
|
||
- **Docker Compose orchestration** — frontend, backend, and database run as isolated containers with a single `docker compose up`
|
||
- **FastAPI backend** — lightweight, async-ready Python API server; already serves CZML orbital data
|
||
- **TimescaleDB + PostGIS** — time-series hypertables for orbit data and geographic types for footprints; the `orbits` hypertable and `reentry_predictions` polygon column are well-suited to the domain
|
||
- **CesiumJS globe** — proven 3D geospatial viewer with CZML support, already rendering orbital tracks with OSM tiles
|
||
- **CZML as the orbital data interchange format** — native to Cesium, supports time-dynamic position, styling, and labels
|
||
- **Schema tables: `objects`, `orbits`, `conjunctions`, `reentry_predictions`** — solid starting point for the data model (see §9 for required expansions)
|
||
- **Worker service slot** — the architecture already anticipates background data ingestion
|
||
|
||
---
|
||
|
||
## 3. Architecture
|
||
|
||
### 3.1 Layered Design
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ Frontend (Web) │
|
||
│ Next.js + TypeScript + CesiumJS + Deck.gl │
|
||
│ httpOnly cookies · CSP · security headers │
|
||
├─────────────────────────────────────────────────────┤
|
||
│ TLS Termination (Caddy/Nginx) │
|
||
│ HTTPS + WSS only; HSTS preload │
|
||
├─────────────────────────────────────────────────────┤
|
||
│ API Gateway │
|
||
│ FastAPI · RBAC middleware · rate limiting │
|
||
│ JWT (RS256) · MFA enforcement · audit logging │
|
||
├─────────────────────────────────────────────────────┤
|
||
│ Core Services │
|
||
│ Hazard Engine · Event Orchestrator · CZML Builder │
|
||
│ Frame Transform Service · Space Weather Cache │
|
||
│ HMAC integrity signing · Alert integrity guard │
|
||
├─────────────────────────────────────────────────────┤
|
||
│ Computational Workers (isolated network) │
|
||
│ Celery tasks: propagation, decay, Monte Carlo │
|
||
│ Per-job CPU time limits · resource caps │
|
||
├─────────────────────────────────────────────────────┤
|
||
│ Report Renderer (network-isolated container) │
|
||
│ Playwright headless · no external network access │
|
||
├─────────────────────────────────────────────────────┤
|
||
│ Data Layer (backend_net only) │
|
||
│ TimescaleDB+PostGIS · Redis (AUTH+TLS) │
|
||
│ MinIO (private buckets · pre-signed URLs) │
|
||
└─────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 3.2 Service Breakdown
|
||
|
||
| Service | Runtime | Responsibility | Tier 2 Spec | Tier 3 Spec |
|
||
|---------|---------|----------------|-------------|-------------|
|
||
| `frontend` | Next.js on Node 22 / Nginx static | Globe UI, dashboards, event timeline, simulation controls | 2 vCPU / 4 GB | 2× (load balanced) |
|
||
| `backend` | FastAPI on Python 3.12 | REST + WebSocket API, authentication, RBAC, request validation, CZML generation, HMAC signing | 4 vCPU / 8 GB | 2× 4 vCPU / 8 GB (blue-green) |
|
||
| `worker-sim` | Python 3.12 + Celery `--queue=simulation --concurrency=16 --pool=prefork` | MC decay prediction (chord sub-tasks), breakup, conjunction, controlled re-entry. Isolated from frontend network. | 2× 16 vCPU / 32 GB | 4× 16 vCPU / 32 GB |
|
||
| `worker-ingest` | Python 3.12 + Celery `--queue=ingest --concurrency=2` | TLE polling, space weather, DISCOS, IERS EOP. Never competes with simulation queue. | 2 vCPU / 4 GB | 2× 2 vCPU / 4 GB (celery-redbeat HA) |
|
||
| `renderer` | Python 3.12 + Playwright | PDF report generation only. No external network access. Receives sanitised data from backend via internal API call only. | 4 vCPU / 8 GB | 2× 4 vCPU / 8 GB |
|
||
| `db` | TimescaleDB (PostgreSQL 17 + PostGIS) | Persistent storage. RLS policies enforced. Append-only triggers on audit tables. | 8 vCPU / 64 GB / 1 TB NVMe | Primary + standby: 8 vCPU / 128 GB each; Patroni failover |
|
||
| `redis` | Redis 7 | Broker + cache + celery-redbeat schedule. AUTH required. TLS in production. ACL users per service. | 2 vCPU / 8 GB | Redis Sentinel: 3× 2 vCPU / 8 GB |
|
||
| `minio` | MinIO (S3-compatible) | Object storage. All buckets private. Pre-signed URLs only. | 4 vCPU / 8 GB / 4 TB | Distributed: 4× 4 vCPU / 16 GB / 2 TB NVMe |
|
||
| `etcd` | etcd 3 | Patroni DCS (distributed configuration store) for DB leader election | — | 3× 1 vCPU / 2 GB |
|
||
| `pgbouncer` | PgBouncer 1.22 | Connection pooler between all application services and TimescaleDB. Transaction-mode pooling. Prevents connection count exceeding `max_connections` at Tier 3. Single failover target point for Patroni switchover. | 1 vCPU / 1 GB | 1 vCPU / 1 GB (updated by Patroni on failover) |
|
||
| `prometheus` | Prometheus 2.x | Metrics scraping from all services; recording rules; AlertManager rules | 2 vCPU / 4 GB | 2 vCPU / 8 GB |
|
||
| `grafana` | Grafana OSS | Four dashboards (§26.7); Loki + Tempo + Prometheus datasources | 1 vCPU / 2 GB | 1 vCPU / 2 GB |
|
||
| `loki` | Grafana Loki 2.9 | Log aggregation; queried by Grafana; Promtail ships container logs | 2 vCPU / 4 GB | 2 vCPU / 8 GB |
|
||
| `promtail` | Grafana Promtail 2.9 | Scrapes Docker json-file logs; labels by service; ships to Loki | 0.5 vCPU / 512 MB | 0.5 vCPU / 512 MB |
|
||
| `tempo` | Grafana Tempo | Distributed trace backend (Phase 2); OTLP ingest; queried by Grafana | — | 2 vCPU / 4 GB |
|
||
|
||
#### Horizontal Scaling Trigger Thresholds (F9 — §58)
|
||
|
||
Tier upgrades are not automatic — SpaceCom is VPS-based and requires deliberate provisioning. The following thresholds trigger a **scaling review meeting** (not an automated action). The responsible engineer creates a tracked issue within 5 business days.
|
||
|
||
| Metric | Threshold | Sustained for | Tier transition indicated |
|
||
|--------|-----------|--------------|--------------------------|
|
||
| Backend CPU utilisation | > 70% | 30 min | Tier 1 → Tier 2 (add second backend instance) |
|
||
| `spacecom_ws_connected_clients` | > 400 sustained | 1 hour | Tier 1 → Tier 2 (WS ceiling at 500; add second backend) |
|
||
| Celery simulation queue depth | > 50 | 15 min (no active event) | Add simulation worker instance |
|
||
| MC p95 latency | > 180s (75% of 240s SLO) | 3 consecutive runs | Add simulation worker instance |
|
||
| DB CPU utilisation | > 60% | 1 hour | Tier 2 → Tier 3 (read replica + Patroni) |
|
||
| DB disk used | > 70% of provisioned | — | Expand disk before hitting 85% |
|
||
| Redis memory used | > 60% of `maxmemory` | — | Increase `maxmemory` or add Redis instance |
|
||
|
||
Scaling decisions are recorded in `docs/runbooks/capacity-limits.md` with: metric value at decision time, decision made, provisioning timeline, and owner. This file is the authoritative capacity log for ESA and ANSP audits.
|
||
|
||
#### Redis ACL Definition
|
||
|
||
SpaceCom uses two Redis trust domains:
|
||
- `redis_app` for sessions, rate limits, WebSocket delivery state, commercial-enforcement deferrals, and other application state where stronger consistency and tighter access separation are required
|
||
- `redis_worker` for Celery broker/result traffic and ephemeral cache data, where limited inconsistency during failover is acceptable
|
||
|
||
This split is deliberate. It prevents worker-side compromise from reaching session state and avoids applying the distributed-systems split-brain risk acceptance for ephemeral workloads to user-session or entitlement-adjacent state.
|
||
|
||
Each Redis service gets its own ACL users with the minimum required key namespace:
|
||
|
||
```conf
|
||
# redis_app/acl.conf - bind-mounted into the application Redis container
|
||
# Backend: application-state access only (session tokens, rate-limit counters, WebSocket tracking)
|
||
user spacecom_backend on >${REDIS_BACKEND_PASSWORD} ~* &* +@all
|
||
|
||
# Disable unauthenticated default user
|
||
user default off
|
||
```
|
||
|
||
```conf
|
||
# redis_worker/acl.conf - bind-mounted into the worker Redis container
|
||
# Simulation worker: Celery broker/result namespaces only
|
||
user spacecom_worker on >${REDIS_WORKER_PASSWORD} ~celery* ~_kombu* ~unacked* &celery* +@all -@dangerous
|
||
|
||
# Ingest worker: same scope as simulation worker
|
||
user spacecom_ingest on >${REDIS_INGEST_PASSWORD} ~celery* ~_kombu* ~unacked* &celery* +@all -@dangerous
|
||
|
||
# Disable unauthenticated default user
|
||
user default off
|
||
```
|
||
|
||
Mount in `docker-compose.yml`:
|
||
```yaml
|
||
redis_app:
|
||
volumes:
|
||
- ./redis_app/acl.conf:/etc/redis/acl.conf:ro
|
||
command: redis-server --aclfile /etc/redis/acl.conf --tls-port 6379 ...
|
||
|
||
redis_worker:
|
||
volumes:
|
||
- ./redis_worker/acl.conf:/etc/redis/acl.conf:ro
|
||
command: redis-server --aclfile /etc/redis/acl.conf --tls-port 6379 ...
|
||
```
|
||
|
||
Separate passwords (`REDIS_BACKEND_PASSWORD`, `REDIS_WORKER_PASSWORD`, `REDIS_INGEST_PASSWORD`) are defined in §30.3. Each rotates independently on the 90-day schedule. Redis Sentinel split-brain risk acceptance in §67 applies to `redis_worker` only; `redis_app` is treated as higher-integrity application state and is not covered by that acceptance.
|
||
|
||
### 3.3 Docker Compose Services and Network Segmentation
|
||
|
||
Services are assigned to isolated Docker networks. A compromised container on one network cannot directly reach services on another.
|
||
|
||
```yaml
|
||
networks:
|
||
frontend_net: # frontend → backend only
|
||
backend_net: # backend → db, redis, minio, pgbouncer
|
||
worker_net: # worker → pgbouncer, redis, minio (no backend access; pgbouncer pools DB connections)
|
||
renderer_net: # backend → renderer only; renderer has no external egress
|
||
db_net: # db, pgbouncer — never exposed to frontend_net
|
||
|
||
services:
|
||
frontend: networks: [frontend_net]
|
||
backend: networks: [frontend_net, backend_net, renderer_net] # +renderer_net: backend calls renderer API
|
||
worker-sim: networks: [worker_net]
|
||
worker-ingest: networks: [worker_net]
|
||
renderer: networks: [renderer_net] # backend-initiated calls only; no outbound to backend_net
|
||
db: networks: [backend_net, worker_net, db_net]
|
||
pgbouncer: networks: [backend_net, worker_net, db_net] # pooling for both backend AND workers
|
||
redis: networks: [backend_net, worker_net]
|
||
minio: networks: [backend_net, worker_net]
|
||
```
|
||
|
||
**Network topology rules:**
|
||
- Workers connect to DB via `pgbouncer:5432`, not `db:5432` directly — enforced by workers' `DATABASE_URL` env var pointing to PgBouncer.
|
||
- The backend is on `renderer_net` so it can call `renderer:8001`; the renderer cannot initiate connections to `backend_net`.
|
||
- `db_net` contains only TimescaleDB, PgBouncer, and etcd. No application service connects directly to this network except PgBouncer.
|
||
|
||
**Container resource limits** — without explicit limits a runaway simulation worker OOM-kills the database (Linux OOM killer targets the largest RSS consumer):
|
||
|
||
```yaml
|
||
services:
|
||
backend:
|
||
deploy:
|
||
resources:
|
||
limits: { cpus: '4.0', memory: 8G }
|
||
reservations: { memory: 512M }
|
||
|
||
worker-sim:
|
||
deploy:
|
||
resources:
|
||
limits: { cpus: '16.0', memory: 32G }
|
||
reservations: { memory: 2G }
|
||
stop_grace_period: 300s # allows long MC jobs to finish before SIGKILL
|
||
command: >
|
||
celery -A app.worker worker
|
||
--queue=simulation
|
||
--concurrency=16
|
||
--pool=prefork
|
||
--without-gossip
|
||
--without-mingle
|
||
--max-tasks-per-child=100
|
||
pids_limit: 64 # prefork: 16 children + Beat + parent + overhead
|
||
|
||
worker-ingest:
|
||
deploy:
|
||
resources:
|
||
limits: { cpus: '2.0', memory: 4G }
|
||
stop_grace_period: 60s
|
||
pids_limit: 16
|
||
|
||
renderer:
|
||
deploy:
|
||
resources:
|
||
limits: { cpus: '4.0', memory: 8G }
|
||
pids_limit: 100 # Chromium spawns ~5 processes per render × concurrent renders
|
||
tmpfs:
|
||
- /tmp/renders:size=512m,mode=1777 # PDF scratch; never written to persistent layer
|
||
environment:
|
||
RENDER_OUTPUT_DIR: /tmp/renders
|
||
|
||
db:
|
||
deploy:
|
||
resources:
|
||
limits: { memory: 64G } # explicit cap; prevents OOM killer targeting db
|
||
|
||
redis:
|
||
deploy:
|
||
resources:
|
||
limits: { cpus: '2.0', memory: 8G }
|
||
|
||
minio:
|
||
deploy:
|
||
resources:
|
||
limits: { cpus: '4.0', memory: 8G }
|
||
```
|
||
|
||
Note: `deploy.resources` is honoured by `docker compose` (v2) without Swarm mode from Compose spec 3.x. Verify with `docker compose version` ≥ 2.0.
|
||
|
||
All containers run as non-root users, with read-only root filesystems and dropped capabilities (see §7.10), except for the renderer container's documented `SYS_ADMIN` exception in §7.11. That exception is accepted only for the renderer, must never be copied to other services, and requires stricter network isolation and annual review.
|
||
|
||
#### Host Bind Mounts
|
||
|
||
All directories that operators need to access directly on the VPS — logs, generated exports, config, and backups — are bind-mounted from the host filesystem. This means no `docker compose exec` is required for routine operations: log tailing, reading generated files, editing config, or recovering a backup.
|
||
|
||
```yaml
|
||
services:
|
||
backend:
|
||
volumes:
|
||
- ./logs/backend:/app/logs # structured JSON logs; tail directly on host
|
||
- ./exports:/app/exports # org export ZIPs, report PDFs
|
||
- ./config/backend.toml:/app/config/settings.toml:ro # edit on host; container reads
|
||
|
||
worker-sim:
|
||
volumes:
|
||
- ./logs/worker-sim:/app/logs
|
||
- ./exports:/app/exports # shared export directory with backend
|
||
|
||
worker-ingest:
|
||
volumes:
|
||
- ./logs/worker-ingest:/app/logs
|
||
|
||
frontend:
|
||
volumes:
|
||
- ./logs/frontend:/app/logs
|
||
|
||
db:
|
||
volumes:
|
||
- /data/postgres:/var/lib/postgresql/data # DB data on host disk; survives container recreation
|
||
- ./backups/db:/backups # pg_basebackup output directly accessible on host
|
||
|
||
minio:
|
||
volumes:
|
||
- /data/minio:/data # object storage on host disk
|
||
```
|
||
|
||
**Host-side directory layout** (under `/opt/spacecom/`):
|
||
```
|
||
/opt/spacecom/
|
||
logs/
|
||
backend/ ← tail -f logs/backend/app.log
|
||
worker-sim/
|
||
worker-ingest/
|
||
frontend/
|
||
exports/ ← ls exports/ to see generated reports and org export ZIPs
|
||
config/
|
||
backend.toml ← edit directly; restart backend container to apply
|
||
backups/
|
||
db/ ← pg_basebackup archives; rsync to offsite from here
|
||
data/
|
||
postgres/ ← TimescaleDB data files (outside /opt to avoid accidental compose down -v)
|
||
minio/ ← MinIO object data
|
||
```
|
||
|
||
**Key rules:**
|
||
- `/data/postgres` and `/data/minio` live **outside** the project directory so `docker compose down -v` cannot accidentally wipe them (Compose only removes named volumes, not bind-mounted host paths, but keeping them separate is an additional safeguard)
|
||
- Log directories are created by `make init-dirs` before first `docker compose up`; containers write to them as a non-root user (UID 1000); host operator reads as the same UID or via `sudo`
|
||
- Config files are mounted `:ro` (read-only) inside the container — a misconfigured backend cannot overwrite its own config
|
||
- `make logs SERVICE=backend` is a convenience alias for `tail -f /opt/spacecom/logs/backend/app.log`
|
||
|
||
#### Port Exposure Map
|
||
|
||
| Port | Service | Exposed to | Notes |
|
||
|------|---------|------------|-------|
|
||
| 80 | Caddy | Public internet | HTTP → HTTPS redirect only |
|
||
| 443 | Caddy | Public internet | TLS termination; proxies to backend/frontend |
|
||
| 8000 | Backend API | Internal (`frontend_net`) | Never directly internet-facing |
|
||
| 3000 | Frontend (Next.js) | Internal (`frontend_net`) | Caddy proxies; HMR port 3001 dev-only |
|
||
| 5432 | TimescaleDB | Internal (`db_net`) | **Never exposed to `frontend_net` or host** |
|
||
| 6379 | Redis | Internal (`backend_net`, `worker_net`) | AUTH required; no public exposure |
|
||
| 9000 | MinIO API | Internal (`backend_net`, `worker_net`) | Pre-signed URL access only from outside |
|
||
| 9001 | MinIO Console | Internal (`db_net`) | Never exposed publicly; admin use only |
|
||
| 5555 | Flower (Celery monitor) | Internal only | VPN/bastion access only in production |
|
||
| 2379/2380 | etcd (Patroni DCS) | Internal (`db_net`) | Never exposed outside db_net |
|
||
|
||
**CI check:** `scripts/check_ports.py` — parses `docker-compose.yml` and all `docker-compose.*.yml` overrides; fails if any port from the "never-exposed" category appears in a `ports:` mapping. Runs in every CI pipeline.
|
||
|
||
#### Infrastructure-Level Egress Filtering
|
||
|
||
Docker's built-in `iptables` rules prevent *inter-network* lateral movement but **do not** restrict egress to the public internet from within a network. An egress filtering layer is mandatory at Tier 2 and Tier 3.
|
||
|
||
**Allowed outbound destinations (whitelist):**
|
||
|
||
| Service | Allowed destination | Protocol | Purpose |
|
||
|---------|---------------------|----------|---------|
|
||
| `ingest_worker` | `www.space-track.org` | HTTPS/443 | TLE / conjunction data |
|
||
| `ingest_worker` | `services.swpc.noaa.gov` | HTTPS/443 | Space weather |
|
||
| `ingest_worker` | `discosweb.esac.esa.int` | HTTPS/443 | DISCOS object catalogue |
|
||
| `ingest_worker` | `celestrak.org` | HTTPS/443 | TLE cross-validation |
|
||
| `ingest_worker` | `iers.org` | HTTPS/443 | EOP download |
|
||
| `backend` | SMTP relay (org-internal) | SMTP/587 | Alert email |
|
||
| All containers | Internal Docker networks | Any | Normal operation |
|
||
| All containers | **All other destinations** | **Any** | **BLOCKED** |
|
||
|
||
**Implementation:** UFW or `nftables` rules on host (Tier 2); network policy + Calico/Cilium (Tier 3 Kubernetes migration); explicit allow-list in `docs/runbooks/egress-filtering.md`. Violations logged at WARN; repeated violations at CRITICAL.
|
||
|
||
---
|
||
|
||
## 4. Coordinate Frames and Time Systems
|
||
|
||
**This section is non-negotiable infrastructure.** Silent frame mismatches invalidate all downstream computation. All developers must understand and implement the conventions below before writing any propagation or display code.
|
||
|
||
### 4.1 Reference Frame Pipeline
|
||
|
||
```
|
||
TLE input
|
||
│
|
||
▼ sgp4 library propagation
|
||
TEME (True Equator Mean Equinox) ← SGP4 native output; do NOT store as final product
|
||
│
|
||
▼ IAU 2006 precession-nutation (or Vallado TEME→J2000 simplification)
|
||
GCRF / J2000 (Geocentric Celestial Reference Frame)
|
||
│ │
|
||
│ ▼ CZML INERTIAL frame ← CesiumJS expects GCRF/ICRF, not TEME
|
||
│
|
||
▼ IAU Earth Orientation Parameters (EOP): IERS Bulletin A/B
|
||
ITRF (International Terrestrial Reference Frame) ← Earth-fixed; use for database storage
|
||
│
|
||
▼ WGS84 geodetic transformation
|
||
Latitude / Longitude / Altitude ← For display, hazard zones, airspace intersections
|
||
```
|
||
|
||
**Implementation:** Use `astropy` (`astropy.coordinates`, `astropy.time`) for all frame conversions. It handles IERS EOP download and interpolation automatically. For performance-critical batch conversions, pre-load EOP tables and vectorise.
|
||
|
||
### 4.2 CesiumJS Frame Convention
|
||
|
||
- CZML `position` with `referenceFrame: "INERTIAL"` expects **ICRF/J2000 Cartesian** coordinates in **metres**
|
||
- SGP4 outputs are in **TEME** and must be rotated to J2000 before being written into CZML
|
||
- CZML `position` with `referenceFrame: "FIXED"` expects **ITRF Cartesian** in metres
|
||
- Never pipe raw TEME coordinates into CesiumJS
|
||
|
||
### 4.3 Time System Conventions
|
||
|
||
| System | Where Used | Notes |
|
||
|--------|-----------|-------|
|
||
| **UTC** | System-wide reference. All API timestamps, database timestamps, CZML epochs | Convert immediately at ingestion boundary |
|
||
| **UT1** | Earth rotation angle for ITRF↔GCRF conversion | UT1-UTC offset from IERS EOP |
|
||
| **TT (Terrestrial Time)** | `astropy` internal; precession-nutation models | ~69 s ahead of UTC |
|
||
| **TLE epoch** | Encoded in TLE line 1 as year + day-of-year fraction | Parse to UTC immediately |
|
||
| **GPS time** | May appear in precision ephemeris products | GPS = UTC + 18 s as of 2024 |
|
||
|
||
**Rule:** Store all timestamps as `TIMESTAMPTZ` in UTC. Convert to local time only at presentation boundaries.
|
||
|
||
### 4.4 Coordinate Reference System Contract (F1 — §62)
|
||
|
||
The CRS used at every system boundary is documented in `docs/COORDINATE_SYSTEMS.md`. This is the authoritative single-page reference for any engineer writing frame conversion code.
|
||
|
||
| Boundary | CRS | Format | Notes |
|
||
|----------|-----|--------|-------|
|
||
| SGP4 output | TEME (True Equator Mean Equinox) | Cartesian metres | Must not leave `physics/` without conversion |
|
||
| Physics → CZML builder | GCRF/J2000 | Cartesian metres | Explicit `teme_to_gcrf()` call |
|
||
| CZML `position` (INERTIAL) | GCRF/J2000 | Cartesian metres | `referenceFrame: "INERTIAL"` |
|
||
| CZML `position` (FIXED) | ITRF | Cartesian metres | `referenceFrame: "FIXED"` |
|
||
| Database storage (`orbits`) | GCRF/J2000 | Cartesian metres | Consistent with CZML inertial |
|
||
| Corridor polygon (DB) | WGS-84 geographic | `GEOGRAPHY(POLYGON)` SRID 4326 | Geodetic lat/lon from ITRF→WGS-84 |
|
||
| FIR boundary (DB) | WGS-84 geographic | `GEOMETRY(POLYGON, 4326)` | Planar approx. for regional FIRs |
|
||
| API response | WGS-84 geographic | GeoJSON (EPSG:4326) | Degrees; always lon,lat order (GeoJSON spec) |
|
||
| Globe display (CesiumJS) | ICRF (= GCRF for practical purposes) | Cartesian metres via CZML | CesiumJS handles geodetic display |
|
||
| Altitude display | WGS-84 ellipsoidal | km or ft (user preference) | See §4.4a for datum labelling |
|
||
|
||
**Antimeridian and pole handling (F5 — §62):**
|
||
|
||
- **Antimeridian:** Corridor polygons stored as `GEOGRAPHY` handle antimeridian crossing correctly — PostGIS GEOGRAPHY uses spherical arithmetic and does not wrap coordinates. CesiumJS CZML polygon positions must be expressed as a continuous polyline; for antimeridian-crossing corridors, the CZML serialiser must not clamp coordinates to ±180° — pass the raw ITRF→geodetic output. CesiumJS handles coordinate wrapping internally when `referenceFrame: "FIXED"` is used for corridor polygons.
|
||
- **Polar orbits:** For objects with inclination > 80°, the ground track corridor may approach or cross the poles. `ST_AsGeoJSON` on a GEOGRAPHY polygon that passes within ~1° of a pole can produce degenerate output (longitude undefined at the pole itself). Mitigation: before storing, check `ST_DWithin(corridor, ST_GeogFromText('SRID=4326;POINT(0 90)'), 111000)` (within 1° of north pole) or south pole equivalent — if true, log a `POLAR_CORRIDOR_WARNING` and clip the polygon to 89.5° max latitude. This is a rare case (ISS incl. 51.6°; most rocket bodies are below 75° incl.) but must not crash the pipeline.
|
||
|
||
**`docs/COORDINATE_SYSTEMS.md`** is a Phase 1 deliverable. Tests in `tests/test_frame_utils.py` serve as executable verification of the contract.
|
||
|
||
### 4.5 Implementation Checklist
|
||
|
||
- [ ] `frame_utils.py`: `teme_to_gcrf()`, `gcrf_to_itrf()`, `itrf_to_geodetic()`
|
||
- [ ] Unit tests against Vallado 2013 reference cases
|
||
- [ ] EOP data auto-refresh: weekly Celery task pulling IERS Bulletin A; verify SHA-256 checksum of downloaded file before applying
|
||
- [ ] CZML builder uses `gcrf_to_czml_inertial()` — explicit function, never implicit conversion
|
||
- [ ] `docs/COORDINATE_SYSTEMS.md` committed with CRS boundary table
|
||
|
||
---
|
||
|
||
## 5. User Personas
|
||
|
||
All UX decisions are traceable to one of the four personas defined here. Navigation, default views, information hierarchy, and alert behaviour must serve user tasks — not the system's internal module structure.
|
||
|
||
### Persona A — Operational Airspace Manager
|
||
|
||
**Role:** ANSP or aviation authority staff. Responsible for airspace safety decisions in real-time or near-real-time.
|
||
|
||
**Primary question:** "Is any airspace under my responsibility affected in the next 6–12 hours, and what do I need to do about it?"
|
||
|
||
**Key needs:** Immediate situational awareness, clear go/no-go spatial display for their region, alert acknowledgement workflow, one-click advisory export, minimal cognitive load.
|
||
|
||
**Tolerance for complexity:** Very low.
|
||
|
||
---
|
||
|
||
### Persona B — Safety Analyst
|
||
|
||
**Role:** Space agency, authority research arm, or consultancy. Conducts detailed re-entry risk assessments for regulatory submissions or post-event reports.
|
||
|
||
**Primary question:** "What is the full uncertainty envelope, what assumptions drove the prediction, and how does this compare to previous similar events?"
|
||
|
||
**Key needs:** Full simulation parameter access, run comparison, numerical uncertainty detail, full data provenance, configurable report generation, historical replay.
|
||
|
||
**Tolerance for complexity:** High.
|
||
|
||
---
|
||
|
||
### Persona C — Incident Commander
|
||
|
||
**Role:** Senior official coordinating response during an active re-entry event. Uses the platform as a shared situational awareness tool in a briefing room.
|
||
|
||
**Primary question:** "Where exactly is it coming down, when, and what is the worst-case affected area right now?"
|
||
|
||
**Key needs:** Clean large-format display, auto-narrowing corridor updates, countdown timer, plain-language status summary, shareable live-view URL.
|
||
|
||
**Tolerance for complexity:** Low.
|
||
|
||
---
|
||
|
||
### Persona D — Systems Administrator / Data Manager
|
||
|
||
**Role:** Technical operator managing system health, data ingest, model configuration, and user accounts.
|
||
|
||
**Primary question:** "Is everything ingesting correctly, are data sources healthy, and are workers keeping up?"
|
||
|
||
**Key needs:** System health dashboard, ingest job status, worker queue metrics, model version management, user and role management.
|
||
|
||
**Tolerance for complexity:** High technical tolerance.
|
||
|
||
---
|
||
|
||
### Persona E — Space Operator
|
||
|
||
**Role:** Satellite or launch vehicle operator responsible for one or more objects in the SpaceCom catalog. May be a commercial operator, a national space agency operating assets, or a launch service provider managing spent upper stages.
|
||
|
||
**Primary question:** "What is the current decay prediction for my objects, when do I need to act, and if I have manoeuvre capability, what deorbit window minimises ground risk?"
|
||
|
||
**Key needs:** Object-scoped view showing only their registered objects; decay prediction with full Monte Carlo detail; controlled re-entry corridor planner (for objects with remaining propellant); conjunction alert for their own objects; API key management for programmatic integration with their own operations centre; exportable predictions for regulatory submission under national space law.
|
||
|
||
**Tolerance for complexity:** High — these are trained orbital engineers, not ATC professionals.
|
||
|
||
**Regulatory context:** Many space operators have legal obligations under national space law (e.g., Australia Space (Launches and Returns) Act 2018, FAA AST licensing) to demonstrate responsible end-of-life management. SpaceCom outputs serve as supporting evidence for those submissions. The platform must produce artefacts suitable for regulatory audit.
|
||
|
||
---
|
||
|
||
### Persona F — Orbital Analyst
|
||
|
||
**Role:** Technical analyst at a space agency, research institution, safety consultancy, or the SSA/STM office of a national authority. Conducts orbital analysis, validates predictions, and produces technical assessments — potentially across the full catalog, not just owned objects.
|
||
|
||
**Primary question:** "What does the full orbital picture look like for this object class, how do SpaceCom predictions compare to other tools, and what are the statistical properties of the prediction ensemble?"
|
||
|
||
**Key needs:** Full catalog read access; conjunction screening across arbitrary object pairs; simulation parameter tuning and comparison; bulk export (CSV, JSON, CCSDS formats); access to raw propagation outputs (state vectors, covariance matrices); historical validation runs; API access for batch processing.
|
||
|
||
**Tolerance for complexity:** Very high — this persona builds the technical evidence base that other personas act on.
|
||
|
||
---
|
||
|
||
## 6. UX Design Specification
|
||
|
||
This section translates engineering capability into concrete interface designs. All designs are persona-linked and phase-scheduled.
|
||
|
||
### 6.1 Information Architecture — Task-Based Navigation
|
||
|
||
Navigation is organised around user tasks, not backend modules. Module names never appear in the UI.
|
||
|
||
The platform has two navigation domains — **Aviation** (default for Persona A/B/C) and **Space** (for Persona E/F). Both are accessible from the top navigation. The root route (`/`) defaults to the domain matched to the user's role on login.
|
||
|
||
**Aviation Domain Navigation:**
|
||
```
|
||
/ → Operational Overview (Persona A, C primary)
|
||
/watch/{norad_id} → Object Watch Page (Persona A, B)
|
||
/events → Active Events + Timeline (Persona A, C)
|
||
/events/{id} → Event Detail (Persona A, B, C)
|
||
/airspace → Airspace Impact View (Persona A)
|
||
/analysis → Analyst Workspace (Persona B primary)
|
||
/catalog → Object Catalog (Persona B)
|
||
/reports → Report Management (Persona A, B)
|
||
/admin → System Administration (Persona D)
|
||
```
|
||
|
||
**Space Domain Navigation:**
|
||
```
|
||
/space → Space Operator Overview (Persona E, F primary)
|
||
/space/objects → My Objects Dashboard (Persona E — owned objects only)
|
||
/space/objects/{norad_id} → Object Technical Detail (Persona E, F)
|
||
/space/reentry/plan → Controlled Re-entry Planner (Persona E)
|
||
/space/conjunction → Conjunction Screening (Persona F)
|
||
/space/analysis → Orbital Analyst Workspace (Persona F)
|
||
/space/export → Bulk Export (Persona F)
|
||
/space/api → API Keys + Documentation (Persona E, F)
|
||
```
|
||
|
||
The 3D globe is a shared component embedded within pages, not a standalone page. Different pages focus and configure the globe differently.
|
||
|
||
---
|
||
|
||
### 6.2 Operational Overview Page (`/`)
|
||
|
||
Landing page for Persona A and C. Loads immediately without configuration.
|
||
|
||
**Layout:**
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ [● LIVE] SpaceCom [Space Weather: ELEVATED ▲] [Alerts: 2] │
|
||
├──────────────────────────────┬──────────────────────────────────┤
|
||
│ │ ACTIVE EVENTS │
|
||
│ 3D GLOBE │ ● CZ-5B R/B 44878 │
|
||
│ (active events + │ Window: 08h – 20h from now │
|
||
│ affected FIRs only) │ Most likely ~14h from now │
|
||
│ │ YMMM FIR — HIGH │
|
||
│ │ [View] [Corridor] │
|
||
│ │ ───────────────────────────── │
|
||
│ │ ○ SL-16 R/B 28900 │
|
||
│ │ Window: 54h – 90h from now │
|
||
│ │ Most likely ~72h from now │
|
||
│ │ Ocean — LOW │
|
||
│ │ │
|
||
│ │ 72-HOUR TIMELINE │
|
||
│ │ [Gantt strip] │
|
||
│ │ │
|
||
│ │ SPACE WEATHER │
|
||
│ │ Activity: ELEVATED │
|
||
│ │ Extend window: add ≥2h buffer │
|
||
├──────────────────────────────┴──────────────────────────────────┤
|
||
│ [● Live] ──────────●────────────────────────────── +72h │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Globe default state:** Active decay objects and their corridors only. All other objects hidden. Affected FIR boundaries highlighted. No orbital tracks unless the user expands an event card.
|
||
|
||
**Temporal uncertainty display — Persona A/C:** Event cards and the Operational Overview show window ranges in plain language (`Window: 08h – 20h from now / Most likely ~14h from now`), never `± N` notation. The `±` form implies symmetric uncertainty, which re-entry distributions are not. The Analyst Workspace (Persona B) additionally shows raw p05/p50/p95 UTC times.
|
||
|
||
---
|
||
|
||
### 6.3 Time Navigation System
|
||
|
||
Three modes — always visible, always unambiguous. Mixing modes without explicit user intent is prohibited.
|
||
|
||
| Mode | Indicator | Description |
|
||
|------|-----------|-------------|
|
||
| **LIVE** | Green pulsing pill: `● LIVE` | Current real-world state. Globe and predictions update from live feeds. |
|
||
| **REPLAY** | Amber pill: `⏪ REPLAY 2024-01-14 03:22 UTC` | Replaying a historical event. All data fixed. No live updates. |
|
||
| **SIMULATION** | Purple pill: `⚗ SIMULATION — [object name]` | Custom scenario. Data is synthetic. Must never be confused with live. |
|
||
|
||
The mode indicator is persistent in the top nav bar. Switching modes requires explicit action through a mode-switch dialogue — it cannot happen implicitly.
|
||
|
||
**Mode-switch dialogue specification:**
|
||
|
||
When the user initiates a mode switch (e.g., LIVE → SIMULATION), the following modal must appear. The dialogue must explicitly state the current mode, the target mode, and all operational consequences:
|
||
|
||
```
|
||
SWITCH TO SIMULATION MODE?
|
||
──────────────────────────────────────────────────────────────
|
||
You are currently viewing LIVE data.
|
||
Switching to SIMULATION will display synthetic scenario data.
|
||
|
||
⚠ Alerts and notifications are suppressed in SIMULATION.
|
||
⚠ Simulation data must never be used for operational decisions.
|
||
⚠ Other users will not see your simulation.
|
||
|
||
[Cancel] [Switch to Simulation ▶]
|
||
──────────────────────────────────────────────────────────────
|
||
```
|
||
|
||
Rules:
|
||
- Cancel on left, destructive action on right (consistent with aviation HMI conventions)
|
||
- The dialogue must always show both the current mode and target mode — never just "are you sure?"
|
||
- Equivalent dialogues apply for all mode transitions (LIVE ↔ REPLAY, LIVE ↔ SIMULATION, etc.)
|
||
|
||
**Simulation mode block during active alerts:** If the organisation has `disable_simulation_during_active_events` enabled (admin setting, default: off), the SIMULATION mode switch is blocked whenever there are unacknowledged CRITICAL or HIGH alerts. A modal replaces the switch dialogue:
|
||
|
||
```
|
||
CANNOT ENTER SIMULATION
|
||
──────────────────────────────────────────────────────────────
|
||
2 active CRITICAL alerts require acknowledgement.
|
||
Acknowledge all active alerts before running simulations.
|
||
|
||
[View active alerts] [Cancel]
|
||
──────────────────────────────────────────────────────────────
|
||
```
|
||
|
||
Document `disable_simulation_during_active_events` prominently in the admin UI: *"Enable only if your organisation has a dedicated SpaceCom monitoring role separate from simulation users."*
|
||
|
||
**Timeline control — two zoom levels:**
|
||
|
||
- **Event scale (default):** 72 hours, 6-hour intervals. Re-entry windows shown as coloured bars.
|
||
- **Orbital scale:** 4-hour window, 15-minute intervals. For orbital passes and conjunction events.
|
||
|
||
**LIVE mode scrub:** User can drag the playhead into the future to preview a predicted corridor. A "Return to Live" button appears whenever the playhead is not at current time.
|
||
|
||
**Future-preview temporal wash:** When the timeline playhead is not at current time (user is previewing a future state), the entire right-panel event list and alert badges are overlaid with a temporal wash (semi-transparent grey overlay) and a persistent label:
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────┐
|
||
│ ⏩ PREVIEWING +4h 00m — not current state [Return to Live] │
|
||
└──────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
The wash and label prevent a controller from acting on predicted-future data as though it were current. The globe corridor may show the projected state; the event list must be visually distinct. Alert badges are greyed and annotated "(projected)" in preview mode. Alert sounds and notifications are suppressed while previewing.
|
||
|
||
---
|
||
|
||
### 6.4 Uncertainty Visualisation — Three Phased Modes
|
||
|
||
Three representations are planned across phases. All are user-selectable via the `UncertaintyModeSelector` once implemented. Each page context has a recommended default.
|
||
|
||
**Mode selector** (appears in the layer controls panel whenever corridor data is loaded):
|
||
```
|
||
Corridor Display
|
||
● Percentile Corridors ← Phase 1
|
||
○ Probability Heatmap ← Phase 2
|
||
○ Monte Carlo Particles ← Phase 3
|
||
```
|
||
|
||
Modes B and C appear greyed in the selector until their phase ships.
|
||
|
||
---
|
||
|
||
#### Mode A — Percentile Corridors (Phase 1, default for Persona A/C)
|
||
|
||
**What it shows:** Three nested polygon swaths on the globe — 5th, 50th, and 95th percentile ground track corridors from Monte Carlo output.
|
||
|
||
**Visual encoding:**
|
||
- 95th percentile: wide, 15% opacity amber fill, dashed border — hazard extent
|
||
- 50th percentile: medium, 35% opacity amber fill, solid border — nominal corridor
|
||
- 5th percentile: narrow, 60% opacity amber fill, bold border — high-probability core
|
||
|
||
**Colour by risk level:** Ocean-only → blue family; partial land → amber; significant land → red-orange.
|
||
|
||
**Over time:** As the re-entry window narrows, the outer swath contracts automatically in LIVE mode. The user watches the corridor "tighten" in real-time.
|
||
|
||
---
|
||
|
||
#### Mode B — Probability Heatmap (Phase 2, default for Persona B)
|
||
|
||
**What it shows:** Continuous colour-ramp Deck.gl heatmap. Each cell's colour encodes probability density of ground impact across the full Monte Carlo sample set.
|
||
|
||
**Visual encoding:** Perceptually uniform, colour-blind-safe sequential palette (viridis or custom blue-white-orange). Scale normalised to the maximum probability cell; legend with percentile labels always shown.
|
||
|
||
**Interaction:** Hover a cell → tooltip shows "~N% probability of impact within this 50×50 km cell." The heatmap is recomputed client-side if the user adjusts the re-entry window bounds via the timeline.
|
||
|
||
---
|
||
|
||
#### Mode C — Monte Carlo Particle Visualisation (Phase 3, Persona B advanced / Persona C briefing)
|
||
|
||
**What it shows:** 50–200 animated MC sample trajectory lines converging from re-entry interface altitude (~80 km) to impact. Particle colour encodes F10.7 assumption (cool = low solar activity = later re-entry, warm = high). Impact points persist as dots.
|
||
|
||
**Interaction:** Play/pause animation; scrub to any point in the trajectory; click a particle to see its parameter set (F10.7, Ap, B*).
|
||
|
||
**Performance:** Use CesiumJS `Primitive` API with per-instance colour attributes — not `Entity` API. Trajectory geometry pre-baked server-side and streamed as binary format (`/viz/mc-trajectories/{prediction_id}`). Never compute trajectories in the browser.
|
||
|
||
**Not the default for Persona A** — the animation can be alarming without quantitative context.
|
||
|
||
**Weighted opacity:** Particles render with opacity proportional to their sample weight, not uniform opacity. This visually down-weights outlier trajectories so that low-probability high-consequence paths do not visually dominate.
|
||
|
||
**Mandatory first-use overlay:** When Mode C is first enabled (per user, tracked in user preferences), a one-time overlay appears before the animation starts:
|
||
|
||
```
|
||
MONTE CARLO PARTICLE VIEW
|
||
──────────────────────────────────────────────────────────────
|
||
Each animated line shows one possible re-entry scenario sampled
|
||
from the prediction distribution. Colour encodes the solar
|
||
activity assumption used for that sample.
|
||
|
||
These are not equally likely outcomes — particle opacity
|
||
reflects sample weight. For operational planning, the
|
||
Percentile Corridors view (Mode A) gives a more reliable
|
||
summary.
|
||
|
||
[Understood — show animation]
|
||
──────────────────────────────────────────────────────────────
|
||
```
|
||
|
||
The overlay is dismissed permanently per user on first acknowledgement and never shown again. It cannot be bypassed — the animation does not play until the user explicitly acknowledges.
|
||
|
||
---
|
||
|
||
### 6.5 Globe Information Hierarchy and Layer Management
|
||
|
||
**Default view state:** Active decay objects and their corridors, FIR boundaries for affected regions. "Show everything" is never the default.
|
||
|
||
**Layer management panel:**
|
||
|
||
```
|
||
LAYERS
|
||
────────────────────────────────────────
|
||
Objects
|
||
☑ Active decay objects (TIP issued)
|
||
☑ Decaying objects (perigee < 250 km)
|
||
☐ All tracked payloads
|
||
☐ Rocket bodies
|
||
☐ Debris catalog
|
||
|
||
Orbital Tracks
|
||
☐ Ground tracks (selected object only)
|
||
☐ All objects — [!] performance warning
|
||
|
||
Predictions & Corridors
|
||
☑ Re-entry corridors (active events)
|
||
☐ Re-entry corridors (all predicted)
|
||
☐ Fragment impact points
|
||
☐ Conjunction geometry
|
||
|
||
Airspace (Phase 2)
|
||
☐ FIR / UIR boundaries
|
||
☐ Controlled airspace
|
||
☐ Affected sectors (hazard intersection)
|
||
|
||
Reference
|
||
☐ Population density grid
|
||
☐ Critical infrastructure
|
||
────────────────────────────────────────
|
||
Corridor Display: [Percentile ▾]
|
||
```
|
||
|
||
Layer state persists to `localStorage` per session. Shared URLs encode active layer state in query parameters.
|
||
|
||
**Object clustering:** At zoom > 5,000 km, objects cluster. Badge shows count and highest urgency level. Clusters expand at < 2,000 km.
|
||
|
||
**Altitude-aware clustering rule (F8 — §62):** Objects at different altitudes with the same ground-track sub-point are not co-located — they have different re-entry windows and different hazard profiles. Two objects that share a 2D screen position but differ by > 100 km in altitude must **not** be merged into a single cluster. Implementation rule: CesiumJS `EntityCluster` clustering is disabled for any object with `reentry_predictions` showing a window < 30 days (i.e., any decay-relevant object in the watch/alert state). Objects in the normal catalog (`window > 30 days`) may continue to use screen-space clustering. This prevents the pathological case where a TIP-active object at 200 km is merged into a cluster with a nominal object at 500 km that shares its ground track, making the TIP object invisible in the cluster badge.
|
||
|
||
**Urgency / Priority Visual Encoding** (colour-blind-safe — shape distinguishes as well as colour):
|
||
|
||
| State | Symbol | Colour | Meaning |
|
||
|-------|--------|--------|---------|
|
||
| TIP issued, window < 6h | ◆ filled diamond | Red `#D32F2F` | Imminent re-entry |
|
||
| TIP issued, window 6–24h | ◆ outlined diamond | Orange `#E65100` | Active threat |
|
||
| Predicted decay, window < 7d | ▲ triangle | Amber `#F9A825` | Elevated watch |
|
||
| Decaying, window > 7d | ● circle | Yellow-grey | Monitor |
|
||
| Conjunction Pc > 1:1000 | ✕ cross | Purple `#6A1B9A` | Conjunction risk |
|
||
| Normal tracked | · dot | Grey `#546E7A` | Catalog |
|
||
|
||
Never use red/green as the sole distinguishing pair.
|
||
|
||
---
|
||
|
||
### 6.6 Alert System UX
|
||
|
||
**Alert taxonomy:**
|
||
|
||
| Level | Trigger | Visual Treatment | Requires Acknowledgement? |
|
||
|-------|---------|-----------------|--------------------------|
|
||
| **CRITICAL** | TIP issued, window < 6h, hazard intersects active FIR | Full-width banner (red), audio tone (ops room mode) | Yes — named user; timestamp + note recorded |
|
||
| **HIGH** | Window < 24h, conjunction Pc > 1:1000 | Persistent badge (orange) | Yes — dismissal recorded |
|
||
| **MEDIUM** | New TIP issued (any), window < 7d, new CDM | Toast (amber), 8s auto-dismiss | No — logged |
|
||
| **LOW** | New TLE ingested, space weather index change | Notification centre only | No |
|
||
|
||
**Alert fatigue mitigation:**
|
||
- Mute rules: per-user, per-session LOW suppression
|
||
- Geographic filtering: alerts scoped to user's configured FIR list
|
||
- Deduplication: window shrinks that don't cross a threshold do not re-trigger
|
||
- Rate limit: same trigger condition cannot produce more than 1 CRITICAL alert per object per 4-hour window without a manual operator reset
|
||
- Alert generation triggered only by backend logic on verified data — never by direct API call from a client
|
||
|
||
**Ops room workload buffer (`OPS_ROOM_SUPPRESS_MINUTES`):** An optional per-organisation setting (default: 0 — disabled). When set to N > 0, CRITICAL alert full-screen banners are queued for up to N minutes before display. The top-nav badge increments immediately so peripheral attention is captured; only the full-screen interrupt is deferred. This matches FAA AC 25.1329 alert prioritisation philosophy: acknowledge at a glance, act when workload permits. Must be documented in the admin UI with a mandatory warning: *"Only enable if your operations room has a dedicated SpaceCom monitoring role. If a single controller manages all alerts, suppression introduces delay that may be safety-significant."*
|
||
|
||
**Audio alert specification:**
|
||
- Trigger: CRITICAL alert only (no audio for HIGH or lower)
|
||
- Sound: two-tone ascending chime pattern (not a siren — ops rooms have sirens from other systems)
|
||
- Behaviour: plays once on alert display; does not loop; stops on alert acknowledgement (not just banner dismiss)
|
||
- Volume: configurable per-device (default 50% system volume); mutable by operator per-session
|
||
- Ops room mode: organisation-level setting that enables audio (default: off; requires explicit activation)
|
||
|
||
**Alert storm detection:** If the system generates > 5 CRITICAL alerts within 1 hour across all objects, generate a meta-alert to Persona D. The meta-alert presents a disambiguation prompt rather than a bare count:
|
||
|
||
```
|
||
[META-ALERT — ALERT VOLUME ANOMALY]
|
||
──────────────────────────────────────────────────────────────
|
||
5 CRITICAL alerts generated within 1 hour.
|
||
|
||
This may indicate:
|
||
(a) Multiple genuine re-entry events — verify via Space-Track
|
||
independently before taking operational action.
|
||
(b) System integrity issue — check ingest pipeline and data
|
||
source health for signs of false data injection.
|
||
|
||
[Open /admin health dashboard →] [View all CRITICAL alerts →]
|
||
──────────────────────────────────────────────────────────────
|
||
```
|
||
|
||
**Acknowledgement workflow:**
|
||
|
||
CRITICAL acknowledgement requires two steps to prevent accidental confirmation:
|
||
|
||
**Step 1** — Alert banner with summary and Open Map link:
|
||
```
|
||
[CRITICAL ALERT]
|
||
───────────────────────────────────────────────────────
|
||
CZ-5B R/B (44878) — TIP Issued
|
||
Re-entry window: 2026-03-16 14:00 – 22:00 UTC (8h)
|
||
Affected FIRs: YMMM, YSSY
|
||
Risk level: HIGH | [Open map →]
|
||
[Review and Acknowledge →]
|
||
───────────────────────────────────────────────────────
|
||
```
|
||
|
||
**Step 2** — Confirmation modal (appears on clicking "Review and Acknowledge"):
|
||
```
|
||
ACKNOWLEDGE CRITICAL ALERT
|
||
───────────────────────────────────────────────────────
|
||
CZ-5B R/B (44878) — Re-entry window 14:00–22:00 UTC 16 Mar
|
||
|
||
Action taken (required — minimum 10 characters):
|
||
[_____________________________________________]
|
||
|
||
[Cancel] [Confirm — J. Smith, 09:14 UTC]
|
||
───────────────────────────────────────────────────────
|
||
```
|
||
|
||
The Confirm button is disabled until the `Action taken` field contains ≥ 10 characters. This prevents reflexive one-click acknowledgement during an incident and ensures a minimal action record is always created.
|
||
|
||
Acknowledgements stored in `alert_events` (append-only). Records cannot be modified or deleted.
|
||
|
||
---
|
||
|
||
### 6.7 Timeline / Gantt View
|
||
|
||
Full timeline accessible from `/events` and as a compact strip on the Operational Overview.
|
||
|
||
```
|
||
NOW +6h +12h +24h +48h +72h
|
||
Object │ │ │ │ │ │
|
||
────────────────────┼────────┼────────┼────────┼────────┼────────┼────
|
||
CZ-5B R/B 44878 │ [■■■■■[══════ window ═══════]■■■] │
|
||
YMMM FIR — HIGH │ │ │ │ │ │
|
||
────────────────────┼────────┼────────┼────────┼────────┼────────┼────
|
||
SL-16 R/B 28900 │ │ │ [■[══════════════════════════→
|
||
NZZC FIR — MED │ │ │ │ │ │
|
||
```
|
||
|
||
`■` = nominal re-entry point; `══` = uncertainty window; colour = risk level.
|
||
|
||
Click event bar → Event Detail page; hover → tooltip with window bounds and affected FIRs. Zoom range: 6h to 7d.
|
||
|
||
---
|
||
|
||
### 6.8 Event Detail Page (`/events/{id}`)
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────┐
|
||
│ ← Events │ CZ-5B R/B NORAD 44878 │ [■ CRITICAL] │
|
||
│ │ Re-entry window: 14:00–22:00 UTC 16 Mar 2026 │
|
||
├──────────────────────────────┬───────────────────────────────┤
|
||
│ │ OBJECT │
|
||
│ 3D GLOBE │ Mass: 21,600 kg (● DISCOS) │
|
||
│ (focused on corridor) │ B*: 0.000215 /ER │
|
||
│ Mode: [Percentile ▾] │ Data confidence: ● DISCOS │
|
||
│ [Layers] │ │
|
||
│ │ PREDICTION │
|
||
│ │ Model: cowell_nrlmsise00 v2 │
|
||
│ │ F10.7 assumed: 148 sfu │
|
||
│ │ MC samples: 500 │
|
||
│ │ HMAC: ✓ verified │
|
||
│ │ │
|
||
│ │ WINDOW │
|
||
│ │ 5th pct: 13:12 UTC │
|
||
│ │ 50th pct: 17:43 UTC │
|
||
│ │ 95th pct: 22:08 UTC │
|
||
│ │ │
|
||
│ │ TIP MESSAGES │
|
||
│ │ MSG #3 — 09:00 UTC today │
|
||
│ │ [All TIP history →] │
|
||
├──────────────────────────────┴───────────────────────────────┤
|
||
│ AFFECTED AIRSPACE (Phase 2) │
|
||
│ YMMM FIR ████ HIGH entry 14:20–19:10 UTC │
|
||
├──────────────────────────────────────────────────────────────┤
|
||
│ [Run Simulation] [Generate Report] [Share Link] │
|
||
└──────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**HMAC verification status** is displayed prominently. If `✗ verification failed` appears, a banner reads: "This prediction record may have been tampered with. Do not use for operational decisions. Contact your system administrator."
|
||
|
||
**Data confidence** annotates every physical property: `● DISCOS` (green), `● estimated` (amber), `● unknown` (grey). When source is `unknown` or `estimated`, a warning callout appears above the prediction panel.
|
||
|
||
**Corridor Evolution widget (Phase 2):** A compact 2D strip on the Event Detail page showing how the p50 corridor footprint is evolving over time — three overlapping semi-transparent polygon outlines at T+0h, T+2h, T+4h from the current prediction. Updated automatically in LIVE mode. Gives Persona A Level 3 situation awareness (projection) at a glance without requiring simulation tools. Labelled: *"Corridor evolution — how prediction is narrowing"*. If the corridor is widening (unusual), an amber warning appears: *"Uncertainty is increasing — check space weather."*
|
||
|
||
**Duty Manager View (Phase 2):** A `[Duty Manager View]` toggle button on the Event Detail header. When active, collapses all technical detail and presents a large-text, decluttered view containing only:
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────┐
|
||
│ CZ-5B R/B NORAD 44878 [■ CRITICAL] │
|
||
│ │
|
||
│ RE-ENTRY WINDOW │
|
||
│ Start: 14:00 UTC 16 Mar 2026 │
|
||
│ End: 22:00 UTC 16 Mar 2026 │
|
||
│ Most likely: 17:43 UTC │
|
||
│ │
|
||
│ AFFECTED FIRs │
|
||
│ YMMM (Airservices Australia) — HIGH RISK │
|
||
│ YSSY (Airservices Australia) — MEDIUM RISK │
|
||
│ │
|
||
│ [Draft NOTAM] [Log Action] [Share Link] │
|
||
└──────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
Toggle back to full view via `[Technical Detail]`. State is not persisted between sessions — always starts in full view.
|
||
|
||
**Response Options accordion (Phase 2):** An expandable panel at the bottom of the Event Detail page, visible to `operator` and above roles. Contextualised to the current risk level and FIR intersection. These are considerations only — all decisions rest with the ANSP:
|
||
|
||
```
|
||
RESPONSE OPTIONS [▼ expand]
|
||
──────────────────────────────────────────────────────────────
|
||
Based on current prediction (risk: HIGH, window: 8h):
|
||
|
||
The following actions are for your consideration.
|
||
All operational decisions rest with the ANSP.
|
||
|
||
☐ Issue SIGMET or advisory to aircraft in YMMM FIR
|
||
☐ Notify adjacent ANSPs (YMMM borders: WAAF, OPKR)
|
||
☐ Draft NOTAM for authorised issuance [Open →]
|
||
☐ Coordinate with FMP on traffic flow impact
|
||
☐ Establish watching brief schedule (every 30 min)
|
||
|
||
[Log coordination note]
|
||
──────────────────────────────────────────────────────────────
|
||
```
|
||
|
||
Checkbox states and coordination notes are appended to `alert_events` (append-only). The Response Options items are dynamically generated by the backend based on risk level and affected FIR count — not hardcoded in the frontend.
|
||
|
||
---
|
||
|
||
### 6.9 Simulation Job Management UX
|
||
|
||
Persistent collapsible bottom-drawer panel visible on any page. Jobs continue running when the user navigates away.
|
||
|
||
```
|
||
SIMULATION JOBS [▲ collapse]
|
||
────────────────────────────────────────────────────────────────
|
||
● Running Decay prediction — 44878 312/500 ████░ 62%
|
||
F10.7: 148, Ap: 12, B*±10% ~45s rem
|
||
[Cancel]
|
||
|
||
✓ Complete Decay prediction — 44878 High F10.7 scenario
|
||
Completed 09:02 UTC [View results] [Compare]
|
||
|
||
✗ Failed Breakup simulation — 28900
|
||
Error: DISCOS data missing [Retry] [Details]
|
||
────────────────────────────────────────────────────────────────
|
||
```
|
||
|
||
**Simulation comparison:** Two completed runs for the same object can be overlaid on the globe with distinct colours and a split-panel parameter comparison.
|
||
|
||
---
|
||
|
||
### 6.10 Space Weather Widget
|
||
|
||
```
|
||
SPACE WEATHER [09:14 UTC]
|
||
────────────────────────────────────────────────────────────
|
||
Solar Activity ●●●○○ ELEVATED
|
||
F10.7 observed: 148 sfu (81d avg: 132)
|
||
|
||
Geomagnetic ●●●●○ ACTIVE
|
||
Kp: 5.3 / Ap daily: 27
|
||
|
||
Re-entry Impact ▲ Active conditions — extend precaution window
|
||
Add ≥2h buffer beyond 95th percentile.
|
||
|
||
Forecast (24h) Activity expected to decline — Kp 3–4
|
||
────────────────────────────────────────────────────────────
|
||
Source: NOAA SWPC Updated: 09:00 UTC [Full history →]
|
||
```
|
||
|
||
**Operational status summary** is generated by the backend based on F10.7 deviation from the 81-day average. The "Re-entry Impact" line delivers an operationally actionable statement — not a percentage — with a concrete recommended precaution buffer computed by the backend and delivered as a structured field:
|
||
|
||
| Condition | Re-entry Impact line | Recommended buffer |
|
||
|-----------|----------------------|--------------------|
|
||
| F10.7 < 90 or Kp < 2 | Low activity — predictions at nominal accuracy | +0h |
|
||
| F10.7 90–140, Kp 2–4 | Moderate activity — standard uncertainty applies | +1h |
|
||
| F10.7 140–200, Kp 4–6 | Active conditions — extend precaution window. Add ≥2h buffer beyond 95th percentile. | +2h |
|
||
| F10.7 > 200 or Kp > 6 | High activity — predictions less reliable. Add ≥4h buffer beyond 95th percentile. | +4h |
|
||
|
||
The buffer recommendation is surfaced on the Event Detail page as an explicit callout when conditions are Elevated or above: *"Space weather active: consider extending your airspace precaution window to [95th pct time + buffer]."*
|
||
|
||
---
|
||
|
||
### 6.11 2D Plan View (Phase 2)
|
||
|
||
Globe/map toggle (`[🌐 Globe] [🗺 Plan]`) synchronises selected object, active corridor, and time position. State is preserved on switch.
|
||
|
||
**2D view features:** Mercator or azimuthal equidistant projection; ICAO chart symbology for airspace; ground-track corridor as horizontal projection only; altitude/time cross-section panel below showing corridor vertical extent at each FIR crossing.
|
||
|
||
---
|
||
|
||
### 6.12 Reporting Workflow
|
||
|
||
**Report configuration dialogue:**
|
||
|
||
```
|
||
NEW REPORT — CZ-5B R/B (44878)
|
||
──────────────────────────────────────────────────────────────
|
||
Simulation: [Run #3 — 09:14 UTC ▾]
|
||
|
||
Report Type:
|
||
○ Operational Briefing (1–2 pages, plain language)
|
||
○ Technical Assessment (full uncertainty, model provenance)
|
||
○ Regulatory Submission (formal format, appendices)
|
||
|
||
Include Sections:
|
||
☑ Object properties and data confidence
|
||
☑ Re-entry window and uncertainty percentiles
|
||
☑ Ground track corridor map
|
||
☑ Affected airspace and FIR crossing times
|
||
☑ Space weather conditions at prediction time
|
||
☑ Model version and simulation parameters
|
||
☐ Full MC sample distribution
|
||
☐ TIP message history
|
||
|
||
Prepared by: J. Smith Authority: CASA
|
||
──────────────────────────────────────────────────────────────
|
||
[Preview] [Generate PDF] [Cancel]
|
||
```
|
||
|
||
**Report identity:** Every report has a unique ID, the simulation ID it was derived from, a generation timestamp, and the analyst's name. Reports are stored in MinIO and listed in `/reports`.
|
||
|
||
**Date format in all reports and exports (F7):** Slash-delimited dates (`03/04/2026`) are ambiguous between DD/MM and MM/DD and are banned from all SpaceCom outputs. All dates in PDF reports, CSV exports, and NOTAM drafts use **`DD MMM YYYY`** format (e.g. `04 MAR 2026`) — unambiguous across all locales and consistent with ICAO and aviation convention. All times alongside dates use `HH:MMZ` (e.g. `04 MAR 2026 14:00Z`). This applies to: PDF prediction reports, CSV bulk exports, NOTAM draft `(B)`/`(C)` fields (which use ICAO `YYMMDDHHMM` format internally but are displayed as `DD MMM YYYY HH:MMZ` in the preview).
|
||
|
||
**Report rendering:** Server-side Playwright in the isolated `renderer` container. The map image is a headless Chromium screenshot of the globe at the relevant configuration. All user-supplied text is HTML-escaped before interpolation. The renderer has no external network access — it receives only sanitised, structured data from the backend API.
|
||
|
||
---
|
||
|
||
### 6.13 NOTAM Drafting Workflow (Phase 2)
|
||
|
||
SpaceCom cannot issue NOTAMs. Only designated NOTAM offices authorised by the relevant AIS authority can issue them. SpaceCom's role is to produce a draft in ICAO Annex 15 format ready for review and formal submission by an authorised originator.
|
||
|
||
**Trigger:** From the Event Detail page, Persona A clicks `[Draft NOTAM]`. This is only available when a hazard corridor intersects one or more FIRs.
|
||
|
||
**Draft NOTAM output (ICAO Annex 15 / OPADD format):**
|
||
|
||
Field format follows ICAO Annex 15 Appendix 6 and EUROCONTROL OPADD. Timestamps use `YYMMDDHHmm` format (not ISO 8601 — ICAO Annex 15 §5.1.2). `(B)` = `p10 − 30 min`; `(C)` = `p90 + 30 min` (see mapping table below).
|
||
|
||
```
|
||
NOTAM DRAFT — FOR REVIEW AND AUTHORISED ISSUANCE ONLY
|
||
══════════════════════════════════════════════════════
|
||
Generated by SpaceCom v2.1 | Prediction ID: pred-44878-20260316-003
|
||
Data source: USSPACECOM TIP #3 + SpaceCom decay prediction
|
||
⚠ This is a DRAFT only. Must be reviewed and issued by authorised NOTAM office.
|
||
|
||
Q) YMMM/QWELW/IV/NBO/AE/000/999/2200S13300E999
|
||
A) YMMM
|
||
B) 2603161330
|
||
C) 2603162230
|
||
E) UNCONTROLLED SPACE OBJECT RE-ENTRY. OBJECT: CZ-5B ROCKET BODY
|
||
NORAD ID 44878. PREDICTED RE-ENTRY WINDOW 1400-2200 UTC 16 MAR
|
||
2026. NOMINAL RE-ENTRY POINT APRX 22S 133E. 95TH PERCENTILE
|
||
CORRIDOR 18S 115E TO 28S 155E. DEBRIS SURVIVAL PSB. AIRSPACE
|
||
WITHIN CORRIDOR MAY BE AFFECTED ALL LEVELS DURING WINDOW.
|
||
REF SPACECOM PRED-44878-20260316-003.
|
||
F) SFC
|
||
G) UNL
|
||
```
|
||
|
||
**NOTAM field mapping (ICAO Annex 15 Appendix 6):**
|
||
|
||
| NOTAM field | SpaceCom data source | Format rule |
|
||
|---|---|---|
|
||
| `(Q)` Q-line | FIR ICAO designator + NOTAM code `QWELW` (re-entry warning) | Generated from `airspace.icao_designator`; subject code `WE` (airspace warning), condition `LW` (laser/space) |
|
||
| `(A)` FIR | `airspace.icao_designator` for each intersecting FIR | One NOTAM per FIR; multi-FIR events generate multiple drafts |
|
||
| `(B)` Valid from | `prediction.p10_reentry_time − 30 minutes` | `YYMMDDHHmm` (UTC); example: `2603161330` |
|
||
| `(C)` Valid to | `prediction.p90_reentry_time + 30 minutes` | `YYMMDDHHmm` (UTC) |
|
||
| `(D)` Schedule | Omitted (continuous) | Do not include `(D)` field for continuous validity |
|
||
| `(E)` Description | Templated from sanitised object name, NORAD ID, p50 time, corridor bounds | `sanitise_icao()` applied; ICAO Doc 8400 abbreviations used (`PSB` not "possible", `APRX` not "approximately") |
|
||
| `(F)/(G)` Limits | `SFC` / `UNL` | Hardcoded for re-entry events; do not compute from corridor altitude |
|
||
|
||
**`(B)`/`(C)` field: re-entry window to NOTAM validity — time-critical cancellation:** The `(C)` validity time does not mean the hazard persists until then — it is the worst-case boundary. When re-entry is confirmed, the NOTAM cancellation draft must be initiated immediately. The Event Detail page surfaces a prominent `[Draft NOTAM Cancellation — RE-ENTRY CONFIRMED]` button at the moment the event status changes to `confirmed`, with a UI note: "Cancellation draft should be submitted to the NOTAM office without delay."
|
||
|
||
**Unit test:** Generate a draft for a prediction with `p10=2026-03-16T14:00Z`, `p90=2026-03-16T22:00Z`; assert `(B)` field is `2603161330` and `(C)` field is `2603162230`. Assert Q-line matches regex `\(Q\) [A-Z]{4}/QWELW/IV/NBO/AE/\d{3}/\d{3}/\d{4}[NS]\d{5}[EW]\d{3}`.
|
||
|
||
**NOTAM cancellation draft:** When an event is closed (re-entry confirmed, object decayed), the Event Detail page offers `[Draft NOTAM Cancellation]` — generates a CANX NOTAM draft referencing the original.
|
||
|
||
**Regulatory note displayed in the UI:** A persistent banner on the NOTAM draft page reads: *"This draft is generated for review purposes only. It must be reviewed for accuracy, formatted to local AIS standards, and issued by an authorised NOTAM originator. SpaceCom does not issue NOTAMs."*
|
||
|
||
**NOTAM language and i18n exclusion (F6):** ICAO Annex 15 specifies that NOTAMs use ICAO standard phraseology in English (or the language of the state for domestic NOTAMs). NOTAM template strings are **never internationalised**:
|
||
- All NOTAM template strings are hardcoded ICAO English phraseology in `backend/app/modules/notam/templates.py`
|
||
- Each template string is annotated `# ICAO-FIXED: do not translate`
|
||
- The NOTAM draft is excluded from the `next-intl` message extraction tooling
|
||
- The NOTAM preview panel renders in a fixed-width monospace font to match traditional NOTAM format
|
||
- `lang="en"` attribute is set on the NOTAM text container regardless of the operator's UI locale
|
||
|
||
The draft is stored in the `notam_drafts` table (see §9.2) for audit purposes.
|
||
|
||
---
|
||
|
||
### 6.14 Shadow Mode (Phase 2)
|
||
|
||
Shadow mode allows ANSPs to run SpaceCom in parallel with existing procedures during a trial period, without acting operationally on its outputs. This is the primary mechanism for building regulatory acceptance evidence.
|
||
|
||
**Activation:** `admin` role only, per-organisation setting in `/admin`.
|
||
|
||
**Visual treatment when shadow mode is active:**
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ ⚗ SHADOW MODE — Predictions are not for operational use │
|
||
│ All outputs are recorded for validation. No alerts are │
|
||
│ delivered externally. Contact your administrator to disable. │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
- A persistent amber banner spans the top of every page
|
||
- The mode indicator pill shows `⚗ SHADOW` in amber
|
||
- All alert levels are demoted to INFORMATIONAL — no banners, no audio tones, no email delivery
|
||
- Prediction records have `shadow_mode = TRUE` in the database (see §9)
|
||
- Shadow predictions are excluded from all operational views but accessible in `/analysis`
|
||
|
||
**Validation reporting:** After each real re-entry event, Persona B can generate a Shadow Validation Report comparing SpaceCom shadow predictions against the actual observed re-entry time/location. These reports form the evidence base for regulatory adoption.
|
||
|
||
**Shadow Mode Exit Criteria (regulatory hand-off specification — Finding 6):**
|
||
|
||
Shadow mode is a formal regulatory activity, not a product trial. Exit to operational use requires:
|
||
|
||
| Criterion | Requirement |
|
||
|---|---|
|
||
| Minimum shadow period | 90 days, or covering ≥ 3 re-entry events above the CRITICAL alert threshold, whichever is longer |
|
||
| Prediction accuracy | `corridor_contains_observed ≥ 90%` across shadow period events (from `prediction_outcomes`) |
|
||
| False positive rate | `fir_false_positive_rate ≤ 20%` — no more than 1 in 5 corridor-intersecting FIR alerts is a false alarm |
|
||
| False negative rate | `fir_false_negative = 0` during the shadow period — no re-entry event missed entirely |
|
||
| Exit document | `shadow-mode-exit-report-{org_id}-{date}.pdf` generated from `prediction_outcomes`; contains automated statistics + ANSP Safety Department sign-off field |
|
||
| Regulatory hand-off | Written confirmation from the ANSP's Accountable Manager or Head of ATM Safety that their internal Safety Case / Tool Acceptance process is complete |
|
||
| System state | `shadow_mode_cleared = TRUE` is set by SpaceCom `admin` only after receipt of the written ANSP confirmation |
|
||
|
||
The exit report template lives at `docs/templates/shadow-mode-exit-report.md`. Persona B generates the statistics from the admin analysis panel; the ANSP prints, signs, and returns the PDF. No software system can substitute for the ANSP's internal Safety Department sign-off.
|
||
|
||
**Commercial trial-to-operational conversion (Finding 5):**
|
||
|
||
A successful shadow exit automatically generates a commercial offer. The admin panel transitions the organisation's `subscription_status` from `'shadow_trial'` to `'offered'` and Persona D receives a task notification. The offer package includes:
|
||
- Commercial offer document (generated from `docs/templates/commercial-offer-ansp.md`): tier, pricing, SLA schedule, DPA status
|
||
- MSA execution path: ANSPs that accept the offer sign the MSA; no separate negotiation required for the standard ANSP Operational tier
|
||
- Onboarding checklist: `docs/onboarding/ansp-onboarding-checklist.md`
|
||
|
||
If an ANSP does not convert within 30 days of receiving the offer, `subscription_status` moves to `'offered_lapsed'` and Persona D is notified. The admin panel shows conversion pipeline status for all ANSP organisations. Maximum concurrent ANSP shadow deployments in Phase 2: **2** (resource constraint — each requires a dedicated SpaceCom integration lead for the 90-day shadow period).
|
||
|
||
---
|
||
|
||
### 6.15 Space Operator Portal UX (Phase 2)
|
||
|
||
The Space Operator Portal (`/space`) is the second front door. It serves Persona E and F with a technically dense interface — different visual language from the aviation-facing portal.
|
||
|
||
**Space Operator Overview (`/space`):**
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ SpaceCom · Space Portal [API] [Export] [Persona E: ORBCO] │
|
||
├─────────────────────┬───────────────────────────────────────────┤
|
||
│ │ MY OBJECTS (3) │
|
||
│ 3D GLOBE │ ┌────────────────────────────────────┐ │
|
||
│ (owned objects │ │ CZ-5B R/B 44878 │ │
|
||
│ only, with │ │ Perigee: 178 km ↓ Decaying fast │ │
|
||
│ full orbital │ │ Re-entry: 16 Mar ± 8h │ │
|
||
│ tracks and │ │ [Predict] [Plan deorbit] [Export] │ │
|
||
│ decay vectors) │ ├────────────────────────────────────┤ │
|
||
│ │ │ SL-16 R/B 28900 │ │
|
||
│ │ │ Perigee: 312 km ~ Stable │ │
|
||
│ │ │ [Predict] [Export] │ │
|
||
│ │ └────────────────────────────────────┘ │
|
||
│ │ CONJUNCTION ALERTS (MY OBJECTS) │
|
||
│ │ No active conjunctions > Pc 1:10000 │
|
||
├─────────────────────┴───────────────────────────────────────────┤
|
||
│ API USAGE Requests today: 143 / 1000 [Manage keys →] │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Controlled Re-entry Planner (`/space/reentry/plan`):**
|
||
|
||
Available for objects with remaining manoeuvre capability (flagged in `owned_objects.has_propulsion`).
|
||
|
||
```
|
||
CONTROLLED RE-ENTRY PLANNER — CZ-5B R/B (44878)
|
||
─────────────────────────────────────────────────────────────────
|
||
Delta-V budget: [▓▓▓░░░░░] 12.4 m/s remaining
|
||
|
||
Target re-entry window: [2026-03-20 ▾] to [2026-03-22 ▾]
|
||
Avoid FIRs: [☑ YMMM] [☑ YSSY] [☑ Populated land]
|
||
Preferred landing: ● Ocean ○ Specific zone
|
||
|
||
CANDIDATE WINDOWS
|
||
──────────────────────────────────────────────────────────────────
|
||
#1 2026-03-21 03:14 UTC ΔV: 8.2 m/s Risk: ● LOW
|
||
Landing: South Pacific FIR: NZZO (ocean)
|
||
[Select] [View corridor]
|
||
|
||
#2 2026-03-21 09:47 UTC ΔV: 11.1 m/s Risk: ● LOW
|
||
Landing: Indian Ocean FIR: FJDG (ocean)
|
||
[Select] [View corridor]
|
||
|
||
#3 2026-03-21 15:30 UTC ΔV: 9.8 m/s Risk: ▲ MEDIUM
|
||
Landing: 22S 133E FIR: YMMM (land)
|
||
[Select] [View corridor]
|
||
──────────────────────────────────────────────────────────────────
|
||
[Export manoeuvre plan (CCSDS)] [Generate operator report]
|
||
```
|
||
|
||
The planner outputs are suitable for submission to national space regulators as evidence of responsible end-of-life management under the ESA Zero Debris Charter and national space law requirements.
|
||
|
||
**Zero Debris Charter compliance output format (Finding 2):**
|
||
|
||
The planner produces a `controlled-reentry-compliance-report-{norad_id}-{date}.pdf` containing:
|
||
- Ranked deorbit window analysis (delta-V budget, window start/end, corridor risk score per window)
|
||
- FIR avoidance corridors for each candidate window
|
||
- Probability of casualty on the ground (Pc_ground) computed using NASA Debris Assessment Software methodology (1-in-10,000 IADC casualty threshold; documented in model card)
|
||
- Comparison table: each candidate window vs. the 1:10,000 Pc_ground threshold; compliant windows flagged green
|
||
- Zero Debris Charter alignment statement (auto-generated from object disposition)
|
||
|
||
Machine-readable companion: `application/vnd.spacecom.reentry-compliance+json` — returned alongside the PDF download URL as `compliance_report_url` in the planning job result. Format documented in `docs/api-guide/compliance-export.md`.
|
||
|
||
The Pc_ground calculation uses the fragment survivability model (§15.3 material class lookup) and the ESA DRAMA casualty area methodology. `objects.material_class IS NULL` → conservative all-survive assumption → higher Pc_ground — creates an incentive for operators to provide accurate physical data.
|
||
|
||
ECCN classification review (already in §21 Phase 2 DoD) must resolve before this output is shared with non-US entities.
|
||
|
||
---
|
||
|
||
### 6.16 Accessibility Requirements
|
||
|
||
- **WCAG 2.1 Level AA compliance** — required for government and aviation authority procurement
|
||
- Colour-blind-safe palette throughout; urgency uses shape + colour, never colour alone
|
||
- High-contrast mode available in user settings (WCAG AAA scheme)
|
||
- Dark mode as a first-class theme (not an afterthought)
|
||
- All interactive elements keyboard-accessible; tab order logical
|
||
- Alerts announced via `aria-live="assertive"` (CRITICAL) and `aria-live="polite"` (MEDIUM/LOW)
|
||
- Globe canvas has `aria-label` describing current view context
|
||
- Minimum touch target size 44×44 px
|
||
- Tested at 1080p (ops room), 1440p (analyst workstation), 1024×768 (tablet minimum)
|
||
- Automated axe-core audit via `@axe-core/playwright` run on the 5 core pages on every PR; 0 critical, 0 serious violations required to merge; known acceptable third-party violations (e.g., CesiumJS canvas contrast) recorded in `tests/e2e/axe-exclusions.json` with a justification comment — not silently suppressed. Implementation:
|
||
```typescript
|
||
// tests/e2e/accessibility.spec.ts
|
||
import AxeBuilder from '@axe-core/playwright';
|
||
for (const [name, path] of [
|
||
['operational-overview', '/'], ['event-detail', '/events/seed-event'],
|
||
['notam-draft', '/notam/draft/seed-draft'], ['space-portal', '/space/objects'],
|
||
['settings', '/settings'],
|
||
]) {
|
||
test(`${name} — WCAG 2.1 AA`, async ({ page }) => {
|
||
await page.goto(path);
|
||
const results = await new AxeBuilder({ page })
|
||
.withTags(['wcag2a', 'wcag2aa'])
|
||
.exclude(loadAxeExclusions()) // loads axe-exclusions.json
|
||
.analyze();
|
||
expect(results.violations).toEqual([]);
|
||
});
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### 6.17 Multi-ANSP Coordination Panel (Phase 2)
|
||
|
||
When an event's predicted corridor intersects FIRs belonging to more than one registered organisation, an additional panel appears on the Event Detail page. This panel provides shared situational awareness across ANSPs without replacing voice coordination.
|
||
|
||
```
|
||
MULTI-ANSP COORDINATION
|
||
──────────────────────────────────────────────────────────────
|
||
FIRs affected by this event:
|
||
YMMM Airservices Australia — ✓ Acknowledged 09:14 UTC J. Smith
|
||
NZZC Airways NZ — ○ Not yet acknowledged
|
||
|
||
Last activity:
|
||
09:22 UTC YMMM — "Watching brief established, coordinating with FMP"
|
||
──────────────────────────────────────────────────────────────
|
||
[Log coordination note]
|
||
```
|
||
|
||
Rules:
|
||
- Each ANSP sees the acknowledgement status and latest coordination note from all other ANSPs on the event; they do not see each other's internal alert state
|
||
- Coordination notes are free text, appended to `alert_events` (append-only, auditable), with organisation name, user name, and UTC timestamp
|
||
- The panel is read-only for organisations that have not yet acknowledged; they can acknowledge and then log notes
|
||
- Visibility is scoped: organisations only see the panel for events that intersect their registered FIRs — they do not see coordination panels for unrelated events from other orgs
|
||
|
||
This does not replace voice or direct coordination — it creates a shared digital record that both ANSPs can reference. The panel carries a permanent banner: *"This coordination panel is for shared situational awareness only. It does not replace formal ATS coordination procedures or voice coordination."*
|
||
|
||
**Authority and precedence (Finding 5):** The panel has no command authority. If two ANSPs log conflicting assessments, neither supersedes the other in SpaceCom — the system records both. The authoritative coordination outcome is always the result of direct ATS coordination outside the system. SpaceCom coordination notes are supporting evidence, not operational decisions.
|
||
|
||
**WebSocket latency for coordination updates:** Coordination note updates must be visible to all parties within 2 seconds of posting (p99). This is specified as a performance SLA for the coordination panel WebSocket channel (distinct from the 5-second SLA for alert events). Latency > 2 seconds means an ANSP may have acted on a stale picture during a fast-moving event.
|
||
|
||
**Data retention for coordination records (ICAO Annex 11 §2.26):** Coordination notes are safety records. Minimum retention: 5 years in append-only storage. The `coordination_notes` table (stored append-only in `alert_events.coordination_notes JSONB[]` or as a separate table) is included in the safety record retention category (§27.4) and excluded from standard data drop policies.
|
||
|
||
---
|
||
|
||
### 6.18 First-Time User Onboarding State (Phase 1)
|
||
|
||
When a new organisation has no configured FIRs and no active events, the globe is empty. An empty globe is indistinguishable from "the system isn't working" for first-time users. An onboarding state prevents this misinterpretation.
|
||
|
||
**Trigger:** Organisation has `fir_list IS NULL OR fir_list = '{}'` at login.
|
||
|
||
**Display:** Three setup cards replace the Active Events panel:
|
||
|
||
```
|
||
WELCOME TO SPACECOM
|
||
──────────────────────────────────────────────────────────────
|
||
To see relevant events and receive alerts, complete setup:
|
||
|
||
1. Configure your FIR watch list
|
||
Determines which re-entry events you see and which
|
||
alerts you receive. [Configure →]
|
||
|
||
2. Set alert delivery preferences
|
||
Email, WebSocket, or webhook for CRITICAL alerts.
|
||
[Configure →]
|
||
|
||
3. Optional: Enable Shadow Mode for a trial period
|
||
Run SpaceCom in parallel with existing procedures —
|
||
outputs are not for operational use until disabled.
|
||
[Configure →]
|
||
|
||
──────────────────────────────────────────────────────────────
|
||
```
|
||
|
||
Cards disappear permanently once step 1 (FIR list) is complete. Steps 2 and 3 remain accessible from `/admin` at any time. The setup cards are not a modal — they appear inline and the user can still access all navigation.
|
||
|
||
---
|
||
|
||
### 6.19 Degraded Mode UI Guidance (Phase 1)
|
||
|
||
The `StalenessWarningBanner` (triggered by `/readyz` returning 207) must include an operational guidance line keyed to the specific type of data degradation, not just a generic "data may be stale" message. Persona A's question in degraded mode is not "is the data stale?" — it is "can I use this for an operational decision right now?"
|
||
|
||
| Degradation type | Banner operational guidance |
|
||
|-----------------|----------------------------|
|
||
| Space weather data stale > 3h | *"Uncertainty estimates may be wider than shown. Treat all corridors as potentially broader than the 95th percentile boundary."* |
|
||
| TLE data stale > 24h | *"Object position data is more than 24 hours old. Do not use for precision airspace decisions without independent position verification."* |
|
||
| Active prediction older than 6h without refresh | *"This prediction reflects conditions from [timestamp]. A fresh prediction run is recommended before operational use. [Trigger refresh →]"* |
|
||
| IERS EOP data stale > 7 days | *"Coordinate frame transformations may have minor errors. Technical assessments only — do not use for precision airspace boundary work."* |
|
||
|
||
Banner behaviour:
|
||
- The banner type is set by the backend via the `/readyz` response body (`degradation_type` enum)
|
||
- Each degradation type has its own banner message — not a generic "degraded" label
|
||
- The banner persists until the degradation is resolved; it cannot be dismissed by the user
|
||
- When multiple degradations are active, show the highest-impact degradation first, with a `(+N more)` expand link
|
||
|
||
---
|
||
|
||
### 6.20 Secondary Display Mode (Phase 2)
|
||
|
||
An ops room secondary monitor display mode — strips all navigation chrome and presents only the operational picture on a full-screen secondary display alongside existing ATC tools.
|
||
|
||
**Activation:** `[Secondary Display]` link in the user menu, or URL parameter `?display=secondary`. Opens in a new window or full-screen.
|
||
|
||
**Layout:** Full-screen globe on the left (~70% width), vertical event list on the right (~30% width). No top navigation, no admin links, no simulation controls. No sidebar panels. The LIVE/SHADOW/SIMULATION mode indicator remains visible (always). CRITICAL alert banners still appear.
|
||
|
||
**Design principle:** This is a CSS-level change — hide navigation and chrome elements, maximise the operational data density. No new data is added; no existing data is removed.
|
||
|
||
---
|
||
|
||
## 7. Security Architecture
|
||
|
||
**This section is as non-negotiable as §4.** Security must be built in from Week 1, not audited at Phase 3. The primary security risk in an aviation safety system is not data exfiltration — it is data corruption that produces plausible but wrong outputs that are acted upon operationally. A false all-clear for a genuine re-entry threat is the highest-consequence attack against this system's mission.
|
||
|
||
### 7.1 Threat Model (STRIDE)
|
||
|
||
Key trust boundaries and their principal threats:
|
||
|
||
| Boundary | Spoofing | Tampering | Repudiation | Info Disclosure | DoS | Elevation |
|
||
|----------|----------|-----------|-------------|-----------------|-----|-----------|
|
||
| Browser → API | JWT forgery | Request injection | Unlogged mutations | Token leak via XSS | Auth endpoint flood | RBAC bypass |
|
||
| API → DB | Credential leak | SQL injection | No audit trail | Column over-fetch | N+1 queries | RLS bypass |
|
||
| Ingest → External feeds | DNS/BGP hijack → wrong TLE | Man-in-middle alters F10.7 | — | Credential interception | Feed DoS | — |
|
||
| Celery worker → DB | Compromised worker | Corrupt sim output written to DB | Unlogged task | Param leak in logs | Runaway MC task | Worker → backend pivot |
|
||
| Playwright renderer → backend | — | User content → XSS → SSRF | — | Local file read | Hang/timeout | RCE via browser exploit |
|
||
| Redis | — | Cache poisoning | — | Token interception | Queue flood | — |
|
||
|
||
Mitigations for each threat are specified in the sections below.
|
||
|
||
---
|
||
|
||
### 7.2 Role-Based Access Control (RBAC)
|
||
|
||
Four roles correspond to the four personas. Every API endpoint enforces the minimum required role via a FastAPI dependency.
|
||
|
||
| Role | Assigned To | Permissions |
|
||
|------|------------|------------|
|
||
| `viewer` | Read-only external stakeholders | View objects, predictions, corridors; read-only globe (aviation domain) |
|
||
| `analyst` | Persona B | viewer + submit simulations, generate reports, access historical data, shadow validation reports |
|
||
| `operator` | Persona A, C | analyst + acknowledge alerts, issue advisories, draft NOTAMs, access operational tools |
|
||
| `org_admin` | Organisation administrator | operator + invite/remove users within their own org; assign roles up to `operator` within own org; view own org's audit log; manage own org's API keys; update own org's billing contact; cannot access other orgs' data; cannot assign `admin` or `org_admin` without system admin approval |
|
||
| `admin` | Persona D (system-wide) | Full access: user management across all orgs, ingest configuration, model version deployment, shadow mode toggle, subscription management |
|
||
| `space_operator` | Persona E | Object-scoped access (owned objects only via `owned_objects` table); decay predictions and controlled re-entry planning for own objects; conjunction alerts for own objects; API key management; CCSDS export; no access to other organisations' simulation data |
|
||
| `orbital_analyst` | Persona F | Full catalog read; conjunction screening across any object pair; simulation submission; bulk export (CSV, JSON, CCSDS); raw state vector and covariance access; API key management; no alert acknowledgement |
|
||
|
||
**Object ownership scoping for `space_operator`:** The `owned_objects` table maps operators to their registered NORAD IDs. All queries from a `space_operator` user are automatically scoped to their owned object list — enforced by a PostgreSQL RLS policy on the `owned_objects` join, not only at the application layer:
|
||
|
||
```sql
|
||
-- space_operator users see only their owned objects in catalog queries
|
||
CREATE POLICY objects_owner_scope ON objects
|
||
USING (
|
||
current_setting('app.current_role') != 'space_operator'
|
||
OR id IN (
|
||
SELECT object_id FROM owned_objects
|
||
WHERE organisation_id = current_setting('app.current_org_id')::INTEGER
|
||
)
|
||
);
|
||
```
|
||
|
||
**Multi-tenancy:** If multiple organisations use the system, every table that contains organisation-specific data (`simulations`, `reports`, `alert_events`, `hazard_zones`) must include an `organisation_id` column. PostgreSQL Row-Level Security (RLS) policies enforce the boundary at the database layer — not only at the application layer:
|
||
|
||
```sql
|
||
ALTER TABLE simulations ENABLE ROW LEVEL SECURITY;
|
||
CREATE POLICY simulations_org_isolation ON simulations
|
||
USING (organisation_id = current_setting('app.current_org_id')::INTEGER);
|
||
```
|
||
|
||
The application sets `app.current_org_id` at the start of every database session from the authenticated user's JWT claims.
|
||
|
||
**Comprehensive RLS policy coverage (F1):** The `simulations` example above is the template. Every table that carries `organisation_id` must have RLS enabled and an isolation policy applied. The full set:
|
||
|
||
| Table | RLS policy | Notes |
|
||
|-------|-----------|-------|
|
||
| `simulations` | `organisation_id = current_org_id` | |
|
||
| `reentry_predictions` | `organisation_id = current_org_id` | shadow policy layered separately |
|
||
| `alert_events` | `organisation_id = current_org_id` | append-only; no UPDATE/DELETE anyway |
|
||
| `hazard_zones` | `organisation_id = current_org_id` | |
|
||
| `reports` | `organisation_id = current_org_id` | |
|
||
| `api_keys` | `organisation_id = current_org_id` | admins bypass to revoke any key |
|
||
| `usage_events` | `organisation_id = current_org_id` | billing metering records |
|
||
| `objects` | `organisation_id IS NULL OR organisation_id = current_org_id` | NULL = catalog-wide; org-specific = owned objects only |
|
||
|
||
**RLS bypass for system-level tasks:** Celery workers and internal admin processes run under a dedicated database role (`spacecom_worker`) that bypasses RLS (`BYPASSRLS`). This role is never used by the API request path. Integration test (BLOCKING): establish two orgs with data; issue a query as Org A's session; assert zero Org B rows returned. This test runs in CI against a real database (not mocked).
|
||
|
||
**Shadow mode segregation — database-layer enforcement (Finding 9):**
|
||
|
||
Shadow predictions must be excluded from operational API responses at the RLS layer, not only via application `WHERE` clauses. A backend query bug or misconfigured join must not expose shadow records to `viewer`/`operator` sessions — that would be a regulatory incident.
|
||
|
||
```sql
|
||
ALTER TABLE reentry_predictions ENABLE ROW LEVEL SECURITY;
|
||
|
||
-- Non-admin sessions never see shadow records unless the session flag is set
|
||
CREATE POLICY shadow_segregation ON reentry_predictions
|
||
USING (
|
||
shadow_mode = FALSE
|
||
OR current_setting('spacecom.include_shadow', TRUE) = 'true'
|
||
);
|
||
```
|
||
|
||
The `spacecom.include_shadow` session variable is set to `'true'` only by the backend's shadow-admin code path, which requires `admin` role and explicit shadow-mode context. Regular backend sessions never set this variable. Integration test: query `reentry_predictions` as `viewer` role with no `WHERE shadow_mode` clause; verify zero shadow rows returned.
|
||
|
||
**Four-eyes principle for admin role elevation (Finding 6):**
|
||
|
||
A single compromised admin account must not be able to silently elevate a backdoor account. Elevation to `admin` requires a second admin to approve within 30 minutes.
|
||
|
||
```sql
|
||
CREATE TABLE pending_role_changes (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
target_user_id INTEGER NOT NULL REFERENCES users(id),
|
||
requested_role TEXT NOT NULL,
|
||
requested_by INTEGER NOT NULL REFERENCES users(id),
|
||
approval_token_hash TEXT NOT NULL, -- SHA-256 of emailed token
|
||
expires_at TIMESTAMPTZ NOT NULL DEFAULT NOW() + INTERVAL '30 minutes',
|
||
approved_by INTEGER REFERENCES users(id),
|
||
approved_at TIMESTAMPTZ,
|
||
rejected_at TIMESTAMPTZ,
|
||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||
);
|
||
```
|
||
|
||
Workflow:
|
||
1. `PATCH /admin/users/{id}/role` with `role=admin` creates a `pending_role_changes` row and triggers an email to all other active admins containing a single-use approval token
|
||
2. `POST /admin/role-changes/{change_id}/approve?token=<token>` — any other admin can approve; completing the role change is atomic
|
||
3. Rows past `expires_at` are auto-rejected by a nightly job and logged as `ROLE_CHANGE_EXPIRED`
|
||
4. All outcomes (`ROLE_CHANGE_APPROVED`, `ROLE_CHANGE_REJECTED`, `ROLE_CHANGE_EXPIRED`) are logged to `security_logs` as HIGH severity
|
||
5. The requesting admin cannot approve their own pending change (enforced by `approved_by != requested_by` constraint)
|
||
|
||
**RBAC enforcement pattern (FastAPI):**
|
||
|
||
```python
|
||
def require_role(*roles: str):
|
||
def dependency(current_user: User = Depends(get_current_user)):
|
||
if current_user.role not in roles:
|
||
log_auth_failure(current_user, roles)
|
||
raise HTTPException(status_code=403, detail="Insufficient permissions")
|
||
return current_user
|
||
return dependency
|
||
|
||
# Applied per router group — not per individual endpoint where it is easy to miss
|
||
router = APIRouter(dependencies=[Depends(require_role("operator", "admin"))])
|
||
```
|
||
|
||
---
|
||
|
||
### 7.3 Authentication
|
||
|
||
#### JWT Implementation
|
||
|
||
- **Algorithm:** `RS256` (asymmetric). Never `HS256` with a shared secret. Never `none`.
|
||
- **Key storage:** RSA private signing key stored in Docker secrets / secrets manager (see §7.5). Never in an environment variable or `.env` file.
|
||
- **Token storage in browser:** `httpOnly`, `Secure`, `SameSite=Strict` cookies only. Never `localStorage` (vulnerable to XSS). Never query parameters (appear in server logs).
|
||
- **Access token lifetime:** 15 minutes.
|
||
- **Refresh token lifetime:** 24 hours for `operator`/`analyst`; 8 hours for `admin`.
|
||
- **Refresh token rotation with family reuse detection (Finding 5):** Invalidate the old token on every refresh. Tokens belong to a `family_id` (UUID assigned at first issuance). If a token from a superseded generation within a family is presented — i.e. it was already rotated and a newer token in the same family exists — the entire family is immediately revoked, logged as `REFRESH_TOKEN_REUSE` (HIGH severity), and an email alert is sent to the user ("Suspicious login detected — all sessions revoked"). This detects refresh token theft: the legitimate user retries after the attacker consumed the token first, causing the reuse to surface. The `refresh_tokens` table includes `family_id UUID NOT NULL` and `superseded_at TIMESTAMPTZ` (set when a new token replaces this one in rotation).
|
||
- **Refresh token storage:** `refresh_tokens` table in the database (see §9.2). This enables server-side revocation — Redis-only storage loses revocations on restart.
|
||
|
||
#### Multi-Factor Authentication (MFA)
|
||
|
||
TOTP-based MFA (RFC 6238) is required for all roles from Phase 1. Implementation:
|
||
|
||
- On first login after account creation, user is presented with TOTP QR code (via `pyotp`) and required to verify before completing registration
|
||
- Recovery codes (8 × 10-character alphanumeric) generated at setup; stored as bcrypt hashes in `users.mfa_recovery_codes`
|
||
- MFA bypass via recovery code is logged as a security event (MEDIUM alert to admins)
|
||
- MFA is enforced at the JWT issuance step — tokens are not issued until MFA is verified
|
||
- Failed MFA attempts after 5 consecutive failures trigger a 30-minute account lockout and a MEDIUM alert
|
||
|
||
#### SSO / Identity Provider Abstraction
|
||
|
||
"Integrate with SkyNav SSO later" cannot remain a deferred decision. The auth layer must be designed as a pluggable provider from the start:
|
||
|
||
```python
|
||
class AuthProvider(Protocol):
|
||
async def authenticate(self, credentials: Credentials) -> User: ...
|
||
async def issue_tokens(self, user: User) -> TokenPair: ...
|
||
async def revoke(self, refresh_token: str) -> None: ...
|
||
|
||
class LocalJWTProvider(AuthProvider): ... # Phase 1: local JWT + TOTP
|
||
class OIDCProvider(AuthProvider): ... # Phase 3: OIDC/SAML SSO
|
||
```
|
||
|
||
All endpoint logic depends on `AuthProvider` — switching from local JWT to OIDC requires no endpoint changes.
|
||
|
||
---
|
||
|
||
### 7.4 API Security
|
||
|
||
#### Rate Limiting
|
||
|
||
Implemented with `slowapi` (Redis token bucket). Limits are per-user for authenticated endpoints, per-IP for auth endpoints:
|
||
|
||
| Endpoint | Limit | Window |
|
||
|----------|-------|--------|
|
||
| `POST /token` (login) | 10 per IP | 1 minute; exponential backoff after 5 failures |
|
||
| `POST /token/refresh` | 30 per user | 1 hour |
|
||
| `POST /decay/predict` | 10 per user | 1 hour |
|
||
| `POST /conjunctions/screen` | 5 per user | 1 hour |
|
||
| `POST /reports` | 20 per user | 1 day |
|
||
| `WS /ws/events` connection attempts | 10 per user | 1 minute |
|
||
| General authenticated read endpoints | 300 per user | 1 minute |
|
||
| General unauthenticated (if any) | 60 per IP | 1 minute |
|
||
|
||
Rate limit headers returned on every response: `X-RateLimit-Limit`, `X-RateLimit-Remaining`, `X-RateLimit-Reset`.
|
||
|
||
#### Simulation Parameter Validation
|
||
|
||
All physical parameters must be validated against their physically meaningful ranges before a simulation job is accepted. Type validation alone is insufficient — NRLMSISE-00 will silently produce garbage for out-of-range inputs without raising an error:
|
||
|
||
```python
|
||
class DecayPredictParams(BaseModel):
|
||
f107: float = Field(..., ge=65.0, le=300.0,
|
||
description="F10.7 solar flux (sfu). Physically valid: 65–300.")
|
||
ap: float = Field(..., ge=0.0, le=400.0,
|
||
description="Geomagnetic Ap index. Valid: 0–400.")
|
||
mc_samples: int = Field(..., ge=10, le=1000,
|
||
description="Monte Carlo sample count. Server cap: 1000 regardless of input.")
|
||
bstar_uncertainty_pct: float = Field(..., ge=0.0, le=50.0)
|
||
|
||
@validator('mc_samples')
|
||
def cap_mc_samples(cls, v):
|
||
return min(v, 1000) # Server-side cap regardless of submitted value
|
||
```
|
||
|
||
#### Server-Side Request Forgery (SSRF) Mitigation
|
||
|
||
The Ingest module fetches from five external sources. These URLs must be:
|
||
- **Hardcoded constants** in `ingest/sources.py` — never loaded from user input, API parameters, or database values
|
||
- **Fetched via an HTTP client configured with an allowlist** of expected IP ranges per source; connections to private IP ranges (`10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`, `169.254.0.0/16`, `::1`, `fc00::/7`) are blocked at the HTTP client layer
|
||
|
||
```python
|
||
ALLOWED_HOSTS = {
|
||
"www.space-track.org": ["18.0.0.0/8"], # approximate; update with actual ranges
|
||
"celestrak.org": [...],
|
||
"swpc.noaa.gov": [...],
|
||
"discosweb.esoc.esa.int": [...],
|
||
"maia.usno.navy.mil": [...],
|
||
}
|
||
```
|
||
|
||
#### CZML and CZML Injection
|
||
|
||
Object names and descriptions sourced from Space-Track are interpolated into CZML documents and ultimately rendered in CesiumJS. A malicious object name containing `<script>` or CesiumJS-specific injection must be sanitised:
|
||
- HTML-encode all string fields from external sources before inserting into CZML
|
||
- CesiumJS evaluates CZML `description` fields as HTML in info boxes — treat as untrusted HTML; use DOMPurify on the client before passing to CesiumJS `description` properties
|
||
|
||
#### NOTAM Draft Content Sanitisation (Finding 10)
|
||
|
||
NOTAM drafts are templated from prediction data, object names, and operator-supplied fields. Object names originate from Space-Track and from manual `POST /objects` input. ICAO plain-text format is vulnerable to special-character injection and, if the draft is ever rendered to PDF by the Playwright renderer, to XSS.
|
||
|
||
```python
|
||
import re
|
||
|
||
_ICAO_SAFE = re.compile(r"[^A-Z0-9\-_ /]")
|
||
|
||
def sanitise_icao(value: str, field_name: str = "field") -> str:
|
||
"""
|
||
Strip characters outside ICAO plain-text safe set before NOTAM template interpolation.
|
||
|
||
Args:
|
||
value: Raw string from user input or external source.
|
||
field_name: Field identifier for logging if value is modified.
|
||
|
||
Returns:
|
||
Sanitised string safe for ICAO plain-text insertion.
|
||
"""
|
||
upper = value.upper()
|
||
sanitised = _ICAO_SAFE.sub("", upper)
|
||
if sanitised != upper:
|
||
logger.info("sanitise_icao: modified %s field", field_name)
|
||
return sanitised or "[REDACTED]"
|
||
```
|
||
|
||
Rules:
|
||
- `sanitise_icao()` is called on every user-sourced field before interpolation into `NOTAM_TEMPLATE.format(...)`
|
||
- TLE remarks fields are stripped entirely from NOTAM output (not an ICAO-relevant field)
|
||
- NOTAM template uses `str.format()` with named arguments, not f-strings with raw variables
|
||
- `sanitise_icao` is listed in `AGENTS.md` as a security-critical function — any change requires a dedicated security review
|
||
|
||
---
|
||
|
||
### 7.5 Secrets Management
|
||
|
||
"All secrets via environment variables" is a development-only posture.
|
||
|
||
**Development:** `.env` file. Never committed. `.gitignore` must include `.env`, `.env.*`.
|
||
|
||
**Production:** Docker secrets (Compose `secrets:` stanza) for Phase 1 production deployment; HashiCorp Vault or cloud-provider secrets manager (AWS Secrets Manager, GCP Secret Manager) for Phase 3.
|
||
|
||
**Secrets rotation schedule:**
|
||
|
||
| Secret | Rotation Frequency | Method |
|
||
|--------|-------------------|--------|
|
||
| JWT RS256 private key | 90 days | Key ID in JWT header; both old and new keys valid during 24h rotation window |
|
||
| Space-Track.org credentials | 90 days | Space-Track account supports credential rotation; coordinated with ops team |
|
||
| Database password | 90 days | Dual-credential rotation (see procedure below); zero-downtime |
|
||
| Redis ACL passwords (backend, worker, ingest) | 90 days | Update ACL password via `redis-cli ACL SETUSER`; restart dependent services with new env var; old password invalid immediately |
|
||
| MinIO access key | 90 days | MinIO admin API |
|
||
| **Cesium ion access token** | **NOT A SECRET** | Public browser credential — shipped in `NEXT_PUBLIC_CESIUM_ION_TOKEN`. Read via `Ion.defaultAccessToken = process.env.NEXT_PUBLIC_CESIUM_ION_TOKEN`. Do not proxy through the backend. Do not store in Docker secrets or Vault. Rotate only if the token is explicitly revoked on cesium.com. |
|
||
|
||
**Database password rotation procedure** — a hard PgBouncer restart drops idle connections cleanly but kills active transactions. Use the drain-then-swap sequence instead:
|
||
|
||
1. **Update Postgres role** (new password valid immediately; old password still in PgBouncer config): `ALTER ROLE spacecom_app PASSWORD 'new_secret';`
|
||
2. **Drain PgBouncer** — issue `PAUSE pgbouncer;`. New connections queue; existing transactions complete. Timeout: 30s (if not drained, proceed and accept brief 503s).
|
||
3. **Update PgBouncer config** with new password, then `RESUME pgbouncer;`. Application connections resume using new password.
|
||
4. **Verify ingest/API within 5 minutes** — `/admin/ingest-status` and `GET /readyz` must return 200.
|
||
5. **Revoke old password** after 15-minute grace: `ALTER ROLE spacecom_app PASSWORD 'new_secret';` (already set — no-op; old session tokens expired during drain).
|
||
6. **Rotate Patroni replication credentials separately** — `patronictl` `reload` with updated `postgresql.parameters.hba_file`; does not affect application connections.
|
||
|
||
Full runbook: `docs/runbooks/db-password-rotation.md`.
|
||
|
||
**Anti-patterns — enforced by `git-secrets` pre-commit hook and CI scan:**
|
||
- No secrets in `requirements.txt`, `docker-compose.yml`, `Dockerfile`, source files, or logs
|
||
- Secret patterns (AWS keys, private key headers, connection strings) trigger CI failure
|
||
|
||
---
|
||
|
||
### 7.6 Transport Security
|
||
|
||
**External-facing:**
|
||
- HTTPS only. HTTP → HTTPS 301 redirect.
|
||
- `Strict-Transport-Security: max-age=63072000; includeSubDomains; preload`
|
||
- TLS 1.2 minimum; TLS 1.3 preferred. Disable TLS 1.0, 1.1, SSLv3.
|
||
- Cipher suite: Mozilla "Intermediate" configuration or better.
|
||
- WebSocket connections: `wss://` only. The `ws.ts` client enforces this.
|
||
|
||
**Internal service communication:**
|
||
- Backend → DB: PostgreSQL TLS with client certificate verification
|
||
- Backend → Redis: Redis 7 TLS mode (`tls-port`, `tls-cert-file`, `tls-key-file`, `tls-ca-cert-file`)
|
||
- Backend → MinIO: HTTPS (MinIO production mode requires TLS)
|
||
- Backend → Renderer: HTTPS on internal Docker network; renderer does not accept connections from any other service
|
||
|
||
**Certificate management:**
|
||
- Production: Let's Encrypt via Caddy (auto-renewal, OCSP stapling)
|
||
- Certificate expiry monitored: alert 30 days before expiry via `cert-manager` or custom Celery task
|
||
|
||
---
|
||
|
||
### 7.7 Content Security Policy and Security Headers
|
||
|
||
SpaceCom uses **two distinct CSP tiers** because CesiumJS requires `'unsafe-eval'` (GLSL shader compilation) — a directive that would be unacceptable on non-globe routes.
|
||
|
||
**Tier 1 — Non-globe routes** (login, settings, admin, API responses):
|
||
|
||
```
|
||
Content-Security-Policy:
|
||
default-src 'self';
|
||
script-src 'self';
|
||
style-src 'self' 'unsafe-inline';
|
||
img-src 'self' data: blob:;
|
||
connect-src 'self' wss://[domain];
|
||
worker-src blob:;
|
||
frame-ancestors 'none';
|
||
base-uri 'self';
|
||
form-action 'self';
|
||
|
||
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
|
||
X-Content-Type-Options: nosniff
|
||
X-Frame-Options: DENY
|
||
Referrer-Policy: strict-origin-when-cross-origin
|
||
Permissions-Policy: geolocation=(), camera=(), microphone=()
|
||
```
|
||
|
||
**Tier 2 — Globe routes** (`app/(globe)/` — all routes under the `(globe)` layout group only):
|
||
|
||
```
|
||
Content-Security-Policy:
|
||
default-src 'self';
|
||
script-src 'self' 'unsafe-eval' https://cesium.com;
|
||
style-src 'self' 'unsafe-inline';
|
||
img-src 'self' data: blob: https://*.cesium.com https://*.openstreetmap.org;
|
||
connect-src 'self' wss://[domain] https://cesium.com https://api.cesium.com;
|
||
worker-src blob:;
|
||
frame-ancestors 'none';
|
||
base-uri 'self';
|
||
form-action 'self';
|
||
```
|
||
|
||
**Implementation in `next.config.ts`:**
|
||
|
||
```typescript
|
||
// next.config.ts
|
||
const isGlobeRoute = (pathname: string) =>
|
||
pathname.startsWith('/dashboard') || pathname.startsWith('/monitor');
|
||
|
||
const headers = async () => [
|
||
{
|
||
source: '/((?!dashboard|monitor).*)', // non-globe routes
|
||
headers: [{ key: 'Content-Security-Policy', value: CSP_STANDARD }],
|
||
},
|
||
{
|
||
source: '/(dashboard|monitor)(.*)', // globe routes — unsafe-eval allowed
|
||
headers: [{ key: 'Content-Security-Policy', value: CSP_GLOBE }],
|
||
},
|
||
];
|
||
```
|
||
|
||
`'unsafe-eval'` is required by CesiumJS for runtime GLSL shader compilation. Scope it **only** to globe routes. This is a known, documented exception — it must never appear in the standard-tier CSP.
|
||
|
||
`'unsafe-inline'` for `style-src` is also required by CesiumJS and appears in both tiers. It must not be used for `script-src` in the standard tier.
|
||
|
||
**Renderer page CSP** (the headless Playwright context, which must be the most restrictive):
|
||
```
|
||
Content-Security-Policy:
|
||
default-src 'self';
|
||
script-src 'self';
|
||
style-src 'self';
|
||
img-src 'self' data: blob:;
|
||
connect-src 'none';
|
||
frame-ancestors 'none';
|
||
```
|
||
|
||
---
|
||
|
||
### 7.8 WebSocket Security
|
||
|
||
`WS /ws/events` authentication:
|
||
- JWT token must be verified at connection establishment (HTTP Upgrade request)
|
||
- Browser WebSocket APIs cannot send custom headers — use the `httpOnly` auth cookie (set by the login flow) which is automatically sent with the Upgrade request; verify it in the WebSocket handshake handler
|
||
- Do not accept tokens via query parameters (`?token=...`) — they appear in server access logs
|
||
|
||
Connection management:
|
||
- Per-user concurrent connection limit: 5. Enforced in the upgrade handler by checking a Redis counter.
|
||
- Server-side ping every 30 seconds; close connections that do not respond within 60 seconds
|
||
- All incoming WebSocket messages (if bidirectional) validated against a JSON schema before processing
|
||
|
||
---
|
||
|
||
### 7.9 Data Integrity
|
||
|
||
This is the most important security property of the system. Predictions that drive aviation safety decisions must be trustworthy and tamper-evident.
|
||
|
||
#### HMAC Signing of Predictions
|
||
|
||
Every row written to `reentry_predictions` and `hazard_zones` is signed at creation time with an application-secret HMAC:
|
||
|
||
```python
|
||
import hmac, hashlib, json
|
||
|
||
def sign_prediction(prediction: dict, secret: bytes) -> str:
|
||
payload = json.dumps({
|
||
"id": prediction["id"],
|
||
"object_id": prediction["object_id"],
|
||
"p50_reentry_time": prediction["p50_reentry_time"].isoformat(),
|
||
"model_version": prediction["model_version"],
|
||
"f107_assumed": prediction["f107_assumed"],
|
||
}, sort_keys=True)
|
||
return hmac.new(secret, payload.encode(), hashlib.sha256).hexdigest()
|
||
```
|
||
|
||
**HMAC signing race fix (F4 — §67):** If `reentry_predictions.id` is a DB-assigned `BIGSERIAL`, the application must INSERT first (to get the `id`), then compute the HMAC using that `id`, then UPDATE the row — a two-phase write. Between the INSERT and the UPDATE there is a brief window where a valid prediction row exists with an empty `record_hmac`, which the nightly HMAC verification job (§10.2) would flag as a violation.
|
||
|
||
**Fix:** Use `UUID` as the primary key (`DEFAULT gen_random_uuid()`) and assign the UUID in the application **before** the INSERT. The application pre-generates the UUID, computes the HMAC against the full prediction dict including that UUID, then inserts the complete row in a single write:
|
||
|
||
```python
|
||
import uuid
|
||
|
||
def write_prediction_to_db(prediction: dict):
|
||
prediction_id = str(uuid.uuid4())
|
||
prediction['id'] = prediction_id
|
||
prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret)
|
||
# Single INSERT — no two-phase write; no race window
|
||
db.execute(text("""
|
||
INSERT INTO reentry_predictions (id, object_id, ..., record_hmac)
|
||
VALUES (:id, :object_id, ..., :record_hmac)
|
||
"""), prediction)
|
||
```
|
||
|
||
Migration: `ALTER TABLE reentry_predictions ALTER COLUMN id TYPE UUID USING gen_random_uuid(); ALTER TABLE reentry_predictions ALTER COLUMN id SET DEFAULT gen_random_uuid();` — requires cascade updates to FK references (`alert_events.prediction_id`, `prediction_outcomes.prediction_id`). Include in the next schema migration (`alembic revision --autogenerate`).
|
||
|
||
The HMAC is stored in a `record_hmac` column. Before serving any prediction to a client, the backend verifies the HMAC. A failed verification:
|
||
- Is logged as a security event (CRITICAL alert to admins)
|
||
- Results in the prediction being marked `integrity_failed = TRUE`
|
||
- The prediction is not served; the API returns a 503 with a message directing the user to contact the system administrator
|
||
- The Event Detail page displays `✗ HMAC verification failed` and a warning banner
|
||
|
||
#### Prediction Immutability
|
||
|
||
Once written, prediction records must not be modified:
|
||
|
||
```sql
|
||
CREATE OR REPLACE FUNCTION prevent_prediction_modification()
|
||
RETURNS TRIGGER AS $$
|
||
BEGIN
|
||
RAISE EXCEPTION 'reentry_predictions is immutable after creation. Create a new prediction instead.';
|
||
END;
|
||
$$ LANGUAGE plpgsql;
|
||
|
||
CREATE TRIGGER reentry_predictions_immutable
|
||
BEFORE UPDATE OR DELETE ON reentry_predictions
|
||
FOR EACH ROW EXECUTE FUNCTION prevent_prediction_modification();
|
||
```
|
||
|
||
Apply the same trigger to `hazard_zones`.
|
||
|
||
#### HMAC Key Rotation Procedure (Finding 1)
|
||
|
||
The immutability trigger blocks all UPDATEs on `reentry_predictions`, including legitimate HMAC re-signing during key rotation. The rotation path must be explicit and auditable:
|
||
|
||
**Schema additions to `reentry_predictions`:**
|
||
```sql
|
||
ALTER TABLE reentry_predictions
|
||
ADD COLUMN rotated_at TIMESTAMPTZ,
|
||
ADD COLUMN rotated_by INTEGER REFERENCES users(id);
|
||
```
|
||
|
||
**Parameterised immutability trigger** — allows UPDATE only on `record_hmac` when the session flag is set by the privileged `hmac_admin` role:
|
||
```sql
|
||
CREATE OR REPLACE FUNCTION prevent_prediction_modification()
|
||
RETURNS TRIGGER AS $$
|
||
BEGIN
|
||
-- Allow HMAC-only rotation when flag is set by hmac_admin role
|
||
IF TG_OP = 'UPDATE'
|
||
AND current_setting('spacecom.hmac_rotation', TRUE) = 'true'
|
||
AND NEW.record_hmac IS DISTINCT FROM OLD.record_hmac
|
||
AND NEW.id = OLD.id -- all other columns unchanged
|
||
THEN
|
||
RETURN NEW;
|
||
END IF;
|
||
RAISE EXCEPTION 'reentry_predictions is immutable after creation. Create a new prediction instead.';
|
||
END;
|
||
$$ LANGUAGE plpgsql SECURITY DEFINER;
|
||
```
|
||
|
||
**`hmac_admin` database role:** A dedicated `hmac_admin` Postgres role is the only role permitted to `SET LOCAL spacecom.hmac_rotation = true`. The `backend` application role does not have this privilege. The rotation script connects as `hmac_admin`, sets the flag per-transaction, re-signs each row, and commits. Every changed row is logged to `security_logs` as event type `HMAC_ROTATION`.
|
||
|
||
**Dual sign-off:** The rotation script must be run with two operators present. The runbook requires that both operators record their user IDs in the `rotated_by` column (use the initiating operator) and that the second operator independently verifies a random sample of re-signed HMACs match the new key before the script is considered complete.
|
||
|
||
The HMAC rotation runbook lives at `docs/runbooks/hmac-key-rotation.md` and cross-references the zero-downtime JWT keypair rotation runbook for the dual-key validity window.
|
||
|
||
#### Append-Only `alert_events`
|
||
|
||
```sql
|
||
CREATE OR REPLACE FUNCTION prevent_alert_modification()
|
||
RETURNS TRIGGER AS $$
|
||
BEGIN
|
||
RAISE EXCEPTION 'alert_events is append-only';
|
||
END;
|
||
$$ LANGUAGE plpgsql;
|
||
|
||
CREATE TRIGGER alert_events_immutable
|
||
BEFORE UPDATE OR DELETE ON alert_events
|
||
FOR EACH ROW EXECUTE FUNCTION prevent_alert_modification();
|
||
```
|
||
|
||
#### Cross-Source Validation
|
||
|
||
Do not silently trust a single data source:
|
||
- **TLE cross-validation:** When the same NORAD ID is received from both Space-Track and CelesTrak within a 6-hour window, compare the key orbital elements. If they differ by more than a defined threshold (e.g., semi-major axis > 1 km, inclination > 0.01°), flag for human review rather than silently using one.
|
||
- **All-clear double check:** A prediction record showing no hazard for an object that has an active TIP message triggers an integrity alert. A single-source all-clear cannot override a TIP message.
|
||
- **Space weather cross-validation:** Ingest F10.7 from both NOAA SWPC and ESA Space Weather Service. If they disagree by > 20%, alert and use the more conservative (higher) value until the discrepancy resolves.
|
||
|
||
#### IERS EOP Integrity
|
||
|
||
The weekly IERS Bulletin A download must be verified before application:
|
||
```python
|
||
IERS_BULLETIN_A_SHA256 = {
|
||
# Updated manually each quarter; verified against IERS publications
|
||
"finals2000A.all": "expected_hash_here",
|
||
}
|
||
# If hash fails, the existing EOP table is retained; a MEDIUM alert is generated
|
||
```
|
||
|
||
**`alert_events` HMAC integrity (F9):** `alert_events` records are safety-critical audit evidence (UN Liability Convention, ICAO). They carry the same HMAC protection as `reentry_predictions`:
|
||
|
||
```python
|
||
def sign_alert_event(event: dict, secret: bytes) -> str:
|
||
payload = json.dumps({
|
||
"id": event["id"],
|
||
"object_id": event["object_id"],
|
||
"organisation_id": event["organisation_id"],
|
||
"level": event["level"],
|
||
"trigger_type": event["trigger_type"],
|
||
"created_at": event["created_at"].isoformat(),
|
||
"acknowledged_by": event["acknowledged_by"],
|
||
"action_taken": event.get("action_taken"),
|
||
}, sort_keys=True)
|
||
return hmac.new(secret, payload.encode(), hashlib.sha256).hexdigest()
|
||
```
|
||
|
||
Nightly integrity check (Celery Beat, 02:00 UTC):
|
||
```python
|
||
@celery.task
|
||
def verify_alert_event_hmac():
|
||
"""Re-verify HMAC on all alert_events created in the past 24 hours."""
|
||
yesterday = utcnow() - timedelta(hours=24)
|
||
failures = db.execute(
|
||
text("SELECT id FROM alert_events WHERE created_at >= :since"),
|
||
{"since": yesterday}
|
||
).fetchall()
|
||
for row in failures:
|
||
event = db.get(AlertEvent, row.id)
|
||
expected = sign_alert_event(event.__dict__, HMAC_SECRET)
|
||
if not hmac.compare_digest(expected, event.record_hmac):
|
||
log_security_event("ALERT_EVENT_HMAC_FAILURE", {"event_id": row.id})
|
||
alert_admin_critical(f"alert_events HMAC integrity failure: id={row.id}")
|
||
```
|
||
|
||
**Database timezone enforcement (F2):** PostgreSQL `TIMESTAMPTZ` stores internally in UTC, but ORM connections can silently apply server or session timezone offsets. All timestamps must remain UTC end-to-end:
|
||
|
||
```python
|
||
# database.py — connection pool creation
|
||
from sqlalchemy import event, text
|
||
|
||
@event.listens_for(engine.sync_engine, "connect")
|
||
def set_timezone(dbapi_conn, connection_record):
|
||
cursor = dbapi_conn.cursor()
|
||
cursor.execute("SET TIME ZONE 'UTC'")
|
||
cursor.close()
|
||
```
|
||
|
||
Integration test (`tests/test_db_timezone.py` — BLOCKING):
|
||
```python
|
||
def test_timestamps_round_trip_as_utc(db_session):
|
||
"""Ensure ORM never silently converts UTC timestamps to local time."""
|
||
known_utc = datetime(2026, 3, 22, 14, 0, 0, tzinfo=timezone.utc)
|
||
obj = ReentryPrediction(p50_reentry_time=known_utc, ...)
|
||
db_session.add(obj)
|
||
db_session.flush()
|
||
db_session.refresh(obj)
|
||
assert obj.p50_reentry_time == known_utc
|
||
assert obj.p50_reentry_time.tzinfo == timezone.utc
|
||
```
|
||
|
||
Any non-UTC representation of a timestamp is a display-layer concern only — never stored or transmitted as local time.
|
||
|
||
---
|
||
|
||
### 7.10 Infrastructure Security
|
||
|
||
#### Container Hardening
|
||
|
||
Applied to all service Dockerfiles and Compose definitions:
|
||
|
||
```yaml
|
||
# Applied to all services
|
||
security_opt:
|
||
- no-new-privileges:true
|
||
read_only: true
|
||
tmpfs:
|
||
- /tmp:size=256m,mode=1777
|
||
user: "1000:1000" # non-root; created in Dockerfile as: RUN useradd -r -u 1000 appuser
|
||
cap_drop:
|
||
- ALL
|
||
cap_add: [] # No capabilities added; NET_BIND_SERVICE not needed if ports > 1024
|
||
```
|
||
|
||
**Renderer container** — most restrictive:
|
||
```yaml
|
||
renderer:
|
||
security_opt:
|
||
- no-new-privileges:true
|
||
- seccomp:renderer-seccomp.json # Custom seccomp profile for Chromium
|
||
network_mode: none # Overridden by renderer_net which allows only internal backend API
|
||
read_only: true
|
||
tmpfs:
|
||
- /tmp:size=512m # Playwright needs /tmp
|
||
- /home/appuser:size=256m # Chromium profile directory
|
||
cap_drop:
|
||
- ALL
|
||
cap_add:
|
||
- SYS_ADMIN # Required by Chromium sandbox; document this explicitly
|
||
```
|
||
|
||
`SYS_ADMIN` for Chromium is a known requirement. Mitigate by ensuring the renderer container has no network access to anything other than the backend internal API, and by setting a strict seccomp profile.
|
||
|
||
#### Redis Authentication and ACLs
|
||
|
||
```
|
||
# redis.conf (production)
|
||
requirepass "" # Disabled; use ACL only
|
||
aclfile /etc/redis/users.acl
|
||
|
||
# users.acl
|
||
user backend on >[backend_password] ~* &* +@all -@dangerous
|
||
user worker on >[worker_password] ~celery:* &celery:* +RPUSH +LPOP +LLEN +SUBSCRIBE +PUBLISH +XADD +XREAD
|
||
user default off # Disable default user
|
||
```
|
||
|
||
#### MinIO Bucket Policies
|
||
|
||
```json
|
||
{
|
||
"Version": "2012-10-17",
|
||
"Statement": [{
|
||
"Effect": "Deny",
|
||
"Principal": "*",
|
||
"Action": "s3:*",
|
||
"Resource": "arn:aws:s3:::*"
|
||
}]
|
||
}
|
||
```
|
||
|
||
All buckets are private. Report downloads use 5-minute pre-signed URLs (reduced from 15 minutes — user downloads immediately). **Pre-signed URL generation is logged** to `security_logs` (event type `PRESIGNED_URL_GENERATED`) with `user_id`, `object_key`, `expires_at`, and `client_ip` — this creates an audit trail of who obtained access to which object.
|
||
|
||
**MC blob access — server-side proxy (Finding 2):** Simulation trajectory blobs (MC samples) must not be served as direct pre-signed MinIO URLs to the browser. Instead, the visualiser calls `GET /viz/mc-trajectories/{simulation_id}` which the backend fetches from MinIO server-side and streams to the authenticated client. This keeps MinIO URLs entirely off the client and prevents URL sharing or exfiltration. The backend enforces the requesting user's organisation matches the simulation's `organisation_id` before proxying.
|
||
|
||
---
|
||
|
||
### 7.11 Playwright Renderer Security
|
||
|
||
The renderer is the highest attack-surface component. It runs a real browser on the server.
|
||
|
||
**Isolation:** The `renderer` service runs in its own container on `renderer_net`. It accepts HTTPS connections only from the backend's internal IP. It makes no outbound connections beyond `backend:8000` (enforced by network segmentation + Playwright request interception — see below).
|
||
|
||
**Data flow:** The renderer receives only a `report_id` (integer) from the backend job queue. It constructs the report URL internally as `http://backend:8000/reports/{report_id}/preview` — user-supplied values are never interpolated into the URL. The `report_id` is validated as a positive integer before use. The renderer has no access to the database, Redis, or MinIO directly.
|
||
|
||
**Playwright request interception (Finding 4) — allowlist, not blocklist:**
|
||
```python
|
||
async def setup_request_interception(page: Page) -> None:
|
||
"""Block any Playwright navigation to hosts other than the backend."""
|
||
async def handle_route(route: Route) -> None:
|
||
url = route.request.url
|
||
if not url.startswith("http://backend:8000/"):
|
||
await route.abort("blockedbyclient")
|
||
else:
|
||
await route.continue_()
|
||
await page.route("**/*", handle_route)
|
||
```
|
||
|
||
This is a defence-in-depth layer: even if a bug causes the renderer to receive a crafted URL, the interception handler prevents navigation to any external or internal host outside `backend:8000`.
|
||
|
||
**Input sanitisation before reaching the renderer:**
|
||
|
||
```python
|
||
import bleach
|
||
|
||
ALLOWED_TAGS = [] # No HTML allowed in user-supplied report fields
|
||
ALLOWED_ATTRS = {}
|
||
|
||
def sanitise_report_field(value: str) -> str:
|
||
"""Strip all HTML from user-supplied strings before renderer interpolation."""
|
||
return bleach.clean(value, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRS, strip=True)
|
||
```
|
||
|
||
**Report template:** The renderer loads a report template from the local filesystem (bundled in the container image). It does not fetch templates from URLs or the database. User-supplied content is inserted via a strict templating engine (Jinja2 with `autoescape=True`).
|
||
|
||
**Timeouts:** Report generation has a hard 30-second timeout. Playwright's `page.goto()` timeout set to 10 seconds. If the timeout is exceeded, the job fails with a clear error — the renderer does not hang indefinitely.
|
||
|
||
**No `dangerouslySetInnerHTML`:** The report React template must never use `dangerouslySetInnerHTML`. All text insertion via `{value}` (React's built-in escaping).
|
||
|
||
---
|
||
|
||
### 7.12 Compute Resource Governance
|
||
|
||
| Limit | Value | Enforcement |
|
||
|-------|-------|-------------|
|
||
| `mc_samples` maximum | 1000 | Pydantic validator at API layer; **also re-validated inside the Celery task body** (Finding 3) |
|
||
| Concurrent simulations per user | 3 | Checked against `simulations` table before job acceptance; returns 429 if exceeded |
|
||
| Pending jobs per user | 10 | Checked at submission time |
|
||
| Decay prediction CPU time limit | 300 s | Celery `time_limit=300, soft_time_limit=270` |
|
||
| Breakup simulation CPU time limit | 600 s | Celery `time_limit=600, soft_time_limit=570` |
|
||
| Ephemeris response points maximum | 100,000 | Enforced by calculating `(end - start) / step`; returns 400 if exceeded with a message to reduce range or increase step |
|
||
| CZML document size | 50 MB | Streaming response with max size enforced; client must paginate for larger ranges |
|
||
| WebSocket connections per user | 5 | Redis counter checked at upgrade time |
|
||
| Simulation workers | Separate Celery worker pool from ingest workers | Prevents runaway simulations from starving TLE/space-weather ingestion |
|
||
|
||
**Celery task-layer validation (Finding 3):** Celery tasks are callable directly via Redis write (e.g., by a compromised worker), bypassing the API layer entirely. Every task function must validate its own arguments independently of the API endpoint:
|
||
|
||
```python
|
||
from functools import wraps
|
||
|
||
def validate_task_args(validator_class):
|
||
"""Decorator: re-validate task kwargs using the same Pydantic model as the API endpoint."""
|
||
def decorator(func):
|
||
@wraps(func)
|
||
def wrapper(*args, **kwargs):
|
||
try:
|
||
validator_class(**kwargs)
|
||
except ValidationError as exc:
|
||
raise ValueError(f"Task arg validation failed: {exc}") from exc
|
||
return func(*args, **kwargs)
|
||
return wrapper
|
||
return decorator
|
||
|
||
@app.task(bind=True)
|
||
@validate_task_args(DecayPredictParams)
|
||
def run_mc_decay_prediction(self, *, norad_id: int, f107: float, ap: float, mc_samples: int, ...):
|
||
...
|
||
```
|
||
|
||
`ValueError` raised inside a Celery task is treated as a non-retryable failure — the task goes to the dead-letter queue and does not silently drop. This applies to all simulation and prediction tasks. Document in `AGENTS.md`: "Task functions are a security boundary. Validate all task arguments inside the task body."
|
||
|
||
**Orphaned job recovery (Celery Beat task):** A Celery worker killed mid-execution (OOM, pod eviction, container restart) leaves its job in `status = 'running'` indefinitely unless a cleanup task intervenes. Add a Celery Beat periodic task that runs every 5 minutes:
|
||
|
||
```python
|
||
@app.task
|
||
def recover_orphaned_jobs():
|
||
"""Mark jobs stuck in 'running' beyond 2× their estimated duration as failed."""
|
||
threshold = datetime.utcnow() - timedelta(minutes=1) # minimum guard
|
||
orphans = (
|
||
db.query(Job)
|
||
.filter(
|
||
Job.status == "running",
|
||
Job.started_at < func.now() - (
|
||
func.coalesce(Job.estimated_duration_seconds, 600) * 2
|
||
) * text("interval '1 second'"),
|
||
)
|
||
.all()
|
||
)
|
||
for job in orphans:
|
||
job.status = "failed"
|
||
job.error_code = "PRESUMED_DEAD"
|
||
job.error_message = "Worker did not complete within 2× estimated duration"
|
||
job.completed_at = datetime.utcnow()
|
||
db.commit()
|
||
```
|
||
|
||
Integration test (`tests/test_jobs/test_celery_failure.py`): set a job to `status='running'` with `started_at = NOW() - 1200s` and `estimated_duration_seconds = 300`; run the Beat task; assert `status = 'failed'` and `error_code = 'PRESUMED_DEAD'`.
|
||
|
||
---
|
||
|
||
### 7.13 Supply Chain and Dependency Security
|
||
|
||
**Python dependency pinning:**
|
||
|
||
All dependencies pinned with exact versions and hashes using `pip-tools`:
|
||
```
|
||
# requirements.in → pip-compile → requirements.txt with hashes
|
||
fastapi==0.111.0 --hash=sha256:...
|
||
```
|
||
|
||
Install with `pip install --require-hashes -r requirements.txt` in all Docker builds.
|
||
|
||
**Node.js:** `package-lock.json` committed and `npm ci` used in Docker builds (not `npm install`).
|
||
|
||
**Base images:** All `FROM` statements use pinned digest tags:
|
||
```dockerfile
|
||
FROM python:3.12.3-slim@sha256:abc123...
|
||
```
|
||
|
||
Never `FROM python:3.12-slim` (floating tag).
|
||
|
||
**PyPI index trust policy — dependency confusion protection:**
|
||
|
||
All Python packages must be fetched from a controlled index, not directly from public PyPI without restrictions. Configure `pip.conf` mounted into all build containers:
|
||
|
||
```ini
|
||
# pip.conf (mounted at /etc/pip.conf in builder stage)
|
||
[global]
|
||
index-url = https://pypi.internal.spacecom.io/simple/
|
||
# Proxy mode: passes through to PyPI but logs and scans before serving
|
||
# extra-index-url is intentionally absent — no fallback to raw public PyPI
|
||
```
|
||
|
||
For Phase 1 (no internal proxy available): register all `spacecom-*` package names on public PyPI as empty stubs to prevent dependency confusion squatting. Document in `docs/adr/0019-pypi-index-trust.md`.
|
||
|
||
**Automated scanning (CI pipeline):**
|
||
|
||
| Tool | Target | Trigger | Notes |
|
||
|------|--------|---------|-------|
|
||
| **`pip-audit`** | Python dependencies | Every PR; blocks on High/Critical | Queries Python Advisory Database (PyPADB); lower false-positive rate than OWASP DC for Python |
|
||
| **`npm audit`** | Node.js dependencies | Every PR; blocks on High/Critical | `--audit-level=high`; run after `npm ci` |
|
||
| **Trivy** | Container images | Every PR; blocks on Critical/High | `.trivyignore` applied (see below); JSON output archived |
|
||
| **Bandit** | Python source code | Every PR; blocks on High severity | |
|
||
| **ESLint security plugin** | TypeScript source | Every PR | |
|
||
| **`pip-licenses`** | Python transitive deps | Every PR; blocks on GPL/AGPL | CesiumJS exempted by name with documented commercial licence |
|
||
| **`license-checker-rseidelsohn`** | npm transitive deps | Every PR; blocks on GPL/AGPL | CesiumJS exempted; other AGPL packages require approval |
|
||
| **Renovate Bot** | Docker image digests + all deps | Weekly PRs; digest PRs auto-merged if CI passes | Replaces Dependabot for Docker digest pins; Dependabot retained for GitHub Security Advisory integration |
|
||
| **`git-secrets` + `detect-secrets`** | All commits | Pre-commit; blocks commit on secret patterns | `detect-secrets` is canonical (entropy + regex); `git-secrets` retained for pattern matching |
|
||
| **`cosign verify`** | Container images at deploy | Every staging/production deploy | Verifies Sigstore keyless signature before pulling |
|
||
|
||
**OWASP Dependency-Check** is removed from the Python scanning stack — it has high false-positive rates due to CPE name mapping issues for Python packages and is superseded by `pip-audit`. It may be retained for future Java/Kotlin components.
|
||
|
||
**Trivy configuration — `.trivyignore`:**
|
||
|
||
```ini
|
||
# .trivyignore
|
||
# Each entry requires: CVE ID, expiry date (90-day max), and documented justification.
|
||
# Process: PR required with senior engineer approval. Expired entries fail CI.
|
||
# Format: CVE-YYYY-NNNNN expires:YYYY-MM-DD reason:<one-line justification>
|
||
#
|
||
# Example (do not add without process):
|
||
# CVE-2024-12345 expires:2024-12-31 reason:builder-stage only; not present in runtime image
|
||
```
|
||
|
||
CI check rejects entries past their expiry date:
|
||
```bash
|
||
python scripts/check_trivyignore_expiry.py .trivyignore || \
|
||
(echo "ERROR: .trivyignore contains expired entry — review or remove" && exit 1)
|
||
```
|
||
|
||
**License scanning CI steps:**
|
||
|
||
```yaml
|
||
# security-scan job
|
||
- name: Python licence gate
|
||
run: |
|
||
pip install pip-licenses
|
||
pip-licenses --format=json --output-file=python-licences.json
|
||
# Fail on GPL/AGPL (CesiumJS has commercial licence; excluded by name in npm step)
|
||
pip-licenses --fail-on="GNU General Public License v2 (GPLv2);GNU General Public License v3 (GPLv3);GNU Affero General Public License v3 (AGPLv3)"
|
||
|
||
- name: npm licence gate
|
||
working-directory: frontend
|
||
run: |
|
||
npx license-checker-rseidelsohn --json --out npm-licences.json
|
||
# cesium excluded: commercial licence at docs/adr/0007-cesiumjs-commercial-licence.md
|
||
npx license-checker-rseidelsohn \
|
||
--excludePackages "cesium" \
|
||
--failOn "GPL;AGPL"
|
||
|
||
- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.3.4
|
||
with:
|
||
name: licences-${{ github.sha }}
|
||
path: "*.json"
|
||
retention-days: 365
|
||
```
|
||
|
||
**Base image digest updates — Renovate configuration:**
|
||
|
||
Dependabot does not update `@sha256:` digest pins in Dockerfiles. Renovate's `docker-digest` manager handles this:
|
||
|
||
```json
|
||
// renovate.json
|
||
{
|
||
"extends": ["config:base"],
|
||
"packageRules": [
|
||
{
|
||
"matchDatasources": ["docker"],
|
||
"matchUpdateTypes": ["digest"],
|
||
"automerge": true,
|
||
"automergeType": "pr",
|
||
"schedule": ["every weekend"],
|
||
"commitMessageSuffix": "(base image digest update)"
|
||
},
|
||
{
|
||
"matchDatasources": ["pypi"],
|
||
"automerge": false
|
||
}
|
||
],
|
||
"github-actions": {
|
||
"enabled": true,
|
||
"pinDigests": true
|
||
}
|
||
}
|
||
```
|
||
|
||
Digest-only updates auto-merge on passing CI. Version bumps (e.g., `python:3.12` → `python:3.13`) require manual PR review. Renovate is added alongside Dependabot; Dependabot retains GitHub Security Advisory integration for Python/Node CVE PRs.
|
||
|
||
---
|
||
|
||
### 7.14 Audit and Security Logging
|
||
|
||
**Security event categories** (stored in `security_logs` table and shipped to SIEM):
|
||
|
||
| Event | Level | Retention |
|
||
|-------|-------|-----------|
|
||
| Successful login | INFO | 90 days |
|
||
| Failed login (IP + user) | WARNING | 180 days |
|
||
| MFA failure | WARNING | 180 days |
|
||
| Account lockout | HIGH | 180 days |
|
||
| Token refresh | INFO | 30 days |
|
||
| Authorisation failure (403) | WARNING | 180 days |
|
||
| Admin action (user create/delete/role change) | HIGH | 1 year |
|
||
| Prediction HMAC failure | CRITICAL | 2 years |
|
||
| Alert storm detection | CRITICAL | 2 years |
|
||
| IERS EOP hash mismatch | HIGH | 1 year |
|
||
| Report generated | INFO | 1 year |
|
||
| Ingest source error | WARNING | 90 days |
|
||
|
||
**Security event human-alerting matrix (Finding 7):** A Grafana dashboard no one is watching provides no protection during an active attack. The following events must trigger an immediate out-of-band alert to a human (PagerDuty, email, or Slack) — not only log to the database:
|
||
|
||
| Event type | Severity | Alert channel | Response SLA |
|
||
|---|---|---|---|
|
||
| `HMAC_VERIFICATION_FAILURE` | CRITICAL | PagerDuty + admin email | Immediate |
|
||
| `REFRESH_TOKEN_REUSE` | HIGH | Email to affected user + admin email | < 5 min |
|
||
| `ROLE_CHANGE_APPROVED` / `ROLE_CHANGE_EXPIRED` | HIGH | Admin email summary | < 15 min |
|
||
| `REGISTRATION_BLOCKED_SANCTIONS` | HIGH | Admin email | < 15 min |
|
||
| `RBAC_VIOLATION` ≥ 10 events in 5 min (same `user_id`) | HIGH | PagerDuty | Immediate |
|
||
| `INGEST_VALIDATION_FAILURE` ≥ 5 events in 1 hour (same source) | MEDIUM | Admin email | < 1 hour |
|
||
| Space-Track ingest gap > 4 hours | CRITICAL | PagerDuty (cross-ref §31) | Immediate |
|
||
| Any `level = CRITICAL` security event | CRITICAL | PagerDuty + SIEM | Immediate |
|
||
|
||
Implemented as AlertManager rules (Prometheus `security_event_total` counter with `event_type` label) and/or direct webhook dispatch from the `security_logs` insert trigger. Rules defined in `monitoring/alertmanager/security-rules.yml`.
|
||
|
||
**Space-Track credential rotation — ingest gap specification (Finding 8):** Space-Track supports only one active credential set; rotation is a hard cut with no parallel-credential window. The rotation runbook at `docs/runbooks/space-track-credential-rotation.md` must include: (a) record last successful ingest time before starting; (b) update Docker secret and restart `ingest_worker`; (c) verify ingest succeeds within 10 minutes of restart (`GET /admin/ingest-status` shows `last_success_at` for Space-Track source); (d) if ingest does not resume within 10 minutes, roll back to previous credentials and raise a CRITICAL alert. The existing 4-hour ingest failure CRITICAL alert (§31) is the backstop — this runbook step reduces mean time to detect to 10 minutes.
|
||
|
||
**Structured log format** — all services emit JSON via `structlog`. Every log record must include these fields:
|
||
|
||
```python
|
||
# backend/app/logging_config.py
|
||
REQUIRED_LOG_FIELDS = {
|
||
"timestamp": "ISO-8601 UTC",
|
||
"level": "DEBUG|INFO|WARNING|ERROR|CRITICAL",
|
||
"service": "backend|worker|ingest|renderer",
|
||
"logger": "module.path",
|
||
"message": "human-readable summary",
|
||
"request_id": "UUID | null — set for HTTP requests; propagated into Celery tasks",
|
||
"job_id": "UUID | null — Celery job_id when inside a task",
|
||
"user_id": "integer | null",
|
||
"organisation_id": "integer | null",
|
||
"duration_ms": "integer | null — HTTP response time",
|
||
"status_code": "integer | null — HTTP responses only",
|
||
}
|
||
```
|
||
|
||
The sanitising formatter wraps the `structlog` JSON processor (strips JWT substrings, Space-Track passwords, database DSNs before the record is written). Docker log driver: `json-file` with `max-size=100m, max-file=5` for Tier 1; forwarded to Loki via Promtail in Tier 2+.
|
||
|
||
**Log sanitisation:** The `structlog` sanitising processor runs as the final processor in the chain before emission, stripping known sensitive patterns (JWT token substrings, Space-Track password patterns, database DSN with credentials).
|
||
|
||
**Log integrity:** Logs are shipped in real-time to an external destination (Loki in Tier 2; S3/MinIO append-only bucket or SIEM for long-term safety record retention). Logs stored only on the container filesystem are considered volatile and untrusted for security purposes.
|
||
|
||
**Request ID correlation middleware** — every HTTP request generates a `request_id` that propagates through logs, Celery tasks, and Prometheus exemplars so an on-call engineer can jump from a metric spike to the causative log line with one click:
|
||
|
||
```python
|
||
# backend/app/middleware.py
|
||
import uuid
|
||
import structlog
|
||
from starlette.middleware.base import BaseHTTPMiddleware
|
||
|
||
class RequestIDMiddleware(BaseHTTPMiddleware):
|
||
async def dispatch(self, request, call_next):
|
||
request_id = request.headers.get("X-Request-ID") or str(uuid.uuid4())
|
||
structlog.contextvars.bind_contextvars(request_id=request_id)
|
||
response = await call_next(request)
|
||
response.headers["X-Request-ID"] = request_id
|
||
structlog.contextvars.clear_contextvars()
|
||
return response
|
||
```
|
||
|
||
When submitting a Celery task, include `request_id` in task kwargs and bind it in the task preamble:
|
||
```python
|
||
structlog.contextvars.bind_contextvars(request_id=kwargs.get("request_id"), job_id=str(self.request.id))
|
||
```
|
||
|
||
This links every log line from the HTTP layer through to the Celery task execution. The `request_id` equals the OpenTelemetry `trace_id` when OTel is enabled (Phase 2), giving a single correlation key across logs and traces.
|
||
|
||
**`security_logs` table:**
|
||
|
||
```sql
|
||
CREATE TABLE security_logs (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
logged_at TIMESTAMPTZ DEFAULT NOW(),
|
||
level TEXT NOT NULL,
|
||
event_type TEXT NOT NULL,
|
||
user_id INTEGER,
|
||
organisation_id INTEGER,
|
||
source_ip INET,
|
||
user_agent TEXT,
|
||
resource TEXT,
|
||
detail JSONB,
|
||
-- Prevent tampering
|
||
record_hash TEXT -- SHA-256 of (logged_at || level || event_type || detail)
|
||
);
|
||
-- Append-only trigger (same pattern as alert_events)
|
||
```
|
||
|
||
---
|
||
|
||
### 7.15 Security SDLC — Embedded, Not Bolted On
|
||
|
||
Security activities are integrated into every sprint from Week 1, not deferred to a Phase 3 audit.
|
||
|
||
**Week 1 (mandatory before any other code):**
|
||
- [ ] RBAC schema implemented; `require_role` dependency applied to all router groups
|
||
- [ ] JWT RS256 + httpOnly cookies implemented; `HS256` never used
|
||
- [ ] MFA (TOTP) implemented and required for all roles
|
||
- [ ] CSP and security headers applied to frontend and backend
|
||
- [ ] Docker network segmentation and container hardening applied to all services
|
||
- [ ] Redis AUTH and ACL configured
|
||
- [ ] MinIO: all buckets private; pre-signed URLs only
|
||
- [ ] Dependency pinning (`pip-compile`) and Dependabot configured
|
||
- [ ] `git-secrets` pre-commit hook installed in repo
|
||
- [ ] Bandit and ESLint security plugin in CI; blocks merge on High severity
|
||
- [ ] Trivy container scanning in CI; blocks merge on Critical/High
|
||
- [ ] `security_logs` table and log sanitisation formatter implemented
|
||
- [ ] Append-only DB triggers on `alert_events`
|
||
|
||
**Phase 1 (ongoing):**
|
||
- [ ] HMAC signing implemented for `reentry_predictions` before decay predictor ships (Week 9)
|
||
- [ ] Immutability triggers on `reentry_predictions` and `hazard_zones`
|
||
- [ ] Cross-source TLE and space weather validation implemented with ingest module (Week 3–6)
|
||
- [ ] IERS EOP hash verification implemented (Week 1)
|
||
- [ ] Rate limiting (slowapi) configured for all endpoint groups (Week 2)
|
||
- [ ] Simulation parameter range validation (Week 9, with decay predictor)
|
||
|
||
**Phase 2:**
|
||
- [ ] OWASP ZAP DAST scan run against staging environment in the Phase 2 CI pipeline
|
||
- [ ] Threat model document (STRIDE) reviewed and updated for Phase 2 attack surface
|
||
- [ ] Playwright renderer: isolated container, sanitised input, timeouts, seccomp profile, **Playwright request interception allowlist** (Week 19–20, when reports ship)
|
||
- [ ] NOTAM draft content sanitisation: `sanitise_icao()` function in `reentry/notam.py` applied to all user-sourced fields before NOTAM template interpolation; unit test: object name containing `"><script>alert(1)</script>` produces a sanitised NOTAM draft and does not raise (Week 17–18, with NOTAM drafting feature)
|
||
- [ ] Shadow mode RLS integration test: query `reentry_predictions` as `viewer` role with no WHERE clause; assert zero shadow rows returned
|
||
- [ ] Refresh token family reuse detection integration test: simulate attacker consuming a rotated token; assert entire family revoked + `REFRESH_TOKEN_REUSE` logged
|
||
- [ ] RLS policies reviewed and integration-tested for multi-tenancy boundary
|
||
|
||
**Phase 3:**
|
||
- [ ] External penetration test by a qualified third party — scope must include: API auth bypass, privilege escalation, SSRF via ingest, XSS → Playwright escalation, WebSocket auth bypass, data integrity attacks on predictions, Redis/MinIO lateral movement
|
||
- [ ] All Critical and High penetration test findings remediated before production go-live
|
||
- [ ] SOC 2 Type I readiness review (if required by customer contracts)
|
||
- [ ] **Acceptance Test Procedure (ATP) defined and run (Finding 10):** `docs/bid/acceptance-test-procedure.md` exists with test script structured as: test ID, requirement reference, preconditions, steps, expected result, pass/fail criteria. ATP is runnable by a non-SpaceCom operator (evaluator) using documented environment setup. ATP covers: physics accuracy (§17 validation), NOTAM format (Q-line regex test), alert delivery latency (synthetic TIP → measure delivery time), HMAC integrity (tampered record → 503), multi-tenancy boundary (Org A cannot access Org B data). ATP seed data committed at `docs/bid/atp-seed-data/`. ATP successfully run by an independent evaluator on the staging environment before any institutional procurement submission.
|
||
- [ ] **Competitive differentiation review completed:** `docs/competitive-analysis.md` updated; any competitor capability that closed a differentiation gap has been assessed and a product response documented
|
||
- [ ] Security runbook: incident response procedure for each CRITICAL threat scenario
|
||
|
||
---
|
||
|
||
### 7.16 Aviation Safety Integrity — Operational Scenarios
|
||
|
||
**Scenario 1 — False all-clear attack:**
|
||
|
||
An attacker who modifies `reentry_predictions` records to suppress a genuine hazard corridor could cause an airspace manager to conclude a FIR is safe when it is not.
|
||
|
||
Mitigations layered in depth:
|
||
1. HMAC signing on every prediction record (§7.9) — modification is immediately detected
|
||
2. Immutability DB trigger (§7.9) — modifications fail at the database layer
|
||
3. TIP message cross-check: a prediction showing no hazard for an object with an active TIP message triggers a CRITICAL integrity alert regardless of the prediction's content
|
||
4. The UI displays HMAC status on every prediction — `✗ verification failed` is immediately visible to the operator
|
||
|
||
**Scenario 2 — Alert storm attack:**
|
||
|
||
An attacker flooding the alert system with false CRITICALs induces alert fatigue; operators disable alerts; a genuine event is missed.
|
||
|
||
Mitigations:
|
||
1. Alert generation runs only from backend business logic on verified, HMAC-checked data — not from direct API calls
|
||
2. Rate limiting on CRITICAL alert generation per object per window (§6.6)
|
||
3. Alert storm detection: > 5 CRITICALs in 1 hour triggers a meta-alert to admins
|
||
4. Geographic filtering means alert volume per operator is naturally bounded to their region
|
||
|
||
---
|
||
|
||
## 8. Functional Modules
|
||
|
||
Each module is a Python package under `backend/modules/` with its own router, schemas, service layer, and (where applicable) Celery tasks. Modules communicate via internal function calls and the shared database — not HTTP between modules.
|
||
|
||
### Phase 1 Modules
|
||
|
||
| Module | Package | Purpose |
|
||
|--------|---------|---------|
|
||
| **Catalog** | `modules.catalog` | CRUD for space objects: NORAD ID, TLE sets, physical properties (from ESA DISCOS), B* drag term, radar cross-section. Source of truth for all tracked objects. |
|
||
| **Catalog Propagator** | `modules.propagator.catalog` | SGP4/SDP4 for general catalog tracking. Outputs GCRF state vectors and geodetic coordinates. Feeds the globe display. **Not used for decay prediction.** |
|
||
| **Decay Predictor** | `modules.propagator.decay` | Numerical integrator (RK7(8) adaptive step) with NRLMSISE-00 atmospheric density model, J2–J6 geopotential, and solar radiation pressure. Used for all re-entry window estimation. Monte Carlo uncertainty (vary F10.7 ±20%, Ap, B* ±10%). All outputs HMAC-signed on creation. Shadow mode flag propagated to all output records. |
|
||
| **Reentry** | `modules.reentry` | Phase 1 scope: re-entry window prediction (time ± uncertainty) and ground track corridor (percentile swaths). Phase 2 expands to full breakup/survivability. |
|
||
| **Space Weather** | `modules.spaceweather` | Ingests NOAA SWPC: F10.7, Ap/Kp, Dst, solar wind. Cross-validates against ESA Space Weather Service. Generates `operational_status` string. Drives Decay Predictor density models. |
|
||
| **Visualisation** | `modules.viz` | Generates CZML documents from ephemeris (J2000 Cartesian — explicit TEME→J2000 conversion), hazard zones, and debris corridors. Pre-bakes MC trajectory binary blobs for Mode C. All object name/description fields HTML-escaped before CZML output. |
|
||
| **Ingest** | `modules.ingest` | Background workers: Space-Track.org TLE polling, CelesTrak TLE polling, TIP message ingestion, ESA DISCOS physical property import, NOAA SWPC space weather polling, IERS EOP refresh. All external URLs are hardcoded constants; SSRF mitigation enforced at HTTP client layer. |
|
||
| **Public API** | `modules.api` | Versioned REST API (`/api/v1/`) as a first-class product for programmatic access by Persona E/F. Includes API key management (generation, rotation, revocation, usage tracking), CCSDS-format export endpoints, bulk ephemeris endpoints, and rate limiting per API key. API keys are separate credentials from the web session JWT and managed independently. |
|
||
|
||
### Phase 2 Modules
|
||
|
||
| Module | Package | Purpose |
|
||
|--------|---------|---------|
|
||
| **Atmospheric Breakup** | `modules.breakup` | ORSAT-like atmospheric re-entry breakup: aerothermal loading → structural failure → fragment generation → ballistic descent → ground impact with kinetic energy and casualty area. Produces fragment descriptors and uncertainty bounds for the sub-/trans-sonic descent layer. |
|
||
| **Conjunction** | `modules.conjunction` | All-vs-all conjunction screening: apogee/perigee filter → TCA refinement → collision probability (Alfano/Foster). Feeds `conjunctions` table. |
|
||
| **Upper Atmosphere** | `modules.weather.upper` | NRLMSISE-00 / JB2008 density model driven by space weather inputs. 80–600 km profiles for Decay Predictor and Atmospheric Breakup. |
|
||
| **Lower Atmosphere** | `modules.weather.lower` | GFS/ECMWF tropospheric wind and density profiles for 0–80 km terminal descent, including wind-sensitive dispersion inputs for fragment clouds after main breakup. |
|
||
| **Hazard** | `modules.hazard` | Fuses Decay Predictor + Atmospheric Breakup + atmosphere modules into hazard zones with uncertainty bounds. All output records HMAC-signed and immutable. Shadow mode flag preserved on all hazard zone records. |
|
||
| **Airspace** | `modules.airspace` | FIR/UIR boundaries, controlled airspace, routes. PostGIS hazard-airspace intersection. |
|
||
| **Air Risk** | `modules.air_risk` | Combines hazard outputs with air traffic density / ADS-B state, aircraft class assumptions, and vulnerability bands to generate time-sliced exposure scores and operator-facing air-risk products. Supports conservative-baseline comparison against blunt closure areas. |
|
||
| **On-Orbit Fragmentation** | `modules.fragmentation` | NASA Standard Breakup Model for on-orbit collision/explosion fragmentation. Separate from atmospheric breakup — different physics. |
|
||
| **Space Operator Portal** | `modules.space_portal` | The second front door. Owned object management (`owned_objects` table); object-scoped prediction views; CCSDS export; API key portal; controlled re-entry planner interface. Enforces `space_operator` RBAC object-ownership scoping. |
|
||
| **Controlled Re-entry Planner** | `modules.reentry.controlled` | For objects with remaining manoeuvre capability: given a delta-V budget and avoidance constraints (FIR exclusions, land avoidance, population density weighting), generates ranked candidate deorbit windows with corridor risk scores. Outputs suitable for national space law regulatory submissions and ESA Zero Debris Charter evidence. |
|
||
| **NOTAM Drafting** | `modules.notam` | Generates ICAO Annex 15 format NOTAM drafts from hazard corridor outputs. Produces cancellation drafts on event close. Stores all drafts in `notam_drafts` table. Displays mandatory regulatory disclaimer. Never submits NOTAMs — draft production only. |
|
||
|
||
### Phase 3 Modules
|
||
|
||
| Module | Package | Purpose |
|
||
|--------|---------|---------|
|
||
| **Reroute** | `modules.reroute` | Strategic pre-flight route intersection analysis only. Given a filed route, identifies which segments intersect the hazard corridor and outputs the geographic avoidance boundary. Does not generate specific alternate routes — avoidance boundary only, to keep SpaceCom in a purely informational role. |
|
||
| **Feedback** | `modules.feedback` | Prediction vs. observed outcome comparison. Atmospheric density scaling recalibration from historical re-entries. Maneuver detection (TLE-to-TLE ΔV estimation). Shadow validation reporting for ANSP regulatory adoption evidence. |
|
||
| **Alerts** | `modules.alerts` | WebSocket push + email notifications. Enforces alert rate limits and deduplication server-side. Stores all events in append-only `alert_events`. Shadow mode: all alerts suppressed to INFORMATIONAL; no external delivery. |
|
||
| **Launch Safety** | `modules.launch_safety` | Screen proposed launch trajectories against the live catalog for conjunction risk during ascent and parking orbit phases. Natural extension of the conjunction module. Serves launch operators as a third customer segment. |
|
||
|
||
---
|
||
|
||
## 9. Data Model Evolution
|
||
|
||
### 9.1 Retain and Expand from Existing Schema
|
||
|
||
#### `objects` table
|
||
|
||
```sql
|
||
ALTER TABLE objects ADD COLUMN IF NOT EXISTS
|
||
bstar DOUBLE PRECISION, -- SGP4 drag parameter (1/Earth-radii)
|
||
cd_a_over_m DOUBLE PRECISION, -- C_D * A / m (m²/kg); physical model
|
||
rcs_m2 DOUBLE PRECISION, -- Radar cross-section from Space-Track
|
||
rcs_size_class TEXT, -- SMALL | MEDIUM | LARGE
|
||
mass_kg DOUBLE PRECISION,
|
||
cross_section_m2 DOUBLE PRECISION,
|
||
material TEXT,
|
||
shape TEXT,
|
||
data_confidence TEXT DEFAULT 'unknown', -- 'discos' | 'estimated' | 'unknown'
|
||
object_type TEXT, -- PAYLOAD | ROCKET BODY | DEBRIS | UNKNOWN
|
||
launch_date DATE,
|
||
launch_site TEXT,
|
||
decay_date DATE,
|
||
organisation_id INTEGER REFERENCES organisations(id), -- multi-tenancy
|
||
-- Physics model parameters (Finding 3, 5, 7)
|
||
attitude_known BOOLEAN DEFAULT FALSE, -- FALSE = tumbling; affects A uncertainty sampling
|
||
material_class TEXT, -- 'aluminium'|'stainless_steel'|'titanium'|'carbon_composite'|'unknown'
|
||
cd_override DOUBLE PRECISION, -- operator-provided C_D override (space_operator only)
|
||
bstar_override DOUBLE PRECISION, -- operator-provided B* override (space_operator only)
|
||
cr_coefficient DOUBLE PRECISION DEFAULT 1.3 -- radiation pressure coefficient; 1.3 = standard non-cooperative
|
||
```
|
||
|
||
#### `orbits` table — full state vectors
|
||
|
||
```sql
|
||
ALTER TABLE orbits ADD COLUMN IF NOT EXISTS
|
||
reference_frame TEXT DEFAULT 'GCRF',
|
||
pos_x_km DOUBLE PRECISION,
|
||
pos_y_km DOUBLE PRECISION,
|
||
pos_z_km DOUBLE PRECISION,
|
||
vel_x_kms DOUBLE PRECISION,
|
||
vel_y_kms DOUBLE PRECISION,
|
||
vel_z_kms DOUBLE PRECISION,
|
||
lat_deg DOUBLE PRECISION,
|
||
lon_deg DOUBLE PRECISION,
|
||
alt_km DOUBLE PRECISION,
|
||
speed_kms DOUBLE PRECISION,
|
||
-- RTN position covariance (upper triangle of 3×3)
|
||
cov_rr DOUBLE PRECISION,
|
||
cov_rt DOUBLE PRECISION,
|
||
cov_rn DOUBLE PRECISION,
|
||
cov_tt DOUBLE PRECISION,
|
||
cov_tn DOUBLE PRECISION,
|
||
cov_nn DOUBLE PRECISION,
|
||
propagator TEXT DEFAULT 'sgp4',
|
||
tle_epoch TIMESTAMPTZ
|
||
```
|
||
|
||
#### `conjunctions` table
|
||
|
||
```sql
|
||
ALTER TABLE conjunctions ADD COLUMN IF NOT EXISTS
|
||
collision_probability DOUBLE PRECISION,
|
||
probability_method TEXT,
|
||
combined_radial_sigma_m DOUBLE PRECISION,
|
||
combined_transverse_sigma_m DOUBLE PRECISION,
|
||
combined_normal_sigma_m DOUBLE PRECISION
|
||
```
|
||
|
||
#### `reentry_predictions` table
|
||
|
||
```sql
|
||
ALTER TABLE reentry_predictions ADD COLUMN IF NOT EXISTS
|
||
confidence_level DOUBLE PRECISION,
|
||
model_version TEXT,
|
||
propagator TEXT,
|
||
f107_assumed DOUBLE PRECISION,
|
||
ap_assumed DOUBLE PRECISION,
|
||
monte_carlo_n INTEGER,
|
||
ground_track_corridor GEOGRAPHY(POLYGON), -- GEOGRAPHY: global corridors may cross antimeridian
|
||
reentry_window_open TIMESTAMPTZ,
|
||
reentry_window_close TIMESTAMPTZ,
|
||
nominal_reentry_point GEOGRAPHY(POINT), -- GEOGRAPHY: global point
|
||
nominal_reentry_alt_km DOUBLE PRECISION DEFAULT 80.0,
|
||
p01_reentry_time TIMESTAMPTZ, -- 1st percentile — extreme early case; displayed as tail risk annotation (F10)
|
||
p05_reentry_time TIMESTAMPTZ,
|
||
p50_reentry_time TIMESTAMPTZ,
|
||
p95_reentry_time TIMESTAMPTZ,
|
||
p99_reentry_time TIMESTAMPTZ, -- 99th percentile — extreme late case; displayed as tail risk annotation (F10)
|
||
sigma_along_track_km DOUBLE PRECISION,
|
||
sigma_cross_track_km DOUBLE PRECISION,
|
||
organisation_id INTEGER REFERENCES organisations(id),
|
||
record_hmac TEXT NOT NULL, -- HMAC-SHA256 of canonical field set
|
||
integrity_failed BOOLEAN DEFAULT FALSE,
|
||
superseded_by INTEGER REFERENCES reentry_predictions(id) ON DELETE RESTRICT, -- write-once; RESTRICT prevents deleting a prediction that supersedes another (F10 — §67)
|
||
ood_flag BOOLEAN DEFAULT FALSE, -- TRUE if any input parameter falls outside the model's validated operating envelope
|
||
ood_reason TEXT, -- comma-separated list of which parameters triggered OOD (e.g. "high_am_ratio,low_data_confidence")
|
||
prediction_valid_until TIMESTAMPTZ, -- computed at creation: p50_reentry_time - 4h; UI warns if NOW() > this and prediction is not superseded
|
||
model_version TEXT NOT NULL, -- semantic version of decay predictor used; must match current deployed version or trigger re-run prompt
|
||
-- Multi-source conflict detection (Finding 10)
|
||
prediction_conflict BOOLEAN DEFAULT FALSE, -- TRUE if SpaceCom window does not overlap TIP or ESA window
|
||
conflict_sources TEXT[], -- e.g. ['space_track_tip', 'esa_esac']
|
||
conflict_union_p10 TIMESTAMPTZ, -- union of all non-overlapping windows: earliest bound
|
||
conflict_union_p90 TIMESTAMPTZ -- union of all non-overlapping windows: latest bound
|
||
```
|
||
|
||
`superseded_by` is write-once after creation: it can be set once by an `analyst` or above, but never changed once set. A DB constraint enforces this (trigger that raises if `superseded_by` is being changed from a non-NULL value). The UI displays a `⚠ Superseded — see [newer run]` banner on any prediction where `superseded_by IS NOT NULL`. This preserves the immutability guarantee (old records are never deleted) while giving analysts a mechanism to communicate "this is not the current operational view."
|
||
|
||
The same `superseded_by` pattern applies to the `simulations` table (self-referential FK).
|
||
|
||
**Immutability trigger** (see §7.9) applied to this table in the initial migration.
|
||
|
||
### 9.2 New Tables
|
||
|
||
```sql
|
||
-- Organisations (for multi-tenancy)
|
||
CREATE TABLE organisations (
|
||
id SERIAL PRIMARY KEY,
|
||
name TEXT NOT NULL UNIQUE,
|
||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
-- Commercial tier (Finding 3, 5)
|
||
subscription_tier TEXT NOT NULL DEFAULT 'shadow_trial'
|
||
CHECK (subscription_tier IN ('shadow_trial','ansp_operational','space_operator','institutional','internal')),
|
||
subscription_status TEXT NOT NULL DEFAULT 'active'
|
||
CHECK (subscription_status IN ('active','offered','offered_lapsed','churned','suspended')),
|
||
subscription_started_at TIMESTAMPTZ,
|
||
subscription_expires_at TIMESTAMPTZ,
|
||
-- Shadow trial gate (F3 - §68): expiry normally auto-deactivates shadow mode, but enforcement is deferred while an active TIP / CRITICAL operational event exists
|
||
shadow_trial_expires_at TIMESTAMPTZ, -- NULL = no trial expiry (paid or internal); set on sandbox agreement signing
|
||
-- Resource quotas (F8 — §68): 0 = unlimited (paid tiers); >0 = monthly cap
|
||
monthly_mc_run_quota INTEGER NOT NULL DEFAULT 100 -- 100 for free/shadow_trial; 0 = unlimited for paid; deferred during active TIP/CRITICAL event
|
||
CHECK (monthly_mc_run_quota >= 0),
|
||
-- Feature flags (F11 — §68): Enterprise-only features gated here
|
||
feature_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE, -- Enterprise only
|
||
-- On-premise licence (F6 — §68)
|
||
licence_key TEXT, -- JWT signed by SpaceCom; checked at startup for on-premise deployments
|
||
licence_expires_at TIMESTAMPTZ, -- derived from licence_key; stored for query efficiency
|
||
-- Data residency (Finding 8)
|
||
hosting_jurisdiction TEXT NOT NULL DEFAULT 'eu'
|
||
CHECK (hosting_jurisdiction IN ('eu','uk','au','us','on_premise')),
|
||
data_residency_confirmed BOOLEAN DEFAULT FALSE -- DPA clause confirmed for this org
|
||
);
|
||
|
||
-- Users
|
||
CREATE TABLE users (
|
||
id SERIAL PRIMARY KEY,
|
||
organisation_id INTEGER REFERENCES organisations(id) NOT NULL,
|
||
email TEXT NOT NULL UNIQUE,
|
||
password_hash TEXT NOT NULL, -- bcrypt, cost factor >= 12
|
||
role TEXT NOT NULL DEFAULT 'viewer'
|
||
CHECK (role IN ('viewer','analyst','operator','org_admin','admin','space_operator','orbital_analyst')),
|
||
mfa_secret TEXT, -- TOTP secret (encrypted at rest)
|
||
mfa_recovery_codes TEXT[], -- bcrypt hashes of recovery codes
|
||
mfa_enabled BOOLEAN DEFAULT FALSE,
|
||
failed_mfa_attempts INTEGER DEFAULT 0,
|
||
locked_until TIMESTAMPTZ,
|
||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
last_login_at TIMESTAMPTZ,
|
||
tos_accepted_at TIMESTAMPTZ, -- NULL = ToS not yet accepted; access blocked until set
|
||
tos_version TEXT, -- semver of ToS accepted (e.g. "1.2.0")
|
||
tos_accepted_ip INET, -- IP address at time of acceptance (GDPR consent evidence)
|
||
data_source_acknowledgement BOOLEAN DEFAULT FALSE, -- must be TRUE before API key access
|
||
altitude_unit_preference TEXT NOT NULL DEFAULT 'ft'
|
||
CHECK (altitude_unit_preference IN ('m', 'ft', 'km'))
|
||
-- 'ft' default for ansp_operator; 'km' default for space_operator (set at account creation based on role)
|
||
);
|
||
|
||
-- Refresh tokens (server-side revocation)
|
||
CREATE TABLE refresh_tokens (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
user_id INTEGER REFERENCES users(id) ON DELETE CASCADE,
|
||
token_hash TEXT NOT NULL UNIQUE, -- SHA-256 of the raw token
|
||
family_id UUID NOT NULL, -- All tokens from the same initial issuance share a family_id
|
||
issued_at TIMESTAMPTZ DEFAULT NOW(),
|
||
expires_at TIMESTAMPTZ NOT NULL,
|
||
revoked_at TIMESTAMPTZ, -- NULL = valid
|
||
superseded_at TIMESTAMPTZ, -- Set when this token is rotated out (newer token in family exists)
|
||
replaced_by UUID REFERENCES refresh_tokens(id), -- for rotation chain audit
|
||
source_ip INET,
|
||
user_agent TEXT
|
||
);
|
||
CREATE INDEX ON refresh_tokens (user_id, revoked_at);
|
||
CREATE INDEX ON refresh_tokens (family_id); -- for family revocation on reuse detection
|
||
|
||
-- Security event log (append-only)
|
||
CREATE TABLE security_logs (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
logged_at TIMESTAMPTZ DEFAULT NOW(),
|
||
level TEXT NOT NULL,
|
||
event_type TEXT NOT NULL,
|
||
user_id INTEGER,
|
||
organisation_id INTEGER,
|
||
source_ip INET,
|
||
user_agent TEXT,
|
||
resource TEXT,
|
||
detail JSONB,
|
||
record_hash TEXT -- SHA-256(logged_at || event_type || detail) for tamper detection
|
||
);
|
||
CREATE TRIGGER security_logs_immutable
|
||
BEFORE UPDATE OR DELETE ON security_logs
|
||
FOR EACH ROW EXECUTE FUNCTION prevent_modification();
|
||
|
||
-- TLE history (hypertable)
|
||
-- No surrogate PK: TimescaleDB requires any UNIQUE/PK constraint to include the partition column.
|
||
-- Natural unique key is (object_id, ingested_at). Reference TLE records by this composite key.
|
||
CREATE TABLE tle_sets (
|
||
object_id INTEGER REFERENCES objects(id),
|
||
epoch TIMESTAMPTZ NOT NULL,
|
||
line1 TEXT NOT NULL,
|
||
line2 TEXT NOT NULL,
|
||
source TEXT NOT NULL,
|
||
ingested_at TIMESTAMPTZ DEFAULT NOW(),
|
||
inclination_deg DOUBLE PRECISION,
|
||
raan_deg DOUBLE PRECISION,
|
||
eccentricity DOUBLE PRECISION,
|
||
arg_perigee_deg DOUBLE PRECISION,
|
||
mean_anomaly_deg DOUBLE PRECISION,
|
||
mean_motion_rev_per_day DOUBLE PRECISION,
|
||
bstar DOUBLE PRECISION,
|
||
apogee_km DOUBLE PRECISION,
|
||
perigee_km DOUBLE PRECISION,
|
||
cross_validated BOOLEAN DEFAULT FALSE, -- TRUE if confirmed by second source
|
||
cross_validation_delta_sma_km DOUBLE PRECISION, -- SMA difference between sources
|
||
UNIQUE (object_id, ingested_at) -- natural key; safe for TimescaleDB (includes partition col)
|
||
);
|
||
SELECT create_hypertable('tle_sets', 'ingested_at');
|
||
|
||
-- Space weather (hypertable)
|
||
CREATE TABLE space_weather (
|
||
time TIMESTAMPTZ NOT NULL,
|
||
f107_obs DOUBLE PRECISION, -- observed F10.7 (current day)
|
||
f107_prior_day DOUBLE PRECISION, -- prior-day F10.7 (NRLMSISE-00 f107 input)
|
||
f107_81day_avg DOUBLE PRECISION, -- 81-day centred average (NRLMSISE-00 f107A input)
|
||
ap_daily INTEGER, -- daily Ap index (linear; NOT Kp)
|
||
ap_3h_history DOUBLE PRECISION[19], -- 3-hourly Ap values for prior 57h (NRLMSISE-00 full mode)
|
||
kp_3hourly DOUBLE PRECISION[], -- 3-hourly Kp (for storm detection; Kp > 5 triggers storm flag)
|
||
dst_index INTEGER,
|
||
uncertainty_multiplier DOUBLE PRECISION,
|
||
operational_status TEXT,
|
||
source TEXT DEFAULT 'noaa_swpc',
|
||
secondary_source TEXT, -- ESA SWS cross-validation value
|
||
cross_validation_delta_f107 DOUBLE PRECISION -- difference between sources
|
||
);
|
||
SELECT create_hypertable('space_weather', 'time');
|
||
|
||
-- TIP messages
|
||
CREATE TABLE tip_messages (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
object_id INTEGER REFERENCES objects(id),
|
||
norad_id INTEGER NOT NULL,
|
||
message_time TIMESTAMPTZ NOT NULL,
|
||
message_number INTEGER,
|
||
reentry_window_open TIMESTAMPTZ,
|
||
reentry_window_close TIMESTAMPTZ,
|
||
predicted_region TEXT,
|
||
source TEXT DEFAULT 'usspacecom',
|
||
raw_message TEXT
|
||
);
|
||
|
||
-- Alert events (append-only)
|
||
CREATE TABLE alert_events (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||
level TEXT NOT NULL
|
||
CHECK (level IN ('INFO','WARNING','CRITICAL')),
|
||
trigger_type TEXT NOT NULL,
|
||
object_id INTEGER REFERENCES objects(id),
|
||
organisation_id INTEGER REFERENCES organisations(id),
|
||
message TEXT NOT NULL,
|
||
acknowledged_at TIMESTAMPTZ,
|
||
acknowledged_by INTEGER REFERENCES users(id) ON DELETE SET NULL, -- SET NULL on GDPR erasure; log entry preserved
|
||
acknowledgement_note TEXT,
|
||
delivered_websocket BOOLEAN DEFAULT FALSE,
|
||
delivered_email BOOLEAN DEFAULT FALSE,
|
||
fir_intersection_km2 DOUBLE PRECISION, -- area of FIR polygon intersected by the triggering corridor (km²); NULL for non-spatial alerts
|
||
intersection_percentile TEXT
|
||
CHECK (intersection_percentile IN ('p50','p95')), -- which corridor percentile triggered the alert
|
||
prediction_id BIGINT REFERENCES reentry_predictions(id) ON DELETE RESTRICT, -- RESTRICT prevents cascade delete of legal-hold predictions (F10 — §67)
|
||
record_hmac TEXT NOT NULL DEFAULT '' -- HMAC-SHA256 of safety-critical fields; signed at insert; verified nightly (F9)
|
||
);
|
||
CREATE TRIGGER alert_events_immutable
|
||
BEFORE UPDATE OR DELETE ON alert_events
|
||
FOR EACH ROW EXECUTE FUNCTION prevent_modification();
|
||
|
||
-- Simulations
|
||
CREATE TABLE simulations (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
module TEXT NOT NULL,
|
||
object_id INTEGER REFERENCES objects(id),
|
||
organisation_id INTEGER REFERENCES organisations(id),
|
||
params_json JSONB NOT NULL,
|
||
started_at TIMESTAMPTZ DEFAULT NOW(),
|
||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
completed_at TIMESTAMPTZ,
|
||
status TEXT NOT NULL DEFAULT 'pending'
|
||
CHECK (status IN ('pending','running','complete','failed','cancelled')),
|
||
result_uri TEXT,
|
||
model_version TEXT,
|
||
celery_task_id TEXT,
|
||
error_detail TEXT,
|
||
created_by INTEGER REFERENCES users(id)
|
||
);
|
||
|
||
-- Reports
|
||
CREATE TABLE reports (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
simulation_id UUID REFERENCES simulations(id),
|
||
object_id INTEGER REFERENCES objects(id),
|
||
organisation_id INTEGER REFERENCES organisations(id),
|
||
report_type TEXT NOT NULL,
|
||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||
created_by INTEGER REFERENCES users(id),
|
||
storage_uri TEXT NOT NULL,
|
||
params_json JSONB,
|
||
report_number TEXT
|
||
);
|
||
|
||
-- Prediction outcomes (algorithmic accountability — links predictions to observed re-entry events)
|
||
CREATE TABLE prediction_outcomes (
|
||
id SERIAL PRIMARY KEY,
|
||
prediction_id BIGINT NOT NULL REFERENCES reentry_predictions(id) ON DELETE RESTRICT, -- RESTRICT prevents cascade delete of legal-hold predictions (F10 — §67)
|
||
norad_id INTEGER NOT NULL,
|
||
observed_reentry_time TIMESTAMPTZ, -- actual re-entry time from post-event analysis (The Aerospace Corporation, US18SCS, etc.)
|
||
observed_reentry_source TEXT, -- 'aerospace_corp' | 'us18scs' | 'esa_esoc' | 'manual'
|
||
p50_error_minutes DOUBLE PRECISION, -- predicted p50 minus observed (+ = predicted late, - = predicted early)
|
||
corridor_contains_observed BOOLEAN, -- TRUE if observed impact point fell within p95 corridor
|
||
fir_false_positive BOOLEAN, -- TRUE if a CRITICAL alert fired but no observable debris reached the affected FIR
|
||
fir_false_negative BOOLEAN, -- TRUE if observable debris reached a FIR but no CRITICAL alert was generated
|
||
ood_flag_at_prediction BOOLEAN, -- snapshot of ood_flag from the prediction record at prediction time
|
||
notes TEXT,
|
||
recorded_at TIMESTAMPTZ DEFAULT NOW(),
|
||
recorded_by INTEGER REFERENCES users(id) -- analyst who logged the outcome
|
||
);
|
||
|
||
-- Hazard zones
|
||
CREATE TABLE hazard_zones (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
simulation_id UUID REFERENCES simulations(id),
|
||
organisation_id INTEGER REFERENCES organisations(id),
|
||
valid_from TIMESTAMPTZ NOT NULL,
|
||
valid_to TIMESTAMPTZ NOT NULL,
|
||
geometry GEOGRAPHY(POLYGON, 4326) NOT NULL,
|
||
altitude_min_km DOUBLE PRECISION,
|
||
altitude_max_km DOUBLE PRECISION,
|
||
risk_level TEXT,
|
||
confidence DOUBLE PRECISION,
|
||
sigma_along_track_km DOUBLE PRECISION,
|
||
sigma_cross_track_km DOUBLE PRECISION,
|
||
record_hmac TEXT NOT NULL
|
||
);
|
||
CREATE INDEX ON hazard_zones USING GIST (geometry);
|
||
CREATE INDEX ON hazard_zones (valid_from, valid_to);
|
||
CREATE TRIGGER hazard_zones_immutable
|
||
BEFORE UPDATE OR DELETE ON hazard_zones
|
||
FOR EACH ROW EXECUTE FUNCTION prevent_modification();
|
||
|
||
-- Airspace boundaries
|
||
CREATE TABLE airspace (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
designator TEXT NOT NULL,
|
||
name TEXT,
|
||
type TEXT NOT NULL,
|
||
geometry GEOMETRY(POLYGON, 4326) NOT NULL, -- GEOMETRY (not GEOGRAPHY): FIR boundaries never cross antimeridian; ~3× faster for ST_Intersects
|
||
lower_fl INTEGER,
|
||
upper_fl INTEGER,
|
||
icao_region TEXT
|
||
);
|
||
CREATE INDEX ON airspace USING GIST (geometry);
|
||
|
||
-- Debris fragments
|
||
CREATE TABLE fragments (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
simulation_id UUID REFERENCES simulations(id),
|
||
mass_kg DOUBLE PRECISION,
|
||
characteristic_length_m DOUBLE PRECISION,
|
||
cross_section_m2 DOUBLE PRECISION,
|
||
material TEXT,
|
||
ballistic_coefficient_kgm2 DOUBLE PRECISION,
|
||
pre_entry_survived BOOLEAN,
|
||
impact_point GEOGRAPHY(POINT, 4326),
|
||
impact_velocity_kms DOUBLE PRECISION,
|
||
impact_angle_deg DOUBLE PRECISION,
|
||
kinetic_energy_j DOUBLE PRECISION,
|
||
casualty_area_m2 DOUBLE PRECISION,
|
||
dispersion_semi_major_km DOUBLE PRECISION,
|
||
dispersion_semi_minor_km DOUBLE PRECISION,
|
||
dispersion_orientation_deg DOUBLE PRECISION
|
||
);
|
||
CREATE INDEX ON fragments USING GIST (impact_point);
|
||
|
||
-- Owned objects (space operator registration)
|
||
CREATE TABLE owned_objects (
|
||
id SERIAL PRIMARY KEY,
|
||
organisation_id INTEGER REFERENCES organisations(id) NOT NULL,
|
||
object_id INTEGER REFERENCES objects(id) NOT NULL,
|
||
norad_id INTEGER NOT NULL,
|
||
registered_at TIMESTAMPTZ DEFAULT NOW(),
|
||
registration_reference TEXT, -- National space law registration number
|
||
has_propulsion BOOLEAN DEFAULT FALSE, -- Enables controlled re-entry planner
|
||
UNIQUE (organisation_id, object_id)
|
||
);
|
||
CREATE INDEX ON owned_objects (organisation_id);
|
||
|
||
-- API keys (for Persona E/F programmatic access)
|
||
CREATE TABLE api_keys (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
organisation_id INTEGER REFERENCES organisations(id) NOT NULL,
|
||
user_id INTEGER REFERENCES users(id), -- NULL for org-level service account keys (F5)
|
||
is_service_account BOOLEAN NOT NULL DEFAULT FALSE, -- TRUE = org-level key, no human user
|
||
service_account_name TEXT, -- required when is_service_account = TRUE; e.g. "ANSP Integration Service"
|
||
key_hash TEXT NOT NULL UNIQUE, -- SHA-256 of raw key; raw key shown once at creation
|
||
name TEXT NOT NULL, -- Human label, e.g. "Ops Centre Integration"
|
||
role TEXT NOT NULL, -- space_operator | orbital_analyst
|
||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||
last_used_at TIMESTAMPTZ,
|
||
expires_at TIMESTAMPTZ,
|
||
revoked_at TIMESTAMPTZ,
|
||
revoked_by INTEGER REFERENCES users(id), -- org_admin or admin who revoked (F5)
|
||
requests_today INTEGER DEFAULT 0,
|
||
daily_limit INTEGER DEFAULT 1000,
|
||
-- API key scope and rate limit overrides (Finding 11)
|
||
allowed_endpoints TEXT[], -- NULL = all endpoints for role; e.g. ['GET /space/objects']
|
||
rate_limit_override JSONB, -- e.g. {"decay_predict": {"limit": 5, "window": "1h"}}
|
||
CONSTRAINT service_account_name_required CHECK (
|
||
(is_service_account = FALSE) OR (service_account_name IS NOT NULL)
|
||
),
|
||
CONSTRAINT user_or_service CHECK (
|
||
(user_id IS NOT NULL AND is_service_account = FALSE)
|
||
OR (user_id IS NULL AND is_service_account = TRUE)
|
||
)
|
||
);
|
||
CREATE INDEX ON api_keys (organisation_id, revoked_at);
|
||
CREATE INDEX ON api_keys (organisation_id, is_service_account); -- org admin key listing
|
||
|
||
-- Async job tracking — all Celery-backed POST endpoints return a job reference (Finding 3)
|
||
CREATE TABLE jobs (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
organisation_id INTEGER NOT NULL REFERENCES organisations(id),
|
||
user_id INTEGER NOT NULL REFERENCES users(id),
|
||
job_type TEXT NOT NULL
|
||
CHECK (job_type IN ('decay_predict','report','reentry_plan','propagate')),
|
||
status TEXT NOT NULL DEFAULT 'queued'
|
||
CHECK (status IN ('queued','running','complete','failed','cancelled')),
|
||
celery_task_id TEXT, -- Celery AsyncResult ID for internal tracking
|
||
params_hash TEXT, -- SHA-256 of input params; used for idempotency check
|
||
result_url TEXT, -- populated when status='complete'; e.g. '/decay/predictions/123'
|
||
error_code TEXT, -- populated when status='failed'
|
||
error_message TEXT,
|
||
estimated_duration_seconds INTEGER, -- populated at creation from historical p50 for job_type
|
||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
started_at TIMESTAMPTZ,
|
||
completed_at TIMESTAMPTZ
|
||
);
|
||
CREATE INDEX ON jobs (organisation_id, status, created_at DESC);
|
||
CREATE INDEX ON jobs (celery_task_id);
|
||
|
||
-- Idempotency key store — prevents duplicate mutations from network retries (Finding 5)
|
||
CREATE TABLE idempotency_keys (
|
||
key TEXT NOT NULL, -- client-provided UUID
|
||
user_id INTEGER NOT NULL REFERENCES users(id),
|
||
endpoint TEXT NOT NULL, -- e.g. 'POST /decay/predict'
|
||
response_status INTEGER NOT NULL,
|
||
response_body JSONB NOT NULL,
|
||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||
expires_at TIMESTAMPTZ NOT NULL DEFAULT NOW() + INTERVAL '24 hours',
|
||
PRIMARY KEY (key, user_id, endpoint)
|
||
);
|
||
CREATE INDEX ON idempotency_keys (expires_at); -- for TTL cleanup job
|
||
|
||
-- Usage metering (F3) — billable events; append-only
|
||
CREATE TABLE usage_events (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
organisation_id INTEGER NOT NULL REFERENCES organisations(id),
|
||
user_id INTEGER REFERENCES users(id), -- NULL for API key / system-triggered events
|
||
api_key_id UUID REFERENCES api_keys(id), -- set when triggered via API key
|
||
event_type TEXT NOT NULL
|
||
CHECK (event_type IN (
|
||
'decay_prediction_run',
|
||
'conjunction_screen_run',
|
||
'report_export',
|
||
'api_request',
|
||
'mc_quota_exhausted', -- quota hit; signals upsell opportunity
|
||
'reentry_plan_run'
|
||
)),
|
||
quantity INTEGER NOT NULL DEFAULT 1, -- e.g. number of API requests batched
|
||
billing_period TEXT NOT NULL, -- 'YYYY-MM' — month this event counts toward
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
detail JSONB -- event-specific metadata (object_id, mc_n, etc.)
|
||
);
|
||
CREATE INDEX ON usage_events (organisation_id, billing_period, event_type);
|
||
CREATE INDEX ON usage_events (organisation_id, created_at DESC);
|
||
-- Append-only enforcement
|
||
CREATE TRIGGER usage_events_immutable
|
||
BEFORE UPDATE OR DELETE ON usage_events
|
||
FOR EACH ROW EXECUTE FUNCTION prevent_modification();
|
||
|
||
-- Billing contacts (F10)
|
||
CREATE TABLE billing_contacts (
|
||
id SERIAL PRIMARY KEY,
|
||
organisation_id INTEGER NOT NULL REFERENCES organisations(id) UNIQUE,
|
||
billing_email TEXT NOT NULL,
|
||
billing_name TEXT NOT NULL,
|
||
billing_address TEXT,
|
||
vat_number TEXT, -- EU VAT registration; required for B2B invoicing
|
||
purchase_order_number TEXT, -- PO reference required by some ANSP procurement depts
|
||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
updated_by INTEGER REFERENCES users(id) -- must be org_admin or admin
|
||
);
|
||
|
||
-- Subscription periods (F10) — immutable record of what was billed when
|
||
CREATE TABLE subscription_periods (
|
||
id SERIAL PRIMARY KEY,
|
||
organisation_id INTEGER NOT NULL REFERENCES organisations(id),
|
||
tier TEXT NOT NULL,
|
||
period_start TIMESTAMPTZ NOT NULL,
|
||
period_end TIMESTAMPTZ, -- NULL = current (open) period
|
||
monthly_fee_eur NUMERIC(10, 2), -- agreed contract price; NULL for internal/trial
|
||
currency TEXT NOT NULL DEFAULT 'EUR',
|
||
invoice_ref TEXT, -- external billing system invoice ID (e.g. Stripe invoice_id)
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||
);
|
||
CREATE INDEX ON subscription_periods (organisation_id, period_start DESC);
|
||
|
||
-- NOTAM drafts (audit trail; never submitted by SpaceCom)
|
||
CREATE TABLE notam_drafts (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
prediction_id BIGINT REFERENCES reentry_predictions(id),
|
||
organisation_id INTEGER REFERENCES organisations(id),
|
||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
created_by INTEGER REFERENCES users(id),
|
||
draft_type TEXT NOT NULL
|
||
CHECK (draft_type IN ('new','cancellation')),
|
||
fir_designators TEXT[] NOT NULL,
|
||
valid_from TIMESTAMPTZ,
|
||
valid_to TIMESTAMPTZ,
|
||
draft_text TEXT NOT NULL, -- Full ICAO-format draft text
|
||
reviewed_by INTEGER REFERENCES users(id) ON DELETE SET NULL, -- SET NULL on GDPR erasure; draft preserved
|
||
reviewed_at TIMESTAMPTZ,
|
||
review_note TEXT,
|
||
safety_record BOOLEAN DEFAULT TRUE, -- always retained; excluded from data drop policy
|
||
generated_during_degraded BOOLEAN DEFAULT FALSE -- TRUE if ingest was degraded at generation time
|
||
-- No issuance fields — SpaceCom never issues NOTAMs
|
||
);
|
||
|
||
-- Degraded mode audit log (Finding 7 — operational ANSP disclosure requirement)
|
||
-- Records every transition into and out of degraded mode for incident investigation
|
||
CREATE TABLE degraded_mode_events (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
ended_at TIMESTAMPTZ, -- NULL = currently degraded
|
||
affected_sources TEXT[] NOT NULL, -- e.g. ['space_track', 'noaa_swpc']
|
||
severity TEXT NOT NULL
|
||
CHECK (severity IN ('WARNING','CRITICAL')),
|
||
trigger_reason TEXT NOT NULL, -- human-readable: 'Space-Track ingest gap > 4h'
|
||
resolved_by TEXT, -- 'auto-recovery' | user_id | 'manual'
|
||
safety_record BOOLEAN DEFAULT TRUE -- always retained under safety record policy
|
||
);
|
||
-- Append-only: no UPDATE or DELETE permitted
|
||
CREATE TRIGGER degraded_mode_events_immutable
|
||
BEFORE UPDATE OR DELETE ON degraded_mode_events
|
||
FOR EACH ROW EXECUTE FUNCTION prevent_modification();
|
||
|
||
-- Shadow validation records (compare shadow predictions to actual events)
|
||
CREATE TABLE shadow_validations (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
prediction_id BIGINT REFERENCES reentry_predictions(id),
|
||
organisation_id INTEGER REFERENCES organisations(id),
|
||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||
created_by INTEGER REFERENCES users(id),
|
||
actual_reentry_time TIMESTAMPTZ,
|
||
actual_reentry_location GEOGRAPHY(POINT, 4326),
|
||
actual_source TEXT, -- 'aerospace_corp_db' | 'tip_message' | 'manual'
|
||
p50_error_minutes DOUBLE PRECISION, -- actual - predicted p50 in minutes
|
||
in_p95_corridor BOOLEAN, -- did actual point fall within 95th pct corridor?
|
||
notes TEXT
|
||
);
|
||
|
||
-- Legal opinions (jurisdiction-level gate for shadow mode and operational deployment)
|
||
CREATE TABLE legal_opinions (
|
||
id SERIAL PRIMARY KEY,
|
||
jurisdiction TEXT NOT NULL UNIQUE, -- e.g. 'AU', 'EU', 'UK', 'US'
|
||
status TEXT NOT NULL DEFAULT 'pending'
|
||
CHECK (status IN ('pending','in_progress','complete','not_required')),
|
||
opinion_date DATE,
|
||
counsel_firm TEXT,
|
||
shadow_mode_cleared BOOLEAN DEFAULT FALSE, -- opinion confirms shadow deployment is permissible
|
||
operational_cleared BOOLEAN DEFAULT FALSE, -- opinion confirms operational deployment is permissible
|
||
liability_cap_agreed BOOLEAN DEFAULT FALSE,
|
||
notes TEXT,
|
||
document_minio_key TEXT, -- reference to stored opinion document in MinIO
|
||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||
);
|
||
|
||
-- Shared immutability function (used by multiple triggers)
|
||
CREATE OR REPLACE FUNCTION prevent_modification()
|
||
RETURNS TRIGGER AS $$
|
||
BEGIN
|
||
RAISE EXCEPTION 'Table % is append-only or immutable after creation', TG_TABLE_NAME;
|
||
END;
|
||
$$ LANGUAGE plpgsql;
|
||
|
||
-- Shared updated_at function (used by mutable tables)
|
||
CREATE OR REPLACE FUNCTION set_updated_at()
|
||
RETURNS TRIGGER LANGUAGE plpgsql AS $$
|
||
BEGIN
|
||
NEW.updated_at = NOW();
|
||
RETURN NEW;
|
||
END;
|
||
$$;
|
||
|
||
-- updated_at triggers for all mutable tables
|
||
CREATE TRIGGER organisations_updated_at
|
||
BEFORE UPDATE ON organisations FOR EACH ROW EXECUTE FUNCTION set_updated_at();
|
||
CREATE TRIGGER users_updated_at
|
||
BEFORE UPDATE ON users FOR EACH ROW EXECUTE FUNCTION set_updated_at();
|
||
CREATE TRIGGER simulations_updated_at
|
||
BEFORE UPDATE ON simulations FOR EACH ROW EXECUTE FUNCTION set_updated_at();
|
||
CREATE TRIGGER jobs_updated_at
|
||
BEFORE UPDATE ON jobs FOR EACH ROW EXECUTE FUNCTION set_updated_at();
|
||
CREATE TRIGGER notam_drafts_updated_at
|
||
BEFORE UPDATE ON notam_drafts FOR EACH ROW EXECUTE FUNCTION set_updated_at();
|
||
```
|
||
|
||
**Shadow mode flag on predictions and hazard zones:** Add `shadow_mode BOOLEAN DEFAULT FALSE` to both `reentry_predictions` and `hazard_zones`. Shadow records are excluded from all operational API responses (`WHERE shadow_mode = FALSE` applied to all operational endpoints) but accessible via `/analysis` and the Feedback/shadow validation workflow.
|
||
|
||
---
|
||
|
||
### 9.3 Index Strategy
|
||
|
||
All indexes must be created `CONCURRENTLY` on live hypertables to avoid table locks (see §9.4). The following indexes are required beyond TimescaleDB's automatic chunk indexes:
|
||
|
||
```sql
|
||
-- orbits hypertable: object + time range queries (CZML generation)
|
||
CREATE INDEX CONCURRENTLY IF NOT EXISTS orbits_object_epoch_idx
|
||
ON orbits (object_id, epoch DESC);
|
||
|
||
-- reentry_predictions: latest prediction per object (Event Detail, operational overview)
|
||
CREATE INDEX CONCURRENTLY IF NOT EXISTS reentry_pred_object_created_idx
|
||
ON reentry_predictions (object_id, created_at DESC)
|
||
WHERE integrity_failed = FALSE AND shadow_mode = FALSE;
|
||
|
||
-- alert_events: unacknowledged alerts per org (badge count — called on every page load)
|
||
-- Partial index on acknowledged_at IS NULL: only live unacked rows indexed; shrinks as alerts are acknowledged
|
||
CREATE INDEX CONCURRENTLY IF NOT EXISTS alert_events_unacked_idx
|
||
ON alert_events (organisation_id, level, created_at DESC)
|
||
WHERE acknowledged_at IS NULL;
|
||
|
||
-- jobs: Celery worker polls for queued jobs; partial index keeps this tiny and fast
|
||
CREATE INDEX CONCURRENTLY IF NOT EXISTS jobs_queued_idx
|
||
ON jobs (organisation_id, created_at)
|
||
WHERE status = 'queued';
|
||
|
||
-- refresh_tokens: token validation only cares about live (non-revoked) tokens
|
||
CREATE INDEX CONCURRENTLY IF NOT EXISTS refresh_tokens_live_idx
|
||
ON refresh_tokens (token_hash)
|
||
WHERE revoked_at IS NULL;
|
||
|
||
-- idempotency_keys: TTL cleanup job needs only expired rows
|
||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idempotency_keys_expired_idx
|
||
ON idempotency_keys (expires_at)
|
||
WHERE expires_at IS NOT NULL;
|
||
|
||
-- PostGIS spatial: all columns used in ST_Intersects / ST_Contains / ST_Distance
|
||
CREATE INDEX CONCURRENTLY IF NOT EXISTS reentry_pred_corridor_gist
|
||
ON reentry_predictions USING GIST (ground_track_corridor);
|
||
-- airspace.geometry GIST index already present (see §9.2)
|
||
CREATE INDEX CONCURRENTLY IF NOT EXISTS hazard_zones_polygon_gist
|
||
ON hazard_zones USING GIST (polygon);
|
||
CREATE INDEX CONCURRENTLY IF NOT EXISTS fragments_impact_gist
|
||
ON fragments USING GIST (impact_point);
|
||
|
||
-- tle_sets hypertable: latest TLE per object (cross-validation, propagation)
|
||
CREATE INDEX CONCURRENTLY IF NOT EXISTS tle_sets_object_ingested_idx
|
||
ON tle_sets (object_id, ingested_at DESC);
|
||
|
||
-- security_logs: recent events per user (audit queries)
|
||
CREATE INDEX CONCURRENTLY IF NOT EXISTS security_logs_user_time_idx
|
||
ON security_logs (user_id, created_at DESC);
|
||
```
|
||
|
||
**Spatial type convention:**
|
||
- `GEOGRAPHY` — used for global features that may cross the antimeridian (corridor polygons, nominal re-entry points, fragment impact points). Geodetic calculations; correct for global spans.
|
||
- `GEOMETRY(POLYGON, 4326)` — used for regional features always within ±180° longitude (FIR/UIR airspace boundaries). Planar approximation; ~3× faster for `ST_Intersects` than `GEOGRAPHY`; accurate enough for airspace boundary intersection within a single hemisphere.
|
||
|
||
**SRID enforcement (F2 — §62):** Declaring the SRID in the column type (`GEOMETRY(POLYGON, 4326)`) prevents implicit SRID mismatch errors, but does not prevent application code from inserting a geometry constructed with SRID 0. Add explicit CHECK constraints on all spatial columns:
|
||
|
||
```sql
|
||
-- Ensure corridor polygon SRID is correct
|
||
ALTER TABLE reentry_predictions
|
||
ADD CONSTRAINT chk_corridor_srid
|
||
CHECK (ST_SRID(ground_track_corridor::geometry) = 4326);
|
||
|
||
ALTER TABLE hazard_zones
|
||
ADD CONSTRAINT chk_hazard_zone_srid
|
||
CHECK (ST_SRID(geometry) = 4326);
|
||
|
||
ALTER TABLE airspace
|
||
ADD CONSTRAINT chk_airspace_srid
|
||
CHECK (ST_SRID(geometry) = 4326);
|
||
```
|
||
|
||
The CI migration gate (`alembic check`) will flag any migration that adds a spatial column without a matching SRID CHECK constraint.
|
||
|
||
**ST_Buffer distance units (F9 — §62):** `ST_Buffer` on a `GEOMETRY(POLYGON, 4326)` column uses degree-units, not metres. At 60°N, 1° ≈ 55 km; at the equator, 1° ≈ 111 km — an uncertainty buffer expressed in degrees gives wildly different areas at different latitudes. Always buffer in a projected CRS, then transform back:
|
||
|
||
```sql
|
||
-- CORRECT: buffer 50 km around corridor point at any latitude
|
||
SELECT ST_Transform(
|
||
ST_Buffer(
|
||
ST_Transform(ST_SetSRID(ST_MakePoint(lon, lat), 4326), 3857), -- project to Web Mercator (metres)
|
||
50000 -- 50 km in metres
|
||
),
|
||
4326 -- back to WGS84
|
||
) AS buffered_geom;
|
||
|
||
-- WRONG: buffer in degrees — DO NOT USE
|
||
-- SELECT ST_Buffer(geom, 0.5) FROM ... ← 0.5° is ~55 km at 60°N, ~55 km at equator
|
||
```
|
||
|
||
For global spans where Mercator distortion is unacceptable, use `ST_Buffer` on a `GEOGRAPHY` column instead — it accepts metres natively:
|
||
```sql
|
||
SELECT ST_Buffer(corridor::geography, 50000) -- 50 km buffer, geodetically correct
|
||
FROM reentry_predictions WHERE ...
|
||
```
|
||
|
||
**FIR intersection query optimisation:** Apply a bounding-box pre-filter before the full polygon intersection test to eliminate most rows cheaply. `airspace.geometry` is `GEOMETRY` while `hazard_zones.geometry` and corridor parameters are `GEOGRAPHY` — **always cast GEOGRAPHY → GEOMETRY explicitly** before passing to `ST_Intersects` with an airspace column; PostgreSQL cannot use the GiST index and falls back to a seq scan if the types are mixed implicitly:
|
||
|
||
```sql
|
||
-- Corridor (GEOGRAPHY) intersecting FIR boundaries (GEOMETRY): explicit cast required
|
||
SELECT a.designator, a.name
|
||
FROM airspace a
|
||
WHERE a.geometry && ST_Envelope($1::geography::geometry) -- fast bbox pre-filter (uses GIST)
|
||
AND ST_Intersects(a.geometry, $1::geography::geometry); -- exact test (GEOMETRY, not GEOGRAPHY)
|
||
-- $1 = corridor polygon passed as GEOGRAPHY from application layer
|
||
```
|
||
|
||
Add a CI linter rule (or custom ruff plugin) that rejects `ST_Intersects(airspace.geometry, <expr>)` unless `<expr>` is explicitly cast to `::geometry`. This prevents the mixed-type silent seq-scan regression from being introduced during maintenance.
|
||
|
||
Cache the FIR intersection result per `prediction_id` in Redis (TTL: until the prediction is superseded) — the intersection for a given prediction never changes.
|
||
|
||
---
|
||
|
||
### 9.4 TimescaleDB Configuration and Continuous Aggregates
|
||
|
||
**Hypertable chunk intervals** — set explicitly at creation; default 7-day chunks are too large for the `orbits` CZML query pattern (most queries cover ≤ 72h):
|
||
|
||
```sql
|
||
-- orbits: 1-day chunks (72h CZML window spans 3 chunks; good chunk exclusion)
|
||
SELECT create_hypertable('orbits', 'epoch',
|
||
chunk_time_interval => INTERVAL '1 day',
|
||
if_not_exists => TRUE);
|
||
|
||
-- tle_sets: 1-month chunks (~1,800 rows/day at 600 objects × 3 TLE updates; queried by object_id not time range)
|
||
-- Small chunks (7 days) produce poor compression ratios (~12,600 rows/chunk); 1 month improves ratio ~4×
|
||
SELECT create_hypertable('tle_sets', 'ingested_at',
|
||
chunk_time_interval => INTERVAL '1 month',
|
||
if_not_exists => TRUE);
|
||
|
||
-- space_weather: 30-day chunks (~3000 rows/month at 15-min cadence)
|
||
SELECT create_hypertable('space_weather', 'time',
|
||
chunk_time_interval => INTERVAL '30 days',
|
||
if_not_exists => TRUE);
|
||
```
|
||
|
||
**Continuous aggregates** — pre-compute recurring expensive queries instead of scanning raw hypertable rows on every request:
|
||
|
||
```sql
|
||
-- 81-day rolling F10.7 average (queried on every Space Weather Widget render)
|
||
CREATE MATERIALIZED VIEW space_weather_daily
|
||
WITH (timescaledb.continuous) AS
|
||
SELECT time_bucket('1 day', time) AS day,
|
||
AVG(f107_obs) AS f107_daily_avg,
|
||
MAX(kp_3hourly[1]) AS kp_max_daily
|
||
FROM space_weather
|
||
GROUP BY day
|
||
WITH NO DATA;
|
||
|
||
SELECT add_continuous_aggregate_policy('space_weather_daily',
|
||
start_offset => INTERVAL '90 days',
|
||
end_offset => INTERVAL '1 hour',
|
||
schedule_interval => INTERVAL '1 hour');
|
||
```
|
||
|
||
Backend queries for the 81-day F10.7 average read from `space_weather_daily` (the continuous aggregate), not from the raw `space_weather` hypertable.
|
||
|
||
**Compression policy intervals** — compression must not target recently-written chunks. TimescaleDB decompresses a chunk before any write to it; compressing hot chunks adds 50–200ms latency per write batch. Set `compress_after` well beyond the active write window:
|
||
|
||
| Hypertable | Chunk interval | `compress_after` | Write cadence | Reasoning |
|
||
|---|---|---|---|---|
|
||
| `orbits` | 1 day | 7 days | 1 min (continuous) | Data is queryable but not written after ~24h; 7-day buffer prevents write-decompress thrash |
|
||
| `adsb_states` | 4 hours | 14 days | 60s (Celery Beat) | Rolling 24h retention; compress only after data is past retention interest |
|
||
| `space_weather` | 30 days | 60 days | 15 min | Very low write rate; compress after one full 30-day chunk is closed |
|
||
| `tle_sets` | 1 month | 2 months | Every 4h ingest | ~1,800 rows/day; 1-month chunks give good compression ratio; 2-month buffer ensures active month is never compressed |
|
||
|
||
```sql
|
||
-- Apply compression policies (run after hypertable creation)
|
||
SELECT add_compression_policy('orbits', INTERVAL '7 days');
|
||
SELECT add_compression_policy('adsb_states', INTERVAL '14 days');
|
||
SELECT add_compression_policy('space_weather', INTERVAL '60 days');
|
||
SELECT add_compression_policy('tle_sets', INTERVAL '2 months');
|
||
```
|
||
|
||
**Autovacuum tuning** — append-only tables still accumulate dead tuples from aborted transactions and MVCC overhead. Default 20% threshold is too conservative for high-write safety tables:
|
||
|
||
```sql
|
||
ALTER TABLE alert_events SET (
|
||
autovacuum_vacuum_scale_factor = 0.01, -- vacuum at 1% dead tuples (default: 20%)
|
||
autovacuum_analyze_scale_factor = 0.005
|
||
);
|
||
ALTER TABLE security_logs SET (
|
||
autovacuum_vacuum_scale_factor = 0.01,
|
||
autovacuum_analyze_scale_factor = 0.005
|
||
);
|
||
ALTER TABLE reentry_predictions SET (
|
||
autovacuum_vacuum_cost_delay = 2, -- allow aggressive vacuum on query-critical table
|
||
autovacuum_analyze_scale_factor = 0.01
|
||
);
|
||
```
|
||
|
||
PostgreSQL-level settings via `patroni.yml`:
|
||
```yaml
|
||
postgresql:
|
||
parameters:
|
||
idle_in_transaction_session_timeout: 30000 # 30s -- prevents analytics sessions blocking autovacuum
|
||
max_connections: 50 # pgBouncer handles client multiplexing; DB needs only 50
|
||
log_min_duration_statement: 500 # F7 §58: log queries > 500ms; shipped to Loki via Promtail
|
||
shared_preload_libraries: timescaledb,pg_stat_statements # F7 §58: enable slow query tracking
|
||
pg_stat_statements.track: all # track all statements including nested
|
||
# Analyst role statement timeout (F11 §58): prevents runaway analytics queries starving ops connections
|
||
# Applied at role level, not globally, to avoid impacting operational paths
|
||
```
|
||
|
||
**Query plan governance (F7 — §58):** Slow queries (> 500ms) appear in PostgreSQL logs and are shipped to Loki. A weekly Grafana report queries `pg_stat_statements` via the `postgres-exporter` and surfaces the top-10 queries by `total_exec_time`. Any query appearing in the top-10 for two consecutive weeks requires a PR with an `EXPLAIN ANALYSE` output and either an index addition or a documented acceptance rationale. The `EXPLAIN ANALYSE` output is recorded in the migration file header comment for index additions. CI migration timeout (§9.4) applies: migrations running > 30s against the test dataset require review before merge.
|
||
|
||
**Analyst role query timeout (F11 — §58):** Persona B/F analyst queries route to the read replica (§3.2) but must still be bounded to prevent a runaway query exhausting replica connections and triggering replication lag. Apply a `statement_timeout` at the database role level so it applies regardless of connection source:
|
||
|
||
```sql
|
||
-- Applied once at schema setup; persists across reconnections
|
||
ALTER ROLE spacecom_analyst SET statement_timeout = '30s';
|
||
ALTER ROLE spacecom_readonly SET statement_timeout = '30s';
|
||
|
||
-- Operational roles have no statement timeout — but idle-in-transaction timeout applies globally
|
||
-- (idle_in_transaction_session_timeout = 30s in patroni.yml)
|
||
```
|
||
|
||
The `spacecom_analyst` role is the PgBouncer user for the read replica pool. All analyst-originated queries automatically inherit the 30s limit. If a query exceeds 30s it receives `ERROR: canceling statement due to statement timeout`; the frontend displays a user-facing message: "This query exceeded the 30-second limit. Refine your filters or contact your administrator." Logged at WARNING to Loki.
|
||
|
||
**PgBouncer transaction mode + asyncpg prepared statement cache** — asyncpg caches prepared statements per server-side connection. In PgBouncer transaction mode, the connection returned after each transaction may differ from the one the statement was prepared on, causing `ERROR: prepared statement "..." does not exist` under load. Disable the cache in the SQLAlchemy async engine config:
|
||
|
||
```python
|
||
engine = create_async_engine(
|
||
DATABASE_URL,
|
||
connect_args={"prepared_statement_cache_size": 0},
|
||
)
|
||
```
|
||
|
||
This is non-negotiable when using PgBouncer transaction mode. Do not revert this setting in the belief that it is a performance regression — it prevents a hard production failure mode. See ADR 0008.
|
||
|
||
**Migration safety on live hypertables** (additions to the Alembic policy in §26.9):
|
||
- Always use `CREATE INDEX CONCURRENTLY` for new indexes — no table lock; safe during live ingest
|
||
- Never add a column with a non-null default to a populated hypertable in one migration: (1) add nullable, (2) backfill in batches, (3) add NOT NULL constraint separately
|
||
- Test every migration against production-sized data; record execution time in the migration file header comment
|
||
- Set a CI migration timeout: if a migration runs > 30s against the test dataset, it must be reviewed before merge
|
||
|
||
---
|
||
|
||
## 10. Technology Stack
|
||
|
||
| Layer | Technology | Rationale |
|
||
|-------|-----------|-----------|
|
||
| Frontend framework | **Next.js 15 + TypeScript** | Type safety, SSR for dashboards, static export option |
|
||
| 3D Globe | **CesiumJS** (retained) | Native CZML support; proven in prototype |
|
||
| 2D overlays | **Deck.gl** | WebGL heatmaps (Mode B), arc layers, hex grids |
|
||
| Server state | **TanStack Query** | Caching, background refetch, stale-while-revalidate. API responses never stored in Zustand. |
|
||
| UI state | **Zustand** | Pure UI state only: timeline mode, selected object, layer visibility, alert acknowledgements |
|
||
| URL state | **nuqs** | Shareable deep links; selected object/event/time reflected in URL |
|
||
| Backend framework | **FastAPI** (retained) | Async, OpenAPI auto-docs, Pydantic validation |
|
||
| Task queue | **Celery + Redis** | Battle-tested for scientific compute; Flower monitoring |
|
||
| Catalog propagation | **`sgp4`** | SGP4/SDP4; catalog tracking only, not decay prediction |
|
||
| Numerical integrator | **`scipy.integrate.DOP853`** or custom **RK7(8)** | Adaptive step-size for Cowell decay prediction |
|
||
| Atmospheric density | **`nrlmsise00`** Python wrapper | NRLMSISE-00; driven by F10.7 and Ap |
|
||
| Frame transformations | **`astropy`** | IAU 2006 precession/nutation, IERS EOP, TEME→GCRF→ITRF |
|
||
| Astrodynamics utilities | **`poliastro`** (optional) | Conjunction geometry helpers |
|
||
| Auth | **`python-jose`** (RS256 JWT) + **`pyotp`** (TOTP MFA) | Asymmetric JWT; TOTP RFC 6238 |
|
||
| Rate limiting | **`slowapi`** | Redis token bucket; per-user and per-IP limits |
|
||
| HTML sanitisation | **`bleach`** | User-supplied content before Playwright rendering |
|
||
| Password hashing | **`passlib[bcrypt]`** | bcrypt cost factor ≥ 12 |
|
||
| Database | **TimescaleDB + PostGIS** (retained) | Time-series + geospatial; RLS for multi-tenancy |
|
||
| Cache / broker | **Redis 7** | Broker + pub/sub: `maxmemory-policy noeviction` (Celery queues must never be evicted). Separate Redis DB index for application cache: `allkeys-lru`. AUTH + TLS in production. |
|
||
| Connection pooler | **PgBouncer 1.22** | Transaction-mode pooling between all app services and TimescaleDB. Prevents connection exhaustion at Tier 3; single failover target for Patroni switchover. `max_client_conn=200`, `default_pool_size=20`. Pool sizing derivation (F2 — §58): PostgreSQL `max_connections=50`; reserve 5 for superuser/admin; 45 available server connections. `default_pool_size=20` per pool (one pool per DB user); leaves headroom for Alembic migrations and ad-hoc DBA access. `max_client_conn=200` = (2 backend workers × 40 async connections) + (4 sim workers × 16 threads) + (2 ingest workers × 4) = 152 peak; 200 provides burst headroom. Validate with `SHOW pools;` in `psql -h pgbouncer` — `cl_waiting > 0` sustained means pool is undersized. |
|
||
| Object storage | **MinIO** | Private buckets; pre-signed URLs only |
|
||
| Containerisation | **Docker Compose** (retained); **Caddy** as TLS-terminating reverse proxy | Single-command dev; HTTPS auto-provisioning |
|
||
| Testing — backend | **pytest + hypothesis** | Property-based tests for numerical and security invariants |
|
||
| Testing — frontend | **Vitest + Playwright** | Unit tests + E2E including security header checks |
|
||
| SAST — Python | **Bandit** | Static analysis; CI blocks on High severity |
|
||
| SAST — TypeScript | **ESLint security plugin** | Static analysis; CI blocks on High severity |
|
||
| Container scanning | **Trivy** | CI blocks on Critical/High CVEs |
|
||
| DAST | **OWASP ZAP** | Phase 2 pipeline against staging |
|
||
| Dependency management | **pip-tools** + **npm ci** | Pinned hashes; `--require-hashes` |
|
||
| Report rendering | **Playwright headless** (isolated `renderer` container) | Server-side globe screenshot; no client-side canvas |
|
||
| Secrets management | **Docker secrets** (Phase 1 production) → **HashiCorp Vault** (Phase 3) | |
|
||
| Task scheduler HA | **`celery-redbeat`** | Redis-backed Beat scheduler; distributed locking; multiple instances safe |
|
||
| DB HA / failover | **Patroni** + **etcd** | Automatic TimescaleDB primary/standby failover; ≤ 30s RTO |
|
||
| Redis HA | **Redis Sentinel** (3 nodes) | Master failover ≤ 10s; transparent to application via `redis-py` Sentinel client |
|
||
| Monitoring | **Prometheus + Grafana** | Business-level metrics from Phase 1; four dashboards (§26.7); AlertManager with runbook links |
|
||
| Log aggregation | **Grafana Loki + Promtail** | Phase 2; Promtail scrapes Docker log files; Loki stores and queries; co-deployed with Grafana; no index servers required |
|
||
| Distributed tracing | **OpenTelemetry → Grafana Tempo** | Phase 2; FastAPI + SQLAlchemy + Celery auto-instrumented; OTLP exporter; trace_id = request_id for log correlation; ADR 0017 |
|
||
| Structured logging | **structlog** | JSON structured logs with required fields; sanitising processor strips secrets; `request_id` propagated through HTTP → Celery chain |
|
||
| On-call alerting | **PagerDuty or OpsGenie** | Routes Prometheus AlertManager alerts; L1/L2/L3 escalation tiers (§26.8) |
|
||
| CI/CD pipeline | **GitLab CI** | Native to the self-hosted GitLab monorepo; stage-based builds for Python/Node; protected environments and approval rules for deploys |
|
||
| Container registry | **GitLab Container Registry** | Co-located with source; `sha-<commit>` is the canonical immutable tag; `latest` tag is forbidden in production deployments; image vulnerability attestations via `cosign` |
|
||
| Pre-commit | **`pre-commit` framework** | Hooks: `detect-secrets`, `ruff` (lint + format), `mypy` (type gate), `hadolint` (Dockerfile), `prettier` (JS/HTML), `sqlfluff` (migrations); spec in `.pre-commit-config.yaml`; same hooks re-run in CI |
|
||
| Local task runner | **`make`** | Standard targets: `make dev` (full-stack hot-reload), `make test` (pytest + vitest), `make migrate` (alembic upgrade head), `make seed` (fixture load), `make lint` (all pre-commit hooks), `make clean` (prune volumes) |
|
||
|
||
---
|
||
|
||
## 11. Data Source Inventory
|
||
|
||
| Source | Data | Access | Priority |
|
||
|--------|------|--------|----------|
|
||
| **Space-Track.org** | TLE catalog, CDMs, object catalog, RCS data, TIP messages | REST API (account required); credentials in secrets manager | P1 |
|
||
| **CelesTrak** | TLE subsets (active sats, decaying objects) | Public REST API / CSV | P1 |
|
||
| **USSPACECOM TIP Messages** | Tracking and Impact Prediction for decaying objects | Via Space-Track.org | P1 |
|
||
| **NOAA SWPC** | F10.7, Ap/Kp, Dst, solar wind; 3-day forecasts | Public REST API and FTP | P1 |
|
||
| **ESA Space Weather Service** | F10.7, Kp cross-validation source | Public REST API | P1 |
|
||
| **ESA DISCOS** | Physical object properties: mass, dimensions, shape, materials | REST API (account required) | P1 |
|
||
| **IERS Bulletin A/B** | UT1-UTC offsets, polar motion | Public FTP (usno.navy.mil); SHA-256 verified on download | P1 |
|
||
| **GFS / ECMWF** | Tropospheric winds and density 0–80 km | NOMADS (NOAA) public FTP | P2 |
|
||
| **ILRS / CDDIS** | Laser ranging POD products for validation | Public FTP | P2 (validation) |
|
||
|
||
| **FIR/UIR boundaries** | FIR and UIR boundary polygons for airspace intersection | EUROCONTROL AIRAC dataset (subscription) for ECAC states; FAA Digital-Terminal Procedures for US; OpenAIP as fallback for non-AIRAC regions. GeoJSON format loaded into `airspace` table. Updated every 28 days on AIRAC cycle. | P1 |
|
||
|
||
**Deprecated reference:** "18th SDS" → use **Space-Track.org** consistently.
|
||
|
||
**ESA DISCOS redistribution rights (Finding 9):** ESA DISCOS is subject to an ESAC user agreement. Data may not be redistributed or used in commercial products without explicit ESA permission. SpaceCom is a commercial platform. Required actions before Phase 2 shadow deployment:
|
||
- Obtain written clarification from ESA/ESAC on whether DISCOS-derived physical properties (mass, dimensions) may be: (a) used internally to drive SpaceCom's own predictions; (b) exposed in API responses to ANSP customers; (c) included in generated PDF reports
|
||
- If redistribution is not permitted, DISCOS data is used only as internal model input — API responses and reports show `source: estimated` rather than exposing raw DISCOS values; the `data_confidence` UI flag continues to show `● DISCOS` for internal tracking but is not labelled as DISCOS in customer-facing outputs
|
||
- Include the DISCOS redistribution clarification in the Phase 2 legal gate checklist alongside the Space-Track AUP opinion
|
||
|
||
**Airspace data scope and SUA disclosure (Finding 4):** Phase 2 FIR/UIR scope covers ECAC states (EUROCONTROL AIRAC) and US FIRs (FAA). The following airspace types are explicitly **out of scope for Phase 2** and disclosed to users:
|
||
- Special Use Airspace (SUA): danger areas, restricted areas, prohibited areas (ICAO Annex 11)
|
||
- Terminal Manoeuvring Areas (TMAs) and Control Zones (CTRs)
|
||
- Oceanic FIRs (ICAO Annex 2 special procedures; OACCs handle coordination)
|
||
|
||
A persistent disclosure note on the Airspace Impact Panel reads: *"SpaceCom FIR intersection analysis covers FIR/UIR boundaries only. It does not account for special use airspace, terminal areas, or oceanic procedures. Controllers must apply their local procedures for these airspace types."* Phase 3 consideration: SUA polygon overlay from national AIP sources. Document in `docs/adr/0014-airspace-scope.md`.
|
||
|
||
All source URLs are hardcoded constants in `ingest/sources.py`. The outbound HTTP client blocks connections to private IP ranges. No source URL is configurable via API or database at runtime.
|
||
|
||
**Space-Track AUP — conditional architecture (Finding 9):** The AUP clarification is a **Phase 1 architectural decision gate**, not a Phase 2 deliverable. The current design assumes shared ingest (a single SpaceCom Space-Track credential fetches TLEs for all organisations). If the AUP prohibits redistribution of derived predictions to customers who have not themselves agreed to the AUP, the ingest architecture must change:
|
||
|
||
- **Path A — redistribution permitted:** Current shared-ingest design is valid. Each customer organisation's access is governed by SpaceCom's AUP click-wrap and the MSA. No architectural change.
|
||
- **Path B — redistribution not permitted:** Per-organisation Space-Track credentials required. Each ANSP/operator must hold their own Space-Track account. SpaceCom acts as a processing layer using each org's own credentials. Architecture change: `space_track_credentials` table (per-org, encrypted); per-org ingest worker configuration; significant additional complexity.
|
||
|
||
The decision must be documented in `docs/adr/0016-space-track-aup-architecture.md` with the chosen path and evidence (written AUP clarification). This ADR is a prerequisite for Phase 1 ingest architecture finalisation — marked as a blocking decision in the Phase 1 DoD.
|
||
|
||
**Space weather raw format specifications:**
|
||
|
||
| Source | Endpoint constant | Format | Key fields consumed |
|
||
|--------|------------------|--------|-------------------|
|
||
| NOAA SWPC F10.7 | `NOAA_F107_URL = "https://services.swpc.noaa.gov/json/f107_cm_flux.json"` | JSON array | `time_tag`, `flux` (solar flux units) |
|
||
| NOAA SWPC Kp/Ap | `NOAA_KP_URL = "https://services.swpc.noaa.gov/json/planetary_k_index_1m.json"` | JSON array | `time_tag`, `kp_index`, `ap` |
|
||
| NOAA SWPC 3-day forecast | `NOAA_FORECAST_URL = "https://services.swpc.noaa.gov/products/3-day-geomag-forecast.json"` | JSON | `Kp` array |
|
||
| ESA SWS Kp | `ESA_SWS_KP_URL = "https://swe.ssa.esa.int/web/guest/current-space-weather-conditions"` | REST JSON | `kp_index` (cross-validation) |
|
||
|
||
An integration test asserts that each response contains the expected top-level keys. If a key is absent, the test fails and the schema change is caught before it reaches production ingest.
|
||
|
||
**TLE validation at ingestion gate:** Before any TLE record is written to the database, `ingest/cross_validator.py` must verify:
|
||
1. Both lines are exactly 69 characters (standard TLE format)
|
||
2. Modulo-10 checksum passes on line 1 and line 2
|
||
3. Epoch field parses to a valid UTC datetime
|
||
4. `BSTAR` drag term is within physically plausible bounds (−0.5 to +0.5)
|
||
|
||
Failed validation is logged to `security_logs` type `INGEST_VALIDATION_FAILURE` with the raw TLE and failure reason. The record is not written to the database.
|
||
|
||
**TLE ingest idempotency — ON CONFLICT behavior:** The `tle_sets` table has `UNIQUE (object_id, ingested_at)`. If the ingest worker runs twice for the same object within the same second (e.g., orphan recovery task + normal schedule overlap, or a worker restart mid-task), the second insert must not raise an exception or silently discard the row without tracking. Required semantics:
|
||
|
||
```python
|
||
# ingest/writer.py
|
||
async def write_tle_set(session: AsyncSession, tle: TLERecord) -> bool:
|
||
"""Insert TLE record. Returns True if inserted, False if duplicate."""
|
||
stmt = pg_insert(TLESet).values(
|
||
object_id=tle.object_id,
|
||
ingested_at=tle.ingested_at,
|
||
tle_line1=tle.line1,
|
||
tle_line2=tle.line2,
|
||
epoch=tle.epoch,
|
||
source=tle.source,
|
||
).on_conflict_do_nothing(
|
||
index_elements=["object_id", "ingested_at"]
|
||
).returning(TLESet.object_id)
|
||
|
||
result = await session.execute(stmt)
|
||
inserted = result.rowcount > 0
|
||
if not inserted:
|
||
spacecom_ingest_tle_conflict_total.inc() # metric; non-zero signals scheduling race
|
||
structlog.get_logger().debug("tle_insert_skipped_duplicate",
|
||
object_id=tle.object_id, ingested_at=tle.ingested_at)
|
||
return inserted
|
||
```
|
||
|
||
Prometheus counter `spacecom_ingest_tle_conflict_total` — a sustained non-zero rate warrants investigation of the Beat schedule overlap. A brief spike during worker restart is acceptable.
|
||
|
||
**Ingest idempotency requirement for all periodic tasks (F8 — §67):** TLE ingest uses `ON CONFLICT DO NOTHING` (above). All other periodic ingest tasks must use equivalent upsert semantics to survive celery-redbeat double-fire on restart:
|
||
|
||
```sql
|
||
-- Space weather ingest: upsert on (fetched_at) unique constraint
|
||
INSERT INTO space_weather (fetched_at, kp, f107, ...)
|
||
VALUES (:fetched_at, :kp, :f107, ...)
|
||
ON CONFLICT (fetched_at) DO NOTHING;
|
||
|
||
-- DISCOS object metadata: upsert on (norad_id) — update if data changed
|
||
INSERT INTO objects (norad_id, name, launch_date, ...)
|
||
VALUES (:norad_id, :name, :launch_date, ...)
|
||
ON CONFLICT (norad_id) DO UPDATE SET
|
||
name = EXCLUDED.name,
|
||
launch_date = EXCLUDED.launch_date,
|
||
updated_at = NOW()
|
||
WHERE objects.updated_at < EXCLUDED.updated_at; -- only update if newer
|
||
|
||
-- IERS EOP: upsert on (date) unique constraint
|
||
INSERT INTO iers_eop (date, ut1_utc, x_pole, y_pole, ...)
|
||
VALUES (:date, :ut1_utc, :x_pole, :y_pole, ...)
|
||
ON CONFLICT (date) DO NOTHING;
|
||
```
|
||
|
||
Add unique constraints if not present: `UNIQUE (fetched_at)` on `space_weather`; `UNIQUE (date)` on `iers_eop`. These prevent double-write corruption at the DB level regardless of application retry logic.
|
||
|
||
**IERS EOP cold-start requirement:** On a fresh deployment with no cached EOP data, astropy's `IERS_Auto` falls back to the bundled IERS-B table (which lags the current date by weeks to months), silently degrading `UT1-UTC` precision from ~1 ms (IERS-A) to ~10–50 ms (IERS-B). For epochs beyond the IERS-B table end date, astropy raises `IERSRangeError`, crashing all frame transforms.
|
||
|
||
The EOP ingest task must run as part of `make seed` before any propagation task starts:
|
||
```bash
|
||
# Makefile
|
||
seed: migrate
|
||
docker compose exec backend python -m ingest.eop --bootstrap # downloads + caches current IERS-A
|
||
docker compose exec backend python -m ingest.fir --bootstrap # loads FIR boundaries
|
||
docker compose exec backend python fixtures/dev_seed.sql
|
||
```
|
||
|
||
The EOP ingest task in Celery Beat is ordered before the TLE ingest task: EOP runs at 00:00 UTC, TLE ingest at 00:10 UTC (ensuring fresh EOP before the first propagation of the day).
|
||
|
||
**IERS EOP verification — dual-mirror comparison:** The IERS does not publish SHA-256 hashes alongside its EOP files. Comparing hash-against-prior-download detects corruption but not substitution. The correct approach is downloading from both the USNO mirror and the Paris Observatory mirror and verifying agreement:
|
||
|
||
```python
|
||
# ingest/eop.py
|
||
IERS_MIRRORS = [
|
||
"https://maia.usno.navy.mil/ser7/finals2000A.all",
|
||
"https://hpiers.obspm.fr/iers/series/opa/eopc04", # IERS-C04 series
|
||
]
|
||
|
||
async def fetch_and_verify_eop() -> bytes:
|
||
contents = []
|
||
for url in IERS_MIRRORS:
|
||
resp = await http_client.get(url, timeout=30)
|
||
resp.raise_for_status()
|
||
contents.append(resp.content)
|
||
|
||
# Verify UT1-UTC values agree within 0.1 ms across mirrors (format-normalised comparison)
|
||
if not _eop_values_agree(contents[0], contents[1], tolerance_ms=0.1):
|
||
structlog.get_logger().error("eop_mirror_disagreement")
|
||
spacecom_eop_mirror_agreement.set(0)
|
||
raise EOPVerificationError("IERS EOP mirrors disagree — rejecting both")
|
||
|
||
spacecom_eop_mirror_agreement.set(1)
|
||
return contents[0] # USNO is primary; Paris Observatory is the verification witness
|
||
```
|
||
|
||
Prometheus gauge `spacecom_eop_mirror_agreement` (1 = mirrors agree, 0 = disagreement detected). Alert on `spacecom_eop_mirror_agreement == 0`.
|
||
|
||
---
|
||
|
||
## 12. Backend Directory Structure
|
||
|
||
```
|
||
backend/
|
||
app/
|
||
main.py # FastAPI app factory, middleware, router mounting
|
||
config.py # Settings via pydantic-settings (env vars); no secrets in code
|
||
auth/
|
||
provider.py # AuthProvider protocol + LocalJWTProvider implementation
|
||
jwt.py # RS256 token issue, verify, refresh; key loaded from secrets
|
||
mfa.py # TOTP (pyotp); recovery code generation and verification
|
||
deps.py # get_current_user, require_role() dependency factory
|
||
middleware.py # Auth middleware; rate limit enforcement
|
||
frame_utils.py # TEME→GCRF→ITRF→WGS84 + IERS EOP refresh + hash verification
|
||
time_utils.py # Time system conversions
|
||
integrity.py # HMAC sign/verify for predictions and hazard zones
|
||
logging_config.py # Sanitising log formatter; security event logger
|
||
modules/
|
||
catalog/
|
||
router.py # /api/v1/objects; requires viewer role minimum
|
||
schemas.py
|
||
service.py
|
||
models.py
|
||
propagator/
|
||
catalog.py # SGP4 catalog propagation
|
||
decay.py # RK7(8) + NRLMSISE-00 + Monte Carlo; HMAC-signs output
|
||
tasks.py # Celery tasks with time_limit, soft_time_limit
|
||
router.py # /api/v1/propagate, /api/v1/decay; requires analyst role
|
||
reentry/
|
||
router.py # /api/v1/reentry; requires viewer role
|
||
service.py
|
||
corridor.py # Percentile corridor polygon generation
|
||
spaceweather/
|
||
router.py # /api/v1/spaceweather; requires viewer role
|
||
service.py # Cross-validates NOAA SWPC vs ESA SWS; generates status string
|
||
tasks.py # Celery Beat: NOAA SWPC polling every 3h
|
||
noaa_swpc.py # NOAA SWPC client; URL hardcoded constant
|
||
esa_sws.py # ESA SWS cross-validation client
|
||
viz/
|
||
router.py # /api/v1/czml; requires viewer role
|
||
czml_builder.py # CZML output; all strings HTML-escaped; J2000 INERTIAL frame
|
||
mc_geometry.py # MC trajectory binary blob pre-baking
|
||
ingest/
|
||
sources.py # Hardcoded external URLs and IP allowlists (SSRF mitigation)
|
||
tasks.py # Celery Beat-scheduled tasks
|
||
spacetrack.py # Space-Track client; credentials from secrets manager only
|
||
celestrak.py # CelesTrak client
|
||
discos.py # ESA DISCOS client
|
||
iers.py # IERS EOP fetcher + SHA-256 verification
|
||
cross_validator.py # TLE and space weather cross-source comparison
|
||
alerts/
|
||
router.py # /api/v1/alerts; requires operator role for acknowledge
|
||
service.py # Alert trigger evaluation; rate limit enforcement; deduplication
|
||
notifier.py # WebSocket push + email; storm detection
|
||
integrity_guard.py # TIP vs prediction cross-check; HMAC failure escalation
|
||
reports/
|
||
router.py # /api/v1/reports; requires analyst role
|
||
builder.py # Section assembly; all user fields sanitised via bleach
|
||
renderer_client.py # Internal HTTPS call to renderer service with sanitised payload
|
||
security/
|
||
audit.py # Security event logger; writes to security_logs
|
||
sanitiser.py # Log formatter that strips credential patterns
|
||
breakup/
|
||
atmospheric.py
|
||
on_orbit.py
|
||
tasks.py
|
||
router.py
|
||
conjunction/
|
||
screener.py
|
||
probability.py
|
||
tasks.py
|
||
router.py
|
||
weather/
|
||
upper.py
|
||
lower.py
|
||
hazard/
|
||
router.py
|
||
fusion.py # HMAC-signs all hazard_zones output; propagates shadow_mode flag
|
||
tasks.py
|
||
airspace/
|
||
router.py
|
||
loader.py
|
||
intersection.py
|
||
notam/
|
||
router.py # /api/v1/notam; requires operator role
|
||
drafter.py # ICAO Annex 15 format generation
|
||
disclaimer.py # Mandatory regulatory disclaimer text
|
||
space_portal/
|
||
router.py # /api/v1/space; space_operator and orbital_analyst roles
|
||
owned_objects.py # Owned object CRUD; RLS enforcement
|
||
controlled_reentry.py # Deorbit window optimisation
|
||
ccsds_export.py # CCSDS OEM/CDM format export
|
||
api_keys.py # API key lifecycle management
|
||
launch_safety/ # Phase 3
|
||
screener.py
|
||
router.py
|
||
reroute/ # Phase 3; strategic pre-flight avoidance boundary only
|
||
feedback/ # Phase 3; includes shadow_validation.py
|
||
migrations/ # Alembic; includes immutability triggers in initial migration
|
||
tests/
|
||
conftest.py # db_session fixture (SAVEPOINT/ROLLBACK); testcontainers setup for Celery tests
|
||
physics/
|
||
test_frame_utils.py
|
||
test_propagator/
|
||
test_decay/
|
||
test_nrlmsise.py
|
||
test_hypothesis.py # Hypothesis property-based tests (§42.3)
|
||
test_mc_corridor.py # MC seeded RNG corridor validation (§42.4)
|
||
test_breakup/
|
||
test_integrity.py # HMAC sign/verify; tamper detection
|
||
test_auth.py # JWT; MFA; rate limiting; RBAC enforcement
|
||
test_rbac.py # Every endpoint tested for correct role enforcement
|
||
test_websocket.py # WS sequence replay; token expiry warning; close codes 4001/4002
|
||
test_ingest/
|
||
test_contracts.py # Space-Track + NOAA key presence AND value-range assertions
|
||
test_spaceweather/
|
||
test_jobs/
|
||
test_celery_failure.py # Timeout → 'failed'; orphan recovery Beat task
|
||
smoke/ # Post-deploy; all idempotent; run in ≤ 2 min; require smoke_user seed
|
||
test_api_health.py # GET /readyz → 200/207; GET /healthz → 200
|
||
test_auth_smoke.py # Login → JWT; refresh → new token
|
||
test_catalog_smoke.py # GET /catalog → 200; 'data' key present
|
||
test_ws_smoke.py # WS connect → heartbeat within 5s
|
||
test_db_smoke.py # SELECT 1 via backend health endpoint
|
||
quarantine/ # Flaky tests awaiting fix; excluded from blocking CI (see §33.10 policy)
|
||
requirements.in # pip-tools source
|
||
requirements.txt # pip-compile output with hashes
|
||
Dockerfile # FROM pinned digest; non-root user; read-only FS
|
||
```
|
||
|
||
### 12.1 Repository `docs/` Directory Structure
|
||
|
||
All documentation files live under `docs/` in the monorepo root. Files referenced elsewhere in this plan must exist at these paths.
|
||
|
||
```
|
||
docs/
|
||
README.md # Documentation index — what's here and where to look
|
||
MASTER_PLAN.md # This document
|
||
AGENTS.md # Guidance for AI coding agents working in this repo (see §33.9)
|
||
CHANGELOG.md # Keep a Changelog format; human-maintained; one entry per release
|
||
|
||
adr/ # Architecture Decision Records (MADR format)
|
||
README.md # ADR index with status column
|
||
0001-rs256-asymmetric-jwt.md
|
||
0002-dual-frontend-architecture.md
|
||
0003-monte-carlo-chord-pattern.md
|
||
0004-geography-vs-geometry-spatial-types.md
|
||
0005-lazy-raise-sqlalchemy.md
|
||
0006-timescaledb-chunk-intervals.md
|
||
0007-cesiumjs-commercial-licence.md
|
||
0008-pgbouncer-transaction-mode.md
|
||
0009-ccsds-oem-gcrf-reference-frame.md
|
||
0010-alert-threshold-rationale.md
|
||
# ... continued; one ADR per consequential decision in §20
|
||
|
||
runbooks/
|
||
README.md # Runbook index with owner and last-reviewed date
|
||
TEMPLATE.md # Standard runbook template (see §33.4)
|
||
db-failover.md
|
||
celery-recovery.md
|
||
hmac-failure.md
|
||
ingest-failure.md
|
||
gdpr-breach-notification.md
|
||
safety-occurrence-notification.md
|
||
secrets-rotation-jwt.md
|
||
secrets-rotation-spacetrack.md
|
||
secrets-rotation-hmac.md
|
||
blue-green-deploy.md
|
||
restore-from-backup.md
|
||
|
||
model-card-decay-predictor.md # Living document; updated per model version (§32.1)
|
||
ood-bounds.md # OOD detection thresholds (§32.3)
|
||
recalibration-procedure.md # Recalibration governance (§32.4)
|
||
alert-threshold-history.md # Alert threshold change log (§24.8)
|
||
|
||
query-baselines/ # EXPLAIN ANALYZE output; one file per critical query
|
||
czml_catalog_100obj.txt
|
||
fir_intersection_baseline.txt
|
||
# ... one file per query baseline recorded in Phase 1
|
||
|
||
validation/ # Validation procedure and reference data (§17)
|
||
README.md # How to run each validation suite
|
||
reference-data/
|
||
vallado-sgp4-cases.json # Vallado (2013) SGP4 reference state vectors
|
||
iers-frame-test-cases.json # IERS precession-nutation reference cases
|
||
aerospace-corp-reentries.json # Historical re-entry outcomes for backcast validation
|
||
backcast-validation-v1.0.0.pdf # Phase 1 validation report (≥3 events)
|
||
backcast-validation-v2.0.0.pdf # Phase 2 validation report (≥10 events)
|
||
|
||
api-guide/ # Persona E/F API developer documentation (§33.10)
|
||
README.md # API guide index
|
||
authentication.md
|
||
rate-limiting.md
|
||
webhooks.md
|
||
code-examples/
|
||
python-quickstart.py
|
||
typescript-quickstart.ts
|
||
error-reference.md
|
||
|
||
user-guides/ # Operational persona documentation (§33.7)
|
||
aviation-portal-guide.md # Persona A/B/C
|
||
space-portal-guide.md # Persona E/F
|
||
admin-guide.md # Persona D
|
||
|
||
test-plan.md # Test suite index with scope and blocking classification (§33.11)
|
||
|
||
public-reports/ # Quarterly transparency reports (§32.6)
|
||
# quarterly-accuracy-YYYY-QN.pdf
|
||
|
||
legal/ # Legal opinion documents (MinIO primary; this dir for dev reference)
|
||
# legal-opinion-template.md
|
||
```
|
||
|
||
---
|
||
|
||
## 13. Frontend Directory Structure and Architecture
|
||
|
||
```
|
||
frontend/
|
||
src/
|
||
app/
|
||
page.tsx # Operational Overview
|
||
watch/[norad_id]/page.tsx # Object Watch Page
|
||
events/
|
||
page.tsx # Active Events + full Timeline/Gantt
|
||
[id]/page.tsx # Event Detail
|
||
airspace/page.tsx # Airspace Impact View
|
||
analysis/page.tsx # Analyst Workspace
|
||
catalog/page.tsx # Object Catalog
|
||
reports/
|
||
page.tsx
|
||
[id]/page.tsx
|
||
admin/page.tsx # System Administration (admin role only)
|
||
space/
|
||
page.tsx # Space Operator Overview
|
||
objects/
|
||
page.tsx # My Objects Dashboard (space_operator: owned only)
|
||
[norad_id]/page.tsx # Object Technical Detail
|
||
reentry/
|
||
plan/page.tsx # Controlled Re-entry Planner
|
||
conjunction/page.tsx # Conjunction Screening (orbital_analyst)
|
||
analysis/page.tsx # Orbital Analyst Workspace
|
||
export/page.tsx # Bulk Export
|
||
api/page.tsx # API Keys + Documentation
|
||
layout.tsx # Root layout: nav, ModeIndicator, AlertBadge,
|
||
# JobsPanel; applies security headers via middleware
|
||
|
||
middleware.ts # Next.js middleware: enforce HTTPS, set CSP
|
||
# and security headers on every response,
|
||
# redirect unauthenticated users to /login
|
||
|
||
components/
|
||
globe/
|
||
CesiumViewer.tsx
|
||
LayerPanel.tsx
|
||
ViewToggle.tsx
|
||
ClusterLayer.tsx
|
||
CorridorLayer.tsx
|
||
corridor/
|
||
PercentileCorridors.tsx # Mode A
|
||
ProbabilityHeatmap.tsx # Mode B (Phase 2)
|
||
ParticleTrajectories.tsx # Mode C (Phase 3)
|
||
UncertaintyModeSelector.tsx
|
||
plan/
|
||
PlanView.tsx # Phase 2
|
||
AltitudeCrossSection.tsx # Phase 2
|
||
timeline/
|
||
TimelineStrip.tsx
|
||
TimelineGantt.tsx
|
||
TimelineControls.tsx
|
||
ModeIndicator.tsx
|
||
panels/
|
||
ObjectInfoPanel.tsx
|
||
PredictionPanel.tsx # Includes HMAC status indicator
|
||
AirspaceImpactPanel.tsx # Phase 2
|
||
ConjunctionPanel.tsx # Phase 2
|
||
alerts/
|
||
AlertBanner.tsx
|
||
AlertBadge.tsx
|
||
NotificationCentre.tsx
|
||
AcknowledgeDialog.tsx
|
||
jobs/
|
||
JobsPanel.tsx
|
||
JobProgressBar.tsx
|
||
SimulationComparison.tsx
|
||
spaceweather/
|
||
SpaceWeatherWidget.tsx
|
||
reports/
|
||
ReportConfigDialog.tsx
|
||
ReportPreview.tsx
|
||
space/
|
||
SpaceOverview.tsx
|
||
OwnedObjectCard.tsx
|
||
ControlledReentryPlanner.tsx
|
||
DeorbitWindowList.tsx
|
||
ApiKeyManager.tsx
|
||
CcsdsExportPanel.tsx
|
||
ShadowBanner.tsx # Amber banner displayed when shadow mode active
|
||
notam/
|
||
NotamDraftViewer.tsx
|
||
NotamCancellationDialog.tsx
|
||
NotamRegulatoryDisclaimer.tsx
|
||
shadow/
|
||
ShadowModeIndicator.tsx
|
||
ShadowValidationReport.tsx
|
||
dashboard/
|
||
EventSummaryCard.tsx
|
||
SystemHealthCard.tsx
|
||
shared/
|
||
DataConfidenceBadge.tsx
|
||
IntegrityStatusBadge.tsx # ✓ HMAC verified / ✗ HMAC failed
|
||
UncertaintyBound.tsx
|
||
CountdownTimer.tsx
|
||
|
||
hooks/
|
||
useObjects.ts
|
||
usePrediction.ts # Polls HMAC status; shows warning if failed
|
||
useEphemeris.ts
|
||
useSpaceWeather.ts
|
||
useAlerts.ts
|
||
useSimulation.ts
|
||
useCZML.ts
|
||
useWebSocket.ts # Cookie-based auth; per-user connection limit
|
||
|
||
stores/ # Zustand — UI state only; no API responses
|
||
timelineStore.ts # Mode, playhead position, playback speed
|
||
selectionStore.ts # Selected object/event/zone IDs
|
||
layerStore.ts # Layer visibility, corridor display mode
|
||
jobsStore.ts # Active job IDs (content fetched via TanStack Query)
|
||
alertStore.ts # Unread count, mute rules
|
||
uiStore.ts # Panel state, theme (dark/light/high-contrast)
|
||
|
||
lib/
|
||
api.ts # Typed fetch wrapper; credentials: 'include'
|
||
# for httpOnly cookie auth; never reads tokens
|
||
czml.ts
|
||
ws.ts # wss:// enforced; cookie auth at upgrade
|
||
corridorGeometry.ts
|
||
mcBinaryDecoder.ts
|
||
reportUtils.ts
|
||
|
||
types/
|
||
objects.ts
|
||
predictions.ts # Includes hmac_status, integrity_failed fields
|
||
alerts.ts
|
||
spaceweather.ts
|
||
simulation.ts
|
||
czml.ts
|
||
|
||
public/
|
||
branding/
|
||
middleware.ts # Root Next.js middleware for security headers
|
||
next.config.ts # Content-Security-Policy defined here for SSR
|
||
tsconfig.json
|
||
package.json
|
||
package-lock.json # Committed; npm ci used in Docker builds
|
||
```
|
||
|
||
### 13.0 Accessibility Standard Commitment
|
||
|
||
**Minimum standard: WCAG 2.1 Level AA** (ISO/IEC 40500:2012), which is incorporated by reference into **EN 301 549 v3.2.1** — the mandatory accessibility standard for ICT procured by EU public sector bodies including ESA. Failure to meet EN 301 549 is a bid disqualifier for any EU public sector tender.
|
||
|
||
All frontend work must meet these criteria before a PR is merged:
|
||
- WCAG 2.1 AA automated check passes (`axe-core` — see §42)
|
||
- Keyboard-only operation possible for all primary operator workflows
|
||
- Screen reader (NVDA + Firefox; VoiceOver + Safari) tested for primary workflow on each release
|
||
- Colour contrast ≥ 4.5:1 for all informational text; ≥ 3:1 for UI components and graphical elements
|
||
- No functionality conveyed by colour alone
|
||
|
||
**Deliverable:** Accessibility Conformance Report (ACR / VPAT 2.4) produced before Phase 2 ESA bid submission. Maintained thereafter for each major release.
|
||
|
||
**UTC-only rule for operational interface (F1):** ICAO Annex 2 and Annex 15 mandate UTC for all aeronautical operational communications. The following is a hard rule — no exceptions without explicit documentation and legal/safety sign-off:
|
||
- All times displayed in Persona A/C operational views (alert panels, event detail, NOTAM draft, shift handover) are **UTC only**, formatted as `HH:MMZ` or `DD MMM YYYY HH:MMZ`
|
||
- No timezone conversion widget or local-time toggle in the operational interface
|
||
- Local time display is permitted only in non-operational views (account settings, admin billing pages) and must be clearly labelled with the timezone name
|
||
- The `Z` suffix or `UTC` label is persistently visible — never hidden in a tooltip or hover state
|
||
- All API timestamps returned as ISO 8601 UTC (`2026-03-22T14:00:00Z`) — never local time strings
|
||
|
||
---
|
||
|
||
### 13.1 State Management Separation
|
||
|
||
**TanStack Query:** All API-derived data — object lists, predictions, ephemeris, space weather, alerts, simulation results. Handles caching, background refetch, and stale-while-revalidate.
|
||
|
||
**Zustand:** Pure UI state with no server dependency — selected IDs, layer visibility, timeline mode and position, panel open/closed state, theme, alert mute rules.
|
||
|
||
**URL state (nuqs):** Shareable, bookmarkable — selected NORAD ID, active event ID, time position in replay mode, active layer set. Browser back/forward works correctly. Requires `NuqsAdapter` wrapping the App Router root layout to hydrate correctly on SSR.
|
||
|
||
**Never in state:** Raw API response bodies. No `useEffect` that writes API responses into Zustand.
|
||
|
||
**Authentication in the client:** The `api.ts` fetch wrapper uses `credentials: 'include'` to send the `httpOnly` auth cookie automatically. The client never reads, stores, or handles the JWT token directly — it is invisible to JavaScript. CSRF is mitigated by `SameSite=Strict` on the cookie.
|
||
|
||
**Next.js App Router component boundary (ADR 0018):** The project uses **App Router**. The globe and all operational views are client components; static pages (onboarding, settings, admin) are React Server Components where practical.
|
||
|
||
| Route group | RSC/Client | Rationale |
|
||
|---|---|---|
|
||
| `app/(globe)/` — operational views | `"use client"` root layout | CesiumJS, WebSocket, Zustand hooks require browser APIs |
|
||
| `app/(static)/` — onboarding, settings | Server Components by default | No browser APIs needed; faster initial load |
|
||
| `app/(auth)/` — login, MFA | Server Components + Client islands | Form validation islands only |
|
||
|
||
Rules enforced in `AGENTS.md`:
|
||
- Never add `"use client"` to a leaf component without a comment explaining which browser API requires it
|
||
- `app/(globe)/layout.tsx` is the single `"use client"` boundary for all operational views — child components inherit it without re-declaring
|
||
- `nuqs` requires `<NuqsAdapter>` at the root of `app/(globe)/layout.tsx`
|
||
|
||
**TanStack Query key factory** (`src/lib/queryKeys.ts`) — stable hierarchical keys prevent cache invalidation bugs:
|
||
|
||
```typescript
|
||
export const queryKeys = {
|
||
objects: {
|
||
all: () => ['objects'] as const,
|
||
list: (f: ObjectFilters) => ['objects', 'list', f] as const,
|
||
detail: (id: number) => ['objects', 'detail', id] as const,
|
||
tleHistory: (id: number) => ['objects', id, 'tle-history'] as const,
|
||
},
|
||
predictions: {
|
||
byObject: (id: number) => ['predictions', id] as const,
|
||
},
|
||
alerts: {
|
||
all: () => ['alerts'] as const,
|
||
unacked: (orgId: number) => ['alerts', 'unacked', orgId] as const,
|
||
},
|
||
jobs: {
|
||
detail: (jobId: string) => ['jobs', jobId] as const,
|
||
},
|
||
} as const;
|
||
// On WS alert.new: queryClient.invalidateQueries({ queryKey: queryKeys.alerts.all() })
|
||
// On acknowledge mutation: optimistic setQueryData, then invalidate on settle
|
||
```
|
||
|
||
**React error boundary hierarchy** — a CesiumJS crash must never remove the alert panel from the DOM:
|
||
|
||
```tsx
|
||
// app/(globe)/layout.tsx
|
||
<AppErrorBoundary fallback={<AppCrashPage />}>
|
||
<GlobeErrorBoundary fallback={<GlobeUnavailable />}>
|
||
<GlobeCanvas /> {/* WebGL context loss isolated here */}
|
||
</GlobeErrorBoundary>
|
||
<PanelErrorBoundary name="alerts">
|
||
<AlertPanel /> {/* Survives globe crash */}
|
||
</PanelErrorBoundary>
|
||
<PanelErrorBoundary name="events">
|
||
<EventList />
|
||
</PanelErrorBoundary>
|
||
</AppErrorBoundary>
|
||
```
|
||
|
||
`GlobeUnavailable` displays: *"Globe unavailable — WebGL context lost. Re-entry event data below remains operational."* Alert and event panels remain visible and functional. Add `GlobeErrorBoundary` to `AGENTS.md` safety-critical component list.
|
||
|
||
**Loading and empty state specification** — for safety-critical panels, loading and empty must be visually distinct from each other and from error:
|
||
|
||
| State | Visual treatment | Required text |
|
||
|---|---|---|
|
||
| Loading | Skeleton matching panel layout | — |
|
||
| Empty | Explicit affirmative message | `AlertPanel`: "No unacknowledged alerts"; `EventList`: "No active re-entry events" |
|
||
| Error | Inline error with retry button | Never blank |
|
||
|
||
Rule: safety-critical panels (`AlertPanel`, `EventList`, `PredictionPanel`) must **never render blank**. `DataConfidenceBadge` must always show a value — display `"Unknown"` explicitly, never render nothing.
|
||
|
||
**WebSocket reconnection policy** (`src/lib/ws.ts`):
|
||
|
||
```typescript
|
||
const RECONNECT = {
|
||
initialDelayMs: 1_000,
|
||
maxDelayMs: 30_000,
|
||
multiplier: 2,
|
||
jitter: 0.2, // ±20% — spreads reconnections after mass outage/deploy
|
||
};
|
||
// TOKEN_EXPIRY_WARNING handler: trigger silent POST /auth/token/refresh;
|
||
// on success send AUTH_REFRESH; on failure show re-login modal (60s grace before disconnect)
|
||
// Reconnect sends ?since_seq=<last_seq> for missed event replay
|
||
```
|
||
|
||
**Operational mode guard** (`src/hooks/useModeGuard.ts`) — enforces LIVE/SIMULATION/REPLAY write restrictions:
|
||
|
||
```typescript
|
||
export function useModeGuard(allowedModes: OperationalMode[]) {
|
||
const { mode } = useTimelineStore();
|
||
return { isAllowed: allowedModes.includes(mode), currentMode: mode };
|
||
}
|
||
// Usage: const { isAllowed } = useModeGuard(['LIVE']);
|
||
// All write-action components (acknowledge alert, submit NOTAM draft, trigger prediction)
|
||
// must call useModeGuard(['LIVE']) and disable + annotate button in other modes.
|
||
```
|
||
|
||
**Deck.gl + CesiumJS integration** — use `DeckLayer` from `@deck.gl/cesium` (rendered inside CesiumJS as a primitive; correct z-order and shared input handling). Never use a separate Deck.gl canvas:
|
||
|
||
```typescript
|
||
import { DeckLayer } from '@deck.gl/cesium';
|
||
import { HeatmapLayer } from '@deck.gl/aggregation-layers';
|
||
|
||
const deckLayer = new DeckLayer({
|
||
layers: [new HeatmapLayer({ id: 'mc-heatmap', data: mcTrajectories,
|
||
getPosition: d => [d.lon, d.lat], getWeight: d => d.weight,
|
||
radiusPixels: 30, intensity: 1, threshold: 0.03 })],
|
||
});
|
||
viewer.scene.primitives.add(deckLayer);
|
||
// Remove when switching away from Mode B: viewer.scene.primitives.remove(deckLayer)
|
||
```
|
||
|
||
**CesiumJS client-side memory constraints:**
|
||
|
||
| Constraint | Value | Enforcement |
|
||
|---|---|---|
|
||
| Max CZML entity count in globe | 500 | Prune lowest-perigee objects beyond 500; `useCZML` monitors count |
|
||
| Orbit path duration | 72h forward / 24h back | Longer paths accumulate geometry |
|
||
| Heatmap cell resolution (Mode B) | 0.5° × 0.5° | Higher resolution requires more GPU memory |
|
||
| Stale entity pruning | Remove entities not updated in 48h | Prevents ghost entities in long sessions |
|
||
| Globe entity count Prometheus metric | `spacecom_globe_entity_count` (gauge) | WARNING alert at 450; prune trigger at 500 |
|
||
|
||
**Bundle size budget and dynamic imports:**
|
||
|
||
| Bundle | Strategy | Budget (gzipped) |
|
||
|---|---|---|
|
||
| Login / onboarding / settings | Static; no CesiumJS/Deck.gl | < 200 KB |
|
||
| Globe route initial load | CesiumJS lazy-loaded; spinner shown | < 500 KB before CesiumJS |
|
||
| Globe fully loaded | CesiumJS + Deck.gl + app | < 8 MB |
|
||
|
||
```typescript
|
||
// src/components/globe/GlobeCanvas.tsx
|
||
import dynamic from 'next/dynamic';
|
||
const CesiumViewer = dynamic(
|
||
() => import('./CesiumViewerInner'),
|
||
{ ssr: false, loading: () => <GlobeLoadingState /> }
|
||
);
|
||
```
|
||
|
||
`bundlewatch` (or `@next/bundle-analyzer`) in CI; warning (non-blocking) if initial route bundle exceeds budget. Baseline stored in `.bundle-size-baseline`.
|
||
|
||
---
|
||
|
||
### 13.2 Accessible Parallel Table View (F4)
|
||
|
||
The CesiumJS WebGL globe is inherently inaccessible: no keyboard navigation, no screen reader support, no motor-impairment accommodation. All interactions available via the globe must also be available via a **parallel data table view**.
|
||
|
||
**Component:** `src/components/globe/ObjectTableView.tsx`
|
||
|
||
- Accessible via keyboard shortcut `Alt+T` from any operational view, and via a persistent visible "Table view" button in the globe toolbar
|
||
- Displays all objects currently rendered on the globe: NORAD ID, name, orbit type, conjunction status badge, predicted re-entry window, alert level
|
||
- Sortable by any column (`aria-sort` updated on header click/keypress); filterable by alert level
|
||
- Row selection focuses the object's Event Detail panel (same as map click)
|
||
- All alert acknowledgement actions reachable from the table view — no functionality requires the globe
|
||
- Implemented as `<table>` with `<thead>`, `<tbody>`, `<th scope="col">`, `<th scope="row">` — no ARIA table role substitutes where native HTML suffices
|
||
- Pagination or virtual scroll for large object sets; `aria-rowcount` and `aria-rowindex` set correctly for virtualised rows
|
||
|
||
The table view is the **primary interaction surface** for users who cannot use the map. It must be functionally complete, not a read-only summary.
|
||
|
||
---
|
||
|
||
### 13.3 Keyboard Navigation Specification (F6)
|
||
|
||
All primary operator workflows must be completable by keyboard alone. Required implementation:
|
||
|
||
**Skip links** (rendered as the first focusable element in the page, visible on focus):
|
||
```html
|
||
<a href="#alert-panel" class="skip-link">Skip to alert panel</a>
|
||
<a href="#main-content" class="skip-link">Skip to main content</a>
|
||
<a href="#object-table" class="skip-link">Skip to object table</a>
|
||
```
|
||
|
||
**Focus ring:** Minimum 3px solid outline, ≥ 3:1 contrast against adjacent colours (WCAG 2.4.11 Focus Appearance, AA). Never `outline: none` without a custom focus indicator. Defined in design tokens: `--focus-ring: 3px solid #4A9FFF`.
|
||
|
||
**Tab order:** Follows DOM order (no `tabindex > 0`). Logical flow: nav → alert panel → map toolbar → main content. Modal dialogs trap focus within the dialog while open; focus returns to the trigger element on close.
|
||
|
||
**Application keyboard shortcuts (all documented in UI via `?` help overlay):**
|
||
|
||
| Shortcut | Action |
|
||
|----------|--------|
|
||
| `Alt+A` | Focus most-recent active CRITICAL alert |
|
||
| `Alt+T` | Toggle table / globe view |
|
||
| `Alt+H` | Open shift handover view |
|
||
| `Alt+N` | Open NOTAM draft for active event |
|
||
| `?` | Open keyboard shortcut reference overlay |
|
||
| `Escape` | Close modal / dismiss non-CRITICAL overlay |
|
||
| `Arrow keys` | Navigate within alert list, table rows, accordion items |
|
||
|
||
All shortcuts declared via `aria-keyshortcuts` on their trigger elements. No shortcut conflicts with browser or screen reader reserved keys.
|
||
|
||
---
|
||
|
||
### 13.4 Colour and Contrast Specification (F7)
|
||
|
||
All colour pairs must meet WCAG 2.1 AA contrast requirements. Documented in `frontend/src/tokens/colours.ts` as design tokens; no hardcoded colour values in component files.
|
||
|
||
**Operational severity palette (dark theme — `background: #1A1A2E`):**
|
||
|
||
| Severity | Background | Text | Contrast ratio | Status |
|
||
|----------|-----------|------|---------------|--------|
|
||
| CRITICAL | `#7B4000` | `#FFFFFF` | 7.2:1 | ✓ AA |
|
||
| HIGH | `#7A3B00` | `#FFD580` | 5.1:1 | ✓ AA |
|
||
| MEDIUM | `#1A3A5C` | `#90CAF9` | 4.6:1 | ✓ AA |
|
||
| LOW | `#1E3A2F` | `#81C784` | 4.5:1 | ✓ AA (minimum) |
|
||
| Focus ring | `#1A1A2E` | `#4A9FFF` | 4.8:1 | ✓ AA |
|
||
|
||
All pairs verified with the APCA algorithm for large display text (corridor labels on the globe). If a colour fails at the target background, the background is adjusted — the text colour is kept consistent for operator recognition.
|
||
|
||
**Number formatting (F4):** Probability values, altitudes, and distances must be formatted correctly across locales:
|
||
- **Operational interface (Persona A/C):** Always use ICAO-standard decimal point (`.`) regardless of browser locale — deviating from locale convention is intentional and matches ICAO Doc 8400 standards; this is documented as an explicit design decision
|
||
- **Admin / reporting / Space Operator views:** Use `Intl.NumberFormat(locale)` for locale-aware formatting (comma decimal separator in DE/FR/ES locales)
|
||
- Helper: `formatOperationalNumber(n: number): string` — always `.` decimal, 3 significant figures for probabilities; `formatDisplayNumber(n: number, locale: string): string` — locale-aware
|
||
- Never use raw `Number.toString()` or `n.toFixed()` in JSX — both ignore locale
|
||
|
||
**Non-colour severity indicators (F5):** Colour must never be the sole differentiator. Each severity level also carries:
|
||
|
||
| Severity | Icon/shape | Text label | Border width |
|
||
|----------|-----------|-----------|-------------|
|
||
| CRITICAL | ⬟ (octagon) | "CRITICAL" always visible | 3px solid |
|
||
| HIGH | ▲ (triangle) | "HIGH" always visible | 2px solid |
|
||
| MEDIUM | ● (circle) | "MEDIUM" always visible | 1px solid |
|
||
| LOW | ○ (circle outline) | "LOW" always visible | 1px dashed |
|
||
|
||
The 1 Hz CRITICAL colour cycle (§28.3 habituation countermeasure) must also include a redundant non-colour animation: 1 Hz border-width pulse (2px → 4px → 2px). Users with `prefers-reduced-motion: reduce` see a static thick border instead (see §28.3 reduced-motion rules).
|
||
|
||
---
|
||
|
||
### 13.5 Internationalisation Architecture (F5, F8, F11)
|
||
|
||
**Language scope — Phase 1:** English only. No other locale is served. This is not a gap — it is an explicit decision that allows Phase 1 to ship without a localisation workflow. The architecture is designed so that adding a new locale requires only adding a `messages/{locale}.json` file and testing; no component code changes.
|
||
|
||
**String externalisation strategy:**
|
||
- Library: `next-intl` (native Next.js App Router support, RSC-compatible, type-safe message keys)
|
||
- Source of truth: `messages/en.json` — all user-facing strings, namespaced by feature area
|
||
- Message ID convention: `{feature}.{component}.{element}` e.g. `alerts.critical.title`, `handover.accept.button`
|
||
- No bare string literals in JSX (enforced by `eslint-plugin-i18n-json` or equivalent)
|
||
- **ICAO-fixed strings are excluded from i18n scope** and must never appear in `messages/en.json` — they are hardcoded constants. Examples: `NOTAM`, `UTC`, `SIGMET`, category codes (`NOTAM_ISSUED`), ICAO phraseology in NOTAM templates. These are annotated `// ICAO-FIXED: do not translate` in source
|
||
|
||
```
|
||
messages/
|
||
en.json # Source of truth — Phase 1 complete
|
||
fr.json # Phase 2 scaffold (machine-translated placeholders; native-speaker review before deploy)
|
||
de.json # Phase 3 scaffold
|
||
```
|
||
|
||
**CSS logical properties (F8):** All new components use CSS logical properties instead of directional utilities, making RTL support a configuration change rather than a code rewrite:
|
||
|
||
| Avoid | Use instead |
|
||
|-------|------------|
|
||
| `margin-left`, `ml-*` | `margin-inline-start`, `ms-*` |
|
||
| `margin-right`, `mr-*` | `margin-inline-end`, `me-*` |
|
||
| `padding-left`, `pl-*` | `padding-inline-start`, `ps-*` |
|
||
| `padding-right`, `pr-*` | `padding-inline-end`, `pe-*` |
|
||
| `left: 0` | `inset-inline-start: 0` |
|
||
| `text-align: left` | `text-align: start` |
|
||
|
||
The `<html>` element carries `dir="ltr"` (hardcoded for Phase 1). When a RTL locale is added, this becomes `dir={locale.dir}` — no component changes required. RTL testing with Arabic locale is a Phase 3 gate before any Middle East deployment.
|
||
|
||
**Altitude and distance unit display (F9):** Aviation and space domain use different unit conventions. All altitudes and distances are stored and transmitted in **metres** (SI base unit) in the database and API. The display layer converts based on `users.altitude_unit_preference`:
|
||
|
||
| Role default | Unit | Display example |
|
||
|---|---|---|
|
||
| `ansp_operator` | `ft` | `39,370 ft (FL394)` |
|
||
| `space_operator` | `km` | `12.0 km` |
|
||
| `analyst` | `km` | `12.0 km` |
|
||
|
||
Rules:
|
||
- Unit label always shown alongside the value — no bare numbers
|
||
- `aria-label` provides full unit name: `aria-label="39,370 feet (Flight Level 394)"`
|
||
- User can override their default in account settings via `PATCH /api/v1/users/me`
|
||
- API always returns metres; unit conversion is client-side only
|
||
- FL (Flight Level) shown in parentheses for `ft` display when altitude > 0 ft MSL and context is airspace
|
||
|
||
**Altitude datum labelling (F11 — §62):** The SGP4 propagator and NRLMSISE-00 output altitudes above the WGS-84 ellipsoid. Aviation altimetry uses altitude above Mean Sea Level (MSL). The geoid height (difference between ellipsoid and MSL) varies globally from approximately −106 m to +85 m (EGM2008). For operational altitudes (below ~25 km / 82,000 ft during re-entry terminal phase), this difference is significant.
|
||
|
||
**Required labelling rule:** All altitude displays must specify the datum. The datum is a non-configurable system constant per altitude context:
|
||
|
||
| Altitude context | Datum | Display example | Notes |
|
||
|-----------------|-------|-----------------|-------|
|
||
| Orbital altitude (> 80 km) | WGS-84 ellipsoid | `185 km (ellipsoidal)` | SGP4 output; geoid difference negligible at orbital altitudes |
|
||
| Re-entry corridor boundary | WGS-84 ellipsoid | `80 km (ellipsoidal)` | Model boundary altitude |
|
||
| Fragment impact altitude | WGS-84 ellipsoid | `0 km (ellipsoidal)` → display as ground level | Converted at display time |
|
||
| Airspace sector boundary (FL) | QNH barometric | `FL390` / `39,000 ft (QNH)` | Aviation standard; NOT ellipsoidal |
|
||
| Terrain clearance / NOTAM lower bound | MSL (approx. ellipsoidal for > 1,000 ft) | `5,000 ft MSL` | Use `MSL` label explicitly |
|
||
|
||
**Implementation:** `formatAltitude(metres, context)` helper accepts a `context` parameter (`'orbital' | 'airspace' | 'notam'`) and appends the appropriate datum label. The datum label is rendered in a smaller secondary font weight alongside the altitude value — not in `aria-label` alone.
|
||
|
||
**API response datum field:** The prediction API response must include `altitude_datum: "WGS84_ELLIPSOIDAL"` alongside any altitude value. Consumers must not assume a datum that is not stated.
|
||
|
||
**Future locale addition checklist** (documented in `docs/ADDING_A_LOCALE.md`):
|
||
1. Add `messages/{locale}.json` translated by a native-speaker aviation professional
|
||
2. Verify all ICAO-fixed strings are excluded from translation
|
||
3. Set `dir` for the locale (ltr/rtl)
|
||
4. Run automated RTL layout tests if `dir=rtl`
|
||
5. Confirm operational time display still shows UTC (not locale timezone)
|
||
6. Legal review of any jurisdiction-specific compliance text
|
||
|
||
---
|
||
|
||
### 13.6 Contribution Workflow (F3)
|
||
|
||
`CONTRIBUTING.md` at the repository root is a required document. It defines how contributors (internal engineers, auditors, future ESA-directed reviewers) engage with the codebase.
|
||
|
||
**Branch naming convention:**
|
||
| Branch type | Pattern | Example |
|
||
|---|---|---|
|
||
| Feature | `feature/{ticket-id}-short-description` | `feature/SC-142-decay-unit-pref` |
|
||
| Bug fix | `fix/{ticket-id}-short-description` | `fix/SC-200-hmac-null-check` |
|
||
| Chore / dependency | `chore/{description}` | `chore/bump-fastapi-0.115` |
|
||
| Release | `release/{semver}` | `release/1.2.0` |
|
||
| Hotfix | `hotfix/{semver}` | `hotfix/1.1.1` |
|
||
|
||
No direct commits to `main`. All changes via pull request. `main` is branch-protected: 1 required approval, all status checks must pass, no force-push.
|
||
|
||
**Commit message format:** [Conventional Commits](https://www.conventionalcommits.org/) — `type(scope): description`. Types: `feat`, `fix`, `chore`, `docs`, `refactor`, `test`, `ci`. Example: `feat(decay): add p01/p99 tail risk columns`.
|
||
|
||
**PR template** (`.github/pull_request_template.md`):
|
||
```markdown
|
||
## Summary
|
||
<!-- What does this PR do? -->
|
||
|
||
## Linked ticket
|
||
<!-- e.g. SC-142 -->
|
||
|
||
## Checklist
|
||
- [ ] `make test` passes locally
|
||
- [ ] OpenAPI spec regenerated (`make generate-openapi`) if API changed
|
||
- [ ] CHANGELOG.md updated under `[Unreleased]`
|
||
- [ ] axe-core accessibility check passes if UI changed
|
||
- [ ] Contract test passes if API response shape changed
|
||
- [ ] ADR created if an architectural decision was made
|
||
```
|
||
|
||
**Review SLA:** Pull requests must receive a first review within **1 business day** of opening. Stale PRs (no activity > 3 business days) are labelled `stale` automatically.
|
||
|
||
---
|
||
|
||
### 13.7 Architecture Decision Records (F4)
|
||
|
||
ADRs (Nygard format) are the lightweight record for code-level and architectural decisions. They live in `docs/adr/` and are numbered sequentially.
|
||
|
||
**When to write an ADR:** Any decision that is:
|
||
- Hard to reverse (e.g., choosing a library, a DB schema approach, an algorithm)
|
||
- Likely to confuse a future contributor who finds the code without context
|
||
- Required by a public-sector procurement framework (ESA specifically requests evidence of a structured decision process)
|
||
- Referenced in a specialist review appendix (§45–§54 all reference ADR numbers)
|
||
|
||
**Format** (`docs/adr/NNNN-title.md`):
|
||
```markdown
|
||
# ADR NNNN: Title
|
||
|
||
**Status:** Proposed | Accepted | Deprecated | Superseded by ADR MMMM
|
||
**Date:** YYYY-MM-DD
|
||
|
||
## Context
|
||
What problem are we solving? What constraints apply?
|
||
|
||
## Decision
|
||
What did we decide?
|
||
|
||
## Consequences
|
||
What becomes easier? What becomes harder? What is now out of scope?
|
||
```
|
||
|
||
**Known ADRs referenced in this plan:**
|
||
|
||
| ADR | Topic |
|
||
|-----|-------|
|
||
| 0001 | FastAPI over Django REST Framework |
|
||
| 0002 | TimescaleDB + PostGIS for orbital time-series |
|
||
| 0003 | CesiumJS + Deck.gl for 3D globe rendering |
|
||
| 0004 | next-intl for string externalisation |
|
||
| 0005 | Append-only alert_events with HMAC signing |
|
||
| 0016 | NRLMSISE-00 vs JB2008 atmospheric density model |
|
||
|
||
All ADR numbers referenced in this document must have a corresponding `docs/adr/NNNN-*.md` file before Phase 2 ESA submission. New ADRs start at the next available number.
|
||
|
||
---
|
||
|
||
### 13.8 Developer Environment Setup (F6)
|
||
|
||
`docs/DEVELOPMENT.md` is a required onboarding document. A new engineer must be able to run a fully functional local environment within **30 minutes** of reading it. The document covers:
|
||
|
||
1. **Prerequisites:** Python 3.11 (pinned in `.python-version`), Node.js 20 LTS, Docker Desktop, `make`
|
||
2. **Environment bootstrap:**
|
||
```bash
|
||
cp .env.example .env # review and fill required values
|
||
make init-dirs # creates logs/, exports/, config/, backups/ on host
|
||
make dev-up # docker compose up -d postgres redis minio
|
||
make migrate # alembic upgrade head
|
||
make seed # load development fixture data (10 tracked objects, sample TIPs)
|
||
make dev # starts: uvicorn + Next.js dev server + Celery worker
|
||
```
|
||
3. **Running tests:**
|
||
```bash
|
||
make test # full test suite (backend + frontend)
|
||
make test-backend # backend only (pytest)
|
||
make test-frontend # frontend only (jest + playwright)
|
||
make test-e2e # Playwright end-to-end (requires make dev running)
|
||
```
|
||
4. **Useful local URLs:**
|
||
- API: `http://localhost:8000` / Swagger UI: `http://localhost:8000/docs`
|
||
- Frontend: `http://localhost:3000`
|
||
- MinIO console: `http://localhost:9001` (credentials in `.env.example`)
|
||
5. **Common issues:** documented in a `## Troubleshooting` section covering: Docker port conflicts, TimescaleDB first-run migration failure, CesiumJS ion token missing.
|
||
|
||
`.env.example` is committed and kept up-to-date with all required variables (no value — keys only). `.env` is in `.gitignore` and must never be committed.
|
||
|
||
---
|
||
|
||
### 13.9 Docs-as-Code Pipeline (F10)
|
||
|
||
All project documentation (this plan, runbooks, ADRs, OpenAPI spec, data provenance records) is version-controlled in the repository and validated by CI.
|
||
|
||
**Documentation site:** MkDocs Material. Source in `docs/`. Published to GitHub Pages on merge to `main`. Configuration in `mkdocs.yml`.
|
||
|
||
**CI documentation checks (run on every PR):**
|
||
- `mkdocs build --strict` — fails on broken links, missing pages, invalid nav
|
||
- `markdown-link-check docs/` — external link validation (warns, does not fail, to avoid flaky CI on transient outages)
|
||
- `openapi-diff` — spec drift check (see §14 F1)
|
||
- `vale --config=.vale.ini docs/` — prose style linter (SpaceCom style guide: no passive voice in runbooks, consistent terminology table for `re-entry` vs `reentry`)
|
||
|
||
**ESA submission artefact:** The MkDocs build output (static HTML) is archived as a CI artefact on each release tag. This provides a reproducible, point-in-time documentation snapshot for the ESA bid submission. The submission artefact is `docs-site-{version}.zip` stored in the GitHub release assets.
|
||
|
||
**Docs owner:** Each section of the documentation has an `owner:` frontmatter field. The owner is responsible for keeping the section current after their feature area changes. Missing or stale ownership is flagged by a quarterly `docs-review` GitHub issue auto-created by a cron workflow.
|
||
|
||
---
|
||
|
||
## 14. API Design
|
||
|
||
Base path: `/api/v1`. All endpoints require authentication (minimum `viewer` role) unless noted. Role requirements listed per group.
|
||
|
||
### System (no auth required)
|
||
- `GET /health` — liveness probe; returns `200 {"status": "ok", "version": "<semver>"}` if the process is running. Used by Docker/Kubernetes liveness probe and load balancer health check. Does **not** check downstream dependencies — a healthy response means only that the API process is alive.
|
||
- `GET /readyz` — readiness probe; returns `200 {"status": "ready", "checks": {...}}` when all dependencies are reachable. Returns `503` if any required dependency is unhealthy. Checks performed: PostgreSQL (query `SELECT 1`), Redis (PING), Celery worker queue depth < 1000. Used by DR automation to confirm the new primary is accepting traffic before updating DNS (§26.3). Also included in OpenAPI spec under `tags: ["System"]`.
|
||
|
||
```json
|
||
// GET /readyz — healthy response example
|
||
{
|
||
"status": "ready",
|
||
"checks": {
|
||
"postgres": "ok",
|
||
"redis": "ok",
|
||
"celery_queue_depth": 42
|
||
},
|
||
"version": "1.2.3"
|
||
}
|
||
// GET /readyz — unhealthy response (503)
|
||
{
|
||
"status": "not_ready",
|
||
"checks": {
|
||
"postgres": "ok",
|
||
"redis": "error: connection refused",
|
||
"celery_queue_depth": 42
|
||
}
|
||
}
|
||
```
|
||
|
||
### Auth
|
||
- `POST /auth/token` — login; returns `httpOnly` cookie (access) + `httpOnly` cookie (refresh); rate-limited 10/min/IP
|
||
- `POST /auth/token/refresh` — rotate refresh token; rate-limited
|
||
- `POST /auth/mfa/verify` — complete MFA; issues full-access token
|
||
- `POST /auth/logout` — revoke refresh token; clear cookies
|
||
|
||
### Catalog (`viewer` minimum)
|
||
- `GET /objects` — list/search (paginated; filter by type, perigee, decay status, data_confidence)
|
||
- `GET /objects/{norad_id}` — detail with TLE, physical properties, data confidence annotation
|
||
- `POST /objects` — manual entry (`operator` role)
|
||
- `GET /objects/{norad_id}/tle-history` — full TLE history including cross-validation status
|
||
|
||
### Propagation (`analyst` role)
|
||
- `POST /propagate` — submit catalog propagation job
|
||
- `GET /propagate/{task_id}` — poll status
|
||
- `GET /objects/{norad_id}/ephemeris?start=&end=&step=` — time range and step validation (Finding 7):
|
||
|
||
| Parameter | Constraint | Error code |
|
||
|---|---|---|
|
||
| `start` | ≥ TLE epoch − 7 days; ≤ now + 90 days | `EPHEMERIS_START_OUT_OF_RANGE` |
|
||
| `end` | `start < end ≤ start + 30 days` | `EPHEMERIS_END_OUT_OF_RANGE` |
|
||
| `step` | ≥ 10 seconds and ≤ 86,400 seconds | `EPHEMERIS_STEP_OUT_OF_RANGE` |
|
||
| Computed points | `(end − start) / step ≤ 100,000` | `EPHEMERIS_TOO_MANY_POINTS` |
|
||
|
||
### Decay Prediction (`analyst` role)
|
||
- `POST /decay/predict` — submit decay job; returns `202 Accepted` (Finding 3). **MC concurrency gate:** per-organisation Redis semaphore limits to 1 concurrent MC run (Phase 1); 2 for `analyst`+ (Phase 2); `429 + Retry-After` on limit; `admin` bypasses.
|
||
|
||
**Async job lifecycle (Finding 3):**
|
||
```
|
||
POST /decay/predict
|
||
Idempotency-Key: <client-uuid> ← optional; prevents duplicate on retry
|
||
→ 202 Accepted
|
||
{
|
||
"jobId": "uuid",
|
||
"status": "queued",
|
||
"statusUrl": "/jobs/uuid",
|
||
"estimatedDurationSeconds": 45
|
||
}
|
||
|
||
GET /jobs/{job_id}
|
||
→ 200 OK
|
||
{
|
||
"jobId": "uuid",
|
||
"status": "running" | "complete" | "failed" | "cancelled",
|
||
"resultUrl": "/decay/predictions/12345", // present when complete
|
||
"error": null | {"code": "...", "message": "..."},
|
||
"createdAt": "...",
|
||
"completedAt": "...",
|
||
"durationSeconds": 42
|
||
}
|
||
```
|
||
WebSocket `PREDICTION_COMPLETE` / `PREDICTION_FAILED` events are the primary completion signal. `GET /jobs/{id}` is the polling fallback (recommended interval: 5 seconds; do not poll faster). All Celery-backed POST endpoints (`/reports`, `/space/reentry/plan`, `/propagate`) follow the same lifecycle pattern.
|
||
|
||
- `GET /jobs/{job_id}` — poll job status (all job types); `404` if job does not belong to the requesting user's organisation
|
||
- `GET /decay/predictions?norad_id=&status=` — list (cursor-paginated)
|
||
|
||
### Re-entry (`viewer` role)
|
||
- `GET /reentry/predictions` — list with HMAC status; filterable by FIR, time window, confidence, integrity_failed
|
||
- `GET /reentry/predictions/{id}` — full detail; HMAC verified before serving; `integrity_failed` records return 503
|
||
- `GET /reentry/tip-messages?norad_id=` — TIP messages
|
||
|
||
### Space Weather (`viewer` role)
|
||
- `GET /spaceweather/current` — F10.7, Kp, Ap, Dst + `operational_status` + `uncertainty_multiplier` + cross-validation delta
|
||
- `GET /spaceweather/history?start=&end=` — history
|
||
- `GET /spaceweather/forecast` — 3-day NOAA SWPC forecast
|
||
|
||
### Conjunctions (`viewer` role)
|
||
- `GET /conjunctions` — active events filterable by Pc threshold
|
||
- `GET /conjunctions/{id}` — detail with covariance and probability
|
||
- `POST /conjunctions/screen` — submit screening (`analyst` role)
|
||
|
||
### Visualisation (`viewer` role)
|
||
- `GET /czml/objects` — full CZML catalog (J2000 INERTIAL; all strings HTML-escaped); **max payload policy: 5 MB**. If estimated payload exceeds 5 MB, the endpoint returns `HTTP 413` with `{"error": "catalog_too_large", "use_delta": true}`.
|
||
- `GET /czml/objects?since=<iso8601>` — **delta CZML**: returns only objects whose position or metadata has changed since the given timestamp. Clients must use this after the initial full load. Response includes `X-CZML-Full-Required: true` header if the server cannot produce a valid delta (e.g. client timestamp > 30 minutes old) — client must re-fetch the full catalog. Delta responses are always ≤ 500 KB for the 100-object catalog.
|
||
- `GET /czml/hazard/{zone_id}` — HMAC verified before serving
|
||
- `GET /czml/event/{event_id}` — full event CZML
|
||
- `GET /viz/mc-trajectories/{prediction_id}` — binary MC blob for Mode C
|
||
|
||
### Hazard (`viewer` role)
|
||
- `GET /hazard/zones` — active zones; HMAC status included in response
|
||
- `GET /hazard/zones/{id}` — detail; HMAC verified before serving; `integrity_failed` records return 503
|
||
|
||
### Alerts (`viewer` read; `operator` acknowledge)
|
||
- `GET /alerts` — alert history
|
||
- `POST /alerts/{id}/acknowledge` — records user ID + timestamp + note in `alert_events`
|
||
- `GET /alerts/unread-count` — unread critical/high count for badge
|
||
|
||
### Reports (`analyst` role)
|
||
- `GET /reports` — list (organisation-scoped via RLS)
|
||
- `POST /reports` — initiate generation (async)
|
||
- `GET /reports/{id}` — metadata + pre-signed 15-minute download URL
|
||
- `GET /reports/{id}/preview` — HTML preview
|
||
|
||
### Org Admin (`org_admin` role — scoped to own organisation) (F7, F9, F11)
|
||
- `GET /org/users` — list users in own org
|
||
- `POST /org/users/invite` — invite a new user (sends email; creates user with `viewer` role pending activation)
|
||
- `PATCH /org/users/{id}/role` — assign role up to `operator` within own org; cannot assign `org_admin` or `admin`
|
||
- `DELETE /org/users/{id}` — deactivate user (revokes sessions and API keys; triggers pseudonymisation for GDPR)
|
||
- `GET /org/api-keys` — list all API keys in own org (including service account keys)
|
||
- `DELETE /org/api-keys/{id}` — revoke any key in own org
|
||
- `GET /org/audit-log` — paginated org-scoped audit log from `security_logs` and `alert_events` filtered by `organisation_id`; supports `?from=&to=&event_type=&user_id=` (F9)
|
||
- `GET /org/usage` — usage summary for current and previous billing period (predictions run, quota hits, API calls); sourced from `usage_events` table
|
||
- `PATCH /org/billing` — update `billing_contacts` row (email, PO number, VAT number)
|
||
- `POST /org/export` — trigger asynchronous org data export (F11); returns job ID; export includes all predictions, alert events, handover logs, and NOTAM drafts for the org; delivered as signed ZIP within 3 business days; used for GDPR portability and offboarding
|
||
|
||
### Admin (`admin` role only)
|
||
- `GET /admin/ingest-status` — last run time and status per source
|
||
- `GET /admin/worker-status` — Celery queue depth and health
|
||
- `GET /admin/security-events` — recent security_logs entries
|
||
- `POST /admin/users` — create user
|
||
- `PATCH /admin/users/{id}/role` — change role (logged as HIGH security event)
|
||
- `GET /admin/organisations` — list all organisations with tier, status, usage summary
|
||
- `POST /admin/organisations` — provision new organisation (onboarding gate — see §29.8)
|
||
- `PATCH /admin/organisations/{id}` — update tier, status, subscription dates
|
||
|
||
### Space Portal (`space_operator` or `orbital_analyst` role)
|
||
- `GET /space/objects` — list owned objects (`space_operator`: scoped; `orbital_analyst`: full catalog)
|
||
- `GET /space/objects/{norad_id}` — full technical detail with state vectors, covariance, TLE history
|
||
- `GET /space/objects/{norad_id}/ephemeris` — raw GCRF state vectors; CCSDS OEM format available via `Accept: application/ccsds-oem`
|
||
- `POST /space/reentry/plan` — submit controlled re-entry planning job; requires `owned_objects.has_propulsion = TRUE`
|
||
- `GET /space/reentry/plan/{task_id}` — poll; returns ranked deorbit windows with risk scores and FIR avoidance status
|
||
- `POST /space/conjunction/screen` — submit screening (`orbital_analyst` only)
|
||
- `GET /space/export/bulk` — bulk ephemeris/prediction export (JSON, CSV, CCSDS)
|
||
|
||
### NOTAM Drafting (`operator` role)
|
||
- `POST /notam/draft` — generate draft NOTAM from prediction ID; returns ICAO-format draft text + mandatory disclaimer
|
||
- `GET /notam/drafts` — list drafts for organisation
|
||
- `GET /notam/drafts/{id}` — draft detail
|
||
- `POST /notam/drafts/{id}/cancel-draft` — generate cancellation draft for a previous new-NOTAM draft
|
||
|
||
### API Key Management (`space_operator` or `orbital_analyst`)
|
||
- `POST /api-keys` — create new API key; raw key returned once and never stored
|
||
- `GET /api-keys` — list active keys (hashed IDs only, never raw keys)
|
||
- `DELETE /api-keys/{id}` — revoke key immediately
|
||
- `GET /api-keys/usage` — per-key request counts and last-used timestamp
|
||
|
||
### WebSocket (`viewer` minimum; cookie auth at upgrade)
|
||
- `WS /ws/events` — real-time stream; 5 concurrent connections per user enforced. **Per-instance subscriber ceiling: 500 connections.** New connections beyond this limit receive `HTTP 503` at the WebSocket upgrade. A `ws_connected_clients` Prometheus gauge tracks current count per backend instance; alert fires at 400 (WARNING) to trigger horizontal scaling before the ceiling is reached. At Tier 2 (2 backend instances), the effective ceiling is 1,000 simultaneous WebSocket clients — documented as a known capacity limit in `docs/runbooks/capacity-limits.md`.
|
||
|
||
**WebSocket event payload schema:**
|
||
|
||
All events share an envelope:
|
||
```json
|
||
{
|
||
"type": "<event_type>",
|
||
"seq": 1042,
|
||
"ts": "2026-03-17T14:23:01.123Z",
|
||
"data": { ... }
|
||
}
|
||
```
|
||
|
||
| `type` | Trigger | `data` fields |
|
||
|--------|---------|---------------|
|
||
| `alert.new` | New alert generated | `alert_id`, `level`, `norad_id`, `object_name`, `fir_ids[]` |
|
||
| `alert.acknowledged` | Alert acknowledged by any user in org | `alert_id`, `acknowledged_by`, `note_preview` |
|
||
| `alert.superseded` | Alert superseded by a new one | `old_alert_id`, `new_alert_id` |
|
||
| `prediction.updated` | New re-entry prediction for a tracked object | `prediction_id`, `norad_id`, `p50_utc`, `supersedes_id` |
|
||
| `ingest.status` | Ingest job completed or failed | `source`, `status` (`ok`/`failed`), `record_count`, `next_run_at` |
|
||
| `spaceweather.change` | Operational status band changes | `old_status`, `new_status`, `kp`, `f107` |
|
||
| `tip.new` | New TIP message ingested | `norad_id`, `object_name`, `tip_epoch`, `predicted_reentry_utc` |
|
||
|
||
**Reconnection and missed-event recovery:** Each event carries a monotonically increasing `seq` number per organisation. On reconnect, the client sends `?since_seq=<last_seq>` in the WebSocket upgrade URL. The server replays up to 200 missed events from an in-memory ring buffer (last 5 minutes). If the client has been disconnected > 5 minutes, it receives a `{"type": "resync_required"}` event and must re-fetch state via REST.
|
||
|
||
**Per-org sequence number implementation (F5 — §67):** The `seq` counter for each org must be assigned using a PostgreSQL `SEQUENCE` object, not `MAX(seq)+1` in a trigger. `MAX(seq)+1` under concurrent inserts for the same org produces duplicate sequence numbers:
|
||
|
||
```sql
|
||
-- Migration: create one sequence per org on org creation
|
||
-- (or use a single global sequence with per-org prefix — simpler)
|
||
CREATE SEQUENCE IF NOT EXISTS alert_seq_global
|
||
START 1 INCREMENT 1 NO CYCLE;
|
||
|
||
-- In the alert_events INSERT trigger or application code:
|
||
-- NEW.seq := nextval('alert_seq_global');
|
||
-- This is globally unique and monotonically increasing; per-org ordering
|
||
-- is derived by filtering on org_id + ordering by seq.
|
||
```
|
||
|
||
**Preferred approach:** A single global `alert_seq_global` sequence assigned at INSERT time. Per-org ordering is maintained because `seq` is globally monotonic — any two events for the same org will have the correct relative ordering by `seq`. The WebSocket ring buffer lookup uses `WHERE org_id = $1 AND seq > $2 ORDER BY seq` which remains correct with a global sequence.
|
||
|
||
**Do not use:** `DEFAULT nextval('some_seq')` on the column without org-scoped locking — concurrent inserts across orgs share the sequence fine; concurrent inserts for the same org also work correctly since sequences are lock-free and gap-tolerant.
|
||
|
||
**Application-level receipt acknowledgement (F2 — §63):** `delivered_websocket = TRUE` in `alert_events` is set at send-time, not client-receipt time. For safety-critical `CRITICAL` and `HIGH` alerts, the client must send an explicit receipt acknowledgement within 10 seconds:
|
||
|
||
```typescript
|
||
// Client → Server: after rendering a CRITICAL/HIGH alert.new event
|
||
{ "type": "alert.received", "alert_id": "<uuid>", "seq": <n> }
|
||
```
|
||
|
||
Server response:
|
||
```json
|
||
{ "type": "alert.receipt_confirmed", "alert_id": "<uuid>", "seq": <n+1> }
|
||
```
|
||
|
||
If no `alert.received` arrives within 10 seconds of delivery, the server marks `alert_events.ws_receipt_confirmed = FALSE` and triggers the email fallback for that alert (same logic as offline delivery). This distinguishes "sent to socket" from "rendered on screen."
|
||
|
||
```sql
|
||
ALTER TABLE alert_events
|
||
ADD COLUMN ws_receipt_confirmed BOOLEAN,
|
||
ADD COLUMN ws_receipt_at TIMESTAMPTZ;
|
||
-- NULL = not yet sent; TRUE = client confirmed receipt; FALSE = sent but no receipt within 10s
|
||
```
|
||
|
||
**Fan-out architecture across multiple backend instances (F3 — §63):** With ≥2 backend instances (Tier 2), a WebSocket connection from org A may be on instance-1 while a new alert fires on instance-2. Without a cross-instance broadcast mechanism, org A's operator misses the alert.
|
||
|
||
**Required: Redis Pub/Sub fan-out:**
|
||
|
||
```python
|
||
# backend/app/alerts/fanout.py
|
||
import redis.asyncio as aioredis
|
||
|
||
ALERT_CHANNEL_PREFIX = "spacecom:alert:"
|
||
|
||
async def publish_alert(redis: aioredis.Redis, org_id: str, event: dict):
|
||
"""Publish alert event to Redis channel; all backend instances receive and forward to connected clients."""
|
||
channel = f"{ALERT_CHANNEL_PREFIX}{org_id}"
|
||
await redis.publish(channel, json.dumps(event))
|
||
|
||
async def subscribe_org_alerts(redis: aioredis.Redis, org_id: str):
|
||
"""Each backend instance subscribes to its connected orgs' channels on startup."""
|
||
pubsub = redis.pubsub()
|
||
await pubsub.subscribe(f"{ALERT_CHANNEL_PREFIX}{org_id}")
|
||
return pubsub
|
||
```
|
||
|
||
Each backend instance maintains a local registry of `{org_id: [websocket_connections]}`. On receiving a Redis Pub/Sub message, the instance forwards to all local connections for that org. This decouples alert generation (any instance) from delivery (per-instance local connections).
|
||
|
||
**ADR:** `docs/adr/0020-websocket-fanout-redis-pubsub.md` — documents this pattern and the decision against sticky sessions (which would break blue-green deploys).
|
||
|
||
**Dead-connection ANSP fallback notification (F6 — §63):** When the ping-pong mechanism detects a dead connection, the current behaviour is to close the socket. There is no notification to the ANSP that their live monitoring connection has silently dropped.
|
||
|
||
**Required behaviour:**
|
||
1. On ping-pong timeout: close socket; record `ws_disconnected_at` in Redis session key for that connection
|
||
2. If no reconnect within `WS_DEAD_CONNECTION_GRACE_SECONDS` (default: 120s): send email to the org's ANSP contact (`organisations.primary_contact_email`) with subject: *"SpaceCom live connection dropped — please check your browser"*
|
||
3. If an active TIP event exists for the org's FIRs when the disconnection is detected: grace period is reduced to 30s and the email subject is: *"URGENT: SpaceCom connection dropped during active re-entry event"*
|
||
4. On reconnect (before grace period expires): cancel the pending fallback email
|
||
|
||
```python
|
||
# backend/app/alerts/ws_health.py
|
||
WS_DEAD_CONNECTION_GRACE_SECONDS = 120
|
||
WS_DEAD_CONNECTION_GRACE_ACTIVE_TIP = 30
|
||
|
||
async def on_connection_closed(org_id: str, user_id: str, redis: aioredis.Redis):
|
||
active_tip = await redis.get(f"spacecom:active_tip:{org_id}")
|
||
grace = WS_DEAD_CONNECTION_GRACE_ACTIVE_TIP if active_tip else WS_DEAD_CONNECTION_GRACE_SECONDS
|
||
# Schedule fallback notification via Celery
|
||
notify_ws_dead.apply_async(
|
||
args=[org_id, user_id],
|
||
countdown=grace,
|
||
task_id=f"ws-dead-{org_id}-{user_id}" # revocable if reconnect arrives
|
||
)
|
||
|
||
async def on_reconnect(org_id: str, user_id: str):
|
||
# Cancel pending dead-connection notification
|
||
celery_app.control.revoke(f"ws-dead-{org_id}-{user_id}")
|
||
```
|
||
|
||
**Per-org email alert rate limit (F7 — §65 FinOps):**
|
||
|
||
Email alerts are triggered both by the alert delivery pipeline (when WebSocket delivery is unconfirmed) and by degraded-mode notifications. Without a rate limit, a flapping prediction window or ingest instability can generate hundreds of alert emails per hour to the same ANSP contact, exhausting the SMTP relay quota and creating alert fatigue.
|
||
|
||
**Rate limit policy:** Maximum **50 alert emails per org per hour**. When the limit is reached, subsequent alerts within the window are queued and delivered as a **digest email** at the end of the hour.
|
||
|
||
```python
|
||
# backend/app/alerts/email_delivery.py
|
||
EMAIL_RATE_LIMIT_PER_ORG_PER_HOUR = 50
|
||
|
||
async def send_alert_email(org_id: str, alert: dict, redis: aioredis.Redis):
|
||
"""Send alert email subject to per-org rate limit; fall back to digest queue."""
|
||
rate_key = f"spacecom:email_rate:{org_id}:{datetime.utcnow().strftime('%Y%m%d%H')}"
|
||
count = await redis.incr(rate_key)
|
||
if count == 1:
|
||
await redis.expire(rate_key, 3600) # expire at end of hour window
|
||
|
||
if count <= EMAIL_RATE_LIMIT_PER_ORG_PER_HOUR:
|
||
# Send immediately
|
||
await _dispatch_email(org_id, alert)
|
||
else:
|
||
# Add to digest queue; Celery task drains it at hour boundary
|
||
digest_key = f"spacecom:email_digest:{org_id}:{datetime.utcnow().strftime('%Y%m%d%H')}"
|
||
await redis.rpush(digest_key, json.dumps(alert))
|
||
await redis.expire(digest_key, 7200) # safety expire
|
||
|
||
@shared_task
|
||
def send_hourly_digest_emails():
|
||
"""Drain digest queues and send consolidated digest emails. Runs at HH:59."""
|
||
# Find all digest keys matching current hour; send one digest per org
|
||
...
|
||
```
|
||
|
||
**Contract expiry alerts (F7 — §68):**
|
||
|
||
Without proactive expiry alerts, contracts expire silently. Add a Celery Beat task (`tasks/commercial/contract_expiry_alerts.py`) that runs daily at 07:00 UTC and checks `contracts.valid_until`:
|
||
|
||
```python
|
||
@shared_task
|
||
def check_contract_expiry():
|
||
"""Alert commercial team of contracts expiring within 90/30/7 days."""
|
||
thresholds = [
|
||
(90, "90-day renewal notice"),
|
||
(30, "30-day renewal notice — action required"),
|
||
(7, "URGENT: 7-day contract expiry warning"),
|
||
]
|
||
for days, subject_prefix in thresholds:
|
||
target_date = date.today() + timedelta(days=days)
|
||
expiring = db.execute(text("""
|
||
SELECT c.id, o.name, c.monthly_value_cents, c.currency,
|
||
c.valid_until, o.primary_contact_email
|
||
FROM contracts c
|
||
JOIN organisations o ON o.id = c.org_id
|
||
WHERE DATE(c.valid_until) = :target_date
|
||
AND c.contract_type NOT IN ('sandbox', 'internal')
|
||
AND c.auto_renew = FALSE
|
||
"""), {"target_date": target_date}).fetchall()
|
||
for contract in expiring:
|
||
send_email(
|
||
to="commercial@spacecom.io",
|
||
subject=f"[SpaceCom] {subject_prefix}: {contract.name}",
|
||
body=f"Contract for {contract.name} expires on {contract.valid_until.date()}. "
|
||
f"Monthly value: {contract.monthly_value_cents/100:.2f} {contract.currency}."
|
||
)
|
||
```
|
||
|
||
Add to celery-redbeat at `crontab(hour=7, minute=0)`. Also send a courtesy expiry notice to the org admin contact at the 30-day threshold so they can initiate their internal procurement process.
|
||
|
||
**Celery schedule:** Add `send_hourly_digest_emails` to celery-redbeat at `crontab(minute=59)`.
|
||
|
||
**Cost rationale:** SMTP relay services (SES, Mailgun) charge per email. At 50/hour cap and 10 orgs, maximum 500 emails/hour = 12,000/day. At $0.10/1,000 (SES) = $1.20/day ≈ **$37/month** at sustained maximum. Without rate limiting during a flapping event, a single incident could generate thousands of emails in minutes.
|
||
|
||
**Per-client back-pressure and send queue circuit breaker (F7 — §63):** A slow client whose network buffers are full will cause `await websocket.send_json(event)` to block in the FastAPI handler. Without a per-client queue depth check, a single slow client can block the fan-out loop for all clients.
|
||
|
||
```python
|
||
# backend/app/alerts/ws_manager.py
|
||
WS_SEND_QUEUE_MAX = 50 # events; beyond this, circuit-breaker triggers
|
||
|
||
class ConnectionManager:
|
||
def __init__(self):
|
||
self._connections: dict[str, list[WebSocket]] = {}
|
||
self._send_queues: dict[WebSocket, asyncio.Queue] = {}
|
||
|
||
async def broadcast_to_org(self, org_id: str, event: dict):
|
||
for ws in self._connections.get(org_id, []):
|
||
queue = self._send_queues[ws]
|
||
if queue.qsize() >= WS_SEND_QUEUE_MAX:
|
||
# Circuit breaker: drop this connection; client will reconnect and replay
|
||
spacecom_ws_send_queue_overflow_total.labels(org_id=org_id).inc()
|
||
await ws.close(code=4003, reason="Send queue overflow — reconnect to resume")
|
||
else:
|
||
await queue.put(event)
|
||
|
||
async def _send_worker(self, ws: WebSocket):
|
||
"""Dedicated coroutine per connection — decouples send from broadcast loop."""
|
||
queue = self._send_queues[ws]
|
||
while True:
|
||
event = await queue.get()
|
||
try:
|
||
await ws.send_json(event)
|
||
except Exception:
|
||
break # connection closed; worker exits
|
||
```
|
||
|
||
Prometheus counter: `spacecom_ws_send_queue_overflow_total{org_id}` — any non-zero value warrants investigation.
|
||
|
||
**Missed-alert display for offline clients (F8 — §63):** When a client reconnects after receiving `resync_required`, it calls the REST API to re-fetch current state. The notification centre must explicitly surface alerts that arrived during the offline period:
|
||
|
||
`GET /api/v1/alerts?since=<last_seen_ts>&include_offline=true` — returns all unacknowledged alerts since `last_seen_ts`, annotated with `"received_while_offline": true`. The notification centre renders these with a distinct visual treatment: amber border + *"Received while you were offline"* label. The client stores `last_seen_ts` in `localStorage` (updated on each WebSocket message); this survives page reload but not localStorage clear.
|
||
|
||
**WebSocket connection metadata — per-org operational visibility (F10 — §63):**
|
||
|
||
New Prometheus metrics:
|
||
|
||
```python
|
||
ws_org_connected = Gauge(
|
||
'spacecom_ws_org_connected',
|
||
'Whether at least one WebSocket connection is active for this org',
|
||
['org_id', 'org_name']
|
||
)
|
||
ws_org_connections = Gauge(
|
||
'spacecom_ws_org_connection_count',
|
||
'Number of active WebSocket connections for this org',
|
||
['org_id']
|
||
)
|
||
```
|
||
|
||
Updated when connections open/close. Alert rule:
|
||
|
||
```yaml
|
||
- alert: ANSPNoLiveConnectionDuringTIPEvent
|
||
expr: |
|
||
spacecom_active_tip_events > 0
|
||
and on(org_id) spacecom_ws_org_connected == 0
|
||
for: 5m
|
||
severity: warning
|
||
annotations:
|
||
summary: "ANSP {{ $labels.org_name }} has no live WebSocket connection during active TIP event"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/ansp-connection-lost.md"
|
||
```
|
||
|
||
On-call dashboard panel 9 (below the fold): *"ANSP Connection Status"* — table of org names, connection count, last-connected timestamp, TIP-event indicator. Rows with `connected = 0` and active TIP highlighted in amber.
|
||
|
||
**Protocol version negotiation (Finding 8):** Client connects with `?protocol_version=1`. The server's first message is always:
|
||
```json
|
||
{"type": "CONNECTED", "protocolVersion": 1, "serverVersion": "2.1.3", "seq": 0}
|
||
```
|
||
When a breaking event schema change ships, both versions are supported in parallel for 6 months. Clients on a deprecated version receive:
|
||
```json
|
||
{"type": "PROTOCOL_DEPRECATION_WARNING", "currentVersion": 1, "sunsetDate": "2026-12-01",
|
||
"migrationGuideUrl": "/docs/api-guide/websocket-protocol.md#v2-migration"}
|
||
```
|
||
After sunset, old-version connections are closed with code `4002` ("Protocol version deprecated"). Protocol version history is maintained in `docs/api-guide/websocket-protocol.md`.
|
||
|
||
**Token refresh during long-lived sessions (Finding 4):** Access tokens expire in 15 minutes. The server sends a `TOKEN_EXPIRY_WARNING` event 2 minutes before expiry:
|
||
```json
|
||
{"type": "TOKEN_EXPIRY_WARNING", "expiresInSeconds": 120, "seq": N}
|
||
```
|
||
The client calls `POST /auth/token/refresh` (standard REST — does not interrupt the WebSocket), then sends on the existing connection:
|
||
```json
|
||
{"type": "AUTH_REFRESH", "token": "<new_access_token>"}
|
||
```
|
||
Server responds: `{"type": "AUTH_REFRESHED", "seq": N}`. If the client does not refresh before expiry, the server closes with code `4001` ("Token expired — reconnect with a new token"). Clients distinguish `4001` (auth expiry, refresh and reconnect) from `4002` (protocol deprecated, upgrade required) from network errors (reconnect with backoff).
|
||
|
||
**Mode awareness:** In SIMULATION or REPLAY mode, the client's WebSocket connection remains open but `alert.new` and `tip.new` events are suppressed for the duration of the mode session. Simulation-generated events are delivered on a separate `WS /ws/simulation/{session_id}` channel.
|
||
|
||
### Alert Webhooks (`admin` role — registration; delivery to registered HTTPS endpoints)
|
||
|
||
For ANSPs with programmatic dispatch systems that cannot consume a browser WebSocket.
|
||
|
||
- `POST /webhooks` — register a webhook endpoint; `{"url": "https://ansp.example.com/hook", "events": ["alert.new", "tip.new"], "secret": "<shared_secret>"}`
|
||
- `GET /webhooks` — list registered webhooks for the organisation
|
||
- `DELETE /webhooks/{id}` — deregister
|
||
- `POST /webhooks/{id}/test` — send a synthetic `alert.new` event to verify delivery
|
||
|
||
**Delivery semantics:** At-least-once. SpaceCom POSTs the event envelope to the registered URL. Signature: `X-SpaceCom-Signature: sha256=<HMAC-SHA256(secret, body)>` header on every delivery. Retry policy: 3 retries with exponential backoff (1s, 5s, 30s). After 3 failures, the webhook is marked `degraded` and the org admin is notified by email. After 10 consecutive failures, the webhook is auto-disabled.
|
||
|
||
`alert_webhooks` table:
|
||
```sql
|
||
CREATE TABLE alert_webhooks (
|
||
id SERIAL PRIMARY KEY,
|
||
organisation_id INTEGER NOT NULL REFERENCES organisations(id),
|
||
url TEXT NOT NULL,
|
||
secret_hash TEXT NOT NULL, -- bcrypt hash of the shared secret; never stored in plaintext
|
||
event_types TEXT[] NOT NULL,
|
||
status TEXT NOT NULL DEFAULT 'active', -- active | degraded | disabled
|
||
failure_count INTEGER DEFAULT 0,
|
||
last_delivery_at TIMESTAMPTZ,
|
||
last_failure_at TIMESTAMPTZ,
|
||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||
);
|
||
```
|
||
|
||
### Structured Event Export (`viewer` minimum)
|
||
|
||
First step toward SWIM / machine-readable ANSP system integration (Phase 3 target).
|
||
|
||
- `GET /events/{id}/export?format=geojson` — returns the event's re-entry corridor and impact zone as a GeoJSON `FeatureCollection` with ICAO FIR IDs and prediction metadata in `properties`
|
||
- `GET /events/{id}/export?format=czml` — CZML event package (same as `GET /czml/event/{event_id}`)
|
||
- `GET /events/{id}/export?format=ccsds-oem` — raw OEM for the object's trajectory at time of prediction
|
||
|
||
The GeoJSON export is the preferred integration surface for ANSP systems that are not SWIM-capable. The `properties` object includes: `norad_id`, `object_name`, `p05_utc`, `p50_utc`, `p95_utc`, `affected_fir_ids[]`, `risk_level`, `prediction_id`, `prediction_hmac` (for downstream integrity verification), `generated_at`.
|
||
|
||
### API Conventions (Finding 9)
|
||
|
||
**Field naming:** All API request and response bodies use `camelCase`. Database column names and Python internal models use `snake_case`. The conversion is handled automatically by a shared base model:
|
||
|
||
```python
|
||
from pydantic import BaseModel, ConfigDict
|
||
from pydantic.alias_generators import to_camel
|
||
|
||
class APIModel(BaseModel):
|
||
"""Base class for all API response/request models. Serialises to camelCase JSON."""
|
||
model_config = ConfigDict(
|
||
alias_generator=to_camel,
|
||
populate_by_name=True, # allows snake_case in tests and internal code
|
||
)
|
||
|
||
class PredictionResponse(APIModel):
|
||
prediction_id: int # → "predictionId" in JSON
|
||
p50_reentry_time: datetime # → "p50ReentryTime"
|
||
ood_flag: bool # → "oodFlag"
|
||
```
|
||
|
||
All Pydantic response models inherit from `APIModel`. All request bodies also inherit from `APIModel` (with `populate_by_name=True`, clients may send either case). Document in `docs/api-guide/conventions.md`.
|
||
|
||
### Error Response Schema (Finding 2)
|
||
|
||
All error responses use the `SpaceComError` envelope — including FastAPI's default Pydantic validation errors (which are overridden):
|
||
|
||
```python
|
||
class SpaceComError(BaseModel):
|
||
error: str # machine-readable code from the error registry
|
||
message: str # human-readable; safe to display in UI
|
||
detail: dict | None = None
|
||
requestId: str # from X-Request-ID header; enables log correlation
|
||
|
||
@app.exception_handler(RequestValidationError)
|
||
async def validation_error_handler(request, exc):
|
||
return JSONResponse(status_code=422, content=SpaceComError(
|
||
error="VALIDATION_ERROR",
|
||
message="Request validation failed",
|
||
detail={"fields": exc.errors()},
|
||
requestId=request.headers.get("X-Request-ID", ""),
|
||
).model_dump(by_alias=True))
|
||
```
|
||
|
||
**Canonical error code registry** — all codes, HTTP status, and recovery actions documented in `docs/api-guide/error-reference.md`. CI check: any `HTTPException` raised in application code must use a code from the registry. Sample entries:
|
||
|
||
| Code | HTTP status | Meaning | Recovery |
|
||
|---|---|---|---|
|
||
| `VALIDATION_ERROR` | 422 | Request body or query param invalid | Fix the indicated fields |
|
||
| `INVALID_CURSOR` | 400 | Pagination cursor malformed or expired | Restart from page 1 |
|
||
| `RATE_LIMITED` | 429 | Rate limit exceeded | Wait `retryAfterSeconds` |
|
||
| `EPHEMERIS_TOO_MANY_POINTS` | 400 | Computed points exceed 100,000 | Reduce range or increase step |
|
||
| `IDEMPOTENCY_IN_PROGRESS` | 409 | Duplicate request still processing | Wait and retry `statusUrl` |
|
||
| `HMAC_VERIFICATION_FAILED` | 503 | Prediction integrity check failed | Contact administrator |
|
||
| `API_KEY_INVALID` | 401 | API key revoked, expired, or invalid | Re-issue key |
|
||
| `PREDICTION_CONFLICT` | 200 (not error) | Multi-source window disagreement | See `conflictSources` field |
|
||
|
||
### Rate Limit Error Response (Finding 6)
|
||
|
||
`429 Too Many Requests` responses include `Retry-After` (RFC 7231 §7.1.3) and a structured body:
|
||
|
||
```
|
||
HTTP/1.1 429 Too Many Requests
|
||
Retry-After: 47
|
||
X-RateLimit-Limit: 10
|
||
X-RateLimit-Remaining: 0
|
||
X-RateLimit-Reset: 1742134847
|
||
|
||
{
|
||
"error": "RATE_LIMITED",
|
||
"message": "Rate limit exceeded for POST /decay/predict: 10 requests per hour",
|
||
"retryAfterSeconds": 47,
|
||
"limit": 10,
|
||
"window": "1h",
|
||
"requestId": "..."
|
||
}
|
||
```
|
||
|
||
`retryAfterSeconds` = `X-RateLimit-Reset − now()`. Clients implementing backoff must honour `Retry-After` and must not retry before it elapses.
|
||
|
||
### Idempotency Keys (Finding 5)
|
||
|
||
Mutation endpoints that have real-world consequences support idempotency keys:
|
||
|
||
```
|
||
POST /decay/predict
|
||
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
|
||
```
|
||
|
||
Server behaviour:
|
||
- **First receipt:** process normally; store `(key, user_id, endpoint, response_body)` in `idempotency_keys` table with 24-hour TTL
|
||
- **Duplicate within 24h:** return stored response with `HTTP 200` + header `Idempotency-Replay: true`; do not re-execute
|
||
- **Still processing:** return `409 Conflict` → `{"error": "IDEMPOTENCY_IN_PROGRESS", "statusUrl": "/jobs/uuid"}`
|
||
- **After 24h:** key expired; treat as new request
|
||
|
||
Applies to: `POST /decay/predict`, `POST /reports`, `POST /notam/draft`, `POST /alerts/{id}/acknowledge`, `POST /admin/users`. Documented in `docs/api-guide/idempotency.md`.
|
||
|
||
### API Key Authentication Model (Finding 11)
|
||
|
||
API key requests use key-only auth — no JWT required:
|
||
```
|
||
Authorization: Bearer apikey_<base64url_encoded_key>
|
||
```
|
||
|
||
The prefix `apikey_` distinguishes API keys from JWT Bearer tokens at the middleware layer. The raw key is hashed with SHA-256 before storage; the raw key is shown exactly once at creation.
|
||
|
||
Rules:
|
||
- API key rate limits are **independent** from JWT session rate limits — separate Redis buckets per key
|
||
- Webhook deliveries are **not** counted against any rate limit bucket (server-initiated, not client-initiated)
|
||
- `allowed_endpoints` scope: `null` = all endpoints for the key's role; a non-null array restricts to listed paths. `403` returned for requests to unlisted endpoints with `{"error": "ENDPOINT_NOT_IN_KEY_SCOPE"}`
|
||
- Revoked/expired/invalid key: always `401` → `{"error": "API_KEY_INVALID", "message": "API key is revoked or expired"}` — indistinguishable from never-valid (prevents enumeration)
|
||
|
||
Document in `docs/api-guide/api-keys.md`.
|
||
|
||
### System Endpoints (Finding 10)
|
||
|
||
`GET /readyz` is included in the OpenAPI spec as a documented endpoint (tagged `System`), so integrators and SWIM consumers can discover and monitor it:
|
||
|
||
```python
|
||
@app.get(
|
||
"/readyz",
|
||
tags=["System"],
|
||
summary="Readiness and degraded-state check",
|
||
response_model=ReadinessResponse,
|
||
responses={
|
||
200: {"description": "System operational"},
|
||
207: {"description": "System degraded — one or more data sources stale"},
|
||
503: {"description": "System unavailable — database or Redis unreachable"},
|
||
}
|
||
)
|
||
```
|
||
|
||
`GET /healthz` (liveness probe) remains undocumented in OpenAPI — infrastructure-only. `/readyz` is the recommended integration health check endpoint for ANSP monitoring systems and the Phase 3 SWIM integration.
|
||
|
||
**Clock skew detection and server time endpoint (F6 — §67):**
|
||
|
||
CZML `availability` timestamps and prediction windows are generated using server UTC. If the server clock drifts (NTP sync failure after container restart, hypervisor clock skew, or VM migration), CZML ground track windows will be offset from real time. A client whose clock differs from the server clock by > 5 seconds will show predictions in the wrong temporal position.
|
||
|
||
**Infrastructure requirement:** All SpaceCom hosts must run `chronyd` or `systemd-timesyncd` with NTP synchronisation to a reliable source. Add to the deployment runbook (`docs/runbooks/host-setup.md`):
|
||
```bash
|
||
# Ubuntu/Debian
|
||
timedatectl set-ntp true
|
||
timedatectl status # confirm NTPSynchronized: yes
|
||
```
|
||
Add Grafana alert: `node_timex_sync_status != 1` → WARNING: *"NTP sync lost on <host>"*.
|
||
|
||
**Client-side clock skew display:** Add `GET /api/v1/time` endpoint (unauthenticated, rate-limited to 1 req/s per IP):
|
||
```python
|
||
@router.get("/api/v1/time")
|
||
async def server_time():
|
||
return {"utc": datetime.utcnow().isoformat() + "Z", "unix": time.time()}
|
||
```
|
||
The frontend calls this on page load and computes `skew_seconds = server_unix - Date.now()/1000`. If `abs(skew_seconds) > 5`: display a persistent WARNING banner: *"Your browser clock differs from the server by {N}s — prediction windows may appear offset. Please synchronise your system clock."*
|
||
|
||
### Pagination Standard
|
||
|
||
All list endpoints use **cursor-based pagination** (not offset-based). Offset pagination degrades as `OFFSET N` forces the DB to scan and discard N rows; at 7-year retention depth this becomes a full table scan.
|
||
|
||
**Canonical response envelope — applied to every list endpoint (Finding 1):**
|
||
```json
|
||
{
|
||
"data": [...],
|
||
"pagination": {
|
||
"next_cursor": "eyJjcmVhdGVkX2F0IjoiMjAyNi0wMy0xNlQxNDozMDowMFoiLCJpZCI6NDQ4Nzh9",
|
||
"has_more": true,
|
||
"limit": 50,
|
||
"total_count": null
|
||
}
|
||
}
|
||
```
|
||
|
||
Rules:
|
||
- `data` (not `items`) is the canonical array key across all list endpoints
|
||
- `next_cursor` is `base64url(json({"created_at": "<iso8601>", "id": <int>}))` — opaque to clients, decoded server-side
|
||
- `total_count` is always `null` — count queries on large tables force full scans; document this explicitly in `docs/api-guide/pagination.md`
|
||
- `limit` defaults to 50; maximum 200; specified per endpoint group in OpenAPI `description`
|
||
- Empty result: `{"data": [], "pagination": {"next_cursor": null, "has_more": false, "limit": 50, "total_count": null}}` — never `404`
|
||
- Invalid/expired cursor: `400 Bad Request` → `{"error": "INVALID_CURSOR", "message": "Cursor is malformed or refers to a deleted record", "request_id": "..."}`
|
||
|
||
**Standard query parameters:**
|
||
- `limit` — page size (default: 50, maximum: 200)
|
||
- `cursor` — opaque cursor token from a previous response (absent = first page)
|
||
|
||
Cursor decodes server-side to `WHERE (created_at, id) < (cursor_ts, cursor_id) ORDER BY created_at DESC, id DESC`. Tokens are valid for 24 hours.
|
||
|
||
**Implementation:**
|
||
```python
|
||
class PaginatedResponse(BaseModel, Generic[T]):
|
||
data: list[T]
|
||
pagination: PaginationMeta
|
||
|
||
class PaginationMeta(BaseModel):
|
||
next_cursor: str | None
|
||
has_more: bool
|
||
limit: int
|
||
total_count: None = None # always None; never compute count
|
||
|
||
def paginate_query(q, cursor: str | None, limit: int) -> PaginatedResponse:
|
||
"""Shared utility used by all list endpoints — enforces envelope consistency."""
|
||
...
|
||
```
|
||
|
||
**Enforcement:** An OpenAPI CI check confirms every endpoint tagged `list` has `limit` and `cursor` query parameters and returns the `PaginatedResponse` schema. Violations fail CI.
|
||
|
||
**Affected endpoints** (all paginated): `/objects`, `/decay/predictions`, `/reentry/predictions`, `/alerts`, `/conjunctions`, `/reports`, `/notam/drafts`, `/space/objects`, `/api-keys/usage`, `/admin/security-events`.
|
||
|
||
---
|
||
|
||
### API Latency Budget — CZML Catalog Endpoint
|
||
|
||
The CZML catalog endpoint (`GET /czml/objects`) is the most latency-sensitive read path and the primary SLO driver (p95 < 2s). Latency budget allocation:
|
||
|
||
| Component | Budget | Notes |
|
||
|---|---|---|
|
||
| DNS + TLS handshake (new connection) | 50 ms | Not applicable on keep-alive; amortised to ~0 for repeat requests |
|
||
| Caddy proxy overhead | 5 ms | Header processing only |
|
||
| FastAPI routing + middleware (auth, RBAC, rate limit) | 30 ms | Each middleware ~5–10 ms; keep middleware count ≤ 5 on this path |
|
||
| PgBouncer connection acquisition | 10 ms | Pool saturation adds latency; monitor `pgbouncer_pool_waiting` metric |
|
||
| DB query execution (PostGIS geometry) | 800 ms | Includes GiST index scan + geometry serialisation |
|
||
| CZML serialisation (Pydantic → JSON) | 200 ms | Validated by benchmark; exceeding this indicates schema complexity regression |
|
||
| HTTP response transmission (5 MB @ 1 Gbps internal) | 40 ms | Internal network; negligible |
|
||
| **Total budget (new connection)** | **~1,135 ms** | **~865 ms headroom to 2s p95 SLO** |
|
||
|
||
Any new middleware added to the CZML endpoint path must be profiled and must not exceed its allocated budget. Exceeding the DB or serialisation budget requires a performance investigation before merge.
|
||
|
||
---
|
||
|
||
### API Versioning Policy
|
||
|
||
Base path: `/api/v1`. All versioned endpoints follow Semantic Versioning applied to the API contract:
|
||
|
||
- **Non-breaking changes** (additive: new optional fields, new endpoints, new query params): deployed without version bump; announced in `CHANGELOG.md`
|
||
- **Breaking changes** (removed fields, changed types, changed auth requirements, removed endpoints): require a new major version (`/api/v2`); old version supported in parallel for a minimum of **6 months** before sunset
|
||
- **Deprecation signalling:** Deprecated endpoints return `Deprecation: true` and `Sunset: <date>` response headers (RFC 8594)
|
||
- **Version negotiation:** Clients may send `Accept: application/vnd.spacecom.v1+json` to pin to a specific version; default is always the latest stable version
|
||
- **Breaking change notice:** Minimum 3 months written notice (email to registered API key holders + `CHANGELOG.md` entry) before any breaking change is deployed
|
||
|
||
**Changelog discipline (F5):** `CHANGELOG.md` follows the [Keep a Changelog](https://keepachangelog.com/) format with [Conventional Commits](https://www.conventionalcommits.org/) as the commit-level input. Every PR must add an entry under `[Unreleased]` if it has a user-visible effect. On release, `[Unreleased]` is renamed to `[{semver}] - {date}`.
|
||
```markdown
|
||
## [Unreleased]
|
||
### Added
|
||
- `p01_reentry_time` and `p99_reentry_time` fields on decay prediction response (SC-188)
|
||
### Changed
|
||
- `altitude_unit_preference` default for ANSP operators changed from `m` to `ft` (SC-201)
|
||
### Fixed
|
||
- HMAC integrity check now correctly handles NULL `action_taken` field (SC-195)
|
||
### Deprecated
|
||
- `GET /objects/{id}/trajectory` — use `GET /objects/{id}/ephemeris` (sunset 2027-06-01)
|
||
```
|
||
- `make changelog-check` (CI step) fails if `[Unreleased]` section is empty and the diff contains non-chore/docs commits
|
||
- Release changelogs are the source for API key holder email notifications and GitHub release notes
|
||
|
||
**OpenAPI spec as source of truth (F1):** FastAPI generates the OpenAPI 3.1 spec automatically from route decorators, Pydantic schemas, and docstrings. The spec is the authoritative contract — not a separately maintained document. CI enforces this:
|
||
- `GET /api/v1/openapi.json` is served by the running API; CI downloads it and diffs against the committed `openapi.yaml`
|
||
- Any uncommitted drift fails the build with `openapi-diff --fail-on-incompatible`
|
||
- The committed `openapi.yaml` is regenerated by running `make generate-openapi` (calls `python -m app.generate_spec`) — this is a required step in the PR checklist for any API change
|
||
- The spec is the input to all downstream tooling: Swagger UI (`/docs`), Redoc (`/redoc`), contract tests, and the client SDK generator
|
||
|
||
**API date/time contract (F10):** All date/time fields in API responses must use **ISO 8601 with UTC offset** — never Unix timestamps, never local time strings:
|
||
- Format: `"2026-03-22T14:00:00Z"` (UTC, `Z` suffix)
|
||
- OpenAPI annotation: `format: date-time` on every `_at`-suffixed and `_time`-suffixed field
|
||
- Contract test (BLOCKING): every field matching `/_at$|_time$/` in every response schema asserts it matches `^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?Z$`
|
||
- Pydantic models use `datetime` with `model_config = {"json_encoders": {datetime: lambda v: v.isoformat().replace("+00:00", "Z")}}`
|
||
|
||
**Frontend ↔ API contract testing (F4):** The TypeScript types used by the Next.js frontend must be validated against the OpenAPI spec on every CI run — preventing the common drift where the Pydantic response model changes but the frontend `interface` is not updated until a runtime error surfaces.
|
||
|
||
Implementation: `openapi-typescript` generates TypeScript types from `openapi.yaml` into `frontend/src/types/api.generated.ts`. The frontend imports only from this generated file — no hand-written API response interfaces. A CI check (`make check-api-types`) regenerates the types and fails if the git diff is non-empty:
|
||
|
||
```bash
|
||
# CI step: check-api-types
|
||
openapi-typescript openapi.yaml -o frontend/src/types/api.generated.ts
|
||
git diff --exit-code frontend/src/types/api.generated.ts \
|
||
|| (echo "API types out of sync — run: make generate-api-types" && exit 1)
|
||
```
|
||
|
||
This is a one-way contract: the spec is authoritative; the TypeScript types are derived. Any API change that affects the frontend must regenerate types before the PR can merge. This replaces the need for a separate consumer-driven contract test framework (Pact) at Phase 1 scale.
|
||
|
||
**OpenAPI response examples (F7):** Every endpoint schema in the OpenAPI spec must include at least one `examples:` block demonstrating a realistic success response. This is enforced by a CI lint step (`spectral lint openapi.yaml --ruleset .spectral.yaml`) with a custom rule `require-response-example`. Missing examples fail the build. The examples serve three purposes: Swagger UI and Redoc interactive documentation, contract test fixture baseline, and ESA auditor review readability.
|
||
|
||
```yaml
|
||
# Example: openapi.yaml fragment for GET /objects/{norad_id}
|
||
responses:
|
||
'200':
|
||
content:
|
||
application/json:
|
||
schema:
|
||
$ref: '#/components/schemas/ObjectDetail'
|
||
examples:
|
||
debris_object:
|
||
summary: Tracked debris fragment in decay
|
||
value:
|
||
norad_id: 48274
|
||
name: "CZ-3B DEB"
|
||
object_type: "DEBRIS"
|
||
perigee_km: 187.4
|
||
apogee_km: 312.1
|
||
data_confidence: "nominal"
|
||
propagation_quality: "degraded"
|
||
propagation_warning: "tle_age_7_14_days"
|
||
```
|
||
|
||
**Client SDK strategy (F8):** Phase 1 — no dedicated SDK. ANSP integrators are provided:
|
||
1. The committed `openapi.yaml` for import into Postman, Insomnia, or any OpenAPI-compatible tooling
|
||
2. A `docs/integration/` directory with language-specific quickstart guides (Python, JavaScript/TypeScript) showing auth, object fetch, and WebSocket subscription patterns
|
||
3. Python integration examples using `httpx` (async) and `requests` (sync) — not a packaged SDK
|
||
|
||
Phase 2 gate: if ≥ 2 ANSP customers request a typed client, generate one using `openapi-generator-cli` targeting Python and TypeScript. Generated clients are published under the `@spacecom/` npm scope and `spacecom-client` PyPI package. The generator configuration is committed to `tools/sdk-generator/` so regeneration is reproducible from the spec.
|
||
|
||
---
|
||
|
||
## 15. Propagation Architecture — Technical Detail
|
||
|
||
### 15.1 Catalog Propagator (SGP4)
|
||
|
||
```python
|
||
from sgp4.api import Satrec, jday
|
||
from app.frame_utils import teme_to_gcrf, gcrf_to_itrf, itrf_to_geodetic
|
||
|
||
def propagate_catalog(tle_line1: str, tle_line2: str, times_utc: list[datetime]) -> list[OrbitalState]:
|
||
sat = Satrec.twoline2rv(tle_line1, tle_line2)
|
||
results = []
|
||
for t in times_utc:
|
||
jd, fr = jday(t.year, t.month, t.day, t.hour, t.minute, t.second + t.microsecond/1e6)
|
||
e, r_teme, v_teme = sat.sgp4(jd, fr)
|
||
if e != 0:
|
||
raise PropagationError(f"SGP4 error code {e}")
|
||
r_gcrf, v_gcrf = teme_to_gcrf(r_teme, v_teme, t)
|
||
lat, lon, alt = itrf_to_geodetic(gcrf_to_itrf(r_gcrf, t))
|
||
results.append(OrbitalState(
|
||
time=t, reference_frame='GCRF',
|
||
pos_x_km=r_gcrf[0], pos_y_km=r_gcrf[1], pos_z_km=r_gcrf[2],
|
||
vel_x_kms=v_gcrf[0], vel_y_kms=v_gcrf[1], vel_z_kms=v_gcrf[2],
|
||
lat_deg=lat, lon_deg=lon, alt_km=alt, propagator='sgp4'
|
||
))
|
||
return results
|
||
```
|
||
|
||
**Scope limitation:** SGP4 accurate to ~1 km for perigee > 300 km and epoch age < 7 days. Do not use for decay prediction.
|
||
|
||
**SGP4 validity gates — enforced at query time (Finding 1):**
|
||
|
||
| Condition | Action | UI signal |
|
||
|---|---|---|
|
||
| `tle_epoch_age ≤ 7 days` | Normal propagation | `propagation_quality: 'nominal'` |
|
||
| `7 days < tle_epoch_age ≤ 14 days` | Propagate with warning | `propagation_quality: 'degraded'`; amber `DataConfidenceBadge`; API includes `propagation_warning: 'tle_age_7_14_days'` |
|
||
| `tle_epoch_age > 14 days` | Return estimate with explicit caveat | `propagation_quality: 'unreliable'`; object position not rendered on globe without user acknowledgement; API returns `propagation_warning: 'tle_age_exceeds_14_days'` |
|
||
| `perigee_altitude < 200 km` | Do not use SGP4 | Route all propagation requests to the numerical decay predictor; SGP4 is invalid in this density regime |
|
||
|
||
The epoch age check runs at the start of `propagate_catalog()`. The perigee altitude gate is enforced during TLE ingest — objects crossing below 200 km perigee are automatically flagged for decay prediction and removed from SGP4 catalog propagation tasks.
|
||
|
||
**Sub-150 km propagation confidence guard (F2):** For the numerical decay predictor, objects with current perigee < 150 km are in a regime where atmospheric density model uncertainty dominates and SGP4/numerical model errors grow rapidly. Predictions in this regime are flagged:
|
||
```python
|
||
if perigee_km < 150:
|
||
prediction.propagation_confidence = 'LOW_CONFIDENCE_PROPAGATION'
|
||
prediction.propagation_confidence_reason = (
|
||
f'Perigee {perigee_km:.0f} km below 150 km; '
|
||
'atmospheric density uncertainty dominant; re-entry imminent'
|
||
)
|
||
```
|
||
`LOW_CONFIDENCE_PROPAGATION` is surfaced in the UI as a red badge: "⚠ Re-entry imminent — prediction confidence low; consult Space-Track TIP directly." Unit test (BLOCKING): construct a TLE with perigee = 120 km; call the decay predictor; assert `propagation_confidence == 'LOW_CONFIDENCE_PROPAGATION'`.
|
||
|
||
### 15.2 Decay Predictor (Numerical)
|
||
|
||
**Physics:** J2–J6 geopotential, NRLMSISE-00 drag, solar radiation pressure (cannonball model), WGS84 oblate Earth.
|
||
|
||
#### NRLMSISE-00 Input Vector (Finding 2)
|
||
|
||
NRLMSISE-00 requires a fully specified input vector. Using a single F10.7 value for both the 81-day average and the prior-day slot, or using Kp instead of Ap, introduces systematic density errors that are worst during geomagnetic storms — exactly when prediction uncertainty matters most.
|
||
|
||
```python
|
||
# Required NRLMSISE-00 inputs — both stored in space_weather table
|
||
nrlmsise_input = NRLMSISEInput(
|
||
f107A = f107_81day_avg, # 81-day centred average F10.7 (NOT current)
|
||
f107 = f107_prior_day, # prior-day F10.7 value (NOT current day)
|
||
ap = ap_daily, # daily Ap index (linear) — NOT Kp (logarithmic)
|
||
ap_a = ap_3h_history_57h, # 19-element array of 3-hourly Ap for prior 57h
|
||
# enables full NRLMSISE accuracy (flags.switches[9]=1)
|
||
)
|
||
```
|
||
|
||
The `space_weather` table already stores `f107_81day_avg` and `ap_daily`. Add `f107_prior_day DOUBLE PRECISION` and `ap_3h_history DOUBLE PRECISION[19]` columns (the 3-hourly Ap history array for the 57 hours preceding each observation). The ingest worker populates both from the NOAA SWPC Space Weather JSON endpoint.
|
||
|
||
**Atmospheric density model selection rationale (F3):** NRLMSISE-00 is used for Phase 1. JB2008 (Bowman et al. 2008) is the current USSF operational standard and is demonstrably more accurate during high solar activity periods (F10.7 > 150) and geomagnetic storms (Kp > 5). NRLMSISE-00 is chosen for Phase 1 because:
|
||
- Python bindings are mature (`nrlmsise00` PyPI package); JB2008 has no equivalent mature Python binding
|
||
- For the typical F10.7 range (70–150 sfu) at solar minimum/moderate activity, the accuracy difference is < 10%
|
||
- Phase 2 milestone: evaluate JB2008 against NRLMSISE-00 on historical re-entry backcasts; if MAE improvement > 15%, migrate; decision documented in `docs/adr/0016-atmospheric-density-model.md`
|
||
|
||
**NRLMSISE-00 input validity bounds (F3):** Inputs outside these ranges produce unphysical density estimates; the prediction is rejected rather than silently accepted:
|
||
```python
|
||
NRLMSISE_INPUT_BOUNDS = {
|
||
"f107": (65.0, 300.0), # physical solar flux range; < 65 indicates data gap
|
||
"f107A": (65.0, 300.0),
|
||
"ap": (0.0, 400.0), # Ap index physical range
|
||
"altitude_km": (85.0, 1000.0), # validated density range
|
||
}
|
||
```
|
||
If any bound is violated, raise `AtmosphericModelInputError` with field and value — never silently clamp.
|
||
|
||
**Altitude scope:** NRLMSISE-00 is used from 150 km to 800 km. Above 800 km, the model is applied but the prediction carries `ood_flag = TRUE` with `ood_reason = 'above_nrlmsise_validated_range_800km'` (Finding 11).
|
||
|
||
**Geomagnetic storm sensitivity (Finding 11):** During the MC sampling, when the current 3-hour Kp index exceeds 5, sample F10.7 and Ap from storm-period values (current observed, not 81-day average). The prediction is annotated:
|
||
- `space_weather_warning: 'geomagnetic_storm'` field on the `reentry_predictions` record
|
||
- UI amber callout: "Active geomagnetic storm — thermospheric density is elevated; re-entry timing uncertainty is significantly increased"
|
||
- The storm flag persists for the lifetime of the prediction; it is not cleared when the storm ends (the prediction was made during disturbed conditions)
|
||
|
||
#### Ballistic Coefficient Uncertainty Model (Finding 3)
|
||
|
||
The ballistic coefficient `β = m / (C_D × A)` is the dominant uncertainty in drag-driven decay. Its three components are sampled independently in the Monte Carlo:
|
||
|
||
| Parameter | Distribution | Rationale |
|
||
|---|---|---|
|
||
| `C_D` | `Uniform(2.0, 2.4)` | Standard assumption for non-cooperative objects in free molecular flow; no direct measurement available |
|
||
| `A` (stable attitude, `attitude_known = TRUE`) | `Normal(A_discos, 0.05 × A_discos)` | 5% shape uncertainty for known-attitude objects |
|
||
| `A` (tumbling, `attitude_known = FALSE`) | `Normal(A_discos_mean, 0.25 × A_discos_mean)` | 25% uncertainty; tumbling objects present a time-varying cross-section |
|
||
| `m` | `Normal(m_discos, 0.10 × m_discos)` | 10% mass uncertainty; DISCOS masses are not independently verified |
|
||
|
||
OOD rules:
|
||
- `attitude_known = FALSE AND mass_kg IS NULL` → `ood_flag = TRUE`, `ood_reason = 'tumbling_no_mass'` — outside validated regime
|
||
- `cd_a_over_m IS NULL AND mass_kg IS NULL AND cross_section_m2 IS NULL` → `ood_flag = TRUE`, `ood_reason = 'no_physical_properties'`
|
||
|
||
Objects with known physical properties can have operator-provided overrides stored in `objects.cd_override DOUBLE PRECISION` and `objects.bstar_override DOUBLE PRECISION`. When overrides are present, the MC samples around the override value rather than the DISCOS-derived value.
|
||
|
||
#### Solar Radiation Pressure (Finding 7)
|
||
|
||
SRP is included using the cannonball model:
|
||
```
|
||
a_srp = −P_sr × C_r × (A/m) × r̂_sun
|
||
```
|
||
where `P_sr = 4.56 × 10⁻⁶ N/m²` at 1 AU (scaled by `(1 AU / r_sun)²`), `C_r` is the radiation pressure coefficient stored in `objects.cr_coefficient DOUBLE PRECISION DEFAULT 1.3`.
|
||
|
||
SRP is significant (> 5% of drag contribution) for objects with area-to-mass ratio > 0.01 m²/kg at altitudes > 500 km. OOD flag: `area_to_mass > 0.01 AND perigee > 500 km AND cr_coefficient IS NULL` → `ood_reason = 'srp_significant_cr_unknown'`.
|
||
|
||
#### Integrator Configuration (Finding 9)
|
||
|
||
```python
|
||
from scipy.integrate import solve_ivp
|
||
|
||
integrator_config = dict(
|
||
method = "DOP853", # RK7(8) embedded pair — adaptive step
|
||
rtol = 1e-9, # relative tolerance (parts-per-billion)
|
||
atol = 1e-9, # absolute tolerance (km); ≈ 1 mm position error
|
||
max_step = 60.0, # seconds; constrained to capture density variation at perigee
|
||
t_span = (t0, t0 + 120 * 86400), # 120-day maximum integration window
|
||
events = [
|
||
altitude_80km_event, # terminal: breakup trigger
|
||
altitude_200km_event, # non-terminal: log perigee passage
|
||
],
|
||
dense_output = False,
|
||
)
|
||
```
|
||
|
||
Stopping criterion: integration terminates when `altitude ≤ 80 km` (breakup trigger fires) or when the 120-day span elapses without reaching 80 km (result: `propagation_timeout`; stored as `status = 'timeout'` in `simulations`). The 120-day cap is a safety stop — any object not re-entering within 120 days from a sub-450 km perigee TLE is anomalous and should be flagged for human review.
|
||
|
||
The `max_step = 60s` constraint near perigee prevents the integrator from stepping over atmospheric density variations. For altitudes above 300 km, the max step is relaxed to 300s (5 min) via a step-size hook that checks current altitude.
|
||
|
||
**TLE age uncertainty inflation (F7):** TLE age is a formal uncertainty source, not just a staleness indicator. For decaying objects, position uncertainty grows with TLE age due to unmodelled atmospheric drag variations. A linear inflation model is applied to the ballistic coefficient covariance before MC sampling:
|
||
```python
|
||
# Applied in decay_predictor.py before MC sampling
|
||
tle_age_days = (prediction_epoch - tle_epoch).total_seconds() / 86400
|
||
if tle_age_days > 0 and perigee_km < 450:
|
||
uncertainty_multiplier = 1.0 + 0.15 * tle_age_days
|
||
sigma_cd *= uncertainty_multiplier
|
||
sigma_area *= uncertainty_multiplier
|
||
```
|
||
The 0.15/day coefficient is derived from Vallado (2013) §9.6 propagation error growth for LEO objects in ballistic flight. `tle_age_at_prediction_time` and `uncertainty_multiplier` are stored in `simulations.params_json` and included in the prediction API response for provenance.
|
||
|
||
**Monte Carlo convergence criterion (F4):** N = 500 for production is not arbitrary — it satisfies the following convergence criterion tested on the reference object (`mc-ensemble-params.json`):
|
||
|
||
| N | p95 corridor area (km²) | Change from N/2 |
|
||
|---|---|---|
|
||
| 100 | baseline | — |
|
||
| 250 | — | ~12% |
|
||
| 500 | — | ~4% |
|
||
| 1000 | — | ~1.8% |
|
||
| 2000 | — | ~0.9% |
|
||
|
||
Convergence criterion: corridor area change < 2% between doublings. N = 500 satisfies this for the reference object. N = 1000 is used for objects with `ood_flag = TRUE` or `space_weather_warning = 'geomagnetic_storm'` (higher uncertainty → higher N needed for stable tail estimates). Server cap remains 1000.
|
||
|
||
**Monte Carlo:**
|
||
```
|
||
N = 500 (standard); N = 1000 (OOD flag or storm warning); server cap 1000
|
||
Per-sample variation: C_D ~ U(2.0, 2.4); A ~ N(A_discos, σ_A × uncertainty_multiplier);
|
||
m ~ N(m_discos, σ_m); F10.7 and Ap from storm-aware sampling
|
||
Output: p01/p05/p25/p50/p75/p95/p99 re-entry times; ground track corridor polygon; per-sample binary blob for Mode C
|
||
All output records HMAC-signed before database write
|
||
```
|
||
|
||
### 15.3 Atmospheric Breakup Model
|
||
|
||
Simplified ORSAT approach: aerothermal heating → failure altitude → fragment generation → RK4 ballistic descent → impact (velocity, angle, KE, casualty area). Distinct from NASA SBM on-orbit fragmentation.
|
||
|
||
**Breakup altitude trigger (Finding 5):** Structural breakup begins when the numerical integrator crosses `altitude = 78 km` (midpoint of the 75–80 km range supported by NASA Debris Assessment Software and ESA DRAMA for aluminium-structured objects; documented in model card under "Breakup Altitude Rationale").
|
||
|
||
**Fragment generation:** Below 78 km, the fragment cloud is generated using the NASA Standard Breakup Model (NASA-TM-2018-220054) parameter set for the object's mass class:
|
||
- Mass class A: < 100 kg
|
||
- Mass class B: 100–1000 kg
|
||
- Mass class C: > 1000 kg (rocket bodies, large platforms)
|
||
|
||
**Survivability by material (Finding 5):** Fragment demise altitude is determined by material class using the ESA DRAMA demise altitude lookup:
|
||
|
||
| `material_class` | Typical demise altitude | Notes |
|
||
|---|---|---|
|
||
| `aluminium` | 60–70 km | Most fragments demise; some survive |
|
||
| `stainless_steel` | 45–55 km | Higher survival probability |
|
||
| `titanium` | 40–50 km | High survival; used in tanks and fasteners |
|
||
| `carbon_composite` | 55–65 km | Largely demises but reinforced structures may survive |
|
||
| `unknown` | Conservative: 0 km (surface impact) | All fragments assumed to survive — drives `ood_flag = TRUE` |
|
||
|
||
`material_class TEXT` added to `objects` table. When `material_class IS NULL`, the `ood_flag` is set and the conservative all-survive assumption is used. The NOTAM `(E)` field debris survival statement changes from a static disclaimer to a model-driven statement: `DEBRIS SURVIVAL PROBABLE` (when calculated survivability > 50%) or `DEBRIS SURVIVAL POSSIBLE` (10–50%) or `COMPLETE DEMISE EXPECTED` (< 10%).
|
||
|
||
**Casualty area:** Computed from fragment mass and velocity using the ESA DRAMA methodology. Stored per-fragment in `fragment_impacts` table. The aggregate casualty area polygon drives the "ground risk" display in the Event Detail page (Phase 3 feature).
|
||
|
||
**Survival probability output (F5):** The aggregate object-level survival probability is stored in `reentry_predictions`:
|
||
```sql
|
||
ALTER TABLE reentry_predictions
|
||
ADD COLUMN survival_probability DOUBLE PRECISION, -- fraction of object mass expected to survive to surface (0.0–1.0)
|
||
ADD COLUMN survival_model_version TEXT, -- e.g. 'phase1_analytical_v1', 'drama_3.2'
|
||
ADD COLUMN survival_model_note TEXT; -- human-readable caveat, e.g. 'Phase 1: simplified analytical; no fragmentation modelling'
|
||
```
|
||
Phase 1 method: simplified analytical — ballistic coefficient of the intact object projected to surface; if `material_class = 'unknown'`, `survival_probability = 1.0` (conservative all-survive). Phase 2: integrate ESA DRAMA output files where available from the space operator's licence submission. The NOTAM `(E)` field statement is driven by `survival_probability` (already specified above).
|
||
|
||
### 15.4 Corridor Generation Algorithm (Finding 4)
|
||
|
||
The re-entry corridor polygon is generated by `reentry/corridor.py`. The algorithm must be specified explicitly — the choice between convex hull, alpha-shape, and ellipse fit produces materially different FIR intersection results.
|
||
|
||
**Algorithm:**
|
||
|
||
```python
|
||
def generate_corridor_polygon(
|
||
mc_trajectories: list[list[GroundPoint]],
|
||
percentile: float = 0.95,
|
||
alpha: float = 0.1, # degrees; ~11 km at equator
|
||
buffer_km: float = 50.0, # lateral dispersion buffer below 80 km
|
||
max_vertices: int = 1000,
|
||
) -> Polygon:
|
||
"""
|
||
Generate a re-entry hazard corridor polygon from Monte Carlo trajectories.
|
||
|
||
Algorithm:
|
||
1. For each MC trajectory, collect ground positions at 10-min intervals
|
||
from the 80 km altitude crossing to the final impact point.
|
||
2. Retain the central `percentile` fraction of trajectories by re-entry time
|
||
(discard the earliest p_low and latest p_high tails).
|
||
3. Compute the alpha-shape (concave hull) of the combined point set
|
||
using alpha = 0.1°. Alpha-shape is preferred over convex hull for
|
||
elongated re-entry corridors (convex hull overestimates width by 2–5x).
|
||
4. Buffer the polygon by `buffer_km` to account for lateral fragment
|
||
dispersion below 80 km.
|
||
5. Simplify to <= `max_vertices` vertices (Douglas-Peucker, tolerance 0.01°).
|
||
6. Store the raw MC endpoint cloud as JSONB in `reentry_predictions.mc_endpoint_cloud`
|
||
for audit and Mode C replay.
|
||
|
||
Returns:
|
||
Polygon in EPSG:4326 (WGS84), suitable for PostGIS GEOGRAPHY storage.
|
||
"""
|
||
```
|
||
|
||
The alpha-shape library (`alphashape`) is added to `requirements.in`. The 50 km buffer accounts for the fact that fragments detach from the main object trajectory below 80 km and disperse laterally. This value is documented in the model card with a reference to ESA DRAMA lateral dispersion statistics.
|
||
|
||
**Adaptive ground-track sampling for CZML corridor fidelity (F4 — §62):**
|
||
|
||
Step 1 of the corridor algorithm above samples at 10-minute intervals. For the high-deceleration terminal phase (below ~150 km), 10 minutes corresponds to hundreds of kilometres of ground track — the polygon will miss the actual terminal geometry. Adaptive sampling is required:
|
||
|
||
```python
|
||
def adaptive_ground_points(trajectory: list[StateVector]) -> list[GroundPoint]:
|
||
"""
|
||
Return ground points at altitude-dependent intervals:
|
||
> 300 km: every 5 min (slow deceleration; sparse sampling adequate)
|
||
150–300 km: every 2 min
|
||
80–150 km: every 30 s (rapid deceleration; must resolve terminal corridor)
|
||
< 80 km: every 10 s (fragment phase; maximum spatial resolution)
|
||
"""
|
||
points = []
|
||
for sv in trajectory:
|
||
alt_km = sv.altitude_km
|
||
step_s = 300 if alt_km > 300 else (
|
||
120 if alt_km > 150 else (
|
||
30 if alt_km > 80 else 10))
|
||
# only emit a point if sufficient time has elapsed since the last point
|
||
if not points or (sv.t - points[-1].t) >= step_s:
|
||
points.append(to_ground_point(sv))
|
||
return points
|
||
```
|
||
|
||
This is a breaking change to the corridor algorithm: the reference polygon in `docs/validation/reference-data/mc-corridor-reference.geojson` must be regenerated after this change is implemented. The ADR for this change must document the old vs. new polygon area difference for the reference object.
|
||
|
||
**PostGIS vs CZML corridor consistency test (F6 — §62):**
|
||
|
||
The PostGIS `ground_track_corridor` polygon (used for FIR intersection and alert generation) and the CZML polygon positions (displayed on the globe) are independently derived. A serialisation bug in the CZML builder could render the corridor in the wrong location while the database record remains correct — operators would see one corridor, alerts would be generated based on another.
|
||
|
||
**Required integration test** in `tests/integration/test_corridor_consistency.py`:
|
||
|
||
```python
|
||
@pytest.mark.safety_critical
|
||
def test_czml_corridor_matches_postgis_polygon(db_session):
|
||
"""
|
||
The bounding box of the CZML polygon positions must agree with the
|
||
PostGIS corridor polygon bounding box to within 10 km in each direction.
|
||
"""
|
||
prediction = db_session.query(ReentryPrediction).filter(
|
||
ReentryPrediction.ground_track_corridor.isnot(None)
|
||
).first()
|
||
|
||
# Generate CZML from the prediction
|
||
czml_doc = generate_czml_for_prediction(prediction)
|
||
czml_polygon = extract_polygon_positions(czml_doc) # list of (lat, lon)
|
||
|
||
# Get PostGIS bounding box
|
||
postgis_bbox = db_session.execute(
|
||
text("SELECT ST_Envelope(ground_track_corridor::geometry) FROM reentry_predictions WHERE id = :id"),
|
||
{"id": prediction.id}
|
||
).scalar()
|
||
postgis_coords = extract_bbox_corners(postgis_bbox) # (min_lat, max_lat, min_lon, max_lon)
|
||
|
||
czml_bbox = bounding_box_of(czml_polygon)
|
||
assert abs(czml_bbox.min_lat - postgis_coords.min_lat) < 0.1 # ~10 km latitude tolerance
|
||
assert abs(czml_bbox.max_lat - postgis_coords.max_lat) < 0.1
|
||
# Antimeridian-aware longitude comparison
|
||
assert lon_diff_deg(czml_bbox.min_lon, postgis_coords.min_lon) < 0.1
|
||
assert lon_diff_deg(czml_bbox.max_lon, postgis_coords.max_lon) < 0.1
|
||
```
|
||
|
||
This test is marked `safety_critical` because a discrepancy > 10 km between displayed and stored corridor is a direct contribution to HZ-004.
|
||
|
||
**Unit test:** Generate a corridor from a known synthetic MC dataset (100 trajectories, straight ground track); verify the resulting polygon contains all input points; verify the polygon area is less than the convex hull area (confirming the alpha-shape is tighter); verify the polygon has ≤ 1000 vertices.
|
||
|
||
**MC test data generation strategy (Finding 10):** Generating hundreds of MC trajectories at test time is slow and non-deterministic. Committing raw trajectory arrays is a large binary blob. Use seeded RNG:
|
||
|
||
```python
|
||
# tests/physics/conftest.py
|
||
@pytest.fixture(scope="session")
|
||
def synthetic_mc_ensemble():
|
||
"""500 synthetic trajectories from seeded RNG — deterministic, no external downloads."""
|
||
rng = np.random.default_rng(seed=42) # seed must never change without updating reference polygon
|
||
return generate_mc_ensemble(
|
||
rng, n=500,
|
||
object_params={ # Reference object: committed, never change without ADR
|
||
"mass_kg": 1000.0, "cd": 2.2, "area_m2": 1.0, "perigee_km": 185.0,
|
||
},
|
||
)
|
||
```
|
||
|
||
Commit to `docs/validation/reference-data/`:
|
||
- `mc-corridor-reference.geojson` — pre-computed corridor polygon (run `python tools/generate_mc_reference.py` once; review and commit)
|
||
- `mc-ensemble-params.json` — RNG seed, object parameters, generation timestamp
|
||
|
||
Test asserts: (a) generated corridor polygon matches committed reference within 5% area difference; (b) corridor contains ≥ 95% of input trajectories. If the corridor algorithm changes, the reference polygon must be explicitly regenerated and the change reviewed — the seed itself never changes.
|
||
|
||
### 15.5 Conjunction Probability (Pc) Computation Method (Finding 8)
|
||
|
||
The Pc method is specified in `conjunction/pc_compute.py` and must be documented in the API response.
|
||
|
||
**Phase 1–2 method: Alfano/Foster 2D Gaussian**
|
||
|
||
```python
|
||
def compute_pc_alfano(
|
||
r1: np.ndarray, # primary position (km, GCRF)
|
||
v1: np.ndarray, # primary velocity (km/s)
|
||
cov1: np.ndarray, # 6×6 covariance (km², km²/s²)
|
||
r2: np.ndarray, # secondary position
|
||
v2: np.ndarray,
|
||
cov2: np.ndarray,
|
||
hbr: float, # combined hard-body radius (m)
|
||
) -> float:
|
||
"""
|
||
Compute probability of collision using Alfano (2005) 2D Gaussian method.
|
||
|
||
Projects combined covariance onto the encounter plane, integrates the
|
||
bivariate normal distribution over the combined hard-body area.
|
||
Standard method in the space surveillance community.
|
||
|
||
Reference: Alfano (2005), "A Numerical Implementation of Spherical Object
|
||
Collision Probability", Journal of the Astronautical Sciences.
|
||
"""
|
||
```
|
||
|
||
**API response field:** Every conjunction record includes `pc_method: "alfano_2d_gaussian"` so consumers can correctly interpret the result.
|
||
|
||
**Covariance source:** TLE format carries no covariance. SpaceCom estimates covariance via TLE differencing (Vallado & Cefola method): multiple TLEs for the same object within a 24-hour window are used to estimate position uncertainty. This is documented in the API as `covariance_source: "tle_differencing"` and flagged as `covariance_quality: 'low'` when fewer than 3 TLEs are available within 24 hours.
|
||
|
||
**`pc_discrepancy_flag` implementation:** The log-scale comparison is confirmed as:
|
||
```python
|
||
pc_discrepancy_flag = abs(math.log10(pc_spacecom) - math.log10(pc_spacetrack)) > 1.0
|
||
```
|
||
Not a linear comparison. A discrepancy is an order-of-magnitude difference in probability — this threshold is correct.
|
||
|
||
**Validity domain (F1):** The Alfano 2D Gaussian method is valid under the following conditions. Outside these conditions, the Pc estimate is flagged with `pc_validity: 'degraded'` in the API response:
|
||
- Short-encounter assumption: valid when the encounter duration is short compared to the orbital period (satisfied for LEO conjunction geometries)
|
||
- Linear relative motion: degrades when `miss_distance_km < 0.1` (non-linear trajectory effects become significant); flag: `pc_validity_warning: 'sub_100m_close_approach'`
|
||
- Gaussian covariance: degrades when the position uncertainty ellipsoid aspect ratio (σ_max/σ_min) > 100; flag: `pc_validity_warning: 'highly_anisotropic_covariance'`
|
||
- Minimum Pc floor: values below 1×10⁻¹⁵ are reported as `< 1e-15` and not computed precisely (numerical precision limit)
|
||
|
||
**Reference implementation test (F1):** `tests/physics/test_pc_compute.py` — BLOCKING:
|
||
```python
|
||
# Reference cases from Vallado & Alfano (2009), Table 1
|
||
VALLADO_ALFANO_CASES = [
|
||
# (miss_dist_m, sigma_r1_m, sigma_t1_m, sigma_n1_m,
|
||
# sigma_r2_m, sigma_t2_m, sigma_n2_m, hbr_m, expected_pc)
|
||
(100.0, 50.0, 200.0, 50.0, 50.0, 200.0, 50.0, 10.0, 3.45e-3),
|
||
(500.0, 100.0, 500.0, 100.0, 100.0, 500.0, 100.0, 5.0, 2.1e-5),
|
||
]
|
||
|
||
@pytest.mark.parametrize("case", VALLADO_ALFANO_CASES)
|
||
def test_pc_against_vallado_alfano(case):
|
||
pc = compute_pc_alfano(*build_conjunction_geometry(case))
|
||
assert abs(pc - case.expected_pc) / case.expected_pc < 0.05 # within 5%
|
||
```
|
||
|
||
**Phase 3 consideration:** Monte Carlo Pc for conjunctions where `pc_spacecom > 1e-3` (high-probability cases where the Gaussian assumption may break down due to non-linear trajectory evolution). Document in `docs/adr/0015-pc-computation-method.md`.
|
||
|
||
### 15.6 Model Version Governance (F6)
|
||
|
||
All components of the prediction pipeline are versioned together as a single `model_version` string using semantic versioning (`MAJOR.MINOR.PATCH`):
|
||
|
||
| Change type | Version bump | Examples |
|
||
|-------------|-------------|---------|
|
||
| Pc methodology or propagator algorithm change | MAJOR | Switch from Alfano 2D to Monte Carlo Pc; replace DOP853 integrator |
|
||
| Atmospheric model or input processing change | MINOR | NRLMSISE-00 → JB2008; change TLE age inflation coefficient |
|
||
| Bug fix in existing model | PATCH | Fix F10.7 index lookup off-by-one; correct frame transformation |
|
||
|
||
Rules:
|
||
- Old model versions are **never deleted** — tagged in git (`model/v1.2.3`) and retained in `backend/app/modules/physics/versions/`
|
||
- `reentry_predictions.model_version` is set at creation and immutable thereafter
|
||
- A model version bump requires: updated unit tests, updated `docs/validation/reference-data/`, entry in `CHANGELOG.md`, ADR if MAJOR
|
||
|
||
**Reproducibility endpoint (F6):**
|
||
```
|
||
POST /api/v1/decay/predict/reproduce
|
||
Body: { "prediction_id": "uuid" }
|
||
```
|
||
Re-runs the prediction using the exact model version and parameters from `simulations.params_json` recorded at the time of the original prediction. Returns a new prediction record with `reproduced_from_prediction_id` set. This endpoint is used for regulatory audit ("what model produced this output?") and post-incident review. Available to `analyst` role and above.
|
||
|
||
### 15.7 Prediction Input Validation (F9)
|
||
|
||
A `validate_prediction_inputs()` function in `backend/app/modules/physics/validation.py` gates all decay prediction submissions. Inputs that fail validation are rejected with structured errors — never silently clamped to a valid range.
|
||
|
||
```python
|
||
def validate_prediction_inputs(params: PredictionParams) -> list[ValidationError]:
|
||
errors = []
|
||
tle_age_days = (utcnow() - params.tle_epoch).days
|
||
if tle_age_days > 30:
|
||
errors.append(ValidationError("INVALID_TLE_EPOCH",
|
||
f"TLE epoch is {tle_age_days} days old; maximum 30 days"))
|
||
if not (65.0 <= params.f107 <= 300.0):
|
||
errors.append(ValidationError("F107_OUT_OF_RANGE",
|
||
f"F10.7 = {params.f107}; valid range [65, 300]"))
|
||
if not (0.0 <= params.ap <= 400.0):
|
||
errors.append(ValidationError("AP_OUT_OF_RANGE",
|
||
f"Ap = {params.ap}; valid range [0, 400]"))
|
||
if params.perigee_km > 1200.0:
|
||
errors.append(ValidationError("PERIGEE_TOO_HIGH",
|
||
f"Perigee {params.perigee_km} km > 1200 km; not a re-entry candidate"))
|
||
if params.mass_kg is not None and params.mass_kg <= 0:
|
||
errors.append(ValidationError("INVALID_MASS",
|
||
f"Mass {params.mass_kg} kg must be > 0"))
|
||
return errors
|
||
```
|
||
|
||
If `errors` is non-empty, the endpoint returns `422 Unprocessable Entity` with the full error list. Unit tests (BLOCKING) cover each validation path including boundary values.
|
||
|
||
### 15.8 Data Provenance Specification (F11)
|
||
|
||
**Phase 1 model classification:** No trained ML model components. All prediction parameters are derived from:
|
||
- Physical constants (gravitational parameter, WGS84 Earth model)
|
||
- Published atmospheric model coefficients (NRLMSISE-00)
|
||
- Published orbital mechanics algorithms (SGP4, Alfano 2005 Pc)
|
||
- Empirical constants from peer-reviewed literature (NASA Standard Breakup Model, ESA DRAMA demise altitudes, Vallado ballistic coefficient uncertainty)
|
||
|
||
This is documented explicitly in `docs/ml/data-provenance.md` as: *"SpaceCom Phase 1 uses no trained machine learning components. All model parameters are derived from physical constants and published peer-reviewed sources cited below."*
|
||
|
||
**EU AI Act Art. 10 compliance (Phase 1):** Because Phase 1 has no training data, the data governance obligations of Art. 10 apply to input data rather than training data. Input data provenance is tracked in `simulations.params_json` (TLE source, space weather source, timestamp, version).
|
||
|
||
**Future ML component protocol:** Any future learned component (e.g., drag coefficient ML model, debris type classifier) must be accompanied by:
|
||
- Training dataset: source, date range, preprocessing steps, known biases
|
||
- Validation split: method, size, metrics
|
||
- Performance on historical re-entry backcasts (§15.9 backcasting pipeline)
|
||
- Documented in `docs/ml/data-provenance.md` under the component name
|
||
- `docs/ml/model-card-{component}.md` following the Google Model Card format
|
||
|
||
### 15.9 Backcasting Validation Pipeline (F8)
|
||
|
||
When a re-entry is confirmed (object decays — `objects.status = 'decayed'`), the backcasting pipeline runs automatically:
|
||
|
||
```python
|
||
# Triggered by Celery task on object status change to 'decayed'
|
||
@celery.task
|
||
def run_reentry_backcast(object_id: int, confirmed_reentry_time: datetime):
|
||
"""Compare all predictions made in 72h before re-entry to actual outcome."""
|
||
predictions = db.query(ReentryPrediction).filter(
|
||
ReentryPrediction.object_id == object_id,
|
||
ReentryPrediction.created_at >= confirmed_reentry_time - timedelta(hours=72),
|
||
).all()
|
||
for pred in predictions:
|
||
error_hours = (pred.p50_reentry_time - confirmed_reentry_time).total_seconds() / 3600
|
||
db.add(ReentryBackcast(
|
||
prediction_id=pred.id,
|
||
object_id=object_id,
|
||
confirmed_reentry_time=confirmed_reentry_time,
|
||
p50_error_hours=error_hours,
|
||
lead_time_hours=(confirmed_reentry_time - pred.created_at).total_seconds() / 3600,
|
||
model_version=pred.model_version,
|
||
))
|
||
```
|
||
|
||
```sql
|
||
CREATE TABLE reentry_backcasts (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
prediction_id BIGINT NOT NULL REFERENCES reentry_predictions(id),
|
||
object_id INTEGER NOT NULL REFERENCES objects(id),
|
||
confirmed_reentry_time TIMESTAMPTZ NOT NULL,
|
||
p50_error_hours DOUBLE PRECISION NOT NULL, -- signed: positive = predicted late
|
||
lead_time_hours DOUBLE PRECISION NOT NULL,
|
||
model_version TEXT NOT NULL,
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||
);
|
||
CREATE INDEX ON reentry_backcasts (model_version, created_at DESC);
|
||
```
|
||
|
||
**Drift detection:** Rolling 30-prediction MAE by model version, computed nightly. If MAE > 2× historical baseline for the current model version, raise `MEDIUM` alert to Persona D flagging for model review. Surfaced in the admin analytics panel as a "Model Performance" widget.
|
||
|
||
---
|
||
|
||
## 16. Cross-Cutting Concerns
|
||
|
||
### 16.1 Subscription Tiers and Feature Flags (F2, F6)
|
||
|
||
SpaceCom gates commercial entitlements by `contracts`, which is the single authoritative commercial source of truth. `organisations.subscription_tier` is a presentation and segmentation shorthand only, and must never be used as the authority for feature access, quota limits, or shadow/production eligibility. Active contract state is materialised into derived organisation flags and quotas by a synchronisation job so runtime checks remain cheap and explicit.
|
||
|
||
| Tier | Intended customer | MC concurrent runs | Decay predictions/month | Conjunction screening | API access | Multi-ANSP coordination |
|
||
|------|------------------|-------------------|------------------------|-----------------------|------------|------------------------|
|
||
| `shadow_trial` | Evaluators / test orgs | 1 | 20 | Read-only (catalog) | No | No |
|
||
| `ansp_operational` | ANSP Phase 1 | 1 | 200 | Yes (Phase 2) | Yes | Yes |
|
||
| `space_operator` | Space operator orgs | 2 | 500 | Own objects only | Yes | No |
|
||
| `institutional` | Space agencies, research | 4 | Unlimited | Yes | Yes | Yes |
|
||
| `internal` | SpaceCom internal | Unlimited | Unlimited | Yes | Yes | Yes |
|
||
|
||
**Feature flag enforcement pattern:**
|
||
```python
|
||
def require_tier(*tiers: str):
|
||
def dependency(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
|
||
org = db.get(Organisation, current_user.organisation_id)
|
||
if org.subscription_tier not in tiers:
|
||
raise HTTPException(status_code=403, detail={
|
||
"code": "TIER_INSUFFICIENT",
|
||
"current_tier": org.subscription_tier,
|
||
"required_tiers": list(tiers),
|
||
})
|
||
return org
|
||
return dependency
|
||
|
||
# Applied at router level alongside require_role:
|
||
router = APIRouter(dependencies=[
|
||
Depends(require_role("analyst", "operator", "org_admin", "admin")),
|
||
Depends(require_tier("ansp_operational", "institutional", "internal")),
|
||
])
|
||
```
|
||
|
||
**Quota enforcement pattern (MC concurrent runs):**
|
||
```python
|
||
TIER_MC_CONCURRENCY = {
|
||
"shadow_trial": 1,
|
||
"ansp_operational": 1,
|
||
"space_operator": 2,
|
||
"institutional": 4,
|
||
"internal": 999,
|
||
}
|
||
|
||
def get_mc_concurrency_limit(org: Organisation) -> int:
|
||
return TIER_MC_CONCURRENCY.get(org.subscription_tier, 1)
|
||
```
|
||
|
||
**Quota exhaustion is a billable signal:** Every `429 TIER_QUOTA_EXCEEDED` response writes a `usage_events` row with `event_type = 'mc_quota_exhausted'` (see §9.2 usage_events table). This powers the org admin's usage dashboard and the upsell trigger in the admin panel.
|
||
|
||
**Tier changes take effect immediately** — no session restart required. The `require_tier` dependency reads from the database on each request; there is no tier caching that could allow a downgraded tier to continue accessing premium features.
|
||
|
||
### Uncertainty and Confidence
|
||
|
||
Every prediction includes:
|
||
- `confidence_level` (0.0–1.0) — derived from MC spread
|
||
- `uncertainty_bounds` — explicit p05/p50/p95 times, corridor ellipse axes
|
||
- `model_version` — semantic version
|
||
- `monte_carlo_n` — ≥ 100 preliminary, ≥ 500 operational
|
||
- `f107_assumed`, `ap_assumed` — critical for reproducibility
|
||
- `record_hmac` — tamper-evident signature, verified before serving
|
||
|
||
**TLE covariance:** TLE format contains no covariance. Use TLE differencing (multiple TLEs within 24h) or empirical Vallado & Cefola covariance. Document clearly in API responses.
|
||
|
||
**Multi-source prediction conflict resolution (Finding 10):**
|
||
|
||
Space-Track TIP messages and SpaceCom's internal decay predictor may produce non-overlapping re-entry windows for the same object simultaneously. ESA ESAC may publish a third window. The aviation regulatory principle of most-conservative applies — the hazard presented to ANSPs must encompass the full credible uncertainty range.
|
||
|
||
Resolution rules (applied at the `reentry_predictions` layer):
|
||
|
||
| Situation | Rule |
|
||
|---|---|
|
||
| SpaceCom p10–p90 and TIP window overlap | Display SpaceCom corridor as primary; TIP window shown as secondary reference band on Event Detail page |
|
||
| SpaceCom p10–p90 and TIP window do not overlap | Set `prediction_conflict = TRUE` on the prediction; HIGH severity data quality warning displayed; hazard corridor presented to ANSPs uses the **union** of SpaceCom p10–p90 and TIP window |
|
||
| ESA ESAC window available | Overlay as third reference band; include in `PREDICTION_CONFLICT` assessment if non-overlapping |
|
||
| All sources agree (all windows overlap) | No flag; SpaceCom corridor is primary |
|
||
|
||
Schema addition to `reentry_predictions`:
|
||
```sql
|
||
ALTER TABLE reentry_predictions
|
||
ADD COLUMN prediction_conflict BOOLEAN DEFAULT FALSE,
|
||
ADD COLUMN conflict_sources TEXT[], -- e.g. ['spacecom', 'space_track_tip']
|
||
ADD COLUMN conflict_union_p10 TIMESTAMPTZ,
|
||
ADD COLUMN conflict_union_p90 TIMESTAMPTZ;
|
||
```
|
||
|
||
The Event Detail page shows a `⚠ PREDICTION CONFLICT` banner (HIGH severity style) when `prediction_conflict = TRUE`, listing the conflicting sources and their windows. The hazard corridor polygon uses `conflict_union_p10`/`conflict_union_p90` when the flag is set. Document in `docs/model-card-decay-predictor.md` under "Conflict Resolution with Authoritative Sources."
|
||
|
||
### Auditability
|
||
- Every simulation in `simulations` with full `params_json` and result URI
|
||
- Reports stored with `simulation_id` reference
|
||
- `alert_events` and `security_logs` are append-only with DB-level triggers
|
||
- All API mutations logged with user ID, timestamp, and payload hash
|
||
- TIP messages stored verbatim for audit
|
||
|
||
### Error Handling
|
||
- Structured error responses: `{ "error": "code", "message": "...", "detail": {...} }`
|
||
- Celery failures captured in `simulations.status = 'failed'`; surfaced in jobs panel
|
||
- Frame transformation failures fail loudly — never silently continue with TEME
|
||
- HMAC failures return 503 and trigger CRITICAL security event — never silently serve a tampered record
|
||
- TanStack Query error states render inline messages with retry; not page-level errors
|
||
|
||
### Performance Patterns
|
||
|
||
**SQLAlchemy async — `lazy="raise"` on all relationships:**
|
||
Async SQLAlchemy prohibits lazy-loaded relationship access outside an async context. Setting `lazy="raise"` converts silent N+1 errors into loud `InvalidRequestError` at development time rather than silent blocking DB calls in production:
|
||
```python
|
||
class ReentryPrediction(Base):
|
||
object: Mapped["SpaceObject"] = relationship(lazy="raise")
|
||
tip_messages: Mapped[list["TipMessage"]] = relationship(lazy="raise")
|
||
# Forces all callers to use joinedload/selectinload explicitly
|
||
```
|
||
Required eager-loading patterns for the three highest-traffic endpoints:
|
||
- Event Detail: `selectinload(ReentryPrediction.object)`, `selectinload(ReentryPrediction.tip_messages)`
|
||
- Active alerts: `selectinload(AlertEvent.prediction)`
|
||
- CZML catalog: raw SQL with a single `JOIN` rather than ORM (bulk fetch; ORM overhead unacceptable at 864k rows)
|
||
|
||
**CZML caching — two-tier strategy:**
|
||
CZML data for the current 72h window changes only when a new TLE is ingested or a propagation job completes. Cache the full serialised CZML blob:
|
||
```python
|
||
CZML_CACHE_KEY = "cache:czml:catalog:{catalog_hash}:{window_start}:{window_end}"
|
||
# TTL: 15 minutes in LIVE mode (refreshed after new TLE ingest event)
|
||
# TTL: permanent in REPLAY mode (historical data never changes)
|
||
```
|
||
Per-object CZML fragments cached separately under `cache:czml:obj:{norad_id}:{...}`. When a TLE is re-ingested for one object, invalidate only that object's fragment and recompute the full catalog CZML from the cached fragments.
|
||
|
||
**CZML cache invalidation triggers (F5 — §58):**
|
||
|
||
| Event | Invalidation scope | Mechanism |
|
||
|-------|--------------------|-----------|
|
||
| New TLE ingested for object X | `cache:czml:obj:{norad_id_x}:*` only | Ingest task calls `redis.delete(pattern)` after TLE commit |
|
||
| Propagation job completes for object X | `cache:czml:obj:{norad_id_x}:*` + full catalog key | Propagation Celery task issues invalidation on success |
|
||
| New prediction created for object X | `cache:czml:obj:{norad_id_x}:*` | Prediction task issues invalidation on completion |
|
||
| Manual cache flush (admin API) | `cache:czml:*` | `DELETE /api/v1/admin/cache/czml` — requires `admin` role |
|
||
| Cold start / DR failover | Warm-up Celery task | `warm_czml_cache` Beat task runs at startup (see below) |
|
||
|
||
**Stale-while-revalidate strategy:** The CZML cache key includes a `stale_ok` variant. When the primary key is expired but the stale key (`cache:czml:catalog:stale:{hash}`) exists, serve the stale response immediately and enqueue a background recompute. Maximum stale age: 5 minutes. This prevents a cache stampede during TLE batch ingest (up to 600 simultaneous invalidations).
|
||
|
||
**Cache warm-up on cold start (F5 — §58):**
|
||
```python
|
||
@app.task
|
||
def warm_czml_cache():
|
||
"""Run at container startup and after DR failover. Estimated: 30–60s for 600 objects."""
|
||
objects = db.query(Object).filter(Object.active == True).all()
|
||
for obj in objects:
|
||
generate_czml_fragment.delay(obj.norad_id)
|
||
# Full catalog key assembled by CZML endpoint after all fragments present
|
||
```
|
||
Cold-start warm-up time (600 objects, 16 simulation workers): estimated 30–60 seconds. Included in DR RTO calculation (§26.3) as "cache warm-up: ~1 min" line item.
|
||
|
||
**Redis key namespaces and eviction policy:**
|
||
|
||
| Namespace | Contents | Eviction policy | Notes |
|
||
|-----------|----------|-----------------|-------|
|
||
| `celery:*` | Celery broker queues | `noeviction` — must never be evicted | Use separate Redis instance or DB 0 with `noeviction` |
|
||
| `redbeat:*` | celery-redbeat schedules | `noeviction` | Loss causes silent scheduled job disappearance |
|
||
| `cache:*` | Application cache (CZML, space weather, HMAC results) | `allkeys-lru` | Cache misses acceptable; broker loss is not |
|
||
| `ws:session:*` | WebSocket session state | `volatile-lru` (with TTL set) | Expires on session end |
|
||
|
||
Run Celery broker and application cache as separate Redis database indexes (`SELECT 0` vs `SELECT 1`) so eviction policies can differ. The Sentinel configuration monitors both.
|
||
|
||
Cache TTLs:
|
||
- `cache:czml:catalog` → 15 minutes
|
||
- `cache:spaceweather:current` → 5 minutes
|
||
- `cache:prediction:{id}:fir_intersection` → until superseded (keyed to prediction ID)
|
||
- `cache:prediction:{id}:hmac_verified` → 60 minutes
|
||
|
||
**Bulk export — Celery offload for Persona F:**
|
||
`GET /space/export/bulk` must not materialise the full result set in the backend container — for the full catalog this risks OOM. Implement as a Celery task that writes to MinIO and returns a pre-signed download URL, consistent with the existing report generation pattern:
|
||
```python
|
||
@app.post("/space/export/bulk")
|
||
async def trigger_bulk_export(params: BulkExportParams, ...):
|
||
task = generate_bulk_export.delay(params.dict(), user_id=current_user.id)
|
||
return {"task_id": task.id, "status": "queued"}
|
||
|
||
@app.get("/space/export/bulk/{task_id}")
|
||
async def get_bulk_export(task_id: str, ...):
|
||
# Returns {"status": "complete", "download_url": presigned_url} when done
|
||
```
|
||
If a streaming response is preferred over task-based, use SQLAlchemy `yield_per=1000` cursor streaming — never materialise the full result set.
|
||
|
||
**Analytics query routing to read replica:**
|
||
Persona B and F analytics queries (simulation comparison, historical validation, bulk export) are I/O intensive and must not compete with operational read paths on the primary TimescaleDB instance during active TIP events. Route to the Patroni standby:
|
||
```python
|
||
def get_db(write: bool = False, analytics: bool = False) -> AsyncSession:
|
||
if write:
|
||
return AsyncSession(primary_engine)
|
||
if analytics:
|
||
return AsyncSession(replica_engine) # Patroni standby
|
||
return AsyncSession(primary_engine) # operational reads: primary (avoids replica lag)
|
||
```
|
||
Monitor replication lag: if replica lag > 30s, log a warning and redirect analytics queries to primary.
|
||
|
||
**Query plan baseline:**
|
||
Add to Phase 1 setup: run `EXPLAIN (ANALYZE, BUFFERS)` on the primary CZML query with 100 objects and record the output in `docs/query-baselines/`. Re-run at Phase 3 load test and compare — if planning time or execution time has increased > 2×, investigate index bloat or chunk count growth before the load test proceeds.
|
||
|
||
---
|
||
|
||
## 17. Validation Strategy
|
||
|
||
### 17.0 Test Standards and Strategy (F1–F3, F5, F7, F8, F10, F11)
|
||
|
||
#### Test Taxonomy (F2)
|
||
|
||
Three levels — every developer must know which level a new test belongs to before writing it:
|
||
|
||
| Level | Definition | I/O boundary | Tool | Location |
|
||
|-------|-----------|-------------|------|----------|
|
||
| **Unit** | Single function or class; all dependencies mocked or stubbed | No I/O | pytest | `tests/unit/` |
|
||
| **Integration** | Multiple components; real PostgreSQL + Redis; no external network | Real DB, no internet | pytest + testcontainers | `tests/integration/` |
|
||
| **E2E** | Full stack including browser; Celery worker running; real DB | Full stack | Playwright | `e2e/` |
|
||
|
||
Rules:
|
||
- Physics algorithm tests (SGP4, MC, Pc) are **unit** tests — pure functions, no DB
|
||
- HMAC signing, RLS isolation, and rate-limit tests are **integration** tests — require a real DB transaction
|
||
- Alert delivery, WebSocket flow, and NOTAM draft UI are **E2E** tests
|
||
- A test that mocks the database is a unit test regardless of what it is testing — name it accordingly
|
||
|
||
#### Coverage Standard (F1)
|
||
|
||
| Scope | Tool | Minimum threshold | CI gate |
|
||
|-------|------|------------------|---------|
|
||
| Backend line coverage | `pytest-cov` | 80% | Fail below threshold |
|
||
| Backend branch coverage | `pytest-cov --branch` | 70% | Fail below threshold |
|
||
| Frontend line coverage | Jest `--coverage` | 75% | Fail below threshold |
|
||
| Safety-critical paths | `pytest -m safety_critical` | 100% (all pass, none skipped) | Always blocking |
|
||
|
||
```ini
|
||
# pyproject.toml
|
||
[tool.pytest.ini_options]
|
||
addopts = "--cov=app --cov-branch --cov-fail-under=80 --cov-report=term-missing"
|
||
|
||
[tool.coverage.run]
|
||
omit = ["*/migrations/*", "*/tests/*", "*/__pycache__/*"]
|
||
```
|
||
|
||
Coverage is measured on the **integration test run** (not unit-only) so that database-layer code paths are included. Coverage reports are uploaded to CI artefacts on every run; a coverage trend chart is required in the Phase 2 ESA submission.
|
||
|
||
#### Test Data Management (F3)
|
||
|
||
**Fixtures, not factories for shared reference data:** Physics reference cases (TLE sets, re-entry events, conjunction scenarios) are committed JSON files in `docs/validation/reference-data/`. Tests load them as pytest fixtures — never fetch from the internet at test time.
|
||
|
||
**Isolated fixtures for integration tests:** Each integration test that writes to the database runs inside a transaction that is rolled back at teardown. No shared mutable state between tests:
|
||
```python
|
||
@pytest.fixture
|
||
def db_session(engine):
|
||
with engine.connect() as conn:
|
||
with conn.begin() as txn:
|
||
yield conn
|
||
txn.rollback() # all writes from this test disappear
|
||
```
|
||
|
||
**Time-dependent tests:** Any test that checks TLE age, token expiry, or billing period uses `freezegun` to freeze time to a known epoch. Tests must never rely on `datetime.utcnow()` producing a particular value:
|
||
```python
|
||
from freezegun import freeze_time
|
||
|
||
@freeze_time("2026-01-15T12:00:00Z")
|
||
def test_tle_age_degraded_warning():
|
||
# TLE epoch is 2026-01-08 → age = 7 days → expects 'degraded'
|
||
...
|
||
```
|
||
|
||
**Sensitive test data:** Real NORAD IDs, real Space-Track credentials, and real ANSP organisation names must never appear in committed test fixtures. Use fictional NORAD IDs (90001–90099 are reserved for test objects by convention) and generated organisation names (`test-org-{uuid4()[:8]}`).
|
||
|
||
#### Safety-Critical Test Markers (F8)
|
||
|
||
All tests that verify safety-critical behaviour carry `@pytest.mark.safety_critical`. These run on every commit (not just pre-merge) and must all pass before any deployment:
|
||
|
||
```python
|
||
# conftest.py
|
||
import pytest
|
||
|
||
def pytest_configure(config):
|
||
config.addinivalue_line(
|
||
"markers", "safety_critical: test verifies a safety-critical invariant; always runs; zero tolerance for failure or skip"
|
||
)
|
||
```
|
||
|
||
```python
|
||
# Usage
|
||
@pytest.mark.safety_critical
|
||
def test_cross_tenant_isolation():
|
||
...
|
||
|
||
@pytest.mark.safety_critical
|
||
def test_hmac_integrity_failure_quarantines_record():
|
||
...
|
||
|
||
@pytest.mark.safety_critical
|
||
def test_sub_150km_low_confidence_flag():
|
||
...
|
||
```
|
||
|
||
The full list of `safety_critical`-marked tests is maintained in `docs/TEST_PLAN.md` (see F11). CI runs `pytest -m safety_critical` as a separate fast job (target: < 2 minutes) before the full suite.
|
||
|
||
#### Physics Test Determinism (F10)
|
||
|
||
Monte Carlo tests are non-deterministic by default. All MC-based tests seed the random number generator explicitly:
|
||
|
||
```python
|
||
import numpy as np
|
||
|
||
@pytest.fixture(autouse=True)
|
||
def seed_rng():
|
||
"""Seed numpy RNG for all physics tests. Produces identical output across runs."""
|
||
np.random.seed(42)
|
||
yield
|
||
# no teardown needed — each test gets a fresh seed via autouse
|
||
|
||
@pytest.mark.safety_critical
|
||
def test_mc_convergence_criterion():
|
||
result = run_mc_decay(tle=TEST_TLE, n=500, seed=42)
|
||
assert result.corridor_area_change_pct < 2.0
|
||
```
|
||
|
||
The seed value `42` is fixed in `tests/conftest.py` and must not be changed without updating the baseline expected values. A PR that changes the seed without updating expected values fails the review checklist.
|
||
|
||
#### Mutation Testing (F5)
|
||
|
||
`mutmut` is run weekly (not on every commit — too slow) against the `backend/app/modules/physics/` and `backend/app/modules/alerts/` directories. These are the highest-consequence paths.
|
||
|
||
```bash
|
||
mutmut run --paths-to-mutate=backend/app/modules/physics/,backend/app/modules/alerts/
|
||
mutmut results
|
||
```
|
||
|
||
**Threshold:** Mutation score ≥ 70% for physics and alerts modules. Results published to CI artefacts. A score drop of > 5 percentage points between weekly runs creates a `mutation-regression` GitHub issue automatically.
|
||
|
||
#### Test Environment Parity (F7)
|
||
|
||
The CI test environment must use identical Docker images to production. Enforced by:
|
||
- `docker-compose.ci.yml` extends `docker-compose.yml` — same image tags, no overrides to DB version or Redis version
|
||
- TimescaleDB version in CI is pinned to the same tag as production (`timescale/timescaledb-ha:pg16-latest` is not acceptable — must be `timescale/timescaledb-ha:pg16.3-ts2.14.2`)
|
||
- `make test` in CI fails if `TIMESCALEDB_VERSION` env var does not match the value in `docker-compose.yml`
|
||
- MinIO is used in CI, not mocked — `make test` brings up the full service stack including MinIO before running integration tests
|
||
|
||
#### ESA Test Plan Document (F11)
|
||
|
||
`docs/TEST_PLAN.md` is a required Phase 2 deliverable. Structure:
|
||
|
||
```markdown
|
||
# SpaceCom Test Plan
|
||
|
||
## 1. Test levels and tools
|
||
## 2. Coverage targets and current status
|
||
## 3. Safety-critical test traceability matrix
|
||
| Requirement | Test ID | Test name | Result |
|
||
|-------------|---------|-----------|--------|
|
||
| Sub-150km propagation guard | SC-TEST-001 | test_sub_150km_low_confidence_flag | PASS |
|
||
| Cross-tenant data isolation | SC-TEST-002 | test_cross_tenant_isolation | PASS |
|
||
...
|
||
## 4. Known test limitations
|
||
## 5. Test environment specification
|
||
## 6. Performance test results (latest k6 run)
|
||
```
|
||
|
||
The traceability matrix links each safety-critical requirement (drawn from §15, §7.2, §26) to its `@pytest.mark.safety_critical` test. This is the primary evidence document for ESA software assurance review.
|
||
|
||
---
|
||
|
||
**Important:** Comparing SGP4 against Space-Track TLEs is circular. All validation uses independent reference datasets.
|
||
|
||
**Reference data location:** `docs/validation/reference-data/` — committed to the repository and loaded automatically by the test suite. No external downloads required at test time.
|
||
|
||
**How to run all validation suites:**
|
||
```bash
|
||
make test # runs pytest including all validation suites
|
||
pytest tests/test_frame_utils.py -v # frame transforms only
|
||
pytest tests/test_decay/ -v # decay predictor + backcast comparison
|
||
pytest tests/test_propagator/ -v # SGP4 propagator
|
||
```
|
||
|
||
**How to add a new validation case:** Add the reference data to the appropriate JSON file in `docs/validation/reference-data/`, add a test case in the relevant test module, and document the source in the file's header comment.
|
||
|
||
---
|
||
|
||
### 17.1 Frame Transformation Validation
|
||
|
||
| Test | Reference | Pass criterion | Run command |
|
||
|------|-----------|---------------|-------------|
|
||
| TEME→GCRF transform | Vallado (2013), Table 3-5 | Position error < 1 m; velocity error < 0.001 m/s | `pytest tests/test_frame_utils.py::test_teme_gcrf_vallado` |
|
||
| GCRF→ITRF transform | Vallado (2013), Table 3-4 | Position error < 1 m | `pytest tests/test_frame_utils.py::test_gcrf_itrf_vallado` |
|
||
| ITRF→WGS84 geodetic | IAU SOFA test vectors | Lat/lon error < 1 μrad; altitude error < 1 mm | `pytest tests/test_frame_utils.py::test_itrf_geodetic` |
|
||
| Round-trip WGS84→ITRF→GCRF→ITRF→WGS84 | Self-consistency | Round-trip error < floating-point machine precision (~1e-12) | `pytest tests/test_frame_utils.py::test_roundtrip` |
|
||
| IERS EOP application | IERS Bulletin A reference values | UT1-UTC error < 1 μs; pole offset error < 0.1 mas | `pytest tests/test_frame_utils.py::test_iers_eop` |
|
||
|
||
**Committed test vectors (Finding 6):** The following reference data files must be committed to the repository before any frame transformation or propagation code is merged. Tests are parameterised fixtures that load from these files; they fail (not skip) if a file is absent:
|
||
|
||
| File | Content | Source |
|
||
|---|---|---|
|
||
| `docs/validation/reference-data/frame_transform_gcrf_to_itrf.json` | ≥ 3 cases from Vallado (2013) §3.7: input UTC epoch + GCRF position → expected ITRF position, accurate to < 1 m | Vallado (2013) *Fundamentals of Astrodynamics* Table 3-4 |
|
||
| `docs/validation/reference-data/sgp4_propagation_cases.json` | ISS (NORAD 25544) and one historical re-entry object: state vector at epoch and after 1h and 24h propagation | STK or GMAT reference propagation |
|
||
| `docs/validation/reference-data/iers_eop_case.json` | One epoch with published IERS Bulletin B UT1-UTC and polar motion values; expected GCRF→ITRF transform result | IERS Bulletin B (iers.org) |
|
||
|
||
```python
|
||
# tests/physics/test_frame_transforms.py
|
||
import json, pytest
|
||
from pathlib import Path
|
||
|
||
CASES_FILE = Path("docs/validation/reference-data/frame_transform_gcrf_to_itrf.json")
|
||
|
||
def test_reference_data_exists():
|
||
"""Fail hard if committed test vectors are missing — do not skip."""
|
||
assert CASES_FILE.exists(), f"Required reference data missing: {CASES_FILE}"
|
||
|
||
@pytest.mark.parametrize("case", json.loads(CASES_FILE.read_text()))
|
||
def test_gcrf_to_itrf(case):
|
||
result = gcrf_to_itrf(case["gcrf_km"], parse_utc(case["epoch_utc"]))
|
||
assert np.linalg.norm(result - case["expected_itrf_km"]) < 0.001 # 1 m tolerance
|
||
```
|
||
|
||
Reference data file: `docs/validation/reference-data/vallado-sgp4-cases.json` and `docs/validation/reference-data/iers-frame-test-cases.json`.
|
||
|
||
**Operational significance of failure:** A frame transform error propagates directly into corridor polygon coordinates. A 1 km error at re-entry altitude produces a ground-track offset of 5–15 km. A failing frame test is a blocking CI failure.
|
||
|
||
---
|
||
|
||
### 17.2 SGP4 Propagator Validation
|
||
|
||
| Test | Reference | Pass criterion |
|
||
|------|-----------|---------------|
|
||
| State vector at epoch | Vallado (2013) test set, 10 objects spanning LEO/MEO/GEO/HEO | Position error < 1 km at epoch; < 10 km after 7-day propagation |
|
||
| Epoch parsing | NORAD 2-line epoch format → UTC | Round-trip to 1 ms precision |
|
||
| TLE line 1/2 checksum | Modulo-10 algorithm | Pass/fail; corrupted checksum rejected before propagation |
|
||
|
||
**Operational significance of failure:** SGP4 position error at epoch > 1 km produces a corridor centred in the wrong place. Blocking CI failure.
|
||
|
||
---
|
||
|
||
### 17.3 Decay Predictor Validation
|
||
|
||
| Test | Reference | Pass criterion |
|
||
|------|-----------|---------------|
|
||
| NRLMSISE-00 density output | Picone et al. (2002) Table 1 reference atmosphere | Density within 1% of reference at 5 altitude/solar activity combinations |
|
||
| Historical backcast: p50 error | The Aerospace Corporation observed re-entry database (≥3 events Phase 1; ≥10 events Phase 2) | Median p50 error < 4h for rocket bodies with known physical properties |
|
||
| Historical backcast: corridor containment | Same database | p95 corridor contains observed impact in ≥90% of validation events |
|
||
| Historical replay: airspace disruption | Long March 5B Spanish airspace closure reconstruction with replay inputs and operator review | Affected FIR/time-window outputs judged operationally plausible and traceable in replay report |
|
||
| Air-risk ranking consistency | Documented crossing-scenario corpus (≥10 unique spacecraft/aircraft crossing cases by Phase 2) | Highest-ranked exposure slices remain stable under seed and traffic-density perturbations or the differences are explained in the validation note |
|
||
| Conservative-baseline comparison | Same replay corpus vs. full-FIR or fixed-radius precautionary closure baseline | Refined outputs reduce affected area or duration in a majority of replay cases without undercutting the agreed p95 protective envelope |
|
||
| Cross-tool comparison | GMAT (NASA open source) — 3 defined test cases | Re-entry time agreement within 1h for objects with identical inputs |
|
||
| Monte Carlo statistical consistency | Self-consistency: 500-sample run vs. 1000-sample run on same inputs | p05/p50/p95 agree within 2% (reducing with more samples) |
|
||
|
||
Reference data files: `docs/validation/reference-data/aerospace-corp-reentries.json` for decay-only validation and `docs/validation/reference-data/reentry-airspace/` for airspace-risk replay cases (Long March 5B, Columbia-derived cloud case, and documented crossing scenarios). GMAT comparison is a manual procedure documented in `docs/validation/README.md` (GMAT is not run in CI — too slow; comparison run once per major model version).
|
||
|
||
**Operational significance of failure:** Decay predictor p50 error > 4h means corridors are offset in time; operators could see a hazard window that doesn't match the actual re-entry. Major model version gate.
|
||
|
||
---
|
||
|
||
### 17.4 Breakup Model Validation
|
||
|
||
| Test | Reference | Pass criterion |
|
||
|------|-----------|---------------|
|
||
| Fragment count distribution | ESA DRAMA published results for similar-mass objects | Fragment count within 30% of DRAMA reference for a 500 kg object at 70 km |
|
||
| Energy conservation at breakup altitude | Internal check | Total kinetic + potential energy conserved within 1% through fragmentation step |
|
||
| Casualty area geometry | Hand-calculated reference case | Casualty area polygon area within 10% of analytic calculation |
|
||
|
||
**Operational significance of failure:** Breakup model failure does not block Phase 1. It is an advisory failure in Phase 2. Blocking before Phase 3 regulatory submission.
|
||
|
||
---
|
||
|
||
### 17.5 Security Validation
|
||
|
||
| Test | Reference | Pass criterion | Blocking? |
|
||
|------|-----------|---------------|-----------|
|
||
| RBAC enforcement | `test_rbac.py` — every endpoint, every role | 403 for insufficient role; 401 for unauthenticated; 0 mismatches | Yes |
|
||
| HMAC tamper detection | `test_integrity.py` — direct DB row modification | API returns 503 + CRITICAL `security_logs` entry | Yes |
|
||
| Rate limiting | `test_auth.py` — per-endpoint threshold | 429 after threshold; 200 after reset window | Yes |
|
||
| CSP headers | Playwright E2E | `Content-Security-Policy` header present on all pages | Yes |
|
||
| Container non-root | CI `docker inspect` check | No container running as root UID | Yes |
|
||
| Trivy CVE scan | Trivy against all built images | 0 Critical/High CVEs | Yes |
|
||
|
||
---
|
||
|
||
### 17.6 Verification Independence (F6 — §61)
|
||
|
||
EUROCAE ED-153 / DO-278A §6.4 requires that SAL-2 software components undergo independent verification — meaning the person who verifies (reviews/tests) a SAL-2 requirement, design, or code artefact must not be the same person who produced it.
|
||
|
||
**Policy:** `docs/safety/VERIFICATION_INDEPENDENCE.md`
|
||
|
||
**Scope:** All SAL-2 components identified in §24.13:
|
||
- `physics/` (decay prediction engine)
|
||
- `alerts/` (alert generation pipeline)
|
||
- HMAC integrity verification functions
|
||
- CZML corridor generation and frame transform
|
||
|
||
**Implementation in GitHub:**
|
||
|
||
```yaml
|
||
# .github/CODEOWNERS
|
||
# SAL-2 components require an independent reviewer (not the PR author)
|
||
/backend/app/physics/ @safety-reviewer
|
||
/backend/app/alerts/ @safety-reviewer
|
||
/backend/app/integrity/ @safety-reviewer
|
||
/backend/app/czml/ @safety-reviewer
|
||
```
|
||
|
||
The `@safety-reviewer` team must have ≥1 member who is not the PR author. GitHub branch protection for `main` must include:
|
||
- `require_code_owner_reviews: true` for the above paths
|
||
- `dismiss_stale_reviews: true` (new commits require re-review)
|
||
- SAL-2 PRs require ≥2 approvals (one of which must be from `@safety-reviewer`)
|
||
|
||
**Verification traceability:** The PR review record (GitHub PR number + reviewer + approval timestamp) serves as evidence for verification independence in the safety case (§24.12 E1.1). This record is referenced in the MoC document (§24.14 MOC-002).
|
||
|
||
**Who qualifies as an independent reviewer for SAL-2:** Any engineer who:
|
||
1. Did not write the code being reviewed
|
||
2. Has sufficient domain knowledge to evaluate correctness (orbital mechanics familiarity for `physics/`; alerting logic familiarity for `alerts/`)
|
||
3. Is designated in the `@safety-reviewer` GitHub team
|
||
|
||
Before ANSP shadow activation, the safety case custodian confirms that all SAL-2 components committed in the release have a documented independent reviewer.
|
||
|
||
---
|
||
|
||
## 18. Additional Physics Considerations
|
||
|
||
| Topic | Why It Matters | Phase |
|
||
|-------|---------------|-------|
|
||
| **Solar radiation pressure (SRP)** | Dominates drag above ~800 km for high A/m objects | Phase 1 (decay predictor) |
|
||
| **J2–J6 geopotential** | J2 alone: ~7°/day RAAN error | Phase 1 (decay predictor) |
|
||
| **Attitude and tumbling** | Drag coefficient 2–3× different; capture via B* Monte Carlo | Phase 2 |
|
||
| **Lift during re-entry** | Non-spherical fragments: 10s km cross-track shift | Phase 2 (breakup) |
|
||
| **Maneuver detection** | Active satellites maneuver; TLE-to-TLE ΔV estimation | Phase 2 |
|
||
| **Ionospheric drag** | Captured via NRLMSISE-00 ion density profile | Phase 1 (via model) |
|
||
| **Re-entry heating uncertainty** | Emissivity/melt temperatures poorly known for debris | Phase 2 |
|
||
|
||
---
|
||
|
||
## 19. Development Phases — Detailed
|
||
|
||
### Phase 1: Analytical Prototype (Weeks 1–10)
|
||
|
||
**Goal:** Real object tracking, decay prediction with uncertainty quantification, functional Persona A/B interface. Security infrastructure fully in place before any other feature ships.
|
||
|
||
| Week | Backend Deliverable | Frontend Deliverable | Security / SRE Deliverable |
|
||
|------|--------------------|--------------------|--------------------------|
|
||
| 1-2 | FastAPI scaffolding, Alembic migrations, Docker Compose with Tier 2 service topology. `frame_utils.py`, `time_utils.py`. IERS EOP refresh + SHA-256 verify. Append-only DB triggers. HMAC signing infrastructure. Liveness + readiness probes on all services. `GET /healthz`, `GET /readyz` with DB + Redis checks. Dead letter queue for Celery. `task_acks_late`, `task_reject_on_worker_lost` configured. Celery queue routing (`ingest` vs `simulation`). `celery-redbeat` configured. **Legal/compliance**: `users` table `tos_accepted_at/tos_version/tos_accepted_ip/data_source_acknowledgement` fields. First-login ToS/AUP/Privacy Notice acceptance flow (blocks access until all accepted). SBOM generated via `syft`; CesiumJS commercial licence verified. Privacy Notice drafted and published. | Next.js scaffolding. Root layout: nav, ModeIndicator, AlertBadge, JobsPanel stub. Dark mode + high-contrast theme. CSP and security headers via Next.js middleware. ToS/AUP acceptance gate on first login (blocks dashboard until accepted). | **RBAC schema + `require_role()`. JWT RS256 + httpOnly cookies. MFA (TOTP). Redis AUTH + ACLs. MinIO private buckets. Docker network segmentation. Container hardening. `git-secrets`. Bandit + ESLint security in CI. Trivy. Dependency pinning. Dependabot. `security_logs` + sanitising formatter. Docker Compose `depends_on: condition: service_healthy` wired.** **Documentation**: `docs/` directory tree created; `AGENTS.md` committed; initial ADRs for JWT, dual frontend, Monte Carlo chord, frame library; `docs/runbooks/TEMPLATE.md` + index; `CHANGELOG.md` first entry; `docs/validation/reference-data/` with Vallado and IERS cases; `docs/alert-threshold-history.md` initial entry. **DevOps/Platform**: self-hosted GitLab CI pipeline (lint, test-backend, test-frontend, security-scan, build-and-push jobs); multi-stage Dockerfiles for all services; `.pre-commit-config.yaml` with all six hooks; `.env.example` committed with all variables documented; `Makefile` with `dev`, `test`, `migrate`, `seed`, `lint`, `clean` targets; Docker layer + pip + npm build cache configured; `sha-<commit>` image tagging in the GitLab container registry in place. Prometheus metrics: `spacecom_active_tip_events`, `spacecom_tle_age_hours`, `spacecom_hmac_verification_failures_total` instrumented. |
|
||
| 3–4 | Catalog module: object CRUD, TLE import. TLE cross-validation. ESA DISCOS import. Ingest Celery Beat (celery-redbeat). Hardcoded URLs, SSRF-mitigated HTTP client. WAL archiving configured. Daily backup Celery task. TimescaleDB compression policy on `orbits`. Retention policy scaffolded. | Object Catalog page. DataConfidenceBadge. Object Watch page stub. | Rate limiting (`slowapi`). Simulation parameter range validation. Prometheus: `spacecom_ingest_success_total`, `spacecom_ingest_failure_total` per source. AlertManager rule: consecutive ingest failures → warning. |
|
||
| 5–6 | Space Weather: NOAA SWPC + ESA SWS cross-validation. `operational_status` string. TIP message ingestion. Prometheus: `spacecom_prediction_age_seconds` per NORAD ID. Readiness probe: TLE staleness + space weather age checks. | SpaceWeatherWidget. Alert taxonomy: CRITICAL banner, NotificationCentre, AcknowledgeDialog. Degraded mode banner (reads `readyz` 207 response). | `alert_events` append-only verified. Alert rate-limit and deduplication. Alert storm detection. AlertManager rule: `spacecom_active_tip_events > 0 AND prediction_age > 3600` → critical. |
|
||
| 7–8 | Catalog Propagator (SGP4): TEME→GCRF, CZML (J2000). Ephemeris caching. Frame transform validation. All CZML strings HTML-escaped. MC chord architecture: `run_mc_decay_prediction` → `group(run_single_trajectory)` → `aggregate_mc_results`. Chord result backend (Redis) sized. | Globe: real object positions, LayerPanel, clustering, urgency symbols. TimelineStrip. Live mode scrub. | WebSocket auth: cookie-based; connection limit. WS ping/pong. Prometheus: `spacecom_simulation_duration_seconds` histogram. |
|
||
| 9–10 | Decay Predictor: RK7(8) + NRLMSISE-00 + Monte Carlo chord. HMAC-signed output. Immutability triggers. Corridor polygon generation. Re-entry API. Validate against ≥3 historical re-entries. Monthly restore test Celery task implemented. | Mode A (Percentile Corridors). Event Detail: PredictionPanel with p05/p50/p95, HMAC status badge. TimelineGantt. Operational Overview. UncertaintyModeSelector (B/C greyed). | HMAC tamper detection E2E test. All-clear TIP cross-check guard. First backup restore test executed and passing. `spacecom_simulation_duration_seconds` p95 verified < 240s on Tier 2 hardware. |
|
||
|
||
### Phase 2: Operational Analysis (Weeks 11–22)
|
||
|
||
| Week | Backend Deliverable | Frontend Deliverable | Security / Regulatory |
|
||
|------|--------------------|--------------------|----------------------|
|
||
| 11–12 | Atmospheric Breakup: aerothermal, fragments, ballistic descent, casualty area. | Fragment impact points on globe. Fragment detail panel. | OWASP ZAP DAST against staging. |
|
||
| 13–14 | Conjunction: all-vs-all screening, Alfano probability. | Conjunction events on globe. ConjunctionPanel. | STRIDE threat model reviewed for Phase 2 surface. |
|
||
| 15–16 | Upper/Lower Atmosphere. Hazard module: fused zones, HMAC-signed, immutable, `shadow_mode` flag. | Mode B (Probability Heatmap): Deck.gl. UncertaintyModeSelector unlocks Mode B. | RLS multi-tenancy integration tests. Shadow records excluded from operational API (integration test). |
|
||
| 17–18 | Airspace: FIR/UIR load, PostGIS intersection. Airspace impact table. NOTAM Drafting: ICAO format, `notam_drafts` table, mandatory disclaimer. Shadow mode admin toggle. | AirspaceImpactPanel. NOTAM draft flow: NotamDraftViewer, disclaimer banner, review/cancel. 2D Plan View. ViewToggle. `/airspace` page. ShadowBanner + ShadowModeIndicator. | Regulatory disclaimer verified present on all NOTAM drafts. axe-core accessibility audit. |
|
||
| 19–20 | Report builder: bleach sanitisation, Playwright renderer (isolated, no-network, timeouts, seccomp). MinIO storage. Shadow validation schema + `shadow_validations` table. | ReportConfigDialog, ReportPreview, `/reports` page. IntegrityStatusBadge. SimulationComparison. ShadowValidationReport scaffold. | Renderer: `network_mode: none` enforced; sanitisation tests passing; 30s timeout verified. |
|
||
| 21–22 | Space Operator Portal: `owned_objects`, controlled re-entry planner (deorbit window optimiser), CCSDS export, `api_keys` table + lifecycle. `modules.api` with per-key rate limiting. **Legal gate**: legal opinion commissioned and received for primary deployment jurisdiction; `legal_opinions` table populated; shadow mode admin toggle wired to `shadow_mode_cleared` flag. Space-Track AUP redistribution clarification obtained (written confirmation from 18th Space Control Squadron or counsel opinion on permissible use). ECCN classification review commissioned for Controlled Re-entry Planner. GDPR compliance review: data inventory completed, lawful bases documented, DPA template drafted, erasure procedure (`handle_erasure_request`) implemented. | `/space` portal: SpaceOverview, ControlledReentryPlanner, DeorbitWindowList, ApiKeyManager, CcsdsExportPanel. Shadow mode admin toggle displays legal clearance status. | Object ownership RLS policy tested: `space_operator` cannot access non-owned objects. API key rate limiting verified. API Terms accepted at key creation and recorded. Jurisdiction screening at registration (OFAC/EU/UK sanctions list check). |
|
||
|
||
### Phase 3: Operational Deployment (Weeks 23–32)
|
||
|
||
| Week | Backend Deliverable | Frontend Deliverable | Security / Regulatory / SRE |
|
||
|------|--------------------|--------------------|----------------------------|
|
||
| 23–24 | Alerts module: thresholds, email delivery, geographic filtering, `alert_events`. Shadow mode: alerts suppressed. ADS-B feed integration: **OpenSky Network REST API** (`https://opensky-network.org/api/states/all`); polled every 60s via Celery Beat; flight state vectors stored in `adsb_states` (non-hypertable; rolling 24h window); route intersection advisory module reads `adsb_states` to identify flights in re-entry corridors. Air Risk module initialisation: aircraft exposure scoring, time-slice aggregation, and vulnerability banding by aircraft class. **Tier 3 HA infrastructure**: TimescaleDB streaming replication + Patroni + etcd. Redis Sentinel (3 nodes). 4× simulation workers (64 total cores). Blue-green deployment pipeline wired. | Full alert lifecycle UI: geographic filtering, mute rules, acknowledgement audit. Route overlay on globe. AirRiskPanel by FIR/time slice. Route intersection advisory (avoidance boundary only). | **Legal/regulatory**: MSA template finalised by counsel; Regulatory Sandbox Agreement template finalised. First ANSP shadow deployment executed under signed Regulatory Sandbox Agreement and confirmed legal clearance. GDPR breach notification procedure tested (tabletop exercise). Professional indemnity, cyber liability, and product liability insurance confirmed in place. **SRE**: Patroni failover tested (primary killed; standby promotes; backend reconnects; verify zero lost predictions). Redis Sentinel failover tested. SLO baseline measurements taken on Tier 3 hardware. |
|
||
| 25–26 | Feedback: prediction vs. outcome. Density scaling recalibration. Maneuver detection. Shadow validation report generation. Historical replay corpus: Long March 5B, Columbia-derived cloud case, and documented crossing-scenario set. Conservative-baseline comparison reporting for airspace closures. Launch safety module. Deployment freeze gate (CI/CD: block deploy if CRITICAL/HIGH alert active). ANSP communication plan implemented (degradation push + email). Incident response runbooks written (DB failover, Celery recovery, HMAC failure, ingest failure). | Prediction accuracy dashboard. Historical comparison. ShadowValidationReport. Air-risk replay comparison views. `/space` Persona F workspace. Launch safety portal. | Vault / cloud secrets manager. Secrets rotation. Begin first ANSP shadow mode deployment. **SRE**: PagerDuty/OpsGenie integrated with Prometheus AlertManager. SEV-1/2/3/4 routing configured. First on-call rotation established. |
|
||
| 27–28 | Mode C binary MC endpoint. Load testing (100 users, <2s CZML p95; MC p95 < 240s). **Prometheus + Grafana**: three dashboards (Operational Overview, System Health, SLO Burn Rate). Full AlertManager rules. ECSS compliance artefacts: SMP, VVP, PAP, DMP. MinIO lifecycle rules: MC blobs > 90 days → cold tier. | Mode C (Monte Carlo Particles). UncertaintyModeSelector unlocks Mode C. Final Playwright E2E suite. Grafana Operational Overview embedded in `/admin`. | **External penetration test** (auth bypass, RBAC escalation, SSRF, XSS→Playwright, WS auth bypass, data integrity, object ownership bypass, API key abuse). All Critical/High remediated. Load test: SLO p95 targets verified under 100-user concurrent load. |
|
||
| 29–32 | Regulatory acceptance package: safety case framework, ICAO data quality mapping, shadow validation evidence, SMS integration guide. TRL 6 demonstration. Data archival pipeline (Parquet export to MinIO cold before chunk drop). Storage growth verified against projections. **ESA bid legal**: background IP schedule documented; Consortium Agreement with academic partner signed (IP ownership, publication rights, revenue share); SBOM submitted as part of ESA artefact package. ECCN classification determination received; export screening process in place for all new customer registrations. ToS version updated to reflect any regulatory feedback from first ANSP deployments; re-acceptance triggered. | Regulatory submission report type. TRL demonstration artefacts. | SOC 2 Type I readiness review. Production runbook + incident response per threat scenario. ECSS compliance review. Monthly restore test passing in CI. Error budget dashboard showing < 10% burn rate. |
|
||
|
||
---
|
||
|
||
## 20. Key Decisions and Tradeoffs
|
||
|
||
| Decision | Chosen | Alternative Considered | Rationale |
|
||
|----------|--------|----------------------|-----------|
|
||
| Propagator split | SGP4 catalog + numerical decay | SGP4 for everything | SGP4 diverges by days–weeks for re-entry time prediction |
|
||
| Numerical integrator | RK7(8) adaptive + NRLMSISE-00 | poliastro Cowell | Direct force model control |
|
||
| Frame library | `astropy` | Manual SOFA Fortran | Handles IERS EOP; well-tested IAU 2006 |
|
||
| Atmospheric density | NRLMSISE-00 (P1), JB2008 option (P2) | Simple exponential | Community standard; captures solar cycle |
|
||
| Breakup model | Simplified ORSAT-like | Full DRAMA/SESAM | DRAMA requires licensing; simplified recovers ~80% utility |
|
||
| Uncertainty visualisation | Three modes, phased (A→B→C), user-selectable | Single fixed mode | Serves different personas; operational users need corridors, analysts need heatmaps |
|
||
| JWT algorithm | RS256 (asymmetric) | HS256 (shared secret) | Compromise of one service does not expose signing key to all services |
|
||
| Token storage | httpOnly Secure SameSite=Strict cookie | localStorage | XSS cannot read httpOnly cookies; localStorage is trivially exfiltrated |
|
||
| Token revocation | DB `refresh_tokens` table | Redis-only | Revocations survive restarts; enables rotation-chain audit |
|
||
| MFA | TOTP (RFC 6238) required for all roles | Optional MFA | Aviation authority context; government procurement baseline |
|
||
| Secrets management | Docker secrets (P1 prod) → Vault (P3) | Env vars only | Env vars appear in process listings and crash dumps; no audit trail |
|
||
| Alert integrity | Backend-only generation on verified data | Client-triggered alerts | Prevents false alert injection via API |
|
||
| Prediction integrity | HMAC-signed, immutable after creation | Mutable with audit log | Tamper-evident at database level; modification is impossible, not just logged |
|
||
| Multi-tenancy | RLS at database layer + `organisation_id` | Application-layer only | DB-level enforcement cannot be bypassed by application bugs |
|
||
| Renderer isolation | Separate `renderer` container, no external network | Playwright in backend container | Limits blast radius of XSS→SSRF escalation |
|
||
| Server state | TanStack Query | Zustand for everything | Automatic cache, background refetch; Zustand is not a data cache |
|
||
| Navigation model | Task-based (events, airspace, analysis) | Module-based | Users think in tasks, not modules |
|
||
| Report rendering | Playwright headless server-side | Client-side canvas | Reliable at print resolution; consistent; not affected by client GPU |
|
||
| Monorepo | Monorepo | Separate repos | Small team, shared types, simpler CI |
|
||
| ORM | SQLAlchemy 2.0 | Raw SQL | Mature async support; Alembic migrations |
|
||
| Domain architecture | Dual front door (aviation + space portal), shared physics core | Single aviation-only product | Space operator revenue stream; ESA bid credibility; space credibility supports aviation trust |
|
||
| Space operator object scoping | PostgreSQL RLS on `owned_objects` join | Application-layer filtering only | DB-level enforcement; prevents application bugs from leaking cross-operator data |
|
||
| NOTAM output | Draft only + mandatory disclaimer; never submitted | System-assisted NOTAM submission | SpaceCom is not a NOTAM originator; keeps platform in purely informational role; reduces regulatory approval burden |
|
||
| Reroute module scope | Strategic pre-flight avoidance boundary only | Specific alternate route generation | Specific routes require ATC integration and aircraft performance data SpaceCom does not have; avoidance boundary keeps SpaceCom legally defensible |
|
||
| Shadow mode | Org-level flag; all alerts suppressed; records segregated | Per-prediction flag | Enables ANSP trial deployments; accumulates validation evidence for regulatory acceptance; segregation prevents operational confusion |
|
||
| Controlled re-entry planner output | CCSDS-format manoeuvre plan + risk-scored deorbit windows | Aviation-format only | Space operators submit to national regulators and ops centres in CCSDS; Zero Debris Charter evidence format |
|
||
| API access | Separate API keys (not session JWT); per-key rate limiting | Session cookie only | Space operators integrate SpaceCom into operations centres programmatically; API keys are revocable machine credentials |
|
||
| MC parallelism model | Celery `group` + `chord` (fan-out sub-tasks across worker pool) | `multiprocessing.Pool` within single task | Chord distributes across all worker containers; Pool limited to one container's cores; chord scales horizontally |
|
||
| Worker topology | Two separate Celery pools: `ingest` and `simulation` | Single shared queue | Runaway simulation jobs cannot starve TLE ingestion; critical for reliability during active TIP events |
|
||
| Celery Beat HA | `celery-redbeat` (Redis-backed, distributed locking) | Standard Celery Beat (single process) | Beat SPOF means scheduled ingest silently stops; redbeat enables multiple instances with leader election |
|
||
| DB HA | TimescaleDB streaming replication + Patroni auto-failover | Single-instance DB | RPO = 0 for critical tables; 15-minute RTO requires automatic failover, not manual |
|
||
| Redis HA | Redis Sentinel (3 nodes) | Single Redis | Master failure without Sentinel means all Celery queues and WebSocket pub/sub stop |
|
||
| Deployment gate | CI/CD checks for active CRITICAL/HIGH alerts before deploying | Manual judgement | Prevents deployments during active TIP events; protects operational continuity |
|
||
| MC blade sizing | 16 vCPU per simulation worker container | Smaller containers | MC chord sub-tasks fill all available cores; below 16 cores p95 SLO of 240s is not met |
|
||
| Temporal uncertainty display | Plain window range ("08h–20h from now / most likely ~14h") for Persona A/C; p05/p50/p95 UTC for Persona B | `± Nh` notation everywhere | `±` implies symmetric uncertainty which re-entry distributions are not; window range is operationally actionable |
|
||
| Space weather impact communication | Operational buffer recommendation ("+2h beyond 95th pct") rather than % deviation | Percentage string | Percentage is meaningless without a known baseline; buffer hours are immediately usable by an ops duty manager |
|
||
| TLS termination | Caddy with automatic ACME (internet-facing) / internal CA (air-gapped) | nginx + manual certs | Caddy handles cert lifecycle automatically; decision tree in §34 |
|
||
| Pagination | Cursor-based `(created_at, id)` | Offset-based | Offset degrades to full-table scan at 7-year retention depth; cursor is O(1) regardless of dataset size |
|
||
| CZML delta protocol | `?since=<iso8601>` parameter; max 5 MB full payload; `X-CZML-Full-Required` header on stale client | Full catalog always | 100-object catalog at 1-min cadence is ~10–50 MB/hr per connected client without delta; delta reduces this to <500 KB/hr |
|
||
| MC concurrency gate | Per-org Redis semaphore; 1 concurrent MC run (Phase 1); `429 + Retry-After` on limit | Unbounded fan-out | 5 concurrent MC requests = 2,500 sub-tasks queued; p95 SLO collapses without backpressure |
|
||
| TimescaleDB `compress_after` | 7 days for `orbits` (not 1 day) | Compress as soon as possible | Compressing hot chunks forces decompress on every write; 1-day compress_after causes 50–200ms write latency thrash |
|
||
| Renderer memory limit | `mem_limit: 4g` Docker cap on renderer container | No memory limit | Chromium print rendering at A4/300DPI consumes 2–4 GB; 4 uncapped renderer instances can OOM a 32 GB node |
|
||
| Static asset caching | Cloudflare CDN (internet-facing); nginx sidecar (on-premise) | No CDN | CesiumJS bundle ~5–10 MB; 100 concurrent first-load = 500 MB–1 GB burst without caching |
|
||
| WAF/DDoS protection | Upstream provider (Cloudflare/AWS Shield) for internet-facing; network perimeter for air-gapped | Application-layer rate limiting only | Application-layer is insufficient for volumetric attacks; must be at ingress |
|
||
| Multi-region deployment | Single region per customer jurisdiction; separate instances, not shared cluster | Active-active multi-region | Data sovereignty; simpler compliance certification; Phase 1–3 customer base doesn't justify multi-region cost |
|
||
| MinIO erasure coding | EC:2 (4-node) | EC:4 or RAID | EC:2 tolerates 1 write failure / 2 read failures; balanced between protection and storage efficiency at 4 nodes |
|
||
| DB connection routing | PgBouncer as single stable connection target | Direct Patroni primary connection | Patroni failover transparent to application; stable DNS target through primary changes |
|
||
| Egress filtering | Host-level UFW/nftables allow-list (Tier 2); Calico/Cilium network policy (Tier 3) | Trust Docker network isolation | Docker isolation is inter-network only; outbound internet egress unrestricted without host-level filtering |
|
||
| Mode-switch dialogue | Explicit current-mode + target-mode + consequences listed; Cancel left, destructive action right | Generic "Are you sure?" | Aviation HMI conventions; listed consequences prevent silent simulation-during-live error |
|
||
| Future-preview temporal wash | Semi-transparent overlay + persistent label on event list when timeline scrubber is not at current time | No visual distinction | Prevents controller from acting on predicted-future data as though it is current operational state |
|
||
| Simulation block during active alerts | Optional org-level `disable_simulation_during_active_events` flag | Always allow simulation entry | Prevents an analyst accidentally entering simulation while CRITICAL alerts require attention in the same ops room |
|
||
| Prediction superseding | Write-once `superseded_by` FK on `reentry_predictions` / `simulations` | Mutable or delete | Preserves immutability guarantee; gives analysts a way to mark outdated predictions without removing the audit record |
|
||
| CRITICAL acknowledgement gate | 10-character minimum free-text field; two-step confirmation modal | Single click | Prevents reflexive acknowledgement; creates meaningful action record for every acknowledged CRITICAL event |
|
||
| Multi-ANSP coordination panel | Shared acknowledgement status and coordination notes across ANSP orgs on the same event | Out-of-band only | Creates shared digital situational awareness record without replacing voice coordination; reduces risk of conflicting parallel NOTAMs |
|
||
| Legal opinion timing | Phase 2 gate (before shadow deployment); not Phase 3 | Phase 3 task | Common law duty of care may attach regardless of UI disclaimers; liability limitation must be in executed agreements before any ANSP relies on the system |
|
||
| Commercial contract instruments | Three instruments: MSA + AUP click-wrap + API Terms | Single platform ToS | Each instrument addresses a different access pathway; API access by Persona E/F must have separate terms recorded against the key |
|
||
| Shadow mode legal gate | `legal_opinions.shadow_mode_cleared` must be TRUE before shadow mode can be activated for an org | Admin can enable freely | Shadow deployment is a formal regulatory activity; without a completed legal opinion it exposes SpaceCom to uncapped liability in the deployment jurisdiction |
|
||
| GDPR erasure vs. retention | Pseudonymise user references in append-only tables on erasure request; never delete safety records | Hard delete on request | UN Liability Convention requires 7-year retention; GDPR right to erasure is satisfied by removing the link to the individual, not the record itself |
|
||
| Space-Track data redistribution | Obtain written clarification from 18th SCS before exposing TLE/CDM data via the SpaceCom API | Assume permissible | Space-Track AUP prohibits redistribution to unregistered parties; violation could result in loss of Space-Track access, disabling the platform's primary data source |
|
||
| OSS licence compliance | CesiumJS commercial licence required for closed-source deployment; SBOM generated from Phase 1 | Assume all dependencies are permissively licensed | CesiumJS AGPLv3 requires source disclosure for network-served applications; undiscovered licence violations create IP risk in ESA bid |
|
||
| Insurance | Professional indemnity + cyber liability + product liability required before operational deployment | No insurance requirement | Aviation safety context; potential claims from incorrect predictions that inform airspace decisions could exceed SpaceCom's balance sheet without coverage |
|
||
| Connection pooling | PgBouncer transaction-mode pooler between all app services and TimescaleDB | Direct connections from app | Tier 3 connection count (2× backend + 4× workers + 2× ingest) exceeds `max_connections=100` without a pooler; Patroni failover updates only pgBouncer |
|
||
| Redis eviction policy | `noeviction` for Celery/redbeat (separate DB index); `allkeys-lru` for application cache | Single Redis with one policy | Broker message eviction causes silent job loss; cache eviction is acceptable |
|
||
| Bulk export implementation | Celery task → MinIO → presigned URL (async offload pattern) | Streaming response from API handler | Full catalog export can be gigabytes; materialising in API handler risks OOM on the backend container |
|
||
| Analytics query routing | Patroni standby replica for Persona B/F analytics; primary for operational reads | All reads to primary | Analytics queries during a TIP event would compete with operational reads on the primary; standby already provisioned at Tier 3 |
|
||
| SQLAlchemy lazy loading | `lazy="raise"` on all relationships | Default lazy loading | Async SQLAlchemy silently blocks the event loop on lazy-loaded relationships; `raise` converts silent N+1s into loud development-time errors |
|
||
| CZML cache strategy | Per-object fragment cache + full catalog assembly; TTL keyed to last propagation job | No cache; query DB on each request | CZML catalog fetch at 100 objects = 864k rows; uncached this misses the 2s p95 SLO under concurrent load |
|
||
| Hypertable chunk interval (`orbits`) | 1-day chunks (not default 7-day) | Default 7-day | 72h CZML query spans 3 × 1-day chunks; spans 11 × 7-day chunks — chunk exclusion is far less effective with the default |
|
||
| Continuous aggregate for F10.7 81-day avg | TimescaleDB continuous aggregate `space_weather_daily` | Compute from raw rows per request | At 100 concurrent users, 100 identical scans of 11,664 raw rows; continuous aggregate reduces this to a single-row lookup |
|
||
| CI/CD orchestration | GitHub Actions | Jenkins / GitLab CI | Project is GitHub-native; Actions has OIDC → GHCR; no separate CI server to operate |
|
||
| Container image tags | `sha-<commit>` as canonical immutable tag; semantic version alias for releases | `latest` tag only | `latest` is mutable and non-reproducible; `sha-<commit>` gives exact traceability from deployed image back to source commit |
|
||
| Multi-stage Docker builds | Builder stage (full toolchain) + runtime stage (distroless/slim) | Single-stage with all tools | Eliminates build toolchain, compiler, and dev dependencies from production image; typically reduces image size by 60–80% |
|
||
| Local dev hot-reload | Backend: FastAPI `--reload` via bind-mounted `./backend` volume; Frontend: Next.js Vite HMR | Rebuild container on change | Full container rebuild per code change adds 30–90s per iteration; volume mount + process reload is < 1s |
|
||
| `.env.example` contract | `.env.example` with all required variables, descriptions, and stage flags committed to repo; actual `.env` in `.gitignore` | Ad-hoc variable discovery from runtime errors | Engineers must be able to run `cp .env.example .env` and have a working local stack within 15 minutes of cloning |
|
||
| Staging environment strategy | `main` branch continuously deployed to staging via GitHub Actions; production deploy requires manual approval gate after staging smoke tests pass | Manual staging deploys | Reduces time-to-detect integration regressions; staging serves as TRL artefact evidence environment |
|
||
| Secrets rotation | Per-secret rotation runbook: Space-Track credentials, JWT signing keys, ANSP tokens; old + new key both valid during 5-minute transition window; `security_logs` entry required; rotated via Vault dynamic secrets in Phase 3 | Manual rotation with downtime | Aviation context: key rotation must not cause service interruption; zero-downtime rotation is a reliability requirement, not a convenience |
|
||
| Build cache strategy | Docker layer cache: `cache-from/cache-to` targeting GHCR in GitHub Actions; pip wheel cache: `actions/cache` keyed on `requirements.txt` hash; npm cache: `actions/cache` keyed on `package-lock.json` hash | No cache; full rebuild each push | Without cache, a full rebuild takes 8–12 minutes; with cache, incremental pushes take 2–3 minutes — critical for CI as a useful merge gate |
|
||
| Image retention policy | Tagged release images kept indefinitely; untagged/orphaned images purged weekly via GHCR lifecycle policy; staging images retained 30 days; dev branch images retained 7 days | No policy; manual cleanup | Unmanaged GHCR storage grows unboundedly; stale images also represent unaudited CVE surface |
|
||
| Pre-commit hook completeness | Six hooks: `detect-secrets`, `ruff`, `mypy`, `hadolint`, `prettier`, `sqlfluff` | `git-secrets` only | `git-secrets` scans only for known secret patterns; `detect-secrets` uses entropy analysis; `hadolint` prevents insecure Dockerfile patterns; `sqlfluff` catches migration anti-patterns before code review |
|
||
| `alembic check` in CI | CI job runs `alembic check` to detect SQLAlchemy model/migration divergence; fails if models have unapplied changes | Only run migrations, no divergence check | SQLAlchemy models can diverge from migrations silently; `alembic check` catches the gap before it reaches production |
|
||
| FIR boundary data source | EUROCONTROL AIRAC (ECAC states) + FAA Digital-Terminal Procedures (US) + OpenAIP (fallback); 28-day update cadence | Manually curated GeoJSON, updated ad hoc | FIR boundaries change on AIRAC cycles; stale boundaries produce wrong airspace intersection results during live TIP events |
|
||
| ADS-B data source | OpenSky Network REST API (Phase 3 MVP); commercial upgrade path to Flightradar24 or FAA SWIM ADS-B if required | Direct receiver hardware | OpenSky is free, global, and sufficient for route overlay and intersection advisory; commercial upgrade only if coverage gaps identified in ANSP trials |
|
||
| CCSDS OEM reference frame | GCRF (Geocentric Celestial Reference Frame); time system UTC; `OBJECT_ID` = NORAD catalog number; missing international designator populated as `UNKNOWN` | ITRF or TEME | GCRF is the standard output of SpaceCom's frame transform pipeline; downstream mission control tools expect GCRF for propagation inputs |
|
||
| CCSDS CDM field population | SpaceCom populates: HEADER, RELATIVE_METADATA, OBJECT1/2 identifiers, state vectors, covariance (if available); fields not held by SpaceCom emitted as `N/A` per CCSDS 508.0-B-1 §4.3 | Omit empty fields | `N/A` is the CCSDS-specified sentinel for unknown values; silent omission causes downstream parser failures |
|
||
| CDM ingestion display | Space-Track CDM Pc displayed alongside SpaceCom-computed Pc with explicit provenance labels; > 10× discrepancy triggers `DATA_CONFIDENCE` warning on conjunction panel | Show only one value | Space operators need both values; discrepancy without explanation erodes trust in both |
|
||
| WebSocket event schema | Typed event envelope with `type` discriminator, monotonic `seq`, and `ts`; reconnect with `?since_seq=` replay of up to 200 events / 5-minute ring buffer; `resync_required` on stale reconnect | Schema-free JSON stream | Untyped streams require every consumer to reverse-engineer the schema; schema enables typed client generation |
|
||
| Alert webhook delivery | At-least-once POST to registered HTTPS endpoint; HMAC-SHA256 signature; 3 retries with exponential backoff; `degraded` status after 3 failures; auto-disable after 10 consecutive failures | WebSocket / email only | ANSPs with existing dispatch infrastructure (AFTN, internal webhook receivers) cannot integrate via browser WebSocket; webhooks are the programmatic last-mile |
|
||
| API versioning | `/api/v1` base; breaking changes require `/api/v2` parallel deployment; 6-month support overlap; `Deprecation` / `Sunset` headers (RFC 8594); 3-month written notice to API key holders | No versioning policy; breaking changes deployed ad hoc | Space operators building operations centre integrations need stable contracts; silent breaking changes disable their integrations |
|
||
| SWIM integration path | Phase 2: GeoJSON structured export; Phase 3: FIXM review + EUROCONTROL SWIM-TI AMQP publish endpoint | Not applicable | European ANSP procurement increasingly requires SWIM compatibility; GeoJSON export is low-cost first step; full SWIM-TI is Phase 3 |
|
||
| Space-Track API contract test | Integration test asserts expected JSON keys present in Space-Track response; ingest health alert fires after 4 consecutive hours with 0 successful Space-Track records | No contract test; breakage discovered at runtime | Space-Track API has had historical breaking changes; silent format change means ingest returns no data while health metrics appear normal |
|
||
| TLE checksum validation | Modulo-10 checksum on both lines verified before DB write; BSTAR range check; failed records logged to `security_logs` type `INGEST_VALIDATION_FAILURE` | Accept TLE at face value | Corrupted TLEs (network errors, encoding issues) would propagate incorrect state vectors without validation |
|
||
| Model card | `docs/model-card-decay-predictor.md` maintained alongside the model; covers validated orbital regime envelope, known failure modes, systematic biases, and performance by object type | Accuracy statement only in §24.3 | Regulators and ANSPs require a documented operational envelope, not just a headline accuracy figure; ESA TRL artefact requirement |
|
||
| Historical backcast selection | Validation report explicitly documents selection criteria, identifies underrepresented object categories, and states accuracy conditional on object type | Single unconditional accuracy figure | Observable re-entry population is biased toward large well-tracked objects; publishing an unconditional accuracy figure misrepresents model generalisation |
|
||
| Out-of-distribution detection | `ood_flag = TRUE` and `ood_reason` set at prediction time if any input falls outside validated bounds; UI shows mandatory warning callout | Serve all predictions identically | NRLMSISE-00 calibration domain does not include tumbling objects, very high area-to-mass ratio, or objects with no physical property data |
|
||
| Prediction staleness warning | `prediction_valid_until` = `p50_reentry_time - 4h`; UI warns independently of system-level TLE staleness if `NOW() > prediction_valid_until` and not superseded | No time-based staleness on predictions | An hours-old prediction for an imminent re-entry has implicitly grown uncertainty; operators need a signal independent of the system health banner |
|
||
| Alert threshold governance | Thresholds documented with rationale; change approval requires engineering lead sign-off + shadow-mode validation period; change log maintained in `docs/alert-threshold-history.md` | Thresholds set in code with no governance | CRITICAL trigger (window < 6h, FIR intersection) has airspace closure consequences; undocumented threshold changes cannot be reviewed by regulators or ANSPs |
|
||
| FIR intersection auditability | `alert_events.fir_intersection_km2` and `intersection_percentile` recorded at alert generation; UI shows "p95 corridor intersects ~N km² of FIR XXXX" | Alert log shows only "intersects FIR XXXX" | Intersection without area and percentile context is not auditable; regulators and ANSPs need to know *how much* intersection triggered the alert |
|
||
| Recalibration governance | Recalibration requires hold-out validation dataset, minimum accuracy improvement threshold, sign-off authority, rollback procedure, and notification to ANSP shadow partners | Recalibration run and deployed without gates | Unchecked recalibration can silently degrade accuracy for object types not in the calibration set |
|
||
| Model version governance | Changes classified as patch/minor/major; major changes require active prediction re-runs with supersession + ANSP notification; rollback path documented | No governance; model updated silently | A major model version change producing materially different corridors without re-running active predictions creates undocumented divergence between what ANSPs are seeing and current best predictions |
|
||
| Adverse outcome monitoring | `prediction_outcomes` table records observed re-entry outcomes against predictions; quarterly accuracy report generated from feedback pipeline; false positive/negative rates in Grafana | No post-deployment accuracy tracking | Without outcome monitoring SpaceCom cannot demonstrate performance within acceptable bounds to regulators; shadow validation reports are episodic, not continuous |
|
||
| Geographic coverage annotation | FIR intersection results carry `data_coverage_quality` flag per FIR; OpenAIP-sourced boundaries flagged as lower confidence | All FIR intersections treated equally | AIRAC coverage varies by region; operators in non-ECAC regions receive lower-quality intersection assessments without knowing it |
|
||
| Public transparency report | Quarterly aggregate accuracy/reliability report published (no personal data); covers prediction count, backcast accuracy, error rates, known limitations | No public reporting | Civil aviation safety tools operate in a regulated transparency environment; ESA bid credibility and regulatory acceptance require demonstrable performance |
|
||
| `docs/` directory structure | Canonical tree defined in §12.1; all documentation files live at known paths committed to the repo | Ad-hoc file creation by individual engineers | Documentation that exists only in prose references gets created inconsistently or not at all |
|
||
| Architecture Decision Records | MADR-format ADRs in `docs/adr/`; one per consequential decision in §20; linked from relevant code via inline comment | §20 table in master plan only | Engineers working in the repo cannot find decision rationale without reading a 5000-line plan document |
|
||
| OpenAPI documentation standard | Every public endpoint has `summary`, `description`, `tags`, and at least one `responses` example; enforced by CI check | Auto-generated stubs only | Auto-generation produces syntactically correct docs that are useless to API integrators (Persona E/F) |
|
||
| Runbook format | Standard template in `docs/runbooks/TEMPLATE.md`; required sections: Trigger, Severity, Preconditions, Steps, Verification, Rollback, Notify; runbook index maintained | Free-form runbooks written ad-hoc | Runbooks written under pressure without a template consistently omit the rollback and notification steps |
|
||
| Docstring standard | Google-style docstrings required on all public functions in `propagator/`, `reentry/`, `breakup/`, `conjunction/`, `integrity.py`; parameters include physical units | No docstring requirement | Physics functions without units and limitations documented cannot be reviewed or audited by third-party evaluators for ESA TRL |
|
||
| Validation procedure | §17 specifies reference data location, run commands, pass/fail tolerances per suite; `docs/validation/README.md` describes how to add new cases | Checklist of what to validate without procedure | A third party cannot reproduce the validation without knowing where the reference data is and what tolerance constitutes a pass |
|
||
| User documentation | Phase 2 delivers aviation portal guide + API quickstart; Phase 3 delivers space portal guide + in-app contextual help; stored in `docs/user-guides/` | No user documentation | ANSP SMS acceptance requires user documentation; aviation operators cannot learn an unfamiliar safety tool from the UI alone |
|
||
| `CHANGELOG.md` format | Keep a Changelog conventions; human-maintained; one entry per release with `Added/Changed/Deprecated/Removed/Fixed/Security` sections | No format specified | Changelogs written by different engineers without a format are unusable by operators and regulators |
|
||
| `AGENTS.md` | Project-root file defining behaviour guidance for AI coding agents; specifies codebase conventions, test requirements, and safety-critical file restrictions; committed to repo | Untracked file, undefined purpose | An undocumented AGENTS.md is either ignored or followed inconsistently, undermining its purpose |
|
||
| Test documentation | Module docstrings on physics/security test files state the invariant, reference source, and operational significance of failure; `docs/test-plan.md` lists all suites with scope and blocking classification | No test documentation requirement | ECSS-Q-ST-80C requires a test specification as a separate deliverable from the test code |
|
||
|
||
---
|
||
|
||
## 21. Definition of Done per Phase
|
||
|
||
### Phase 1 Complete When:
|
||
**Physics and data:**
|
||
- [ ] 100+ real objects tracked with current TLE data
|
||
- [ ] Frame transformation unit tests pass against IERS/Vallado reference cases (round-trip error < 1 m)
|
||
- [ ] SGP4 CZML uses J2000 INERTIAL frame (not TEME)
|
||
- [ ] Space weather polled from NOAA SWPC; cross-validated against ESA SWS; operational status widget visible
|
||
- [ ] TIP messages ingested and displayed for decaying objects
|
||
- [ ] TLE cross-validation flags discrepancies > threshold for human review
|
||
- [ ] IERS EOP hash verification passing
|
||
- [ ] Decay predictor: ≥3 historical re-entry backcast windows overlap actual events
|
||
- [ ] Mode A (Percentile Corridors): p05/p50/p95 swaths render with correct visual encoding
|
||
- [ ] TimelineGantt displays all active events; click-to-navigate functional
|
||
- [ ] LIVE/REPLAY/SIMULATION mode indicator correct on all pages
|
||
|
||
**Security (all required before Phase 1 is considered complete):**
|
||
- [ ] RBAC enforced: automated `test_rbac.py` verifies every endpoint returns 403 for insufficient role, 401 for unauthenticated
|
||
- [ ] JWT RS256 with httpOnly cookies; `localStorage` token storage absent from codebase (grep check in CI)
|
||
- [ ] MFA (TOTP) enforced for all roles; recovery codes functional
|
||
- [ ] Rate limiting: 429 responses verified by integration tests for all configured limits
|
||
- [ ] Simulation parameter range validation: out-of-range values return 400 with clear message
|
||
- [ ] Prediction HMAC: tamper test (direct DB row modification) triggers 503 + CRITICAL security_log entry
|
||
- [ ] `alert_events` append-only trigger: UPDATE/DELETE raise exception (verified by test)
|
||
- [ ] `reentry_predictions` immutability trigger: same (verified by test)
|
||
- [ ] Redis AUTH enabled; default user disabled; ACL per service verified
|
||
- [ ] MinIO: all buckets verified private; direct object URL returns 403; pre-signed URL required
|
||
- [ ] Docker: all containers verified non-root (`docker inspect` check in CI)
|
||
- [ ] Docker: network segmentation verified — frontend container cannot reach database port
|
||
- [ ] Bandit: 0 High severity findings in CI
|
||
- [ ] ESLint security: 0 High findings in CI
|
||
- [ ] Trivy: 0 Critical/High CVEs in all container images
|
||
- [ ] CSP headers present on all pages; verified by Playwright E2E test
|
||
- [ ] axe-core: 0 critical, 0 serious violations on all pages (CI check)
|
||
- [ ] WCAG 2.1 AA colour contrast: automated check passes
|
||
|
||
**UX:**
|
||
- [ ] Globe: object clustering active at global zoom; urgency symbols correct (colour-blind-safe)
|
||
- [ ] DataConfidenceBadge visible on all object detail and prediction panels
|
||
- [ ] UncertaintyModeSelector visible; Mode B/C greyed with "Phase 2/3" label
|
||
- [ ] JobsPanel shows live sample progress for running decay jobs
|
||
- [ ] Shared deep links work: `/events/{id}` loads correct event; globe focuses on corridor
|
||
- [ ] All pages keyboard-navigable; modal focus trap verified
|
||
- [ ] Report generation: Operational Briefing type functional; PDF includes globe corridor map
|
||
|
||
**Human Factors (Phase 1 items — all required before Phase 1 is considered complete):**
|
||
- [ ] Event cards display window range notation (`Window: Xh–Yh from now / Most likely ~Zh from now`); no `±` notation appears in operational-facing UI (grep check)
|
||
- [ ] Mode-switch dialogue: switching to SIMULATION shows current mode, target mode, and "alerts suppressed" consequence; Cancel left, Switch right; Playwright E2E test verifies dialogue content
|
||
- [ ] Future-preview temporal wash: dragging timeline scrubber past current time applies overlay and `PREVIEWING +Xh` label to event panel; alert badges show "(projected)"; verified by Playwright test
|
||
- [ ] CRITICAL acknowledgement: two-step flow (banner → confirmation modal); Confirm button disabled until `Action taken` field ≥ 10 characters; verified by Playwright test
|
||
- [ ] Audio alert: non-looping two-tone chime plays once on CRITICAL alert; stops on acknowledgement; does not play in SIMULATION or REPLAY mode; verified by integration test with audio mock
|
||
- [ ] Alert storm meta-alert: > 5 CRITICAL alerts within 1 hour generates Persona D meta-alert with disambiguation prompt (verified by test with synthetic alerts)
|
||
- [ ] Onboarding state: new organisation with no FIRs configured sees three-card setup prompt on first login (Playwright test)
|
||
- [ ] Degraded mode banner: `/readyz` 207 response triggers correct per-degradation-type operational guidance text in UI (integration test for each degradation type: space weather stale, TLE stale)
|
||
- [ ] `superseded_by` constraint: setting `superseded_by` on a prediction a second time raises DB exception (integration test); UI shows `⚠ Superseded` banner on any prediction where `superseded_by IS NOT NULL`
|
||
|
||
**Legal / Compliance (Phase 1 items — all required before Phase 1 is considered complete):**
|
||
- [ ] **Space-Track AUP architectural decision gate (Finding 9):** Written AUP clarification obtained from 18th Space Control Squadron or legal counsel opinion. `docs/adr/0016-space-track-aup-architecture.md` committed with Path A (shared ingest) or Path B (per-org credentials) decision recorded and evidenced. Ingest architecture finalised accordingly. This is a blocking Phase 1 decision — ingest code must not be written until the path is decided.
|
||
- [ ] ToS / AUP / Privacy Notice acceptance gate: first login blocks dashboard access until all three documents are accepted; `users.tos_accepted_at`, `users.tos_version`, `users.tos_accepted_ip` populated on acceptance (integration test: unauthenticated attempt to skip returns 403)
|
||
- [ ] ToS version change triggers re-acceptance: bump `tos_version` in config; verify existing users are blocked on next login until they re-accept (integration test)
|
||
- [ ] **CesiumJS commercial licence executed** and stored at `legal/LICENCES/cesium-commercial.pdf`; `legal_clearances.cesium_commercial_executed = TRUE` — **blocking gate for any external demo** (§29.11 F1)
|
||
- [ ] SBOM generated at build time via `syft` (SPDX-JSON, container image) + `pip-licenses` + `license-checker-rseidelsohn` (dependency manifests); stored in `docs/compliance/sbom/` as versioned artefacts; all dependency licences reviewed against `legal/OSS_LICENCE_REGISTER.md`; CI `pip-licenses --fail-on` gate includes GPL/AGPL/SSPL; no unapproved licence in transitive closure (§29.11 F2, F10)
|
||
- [ ] `legal/LGPL_COMPLIANCE.md` created documenting poliastro LGPL dynamic linking compliance and PostGIS GPLv2 linking exception (§29.11 F4, F9)
|
||
- [ ] `legal/LICENCES/timescaledb-licence-assessment.md` and `legal/LICENCES/redis-sspl-assessment.md` created with licence assessment sign-off (§29.11 F5, F6)
|
||
- [ ] `legal_opinions` table present in schema; admin UI shows legal clearance status per org; shadow mode toggle displays warning if `shadow_mode_cleared = FALSE`
|
||
- [ ] GDPR breach notification procedure documented in the incident response runbook; tabletop exercise completed with the engineering team
|
||
|
||
**Infrastructure / DevOps (all required before Phase 1 is considered complete):**
|
||
- [ ] Docker Compose starts full stack with single command (`make dev`)
|
||
- [ ] `make test` executes pytest + vitest in one command; all tests pass on a clean clone
|
||
- [ ] `make migrate` runs all Alembic migrations against a fresh DB without error
|
||
- [ ] `make seed` loads fixture data; globe shows test objects on first load
|
||
- [ ] `.env.example` present with all required variables documented; a new engineer can reach a working local stack in ≤ 15 minutes
|
||
- [ ] Multi-stage Dockerfiles in place for backend, worker, renderer, and frontend: builder stage uses full toolchain; runtime stage is distroless/slim; `docker inspect` confirms no build tools (gcc, pip, npm) present in runtime image
|
||
- [ ] All containers run as non-root UID (baked in Dockerfile `USER` directive — not set at runtime); verified by `docker inspect` check in CI
|
||
- [ ] Self-hosted GitLab CI pipeline exists with jobs: `lint` (pre-commit all hooks), `test-backend` (pytest), `test-frontend` (vitest + Playwright), `security-scan` (Bandit + Trivy + ESLint security), `build-and-push` (multi-stage build -> GitLab container registry with `sha-<commit>` tag)
|
||
- [ ] `.pre-commit-config.yaml` committed with all six hooks; CI re-runs all hooks and fails if any fail
|
||
- [ ] `alembic check` step in CI fails if SQLAlchemy models have unapplied changes
|
||
- [ ] Build cache: Docker layer cache, pip wheel cache, npm cache all configured in GitLab CI; incremental push CI time < 4 minutes
|
||
- [ ] pytest suite: frame utils, integrity, auth, RBAC, propagator, decay, space weather, ingest, API integration
|
||
- [ ] Playwright E2E: mode switch, alert acknowledge, CZML render, job progress, report generation, CSP headers
|
||
- [ ] Port exposure CI check: `scripts/check_ports.py` passes with no never-exposed port in a `ports:` mapping
|
||
- [ ] Caddy TLS active on local dev stack with self-signed cert or ACME staging cert; HSTS header present (`Strict-Transport-Security: max-age=63072000`); TLS 1.1 and below not offered (verified by `nmap --script ssl-enum-ciphers`)
|
||
- [ ] `docs/runbooks/egress-filtering.md` exists documenting the allowed outbound destination whitelist; implementation method (UFW/nftables) noted
|
||
|
||
**Performance / Database (Phase 1 items — all required before Phase 1 is considered complete):**
|
||
- [ ] pgBouncer in Docker Compose; all app services connect via pgBouncer (not directly to TimescaleDB); verified by `netstat` or connection-source query showing only pgBouncer IPs in `pg_stat_activity`
|
||
- [ ] All required indexes present: `orbits_object_epoch_idx`, `reentry_pred_object_created_idx`, `alert_events_unacked_idx`, `reentry_pred_corridor_gist`, `hazard_zones_polygon_gist`, `fragments_impact_gist`, `tle_sets_object_ingested_idx` — verified by `\d+` or `pg_indexes` query
|
||
- [ ] `orbits` hypertable chunk interval set to 1 day; `space_weather` to 30 days; `tle_sets` to 7 days — verified by `timescaledb_information.chunks`
|
||
- [ ] `space_weather_daily` continuous aggregate created and policy active; Space Weather Widget backend query reads from the aggregate (verified by `EXPLAIN` showing `space_weather_daily` in plan, not raw `space_weather`)
|
||
- [ ] Autovacuum settings applied to `alert_events`, `security_logs`, `reentry_predictions` — verified via `pg_class` `reloptions`
|
||
- [ ] `lazy="raise"` set on all SQLAlchemy relationships; test suite passes with no `MissingGreenlet` or `InvalidRequestError` exceptions (test suite itself verifies this by accessing relationships without explicit loading — should raise)
|
||
- [ ] Redis Celery broker DB index (`SELECT 0`) has `maxmemory-policy noeviction`; application cache DB index (`SELECT 1`) has `allkeys-lru` — verified by `CONFIG GET maxmemory-policy` on each DB
|
||
- [ ] CZML catalog endpoint: `EXPLAIN (ANALYZE, BUFFERS)` output recorded in `docs/query-baselines/czml_catalog_100obj.txt`; p95 response time < 2s verified by load test with 10 concurrent users
|
||
- [ ] CZML delta endpoint (`?since=`) functional: integration test verifies delta response contains only changed objects; `X-CZML-Full-Required: true` returned when client timestamp > 30 min old
|
||
- [ ] Compression policies applied with correct `compress_after` intervals (see §9.4 table): `orbits` = 7 days, `adsb_states` = 14 days, `space_weather` = 60 days, `tle_sets` = 14 days — verified by `timescaledb_information.jobs`
|
||
- [ ] Cursor-based pagination: integration test on `/reentry/predictions` with 200+ rows confirms `next_cursor` present and second page returns non-overlapping rows; `limit=201` returns 400
|
||
- [ ] MC concurrency gate: integration test submits two concurrent `POST /decay/predict` requests from the same organisation; second request returns `HTTP 429` with `Retry-After` header while first is running; first completes normally
|
||
- [ ] Renderer Docker memory limit set to 4 GB in `docker-compose.yml`; `docker inspect` confirms `HostConfig.Memory = 4294967296`
|
||
- [ ] Bulk export endpoint: integration test with 10,000-row dataset confirms response is a task ID + status URL, not an inline response body
|
||
- [ ] `tests/load/` directory exists with at least a k6 or Locust scenario for the CZML catalog endpoint; `docs/test-plan.md` load test section specifies scenario, ramp shape, and SLO assertion
|
||
|
||
**Technical Writing / Documentation (Phase 1 items — all required before Phase 1 is considered complete):**
|
||
- [ ] `docs/` directory tree created and committed matching the structure in §12.1; all referenced documentation paths exist (even if files are stubs with "TODO" content)
|
||
- [ ] `AGENTS.md` committed to repo root; contains codebase conventions, test requirements, and safety-critical file restrictions (see §33.9)
|
||
- [ ] `docs/adr/` contains minimum 5 ADRs for the most consequential Phase 1 decisions: JWT algorithm choice, dual frontend architecture, Monte Carlo chord pattern, frame library choice, TimescaleDB chunk intervals
|
||
- [ ] `docs/runbooks/TEMPLATE.md` committed; `docs/runbooks/README.md` index lists all required runbooks with owner field; at least `db-failover.md`, `ingest-failure.md`, and `hmac-failure.md` are complete (not stubs)
|
||
- [ ] `docs/validation/README.md` documents how to run each validation suite and where reference data files live; `docs/validation/reference-data/` contains Vallado SGP4 cases and IERS frame test cases
|
||
- [ ] `CHANGELOG.md` exists at repo root in Keep a Changelog format; first entry records Phase 1 initial release
|
||
- [ ] `docs/alert-threshold-history.md` exists with initial entry recording threshold values, rationale, and author sign-off (required by §24.8)
|
||
- [ ] OpenAPI docs: CI check confirms no public endpoint has an empty `description` field; spot-check 5 endpoints in code review to verify `summary` and at least one `responses` example
|
||
|
||
**Ethics / Algorithmic Accountability (Phase 1 items — all required before Phase 1 is considered complete):**
|
||
- [ ] `ood_flag` and `ood_reason` populated at prediction time: integration test with an object whose `data_confidence = 'unknown'` and no DISCOS physical properties confirms `ood_flag = TRUE` and `ood_reason` contains `'low_data_confidence'`; prediction is served but UI shows mandatory warning callout above the prediction panel
|
||
- [ ] `prediction_valid_until` field present: verify it equals `p50_reentry_time - 4h` for a test prediction; UI shows staleness warning when `NOW() > prediction_valid_until` and prediction is not superseded (Playwright test simulates time travel)
|
||
- [ ] `alert_events.fir_intersection_km2` and `intersection_percentile` recorded: synthetic CRITICAL alert with known corridor area confirms both fields populated; UI renders "p95 corridor intersects ~N km² of FIR XXXX" (Playwright test)
|
||
- [ ] Alert threshold values documented: `docs/alert-threshold-history.md` exists with initial entry recording threshold values, rationale, and author sign-off
|
||
- [ ] `prediction_outcomes` table exists in schema; `POST /api/v1/predictions/{id}/outcome` endpoint (requires `analyst` role) accepts observed re-entry time and source (integration test: unauthenticated attempt returns 401)
|
||
|
||
**Interoperability (Phase 1 items — all required before Phase 1 is considered complete):**
|
||
- [ ] TLE checksum validation: integration test sends a TLE with deliberately corrupted checksum; verify it is rejected and logged to `security_logs` type `INGEST_VALIDATION_FAILURE`; valid TLE with same content but correct checksum is accepted
|
||
- [ ] Space weather format contract test: CI integration test against mocked NOAA SWPC response asserts (a) expected top-level JSON keys present (`time_tag`, `flux` / `kp_index`); (b) F10.7 values in physical range 50–350 sfu; (c) Kp values in range 0–90 (NOAA integer format); test is `@pytest.mark.contract` and runs against mocks in standard CI, against live API in nightly sandbox job
|
||
- [ ] Space-Track contract test: integration test against mocked Space-Track response asserts (a) expected JSON keys present for TLE and CDM queries; (b) B* values trigger warning when outside [-0.5, 0.5]; (c) epoch field parseable as ISO-8601; `spacecom_ingest_success_total{source="spacetrack"}` Prometheus metric > 0 after a live ingest cycle (nightly sandbox only)
|
||
- [ ] FIR boundary data loaded: `airspace` table populated with FIR/UIR polygons for at least the test ANSP region; source documented in `ingest/sources.py`; AIRAC update date recorded in `airspace_metadata` table
|
||
- [ ] WebSocket event schema: `WS /ws/events` delivers typed event envelopes; integration test sends a synthetic `alert.new` event and verifies the client receives `{"type": "alert.new", "seq": <n>, "data": {...}}`; reconnect with `?since_seq=<n>` replays missed event
|
||
- [ ] API versioning headers: all API endpoints return `Content-Type: application/vnd.spacecom.v1+json`; deprecated endpoints (if any) return `Deprecation: true` and `Sunset: <date>` headers (verified by Playwright E2E check)
|
||
|
||
**SRE / Reliability (all required before Phase 1 is considered complete):**
|
||
- [ ] Health probes: `/healthz` returns 200 on all services; `/readyz` returns 200 (healthy) or 207 (degraded) as appropriate; Docker Compose `depends_on: condition: service_healthy` wired for all service dependencies
|
||
- [ ] Celery queue routing: integration test confirms `ingest.*` tasks appear only on `ingest` queue and `propagator.*` tasks appear only on `simulation` queue; no cross-queue contamination possible
|
||
- [ ] `celery-redbeat` schedule persistence: Beat process restart test verifies scheduled jobs survive without duplicate scheduling; Redis key `redbeat:*` present after restart
|
||
- [ ] Crash-safety: kill a `worker-sim` container mid-task; verify task is requeued (not lost) on worker restart; `task_acks_late = True` and `task_reject_on_worker_lost = True` confirmed by log inspection
|
||
- [ ] Dead letter queue: a task that exhausts all retries appears in the DLQ; DLQ depth metric visible in Prometheus
|
||
- [ ] WAL archiving: `pg_basebackup` and WAL segments appearing in MinIO `db-wal-archive` bucket within 10 minutes of first write (verified by bucket list)
|
||
- [ ] Daily backup Celery task: `backup_database` task appears in Celery Beat schedule; execution logged in `celery-beat.log`; resulting archive object visible in MinIO `db-backups` bucket
|
||
- [ ] TimescaleDB compression policy: `orbits` compression policy applied; `timescaledb_information.jobs` shows policy active; manual `CALL run_job()` compresses at least one chunk
|
||
- [ ] Prometheus metrics: `spacecom_active_tip_events`, `spacecom_tle_age_hours`, `spacecom_hmac_verification_failures_total`, `spacecom_celery_queue_depth` all visible in Prometheus UI with correct labels
|
||
- [ ] MC chord distribution: `run_mc_decay_prediction` fans out 500 sub-tasks; Celery Flower shows sub-tasks distributed across both `worker-sim` instances (not all on one worker)
|
||
- [ ] MC p95 latency SLO: 500-sample MC run completes in < 240s on Tier 1 dev hardware (8 vCPU/32 GB) under load test; documented baseline recorded for Tier 2 comparison
|
||
|
||
### Phase 2 Complete When:
|
||
- [ ] Atmospheric breakup: fragments, casualty areas, fragment globe display
|
||
- [ ] Mode B (Probability Heatmap): Deck.gl layer renders; hover tooltip shows probability
|
||
- [ ] Conjunction screening: known close approaches identified; Pc computed for ≥1 test case
|
||
- [ ] 2D Plan View: FIR boundaries, horizontal corridor projection, altitude cross-section
|
||
- [ ] Airspace intersection table: affected FIRs with entry/exit times on Event Detail
|
||
- [ ] Hazard zones: HMAC-signed and immutability trigger verified
|
||
- [ ] PDF reports: Technical Assessment and Regulatory Submission types functional
|
||
- [ ] Renderer container: `network_mode: none` enforced; sanitisation tests passing; 30s timeout verified
|
||
- [ ] OWASP ZAP DAST: 0 High/Critical findings against staging environment
|
||
- [ ] RLS multi-tenancy: Org A user cannot access Org B records (integration test)
|
||
- [ ] SimulationComparison: two runs overlaid on globe with distinct colours
|
||
|
||
**Phase 2 SRE / Reliability:**
|
||
- [ ] Monthly restore test: `restore_test` Celery task executes on schedule; restores latest backup to isolated `db-restore-test` container; row count reconciliation passes; result logged to `security_logs` (type `RESTORE_TEST`)
|
||
- [ ] TimescaleDB retention policy: 90-day drop policy active on `orbits` and `space_weather`; manual chunk drop test in staging confirms chunks older than 90 days are removed without affecting newer data
|
||
- [ ] Archival pipeline: Parquet export Celery task runs before chunk drop; resulting `.parquet` files visible in MinIO `db-archive` bucket; spot-check query against archived Parquet returns expected rows
|
||
- [ ] Degraded mode UI: stop space weather ingest; confirm `/readyz` returns 207; confirm `StalenessWarningBanner` appears in aviation portal within one polling cycle (≤ 60s); restart ingest; confirm banner clears
|
||
- [ ] Error budget dashboard: Grafana `SRE Error Budgets` dashboard shows Phase 2 SLO burn rates for prediction latency and data freshness; alert fires in Prometheus when burn rate exceeds 2× for > 1 hour
|
||
|
||
**Phase 2 Human Factors:**
|
||
- [ ] Corridor Evolution widget: Event Detail page shows p50 corridor footprint at T+0h/+2h/+4h; auto-updates in LIVE mode; ambering warning appears if corridor is widening
|
||
- [ ] Duty Manager View: toggle on Event Detail collapses to large-text window/FIR/action-buttons only; toggles back to technical detail
|
||
- [ ] Response Options accordion: contextualised action checklist visible to `operator`+ role; checkbox states and coordination notes persisted to `alert_events`
|
||
- [ ] Multi-ANSP Coordination Panel: visible on events where ≥2 registered organisations share affected FIRs; acknowledgement status and coordination notes from each ANSP visible; integration test confirms Org A cannot see Org B coordination notes on unrelated events
|
||
- [ ] Simulation block: `disable_simulation_during_active_events` org setting functional; mode switch blocked with correct modal when unacknowledged CRITICAL alerts exist (integration test)
|
||
- [ ] Space weather buffer recommendation: Event Detail shows `[95th pct time + buffer]` callout when conditions are Elevated or above; buffer computed by backend from F10.7/Kp thresholds (integration test verifies all four threshold bands)
|
||
- [ ] Secondary Display Mode: `?display=secondary` URL opens chrome-free full-screen operational view; navigation, admin links, and simulation controls not present; CRITICAL banners still appear (Playwright test)
|
||
- [ ] Mode C first-use overlay: MC particle animation blocked until user acknowledges one-time explanation overlay; preference stored in user record; never shown again after first acknowledgement
|
||
|
||
**Phase 2 Performance / Database:**
|
||
- [ ] FIR intersection query: `EXPLAIN (ANALYZE)` confirms bounding-box pre-filter (`&&`) eliminates > 90% of `airspace` rows before exact `ST_Intersects`; p95 intersection query time < 200ms with full airspace table loaded
|
||
- [ ] Analytics query routing: Persona B/F workspace queries confirmed routing to replica engine via `pg_stat_activity` source host check; replication lag monitored in Grafana (alert if > 30s)
|
||
- [ ] Query plan regression: re-run `EXPLAIN (ANALYZE, BUFFERS)` on CZML catalog query; compare to Phase 1 baseline in `docs/query-baselines/`; planning time and execution time increase < 2× (if exceeded, investigate before Phase 3 load test)
|
||
- [ ] Hypertable migration: at least one migration involving `orbits` executed using `CREATE INDEX CONCURRENTLY`; CI migration timeout gate in place (> 30s fails CI)
|
||
- [ ] Query plan regression CI job active: `tests/load/check_query_baselines.py` runs after each migration in staging; fails if any baseline query execution time increases > 2× vs recorded baseline; PR comment generated with comparison table
|
||
- [ ] `ws_connected_clients` Prometheus gauge reporting per backend instance; Grafana alert configured at 400 (WARNING) — verified by injecting 5 synthetic WebSocket connections and confirming gauge increments
|
||
- [ ] Space weather backfill cap: integration test simulates 24-hour ingest gap; verify ingest task logs `WARN` and backfills only last 6 hours; no duplicate timestamps written; `space_weather_daily` aggregate remains consistent
|
||
- [ ] CDN / static asset caching: `bundle-size` CI step active; PR comment shows bundle size delta; CI fails if main JS bundle grows > 10% vs. previous build; Caddy cache headers for `/_next/static/*` set `Cache-Control: public, max-age=31536000, immutable`
|
||
|
||
**Phase 2 Legal / Compliance:**
|
||
- [ ] **Regulatory classification ADR committed:** `docs/adr/0012-regulatory-classification.md` documents the chosen position (Position A — ATM/ANS Support Tool, non-safety-critical) with rationale; legal counsel has reviewed the position against EASA IR 2017/373; position is referenced in all ANSP service contracts
|
||
- [ ] Legal opinion received for primary deployment jurisdiction; `legal_opinions` table updated with `shadow_mode_cleared = TRUE`; shadow mode admin toggle no longer shows legal warning for that jurisdiction
|
||
- [ ] Space-Track AUP redistribution clarification obtained (written); legal position documented; AUP click-wrap wording updated to reflect agreed terms
|
||
- [ ] **ESA DISCOS redistribution rights clarified (written):** Written confirmation from ESA/ESAC on permissible use of DISCOS-derived properties in commercial API responses and generated reports; if redistribution is not permitted, API response and report templates updated to show `source: estimated` rather than raw DISCOS values
|
||
- [ ] **GDPR DPA signed with each shadow ANSP partner before shadow mode begins:** DPA template reviewed by counsel; executed DPA on file for each organisation before `shadow_mode_cleared` is set to `TRUE`; data processing not permitted for any ANSP organisation without a signed DPA
|
||
- [ ] GDPR data inventory documented; pseudonymisation procedure `handle_erasure_request()` implemented and tested: user deleted → name/email replaced with `[user deleted - ID:{hash}]` in `alert_events`/`security_logs`; core safety records preserved
|
||
- [ ] Jurisdiction screening at user registration: sanctioned-country check fires before account creation; blocked attempt logged to `security_logs` type `REGISTRATION_BLOCKED_SANCTIONS`
|
||
- [ ] MSA template reviewed by aviation law counsel; Regulatory Sandbox Agreement template finalised; first shadow mode deployment covered by a signed Regulatory Sandbox Agreement on file
|
||
- [ ] Controlled Re-entry Planner carries in-platform export control notice; `data_source_acknowledgement = TRUE` enforced before API key issuance (integration test: attempt to create API key without acknowledgement returns 403)
|
||
- [ ] Professional indemnity, cyber liability, and product liability insurance confirmed in place before first shadow deployment; certificates stored in MinIO `legal-docs` bucket
|
||
- [ ] **Shadow mode exit criteria documented and tooled:** `docs/templates/shadow-mode-exit-report.md` exists; Persona B can generate exit statistics from admin panel; exit to operational use for any ANSP requires written Safety Department confirmation on file before `shadow_mode_cleared` is set
|
||
|
||
**Phase 2 Technical Writing / Documentation:**
|
||
- [ ] `docs/user-guides/aviation-portal-guide.md` complete and reviewed by at least one Persona A representative before first ANSP shadow deployment; covers: dashboard overview, alert acknowledgement workflow, NOTAM draft workflow, degraded mode response
|
||
- [ ] `docs/api-guide/` complete: `authentication.md`, `rate-limiting.md`, `webhooks.md`, `error-reference.md`, Python and TypeScript quickstart examples; reviewed by a Persona E/F tester
|
||
- [ ] All public functions in `propagator/decay.py`, `propagator/catalog.py`, `reentry/corridor.py`, `integrity.py`, and `breakup/atmospheric.py` have Google-style docstrings with parameter units; `mypy` pre-commit hook enforces no untyped function signatures
|
||
- [ ] `docs/test-plan.md` complete: lists all test suites, physical invariant tested, reference source, pass/fail tolerance, and blocking classification; reviewed by physics lead
|
||
- [ ] `docs/adr/` contains ≥ 10 ADRs covering all consequential Phase 2 decisions added during the phase
|
||
- [ ] All runbooks referenced in the §21 DoD are complete (not stubs): `gdpr-breach-notification.md`, `safety-occurrence-notification.md`, `secrets-rotation-jwt.md`, `blue-green-deploy.md`, `restore-from-backup.md`
|
||
|
||
**Phase 2 Ethics / Algorithmic Accountability:**
|
||
- [ ] Model card published: `docs/model-card-decay-predictor.md` complete with validated orbital regime envelope, object type performance breakdown, known failure modes, and systematic biases; reviewed by the physics lead before Phase 2 ANSP shadow deployments
|
||
- [ ] Backcast validation report: ≥10 historical re-entry events validated; report documents selection criteria, identifies underrepresented object categories (small debris, tumbling objects), and states accuracy conditional on object type — not as a single unconditional figure; stored in MinIO `docs` bucket
|
||
- [ ] Out-of-distribution bounds defined: `docs/ood-bounds.md` specifies the threshold values for `ood_flag` triggers (area-to-mass ratio, minimum data confidence, minimum TLE count); CI test confirms all thresholds are checked in `propagator/decay.py`
|
||
- [ ] Alert threshold governance: any threshold change requires a PR reviewed by engineering lead + product owner; `docs/alert-threshold-history.md` entry created; change must complete a minimum 2-week shadow-mode validation period before deploying to any operational ANSP connection
|
||
- [ ] FIR coverage quality flag: `airspace` table has `data_source` and `coverage_quality` columns; intersection results for OpenAIP-sourced FIRs include a `coverage_quality: 'low'` flag in the API response; UI shows a coverage quality callout for non-AIRAC FIRs
|
||
- [ ] Recalibration governance documented: `docs/recalibration-procedure.md` exists specifying hold-out validation dataset, minimum accuracy improvement threshold (> 5% improvement on hold-out, no regression on any object type category), sign-off authority (physics lead + engineering lead), ANSP notification procedure
|
||
|
||
**Phase 2 Interoperability:**
|
||
- [ ] CCSDS OEM response: `GET /space/objects/{norad_id}/ephemeris` with `Accept: application/ccsds-oem` returns a valid CCSDS 502.0-B-3 OEM file; integration test validates all mandatory keyword fields (`OBJECT_ID`, `CENTER_NAME`, `REF_FRAME=GCRF`, `TIME_SYSTEM=UTC`, `START_TIME`, `STOP_TIME`) are present; test parses with a reference CCSDS OEM parser
|
||
- [ ] CCSDS CDM export: bulk export includes CDM-format conjunction records; mandatory CDM fields populated; `N/A` used per CCSDS 508.0-B-1 §4.3 for unknown values; integration test validates with reference CDM parser
|
||
- [ ] CDM ingestion display: Space-Track CDM Pc and SpaceCom-computed Pc both visible on conjunction panel with distinct provenance labels; `DATA_CONFIDENCE` warning fires when values differ by > 10× (integration test with synthetic divergent CDM)
|
||
- [ ] Alert webhook: `POST /webhooks` registers endpoint; synthetic `alert.new` event POSTed to registered URL within 5s of trigger; `X-SpaceCom-Signature` header present and verifiable with shared secret; retry fires on 500 response from webhook receiver (integration test with mock server)
|
||
- [ ] GeoJSON structured export: `GET /events/{id}/export?format=geojson` returns valid GeoJSON `FeatureCollection`; `properties` includes `norad_id`, `p50_utc`, `affected_fir_ids`, `risk_level`, `prediction_hmac`; validates against GeoJSON schema (RFC 7946)
|
||
- [ ] ADS-B feed: OpenSky Network integration active; live flight positions overlay on globe in aviation portal; route intersection advisory receives ADS-B flight tracks as input
|
||
|
||
**Phase 2 DevOps / Platform Engineering:**
|
||
- [ ] Staging environment spec documented: resources, data (synthetic only — no production data in staging), secrets set (separate from production), continuous deployment from `main` branch
|
||
- [ ] GitLab staging deploy job: merge to `main` triggers automatic staging deploy; production deploy requires manual approval in GitLab after staging smoke tests pass
|
||
- [ ] OWASP ZAP DAST run against staging in CI pipeline; results reviewed; 0 High/Critical required to unblock production deploy approval
|
||
- [ ] Secrets rotation runbooks written for all critical secrets: Space-Track credentials, JWT RS256 signing keypair, MinIO access keys, Redis `AUTH` password; each runbook includes: who initiates, affected services, zero-downtime rotation procedure, verification step, `security_logs` entry required
|
||
- [ ] JWT RS256 keypair rotation tested without downtime: old public key retained during 5-minute transition window; tokens signed with old key remain valid until expiry; verified by integration test
|
||
- [ ] Image retention container-registry lifecycle policy in place: untagged images purged weekly; staging images retained 30 days; dev images retained 7 days; policy verified in registry settings
|
||
- [ ] CI observability: GitLab pipeline duration tracked; image size delta posted as merge request comment (fail if > 20% increase); test failure rate visible in CI dashboard
|
||
- [ ] `alembic check` CI gate: no migration added a `NOT NULL` column without a default in the same step; CI job validates hypertable migrations use `CONCURRENTLY` (grep check on all new migration files)
|
||
|
||
### Phase 2 Additional Regulatory / Dual Domain Items:
|
||
- [ ] Shadow mode: admin can enable/disable per organisation; ShadowBanner displayed on all pages when active; shadow records have `shadow_mode = TRUE`; shadow records excluded from all operational API responses (integration test)
|
||
- [ ] NOTAM drafting: draft generated in ICAO Annex 15 format from any event with FIR intersection; mandatory regulatory disclaimer present (automated test verifies its presence in every draft); stored in `notam_drafts`
|
||
- [ ] Space Operator Portal: `space_operator` user can view only owned objects (non-owned objects return 404, not 403, to prevent object enumeration); ControlledReentryPlanner functional for `has_propulsion = TRUE` objects
|
||
- [ ] CCSDS export: ephemeris export in OEM format passes CCSDS 502.0-B-3 structural validation
|
||
- [ ] API keys: create, use, and revoke flow functional; per-key rate limiting returns 429 at daily limit; raw key displayed only at creation (never retrievable after)
|
||
- [ ] TIP message provenance displayed in UI: source label reads "USSPACECOM TIP (not certified aeronautical information)" — not just "TIP Message #N"
|
||
- [ ] Data confidence warnings: objects with `data_confidence = 'unknown'` display a warning callout on all prediction panels explaining the impact on prediction quality
|
||
|
||
### Phase 3 Complete When:
|
||
- [ ] Mode C (Monte Carlo Particles): animated trajectories render; click-particle shows params
|
||
- [ ] Real-time alerts delivered within 30 seconds of trigger condition
|
||
- [ ] Geographic alert filtering: alerts scoped to user's FIR list
|
||
- [ ] Route intersection analysis functional against sample flight plans
|
||
- [ ] Feedback: density scaling recalibration demonstrated from ≥2 historical re-entries
|
||
- [ ] Load test: 100 concurrent users; CZML load < 2s at p95
|
||
- [ ] **External penetration test completed; all Critical/High findings remediated**
|
||
- [ ] Full axe-core audit + manual screen reader test (NVDA + VoiceOver) passes
|
||
- [ ] Secrets manager (Vault or equivalent) replacing Docker secrets for all production credentials
|
||
- [ ] All credentials on rotation schedule; rotation verified without downtime
|
||
- [ ] Prometheus + Grafana operational; certificate expiry alert configured
|
||
- [ ] Production deployment runbook documented; incident response procedure per threat scenario
|
||
- [ ] Security audit log shipping to external SIEM verified
|
||
- [ ] Shadow validation report generated for ≥1 historical re-entry event demonstrating prediction accuracy
|
||
- [ ] ECSS compliance artefacts produced: Software Management Plan, V&V Plan, Product Assurance Plan, Data Management Plan (required for ESA contract bids)
|
||
- [ ] TRL 6 demonstration: system demonstrated in operationally relevant environment with real TLE data, real space weather, and ≥1 ANSP shadow deployment
|
||
- [ ] Regulatory acceptance package complete: safety case framework, ICAO Annex 15 data quality mapping, SMS integration guide
|
||
- [ ] Legal opinion obtained on operational liability per target deployment jurisdictions (Australia, EU, UK minimum)
|
||
- [ ] First ANSP shadow mode deployment active with ≥4 weeks of shadow prediction records
|
||
|
||
**Phase 3 Infrastructure / HA:**
|
||
- [ ] Patroni configuration validated: `scripts/check_patroni_config.py` passes confirming `maximum_lag_on_failover`, `synchronous_mode: true`, `synchronous_mode_strict: true`, `wal_level: replica`, `recovery_target_timeline: latest` all present in `patroni.yml`
|
||
- [ ] Patroni failover drill: manually kill the primary DB container; verify standby promoted within 30s; backend API continues serving requests (latency spike acceptable; no 5xx errors after 35s); PgBouncer reconnects automatically to new primary
|
||
- [ ] MinIO EC:2 verified: 4-node MinIO starts cleanly; integration test writes a 100 MB object; shut down one MinIO node; read succeeds; write succeeds; shut down second node; write fails with expected error; read still succeeds (EC:2 read quorum = 2 of 4)
|
||
- [ ] WAF/DDoS protection confirmed in place at ingress (Cloudflare/AWS Shield or equivalent network-level appliance for on-premise); security architecture review sign-off
|
||
- [ ] DNS architecture documented: `docs/runbooks/dns-architecture.md` covers split-horizon zones, PgBouncer VIP, Redis Sentinel VIP, and service discovery records for Tier 3 deployment
|
||
- [ ] Backup restore test checklist completed successfully (see §34.5): all 6 checklist items passed within the 30-day window before Phase 3 sign-off
|
||
- [ ] TLS certificate lifecycle runbook complete: `docs/runbooks/tls-cert-lifecycle.md` documents ACME auto-renewal path and internal CA path for air-gapped deployments; cert expiry Prometheus alerts firing at 60/30/7-day thresholds
|
||
|
||
**Phase 3 Performance:**
|
||
- [ ] Formal load test passed: `tests/load/` scenario with k6 or Locust; 100 concurrent users; CZML catalog load < 2s p95; MC job submit < 500ms; alert WebSocket delivery < 30s; test report committed to `docs/validation/load-test-report-phase3.md`
|
||
- [ ] MC concurrency gate tested at scale: 10 simultaneous MC submissions across 5 organisations; each org receives `429` for its second request; no deadlock or Redis key leak observed; Celery worker queue depth remains bounded
|
||
- [ ] WebSocket subscriber ceiling verified: load test opens 450 connections to a single backend instance; 451st connection receives `HTTP 503`; `ws_connected_clients` gauge reads 450; scaling trigger fires at 400 (alert visible in Grafana)
|
||
- [ ] CZML delta adoption: Playwright E2E test confirms the frontend sends `?since=` parameter on all CZML polls after initial load; no full-catalog request occurs after page load in LIVE mode
|
||
- [ ] Bundle size CI gate active and green: final production build JS bundle documented; `bundle-size` CI step has passed for ≥2 consecutive deploys without manual override
|
||
|
||
---
|
||
|
||
## 22. Open Physics Questions for Engineering Review
|
||
|
||
1. **JB2008 vs NRLMSISE-00** — Recommend: NRLMSISE-00 for Phase 1 with a pluggable density model interface that accepts JB2008 in Phase 2 without API or schema changes.
|
||
|
||
2. **Covariance source for conjunction probability** — Recommend: SP ephemeris covariance from Space-Track for active payloads; empirical covariance with explicit UI warning for debris.
|
||
|
||
3. **Re-entry termination altitude** — Recommend: 80 km for Phase 1; parametric interface for Phase 2 breakup module (default 80 km, allow up to 120 km).
|
||
|
||
4. **F10.7 forecast horizon** — For objects re-entering 5–14 days out, NOAA 3-day forecasts have degraded skill. Recommend: 81-day smoothed average as baseline with ±20% MC variation; document clearly in the SpaceWeatherWidget and every prediction panel.
|
||
|
||
---
|
||
|
||
## 23. Dual Domain Architecture
|
||
|
||
### 23.1 The Interface Problem
|
||
|
||
Two technically adjacent domains — space operations and civil aviation — manage debris re-entry hazards using incompatible tools, data formats, and operational vocabularies. The gap between them is the market.
|
||
|
||
```
|
||
SPACE DOMAIN THE GAP AVIATION DOMAIN
|
||
──────────────── ────────── ────────────────
|
||
TLE / SGP4 NOTAM
|
||
CDMs / TIP messages No standard interface FIR restrictions
|
||
CCSDS orbit products No common tool ATC procedures
|
||
Kp / F10.7 indices No shared language En-route charts
|
||
Probability of casualty ← SpaceCom bridges this → Plain English hazard brief
|
||
```
|
||
|
||
### 23.2 Shared Physics Core
|
||
|
||
One physics engine serves both front doors. Neither domain gets a different model — they get different views of the same computation.
|
||
|
||
```
|
||
┌─────────────────────────────────┐
|
||
│ PHYSICS CORE │
|
||
│ Catalog Propagator (SGP4) │
|
||
│ Decay Predictor (RK7(8)+NRLMS) │
|
||
│ Monte Carlo ensemble │
|
||
│ Conjunction Screener │
|
||
│ Atmospheric Breakup (ORSAT) │
|
||
│ Frame transforms (TEME→WGS84) │
|
||
└────────────┬────────────────────┘
|
||
│
|
||
┌─────────────────┴─────────────────┐
|
||
│ │
|
||
┌──────────▼───────────┐ ┌────────────▼──────────┐
|
||
│ SPACE DOMAIN UI │ │ AVIATION DOMAIN UI │
|
||
│ /space portal │ │ / (operational view) │
|
||
│ Persona E, F │ │ Persona A, B, C │
|
||
│ │ │ │
|
||
│ State vectors │ │ Hazard corridors │
|
||
│ Covariance matrices │ │ FIR intersection │
|
||
│ CCSDS formats │ │ NOTAM drafts │
|
||
│ Deorbit windows │ │ Plain-language status│
|
||
│ API keys │ │ Alert acknowledgement│
|
||
│ Conjunction data │ │ Gantt timeline │
|
||
└──────────────────────┘ └───────────────────────┘
|
||
```
|
||
|
||
### 23.3 Domain-Specific Output Formats
|
||
|
||
| Output | Space Domain | Aviation Domain |
|
||
|--------|-------------|----------------|
|
||
| Trajectory | CCSDS OEM (state vectors) | CZML (J2000 INERTIAL for CesiumJS) |
|
||
| Re-entry prediction | p05/p50/p95 times + covariance | Percentile corridor polygons on globe |
|
||
| Hazard | Probability of casualty (Pc) value | Risk level (LOW/MEDIUM/HIGH/CRITICAL) |
|
||
| Uncertainty | Monte Carlo ensemble statistics | Corridor width visual encoding |
|
||
| Conjunction | CDM-format Pc value | Not surfaced to Persona A |
|
||
| Space weather | F10.7 / Ap / Kp raw indices | "Elevated activity — wider uncertainty" |
|
||
| Deorbit plan | CCSDS manoeuvre plan | Corridor risk map on globe |
|
||
|
||
### 23.4 Competitive Position
|
||
|
||
| Competitor | Their Strength | SpaceCom Advantage |
|
||
|-----------|---------------|-------------------|
|
||
| **ESA ESOC Re-entry Prediction Service** | Authoritative technical product; longest-running service | Aviation-facing operational UX; ANSP decision support; NOTAM drafting; multi-ANSP coordination |
|
||
| **OKAPI:Orbits + DLR + TU Braunschweig** | Academic orbital mechanics depth; space operator integrations | Purpose-built ANSP interface; controlled re-entry planner; shadow mode for regulatory adoption |
|
||
| **Aviation weather vendors (e.g., StormGeo)** | Deep ANSP relationships; established procurement pathways | Space domain physics credibility; TLE/CDM ingestion; conjunction screening |
|
||
| **General STM platforms** | Broad catalog management | Operational decision support depth; aviation integration layer |
|
||
|
||
SpaceCom's moat is the combination of space physics credibility AND aviation operational usability. Neither side alone is sufficient to win regulated aviation authority contracts.
|
||
|
||
**Differentiation capabilities — must be maintained regardless of competitor moves (Finding 4):**
|
||
|
||
These are the capabilities that competitors cannot quickly replicate and that directly determine whether ANSPs and institutional buyers choose SpaceCom over alternatives:
|
||
|
||
| Capability | Why it matters | Maintenance requirement |
|
||
|---|---|---|
|
||
| ANSP operational workflow integration | NOTAM drafting, multi-ANSP coordination, and shadow mode are purpose-built for ANSP operations — not retrofitted | Must be validated with ≥ 2 ANSP safety teams before Phase 2 shadow deployment |
|
||
| Regulatory adoption path | Shadow mode + exit criteria + ANSP Safety Department sign-off creates a documented adoption trail that institutional procurements require | Shadow mode exit report template must remain current; exit statistics generated automatically |
|
||
| Physics + aviation in one product | Neither a pure orbital analytics tool nor a pure aviation tool can cover both sides without the other's domain expertise | Dual-domain architecture (§23) must be maintained; any feature removal from either domain triggers an ADR |
|
||
| ESA/DISCOS data integration | Institutional credibility with ESA and national space agencies depends on using authoritative ESA data sources | DISCOS redistribution rights must be resolved before Phase 2; integration maintained as P1 data source |
|
||
|
||
A `docs/competitive-analysis.md` document (maintained by the product owner, reviewed quarterly) tracks competitor feature releases and assesses impact on these claims. Any competitor capability that closes a differentiation gap triggers a product review within 30 days.
|
||
|
||
### 23.5 SWIM Integration Path
|
||
|
||
European ANSPs increasingly exchange operational data via SWIM (System Wide Information Management), defined by ICAO Doc 10039 and implemented in Europe via EUROCONTROL SWIM-TI (AMQP/MQTT transport, FIXM/AIXM 5.1 schemas). Full SWIM compliance is a Phase 3+ target; the path is:
|
||
|
||
| Phase | Deliverable | Standard |
|
||
|-------|-------------|----------|
|
||
| Phase 2 | GeoJSON structured event export (`/events/{id}/export?format=geojson`) with ICAO FIR IDs and prediction metadata | GeoJSON + ISO 19115 metadata |
|
||
| Phase 3 | Review FIXM Core 4.x schema for re-entry hazard representation; define SpaceCom extension namespace | FIXM Core 4.2 |
|
||
| Phase 3 | SWIM-TI AMQP endpoint (publish-only) for `alert.new` and `tip.new` events to EUROCONTROL Network Manager B2B service | EUROCONTROL SWIM-TI Yellow Profile |
|
||
|
||
Phase 2 GeoJSON export is the immediate deliverable. Phase 3 SWIM-TI integration is scoped but requires a EUROCONTROL B2B service account and FIXM schema extension review — neither is blocking for Phase 1 or 2.
|
||
|
||
---
|
||
|
||
## 24. Regulatory Compliance Framework
|
||
|
||
### 24.1 The Regulatory Gap SpaceCom Operates In
|
||
|
||
There is currently **no binding international regulatory framework** governing re-entry debris hazard notifications to civil aviation. SpaceCom operates at the boundary between two regulatory regimes that have not yet formally agreed on how to bridge them.
|
||
|
||
This creates risk (no approved pathway to slot into) but also opportunity (SpaceCom can help define the standard and accumulate first-mover evidence).
|
||
|
||
### 24.2 Liability and Operational Status
|
||
|
||
**Legal opinion is a Phase 2 gate, not a Phase 3 task.** Shadow mode deployments with ANSPs must not occur without a completed legal opinion for the deployment jurisdiction. "Advisory only" UI labelling is not contractual protection — liability limitation must be in executed agreements. In common law jurisdictions (Australia, UK, US), a voluntary undertaking of responsibility to a known class of relying professionals can create a duty of care regardless of disclaimers (*Hedley Byrne & Co v Heller* and equivalents). Shadow mode activation in the admin panel is gated by `legal_opinions.shadow_mode_cleared = TRUE` for the organisation's jurisdiction.
|
||
|
||
**Legal opinion scope** (per deployment jurisdiction — Australia, EU, UK, US minimum):
|
||
- Whether "decision support information" labelling limits liability for incorrect predictions that inform airspace decisions
|
||
- Whether the platform creates duty-of-care obligations regardless of labelling
|
||
- Whether Space-Track data redistribution via the SpaceCom API requires a separate licensing agreement with 18th Space Control Squadron
|
||
- Whether CDM data (national security-adjacent) is subject to export controls in target jurisdictions
|
||
- Whether the Controlled Re-entry Planner falls under ECCN 9E515 (spacecraft operations technical data) for non-US users
|
||
|
||
**Operational status classification** for SpaceCom outputs — not a UI label, a formal determination made in consultation with the ANSP's legal and SMS teams:
|
||
- *Aeronautical information* (ICAO Annex 15) — highest standard; triggers data quality obligations
|
||
- *Decision support information* — intermediate; requires formal ANSP SMS acceptance
|
||
- *Situational awareness information* — lowest; advisory only; no procedural authority
|
||
|
||
**Commercial contract requirements — three instruments required before any access:**
|
||
|
||
1. **Master Services Agreement (MSA)** — executed before any ANSP or space operator accesses the system. Must be reviewed by aviation law counsel. Minimum required terms:
|
||
- Limitation of liability: capped at 12 months of fees paid, or a fixed cap for government/sovereign customers (to be determined by counsel)
|
||
- Exclusion of consequential and indirect loss
|
||
- Explicit statement that SpaceCom outputs are decision support information, not certified aeronautical information and not a substitute for ANSP operational procedures
|
||
- ANSP's acknowledgement that they retain full authority and responsibility for all operational decisions
|
||
- SLOs from §26.1 incorporated by reference
|
||
- Governing law and jurisdiction clause
|
||
- Data Processing Agreement (DPA) addendum for GDPR-scope deployments (see §29)
|
||
- Right to suspend service without liability for maintenance, degraded mode, data quality concerns, or active security incidents
|
||
|
||
2. **Acceptable Use Policy (AUP)** — click-wrap accepted in-platform at first login, recorded in `users.tos_accepted_at`, `users.tos_version`, and `users.tos_accepted_ip`. Must re-accept when version changes (system blocks access until accepted). Includes:
|
||
- Acknowledgement that orbital data originates from Space-Track, subject to Space-Track terms
|
||
- Prohibition on redistributing SpaceCom-derived data to third parties without written consent
|
||
- Acknowledgement that the platform is decision support only, not certified aeronautical information
|
||
- Export control acknowledgement (user is responsible for compliance in their jurisdiction)
|
||
|
||
3. **API Terms** — embedded in the API key issuance flow for Persona E/F programmatic access. Accepted at key creation; recorded against the `api_keys` record. Includes the Space-Track redistribution acknowledgement and the export control notice.
|
||
|
||
**Space-Track data redistribution gate (F3):** Space-Track.org Terms of Service prohibit redistribution of TLE data to non-registered entities. The SpaceCom API must not serve TLE-derived fields (raw TLE strings, `tle_epoch`, `tle_line1/2`) to organisations that have not confirmed Space-Track registration. Implementation:
|
||
|
||
```sql
|
||
-- Add to organisations table
|
||
ALTER TABLE organisations ADD COLUMN space_track_registered BOOLEAN NOT NULL DEFAULT FALSE;
|
||
ALTER TABLE organisations ADD COLUMN space_track_registered_at TIMESTAMPTZ;
|
||
ALTER TABLE organisations ADD COLUMN space_track_username TEXT; -- for audit
|
||
```
|
||
|
||
API middleware check (applied to any response containing TLE-derived fields):
|
||
```python
|
||
def check_space_track_gate(org: Organisation):
|
||
if not org.space_track_registered:
|
||
raise HTTPException(
|
||
status_code=403,
|
||
detail="TLE-derived data requires Space-Track registration. "
|
||
"Register at space-track.org and confirm in your organisation settings."
|
||
)
|
||
```
|
||
|
||
All TLE-derived disclosures are logged in `data_disclosure_log`:
|
||
```sql
|
||
CREATE TABLE data_disclosure_log (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
org_id UUID NOT NULL REFERENCES organisations(id),
|
||
source TEXT NOT NULL, -- 'space_track', 'esa_sst', etc.
|
||
endpoint TEXT NOT NULL,
|
||
disclosed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
record_count INTEGER
|
||
);
|
||
CREATE INDEX ON data_disclosure_log (org_id, source, disclosed_at DESC);
|
||
```
|
||
|
||
**Contracts table and MRR tracking (F1, F4, F9 — §68):**
|
||
|
||
The `contracts` table enforces that feature access is gated on commercial state, provides MRR data for the commercial team, and records discount approval for audit:
|
||
|
||
```sql
|
||
CREATE TABLE contracts (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
org_id INTEGER NOT NULL REFERENCES organisations(id),
|
||
contract_type TEXT NOT NULL
|
||
CHECK (contract_type IN ('sandbox','professional','enterprise','on_premise','internal')),
|
||
-- Financial terms
|
||
monthly_value_cents INTEGER NOT NULL DEFAULT 0, -- 0 for sandbox/internal
|
||
currency CHAR(3) NOT NULL DEFAULT 'EUR',
|
||
discount_pct NUMERIC(5,2) NOT NULL DEFAULT 0
|
||
CHECK (discount_pct >= 0 AND discount_pct <= 100),
|
||
-- Discount approval guard (F4): discounts >20% require second approver
|
||
discount_approved_by INTEGER REFERENCES users(id), -- NULL if discount_pct <= 20
|
||
discount_approval_note TEXT,
|
||
-- Term
|
||
valid_from TIMESTAMPTZ NOT NULL,
|
||
valid_until TIMESTAMPTZ NOT NULL,
|
||
auto_renew BOOLEAN NOT NULL DEFAULT FALSE,
|
||
-- Feature access — what this contract enables
|
||
enables_operational_mode BOOLEAN NOT NULL DEFAULT FALSE,
|
||
enables_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE,
|
||
enables_api_access BOOLEAN NOT NULL DEFAULT FALSE,
|
||
-- Audit
|
||
created_by INTEGER REFERENCES users(id),
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
signed_msa_at TIMESTAMPTZ, -- NULL until MSA countersigned
|
||
msa_document_ref TEXT, -- path in MinIO legal bucket
|
||
-- Professional Services (F10)
|
||
ps_value_cents INTEGER NOT NULL DEFAULT 0, -- one-time PS revenue on this contract
|
||
ps_description TEXT
|
||
);
|
||
CREATE INDEX ON contracts (org_id, valid_until DESC);
|
||
CREATE INDEX ON contracts (valid_until) WHERE valid_until > NOW(); -- active contract lookup
|
||
|
||
-- Constraint: discounts >20% must have a named approver
|
||
ALTER TABLE contracts ADD CONSTRAINT discount_approval_required
|
||
CHECK (discount_pct <= 20 OR discount_approved_by IS NOT NULL);
|
||
```
|
||
|
||
**Feature access enforcement (F1):** Feature flags in `organisations` must be set from the active contract, not by admin toggle alone. A Celery task (`tasks/commercial/sync_feature_flags.py`) runs nightly and on contract creation/update to sync `organisations.feature_multi_ansp_coordination` from the active contract's `enables_multi_ansp_coordination`. An admin toggle that disagrees with the active contract is overwritten by the nightly sync.
|
||
|
||
**MRR dashboard (F9):** Add a Grafana panel (internal dashboard, not customer-facing) showing current MRR:
|
||
```sql
|
||
-- Recording rule or direct query:
|
||
SELECT SUM(monthly_value_cents) / 100.0 AS mrr_eur
|
||
FROM contracts
|
||
WHERE valid_from <= NOW() AND valid_until >= NOW()
|
||
AND contract_type NOT IN ('sandbox', 'internal');
|
||
```
|
||
Expose as `spacecom_mrr_eur` Prometheus gauge updated by the nightly `sync_feature_flags` task. Grafana panel: *"Current MRR (€)"* — single stat panel, comparison to previous month.
|
||
|
||
**Export control screening (F4):** ITAR 22 CFR §120.15 and EAR 15 CFR §736 prohibit providing certain SSA capabilities to nationals of embargoed countries and denied parties. Required at organisation onboarding:
|
||
|
||
```sql
|
||
ALTER TABLE organisations ADD COLUMN country_of_incorporation CHAR(2); -- ISO 3166-1 alpha-2
|
||
ALTER TABLE organisations ADD COLUMN export_control_screened_at TIMESTAMPTZ;
|
||
ALTER TABLE organisations ADD COLUMN export_control_cleared BOOLEAN NOT NULL DEFAULT FALSE;
|
||
ALTER TABLE organisations ADD COLUMN itar_cleared BOOLEAN NOT NULL DEFAULT FALSE; -- US-person or licensed
|
||
```
|
||
|
||
Onboarding flow:
|
||
1. Collect `country_of_incorporation` at registration
|
||
2. Flag embargoed countries (CU, IR, KP, RU, SY) for manual review — account held in `PENDING_EXPORT_REVIEW` state
|
||
3. Screen organisation name against BIS Entity List (automated lookup; manual review on partial match)
|
||
4. EU-SST-derived data gated behind `itar_cleared = TRUE` (EU-SST has its own access restrictions for non-EU entities)
|
||
5. All screening decisions logged with reviewer ID and date
|
||
|
||
Documented in `legal/EXPORT_CONTROL_POLICY.md`. Legal counsel review required before any deployment that could serve US-origin technical data (TLE from 18th Space Control Squadron) to non-US persons.
|
||
|
||
**Regulatory Sandbox Agreement** — a lightweight 2-page letter of understanding required before any ANSP shadow mode activation. Specifies:
|
||
- Trial period start and end dates
|
||
- ANSP's confirmation that SpaceCom outputs are for internal validation only (not operational)
|
||
- SpaceCom's commitment to produce a shadow validation report at trial end
|
||
- Data protection terms for the trial period
|
||
- How incidents during the trial are handled by both parties
|
||
- Mutual agreement that the trial does not create any ongoing commercial obligation
|
||
|
||
**Regulatory sandbox liability clarification (F11 — §61):** The sandbox agreement is not a liability shield by itself. During shadow mode, SpaceCom is a tool under evaluation — liability exposure depends on how the ANSP uses outputs and what the sandbox agreement says about consequences of errors. Required provisions:
|
||
- **No operational reliance clause:** ANSP certifies in writing that no operational decisions will be made on the basis of SpaceCom outputs during the trial. Any breach of this clause by the ANSP shifts liability to the ANSP.
|
||
- **Incident notification:** If a SpaceCom output error is identified during the trial, SpaceCom notifies the ANSP within 2 hours (matching the safety occurrence runbook at §26.8). The sandbox agreement specifies whether this constitutes a notifiable occurrence under the ANSP's SMS.
|
||
- **Indemnification cap:** SpaceCom's aggregate liability during the sandbox period is capped at AUD/EUR 50,000 (or local equivalent). Catastrophic loss claims are excluded (consistent with MSA terms).
|
||
- **Insurance requirement:** SpaceCom must carry professional indemnity insurance with minimum cover AUD/EUR 1 million before activating any sandbox with an ANSP. Certificate of currency provided to the ANSP before activation.
|
||
- **Regulatory notification duty:** If the ANSP's safety regulator requires notification of third-party tool trials (e.g., EASA, CASA, CAA), that obligation rests with the ANSP. SpaceCom provides a one-page system description document to support the ANSP's notification.
|
||
- **Sandbox ≠ approval pathway:** A successful sandbox trial is evidence for a future regulatory submission — it is not itself an approval. Neither party should represent the sandbox as a form of regulatory acceptance.
|
||
|
||
`legal/SANDBOX_AGREEMENT_TEMPLATE.md` captures the standard text. Legal counsel review required before any amendment.
|
||
|
||
The shadow mode admin toggle must display a warning if no Regulatory Sandbox Agreement is on record (`legal_opinions.shadow_mode_cleared = FALSE` for the org's jurisdiction):
|
||
```
|
||
⚠ No legal clearance on record for this organisation's jurisdiction.
|
||
Shadow mode should not be activated without a completed legal opinion
|
||
and a signed Regulatory Sandbox Agreement.
|
||
[View legal status →]
|
||
```
|
||
|
||
### 24.3 ICAO Data Quality Mapping (Annex 15)
|
||
|
||
SpaceCom outputs that may enter aeronautical information channels must be characterised against ICAO's five data quality attributes:
|
||
|
||
| Attribute | SpaceCom Characterisation | Required Action |
|
||
|-----------|--------------------------|----------------|
|
||
| **Accuracy** | Decay predictor accuracy characterised from ≥10 historical re-entry backcasts vs. The Aerospace Corporation database. Published as a formal accuracy statement in `GET /api/v1/reentry/predictions/{id}` response. | Phase 3: produce accuracy characterisation document |
|
||
| **Resolution** | Corridor boundaries expressed as geographic polygons with stated precision. Position uncertainty stated as formal resolution value in prediction response. | Included in prediction API response from Phase 1 |
|
||
| **Integrity** | HMAC-SHA256 on all prediction and hazard zone records. Integrity assurance level: *Essential* (1×10⁻⁵). Documented in system description. | Implemented Phase 1 (§7.9) |
|
||
| **Traceability** | Full parameter provenance in `simulations.params_json` and prediction records. Accessible to regulatory auditors via dedicated API. | Phase 1 |
|
||
| **Timeliness** | Maximum latency from TIP message ingestion to updated prediction available: 30 minutes. Maximum latency from NOAA SWPC space weather update to prediction recalculation: 4 hours. Published as formal SLA. | Phase 3 SLA document |
|
||
|
||
**F5 — Completeness attribute and ICAO Annex 15 §3.2 data quality classification (§61):**
|
||
|
||
ICAO Annex 15 §3.2 defines a sixth implicit attribute — **Completeness** — meaning all data fields required by the receiving system are present and within range. SpaceCom must:
|
||
- Define a formal completeness schema for each prediction response (required fields, allowed nulls, value ranges)
|
||
- Return `data_quality.completeness_pct` in the prediction response (fields present / fields required × 100)
|
||
- Reject predictions with completeness < 90% from the alert pipeline (alert not generated; operator notified of incomplete prediction)
|
||
|
||
**ICAO data category and classification** required in the prediction response (Annex 15 Table A3-1):
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| `data_category` | `AERONAUTICAL_ADVISORY` (until formal AIP entry process established) |
|
||
| `originator` | `SPACECOM` + system version string |
|
||
| `effective_from` | ISO 8601 UTC timestamp |
|
||
| `integrity_assurance` | `ESSENTIAL` (1×10⁻⁵ probability of undetected error) |
|
||
| `accuracy_class` | `CLASS_2` (advisory, not certified — until accuracy characterisation completes Phase 3 validation) |
|
||
|
||
Formal accuracy characterisation (`docs/validation/ACCURACY_CHARACTERISATION.md`) is a Phase 3 gate before the API can be presented to any ANSP as meeting Annex 15 data quality standards.
|
||
|
||
### 24.4 Safety Management System Integration
|
||
|
||
Any ANSP formally adopting SpaceCom must include it in their SMS (ICAO Annex 19). SpaceCom provides the following artefacts to support ANSP SMS assessment:
|
||
|
||
**Hazard register (SpaceCom's contribution to the ANSP's SMS — F3, §61 structured format):**
|
||
|
||
Maintained as `docs/safety/HAZARD_LOG.md`. Each hazard uses the structured schema below. Hazard IDs are permanent — retired hazards are marked CLOSED, not deleted.
|
||
|
||
| ID | Description | Cause | Effect | Mitigations | Severity | Likelihood | Risk Level | Status |
|
||
|----|-------------|-------|--------|-------------|----------|------------|------------|--------|
|
||
| HZ-001 | SpaceCom unavailable during active re-entry event | Infrastructure failure; deployment error; DDoS | ANSP cannot access current re-entry prediction during event window | Patroni HA failover (§26.3); 15-min RTO SLO; automated ANSP push notification + email; documented fallback procedure | Hazardous | Low (SLO 99.9%) | Medium | OPEN |
|
||
| HZ-002 | False all-clear prediction (false negative — corridor misses actual impact zone) | TLE age; atmospheric model error; MC sampling variance; adversarial data manipulation | ANSP issues all-clear; aircraft enters debris corridor | HMAC integrity check; dual-source TLE validation; TIP cross-check guard; shadow validation evidence; accuracy characterisation (Phase 3); `@pytest.mark.safety_critical` tests | Catastrophic | Very Low | High | OPEN |
|
||
| HZ-003 | False hazard prediction (false positive — corridor over-stated) | Atmospheric model conservatism; TLE propagation error | Unnecessary airspace restriction; operational disruption; credibility loss | Cross-source TLE validation; HMAC; p95 corridor with stated uncertainty; accuracy characterisation | Major | Low | Medium | OPEN |
|
||
| HZ-004 | Corridor displayed in wrong reference frame | ECI/ECEF/geographic frame conversion error; CZML frame parameter misconfiguration | Corridor shown at wrong lat/lon; operator makes decisions on incorrect geographic basis | Frame transform unit tests against IERS references (§17); CZML frame convention enforced via CI | Hazardous | Very Low | Medium | OPEN |
|
||
| HZ-005 | Outdated prediction served (stale data) | Ingest pipeline failure; TLE source outage; cache not invalidating | Operator sees prediction that no longer reflects current orbital state | Data staleness indicators in UI; automated stale alert to operators; ingest health monitoring; CZML cache invalidation triggers (§35) | Major | Low | Medium | OPEN |
|
||
| HZ-006 | Prediction integrity failure (HMAC mismatch) | Database modification; backup restore error; storage corruption | Prediction record cannot be verified; may have been tampered with | Prediction quarantined automatically; CRITICAL security alert; prediction withheld from API | Catastrophic | Very Low | High | OPEN |
|
||
| HZ-007 | Unauthorised access to prediction data | Compromised credentials; RLS bypass; API misconfiguration | Competitor or adversary obtains early re-entry corridor data; potential ITAR exposure | PostgreSQL RLS; JWT validation; rate limiting; `security_logs` audit trail; penetration testing | Major | Low | Medium | OPEN |
|
||
|
||
**Hazard log governance:**
|
||
- Review: quarterly, and after each SEV-1 incident, model version update, or material system change
|
||
- New hazards identified during safety occurrence reporting are added within 5 business days
|
||
- Risk level = Severity × Likelihood using EUROCAE ED-153 risk classification matrix
|
||
- OPEN hazards with `High` risk level are Phase 2 gate blockers — must reach `MITIGATED` before ANSP shadow activation
|
||
|
||
**System safety classification:** Safety-related (not safety-critical under DO-278A). Relevant components targeting SAL-2 assurance level (see §24.13). Development assurance standard: EUROCAE ED-78A equivalent for relevant components.
|
||
|
||
**Change management:** SpaceCom must notify all ANSP users before model version updates that affect prediction outputs. Version changes tracked in `simulations.model_version` and surfaced in the UI.
|
||
|
||
### 24.5 NOTAM System Interface
|
||
|
||
SpaceCom's position in the NOTAM workflow:
|
||
|
||
```
|
||
SpaceCom generates → NOTAM draft (ICAO format) → Reviewed by Persona A → Submitted by authorised NOTAM originator → Issued NOTAM
|
||
```
|
||
|
||
SpaceCom never submits NOTAMs. The draft is a decision support artefact. The mandatory disclaimer on every draft is a non-removable regulatory requirement, not a UI preference.
|
||
|
||
**NOTAM timing requirements by jurisdiction:**
|
||
- Routine NOTAMs: 24–48 hours minimum lead time
|
||
- Short-notice (re-entry window < 24 hours): ASAP; NOTAM issued with minimum lead time
|
||
- SpaceCom alert thresholds align with these: CRITICAL alert at < 6h, HIGH at < 24h
|
||
|
||
### 24.6 Space Law Considerations
|
||
|
||
**UN Liability Convention (1972):** All SpaceCom prediction records, simulation runs, and alert acknowledgements may be legally discoverable in an international liability claim. The immutable audit trail (§7.9) is partially an evidence preservation mechanism. Retention of `reentry_predictions`, `alert_events`, `notam_drafts`, and `shadow_validations` for ≥7 years minimum.
|
||
|
||
**National space laws with re-entry obligations:**
|
||
- **Australia:** Space (Launches and Returns) Act 2018. CASA and the Australian Space Agency have coordination protocols. SpaceCom's controlled re-entry planner outputs are suitable as evidence for operator obligations under this Act.
|
||
- **EU/ESA:** EU Space Programme Regulation; ESA Zero Debris Charter. SpaceCom supports Zero Debris by characterising re-entry risk and supporting responsible end-of-life planning.
|
||
- **US:** FAA AST re-entry licensing generates data that SpaceCom should ingest when available. 51 USC Chapter 509 obligations may affect US space operator customers.
|
||
|
||
**Space Traffic Management evolution:** US Office of Space Commerce is developing civil STM frameworks that may eventually replace Space-Track as the primary civil space data source. SpaceCom's ingest architecture must be adaptable (hardcoded URL constants in `ingest/sources.py` make this a 1-file change when the source changes).
|
||
|
||
### 24.7 ICAO Framework Alignment
|
||
|
||
**Existing:** ICAO Doc 10100 (Manual on Space Weather Information, 2019) designates three ICAO-recognised Space Weather Centres (NOAA SWPC, ESA/ESAC, Japan Meteorological Agency). SpaceCom's space weather widget must reference these designated centres by name and ICAO recognition status.
|
||
|
||
**Emerging re-entry guidance:** ICAO is in early stages of developing re-entry hazard notification guidance (no published document as of 2025). SpaceCom should:
|
||
- Monitor ICAO Air Navigation Commission and Meteorology Panel working group outputs
|
||
- Design hazard corridor outputs in a format that parallels SIGMET structure (the closest existing ICAO framework: WHO/WHAT/WHERE/WHEN/INTENSITY/FORECAST) — this positions SpaceCom well for whatever standard emerges
|
||
- Consider engaging ICAO working groups as a stakeholder; SpaceCom could become a reference implementation
|
||
|
||
**SIGMET parallel structure for re-entry corridor outputs:**
|
||
```
|
||
REENTRY ADVISORY (SpaceCom format; parallel to SIGMET structure)
|
||
WHO: CZ-5B ROCKET BODY / NORAD 44878
|
||
WHAT: UNCONTROLLED RE-ENTRY / DEBRIS SURVIVAL POSSIBLE
|
||
WHERE: CORRIDOR 18S115E TO 28S155E / FL000 TO UNL
|
||
WHEN: FROM 2026031614 TO 2026031622 UTC / WINDOW ±4H (P95)
|
||
RISK: HIGH / LAND AREA IN CORRIDOR: 12%
|
||
FORECAST: CORRIDOR EXPECTED TO NARROW 20% OVER NEXT 6H
|
||
SOURCE: SPACECOM V2.1 / PRED-44878-20260316-003 / TIP MSG #3
|
||
```
|
||
|
||
### 24.8 Alert Threshold Governance
|
||
|
||
Alert threshold values are consequential algorithmic decisions. A CRITICAL threshold that is too sensitive causes unnecessary airspace disruption; one that is too conservative creates false-negative risk. Both outcomes have legal, operational, and reputational consequences.
|
||
|
||
**Current threshold values and rationale:**
|
||
|
||
| Threshold | Value | Rationale |
|
||
|-----------|-------|-----------|
|
||
| CRITICAL window | < 6h | Aligns with ICAO minimum NOTAM lead time for short-notice restrictions; 6h allows ANSP to issue NOTAM with ≥2h lead time |
|
||
| HIGH window | < 24h | Operational planning horizon for pre-tactical airspace management |
|
||
| FIR intersection trigger | p95 corridor intersects any non-zero area of the FIR | Conservative: any non-zero intersection at p95 level generates an alert; minimum area threshold is an org-configurable setting (default: 0) |
|
||
| Alert rate limit | 1 CRITICAL per object per 4h window | Prevents alert flooding from repeated window-shrink events without substantive new information |
|
||
| Alert storm threshold | > 5 CRITICAL in 1h | Empirically chosen; above this rate the response-time expectation for individual alerts cannot be met |
|
||
|
||
These values are recorded in `docs/alert-threshold-history.md` with initial entry date and author sign-off.
|
||
|
||
**Threshold change procedure:**
|
||
1. Engineer proposes change in a PR with rationale documented in `docs/alert-threshold-history.md`
|
||
2. PR requires review by engineering lead **and** product owner before merge
|
||
3. Change is deployed to staging; minimum 2-week shadow-mode observation period against real TLE/TIP data
|
||
4. Shadow observation review: false positive rate and false negative rate compared against pre-change baseline
|
||
5. If baseline comparison passes: change deployed to production; all ANSP shadow deployment partners notified in writing with new threshold values
|
||
6. If any ANSP objects: change is held until concerns are resolved
|
||
|
||
**Threshold values are not configurable at runtime by operators.** They are code constants reviewed through the above process. Org-configurable alert settings (geographic FIR filter, mute rules, `OPS_ROOM_SUPPRESS_MINUTES`) are UX preferences, not threshold changes.
|
||
|
||
### 24.9 Degraded Mode and Availability
|
||
|
||
SpaceCom must specify degraded mode behaviour for ANSP adoption:
|
||
|
||
| Condition | System Behaviour | ANSP Action |
|
||
|-----------|-----------------|-------------|
|
||
| Ingest pipeline failure (TLE data > 6h stale) | MEDIUM alert to all operators; staleness indicator on all objects; predictions greyed | Consult Space-Track directly; activate fallback procedure |
|
||
| Space weather data > 4h stale | WARNING banner on SpaceWeatherWidget; uncertainty multiplier set to HIGH conservatively | Note wider uncertainty on any operational decisions |
|
||
| System unavailable | Push notification to all registered users; email to ANSP contacts | Activate fallback procedure documented in SpaceCom SMS integration guide |
|
||
| HMAC verification failure on a prediction | Prediction withheld; CRITICAL security alert; prediction marked `integrity_failed` | Do not use the withheld prediction; contact SpaceCom immediately |
|
||
|
||
**Degraded mode notification:** When SpaceCom is down or data is stale beyond defined thresholds, all connected ANSPs receive push notification (WebSocket if connected; email fallback) so they can activate their fallback procedures. SpaceCom must never go silent when operationally relevant events are active.
|
||
|
||
---
|
||
|
||
### 24.10 EU AI Act Obligations
|
||
|
||
**Classification:** SpaceCom's conjunction probability model (§19) and any ML-based alert prioritisation constitute an AI system under EU AI Act Art. 3(1). AI systems used in transport infrastructure safety fall under Annex III, point 4 (AI systems intended to be used for dispatching, monitoring, and maintenance of transport infrastructure including aviation). This classification implies **high-risk AI system** obligations.
|
||
|
||
**High-risk AI system obligations (EU AI Act Chapter III Section 2):**
|
||
|
||
| Obligation | Article | SpaceCom implementation |
|
||
|-----------|---------|------------------------|
|
||
| Risk management system | Art. 9 | Integrate with existing SMS (§24.4); maintain AI-specific risk register in `legal/EU_AI_ACT_ASSESSMENT.md` |
|
||
| Data governance | Art. 10 | TLE training data provenance documented; `simulations.params_json` stores full input provenance; bias assessment required for orbital prediction models |
|
||
| Technical documentation | Art. 11 + Annex IV | `legal/EU_AI_ACT_ASSESSMENT.md` — system description, capabilities, limitations, human oversight measures, accuracy characterisation |
|
||
| Record-keeping / automatic logging | Art. 12 | `reentry_predictions` and `alert_events` tables provide automatic event logging; immutable (`APPEND`-only with HMAC) |
|
||
| Transparency to users | Art. 13 | Conjunction probability values labelled with model version (`simulations.model_version`), TLE age, EOP currency; uncertainty bounds displayed |
|
||
| Human oversight | Art. 14 | All decisions remain with duty controller (§24.2 AUP; §28.6 Decision Prompts disclaimer); no autonomous action taken by SpaceCom |
|
||
| Accuracy, robustness, cybersecurity | Art. 15 | Accuracy characterisation (§24.3 ICAO Data Quality); adversarial robustness covered by §7 and §36 security review |
|
||
| Conformity assessment | Art. 43 | Self-assessment pathway available for transport safety AI without third-party involvement at first deployment; document in `legal/EU_AI_ACT_ASSESSMENT.md` |
|
||
| EU database registration | Art. 51 | High-risk AI systems must be registered in the EU AI Act database before placing on market; legal milestone in deployment roadmap |
|
||
|
||
**Human oversight statement (required in UI — Art. 14):** The conjunction probability display (§19.4) must include the following non-configurable statement in the model information panel:
|
||
|
||
> *"This probability estimate is generated by an AI model and is subject to uncertainty arising from TLE age, atmospheric model limitations, and manoeuvre uncertainty. All operational decisions remain with the duty controller. This system does not replace ANSP procedures."*
|
||
|
||
**Gap analysis and roadmap:** `legal/EU_AI_ACT_ASSESSMENT.md` must document: current compliance state → gaps → remediation actions → target dates. Phase 2 gate: conformity assessment documentation complete. Phase 3 gate: EU database registration completed before commercial EU deployment.
|
||
|
||
---
|
||
|
||
### 24.11 Regulatory Correspondence Register
|
||
|
||
For an ANSP-facing product, regulators (CAA, EASA, national ANSPs, ESA, OACI) will issue queries, audits, formal requests, and correspondence. Missed regulatory deadlines can constitute a licence breach or grounds for suspension of operations.
|
||
|
||
**Correspondence log:** `legal/REGULATORY_CORRESPONDENCE_LOG.md` — structured register with the following fields per entry:
|
||
|
||
| Field | Description |
|
||
|-------|-------------|
|
||
| Date received | ISO 8601 |
|
||
| Authority | Regulatory body name and country |
|
||
| Reference number | Authority's reference (if given) |
|
||
| Subject | Brief description |
|
||
| Deadline | Formal response deadline (ISO 8601) |
|
||
| Owner | Named individual responsible for response |
|
||
| Status | PENDING / RESPONDED / CLOSED / ESCALATED |
|
||
| Response date | Date formal response sent |
|
||
| Notes | Internal context, legal counsel involvement |
|
||
|
||
**SLAs:**
|
||
- All regulatory correspondence acknowledged (receipt confirmed to sender) within **2 business days**
|
||
- Substantive response or extension request within **14 calendar days** (or as required by the correspondence)
|
||
- All correspondence older than 14 days without a RESPONDED or CLOSED status triggers an escalation to the CEO
|
||
|
||
**Proactive regulatory engagement:** The correspondence register is reviewed at each quarterly steering meeting. Any authority that has issued ≥3 queries in a 12-month period warrants a proactive engagement call to identify and address systemic concerns before they become formal regulatory actions.
|
||
|
||
---
|
||
|
||
### 24.12 Safety Case Framework (F1 — §61)
|
||
|
||
A safety case is a structured argument that a system is acceptably safe for a specified use in a defined context. SpaceCom must produce and maintain a safety case before any operational ANSP deployment. The safety case is a living document, updated at each material system change.
|
||
|
||
**Safety case structure** (Goal Structuring Notation — GSN, consistent with EUROCAE ED-153 / IEC 61508 safety case guidance):
|
||
|
||
```
|
||
G1: SpaceCom is acceptably safe to use as a decision support tool
|
||
for re-entry hazard awareness in civil airspace operations
|
||
|
||
C1: Context — SpaceCom operates as decision support (not autonomous authority);
|
||
all operational decisions remain with the ANSP duty controller
|
||
|
||
S1: Argument strategy — safety achieved by hazard identification,
|
||
risk reduction, and operational constraints
|
||
|
||
G1.1: All identified hazards are mitigated to acceptable risk levels
|
||
Sn1: Hazard Log (docs/safety/HAZARD_LOG.md)
|
||
E1.1.1: HZ-001 through HZ-007 mitigation evidence (§24.4)
|
||
E1.1.2: Shadow validation report (≥30 day trial)
|
||
|
||
G1.2: System integrity is maintained through all operational modes
|
||
Sn2: HMAC integrity on all safety-critical records (§7.9)
|
||
E1.2.1: `@pytest.mark.safety_critical` test suite — 100% pass
|
||
E1.2.2: Integrity failure quarantine demonstrated (§56 E2E test)
|
||
|
||
G1.3: Operators are trained and capable of correct system use
|
||
Sn3: Operator Training Programme (§28.9)
|
||
E1.3.1: Training completion records (operator_training_records table)
|
||
E1.3.2: Reference scenario completion evidence
|
||
|
||
G1.4: Degraded mode provides adequate notification for fallback
|
||
Sn4: Degraded mode specification (§24.9)
|
||
E1.4.1: ANSP communication plan activated in game day exercise (§26.8)
|
||
|
||
G1.5: Regulatory obligations are met for the deployment jurisdiction
|
||
Sn5: Means of Compliance document (§24.14)
|
||
E1.5.1: Legal opinions for deployment jurisdictions (§24.2)
|
||
E1.5.2: ANSP SMS integration guide (§24.15)
|
||
```
|
||
|
||
**Safety case document:** `docs/safety/SAFETY_CASE.md`. Version-controlled; each tagged release includes a safety case snapshot. Safety case review is required before:
|
||
- ANSP shadow mode activation
|
||
- Model version updates that affect prediction outputs
|
||
- New deployment jurisdiction
|
||
- Any change to alert thresholds (§24.8)
|
||
|
||
**Safety case custodian:** Named individual (Phase 2: CEO or CTO until a dedicated safety manager is appointed). Changes to the safety case require the custodian's sign-off.
|
||
|
||
---
|
||
|
||
### 24.13 Software Assurance Level (SAL) Assignment (F2 — §61)
|
||
|
||
EUROCAE ED-153 / DO-278A defines Software Assurance Levels for ground-based aviation software systems. The appropriate SAL determines the rigour of development, verification, and documentation activities required.
|
||
|
||
**SpaceCom SAL assignment:**
|
||
|
||
| Component | Failure Condition | Severity Class | SAL | Rationale |
|
||
|-----------|------------------|----------------|-----|-----------|
|
||
| Re-entry prediction engine (`physics/`) | False all-clear (HZ-002) | Hazardous | SAL-2 | Undetected false negative could contribute to an airspace safety event; highest-consequence component |
|
||
| Alert generation pipeline (`alerts/`) | Failed alert delivery; wrong threshold applied | Hazardous | SAL-2 | Failure to generate a CRITICAL alert during an active event is equivalent consequence to HZ-002 |
|
||
| HMAC integrity verification | Integrity failure undetected | Hazardous | SAL-2 | Loss of integrity checking removes the primary guard against data manipulation |
|
||
| CZML corridor rendering | Wrong geographic position displayed (HZ-004) | Hazardous | SAL-2 | Geographic display error directly misleads operator |
|
||
| API authentication and authorisation | Unauthorised data access (HZ-007) | Major | SAL-3 | Privacy and data governance impact; not directly causal of airspace event |
|
||
| Ingest pipeline (`worker/`) | Stale data not detected (HZ-005) | Major | SAL-3 | Staleness monitoring is a mitigation for HZ-005; failure of staleness monitoring increases HZ-005 likelihood |
|
||
| Frontend (non-safety-critical paths) | Cosmetic / non-operational UI failure | Minor | SAL-4 | Not in the safety-critical path |
|
||
|
||
**SAL-2 implications** (minimum activities required):
|
||
- Independent verification of requirements, design, and code for SAL-2 components (see §24.16 Verification Independence)
|
||
- Formal test coverage: 100% statement coverage for SAL-2 modules (enforced via `@pytest.mark.safety_critical`)
|
||
- Configuration management of all SAL-2 source files and their test artefacts (see §30.8)
|
||
- SAL-2 components documented in the safety case with traceability from requirement → design → code → test
|
||
|
||
**SAL assignment document:** `docs/safety/SAL_ASSIGNMENT.md` — reviewed at each architecture change and before any ANSP deployment.
|
||
|
||
---
|
||
|
||
### 24.14 Means of Compliance (MoC) Document (F8 — §61)
|
||
|
||
A Means of Compliance document maps each regulatory or standard requirement to the specific implementation evidence that demonstrates compliance. Required before any formal regulatory submission (ESA bid, EASA consultation response, ANSP safety acceptance).
|
||
|
||
**Document:** `docs/safety/MEANS_OF_COMPLIANCE.md`
|
||
|
||
**Structure:**
|
||
|
||
| Requirement ID | Source | Requirement Text (summary) | Means of Compliance | Evidence Location | Status |
|
||
|---------------|--------|---------------------------|--------------------|--------------------|--------|
|
||
| MOC-001 | EUROCAE ED-153 §5.3 | Software requirements defined and verifiable | Requirements documented in relevant §sections of MASTER_PLAN; acceptance criteria in TEST_PLAN | `docs/TEST_PLAN.md`; relevant §sections | PARTIAL |
|
||
| MOC-002 | EUROCAE ED-153 §6.4 | Independent verification of SAL-2 software | Verification independence policy (§24.16); separate reviewer for safety-critical PRs | `docs/safety/VERIFICATION_INDEPENDENCE.md` | PLANNED |
|
||
| MOC-003 | ICAO Annex 15 §3.2 | Data quality attributes characterised | ICAO data quality table (§24.3); accuracy characterisation document | `docs/validation/ACCURACY_CHARACTERISATION.md` | PARTIAL (Phase 3) |
|
||
| MOC-004 | ICAO Annex 19 | ANSP SMS integration supported | SMS integration guide; hazard register; training programme | `docs/safety/ANSP_SMS_GUIDE.md`; `docs/safety/HAZARD_LOG.md` | PLANNED |
|
||
| MOC-005 | EU AI Act Art. 9 | Risk management system documented | AI Act assessment; hazard log; safety case | `legal/EU_AI_ACT_ASSESSMENT.md`; `docs/safety/HAZARD_LOG.md` | IN PROGRESS |
|
||
| MOC-006 | DO-278A §10 | Configuration management of safety artefacts | CM policy (§30.8); Git tagging of releases; signed commits | `docs/safety/CM_POLICY.md` | PLANNED |
|
||
| MOC-007 | ED-153 §7.2 | Safety occurrence reporting procedure | Runbook in §26.8; `SAFETY_OCCURRENCE` log type | `docs/runbooks/`; `security_logs` table | IMPLEMENTED |
|
||
|
||
The MoC document is a Phase 2 deliverable. `PARTIAL` items become Phase 3 gates. `PLANNED` items require assigned owners and completion dates before ANSP shadow activation.
|
||
|
||
---
|
||
|
||
### 24.15 ANSP-Side Obligations Document (F10 — §61)
|
||
|
||
SpaceCom cannot unilaterally satisfy all regulatory requirements — the receiving ANSP has obligations that SpaceCom must document and communicate. Failing to do so is a gap in the safety argument.
|
||
|
||
**Document:** `docs/safety/ANSP_SMS_GUIDE.md` — provided to every ANSP before shadow mode activation.
|
||
|
||
**ANSP obligations by category:**
|
||
|
||
| Category | ANSP Obligation | SpaceCom Provides |
|
||
|----------|----------------|-------------------|
|
||
| SMS integration | Include SpaceCom in ANSP SMS under ICAO Annex 19 | Hazard register contribution (§24.4); SAL assignment; safety case |
|
||
| Change notification | Notify SpaceCom of any ANSP procedure changes that affect how SpaceCom outputs are used | Change notification contact in MSA |
|
||
| Operator training | Ensure all SpaceCom users complete the operator training programme (§28.9) | Training modules; completion API; training records |
|
||
| Fallback procedure | Maintain and exercise a fallback procedure for SpaceCom unavailability | Fallback procedure template in onboarding documentation |
|
||
| Occurrence reporting | Report any safety occurrence involving SpaceCom outputs to SpaceCom within 24 hours | Safety occurrence form; contact details; §26.8 runbook |
|
||
| Regulatory notification | Notify applicable safety regulator of SpaceCom use if required by national SMS regulations | System description one-pager for regulator submission |
|
||
| Shadow validation | Participate in ≥30-day shadow validation trial; provide evaluation feedback | Shadow validation report template; shadow validation dashboard |
|
||
| AUP acceptance | Ensure all users accept the AUP (§24.2) | Automated AUP flow; compliance report for ANSP admin |
|
||
|
||
**Liability assignment note (links to §24.2 and §24.12 F11):** The ANSP SMS guide explicitly states that the ANSP retains full operational authority and accountability for all air traffic decisions, regardless of SpaceCom outputs. SpaceCom is a decision support tool. This statement must appear in the ANSP SMS guide, the AUP, and the safety case context node C1 (§24.12).
|
||
|
||
### 25.1 Target Tender Profile
|
||
|
||
SpaceCom targets ESA tenders in the following programme areas:
|
||
- **Space Safety Programme** — re-entry risk, SSA services, space debris
|
||
- **GSTP (General Support Technology Programme)** — technology development with commercial potential
|
||
- **ARTES (Advanced Research in Telecommunications Systems)** — if the commercial operator portal reaches satellite operators
|
||
- **Space-Air Traffic Integration** studies — the category matching ESA's OKAPI:Orbits award
|
||
|
||
### 25.2 Differentiation from ESA ESOC Re-entry Prediction Service
|
||
|
||
ESA's re-entry prediction service (`reentry.esoc.esa.int`) is a technical product for space operators and agencies. SpaceCom is **not a competitor** to this service — it is a complementary operational layer that could consume ESOC outputs:
|
||
|
||
| Dimension | ESA ESOC Service | SpaceCom |
|
||
|-----------|-----------------|---------|
|
||
| Primary user | Space agencies, debris researchers | ANSPs, airspace managers, space operators |
|
||
| Output format | Technical prediction reports | Operational decision support + NOTAM drafts |
|
||
| Aviation integration | None | Core feature |
|
||
| ANSP decision workflow | Not designed for this | Primary design target |
|
||
| Space operator portal | Not provided | Phase 2 deliverable |
|
||
| Shadow mode / regulatory adoption | Not provided | Built-in |
|
||
|
||
**In an ESA bid:** Position SpaceCom as the *user-facing operational layer* that sits on top of the space surveillance and prediction infrastructure that ESA already operates. ESA invests in the physics; SpaceCom invests in the interface that makes the physics actionable for aviation authorities and space operators.
|
||
|
||
### 25.3 TRL Roadmap (ESA Definitions)
|
||
|
||
| Phase | End TRL | Evidence |
|
||
|-------|---------|---------|
|
||
| Phase 1 complete | **TRL 4** | Validated decay predictor (≥3 historical backcasts); SGP4 globe with real TLE data; Mode A corridors; HMAC integrity; full security infrastructure |
|
||
| Phase 2 complete | **TRL 5** | Atmospheric breakup; Mode B heatmap; NOTAM drafting; space operator portal; CCSDS export; shadow mode; ≥1 ANSP shadow deployment running |
|
||
| Phase 3 complete | **TRL 6** | System demonstrated in operationally relevant environment; ≥1 ANSP shadow deployment with ≥4 weeks validation data; external penetration test passed; ECSS compliance artefacts complete |
|
||
| Post-Phase 3 | **TRL 7** | System prototype demonstrated in operational environment (live ANSP deployment, not shadow) |
|
||
|
||
### 25.4 ECSS Standards Compliance
|
||
|
||
ESA contracts require compliance with the European Cooperation for Space Standardization (ECSS). Required compliance mapping:
|
||
|
||
| Standard | Title | SpaceCom Compliance |
|
||
|----------|-------|-------------------|
|
||
| **ECSS-Q-ST-80C** | Software Product Assurance | Software Management Plan, V&V Plan, Product Assurance Plan — produced Phase 3 |
|
||
| **ECSS-E-ST-10-04C** | Space environment | NRLMSISE-00 and JB2008 compliance with ECSS atmospheric model requirements |
|
||
| **ECSS-E-ST-10-12C** | Methods for re-entry and debris footprint calculation | Decay predictor and atmospheric breakup model methodology documented and traceable |
|
||
| **ECSS-U-AS-010C** | Space sustainability | Zero Debris Charter alignment statement; controlled re-entry planner outputs |
|
||
|
||
**Compliance matrix document** (produced Phase 3): Maps every ECSS requirement to the relevant SpaceCom component, test, or document. Required for ESA tender submission.
|
||
|
||
### 25.5 ESA Zero Debris Charter Alignment
|
||
|
||
SpaceCom directly supports the Zero Debris Charter objectives:
|
||
|
||
| Charter Objective | SpaceCom Support |
|
||
|-------------------|----------------|
|
||
| Responsible end-of-life disposal | Controlled re-entry planner generates CCSDS-format manoeuvre plans minimising ground risk |
|
||
| Transparency of re-entry risk | Public hazard corridor data; NOTAM drafting; multi-ANSP coordination |
|
||
| Reduction of casualty risk | Atmospheric breakup model; casualty area computation; population density weighting in deorbit optimiser |
|
||
| Data sharing | API layer for space operator integration; CCSDS export; open prediction endpoints |
|
||
|
||
Include Zero Debris Charter alignment statement in all ESA bid submissions.
|
||
|
||
### 25.6 Required ESA Procurement Artefacts
|
||
|
||
All ESA contracts require these management documents. SpaceCom must produce them by Phase 3:
|
||
|
||
| Document | ECSS Reference | Content |
|
||
|----------|---------------|---------|
|
||
| **Software Management Plan (SMP)** | ECSS-Q-ST-80C §5 | Development methodology, configuration management, change control, documentation standards |
|
||
| **Verification and Validation Plan (VVP)** | ECSS-Q-ST-80C §6 | Test strategy, traceability from requirements to test cases, acceptance criteria |
|
||
| **Product Assurance Plan (PAP)** | ECSS-Q-ST-80C §4 | Safety, reliability, quality standards and how they are met |
|
||
| **Data Management Plan (DMP)** | ECSS-Q-ST-80C §8 | How data produced under contract is handled, shared, archived, and made reproducible |
|
||
| **Software Requirements Specification (SRS)** | Tailored ECSS-E-ST-40C | Software requirements baseline, interfaces, external dependencies, and bounded assumptions including air-risk and RDM exchange boundaries |
|
||
| **Software Design Description (SDD)** | Tailored ECSS-E-ST-40C | Module architecture, algorithm choices, interface contracts, and validation assumptions |
|
||
| **User Manual / Ops Guide** | Tailored ECSS-E-ST-40C | Installation, configuration, operator workflows, limitations, and degraded-mode handling |
|
||
| **Test Plan + Test Report** | Tailored ECSS-Q-ST-80C | Planned validation campaign, executed results, deviations, and acceptance evidence for procurement submission |
|
||
| **Accessibility Conformance Report (ACR/VPAT 2.4)** | EN 301 549 v3.2.1 | WCAG 2.1 AA conformance declaration; mandatory for EU public sector ICT procurement; maps each success criterion to Supports / Partially Supports / Does Not Support with remarks |
|
||
|
||
Scaffold documents for all procurement-facing artefacts should be created at Phase 1 start and maintained throughout development — not produced from scratch at Phase 3.
|
||
|
||
For contracts with explicit software prototype review gates (e.g. PDR, TRR, CDR, QR, FR), the SRS, SDD, User Manual, Test Plan, and Test Report are updated incrementally at each milestone rather than back-filled only at final review.
|
||
|
||
### 25.7 Consortium Strategy
|
||
|
||
ESA study contracts typically favour consortia that combine:
|
||
- **Technical depth** (university or research institute)
|
||
- **Industrial relevance** (commercial applicability)
|
||
- **End-user representation** (the entity that will use the output)
|
||
|
||
SpaceCom's ideal consortium for an ESA bid:
|
||
- **SpaceCom** (lead) — system integration, aviation domain interface, commercial deployment
|
||
- **Academic partner** (orbital mechanics / atmospheric density modelling credibility — equivalent to TU Braunschweig in the OKAPI:Orbits consortium)
|
||
- **ANSP or aviation authority** (end-user representation — demonstrates the aviation gap is real and the solution is wanted)
|
||
|
||
Without a credentialled academic or research partner for the physics components, ESA evaluators may question the technical depth. Identify and approach potential academic partners before submitting to any ESA tender.
|
||
|
||
### 25.8 Intellectual Property Framework for ESA Bids
|
||
|
||
ESA contracts operate under the ESA General Conditions of Contract, which distinguish between **background IP** (pre-existing IP brought into the contract) and **foreground IP** (IP created during the contract). The default terms grant ESA a non-exclusive, royalty-free licence to use foreground IP, while the contractor retains ownership. These terms are negotiable and must be agreed before contract signature.
|
||
|
||
**Required IP actions before bid submission:**
|
||
|
||
1. **Background IP schedule:** Document all SpaceCom components that constitute background IP — physics engine, data model, UX design, proprietary algorithms. This schedule protects SpaceCom's ability to continue commercial deployment after the ESA contract ends without ESA claiming rights to the core product.
|
||
|
||
2. **Foreground IP boundary:** Define clearly what will be created during the ESA contract (e.g., specific ECSS compliance artefacts, validation datasets, TRL demonstration reports) versus what SpaceCom brings in as background IP. Narrow the foreground IP scope to ESA-specific deliverables only.
|
||
|
||
3. **Software Bill of Materials (SBOM):** Required for ECSS compliance and as part of the ESA bid artefact package. Generated via `syft` or `cyclonedx-bom`. Must identify all third-party licences. AGPLv3 components (notably CesiumJS community edition) cannot be in the SBOM of a closed-source ESA deliverable — commercial licence required.
|
||
|
||
4. **Consortium Agreement:** Must be signed by all consortium members before bid submission. Must specify:
|
||
- IP ownership for each consortium member's contributions
|
||
- Publication rights for academic partners (must not conflict with any commercial confidentiality obligations)
|
||
- Revenue share for any commercial use arising from the contract
|
||
- Liability allocation between consortium members
|
||
- Exit terms if a member withdraws
|
||
|
||
5. **Export control pre-clearance:** Confirm with counsel that the planned ESA deliverable does not require an export licence for transfer to ESA (a Paris-based intergovernmental organisation). Generally covered under EAR licence exception GOV, but verify for any controlled technology components.
|
||
|
||
---
|
||
|
||
## 26. SRE and Reliability Framework
|
||
|
||
### 26.1 Service Level Objectives
|
||
|
||
SpaceCom is most critical during active re-entry events — peak load coincides with highest operational stakes. Standard availability metrics are insufficient. SLOs must be defined against *event-correlated* conditions, not just averages.
|
||
|
||
| Service Level Indicator | SLO | Measurement Window | Notes |
|
||
|------------------------|-----|--------------------|-------|
|
||
| Prediction API availability | 99.9% | Rolling 30 days | 43.8 min/month error budget |
|
||
| Prediction API availability (active TIP event) | 99.95% | Duration of TIP window | Stricter; degradation during events is SEV-1 |
|
||
| Decay prediction latency p50 | < 90s | Per MC job | 500-sample chord run |
|
||
| Decay prediction latency p95 | < 240s | Per MC job | Drives worker sizing (§27) |
|
||
| CZML ephemeris load p95 | < 2s | Per request | 100-object catalog |
|
||
| TIP message ingest latency | < 30 min from publication | Per TIP message | Drives CRITICAL alert timing |
|
||
| Space weather update latency | < 15 min from NOAA SWPC | Per update cycle | Drives uncertainty multiplier refresh |
|
||
| Alert WebSocket delivery latency | < 10s from trigger | Per alert | Measured trigger→client receipt |
|
||
| Corridor update after new TIP | < 60 min | Per TIP message | Full MC rerun triggered |
|
||
|
||
**Error budget policy:** When the 30-day rolling error budget is exhausted, no further deployments or planned maintenance are permitted until the next measurement window opens. Tracked in Grafana SLO dashboard (§26.8).
|
||
|
||
**SLOs must be written into the model user agreement** (§24.2) and agreed with each ANSP customer before operational deployment. ANSPs need defined thresholds to determine when to activate their fallback procedures.
|
||
|
||
**Customer-facing SLA (Finding 7) — contractual commitments in the MSA:**
|
||
|
||
Internal SLOs are aspirational targets; the SLA is a binding contractual commitment with defined measurement, exclusions, and credits. The MSA template includes the following SLA schedule:
|
||
|
||
| Metric | SLA commitment | Measurement | Exclusions |
|
||
|---|---|---|---|
|
||
| Monthly availability | 99.5% | External uptime monitor; excludes scheduled maintenance (max 4h/month; 48h advance notice) | Force majeure; upstream data source outages (Space-Track, NOAA SWPC) lasting > 4h |
|
||
| Critical alert delivery | Within 5 minutes of trigger (p95) | `alert_events.created_at` → `delivered_websocket/email = TRUE` timestamp | Customer network connectivity issues |
|
||
| Prediction freshness | p50 updated within 4h of new TLE availability | `tle_sets.ingested_at` → `reentry_predictions.created_at` | Space-Track API outage > 4h |
|
||
| Support response — CRITICAL incident | Initial response within 1 hour | From customer report or automated alert, whichever earlier | Outside contracted support hours (on-call for CRITICAL) |
|
||
| Support response — P1 resolution | Within 8 hours | From initial response | — |
|
||
| Service credits | 1 day credit per 0.1% availability below SLA | Applied to next invoice | — |
|
||
|
||
Any SRE threshold change that could cause an SLA breach (e.g., raising the ingest failure alert threshold beyond 4 hours) must be reviewed by the product owner before deployment. Tracked in `docs/sla/sla-schedule-v{N}.md` (versioned; MSA references the current version by number).
|
||
|
||
---
|
||
|
||
### 26.2 Recovery Objectives
|
||
|
||
| Objective | Target | Scope | Derivation |
|
||
|-----------|--------|-------|-----------|
|
||
| RTO (active TIP event) | ≤ 15 minutes | Prediction API restoration | CRITICAL alert rate-limit window is 4 hours per object; 15-minute outage is tolerable within this window without skipping a CRITICAL cycle; beyond 15 minutes the ANSP must activate fallback procedures |
|
||
| RTO (no active event) | ≤ 60 minutes | Full system restoration | 1-hour window aligns with MSA SLA commitment; exceeding this triggers the P1 communication plan |
|
||
| RPO (safety-critical tables) | Zero | `reentry_predictions`, `alert_events`, `security_logs`, `notam_drafts` — synchronous replication required | UN Liability Convention evidentiary requirements; loss of a single alert acknowledgement record could be material in a liability investigation |
|
||
| RPO (operational data) | ≤ 5 minutes | `orbits`, `tle_sets`, `simulations` — async replication acceptable | 5-minute data age is within the staleness tolerance for TLE-based predictions; loss of in-flight simulations is recoverable by re-submission |
|
||
|
||
**MSA sign-off requirement:** RTO and RPO targets must be explicitly stated in and agreed upon in the Master Services Agreement with each ANSP customer before any production deployment. Customers must acknowledge that the fallback procedure (Space-Track direct + ESOC public re-entry page) is their responsibility during the RTO window. RTO/RPO targets are not unilaterally changeable by SpaceCom — any tightening requires customer notification ≥30 days in advance; any relaxation requires customer consent.
|
||
|
||
---
|
||
|
||
### 26.3 High Availability Architecture
|
||
|
||
#### TimescaleDB — Streaming Replication + Patroni
|
||
|
||
```yaml
|
||
# Primary + hot standby; Patroni manages leader election and failover
|
||
db_primary:
|
||
image: timescale/timescaledb-ha:pg17
|
||
environment:
|
||
PATRONI_POSTGRESQL_DATA_DIR: /var/lib/postgresql/data
|
||
PATRONI_REPLICATION_USERNAME: replicator
|
||
networks: [db_net]
|
||
|
||
db_standby:
|
||
image: timescale/timescaledb-ha:pg17
|
||
environment:
|
||
PATRONI_REPLICA: "true"
|
||
networks: [db_net]
|
||
|
||
etcd:
|
||
image: bitnami/etcd:3 # Patroni DCS
|
||
networks: [db_net]
|
||
```
|
||
|
||
- Synchronous replication for `reentry_predictions`, `alert_events`, `security_logs`, `notam_drafts` (RPO = 0): `synchronous_standby_names = 'FIRST 1 (db_standby)'` with table-level synchronous commit override
|
||
- Asynchronous replication for `orbits`, `tle_sets` (RPO ≤ 5 min): default async
|
||
- Patroni auto-failover: standby promoted within ~30s of primary failure, well within the 15-minute RTO
|
||
|
||
**Required Patroni configuration parameters** (must be present in `patroni.yml`; CI validation via `scripts/check_patroni_config.py`):
|
||
|
||
```yaml
|
||
bootstrap:
|
||
dcs:
|
||
maximum_lag_on_failover: 1048576 # 1 MB; standby > 1 MB behind primary is excluded from failover election
|
||
synchronous_mode: true # Enable synchronous replication mode
|
||
synchronous_mode_strict: true # Primary refuses writes if no synchronous standby confirmed; prevents split-brain
|
||
|
||
postgresql:
|
||
parameters:
|
||
wal_level: replica # Required for streaming replication; 'minimal' breaks replication
|
||
recovery_target_timeline: latest # Follow timeline switches after failover; required for correct standby behaviour
|
||
```
|
||
|
||
**Rationale:**
|
||
- `maximum_lag_on_failover`: without this, a severely lagged standby could be promoted as primary and serve stale data for safety-critical tables.
|
||
- `synchronous_mode_strict: true`: trades availability for consistency — primary halts rather than allowing an unconfirmed write to proceed without a standby. Acceptable given 15-minute RTO SLO.
|
||
- `wal_level: replica`: `minimal` disables the WAL detail needed for streaming replication; must be explicitly set.
|
||
- `recovery_target_timeline: latest`: without this, a promoted standby after failover may not follow future timeline switches, causing divergence.
|
||
|
||
#### Redis — Sentinel (3 Nodes)
|
||
|
||
```yaml
|
||
redis-master:
|
||
image: redis:7-alpine
|
||
command: redis-server /etc/redis/redis.conf
|
||
redis-sentinel-1:
|
||
image: redis:7-alpine
|
||
command: redis-sentinel /etc/redis/sentinel.conf
|
||
redis-sentinel-2:
|
||
image: redis:7-alpine
|
||
command: redis-sentinel /etc/redis/sentinel.conf
|
||
```
|
||
|
||
Three Sentinel instances form a quorum. If the master fails, Sentinel promotes a replica within ~10s. The backend and workers use `redis-py`'s `Sentinel` client which transparently follows the master after failover.
|
||
|
||
**Redis Sentinel split-brain risk assessment (F3 — §67):** In a network partition where Sentinel nodes disagree on master reachability, two Sentinels could theoretically promote two different replicas simultaneously. The `min-replicas-to-write 1` Sentinel configuration mitigates this: the old master stops accepting writes when it loses contact with replicas, forcing clients to the new master.
|
||
|
||
SpaceCom's Redis data is largely ephemeral — Celery broker messages, WebSocket session state, application cache. A split-brain that loses a small number of Celery tasks or cache entries is survivable. The one persistent concern is the per-org email rate limit counter (`spacecom:email_rate:{org_id}:{hour}`, §65 F7): a split-brain could result in two independent counters, both allowing up to 50 emails, for a brief period before the split resolves. This is accepted: the 50/hr limit is a cost control, not a safety guarantee. Email volume during a short Sentinel split-brain is not a safety risk.
|
||
|
||
**Risk acceptance and configuration:** Set `sentinel.conf` values:
|
||
```
|
||
sentinel down-after-milliseconds spacecom-redis 5000
|
||
sentinel failover-timeout spacecom-redis 60000
|
||
sentinel parallel-syncs spacecom-redis 1
|
||
min-replicas-to-write 1
|
||
min-replicas-max-lag 10
|
||
```
|
||
ADR: `docs/adr/0021-redis-sentinel-split-brain-risk-acceptance.md`
|
||
|
||
#### Cross-Region Disaster Recovery — Warm Standby (F7)
|
||
|
||
Single-region deployment cannot meet the RTO ≤ 60 minutes target against a full cloud region failure. A warm standby in a second region provides the required recovery path.
|
||
|
||
**Strategy:** Warm standby (not hot active-active) — reduces cost and complexity while meeting RTO.
|
||
|
||
| Component | Primary region | DR region | Failover mechanism |
|
||
|-----------|--------------|-----------|-------------------|
|
||
| TimescaleDB | Primary + hot standby | Read replica (streaming replication from primary) | Promote replica; update DNS; `make db-failover-dr` runbook |
|
||
| Application tier | Running | Stopped; container images pre-pulled from GHCR | Deploy from images on failover; < 10 minutes |
|
||
| MinIO (object storage) | Active | Active (bucket replication enabled) | Already in sync; no failover needed |
|
||
| Redis | Active | Cold (config ready) | Restart on failover; session loss acceptable (operators re-authenticate) |
|
||
| DNS | Primary A record | Secondary A record in Route 53 (or equiv.) | Health-check-based routing; TTL 60s; auto-failover on primary health check failure |
|
||
|
||
**Failover time estimate:** DB promotion 2–5 minutes + DNS propagation 1 minute + app deploy 10 minutes = **< 15 minutes** (within RTO for active TIP event).
|
||
|
||
**Runbook:** `docs/runbooks/region-failover.md` — tested annually as game day scenario 6. Post-failover checklist: verify HMAC validation on restored primary; verify WAL integrity; notify ANSPs of region switch; schedule return to primary region within 48 hours.
|
||
|
||
---
|
||
|
||
### 26.4 Celery Reliability
|
||
|
||
#### Task Acknowledgement and Crash Safety
|
||
|
||
```python
|
||
# celeryconfig.py
|
||
task_acks_late = True # Task not acknowledged until complete; if worker dies mid-task, task is requeued
|
||
task_reject_on_worker_lost = True # Orphaned tasks requeued, not dropped
|
||
task_serializer = 'json'
|
||
result_expires = 86400 # Results expire after 24h; database is the durable store
|
||
worker_prefetch_multiplier = 1 # F6 §58: long MC tasks (up to 240s) — prefetch=1 prevents worker A
|
||
# holding 4 tasks while workers B/C/D are idle; fair distribution
|
||
```
|
||
|
||
#### Dead Letter Queue
|
||
|
||
Failed tasks (exception, timeout, or permanent error) must be captured, not silently dropped:
|
||
|
||
```python
|
||
# In Celery task base class
|
||
class SpaceComTask(Task):
|
||
def on_failure(self, exc, task_id, args, kwargs, einfo):
|
||
# Update simulations table to status='failed'
|
||
update_simulation_status(task_id, 'failed', error_detail=str(exc))
|
||
# Route to dead letter queue for inspection
|
||
dead_letter_queue.rpush('dlq:failed_tasks', json.dumps({
|
||
'task_id': task_id, 'task_name': self.name,
|
||
'error': str(exc), 'failed_at': utcnow().isoformat()
|
||
}))
|
||
```
|
||
|
||
#### Queue Routing (Ingest vs Simulation Isolation)
|
||
|
||
```python
|
||
CELERY_TASK_ROUTES = {
|
||
'modules.ingest.*': {'queue': 'ingest'},
|
||
'modules.propagator.*': {'queue': 'simulation'},
|
||
'modules.breakup.*': {'queue': 'simulation'},
|
||
'modules.conjunction.*': {'queue': 'simulation'},
|
||
'modules.reentry.controlled.*': {'queue': 'simulation'},
|
||
}
|
||
```
|
||
|
||
Two separate worker processes — never competing on the same queue:
|
||
```bash
|
||
# Ingest worker: always running, low concurrency
|
||
celery worker --queue=ingest --concurrency=2 --hostname=ingest@%h
|
||
|
||
# Simulation worker: high concurrency for MC sub-tasks (see §27.2)
|
||
celery worker --queue=simulation --concurrency=16 --pool=prefork --hostname=sim@%h
|
||
```
|
||
|
||
**Per-organisation priority isolation (F8):** All organisations share the `simulation` queue, but job priority is set at submission time based on subscription tier and event criticality. This prevents a `shadow_trial` org's bulk simulation from starving a `CRITICAL` alert computation for an `ansp_operational` org.
|
||
|
||
```python
|
||
TIER_TASK_PRIORITY = {
|
||
"internal": 9,
|
||
"institutional": 8,
|
||
"ansp_operational": 7,
|
||
"space_operator": 5,
|
||
"shadow_trial": 3,
|
||
}
|
||
CRITICAL_EVENT_PRIORITY_BOOST = 2 # added when active TIP event exists for the org's objects
|
||
|
||
def get_task_priority(org_tier: str, has_active_tip: bool) -> int:
|
||
base = TIER_TASK_PRIORITY.get(org_tier, 3)
|
||
return min(10, base + (CRITICAL_EVENT_PRIORITY_BOOST if has_active_tip else 0))
|
||
|
||
# At submission:
|
||
task.apply_async(priority=get_task_priority(org.subscription_tier, active_tip))
|
||
```
|
||
|
||
Redis with `maxmemory-policy noeviction` supports Celery task priorities natively (0–9). Workers process higher-priority tasks first when multiple tasks are queued. Ingest tasks always route to the separate `ingest` queue and are unaffected by simulation priority.
|
||
|
||
#### Celery Beat — High Availability with `celery-redbeat`
|
||
|
||
Standard Celery Beat is a single-process SPOF. `celery-redbeat` stores the schedule in Redis with distributed locking — multiple Beat instances can run; only one holds the lock at a time:
|
||
|
||
```python
|
||
CELERY_BEAT_SCHEDULER = 'redbeat.RedBeatScheduler'
|
||
REDBEAT_REDIS_URL = settings.redis_url
|
||
REDBEAT_LOCK_TIMEOUT = 60 # 60s; crashed leader blocks scheduling for at most 60s
|
||
REDBEAT_MAX_SLEEP_INTERVAL = 5 # standby instances check for lock every 5s after TTL expiry
|
||
```
|
||
|
||
The default `REDBEAT_LOCK_TIMEOUT = max_interval × 5` (typically 25 minutes) is too long during active TIP events — a crashed Beat leader would prevent TIP polling for up to 25 minutes. At 60 seconds, a failover causes at most a 60-second scheduling gap. The standby Beat instance acquires the lock within 5 seconds of TTL expiry (`REDBEAT_MAX_SLEEP_INTERVAL = 5`).
|
||
|
||
During an active TIP window (`spacecom_active_tip_events > 0`), the AlertManager rule for TIP ingest failure uses a 10-minute threshold rather than the baseline 4-hour threshold — ensuring a Beat failover gap does not silently miss critical TIP updates.
|
||
|
||
---
|
||
|
||
### 26.5 Health Checks
|
||
|
||
Every service exposes two endpoints. Docker Compose `depends_on: condition: service_healthy` uses these — the backend does not start until the database is healthy.
|
||
|
||
**Liveness probe** (`GET /healthz`) — process is alive; returns 200 unconditionally if the process can respond. Does not check dependencies.
|
||
|
||
**Readiness probe** (`GET /readyz`) — process is ready to serve traffic:
|
||
|
||
```python
|
||
@app.get("/readyz")
|
||
async def readiness(db: AsyncSession = Depends(get_db)):
|
||
checks = {}
|
||
|
||
# Database connectivity
|
||
try:
|
||
await db.execute(text("SELECT 1"))
|
||
checks["database"] = "ok"
|
||
except Exception as e:
|
||
checks["database"] = f"error: {e}"
|
||
|
||
# Redis connectivity
|
||
try:
|
||
await redis_client.ping()
|
||
checks["redis"] = "ok"
|
||
except Exception:
|
||
checks["redis"] = "error"
|
||
|
||
# Data freshness
|
||
tle_age = await get_oldest_active_tle_age_hours()
|
||
sw_age = await get_space_weather_age_hours()
|
||
eop_age = await get_eop_age_days()
|
||
airac_age = await get_airspace_airac_age_days()
|
||
checks["tle_age_hours"] = tle_age
|
||
checks["space_weather_age_hours"] = sw_age
|
||
checks["eop_age_days"] = eop_age
|
||
checks["airac_age_days"] = airac_age
|
||
|
||
degraded = []
|
||
if checks["database"] != "ok" or checks["redis"] != "ok":
|
||
return JSONResponse(status_code=503, content={"status": "unavailable", "checks": checks})
|
||
if tle_age > 6:
|
||
degraded.append("tle_stale")
|
||
if sw_age > 4:
|
||
degraded.append("space_weather_stale")
|
||
if eop_age > 7:
|
||
degraded.append("eop_stale") # IERS-A older than 7 days; frame transform accuracy degraded
|
||
if airac_age > 28:
|
||
degraded.append("airspace_stale") # AIRAC cycle missed
|
||
|
||
status_code = 207 if degraded else 200
|
||
return JSONResponse(status_code=status_code, content={
|
||
"status": "degraded" if degraded else "ok",
|
||
"degraded": degraded, "checks": checks
|
||
})
|
||
```
|
||
|
||
The `207 Degraded` response triggers the staleness banner in the UI (§24.8) without taking the service offline. The load balancer treats 207 as healthy (traffic continues); the operational banner warns users.
|
||
|
||
**Renderer service health check** — the `renderer` container runs Playwright/Chromium. If Chromium hangs (a known Playwright failure mode), the container process stays alive and appears healthy while all report generation jobs silently time out. The renderer `GET /healthz` must verify Chromium can respond, not just that the Python process is alive:
|
||
|
||
```python
|
||
# renderer/app/health.py
|
||
import asyncio
|
||
from playwright.async_api import async_playwright
|
||
from fastapi.responses import JSONResponse
|
||
|
||
async def health_check():
|
||
"""Liveness probe: verify Chromium can launch and load a blank page within 5s."""
|
||
try:
|
||
async with async_playwright() as p:
|
||
browser = await asyncio.wait_for(p.chromium.launch(), timeout=5.0)
|
||
page = await browser.new_page()
|
||
await asyncio.wait_for(page.goto("about:blank"), timeout=3.0)
|
||
await browser.close()
|
||
return {"status": "ok", "chromium": "responsive"}
|
||
except asyncio.TimeoutError:
|
||
renderer_chromium_restarts.inc()
|
||
return JSONResponse({"status": "chromium_unresponsive"}, status_code=503)
|
||
```
|
||
|
||
Docker Compose healthcheck for renderer:
|
||
```yaml
|
||
renderer:
|
||
healthcheck:
|
||
test: ["CMD", "curl", "-f", "http://localhost:8001/healthz"]
|
||
interval: 30s
|
||
timeout: 10s
|
||
retries: 3
|
||
start_period: 15s
|
||
```
|
||
|
||
If the healthcheck fails 3 times consecutively, Docker restarts the renderer container. The `renderer_chromium_restarts_total` counter increments on each restart and triggers the `RendererChromiumUnresponsive` alert.
|
||
|
||
**Degraded state in `GET /readyz` for API clients and SWIM (Finding 7):** The `degraded` array in the response is the machine-readable signal for any automated integration (Phase 3 SWIM, API polling clients). API clients must not scrape the UI to determine system state — the health endpoint is the authoritative source. Response fields:
|
||
|
||
| Field | Type | Meaning |
|
||
|---|---|---|
|
||
| `status` | `"ok"` \| `"degraded"` \| `"unavailable"` | Overall system state |
|
||
| `degraded` | `string[]` | Active degradation reasons: `"tle_stale"`, `"space_weather_stale"`, `"ingest_source_failure"`, `"prediction_service_overloaded"` |
|
||
| `degraded_since` | `ISO8601 \| null` | Timestamp of when current degraded state began (from `degraded_mode_events`) |
|
||
| `checks` | `object` | Per-subsystem check results |
|
||
|
||
Every transition into or out of degraded state is written to `degraded_mode_events` (see §9.2). NOTAM drafts generated while `status = "degraded"` have `generated_during_degraded = TRUE` and the draft `(E)` field includes: `NOTE: GENERATED DURING DEGRADED DATA STATE - VERIFY INDEPENDENTLY BEFORE ISSUANCE`.
|
||
|
||
**Docker Compose health check definitions:**
|
||
```yaml
|
||
backend:
|
||
healthcheck:
|
||
test: ["CMD", "curl", "-f", "http://localhost:8000/healthz"]
|
||
interval: 10s
|
||
timeout: 5s
|
||
retries: 3
|
||
start_period: 30s
|
||
|
||
db:
|
||
healthcheck:
|
||
# pg_isready alone passes before the spacecom database and TimescaleDB extension are loaded.
|
||
# This check verifies that the application database is accessible and TimescaleDB is active
|
||
# before any dependent service (pgbouncer, backend) is marked healthy.
|
||
test: |
|
||
CMD-SHELL psql -U spacecom_app -d spacecom -c
|
||
"SELECT 1 FROM timescaledb_information.hypertables LIMIT 1"
|
||
interval: 5s
|
||
timeout: 3s
|
||
retries: 10
|
||
start_period: 30s # TimescaleDB extension load and initial setup can take up to 20s
|
||
|
||
pgbouncer:
|
||
depends_on:
|
||
db:
|
||
condition: service_healthy
|
||
healthcheck:
|
||
test: ["CMD-SHELL", "psql -h localhost -p 5432 -U spacecom_app -d spacecom -c 'SELECT 1'"]
|
||
interval: 5s
|
||
timeout: 3s
|
||
retries: 5
|
||
```
|
||
|
||
---
|
||
|
||
### 26.6 Backup and Restore
|
||
|
||
#### Continuous WAL Archiving (RPO = 0 for critical tables)
|
||
|
||
```bash
|
||
# postgresql.conf
|
||
wal_level = replica
|
||
archive_mode = on
|
||
archive_command = 'mc cp %p minio/wal-archive/$(hostname)/%f' # MinIO via mc client
|
||
archive_timeout = 60 # Force WAL segment every 60s even if no writes
|
||
```
|
||
|
||
#### Daily Base Backup
|
||
|
||
`pg_basebackup` is a PostgreSQL client tool that is not present in the Python runtime worker image. The backup must run in a dedicated sidecar container that has PostgreSQL client tools installed, invoked by the Celery Beat task via `docker compose run`:
|
||
|
||
```yaml
|
||
# docker-compose.yml — backup sidecar (no persistent service; run on demand)
|
||
services:
|
||
db-backup:
|
||
image: timescale/timescaledb:2.14-pg17 # same image as db; has pg_basebackup
|
||
entrypoint: []
|
||
command: >
|
||
sh -c "pg_basebackup -h db -U postgres -D /backup
|
||
--format=tar --compress=9 --wal-method=stream &&
|
||
mc cp /backup/*.tar.gz minio/db-backups/base-$(date +%F)/"
|
||
networks: [db_net]
|
||
volumes:
|
||
- backup_scratch:/backup
|
||
profiles: [backup] # not started by default; invoked explicitly
|
||
environment:
|
||
PGPASSWORD: ${POSTGRES_PASSWORD}
|
||
MC_HOST_minio: http://${MINIO_ACCESS_KEY}:${MINIO_SECRET_KEY}@minio:9000
|
||
|
||
volumes:
|
||
backup_scratch:
|
||
driver: local
|
||
driver_opts:
|
||
type: tmpfs
|
||
device: tmpfs
|
||
o: size=20g # large enough for compressed base backup
|
||
```
|
||
|
||
The Celery Beat task triggers the sidecar via the Docker socket (backend container must have `/var/run/docker.sock` mounted in development — **not in production**). In production (Tier 2+), use a dedicated cron job on the host:
|
||
|
||
```bash
|
||
# /etc/cron.d/spacecom-backup — runs outside Docker, uses Docker CLI
|
||
0 2 * * * root docker compose -f /opt/spacecom/docker-compose.yml \
|
||
--profile backup run --rm db-backup >> /var/log/spacecom-backup.log 2>&1
|
||
```
|
||
|
||
The Celery Beat task in production polls MinIO for today's backup object to verify completion, and fires an alert if it is absent by 03:00 UTC:
|
||
|
||
```python
|
||
# Celery Beat: daily at 03:00 UTC (verification, not execution)
|
||
@celery.task
|
||
def verify_daily_backup():
|
||
"""Verify today's base backup exists in MinIO; alert if absent."""
|
||
expected_key = f"db-backups/base-{utcnow().date()}"
|
||
try:
|
||
minio_client.stat_object("db-backups", expected_key)
|
||
structlog.get_logger().info("backup_verified", key=expected_key)
|
||
except S3Error:
|
||
structlog.get_logger().error("backup_missing", key=expected_key)
|
||
alert_admin(f"Daily base backup missing: {expected_key}")
|
||
raise # marks task as FAILED in Celery result backend
|
||
```
|
||
|
||
#### Monthly Restore Test
|
||
|
||
```python
|
||
# Celery Beat: first Sunday of each month at 03:00 UTC
|
||
@celery.task
|
||
def monthly_restore_test():
|
||
"""Restore latest backup to ephemeral container; run test suite; alert on failure."""
|
||
# 1. Spin up a test TimescaleDB container from latest base backup + WAL
|
||
# 2. Run db/test_restore.py: verify row counts, hypertable integrity, HMAC spot-checks
|
||
# 3. Tear down container
|
||
# 4. Log result to security_logs; alert admin if test fails
|
||
```
|
||
|
||
If the monthly restore test fails, the failure is treated as SEV-2. The incident is not resolved until a successful restore is verified.
|
||
|
||
**WAL retention:** 30 days of WAL segments retained in MinIO; base backups retained for 90 days; `reentry_predictions`, `alert_events`, `notam_drafts`, `security_logs` additionally archived to cold storage for 7 years (MinIO lifecycle policy, separate bucket with Object Lock COMPLIANCE mode — prevents deletion even by bucket owner).
|
||
|
||
**Application log retention policy (F10 — §57):**
|
||
|
||
| Log tier | Storage | Retention | Rationale |
|
||
|----------|---------|-----------|-----------|
|
||
| Container stdout (json-file) | Docker log driver on host | 7 days (`max-size=100m, max-file=5`) | Short-lived; Promtail ships to Loki in Tier 2+ |
|
||
| Loki (structured application logs) | Grafana Loki | **90 days** | Covers 30-day incident investigation SLA with headroom |
|
||
| Safety-relevant log lines (`level=CRITICAL`, `security_logs` events, alert-related log lines) | MinIO append-only bucket | **7 years** (same as database safety records) | Regulatory parity with `alert_events` 7-year hold; NIS2 Art. 23 evidence requirement |
|
||
| SIEM-forwarded events | External SIEM (customer-specified) | Per customer contract | ANSP customers may have their own retention obligations |
|
||
|
||
Loki retention is set in `monitoring/loki-config.yml`:
|
||
```yaml
|
||
limits_config:
|
||
retention_period: 2160h # 90 days
|
||
compactor:
|
||
retention_enabled: true
|
||
```
|
||
|
||
Safety-relevant log shipping: a Promtail pipeline stage tags log lines with `__path__` label `safety_critical=true` when `level=CRITICAL` or `logger` contains `alert` or `security`. A separate Loki ruler rule ships these to MinIO via a Loki-to-S3 connector (Phase 2). Phase 1 interim: Celery Beat task exports CRITICAL log lines from Loki to MinIO daily.
|
||
|
||
**Restore time target:** Full restore to latest WAL segment in < 30 minutes (tested monthly). This satisfies the RTO ≤ 60 minutes (no active event) with 30 minutes headroom for DNS propagation and smoke tests. Documented step-by-step in `docs/runbooks/db-restore.md` (Phase 2 deliverable).
|
||
|
||
#### Retention Schedule
|
||
|
||
```sql
|
||
-- Online retention (TimescaleDB compression + drop policies)
|
||
SELECT add_compression_policy('orbits', INTERVAL '7 days');
|
||
SELECT add_retention_policy('orbits', INTERVAL '90 days'); -- Archive before drop; see below
|
||
SELECT add_retention_policy('space_weather', INTERVAL '2 years');
|
||
SELECT add_retention_policy('tle_sets', INTERVAL '1 year');
|
||
|
||
-- Archival pipeline: Celery task runs before each chunk drop
|
||
-- Exports chunk to Parquet in MinIO cold storage before TimescaleDB drops it
|
||
-- Legal hold: reentry_predictions, alert_events, notam_drafts, shadow_validations → 7 years
|
||
-- No retention policy on these tables; MinIO lifecycle rule retains for 7 years
|
||
```
|
||
|
||
---
|
||
|
||
### 26.7 Prometheus Metrics
|
||
|
||
Metrics must be instrumented from Phase 1 — not added at Phase 3 as an afterthought. Business-level metrics are more important than infrastructure metrics for this domain.
|
||
|
||
**Metric naming convention (F1 — §57):**
|
||
|
||
All custom metrics must follow `{namespace}_{subsystem}_{name}_{unit}` with these rules:
|
||
|
||
| Rule | Example compliant | Example non-compliant |
|
||
|------|------------------|-----------------------|
|
||
| Namespace is always `spacecom_` | `spacecom_ingest_success_total` | `ingest_success` |
|
||
| Unit suffix required (Prometheus base units) | `spacecom_simulation_duration_seconds` | `spacecom_simulation_duration` |
|
||
| Counters end in `_total` | `spacecom_hmac_verification_failures_total` | `spacecom_hmac_failures` |
|
||
| Gauges end in `_seconds`, `_bytes`, `_ratio`, or domain unit | `spacecom_celery_queue_depth` | `spacecom_queue` |
|
||
| Histograms end in `_seconds` or `_bytes` | `spacecom_alert_delivery_latency_seconds` | `spacecom_alert_latency` |
|
||
| Labels use `snake_case` | `queue_name`, `source` | `queueName`, `Source` |
|
||
| **High-cardinality fields are NEVER labels** | — | `norad_id`, `organisation_id`, `user_id`, `request_id` as Prometheus labels |
|
||
| Per-object drill-down uses recording rules | `spacecom:tle_age_hours:max` recording rule | `spacecom_tle_age_hours{norad_id="25544"}` alerted directly |
|
||
|
||
High-cardinality identifiers belong in log fields (structlog) or Prometheus exemplars — not in metric labels. A metric with an unbounded label creates one time series per unique value and will OOM Prometheus at scale.
|
||
|
||
**Business-level metrics (custom — most critical):**
|
||
|
||
```python
|
||
# Phase 1 — instrument from day 1
|
||
active_tip_events = Gauge('spacecom_active_tip_events', 'Objects with active TIP messages')
|
||
prediction_age = Gauge('spacecom_prediction_age_seconds', 'Age of latest prediction per object',
|
||
['norad_id']) # per-object label: Grafana drill-down only; alert via recording rule
|
||
tle_age = Gauge('spacecom_tle_age_hours', 'TLE data age per object', ['norad_id'])
|
||
ingest_success = Counter('spacecom_ingest_success_total', 'Successful ingest runs', ['source'])
|
||
ingest_failure = Counter('spacecom_ingest_failure_total', 'Failed ingest runs', ['source'])
|
||
hmac_failures = Counter('spacecom_hmac_verification_failures_total', 'HMAC check failures')
|
||
simulation_duration = Histogram('spacecom_simulation_duration_seconds', 'MC run duration', ['module'],
|
||
buckets=[30, 60, 90, 120, 180, 240, 300, 600])
|
||
alert_delivery_lat = Histogram('spacecom_alert_delivery_latency_seconds', 'Alert trigger → WS receipt',
|
||
buckets=[1, 2, 5, 10, 15, 20, 30, 60])
|
||
ws_connected = Gauge('spacecom_ws_connected_clients', 'Active WebSocket connections', ['instance'])
|
||
celery_queue_depth = Gauge('spacecom_celery_queue_depth', 'Tasks waiting in queue', ['queue'])
|
||
dlq_depth = Gauge('spacecom_dlq_depth', 'Tasks in dead letter queue')
|
||
renderer_active_jobs = Gauge('renderer_active_jobs', 'Reports being generated')
|
||
renderer_job_dur = Histogram('renderer_job_duration_seconds', 'Report generation time',
|
||
buckets=[2, 5, 10, 15, 20, 25, 30])
|
||
renderer_chromium_restarts = Counter('renderer_chromium_restarts_total', 'Chromium process restarts')
|
||
```
|
||
|
||
**SLI recording rules** — pre-aggregate before alerting; avoids per-object flooding (Finding 1, 7):
|
||
|
||
```yaml
|
||
# monitoring/recording-rules.yml
|
||
groups:
|
||
- name: spacecom_sli
|
||
rules:
|
||
# SLI: API availability (non-5xx fraction) — feeds availability SLO
|
||
- record: spacecom:api_availability:ratio_rate5m
|
||
expr: >
|
||
sum(rate(http_requests_total{status!~"5.."}[5m]))
|
||
/ sum(rate(http_requests_total[5m]))
|
||
|
||
# SLI: max TLE age across all objects (single series; alertable without flooding)
|
||
- record: spacecom:tle_age_hours:max
|
||
expr: max(spacecom_tle_age_hours)
|
||
|
||
# SLI: count of objects with stale TLEs (for dashboard)
|
||
- record: spacecom:tle_stale_objects:count
|
||
expr: count(spacecom_tle_age_hours > 6) or vector(0)
|
||
|
||
# SLI: max prediction age across active TIP objects
|
||
- record: spacecom:prediction_age_seconds:max
|
||
expr: max(spacecom_prediction_age_seconds)
|
||
|
||
# SLI: alert delivery latency p99
|
||
- record: spacecom:alert_delivery_latency:p99_rate5m
|
||
expr: histogram_quantile(0.99, rate(spacecom_alert_delivery_latency_seconds_bucket[5m]))
|
||
|
||
# Error budget burn rate — multi-window (F2 — §57)
|
||
- record: spacecom:error_budget_burn:rate1h
|
||
expr: 1 - avg_over_time(spacecom:api_availability:ratio_rate5m[1h])
|
||
|
||
- record: spacecom:error_budget_burn:rate6h
|
||
expr: 1 - avg_over_time(spacecom:api_availability:ratio_rate5m[6h])
|
||
|
||
# Fast-burn window (5 min) — catches sudden outages
|
||
- record: spacecom:error_budget_burn:rate5m
|
||
expr: 1 - spacecom:api_availability:ratio_rate5m
|
||
```
|
||
|
||
**Alerting rules (Prometheus AlertManager):**
|
||
|
||
```yaml
|
||
# monitoring/alertmanager/spacecom-rules.yml
|
||
groups:
|
||
- name: spacecom_critical
|
||
rules:
|
||
- alert: HmacVerificationFailure
|
||
expr: increase(spacecom_hmac_verification_failures_total[5m]) > 0
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "HMAC verification failure detected — prediction integrity compromised"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/hmac-integrity-failure.md"
|
||
|
||
- alert: TipIngestStale
|
||
expr: spacecom_tle_age_hours{source="tip"} > 0.5
|
||
for: 5m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "TIP data > 30 min old — active re-entry warning may be stale"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/tip-ingest-failure.md"
|
||
|
||
- alert: ActiveTipNoPrediction
|
||
expr: spacecom_active_tip_events > 0 and spacecom:prediction_age_seconds:max > 3600
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "Active TIP event but newest prediction is {{ $value | humanizeDuration }} old"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/tip-ingest-failure.md"
|
||
|
||
# Fast burn: 1h + 5min windows (catches sudden outages quickly) — F2 §57
|
||
- alert: ErrorBudgetFastBurn
|
||
expr: >
|
||
spacecom:error_budget_burn:rate1h > (14.4 * 0.001)
|
||
and
|
||
spacecom:error_budget_burn:rate5m > (14.4 * 0.001)
|
||
for: 2m
|
||
labels:
|
||
severity: critical
|
||
burn_window: fast
|
||
annotations:
|
||
summary: "Error budget burning fast — 1h burn rate {{ $value | humanizePercentage }}"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/db-failover.md"
|
||
dashboard_url: "https://grafana.spacecom.internal/d/slo-burn-rate"
|
||
|
||
# Slow burn: 6h + 30min windows (catches gradual degradation before budget exhausts) — F2 §57
|
||
- alert: ErrorBudgetSlowBurn
|
||
expr: >
|
||
spacecom:error_budget_burn:rate6h > (6 * 0.001)
|
||
and
|
||
spacecom:error_budget_burn:rate1h > (6 * 0.001)
|
||
for: 15m
|
||
labels:
|
||
severity: warning
|
||
burn_window: slow
|
||
annotations:
|
||
summary: "Error budget burning slowly — 6h burn rate {{ $value | humanizePercentage }}"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/db-failover.md"
|
||
dashboard_url: "https://grafana.spacecom.internal/d/slo-burn-rate"
|
||
|
||
- name: spacecom_warning
|
||
rules:
|
||
- alert: TleStale
|
||
# Alert on recording rule aggregate — single alert, not 600 per-NORAD alerts
|
||
expr: spacecom:tle_stale_objects:count > 0
|
||
for: 10m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "{{ $value }} objects have TLE age > 6h"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/ingest-pipeline-staleness.md"
|
||
|
||
- alert: IngestConsecutiveFailures
|
||
# Use increase(), not rate(); rate() is always positive once any failure exists
|
||
expr: increase(spacecom_ingest_failure_total[15m]) >= 3
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Ingest source {{ $labels.source }} failed ≥ 3 times in 15 min"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/ingest-pipeline-staleness.md"
|
||
|
||
- alert: CelerySimulationQueueDeep
|
||
expr: spacecom_celery_queue_depth{queue="simulation"} > 20
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Simulation queue depth {{ $value }} — workers may be overwhelmed"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md"
|
||
|
||
- alert: DLQGrowing
|
||
expr: increase(spacecom_dlq_depth[10m]) > 0
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Dead letter queue growing — tasks exhausting retries"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md"
|
||
|
||
- alert: WebSocketCeilingApproaching
|
||
expr: spacecom_ws_connected_clients > 400
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "WS connections {{ $value }}/500 — scale backend before ceiling hit"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/capacity-limits.md"
|
||
|
||
# Queue depth growth rate alert — fires before threshold is breached (F8 — §57)
|
||
- alert: CelerySimulationQueueGrowing
|
||
expr: rate(spacecom_celery_queue_depth{queue="simulation"}[10m]) > 2
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Simulation queue growing at {{ $value | humanize }} tasks/sec — workers not keeping up"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/celery-worker-recovery.md"
|
||
|
||
- alert: RendererChromiumUnresponsive
|
||
expr: increase(renderer_chromium_restarts_total[5m]) > 0
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Renderer Chromium restarted — report generation may be delayed"
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/renderer-recovery.md"
|
||
```
|
||
|
||
**Alert authoring rule (F11 — §57):** Every AlertManager alert rule MUST include `annotations.runbook_url` pointing to an existing file in `docs/runbooks/`. CI lint step (`make lint-alerts`) validates this using `promtool check rules` plus a custom Python script that asserts every rule has a non-empty `runbook_url` annotation that resolves to an existing markdown file. A PR that adds an alert without a runbook fails CI.
|
||
|
||
**Alert coverage audit (F5 — §57):** The following table maps every SLO and safety invariant to its alert rule. Gaps must be closed before Phase 2.
|
||
|
||
| SLO / Safety invariant | Alert rule | Severity | Gap? |
|
||
|------------------------|-----------|----------|------|
|
||
| API availability 99.9% | `ErrorBudgetFastBurn`, `ErrorBudgetSlowBurn` | CRITICAL / WARNING | Covered |
|
||
| TLE age < 6h | `TleStale` | WARNING | Covered |
|
||
| TIP ingest freshness < 30 min | `TipIngestStale` | CRITICAL | Covered |
|
||
| Active TIP + prediction age > 1h | `ActiveTipNoPrediction` | CRITICAL | Covered |
|
||
| HMAC verification integrity | `HmacVerificationFailure` | CRITICAL | Covered |
|
||
| Ingest consecutive failures | `IngestConsecutiveFailures` | WARNING | Covered |
|
||
| Celery queue depth threshold | `CelerySimulationQueueDeep` | WARNING | Covered |
|
||
| Celery queue depth growth rate | `CelerySimulationQueueGrowing` | WARNING | Covered |
|
||
| DLQ depth > 0 | `DLQGrowing` | WARNING | Covered |
|
||
| WS connection ceiling approach | `WebSocketCeilingApproaching` | WARNING | Covered |
|
||
| Renderer Chromium crash | `RendererChromiumUnresponsive` | WARNING | Covered |
|
||
| EOP mirror disagreement | `EopMirrorDisagreement` | CRITICAL | **Gap — add Phase 1** |
|
||
| DB replication lag > 30s | `DbReplicationLagHigh` | WARNING | **Gap — add Phase 2** |
|
||
| Backup job failure | `BackupJobFailed` | CRITICAL | **Gap — add Phase 1** |
|
||
| Security event anomaly | In `security-rules.yml` | CRITICAL | Covered |
|
||
| Alert HMAC integrity (nightly) | In `security-rules.yml` | CRITICAL | Covered |
|
||
|
||
**Prometheus scrape configuration** (`monitoring/prometheus.yml`):
|
||
|
||
```yaml
|
||
scrape_configs:
|
||
- job_name: backend
|
||
static_configs:
|
||
- targets: ['backend:8000']
|
||
metrics_path: /metrics # enabled by prometheus-fastapi-instrumentator
|
||
|
||
- job_name: renderer
|
||
static_configs:
|
||
- targets: ['renderer:8001']
|
||
metrics_path: /metrics
|
||
|
||
- job_name: celery
|
||
static_configs:
|
||
- targets: ['celery-exporter:9808'] # celery-exporter sidecar
|
||
|
||
- job_name: postgres
|
||
static_configs:
|
||
- targets: ['postgres-exporter:9187'] # postgres_exporter; also scrapes PgBouncer stats
|
||
|
||
- job_name: redis
|
||
static_configs:
|
||
- targets: ['redis-exporter:9121'] # redis_exporter
|
||
```
|
||
|
||
Add to `docker-compose.yml` (Phase 2 service topology): `postgres-exporter`, `redis-exporter`, `celery-exporter` sidecar, `loki`, `promtail`, `tempo` (all on `monitor_net`). Add to `requirements.in`: `prometheus-fastapi-instrumentator`, `structlog`, `opentelemetry-sdk`, `opentelemetry-instrumentation-fastapi`, `opentelemetry-instrumentation-sqlalchemy`, `opentelemetry-instrumentation-celery`.
|
||
|
||
**Distributed tracing — OpenTelemetry (Phase 2, ADR 0017):**
|
||
|
||
```python
|
||
# backend/app/main.py — instrument at startup
|
||
from opentelemetry import trace
|
||
from opentelemetry.sdk.trace import TracerProvider
|
||
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
|
||
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
|
||
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
|
||
from opentelemetry.instrumentation.celery import CeleryInstrumentor
|
||
|
||
provider = TracerProvider()
|
||
provider.add_span_exporter(OTLPSpanExporter(endpoint="http://tempo:4317"))
|
||
trace.set_tracer_provider(provider)
|
||
|
||
FastAPIInstrumentor.instrument_app(app)
|
||
SQLAlchemyInstrumentor().instrument(engine=engine)
|
||
CeleryInstrumentor().instrument()
|
||
```
|
||
|
||
The `trace_id` from each span equals the `request_id` bound in `structlog.contextvars` (set by `RequestIDMiddleware`). This gives a single correlation key across Grafana Loki log search and Grafana Tempo trace view — one click from a log entry to its trace, and from a trace span to its log lines. Phase 1 fallback: set `OTEL_SDK_DISABLED=true`; spans emit to stdout only (no collector needed).
|
||
|
||
**Celery trace propagation (F4 — §57):** `CeleryInstrumentor` automatically propagates W3C `traceparent` headers through the Celery task message body. The trace started at `POST /api/v1/decay/predict` continues unbroken through the queue wait and into the worker execution. To verify propagation is working:
|
||
|
||
```python
|
||
# tests/integration/test_tracing.py
|
||
def test_celery_trace_propagation():
|
||
"""Trace started in HTTP handler must appear in Celery worker span."""
|
||
with patch("opentelemetry.instrumentation.celery") as mock_otel:
|
||
response = client.post("/api/v1/decay/predict", ...)
|
||
task_id = response.json()["job_id"]
|
||
# Poll until complete, then assert trace_id matches request_id
|
||
span = get_span_by_task_id(task_id)
|
||
assert span.context.trace_id == uuid.UUID(response.headers["X-Request-ID"]).int
|
||
```
|
||
|
||
Additionally, `request_id` must be passed explicitly in Celery task kwargs as a belt-and-suspenders fallback for Phase 1 when OTel is disabled (`OTEL_SDK_DISABLED=true`). The worker binds it via `structlog.contextvars.bind_contextvars(request_id=kwargs["request_id"])`. This ensures log correlation works in Phase 1 without a running Tempo instance.
|
||
|
||
**Chord sub-task and callback trace propagation (F11 — §67):** `CeleryInstrumentor` propagates `traceparent` through individual task messages. For the MC chord pattern (`group` → `chord` → callback), trace context propagation must flow: FastAPI handler → `run_mc_decay_prediction` → 500× `run_single_trajectory` sub-tasks → `aggregate_mc_results` callback. Each hop in the chord must carry the same `trace_id` to enable end-to-end p95 latency attribution.
|
||
|
||
`CeleryInstrumentor` handles single task propagation automatically. For chord callbacks, verify that the parent `trace_id` appears in the `aggregate_mc_results` span — if the span is orphaned (different `trace_id`), set the trace context explicitly in the chord header:
|
||
|
||
```python
|
||
from opentelemetry import propagate, context
|
||
|
||
def run_mc_decay_prediction(object_id: int, params: dict) -> str:
|
||
carrier = {}
|
||
propagate.inject(carrier) # inject current trace context
|
||
params['_trace_context'] = carrier # pass through chord params
|
||
...
|
||
|
||
def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str:
|
||
ctx = propagate.extract(params.get('_trace_context', {}))
|
||
token = context.attach(ctx) # re-attach parent trace context in callback
|
||
try:
|
||
... # callback body
|
||
finally:
|
||
context.detach(token)
|
||
```
|
||
|
||
This ensures the Tempo waterfall for an MC prediction shows one continuous trace from HTTP request through all 500 sub-tasks to DB write, enabling per-prediction p95 breakdown.
|
||
|
||
**Celery queue depth Beat task** (updates `celery_queue_depth` and `dlq_depth` every 30s):
|
||
|
||
```python
|
||
@app.task
|
||
def update_queue_depth_metrics():
|
||
for queue_name in ['ingest', 'simulation', 'default']:
|
||
depth = redis_client.llen(f'celery:{queue_name}')
|
||
celery_queue_depth.labels(queue=queue_name).set(depth)
|
||
dlq_depth.set(redis_client.llen('dlq:failed_tasks'))
|
||
```
|
||
|
||
**Four Grafana dashboards** (updated from three):
|
||
1. **Operational Overview** — primary on-call dashboard (F7 — §57): an on-call engineer must be able to answer "is the system healthy?" within 15 seconds of opening this dashboard. Panel order and layout is therefore mandated:
|
||
|
||
| Row | Panel | Metric | Alert threshold shown |
|
||
|-----|-------|--------|-----------------------|
|
||
| 1 (top) | Active TIP events (stat) | `spacecom_active_tip_events` | Red if > 0 |
|
||
| 1 | System status (state timeline) | All alert rule states | Any CRITICAL = red bar |
|
||
| 2 | Ingest freshness per source (gauge) | `spacecom_tle_age_hours` per source | Yellow > 2h, Red > 6h |
|
||
| 2 | Prediction age — active objects (gauge) | `spacecom:prediction_age_seconds:max` | Red > 3600s |
|
||
| 3 | Error budget burn rate (time series) | `spacecom:error_budget_burn:rate1h` | Reference line at 14.4× |
|
||
| 3 | Alert delivery latency p99 (stat) | `spacecom:alert_delivery_latency:p99_rate5m` | Red > 30s |
|
||
| 4 | Celery queue depth (time series) | `spacecom_celery_queue_depth` per queue | Reference line at 20 |
|
||
| 4 | DLQ depth (stat) | `spacecom_dlq_depth` | Red if > 0 |
|
||
|
||
Rows 1–2 must be visible without scrolling on a 1080p monitor. The dashboard UID is pinned in the AlertManager `dashboard_url` annotations.
|
||
|
||
2. **System Health**: DB replication lag, Redis memory, container CPU/RAM, error rates by endpoint, renderer job duration
|
||
3. **SLO Burn Rate**: error budget consumption rate from recording rules, fast/slow burn rates, availability by SLO, latency percentiles vs. targets, WS delivery latency p99
|
||
4. **Tracing** (Phase 2, Grafana Tempo): per-request traces for decay prediction and CZML catalog; p95 span breakdown by service
|
||
|
||
---
|
||
|
||
### 26.8 Incident Response
|
||
|
||
#### On-Call Rotation and Escalation
|
||
|
||
| Tier | Responder | Response SLA | Escalation trigger |
|
||
|---|---|---|---|
|
||
| **L1 On-call** | Rotating engineer (weekly rotation) | 5 min (SEV-1) / 15 min (SEV-2) | Auto-escalate to L2 if no acknowledgement after SLA |
|
||
| **L2 Escalation** | Tech lead / senior engineer | 10 min (SEV-1) | Auto-escalate to L3 after 10 min |
|
||
| **L3 Incident commander** | Engineering or product lead | SEV-1 only | Manual phone call; no auto-escalation |
|
||
|
||
AlertManager routing:
|
||
```yaml
|
||
# monitoring/alertmanager/routing.yml
|
||
route:
|
||
receiver: slack-ops-channel
|
||
group_wait: 30s
|
||
group_interval: 5m
|
||
repeat_interval: 4h
|
||
routes:
|
||
- match: {severity: critical}
|
||
receiver: pagerduty-l1
|
||
continue: true # also send to Slack
|
||
- match: {severity: warning}
|
||
receiver: slack-ops-channel
|
||
```
|
||
|
||
On-call guide: `docs/runbooks/on-call-guide.md` — required Phase 2 deliverable. Must cover: rotation schedule, handover checklist, escalation contact list, how to acknowledge PagerDuty alerts, Grafana dashboard URLs, and the "active TIP event protocol" (escalate all SEV-2+ to SEV-1 automatically when `spacecom_active_tip_events > 0`).
|
||
|
||
**On-call rotation spec (F5):**
|
||
- 7-day rotation; minimum 2 engineers in the pool before going on-call
|
||
- L1 → L2 escalation if incident not contained within **30 minutes** of L1 acknowledgement
|
||
- L2 → L3 escalation triggers: ANSP data affected; confirmed security breach; total outage > 15 minutes; regulatory notification obligation triggered (NIS2 24h, GDPR 72h)
|
||
- **On-call handoff:** At rotation boundary, outgoing on-call documents system state in `docs/runbooks/on-call-handoff-log.md`: active incidents, degraded services, pending maintenance, known risks. Incoming on-call acknowledges in the same log. Mirrors the operator `/handover` concept (§28.5a) applied to engineering shifts.
|
||
|
||
**ANSP communication commitments per severity (F6):**
|
||
|
||
| Severity | ANSP notification timing | Channel | Update cadence |
|
||
|----------|------------------------|---------|---------------|
|
||
| SEV-1 (active TIP event) | Within 5 minutes of detection | Push + email | Every 15 minutes until resolved |
|
||
| SEV-1 (no active event) | Within 15 minutes | Email | Every 30 minutes until resolved |
|
||
| SEV-2 | Within 30 minutes if prediction data affected | Email | On resolution |
|
||
| SEV-3/4 | Status page update only | Status page | On resolution |
|
||
|
||
Resolution notification always includes: what was affected, duration, root cause summary (1 sentence), and confirmation that prediction integrity was verified post-incident.
|
||
|
||
#### Severity Levels
|
||
|
||
| Level | Definition | Response Time | Examples |
|
||
|-------|-----------|--------------|---------|
|
||
| **SEV-1** | System unavailable or prediction integrity compromised during active TIP event | 5 minutes | DB down with TIP window open; HMAC failure on active prediction |
|
||
| **SEV-2** | Core functionality broken; no active TIP event | 15 minutes | Workers down; ingest stopped > 2h; Redis down |
|
||
| **SEV-3** | Degraded functionality; operational but impaired | 60 minutes | TLE stale > 6h; space weather stale; slow CZML > 5s p95 |
|
||
| **SEV-4** | Minor; no operational impact | Next business day | UI cosmetic; log noise; non-critical test failure |
|
||
|
||
#### Runbook Standard Structure (F9)
|
||
|
||
Every runbook in `docs/runbooks/` must follow this template. Inconsistent runbooks written under incident pressure are a leading cause of missed steps and extended resolution times.
|
||
|
||
```markdown
|
||
# Runbook: {Title}
|
||
|
||
**Owner:** {team or role}
|
||
**Last tested:** {YYYY-MM-DD} (game day or real incident)
|
||
**Severity scope:** SEV-1 | SEV-2 | SEV-3 (as applicable)
|
||
|
||
## Triggers
|
||
<!-- What conditions cause this runbook to be invoked? Alert name, symptom, or explicit escalation. -->
|
||
|
||
## Immediate actions (first 5 minutes)
|
||
<!-- Numbered steps. Each step must be independently executable. No "investigate" — specific commands only. -->
|
||
1.
|
||
2.
|
||
|
||
## Diagnosis
|
||
<!-- How to confirm the root cause before taking corrective action. -->
|
||
|
||
## Resolution steps
|
||
<!-- Numbered. Each step: what to do, expected output, what to do if the expected output is NOT seen. -->
|
||
1.
|
||
2.
|
||
|
||
## Verification
|
||
<!-- How to confirm the incident is resolved. Specific health check commands or metrics to inspect. -->
|
||
|
||
## Escalation
|
||
<!-- If unresolved after N minutes: who to page, what information to have ready. -->
|
||
|
||
## Post-incident
|
||
<!-- Mandatory PIR? Log entry required? Notification required? -->
|
||
```
|
||
|
||
All runbooks are reviewed and updated after each game day or real incident in which they were used. The `Last tested` field must not be older than 12 months — a CI check (`make runbook-audit`) warns if any runbook has not been updated within that window.
|
||
|
||
#### Required Runbooks (Phase 2 deliverable)
|
||
|
||
Each runbook is a step-by-step operational procedure, not a general guide:
|
||
|
||
| Runbook | Key Steps |
|
||
|---------|----------|
|
||
| **DB failover** | Confirm primary down → Patroni status → manual failover if Patroni stuck → verify standby promoting → update connection strings → verify HMAC validation working on new primary |
|
||
| **Celery worker recovery** | Check queue depth → inspect dead letter queue → restart worker containers → verify simulation jobs resuming → check ingest worker catching up |
|
||
| **HMAC integrity failure** | Identify affected prediction ID → quarantine record (`integrity_failed = TRUE`) → notify affected ANSP users → investigate modification source → escalate to security incident if tampering confirmed |
|
||
| **TIP ingest failure** | Check Space-Track API status → verify credentials not expired → check outbound network → manual TIP fetch if automated ingest blocked → notify operators of manual TIP status |
|
||
| **Ingest pipeline staleness** | Check Celery Beat health (redbeat lock status) → check worker queue → inspect ingest failure counter in Prometheus → trigger manual ingest job → notify operators of staleness |
|
||
| **GDPR personal data breach** | Contain breach (revoke credentials, isolate affected service) → assess scope (which data, how many data subjects, which jurisdictions) → notify legal counsel within 4 hours → if EU/UK data subjects affected: notify supervisory authority within 72 hours of discovery; notify affected data subjects "without undue delay" if high risk → log in `security_logs` with type `DATA_BREACH` → document remediation |
|
||
| **Safety occurrence notification** | If a SpaceCom integrity failure (HMAC fail, data source outage, incorrect prediction) is identified during a period when an ANSP was actively managing a re-entry event: notify affected ANSP within 2 hours → create `security_logs` record with type `SAFETY_OCCURRENCE` → notify legal counsel before any external communications → preserve all prediction records, alert_events, and ingest logs from the relevant period (do not rotate or archive). Full procedure: `docs/runbooks/safety-occurrence.md` — see §26.8a below. |
|
||
| **Prediction service outage during active re-entry event (F3)** | Detect via `spacecom_active_tip_events > 0` + prediction API health check fail → immediate ANSP push notification + email within 5 minutes ("SpaceCom prediction service is unavailable. Activate your fallback procedure: consult Space-Track TIP messages directly and ESOC re-entry page.") → designate incident commander → communication cadence every 15 minutes until resolved → service restoration checklist: restore prediction API → verify HMAC integrity on latest predictions → notify ANSPs of restoration with prediction freshness timestamp → trigger PIR. Full procedure: `docs/runbooks/prediction-service-outage-during-active-event.md` |
|
||
|
||
#### §26.8a Safety Occurrence Reporting Procedure (F4 — §61)
|
||
|
||
A safety occurrence is any event or condition in which a SpaceCom error may have contributed to, or could have contributed to, a reduction in aviation safety. This is distinct from an operational incident (which is defined by system availability/performance). Safety occurrences require a different response chain that includes regulatory and legal notification.
|
||
|
||
**Trigger conditions:**
|
||
- HMAC integrity failure on any prediction that was served to an ANSP operator during an active TIP event
|
||
- A confirmed incorrect prediction (false positive or false negative) where the ANSP was managing airspace based on SpaceCom outputs
|
||
- Data staleness in excess of the operational threshold (TLE > 6h old) during an active re-entry event window without degradation notification having been sent
|
||
- Any SpaceCom system failure during which an ANSP continued operational use without receiving a degradation notification
|
||
|
||
**Response procedure** (`docs/runbooks/safety-occurrence.md`):
|
||
|
||
| Step | Action | Owner | Timing |
|
||
|------|--------|-------|--------|
|
||
| 1 | Detect and classify: confirm the occurrence meets trigger criteria; assign SAFETY_OCCURRENCE vs. standard incident | On-call engineer | Within 30 min of detection |
|
||
| 2 | Preserve evidence: set `do_not_archive = TRUE` on all affected prediction records, alert_events, and ingest logs; export to MinIO safety archive | On-call engineer | Within 1 hour |
|
||
| 3 | Internal escalation: notify incident commander + legal counsel; do NOT communicate externally until legal counsel is engaged | Incident commander | Within 1 hour |
|
||
| 4 | ANSP notification: contact affected ANSP primary contact and safety manager using the safety occurrence notification template (not the standard incident template); include what happened, what data was affected, what the ANSP should do in response | Incident commander + legal counsel review | Within 2 hours |
|
||
| 5 | Log: create `security_logs` record with `type = 'SAFETY_OCCURRENCE'`; include ANSP ID, affected prediction IDs, notification timestamp, and legal counsel name | On-call engineer | Same session |
|
||
| 6 | ANSP SMS obligation: inform the ANSP in writing that they may have an obligation to report this occurrence to their safety regulator under their SMS; SpaceCom cannot make this determination for the ANSP | Legal counsel | Within 24 hours |
|
||
| 7 | PIR: conduct a safety-occurrence-specific post-incident review (same structure as §26.8 PIR but with additional sections: regulatory notification status, hazard log update required?) | Engineering lead | Within 5 business days |
|
||
| 8 | Hazard log update: if the occurrence reveals a new hazard or changes the likelihood/severity of an existing hazard, update `docs/safety/HAZARD_LOG.md` and trigger a safety case review | Safety case custodian | Within 10 business days |
|
||
|
||
**Safety occurrence log table:**
|
||
```sql
|
||
-- Add to security_logs or create a dedicated table
|
||
CREATE TABLE safety_occurrences (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
occurred_at TIMESTAMPTZ NOT NULL,
|
||
detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
org_ids UUID[] NOT NULL, -- affected ANSPs
|
||
trigger_type TEXT NOT NULL, -- 'HMAC_FAILURE', 'INCORRECT_PREDICTION', 'STALE_DATA', 'SILENT_FAILURE'
|
||
affected_predictions UUID[] NOT NULL DEFAULT '{}',
|
||
evidence_archived BOOLEAN NOT NULL DEFAULT FALSE,
|
||
ansp_notified_at TIMESTAMPTZ,
|
||
legal_notified_at TIMESTAMPTZ,
|
||
hazard_log_updated BOOLEAN NOT NULL DEFAULT FALSE,
|
||
pir_completed_at TIMESTAMPTZ,
|
||
notes TEXT
|
||
);
|
||
```
|
||
|
||
**What is NOT a safety occurrence (to avoid over-classification):**
|
||
- Standard availability incidents with degradation notification sent promptly
|
||
- Cosmetic UI errors not in the alert/prediction path
|
||
- Prediction updates that change values within stated uncertainty bounds
|
||
|
||
#### ANSP Communication Plan
|
||
|
||
When SpaceCom is degraded during an active TIP event, operators must be notified immediately through a defined channel:
|
||
- **WebSocket push** (if connected): automatic via the degraded-mode notification (§24.8)
|
||
- **Email fallback**: automated email to all `operator` role users with active sessions within the last 24h, identifying the degradation type and estimated resolution
|
||
- **Documented fallback**: every SpaceCom user onboarding includes the fallback procedure: "In the absence of SpaceCom, consult Space-Track TIP messages directly at space-track.org and coordinate with your national space surveillance authority per existing procedures"
|
||
|
||
**Incident communication templates (F10):** Pre-drafted templates in `docs/runbooks/incident-comms-templates.md` — reviewed by legal counsel before first use. On-call engineers must use these templates verbatim; deviations require incident commander approval. Templates cover:
|
||
1. **Initial notification** (< 5 minutes): impact, what we know, what we are doing, next update time
|
||
2. **15-minute update**: progress, updated ETA if known, revised fallback guidance if needed
|
||
3. **Resolution notification**: confirmed restoration, prediction integrity verified, brief root cause (one sentence), PIR date
|
||
4. **Post-incident summary** (within 5 business days): full timeline, root cause, remediations implemented
|
||
What never appears in templates: speculation about cause before root cause confirmed; estimated recovery time until known with confidence; any admission of negligence or legal liability.
|
||
|
||
#### Post-Incident Review Process (F8)
|
||
|
||
Mandatory for all SEV-1 and SEV-2 incidents. PIR due within **5 business days** of resolution.
|
||
|
||
**PIR document structure** (`docs/post-incident-reviews/YYYY-MM-DD-{slug}.md`):
|
||
1. **Incident summary** — what happened, when, duration, severity
|
||
2. **Timeline** — minute-by-minute from first alert to resolution
|
||
3. **Root cause** — using 5-whys methodology; stop when a process or system gap is identified
|
||
4. **Contributing factors** — what made the impact worse or detection slower
|
||
5. **Impact** — users/ANSPs affected; data at risk; SLO breach duration
|
||
6. **Remediation actions** — each with owner, GitHub issue link, and deadline; tracked with `incident-remediation` label
|
||
7. **What went well** — to reinforce effective practices
|
||
|
||
PIR presented at the next engineering all-hands. Remediation actions are P2 priority — no new feature work by the responsible engineer until overdue remediations are closed.
|
||
|
||
#### Chaos Engineering / Game Day Programme (F4)
|
||
|
||
Quarterly game day; scenarios rotated so each is tested at least annually. Document in `docs/runbooks/game-day-scenarios.md`.
|
||
|
||
**Minimum scenario set:**
|
||
|
||
| # | Scenario | Expected behaviour | Pass criterion |
|
||
|---|---------|-------------------|---------------|
|
||
| 1 | PostgreSQL primary killed | Patroni promotes standby; API recovers within RTO | API returns 200 within 15 minutes; no data loss |
|
||
| 2 | Celery worker crash during active MC simulation | Job moves to DLQ; orphan recovery task re-queues; operator sees `FAILED` state | Job visible in DLQ within 2 minutes; re-queue succeeds |
|
||
| 3 | Space-Track ingest unavailable 6 hours | Staleness degraded mode activates; operators notified; predictions greyed | Staleness alert fires within 15 minutes of ingest stop |
|
||
| 4 | Redis failure | Sessions expire gracefully; WebSocket reconnects; no silent data loss | Users see "session expired" prompt; no 500 errors |
|
||
| 5 | Full prediction service restart during active CRITICAL alert | Alert state preserved in DB; re-subscribing WebSocket clients receive current state | No alert acknowledgement lost; reconnection < 30 seconds |
|
||
| 6 | Full region failover (annually) | DNS fails over to DR region; prediction API resumes | Recovery within RTO; HMAC verification passes on new primary |
|
||
|
||
Each scenario: defined inject → observe → record actual behaviour → pass/fail vs. criterion → remediation window 2 weeks. Any scenario fail is treated as a SEV-2 incident with a PIR.
|
||
|
||
#### Operational vs. Security Incident Runbooks (F11)
|
||
|
||
Operational and security incidents have different response teams, communication obligations, and legal constraints:
|
||
|
||
| Dimension | Operational incident | Security incident |
|
||
|-----------|---------------------|-----------------|
|
||
| Primary responder | On-call engineer | On-call engineer + DPO within 4h |
|
||
| Communication | Status page + ANSP email | **No public status page until legal counsel approves** |
|
||
| Regulatory obligation | SLA breach notification (MSA) | NIS2 24h early warning; GDPR 72h (if personal data) |
|
||
| Evidence preservation | Normal log retention | Immediate log freeze; do not rotate or archive |
|
||
|
||
Separate runbooks:
|
||
- `docs/runbooks/operational-incident-response.md` — standard on-call playbook
|
||
- `docs/runbooks/security-incident-response.md` — invokes DPO, legal counsel, NIS2/GDPR timelines; references §29.6 notification obligations
|
||
|
||
---
|
||
|
||
### 26.9 Deployment Strategy
|
||
|
||
#### Zero-Downtime Deployment (Blue-Green)
|
||
|
||
The TLS-terminating Caddy instance routes between blue (current) and green (new) backend instances:
|
||
|
||
```
|
||
Client → Caddy → [Blue backend] (current)
|
||
→ [Green backend] (new — deployed but not yet receiving traffic)
|
||
```
|
||
|
||
**Docker Compose implementation for Tier 2 (single-host):**
|
||
|
||
Docker Compose service names are fixed, so blue and green run as two separate Compose project instances. The deploy script at `scripts/blue-green-deploy.sh` manages the cutover:
|
||
|
||
```bash
|
||
#!/usr/bin/env bash
|
||
# scripts/blue-green-deploy.sh
|
||
set -euo pipefail
|
||
|
||
NEW_IMAGE="${1:?Usage: blue-green-deploy.sh <image-tag>}"
|
||
COMPOSE_FILE="docker-compose.yml"
|
||
BLUE_PROJECT="spacecom-blue"
|
||
GREEN_PROJECT="spacecom-green"
|
||
|
||
# 1. Determine which colour is currently active
|
||
ACTIVE=$(cat /opt/spacecom/.active-colour 2>/dev/null || echo "blue")
|
||
if [[ "$ACTIVE" == "blue" ]]; then NEXT="green"; else NEXT="blue"; fi
|
||
|
||
# 2. Start next-colour project with new image
|
||
SPACECOM_BACKEND_IMAGE="$NEW_IMAGE" \
|
||
docker compose -p "$( [[ $NEXT == green ]] && echo $GREEN_PROJECT || echo $BLUE_PROJECT )" \
|
||
-f "$COMPOSE_FILE" up -d backend
|
||
|
||
# 3. Wait for next-colour healthcheck
|
||
docker compose -p "$( [[ $NEXT == green ]] && echo $GREEN_PROJECT || echo $BLUE_PROJECT )" \
|
||
exec backend curl -sf http://localhost:8000/healthz || { echo "Health check failed — aborting"; exit 1; }
|
||
|
||
# 4. Run smoke tests against next-colour directly
|
||
SMOKE_TARGET="http://localhost:$( [[ $NEXT == green ]] && echo 8001 || echo 8000 )" \
|
||
python scripts/smoke-test.py || { echo "Smoke tests failed — aborting"; exit 1; }
|
||
|
||
# 5. Shift Caddy upstream to next colour (atomic file swap + reload)
|
||
echo "{ \"upstream\": \"backend-$NEXT:8000\" }" > /opt/spacecom/caddy-upstream.json
|
||
docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile
|
||
|
||
echo "$NEXT" > /opt/spacecom/.active-colour
|
||
echo "✓ Traffic shifted to $NEXT. Monitoring for 5 minutes..."
|
||
sleep 300
|
||
|
||
# 6. Verify error rate via Prometheus (optional gate)
|
||
ERROR_RATE=$(curl -s "http://localhost:9090/api/v1/query?query=spacecom:api_availability:ratio_rate5m" \
|
||
| jq -r '.data.result[0].value[1]')
|
||
if (( $(echo "$ERROR_RATE < 0.99" | bc -l) )); then
|
||
echo "Error rate $ERROR_RATE < 0.99 — rolling back"
|
||
# Swap back to active colour
|
||
echo "{ \"upstream\": \"backend-$ACTIVE:8000\" }" > /opt/spacecom/caddy-upstream.json
|
||
docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile
|
||
echo "$ACTIVE" > /opt/spacecom/.active-colour
|
||
exit 1
|
||
fi
|
||
|
||
# 7. Decommission old colour
|
||
docker compose -p "$( [[ $ACTIVE == blue ]] && echo $BLUE_PROJECT || echo $GREEN_PROJECT )" \
|
||
stop backend && docker compose -p ... rm -f backend
|
||
echo "✓ Blue-green deploy complete. Active: $NEXT"
|
||
```
|
||
|
||
**Caddy upstream configuration** — Caddy reads a JSON file that the deploy script rewrites atomically:
|
||
|
||
```
|
||
# /etc/caddy/Caddyfile
|
||
reverse_proxy {
|
||
dynamic file /opt/spacecom/caddy-upstream.json
|
||
lb_policy first
|
||
health_uri /healthz
|
||
health_interval 5s
|
||
}
|
||
```
|
||
|
||
**WebSocket long-lived connection timeout configuration (F11 — §63):** HTTP reverse proxies have default idle timeouts that silently terminate long-lived WebSocket connections. Caddy's default idle timeout for HTTP/2 connections is governed by `idle_timeout` (default: 5 minutes). Many cloud load balancers default to 60 seconds. A WebSocket with no traffic for this period is silently closed by the proxy — the FastAPI server and client may not detect this for minutes, creating a "ghost connection" that is alive at the socket level but dead at the application level.
|
||
|
||
**Required Caddyfile additions for WebSocket paths:**
|
||
|
||
```
|
||
# /etc/caddy/Caddyfile
|
||
{
|
||
servers {
|
||
timeouts {
|
||
idle_timeout 0 # disable idle timeout globally — WS connections can be silent for extended periods
|
||
}
|
||
}
|
||
}
|
||
|
||
spacecom.io {
|
||
# WebSocket endpoints: no idle timeout, no read timeout
|
||
@websockets {
|
||
path /ws/*
|
||
header Connection *Upgrade*
|
||
header Upgrade websocket
|
||
}
|
||
handle @websockets {
|
||
reverse_proxy backend:8000 {
|
||
transport http {
|
||
read_timeout 0 # no read timeout — WS connection can be idle
|
||
write_timeout 0 # no write timeout — WS send can be slow on poor networks
|
||
}
|
||
flush_interval -1 # immediate flush; do not buffer WS frames
|
||
}
|
||
}
|
||
|
||
# Non-WebSocket paths: retain normal timeouts
|
||
handle {
|
||
reverse_proxy backend:8000 {
|
||
transport http {
|
||
read_timeout 30s
|
||
write_timeout 30s
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
**Ping-pong interval must be less than proxy idle timeout:** The FastAPI WebSocket handler sends a ping every `WS_PING_INTERVAL_SECONDS` (default: 30s). With `idle_timeout 0` in Caddy, this prevents proxy-side termination. If running behind a cloud load balancer with a fixed idle timeout, the ping interval must be set to `(load_balancer_idle_timeout - 10s)` — documented in `docs/runbooks/websocket-proxy-config.md`.
|
||
|
||
**Rollback:** `scripts/blue-green-rollback.sh` — resets `/opt/spacecom/caddy-upstream.json` to the previous colour and reloads Caddy. Rollback completes in < 5 seconds (no container restart required).
|
||
|
||
Deployment sequence:
|
||
1. Deploy green backend alongside blue (both running)
|
||
2. Run smoke tests against green directly (`X-Deploy-Target: green` header)
|
||
3. Shift 10% of traffic to green (canary); monitor error rate for 5 minutes
|
||
4. If clean: shift 100% to green; keep blue running for 10 minutes
|
||
5. If error spike: shift 0% back to blue instantly (< 5s rollback via `blue-green-rollback.sh`)
|
||
6. Decommission blue after 10 minutes of clean green operation
|
||
|
||
#### Alembic Migration Safety Policy
|
||
|
||
Every database migration must be backwards-compatible with the previous application version. Required sequence for any schema change:
|
||
|
||
1. **Migration only**: deploy migration; verify old app still functions with new schema (additive changes only — new nullable columns, new tables, new indexes)
|
||
2. **Application deploy**: deploy new application version that uses the new schema
|
||
3. **Cleanup migration** (if needed): remove old columns/constraints after old app version is fully retired
|
||
|
||
Never: rename a column, change a column type, or drop a column in a single migration that deploys simultaneously with the application change.
|
||
|
||
**Hypertable-specific migration rules:**
|
||
- Always use `CREATE INDEX CONCURRENTLY` for new indexes on hypertables — does not acquire a table lock; safe during live ingest. Standard `CREATE INDEX` (without `CONCURRENTLY`) blocks all reads and writes for the duration.
|
||
- Never add a column with a non-null default to a populated hypertable in a single migration. Required sequence: (1) add nullable column, (2) backfill in batches with `UPDATE ... WHERE id BETWEEN x AND y`, (3) add NOT NULL constraint in a separate deployment.
|
||
- Test every migration against a production-sized data copy before applying to production. Record the measured execution time in the migration file header comment: `# Execution time on 10M-row orbits table: 45s`.
|
||
- Set a CI migration timeout gate: if a migration runs > 30 seconds against the test dataset, it must be reviewed by a senior engineer before merge.
|
||
|
||
#### TIP Event Deployment Freeze
|
||
|
||
No deployments permitted when a CRITICAL or HIGH alert is active for any tracked object. Enforced by a CI/CD gate:
|
||
|
||
```python
|
||
# .gitlab-ci.yml pre-deploy check
|
||
def check_deployment_gate():
|
||
response = requests.get(f"{API_URL}/api/v1/alerts?level=CRITICAL,HIGH&active=true",
|
||
headers={"X-Deploy-Check": settings.deploy_check_secret})
|
||
active = response.json()["total"]
|
||
if active > 0:
|
||
raise DeploymentBlocked(
|
||
f"{active} active CRITICAL/HIGH alerts. Deployment blocked until events resolve."
|
||
)
|
||
```
|
||
|
||
The deploy check secret is a read-only service credential — it cannot acknowledge alerts or modify data.
|
||
|
||
#### CI/CD Pipeline Specification
|
||
|
||
**GitLab CI pipeline jobs (`.gitlab-ci.yml`):**
|
||
|
||
| Job | Trigger | Steps | Failure behaviour |
|
||
|-----|---------|-------|------------------|
|
||
| `lint` | All pushes + PRs | `pre-commit run --all-files` (detect-secrets, ruff, mypy, hadolint, prettier, sqlfluff) | Blocks merge |
|
||
| `test-backend` | All pushes + PRs | `pytest --cov --cov-fail-under=80`; `alembic check` (model/migration divergence) | Blocks merge |
|
||
| `test-frontend` | All pushes + PRs | `vitest run`; `playwright test` | Blocks merge |
|
||
| `security-scan` | All pushes + PRs | `bandit -r backend/`; `pip-audit --require backend/requirements.txt`; `npm audit --audit-level=high` (frontend); `eslint --plugin security`; `trivy image` on built images (`.trivyignore` applied); `pip-licenses` + `license-checker-rseidelsohn` gate; `.secrets.baseline` currency check | Blocks merge on High/Critical |
|
||
| `build-and-push` | Merge to `main` or `release/*` | Multi-stage `docker build`; `docker push ghcr.io/spacecom/<service>:sha-<commit>` via OIDC; `cosign sign` all images; `syft` SPDX-JSON SBOM generated and attached as `cosign attest`; `pip-licenses --format=json` + `license-checker-rseidelsohn --json` manifests merged into SBOM and uploaded as workflow artifact (365-day retention); `docs/compliance/sbom/` updated with versioned SBOM artefact | Blocks deploy |
|
||
| `deploy-staging` | After `build-and-push` on `main` | Docker Compose update on staging host; smoke tests | Blocks production deploy gate |
|
||
| `deploy-production` | Manual approval after `deploy-staging` passes | `check_deployment_gate()` (no active CRITICAL/HIGH alerts); blue-green deploy | Manual |
|
||
|
||
**Image tagging convention:**
|
||
- `sha-<commit>` — immutable canonical tag; always pushed
|
||
- `v<major>.<minor>.<patch>` — release alias pushed on tagged commits
|
||
- `latest` — never pushed; forbidden in production Compose files (CI grep check enforces this)
|
||
|
||
**Build cache strategy:**
|
||
```yaml
|
||
# .github/workflows/ci.yml (build-and-push job excerpt)
|
||
- uses: docker/setup-buildx-action@v3
|
||
- uses: docker/login-action@v3
|
||
with:
|
||
registry: ghcr.io
|
||
username: ${{ github.actor }}
|
||
password: ${{ secrets.GITHUB_TOKEN }} # OIDC — no stored secret
|
||
- uses: docker/build-push-action@v5
|
||
with:
|
||
context: ./backend
|
||
push: true
|
||
tags: ghcr.io/spacecom/backend:sha-${{ github.sha }}
|
||
cache-from: type=registry,ref=ghcr.io/spacecom/backend:buildcache
|
||
cache-to: type=registry,ref=ghcr.io/spacecom/backend:buildcache,mode=max
|
||
```
|
||
|
||
pip and npm caches use `actions/cache` keyed on lock file hash:
|
||
```yaml
|
||
- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f # v4.0.2
|
||
with:
|
||
path: ~/.cache/pip
|
||
key: pip-${{ hashFiles('backend/requirements.txt') }}
|
||
- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f # v4.0.2
|
||
with:
|
||
path: frontend/.next/cache
|
||
key: npm-${{ hashFiles('frontend/package-lock.json') }}
|
||
```
|
||
|
||
**`cosign` image signing and SBOM attestation** (added after each `docker push`):
|
||
|
||
```yaml
|
||
# .github/workflows/ci.yml — build-and-push job (after docker push steps)
|
||
- uses: sigstore/cosign-installer@59acb6260d9c0ba8f4a2f9d9b48431a222b68e20 # v3.5.0
|
||
|
||
- name: Sign all service images with cosign (keyless, OIDC)
|
||
env:
|
||
COSIGN_EXPERIMENTAL: "true"
|
||
run: |
|
||
for svc in backend worker-sim worker-ingest renderer frontend; do
|
||
cosign sign --yes \
|
||
ghcr.io/spacecom/${svc}:sha-${{ github.sha }}
|
||
done
|
||
|
||
- name: Generate SBOM and attach as cosign attestation
|
||
env:
|
||
COSIGN_EXPERIMENTAL: "true"
|
||
run: |
|
||
for svc in backend worker-sim worker-ingest renderer frontend; do
|
||
syft ghcr.io/spacecom/${svc}:sha-${{ github.sha }} \
|
||
-o spdx-json=sbom-${svc}.spdx.json
|
||
# Validate non-empty
|
||
jq -e '.packages | length > 0' sbom-${svc}.spdx.json
|
||
cosign attest --yes \
|
||
--predicate sbom-${svc}.spdx.json \
|
||
--type spdxjson \
|
||
ghcr.io/spacecom/${svc}:sha-${{ github.sha }}
|
||
done
|
||
|
||
- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.3.4
|
||
with:
|
||
name: sbom-${{ github.sha }}
|
||
path: "*.spdx.json"
|
||
retention-days: 365 # ESA bid artefacts; ECSS minimum 1 year
|
||
|
||
- name: Verify signature before deploy (deploy jobs only)
|
||
if: github.event_name == 'workflow_dispatch'
|
||
run: |
|
||
cosign verify ghcr.io/spacecom/backend:sha-${{ github.sha }} \
|
||
--certificate-identity-regexp="https://github.com/spacecom/spacecom/.*" \
|
||
--certificate-oidc-issuer="https://token.actions.githubusercontent.com"
|
||
```
|
||
|
||
**All GitHub Actions pinned by commit SHA** (mutable `@vN` tags allow tag-repointing attacks that exfiltrate all workflow secrets):
|
||
|
||
```yaml
|
||
# Correct form — all third-party actions in .github/workflows/*.yml:
|
||
- uses: docker/setup-buildx-action@4fd812986e6c8c2a69e18311145f9371337f27d # v3.4.0
|
||
- uses: docker/login-action@9780b0c442fbb1117ed29e0efdff1e18412f7567 # v3.3.0
|
||
- uses: docker/build-push-action@1a162644f9a7e87d8f4b053101d1d9a712edc18c # v6.3.0
|
||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||
- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f # v4.0.2
|
||
- uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.3.4
|
||
```
|
||
|
||
CI lint check enforces no mutable tags remain:
|
||
```bash
|
||
grep -rE 'uses: [^@]+@v[0-9]' .github/workflows/ && \
|
||
echo "ERROR: Actions must be pinned by commit SHA, not tag" && exit 1
|
||
```
|
||
|
||
Use `pinact` or Renovate's `github-actions` manager to automate SHA updates.
|
||
|
||
#### Local Development Environment
|
||
|
||
**First-time setup (target: working stack in ≤ 15 minutes from clean clone):**
|
||
|
||
```bash
|
||
git clone https://github.com/spacecom/spacecom && cd spacecom
|
||
cp .env.example .env # fill in Space-Track credentials only; all others have safe defaults
|
||
pip install pre-commit && pre-commit install
|
||
make dev # starts full stack with hot-reload
|
||
make seed # loads test objects, FIRs, and synthetic TIP events
|
||
# → Open http://localhost:3000; globe shows 10 test objects
|
||
```
|
||
|
||
**`make` targets:**
|
||
|
||
| Target | What it does |
|
||
|--------|-------------|
|
||
| `make dev` | `docker compose up` with `./backend` and `./frontend/src` bind-mounted for hot-reload |
|
||
| `make test` | `pytest` (backend) + `vitest run` (frontend) + `playwright test` (E2E) |
|
||
| `make migrate` | `alembic upgrade head` inside the running backend container |
|
||
| `make seed` | Loads `fixtures/dev_seed.sql` + synthetic TIP events via seed script |
|
||
| `make lint` | Runs all pre-commit hooks against all files |
|
||
| `make clean` | `docker compose down -v` — removes all containers and volumes (destructive, prompts) |
|
||
| `make shell-db` | Opens a `psql` shell inside the TimescaleDB container |
|
||
| `make shell-backend` | Opens a bash shell inside the running backend container |
|
||
|
||
**Hot-reload configuration (docker-compose.override.yml — dev only, not committed to CI):**
|
||
```yaml
|
||
services:
|
||
backend:
|
||
volumes:
|
||
- ./backend:/app # bind mount — FastAPI --reload picks up changes instantly
|
||
command: ["uvicorn", "app.main:app", "--reload", "--host", "0.0.0.0"]
|
||
frontend:
|
||
volumes:
|
||
- ./frontend/src:/app/src # Next.js / Vite HMR
|
||
```
|
||
|
||
**`.env.example` structure (excerpt):**
|
||
```bash
|
||
# === Required: obtain before first run ===
|
||
SPACETRACK_USERNAME=your_email@example.com
|
||
SPACETRACK_PASSWORD=your_password
|
||
|
||
# === Required: generate locally ===
|
||
JWT_PRIVATE_KEY_PATH=./certs/jwt_private.pem # openssl genrsa -out certs/jwt_private.pem 2048
|
||
JWT_PUBLIC_KEY_PATH=./certs/jwt_public.pem
|
||
|
||
# === Safe defaults for local dev (change for production) ===
|
||
POSTGRES_PASSWORD=spacecom_dev
|
||
REDIS_PASSWORD=spacecom_dev
|
||
MINIO_ACCESS_KEY=spacecom_dev
|
||
MINIO_SECRET_KEY=spacecom_dev_secret
|
||
HMAC_SECRET=dev_hmac_secret_change_in_prod
|
||
|
||
# === Stage flags ===
|
||
ENVIRONMENT=development # development | staging | production
|
||
SHADOW_MODE_DEFAULT=false
|
||
DISABLE_SIMULATION_DURING_ACTIVE_EVENTS=false
|
||
```
|
||
|
||
All production-only variables are clearly marked. The README's "Getting Started" section mirrors the first-time setup steps above.
|
||
|
||
#### Staging Environment
|
||
|
||
**Purpose:** Continuous integration target for `main` branch. Serves as the TRL artefact evidence environment — all shadow validation records and OWASP ZAP reports reference the staging deployment.
|
||
|
||
| Property | Staging | Production |
|
||
|----------|---------|-----------|
|
||
| Infrastructure | Tier 2 (single-host Docker Compose) | Tier 3 (multi-host HA) |
|
||
| Data | Synthetic only — no production data | Real TLE/TIP/space weather |
|
||
| Secrets | Separate credential set; non-production Space-Track account | Production credential set in Vault |
|
||
| Deploy trigger | Automatic on merge to `main` | Manual approval in GitHub Actions |
|
||
| OWASP ZAP | Runs against every staging deploy | Run on demand before Phase 3 milestones |
|
||
| Retention | Environment resets weekly (fresh `make seed` run) | Persistent |
|
||
|
||
#### Secrets Rotation Procedure
|
||
|
||
Zero-downtime rotation is required. Service interruption during rotation is a reliability failure.
|
||
|
||
**JWT RS256 Signing Keypair:**
|
||
1. Generate new keypair: `openssl genrsa -out jwt_private_new.pem 2048 && openssl rsa -in jwt_private_new.pem -pubout -out jwt_public_new.pem`
|
||
2. Load new public key into `JWT_PUBLIC_KEY_NEW` env var on all backend instances (old key still active)
|
||
3. Backend now validates tokens signed with either old or new key
|
||
4. Update `JWT_PRIVATE_KEY` to new key; new tokens are signed with new key
|
||
5. Wait for all old tokens to expire (max 1h for access tokens; 30 days for refresh tokens)
|
||
6. Remove `JWT_PUBLIC_KEY_NEW`; old public key no longer needed
|
||
7. Log `security_logs` entry type `KEY_ROTATION` with rotation timestamp and initiator
|
||
|
||
**Space-Track Credentials:**
|
||
1. Create new Space-Track account or update password via Space-Track web portal
|
||
2. Update `SPACETRACK_USERNAME` / `SPACETRACK_PASSWORD` in secrets manager (Docker secrets / Vault)
|
||
3. Trigger one manual ingest cycle; verify 200 response from Space-Track API
|
||
4. Deactivate old credentials in Space-Track portal
|
||
5. Log `security_logs` entry type `CREDENTIAL_ROTATION`
|
||
|
||
**MinIO Access Keys:**
|
||
1. Create new access key pair via MinIO console (`mc admin user add`)
|
||
2. Update `MINIO_ACCESS_KEY` / `MINIO_SECRET_KEY` in secrets manager
|
||
3. Restart backend and worker services (rolling restart — blue-green ensures zero downtime)
|
||
4. Verify pre-signed URL generation succeeds
|
||
5. Delete old access key from MinIO console
|
||
|
||
**HMAC Secret (prediction signing key):**
|
||
- **Do not rotate casually.** All existing HMAC-signed predictions will fail verification after rotation.
|
||
- Pre-rotation: re-sign all existing predictions with new key (batch migration script required)
|
||
- Post-rotation: update `HMAC_SECRET` in secrets manager; verify batch re-sign by spot-checking 10 predictions
|
||
- Rotation must be approved by engineering lead; `security_logs` entry type `HMAC_KEY_ROTATION` required
|
||
|
||
---
|
||
|
||
### 26.10 Post-Deployment Safety Monitoring Programme (F9 — §61)
|
||
|
||
Pre-deployment testing and shadow validation demonstrate that a system was safe at a point in time. Post-deployment monitoring demonstrates that it remains safe in operational conditions. DO-278A §12 and EUROCAE ED-153 both require evidence of ongoing safety monitoring after deployment.
|
||
|
||
**Programme components:**
|
||
|
||
#### 26.10.1 Prediction Accuracy Monitoring
|
||
|
||
After each actual re-entry event where SpaceCom generated predictions:
|
||
1. Record the actual re-entry time and location (from The Aerospace Corporation / ESA re-entry campaign results)
|
||
2. Compare against SpaceCom's p50 corridor centre and p95 bounds
|
||
3. Record in `shadow_validations` table: `actual_reentry_time`, `actual_impact_region`, `p50_error_km`, `p95_captured` (boolean)
|
||
4. Compute running accuracy statistics: % of events where actual impact was within p95 corridor; median error in km
|
||
5. Publish accuracy statistics to `GET /api/v1/admin/accuracy-report` (accessible to ANSP admins)
|
||
|
||
**Alert trigger:** If rolling 12-month p95 capture rate drops below 80% (target: 95%), engineering review is mandatory before the next ANSP shadow activation or model update deployment.
|
||
|
||
#### 26.10.2 Safety KPI Dashboard
|
||
|
||
Prometheus recording rules and Grafana dashboard (`monitoring/dashboards/safety-kpis.json`):
|
||
|
||
| KPI | Metric | Target | Alert threshold |
|
||
|-----|--------|--------|----------------|
|
||
| HMAC verification failures | `spacecom_hmac_verification_failures_total` | 0 / month | Any failure → SEV-1 |
|
||
| Safety occurrences | `safety_occurrences` table count | 0 / year | ≥1 → safety case review |
|
||
| Alert false positive rate | Manual: PIR review | < 5% | Engineering review if exceeded |
|
||
| Operator training currency | `operator_training_records` expiry | 100% current | < 95% → ANSP admin notification |
|
||
| p95 corridor capture rate | `shadow_validations` rolling 12-month | ≥ 95% | < 80% → model review |
|
||
| Prediction freshness (TLE age at prediction time) | `spacecom_tle_age_hours` histogram p95 | < 6h | > 24h → MEDIUM alert |
|
||
|
||
#### 26.10.3 Quarterly Safety Review
|
||
|
||
Mandatory quarterly safety review meeting. Output: `docs/safety/QUARTERLY_SAFETY_REVIEW_YYYY_QN.md`.
|
||
|
||
Agenda:
|
||
1. Safety KPI review (all metrics above)
|
||
2. Safety occurrences since last review (zero is an acceptable answer — record it)
|
||
3. Hazard log review: has any hazard likelihood or severity changed since last quarter?
|
||
4. MoC status update: progress on PLANNED items
|
||
5. Model changes in period: were any SAL-2 components modified? If so, safety case impact assessment
|
||
6. ANSP feedback: any concerns raised by ANSP customers regarding safety or accuracy?
|
||
7. Actions: owner, deadline, priority
|
||
|
||
**Attendance required:** Safety case custodian + engineering lead. One ANSP contact may be invited as an observer (good practice for regulatory demonstration).
|
||
|
||
#### 26.10.4 Model Version Safety Monitoring
|
||
|
||
When a new model version is deployed (changes to `physics/` or `alerts/` SAL-2 components):
|
||
1. Shadow run new model in parallel for ≥14 days before replacing production model
|
||
2. Compare new vs. old: prediction differences > 50 km for p50, or > 100 km for p95, require engineering review before promotion
|
||
3. After promotion: monitor `shadow_validations` for the next 3 re-entry events; regression alert if p95 capture rate declines
|
||
4. Record in `simulations.model_version`; all predictions annotated with the model version they used
|
||
|
||
---
|
||
|
||
## 27. Capacity Planning
|
||
|
||
### 27.0 Performance Test Specification (F6)
|
||
|
||
Performance tests live in `tests/load/` and are run with **k6**. They are not part of the standard `make test` suite — they require a running environment with realistic data. They run:
|
||
- Manually before any Phase gate release
|
||
- Automatically on the staging environment nightly (scheduled k6 Cloud or self-hosted k6)
|
||
- Results committed to `docs/validation/load-test-results/` after each Phase gate
|
||
|
||
#### Scenarios
|
||
|
||
```javascript
|
||
// tests/load/scenarios.js
|
||
export const options = {
|
||
scenarios: {
|
||
czml_catalog: {
|
||
executor: 'ramping-vus',
|
||
startVUs: 0, stages: [
|
||
{ duration: '30s', target: 50 },
|
||
{ duration: '2m', target: 100 },
|
||
{ duration: '30s', target: 0 },
|
||
],
|
||
},
|
||
websocket_subscribers: {
|
||
executor: 'constant-vus', vus: 200, duration: '3m',
|
||
},
|
||
decay_submit: {
|
||
executor: 'constant-arrival-rate', rate: 5, timeUnit: '1m',
|
||
preAllocatedVUs: 10, duration: '5m',
|
||
},
|
||
},
|
||
};
|
||
```
|
||
|
||
#### SLO Assertions (k6 thresholds — test fails if breached)
|
||
|
||
| Scenario | Metric | Threshold |
|
||
|----------|--------|-----------|
|
||
| CZML catalog (`GET /objects` + CZML) | p95 response time | < 2 000 ms |
|
||
| API auth (`POST /auth/token`) | p99 response time | < 500 ms |
|
||
| Decay prediction submit | p95 response time | < 500 ms (202 accept only) |
|
||
| WebSocket connection | 200 concurrent connections stable for 3 min | 0 connection drops |
|
||
| WebSocket alert delivery | Time from DB insert to browser receipt | < 30 000 ms p95 |
|
||
| `/readyz` probe | p99 response time | < 100 ms |
|
||
|
||
#### Baseline Environment
|
||
|
||
Performance tests are only comparable if run against a consistent hardware baseline:
|
||
|
||
```markdown
|
||
# docs/validation/load-test-baseline.md
|
||
- Host: 8 vCPU / 32 GB RAM (Tier 2 single-host)
|
||
- TimescaleDB: 100 tracked objects, 90 days of orbit history
|
||
- Celery workers: simulation ×16 concurrency, ingest ×2
|
||
- Redis: empty (no warm cache) at test start
|
||
```
|
||
|
||
Results from a different hardware spec must be labelled separately and not compared to the baseline. A performance regression is defined as any threshold breach on the **same** baseline hardware.
|
||
|
||
#### Storing and Trending Results
|
||
|
||
k6 outputs a JSON summary; a CI step uploads it to `docs/validation/load-test-results/YYYY-MM-DD-{env}.json`. A lightweight Python script (`scripts/load-test-trend.py`) plots p95 latency over time for the past 10 runs and embeds the chart in `docs/TEST_PLAN.md`. A > 20% increase in any p95 metric between consecutive runs on the same hardware creates a `performance-regression` GitHub issue automatically.
|
||
|
||
### 27.1 Workload Characterisation
|
||
|
||
| Workload | CPU Profile | Memory | Dominant Constraint |
|
||
|----------|------------|--------|-------------------|
|
||
| **MC decay prediction** (500 samples) | CPU-bound, parallelisable | 200–500 MB per process | CPU cores on simulation workers |
|
||
| SGP4 catalog propagation (100 objects) | Trivial | < 100 MB | None — analytical model |
|
||
| CZML generation | I/O-bound (DB read) | < 500 MB | DB query latency |
|
||
| Atmospheric breakup | CPU-bound, light | ~200 MB | Negligible vs. MC |
|
||
| Conjunction screening (100 objects) | CPU-bound, seconds | ~500 MB | Acceptable on any worker |
|
||
| Controlled re-entry planner | CPU-bound, similar to MC | 500 MB | Same pool as MC |
|
||
| Playwright renderer | Memory-bound (Chromium) | 1–2 GB per instance | Isolated container |
|
||
| TimescaleDB queries | I/O-bound | 64 GB (buffer cache) | NVMe IOPS for spatial queries |
|
||
|
||
**Cost-tracking metrics (F3, F4, F11):**
|
||
|
||
Add the following Prometheus counters to enable per-org cost attribution and external API budget visibility. These feed the unit economics model (§27.7) and the Enterprise tier chargeback reports.
|
||
|
||
```python
|
||
# backend/app/metrics.py (add to existing prometheus_client registry)
|
||
from prometheus_client import Counter
|
||
|
||
# F3 — External API call budget tracking
|
||
ingest_api_calls_total = Counter(
|
||
"spacecom_ingest_api_calls_total",
|
||
"Total external API calls made by the ingest worker",
|
||
labelnames=["source"] # "space_track", "celestrak", "noaa_swpc", "esa_discos", "iers"
|
||
)
|
||
# Usage: ingest_api_calls_total.labels(source="space_track").inc()
|
||
# Alert: if space_track calls > 100/day → investigate polling loop bug (Space-Track AUP limit: 200/day)
|
||
|
||
# F4 — Per-org simulation CPU attribution
|
||
simulation_cpu_seconds_total = Counter(
|
||
"spacecom_simulation_cpu_seconds_total",
|
||
"Total CPU-seconds consumed by MC simulations, by org and object",
|
||
labelnames=["org_id", "norad_id"]
|
||
)
|
||
# Usage: simulation_cpu_seconds_total.labels(org_id=str(org_id), norad_id=str(norad_id)).inc(elapsed)
|
||
# This is the primary input to infrastructure_cost_per_mc_run in §27.7
|
||
```
|
||
|
||
**F5 — Inbound API request counter (§68):**
|
||
|
||
```python
|
||
# backend/app/metrics.py (add to existing prometheus_client registry)
|
||
api_requests_total = Counter(
|
||
"spacecom_api_requests_total",
|
||
"Total inbound API requests, by org, endpoint, and API version",
|
||
labelnames=["org_id", "endpoint", "version", "status_code"]
|
||
)
|
||
# Usage (FastAPI middleware):
|
||
# api_requests_total.labels(
|
||
# org_id=str(request.state.org_id),
|
||
# endpoint=request.url.path,
|
||
# version=request.headers.get("X-API-Version", "v1"),
|
||
# status_code=str(response.status_code)
|
||
# ).inc()
|
||
```
|
||
|
||
This counter is the foundation for future API tier enforcement (e.g., 1,000 requests/month for Professional; unlimited for Enterprise) and for supporting usage-based billing for Persona E/F API consumers. Add to the FastAPI middleware stack alongside `prometheus_fastapi_instrumentator`.
|
||
|
||
**F11 — Per-org cost attribution for Enterprise tier:**
|
||
|
||
Enterprise contracts may include usage-based clauses (e.g., MC simulation credits). The `simulation_cpu_seconds_total` metric provides the raw data; a monthly Celery task (`tasks/billing/generate_usage_report.py`) aggregates it per org:
|
||
|
||
```python
|
||
@shared_task
|
||
def generate_monthly_usage_report(org_id: str, year: int, month: int):
|
||
"""Aggregate simulation CPU-seconds and ingest API calls per org for billing review."""
|
||
# Query Prometheus/VictoriaMetrics for the org's metrics over the billing period
|
||
# Output: docs/business/usage_reports/{org_id}/{year}-{month:02d}.json
|
||
# Fields: total_mc_runs, total_cpu_seconds, estimated_cost_usd (at $0.40/run internal rate)
|
||
```
|
||
|
||
Per-org usage reports are stored in `docs/business/usage_reports/` and referenced in Enterprise QBRs. The cost rate (`$0.40/run` at Tier 3 scale) is updated quarterly in `docs/business/UNIT_ECONOMICS.md`.
|
||
|
||
**Usage surfaced to commercial team and org admins (F2 — §68):**
|
||
|
||
Usage data must reach two audiences: the commercial team (for renewal and expansion conversations) and the org admin (to understand value received).
|
||
|
||
*Commercial team:* Monthly Celery Beat task (`tasks/commercial/send_commercial_summary.py`) emails `commercial@spacecom.io` on the 1st of each month with:
|
||
- Per-org: MC simulation count, PDF reports generated, WebSocket connection hours, alert events (by severity)
|
||
- Trend vs. previous 3 months (growth signal for expansion conversations)
|
||
- Contracts expiring within 90 days (renewal pipeline)
|
||
|
||
*Org admin:* Monthly usage summary email to each org's admin contact showing their own usage. Template: *"In [month], your team ran [N] decay predictions, generated [M] PDF reports, and received [K] CRITICAL alerts. Your monthly quota: [Q] simulations (used: [N])."* This email reinforces value perception ahead of renewal conversations.
|
||
|
||
Both emails use the `generate_monthly_usage_report` output. Add `send_usage_summary_emails` to celery-redbeat at `crontab(day_of_month=1, hour=6)`.
|
||
|
||
### 27.2 Monte Carlo Parallelism Architecture
|
||
|
||
The MC decay predictor must use **Celery `group` + `chord`** to distribute sample computation across the full worker pool. `multiprocessing.Pool` within a single task is limited to one container's cores.
|
||
|
||
```python
|
||
from celery import group, chord
|
||
|
||
@celery.task
|
||
def run_mc_decay_prediction(object_id: int, params: dict) -> str:
|
||
"""Fan out 500 samples as individual sub-tasks; aggregate with chord callback."""
|
||
sample_tasks = group(
|
||
run_single_trajectory.s(object_id, params, seed=i)
|
||
for i in range(params['mc_samples'])
|
||
)
|
||
result = chord(sample_tasks)(aggregate_mc_results.s(object_id, params))
|
||
return result.id
|
||
|
||
@celery.task
|
||
def run_single_trajectory(object_id: int, params: dict, seed: int) -> dict:
|
||
"""Single RK7(8) + NRLMSISE-00 trajectory integration. CPU time: 2–20s."""
|
||
rng = np.random.default_rng(seed)
|
||
f107 = params['f107'] * rng.normal(1.0, 0.20) # ±20% variation
|
||
bstar = params['bstar'] * rng.normal(1.0, 0.10)
|
||
return integrate_trajectory(object_id, f107, bstar, params)
|
||
|
||
@celery.task
|
||
def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str:
|
||
"""Compute percentiles, build corridor polygon, HMAC-sign, write to DB."""
|
||
prediction = compute_percentiles_and_corridor(results)
|
||
prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret)
|
||
write_prediction_to_db(prediction)
|
||
return str(prediction['id'])
|
||
```
|
||
|
||
**Worker concurrency for chord sub-tasks:**
|
||
- Each sub-task is short (2–20s) and CPU-bound
|
||
- Worker `--pool=prefork --concurrency=16`: 16 OS processes per container
|
||
- 2 simulation worker containers: 32 concurrent sub-tasks
|
||
- 500 samples / 32 = ~16 batches × ~10s average = **~160s per MC run** (p50)
|
||
- p95 target of 240s met with headroom
|
||
|
||
**Chord result backend:** Sub-task results stored in Redis temporarily (< 1 MB each × 500 = 500 MB peak per run). Results expire after 1 hour (`result_expires = 3600` in `celeryconfig.py` — §27.8). The aggregate callback reads all results, computes the final prediction, and writes to TimescaleDB — Redis is not the durable store.
|
||
|
||
**Chord callback result count validation (F1 — §67):** Redis `noeviction` prevents eviction, but if Redis is misconfigured or hits `maxmemory` and rejects writes, sub-task results may be missing when the chord callback fires. The callback must validate that it received the expected number of results before writing to TimescaleDB:
|
||
|
||
```python
|
||
@celery.task
|
||
def aggregate_mc_results(results: list[dict], object_id: int, params: dict) -> str:
|
||
"""Compute percentiles, build corridor polygon, HMAC-sign, write to DB."""
|
||
expected = params['mc_samples']
|
||
if len(results) != expected:
|
||
# Partial result — do not write a silently truncated prediction
|
||
raise ValueError(
|
||
f"MC chord received {len(results)}/{expected} results for object {object_id}. "
|
||
"Redis result backend may be under memory pressure. Aborting."
|
||
)
|
||
prediction = compute_percentiles_and_corridor(results)
|
||
prediction['record_hmac'] = sign_prediction(prediction, settings.hmac_secret)
|
||
write_prediction_to_db(prediction)
|
||
return str(prediction['id'])
|
||
```
|
||
|
||
The `ValueError` causes the chord callback to fail and be routed to the DLQ (Dead Letter Queue). The originating API call receives a task failure, and the client receives `HTTP 500` with `Retry-After`. A `spacecom_mc_chord_partial_result_total` counter fires, triggering a CRITICAL alert: *"MC chord received partial results — Redis memory budget exceeded."*
|
||
|
||
### 27.3 Deployment Tiers
|
||
|
||
#### Tier 1 — Development and Demonstration
|
||
|
||
Single machine, Docker Compose, all services co-located. No HA. Suitable for development, internal demos, and ESA TRL 4 demonstrations.
|
||
|
||
| Spec | Minimum | Recommended |
|
||
|------|---------|-------------|
|
||
| CPU | 8 cores | 16 cores |
|
||
| RAM | 16 GB | 32 GB |
|
||
| Storage | 256 GB NVMe SSD | 512 GB NVMe SSD |
|
||
| Cloud equivalent | `t3.2xlarge` ~$240/mo | `m6i.4xlarge` ~$540/mo |
|
||
|
||
MC prediction p95: ~400–800s (exceeds SLO — acceptable for demo; noted in demo briefings).
|
||
|
||
---
|
||
|
||
#### Tier 2 — Phase 1–2 Production
|
||
|
||
Separate containers per service. Meets SLOs under moderate load (≤ 5 concurrent simulation users). Single-node per service — no HA. Suitable for shadow mode deployments and early ANSP pilots.
|
||
|
||
| Service | vCPU | RAM | Storage | Cloud (AWS) | Monthly |
|
||
|---------|------|-----|---------|-------------|---------|
|
||
| Backend API | 4 | 8 GB | — | `c6i.xlarge` | ~$140 |
|
||
| Simulation Workers ×2 | **16 each** | 32 GB each | — | `c6i.4xlarge` ×2 | ~$560 each |
|
||
| Ingest Worker | 2 | 4 GB | — | `t3.medium` | ~$30 |
|
||
| Renderer | 4 | 8 GB | — | `c6i.xlarge` | ~$140 |
|
||
| TimescaleDB | 8 | **64 GB** | 1 TB NVMe | `r6i.2xlarge` | ~$420 |
|
||
| Redis | 2 | 8 GB | — | `cache.r6g.large` | ~$120 |
|
||
| MinIO / S3 | 4 | 8 GB | 4 TB | `i3.xlarge` + EBS | ~$200 |
|
||
| **Total** | | | | | **~$2,200/mo** |
|
||
|
||
**On-premise equivalent (Tier 2):** Two servers — compute host (2× AMD EPYC 7313P, 32 total cores, 192 GB RAM) + storage host (8 vCPU, 256 GB RAM, 2 TB NVMe + 8 TB HDD). Capital cost: **~$25,000–35,000**.
|
||
|
||
---
|
||
|
||
#### Tier 3 — Phase 3 HA Production
|
||
|
||
Full redundancy. Meets 99.9% availability SLO including during active TIP events. Required before any formal operational ANSP deployment.
|
||
|
||
| Service | Count | vCPU each | RAM each | Notes |
|
||
|---------|-------|-----------|----------|-------|
|
||
| Backend API | 2 | 4 | 8 GB | Load balanced; blue-green deployable |
|
||
| Simulation Workers | **4** | **16** | 32 GB | 64 total cores; chord sub-tasks fill all |
|
||
| Ingest Worker | 2 | 2 | 4 GB | celery-redbeat leader election |
|
||
| Renderer | 2 | 4 | 8 GB | Network-isolated; Chromium memory budget |
|
||
| TimescaleDB Primary | 1 | 8 | **128 GB** | Patroni-managed; synchronous replication |
|
||
| TimescaleDB Standby | 1 | 8 | **128 GB** | Hot standby; auto-failover ≤ 30s |
|
||
| Redis Sentinel ×3 | 3 | 2 | 8 GB | Quorum; master failover ≤ 10s |
|
||
| MinIO (distributed) | 4 | 4 | 16 GB | Erasure coding EC:2; 2× 2 TB NVMe each |
|
||
| **Cloud total (AWS)** | | | | ~**$6,000–7,000/mo** |
|
||
|
||
With 64 simulation worker cores: 500-sample MC in **~80s p50, ~120s p95** — well within SLO.
|
||
|
||
**MinIO Erasure Coding (Tier 3):** 4-node distributed MinIO uses **EC:2** (2 parity shards). This provides:
|
||
- **Read quorum:** any 2 of 4 nodes (tolerates 2 simultaneous node failures for reads)
|
||
- **Write quorum:** requires 3 of 4 nodes (tolerates 1 simultaneous node failure for writes)
|
||
- **Effective storage:** 50% — 8 TB raw across 4 nodes → 4 TB usable. Match the Tier 3 table note (8 TB usable requires 16 TB raw across 4×2 TB nodes; resize if needed)
|
||
- Configured via `MINIO_ERASURE_SET_DRIVE_COUNT=4` and server startup with all 4 node endpoints
|
||
|
||
**Multi-region stance:** SpaceCom is **single-region** through all three phases. Reasoning:
|
||
- Phase 1–3 customer base is small (ESA evaluation, early ANSP pilots); cross-region replication cost and operational complexity is not justified.
|
||
- Government and defence customers may have data sovereignty requirements — a single, clearly defined deployment region (customer-specified) is simpler to certify than an active-active multi-region setup.
|
||
- When a second jurisdiction customer is onboarded, deploy a **separate, independent instance** in their required jurisdiction rather than extending a single global cluster. Each instance has its own data, its own compliance scope, and its own operational team contact.
|
||
- This decision is documented as ADR-0010 (see §34 decision log).
|
||
|
||
**On-premise equivalent (Tier 3):** Three servers — 2× compute (2× EPYC 7343, 32 cores, 256 GB RAM each) + 1× storage (128 GB RAM, 4× 2 TB NVMe RAID-10, 16 TB HDD). Capital cost: **~$60,000–80,000**.
|
||
|
||
**Celery worker idle cost and scale-to-zero decision (F6):**
|
||
|
||
Simulation workers are the largest cloud line item ($560/mo each at Tier 2 on `c6i.4xlarge`). Their actual compute utilisation depends on MC run frequency:
|
||
|
||
| Usage pattern | Active compute/day | Idle fraction | Monthly cost at Tier 2 ×2 workers |
|
||
|--------------|-------------------|--------------|----------------------------------|
|
||
| Light (5 MC runs/day × 80s p50) | ~7 min/day | ~99.5% | $1,120 |
|
||
| Moderate (20 MC runs/day × 80s) | ~27 min/day | ~98.1% | $1,120 |
|
||
| Heavy (100 MC runs/day × 80s) | ~133 min/day | ~90.7% | $1,120 |
|
||
|
||
**Scale-to-zero analysis:**
|
||
|
||
| Approach | Pros | Cons | Decision |
|
||
|---------|------|------|---------|
|
||
| Always-on (Tier 1–2) | Zero cold-start; SLO met immediately | High idle cost when lightly used | **Use at Tier 1–2** — cost is ~$1,120/mo regardless; latency SLO requires workers ready |
|
||
| Scale-to-1 minimum (Tier 3) | Reduced idle cost vs. 4×; one worker handles ingest keepalive tasks | Cold-start for burst: 3 new workers × 30–60s spin-up; MC SLO may breach during burst | **Use at Tier 3** — scale-to-1 minimum; HPA/KEDA scales 1→4 on `celery_queue_length > 10` |
|
||
| Scale-to-zero | Maximum idle savings | 60–120s cold-start violates 10-min MC SLO when all workers are down | **Do not use** — cold-start from zero exceeds acceptable latency for on-demand simulation |
|
||
|
||
**Implementation at Tier 3 (Kubernetes):** Use KEDA `ScaledObject` with `celery` trigger:
|
||
```yaml
|
||
triggers:
|
||
- type: redis
|
||
metadata:
|
||
listName: celery # Celery default queue
|
||
listLength: "10" # scale up when >10 tasks queued
|
||
activationListLength: "1" # keep at least 1 replica (scale-to-1 minimum)
|
||
```
|
||
Minimum replica count: **1**. Maximum: **4**. Scale-down stabilisation window: 5 minutes (prevents oscillation during multi-run bursts).
|
||
|
||
**Ingest worker:** Always-on, single instance (2 vCPU, $30/mo at Tier 2). celery-redbeat tasks run on 1-minute and hourly schedules; scale-to-zero is not appropriate. At Tier 3, 2 instances for redundancy; no autoscaling needed.
|
||
|
||
---
|
||
|
||
### 27.4 Storage Growth Projections
|
||
|
||
| Data | Retention | Raw Growth/Year | Compressed/Year | Cloud Cost/Year (est.) | Notes |
|
||
|------|-----------|----------------|----------------|----------------------|-------|
|
||
| `orbits` (100 objects, 1/min) | 90 days online | ~15 GB | ~2 GB | ~$20 (EBS gp3, rolling) | TimescaleDB compression ~7:1 |
|
||
| `tle_sets` | 1 year | ~55 MB | ~30 MB | Negligible | — |
|
||
| `space_weather` | 2 years | ~5 MB | ~2 MB | Negligible | — |
|
||
| MC simulation blobs (MinIO) | 2 years | 500 GB–2 TB | Not compressed | **$140–$560/yr** (S3-IA after 90d) | **Dominant cost** — S3-IA at $0.0125/GB/mo |
|
||
| PDF reports (MinIO) | 7 years | 10–90 GB | 5–45 GB | $5–$45/yr (S3 Glacier) | $0.004/GB/mo Glacier tier |
|
||
| WAL archive (backup) | 30 days rolling | ~25 GB/month | — | **~$100/yr** (300 GB peak × $0.023/GB/mo × 12) | S3 Standard; rolls over; cost is steady-state |
|
||
| `security_logs` | 2 years online; 7-year archive | ~500 MB/year | — | Negligible | Legal hold |
|
||
| `reentry_predictions` | 7 years | ~100 MB/year | — | Negligible | Legal hold |
|
||
| Safety records (`alert_events`, `notam_drafts`, `prediction_outcomes`, `degraded_mode_events`, coordination notes) | **5-year minimum** append-only archive | ~200 MB/year | — | Negligible | ICAO Annex 11 §2.26; safety investigation requirement |
|
||
|
||
**Storage cost summary (Phase 2 steady-state):** MC blobs dominate at sustained use. At 50 runs/day × 120 MB/run = 2.2 TB/year, 2-year retention on S3-IA ≈ **$660/year** in object storage alone. This should be captured in the unit economics model (§27.7). Storage cost is the primary variable cost that scales with usage depth (number of MC runs), not with number of users.
|
||
|
||
**Backup cost projection (F9):** WAL archive at 30-day rolling window: ~300 GB peak occupancy on S3 Standard ≈ **$83/year** (Tier 2). At Tier 3 with synchronous replication, the base-backup is ~2× TimescaleDB data size. At 1 TB compressed DB size: one weekly base-backup (retained 4 weeks) = ~4 TB S3 occupancy → **~$1,100/year** at Tier 3. Include backup S3 bucket costs in infrastructure budget from Phase 3 onwards. Budget line: `infra/backup-s3` ≈ $100–200/month at steady Tier 3 scale.
|
||
|
||
**Safety record retention policy (Finding 11):** Safety-relevant event records have a distinct retention category separate from general operational data. A `safety_record BOOLEAN DEFAULT FALSE` flag on `alert_events` and `notam_drafts` marks records that must survive the standard retention drop. Records with `safety_record = TRUE` are excluded from TimescaleDB drop policies and transferred to MinIO cold tier (append-only) after 90 days online, retained for 5 years minimum. The TimescaleDB retention job checks `WHERE safety_record = FALSE` before dropping chunks. `safety_record` is set to `TRUE` at insert time for any event with `alert_level IN ('HIGH', 'CRITICAL')` and for all NOTAM drafts.
|
||
|
||
**MC blob storage dominates at scale.** At sustained use (50 MC runs/day × 120 MB/run): 2.2 TB/year. The Tier 3 distributed MinIO (8 TB usable with erasure coding on 4×2 TB nodes) covers approximately 3–4 years before expansion.
|
||
|
||
**Cold tier tiering decision (two object classes with different requirements):**
|
||
|
||
| Object class | Cold tier target | Reason |
|
||
|---|---|---|
|
||
| MC simulation blobs (`mc_blobs/` prefix) | **MinIO ILM warm tier or S3 Infrequent Access** | Blobs may need to be replayed for Mode C visualisation of historical events (e.g., regulatory dispute review, incident investigation). Glacier 12h restore latency is operationally unacceptable for this use case. |
|
||
| Compliance-only documents (`reports/`, `notam_drafts/`) | **S3 Glacier / Glacier Deep Archive acceptable** | These are legal records requiring 7-year retention; retrieval is for audit or legal discovery only; 12h restore latency is acceptable. |
|
||
|
||
MinIO ILM rules configured in `docs/runbooks/minio-lifecycle.md`. Lifecycle transitions: MC blobs after 90 days → ILM warm (lower-cost MinIO tier or S3-IA); compliance docs after 1 year → Glacier.
|
||
|
||
**MinIO multipart upload retry and incomplete upload expiry (F7 — §67):**
|
||
|
||
MC simulation blobs (~120 MB each) are uploaded as multipart uploads. During a MinIO node failure in EC:2 distributed mode, write quorum (3/4 nodes) may be temporarily unavailable. An in-flight multipart upload will fail with `MinioException` / `S3Error`. Without a retry policy, the MC prediction is written to TimescaleDB but the blob is lost — the historical replay functionality silently fails.
|
||
|
||
```python
|
||
# worker/tasks/blob_upload.py
|
||
from minio.error import S3Error
|
||
|
||
@shared_task(
|
||
autoretry_for=(S3Error, ConnectionError),
|
||
max_retries=3,
|
||
retry_backoff=30, # 30s, 60s, 120s — allow node recovery
|
||
retry_jitter=True,
|
||
)
|
||
def upload_mc_blob(prediction_id: str, blob_data: bytes):
|
||
"""Upload MC simulation blob to MinIO with retry on quorum failure."""
|
||
object_key = f"mc_blobs/{prediction_id}.msgpack"
|
||
minio_client.put_object(
|
||
bucket_name="spacecom-simulations",
|
||
object_name=object_key,
|
||
data=io.BytesIO(blob_data),
|
||
length=len(blob_data),
|
||
content_type="application/msgpack",
|
||
)
|
||
```
|
||
|
||
**Incomplete multipart upload cleanup:** Configure MinIO lifecycle rule to abort incomplete multipart uploads after 24 hours. Add to `docs/runbooks/minio-lifecycle.md`:
|
||
```bash
|
||
mc ilm rule add --expire-delete-marker --noncurrent-expire-days 1 \
|
||
spacecom/spacecom-simulations --abort-incomplete-multipart-upload-days 1
|
||
```
|
||
This prevents orphaned multipart upload parts accumulating on disk during node failures or application crashes mid-upload.
|
||
|
||
### 27.5 Network and External Bandwidth
|
||
|
||
| Traffic | Direction | Volume | Notes |
|
||
|---------|-----------|--------|-------|
|
||
| Space-Track TLE polling | Outbound | ~1 MB per run, every 4h | ~6 MB/day |
|
||
| NOAA SWPC space weather | Outbound | ~50 KB per fetch, hourly | ~1 MB/day |
|
||
| ESA DISCOS | Outbound | ~10 MB/day (initial bulk); ~100 KB/day incremental | — |
|
||
| CZML to clients | Outbound | ~5–15 MB per user page load (full); <500 KB/hr delta | Scales linearly with users; delta protocol essential |
|
||
| WebSocket to clients | Outbound | ~1 KB/event × events/day | Low bandwidth, persistent connection |
|
||
| PDF reports (download) | Outbound | ~2–5 MB per report | Low frequency; MinIO presigned URL avoids backend proxy |
|
||
| MinIO internal traffic | Internal | Dominated by MC blob writes | Keep on internal Docker network |
|
||
|
||
**CZML egress cost estimate and compression policy (F5):**
|
||
|
||
At Phase 2 (10 concurrent users), daily CZML egress:
|
||
- Initial full loads: 10 users × 3 page loads/day × 15 MB = 450 MB/day
|
||
- Delta updates (delta protocol, §6): 10 users × 8h active × 500 KB/hr = 40 MB/day
|
||
- **Total: ~490 MB/day ≈ 15 GB/month**
|
||
|
||
At $0.085/GB AWS CloudFront egress: **~$1.28/month** (Phase 2) → **~$6.40/month** (50 users Phase 3).
|
||
|
||
CZML egress is **not a significant cost driver** at this scale, but is significant for latency and user experience. Compression policy:
|
||
|
||
| Encoding | CZML size reduction | Implementation |
|
||
|---------|-------------------|----------------|
|
||
| gzip (Accept-Encoding) | 60–75% | Caddy `encode gzip` — already included in §26.9 Caddy config |
|
||
| Brotli | 70–80% | Caddy `encode zstd br gzip` — use br for browser clients |
|
||
| CZML delta protocol (`?since=`) | 95%+ for incremental updates | Already specified in §6 |
|
||
|
||
**Minimum requirement:** Caddy `encode` block must include `br` before `gzip` in the content negotiation order. A 15 MB CZML payload compresses to ~3–5 MB with brotli. Verify with `curl -H "Accept-Encoding: br" -I <url>` — response must show `Content-Encoding: br`.
|
||
|
||
Network is not a constraint for this workload at the scales described. Standard 1 Gbps datacenter networking is sufficient. For on-premise government deployments, standard enterprise LAN is adequate.
|
||
|
||
---
|
||
|
||
### 27.6 DNS Architecture and Service Discovery
|
||
|
||
#### Tier 1–2 (Docker Compose)
|
||
|
||
Docker Compose provides built-in DNS resolution by service name within each network. Services reference each other by container name (e.g., `db`, `redis`, `minio`). No additional DNS infrastructure required.
|
||
|
||
**PgBouncer as single DB connection target:** At Tier 2, the backend and workers connect to `pgbouncer:5432`, not directly to `db:5432`. PgBouncer multiplexes connections and acts as a stable endpoint:
|
||
- In a Patroni failover, `pgbouncer` is reconfigured to point to the new primary; application code never changes connection strings.
|
||
- PgBouncer configuration: `docs/runbooks/pgbouncer-config.md`
|
||
|
||
**Celery task retry during Patroni failover (F2 — §67):** During the ≤ 30s Patroni leader election window, all writes to PgBouncer fail with `FATAL: no connection available` or `OperationalError: server closed the connection unexpectedly`. Celery tasks that execute a DB write during this window will raise `sqlalchemy.exc.OperationalError`. Without a retry policy, these tasks fail permanently and are routed to the DLQ.
|
||
|
||
All Celery tasks that write to the database must declare:
|
||
```python
|
||
@shared_task(
|
||
autoretry_for=(OperationalError,),
|
||
max_retries=3,
|
||
retry_backoff=5, # 5s, 10s, 20s
|
||
retry_backoff_max=30, # cap at 30s (within failover window)
|
||
retry_jitter=True,
|
||
)
|
||
def my_db_writing_task(...):
|
||
...
|
||
```
|
||
|
||
This covers: `aggregate_mc_results`, `write_alert_event`, `write_prediction_outcome`, all ingest tasks. Tasks that only read from DB should also retry on `OperationalError` since PgBouncer may pause reads during leader election. Add integration test: simulate `OperationalError` on first two attempts → task succeeds on third attempt.
|
||
|
||
#### Tier 3 (HA / Kubernetes migration path)
|
||
|
||
At Tier 3, introduce **split-horizon DNS**:
|
||
|
||
| Zone | Scope | Purpose |
|
||
|------|-------|---------|
|
||
| `spacecom.internal` | Internal services | Service discovery: `backend.spacecom.internal`, `db.spacecom.internal` (→ PgBouncer VIP) |
|
||
| `spacecom.io` (or customer domain) | Public internet | Caddy termination endpoint; ACME certificate domain |
|
||
|
||
**Service discovery implementation:**
|
||
- **Cloud (AWS/GCP/Azure):** Use cloud-native internal DNS (Route 53 private hosted zones / Cloud DNS) + load balancer for each service tier
|
||
- **On-premise:** CoreDNS deployed as a DaemonSet (Kubernetes) or as a Docker container on the management network; service records updated via Patroni callback scripts on failover
|
||
|
||
**Key DNS records (Tier 3):**
|
||
|
||
| Record | Type | Value |
|
||
|--------|------|-------|
|
||
| `db.spacecom.internal` | A | PgBouncer VIP (stable through Patroni failover) |
|
||
| `redis.spacecom.internal` | A | Redis Sentinel VIP |
|
||
| `minio.spacecom.internal` | A | MinIO load balancer (all 4 nodes) |
|
||
| `backend.spacecom.internal` | A | Backend API load balancer (2 instances) |
|
||
|
||
---
|
||
|
||
### 27.7 Unit Economics Model
|
||
|
||
**Reference document:** `docs/business/UNIT_ECONOMICS.md` — maintained alongside this plan; update whenever pricing or infrastructure costs change.
|
||
|
||
Unit economics express the cost to serve one organisation per month and the revenue generated, enabling margin analysis per tier.
|
||
|
||
**Cost-to-serve model (Phase 2, cloud-hosted, per org):**
|
||
|
||
| Cost driver | Basis | Monthly cost per org |
|
||
|------------|-------|---------------------|
|
||
| Simulation workers (shared pool) | 2 workers shared across all orgs; allocate by MC run share | $1,120 ÷ org count |
|
||
| TimescaleDB (shared instance) | ~$420/mo; fixed regardless of org count up to Phase 2 capacity | $420 ÷ org count |
|
||
| Redis (shared) | ~$120/mo | $120 ÷ org count |
|
||
| MinIO / S3 storage | Variable; ~$660/yr at heavy MC use → $55/mo | $5–55/mo |
|
||
| Backend API (shared) | ~$140/mo | $140 ÷ org count |
|
||
| Ingest worker (shared) | ~$30/mo | Allocated to platform overhead |
|
||
| Email relay | ~$0.001/email × volume | $0–5/mo |
|
||
| CZML egress | ~$0.085/GB | $1–7/mo |
|
||
| **Total variable (1 org, Tier 2)** | | **~$1,860/mo platform + $60–70 per-org variable** |
|
||
|
||
**Revenue per tier (target pricing — cross-reference §55 commercial model):**
|
||
|
||
| Tier | Monthly ARR / org | Gross margin target |
|
||
|------|-----------------|-------------------|
|
||
| Free / Evaluation | $0 | Negative — cost of ESA relationship |
|
||
| Professional (shadow) | $3,000–6,000/mo | 50–70% at ≥3 orgs on platform |
|
||
| Enterprise (operational) | $15,000–40,000/mo | 65–75% at Tier 3 scale |
|
||
|
||
**Break-even analysis:** At Tier 2 platform cost (~$2,200/mo), break-even at Professional tier requires ≥1 paying org at $3,000/mo. Each additional Professional org at shared infrastructure has near-zero incremental infrastructure cost until capacity boundaries (MC concurrency limit, DB connection pooler limit).
|
||
|
||
**Key unit economics metric:** `infrastructure_cost_per_mc_run`. At Tier 2 (2 workers, $1,120/mo) and 500 runs/month: **$2.24/run**. At Tier 3 (4 workers KEDA scale-to-1, ~$800/mo amortised at medium utilisation) and 2,000 runs/month: **$0.40/run**. This metric should be tracked alongside `spacecom_simulation_cpu_seconds_total` (§27.1).
|
||
|
||
**Professional Services as a revenue line (F10 — §68):**
|
||
|
||
Professional Services (PS) revenue is a distinct revenue stream from recurring SaaS fees. For safety-critical aviation systems, PS typically represents **30–50% of first-year contract value** and includes:
|
||
|
||
| PS engagement type | Typical value | Description |
|
||
|-------------------|-------------|-------------|
|
||
| Implementation support | $15,000–40,000 | Deployment, configuration, integration with ANSP SMS |
|
||
| Regulatory documentation | $10,000–25,000 | SpaceCom system description for ANSP regulatory submissions; assists with EASA/CASA/CAA shadow mode notifications |
|
||
| Training (initial) | $5,000–15,000 | On-site or remote training for duty controllers, analysts, and IT administrators |
|
||
| Safety Management System integration | $8,000–20,000 | Integrating SpaceCom alert triggers into the ANSP's existing SMS occurrence reporting workflow |
|
||
| Annual training refresh | $2,000–5,000/yr | Recurring annual training for new staff and procedure updates |
|
||
|
||
PS revenue is tracked in the `contracts.ps_value_cents` column (§68 F1). Include PS as a budget line in `docs/business/UNIT_ECONOMICS.md`:
|
||
- **Year 1 total contract value** = MRR × 12 + PS value
|
||
- PS is recognised as one-time revenue at delivery (milestone-based); SaaS fees are recognised monthly
|
||
- PS delivery requires dedicated engineering and commercial capacity — budget 1–2 days of senior engineer time per $5,000 of PS value
|
||
|
||
**Shadow trial MC quota (F8 - §68):** Free/shadow trial orgs are limited to 100 MC simulation runs per month (`organisations.monthly_mc_run_quota = 100`). Enforcement at `POST /api/v1/decay/predict`:
|
||
```python
|
||
if org.subscription_tier in ('shadow_trial',) and org.monthly_mc_run_quota > 0:
|
||
runs_this_month = get_monthly_mc_run_count(org_id)
|
||
if runs_this_month >= org.monthly_mc_run_quota:
|
||
raise HTTPException(
|
||
status_code=429,
|
||
detail={
|
||
"error": "monthly_quota_exceeded",
|
||
"quota": org.monthly_mc_run_quota,
|
||
"used": runs_this_month,
|
||
"resets_at": first_of_next_month().isoformat(),
|
||
"upgrade_url": "/settings/billing"
|
||
}
|
||
)
|
||
```
|
||
|
||
Commercial controls must not interrupt active operations. If the organisation is in an active TIP / CRITICAL operational state, quota exhaustion is logged and surfaced to commercial/admin dashboards but enforcement is deferred until the event closes.
|
||
|
||
---
|
||
|
||
### 27.8 Redis Memory Budget
|
||
|
||
**Reference document:** `docs/infra/REDIS_SIZING.md` — sizing rationale and eviction policy decisions.
|
||
|
||
Redis serves three distinct purposes with different memory characteristics. Using a single Redis instance (with separate DB indexes for broker vs. cache) requires explicit memory budgeting:
|
||
|
||
| Purpose | DB index | Key pattern | Estimated peak memory | Eviction policy |
|
||
|---------|----------|------------|----------------------|----------------|
|
||
| Celery broker + result backend | DB 0 | `celery-task-meta-*`, `_kombu.*` | 500 MB (500 MC sub-tasks × ~1 MB results) | `noeviction` |
|
||
| celery-redbeat schedule | DB 1 | `redbeat:*` | < 1 MB | `noeviction` |
|
||
| WebSocket session tracking | DB 2 | `spacecom:ws:*`, `spacecom:active_tip:*` | < 10 MB | `noeviction` |
|
||
| Application cache (CZML, NOTAM) | DB 3 | `spacecom:cache:*` | 50–200 MB | `allkeys-lru` |
|
||
| Redis Pub/Sub fan-out (alerts) | — | `spacecom:alert:*` channels | Transient; ~1 KB/message | N/A (pub/sub, no persistence) |
|
||
| **Total budget** | | | **~700–750 MB peak** | |
|
||
|
||
**Sizing decision:** Use `cache.r6g.large` (8 GB RAM) with `maxmemory 2gb` — provides 2.5× headroom above peak estimate for burst conditions (multiple simultaneous MC runs × result backend). Set `maxmemory-policy noeviction` globally; the application cache (DB 3) must handle cache misses gracefully (it does — CZML regeneration on miss is defined in §6).
|
||
|
||
**Redis memory alert:** Add Grafana alert `redis_memory_used_bytes > 1.5GB` → WARNING; `> 1.8GB` → CRITICAL. At CRITICAL, check for result backend accumulation (expired Celery results not cleaned up) before scaling.
|
||
|
||
**Redis result cleanup:** Celery `result_expires` must be set to `3600` (1 hour). Verify in `backend/celeryconfig.py`:
|
||
```python
|
||
result_expires = 3600 # Clean up MC sub-task results after 1 hour
|
||
```
|
||
|
||
---
|
||
|
||
## 28. Human Factors Framework
|
||
|
||
SpaceCom is a safety-critical decision support system used by time-pressured operators in aviation operations rooms. Human factors are not a UX concern — they are a safety assurance concern. This section documents the HF design requirements, standards basis, and validation approach.
|
||
|
||
**Standards basis:** ICAO Doc 9683 (Human Factors in Air Traffic Management), FAA AC 25.1329 (Flight Guidance Systems — alert prioritisation philosophy), EUROCONTROL HRS-HSP-005, ISA-18.2 (alarm management, adapted for ATC context), Endsley (1995) Situation Awareness model.
|
||
|
||
---
|
||
|
||
### 28.1 Situation Awareness Design Requirements
|
||
|
||
SpaceCom must support all three levels of Endsley's SA model for Persona A (ANSP duty manager):
|
||
|
||
| SA Level | Requirement | Implementation | Time target |
|
||
|----------|-------------|----------------|-------------|
|
||
| **Level 1 — Perception** | Correct hazard information visible at a glance | Globe with urgency symbols; active events panel; risk level badges | **≤ 5 seconds** from alert appearance — icon, colour, and position alone must convey object + risk level without reading text |
|
||
| **Level 2 — Comprehension** | Operator understands what the hazard means for their sector | Plain-language event cards; window range notation; FIR intersection list; data confidence indicators | **≤ 15 seconds** to identify earliest FIR intersection window and whether it falls within the operator's sector |
|
||
| **Level 3 — Projection** | Operator can anticipate future state without simulation tools | Corridor Evolution widget (T+0/+2/+4h); Gantt timeline; space weather buffer callout | **≤ 30 seconds** to determine whether the corridor is expanding or contracting using the Corridor Evolution widget |
|
||
|
||
These time targets are **pass/fail criteria** for the Phase 2 ANSP usability test (§28.7).
|
||
|
||
**Globe visual information hierarchy (F7 — §60):** The globe displays objects, corridors, hazard zones, FIR boundaries, and ADS-B routes simultaneously. Under operational stress, operators must not be required to search for the critical element — it must be pre-attentively distinct. The following hierarchy is mandatory and enforced by the rendering layer:
|
||
|
||
| Priority | Element | Visual treatment | Pre-attentive channel |
|
||
|----------|---------|------------------|-----------------------|
|
||
| 1 — Immediate | Active CRITICAL object | Flashing red octagon (2 Hz, reduced-motion: static + thick border) + label always visible | Motion + colour + shape |
|
||
| 2 — Urgent | Active HIGH object | Amber triangle, label visible at zoom ≥ 4 | Colour + shape |
|
||
| 3 — Monitor | Active MEDIUM object | Yellow circle, label on hover | Colour + shape |
|
||
| 4 — Context | Re-entry corridors (p05–p95) | Semi-transparent red fill, no label until hover | Colour + opacity |
|
||
| 5 — Awareness | FIR boundary overlay | Thin white lines, low opacity (30%) | Position |
|
||
| 6 — Background | ADS-B routes | Thin grey lines, visible only at zoom ≥ 5 | Position |
|
||
| 7 — Ambient | All other tracked objects | Small white dots, no label until hover | Position |
|
||
|
||
Rule: no element at priority N may be more visually prominent than an element at priority N-1. The rendering layer enforces draw order and applies opacity/size reduction to lower-priority elements when a priority-1 element is present. This is a **non-negotiable safety requirement** — a CesiumJS performance optimisation that re-orders draw calls or flattens layers must not override this hierarchy. An operator who cannot reach SA Level 1 in ≤ 5 seconds on a CRITICAL alert constitutes a design failure requiring a redesign cycle before shadow deployment. Without numeric targets the usability test cannot produce a meaningful result.
|
||
|
||
Level 3 SA support is specifically identified as a gap in pure corridor-display systems and is addressed by the Corridor Evolution widget (§6.8).
|
||
|
||
---
|
||
|
||
### 28.2 Mode Error Prevention
|
||
|
||
Mode confusion is the most common cause of automation-related incidents in aviation. SpaceCom has three operational modes (LIVE / REPLAY / SIMULATION) that must be unambiguously distinct at all times.
|
||
|
||
**Mode error prevention mechanisms:**
|
||
1. Persistent mode indicator pill in top nav — never hidden, never small
|
||
2. Mode-switch dialogue with explicit current-mode, target-mode, and consequence statements (§6.3)
|
||
3. Future-preview temporal wash when the timeline scrubber is not at current time (§6.3)
|
||
4. Optional `disable_simulation_during_active_events` org setting to block simulation entry during live incidents (§6.3)
|
||
5. Audio alerts suppressed in SIMULATION and REPLAY modes
|
||
6. All simulation-generated records have `simulation_id IS NOT NULL` — they cannot appear in operational views
|
||
|
||
---
|
||
|
||
### 28.3 Alarm Management
|
||
|
||
Alarm management requirements follow the principle: every alarm should demand action, every required action should have an alarm, and no alarm should be generated that does not demand action.
|
||
|
||
**Alarm rationalisation:**
|
||
- CRITICAL: demands immediate action — full-screen banner + audio
|
||
- HIGH: demands timely action — persistent badge + acknowledgement required
|
||
- MEDIUM: informs — toast, auto-dismiss, logged
|
||
- LOW: awareness only — notification centre
|
||
|
||
**Alarm management philosophy and KPIs (F1 — §60):** SpaceCom adopts the EEMUA 191 / ISA-18.2 alarm management framework adapted for space/aviation operations. The following KPIs are measured quarterly by Persona D and included in the ESA compliance artefact package:
|
||
|
||
| EEMUA 191 KPI | Target | Definition |
|
||
|---------------|--------|-----------|
|
||
| Alarm rate (steady-state) | < 1 alarm per 10 minutes per operator | Alarms requiring attention across all levels; excludes LOW awareness-only |
|
||
| Nuisance alarm rate | < 1% of all alarms | Alarms acknowledged as `MONITORING` within 30s without any other action — indicates no actionable information |
|
||
| Stale alarms | 0 CRITICAL unacknowledged > 10 min | Unacknowledged CRITICAL alerts older than 10 minutes; triggers supervisor notification (F8) |
|
||
| Alarm flood threshold | < 10 CRITICAL alarms within 10 minutes | Beyond this rate, an alert storm meta-alert fires and the batch-flood suppression protocol activates |
|
||
| Chattering alarms | 0 | Any alarm that fires and clears more than 3 times in 30 minutes without operator action |
|
||
|
||
**Alarm quality requirements:**
|
||
- Nuisance alarm rate target: < 1 LOW alarm per 10 minutes per user in steady-state operations (logged and reviewed quarterly by Persona D)
|
||
- Alert deduplication: consecutive window-shrink events do not re-trigger CRITICAL if the threshold was not crossed
|
||
- 4-hour per-object CRITICAL rate limit prevents alarm flooding from a single event
|
||
- Alert storm meta-alert disambiguates between genuine multi-object events and system integrity issues (§6.6)
|
||
|
||
**Batch TIP flood handling (F2 — §60):** Space-Track releases TIP messages in batches — a single NOAA solar storm event can produce 50+ new TIP entries within a 10-minute window. Without mitigation, this generates 50 simultaneous CRITICAL alerts, constituting an alarm flood that exceeds EEMUA 191 KPIs and cognitively overwhelms the operator.
|
||
|
||
Protocol when ingest detects ≥ 5 new TIP messages within a 5-minute window:
|
||
1. **Batch gate activates:** Individual CRITICAL banners suppressed for objects 2–N of the batch. Object 1 (highest-priority by predicted Pc or earliest window) receives the standard CRITICAL banner.
|
||
2. **Batch summary alert fires:** A single HIGH-level "Batch TIP event: N objects with new TIP data" summary appears in the notification centre. The summary is actionable — it links to a pre-filtered catalog view showing all newly-TIP-flagged objects sorted by predicted re-entry window.
|
||
3. **Batch event logged:** A `batch_tip_event` record is created in `alert_events` with `trigger_type = 'BATCH_TIP'`, `affected_objects = [NORAD ID list]`, and `batch_size = N`. This is distinct from individual object alert records.
|
||
4. **Per-object alerts queue:** Individual CRITICAL alerts for objects 2–N are queued and delivered at a maximum rate of 1 per minute, only if the operator has not opened the batch summary view within 5 minutes of the batch gate activating. This prevents indefinite suppression while preventing flood.
|
||
|
||
The threshold (≥ 5 TIP in 5 minutes) and maximum queue delivery rate (1/min) are configurable per-org via org-admin settings, subject to minimum values (≥ 3 and ≤ 2/min respectively) to prevent safety-defeating misconfiguration.
|
||
|
||
**Audio alarm specification (F11 — §60):**
|
||
- Two-tone ascending chime: 261 Hz (C4) followed by 392 Hz (G4), each 250ms, 20ms fade-in/out (not siren — ops rooms have sirens from other systems already)
|
||
- Conforms to EUROCAE ED-26 / RTCA DO-256 advisory alert audio guidelines (advisory category — attention-getting without startle)
|
||
- Plays once on first presentation; **does not loop automatically**
|
||
- **Re-alert on missed acknowledgement:** If a CRITICAL alert remains unacknowledged for 3 minutes, the chime replays once. Replays at most once — the second chime is the final audio prompt. Further escalation is via supervisor notification (F8), not repeated audio (which would cause habituation)
|
||
- Stops on acknowledgement — not on banner dismiss; banner dismiss without acknowledgement is not permitted for CRITICAL severity
|
||
- Per-device volume control via OS; per-session software mute (persists for session only; resets on next login to prevent operators permanently muting safety alerts)
|
||
- Enabled by org-level "ops room mode" setting (default: off); must be explicitly enabled by org admin — not auto-enabled to prevent unexpected audio in environments where audio is not appropriate
|
||
- Volume floor in ops room mode: minimum 40% of device maximum; operators cannot mute below this floor when ops room mode is active (configurable per-org, minimum 30%)
|
||
|
||
**Startle-response mitigation** — sudden full-screen CRITICAL banners cause ~5 seconds of degraded cognitive performance in research studies. The following rules prevent cold-start startle:
|
||
|
||
1. **Progressive escalation mandatory:** A CRITICAL alert may only be presented full-screen if the same object has already been in HIGH state for ≥ 1 minute during the current session. If the alert arrives cold (no prior HIGH state), the system must hold the alert in HIGH presentation for 30 seconds before upgrading to CRITICAL full-screen. Exception: `impact_time_minutes < 30` bypasses the 30s hold.
|
||
2. **Audio precedes visual by 500ms:** The two-tone chime fires 500ms before the full-screen banner renders. This primes the operator's attentional system and eliminates the startle peak.
|
||
3. **Banner is overlay, not replacement:** The CRITICAL full-screen banner is a dimmed overlay (backdrop `rgba(0,0,0,0.72)`) rendered above the corridor map - the map, aircraft positions, and FIR boundaries remain visible beneath it. The banner must never replace the map render, as spatial context is required for the decision the operator is being asked to make.
|
||
|
||
**Cross-hat alert override matrix:** The Human Factors, Safety, and Regulatory hats jointly approve the following override rule set:
|
||
- `impact_time_minutes < 30` or equivalent imminent-impact state: bypass progressive delay; immediate full-screen CRITICAL permitted
|
||
- data-integrity compromise (`HMAC_INVALID`, corrupted prediction provenance, or equivalent): immediate full-screen CRITICAL permitted
|
||
- degraded-data or connectivity-only events without direct hazard change: progressive escalation remains mandatory
|
||
- all immediate-bypass cases require explicit rationale in the alert type definition and traceability into the safety case and hazard log
|
||
|
||
**CRITICAL alert accessibility requirements (F2):** When the CRITICAL alert banner renders:
|
||
- `focus()` is called on the alert dialog element programmatically
|
||
- `role="alertdialog"` and `aria-modal="true"` on the banner container
|
||
- `aria-labelledby` points to the alert title; `aria-describedby` points to the conjunction summary text
|
||
- `aria-hidden="true"` set on the map container while the alertdialog is active; removed on dismiss
|
||
- `aria-live="assertive"` region announces alert title immediately on render (separate from the dialog, for screen readers that do not expose `alertdialog` role automatically)
|
||
- Visible text status indicator "⚠ Audio alert active" accompanies the audio tone for deaf or hard-of-hearing operators (audio-only notification is not sufficient as a sole channel)
|
||
- All alert action buttons reachable by `Tab` from within the dialog; `Escape` closes only if the alert has a non-CRITICAL severity; CRITICAL requires explicit category selection before dismiss
|
||
|
||
**Alarm rationalisation procedure** — alarm systems degrade over time through threshold drift and alert-to-alert desensitisation. The following procedure is mandatory:
|
||
|
||
- Persona D (Operations Analyst) reviews alert event logs quarterly
|
||
- Any alarm type that fired ≥ 5 times in a 90-day period and was acknowledged as `MONITORING` ≥ 90% of the time is a **nuisance alarm candidate** — threshold review required before next quarter
|
||
- Any alarm threshold change must be recorded in `alarm_threshold_audit` (object, old threshold, new threshold, reviewer, rationale, date); immutable append-only
|
||
- ANSP customers may request threshold adjustments for their own organisation via the org-admin settings; changes take effect after a mandatory 7-day confirmation period and are logged in `alarm_threshold_audit`
|
||
- Alert categories that have never triggered a `NOTAM_ISSUED` or `ESCALATING` acknowledgement in 12 months are escalated to Persona D for review of whether the alert should be demoted one severity level
|
||
|
||
**Habituation countermeasures** — repeated identical stimuli produce reduced response (habituation). The following design rules counteract alarm habituation:
|
||
|
||
- CRITICAL audio uses two alternating tones (261 Hz and 392 Hz, ~0.25s each); the alternation pattern is varied pseudo-randomly within the specification range so the exact sound is never identical across sessions
|
||
- CRITICAL banner background colour cycles through two dark-amber shades (`#7B4000` / `#6B3400`) at 1 Hz — subtle variation without strobing, enough to maintain arousal without inducing distraction
|
||
- Per-object CRITICAL rate limit (4-hour window) prevents habituation to a single persistent event
|
||
- `alert_events` habituation report: any operator who has acknowledged ≥ 20 alerts of the same type in a 30-day window without a single `ESCALATING` or `NOTAM_ISSUED` response is flagged for supervisor review — this indicates potential habituation or threshold misconfiguration
|
||
|
||
**Reduced-motion support (F10):** WCAG 2.3.3 (Animation from Interactions — Level AAA) and WCAG 2.3.1 (Three Flashes or Below Threshold — Level A) apply. The 1 Hz CRITICAL banner colour cycle and any animated corridor rendering must respect the OS-level `prefers-reduced-motion: reduce` media query:
|
||
|
||
```css
|
||
/* Default: animated */
|
||
.critical-banner { animation: amber-cycle 1s step-end infinite; }
|
||
|
||
/* Reduced motion: static high-contrast state */
|
||
@media (prefers-reduced-motion: reduce) {
|
||
.critical-banner {
|
||
animation: none;
|
||
background-color: #7B4000;
|
||
border: 4px solid #FFD580; /* thick static border as redundant indicator */
|
||
}
|
||
}
|
||
```
|
||
|
||
**Fatigue and cognitive load monitoring (F8 — §60):** Operators on long shifts exhibit reduced alertness. The following server-side rules trigger supervisor notifications without requiring operator interaction:
|
||
|
||
| Condition | Trigger | Supervisor notification |
|
||
|-----------|---------|------------------------|
|
||
| Unacknowledged CRITICAL alert | > 10 minutes without acknowledgement | Push + email to org supervisor role: "CRITICAL alert unacknowledged for 10 minutes — [object, time]" |
|
||
| Stale HIGH alert | > 30 minutes without acknowledgement | Push to org supervisor: "HIGH alert unacknowledged for 30 minutes" |
|
||
| Long session without interaction | Logged-in operator: no UI interaction for 45 min during active event | Push to operator + supervisor: "Possible inactivity during active event — please verify" |
|
||
| Shift duration exceeded | Session age > `org.shift_duration_hours` (default 8h) | Non-blocking reminder to operator: "Your shift duration setting is 8 hours — consider handover" |
|
||
|
||
Supervisor notifications are sent to users with `org_admin` or `supervisor` role. If no supervisor role is configured for the org, the notification escalates to SpaceCom internal ops via the existing PagerDuty route with `severity: warning`. All supervisor notifications are logged to `security_logs` with `event_type = SUPERVISOR_NOTIFICATION`.
|
||
|
||
For CesiumJS corridor animations: check `window.matchMedia('(prefers-reduced-motion: reduce)').matches` on mount; if true, disable trajectory particle animation (Mode C) and set corridor opacity to a static value instead of pulsing. The preference is re-checked on change via `addEventListener('change', ...)` without requiring a page reload.
|
||
|
||
---
|
||
|
||
### 28.4 Probabilistic Communication to Non-Specialist Operators
|
||
|
||
Re-entry timing predictions are inherently probabilistic. Aviation operations personnel (Persona A/C) are trained in operational procedures, not orbital mechanics. The following design rules ensure probabilistic information is communicated without creating false precision or misinterpretation:
|
||
|
||
1. **No `±` notation for Persona A/C** — use explicit window ranges (`08h–20h from now`) with a "most likely" label; all absolute times rendered as `HH:MMZ` (e.g., `14:00Z`) or `DD MMM YYYY HH:MMZ` (e.g., `22 MAR 2026 14:00Z`) per ICAO Doc 8400 UTC-suffix convention; the `Z` suffix is not a tooltip — it is always rendered inline
|
||
2. **Space weather impact as operational buffer, not percentage** — `Add ≥2h beyond 95th percentile`, not `+18% wider uncertainty`
|
||
3. **Mode C particles require a mandatory first-use overlay** explaining that particles are not equiprobable; weighted opacity down-weights outliers (§6.4)
|
||
4. **"What does this mean?" expandable panel** on Event Detail for Persona C (incident commanders) explaining the window in operational terms
|
||
5. **Data confidence badges** contextualise all physical property estimates — `unknown` source triggers a warning callout above the prediction panel
|
||
6. **Tail risk annotation (F10):** The p5–p95 window is the primary display, but a 10% probability of re-entry outside that range is operationally significant. Below the primary window, display: *"Extreme case (1% probability outside this range): `p01_reentry_time`Z – `p99_reentry_time`Z"* — labelled clearly as a tail annotation, not the primary window. This annotation is shown only when `p99_reentry_time - p01_reentry_time > 1.5 × (p95_reentry_time - p05_reentry_time)` (i.e., the tails are materially wider than the primary window). Also included as a footnote in NOTAM drafts when this condition is met.
|
||
|
||
---
|
||
|
||
### 28.5 Error Recovery and Irreversible Actions
|
||
|
||
| Action | Recovery mechanism |
|
||
|--------|--------------------|
|
||
| Analyst runs prediction with wrong parameters | `superseded_by` FK on `reentry_predictions` — marks old run as superseded; UI shows warning banner; original record preserved |
|
||
| Controller accidentally acknowledges CRITICAL alert | Two-step confirmation; structured category selection (see below) + optional free text; append-only audit log preserves full record |
|
||
| Analyst shares link to superseded prediction | `⚠ Superseded — see [newer run]` banner appears on the superseded prediction page for any viewer |
|
||
| Operator enters SIMULATION during live incident | `disable_simulation_during_active_events` org setting blocks mode switch while unacknowledged CRITICAL/HIGH alerts exist |
|
||
|
||
**Structured acknowledgement categories** — replaces 10-character text minimum. Research consistently shows forced-text minimums under time pressure produce reflexive compliance (`1234567890`, `aaaaaaaaaa`) rather than genuine engagement, creating audit noise rather than evidence:
|
||
|
||
```typescript
|
||
export const ACKNOWLEDGEMENT_CATEGORIES = [
|
||
{ value: 'NOTAM_ISSUED', label: 'NOTAM issued or requested' },
|
||
{ value: 'COORDINATING', label: 'Coordinating with adjacent FIR' },
|
||
{ value: 'MONITORING', label: 'Monitoring — no action required yet' },
|
||
{ value: 'ESCALATING', label: 'Escalating to incident command' },
|
||
{ value: 'OUTSIDE_MY_SECTOR', label: 'Outside my sector — passing to responsible unit' },
|
||
{ value: 'OTHER', label: 'Other (free text required below)' },
|
||
] as const;
|
||
// Category selection is mandatory. Free text is optional except when value = 'OTHER'.
|
||
// alert_events.action_taken stores the category code; action_notes stores optional text.
|
||
```
|
||
|
||
**Acknowledgement form accessibility requirements (F3):**
|
||
- Each category option rendered as `<input type="radio">` with an explicit `<label for="...">` — no ARIA substitutes where native HTML suffices
|
||
- The radio group wrapped in `<fieldset>` with `<legend>Select acknowledgement category</legend>`
|
||
- The keyboard shortcut `Alt+A` documented via `aria-keyshortcuts="Alt+A"` on the alert panel trigger element
|
||
- A visible keyboard shortcut legend displayed within the acknowledgement dialog: "Keyboard: Alt+A to focus · Tab to change category · Enter to submit"
|
||
- Free-text field (`OTHER`) labelled `<label for="action_notes">Describe action taken (required)</label>`; `aria-required="true"` when OTHER is selected
|
||
- On submit, a screen-reader-visible confirmation announced via `aria-live="polite"`: "Acknowledgement recorded: [category label]"
|
||
|
||
**Keyboard-completable acknowledgement flow** — CRITICAL acknowledgement must be completable in ≤ 3 keyboard interactions from any application state (operators frequently work with one hand on radio PTT):
|
||
|
||
```
|
||
Alt+A → focus most-recent active CRITICAL alert in alert panel
|
||
Enter → open acknowledgement dialogue (category pre-selected: MONITORING)
|
||
Enter → submit (Tab to change category; free-text field skipped unless OTHER selected)
|
||
```
|
||
|
||
This keyboard path must be documented in the operator quick-reference card and tested in the Phase 2 usability study against the ≤ 3 interaction target.
|
||
|
||
---
|
||
|
||
### 28.5a Shift Handover
|
||
|
||
Shift handover is a high-risk transition point: situational awareness held by one operator must be reliably transferred to a second operator under time pressure. Current aviation safety events have involved information loss at handover. SpaceCom must not become a contributing factor.
|
||
|
||
**Handover screen** (Persona A/C): Dedicated `/handover` view within Secondary Display Mode (§6.20). Accessible from main nav; also triggered automatically when an operator session exceeds `org.shift_duration_hours` (configurable; default: 8h).
|
||
|
||
The handover screen shows:
|
||
1. All active CRITICAL and HIGH alerts with current status and acknowledgement history
|
||
2. Any unresolved multi-ANSP coordination threads (§6.9)
|
||
3. Recent window-change events (last 2h) in reverse chronological order
|
||
4. Free-text handover notes field (plain text, ≤ 2,000 characters)
|
||
5. "Accept handover" button — records handover event with both operator IDs and timestamp
|
||
|
||
**Handover record schema:**
|
||
|
||
```sql
|
||
CREATE TABLE shift_handovers (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
org_id UUID NOT NULL REFERENCES organisations(id),
|
||
outgoing_user UUID NOT NULL REFERENCES users(id),
|
||
incoming_user UUID NOT NULL REFERENCES users(id),
|
||
handed_over_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
notes TEXT, -- operator free text, ≤ 2000 chars
|
||
active_alerts JSONB NOT NULL DEFAULT '[]', -- snapshot of alert IDs + status at handover
|
||
open_coord_threads JSONB NOT NULL DEFAULT '[]' -- snapshot of open coordination thread IDs
|
||
);
|
||
|
||
CREATE INDEX ON shift_handovers (org_id, handed_over_at DESC);
|
||
```
|
||
|
||
**Handover integrity rules:**
|
||
- `incoming_user` must be a different `users.id` from `outgoing_user`
|
||
- `active_alerts` and `open_coord_threads` are system-populated snapshots — the outgoing operator cannot edit them; only `notes` is free-form
|
||
- Handover record is immutable after creation; retained for 7 years (aviation safety audit basis)
|
||
- If a CRITICAL alert fires within 5 minutes of a handover record being created, the alert email/push notification includes a "⚠ Alert during handover window" flag so the incoming operator and their supervisor are aware
|
||
|
||
**Structured SA transfer prompts (F4 — §60):** The handover notes field (free text) is insufficient for reliable SA transfer under time pressure. The handover screen must also include a structured prompt section that the outgoing operator completes — mapping to Endsley's three SA levels:
|
||
|
||
| SA Level | Structured prompt | Type |
|
||
|----------|------------------|------|
|
||
| Level 1 — Perception | "Active objects of concern right now:" | Multi-select from current TIP-flagged objects |
|
||
| Level 2 — Comprehension | "My assessment of the most critical object:" | Dropdown: `Within sector / Adjacent sector / Low confidence / Not a concern yet` + optional text |
|
||
| Level 3 — Projection | "Expected development in next 2 hours:" | Dropdown: `Window narrowing / Window stable / Window widening / Awaiting new prediction` + optional text |
|
||
| Decision context | "Actions I have taken or initiated:" | Multi-select from `ACKNOWLEDGEMENT_CATEGORIES` + free text |
|
||
| Handover flags | "Incoming operator should know:" | Checkboxes: `Space weather active`, `Pending coordination thread`, `Degraded data`, `Unusual pattern` |
|
||
|
||
The structured prompts are optional (the outgoing operator cannot be forced to complete them under time pressure) but their completion status is recorded. If the outgoing operator submits handover without completing any structured prompts, a non-blocking warning appears: *"Structured SA transfer not completed — incoming operator will rely on notes only."* Completion rate is reported quarterly as a human factors KPI.
|
||
|
||
**Session timeout accessibility (F8):** WCAG 2.2.1 (Timing Adjustable — Level A) requires users be warned before session expiry and given the opportunity to extend. For operators completing a handover (which may take longer for users with cognitive or motor impairments):
|
||
- At T−2 minutes before session expiry: an `aria-live="polite"` announcement fires and a non-modal warning dialog appears: "Your session will expire in 2 minutes. [Extend session] [Save and log out]"
|
||
- If the `/handover` view is active when the warning fires, the session is **automatically extended by 30 minutes** without user interaction (silently); the warning dialog is suppressed; the extension is logged in `security_logs` with `event_type = SESSION_AUTO_EXTENDED_HANDOVER`
|
||
- The silent auto-extension only applies once per session to prevent indefinite extension; after the 30-minute extension the standard warning dialog fires normally
|
||
- Session extension endpoint: `POST /api/v1/auth/extend-session` — returns a new expiry timestamp; requires valid current session cookie
|
||
|
||
---
|
||
|
||
### 28.6 Cognitive Load Reduction
|
||
|
||
**Event Detail Duty Manager View:** Decluttered large-text view for Persona A showing only window, FIRs, risk level, and three action buttons. Collapses all technical detail. Designed for ops room use at a secondary glance distance. (§6.8)
|
||
|
||
**Decision Prompts accordion** (formerly "Response Options"): Contextualised checklist of possible ANSP actions. Not automated — for consideration only. Checkbox states create a lightweight action record without requiring Persona A to open a separate logging system. (§6.8)
|
||
|
||
The feature is renamed from "Response Options" to "Decision Prompts" throughout UI text, documentation, and API field names. "Options" implies equivalence; "Prompts" correctly signals that the list is an aide-mémoire, not a prescribed workflow.
|
||
|
||
**Legal treatment of Decision Prompts:** Every Decision Prompts accordion must display the following non-waivable disclaimer in 11px grey text immediately below the accordion header:
|
||
|
||
> *"Decision Prompts are non-prescriptive aide-mémoire items generated from common ANSP practice. They do not constitute operational procedures. All decisions remain with the duty controller in accordance with applicable air traffic regulations and your organisation's established procedures."*
|
||
|
||
This disclaimer is: (a) hard-coded, not configurable; (b) included in the printed/exported Event Detail report; (c) present in the API response for Decision Prompts payloads (`"legal_notice"` field). Rationale: SpaceCom is decision support, not decision authority. Without an explicit disclaimer, a regulator or court could interpret a checked Decision Prompt item as evidence of a prescribed procedure not followed.
|
||
|
||
**Decision prompt content template (F6 — §60):** Each Decision Prompt entry must provide four fields to be actionable under operational stress:
|
||
|
||
```typescript
|
||
interface DecisionPrompt {
|
||
id: string;
|
||
risk_summary: string; // Plain-language risk in ≤ 20 words. No jargon. No Pc values.
|
||
action_options: string[]; // Specific named actions available to this operator role
|
||
time_available: string; // "Decision window: X hours before earliest FIR intersection"
|
||
consequence_note?: string; // Optional: consequence of inaction (shown only if significant)
|
||
}
|
||
|
||
// Example for a re-entry/FIR intersection:
|
||
const examplePrompt: DecisionPrompt = {
|
||
id: 'reentry_fir_intersection',
|
||
risk_summary: 'Object expected to re-enter atmosphere over London FIR within 8–14 hours.',
|
||
action_options: [
|
||
'Issue precautionary NOTAM for affected flight levels',
|
||
'Coordinate with adjacent FIR controllers (Paris, Amsterdam)',
|
||
'Notify airline operations centres in affected region',
|
||
'Continue monitoring — no action required yet',
|
||
],
|
||
time_available: 'Decision window: ~6 hours before earliest FIR intersection (08:00Z)',
|
||
consequence_note: 'If window narrows below 4 hours without NOTAM, affected departures may require last-minute rerouting.',
|
||
};
|
||
```
|
||
|
||
Decision Prompts are pre-authored for each alert scenario type in `docs/decision-prompts/` and reviewed annually by a subject-matter expert from an ANSP partner. They are not auto-generated by the system. New prompt types require approval from both the SpaceCom safety case owner and at least one ANSP reviewer.
|
||
|
||
**Legal sufficiency note (F5):** The in-UI disclaimer is a reinforcing reminder only. Under UCTA 1977 and the EU Unfair Contract Terms Directive, liability limitation requires that the customer was given a reasonable opportunity to discover and understand the term at contract formation. The substantive liability limitation clause (consequential loss excluded; aggregate cap = 12 months fees paid) must appear in the executed **Master Services Agreement** (§24.2). The UI disclaimer does not substitute for executed contractual terms.
|
||
|
||
**Decision Prompts accessibility (F9):** The accordion must implement the WAI-ARIA Accordion design pattern:
|
||
- Accordion header: `<button role="button" aria-expanded="true|false" aria-controls="panel-{id}">` — `Enter` and `Space` toggle open/close
|
||
- Panel: `<div id="panel-{id}" role="region" aria-labelledby="header-{id}">`
|
||
- Arrow keys navigate between accordion items when focus is on a header button
|
||
- Each prompt item: `<input type="checkbox" id="prompt-{n}" aria-checked="true|false">` with `<label for="prompt-{n}">` — native checkbox, not ARIA role substitute
|
||
- On checkbox state change: `aria-live="polite"` region announces "Action recorded: [prompt text]"
|
||
- `aria-keyshortcuts` on the accordion container documents any applicable shortcuts
|
||
|
||
**Attention management** — operational environments have high ambient interruption rates. SpaceCom must not become an additional source of cognitive fragmentation:
|
||
|
||
| State | Interaction rate limit | Rationale |
|
||
|-------|----------------------|-----------|
|
||
| Steady-state (no active CRITICAL/HIGH) | ≤ 1 unsolicited notification per 10 minutes per user | Preserve peripheral attentional channel for ATC primary tasks |
|
||
| Active event (≥ 1 unacknowledged CRITICAL) | ≤ 1 update notification per 60 seconds for the same event | Prevent update flooding during the critical decision window |
|
||
| Critical flow (user actively in acknowledgement or handover screen) | Zero unsolicited notifications | Do not interrupt the operator while they are completing a safety-critical task |
|
||
|
||
Critical flow state is entered when: acknowledgement dialog is open, or `/handover` view is active. It is exited on dialog close or handover acceptance. During critical flow, all queued notifications are held and delivered as a batch summary immediately on exit.
|
||
|
||
**Secondary Display Mode:** Chrome-free full-screen operational view optimised for secondary monitor in an ops room alongside existing ATC displays. (§6.20)
|
||
|
||
**First-time user onboarding:** New organisations with no configured FIRs see a three-card guided setup rather than an empty globe. (§6.18)
|
||
|
||
---
|
||
|
||
### 28.7 HF Validation Approach
|
||
|
||
HF design cannot be fully validated by automated tests alone. The following validation activities are planned:
|
||
|
||
| Activity | Phase | Method |
|
||
|----------|-------|--------|
|
||
| Cognitive walkthrough of CRITICAL alert handling | Phase 1 | Developer walk-through against §28.3 alarm management requirements |
|
||
| ANSP user testing — Persona A operational scenario | Phase 2 | Structured usability test: duty manager handles a simulated TIP event; time-to-decision and error rate measured |
|
||
| Multi-ANSP coordination scenario | Phase 2 | Two-ANSP test with shared event; assess whether coordination panel reduces perceived workload vs. out-of-band comms only |
|
||
| Mode confusion scenario | Phase 2 | Participants switch between LIVE and SIMULATION; measure rate of mode errors without and with the temporal wash |
|
||
| Alarm fatigue assessment | Phase 3 | Review of LOW alarm rate over a 30-day shadow deployment; adjust thresholds if nuisance rate > 1/10 min/user |
|
||
| Final HF review by qualified human factors specialist | Phase 3 | Required for TRL 6 demonstration and ECSS-E-ST-10-12C compliance evidence |
|
||
|
||
**Probabilistic comprehension test items** — the Phase 2 usability study must include the following scripted comprehension items delivered verbally to participants after they view a TIP event detail screen. Items are designed to distinguish genuine probabilistic comprehension from confidence masking:
|
||
|
||
| Item | Correct answer | Common wrong answer (detects) |
|
||
|------|---------------|-------------------------------|
|
||
| "What does the re-entry window of 08h–20h from now mean — does it mean the object will come down in the middle of that period?" | No — most likely landing is in the modal estimate shown, but the object could land anywhere in the window | "Yes, probably in the middle" — detects false precision from window endpoints |
|
||
| "If SpaceCom shows Impact Probability 0.03, should you start evacuating the FIR corridor?" | Not automatically — impact probability is one input; operational decision depends on assets at risk, corridor extent, and existing procedures | "Yes, 0.03 is high for space" — detects calibration gap between space and aviation risk thresholds |
|
||
| "The window has just widened by 4 hours. Does that mean SpaceCom detected new debris or a new threat?" | No — window widening usually means updated atmospheric data or revised mass/BC estimate increased uncertainty | "Yes, something new happened" — detects misattribution of uncertainty update to new threat |
|
||
| "SpaceCom shows 'Data confidence: TLE age 4 days'. Does that mean the prediction is wrong?" | No — it means the prediction has higher positional uncertainty; the window should be treated as wider in practice | "Yes, ignore it" — detects over-application of data quality warning |
|
||
|
||
Participants who answer ≥ 2 items incorrectly indicate a comprehension design failure requiring UI revision before shadow deployment. Target: ≥ 80% correct on each item across the test cohort.
|
||
|
||
---
|
||
|
||
### 28.8 Degraded-Data Human Factors
|
||
|
||
Operators must be able to distinguish "SpaceCom is working normally" from "SpaceCom is working but with reduced fidelity" from "SpaceCom is in a failure state" — three states that require fundamentally different responses. Undifferentiated degradation presentation causes two failure modes: operators continuing to act on stale data as if it were fresh (over-trust), or operators stopping using the system entirely during a tolerable degradation (under-trust).
|
||
|
||
**Visual degradation language:**
|
||
|
||
| State | Indicator | Operator action required |
|
||
|-------|-----------|--------------------------|
|
||
| All data fresh | Green status pill in system tray (§6.6) | None |
|
||
| TLE age ≥ 48h for any active CRITICAL/HIGH object | Amber "⚠ TLE stale" badge on affected event card | Widen mental model of corridor uncertainty; consult space domain Persona B/D |
|
||
| EOP data stale (>7 days) | Amber system badge + `eop_stale` exposed in `GET /readyz` | Frame transform accuracy reduced; no action required unless close-approach timing is critical |
|
||
| Space weather stale (>2h for active event) | Amber badge on Kp readout in Event Detail | Kp-dependent atmospheric drag estimates are less reliable; apply additional margin |
|
||
| AIRAC data >35 days old | Red "⚠ AIRAC expired" badge on any FIR overlay | FIR boundaries may have changed; do not issue NOTAM text based on SpaceCom FIR names without manual verification |
|
||
| Backend unreachable | Full-screen "SpaceCom Offline" modal | No predictions available; fall back to organisational offline procedures |
|
||
|
||
**Graded response rules:**
|
||
1. A single stale data source **never** suppresses the main operational view. Operators must be able to see the event and make decisions; stale data badges are contextual, not blocking.
|
||
2. Multiple simultaneous amber badges (≥ 3) trigger a consolidated "Multiple data sources degraded" yellow banner at top of screen — prevents badge blindness when individual badges are numerous.
|
||
3. The `GET /readyz` endpoint (§26.5) exposes all staleness states as machine-readable flags. ANSPs may configure their own monitoring to receive `readyz` alerts via webhook.
|
||
4. Degraded-data states are recorded in `system_health_events` table and included in the quarterly operational report to Persona D.
|
||
|
||
**Operator quick-reference language for degraded states** — the operator quick-reference card must include a "SpaceCom status indicators" section using the exact badge text from the UI (copy-match required). Operators must not need to translate between UI text and documentation text.
|
||
|
||
---
|
||
|
||
### 28.9 Operator Training and Competency Specification (F10 — §60)
|
||
|
||
SpaceCom is a safety-critical decision support system. ANSP customers deploying it in operational environments will be asked by their safety regulators what training operators received. This section defines the minimum training specification. Individual ANSPs may add requirements; they may not remove them.
|
||
|
||
**Minimum initial training programme:**
|
||
|
||
| Module | Delivery | Duration | Completion criteria |
|
||
|--------|----------|----------|---------------------|
|
||
| M1 — System overview and safety philosophy | Instructor-led or self-paced e-learning | 2 hours | Quiz score ≥ 80% |
|
||
| M2 — Operational interface walkthrough | Instructor-led hands-on with staging environment | 3 hours | Complete reference scenario (see below) |
|
||
| M3 — Alert acknowledgement workflow | Scenario-based with role-play | 1 hour | Keyboard-completable ack in ≤ 3 interactions |
|
||
| M4 — NOTAM drafting and disclaimer | Instructor-led with sample NOTAMs | 1 hour | Produce a compliant NOTAM draft from a scenario |
|
||
| M5 — Degraded mode response | Scenario-based | 30 min | Correctly identify each degraded state + action |
|
||
| M6 — Shift handover procedure | Pair exercise | 30 min | Complete a structured handover with SA prompts |
|
||
|
||
Total minimum initial training: **8 hours**. Training is completed before any operational use. Simulator/staging environment only — no training on production data.
|
||
|
||
**Reference scenario (M2):** A CRITICAL re-entry alert fires for an object with a 6–14 hour window intersecting two FIRs. The trainee must: acknowledge the alert, identify the FIR intersection, assess the corridor evolution, draft a NOTAM, and complete a handover to a colleague — all within 20 minutes. This scenario is standardised in `docs/training/reference-scenario-01.md`.
|
||
|
||
**Recurrency requirements:**
|
||
- Annual refresher: 2 hours, covering any UI changes in the preceding 12 months + repeat of M3 scenario
|
||
- After any incident where SpaceCom was a contributing factor: mandatory debrief + targeted re-training before return to operational use
|
||
- After a major version upgrade (breaking UI changes): M2 + affected modules before using upgraded system operationally
|
||
|
||
**Competency record model:**
|
||
|
||
```sql
|
||
CREATE TABLE operator_training_records (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
user_id INTEGER NOT NULL REFERENCES users(id),
|
||
module_id TEXT NOT NULL, -- 'M1'..'M6' or custom ANSP module codes
|
||
completed_at TIMESTAMPTZ NOT NULL,
|
||
score INTEGER, -- quiz score where applicable; NULL for practical
|
||
instructor_id INTEGER REFERENCES users(id),
|
||
training_env TEXT NOT NULL DEFAULT 'staging', -- 'staging' | 'simulator'
|
||
notes TEXT,
|
||
UNIQUE (user_id, module_id, completed_at)
|
||
);
|
||
```
|
||
|
||
`GET /api/v1/admin/training-status` (org_admin only) returns completion status for all users in the organisation. Users without all required modules completed are flagged; their access is not automatically blocked (the ANSP retains operational responsibility) but the flag is visible to org_admin and included in the quarterly compliance report.
|
||
|
||
**Training material ownership:** `docs/training/` directory maintained by SpaceCom. ANSP-specific scenario variants stored in `docs/training/ansp-variants/`. Annual review cycle tied to the CHANGELOG review process.
|
||
|
||
**Training records data retention and pseudonymisation (F10 — §64):** `operator_training_records` is personal data — it records when a named individual completed specific training activities. For former employees whose accounts are deleted, these records must not be retained indefinitely as identified personal data.
|
||
|
||
Retention policy:
|
||
- Active users: retain for the duration of active employment (account `status = 'active'`) plus **2 years** after account deletion (for certification audit purposes — an ANSP may need to verify training history after an operator leaves)
|
||
- After 2 years post-deletion: pseudonymise `user_id` → tombstone token; retain completion dates and module IDs for aggregate training statistics
|
||
|
||
```sql
|
||
-- Add to operator_training_records
|
||
ALTER TABLE operator_training_records
|
||
ADD COLUMN pseudonymised_at TIMESTAMPTZ,
|
||
ADD COLUMN user_tombstone TEXT; -- SHA-256 prefix of deleted user_id; replaces user_id link
|
||
```
|
||
|
||
The weekly `pseudonymise_old_freetext` Celery task (§29.3) is extended to also pseudonymise training records where the linked `users` row has been deleted for > 2 years:
|
||
```python
|
||
db.execute(text("""
|
||
UPDATE operator_training_records otr
|
||
SET user_tombstone = CONCAT('tombstone:', LEFT(ENCODE(DIGEST(otr.user_id::text, 'sha256'), 'hex'), 16)),
|
||
pseudonymised_at = NOW()
|
||
WHERE otr.pseudonymised_at IS NULL
|
||
AND NOT EXISTS (SELECT 1 FROM users u WHERE u.id = otr.user_id)
|
||
AND otr.completed_at < NOW() - INTERVAL '2 years'
|
||
"""))
|
||
|
||
---
|
||
|
||
## 29. Data Protection Framework
|
||
|
||
SpaceCom processes personal data in the course of providing its services. For EU and UK deployments (ESA bid context), GDPR / UK GDPR compliance is mandatory. For Australian ANSP customers, the Privacy Act 1988 (Cth) applies. This section documents the data protection design requirements.
|
||
|
||
**Standards basis:** GDPR (EU) 2016/679, UK GDPR, Privacy Act 1988 (Cth), EDPB Guidelines on data breach notification, ICO guidance on legitimate interests, CNIL recommendations on consent records.
|
||
|
||
---
|
||
|
||
### 29.1 Data Inventory
|
||
|
||
**Record of Processing Activities (RoPA) — GDPR Art. 30:** This table constitutes the RoPA. It is maintained in `legal/ROPA.md` (authoritative version) and mirrored here. Organisations with ≥250 employees or processing high-risk data must maintain a written RoPA; space traffic management constitutes high-risk processing (Art. 35 DPIA trigger — see below). The DPO must review and sign off the RoPA annually.
|
||
|
||
| Data type | Personal? | Lawful basis (GDPR Art. 6) | Retention | Table / Location |
|
||
|-----------|-----------|---------------------------|-----------|-----------------|
|
||
| User email, name, organisation | Yes | Contract performance (Art. 6(1)(b)) | Account lifetime + 1 year after deletion | `users` |
|
||
| IP address in security logs | Yes (pseudonymous) | Legitimate interests — security (Art. 6(1)(f)) | **90 days full; hash retained for 7 years** | `security_logs` |
|
||
| IP address at ToS acceptance | Yes | Legitimate interests — consent evidence (Art. 6(1)(f)) | **90 days full; hash retained for account lifetime + 1 year** | `users.tos_accepted_ip` |
|
||
| Alert acknowledgement text | Yes (contains user name) | Legitimate interests — aviation safety (Art. 6(1)(f)) | 7 years | `alert_events` |
|
||
| Multi-ANSP coordination notes | Yes (contains user name) | Legitimate interests — aviation safety (Art. 6(1)(f)) | 7 years | `alert_events` |
|
||
| Shift handover records | Yes (outgoing/incoming user IDs) | Legitimate interests — aviation safety / operational continuity (Art. 6(1)(f)) | 7 years | `shift_handovers` |
|
||
| Alarm threshold audit records | Yes (reviewer ID) | Legitimate interests — safety governance (Art. 6(1)(f)) | 7 years | `alarm_threshold_audit` |
|
||
| API request logs | Yes (pseudonymous — IP) | Legitimate interests — security / billing (Art. 6(1)(f)) | 90 days | Log files / SIEM |
|
||
| MFA secrets (TOTP) | Yes (sensitive account data) | Contract performance (Art. 6(1)(b)) | Account lifetime; immediately deleted on account deletion | `users.mfa_secret` (encrypted at rest) |
|
||
| Space-Track data disclosure log | No (records org-level disclosure, not individuals) | Legitimate interests — licence compliance (Art. 6(1)(f)) | 5 years | `data_disclosure_log` |
|
||
|
||
**IP address data minimisation policy (F3 — §64):** IP addresses are personal data (CJEU *Breyer*, C-582/14). The full IP address is needed for fraud detection and security investigation within the first 90 days; beyond that, only a hashed form is needed for statistical/audit purposes.
|
||
|
||
Required Celery Beat task (`tasks/privacy_maintenance.py`, runs weekly):
|
||
```python
|
||
@shared_task
|
||
def hash_old_ip_addresses():
|
||
"""Replace full IP addresses with SHA-256 hashes after 90-day audit window."""
|
||
cutoff = datetime.utcnow() - timedelta(days=90)
|
||
db.execute(text("""
|
||
UPDATE security_logs
|
||
SET ip_address = CONCAT('sha256:', LEFT(ENCODE(DIGEST(ip_address, 'sha256'), 'hex'), 16))
|
||
WHERE created_at < :cutoff
|
||
AND ip_address NOT LIKE 'sha256:%'
|
||
"""), {"cutoff": cutoff})
|
||
db.execute(text("""
|
||
UPDATE users
|
||
SET tos_accepted_ip = CONCAT('sha256:', LEFT(ENCODE(DIGEST(tos_accepted_ip, 'sha256'), 'hex'), 16))
|
||
WHERE created_at < :cutoff
|
||
AND tos_accepted_ip NOT LIKE 'sha256:%'
|
||
"""), {"cutoff": cutoff})
|
||
db.commit()
|
||
```
|
||
|
||
**Necessity assessment for IP storage (required in DPIA §2):** Full IP is necessary for: (a) detecting account takeover (geolocation anomaly), (b) rate-limiting bypass investigation, (c) regulatory/legal requests within the statutory window. Hashed form is sufficient for: (d) long-term audit log integrity (proving an event occurred from a non-obvious source), (e) statistical reporting. The 90-day threshold is the operational window for security investigations; beyond this, benefit does not outweigh data subjects' privacy interests.
|
||
|
||
**DPIA requirement and structure (F1 — §64):** GDPR Article 35 mandates a DPIA before processing that is likely to result in high risk. SpaceCom's processing falls under Art. 35(3)(b) — systematic monitoring of publicly accessible areas — because it tracks the online operational behaviour of aviation professionals (login times, alert acknowledgements, decision patterns, handover text) in a system used to support safety decisions. This is a pre-processing obligation: EU personal data cannot lawfully be processed without completing the DPIA first.
|
||
|
||
**Document:** `legal/DPIA.md` — a Phase 2 gate (must be complete before any EU/UK ANSP shadow activation).
|
||
|
||
**Required DPIA structure (EDPB WP248 rev.01 template):**
|
||
|
||
| Section | Content required |
|
||
|---------|----------------|
|
||
| **1. Description of processing** | Purpose, nature, scope, context of processing; categories of data; data flows; recipients |
|
||
| **2. Necessity and proportionality** | Why is this data necessary? Could the purpose be achieved with less data? Legal basis per activity (mapped in §29.1 RoPA) |
|
||
| **3. Risk identification** | Risks to data subjects: unauthorised access to operational patterns; re-identification of pseudonymised safety records; cross-border transfer exposure; disclosure to authorities |
|
||
| **4. Risk mitigation measures** | Technical: RLS, HMAC, TLS, MFA, pseudonymisation. Organisational: DPA with ANSPs, export control screening, sub-processor contracts |
|
||
| **5. Residual risk assessment** | Risk level after mitigations: Low / Medium / High. If High residual risk: prior consultation with supervisory authority required (Art. 36) |
|
||
| **6. DPO opinion** | Designated DPO's written sign-off or objection |
|
||
| **7. Review schedule** | DPIA reviewed when processing changes materially; at least every 3 years |
|
||
|
||
The DPIA covers all processing activities in the RoPA. Key risk finding anticipated: the alert acknowledgement audit trail (who acknowledged what, when) creates a de facto performance monitoring record for individual ANSP controllers — this must be addressed in Section 3 with mitigations in Section 4 (pseudonymisation after operational retention window, access restricted to org_admin and admin roles).
|
||
|
||
**Privacy Notice** — must be published at the registration URL and linked from the ToS acceptance flow. Must cover: data controller identity, categories of data collected, purposes and lawful bases, retention periods, data subject rights, third-party processors (cloud provider, SIEM), cross-border transfer safeguards.
|
||
|
||
---
|
||
|
||
### 29.2 Data Subject Rights Implementation
|
||
|
||
| Right | Mechanism | Notes |
|
||
|-------|-----------|-------|
|
||
| **Access (Art. 15)** | `GET /api/v1/users/me/data-export` — returns all personal data held for the authenticated user as a JSON download | Available to all logged-in users |
|
||
| **Rectification (Art. 16)** | `PATCH /api/v1/users/me` — allows name, email, organisation update | Email change triggers re-verification |
|
||
| **Erasure (Art. 17)** | `POST /api/v1/users/me/erasure-request` → calls `handle_erasure_request(user_id)` | See §29.3 |
|
||
| **Restriction (Art. 18)** | Admin-level: `users.access_restricted = TRUE` suspends account without deleting data | Used where erasure conflicts with retention requirement |
|
||
| **Portability (Art. 20)** | `POST /org/export` (org_admin or admin) — asynchronous export of all org personal data in machine-readable JSON; fulfilled within 30 days; also used for offboarding (§29.8). Covers user-generated content (acknowledgements, handover notes); not derived physics predictions. | F11 |
|
||
| **Objection (Art. 21)** | For legitimate interests processing: handled by erasure or restriction pathway | No automated profiling that would trigger Art. 22 |
|
||
|
||
---
|
||
|
||
### 29.3 Erasure vs. Retention Conflict — Pseudonymisation Procedure
|
||
|
||
The 7-year retention requirement (UN Liability Convention, aviation safety records) conflicts with GDPR Article 17 right to erasure for personal data embedded in `alert_events` and `security_logs`. Resolution: **pseudonymise, do not delete**.
|
||
|
||
```python
|
||
def handle_erasure_request(user_id: int, db: Session):
|
||
"""
|
||
Satisfy GDPR Art. 17 erasure request while preserving safety-critical records.
|
||
Called when a user account is deleted or an explicit erasure request is received.
|
||
"""
|
||
# Stable pseudonym — deterministic hash of user_id, not reversible
|
||
pseudonym = f"[user deleted - ID:{hashlib.sha256(str(user_id).encode()).hexdigest()[:12]}]"
|
||
|
||
# Pseudonymise user references in append-only safety tables
|
||
db.execute(
|
||
text("UPDATE alert_events SET acknowledged_by_name = :p WHERE acknowledged_by = :uid"),
|
||
{"p": pseudonym, "uid": user_id}
|
||
)
|
||
db.execute(
|
||
text("UPDATE security_logs SET user_email = :p WHERE user_id = :uid"),
|
||
{"p": pseudonym, "uid": user_id}
|
||
)
|
||
# Pseudonymise shift handover records — user IDs replaced, notes preserved for safety record
|
||
db.execute(
|
||
text("""UPDATE shift_handovers
|
||
SET outgoing_user = NULL, incoming_user = NULL,
|
||
notes = CASE WHEN outgoing_user = :uid OR incoming_user = :uid
|
||
THEN CONCAT('[pseudonymised: ', :p, '] ', COALESCE(notes,''))
|
||
ELSE notes END
|
||
WHERE outgoing_user = :uid OR incoming_user = :uid"""),
|
||
{"p": pseudonym, "uid": user_id}
|
||
)
|
||
# Delete the user record itself (and cascade to refresh_tokens, api_keys)
|
||
db.execute(text("DELETE FROM users WHERE id = :uid"), {"uid": user_id})
|
||
db.commit()
|
||
# Log the erasure event (note: this log entry is itself pseudonymised from creation)
|
||
log_security_event("USER_ERASURE_COMPLETED", details={"pseudonym": pseudonym})
|
||
```
|
||
|
||
The core safety records (`alert_events`, `security_logs`, `reentry_predictions`) are preserved. The link to the identified individual is severed. This satisfies GDPR recital 26 (pseudonymous data is not personal data when re-identification is not reasonably possible) and Article 17(3)(b) (erasure obligation does not apply where processing is necessary for compliance with a legal obligation).
|
||
|
||
**Free-text field periodic pseudonymisation (F6 — §64):** Handover notes (`shift_handovers.notes_text`) and alert acknowledgement text (`alert_events.action_taken`) are free-text fields where operators may name colleagues, reference individuals' decisions, or include other personal references. The 7-year retention of these fields as-written creates personal data retained far beyond its operational value. After the **operational retention window** (2 years — the period within which a re-entry event's record could be actively referenced by an ANSP), free-text personal references must be pseudonymised in place.
|
||
|
||
Required Celery Beat task (`tasks/privacy_maintenance.py`, runs monthly):
|
||
```python
|
||
@shared_task
|
||
def pseudonymise_old_freetext():
|
||
"""
|
||
Replace identifiable free-text in operational records after 2-year operational window.
|
||
The record itself is retained; only the human-entered text is sanitised.
|
||
"""
|
||
cutoff = datetime.utcnow() - timedelta(days=730) # 2 years
|
||
# Replace acknowledgement text with sanitised marker — preserve the fact of acknowledgement
|
||
db.execute(text("""
|
||
UPDATE alert_events
|
||
SET action_taken = '[text pseudonymised after operational retention window]'
|
||
WHERE created_at < :cutoff
|
||
AND action_taken IS NOT NULL
|
||
AND action_taken NOT LIKE '[text pseudonymised%'
|
||
"""), {"cutoff": cutoff})
|
||
# Preserve handover structure; pseudonymise notes text
|
||
db.execute(text("""
|
||
UPDATE shift_handovers
|
||
SET notes_text = '[text pseudonymised after operational retention window]'
|
||
WHERE created_at < :cutoff
|
||
AND notes_text IS NOT NULL
|
||
AND notes_text NOT LIKE '[text pseudonymised%'
|
||
"""), {"cutoff": cutoff})
|
||
db.commit()
|
||
```
|
||
|
||
The 2-year operational window is chosen because: (a) PIR processes complete within 5 business days; (b) regulatory investigations of re-entry events typically complete within 12–18 months; (c) 2 years provides margin. Beyond 2 years, the text serves no legitimate purpose that outweighs the data subject's interest in not having their decision-making text retained indefinitely.
|
||
|
||
---
|
||
|
||
### 29.4a Data Subject Access Request Procedure (F7 — §64)
|
||
|
||
The `GET /api/v1/users/me/data-export` endpoint exists (§29.2). The DSAR procedure — how requests are received, processed, and responded to within the statutory deadline — must also be documented.
|
||
|
||
**DSAR SLA:** 30 calendar days from receipt of the verified request (GDPR Art. 12(3)). Extension to 60 days permitted for complex requests with written notice to the data subject within the first 30 days.
|
||
|
||
**DSAR procedure (`docs/runbooks/dsar-procedure.md`):**
|
||
|
||
| Step | Action | Owner | Timing |
|
||
|------|--------|-------|--------|
|
||
| 1 | Receive request (email to `privacy@spacecom.io` or in-app `POST /api/v1/users/me/data-export-request`) | DPO/designated contact | Day 0 |
|
||
| 2 | Verify identity of requestor (must be the data subject or authorised representative) | DPO | Within 3 business days |
|
||
| 3 | Assess scope: what data is held? Which tables? What exemptions apply (safety record retention)? | DPO + engineering | Within 7 days |
|
||
| 4 | Generate export: `GET /api/v1/users/me/data-export` for self-service; admin endpoint for cases where account is deleted/suspended | Engineering | Within 20 days |
|
||
| 5 | Deliver export: encrypted ZIP sent to verified email address | DPO | By day 28 |
|
||
| 6 | Document: log in `legal/DSAR_LOG.md` — request date, identity verified, scope, delivery date, any exemptions invoked | DPO | Same day as delivery |
|
||
| 7 | If exemption applied (safety records retained): provide written explanation of the exemption and residual rights | DPO | Included in delivery |
|
||
|
||
**`GET /api/v1/users/me/data-export` response scope** — must include all of:
|
||
- `users` record fields (excluding password hash)
|
||
- `alert_events` where `acknowledged_by = user.id` (pre-pseudonymisation only)
|
||
- `shift_handovers` where `outgoing_user = user.id` or `incoming_user = user.id`
|
||
- `operator_training_records` for the user
|
||
- `api_keys` metadata (not the key value itself)
|
||
- `security_logs` where `user_id = user.id` (pre-IP-hashing only)
|
||
- `tos_accepted_at`, `tos_version` from `users`
|
||
|
||
Fields excluded from DSAR export (not personal data or subject to legitimate processing exemption):
|
||
- `reentry_predictions` (not personal data)
|
||
- `security_logs` entries of type `HMAC_KEY_ROTATION`, `DEPLOY_*` (operational audit, not personal)
|
||
|
||
---
|
||
|
||
### 29.4 Data Processing Agreements
|
||
|
||
A **Data Processing Agreement (DPA)** is required in every commercial relationship where SpaceCom acts as a data processor for customer personal data (GDPR Art. 28).
|
||
|
||
SpaceCom acts as **data processor** for: user data belonging to ANSP and space operator customers (the customers are the data controllers for their employees' data).
|
||
|
||
SpaceCom acts as **data controller** for: its own user authentication data, security logs, and analytics.
|
||
|
||
**Required DPA provisions (GDPR Art. 28(3)):**
|
||
- Processing only on documented instructions of the controller
|
||
- Confidentiality obligations on authorised processors
|
||
- Technical and organisational security measures (reference §7)
|
||
- Sub-processor approval process (cloud provider, SIEM)
|
||
- Data subject rights assistance obligations
|
||
- Deletion or return of data on contract termination
|
||
- Audit and inspection rights for the controller
|
||
|
||
The DPA template must be reviewed by counsel before any EU/UK commercial deployment. It is a standard addendum to the MSA.
|
||
|
||
**Sub-processor register (F9 — §64):** GDPR Article 28(2) requires that the controller authorises sub-processors, and Article 28(4) requires that the processor imposes equivalent obligations on sub-processors. The DPA template references a sub-processor register; that register must exist as a standalone document.
|
||
|
||
**Document:** `legal/SUB_PROCESSORS.md` — Phase 2 gate (required before first EU/UK commercial deployment).
|
||
|
||
| Sub-processor | Service | Personal data transferred | Location | Transfer mechanism | DPA in place |
|
||
|--------------|---------|--------------------------|----------|-------------------|-------------|
|
||
| Cloud host (e.g. AWS/Hetzner) | Infrastructure hosting | All categories (hosted on their infrastructure) | EU-central-1 (Frankfurt) | Adequacy / SCCs | AWS DPA / Hetzner DPA |
|
||
| GitHub | Source code hosting, CI/CD | Developer usernames; may appear in test fixtures | US | EU SCCs (Module 2) | GitHub DPA |
|
||
| Email delivery provider (e.g. Postmark, SES) | Transactional email (alert notifications) | User email address, name, alert content | US | EU SCCs (Module 2) | Provider DPA |
|
||
| Grafana Cloud (if used) | Observability / monitoring | IP addresses in logs ingested to Loki | US/EU | SCCs / EU region option | Grafana DPA |
|
||
| Sentry (if used) | Error tracking | Stack traces may contain user IDs, request data | US | EU SCCs | Sentry DPA |
|
||
|
||
**Customer notification obligation:** ANSPs (as data controllers) must be notified ≥30 days before any new sub-processor is added. The DPA addendum requires this. The sub-processor register is the mechanism for tracking and triggering notifications.
|
||
|
||
---
|
||
|
||
### 29.5 Cross-Border Data Transfer Safeguards
|
||
|
||
For EU/UK customers where SpaceCom infrastructure is hosted outside the EU/UK (e.g., AWS us-east-1):
|
||
- Use EU/UK regions where available, or
|
||
- Execute Standard Contractual Clauses (SCCs — 2021 EU SCCs / UK IDTA) with the cloud provider
|
||
- Document the transfer mechanism in the Privacy Notice
|
||
|
||
For Australian customers: the Privacy Act's Australian Privacy Principle 8 (cross-border disclosure) requires contractual protections equivalent to the APPs when transferring personal data internationally.
|
||
|
||
**Data residency policy (Finding 8):**
|
||
|
||
- **Default hosting:** EU jurisdiction (eu-central-1 / Frankfurt or equivalent) — satisfies EU data residency requirements for ECAC ANSP customers; stated in the MSA and DPA
|
||
- **On-premise option:** `Institutional` tier supports customer-managed on-premise deployment (§34 specifies the deployment model); customer's own infrastructure, own jurisdiction; SpaceCom provides a deployment package and support contract
|
||
- **Multi-tenancy isolation:** Each ANSP organisation's operational data (`alert_events`, `notam_drafts`, coordination notes) is accessible only to that organisation's users — enforced by RLS (§7.2). Multi-tenancy does not mean data co-mingling
|
||
- **Subprocessor disclosure:** `docs/legal/data-residency-policy.md` lists hosting provider, region, and any subprocessors; updated when subprocessors change; referenced in the DPA; customers notified of material subprocessor changes ≥ 30 days in advance
|
||
- `organisations.hosting_jurisdiction` and `organisations.data_residency_confirmed` columns (§9.2) track per-organisation residency state; admin UI surfaces this to Persona D
|
||
- **Authoritative document:** `legal/DATA_RESIDENCY.md` — lists hosting provider, region, all sub-processors with their data residency and SCCs/IDTA status; reviewed and re-signed annually by DPO; customers notified of material sub-processor changes ≥30 days in advance per DPA obligations
|
||
|
||
---
|
||
|
||
### 29.6 Security Breach Notification
|
||
|
||
**Regulatory notification obligations by framework:**
|
||
|
||
| Framework | Trigger | Deadline | Authority | Template location |
|
||
|-----------|---------|----------|-----------|------------------|
|
||
| GDPR Art. 33 | Personal data breach affecting EU/UK data subjects | 72 hours of discovery | National DPA (e.g. ICO, CNIL, BfDI) | `legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md` |
|
||
| UK GDPR | As above for UK data subjects | 72 hours | ICO | As above |
|
||
| NIS2 Art. 23 | Significant incident affecting network/information systems of an essential entity | **Early warning: 24 hours** of becoming aware; full notification: 72 hours; final report: 1 month | National CSIRT + competent authority (space traffic management is likely an essential sector under NIS2 Annex I) | As above |
|
||
| Australian Privacy Act | Eligible data breach (serious harm likely) | ASAP (no fixed period; promptness required) | OAIC | As above |
|
||
|
||
**Incident response timeline:**
|
||
|
||
| Step | Timing | Action |
|
||
|------|--------|--------|
|
||
| Detect and contain | Immediately | Revoke affected credentials; isolate affected service; preserve logs |
|
||
| Assess scope | Within 2 hours | Determine: categories of data affected, approximate number of data subjects, jurisdictions, NIS2 applicability |
|
||
| Notify legal counsel and DPO | Within 4 hours of detection | Counsel advises on notification obligations across all applicable frameworks |
|
||
| NIS2 early warning | Within 24 hours of awareness | If significant incident: notify national CSIRT with initial information; no need for complete picture at this stage |
|
||
| Notify supervisory authority (EU/UK GDPR) | Within 72 hours of discovery | Via national DPA portal; even if incomplete — update as more known |
|
||
| NIS2 full notification | Within 72 hours of awareness | Full incident notification to national CSIRT / competent authority |
|
||
| Notify data subjects | Without undue delay | If breach likely to result in high risk to individuals |
|
||
| NIS2 final report | Within 1 month of full notification | Detailed description, impact assessment, cross-border impact, measures taken |
|
||
| Document | Ongoing | GDPR Art. 33(5) requires documentation of all breaches; NIS2 requires audit trail |
|
||
|
||
GDPR and NIS2 breach notification is integrated into the §26.8 incident response runbook. The `security_logs` record type `DATA_BREACH` triggers the breach notification workflow. On-call engineers must be trained to recognise when NIS2 thresholds (significant impact on service continuity or data integrity) are met and escalate to the DPO within the 24-hour window. Full obligations mapped in `legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md`.
|
||
|
||
---
|
||
|
||
### 29.7 Cookie / Tracking Consent
|
||
|
||
Even as a B2B SaaS operating within corporate networks, SpaceCom must comply with the ePrivacy Directive (2002/58/EC as amended) for any non-essential cookies set on EU/UK user browsers.
|
||
|
||
**Cookie audit (required at least annually — `legal/COOKIE_POLICY.md`):**
|
||
|
||
| Cookie name | Category | Purpose | Lifetime | Consent required? |
|
||
|-------------|----------|---------|----------|-----------------|
|
||
| `session` | Strictly necessary | Authenticated session token | Session / 8h inactivity | No |
|
||
| `csrf_token` | Strictly necessary | CSRF protection | Session | No |
|
||
| `tos_version` | Strictly necessary | ToS acceptance tracking | 1 year | No |
|
||
| `feature_flags` | Functional | A/B flags for UI features | 30 days | Yes (functional consent) |
|
||
| `_analytics` | Analytics | Usage telemetry (if implemented) | 13 months | Yes (analytics consent) |
|
||
|
||
**Security requirements for all session cookies (ePrivacy + §36 security):**
|
||
```
|
||
Set-Cookie: session=...; HttpOnly; Secure; SameSite=Strict; Path=/; Max-Age=28800
|
||
```
|
||
|
||
**Consent implementation:**
|
||
- Consent banner displayed on first visit to any EU/UK user before any non-essential cookies are set
|
||
- Three options: Accept all / Functional only / Strictly necessary only
|
||
- Consent preference stored in `user_cookie_preferences` or localStorage (no cookie used to store consent — self-defeating)
|
||
- Consent is re-requested if cookie categories change materially
|
||
- B2B context note: even if the organisation has a corporate cookie policy, individual users' consent is required under ePrivacy; organisational IT policies do not substitute for individual consent
|
||
|
||
**Cookie policy:** `legal/COOKIE_POLICY.md` — published at registration URL and linked from the consent banner. Reviewed when new cookies are introduced or existing cookies change purpose.
|
||
|
||
---
|
||
|
||
### 29.8 Organisation Onboarding and Offboarding (F4)
|
||
|
||
#### Onboarding workflow
|
||
|
||
New organisation provisioning requires explicit `admin` action — self-serve registration is not available in Phase 1 (safety-critical context; all organisations are individually vetted).
|
||
|
||
**Onboarding gates (all must be satisfied before `subscription_status` → `active`):**
|
||
1. Legal: MSA executed (countersigned PDF stored in `legal/contracts/{org_id}/msa.pdf`)
|
||
2. Export control: `export_control_cleared = TRUE` on the `organisations` row (BIS Entity List check; see §24.2)
|
||
3. Space-Track: If the organisation requires Space-Track data: `space_track_registered = TRUE`; `space_track_username` recorded; data disclosure log seeded
|
||
4. Billing: `billing_contacts` row created; VAT number validated for EU customers
|
||
5. Admin user: at least one `org_admin` user created with MFA enrolled
|
||
6. ToS: primary `org_admin` user has `tos_accepted_at IS NOT NULL`
|
||
|
||
Each gate is a checklist step in `docs/runbooks/org-onboarding.md`. Completing all gates creates a `subscription_periods` row with `period_start = NOW()`.
|
||
|
||
#### Offboarding workflow
|
||
|
||
When an organisation's subscription ends (churn, termination, or suspension), the offboarding procedure:
|
||
|
||
| Step | Action | Who | When |
|
||
|------|--------|-----|------|
|
||
| 1 | Set `subscription_status = 'churned'` / `'suspended'` | Admin | Immediately |
|
||
| 2 | Revoke all `api_keys` for the org | Admin (automated) | Immediately |
|
||
| 3 | Invalidate all active sessions (`refresh_tokens`) | Admin (automated) | Immediately |
|
||
| 4 | Notify org primary contact: 30-day data export window | Admin | Same day |
|
||
| 5 | Generate and deliver org data export archive | Admin | Within 3 business days |
|
||
| 6 | After 30-day window: pseudonymise user personal data | Automated job | Day 31 |
|
||
| 7 | Retain non-personal safety records (7-year minimum) | DB — no action | Ongoing |
|
||
| 8 | Confirm deletion in writing to org billing contact | Admin | After step 6 |
|
||
|
||
**GDPR Art. 17 vs. retention conflict:** User personal data (name, email, IP addresses) is pseudonymised per §29.3 after the 30-day window. Safety records (`alert_events`, `reentry_predictions`, `shift_handovers`) are retained for 7 years per UN Liability Convention — the organisation row remains in the database with `subscription_status = 'churned'` as the foreign key anchor. No safety record is deleted.
|
||
|
||
**Suspension vs. termination:** A suspended organisation (`subscription_status = 'suspended'`) retains data and can be reactivated by an admin. A churned organisation enters the 30-day export window immediately. Suspension is used for payment failure; churn for voluntary or contractual termination.
|
||
|
||
---
|
||
|
||
### 29.9 Audit Log Personal Data Separation (F8 — §64)
|
||
|
||
`security_logs` currently serves two distinct purposes with conflicting retention requirements:
|
||
- **Integrity audit records** (HMAC checks, ingest events, deploy markers): no personal data; 7-year retention under UN Liability Convention
|
||
- **Personal data processing records** (user logins, IP addresses, acknowledgement events): personal data; subject to data minimisation, IP hashing at 90 days, erasure on request
|
||
|
||
Mixing these in one table means a single retention policy applies to both — either over-retaining personal data (7 years) or under-retaining operational integrity records. Required separation:
|
||
|
||
```sql
|
||
-- New table: operational integrity audit — no personal data, 7-year retention
|
||
CREATE TABLE integrity_audit_log (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
event_type TEXT NOT NULL, -- 'HMAC_VERIFICATION', 'INGEST_SUCCESS', 'DEPLOY_COMPLETED', etc.
|
||
source TEXT, -- service name, job ID
|
||
details JSONB, -- operational context; must not contain user IDs or IPs
|
||
severity TEXT NOT NULL DEFAULT 'INFO'
|
||
);
|
||
|
||
-- Existing security_logs: personal data processing records — IP hashing at 90d, erasure on request
|
||
-- Add constraint: security_logs must only hold user-action event types
|
||
ALTER TABLE security_logs ADD CONSTRAINT chk_security_logs_type
|
||
CHECK (event_type IN (
|
||
'LOGIN', 'LOGOUT', 'MFA_ENROLLED', 'PASSWORD_RESET', 'API_KEY_CREATED',
|
||
'API_KEY_REVOKED', 'TOS_ACCEPTED', 'DATA_BREACH', 'USER_ERASURE_COMPLETED',
|
||
'SAFETY_OCCURRENCE', 'DEPLOY_ALERT_GATE_OVERRIDE', 'HMAC_KEY_ROTATION',
|
||
'AIRSPACE_UPDATE', 'EXPORT_CONTROL_SCREENED', 'SHADOW_MODE_ACTIVATED'
|
||
));
|
||
```
|
||
|
||
**Migration:** Existing `security_logs` records of type `INGEST_*`, `HMAC_VERIFICATION_*` (pass/fail), `DEPLOY_COMPLETED` are migrated to `integrity_audit_log`. The personal-data-containing events remain in `security_logs` with the updated retention and IP-hashing policy.
|
||
|
||
**Benefit:** `integrity_audit_log` can be retained for 7 years without any privacy obligation. `security_logs` is subject to the 90-day IP hashing, erasure-on-request, and 2-year text pseudonymisation policies without affecting integrity records.
|
||
|
||
---
|
||
|
||
### 29.10 Lawful Basis Mapping and ToS Acceptance Clarification (F11 — §64)
|
||
|
||
The first-login ToS/AUP acceptance flow (§3.1, §13) gates access and records `tos_accepted_at`. This mechanism does not mean consent (Art. 6(1)(a)) is the universal lawful basis for all processing. The RoPA (§29.1) maps the correct basis per activity; this section clarifies the principle.
|
||
|
||
**Lawful basis is determined by purpose, not by the collection mechanism:**
|
||
|
||
| Processing activity | Correct basis | Why NOT consent |
|
||
|--------------------|--------------|--------------------|
|
||
| Delivering alerts and predictions the user subscribed to | Art. 6(1)(b) — contract performance | User contracted for the service; consent would be revocable and would prevent service delivery |
|
||
| Security logging of user actions | Art. 6(1)(f) — legitimate interests (fraud/security) | Required regardless of consent; security cannot be conditional on consent |
|
||
| Audit trail for UN Liability Convention | Art. 6(1)(c) — legal obligation | Statutory retention requirement; consent is irrelevant |
|
||
| Fatigue monitoring triggers (§28.3 — server-side thresholds) | Art. 6(1)(b) or (f) | Part of the contracted service and/or legitimate safety interest; **not** health data (Art. 9) because no health information is processed — only activity patterns |
|
||
| Sending marketing or product update emails (not core service) | Art. 6(1)(a) — consent | Marketing emails require opt-in consent separate from service ToS |
|
||
|
||
**ToS acceptance is consent evidence only for:** (a) acknowledgement of terms, (b) Space-Track redistribution acknowledgement, (c) export control acknowledgement. It is not a blanket consent to all processing.
|
||
|
||
**Implementation requirement:** The Privacy Notice (§29.1) must state the correct lawful basis for each category of processing, not imply consent for all. Legal counsel review required before publication.
|
||
|
||
---
|
||
|
||
### 29.11 Open Source / Dependency Licence Compliance (§66)
|
||
|
||
SpaceCom is a closed-source SaaS product. Certain open-source licence obligations apply regardless of whether source code is distributed, because SpaceCom serves a web application to end users over a network. This section documents licence assessments for all material dependencies.
|
||
|
||
**Reference document:** `legal/OSS_LICENCE_REGISTER.md` — authoritative per-dependency licence record, updated on every major dependency version change.
|
||
|
||
#### F1 — CesiumJS AGPLv3 Commercial Licence
|
||
|
||
CesiumJS is licensed under AGPLv3. The AGPL network use provision (§13) requires that any software that incorporates AGPLv3 code and is served over a network must make its complete corresponding source available to users. SpaceCom is closed-source and does not satisfy this requirement under the AGPLv3 terms.
|
||
|
||
**Required action:** A commercial licence from Cesium Ion must be executed and stored at `legal/LICENCES/cesium-commercial.pdf` **before any Phase 1 demo or ESA evaluation deployment**. The CI licence gate (`license-checker-rseidelsohn --excludePackages "cesium"`) is correct only when a valid commercial licence exists — the exclusion without the licence is a false negative. The commercial licence is referenced in ADR-0007 (`docs/adr/0007-cesiumjs-commercial-licence.md`).
|
||
|
||
**Phase gate:** `legal/LICENCES/cesium-commercial.pdf` present and `legal_clearances.cesium_commercial_executed = TRUE` is a **Phase 1 go/no-go** criterion. Block all external deployments until confirmed.
|
||
|
||
#### F3 — Space-Track AUP Redistribution Prohibition
|
||
|
||
Space-Track Terms of Service prohibit redistribution of TLE and CDM data to unregistered parties. SpaceCom's ingest pipeline fetches TLE/CDM data under a single registered account and serves derived predictions to ANSP users. The redistribution risk surfaces in two ways:
|
||
|
||
1. **Raw TLE exposure via API:** If SpaceCom's API returns raw TLE strings (e.g., in `/objects/{id}/tle`), and those strings are accessible to unauthenticated users or third-party integrations, this may constitute redistribution. All TLE endpoints must require authentication and must not be proxied to unregistered downstream systems.
|
||
|
||
2. **Credentials in client-side code or SBOM:** `SPACE_TRACK_PASSWORD` must never appear in `frontend/` source, git history, SBOM artefacts, or any publicly accessible location. Validate with `detect-secrets` (already in pre-commit hook) and `git secrets --scan-history`.
|
||
|
||
**ADR:** `docs/adr/0016-space-track-aup-architecture.md` — records the chosen path (shared ingest vs. per-org credentials) with AUP clarification evidence.
|
||
|
||
#### F4 — Python Dependency Licence Assessment
|
||
|
||
| Package | Licence | Risk | Mitigation |
|
||
|---------|---------|------|-----------|
|
||
| NumPy | BSD-3 | None | — |
|
||
| SciPy | BSD-3 | None | — |
|
||
| astropy | BSD-3 | None | — |
|
||
| sgp4 | MIT | None | — |
|
||
| poliastro | MIT / LGPLv3 (components) | Low | LGPLv3 requires dynamic linking ability; standard `pip install` satisfies LGPL dynamic linking. SpaceCom does not ship a modified poliastro — no relinking obligation arises. Document in `legal/LGPL_COMPLIANCE.md`. |
|
||
| FastAPI | MIT | None | — |
|
||
| SQLAlchemy | MIT | None | — |
|
||
| Celery | BSD-3 | None | — |
|
||
| Pydantic | MIT | None | — |
|
||
| Playwright (Python) | Apache 2.0 | None | Chromium binary downloaded at build time; not redistributed. Captured in SBOM. |
|
||
|
||
**LGPL compliance document:** `legal/LGPL_COMPLIANCE.md` must confirm: (a) poliastro is installed via pip as a separate library, (b) SpaceCom does not statically link or incorporate modified poliastro source, (c) users can substitute a modified poliastro by reinstalling — this is satisfied by standard Python packaging. No further action required beyond this documentation.
|
||
|
||
#### F5 — TimescaleDB Licence Assessment
|
||
|
||
TimescaleDB uses a dual-licence model:
|
||
|
||
| Feature | Licence | SpaceCom use? |
|
||
|---------|---------|-------------|
|
||
| Hypertables, continuous aggregates, compression, `time_bucket()` | **Apache 2.0** | Yes — all core features used by SpaceCom |
|
||
| Multi-node distributed hypertables | **Timescale Licence (TSL)** | No — single-node at all tiers |
|
||
| Data tiering (automated S3 tiering) | **TSL** | No — SpaceCom uses MinIO ILM / manual S3 lifecycle, not TimescaleDB tiering |
|
||
|
||
**Assessment:** SpaceCom uses only Apache 2.0-licensed TimescaleDB features. No Timescale commercial agreement required. Document in `legal/LICENCES/timescaledb-licence-assessment.md`. Re-assess if multi-node or data tiering features are adopted at Tier 3.
|
||
|
||
#### F6 — Redis SSPL Assessment
|
||
|
||
Redis 7.4+ adopted the Server Side Public Licence (SSPL). SSPL § 13 requires that any entity offering the software as a service must open-source their entire service stack. The relevant question for SpaceCom is whether deploying Redis **as an internal component** of SpaceCom constitutes "offering Redis as a service."
|
||
|
||
**Assessment:** SpaceCom operates Redis internally — users interact with SpaceCom's API and WebSocket interface, not directly with Redis. This is not offering Redis as a service. The SSPL obligation does not apply to internal use of Redis as a component. However, legal counsel should confirm this position before Phase 3 (operational deployment).
|
||
|
||
**Alternative if legal counsel disagrees:** Pin to **Redis 7.2.x** (BSD-3-Clause, last release before SSPL adoption) or migrate to **Valkey** (BSD-3-Clause fork maintained by Linux Foundation). Either is a drop-in replacement. Document the chosen path in `legal/LICENCES/redis-sspl-assessment.md`.
|
||
|
||
**Action:** Update `pip-licenses` fail-on list to include `"Server Side Public License"` as a blocking licence category. Redis itself is not in the Python dependency tree (it is a Docker service), so this is a docker-image licence check. Add to Trivy scan policy.
|
||
|
||
#### F7 — Playwright and Chromium Binary Licence
|
||
|
||
Playwright (Python) is Apache 2.0. The Chromium binary bundled by Playwright uses the Chromium licence (BSD-3-Clause for most code; additional component licences apply for media codecs). Chromium is not redistributed by SpaceCom — Playwright downloads it at container build time via `playwright install chromium`.
|
||
|
||
**Assessment:** Internal use only; no redistribution. SBOM captures the Playwright version; Chromium binary version is captured by `syft` scanning the container image at the `cosign attest` step. No further action required.
|
||
|
||
#### F8 — Caddy Licence Assessment
|
||
|
||
Caddy server is Apache 2.0. Community plugins (the modules used in §26.9: `encode`, `reverse_proxy`, `tls`, `file_server`) are Apache 2.0. No Caddy enterprise plugins are used by SpaceCom. Caddy DNS challenge modules (if used for ACME wildcard certificates) must be verified — the `caddy-dns/cloudflare` module is MIT.
|
||
|
||
**Audit requirement:** On any `Caddyfile` change that adds a new module, verify its licence before merging. Add to the PR checklist for infrastructure changes.
|
||
|
||
#### F9 — PostGIS Licence Assessment
|
||
|
||
PostGIS is GPLv2+ with a linking exception for use with PostgreSQL. The linking exception reads: *"the copyright holders of PostGIS grant you permission to use PostGIS as a PostgreSQL extension without this resulting in the entire combined work becoming subject to the GPL."* SpaceCom uses PostGIS as a PostgreSQL extension (loaded via `CREATE EXTENSION postgis`) — the linking exception applies.
|
||
|
||
SpaceCom does not distribute PostGIS, does not modify PostGIS source, and does not ship a combined work — PostGIS is a runtime dependency of the database service. **No GPLv2 obligation arises.** Document in `legal/LGPL_COMPLIANCE.md` alongside the poliastro LGPL note.
|
||
|
||
#### F10 — Licence Change Monitoring CI Check
|
||
|
||
The existing `pip-licenses --fail-on` list (§7.13) catches Python GPL/AGPL. Additions required:
|
||
|
||
```yaml
|
||
# .github/workflows/ci.yml (security-scan job — update existing step)
|
||
- name: Python licence gate
|
||
run: |
|
||
pip install pip-licenses
|
||
pip-licenses --format=json --output-file=python-licences.json
|
||
# Block: GPL v2, GPL v3, AGPL v3, SSPL (if any Python package adopts it)
|
||
pip-licenses --fail-on="GNU General Public License v2 (GPLv2);GNU General Public License v3 (GPLv3);GNU Affero General Public License v3 (AGPLv3);Server Side Public License"
|
||
|
||
- name: npm licence gate (updated)
|
||
working-directory: frontend
|
||
run: |
|
||
npx license-checker-rseidelsohn --json --out npm-licences.json
|
||
# cesium excluded: commercial licence at docs/adr/0007-cesiumjs-commercial-licence.md
|
||
npx license-checker-rseidelsohn \
|
||
--excludePackages "cesium" \
|
||
--failOn "GPL;AGPL;SSPL"
|
||
```
|
||
|
||
Additionally, pin all Python and Node dependencies to **exact versions** in `requirements.txt` and `package-lock.json`. Renovate Bot PRs (§7.13) provide controlled upgrade paths; the licence gate re-runs on each Renovate PR to catch licence changes introduced by version upgrades.
|
||
|
||
#### F11 — Contributor Licence Agreement for External Contributors
|
||
|
||
Before any contractor, partner, or third-party engineer contributes code to SpaceCom:
|
||
|
||
1. A **CLA or work-for-hire clause** must be in their contract confirming that all IP created for SpaceCom is owned by SpaceCom (or the appointing entity, per agreement).
|
||
2. The CLA template is at `legal/CLA.md` — a simple assignment of copyright for contributions made under contract.
|
||
3. The GitHub repository's `CONTRIBUTING.md` must state: *"External contributions require a signed CLA. Contact legal@spacecom.io before submitting a PR."*
|
||
|
||
**Phase gate:** Before any Phase 2 ESA validation partnership involves third-party engineering, confirm all engineers have executed the CLA or have work-for-hire clauses in their contracts. Unattributed IP in an ESA bid creates serious procurement risk.
|
||
|
||
---
|
||
|
||
## 30. DevOps / Platform Engineering
|
||
|
||
### 30.1 Pre-commit Hook Specification
|
||
|
||
All six hooks are required. The same hooks run locally (via `pre-commit`) and in CI (`lint` job). A push to GitHub that bypasses local hooks will fail CI.
|
||
|
||
**`.pre-commit-config.yaml`:**
|
||
|
||
```yaml
|
||
repos:
|
||
- repo: https://github.com/Yelp/detect-secrets
|
||
rev: v1.4.0
|
||
hooks:
|
||
- id: detect-secrets
|
||
args: ['--baseline', '.secrets.baseline']
|
||
|
||
- repo: https://github.com/astral-sh/ruff-pre-commit
|
||
rev: v0.3.0
|
||
hooks:
|
||
- id: ruff
|
||
args: ['--fix']
|
||
- id: ruff-format
|
||
|
||
- repo: https://github.com/pre-commit/mirrors-mypy
|
||
rev: v1.9.0
|
||
hooks:
|
||
- id: mypy
|
||
additional_dependencies: ['types-requests', 'sqlalchemy[mypy]']
|
||
|
||
- repo: https://github.com/hadolint/hadolint
|
||
rev: v2.12.0
|
||
hooks:
|
||
- id: hadolint-docker
|
||
|
||
- repo: https://github.com/pre-commit/mirrors-prettier
|
||
rev: v3.1.0
|
||
hooks:
|
||
- id: prettier
|
||
types_or: [javascript, typescript, html, css, json, yaml]
|
||
|
||
- repo: https://github.com/sqlfluff/sqlfluff
|
||
rev: 3.0.0
|
||
hooks:
|
||
- id: sqlfluff-lint
|
||
args: ['--dialect', 'postgres']
|
||
- id: sqlfluff-fix
|
||
args: ['--dialect', 'postgres']
|
||
```
|
||
|
||
All hooks are pinned by `rev`; update via `pre-commit autoupdate` in a dedicated dependency update PR. The `detect-secrets` baseline (`.secrets.baseline`) is committed to the repo and updated whenever legitimate secrets-like strings are added.
|
||
|
||
**`detect-secrets` baseline maintenance process** — incorrect baseline updates are the most common way this hook is neutralised. The correct procedure must be documented and enforced:
|
||
|
||
```bash
|
||
# docs/runbooks/detect-secrets-update.md (required runbook)
|
||
|
||
# CORRECT: update baseline to add a new allowance while preserving existing ones
|
||
detect-secrets scan --baseline .secrets.baseline --update
|
||
git add .secrets.baseline
|
||
git commit -m "chore: update detect-secrets baseline for <reason>"
|
||
|
||
# WRONG — overwrites ALL existing allowances:
|
||
# detect-secrets scan > .secrets.baseline ← NEVER do this
|
||
```
|
||
|
||
CI check verifies baseline currency on every PR (stale baseline = hook not enforced):
|
||
```bash
|
||
# In lint job, after running pre-commit:
|
||
detect-secrets scan --baseline .secrets.baseline --diff | \
|
||
python -c "import sys,json; d=json.load(sys.stdin); sys.exit(0 if not d else 1)" || \
|
||
(echo "ERROR: .secrets.baseline is stale — run: detect-secrets scan --baseline .secrets.baseline --update" && exit 1)
|
||
```
|
||
|
||
`detect-secrets` is the canonical secrets scanner (entropy + regex). `git-secrets` (listed in §7.13) is also retained for its AWS credential pattern matching, which complements `detect-secrets`. Both run as pre-commit hooks; there is no conflict — they check different pattern sets.
|
||
|
||
---
|
||
|
||
### 30.2 Multi-Stage Dockerfile Pattern
|
||
|
||
All service Dockerfiles follow the builder/runtime two-stage pattern. No exceptions without documented justification.
|
||
|
||
**Backend (example — same pattern for worker and ingest):**
|
||
|
||
```dockerfile
|
||
# Stage 1: builder
|
||
FROM python:3.12-slim AS builder
|
||
WORKDIR /build
|
||
|
||
# Install build dependencies (not copied to runtime stage)
|
||
RUN apt-get update && apt-get install -y --no-install-recommends gcc libpq-dev
|
||
|
||
COPY backend/requirements.txt .
|
||
# --require-hashes enforces that every package in requirements.txt carries a hash annotation.
|
||
# pip-compile --generate-hashes produces these. Without this flag, hash pinning is specified
|
||
# but not verified during build — a dependency confusion attack would be silently installed.
|
||
RUN pip install --upgrade pip && \
|
||
pip wheel --no-cache-dir --require-hashes --wheel-dir /wheels -r requirements.txt
|
||
|
||
# Stage 2: runtime
|
||
FROM python:3.12-slim AS runtime
|
||
WORKDIR /app
|
||
|
||
# Create non-root user at build time
|
||
RUN groupadd --gid 1001 appuser && \
|
||
useradd --uid 1001 --gid appuser --no-create-home appuser
|
||
|
||
# Install only compiled wheels — no build tools
|
||
COPY --from=builder /wheels /wheels
|
||
RUN pip install --no-cache-dir --no-index --find-links /wheels /wheels/*.whl && \
|
||
rm -rf /wheels
|
||
|
||
COPY backend/app ./app
|
||
|
||
USER appuser
|
||
EXPOSE 8000
|
||
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||
```
|
||
|
||
**Frontend:**
|
||
|
||
```dockerfile
|
||
FROM node:22-slim AS builder
|
||
WORKDIR /build
|
||
COPY frontend/package*.json .
|
||
RUN npm ci
|
||
COPY frontend/ .
|
||
RUN npm run build
|
||
|
||
FROM node:22-slim AS runtime
|
||
WORKDIR /app
|
||
RUN groupadd --gid 1001 appuser && useradd --uid 1001 --gid appuser --no-create-home appuser
|
||
COPY --from=builder /build/.next/standalone ./
|
||
COPY --from=builder /build/.next/static ./.next/static
|
||
COPY --from=builder /build/public ./public
|
||
USER appuser
|
||
EXPOSE 3000
|
||
CMD ["node", "server.js"]
|
||
```
|
||
|
||
**Version pin rule:** All Python service images use `python:3.12-slim`. All frontend/Node images use `node:22-slim`. Any `FROM` line using a different tag fails the `hadolint` pre-commit hook and CI lint step. Do not drift these — the service table in §3.2 and the Dockerfiles must agree.
|
||
|
||
**CI verification** — the `build-and-push` job includes:
|
||
```bash
|
||
# Verify no build tools in runtime image
|
||
docker run --rm ghcr.io/spacecom/backend:sha-$GITHUB_SHA which gcc 2>&1 | grep -q "no gcc" || exit 1
|
||
docker run --rm --user root ghcr.io/spacecom/backend:sha-$GITHUB_SHA id | grep -q "uid=1001" || exit 1
|
||
# Verify correct Python version
|
||
docker run --rm ghcr.io/spacecom/backend:sha-$GITHUB_SHA python --version | grep -q "Python 3.12" || exit 1
|
||
```
|
||
|
||
**Image digest pinning in production Compose files (F4 — §59):** The production `docker-compose.yml` pins images by digest, not by mutable tag, to guarantee bit-for-bit reproducibility and prevent registry-side tampering:
|
||
|
||
```yaml
|
||
# docker-compose.yml — production image references
|
||
# Update digests via: make update-image-digests (runs after each build-and-push)
|
||
services:
|
||
backend:
|
||
image: ghcr.io/spacecom/backend:sha-abc1234@sha256:a1b2c3d4... # tag + digest
|
||
worker-sim:
|
||
image: ghcr.io/spacecom/worker:sha-abc1234@sha256:e5f6a7b8...
|
||
```
|
||
|
||
`make update-image-digests` script (run by CI after `build-and-push`): queries GHCR for the digest of each newly pushed image and patches `docker-compose.yml` via `sed`. The patched file is committed back to the release branch as a separate commit.
|
||
|
||
**GHCR image retention policy (F4 — §59):**
|
||
|
||
| Image type | Tag pattern | Retention |
|
||
|-----------|-------------|-----------|
|
||
| Release images | `sha-<commit>` on tagged release | Indefinite |
|
||
| Staging images | `sha-<commit>` on `main` push | 30 days |
|
||
| Dev branch images | `sha-<commit>` on PR branch | 7 days |
|
||
| Build cache manifests | `buildcache` | Overwritten each build; no accumulation |
|
||
| Untagged images | (orphaned layers) | Purged weekly via GHCR lifecycle policy |
|
||
|
||
GHCR lifecycle policy is configured via the GitHub repository settings (Packages → Manage versions). The policy is documented in `docs/runbooks/image-lifecycle.md` and reviewed quarterly alongside the secrets audit.
|
||
|
||
---
|
||
|
||
### 30.3 Environment Variable Contract
|
||
|
||
All environment variables are documented in `.env.example`. Variables are grouped by category and stage:
|
||
|
||
| Variable | Required | Stage | Description |
|
||
|----------|----------|-------|-------------|
|
||
| `SPACETRACK_USERNAME` | Yes | All | Space-Track.org account email |
|
||
| `SPACETRACK_PASSWORD` | Yes | All | Space-Track.org password |
|
||
| `JWT_PRIVATE_KEY_PATH` | Yes | All | Path to RS256 PEM private key |
|
||
| `JWT_PUBLIC_KEY_PATH` | Yes | All | Path to RS256 PEM public key |
|
||
| `JWT_PUBLIC_KEY_NEW_PATH` | No | Rotation only | Second public key during keypair rotation window |
|
||
| `POSTGRES_PASSWORD` | Yes | All | TimescaleDB password |
|
||
| `REDIS_BACKEND_PASSWORD` | Yes | All | Redis ACL password for `spacecom_backend` user (full keyspace access) |
|
||
| `REDIS_WORKER_PASSWORD` | Yes | All | Redis ACL password for `spacecom_worker` user (Celery namespaces only) |
|
||
| `REDIS_INGEST_PASSWORD` | Yes | All | Redis ACL password for `spacecom_ingest` user (Celery namespaces only) |
|
||
| `MINIO_ACCESS_KEY` | Yes | All | MinIO access key |
|
||
| `MINIO_SECRET_KEY` | Yes | All | MinIO secret key |
|
||
| `HMAC_SECRET` | Yes | All | Prediction signing key (rotate per §26.9 procedure) |
|
||
| `ENVIRONMENT` | Yes | All | `development` / `staging` / `production` |
|
||
| `DEPLOY_CHECK_SECRET` | Yes | Staging/Prod | Read-only CI/CD gate credential |
|
||
| `SENTRY_DSN` | No | Staging/Prod | Error reporting DSN |
|
||
| `PAGERDUTY_ROUTING_KEY` | No | Prod only | AlertManager → PagerDuty routing key |
|
||
| `VAULT_ADDR` | No | Phase 3 | HashiCorp Vault address |
|
||
| `VAULT_TOKEN` | No | Phase 3 | Vault authentication token |
|
||
| `DISABLE_SIMULATION_DURING_ACTIVE_EVENTS` | No | All | Org-level simulation block; default `false` |
|
||
| `OPS_ROOM_SUPPRESS_MINUTES` | No | All | Alert audio suppression window; default `0` |
|
||
|
||
CI validates that `.env.example` is up-to-date by checking that every variable referenced in the codebase (`os.getenv(...)`, `settings.*`) has an entry in `.env.example`. Missing entries fail CI.
|
||
|
||
**CI secrets register (F3 — §59):** GitHub Actions secrets are audited quarterly. The following table is the authoritative register — any secret not in this table must not exist in the repository settings.
|
||
|
||
| Secret name | Environment | Owner | Rotation schedule | What breaks if leaked |
|
||
|-------------|-------------|-------|-------------------|-----------------------|
|
||
| `GITHUB_TOKEN` | All | GitHub-managed (OIDC) | Per-job (automatic) | GHCR push access |
|
||
| `DEPLOY_CHECK_SECRET` | Staging, Production | Engineering lead | 90 days | CI can skip alert gate |
|
||
| `STAGING_SSH_KEY` | Staging | Engineering lead | 180 days | Staging server access |
|
||
| `PRODUCTION_SSH_KEY` | Production | Engineering lead + 1 | 90 days | Production server access |
|
||
| `SPACETRACK_USERNAME_STAGING` | Staging | DevOps | On offboarding | Space-Track ingest |
|
||
| `SPACETRACK_PASSWORD_STAGING` | Staging | DevOps | 90 days | Space-Track ingest |
|
||
| `SENTRY_DSN` | Staging, Production | DevOps | On rotation | Error reporting only |
|
||
| `PAGERDUTY_ROUTING_KEY` | Production | Engineering lead | On rotation | On-call alerting |
|
||
|
||
Rotation procedure: use `gh secret set <NAME> --env <ENV>` from a local machine; never paste secrets into PR descriptions or issue comments. Quarterly audit: `gh secret list --env production` output reviewed by engineering lead; any unrecognised secret triggers a security review.
|
||
|
||
---
|
||
|
||
### 30.4 Staging Environment Specification
|
||
|
||
Staging is a Tier 2 deployment (single-host Docker Compose) running continuously on a dedicated server or cloud VM.
|
||
|
||
**Data policy:** Staging never holds production data. On weekly reset (`make clean && make seed`), the database is wiped and synthetic fixtures are loaded. Synthetic fixtures include:
|
||
- 50 tracked objects with pre-computed TLE histories
|
||
- 5 synthetic TIP events across the test FIR set
|
||
- 3 synthetic CRITICAL alert events at various acknowledgement states
|
||
- 2 shadow mode test organisations
|
||
|
||
**Credential policy:** Staging uses a separate Space-Track account (if available) or rate-limited credentials. JWT keypairs, HMAC secrets, and MinIO keys are all distinct from production. Staging credentials are stored in GitHub Actions environment secrets, not in the production Vault.
|
||
|
||
**OWASP ZAP integration:**
|
||
```yaml
|
||
# .github/workflows/ci.yml (post-staging-deploy step)
|
||
- name: OWASP ZAP baseline scan
|
||
uses: zaproxy/action-baseline@v0.11.0
|
||
with:
|
||
target: 'https://staging.spacecom.io'
|
||
rules_file_name: '.zap/rules.tsv'
|
||
fail_action: true
|
||
```
|
||
|
||
ZAP results are uploaded as GitHub Actions artefacts and must be reviewed before production deploy approval is granted in Phase 2+.
|
||
|
||
---
|
||
|
||
### 30.5 CI Observability
|
||
|
||
**Build duration:** Each GitHub Actions job reports duration to a summary table. A Grafana dashboard (`CI Health`) tracks p50/p95 job durations over time. Alert if any job's p95 duration increases > 2× week-over-week.
|
||
|
||
**Image size delta:** The `build-and-push` job posts a PR comment with the compressed image size delta versus the previous `main` build:
|
||
```
|
||
Backend image: 187 MB → 192 MB (+2.7%) ✅
|
||
Worker image: 203 MB → 289 MB (+42.4%) ⚠️ Investigate before merge
|
||
```
|
||
If any image grows > 20% in a single PR, CI posts a warning. If any image exceeds the tier limits below, CI fails:
|
||
|
||
| Image | Max size (compressed) |
|
||
|-------|--------------------|
|
||
| `backend` | 300 MB |
|
||
| `worker` | 350 MB |
|
||
| `frontend` | 200 MB |
|
||
| `renderer` | 500 MB (Chromium) |
|
||
| `ingest` | 250 MB |
|
||
|
||
**Test failure rate:** GitHub Actions test reports (JUnit XML output from pytest and vitest) are stored as artefacts. A weekly CI health review checks for flaky tests (passing < 90% of the time) and schedules them for investigation.
|
||
|
||
---
|
||
|
||
### 30.6 DevOps Decision Log
|
||
|
||
| Decision | Chosen | Rationale |
|
||
|----------|--------|-----------|
|
||
| CI/CD orchestration | GitHub Actions | Project is GitHub-native; OIDC → GHCR eliminates long-lived registry credentials; matrix builds supported |
|
||
| Container registry | GHCR | Co-located with source; free for this repo; `cosign` attestation support |
|
||
| Image tagging | `sha-<commit>` canonical; version alias on release tags; `latest` forbidden | `latest` is mutable; `sha` tag gives exact source traceability |
|
||
| Multi-stage builds | Builder + distroless/slim runtime for all services | 60–80% image size reduction; eliminates compiler/build tools from production attack surface |
|
||
| Hot-reload strategy | `docker-compose.override.yml` with bind-mounted source volumes | < 1s reload vs. 30–90s container rebuild; override file not committed to CI |
|
||
| Local task runner | `make` | Universally available, no extra install; self-documenting targets; shell-level DX standard |
|
||
| Pre-commit stack | 6 hooks: detect-secrets + ruff + mypy + hadolint + prettier + sqlfluff | Each addresses a distinct failure mode; hooks run in CI to enforce for engineers who skip local install |
|
||
| Staging data | Synthetic fixtures only; weekly reset | Production data in staging creates GDPR complexity; synthetic data is sufficient for integration testing |
|
||
| Secrets rotation | Zero-downtime per-secret runbook; HMAC rotation requires batch re-sign migration | Aviation context: rotation cannot cause service interruption; HMAC is special-cased due to signed-data dependency |
|
||
| HMAC key rotation | Requires batch re-sign of all existing predictions; engineering lead approval required | All existing HMAC signatures become invalid on key change; silent re-sign is safer than mass verification failures |
|
||
|
||
---
|
||
|
||
### 30.7 GitLab CI Workflow Specification (F1, F5, F8, F10 - §59)
|
||
|
||
The CI pipeline must enforce a strict job dependency graph. Jobs that do not declare `needs:` run in parallel by default — this is incorrect for a safety-critical pipeline where a failed test must prevent a build reaching production.
|
||
|
||
**Canonical job dependency graph:**
|
||
```
|
||
lint ──┬── test-backend ──┬── security-scan ──── build-and-push ──── deploy-staging ──── deploy-production
|
||
└── test-frontend ─┘ ↑ (auto) ↑ (manual gate)
|
||
```
|
||
|
||
**`.github/workflows/ci.yml` (abbreviated — full spec below):**
|
||
|
||
```yaml
|
||
name: CI
|
||
|
||
on:
|
||
push:
|
||
branches: [main]
|
||
pull_request:
|
||
branches: [main]
|
||
|
||
env:
|
||
REGISTRY: ghcr.io
|
||
IMAGE_NAME: ${{ github.repository }}
|
||
|
||
jobs:
|
||
|
||
lint:
|
||
runs-on: ubuntu-latest
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
- uses: actions/setup-python@v5
|
||
with: { python-version: '3.12' }
|
||
- uses: actions/cache@v4
|
||
with:
|
||
path: ~/.cache/pre-commit
|
||
key: pre-commit-${{ hashFiles('.pre-commit-config.yaml') }}
|
||
- run: pip install pre-commit
|
||
- run: pre-commit run --all-files # F6 §59: enforce hooks in CI
|
||
|
||
test-backend:
|
||
needs: [lint]
|
||
runs-on: ubuntu-latest
|
||
services:
|
||
db:
|
||
image: timescale/timescaledb:2.14-pg17
|
||
env: { POSTGRES_PASSWORD: test }
|
||
options: --health-cmd pg_isready
|
||
redis:
|
||
image: redis:7-alpine
|
||
options: --health-cmd "redis-cli ping"
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
- uses: actions/setup-python@v5
|
||
with: { python-version: '3.12' }
|
||
- uses: actions/cache@v4 # F10 §59: pip wheel cache
|
||
with:
|
||
path: ~/.cache/pip
|
||
key: pip-${{ hashFiles('backend/requirements.txt') }}
|
||
- run: pip install -r backend/requirements.txt
|
||
- run: pytest -m safety_critical --tb=short -q # fast safety gate first
|
||
- run: pytest --cov=backend --cov-fail-under=80
|
||
|
||
test-frontend:
|
||
needs: [lint]
|
||
runs-on: ubuntu-latest
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
- uses: actions/setup-node@v4
|
||
with: { node-version: '22' }
|
||
- uses: actions/cache@v4 # F10 §59: npm cache
|
||
with:
|
||
path: ~/.npm
|
||
key: npm-${{ hashFiles('frontend/package-lock.json') }}
|
||
- run: npm ci --prefix frontend
|
||
- run: npm run test --prefix frontend
|
||
|
||
migration-gate: # F11 §59: migration reversibility + timing gate
|
||
needs: [lint]
|
||
if: contains(github.event.commits[*].modified, 'migrations/')
|
||
runs-on: ubuntu-latest
|
||
services:
|
||
db:
|
||
image: timescale/timescaledb:2.14-pg17
|
||
env: { POSTGRES_PASSWORD: test }
|
||
options: --health-cmd pg_isready
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
- run: pip install alembic psycopg2-binary
|
||
- name: Forward migration (timed)
|
||
run: |
|
||
START=$(date +%s)
|
||
alembic upgrade head
|
||
END=$(date +%s)
|
||
ELAPSED=$((END - START))
|
||
echo "Migration took ${ELAPSED}s"
|
||
if [ "$ELAPSED" -gt 30 ]; then
|
||
echo "::error::Migration took ${ELAPSED}s > 30s budget — requires review"
|
||
exit 1
|
||
fi
|
||
- name: Reverse migration (reversibility check)
|
||
run: alembic downgrade -1
|
||
- name: Model/migration sync check
|
||
run: alembic check
|
||
|
||
security-scan:
|
||
needs: [test-backend, test-frontend, migration-gate]
|
||
runs-on: ubuntu-latest
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
- run: pip install bandit && bandit -r backend/app -ll
|
||
- uses: actions/setup-node@v4
|
||
with: { node-version: '22' }
|
||
- run: npm audit --prefix frontend --audit-level=high
|
||
- name: Trivy container scan (on previous image)
|
||
uses: aquasecurity/trivy-action@master
|
||
with:
|
||
image-ref: ghcr.io/${{ github.repository }}/backend:latest
|
||
severity: CRITICAL,HIGH
|
||
exit-code: '1'
|
||
|
||
build-and-push:
|
||
needs: [security-scan]
|
||
if: github.ref == 'refs/heads/main'
|
||
runs-on: ubuntu-latest
|
||
permissions: { contents: read, packages: write, id-token: write }
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
- uses: docker/setup-buildx-action@v3
|
||
- uses: docker/login-action@v3
|
||
with:
|
||
registry: ghcr.io
|
||
username: ${{ github.actor }}
|
||
password: ${{ secrets.GITHUB_TOKEN }} # OIDC — no long-lived token
|
||
- name: Build and push (with layer cache) # F10 §59
|
||
uses: docker/build-push-action@v5
|
||
with:
|
||
push: true
|
||
tags: ghcr.io/${{ env.IMAGE_NAME }}/backend:sha-${{ github.sha }}
|
||
cache-from: type=registry,ref=ghcr.io/${{ env.IMAGE_NAME }}/backend:buildcache
|
||
cache-to: type=registry,ref=ghcr.io/${{ env.IMAGE_NAME }}/backend:buildcache,mode=max
|
||
- name: Sign image with cosign (F5 §59)
|
||
uses: sigstore/cosign-installer@v3
|
||
- run: |
|
||
cosign sign --yes \
|
||
ghcr.io/${{ env.IMAGE_NAME }}/backend:sha-${{ github.sha }}
|
||
- name: Generate SBOM and attach (F5 §59)
|
||
uses: anchore/sbom-action@v0
|
||
with:
|
||
image: ghcr.io/${{ env.IMAGE_NAME }}/backend:sha-${{ github.sha }}
|
||
upload-artifact: true
|
||
|
||
deploy-staging:
|
||
needs: [build-and-push]
|
||
runs-on: ubuntu-latest
|
||
environment: staging
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
- name: Check no active CRITICAL alert (F8 §59)
|
||
run: |
|
||
STATUS=$(curl -sf -H "Authorization: Bearer ${{ secrets.DEPLOY_CHECK_SECRET }}" \
|
||
https://staging.spacecom.io/api/v1/readyz | jq -r '.alert_gate')
|
||
if [ "$STATUS" != "clear" ]; then
|
||
echo "::error::Active CRITICAL/HIGH alert — deploy blocked. Override with workflow_dispatch."
|
||
exit 1
|
||
fi
|
||
- name: SSH deploy to staging
|
||
run: |
|
||
ssh deploy@staging.spacecom.io \
|
||
"bash /opt/spacecom/scripts/blue-green-deploy.sh sha-${{ github.sha }}"
|
||
|
||
deploy-production:
|
||
needs: [deploy-staging]
|
||
runs-on: ubuntu-latest
|
||
environment: production # GitLab protected environment with required approvers - manual gate
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
- name: Check no active CRITICAL alert (F8 §59)
|
||
run: |
|
||
STATUS=$(curl -sf -H "Authorization: Bearer ${{ secrets.DEPLOY_CHECK_SECRET }}" \
|
||
https://spacecom.io/api/v1/readyz | jq -r '.alert_gate')
|
||
if [ "$STATUS" != "clear" ]; then
|
||
echo "::error::Active CRITICAL/HIGH alert — production deploy blocked."
|
||
exit 1
|
||
fi
|
||
- name: SSH deploy to production
|
||
run: |
|
||
ssh deploy@spacecom.io \
|
||
"bash /opt/spacecom/scripts/blue-green-deploy.sh sha-${{ github.sha }}"
|
||
```
|
||
|
||
**`/api/v1/readyz` alert gate field (F8 — §59):** The existing `GET /readyz` response is extended with an `alert_gate` field:
|
||
|
||
```python
|
||
# Returns "clear" | "blocked"
|
||
alert_gate = "blocked" if db.query(AlertEvent).filter(
|
||
AlertEvent.level.in_(["CRITICAL", "HIGH"]),
|
||
AlertEvent.acknowledged_at == None,
|
||
AlertEvent.organisation_id != INTERNAL_ORG_ID, # internal test alerts don't block deploys
|
||
).count() > 0 else "clear"
|
||
```
|
||
|
||
Emergency deploy override: use `workflow_dispatch` with input `override_alert_gate: true` — requires two approvals in the GitHub `production` environment. All overrides are logged to `security_logs` with `event_type = DEPLOY_ALERT_GATE_OVERRIDE`.
|
||
|
||
---
|
||
|
||
### 30.8 Configuration Management of Safety-Critical Artefacts (F7 — §61)
|
||
|
||
EUROCAE ED-153 / DO-278A §10 requires that safety-critical software and its associated artefacts are placed under configuration management. This extends beyond the code itself to include requirements, test cases, design documents, and safety evidence.
|
||
|
||
**Policy document:** `docs/safety/CM_POLICY.md`
|
||
|
||
**Artefacts under CM:**
|
||
|
||
| Artefact | Location | CM Control |
|
||
|---------|----------|-----------|
|
||
| SAL-2 source files (`physics/`, `alerts/`, `integrity/`, `czml/`) | Git `main` branch | Signed commits required; CODEOWNERS enforcement; no direct push to `main` |
|
||
| Hazard Log | `docs/safety/HAZARD_LOG.md` | Git-tracked; changes require safety case custodian sign-off (CODEOWNERS rule) |
|
||
| Safety Case | `docs/safety/SAFETY_CASE.md` | Git-tracked; changes require safety case custodian sign-off |
|
||
| SAL Assignment | `docs/safety/SAL_ASSIGNMENT.md` | Git-tracked; changes require safety case custodian sign-off |
|
||
| Means of Compliance | `docs/safety/MEANS_OF_COMPLIANCE.md` | Git-tracked; changes require safety case custodian sign-off |
|
||
| Verification Independence Policy | `docs/safety/VERIFICATION_INDEPENDENCE.md` | Git-tracked |
|
||
| Test plan (safety-critical markers) | `docs/TEST_PLAN.md` | Git-tracked; `safety_critical` marker additions/removals reviewed in PR |
|
||
| Reference validation data | `docs/validation/reference-data/` | Git-tracked; immutable once committed (SHA verified in CI) |
|
||
| Accuracy Characterisation | `docs/validation/ACCURACY_CHARACTERISATION.md` | Git-tracked; Phase 3 deliverable |
|
||
| ANSP SMS Guide | `docs/safety/ANSP_SMS_GUIDE.md` | Git-tracked |
|
||
| Release artefacts (SBOM, Trivy report, cosign signature) | GHCR + MinIO safety archive | Tagged per release; 7-year retention |
|
||
|
||
**Release tagging for safety artefacts:**
|
||
|
||
Every production release (`vMAJOR.MINOR.PATCH`) creates a Git tag that captures:
|
||
```bash
|
||
# scripts/tag-safety-release.sh
|
||
VERSION=$1
|
||
git tag -a "$VERSION" -m "Release $VERSION — safety artefacts frozen at this tag"
|
||
# Attach safety snapshot to the release
|
||
gh release create "$VERSION" \
|
||
docs/safety/SAFETY_CASE.md \
|
||
docs/safety/HAZARD_LOG.md \
|
||
docs/safety/SAL_ASSIGNMENT.md \
|
||
docs/safety/MEANS_OF_COMPLIANCE.md \
|
||
--title "SpaceCom $VERSION" \
|
||
--notes "Safety artefacts attached. See CHANGELOG.md for changes."
|
||
```
|
||
|
||
**Signed commits for SAL-2 paths:** `backend/app/physics/`, `backend/app/alerts/`, `backend/app/integrity/`, `backend/app/czml/` require GPG-signed commits. Branch protection rule: `require_signed_commits: true` on `main`. This provides non-repudiation for safety-critical code changes.
|
||
|
||
**CODEOWNERS additions:**
|
||
```
|
||
# .github/CODEOWNERS
|
||
# Safety artefacts — require safety case custodian review
|
||
/docs/safety/ @safety-custodian
|
||
/docs/validation/ @safety-custodian
|
||
```
|
||
|
||
**Configuration baseline:** At each ANSP deployment, a configuration baseline is recorded in `legal/ANSP_DEPLOYMENT_REGISTER.md`:
|
||
- SpaceCom version deployed (Git tag)
|
||
- Commit SHA
|
||
- SBOM hash
|
||
- Safety case version
|
||
- SAL assignment version
|
||
- Deployment jurisdiction and date
|
||
|
||
This baseline is the reference for any subsequent regulatory audit or safety occurrence investigation.
|
||
|
||
---
|
||
|
||
## 31. Interoperability / Systems Integration
|
||
|
||
### 31.1 External Data Source Contracts
|
||
|
||
For each inbound data source, the integration contract must be explicit. Implicit assumptions about format are the most common source of silent ingest failures.
|
||
|
||
#### 31.1.1 Space-Track.org
|
||
|
||
**Endpoints consumed:**
|
||
|
||
| Data | Endpoint | Format | Baseline interval | Active TIP interval |
|
||
|------|----------|--------|-------------------|---------------------|
|
||
| TLE catalog | `/basicspacedata/query/class/gp/DECAY_DATE/null-val/orderby/NORAD_CAT_ID asc/format/json` | JSON array | Every 6h | Every 6h (unchanged) |
|
||
| CDMs | `/basicspacedata/query/class/cdm_public/format/json` | JSON array | Every 2h | Every 30min |
|
||
| TIP messages | `/basicspacedata/query/class/tip/format/json` | JSON array | Every 30min | Every 5min |
|
||
| Object catalog | `/basicspacedata/query/class/satcat/format/json` | JSON array | Daily | Daily |
|
||
|
||
**Adaptive polling:** When `spacecom_active_tip_events > 0` (any object with predicted re-entry within 6 hours), the Celery Beat schedule dynamically switches TIP polling to 5-minute intervals and CDM polling to 30-minute intervals. This is implemented via redbeat schedule overrides, not by running additional tasks — the existing Beat entry's `run_every` is updated in Redis. When all TIP events clear, intervals revert to baseline.
|
||
|
||
**Space-Track request budget (600 requests/day):**
|
||
|
||
Space-Track enforces a 600 requests/day limit per account. Budget must be tracked and protected:
|
||
|
||
```python
|
||
# ingest/budget.py
|
||
DAILY_REQUEST_BUDGET = 600
|
||
BUDGET_ALERT_THRESHOLD = 0.80 # alert at 80% consumed
|
||
|
||
class SpaceTrackBudget:
|
||
"""Redis counter tracking daily Space-Track API requests. Resets at midnight UTC."""
|
||
|
||
def __init__(self, redis_client):
|
||
self._redis = redis_client
|
||
self._key = f"spacetrack:budget:{date.today().isoformat()}"
|
||
|
||
def consume(self, n: int = 1) -> bool:
|
||
"""Deduct n requests. Returns False if budget exhausted; raises if > threshold."""
|
||
current = self._redis.incrby(self._key, n)
|
||
self._redis.expireat(self._key, self._next_midnight())
|
||
if current > DAILY_REQUEST_BUDGET:
|
||
raise SpaceTrackBudgetExhausted(f"Daily budget exhausted ({current}/{DAILY_REQUEST_BUDGET})")
|
||
if current / DAILY_REQUEST_BUDGET >= BUDGET_ALERT_THRESHOLD:
|
||
structlog.get_logger().warning(
|
||
"spacetrack_budget_warning",
|
||
consumed=current, budget=DAILY_REQUEST_BUDGET,
|
||
)
|
||
return True
|
||
|
||
def remaining(self) -> int:
|
||
return max(0, DAILY_REQUEST_BUDGET - int(self._redis.get(self._key) or 0))
|
||
```
|
||
|
||
Prometheus gauge: `spacecom_spacetrack_budget_remaining` — alert at < 100 remaining requests.
|
||
|
||
**Exponential backoff and circuit breaker:**
|
||
|
||
```python
|
||
# ingest/tasks.py
|
||
@app.task(
|
||
bind=True,
|
||
autoretry_for=(SpaceTrackError, httpx.TimeoutException, httpx.ConnectError),
|
||
retry_backoff=True, # 2s, 4s, 8s, 16s, 32s ...
|
||
retry_backoff_max=3600, # cap at 1 hour
|
||
retry_jitter=True, # ±20% jitter per retry
|
||
max_retries=5, # task → DLQ on 6th failure
|
||
acks_late=True,
|
||
)
|
||
def ingest_tle_catalog(self):
|
||
if not circuit_breaker.is_closed("spacetrack"):
|
||
raise SpaceTrackCircuitOpen("Circuit open — Space-Track unreachable")
|
||
try:
|
||
budget.consume(1)
|
||
result = spacetrack_client.fetch_tle_catalog()
|
||
circuit_breaker.record_success("spacetrack")
|
||
return result
|
||
except (SpaceTrackError, httpx.TimeoutException) as exc:
|
||
circuit_breaker.record_failure("spacetrack")
|
||
raise self.retry(exc=exc)
|
||
```
|
||
|
||
Circuit breaker config: open after 3 consecutive failures; half-open after 30 minutes; close after 1 successful probe. Implemented via `pybreaker` or equivalent. State stored in Redis for cross-worker visibility.
|
||
|
||
**Session expiry handling:**
|
||
|
||
Space-Track uses cookie-based sessions that expire after ~2 hours of inactivity. A 6-hour TLE poll interval guarantees session expiry between polls. The `spacetrack` library must be configured to re-authenticate transparently on 401/403:
|
||
|
||
```python
|
||
# ingest/spacetrack.py
|
||
class SpaceTrackClient:
|
||
def __init__(self):
|
||
self._session_valid_until: datetime | None = None
|
||
self._SESSION_TTL = timedelta(hours=1, minutes=45) # conservative re-auth before expiry
|
||
|
||
async def _ensure_authenticated(self):
|
||
if self._session_valid_until is None or datetime.utcnow() >= self._session_valid_until:
|
||
await self._authenticate()
|
||
self._session_valid_until = datetime.utcnow() + self._SESSION_TTL
|
||
spacecom_ingest_session_reauth_total.labels(source="spacetrack").inc()
|
||
|
||
async def fetch_tle_catalog(self):
|
||
await self._ensure_authenticated()
|
||
# ... fetch logic
|
||
```
|
||
|
||
Metric `spacecom_ingest_session_reauth_total{source="spacetrack"}` distinguishes routine re-auth from genuine authentication failures. An alert fires if `reauth_total` increments more than once per hour (indicates session instability, not normal expiry).
|
||
|
||
**Contract test (asserts on every CI run against a live Space-Track response):**
|
||
```python
|
||
def test_spacetrack_tle_schema(spacetrack_client):
|
||
response = spacetrack_client.query("gp", limit=1)
|
||
required_keys = {"NORAD_CAT_ID", "TLE_LINE1", "TLE_LINE2", "EPOCH", "BSTAR", "OBJECT_NAME"}
|
||
assert required_keys.issubset(response[0].keys()), f"Missing keys: {required_keys - response[0].keys()}"
|
||
```
|
||
|
||
**Failure alerting:** `spacecom_ingest_success_total{source="spacetrack"}` counter. AlertManager rules:
|
||
- Baseline: if counter does not increment for 4 consecutive hours during expected polling windows → CRITICAL `INGEST_SOURCE_FAILURE` alert.
|
||
- Active TIP window: if `spacecom_ingest_success_total{source="spacetrack", type="tip"}` does not increment for > 10 minutes when `spacecom_active_tip_events > 0` → immediate L1 page (bypasses standard 4h threshold).
|
||
|
||
#### 31.1.2 NOAA SWPC Space Weather
|
||
|
||
All endpoints are hardcoded constants in `ingest/sources.py`. Format is JSON for all P1 endpoints.
|
||
|
||
```python
|
||
# ingest/sources.py
|
||
NOAA_F107_URL = "https://services.swpc.noaa.gov/json/f107_cm_flux.json"
|
||
NOAA_KP_URL = "https://services.swpc.noaa.gov/json/planetary_k_index_1m.json"
|
||
NOAA_DST_URL = "https://services.swpc.noaa.gov/json/geomag/dst/index.json"
|
||
NOAA_FORECAST_URL = "https://services.swpc.noaa.gov/products/3-day-geomag-forecast.json"
|
||
ESA_SWS_KP_URL = "https://swe.ssa.esa.int/web/guest/current-space-weather-conditions"
|
||
```
|
||
|
||
**Nowcast vs. forecast distinction:** NRLMSISE-00 decay predictions spanning hours to days require different F10.7/Ap inputs depending on the prediction horizon. These must be stored separately and selected by the decay predictor at query time:
|
||
|
||
```sql
|
||
-- space_weather table: forecast_horizon_hours column required
|
||
ALTER TABLE space_weather ADD COLUMN forecast_horizon_hours INTEGER NOT NULL DEFAULT 0;
|
||
-- 0 = nowcast (observed); 24/48/72 = NOAA 3-day forecast horizon; NULL = 81-day average
|
||
COMMENT ON COLUMN space_weather.forecast_horizon_hours IS
|
||
'0=nowcast; 24/48/72=NOAA 3-day forecast; NULL=81-day F10.7 average for long-horizon use';
|
||
```
|
||
|
||
**Decay predictor input selection rule** (documented in model card and `decay.py`):
|
||
|
||
| Prediction horizon | F10.7 source | Ap source |
|
||
|-------------------|-------------|-----------|
|
||
| t < 6h | Nowcast (`horizon=0`) | Nowcast (`horizon=0`) |
|
||
| 6h ≤ t < 72h | NOAA 3-day forecast (`horizon=24/48/72`) | NOAA 3-day forecast |
|
||
| t ≥ 72h | 81-day F10.7 average (`horizon=NULL`) | Storm-aware climatological Ap |
|
||
|
||
Beyond 72h: the NOAA forecast expires. The model uses the 81-day F10.7 average (a standard NRLMSISE-00 input) and the long-range uncertainty is reflected in wider Monte Carlo corridor bounds. This is documented in the model card under "Space Weather Input Uncertainty Beyond 72h".
|
||
|
||
**ESA SWS Kp cross-validation decision rule:** ESA SWS Kp is a cross-validation source, not a fallback. A decision rule is required when NOAA and ESA values diverge — without one, the cross-validation is observational only:
|
||
|
||
```python
|
||
# ingest/space_weather.py
|
||
NOAA_ESA_KP_DIVERGENCE_THRESHOLD = 2.0 # Kp units; ADR-0018
|
||
|
||
def arbitrate_kp(noaa_kp: float, esa_kp: float) -> float:
|
||
"""Select Kp value for NRLMSISE-00 input. Conservative-high on divergence."""
|
||
divergence = abs(noaa_kp - esa_kp)
|
||
if divergence > NOAA_ESA_KP_DIVERGENCE_THRESHOLD:
|
||
structlog.get_logger().warning(
|
||
"kp_source_divergence",
|
||
noaa_kp=noaa_kp, esa_kp=esa_kp, divergence=divergence,
|
||
)
|
||
spacecom_kp_divergence_events_total.inc()
|
||
# Conservative: higher Kp → denser atmosphere → shorter predicted lifetime → earlier alerting
|
||
return max(noaa_kp, esa_kp)
|
||
return noaa_kp # NOAA is primary source
|
||
```
|
||
|
||
The threshold (2.0 Kp) and the conservative-high selection policy are documented in `docs/adr/0018-kp-source-arbitration.md` and reviewed by the physics lead. The `spacecom_kp_divergence_events_total` counter is monitored; a sustained rate of divergence warrants investigation of source calibration.
|
||
|
||
**Schema contract test (CI):**
|
||
```python
|
||
def test_noaa_kp_schema(noaa_client):
|
||
response = noaa_client.get_kp()
|
||
assert isinstance(response, list) and len(response) > 0
|
||
assert {"time_tag", "kp_index"}.issubset(response[0].keys())
|
||
|
||
def test_space_weather_forecast_horizon_stored(db_session):
|
||
"""Verify nowcast and forecast rows are stored with distinct horizon values."""
|
||
nowcast = db_session.query(SpaceWeather).filter_by(forecast_horizon_hours=0).first()
|
||
forecast_72 = db_session.query(SpaceWeather).filter_by(forecast_horizon_hours=72).first()
|
||
assert nowcast is not None, "Nowcast row missing"
|
||
assert forecast_72 is not None, "72h forecast row missing"
|
||
```
|
||
|
||
#### 31.1.3 FIR Boundary Data
|
||
|
||
**Source:** EUROCONTROL AIRAC dataset (primary for ECAC states); FAA Digital-Terminal Procedures Publication (US); OpenAIP (fallback for non-AIRAC regions).
|
||
|
||
**Format:** GeoJSON `FeatureCollection` with `properties.icao_id` (FIR ICAO designator) and `properties.name`.
|
||
|
||
**Update procedure (runs on each 28-day AIRAC cycle):**
|
||
1. Download new AIRAC dataset from EUROCONTROL (subscription required; credentials in secrets manager)
|
||
2. Convert to GeoJSON via `ingest/fir_loader.py`
|
||
3. Compare new boundaries against current `airspace` table; log added/removed/changed FIRs to `security_logs` type `AIRSPACE_UPDATE`
|
||
4. Stage new boundaries in `airspace_staging` table; run intersection regression test against 10 known prediction corridors
|
||
5. If regression passes: swap `airspace` and `airspace_staging` in a single transaction
|
||
6. Record update in `airspace_metadata` table: `airac_cycle`, `record_count`, `updated_at`, `updated_by`
|
||
|
||
**`airspace_metadata` table:**
|
||
```sql
|
||
CREATE TABLE airspace_metadata (
|
||
id SERIAL PRIMARY KEY,
|
||
airac_cycle TEXT NOT NULL, -- e.g. "2026-03"
|
||
effective_date DATE NOT NULL,
|
||
expiry_date DATE NOT NULL, -- effective_date + 28 days; used for staleness detection
|
||
record_count INTEGER NOT NULL,
|
||
source TEXT NOT NULL, -- 'eurocontrol' | 'faa' | 'openaip'
|
||
updated_at TIMESTAMPTZ DEFAULT NOW(),
|
||
updated_by TEXT NOT NULL
|
||
);
|
||
```
|
||
|
||
**AIRAC staleness detection:** The AIRAC update procedure is manual — there is no automated mechanism to trigger it. Without monitoring, a missed cycle goes undetected for up to 28 days.
|
||
|
||
Required additions:
|
||
|
||
1. **Prometheus gauge:** `spacecom_airspace_airac_age_days` = `EXTRACT(EPOCH FROM NOW() - MAX(effective_date)) / 86400` from `airspace_metadata`. Alert rule:
|
||
```yaml
|
||
- alert: AIRACAirspaceStale
|
||
expr: spacecom_airspace_airac_age_days > 29
|
||
for: 1h
|
||
severity: warning
|
||
annotations:
|
||
runbook_url: "https://spacecom.internal/docs/runbooks/fir-update.md"
|
||
summary: "FIR boundary data is {{ $value | humanizeDuration }} old — AIRAC cycle may be missed"
|
||
```
|
||
|
||
2. **`GET /readyz` integration:** `"airspace_stale"` is added to the `degraded` array when `airac_age_days > 28` (already incorporated into §26.5 `readyz` check above).
|
||
|
||
3. **FIR update runbook** (`docs/runbooks/fir-update.md`) is a **Phase 1 deliverable** — it must exist before shadow deployment. Add to the Phase 1 DoD runbook checklist alongside `secrets-rotation-jwt.md`.
|
||
|
||
#### 31.1.4 TLE Validation Gate
|
||
|
||
Before any TLE record is written to the database, `ingest/cross_validator.py` enforces:
|
||
|
||
```python
|
||
def validate_tle(line1: str, line2: str) -> TLEValidationResult:
|
||
errors = []
|
||
if len(line1) != 69:
|
||
errors.append(f"Line 1 length {len(line1)} != 69")
|
||
if len(line2) != 69:
|
||
errors.append(f"Line 2 length {len(line2)} != 69")
|
||
if not _tle_checksum_valid(line1):
|
||
errors.append("Line 1 checksum failed")
|
||
if not _tle_checksum_valid(line2):
|
||
errors.append("Line 2 checksum failed")
|
||
epoch = _parse_epoch(line1[18:32])
|
||
if epoch is None:
|
||
errors.append("Epoch field invalid")
|
||
bstar = float(line1[53:61].replace(' ', ''))
|
||
# Finding 10: BSTAR validation revised
|
||
# Lower bound removed: valid high-density objects (e.g. tungsten sphere) have B* << 0.0001
|
||
# Zero or negative B* is physically meaningless (negative drag) → hard reject
|
||
if bstar <= 0.0:
|
||
errors.append(f"BSTAR {bstar} is zero or negative — physically invalid")
|
||
elif bstar > 0.5:
|
||
# Physically implausible at altitude > 300 km; log warning but do not reject
|
||
log_security_event("TLE_VALIDATION_WARNING", {
|
||
"tle": [line1, line2], "reason": "HIGH_BSTAR", "bstar": bstar
|
||
}, level="WARNING")
|
||
# Hard reject only the impossible combination: very high drag at high altitude
|
||
if bstar > 0.5 and perigee_km > 300:
|
||
errors.append(f"BSTAR {bstar} implausible for perigee {perigee_km:.0f} km — high drag at high altitude")
|
||
if errors:
|
||
log_security_event("INGEST_VALIDATION_FAILURE", {"tle": [line1, line2], "errors": errors})
|
||
return TLEValidationResult(valid=False, errors=errors)
|
||
return TLEValidationResult(valid=True)
|
||
```
|
||
|
||
---
|
||
|
||
### 31.2 CCSDS Format Specifications
|
||
|
||
#### 31.2.1 OEM (Orbit Ephemeris Message) — CCSDS 502.0-B-3
|
||
|
||
Emitted by `GET /space/objects/{norad_id}/ephemeris` when `Accept: application/ccsds-oem`.
|
||
|
||
**Header keyword population:**
|
||
|
||
| Keyword | Value | Source |
|
||
|---------|-------|--------|
|
||
| `CCSDS_OEM_VERS` | `3.0` | Fixed |
|
||
| `CREATION_DATE` | ISO 8601 UTC timestamp | `datetime.utcnow()` |
|
||
| `ORIGINATOR` | `SPACECOM` | Fixed |
|
||
| `OBJECT_NAME` | `objects.name` | DB |
|
||
| `OBJECT_ID` | COSPAR designator if known; `NORAD-<norad_id>` otherwise | DB |
|
||
| `CENTER_NAME` | `EARTH` | Fixed |
|
||
| `REF_FRAME` | `GCRF` | Fixed — SpaceCom frame transform output |
|
||
| `TIME_SYSTEM` | `UTC` | Fixed |
|
||
| `START_TIME` | Query `start` parameter | Request |
|
||
| `STOP_TIME` | Query `end` parameter | Request |
|
||
|
||
**Unknown fields:** Any keyword for which SpaceCom holds no data is emitted as `N/A` per CCSDS 502.0-B-3 §4.1.
|
||
|
||
#### 31.2.2 CDM (Conjunction Data Message) — CCSDS 508.0-B-1
|
||
|
||
Emitted by `GET /space/export/bulk?format=ccsds-cdm`.
|
||
|
||
**Field population table (abbreviated):**
|
||
|
||
| Field | Populated? | Source |
|
||
|-------|-----------|--------|
|
||
| `CREATION_DATE` | Yes | `datetime.utcnow()` |
|
||
| `ORIGINATOR` | Yes | `SPACECOM` |
|
||
| `TCA` | Yes | SpaceCom conjunction screener |
|
||
| `MISS_DISTANCE` | Yes | SpaceCom conjunction screener |
|
||
| `COLLISION_PROBABILITY` | Yes | SpaceCom Alfano Pc |
|
||
| `COLLISION_PROBABILITY_METHOD` | Yes | `ALFANO-2005` |
|
||
| `OBJ1/2 COVARIANCE_*` | Conditional | From Space-Track CDM if available; `N/A` for debris without covariance |
|
||
| `OBJ1/2 RECOMMENDED_OD_SPAN` | No | `N/A` — SpaceCom does not hold OD span |
|
||
| `OBJ1/2 SEDR` | No | `N/A` |
|
||
|
||
**CDM ingestion and Pc reconciliation:**
|
||
When a Space-Track CDM is ingested for an object that SpaceCom has also screened, both Pc values are stored:
|
||
- `conjunctions.pc_spacecom` — SpaceCom Alfano result
|
||
- `conjunctions.pc_spacetrack` — from ingested CDM
|
||
- `conjunctions.pc_discrepancy_flag` — set TRUE when `abs(log10(pc_spacecom/pc_spacetrack)) > 1` (order-of-magnitude difference)
|
||
|
||
The conjunction panel displays both values with their provenance labels. When `pc_discrepancy_flag = TRUE`, a `DATA_CONFIDENCE` warning callout is shown explaining possible causes (different epoch, different covariance source, different Pc method).
|
||
|
||
---
|
||
|
||
#### 31.2.3 RDM (Re-entry Data Message) — CCSDS 508.1-B-1
|
||
|
||
Emitted by `GET /reentry/predictions/{prediction_id}/export?format=ccsds-rdm`.
|
||
|
||
**Planned population rules:**
|
||
|
||
- SpaceCom populates creation metadata, object identifiers, prediction provenance, prediction epoch, and the primary predicted re-entry time range from the active prediction record.
|
||
- Where the active prediction carries `prediction_conflict = TRUE`, the export includes both the primary SpaceCom range and the conservative union range used for aviation-facing products, with explicit conflict provenance.
|
||
- Corridor, fragment-cloud, and air-risk annotations are included only when supported by the active model version and marked with the model version identifier used to generate them.
|
||
- Unknown optional fields are emitted as `N/A` rather than silently omitted, matching the CCSDS handling already used for OEM/CDM unknowns.
|
||
- Raw upstream TIP or third-party reference messages are not overwritten; they remain separate provenance sources and are cross-referenced in the export metadata and audit trail.
|
||
|
||
---
|
||
|
||
### 31.3 WebSocket Event Reference
|
||
|
||
Full event type catalogue for `WS /ws/events`. All events share the envelope:
|
||
|
||
```json
|
||
{
|
||
"type": "alert.new",
|
||
"seq": 1042,
|
||
"ts": "2026-03-17T14:23:01.123Z",
|
||
"org_id": 7,
|
||
"data": { ... }
|
||
}
|
||
```
|
||
|
||
**Event type specifications:**
|
||
|
||
```
|
||
alert.new
|
||
data: {alert_id, level, norad_id, object_name, fir_ids[], predicted_reentry_utc, corridor_wkt}
|
||
|
||
alert.acknowledged
|
||
data: {alert_id, acknowledged_by_name, note_preview (first 80 chars), acknowledged_at}
|
||
|
||
alert.superseded
|
||
data: {old_alert_id, new_alert_id, reason}
|
||
|
||
prediction.updated
|
||
data: {prediction_id, norad_id, p50_utc, p05_utc, p95_utc, supersedes_id (nullable), corridor_wkt}
|
||
|
||
tip.new
|
||
data: {norad_id, object_name, tip_epoch, predicted_reentry_utc, source_label ("USSPACECOM TIP")}
|
||
|
||
ingest.status
|
||
data: {source, status ("ok"|"failed"), record_count (nullable), next_run_at, failure_reason (nullable)}
|
||
|
||
spaceweather.change
|
||
data: {old_status, new_status, kp, f107, recommended_buffer_hours}
|
||
|
||
resync_required
|
||
data: {reason ("reconnect_too_stale"), last_known_seq}
|
||
```
|
||
|
||
**Reconnection protocol:**
|
||
1. Client stores last received `seq`
|
||
2. On reconnect: upgrade with `?since_seq=<last_seq>`
|
||
3. Server delivers all events with `seq > last_seq` from a 5-minute / 200-event ring buffer
|
||
4. If the gap is too large: server sends `{"type": "resync_required"}`; client must call REST endpoints to re-fetch current state before resuming WebSocket consumption
|
||
|
||
**Simulation/Replay isolation:** During SIMULATION or REPLAY mode, the client is connected to `WS /ws/simulation/{session_id}` instead of `WS /ws/events`. No LIVE events are delivered while in a simulation session.
|
||
|
||
---
|
||
|
||
### 31.4 Alert Webhook Specification
|
||
|
||
**Registration:**
|
||
```http
|
||
POST /api/v1/webhooks
|
||
Content-Type: application/json
|
||
Authorization: Bearer <admin_jwt>
|
||
|
||
{
|
||
"url": "https://ansp-dispatch.example.com/spacecom/hook",
|
||
"events": ["alert.new", "tip.new"],
|
||
"secret": "webhook_shared_secret_min_32_chars"
|
||
}
|
||
```
|
||
|
||
Response includes `webhook_id`. The `secret` is bcrypt-hashed before storage; the plaintext is never retrievable after registration.
|
||
|
||
**Delivery:**
|
||
```http
|
||
POST https://ansp-dispatch.example.com/spacecom/hook
|
||
Content-Type: application/json
|
||
X-SpaceCom-Signature: sha256=<HMAC-SHA256(secret, raw_body)>
|
||
X-SpaceCom-Event: alert.new
|
||
X-SpaceCom-Delivery: <uuid>
|
||
|
||
{ "type": "alert.new", "seq": 1042, ... }
|
||
```
|
||
|
||
**Receiver verification (example):**
|
||
```python
|
||
import hmac, hashlib
|
||
|
||
def verify_signature(secret: str, body: bytes, header_sig: str) -> bool:
|
||
expected = "sha256=" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
|
||
return hmac.compare_digest(expected, header_sig)
|
||
```
|
||
|
||
**Retry and status lifecycle:**
|
||
|
||
| State | Condition | Action |
|
||
|-------|-----------|--------|
|
||
| `active` | Deliveries succeeding | Normal operation |
|
||
| `degraded` | 3 consecutive delivery failures | Org admin notified by email; deliveries continue |
|
||
| `disabled` | 10 consecutive delivery failures | No further deliveries; manual re-enable via `PATCH /webhooks/{id}` required |
|
||
|
||
---
|
||
|
||
### 31.5 Interoperability Decision Log
|
||
|
||
| Decision | Chosen | Rationale |
|
||
|----------|--------|-----------|
|
||
| ADS-B source | OpenSky Network REST API | Free, global, sufficient for Phase 3 route overlay; upgrade path to FAA SWIM ADS-B if coverage gaps emerge |
|
||
| CCSDS OEM reference frame | GCRF | SpaceCom frame transform pipeline output; downstream tools expect GCRF |
|
||
| CCSDS CDM unknown fields | `N/A` per CCSDS 508.0-B-1 §4.3 | Silent omission causes downstream parser failures; `N/A` is the standard sentinel |
|
||
| CDM Pc reconciliation | Both Space-Track CDM Pc and SpaceCom Pc displayed with provenance; discrepancy flag on order-of-magnitude difference | Transparency over false precision; operators need to see the discrepancy, not have SpaceCom silently override it |
|
||
| FIR update mechanism | Staging table swap + regression test on 28-day AIRAC cycle | Direct overwrite during a live TIP event would corrupt ongoing airspace intersection queries |
|
||
| WebSocket event schema | Typed envelope with `type` discriminator + monotonic `seq` | Enables typed client generation; `seq` enables reliable missed-event recovery |
|
||
| Webhook signature | HMAC-SHA256 with `sha256=` prefix (same convention as GitHub webhooks) | Operators will already know this pattern; reduces integration friction |
|
||
| SWIM integration timing | Phase 2: GeoJSON export; Phase 3: FIXM review + AMQP endpoint | Full SWIM-TI requires EUROCONTROL B2B account and FIXM extension work — not Phase 1/2 blocking |
|
||
| API versioning | `/api/v1` base; 6-month parallel support on breaking changes; RFC 8594 headers | Space operators need stable contracts; 6-month overlap is industry standard for operational API changes |
|
||
| Space weather format | JSON REST endpoints (not legacy ASCII FTP) | ASCII FTP format is brittle; NOAA SWPC JSON API is stable and machine-readable; contract test catches format changes |
|
||
|
||
---
|
||
|
||
## 32. Ethics / Algorithmic Accountability
|
||
|
||
SpaceCom makes algorithmic predictions that inform operational airspace decisions. False negatives are catastrophic; false positives cause economic disruption and erode operator trust. This section documents the accountability framework that governs how the prediction model is specified, validated, changed, and monitored.
|
||
|
||
**Applicable frameworks:** IEEE 7001-2021 (Transparency of Autonomous Systems), NIST AI RMF (Govern/Map/Measure/Manage), ICAO Safety Management (Annex 19), ECSS-Q-ST-80C (Software Product Assurance).
|
||
|
||
---
|
||
|
||
### 32.1 Decay Predictor Model Card
|
||
|
||
The model card is a living document maintained at `docs/model-card-decay-predictor.md`. It is a required artefact for ESA Phase 2 TRL demonstrations and ANSP SMS acceptance. It must be updated whenever the model version changes.
|
||
|
||
**Required sections:**
|
||
|
||
```markdown
|
||
# Decay Predictor Model Card — SpaceCom v<X.Y.Z>
|
||
|
||
## Model summary
|
||
Numerical decay predictor using RK7(8) adaptive integrator + NRLMSISE-00 atmospheric
|
||
density model + J2–J6 geopotential + solar radiation pressure. Monte Carlo uncertainty
|
||
via 500-sample ensemble varying F10.7 (±20%), Ap, and B* (±10%).
|
||
|
||
## Validated orbital regime
|
||
- Perigee altitude: 100–600 km
|
||
- Inclination: 0–98°
|
||
- Object type: rocket bodies and payloads with RCS > 0.1 m²
|
||
- B* range: 0.0001–0.3
|
||
- Area-to-mass ratio: 0.005–0.04 m²/kg
|
||
|
||
## Known out-of-distribution inputs (ood_flag triggers)
|
||
| Parameter | OOD condition | Expected behaviour |
|
||
|-----------|--------------|-------------------|
|
||
| Area-to-mass ratio | > 0.04 m²/kg | Underestimates atmospheric drag; re-entry time predicted too late |
|
||
| data_confidence | 'unknown' | Physical properties estimated from object type defaults; wide systematic uncertainty |
|
||
| TLE count in history | < 5 TLEs in last 30 days | B* estimate unreliable; uncertainty may be significantly underestimated |
|
||
| Perigee altitude | < 100 km | Object may already be in final decay corridor; NRLMSISE-00 not calibrated below 100 km |
|
||
|
||
## Performance characterisation
|
||
(Updated from backcast validation report — see MinIO docs/backcast-validation-v<X>.pdf)
|
||
|
||
| Object category | N backcasts | p50 error (median) | p50 error (95th pct) | Corridor containment |
|
||
|----------------|-------------|-------------------|---------------------|---------------------|
|
||
| Rocket bodies, RCS > 2 m² | TBD | TBD | TBD | TBD |
|
||
| Payloads, RCS 0.5–2 m² | TBD | TBD | TBD | TBD |
|
||
| Small debris / unknown RCS | TBD (underrepresented) | TBD | TBD | TBD |
|
||
|
||
## Known systematic biases
|
||
- NRLMSISE-00 underestimates atmospheric density during geomagnetic storms at altitudes 200–350 km.
|
||
Effect: predictions during Kp > 5 events tend to predict re-entry slightly later than observed.
|
||
Mitigation: space weather buffer recommendation adds ≥2h beyond p95 during Elevated/Severe/Extreme conditions.
|
||
- Tumbling objects: effective drag area unknown; B* from TLEs reflects tumble-averaged drag.
|
||
Effect: uncertainty may be systematically underestimated for highly elongated objects.
|
||
- Calibration data bias: validation events are dominated by large well-tracked objects from major launch
|
||
programmes. Small debris and objects from less-tracked orbital regimes are underrepresented.
|
||
|
||
## Not intended for
|
||
- Objects with perigee < 100 km (already in terminal descent corridor)
|
||
- Crewed vehicles (use mission-specific tools)
|
||
- Objects undergoing active manoeuvring
|
||
- Predictions beyond 21 days (F10.7 forecast skill degrades sharply beyond 3 days)
|
||
```
|
||
|
||
---
|
||
|
||
### 32.2 Backcast Validation Requirements
|
||
|
||
**Phase 1 minimum:** ≥3 historical re-entries selected from The Aerospace Corporation observed re-entry database. Selection criteria documented.
|
||
|
||
**Phase 2 target:** ≥10 historical re-entries. The validation report (`docs/backcast-validation-v<X>.pdf`) must explicitly:
|
||
|
||
1. **Document selection criteria** — which events were chosen and why. Selection must include at least one event from each of: rocket bodies, payloads, and at least one high-area-to-mass object if available.
|
||
2. **Flag underrepresented categories** — explicitly state which object types have < 3 validation events and what the implication is for accuracy claims in those categories.
|
||
3. **State accuracy as conditional** — not "p50 accuracy is ±2h" but "for rocket bodies (N=7): median p50 error is 1.8h; for payloads (N=3): median p50 error is 3.1h; for small debris (N=0): no validation data available."
|
||
4. **Include negative results** — events where the p95 corridor did not contain the observed impact point must be included and analysed.
|
||
5. **Compare across model versions** — each new validation report must include a comparison table against the previous version's results.
|
||
|
||
The validation report is generated by `modules.feedback` and stored in MinIO `docs/` bucket with a version tag matching the model version.
|
||
|
||
---
|
||
|
||
### 32.3 Out-of-Distribution Detection
|
||
|
||
At prediction creation time, `propagator/decay.py` evaluates each input object against the OOD bounds defined in `docs/ood-bounds.md` and sets `reentry_predictions.ood_flag` and `ood_reason` accordingly.
|
||
|
||
**OOD checks (initial set — update in `docs/ood-bounds.md` as model is validated):**
|
||
|
||
```python
|
||
def check_ood(obj: ObjectParams) -> tuple[bool, list[str]]:
|
||
reasons = []
|
||
if obj.area_to_mass_ratio is not None and obj.area_to_mass_ratio > 0.04:
|
||
reasons.append("high_am_ratio")
|
||
if obj.data_confidence == "unknown":
|
||
reasons.append("low_data_confidence")
|
||
if obj.tle_count_last_30d is not None and obj.tle_count_last_30d < 5:
|
||
reasons.append("sparse_tle_history")
|
||
if obj.perigee_km is not None and obj.perigee_km < 100:
|
||
reasons.append("sub_100km_perigee")
|
||
if obj.bstar is not None and not (0.0001 <= obj.bstar <= 0.3):
|
||
reasons.append("bstar_out_of_range")
|
||
return len(reasons) > 0, reasons
|
||
```
|
||
|
||
**UI presentation when `ood_flag = TRUE`:**
|
||
|
||
```
|
||
⚠ OUT-OF-CALIBRATION-RANGE PREDICTION
|
||
──────────────────────────────────────────────────────────────
|
||
This prediction uses inputs outside the model's validated range:
|
||
• high_am_ratio — effective drag may be underestimated
|
||
• low_data_confidence — physical properties estimated from defaults
|
||
|
||
Timing uncertainty may be significantly larger than shown.
|
||
For operational planning, treat the p95 window as a minimum bound.
|
||
|
||
[What does this mean? →]
|
||
──────────────────────────────────────────────────────────────
|
||
```
|
||
|
||
The callout is mandatory and non-dismissable. It appears above the prediction panel wherever the prediction is displayed. It does not prevent the prediction from being used — operators retain full autonomy.
|
||
|
||
---
|
||
|
||
### 32.4 Recalibration Governance
|
||
|
||
The `modules.feedback` pipeline computes atmospheric density scaling coefficients from observed re-entry outcomes recorded in `prediction_outcomes`. Updating these coefficients changes all future predictions.
|
||
|
||
**Recalibration procedure:**
|
||
|
||
1. **Trigger:** Automated check in the feedback pipeline flags when the last 10 outcomes show a systematic bias (median p50 error > 1.5× the historical baseline).
|
||
2. **Candidate coefficients:** New coefficients computed from the full `prediction_outcomes` history using a hold-out split (80% train / 20% hold-out). Hold-out set is fixed and never used in training.
|
||
3. **Validation gate:** New coefficients must achieve:
|
||
- > 5% improvement in median p50 error on hold-out set
|
||
- No regression (> 10% worsening) on any validated object type category
|
||
- Corridor containment rate ≥ 95% on hold-out set
|
||
4. **Sign-off:** Physics lead + engineering lead both must approve via PR review. PR includes the validation comparison table.
|
||
5. **Active prediction handling:** Before deployment, a batch job re-runs all active predictions (status = `active`, not superseded) using the new coefficients. Each re-run creates a new prediction record linked via `superseded_by`. ANSPs with active shadow deployments receive an automated notification: *"SpaceCom model recalibrated — active predictions updated. Previous predictions superseded. New model version: X.Y.Z."*
|
||
6. **Rollback:** If a post-deployment accuracy regression is detected, the previous coefficient set is restored via the same procedure (treated as a new recalibration). The rollback is logged to `security_logs` type `MODEL_ROLLBACK`.
|
||
|
||
---
|
||
|
||
### 32.5 Model Version Governance
|
||
|
||
**Version classification:**
|
||
|
||
| Classification | Examples | Active prediction re-run? | ANSP notification required? |
|
||
|---------------|----------|--------------------------|----------------------------|
|
||
| **Patch** | Documentation update, logging improvement, no physics change | No | No |
|
||
| **Minor** | Performance improvement, OOD bound adjustment, new object type support | No (optional for analyst review) | Yes — changelog summary |
|
||
| **Major** | Integrator change, density model change, MC parameter change, recalibration | Yes — all active predictions superseded | Yes — written notice to all shadow deployment partners; 2-week notice before deployment |
|
||
|
||
**Version string:** Semantic version (`MAJOR.MINOR.PATCH`) embedded in every prediction record at creation time as `model_version`. The currently deployed version is exposed via `GET /api/v1/system/model-version`.
|
||
|
||
**Cross-version prediction display:** When a prediction was made with a model version that differs from the current deployed version by a major bump, the UI shows:
|
||
|
||
```
|
||
ℹ Prediction generated with model v1.2.0 — current model is v2.0.0 (major update).
|
||
This prediction reflects older parameters. Re-run recommended for operational planning.
|
||
[Re-run with current model →]
|
||
```
|
||
|
||
---
|
||
|
||
### 32.6 Adverse Outcome Monitoring
|
||
|
||
Continuous monitoring of prediction accuracy post-deployment is a regulatory credibility requirement. It is also the primary input to the recalibration pipeline.
|
||
|
||
**Data flow:**
|
||
1. Analyst logs observed re-entry outcome via `POST /api/v1/predictions/{id}/outcome` after post-event analysis (source: The Aerospace Corporation observed re-entry database, US18SCS reports, or ESA ESOC confirmation)
|
||
2. `prediction_outcomes` record created with `p50_error_minutes`, `corridor_contains_observed`, `fir_false_positive`, `fir_false_negative`
|
||
3. Feedback pipeline runs weekly: aggregates outcomes, computes rolling accuracy metrics, flags systematic biases
|
||
4. Grafana `Model Accuracy` dashboard shows: rolling 90-day median p50 error, corridor containment rate, false positive rate (CRITICAL alerts with no confirmed hazard), false negative rate (confirmed hazard with no CRITICAL alert)
|
||
|
||
**Quarterly transparency report:** Generated automatically from `prediction_outcomes`. Contains aggregate (non-personal) data:
|
||
- Total predictions served in the quarter
|
||
- Number of outcomes recorded (and percentage — coverage of the total)
|
||
- Median p50 error, 95th percentile error
|
||
- Corridor containment rate
|
||
- False positive rate (CRITICAL alerts with no confirmed hazard) and estimated false negative rate
|
||
- Known model limitations summary (from model card)
|
||
- Model version(s) active during the quarter
|
||
|
||
Report stored in MinIO `public-reports/` bucket and made available on SpaceCom's public documentation site. The report is a Phase 3 deliverable.
|
||
|
||
---
|
||
|
||
### 32.7 Geographic Coverage Quality
|
||
|
||
FIR intersection quality varies by boundary data source. Operators in non-ECAC regions receive lower-quality airspace intersection assessments than European counterparts. This disparity must be acknowledged, not hidden.
|
||
|
||
**Coverage quality levels:**
|
||
|
||
| Source | Coverage quality | Regions |
|
||
|--------|-----------------|---------|
|
||
| EUROCONTROL AIRAC | High | All ECAC states (Europe, Turkey, Israel, parts of North Africa) |
|
||
| FAA Digital-Terminal Procedures | High | Continental US, Alaska, Hawaii, US territories |
|
||
| OpenAIP | Medium | Global fallback; community-maintained; may lag AIRAC |
|
||
| Manual / not loaded | Low | Any region where no FIR data has been imported |
|
||
|
||
The `airspace` table has a `coverage_quality` column (`high` / `medium` / `low`). The airspace intersection API response includes `coverage_quality` per affected FIR. The UI shows a coverage quality callout on the airspace impact table when any affected FIR is `medium` or `low`:
|
||
|
||
```
|
||
ℹ FIR boundary quality: MEDIUM (OpenAIP source)
|
||
Intersection calculations for this region use community-maintained boundary data.
|
||
Verify with official AIRAC charts before operational use.
|
||
```
|
||
|
||
---
|
||
|
||
### 32.8 Ethics Accountability Decision Log
|
||
|
||
| Decision | Chosen | Rationale |
|
||
|----------|--------|-----------|
|
||
| Model card | Required artefact; maintained alongside model in `docs/` | Regulators and ANSPs need a documented operational envelope; ESA TRL process requires it |
|
||
| Backcast accuracy statement | Conditional on object type; selection bias explicitly documented | Single unconditional figure misrepresents model generalisation to non-specialist audiences |
|
||
| OOD detection | Evaluated at prediction time; `ood_flag` + UI warning callout; prediction still served | Operators retain autonomy; OOD flag informs rather than blocks; hiding it would create false confidence |
|
||
| Recalibration governance | Hold-out validation + dual sign-off + active prediction re-run + ANSP notification | Ungoverned recalibration is an ungoverned change to a safety-critical model |
|
||
| Alert threshold governance | Documented rationale; change requires PR review + 2-week shadow validation + ANSP notification | Threshold values are consequential algorithmic decisions; they must be as auditable as code changes |
|
||
| Prediction staleness warning | `prediction_valid_until` = `p50 - 4h`; warning independent of system health banner | A prediction for an imminent re-entry event has growing implicit uncertainty; operators need a signal |
|
||
| Adverse outcome monitoring | `prediction_outcomes` table; weekly pipeline; quarterly public report | Without outcome data, performance claims are assertions not evidence; public report builds regulatory trust |
|
||
| FIR coverage disparity | `coverage_quality` column on `airspace`; disclosed per-FIR in intersection results | Hiding coverage quality differences from operators would be a form of false precision |
|
||
| False positive / negative framing | Both tracked in `prediction_outcomes`; both in quarterly report | Optimising only for one error type can silently worsen the other; both must be visible |
|
||
| Public transparency report | Aggregate accuracy data; no personal data; quarterly cadence | Aviation safety infrastructure operates in a regulated transparency environment; SpaceCom must too |
|
||
|
||
---
|
||
|
||
## 33. Technical Writing / Documentation Engineering
|
||
|
||
### 33.1 Documentation Principles
|
||
|
||
SpaceCom documentation has three distinct audiences with different needs:
|
||
|
||
| Audience | Primary docs | Format |
|
||
|----------|-------------|--------|
|
||
| **Engineers building the system** | ADRs, inline docstrings, test plan, `AGENTS.md` | Markdown in repo |
|
||
| **Operators using the system** | User guides, API guide, in-app help | Hosted docs site / PDF |
|
||
| **Regulators and auditors** | Model card, validation reports, runbooks, CHANGELOG | Formal documents; version-controlled |
|
||
|
||
Documentation that serves the wrong audience in the wrong format fails both audiences. The §12.1 `docs/` directory tree encodes this separation by subdirectory.
|
||
|
||
---
|
||
|
||
### 33.2 Architecture Decision Record (ADR) Standard
|
||
|
||
**Format:** [MADR — Markdown Architectural Decision Records](https://adr.github.io/madr/). Lightweight, git-friendly, no tooling dependency.
|
||
|
||
**File naming:** `docs/adr/NNNN-short-title.md` where `NNNN` is a zero-padded sequence number.
|
||
|
||
**Template:**
|
||
|
||
```markdown
|
||
# NNNN — <Title>
|
||
|
||
**Status:** Accepted | Superseded by [MMMM](MMMM-title.md) | Deprecated
|
||
|
||
## Context
|
||
|
||
<What is the issue or design question this decision addresses? What forces are at play?>
|
||
|
||
## Decision
|
||
|
||
<What was decided?>
|
||
|
||
## Consequences
|
||
|
||
**Positive:** <What does this decision make easier or better?>
|
||
**Negative / trade-offs:** <What does this decision make harder or require accepting?>
|
||
**Neutral:** <Other effects worth noting>
|
||
|
||
## Alternatives considered
|
||
|
||
| Alternative | Why rejected |
|
||
|-------------|-------------|
|
||
| ... | ... |
|
||
```
|
||
|
||
**Linking from code:** When a code section implements a non-obvious decision, add an inline comment: `# See docs/adr/0003-monte-carlo-chord-pattern.md`. This makes the rationale discoverable from the code, not just from the plan.
|
||
|
||
**Required initial ADR set (Phase 1):**
|
||
|
||
| ADR | Decision |
|
||
|-----|----------|
|
||
| 0001 | RS256 asymmetric JWT over HS256 |
|
||
| 0002 | Dual front-door architecture (aviation + space portals) |
|
||
| 0003 | Monte Carlo chord pattern (Celery group + chord) |
|
||
| 0004 | GEOGRAPHY vs GEOMETRY spatial column types |
|
||
| 0005 | `lazy="raise"` on all SQLAlchemy relationships |
|
||
| 0006 | TimescaleDB chunk intervals (orbits: 1 day, space_weather: 30 days) |
|
||
| 0007 | CesiumJS commercial licence requirement |
|
||
| 0008 | PgBouncer transaction-mode pooling |
|
||
| 0009 | CCSDS OEM GCRF reference frame |
|
||
| 0010 | Alert threshold rationale (6h CRITICAL, 24h HIGH) |
|
||
|
||
---
|
||
|
||
### 33.3 OpenAPI Documentation Standard
|
||
|
||
FastAPI auto-generates OpenAPI 3.1 schema from Python type annotations. Auto-generation is necessary but not sufficient. The following requirements are enforced by CI.
|
||
|
||
**Per-endpoint requirements:**
|
||
|
||
```python
|
||
@router.get(
|
||
"/reentry/predictions/{id}",
|
||
summary="Get re-entry prediction by ID",
|
||
description=(
|
||
"Returns a single re-entry prediction with HMAC integrity verification. "
|
||
"If the prediction's HMAC fails verification, returns 503 — do not use the data. "
|
||
"Requires `viewer` role minimum. OOD-flagged predictions include a warning field."
|
||
),
|
||
tags=["Re-entry"],
|
||
responses={
|
||
200: {"description": "Prediction returned; check `integrity_failed` field"},
|
||
401: {"description": "Not authenticated"},
|
||
403: {"description": "Insufficient role"},
|
||
404: {"description": "Prediction not found or belongs to another organisation"},
|
||
503: {"description": "HMAC integrity check failed — prediction data is untrusted"},
|
||
},
|
||
)
|
||
async def get_prediction(id: int, ...):
|
||
```
|
||
|
||
**CI enforcement:** A pytest fixture iterates the FastAPI app's routes and asserts that `description` is non-empty for every route with path starting `/api/v1/`. Fails CI with a list of non-compliant endpoints.
|
||
|
||
**Rate limiting documentation:** Endpoints with rate limits include the limit in the `description` field: *"Rate limited: 10 requests/minute per user. Returns 429 with `Retry-After` header when exceeded."*
|
||
|
||
---
|
||
|
||
### 33.4 Runbook Standard
|
||
|
||
**Template** (`docs/runbooks/TEMPLATE.md`):
|
||
|
||
```markdown
|
||
# Runbook: <Title>
|
||
|
||
**Severity:** SEV-1 | SEV-2 | SEV-3 | SEV-4
|
||
**Owner:** <team or role>
|
||
**Last reviewed:** YYYY-MM-DD
|
||
**Estimated duration:** <X minutes>
|
||
|
||
## Trigger condition
|
||
|
||
<What condition causes this runbook to be needed? What alert or observation triggers it?>
|
||
|
||
## Preconditions
|
||
|
||
- [ ] You have SSH access to the production host
|
||
- [ ] <other preconditions>
|
||
|
||
## Steps
|
||
|
||
1. <First step — be specific; include exact commands>
|
||
2. <Second step>
|
||
```bash
|
||
# exact command with expected output noted
|
||
docker compose ps
|
||
```
|
||
3. ...
|
||
|
||
## Verification
|
||
|
||
<How do you confirm the runbook was successful? What does healthy state look like?>
|
||
|
||
## Rollback
|
||
|
||
<If the steps made things worse, how do you undo them?>
|
||
|
||
## Notify
|
||
|
||
- [ ] Engineering lead notified (Slack #incidents)
|
||
- [ ] On-call via PagerDuty if SEV-1/2
|
||
- [ ] ANSP partners notified if operational disruption (template: `docs/runbooks/ansp-notification-template.md`)
|
||
```
|
||
|
||
**Runbook index** (`docs/runbooks/README.md`):
|
||
|
||
| Runbook | Severity | Owner | Last reviewed |
|
||
|---------|----------|-------|--------------|
|
||
| `db-failover.md` | SEV-1 | Platform | Phase 3 |
|
||
| `celery-recovery.md` | SEV-2 | Platform | Phase 3 |
|
||
| `hmac-failure.md` | SEV-1 | Security | Phase 1 |
|
||
| `ingest-failure.md` | SEV-2 | Platform | Phase 1 |
|
||
| `gdpr-breach-notification.md` | SEV-1 | Legal + Engineering | Phase 2 |
|
||
| `safety-occurrence-notification.md` | SEV-1 | Legal + Engineering | Phase 2 |
|
||
| `secrets-rotation-jwt.md` | SEV-2 | Platform | Phase 2 |
|
||
| `secrets-rotation-spacetrack.md` | SEV-2 | Platform | Phase 2 |
|
||
| `secrets-rotation-hmac.md` | SEV-1 | Engineering Lead | Phase 2 |
|
||
| `blue-green-deploy.md` | SEV-3 | Platform | Phase 3 |
|
||
| `restore-from-backup.md` | SEV-2 | Platform | Phase 2 |
|
||
|
||
---
|
||
|
||
### 33.5 Docstring Standard
|
||
|
||
All public functions in the following modules must have Google-style docstrings:
|
||
`propagator/decay.py`, `propagator/catalog.py`, `reentry/corridor.py`, `breakup/atmospheric.py`, `conjunction/probability.py`, `integrity.py`, `frame_utils.py`, `time_utils.py`.
|
||
|
||
**Required docstring sections:** `Args` (with physical units for all dimensional quantities), `Returns`, `Raises`, and `Notes` (for numerical limitations or known edge cases).
|
||
|
||
```python
|
||
def integrate_trajectory(
|
||
object_id: int,
|
||
f107: float,
|
||
bstar: float,
|
||
params: dict,
|
||
) -> TrajectoryResult:
|
||
"""Integrate a single RK7(8) decay trajectory from current epoch to re-entry.
|
||
|
||
Uses NRLMSISE-00 atmospheric density model with J2–J6 geopotential and
|
||
solar radiation pressure. Terminates at 80 km altitude (configurable via
|
||
params['termination_altitude_km']).
|
||
|
||
Args:
|
||
object_id: NORAD catalog number of the decaying object.
|
||
f107: Solar flux index (10.7 cm) in solar flux units (sfu).
|
||
Valid range: 65–300 sfu. Values outside this range are accepted
|
||
but produce extrapolated NRLMSISE-00 results (see docs/ood-bounds.md).
|
||
bstar: BSTAR drag term from TLE (units: 1/Earth_radius).
|
||
Valid range: 0.0001–0.3 per docs/ood-bounds.md.
|
||
params: Simulation parameters dict. Required keys:
|
||
'mc_samples' (int), 'termination_altitude_km' (float, default 80.0).
|
||
|
||
Returns:
|
||
TrajectoryResult with fields: reentry_time (UTC datetime),
|
||
impact_lat_deg (float), impact_lon_deg (float), final_velocity_ms (float).
|
||
|
||
Raises:
|
||
IntegrationDivergenceError: If the integrator step size shrinks below
|
||
1e-6 seconds (indicates numerical instability — log and flag as OOD).
|
||
ValueError: If object_id is not in the database.
|
||
|
||
Notes:
|
||
NRLMSISE-00 is calibrated for 100–600 km altitude. Below 100 km the
|
||
density is extrapolated and uncertainty grows significantly. The OOD
|
||
flag is set by the caller based on ood-bounds.md thresholds, not here.
|
||
"""
|
||
```
|
||
|
||
**Enforcement:** `mypy` pre-commit hook enforces no untyped function signatures. A separate CI check using `pydocstyle` or `ruff` with docstring rules enforces non-empty docstrings on public functions in the listed modules.
|
||
|
||
---
|
||
|
||
### 33.6 `CHANGELOG.md` Format
|
||
|
||
Follows [Keep a Changelog](https://keepachangelog.com/) conventions. Human-maintained — not auto-generated from commit messages.
|
||
|
||
```markdown
|
||
# Changelog
|
||
|
||
All notable changes to SpaceCom are documented here.
|
||
Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)
|
||
|
||
## [Unreleased]
|
||
|
||
## [1.0.0] — 2026-MM-DD
|
||
|
||
### Added
|
||
- Re-entry decay predictor (RK7(8) + NRLMSISE-00 + Monte Carlo 500 samples)
|
||
- Percentile corridor visualisation (Mode A)
|
||
- Space weather widget (NOAA SWPC + ESA SWS cross-validation)
|
||
- CRITICAL/HIGH/MEDIUM/LOW alert system with two-step CRITICAL acknowledgement
|
||
- Shadow mode with per-org legal clearance gate
|
||
|
||
### Security
|
||
- JWT RS256 with httpOnly cookies; TOTP MFA enforced for all roles
|
||
- HMAC-SHA256 integrity on all prediction and hazard zone records
|
||
- Append-only `alert_events` and `security_logs` tables
|
||
|
||
## [0.1.0] — 2026-MM-DD (Phase 1 internal)
|
||
...
|
||
```
|
||
|
||
**Who maintains it:** The engineer cutting the release writes the entry. Product owner reviews before tagging. Entries are written for operators and regulators — not for engineers.
|
||
|
||
---
|
||
|
||
### 33.7 User Documentation Plan
|
||
|
||
| Document | Audience | Phase | Format | Location |
|
||
|----------|----------|-------|--------|----------|
|
||
| Aviation Portal User Guide | Persona A/B/C | Phase 2 | Markdown → PDF | `docs/user-guides/aviation-portal-guide.md` |
|
||
| Space Portal User Guide | Persona E/F | Phase 3 | Markdown → PDF | `docs/user-guides/space-portal-guide.md` |
|
||
| Administrator Guide | Persona D | Phase 2 | Markdown | `docs/user-guides/admin-guide.md` |
|
||
| API Developer Guide | Persona E/F | Phase 2 | Markdown → hosted | `docs/api-guide/` |
|
||
| In-app contextual help | Persona A/C | Phase 3 | React component content | `frontend/src/components/shared/HelpContent.ts` |
|
||
|
||
**Aviation Portal User Guide — required sections:**
|
||
1. Dashboard overview (what you see on first login)
|
||
2. Understanding the globe display and urgency symbols
|
||
3. Reading a re-entry event: window range, corridor, risk level
|
||
4. Alert acknowledgement workflow (step-by-step with screenshots)
|
||
5. NOTAM draft workflow and mandatory disclaimer
|
||
6. Degraded mode: what the banners mean and what to do
|
||
7. Sharing views: deep links
|
||
8. Contacting SpaceCom support
|
||
|
||
**Review requirement:** The aviation portal guide must be reviewed by at least one Persona A representative (ANSP duty manager or equivalent) before first shadow deployment. Their sign-off is recorded in `docs/user-guides/review-log.md`.
|
||
|
||
---
|
||
|
||
### 33.8 API Developer Guide
|
||
|
||
Located at `docs/api-guide/`. This is the primary onboarding resource for Persona E (space operators using API keys) and Persona F (orbital analysts with programmatic access).
|
||
|
||
**Minimum content for Phase 2:**
|
||
|
||
**`authentication.md`:**
|
||
- How to create an API key (step-by-step with screenshots)
|
||
- How to attach the key to requests (`Authorization: Bearer <key>` header)
|
||
- API key scopes and which endpoints each scope can access
|
||
- How to revoke a key
|
||
|
||
**`rate-limiting.md`:**
|
||
- Per-endpoint rate limits in a table
|
||
- `429` response format and `Retry-After` header usage
|
||
- Burst vs. sustained limits
|
||
|
||
**`error-reference.md`:**
|
||
```
|
||
400 Bad Request — Invalid parameters; see `detail` field
|
||
401 Unauthorized — Missing or invalid API key
|
||
403 Forbidden — API key does not have the required scope
|
||
404 Not Found — Resource not found or not owned by your account
|
||
422 Unprocessable — Request body failed schema validation
|
||
429 Too Many Requests — Rate limit exceeded; see Retry-After header
|
||
503 Service Unavailable — HMAC integrity check failed; do not use the returned data
|
||
```
|
||
|
||
**`code-examples/python-quickstart.py`:**
|
||
```python
|
||
import requests
|
||
|
||
API_BASE = "https://api.spacecom.io/api/v1"
|
||
API_KEY = "sk_live_..." # from your API key dashboard
|
||
|
||
session = requests.Session()
|
||
session.headers["Authorization"] = f"Bearer {API_KEY}"
|
||
|
||
# Get list of tracked objects currently decaying
|
||
resp = session.get(f"{API_BASE}/objects", params={"decay_status": "decaying"})
|
||
resp.raise_for_status()
|
||
objects = resp.json()["results"]
|
||
print(f"{len(objects)} objects in active decay")
|
||
|
||
# Get OEM ephemeris for the first object
|
||
norad_id = objects[0]["norad_id"]
|
||
resp = session.get(
|
||
f"{API_BASE}/space/objects/{norad_id}/ephemeris",
|
||
headers={"Accept": "application/ccsds-oem"},
|
||
params={"start": "2026-03-17T00:00:00Z", "end": "2026-03-18T00:00:00Z"}
|
||
)
|
||
print(resp.text) # CCSDS OEM format
|
||
```
|
||
|
||
---
|
||
|
||
### 33.9 `AGENTS.md` Specification
|
||
|
||
`AGENTS.md` at the project root provides guidance to AI coding agents (such as Claude Code) working in this codebase. It is a first-class documentation artefact — committed to the repo, version-controlled, and referenced in the onboarding guide.
|
||
|
||
**Required sections:**
|
||
|
||
```markdown
|
||
# SpaceCom — Agent Guidance
|
||
|
||
## Codebase overview
|
||
<3-paragraph summary of architecture, key modules, and safety context>
|
||
|
||
## Safety-critical files — extra care required
|
||
The following files have safety-critical implications. Any change must include
|
||
a test and a brief rationale comment:
|
||
- `backend/app/frame_utils.py` — frame transforms affect corridor coordinates
|
||
- `backend/app/integrity.py` — HMAC signing affects prediction integrity guarantees
|
||
- `backend/app/modules/propagator/decay.py` — physics model
|
||
- `backend/app/modules/alerts/service.py` — alert trigger logic
|
||
- `backend/migrations/` — schema changes affect immutability triggers
|
||
|
||
## Test requirements
|
||
- All backend changes must pass `make test` before committing
|
||
- Physics function changes require a new test case in the relevant test module
|
||
- Security-relevant changes require a `test_rbac.py` or `test_integrity.py` case
|
||
- Never mock the database in integration tests — use the test DB container
|
||
|
||
## Code conventions
|
||
- FastAPI endpoints must have `summary`, `description`, and `responses` (see §33.3)
|
||
- Public physics/security functions must have Google-style docstrings with units
|
||
- All new decisions should have an ADR in `docs/adr/` (see §33.2)
|
||
- New runbooks go in `docs/runbooks/` using the template at `docs/runbooks/TEMPLATE.md`
|
||
|
||
## Playwright / E2E test selector convention
|
||
- Every interactive element targeted by a Playwright test **must** have a `data-testid="<component>-<action>"` attribute
|
||
- Examples: `data-testid="alert-acknowledge-btn"`, `data-testid="notam-draft-submit"`, `data-testid="decay-predict-form"`
|
||
- Playwright tests must use `page.getByTestId(...)` or accessible role selectors (`page.getByRole(...)`) **only**
|
||
- CSS class selectors, XPath, and `page.locator('.')` are forbidden in test files
|
||
- A CI lint step (`grep -r 'page\.locator\b\|page\.\$\b' tests/e2e/`) must return empty
|
||
|
||
## What not to do
|
||
- Do not add `latest` tags to Docker image references
|
||
- Do not store secrets in `.env` files committed to git
|
||
- Do not make changes to alert thresholds without updating `docs/alert-threshold-history.md`
|
||
- Do not change `model_version` in `decay.py` without following the model version governance procedure (§32.5)
|
||
- Do not proxy the Cesium ion token server-side — it is a public browser credential by design (`NEXT_PUBLIC_CESIUM_ION_TOKEN`). Do not store it in Vault, Docker secrets, or treat it as sensitive.
|
||
- Do not add write operations (POST/PUT/DELETE API calls, Zustand mutations) to components rendered in SIMULATION or REPLAY mode without calling `useModeGuard(['LIVE'])` first and disabling the control in non-LIVE modes.
|
||
```
|
||
|
||
---
|
||
|
||
### 33.10 Test Documentation Standard
|
||
|
||
**Test pyramid and coverage gates** — enforced in CI; `make test` runs all layers:
|
||
|
||
| Layer | Scope | Minimum gate | CI enforcement |
|
||
|---|---|---|---|
|
||
| Unit | `backend/app/` excluding `migrations/`, `schemas/` | 80% line coverage | `pytest --cov=backend/app --cov-fail-under=80` |
|
||
| Integration | Every API endpoint × every applicable role | 100% of routes in `test_rbac.py` | RBAC matrix fixture enumerates all FastAPI routes via `app.routes` |
|
||
| E2E | 5 critical user journeys (see below) | All journeys pass | Playwright job in CI; blocks merge |
|
||
| Physics validation | All suites in `docs/test-plan.md` marked Blocking | 0 failures | Separate CI job; always runs before merge |
|
||
|
||
**5 critical user journeys (E2E blocking):**
|
||
1. CRITICAL alert → acknowledge → NOTAM draft saved
|
||
2. Analyst submits decay prediction → job completes → corridor visible on globe
|
||
3. Admin creates user → user logs in → MFA enrolment complete
|
||
4. Space operator registers object → views conjunction list
|
||
5. Admin enables shadow mode → shadow prediction absent from viewer response
|
||
|
||
**Module docstring requirement** for all physics and security test modules:
|
||
|
||
```python
|
||
"""
|
||
test_frame_utils.py — Frame Transformation Validation Suite
|
||
|
||
Physical invariant tested:
|
||
TEME → GCRF → ITRF → WGS84 coordinate chain must agree with
|
||
Vallado (2013) reference state vectors to within specified tolerances.
|
||
|
||
Reference source:
|
||
Vallado, D.A. (2013). Fundamentals of Astrodynamics and Applications, 4th ed.
|
||
Table 3-4 (GCRF↔ITRF) and Table 3-5 (TEME→GCRF). Reference vectors in
|
||
docs/validation/reference-data/vallado-sgp4-cases.json.
|
||
|
||
Operational significance of failure:
|
||
A frame transform error propagates directly into corridor polygon coordinates.
|
||
A 1 km error at re-entry altitude produces a ground-track offset of 5–15 km.
|
||
ALL tests in this module are BLOCKING CI failures.
|
||
|
||
How to add a new test case:
|
||
1. Add the reference state vector to vallado-sgp4-cases.json
|
||
2. Add a parametrised test case to TestTEMEGCRF or TestGCRFITRF
|
||
3. Document the source in a comment on the test case
|
||
"""
|
||
```
|
||
|
||
**`docs/test-plan.md` structure:**
|
||
|
||
| Suite | Module(s) | Physical invariant / behaviour | Reference | Pass tolerance | Blocking? |
|
||
|-------|-----------|--------------------------------|-----------|---------------|-----------|
|
||
| Frame transforms | `tests/physics/test_frame_utils.py` | TEME→GCRF→ITRF→WGS84 chain accuracy | Vallado (2013) Table 3-4/3-5 | Position < 1 km | Yes |
|
||
| SGP4 propagator | `tests/physics/test_propagator/` | State vector at epoch; 7-day propagation | Vallado (2013) test set | < 1 km at epoch; < 10 km at +7d | Yes |
|
||
| Decay predictor | `tests/physics/test_decay/` | p50 re-entry time accuracy; corridor containment | Aerospace Corp database | Median error < 4h; containment ≥ 90% | Phase 2+ |
|
||
| NRLMSISE-00 density | `tests/physics/test_decay/test_nrlmsise.py` | Density agrees with reference atmosphere | Picone et al. (2002) Table 1 | < 1% at 5 reference points | Yes |
|
||
| Hypothesis invariants | `tests/physics/test_hypothesis.py` | SGP4 round-trip; p95 corridor containment; RLS tenant isolation | Internal + Vallado | See §42.3 | Yes |
|
||
| HMAC integrity | `tests/test_integrity.py` | Tampered record detected; correct error response | Internal | 503 + CRITICAL log entry | Yes |
|
||
| RBAC enforcement | `tests/test_rbac.py` | Every endpoint returns correct status for every role | Internal | 0 mismatches | Yes |
|
||
| Rate limiting | `tests/test_auth.py` | 429 at threshold; 200 after reset | Internal | Exact threshold | Yes |
|
||
| WebSocket | `tests/test_websocket.py` | Sequence replay; token expiry warning; close codes 4001/4002 | Internal spec §14 | All assertions pass | Yes |
|
||
| Contract tests | `tests/test_ingest/test_contracts.py` | Space-Track + NOAA key presence AND value ranges | Internal | 0 violations | Yes (in CI against mocks) |
|
||
| Celery lifecycle | `tests/test_jobs/test_celery_failure.py` | Timed-out job → `failed`; orphan recovery Beat task | Internal | State correct within 5 min | Yes |
|
||
| MC corridor | `tests/physics/test_mc_corridor.py` | Corridor contains ≥ 95% of p95 trajectories; polygon matches committed reference | Internal (seeded RNG seed=42) | Area delta < 5% | Phase 2+ |
|
||
| Smoke suite | `tests/smoke/` | API/WS health; auth; catalog non-empty; DB connectivity | Internal | All pass in ≤ 2 min | Yes (post-deploy) |
|
||
| E2E journeys | `tests/e2e/` (Playwright) | 5 critical user journeys; WCAG 2.1 AA axe-core scan | Internal | 0 journey failures; 0 axe violations | Yes |
|
||
| Breakup energy conservation | `tests/physics/test_breakup/` | Energy conserved through fragmentation | Internal analytic | < 1% error | Phase 2+ |
|
||
|
||
**Test database isolation strategy** — prevents test state leakage and enables parallel execution (`pytest-xdist`):
|
||
|
||
- **Unit tests and single-connection integration tests:** `db_session` fixture wraps each test in a `SAVEPOINT`/`ROLLBACK TO SAVEPOINT` transaction. No committed data persists between tests.
|
||
- **Celery integration tests** (multi-connection, multi-process): use `testcontainers-python` (`PostgresContainer`) to spin up a dedicated DB container per `pytest-xdist` worker. The container is created at session scope and torn down at session end. Each test worker sets `search_path` to its own schema (`test_worker_<worker_id>`) for additional isolation.
|
||
- **Never** use the development or production DB for tests. The `DATABASE_URL` in test config must point to `localhost:5433` (test container) or the `testcontainers` dynamic port. CI enforces this via environment variable assertion at test startup.
|
||
- `pytest.ini` configuration:
|
||
```ini
|
||
[pytest]
|
||
addopts = -x --strict-markers -p no:warnings
|
||
markers =
|
||
quarantine: flaky tests excluded from blocking CI
|
||
contract: external API contract tests; run against mocks in CI
|
||
smoke: post-deploy smoke tests
|
||
```
|
||
|
||
**Flaky test policy:**
|
||
|
||
1. A test is "flaky" if it fails without a code change ≥ 2 times in any 30-day window (tracked via GitHub Actions JUnit artefact history)
|
||
2. On second flaky failure: the test is decorated with `@pytest.mark.quarantine` and moved to `tests/quarantine/`; a GitHub issue is filed automatically by the CI workflow
|
||
3. Quarantined tests are excluded from blocking CI (`pytest -m "not quarantine"`) but continue to run in a non-blocking nightly job so failures are visible
|
||
4. A test in quarantine > 14 days without a fix **must** be deleted — a never-fixed flaky test provides no safety value and actively erodes trust in CI
|
||
5. The quarantine list is reviewed at each sprint review; any test in quarantine > 30 days blocks the next sprint release gate
|
||
|
||
---
|
||
|
||
### 33.11 Technical Writing Decision Log
|
||
|
||
| Decision | Chosen | Rationale |
|
||
|----------|--------|-----------|
|
||
| ADR format | MADR (Markdown) | Lightweight; git-native; no tooling; linkable from code comments |
|
||
| ADR location | `docs/adr/` in monorepo | Engineers find rationale where they work, not in a separate wiki |
|
||
| Changelog format | Keep a Changelog (human-maintained) | Commit messages are for engineers; changelogs are for operators and regulators; auto-generation produces wrong audience tone |
|
||
| Docstring style | Google-style | Most readable inline; compatible with Sphinx if API reference generation is needed; `ruff` can enforce it |
|
||
| Runbook format | Standard template with Trigger/Steps/Verification/Rollback/Notify | On-call engineers under pressure skip steps that aren't explicitly numbered; Rollback and Notify are consistently omitted without a template |
|
||
| User documentation timing | Phase 2 for aviation portal; Phase 3 for space portal | ANSP SMS acceptance requires user documentation before shadow deployment; space portal can follow |
|
||
| API guide location | `docs/api-guide/` in repo | Co-located with code; version-controlled; engineers update it when they change the API |
|
||
| `AGENTS.md` | Committed to repo root; safety-critical files explicitly listed | An undocumented AGENTS.md is ignored or followed inconsistently; explicit safety-critical file list is the highest-value content |
|
||
| Test documentation | Module docstring + `docs/test-plan.md` | ECSS-Q-ST-80C requires test specification as a separate artefact; module docstrings are the lowest-friction way to maintain it |
|
||
| OpenAPI enforcement | CI check on empty `description` fields | Developers don't write documentation voluntarily; CI enforcement is the only reliable mechanism |
|
||
|
||
---
|
||
|
||
## 34. Infrastructure Design
|
||
|
||
This section consolidates infrastructure-level specifications: TLS lifecycle, port map, reverse-proxy configuration, WAF/DDoS posture, object storage configuration, backup validation, egress control, and the HA database parameters. For Patroni parameters see §26.3; for port exposure details see §3.3; for storage tiering see §27.4; for DNS/service discovery see §27.6.
|
||
|
||
---
|
||
|
||
### 34.1 TLS Certificate Lifecycle
|
||
|
||
#### Certificate Issuance Decision Tree
|
||
|
||
```
|
||
Is the deployment internet-facing?
|
||
├── YES → Use Caddy ACME (Let's Encrypt / ZeroSSL)
|
||
│ Caddy automatically renews; no manual steps required
|
||
│ Domain must be publicly resolvable (A record pointing to Caddy host)
|
||
│
|
||
└── NO (air-gapped / on-premise with no public DNS)
|
||
├── Does the customer operate an internal CA?
|
||
│ ├── YES → Request cert from customer CA; configure Caddy with cert_file + key_file
|
||
│ │ Document CA chain in `docs/runbooks/tls-cert-lifecycle.md`
|
||
│ └── NO → Generate internal CA with `step-ca` (Smallstep)
|
||
│ Run step-ca as a sidecar container on the management network
|
||
│ Issue Caddy cert from internal CA; clients import internal CA root cert
|
||
```
|
||
|
||
#### Cert Expiry Alert Thresholds
|
||
|
||
Prometheus alert rules in `monitoring/alerts/tls.yml`:
|
||
|
||
| Alert | Threshold | Severity |
|
||
|-------|-----------|----------|
|
||
| `TLSCertExpiringSoon` | < 60 days remaining | WARNING |
|
||
| `TLSCertExpiringImminent` | < 30 days remaining | HIGH |
|
||
| `TLSCertExpiryCritical` | < 7 days remaining | CRITICAL (pages on-call) |
|
||
|
||
For ACME-managed certs: Caddy renews at 30 days remaining by default; the 30-day alert should never fire in steady state. The 7-day CRITICAL alert is the backstop for ACME renewal failures.
|
||
|
||
#### Runbook Entry
|
||
|
||
`docs/runbooks/tls-cert-lifecycle.md` must cover:
|
||
1. How to verify current cert expiry (`echo | openssl s_client -connect host:443 2>/dev/null | openssl x509 -noout -dates`)
|
||
2. ACME renewal troubleshooting (Caddy logs: `caddy logs --tail 100`)
|
||
3. Manual certificate replacement procedure for air-gapped deployments
|
||
4. Internal CA cert distribution to client browsers / API consumers
|
||
|
||
---
|
||
|
||
### 34.2 Caddy Reverse Proxy Configuration
|
||
|
||
```caddyfile
|
||
# /etc/caddy/Caddyfile
|
||
# Production Caddyfile stub — customise domain and backend addresses
|
||
|
||
{
|
||
email admin@your-domain.com # ACME account email
|
||
# For air-gapped: comment out email, add tls /path/to/cert /path/to/key
|
||
}
|
||
|
||
your-domain.com {
|
||
# TLS — automatic ACME for internet-facing; replace with manual cert for air-gapped
|
||
tls {
|
||
protocols tls1.2 tls1.3 # Disable TLS 1.0 and 1.1
|
||
}
|
||
|
||
# Security headers
|
||
header {
|
||
Strict-Transport-Security "max-age=63072000; includeSubDomains; preload"
|
||
X-Content-Type-Options "nosniff"
|
||
X-Frame-Options "DENY"
|
||
Referrer-Policy "strict-origin-when-cross-origin"
|
||
-Server # Strip Server header (do not expose Caddy version)
|
||
-X-Powered-By # Strip if present
|
||
}
|
||
|
||
# WebSocket proxy (backend WebSocket endpoint)
|
||
handle /ws/* {
|
||
reverse_proxy backend:8000 {
|
||
header_up Host {host}
|
||
header_up X-Real-IP {remote_host}
|
||
header_up X-Forwarded-Proto {scheme}
|
||
}
|
||
}
|
||
|
||
# API and SSR routes
|
||
handle /api/* {
|
||
reverse_proxy backend:8000 {
|
||
header_up X-Real-IP {remote_host}
|
||
header_up X-Forwarded-Proto {scheme}
|
||
}
|
||
}
|
||
|
||
# Static assets — served with long-lived immutable cache headers (F8 — §58)
|
||
# Next.js content-hashes all filenames under /_next/static/ — safe for max-age=1y
|
||
handle /_next/static/* {
|
||
header Cache-Control "public, max-age=31536000, immutable"
|
||
reverse_proxy frontend:3000 {
|
||
header_up X-Real-IP {remote_host}
|
||
}
|
||
}
|
||
|
||
# Cesium workers and static resources (large; benefit most from caching)
|
||
handle /cesium/* {
|
||
header Cache-Control "public, max-age=604800" # 7 days; not content-hashed
|
||
reverse_proxy frontend:3000 {
|
||
header_up X-Real-IP {remote_host}
|
||
}
|
||
}
|
||
|
||
# Frontend (Next.js) — HTML and dynamic routes (no caching)
|
||
handle {
|
||
header Cache-Control "no-store" # HTML must never be cached; contains stale JS references otherwise
|
||
reverse_proxy frontend:3000 {
|
||
header_up X-Real-IP {remote_host}
|
||
header_up X-Forwarded-Proto {scheme}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
**Notes:**
|
||
- MinIO console (`9001`) and Flower (`5555`) are **not** exposed through Caddy in production. VPN/bastion access only.
|
||
- Static asset `Cache-Control: immutable` is safe only because Next.js content-hashes all filenames. HTML pages must use `no-store` to force browsers to re-fetch the latest JS bundle references after a deploy.
|
||
- HTTP (port 80) is implicitly redirected to HTTPS by Caddy when a TLS block is present.
|
||
- `max-age=63072000` = 2 years; standard for HSTS preload submission.
|
||
|
||
---
|
||
|
||
### 34.3 WAF and DDoS Protection
|
||
|
||
SpaceCom's application-layer rate limiting (§7.7) is a mitigation for abusive authenticated clients, not a defence against volumetric DDoS or web application attacks. A dedicated WAF/DDoS layer is required at Tier 2+ production deployments.
|
||
|
||
**Internet-facing deployments (cloud or hosted):**
|
||
- Deploy behind **Cloudflare** (free tier minimum; Pro tier for WAF rules) or **AWS Shield Standard** + AWS WAF
|
||
- Cloudflare: enable DDoS protection, OWASP managed ruleset, Bot Fight Mode
|
||
- Configure Caddy to only accept connections from Cloudflare IP ranges (Cloudflare publishes the range; verify with `curl https://www.cloudflare.com/ips-v4`)
|
||
|
||
**Air-gapped / on-premise government deployments:**
|
||
- Customer's upstream network perimeter (firewall/IPS) provides the DDoS and WAF layer
|
||
- Document the perimeter protection requirement in the customer deployment checklist (`docs/runbooks/on-premise-deployment.md`)
|
||
- SpaceCom is not responsible for perimeter DDoS mitigation in customer-managed deployments; this is a contractual boundary that must be documented in the MSA
|
||
|
||
**On-premise licence key enforcement (F6 — §68):**
|
||
|
||
On-premise deployments run on customer infrastructure. Without a licence key mechanism, a customer could run additional instances, share the deployment, or continue operating after licence expiry.
|
||
|
||
**Licence key design:** A JWT signed with SpaceCom's RSA private key (2048-bit minimum). Claims:
|
||
```json
|
||
{
|
||
"sub": "<org_id>",
|
||
"org_name": "Civil Aviation Authority of Australia",
|
||
"contract_type": "on_premise",
|
||
"valid_from": "2026-01-01T00:00:00Z",
|
||
"valid_until": "2027-01-01T00:00:00Z",
|
||
"features": ["operational_mode", "multi_ansp_coordination"],
|
||
"max_users": 50,
|
||
"iss": "spacecom.io",
|
||
"iat": 1735689600
|
||
}
|
||
```
|
||
|
||
**Enforcement:** At startup, `backend/app/main.py` verifies the licence JWT using SpaceCom's public key (bundled in the Docker image). If validation fails or the licence has expired: the backend starts in **licence-expired degraded mode** — read-only access to historical data; no new predictions or alerts; all write endpoints return `HTTP 402 Payment Required` with `{"error": "licence_expired", "contact": "commercial@spacecom.io"}`. An hourly Celery Beat task re-validates the licence. If it expires mid-operation, running simulations complete but no new simulations are accepted after the check fires.
|
||
|
||
**Key rotation:** New licence JWT issued via `scripts/generate_licence_key.py` (requires SpaceCom private key, stored in HashiCorp Vault — never committed to the repository). Customer sets `SPACECOM_LICENCE_KEY` environment variable; container restart picks it up. SpaceCom's RSA public key is embedded in the Docker image at build time (`/etc/spacecom/licence_pubkey.pem`).
|
||
|
||
**CI/DAST complement:** OWASP ZAP DAST (§21 Phase 2 DoD) tests the application layer; WAF covers infrastructure-layer attack patterns. Both are required — they cover different threat categories.
|
||
|
||
---
|
||
|
||
### 34.4 MinIO Object Storage Configuration
|
||
|
||
#### Erasure Coding (Tier 3)
|
||
|
||
4-node distributed MinIO uses EC:2 (2 data + 2 parity shards per erasure set):
|
||
|
||
```bash
|
||
# MinIO server startup command (each of 4 nodes runs the same command)
|
||
minio server \
|
||
http://minio-1:9000/data \
|
||
http://minio-2:9000/data \
|
||
http://minio-3:9000/data \
|
||
http://minio-4:9000/data \
|
||
--console-address ":9001"
|
||
```
|
||
|
||
EC:2 on 4 nodes means:
|
||
- Each object is split into 4 shards (2 data + 2 parity)
|
||
- Read quorum: 2 shards (tolerates 2 simultaneous node failures for reads)
|
||
- Write quorum: 3 shards (tolerates 1 simultaneous node failure for writes)
|
||
- Usable capacity: 50% of raw total
|
||
|
||
#### ILM (Information Lifecycle Management) Policies
|
||
|
||
Configured via `mc ilm add` commands in `docs/runbooks/minio-lifecycle.md`:
|
||
|
||
| Bucket | Prefix | Transition after | Target |
|
||
|--------|--------|-----------------|--------|
|
||
| `mc-blobs` | (all) | 90 days | MinIO warm tier or S3-IA |
|
||
| `pdf-reports` | (all) | 365 days | S3 Glacier |
|
||
| `notam-drafts` | (all) | 365 days | S3 Glacier |
|
||
| `db-wal-archive` | (all) | 31 days | **Delete** (WAL older than 30 days not needed for point-in-time recovery) |
|
||
|
||
---
|
||
|
||
### 34.5 Backup Restore Test Verification Checklist
|
||
|
||
Monthly restore test procedure (executed by the `restore_test` Celery task; results logged to `security_logs` type `RESTORE_TEST`). A human engineer must verify all six items before marking the restore test as passed:
|
||
|
||
| # | Verification item | How to verify |
|
||
|---|-------------------|---------------|
|
||
| 1 | **Row count match** | `SELECT COUNT(*) FROM reentry_predictions` on restored DB equals baseline count captured before backup |
|
||
| 2 | **Latest record present** | Most recent `reentry_predictions.created_at` in restored DB is within 5 minutes of the backup timestamp |
|
||
| 3 | **HMAC spot-check** | Run `integrity.verify_prediction(id)` on 5 randomly selected prediction IDs; all must return `VALID` |
|
||
| 4 | **Append-only trigger functional** | Attempt `UPDATE reentry_predictions SET risk_level = 'LOW' WHERE id = <test_id>`; must raise exception |
|
||
| 5 | **Hypertable chunks intact** | `SELECT count(*) FROM timescaledb_information.chunks WHERE hypertable_name = 'orbits'` matches expected chunk count for the backup date range |
|
||
| 6 | **Foreign key integrity** | `pg_restore` completed with 0 FK constraint violations (check restore log for `ERROR: insert or update on table ... violates foreign key constraint`) |
|
||
|
||
Restore test failures are treated as CRITICAL alerts. The restore test target DB (`db-restore-test` container) must be isolated from the production network (not attached to `db_net`).
|
||
|
||
---
|
||
|
||
### 34.6 Infrastructure Design Decision Log
|
||
|
||
| Decision | Chosen | Alternative Considered | Rationale |
|
||
|----------|--------|----------------------|-----------|
|
||
| Reverse proxy | Caddy | nginx + certbot | Caddy automatic ACME eliminates manual cert management; simpler config; native HTTP/2 and HTTP/3 |
|
||
| TLS air-gapped | Internal CA (`step-ca`) | Self-signed per-service | Internal CA allows cert chain trust; self-signed requires per-client exception management |
|
||
| WAF/DDoS | Upstream provider (Cloudflare/AWS Shield) | Application-layer rate limiting only | Volumetric DDoS bypasses application-layer; WAF covers OWASP attack patterns at network ingress |
|
||
| MinIO erasure coding | EC:2 on 4 nodes | EC:4 (higher parity) | EC:4 on 4 nodes would require 4-node write quorum; any single failure blocks writes; EC:2 balances protection and availability |
|
||
| Multi-region | Single region per jurisdiction | Active-active global cluster | Data sovereignty; compliance certification scope; Phase 1–3 customer base size doesn't justify multi-region operational complexity |
|
||
| DB connection target | PgBouncer VIP | Direct Patroni primary connection string | Application connection strings don't change during Patroni failover; stable operational target |
|
||
| Cold tier (MC blobs) | MinIO ILM warm → S3-IA | S3 Glacier | MC blobs may be replayed for Mode C visualisation; 12h Glacier restore latency is operationally unacceptable |
|
||
| Cold tier (compliance) | S3 Glacier / Deep Archive | Warm S3 | Compliance docs need 7-year retention but rare retrieval; Glacier cost is 80–90% lower than S3-IA |
|
||
| Egress filtering | Host-level UFW/nftables | Rely on Docker network isolation | Docker isolation is inter-network only; outbound internet egress must be filtered at host level |
|
||
| HSTS `max-age` | 63072000 (2 years) | 31536000 (1 year) | 2 years is the HSTS preload list minimum; aligns with standard hardening guides |
|
||
|
||
---
|
||
|
||
## 35. Performance Engineering
|
||
|
||
This section consolidates performance specifications, load test definitions, and scalability constraints across the system. For compression policy configuration see §9.4; for latency budget and pagination standard see §14; for WebSocket subscriber ceiling see §14; for renderer memory limits see §3 / §27.
|
||
|
||
---
|
||
|
||
### 35.1 Load Test Specification
|
||
|
||
**Tool:** k6 (preferred) or Locust. Scripts in `tests/load/`. Scenarios must be deterministic and reproducible on a freshly seeded database.
|
||
|
||
#### Scenario: CZML Catalog (Phase 1 baseline, Phase 3 SLO gate)
|
||
|
||
```javascript
|
||
// tests/load/czml_catalog.js
|
||
import http from 'k6/http';
|
||
import { check, sleep } from 'k6';
|
||
|
||
export const options = {
|
||
stages: [
|
||
{ duration: '2m', target: 20 }, // Ramp to 20 users
|
||
{ duration: '5m', target: 100 }, // Ramp to 100 users (SLO target)
|
||
{ duration: '5m', target: 100 }, // Sustain 100 users
|
||
{ duration: '2m', target: 0 }, // Ramp down
|
||
],
|
||
thresholds: {
|
||
'http_req_duration{endpoint:czml_full}': ['p(95)<2000'], // Phase 3 SLO
|
||
'http_req_duration{endpoint:czml_delta}': ['p(95)<500'], // Delta must be faster
|
||
'http_req_failed': ['rate<0.01'], // < 1% error rate
|
||
},
|
||
};
|
||
|
||
export default function () {
|
||
// First load: full catalog
|
||
const fullRes = http.get('/czml/objects', {
|
||
tags: { endpoint: 'czml_full' },
|
||
headers: { Authorization: `Bearer ${__ENV.TEST_TOKEN}` },
|
||
});
|
||
check(fullRes, { 'full catalog 200': (r) => r.status === 200 });
|
||
|
||
// Subsequent loads: delta
|
||
const since = new Date(Date.now() - 60000).toISOString();
|
||
const deltaRes = http.get(`/czml/objects?since=${since}`, {
|
||
tags: { endpoint: 'czml_delta' },
|
||
headers: { Authorization: `Bearer ${__ENV.TEST_TOKEN}` },
|
||
});
|
||
check(deltaRes, { 'delta 200': (r) => r.status === 200 });
|
||
|
||
sleep(5); // Think time: user views globe for ~5s before next action
|
||
}
|
||
```
|
||
|
||
#### Scenario: MC Prediction Submission
|
||
|
||
```javascript
|
||
// tests/load/mc_predict.js — tests concurrency gate
|
||
export const options = {
|
||
vus: 10, // 10 concurrent MC submissions from 5 orgs (2 per org)
|
||
duration: '3m',
|
||
thresholds: {
|
||
'http_req_duration{endpoint:mc_submit}': ['p(95)<500'],
|
||
// 429s are expected (concurrency gate) — not counted as failures
|
||
'checks': ['rate>0.95'],
|
||
},
|
||
};
|
||
```
|
||
|
||
#### Scenario: WebSocket Alert Delivery
|
||
|
||
```javascript
|
||
// tests/load/ws_alerts.js — verifies < 30s delivery under load
|
||
// Opens 100 persistent WebSocket connections; triggers 10 synthetic alerts;
|
||
// measures time from alert POST to WS delivery on all 100 clients
|
||
```
|
||
|
||
**Load test execution:**
|
||
- Phase 1: run `czml_catalog` scenario on Tier 1 dev hardware; record p95 baseline
|
||
- Phase 2: run after each major migration; confirm no regression vs Phase 1 baseline
|
||
- Phase 3: full suite (all three scenarios) on Tier 2 staging; all thresholds must pass before production deploy approval
|
||
|
||
Load test reports committed to `docs/validation/load-test-report-phase{N}.md`.
|
||
|
||
---
|
||
|
||
### 35.2 CZML Delta Protocol
|
||
|
||
The full CZML catalog grows proportionally with object count and time-step density. The delta protocol prevents repeat full-catalog downloads after initial page load.
|
||
|
||
**Client responsibility:**
|
||
1. On page load: fetch `GET /czml/objects` (full catalog). Cache `X-CZML-Timestamp` response header as `lastSync`.
|
||
2. Every 30s (or on reconnect): fetch `GET /czml/objects?since=<lastSync>`.
|
||
3. On receipt of `X-CZML-Full-Required: true`: discard globe state and re-fetch full catalog.
|
||
4. On receipt of `HTTP 413`: the server cannot serve the full catalog (too large); contact system admin.
|
||
|
||
**Server responsibility:**
|
||
- Full response: include `X-CZML-Timestamp: <server_time_iso8601>` header.
|
||
- Delta response: include only objects with `updated_at > since`. If `since` is more than 30 minutes ago, return `X-CZML-Full-Required: true` with an empty CZML body (client must re-fetch).
|
||
- Maximum full payload: 5 MB. If estimated size exceeds limit, return `HTTP 413` with `{"error": "catalog_too_large", "use_delta": true}`.
|
||
|
||
**Prometheus metric:** `czml_delta_ratio` = delta requests / (delta + full requests). Target: > 0.95 in steady state (95% of CZML requests are delta).
|
||
|
||
---
|
||
|
||
### 35.3 Monte Carlo Concurrency Gate
|
||
|
||
Unbounded MC fan-out collapses SLOs when multiple users submit concurrent jobs. The concurrency gate is implemented as a per-organisation Redis semaphore:
|
||
|
||
```python
|
||
# worker/tasks/decay.py
|
||
|
||
import redis
|
||
from celery import current_app
|
||
|
||
REDIS = redis.Redis.from_url(settings.REDIS_URL)
|
||
MC_SEMAPHORE_TTL = 600 # seconds; covers maximum expected MC duration + margin
|
||
|
||
def acquire_mc_slot(org_id: int, org_tier: str) -> bool:
|
||
"""Returns True if slot acquired, False if at capacity. Limit derived from subscription tier (F6)."""
|
||
from app.modules.billing.tiers import get_mc_concurrency_limit
|
||
limit = get_mc_concurrency_limit_by_tier(org_tier)
|
||
key = f"mc_running:{org_id}"
|
||
pipe = REDIS.pipeline()
|
||
pipe.incr(key)
|
||
pipe.expire(key, MC_SEMAPHORE_TTL)
|
||
count, _ = pipe.execute()
|
||
if count > limit:
|
||
REDIS.decr(key)
|
||
return False
|
||
return True
|
||
|
||
def release_mc_slot(org_id: int) -> None:
|
||
key = f"mc_running:{org_id}"
|
||
current = REDIS.get(key)
|
||
if current and int(current) > 0:
|
||
REDIS.decr(key)
|
||
```
|
||
|
||
**API layer:**
|
||
```python
|
||
# backend/api/decay.py
|
||
|
||
@router.post("/decay/predict")
|
||
async def submit_decay(req: DecayRequest, user: User = Depends(current_user)):
|
||
if not acquire_mc_slot(user.organisation_id, user.role):
|
||
raise HTTPException(
|
||
status_code=429,
|
||
detail="MC concurrency limit reached for your organisation",
|
||
headers={"Retry-After": "120"},
|
||
)
|
||
task = run_mc_decay_prediction.delay(...)
|
||
return {"task_id": task.id}
|
||
```
|
||
|
||
The Celery chord callback (`on_chord_done`) calls `release_mc_slot`. A TTL of 600s ensures the slot is released even if the worker crashes mid-task.
|
||
|
||
**Quota exhaustion logging (F6):** When `acquire_mc_slot` returns `False`, before returning `429`, the endpoint writes a `usage_events` row: `event_type = 'mc_quota_exhausted'`. This makes quota pressure visible to the org admin and to the SpaceCom sales team (via admin panel). The org admin's usage dashboard shows: predictions run this month, quota hits this month, and a prompt to upgrade if hits ≥ 3 in a billing period.
|
||
|
||
---
|
||
|
||
### 35.4 Query Plan Regression Gate
|
||
|
||
**CI job:** `performance-regression` (runs in staging pipeline after `make migrate`):
|
||
|
||
```python
|
||
# scripts/check_query_baselines.py
|
||
"""
|
||
Runs EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) for each query in
|
||
docs/query-baselines/*.sql against the migrated staging DB.
|
||
Compares execution time to the baseline JSON stored in the same directory.
|
||
Fails with exit code 1 if any query exceeds 2× the recorded baseline.
|
||
Emits a GitHub PR comment with a comparison table.
|
||
"""
|
||
|
||
BASELINE_DIR = "docs/query-baselines"
|
||
THRESHOLD_MULTIPLIER = 2.0
|
||
|
||
queries = {
|
||
"czml_catalog_100obj": "SELECT ...", # from czml_catalog_100obj.sql
|
||
"fir_intersection": "SELECT ...", # from fir_intersection.sql
|
||
"prediction_list": "SELECT ...", # from prediction_list_cursor.sql
|
||
}
|
||
```
|
||
|
||
Baselines are JSON files containing `{"planning_time_ms": N, "execution_time_ms": N, "recorded_at": "..."}`. Updated manually after a deliberate schema change with a PR comment explaining the expected regression.
|
||
|
||
---
|
||
|
||
### 35.5 Renderer Container Constraints
|
||
|
||
The `renderer` service (Playwright + Chromium) is memory-intensive during print-resolution globe captures:
|
||
|
||
```yaml
|
||
# docker-compose.yml (renderer service)
|
||
renderer:
|
||
image: spacecom/renderer:sha-${GIT_SHA}
|
||
mem_limit: 4g
|
||
memswap_limit: 4g # No swap; if OOM, container restarts cleanly
|
||
networks: [renderer_net]
|
||
environment:
|
||
RENDERER_MAX_PAGES: "4" # Maximum concurrent render jobs
|
||
RENDERER_TIMEOUT_S: "30" # Per-render timeout; matches §21 DoD
|
||
RENDERER_MAX_RESOLUTION: "300dpi"
|
||
```
|
||
|
||
**Renderer Prometheus metrics:**
|
||
- `renderer_memory_usage_bytes` — current RSS of Chromium process; alert at 3.5 GB (WARN before OOM)
|
||
- `renderer_jobs_active` — concurrent in-flight renders; alert if > 3 for > 60s (capacity signal)
|
||
- `renderer_timeout_total` — count of renders killed by timeout; alert if > 0 in a 5-min window
|
||
|
||
**Maximum report constraints** (enforced in `worker/tasks/renderer.py`):
|
||
- Maximum report pages: 50
|
||
- Maximum globe snapshot resolution: 300 DPI (A4 format)
|
||
- Reports exceeding these limits are rejected at submission with `HTTP 400`
|
||
|
||
**Renderer memory isolation and on-demand rationale (F8 — §65 FinOps):**
|
||
|
||
The renderer is the second-most memory-intensive service after TimescaleDB. At Tier 2 it is allocated a dedicated `c6i.xlarge` (~$140/mo) or equivalent. Unlike simulation workers, the renderer is called infrequently — typically a few times per day when a duty manager requests a PDF briefing pack.
|
||
|
||
**On-demand vs. always-on analysis:**
|
||
|
||
| Approach | Benefit | Cost/risk | Decision |
|
||
|---------|---------|-----------|---------|
|
||
| Always-on (current) | Zero latency to first render; Chromium warm | $140/mo even if 0 renders/day | **Use at Tier 1–2** — cost is predictable; latency matters for interactive report requests |
|
||
| On-demand (start on request, stop after idle) | Saves $140/mo on lightly used deployments | 15–30s Chromium cold-start per report; complicates deployment | Consider at Tier 3 with HPA scale-to-zero on `renderer_jobs_active` if customer SLA permits a 30s wait |
|
||
| Shared with simulation worker | Saves dedicated instance | Chromium OOM risk during concurrent MC + render | **Do not use** — Chromium 2–4 GB footprint during render + MC worker memory = OOM on 32 GB nodes |
|
||
|
||
**Memory isolation is non-negotiable:** The renderer container is on an **isolated Docker network** (`renderer_net`) with no direct DB access and no simulation worker co-location. This is both a security boundary (§7, §35.5) and a memory isolation boundary. A runaway Chromium process will OOM its own container and restart cleanly without affecting simulation workers or the backend API.
|
||
|
||
**Cost-saving lever (on-premise):** For on-premise deployments where the renderer runs on the same physical server as simulation workers, monitor `renderer_memory_usage_bytes` + `spacecom_simulation_worker_memory_bytes` via Grafana. Add a combined alert `renderer + workers > 80% host RAM` to detect co-location pressure before OOM.
|
||
|
||
---
|
||
|
||
### 35.6 Static Asset CDN Strategy
|
||
|
||
CesiumJS uncompressed: ~8 MB. With gzip compression: ~2.5 MB. At 100 concurrent first-time users: ~250 MB outbound in a burst.
|
||
|
||
**Internet-facing (Cloudflare):**
|
||
- All paths under `/_next/static/*` and `/static/*` are served with `Cache-Control: public, max-age=31536000, immutable` (1 year, immutable — Next.js uses content-hash filenames)
|
||
- Caddy upstream caches are bypassed for these paths (Cloudflare edge is the cache)
|
||
- CesiumJS assets: cache hit ratio target > 0.98 after warm-up
|
||
|
||
**On-premise:**
|
||
- Deploy an nginx sidecar container (`static-cache`) on `frontend_net` serving the Next.js `out/` or `.next/static/` directory directly
|
||
- Caddy routes `/_next/static/*` → `static-cache:80` (bypasses Next.js server)
|
||
- Configure in `docs/runbooks/on-premise-deployment.md`
|
||
|
||
**Bundle size monitoring (CI):**
|
||
|
||
```yaml
|
||
# .github/workflows/ci.yml (bundle-size job)
|
||
- name: Check bundle size
|
||
run: |
|
||
npm run build 2>&1 | grep "First Load JS"
|
||
# Fails if main bundle > previous + 10% (threshold stored in .bundle-size-baseline)
|
||
node scripts/check_bundle_size.js
|
||
```
|
||
|
||
Baseline stored in `.bundle-size-baseline` at repo root (plain number in bytes). Updated manually with a PR comment when a deliberate size increase is approved.
|
||
|
||
---
|
||
|
||
### 35.7 Performance Engineering Decision Log
|
||
|
||
| Decision | Chosen | Alternative Considered | Rationale |
|
||
|----------|--------|----------------------|-----------|
|
||
| Load test tool | k6 | Locust, JMeter | k6 is script-based (TypeScript-friendly), CI-native, outputs Prometheus-compatible metrics; Locust requires a Python process; JMeter is XML-heavy |
|
||
| CZML delta | `?since=<iso8601>` server-side filter | Client-side WebSocket push of changed entities | Server-side filter is simpler and works with HTTP caching; push requires server to track per-client state |
|
||
| MC semaphore | Redis INCR/DECR with TTL | DB-level lock | Redis is already the Celery broker; DB-level lock adds latency on every MC submit; TTL prevents deadlock on worker crash |
|
||
| Pagination | Cursor `(created_at, id)` | Keyset on single column | Single-column keyset has ties at same `created_at` (batch ingest); compound key is unique and stable |
|
||
| Query regression gate | `EXPLAIN (ANALYZE, BUFFERS)` JSON baseline | pg_stat_statements | `EXPLAIN` is deterministic per run on a warm buffer; `pg_stat_statements` averages across all historic executions and requires prod traffic to populate |
|
||
| Renderer memory cap | 4 GB Docker `mem_limit` | ulimit in container | Docker `mem_limit` is enforced by the kernel cgroup; `ulimit` only applies to the shell process, not Chromium subprocesses |
|
||
| Bundle size gate | `+10%` threshold vs. stored baseline | Absolute byte limit | Percentage is proportional to current size; absolute limits become irrelevant as bundles grow or shrink |
|
||
|
||
---
|
||
|
||
## 36. Security Architecture — Red Team / Adversarial Review
|
||
|
||
This section records the findings of an adversarial review against the §7 security architecture. Where findings were resolved by updating existing sections (§7.2, §7.3, §7.4, §7.9, §7.10, §7.11, §7.12, §7.14, §9.2), this section provides the finding rationale and cross-reference for traceability.
|
||
|
||
### 36.1 Finding Summary
|
||
|
||
| # | Finding | Primary Section Updated | Severity |
|
||
|---|---------|------------------------|----------|
|
||
| 1 | HMAC key rotation has no path through the immutability trigger | §7.9 — HMAC Key Rotation Procedure | Critical |
|
||
| 2 | Pre-signed MinIO URLs unscoped and unproxied for MC blobs | §7.10 — MinIO Bucket Policies | High |
|
||
| 3 | Celery task arguments not validated at the task layer | §7.12 — Compute Resource Governance | High |
|
||
| 4 | Playwright renderer SSRF mitigation incomplete | §7.11 — request interception allowlist | High |
|
||
| 5 | Refresh token theft: no family reuse detection | §7.3 + §9.2 `refresh_tokens` schema | High |
|
||
| 6 | Admin role elevation with no four-eyes approval | §7.2 + `pending_role_changes` table | High |
|
||
| 7 | Security events logged but no human alert matrix | §7.14 — security alerting matrix | Medium |
|
||
| 8 | Space-Track credential rotation has no ingest-gap spec | §7.14 — rotation runbook cross-reference | Medium |
|
||
| 9 | Shadow mode segregation application-layer only | §7.2 — shadow_segregation RLS policy | High |
|
||
| 10 | NOTAM draft content not sanitised — injection path | §7.4 — `sanitise_icao()` function | High |
|
||
| 11 | Supply chain posture not fully specified | §7.13 — already fully covered; no gap found | N/A |
|
||
|
||
### 36.2 Attack Paths Considered
|
||
|
||
The following attack paths were evaluated in this review:
|
||
|
||
**Insider threat paths:**
|
||
- Compromised admin account silently elevating a backdoor account → mitigated by four-eyes approval (Finding 6)
|
||
- Admin with access to the HMAC rotation script replacing legitimate predictions with forged ones → mitigated by dual sign-off + `rotated_by` audit trail (Finding 1)
|
||
- ANSP operator sharing a pre-signed report URL with an external party → mitigated by 5-minute TTL + audit log (Finding 2)
|
||
|
||
**Compromised worker paths:**
|
||
- Compromised `ingest_worker` (shares `worker_net` with Redis) writing crafted Celery task args → mitigated by task-layer validation (Finding 3)
|
||
- Compromised worker exfiltrating simulation trajectory URLs → mitigated by server-side MC blob proxy (Finding 2)
|
||
|
||
**Authentication/session paths:**
|
||
- Refresh token exfiltration + replay before legitimate client retries → mitigated by family reuse detection + full-family revocation (Finding 5)
|
||
- Compromised admin credential creating backdoor admin → mitigated by four-eyes principle (Finding 6)
|
||
|
||
**Renderer SSRF paths:**
|
||
- Bug causing renderer to navigate to a crafted URL → mitigated by Playwright request interception allowlist (Finding 4)
|
||
- Report ID injection → mitigated by integer validation + hardcoded URL construction (Finding 4)
|
||
|
||
**Data integrity paths:**
|
||
- Shadow prediction leaking into operational response via query bug → mitigated by RLS `shadow_segregation` policy (Finding 9)
|
||
- NOTAM draft XSS → Playwright PDF renderer execution → mitigated by `sanitise_icao()` + Jinja2 autoescape (Finding 10)
|
||
|
||
**Credential rotation paths:**
|
||
- HMAC key compromise: attacker forges predictions → mitigated by rotation procedure with `hmac_admin` role isolation (Finding 1)
|
||
- Space-Track credential rotation creates an undetected ingest gap → mitigated by 10-minute verification step in runbook (Finding 8)
|
||
|
||
### 36.3 Security Architecture ADRs
|
||
|
||
| ADR | Title | Decision |
|
||
|-----|-------|----------|
|
||
| `docs/adr/0007-hmac-rotation-procedure.md` | HMAC key rotation with parameterised immutability trigger | `hmac_admin` role + `SET LOCAL spacecom.hmac_rotation` flag; dual sign-off required |
|
||
| `docs/adr/0008-admin-four-eyes.md` | Admin role elevation requires four-eyes approval | `pending_role_changes` table; 30-minute token; second admin must approve |
|
||
| `docs/adr/0009-shadow-mode-rls.md` | Shadow mode segregated at RLS layer, not application layer | `shadow_segregation` RLS policy; `spacecom.include_shadow` session variable; admin-only |
|
||
| `docs/adr/0010-refresh-token-families.md` | Refresh token family reuse detection | `family_id` column; full family revocation on reuse; user email alert |
|
||
| `docs/adr/0011-mc-blob-proxy.md` | MC trajectory blobs proxied server-side, not pre-signed URL | `GET /viz/mc-trajectories/{id}` backend proxy; MinIO URLs never exposed to browser |
|
||
|
||
### 36.4 Penetration Test Scope (Phase 3)
|
||
|
||
The Phase 3 external penetration test (referenced in §7.15) must include the following adversarial scenarios derived from this review:
|
||
|
||
1. **HMAC rotation bypass** — attempt to forge a prediction record by exploiting the immutability trigger with and without the `hmac_admin` role
|
||
2. **Pre-signed URL exfiltration** — verify that MC blob URLs are not present in any browser-side response; verify pre-signed report URLs cannot be used after 5 minutes
|
||
3. **Celery task injection** — attempt to enqueue tasks with out-of-range arguments directly via Redis; verify the task validates and rejects them
|
||
4. **Playwright SSRF** — attempt to trigger renderer navigation to `http://169.254.169.254/` (AWS metadata) or `http://backend:8000/internal/admin`; verify interception blocks both
|
||
5. **Refresh token theft simulation** — replay a superseded refresh token; verify full family revocation and email alert
|
||
6. **Admin privilege escalation** — attempt to elevate a `viewer` account to `admin` via a single compromised admin account without the four-eyes approval token; verify the attempt is blocked and logged
|
||
7. **Shadow mode leak** — query `GET /decay/predictions` as `viewer`; inject a shadow prediction directly at the DB layer; verify the API response never returns it
|
||
8. **NOTAM injection** — submit an object with a name containing `<script>alert(1)</script>` via `POST /objects`; generate a NOTAM draft; verify PDF render does not execute script
|
||
|
||
### 36.5 Decision Log
|
||
|
||
| Decision | Chosen | Alternative | Rationale |
|
||
|----------|--------|-------------|-----------|
|
||
| HMAC rotation trigger | Parameterised `SET LOCAL` flag scoped to `hmac_admin` role | Separate migration to drop and recreate trigger | `SET LOCAL` is session-scoped; cannot be set by application role; minimises window of bypass |
|
||
| Family reuse detection | Full family revocation on superseded token reuse | Single token revocation | Full revocation is the only action that guarantees the attacker's session is destroyed even if the legitimate user doesn't notice |
|
||
| MC blob delivery | Server-side proxy endpoint | Pre-signed MinIO URL with short TTL | Pre-signed URLs can be shared or logged in browser history; server-side proxy enforces org scoping on every request |
|
||
| Admin four-eyes | Email approval token with 30-minute window | Yubikey hardware confirmation | Email approval is achievable without additional hardware; 30-minute window prevents indefinite pending states |
|
||
| Shadow RLS | PostgreSQL RLS policy | Application-layer `WHERE shadow_mode = FALSE` | RLS is enforced by the database engine regardless of query construction; application-layer filters can be omitted by bugs or direct DB queries |
|
||
|
||
---
|
||
|
||
## 37. Aviation Regulatory / ATM Compliance Review
|
||
|
||
This section records findings from an ATM systems engineering review against the ICAO/EUROCONTROL regulatory environment that governs ANSP customers. Findings were incorporated into §6.13 (NOTAM format), §6.14 (shadow exit), §6.17 (multi-ANSP panel), §11 (data sources / airspace scope), §16 (prediction conflict), §21 Phase 2 DoD, §27.4 (safety record retention), and §9.2 (schema additions).
|
||
|
||
### 37.1 Finding Summary
|
||
|
||
| # | Finding | Primary Section Updated | Severity |
|
||
|---|---------|------------------------|----------|
|
||
| 1 | Regulatory classification (EASA IR 2017/373 position) unresolved | §21 Phase 2 DoD + ADR 0012 | Critical |
|
||
| 2 | NOTAM format non-compliant with ICAO Annex 15 field formatting | §6.13 — field mapping table, Q-line, `YYMMDDHHmm` timestamps | High |
|
||
| 3 | Re-entry window → NOTAM (B)/(C) mapping not specified | §6.13 — `p10−30min` / `p90+30min` rule + cancellation urgency | High |
|
||
| 4 | FIR scope excludes SUA, TMAs, oceanic — undisclosed | §11 — airspace scope disclosure; ADR 0014 | Medium |
|
||
| 5 | Multi-ANSP coordination panel has no authority/precedence spec | §6.17 — advisory-only banner, retention, WebSocket SLA | Medium |
|
||
| 6 | Shadow mode exit criteria not specified | §6.14 — exit criteria table, exit report template | High |
|
||
| 7 | Degraded mode disclosure insufficient for ANSP operational use | §9.2 `degraded_mode_events` table; §14 `GET /readyz` schema; NOTAM `(E)` injection | High |
|
||
| 8 | GDPR DPA must be signed before shadow mode begins, not Phase 3 | §21 Phase 2 DoD legal gate | High |
|
||
| 9 | ESA DISCOS redistribution rights unaddressed | §11 — redistribution rights requirement; §21 Phase 2 DoD | High |
|
||
| 10 | Multi-source prediction conflict resolution not specified | §16 — conflict resolution rules; `prediction_conflict` schema columns | High |
|
||
| 11 | Safety-relevant records have no distinct retention policy | §27.4 — `safety_record` flag; 5-year safety category | Medium |
|
||
|
||
### 37.2 Regulatory Framework References
|
||
|
||
| Framework | Relevance | Position taken |
|
||
|---|---|---|
|
||
| EASA IR (EU) 2017/373 | Requirements for ATM/ANS providers; may apply if ANSP integrates SpaceCom into operational workflow | Position A: advisory tool; not ATM/ANS provision — documented in ADR 0012 |
|
||
| ICAO Annex 15 (AIS) + Appendix 6 | NOTAM format specification | NOTAM drafts now comply with Annex 15 field formatting (§6.13) |
|
||
| ICAO Annex 11 (ATS) §2.26 | ATC record retention recommendation | Safety records retained ≥ 5 years (§27.4) |
|
||
| ICAO Doc 8400 | ICAO abbreviations and codes used in NOTAM `(E)` field | `sanitise_icao()` uses Doc 8400 abbreviation list |
|
||
| EUROCONTROL OPADD | Operational NOTAM Production and Distribution; EUR regional NOTAM practice | Q-line format and series conventions follow OPADD (§6.13) |
|
||
| GDPR Article 28 | Data processor obligations when processing ANSP staff personal data | DPA must be signed before any ANSP data processing, including shadow mode |
|
||
| UN Liability Convention 1972 | 7-year record retention for space object liability claims | `reentry_predictions`, `alert_events` retained 7 years (§27.4) |
|
||
|
||
### 37.3 Regulatory ADRs
|
||
|
||
| ADR | Title | Decision |
|
||
|-----|-------|----------|
|
||
| `docs/adr/0012-regulatory-classification.md` | EASA IR 2017/373 position | Position A: ATM/ANS Support Tool; decision support only; not ATM/ANS provision; written ANSP agreements required |
|
||
| `docs/adr/0013-notam-format.md` | ICAO Annex 15 NOTAM field compliance | Field mapping table; `YYMMDDHHmm` timestamps; Q-line `QWELW`; `(B)` = p10−30min; `(C)` = p90+30min |
|
||
| `docs/adr/0014-airspace-scope.md` | Phase 2 airspace data scope | FIR/UIR only (ECAC + US); SUA/TMA/oceanic explicitly out of scope; disclosed in UI; Phase 3 SUA consideration |
|
||
|
||
### 37.4 Compliance Checklist (Phase 2 Gate)
|
||
|
||
Before the first ANSP shadow deployment:
|
||
|
||
- [ ] `docs/adr/0012-regulatory-classification.md` committed and reviewed by aviation law counsel
|
||
- [ ] NOTAM draft generator produces ICAO-compliant output (unit test passes Q-line regex and `YYMMDDHHmm` field checks)
|
||
- [ ] Airspace scope disclosure note present in Airspace Impact Panel (Playwright test verifies text)
|
||
- [ ] Multi-ANSP coordination advisory-only banner present in panel (Playwright test verifies text)
|
||
- [ ] `degraded_mode_events` table active; transitions logged; `GET /readyz` response includes `degraded_since`
|
||
- [ ] NOTAM draft `(E)` field injects degraded-state warning when `generated_during_degraded = TRUE` (integration test)
|
||
- [ ] DPA signed with each ANSP shadow partner; DPA template reviewed by counsel
|
||
- [ ] ESA DISCOS redistribution rights clarified in writing; API/report templates updated if required
|
||
- [ ] `prediction_conflict` flag operational; Event Detail page shows `⚠ PREDICTION CONFLICT` when set
|
||
- [ ] Safety record retention policy active: `safety_record = TRUE` records excluded from TimescaleDB drop; `degraded_mode_events` retained 5 years
|
||
- [ ] Shadow mode exit report template (`docs/templates/shadow-mode-exit-report.md`) exists and Persona B can generate statistics from admin panel
|
||
|
||
### 37.5 Decision Log
|
||
|
||
| Decision | Chosen | Alternative | Rationale |
|
||
|----------|--------|-------------|-----------|
|
||
| Regulatory classification | Position A — advisory, non-safety-critical ATM/ANS Support Tool | Position B — Functional System under IR 2017/373 | Position B would require ED-78A system safety assessment, ATCO HMI compliance, and EASA change management — disproportionate for a decision-support tool where a human verifies all outputs before acting |
|
||
| NOTAM timestamp format | `YYMMDDHHmm` (ICAO Annex 15 §5.1.2) | ISO 8601 `YYYY-MM-DDTHH:mmZ` | ICAO Annex 15 is unambiguous; ISO 8601 would require the NOTAM office to reformat before issuance |
|
||
| NOTAM window mapping | `(B)` = p10 − 30 min; `(C)` = p90 + 30 min | `(B)` = p50 − 3h; `(C)` = p50 + 3h | p10/p90 are the actual statistical bounds; symmetric windows around p50 ignore the often-asymmetric uncertainty distribution |
|
||
| Degraded NOTAM warning | Machine-inserted line in `(E)` field | UI-only warning on the draft page | The `(E)` field is what the NOTAM office receives; a UI-only warning is lost when the draft is copied to the NOTAM office's system |
|
||
| Multi-source conflict | Union of windows when non-overlapping | SpaceCom window always primary regardless | ICAO most-conservative principle; ANSPs must be protected against the case where SpaceCom is wrong and TIP is right |
|
||
| Safety record retention | `safety_record` flag on row; excluded from drop policy | Separate table for safety records | Flag approach avoids data duplication and works with TimescaleDB chunk-level policies; excluded records stay in the same hypertable partition for query performance |
|
||
|
||
---
|
||
|
||
## 38. Orbital Mechanics / Astrodynamics Review
|
||
|
||
This section records findings from an astrodynamics specialist review of the physics specification. Findings were incorporated into §15.1 (SGP4 validity gates), §15.2 (NRLMSISE-00 inputs, MC uncertainty model, SRP, integrator config), §15.3 (breakup altitude trigger, material survivability), §15.4 (new — corridor generation algorithm), §15.5 (new — Pc computation method), §17.1 (committed test vectors), §31.1 (BSTAR validation), and the `objects`/`space_weather` schema in §9.
|
||
|
||
### 38.1 Finding Summary
|
||
|
||
| # | Finding | Section Updated | Severity |
|
||
|---|---------|-----------------|----------|
|
||
| 1 | SGP4 validity limits not enforced at query time | §15.1 — epoch age gates, perigee < 200 km routing | High |
|
||
| 2 | NRLMSISE-00 input vector under-specified | §15.2 — f107_prior_day, ap_3h_history, Ap vs Kp | High |
|
||
| 3 | Ballistic coefficient uncertainty model not specified | §15.2 — C_D/A/m sampling distributions; `objects` schema | High |
|
||
| 4 | Corridor generation algorithm not specified | §15.4 (new) — alpha-shape, 50 km buffer, ≤ 1000 vertices | High |
|
||
| 5 | Breakup altitude trigger not specified | §15.3 — 78 km trigger, NASA SBM, material survivability | High |
|
||
| 6 | Frame transformation test vectors not committed | §17.1 — 3 required JSON files; fail-not-skip test pattern | Medium |
|
||
| 7 | Solar radiation pressure absent from decay predictor | §15.2 — cannonball SRP model, `cr_coefficient` column | Medium |
|
||
| 8 | Pc computation method not specified | §15.5 (new) — Alfano 2D Gaussian, TLE differencing covariance | Medium |
|
||
| 9 | Integrator tolerances and stopping criterion not specified | §15.2 — atol=1e-9, rtol=1e-9, max_step=60s, 120-day cap | High |
|
||
| 10 | BSTAR validation range excludes valid high-density objects | §31.1 — removed lower floor; warn-not-reject for B* > 0.5 | Medium |
|
||
| 11 | NRLMSISE-00 altitude limit and storm handling not specified | §15.2 — 800 km OOD boundary; Kp > 5 storm flag | Medium |
|
||
|
||
### 38.2 Physics Model Decisions
|
||
|
||
| Decision | Chosen | Alternative Considered | Rationale |
|
||
|----------|--------|----------------------|-----------|
|
||
| Catalog propagator | SGP4 (`sgp4` library) | SP (Special Perturbations) via GMAT | SGP4 is the standard for TLE-based catalog propagation; SP requires full state vector with covariance — not available from TLEs |
|
||
| Decay integrator | DOP853 (RK7/8 adaptive) | RK4 fixed step | DOP853 is embedded error control; RK4 fixed step requires manual step-size management and may miss density variations near perigee |
|
||
| Atmospheric model | NRLMSISE-00 | JB2008 (Jacchia-Bowman 2008) | NRLMSISE-00 is well-validated, open-source, and widely used in community tools; JB2008 is more accurate during storms but requires additional data inputs not yet in scope |
|
||
| Corridor shape | Alpha-shape (concave hull) | Convex hull | Convex hull overestimates corridor width by 2–5× for elongated re-entry ground tracks; alpha-shape produces tighter, more operationally useful polygons |
|
||
| C_D sampling | Uniform(2.0, 2.4) | Fixed value 2.2 | Uniform sampling covers the credible range without assuming a specific distribution; fixed value understates uncertainty |
|
||
| SRP model | Cannonball (scalar) | Panelled model | Cannonball model is standard for non-cooperative objects; panelled model requires detailed attitude and geometry data unavailable for most catalog objects |
|
||
| Pc method | Alfano 2D Gaussian | Monte Carlo Pc | Alfano is computationally fast and the community standard; Monte Carlo Pc added as Phase 3 consideration for high-Pc events |
|
||
| BSTAR lower bound | No lower bound (reject ≤ 0 only) | 0.0001 lower bound | Dense objects (tungsten, stainless steel tanks) can have B* << 0.0001; the previous lower bound would silently reject valid high-density object TLEs |
|
||
|
||
### 38.3 Model Card Additions Required
|
||
|
||
The following items must be added to `docs/model-card-decay-predictor.md`:
|
||
|
||
- **Breakup altitude rationale:** 78 km trigger; reference to NASA Debris Assessment Software range (75–80 km for Al structures)
|
||
- **Monte Carlo uncertainty model:** C_D, A, m sampling distributions and their justifications
|
||
- **SRP significance:** conditions under which SRP > 5% of drag (area-to-mass > 0.01 m²/kg, altitude > 500 km)
|
||
- **NRLMSISE-00 altitude scope:** validated 150–800 km; OOD flag above 800 km
|
||
- **Geomagnetic storm sensitivity:** Kp > 5 triggers storm-period sampling; prediction uncertainty is elevated
|
||
- **Corridor generation algorithm:** alpha-shape with α = 0.1°, 50 km buffer; reference to alpha-shape literature
|
||
- **Pc computation:** Alfano 2D Gaussian; TLE differencing covariance; quality flag when < 3 TLEs available
|
||
- **SGP4 validity limits:** 7-day degraded, 14-day unreliable, 200 km perigee routing to decay predictor
|
||
|
||
### 38.4 Validation Test Vector Requirements
|
||
|
||
| File | Required before | Blocking if absent |
|
||
|------|-----------------|-------------------|
|
||
| `docs/validation/reference-data/frame_transform_gcrf_to_itrf.json` | Any frame transform code merged | Yes — test fails hard |
|
||
| `docs/validation/reference-data/sgp4_propagation_cases.json` | SGP4 propagator merged | Yes |
|
||
| `docs/validation/reference-data/iers_eop_case.json` | IERS EOP application merged | Yes |
|
||
| `docs/validation/reference-data/nrlmsise00_density_cases.json` | Decay predictor merged | Yes — referenced in §17.3 |
|
||
| `docs/validation/reference-data/aerospace-corp-reentries.json` | Phase 1 backcast validation | Yes for Phase 2 gate |
|
||
|
||
---
|
||
|
||
## 39. API Design / Developer Experience Review
|
||
|
||
This section records findings from a senior API design review. Findings were incorporated into §9.2 (new `jobs` and `idempotency_keys` tables; expanded `api_keys` schema), §14 (canonical pagination envelope, error schema, rate limit 429 body, async job lifecycle, ephemeris validation, WebSocket token refresh, WebSocket protocol versioning, field naming convention, `GET /readyz` in OpenAPI, API key auth model).
|
||
|
||
### 39.1 Finding Summary
|
||
|
||
| # | Finding | Section Updated | Severity |
|
||
|---|---------|-----------------|----------|
|
||
| 1 | Pagination envelope not canonical across endpoints | §14 — `PaginatedResponse[T]`, `data` key, `total_count: null` | High |
|
||
| 2 | Error response shape inconsistent; no error code registry | §14 — `SpaceComError` base, `RequestValidationError` override, registry table | High |
|
||
| 3 | Async job lifecycle for `POST /decay/predict` not specified | §14 — 202 response, `/jobs/{id}` endpoint; §9.2 — `jobs` table | High |
|
||
| 4 | WebSocket token refresh path not specified | §14 — `TOKEN_EXPIRY_WARNING`, `AUTH_REFRESH`, close codes `4001`/`4002` | High |
|
||
| 5 | Idempotency keys not specified for mutation endpoints | §14 — idempotency spec; §9.2 — `idempotency_keys` table | Medium |
|
||
| 6 | `429` missing `Retry-After` header and structured body | §14 — `retryAfterSeconds` body field, `Retry-After` header spec | Medium |
|
||
| 7 | Ephemeris endpoint lacks time range and step validation | §14 — 4-row validation table with error codes | Medium |
|
||
| 8 | WebSocket protocol versioning not specified | §14 — `?protocol_version=N`, deprecation warning event, sunset close code | Medium |
|
||
| 9 | Field naming convention not decided | §14 — `APIModel` base class, `alias_generator=to_camel` | Medium |
|
||
| 10 | `GET /readyz` not in OpenAPI spec | §14 — `tags=["System"]` decorated endpoint | Low |
|
||
| 11 | API key auth model, rate limits, and scope not specified | §14 — `apikey_` prefix, independent buckets, `allowed_endpoints` scope | High |
|
||
|
||
### 39.2 Developer Experience Contracts
|
||
|
||
The following contracts are enforced by CI and must not be broken without an ADR:
|
||
|
||
| Contract | Enforcement |
|
||
|---|---|
|
||
| All list endpoints return `{"data": [...], "pagination": {...}}` | OpenAPI CI check: `list`-tagged endpoints validated against `PaginatedResponse` schema |
|
||
| All errors return `{"error": "...", "message": "...", "requestId": "..."}` | AST/grep CI check: `HTTPException` and `JSONResponse` must reference registry codes |
|
||
| `POST` endpoints returning async jobs return `202` with `statusUrl` | OpenAPI CI check: endpoints tagged `async` validated for `202` response schema |
|
||
| `429` responses include `Retry-After` header | Integration test: rate-limited request asserts `Retry-After` header present |
|
||
| `Idempotency-Key` header documented for mutation endpoints | OpenAPI CI check: endpoints tagged `mutation` declare the header parameter |
|
||
| `GET /readyz` is in the OpenAPI spec | Schema validation: `readyz` path present in generated `openapi.json` |
|
||
|
||
### 39.3 New Endpoints Added
|
||
|
||
| Endpoint | Role | Purpose |
|
||
|---|---|---|
|
||
| `GET /jobs/{job_id}` | `viewer` (own jobs only) | Poll async job status; returns `resultUrl` on completion |
|
||
| `DELETE /jobs/{job_id}` | `viewer` (own jobs only) | Cancel a queued job (no effect if already running) |
|
||
|
||
### 39.4 New API Guide Documents Required
|
||
|
||
| Document | Content |
|
||
|---|---|
|
||
| `docs/api-guide/conventions.md` | `camelCase` rule, `APIModel` base class, error envelope, request ID tracing |
|
||
| `docs/api-guide/pagination.md` | Cursor encoding, `total_count: null` rationale, empty result shape |
|
||
| `docs/api-guide/error-reference.md` | Canonical error code registry with HTTP status, description, recovery action |
|
||
| `docs/api-guide/idempotency.md` | Idempotency key protocol, 24h TTL, replay header, in-progress behaviour |
|
||
| `docs/api-guide/async-jobs.md` | Job lifecycle, WebSocket vs polling, recommended poll interval |
|
||
| `docs/api-guide/websocket-protocol.md` | Protocol version history, token refresh flow, close codes, reconnection |
|
||
| `docs/api-guide/api-keys.md` | Key creation, `apikey_` prefix, scope, independent rate limits |
|
||
|
||
### 39.5 Decision Log
|
||
|
||
| Decision | Chosen | Alternative | Rationale |
|
||
|----------|--------|-------------|-----------|
|
||
| Pagination key | `data` | `items`, `results` | `data` is the most common convention (JSON:API, GitHub API, Stripe); `items` is ambiguous with Python iterables |
|
||
| `total_count` | Always `null` | Compute count on every list request | COUNT(*) on a 7-year-retention hypertable can be a full scan; cursor pagination does not need count; document the omission |
|
||
| Error base model | `SpaceComError` with `requestId` | Per-endpoint error types | Uniform shape allows generic client error handling; `requestId` enables log correlation without exposing internals |
|
||
| Field naming | `camelCase` via `alias_generator` | `snake_case` (Python default) | Frontend and API consumer convention is `camelCase`; `populate_by_name=True` keeps internal code readable |
|
||
| Async job surface | `/jobs/{id}` unified endpoint | Per-type endpoints (`/decay/{id}`, `/reports/{id}`) | Unified job surface simplifies client polling logic; type-specific result URLs are returned in `resultUrl` field |
|
||
| WebSocket close codes | `4001` auth expiry, `4002` protocol deprecated | Generic `1008` for all auth failures | Application-specific close codes enable clients to take the correct action (refresh token vs. upgrade protocol) without scraping close reason text |
|
||
| Idempotency TTL | 24 hours | 1 hour, 7 days | 24 hours covers retry windows caused by network outages, client restarts, and overnight batch jobs; longer risks unbounded table growth |
|
||
|
||
---
|
||
|
||
## 40. Commercial Strategy Review
|
||
|
||
SpaceCom is a standalone commercial product. Institutional procurements (ESA STAR #182213 and similar) are market opportunities pursued with existing capabilities — the product is not built to suit any single bid. This section records findings from a commercial strategy review; incorporations are in the product and architecture sections, not in bid-specific requirements.
|
||
|
||
### 40.1 Finding Summary
|
||
|
||
| # | Finding | Section Updated | Severity |
|
||
|---|---------|-----------------|----------|
|
||
| 1 | ESA bid requirements not mapped to plan | Scoped as per-bid process only — `docs/bid/` created per procurement opportunity, not a structural plan requirement | Critical (clarified) |
|
||
| 2 | Zero Debris Charter compliance output format not specified | §6 — Controlled Re-entry Planner compliance report spec, Pc_ground, `compliance_report_url` | High |
|
||
| 3 | No commercial tier structure | §9.2 — `subscription_tier`, `subscription_status` on `organisations`; tier table defined | High |
|
||
| 4 | Competitive differentiation not anchored to maintained capabilities | §23.4 — maintained capabilities table; `docs/competitive-analysis.md` quarterly review | Medium |
|
||
| 5 | Shadow trial-to-operational conversion not specified | §6.14 — conversion path, offer package, `subscription_status` transitions, 2-concurrent-deployment cap | High |
|
||
| 6 | Delivery schedule vs. procurement milestones | Light touch: per-procurement milestone reconciliation doc created at bid time; not a structural plan requirement | High (scoped) |
|
||
| 7 | No customer-facing SLA | §26.1 — SLA schedule table in MSA; measurement methodology; service credits | High |
|
||
| 8 | Data residency requirements not addressed | §29.5 — EU default hosting; on-premise option; `hosting_jurisdiction` column; subprocessor disclosure | High |
|
||
| 9 | Space-Track AUP conditional architecture not specified | §11 — Path A/B conditional architecture; ADR 0016; Phase 1 architectural decision gate | High |
|
||
| 10 | No Acceptance Test Procedure specification | §21 Phase 3 DoD — ATP requirement; independent evaluator; `docs/bid/acceptance-test-procedure.md` | Medium |
|
||
| 11 | Go-to-market sequence not validated against resource constraints | §6.14 — 2-concurrent-shadow cap; integration lead assignment; onboarding package spec | Medium |
|
||
|
||
### 40.2 Commercial Tier Structure
|
||
|
||
| Tier | Customer | Feature access | Pricing model |
|
||
|---|---|---|---|
|
||
| **Shadow Trial** | ANSP (pre-commercial) | Full aviation portal; shadow mode only; 90-day maximum; 2 concurrent deployments maximum | Free — bilateral agreement or institutional funding |
|
||
| **ANSP Operational** | ANSP (post-shadow) | Full aviation portal; live alerts; NOTAM drafting; multi-ANSP coordination | Annual SaaS subscription per ANSP (seat-unlimited within org) |
|
||
| **Space Operator** | Satellite operators | Space portal; decay prediction; conjunction; CCSDS export; API access | Per-object-per-month or flat subscription with object cap |
|
||
| **Institutional** | ESA, national agencies, research | Full access; data export; API; bulk historical; on-premise deployment option | Bilateral contract or grant-funded; source code escrow option |
|
||
|
||
Tier is stored in `organisations.subscription_tier`. Tier-based feature gating added to RBAC: e.g., `shadow_trial` orgs cannot activate live alert delivery to external systems.
|
||
|
||
### 40.3 Procurement Readiness Process
|
||
|
||
For each institutional procurement opportunity pursued:
|
||
1. Create `docs/bid/{procurement-id}/traceability.md` — maps the procurement's SoR requirements to existing MASTER_PLAN.md section(s); gaps marked `NOT MET` or `PARTIALLY MET`
|
||
2. Create `docs/bid/{procurement-id}/milestone-reconciliation.md` — maps procurement milestones (KO, PDR, CDR, AT) to SpaceCom phase completion dates
|
||
3. Run ATP (`docs/bid/acceptance-test-procedure.md`) on the staging environment before submission
|
||
4. Create `docs/bid/{procurement-id}/kpi-and-validation-plan.md` — maps tender KPIs to replay cases, conservative baselines, evidence artefacts, and any partner-supplied validation input still required
|
||
5. Update `docs/competitive-analysis.md` to confirm differentiation claims are current
|
||
|
||
This is a per-opportunity process maintained by the product owner — it does not drive changes to the core plan unless a genuine product gap is identified.
|
||
|
||
### 40.4 Customer Onboarding Specification
|
||
|
||
| Artefact | Location | Purpose |
|
||
|---|---|---|
|
||
| ANSP onboarding checklist | `docs/onboarding/ansp-onboarding-checklist.md` | Integration lead walkthrough; environment setup; FIR configuration; user training |
|
||
| Admin setup guide | `docs/onboarding/admin-setup.md` | Persona D configuration; shadow mode activation; user provisioning |
|
||
| Shadow exit report template | `docs/templates/shadow-mode-exit-report.md` | Statistics + ANSP Safety Department sign-off |
|
||
| Commercial offer template | `docs/templates/commercial-offer-ansp.md` | Auto-populated from org data; sent at shadow exit |
|
||
|
||
### 40.5 Decision Log
|
||
|
||
| Decision | Chosen | Alternative | Rationale |
|
||
|----------|--------|-------------|-----------|
|
||
| Plan structure vs. bid | Product-first; bid traceability is a per-opportunity overlay | Restructure plan around ESA SoR | SpaceCom serves multiple market segments; structuring around one procurement creates lock-in and excludes ANSP and space operator commercial pathways |
|
||
| Default hosting jurisdiction | EU (eu-central-1) | US-based hosting | ECAC ANSP customers are predominantly EU/UK; EU hosting satisfies data residency without per-customer complexity |
|
||
| Shadow deployment cap | 2 concurrent | Unlimited | Each shadow deployment requires a dedicated integration lead for 90 days; 2 concurrent is the realistic Phase 2 capacity without specialist hiring |
|
||
| Space-Track AUP gate | Phase 1 architectural decision | Phase 2 clarification | The shared vs. per-org ingest architecture is a fundamental Phase 1 design choice; deferring to Phase 2 would require rearchitecting already-shipped code |
|
||
| SLA in MSA | Separate SLA schedule versioned independently | Inline in MSA body | SLA values change more frequently than contract terms; versioned schedule allows SLA updates without full MSA re-execution |
|
||
|
||
---
|
||
|
||
## 41. Database Engineering Review
|
||
|
||
### 41.1 Finding Summary
|
||
|
||
| # | Finding | Severity | Location updated |
|
||
|---|---------|----------|-----------------|
|
||
| 1 | `tle_sets` BIGSERIAL PK incompatible with TimescaleDB hypertable uniqueness requirement | High | §9.2 `tle_sets` |
|
||
| 2 | TEXT enum columns lacking CHECK constraints (12 columns across 7 tables) | High | §9.2 all affected tables |
|
||
| 3 | asyncpg prepared statement cache conflicts with PgBouncer transaction mode | High | §9.4 |
|
||
| 4 | `prediction_outcomes.prediction_id` and `alert_events.prediction_id` typed INTEGER; references BIGSERIAL column | Medium | §9.2 |
|
||
| 5 | `idempotency_keys` already has composite PRIMARY KEY — confirmed safe; upsert pattern documented | N/A (already correct) | §9.2 |
|
||
| 6 | Mixed GEOGRAPHY/GEOMETRY types break GiST index selectivity on cross-table spatial joins | Medium | §9.3 |
|
||
| 7 | `acknowledged_by` and `reviewed_by` FKs block GDPR erasure with default RESTRICT | Medium | §9.2 |
|
||
| 8 | Mutable tables missing `updated_at` column and trigger | Medium | §9.2 |
|
||
| 9 | DB password rotation procedure killed in-flight transactions via hard restart | Medium | §7.5 |
|
||
| 10 | `tle_sets` chunk interval (7 days) too small; poor compression ratio for ingest rate | Low | §9.4 |
|
||
| 11 | Missing partial indexes on hot-path filtered queries (jobs, refresh_tokens, idempotency_keys, alert_events) | Low | §9.3 |
|
||
|
||
### 41.2 Schema Integrity Rules
|
||
|
||
Rules enforced after this review:
|
||
|
||
1. **Hypertable natural keys** — No surrogate BIGSERIAL PK on hypertables. Reference `tle_sets` rows by `(object_id, ingested_at)`. If a surrogate is needed, use `UNIQUE (surrogate_id, partition_col)` composite.
|
||
2. **CHECK constraints mandatory** — Every TEXT column with a finite valid value set must have a `CHECK (col IN (...))` constraint. Application-layer validation is supplemental, not primary.
|
||
3. **asyncpg pool config** — `prepared_statement_cache_size=0` must be set on all async engine instances. Enforced by a test that creates a test engine and asserts the connect_arg is present.
|
||
4. **BIGINT FK parity** — Any FK referencing a BIGSERIAL column must be `BIGINT`. Linted in CI via a custom Alembic migration checker.
|
||
5. **Spatial type discipline** — Every `ST_Intersects` / `ST_Contains` call mixing GEOGRAPHY and GEOMETRY sources must include an explicit `::geometry` cast on the GEOGRAPHY operand. Linted via ruff custom rule.
|
||
6. **ON DELETE SET NULL on audit FKs** — FKs in audit/safety tables (`security_logs`, `alert_events.acknowledged_by`, `notam_drafts.reviewed_by`) use `ON DELETE SET NULL`. Hard DELETE on `users` is reserved for GDPR erasure only; see §29.
|
||
7. **updated_at trigger** — All mutable (non-append-only) tables must have `updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()` and a BEFORE UPDATE trigger using `set_updated_at()`. Append-only tables (those with `prevent_modification()` trigger) are excluded.
|
||
|
||
### 41.3 GDPR Erasure Procedure (users table)
|
||
|
||
Per Finding 7 — a hard `DELETE FROM users WHERE id = $1` is not the correct GDPR erasure mechanism. The correct procedure:
|
||
|
||
1. Null out PII columns: `UPDATE users SET email = 'erased-' || id || '@erased.invalid', password_hash = 'ERASED', mfa_secret = NULL, mfa_recovery_codes = NULL, tos_accepted_ip = NULL WHERE id = $1`
|
||
2. Security logs, alert acknowledgements, and NOTAM review records are preserved with `user_id = NULL` (ON DELETE SET NULL handles this automatically if a hard DELETE is later required by specific legal instruction)
|
||
3. Log the erasure in `security_logs` with `event_type = 'GDPR_ERASURE'` before nulling
|
||
4. The `users` row itself is retained as a tombstone (`email` contains the erased marker) — this preserves referential integrity for `organisation_id` links and prevents FK violations in tables without SET NULL
|
||
|
||
Full procedure: `docs/runbooks/gdpr-erasure.md` (Phase 2 gate, per §29).
|
||
|
||
### 41.4 Decision Log
|
||
|
||
| Decision | Chosen | Alternative | Rationale |
|
||
|----------|--------|-------------|-----------|
|
||
| Hypertable surrogate key | Remove BIGSERIAL; use `UNIQUE(object_id, ingested_at)` | Add `UNIQUE(id, ingested_at)` composite | Natural key is semantically stable and meaningful; composite surrogate is confusing and rarely queried by raw id |
|
||
| CHECK constraints vs. Postgres ENUM | CHECK (col IN (...)) | `CREATE TYPE` ENUM | CHECK constraints are simpler to extend in migrations (no `ALTER TYPE ADD VALUE`); ENUM changes require `pg_dump` for type renaming |
|
||
| GDPR erasure | Tombstone update, not hard DELETE | Hard DELETE with CASCADE | Hard DELETE cascades into safety records (NOTAM drafts, alert logs) that must be retained under EASA/ICAO safety record requirements; tombstone preserves the record while removing identity |
|
||
| Spatial type mixing | Explicit `::geometry` cast; document in §9.3 | Migrate all columns to GEOGRAPHY | Airspace GEOMETRY gives 3× `ST_Intersects` speedup for regional FIR queries; global corridors correctly use GEOGRAPHY; cast is cheap and safe |
|
||
|
||
---
|
||
|
||
## 42. Test Engineering / QA Review
|
||
|
||
### 42.1 Finding Summary
|
||
|
||
| # | Finding | Severity | Location updated |
|
||
|---|---------|----------|-----------------|
|
||
| 1 | No formal test pyramid with per-layer coverage gates | High | §33.10 |
|
||
| 2 | No database isolation strategy for integration tests | High | §33.10 |
|
||
| 3 | Hypothesis property-based tests unspecified | High | §33.10 table, §12 |
|
||
| 4 | WebSocket test strategy missing | High | §33.10 table, §12 |
|
||
| 5 | Playwright E2E tests lack `data-testid` selector convention | Medium | §33.9 |
|
||
| 6 | No smoke test suite for post-deploy verification | Medium | §12, §33.10 |
|
||
| 7 | No flaky test policy | Medium | §33.10 |
|
||
| 8 | Contract tests lack value-range assertions | Medium | DoD checklists |
|
||
| 9 | Celery task timeout → `jobs` state transition untested; no orphan cleanup | Medium | §7.12 |
|
||
| 10 | MC simulation test data generation strategy not specified | Low | §15.4 |
|
||
| 11 | Accessibility testing not integrated into CI with implementation spec | Low | §6.16 |
|
||
|
||
### 42.2 Test Suite Inventory
|
||
|
||
Full test suite after this review:
|
||
|
||
```
|
||
tests/
|
||
conftest.py # db_session (SAVEPOINT); testcontainers for Celery tests; pytest.ini markers
|
||
physics/
|
||
test_frame_utils.py # Vallado reference cases — all BLOCKING
|
||
test_propagator/ # SGP4 state vectors — BLOCKING
|
||
test_decay/ # Decay predictor backcast — Phase 2+
|
||
test_nrlmsise.py # NRLMSISE-00 density reference — BLOCKING
|
||
test_hypothesis.py # Hypothesis property-based invariants — BLOCKING
|
||
test_mc_corridor.py # MC seeded RNG corridor — Phase 2+
|
||
test_breakup/ # Breakup energy conservation — Phase 2+
|
||
test_integrity.py # HMAC sign/verify/tamper — BLOCKING
|
||
test_auth.py # JWT; MFA; rate limiting — BLOCKING
|
||
test_rbac.py # Every endpoint × every role — BLOCKING
|
||
test_websocket.py # WS lifecycle; sequence replay; close codes — BLOCKING
|
||
test_ingest/
|
||
test_contracts.py # Space-Track + NOAA key + value range — BLOCKING (mocked)
|
||
test_spaceweather/ # Space weather ingest logic
|
||
test_jobs/
|
||
test_celery_failure.py # Timeout → failed; orphan recovery — BLOCKING
|
||
smoke/ # Post-deploy; idempotent; ≤ 2 min — BLOCKING post-deploy
|
||
quarantine/ # Flaky tests awaiting fix; non-blocking nightly only
|
||
e2e/ # Playwright; 5 user journeys + axe WCAG 2.1 AA — BLOCKING
|
||
test_accessibility.ts # axe-core scan on every primary view; fails PR on any WCAG 2.1 AA violation
|
||
test_alert_websocket.ts # submit prediction → Celery completes → CRITICAL alert in browser via WS (F9)
|
||
load/ # k6 performance scenarios — non-blocking (nightly)
|
||
```
|
||
|
||
**Accessibility test specification (F11):**
|
||
|
||
`e2e/test_accessibility.ts` uses `@axe-core/playwright` to scan each primary view on every PR:
|
||
|
||
```typescript
|
||
import { checkA11y } from 'axe-playwright';
|
||
|
||
const VIEWS_TO_SCAN = [
|
||
'/', // Operational Overview
|
||
'/events', // Active Events
|
||
'/events/[sample-id]', // Event Detail
|
||
'/handover', // Shift Handover
|
||
'/space/objects', // Space Operator Overview
|
||
];
|
||
|
||
test.each(VIEWS_TO_SCAN)('WCAG 2.1 AA: %s', async ({ page }) => {
|
||
await page.goto(url);
|
||
await checkA11y(page, undefined, {
|
||
axeOptions: { runOnly: { type: 'tag', values: ['wcag2a', 'wcag2aa'] } },
|
||
detailedReport: true,
|
||
detailedReportOptions: { html: true },
|
||
});
|
||
});
|
||
```
|
||
|
||
CI gate: any `axe-core` violation at `wcag2a` or `wcag2aa` level fails the PR. `wcag2aaa` violations are reported as warnings only. Results published as CI artefact (`a11y-report.html`).
|
||
|
||
**WebSocket alert delivery E2E test (F9):** `e2e/test_alert_websocket.ts` is a BLOCKING E2E test that verifies the full end-to-end path from prediction submission to browser alert receipt. This test requires the full stack (Celery workers running, WebSocket server live):
|
||
|
||
```typescript
|
||
// e2e/test_alert_websocket.ts
|
||
import { test, expect } from '@playwright/test';
|
||
|
||
test('CRITICAL alert appears in browser via WebSocket after prediction job completes', async ({ page }) => {
|
||
// 1. Authenticate as operator
|
||
await page.goto('/login');
|
||
await page.fill('[name=email]', process.env.E2E_OPERATOR_EMAIL);
|
||
await page.fill('[name=password]', process.env.E2E_OPERATOR_PASSWORD);
|
||
await page.click('[type=submit]');
|
||
await page.waitForURL('/');
|
||
|
||
// 2. Submit a decay prediction via API that will produce a CRITICAL alert
|
||
const job = await fetch('/api/v1/decay/predict', {
|
||
method: 'POST',
|
||
headers: { Cookie: await page.context().cookies().then(c => c.map(x => `${x.name}=${x.value}`).join('; ')) },
|
||
body: JSON.stringify({ norad_id: 90001, mc_samples: 50 }), // test object; always produces CRITICAL
|
||
}).then(r => r.json());
|
||
|
||
// 3. Wait for the CRITICAL alert banner to appear in the browser (max 60s)
|
||
await expect(page.locator('[role="alertdialog"][data-severity="CRITICAL"]'))
|
||
.toBeVisible({ timeout: 60_000 });
|
||
|
||
// 4. Assert the alert references our prediction
|
||
const alertText = await page.locator('[role="alertdialog"]').textContent();
|
||
expect(alertText).toContain('90001');
|
||
});
|
||
```
|
||
|
||
The 60-second timeout covers: Celery task queue, MC computation (50 samples), alert threshold evaluation, WebSocket push to all org subscribers, React state update, and DOM render. If this test fails intermittently, the failure is investigated as a potential latency regression — it must not be moved to `quarantine/` without a root-cause investigation.
|
||
|
||
**Manual screen reader test** (release checklist — not automated):
|
||
- NVDA + Firefox (Windows): primary operator workflow (alert receipt → acknowledgement → NOTAM draft)
|
||
- VoiceOver + Safari (macOS): same workflow
|
||
- Keyboard-only: full workflow without mouse
|
||
- Added to release gate checklist in `docs/RELEASE_CHECKLIST.md`
|
||
|
||
### 42.3 Hypothesis Invariant Specifications
|
||
|
||
Minimum 5 required Hypothesis properties in `tests/physics/test_hypothesis.py`:
|
||
|
||
| Property | Strategy | Assertion | max_examples |
|
||
|---|---|---|---|
|
||
| SGP4 round-trip position | Random valid TLE orbital elements | Forward propagate T days then back; position error < 1 m | 200 |
|
||
| p95 corridor containment | Seeded MC ensemble (seed=42, N=500) | Corridor contains ≥ 95% of input trajectories | 50 |
|
||
| NRLMSISE-00 density positive | Random altitude 100–800 km, valid F10.7/Ap | Density always > 0 kg/m³ | 500 |
|
||
| RLS tenant isolation | Two different organisation IDs | Session set to org A never returns rows for org B | 100 |
|
||
| Pagination non-overlap | Cursor pagination with random page sizes | Pages are non-overlapping and cover full dataset | 100 |
|
||
|
||
### 42.4 MC Corridor Test Data Specification
|
||
|
||
Reference data committed to `docs/validation/reference-data/`:
|
||
|
||
| File | Contents | Regeneration |
|
||
|---|---|---|
|
||
| `mc-ensemble-params.json` | RNG seed=42, object params, generation timestamp | Never change seed; add to file if params change |
|
||
| `mc-corridor-reference.geojson` | Pre-computed p95 corridor polygon | Run `python tools/generate_mc_reference.py` after algorithm change; review diff before committing |
|
||
|
||
Test asserts area delta < 5% between computed and reference polygon. If the algorithm changes, the reference polygon must be explicitly regenerated and the change logged in `CHANGELOG.md`.
|
||
|
||
### 42.5 Decision Log
|
||
|
||
| Decision | Chosen | Alternative | Rationale |
|
||
|----------|--------|-------------|-----------|
|
||
| DB isolation | SAVEPOINT for unit/single-connection; testcontainers for Celery | Shared test DB with cleanup | SAVEPOINT is zero-overhead and perfectly isolated; testcontainers gives true process isolation for multi-connection Celery tests without manual teardown |
|
||
| Flaky test policy | Quarantine after 2 failures in 30 days; delete if unfixed > 14 days | Retry flaky tests automatically | Auto-retry masks root causes; quarantine with mandatory resolution timeline creates accountability |
|
||
| Hypothesis in blocking CI | Yes, max_examples ≥ 200 for physics | Optional/nightly only | Safety-critical physics invariants must be checked on every commit; 200 examples adds < 30s to CI at default shrink settings |
|
||
| MC test data | Seeded RNG + committed reference polygon | Committed raw trajectory arrays | Raw arrays are large (~MB); seeded RNG is deterministic and tiny; committed polygon provides a stable regression target |
|
||
| `data-testid` convention | Mandatory for all Playwright targets; CSS class selectors forbidden | Allow CSS class selectors | CSS classes are refactoring artefacts; `data-testid` is stable across UI refactors and explicitly documents test intent |
|
||
| Smoke test gate | Blocking post-deploy, not blocking pre-deploy CI | Block pre-deploy CI | Smoke tests require a running stack; pre-deploy CI has no stack. Post-deploy gate means deployment rollback is the recovery action for smoke failure |
|
||
| Accessibility CI gate | `axe-core` wcag2a + wcag2aa violations block PR; wcag2aaa warnings only | Manual testing only | Manual testing is too slow and inconsistent for PR-level feedback; automated axe-core catches ~57% of WCAG issues at zero marginal cost; manual screen reader testing reserved for release gate |
|
||
|
||
---
|
||
|
||
## 43. Observability / Monitoring Engineering Review
|
||
|
||
### 43.1 Finding Summary
|
||
|
||
| # | Finding | Severity | Location updated |
|
||
|---|---------|----------|-----------------|
|
||
| 1 | Per-object Gauge labels cause alert flooding (600 pages for one outage) | High | §26.7 — recording rules added |
|
||
| 2 | No structured logging format specification | High | §7.14, §10 |
|
||
| 3 | No distributed tracing (OpenTelemetry) | High | §26.7, §10 |
|
||
| 4 | AlertManager rules have semantic errors; no runbook links | High | §26.7 — rules rewritten |
|
||
| 5 | No log aggregation stack specified | Medium | §3.2, §10 |
|
||
| 6 | Celery queue depth and DLQ depth metrics not defined | Medium | §26.7 |
|
||
| 7 | SLIs not formally instrumented against SLOs | Medium | §26.7 — recording rules |
|
||
| 8 | No request_id / trace_id correlation between logs and metrics | Medium | §7.14 |
|
||
| 9 | Prometheus scrape configuration not specified | Medium | §26.7 |
|
||
| 10 | Renderer service has no functional health check or metrics | Medium | §26.5 |
|
||
| 11 | No on-call rotation spec or AlertManager escalation routing | Medium | §26.8 |
|
||
|
||
### 43.2 Observability Stack Summary
|
||
|
||
After this review the full observability stack is:
|
||
|
||
| Layer | Tool | Phase |
|
||
|---|---|---|
|
||
| Metrics | Prometheus + `prometheus-fastapi-instrumentator` | 1 |
|
||
| Alerting | AlertManager with runbook_url annotations | 1 |
|
||
| Dashboards | Grafana (4 dashboards) | 2 |
|
||
| Structured logs | `structlog` JSON with required fields + sanitiser | 1 |
|
||
| Log aggregation | Grafana Loki + Promtail (Docker log scrape) | 2 |
|
||
| Distributed tracing | OpenTelemetry → Grafana Tempo | 2 |
|
||
| On-call routing | PagerDuty/OpsGenie via AlertManager L1/L2/L3 tiers | 2 |
|
||
|
||
### 43.3 Alert Anti-Patterns (Do Not Reintroduce)
|
||
|
||
| Anti-pattern | Correct form |
|
||
|---|---|
|
||
| `rate(counter[Xm]) > 0` | `increase(counter[Xm]) >= N` — `rate()` is per-second and stays positive once counter increments |
|
||
| Alert directly on `spacecom_tle_age_hours{norad_id=...}` | Alert on `spacecom:tle_stale_objects:count` recording rule — prevents 600-alert floods |
|
||
| AlertManager rule with no `annotations.runbook_url` | Every rule must include `runbook_url` pointing to the relevant runbook in `docs/runbooks/` |
|
||
| Grafana dashboard as sole incident channel | All CRITICAL alerts also page via PagerDuty; dashboards are diagnosis tools, not alert channels |
|
||
|
||
### 43.4 Decision Log
|
||
|
||
| Decision | Chosen | Alternative | Rationale |
|
||
|----------|--------|-------------|-----------|
|
||
| Log aggregation | Grafana Loki | ELK stack | Loki is 10× cheaper to operate (no full-text index); Prometheus labels for log querying are sufficient for this workload; co-deploys with existing Grafana without separate ES cluster |
|
||
| Tracing backend | Grafana Tempo | Jaeger | Tempo co-deploys with Grafana/Loki with no separate storage; native Grafana datasource; OTLP ingest; no query language to learn |
|
||
| Per-object label strategy | Keep labels for Grafana; alert on recording rule aggregates | Remove per-object labels | Per-object drill-down in Grafana dashboards is operationally valuable; the alert flooding problem is solved by recording rules, not by removing labels |
|
||
| Structured logging library | structlog | Python standard logging + JSON formatter | structlog integrates natively with contextvars for request_id propagation; the context binding pattern is cleaner than threading.local |
|
||
| Renderer health check | Functional Chromium launch test | Process liveness only | Chromium hanging without crashing is a known Playwright failure mode; process liveness gives false confidence; functional check is the only reliable signal |
|
||
|
||
---
|
||
|
||
## §44 — Frontend Architecture Review
|
||
|
||
### 44.1 Finding Summary
|
||
|
||
| # | Finding | Severity | Resolution |
|
||
|---|---------|----------|-----------|
|
||
| 1 | No documented decision on Next.js App Router vs Pages Router; component boundary (`"use client"`) placement unspecified | Medium | §13.1 — App Router confirmed; `"use client"` at `app/(globe)/layout.tsx` boundary |
|
||
| 2 | CesiumJS requires `'unsafe-eval'` in CSP for GLSL shader compilation; existing policy blocks the globe | High | §7.7 — two-tier CSP; `'unsafe-eval'` scoped to `app/(globe)/` routes only |
|
||
| 3 | Globe WebGL crash removes alert panel from DOM; CesiumJS WebGL context loss is unhandled | High | §13.1 — `GlobeErrorBoundary` wrapping only the globe canvas; alert panel in separate `PanelErrorBoundary` |
|
||
| 4 | CesiumJS entity memory leak: unbounded entity accumulation causes WebGL OOM and renderer crash | Medium | §13.1 — max 500 entities; 96h orbit path limit; stale entity pruning on update |
|
||
| 5 | WebSocket reconnection strategy unspecified; naive reconnect causes thundering-herd on server restart | Medium | §13.1 — exponential backoff with ±20% jitter; `RECONNECT` config object; max 30s delay |
|
||
| 6 | No TanStack Query key management strategy; ad-hoc key strings cause cache stampedes and stale data | Medium | §13.1 — `queryKeys` key factory pattern; all query keys centralised in `src/lib/queryKeys.ts` |
|
||
| 7 | Safety-critical panels (alert list, corridor map) have no loading/empty/error state specification | High | §13.1 — explicit state matrix per panel; alert panel must show degraded-data warning on stale WebSocket |
|
||
| 8 | LIVE/SIMULATION/REPLAY mode isolation not enforced in UI; writes possible in replay mode | High | §13.1 — `useModeGuard` hook; §33.9 — AGENTS.md rule added |
|
||
| 9 | Deck.gl renders on a separate canvas above CesiumJS; z-order and input event handling are broken | Medium | §13.1 — `DeckLayer` from `@deck.gl/cesium`; single canvas; shared input handling |
|
||
| 10 | CesiumJS imported at module level causes SSR crash; `next build` fails | High | §13.1 — `next/dynamic` with `ssr: false` for all CesiumJS components |
|
||
| 11 | Cesium ion token injection pattern undocumented; risk of over-engineering (proxying a public credential) | Low | §7.5 — explicit `NOT A SECRET` annotation; §33.9 — AGENTS.md rule added |
|
||
|
||
### 44.2 Architecture Constraints Summary
|
||
|
||
After this review the frontend architecture constraints are:
|
||
|
||
| Constraint | Rule |
|
||
|------------|------|
|
||
| App Router split | `app/(auth)/` and `app/(admin)/` — server components; `app/(globe)/` — `"use client"` root layout |
|
||
| CesiumJS import | `next/dynamic` + `ssr: false` only; never a static import at module level |
|
||
| CSP | Two-tier: standard (no `'unsafe-eval'`) for non-globe; globe-tier (`'unsafe-eval'`) for `app/(globe)/` only |
|
||
| Error isolation | Globe crash must not affect alert panel; independent `ErrorBoundary` per major region |
|
||
| Entity cap | 500 CesiumJS entities maximum; prune entities not updated in last 96h |
|
||
| WebSocket reconnect | Exponential backoff, initial 1s, max 30s, ×2 multiplier, ±20% jitter |
|
||
| Query keys | All keys defined in `src/lib/queryKeys.ts` key factory; no inline key strings |
|
||
| Mode guard | All write operations must check `useModeGuard(['LIVE'])` and disable in SIMULATION/REPLAY |
|
||
| Deck.gl | `DeckLayer` from `@deck.gl/cesium` only; no separate canvas |
|
||
| Cesium ion token | `NEXT_PUBLIC_CESIUM_ION_TOKEN`; public credential; not proxied; not in Vault |
|
||
|
||
### 44.3 Anti-Patterns (Do Not Introduce)
|
||
|
||
| Anti-pattern | Correct form |
|
||
|---|---|
|
||
| `import * as Cesium from 'cesium'` at module level | `next/dynamic(() => import('./CesiumViewerInner'), { ssr: false })` |
|
||
| Single root `<ErrorBoundary>` wrapping entire app | Independent boundaries: `GlobeErrorBoundary`, `PanelErrorBoundary`, `AlertErrorBoundary` |
|
||
| `queryClient.invalidateQueries('objects')` (string key) | `queryClient.invalidateQueries({ queryKey: queryKeys.objects.all() })` |
|
||
| Rendering write controls (buttons, forms) without mode check | `const { isAllowed } = useModeGuard(['LIVE']); <button disabled={!isAllowed}>` |
|
||
| Deck.gl separate canvas (`new Deck({ canvas: ... })`) | `viewer.scene.primitives.add(new DeckLayer({ layers: [...] }))` |
|
||
| Storing Cesium ion token in backend env / Vault / Docker secrets | `NEXT_PUBLIC_CESIUM_ION_TOKEN` in `.env.local`; committed non-secret in CI |
|
||
| Reconnect without jitter (`setTimeout(connect, delay)`) | `delay * (1 + (Math.random() * 2 - 1) * RECONNECT.jitter)` |
|
||
|
||
### 44.4 Decision Log
|
||
|
||
| Decision | Chosen | Alternative | Rationale |
|
||
|----------|--------|-------------|-----------|
|
||
| App Router adoption | App Router with route groups | Pages Router | Route groups (`(globe)`, `(auth)`) enable per-group CSP header configuration in `next.config.ts`; server components reduce globe-route initial JS; incremental adoption possible |
|
||
| `"use client"` boundary | `app/(globe)/layout.tsx` | Per-component `"use client"` annotations | Single boundary at layout level is simpler; all CesiumJS/Zustand/WebSocket code already browser-only; per-component annotations at this scale would be noise |
|
||
| Globe CSP strategy | Route-scoped `'unsafe-eval'` | Hash-based CSP for GLSL | CesiumJS generates shader source dynamically; hashes cannot cover runtime-generated strings; route-scoping is the only practical option |
|
||
| Deck.gl integration | `DeckLayer` from `@deck.gl/cesium` | Separate Deck.gl canvas | Separate canvas breaks mouse event routing and z-order; `DeckLayer` renders inside CesiumJS as a primitive, sharing the WebGL context |
|
||
| Cesium ion token | `NEXT_PUBLIC_` env var | Backend proxy endpoint | Cesium ion is a CDN/tile service with public tokens by design; proxying adds latency and a backend dependency for a non-secret; Cesium's own documentation recommends direct browser use |
|
||
|
||
---
|
||
|
||
## §45 — Platform / Infrastructure Operations Engineering Review
|
||
|
||
### 45.1 Finding Summary
|
||
|
||
| # | Finding | Severity | Resolution |
|
||
|---|---------|----------|-----------|
|
||
| 1 | Python 3.11/3.12 version mismatch between Dockerfiles and service table | Medium | §30.2 — all images updated to `python:3.12-slim`, `node:22-slim`; CI version check added |
|
||
| 2 | No container resource limits; runaway simulation worker can OOM-kill the database | High | §3.3 — `deploy.resources.limits` added for all services; stop_grace_period added |
|
||
| 3 | Docker SIGTERM→SIGKILL grace period (10s default) too short for MC task warm shutdown | High | §3.3 — `stop_grace_period: 300s` for worker-sim; `--without-gossip --without-mingle` flags specified |
|
||
| 4 | Backend and renderer on disjoint networks — cannot communicate | Critical | §3.3 — `backend` added to `renderer_net`; network topology diagram corrected |
|
||
| 5 | Workers bypass PgBouncer — 16 direct connections per worker undermines connection pooling | Medium | §3.3 — PgBouncer added to `worker_net`; workers connect via `pgbouncer:5432` |
|
||
| 6 | Redis ACL per-service is stated in §3.2 but undefined — compromised worker can read session tokens | High | §3.2 — full ACL definition added; three separate passwords added to §30.3 env contract |
|
||
| 7 | `pg_isready -U postgres` healthcheck passes before TimescaleDB extension and application DB are ready | Medium | §26.5 — healthcheck replaced with `psql` query against `timescaledb_information.hypertables` |
|
||
| 8 | `daily_base_backup` calls `pg_basebackup` from Python worker image — tool not installed | High | §26.6 — replaced with dedicated `db-backup` sidecar container; Celery task now verifies backup presence in MinIO |
|
||
| 9 | No `pids_limit` on renderer or worker containers — Chromium crash can fork-bomb host | Medium | §3.3 — `pids_limit` added: renderer=100, worker-sim=64, worker-ingest=16 |
|
||
| 10 | Renderer PDF scratch written to container writable layer — sensitive data persists | Medium | §3.3 — `tmpfs` mount at `/tmp/renders` (512 MB); `RENDER_OUTPUT_DIR` env var added |
|
||
| 11 | Blue-green deployment mechanics unspecified for Docker Compose — first production deploy would fail | High | §26.9 — `scripts/blue-green-deploy.sh` spec added; Caddy dynamic upstream pattern defined |
|
||
|
||
### 45.2 Container Runtime Safety Summary
|
||
|
||
After this review the container runtime safety posture is:
|
||
|
||
| Concern | Control |
|
||
|---------|---------|
|
||
| Resource isolation | `deploy.resources.limits` per service; DB memory-capped to survive worker OOM |
|
||
| Graceful shutdown | `stop_grace_period: 300s` for simulation workers; Celery `--without-gossip --without-mingle` |
|
||
| Process containment | `pids_limit` on renderer (100) and both workers |
|
||
| Sensitive scratch data | Renderer uses `tmpfs` at `/tmp/renders`; cleared on container stop |
|
||
| Network access | Backend on `renderer_net`; PgBouncer on `worker_net`; workers never reach `frontend_net` |
|
||
| Redis ACL | Three ACL users (backend, worker, ingest) with scoped key namespaces; default user disabled |
|
||
| DB healthcheck | Verifies TimescaleDB extension loaded and application DB accessible before dependent services start |
|
||
| Backups | Dedicated `db-backup` sidecar with PostgreSQL tools; Celery Beat verifies presence not execution |
|
||
|
||
### 45.3 Operations Anti-Patterns (Do Not Reintroduce)
|
||
|
||
| Anti-pattern | Correct form |
|
||
|---|---|
|
||
| `FROM python:3.11-slim` or `FROM node:20-slim` in any Dockerfile | `python:3.12-slim` / `node:22-slim`; hadolint check enforces this |
|
||
| No `deploy.resources.limits` on CPU/memory-intensive services | All services must have limits; simulation workers especially |
|
||
| Worker `DATABASE_URL` pointing to `db:5432` | `pgbouncer:5432` — all workers route through PgBouncer |
|
||
| `subprocess.run(['pg_basebackup', ...])` from a Python worker container | Dedicated `db-backup` sidecar container with PostgreSQL tools |
|
||
| `pg_isready -U postgres` as the DB healthcheck | `psql -c "SELECT 1 FROM timescaledb_information.hypertables LIMIT 1"` |
|
||
| `docker compose stop` (default 10s) for simulation workers | `stop_grace_period: 300s` on worker-sim service definition |
|
||
| All services sharing single `REDIS_PASSWORD` | Three ACL users with scoped namespaces; separate passwords |
|
||
| Blue-green deploy without specifying the Compose implementation | `scripts/blue-green-deploy.sh` with separate Compose project instances + Caddy dynamic upstream |
|
||
|
||
### 45.4 Decision Log
|
||
|
||
| Decision | Chosen | Alternative | Rationale |
|
||
|----------|--------|-------------|-----------|
|
||
| Python version | 3.12 (service table and Dockerfiles aligned) | 3.11 (original Dockerfiles) | 3.12 has 10–25% numeric performance improvements; free-threaded GIL prep; security support through 2028; alignment eliminates silent version drift |
|
||
| Blue-green implementation | Separate Compose project instances + Caddy dynamic upstream file | Single Compose file with blue/green service name variants | Separate projects mean the Compose file is not modified per deployment; Caddy JSON upstream reload is atomic and < 5s |
|
||
| Backup execution model | Host cron → `db-backup` sidecar via `docker compose run` | Celery task + `subprocess.run` | Celery workers do not have `pg_basebackup`; host cron is independent of application availability — backup runs even if Celery is down |
|
||
| PID limits | Per-service `pids_limit` in Compose | Kernel cgroup default | Compose `pids_limit` is applied at container creation; simpler to audit than system-level cgroup tuning; values sized per expected process count |
|
||
| Renderer scratch storage | `tmpfs` | Named Docker volume | PDF contents include prediction data; tmpfs guarantees no persistence; cleared on container stop/restart without manual cleanup |
|
||
| Redis ACL scope | Key prefix namespacing (`~celery*` for workers) | Command-level ACL only | Key-prefix ACL prevents workers from reading/writing outside their namespace; command-level-only ACL is weaker (worker could still enumerate all keys) |
|
||
|
||
---
|
||
|
||
## §46 — Data Pipeline / ETL Engineering Review
|
||
|
||
### 46.1 Finding Summary
|
||
|
||
| # | Finding | Severity | Resolution |
|
||
|---|---------|----------|-----------|
|
||
| 1 | No Space-Track request budget tracked; 30-min TIP polling consumes 48/600 requests/day before retries | High | §31.1.1 — `SpaceTrackBudget` Redis counter; alert at 80%; operator re-fetches budget-checked |
|
||
| 2 | TIP 30-min polling too slow for late re-entry phase; CDM 12h polling can miss short-TCA conjunctions entirely | High | §31.1.1 — adaptive polling: TIP→5min, CDM→30min when `active_tip_events > 0` |
|
||
| 3 | TLE ingest ON CONFLICT behavior unspecified; double-run hits unique constraint silently | Medium | §11 — `INSERT ... ON CONFLICT DO NOTHING` + `spacecom_ingest_tle_conflict_total` metric |
|
||
| 4 | IERS EOP cold-start: astropy falls back to months-old IERS-B, silently degrading frame transforms | High | §11 — `make seed` EOP bootstrap step; EOP freshness check in `GET /readyz` |
|
||
| 5 | AIRAC FIR updates are fully manual with no staleness detection or missed-cycle alert | Medium | §31.1.3 — `spacecom_airspace_airac_age_days` gauge + alert; `airspace_stale` in `readyz`; fir-update runbook as Phase 1 deliverable |
|
||
| 6 | Space weather nowcast vs. forecast not distinguished; decay predictor uses wrong F10.7 for horizon > 72h | High | §31.1.2 — `forecast_horizon_hours` column; decay predictor input selection table |
|
||
| 7 | IERS EOP SHA-256 verification unimplementable — IERS publishes no reference hashes | Medium | §11 — dual-mirror comparison (USNO + Paris Observatory); `spacecom_eop_mirror_agreement` gauge |
|
||
| 8 | No exponential backoff or circuit breaker on ingest tasks; transient failures exhaust Space-Track budget | High | §31.1.1 — `retry_backoff=True`, `retry_backoff_max=3600`, `max_retries=5`; `pybreaker` circuit breaker |
|
||
| 9 | Space-Track session cookie expires between 6h polls; re-auth behavior not specified or tested | Medium | §31.1.1 — `_ensure_authenticated()` with proactive 1h45m TTL; `session_reauth_total` metric |
|
||
| 10 | ESA SWS Kp cross-validation has no decision rule; divergence from NOAA is silently ignored | Medium | §31.1.2 — `arbitrate_kp()` with 2.0 Kp threshold; conservative-high selection; ADR-0018 |
|
||
| 11 | `celery-redbeat` default lock TTL 25min causes up to 25min scheduling gap on Beat crash during TIP event | High | §26.4 — `REDBEAT_LOCK_TIMEOUT=60`; `REDBEAT_MAX_SLEEP_INTERVAL=5`; active TIP alert threshold 10min |
|
||
|
||
### 46.2 Ingest Pipeline Reliability Summary
|
||
|
||
After this review the ingest pipeline reliability posture is:
|
||
|
||
| Concern | Control |
|
||
|---------|---------|
|
||
| Space-Track rate limit | `SpaceTrackBudget` Redis counter; alert at 80%; hard stop at 600/day |
|
||
| Upstream failure recovery | Exponential backoff (2s→1h, ×2, ±20% jitter); circuit breaker after 3 failures; max 5 retries then DLQ |
|
||
| TIP latency during re-entry | Adaptive polling: 5-minute TIP cycle when active TIP event detected |
|
||
| CDM conjunction coverage | 30-minute CDM cycle during active TIP events (baseline 2h) |
|
||
| TLE ingest idempotency | `ON CONFLICT DO NOTHING` + conflict metric |
|
||
| EOP freshness | Daily download (USNO primary); dual-mirror verification; 7-day staleness alert; cold-start bootstrap in `make seed` |
|
||
| AIRAC currency | 28-day staleness alert; `/readyz` degraded signal; manual update runbook as Phase 1 deliverable |
|
||
| Space weather horizon | `forecast_horizon_hours` column; predictor selects by horizon; 81-day F10.7 average beyond 72h |
|
||
| Beat HA failover gap | `REDBEAT_LOCK_TIMEOUT=60s`; standby acquires lock within 5s of TTL expiry |
|
||
|
||
### 46.3 New ADR Required
|
||
|
||
| ADR | Title | Decision |
|
||
|-----|-------|----------|
|
||
| `docs/adr/0018-kp-source-arbitration.md` | Kp Source Arbitration Policy | NOAA primary; ESA SWS cross-validation; conservative-high selection on > 2.0 Kp divergence; physics lead approval required |
|
||
|
||
### 46.4 Ingest Pipeline Anti-Patterns (Do Not Reintroduce)
|
||
|
||
| Anti-pattern | Correct form |
|
||
|---|---|
|
||
| `INSERT INTO tle_sets ... VALUES (...)` without `ON CONFLICT DO NOTHING` | Always use `ON CONFLICT DO NOTHING` + increment conflict metric |
|
||
| `spacetrack_client.fetch()` without budget check | Always call `budget.consume(1)` before any Space-Track HTTP request |
|
||
| Celery ingest task with `max_retries=None` or no backoff | `retry_backoff=True`, `retry_backoff_max=3600`, `max_retries=5` |
|
||
| EOP verification by SHA-256 against prior download | Dual-mirror UT1-UTC value comparison (USNO + Paris Observatory) |
|
||
| `REDBEAT_LOCK_TIMEOUT = 300` (default 5min or 25min) | `REDBEAT_LOCK_TIMEOUT = 60` for active TIP event tolerance |
|
||
| Single F10.7 value regardless of prediction horizon | Select by `forecast_horizon_hours`; 81-day average beyond 72h |
|
||
| ESA SWS Kp logged but not acted upon | `arbitrate_kp()` decision rule; conservative-high on divergence |
|
||
|
||
### 46.5 Decision Log
|
||
|
||
| Decision | Chosen | Alternative | Rationale |
|
||
|----------|--------|-------------|-----------|
|
||
| Adaptive TIP polling | Dynamic redbeat schedule override when `active_tip_events > 0` | Fixed 5-min polling always | Fixed 5-min polling uses 288/600 Space-Track requests/day for TIPs alone; adaptive polling reserves budget for baseline operations |
|
||
| Space-Track budget enforcement | Redis counter with hard stop | Honour-system rate limit compliance | Hard stop prevents CI/staging test runs or operator actions from exhausting production budget unexpectedly |
|
||
| EOP verification | Dual-mirror value comparison | SHA-256 against prior download | IERS publishes no reference hashes; prior-download comparison detects corruption but not substitution; dual-mirror comparison is the de facto industry approach |
|
||
| Kp arbitration | Conservative-high (max of NOAA, ESA on divergence) | Average of both sources | Averaging introduces a systematic bias toward lower geomagnetic activity; in a safety-critical context, the conservative choice is the higher Kp (denser atmosphere, shorter lifetime, earlier alerting) |
|
||
| `forecast_horizon_hours` schema | Dedicated column on `space_weather` | Separate tables per horizon | Single table with horizon column is simpler to query (`WHERE forecast_horizon_hours = 0`); adding a table per horizon complicates the ingest pipeline without query benefit |
|
||
|
||
---
|
||
|
||
## §47 — Supply Chain / Dependency Security Engineering Review
|
||
|
||
### 47.1 Finding Summary
|
||
|
||
| # | Finding | Severity | Resolution |
|
||
|---|---------|----------|-----------|
|
||
| 1 | `pip wheel` in Dockerfile does not enforce `--require-hashes`; hash pinning specified but not verified during build | High | §30.2 — `--require-hashes` added to `pip wheel` command with explanatory comment |
|
||
| 2 | `cosign` image signing absent from CI workflow; attestation claim was aspirational | High | §26.9 — full `cosign sign` + `cosign attest` YAML added to `build-and-push` job |
|
||
| 3 | SBOM format, CI step, and retention unspecified; ESA ECSS requirement undeliverable | High | §26.9 — SPDX-JSON via `syft`; `cosign attest` attachment; 365-day artifact retention |
|
||
| 4 | `pip-audit` absent; OWASP Dependency-Check has high Python false-positive rate | Medium | §7.13 — `pip-audit` added to `security-scan`; OWASP DC removed from Python scope |
|
||
| 5 | No automated license scanning; CesiumJS AGPLv3 compliance check was manual | High | §7.13 — `pip-licenses` + `license-checker-rseidelsohn` gate on every PR |
|
||
| 6 | Base image digest update process undefined; Dependabot cannot update `@sha256:` pins | Medium | §7.13 — Renovate Bot `docker-digest` manager; digest PRs auto-merged on passing CI |
|
||
| 7 | No `.trivyignore` file; first base-image CVE with no fix will break all CI builds | Medium | §7.13 — `.trivyignore` spec with expiry dates + CI expiry check |
|
||
| 8 | `npm audit` absent from CI; `npm ci` does not scan for known vulnerabilities | Medium | §7.13 + §26.9 — `npm audit --audit-level=high` in `security-scan` job |
|
||
| 9 | `detect-secrets` baseline update process undefined; incorrect `scan >` overwrites all allowances | Medium | §30.1 — correct `--update` procedure documented; CI baseline currency check added |
|
||
| 10 | No PyPI index trust policy; dependency confusion attack surface unmitigated | High | §7.13 — private PyPI proxy spec; `spacecom-*` namespace reservation on public PyPI; ADR-0019 |
|
||
| 11 | GitHub Actions pinned by mutable `@vN` tags; tag repointing exfiltrates all workflow secrets | Critical | §26.9 — all actions pinned by full commit SHA; CI lint check enforces no `@v\d` tags |
|
||
|
||
### 47.2 Supply Chain Security Posture Summary
|
||
|
||
After this review the supply chain security posture is:
|
||
|
||
| Layer | Control |
|
||
|-------|---------|
|
||
| Python build-time hash verification | `pip wheel --require-hashes` enforces hash pinning during Docker build |
|
||
| Python CVE scanning | `pip-audit` (PyPADB); every PR; blocks on High/Critical |
|
||
| Node.js CVE scanning | `npm audit --audit-level=high`; every PR |
|
||
| Container CVE scanning | Trivy + `.trivyignore` with expiry enforcement |
|
||
| Image provenance | `cosign` keyless signing (Sigstore) on every image push |
|
||
| SBOM | SPDX-JSON via `syft`; attached as `cosign attest`; 365-day retention |
|
||
| License gate | `pip-licenses` + `license-checker-rseidelsohn`; GPL/AGPL blocks merge |
|
||
| Base image currency | Renovate `docker-digest` manager; weekly PRs; auto-merged on CI pass |
|
||
| Dependency currency | Dependabot (GitHub Advisory integration) for Python/Node versions |
|
||
| CI pipeline integrity | All actions SHA-pinned; lint check rejects `@vN` references |
|
||
| Secrets detection | `detect-secrets` (entropy + regex) primary; `git-secrets` secondary; baseline currency check in CI |
|
||
| PyPI index trust | Private proxy (Phase 2+); `spacecom-*` namespace stubs on public PyPI |
|
||
|
||
### 47.3 New ADR Required
|
||
|
||
| ADR | Title | Decision |
|
||
|-----|-------|----------|
|
||
| `docs/adr/0019-pypi-index-trust.md` | PyPI Index Trust Policy | Private proxy for Phase 2+; public PyPI namespace reservation for `spacecom-*` packages in Phase 1 |
|
||
|
||
### 47.4 Anti-Patterns (Do Not Reintroduce)
|
||
|
||
| Anti-pattern | Correct form |
|
||
|---|---|
|
||
| `pip wheel -r requirements.txt` without `--require-hashes` | `pip wheel --require-hashes -r requirements.txt` |
|
||
| `uses: actions/checkout@v4` in any workflow file | `uses: actions/checkout@<full-commit-sha> # vX.Y.Z` |
|
||
| `detect-secrets scan > .secrets.baseline` | `detect-secrets scan --baseline .secrets.baseline --update` |
|
||
| OWASP Dependency-Check as Python CVE scanner | `pip-audit --requirement requirements.txt` |
|
||
| Trivy gate with no `.trivyignore` | `.trivyignore` with documented expiry dates + CI expiry check |
|
||
| Manual CesiumJS licence check at Phase 1 only | `license-checker-rseidelsohn --failOn "GPL;AGPL"` on every PR (CesiumJS exempted by name) |
|
||
| `cosign` mentioned in decision log but absent from CI | `cosign sign` + `cosign attest` in `build-and-push` job; `cosign verify` in deploy jobs |
|
||
|
||
### 47.5 Decision Log
|
||
|
||
| Decision | Chosen | Alternative | Rationale |
|
||
|----------|--------|-------------|-----------|
|
||
| Python CVE scanning | `pip-audit` (PyPADB) | OWASP Dependency-Check | OWASP DC CPE mapping generates false positives for Python; `pip-audit` queries the Python-native advisory database with near-zero false positives |
|
||
| Image signing | `cosign` keyless (Sigstore) | Long-lived signing key | Keyless signing uses ephemeral OIDC-bound keys; no key management overhead; verifiable against GitHub Actions OIDC issuer |
|
||
| SBOM format | SPDX 2.3 JSON (`spdx-json`) | CycloneDX 1.5 | SPDX is the ECSS/ESA-preferred format; both are equivalent for compliance purposes; SPDX has wider tooling support in the aerospace sector |
|
||
| Base image update automation | Renovate `docker-digest` | Manual digest updates | Manual digest updates are always deferred; Renovate auto-merge on passing CI achieves zero-latency security patch application for base image OS updates |
|
||
| GitHub Actions pinning | Commit SHA with tag comment | Dependabot auto-bump of `@vN` | Tag references are mutable; SHA pins are immutable; Renovate `github-actions` manager keeps SHAs current automatically |
|
||
| PyPI trust (Phase 1) | Namespace reservation on public PyPI | Private proxy | Private proxy requires infrastructure investment not available in Phase 1; namespace squatting prevention provides meaningful protection at zero cost |
|
||
|
||
---
|
||
|
||
## §48 Human Factors Engineering — Specialist Review
|
||
|
||
**Hat:** Human Factors Engineering
|
||
**Standard basis:** ECSS-E-ST-10-12C (Space engineering — Human factors), CAP 1264 (Alarm management for safety-related ATC systems), EASA GM1 ATCO.B.001(d) (Competency-based training — decision making under uncertainty), Endsley (1995) Situation Awareness taxonomy, Parasuraman & Riley (1997) automation trust calibration
|
||
|
||
**Review scope:** §28 Human Factors Framework, §6 UI/UX Feature Specifications, §26 Infrastructure (alert delivery), §31 Data Pipeline (data freshness / degraded state)
|
||
|
||
---
|
||
|
||
### 48.1 Findings
|
||
|
||
**Finding 1 — SA timing targets absent:** §28.1 contained no quantitative time-to-comprehension targets. Situation Awareness without measurable timing criteria cannot be validated against ECSS-E-ST-10-12C Part 6.4 or used as pass/fail criteria in usability testing.
|
||
**Fix applied (§28.1):** SA Level 1 ≤ 5s (icon/colour/position); SA Level 2 ≤ 15s (FIR intersection + sector); SA Level 3 ≤ 30s (corridor expanding/contracting). Targets designated as Phase 2 usability test pass/fail criteria.
|
||
|
||
**Finding 2 — Forced-text acknowledgement minimum causes compliance noise:** The 10-character minimum on alert acknowledgement text is a common anti-pattern. Under time pressure, operators produce `1234567890` or similar, which is audit record pollution rather than evidence of cognitive engagement.
|
||
**Fix applied (§28.5):** Replaced with `ACKNOWLEDGEMENT_CATEGORIES` (6 structured options). Free text is optional except when `OTHER` is selected. Category selection satisfies audit requirements with less operator burden.
|
||
|
||
**Finding 3 — No keyboard-completable acknowledgement path:** ANSP ops room staff routinely hold a radio PTT with one hand. A mouse-dependent acknowledgement dialog is inaccessible in that context and constitutes a HF design failure.
|
||
**Fix applied (§28.5):** `Alt+A → Enter → Enter` three-keystroke path from any application state. Documented for operator quick-reference card; included in Phase 2 usability test scenario.
|
||
|
||
**Finding 4 — No startle-response mitigation:** Sudden full-screen CRITICAL banners produce a documented ~5-second degraded cognitive performance window (startle effect, Staal 2004). The existing design transitions directly to full-screen without priming.
|
||
**Fix applied (§28.3):** Three-rule mitigation: (1) progressive escalation — CRITICAL full-screen only after ≥ 1 minute in HIGH state (except `impact_time_minutes < 30`); (2) audio precedes visual by 500ms; (3) banner is dimmed overlay over corridor map, not a replacement.
|
||
|
||
**Finding 5 — No shift handover specification:** Handover is the highest-risk transition in continuous operations. Loss of situational awareness at shift change is a documented contributing factor in ATC incidents. No handover mechanism existed.
|
||
**Fix applied (§28.5a):** Dedicated `/handover` view; `shift_handovers` table with `outgoing_user`, `incoming_user`, `notes`, `active_alerts` snapshot, `open_coord_threads` snapshot; immutable audit record; CRITICAL-during-handover flag on notifications.
|
||
|
||
**Finding 6 — Alarm rationalisation procedure absent:** Alarm systems without formal rationalisation procedures inevitably drift toward nuisance alarm rates that exceed operator tolerance. The existing quarterly review target (< 1 LOW/10 min/user) had no enforcement mechanism.
|
||
**Fix applied (§28.3):** Quarterly rationalisation procedure with `alarm_threshold_audit` table; 90% MONITORING acknowledgement rate as nuisance alarm trigger; mandatory 7-day confirmation for threshold changes; 12-month no-escalation review for alert categories.
|
||
|
||
**Finding 7 — Comprehension test items not specified:** §28.7 stated "usability test" without scripted probabilistic comprehension items. Generic usability tests are insensitive to the specific calibration failures relevant to probabilistic re-entry data (false precision, space/aviation risk threshold conflation, uncertainty update misattribution).
|
||
**Fix applied (§28.7):** Four scripted comprehension items with correct answer, common wrong answer, and failure mode each item detects. Pass criterion: ≥ 80% correct per item across the test cohort.
|
||
|
||
**Finding 8 — No habituation countermeasures:** Repeated identical stimuli (identical alarm sound, identical banner appearance) produce habituation — reduced physiological and attentional response over weeks of exposure. No design provisions existed.
|
||
**Fix applied (§28.3):** Pseudo-random alternation of two-tone audio pattern; 1 Hz colour cycling on CRITICAL banner between two dark-amber shades; per-operator habituation metric (≥ 20 same-type acknowledgements in 30 days without escalation triggers supervisor review).
|
||
|
||
**Finding 9 — "Response Options" label creates legal ambiguity:** The label "Response Options" implies these are prescribed choices. In a regulatory investigation following an incident, checked items could be interpreted as evidence of a standard procedure that was or was not followed.
|
||
**Fix applied (§28.6):** Feature renamed to "Decision Prompts" throughout. Non-waivable legal disclaimer added below accordion header. Disclaimer included in printed/exported Event Detail report and in API response `legal_notice` field.
|
||
|
||
**Finding 10 — No attention management specification:** SpaceCom exists in an environment (ops room) with very high ambient interruption rates. Without explicit constraints on unsolicited notification rate, SpaceCom becomes an additional fragmentation source — the documented cause of error in multiple ATC incident analyses.
|
||
**Fix applied (§28.6):** Three-tier rate limit: ≤ 1/10 min in steady state; ≤ 1/60s for same-event updates during active incident; zero during critical flow (acknowledgement dialog or handover screen). Queued notifications delivered as batch on critical-flow exit.
|
||
|
||
**Finding 11 — Degraded-data states not differentiated for operators:** Three meaningfully different system states (healthy, degraded, failed) were visually undifferentiated in the previous design. Operators cannot distinguish between data they should trust, trust with margin, or not trust at all.
|
||
**Fix applied (§28.8):** Graded visual degradation language table (5 amber/red states with exact badge text and required operator response); multiple-amber consolidation rule; `GET /readyz` machine-readable staleness flags for ANSP monitoring integration; `system_health_events` audit table.
|
||
|
||
---
|
||
|
||
### 48.2 Files / Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §28.1 Situation Awareness Design Requirements | Added SA level timing targets as pass/fail usability criteria |
|
||
| §28.3 Alarm Management | Added startle-response mitigation (3 rules), alarm rationalisation procedure, habituation countermeasures |
|
||
| §28.5 Error Recovery and Irreversible Actions | Replaced 10-char text minimum with `ACKNOWLEDGEMENT_CATEGORIES`; added `Alt+A → Enter → Enter` keyboard path |
|
||
| §28.5a Shift Handover (new section) | Handover screen spec; `shift_handovers` table schema; integrity rules; handover-window CRITICAL flag |
|
||
| §28.6 Cognitive Load Reduction | Renamed Response Options → Decision Prompts; added legal disclaimer; added attention management rate limits |
|
||
| §28.7 HF Validation Approach | Added 4 scripted probabilistic comprehension test items with pass criterion |
|
||
| §28.8 Degraded-Data Human Factors (new section) | Graded degradation language; 5-state indicator table; multiple-amber consolidation; `GET /readyz` integration |
|
||
|
||
---
|
||
|
||
### 48.3 New Tables / Schema Changes
|
||
|
||
| Table | Purpose |
|
||
|-------|---------|
|
||
| `shift_handovers` | Immutable record of shift handovers with alert and coordination thread snapshots |
|
||
| `alarm_threshold_audit` | Immutable record of alarm threshold changes with reviewer and rationale |
|
||
| `system_health_events` | Time-series log of degraded-data state transitions for operational reporting |
|
||
|
||
---
|
||
|
||
### 48.4 New ADR Required
|
||
|
||
| ADR | Title | Decision |
|
||
|-----|-------|----------|
|
||
| `docs/adr/0020-acknowledgement-categories.md` | Alert Acknowledgement Design | Structured category selection replaces free-text minimum; `OTHER` requires text; 6 categories cover all anticipated operational responses |
|
||
| `docs/adr/0021-decision-prompts-legal.md` | Decision Prompts Legal Treatment | Feature renamed from Response Options; non-waivable disclaimer required; legal rationale documented for future regulatory inquiries |
|
||
|
||
---
|
||
|
||
### 48.5 Anti-Patterns (Do Not Reintroduce)
|
||
|
||
| Anti-pattern | Correct form |
|
||
|---|---|
|
||
| Full-screen CRITICAL banner without progressive escalation | Progressive escalation: ≥ 1 min in HIGH state before CRITICAL full-screen (except `impact_time < 30 min`) |
|
||
| Audio and visual CRITICAL alert fired simultaneously | Audio fires 500ms before visual banner render |
|
||
| Alert acknowledgement with free-text character minimum | `ACKNOWLEDGEMENT_CATEGORIES` structured selection; free text only when `OTHER` selected |
|
||
| "Response Options" label anywhere in UI, API, or docs | "Decision Prompts" throughout; legal disclaimer present |
|
||
| Comprehension test without scripted probabilistic items | Use the 4 scripted items in §28.7; measure per-item accuracy against 80% pass threshold |
|
||
| Degraded data shown with same visual weight as fresh data | Use exact badge text from §28.8; amber for stale, red for expired/unusable |
|
||
|
||
---
|
||
|
||
### 48.6 Decision Log
|
||
|
||
| Decision | Chosen | Alternative | Rationale |
|
||
|----------|--------|-------------|-----------|
|
||
| Acknowledgement mechanism | Structured categories | Free-text minimum | Research shows forced-text minimums produce compliance noise, not evidence; structured categories produce lower operator burden with higher audit utility |
|
||
| CRITICAL escalation model | Progressive (HIGH → CRITICAL) | Immediate full-screen | Startle effect causes ~5s cognitive degradation; progressive escalation eliminates cold-start startle while preserving urgency |
|
||
| Audio timing | 500ms pre-visual | Simultaneous | Pre-auditory alert primes attentional orienting response; eliminates visual startling; 500ms is within the ICAO recommended alerting lead-time range |
|
||
| Shift handover | System-managed `/handover` view | Out-of-band process | Out-of-band handovers leave no audit trail and are not integrated with active alert state; system-managed handover provides immutable record and SA transfer assurance |
|
||
| Decision Prompts legal treatment | Non-waivable hard-coded disclaimer | Configurable disclaimer or none | Configurable disclaimer creates discovery risk (could be disabled); absence of disclaimer creates precedent risk; hard-coded disclaimer is the only legally safe option |
|
||
|
||
---
|
||
|
||
## §49 Legal / Compliance Engineering — Specialist Review
|
||
|
||
**Standards basis:** GDPR (Regulation 2016/679), UK GDPR, ePrivacy Directive, Export Administration Regulations (EAR), ITAR (22 CFR 120–130), ESA Procurement Rules, EUMETSAT Data Policy, Space Debris Mitigation Guidelines (IADC/ISO 24113), Chicago Convention Article 28, EU AI Act (Regulation 2024/1689), NIS2 Directive (2022/2555)
|
||
**Review scope:** Data handling, user consent, liability framing, export control, third-party data licensing, AI Act obligations, operator accountability chain, record retention, cross-border transfer, regulatory correspondence readiness
|
||
|
||
---
|
||
|
||
### 49.1 Findings and Fixes Applied
|
||
|
||
**F1 — No GDPR lawful basis documented per processing activity**
|
||
Fix applied (§29.1): RoPA requirement formalised. `legal/ROPA.md` designated as authoritative document. Data inventory table extended to include all processing activities with lawful basis, retention period, and table reference. `shift_handovers` and `alarm_threshold_audit` added as processing activities. Annual DPO sign-off required. DPIA trigger documented.
|
||
|
||
**F2 — No DPIA for conjunction alert delivery**
|
||
Fix applied (§29.1): DPIA trigger documented — conjunction alert delivery constitutes systematic monitoring under GDPR Art. 35(3)(b). DPIA required before production deployment; template designated as `legal/DPIA_conjunction_alerts.md`.
|
||
|
||
**F3 — TLE / space weather data redistribution may breach upstream licence**
|
||
Fix applied (§24.2): `space_track_registered` boolean column added to `organisations` table. API middleware gate blocks TLE-derived fields for non-registered orgs. `data_disclosure_log` table added for licence audit trail. EU-SST gated separately behind `itar_cleared` flag.
|
||
|
||
**F4 — No export control screening at registration**
|
||
Fix applied (§24.2): `country_of_incorporation`, `export_control_screened_at`, `export_control_cleared`, and `itar_cleared` columns added to `organisations` table. Onboarding flow screens against embargoed countries (ISO 3166-1 alpha-2) and BIS Entity List. EU-SST-derived data gated behind `itar_cleared`. Documented in `legal/EXPORT_CONTROL_POLICY.md`.
|
||
|
||
**F5 — Liability disclaimer in Decision Prompts insufficient as standalone protection**
|
||
Fix applied (§28.6): Note added that the in-UI disclaimer is a reinforcing reminder only. Substantive liability limitation (consequential loss excluded; aggregate cap = 12 months fees) must appear in the executed MSA (§24.2). UCTA 1977 and EU Unfair Contract Terms Directive requirement noted.
|
||
|
||
**F6 — No retention / deletion schedule; erasure requests unhandled for new tables**
|
||
Fix applied (§29.1, §29.3): `shift_handovers` and `alarm_threshold_audit` added to RoPA with 7-year retention (safety record basis). Pseudonymisation procedure in §29.3 extended to cover `shift_handovers` — user ID columns nulled, notes prefixed with pseudonym on erasure request.
|
||
|
||
**F7 — Cross-border data transfer mechanism not formally documented**
|
||
Fix applied (§29.5): `legal/DATA_RESIDENCY.md` designated as authoritative sub-processor list with hosting provider, region, SCC/IDTA status. Annual DPO review and customer notification on material sub-processor change formalised.
|
||
|
||
**F8 — EU AI Act obligations not assessed**
|
||
Fix applied (§24.10): New section added. Conjunction probability model classified as high-risk AI under EU AI Act Annex III (transport infrastructure safety). Eight high-risk obligations mapped (risk management, data governance, technical documentation, logging, transparency, human oversight, accuracy/robustness, conformity assessment). Human oversight statement added as mandatory non-configurable UI element in §19.4 conjunction probability display. EU database registration (Art. 51) added as Phase 3 gate. `legal/EU_AI_ACT_ASSESSMENT.md` designated as authoritative document.
|
||
|
||
**F9 — No regulatory correspondence register**
|
||
Fix applied (§24.11): New section added. `legal/REGULATORY_CORRESPONDENCE_LOG.md` designated as structured register. SLAs: 2-business-day acknowledgement, 14-calendar-day response. Quarterly steering review of outstanding correspondence. Proactive engagement triggered by ≥3 queries from same authority in 12 months.
|
||
|
||
**F10 — Cookie / tracking consent mechanism not specified**
|
||
Fix applied (§29.7): New section added. Cookie audit table defined (strictly necessary / functional / analytics). `HttpOnly; Secure; SameSite=Strict` formalised as required security attributes. Consent banner specification: three tiers, preference stored in localStorage (not a cookie), re-requested on material category changes. `legal/COOKIE_POLICY.md` designated as authoritative document.
|
||
|
||
**F11 — Incident notification obligations not mapped to regulatory timelines**
|
||
Fix applied (§29.6): NIS2 Art. 23 obligations added alongside GDPR Art. 33. Early warning deadline: 24 hours of awareness (NIS2) vs. 72 hours (GDPR). Full NIS2 notification: 72 hours. Final report: 1 month. On-call escalation requirement to DPO within 24-hour window documented. `legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md` designated as authoritative template document.
|
||
|
||
---
|
||
|
||
### 49.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §24.2 Liability and Operational Status | Added Space-Track redistribution gate (`space_track_registered`), `data_disclosure_log` table, export control screening columns and onboarding flow |
|
||
| §24.10 (new) EU AI Act Obligations | Full high-risk AI obligation mapping; human oversight statement; conformity assessment and registration roadmap |
|
||
| §24.11 (new) Regulatory Correspondence Register | Structured log specification; SLAs; escalation trigger |
|
||
| §28.6 Cognitive Load Reduction | Added legal sufficiency note on Decision Prompts disclaimer; MSA cross-reference |
|
||
| §29.1 Data Inventory | Formalised as GDPR Art. 30 RoPA; added `shift_handovers`, `alarm_threshold_audit`, `data_disclosure_log` entries; DPIA trigger documented |
|
||
| §29.3 Erasure vs. Retention Conflict | Extended pseudonymisation procedure to cover `shift_handovers` |
|
||
| §29.5 Cross-Border Data Transfer Safeguards | Added `legal/DATA_RESIDENCY.md` as authoritative document with annual review requirement |
|
||
| §29.6 Security Breach Notification | Expanded to full NIS2 Art. 23 obligations table; multi-framework notification timeline |
|
||
| §29.7 (new) Cookie / Tracking Consent | Cookie audit table; `HttpOnly; Secure; SameSite=Strict` formalised; consent banner specification |
|
||
|
||
---
|
||
|
||
### 49.3 New Tables
|
||
|
||
| Table | Purpose |
|
||
|-------|---------|
|
||
| `data_disclosure_log` | Immutable record of every TLE-derived data disclosure per organisation; supports Space-Track licence audit |
|
||
| `organisations.space_track_registered` | Gate controlling access to TLE-derived API fields |
|
||
| `organisations.country_of_incorporation` | Feeds export control screening at onboarding |
|
||
| `organisations.export_control_cleared` | Records completion of export control screening |
|
||
| `organisations.itar_cleared` | Gates EU-SST-derived data to cleared entities only |
|
||
|
||
---
|
||
|
||
### 49.4 New Legal Documents (required before Phase 2 gate)
|
||
|
||
| Document | Purpose |
|
||
|----------|---------|
|
||
| `legal/ROPA.md` | GDPR Art. 30 Record of Processing Activities — authoritative version |
|
||
| `legal/DPIA_conjunction_alerts.md` | Data Protection Impact Assessment for conjunction alert delivery |
|
||
| `legal/EXPORT_CONTROL_POLICY.md` | Export control screening procedure and embargoed-country list |
|
||
| `legal/DATA_RESIDENCY.md` | Sub-processor list with hosting regions and SCC/IDTA status |
|
||
| `legal/EU_AI_ACT_ASSESSMENT.md` | High-risk AI classification; obligation mapping; conformity assessment |
|
||
| `legal/REGULATORY_CORRESPONDENCE_LOG.md` | Structured register of regulatory correspondence |
|
||
| `legal/COOKIE_POLICY.md` | Cookie audit and consent policy |
|
||
| `legal/INCIDENT_NOTIFICATION_OBLIGATIONS.md` | Multi-framework notification timelines and templates |
|
||
|
||
---
|
||
|
||
### 49.5 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| In-UI disclaimer as sole liability protection | Substantive liability cap in executed MSA; UI disclaimer is reinforcement only |
|
||
| Serving TLE-derived data without licence verification | Gate behind `space_track_registered`; log all disclosures |
|
||
| Registering users without country-of-incorporation check | Collect at onboarding; screen against embargoed countries and BIS Entity List before account activation |
|
||
| Treating GDPR 72-hour obligation as the only notification deadline | NIS2 requires 24-hour early warning for significant incidents; both timelines must be tracked simultaneously |
|
||
| Storing consent preference in a cookie | Self-defeating; use localStorage with no expiry |
|
||
| Self-classifying the conjunction model as low-risk AI | Transport infrastructure safety = Annex III high-risk; full obligations apply regardless of system size |
|
||
|
||
---
|
||
|
||
### 49.6 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| RoPA location | `legal/ROPA.md` (authoritative) + §29.1 mirror | MASTER_PLAN only | Regulatory auditors expect a standalone document; MASTER_PLAN mirror keeps engineers informed |
|
||
| Space-Track gate mechanism | Per-org boolean + middleware check | Per-request licence verification | Per-request verification against Space-Track API would add latency and a hard dependency; boolean flag updated at onboarding and reviewed quarterly |
|
||
| EU AI Act classification | High-risk (Annex III, transport safety) | Low-risk / unclassified | The conjunction model informs time-critical airspace decisions; conservative classification is the legally safe position; reclassification requires legal opinion |
|
||
| Cookie consent storage | localStorage | Session cookie | Storing consent in a cookie creates a circular dependency (need consent to set cookie, but cookie stores consent); localStorage avoids this without additional server round-trips |
|
||
| NIS2 applicability | Treat SpaceCom as essential entity (space traffic management) | Treat as non-essential until formally classified | Early compliance avoids a reclassification scramble; ENISA guidance indicates space infrastructure operators are likely Annex I essential entities |
|
||
|
||
---
|
||
|
||
## §50 Accessibility Engineering — Specialist Review
|
||
|
||
**Standards basis:** WCAG 2.1 Level AA (ISO/IEC 40500:2012), WAI-ARIA 1.2, EN 301 549 v3.2.1, Section 508, APCA contrast algorithm, ATAG 2.0
|
||
**Review scope:** Keyboard navigation, screen reader compatibility, colour contrast, motion/animation, focus management, dynamic content announcements, form accessibility, alert/modal accessibility, time-limited interactions, ARIA live regions
|
||
|
||
---
|
||
|
||
### 50.1 Findings and Fixes Applied
|
||
|
||
**F1 — No accessibility standard committed; EN 301 549 mandatory for ESA procurement**
|
||
Fix applied (§13.0, §25.6): WCAG 2.1 AA committed as minimum standard in new §13.0. Definition of done updated: all PRs must pass axe-core wcag2a/aa before merge. ACR/VPAT 2.4 added to §25.6 ESA procurement artefacts table as a required Phase 2 deliverable.
|
||
|
||
**F2 — CRITICAL alert overlay inaccessible to screen reader and keyboard users**
|
||
Fix applied (§28.3): Full ARIA alertdialog spec added: `role="alertdialog"`, `aria-modal="true"`, programmatic `focus()` on render, `aria-hidden="true"` on map container, `aria-live="assertive"` announcement region, visible text status indicator for deaf operators, Escape key handling per severity level.
|
||
|
||
**F3 — Structured acknowledgement form has no accessible labels**
|
||
Fix applied (§28.5): Native `<input type="radio">` with `<label for="...">`, `<fieldset>` + `<legend>`, `aria-keyshortcuts` on trigger, visible keyboard shortcut legend inside dialog, `aria-required` on free-text field when OTHER selected, `aria-live="polite"` confirmation on submit.
|
||
|
||
**F4 — CesiumJS globe inaccessible; no keyboard/screen reader equivalent**
|
||
Fix applied (§13.2): New §13.2 specifies `ObjectTableView.tsx` as a parallel accessible table view. Accessible via `Alt+T` and a persistent visible button. All alert interactions completable from table view alone. Implemented with native `<table>` elements; `aria-sort`, `aria-rowcount`, `aria-rowindex` for virtual scroll.
|
||
|
||
**F5 — Colour is the sole differentiator for alert severity**
|
||
Fix applied (§13.4): Non-colour severity indicators specified in §13.4: per-severity icon/shape (octagon/triangle/circle/circle-outline), text labels always visible, distinct border widths. 1 Hz colour cycle also has a 1 Hz border-width pulse as redundant indicator.
|
||
|
||
**F6 — No keyboard navigation spec for primary operator workflow**
|
||
Fix applied (§13.3): New §13.3 specifies skip links, focus ring (3px, ≥3:1 contrast, `--focus-ring` token), tab order rules (no `tabindex > 0`), full application keyboard shortcut table (`Alt+A/T/H/N`, `?`, `Escape`, arrow keys), `aria-keyshortcuts` on all trigger elements, conflict-free shortcut design.
|
||
|
||
**F7 — Colour contrast ratios not specified**
|
||
Fix applied (§13.4): Verified contrast table for all operational severity colours on dark theme `#1A1A2E`. All pairs meet ≥4.5:1 (AA). Design token file `frontend/src/tokens/colours.ts` designated as authoritative; no hardcoded colour values in component files.
|
||
|
||
**F8 — Session timeout risk during shift handover**
|
||
Fix applied (§28.5a): WCAG 2.2.1 (Timing Adjustable) compliance spec added. T−2 minute warning dialog with `aria-live="polite"` announcement. Auto-extension (30 min, once per session) when `/handover` view is active. `POST /api/v1/auth/extend-session` endpoint specified. Extension logged in `security_logs` as `SESSION_AUTO_EXTENDED_HANDOVER`.
|
||
|
||
**F9 — Decision Prompts accordion not keyboard-operable or screen-reader-friendly**
|
||
Fix applied (§28.6): Full WAI-ARIA Accordion pattern specified: `aria-expanded`, `aria-controls`, `role="region"`, `aria-labelledby`, native checkbox inputs with labels, arrow-key navigation, `aria-live="polite"` confirmation on checkbox state change.
|
||
|
||
**F10 — No reduced-motion support**
|
||
Fix applied (§28.3): `prefers-reduced-motion: reduce` CSS implementation specified for CRITICAL banner colour cycle (static thick border replaces animation). CesiumJS corridor animation: JS `matchMedia` check on mount; particle animation disabled; static opacity when reduced motion preferred. Listener on `change` event for live preference updates without page reload.
|
||
|
||
**F11 — No accessibility testing in CI**
|
||
Fix applied (§42.2, §42.5): `e2e/test_accessibility.ts` added using `@axe-core/playwright`. Scans 5 primary views. wcag2a + wcag2aa violations block PR; wcag2aaa warnings only. Results as CI artefact `a11y-report.html`. Manual screen reader test (NVDA+Firefox, VoiceOver+Safari) added to release checklist. Decision log entry added in §42.5.
|
||
|
||
---
|
||
|
||
### 50.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §13.0 (new) Accessibility Standard Commitment | WCAG 2.1 AA minimum standard; EN 301 549 mandatory for ESA; ACR/VPAT as Phase 2 deliverable; definition of done |
|
||
| §13.2 (new) Accessible Parallel Table View | `ObjectTableView.tsx` spec; keyboard trigger; native table markup; virtual scroll ARIA attributes |
|
||
| §13.3 (new) Keyboard Navigation Specification | Skip links; focus ring token; tab order rules; full shortcut table; `aria-keyshortcuts` convention |
|
||
| §13.4 (new) Colour and Contrast Specification | Verified contrast table; design token file; non-colour severity indicators (icons, text labels, border widths) |
|
||
| §25.6 Required ESA Procurement Artefacts | ACR/VPAT 2.4 added to artefacts table |
|
||
| §28.3 Alarm Management | CRITICAL alert ARIA spec; reduced-motion CSS spec |
|
||
| §28.5 Error Recovery | Acknowledgement form accessibility: native inputs, fieldset/legend, aria-keyshortcuts, confirmation announcement |
|
||
| §28.5a Shift Handover | Session timeout accessibility: T−2 min warning, auto-extension during handover, extend-session endpoint |
|
||
| §28.6 Cognitive Load Reduction | Decision Prompts ARIA Accordion pattern spec |
|
||
| §42.2 Test Suite Inventory | `test_accessibility.ts` added to e2e suite |
|
||
| §42.3 (renamed from 42.2) | axe-core implementation spec with code example; manual screen reader test checklist |
|
||
| §42.5 Decision Log | Accessibility CI gate decision added |
|
||
|
||
---
|
||
|
||
### 50.3 New Components
|
||
|
||
| Component / File | Purpose |
|
||
|-----------------|---------|
|
||
| `src/components/globe/ObjectTableView.tsx` | Accessible parallel table view for all globe objects |
|
||
| `frontend/src/tokens/colours.ts` | Design token file for all operational colours; authoritative contrast reference |
|
||
| `e2e/test_accessibility.ts` | `@axe-core/playwright` scans blocking PRs on WCAG 2.1 AA violations |
|
||
| `docs/RELEASE_CHECKLIST.md` | Manual screen reader test steps; keyboard-only workflow test |
|
||
|
||
---
|
||
|
||
### 50.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| `aria-label` on a `<div>` when a native `<button>` would do | Always prefer native HTML semantics; ARIA substitutes only when no native element exists |
|
||
| `outline: none` without a custom focus indicator | Never suppress focus ring without providing an equivalent; use `--focus-ring` token |
|
||
| `tabindex="2"` or any positive tabindex | Never; positive tabindex breaks natural reading order and confuses screen readers |
|
||
| Colour-only severity communication | Always pair colour with shape, text label, and border width as redundant indicators |
|
||
| Inline `aria-live="assertive"` for non-emergency announcements | `assertive` interrupts immediately; use `polite` for non-CRITICAL confirmations, `assertive` only for CRITICAL alerts |
|
||
| Session timeout that cannot be extended | WCAG 2.2.1 requires user ability to extend or disable timing; auto-extend during safety-critical views is the correct pattern |
|
||
|
||
---
|
||
|
||
### 50.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Globe accessibility approach | Parallel accessible table view | Making CesiumJS accessible directly | WebGL canvas cannot be made screen-reader accessible; a parallel data view is the only WCAG-conformant approach for complex visualisations |
|
||
| Focus ring specification | 3px solid `#4A9FFF`, design token | Browser default outline | Browser default fails contrast requirements on dark themes; design token ensures consistency and testability |
|
||
| axe-core CI level | wcag2a + wcag2aa block; wcag2aaa warn | All levels block, or all levels warn | All-block creates false positives (AAA is aspirational); all-warn provides no enforcement; AA is the legal and contractual minimum |
|
||
| Reduced-motion: animation vs. static | Static thick border when `prefers-reduced-motion: reduce` | Slow down animation | Slowing animation still triggers vestibular symptoms; static replacement is the only fully safe approach |
|
||
| Session auto-extension scope | Only during `/handover` active; once per session | For any active form | Broad auto-extension creates security risk (indefinitely open sessions); limiting to handover scope is the narrowest sufficient accommodation |
|
||
|
||
---
|
||
|
||
## §52 Incident Response / Disaster Recovery Engineering — Specialist Review
|
||
|
||
**Standards basis:** NIST SP 800-61r2, ISO/IEC 27035, ISO 22301, ITIL 4, ICAO Doc 9859, AWS/GCP Well-Architected Framework (Reliability Pillar), Google SRE Book (Chapter 14)
|
||
**Review scope:** Incident classification, runbook completeness, escalation chains, RTO/RPO definition and achievability, backup and restore, chaos/game day testing, on-call rotation, post-incident review, DR site strategy, alert_events integrity
|
||
|
||
---
|
||
|
||
### 52.1 Findings and Fixes Applied
|
||
|
||
**F1 — RTO and RPO targets not formally defined with derivation rationale**
|
||
Fix applied (§26.2): Table expanded with derivation column. RTO ≤ 15 min (active TIP event) derived from 4-hour CRITICAL rate-limit window. RTO ≤ 60 min (no active event) aligns with MSA SLA. RPO zero for safety-critical tables derived from UN Liability Convention evidentiary requirements. MSA sign-off requirement added — customers must agree RTO/RPO before production deployment.
|
||
|
||
**F2 — No restore time target or WAL retention period**
|
||
Fix applied (§26.6): WAL retained 30 days; base backups 90 days; safety-critical tables in MinIO Object Lock COMPLIANCE mode for 7 years. Restore time target < 30 minutes documented. `docs/runbooks/db-restore.md` designated as Phase 2 deliverable.
|
||
|
||
**F3 — No runbook for prediction service outage during active re-entry event**
|
||
Fix applied (§26.8): New runbook row added to the required runbooks table covering: detection → 5-minute ANSP notification → incident commander designation → 15-minute update cadence → restoration checklist → PIR trigger. Full procedure in `docs/runbooks/prediction-service-outage-during-active-event.md`.
|
||
|
||
**F4 — No chaos engineering / game day programme**
|
||
Fix applied (§26.8): Quarterly game day programme specified. 6 scenarios defined with inject, expected behaviour, and pass criteria. Scenario fail treated as SEV-2 with PIR. `docs/runbooks/game-day-scenarios.md` designated.
|
||
|
||
**F5 — On-call rotation underspecified**
|
||
Fix applied (§26.8): 7-day rotation, minimum 2-engineer pool. L1 → L2 escalation trigger: 30 minutes without containment. L2 → L3 triggers enumerated (ANSP data affected, security breach, total outage > 15 min, regulatory notification triggered). On-call handoff log specified mirroring operator `/handover` model.
|
||
|
||
**F6 — No P1/P2/P3 severity communication commitments**
|
||
Fix applied (§26.8): ANSP notification commitments per SEV level added. SEV-1 active TIP event: push + email within 5 minutes, 15-minute cadence. SEV-1 no active event: email within 15 minutes. SEV-2: email within 30 minutes if predictions affected. SEV-3/4: status page only.
|
||
|
||
**F7 — No DR site or failover architecture**
|
||
Fix applied (§26.3): Cross-region warm standby architecture added. DB replica promoted on failover; app tier deployed from pre-pulled container images; MinIO bucket replication active; DNS health-check-based routing (TTL 60s). Estimated failover time < 15 minutes. Annual game day test (scenario 6). `docs/runbooks/region-failover.md` designated.
|
||
|
||
**F8 — No post-incident review process**
|
||
Fix applied (§26.8): Mandatory PIR for all SEV-1 and SEV-2. Due within 5 business days. 7-section structure: summary, timeline, 5-whys root cause, contributing factors, impact, remediation actions (GitHub issues, `incident-remediation` label), what went well. Presented at engineering all-hands. Remediations are P2 priority.
|
||
|
||
**F9 — `alert_events` not HMAC-protected**
|
||
Fix applied (§7.9, `alert_events` schema): `record_hmac TEXT NOT NULL` column added. Signing function specified (id, object_id, org_id, level, trigger_type, created_at, acknowledged_by, action_taken). Nightly Celery Beat integrity check re-verifies all events from past 24 hours; HMAC failure raises CRITICAL security alert. Existing `alert_events_immutable` trigger already prevents modification.
|
||
|
||
**F10 — No incident communication templates**
|
||
Fix applied (§26.8): `docs/runbooks/incident-comms-templates.md` designated with 4 templates (initial notification, 15-min update, resolution, post-incident summary). Legal counsel review required before first use. Templates specify what never to include (speculation, premature ETAs, admissions of liability).
|
||
|
||
**F11 — Operational and security incidents not separated**
|
||
Fix applied (§26.8): Operational vs. security incident comparison table added. Separate runbooks designated: `docs/runbooks/operational-incident-response.md` and `docs/runbooks/security-incident-response.md`. Security incidents: no public status page until legal counsel approves; DPO within 4 hours; NIS2/GDPR timelines from §29.6.
|
||
|
||
---
|
||
|
||
### 52.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §26.2 Recovery Objectives | Derivation rationale column; MSA sign-off requirement |
|
||
| §26.3 High Availability Architecture | Cross-region warm standby DR strategy; component failover table; estimated recovery time |
|
||
| §26.6 Backup and Restore | WAL retention 30 days; restore time target < 30 min; MinIO Object Lock for 7-year legal hold; `docs/runbooks/db-restore.md` |
|
||
| §26.8 Incident Response | Prediction-service-outage runbook; on-call rotation spec + handoff log; ANSP comms per severity; PIR process; game day programme; incident comms templates; operational/security split |
|
||
| §7.9 Data Integrity | `alert_events` HMAC signing function; nightly integrity check Celery task |
|
||
| `alert_events` schema | `record_hmac TEXT NOT NULL` column added |
|
||
|
||
---
|
||
|
||
### 52.3 New Runbooks Required (Phase 2 deliverables)
|
||
|
||
| Runbook | Trigger |
|
||
|---------|---------|
|
||
| `docs/runbooks/db-restore.md` | Monthly restore test failure; DR failover |
|
||
| `docs/runbooks/prediction-service-outage-during-active-event.md` | SEV-1 during active TIP event |
|
||
| `docs/runbooks/region-failover.md` | Cloud region failure; annual game day |
|
||
| `docs/runbooks/game-day-scenarios.md` | Quarterly game day reference |
|
||
| `docs/runbooks/incident-comms-templates.md` | All SEV-1/2 incidents |
|
||
| `docs/runbooks/operational-incident-response.md` | All operational incidents |
|
||
| `docs/runbooks/security-incident-response.md` | All security incidents |
|
||
| `docs/runbooks/on-call-handoff-log.md` | Weekly rotation boundary |
|
||
| `docs/post-incident-reviews/` | All SEV-1/2 incidents (within 5 business days) |
|
||
|
||
---
|
||
|
||
### 52.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| RTO/RPO as aspirational targets without derivation | Derive from operational requirements; document rationale; agree in MSA |
|
||
| Single-region deployment with 1-hour RTO target | Warm standby in a second region; < 15 min estimated failover |
|
||
| Conflating operational and security incident response | Separate runbooks; different escalation chains; different communication rules |
|
||
| Improvised ANSP communications under pressure | Pre-drafted legal-reviewed templates; deviations require incident commander approval |
|
||
| PIR as optional / informal | Mandatory for SEV-1/2; structured format; remediation tracking; all-hands presentation |
|
||
| Game day as a one-time activity | Quarterly rotation; each scenario tested at least annually; failures treated as SEV-2 |
|
||
|
||
---
|
||
|
||
### 52.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| DR strategy | Warm standby (second region) | Cold standby or active-active | Cold standby: restore time too slow for RTO; active-active: complexity and cost disproportionate to Phase 1 scale; warm standby meets RTO at acceptable cost |
|
||
| alert_events HMAC | Nightly batch verification | Per-request verification | Per-request adds latency to the alert delivery path; nightly batch catches tampering within 24 hours — adequate for evidentiary purposes |
|
||
| PIR timing | 5 business days | 24 hours / 30 days | 24 hours is too fast for full 5-whys analysis; 30 days allows recurrence before remediation; 5 days balances speed with quality |
|
||
| Game day cadence | Quarterly | Monthly / annually | Monthly creates operational fatigue; annually is too infrequent to maintain muscle memory; quarterly is standard SRE practice |
|
||
| On-call escalation trigger | 30 minutes containment | 15 minutes / 60 minutes | 15 minutes is too aggressive for complex incidents; 60 minutes risks SLO breach before L2 engaged; 30 minutes matches the active TIP event RTO window |
|
||
|
||
---
|
||
|
||
## §51 Internationalisation / Localisation Engineering — Specialist Review
|
||
|
||
**Standards basis:** Unicode CLDR 44, IETF BCP 47, ISO 8601, ICAO Annex 2 / Annex 15 / Doc 8400 (UTC mandate), POSIX locale model, W3C Internationalisation guidelines, ICU MessageFormat 2.0, EU Regulation 2018/1139 (EASA language requirements)
|
||
**Review scope:** Timezone handling, date/time display, number/unit formatting, string externalisation, RTL layout, language coverage, ICAO UTC compliance, API date formats, database timezone storage
|
||
|
||
---
|
||
|
||
### 51.1 Findings and Fixes Applied
|
||
|
||
**F1 — Operational times must be UTC; no local timezone conversion in ops interface**
|
||
Fix applied (§13.0): Iron UTC rule documented. All Persona A/C views display UTC only, formatted `HH:MMZ` or `DD MMM YYYY HH:MMZ`. `Z` suffix always inline, never a tooltip. No timezone conversion widget in operational interface. Local time permitted only in non-operational admin views with explicit timezone label. API times always ISO 8601 UTC.
|
||
|
||
**F2 — ORM may silently convert TIMESTAMPTZ to session timezone**
|
||
Fix applied (§7.9): `SET TIME ZONE 'UTC'` enforced on every connection via SQLAlchemy engine event listener. Blocking integration test `test_timestamps_round_trip_as_utc` added — asserts that a known UTC datetime survives a full ORM insert/read cycle without offset conversion.
|
||
|
||
**F3 — Re-entry window displayed without explicit UTC label**
|
||
Fix applied (§28.4): Rule 1 of probabilistic communication to non-specialists updated — all absolute times rendered as `HH:MMZ` per ICAO Doc 8400 UTC-suffix convention. `Z` suffix always rendered inline; never hover-only.
|
||
|
||
**F4 — Number formatting not locale-aware in non-operational views**
|
||
Fix applied (§13.4): `formatOperationalNumber()` (ICAO decimal point, invariant) and `formatDisplayNumber(locale)` (`Intl.NumberFormat`, locale-aware) helpers specified. Raw `Number.toString()` and `n.toFixed()` banned from JSX.
|
||
|
||
**F5 — No string externalisation strategy; hardcoded strings block localisation**
|
||
Fix applied (§13.5): `next-intl` adopted. All user-facing strings in `messages/en.json`. Message ID convention defined. `eslint-plugin-i18n-json` enforcement. ICAO-fixed strings explicitly excluded and annotated `// ICAO-FIXED: do not translate`.
|
||
|
||
**F6 — NOTAM draft output must be ICAO English regardless of UI locale**
|
||
Fix applied (§6.13): NOTAM template strings hardcoded ICAO English phraseology in `backend/app/modules/notam/templates.py`, annotated `# ICAO-FIXED: do not translate`. Excluded from `next-intl` extraction. Preview renders in monospace font with `lang="en"` attribute.
|
||
|
||
**F7 — Slash-delimited dates are ambiguous in exports**
|
||
Fix applied (§6.12): **`DD MMM YYYY`** format mandated for all PDF reports, CSV exports, and display previews (e.g. `04 MAR 2026`). Slash-delimited dates banned from all SpaceCom outputs. Times alongside dates use `HH:MMZ`. NOTAM internal `YYMMDDHHMM` fields displayed as `DD MMM YYYY HH:MMZ` in preview.
|
||
|
||
**F8 — RTL layout not considered; directional CSS utilities used**
|
||
Fix applied (§13.5): CSS logical properties table specified (`margin-inline-start` etc. replacing `ml-`/`mr-`). `<html dir="ltr">` hardcoded for Phase 1; becomes `dir={locale.dir}` when RTL locale added — no component changes required. `docs/ADDING_A_LOCALE.md` checklist includes RTL gate.
|
||
|
||
**F9 — Altitude units inconsistent between aviation and space personas**
|
||
Fix applied (users table, §13.5): `altitude_unit_preference` column added to `users` table (`ft` default for ANSP operators, `km` for space operators). API transmits metres; display layer converts. Unit label always visible. FL notation shown in parentheses for `ft` context. User can override in account settings.
|
||
|
||
**F10 — API date formats inconsistent (Unix timestamps vs. ISO 8601)**
|
||
Fix applied (§14 API Versioning Policy): ISO 8601 UTC (`2026-03-22T14:00:00Z`) mandated for all API date fields. OpenAPI `format: date-time` on all `_at`/`_time` fields. Blocking contract test asserts regex match. Pydantic `json_encoders` specified.
|
||
|
||
**F11 — Language coverage undefined; English-only now but architecture must support future localisation**
|
||
Fix applied (§13.5): English-only explicitly committed for Phase 1. `next-intl` architecture allows adding a locale by adding `messages/{locale}.json` only — no component changes. `messages/fr.json` and `messages/de.json` scaffolded at Phase 2/3 start. `docs/ADDING_A_LOCALE.md` checklist documented.
|
||
|
||
---
|
||
|
||
### 51.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §6.12 Report Generation | `DD MMM YYYY` date format rule; slash-delimited dates banned |
|
||
| §6.13 NOTAM Drafting Workflow | ICAO-FIXED template rule; `lang="en"` on NOTAM container |
|
||
| §7.9 Data Integrity | `SET TIME ZONE 'UTC'` connection event listener; `test_timestamps_round_trip_as_utc` integration test |
|
||
| §13.0 Accessibility Standard Commitment | UTC-only rule added |
|
||
| §13.4 Colour and Contrast Specification | `formatOperationalNumber` / `formatDisplayNumber` helpers; `Intl.NumberFormat` mandate |
|
||
| §13.5 (new) Internationalisation Architecture | `next-intl`; `messages/en.json`; ICAO-FIXED exclusions; CSS logical properties; altitude unit display; `docs/ADDING_A_LOCALE.md` checklist |
|
||
| §14 API Versioning Policy | ISO 8601 UTC contract; OpenAPI `format: date-time`; contract test; Pydantic encoder |
|
||
| §28.4 Probabilistic Communication | `HH:MMZ` inline UTC suffix rule |
|
||
| `users` table | `altitude_unit_preference` column added |
|
||
|
||
---
|
||
|
||
### 51.3 New Files
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `messages/en.json` | Phase 1 string source of truth for `next-intl` |
|
||
| `messages/fr.json` | Phase 2 scaffold (machine-translated placeholders; native-speaker review before deploy) |
|
||
| `messages/de.json` | Phase 3 scaffold |
|
||
| `docs/ADDING_A_LOCALE.md` | Step-by-step checklist for adding a new locale; includes RTL gate |
|
||
| `frontend/src/lib/formatters.ts` | `formatOperationalNumber`, `formatDisplayNumber`, `formatUtcTime`, `formatUtcDate` helpers |
|
||
| `tests/test_db_timezone.py` | Blocking integration test for UTC round-trip integrity |
|
||
|
||
---
|
||
|
||
### 51.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| Displaying local time in the ops interface | UTC only; `HH:MMZ` always; no conversion widget |
|
||
| `Number.toString()` or `n.toFixed()` in JSX | `formatOperationalNumber()` (ICAO) or `formatDisplayNumber(locale)` depending on context |
|
||
| `03/04/2026` in any export or report | `04 MAR 2026` — unambiguous ICAO-aligned format |
|
||
| Translating NOTAM template strings | ICAO-FIXED; annotate and exclude from i18n tooling |
|
||
| Positive `tabindex` (already covered §50) | Never; noted here as it is also an i18n anti-pattern (breaks RTL reading order) |
|
||
| Hardcoded `margin-left` in new components | `margin-inline-start`; logical properties throughout |
|
||
| Multiple API date formats in same response | ISO 8601 UTC only; one format, no exceptions |
|
||
|
||
---
|
||
|
||
### 51.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Operational time display | UTC-only, `HH:MMZ` inline | User-selectable timezone | ICAO Annex 15 mandates UTC for aeronautical data; a timezone selector introduces conversion errors under time pressure |
|
||
| Date format in exports | `DD MMM YYYY` | ISO 8601 (`2026-03-04`) | ISO 8601 is unambiguous but unfamiliar to aviation professionals; `DD MMM YYYY` matches aviation document convention (NOTAM, METARs) and is equally unambiguous |
|
||
| Phase 1 language scope | English only | Multi-language from Phase 1 | Localisation adds QA overhead and translation cost before product-market fit is proven; architecture supports future locales without rework |
|
||
| i18n library | `next-intl` | `react-i18next` | `next-intl` has first-class App Router RSC support; `react-i18next` requires client-component wrapping for all translated text |
|
||
| Altitude storage unit | Metres (API + DB) | Role-dependent storage | Single SI storage unit eliminates conversion bugs in physics engine; display conversion is well-understood and testable |
|
||
| ORM timezone enforcement | Engine event listener (`SET TIME ZONE UTC`) | Application-level assertion | Engine listener fires at connection creation and cannot be bypassed by individual queries; application assertions can be accidentally omitted |
|
||
|
||
---
|
||
|
||
## §53 Machine Learning / Data Science — Specialist Review
|
||
|
||
**Standards basis:** ISO/IEC 22989, ECSS-E-ST-10-04C, IADC Space Debris Mitigation Guidelines, ESA DRAMA methodology, Vallado (2013), JB2008, NRLMSISE-00, FAA Order 8040.4B, EU AI Act Art. 10
|
||
**Review scope:** Conjunction Pc model, SGP4 domain, atmospheric density model selection, MC convergence, survival probability, model versioning, TLE age uncertainty, backcasting, input validation, tail risk, data provenance
|
||
|
||
---
|
||
|
||
### 53.1 Findings and Fixes Applied
|
||
|
||
**F1 — Conjunction probability model methodology unspecified**
|
||
Fix applied (§15.5): Alfano (2005) 2D Gaussian method already specified. Validity domain added: three degradation conditions (sub-100m close approach, anisotropic covariance > 100:1, Pc < 1×10⁻¹⁵ floor). API response carries `pc_validity` and `pc_validity_warning` fields. Reference test suite added against Vallado & Alfano (2009) published cases with 5% tolerance.
|
||
|
||
**F2 — SGP4 used beyond valid domain without sub-150 km guard**
|
||
Fix applied (§15.1): Sub-150 km `LOW_CONFIDENCE_PROPAGATION` flag added to decay predictor. UI badge: "⚠ Re-entry imminent — prediction confidence low." BLOCKING unit test: TLE with perigee 120 km → asserts flag is set.
|
||
|
||
**F3 — Atmospheric density model not justified vs. JB2008**
|
||
Fix applied (§15.2): NRLMSISE-00 Phase 1 selection rationale documented (Python binding maturity, acceptable accuracy at moderate F10.7). Known limitations stated. Phase 2 milestone: evaluate JB2008 on backcasts; migrate if MAE improvement > 15%; ADR 0016. Input validity bounds added: F10.7 [65, 300], Ap [0, 400], altitude [85, 1000] km; violation raises `AtmosphericModelInputError`.
|
||
|
||
**F4 — MC sample count not justified by convergence analysis**
|
||
Fix applied (§15.2/§15.4): Convergence table added. N = 500 satisfies < 2% corridor area change between doublings on the reference object. N = 1000 for OOD or storm-warning cases. MC output updated to include p01 and p99.
|
||
|
||
**F5 — Survival probability methodology absent**
|
||
Fix applied (§15.3): `survival_probability`, `survival_model_version`, `survival_model_note` columns added to `reentry_predictions`. Phase 1: simplified analytical all-survive/no-survive per material class. Phase 2: ESA DRAMA integration. NOTAM `(E)` field statement driven by `survival_probability`.
|
||
|
||
**F6 — No model version governance or reproducibility**
|
||
Fix applied (§15.6 new): MAJOR/MINOR/PATCH version bump policy defined. Old versions retained in git tags and `physics/versions/`. `POST /decay/predict/reproduce` endpoint specified — re-runs with original model version and params for regulatory audit.
|
||
|
||
**F7 — TLE age not a formal uncertainty source**
|
||
Fix applied (§15.2): Linear inflation model added: `uncertainty_multiplier = 1 + 0.15 × tle_age_days` applied to ballistic coefficient covariance before MC sampling. `tle_age_at_prediction_time` and `uncertainty_multiplier` stored in `simulations.params_json` and returned in API response.
|
||
|
||
**F8 — No model performance monitoring or drift detection**
|
||
Fix applied (§15.9 new): `reentry_backcasts` table specified. Celery task triggered on object `status = 'decayed'`; compares all 72h predictions to confirmed re-entry time. Rolling 30-prediction MAE nightly; MEDIUM alert if MAE > 2× historical baseline. Admin panel "Model Performance" widget.
|
||
|
||
**F9 — Input data quality gates insufficient**
|
||
Fix applied (§15.7 new): `validate_prediction_inputs()` function in `backend/app/modules/physics/validation.py`. Validates TLE epoch age ≤ 30 days, F10.7/Ap/perigee bounds, mass > 0. Returns structured `ValidationError` list; endpoint returns 422. All validation paths covered by BLOCKING unit tests.
|
||
|
||
**F10 — Tail risks not communicated; only p5–p95 shown**
|
||
Fix applied (§28.4, `reentry_predictions` schema): `p01_reentry_time` and `p99_reentry_time` columns added. Tail risk annotation displayed when p1–p99 range > 1.5× p5–p95 range: *"Extreme case (1% probability outside): p01Z – p99Z."* Included as NOTAM draft footnote when condition met.
|
||
|
||
**F11 — No training/validation data provenance**
|
||
Fix applied (§15.8 new): Phase 1 explicitly documented as physics-based with no trained ML components. `docs/ml/data-provenance.md` designated. EU AI Act Art. 10 compliance mapped to input data provenance (tracked in `simulations.params_json`). Future ML component protocol: training data, validation split, model card in `docs/ml/model-card-{component}.md`.
|
||
|
||
---
|
||
|
||
### 53.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §15.1 Catalog Propagator | Sub-150 km LOW_CONFIDENCE_PROPAGATION flag + unit test |
|
||
| §15.2 Decay Predictor | NRLMSISE-00 selection rationale vs. JB2008; input bounds; TLE age inflation model; MC convergence table; N=1000 for OOD/storm cases |
|
||
| §15.3 Atmospheric Breakup Model | `survival_probability` / `survival_model_version` / `survival_model_note` columns; Phase 1 analytical methodology |
|
||
| §15.5 Conjunction Pc | Validity domain (3 degradation conditions); `pc_validity` API fields; Vallado & Alfano reference test suite |
|
||
| §15.6 (new) Model Version Governance | MAJOR/MINOR/PATCH policy; version retention; `reproduce` endpoint |
|
||
| §15.7 (new) Prediction Input Validation | `validate_prediction_inputs()`; 5 validation rules; 422 response; BLOCKING tests |
|
||
| §15.8 (new) Data Provenance | Phase 1 no-ML declaration; EU AI Act Art. 10 mapping; future ML component protocol |
|
||
| §15.9 (new) Backcasting Validation | `reentry_backcasts` table; Celery trigger on decay; rolling MAE drift detection; admin panel widget |
|
||
| §28.4 Probabilistic Communication | Tail risk annotation (rule 6); p01/p99 display condition; NOTAM footnote |
|
||
| `reentry_predictions` schema | `p01_reentry_time`, `p99_reentry_time`, `survival_probability`, `survival_model_version`, `survival_model_note` |
|
||
|
||
---
|
||
|
||
### 53.3 New Tables and Files
|
||
|
||
| Artefact | Purpose |
|
||
|----------|---------|
|
||
| `reentry_backcasts` table | Prediction vs. actual comparison; drift detection input |
|
||
| `docs/ml/data-provenance.md` | Phase 1 no-ML declaration; future ML data provenance template |
|
||
| `docs/ml/model-card-{component}.md` | Template for any future learned component |
|
||
| `docs/adr/0016-atmospheric-density-model.md` | NRLMSISE-00 vs. JB2008 decision; Phase 2 evaluation trigger |
|
||
| `backend/app/modules/physics/validation.py` | `validate_prediction_inputs()` function |
|
||
| `tests/physics/test_pc_compute.py` | Vallado & Alfano reference cases (BLOCKING) |
|
||
|
||
---
|
||
|
||
### 53.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| Displaying only p5–p95 without tail annotation | Add p1/p99 as explicit tail risk annotation when materially wider |
|
||
| Silently clamping out-of-range inputs | Reject with structured `ValidationError`; operator must correct the input |
|
||
| Deleting old model versions on update | Tag and retain; `reproduce` endpoint requires historical version access |
|
||
| Treating TLE age as display-only staleness | TLE age is a formal uncertainty source; inflate MC covariance accordingly |
|
||
| Choosing atmospheric model without documented rationale | Document selection vs. alternatives; schedule re-evaluation with objective criterion |
|
||
| No feedback loop from confirmed re-entries | Backcasting pipeline closes the loop; MAE monitoring detects drift |
|
||
|
||
---
|
||
|
||
### 53.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Phase 1 atmospheric model | NRLMSISE-00 | JB2008 | Mature Python binding; acceptable accuracy at moderate F10.7; JB2008 evaluation deferred to Phase 2 with objective trigger |
|
||
| Pc method | Alfano (2005) 2D Gaussian | Monte Carlo Pc | Alfano is computationally fast and widely accepted; MC Pc reserved for Phase 3 high-Pc cases where Gaussian assumption breaks down |
|
||
| MC convergence criterion | < 2% corridor area change between N doublings | Fixed N from literature | Fixed N is arbitrary; convergence criterion is object-class specific and reproducible |
|
||
| Tail risk display threshold | p1–p99 > 1.5× p5–p95 | Always show / never show | Always showing creates visual clutter for well-constrained predictions; never showing hides operationally relevant uncertainty; threshold balances both |
|
||
| Model version retention | Git tags + `physics/versions/` directory | Docker image tags only | Docker images are routinely pruned; git tags are permanent; `reproduce` endpoint needs the actual code, not just an image |
|
||
|
||
---
|
||
|
||
## §54 Technical Documentation / Developer Experience — Specialist Review
|
||
|
||
**Standards basis:** OpenAPI 3.1, Keep a Changelog, Conventional Commits, Nygard ADR format, WCAG authoring guidance, MkDocs Material, spectral OpenAPI linting, ESA ECSS documentation requirements
|
||
**Review scope:** OpenAPI spec governance, health endpoint coverage, contribution workflow, ADR process, changelog discipline, developer onboarding, response examples, SDK strategy, runbook structure, docs pipeline, AI assistance declaration
|
||
|
||
---
|
||
|
||
### 54.1 Findings and Fixes Applied
|
||
|
||
**F1 — OpenAPI spec not declared as source of truth**
|
||
Fix applied (§14 API Versioning Policy): FastAPI's built-in OpenAPI generation is declared as the sole source of truth. `make generate-openapi` regenerates `openapi.yaml`. CI runs `openapi-diff --fail-on-incompatible` to detect uncommitted drift. The spec is input to Swagger UI, Redoc, contract tests, and the SDK generator.
|
||
|
||
**F2 — No `/health` or `/readiness` endpoint specified**
|
||
Fix applied (§14 System endpoints): New `System (no auth required)` group added. `GET /health` — liveness probe; process-alive check only. `GET /readyz` — readiness probe; checks PostgreSQL, Redis, Celery queue depth; returns 503 when any dependency is unhealthy. Both used by Kubernetes probes, load balancers, and DR automation DNS-flip gate (§26.3). Both included in OpenAPI spec.
|
||
|
||
**F3 — `CONTRIBUTING.md` absent**
|
||
Fix applied (§13.6 new): Full contribution workflow documented. Branch naming convention table (feature/fix/chore/release/hotfix), `main` branch protection (1 approval, all checks pass, no force-push), Conventional Commits commit format, PR template with checklist (test, openapi regeneration, CHANGELOG, axe-core, ADR), 1-business-day review SLA, stale PR automation.
|
||
|
||
**F4 — No ADR process**
|
||
Fix applied (§13.7 new): ADR process specified using Nygard format in `docs/adr/NNNN-title.md`. Trigger criteria defined (hard-to-reverse decisions, auditor context, procurement evidence). Standard template specified. Known ADR register table provided with 6 existing entries. Phase 2 ESA submission gate: all referenced ADR numbers must have corresponding files.
|
||
|
||
**F5 — Changelog discipline unspecified**
|
||
Fix applied (§14 API Versioning Policy): Keep a Changelog format + Conventional Commits declared. `[Unreleased]` section with Added/Changed/Fixed/Deprecated subsections required on every PR with user-visible effect. `make changelog-check` CI step fails if `[Unreleased]` is empty for non-chore/docs commits. Release changelogs drive API key holder notifications and GitHub release notes.
|
||
|
||
**F6 — Developer environment setup undocumented**
|
||
Fix applied (§13.8 new): `docs/DEVELOPMENT.md` spec covering: prerequisites (Python 3.11 pinned, Node.js 20, Docker Desktop, make), `make dev-up / migrate / seed / dev` bootstrap sequence, `make test / test-backend / test-frontend / test-e2e` commands, local URL map (API, Swagger UI, frontend, MinIO). 30-minute onboarding target. `.env.example` committed; `.env` in `.gitignore`.
|
||
|
||
**F7 — OpenAPI response examples not required**
|
||
Fix applied (§14 API Versioning Policy): All endpoint schemas must include at least one `examples:` block. Enforced by `spectral lint` with custom `require-response-example` rule in CI. Example YAML fragment provided for `GET /objects/{norad_id}`. Examples serve: Swagger/Redoc docs, contract test fixtures, ESA auditor readability.
|
||
|
||
**F8 — No SDK or client library strategy**
|
||
Fix applied (§14 API Versioning Policy): Phase 1 — no SDK; ANSP integrators receive `openapi.yaml`, `docs/integration/` quickstarts (Python httpx/requests, TypeScript), and Postman-importable spec. Phase 2 gate: if ≥ 2 ANSP customers request a typed client, generate with `openapi-generator-cli` targeting Python and TypeScript. Generator config committed to `tools/sdk-generator/`. Published as `spacecom-client` PyPI and `@spacecom/client` npm packages.
|
||
|
||
**F9 — Runbooks named but not templated**
|
||
Fix applied (§26.8 new subsection): Standard runbook template specified with 7 sections: Triggers, Immediate actions (first 5 minutes), Diagnosis, Resolution steps, Verification, Escalation, Post-incident. `Last tested` frontmatter field required. `make runbook-audit` CI check warns if any runbook is older than 12 months. Template preempts the most common incident-pressure failures: vague steps, no expected output, missing escalation path.
|
||
|
||
**F10 — No docs-as-code pipeline**
|
||
Fix applied (§13.9 new): MkDocs Material as the documentation site generator. `mkdocs build --strict` in CI fails on broken links and missing pages. `markdown-link-check` for external links. `vale` prose style linter. `openapi-diff` spec drift check. ESA submission artefact: static HTML archived as `docs-site-{version}.zip` in release assets — reproducible point-in-time snapshot. `owner:` frontmatter field with quarterly `docs-review` cron issue.
|
||
|
||
**F11 — AGENTS.md scope vs. MASTER_PLAN undefined**
|
||
Fix applied (§1 Vision): AI-assisted development policy added. Defines: permitted uses (code generation, refactoring, review, documentation drafting), prohibited uses (autonomous decisions on safety-critical algorithms, auth logic, regulatory compliance text; production credentials; personal data). Human review standards apply identically to AI-generated code. ESA procurement statement: human engineers are sole responsible parties regardless of authoring tool.
|
||
|
||
---
|
||
|
||
### 54.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §1 Vision | AI-assisted development policy; AGENTS.md scope declaration; ESA procurement statement |
|
||
| §13.6 (new) Contribution Workflow | Branch naming; commit format; PR template; review SLA; `main` protection |
|
||
| §13.7 (new) Architecture Decision Records | Nygard ADR format; trigger criteria; template; known ADR register; Phase 2 ESA gate |
|
||
| §13.8 (new) Developer Environment Setup | `docs/DEVELOPMENT.md` spec; make targets; 30-minute onboarding target; `.env.example` policy |
|
||
| §13.9 (new) Docs-as-Code Pipeline | MkDocs Material; CI checks (strict, link, vale, openapi-diff); ESA artefact; docs ownership |
|
||
| §14 API Versioning Policy | OpenAPI as source of truth; `make generate-openapi`; CI drift check; changelog discipline; response examples mandate; client SDK strategy |
|
||
| §14 System Endpoints (new) | `GET /health` liveness spec; `GET /readyz` readiness spec with example responses |
|
||
| §26.8 Incident Response | Runbook standard structure template; `Last tested` field; `make runbook-audit` |
|
||
|
||
---
|
||
|
||
### 54.3 New Tables and Files
|
||
|
||
| Artefact | Purpose |
|
||
|----------|---------|
|
||
| `CONTRIBUTING.md` | Branch naming, commit format, PR template, review SLA |
|
||
| `CHANGELOG.md` | Keep a Changelog format; `[Unreleased]` driven by PRs; release notes source |
|
||
| `docs/adr/NNNN-*.md` | Architecture Decision Records (Nygard format) |
|
||
| `docs/DEVELOPMENT.md` | Developer onboarding; make targets; environment bootstrap |
|
||
| `docs/ADDING_A_LOCALE.md` | (already referenced §13.5) — Locale addition checklist |
|
||
| `docs/integration/` | ANSP quickstart guides (Python, TypeScript) |
|
||
| `tools/sdk-generator/` | openapi-generator-cli config for Phase 2 SDK generation |
|
||
| `.github/pull_request_template.md` | PR checklist enforcing OpenAPI regeneration, CHANGELOG, axe-core, ADR |
|
||
| `.spectral.yaml` | Custom spectral ruleset including `require-response-example` |
|
||
| `.vale.ini` | Prose style linter config for docs |
|
||
| `mkdocs.yml` | MkDocs Material configuration |
|
||
| `docs/runbooks/*.md` | All runbooks follow the standard template with `Last tested` frontmatter |
|
||
|
||
---
|
||
|
||
### 54.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| Maintaining a separate OpenAPI spec alongside FastAPI routes | Generate from code; enforce with CI drift check |
|
||
| Undocumented `GET /health` with ad-hoc response shape | Specify the schema, document it in OpenAPI, use it in DR automation |
|
||
| New engineers learning the codebase by asking colleagues | `docs/DEVELOPMENT.md` with 30-min onboarding target; `make dev` brings up everything |
|
||
| Architectural decisions in Slack or PR comments | ADR in `docs/adr/`; permanent and findable by auditors and new engineers |
|
||
| Runbooks written for the first time during an incident | Template-first; test in game day before needed |
|
||
| Publishing an API with no response examples | `spectral` enforces `examples:` blocks; Swagger UI shows realistic data |
|
||
| Building an SDK before customers ask | Phase 2 gate: ≥ 2 ANSP requests; Phase 1 is `openapi.yaml` + quickstarts |
|
||
|
||
---
|
||
|
||
### 54.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| OpenAPI generation direction | Code → spec (FastAPI auto-generation) | Spec → code (contract-first with codegen) | Team is Python-first; FastAPI's generation is high-fidelity; contract-first adds a separate edit step without meaningful quality gain at Phase 1 scale |
|
||
| SDK strategy | Generated from spec (Phase 2) | Hand-crafted SDK | Generated SDK stays in sync with spec automatically; hand-crafted SDKs drift; generation deferred until customer demand justifies maintenance cost |
|
||
| Documentation tooling | MkDocs Material | Docusaurus, GitBook | MkDocs Material is Python-native (same toolchain as backend); `mkdocs build --strict` provides CI integration; no JS toolchain dependency for docs |
|
||
| ADR format | Nygard (Context/Decision/Consequences) | MADR, RFC-style | Nygard is the most widely recognised format; recognised by ESA/public-sector auditors; minimal overhead |
|
||
| AI assistance declaration | Explicit policy in §1 Vision | Silent (no declaration) | ESA and EASA increasingly require disclosure of AI tool use in safety-relevant software; proactive disclosure pre-empts audit questions and demonstrates process maturity |
|
||
|
||
---
|
||
|
||
## §55 Multi-Tenancy, Billing & Org Management — Specialist Review
|
||
|
||
**Standards basis:** GDPR Art. 17/20, PCI-DSS (if card payments introduced), SaaS subscription billing conventions, PostgreSQL Row Level Security documentation, Celery priority queue documentation, ICAO Annex 11 (operator accountability)
|
||
**Review scope:** Data isolation, subscription tier model, usage metering, org lifecycle, API key governance, quota enforcement, queue fairness, audit log access, billing data model, data portability
|
||
|
||
---
|
||
|
||
### 55.1 Findings and Fixes Applied
|
||
|
||
**F1 — No row-level tenant isolation strategy defined**
|
||
Fix applied (§7.2): Comprehensive RLS policy table added covering all 8 `organisation_id`-carrying tables. `spacecom_worker` database role specified as the only `BYPASSRLS` principal. BLOCKING integration test specified: query as Org A session; assert zero Org B rows across all tenanted tables.
|
||
|
||
**F2 — Subscription tiers and feature flags not specified**
|
||
Fix applied (§16.1 new): Tier table defined (`shadow_trial`, `ansp_operational`, `space_operator`, `institutional`, `internal`) with per-tier MC concurrency, prediction quota, and feature access. `require_tier()` FastAPI dependency pattern specified. `TIER_MC_CONCURRENCY` dict ties limits to tier. Tier changes take immediate effect (no session cache).
|
||
|
||
**F3 — Usage metering not modelled**
|
||
Fix applied (§9.2): `usage_events` table added — append-only, immutable trigger, indexed by `(organisation_id, billing_period, event_type)`. Billable event types: `decay_prediction_run`, `conjunction_screen_run`, `report_export`, `api_request`, `mc_quota_exhausted`, `reentry_plan_run`. Powers org admin usage dashboard and upsell trigger.
|
||
|
||
**F4 — Organisation onboarding and offboarding procedures absent**
|
||
Fix applied (§29.8 new): Onboarding gate checklist specified (MSA, export control, Space-Track, billing contact, org_admin user, ToS). Offboarding 8-step procedure with timing, owner, and GDPR Art. 17 vs. retention resolution. Suspension vs. churn distinction documented. `docs/runbooks/org-onboarding.md` designated.
|
||
|
||
**F5 — API key lifecycle lacks org-level service account concept**
|
||
Fix applied (§9.2 `api_keys` table): `is_service_account` column added; `user_id` made nullable for service account keys; `service_account_name` required when `is_service_account = TRUE`; `revoked_by` column added for org_admin audit trail. CHECK constraints enforce mutual exclusivity. Org admin can see and revoke all org keys via `GET/DELETE /org/api-keys`.
|
||
|
||
**F6 — Concurrent prediction limit not persisted and not tier-linked**
|
||
Fix applied (§16.1, Celery section): `acquire_mc_slot` now derives limit from `org_tier` via `get_mc_concurrency_limit_by_tier()`. Quota exhaustion writes `usage_events` row with `event_type = 'mc_quota_exhausted'`. Org admin usage dashboard shows hits per billing period with upgrade prompt if hits ≥ 3.
|
||
|
||
**F7 — No org-level admin role**
|
||
Fix applied (§7.2 RBAC table, `users.role` CHECK): `org_admin` role added between `operator` and `admin`. Permissions: manage users within own org (up to `operator`), manage own org's API keys, view own org's audit log, update billing contact. Cannot cross org boundaries or assign `admin`/`org_admin` without system admin.
|
||
|
||
**F8 — Shared Celery queues with no per-org priority**
|
||
Fix applied (Celery Queue section): `TIER_TASK_PRIORITY` table (3–9 by tier) with `CRITICAL_EVENT_PRIORITY_BOOST = 2` when active TIP event exists. `get_task_priority()` function specified. Priority submitted via `apply_async(priority=...)`. Redis `noeviction` policy supports native Celery priorities 0–9.
|
||
|
||
**F9 — No tenant-scoped audit log API**
|
||
Fix applied (§14 Org Admin endpoints): `GET /org/audit-log` added — paginated, filtered by `organisation_id`, supports `?from=&to=&event_type=&user_id=`. Sources `security_logs` and `alert_events`. Accessible to `org_admin` and `admin`. Required by enterprise SaaS compliance expectations.
|
||
|
||
**F10 — Billing data model absent**
|
||
Fix applied (§9.2): `billing_contacts` table (email, name, address, VAT, PO reference), `subscription_periods` table (immutable billing history with tier, dates, monthly fee, invoice reference). `PATCH /org/billing` endpoint for org_admin self-service updates. Phase 1 billing is manual; `invoice_ref` field accommodates future Stripe or Lago integration.
|
||
|
||
**F11 — No org data export or portability mechanism**
|
||
Fix applied (§14 Org Admin endpoints, §29.2): `POST /org/export` endpoint added — async job, delivers signed ZIP within 3 business days. Used for GDPR Art. 20 portability and offboarding. §29.2 portability row updated with endpoint reference and scope clarification (user-generated content, not derived predictions).
|
||
|
||
---
|
||
|
||
### 55.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §7.2 RBAC | `org_admin` role added; comprehensive RLS policy table; `spacecom_worker` BYPASSRLS principal; `users.role` CHECK constraint updated |
|
||
| §9.2 `api_keys` | `is_service_account`, `service_account_name`, `revoked_by` columns; CHECK constraints; service account index |
|
||
| §9.2 (new tables) | `usage_events`, `billing_contacts`, `subscription_periods` |
|
||
| §14 Org Admin endpoints (new group) | 10 `org_admin`-scoped endpoints covering users, API keys, audit log, usage, billing, and data export |
|
||
| §14 Admin endpoints | `GET /admin/organisations`, `POST /admin/organisations`, `PATCH /admin/organisations/{id}` added |
|
||
| §16.1 (new) Subscription Tiers | Tier table; `require_tier()` pattern; `TIER_MC_CONCURRENCY`; tier change immediacy |
|
||
| Celery Queue section | `TIER_TASK_PRIORITY` priority map; `CRITICAL_EVENT_PRIORITY_BOOST`; `get_task_priority()` function |
|
||
| MC concurrency gate | `acquire_mc_slot` now tier-driven; quota exhaustion writes `usage_events` |
|
||
| §29.2 Data Subject Rights | Portability row updated with `POST /org/export` endpoint and scope |
|
||
| §29.8 (new) Org Onboarding/Offboarding | 6-gate onboarding checklist; 8-step offboarding procedure; suspension vs. churn distinction |
|
||
|
||
---
|
||
|
||
### 55.3 New Tables and Files
|
||
|
||
| Artefact | Purpose |
|
||
|----------|---------|
|
||
| `usage_events` table | Billable event metering; org admin dashboard; quota exhaustion signal |
|
||
| `billing_contacts` table | Invoice address, VAT, PO number per org |
|
||
| `subscription_periods` table | Immutable billing history; Phase 2 invoice integration anchor |
|
||
| `docs/runbooks/org-onboarding.md` | Onboarding gate checklist; provisioning procedure |
|
||
| `backend/app/modules/billing/tiers.py` | `get_mc_concurrency_limit_by_tier()` and `TIER_TASK_PRIORITY` |
|
||
|
||
---
|
||
|
||
### 55.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| Relying solely on application-layer `WHERE organisation_id = X` | RLS at database layer; application filter is defence-in-depth only |
|
||
| Role model with only system-wide `admin` | `org_admin` for self-service tenant management; `admin` for cross-org system operations |
|
||
| Flat API key model with no service accounts | Service account keys (`user_id IS NULL`) for system integrations; org admin can audit and revoke all keys |
|
||
| Sharing Celery queue with equal priority for all orgs | Priority queue by tier + active event boost prevents low-tier bulk jobs starving safety-critical work |
|
||
| No audit log access for tenants | Tenant-scoped `GET /org/audit-log`; required by enterprise procurement and insurance |
|
||
| Treating `subscription_tier` as static configuration | Tier changes must be real-time enforced; `require_tier()` reads from DB on each request |
|
||
|
||
---
|
||
|
||
### 55.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Tenant isolation mechanism | PostgreSQL RLS + application filter | Application filter only | RLS enforces at DB layer; a single missing WHERE clause in application code cannot leak cross-tenant data |
|
||
| Tier change immediacy | Real-time DB read on each request | Cached in JWT claim | JWT caching means downgraded orgs continue at higher tier until token expires; unacceptable for billing correctness |
|
||
| Billing integration (Phase 1) | Manual + `subscription_periods` table | Stripe/Lago from day 1 | Phase 1 has ≤5 paying customers; manual invoicing is sufficient; `invoice_ref` field enables future integration without schema migration |
|
||
| org_admin role scope | Cannot assign `admin` or `org_admin` without system admin approval | Full self-service role management | Self-service `org_admin` assignment creates privilege escalation paths; system admin as approval gate is a standard SaaS pattern |
|
||
| Service account API keys | `user_id IS NULL` with `is_service_account = TRUE` flag | Separate `service_accounts` table | Single `api_keys` table is simpler; constraints enforce consistency; avoids JOIN complexity for key lookup hot path |
|
||
|
||
---
|
||
|
||
## §56 Testing Strategy — Specialist Review
|
||
|
||
**Standards basis:** pytest, pytest-cov, mutmut, k6, Playwright, openapi-typescript, freezegun, ISTQB test level definitions, ESA ECSS-E-ST-40C software testing standard
|
||
**Review scope:** Coverage standard, test taxonomy, test data management, frontend/API contract drift, mutation testing, performance test specification, environment parity, safety-critical labelling, WebSocket E2E, MC determinism, ESA test plan artefact
|
||
|
||
---
|
||
|
||
### 56.1 Findings and Fixes Applied
|
||
|
||
**F1 — No test coverage standard defined**
|
||
Fix applied (§17.0): Coverage thresholds declared: 80% line / 70% branch for backend (`pytest-cov`), 75% line for frontend (Jest). Enforced via `pyproject.toml` `--cov-fail-under`. Measured on the integration run (real DB), not unit-only. Coverage artefact required in Phase 2 ESA submission.
|
||
|
||
**F2 — Test level boundary undefined**
|
||
Fix applied (§17.0): Three-level taxonomy defined: unit (no I/O, `tests/unit/`), integration (real DB + Redis, `tests/integration/`), E2E (full stack + browser, `e2e/`). Rules specify which level each category of test belongs to. Stops developers placing DB tests in `tests/unit/` or mocking the database in integration tests.
|
||
|
||
**F3 — Test data management strategy absent**
|
||
Fix applied (§17.0): Committed JSON reference data for physics; transaction-rollback isolation for integration tests; `freezegun` mandate for all time-dependent tests; fictional NORAD IDs (90001–90099) and generated org names for sensitive data. Prevents flaky time-dependent failures and production-data leakage into the test repo.
|
||
|
||
**F4 — No contract testing between frontend and API**
|
||
Fix applied (§14): `openapi-typescript` generates `frontend/src/types/api.generated.ts` from `openapi.yaml`. Frontend imports only from the generated file. `make check-api-types` CI step fails on any drift. Replaces Pact-style consumer-driven contracts at Phase 1 scale — simpler, equally effective for a single-team project.
|
||
|
||
**F5 — Mutation testing not specified**
|
||
Fix applied (§17.0): `mutmut` runs weekly against `physics/` and `alerts/` modules. Threshold: ≥ 70% mutation score. Results published to CI artefacts. > 5 percentage point drop between runs creates a `mutation-regression` issue automatically.
|
||
|
||
**F6 — Performance test specification informal**
|
||
Fix applied (§27.0 new): k6 chosen as the load testing tool. Three scenarios specified: CZML catalog ramp, 200 WebSocket subscribers, decay submit constant arrival rate. SLO thresholds as k6 `thresholds` (test fails if breached). Baseline hardware spec documented in `docs/validation/load-test-baseline.md`. Results stored as JSON and trended; > 20% p95 increase creates `performance-regression` issue.
|
||
|
||
**F7 — Test environment parity unspecified**
|
||
Fix applied (§17.0): `docker-compose.ci.yml` must use pinned image tags matching production (not `latest`). `make test` fails if `TIMESCALEDB_VERSION` env var does not match `docker-compose.yml`. MinIO used in CI (not mocked). Prevents the class of "passes in CI, fails in prod" due to minor version differences in TimescaleDB chunk behaviour.
|
||
|
||
**F8 — Safety-critical tests not labelled**
|
||
Fix applied (§17.0): `@pytest.mark.safety_critical` marker defined in `conftest.py`. Applied to: cross-tenant isolation, HMAC integrity, sub-150km guard, shadow segregation, and any other safety-invariant test. Separate fast CI job (`pytest -m safety_critical`, target < 2 min) runs on every commit before the full suite.
|
||
|
||
**F9 — No E2E test for WebSocket alert delivery**
|
||
Fix applied (§42.2 E2E test inventory, accessibility section): `e2e/test_alert_websocket.ts` added. Full path: submit prediction via API → Celery completes → CRITICAL alert appears in browser DOM via WebSocket within 60 seconds. BLOCKING. Intermittent failures are root-cause investigated, not quarantined.
|
||
|
||
**F10 — Physics tests non-deterministic**
|
||
Fix applied (§17.0): `np.random.seed(42)` `autouse` fixture in `tests/conftest.py`. `seed=42` passed explicitly to all MC calls in tests. Seed value pinned; a PR changing it without updating baselines fails the review checklist. MC-based tests are now fully reproducible across machines and Python versions.
|
||
|
||
**F11 — No test plan document for ESA submission**
|
||
Fix applied (§17.0): `docs/TEST_PLAN.md` structure specified with 6 sections including safety-critical traceability matrix (requirement → test ID → test name → result). This is the primary software assurance evidence document for the ESA bid. Required as a Phase 2 deliverable.
|
||
|
||
**Bind mount strategy (companion fix)**
|
||
Fix applied (§3.3 Docker Compose): Host bind mounts specified for logs, exports, config, and DB data. Eliminates the need for `docker compose exec` for all routine operations. `/data/postgres` and `/data/minio` outside the project directory to prevent accidental wipe. `make init-dirs` creates the host directory structure before first `docker compose up`. `make logs SERVICE=backend` convenience alias.
|
||
|
||
---
|
||
|
||
### 56.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §3.3 Docker Compose | Host bind mount specification; host directory layout; `make init-dirs`; `:ro` config mounts |
|
||
| §13.8 Developer Environment Setup | `make init-dirs` added to bootstrap sequence |
|
||
| §17.0 (new) Test Standards and Strategy | Full test taxonomy, coverage standard, fixture isolation, `freezegun`, safety_critical marker, MC seed, mutation testing, env parity, `docs/TEST_PLAN.md` structure |
|
||
| §27.0 (new) Performance Test Specification | k6 scenarios, SLO thresholds, baseline hardware spec, result storage and trending |
|
||
| §14 API Versioning Policy | `openapi-typescript` contract type generation; `make check-api-types` CI step |
|
||
| §42.2 E2E Test Inventory | `test_alert_websocket.ts` added; full WebSocket delivery E2E spec |
|
||
|
||
---
|
||
|
||
### 56.3 New Tables and Files
|
||
|
||
| Artefact | Purpose |
|
||
|----------|---------|
|
||
| `tests/unit/`, `tests/integration/`, `e2e/` | Canonical test directory structure per taxonomy |
|
||
| `e2e/test_alert_websocket.ts` | WebSocket alert delivery E2E test |
|
||
| `tests/conftest.py` | `seed_rng` autouse fixture; `safety_critical` marker registration |
|
||
| `docs/TEST_PLAN.md` | ESA Phase 2 deliverable; traceability matrix |
|
||
| `docs/validation/load-test-baseline.md` | k6 baseline hardware and data spec |
|
||
| `docs/validation/load-test-results/` | Stored k6 JSON results for trending |
|
||
| `tests/load/scenarios.js` | k6 scenario definitions |
|
||
| `frontend/src/types/api.generated.ts` | Generated TypeScript API types from `openapi.yaml` |
|
||
| `scripts/load-test-trend.py` | p95 latency trend chart generator |
|
||
|
||
---
|
||
|
||
### 56.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| Mocking the database in integration tests | Transaction-rollback isolation against a real DB; mocks hide schema and RLS bugs |
|
||
| `datetime.utcnow()` in tests | `freezegun` `@freeze_time` decorator; tests must be time-independent |
|
||
| Non-deterministic MC tests | `np.random.seed(42)` autouse fixture; same seed → same output everywhere |
|
||
| Coverage measured on unit tests only | Integration run coverage includes DB-layer code; unit-only inflates the number |
|
||
| Putting safety-critical tests in the full suite only | `pytest -m safety_critical` fast job on every commit; never wait for the full suite to catch a safety regression |
|
||
| Performance test results not stored | JSON output committed to `docs/validation/`; trend script flags regressions |
|
||
|
||
---
|
||
|
||
### 56.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Frontend/API contract testing | `openapi-typescript` generated types + `make check-api-types` | Pact consumer-driven contracts | Pact requires a broker and bidirectional test setup; `openapi-typescript` achieves the same drift detection with a single CI command at Phase 1 team size |
|
||
| Performance test tool | k6 | Locust, Gatling | k6 is JavaScript-native (same language as frontend tests); scripting is lightweight; built-in threshold assertions; good CI integration |
|
||
| Coverage measurement scope | Integration test run | Unit test run | Unit-only coverage excludes database, Redis, and auth middleware code paths — the most likely sources of prod bugs |
|
||
| Mutation testing scope | `physics/` and `alerts/` only (weekly) | Full codebase (every commit) | Full-codebase mutation testing on every commit would take hours; scoping to highest-consequence modules provides meaningful signal at reasonable cost |
|
||
| Host bind mounts approach | Named directories under `/opt/spacecom/` with `make init-dirs` | Named Docker volumes | Host bind mounts are directly accessible via SSH without `docker exec`; named volumes require exec or a volume driver for host access |
|
||
|
||
---
|
||
|
||
## §57 Observability & Monitoring — Specialist Review
|
||
|
||
**Hat:** Observability & Monitoring
|
||
**Findings reviewed:** 11
|
||
**Sections modified:** §26.6, §26.7
|
||
**Date:** 2026-03-24
|
||
|
||
---
|
||
|
||
### 57.1 Findings and Fixes Applied
|
||
|
||
**F1 — Prometheus metric naming convention not defined**
|
||
Fix applied (§26.7 new): Naming convention table added before metric definitions. Rules: `spacecom_` namespace required; unit suffix mandatory; `_total` for counters; high-cardinality identifiers (`norad_id`, `organisation_id`, `user_id`, `request_id`) banned from metric labels; snake_case labels only. CI `make lint-metrics` step validates names against the convention pattern.
|
||
|
||
**F2 — SLO burn rate alerting single-window only**
|
||
Fix applied (§26.7): Replaced single `ErrorBudgetBurnRate` alert with two-alert multi-window pattern. `ErrorBudgetFastBurn` (1h + 5min windows, 14.4× multiplier, `for: 2m`) catches sudden outages. `ErrorBudgetSlowBurn` (6h + 1h windows, 6× multiplier, `for: 15m`) catches gradual degradation before the budget exhausts silently. Three recording rules added (`rate1h`, `rate6h`, `rate5m`).
|
||
|
||
**F3 — Structured log schema undefined**
|
||
Already substantially addressed in §2274: `REQUIRED_LOG_FIELDS` schema with 10 mandatory fields, sanitising processor, `request_id` correlation middleware, and `log integrity` policy. No further action required for F3 — confirmed as covered.
|
||
|
||
**F4 — Distributed tracing not specified for Celery path**
|
||
Fix applied (§26.7): Explicit Celery W3C `traceparent` propagation spec added. `CeleryInstrumentor` handles automatic propagation; `request_id` passed in task kwargs as Phase 1 fallback when `OTEL_SDK_DISABLED=true`. Integration test stub specified to verify trace continuity from HTTP handler through worker span.
|
||
|
||
**F5 — No alerting rule coverage audit**
|
||
Fix applied (§26.7 new): Alert coverage audit table added mapping every SLO and safety invariant to its alert rule. Two gaps identified: `EopMirrorDisagreement` alert (Phase 1 gap — metric exists, alert rule missing), `DbReplicationLagHigh` (Phase 2 gap — requires streaming replication). `BackupJobFailed` alert identified as Phase 1 gap.
|
||
|
||
**F6 — High-cardinality label risk**
|
||
Already addressed: `norad_id` label was already noted as "Grafana drill-down only; alert via recording rule" in the existing metric definition comment. F1 naming convention formalises this as an explicit prohibition with a CI-enforced lint rule. No additional edit required.
|
||
|
||
**F7 — On-call dashboard not specified**
|
||
Fix applied (§26.7): Operational Overview dashboard panel layout mandated. 8-panel grid with fixed row order; rows 1–2 visible without scroll at 1080p. Each panel maps to a specific metric and threshold. Dashboard UID pinned in AlertManager `dashboard_url` annotations. Design criterion: "answer is the system healthy in 15 seconds."
|
||
|
||
**F8 — Celery queue depth alerting threshold-only**
|
||
Fix applied (§26.7): `CelerySimulationQueueGrowing` alert added using `rate(spacecom_celery_queue_depth{queue="simulation"}[10m]) > 2` with `for: 5m`. Complements the existing threshold-based `CelerySimulationQueueDeep`. Growth rate alert catches a rising queue before it breaches the absolute threshold.
|
||
|
||
**F9 — No DLQ monitoring**
|
||
Already addressed: `DLQGrowing` alert (`increase(spacecom_dlq_depth[10m]) > 0`) and `spacecom_dlq_depth` metric were already specified in §26.7. F9 confirmed as covered — no further action required.
|
||
|
||
**F10 — Log retention and SIEM integration not specified**
|
||
Fix applied (§26.6 new): Application log retention policy table added. Container stdout: 7 days (Docker json-file). Loki: 90 days (covers incident investigation SLA). Safety-relevant log lines: 7 years (MinIO, matching database safety record retention). SIEM forwarding: per customer contract. Loki retention YAML configuration specified. Phase 1 interim: Celery Beat daily export of CRITICAL log lines to MinIO until Loki ruler is deployed.
|
||
|
||
**F11 — No alerting runbook cross-reference mandate**
|
||
Fix applied (§26.7): `runbook_url` added to `WebSocketCeilingApproaching` (previously missing). Mandate added: every AlertManager rule must include `annotations.runbook_url` pointing to an existing file in `docs/runbooks/`. `make lint-alerts` CI step enforces this using `promtool check rules` plus a custom script that validates the URL resolves to a real markdown file.
|
||
|
||
---
|
||
|
||
### 57.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §26.6 Backup and Restore | Application log retention policy table added; Loki 90-day retention config; safety-critical log line archival to MinIO |
|
||
| §26.7 Prometheus Metrics | Metric naming convention table; multi-window burn rate recording rules and alerts; Celery trace propagation spec; queue growth rate alert; alert coverage audit table; `runbook_url` mandate; `WebSocketCeilingApproaching` runbook link added; on-call dashboard panel layout mandated |
|
||
|
||
---
|
||
|
||
### 57.3 New Tables and Files
|
||
|
||
| Artefact | Purpose |
|
||
|----------|---------|
|
||
| `monitoring/alertmanager/spacecom-rules.yml` | Updated with multi-window burn rate alerts and queue growth alert |
|
||
| `monitoring/loki-config.yml` | 90-day retention configuration |
|
||
| `monitoring/recording-rules.yml` | Three burn rate recording rules |
|
||
| `docs/runbooks/capacity-limits.md` | Referenced by WebSocketCeilingApproaching; Phase 2 deliverable |
|
||
| `scripts/lint-alerts.py` | CI script validating `runbook_url` annotation on every alert rule |
|
||
| `monitoring/grafana/dashboards/operational-overview.json` | Codified panel layout per §26.7 on-call dashboard spec |
|
||
| `tests/integration/test_tracing.py` | Celery trace propagation integration test stub |
|
||
|
||
---
|
||
|
||
### 57.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| Single-window burn rate alert (`for: 30m`) | Multi-window fast+slow burn: catches both sudden outages and slow degradations |
|
||
| `norad_id` or `organisation_id` as Prometheus label | Recording rule aggregates; high-cardinality identifiers in log fields or exemplars only |
|
||
| Alert rules without `runbook_url` | `make lint-alerts` enforces presence; a page at 3am without a runbook link adds ~5 min to MTTR |
|
||
| Threshold-only queue alerts | Complement with rate-of-growth alert; threshold fires too late on a gradually filling queue |
|
||
| On-call dashboard with no defined layout | Mandated panel order; rows 1–2 visible without scroll; 15-second health answer target |
|
||
| Application logs with no retention policy | Explicit tier policy: 7 days local, 90 days Loki, 7 years for safety-relevant lines |
|
||
|
||
---
|
||
|
||
### 57.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Burn rate multipliers | 14.4× (fast, 1h) / 6× (slow, 6h) | Custom thresholds | Google SRE Workbook standard multipliers for 99.9% SLO; well-understood by on-call engineers familiar with SRE literature |
|
||
| Loki retention | 90 days | 30 days / 1 year | 30 days is insufficient for post-incident reviews triggered by regulatory queries; 1 year is expensive for high-volume structured logs; 90 days covers all contractual and regulatory investigation windows |
|
||
| Fast burn `for: 2m` | 2 minutes | Immediate (no `for`) | Without a `for` clause, a single scraped bad value pages on-call; 2 minutes filters transient scrape errors while still alerting within 5 minutes of a real outage |
|
||
| Celery trace propagation | `CeleryInstrumentor` + explicit `request_id` kwargs | OTel only | OTel-only approach breaks Phase 1 when `OTEL_SDK_DISABLED=true`; explicit kwargs are a zero-dependency fallback that costs nothing and ensures log correlation always works |
|
||
|
||
---
|
||
|
||
## §58 Performance & Scalability — Specialist Review
|
||
|
||
**Hat:** Performance & Scalability
|
||
**Findings reviewed:** 11
|
||
**Sections modified:** §3.2, §9.4, §16 (CZML cache), §34.2 (Caddyfile), Celery config
|
||
**Date:** 2026-03-24
|
||
|
||
---
|
||
|
||
### 58.1 Findings and Fixes Applied
|
||
|
||
**F1 — No index strategy documented beyond primary keys**
|
||
Already addressed: §9.3 contains a comprehensive index specification with 10+ named indexes covering all identified hot paths: `orbits` (CZML generation), `reentry_predictions` (latest per object, partial), `alert_events` (unacknowledged per org, partial), `jobs` (queued, partial), `refresh_tokens` (live only, partial), PostGIS GiST indexes on all geometry columns, `tle_sets` (latest per object), `security_logs` (user+time). F1 confirmed as covered — no further action required.
|
||
|
||
**F2 — PgBouncer pool size not derived from workload**
|
||
Fix applied (§3.2 technology table): Derivation rationale added inline. `max_client_conn=200` derived from: 2 backend × 40 async + 4 sim workers × 16 + 2 ingest × 4 = 152 peak, 200 burst headroom. `default_pool_size=20` derived from `max_connections=50` with 5 reserved for superuser. Validation query (`SHOW pools; cl_waiting > 0 = undersized`) documented.
|
||
|
||
**F3 — N+1 query risk in catalog and alert APIs**
|
||
Already addressed: §16 (CZML and API performance section) already specifies ORM loading strategies: `selectinload` for Event Detail and active alerts; raw SQL with explicit JOIN for CZML catalog bulk fetch (ORM overhead unacceptable at 864k rows). F3 confirmed as covered — no further action required.
|
||
|
||
**F4 — Redis cache eviction policy not specified**
|
||
Already addressed: §16 Redis key namespace table specifies `noeviction` for `celery:*` and `redbeat:*`, `allkeys-lru` for `cache:*`, `volatile-lru` for `ws:session:*`. Separate Redis DB indexes mandated. F4 confirmed as covered — no further action required.
|
||
|
||
**F5 — CZML cache invalidation strategy incomplete**
|
||
Fix applied (§16): Invalidation trigger table added (TLE re-ingest, propagation completion, new prediction, admin flush, cold start). Stale-while-revalidate strategy specified: stale key served immediately on primary expiry; background recompute enqueued; max stale age 5 minutes. `warm_czml_cache` Celery task specified for cold start and DR failover; estimated 30–60 seconds for 600 objects. Cold-start warm-up added to DR RTO calculation.
|
||
|
||
**F6 — Celery `worker_prefetch_multiplier` not tuned**
|
||
Fix applied (celeryconfig.py): `worker_prefetch_multiplier = 1` added with rationale comment. Long MC tasks (up to 240s) with default prefetch=4 cause worker starvation. Prefetch=1 ensures fair task distribution across all available workers.
|
||
|
||
**F7 — No database query plan governance**
|
||
Fix applied (§9.4 PostgreSQL parameters): `log_min_duration_statement: 500` and `shared_preload_libraries: timescaledb,pg_stat_statements` added to `patroni.yml`. Query plan governance process specified: weekly top-10 slow query report from `pg_stat_statements`; any query in top-10 for two consecutive weeks requires PR with `EXPLAIN ANALYSE` and index addition or documented acceptance.
|
||
|
||
**F8 — Static asset delivery strategy undefined**
|
||
Fix applied (§34.2 Caddyfile): Three-tier static asset strategy added. `/_next/static/*`: `Cache-Control: public, max-age=31536000, immutable` (safe — Next.js content-hashes filenames). `/cesium/*`: `Cache-Control: public, max-age=604800` (7 days; not content-hashed). HTML routes: `Cache-Control: no-store` (force re-fetch after deploy). Rationale: immutable caching only safe for content-hashed assets; HTML must never be cached.
|
||
|
||
**F9 — Horizontal scaling trigger thresholds not defined**
|
||
Fix applied (§3.2 new table): Scaling trigger threshold table added covering backend CPU (>70% for 30min), WS connections (>400 sustained), simulation queue depth (>50 for 15min), MC p95 latency (>180s), DB CPU (>60% for 1h), disk usage (>70%), Redis memory (>60%). All triggers initiate a scaling review meeting, not automatic action. Decisions logged in `docs/runbooks/capacity-limits.md`.
|
||
|
||
**F10 — TimescaleDB chunk interval not specified**
|
||
Already addressed: §9.4 specifies chunk intervals for all hypertables with derivation rationale table: `orbits` 1 day (72h CZML window spans 3 chunks), `tle_sets` 1 month (compression ratio), `space_weather` 30 days (low write rate), `adsb_states` 4 hours (24h rolling window). F10 confirmed as covered — no further action required.
|
||
|
||
**F11 — No query timeout or statement timeout policy**
|
||
Fix applied (§9.4): `ALTER ROLE spacecom_analyst SET statement_timeout = '30s'` and `ALTER ROLE spacecom_readonly SET statement_timeout = '30s'`. Applied at role level so it persists regardless of connection source. User-facing error message specified for timeout exceeded. Operational roles excluded (they have `idle_in_transaction_session_timeout` as global backstop only).
|
||
|
||
---
|
||
|
||
### 58.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §3.2 Service Breakdown | PgBouncer pool size derivation rationale; horizontal scaling trigger threshold table |
|
||
| §9.4 TimescaleDB Configuration | `log_min_duration_statement`, `pg_stat_statements` in patroni.yml; query plan governance process; analyst role `statement_timeout`; `idle_in_transaction_session_timeout` comment |
|
||
| §16 CZML / Cache | Invalidation trigger table; stale-while-revalidate strategy; `warm_czml_cache` cold-start task |
|
||
| §34.2 Caddyfile | Three-tier static asset `Cache-Control` strategy; HTML `no-store` mandate |
|
||
| `celeryconfig.py` | `worker_prefetch_multiplier = 1` with rationale |
|
||
|
||
---
|
||
|
||
### 58.3 New Tables and Files
|
||
|
||
| Artefact | Purpose |
|
||
|----------|---------|
|
||
| `docs/runbooks/capacity-limits.md` | Scaling decision log; WS ceiling documentation; capacity trigger thresholds |
|
||
| `worker/celeryconfig.py` | Updated with `worker_prefetch_multiplier = 1` |
|
||
|
||
---
|
||
|
||
### 58.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| Default Celery `prefetch_multiplier=4` with long tasks | `prefetch_multiplier=1` for MC jobs; fair distribution across workers |
|
||
| Single Redis `maxmemory-policy` for broker + cache | Separate DB indexes with `noeviction` for broker, `allkeys-lru` for cache |
|
||
| HTML pages with `Cache-Control: public, max-age=...` | `no-store` for HTML; `immutable` only for content-hashed static assets |
|
||
| Analyst queries without timeout | `statement_timeout=30s` at role level; prevents replica exhaustion cascading to primary |
|
||
| Monitoring slow queries without a review process | Weekly `pg_stat_statements` top-10 review; two-week persistence triggers mandatory PR |
|
||
| Scaling triggers defined as "when it feels slow" | Metric thresholds with sustained durations; documented decision log for audit trail |
|
||
|
||
---
|
||
|
||
### 58.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| `worker_prefetch_multiplier` | 1 | 4 (default) | Long MC tasks (up to 240s) make default prefetch cause severe worker imbalance; prefetch=1 adds trivial latency (one extra Redis round-trip) per task |
|
||
| Analyst timeout | 30 seconds at role level | Global `statement_timeout` | Global timeout would cancel legitimate long-running operations like backup restore tests and migration backfills; role-scoped is surgical |
|
||
| CZML stale-while-revalidate max age | 5 minutes | 0 (no stale) | Without stale window, TLE batch ingest (600 objects) causes 600 simultaneous cache stampedes; 5-minute stale window amortises recompute over the natural ingest cadence |
|
||
| Static asset caching | Immutable for `/_next/static/`, 7 days for `/cesium/`, no-store for HTML | Uniform TTL | Content-hash presence determines whether immutable is safe; non-uniform strategy is correct, not inconsistent |
|
||
|
||
---
|
||
|
||
## §59 DevOps / CI-CD Pipeline — Specialist Review
|
||
|
||
**Hat:** DevOps / CI-CD Pipeline
|
||
**Findings reviewed:** 11
|
||
**Sections modified:** §30.2, §30.3, §30.7 (new)
|
||
**Date:** 2026-03-24
|
||
|
||
---
|
||
|
||
### 59.1 Findings and Fixes Applied
|
||
|
||
**F1 — CI pipeline job dependency graph not specified**
|
||
Fix applied (§30.7 new): Full GitLab CI pipeline specified with explicit stage/needs ordering enforcing the dependency order: `lint` → (`test-backend` ∥ `test-frontend` ∥ `migration-gate`) → `security-scan` → `build-and-push` → `deploy-staging` → `deploy-production`. Parallel jobs where safe; sequential where correctness requires it.
|
||
|
||
**F2 — No environment promotion gate between staging and production**
|
||
Already addressed: §30.4 specifies the staging environment spec and data policy. The ADR at §30.6 records the decision: "production deploy requires manual approval gate after staging smoke tests pass." The new §30.7 workflow formalises this as a GitLab protected `production` environment with required approvers. Confirmed as covered and formalised.
|
||
|
||
**F3 — Secrets in CI not audited or rotated**
|
||
Fix applied (§30.3): CI secrets register table added with 8 entries covering all pipeline secrets. Each entry specifies: environment scope, owner, rotation schedule (90-180 days), and blast radius on leak. Quarterly audit procedure using GitLab CI/CD variable inventory documented. Rotation procedure for GitLab protected variables specified.
|
||
|
||
**F4 — Docker image tags without immutability guarantee**
|
||
Fix applied (§30.2): Production `docker-compose.yml` now pins images by `tag@digest` rather than tag alone. `make update-image-digests` script added to CI post-build pipeline. Container-registry retention policy table added covering 5 image categories. Lifecycle policy documented in `docs/runbooks/image-lifecycle.md`.
|
||
|
||
**F5 — No build provenance or SBOM in CI pipeline**
|
||
Fix applied (§30.7): `cosign sign --yes` step added to `build-and-push` job using Sigstore keyless signing (OIDC identity from GitLab CI). SBOM artefacts are attached to the pipeline and copied into the compliance artefact store. The deploy-time `cosign verify` step remains the verification gate.
|
||
|
||
**F6 — Pre-commit hooks not enforced in CI**
|
||
Already addressed: §30.1 explicitly states "The same hooks run locally (via `pre-commit`) and in CI (`lint` job)." The new §30.7 workflow formalises this as `pre-commit run --all-files` in the `lint` job with a dedicated cache. F6 confirmed as covered and formalised.
|
||
|
||
**F7 — No automated rollback trigger**
|
||
Already addressed: §26.9 blue-green deploy script (step 6) already checks `spacecom:api_availability:ratio_rate5m < 0.99` after a 5-minute monitoring window and executes the Caddy upstream rollback atomically if the threshold is breached. F7 confirmed as covered.
|
||
|
||
**F8 — Deployment pipeline does not check for active CRITICAL events**
|
||
Fix applied (§30.7): `check no active CRITICAL alert` step added to both `deploy-staging` and `deploy-production` jobs. Calls `GET /readyz` and checks `alert_gate` field. `"blocked"` aborts the deploy with a clear error message. Emergency override requires two production-environment approvals and is logged to `security_logs`.
|
||
|
||
**F9 — No branch protection or merge queue specification**
|
||
Already addressed: §13.6 (CONTRIBUTING.md spec from §54) specifies: "No direct commits to `main`. All changes via pull request. `main` is branch-protected: 1 required approval, all status checks must pass, no force-push." The §30.7 workflow defines all required status checks (`lint`, `test-backend`, `test-frontend`, `migration-gate`, `security-scan`) which the branch protection rule references. F9 confirmed as covered.
|
||
|
||
**F10 — Docker layer cache strategy not documented for CI**
|
||
Fix applied (§30.7): Build cache strategy formalised in the `build-and-push` job using `docker/build-push-action` with `cache-from: type=registry` and `cache-to: type=registry,mode=max` targeting the GHCR `buildcache` tag. pip wheel cache keyed on `requirements.txt` hash. npm cache keyed on `package-lock.json` hash. Both use `actions/cache@v4`.
|
||
|
||
**F11 — No database migration CI gate**
|
||
Fix applied (§30.7 `migration-gate` job): Three-step gate on all PRs touching `migrations/`: (1) timed forward migration — fails if > 30s; (2) reverse migration `alembic downgrade -1` — fails if not reversible; (3) `alembic check` — fails if model/migration divergence. Gate runs in parallel with test jobs to minimise critical path impact.
|
||
|
||
---
|
||
|
||
### 59.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §30.2 Multi-Stage Dockerfile | Image digest pinning spec; GHCR retention policy table; `make update-image-digests` |
|
||
| §30.3 Environment Variable Contract | CI secrets register table; rotation schedule; quarterly audit procedure |
|
||
| §30.7 (new) GitHub Actions Workflow | Full CI YAML with `needs:` graph; all 8 jobs; `cosign sign`; `migration-gate`; alert gate step; environment-gated production deploy |
|
||
|
||
---
|
||
|
||
### 59.3 New Tables and Files
|
||
|
||
| Artefact | Purpose |
|
||
|----------|---------|
|
||
| `.github/workflows/ci.yml` | Canonical CI pipeline — 8 jobs with explicit dependency graph |
|
||
| `scripts/smoke-test.py` | Post-deploy smoke test (already referenced in §26.9; now mandatory gate in CI) |
|
||
| `scripts/update-image-digests.sh` | Patches `docker-compose.yml` with `tag@digest` after each build |
|
||
| `docs/runbooks/image-lifecycle.md` | GHCR retention policy; lifecycle policy config procedure |
|
||
| `docs/runbooks/detect-secrets-update.md` | Correct baseline update procedure (already referenced in §30.1) |
|
||
|
||
---
|
||
|
||
### 59.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| Jobs without `needs:` run in parallel by default | Explicit `needs:` chains; test jobs must precede build; build must precede deploy |
|
||
| Mutable image tags in production Compose | `tag@digest` pinning; `make update-image-digests` in post-build CI step |
|
||
| Long-lived CI credentials for registry push | OIDC `GITHUB_TOKEN` (per-job, automatic); no static `GHCR_TOKEN` secret needed |
|
||
| Signing at deploy-time only (`cosign verify`) | Sign at build-time (`cosign sign`); verify at deploy; both steps required for supply chain integrity |
|
||
| Deploying during active CRITICAL alert | `alert_gate` check in CI deploy steps; emergency override requires two approvals and is logged |
|
||
| Migrations tested only by running them forward | Three-step gate: forward (timed) + reverse (reversibility) + `alembic check` (model sync) |
|
||
|
||
---
|
||
|
||
### 59.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| OIDC for GHCR auth | `GITHUB_TOKEN` OIDC (per-job) | Static `GHCR_TOKEN` secret | Static tokens don't expire; OIDC tokens are per-job and cannot be reused outside the workflow |
|
||
| cosign keyless signing | Sigstore keyless (OIDC identity) | Private key signing | Keyless signing ties the signature to the GitHub Actions OIDC identity; no long-lived private key to rotate or leak |
|
||
| Alert gate scope | Blocks `CRITICAL` and `HIGH` unacknowledged alerts from non-internal orgs | All alerts | Internal test org alerts should not block production operations; unacknowledged = operator hasn't seen it yet |
|
||
| migration-gate triggers | Only on PRs touching `migrations/` | Every PR | Running `alembic upgrade head` on every PR adds 60–90 seconds to CI for PRs that don't touch the schema; path filter reduces cost |
|
||
|
||
---
|
||
|
||
## §60 Human Factors / Operational UX — Specialist Review
|
||
|
||
**Hat:** Human Factors / Operational UX
|
||
**Findings reviewed:** 11
|
||
**Sections modified:** §28.1, §28.3, §28.5a, §28.6, §28.9 (new)
|
||
**Date:** 2026-03-24
|
||
|
||
---
|
||
|
||
### 60.1 Findings and Fixes Applied
|
||
|
||
**F1 — No alarm management philosophy documented**
|
||
Fix applied (§28.3): EEMUA 191 / ISA-18.2 alarm management KPI table added with 5 quantitative targets: alarm rate (< 1/10min), nuisance rate (< 1%), stale CRITICAL (0 unacknowledged > 10min), alarm flood threshold (< 10 CRITICAL in 10min), chattering alarms (0). Measured quarterly by Persona D; included in ESA compliance artefact package.
|
||
|
||
**F2 — Alarm flood scenario not bounded**
|
||
Fix applied (§28.3): Batch TIP flood protocol added. Triggers at >= 5 new TIP messages in 5 minutes. Protocol: highest-priority object gets CRITICAL banner; objects 2-N are suppressed; single HIGH "Batch TIP event: N objects" summary fires; per-object alerts queue at <= 1/min after 5-minute operator grace period. `batch_tip_event` record type added to `alert_events`. Thresholds configurable per-org within safety bounds.
|
||
|
||
**F3 — Mode confusion risk unmitigated**
|
||
Already addressed: §28.2 specifies six mode error prevention mechanisms including persistent mode indicator, mode-switch confirmation dialog with consequence statements, temporal wash for future-preview, simulation disable during active events, audio suppression in non-LIVE modes, and simulation record segregation. F3 confirmed as covered.
|
||
|
||
**F4 — Handover workflow does not account for SA transfer**
|
||
Fix applied (§28.5a): Structured SA transfer prompt table added. Five prompts mapping to Endsley SA levels: active objects (L1 perception), operator assessment (L2 comprehension), expected development (L3 projection), actions taken (decision context), and handover flags (situational context). Prompts are optional but completion rate tracked as HF KPI. Non-blocking warning on submission without completion.
|
||
|
||
**F5 — Acknowledgement does not distinguish seen from assessed**
|
||
Already addressed: §28.5 structured acknowledgement categories distinguish `MONITORING` (seen, no action) from `NOTAM_ISSUED`, `COORDINATING`, `ESCALATING` (assessed and acted). The category taxonomy maps directly to perception vs. comprehension+projection. F5 confirmed as covered.
|
||
|
||
**F6 — No specification for decision prompt content**
|
||
Fix applied (§28.6): `DecisionPrompt` TypeScript interface specified with four mandatory fields: `risk_summary` (<= 20 words, no jargon), `action_options` (role-specific), `time_available` (decision window before FIR intersection), `consequence_note` (optional). Example instance for re-entry/FIR scenario provided. Pre-authored prompt library in `docs/decision-prompts/`; annual ANSP SME review required.
|
||
|
||
**F7 — Globe information hierarchy not specified**
|
||
Fix applied (§28.1): Seven-level visual information hierarchy table added with mandatory rendering order. Priority 1 (CRITICAL object): flashing red octagon + always-visible label. Priority 2 (HIGH): amber triangle. Down to Priority 7 (ambient objects): white dots on hover only. Rule: no lower-priority element may be visually more prominent than a higher-priority element. Non-negotiable safety requirement — overrides CesiumJS performance optimisations that reorder draw calls.
|
||
|
||
**F8 — No fatigue or cognitive load accommodation**
|
||
Fix applied (§28.3): Server-side fatigue monitoring rules added. Four triggers: CRITICAL unacknowledged > 10 min — supervisor push+email; HIGH unacknowledged > 30 min — supervisor push; inactivity during active event (45 min) — operator+supervisor push; session age > shift_duration_hours — non-blocking operator reminder. All notifications logged to `security_logs`. Escalates to SpaceCom internal ops if no supervisor role configured.
|
||
|
||
**F9 — Degraded mode display not actionable**
|
||
Already addressed: §28.8 (Degraded-Data Human Factors) specifies per-degradation-type visual indicators with operator action required. §1315 specifies operational guidance text per degradation type. Acceptance criteria (§6056) requires integration test for each type. F9 confirmed as covered.
|
||
|
||
**F10 — No operator training specification**
|
||
Fix applied (§28.9 new): Full operator training programme specified. Six modules (M1-M6), 8 hours total minimum. M2 reference scenario defined. Recurrency requirements: annual 2-hour refresher + scenario repeat. `operator_training_records` schema added. `GET /api/v1/admin/training-status` endpoint added. Training material ownership and annual review cycle defined.
|
||
|
||
**F11 — Audio alert design not fully specified**
|
||
Fix applied (§28.3): Audio spec expanded with EUROCAE ED-26 / RTCA DO-256 advisory alert compliance. Tones specified: 261 Hz (C4) + 392 Hz (G4), 250ms each with 20ms fade. Re-alert on missed acknowledgement: replays once at 3 minutes; no further audio beyond second play (supervisor notification handles further escalation). Volume floor in ops room mode: minimum 40%. Per-session mute resets on next login.
|
||
|
||
---
|
||
|
||
### 60.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §28.1 Situation Awareness | Globe visual information hierarchy table (7 levels, mandatory rendering order) |
|
||
| §28.3 Alarm Management | EEMUA 191 KPI table; batch TIP flood protocol; fatigue monitoring rules; audio spec expanded with EUROCAE ref, re-alert rule, volume floor |
|
||
| §28.5a Shift Handover | Structured SA transfer prompts (5 prompts, 3 SA levels); completion tracking |
|
||
| §28.6 Cognitive Load Reduction | Decision prompt TypeScript interface + example; pre-authored library governance |
|
||
| §28.9 (new) Operator Training | 6-module programme; reference scenario; recurrency; `operator_training_records` schema; API endpoint |
|
||
|
||
---
|
||
|
||
### 60.3 New Tables and Files
|
||
|
||
| Artefact | Purpose |
|
||
|----------|---------|
|
||
| `operator_training_records` | Training completion records per user/module |
|
||
| `docs/training/` | Training module content directory |
|
||
| `docs/training/reference-scenario-01.md` | Standardised M2 reference scenario |
|
||
| `docs/decision-prompts/` | Pre-authored decision prompt library (per scenario type) |
|
||
| `GET /api/v1/admin/training-status` | Org-admin view of operator training completion |
|
||
|
||
---
|
||
|
||
### 60.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| Single "data may be delayed" degraded banner | Per-degradation-type badges with operator action required; graded response rules |
|
||
| Free-text only handover notes | Structured SA transfer prompts + notes; prompts tracked as HF KPI |
|
||
| Audio alert that loops indefinitely | Plays once; re-alerts once at 3 min; further escalation is supervisor notification, not more audio |
|
||
| Acknowledgement with 10-character text minimum | Structured category selection — captures intent, not just compliance |
|
||
| Unlimited alarm rate during batch TIP events | Batch flood protocol: suppress objects 2-N, queue at <= 1/min after grace period |
|
||
| Globe with equal visual weight for all elements | 7-level mandatory hierarchy; safety-critical objects pre-attentively distinct at all zoom levels |
|
||
|
||
---
|
||
|
||
### 60.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Alarm KPI standard | EEMUA 191 adapted for ATC | Process-control standard verbatim | EEMUA 191 is process-control oriented; ATC operations have different alarm rate expectations; adaptation noted explicitly |
|
||
| Re-alert timing | Once at 3 minutes | Continuous loop / never re-alert | Loop causes habituation; never re-alerting risks missed CRITICAL in a noisy environment; single replay at 3 min is the minimum effective prompt |
|
||
| SA transfer prompts | Optional with completion tracking | Mandatory (blocks handover submission) | Mandatory completion under time pressure produces checkbox compliance, not genuine SA transfer; optional + tracked provides accountability without creating a safety-defeating blocker |
|
||
| Operator training blocking | Flag but not block access | Auto-block untrained users | ANSP retains operational responsibility; SpaceCom cannot unilaterally block a certified ATC professional; flag + report gives ANSP the information to manage their own training compliance |
|
||
|
||
|
||
---
|
||
|
||
## §61 Aviation & Space Regulatory Compliance — Specialist Review
|
||
|
||
### 61.1 Finding Summary
|
||
|
||
| # | Finding | Severity | Resolution |
|
||
|---|---------|----------|-----------|
|
||
| 1 | No formal safety case structure — argument/evidence/claims framework absent | High | §24.12 — Safety case with GSN argument structure, evidence nodes, and claims added; `docs/safety/SAFETY_CASE.md` |
|
||
| 2 | SAL assignment under ED-153/DO-278A not documented — no formal assurance level per component | High | §24.13 — SAL assignment table: SAL-2 for physics, alerts, HMAC, CZML; SAL-3 for auth and ingest; `docs/safety/SAL_ASSIGNMENT.md` |
|
||
| 3 | Hazard log lacked structured format — no ID, cause/effect decomposition, risk level, or governance | Medium | §24.4 — Hazard register restructured with 7 hazards (HZ-001 to HZ-007), structured fields, governance rules, and EUROCAE ED-153 risk matrix |
|
||
| 4 | Safety occurrence reporting procedure lacked formal structure — ANSP notification, evidence preservation, and regulatory notification flow not defined | High | §26.8a — Full safety occurrence reporting procedure with trigger conditions, 8-step response, SQL table, and clear negative scope |
|
||
| 5 | ICAO data quality mapping incomplete — Completeness attribute absent; no formal data category and classification fields in API response | Medium | §24.3 — Completeness attribute added; formal ICAO data category/classification fields specified; accuracy characterisation as Phase 3 gate |
|
||
| 6 | Verification independence not specified — no CODEOWNERS, PR review rule, or traceability for SAL-2 components | High | §17.6 — CODEOWNERS for SAL-2 paths, 2-reviewer requirement, qualification criteria, traceability to safety case evidence |
|
||
| 7 | No configuration management policy for safety-critical artefacts — source files, safety documents, and validation data not formally under CM | High | §30.8 — CM policy covering 10 artefact types, release tagging script, signed commits, deployment register, CODEOWNERS for `docs/safety/` |
|
||
| 8 | Means of Compliance document not planned — no mapping from regulatory requirement to implementation evidence | Medium | §24.14 — MoC document structure with 7 initial MOC entries, status tracking, and Phase 2/3 gates |
|
||
| 9 | Post-deployment safety monitoring programme absent — no ongoing accuracy monitoring, safety KPIs, or model version monitoring | High | §26.10 — Four-component programme: prediction accuracy monitoring, safety KPI dashboard, quarterly safety review, model version monitoring |
|
||
| 10 | ANSP-side obligations not documented — SpaceCom's safety argument assumes ANSP actions that are never formally communicated | Medium | §24.15 — ANSP obligations table by category; SMS guide document; liability assignment note linking to safety case |
|
||
| 11 | Regulatory sandbox liability not formally characterised — who bears liability during trial, what insurance is required, sandbox ≠ approval | Medium | §24.2 — Sandbox liability provisions: no operational reliance clause, indemnification cap, insurance requirement, regulatory notification duty, explicit statement that sandbox ≠ regulatory approval |
|
||
|
||
**Already addressed — no further action required:**
|
||
- NOTAM interface and disclaimer (§24.5 — covered in prior sessions)
|
||
- Space law retention obligations (§24.6 — 7-year retention already specified)
|
||
- EU AI Act compliance obligations (§24.10 — fully covered including Art. 14 human oversight statement)
|
||
- Regulatory correspondence register (§24.11 — covered)
|
||
|
||
---
|
||
|
||
### 61.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §24.2 Liability and Operational Status | Regulatory sandbox liability provisions (F11): no operational reliance clause, indemnification cap, insurance requirement, sandbox ≠ approval statement |
|
||
| §24.3 ICAO Data Quality Mapping | Completeness attribute added (F5); formal ICAO data category and classification table; accuracy characterisation Phase 3 gate |
|
||
| §24.4 Safety Management System Integration | Hazard register fully restructured (F3): 7 hazards with IDs, cause/effect, risk levels, governance; system safety classification updated to reference §24.13 SAL assignment |
|
||
| §24.11 (after) | New §24.12 Safety Case Framework (F1); §24.13 SAL Assignment (F2); §24.14 Means of Compliance (F8); §24.15 ANSP-Side Obligations (F10) |
|
||
| §17.5 (after) | New §17.6 Verification Independence (F6): CODEOWNERS, 2-reviewer rule, qualification criteria, traceability |
|
||
| §26.8 Incident Response runbooks | Safety occurrence runbook pointer updated; §26.8a Safety Occurrence Reporting full procedure added (F4) |
|
||
| §26.9 (after) | New §26.10 Post-Deployment Safety Monitoring Programme (F9): accuracy monitoring, safety KPI dashboard, quarterly review, model version monitoring |
|
||
| §30.7 (after) | New §30.8 Configuration Management of Safety-Critical Artefacts (F7): CM policy table, release tagging, signed commits, deployment register |
|
||
|
||
---
|
||
|
||
### 61.3 New Documents and Tables
|
||
|
||
| Artefact | Purpose |
|
||
|----------|---------|
|
||
| `docs/safety/SAFETY_CASE.md` | GSN-structured safety case; living document; version-controlled |
|
||
| `docs/safety/SAL_ASSIGNMENT.md` | Software Assurance Level per component; review triggers |
|
||
| `docs/safety/HAZARD_LOG.md` | Structured hazard log (HZ-001 to HZ-007 and future additions) |
|
||
| `docs/safety/MEANS_OF_COMPLIANCE.md` | Regulatory requirement → implementation evidence mapping |
|
||
| `docs/safety/ANSP_SMS_GUIDE.md` | ANSP obligations and SMS integration guide |
|
||
| `docs/safety/CM_POLICY.md` | Configuration management policy for safety artefacts |
|
||
| `docs/safety/VERIFICATION_INDEPENDENCE.md` | Verification independence policy for SAL-2 components |
|
||
| `docs/safety/QUARTERLY_SAFETY_REVIEW_YYYY_QN.md` | Quarterly safety review output template |
|
||
| `legal/SANDBOX_AGREEMENT_TEMPLATE.md` | Standard regulatory sandbox letter of understanding |
|
||
| `legal/ANSP_DEPLOYMENT_REGISTER.md` | Configuration baseline per ANSP deployment |
|
||
| `docs/validation/ACCURACY_CHARACTERISATION.md` | Phase 3: formal accuracy statement (ICAO Annex 15) |
|
||
| `safety_occurrences` SQL table | Dedicated log for safety occurrences with full audit fields |
|
||
| `monitoring/dashboards/safety-kpis.json` | Grafana dashboard: 6 safety KPIs with alert thresholds |
|
||
| `.github/CODEOWNERS` additions | SAL-2 source paths + `docs/safety/` require custodian review |
|
||
|
||
---
|
||
|
||
### 61.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| "Advisory only" UI label as sole liability protection | Legal instruments required: MSA, AUP, legal opinion; label is not contractual protection |
|
||
| Hazard log as a table of symptoms with no cause/effect structure | Structured hazard log with ID, cause, effect, mitigations, risk level, status — enables safety case argument |
|
||
| No distinction between safety occurrence and operational incident | Safety occurrences require a separate response chain (legal counsel, ANSP regulatory notification); conflating with incidents creates regulatory exposure |
|
||
| Verification by the author of safety-critical code | SAL-2 requires independent verification — CODEOWNERS enforcement is the implementation mechanism |
|
||
| Safety documents outside version control | All safety artefacts are Git-tracked; changes require custodian sign-off via CODEOWNERS; release tags capture safety snapshots |
|
||
| Sandbox trial treated as implicit regulatory approval | Explicit language required: sandbox ≠ approval; the ANSP cannot represent a trial as regulatory acceptance |
|
||
| Post-deployment safety monitoring as "we'll look at incidents when they happen" | Proactive programme: quarterly review, prediction accuracy tracking, model version monitoring — demonstrates ongoing safe operation |
|
||
|
||
---
|
||
|
||
### 61.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Safety case notation | Goal Structuring Notation (GSN) | ASCE text-only format | GSN is the standard for DO-178C and ED-153 safety cases; accepted by EASA and ESA reviewers; tooling (Astah, Visio, ArgoSAFETY) exists for formal diagrams when Phase 3 requires it |
|
||
| SAL-2 for physics and alerts | SAL-2 (not SAL-1) | SAL-1 (highest) | SAL-1 implies formal methods / formal proofs — disproportionate for decision support software where the ANSP retains authority; SAL-2 balances rigour with development practicality |
|
||
| Safety occurrence trigger scope | 4 specific trigger conditions | Any anomaly during operational use | Over-broad triggers desensitise the process; under-broad triggers miss real occurrences; 4 conditions map directly to the identified hazards |
|
||
| Post-deployment monitoring cadence | Quarterly safety review | Monthly review / ad hoc | Quarterly balances administrative overhead with meaningful trend data; monthly creates review fatigue for a small team; ad hoc provides no assurance |
|
||
| Configuration management of safety documents | Git + CODEOWNERS + release attachments | Dedicated safety management tool | Git is already the source of truth; CODEOWNERS provides access control; release attachments are the simplest artefact preservation mechanism without introducing a new tool |
|
||
|
||
---
|
||
|
||
## §62 Geospatial / Mapping Engineering — Specialist Review
|
||
|
||
### 62.1 Finding Summary
|
||
|
||
| # | Finding | Severity | Resolution |
|
||
|---|---------|----------|-----------|
|
||
| 1 | No authoritative CRS contract document — frame transitions at each boundary were scattered across multiple sections with no single reference | Medium | §4.4 — CRS boundary table added; `docs/COORDINATE_SYSTEMS.md` defined as Phase 1 deliverable; antimeridian and pole handling specified |
|
||
| 2 | SRID not enforced by CHECK constraint — column type declares SRID 4326 but application code can insert SRID-0 geometries silently | Medium | §9.3 — CHECK constraints added for `reentry_predictions`, `hazard_zones`, `airspace` spatial columns; migration gate lints new spatial columns |
|
||
| 3 | No spatial GiST index on corridor polygon columns | High | Already addressed — §9.3 contains GiST indexes for `reentry_predictions`, `hazard_zones`, `airspace` geometry columns. No further action required. |
|
||
| 4 | CZML corridor geometry uses fixed 10-minute time-step sampling — under-represents terminal phase where displacement is highest | High | §15.4 — Adaptive sampling function added: 5 min above 300 km, 2 min at 150–300 km, 30 s at 80–150 km, 10 s below 80 km; ADR required for reference polygon regeneration |
|
||
| 5 | Antimeridian and pole handling not explicitly specified | Medium | §4.4 — Antimeridian: GEOGRAPHY type confirmed; CZML serialiser must not clamp to ±180°. Polar corridors: `ST_DWithin` pole proximity check; clip to 89.5° max latitude with `POLAR_CORRIDOR_WARNING` log |
|
||
| 6 | No test verifying PostGIS corridor polygon matches CZML polygon positions | High | §15.4 — `test_czml_corridor_matches_postgis_polygon` integration test added; marked `safety_critical`; 10 km bbox agreement tolerance |
|
||
| 7 | FIR boundary data source and update policy not documented | Medium | Already addressed — §31.1.3 documents EUROCONTROL AIRAC source, 28-day update procedure, `airspace_metadata` table, Prometheus staleness alert, `readyz` integration. No further action required. |
|
||
| 8 | Globe clustering merges objects at different altitudes sharing a ground-track sub-point | Medium | §13.2 (globe clustering) — Altitude-aware clustering rule: clustering disabled for any object with re-entry window < 30 days; prevents TIP-active objects from being absorbed into catalog clusters |
|
||
| 9 | `ST_Buffer` distance units ambiguous — degree-based buffer on SRID 4326 geometry produces latitude-varying results | Medium | §9.3 — Correct pattern documented: project to Web Mercator for metric buffer, or use `GEOGRAPHY` column buffer (natively metre-aware). Wrong pattern explicitly prohibited. |
|
||
| 10 | FIR intersection missing bounding-box pre-filter in some query paths | Medium | Already addressed — §9.3 FIR intersection query with `&&` pre-filter and explicit `::geography::geometry` cast; CI linter rule added. No further action required. |
|
||
| 11 | Altitude display mixes WGS-84 ellipsoidal and MSL datums without labelling — geoid offset (−106 m to +85 m) material at re-entry terminal altitudes | High | §13.5 — Altitude datum labelling table added: orbital context → ellipsoidal; airspace context → QNH; `formatAltitude(metres, context)` helper; `altitude_datum` field in prediction API response |
|
||
|
||
---
|
||
|
||
### 62.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §4.4 (new) Coordinate Reference System Contract | CRS boundary table; `docs/COORDINATE_SYSTEMS.md` reference; antimeridian CZML serialiser note; polar corridor `ST_DWithin` proximity check and 89.5° clip |
|
||
| §4.5 (renumbered from 4.4) Implementation Checklist | Added `docs/COORDINATE_SYSTEMS.md` deliverable |
|
||
| §9.3 Index Specification | SRID CHECK constraints for 3 spatial tables; ST_Buffer correct/wrong patterns; explicit prohibition on degree-unit buffers |
|
||
| §13.2 Globe Object Clustering | Altitude-aware clustering rule: disable for decay-relevant objects (window < 30 days) |
|
||
| §13.5 Altitude and Distance Unit Display | Altitude datum labelling table (4 contexts); `formatAltitude(metres, context)` helper spec; `altitude_datum` API field |
|
||
| §15.4 Corridor Generation Algorithm | Adaptive ground-track sampling function (4 altitude bands); ADR requirement for reference polygon regeneration; `test_czml_corridor_matches_postgis_polygon` integration test |
|
||
|
||
---
|
||
|
||
### 62.3 New Documents and Files
|
||
|
||
| Artefact | Purpose |
|
||
|----------|---------|
|
||
| `docs/COORDINATE_SYSTEMS.md` | Authoritative CRS contract: frame at every system boundary |
|
||
| `tests/integration/test_corridor_consistency.py` | PostGIS vs CZML corridor bbox consistency test (safety_critical) |
|
||
| `backend/app/utils/altitude.py` | `formatAltitude(metres, context)` helper |
|
||
|
||
---
|
||
|
||
### 62.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| Fixed 10-minute ground track sampling across all altitudes | Adaptive sampling: coarse above 300 km, fine in terminal phase below 150 km |
|
||
| `ST_Buffer(geom_4326, 0.5)` — degree buffer on geographic column | `ST_Buffer(ST_Transform(geom, 3857), 50000)` for Mercator metric, or `ST_Buffer(geom::geography, 50000)` for geodetic metric |
|
||
| `ST_Intersects(airspace.geometry, corridor)` without explicit cast | Always `::geography::geometry` cast when mixing GEOGRAPHY and GEOMETRY types; enforced by CI linter |
|
||
| Clustering all objects by screen position | Disable CesiumJS EntityCluster for decay-relevant objects; altitude is a critical dimension for orbital objects |
|
||
| Altitude labelled as `km` without datum | Datum is always explicit: `(ellipsoidal)` or `QNH` or `MSL` per context |
|
||
| SRID declared in column type only | Add CHECK constraint: `CHECK (ST_SRID(geom::geometry) = 4326)` — prevents SRID-0 insertion from application layer |
|
||
|
||
---
|
||
|
||
### 62.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Adaptive sampling bands | 4 bands (> 300 km / 150–300 km / 80–150 km / < 80 km) | Single fine step (30 s) everywhere | Fine step everywhere generates unnecessary data volume in the high-altitude portion where trajectory changes are slow; 4 bands give fidelity where it matters at manageable data volume |
|
||
| Antimeridian strategy | GEOGRAPHY type (spherical arithmetic) for corridors | Split polygons at ±180° | Splitting at antimeridian requires downstream consumers (CesiumJS, PostGIS) to handle multi-polygon; GEOGRAPHY avoids the split natively |
|
||
| Polar corridor clip at 89.5° | `ST_DWithin` + clip | Full polar treatment | True polar passages are extremely rare for the tracked object population; full treatment (azimuthal projection, pole-aware alpha-shape) is disproportionate; clip + warning is the pragmatic safe choice |
|
||
| Altitude datum labelling | Per-context datum in `formatAltitude` helper | Global user setting | Datum is physically determined by the altitude context (orbital = ellipsoidal; aviation = QNH), not user preference; a user setting would allow operators to view the wrong datum label |
|
||
| Corridor consistency test tolerance | 10 km (0.1°) bbox agreement | Exact match | Sub-pixel globe rendering differences make exact match impractical; 10 km is far below the display resolution at most zoom levels and well below any operationally significant discrepancy |
|
||
|
||
---
|
||
|
||
## §63 Real-Time Systems / WebSocket Engineering — Specialist Review
|
||
|
||
### 63.1 Finding Summary
|
||
|
||
| # | Finding | Severity | Resolution |
|
||
|---|---------|----------|-----------|
|
||
| 1 | No message sequence numbers or ordering guarantee | High | Already addressed — `seq` field in event envelope; `?since_seq=` reconnect replay; 200-event / 5-min ring buffer; `resync_required` on stale gap. No further action required. |
|
||
| 2 | No application-level delivery acknowledgement — `delivered_websocket = TRUE` set at send-time, not client-receipt | High | §4 WebSocket schema — `alert.received` / `alert.receipt_confirmed` round-trip for CRITICAL/HIGH; `ws_receipt_confirmed` column in `alert_events`; 10s timeout triggers email fallback |
|
||
| 3 | Fan-out architecture for multiple backend instances not specified | High | §4 WebSocket schema — Redis Pub/Sub fan-out via `spacecom:alert:{org_id}` channels; per-instance local connection registry; `docs/adr/0020-websocket-fanout-redis-pubsub.md` |
|
||
| 4 | No client-side reconnection backoff policy | High | Already addressed — `src/lib/ws.ts` specifies `initialDelayMs=1000`, `maxDelayMs=30000`, `multiplier=2`, `jitter=0.2`. No further action required. |
|
||
| 5 | No state reconciliation protocol after reconnect | High | Already addressed — `resync_required` event triggers REST re-fetch; `?since_seq=` replays up to 200 events. No further action required. |
|
||
| 6 | Dead WebSocket connection does not trigger ANSP fallback notification | High | §4 WebSocket schema — `on_connection_closed` schedules Celery task with 120s / 30s (active TIP) grace; `on_reconnect` revokes pending task; org primary contact emailed with TIP-aware subject line |
|
||
| 7 | No back-pressure or per-client send queue monitoring | High | §4 WebSocket schema — `ConnectionManager` with per-connection `asyncio.Queue`; circuit breaker at 50 queued events closes slow-client connection; `spacecom_ws_send_queue_overflow_total` counter |
|
||
| 8 | Offline clients do not see missed alerts surfaced on reconnect | Medium | §4 WebSocket schema — `GET /alerts?since=<ts>&include_offline=true`; `received_while_offline: true` annotation; `localStorage` `last_seen_ts`; amber border visual treatment in notification centre |
|
||
| 9 | Multi-tab acknowledgement not synced | Medium | Already addressed — `alert.acknowledged` event type in WebSocket schema broadcasts to all org connections. No further action required. |
|
||
| 10 | No per-org WebSocket connection visibility during TIP events | Medium | §4 WebSocket schema + Observability — `spacecom_ws_org_connected` and `spacecom_ws_org_connection_count` gauges; `ANSPNoLiveConnectionDuringTIPEvent` alert rule; on-call dashboard panel 9 |
|
||
| 11 | Caddy idle timeout silently terminates long-lived WebSocket connections | High | §26.9 Caddy configuration — `idle_timeout 0` for WebSocket paths; `read_timeout 0` / `write_timeout 0` on WS reverse proxy transport; `flush_interval -1`; ping interval < proxy idle timeout rule documented |
|
||
|
||
---
|
||
|
||
### 63.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §4 WebSocket event schema | App-level receipt ACK protocol (F2); Redis Pub/Sub fan-out spec with code (F3); dead-connection ANSP fallback (F6); `ConnectionManager` back-pressure with per-connection queue (F7); offline missed-alert REST endpoint and notification centre treatment (F8); per-org Prometheus gauges and `ANSPNoLiveConnectionDuringTIPEvent` alert rule (F10) |
|
||
| §26.9 Caddy upstream configuration | WebSocket-specific Caddyfile additions: `idle_timeout 0`, WS path matcher, `read_timeout 0`, `write_timeout 0`, `flush_interval -1`; ping interval < proxy idle timeout rule (F11) |
|
||
|
||
---
|
||
|
||
### 63.3 New Tables, Metrics, and Files
|
||
|
||
| Artefact | Purpose |
|
||
|----------|---------|
|
||
| `alert_events.ws_receipt_confirmed` | Tracks whether client confirmed receipt of CRITICAL/HIGH alerts |
|
||
| `alert_events.ws_receipt_at` | Timestamp of client receipt confirmation |
|
||
| `spacecom_ws_send_queue_overflow_total{org_id}` | Counter: WS send queue circuit breaker activations |
|
||
| `spacecom_ws_org_connected{org_id, org_name}` | Gauge: whether org has ≥1 active WS connection |
|
||
| `spacecom_ws_org_connection_count{org_id}` | Gauge: count of active WS connections per org |
|
||
| `ANSPNoLiveConnectionDuringTIPEvent` | Prometheus alert rule: warning when ANSP has no WS connection during active TIP |
|
||
| On-call dashboard panel 9 | ANSP Connection Status table (below fold) |
|
||
| `docs/adr/0020-websocket-fanout-redis-pubsub.md` | ADR: Redis Pub/Sub for cross-instance WS fan-out |
|
||
| `docs/runbooks/websocket-proxy-config.md` | Runbook: WS proxy timeout configuration for cloud deployments |
|
||
| `docs/runbooks/ansp-connection-lost.md` | Runbook: ANSP with no live connection during TIP event |
|
||
| `GET /alerts?since=<ts>&include_offline=true` | Missed-alert reconciliation endpoint |
|
||
|
||
---
|
||
|
||
### 63.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| `delivered_websocket = TRUE` set at `send()` time | App-level receipt ACK with 10s timeout; `FALSE` triggers email fallback |
|
||
| Single fan-out loop blocks on slow client | Per-connection async send queue with circuit breaker; slow client disconnected, not blocking |
|
||
| Caddy default idle timeout terminates quiet WS connections | `idle_timeout 0` + `read_timeout 0` on WS paths; ping interval enforced below proxy timeout |
|
||
| No distinction between "connected to SpaceCom" and "receiving alerts during TIP event" | Per-org connection gauge + `ANSPNoLiveConnectionDuringTIPEvent` alert distinguishes the two |
|
||
| `resync_required` causes silent state restoration with no visual indication | `received_while_offline: true` annotation + amber border in notification centre |
|
||
| Dead socket detected by ping-pong, silently closed | Grace-period Celery task schedules ANSP notification; cancelled on reconnect |
|
||
|
||
---
|
||
|
||
### 63.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Fan-out mechanism | Redis Pub/Sub | Sticky sessions (consistent hash) | Sticky sessions break blue-green deploys; Pub/Sub is stateless and works with any instance count |
|
||
| App-level ACK scope | CRITICAL and HIGH only | All events | Ack overhead for `ingest.status` and `spaceweather.change` is disproportionate; only safety-relevant alerts need receipt confirmation |
|
||
| Dead connection grace period | 120s normal / 30s active TIP | Immediate notification | False-positive notifications from brief network hiccups destroy operator trust in the system; grace period filters transient drops |
|
||
| Back-pressure circuit breaker | Close slow client (force reconnect) | Drop messages silently | Silently dropping alert messages is unacceptable; forced reconnect triggers the `?since_seq=` replay mechanism, giving the client another chance to receive the queued events |
|
||
| Caddy WS idle timeout | `0` (no timeout) on WS paths only | Global `0` | Non-WS paths benefit from timeout protection against slow HTTP clients; WS paths require persistent connections; path-specific override is the correct scope |
|
||
|
||
---
|
||
|
||
## §64 Data Governance & Privacy Engineering — Specialist Review
|
||
|
||
### 64.1 Finding Summary
|
||
|
||
| # | Finding | Severity | Resolution |
|
||
|---|---------|----------|-----------|
|
||
| 1 | No DPIA document — pre-processing obligation for high-risk processing of aviation professionals' behavioural data | High | §29.1 — Full DPIA structure added (EDPB WP248 template, 7 sections, key risk findings identified); `legal/DPIA.md` designated as Phase 2 gate before EU/UK ANSP shadow activation |
|
||
| 2 | Right-to-erasure conflict with 7-year safety retention unresolved | High | Already addressed — §29.3 documents pseudonymisation procedure; Art. 17(3)(b) exemption explicitly invoked. No further action required. |
|
||
| 3 | IP addresses stored full-resolution for 7 years — no necessity assessment, no minimisation policy | High | §29.1 — IP retention updated to 90 days full / hash retained for longer period; `hash_old_ip_addresses` Celery task specified; necessity assessment documented |
|
||
| 4 | No Record of Processing Activities (RoPA) document | Medium | Already addressed — §29.1 contains the RoPA table with all required Art. 30 fields; `legal/ROPA.md` designated as authoritative. No further action required. |
|
||
| 5 | Cross-border transfer mechanisms not documented per jurisdiction pair | Medium | Already addressed — §29.5 documents EU default hosting, SCCs for cross-border transfers, Australian APP8, data residency policy in `legal/DATA_RESIDENCY.md`. No further action required. |
|
||
| 6 | Handover notes and acknowledgement text retained as-written indefinitely — free-text personal references not pseudonymised | Medium | §29.3 — `pseudonymise_old_freetext` Celery task added; 2-year operational retention window; text replaced with `[text pseudonymised after operational retention window]` |
|
||
| 7 | No DSAR procedure or SLA — endpoint exists but no documented process | High | §29.4a — Full DSAR procedure added: 7-step runbook, 30-day SLA, 60-day extension provision, `legal/DSAR_LOG.md`, export scope defined, exemptions documented |
|
||
| 8 | Audit log mixes personal data and integrity records — single table, conflicting retention obligations | High | §29.9 — `integrity_audit_log` table split out for non-personal operational records (7-year retention); `security_logs` constrained to user-action types with CHECK; migration plan specified |
|
||
| 9 | No formal sub-processor register — sub-processor details scattered across multiple documents | Medium | §29.4 — `legal/SUB_PROCESSORS.md` register added with 5 sub-processors, transfer mechanism, DPA status; customer notification obligation documented |
|
||
| 10 | `operator_training_records` has no retention or pseudonymisation policy | Medium | §28.9 — Retention policy: active + 2 years post-deletion; `user_tombstone` column; pseudonymisation task extended to cover training records |
|
||
| 11 | ToS acceptance implies consent is the universal lawful basis — incorrect and creates compliance exposure | High | §29.10 — Lawful basis mapping table added (5 processing activities); clarification that ToS acceptance evidences consent only for specific acknowledgements; Privacy Notice requirement restated |
|
||
|
||
---
|
||
|
||
### 64.2 Sections Modified
|
||
|
||
| Section | Change |
|
||
|---------|--------|
|
||
| §28.9 Operator Training | Training records retention policy and pseudonymisation (F10): 2-year post-deletion window; `user_tombstone` column; Celery task extension |
|
||
| §29.1 Data Inventory | IP address retention updated to 90-day full / hash retained (F3); `hash_old_ip_addresses` Celery task; IP necessity assessment; DPIA structure expanded to full EDPB WP248 template (F1) |
|
||
| §29.3 Erasure Procedure | Free-text field periodic pseudonymisation added (F6): 2-year operational window; `pseudonymise_old_freetext` Celery task for `shift_handovers.notes_text` and `alert_events.action_taken` |
|
||
| §29.4 Data Processing Agreements | Sub-processor register table added (F9): 5 sub-processors, locations, transfer mechanisms |
|
||
| §29.4a (new) DSAR Procedure | Full 7-step DSAR procedure with 30-day SLA, export scope, exemption documentation (F7) |
|
||
| §29.9 (new) Audit Log Separation | `integrity_audit_log` table split; `security_logs` constrained to user-action types; migration plan (F8) |
|
||
| §29.10 (new) Lawful Basis Mapping | Per-activity lawful basis table; ToS acceptance ≠ universal consent; Privacy Notice requirement (F11) |
|
||
|
||
---
|
||
|
||
### 64.3 New Documents and Tables
|
||
|
||
| Artefact | Purpose |
|
||
|----------|---------|
|
||
| `legal/DPIA.md` | Data Protection Impact Assessment (EDPB WP248 template) — Phase 2 gate |
|
||
| `legal/SUB_PROCESSORS.md` | Art. 28 sub-processor register with transfer mechanisms |
|
||
| `legal/DSAR_LOG.md` | Log of all Data Subject Access Requests received and fulfilled |
|
||
| `docs/runbooks/dsar-procedure.md` | Step-by-step DSAR handling runbook |
|
||
| `tasks/privacy_maintenance.py` | Celery tasks: `hash_old_ip_addresses`, `pseudonymise_old_freetext` (extended to training records) |
|
||
| `integrity_audit_log` table | Non-personal operational audit records separated from `security_logs` |
|
||
| `operator_training_records.user_tombstone` | Pseudonymisation field for post-deletion training records |
|
||
| `operator_training_records.pseudonymised_at` | Timestamp tracking pseudonymisation |
|
||
|
||
---
|
||
|
||
### 64.4 Anti-Patterns Identified
|
||
|
||
| Anti-pattern | Correct approach |
|
||
|-------------|-----------------|
|
||
| DPIA treated as optional documentation exercise | Pre-processing legal obligation; EU personal data cannot be processed without completing it first |
|
||
| Full IP address retained for 7 years "for security" | 90-day necessity window; hash retained for longer-term audit; necessity assessment documented |
|
||
| Single `security_logs` table for both personal data and operational integrity records | Separate tables with separate retention policies; `integrity_audit_log` for non-personal records |
|
||
| ToS acceptance as universal consent mechanism | Lawful basis is determined by processing purpose; most SpaceCom processing is Art. 6(1)(b) or (f), not consent |
|
||
| Sub-processor details spread across multiple documents | Single `legal/SUB_PROCESSORS.md` register with mandatory Art. 28(3) fields |
|
||
| Free-text operational fields retained as-written indefinitely | 2-year operational window then pseudonymisation in place; record preserved, personal reference removed |
|
||
|
||
---
|
||
|
||
### 64.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| DPIA processing category | Art. 35(3)(b) — systematic monitoring of publicly accessible area | Art. 35(3)(a) — large-scale special category data | No special category data is processed; the systematic monitoring category is the correct trigger given real-time operational pattern tracking of named aviation professionals |
|
||
| IP hashing threshold | 90 days | 30 days / 1 year | 90 days covers the active investigation window for the vast majority of security incidents; shorter is unnecessarily restrictive for legitimate investigation; longer retains more than necessary |
|
||
| Free-text pseudonymisation window | 2 years post-creation | Immediate deletion / 7-year retention as-written | 2 years covers all active PIR, investigation, and regulatory inquiry periods while removing personal references well before maximum retention; deletion would destroy operational context needed for safety record; 7-year as-written retention is disproportionate |
|
||
| Audit log split mechanism | Separate table with CHECK constraint on `security_logs` | Application-level routing only | Database constraint enforces the separation at ingest time; application routing alone is fragile and will be bypassed as code evolves |
|
||
| DSAR response channel | Encrypted ZIP to verified email | In-platform download only | In-platform download is unavailable after account deletion; verified email ensures identity confirmation and provides a paper trail |
|
||
|
||
---
|
||
|
||
## Appendix §65 — Cost Engineering / FinOps Hat Review
|
||
|
||
**Hat:** Cost Engineering / FinOps
|
||
**Reviewer focus:** Infrastructure cost visibility, unit economics, per-resource attribution, cost anti-patterns, egress waste, idle resource cost
|
||
|
||
---
|
||
|
||
### 65.1 Findings and Fixes
|
||
|
||
| # | Finding | Severity | Section modified | Fix applied |
|
||
|---|---------|----------|-----------------|-------------|
|
||
| F1 | No unit economics model — impossible to reason about margin per customer tier | HIGH | §27.7 (new) | Added unit economics model with cost-to-serve breakdown and break-even analysis; reference doc `docs/business/UNIT_ECONOMICS.md` |
|
||
| F2 | Storage table lacked cost figures — MC blob cost invisible to planners | MEDIUM | §27.4 | Added Cloud Cost/Year column to storage table; S3-IA pricing for MC blobs; noted dominant cost driver |
|
||
| F3 | No metric tracking external API calls (Space-Track budget at risk) | MEDIUM | §27.1 | Added `spacecom_ingest_api_calls_total{source}` counter; alert at Space-Track 100/day approaching AUP limit |
|
||
| F4 | No per-org simulation CPU tracking — Enterprise chargeback impossible | MEDIUM | §27.1 | Added `spacecom_simulation_cpu_seconds_total{org_id, norad_id}` counter; monthly usage report task |
|
||
| F5 | CZML egress cost unquantified; no brotli compression mandate | LOW | §27.5 | Added CZML egress cost estimate (~$1–7/mo at Phase 2–3); brotli compression policy added |
|
||
| F6 | Celery worker idle cost not analysed — $1,120/mo regardless of usage | HIGH | §27.3 | Added idle cost analysis; scale-to-zero rejected (violates MC SLO); scale-to-1 KEDA policy for Tier 3 documented |
|
||
| F7 | No per-org email rate limit — SMTP quota at risk during flapping events | MEDIUM | §4 (WebSocket/alerts) | Added 50 emails/hour/org rate limit with digest fallback; Celery hourly digest task; cost rationale |
|
||
| F8 | Renderer always-on rationale not documented; co-location OOM risk unaddressed | LOW | §35.5 | Added on-demand analysis table; confirmed always-on at Tier 1–2; documented co-location isolation requirement |
|
||
| F9 | Backup storage cost not projected — surprise cost at Tier 3 | LOW | §27.4 | Added WAL backup cost projection; $100–200/month at Tier 3 steady state |
|
||
| F10 | No Redis memory budget — result backend accumulation can cause OOM | HIGH | §27.8 (new) | Added Redis memory budget table by purpose/DB index; `maxmemory 2gb`; `result_expires=3600` requirement |
|
||
| F11 | No per-org cost attribution mechanism for Enterprise tier negotiations | MEDIUM | §27.1 | Added monthly usage report Celery task; per-org CPU-seconds → cost-per-run attribution |
|
||
|
||
---
|
||
|
||
### 65.2 Sections Modified
|
||
|
||
| Section | Change summary |
|
||
|---------|---------------|
|
||
| §27.1 Workload Characterisation | Added cost-tracking Prometheus counters (F3, F4) and per-org usage report task (F11) |
|
||
| §27.3 Deployment Tiers | Added Celery worker idle cost analysis and scale-to-zero decision table (F6) |
|
||
| §27.4 Storage Growth Projections | Added Cloud Cost/Year column; storage cost summary; backup cost projection (F2, F9) |
|
||
| §27.5 Network and External Bandwidth | Added CZML egress cost estimate and brotli compression policy (F5) |
|
||
| §27.7 Unit Economics Model (new) | Full unit economics model: cost-to-serve, revenue per tier, break-even analysis (F1) |
|
||
| §27.8 Redis Memory Budget (new) | Redis memory budget by purpose; `maxmemory` setting; result cleanup requirement (F10) |
|
||
| §4 WebSocket / Alerts | Added per-org email rate limit (50/hr) with digest fallback; SMTP cost rationale (F7) |
|
||
| §35.5 Renderer Container Constraints | Added on-demand analysis; memory isolation rationale; co-location risk guidance (F8) |
|
||
|
||
---
|
||
|
||
### 65.3 New Files and Documents Required
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `docs/business/UNIT_ECONOMICS.md` | Unit economics model; cost-to-serve per tier; break-even analysis; update quarterly |
|
||
| `docs/infra/REDIS_SIZING.md` | Redis memory budget by purpose; eviction policy decisions; sizing rationale |
|
||
| `docs/business/usage_reports/{org_id}/{year}-{month}.json` | Per-org monthly usage reports for Enterprise tier chargeback |
|
||
| `backend/app/metrics.py` (additions) | `spacecom_ingest_api_calls_total` and `spacecom_simulation_cpu_seconds_total` counters |
|
||
| `backend/app/alerts/email_delivery.py` | Per-org email rate limiting logic with Redis counter and digest queue |
|
||
| `backend/celeryconfig.py` (addition) | `result_expires = 3600` to prevent Redis result backend accumulation |
|
||
|
||
---
|
||
|
||
### 65.4 Anti-Patterns Rejected
|
||
|
||
| Anti-pattern | Why rejected |
|
||
|-------------|-------------|
|
||
| Scale-to-zero simulation workers | 60–120s Chromium-style cold-start violates 10-min MC SLO; scale-to-1 minimum is the correct floor |
|
||
| Co-locating renderer with simulation workers | Chromium 2–4 GB render memory + MC worker memory = OOM on 32 GB nodes; isolated container required |
|
||
| Unbounded alert emails per org | SMTP relay quota exhausted during flapping events; 50/hr cap with digest is operationally equivalent at lower cost |
|
||
| Redis without `result_expires` | MC sub-task result accumulation; 500 sub-tasks × 1 MB = 500 MB peak; without expiry, accumulates across runs indefinitely |
|
||
| Single Redis `noeviction` policy | Blocks cache use alongside broker in same instance; DB-index split with `allkeys-lru` on cache DB required |
|
||
|
||
---
|
||
|
||
### 65.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Simulation worker floor | Scale-to-1 minimum at Tier 3 | Scale-to-zero | Cold-start from zero violates 10-min MC SLO; one warm worker absorbs small queues instantly |
|
||
| Email rate limit mechanism | Redis hour-window counter + Celery digest task | Database-level throttle / no limit | Redis counter is O(1) per email with sub-millisecond latency; DB throttle adds per-email DB write at high fan-out; no limit is a SMTP quota risk |
|
||
| Unit economics granularity | Per-org CPU-seconds via Prometheus | Per-request DB logging | Prometheus counter aggregation has negligible overhead; DB per-request logging at MC sub-task granularity = 500 writes/run |
|
||
| Redis maxmemory target | 2 GB (`cache.r6g.large` with 8 GB RAM) | 4 GB / 1 GB | 2× headroom above 700–750 MB peak estimate; leaves OS and other processes room; below 4 GB alerts before OOM |
|
||
| CZML compression priority | Brotli before gzip in Caddy `encode` block | gzip only | Brotli achieves 70–80% reduction vs. gzip's 60–75%; modern browsers universally support brotli; on-premise clients are always browser-based |
|
||
|
||
---
|
||
|
||
## Appendix §66 — Open Source / Dependency Licensing Hat Review
|
||
|
||
**Hat:** OSS Licensing Engineer
|
||
**Reviewer focus:** Licence obligations for closed-source SaaS, SBOM completeness, redistribution constraints, IP risk in ESA bid context, contractor IP ownership
|
||
|
||
---
|
||
|
||
### 66.1 Findings and Fixes
|
||
|
||
| # | Finding | Severity | Section modified | Fix applied |
|
||
|---|---------|----------|-----------------|-------------|
|
||
| F1 | CesiumJS AGPLv3 commercial licence not explicitly gated as Phase 1 blocker | CRITICAL | §6 Phase 1 checklist, §29.11 (new) | Added Phase 1 blocking gate requiring `cesium-commercial.pdf`; dedicated §29.11 F1 section with phase-gate language |
|
||
| F2 | SBOM covered container image (syft) but not dependency manifests (pip-licenses/license-checker JSON merge) | HIGH | §26.9 CI table, §6 Phase 1 checklist, §29.11 (new) | Added manifest SBOM merge to `build-and-push`; `docs/compliance/sbom/` as versioned store; Phase 1 gate updated |
|
||
| F3 | Space-Track AUP redistribution risk not analysed in detail for API endpoint and credential exposure | MEDIUM | §29.11 (new) | Added two-vector redistribution analysis (API exposure + credential in client-side code); confirmed `detect-secrets` coverage |
|
||
| F4 | poliastro LGPLv3 licence not documented; LGPL dynamic linking compliance undocumented | MEDIUM | §29.11 (new) | Added LGPL compliance assessment; `legal/LGPL_COMPLIANCE.md` required; standard pip install satisfies LGPL |
|
||
| F5 | TimescaleDB dual-licence (TSL vs Apache 2.0) not assessed; risk if TSL-only features adopted | MEDIUM | §29.11 (new) | Added feature-by-feature TimescaleDB licence table; confirmed SpaceCom uses only Apache 2.0 features; re-assessment gate if multi-node adopted |
|
||
| F6 | Redis SSPL adoption (7.4+) not assessed; Valkey alternative not documented | MEDIUM | §29.11 (new) | Added SSPL internal-use assessment; legal counsel confirmation required before Phase 3; Valkey/Redis 7.2 as fallback |
|
||
| F7 | Playwright/Chromium binary licence not captured in SBOM | LOW | §29.11 (new) | Confirmed Apache 2.0 (Playwright) + BSD-3 (Chromium); captured by `syft` container scan; no redistribution |
|
||
| F8 | Caddy enterprise plugin licence risk not noted; audit process not defined | LOW | §29.11 (new) | Added plugin licence audit requirement; PR checklist for Caddyfile changes |
|
||
| F9 | PostGIS GPLv2 linking exception not documented | LOW | §29.11 (new) | Confirmed linking exception applies to PostgreSQL extension use; `legal/LGPL_COMPLIANCE.md` to document |
|
||
| F10 | `pip-licenses --fail-on` list missing SSPL; no SSPL check on npm side | MEDIUM | §29.11 (new), §7.13 CI step | Added SSPL to Python fail-on list; SSPL added to npm failOn; exact version pinning requirement stated |
|
||
| F11 | No CLA or work-for-hire mechanism before contractor contributions | HIGH | §29.11 (new), §6 Phase 2 checklist | Added CLA template requirement (`legal/CLA.md`); `CONTRIBUTING.md` disclosure; Phase 2 gate |
|
||
|
||
---
|
||
|
||
### 66.2 Sections Modified
|
||
|
||
| Section | Change summary |
|
||
|---------|---------------|
|
||
| §6 Phase 1 legal/compliance checklist | Added CesiumJS commercial licence as explicit blocking gate; expanded SBOM checklist item to cover manifest SBOMs; added LGPL/PostGIS and TimescaleDB/Redis licence document gates |
|
||
| §26.9 CI workflow table | Updated `build-and-push` job to include manifest SBOM merge and `docs/compliance/sbom/` artefact storage |
|
||
| §29.11 (new) | Full OSS licence compliance section: F1–F11 covering all material dependencies |
|
||
|
||
---
|
||
|
||
### 66.3 New Files and Documents Required
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `legal/OSS_LICENCE_REGISTER.md` | Authoritative per-dependency licence record; updated on major version changes |
|
||
| `legal/LICENCES/cesium-commercial.pdf` | Executed CesiumJS commercial licence — Phase 1 blocking gate |
|
||
| `legal/LICENCES/timescaledb-licence-assessment.md` | TimescaleDB Apache 2.0 vs. TSL feature confirmation |
|
||
| `legal/LICENCES/redis-sspl-assessment.md` | Redis SSPL internal-use assessment; legal counsel sign-off |
|
||
| `legal/LGPL_COMPLIANCE.md` | poliastro LGPL dynamic linking compliance; PostGIS GPLv2 linking exception |
|
||
| `legal/CLA.md` | Contributor Licence Agreement template for external contributors |
|
||
| `docs/compliance/sbom/` | Versioned SBOM artefacts: `syft` SPDX-JSON + manifest JSONs per release |
|
||
| `CONTRIBUTING.md` | CLA requirement disclosure; external contributor instructions |
|
||
|
||
---
|
||
|
||
### 66.4 Anti-Patterns Rejected
|
||
|
||
| Anti-pattern | Why rejected |
|
||
|-------------|-------------|
|
||
| "CesiumJS licence can wait until Phase 2" | AGPLv3 network use provision applies from the first external demo — waiting creates retroactive non-compliance exposure in an ESA bid context |
|
||
| Excluding CesiumJS from the licence gate without a commercial licence on file | CI exclusion hides the issue; the gate is correct only when the commercial licence exists |
|
||
| Assuming LGPL dynamic linking is automatically satisfied | Must be documented; LGPL allows relinking — standard pip install satisfies this but the compliance position must be written down |
|
||
| Single Redis `noeviction` policy | Already rejected in §65; Redis SSPL also motivates Valkey evaluation as BSD-3 alternative |
|
||
| Assuming all TimescaleDB features are Apache 2.0 | TSL features (multi-node, data tiering) would require a Timescale commercial agreement; feature use must be tracked |
|
||
|
||
---
|
||
|
||
### 66.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| CesiumJS licence | Commercial licence from Cesium Ion; Phase 1 blocker | Open-source the frontend (comply with AGPLv3) | Source disclosure of SpaceCom's frontend is commercially unacceptable; commercial licence is the only viable path for a closed-source product |
|
||
| Redis SSPL response | Legal counsel assessment; Valkey as fallback | Immediate migration to Valkey | Internal-use assessment is likely favourable; premature migration introduces risk; assess first |
|
||
| poliastro LGPL | Document standard pip install compliance | Seek MIT-licensed alternative | Standard pip install satisfies LGPL dynamic linking; replacing poliastro would require significant re-engineering for marginal legal gain |
|
||
| SBOM format | SPDX-JSON (syft) + pip-licenses/license-checker manifests merged | CycloneDX only | SPDX is the format required by ECSS and EU Cyber Resilience Act; CycloneDX can be generated alongside if required by a specific customer |
|
||
|
||
---
|
||
|
||
## Appendix §67 — Distributed Systems / Consistency Hat Review
|
||
|
||
**Hat:** Distributed Systems Engineer
|
||
**Reviewer focus:** Consistency guarantees, failure modes, split-brain scenarios, clock skew, ordering, idempotency, CAP trade-offs
|
||
|
||
---
|
||
|
||
### 67.1 Findings and Fixes
|
||
|
||
| # | Finding | Severity | Section modified | Fix applied |
|
||
|---|---------|----------|-----------------|-------------|
|
||
| F1 | Chord callback doesn't validate result count — partial results silently produce truncated predictions | CRITICAL | §27.2 chord section | Added result count guard in `aggregate_mc_results`; raises `ValueError` on mismatch; `spacecom_mc_chord_partial_result_total` counter; DLQ routing |
|
||
| F2 | No Celery `autoretry_for=(OperationalError,)` on DB-writing tasks — Patroni 30s failover window causes permanent task failure | HIGH | §27.6 PgBouncer section | Added `autoretry_for=(OperationalError,)` policy; `max_retries=3`, `retry_backoff=5`, cap 30s; applies to all DB-writing Celery tasks |
|
||
| F3 | Redis Sentinel split-brain risk not documented or assessed | MEDIUM | §26 Redis Sentinel section | Added split-brain assessment; accepted risk for ephemeral data; `min-replicas-to-write 1` mitigates; ADR-0021 required |
|
||
| F4 | HMAC signing race — prediction INSERT then HMAC UPDATE creates window of unsigned prediction | HIGH | §10 HMAC section | Fixed: pre-generate UUID in application before INSERT; compute HMAC with UUID; single-phase write; migration from BIGSERIAL to UUID PK documented |
|
||
| F5 | `alert_events.seq` assigned via `MAX(seq)+1` trigger — concurrent inserts produce duplicates | HIGH | §4 WebSocket/events section | Replaced with `CREATE SEQUENCE alert_seq_global`; globally monotonic; per-org ordering via `WHERE org_id = $1 ORDER BY seq` |
|
||
| F6 | Clock skew between server and client causes CZML ground track timing drift — no detection mechanism | MEDIUM | §4 API section | Added `chronyd`/`timesyncd` host requirement; `node_timex_sync_status` Grafana alert; `GET /api/v1/time` endpoint; client-side skew warning banner at >5s |
|
||
| F7 | MinIO multipart upload has no retry on write quorum failure — MC blob lost silently | HIGH | §27.4 storage section | Added `autoretry_for=(S3Error,)` with 30s backoff; MinIO ILM rule to abort incomplete multipart uploads after 24h |
|
||
| F8 | celery-redbeat double-fire on restart: only TLE ingest has `ON CONFLICT DO NOTHING`; space weather and IERS EOP lack upsert | MEDIUM | §11 ingest section | Added upsert patterns for all periodic ingest tables; unique constraint requirements stated |
|
||
| F9 | WebSocket fan-out cross-channel ordering — no cross-org ordering guarantee | LOW | — | Already addressed — Redis Pub/Sub ordering is per-channel (per-org); sequence numbers provide intra-org ordering. No further action required. |
|
||
| F10 | `reentry_predictions` FK referenced with default `CASCADE` — accidental simulation delete cascades to legal-hold predictions | HIGH | §9 schema | Changed all `REFERENCES reentry_predictions(id)` to `ON DELETE RESTRICT` in `alert_events`, `prediction_outcomes`, `superseded_by` FK |
|
||
| F11 | No distributed trace context propagation through chord sub-tasks and callback | MEDIUM | §26.9 OTel section | Added chord trace context injection/extraction pattern; verified `CeleryInstrumentor` for single tasks; manual `propagate.inject/extract` for chord callback continuity |
|
||
|
||
---
|
||
|
||
### 67.2 Sections Modified
|
||
|
||
| Section | Change summary |
|
||
|---------|---------------|
|
||
| §27.2 MC Parallelism | Added chord result count validation in `aggregate_mc_results`; partial result counter |
|
||
| §27.6 DNS / PgBouncer | Added Celery `autoretry_for=(OperationalError,)` policy for Patroni failover window |
|
||
| §26 Redis Sentinel | Added split-brain risk assessment; `min-replicas-to-write 1` config; ADR-0021 |
|
||
| §10 HMAC signing | Fixed two-phase write race: pre-generate UUID, single-phase INSERT; PK migration note |
|
||
| §4 WebSocket schema | Added `alert_seq_global` PostgreSQL SEQUENCE replacing `MAX(seq)+1` trigger |
|
||
| §4 API / health | Added `GET /api/v1/time` clock skew endpoint; NTP sync requirement; client banner |
|
||
| §27.4 Storage | Added MinIO multipart upload retry; incomplete upload ILM expiry rule |
|
||
| §11 Ingest | Added upsert patterns for space_weather and IERS EOP; unique constraint requirements |
|
||
| §9 Data Model | Changed `REFERENCES reentry_predictions(id)` to `ON DELETE RESTRICT` on 3 FKs |
|
||
| §26.9 OTel/Tracing | Added chord trace context propagation pattern; `propagate.inject/extract` for callback |
|
||
|
||
---
|
||
|
||
### 67.3 New ADRs Required
|
||
|
||
| ADR | Decision |
|
||
|-----|---------|
|
||
| `docs/adr/0021-redis-sentinel-split-brain-risk-acceptance.md` | Accept Redis Sentinel split-brain risk for ephemeral data; `min-replicas-to-write 1` mitigation; email rate limit counter inconsistency accepted as cost control gap |
|
||
|
||
---
|
||
|
||
### 67.4 Anti-Patterns Rejected
|
||
|
||
| Anti-pattern | Why rejected |
|
||
|-------------|-------------|
|
||
| `MAX(seq)+1` for sequence assignment in trigger | Race condition under concurrent inserts — two transactions read same MAX and both write the same seq; PostgreSQL `SEQUENCE` is lock-free and gap-tolerant |
|
||
| Two-phase HMAC (INSERT then UPDATE) | Creates a window where a valid unsigned prediction exists in the DB; single-phase INSERT with pre-generated UUID eliminates the window |
|
||
| No retry on Celery DB tasks during Patroni failover | The 30s failover window is a known operational event; retries with 5s backoff cap at 30s, fitting entirely within the failover window |
|
||
| `ON DELETE CASCADE` on legal-hold FK references | Accidental deletion of a simulation row would cascade to 7-year-retention safety records; `RESTRICT` forces explicit deletion of dependents first, making accidental cascade impossible |
|
||
| Scale-to-zero with immediate cold-start | Already rejected in §65; distributed systems perspective adds: cold-start during Patroni failover + worker cold-start = double failure; always keep 1 warm worker |
|
||
|
||
---
|
||
|
||
### 67.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Chord result count validation | `ValueError` → DLQ → `HTTP 500 + Retry-After` | Silently write partial result | A 400-sample prediction is not a 500-sample prediction; confidence intervals and corridor widths are wrong; it is safer to fail visibly |
|
||
| reentry_predictions PK type | Migrate BIGSERIAL → UUID; pre-generate in application | Keep BIGSERIAL; use two-phase HMAC | UUID pre-generation eliminates the race window; UUID is also a safer choice for distributed deployments where sequence coordination between nodes is not possible |
|
||
| alert_seq assignment | Single global `alert_seq_global` SEQUENCE | Per-org sequences | Single sequence is simpler to manage; global monotonicity is sufficient for per-org ordering by filtering on org_id; per-org sequences require one sequence per org — complex at scale |
|
||
| Redis split-brain response | Accept risk; document in ADR | Migrate to Redis Cluster (stronger consistency) | Redis Cluster adds significant operational complexity (hash slots, resharding, client-side routing); split-brain on Sentinel with 3 nodes is rare and the affected data is ephemeral or cost-control only |
|
||
|
||
---
|
||
|
||
## Appendix §68 — Commercial / Pricing Architecture Hat Review
|
||
|
||
**Hat:** Commercial Strategy / Pricing Architect
|
||
**Reviewer focus:** Pricing model design, deal structure, revenue protection, margin preservation, enterprise negotiation guardrails, commercial signals in technical architecture
|
||
|
||
---
|
||
|
||
### 68.1 Findings and Fixes
|
||
|
||
| # | Finding | Severity | Section modified | Fix applied |
|
||
|---|---------|----------|-----------------|-------------|
|
||
| F1 | No `contracts` table — feature access not gated on commercial state; admin can enable Enterprise features with no contract | CRITICAL | §9 data model, §24 commercial section | Added `contracts` table with financial terms, feature enablement flags, discount approval constraint, PS tracking; nightly sync task |
|
||
| F2 | Usage data not surfaced to commercial team or org admins — renewal conversations lack data | HIGH | §27.7 unit economics | Added monthly usage summary emails to commercial team and org admins; `send_usage_summary_emails` Beat task |
|
||
| F3 | No shadow trial time limit — ANSP could remain in shadow mode indefinitely without signing production contract | HIGH | §9 organisations table | Added `shadow_trial_expires_at` column; enforcement via daily Celery task that auto-deactivates expired trials |
|
||
| F4 | No discount approval guard-rails — single admin can give 100% discount | MEDIUM | §9 contracts table | Added `CHECK (discount_pct <= 20 OR discount_approved_by IS NOT NULL)` constraint; discount >20% requires named approver |
|
||
| F5 | No inbound API request counter — usage-based billing for Persona E/F impossible | MEDIUM | §27.1 metrics | Added `spacecom_api_requests_total{org_id, endpoint, version, status_code}`; FastAPI middleware |
|
||
| F6 | On-premise deployments have no licence key enforcement — multi-instance or post-expiry use undetectable | HIGH | §34 infrastructure section | Added RSA JWT licence key mechanism; licence-expired degraded mode; hourly Celery re-validation; key rotation script |
|
||
| F7 | No contract expiry alerts — contracts expire silently; revenue risk | HIGH | §4 Celery tasks | Added `check_contract_expiry` Beat task at 90/30/7-day thresholds; courtesy notice to org admin at 30 days |
|
||
| F8 | Free/shadow tier has no MC simulation quota — free usage consumes paid-tier worker capacity | MEDIUM | §9 organisations table, §27.7 | Added `monthly_mc_run_quota` column (default 100); `POST /api/v1/decay/predict` quota enforcement with `429 + Retry-After` |
|
||
| F9 | No MRR/ARR tracking — commercial team cannot measure revenue targets | HIGH | §9 contracts table, §27.7 | `contracts.monthly_value_cents` + `spacecom_mrr_eur` Prometheus gauge updated nightly; Grafana MRR panel |
|
||
| F10 | Professional Services not documented as a revenue line — first-year contract value underestimated | MEDIUM | §27.7 unit economics | Added PS revenue table (engagement types, values); `contracts.ps_value_cents`; Year 1 total contract value formula |
|
||
| F11 | Multi-ANSP coordination panel available to all tiers — high-value Enterprise feature not packaging-protected | MEDIUM | §9 organisations table | Added `feature_multi_ansp_coordination BOOLEAN NOT NULL DEFAULT FALSE`; gated in UI by feature flag; synced from `contracts.enables_multi_ansp_coordination` |
|
||
|
||
---
|
||
|
||
### 68.2 Sections Modified
|
||
|
||
| Section | Change summary |
|
||
|---------|---------------|
|
||
| §9 organisations table | Added `shadow_trial_expires_at`, `monthly_mc_run_quota`, `feature_multi_ansp_coordination`, `licence_key`, `licence_expires_at` columns |
|
||
| §9 (new contracts table) | Full `contracts` table with financial terms, discount approval constraint, feature enablement, PS tracking |
|
||
| §24 commercial section | Added contracts table spec, MRR tracking, feature sync task, discount enforcement |
|
||
| §27.1 cost-tracking metrics | Added `spacecom_api_requests_total{org_id, endpoint, version, status_code}` counter |
|
||
| §27.7 unit economics | Added PS revenue table; shadow trial quota enforcement code; usage summary emails |
|
||
| §34 on-premise deployment | Added RSA JWT licence key mechanism; degraded mode on expiry; key rotation process |
|
||
| §4 Celery Beat tasks | Added `check_contract_expiry` 90/30/7-day alert task; `send_usage_summary_emails` monthly task |
|
||
|
||
---
|
||
|
||
### 68.3 New Files and Documents Required
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `docs/business/UNIT_ECONOMICS.md` | Updated with PS revenue line, Year 1 total contract value formula, MRR tracking |
|
||
| `tasks/commercial/contract_expiry_alerts.py` | Contract expiry Celery task (90/30/7-day thresholds) |
|
||
| `tasks/commercial/send_commercial_summary.py` | Monthly commercial team usage summary email |
|
||
| `tasks/commercial/sync_feature_flags.py` | Nightly sync of org feature flags from active contracts |
|
||
| `scripts/generate_licence_key.py` | RSA JWT licence key generation script (requires private key) |
|
||
| `legal/contracts/` | Contract document store (MSA PDFs, signed sandbox agreements) |
|
||
|
||
---
|
||
|
||
### 68.4 Anti-Patterns Rejected
|
||
|
||
| Anti-pattern | Why rejected |
|
||
|-------------|-------------|
|
||
| Admin toggle for feature access without contract gate | Single admin can bypass commercial controls; `contracts` table with nightly sync is the authoritative source |
|
||
| Unlimited MC runs for free tier | Free-tier heavy users degrade paid-tier SLO by consuming simulation worker capacity; 100-run/month quota is enforceable without impacting legitimate evaluation |
|
||
| Honour-system on-premise licensing | Without a licence key, post-expiry use is undetectable and unenforceable; JWT with RSA signature provides cryptographic enforcement with no ongoing connectivity requirement |
|
||
| Silent contract expiry | Revenue loss from silent expiry is predictable and preventable; 90/30/7-day alerts are standard SaaS practice |
|
||
| Infinite shadow trial | Shadow mode is a commercial transition stage, not a permanent state; `shadow_trial_expires_at` enforces the commercial expectation established in the Regulatory Sandbox Agreement |
|
||
|
||
---
|
||
|
||
### 68.5 Decision Log
|
||
|
||
| Decision | Chosen approach | Rejected alternative | Rationale |
|
||
|----------|----------------|---------------------|-----------|
|
||
| Feature flag sync | Nightly Celery task syncs from `contracts` | Real-time sync on every request | Real-time sync adds DB query per request; nightly sync is sufficient for contract-level changes which happen at most monthly |
|
||
| Licence key format | RSA-signed JWT | Database-backed licence check | JWT is verifiable offline (no network required for air-gapped deployments); RSA signature prevents forgery without access to SpaceCom private key |
|
||
| Discount approval threshold | 20% without approval; >20% requires named approver | Flat approval for all discounts | 0-20% is sales discretion; >20% represents strategic pricing requiring commercial leadership sign-off; DB constraint makes this enforceable rather than advisory |
|
||
| PS revenue tracking | `contracts.ps_value_cents` one-time field | Separate PS contracts table | PS is almost always bundled with the main contract at first engagement; a separate table adds complexity for marginal benefit at Phase 2-3 scale |
|
||
| MRR metric | Prometheus gauge from nightly Celery task | Real-time DB query in Grafana | Prometheus gauge is consistent with other business metrics; Grafana can scrape it without a DB connection; historical MRR trend is automatically recorded |
|
||
|
||
---
|
||
|
||
## §69 Cross-Hat Governance and Decision Authority
|
||
|
||
This section resolves conflicts between specialist reviews. SpaceCom uses hats to surface expert constraints, not to create parallel authorities. Where hats conflict, this section defines who decides, how the decision is recorded, and which interpretation governs implementation.
|
||
|
||
### 69.1 Decision Authority Model
|
||
|
||
| Decision class | Primary owner | Mandatory reviewers | Tie-break principle |
|
||
|---|---|---|---|
|
||
| Product packaging, contracts, commercial entitlements | Product / Commercial owner | Legal, Engineering | Contractual and legal truth beats UI shorthand |
|
||
| Safety-critical alerting, operational UX, hazard communication | Safety case owner | Human Factors, Regulatory, Engineering | Safer operator outcome beats convenience or sales flexibility |
|
||
| Core architecture, infrastructure, CI/CD, consistency | Architecture / Platform owner | Security, SRE, DevOps | Lower operational risk and clearer failure semantics beat elegance |
|
||
| Privacy, data governance, lawful basis, retention | Legal / Privacy owner | Product, Engineering | Regulatory obligation beats implementation convenience |
|
||
| External licensing / open source / procurement artefacts | Legal / Procurement owner | Engineering, Product | Licence compliance beats delivery speed |
|
||
|
||
Any unresolved cross-hat conflict is recorded in `docs/governance/CROSS_HAT_CONFLICT_REGISTER.md` before implementation proceeds.
|
||
|
||
### 69.2 Arbitration Rules Adopted
|
||
|
||
1. **Commercial source of truth:** `contracts` is the authoritative source for features, quotas, and deployment rights. `subscription_tier` is descriptive only.
|
||
2. **CI/CD platform:** SpaceCom uses self-hosted GitLab. All GitHub Actions references in the plan are interpreted as GitLab CI equivalents and must be implemented in `.gitlab-ci.yml`, protected environments, and GitLab approval rules.
|
||
3. **Redis split by trust class:** `redis_app` holds higher-integrity application state; `redis_worker` holds broker/result/cache state. Split-brain acceptance applies only to `redis_worker`.
|
||
4. **Commercial enforcement deferral:** Licence expiry, shadow-trial expiry, and quota exhaustion must not interrupt active TIP / CRITICAL operations. Enforcement is deferred, logged, and applied after the active event closes.
|
||
5. **Alert escalation matrix:** Progressive escalation is the default. Immediate bypass is allowed only for imminent-impact or integrity-compromise conditions formally listed in the alert definition and traced into safety artefacts.
|
||
6. **Renderer privilege exception:** The renderer `SYS_ADMIN` capability is an approved exception, not a precedent. Any similar request from another service requires a new ADR and security review.
|
||
7. **Phase 0 blockers:** Space-Track AUP architecture and Cesium commercial licensing are Phase 0 gates. Work that would lock in ingest or frontend architecture must not proceed before those gates are closed.
|
||
|
||
### 69.3 Phase 0 Governance Gates
|
||
|
||
Before Phase 1 implementation begins, the following must be complete:
|
||
|
||
- Space-Track AUP architecture decision recorded in `docs/adr/0016-space-track-aup-architecture.md`
|
||
- Cesium commercial licence executed and stored at `legal/LICENCES/cesium-commercial.pdf`
|
||
- GitLab CI/CD authority confirmed in platform docs and reflected in `.gitlab-ci.yml`
|
||
- `contracts` entitlement model and synchronisation path approved by Product, Legal, and Engineering
|
||
- Redis trust split (`redis_app` / `redis_worker`) approved by Architecture, Security, and SRE
|
||
|
||
These are architectural commitment gates, not paperwork gates. If any remain open, implementation that would cement the affected design area is blocked.
|
||
|
||
### 69.4 Intervention Register
|
||
|
||
| Conflict | Sections affected | Intervention | Owner | Status |
|
||
|---|---|---|---|---|
|
||
| `subscription_tier` vs `contracts` authority | §16.1, §24, §68 | `contracts` made authoritative; org flags become derived cache | Product / Commercial | Accepted |
|
||
| GitHub Actions vs self-hosted GitLab | §26.9, §30.4, §30.7, delivery checklists | GitLab CI/CD designated authoritative | Platform | Accepted |
|
||
| Shared Redis vs accepted split-brain risk | §3.2, §3.3, §65, §67 | Redis split into app-state and worker-state trust domains | Architecture / Security | Accepted |
|
||
| Commercial enforcement during incidents | §9, §27.7, §34, §68 | Enforcement deferred during active TIP / CRITICAL event | Product / Operations | Accepted |
|
||
| HF progressive escalation vs safety urgency | §28.3, §60, §61 | Immediate-bypass matrix added for imminent-impact and integrity events | Safety case owner | Accepted |
|
||
| Non-root/container hardening vs renderer `SYS_ADMIN` | §3.3, §7.11 | Renderer documented as approved exception with tighter isolation | Security / Platform | Accepted |
|
||
| Implementation starting before legal/licence blockers close | §6, §19, §21, §29.11 | Blockers moved into Phase 0 governance gates | Programme owner | Accepted |
|