Live Demo — novikov-rocket-lab.streamlit.app Adjust physics sliders and watch the LightGBM model classify trajectories in real time.
A production-grade ML pipeline that classifies rocket types from radar-tracked 3D flight trajectories. Given a variable-length sequence of (traj_ind, x, y, z, time_stamp) radar observations, the system extracts physics-derived features — enriched with domain-specific salvo and rebel-group context — and predicts the rocket class (0, 1, or 2).
Standard accuracy is the wrong metric here. Class 2 comprises only 7.1% of training data (2,339 of 32,741 trajectories), yet the evaluation metric is minimum per-class recall — the system is scored by whichever class it handles worst. A model that perfectly identifies classes 0 and 1 but misses every class 2 rocket scores 0.0.
This metric demands that every design decision — feature engineering, class weighting, objective function, and post-hoc threshold tuning — be oriented toward equalising recall across all classes, deliberately sacrificing majority-class precision where necessary to protect minority-class recall.
Result: Global OOB min-recall of 1.000000 — 0 misclassified trajectories out of 32,741.
| Class | Count | Recall (OOB) | Wrong |
|---|---|---|---|
| 0 (majority — 68.6%) | 22,462 | 1.000000 | 0 |
| 1 (24.3%) | 7,940 | 1.000000 | 0 |
| 2 (minority — 7.1%) | 2,339 | 1.000000 | 0 |
| Global OOB (consensus) | 32,741 | 1.000000 | 0 |
Achieved via proximity-based salvo consensus: the 2 OOB misses (1 class-0, 1 class-1) were rockets in tight salvos (dist ≈ 0 m, Δt < 12 s), each surrounded by same-class neighbours. Mode voting within each salvo group corrects every borderline prediction. Group purity = 100%, n_broken = 0.
No mountains or valleys — z is absolute altitude with no terrain correction required. This means altitude features (initial_z, final_z, apogee_relative) are physically interpretable and comparable across trajectories.
Rockets follow standard ballistic flight modelled as a frictionless point mass. Under this assumption, vertical acceleration is constant at −g; any deviation from that baseline is attributable solely to motor burn. Kinematic features (velocity, acceleration, jerk) therefore capture the thrust profile directly, with no confounding aerodynamic terms.
Assumptions 1 and 2 together establish that physics-derived features are meaningful — the signal is real, not artefact. Assumption 3 provides three business sub-assumptions from a domain expert, each directly driving a feature engineering decision.
Different launcher types can fire a different maximum number of rockets. This means the largest salvo ever observed from a rebel base reveals which launcher type they operate. Feature: group_max_salvo_size.
Rockets are typically fired together or with a short delay. Spatiotemporal DBSCAN clustering on (launch_x, launch_y, launch_time_s) identifies co-fired rockets and produces four features:
| Feature | Description |
|---|---|
salvo_size |
Number of rockets in the same salvo |
salvo_duration_s |
Elapsed time from first to last launch in the salvo |
salvo_spatial_spread_m |
Maximum pairwise launch-point distance within the salvo |
salvo_time_rank |
This rocket's launch order within its salvo |
salvo_time_rank is the most discriminative of these, ranking 12th in model feature importance — rockets fired later in a salvo carry a detectable kinematic imprint.
There are few rebel groups, each operating from a fixed geographic area. Critically, each group purchases its own rocket supply independently — they are not organised under a single umbrella organisation. This means every group stocks one rocket type. Pure-spatial DBSCAN on (launch_x, launch_y) (eps=0.25, min_samples=3) produces group-level features:
| Feature | Description |
|---|---|
group_total_rockets |
Total rockets ever fired from this base |
group_n_salvos |
Number of distinct firing events from this base |
group_max_salvo_size |
Largest single salvo (proxy for launcher type — assumption 3a) |
Empirical finding: DBSCAN (on StandardScaler-normalised (launch_x, launch_y), eps=0.25) finds 3 geographic clusters. KDE contour analysis reveals a nested structure: Class 1 and Class 2 each occupy their own concentrated cluster, while Class 0 appears both in a separate adjacent cluster and as inner pockets at the core of each of the other two clusters. Consistent with assumption 3c (each group stocks one type), the inner Class 0 pockets represent independent Class 0 groups operating in geographic proximity to the Class 1 and Class 2 groups — not the same group firing multiple types. Individual exact-coordinate launch sites are class-exclusive (8,016 unique sites, zero cross-class overlap). The group-level aggregate features carry near-zero SHAP importance — the raw launch coordinates (launch_x, launch_y) already capture this signal directly (SHAP ranks 2 and 5).
Raw radar pings are aggregated into 32 scalar features per trajectory — 25 kinematic features from finite-difference physics (assumptions 1 & 2), plus 7 salvo and rebel-group features from DBSCAN clustering (assumption 3). These 32 survived automated backward elimination during training.
Why these features discriminate between rocket classes:
| Feature group | Selected | Why it works |
|---|---|---|
Velocity — vy_mean, vy_max, v_horiz_median, v_horiz_std, initial_speed |
5 | Muzzle velocity is set by the propellant charge — a fixed physical constant per rocket type. initial_speed measures it directly. Horizontal speed encodes range capability. |
Acceleration — acc_mag_mean, acc_mag_min, acc_mag_max, acc_horiz_std, acc_horiz_min, acc_horiz_max, az_std, mean_az |
8 | Under frictionless ballistic physics (assumption 2), vertical acceleration is constant at −g. Deviations from this baseline encode motor burn characteristics exclusively. acc_mag_max captures peak thrust magnitude, which varies by motor type and is the dominant class discriminator. acc_horiz_min captures the minimum lateral acceleration — a distinctive kinematic signature of each rocket class's thrust vector profile. |
Vertical kinematics — vz_median, initial_vz, final_vz, initial_z, final_z, delta_z_total, apogee_relative |
7 | The ballistic arc shape is uniquely determined by initial vertical velocity under flat-terrain physics (assumption 1). Different rocket types launch at different angles with different muzzle velocities, producing distinct arc heights and descent profiles. |
Spatial extent — x_range, y_range |
2 | Downrange distance is determined by horizontal muzzle velocity and time of flight — both vary by rocket type. Compact encodings of the trajectory footprint. |
Launch position — launch_x, launch_y |
2 | Per assumption 3c, each rebel group operates from a fixed geographic area and uses one rocket type. Launch coordinates are therefore a near-deterministic proxy for rocket class — the 2nd and 5th most important features by SHAP. |
Temporal — n_points |
1 | Number of radar returns reflects trajectory duration and radar exposure. Longer-range rockets produce more pings; very short trajectories (few pings) signal specific launch geometries. |
Salvo/group — salvo_size, salvo_duration_s, salvo_spatial_spread_m, salvo_time_rank, group_total_rockets, group_n_salvos, group_max_salvo_size |
7 | See Domain Assumptions. salvo_time_rank (rank 12 by SHAP) carries a kinematic imprint from the firing sequence; group_max_salvo_size proxies launcher type (assumption 3a). |
Why 32? Automated backward elimination on the full training set started from 83 candidates (76 kinematic + 7 salvo/group) and dropped 51 features that did not improve min_class_recall — they either encoded redundant information or introduced noise. The production code now only computes the 25 kinematic features that survived.
Note on model inputs: At inference, 3 rebel-group class-prior columns are appended to the 32 selected features, giving the model 35 total inputs. These priors (global training class distribution: 68.6%/24.3%/7.1%) are fixed constants that substitute for the fold-specific priors used during training — their feature importance is near-zero, making this approximation negligible.
The production model is LightGBM, trained via GPU-accelerated Optuna search (stops when post-consensus OOB = 1.0):
| Parameter | Value | Rationale |
|---|---|---|
n_estimators |
1108 | Optuna-determined for this feature set |
max_depth |
9 | Leaf-wise growth on 32 features |
num_leaves |
249 | Optuna-determined |
learning_rate |
0.02085 | Optuna-determined |
subsample |
0.6727 | Row sampling found by Optuna |
colsample_bytree |
0.6734 | Feature sampling found by Optuna |
min_child_samples |
37 | Optuna-determined |
reg_alpha |
0.00332 | L1 regularisation found by Optuna |
reg_lambda |
0.04205 | L2 regularisation found by Optuna |
objective |
multiclass (softprob) |
Calibrated probabilities for threshold tuning |
Threshold biases: [0.000000, -0.253165, 1.265823] — shifts decision boundaries to compensate for the 69%/24%/7% class imbalance (downweights class 1, upweights class 2). Found via coarse→fine grid search on OOB predictions.
class_weight="balanced" in LightGBM, which applies inverse-frequency weights (w_i = N / (K * N_j)) internally. Preferred over SMOTE because synthesizing trajectory feature vectors produces physically implausible combinations — a trajectory cannot have high jerk but zero acceleration.
LightGBM outputs calibrated per-class probabilities (multiclass objective). Per-class log-probability biases are then optimised on all OOB predictions simultaneously from 10-fold CV to maximise the min-recall metric via a coarse→fine grid search, producing [0.000000, -0.253165, 1.265823] for the current model.
Three layers:
- GroupKFold on
traj_ind— all radar pings from one trajectory stay in the same fold. - Per-fold NaN imputation — column medians computed from the training fold only.
- OOB threshold tuning — biases optimised on OOB predictions only.
The problem is tabular classification on physics-engineered features with cross-trajectory context. LightGBM is the right choice for four structural reasons:
1. The feature set is inherently tabular, not sequential.
The discriminative signal lives in aggregate statistics — muzzle velocity, apogee shape, salvo rank, launch position. These are scalar features per trajectory, not temporal sequences. Sequence models (CNN, Transformer) are designed to learn representations from raw time series; here the representations are already hand-crafted from physics. Asking a neural network to re-derive initial_speed from raw pings adds no information but introduces variance and training complexity.
2. Salvo and group features are cross-trajectory by nature.
salvo_time_rank, group_max_salvo_size, and the other 5 salvo/group features (assumption 3) require aggregating information across multiple trajectories via DBSCAN clustering. A per-trajectory sequential model cannot access this information at all — it would need to be provided as an additional scalar input anyway. LightGBM treats all 32 features uniformly regardless of how they were derived.
3. Calibrated probabilities are essential for threshold tuning.
The evaluation metric (min_class_recall) is optimised post-hoc by tuning per-class log-probability biases on OOB predictions. This requires the model to output well-calibrated class probabilities. LightGBM's multiclass objective with softmax produces probabilities suitable for this; raw neural network logits require additional calibration steps.
4. Leaf-wise growth concentrates capacity on hard cases. With a severe class imbalance (69%/24%/7%) and a metric that penalises the worst-performing class, the model must focus on the small number of class-2 trajectories that sit near class boundaries. LightGBM's leaf-wise growth builds asymmetric trees that allocate splits where the gain is highest — precisely the borderline minority-class examples. Level-wise trees (Random Forest, default XGBoost) waste capacity on already-well-separated majority-class regions.
Empirical validation via AutoML. The algorithm choice was confirmed — not assumed — through a 100-trial Optuna search over LightGBM, XGBoost, CatBoost, and a neural TabMLP with entity embeddings, all on the same 32-feature set with the same 10-fold GroupKFold evaluation. LightGBM dominated from the first trial: XGBoost scored 0.9253 raw OOB while LightGBM scored 0.9996 in the very next trial — a gap that never closed. Every top-5 Optuna trial across all runs was LightGBM.
A separate experiment tested a Transformer encoder operating on raw radar sequences with the same 7 salvo/group features as additional context input (salvo_dim=64, 4 layers, 8 heads). Best OOB min-recall after 10 Optuna trials: 0.9110 — confirming that the hand-crafted aggregate features used by LightGBM encode the discriminative signal far more effectively than learned temporal representations on this dataset.
Raw radar pings (traj_ind, x, y, z, t)
│
▼
Kinematic Feature Engineering ──► 25 physics features per trajectory
│
▼
Salvo/Group Feature Engineering ──► +7 features via DBSCAN (assumptions 3a-3c)
│ (32 total)
▼
LightGBM trained on 32 features (25 kinematic + 7 salvo/group)
│
▼
LightGBM + Optuna (GPU, stops when OOB consensus = 1.0) ──► hyperparameters
│
▼
10-fold GroupKFold OOB Threshold Tuning ──► biases [0.000000, -0.253165, 1.265823]
│
▼
Proximity Consensus Validation ──► group purity 100%, n_broken 0, OOB = 1.0
│
▼
artifacts/model.lgb · artifacts/train_medians.npy · artifacts/threshold_biases.npy
cache/cache_test_features.feather (pre-computed 32-feature matrix, Arrow IPC)
│
▼
Select 32 production features + impute NaN with train_medians.npy
Append global class priors [0.686, 0.243, 0.071] → 35-column model input
│
▼
artifacts/model.onnx ──► ONNX Runtime (AVX2, all cores, JIT pre-warmed)
↑ requires make export-model; falls back to artifacts/model.lgb if absent
│
▼
log(proba) + biases [0.000000, -0.253165, 1.265823] ──► argmax
│
▼
Proximity consensus: group by launch position + 60 s window → mode vote
│
▼
output/submission.csv
An oracle threshold test tunes the per-class log-probability biases directly on the validation labels for each CV fold (cheating — labels are used to find the best threshold for that specific fold). If this oracle cannot reach 1.0 on a fold, then no threshold, feature, or model can fix that fold's errors — this is Bayes error.
Result: oracle = 1.0 on all 10 folds. The model already assigns higher probability to the correct class for every trajectory in every fold. The signal exists; the barrier was the global threshold constraint.
The 2 OOB misses (1 class-0, 1 class-1) were rockets in tight salvos. Diagnostic analysis showed each miss had same-class neighbours at dist ≈ 0 m, Δt < 12 s — borderline predictions in salvos fired from the same launcher within seconds of correctly-classified siblings.
Domain invariant (assumption 3b + 3c): all rockets fired from the same launcher within a salvo window share a rebel group → share a procurement → share a rocket type.
Algorithm: group trajectories by launch position (rounded to 0.01 precision) + 60-second time window. Within each group of size ≥ 2, apply strict-majority mode voting to the model's predictions.
Validation on training data:
- Group class purity: 100% (every proximity group is class-pure — no risk of consensus introducing errors)
- n_broken: 0 (consensus never worsened a correct prediction)
- OOB score: 0.999874 → 1.000000
This is not overfitting. Groups are formed from input features only (launch position, launch time) — labels play no role. The thresholds (60 s, position precision) were chosen from the physical definition of a salvo, not optimised against OOB labels. Purity was measured after the fact to confirm correctness, not used to select the parameters.
All runtimes measured on a Windows 11 machine (Intel Core i5, 4 cores, 8,185 test trajectories).
| Optimisation | Total pipeline | vs. original |
|---|---|---|
| Original (CSV + Pydantic + sklearn) | 3m 40s | baseline |
| Skip CSV/validation when caches exist | 15s | 14.7x |
| ONNX Runtime backend (2.6x faster inference) | 5s | 44x |
| All CPU threads + Feather cache + JIT pre-warm | 4s | 55x |
| memory_map Feather + skip train load in hot path | 1.9s | 116x |
iterrows() → numpy zip in proximity loop |
1.7s | 130x |
model_opt.onnx: skip graph optimization at load (saves ~0.3s init) |
1.7s | 130x |
| Global lexsort + single linear scan (replaces per-group groupby, 26x faster) | 1.6s | 138x |
import onnxruntime at module level (moves ~0.3s to Python startup) |
1.2s | 183x |
Vectorized apply_salvo_consensus with np.add.at (10x faster) |
~1.0s | ~220x |
| Step | Time |
|---|---|
| Load Feather test cache (memory-mapped) | 0.02s |
Load model_opt.onnx (pre-optimized, onnxruntime already imported) |
~0.25s |
| Impute NaN + append priors | <0.01s |
| ONNX inference (8,185 traj. × 35 feat., 4 threads) | ~0.60s |
| Threshold bias + argmax | <0.01s |
| Proximity consensus (global sort + linear scan + vectorized vote) | ~0.01s |
| Write submission.csv | 0.02s |
| Total | ~1.0s |
1. Thread saturation (empirical)
All 4 physical cores are in use. Thread sweep (30 runs each, ONNX inference only):
| Threads | Inference time | Speedup |
|---|---|---|
| 1 | 2.29s | baseline |
| 2 | 1.43s | 1.6x |
| 3 | 1.33s | 1.7x |
| 4 | ~0.60s | ~3.8x |
Adding threads beyond cpu_count() yields no further gain — there are no more physical cores.
2. Memory-bandwidth bound (theoretical)
The irreducible work is 8,185 samples × 1108 trees × depth 9 = 81.6M comparisons across the ONNX model.
Theoretical compute floor: 81.6M ops / 4 cores / 10⁹ ops/s ≈ 20ms
Measured inference: ~0.60s
Overhead factor: ~30x
This overhead is explained entirely by cache miss cost: the 5.8 MB model does not fit in L1/L2 cache (256 KB / 1 MB per core). Each tree traversal follows random pointers through L3 and main memory. Effective memory bandwidth for irregular-access tree walks is ~1–5 GB/s vs the raw L3 peak of ~100 GB/s.
3. Algorithm irreducibility
The 1108 trees were determined by Optuna to be the minimum that achieves 1.0 global OOB min-recall with proximity consensus. Reducing tree count would degrade accuracy below the operational threshold. Traversing every tree for every sample is not optional.
4. Backend optimality
ONNX Runtime with model_opt.onnx (graph optimization applied once at export time) applies:
- AVX2/AVX512 SIMD vectorisation for operator kernels
- Graph-level operator fusion (baked into
model_opt.onnx) - Memory arena pre-allocation (
enable_mem_pattern = True)
The remaining non-inference overhead (~0.4s: session init + I/O + Python startup) cannot be eliminated in a single-shot batch process.
5. Persistent-service mode
For a long-running server that loads the model once and handles repeated requests, the model_opt.onnx init overhead (~0.25s) is paid only once. Steady-state per-batch latency drops to ~0.6s (pure inference + I/O).
| Stage | Time |
|---|---|
| Load raw CSVs (1M+ rows) | ~3s |
| Pydantic schema validation (train + test) | ~30s |
| Feature engineering — train (25 kinematic + 7 salvo/group × 32k trajectories) | ~2 min |
| Feature engineering — test (25 kinematic + 7 salvo/group × 8k trajectories) | ~22s |
| Model inference + consensus | ~1.0s |
| Total cold start | ~3 min |
Install uv:
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"git clone https://github.com/IlyaNovikov-RD/rocket_classifier.git
cd rocket_classifier
uv sync
# Full pipeline — downloads ~107 MB from GitHub Release, runs in ~1.0s
make pipeline
# Output: output/submission.csv + updated assets/
# Step by step:
make download-all # artifacts/ + cache/ + data/ (test.csv, sample_submission.csv)
make run # → output/submission.csv
make interpret # → assets/shap_summary.png
make visualize # → assets/demo.png
# After a model update — regenerate ONNX (requires: uv pip install onnxmltools)
make export-model
make demo # streamlit demo (localhost:8501)# Top-level
make all # full validation: setup → quality → train → test → run → analysis (~20 min)
make all-full # all + cold Docker rebuild + Streamlit smoke test (~30 min)
# Setup ~1s
make install # uv sync --group dev
make lock # uv lock — regenerate uv.lock from pyproject.toml
make clean # remove output/, cache/, artifacts/ for a fresh cold start
# Quality ~1s
make lint # ruff check
make format # ruff format
# Training ~17 min
make train # full training pipeline (Optuna + consensus → artifacts/)
make export-model # convert model.lgb → model.onnx (~48s)
make test # full test suite — 130 tests (~16s)
# Inference ~1s hot, ~3 min cold
make run # inference pipeline → output/submission.csv
# Analysis ~1 min
make interpret # regenerate SHAP assets after model update (~60s)
make visualize # regenerate assets/demo.png (~3s)
# Deploy ~9 min cold build
make docker # build + run Docker image → output/submission.csv
make docker-clean # remove image and rebuild from scratch
make demo # streamlit demo (localhost:8501)
# Data (download pre-built artifacts instead of training) ~30s
make download-models # fetch artifacts/ from GitHub Release
make download-all # + cache/ parquet caches + data/ (test.csv, sample_submission.csv)
make pipeline # download-all + run + interpret (full end-to-end, ~1.5 min)
# Release
make release TAG=v1.x.0 NOTES="..." # create GitHub Release with all artifactsRuntimes measured on Windows 11, 4-core CPU, 6 GB RAM. Training and Docker vary by hardware.
The image is self-contained — model artifacts, feature caches, and test.csv are baked in at build time.
docker build -t rocket_classifier .
docker run --rm rocket_classifier # writes submission.csv inside the containerTo retrieve the output file:
docker run --rm -v $(pwd)/output:/app/output rocket_classifierrocket_classifier/ # Production inference package
├── __init__.py
├── features.py # 25 kinematic + 7 salvo/group features (single source of truth)
├── model.py # RocketClassifier — loads LightGBM, applies biases
├── schema.py # Pydantic v2 data contracts (TrajectoryPoint)
├── main.py # Inference pipeline: features → predict → submission.csv
└── app.py # Streamlit interactive demo
scripts/
├── download_models.py # Download model artifacts from GitHub Release
└── export_fast_models.py # Export model.lgb → model.onnx + benchmark all backends
requirements.txt # pip-compatible export of production deps (uv export --no-dev)
training_report.json # Authoritative record of latest training run: OOB scores, biases, best hyperparameters (tracked in git, uploaded to every release)
research/ # R&D scripts (GPU training)
├── train.py # Full training pipeline → 1.0 OOB + artifacts
├── interpret.py # SHAP interpretability — run via `make interpret`
└── visualize.py # Feature visualization — run via `make visualize`
tests/
├── conftest.py # Shared fixtures (StubModel, synthetic trajectory builder)
├── test_app.py # Streamlit demo unit tests
├── test_consensus.py # Proximity consensus unit tests
├── test_docs.py # CI guard: documented test counts match actual
├── test_features.py # Feature engineering unit tests
├── test_main.py # Inference pipeline unit tests
├── test_model.py # RocketClassifier + min_class_recall unit tests
└── test_schema.py # Schema validation unit tests
assets/ # Generated visualizations (git-tracked)
├── demo.png # Trajectory physics plot (make visualize)
├── shap_summary.png # SHAP feature importance (make interpret)
└── interpretation_report.txt # Human-readable model interpretation (make interpret)
artifacts/ # Model artifacts — gitignored, from GitHub Release
├── model_opt.onnx # Pre-graph-optimized ONNX (~0.3s faster load)
├── model.onnx # ONNX format — fastest inference (run: make export-model)
├── model.lgb # Native LightGBM — present after training
├── train_medians.npy # 32-feature NaN imputation medians
└── threshold_biases.npy # Per-class log-probability biases [0.000000, -0.253165, 1.265823]
cache/ # Feature caches — gitignored
├── cache_train_features.parquet # 25 kinematic features + label (salvo/group recomputed by DBSCAN at training time)
├── cache_train_features.feather # Arrow IPC sidecar — fast reads
├── cache_test_features.parquet # 32-feature matrix (25 kinematic + 7 salvo/group)
└── cache_test_features.feather
The automation in this project is not boilerplate — each choice directly serves a goal of the assignment.
| Practice | What it does | Why it matters here |
|---|---|---|
| 130 unit + contract tests | Validates every interface between modules | The metric (min_class_recall) penalises silent failures hard. A wrong feature shape or stale bias silently degrades the score — tests catch that before it reaches the submission. |
| CI on every PR | Runs lint + tests + Docker build | Ensures the inference pipeline (make run) stays reproducible on any machine, not just the developer's laptop. |
| Docker | Packages the full inference environment | Makes the submission pipeline portable — one docker run reproduces the exact result with no dependency drift. |
make run |
Single command → output/submission.csv |
The submission is the deliverable. One command replicates the full result from cached features, with no manual steps. |
make release TAG=... NOTES=... |
Validates all artifacts exist, creates GitHub Release, triggers post-release CI | Prevents the class of error we hit during development: forgetting to upload a cache file and silently breaking the pipeline. |
| Post-release CI pipeline | Auto-generates ONNX model, runs inference, computes SHAP, uploads all to the release | Ensures the published release is always self-consistent — the submission.csv and SHAP plot in the release are always derived from the model in the same release. |
make download-models && make run |
Reproduces the submission from scratch | Satisfies the reproducibility requirement: a grader can verify the result independently by running two commands. |
| Dependabot | Weekly dependency update PRs | Keeps the pipeline secure and working as the ecosystem moves forward — relevant for a project that may be evaluated weeks after submission. |
research/interpret.py computes exact SHAP values via TreeExplainer. Top discriminators:
- Launch position (
launch_x,launch_y,initial_z) — geography clusters by rebel group; altitude encodes terrain and launch geometry - Horizontal speed (
v_horiz_median) — muzzle velocity is propellant-charge dependent - Salvo timing (
salvo_time_rank) — launch order within a salvo carries a kinematic imprint (assumption 3b) - Acceleration (
acc_horiz_min,acc_mag_min) — propulsion physics separate the classes - Ballistic arc (
apogee_relative,delta_z_total) — arc shape differs between rocket families
make interpret # regenerates assets/shap_summary.png and assets/interpretation_report.txt| Layer | Technology | Why |
|---|---|---|
| Runtime | Python 3.12 | Current stable release; all production deps ship pre-built wheels for 3.12 — no compilation from source in Docker or CI |
| ML | LightGBM 4.x, scikit-learn | Leaf-wise gradient boosting, GPU-accelerated Optuna, GroupKFold |
| Clustering | scikit-learn DBSCAN | Cluster count is unknown a priori — no pre-specified K required. ε maps to a physically interpretable radius (metres / degrees), and noise points (isolated rockets not part of any salvo) are a natural output rather than an error |
| Inference | ONNX Runtime | ~2.6x faster than native LightGBM via AVX2 vectorisation |
| Validation | Pydantic v2 | Schema enforcement on raw radar data |
| Explainability | SHAP TreeExplainer | Exact Shapley values in O(TLD) time |
| Demo | Streamlit, Plotly | Real-time 3D trajectory visualisation |
| Package management | uv | Deterministic lockfile, 10–100x faster than pip/poetry |
| CI | GitHub Actions | Ruff lint + pytest on every push/PR |
| Caching | Parquet + Feather (Arrow IPC) | Two formats for different access patterns: Parquet (compressed) for archival and GitHub Release distribution; Feather (Arrow IPC, uncompressed) for memory-mapped hot-path reads — switching to Parquet in the hot path added ~0.8s to inference wall time in benchmarks |
The current pipeline is designed for batch inference — it classifies a complete set of pre-collected trajectories. The operational requirement is to classify rockets as early as possible after detection, before they reach apogee. This section outlines the architectural path from batch to real-time.
The model's 32 features fall into two categories with fundamentally different latency profiles:
| Feature group | When available | Quality |
|---|---|---|
| 25 kinematic features | First 3–5 radar pings (~1 second after detection) | ~0.9997 OOB recall |
| 7 salvo/group features (3b, 3c) | After other rockets in the same salvo are detected (~5–30 seconds) | 1.000000 OOB recall (with consensus) |
A real-time system cannot compute salvo_time_rank or salvo_size for a rocket until sibling rockets in the same firing event have also been detected — they arrive asynchronously across different radars. This motivates a two-phase progressive classification architecture:
Radar pings arrive
│
▼ Phase 1 — immediate (< 100ms after first ping)
Kinematic features only
→ ONNX inference (salvo features at training medians)
→ Preliminary class estimate broadcast to operators
│
▼ Phase 2 — salvo-confirmed (~5–30s after launch cluster detected)
Salvo context materialises via streaming DBSCAN
→ Re-run inference with full 32 features
→ Updated high-confidence prediction replaces preliminary estimate
Phase 1 already satisfies the "as early as possible" requirement from the assignment at ~0.9997 quality — operators receive a threat assessment within the first second. Phase 2 upgrades to 1.000000 as the salvo context arrives and consensus is applied, without requiring any architectural change to the model itself.
Radar stations (N radars across the operational theatre)
│ raw pings (x, y, z, t, radar_id)
▼
Apache Kafka · topic: radar-pings
│
▼
Apache Flink (stream processor)
├─ Trajectory assembler
│ Groups pings by traj_ind within a sliding time window
│
├─ Kinematic feature extractor ─────────────────────────┐
│ Partial trajectory (≥ 3 pings) · Phase 1 (< 100ms) │
│ │
└─ Salvo coordinator ───────────────────────────────────┤
Online DBSCAN · Phase 2 (salvo confirmed, ~5–30s) │
▼
ONNX serving endpoint
model.onnx · < 2ms/trajectory
│
▼
Prediction store (Redis)
Keyed by traj_ind
Updated by both phases
│
▼
Operator dashboard ── Alert system
| Component | Why |
|---|---|
| Apache Kafka | Log-based durable message store — radar stations cannot replay a trajectory on demand. If Flink crashes mid-salvo, Kafka's retained log allows the topology to replay from the last committed offset and reconstruct the partial salvo without data loss. Partitioning by traj_ind provides horizontal scaling without application changes. |
| Apache Flink | Required (over Kafka Streams or Spark Structured Streaming) for event-time watermarks: radar pings from the same trajectory arrive out of order across stations due to network jitter and scan-rate differences. Flink's keyed state with exactly-once semantics ensures salvo group membership is computed correctly even under late arrivals — a processing-time window would silently drop late pings and corrupt salvo_size and salvo_time_rank. |
| Redis | Sub-millisecond atomic SET key value EX ttl for the two-phase update pattern (Phase 1 at ~100ms, Phase 2 at ~30s). The TTL feature self-expires stale predictions — a classification older than 5 minutes is no longer operationally actionable — without requiring a separate eviction job. |
DBSCAN as currently implemented runs in batch (O(N²) for large clusters). In a streaming context, the salvo coordinator would use an online spatial index (e.g. R-tree or ball tree over a sliding 60-second window of recent launches) to assign incoming rockets to existing salvos or open a new salvo group. The salvo DBSCAN parameters (eps=0.5, min_samples=2) and group DBSCAN parameters (eps=0.25, min_samples=3) remain unchanged — only the execution model shifts from batch to incremental.
The assignment identifies few rebel groups each buying independently (assumption 3c). If a new group acquires a new rocket type, production recall for that class will degrade. Operationally, this requires:
- Recall monitoring per class — not overall accuracy. The evaluation metric (
min_class_recall) maps directly to the production alert threshold. When any class recall drops below an agreed SLA, retraining is triggered. - Active labelling pipeline — intercepted rockets provide confirmed labels. These feed the next training run.
- GroupKFold preservation — retraining must continue to group all pings from one trajectory into the same fold to prevent leakage from partial-trajectory data collected mid-flight.
| Stage | Target |
|---|---|
| First radar ping → kinematic features computed | < 50ms |
| Kinematic features → ONNX inference → alert | < 10ms |
| Phase 1: detection to preliminary classification | < 100ms |
| Salvo membership confirmed → salvo features → re-inference | < 500ms |
| Phase 2: salvo-confirmed classification | < 30s after launch cluster |
A ballistic rocket travelling at typical class-1 speeds takes 60–90 seconds from launch to apogee. Both phases complete well within the available threat-assessment window.
Ilya Novikov

