All 50 gpu-infra-os feature components. Each submits typed JSON on action.

1. GPU Cluster ProvisionerView in Storybook →
Provision H100/H200 clusters with IB/Ethernet and MIG-aware quotas

GPU Cluster Provisioner

Persist to MongoDB (outcome JSON)

Slug cluster-provisioner /api/gpu-infra/clusters. Optional upstream: GPU_INFRA_UPSTREAM_URL.

2. Auto-Scaling Policy ManagerView in Storybook →
Define scale thresholds, cooldowns, and scheduled node bounds

Auto-Scaling Policy Manager

Persist to MongoDB (outcome JSON)

Slug autoscaling-policy /api/gpu-infra/clusters. Optional upstream: GPU_INFRA_UPSTREAM_URL.

3. Node Health DashboardView in Storybook →
GPU util, VRAM, thermals, and power per node with drain/restart actions

Node Health Dashboard

Persist to MongoDB (outcome JSON)

Slug node-health /api/gpu-infra/clusters. Optional upstream: GPU_INFRA_UPSTREAM_URL.

4. Multi-Region Cluster ViewView in Storybook →
Global GPU inventory and failover across regions

Multi-Region Cluster View

Persist to MongoDB (outcome JSON)

Slug multi-region /api/gpu-infra/clusters. Optional upstream: GPU_INFRA_UPSTREAM_URL.

5. Spot Instance OptimizerView in Storybook →
Balance spot vs on-demand with savings and interruption risk

Spot Instance Optimizer

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug spot-optimizer /api/gpu-infra/clusters. Optional upstream: GPU_INFRA_UPSTREAM_URL.

6. Reserved Capacity PlannerView in Storybook →
1yr/3yr commits with savings vs on-demand

Reserved Capacity Planner

Persist to MongoDB (outcome JSON)

Slug reserved-capacity /api/gpu-infra/clusters. Optional upstream: GPU_INFRA_UPSTREAM_URL.

7. Cluster Topology EditorView in Storybook →
Spine-leaf / fat-tree and NVLink group visualization

Cluster Topology Editor

Persist to MongoDB (outcome JSON)

Slug topology /api/gpu-infra/clusters. Optional upstream: GPU_INFRA_UPSTREAM_URL.

8. Resource Quota ManagerView in Storybook →
Per-team GPU/CPU/RAM/storage caps and increase requests

Resource Quota Manager

Persist to MongoDB (outcome JSON)

Slug resource-quota /api/gpu-infra/clusters. Optional upstream: GPU_INFRA_UPSTREAM_URL.

9. Distributed Training LauncherView in Storybook →
Launch PyTorch/JAX/TF/DeepSpeed jobs with priority and paths

Distributed Training Launcher

Persist to MongoDB (outcome JSON)

Slug dist-training /api/gpu-infra/jobs. Optional upstream: GPU_INFRA_UPSTREAM_URL.

10. Training Job MonitorView in Storybook →
Live epochs, loss curves, GPU util, pause/resume/cancel

Training Job Monitor

Persist to MongoDB (outcome JSON)

Slug training-monitor /api/gpu-infra/jobs. Optional upstream: GPU_INFRA_UPSTREAM_URL.

11. Experiment TrackerView in Storybook →
Compare runs, best run, tags, and cost per experiment

Experiment Tracker

Persist to MongoDB (outcome JSON)

Slug experiment-tracker /api/gpu-infra/jobs. Optional upstream: GPU_INFRA_UPSTREAM_URL.

12. Checkpoint ManagerView in Storybook →
Restore, promote, or prune checkpoints

Checkpoint Manager

Persist to MongoDB (outcome JSON)

Slug checkpoint /api/gpu-infra/jobs. Optional upstream: GPU_INFRA_UPSTREAM_URL.

13. Hyperparameter SweepView in Storybook →
Grid/random/Bayes sweeps over search space

Hyperparameter Sweep

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug hyperparam-sweep /api/gpu-infra/jobs. Optional upstream: GPU_INFRA_UPSTREAM_URL.

14. Job Queue ManagerView in Storybook →
Reorder queue, bulk cancel or reprioritize

Job Queue Manager

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug job-queue /api/gpu-infra/jobs. Optional upstream: GPU_INFRA_UPSTREAM_URL.

15. Preemption Recovery ConsoleView in Storybook →
Recover from spot preemption with checkpoint-aware actions

Preemption Recovery Console

Persist to MongoDB (outcome JSON)

Slug preemption-recovery /api/gpu-infra/jobs. Optional upstream: GPU_INFRA_UPSTREAM_URL.

16. Job Cost EstimatorView in Storybook →
Estimate job $ with spot comparison

Job Cost Estimator

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug job-cost-estimate /api/gpu-infra/jobs. Optional upstream: GPU_INFRA_UPSTREAM_URL.

17. Model RegistryView in Storybook →
Register models with visibility and metrics

Model Registry

Persist to MongoDB (outcome JSON)

Slug model-registry /api/gpu-infra/models. Optional upstream: GPU_INFRA_UPSTREAM_URL.

18. Model Deployment WizardView in Storybook →
vLLM/TGI/Triton deployments with SLA targets

Model Deployment Wizard

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug model-deploy /api/gpu-infra/models. Optional upstream: GPU_INFRA_UPSTREAM_URL.

19. A/B Traffic RouterView in Storybook →
Weighted A/B variants with promotion thresholds

A/B Traffic Router

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug ab-traffic /api/gpu-infra/models. Optional upstream: GPU_INFRA_UPSTREAM_URL.

20. Model Performance MonitorView in Storybook →
p50/p95/p99, throughput, drift, SLA breaches

Model Performance Monitor

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug model-perf /api/gpu-infra/models. Optional upstream: GPU_INFRA_UPSTREAM_URL.

21. Model Rollback ConsoleView in Storybook →
Rollback with audit: manual, SLA, or drift triggered

Model Rollback Console

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug model-rollback /api/gpu-infra/models. Optional upstream: GPU_INFRA_UPSTREAM_URL.

22. Feature Store BrowserView in Storybook →
Feature sets, freshness, consumers

Feature Store Browser

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug feature-store /api/gpu-infra/models. Optional upstream: GPU_INFRA_UPSTREAM_URL.

23. Pipeline OrchestratorView in Storybook →
DAG: ingest → train → deploy with triggers

Pipeline Orchestrator

Persist to MongoDB (outcome JSON)

Slug pipeline /api/gpu-infra/models. Optional upstream: GPU_INFRA_UPSTREAM_URL.

24. Inference Endpoint ManagerView in Storybook →
Scale, restart, or deactivate inference endpoints

Inference Endpoint Manager

Manage serving endpoints: replicas, health status, latency/RPS, and safe actions (scale / restart / deactivate). Persists to MongoDB via /api/gpu-infra/inference.

25. Inference Autoscaler ConfigView in Storybook →
RPS and GPU targets with pre-warm

Inference Autoscaler Config

Tune horizontal autoscaling for inference: replica bounds, RPS and GPU targets, cooldowns, optional pre-warm, and cron-based replica overrides.

Schedule overrides
26. Inference Cost OptimizerView in Storybook →
Quantize, batch, downsize recommendations

Inference Cost Optimizer

Compare cost optimizations (quantize, batch, downsize, spot), record tradeoffs, and persist selected strategies with projected monthly savings.

Recommendations
27. Cold Start AnalyzerView in Storybook →
Cold vs warm start breakdown and pre-warm cron

Cold Start Analyzer

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug cold-start /api/gpu-infra/inference. Optional upstream: GPU_INFRA_UPSTREAM_URL.

28. Batch Inference SchedulerView in Storybook →
Scheduled batch jobs with notify

Batch Inference Scheduler

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug batch-inference /api/gpu-infra/inference. Optional upstream: GPU_INFRA_UPSTREAM_URL.

29. Serving SLA MonitorView in Storybook →
SLA compliance % and incident timeline

Serving SLA Monitor

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug serving-sla /api/gpu-infra/inference. Optional upstream: GPU_INFRA_UPSTREAM_URL.

30. GPU Cost DashboardView in Storybook →
Org-wide GPU spend by team and SKU

GPU Cost Dashboard

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug gpu-cost-dash /api/gpu-infra/billing. Optional upstream: GPU_INFRA_UPSTREAM_URL.

31. Budget Alert ManagerView in Storybook →
Threshold alerts via Slack/email/webhook

Budget Alert Manager

Persist to MongoDB (outcome JSON)

Slug budget-alert /api/gpu-infra/billing. Optional upstream: GPU_INFRA_UPSTREAM_URL.

32. Cost Forecast EngineView in Storybook →
Forecast spend with confidence bands

Cost Forecast Engine

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug cost-forecast /api/gpu-infra/billing. Optional upstream: GPU_INFRA_UPSTREAM_URL.

33. Team Cost AllocationView in Storybook →
Chargeback-ready allocation export

Team Cost Allocation

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug team-allocation /api/gpu-infra/billing. Optional upstream: GPU_INFRA_UPSTREAM_URL.

34. Savings RecommenderView in Storybook →
Actionable FinOps recommendations

Savings Recommender

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug savings /api/gpu-infra/billing. Optional upstream: GPU_INFRA_UPSTREAM_URL.

35. Invoice & Credit ManagerView in Storybook →
Invoices, credits, redemptions

Invoice & Credit Manager

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug invoice-credit /api/gpu-infra/billing. Optional upstream: GPU_INFRA_UPSTREAM_URL.

36. Dataset RegistryView in Storybook →
Versioned datasets with lineage

Dataset Registry

Persist to MongoDB (outcome JSON)

Slug dataset-registry /api/gpu-infra/storage. Optional upstream: GPU_INFRA_UPSTREAM_URL.

37. Storage Cost AnalyzerView in Storybook →
Hot/warm/cold/archive cost breakdown

Storage Cost Analyzer

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug storage-cost /api/gpu-infra/storage. Optional upstream: GPU_INFRA_UPSTREAM_URL.

38. Data Pipeline MonitorView in Storybook →
Pipeline run health and staleness

Data Pipeline Monitor

Persist to MongoDB (outcome JSON)

Slug data-pipeline /api/gpu-infra/storage. Optional upstream: GPU_INFRA_UPSTREAM_URL.

39. Artifact Lifecycle ManagerView in Storybook →
Retention, archive, delete policies for artifacts

Artifact Lifecycle Manager

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug artifact-lifecycle /api/gpu-infra/storage. Optional upstream: GPU_INFRA_UPSTREAM_URL.

40. Model Artifact Diff ToolView in Storybook →
Compare two artifacts for promotion decisions

Model Artifact Diff

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug artifact-diff /api/gpu-infra/storage. Optional upstream: GPU_INFRA_UPSTREAM_URL.

41. API Key ManagerView in Storybook →
Create/rotate/revoke keys with audit trail

API Key Manager

Persist to MongoDB (outcome JSON)

Slug api-key /api/gpu-infra/developer. Optional upstream: GPU_INFRA_UPSTREAM_URL.

42. SDK Quickstart GeneratorView in Storybook →
Language-specific snippets for inference/train

SDK Quickstart Generator

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug sdk-quickstart /api/gpu-infra/developer. Optional upstream: GPU_INFRA_UPSTREAM_URL.

43. Webhook Event ManagerView in Storybook →
Event subscriptions with retry policy

Webhook Event Manager

Persist to MongoDB (outcome JSON)

Slug webhook /api/gpu-infra/developer. Optional upstream: GPU_INFRA_UPSTREAM_URL.

44. Environment & Secret ManagerView in Storybook →
Scoped env vars from vault/SSM

Environment & Secret Manager

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug env-secret /api/gpu-infra/developer. Optional upstream: GPU_INFRA_UPSTREAM_URL.

45. Terraform Module GeneratorView in Storybook →
Generate HCL for AWS/GCP/Azure/CoreWeave

Terraform Module Generator

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug terraform /api/gpu-infra/developer. Optional upstream: GPU_INFRA_UPSTREAM_URL.

46. RBAC Permission ManagerView in Storybook →
Roles and resource-level permissions

RBAC Permission Manager

Persist to MongoDB (outcome JSON)

Slug rbac /api/gpu-infra/compliance. Optional upstream: GPU_INFRA_UPSTREAM_URL.

47. Compliance Audit DashboardView in Storybook →
SOC2/HIPAA/ISO/GDPR control status

Compliance Audit Dashboard

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug compliance-audit /api/gpu-infra/compliance. Optional upstream: GPU_INFRA_UPSTREAM_URL.

48. Data Residency ControllerView in Storybook →
Region allow/block lists for data classes

Data Residency Controller

Configure and submit to emit typed outcome JSON.

Persist to MongoDB (outcome JSON)

Slug data-residency /api/gpu-infra/compliance. Optional upstream: GPU_INFRA_UPSTREAM_URL.

49. Observability HubView in Storybook →
Metrics, logs, traces, correlated incidents

Observability Hub

Persist to MongoDB (outcome JSON)

Slug observability /api/gpu-infra/observability. Optional upstream: GPU_INFRA_UPSTREAM_URL.

50. GPU Kernel ProfilerView in Storybook →
Kernel-level profiling and roofline hints

GPU Kernel Profiler

Persist to MongoDB (outcome JSON)

Slug gpu-profiler /api/gpu-infra/observability. Optional upstream: GPU_INFRA_UPSTREAM_URL.