50 features · GPU Infra OS

50 production-grade features for GPU infrastructure.

Clusters, distributed training, model serving, FinOps, compliance, observability — typed JSON outcomes, optional Postgres via API, Storybook demos.

Data

GPU Infra capabilities

Everything your ML platform needs.
Nothing it doesn't.

Clusters, training, inference, FinOps, and observability — one OS for all 50 feature components in the mdbook GPU Infra package.

Cluster & resource

Provision clusters. Right-size capacity.

GPU Cluster Provisioner, autoscaling, node health, multi-region views, spot and reserved planning, topology editing, and resource quotas — the full § CLUSTER & RESOURCE MANAGEMENT surface from the spec.

Cluster snapshot

🖥️ Nodes up: 24

🌐 Regions: Mumbai · Virginia

📈 Avg GPU util: 78%

⚡ Spot mix: 42%

Job & training

Train at scale. Recover fast.

Distributed training launcher, training job monitor, experiment tracker, checkpoints, hyperparameter sweeps, job queue, preemption recovery, and job cost estimator — § JOB & TRAINING MANAGEMENT.

Training job

PyTorchDeepSpeed8× H100checkpointing on

Epoch 12/40 · loss 0.24 · ETA 2h 14m

Model & MLOps

Ship models. Automate pipelines.

Model registry, deployment wizard, A/B traffic router, performance monitor, rollback console, feature store browser, and pipeline orchestrator — § MODEL MANAGEMENT & MLOPS.

Model registry

llama-3-70b-instructv2.1.0 · promoted

embedding-smallv1.4.2 · staging

Inference & serving

Serve endpoints. Hit your SLA.

Inference endpoint manager, autoscaler config, cost optimizer, cold-start analyzer, batch inference scheduler, and serving SLA monitor — § INFERENCE & SERVING.

Inference endpoint

p95 142ms380 RPS4 replicasSLA OK

Autoscale · min 2 · max 12 · GPU util 61%

FinOps & platform

See spend. Govern the stack.

GPU cost dashboards through savings and invoices, dataset and artifact lifecycle, API keys, Terraform, RBAC, compliance audit, residency, and observability hub — per § BILLING, storage/data, developer tools, compliance, observability.

FinOps snapshot

MTD GPU spend$48.2k

Budget remaining62%

Audit events (7d)128

Feature 1

GPU Cluster Provisioner

Provision H100/H200 clusters with IB/Ethernet and MIG-aware quotas

Feature 2

Auto-Scaling Policy Manager

Define scale thresholds, cooldowns, and scheduled node bounds

Feature 3

Node Health Dashboard

GPU util, VRAM, thermals, and power per node with drain/restart actions

Feature 4

Multi-Region Cluster View

Global GPU inventory and failover across regions

Feature 5

Spot Instance Optimizer

Balance spot vs on-demand with savings and interruption risk

Feature 6

Reserved Capacity Planner

1yr/3yr commits with savings vs on-demand

Feature 7

Cluster Topology Editor

Spine-leaf / fat-tree and NVLink group visualization

Feature 8

Resource Quota Manager

Per-team GPU/CPU/RAM/storage caps and increase requests

Feature 9

Distributed Training Launcher

Launch PyTorch/JAX/TF/DeepSpeed jobs with priority and paths

Feature 10

Training Job Monitor

Live epochs, loss curves, GPU util, pause/resume/cancel

Feature 11

Experiment Tracker

Compare runs, best run, tags, and cost per experiment

Feature 12

Checkpoint Manager

Restore, promote, or prune checkpoints

Feature 13

Hyperparameter Sweep

Grid/random/Bayes sweeps over search space

Feature 14

Job Queue Manager

Reorder queue, bulk cancel or reprioritize

Feature 15

Preemption Recovery Console

Recover from spot preemption with checkpoint-aware actions

Feature 16

Job Cost Estimator

Estimate job $ with spot comparison

Feature 17

Model Registry

Feature 18

Model Deployment Wizard

vLLM/TGI/Triton deployments with SLA targets

Feature 19

A/B Traffic Router

Weighted A/B variants with promotion thresholds

Feature 20

Model Performance Monitor

p50/p95/p99, throughput, drift, SLA breaches

Feature 21

Model Rollback Console

Rollback with audit: manual, SLA, or drift triggered

Feature 22

Feature Store Browser

Feature sets, freshness, consumers

Feature 23

Pipeline Orchestrator

DAG: ingest → train → deploy with triggers

Feature 24

Inference Endpoint Manager

Scale, restart, or deactivate inference endpoints

Feature 25

Inference Autoscaler Config

RPS and GPU targets with pre-warm

Feature 26

Inference Cost Optimizer

Quantize, batch, downsize recommendations

Feature 27

Cold Start Analyzer

Cold vs warm start breakdown and pre-warm cron

Feature 28

Batch Inference Scheduler

Scheduled batch jobs with notify

Feature 29

Serving SLA Monitor

SLA compliance % and incident timeline

Feature 30

GPU Cost Dashboard

Org-wide GPU spend by team and SKU

Feature 31

Budget Alert Manager

Threshold alerts via Slack/email/webhook

Feature 32

Cost Forecast Engine

Forecast spend with confidence bands

Feature 33

Team Cost Allocation

Chargeback-ready allocation export

Feature 34

Savings Recommender

Actionable FinOps recommendations

Feature 35

Invoice & Credit Manager

Invoices, credits, redemptions

Feature 36

Dataset Registry

Versioned datasets with lineage

Feature 37

Storage Cost Analyzer

Hot/warm/cold/archive cost breakdown

Feature 38

Data Pipeline Monitor

Pipeline run health and staleness

Feature 39

Artifact Lifecycle Manager

Retention, archive, delete policies for artifacts

Feature 40

Model Artifact Diff Tool

Compare two artifacts for promotion decisions

Feature 41

API Key Manager

Create/rotate/revoke keys with audit trail

Feature 42

SDK Quickstart Generator

Language-specific snippets for inference/train

Feature 43

Webhook Event Manager

Event subscriptions with retry policy

Feature 44

Environment & Secret Manager

Scoped env vars from vault/SSM

Feature 45

Terraform Module Generator

Generate HCL for AWS/GCP/Azure/CoreWeave

Feature 46

RBAC Permission Manager

Roles and resource-level permissions

Feature 47

Compliance Audit Dashboard

SOC2/HIPAA/ISO/GDPR control status

Feature 48

Data Residency Controller

Region allow/block lists for data classes

Feature 49

Observability Hub

Metrics, logs, traces, correlated incidents

Feature 50

GPU Kernel Profiler

Kernel-level profiling and roofline hints

Pick a brand

Click a brand to switch the page theme instantly.

Everything included

Under the hood.

Clusters & capacity

Multi-region clusters, autoscaling, spot optimization, health, and resource quotas.

Training & jobs

Distributed training, experiments, checkpoints, queues, preemption recovery, and cost estimates.

Inference & serving

Endpoints, A/B traffic, autoscaler, SLA, cold-start analysis, and batch schedules.

FinOps

GPU cost dashboards, budgets, forecasts, team allocation, savings, invoices, and storage cost.

Data & artifacts

Dataset registry, pipelines, model registry, artifact lifecycle, and diff tools.

Platform & DevEx

API keys, SDK quickstart, webhooks, secrets, Terraform modules, and collaboration.

Security & compliance

RBAC, audit trails, residency controls, and production guardrails.

Observability

Unified hub, kernel profiling, and performance signals across workloads.

npm run storybook

npm run build:with-storybook

Get started

Simple enough to see.
Powerful enough to ship.

Join teams building with one design system. One codebase, any brand, every vertical.

919+ components14 verticals0 themes

50 production-grade features for GPU infrastructure.

Everything your ML platform needs.Nothing it doesn't.

Provision clusters. Right-size capacity.

Train at scale. Recover fast.

Ship models. Automate pipelines.

Serve endpoints. Hit your SLA.

See spend. Govern the stack.

GPU Cluster Provisioner

Auto-Scaling Policy Manager

Node Health Dashboard

Multi-Region Cluster View

Spot Instance Optimizer

Reserved Capacity Planner

Cluster Topology Editor

Resource Quota Manager

Distributed Training Launcher

Training Job Monitor

Experiment Tracker

Checkpoint Manager

Hyperparameter Sweep

Job Queue Manager

Preemption Recovery Console

Job Cost Estimator

Model Registry

Model Deployment Wizard

A/B Traffic Router

Model Performance Monitor

Model Rollback Console

Feature Store Browser

Pipeline Orchestrator

Inference Endpoint Manager

Inference Autoscaler Config

Inference Cost Optimizer

Cold Start Analyzer

Batch Inference Scheduler

Serving SLA Monitor

GPU Cost Dashboard

Budget Alert Manager

Cost Forecast Engine

Team Cost Allocation

Savings Recommender

Invoice & Credit Manager

Dataset Registry

Storage Cost Analyzer

Data Pipeline Monitor

Artifact Lifecycle Manager

Model Artifact Diff Tool

API Key Manager

SDK Quickstart Generator

Webhook Event Manager

Environment & Secret Manager

Terraform Module Generator

RBAC Permission Manager

Compliance Audit Dashboard

Data Residency Controller

Observability Hub

GPU Kernel Profiler

Under the hood.

Clusters & capacity

Training & jobs

Inference & serving

FinOps

Data & artifacts

Platform & DevEx

Security & compliance

Observability

Move fast. Break nothing.

Simple enough to see.Powerful enough to ship.

Everything your ML platform needs.
Nothing it doesn't.

Simple enough to see.
Powerful enough to ship.