GPU Infra capabilities
Everything your ML platform needs.
Nothing it doesn't.
Clusters, training, inference, FinOps, and observability — one OS for all 50 feature components in the mdbook GPU Infra package.
Provision clusters. Right-size capacity.
GPU Cluster Provisioner, autoscaling, node health, multi-region views, spot and reserved planning, topology editing, and resource quotas — the full § CLUSTER & RESOURCE MANAGEMENT surface from the spec.
Train at scale. Recover fast.
Distributed training launcher, training job monitor, experiment tracker, checkpoints, hyperparameter sweeps, job queue, preemption recovery, and job cost estimator — § JOB & TRAINING MANAGEMENT.
Ship models. Automate pipelines.
Model registry, deployment wizard, A/B traffic router, performance monitor, rollback console, feature store browser, and pipeline orchestrator — § MODEL MANAGEMENT & MLOPS.
llama-3-70b-instructv2.1.0 · promotedembedding-smallv1.4.2 · stagingServe endpoints. Hit your SLA.
Inference endpoint manager, autoscaler config, cost optimizer, cold-start analyzer, batch inference scheduler, and serving SLA monitor — § INFERENCE & SERVING.
See spend. Govern the stack.
GPU cost dashboards through savings and invoices, dataset and artifact lifecycle, API keys, Terraform, RBAC, compliance audit, residency, and observability hub — per § BILLING, storage/data, developer tools, compliance, observability.
MTD GPU spend$48.2kBudget remaining62%Audit events (7d)128GPU Infra OS
50 production-grade GPU platform tools
Clusters, training, serving, FinOps, compliance — try each feature. Persist outcomes to MongoDB (all 50 slugs), Postgres when configured, or JSON preview only.
GPU Cluster Provisioner
Provision H100/H200 clusters with IB/Ethernet and MIG-aware quotas
Auto-Scaling Policy Manager
Define scale thresholds, cooldowns, and scheduled node bounds
Node Health Dashboard
GPU util, VRAM, thermals, and power per node with drain/restart actions
Multi-Region Cluster View
Global GPU inventory and failover across regions
Spot Instance Optimizer
Balance spot vs on-demand with savings and interruption risk
Reserved Capacity Planner
1yr/3yr commits with savings vs on-demand
Cluster Topology Editor
Spine-leaf / fat-tree and NVLink group visualization
Resource Quota Manager
Per-team GPU/CPU/RAM/storage caps and increase requests
Distributed Training Launcher
Launch PyTorch/JAX/TF/DeepSpeed jobs with priority and paths
Training Job Monitor
Live epochs, loss curves, GPU util, pause/resume/cancel
Experiment Tracker
Compare runs, best run, tags, and cost per experiment
Checkpoint Manager
Restore, promote, or prune checkpoints
Hyperparameter Sweep
Grid/random/Bayes sweeps over search space
Job Queue Manager
Reorder queue, bulk cancel or reprioritize
Preemption Recovery Console
Recover from spot preemption with checkpoint-aware actions
Job Cost Estimator
Estimate job $ with spot comparison
Model Registry
Register models with visibility and metrics
Model Deployment Wizard
vLLM/TGI/Triton deployments with SLA targets
A/B Traffic Router
Weighted A/B variants with promotion thresholds
Model Performance Monitor
p50/p95/p99, throughput, drift, SLA breaches
Model Rollback Console
Rollback with audit: manual, SLA, or drift triggered
Feature Store Browser
Feature sets, freshness, consumers
Pipeline Orchestrator
DAG: ingest → train → deploy with triggers
Inference Endpoint Manager
Scale, restart, or deactivate inference endpoints
Inference Autoscaler Config
RPS and GPU targets with pre-warm
Inference Cost Optimizer
Quantize, batch, downsize recommendations
Cold Start Analyzer
Cold vs warm start breakdown and pre-warm cron
Batch Inference Scheduler
Scheduled batch jobs with notify
Serving SLA Monitor
SLA compliance % and incident timeline
GPU Cost Dashboard
Org-wide GPU spend by team and SKU
Budget Alert Manager
Threshold alerts via Slack/email/webhook
Cost Forecast Engine
Forecast spend with confidence bands
Team Cost Allocation
Chargeback-ready allocation export
Savings Recommender
Actionable FinOps recommendations
Invoice & Credit Manager
Invoices, credits, redemptions
Dataset Registry
Versioned datasets with lineage
Storage Cost Analyzer
Hot/warm/cold/archive cost breakdown
Data Pipeline Monitor
Pipeline run health and staleness
Artifact Lifecycle Manager
Retention, archive, delete policies for artifacts
Model Artifact Diff Tool
Compare two artifacts for promotion decisions
API Key Manager
Create/rotate/revoke keys with audit trail
SDK Quickstart Generator
Language-specific snippets for inference/train
Webhook Event Manager
Event subscriptions with retry policy
Environment & Secret Manager
Scoped env vars from vault/SSM
Terraform Module Generator
Generate HCL for AWS/GCP/Azure/CoreWeave
RBAC Permission Manager
Roles and resource-level permissions
Compliance Audit Dashboard
SOC2/HIPAA/ISO/GDPR control status
Data Residency Controller
Region allow/block lists for data classes
Observability Hub
Metrics, logs, traces, correlated incidents
GPU Kernel Profiler
Kernel-level profiling and roofline hints
Brand themes
Your GPU stack, your brand.
Switch themes in real time. Every screen respects the active theme.
Click a brand to switch the page theme instantly.
Everything included
Under the hood.
Clusters & capacity
Multi-region clusters, autoscaling, spot optimization, health, and resource quotas.
Training & jobs
Distributed training, experiments, checkpoints, queues, preemption recovery, and cost estimates.
Inference & serving
Endpoints, A/B traffic, autoscaler, SLA, cold-start analysis, and batch schedules.
FinOps
GPU cost dashboards, budgets, forecasts, team allocation, savings, invoices, and storage cost.
Data & artifacts
Dataset registry, pipelines, model registry, artifact lifecycle, and diff tools.
Platform & DevEx
API keys, SDK quickstart, webhooks, secrets, Terraform modules, and collaboration.
Security & compliance
RBAC, audit trails, residency controls, and production guardrails.
Observability
Unified hub, kernel profiling, and performance signals across workloads.
Developers
Move fast. Break nothing.
Storybook and app stay in sync. Run locally or build static docs.
Get started
Simple enough to see.
Powerful enough to ship.
Join teams building with one design system. One codebase, any brand, every vertical.
By the numbers