JU
justqrGPU Infra OS
50 features · GPU Infra OS

50 production-grade features for GPU infrastructure.

Clusters, distributed training, model serving, FinOps, compliance, observability — typed JSON outcomes, optional Postgres via API, Storybook demos.

Data

By the numbers

50
GPU Infra features
14
Vertical OS
0
Brand themes
919
Components (library)
0
Total components
Library files + vertical features

GPU Infra capabilities

Everything your ML platform needs.
Nothing it doesn't.

Clusters, training, inference, FinOps, and observability — one OS for all 50 feature components in the mdbook GPU Infra package.

Cluster & resource

Provision clusters. Right-size capacity.

GPU Cluster Provisioner, autoscaling, node health, multi-region views, spot and reserved planning, topology editing, and resource quotas — the full § CLUSTER & RESOURCE MANAGEMENT surface from the spec.

Cluster snapshot
🖥️ Nodes up: 24
🌐 Regions: Mumbai · Virginia
📈 Avg GPU util: 78%
⚡ Spot mix: 42%
Job & training

Train at scale. Recover fast.

Distributed training launcher, training job monitor, experiment tracker, checkpoints, hyperparameter sweeps, job queue, preemption recovery, and job cost estimator — § JOB & TRAINING MANAGEMENT.

Training job
PyTorchDeepSpeed8× H100checkpointing on
Epoch 12/40 · loss 0.24 · ETA 2h 14m
Model & MLOps

Ship models. Automate pipelines.

Model registry, deployment wizard, A/B traffic router, performance monitor, rollback console, feature store browser, and pipeline orchestrator — § MODEL MANAGEMENT & MLOPS.

Model registry
llama-3-70b-instructv2.1.0 · promoted
embedding-smallv1.4.2 · staging
Inference & serving

Serve endpoints. Hit your SLA.

Inference endpoint manager, autoscaler config, cost optimizer, cold-start analyzer, batch inference scheduler, and serving SLA monitor — § INFERENCE & SERVING.

Inference endpoint
p95 142ms380 RPS4 replicasSLA OK
Autoscale · min 2 · max 12 · GPU util 61%
FinOps & platform

See spend. Govern the stack.

GPU cost dashboards through savings and invoices, dataset and artifact lifecycle, API keys, Terraform, RBAC, compliance audit, residency, and observability hub — per § BILLING, storage/data, developer tools, compliance, observability.

FinOps snapshot
MTD GPU spend$48.2k
Budget remaining62%
Audit events (7d)128

GPU Infra OS

50 production-grade GPU platform tools

Clusters, training, serving, FinOps, compliance — try each feature. Persist outcomes to MongoDB (all 50 slugs), Postgres when configured, or JSON preview only.

Feature 1

GPU Cluster Provisioner

Provision H100/H200 clusters with IB/Ethernet and MIG-aware quotas

Feature 2

Auto-Scaling Policy Manager

Define scale thresholds, cooldowns, and scheduled node bounds

Feature 3

Node Health Dashboard

GPU util, VRAM, thermals, and power per node with drain/restart actions

Feature 4

Multi-Region Cluster View

Global GPU inventory and failover across regions

Feature 5

Spot Instance Optimizer

Balance spot vs on-demand with savings and interruption risk

Feature 6

Reserved Capacity Planner

1yr/3yr commits with savings vs on-demand

Feature 7

Cluster Topology Editor

Spine-leaf / fat-tree and NVLink group visualization

Feature 8

Resource Quota Manager

Per-team GPU/CPU/RAM/storage caps and increase requests

Feature 9

Distributed Training Launcher

Launch PyTorch/JAX/TF/DeepSpeed jobs with priority and paths

Feature 10

Training Job Monitor

Live epochs, loss curves, GPU util, pause/resume/cancel

Feature 11

Experiment Tracker

Compare runs, best run, tags, and cost per experiment

Feature 12

Checkpoint Manager

Restore, promote, or prune checkpoints

Feature 13

Hyperparameter Sweep

Grid/random/Bayes sweeps over search space

Feature 14

Job Queue Manager

Reorder queue, bulk cancel or reprioritize

Feature 15

Preemption Recovery Console

Recover from spot preemption with checkpoint-aware actions

Feature 16

Job Cost Estimator

Estimate job $ with spot comparison

Feature 17

Model Registry

Register models with visibility and metrics

Feature 18

Model Deployment Wizard

vLLM/TGI/Triton deployments with SLA targets

Feature 19

A/B Traffic Router

Weighted A/B variants with promotion thresholds

Feature 20

Model Performance Monitor

p50/p95/p99, throughput, drift, SLA breaches

Feature 21

Model Rollback Console

Rollback with audit: manual, SLA, or drift triggered

Feature 22

Feature Store Browser

Feature sets, freshness, consumers

Feature 23

Pipeline Orchestrator

DAG: ingest → train → deploy with triggers

Feature 24

Inference Endpoint Manager

Scale, restart, or deactivate inference endpoints

Feature 25

Inference Autoscaler Config

RPS and GPU targets with pre-warm

Feature 26

Inference Cost Optimizer

Quantize, batch, downsize recommendations

Feature 27

Cold Start Analyzer

Cold vs warm start breakdown and pre-warm cron

Feature 28

Batch Inference Scheduler

Scheduled batch jobs with notify

Feature 29

Serving SLA Monitor

SLA compliance % and incident timeline

Feature 30

GPU Cost Dashboard

Org-wide GPU spend by team and SKU

Feature 31

Budget Alert Manager

Threshold alerts via Slack/email/webhook

Feature 32

Cost Forecast Engine

Forecast spend with confidence bands

Feature 33

Team Cost Allocation

Chargeback-ready allocation export

Feature 34

Savings Recommender

Actionable FinOps recommendations

Feature 35

Invoice & Credit Manager

Invoices, credits, redemptions

Feature 36

Dataset Registry

Versioned datasets with lineage

Feature 37

Storage Cost Analyzer

Hot/warm/cold/archive cost breakdown

Feature 38

Data Pipeline Monitor

Pipeline run health and staleness

Feature 39

Artifact Lifecycle Manager

Retention, archive, delete policies for artifacts

Feature 40

Model Artifact Diff Tool

Compare two artifacts for promotion decisions

Feature 41

API Key Manager

Create/rotate/revoke keys with audit trail

Feature 42

SDK Quickstart Generator

Language-specific snippets for inference/train

Feature 43

Webhook Event Manager

Event subscriptions with retry policy

Feature 44

Environment & Secret Manager

Scoped env vars from vault/SSM

Feature 45

Terraform Module Generator

Generate HCL for AWS/GCP/Azure/CoreWeave

Feature 46

RBAC Permission Manager

Roles and resource-level permissions

Feature 47

Compliance Audit Dashboard

SOC2/HIPAA/ISO/GDPR control status

Feature 48

Data Residency Controller

Region allow/block lists for data classes

Feature 49

Observability Hub

Metrics, logs, traces, correlated incidents

Feature 50

GPU Kernel Profiler

Kernel-level profiling and roofline hints

Brand themes

Your GPU stack, your brand.

Switch themes in real time. Every screen respects the active theme.

Pick a brand

Click a brand to switch the page theme instantly.

Current: Eatsure

Everything included

Under the hood.

Clusters & capacity

Multi-region clusters, autoscaling, spot optimization, health, and resource quotas.

Training & jobs

Distributed training, experiments, checkpoints, queues, preemption recovery, and cost estimates.

Inference & serving

Endpoints, A/B traffic, autoscaler, SLA, cold-start analysis, and batch schedules.

FinOps

GPU cost dashboards, budgets, forecasts, team allocation, savings, invoices, and storage cost.

Data & artifacts

Dataset registry, pipelines, model registry, artifact lifecycle, and diff tools.

Platform & DevEx

API keys, SDK quickstart, webhooks, secrets, Terraform modules, and collaboration.

Security & compliance

RBAC, audit trails, residency controls, and production guardrails.

Observability

Unified hub, kernel profiling, and performance signals across workloads.

Developers

Move fast. Break nothing.

Storybook and app stay in sync. Run locally or build static docs.

npm run storybook
npm run build:with-storybook

Get started

Simple enough to see.
Powerful enough to ship.

Join teams building with one design system. One codebase, any brand, every vertical.

919+ components14 verticals0 themes