Self‑Hosted LLM Control Plane
•
Cut GPU spend by up to 40%
⌘
+
K
Command Palette
Deploy Model
Dashboard
Models
Deploy
Monitoring
FinOps
Security
GPU Utilization
62
%
Target 70–85% for efficiency
Latency (p95)
410
ms
Autoscaler holds p95 under 600ms
Monthly GPU Cost
$
18,900
↓ from $30,000 baseline
Fleet Overview
Click a row for details
Model
Runtime
GPU
Replicas
Status
Actions
No models yet — go to
Deploy
.
Recent Alerts
Idle GPU detected on node g4dn‑2xlarge — shutdown scheduled.
Latency spike on /api/mixtral — scaling replicas 2 → 3.
Simulate Alert
GPU Utilization by Node
Last 5 min
Latency (p50/p95/p99)
ms
Requests by Endpoint
share
Cost Breakdown (Fixed vs Variable)
Monthly
Savings Attribution
percentage
Models
Seed
Clear
Deploy Model
Model
Runtime
GPU
Replicas
Status
Actions
No models yet.
Deploy New Model
1 • Select Model
2 • Configure
3 • Review
4 • Deploy
Model Family
LLaMA 3.1 8B
Mistral 7B
Mixtral 8x7B
Gemma 2
Runtime
vLLM
Text Generation Inference
Triton
Ray Serve
Endpoint Name
GPUs
Replicas
Enable autoscaling
Review your configuration:
—
Docker Command
—
Helm (Kubernetes)
—
Deploy Logs
Back
Next
Finish
Monitoring
▶ Play
⏸ Pause
GPU Utilization
Latency (p95)
Throughput
FinOps & Cost
Monthly GPU Cost
Idle Shutdown
Intelligent Routing
Autoscaling
Spend by Team
Security & Privacy
Access Controls
SSO Provider
Okta
Azure AD
Google Workspace
RBAC
Audit Logs
Data & Deployment
Runs entirely in your VPC/on‑prem; no data egress.
Air‑gapped & offline installer available.
SOC 2 readiness: mapped controls & evidence checklist.
Model Details
Close
Metrics
Actions
+1 Replica
-1 Replica
vLLM
GPU:
1
Replicas:
1
Logs