Supported Hardware and Performance (FP8)
All benchmarks below are observed at batch size 32 using FP8 precision. Actual performance may vary depending on workload and deployment configuration.
| GPU | Tokens / Second (Range) | Batch Size |
|---|---|---|
| NVIDIA H100 | 100 – 150 | 32 |
| NVIDIA H200 | 100 – 180 | 32 |
| NVIDIA RTX Pro 6000 (Blackwell) | 80 – 120 | 32 |
Supported Configurations
| Capability | Status |
|---|---|
| Precision | FP8 |
| Model Architecture | MoE (26B total / 3B active parameters) |
| Streaming Inference | Supported |
| OpenAI-compatible API | Supported |
| Dynamic Batching | Supported |
| Production Deployment | Supported |
Model Characteristics
| Attribute | Details |
|---|---|
| Model Family | Trinity |
| Primary Focus | Reasoning |
| Parameter Count | 26B total |
| Active Parameters | 3B |
| Token Efficiency | Comparable to competitive instruction-tuned models |
Performance Notes
-
Optimized for high-throughput inference under batch workloads
-
FP8 precision enables efficient GPU utilization without observed quality regressions
-
Suitable for shared-GPU and multi-tenant deployments
Current Status
| State |
|---|
| Available |