arcee-ai/Trinity-Mini

Supported Hardware and Performance (FP8)

All benchmarks below are observed at batch size 32 using FP8 precision. Actual performance may vary depending on workload and deployment configuration.

GPU Tokens / Second (Range) Batch Size
NVIDIA H100 100 – 150 32
NVIDIA H200 100 – 180 32
NVIDIA RTX Pro 6000 (Blackwell) 80 – 120 32

Supported Configurations

Capability Status
Precision FP8
Model Architecture MoE (26B total / 3B active parameters)
Streaming Inference Supported
OpenAI-compatible API Supported
Dynamic Batching Supported
Production Deployment Supported

Model Characteristics

Attribute Details
Model Family Trinity
Primary Focus Reasoning
Parameter Count 26B total
Active Parameters 3B
Token Efficiency Comparable to competitive instruction-tuned models

Performance Notes

  • Optimized for high-throughput inference under batch workloads

  • FP8 precision enables efficient GPU utilization without observed quality regressions

  • Suitable for shared-GPU and multi-tenant deployments


Current Status

State
Available
1 Like

I’m running Trinity mini and I’m seeing really good throughput, it did take however about 8 to 10 minutes to launch.

1 Like