arcee-ai/Trinity-Mini

malith · February 2, 2026, 2:01pm

Supported Hardware and Performance (FP8)

All benchmarks below are observed at batch size 32 using FP8 precision. Actual performance may vary depending on workload and deployment configuration.

GPU	Tokens / Second (Range)	Batch Size
NVIDIA H100	100 – 150	32
NVIDIA H200	100 – 180	32
NVIDIA RTX Pro 6000 (Blackwell)	80 – 120	32

Supported Configurations

Capability	Status
Precision	FP8
Model Architecture	MoE (26B total / 3B active parameters)
Streaming Inference	Supported
OpenAI-compatible API	Supported
Dynamic Batching	Supported
Production Deployment	Supported

Model Characteristics

Attribute	Details
Model Family	Trinity
Primary Focus	Reasoning
Parameter Count	26B total
Active Parameters	3B
Token Efficiency	Comparable to competitive instruction-tuned models

Performance Notes

Optimized for high-throughput inference under batch workloads
FP8 precision enables efficient GPU utilization without observed quality regressions
Suitable for shared-GPU and multi-tenant deployments

Current Status

State
Available

JoshWild · February 4, 2026, 2:26am

I’m running Trinity mini and I’m seeing really good throughput, it did take however about 8 to 10 minutes to launch.