Primus-Turbo

Primus-Turbo is a high-performance acceleration library dedicated to large-scale model training on AMD GPUs. Built and optimized for the AMD ROCm platform, it covers the full training stack — including core compute operators (GEMM, Attention, GroupedGEMM), communication primitives, optimizer modules, low-precision computation (FP8), and compute–communication overlap kernels.

With High Performance, Full-Featured, and Developer-Friendly as its guiding principles, Primus-Turbo is designed to fully unleash the potential of AMD GPUs for large-scale training workloads, offering a robust and complete acceleration foundation for next-generation AI systems.

Note: JAX support is under active development. Optim support is planned but not yet available.

🚀 What's New

[2025/12/16] 🔥MoE Training Best Practices on AMD GPUs
[2025/12/01] 🔥Efficient MoE Pre-training at Scale on 1K AMD GPUs with TorchTitan.
[2025/09/19] Primus-Turbo introduction blog.
[2025/09/11] Primus-Turbo initial release, version v0.1.0.

🧩 Primus Product Matrix

Module	Role	Key Features
Primus-LM	E2E training framework	- Supports multiple training backends (Megatron, TorchTitan, etc.) - Provides high-performance, scalable distributed training - Deeply integrates with Primus-Turbo and Primus-SaFE
Primus-Turbo	High-performance operators & modules	- Supports core training operators and modules (FlashAttention, GEMM, GroupedGemm, DeepEP etc.) - Integrates multiple high-performance backends (e.g., CK, hipBLASLt, AITER) - High performance and easy to integrate
Primus-SaFE	Stability & platform layer	- Cluster sanity check and benchmarking - Kubernetes scheduling with topology awareness - Fault tolerance - Stability enhancements

📦 Quick Start

Requirements

Software

ROCm >= 6.4
Python >= 3.10
PyTorch >= 2.6.0 (with ROCm support)
rocSHMEM (optional, required for experimental DeepEP). Please refer to our DeepEP Installation Guide for instructions.

Hardware

Architecture	Supported GPUs
GFX942	✅MI300X, ✅MI325X
GFX950	✅MI350X, ✅MI355X

See AMD GPU Architecture to find the architecture for your GPU.

1. Installation

Docker (Recommended)

Use the pre-built AMD ROCm image from Docker Hub:

# PyTorch Ecosystem
rocm/primus:v25.10

# JAX Ecosystem
rocm/jax-training:maxtext-v25.9

Install from Source

git clone https://github.com/AMD-AGI/Primus-Turbo.git --recursive
cd Primus-Turbo

pip3 install -r requirements.txt
pip3 install --no-build-isolation .

# (Optional) Set GPU_ARCHS environment variable to specify target AMD GPU architectures.
GPU_ARCHS="gfx942;gfx950" pip3 install --no-build-isolation .

2. Development

For contributors, use editable mode (-e) so that code changes take effect immediately without reinstalling.

git clone https://github.com/AMD-AGI/Primus-Turbo.git --recursive
cd Primus-Turbo

pip3 install -r requirements.txt
pip3 install --no-build-isolation -e . -v

# (Optional) Set GPU_ARCHS environment variable to specify target AMD GPU architectures.
GPU_ARCHS="gfx942;gfx950" pip3 install --no-build-isolation -e . -v

# (Optional) Set PRIMUS_TURBO_FRAMEWORK to compile for a specific framework.
# Supported values: PYTORCH (default), JAX.
# For example, to compile for JAX:
PRIMUS_TURBO_FRAMEWORK="JAX" pip3 install --no-build-isolation -e . -v

3. Testing

Option 1: Single-process mode (slow but simple)

pytest tests/pytorch/    # run all PyTorch tests
pytest tests/jax/        # run all JAX tests

Option 2: Multi-process mode (faster)

# PyTorch tests
pytest tests/pytorch/ -n 8        # single-GPU tests (parallel)
pytest tests/pytorch/ --dist-only # multi-GPU tests

# JAX tests
pytest tests/jax/ -n 8            # single-GPU tests (parallel)
pytest tests/jax/ --dist-only     # multi-GPU tests

4. Packaging

pip3 install -r requirements.txt
python3 -m build --wheel --no-isolation
pip3 install --extra-index-url https://test.pypi.org/simple ./dist/primus_turbo-XXX.whl

5. Minimal Example

import torch
import primus_turbo.pytorch as turbo

dtype = torch.bfloat16
device = "cuda:0"

a = torch.randn((128, 256), dtype=dtype, device=device)
b = torch.randn((256, 512), dtype=dtype, device=device)
c = turbo.ops.gemm(a, b)

print(c)
print(c.shape)

💡 Example

See Examples for usage examples.

📊 Performance

See Benchmarks for detailed performance results and comparisons.

📍 Roadmap

Roadmap: Primus-Turbo Roadmap H2 2025

📜 License

Primus-Turbo is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
.github		.github
3rdparty		3rdparty
benchmark		benchmark
csrc		csrc
docs		docs
primus_turbo		primus_turbo
tests		tests
tools		tools
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Primus-Turbo

🚀 What's New

🧩 Primus Product Matrix

📦 Quick Start

Requirements

Software

Hardware

1. Installation

Docker (Recommended)

Install from Source

2. Development

3. Testing

4. Packaging

5. Minimal Example

💡 Example

📊 Performance

📍 Roadmap

📜 License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 13

Languages

License

AMD-AGI/Primus-Turbo

Folders and files

Latest commit

History

Repository files navigation

Primus-Turbo

🚀 What's New

🧩 Primus Product Matrix

📦 Quick Start

Requirements

Software

Hardware

1. Installation

Docker (Recommended)

Install from Source

2. Development

3. Testing

4. Packaging

5. Minimal Example

💡 Example

📊 Performance

📍 Roadmap

📜 License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 13

Languages

Packages