rocmlir-tuning-driver improvements by mirza-halilcevic · Pull Request #2249 · ROCm/rocMLIR

mirza-halilcevic · 2026-02-21T12:48:34Z

Motivation

Improvements to reduce memory usage and improve performance when tuning. Excessive memory usage causes problems on APU systems.

Technical Details

rocmlir-tuning-driver.cpp:

Avoid unnecessary iterations when greedy falls back to exhaustive in non-accel case
Create stream and allocate gpu buffers once, and initialize them with memset instead of using extra host buffers

ConcurrentQueue.h

Implement rate-adaptiveness so we don't accumulate more compile results than necessary

Test Plan

Several problem configs that were previously crashing on Hark Point are now able to run.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

the concurrent queue.

Copilot

Pull request overview

This PR improves the rocmlir-tuning-driver to reduce memory usage and improve performance during tuning operations, particularly targeting issues on APU systems with limited memory.

Changes:

Eliminates host buffer allocation by initializing GPU buffers directly with hipMemsetAsync
Implements rate-adaptive concurrent queue to provide backpressure and prevent excessive memory accumulation
Caches and reuses thread resources (MLIR contexts, PassManagers) across greedy tuning iterations
Tracks effective tuning kind to avoid unnecessary iterations when greedy mode falls back to exhaustive

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp	Refactored buffer management to use GPU-direct initialization, moved stream creation outside benchmarking loop, implemented thread resource caching, and added effectiveKind tracking
mlir/tools/rocmlir-tuning-driver/ConcurrentQueue.h	Added rate-adaptive queue with dynamic capacity adjustment to limit memory usage from compilation results
mlir/lib/Dialect/Rock/Tuning/RockTuningImpl.cpp	Set effectiveKind field when creating tuning space and when falling back from Greedy to Exhaustive
mlir/include/mlir/Dialect/Rock/Tuning/RockTuning.h	Added effectiveKind field to TuningParamSet struct to track actual tuning mode used

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mlir/tools/rocmlir-tuning-driver/ConcurrentQueue.h

* Reuse resources between greedy tuning iterations. * Reuse hipStream and only do hostToDevice copy once. * Improve memory footprint by making host buffers temporary and bounding the concurrent queue. * Avoid extra iterations if falling back to greedy to exhaustive. * Use memset to initialize gpu buffers and void host memory overhead. * Fix bugs in ConcurrentQueue. * Avoid using minCapacity for ConcurrentQueue.

mirza-halilcevic and others added 7 commits February 21, 2026 00:44

Reuse resources between greedy tuning iterations.

afaacda

Reuse hipStream and only do hostToDevice copy once.

15d1390

Improve memory footprint by making host buffers temporary and bounding

4bd4407

the concurrent queue.

Avoid extra iterations if falling back to greedy to exhaustive.

ad8ede0

Use memset to initialize gpu buffers and void host memory overhead.

2945802

Merge branch 'develop' into tuning-driver-improvements

34dcad2

Merge branch 'develop' into tuning-driver-improvements

c239cf5

mirza-halilcevic marked this pull request as ready for review February 23, 2026 12:18

mirza-halilcevic requested a review from causten as a code owner February 23, 2026 12:18

mirza-halilcevic requested review from dhernandez0 and umangyadav February 23, 2026 12:18

umangyadav approved these changes Feb 23, 2026

View reviewed changes

umangyadav requested a review from Copilot February 23, 2026 13:20

Copilot started reviewing on behalf of umangyadav February 23, 2026 13:23 View session

Copilot AI reviewed Feb 23, 2026

View reviewed changes

mlir/tools/rocmlir-tuning-driver/ConcurrentQueue.h Show resolved Hide resolved

mirza-halilcevic and others added 6 commits February 23, 2026 16:48

Fix bugs in ConcurrentQueue.

a5b9fb1

Avoid using minCapacity for ConcurrentQueue.

e5d7412

Merge branch 'develop' into tuning-driver-improvements

3c3cc2e

Merge branch 'develop' into tuning-driver-improvements

52c9cf1

Merge branch 'develop' into tuning-driver-improvements

c916cbc

Merge branch 'develop' into tuning-driver-improvements

119cba7

mirza-halilcevic merged commit ac0dcc9 into develop Feb 27, 2026
7 of 14 checks passed

mirza-halilcevic deleted the tuning-driver-improvements branch February 27, 2026 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocmlir-tuning-driver improvements#2249

rocmlir-tuning-driver improvements#2249
mirza-halilcevic merged 13 commits intodevelopfrom
tuning-driver-improvements

mirza-halilcevic commented Feb 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mirza-halilcevic commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mirza-halilcevic commented Feb 21, 2026 •

edited

Loading