Which component requires the feature?
CuTe DSL
Feature Request
Is your feature request related to a problem? Please describe.
It will be very useful to be able to run and debug examples like in https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/distributed on a single GPU.
Describe the solution you'd like
Run on a singe GPU , for example as follows,
torchrun --nnodes 1 --nproc-per-node 2 --no-python python all_reduce_one_shot_lamport.py
Describe alternatives you've considered
Multiple GPU's
Additional context
N/A