Skip to content

Conversation

@blueberrycongee
Copy link

Summary

Add descriptive documentation to sgemm_sm80.cu explaining that it is actually an FP16xFP16 GEMM (HGEMM) tutorial using CuTe.

Problem

Issue #1686 reports that users cannot find an fp16 GEMM tutorial in the CuTe examples.

Solution

The existing sgemm_sm80.cu already implements FP16 GEMM using cute::half_t, but this was not documented. This PR adds a documentation block clarifying:

  • This example uses FP16 data types despite the "sgemm" filename
  • Key features: Tensor Cores, cp.async, pipelining, swizzled shared memory
  • Usage examples

Fixes #1686

@hwu36
Copy link
Collaborator

hwu36 commented Jan 7, 2026

if the accumulation type is fp32, it is sgemm. if the accumulation type is fp16, it is hgemm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[QST] Is there any fp16xfp16 GEMM sample using CUTE with a performance comparable to cublas?

2 participants