python interop poc by PhilipDeegan · Pull Request #30 · mkn/mkn.gpu

PhilipDeegan · 2026-01-27T19:19:01Z

Summary by CodeRabbit

New Features
- Added GPU warp size detection functionality
- Introduced Python bindings for GPU-accelerated vector operations
Build & Infrastructure
- Enhanced download reliability with retries in build workflow
- Updated CUDA architecture targeting and added NVCC build configuration
- Enabled CPU GPU flag for comprehensive testing
Tests & Examples
- Added CUDA-accelerated vector addition example with Python integration
Chores
- Code cleanup and development environment improvements

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-27T19:19:23Z

📝 Walkthrough

Walkthrough

This pull request updates GPU configuration and build settings, modifies CUDA header handling for runtime warp size detection, and introduces a new proof-of-concept for CUDA-accelerated Python integration using Numba JIT compilation with pybind11 bindings.

Changes

Cohort / File(s)	Summary
Build & Workflow Configuration `.github/workflows/build.yml`, `mkn.yaml`, `.sublime-project`	Updated curl with retry logic (3 attempts, 2-second delay, fail-fast); enabled MKN_GPU_CPU flag in test profile; added Python Black formatting on-save in Sublime settings.
CUDA Headers & Compiler Configuration `inc/mkn/gpu/cuda.hpp`, `inc/mkn/gpu/tuple.hpp`, `res/mkn/clang_cuda.yaml`, `res/mkn/nvcc.yaml`	Added getWarpSize() function for runtime warp-size resolution with compile-time fallback via \`_MKN_GPU_WARP_SIZE_\\`; removed commented include in tuple.hpp; upgraded CUDA arch from sm_61 to sm_80; replaced clang++-17 with generic clang++; introduced new NVCC build configuration with C++20 and PIC settings.
Test Macro Updates `test/any/async_streaming.cpp`	Replaced \`KUL_DBG_FUNC_ENTER\\` with \`MKN_KUL_DBG_FUNC_ENTER\\` debug macros; reordered includes.
New POC JIT Integration `test/poc/jit/eg.py`, `test/poc/jit/mkn.yaml`, `test/poc/jit/pybind.cpp`, `test/poc/jit/requirements.txt`	Added new proof-of-concept demonstrating Numba-based GPU kernel (vector addition) with pybind11 C++ bindings exposing FunctionSupport class for GPU memory management and array access; includes build profile and Python dependencies (numpy, numba-cuda).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

yaml defaults #29: Both PRs modify the same mkn.yaml CPU profile argument for \-DMKN_GPU_CPU=1\\ flag with inverse logic (this PR enables it).
count zero #28: Both PRs alter warp-size handling in cuda.hpp, changing how warp_size is determined at compile-time versus runtime.

Poem

🐰 Bindings hop and kernels launch,
CUDA whispers through each branch,
Warp sizes dance, configs align,
Python meets GPU in design,
Numba-fied magic, shining bright! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'python interop poc' accurately reflects the main objective of the pull request, which adds Python interoperability through pybind11 bindings and proof-of-concept examples.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

PhilipDeegan · 2026-01-27T19:19:24Z

@coderabbitai are you setup for this repo?

coderabbitai · 2026-01-27T19:19:48Z

Tip

For best results, initiate chat on the files or code changes.

@PhilipDeegan Yes, I'm set up and ready to help with this repository! I have your preferences stored (like your include ordering style) and I'm operational for code reviews.

Would you like me to:

Review this PR (python interop poc)?
Show you the current CodeRabbit configuration for this repository?
Help with something specific?

Just let me know how I can assist! 🐰

🧠 Learnings used

Learnt from: PhilipDeegan
Repo: mkn/mkn.gpu PR: 19
File: test/any/async_streaming.cpp:2-6
Timestamp: 2024-06-10T19:38:12.663Z
Learning: PhilipDeegan, thank you for sharing your preferred style for organizing includes. I'll remember this for future reviews in your codebase.

<!--

PhilipDeegan prefers to sort `#include` statements by line length first, and alphabetically second.

coderabbitai · 2026-01-27T19:20:09Z

Tip

For best results, initiate chat on the files or code changes.

@PhilipDeegan Yes, I'm all set up for this repository! I have your preferences stored, including your preferred include ordering style (by line length first, then alphabetically).

How can I help you with this PR? Would you like me to:

Review the python interop poc changes?
Help with specific files or implementations?
Answer questions about the code?

Let me know what you need! 🐰

🧠 Learnings used

Learnt from: PhilipDeegan
Repo: mkn/mkn.gpu PR: 19
File: test/any/async_streaming.cpp:2-6
Timestamp: 2024-06-10T19:38:12.663Z
Learning: PhilipDeegan, thank you for sharing your preferred style for organizing includes. I'll remember this for future reviews in your codebase.

<!--

PhilipDeegan prefers to sort `#include` statements by line length first, and alphabetically second.

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@inc/mkn/gpu/cuda.hpp`:
- Around line 84-95: Make getWarpSize inline to avoid ODR violations when
included from headers, and add proper CUDA error handling: in the inline
std::uint32_t getWarpSize(size_t dev = 0) function initialize a cudaDeviceProp
variable before use, call cudaGetDeviceProperties and check its return value
(use the project's MKN_GPU_ASSERT macro or return a safe default like 32 /
throw/abort on failure), and only access devProp.warpSize on success; also
ensure the static std::uint32_t const warp_size = getWarpSize(); usage remains
valid after these safeguards.

In `@res/mkn/clang_cuda.yaml`:
- Around line 18-31: The compiler and linker entries currently invoke the
generic clang++ which can pick older toolchains; update the "compiler" and
"linker" fields so they explicitly call clang++-17 (i.e. replace uses of
"clang++" in the compiler: and linker: settings with "clang++-17") to ensure the
recommended clang 17 toolchain is used for the cxx_cuda build and CUDA SM_80
flags (references: compiler, linker, cxx_cuda).

In `@test/poc/jit/eg.py`:
- Around line 17-24: vadd is declared as a CUDA kernel but is invoked from
inside another kernel (vector_add_gpu), which Numba disallows; change vadd to a
device function by using `@cuda.jit`(device=True) instead of a kernel decorator so
it can be called from vector_add_gpu, keep the same signature (index i and
arrays a, b, c) and ensure vector_add_gpu calls vadd(...) as a regular device
function rather than launching it.

In `@test/poc/jit/pybind.cpp`:
- Around line 61-66: The pybind registration mistakenly repeats
.def_readwrite("a", &FunctionSupport::a) instead of exposing member c; update
the duplicate binding in the chain that registers FunctionSupport so it uses
.def_readwrite("c", &FunctionSupport::c) (keep the existing
.def("A",&FunctionSupport::A), .def_readwrite("b",&FunctionSupport::b) and
.def("C",&FunctionSupport::C) intact) so the struct's c member is correctly
exposed to Python.
- Around line 44-46: The make function currently takes an auto parameter and
returns a py::array_t<float> with a no-op capsule, which causes two problems:
'auto' in the parameter requires C++20 and the no-op capsule does not extend the
owning FunctionSupport lifetime (risking use-after-free). Change the signature
of make to accept an explicit float* (replace auto p with float* p) and when
constructing the py::array_t<float> pass py::cast(this) as the base object so
the Python wrapper for this FunctionSupport instance is held alive for the
array's lifetime, removing the no-op capsule.

🧹 Nitpick comments (1)

res/mkn/nvcc.yaml (1)
5-19: Consider making the CUDA arch explicit/configurable.
Without an arch flag, nvcc defaults can vary across environments and hurt reproducibility/perf. A cuda_arch property (or env override) keeps builds consistent.
Example (adjust to mkn interpolation rules)
 property:
-  cxx_flags: --std=c++20 -Xcompiler -fPIC # -fPIC -fsized-deallocation -Wno-unknown-cuda-version
-  cxx_cuda: -x cu --extended-lambda `#-x` cuda --cuda-gpu-arch="sm_80" -Xclang -fcuda-allow-variadic-functions
+  cuda_arch: sm_80
+  cxx_flags: --std=c++20 -Xcompiler -fPIC # -fPIC -fsized-deallocation -Wno-unknown-cuda-version
+  cxx_cuda: -x cu --extended-lambda --gpu-architecture="${cuda_arch}"

coderabbitai · 2026-01-27T19:31:57Z

inc/mkn/gpu/cuda.hpp

+std::uint32_t getWarpSize(size_t dev = 0) {
 #ifdef _MKN_GPU_WARP_SIZE_
-static constexpr int warp_size = _MKN_GPU_WARP_SIZE_;
+  return _MKN_GPU_WARP_SIZE_;
 #else
-static constexpr int warp_size = warpSize;
+  cudaDeviceProp devProp;
+  [[maybe_unused]] auto ret = cudaGetDeviceProperties(&devProp, dev);
+  return devProp.warpSize;
 #endif /*_MKN_GPU_WARP_SIZE_    */
+}
+
+static std::uint32_t const warp_size = getWarpSize();



⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n inc/mkn/gpu/cuda.hpp | sed -n '80,100p'

Repository: mkn/mkn.gpu

Length of output: 800

🏁 Script executed:

rg "getWarpSize" --type cpp --type hpp -B 2 -A 8

Repository: mkn/mkn.gpu

Length of output: 82

🏁 Script executed:

rg "MKN_GPU_ASSERT" --type cpp --type hpp | head -20

Repository: mkn/mkn.gpu

Length of output: 82

🏁 Script executed:

rg "cudaGetDeviceProperties" --type cpp --type hpp -B 2 -A 3

Repository: mkn/mkn.gpu

Length of output: 82

🏁 Script executed:

rg "getWarpSize" -B 2 -A 2

Repository: mkn/mkn.gpu

Length of output: 513

🏁 Script executed:

rg "MKN_GPU_ASSERT" -B 1 -A 1 | head -40

Repository: mkn/mkn.gpu

Length of output: 1886

🏁 Script executed:

git log --oneline --all -- inc/mkn/gpu/cuda.hpp | head -10

Repository: mkn/mkn.gpu

Length of output: 344

🏁 Script executed:

rg "define MKN_GPU_ASSERT" -A 10

Repository: mkn/mkn.gpu

Length of output: 1742

🏁 Script executed:

git show 3e7997c --stat | head -20

Repository: mkn/mkn.gpu

Length of output: 315

🏁 Script executed:

git show 3e7997c -- inc/mkn/gpu/cuda.hpp | head -100

Repository: mkn/mkn.gpu

Length of output: 665

🏁 Script executed:

head -30 inc/mkn/gpu/cuda.hpp

Repository: mkn/mkn.gpu

Length of output: 1545

🏁 Script executed:

grep -n "define _MKN_GPU_WARP_SIZE_" inc/mkn/gpu/cuda.hpp

Repository: mkn/mkn.gpu

Length of output: 37

🏁 Script executed:

find . -name "*.cpp" -o -name "*.cc" | head -20

Repository: mkn/mkn.gpu

Length of output: 414

🏁 Script executed:

grep -n "_MKN_GPU_WARP_SIZE_" inc/mkn/gpu/cuda.hpp

Repository: mkn/mkn.gpu

Length of output: 155

🏁 Script executed:

wc -l inc/mkn/gpu/cuda.hpp

Repository: mkn/mkn.gpu

Length of output: 79

🏁 Script executed:

sed -n '84,100p' inc/mkn/gpu/cuda.hpp

Repository: mkn/mkn.gpu

Length of output: 599

Make getWarpSize inline and add error handling for CUDA initialization.

Line 84 defines a non-inline function in a header file, creating an ODR violation when included across multiple translation units. Additionally, cudaDeviceProp devProp at line 88 is uninitialized, and if cudaGetDeviceProperties fails (e.g., in CPU-only setups), line 90 accesses uninitialized memory—undefined behavior. The error is silently ignored with [[maybe_unused]], inconsistent with the project's MKN_GPU_ASSERT pattern used throughout the codebase.

Suggested fix

-std::uint32_t getWarpSize(size_t dev = 0) { +inline std::uint32_t getWarpSize(size_t dev = 0) { `#ifdef` _MKN_GPU_WARP_SIZE_ return _MKN_GPU_WARP_SIZE_; `#else` - cudaDeviceProp devProp; - [[maybe_unused]] auto ret = cudaGetDeviceProperties(&devProp, dev); - return devProp.warpSize; + int count = 0; + if (cudaGetDeviceCount(&count) != cudaSuccess || count <= 0) { + return 32; // TODO: choose a project-appropriate fallback + } + if (dev >= static_cast<size_t>(count)) dev = 0; + cudaDeviceProp devProp{}; + MKN_GPU_ASSERT(cudaGetDeviceProperties(&devProp, static_cast<int>(dev))); + return devProp.warpSize; `#endif` /*_MKN_GPU_WARP_SIZE_ */ }

🤖 Prompt for AI Agents

In `@inc/mkn/gpu/cuda.hpp` around lines 84 - 95, Make getWarpSize inline to avoid ODR violations when included from headers, and add proper CUDA error handling: in the inline std::uint32_t getWarpSize(size_t dev = 0) function initialize a cudaDeviceProp variable before use, call cudaGetDeviceProperties and check its return value (use the project's MKN_GPU_ASSERT macro or return a safe default like 32 / throw/abort on failure), and only access devProp.warpSize on success; also ensure the static std::uint32_t const warp_size = getWarpSize(); usage remains valid after these safeguards.

What I'm trying to do is have a default warpSize, but allow for overrides as rocm headers can use warpsize 64, but some amd gamings gpus only support 32

it needs to work on host and device, and for clang, nvcc, and hipcc

@PhilipDeegan, thanks for clarifying the design intent! Supporting runtime warp size detection with compile-time override across CUDA/ROCm platforms makes sense.

The inline keyword is still necessary to avoid ODR violations when this header is included from multiple translation units. However, I'll adjust my error handling recommendation given your cross-platform requirements:

Revised suggestions:

Add inline keyword - Required for ODR safety in headers

Initialize devProp - Use cudaDeviceProp devProp{}; to avoid undefined behavior if the API call fails

Graceful fallback - Since this runs at static init time and needs to work across platforms, consider returning a safe default (32) if device query fails, rather than aborting:

inline std::uint32_t getWarpSize(size_t dev = 0) { `#ifdef` _MKN_GPU_WARP_SIZE_ return _MKN_GPU_WARP_SIZE_; `#else` cudaDeviceProp devProp{}; auto ret = cudaGetDeviceProperties(&devProp, dev); if (ret != cudaSuccess) { return 32; // Safe fallback for gaming GPUs and when CUDA unavailable } return devProp.warpSize; `#endif` }

Question: When building for ROCm/HIP, does cudaGetDeviceProperties map to hipGetDeviceProperties? If so, the same fallback logic should work. If you need platform-specific handling, that would require conditional compilation.

Does this approach work for your use case?

✏️ Learnings added

Learnt from: PhilipDeegan Repo: mkn/mkn.gpu PR: 30 File: inc/mkn/gpu/cuda.hpp:84-95 Timestamp: 2026-01-27T21:55:44.003Z Learning: In `inc/mkn/gpu/cuda.hpp`, the `getWarpSize()` function is designed to provide a default warp size with compile-time override via `_MKN_GPU_WARP_SIZE_`, supporting both CUDA (warp size 32) and ROCm (warp size 64), and must work on host and device with clang, nvcc, and hipcc compilers.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: PhilipDeegan Repo: mkn/mkn.gpu PR: 23 File: inc/mkn/gpu/cuda.hpp:307-311 Timestamp: 2024-12-08T14:07:35.137Z Learning: In `inc/mkn/gpu/cuda.hpp`, the `global_d_kernel` function is intentionally left without bounds checking, intended for advanced use where the caller is responsible for ensuring proper grid and block dimensions, while `global_gd_kernel` includes bounds checking for safer operation.

Learnt from: PhilipDeegan Repo: mkn/mkn.gpu PR: 19 File: test/any/async_streaming.cpp:2-6 Timestamp: 2024-06-10T19:38:12.663Z Learning: PhilipDeegan, thank you for sharing your preferred style for organizing includes. I'll remember this for future reviews in your codebase. <!-- PhilipDeegan prefers to sort `#include` statements by line length first, and alphabetically second.

coderabbitai · 2026-01-27T19:31:57Z

test/poc/jit/eg.py

+@cuda.jit
+def vadd(i, a, b, c):
+    c[i] = a[i] + b[i]
+
+
+@cuda.jit
+def vector_add_gpu(a, b, c):
+    vadd(cuda.threadIdx.x, a, b, c)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Numba CUDA @cuda.jit device=True device function syntax kernel-to-kernel call

💡 Result:

Device function (@cuda.jit(device=True)) syntax and how to call it from a kernel

Define a device function with device=True, then call it like a normal Python function from inside a @cuda.jit kernel (or from another device function). Device functions can return values. [1]

from numba import cuda import math `@cuda.jit`(device=True) def f(x, y): return x + y `@cuda.jit` def kernel(a, b, out): i = cuda.grid(1) if i < out.size: out[i] = f(a[i], b[i]) # device-function call

What you cannot do (“kernel-to-kernel call”)

You cannot launch a kernel from inside another kernel in Numba (CUDA dynamic parallelism is not supported). [2]

A @cuda.jit(device=True) function is not callable from host (CPU) Python code—it must be called only from GPU code (a kernel/device function). [3]

Sources:
[1] Numba “Writing Device Functions” docs (stable) (numba.readthedocs.io)
[2] Numba-CUDA “Writing CUDA Kernels” (note on dynamic parallelism not supported) (nvidia.github.io)
[3] StackOverflow: device functions are only callable from kernels/device code (stackoverflow.com)

Citations:

1: https://numba.readthedocs.io/en/stable/cuda/device-functions.html?utm_source=openai

2: https://nvidia.github.io/numba-cuda/user/kernels.html?utm_source=openai

3: https://stackoverflow.com/questions/66405332/cuda-jitdevice-true-returns-devicefunctiontemplate-object-is-not-callable?utm_source=openai

Fix kernel-to-kernel call in Numba CUDA.

vadd is declared as a CUDA kernel but invoked from within another kernel (vector_add_gpu), which Numba does not support. Kernel-to-kernel calls will fail during JIT compilation. Change vadd to a device function using @cuda.jit(device=True).

🔧 Suggested fix (make vadd a device function)

-@cuda.jit +@cuda.jit(device=True) def vadd(i, a, b, c): c[i] = a[i] + b[i] `@cuda.jit` def vector_add_gpu(a, b, c): - vadd(cuda.threadIdx.x, a, b, c) + i = cuda.threadIdx.x + vadd(i, a, b, c)

🤖 Prompt for AI Agents

In `@test/poc/jit/eg.py` around lines 17 - 24, vadd is declared as a CUDA kernel but is invoked from inside another kernel (vector_add_gpu), which Numba disallows; change vadd to a device function by using `@cuda.jit`(device=True) instead of a kernel decorator so it can be called from vector_add_gpu, keep the same signature (index i and arrays a, b, c) and ensure vector_add_gpu calls vadd(...) as a regular device function rather than launching it.

it seems to be working for me locally

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

coderabbitai · 2026-01-27T19:31:57Z

test/poc/jit/pybind.cpp

+  py::array_t<float> make(auto p) {
+    return {{32}, {sizeof(float)}, p, py::capsule(p, [](void* f) { /* noop */ })};
+  }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "pybind.cpp" -type f | head -20

Repository: mkn/mkn.gpu

Length of output: 80

🏁 Script executed:

cat -n ./test/poc/jit/pybind.cpp

Repository: mkn/mkn.gpu

Length of output: 2359

🌐 Web query:

pybind11 py::array_t base parameter lifetime management reference_internal

💡 Result:

return_value_policy::reference_internal

reference_internal is only about C++ references/pointers returned from methods/properties: it keeps the implicit self (“parent”) alive as long as the returned object (“child”) is alive. Internally it’s reference + keep_alive<0,1> (i.e., keep argument 1 alive while the return value (0) is alive). [1]

So it does not manage the lifetime of a NumPy buffer unless the thing you’re returning is literally a C++ ref/pointer whose lifetime is tied to self.

py::array_t(..., base=...) (NumPy view lifetime)

When you construct a py::array / py::array_t from an existing data pointer, the last parameter (handle base) is the mechanism that ties the array’s lifetime to some Python object that owns (or keeps alive) the memory. In the constructor, if base && ptr is provided, NumPy will treat it as a non-owning/view array and keep a reference to base (so base can’t be GC’d while the array is alive). [2]

Signature (simplified):
py::array(dt, shape, strides, ptr, handle base = {}) [2]

Practical patterns

1) View into a C++ object’s internal buffer (keep self alive):
Return an array view and pass the Python self as base (so the C++ instance can’t die while the NumPy view exists). This is the “NumPy equivalent” of what people expect reference_internal to do, but it’s done via base, not via return value policy. [2]

2) Heap-owned buffer (free when array dies):
Allocate and pass a py::capsule as base with a deleter; NumPy keeps the capsule alive and the deleter runs when the array is freed. (Common idiom in pybind11.) [2][3]

Key takeaway

Use reference_internal for C++ ref/pointer returns tied to self. [1]

Use py::array_t(..., base=...) to manage NumPy view lifetimes, typically by setting base to self or to a py::capsule/owner object. [2][3]

Sources: [1] pybind11 docs on reference_internal / keep_alive; [2] pybind11/numpy.h array constructor showing base handling; [3] explanation/example of using py::capsule as base to tie deallocation to array lifetime.

🌐 Web query:

C++ auto parameter in function auto p C++20 deduced parameter requirement

💡 Result:

In C++20, writing auto in a function parameter type makes the declaration an abbreviated function template: each auto becomes an invented template type parameter, and the parameter type is deduced from the call using normal template argument deduction rules. (en.cppreference.com)

What’s required / allowed

auto must appear at the top level of the parameter’s type (you may add const, &, &&, *, etc. around it):
void f(auto x); // template<class T> void f(T) void g(const auto& x); // template<class T> void g(const T&) void h(auto* p); // template<class T> void h(T*)
(en.cppreference.com)

You can constrain the deduced type with a concept:
void f(std::integral auto x); // template<std::integral T> void f(T)
If the constraint is false/ill-formed for the deduced T, deduction fails. (en.cppreference.com)

What’s not allowed (common pitfall)

In C++20, auto is not permitted inside a “compound” parameter type like std::vector<auto>; it’s only allowed as a generic parameter type placeholder at the top level of the parameter declaration. (stackoverflow.com)

If you paste your exact signature (e.g., what you mean by “auto p”), I can restate the equivalent explicit template<...> form and point out exactly why it does/doesn’t compile.

Citations:

1: https://en.cppreference.com/w/cpp/language/function_template.html?utm_source=openai

2: https://en.cppreference.com/w/cpp/language/function_template.html?utm_source=openai

3: https://en.cppreference.com/w/cpp/language/auto?utm_source=openai

4: https://stackoverflow.com/questions/60358154/auto-as-a-template-argument-placeholder-for-a-function-parameter?utm_source=openai

🌐 Web query:

pybind11 py::cast(this) py::array_t base buffer lifetime keep alive

💡 Result:

To keep a py::array_t (that views C++ memory) from outliving the C++ object that owns the buffer, pass a Python “owner” object as the array’s base. NumPy will hold a reference to base, so the owner stays alive as long as the array does. (pybind11.readthedocs.io)

Pattern: make the array’s base be self (this)

Inside a bound method, you can tie the array lifetime to the Python wrapper for *this:

`#include` <pybind11/numpy.h> namespace py = pybind11; py::array_t<float> MyObj::view() { auto base = py::cast(this, py::return_value_policy::reference); // "self" return py::array_t<float>( {n}, // shape {sizeof(float)}, // strides (bytes) data_ptr, // pointer into this object base // base/owner => keeps self alive ); }

This is the NumPy equivalent of “keep self alive while the returned view exists”: the array does not own the data; it just references it, and base prevents premature destruction. (pybind11.readthedocs.io)

Alternative: reference_internal / keep_alive<0,1>

If you’re returning a pointer/reference to internal data (not building a NumPy array yourself), return_value_policy::reference_internal applies the equivalent of keep_alive<0,1> (keep the implicit this alive for the return value’s lifetime). (pybind11.readthedocs.io)

When the buffer isn’t owned by this: use a capsule as base

If the memory should be freed when the array dies (or is owned by something other than this), use a py::capsule as the array base with a destructor callback. (exchangetuts.com)

Citations:

1: https://pybind11.readthedocs.io/en/stable/advanced/pycpp/numpy.html?utm_source=openai

2: https://pybind11.readthedocs.io/en/stable/advanced/pycpp/numpy.html?utm_source=openai

3: https://pybind11.readthedocs.io/en/latest/advanced/functions.html?utm_source=openai

4: https://exchangetuts.com/returning-numpy-arrays-via-pybind11-1639677247749392?utm_source=openai

Fix lifetime management and C++20 incompatibility in array return.

The no-op capsule does not keep the owning FunctionSupport alive; Python can hold the array after the destructor frees memory, causing use-after-free. Additionally, auto in function parameters requires C++20—without it, this won't compile.

Change to an explicit float* parameter and pass py::cast(this) as the array's base to tie its lifetime to the Python wrapper of this:

🔧 Suggested fix

- py::array_t<float> make(auto p) { - return {{32}, {sizeof(float)}, p, py::capsule(p, [](void* f) { /* noop */ })}; - } + py::array_t<float> make(float* p) { + return py::array_t<float>( + {32}, + {sizeof(float)}, + p, + py::cast(this)); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

py::array_t<float> make(auto p) {

return {{32}, {sizeof(float)}, p, py::capsule(p, [](void* f) { /* noop */ })};

}

py::array_t<float> make(float* p) {

return py::array_t<float>(

{32},

{sizeof(float)},

p,

py::cast(this));

}

🤖 Prompt for AI Agents

In `@test/poc/jit/pybind.cpp` around lines 44 - 46, The make function currently takes an auto parameter and returns a py::array_t<float> with a no-op capsule, which causes two problems: 'auto' in the parameter requires C++20 and the no-op capsule does not extend the owning FunctionSupport lifetime (risking use-after-free). Change the signature of make to accept an explicit float* (replace auto p with float* p) and when constructing the py::array_t<float> pass py::cast(this) as the base object so the Python wrapper for this FunctionSupport instance is held alive for the array's lifetime, removing the no-op capsule.

this is a POC to show C++ side memory allocation/lifetimes, so not so important

@PhilipDeegan Understood! Since this is a POC for demonstrating memory allocation patterns, those details are fine as-is. Thanks for the context.

🧠 Learnings used

Learnt from: PhilipDeegan Repo: mkn/mkn.gpu PR: 19 File: test/any/async_streaming.cpp:2-6 Timestamp: 2024-06-10T19:38:12.663Z Learning: PhilipDeegan, thank you for sharing your preferred style for organizing includes. I'll remember this for future reviews in your codebase. <!-- PhilipDeegan prefers to sort `#include` statements by line length first, and alphabetically second.

coderabbitai · 2026-01-27T19:31:58Z

test/poc/jit/pybind.cpp

+      .def_readwrite("a", &FunctionSupport::a)
+      .def("A", &FunctionSupport::A)
+      .def_readwrite("b", &FunctionSupport::b)
+      .def("B", &FunctionSupport::B)
+      .def_readwrite("a", &FunctionSupport::a)
+      .def("C", &FunctionSupport::C);


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the actual file content around lines 61-66 cat -n test/poc/jit/pybind.cpp | sed -n '55,75p'

Repository: mkn/mkn.gpu

Length of output: 780

🏁 Script executed:

# Let's also search for the FunctionSupport class definition rg -A 20 'class FunctionSupport' test/poc/jit/

Repository: mkn/mkn.gpu

Length of output: 37

🏁 Script executed:

# Check if there are other references to binding "c" in the file rg 'def.*"c"' test/poc/jit/pybind.cpp

Repository: mkn/mkn.gpu

Length of output: 37

🏁 Script executed:

# Search for FunctionSupport class definition in the entire repo rg -l 'class FunctionSupport'

Repository: mkn/mkn.gpu

Length of output: 37

🏁 Script executed:

# Also look at the beginning of the pybind.cpp file to see includes and class definition head -n 60 test/poc/jit/pybind.cpp

Repository: mkn/mkn.gpu

Length of output: 1429

Duplicate a binding—should expose c instead.

The FunctionSupport struct has three members (a, b, c) with corresponding methods (A(), B(), C()). Line 65 duplicates the a binding, preventing exposure of the c member variable that is allocated, used, and destroyed throughout the class lifecycle.

🔧 Suggested fix

.def("A", &FunctionSupport::A) .def_readwrite("b", &FunctionSupport::b) .def("B", &FunctionSupport::B) - .def_readwrite("a", &FunctionSupport::a) + .def_readwrite("c", &FunctionSupport::c) .def("C", &FunctionSupport::C);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

.def_readwrite("a", &FunctionSupport::a)

.def("A", &FunctionSupport::A)

.def_readwrite("b", &FunctionSupport::b)

.def("B", &FunctionSupport::B)

.def_readwrite("a", &FunctionSupport::a)

.def("C", &FunctionSupport::C);

.def_readwrite("a", &FunctionSupport::a)

.def("A", &FunctionSupport::A)

.def_readwrite("b", &FunctionSupport::b)

.def("B", &FunctionSupport::B)

.def_readwrite("c", &FunctionSupport::c)

.def("C", &FunctionSupport::C);

🤖 Prompt for AI Agents

In `@test/poc/jit/pybind.cpp` around lines 61 - 66, The pybind registration mistakenly repeats .def_readwrite("a", &FunctionSupport::a) instead of exposing member c; update the duplicate binding in the chain that registers FunctionSupport so it uses .def_readwrite("c", &FunctionSupport::c) (keep the existing .def("A",&FunctionSupport::A), .def_readwrite("b",&FunctionSupport::b) and .def("C",&FunctionSupport::C) intact) so the struct's c member is correctly exposed to Python.

++

f478dde

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

mkn deleted a comment from coderabbitai bot Jan 27, 2026

Conversation

PhilipDeegan commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

PhilipDeegan commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PhilipDeegan Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Device function (@cuda.jit(device=True)) syntax and how to call it from a kernel

What you cannot do (“kernel-to-kernel call”)

Uh oh!

PhilipDeegan Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

return_value_policy::reference_internal

py::array_t(..., base=...) (NumPy view lifetime)

Practical patterns

Key takeaway

What’s required / allowed

What’s not allowed (common pitfall)

Pattern: make the array’s base be self (this)

Alternative: reference_internal / keep_alive<0,1>

When the buffer isn’t owned by this: use a capsule as base

Uh oh!

PhilipDeegan Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PhilipDeegan commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

PhilipDeegan commented Jan 27, 2026 •

edited

Loading

coderabbitai bot Jan 27, 2026 •

edited

Loading

coderabbitai bot Jan 27, 2026 •

edited

Loading

Device function (`@cuda.jit(device=True)`) syntax and how to call it from a kernel

coderabbitai bot Jan 27, 2026 •

edited

Loading

`return_value_policy::reference_internal`

`py::array_t(..., base=...)` (NumPy view lifetime)

Pattern: make the array’s `base` be `self` (`this`)

Alternative: `reference_internal` / `keep_alive<0,1>`

When the buffer isn’t owned by `this`: use a capsule as `base`