Is the `Tensor` type suited to implement an `im2col` operation? I tried and only succeeded with nested loops—which of course is bad for CUDA. In the end, I want to arrive at an efficient convolution. Would that be possible with the current API surface?