I’ve encountered a rather weird issue and was wondering if anyone else has encountered something like it. I am working on a kernel that performs a radex sort. The kernel follows published algorithms by bucketing the last two bits of an unsigned integer using prefix sum etc.
I am testing out the kernel using a single threadgroup with 256 threads. After executing the kernel, I read back the values on the CPU. If everything worked as expected, I’d have an ordered list of unsigned ints.
But at thread 53 the kernel starts writing garbage values to the device memory out buffer. Here’s the weird part: In the GPU frame capture, the calculations are correct, and I can see the correct value in the debugger. If the information in the GPU frame capture were written out to the device memory buffer, I’d have a correctly sorted list. If I write a primitive value to the buffer (like the thread id), the CPU reads back the correct number.
I have been throwing up some memory barriers to try and fix the issue to no avail. Any ideas on what this symptom might mean?