Best data layout in threadgroup memory

octavo · August 21, 2019, 5:01pm

Is threadgroup memory in Metal similar to shared memory in CUDA? Is there a similar bank conflicts issue in Metal? I try to optimize the memory access in threadgroup memory.

Thanks

mhorga · August 21, 2019, 5:19pm

I do not know much about the CUDA bank conflicts but the hierarchy is similar to that in Metal: you have global (device/constant), shared (threadgroup) and local (thread) memory. You can read more in the MSL specification on pages 56 and 126. I hope this helps.

octavo · August 21, 2019, 6:04pm

It’s about data layout inside threadgroup memory to reduce load/store instructions of accessing elements of an array

mhorga · August 21, 2019, 7:43pm

I see… did you find the doc useful?

octavo · August 24, 2019, 9:37pm

No, the doc doesn’t talk anything about that. But I asked some Apple GPU engineer and here’s the answer:

consecutive accesses should always be good. i.e. each thread k should access element at k. The accesses should be coalesced between neighboring threads.

mhorga · August 25, 2019, 12:01am

That’s space locality, a parallel programming basic concept, but that did not answer your question about the similarity you see in CUDA with bank conflicts. I recommend reading any of the books out there about parallel programming because they are not hardware specific, so any concept you learn there is universal for all APIs.

octavo · August 26, 2019, 12:10am

The Apple GPU Engineer doesn’t mention CUDA bank conflicts, here’s the quote

The answer is complicated for reasons I can’t go into, but a good rule of thumb is to have threads in a warp access consecutive elements, ideally 16B per element and no smaller than 4B.

And then, I asked him about 2B elements, he said “consecutive accesses should always be good”