Inspired by this post, I realised that there are much potential for improvement in my MST implementation, in particular the sorting part.
I am noting some experimental results in this post as a future reminder.
Essentially, I did a standalone performance timing on:
- directly sorting cuda builtin type
int3
; - directly sorting user struct type containing
int x, y, z
which is the same asint3
; - split the struct into 2 (keys and values) arrays and first do
sort_by_key
along with the indices, and then scatter the values using the indices.
The code:
The result:
Sorting sort_array_of_int3 elapsed time: 111.8 ms
Sorting sort_array_of_struct elapsed time: 110.4 ms
Sorting sort_struct_of_array elapsed time: 33.3 ms
Obviously, it is much faster if we only sort on small data (the key and indices), and then use gather
to move the values as well.
This reduces memory bandwith during the sort (therefore improving the performance), and gathering is very efficient as well.
Strangely, the performances of sorting int3 and struct are on-par, while what I think I observed in my MST implementation, int3 presented with better performance… Not yet sure why…
Sometimes, a habit of using less steps dominates my mind when coding, mostly because usually in serial programmes this means better efficiency.
But in GPU programming, more steps doesnt necessarily mean bad performance as long as one makes memory accessing pattern coaleased. And that is what GPU programming fascinates me (and I have a long way to go…).
Comments