(Replying to PARENT post)
We get really quite good results over a number of benchmarks - check out our papers!
(Replying to PARENT post)
The paper seems to confirm your last caveat. Each point on the following summary sounds like they require fine-tuning it's hardware-dependent down to the specific model, except maybe the second-to-last point about which approach works best in general:
Our key findings are the following:
โข Effective parallel sorting algorithms must use the faster access on-chip memory as much and as often as possible as a substitute to global memory operations.
โข Algorithmic improvements that used on-chip memory and made threads work more evenly seemed to be more effective than those that simply encoded sorts as primitive GPU operations.
โข Communication and synchronization should be done at points specified by the hardware.
โข Which GPU primitives (scan and 1-bit scatter in particular) are used makes a big difference. Some primitive implementations were simply more efficient than others, and some exhibit a greater degree of fine grained parallelism than others.
โข A combination of radix sort, a bucketization scheme, and a sorting network per scalar processor seems to be the combination that achieves the best results.
โข Finally, more so than any of the other points above, using on-chip memory and registers as effectively as possible is key to an effective GPU sort.
(Replying to PARENT post)
Finally they understood that they world moved on and better support to other languages had to be provided, so lets see how much OpenCL 2.2 and SPIR can improve the situation.
(Replying to PARENT post)
Now you can write OpenCL kernels that automatically tweak themselves to run as fast as possible on different hardware, but that requires significant extra work over just getting it to work at all.
And finally, CUDA has a bunch of hand-tweaked libraries for doing common numerical operations (matrix multiply, FFT, ...) that are (partly) written in 'NVIDIA GPU assembly) (ptx), so those operations will be faster on CUDA than on OpenCL.
CUDA is also (a bit) easier to write/use than OpenCL code and the tooling is better, so that's another reason people often default to CUDA.