(Replying to PARENT post)

The problem with OpenCL isn't performance per se, but performance portability (well it's only a problem for those that need such a thing, of course - many people don't). When you write OpenCL code and you tweak it for one CPU or GPU, it might run at 1/10th the speed on another. This is of course something you don't have with an API that works only on GPU's from one vendor, although even there different generations of hardware might prefer different parameters or tradeoffs.

Now you can write OpenCL kernels that automatically tweak themselves to run as fast as possible on different hardware, but that requires significant extra work over just getting it to work at all.

And finally, CUDA has a bunch of hand-tweaked libraries for doing common numerical operations (matrix multiply, FFT, ...) that are (partly) written in 'NVIDIA GPU assembly) (ptx), so those operations will be faster on CUDA than on OpenCL.

CUDA is also (a bit) easier to write/use than OpenCL code and the tooling is better, so that's another reason people often default to CUDA.

๐Ÿ‘คroel_v๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

The LIFT project (http://www.lift-project.org/) is specifically trying to solve the problem of performance portability. Our approach relies on a high level model of computation (think or something like a functional, or pattern based programming language) coupled with a rewrite-based compiler that explores the space of OpenCL programs with which to implement a computation.

We get really quite good results over a number of benchmarks - check out our papers!

๐Ÿ‘ค14113๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

> This is of course something you don't have with an API that works only on GPU's from one vendor, although even there different generations of hardware might prefer different parameters or tradeoffs.

The paper seems to confirm your last caveat. Each point on the following summary sounds like they require fine-tuning it's hardware-dependent down to the specific model, except maybe the second-to-last point about which approach works best in general:

Our key findings are the following:

โ€ข Effective parallel sorting algorithms must use the faster access on-chip memory as much and as often as possible as a substitute to global memory operations.

โ€ข Algorithmic improvements that used on-chip memory and made threads work more evenly seemed to be more effective than those that simply encoded sorts as primitive GPU operations.

โ€ข Communication and synchronization should be done at points specified by the hardware.

โ€ข Which GPU primitives (scan and 1-bit scatter in particular) are used makes a big difference. Some primitive implementations were simply more efficient than others, and some exhibit a greater degree of fine grained parallelism than others.

โ€ข A combination of radix sort, a bucketization scheme, and a sorting network per scalar processor seems to be the combination that achieves the best results.

โ€ข Finally, more so than any of the other points above, using on-chip memory and registers as effectively as possible is key to an effective GPU sort.

๐Ÿ‘คvanderZwan๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

One big reason for developers favouring CUDA is that since the early days it supported C++, Fortran and any other language with a PTX backend, where Khronos wanted everyone to just shut up and use C99.

Finally they understood that they world moved on and better support to other languages had to be provided, so lets see how much OpenCL 2.2 and SPIR can improve the situation.

๐Ÿ‘คpjmlp๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0