I don't have experience with opencl or compute shader. My experience with cude told me that the language design tied closely to the hardware. And to write efficient cuda code, you have to understand the hardware architecture well. It's hard for me to imagine a generate gpu language that works on all hardwares, unless the gpu hardware is standardized like x86 or arm.
I am not an ML person by all means, but when I played with GPUs, I found that the time it takes to get something done using CUDA is substantially lower than with OpenCL, nota bene I have not yet tried Metal.
The barrier to entry also seems to be lower for CUDA, so this might be something that TensorFlow people are considering important.