I don't really see how any code that depends heavily on the underlying hardware can "just work" on AMD. Most serious CUDA code is aware of register file and shared memory sizes, wgmma instructions, optimal tensor core memory & register layouts, tensor memory accelerator instructions, etc...
Presumably that stuff doesn't "just work" but they don't want to mention it?
Impressive if true. Unfortunately not open source and scarce on exact details on how it works
Edit: not sure why I just sort of expect projects to be open source or at least source available these days.
The main cause of Nvidia's crazy valuation is AMD's unwillingness to invest in making its GPUs as useful as Nvidia's for ML.
Maybe AMD fears antitrust action, or maybe there is something about its underlying hardware approach that would limit competitiveness, but the company seems to have left billions of dollars on the table during the crypto mining GPU demand spike and now during the AI boom demand spike.
I worked for spectral compute a few years ago. Very smart and capable technical team.
At the time, not only did they target AMD (with less compatibility than they have now), but also outperformed the default LLVM ptx backend, and even NVCC, when compiling for Nvidia GPUs!
I don't understand how AMD has messed up so badly that I feel like celebrating a project like this. Features of my laptop are just physically there but not usable, particularly in Linux. So frustrating.
It’s great that there is a page about current limitations [1], but I am afraid that what most people describe as “CUDA” is a small subset of the real CUDA functionality. Would be great to have a comparison table for advanced features like warp shuffles, atomics, DPX, TMA, MMA, etc. Ideally a table, mapping every PTX instruction to a direct RDNA counterpart or a list of instructions used to emulate it.
This sounds fabulous. I look forward to AMD being drawn kicking and screaming into direct competition with Nvidia.
Very impressive!
But I can't help but think if something like this can be done to this extend, I wonder what went wrong/why it's a struggle for OpenCL to unify the two fragmentized communities. While this is very practical and has a significant impact for people who develop GPGPU/AI applications, for the heterogeneous computing community as a whole, relying on/promoting a proprietary interface/API/language to become THE interface to work with different GPUs sounds like bad news.
Can someone educate me on why OpenCL seems to be out of scene in the comments/any of the recent discussions related to this topic?
Compiler isn't open source? That feels like DOA in this day and age. There is ZLUDA already which is open.
If they plan to open it up, it can be something useful to add to options of breaking CUDA lock-in.
Would like to see benchmarks for the applications in the test suite.
E.g., how does Cycles compare on AMD vs Nvidia?
This is technically feasible so might be the real thing. Parsing inline ptx and mapping that onto amdgpu would be a huge pain.
Working from cuda source that doesn't use inline ptx to target amdgpu is roughly regex find and replace to get hip, which has implemented pretty much the same functionality.
Some of the details would be dubious, e.g. the atomic models probably don't match, and volta has a different instruction pointer model, but it could all be done correctly.
Amd won't do this. Cuda isn't a very nice thing in general and the legal team would have kittens. But other people totally could.
At my workplace, we were reluctant in making the choice between writing OpenCL and being AMD-compliant, but missing out on CUDA features and tooling; and writing CUDA and being vendor-locked.
Our jerry-rigged solution for now is writing kernels that are the same source for both OpenCL and CUDA, with a few macros doing a bit of adaptation (e.g. the syntax for constructing a struct). This requires no special library or complicated runtime work - but it does have the downside of forcing our code to be C'ish rather than C++'ish, which is quite annoying if you want to write anything that's templated.
Note that all of this regards device-side, not host-side, code. For the host-side, I would like, at some point, to take the modern-C++ CUDA API wrappers (https://github.com/eyalroz/cuda-api-wrappers/) and derive from them something which supports CUDA, OpenCL and maybe HIP/ROCm. Unfortunately, I don't have the free time to do this on my own, so if anyone is interested in collaborating on something like that, please drop me a line.
-----
You can find the OpenCL-that-is-also-CUDA mechanism at:
https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...
and
https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...
(the files are provided alongside a tool for testing, profiling and debugging individual kernels outside of their respective applications.)
I'd love to see some benchmarks but this is something the market has been yearning for.
the real question here is whether anybody has gotten cheap, easily available AMD GPUs to run their AI workloads, and if we can predict more people will do so
This isn’t a solution for pros because it will always play catch up and Nvidia can always add things to make it difficult. This is like emulation.
I've written a bit of CUDA before. If I want to go pretty bare-bones, what's the equivalent setup for writing code for my AMD card?
Ok, so I just stumbled on the problem, that I tried out openwhisper (from OpenAI), but on my CPU, because of no CUDA and workarounds seem hacky. So the headline sounds good!
But can this help me directly? Or would OpenAI have to use this tool for me to benefit?
It is not immediately clear to me (but I am a beginner in this space).
This targets CUDA C++, not CUDA the NVIDIA infrastructure for C, C++, Fortran, and anything else targeting PTX.
One question I always have about these sorts of translation layers is how they deal with the different warp sizes. I'd imagine a lot of CUDA code relies on 32-wide warps, while as far as I know AMD tends to have 64-wide warps. Is there some sort of emulation that needs to happen?
The future is inference. Many inference stacks already support AMD although the kernels are less optimized. This will of course change over time, but if AMD can crack the inference demand, it will put NVDA under huge pressure.
If this actually works (remains to be seen), I can only say:
1) Kudos
2) Finally !
What's the licensing, will I be able run this as a hobbyist for free software?
Really, really, really curious as to how they managed to pull this off, if their project works as well as they claim it does. If stuff as complex as paged/flash attention can "just work", this is really cool.
Wondering if there's an ongoing effort to do the same with MPS/Metal as a backend. If anything given how many developers are on macs I think it could get immense traction.
Very clearly the business motive make sense, go after nvidia gpu monopoly. Can someone help a lay person understand the pitfalls here that prevent this from being an intelligent venture?
A major component of many CUDA programs these days involves NCCL and high bandwidth intra-node communication.
Does NCCL just work? If not, what would be involved in getting it to work?
> gfx1030, gfx1100, gfx1010, gfx1101, gfx900...
How do I find out which do I have?
Who is this Spectral Compute, and where can we see more about them?
Isn't this a bit legally dubious, like zluda?
I wonder if this thing has anything common with zluda, its permissively licensed after all.
Can anyone explain why libcudnn is taking several gigabytes of my harddrive?
> SCALE does not require the CUDA program or its build system to be modified.
how big of a deal is this?
Has anyone tried this and knows how well it works? It definitely sounds very compelling
Is Nvidia not likely to sue or otherwise bork this into non-existence?
Companies selling CUDA software should no doubt adopt this tool
If it's efficient, this is very good for competition.
Doesn't seem to mention CDNA?
Sounds really awesome. Any chance someone can suggest if this works also inside a Docker container?
Does it translate to OpenCL?
This sounds like DirectX vs OpenGL debate when I was younger lol
This is the way.
[flagged]
But the question is, can it also run SHUDA and WUDA?
A lot of people think AMD should support these translation layers but I think it's a bad idea. CUDA is not designed to be vendor agnostic and Nvidia can make things arbitrarily difficult both technically and legally. For example I think it would be against the license agreement of cuDNN or cuBLAS to run them on this. So those and other Nvidia libraries would become part of the API boundary that AMD would need to reimplement and support.
Chasing bug-for-bug compatibility is a fool's errand. The important users of CUDA are open source. AMD can implement support directly in the upstream projects like pytorch or llama.cpp. And once support is there it can be maintained by the community.