These NPUs are tying up a substantial amount of silicon area so it would be a real shame if they end up not being used for much. I can't find a die analysis of the Snapdragon X which isolates the NPU specifically but AMDs equivalent with the same ~50 TOPS performance target can be seen here, and takes up about as much area as three high performance CPU cores:
https://www.techpowerup.com/325035/amd-strix-point-silicon-p...
I thought the purpose of these things was not to be fast, but to be able to run small models with very little power usage? I have a newer AMD laptop with an NPU, and my power usage doesn't change using the video effects that supposedly run on it, but goes up when using the nvidia studio effects.
It seems like the NPUs are for very optimized models that do small tasks, like eye contact, background blur, autocorrect models, transcription, and OCR. In particular, on Windows, I assumed they were running the full screen OCR (and maybe embeddings for search) for the rewind feature.
Deploying a model on an NPU requires significant profile based optimization. Picking up a model that works fine on the CPU but hasn't been optimized for an NPU usually leads to disappointing results.
The write up on the GitHub repo is much more informative than the blog.
When running int8 matmul using onnx performance is ~0.6TF.
> We've tried to avoid that by making both the input matrices more square, so that tiling and reuse should be possible.
While it might be possible it would not surprise me if a number of possible optimizations had not made it into Onnx. It appears that Qualcomm does not give direct access to the NPU and users are expected to use frameworks to convert models over to it, and in my experience conversion tools generally suck and leave a lot of optimizations on the table. It could be less of NPUs suck and more of the conversions tools suck. I'll wait until I get direct access - I don't trust conversion tools.
My view of NPUs is that they're great for tiny ML models and very fast function approximations which is my intended use case. While LLMs are the new hotness there are huge number of specialized tasks that small models are really useful for.
The RTX 4080 should be capable of ~40 TFLOPS, yet they only report 2,160 billion operations per second. Shouldn't this be enough to reconsider the benchmark? They probably made some serious error in measuring FLOPS. Regarding the fact that CPU beats NPU is possible but they should benchmark many matrix multiplications without any application synchronization in order to have a decent comparison.
The benchmark is matrix multiplcation with the shapes `(6, 1500, 256) X (6, 256, 1500)`, which just aren't that big in the AI world. I think the gap would be larger with much larger matrices.
E.g. Llama 3.1 8B which is one of the smaller models has matrix multiplications like `(batch, 14336, 4096) x (batch, 4096, 14336)`.
I just don't think this benchmark is realistic enough.
We ran qprof (a Qualcomm NPU profiler) on this benchmark. The profiling results indicate that the workload was distributed to the vector cores instead of the tensor core, which provide the vast majority of the compute power in the NPU (my back of napkin math suggests that HMX is 30x stronger than HVX).
The workload is relatively small, which results in underutilization of the hardware capacity due to the overhead associated with input/output quantization-dequantization and NCHW-NHCW mapping. Padding the weights and inputs to be a multiple of 64 would also help the performance.
Edit: Link to the profiling graph https://imgur.com/a/2OKR93e
Estimated HVX compute capability 421.43*1024/8 = 1.46TOPS in int8,
in which 4 is number of vector cores
2 is number operation per cycle
1.43GHz is HVX frequency
1024bit is vector register width
8bit is precision
Actual article title: Benchmarking Qualcomm's NPU on the Microsoft Surface Tablet
Because this isn't about NPUs. It's about a specific NPU, on a specific benchmark, with a specific set of libraries and frameworks. So basically, this proves nothing.
I always thought that the main point of NPUs is energy efficiency (and being able to run ML models without taking over all computer resources, making it practical to integrate ML applications in the OS itself in ways that it does not disturb the user or the workflow) rather than being exceptionally faster. At least this has been my experience with running stable diffusion on macs. Similar to using other specialised hardware like media encoders; they are not necessarily faster than a CPU if you throw a dozen+ cpu cores on the task, but it will draw a minuscule part of the power.
OK, I am one of the developers in onnxruntime team. Perviously working on ROCm EP now has been transfered to QNN EP. The following is purely devrant and the opinions are mine.
So ROCm already sucks whereas QNN sucks even harder!
The conclusion here is NVIDIA knows how to make software that just works. AMD makes software that might work. Qualcomm, however, knows zero piece of shit of how to make a useful software.
The dev experience is just another level of disaster with Qualcomm. Their tools and APIs return absolutely zero useful infomation about what error you are getting, just an error code that you can grep from their include headers from SDK. To debug an error code, you need strace to get the internal error string on the device. Their profiler merely gives you a trace that cannot be associated back to original computing logic with very high stddev on the runtime. Their docs website is not indexed by the MF search engine, not to say LLMs, so if you have any question, good luck then!
So if you don't have a reason to use QNN, just don't use it (and any other NPU you name it).
Back to the benchmark script. There is a lot of flaws as I can see.
1. the session is not warmed up and the iteraion is too small. 2. the onnx graph is too small, I suspect the onnxruntime overhead cannot be ignored in this case. Try stack more gemm in the graph instead of increasing the iteration naively. 3. the "htp_performance_mode": "sustained_high_performance" might give a lower perf compare to "burst" mode.
A more reliable way to benchmark might just dump the context binary[1] and dump context inputs[2] and run this with qnn-net-run to get rid of the onnxruntime overhead.
[1]: https://github.com/cloudhan/misc-nn-test-driver/blob/main/qn... [2]: https://github.com/cloudhan/misc-nn-test-driver/blob/main/qn...
Haven't played much with Qualcomm NPU but Apple Neural Engine available in iOS and MacOS for many Computer Vision models was significantly faster than when running on CPU or GPU (e.g. mediapipe models, yolo, depth-anything) - to the point that inference was much faster on Macbook M2 Max using its NPU that is the same as on older iPhones rather than executing on all 38 GPU cores.
This all depends on model architecture, conversions and tuning. Apple provides good tooling in XCode for benchmarking models up to execution time of single operators and where such operator got executed (CPU, GPU, NPU) in case couldn't been executed on NPU and have to fallback to CPU/GPU. Sometimes model have be tweaked to slightly different operator if it's not available in NPU. On top of that ML frameworks/runtimes such as ONNX/Pytorch/TensorflowLite sometimes don't implement all operators in CoreML or MPS.
The author's benchmark sucks if he could only get 2 tops from a laptop 4080. The thing should be doing somewhere around 80 tops.
Given that you should take his NPU results with a truckload of salt.
>We see 1.3% of Qualcomm's NPU 45 Teraops/s claim
To me that suggests that the test is wrong.
I could see intel massaging results, but that far off seems incredibly improbable
This headline is seriously misleading because the author did not test AMD or Intel NPUs. If Qualcomm is slow don't say all AI PCs are not good.
One should pay attention also to power efficiency, a direct comparison could be misleading here.
Snapdragon touts 45 TOPS but it’s only int8.
For example Apple's m3 neural engine is mere 18 TOPS but it’s FP16.
So windows has bigger number but it’s not apple to apple comparison.
Did author test int8 performance?
NPUs are efficient, not especially fast. The CPU is much bigger than the NPU and has better cache access. Of course it'll perform better.
I might be overly cynical but I just assumed that the entire purpose of "AI PCs" was marketing - of course they don't actually achieve much. Any real hardware that's supposedly for the "AI" features will actually be just special purpose hardware for literally anything the sales department can lump under that category.
What exactly does Windows do with a NPU? I don't own an 'AI PC' but it seems like the NPUs are slow and can't run much.
I know Apple's Neural Engine is used to power Face ID and the facial recognition stuff in Photos, among other things.
In general MAC unit utilization tends to be low for transformers, but 1.3% seems pretty bad. I wonder if they fucked up the memory interface for the NPU. All the MACs in the world are useless if you cannot feed them.
IMO, benchmarking accelerator hardware with onnxruntime is like benchmarking a CPU with a Python script.
> We've seen similar performance results to those shown here using the Qualcomm QNN SDK directly.
Why not include those results?
>The second conclusion is that the measured performance of 573 billion operations per second is only 1.3% of the 45 trillion ops/s that the marketing material promises.
It just gets so hard to take this industry seriously.
Is there a possibility to use the Qualcomm SNPE SDK? I thought this SDK isn't bad. Also, for those who have access to the Qualcomm NPU: Is the Hexagon SDK working properly? Do apps still need to be signed (which i never got to work) when using Hexagon?
Fairly misleading title, boiling down AI PCs to just the Microsoft Surface running Qualcomm
Memory bound workload is memory bound. Doesn’t matter how many TOPS you have if you’re sitting idle waiting on DRAM during generation. You will, however notice a difference in prefill for long prompts.
They should have just made a pci card and not tried to push whole new machines on us. We are all good with the machines we already have. If you want to sell a new feature, then it needs to be an add-on
the ARM SME could be an interesting alternative to NPUs in the future. Unlike the NPUs which have at best some fixed function API it will be possible to program the SMEs more directly
what are all these folks hoping to accomplish? By crying and starting shit about windows recall, all you did was signal to their shareholders and the financial analysts that windows recall actually substance and not just a marketing facade. Otherwise, why would all those nerds be so angry?
So microsoft takes some of the criticisms on twitter and gets them in before shipping. Free appsec, nice.
Now, microsoft doesnt care about your benchmarks, dude. Grandma isnt gonna notice these workloads finish faster on a different compiled program utilizing different chips. Her last PC was EOL'd 10 years ago, it certainly cant keep up with this new ai laptop.
Are NPUs the VLIW of our times in terms of hype?
You dont seriously think MSFT expects this shit to benefit consumers do you? Their datacenters are overheating and the billing meter is still ticking while they burn, they need to figure out how to get consumers to start paying for this shit before they go broke and wall st sells them off for parts..
Either way, these are some of the first personal computers to have NPUs. They will improve. CPUs are 20 years optimized, this is literally the first try for some of these companies
I laughed when I saw that the Qualcomm “AI PC” is described as this in the ComfyUI docs:
"Avoid", "Nothing works", "Worthless for any AI use"
> the 45 trillion operations per second that’s listed in the specs
Such a spec should be ideally be accompanied by code demonstrating or approximating the claimed performance. I can't imagine a sports car advertising a 0-100km/h spec of 2.0 seconds where a user is unable to get below 5 seconds.
Sutherland's wheel of reincarnation turns.
[dead]
I think the results show that just in general the compute is not used well. That the CPU took 8.4ms and GPU took 3.2ms shows a very small gap. I'd expect more like 10x - 20x difference here. I'd assume that the onnxruntime might be the issue. I think some hardware vendors just release the compute units without shipping proper support yet. Let's see how fast that will change.
Also, people often mistake the reason for an NPU is "speed". That's not correct. The whole point of the NPU is rather to focus on low power consumption. To focus on speed you'd need to get rid of the memory bottleneck. Then you end up designing your own ASIC with it's own memory. The NPUs we see in most devices are part of the SoC around the CPU to offload AI computations. It would be interesting to run this benchmark in a infinite loop for the three devices (CPU, NPU, GPU) and measure power consumption. I'd expect the NPU to be lowest and also best in terms of "ops/watt"