-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049
base: master
Are you sure you want to change the base?
Conversation
…omplex/redundant pointer operation
a simple tech doc: mapping ggml compute graph to QNN compute graph with the breakthrough help from chiwwang@QTI on April 2024,
I already found that there are different technical paths to utilize the Qualcomm Hexagon NPU in ggml-qnn via QNN SDK:
prons: this approach can benefit greatly from the excellent "backend scheduler" feature in the ggml backend subsystem, can be a "functional implementation" or a good starting-point in the upstream llama.cpp community. accordingly, this approach can be verified easily with my self-made script build-run-android.sh cons: there mightbe performance concern in ggml-qnn backend
prons: this approach might be equivalent to the principle shown in the above quoted code, and I guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend. I don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl. cons: can not take advantage of backend scheduler feature and too much work load there are many undocumented(or not very clear) technical details in QNN SDK, so I think the necessary technical support should be provided from Qualcomm's tech team even I reach the final mission according to the first approach with help from the great llama.cpp community.
correction from domain technical experts is greatly welcomed and appricated. |
How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. |
Hi @oreomaker , Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution. In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation. |
I don't want to continue debating this here. The reason for restricting your access to this repo was your inappropriate comments on unrelated PRs (here, here2 and here3). And repo's owner gave a clear reason about it. I'd suggest focusing on improving your codebase in an objective manner without making assumptions about or judging others' work. If my comments made you uncomfortable, I apologize. I'm happy to step back from this discussion. I can also create my own PR where anyone interested can discuss the design approach more constructively. |
this is a good question, your concern is correct:
[updated on 02/26/2025] my previous answer might be wrong, because the first technical approach can works very good(quantized data with mulmat didn't implemented when I wrote the simple tech doc), there are 7x-10x performance improvements in my local dev envs with QNN backend. btw, you can refer to my personal understanding about ggml-qnn and other ggml backends in that simple tech doc: prons: this approach might be equivalent to the principle shown in the above quoted code, and I guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend. I don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl. |
@chraac, today(02/27/2025) I confirmed that your PR couldn't do a real LLM inference on Snapdragon based device.
thanks. |
Hi @zhouwg. I want to clarify that the comments made by @chraac in your previous PR had no influence whatsoever in the decision to block you from participating in this repository. Technical feedback and code reviews are always welcome and even encouraged. However, you were blocked due to a consistent pattern of comments that incited personal conflict, often in response to legitimate technical feedback. The comments linked by @chraac (now removed) are an example of this behavior. We value your technical contributions and encourage you to continue participating in discussions, but please focus on the technical aspects and the code itself. If you receive feedback on your code, try not to take it personally, state your point of view and move on. If you feel personally attacked or treated unfairly, please reach out to the maintainers privately, and we will do our best to address the situation. Engaging in personal conflict in public comments is not productive for anyone. Additionally, please keep in mind that while this is an open source project that welcomes contributions, it does not mean that contributions are guaranteed to be accepted. You may disagree with the maintainers' decisions, but refusing to make the changes requested by the maintainers will likely result in your contributions being rejected. If you disagree with these decisions, you always have the option to fork the project and maintain your own version. |
@slaren, thanks for your valuable and very helpful suggestion or guidance. I admitted that I made a same stupid mistake because of intentional challenge from a same CN programmer in my third PR and I try to adjust my mind and behavior accordingly. at the same time, I think now everyone in this community can see what happened in my first&second&third PR. especially in this third PR:
yes, your opinion is definitionally correct, I see. I came to this great tech community for learning real hard-core AI tech and try to make some contributions. I understand my PR probably might be not accepted and this is normal acceptable thing, but I really don't want to be involved with some meaningless conflict with others especially some CN programmers from mainland China. |
…for benchmark more conveniently
Hi everyone, |
thanks for your comment. your question is a really good question. I strongly agree your point of view: there is no right or wrong from pure tech perspective. I can see chraac really did many efforts on ggml-qnn backend as his way based on my initial PR and did a good progress on Windows port and 4d mulmat. unfortunately, I personally think:
at the same time, I think
for avoid misunderstanding, I never thought/claimed that his/his team's effort on ggml-qnn backend is a duplicated effort, because I'm an open-minded programmer and I strongly agree that's also the GGML way: try crazy ideas, build wild demos, and push the edge of what’s possible. one more important thing, there is a big problem in the second technical approach which I already explained in the simple tech doc, this is the key reason why his PR is not a functional PR and marked as [WIP][draft]. furthermore, I personally guess that the second technical approach that I have discovered and mentioned in this community on 04/2024 may not be implemented without the technical help of very skilled QNN experts and AI experts or help from Qualcomm because Qualcomm already provides some other similar dedicated tech stack in QNN: https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-100/tutorials.html#qnn-genaitransformer-backend-workflow for further understand what I mentioned here, pls refer to following 20000+ LoC source file in my study/research of relative topic on 04/2024: for avoid misunderstanding from chraac or his team, the above guess may be incorrect and I sincerely wish his team can success. finally, as I described in this PR's description, I personally think a fully open-source implementation of ggml-qnn through the first tech approach might be a team work between experienced programmers(Android system sw programmers, Windows system sw programmers, QNN experts) and AI experts, and this is should be a P0 team-work task. I hope my understanding is correct and/but correction from experts is greatly appreciated. @null-define, thanks for your good question, it helps me brainstorm or deep dive into related technical issues and combine/assemble them to get a clearer understanding although it might be incorrect. |
* [ ] Low
* [x] Medium
* [ ] High
PR Description
this PR is a continued effort of my original PR #6869
thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature),
this implementation put main logic in one single source file(ggml-qnn.cpp) and all op functions in ggml-qnn-ops.cpp because it will helpful for other experienced programmers be involved in dev activity which similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp or what Qualcomm did in the ggml-opencl.cpp.

another main reason of this coding style is I think this will make the developers' workflow more easily:
this implementation is a concise implementation and focus on the final mission "how to utilize the Hexagon NPU maximally", this implementation has no complex/complicated/redundant C++ encapsulation and works on real Snapdragon based phone very effectively, I hope every domain programmers can understand codes and domain technical details easily and quickly and it can be considered an open source reference implementation of ggml-qnn and also can/might be used in production project . I think this PR will be helpful to the llama.cpp community.
at the moment:
btw, after spent too much efforts on ggml-qnn backend, I personally think a fully open-source ggml-qnn backend might be a team-work between experienced software programmers and AI experts even the professional technical help from Qualcomm. in other words, I personally think NO single independent programmer or independent development team can provide a fully implementation of ggml-qnn backend, because experts and programmers in this community should haven't seen a programmer who is familiar with both Android system software programming and Windows system software programming, and is proficient in hard-core AI technology and Qualcomm QNN SDK, one more important thing, familiar with source code of ggml/llama.cpp although he/she is not an AI expert.
Big picture of ggml-qnn backend
pls refer to a simple tech doc below: mapping ggml compute graph to QNN compute graph
the first technical approach can be seen in this PR. accordingly, the second technical approach can be extended base on this PR with the similar coding style.
What I did in my first PR and this PR
all above items can be found in project KanTV if there is no specified notes and project KanTV is a device-AI learning project and heavily depend on ggml/whisper.cpp/llama.cpp. I personally think the rest parts of ggml-qnn is a team work between AI experts and experienced programmers. we can work together to achieve the final goal: provide a productive-level open-source ggml-qnn backend to llama.cpp community.
Key-reasons of personally disagree with complex C++ encapsulation
highly-well designed QNN SDK already manage it's internal hardware and software resource very carefully, we can use a simple STL to manage QNN resources in this PR, another complicated wrapper of QNN SDK through C++ in user layer might be not mandatory.
this implementation of ggml-qnn is mainly porting/reverse engineering from executorch(the QNN backend's implementation in executorch comes from Qualcomm). I hope every domain programmers or experts can understand codes and domain technical details easily and quickly. another implementation of ggml-qnn through complex/complicated C++ encapsulation is equivalent to another executorch. accordingly, I personally think my effort on this might be useful for Qualcomm's QNN SDK's users also complex C++ encapsulation really has some advantages.
the core ideas and tech difficulties should be completely same to this implementation even with complicated and cool C++ encapsulation. accordingly, we can see almost all of core ideas are totally comes from my first PR.
as I said before, the second technical approach can be extended base on this PR with the similar coding style without complicated C++ encapsulation. in the fact, there already implemented in my local dev envs, but it's not a functional PR, the reason I had been said in this simple tech doc.

I fully understand that we will not agree on everything, all above points of view is personal perspective.
Performance of ggml-qnn backend
performance is the key-point I heavily/always concerns on Android or other embedded system. here is the result in local dev envs of how I solve this performance issue at the early phase(because this ggml-qnn backend lacks of many ops although it's a really functional backend) of ggml-qnn backend:
before finetune:
Fig-1:llama-bench with QNN NPU backend(Hexagon NPU) on a Snaprdragon 8 Gen3 based phone

after finetune:

Fig-2: llama-bench with QNN NPU backend(Hexagon NPU) on a Snapdragon 8 Gen3 based phone, some mulmat operations has been offloaded to NPU(the test result in following screenshots depend on OS load and adjust various parameters, this is the best result I have seen)
Fig-3:llama-bench with ggml cpu backend("3" is a "fake" ggml-qnn backend which used to compare performance between QNN NPU backend and ggml cpu bakcned)) on a Snapdragon 8 Gen3 based phone(AI experts can explain why there is so big difference of the second data between Fig-2 and Fig-3, this would be helpful for performance fine-tuning in this PR)

Fig-4(updated on 02-28-2025,12:09, inference speed is very fast, depend on OS load and adjust parameters in my local dev envs, the PR might be can't reproduce this benchmark result because of there is a hard-forked candidate PR through C++ encapsulation from a CN programmer in this community. as an independent programmer(which means shouldn't have any potential conflict of interests with any programmers in CN), I'd like to/will contribute all relative techs/experiences if this PR can be approved-------this is not condition and I understand this PR probably be rejected just avoid some unpleasant intentional behaviors from some CN programmers and I hopefully this will gain some corresponding understanding.)LLM inference with QNN CPU backend on a Snapdragon 8 Gen3 based phone
LLM-inference-with-ggmlqnn-backend-on-snapdragon8-gen3-phone.mp4
How to build ggml‐qnn source code for Android and verify ggml-qnn backend on Snapdragon based phone
Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions might be also ok). the dev activity in this PR can be done in pure command line without any IDE, so setup dev envs is simple:
we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-qnn" to help troubleshooting work.
How to build ggml‐qnn source code for Snapdragon based WoA(Windows on ARM) device
the source code of ggml/llama.cpp are both portable codes and Qualcomm's highly-well designed QNN SDK is also portable, thanks to that Qualcomm has provide a very simple reference code without complex C++ encapsulation of how to handle dlopen,dlclose, dlsym, dlerror between Windows & Linux/Android in QNN SDK, I have ported them to this PR accordingly.
unfortunately, I have nothing knowledge about Windows programming and have no WoA device for validation, so I don't know how to build ggml-qnn source code for Snapdragon WoA device currently.
How to build ggml‐qnn source code for Snapdragon based Linux device
generally speaking, Android can be considered as a special Linux distribution, so build ggml-qnn source code for Snapdragon based Linux device should be similar to Snapdragon based phone, but I have no Snapdragon based Linux device for validation.
Acknowledgement
@slaren, sorry to bother you, thanks to your outstanding "backed-schedualr" feature, this ggml-qnn backend now is a functional QNN backend and performance of LLM inference on Snaprdragon 8 Gen3 is probably improved(7x-10x). it can pass the test-backend-ops and llama-bench, and do real LLM inference with ggml-qnn backend on a Snapdragon based phone via llama-cli or a standard Android APP. you can check these facts if you have time although I understand that you are very busy and accordingly I think we can discuss this PR formally so other community developers or AI experts can be involved in the further dev activities. thanks!
we can see that Qualcomm's top-talnet engineers have provided a very important and highly-difficult PR to this great community, it's greatly appreciated if Qualcomm's engineers or experts can help to do code review regardless the final result.