comparison with qualcomm ai hub model #8194

DongGeun123 · 2024-12-20T01:50:33Z

DongGeun123
Dec 20, 2024

🐛 Describe the bug

I ran Llama-v3.2-3B-Chat(precision w4a16) from ai-hub-model on a Snapdragon 8 Gen 3 device, achieving 20 tokens/s.
For comparison, I ran inference for the Llama3.2-3B model quantized to W4A16 using executorch with the QNN backend on the same device. The performance I observed was 10 tokens/s.
Could you provide insights into what might be causing this performance difference? Are there issues with how executorch handles quantized models that could explain this gap?
Any guidance or suggestions would be greatly appreciated!

cc @cccclai @winskuo-quic @shewu-quic

Answered by cccclai

Jan 3, 2025

Yeah we found out the model definition in llama_transformer.py isn't ideal for running llama model on qnn backend. We've started a new model definition in /~https://github.com/pytorch/executorch/tree/e66cdaf514e15242692073db1271aae4657f2033/examples/qualcomm/oss_scripts/llama3_2 which have better latency number

It's still wip and please expect some burden if trying them out, or maybe wait till it's more settled.

View full answer

AndreaChiChengdu · 2024-12-27T09:47:01Z

AndreaChiChengdu
Dec 27, 2024

good question, i found the same issue on llama 3.1 8b via executorch qnn backend，which performance is below 1/2 compare with qualcomm qaihub chaimed
my device is xiaomi14pro with sm8650（v75 npu） 16GB ram

0 replies

kimishpatel · 2025-01-03T16:43:06Z

kimishpatel
Jan 3, 2025
Collaborator

cc: @cccclai

0 replies

cccclai · 2025-01-03T19:27:44Z

cccclai
Jan 3, 2025
Collaborator

Yeah we found out the model definition in llama_transformer.py isn't ideal for running llama model on qnn backend. We've started a new model definition in /~https://github.com/pytorch/executorch/tree/e66cdaf514e15242692073db1271aae4657f2033/examples/qualcomm/oss_scripts/llama3_2 which have better latency number

It's still wip and please expect some burden if trying them out, or maybe wait till it's more settled.

0 replies

Jack-Khuu · 2025-02-04T20:27:56Z

Jack-Khuu
Feb 4, 2025
Collaborator

Converting to discussion

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comparison with qualcomm ai hub model #8194

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

comparison with qualcomm ai hub model #8194

DongGeun123 Dec 20, 2024

🐛 Describe the bug

Replies: 4 comments

AndreaChiChengdu Dec 27, 2024

kimishpatel Jan 3, 2025 Collaborator

cccclai Jan 3, 2025 Collaborator

Jack-Khuu Feb 4, 2025 Collaborator

DongGeun123
Dec 20, 2024

AndreaChiChengdu
Dec 27, 2024

kimishpatel
Jan 3, 2025
Collaborator

cccclai
Jan 3, 2025
Collaborator

Jack-Khuu
Feb 4, 2025
Collaborator