comparison with qualcomm ai hub model #8194
-
🐛 Describe the bugI ran Llama-v3.2-3B-Chat(precision w4a16) from ai-hub-model on a Snapdragon 8 Gen 3 device, achieving 20 tokens/s. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
good question, i found the same issue on llama 3.1 8b via executorch qnn backend,which performance is below 1/2 compare with qualcomm qaihub chaimed |
Beta Was this translation helpful? Give feedback.
-
cc: @cccclai |
Beta Was this translation helpful? Give feedback.
-
Yeah we found out the model definition in It's still wip and please expect some burden if trying them out, or maybe wait till it's more settled. |
Beta Was this translation helpful? Give feedback.
-
Converting to discussion |
Beta Was this translation helpful? Give feedback.
Yeah we found out the model definition in
llama_transformer.py
isn't ideal for running llama model on qnn backend. We've started a new model definition in /~https://github.com/pytorch/executorch/tree/e66cdaf514e15242692073db1271aae4657f2033/examples/qualcomm/oss_scripts/llama3_2 which have better latency numberIt's still wip and please expect some burden if trying them out, or maybe wait till it's more settled.