PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

zhouwg · 2025-02-24T08:59:59Z

I have read the contributing guidelines
Self-reported review complexity:
* [ ] Low
* [x] Medium
* [ ] High

PR Description

this PR is a continued effort of my original PR #6869

thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature),

data path of ggml-qnn backend works pretty good as expected with whisper.cpp and llama.cpp.
the official command line tool "test-backend-ops" & "llama-cli" has verified on a Qualcomm Snapdragon 8 Gen3 equipped Android phone.
works pretty good with ASR inference via whisper.cpp and LLM inference via llama.cpp with a standard Android APP(which is a self-made complex Android APP) on a Qualcomm Snapdragon 8 Gen 3 equipped Android phone.

this implementation put main logic in one single source file(ggml-qnn.cpp) and all op functions in ggml-qnn-ops.cpp because it will helpful for other experienced programmers be involved in dev activity which similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp or what Qualcomm did in the ggml-opencl.cpp.

another main reason of this coding style is I think this will make the developers' workflow more easily:

generally speaking, the main logic of ggml-qnn backend can be founded in ggml-qnn.cpp, community developers or AI experts only need to pay attention to ggml-qnn-ops.cpp. accordingly, this simple but effective approach provide a viable path to add support for the rest of the operations, such as RMS_NORM, SILU...
current implementation support the ADD, MUL and MUL_MAT(all quantize types with 2D matrix and some quantize types with 3D matrix) ops in ggml-qnn-ops.cpp, can be used a simple and complicated skeleton for community developers or AI experts
try to overcome all relevant technical issues/difficulties with a specified op GGML_OP_ADD or GGML_OP_MUL_MAT
then expand other ggml ops in ggml-qnn-ops.cpp accordingly with team-work from AI experts in the upstream llama.cpp community
cross validation or internal dev activities will be convenient

this implementation is a concise implementation and focus on the final mission "how to utilize the Hexagon NPU maximally", this implementation has no complex/complicated/redundant C++ encapsulation and works on real Snapdragon based phone very effectively, I hope every domain programmers can understand codes and domain technical details easily and quickly and it can be considered an open source reference implementation of ggml-qnn and also can/might be used in production project . I think this PR will be helpful to the llama.cpp community.

at the moment:

it's a real functional PR(can do real LLM inference with QNN backend on a Snapdragon 8 Gen3 phone, not only pass the test-backend-ops)
other programmers and AI experts can be involved for further dev activities accordingly,e.g.: Windows port from other programmers and then do "real LLM inference on Snapdragon based WoA device" rather than just pass the test-backend-ops

btw, after spent too much efforts on ggml-qnn backend, I personally think a fully open-source ggml-qnn backend might be a team-work between experienced software programmers and AI experts even the professional technical help from Qualcomm. in other words, I personally think NO single independent programmer or independent development team can provide a fully implementation of ggml-qnn backend, because experts and programmers in this community should haven't seen a programmer who is familiar with both Android system software programming and Windows system software programming, and is proficient in hard-core AI technology and Qualcomm QNN SDK, one more important thing, familiar with source code of ggml/llama.cpp although he/she is not an AI expert.

Big picture of ggml-qnn backend

pls refer to a simple tech doc below: mapping ggml compute graph to QNN compute graph

the first technical approach can be seen in this PR. accordingly, the second technical approach can be extended base on this PR with the similar coding style.

What I did in my first PR and this PR

data path works good between QNN SDK and ggml/llama.cpp through reverse engineering from executorch(the QNN backend's implementation in executorch comes from Qualcomm) in my first PR on 04/2024, I personally think this is the first open-source implementation of ggml-qnn in llama.cpp community since 04/2024(correction is greatly welcomed if I made incorrect expression)
a very simple and effective graph cache mechanism which already implemented in project kanTV on 04/2024 without complex C++ encapsulation
use a simple STL to manage QNN resources in this PR rather than complex C++ encapsulation because the highly-well designed QNN SDK already manage it's internal hardware and software resource very carefully
a simple skeleton in function ggml_qnn_general_node:offload GGML_OP_ADD & GGML_OP_MUL to QNN backend
a complex skeleton in function ggml_qnn_mulmat: offload GGML_OP_MULMAT to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally"
QNN NPU RPC feature which implemented in my first PR on 04/2024 without complex C++ encapsulation(UT passed but some unknown bugs should be fixed and should be seen in all hard-forked ggml-qnn projects, this is not an intentional bug)
provide big picture and different technical approaches of ggml-qnn in my forked llama.cpp and this PR
overcome necessary technical difficulties in this PR
quantized data type with 2D mulmat and a very significant performance improvement for LLM inference with ggml-qnn backend(added on 02/26/2025,12:40, pass UT and CT through test-backend-ops and llama-cli)
follow the philosophy of "simple is beautiful" which comes from the great Unix which borned in the US, provide a concise implementation without complex encapsulation, code is simple and easily/quickly understand, because I believe in the philosophy of "simple is beautiful" which comes from the great Unix which borned in the US and this philosophy also can be found in the great llama.cpp. btw, I personally think that's one of the key-reasons why ggml/whisper.cpp/llama.cpp is so popular and so successful, especially we can see that there are already TensorflowLite from Google, Executorch from Meta, MNN from Alibaba, MACE from Xiaomi... they are both good but many programmers and IT giants and IT startups and research institutions prefer ggml for device-AI scenarios. accordingly, other experienced programmers can do code construction with cool and complex C++ wrapper/encapsulation for a real product.

all above items can be found in project KanTV if there is no specified notes and project KanTV is a device-AI learning project and heavily depend on ggml/whisper.cpp/llama.cpp. I personally think the rest parts of ggml-qnn is a team work between AI experts and experienced programmers. we can work together to achieve the final goal: provide a productive-level open-source ggml-qnn backend to llama.cpp community.

Key-reasons of personally disagree with complex C++ encapsulation

highly-well designed QNN SDK already manage it's internal hardware and software resource very carefully, we can use a simple STL to manage QNN resources in this PR, another complicated wrapper of QNN SDK through C++ in user layer might be not mandatory.
this implementation of ggml-qnn is mainly porting/reverse engineering from executorch(the QNN backend's implementation in executorch comes from Qualcomm). I hope every domain programmers or experts can understand codes and domain technical details easily and quickly. another implementation of ggml-qnn through complex/complicated C++ encapsulation is equivalent to another executorch. accordingly, I personally think my effort on this might be useful for Qualcomm's QNN SDK's users also complex C++ encapsulation really has some advantages.
the core ideas and tech difficulties should be completely same to this implementation even with complicated and cool C++ encapsulation. accordingly, we can see almost all of core ideas are totally comes from my first PR.
as I said before, the second technical approach can be extended base on this PR with the similar coding style without complicated C++ encapsulation. in the fact, there already implemented in my local dev envs, but it's not a functional PR, the reason I had been said in this simple tech doc.

I fully understand that we will not agree on everything, all above points of view is personal perspective.

Performance of ggml-qnn backend

performance is the key-point I heavily/always concerns on Android or other embedded system. here is the result in local dev envs of how I solve this performance issue at the early phase(because this ggml-qnn backend lacks of many ops although it's a really functional backend) of ggml-qnn backend:

before finetune:

Fig-1:llama-bench with QNN NPU backend(Hexagon NPU) on a Snaprdragon 8 Gen3 based phone

after finetune:
Fig-2: llama-bench with QNN NPU backend(Hexagon NPU) on a Snapdragon 8 Gen3 based phone, some mulmat operations has been offloaded to NPU(the test result in following screenshots depend on OS load and adjust various parameters, this is the best result I have seen)

Fig-3:llama-bench with ggml cpu backend("3" is a "fake" ggml-qnn backend which used to compare performance between QNN NPU backend and ggml cpu bakcned)) on a Snapdragon 8 Gen3 based phone(AI experts can explain why there is so big difference of the second data between Fig-2 and Fig-3, this would be helpful for performance fine-tuning in this PR)

Fig-4(updated on 02-28-2025,12:09, inference speed is very fast, depend on OS load and adjust parameters in my local dev envs, the PR might be can't reproduce this benchmark result because of there is a hard-forked candidate PR through C++ encapsulation from a CN programmer in this community. as an independent programmer(which means shouldn't have any potential conflict of interests with any programmers in CN), I'd like to/will contribute all relative techs/experiences if this PR can be approved-------this is not condition and I understand this PR probably be rejected just avoid some unpleasant intentional behaviors from some CN programmers and I hopefully this will gain some corresponding understanding.)LLM inference with QNN CPU backend on a Snapdragon 8 Gen3 based phone

LLM-inference-with-ggmlqnn-backend-on-snapdragon8-gen3-phone.mp4

How to build ggml‐qnn source code for Android and verify ggml-qnn backend on Snapdragon based phone

Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions might be also ok). the dev activity in this PR can be done in pure command line without any IDE, so setup dev envs is simple:

download and install Qualcomm QNN SDK accordingly from https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk
utilize my self-made script build-run-android.sh to download Android NDK automatically(pls see below section)

  git clone /~https://github.com/kantv-ai/ggml-qnn
  cd ggml-qnn
  git checkout build_fix
  ./scripts/build-run-android.sh build          (it'll setup local build envs automatically and build the entire project)
  ./scripts/build-run-android.sh updateqnnlib   (upload Qualcomm's QNN binary runtime libs to Android phone)
  ./scripts/build-run-android.sh run_llamacli   (running llama-cli on Android pohone)
  ./scripts/build-run-android.sh run_testop     (running test-backend-ops on Android phone)

we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-qnn" to help troubleshooting work.

How to build ggml‐qnn source code for Snapdragon based WoA(Windows on ARM) device

the source code of ggml/llama.cpp are both portable codes and Qualcomm's highly-well designed QNN SDK is also portable, thanks to that Qualcomm has provide a very simple reference code without complex C++ encapsulation of how to handle dlopen,dlclose, dlsym, dlerror between Windows & Linux/Android in QNN SDK, I have ported them to this PR accordingly.

unfortunately, I have nothing knowledge about Windows programming and have no WoA device for validation, so I don't know how to build ggml-qnn source code for Snapdragon WoA device currently.

How to build ggml‐qnn source code for Snapdragon based Linux device

generally speaking, Android can be considered as a special Linux distribution, so build ggml-qnn source code for Snapdragon based Linux device should be similar to Snapdragon based phone, but I have no Snapdragon based Linux device for validation.

Acknowledgement

this implementation of ggml-qnn is mainly porting/reverse engineering from executorch(the QNN backend's implementation in executorch comes from Qualcomm). all the original techs of this topic comes from Qualcomm.
I got breakthrough help from chiwwang@Qualcomm Technologies Inc on 04/2024.
I also got a meaningful help from XiaoMi-StableDiffusionOnDevice on 05/2024.
thanks for that I borrowed 5-7 functions from implementation which comes from a CN programmer @chraac 's team. I can see that there are many people in his team and they really did a good progress on Windows port and 4D matrix mulmat. I hope we can work together to provide a production-level open-source ggml-qnn backend to the great llama.cpp community, eliminate pointless arguments such as OO principle through complex C++ encapsulation or function pointers array through C which he suggested several times(thanks for these kind suggestions although I know that many years ago), meaningless tech competition(I deeply understand the extreme involution can be seen in China everywhere which I think totally meaningless), follow some default rules: verify and validate PR firstly before drop tech comments, follow the coding style in this PR, don't drop your meaningless comments in this PR, don't reference your standalone PR in this PR although I wish your success in this community, don't answer tech questions for the PR author unless the PR author actively seeks help, don't challenge the PR author intentionally ......in the all, this tech community is out of Mainland China, don’t brings some Chinese habits here. of course, they can continue their way and I wish their success in this community. thanks so much!
recently I tried AI-assisted programming for ggml-qnn backend with the help from the powerful Grok 3, it really helped me a lot in this PR.

@slaren, sorry to bother you, thanks to your outstanding "backed-schedualr" feature, this ggml-qnn backend now is a functional QNN backend and performance of LLM inference on Snaprdragon 8 Gen3 is probably improved(7x-10x). it can pass the test-backend-ops and llama-bench, and do real LLM inference with ggml-qnn backend on a Snapdragon based phone via llama-cli or a standard Android APP. you can check these facts if you have time although I understand that you are very busy and accordingly I think we can discuss this PR formally so other community developers or AI experts can be involved in the further dev activities. thanks!

we can see that Qualcomm's top-talnet engineers have provided a very important and highly-difficult PR to this great community, it's greatly appreciated if Qualcomm's engineers or experts can help to do code review regardless the final result.

… ggml-qnn

…blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp

…om branch kantvai-ggmlqnn-npurpc, /~https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)

… ggml-qnn

…blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp

…om branch kantvai-ggmlqnn-npurpc, /~https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)

…omplex/redundant pointer operation

zhouwg · 2025-02-24T12:42:33Z

a simple tech doc: mapping ggml compute graph to QNN compute graph

with the breakthrough help from chiwwang@QTI on April 2024,

//==============================================================================
//
//  Copyright (c) 2020-2024 Qualcomm Technologies, Inc.
//  All Rights Reserved.
//  Confidential and Proprietary - Qualcomm Technologies, Inc.

//  saver_output.c is generated automatically by Qualcomm's dedicated tool
//
//  this customized saver_output.c is used to troubleshooting issue in
//  PoC-S26: offload a simple f32 2x2 matrix addition operation to QNN CPU backend
//  /~https://github.com/zhouwg/kantv/issues/121
//
//==============================================================================

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "QnnInterface.h"

#include "ggml-jni.h"

#define VALIDATE(res, value) \
   do { \
      if (res == 0 || res == QNN_COMMON_ERROR_NOT_SUPPORTED) \
      { \
         res = value; \
         if (res != 0) \
         { \
            if (res == QNN_COMMON_ERROR_NOT_SUPPORTED) \
            { \
               LOGGD("WARNING! Line %d QNN feature/API not supported\n", __LINE__); \
               GGML_JNI_NOTIFY("WARNING! Line %d QNN feature/API not supported\n", __LINE__); \
            } else { \
               LOGGD("ERROR! Line %d with error value: %d\n", __LINE__, (unsigned int)error); \
            } \
         } \
      } \
   } \
   while(0)


static void qnn_saver_logcallback(const char* fmt,
                                 QnnLog_Level_t level,
                                 uint64_t timestamp,
                                 va_list argp) {

    static unsigned char s_qnn_saver_buf[JNI_BUF_LEN];

    const char * levelStr = "";
    switch (level) {
        case QNN_LOG_LEVEL_ERROR:
            levelStr = " ERROR ";
            break;
        case QNN_LOG_LEVEL_WARN:
            levelStr = "WARNING";
            break;
        case QNN_LOG_LEVEL_INFO:
            levelStr = "  INFO ";
            break;
        case QNN_LOG_LEVEL_DEBUG:
            levelStr = " DEBUG ";
            break;
        case QNN_LOG_LEVEL_VERBOSE:
            levelStr = "VERBOSE";
            break;
        case QNN_LOG_LEVEL_MAX:
            levelStr = "UNKNOWN";
            break;
    }

    double ms = (double)timestamp / 1000000.0;

    {
        int len_content = 0;
        memset(s_qnn_saver_buf, 0, JNI_BUF_LEN);
        len_content = vsnprintf(s_qnn_saver_buf, JNI_BUF_LEN, fmt, argp);
        snprintf((s_qnn_saver_buf + len_content), JNI_BUF_LEN - len_content, "\n");
        LOGGD("%8.1fms [%-7s] %s ", ms, levelStr, s_qnn_saver_buf);
        //if (level <= QNN_LOG_LEVEL_INFO)
        {
            GGML_JNI_NOTIFY("%8.1fms [%-7s] %s ", ms, levelStr, s_qnn_saver_buf);
        }
    }
}

int qnn_saver_main(int argc, char **argv) {
    LOGGI("enter %s", __func__);
    GGML_JNI_NOTIFY("enter %s", __func__);
    Qnn_ErrorHandle_t error = 0;
    QnnLog_Level_t logLevel = QNN_LOG_LEVEL_VERBOSE;
    int logging = 1;
    for (int i = 1; i < argc; i++) {
        char *arg = argv[i];
        if (!strcmp("--logging", arg) || !strcmp("-l", arg)) {
            logging = 1;
            if (i + 1 == argc) {
                printf("No log level provided, defaulting to QNN_LOG_LEVEL_ERROR\n");
                break;
            }
            char *value = argv[++i];
            if (!strcmp("error", value)) {
                logLevel = QNN_LOG_LEVEL_ERROR;
            } else if (!strcmp("warn", value)) {
                logLevel = QNN_LOG_LEVEL_WARN;
            } else if (!strcmp("info", value)) {
                logLevel = QNN_LOG_LEVEL_INFO;
            } else if (!strcmp("debug", value)) {
                logLevel = QNN_LOG_LEVEL_DEBUG;
            } else if (!strcmp("verbose", value)) {
                logLevel = QNN_LOG_LEVEL_VERBOSE;
            } else {
                printf("WARNING: Unknown log level provided: %s, defaulting to QNN_LOG_LEVEL_ERROR\n",
                       value);
            }
        } else {
            printf("Usage: %s [options]\n\n"
                   "-l <level>, --logging <level>      Enable logging, acceptable levels are: error,warn,info,debug,verbose\n",
                   argv[0]);
            return -1;
        }
    }

    LOGGD("log level %d\n", logLevel);
    FILE *fp = fopen("/sdcard/kantv/params.bin", "rb");
    if (!fp) {
        error = -1;
        LOGGI("ERROR! Could not open params.bin, ensure this file is in the current working directory when executing this program\n");
        GGML_JNI_NOTIFY("ERROR! Could not open params.bin, ensure this file is in the current working directory when executing this program\n");
        return error;
    }

    const QnnInterface_t **providerList = NULL;
    uint32_t numProviders;
    VALIDATE(error, QnnInterface_getProviders(&providerList, &numProviders));
    LOGGD("numProviders %d\n", numProviders);
    GGML_JNI_NOTIFY("numProviders %d\n", numProviders);
    for (int idx = 0; idx < numProviders; idx++) {
        LOGGD("backend name %s\n", providerList[idx]->providerName);
        GGML_JNI_NOTIFY("backend name %s\n", providerList[idx]->providerName);
    }
    QNN_INTERFACE_VER_TYPE interface = providerList[0]->QNN_INTERFACE_VER_NAME;

    Qnn_LogHandle_t loghandle = NULL;
    if (logging) {
        VALIDATE(error, interface.logCreate(qnn_saver_logcallback, logLevel, &loghandle));
    }
    //VALIDATE(error, interface.propertyHasCapability((QnnProperty_Key_t) 304)); //QNN_PROPERTY_GRAPH_SUPPORT_NULL_INPUTS
    VALIDATE(error, interface.propertyHasCapability((QnnProperty_Key_t) QNN_PROPERTY_GRAPH_SUPPORT_NULL_INPUTS));

    const QnnBackend_Config_t *backend_0_config_0[] = {NULL};
    Qnn_BackendHandle_t backend_0;
    VALIDATE(error, interface.backendCreate(loghandle, backend_0_config_0, &backend_0));

    const QnnDevice_Config_t *device_0_config_0[] = {NULL};
    Qnn_DeviceHandle_t device_0;
    VALIDATE(error, interface.deviceCreate(loghandle, device_0_config_0, &device_0));

    const QnnContext_Config_t *context_0_config_0[] = {NULL};
    Qnn_ContextHandle_t context_0;
    VALIDATE(error, interface.contextCreate(backend_0, device_0, context_0_config_0, &context_0));

    const QnnGraph_Config_t *context_0_convReluModel_config_0[] = {NULL};
    Qnn_GraphHandle_t context_0_convReluModel;
    VALIDATE(error,
             interface.graphCreate(context_0, "convReluModel", context_0_convReluModel_config_0,
                                   &context_0_convReluModel));

    //how to compose qnn graph

    //step-1:
    uint32_t context_0_convReluModel_tensor_0_dims[] = {1, 299, 299, 3};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_0_quantizeParams = {
            (Qnn_Definition_t) 2147483647/*QNN_DEFINITION_UNDEFINED*/,
            (Qnn_QuantizationEncoding_t) 2147483647/*QNN_QUANTIZATION_ENCODING_UNDEFINED*/, .scaleOffsetEncoding = {0.0, 0}};
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_0_clientBuf = {NULL, 0};
    Qnn_TensorV1_t context_0_convReluModel_tensor_0_v1 = {0, "input_0",
                                                          (Qnn_TensorType_t) 0/*QNN_TENSOR_TYPE_APP_WRITE*/,
                                                          0/*QNN_TENSOR_DATA_FORMAT_FLAT_BUFFER*/,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_0_quantizeParams,
                                                          4, context_0_convReluModel_tensor_0_dims,
                                                          (Qnn_TensorMemType_t) 0/*QNN_TENSORMEMTYPE_RAW*/,
                                                          context_0_convReluModel_tensor_0_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_0 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_0_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_0));




    //step-2:
    uint32_t context_0_convReluModel_tensor_1_dims[] = {3, 3, 3, 32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_1_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static float context_0_convReluModel_tensor_1_data[864];
    fread(context_0_convReluModel_tensor_1_data, 4, 864, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_1_clientBuf = {
            (void *) context_0_convReluModel_tensor_1_data, 3456};
    Qnn_TensorV1_t context_0_convReluModel_tensor_1_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_weight",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/,
                                                          0/*QNN_TENSOR_DATA_FORMAT_FLAT_BUFFER*/,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_1_quantizeParams,
                                                          4, context_0_convReluModel_tensor_1_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_1_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_1 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_1_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_1));



    //step-3:
    uint32_t context_0_convReluModel_tensor_2_dims[] = {32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_2_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static float context_0_convReluModel_tensor_2_data[32];
    fread(context_0_convReluModel_tensor_2_data, 4, 32, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_2_clientBuf = {
            (void *) context_0_convReluModel_tensor_2_data, 128};
    Qnn_TensorV1_t context_0_convReluModel_tensor_2_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_bias",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/,
                                                          0,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_2_quantizeParams,
                                                          1, context_0_convReluModel_tensor_2_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_2_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_2 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_2_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_2));



    //step-4:
    uint32_t context_0_convReluModel_tensor_3_dims[] = {2};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_3_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static uint32_t context_0_convReluModel_tensor_3_data[2];
    fread(context_0_convReluModel_tensor_3_data, 4, 2, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_3_clientBuf = {
            (void *) context_0_convReluModel_tensor_3_data, 8};
    Qnn_TensorV1_t context_0_convReluModel_tensor_3_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_dilation",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/, 0,
                                                          (Qnn_DataType_t) 306/*QNN_DATATYPE_UINT_32*/,
                                                          context_0_convReluModel_tensor_3_quantizeParams,
                                                          1, context_0_convReluModel_tensor_3_dims,
                                                          (Qnn_TensorMemType_t) 0/*QNN_TENSORMEMTYPE_RAW*/,
                                                          context_0_convReluModel_tensor_3_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_3 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_3_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel, &context_0_convReluModel_tensor_3));




    //step-5:
    uint32_t context_0_convReluModel_tensor_4_dims[] = {2, 2};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_4_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static uint32_t context_0_convReluModel_tensor_4_data[4];
    fread(context_0_convReluModel_tensor_4_data, 4, 4, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_4_clientBuf = {
            (void *) context_0_convReluModel_tensor_4_data, 16};
    Qnn_TensorV1_t context_0_convReluModel_tensor_4_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_pad_amount",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/, 0,
                                                          (Qnn_DataType_t) 306/*QNN_DATATYPE_UINT_32*/,
                                                          context_0_convReluModel_tensor_4_quantizeParams,
                                                          2, context_0_convReluModel_tensor_4_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_4_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_4 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_4_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_4));




    //step-6:
    uint32_t context_0_convReluModel_tensor_5_dims[] = {2};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_5_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static uint32_t context_0_convReluModel_tensor_5_data[2];
    fread(context_0_convReluModel_tensor_5_data, 4, 2, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_5_clientBuf = {
            (void *) context_0_convReluModel_tensor_5_data, 8};
    Qnn_TensorV1_t context_0_convReluModel_tensor_5_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_stride",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/, 0,
                                                          (Qnn_DataType_t) 306/*QNN_DATATYPE_UINT_32*/,
                                                          context_0_convReluModel_tensor_5_quantizeParams,
                                                          1, context_0_convReluModel_tensor_5_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_5_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_5 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_5_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_5));




    //step-7:
    uint32_t context_0_convReluModel_tensor_6_dims[] = {1, 149, 149, 32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_6_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_6_clientBuf = {NULL, 0};
    Qnn_TensorV1_t context_0_convReluModel_tensor_6_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_BatchNorm_FusedBatchNorm_0",
                                                          (Qnn_TensorType_t) 3/*QNN_TENSOR_TYPE_NATIVE*/, 0,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_6_quantizeParams,
                                                          4, context_0_convReluModel_tensor_6_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_6_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_6 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_6_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel, &context_0_convReluModel_tensor_6));


    //step-8:
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_0 = {
            (Qnn_ParamType_t) 1/*QNN_PARAMTYPE_TENSOR*/,
            "dilation",
            .tensorParam = context_0_convReluModel_tensor_3
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_1 = {
            (Qnn_ParamType_t) 1/*QNN_PARAMTYPE_TENSOR*/,
            "pad_amount",
            .tensorParam = context_0_convReluModel_tensor_4
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_2 = {
            (Qnn_ParamType_t) 1/*QNN_PARAMTYPE_TENSOR*/,
            "stride",
            .tensorParam = context_0_convReluModel_tensor_5
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_3 = {
            (Qnn_ParamType_t) 0/*QNN_PARAMTYPE_SCALAR*/,
            "group",
            .scalarParam = {
                    (Qnn_DataType_t) 306, .uint32Value = 1}
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_params[] = {
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_0,
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_1,
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_2,
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_3};

    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_inputs[] = {
            context_0_convReluModel_tensor_0,
            context_0_convReluModel_tensor_1,
            context_0_convReluModel_tensor_2};

    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_outputs[] = {
            context_0_convReluModel_tensor_6
    };

    Qnn_OpConfig_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0 = {
            (Qnn_OpConfigVersion_t) 1,
            .v1 = {
                    "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D",
                    "qti.aisw",
                    "Conv2d",
                    4,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_params,
                    3,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_inputs,
                    1,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_outputs
            }
    };
    VALIDATE(error, interface.backendValidateOpConfig(backend_0, context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0));
    VALIDATE(error, interface.graphAddNode(context_0_convReluModel, context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0));




    //step-9:
    uint32_t context_0_convReluModel_tensor_7_dims[] = {1, 149, 149, 32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_7_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_7_clientBuf = {NULL, 0};
    Qnn_TensorV1_t context_0_convReluModel_tensor_7_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0",
                                                          (Qnn_TensorType_t) 1/*QNN_TENSOR_TYPE_APP_READ*/, 0,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_7_quantizeParams,
                                                          4, context_0_convReluModel_tensor_7_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_7_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_7 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_7_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_7));



    //step-10:
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_params[] = {};
    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_inputs[] = {
            context_0_convReluModel_tensor_6
    };
    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_outputs[] = {
            context_0_convReluModel_tensor_7
    };
    Qnn_OpConfig_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0 = {
            (Qnn_OpConfigVersion_t) 1, .v1 = {
                    "InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu",
                    "qti.aisw",
                    "Relu",
                    0,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_params,
                    1,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_inputs,
                    1,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_outputs
            }
    };
    VALIDATE(error, interface.backendValidateOpConfig(backend_0,context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0));
    VALIDATE(error, interface.graphAddNode(context_0_convReluModel,context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0));

    //step-10:
    VALIDATE(error, interface.graphFinalize(context_0_convReluModel, NULL, NULL));

    Qnn_Tensor_t context_0_convReluModel_inputTensors_0[] = {context_0_convReluModel_tensor_0};
    Qnn_Tensor_t context_0_convReluModel_outputTensors_0[] = {context_0_convReluModel_tensor_7};
    VALIDATE(error,interface.graphExecute(context_0_convReluModel, context_0_convReluModel_inputTensors_0,
                                    1, context_0_convReluModel_outputTensors_0, 1, NULL, NULL));


    VALIDATE(error, interface.contextFree(context_0, NULL));

    VALIDATE(error, interface.deviceFree(device_0));

    VALIDATE(error, interface.backendFree(backend_0));

    if (logging) {
        VALIDATE(error, interface.logFree(loghandle));
    }

    if (fclose(fp)) error = -1;

    LOGGI("leave %s", __func__);
    GGML_JNI_NOTIFY("leave %s", __func__);
    return error == 0 || error == QNN_COMMON_ERROR_NOT_SUPPORTED ? 0 : error;
}

I already found that there are different technical paths to utilize the Qualcomm Hexagon NPU in ggml-qnn via QNN SDK:

the general approach in many ggml backend(such as ggml-sycl, ggml-cann, ggml-opencl...), this approach can be found at current implementation of ggml-qnn backend in source file ggml-qnn.cpp:

prons: this approach can benefit greatly from the excellent "backend scheduler" feature in the ggml backend subsystem, can be a "functional implementation" or a good starting-point in the upstream llama.cpp community. accordingly, this approach can be verified easily with my self-made script build-run-android.sh

cons: there mightbe performance concern in ggml-qnn backend

mapping ggml computational graph to QNN computational graph

prons: this approach might be equivalent to the principle shown in the above quoted code, and I guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend. I don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl.

cons: can not take advantage of backend scheduler feature and too much work load

there are many undocumented(or not very clear) technical details in QNN SDK, so I think the necessary technical support should be provided from Qualcomm's tech team even I reach the final mission according to the first approach with help from the great llama.cpp community.

the dedicated tech provided by Qualcomm(although the QNN SDK also provided by Qualcomm), such as QNNSample or XiaoMiStableDiffusionOnDevice, I think this approach doesn't make sense in ggml-qnn's implementation.
the dedicated tech stack (/~https://github.com/quic/aimet) provided by Qualcomm, I also think this approach doesn't make sense in ggml-qnn's implementation.

correction from domain technical experts is greatly welcomed and appricated.

ggml/src/ggml-qnn/ggml-qnn.cpp

oreomaker · 2025-02-25T06:25:50Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution.
Maybe a better execution pipeline is needed for it.

chraac · 2025-02-25T06:40:35Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

Hi @oreomaker ,

Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution.
To reduce compilation time, this PR utilizes a mechanism called "graph cache" to store each operation graph with its specific tensor configuration:

In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation.

chraac · 2025-02-25T07:04:44Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

Hi @oreomaker ,
Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution. To reduce compilation time, this PR utilizes a mechanism called "graph cache" to store each operation graph with its specific tensor configuration:
In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation.

I'm a little curious of that whether you are a regular employee from Qualcomm Shenzhen branch. as I said before many times, you can submit your standalone PR and I personally like to see your success in this community, but pls don't brings some non-technical comments again and again in my PR:

what you did in my first PR got me blocked from this community

what you did in my second PR resulted in my PR being closed by the maintainers because two Chinese programmers dropped two much pointless arguments in this community and I think this is another joke from China.
thanks so much!

I don't want to continue debating this here. The reason for restricting your access to this repo was your inappropriate comments on unrelated PRs (here, here2 and here3). And repo's owner gave a clear reason about it.

I'd suggest focusing on improving your codebase in an objective manner without making assumptions about or judging others' work.

If my comments made you uncomfortable, I apologize. I'm happy to step back from this discussion. I can also create my own PR where anyone interested can discuss the design approach more constructively.

zhouwg · 2025-02-25T07:32:08Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

this is a good question, your concern is correct:

the existing execution pipeline in this PR comes from ggml's default cpu backend and other hardware-accelerated backends such as ggml-sycl, ggml-cann, ggml-opencl
I personally think the highly-well designed QNN SDK will manage it's internal hardware and software resource very carefully that's why they don't provide cleanup APIs such as releaseGraph/releaseTensor/.... this design philosophy is different with common SDK design principle and this is one of the key-reasons why I disagree another implementation through complex C++ encapsulation because the highly-well designed QNN SDK already do that. you will find that I just use a simple STL to manage QNN resources in this PR.
the graph cache mechanism already used in my PoC of ggml-qnn or my first PR here on 04/2024:/~https://github.com/kantv-ai/kantv/blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp#L2091.
a better execution pipeline is needed because of performance concern, pls see my simple tech doc:mapping ggml compute graph to QNN compute graph .

[updated on 02/26/2025] my previous answer might be wrong, because the first technical approach can works very good(quantized data with mulmat didn't implemented when I wrote the simple tech doc), there are 7x-10x performance improvements in my local dev envs with QNN backend.

btw, you can refer to my personal understanding about ggml-qnn and other ggml backends in that simple tech doc:

prons: this approach might be equivalent to the principle shown in the above quoted code, and I guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend. I don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl.

zhouwg · 2025-02-26T12:04:32Z

anyway, I will continue my effort on ggml-qnn backend as gg's suggestion in last year:

…happy

zhouwg · 2025-02-27T14:29:19Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

Hi @oreomaker ,
Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution. To reduce compilation time, this PR utilizes a mechanism called "graph cache" to store each operation graph with its specific tensor configuration:
In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation.

I'm a little curious of that whether you are a regular employee from Qualcomm Shenzhen branch. as I said before many times, you can submit your standalone PR and I personally like to see your success in this community, but pls don't brings some non-technical comments again and again in my PR:

what you did in my first PR got me blocked from this community

what you did in my second PR resulted in my PR being closed by the maintainers because two Chinese programmers dropped two much pointless arguments in this community and I think this is another joke from China.
thanks so much!

I don't want to continue debating this here. The reason for restricting your access to this repo was your inappropriate comments on unrelated PRs (here, here2 and here3). And repo's owner gave a clear reason about it.

I'd suggest focusing on improving your codebase in an objective manner without making assumptions about or judging others' work.

If my comments made you uncomfortable, I apologize. I'm happy to step back from this discussion. I can also create my own PR where anyone interested can discuss the design approach more constructively.

@chraac, today(02/27/2025) I confirmed that your PR couldn't do a real LLM inference on Snapdragon based device.

mightbe Qualcomm's experts already do that(the second technical approach in the simple tech doc). btw, such this graph-cache mechanism or so-called strategy I already implemented on 04/24, so-called standout feature I already founded on 04/2024(you can study my comments in this pure tech community again and I'm sure you can find that). I admit your English words and PPT and diagrams are beautiful and you/your team did a really good progress on 4d mulmat and Windows port. one more thing, as I said before in my PR's description: all original tech comes from Qualcomm------they provide fundamental mechanism and we programmers use it:
your third and fourth comment in this PR is one of reasons why I'm so angry and I feel you breaks my bottom-line that day(of course, I admit this is my mistake). could you remove these inappropriate comments in my PR(I think there is a default rule: this is not your private family garden and you can arbitrarily drop comments) accordingly so other people can participate in technical discussion? you suddenly happened in my first & second & third PR and broken my first and second PR, that's enough, everyone in this community can see that. your intentional challenge or PUA approach to me doesn't help your cause at all although I'd like to see your success in this community. we can walk our own way without intentionally disturbing each other.

thanks.

slaren · 2025-02-27T19:14:24Z

Hi @zhouwg. I want to clarify that the comments made by @chraac in your previous PR had no influence whatsoever in the decision to block you from participating in this repository. Technical feedback and code reviews are always welcome and even encouraged. However, you were blocked due to a consistent pattern of comments that incited personal conflict, often in response to legitimate technical feedback. The comments linked by @chraac (now removed) are an example of this behavior.

We value your technical contributions and encourage you to continue participating in discussions, but please focus on the technical aspects and the code itself. If you receive feedback on your code, try not to take it personally, state your point of view and move on. If you feel personally attacked or treated unfairly, please reach out to the maintainers privately, and we will do our best to address the situation. Engaging in personal conflict in public comments is not productive for anyone.

Additionally, please keep in mind that while this is an open source project that welcomes contributions, it does not mean that contributions are guaranteed to be accepted. You may disagree with the maintainers' decisions, but refusing to make the changes requested by the maintainers will likely result in your contributions being rejected. If you disagree with these decisions, you always have the option to fork the project and maintain your own version.

zhouwg · 2025-02-28T00:20:34Z

Hi @zhouwg. I want to clarify that the comments made by @chraac in your previous PR had no influence whatsoever in the decision to block you from participating in this repository. Technical feedback and code reviews are always welcome and even encouraged. However, you were blocked due to a consistent pattern of comments that incited personal conflict, often in response to legitimate technical feedback. The comments linked by @chraac (now removed) are an example of this behavior.

We value your technical contributions and encourage you to continue participating in discussions, but please focus on the technical aspects and the code itself. If you receive feedback on your code, try not to take it personally, state your point of view and move on. If you feel personally attacked or treated unfairly, please reach out to the maintainers privately, and we will do our best to address the situation. Engaging in personal conflict in public comments is not productive for anyone.

@slaren, thanks for your valuable and very helpful suggestion or guidance. I admitted that I made a same stupid mistake because of intentional challenge from a same CN programmer in my third PR and I try to adjust my mind and behavior accordingly. at the same time, I think now everyone in this community can see what happened in my first&second&third PR.

especially in this third PR:

I really don't understand why this CN programmer @chraac spent efforts to study my inappropriate and stupid comments in this community? I already admitted my mistake and I have no negative opinion about that: I think the cost I paid was what I deserved and I learnt something from that. and I really don't understand why this CN programmer @chraac intentionally quoted them here? I think he want to use your hands to punish me again, I know these Chinese very much although I hope this is my misunderstanding.
I feel very strange why this CN programmer @chraac does not go to the Intel sycl backend or Huawei cann backend to make comments but goes to my PR again and again to make various challenge comments, and fire the conflict firstly again in my third PR.
I personally think code review in such an important tech community is not someone's family garden and someone can arbitrarily drop comments in other people's PR. why this CN programmer @chraac challenged me again and again in my first&second&third PR? I personally think the reason is because (1)I'm an independent programmer (2)I'm a Chinese programmer.
offered an unacceptable PR/help in my first PR or local forked llama.cpp project, and then claimed my effort on ggml-qnn is a duplicated effort and I should waiting for their PR get approved and continue my effort based on their approved PR in this important tech community, brought troubles in my first & second PR again, then fire the conflict firstly in my third PR(even want to use your team's hands to punish me again), then claimed they want to step back with so-called constructive dialogue, at the same time create an illusion to make people think that they are be suppressed by someone rather than reflecting on their thoughts/minds/behaviors/actions...... I know this very much and generally speaking we call this a "PUA master" in China, such these behaviors/things can be commonly seen in China and I personally understand that because there 1.4 billion people in China for the limited resource and survive. BUT I personally think it's unacceptable in this pure tech community which is out of mainland China.
I made several cooperation invitations to this CN programmer @chraac but no response from him or his team. I think he or his team can contribute ideas or code in this PR with the same coding style as their intention or walk our own way without intentionally disturbing each other. I'd like to see their success in this great tech community which is out of mainland China.
we all can see two code review comments from this CN programmer @chraac in this PR are totally meaningless: he wouldn't drop such code review comments if this CN programmer @chraac already take a little time to verify and validate this PR, so I think his so-called code review comments are an intentionally challenge to me because I'm an independent Chinese programmer. at the same time, I personally think his two pointless code review comments are not important and I already gave a professional response because (1)I'm an open-minded and kind programmer (2) I still want to cooperation with him or his team.
I think I know/understand China&Chinese very much after I spent many many years in China. this world is really not so beautiful or perfect as someone's imagination when someone lived in the US or EU or as a tourist in China. I try my best to avoid be involved with pointless conflict with them because I know them and I can't change the reality, but I try to don't do behaviors like that.
I understand some really unpleasant behaviors(I never do that or I try to never to that) from this CN programmer because I was once a young programmer and I can see this CN programmer also really did much efforts on this ggml-qnn backend. this young CN programmer might be understand something in the future: all winners or losers from mainland China are totally meaningless(this is my personal point of view and might be not correct),e.g: they must access this tech community via a dedicated VPN or proxy. this is one of the key-reasons why I repeatedly emphasized that I have no intention of getting involved in a meaningless competition between me and this CN programmer in this non-CN tech community after he refused my cooperation invitation. I do this thing for fun, to learn and try to make some contributions to this great tech community, not for fight others or win others. I will be very happy if this PR gets approved and I have nothing to lose if it's rejected and I'm tired of fighting or competing with these CN programmers.
finally, I'm definitionally sure that most Chinese and most CN programmers are kind&honest&integrity people although I met a few unpleasant people in China even in this great tech community which is out of mainland China.

Additionally, please keep in mind that while this is an open source project that welcomes contributions, it does not mean that contributions are guaranteed to be accepted. You may disagree with the maintainers' decisions, but refusing to make the changes requested by the maintainers will likely result in your contributions being rejected. If you disagree with these decisions, you always have the option to fork the project and maintain your own version.

yes, your opinion is definitionally correct, I see. I came to this great tech community for learning real hard-core AI tech and try to make some contributions. I understand my PR probably might be not accepted and this is normal acceptable thing, but I really don't want to be involved with some meaningless conflict with others especially some CN programmers from mainland China.

…for benchmark more conveniently

null-define · 2025-02-28T07:18:05Z

Hi everyone,
I've been eagerly looking forward to deploying a model on Android Qualcomm QNN devices using llama.cpp, and I've been closely following the QNN developments since the initial PR. While I understand Zhouwg's concerns, I’m not in a position to judge who is right or wrong.
That said, I’m curious: is there any possibility of merging QNN support (either from Zhouwg’s branch or Chraac’s) in the near future? Your efforts are greatly appreciated!
Thank you!

zhouwg · 2025-02-28T07:21:30Z

Hi everyone, I've been eagerly looking forward to deploying a model on Android Qualcomm QNN devices using llama.cpp, and I've been closely following the QNN developments since the initial PR. While I understand Zhouwg's concerns, I’m not in a position to judge who is right or wrong. That said, I’m curious: is there any possibility of merging QNN support (either from Zhouwg’s branch or Chraac’s) in the near future? Your efforts are greatly appreciated! Thank you!

thanks for your comment.

your question is a really good question. I strongly agree your point of view: there is no right or wrong from pure tech perspective.

I can see chraac really did many efforts on ggml-qnn backend as his way based on my initial PR and did a good progress on Windows port and 4d mulmat. unfortunately, I personally think:

my C style implementation is not compatible with his C++ style implementation
chraac seems a proud cpper, I know/met many similar programmers in China, their logic seems a little different with other programmers, this is also another non-tech problem.

at the same time, I think

Windows port in this PR can be done by a skilled Windows programmer less then 10 days although I have nothing knowledge about Windows programming(Qualcomm has provide a very simple reference code without complex C++ encapsulation in latest QNN SDK and I have ported them from QNN SDK to this PR but I don't know how to build them on Windows platform)
he or his team can contribute the 4d mulmat to this PR because Qualcomm provide the fundamental mechanism and we programmer use it regardless C style or C++ style
the second technical approach which describe in the simple tech doc or the "standout feature" in his PPT can also implemented in this PR with some additional efforts.
as I explained in my PR's description, tech difficulties should be completely same in this PR or in his PR

for avoid misunderstanding, I never thought/claimed that his/his team's effort on ggml-qnn backend is a duplicated effort, because I'm an open-minded programmer and I strongly agree that's also the GGML way: try crazy ideas, build wild demos, and push the edge of what’s possible.

one more important thing, there is a big problem in the second technical approach which I already explained in the simple tech doc, this is the key reason why his PR is not a functional PR and marked as [WIP][draft].

furthermore, I personally guess that the second technical approach that I have discovered and mentioned in this community on 04/2024 may not be implemented without the technical help of very skilled QNN experts and AI experts or help from Qualcomm because Qualcomm already provides some other similar dedicated tech stack in QNN:
https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-100/tools.html

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-100/tutorials.html#qnn-genaitransformer-backend-workflow

for further understand what I mentioned here, pls refer to following 20000+ LoC source file in my study/research of relative topic on 04/2024:
/~https://github.com/kantv-ai/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp
especially following line which I think/guess that's something chraac or his team wants to do through many many C++ class and C++ encapsulation(because the technical path in his team already chosen when he decided to hard-forked that PR through C++ encapsulation rather than along the similar tech path last year or cooperation with me this year through the first technical approach or the general approach in the ggml, this is also one of the key-reasons why I thought his PR/help in my first PR/forked llama.cpp is unacceptable although he challenged me again and again in my first PR which might-be because he is a cpper):
/~https://github.com/kantv-ai/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp#L20634

for avoid misunderstanding from chraac or his team, the above guess may be incorrect and I sincerely wish his team can success.

finally, as I described in this PR's description, I personally think a fully open-source implementation of ggml-qnn through the first tech approach might be a team work between experienced programmers(Android system sw programmers, Windows system sw programmers, QNN experts) and AI experts, and this is should be a P0 team-work task.

I hope my understanding is correct and/but correction from experts is greatly appreciated.

@null-define, thanks for your good question, it helps me brainstorm or deep dive into related technical issues and combine/assemble them to get a clearer understanding although it might be incorrect.

zhouwg added 27 commits February 16, 2025 12:39

ggml-qnn: add Qualcomm QNN backend for GGML

74029f3

ggml-qnn: santiy check

986a37d

ggml-qnn: update script build-run-android.sh to compare peformance of…

af604d5

… ggml-qnn

ggml-qnn: fix minor issue in test-backend-ops.cpp

816ebb9

ggml-qnn: merge QNN RPC feature from /~https://github.com/zhouwg/kantv/…

2a8020b

…blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

da4d007

ggml-qnn: a concise approach to offload mulmat to QNN backend(sync fr…

7cb1a86

…om branch kantvai-ggmlqnn-npurpc, /~https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)

ggml-qnn: remove redundant codes

c8cf291

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

84317c7

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

c6a04c6

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

59a2fbe

ggml-qnn: add Qualcomm QNN backend for GGML

1e6f4a7

ggml-qnn: santiy check

6974079

ggml-qnn: update script build-run-android.sh to compare peformance of…

ea970f9

… ggml-qnn

ggml-qnn: fix minor issue in test-backend-ops.cpp

d0c01c0

ggml-qnn: merge QNN RPC feature from /~https://github.com/zhouwg/kantv/…

b48ad85

…blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

5ac113b

ggml-qnn: a concise approach to offload mulmat to QNN backend(sync fr…

31152be

…om branch kantvai-ggmlqnn-npurpc, /~https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)

ggml-qnn: remove redundant codes

e16dd3c

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

1d56350

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

12f4911

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

37985f9

rebase to the latest upstream

9fa0765

ggml-qnn: fix a minior typo in internal doc

60ca941

ggml-qnn: refine function ggml_qnn_create_general_tensor() to avoid c…

d5d110d

…omplex/redundant pointer operation

ggml-qnn: fix a minor typo in source code

c687f26

build: avoid ggml-qnn backend breaking other backend's builds

d1b9d1b

zhouwg closed this Feb 24, 2025

github-actions bot added script Script related testing Everything test related labels Feb 24, 2025

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 24, 2025

zhouwg changed the title ~~build: avoid ggml-qnn backend breaking other backend's builds~~ PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp Feb 24, 2025

zhouwg reopened this Feb 24, 2025

chraac reviewed Feb 24, 2025

View reviewed changes

ggml/src/ggml-qnn/ggml-qnn.cpp Outdated Show resolved Hide resolved

chraac reviewed Feb 24, 2025

View reviewed changes

ggml/src/ggml-qnn/ggml-qnn.cpp Outdated Show resolved Hide resolved

zhouwg added 2 commits February 25, 2025 08:22

ggml-qnn: remove redundant codes to make PR reviewers happy

35a289a

ggml-qnn: refine code format

71dae47

zhouwg force-pushed the build_fix branch from c902c4d to 71dae47 Compare February 25, 2025 05:50

zhouwg added 2 commits February 26, 2025 13:38

ggml-qnn: offload quantized type mulmat to QNN backend

d80b289

ggml-qnn: benchmark of real LLM inference on a Snapdragon 8 Gen3 phone

eb47de0

ggml-qnn: refine source code structure to make code more clearly

36b58e3

zhouwg force-pushed the build_fix branch from ff3fd43 to 36b58e3 Compare February 27, 2025 06:37

zhouwg added 3 commits February 27, 2025 15:34

ggml-qnn: refine code

302e014

ggml-qnn: enable release build with necessary logs to make reviewers …

a134884

…happy

ggml-qnn: enable all quantize type with 2d mulmat

137b347

zhouwg mentioned this pull request Feb 28, 2025

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp kantv-ai/kantv#246

Closed

ggml-qnn: enable log output of GGMLQNN_LOG_INFO in command line mode …

653bc33

…for benchmark more conveniently

ggml-qnn: Windows port --- step2

9d10e4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

zhouwg commented Feb 24, 2025 •

edited

Loading

zhouwg commented Feb 24, 2025 •

edited

Loading

oreomaker commented Feb 25, 2025 •

edited

Loading

chraac commented Feb 25, 2025 •

edited

Loading

chraac commented Feb 25, 2025

zhouwg commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 26, 2025 •

edited

Loading

zhouwg commented Feb 27, 2025 •

edited

Loading

slaren commented Feb 27, 2025

zhouwg commented Feb 28, 2025 •

edited

Loading

null-define commented Feb 28, 2025

zhouwg commented Feb 28, 2025 •

edited

Loading

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

Are you sure you want to change the base?

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

Conversation

zhouwg commented Feb 24, 2025 • edited Loading

PR Description

Big picture of ggml-qnn backend

What I did in my first PR and this PR

Key-reasons of personally disagree with complex C++ encapsulation

Performance of ggml-qnn backend

How to build ggml‐qnn source code for Android and verify ggml-qnn backend on Snapdragon based phone

How to build ggml‐qnn source code for Snapdragon based WoA(Windows on ARM) device

How to build ggml‐qnn source code for Snapdragon based Linux device

Acknowledgement

zhouwg commented Feb 24, 2025 • edited Loading

oreomaker commented Feb 25, 2025 • edited Loading

chraac commented Feb 25, 2025 • edited Loading

chraac commented Feb 25, 2025

zhouwg commented Feb 25, 2025 • edited Loading

zhouwg commented Feb 26, 2025 • edited Loading

zhouwg commented Feb 27, 2025 • edited Loading

slaren commented Feb 27, 2025

zhouwg commented Feb 28, 2025 • edited Loading

null-define commented Feb 28, 2025

zhouwg commented Feb 28, 2025 • edited Loading

zhouwg commented Feb 24, 2025 •

edited

Loading

zhouwg commented Feb 24, 2025 •

edited

Loading

oreomaker commented Feb 25, 2025 •

edited

Loading

chraac commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 26, 2025 •

edited

Loading

zhouwg commented Feb 27, 2025 •

edited

Loading

zhouwg commented Feb 28, 2025 •

edited

Loading

zhouwg commented Feb 28, 2025 •

edited

Loading