-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[User] AMD GPU slower than CPU #3422
Comments
I don't know the solution, but if you want to use llama.cpp with your gpu in the meantime you might want to try it with CLBLAST instead of ROCm, it should give you a significant speedup compared to cpu-only, not as good as ROCm should give but it should get you close. |
This issue is missing info, please share the commands used to build llama.cpp, output of rocminfo and the full output of llama.cpp. |
I know is not the topic but I wanted to compare speed the strongest intel 13900k to strongest amd 7950x3d running model on cpu only . Same model size and quantization look
Almost the same performance , slightly faster on 7950x3d for answers but for some reason prompting I had almost 10x faster ... |
Same issue on RX 560, granted it's an older card. |
I updated the question with all the details (should be more than enough..). In the meantime, I tested RTX 4700 TI... it is probably 10x faster than RX7900XTX... 4700ti 56.23 tokens
7900XTX 5.62 tokens per second
|
The RX 560 may be slower in part because it's using the fallback code for Could you try this commit and see if the RX 560 is any faster? Engininja2@23510ef As for the RX 7900XTX, I can't think of anything. The PR for RDNA mul_mat_q tunings has someone reporting solid speeds for that gpu #2910 (comment) |
Wow I think you just fixed it for my RX560 card @Engininja2 , thank you so much! I will do some more testing and let you know how it goes. |
@oliverhu if you didn't compile the binary yourself (or compiled the binary on a machine with a different card) try doing that. By default, you will only get support for the current card at compile time, unless |
that's why I keep the amdgpu targets in my makefile set to: that way it builds the most common targets and also uses the name of the current machine in the system |
i am not sure but it may be because of Instruction Set Extensions Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2 in your processor, and as readme suggests it supports vectors extentions avx2 and avx512 |
Thanks, it’s impressive to see so many community responses btw! I did compile locally, so I assume it uses the right arch. It doesn’t make sense to see RTX 4700 ti to be 10x faster than Radeon 7900 XTX anyway regardless of the CPU discussion… |
Yeah the 7900XTX runs pretty fast and even faster than the scores I reported back then. I did initially notice very low scores but that was because I was compiling with a debug build (which is the default). You gotta make sure to use -DCMAKE_BUILD_TYPE=Release when building. |
I just saw the build command used here - |
It's probably fixed by now An error popped up when building HIP:
how to fix: sudo apt-get install libstdc++-13-dev Test string: Test results:
make clean && make LLAMA_OPENBLAS=1
make clean && make LLAMA_CLBLAST=1
make clean && make LLAMA_HIPBLAS=1
|
Not sure if this adds anything but I noticed on my RX 570, prompt ingestion was terribly slow, slower than CLBLAST or OPENBLAS, while actual inference would still be fast. Turns out it was Without mmq:
With mmq:
12ms vs 292ms per token to process a prompt. |
Thanks for all the replies. None worked for me. Observations:
|
Probably hipblas not working? It's pretty easy to break in Linux. To restore, you need to restart the driver installation, but do not forget the flags to set Hip-libraries. That's exactly what happened to me: I didn't understand why stable diffusion it's so slow, before reinstallation HIP and rocm. Just make sure that rocm and HIP work |
On my computer, using an rx 9700 xt I'm getting a decent 82 tokens/s, which is less than I get using mlc-llm (I usually get 90 to 105 tokens/second) :
I'm using Ubuntu 22.04 with mesa gpu driver! amdgpu driver had some issues and I switched back to mesa one. There is one problem, if I'm not setting the Here is the output difference between no_gl param and ngl params:
|
Ah, let me try mesa gpu driver as well...seems to be driver issues :( |
|
I have a similar issue with a CLBlast build on Windows, on my rx5700XT. Offloading layers to GPU causes a very significant slowdown, even compared to my slow CPU. .\main.exe -m .\models\tinyllama\tinyllama-1.1b-chat-v0.3.Q2_K.gguf -c 4096 -p "This is a test prompt." -e -n 128 -ctk q4_0 -s 0 -t 4 -ngl 20
...
llama_print_timings: load time = 3765.29 ms
llama_print_timings: sample time = 36.10 ms / 128 runs ( 0.28 ms per token, 3545.90 tokens per second)
llama_print_timings: prompt eval time = 460.28 ms / 7 tokens ( 65.75 ms per token, 15.21 tokens per second)
llama_print_timings: eval time = 20259.19 ms / 127 runs ( 159.52 ms per token, 6.27 tokens per second)
llama_print_timings: total time = 20801.06 ms .\main.exe -m .\models\tinyllama\tinyllama-1.1b-chat-v0.3.Q2_K.gguf -c 4096 -p "This is a test prompt." -e -n 128 -ctk q4_0 -s 0 -t 4 -ngl 0
...
llama_print_timings: load time = 298.46 ms
llama_print_timings: sample time = 42.78 ms / 128 runs ( 0.33 ms per token, 2991.77 tokens per second)
llama_print_timings: prompt eval time = 394.26 ms / 7 tokens ( 56.32 ms per token, 17.75 tokens per second)
llama_print_timings: eval time = 9202.12 ms / 127 runs ( 72.46 ms per token, 13.80 tokens per second)
llama_print_timings: |
AMD in their new release of ROCm 6 will also do a lot of optimization around FP8 in |
hipBLASlt's github says it requires an AMD MI200-MI300 Instinct Accelerator GPU |
Too bad. Although it's AMD, so maybe I'll just try and build it on my XTX and see if it works anyway, wouldn't be the first time... What still could be interesting is WMMA, I don't know if you can already take advantage of it using hipblas, but if not, this would probably really accelerate stuff on the RDNA3 cards, as they come with that built deep into the architecture: https://gpuopen.com/learn/wmma_on_rdna3/ (also supported by CUDA, so maybe you can accelerate both in one go) Maybe in general, I'm not too knowledgeable there, but might it be useful to use something higher level like MIOpen and letting AMD do the optimization? |
Definitely share your results, I also have a 7900xtx! |
When it comes to compiling models for specific architectures, you can always look at mlc: https://llm.mlc.ai/docs/index.html, I believe they're using an Apache compiler/framework to compile a model into extremely optimized code. Not sure if it's using MIGraphX or MIOpen or ... under the hood. Inference performance using mlc is basically the best you can get on consumer hardware as far as my knowledge goes. Better than llama.cpp and better than exllama. Not sure if it's fit for enterprise applications and servers, but most of us aren't doing that anyway. |
Well I'm using TabbyML, which is currently bound to using llama.cpp, so that's why I'm stuck with that (which also means I work with different models than just llama 2). Maybe it's time to change that... |
Nope, doesn't work, it requires |
@oliverhu I guess you have an Intel CPU compter :) But cmake based build creates the correct binary and work well. |
Yeah, I have Intel CPU with AMD GPU, haven’t got a chance to dig deeper |
@arch-btw I'm also using RX560 and speed is same as CPU. |
Just a general FYI, if you have a pretty modern CPU, that's probably even expected behavior, as it's a 7 year old tiny 14nm GPU. Plus I don't think ROCm supports it anymore, so it might not even be using the GPU. I think Vega is the oldest it does. |
I was referring to i7-7700HQ, so not really expected |
Testing with a Radeon Instinct MI25, is actually quite slow.
build: cfc4d75 (2564) |
Well, you tested with openCL, so that's kinda expected. You want to use hipBLAS. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Coming from the ollama repo, maybe |
Compiling without GGML_HIP_UMA=1 makes it fast for me |
What GPU are you using? |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
GPU inference should be faster than CPU.
Current Behavior
I have 13900K CPU & 7900XTX 24G hardware. I built llama.cpp using the hipBLAS and it builds. However, I noticed that when I offload all layers to GPU, it is noticably slower
GPU
CPU
Environment and Context
CPU: i9-13900KF
OS: Linux pia 6.2.0-33-generic #33~22.04.1-Ubuntu
GPU: 7900XTX
Python: 3.10
g++: 11.4.0
Make: 4.3
Build command
rocminfo
Additional comparison between Nvidia RTX 4700 ti vs RX7900XTX
I further tested RTX 4700 TI... it is probably 10x faster than RX7900XTX...
Nvidia GPU (4700TI)
4700ti 56.23 tokens
7900XTX 5.62 tokens per second
The text was updated successfully, but these errors were encountered: