Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User] AMD GPU slower than CPU #3422

Closed
4 tasks done
oliverhu opened this issue Oct 1, 2023 · 39 comments
Closed
4 tasks done

[User] AMD GPU slower than CPU #3422

oliverhu opened this issue Oct 1, 2023 · 39 comments
Labels
AMD GPU Issues specific to AMD GPUs performance Speed related topics stale

Comments

@oliverhu
Copy link

oliverhu commented Oct 1, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

GPU inference should be faster than CPU.

Current Behavior

I have 13900K CPU & 7900XTX 24G hardware. I built llama.cpp using the hipBLAS and it builds. However, I noticed that when I offload all layers to GPU, it is noticably slower

GPU

./main -m ../model/llama-2-13b-chat/ggml-model-q4.gguf -n 128 -ngl 50
----
Log start
main: build = 1299 (f5ef5cf)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1696212406
ggml_init_cublas: found 1 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from ../model/llama-2-13b-chat/ggml-model-q4.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight q6_K     [  5120, 32000,     1,     1 ]
...
llama_model_loader: - tensor  361:          blk.39.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  362:           blk.39.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                          general.file_type u32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:               general.quantization_version u32
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_0:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required  =   88.01 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 6936.01 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 400.00 MB
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size = 80.88 MB
llama_new_context_with_model: VRAM scratch buffer: 75.00 MB
llama_new_context_with_model: total VRAM used: 7411.01 MB (model: 6936.01 MB, context: 475.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


pgfplotstablecolumntypes

In addition to the built-in types provided by `pgfplots`, you can also use your own custom column types. Here are some examples of how to define and use custom column types:

1. `boolean` type:
\documentclass{article}
\usepackage{pgfplotstable}
\begin{document}
\pgfplotstabletypeset[
    columns/my_column/type={boolean},
    data=mydata,
    every head row/.style={before row={\hline}}
]{%
    my_column & other_column
llama_print_timings:        load time =  6432.57 ms
llama_print_timings:      sample time =    32.92 ms /   128 runs   (    0.26 ms per token,  3888.10 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 22756.97 ms /   128 runs   (  177.79 ms per token,     5.62 tokens per second)
llama_print_timings:       total time = 22857.59 ms
Log end

CPU

./main -m ../model/llama-2-13b-chat/ggml-model-q4.gguf -n 128

----
Log start
main: build = 1299 (f5ef5cf)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1696212490
ggml_init_cublas: found 1 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from ../model/llama-2-13b-chat/ggml-model-q4.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight q6_K     [  5120, 32000,     1,     1 ]
...
llama_model_loader: - tensor  361:          blk.39.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  362:           blk.39.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                          general.file_type u32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:               general.quantization_version u32
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_0:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required  = 7024.01 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/43 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size = 80.88 MB
llama_new_context_with_model: VRAM scratch buffer: 75.00 MB
llama_new_context_with_model: total VRAM used: 75.00 MB (model: 0.00 MB, context: 75.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


tikz\draw[fill=blue!50] (0,0) rectangle (1.5,1.5);
\tikz\draw[fill=red!50] (1.5,0) rectangle (3,1.5);
\tikz\draw[fill=green!50] (3,0) rectangle (4.5,1.5);
\end{tikzpicture}

In this example, the rectangles are drawn with different colors: blue, red and green.

You can also use other shapes like circles, triangles, etc. by changing the
llama_print_timings:        load time =   363.76 ms
llama_print_timings:      sample time =    36.15 ms /   128 runs   (    0.28 ms per token,  3541.29 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 19588.62 ms /   128 runs   (  153.04 ms per token,     6.53 tokens per second)
llama_print_timings:       total time = 19695.27 ms
Log end

Environment and Context

CPU: i9-13900KF
OS: Linux pia 6.2.0-33-generic #33~22.04.1-Ubuntu
GPU: 7900XTX
Python: 3.10
g++: 11.4.0
Make: 4.3

Build command

make LLAMA_HIPBLAS=1

rocminfo


❯ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    13th Gen Intel(R) Core(TM) i9-13900KF
  Uuid:                    CPU-XX
  Marketing Name:          13th Gen Intel(R) Core(TM) i9-13900KF
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   5500
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            32
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    98692092(0x5e1ebfc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    98692092(0x5e1ebfc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    98692092(0x5e1ebfc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1100
  Uuid:                    GPU-754358d3215edcd7
  Marketing Name:          Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2304
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional comparison between Nvidia RTX 4700 ti vs RX7900XTX

I further tested RTX 4700 TI... it is probably 10x faster than RX7900XTX...

Nvidia GPU (4700TI)

4700ti 56.23 tokens

llama_print_timings:        load time =   824.29 ms
llama_print_timings:      sample time =    52.74 ms /   128 runs   (    0.41 ms per token,  2427.18 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  2276.23 ms /   128 runs   (   17.78 ms per token,    56.23 tokens per second)
llama_print_timings:       total time =  2357.70 ms
Log end

7900XTX 5.62 tokens per second

llama_print_timings:        load time =  6432.57 ms
llama_print_timings:      sample time =    32.92 ms /   128 runs   (    0.26 ms per token,  3888.10 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 22756.97 ms /   128 runs   (  177.79 ms per token,     5.62 tokens per second)
llama_print_timings:       total time = 22857.59 ms
@Azeirah
Copy link
Contributor

Azeirah commented Oct 1, 2023

I don't know the solution, but if you want to use llama.cpp with your gpu in the meantime you might want to try it with CLBLAST instead of ROCm, it should give you a significant speedup compared to cpu-only, not as good as ROCm should give but it should get you close.

@ardfork
Copy link
Contributor

ardfork commented Oct 1, 2023

This issue is missing info, please share the commands used to build llama.cpp, output of rocminfo and the full output of llama.cpp.

@staviq staviq added performance Speed related topics AMD GPU Issues specific to AMD GPUs labels Oct 1, 2023
@mirek190
Copy link

mirek190 commented Oct 1, 2023

I know is not the topic but I wanted to compare speed the strongest intel 13900k to strongest amd 7950x3d running model on cpu only .

Same model size and quantization

look

main.exe --model models\new3\ultralm-13b-v2.0.Q4_0.gguf --mlock --color --threads 16 --keep -1 --batch_size 512 --n_predict -1 --top_k 40 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 4096 --interactive --instruct  -ngl 0


llama_print_timings:        load time = 13046.85 ms
llama_print_timings:      sample time =    11.69 ms /    85 runs   (    0.14 ms per token,  7273.04 tokens per second)
llama_print_timings: prompt eval time =  2055.02 ms /    77 tokens (   26.69 ms per token,    37.47 tokens per second)
llama_print_timings:        eval time = 10850.53 ms /    85 runs   (  127.65 ms per token,     7.83 tokens per second)
llama_print_timings:       total time = 15315.68 ms

Almost the same performance , slightly faster on 7950x3d for answers but for some reason prompting I had almost 10x faster ...
Another difference is just 2x less energy used for the task on amd cpu ;P

@arch-btw
Copy link
Contributor

arch-btw commented Oct 1, 2023

Same issue on RX 560, granted it's an older card.

@oliverhu
Copy link
Author

oliverhu commented Oct 2, 2023

I updated the question with all the details (should be more than enough..). In the meantime, I tested RTX 4700 TI... it is probably 10x faster than RX7900XTX...

4700ti 56.23 tokens

llama_print_timings:        load time =   824.29 ms
llama_print_timings:      sample time =    52.74 ms /   128 runs   (    0.41 ms per token,  2427.18 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  2276.23 ms /   128 runs   (   17.78 ms per token,    56.23 tokens per second)
llama_print_timings:       total time =  2357.70 ms
Log end

7900XTX 5.62 tokens per second

llama_print_timings:        load time =  6432.57 ms
llama_print_timings:      sample time =    32.92 ms /   128 runs   (    0.26 ms per token,  3888.10 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 22756.97 ms /   128 runs   (  177.79 ms per token,     5.62 tokens per second)
llama_print_timings:       total time = 22857.59 ms

@Engininja2
Copy link
Contributor

The RX 560 may be slower in part because it's using the fallback code for __dp4a() and its isa lacks a corresponding opcode and the compiler may not be choosing the fastest instructions, or it might potentially choose not to unroll a loop later on because it emitted too many instructions per dp4a.

Could you try this commit and see if the RX 560 is any faster? Engininja2@23510ef

As for the RX 7900XTX, I can't think of anything. The PR for RDNA mul_mat_q tunings has someone reporting solid speeds for that gpu #2910 (comment)
Maybe an environment variable like LLAMA_DEBUG could be set and slowing things down, but I think that would affect the other build just as much.

@arch-btw
Copy link
Contributor

arch-btw commented Oct 2, 2023

Wow I think you just fixed it for my RX560 card @Engininja2 , thank you so much!

I will do some more testing and let you know how it goes.

@65a
Copy link
Contributor

65a commented Oct 3, 2023

@oliverhu if you didn't compile the binary yourself (or compiled the binary on a machine with a different card) try doing that. By default, you will only get support for the current card at compile time, unless AMDGPU_TARGETS and/or GPU_TARGETS are set, which has bitten me since I have a random assortment of cards across several machines. I had a gfx1100 recently and inference was very fast (and much faster than a big recent Xeon doing CPU inference) when compiled with that in the supported rocm architectures list.

@YellowRoseCx
Copy link
Contributor

@oliverhu if you didn't compile the binary yourself (or compiled the binary on a machine with a different card) try doing that. By default, you will only get support for the current card at compile time, unless AMDGPU_TARGETS and/or GPU_TARGETS are set, which has bitten me since I have a random assortment of cards across several machines. I had a gfx1100 recently and inference was very fast (and much faster than a big recent Xeon doing CPU inference) when compiled with that in the supported rocm architectures list.

that's why I keep the amdgpu targets in my makefile set to:
GPU_TARGETS ?= gfx803 gfx900 gfx906 gfx908 gfx90a gfx1030 gfx1100 $(shell $(ROCM_PATH)/llvm/bin/amdgpu-arch)

that way it builds the most common targets and also uses the name of the current machine in the system

@Abhinavpatel00
Copy link

Abhinavpatel00 commented Oct 4, 2023

i am not sure but it may be because of Instruction Set Extensions Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2 in your processor, and as readme suggests it supports vectors extentions avx2 and avx512

@oliverhu
Copy link
Author

oliverhu commented Oct 4, 2023

Thanks, it’s impressive to see so many community responses btw! I did compile locally, so I assume it uses the right arch. It doesn’t make sense to see RTX 4700 ti to be 10x faster than Radeon 7900 XTX anyway regardless of the CPU discussion…

@jammm
Copy link
Contributor

jammm commented Oct 6, 2023

The RX 560 may be slower in part because it's using the fallback code for __dp4a() and its isa lacks a corresponding opcode and the compiler may not be choosing the fastest instructions, or it might potentially choose not to unroll a loop later on because it emitted too many instructions per dp4a.

Could you try this commit and see if the RX 560 is any faster? Engininja2@23510ef

As for the RX 7900XTX, I can't think of anything. The PR for RDNA mul_mat_q tunings has someone reporting solid speeds for that gpu #2910 (comment) Maybe an environment variable like LLAMA_DEBUG could be set and slowing things down, but I think that would affect the other build just as much.

Yeah the 7900XTX runs pretty fast and even faster than the scores I reported back then. I did initially notice very low scores but that was because I was compiling with a debug build (which is the default). You gotta make sure to use -DCMAKE_BUILD_TYPE=Release when building.

@jammm
Copy link
Contributor

jammm commented Oct 6, 2023

I just saw the build command used here - make LLAMA_HIPBLAS=1.
This doesn't use the -O3 build flag. You'll have to specify it manually in the Makefile. LLAMA_FAST also won't work because it doesn't add -O3 to HIPFLAGS /~https://github.com/ggerganov/llama.cpp/blob/master/Makefile#L119

@0x131315
Copy link

0x131315 commented Oct 8, 2023

It's probably fixed by now
Tested on 7950x and 7900xt, ubuntu 23.04, kernel xanmod 6.5.5, rocm-5.7.0
Tested on commit: eee42c6

An error popped up when building HIP:

/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/include/cuda_wrappers/cmath:27:15: fatal error: 'cmath' file not found

how to fix: sudo apt-get install libstdc++-13-dev

Test string:
./llama-bench --model ./models/amethyst-13b-mistral.Q4_K_M.gguf

Test results:
make clean && make

model size params backend threads test t/s
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B CPU 16 pp 512 39.74 ± 0.55
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B CPU 16 tg 128 7.39 ± 0.03

make clean && make LLAMA_OPENBLAS=1

model size params backend threads test t/s
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B BLAS 16 pp 512 5.47 ± 0.04
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B BLAS 16 tg 128 7.40 ± 0.03

make clean && make LLAMA_CLBLAST=1

model size params backend ngl test t/s
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B OpenCL 99 pp 512 151.09 ± 9.03
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B OpenCL 99 tg 128 30.96 ± 0.75

make clean && make LLAMA_HIPBLAS=1

model size params backend ngl test t/s
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B ROCm 99 pp 512 687.51 ± 0.74
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B ROCm 99 tg 128 54.94 ± 0.06

@rabidcopy
Copy link
Contributor

Not sure if this adds anything but I noticed on my RX 570, prompt ingestion was terribly slow, slower than CLBLAST or OPENBLAS, while actual inference would still be fast. Turns out it was mmq. With the -nommq option and its equivalent in the ROCM fork of KoboldCPP, prompt ingestion sped up dramatically to a completely usable state.

Without mmq: ./main -m ../openhermes-2-mistral-7b.Q5_K_M.gguf -f prompts/dan-modified.txt -n 100 -ngl 20 --threads 6 -nommq

llama_print_timings:        load time =    1268.83 ms
llama_print_timings:      sample time =       6.01 ms /    20 runs   (    0.30 ms per token,  3328.34 tokens per second)
llama_print_timings: prompt eval time =    4450.21 ms /   365 tokens (   12.19 ms per token,    82.02 tokens per second)
llama_print_timings:        eval time =    2483.49 ms /    19 runs   (  130.71 ms per token,     7.65 tokens per second)

With mmq: ./main -m ../openhermes-2-mistral-7b.Q5_K_M.gguf -f prompts/dan-modified.txt -n 100 -ngl 20 --threads 6

llama_print_timings:        load time =    6360.01 ms
llama_print_timings:      sample time =       3.29 ms /    11 runs   (    0.30 ms per token,  3345.50 tokens per second)
llama_print_timings: prompt eval time =  106569.27 ms /   365 tokens (  291.97 ms per token,     3.43 tokens per second)
llama_print_timings:        eval time =    1369.67 ms /    10 runs   (  136.97 ms per token,     7.30 tokens per second)
llama_print_timings:       total time =  107948.71 ms

12ms vs 292ms per token to process a prompt.
Takeaway is on gfx803 cards, use --nommq and don't use the custom mul_mat_q kernels. It only occured to me is that on KoboldCPP-ROCM you have to explicitly specify to use mmq or use the launcher that defaults to using it, I couldn't understand why KoboldCPP was so much faster when running it as a command. Then realized when using the launcher with mmq as the default, the slowness was present like with llamacpp ./main.

@oliverhu
Copy link
Author

Thanks for all the replies. None worked for me.

Observations:

  1. -O3 is automatically applied. Adding another -O3 didn't work since it is always here. (also validated in the commands)
  2. --nommq didn't work either.
  3. Adding GPU_TARGETS ?= gfx1100 $(shell $(ROCM_PATH)/llvm/bin/amdgpu-arch) didn't work either.
  4. the binary created by cmake is faster than make, (5.6 vs 6.7 t/s) regardless of -DCMAKE_BUILD_TYPE=Release

@0x131315
Copy link

0x131315 commented Nov 1, 2023

Probably hipblas not working? It's pretty easy to break in Linux. To restore, you need to restart the driver installation, but do not forget the flags to set Hip-libraries. That's exactly what happened to me: I didn't understand why stable diffusion it's so slow, before reinstallation HIP and rocm. Just make sure that rocm and HIP work

@bog-dan-ro
Copy link

bog-dan-ro commented Nov 17, 2023

On my computer, using an rx 9700 xt I'm getting a decent 82 tokens/s, which is less than I get using mlc-llm (I usually get 90 to 105 tokens/second) :

$ ./main -m openchat_3.5-16k.Q2_K.gguf -n 512 -ngl 84
...
llama_print_timings:        load time =     602,64 ms
llama_print_timings:      sample time =      49,50 ms /   512 runs   (    0,10 ms per token, 10343,02 tokens per second)
llama_print_timings: prompt eval time =       0,00 ms /     1 tokens (    0,00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    6213,93 ms /   512 runs   (   12,14 ms per token,    82,40 tokens per second)
llama_print_timings:       total time =    6353,88 ms
Log end

I'm using Ubuntu 22.04 with mesa gpu driver! amdgpu driver had some issues and I switched back to mesa one.
If you have an rx 7900 xtx then you should set ngl to 96.

There is one problem, if I'm not setting the -ngl 84 param it seems it defaults to 1 or a very low number and it's terribly slow ... is it possible to have a better auto detection ?

Here is the output difference between no_gl param and ngl params:

 diff -Nru no_gl.txt ngl.txt 
--- no_gl.txt   2023-11-17 18:41:18.800750412 +0200
+++ ngl.txt     2023-11-17 18:42:16.152175500 +0200
@@ -1,7 +1,7 @@
 Log start
 main: build = 1523 (947f64f)
 main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
-main: seed  = 1700239230
+main: seed  = 1700239309
 ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
 ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
 ggml_init_cublas: found 1 ROCm devices:
@@ -356,19 +356,23 @@
 llm_load_print_meta: LF token  = 13 '<0x0A>'
 llm_load_tensors: ggml ctx size =    0,11 MiB
 llm_load_tensors: using ROCm for GPU acceleration
-llm_load_tensors: mem required  = 2939,69 MiB
-llm_load_tensors: offloading 0 repeating layers to GPU
-llm_load_tensors: offloaded 0/35 layers to GPU
-llm_load_tensors: VRAM used: 0,00 MiB
+llm_load_tensors: mem required  =   41,12 MiB
+llm_load_tensors: offloading 32 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 35/35 layers to GPU
+llm_load_tensors: VRAM used: 2898,56 MiB
 ..................................................................................................
 llama_new_context_with_model: n_ctx      = 512
 llama_new_context_with_model: freq_base  = 1000000,0
 llama_new_context_with_model: freq_scale = 1
+llama_kv_cache_init: offloading v cache to GPU
+llama_kv_cache_init: offloading k cache to GPU
+llama_kv_cache_init: VRAM kv self = 64,00 MiB
 llama_new_context_with_model: kv self size  =   64,00 MiB
 llama_build_graph: non-view tensors processed: 740/740
 llama_new_context_with_model: compute buffer total size = 74,57 MiB
 llama_new_context_with_model: VRAM scratch buffer: 73,00 MiB
-llama_new_context_with_model: total VRAM used: 73,00 MiB (model: 0,00 MiB, context: 73,00 MiB)
+llama_new_context_with_model: total VRAM used: 3035,57 MiB (model: 2898,56 MiB, context: 137,00 MiB)
 
 system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
 sampling: 
@@ -376,3 +380,4 @@
         top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
         mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
 generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0

@oliverhu
Copy link
Author

Ah, let me try mesa gpu driver as well...seems to be driver issues :(

@shibe2
Copy link
Contributor

shibe2 commented Nov 18, 2023

is it possible to have a better auto detection ?

#3719

@stduhpf
Copy link
Contributor

stduhpf commented Dec 9, 2023

I have a similar issue with a CLBlast build on Windows, on my rx5700XT. Offloading layers to GPU causes a very significant slowdown, even compared to my slow CPU.

.\main.exe  -m .\models\tinyllama\tinyllama-1.1b-chat-v0.3.Q2_K.gguf  -c 4096 -p "This is a test prompt." -e -n 128 -ctk q4_0 -s 0 -t 4 -ngl 20
...
llama_print_timings:        load time =    3765.29 ms
llama_print_timings:      sample time =      36.10 ms /   128 runs   (    0.28 ms per token,  3545.90 tokens per second)
llama_print_timings: prompt eval time =     460.28 ms /     7 tokens (   65.75 ms per token,    15.21 tokens per second)
llama_print_timings:        eval time =   20259.19 ms /   127 runs   (  159.52 ms per token,     6.27 tokens per second)
llama_print_timings:       total time =   20801.06 ms
.\main.exe  -m .\models\tinyllama\tinyllama-1.1b-chat-v0.3.Q2_K.gguf  -c 4096 -p "This is a test prompt." -e -n 128 -ctk q4_0 -s 0 -t 4 -ngl 0
...
llama_print_timings:        load time =     298.46 ms
llama_print_timings:      sample time =      42.78 ms /   128 runs   (    0.33 ms per token,  2991.77 tokens per second)
llama_print_timings: prompt eval time =     394.26 ms /     7 tokens (   56.32 ms per token,    17.75 tokens per second)
llama_print_timings:        eval time =    9202.12 ms /   127 runs   (   72.46 ms per token,    13.80 tokens per second)
llama_print_timings:

@cromefire
Copy link

cromefire commented Dec 10, 2023

AMD in their new release of ROCm 6 will also do a lot of optimization around FP8 in hipBLASLt (and in general recommends that library for use in ML), could that maybe be used to optimize further?
https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/product-briefs/amd-rocm-6-brief.pdf
image
image

@YellowRoseCx
Copy link
Contributor

AMD in their new release of ROCm 6 will also do a lot of optimization around FP8 in hipBLASLt (and in general recommends that library for use in ML), could that maybe be used to optimize further? https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/product-briefs/amd-rocm-6-brief.pdf image image

hipBLASlt's github says it requires an AMD MI200-MI300 Instinct Accelerator GPU

@cromefire
Copy link

cromefire commented Dec 19, 2023

hipBLASlt's github says it requires an AMD MI200-MI300 Instinct Accelerator GPU

Too bad. Although it's AMD, so maybe I'll just try and build it on my XTX and see if it works anyway, wouldn't be the first time...

What still could be interesting is WMMA, I don't know if you can already take advantage of it using hipblas, but if not, this would probably really accelerate stuff on the RDNA3 cards, as they come with that built deep into the architecture: https://gpuopen.com/learn/wmma_on_rdna3/ (also supported by CUDA, so maybe you can accelerate both in one go)

Maybe in general, I'm not too knowledgeable there, but might it be useful to use something higher level like MIOpen and letting AMD do the optimization?

@Azeirah
Copy link
Contributor

Azeirah commented Dec 19, 2023

hipBLASlt's github says it requires an AMD MI200-MI300 Instinct Accelerator GPU

Too bad. Although it's AMD, so maybe I'll just try and build it on my XTX and see if it works anyway, wouldn't be the first time...

Definitely share your results, I also have a 7900xtx!

@Azeirah
Copy link
Contributor

Azeirah commented Dec 19, 2023

hipBLASlt's github says it requires an AMD MI200-MI300 Instinct Accelerator GPU

Too bad. Although it's AMD, so maybe I'll just try and build it on my XTX and see if it works anyway, wouldn't be the first time...

What still could be interesting is WMMA, I don't know if you can already take advantage of it using hipblas, but if not, this would probably really accelerate stuff on the RDNA3 cards, as they come with that built deep into the architecture: https://gpuopen.com/learn/wmma_on_rdna3/ (also supported by CUDA, so maybe you can accelerate both in one go)

Maybe in general, I'm not too knowledgeable there, but might it be useful to use something higher level like MIOpen and letting AMD do the optimization?

When it comes to compiling models for specific architectures, you can always look at mlc: https://llm.mlc.ai/docs/index.html, I believe they're using an Apache compiler/framework to compile a model into extremely optimized code. Not sure if it's using MIGraphX or MIOpen or ... under the hood.

Inference performance using mlc is basically the best you can get on consumer hardware as far as my knowledge goes. Better than llama.cpp and better than exllama. Not sure if it's fit for enterprise applications and servers, but most of us aren't doing that anyway.

@cromefire
Copy link

cromefire commented Dec 19, 2023

When it comes to compiling models for specific architectures, you can always look at mlc: https://llm.mlc.ai/docs/index.html, I believe they're using an Apache compiler/framework to compile a model into extremely optimized code. Not sure if it's using MIGraphX or MIOpen or ... under the hood.\n\nInference performance using mlc is basically the best you can get on consumer hardware as far as my knowledge goes. Better than llama.cpp and better than exllama. Not sure if it's fit for enterprise applications and servers, but most of us aren't doing that anyway.

Well I'm using TabbyML, which is currently bound to using llama.cpp, so that's why I'm stuck with that (which also means I work with different models than just llama 2). Maybe it's time to change that...

@cromefire
Copy link

Definitely share your results, I also have a 7900xtx!

Nope, doesn't work, it requires amdhsa_accum_offset which is only available on CDNA2+ it seems.

@linuxyz
Copy link

linuxyz commented Feb 17, 2024

@oliverhu I guess you have an Intel CPU compter :)
I can reproduce the similar issues on my PC with Intel CPU, and I think it caused by the Makefile build hipBLAS issues, which make the model layers GPU offload doesn't really work.

But cmake based build creates the correct binary and work well.

@oliverhu
Copy link
Author

Yeah, I have Intel CPU with AMD GPU, haven’t got a chance to dig deeper

@dmatora
Copy link

dmatora commented Feb 28, 2024

@arch-btw I'm also using RX560 and speed is same as CPU.
what speed did you have before and after the fix?

@cromefire
Copy link

I'm also using RX560 and speed is same as CPU.
what speed did you have before and after the fix?

Just a general FYI, if you have a pretty modern CPU, that's probably even expected behavior, as it's a 7 year old tiny 14nm GPU. Plus I don't think ROCm supports it anymore, so it might not even be using the GPU. I think Vega is the oldest it does.

@dmatora
Copy link

dmatora commented Feb 28, 2024

I was referring to i7-7700HQ, so not really expected
but yes, I doubt GPU is truly active, which is why I'm asking about your speed

@gadLinux
Copy link

Testing with a Radeon Instinct MI25, is actually quite slow.
✦ ✖ ./llama-bench --model ./models/amethyst-13b-mistral.Q4_K_M.gguf
ggml_opencl: selecting platform: 'rusticl'
ggml_opencl: selecting device: 'AMD Radeon Instinct MI25 (radeonsi, vega10, LLVM 15.0.7, DRM 3.54, 6.5.0-14-generic)'

model size params backend ngl test t/s
llama 13B Q4_K - Medium 7.33 GiB 13.02 B OpenCL 99 pp 512 43.12 ± 0.46
llama 13B Q4_K - Medium 7.33 GiB 13.02 B OpenCL 99 tg 128 3.51 ± 0.67

build: cfc4d75 (2564)
image

@cromefire
Copy link

ggml_opencl: selecting platform: 'rusticl'

Well, you tested with openCL, so that's kinda expected. You want to use hipBLAS.

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@rennokki
Copy link

Coming from the ollama repo, maybe ROC_ENABLE_PRE_VEGA=1 would fix? ollama/ollama#2453 (comment)

@jimmyrcom
Copy link

Compiling without GGML_HIP_UMA=1 makes it fast for me

@YellowRoseCx
Copy link
Contributor

Compiling without GGML_HIP_UMA=1 makes it fast for me

What GPU are you using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AMD GPU Issues specific to AMD GPUs performance Speed related topics stale
Projects
None yet
Development

No branches or pull requests