Splitting model on multiple GPUs is broken (ROCm) #173
Closed
Description
Splitting a model between two AMD GPUs (Rx 7900XTX and Radeon VII) results in garbage output (gibberish).
Tested with Llama-2-13B-chat-GPTQ
and Llama-2-70B-chat-GPTQ
.
Running a model on just any one of the two card the output seems reasonable, although I cant vouch for the correctness of the 70B model as it cannot fit on a single card.
No flags seem to impact the results, although if i split the model and use --fused_mlp_thd 0
the following error occurs:
Exception
Traceback (most recent call last):
File "/home/luigi/Documents/temp/LLAMAv2/exllama/venv/lib/python3.11/site-packages/waitress/channel.py", line 428, in service
task.service()
File "/home/luigi/Documents/temp/LLAMAv2/exllama/venv/lib/python3.11/site-packages/waitress/task.py", line 168, in service
self.execute()
File "/home/luigi/Documents/temp/LLAMAv2/exllama/venv/lib/python3.11/site-packages/waitress/task.py", line 456, in execute
for chunk in app_iter:
File "/usr/lib/python3.11/site-packages/werkzeug/wsgi.py", line 289, in __next__
return self._next()
^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/werkzeug/wrappers/response.py", line 31, in _iter_encoded
for item in iterable:
File "/usr/lib/python3.11/site-packages/flask/helpers.py", line 149, in generator
yield from gen
File "/home/luigi/Documents/temp/LLAMAv2/exllama/webui/session.py", line 694, in respond_multi
yield from self.respond(self.participants[1], stop_conditions, total_tokens, res_line, num_res_tokens)
File "/home/luigi/Documents/temp/LLAMAv2/exllama/webui/session.py", line 532, in respond
gen_token = generator.beam_search()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/luigi/Documents/temp/LLAMAv2/exllama/generator.py", line 487, in beam_search
if self.settings.beams == 1 and self.settings.beam_length == 1: return self.gen_single_token()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/luigi/Documents/temp/LLAMAv2/exllama/generator.py", line 341, in gen_single_token
token, _ = self.batched_sample(logits,
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/luigi/Documents/temp/LLAMAv2/exllama/generator.py", line 64, in batched_sample
if logits.shape[0] == 1: return self.sample(logits, temperature, top_k, top_p, min_p, typical, num)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/luigi/Documents/temp/LLAMAv2/exllama/generator.py", line 147, in sample
sampled_ind = torch.multinomial(top_probs, top_probs.shape[-1] if num == -1 else min(num, top_probs.shape[-1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Compiling #146 does not seem to impact the outcome either.
The system is running Arch Linux with python-pytorch-opt-rocm 2.0.1-7
Output of rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 9 3950X 16-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 3950X 16-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3500
BDFID: 0
Internal Node ID: 0
Compute Unit: 32
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 131809036(0x7db3f0c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 131809036(0x7db3f0c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 131809036(0x7db3f0c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1100
Uuid: GPU-94c2e25f00000000
Marketing Name: AMD Radeon RX 7900 XTX
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2304
BDFID: 3584
Internal Node ID: 1
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*******
Agent 3
*******
Name: gfx906
Uuid: GPU-ed7030e172da5eba
Marketing Name: AMD Radeon VII
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 26287(0x66af)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1801
BDFID: 4352
Internal Node ID: 2
Compute Unit: 60
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16760832(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
I am available to do any testing that may help isolate the issue, I can try to test a third card as well (RX 6800XT).
Metadata
Assignees
Labels
No labels