Tensor Parallel? Multiple GPU #7

anonymousmaharaj · 2025-01-28T12:04:11Z

I have 2x3090 on my server. It is possible to run YuE on both cards in one time?

Also I have this warning:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
but why? My GPU is load model

FULL LOG:

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  4.70it/s]
/root/.pyenv/versions/yue/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
/root/YuE/inference/infer.py:86: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See /~https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  parameter_dict = torch.load(args.resume_path, map_location='cpu')
  0%|                                                                                                                        | 0/4 [00:00<?, ?it/s]


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

The text was updated successfully, but these errors were encountered:

a43992899 · 2025-01-29T07:08:42Z

Yes, you need to adjust a few lines of code, in /~https://github.com/multimodal-art-projection/YuE/blob/main/inference/infer.py

See hugging face tutorial:
https://huggingface.co/docs/transformers/en/perf_infer_gpu_multi

@hf-lin we should support this

anonymousmaharaj · 2025-01-29T07:12:03Z

Yes, you need to adjust a few lines of code, in /~https://github.com/multimodal-art-projection/YuE/blob/main/inference/infer.py

See hugging face tutorial: https://huggingface.co/docs/transformers/en/perf_infer_gpu_multi

@hf-lin we should support this

Thank you for the answer. Good day!

anonymousmaharaj · 2025-01-30T14:00:07Z

@a43992899 I spun it both ways, but I was unable to get it to start.

One of the bugs I have is that I start loading the model on both GPUs at the same time and it crashes on OOM, although when I run it the standard way like you do - everything works but it just takes unbearable time

hackey · 2025-02-13T08:15:22Z

I tried several ways to distribute the model across multiple GPUs.

With the device_map="auto" parameter during application startup, the model is correctly distributed across multiple GPUs. But then an error occurs:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
I tried several ways to fix this error. Explicitly moved all tensors to the model device. In particular, the problem may be related to position_ids, which is used in LlamaRotaryEmbedding. This tensor must be on the same device as the model. But I could not implement the correct distribution.
I also tried to use the proposed method with tp_plan="auto". Here everything turned out more complicated.
I also tried to use torch.nn.DataParallel. But there was no success either.

Guys, if you have the opportunity to give an example for the correct distribution of the model to different GPUs, I would be very grateful.

wangjiancheng-123 · 2025-02-21T03:21:43Z

Yes, you need to adjust a few lines of code, in /~https://github.com/multimodal-art-projection/YuE/blob/main/inference/infer.py

See hugging face tutorial: https://huggingface.co/docs/transformers/en/perf_infer_gpu_multi

@hf-lin we should support this

I used two gpu to run it successfully, but didn't increase the speed

a43992899 added the enhancement New feature or request label Jan 29, 2025

a43992899 assigned hf-lin Jan 29, 2025

a43992899 mentioned this issue Jan 29, 2025

transforms tensor parallel #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor Parallel? Multiple GPU #7

Tensor Parallel? Multiple GPU #7

anonymousmaharaj commented Jan 28, 2025 •

edited

Loading

a43992899 commented Jan 29, 2025

anonymousmaharaj commented Jan 29, 2025

anonymousmaharaj commented Jan 30, 2025

hackey commented Feb 13, 2025

wangjiancheng-123 commented Feb 21, 2025

Tensor Parallel? Multiple GPU #7

Tensor Parallel? Multiple GPU #7

Comments

anonymousmaharaj commented Jan 28, 2025 • edited Loading

a43992899 commented Jan 29, 2025

anonymousmaharaj commented Jan 29, 2025

anonymousmaharaj commented Jan 30, 2025

hackey commented Feb 13, 2025

wangjiancheng-123 commented Feb 21, 2025

anonymousmaharaj commented Jan 28, 2025 •

edited

Loading