Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor Parallel? Multiple GPU #7

Open
anonymousmaharaj opened this issue Jan 28, 2025 · 5 comments
Open

Tensor Parallel? Multiple GPU #7

anonymousmaharaj opened this issue Jan 28, 2025 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@anonymousmaharaj
Copy link

anonymousmaharaj commented Jan 28, 2025

I have 2x3090 on my server. It is possible to run YuE on both cards in one time?

Also I have this warning:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
but why? My GPU is load model

Image

FULL LOG:

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  4.70it/s]
/root/.pyenv/versions/yue/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
/root/YuE/inference/infer.py:86: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See /~https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  parameter_dict = torch.load(args.resume_path, map_location='cpu')
  0%|                                                                                                                        | 0/4 [00:00<?, ?it/s]


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
@a43992899
Copy link
Collaborator

Yes, you need to adjust a few lines of code, in /~https://github.com/multimodal-art-projection/YuE/blob/main/inference/infer.py

See hugging face tutorial:
https://huggingface.co/docs/transformers/en/perf_infer_gpu_multi

@hf-lin we should support this

@a43992899 a43992899 added the enhancement New feature or request label Jan 29, 2025
@anonymousmaharaj
Copy link
Author

Yes, you need to adjust a few lines of code, in /~https://github.com/multimodal-art-projection/YuE/blob/main/inference/infer.py

See hugging face tutorial: https://huggingface.co/docs/transformers/en/perf_infer_gpu_multi

@hf-lin we should support this

Thank you for the answer. Good day!

@anonymousmaharaj
Copy link
Author

@a43992899 I spun it both ways, but I was unable to get it to start.

One of the bugs I have is that I start loading the model on both GPUs at the same time and it crashes on OOM, although when I run it the standard way like you do - everything works but it just takes unbearable time

@hackey
Copy link

hackey commented Feb 13, 2025

I tried several ways to distribute the model across multiple GPUs.

  1. With the device_map="auto" parameter during application startup, the model is correctly distributed across multiple GPUs. But then an error occurs:
    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
    I tried several ways to fix this error. Explicitly moved all tensors to the model device. In particular, the problem may be related to position_ids, which is used in LlamaRotaryEmbedding. This tensor must be on the same device as the model. But I could not implement the correct distribution.

  2. I also tried to use the proposed method with tp_plan="auto". Here everything turned out more complicated.

  3. I also tried to use torch.nn.DataParallel. But there was no success either.

Guys, if you have the opportunity to give an example for the correct distribution of the model to different GPUs, I would be very grateful.

@wangjiancheng-123
Copy link

Yes, you need to adjust a few lines of code, in /~https://github.com/multimodal-art-projection/YuE/blob/main/inference/infer.py

See hugging face tutorial: https://huggingface.co/docs/transformers/en/perf_infer_gpu_multi

@hf-lin we should support this

I used two gpu to run it successfully, but didn't increase the speed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants