Inference with GPU instead of CPU. #12109

jonndoe · 2025-02-28T10:46:00Z

jonndoe
Feb 28, 2025

Could someone kindly help to understand the following:

In ideal situation, as i understand this, the CPU llm inference looks like this: the full model(i.e. 460GB) to be loaded in RAM and cpu is working with RAM to make inference. With GPU situation is similar - the full model to be loaded in GPU VRAM for making inference.

But in reality, When I inference llamacpp on CPU only, (even with huge models like r1 671B 460GB disk size) my RAM memory filled only about 22GB. So that means my CPU is actively working with 22GB of data(part of model) that loaded in RAM. The rest of the data(model) is remaining on disk and somehow triggered from there.

So why not to switch to GPU? My gpu has 24GB vram, so that would be enough for such a run? If CPU can work with just a part of huge model, can GPU as well?

I know, There is an option for offloading some layers to gpu, but it is different. it is interesting if it is possible to switch from cpu to gpu, since data used by CPU in RAM is 22 gb only...

Could someone point me where im wrong?

jonndoe · 2025-02-28T12:03:35Z

jonndoe
Feb 28, 2025
Author

In short - it is not possible due to technical architecture of gpu memory - it does not have memory mapping capabilities like cpu does.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference with GPU instead of CPU. #12109

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Inference with GPU instead of CPU. #12109

jonndoe Feb 28, 2025

Replies: 1 comment

jonndoe Feb 28, 2025 Author

jonndoe
Feb 28, 2025

jonndoe
Feb 28, 2025
Author