Replies: 1 comment
-
In short - it is not possible due to technical architecture of gpu memory - it does not have memory mapping capabilities like cpu does. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Could someone kindly help to understand the following:
In ideal situation, as i understand this, the CPU llm inference looks like this: the full model(i.e. 460GB) to be loaded in RAM and cpu is working with RAM to make inference. With GPU situation is similar - the full model to be loaded in GPU VRAM for making inference.
But in reality, When I inference llamacpp on CPU only, (even with huge models like r1 671B 460GB disk size) my RAM memory filled only about 22GB. So that means my CPU is actively working with 22GB of data(part of model) that loaded in RAM. The rest of the data(model) is remaining on disk and somehow triggered from there.
So why not to switch to GPU? My gpu has 24GB vram, so that would be enough for such a run? If CPU can work with just a part of huge model, can GPU as well?
I know, There is an option for offloading some layers to gpu, but it is different. it is interesting if it is possible to switch from cpu to gpu, since data used by CPU in RAM is 22 gb only...
Could someone point me where im wrong?
Beta Was this translation helpful? Give feedback.
All reactions