metal : use residency sets #11427

ggerganov · 2025-01-26T10:37:14Z

Using residency sets makes the allocated memory stay wired and eliminates almost completely the overhead observed in #10119. For example, on M2 Ultra, using 7B Q8_0 model the requests are ~250ms faster thanks to this change. It seems it is not necessary to attach the residency sets to the command queue and buffers, so the change is rather simple. For each buffer, we create an associated MTLResidencySet and add the MTLBuffer objects to it. After that we commit it and request residency:

/~https://github.com/ggerganov/llama.cpp/blob/225d2e0ca1d7a7e627f2cea4a43dd77a83b9f078/ggml/src/ggml-metal/ggml-metal.m#L1084-L1091

build: b9126fe (4561)

Model	Test	t/s master	t/s gg/metal-residency-sets	Speedup
llama 3B F16	pp512	3289.51	3286.29	1.00
llama 3B F16	tg128	73.28	73.35	1.00
llama 3B Q4_0	pp512	2999.71	3002.93	1.00
llama 3B Q4_0	tg128	165.83	166.03	1.00
llama 3B Q8_0	pp512	2958.32	2960.69	1.00
llama 3B Q8_0	tg128	123.61	123.96	1.00

Metal backend changes

Checks the environment variable GGML_METAL_NO_RESIDENCY. If set, then no residency sets will be created, allowing the GPU memory to be collected by the OS after 1 second of inactivity. Generally, this should rarely be needed as it hurts the performance of the application, but keeping support just in case.

ggerganov · 2025-01-26T13:51:15Z

Great news - this change finally resolves the annoying overhead that I was observing. The only remaining question is how to implement this to be compatible with macOS < 15.0.

Any suggestions?

Edit: resolved

ggml-ci

metal : use residency sets

* metal : use residency sets ggml-ci * metal : restore commandBufferWithUnretainedReferences calls [no ci] * metal : release descriptors ggml-ci * metal : check env GGML_METAL_NO_RESIDENCY ggml-ci * metal : fix build + clean-up ggml-ci

ggerganov force-pushed the gg/metal-residency-sets branch from febb813 to 4dad9fa Compare January 26, 2025 10:39

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jan 26, 2025

ggerganov mentioned this pull request Jan 26, 2025

metal : GPU "idle-throttling" analysis #10119

Closed

metal : use residency sets

2674f02

ggml-ci

ggerganov force-pushed the gg/metal-residency-sets branch from 21850f6 to 2674f02 Compare January 26, 2025 14:27

github-actions bot added the build Compilation issues label Jan 26, 2025

ggerganov changed the base branch from gg/idle to master January 26, 2025 14:30

ggerganov added 2 commits January 26, 2025 16:32

metal : restore commandBufferWithUnretainedReferences calls [no ci]

7fb39e3

metal : release descriptors

b9126fe

ggml-ci

ggerganov marked this pull request as ready for review January 26, 2025 14:41

ggerganov added 2 commits January 26, 2025 19:31

metal : check env GGML_METAL_NO_RESIDENCY

9dc5ef4

ggml-ci

metal : fix build + clean-up

225d2e0

ggml-ci

ggerganov merged commit 178a7eb into master Jan 26, 2025
51 checks passed

ggerganov deleted the gg/metal-residency-sets branch January 26, 2025 18:06

Animaxx added a commit to Animaxx/llama.cpp that referenced this pull request Jan 28, 2025

/~https://github.com/ggerganov/llama.cpp/pull/11427

1b2f685

metal : use residency sets

ggerganov mentioned this pull request Jan 31, 2025

Feature Request: MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 Inference on consumer GPU #11532

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : use residency sets #11427

metal : use residency sets #11427

ggerganov commented Jan 26, 2025 •

edited

Loading

ggerganov commented Jan 26, 2025 •

edited

Loading

metal : use residency sets #11427

metal : use residency sets #11427

Conversation

ggerganov commented Jan 26, 2025 • edited Loading

Metal backend changes

ggerganov commented Jan 26, 2025 • edited Loading

ggerganov commented Jan 26, 2025 •

edited

Loading

ggerganov commented Jan 26, 2025 •

edited

Loading