-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Open-to-community] Benchmark swift-coreml-diffusers on different Mac hardware #31
Comments
I can do the M2Pro Mac Mini! |
Cool, assigned it to you above :) |
With default settings with 25 steps: Macbook Pro 14" with M1 Pro GPU 16 Cores - 16GB of ram - 8 perf cores
Mac Mini with M2 Pro GPU 16 Cores - 16GB of ram - 6 perf cores
|
Thanks a lot @tcapelle that's super helpful!
Yeah, we have a simple rule (based on the number of performance cores, which is a good proxy for the rest of the hardware). It looks like it worked in both your computers, didn't it? (the best option was selected by default). A couple of questions, if you can.
The ANE+GPU performance is very close in both computers! I'm expecting ANE+GPU to beat just ANE in some of the MBP M2 Pro combinations. |
I used default settings, so it's sd-base-2.0 |
I suppose terminal is ok: ioreg -l | grep gpu-core-count | tail -1 | awk -F"=\ " '{print $NF}' (Only produces results on Apple Silicon) |
there is also this thingy: /~https://github.com/tlkh/asitop |
Oh interesting. This is what they do: /~https://github.com/tlkh/asitop/blob/main/asitop/utils.py#L123 |
Same config as tcapelle above for comparison. Default settings with 25 steps, Macbook Pro 14" with M1 Pro GPU 16 Cores - 16GB of ram - 8 perf cores ANE: 15.2, 15.1, 15.3 Similar results as above, so that's cool. |
It did automatically select the GPU, yes :)
…On Sat, Feb 25, 2023 at 1:16 PM Pedro Cuenca ***@***.***> wrote:
Thanks a lot @emwdx </~https://github.com/emwdx>! I think the app should
have selected the best option (GPU) for you, is that correct?
—
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or unsubscribe
</~https://github.com/notifications/unsubscribe-auth/AALV7N3DWBME4YK5X3LJE4TWZIAZVANCNFSM6AAAAAAVGWC7KA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Amazing work huggingface team ❤️! Here are mine - 14" MacBook M1 Pro - 14 GPU cores / 6 performance cores - All settings default (SD 2-base)ANE: 15.2, 15.2, 15.2 |
14" MacBook M2 Max - 64 GB - 30 coreshw.perflevel0.physicalcpu: 8 Settings
Result
|
Which model should we run for this benchmark? |
@julien-c Ideally, the 4 we used in the benchmark: https://huggingface.co/blog/fast-mac-diffusers#performance-benchmarks But results seem consistent across models, so most people are doing just |
Very interesting test @abazlinton! This is the first time we see GPU+ANE beating either GPU or ANE. We'll try to improve our heuristics to select that combination by default for those systems. Thank you! |
Nice computer @Tz-H! We were very interested to see performance on M2 Max, thanks a lot! |
Is it possible to report ram usage as well? Would have been interesting to see how ram is used and how it affects the performance |
Hi @grapefroot! Initially I was under the impression that RAM would be an important factor for performance (it is on iOS), but in our tests we did not notice any difference between 8 GB and 16 GB Macs: https://huggingface.co/blog/fast-mac-diffusers#performance-benchmarks. Things could be different if the computer is memory pressured when other apps are running, but am not sure how to test for that scenario. How would you go about measuring RAM usage? |
MacBook Pro 14-inch, 2023; Apple M2 Pro, 8-P-Core, 4-E-core, 19-GPU-core; 32GB Memory Model: stable-diffusion-2-base GPU: 11.0s, 11.1s, 11.0s Low Power Mode: On |
Hi folks, just wanted to throw in a suggestion: I think it would be better to include in this article that all the tests were made using a Source: personal, and with more examples in The Murus Team PromptToImage benchmarks. |
@Zabriskije the results in our table were done thus:
|
@mja – Super interesting, thanks a lot! |
@pcuenca I'm a bit confused: isn't the model downloaded within the Diffusers 1.1 app SPLIT_EINSUM? |
@Zabriskije We wanted the blog post to be easy, so we decided to hide some details. But yeah, maybe it's worth pointing it out :) Barring bugs, the way the app is meant to work is:
Is this not what's happening in your case? |
@pcuenca Yup, it downloads the |
Macbook Pro 14" with M2 Pro 12-Core CPU, 19-Core GPU, 32GB Unified MemoryModel: stable-diffusion-2-base
|
Data point on an Intel Mac: iMac Retina 5K, 2020 Model: stable-diffusion-2-base
|
Macbook Pro 14" with M2 Max 12-Core CPU, 38-Core GPU, 16-core Neural Engine, 96GB Unified MemoryModel: stable-diffusion-2-1-base
|
MacBook Pro 16" with M1 Pro | CPU: 10 cores (8 performance and 2 efficiency) | GPU: 16 Cores | Memory: 16 GB
|
MacBook Pro 16" with M2 Max 12-Core CPU, 38-Core GPU, 16-core Neural Engine, 64GB Unified MemoryModel: stable-diffusion-2-1-base
|
MacBook Pro 16" with M1 Max 10-core CPU (8P,2E) 24-core GPU 16-core, 11 Tops Neural Engine 32GB Unified MemoryModel: stable-diffusion-2-base
The big story is that GPU took ~9x the power for 1.4x performance over the ANE. I monitored power by running this and eye-balling it: |
MacBook Pro 14“, M2 Max (12c CPU, 30c GPU, 16c ANE), 64GB RAM 25 Steps Stable Diffusion 2 (base) Stable Diffusion 1.5 |
Mac Mini with M2 Pro GPU 16 Cores - 16GB, 6 Performance, 4 Efficiency |
Don't see an M2 Ultra here yet, so here goes! :) Mac Studio M2 Ultra
As a rough comparison, I ran the same parameters through the InvokeAI 3.0.2 web UI. On there, I used the SD1.5 model at 512x512 with Euler A and fp32 precision, so it's not a direct apples-to-apples comparison:
So, on ANE or GPU + ANE, InvokeAI wins by ~2.0-2.5s. But on GPU, Diffusers wins by 3.2s! Switching the scheduler to Heun slows InvokeAI to about 13.2s total execution time. Switching to fp16 precision produced more or less the same results. (I'm not sure if fp16 vs fp32 has any meaning on the mps implementation.) |
@pcuenca |
Apple M2 Ultra avec CPU 24 cœurs, GPU 60 cœurs et Neural Engine 32 cœurs Model: stable-diffusion-2-1-base GPU: 4,5 / 4,5 / 4,5 |
Hi there ! What about a cheap M3 ? Guidance Scale: 7.5 Model: stable-diffusion-2-base : Model: stable-diffusion-2-1-base: The ANE performances are some amazing. I'd be curious to know what are the settings of the ksampler in "diffusers". Would it be possible to use some SDXL coreML in ComfyUI to get better performances ? thanks !... |
Just made some tests with comfyUI. SDXL cfg 7.0, 30 steps, dpmpp-2M karras SDXL CoreML Sampler : 200 / 211 seconds So CoreML is about 25% faster than Ksampler "efficient". Pretty cool. But I noticed that if the output is very similar, it's not exactly the same . CoreML Sampler vs KSampler eff., same settings : |
Hey hey,
We are on a mission to provide a first-class, one-click solution to blazingly fast
diffusers
inference on Mac. In order for us to get a better idea of our framework, we'd like to get inference time benchmarks for the app.Currently, we are explicitly looking for benchmarks on:
You can do so by following the below steps:
Advanced
A Labrador playing in the fields
.Note: Do make sure to run inference multiple times as the framework sometimes requires to prepare the weights in order to run it in the most efficient way possible.
Ping @pcuenca and @Vaibhavs10 for any queries or questions!
Happy diffusing 🧨
The text was updated successfully, but these errors were encountered: