[Open-to-community] Benchmark swift-coreml-diffusers on different Mac hardware #31

Vaibhavs10 · 2023-02-24T09:51:15Z

Hey hey,

We are on a mission to provide a first-class, one-click solution to blazingly fast diffusers inference on Mac. In order for us to get a better idea of our framework, we'd like to get inference time benchmarks for the app.

Currently, we are explicitly looking for benchmarks on:

M1 Pro - @tcapelle, @emwdx, @Pindar777
M1 Pro (6/14/16) - @abazlinton
M2 Pro - @tcapelle, @mja, @SerCeMan
M2 Max - @Tz-H, @lovelace

You can do so by following the below steps:

Download the latest version of the Diffusers app from the App store.
Select one of the three options in the Advanced
Insert a random prompt for e.g. A Labrador playing in the fields.
Run inference and make a note of the time taken for inference.

Note: Do make sure to run inference multiple times as the framework sometimes requires to prepare the weights in order to run it in the most efficient way possible.

Ping @pcuenca and @Vaibhavs10 for any queries or questions!

Happy diffusing 🧨

The text was updated successfully, but these errors were encountered:

tcapelle · 2023-02-24T15:23:15Z

I can do the M2Pro Mac Mini!

pcuenca · 2023-02-24T15:28:31Z

I can do the M2Pro Mac Mini!

Cool, assigned it to you above :)

tcapelle · 2023-02-24T15:46:29Z

With default settings with 25 steps:

Macbook Pro 14" with M1 Pro GPU 16 Cores - 16GB of ram - 8 perf cores

ANE: 15.4s, 15.2, 15.2
GPU: 13.7s, 13.9s, 13.7s (Using less than 4GB of ram 🤯)
GPU + ANE: 15.4, 15.2, 15.4

Mac Mini with M2 Pro GPU 16 Cores - 16GB of ram - 6 perf cores

ANE: For some reason, on this machine the ANE was the default: 10.4, 10.3, 10.4 (no ram usage reported?!)
GPU: 12.4s, 12.3s, 12.3s
ANE+GPU: 10.9, 10.8, 10.8

pcuenca · 2023-02-24T16:11:32Z

Thanks a lot @tcapelle that's super helpful!

For some reason, on this machine the ANE was the default

Yeah, we have a simple rule (based on the number of performance cores, which is a good proxy for the rest of the hardware). It looks like it worked in both your computers, didn't it? (the best option was selected by default).

A couple of questions, if you can.

How many performance cores does each computer have? (I find sysctl hw.perflevel0.physicalcpu to be easy).
What model did you test? Relative performance is usually consistent across models.

The ANE+GPU performance is very close in both computers! I'm expecting ANE+GPU to beat just ANE in some of the MBP M2 Pro combinations.

tcapelle · 2023-02-24T16:21:35Z

I used default settings, so it's sd-base-2.0

tcapelle · 2023-02-24T16:29:13Z

do you know a trick to query how many GPU cores the machine has?
Would be really cool to retrieve this info programatically so we can log it to the wandb info:

pcuenca · 2023-02-24T16:44:16Z

I suppose terminal is ok:

ioreg -l | grep gpu-core-count | tail -1 | awk -F"=\ " '{print $NF}'

(Only produces results on Apple Silicon)

tcapelle · 2023-02-24T16:51:15Z

there is also this thingy: /~https://github.com/tlkh/asitop

pcuenca · 2023-02-24T16:55:50Z

there is also this thingy: /~https://github.com/tlkh/asitop

Oh interesting. This is what they do: /~https://github.com/tlkh/asitop/blob/main/asitop/utils.py#L123

emwdx · 2023-02-25T13:04:18Z

Same config as tcapelle above for comparison.

Default settings with 25 steps, Macbook Pro 14" with M1 Pro GPU 16 Cores - 16GB of ram - 8 perf cores

ANE: 15.2, 15.1, 15.3
GPU: 13.9, 13.7, 13.7
GPU + ANE: 14.2, 14.5, 14.4

Similar results as above, so that's cool.

pcuenca · 2023-02-25T13:15:59Z

Thanks a lot @emwdx! I think the app should have selected the best option (GPU) for you, is that correct?

Interestingly, your GPU+ANE combination is better than @tcapelle's (although still not better than just GPU).

emwdx · 2023-02-25T13:17:21Z

It did automatically select the GPU, yes :)

…

On Sat, Feb 25, 2023 at 1:16 PM Pedro Cuenca ***@***.***> wrote: Thanks a lot @emwdx </~https://github.com/emwdx>! I think the app should have selected the best option (GPU) for you, is that correct? — Reply to this email directly, view it on GitHub <#31 (comment)>, or unsubscribe </~https://github.com/notifications/unsubscribe-auth/AALV7N3DWBME4YK5X3LJE4TWZIAZVANCNFSM6AAAAAAVGWC7KA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

abazlinton · 2023-02-25T15:31:14Z

Amazing work huggingface team ❤️!

Here are mine -

14" MacBook M1 Pro - 14 GPU cores / 6 performance cores - All settings default (SD 2-base)

ANE: 15.2, 15.2, 15.2
GPU: 15.1, 15.1, 15.2
ANE+GPU: 14.4, 14.5, 14.4

Tz-H · 2023-02-26T00:06:47Z

14" MacBook M2 Max - 64 GB - 30 cores

hw.perflevel0.physicalcpu: 8

Settings

Models: stabilityai/stable-diffusion-2-1-base
Prompts: A Labrador playing in the fields
Steps: 25
Seed: -1

Result

GPU: 7.7, 7.7, 7.6
ANE: 10.3, 10.3, 10.3
GPU + ANE: 10.6, 10.6, 10.7

julien-c · 2023-02-26T08:27:22Z

Which model should we run for this benchmark?

pcuenca · 2023-02-26T08:59:46Z

@julien-c Ideally, the 4 we used in the benchmark: https://huggingface.co/blog/fast-mac-diffusers#performance-benchmarks

But results seem consistent across models, so most people are doing just stabilityai/stable-diffusion-2-base or stabilityai/stable-diffusion-2-1-base.

pcuenca · 2023-02-26T09:04:29Z

Amazing work huggingface team ❤️!

Here are mine -

14" MacBook M1 Pro - 14 GPU cores / 6 performance cores - All settings default (SD 2-base)

ANE: 15.2, 15.2, 15.2 GPU: 15.1, 15.1, 15.2 ANE+GPU: 14.4, 14.5, 14.4

Very interesting test @abazlinton! This is the first time we see GPU+ANE beating either GPU or ANE. We'll try to improve our heuristics to select that combination by default for those systems. Thank you!

pcuenca · 2023-02-26T09:06:06Z

Nice computer @Tz-H! We were very interested to see performance on M2 Max, thanks a lot!

grapefroot · 2023-02-26T16:05:59Z

Is it possible to report ram usage as well? Would have been interesting to see how ram is used and how it affects the performance

pcuenca · 2023-02-26T17:16:34Z

Hi @grapefroot! Initially I was under the impression that RAM would be an important factor for performance (it is on iOS), but in our tests we did not notice any difference between 8 GB and 16 GB Macs: https://huggingface.co/blog/fast-mac-diffusers#performance-benchmarks. Things could be different if the computer is memory pressured when other apps are running, but am not sure how to test for that scenario. How would you go about measuring RAM usage?

mja · 2023-03-01T14:29:51Z

MacBook Pro 14-inch, 2023; Apple M2 Pro, 8-P-Core, 4-E-core, 19-GPU-core; 32GB Memory

Model: stable-diffusion-2-base
Guidance Scale: 7.5
Step count: 25

GPU: 11.0s, 11.1s, 11.0s
ANE: 10.6s, 10.8s, 10.7s,
GPU+ANE: 10.5s, 10.4s, 10.7s

Low Power Mode: On
GPU: 12.7s, 12.5s, 12.5s
ANE: 11.3s, 11.2s, 11.1s
GPU+ANE: 10.8s, 11.3s, 11.4s

Zabriskije · 2023-03-02T01:40:34Z

Hi folks, just wanted to throw in a suggestion: I think it would be better to include in this article that all the tests were made using a SPLIT_EINSUM model, since speeds of CPU_AND_GPU with ORIGINAL models are higher.

Source: personal, and with more examples in The Murus Team PromptToImage benchmarks.

pcuenca · 2023-03-02T07:54:37Z

@Zabriskije the results in our table were done thus:

ORIGINAL attention when using compute units CPU_AND_GPU.
SPLIT_EINSUM attention for CPU_AND_ANE.

pcuenca · 2023-03-02T07:58:35Z

@mja – Super interesting, thanks a lot!

Zabriskije · 2023-03-02T10:33:00Z

@pcuenca I'm a bit confused: isn't the model downloaded within the Diffusers 1.1 app SPLIT_EINSUM?
Aren't the results reported in the article the same as the ones found here?
Either way, I think it's still worth pointing out 🤓

pcuenca · 2023-03-02T10:40:50Z

@Zabriskije We wanted the blog post to be easy, so we decided to hide some details. But yeah, maybe it's worth pointing it out :)

Barring bugs, the way the app is meant to work is:

It takes a look at your system and guesses the best compute combination for you. Currently, this yields either CPU+GPU or CPU+ANE.
The attention method is coupled with the compute units. GPU implies ORIGINAL, while ANE implies SPLIT_EINSUM.
We download the default model (Stable Diffusion 2) according to those decisions. In your case, it looks like it was CPU+ANE, and therefore split_einsum.
If you use the Advanced settings and select CPU+GPU instead, then the app tells you that it needs to download a different model (the original attention one), and it does that if you allow it to proceed.

Is this not what's happening in your case?

Zabriskije · 2023-03-02T11:11:46Z

@pcuenca Yup, it downloads the ORIGINAL model. Didn't know about that, thanks for the clarification :)

SerCeMan · 2023-03-02T11:36:21Z

Macbook Pro 14" with M2 Pro 12-Core CPU, 19-Core GPU, 32GB Unified Memory

Model: stable-diffusion-2-base
Guidance Scale: 7.5
Step count: 25

GPU: 11.4, 11.2, 11.2
ANE: 10.3, 10.2, 10.3
GPU+ANE: 10.4, 10.3, 10.2

pcuenca · 2023-03-05T12:16:36Z

Data point on an Intel Mac:

iMac Retina 5K, 2020
Processor: 3.6 GHz 10-Core Intel Core i9
GPU: AMD Radeon Pro 5700 XT 16 GB

Model: stable-diffusion-2-base
Guidance Scale: 7.5
Step count: 25

GPU: 14.9s

lovelace · 2023-03-08T05:00:46Z

Macbook Pro 14" with M2 Max 12-Core CPU, 38-Core GPU, 16-core Neural Engine, 96GB Unified Memory

Model: stable-diffusion-2-1-base
Guidance Scale: 7.5
Step count: 25

GPU: 6.5, 6.4, 6.5, 6.6, 6.5
ANE: 10.2, 10.3, 10.2, 10.3, 10.2
GPU+ANE: 9.9, 9.9, 10.0, 9.8, 10.0

Pindar777 · 2023-04-09T11:42:27Z

MacBook Pro 16" with M1 Pro | CPU: 10 cores (8 performance and 2 efficiency) | GPU: 16 Cores | Memory: 16 GB

Prompt: "Ancient Roman fresco of woman working with her laptop. She is facing the camera and has a Mac"
Model-Parameters:
- SD-2-1-base
- Guidance: 7.5
- Step-Count: 25
- Seed: -1
- low-power-mode: FALSE
Performance after two initital runs:
- GPU: 14.4 / 13.8 / 13.6
- ANE: 15.0 / 15.0 / 15.0
- GPU+ANE: 14.3 / 14.2 / 14.3

keijiro · 2023-04-22T13:50:50Z

MacBook Pro 16" with M2 Max 12-Core CPU, 38-Core GPU, 16-core Neural Engine, 64GB Unified Memory

Model: stable-diffusion-2-1-base
Guidance Scale: 7.5
Step count: 25

GPU: 6.2 / 6.2 / 6.2 / 6.2
ANE: 10.1 / 10.2 / 10.2 / 10.1
GPU + ANE: 9.8 / 9.7 / 9.7 / 9.8

easp · 2023-04-28T17:01:33Z

MacBook Pro 16" with M1 Max 10-core CPU (8P,2E) 24-core GPU 16-core, 11 Tops Neural Engine 32GB Unified Memory

Model: stable-diffusion-2-base
Guidance Scale: 25
Step Count: 25

GPU: 9.9 / 10.0 / 9.7 / 9.7s
- GPU Power: ~28W
ANE: 14.1 / 14.3 / 14.4 / 14.4s
- ANE Power: ~3.1W
GPU + ANE: 13.5 / 13.6 / 13.5 / 13.5s
- GPU + ANE Power 5.3w

The big story is that GPU took ~9x the power for 1.4x performance over the ANE.

I monitored power by running this and eye-balling it:
sudo powermetrics -i 1000 --samplers cpu_power |grep Power I didn't include CPU power because that jumped around a fair bit with background activity on the computer.

MarkoS046 · 2023-06-12T12:49:05Z

MacBook Pro 14“, M2 Max (12c CPU, 30c GPU, 16c ANE), 64GB RAM

25 Steps

Stable Diffusion 2 (base)
GPU: 7.5s, 7.5s, 7.5s (about the same as with SD/Torch)
ANE: 10.2s, 10.2s, 10.2s
GPU+ANE: 9.9s, 10.0s, 10.0s

Stable Diffusion 1.5
GPU: 8.9s, 8.4s, 8.5s
ANE: 12.9s, 12.6s, 12.7
GPU+ANE: 12.4s, 12.0s, 12.1s

wilsonics · 2023-08-12T15:58:17Z

Mac Mini with M2 Pro GPU 16 Cores - 16GB, 6 Performance, 4 Efficiency
Stable Diffusion 2 (base) - 25 steps
ANE: 10.5, 10.5, 10.5
GPU: 12.3, 12.4, 12.4
ANE+GPU: 10.9, 10.8, 10.9

panicsteve · 2023-08-18T01:08:57Z

Don't see an M2 Ultra here yet, so here goes! :)

Mac Studio M2 Ultra
macOS 13.5
CPU: 24 cores (16 performance, 8 efficiency)
GPU: 76 cores
Model: stabilityai/stable-diffusion-2-base
Prompt: A Labrador playing in the fields
CFG 7.5, 25 steps

      GPU:   4.1s,   4.2s,   4.1s
      ANE:  10.0s,  10.0s,  10.0s
GPU + ANE:   9.2s,   9.2s,   9.2s

As a rough comparison, I ran the same parameters through the InvokeAI 3.0.2 web UI. On there, I used the SD1.5 model at 512x512 with Euler A and fp32 precision, so it's not a direct apples-to-apples comparison:

Node                 Calls    Seconds VRAM Used
main_model_loader        1     0.001s     0.00G
clip_skip                1     0.001s     0.00G
compel                   2     0.496s     0.00G
dynamic_prompt           1     0.001s     0.00G
rand_int                 1     0.001s     0.00G
noise                    1     0.003s     0.00G
iterate                  1     0.001s     0.00G
t2l                      1     6.458s     0.00G
metadata_accumulator     1     0.001s     0.00G
l2i                      1     0.334s     0.00G
TOTAL GRAPH EXECUTION TIME:    7.297s

So, on ANE or GPU + ANE, InvokeAI wins by ~2.0-2.5s. But on GPU, Diffusers wins by 3.2s!

Switching the scheduler to Heun slows InvokeAI to about 13.2s total execution time.

Switching to fp16 precision produced more or less the same results. (I'm not sure if fp16 vs fp32 has any meaning on the mps implementation.)

wonkyoc · 2023-11-09T17:15:32Z

@pcuenca
Hi Pedro, I am just wondering that is there any APIs to control # of cores for CPU or GPU? Also, can we check # of ANE cores w/ some CLIs? (e.g., ioreg or sysctl). The info of such things could be useful to analyze this repo further, which I am interested in.

madjuju · 2024-01-14T11:46:21Z

Apple M2 Ultra avec CPU 24 cœurs, GPU 60 cœurs et Neural Engine 32 cœurs

Model: stable-diffusion-2-1-base
Guidance Scale: 7.5
Step count: 25
Prompt : "A Labrador playing in the fields"

GPU: 4,5 / 4,5 / 4,5
ANE: 9,4 / 9,4 / 9,4 /
GPU + ANE: 9.4 / 9.4 / 9.4

laurentVeliscek · 2024-05-10T16:42:01Z

Hi there !

What about a cheap M3 ?
Apple Macbook Air M3 CPU 8 cores (4+4) , GPU 10 cores,
16 Go Ram

Guidance Scale: 7.5
Step count: 25
Prompt : "A Labrador playing in the fields"
Safety checker disabled.

Model: stable-diffusion-2-base :
GPU: 15.9 s
Neural Engine: 9s (!)
GPU + ANE: 9 s

Model: stable-diffusion-2-1-base:
GPU: 15.9 s
Neural Engine: 9s (!)
GPU + ANE: 9 s

The ANE performances are some amazing.

I'd be curious to know what are the settings of the ksampler in "diffusers".
I tried to reproduce the generated pictures with comfyUI with no success.
(about 25 seconds with 25 steps / cfg 7.5 / dpm++2m karras )

Would it be possible to use some SDXL coreML in ComfyUI to get better performances ?

thanks !...

laurentVeliscek · 2024-05-11T05:53:27Z

Just made some tests with comfyUI.

SDXL cfg 7.0, 30 steps, dpmpp-2M karras

SDXL CoreML Sampler : 200 / 211 seconds
SDXL 1.0_09 Ksampler Adv. (Efficient) : 270 / 290 secondes

So CoreML is about 25% faster than Ksampler "efficient".

Pretty cool.

But I noticed that if the output is very similar, it's not exactly the same .

CoreML Sampler vs KSampler eff., same settings :

pcuenca changed the title ~~[Open-to-community] Benchmark swift-cormel-diffusers on different Mac hardware~~ [Open-to-community] Benchmark swift-coreml-diffusers on different Mac hardware Feb 24, 2023

pcuenca mentioned this issue Mar 5, 2023

Intel Mac AMD GPU apple/ml-stable-diffusion#120

Open

godly-devotion mentioned this issue Mar 13, 2023

Can you add a comparison of different configurations MochiDiffusion/MochiDiffusion#190

Closed

1 task

[Open-to-community] Benchmark swift-coreml-diffusers on different Mac hardware #31

[Open-to-community] Benchmark swift-coreml-diffusers on different Mac hardware #31

Comments

Vaibhavs10 commented Feb 24, 2023 • edited by pcuenca Loading

tcapelle commented Feb 24, 2023

pcuenca commented Feb 24, 2023

tcapelle commented Feb 24, 2023 • edited Loading

Macbook Pro 14" with M1 Pro GPU 16 Cores - 16GB of ram - 8 perf cores

Mac Mini with M2 Pro GPU 16 Cores - 16GB of ram - 6 perf cores

pcuenca commented Feb 24, 2023 • edited Loading

tcapelle commented Feb 24, 2023 • edited Loading

tcapelle commented Feb 24, 2023 • edited Loading

pcuenca commented Feb 24, 2023

tcapelle commented Feb 24, 2023

pcuenca commented Feb 24, 2023

emwdx commented Feb 25, 2023

pcuenca commented Feb 25, 2023 • edited Loading

emwdx commented Feb 25, 2023 via email

abazlinton commented Feb 25, 2023 • edited Loading

14" MacBook M1 Pro - 14 GPU cores / 6 performance cores - All settings default (SD 2-base)

Tz-H commented Feb 26, 2023 • edited Loading

14" MacBook M2 Max - 64 GB - 30 cores

Settings

Result

julien-c commented Feb 26, 2023

pcuenca commented Feb 26, 2023

pcuenca commented Feb 26, 2023 • edited Loading

14" MacBook M1 Pro - 14 GPU cores / 6 performance cores - All settings default (SD 2-base)

pcuenca commented Feb 26, 2023

grapefroot commented Feb 26, 2023

pcuenca commented Feb 26, 2023

mja commented Mar 1, 2023

Zabriskije commented Mar 2, 2023

pcuenca commented Mar 2, 2023

pcuenca commented Mar 2, 2023

Zabriskije commented Mar 2, 2023

pcuenca commented Mar 2, 2023

Zabriskije commented Mar 2, 2023

SerCeMan commented Mar 2, 2023

Macbook Pro 14" with M2 Pro 12-Core CPU, 19-Core GPU, 32GB Unified Memory

pcuenca commented Mar 5, 2023

lovelace commented Mar 8, 2023

Macbook Pro 14" with M2 Max 12-Core CPU, 38-Core GPU, 16-core Neural Engine, 96GB Unified Memory

Pindar777 commented Apr 9, 2023 • edited Loading

MacBook Pro 16" with M1 Pro | CPU: 10 cores (8 performance and 2 efficiency) | GPU: 16 Cores | Memory: 16 GB

keijiro commented Apr 22, 2023

MacBook Pro 16" with M2 Max 12-Core CPU, 38-Core GPU, 16-core Neural Engine, 64GB Unified Memory

easp commented Apr 28, 2023

MacBook Pro 16" with M1 Max 10-core CPU (8P,2E) 24-core GPU 16-core, 11 Tops Neural Engine 32GB Unified Memory

MarkoS046 commented Jun 12, 2023

wilsonics commented Aug 12, 2023

panicsteve commented Aug 18, 2023

wonkyoc commented Nov 9, 2023

madjuju commented Jan 14, 2024

laurentVeliscek commented May 10, 2024

laurentVeliscek commented May 11, 2024 • edited Loading

Vaibhavs10 commented Feb 24, 2023 •

edited by pcuenca

Loading

tcapelle commented Feb 24, 2023 •

edited

Loading

pcuenca commented Feb 24, 2023 •

edited

Loading

tcapelle commented Feb 24, 2023 •

edited

Loading

tcapelle commented Feb 24, 2023 •

edited

Loading

pcuenca commented Feb 25, 2023 •

edited

Loading

abazlinton commented Feb 25, 2023 •

edited

Loading

Tz-H commented Feb 26, 2023 •

edited

Loading

pcuenca commented Feb 26, 2023 •

edited

Loading

Pindar777 commented Apr 9, 2023 •

edited

Loading

laurentVeliscek commented May 11, 2024 •

edited

Loading