A1111 Stable Diffusion speed?

TzunamiOSX · Aug 30, 2023

I am playing a bit with Automatic1111 Stable Diffusion. At the moment, A1111 is running on M1 Mac Mini under Big Sur.

The performance is not very good.

Bildschirmfoto 2023-08-31 um 04.20.27.png

A picture with sees settings need around 5 min. 47 sec. (10.86s/it). My Mac Pro with Windows and an old Titan X give me a picture every 40 seconds.

It looks like GPU and Neural Engine are not working. Did I get a better speed when using a newer system?

Start setting are: ./webui.sh --skip-torch-cuda-test --precision full --no-half --medvram --opt-sub-quad-attention

Slartibart · Aug 30, 2023

Which version of Torch are you using? I recommend using this unofficial installer for the SD web ui which will download the nightly build of pytorch (I do not know about running it on Big Sur though).
Here is a thread on Automatic1111’s website on optimising for M1/M2.

TzunamiOSX · Aug 31, 2023

Slartibart said:
Which version of Torch are you using? I recommend using this unofficial installer for the SD web ui which will download the nightly build of pytorch (I do not know about running it on Big Sur though).
Here is a thread on Automatic1111’s website on optimising for M1/M2.

Here are some Infos

Bildschirmfoto 2023-08-31 um 15.21.01.png

Bildschirmfoto 2023-08-31 um 15.21.26.png

TzunamiOSX · Aug 31, 2023

A short feedback...

Same stable Automatic1111 Stable Diffusion with same settings.
I have updated the System to Ventura and now I get better results

Big Sur, Standard A1111: 5 min. 47 sec. (10.86s/it)
Ventura, Standard A1111: 1 min. 40 sec. (3.14s/it)

Ventura, Standard A1111: Slartibarts link 1 min. 19 sec. (2.64s/it)

Looks like the neural engine is not in use, only the GPU.

Slartibart · Aug 31, 2023

On Apple’s Core ML stable diffusion code repository on github - which offers neural engine support - you find some iteration time info, as well as the contributed code (related blog post) and detailed instructions on benchmarking and deployment.

thebart · Sep 4, 2023

That's pretty dang slow. Have you tried Draw Things from the app store? Although I suppose Draw Things also uses similar back end

TzunamiOSX · Sep 4, 2023

thebart said:
That's pretty dang slow. Have you tried Draw Things from the app store? Although I suppose Draw Things also uses similar back end

What is your system and how long do you need?

EDIT: Draw Things needs 1:42 under Monterey

thebart · Sep 4, 2023

TzunamiOSX said:
What is your system and how long do you need?

EDIT: Draw Things needs 1:42 under Monterey

M1 mini (base, not pro, not max) 16gb ram Ventura. 512x768 32 steps takes 1:16

TzunamiOSX · Sep 4, 2023

thebart said:
M1 mini (base, not pro, not max) 16gb ram Ventura. 512x768 32 steps takes 1:16

This is the MPS optimization from Ventura. Monterey is a little bit slower. My personal problem is, that my Mac Pro from 2008 with a old Titan X Maxwell is up 2 to 3 times faster, My 2010 Mac Pro with Vega Frontier is at nearly the same speed as the M1 Under Windows. Im sure the 5,1 is faster under MacOS (Tests with my Mac Pro 2013), but I don’t have OpenCore on it.

Exactly values coming, when back to my room.

thebart · Sep 4, 2023

TzunamiOSX said:
This ist the MPS optimization from Ventura. My personal problem ist, that my Mac Pro from 2008 with a old Titan X Maxwell is up 2 to 3 times faster, My 2010 Mac Pro with Vega Frontier is at nearly the same speed Under Windows. Im sure the 5,1 is faster under MacOS (Tests with my Mac Pro 2013), but I don’t have OpenCore on it.

Yeah SD on Silicon is way behind PCs with discrete GPUs, even older Macs with older GPUs, let alone anything like a 4070.

TzunamiOSX · Sep 4, 2023

32 Sampling steps, 512x768

Mac Pro 2008, Windows 10, Titan X (Maxwell) 12 GB

Bildschirmfoto 2023-09-05 um 04.16.25.png

Mac Pro 2010, macOS Monterey, Vega Frontier 16 GB

Mac Mini M1, macOS Monterey

Mac Pro 2013 macOS Monterey, 1x D500, 3 GB

Mac Pro 2013 Windows 11, 1x D500, 3 GB
over 20 mins, make a test when I have the time

galad · Sep 4, 2023

That doesn't look to bad. The Titan X is a dedicated GPU with a 250 TDP running CUDA code that has been optimized over the arc of a decade. The M1 GPU is what, 15 Watt?

TzunamiOSX · Sep 5, 2023

galad said:
That doesn't look to bad. The Titan X is a dedicated GPU with a 250 TDP running CUDA code that has been optimized over the arc of a decade. The M1 GPU is what, 15 Watt?

This is a GPU from 2015 with 28nm, so it looks not so nice.

TzunamiOSX · Sep 6, 2023

I have added my Mac Pro 2010 who is also faster than my M1. On my 6,1 I can't finish the picture under Windows, because I got a low VRAM error.

TzunamiOSX · Sep 10, 2023

Is here anyone with a Rx 5700 or newer on a Intel Mac, who can run automatic 1111? Hope to see results of newer GPUs.

TzunamiOSX · Jan 14, 2025

RashyIsBack said:
Hey folks! I am a passionate DrawThings and Auto1111 WebUI user, next to Blender 4.2 and Davinic Resolve. I would like to keep this topic alive and get more comparisons, especially between the M1-M4 Max chips. I have the 16" with 24C M1 Max / 32GB and consider doing the upgrade to the 30C/36GB M3 Max (cheaper) or 32C/36GB M4 Max (much better NPU, faster h264/265 encoders could benefit me in Resolve).

I have two setups: DrawThings with a SD1.5 model and Auto1111 WebUI with a SDXL based one. Because I couldn't get that SDXL one running in DrawThings, eh. Also, I am aware that Flux became the newest ****, but I prefer to stay with the tools I know and love, especially for Anime and Furry stuff. Here are some results:

Auto1111 WebUI, SDXL (ChromaMixXL), Euler A (Beta), 25 Steps:

≈ 2:00min per image for 1024x1536 (regular power)
≈ 2:40min per image for 1024x1536 (low power mode)

DrawThings, SD1.5 (IndigoFurryMix), DPM++ 2M Karras, 20 Steps):

≈ 0:33min per image for 768x1152 (regular power)
≈ 0:40min per image for 768x1152 (low power)

≈ 1:00min per image for 1024x1536 (regular power)
≈ 1:20min per image for 1024x1536 (low power)

Package power draw for both apps and models is 30-35W in regular and 18-20W in low power mode. When doing multiple renders / batches, the fans start to spin in regular mode and stay around 40-50%, while remaining almost idle in low power mode. That makes "rendering overnight" pretty convenient.

I would be interested what the newer Max chips, especially the binned ones, use in package power during Stable Diffusion rendering, using the GPU. You can check it with TGPro or the terminal command sudo powermetrics, for anyone how would like to join this thread!

How to use low power mode on A1111?

Take a look at webui-forge.

GitHub - lllyasviel/stable-diffusion-webui-forge

Contribute to lllyasviel/stable-diffusion-webui-forge development by creating an account on GitHub.

github.com

Interface is nearly the same as A1111, but it is more optimized and GPU settings and it also supports flux.

Here is the interface:

Bildschirmfoto 2025-01-14 um 18.27.52.png

Search

Search

A1111 Stable Diffusion speed?

TzunamiOSX

macrumors 65816

Slartibart

macrumors 68040

TzunamiOSX

macrumors 65816

TzunamiOSX

macrumors 65816

Slartibart

macrumors 68040

thebart

macrumors 6502a

TzunamiOSX

macrumors 65816

thebart

macrumors 6502a

TzunamiOSX

macrumors 65816

thebart

macrumors 6502a

TzunamiOSX

macrumors 65816

galad

macrumors 6502a

TzunamiOSX

macrumors 65816

TzunamiOSX

macrumors 65816

TzunamiOSX

macrumors 65816

TzunamiOSX

macrumors 65816

GitHub - lllyasviel/stable-diffusion-webui-forge

Our Staff