Apple-optimized toolkit for Stable Diffusion

leman · Dec 2, 2022

Apple has just released a framework for using Stable Diffusion models on Apple Silicon. This includes tools for converting the models to CoreML (Apple's ML framework) as well as some libraries to use these models. I haven't looked into this yet but they claim some fairly impressive performance gains, along with dramatic reduction in required RAM.

I hope someone will be quick to build an interface and a one-click installer for ease of use.

You can check it out here: https://machinelearning.apple.com/research/stable-diffusion-coreml-apple-silicon

Xiao_Xi · Dec 2, 2022

Apple has chosen a peculiar hardware for benchmarking. It usually chooses best-in-class hardware.

The image generation procedure follows the standard configuration: 50 inference steps, 512x512 output image resolution, 77 text token sequence length, classifier-free guidance (batch size of 2 for unet).

Weights and activations are in float16 precision for both the GPU and the ANE.

It seems that the neural engine and the GPU don't generate the same image. I wonder which one is faster.

GitHub - apple/ml-stable-diffusion: Stable Diffusion with Core ML on Apple Silicon

Stable Diffusion with Core ML on Apple Silicon. Contribute to apple/ml-stable-diffusion development by creating an account on GitHub.

github.com

leman · Dec 2, 2022

Xiao_Xi said:
Apple has chosen a peculiar hardware for benchmarking. It usually chooses best-in-class hardware.

They have benchmarks for a wide range of hardware. Choosing low-end machines for that table is probably deliberate to show off that image generation is viable even on base M1.

Xiao_Xi · Dec 2, 2022

leman said:
They have benchmarks for a wide range of hardware. Choosing low-end machines for that table is probably deliberate to show off that image generation is viable even on base M1.

You may be right, but this would be one of the few times that Apple doesn't use best-in-class hardware. For reference, for the benchmark in Pytorch's press release on Apple Silicon, Apple used a "production Mac Studio systems with Apple M1 Ultra, 20-core CPU, 64-core GPU 128GB of RAM, and 2TB SSD."

Introducing Accelerated PyTorch Training on Mac – PyTorch

pytorch.org

Which M1 MacBook Pro did Apple use for benchmarking? Why is it slower than a M2 MacBook Air?

leman · Dec 2, 2022

Xiao_Xi said:
Which M1 MacBook Pro did Apple use for benchmarking? Why is it slower than a M2 MacBook Air?

There is only one M1 MacBook Pro, the 13” model? And it’s slower because M1 is slower than M2.

Xiao_Xi · Dec 2, 2022

https://twitter.com/x/status/1598399408160342039

jdb8167 · Dec 2, 2022

Xiao_Xi said:
You may be right, but this would be one of the few times that Apple doesn't use best-in-class hardware. For reference, for the benchmark in Pytorch's press release on Apple Silicon, Apple used a "production Mac Studio systems with Apple M1 Ultra, 20-core CPU, 64-core GPU 128GB of RAM, and 2TB SSD."

Introducing Accelerated PyTorch Training on Mac – PyTorch

pytorch.org

Which M1 MacBook Pro did Apple use for benchmarking? Why is it slower than a M2 MacBook Air?

Apple claims a 40% improvement with the M2 ANE vs the M1.

Xiao_Xi · Dec 2, 2022

Xiao_Xi said:
It seems that the neural engine and the GPU don't generate the same image. I wonder which one is faster.

In the benchmark table, we report the best performing --compute-unit and --attention-implementation values per device. The former does not modify the Core ML model and can be applied during runtime. The latter modifies the Core ML model. Note that the best performing compute unit is model version and hardware-specific.

GitHub - apple/ml-stable-diffusion: Stable Diffusion with Core ML on Apple Silicon

Stable Diffusion with Core ML on Apple Silicon. Contribute to apple/ml-stable-diffusion development by creating an account on GitHub.

github.com

falainber · Dec 2, 2022

Xiao_Xi said:
Apple has chosen a peculiar hardware for benchmarking. It usually chooses best-in-class hardware.

Obviously, in this case Apple hardware is not in the same league as best-in-class hardware. I am surprised they decided to do any comparisons at all.

leman · Dec 3, 2022

falainber said:
Obviously, in this case Apple hardware is not in the same league as best-in-class hardware. I am surprised they decided to do any comparisons at all.

Not at all. In fact, these results are pretty damn close to big desktop GPUs.

Xiao_Xi · Dec 3, 2022

leman said:
In fact, these results are pretty damn close to big desktop GPUs.

I don't consider almost twice as fast to be "pretty damn close". What GPU do you have in mind?

lambda-diffusers/docs/benchmark.md at eole/benchmarking · LambdaLabsML/lambda-diffusers

Contribute to LambdaLabsML/lambda-diffusers development by creating an account on GitHub.

github.com

Are there any other benchmarks for the neural engine? I find it very interesting that M2's neural engine is as fast as M1's 24-core GPU in inference.

leman · Dec 3, 2022

Xiao_Xi said:
I don't consider almost twice as fast to be "pretty damn close". What GPU do you have in mind?

The RTX 3090 (in FP32 mode) needs 8 seconds, the M1 Ultra needs 9 seconds. I would say this is pretty damn close given the obvious performance and power usage disparity between these two GPUs. Sure, in half-precision mode the 3090 is twice as fast. But then the 3090 has FP16 tensor cores with the throughput of 285TFLOPs while the M1 Ultra is a 20TFLOPS GPU. Frankly, I find it really remarkable that an M2 can generate an image in under 25 seconds. That should be close to a (desktop) RTX 3050. Only of course M2 will have more useable RAM for these workloads...

Xiao_Xi · Dec 3, 2022

leman said:
The RTX 3090 (in FP32 mode) needs 8 seconds, the M1 Ultra needs 9 seconds.

Do you have any links to support these figures?

By the way, it doesn't look good when you have to cherry-pick the benchmark to make them look alike. Please, let's leave the cherry-picking of benchmarks to corporate marketing and loyal customers.

leman · Dec 3, 2022

Xiao_Xi said:
Do you have any links to support these figures?

The numbers are literally in the tables you have quoted in your posts.

Xiao_Xi said:
By the way, it doesn't look good when you have to cherry-pick the benchmark to make them look alike. Please, let's leave the cherry-picking of benchmarks to corporate marketing and loyal customers.

How is what I wrote cherry-picking? I am simply commenting on the capabilities of the respective devices. It is absolutely clear that the 3090 is twice as fast, which is also what I wrote. But it is really impressive that Apple can deliver very good performance using this little power and despite such a huge nominal spec discrepancy.

Where this becomes relevant is when you look at the lower-end hardware. Even the base M2 is actually usable for this kind of work. It is a shame that the repo you quote only benchmarked high-end desktop GPUs. I'd love to see the results for the mobile RTX series.

P.S. the results also show that dual-rate FP16 and ML-specialized FP representations make a big difference. Apple GPUs could see a huge boost here if Apple implements these things in their GPU ALUs.

Xiao_Xi · Dec 3, 2022

leman said:
The numbers are literally in the tables you have quoted in your posts.

Apple did the benchmarks in FP16, not in FP32.

leman said:
How is what I wrote cherry-picking?

To support your claim:

leman said:
In fact, these results are pretty damn close to big desktop GPUs.

Of the two benchmarks (FP16 and FP32) you could choose, you chose the benchmark that makes Ultra perform similar to the RTX 3090.

leman said:
The RTX 3090 (in FP32 mode) needs 8 seconds, the M1 Ultra needs 9 seconds. I would say this is pretty damn close

It's amazing what Apple has done with their neural engine, GPU and CoreML, but they don't need us to hype their performance.

By the way, this graph has older Nvidia GPUs.

--GUIDE--

The definitive Stable Diffusion experience ™ (Special thanks to all anons who contributed) ---FEATURE SHOWCASE & HOWTO--- SD News Japanese guide here 日本語ガイド (JP Resources) NovelAI FAQ Interested in LoRA training? Try the Quickstart LoRA Guide --GUIDE-- Official Stable Diffusion Alternate Mod...

rentry.org

leman · Dec 3, 2022

Xiao_Xi said:
Apple did the benchmarks in FP16, not in FP32.

Apple GPUs do not have a separate faster data path for FP16. They perform FP16 and FP32 calculations at the same speed*. Nvidia GPUs can perform FP16 calculations two times faster as the FP32 calculations.

*the AMX and CPU can perform FP16 calculations faster, but I doubt that they contribute even 20% of the performance we see here

Xiao_Xi said:
Of the two benchmarks (FP16 and FP32) you could choose, you chose the benchmark that makes Ultra perform similar to the RTX 3090.

Again, I was commenting on the device performance when performing comparable amount of work. Sorry if you thought that I was cherry-picking.

My point is that Apple GPUs are still very much young. The current iteration didn't change much from the mobile GPU in the iPhone. Other GPUs implement a bunch of tricks (dual FP16 rate, dual-issue instructions etc.) that Apple has not implemented yet, but would be relatively straightforward for them to add. What interests me here is the performance potential of an immature product compared to a very mature one that already implements every possible and impossible trick in the book.

Xiao_Xi · Dec 7, 2022

By the way, which version of Stable Diffusion does Apple use: v1 or v2?

Apple-optimized toolkit for Stable Diffusion

macrumors Core

macrumors 68000

macrumors Core

macrumors 68000

macrumors Core

macrumors 68000

macrumors 601

macrumors 68000

macrumors 68040

macrumors Core

macrumors 68000

macrumors Core

macrumors 68000

macrumors Core

macrumors 68000

macrumors Core

macrumors 68000

Our Staff