Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

AirpodsNow

macrumors regular
Original poster
Aug 15, 2017
224
145
I went down a rabbit hole about local LLM. It seems that the new apple silicon is especially well suited to run these due to the unified memory (able to assign large amount of 'ram' to the GPU) and high(er than normal pc) bandwidth.

These days it's rather easy to use these LLM without touching any terminal work with apps like LM Studio (text) and Diffusion Bee (picture/images). Within the apps you can download general or specialized models that can aid you to create or respond to your requests. It has been rather fun to try it out. (It's like chatgpt but you can choose a model that is especially suited for a certain coding language or creating pictures as real life as possible or focus on real animals).

I am trying to swap my 16" to a 14" MPB. Initially trying to figure out whether to go for a Pro or Max with 30GB+ ram. Now knowing that these models can be quite useful running locally, I am suddenly more focused on the highest bandwidth and as much ram as I can/want to afford.
 

senttoschool

macrumors 68030
Nov 2, 2017
2,626
5,482
Yes, it's fun to play with but not a very good experience.

Yes, unified memory is an advantage for LLMs but if you want to run a bigger model such as 70b, it's quite slow at 5-8 tokens/s with substantial time to first token.

If you want to run a smaller model, it's well, not as good.

Hopefully Apple is planning much more support for local LLMs in the future by better software support (MLX backend is a start) and more chip optimization for LLMs.
 
  • Like
Reactions: AirpodsNow

TechnoMonk

macrumors 68030
Oct 15, 2022
2,603
4,110
Llama 70B runs great on my 64 GB M1 Max. My Linux workstation with 4090 runs out of memory unless you use a wuantized model. Diffusion bee is a gimmick, pretty useless.
 
  • Like
Reactions: AirpodsNow

AirpodsNow

macrumors regular
Original poster
Aug 15, 2017
224
145
Llama 70B runs great on my 64 GB M1 Max. My Linux workstation with 4090 runs out of memory unless you use a wuantized model. Diffusion bee is a gimmick, pretty useless.
I will be looking for a MacBook with at least 64gb, knowing how this is rather a hard limit what models you can use.

Diffusion bee gimmick because it’s limited in parameter and use? Can you explain? I also downloaded draw things that was mentioned quite frequently
 

AirpodsNow

macrumors regular
Original poster
Aug 15, 2017
224
145
Yes, it's fun to play with but not a very good experience.

Yes, unified memory is an advantage for LLMs but if you want to run a bigger model such as 70b, it's quite slow at 5-8 tokens/s with substantial time to first token.

If you want to run a smaller model, it's well, not as good.

Hopefully Apple is planning much more support for local LLMs in the future by better software support (MLX backend is a start) and more chip optimization for LLMs.
People are expecting some kind of feature in the m4 that would help. I was wondering what can they do except introducing a higher soc bandwidth? Isn’t that really the main bottleneck (number ram and gpu can be chosen by customers wallet size)
 

senttoschool

macrumors 68030
Nov 2, 2017
2,626
5,482
People are expecting some kind of feature in the m4 that would help. I was wondering what can they do except introducing a higher soc bandwidth? Isn’t that really the main bottleneck (number ram and gpu can be chosen by customers wallet size)
I doubt there's any new feature in M4 that would help. I think Apple just boosted the NPU, which is not normally used for larger local LLMs.

I would like to see if there's a way to combine the AMX, NPU, and GPU units in the SoC together to do LLM inference.
 
  • Like
Reactions: jdb8167

senttoschool

macrumors 68030
Nov 2, 2017
2,626
5,482
Llama 70B runs great on my 64 GB M1 Max. My Linux workstation with 4090 runs out of memory unless you use a wuantized model. Diffusion bee is a gimmick, pretty useless.
What's time to first token like? And tokens/second? What quant?
 

TechnoMonk

macrumors 68030
Oct 15, 2022
2,603
4,110
What's time to first token like? And tokens/second? What quant?
4090 has 24 GB memory limitations. It’s my secondary machine now. Just hoping Apple comes with at least 256-512 GB RAM for m4 max/m4 ultra. For most of my use cases tokens/sec on 4090 is a big zero.
 

TechnoMonk

macrumors 68030
Oct 15, 2022
2,603
4,110
I will be looking for a MacBook with at least 64gb, knowing how this is rather a hard limit what models you can use.

Diffusion bee gimmick because it’s limited in parameter and use? Can you explain? I also downloaded draw things that was mentioned quite frequently
Diffusion bee is not good qualitatively speaking. 64 GB should be good enough for experimenting with open source LLm and vision models. I am just waiting for M4 Max or even M5 max to get to 256 GB. That will save me ton of money for my custom model inference testing. Apple still need to add support for MLX neural engine, but it’s not a show stopper as o don’t use ANE.
 
  • Like
Reactions: AirpodsNow

AirpodsNow

macrumors regular
Original poster
Aug 15, 2017
224
145
64 GB should be good enough for experimenting with open source LLm and vision models.
do you think that is true with where the open source LLM is heading to? Since I am new to the subject, I did read some posts saying how certain llm size that were great for 32GB ram configs, were slowly diminishing. Is that also true for the 'next' level?

Also, if you say 64GB, do you mean 64GB only for llm, or 70% of it how MacBooks seems to assign as a maximum to GPU?

I would be happy to get a 64GB if that is enough, because it seems on the current m3 max the next option is 128GB, whereas the 96GB is on the 'slower' 300GB/s bandwidth. I assume something like that will be also for the M4 Max.
 

Algr

macrumors 6502a
Jul 27, 2022
526
786
Earth (mostly)
I will be looking for a MacBook with at least 64gb, knowing how this is rather a hard limit what models you can use.

Diffusion bee gimmick because it’s limited in parameter and use? Can you explain? I also downloaded draw things that was mentioned quite frequently

DiffusionBee and DrawThings both come with a default model that is pretty useless. That frustrated me for a while. But you can get far better results by downloading better models into those programs. Flux (Shnell) is the best right now. Some of the 1.5 models are good for specific tasks.
 
  • Like
Reactions: gank41

AirpodsNow

macrumors regular
Original poster
Aug 15, 2017
224
145
DiffusionBee and DrawThings both come with a default model that is pretty useless. That frustrated me for a while. But you can get far better results by downloading better models into those programs. Flux (Shnell) is the best right now. Some of the 1.5 models are good for specific tasks.
I’m using Flux model now. I am just using the app as an easy gui. Similar to lm studio, although it seems more complicated with transformer lab. I wonder if their is also a gui app for sound with loading local llm.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.