Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

JSRinUK

macrumors 6502
Original poster
Sep 17, 2018
416
622
Greater London, UK
I thought I would do some LLM testing on my new M4 Mac Studio today - my interest being what’s the largest LLM I could load up and have run.

For testing, I decided on c4ai-command-r-plus-08-2024-Q6_K. This is a 104B parameter LLM. Quantised to Q6_K brings it in at about 85GB (LM Studio says 80GB). I downloaded this thinking that (i) it may not work and (ii) if it did, it’d be very slow.

I’m using LM Studio for this. I did try to download a similar model in Ollama (and Msty), but it just wouldn’t download. The progress would go slowly from 1.5% to 2%, then drop back to 1.6%, with another slow progress before dropping back to 1.5% and, after a while, it just gave up with “maximum retries exceeded”. So I gave up with them.

After downloading, LM Studio will refuse to load it - its Guardrails preventing it to avoid risking “system overload”. Turning off Guardrails is the only way for it to run (forewarning - I turned this off on my 24GB MacBook Pro before, and the system did crash & reboot). I have reduced Context Length to 2,000 for this test but I will try other tests later with a higher Context Length (I suspect it won’t go much higher before being an issue).

It takes some time to load into memory (a little longer for me because I have my LLMs stored on an external Crucial X9 Pro SSD) but it’s not a big inconvenience if you’re not going to be swapping models around.

Once loaded up, Activity Monitor shows a little over 92GB of memory used (and, importantly - no swap!). At first it showed higher than that, but that’s because I forgot that I had Parallels fire up my Windows VM on boot. I killed that to give the LLM all that it might need.

LM Studio shows “RAM: 80.05GB”. After also opening a browser, and Obsidian (I use AI to brainstorm my story-writing which I store as markdown files), Activity Monitor now shows “Memory Used: 100.79GB” (memory pressure all in the green, no swap).

I asked the LLM to draft a short scene for my story.

Prior to all of this, I'd asked Grok for its thoughts on whether this model would run or be useful. It estimated 0.5-1.0 tokens/sec. What I’m actually getting is 4.5 tokens/sec, with 4-9s to first token. Okay, this isn’t going to be practical for professional use - it’s outputting at a tiny bit less than comfortable reading speed (and I did go and make myself a coffee while it spat out 437 tokens) but, for my use case, I find it’s probably going to be “acceptable” if I decide I need to run a larger LLM.

For comparison, I’ve also downloaded c4ai-command-r-plus-08-2024-Q4_K_M - the Q4_K_M quantised version of the same model. This one is 62GB (so probably still a bit much if you have 64GB RAM). Time to first token was 2.26s (and then 7s on the next output). Speed is now about 6.5 tokens/sec (which is about reading speed for me).

In terms of output, the Q4_K_M version gave me 268 and 198 tokens, compared to Q6_K which gave me 437 and 357 for the same prompt. The first prompt was, as above, asking for a draft of a scene; the second prompt was me asking it to add a specific detail to the scene. Despite being slower, the Q6_K version definitely gave me a better response.

There is also a c4ai-command-r-plus-08-2024 (Q6_K) of the same size provided by lmstudio-community (the ones I tested before were from second-state). Output speed and quality of output is similar for both Q6_K versions, as it should be.
 
Have you tried the MLX quants?

One of the reasons for my M3U upgrade was to use 8bit over 4bit quants as for my needs the accuracy of 4bit was poor. Have you done any 4bit vs 8bit testing?
 
  • Like
Reactions: hgfrrtttt
Have you tried the MLX quants?

One of the reasons for my M3U upgrade was to use 8bit over 4bit quants as for my needs the accuracy of 4bit was poor. Have you done any 4bit vs 8bit testing?
I have tried out some MLX quants in the past, but I’m not comparing that just at the moment.

My usual LLM has been Gemma-The-Writer-N-Restless-Quill-10B-max-cpu-D_AU-Q6 (and the Q4 version on my MacBook Pro), and this has worked pretty well for my purposes. I have some Q8 models downloaded (Gemma 3, Gemma 2, QwQ, Llama 3.3, etc), but I’ve not spent too much time with them or made any comparisons. I probably will do additional tests in the fullness of time but, right now, I’m seeing if I can tweak Command R Plus to my purposes.
 
I’ve done some testing with context lengths today. Using the c4ai-command-r-plus-08-2024-Q6_K, I’ve methodically been increasing the context length up to 40,000. Memory Used is around 110GB.

However, LM Studio seems to have an issue with context lengths this high. I have a document that’s about 19,000 words (it’s an 8-act scene from my story) and, no matter what I set the context length to, LM Studio switches to “retrieval is optimal” and just offers up citations to the LLM. This produces inadequate results.

Thinking it might be a RAM issue (maybe LM Studio is wary that there’s not much left after 110GB used), I switched down to the 4-bit version c4ai-command-r-plus-08-2024-Q4_K_M with 40,000 context length (total Memory Used ~90GB) but the same thing happens.

If I put the entire text into the prompt (still in the 4-bit version), it eventually provides a “reasonable” summary of some key points, but stops half-way through the text (it gets to Act 4, but doesn’t bother with Acts 5-8).

So, regardless of how much RAM you have, context length is an issue with LM Studio. I’ll be looking at other apps to see if the same applies.

Incidentally, while processing the full 19,000+ word scene in the prompt (which took some time), the GPU temp went up to 90C and the Studio’s fan went to 1,700RPM. You can definitely hear the fan running (I’m in a fairly quiet room). It’s likely you wouldn’t hear it in an office environment, but it’s definitely there.
 
LM Studio is just a front end for llama.cpp and MLX.

The issue with speed is that the hardware is just not good enough for any models with more than 32B parameters and get progressively slower once context size is increased. The only reason why R1 works ok is because it is a mixture of experts model with 37B active parameters.
 
LM Studio is just a front end for llama.cpp and MLX.

The issue with speed is that the hardware is just not good enough for any models with more than 32B parameters and get progressively slower once context size is increased. The only reason why R1 works ok is because it is a mixture of experts model with 37B active parameters.
The speed issue isn’t a concern for my use case (story-writing isn’t a speedy pastime!) - but it’s good to know I have the option of dropping down to my previous LLM choice (10B) for quick exchanges. I’ll probably be trying out the llama.cpp CLI to get around LM Studio’s bizarre context length behaviour.
 
  • Like
Reactions: hgfrrtttt
Try Ollama in place of LMS.
I did. In my original post, I said:

I did try to download a similar model in Ollama (and Msty), but it just wouldn’t download. The progress would go slowly from 1.5% to 2%, then drop back to 1.6%, with another slow progress before dropping back to 1.5% and, after a while, it just gave up with “maximum retries exceeded”. So I gave up with them.
 
The speed issue isn’t a concern for my use case (story-writing isn’t a speedy pastime!) - but it’s good to know I have the option of dropping down to my previous LLM choice (10B) for quick exchanges. I’ll probably be trying out the llama.cpp CLI to get around LM Studio’s bizarre context length behaviour.
It probably won’t make a difference as it’s just a wrapper on the same back end.
 
It probably won’t make a difference as it’s just a wrapper on the same back end.
I’m just trying to find out why LM Studio appears to be limiting the context length. If I set the context length to 40,000, LM Studio still splits the uploaded file into citations when the context goes over about 26,000. Up to that point, it’ll take the entire content.

Maybe there’s a setting or something I’ve not found yet, but it seems quite odd behaviour. Why allow you to set a higher context length if it won’t use it?
 
  • Like
Reactions: hgfrrtttt
Try Command A; it’s better than Command R+ and will probably run a bit faster too. It has a similar knowledge level to Mistral Large 2411, but a better writing style. In my experience it’s the best open weights model that fits in 128 GB RAM, at least for writing tasks. Mistral Large may be slightly better at programming and physics, and reasoning models like QwQ are better for complex reasoning tasks. QwQ needs 10k tokens to answer simple questions that even Gemma 3 4b can answer almost instantly, so I generally avoid it even though its capabilities are impressive for its size.

Q4_K_M is great for models that are 24B and bigger, as the loss in quality is negligible compared to Q6 or Q8 for models in this size class.
 
Last edited:
Try Command A; it’s better than Command R+ and will probably run a bit faster too. Q4_K_M is great for models this size, the loss in quality is negligible compared to Q6 or Q8 for models in this size class.
That is on my list. My intent for this exercise was to try out a large LLM and see where it took me. I’m relatively happy with its responses. If I can figure out the context length issue, I’ll chalk it up as “done” and look at other models.
 
  • Like
Reactions: hgfrrtttt
I had some success with llama.cpp, after I got it set up (thanks Grok! - the irony 🤣 ). Using the Q6 model (104B parameters, ~85GB) with a context size of 40,000 failed. Memory Used went up to about 100GB, then llama.cpp gave up. Using a context size of 20,000 also failed with:

Code:
main: prompt is too long (31035 tokens, max 19996)
However, a context size of 32,000 tokens proved successful. Memory Used maxed out to about 113GB (that’s the whole system, not just llama.cpp). It obviously took some time to summarise my text, but it did a pretty good job - offering up a ~400 word summary of the entire piece, and 24 bullet points of key moments (3 each for the 8 acts).

I will need to tweak some settings to get the output the way I'd prefer it (at under 800 words for the entire output, it’s not as detailed as I’d like) but, on the whole, it was a success. I wouldn’t recommend this method for anyone on a deadline. This exercise is more about whether it can be done rather than whether it should be done.
 
This exercise is more about whether it can be done rather than whether it should be done.
Thanks.

Sort of like this guy who did the same thing for his M3 Ultra:


Basically, yes, you can get the quantized DeepSeek R1 671B to run... but it's slow. The slightly smaller MLX models work really well though. I just wish he did the context testing such as you have done.
 
Excuse me if I’m being too naive but, how do you think a 32GB machine could handle a local LLM? Is there any possibility of using an 8Q model?
I've run Mistral Small 3 and 3.1 (both 24B models) at Q4_K_M on my 32 GB M2 Pro MacBook Pro without issue. You can run 12B or 14B models at Q8, though honestly Q8 for models that size is pointless. The output quality of Q6 is essentially indistinguishable from Q8 for 12B models, and even Q4_K_M has minimal degradation. For smaller models (8B and below), I prefer Q4_K_L.
 
I’ve done some testing with context lengths today. Using the c4ai-command-r-plus-08-2024-Q6_K, I’ve methodically been increasing the context length up to 40,000. Memory Used is around 110GB.

However, LM Studio seems to have an issue with context lengths this high. I have a document that’s about 19,000 words (it’s an 8-act scene from my story) and, no matter what I set the context length to, LM Studio switches to “retrieval is optimal” and just offers up citations to the LLM. This produces inadequate results.

Thinking it might be a RAM issue (maybe LM Studio is wary that there’s not much left after 110GB used), I switched down to the 4-bit version c4ai-command-r-plus-08-2024-Q4_K_M with 40,000 context length (total Memory Used ~90GB) but the same thing happens.

If I put the entire text into the prompt (still in the 4-bit version), it eventually provides a “reasonable” summary of some key points, but stops half-way through the text (it gets to Act 4, but doesn’t bother with Acts 5-8).

So, regardless of how much RAM you have, context length is an issue with LM Studio. I’ll be looking at other apps to see if the same applies.

Incidentally, while processing the full 19,000+ word scene in the prompt (which took some time), the GPU temp went up to 90C and the Studio’s fan went to 1,700RPM. You can definitely hear the fan running (I’m in a fairly quiet room). It’s likely you wouldn’t hear it in an office environment, but it’s definitely there.
You can set to use 120 GB Memory for LLm if you don’t have anything else intensive or open.
 
  • Like
Reactions: hgfrrtttt
I've run Mistral Small 3 and 3.1 (both 24B models) at Q4_K_M on my 32 GB M2 Pro MacBook Pro without issue. You can run 12B or 14B models at Q8, though honestly Q8 for models that size is pointless. The output quality of Q6 is essentially indistinguishable from Q8 for 12B models, and even Q4_K_M has minimal degradation. For smaller models (8B and below), I prefer Q4_K_L.
Okay thank you too for the information. Now, given the small size of the model, realistically, what can be done with such models? I guess they aren’t as capable as ChatGPT 4, or even 3.5, but can I feed them info, documentation, uni subjects, to make it an “expert” in that and let it evaluate me or help me with my studies? That’s a mere example.

Now, if this small models aren’t very useful / versatile, then I guess I won’t let this be a factor that influences me to get a 32GB machine instead of a 24GB one.
 
  • Like
Reactions: hgfrrtttt
You can set to use 120 GB Memory for LLm if you don’t have anything else intensive or open.
I may try that at some point, but I’m not sure that would necessarily get over LM Studio’s “quirk” of letting you set a higher context length but not letting you actually use it - given that it also had this quirk when I used the smaller Q4 model (so there was plenty of RAM spare). It’s something to look at later, though.
 
  • Like
Reactions: hgfrrtttt
Thanks.

Sort of like this guy who did the same thing for his M3 Ultra:


Basically, yes, you can get the quantized DeepSeek R1 671B to run... but it's slow. The slightly smaller MLX models work really well though. I just wish he did the context testing such as you have done.
These machines are pretty amazing in what they can do, but it does seem that if you want to “speed” then you can’t really go with the largest models (although the Ultra’s higher memory bandwidth should help with that).
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.