I thought I would do some LLM testing on my new M4 Mac Studio today - my interest being what’s the largest LLM I could load up and have run.
For testing, I decided on c4ai-command-r-plus-08-2024-Q6_K. This is a 104B parameter LLM. Quantised to Q6_K brings it in at about 85GB (LM Studio says 80GB). I downloaded this thinking that (i) it may not work and (ii) if it did, it’d be very slow.
I’m using LM Studio for this. I did try to download a similar model in Ollama (and Msty), but it just wouldn’t download. The progress would go slowly from 1.5% to 2%, then drop back to 1.6%, with another slow progress before dropping back to 1.5% and, after a while, it just gave up with “maximum retries exceeded”. So I gave up with them.
After downloading, LM Studio will refuse to load it - its Guardrails preventing it to avoid risking “system overload”. Turning off Guardrails is the only way for it to run (forewarning - I turned this off on my 24GB MacBook Pro before, and the system did crash & reboot). I have reduced Context Length to 2,000 for this test but I will try other tests later with a higher Context Length (I suspect it won’t go much higher before being an issue).
It takes some time to load into memory (a little longer for me because I have my LLMs stored on an external Crucial X9 Pro SSD) but it’s not a big inconvenience if you’re not going to be swapping models around.
Once loaded up, Activity Monitor shows a little over 92GB of memory used (and, importantly - no swap!). At first it showed higher than that, but that’s because I forgot that I had Parallels fire up my Windows VM on boot. I killed that to give the LLM all that it might need.
LM Studio shows “RAM: 80.05GB”. After also opening a browser, and Obsidian (I use AI to brainstorm my story-writing which I store as markdown files), Activity Monitor now shows “Memory Used: 100.79GB” (memory pressure all in the green, no swap).
I asked the LLM to draft a short scene for my story.
Prior to all of this, I'd asked Grok for its thoughts on whether this model would run or be useful. It estimated 0.5-1.0 tokens/sec. What I’m actually getting is 4.5 tokens/sec, with 4-9s to first token. Okay, this isn’t going to be practical for professional use - it’s outputting at a tiny bit less than comfortable reading speed (and I did go and make myself a coffee while it spat out 437 tokens) but, for my use case, I find it’s probably going to be “acceptable” if I decide I need to run a larger LLM.
For comparison, I’ve also downloaded c4ai-command-r-plus-08-2024-Q4_K_M - the Q4_K_M quantised version of the same model. This one is 62GB (so probably still a bit much if you have 64GB RAM). Time to first token was 2.26s (and then 7s on the next output). Speed is now about 6.5 tokens/sec (which is about reading speed for me).
In terms of output, the Q4_K_M version gave me 268 and 198 tokens, compared to Q6_K which gave me 437 and 357 for the same prompt. The first prompt was, as above, asking for a draft of a scene; the second prompt was me asking it to add a specific detail to the scene. Despite being slower, the Q6_K version definitely gave me a better response.
There is also a c4ai-command-r-plus-08-2024 (Q6_K) of the same size provided by lmstudio-community (the ones I tested before were from second-state). Output speed and quality of output is similar for both Q6_K versions, as it should be.
For testing, I decided on c4ai-command-r-plus-08-2024-Q6_K. This is a 104B parameter LLM. Quantised to Q6_K brings it in at about 85GB (LM Studio says 80GB). I downloaded this thinking that (i) it may not work and (ii) if it did, it’d be very slow.
I’m using LM Studio for this. I did try to download a similar model in Ollama (and Msty), but it just wouldn’t download. The progress would go slowly from 1.5% to 2%, then drop back to 1.6%, with another slow progress before dropping back to 1.5% and, after a while, it just gave up with “maximum retries exceeded”. So I gave up with them.
After downloading, LM Studio will refuse to load it - its Guardrails preventing it to avoid risking “system overload”. Turning off Guardrails is the only way for it to run (forewarning - I turned this off on my 24GB MacBook Pro before, and the system did crash & reboot). I have reduced Context Length to 2,000 for this test but I will try other tests later with a higher Context Length (I suspect it won’t go much higher before being an issue).
It takes some time to load into memory (a little longer for me because I have my LLMs stored on an external Crucial X9 Pro SSD) but it’s not a big inconvenience if you’re not going to be swapping models around.
Once loaded up, Activity Monitor shows a little over 92GB of memory used (and, importantly - no swap!). At first it showed higher than that, but that’s because I forgot that I had Parallels fire up my Windows VM on boot. I killed that to give the LLM all that it might need.
LM Studio shows “RAM: 80.05GB”. After also opening a browser, and Obsidian (I use AI to brainstorm my story-writing which I store as markdown files), Activity Monitor now shows “Memory Used: 100.79GB” (memory pressure all in the green, no swap).
I asked the LLM to draft a short scene for my story.
Prior to all of this, I'd asked Grok for its thoughts on whether this model would run or be useful. It estimated 0.5-1.0 tokens/sec. What I’m actually getting is 4.5 tokens/sec, with 4-9s to first token. Okay, this isn’t going to be practical for professional use - it’s outputting at a tiny bit less than comfortable reading speed (and I did go and make myself a coffee while it spat out 437 tokens) but, for my use case, I find it’s probably going to be “acceptable” if I decide I need to run a larger LLM.
For comparison, I’ve also downloaded c4ai-command-r-plus-08-2024-Q4_K_M - the Q4_K_M quantised version of the same model. This one is 62GB (so probably still a bit much if you have 64GB RAM). Time to first token was 2.26s (and then 7s on the next output). Speed is now about 6.5 tokens/sec (which is about reading speed for me).
In terms of output, the Q4_K_M version gave me 268 and 198 tokens, compared to Q6_K which gave me 437 and 357 for the same prompt. The first prompt was, as above, asking for a draft of a scene; the second prompt was me asking it to add a specific detail to the scene. Despite being slower, the Q6_K version definitely gave me a better response.
There is also a c4ai-command-r-plus-08-2024 (Q6_K) of the same size provided by lmstudio-community (the ones I tested before were from second-state). Output speed and quality of output is similar for both Q6_K versions, as it should be.