M4 Max Studio 128GB - LLM testing

JSRinUK · Mar 23, 2025

Populus said:
Okay thank you too for the information. Now, given the small size of the model, realistically, what can be done with such models? I guess they aren’t as capable as ChatGPT 4, or even 3.5, but can I feed them info, documentation, uni subjects, to make it an “expert” in that and let it evaluate me or help me with my studies? That’s a mere example.

Now, if this small models aren’t very useful / versatile, then I guess I won’t let this be a factor that influences me to get a 32GB machine instead of a 24GB one.

There are model management tools that leverage a RAG type system (Ollama uses “knowledge base”, Msty uses “knowledge stacks”, LM Studio only allows you to attach documents to prompts). However, it’s not really making them an “expert”.

They rely on RAG (Retrieval Augmented Generation) in which, from your prompt, the model management tool attempts to retrieve relevant citations from the knowledge base (your documents) that enables the LLM to “augment” its “generation” of its response by looking at those citations first, before generating its answer from its parameters.

I’ve not found a truly reliable or accurate RAG system yet. I’d like to use LLMs that are experts on my stories and characters but I’ve not been happy with what they retrieve from a knowledge base filled with my documents.

I think the only real way of creating a workable “expert” is in fine tuning a model, but you need a pretty powerful computer to do that.

However, if all you need to do is attach one or two short documents (in my case a single scene or an act), you can attach them to the prompt and quiz the LLM about it and it’ll probably do a pretty good job. But giving them too much info leads to inaccuracies and overlooked info.

This is all from my experimentation with Ollama’s “knowledge base” and trying to have LM Studio manage large context. Msty might be different. I will be looking more at RAG options as I get time.

Populus · Mar 23, 2025

JSRinUK said:
There are model management tools that leverage a RAG type system (Ollama uses “knowledge base”, Msty uses “knowledge stacks”, LM Studio only allows you to attach documents to prompts). However, it’s not really making them an “expert”.

They rely on RAG (Retrieval Augmented Generation) in which, from your prompt, the model management tool attempts to retrieve relevant citations from the knowledge base (your documents) that enables the LLM to “augment” its “generation” of its response by looking at those citations first, before generating its answer from its parameters.

I’ve not found a truly reliable or accurate RAG system yet. I’d like to use LLMs that are experts on my stories and characters but I’ve not been happy with what they retrieve from a knowledge base filled with my documents.

I think the only real way of creating a workable “expert” is in fine tuning a model, but you need a pretty powerful computer to do that.

However, if all you need to do is attach one or two short documents (in my case a single scene or an act), you can attach them to the prompt and quiz the LLM about it and it’ll probably do a pretty good job. But giving them too much info leads to inaccuracies and overlooked info.

This is all from my experimentation with Ollama’s “knowledge base” and trying to have LM Studio manage large context. Msty might be different. I will be looking more at RAG options as I get time.

Thank you! Well, I’ve been starting to use Obsidian to centralise all my documents in markdown, a lot of info, about many several subjects and topics about my studies, and I had thought the next step in managing my information and notes and documents, would be letting a LLM use all those md documents, information and data I will have in Obsidian, just like a data base. Thus, creating an expert on those fields I need to study.

But honestly, seeing what you’re all telling to me, I guess this task will be better done by GPT 4o, o1 and o3 without even having to feed it the info of the subjects. So I guess I’ll keep paying the GPT Plus subscription. A more rudimentary tool would be to make my own local Wiki either on the Mac or in iOS, but I guess Obsidian is already good at that, but if you know a better tool to do that, I’m all eyes.

Anyway, even with the local LLM out of the question, the dilemma between 24GB and 32GB continues… but that would be an off topic here, so thank you all for the advice and good luck in your tests, OP! I guess 24/32GB machines and yours with 128GB of RAM are in two complete different leagues, just like your 128GB machine and a 512GB M3 Ultra Studio…

JSRinUK · Mar 23, 2025

Populus said:
Thank you! Well, I’ve been starting to use Obsidian to centralise all my documents in markdown, a lot of info, about many several subjects and topics about my studies, and I had thought the next step in managing my information and notes and documents, would be letting a LLM use all those md documents, information and data I will have centralised in obsidian. Thus, creating an expert on those fields I need to study.

But honestly, seeing what you’re all telling to me, I guess this task will be better done by GPT 4o, o1 and o3 without even having to feed it the info of the subjects. So I guess I’ll keep paying the GPT Plus subscription.

Anyway, even with the local LLM out of the question, the dilemma between 24GB and 32GB continues… but that would be an off topic here, so thank you all for the advice and good luck in your tests, OP! I guess 24/32GB machines and yours with 128GB of RAM are in two complete different leagues, just like your 128GB machine and a 512GB M3 Ultra Studio…

I recently started using Obsidian myself. I’ve been using my latest story-writing exercise to understand how AI/LLM works, and it’s been quite the learning experience.

I’ve been using individual markdown files for each scene (or each act if the scene gets too big). I believe you can connect Msty to your Obsidian vault. I haven’t tried that yet, but it’s something I’m keen to explore. It may well be that Msty handles its “knowledge stacks” better than the other options I’ve tried. Time will tell.

What’s delaying me at the moment is that LM Studio downloaded my 104B model in three split files, and Msty will only look at one of them - I need to merge them before I can try this model out with Msty.

With your 24/32GB machine, you should still experiment with LLM to see how it might work for you. I was using a 24GB MacBook Pro before my Studio and it worked pretty well. Using a relatively small but high-quality model and feeding it one document that’s within its context length, should enable you to quiz the LLM and get something from it. I don’t know what documents you have so I can’t really offer much insight into that.

There are different LLMs that are trained for specific uses (some for coding, some for math, some for creativity, etc), so look for one that leans more into your field and see how that works out.

Populus · Mar 23, 2025

JSRinUK said:
There are different LLMs that are trained for specific uses (some for coding, some for math, some for creativity, etc), so look for one that leans more into your field and see how that works out.

Yeah! This is actually a good idea. If there was an LLM trained in science that would be great, but again, I’m not sure if I should feed it my vault or just rely on its own knowledge.

Probably the context window of a 32GB machine falls short for an entire vault with several subjects. I know you cannot give me more information without the size of my data base, but I really don’t know how to measure the size of the data: pages? documents? words? sentences? Megabytes? Honestly, I don’t even know how big it can grow…

Do you have a quick reference for the context window size on a 24/32GB fitting model? How’s that measured, in thousands of words? And if I want to have all the vault accessible, I should look for a context window big enough to fit all my vault in, or at least the subjects I’m interested in?

Excuse my basic questions, I have the feeling that I always end up derailing the threads you open lol.

By the way, there must be a way to efficiently compress the data I want to feed it, maybe with some sort of Algorithm… 🤔

Smoke Bar · Mar 23, 2025

I will test c4ai-command-r-plus-08-2024-Q6_K on my 256GB M3 Ultra and see if the LMS limitation is the same.

I have a feeling you may need to set a limit of <70b on your machine...

JSRinUK · Mar 23, 2025

Populus said:
Yeah! This is actually a good idea. If there was an LLM trained in science that would be great, but again, I’m not sure if I should feed it my vault or just rely on its own knowledge.

Probably the context window of a 32GB machine falls short for an entire vault with several subjects. I know you cannot give me more information without the size of my data base, but I really don’t know how to measure the size of the data: pages? documents? words? sentences? Megabytes? Honestly, I don’t even know how big it can grow…

Do you have a quick reference for the context window size on a 24/32GB fitting model? How’s that measured, in thousands of words? And if I want to have all the vault accessible, I should look for a context window big enough to fit all my vault in, or at least the subjects I’m interested in?

Excuse my basic questions, I have the feeling that I always end up derailing the threads you open lol.

By the way, there must be a way to efficiently compress the data I want to feed it, maybe with some sort of Algorithm… 🤔

Different LLMs have different context lengths. From what I’ve seen, most will do 2K-4K. Some will do 8K. The one I’m working on right now can do 128K. There are some, I believe, that can do a million.

The more context you want, the more RAM you need (so you may end up with a smaller model to free up more RAM for the context length).

Alex Ziskind did a calculator for working out memory requirements: https://llm-inference-calculator-rki02.kinsta.page/ I’m not sure it’s entirely accurate with its context length requirements, but it’s a starting point at least.

The context length is “number of tokens” which isn’t quite the same as “number of words”.

JSRinUK · Mar 23, 2025

Smoke Bar said:
I will test c4ai-command-r-plus-08-2024-Q6_K on my 256GB M3 Ultra and see if the LMS limitation is the same.

I have a feeling you may need to set a limit of <70b on your machine...

My llama.cpp experiment demonstrated that 128GB RAM can handle a 104B model at Q6 with context length of 32,000. That, in my mind, is its limit. Smaller models, 4-bit quants, and smaller context length, are all ways to make it more responsive/useable. It’s good to find the limits of this machine.

I’ll be very interested to hear how this runs on your 256GB, whether LM Studio relies on “citations” rather than the full document with the full 128K context length, and also the largest model you can run (and how usable that model is). You’ll have a faster memory bandwidth than I have so you should benefit from that.

ondioline · Mar 23, 2025

JSRinUK said:
Alex Ziskind did a calculator for working out memory requirements: https://llm-inference-calculator-rki02.kinsta.page/

FWIW this thing isn't accurate at all, it seems to overestimate by a significant amount. I'm using Q8 Gemma 3 27b with 32768 context and it only uses 44GB of memory, when this says it would be 76GB+

JSRinUK · Mar 23, 2025

ondioline said:
FWIW this thing isn't accurate at all, it seems to overestimate by a significant amount. I'm using Q8 Gemma 3 27b with 32768 context and it only uses 44GB of memory, when this says it would be 76GB+

View attachment 2494963

View attachment 2494964

I suspected that before but didn’t really have a way of testing it on my MacBook. I’ll do some comparisons with real world usage when I get time.

JSRinUK · Mar 23, 2025

I’ve managed to merge the GGUF into a single file so that Msty is able to load it. Using the same settings and context length (33000) in Msty resulted in some strange output. No matter what I ask it (whether to try summarising the full scene, or just asking it a simple question without the file), Msty hallucinates badly. It rants on about some unrelated story with what appears to be some back-and-forth forum discussion, nothing whatsoever to do with my prompt.

Just asking it for a 50 token summary of my two main characters (who are mentioned in the system prompt) resulted in an endless stream of hallucinated gibberish. When asking just a simple question, Msty effectively crashed out when the Memory Usage got to 100GB - and this is the first time I’ve seen “memory pressure” go into Orange (there’s even a small 2.47GB swap going on). I’ve never seen this in LM Studio, or llama.cpp while doing these tests.

I’ll need time to figure out what’s going on there - maybe Msty just doesn’t like large LLMs.

However, my next step will be to try and leverage the single GGUF file into Ollama and see if that’ll work any better.

splifingate · Mar 23, 2025

JSRinUK said:
It rants on about some unrelated story with what appears to be some back-and-forth forum discussion, nothing whatsoever to do with my prompt.

Sounds like it scraped samples from the wccftech Comment Sections

ThailandToo · Mar 23, 2025

JSRinUK said:
Yeah, the “max” in “M4 Max” for me stands for “max budget”. 😂

Yeah. They’re expensive for sure. But to be able to run LLMs with a lot of parameters it costs so much more networking together Nvidia GPU clusters. About $10k is the minimum one could spend with the M3 Ultra with 512GB of unified memory and 4TB Storage. Similar with Nvidia and servers closer to $100k ish. Have bought 30 Nvidia GPUs and servers and we spent $2.5m. That was seven years ago. Now, my M3 Ultra Mac Studio can outperform them all clustered together. But, it’s just a test rig at the moment. We have over 1PB of data on those servers and that’s where we lack the current knowledge to put access all the sensitive data locally. We need to upgrade to new Nvidia servers or determine if we can do the same thing clustering Mac Studios for less money. The problem is we use a lot of RAM and far more storage for financial data.

JSRinUK · Mar 23, 2025

ThailandToo said:
Yeah. They’re expensive for sure. But to be able to run LLMs with a lot of parameters it costs so much more networking together Nvidia GPU clusters. About $10k is the minimum one could spend with the M3 Ultra with 512GB of unified memory and 4TB Storage. Similar with Nvidia and servers closer to $100k ish. Have bought 30 Nvidia GPUs and servers and we spent $2.5m. That was seven years ago. Now, my M3 Ultra Mac Studio can outperform them all clustered together. But, it’s just a test rig at the moment. We have over 1PB of data on those servers and that’s where we lack the current knowledge to put access all the sensitive data locally. We need to upgrade to new Nvidia servers or determine if we can do the same thing clustering Mac Studios for less money. The problem is we use a lot of RAM and far more storage for financial data.

I’d love to be in that field. I’ve seen how much Nvidia GPUs go for and can only imagine what the cost must run to get a system with sufficient VRAM.

This is why I’m amazed at what we can actually do with the Mac Studio (I go back to the ZX81!). I’m always trying to push to the edge of what the machine I have is capable of. (I still recall pushing that old ZX81 to fill up its 16K RAM pack - for no better reason than “I have to try"!)

For context, my budget for my Studio was originally around £3,200 - and I was kind of hoping 128GB would come in around there. Pushing to £3,800 for 128GB (plus 1TB SSD) went right to the edge of my comfort zone. If the M3 Ultra had started at 128GB, I may *just* have pushed to the £4,200 price (but I’d be arguing with my bank balance every step of the way).

In the end, I’m happy with what I have. 128GB on an M4 Max is of more use to me than 96GB on an M3 Ultra, I feel. I think it’s the right Studio for where I’m at right now. Spending more would just put me in a struggling position financially and, for what amounts to little more than a “hobby” combined with “learning experience” right now, I can’t justify that.

JSRinUK · Mar 23, 2025

JSRinUK said:
I did. In my original post, I said:

I did try to download a similar model in Ollama (and Msty), but it just wouldn’t download. The progress would go slowly from 1.5% to 2%, then drop back to 1.6%, with another slow progress before dropping back to 1.5% and, after a while, it just gave up with “maximum retries exceeded”. So I gave up with them.

Today I’ve been setting Ollama back up to have a bash with this model. It took a while to register the GGUF (trying to figure out why it wouldn’t do it wasn’t fun). Tip: Setting up Ollama through Docker isn’t going to work. To start with you need to raise Docker’s memory limit (which is okay) and, after doing that, Ollama did run the model - but, as “Memory Used” ramped up, “Memory pressure" shot through Orange and into Red before Docker then crashed. I’ve not seen Memory Pressure go into red before now while doing these tests.

I *may* look at setting Ollama up without Docker, but I’m also looking at putting a simple GUI on top of llama.cpp (mainly because I want to learn how to do it).

Allen_Wentz · Mar 23, 2025

ThailandToo said:
Yeah. They’re expensive for sure. But to be able to run LLMs with a lot of parameters it costs so much more networking together Nvidia GPU clusters. About $10k is the minimum one could spend with the M3 Ultra with 512GB of unified memory and 4TB Storage. Similar with Nvidia and servers closer to $100k ish. Have bought 30 Nvidia GPUs and servers and we spent $2.5m. That was seven years ago. Now, my M3 Ultra Mac Studio can outperform them all clustered together. But, it’s just a test rig at the moment. We have over 1PB of data on those servers and that’s where we lack the current knowledge to put access all the sensitive data locally. We need to upgrade to new Nvidia servers or determine if we can do the same thing clustering Mac Studios for less money. The problem is we use a lot of RAM and far more storage for financial data.

Yours is a situation some Apple business sales rep will salivate over. The sales guys do not know the Mac future any better than we do, but they may be capable of assisting you with upgrade/trade up deals to switch you from Nvidia. I.e. maybe put you into loaded M3 Ultras now and providing high trade up value to M4/M5 Ultras or Mac Pros when they come out. My personal guess is that we will see new Mac Pros within a year.

KnowsHowItWorks · Mar 23, 2025

Thanks for sharing your experience. Predictably we need as much RAM as possible but good news it Really works.

ThailandToo · Mar 23, 2025

Allen_Wentz said:
Yours is a situation some Apple business sales rep will salivate over. The sales guys do not know the Mac future any better than we do, but they may be capable of assisting you with upgrade/trade up deals to switch you from Nvidia. I.e. maybe put you into loaded M3 Ultras now and providing high trade up value to M4/M5 Ultras or Mac Pros when they come out. My personal guess is that we will see new Mac Pros within a year.

Oddly enough, my wife knows someone who is a business sales rep for Apple here in Thailand. She makes a fortune selling to businesses with commissions. I didn’t know Apple did that before talking with her.

M1956 · Mar 24, 2025

Populus said:
Anyway, even with the local LLM out of the question, the dilemma between 24GB and 32GB continues…

I would go for 32gb. There was a lot progress with small models recently. Gemma 27b, Qwen QwQ 32b, Mistral small 3.1 24b, and LG’s 32b models came out just in the last two months. They are all pretty close to Deepseek r1, gtp4o and sonnet 3.5. While it’s impossible to predict the future, I think that bigger (and proprietary) models like future gpt5, sonnet/opus 4.0 will remain supreme, but new small models will keep closing the gap. 32gb will widen the space of models available for you.
Even if you will have to use proprietary models for the hardest tasks it’s worth having private, unrestricted model locally.
Another thing to take in account is memory bandwidth. I assume you are looking at m4 machine given the choose between 24 and 32 gb. Those only have 120gb/s bandwidth. The bandwidth is the second most important factor for LLM performance after memory size. It’s a bottleneck for current machines so it determines the speed of token generation. M4 pro will be twice as fast as m4, max - twice as fast as pro.
I think it’s still worth getting 32gb, just be aware of this when making a decision.

Populus · Mar 24, 2025

M1956 said:
I would go for 32gb. There was a lot progress with small models recently. Gemma 27b, Qwen QwQ 32b, Mistral small 3.1 24b, and LG’s 32b models came out just in the last two months. They are all pretty close to Deepseek r1, gtp4o and sonnet 3.5. While it’s impossible to predict the future, I think that bigger (and proprietary) models like future gpt5, sonnet/opus 4.0 will remain supreme, but new small models will keep closing the gap. 32gb will widen the space of models available for you.
Even if you will have to use proprietary models for the hardest tasks it’s worth having private, unrestricted model locally.
Another thing to take in account is memory bandwidth. I assume you are looking at m4 machine given the choose between 24 and 32 gb. Those only have 120gb/s bandwidth. The bandwidth is the second most important factor for LLM performance after memory size. It’s a bottleneck for current machines so it determines the speed of token generation. M4 pro will be twice as fast as m4, max - twice as fast as pro.
I think it’s still worth getting 32gb, just be aware of this when making a decision.

Thank you for your insightful reply! Yes, I’m well aware of the bandwidth issue, and even tho I’m not denying it, after seeing some comparisons between the M4 and M4 Pro, I think the M4 is still quite capable. After all, it’s the fastest bandwidth among all the base SoC. M1 was 68GB/s, M2 & M3 were 100GB/s, so there’s clearly an improvement there going to 120GB/s, although obviously, the M4 Pro will have double the bandwidth but I’m not getting the 48GB model as that’s too expensive for me, and I don’t need the power of the M4 Pro. So yeah, I think the regular M4 is the sweet spot, and you’re right in that 32GB is interesting to have looking forward.

What I didn’t know, is that there is so much advancement in small models, and maybe (probably) in the future there will be more of them, and more capable. So yeah, you convinced me to go with 32GB of RAM. If I get the M4, because I’m still evaluating if I could or should wait a few more months to see if Apple releases an M5 Mac mini. I have an M2 iPad Pro and maybe I can get by with it until the next Mac mini refresh.

Smoke Bar · Mar 24, 2025

Populus said:
Thank you for your insightful reply! Yes, I’m well aware of the bandwidth issue, and even tho I’m not denying it, after seeing some comparisons between the M4 and M4 Pro, I think the M4 is still quite capable. After all, it’s the fastest bandwidth among all the base SoC. M1 was 68GB/s, M2 & M3 were 100GB/s, so there’s clearly an improvement there going to 120GB/s, although obviously, the M4 Pro will have double the bandwidth but I’m not getting the 48GB model as that’s too expensive for me, and I don’t need the power of the M4 Pro. So yeah, I think the regular M4 is the sweet spot, and you’re right in that 32GB is interesting to have looking forward.

What I didn’t know, is that there is so much advancement in small models, and maybe (probably) in the future there will be more of them, and more capable. So yeah, you convinced me to go with 32GB of RAM. If I get the M4, because I’m still evaluating if I could or should wait a few more months to see if Apple releases an M5 Mac mini. I have an M2 iPad Pro and maybe I can get by with it until the next Mac mini refresh.

What models are you looking to run? Do you need a large context or RAG?

32GB would be a bear minimum and probably not that great an experience if using a large context or RAG.

JSRinUK · Mar 24, 2025

I’ve been doing some further work on this and I’m starting to come to some conclusions.

To summarise, using the 104B parameter Command-R+ at Q6 (~85GB):

LM Studio ran well, but tripped up with context any higher than ~25,000 tokens - even when you set it higher in the model settings.
Ollama wouldn’t download models (crawling to download 1.5% - 2%, then dropping back to 1.6%, continuing like that until failing with “maximum retries” error). No idea why.
Msty & Ollama won’t work with a downloaded GGUF split over multiple files (Command-R+ is in three files). Msty will use the first split file, but doesn’t seem aware that there are others.
Merging the three files allows Ollama to load and run it, but the results are rambling hallucinations. This can be somewhat mitigated by going extreme on the settings, but it’s not really practical.
llama.cpp cli worked well, except that with RAM becoming tight, it wouldn’t do 40,000 context length. The context length needs to be larger than the file you’re working with (in my case around 17,000 words). It worked okay with 32,000 context length and produced good results for my lengthy document.
llama.cpp works with the split files just fine and produces results comparable with LM Studio, but with a better context length (presumably due to lack of overhead as there’s no GUI).
llama.cpp also works with the merged GGUF file but the results were not as good. Also, using the merged GGUF caused the Mac Studio to become unresponsive/unusable while it was running (this doesn’t happen when using split GGUFs). From this, I’m guessing there’s some issue using a single ~80GB GGUF and this may have contributed to Ollama’s rambling hallucinations. Unfortunately, Ollama won’t work with the split files so, for a model this big, I can’t use Ollama.
Conclusions so far: Using LM Studio with this model is fine if you keep the context length / window below about 25,000. llama.cpp cli works better for slightly longer context length.
Current project: I’m now using Grok to teach me how to put a (very simple) GUI on top of llama.cpp. If nothing else, it teaches me something new.

JSRinUK · Mar 24, 2025

Populus said:
Thank you for your insightful reply! Yes, I’m well aware of the bandwidth issue, and even tho I’m not denying it, after seeing some comparisons between the M4 and M4 Pro, I think the M4 is still quite capable. After all, it’s the fastest bandwidth among all the base SoC. M1 was 68GB/s, M2 & M3 were 100GB/s, so there’s clearly an improvement there going to 120GB/s, although obviously, the M4 Pro will have double the bandwidth but I’m not getting the 48GB model as that’s too expensive for me, and I don’t need the power of the M4 Pro. So yeah, I think the regular M4 is the sweet spot, and you’re right in that 32GB is interesting to have looking forward.

What I didn’t know, is that there is so much advancement in small models, and maybe (probably) in the future there will be more of them, and more capable. So yeah, you convinced me to go with 32GB of RAM. If I get the M4, because I’m still evaluating if I could or should wait a few more months to see if Apple releases an M5 Mac mini. I have an M2 iPad Pro and maybe I can get by with it until the next Mac mini refresh.

Some kind of LLM will run on even the most basic machine. It all comes down to your use case.

I only have two Macs - this Studio, and a MacBook Pro M4Pro/24GB/1TB. Until the Studio, I was using the MacBook Pro just fine and, for my use case, Gemma-The-Writer-N-Restless-Quill-10B-D_AU-Q4_k_m.gguf (~6.1GB) worked just fine - very usable. There’s a slightly larger 6-bit model that I’d also started using too.

The larger the model, the more RAM you need, the more likelihood you’ll run into the swap file on low-RAM systems, and the less you’ll be able to comfortably run other apps alongside the LLM.

The larger the model, the slower it’s likely to respond. Memory bandwidth helps here.

M4 with 32GB is likely to allow you to run small models pretty well and still run something else alongside it. However, if you try to fill the 32GB with a larger model, you’ll likely notice a big slow down. It’ll still probably do it, but have your coffee ready while you wait. If you’re running something around 7B-10B with Q4, you’ll probably do fine (I don’t know for sure, I don’t have the M4, just the M4 Pro and M4 Max, but I can’t imagine it being much slower than my MacBook).

Something like gemma-3-27b-it-Q4_K_M.gguf (~16.55GB) will probably be your top level with 32GB RAM while still having a usable system. You may do better, you may do worse, but if it runs okay for you I think it’ll be a good choice.

JSRinUK · Mar 24, 2025

Very pleased with today’s work in which (with a lot of help from Grok) I put together a minimalist GUI for llama.cpp. It allows me to select a file which, in this case, is 22,000+ words; enables me to enter a brief prompt (such as “summarise this scene”), and it’ll output it to a text file as well as log its progress. I have a .sh file that includes my system prompt and other settings - many of which I will later work out how to move to the GUI - but I can change them easily enough should I need to.

Okay, it’s basic and not entirely bug-free but I didn’t know how to do any of this llama.cpp stuff three days ago and was relying on LM Studio / Msty / Ollama / etc. However, this minimalist option now allows me to get my largest scenes summarised using a Q6 104B LLM without (i) LM Studio’s quirk about context length, or (ii) Ollama’s hallucinating rambling, or (iii) having the system become unresponsive using a merged GGUF.

I won’t be flipping over to this method for everything, of course, as it’s not a “chat” system (right now it’s a “one prompt wonder”) and the likes of LM Studio / Msty / Ollama / etc are just fine for shorter contexts and smaller models (and RAG) and general brainstorming, but I feel that I’ve learned a lot.

Howard2k · Mar 24, 2025

Thanks for all the great posts. I've been running LMStudio on my M4 Air (24GB) and on my PC's GPU (RX 7800 XT). I've been pleasantly surprised how well the Air can handle it, although I've only been doing basic stuff. Writing Excel macros it's been stunning.

I am going to try to more advanced things with it too, but can always fall back to the PC as needed. It's obviously no Ultra(!) or even Max, but it does have a little more headroom than I expected. I did consider the 32GB Air but then with the PC on standby I figured that would be a better backup plan.

It's been a fantastic learning experience, and the optimizations I've been able to make in my "9 to 5" workflow have been priceless.

asiga · Mar 25, 2025

BTW, what's the size of the full largest version of Qwen-Coder? Is it 32B? Or is it 70B? This is the model I'm most interested in at the moment for running it in my also new M4 Max Studio with 128GB. I have not tried to run LLMs yet because I want to first build llama.cpp myself (I'll be using llama.cpp only, I'm a C/C++ guy --besides, if something works in LM Studio, it must also work in llama.cpp if the proper settings are used).

M4 Max Studio 128GB - LLM testing

macrumors 6502

macrumors 604

macrumors 6502

macrumors 604

macrumors newbie

macrumors 6502

macrumors 6502

macrumors 6502

macrumors 6502

macrumors 6502

macrumors 68030

macrumors 65816

macrumors 6502

macrumors 6502

macrumors 601

macrumors member

macrumors 65816

macrumors member

macrumors 604

macrumors newbie

macrumors 6502

macrumors 6502

macrumors 6502

macrumors 603

macrumors 65816

Our Staff