People get kind of neurotic about heat and their SSDs but heat is a legitimate concern. Most will typically put some sort of heat sink on top of their drives if they have space for it. For a laptop, the heat spreader approach is typically used which pulls the heat from the hottest parts and spreads over a larger surface to radiate and heat more air to be blown or convect away.
But it’s the controller that is the hottest part of the drive, so when you take it out of the thermally managed SoC package and put on an M.2 drive next to the NAND chips it has the effect of taking the heat from the controller and spreading it over and adding heat to the NAND chips themselves:
View attachment 2247246View attachment 2247247
That’s in contrast to the MacBook Pro approach of putting the NAND chips on a cool area of the board and right where the cool air gets drawn in by the fans:
View attachment 2247248
So Apple's design approach here is actually quite SSD and performance friendly.
This is what makes it even more surprising that A2141 SSDs drop at the rate that they do. I would be curious to see what the temperature of those NANDs were on that machine. I might check with my thermal camera when I am at the office again.
He goes into a rant about how the SSD is a “wear part” that will eventually fail. That’s true. Over time SSDs develop a few problems around maintenance of the floating gate, charge accumulation, oxide issues etc. You wind up with bits that fail validation on write or erase, you get bad sectors, etc.
But this is an entirely new thing for me:
LR2 (2:31) “My personal favorite, is when the NAND fails by shorting to ground. Many of these fail by shorting to ground entirely. Not just like a dead SSD where the data went poof, I mean when the actual NAND chips fail and bring down the power line completely because there is a 6 to 0 ohm short to ground on the NANDs main power line.”
Ok, so what about Rossmann’s argument that these parts are all shorting to ground?
Here’s where Rossmann’s credibility is so important to the discussion— he claims that half of the repairs he does on A2141’s is because of SSD shorts to ground. That’s a rather incredible number and one that I can’t find any independent confirmation of. Rossmann claims this has been happening since at least 2019. 4 years is a lot of time for Apple to work with their suppliers, of which they apparently have 4: Samsung, Hynix, Kioxia, Western Digital. It seems odd that this would be a failure point across so many vendors for so long without a correction.
Can a digital IC like this show a short? Sure. But typically it’ll happen on the I/O ring because of some damage inflicted on the chip not as a result of wear. One major culprit would be ESD or some other voltage surge blowing out the protection diodes on the pins. Wear would imply this was a short in a particular NAND cell, which would be a rather bizarre mechanism for sinking that much current. The I/O pins at least are beefy transistors with (relatively) fat bond wires attached, the internal cells are delicate structures with nm scale metal layers that you’d expect to vaporize and open circuit when passing that much current.
I am not an engineer, I am not a designer. I'm a repair person. I wish I could say
"XYZ in the ABC is why this product has a higher failure rate, and why it is bad design." I can't say that, and I'm not going to try to pull something out of my ass to explain why it is the case.
Rather, I would stick to what I know.
a) This machine uses a fairly non-standard NAND I cannot get access to anywhere.
b) This non-standard NAND fails at an unheard of rate for my shop, and any shop in the industry that gets these machines. TCRS Circuit, Paul Daniels, Jessa Jones, etc - my colleagues, whom I trust & look up to.
And can someone, anyone, explain this statement to me: LR2 (30:42) “It's more likely here that the NAND is what's bad. [...] If the .9V rail has all these little solder balls moving around that means that most likely the NAND is shorted and it is placing a lot of tension on all of the power rails that are powering the SSD which is why there is solder balls there which… **** my life…” WT-actual-F is he going on about? Is he saying the solder balls are formed by "tension" in the power rail caused by a short in a NAND chip?!? Is it somehow getting mechanically squeezed out of the capacitor?
Seeing tiny solder balls near a failed or shorted capacitor is common. It's a hint I look for because very often it's next to something that is either strained or dead.
Sometimes, those solder balls are just there because of bad manufacturing and do not indicate a failure. it has to be taken in context with the rest; this is a hint that by itself, isolated, isn't proof of something. When we are trying to figure out what happened, or what died, we usually follow clues because the board isn't going to tell you what is wrong.
View attachment 2247304
None of the MacBooks Rossmann opens in those videos show any actual evidence of the NANDs shorting to ground. What we see, if you trust his measurement methods, is that something may have shorted to ground and in most cases it’s hard to know what by the time he’s finished throwing chips out and reworking component after component.
When I remove the inductor that sits inbetween the TPS62180 and the NANDs, it is going to be one of the capacitors, or the NAND. When you inject voltage and no capacitors are warm, it's the NAND.
And finally, what about the fact that you can't boot from an external drive without a working internal SSD? I think it comes down to what working means to understand how much of a problem that presents. He goes through a sloppy interpretation of an iBoff video that needs some disentangling.
For one, he says that T2 Macs don't have UEFI.
That's wrong. They do:
View attachment 2247253
The UEFI firmware sits in the main SSD array rather than in a separate SPI connected Flash chip and the T2 acts as an eSPI client to the Intel chip which reads the UEFI firmware through the T2. So, yes, there is still UEFI firmware.
The fact that this firmware sits on the NAND you are writing to everyday that will stop functioning if it dies is the valid detail. This is senseless nitpicking but more importantly, proving of the point that you end up with a machine that is functionally useless if the NAND dies.
The T2 also has its own SPI connected Flash that holds the secure iBoot procedure. This is the part that gets updated when you go into DFU mode. iBoot validates the UEFI, UEFI is needed to access the ports on the machine which eventually allows access to an external drive and verifies that you aren't booting from an external drive to roll back to an insecure version of firmware.
If you have a board failure that blows up your power rails, can you boot from an external SSD? No. But I don't see how that's relevant to the discussion. It's no different if you had a modular SSD-- you have a board failure and a power supply failure. You can't expect anything to work. If you held the UEFI in a separate SPI NOR flash somewhere, it still relies on a power supply, it still relies on a working flash chip, and the same failure modes that apply to the main SSD apply to that part as well.\
Semi true. A modular SSD doesn't have all the NAND rails created by the motherboard, the SSD is powered by PP3V3_S0. It doesn't have all of the subrails made by the main motherboard.
But let's say that wasn't true - this would be a chicken & the egg situation, and a fun one!
If the failure mode is that the NAND dies, the SSD is shorted, this is fixable. You'd unplug the SSD, plug in a new one, and you're good - because any subrails if they were necessary are being created on the SSD, not the main logic board.
A modular SSD would be beneficial here only if
a) The PSU is made in a way that it can turn off QUICKLY enough after detecting a short that it doesn't die in a nasty fashion when given a sudden extreme load(aka sending input voltage straight to output)
b) The user removes the faulty component before the PSU
DOES die from trying to send unlimited power into a shorted-to-ground output.
If the failure mode is that the NAND PSU died, sent 12v to the NANDs, and killed them... well, now you're screwed regardless of whether your SSD is modular. Your SSD will expect 3v and get 12v. Ouch!
A modular SSD would NOT be beneficial here.
So which did it? The NAND shorted and killed the power supply, or the power supply died and killed(then shorted) the NAND? That's a question we don't get an answer to here
The thing is... again.... older machines
did not do this. I have had
ZERO KNOWN CASES of any pre-A2141 machine sending PPBUS_G3H straight to the NAND. It just didn't happen.
If the NAND shorts and the machine dies, the power supply doesn't have to go with it.
One thing that is absolutely true here though: you must have a backup to boot from an external drive. For everyone saying that Apple is horrible for how they made their SSDs because it's unreasonable to expect people to back up their data, you need another drive to boot from, folks.
I agree that people should back up their data.
I also think that it sucks that this machine is designed in a way where, if the main SSD dies,
a) You cannot replace it
b) You cannot boot off an external/turn it on & have it POST properly
As far as I can tell, this is an entirely hypothetical problem being hypothesized by people with an agenda.
My agenda is to make up a problem for a video that gets 90,000 views. At the expense of the credibility of my business, my livelihood, and my personal time.
I will leave other people to judge what they think of this.