MP 1,1-5,1 Hard crashes on cMP 4.1/5.1

polanskiman · Oct 4, 2023

Hello,

Recently wanted to upgrade my cMP 2009 with a new GPU but soon discovered that my cMP was behaving strangely so I put a hold on the GPU purchase.

Current specs are:
- cMP 2009 cross flashed with 5.1 BootROM.
- 48GB Ram
- 2 * X5670 CPUs
- original ATI Radeon HD 4870 GPU
- 4 drives one being an SSD connected below optical drive
- MacOS El Capitan

There is no other mods to this cMP.

I hadn't used the computer for quite a while. When I went to turn it on last week it would hard shut down and restart ramdomly. I thought perhaps I would just clean it up, unscrew the heatsinks and change the thermal paste. This seems to have somehow improved the situation as it doesn't shutdown anymore on normal use. However if I run a CPU torture test (with Prime95) it will inevitably shutdown within a minute or 2. This is usually followed by multiple hard-shutdowns upon reboot. It can take anywhere from 2 to 6 times before I can get control of the machine again. Once I regain control I get these errors reports such as:

*** Panic Report ***
panic(cpu 19 caller 0xffffff801d3ad68f): "TLB invalidation IPI timeout: " "CPU(s) failed to respond to interrupts, unresponsive CPU bitmap: 0xc000, NMIPI acks: orig: 0x0, now: 0x0"@/Library/Caches/com.apple.xbs/Sources/xnu/xnu-3248.73.11/osfmk/x86_64/pmap.c:2648
Backtrace (CPU 19), Frame : Return Address
0xffffff95852faf40 : 0xffffff801d2dab52
0xffffff95852fafc0 : 0xffffff801d3ad68f
0xffffff95852fb060 : 0xffffff801d3b45db
0xffffff95852fb150 : 0xffffff801d3b532c
0xffffff95852fb1c0 : 0xffffff801d360fed
0xffffff95852fb2f0 : 0xffffff801d35648c
0xffffff95852fb320 : 0xffffff801d3794b1
0xffffff95852fb360 : 0xffffff801d4dea9e
0xffffff95852fb370 : 0xffffff7f9f464089
0xffffff95852fb4e0 : 0xffffff801d539158
0xffffff95852fb540 : 0xffffff801d4dec88
0xffffff95852fb690 : 0xffffff801d4ebcf1
0xffffff95852fb820 : 0xffffff801d4e71cd
0xffffff95852fb890 : 0xffffff801d71796b
0xffffff95852fb990 : 0xffffff801d820944
0xffffff95852fbaa0 : 0xffffff801d33da98
0xffffff95852fbaf0 : 0xffffff801d33cb83
0xffffff95852fbb30 : 0xffffff801d348480
0xffffff95852fbd00 : 0xffffff801d34debe
0xffffff95852fbf20 : 0xffffff801d3d028f
0xffffff95852fbfb0 : 0xffffff801d3ebb12
Kernel Extensions in backtrace:
com.apple.BootCache(38.0)[C1EA21DC-CEC4-34EF-8172-8D217927D3EC]@0xffffff7f9f45f000->0xffffff7f9f468fff

BSD process name corresponding to current thread: garcon

Mac OS version:
15G22010

*** Panic Report ***
panic(cpu 4 caller 0xffffff80051cfdab): Kernel trap at 0x0000000000000000, type 14=page fault, registers:
CR0: 0x000000008001003b, CR2: 0x0000000000000000, CR3: 0x0000000008a5b000, CR4: 0x00000000000226e0
RAX: 0xffffff806ea46960, RBX: 0x0000000000000001, RCX: 0xfffffffff9cff2ec, RDX: 0x0000000000000001
RSP: 0xffffff956e613b08, RBP: 0xffffff956e613b40, RSI: 0x0000000000000002, RDI: 0xffffff805ff8e4a0
R8: 0xffffff805f4a7000, R9: 0x0000000000010001, R10: 0xffffff8061936ca0, R11: 0x0000000000000001
R12: 0xffffff805ff8e4a0, R13: 0xffffff80802a4e00, R14: 0xffffff805f5e7200, R15: 0xffffff806c260000
RFL: 0x0000000000010286, RIP: 0x0000000000000000, CS: 0x0000000000000008, SS: 0x0000000000000000
Fault CR2: 0x0000000000000000, Error code: 0x0000000000000010, Fault CPU: 0x4, PL: 0

Backtrace (CPU 4), Frame : Return Address
0xffffff956e613790 : 0xffffff80050dab52
0xffffff956e613810 : 0xffffff80051cfdab
0xffffff956e6139f0 : 0xffffff80051ebc03
0xffffff956e613a10 : 0x0
0xffffff956e613b40 : 0xffffff8005691cbd
0xffffff956e613b80 : 0xffffff8005691586
0xffffff956e613bf0 : 0xffffff800568c25a
0xffffff956e613c30 : 0xffffff800568bef1
0xffffff956e613c70 : 0xffffff800568bd36
0xffffff956e613cb0 : 0xffffff7f85a3d387
0xffffff956e613de0 : 0xffffff7f85a3c85c
0xffffff956e613ee0 : 0xffffff7f8597e216
0xffffff956e613f00 : 0xffffff800510f2da
0xffffff956e613fb0 : 0xffffff80051ca3f7
Kernel Extensions in backtrace:
com.apple.iokit.IOUSBHostFamily(1.0.1)[79D250A3-843A-3750-BE64-A252CF17A148]@0xffffff7f85978000->0xffffff7f859e1fff
dependency: com.apple.driver.AppleUSBHostMergeProperties(1.0.1)[401B7165-36E8-3EE5-849D-71DAEB3E46E4]@0xffffff7f85974000
com.apple.iokit.IOUSBFamily(900.4.1)[A0E77E09-5ABD-3E39-8453-CECBFC07C9DC]@0xffffff7f859eb000->0xffffff7f85a84fff
dependency: com.apple.iokit.IOPCIFamily(2.9)[5F5B9213-0BE4-33DA-9DC6-5859D824DC7D]@0xffffff7f8592c000
dependency: com.apple.iokit.IOUSBHostFamily(1.0.1)[79D250A3-843A-3750-BE64-A252CF17A148]@0xffffff7f85978000

BSD process name corresponding to current thread: kernel_task

Mac OS version:
15G22010

*** Panic Report ***
panic(cpu 2 caller 0xffffff80116ee0db): "Spinlock acquisition timed out: lock=0xffffff8011f05bb8, lock owner thread=0x0, current_thread: 0xffffff806d7fadb0, lock owner active on CPU 0xffffffff, current owner: 0x0"@/Library/Caches/com.apple.xbs/Sources/xnu/xnu-3248.73.11/osfmk/i386/locks_i386.c:378
Backtrace (CPU 2), Frame : Return Address
0xffffff856afb8d00 : 0xffffff80116dab52
0xffffff856afb8d80 : 0xffffff80116ee0db
0xffffff856afb8dd0 : 0xffffff80116edb72
0xffffff856afb8e20 : 0xffffff80116eda2e
0xffffff856afb8e50 : 0xffffff8011711ea0
0xffffff856afb8f20 : 0xffffff80117c01db
0xffffff856afb8f60 : 0xffffff80117d797b
0xffffff856afb8f80 : 0xffffff80117cefdf
0xffffff856afb8fd0 : 0xffffff80117ebdc6

BSD process name corresponding to current thread: Prime95

Mac OS version:
15G22010

In two instances I also go these during the boot screen:

Never had this happened to me before.

- What I did first was to delete all non-apple kext. Didn't help.
- I also disconnected all drives except the SSD. Didn't help.
- Then I thought that perhaps the CPUs and RAM I bought a few years ago for the upgrade perhaps were turning bad so I swapped them for the original 2.26GHz along the original RAM. Didn't help. Machine keeps hard-crashing when I torture test it. So unless I am extremely unlucky this doesn't seem to be a CPU / RAM problem.
- I run the torture test from a different drive which is running MacOS High Sierra. Didn't help.
- Reset VRAM, SMC multiple times. Didn't help.

The only instance so far where the computer will tend to crash much less is when running the torture test in Save Mode. It only crashed once out of 5 or 6 times I did it. It also tends to run for much longer (+10/15 minutes) that when doing the torture test in normal boot were it usually crashes within 2-3 minutes of running, sometimes less.

So I don't know what else this could be other than the GPU, the PSU or perhaps the motherboard.

Any help would be appreciated in trying to debug this further. Until I get to the bottom of this I can't upgrade my GPU. No point. I don't trust the machine.

Thank you.

tsialex · Oct 4, 2023

Time to start testing known working parts one at a time with this Mac Pro and the suspected parts one at a time with a known working Mac Pro. It's the fastest and cheapest way to diagnose a defective Mac Pro.

polanskiman · Oct 4, 2023

Thanks. I guess I am stuck then. I don't have another Mac Pro.

tsialex · Oct 4, 2023

Yep, besides cleaning it thoroughly and hoping for a bad contact somewhere, there is nothing else to do without access to parts.

h9826790 · Oct 4, 2023

Since you just changed the thermal paste. A screen capture of all sensors reading (e.g. from iStat or MFC) can help us to see if any missing reading / overheating etc.

polanskiman · Oct 4, 2023

h9826790 said:
Since you just changed the thermal paste. A screen capture of all sensors reading (e.g. from iStat or MFC) can help us to see if any missing reading / overheating etc.

Under normal operation all sensors seems to be reading fine. I don't see any temps going wild or fans not working properly. Only missing temp sensor reading is the one from the GPU. Is this normal?

Screen Shot 2023-10-04 at 9.38.29 PM.png

I did a quick test with MFC and it seems you are onto something here. It looks like under heavy load and with MFC using the Full Blast preset the machine remains stable. Once I revert to Automatic in MFC, within a minute or 2 the machine crashes. This seems to point towards something getting too hot and shutting down the cMP.

I carried out a torture test (Prime95) and used MFC and Hardware Monitor to monitor things. Here is the result:

Screen Shot 2023-10-04 at 9.53.28 PM.png

I continued the test for a good 10 more minutes after the end of the above chart and then set SMC back to automatic. Temperature rose for around a minute. I was about to make a screenshot but then cMP crashed.
Test lasted a good 25 minutes. Never had it lasted this long. I suppose the full blast preset keept everything cool preventing something from getting too hot. Not sure if it's the Northbridge or the CPUs themselves or something else.

tsialex · Oct 4, 2023

You never even got over the normal temperatures of the CPU/northbridge, around 75C is a normal and expected temperature for the northbridge, the overtemp for the X58 chipset is around 95C.

From what you wrote, if I had to guess anything to look further, your problem is on the PWM circuit of the CPU tray. I´d test the Mac Pro with a known working CPU tray.

polanskiman · Oct 4, 2023

tsialex said:
You never even got over the normal temperatures of the CPU/northbridge, around 75C is a normal and expected temperature for the northbridge, the overtemp for the X58 chipset is around 95C.

Indeed.
It let's me wonder why it is that while the chips didn't even reach their normal operating temperatures, when using all fan at full blast the machine seems to be able to run the torture test indefinitely. I tried again yesterday and it run for 40 minutes before I decided to shut it down myself. It never went past 2-3 minutes when letting the computer SMC manage the fan speeds.

tsialex said:
From what you wrote, if I had to guess anything to look further, your problem is on the PWM circuit of the CPU tray. I´d test the Mac Pro with a known working CPU tray.

What exactly is the PWN circuit for? Is that related to fan control? Could you point out where that is located on the board? Not that I will be able to repair it myself but at least I can have a look to see if something is funky on the board.

tsialex · Oct 4, 2023

polanskiman said:
What exactly is the PWN circuit for? Is that related to fan control? Could you point out where that is located on the board? Not that I will be able to repair it myself but at least I can have a look to see if something is funky on the board.

PWM are the line of MOSFETs where the thermal pad sits, right next to the CPU socket. It's the power supply of each CPU.

h9826790 · Oct 4, 2023

Since you can reproduce the issue by running CPU stress test. I think we can assume the 4870 is good, but focus on the CPU tray first.

And since we can see that the CPU and NB temperatures always stay within the reasonable range. I also agree that the next thing to check is the PWM thermal pad condition. Especially if the CPU aren't delidded, then the original thermal pad most likely can't provide proper cooling the the PWM chip.

Macschrauber · Oct 4, 2023

You may also clean that aera where the pwm circuits live.

polanskiman · Oct 5, 2023

tsialex said:
PWM are the line of MOSFETs where the thermal pad sits, right next to the CPU socket. It's the power supply of each CPU.

h9826790 said:
Since you can reproduce the issue by running CPU stress test. I think we can assume the 4870 is good, but focus on the CPU tray first.

And since we can see that the CPU and NB temperatures always stay within the reasonable range. I also agree that the next thing to check is the PWM thermal pad condition. Especially if the CPU aren't delidded, then the original thermal pad most likely can't provide proper cooling the the PWM chip.

CPUs are de-lidded. When I re-pasted the CPUs last week, I also changed the thermal pads which run along the heatsinks for good measure. In fact I also changed the thermal pads on the GPU.
Now I am kind of convinced that this is the issue though because I got the samples from Fujipoly about 5-6 years ago. Maybe they turned bad or perhaps they are not adapted for this use? The one I used is Sarcon GR80A-0H-200GY or Sarcon 200X-m. I am not sure. I would need to double check tonight. I also have Sarcon PG80A-00-200BL but that one I have never used.

Today I run my torture test again. This time for 2 hours straight but instead of full blast I halved all the speeds, so still pretty decent cooling. Prime95 crashed after 2 hours but not the machine which was still responsive and well. I restarted the test and set MFC to auto. 1-2 minutes later... hard crash.

Screen Shot 2023-10-05 at 12.27.43 PM.png

Tonight I will remove the CPU tray and remove those thermal pads. I'll get brand new ones.

Macschrauber said:
You may also clean that aera where the pwm circuits live.

The board was thoroughly cleaned last week when I changed the thermal paste on the CPUs.

polanskiman · Oct 5, 2023

I think we might have succeeded in reviving this grandma.

I changed the new thermal pad I had put last week for new ones. This time I used Sarcon 200X-m which has a higher thermal conductivity (17w/m-K).
I run the torture test again and I am happy to report that the machine was stable for a good 25 minutes. I had to shut it down as I had to leave my place but I'll try again tonight for a longer period in order to brush off any doubts. If everything goes well this means I can proceed to the next episode and upgrade my GPU for an X580.

Screen Shot 2023-10-06 at 9.22.35 AM.png

What is interesting is that there isn't virtually any difference in temperatures between running 12 cores and 24 cores. Also interestingly, both BOOSTA and BOOSTB fan dropped speeds when launching the 24 core torture test while the Intake and Exhaust fans took over part fo the cooling job. I assume that the SMC detects both BOOSTA and BOOSTB aren't able to cool the CPU heatsink enough and thus increases overall intake and exhaust air flow to compensate.

mattspace · Oct 6, 2023

polanskiman said:
What is interesting is that there isn't virtually any difference in temperatures between running 12 cores and 24 cores.

What's that test between - 12 physical cores vs. 24 virtual cores, or 12 virtual cores vs. 24 virtual cores?

polanskiman · Oct 6, 2023

mattspace said:
What's that test between - 12 physical cores vs. 24 virtual cores, or 12 virtual cores vs. 24 virtual cores?

I just realized those CPUs are 6 core each, not 12. 12 is the number of threads/virtual cores. I used the wrong terminology.

Prime95 allows you to select as many threads as the CPUs allow, hence 24 threats. So the test is between 12 virtual threads vs 24 virtual threads.

mattspace · Oct 6, 2023

polanskiman said:
I just realized those CPUs are 6 core each, not 12. 12 is the number of threads/virtual cores. I used the wrong terminology.

Prime95 allows you to select as many threads as the CPUs allow, hence 24 threats. So the test is between 12 virtual threads vs 24 virtual threads.

Right, but the lack of temperature difference might be because 12 core test is 12 physical cores at 100%, whereas 24 core is 24 virtual cores, so it's still running the 12 physical cores at 100%, so it would make sense that the temperature is the same.

polanskiman · Oct 12, 2023

Well ladies and gents. All seems in order. I have run multiple tests of several hours each and the granny aint letting me down. No crash. Stable like a young stud.

I got a question though. Do these temps look normal to you?

idle:

during torture test:

h9826790 · Oct 12, 2023

The temperatures are within limit. However, it seems the thermal paste isn't working that well now.

e.g. the temperature difference between the NB diode and NB heatsink is 25°C or above. A good thermal paste should able to keep that 15°C or below. In fact, your screen capture in post #6 looks good.

Similar numbers also observed on both CPU. In fact, the numbers in post #13 still not that bad.

It seems you opened the heatsink after post #13, and somehow the thermal paste lost performance after that.

polanskiman · Oct 12, 2023

h9826790 said:
The temperatures are within limit. However, it seems the thermal paste isn't working that well now.

e.g. the temperature difference between the NB diode and NB heatsink is 25°C or above. A good thermal paste should able to keep that 15°C or below. In fact, your screen capture in post #6 looks good.

Similar numbers also observed on both CPU. In fact, the numbers in post #13 still not that bad.

It seems you opened the heatsink after post #13, and somehow the thermal paste lost performance after that.

Crap. I was hoping it would be fine. I did open all heatsinks because I wanted to change the rivets on the NB heatsink for new ones but also wanted to use graphite pads on all chips (NB and both CPUs). Doesn't look like the graphite pad is doing a great job.

h9826790 · Oct 12, 2023

polanskiman said:
Crap. I was hoping it would be fine. I did open all heatsinks because I wanted to change the rivets on the NB heatsink for new ones but also wanted to use graphite pads on all chips (NB and both CPUs). Doesn't look like the graphite pad is doing a great job.

High quality graphite pad should be good. Easy to apply, can be re-used. However, if all three devices (two CPU + NB) show the similar numbers (~25°C delta T), then I suspect that particular one is not a good thermal paste replacement.

The only device that I have which use graphite pad is the Radeon VII. Factory use it rather than normal thermal paste. That one has very high quality. In fact, even I replace that with liquid metal, I can only get 2-3°C improvement. For normal thermal paste, I can easily get 5°C or more usually. On my overclocked 8700K, liquid metal can do more than 15°C better than the factory thermal paste.

This is what I can get from liquid metal. Ambient temperature 35°C, running Prime95. The average fan speed on Intake / Exhaust / Booster is about 1000RPM. My CPU diode temperature won't go above 80°C. And the delta T is just 12°C.

WARNING: Liquid metal can be very danger to computer, if you go this route, you must do everything carefully and correctly. e.g. You CANNOT apply liquid metal on NB heatsink. Any direct contact between Liquid Metal and Aluminium is strictly forbidden.

P.S. My single processor cMP has not much concern about the NB cooling. I never touch it. It still with the factory thermal paste and rivet. But in my obersvation from the forum, I believe the delta T should be between 10-15°C most of the time. Of course, it can go below 10°C, but that's quite rare to happen.

polanskiman · Oct 12, 2023

I'll just go back and put the Arctic Silver or Noctua thermal paste and remove those graphite pads.

h9826790 · Oct 12, 2023

polanskiman said:
I'll just go back and put the Arctic Silver or Noctua thermal paste and remove those graphite pads.

This should be a known good and the safest way to do it.

MP 1,1-5,1 Hard crashes on cMP 4.1/5.1

macrumors regular

Contributor

macrumors regular

Contributor

macrumors P6

macrumors regular

Contributor

macrumors regular

Contributor

macrumors P6

macrumors 68030

macrumors regular

macrumors regular

macrumors 68040

macrumors regular

macrumors 68040

macrumors regular

macrumors P6

macrumors regular

macrumors P6

macrumors regular

macrumors P6

Our Staff