Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Status
The first post of this thread is a WikiPost and can be edited by anyone with the appropiate permissions. Your edits will be public.
Status
Not open for further replies.

burnthefires

macrumors member
May 26, 2017
59
12
After 49 pages in this forum there is a conclusion that cMP's have an issue booting 11.3 and later found through multiple trials/errors. You think yours will be excluded for some unknown reason? Why not testing 20 or more boots and coming back with the results confirming or denying the issue?
Again, i'm not implying anything, I know it's statistically probable the issue applies to all cMP's, mine included, but from what i read here it's happening more often for some people and rather less frequently for others, so there must be something related to a particular part of the setup. Therefore it's also statistically possible for a setup to exist that's not affected at all.
 
  • Like
Reactions: Chung123

sfalatko

macrumors 6502a
Sep 24, 2016
641
365
Again, i'm not implying anything, I know it's statistically probable the issue applies to all cMP's, mine included, but from what i read here it's happening more often for some people and rather less frequently for others, so there must be something related to a particular part of the setup. Therefore it's also statistically possible for a setup to exist that's not affected at all.
Very possible - so to start seeing if your machine is a unicorn you need to do some testing. Try 20 warm boots and see if they are all successful - then 20 cold boots and see if they are successful. If they all work let the group know. Fully document your setup to see if something in hardware might be different in your machine. Get some concrete data off your setup that could help the experts debug the situation.
 
  • Like
Reactions: Stex and Chung123

iclaudius54

macrumors newbie
Jun 18, 2021
7
0
Hi All, I have a Mac Pro 4,1 flashed over to a 5,1. Installed a Sabrent NVMe drive and got Catalina on it with OC from the create.Pro guys. My Star USB card doesn't work now, but everything else seems to work.

I'm looking to try OCLP for a Big Sur installation. I have several 2.5" SATA SSDs, so I can get a sled and try it that way.

Question: Should I try to bypass 11.3 and install 11.4? Does anyone have 11.4 working correctly with OCLP? Or, should I just reinstall Catalina using OCLP? I figure my mac pro has about 2 years left on Catalina according to Apple support cycles.
 

JohnD

macrumors regular
Jun 2, 2005
150
97
Los Angeles, California
I'm not claiming there isn't anything fishy going on under the bonnet because that's far from my expertise, i'm just saying that so far my user experience is as normal as it could be and i wouldn't even know about this thread if i had not followed the forums. But still, since there's no definite conclusion on what's causing this issue - how can you know for sure that every cMP is affected and i'm also gonna encounter the problem sooner or later?
Because 1) even a fully stock machine, with no upgrades whatsoever, with an Apple GPU and even an Apple HDD, has the issue, and 2) the condition is random. If every hardware change resulted in the same change every time, this would be a lot easier to figure out.
 

iclaudius54

macrumors newbie
Jun 18, 2021
7
0
Don't install anything higher than BS 11.2.3 unless you are a tester in which case you expect to have problems while booting to the operating system login screen.
Thanks for the suggestion! If that's the case, should I upgrade to BS at all? Or, should I just do a Catalina install on OCLP?

Also, does anyone have issues with PCIe cards not working with OC? My Star USB 3.0 port is not working.
 

JohnD

macrumors regular
Jun 2, 2005
150
97
Los Angeles, California
Thanks for the suggestion! If that's the case, should I upgrade to BS at all? Or, should I just do a Catalina install on OCLP?

Also, does anyone have issues with PCIe cards not working with OC? My Star USB 3.0 port is not working.
Try one of the OpenCore forums, or "OpenCore - on the Mac Pro" on Facebook. This isn't an end-user support group.
 
  • Like
Reactions: octoviaa

jgleigh

macrumors regular
Apr 30, 2009
177
231
Thanks for the suggestion! If that's the case, should I upgrade to BS at all? Or, should I just do a Catalina install on OCLP?

Also, does anyone have issues with PCIe cards not working with OC? My Star USB 3.0 port is not working.
Most of us running MacPro's have no issues with BigSur 11.2.3.

Two support options:

You'll probably get a faster response on Discord.
 
  • Like
Reactions: JeDiGM

nekton1

macrumors 65816
Apr 15, 2010
1,094
777
Asia
Perhaps I'm particularly unlucky. Except for my Titan-Ridge Thunderbolt 3 card, I removed everything related to USB and external connections and reset SMC on each of the six attempts to boot 11.4. No dice at all. I wonder if my Mac Pro 5,1, just because it seems to be so utterly incompatible with anything above 11.2.3, might be useful as a diagnostic tool. Be that as it may, I don't have any SSDs, NVME cards or anything that fancy. Only the Thunderbolt card.
 

nekton1

macrumors 65816
Apr 15, 2010
1,094
777
Asia
Am I misreading or are you saying you have had a Titan-Ridge Thunderbolt 3 card working properly in a cMP 5,1 (with Thunderbolt) when running 11.2.3?
 

startergo

macrumors 603
Sep 20, 2018
5,022
2,283
Upgraded to:
System Software Overview:
System Version: macOS 12.0 (21A5248p)
Kernel Version: Darwin 21.0.0
Intel 82574L:

Name: ethernet
Type: Ethernet Controller
Bus: PCI
Vendor ID: 0x8086
Device ID: 0x10f6
Subsystem Vendor ID: 0x8086
Subsystem ID: 0x0000
Revision ID: 0x0000
Link Width: x1
BSD name: en1
Kext name: Intel82574L.kext
Location: /System/Library/Extensions/IONetworkingFamily.kext/Contents/PlugIns/Intel82574L.kext
Version: 2.7.2
For the ethernet it says:
Status: Connected.
Ethernet 1 has a self-assigned IP address and will not be able to connect to the Internet.

WIFI is working fine (upgraded WIFI+BT)
Edit:

It looks like the booting has improved, but I have reduced my HDD/SDD/Nvme to 3 SSD's and one NVMe so I can't tell for sure.
 
Last edited:

startergo

macrumors 603
Sep 20, 2018
5,022
2,283
I set the ProcessorType to 1537 and I see this:
1624150979462.png
 

khronokernel

macrumors 6502
Sep 30, 2020
278
1,425
Alberta, Canada

joevt

macrumors 604
Jun 21, 2012
6,968
4,262
io=0xff and smc=0xff boot arguments will give more printing. The problem with io=0xff is that the text scrolls very fast.
For anyone wanting to get better debug output, I've uploaded macOS 11.4 Beta 2's Kernel Debug Kit:


Don't have any machines personally with NVMe boot issues, though hope others find use for it
Some instructions for the kernel debug kit above:
To capture the boot log, use firewire or serial kprintf. I think you can do firewire kprintf and debugging at the same time. I've used serial kprintf with a hackintosh that has a COM port. It might work with a Parallels VM (you can direct serial output to a text file). Does serial kprintf work with a PCIe serial card? I have a PCIe serial card in my MacPro3,1 but haven't tried it with kprintf. I haven't tried serial or firewire kprintf since High Sierra though.

Next I updated apfs.efi and apfs_aligned.efi in
Code:
/System/Volumes/Update/mnt1/usr/standalone/i386
Those are the apfs EFI drivers that are embedded into an apfs partition when it is blessed. They don't actually get used except by the APFS jump starter (in the EFI firmware or RefindPlus or OpenCore?).

Nobody is disrespecting anything, I wanted to highlight that you are injecting code and components you build yourself from source and offer it to people as a package installer. That's all. I personally believe this is not the right approach. Acidanthera provides a very comprehensive manual for configuring OpenCore. To get the most out of OpenCore, it is important to read the manual and understand its settings, not use a system that does everything for end-user without understanding anything behind the scenes. I believe @vit9696 shares also this philosophy.
I can use OCLP to create an OpenCore folder and compare it with my non-OCLP folder to see how they differ. Then use the manual to discover why they might differ and learn something that way.

And since there's new code in play here, that's where my focus goes next (because that's the most likely source of problems) - however, it remains a possibility that there's always been a race condition in place, but the conditions never really allowed for a failure until these new changes exposed it. (The kernel source has many comments like "This is racy because we don't get the lock!!!!" and "This is racy; it's possible to lose both" and a number of "This is racy, but..." excuses. It's possible that one of these "racy" conditions has reared its ugly head.)

There is definitely a race condition in play, but catching it "in the act" is proving rather difficult.
Another quick update - good news/bad news. Among the changes to the startup code, Apple reorganized some (not all) things from a linear procedure to a priority list. Basically, instead of code saying "do A, then do B, then do C...", there's an array of functions to call and arguments to pass, each one with a subsystem and a rank assignment. (I didn't catch this at first, because this list is stored in its own separate data segment (__KLDDATA,__init_entry_set), so my initial disassembly missed it). Anyway, that list gets sorted at runtime by subsystem and rank, then as each of the subsystems gets initialized, all of the associated functions in the list get called in order of their rank. From a programming perspective, it's much more elegant and flexible; from a reverse-engineering perspective, it's a pain in the ass - and the eventual source code release is unlikely to help, because 11.1 uses this same mechanism (to a much smaller extent) and the 11.1/11.2 source did not include this part.

I was initially intrigued by the runtime sorting of the list, because they use qsort. Some variants of qsort use random pivots, which would result in slightly different outcomes with every run (sound familiar?). Unfortunately, after tracing through the qsort they used, it's deterministic (meaning it should generate the same output every time).

My current working theory is that by reorganizing the code in this manner, one or more functions ended up being executed earlier or later than they did before this change, resulting in either creating or exposing a race condition. The good news is that because they used a priority list, if we could identify the function(s) that need to be re-ordered, it would only take a one-byte (or perhaps a few-byte) patch to fix it. The bad news is that in 11.3, there are 2315 of these function/argument pairs to analyze in order to make that determination.

I'm going to take a break from this and ponder how I might automate at least part of this process; there's no way I'm going to slog through 2315 functions just to find the needle in this haystack.
I see two directions for solving this:
  1. Mitigate the race condition by identifying a hardware configuration that increases the rate of successful booting to an acceptable level
  2. Eliminate the race condition by patching the kernel
So far, direction 1 has not proven possible because there's no real consensus on what constitutes a good hardware configuration. The issue is just so random from machine to machine. What's more, a key piece of information from machines that do boot "reliably" is unfortunately missing:

Do such machines still boot reliably when stripped down to a minimal configuration (only graphics card and boot drive)?

If the answer is yes, then we could rule out direction 1. Otherwise, we could continue in that direction.

As for direction 2, the current status of @Syncretic's research suggests that it's like finding a needle in a haystack. Hopefully, there's more to come from this direction, because it otherwise seems promising!
So for my iMac11,2, MacBook7,1 and MacPro3,1, they all experience the hang right after PCI Configuration Begins fairly frequently(1 every 10 boots). All machines have SATA SSDs and are basically stock (MacPro3,1 with BCM94360), they'll occasionally also panic on "Waiting for Root Device". For my iMac, I can no longer boot without verbose and other user on an iMac8,1 said they can rarely boot with Ethernet and Wifi kexts injected via OpenCore or on disk.

Machines have too many PCI devices to query and not finishing fast enough to create the boot hangs and/or errors

ASentientBot was also able to replicate this issue on a Penryn hackintosh as well so not specific to Apple firmware configuration
Not sure if this is a silly idea but if the identified XNU race problem is to do with the PCIe ports (I assume their identification and status), would it be possible to create some kind of Kext that work similarly to the kext created for USBport mapping on a Hackintosh? The custom USB port kext I mention (usually produced with a tool called Hackintool) identifies and cuts down on the amount of USB ports available on the system in order for the recognised total to get reduced to 15 ports - maximum macOS currently allows. This usually fixes conflicts between USB ports and set them up properly. Could such kind of kernel applied to PCI ports (identifying them and setting the precise tree and ports from the very start) potentially avoid the race problem when the system is booting up?
Note: I've just used Hackintool on a MBP and easily exported the pcie listing (generated 4 files - dsl + json + plist+ text). Would it be possible to do a post-install tweak on an OC cMP with potentially this precise custom information in mind, again, avoiding the XNU race altogether. A post-install could potentially be another strategy (in the same way as custom USB port mapping usually is. Some food for thought.
About "PCI Configuration Begins": setting some pci flags in boot-args will output more PCI Configuration info (a lot more - not useful unless you have serial or firewire kprintf enabled).

To solve race conditions, it may be useful to find a way to modify the scheduling of loading kexts. I wonder if a kext can be created that does nothing except match a PCIe device with a high probe score, then unload itself for that one device at a deterministic point in time, then force a rematch where our kext would return a low probe score so that the real kext would load at that time. Basically, our kext would be controlled by a scheduling kext. Is Open Core able to inject kext into the early boot before the root drive is discovered?

You would need to eliminate any timeouts that expect a kext to be loaded in a certain amount of time.

For those who want to disable non-essential PCIe devices, you can apply botched properties to them so macOS won't recognize them. Similar to how we do it for iMac12,x with upgraded Metal GPUs, we hide the bad devices with garbage properties and so macOS threats them as generic devices and no kexts attach
Or you could make a kext that attaches instead of the real kext like I suggested above. This sort of worked with an Nvidia GPU in a MacPro7,1 that was causing sleep issues. #332

My current thinking is this: the NVRAM is accessed through an EFI runtime service. Is it possible that supported machines have a newer EFI runtime that behaves just a little differently (especially for timing-sensitive operations)? The MP3,1 seems to use a M50FW016-class FWH flash, while the 4,1 uses an SST25VF032B-class SPI flash. The FWH interface is probably slower, but less time-sensitive than the SPI interface (although I'd guess both probably use the LPC controller for access), and that could account for why the MP3,1 seems to have a higher success rate than the MP4,1/5,1. If MacOS is expecting a fast result, or is sending commands/data too quickly, the older EFI interface (and/or the flash chip itself) could get confused, time out, or just hang.

I doubt they changed the actual EFI runtime interface, because if that were the case, I'd expect either 100% failures on older systems or runtime instability, and we don't see that.
If NVRAM speed were the problem, you could remove it from consideration by creating an EFI driver to override the NVRAM read/write. The driver would read all the NVRAM parameters into memory, then when the OS takes over, the OS would use the EFI driver to read/write NVRAM variables but only to RAM. Open Core already has code to override the NVRAM read/write functions.


(I'm currently thinking we have a deadlock situation, but identifying every possible boot-time deadlock is virtually impossible, so I'm concentrating on events surrounding APFS initialization, which is where most of my hangs occur. In my case, the most frequent hang is after AppleAPFSContainerScheme::probe() is called, but before AppleAPFSContainer::probe() gets called.)
One way to cause a deadlock is to have multiple processes that use locks in a different order.
Process 1: lock A -> lock B -> do stuff
Process 2: lock B -> lock A -> do stuff

If Process 1 locks A and Process 2 locks B, then Process 1 can't lock B so it's stuck. Process 2 can't lock A so it's stuck. The locks have to be used in the same order like this:

Process 1: lock A -> lock B -> do stuff -> unlock B -> unlock A
Process 2: lock A -> lock B -> do stuff -> unlock B -> unlock A

If Process 1 locks A, then process 2 has to wait. Process 1 can lock B, then do it's thing, then unlock the locks.

With regards to nvram issues and repeated failures to boot 11.3/11.4. Is it the panic log being written to nvram that causes the issue? both with corruption and worn out nand cells.

I see that "nvram_paniclog" is a valid boot argument. It default is is to write panic logs to nvram. However it can be disabled by adding the bootarg nvram_paniclog=0.
It's not the full crash that is written to the NVRAM, but the number and name/path of it on the disk in a NVRAM variable named PanicInfoLog. Since you can't access the root device, the kernel is crashing before or right at the IOFindBSDRoot, no crash log is being saved to the disk and the PanicInfoLog is not being saved.
I don't think I've seen a panic log NVRAM variable with a file path before. Is that specific to certain macOS versions or Macs?
I've seen nvram variables AAPL,panic-info which contain an entire panic log and also AAPL,PanicInfo#### where #### is a hex number where the panic log is split into multiple NVRAM variables with a max size of 768 bytes per variable. These panic logs are large so it may be a problem if they're getting written to NVRAM all the time.

I see that the io flags were mentioned briefly at https://github.com/dortania/OpenCore-Legacy-Patcher/issues/110
 

nekton1

macrumors 65816
Apr 15, 2010
1,094
777
Asia
Thanks Peter; I was under the (now mistaken) impression that the Gigabyte Titan-Ridge Thunderbolt 3 card only works in Gigabyte non-Apple motherboards. So can you confirm that TB3 can be added to a cMP 5,1 running macOS prior to 11.3 using one of these cards?
A System Information—Hardware—PCI screenshot would be useful.
 

lovely666

macrumors member
Oct 7, 2006
45
14
I don't think I've seen a panic log NVRAM variable with a file path before. Is that specific to certain macOS versions or Macs?
I've seen nvram variables AAPL,panic-info which contain an entire panic log and also AAPL,PanicInfo#### where #### is a hex number where the panic log is split into multiple NVRAM variables with a max size of 768 bytes per variable. These panic logs are large so it may be a problem if they're getting written to NVRAM all the time.

I think it was a past feature that's still partially implemented in 11.3. Eg it has all the flags and logic to do it, however when gets to the final stage where it would write to the nvram: the method is empty. If i look back in git, it was there many version ago but not anymore.

On the original subject of hangs and panics: my gut instinct is that is caused by memory corruption rather than a race condition. The kernel panics I've seen on my system and in other's screen shots leads me to believe they are often page faults, and memory corruption would also cause random hangs and random console garbage.

I've noticed in diff's between 11.2 and 11.3 a lot of refactoring of both memory allocation and locks, which both could be responsible for memory corruption, however what i don't get is why this affects MacPro5,1/4,1 so much more than other systems.

At least it's been interesting getting to know the XNU source.

IMG_4658.JPG


IMG_4661.JPG
IMG_4662.JPG
IMG_4663.JPG
 

PeterHolbrook

macrumors 68000
Sep 23, 2009
1,625
441
Thanks Peter; I was under the (now mistaken) impression that the Gigabyte Titan-Ridge Thunderbolt 3 card only works in Gigabyte non-Apple motherboards. So can you confirm that TB3 can be added to a cMP 5,1 running macOS prior to 11.3 using one of these cards?
A System Information—Hardware—PCI screenshot would be useful.
Not 11.3! Just 11.2.3.
TB.png
Screen Shot 2021-06-20 at 10.27.01 AM.png
 
  • Like
Reactions: Chung123

trifero

macrumors 68030
May 21, 2009
2,958
2,800
I have just bought one this morning. And a Apple Thunderbolt 3 to 2 adapter. I hope it works with my TB2 dock.
 

khronokernel

macrumors 6502
Sep 30, 2020
278
1,425
Alberta, Canada
Not really. Just curious what is the reason OCLP sets this CPU model on a MacPro7,1 board spoofing? AFAIK i5 is not installed in the MP7,1 nor iMacPro.
It should only set if the user opts to do CPU renaming with RestrictEvents in patcher settings

Are you saying on a clean run of OCLP with no user changes this occurs? If so, seems to be a bug
 
Status
Not open for further replies.
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.