I have created a kext that may provide a workaround for the Big Sur 11.3+ race condition. Alpha testing with a small group has provided encouraging results, so I'm now posting it publicly to see if those results withstand broader scrutiny.
UPDATE 4aug21:
Some important points before we start:
- THIS IS STILL EXPERIMENTAL. It's not polished, and there are no guarantees - it may work extremely well for you, or you may see little or no benefit. It could conceivably even make things worse on your system, although I have yet to see any such case, and have no reason to believe it might do so. Be aware that at this stage, trying this makes you a guinea pig.
- THIS IS A WORKAROUND, NOT A SOLUTION. As far as I know, no one has yet identified the true root cause of the race condition. Therefore, no one can offer an actual "fix," since we don't know what the real problem is. This kext helps mitigate the symptoms, but it does not address the root cause - therefore, any related problems may still be present. In particular, I've seen a few reports of disk corruption attributed to the race condition (although I have yet to see anything like that myself); be aware that using this kext probably does little or nothing to mitigate that risk.
- THIS IS STILL IN BETA TESTING. As such, I'm currently looking for somewhat more experienced Mac users to give it a try and provide feedback. If you're unsure what to do if I simply say "inject it via OpenCore or just link it directly into the kernel" without further detail, then please don't try this just yet - if it proves to be useful, I'll clean up the code and I (or some kind soul) will write up step-by-step instructions for installation and use. (Anyone is free to try this, but my availability for helping folks with installation questions is extremely limited right now.)
- IF IT AIN'T BROKE, DON'T FIX IT. If your system isn't experiencing the race condition bug on Big Sur 11.3+ or Monterey, you shouldn't bother with this kext. All it's going to do is increase your boot time.
The kext is called
latebloom (see the bottom of this post if you're curious about the name). It hooks the PCI bus probe code, and inserts a pause for every iteration through the PCI bus. It does not touch the filesystem, it does not touch the network, it's only active during boot (before the login screen), and its only output is some print statements if you have verbose/debugging enabled. It won't do anything at all if it's loaded on MacOS older than Big Sur. (Let me reiterate here that this is EXPERIMENTAL - it's very much like having an old TV that flickers, so you bang on the side of the case until the picture clears up. We can't directly address the true cause, because we don't know what it is, but we can shake up the boot process and apparently improve the odds of a successful boot.)
This has been tested on various versions of Big Sur 11.3+, and on three betas of Monterey. It's primarily intended for Mac Pro 4,1 and 5,1 systems, but it should run on any Intel-based Mac (if you're so inclined). Note that while it's effective on either warm (restart) or cold (power-up) boots, testing shows it to be much more effective on cold boots, for reasons currently unknown. I welcome more test results to see if we can figure out why, and perhaps do something about it.
The kext can either be injected via OpenCore or linked directly into the kernel. (I have not tested direct kernel linking, so that's still an action item.) There's nothing special about the OpenCore injection; I have set MinKernel to 20.4.0 to ensure that it only gets loaded on 11.3+. The Kernel/Add values I used:
Code:
<dict>
<key>Arch</key>
<string>x86_64</string>
<key>BundlePath</key>
<string>latebloom.kext</string>
<key>Comment</key>
<string>PCI delay for BS/Monterey</string>
<key>Enabled</key>
<true/>
<key>ExecutablePath</key>
<string>Contents/MacOS/latebloom</string>
<key>MaxKernel</key>
<string></string>
<key>MinKernel</key>
<string>20.4.0</string>
<key>PlistPath</key>
<string>Contents/Info.plist</string>
</dict>
If you have kernel verbose/debug enabled, you'll see some startup messages that start with
_____[ !!! *** latebloom *** !!! ]:
, which might be informative.
Notably, if you see HOOK NOT PLACED
, the kext didn't actually do anything because it couldn't find the correct code to hook. PLEASE NOTIFY ME IF THIS HAPPENS. (This should not appear on any current Big Sur or Monterey releases, but new betas might trigger it.)
The kext uses three numeric boot-args for configuration.
Note that unlike kernel boot-args, these values must be numeric, must be decimal, and must contain one to four digits. If a latebloom boot-arg is found but what follows the "=" is non-numeric, it will be interpreted as 0.
latebloom=N
sets the delay, in milliseconds. If latebloom=N
is not present, a "safe" default of 60ms is used. If latebloom=0
is set (or if N is non-numeric, effectively making it 0), the kext will not place its hook, and will do nothing at all. This is an easy way to disable the kext (another being to set the OpenCore Enabled
tag to false).
lb_range=N
sets a range for random delays. By default, every delay is exactly the same (either the latebloom=N
value or the default 60ms). If lb_range
is set, the delay for each PCI bus probe is random in the range (latebloom +/- lb_range) milliseconds. For example, latebloom=90 lb_range=20
results in random delays between 70..110 ms (90 +/- 20). If lb_range=N
is not present, it defaults to 0 (no random changes, always use the same delay value). Using random delays may help avoid deadlocks.
lb_debug=N
enables additional debugging messages. If N is 0, or if lb_debug
is not present, only some progress messages are displayed during boot. If lb_debug
is set to 1, additional debugging messages are printed, including one for each iteration of the PCI bus probe, along with the actual delay value for that loop. At present, the only valid values are 1 or 0 (or no lb_debug
boot-arg present).
If the kext encounters a problem trying to hook the PCI code, it will display an error message and pause the boot process for 4 seconds (to give you a chance to see the message). From there, your system will boot as usual, and the kext will do nothing. If there are no errors hooking the PCI code, the boot will proceed at normal speed (subject to whatever PCI delays you have set). Depending on your setup, there will likely be 50-100 PCI bus probe iterations; multiply that by your delay value to get a rough estimate of the impact to boot time. For example, if you set
latebloom=1500
, you're adding 1.5 seconds per loop - if there are 60 loops on your system, you've thus added 90 seconds to your boot time, which can feel like forever - so be careful with large values. Also note that unless you have
lb_debug=1
in your boot-args, you'll see "silent" delays where your boot might appear to hang - be patient, especially if you're using large values (anything over, say, 200).
Note that if the kext is injected via OpenCore or linked directly into the kernel, it will always show up in kextstat (or other lists of loaded kexts), even if it failed to place its hook or if you've disabled it with
latebloom=0
. MacOS sees it as loaded, but it's not doing anything; seeing it as loaded tells you nothing about whether or not it successfully hooked or added any delays. The only evidence for that is in the verbose boot messages and the duration of your boot process.
The best results seem to be in the 50-150ms range. However, depending on your CPU clock speed, the number of PCI devices you have, the number of USB devices you have, the number of disks you have, and other configuration items, your value will likely be unique.
The only way to determine what works best for you is to experiment - start with some value, see how it works, try increasing or decreasing it, see how it works, and eventually home in on the best value for your system. I'm requesting that once you've decided on an optimal (or at least "good enough") value for your system, please post in this thread with your system configuration and your latebloom parameters; I'm hoping we can collect enough of these to make a table of "best guess starting values" so that later users won't have to go through this process (or at least will have fewer steps to go through). If those parameters are consistent, and paint a fairly clear picture of which values work best on which configurations, I can incorporate that into the kext itself so it can choose a better default based on the system it's running on.
This is an unsigned kext, so you'll almost certainly need SIP disabled to link it directly into the kernel; OpenCore will apparently allow you to inject the kext without disabling SIP. As with my other kexts posted on MacRumors, the attached file is a ZIP of a TGZ file, so you'll need to extract it twice.
Good luck, and please post any results you get, good or bad. This may still prove to be yet another dead end, but it might also provide at least a narrow path past 11.2.3 (depending upon your use-case).
(Regarding the name "latebloom" - This is the 18th kext-based mitigation process I have attempted. The first one held the system in single-CPU mode until boot was completed, then enabled all the processors, creating a "late bloomer" situation. That method didn't prove fruitful, but the name stuck for every successive process.)
Version history:
Code:
12jul21 0.17 Initial public beta release
14jul21 0.19 Added support for Monterey beta 3