Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Status
The first post of this thread is a WikiPost and can be edited by anyone with the appropiate permissions. Your edits will be public.
Status
Not open for further replies.
Is it disabling itself or replacing the original?
Apparently disabling itself. The notes are misleading in that it doesn't proceed if a previous SMC instance is already in place.

It can displace a native SMC instance, even a native Apple instance, if it is initiated first but will not replace if the other instance gets in first (as in your case) ... basically a race condition!

EDIT: Probably always loses such races against Apple Hardware SMC.
 
Last edited:
Ok, so I just enabled NvmExpressDxe.efi in Drivers (as the first driver to load) to bypass the firmware's NVMe capability and my last 10 boots/reboots, mix of normal and booting into recovery, have all been successful. If this indeed turns out to be the solution we needed then something's not right with our firmware's NVMe booting capability.

Booting using OC compiled an hour back.

Can you guys test this?
 
Can you guys test this?
See this:

A set of outcomes is not significant by itself.

First, you will need to be able to shut down your machine, come back to it in an hour or so and see whether you get the exact same results and then repeat this again. After confirming this is the case, it might be worth others looking into it (unless of course they want to anyway).

Basically, you need to be able to confirm you can reliably reproduce the outcome of "x" successes, where x > 9 (preferably 19).
 
Last edited:
If this indeed turns out to be the solution we needed then something's not right with our firmware's NVMe booting capability.
The issue also occurs on machines with no NVMe drives. So at best this solution could only treat a symptom.
 
See this:

A set of outcomes is not significant by itself.

First, you will need to be able to shut down your machine, come back to it in an hour or so and see whether you get the exact same results and then repeat this again. After confirming this is the case, it might be worth others looking into it (unless of course they want to anyway).

Basically, you need to be able to confirm you can reliably reproduce the outcome of "x" successes, where x > 9 (preferably 19).
I can't reliably reproduce the outcome of "x" successes, where x > 9
Code:
## Simple Race Condition Simulator ##
-------------------------------------
ATTEMPT 01: 'Kernel Panic'

Kernel Panic at Attempt 1

## Simple Race Condition Simulator ##
-------------------------------------
ATTEMPT 01: 'Booted'
ATTEMPT 02: 'Booted'
ATTEMPT 03: 'Booted'
ATTEMPT 04: 'Booted'
ATTEMPT 05: 'Booted'
ATTEMPT 06: 'Booted'
ATTEMPT 07: 'Booted'
ATTEMPT 08: 'Booted'
ATTEMPT 09: 'Kernel Panic'

Kernel Panic at Attempt 9

## Simple Race Condition Simulator ##
-------------------------------------
ATTEMPT 01: 'Booted'
ATTEMPT 02: 'Booted'
ATTEMPT 03: 'Booted'
ATTEMPT 04: 'Booted'
ATTEMPT 05: 'Booted'
ATTEMPT 06: 'Booted'
ATTEMPT 07: 'Booted'
ATTEMPT 08: 'Kernel Panic'

Kernel Panic at Attempt 8

## Simple Race Condition Simulator ##
-------------------------------------
ATTEMPT 01: 'Booted'
ATTEMPT 02: 'Booted'
ATTEMPT 03: 'Booted'
ATTEMPT 04: 'Kernel Panic'

Kernel Panic at Attempt 4

## Simple Race Condition Simulator ##
-------------------------------------
ATTEMPT 01: 'Booted'
ATTEMPT 02: 'Booted'
ATTEMPT 03: 'Booted'
ATTEMPT 04: 'Booted'
ATTEMPT 05: 'Booted'
ATTEMPT 06: 'Booted'
ATTEMPT 07: 'Booted'
ATTEMPT 08: 'Booted'
ATTEMPT 09: 'Booted'
ATTEMPT 10: 'Booted'
ATTEMPT 11: 'Booted'
ATTEMPT 12: 'Booted'
ATTEMPT 13: 'Booted'
ATTEMPT 14: 'Booted'
ATTEMPT 15: 'Booted'
ATTEMPT 16: 'Booted'
ATTEMPT 17: 'Kernel Panic'

Kernel Panic at Attempt 17

## Simple Race Condition Simulator ##
-------------------------------------
ATTEMPT 01: 'Booted'
ATTEMPT 02: 'Booted'
ATTEMPT 03: 'Booted'
ATTEMPT 04: 'Booted'
ATTEMPT 05: 'Booted'
ATTEMPT 06: 'Booted'
ATTEMPT 07: 'Booted'
ATTEMPT 08: 'Booted'
ATTEMPT 09: 'Kernel Panic'

Kernel Panic at Attempt 9

## Simple Race Condition Simulator ##
-------------------------------------
ATTEMPT 01: 'Kernel Panic'

Kernel Panic at Attempt 1

## Simple Race Condition Simulator ##
-------------------------------------
ATTEMPT 01: 'Booted'
ATTEMPT 02: 'Booted'
ATTEMPT 03: 'Booted'
ATTEMPT 04: 'Booted'
ATTEMPT 05: 'Booted'
ATTEMPT 06: 'Booted'
ATTEMPT 07: 'Booted'
ATTEMPT 08: 'Booted'
ATTEMPT 09: 'Kernel Panic'

Kernel Panic at Attempt 9

## Simple Race Condition Simulator ##
-------------------------------------
ATTEMPT 01: 'Booted'
ATTEMPT 02: 'Booted'
ATTEMPT 03: 'Booted'
ATTEMPT 04: 'Booted'
ATTEMPT 05: 'Booted'
ATTEMPT 06: 'Booted'
ATTEMPT 07: 'Booted'
ATTEMPT 08: 'Booted'
ATTEMPT 09: 'Booted'
ATTEMPT 10: 'Booted'
ATTEMPT 11: 'Booted'
ATTEMPT 12: 'Booted'
ATTEMPT 13: 'Booted'
ATTEMPT 14: 'Booted'
ATTEMPT 15: 'Kernel Panic'

Kernel Panic at Attempt 15
 
  • Haha
Reactions: JohnD
I can't reliably reproduce the outcome of "x" successes, where x > 9
Sorry, that script is just a simulation to illustrate the problem of drawing conclusions from the symptoms of a random condition.

"Reliably reproducing the outcome of 'x' successes, where x > 9" is for anyone thinking some condition they have observed is significant.

For example, the poster thought NVMeExpress was significant (ignoring knowing that it isn't). Before anyone needs to bother looking at that or any other such observation, the person thinking their observation is significant needs to first reliably reproduce "x" successes (or failures actually), where x > 9, preferably 19, themselves.
 
Last edited:
For example, the poster thought NVMeExpress was significant (ignoring knowing that it isn't). Before anyone needs to bother looking at that or any other such observation, the person thinking their observation is significant needs to first reliably reproduce "x" successes, where x > 9, preferably 19, themselves.

Why are you trying to moderate and/or direct the conversation in this thread?

I’m contributing my experiences and if you don’t like what you’re reading then just ignore and read on but don’t tell us what we should and shouldn’t think.. more so, don’t speak for others as it is not appreciated.

Edit: fixed the quoted text
 
Last edited:
  • Like
Reactions: tsialex
Why are you trying to moderate and/or direct the conversation in this thread?

You must have missed this:
After confirming this is the case, it might be worth others looking into it (unless of course they want to anyway).

Of course everyone is free to do what they want but the fact is that we have been at it for 30 pages with multiple "data points" scattered all over the place and no further along than when we started.

If these could be filtered by those making observations first, we would be able to make progress as significant issues would bubble up and others can pile in to confirm.

Surely you don't object to this?

PS; I don't dislike what you posted btw. Just requesting confirmation of the matter from you. If you try the suggestion and it can't be reliably reproduced, then you would be helping to filter it out. or perhaps you will be able to come back and confirm that you have and it seems to be significant.
 
  • Like
Reactions: PeterHolbrook
My wild guess and a little experience from now is that the race condition is triggered by:

Cold Boot / Warm Boot / SMC Reset -> Cold Boot triggers more freezes, SMC Reset also.

USB -> If I pull all my USB Stuff I get 99% cold boot success, auto select by OpenCore helps to boot until I can plug in.

NVram -> the more memory configs are in the longer it takes to post, if I see about 10 reboots it remembers me on nvram garbage collection circle, see my older post with dumps and nvram analytics.

Ram -> the more Ram the longer it takes to post. Could also trigger the race freeze with more or less Ram.

Drives -> Me and others have seen that it differs with the amount of Drives plugged in


plus much more, everything what affects timing in the early boot phase.
 
My wild guess and a little experience from now is that the race condition is triggered by:

Cold Boot / Warm Boot / SMC Reset -> Cold Boot triggers more freezes, SMC Reset also.

USB -> If I pull all my USB Stuff I get 99% cold boot success, auto select by OpenCore helps to boot until I can plug in.

NVram -> the more memory configs are in the longer it takes to post, if I see about 10 reboots it remembers me on nvram garbage collection circle, see my older post with dumps and nvram analytics.

Ram -> the more Ram the longer it takes to post. Could also trigger the race freeze with more or less Ram.

Drives -> Me and others have seen that it differs with the amount of Drives plugged in


plus much more, everything what affects timing in the early boot phase.
You had also mentioned successful cold booting using a 256 GB SSUBX drive. My experience with that drive is that POST on cold boot is very long (1 minute before OpenCore starts vs 15 seconds without that drive). So if long POST times increase the rate of success, then that could be an interesting data point. Is that what you are observing?
 
The issue also occurs on machines with no NVMe drives. So at best this solution could only treat a symptom.

I must have missed that. I'll report back if I experience boot issues again.

Update: Failure on the 14th reboot so it's not related.
 
Last edited:
  • Like
Reactions: cdf
You had also mentioned successful cold booting using a 256 GB SSUBX drive. My experience with that drive is that POST on cold boot is very long (1 minute before OpenCore starts vs 15 seconds without that drive). So if long POST times increase the rate of success, then that could be an interesting data point. Is that what you are observing?

I'll try to test increasing and decreasing post times (e.g. by adding memory) and flashing a Firmware Dump with almost filled nvram versus an almost empty one.

Maybe we can add a data point measuring the time from power on to the chime (post done).
 
  • Like
Reactions: cdf
While you can make it unrelatedly worse, you can't make it better. Even flashing MP51.fd won't change the outcome, I've tested this and asked @cdf to test it too, with the same results.

Yes, I know, I am in the situation that my machine boots most of the times, so I want a "should fail" status.
 
I made a little test helper:

Script to automatic restart.

Automatic Login should be enabled,

the script hast to be set as Login Item in users & groups prefpane to run after automatic login.

OpenCore (if present should be auto select Big Sur)

it makes a time stamp file in Downloads like "restarted_at_11.05.2021_20-56-44"

as long as noone touches "Cancel" it will restart the Mac after 30 seconds being untouched

Be warned that continuous restarting will wear your Firmware chip.


App attached. The Code:

Code:
set time_stamp to (do shell script "date +%d.%m.%Y_%H-%M-%S")

do shell script "touch ~/Downloads/restarted_at_" & time_stamp

activate me

display dialog "Restart Script by Macschrauber" & return & return & "restarting the Mac?" & return & return & "If nothing get touched it will do after 30 seconds" default button 1 giving up after 30


tell application "Finder" to restart
 

Attachments

  • Auto Restart.app.zip
    50.7 KB · Views: 75
Just a quick update. Normally, MacOS dot-revs make incremental changes. However, between 11.2 and 11.3, Apple made substantial changes to the startup code. In particular, a sequence of events that's been present since Sierra (kernel_bootstrap_thread() » bsd_early_init() » sysctl_early_init()) is completely gone, replaced with code that's doing something different. (The 11.2 code includes sysctl_load_devicetree_entries(), which lays the groundwork for parsing the device tree; the 11.3 code doesn't call that function, and doesn't even seem to look at those IOReg variables.) This immediately precedes PE_init_iokit(), which not only initializes IOKit, but also the C++ runtime environment that the higher-level stuff relies upon; it also launches the IOKit matching process, which is how the boot device gets discovered. As you might imagine, changing the startup code around IOKit also changes the timing. While it's still possible that the underlying issue is hardware-related (e.g. some neglected code for dealing with the Ethernet ports or SMC or other 5,1-specific device, which doesn't affect newer (or older?) machines, but sometimes hobbles the 4,1/5,1), I'm beginning to think it's all about these changes to the bootstrap process. And since there's new code in play here, that's where my focus goes next (because that's the most likely source of problems) - however, it remains a possibility that there's always been a race condition in place, but the conditions never really allowed for a failure until these new changes exposed it. (The kernel source has many comments like "This is racy because we don't get the lock!!!!" and "This is racy; it's possible to lose both" and a number of "This is racy, but..." excuses. It's possible that one of these "racy" conditions has reared its ugly head.)

There is definitely a race condition in play, but catching it "in the act" is proving rather difficult. I'm continuing to chase it, and I'll post updates when I have something useful to offer. Otherwise, I'm not on here very much, so my apologies if anyone has asked me something and I haven't answered. I'll eventually get caught up, once this is resolved (or I surrender and wait for the 11.3 source code release, which would be very helpful at this point).
 
While you can make it unrelatedly worse, you can't make it better. Even flashing MP51.fd won't change the outcome, I've tested this and asked @cdf to test it too, with the same results.

So Alex when you built my boot rom you added the more up to date rom drivers from the 2012 era. Is it a possibility right now that those who are having these problems are all using the 2010 era? Just asking. I will not be trying any of this because I don’t have a spare machine that be offline from this kind of experimentation.
 
I guess it's not boot rom related.

as @Dayo suggested I flashed the oldest Firmware 007F.B03 from about 2010 without APFS and NVMe support.

Added APFS Efi driver to RefindPlus and booted Big Sur with the alternative APFS Driver.

Same symptom for me. Boots fine with old firmware but hang with a lot of USB Hardware connected.


Bildschirmfoto 2021-05-11 um 21.49.03.png



732D7BE4-09BD-4B7F-AFD2-0445BD864136.jpeg


with no USB Hardware but internal Bluetooth

Code:
restarted_at_11.05.2021_21-57-28
restarted_at_11.05.2021_21-59-28
restarted_at_11.05.2021_21-59-58
restarted_at_11.05.2021_22-01-29
restarted_at_11.05.2021_22-01-59
restarted_at_11.05.2021_22-03-31
restarted_at_11.05.2021_22-04-02
restarted_at_11.05.2021_22-05-33
restarted_at_11.05.2021_22-06-03
restarted_at_11.05.2021_22-07-33
restarted_at_11.05.2021_22-08-03
restarted_at_11.05.2021_22-09-35
restarted_at_11.05.2021_22-10-05
restarted_at_11.05.2021_22-11-36
restarted_at_11.05.2021_22-12-06
restarted_at_11.05.2021_22-13-37
restarted_at_11.05.2021_22-14-07
restarted_at_11.05.2021_22-15-40
restarted_at_11.05.2021_22-16-11
restarted_at_11.05.2021_22-17-41
restarted_at_11.05.2021_22-18-11
restarted_at_11.05.2021_22-19-41
restarted_at_11.05.2021_22-19-55
restarted_at_11.05.2021_22-21-48
 
Last edited:
So Alex when you built my boot rom you added the more up to date rom drivers from the 2012 era. Is it a possibility right now that those who are having these problems are all using the 2010 era? Just asking. I will not be trying any of this because I don’t have a spare machine that be offline from this kind of experimentation.
2012 mac pros suffer from the same problems
 
Status
Not open for further replies.
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.