Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Status
The first post of this thread is a WikiPost and can be edited by anyone with the appropiate permissions. Your edits will be public.
Status
Not open for further replies.
That loop in IOFindBSDRoot() has me wondering - has anyone left their system running once they got the "prohibited" symbol (or the Still waiting for root device message)? It's possible that the true underlying problem isn't actually killing the system, but rather just slowing it down with a blocking operation (or some other failure) that's causing everything else to time out. That IOFindBSDRoot() loop times out every 60 seconds, and just keeps trying forever; if the boot device actually is available, it should eventually find it.

Someone might want to try leaving the system sitting for 5-10 minutes once the "prohibited" symbol appears (or, if the symbol never appears, once the Still waiting for root device message appears). Sometimes, a hang isn't really a hang...
 
  • Like
Reactions: Bmju, cdf and Dayo
Actually, I mentioned previously that the hang persists even with no WiFi card.
That is probably because this particular one you posted is related to ethernet? I got the same one. I tried downgrading io80212 but apparently that is not the one for ethernet.
 
Actually, I mentioned previously that the hang persists even with no WiFi card.
Seems to go past that anyway. The Ethernet thing, "IONetworkInterface", would be the one to look into.
Someone flagged this a few posts back as well.

WIFI is still something to bear in mind for later if no joy .... if only to eliminate.
 
  • Like
Reactions: cdf
That is probably because this particular one you posted is related to ethernet? I got the same one. I tried downgrading io80212 but apparently that is not the one for ethernet.
Intel82574L is a gigabit Ethernet controller.
IO80211Family covers Atheros, Broadcom, and other 802.11* WiFi controllers.
 
  • Like
Reactions: startergo
Here are a few screenshots of hangs I had yesterday (haven't powered on the test machine today yet). Booting from a SATA SSD on PCIe card, which has a 50% success/fail rate after 10's of reboots.

IMG_6673.JPG
IMG_6674.JPG
IMG_6675.JPG
IMG_6676.JPG
IMG_6677.JPG
IMG_6678.JPG
 
  • Like
Reactions: AlexMaximus
Actually I left it on for a while and I saw something print on the screen after the prohibitory sign.
If you let it sit for a while (don't recall if 30s/1m/more?), it keeps adding blank lines to the bottom, and shifting the text up. I haven't seen it add any additional text on mine though.
 
Examining things a bit further ... Those two Intel82574L entries being the last items does not mean it hangs/crashes there. Just seems there is much less logging afterwards.

Seems it tries to see whether the machine is networked with those Ethernet kexts. It then returns from "IOFindBSDRoot".

Once it returns from this, there isn't as much logging and seems things either work or you get KPs. The panic is presumably the log source.

Hopefully related to those kexts as otherwise the trail goes a bit cold without the KP log.
 
Examining things a bit further ... Those two Intel82574L entries being the last items does not mean it hangs/crashes there. Just seems there is much less logging afterwards.

Seems it tries to see whether the machine is networked with those Ethernet kexts. It then returns from "IOFindBSDRoot".

Once it returns from this, there isn't as much logging and seems things either work or you get KPs. The panic is presumably the log source.

Hopefully related to those kexts as otherwise the trail goes a bit cold without the KP log.
io=0xff and smc=0xff boot arguments will give more printing. The problem with io=0xff is that the text scrolls very fast.
 
  • Like
Reactions: Enricote
I need to step away from this for a while, but I have a working theory.

Like Catalina, Big Sur uses the split volumes (e.g. "Big Sur" and "Big Sur - Data") that APFS magically sews together as a Volume Group. Because of the Signed/Sealed System Volume, Big Sur does a lot more hashing and verification, though.

In the majority of @JohnD's screen photos, I see nx_get_volume_group and getVolumeGroupMountFrom errors. Those functions are part of the APFS kext. The "getVolumeGroupMountFrom...failed with error 2" is a direct consequence of the "nx_get_volume_group: volume groups tree is not setup yet" error. A background thread should be building that tree and dealing with all the hashes, but when the bootloader is ready to go, it's not done yet. Since the Volume Group doesn't exist, neither does the Boot Volume that the bootloader wants to see - and hence the "still waiting for root device" errors.

I think that perhaps Apple introduced a race condition that can cause the APFS volume group creation thread to block. If the wind is blowing in the right direction, you might get lucky; if not, it hangs. So why might the MP3,1 not suffer as much? I think that's partly due to the math - the APFS kext uses the CRC32 instruction if it's available (which it is on MP4,1/5,1), but uses a software CRC32 if not (as on the 3,1). That means the MP3,1 will always be a little slower calculating the hashes. If there really is a race condition, slower might actually be better.

One thing that someone might want to try is downgrading the APFS kext to the last good Big Sur version (was that 11.2.3?). I have no idea if they mangled any of the structures such that the older kext won't work, but it might be worth a try. The actual cause might be entirely outside of APFS, but right now, APFS seems like the focal point.

Also, is everyone running with sealed volumes, or without? Or do unsupported Macs even have a choice? (I haven't kept up with the Big Sur patchers very well...)

Good luck to all, I'll be back at some point.
 
I think that perhaps Apple introduced a race condition that can cause the APFS volume group creation thread to block. If the wind is blowing in the right direction, you might get lucky; if not, it hangs. So why might the MP3,1 not suffer as much? I think that's partly due to the math - the APFS kext uses the CRC32 instruction if it's available (which it is on MP4,1/5,1), but uses a software CRC32 if not (as on the 3,1). That means the MP3,1 will always be a little slower calculating the hashes. If there really is a race condition, slower might actually be better.
Perhaps we could test this simply by spoofing out SSE 4.2 in CPUID.
 
Does someone know how to manipulate external Big Sur volume?
Say I mounted the volume as RW like:
Code:
sudo mount -o nobrowse -t apfs /dev/disk11s3 /System/Volumes/Update/mnt1
Next I updated apfs.efi and apfs_aligned.efi in
Code:
/System/Volumes/Update/mnt1/usr/standalone/i386
Do I have to Rebuild kernel cache with:
Code:
sudo kmutil install --volume-root /System/Volumes/Update/mnt1/ --update-all
or --volume-root needs to be substituted?
Do I need to create new bootable snapshot:
Code:
sudo bless --folder /System/Volumes/Update/mnt1/System/Library/CoreServices --bootefi --create-snapshot
 
I have a working theory.

Makes sense as all the issues seemed to be related to rooting problems with filesystem / IO resources and something not being mounted which was odd given focus on PCIe but was still there when the slots were empty:
It is basically timing out due to some IO issue after 10 attempts at mounting something has failed. Needs checking what that "something " exactly is but initially looks like some disk/drive.

"Still waiting for root device" seems to be the key and seems to be targeted at Intel but all the stuff there seem to be filesystem related.

BTW, for those testing and already on OC 0.6.9, you might want to also make sure the recently added EnableVectorAcceleration key is deactivated as it leverages AVX (when available) for calculating hashes which apparently results in an approx 30% speed bump for such actions. Might be an unwanted speed bump in the light of things.

EDIT: On the flip side, if the APFS driver is accelerating stuff it touches and this key does the same for related stuff the driver doesn't touch, having it on might actually be beneficial.
 
Last edited:
It appears that the "!BSD" message only means that when IOFindBSDRoot() is called, there are no IOBSD devices still initializing (the lack of "!BSD" should indicate that IOFindBSDRoot() had to wait (up to 30 seconds) for one or more IOBSD devices to complete their initialization).
Based on @cdf's report (that it takes 30s for !BSD to appear) and looking again at the code, it is trying to find a matching BSD (IOBSD) device, and then reporting after a 30s timeout that it has not found one. (Which may well be exactly what you are saying above @Syncretic , but I wasn't quite sure!)
 
Just adding for other context and not answering what the user meant

it is trying to find a matching BSD (IOBSD) device, and then reporting after a 30s timeout that it has not found one.
This is what happens but following the function code to the end, this is not supposed to result in any issues. It only applies to certain conditions as the code comment points out. The function will try other services down the line and the standard return from the function is a success code (obviously unless it crashes beforehand)

The IOFindBSDRoot function is called from the setconf function in the "bsd/kern/bsd_init.c" file.
This setconf function is called in a loop which means IOFindBSDRoot is ultimately called in a loop and each time should successfully hook something up even if that bit near the beginning does not match or is not relevant.

Just some general context to help us better understand what is happening as the stuff @cdf noted happening after 1 minute might be on subsequent runs of the function in the loop. That is, the "!BSD" message might not have anything to do with the observed problem and was just output on an earlier call of the function which then progressed to the end of that particular loop call without issue. (Not checked. Just raising as possibility btw).

Wonder whether OC can intervene somewhere but not sure whether it is as yet able to do more on the Apple Kernel beyond driver injection/blocking and patching.
 
  • Like
Reactions: Bmju
Sorry for writing out of context, but I think I have found a way of running 11.3 on my MP 4.1.

So far, I have been able to boot 11.3 from my NVME SSD several times without any issues at all. I also didn't notice any other problems or glitches. Of course, I only did the update yesterday, so there may still be issues ahead.

I admit that I did not read the complete thread, but maybe the information below can help you get to the bottom of the update issues...

Originally, the 11.3 update failed with the frozen progress bar and any number of restarts did not help resolve the issue. I then decided to do a downgrade to 11.2.3. I had a full backup of the system drive but thought there might be an easier way to do a downgrade.

I booted from a separate drive and overwrote the Systems folder on the NVME with the one located in "macOS - Data/Previous System". After a reboot nothing had changed. I still got the freezing progress bar.

I then used this installer (which was supposed to contain 11.2.3) to do a reinstallation on the NVME:
http://swcdn.apple.com/content/down...i7fezrmvu4vuab80m0e8a5ll/InstallAssistant.pkg

As it turned out, Apple must have updated the installer because it actually reinstalled 11.3

After the next reboot the system was busy for about 30 minutes and then presented me with the settings dialogue of a new system installation (user creation, language settings,etc.). After finishing the setup I could login with my original user.

About this Mac now shows the OS version as 11.3 and so far everything seems to be working fine.

I am not sure if copying the old system folder actually did any good. Maybe it was just the downloaded installer that did the trick?

If there is anything I can contribute to help let me know.


EDIT: In the meantime I had one boot failure with 11.3 (after 8 or 9 successful boots). I guess, I was a bit too optimistic in my post above.
 
Last edited:
  • Like
Reactions: pure.wisdom
which cMP CPU has AVX instruction set? I don't think anyone has.
"x5690 Instruction Set ExtensionsIntel® SSE4.2"
Just threw it out there and not 100% familiar with how it works as documentation is on the laconic side of things.

Regardless, it does some form of hashing acceleration similar to what was pointed out. Might not even be for related items but still something to be aware of.
 
I think there is something with the APFS.efi. I placed one from 11.2.3 in the drivers section of OC. Of course APFS jump start and connect are false. Boot still hangs most of the times, but so far there was no IONVMe panic with NVMe drive in the system. I wonder if CpuTscSync can help here as well.
 
  • Like
Reactions: cdf and KevinClark
Tagging @vit9696 ... See:
 
I do not really have any news about this, but just for completeness: APFS filesystem does not use CRC32. Instead it goes with a fletcher hash. Whether or not that matters, the absence of CRC instruction set should not matter on APFS at the very least.
 
I do not really have any news about this, but just for completeness: APFS filesystem does not use CRC32. Instead it goes with a fletcher hash. Whether or not that matters, the absence of CRC instruction set should not matter on APFS at the very least.

I don't have much time to look at this today, but I did want to defend my CRC32 observation.
This is all from apfs.kext; I have (snipped) irrelevant chunks of code to make this more concise.

Code:
_nx_kernel_mount:
00000000000f43b2    55     pushq    %rbp
00000000000f43b3    48 89 e5     movq    %rsp, %rbp
00000000000f43b6    41 57     pushq    %r15
// (snip)
00000000000f44cb    e8 00 00 00 00     callq    _cpuid_features
00000000000f44d0    48 0f ba e0 34     btq    $0x34, %rax
00000000000f44d5    72 43     jb    0xf451a
00000000000f44d7    31 d2     xorl    %edx, %edx
00000000000f44d9    48 8d 35 60 7e 07 00     leaq    _crc32c_table(%rip), %rsi
00000000000f44e0    4c 8d 05 53 b1 f1 ff     leaq    _crc32c_soft(%rip), %r8
// (snip)
00000000000f4518    eb 07     jmp    0xf4521
00000000000f451a    4c 8d 05 b6 b0 f1 ff     leaq    _crc32c_x86_hw(%rip), %r8
00000000000f4521    4c 89 05 10 7e 07 00     movq    %r8, _crc32c(%rip)
///
/// The code above checks the stored CPUID data for SSE4.2 (bit 52 of ECX:EDX).
/// If SSE4.2 is not available, store _crc32c_soft (software CRC32) in _crc32c variable.
/// If SSE4.2 is available, store _crc32_x86_hw (hardware CRC32) in _crc32c variable.
/// _crc32c will be called indirectly for CRC32 calculations.
/// (They do something similar to choose an AVX implementation of _fletcher64 if
/// AVX is available.)
///

// (snip)

///
/// Hardware CRC32 implementation (uses crc32q/crc32l/crc32w/crc32b)
///
_crc32c_x86_hw:
000000000000f5d7        55      pushq   %rbp
000000000000f5d8        48 89 e5        movq    %rsp, %rbp
000000000000f5db        89 f8   movl    %edi, %eax
000000000000f5dd        48 89 d1        movq    %rdx, %rcx
000000000000f5e0        48 c1 e9 03     shrq    $0x3, %rcx
000000000000f5e4        74 11   je      0xf5f7
000000000000f5e6        31 ff   xorl    %edi, %edi
000000000000f5e8        f2 48 0f 38 f1 04 fe    crc32q  (%rsi,%rdi,8), %rax
000000000000f5ef        48 ff c7        incq    %rdi
000000000000f5f2        48 39 f9        cmpq    %rdi, %rcx
000000000000f5f5        75 f1   jne     0xf5e8
000000000000f5f7        48 89 d1        movq    %rdx, %rcx
000000000000f5fa        48 83 e1 f8     andq    $-0x8, %rcx
000000000000f5fe        48 01 ce        addq    %rcx, %rsi
000000000000f601        83 e2 07        andl    $0x7, %edx
000000000000f604        48 83 fa 03     cmpq    $0x3, %rdx
000000000000f608        76 10   jbe     0xf61a
000000000000f60a        31 c9   xorl    %ecx, %ecx
000000000000f60c        f2 0f 38 f1 04 8e       crc32l  (%rsi,%rcx,4), %eax
000000000000f612        48 83 c6 04     addq    $0x4, %rsi
000000000000f616        48 83 c2 fc     addq    $-0x4, %rdx
000000000000f61a        48 83 fa 02     cmpq    $0x2, %rdx
000000000000f61e        72 0e   jb      0xf62e
000000000000f620        66 f2 0f 38 f1 06       crc32w  (%rsi), %eax
000000000000f626        48 83 c6 02     addq    $0x2, %rsi
000000000000f62a        48 83 c2 fe     addq    $-0x2, %rdx
000000000000f62e        48 85 d2        testq   %rdx, %rdx
000000000000f631        74 05   je      0xf638
000000000000f633        f2 0f 38 f0 06  crc32b  (%rsi), %eax
000000000000f638        5d      popq    %rbp
000000000000f639        c3      retq
///
/// Software CRC32 implementation (uses lookup table)
///
_crc32c_soft:
000000000000f63a        55      pushq   %rbp
000000000000f63b        48 89 e5        movq    %rsp, %rbp
000000000000f63e        89 f8   movl    %edi, %eax
000000000000f640        48 85 d2        testq   %rdx, %rdx
000000000000f643        74 23   je      0xf668
000000000000f645        31 c9   xorl    %ecx, %ecx
000000000000f647        4c 8d 05 f2 cc 15 00    leaq    _crc32c_table(%rip), %r8
000000000000f64e        44 0f b6 0c 0e  movzbl  (%rsi,%rcx), %r9d
000000000000f653        0f b6 f8        movzbl  %al, %edi
000000000000f656        44 31 cf        xorl    %r9d, %edi
000000000000f659        c1 e8 08        shrl    $0x8, %eax
000000000000f65c        41 33 04 b8     xorl    (%r8,%rdi,4), %eax
000000000000f660        48 ff c1        incq    %rcx
000000000000f663        48 39 ca        cmpq    %rcx, %rdx
000000000000f666        75 e6   jne     0xf64e
000000000000f668        5d      popq    %rbp
000000000000f669        c3      retq

// (snip)

///
/// drec_hash_func is a worker function that simply sets up and calls the selected CRC32
/// function (hardware or software).
/// It is used as a callback function for utf8_normalizeOptCaseFoldAndHash, which calls
/// drec_hash_func iteratively over a buffer.
///
_drec_hash_func:
00000000000a2288        55      pushq   %rbp
00000000000a2289        48 89 e5        movq    %rsp, %rbp
00000000000a228c        53      pushq   %rbx
00000000000a228d        50      pushq   %rax
00000000000a228e        48 89 d3        movq    %rdx, %rbx
00000000000a2291        48 89 f2        movq    %rsi, %rdx
00000000000a2294        48 89 fe        movq    %rdi, %rsi
00000000000a2297        8b 3b   movl    (%rbx), %edi
00000000000a2299        ff 15 99 a0 0c 00       callq   *_crc32c(%rip)
00000000000a229f        89 03   movl    %eax, (%rbx)
00000000000a22a1        48 83 c4 08     addq    $0x8, %rsp
00000000000a22a5        5b      popq    %rbx
00000000000a22a6        5d      popq    %rbp
00000000000a22a7        c3      retq

// (snip)

///
/// fs_lookup_name_with_parent_id is called from 14 different locations in apfs.kext
///
_fs_lookup_name_with_parent_id:
00000000000a178e        55      pushq   %rbp
00000000000a178f        48 89 e5        movq    %rsp, %rbp
00000000000a1792        41 57   pushq   %r15
// (snip)
// Loads drec_hash_func callback function address, then calls utf8_normalizeOptCaseFoldAndHash
00000000000a186d        48 8d 0d 14 0a 00 00    leaq    _drec_hash_func(%rip), %rcx
00000000000a1874        4c 89 ef        movq    %r13, %rdi
00000000000a1877        4c 89 f6        movq    %r14, %rsi
00000000000a187a        e8 00 00 00 00  callq   _utf8_normalizeOptCaseFoldAndHash

// (snip)

///
/// dir_rec_alloc_with_hash is called from 20 different locations in apfs.kext
///
_dir_rec_alloc_with_hash:
00000000000a1a51    55     pushq    %rbp
00000000000a1a52    48 89 e5     movq    %rsp, %rbp
00000000000a1a55    41 57     pushq    %r15
// (snip)
// Loads drec_hash_func callback function address, then calls utf8_normalizeOptCaseFoldAndHash
00000000000a1b58    48 8d 0d 29 07 00 00     leaq    _drec_hash_func(%rip), %rcx
00000000000a1b5f    4c 89 f7     movq    %r14, %rdi
00000000000a1b62    e8 00 00 00 00     callq    _utf8_normalizeOptCaseFoldAndHash

I haven't traced through the entire 277k lines of code. It's true that the non-AVX fletcher64 implementation doesn't appear to use CRC32. However, CRC32 is used in a great many places throughout the code. Is it enough to make a difference on a MP5,1 vs. an MP3,1? I don't know, it's just a possibility I thought was worth considering.

I'll look at this again in a few days, time permitting (and assuming someone else hasn't cracked it yet).

Good luck.

P.S.
I think there is something with the APFS.efi. I placed one from 11.2.3 in the drivers section of OC. Of course APFS jump start and connect are false. Boot still hangs most of the times, but so far there was no IONVMe panic with NVMe drive in the system. I wonder if CpuTscSync can help here as well.
I'm focusing on apfs.kext, not the EFI implementation. I'm definitely not an OC expert, but I doubt changing APFS.EFI will have much effect on what I'm looking at. I'm hoping someone will try replacing apfs.kext to see if it makes any difference.
 
Status
Not open for further replies.
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.