OK, I've had a chance to review and test my 12.4 patch, and I think it's safe to move it to "beta" status; once some more people confirm that it's stable and has no side-effects, we can remove the "beta" label and just call it a patch.
This patch should have zero side-effects, so any
observed differences are almost certainly unrelated to this patch.
Note that the kernel version in the patch is currently limited to only Monterey 12.4 (Kernel 21.5.0). This problem seems likely to persist into future kernels, but the compiler could easily choose different registers, even if the source code is identical. That means each iteration of MacOS will need to be examined, and might (or might not) need a separate patch. If no change is required, the MaxKernel value can simply be increased appropriately. If each released version of Monterey uses a slightly different register set, a
config.plist
capable of booting various versions of Monterey will be littered with these patches (ugh!).
Note that as far as I can tell,
@khronokernel's
NoAVX kext substitution solution is equally effective, and probably has the same amount of MacOS version-dependency as my patch (i.e. future MacOS releases may or may not work with the older kext that
@khronokernel is using). The difference there is that my patch is adaptable to future MacOS changes, but the older kext most likely is not. Regardless, use the solution that best fits your needs.
Is there code in QEMU/UTM emulator that could be studied for emulation solutions in the long run?
Emulation is straightforward. My AVX emulator has been complete for months; the problem I encountered is integrating it with MacOS. Surprisingly, the integration is more complicated than the emulation itself. I've been stuck trying to solve the integration issues for a long time now...
I upgraded to 12.4 the day it became available and have not yet once had an issue. I've been watching this issue (kernel panics, AppleFSCompressionTypeZlib, avx). Has anyone determined why some people seem to have the issue and others do not? I'm trying to decide if I should add Syncretic's patch.
As I noted in my
earlier post, the AVX code only gets executed under the right (or, perhaps more accurately, wrong) circumstances. If your use case somehow consistently avoids those circumstances, more power to you - but if/when it happens, you'll get a kernel panic, and probably lose whatever you were working on at the time. The patch I posted should be safe, whether you encounter the problem or not. (The choice of whether or not to apply the patch is entirely yours, of course.)
The rest of this post offers a bit of technical insight to the problem. If you're not interested in the gory details, feel free to skip to the next post in the thread, or to something more entertaining. Maybe even go outside for a while.
The function
_compression_decode_buffer
looks at the Zlib strategies in use, and if LZ is involved, it eventually invokes
_lzbitmap_decode
, which contains unfiltered AVX instructions.
As noted earlier, both the
AppleFSCompressionTypeZlib
and
AppleDiskImagesUDIFDiskImage
kexts contain apparently-identical copies of
_compression_decode_buffer
, but the
AppleDiskImagesUDIFDiskImage
instance contains more public symbols (e.g. the
AppleFSCompressionTypeZlib
instance doesn't contain the symbol
_lzbitmap_decode
, despite having the same code).
The YMM code confused me for a while, causing me to overlook the most obvious reason for its existence. The YMM save/restore code isn't there for argument-passing or any other processing, it's simply doing a "callee-save" on the
XMM registers that
_lzbitmap_decode
uses. I'm guessing that the YMM save/restore is either a macro or an inlined library function. At the beginning of
_lzbitmap_decode
, all 16 YMM registers are saved on the stack, then in no fewer than six separate locations, all 16 YMM registers are restored, either from [R15] or [RAX]. What initially looked to me like manipulation of the values on the stack turned out not to involve the YMM storage area, and both R15 and RAX both get loaded with the address of the YMM storage area prior to restoring the YMM registers, so all of the instructions involving the YMM registers are purely for save/restore, and really only for the XMM (low 128 bit) portion of the registers, since that's all the code uses.
Code:
// Original [RAX] restore code (AVX) (115 bytes):
c5fd6f00 vmovdqa (%rax), %ymm0
c5fd6f4820 vmovdqa 0x20(%rax), %ymm1
c5fd6f5040 vmovdqa 0x40(%rax), %ymm2
c5fd6f5860 vmovdqa 0x60(%rax), %ymm3
c5fd6fa080000000 vmovdqa 0x80(%rax), %ymm4
c5fd6fa8a0000000 vmovdqa 0xa0(%rax), %ymm5
c5fd6fb0c0000000 vmovdqa 0xc0(%rax), %ymm6
c5fd6fb8e0000000 vmovdqa 0xe0(%rax), %ymm7
c57d6f8000010000 vmovdqa 0x100(%rax), %ymm8
c57d6f8820010000 vmovdqa 0x120(%rax), %ymm9
c57d6f9040010000 vmovdqa 0x140(%rax), %ymm10
c57d6f9860010000 vmovdqa 0x160(%rax), %ymm11
c57d6fa080010000 vmovdqa 0x180(%rax), %ymm12
c57d6fa8a0010000 vmovdqa 0x1a0(%rax), %ymm13
c57d6fb0c0010000 vmovdqa 0x1c0(%rax), %ymm14
c57d6fb8e0010000 vmovdqa 0x1e0(%rax), %ymm15
A straightforward patch would be to simply replace all the VMOVDQA YMM# instructions with their MOVDQA XMM# counterparts. However, the VEX prefixes used by AVX instructions are more compact than the older XMM prefixes, and if we use the same addresses for saving/restoring, the putative patch code is actually
larger than the AVX code, and won't fit.
Code:
// Direct replacement (123 bytes):
660f6f00 movdqa (%rax), %xmm0
660f6f4820 movdqa 0x20(%rax), %xmm1
660f6f5040 movdqa 0x40(%rax), %xmm2
660f6f5860 movdqa 0x60(%rax), %xmm3
// Starting at offset 0x80, switches to 32-bit offsets
660f6fa080000000 movdqa 0x80(%rax), %xmm4
660f6fa8a0000000 movdqa 0xa0(%rax), %xmm5
660f6fb0c0000000 movdqa 0xc0(%rax), %xmm6
660f6fb8e0000000 movdqa 0xe0(%rax), %xmm7
// Starting with XMM8, an additional prefix byte is required
66440f6f8000010000 movdqa 0x100(%rax), %xmm8
66440f6f8820010000 movdqa 0x120(%rax), %xmm9
66440f6f9040010000 movdqa 0x140(%rax), %xmm10
66440f6f9860010000 movdqa 0x160(%rax), %xmm11
66440f6fa080010000 movdqa 0x180(%rax), %xmm12
66440f6fa8a0010000 movdqa 0x1a0(%rax), %xmm13
66440f6fb0c0010000 movdqa 0x1c0(%rax), %xmm14
66440f6fb8e0010000 movdqa 0x1e0(%rax), %xmm15
However, we can improve the situation - we don't need to store the XMM registers at the 32-byte offsets used by the YMM registers, we can store them at 16-byte offsets (using only half of the reserved stack space). Doing this allows 4 additional MOVDQA instructions to use 8-bit offsets, resulting in code that's smaller than the original AVX code (and thus requires some NOPs at the end to fill out the space).
Code:
// Condensed replacement (111 bytes):
660f6f00 movdqa (%rax), %xmm0
660f6f4810 movdqa 0x10(%rax), %xmm1
660f6f5020 movdqa 0x20(%rax), %xmm2
660f6f5830 movdqa 0x30(%rax), %xmm3
// XMM4-7 still using 8-bit offsets
660f6f6040 movdqa 0x40(%rax), %xmm4
660f6f6850 movdqa 0x50(%rax), %xmm5
660f6f7060 movdqa 0x60(%rax), %xmm6
660f6f7870 movdqa 0x70(%rax), %xmm7
// Starting with offset 0x80/XMM8, 32-bit offset + extra prefix byte
66440f6f8080000000 movdqa 0x80(%rax), %xmm8
66440f6f8890000000 movdqa 0x90(%rax), %xmm9
66440f6f90a0000000 movdqa 0xa0(%rax), %xmm10
66440f6f98b0000000 movdqa 0xb0(%rax), %xmm11
66440f6fa0c0000000 movdqa 0xc0(%rax), %xmm12
66440f6fa8d0000000 movdqa 0xd0(%rax), %xmm13
66440f6fb0e0000000 movdqa 0xe0(%rax), %xmm14
66440f6fb8f0000000 movdqa 0xf0(%rax), %xmm15
(Of course, if we we were
really pressed for code space, and wanted to get fancy, we could just use
fxsave64
and
fxrstor64
, since we have a 512-byte area on the stack to work with. However, that would involve more analysis, since
fxsave64/fxrstor64
are considered floating-point instructions, which might cause complications.)
Ultimately, the patch(es - technically, there are six of them per kext, and two kexts' worth of patches) just switches from saving/restoring YMM registers to saving/restoring XMM registers. One nice thing we don't have to worry about is preserving the upper 128 bits of the YMM registers - when using AVX instructions on an XMM register, the upper bits of the corresponding YMM/ZMM register are zeroed (much like manipulating the 32-bit registers in 64-bit mode, e.g.
movl $9,%eax
clears the upper 32 bits of %RAX). However, when using non-AVX instructions, the upper bits of the YMM/ZMM registers are left undisturbed. Thus, we can safely save/restore only the XMM registers without worrying about the YMM/ZMM register bits.