M4 and the ARMv9 ‘problem’

Bokito · Oct 28, 2024

There is a lot discussion about ARMv9 and the inclusion into M4. Apple hasn’t commented on that matter, but Wikipedia is referencing a comment from the LLVM-compiler where the M4 is said to include ARMv9.2a but without the SVE2 extension.

Now I’ve read some articles saying that SVE2 can be crucial for heavy tasks like video encoding but I have yet to find any real world tests. Can anyone provide something? If HandBrake could be twice as fast in AV1 encoding for example with just this extension enabled I might wait another year for a new MacBook Pro. If that isn’t the case I’m strongly considering an almost maxed out M4 Max MacBook Pro.

My take is that if SVE2 isn’t included Apple is saving its gains for when performance doesn’t come as easy anymore from other upgrades. Right now the performance upgrades come really easy it seems, but there will be a time when TSMC can’t shrink their node for a considerable time and then such upgrades might come in handy.

galad · Oct 29, 2024

Nope, HandBrake will not be twice as fast when using SVE. Maybe a 20-30% faster, but first someone will have to manually write a lot of SVE/SVE2 assembly code, which will take time, and it depends on how good the SVE implementation on the actual CPU.

There are still many missing NEON optimizations to write yet, for example the next version of the SVT-AV1 encoder will be 30% faster even without using SVE, and x265 4.0 is at least 10% faster than before (and a bit faster when the NEON i8mm instructions are available, like on the M2 chips and later)

dmccloud · Oct 29, 2024

Bokito said:
There is a lot discussion about ARMv9 and the inclusion into M4. Apple hasn’t commented on that matter, but Wikipedia is referencing a comment from the LLVM-compiler where the M4 is said to include ARMv9.2a but without the SVE2 extension.

Now I’ve read some articles saying that SVE2 can be crucial for heavy tasks like video encoding but I have yet to find any real world tests. Can anyone provide something? If HandBrake could be twice as fast in AV1 encoding for example with just this extension enabled I might wait another year for a new MacBook Pro. If that isn’t the case I’m strongly considering an almost maxed out M4 Max MacBook Pro.

My take is that if SVE2 isn’t included Apple is saving its gains for when performance doesn’t come as easy anymore from other upgrades. Right now the performance upgrades come really easy it seems, but there will be a time when TSMC can’t shrink their node for a considerable time and then such upgrades might come in handy.

There's also a lot of confusion regarding Apple's licensing terms with respect to ARM. Whereas companies like Qualcomm and Samsung (Exynos) have architecture licenses that allow them to build SoCs using the ARM-created core designs, Apple has an ISA license which allows them to use the ARM instruction set and build additional instructions on top of that, as well as their own silicon that is compatible with the ISA. Apple is under no obligation to use all aspects of the ISA, and in some cases Apple's additions to the base ISA have made their way into the ARM ISA being used by Qualcomm and Samsung.

I'm sure Apple has its reasons for not adopting SVE2, but I don't know enough to say what they might be. I do think there may have been a clue in the Mac Mini announcement though, because at one point there was mention of "the upcoming version of Final Cut Pro". Perhaps Apple has added something into the M4 instruction set that equates to SVE2 and the new version of FCP will take advantage of that.

leman · Oct 29, 2024

It is difficult to make quantified statements about these things. Technically, M4 is indeed ARMv9.2a which if I remember correctly is virtually the same as ARMv8.7 (or something like that). ARM feature matrix and requirements are not entirely transparent. Apple does not support SVE in their CPU cores, but they support a subset of SVE in streaming mode in their matrix/vector coprocessor. I doubt that the streaming mode will help with video encoding.

It is unclear how much benefit SVE support would bring. There were some studies back in the day mentioning 20% or so, I don't know how conclusive that is. I haven't seen any empirical studies using ARM CPUs that support SVE/SVE2. SVE is a great ISA in theory, but there are some issues too. ARM has since introduced a lot of extensions targeted at flexible manipulation of small vectors, not unlike NEON. We don't know if Apple is interested in adopting SVE or whether they intend to continue using NEON.

Either way, it is not like you will miss out on performance. The M4 Pro Mini in particular seems like a beast for its form factor.

Gnattu · Oct 29, 2024

SVE is a good ISA on paper but in reality the current implementations are not so good and some operations are actually seeing regression with SVE on Amazon Graviton3 and to optimize performance the actual multimedia softwares are mixing neon and SVE on that particular CPU. I think Apple needs to figure out a good enough implementation so that it is really better than neon that much first. Arm needs to provide a reference implementation in their cores so some performance quirks is acceptable but it is not the case for Apple.

mr_roboto · Oct 29, 2024

dmccloud said:
There's also a lot of confusion regarding Apple's licensing terms with respect to ARM. Whereas companies like Qualcomm and Samsung (Exynos) have architecture licenses that allow them to build SoCs using the ARM-created core designs, Apple has an ISA license which allows them to use the ARM instruction set and build additional instructions on top of that, as well as their own silicon that is compatible with the ISA. Apple is under no obligation to use all aspects of the ISA, and in some cases Apple's additions to the base ISA have made their way into the ARM ISA being used by Qualcomm and Samsung.

I'm afraid you're spreading more confusion here. An "architectural license" is what Apple has. This use of language derives from the perhaps slightly odd way the CPU/chip design world uses terms like "architecture" and "intellectual property". If you have an Arm architectural license, you have the right to implement the Arm instruction set architecture in your own designs. If you have an ordinary Arm IP license, you have the right to use Arm-designed IP cores (not limited to just CPUs, Arm has a very wide selection of SoC building blocks). The exact selection of cores and fee structures for same is determined by the specific contract you negotiated with Arm, and you usually have to negotiate a new license for each chip you design. (Sublicensing also exists, so in many cases you get access to Arm IP through the contract you sign to build a chip on a particular process node.)

Qualcomm has not one but two Arm architectural licenses - their own, acquired for their first attempt at full custom CPU core designs (Krait and early Kryo cores), and Nuvia's, acquired when they bought Nuvia. However, this is now in legal dispute due to disagreements between Qualcomm and Arm over whether the Nuvia acquisition obligated QC to re-negotiate architectural licensing terms and fee structures for all work done by Nuvia.

I don't remember offhand whether Samsung ever tried its hand at custom Arm core design, but they probably have, even though most of their chips use Arm-designed cores.

ARM's architectural licenses do not give free reign to do everything. Arm has conformance tests to determine whether licensees are building CPUs compatible with Arm's standards; Apple and other architectural license holders must pass them. Apple also has not been free to build custom extensions which are exposed to the public. M1, M2, and M3 all had AMX (Apple's matrix math coprocessor instructions), but it was undocumented and not directly usable by the public. If you wanted to take advantage, all you could do was call into Apple's Accelerate framework and hope that whatever you were doing was accelerated by AMX under the hood.

That said, the AMX story shows the flip side of the coin: if you're big and influential enough, you can twist Arm's... arm to get them to make official ISA changes. AMX ended up getting transmogrified into SME, an official Arm extension, and in the process got designated as part of SVE2. However, apparently at Apple's insistence, SME implementors are free to not support full SVE2, only a very limited 'streaming' mode subset. Apple doesn't seem to like SVE and doesn't want it, while Arm's shown signs of wanting to standardize on everyone supporting SVE. So far, Apple's mostly gotten their way.

joema2 · Oct 30, 2024

Bokito said:
I’ve read some articles saying that SVE2 can be crucial for heavy tasks like video encoding but I have yet to find any real world tests. Can anyone provide something? If HandBrake could be twice as fast in AV1 encoding for example with just this extension enabled I might wait another year for a new MacBook Pro.

SVE2 is just vector instructions, IOW single instruction, multiple data (SIMD). It is good for processing arrays of data. The problem with H.264/H.265(HEVC) encode/decode is the core algorithms are inherently serialized. You must process one phase of the algorithm before proceeding to the next phase. None of those are amenable to parallelization, whether on a GPU or using vector instructions.

That said, it is possible to mildly accelerate AV1 decode/encode using SVE, etc, but it's a lot of work and the gains are very modest. IMO it's not generally worth it.

Using GPU parallelism to improve H.264/H.265 performance has been thoroughly investigated by many people. This includes the lead developer of x264, Jason Garrett-Glaser in a 2012 presentation named "OpenCL Acceleration of x264." He covered prior efforts, his team's research and outlook for the future. He said mild gains were possible but not a major improvement. Unfortunately, this presentation is no longer available.

By contrast, the gains of "fixed function" decode/encode acceleration are much greater, on the order of 400%, 500% or more. Apple Silicon CPUs already have that capability in their media engines for most common codecs. The Max and Ultra CPUs have parallel media engines, so in some cases, this can be harnessed to process H.264/H.265 data in parallel, but (I think) only for ProRes or Long GOP where the GOPs are independent. In FCP this option is called "export segmentation." For dependent GOP encoding (which is very common), they must be processed serially.

Apple Silicon media engines support decode/encode acceleration of H.264, H.265, VP9, and (starting with A17 in the iPhone 15 Pro) and M3, AV1 decode acceleration. The M4 series would logically also have AV1 decoding, but I don't know if it has AV1 encoding acceleration. But it seems unlikely that SVE2 (whether available or not) would have a major impact on software decode/encode, when compared to fixed-function hardware acceleration.

Regardless of whether encode acceleration exists for AV1 on M4, I don't think either FCP or Resolve supports AV1 encoding from a software standpoint. They both support hardware-accelerated HEVC encoding.

Search

Search

M4 and the ARMv9 ‘problem’

Bokito

macrumors 6502

galad

macrumors 6502a

dmccloud

macrumors 68040

leman

macrumors Core

Gnattu

macrumors 65816

mr_roboto

macrumors 6502a

joema2

macrumors 68000

Our Staff