Last edited:
| 8-core | 4 high-performance Firestorm | 4 energy-efficient Icestorm |
| Speed | 3.2 GHz | 2.064 GHz |
| L1 cache | 192 KB instructions /128 KB data | 128 KB instructions / 64 KB data |
| L2 cache | 12 MB | 4 MB |
| 8-core | 4 high-performance Avalanche | 4 energy-efficient Blizzard |
| Speed | 3.49 GHz | 2.424 GHz |
| L1 cache | 192 KB instructions / 128 KB data | 128 KB instructions / 64 data |
| L2 cache | 16 MB | 4 MB |
| 8-core | 4 high-performance Everest (V2) | 4 energy-efficient Sawtooth (V2) |
| Speed | 4.05 GHz | 2.75 GHz |
| L1 cache | 192 KB instructions / 128 KB data | 128 KB instructions / 64 data |
| L2 cache | 16 MB | 4 MB |
| 10-core | 4 high-performance Everest (V3) | 6 energy-efficient Sawtooth (V3) |
| Speed | 4.41 GHz | 2.89 GHz |
| L1 cache | 192 KB instructions / 128 KB data | 128 KB instructions / 64 KB data |
| L2 cache | 16 MB | 4 MB |
| 10-core | 4 high-performance Everest (V4) | 6 energy-efficient Sawtooth (V4) |
| Speed | 4.46 GHz | 3.04 GHz |
| L1 cache | 192 KB instructions / 128 KB data | 128 KB instructions / 64 KB data |
| L2 cache | 22 MB | 6 MB |
Apple M4 (2024)
APL1206 (Donan, T8132)
TSMC N3E 3nm
28 billion transistors
CPU
Apple designed ARMv9.2-A
10-core big.LITTLE
Core designs from A18/A18 Pro
10-core 4 high-performance Everest (V3) 6 energy-efficient Sawtooth (V3) Speed 4.41 GHz 2.89 GHz L1 cache 192 KB instructions / 128 KB data 128 KB instructions / 64 KB data L2 cache 16 MB 4 MB
GPU
Apple G16G Apple 10 family integrated graphics 10 cores
160 Execution Units
1280 Arithmetic Logic Units
30,720 threads
1.58 GHz
4.32 TFLOPs (FP32)
16 Execution Units per core
8 Arithmetic Logic Units per Execution Unit
Dynamic Caching, Mesh Shading and Ray Tracing
Memory
Unified 128-bit 7,500 MT/s LPDDR5X SDRAM at 120 GB/s 3,750 MHz
Up to 32 GB
Other
Neural Engine 16-core 38 trillion operations per second 7th gen (H16x_Leto_J7x)
PCIe 4.0 storage controller up to 2 TB SSD
3x USB-C 4 controller with Thunderbolt 4/USB 4 support (40 Gb/s)
2x USB 3.0 Type-C (10 Gb/s)
Secure Enclave and Image Signal processor
Video codec encoding support for 8K HEVC and 8K H.264
Video codec decoding support for 8K ProRes, 8K HEVC, 8K H.264, VP9, AV1 and JPEG
HDMI 2.1
One 8K display at 60Hz or one 4K display at 240Hz Thunderbolt or HDMI
Wi-Fi 6E up to 2.4 Gbit/s 2x2
Bluetooth 5.3
I'll check back on those numbers later and will update the post if necessary, thanks for sharing your observation.I don't think the TFLOPS calculation is right:
1280*1.58*2/1000 = 4.0448 TFLOPS
And the M5 is even more off. The only thing that changed is a minor clock speed bump but a nearly 20% increase in TFLOPS? Why?
Also I'm not 100% what the execution unit is here? I was under the impression Apple had 4 x execution units of 32 FP32 units each, not 2 x 64. Unless there is another control structure below what they are calling the execution unit? @leman?
I'll check back on those numbers later and will update the post if necessary, thanks for sharing your observation.![]()
Whenever someone reports TFLOPS it should be peak, the theoretical throughput of the number of FP32 units x clock speed x 2, unless otherwise stated. The issue here is the numbers being reported are actually even above the peak theoretical value, which is not possible. So even from a perspective of peak vs sustained/practical, it doesn't really matter, the reported TFLOPS have issues starting with the M3.ok, peak flop value or sustainable (in bus widths, stack length, and other memory considerations etc)
I've edited the TFLOPS for M2, M3 and M4 according to these sources, M5 already matched the source:I don't think the TFLOPS calculation is right:
1280*1.58*2/1000 = 4.0448 TFLOPS
And the M5 is even more off. The only thing that changed is a minor clock speed bump but a nearly 20% increase in TFLOPS? Why?
Also I'm not 100% what the execution unit is here? I was under the impression Apple had 4 x execution units of 32 FP32 units each, not 2 x 64. Unless there is another control structure below what they are calling the execution unit? @leman?
Wikipedia lists the base M1 as having ProRes decode but not encode. And what does "The M1 is entirely CPU/GPU" mean?The M1 chip doesn’t have ProRes hardware acceleration. Only the subsequent chips have that.
The M1 is entirely CPU/GPU…
Fixed! Thank you!The M1 chip doesn’t have ProRes hardware acceleration. Only the subsequent chips have that.
The M1 is entirely CPU/GPU…
So Wikipedia is wrong on the M1 having a ProRes decoder?Fixed! Thank you!
AFAIK ( I could be wrong) the M1 Pro and above have it, the base M1 does not.So Wikipedia is wrong on the M1 having a ProRes decoder?
Ah unfortunately cpu-monkey is a little unreliable. This is not the first time I've found them making mistakes. The M2 and M3 TFLOPS are right, but the M4 and M5 are wrong. I can't confirm their clock speed for the M5. But even if it is right, the TFLOPS they calculated from it is still wrong.I've edited the TFLOPS for M2, M3 and M4 according to these sources, M5 already matched the source:
Apple M2 (10 Core) Benchmarks & Specs
Apple M2 (10 Core) - 3DMark Time Spy and FP32 benchmarks and specifications for this integrated graphicswww.cpu-monkey.com
Apple M3 (10 Core) Benchmarks & Specs
Apple M3 (10 Core) - 3DMark Time Spy and FP32 benchmarks and specifications for this integrated graphicswww.cpu-monkey.com
Apple M4 (10 Core) Benchmarks & Specs
Apple M4 (10 Core) - 3DMark Time Spy and FP32 benchmarks and specifications for this integrated graphicswww.cpu-monkey.com
M5: https://www.cpu-monkey.com/en/igpu-apple_m5_10_core
A little digging shows that https://www.apple.com/final-cut-pro/docs/Apple_ProRes.pdf only mentions the M1 Pro, Max and Ultra.AFAIK ( I could be wrong) the M1 Pro and above have it, the base M1 does not.
I THINK (patents and some benchmarks I've seen both suggest this) that M5 FP16 throughput is 3x FP32 -- but ONLY for matrix multiply.Ah unfortunately cpu-monkey is a little unreliable. This is not the first time I've found them making mistakes. The M2 and M3 TFLOPS are right, but the M4 and M5 are wrong. I can't confirm their clock speed for the M5. But even if it is right, the TFLOPS they calculated from it is still wrong.
M4: 1280*1.58*2/1000 = 4.0448 TFLOPS
M5: 1280*1.9*2/1000 = 4.864 TFLOPS (again, that's if they have the new GPU clock speed right)
Another problem is that cpu-monkey lists double throughput of FP16 relative to FP32 for all of these chips, but Apple didn't gain the ability to do that until the M5. And finally, pretty sure the number of execution units is 4 per core, so it should be 40 units not 160 units*. Apple has a SIMD width of 32 for a total of 40*32 = 1,280 FP32 units (which cpu-monkey does get right, though we don't know yet about the M5 structure and how Apple doubled FP16 throughput, but CPU-monkey doesn't list that anyway). Hopefully I haven't led you astray, but if so @name99 or @leman can correct me.
*EDIT: so tired I managed to confuse myself. It should be right now. I think where CPU monkey got confused and momentarily confused me, is that Apple used to allow 16*32 = 512 threads per core (I believe since Apple has bumped that up to 32*32 = 1024 threads per core, but that's not the same as the execution units/core counts).
![]()
Metal compute shaders threadgroup & threadExecutionWidth
Can someone explain in simple terms what threadgroup conceptually is in Metal compute shaders and other terms such as SIMD group, threadExecutionWidth (wavefront)? I read the docs but am more confu...stackoverflow.com