Could be a plethora of reasons but the short answer is that right now, I don't know, and I don't have an M1 Mac to investigate further. I plan on getting the 16" M1X or whatever it will be when that comes around and may be able to illuminate further then. Assuming the differences are not major in size an argument could be that the instruction sizes themselves are smaller, but I find this unlikely, since aside from the longest of x86_64's variable length instructions, you'd normally think CISC instructions less space consuming since a single instruction can do more. On the flip side though, if that single instruction takes 4 bytes to encode and corresponding AArch64 takes just 2 bytes to encode that will result in less instruction space.
What I've talked about prior to this post has been data in memory, but of course the instruction stream itself is also kept in memory, pointed to by the %RIP register on x86_64. Now the lines are of course blurry between data and instructions, but to put it briefly, in the grand scheme of things data is responsible for far more memory usage than instructions. Though that difference could still be part of all of the difference you're seeing between M1 native and Intel binaries at this time.
Let''s make an experiment with the cross compiling capabilities we do have. To Godbolt's Compiler Explorer!
View attachment 1716165
Do excuse me that this is so small, it was the only way I could make it all fit on screen. These are two objdumps from compiler explorer. The left is the binary output of x86_64 in a more human readable form, the right is the same for AArch64 (ARM).
The C code this corresponds to is the rather simple
Code:
#include<stdlib.h>
typedef struct Point {
int x;
int y;
} Point;
// Type your code here, or load an example.
int xTimesY(Point* p) {
return p->x * p->y;
}
int main() {
Point *p = (Point*) malloc(sizeof(Point));
p->x = 5;
p->y = 2;
return xTimesY(p);
}
As one can tell by applying relative offsets, the x86_64 is actually 110 bytes shorter than the corresponding ARM. Of course neither of these binaries are compiled with any level of optimisation since the code would reduce to just return 10 even with just -O.
This is not a very comprehensive look into things, and more experimentation would have to be done with larger programs and using optimisation levels, but as an initial investigation, it leads me to believe that the memory footprint reduction has more to do with the linked libraries being streamlined for the M1 than with anything inherent to the instruction stream.