Anyone else a bit skeptical about Apple silicon future?

bradl · Sep 9, 2021

I just gotta say it...

It's funny that 3/4 of this entire thread reads more like the Linux Kernel Mailing List than the Linux Kernel Mailing List reads like the Linux Kernel Mailing List. 😁

Anywho, carry on!

BL.

casperes1996 · Sep 9, 2021

leman said:
The only non-standard-compliant part was casting from float to int. If you use one of supported ways to do this bitwise type reinterpretation, it's perfectly valid.

Is there even a standard way of doing that in C, not C++? I've only ever seen it done the FISR way.

leman said:
I recommend you to check out Zig. It's a modern language that was designed as a replacement for C. I think it's one of the more interesting new languages on the block. Excellent design decisions.

+1 for Zig. Haven't played all that much with it yet myself but it's looking great so far. It's the systems programming language that learned from the past 30 years.

leman said:
Standards are important for programmers because your correctly performing code might actually be buggy and stop performing correctly when you use a different compiler (version) or a different platform. The mostly unknown corner cases of the C/C++ standards and various expectations that exist between different players are the main reason why C/C++ ecosystem is such a terrible mess.

I don't know why, but I have a massive codebase that works perfectly fine prior to GCC-8, but GCC-8 onwards it just burns. It's large, and no longer needed so I haven't bothered investigating, but I was conscious of being standard compliant when I wrote my parts of it at least. I could have made a mistake anyway or it could've been one of the commits I didn't make or review.

leman said:
Yep, Swift's is the most elegant solution that directly communicates intent in the code while resulting in optimal performance (all these methods are essentially no-ops). But this only works because Swift explicitly states that floating point values are encoded with IEEE 754. C generally makes no such promises.

Agreed, Swift's solution here's pretty neat. Albeit a neat solution to an esoteric problem that maybe we shouldn't ever be needing anymore. The FISR was a neat use of reinterpret cast (though if you could do shifts on floats it wouldn't have been needed anyway), but these days CPUs have square root operations that will be faster than the FISR anyway. I'm all for the option existing to enable clever ideas like the FISR, but I can't think of any use case for it where an arguably cleaner solution doesn't exist.
Anyways reason I even quote this in the first place...

I thought C99 onwards specified IEEE 754 floating point values in the standard?

cmaier said:
When I was designing the PowerPC x704, I actually designed it using C It’s been so long that I don’t remember the details, but, essentially, each function represented a specific gate with specific transistor sizes, and the variables were wires. Then you’d use comments to control things like where the gates physically belong. Then we had tools which could parse all that and physically build the layout, but you could also compile it to make sure that the functionality was correct. Essentially we used it instead of structural Verilog (which should not be confused with behavioral Verilog, which is what most Verilog-users use).

I remember it got a little weird because you might call the nand2x1000 function in your code, but then you could override that with a comment to make it a nand2x1600 or whatever. It got kludgey because we kept writing tools like optimizers to change gate sizes. Then we figured out that having to touch the c code to change something like where a cell lives would kick off unnecessary dependencies, so we started having sidecar files with a lot of this stuff. We were making it all up as we went along.

But all that mess did heavily influence the design methodology me and my coworker came up with at AMD (we both came from Exponential, though I took a very short stop at Sun).

Your comments are a blast to read!

But really you actually used comments to have semantic meaning? Did you then have another "No this really is a comment, ignore it" for your parser? I don't really know what the process of chip design its like but I'd imagine more human readable comments could be helpful there too - or did you just rely on externally documenting everything and not within these "C" source files?

bradl said:
It's funny that 3/4 of this entire thread reads more like the Linux Kernel Mailing List than the Linux Kernel Mailing List reads like the Linux Kernel Mailing List. 😁

Haha yeah. Funny how the chip designers, kernel people and generally just programmers congregated in this thread

cmaier · Sep 9, 2021

casperes1996 said:
But really you actually used comments to have semantic meaning? Did you then have another "No this really is a comment, ignore it" for your parser? I don't really know what the process of chip design its like but I'd imagine more human readable comments could be helpful there too - or did you just rely on externally documenting everything and not within these "C" source files?

Haha yeah. Funny how the chip designers, kernel people and generally just programmers congregated in this thread

It’s been a *really* long time, but I *think* that you did something like //! to indicate that the comment was semantic. The reason I think that is that later on, at another job, I used something like that. I wasn’t responsible for tools at Exponential (the place where we did the c-code), but was, eventually, at AMD. (And at Sun it was a free-for-all - nobody would tell me what tools to use, so I just started writing my own).

I don’t remember much about the Exponential system other than the problem with dependencies getting triggered (I think we used actual makefiles! So we’d use ‘touch’ to say “no, that wasn’t a real change. Don’t worry about it.” We were netBSD cowboys

Things got a lot nicer in the system we set up at AMD. Each semantic concept in a different type of file, a perl system that replaced makefiles (with plugins for each step, that could be chosen from the command line, and would automatically check dependancies, etc.). And we used structural verilog instead of c for the netlist. Also our placement files got really funky, using a language I invented to allow placement of arrays, interleaving of structures, relative placement, calculated placement based on expressions, etc. That was the “perl” phase of my career. That turned into the C++ phase when I started writing GUI tools to allow visualizing the chip design, moving gates and wires and on-the-fly re-calculating timing, etc.

casperes1996 · Sep 9, 2021

cmaier said:
Things got a lot nicer in the system we set up at AMD. Each semantic concept in a different type of file, a perl system that replaced makefiles (with plugins for each step, that could be chosen from the command line, and would automatically check dependancies, etc.). And we used structural verilog instead of c for the netlist. Also our placement files got really funky, using a language I invented to allow placement of arrays, interleaving of structures, relative placement, calculated placement based on expressions, etc. That was the “perl” phase of my career. That turned into the C++ phase when I started writing GUI tools to allow visualizing the chip design, moving gates and wires and on-the-fly re-calculating timing, etc.

Wow you got fancy there at the end with GUI tools and everything

- Something I always did wonder about chip design. How accurately do you feel like you understand a chip's performance characteristics when you just have the design specs for it, vs. a manufactured chip? Do you need intermediate prototyping physically manufactured implementations to test design ideas or can you pretty easily stay on the abstract level right up to the point you begin manufacturing what will effectively be the final chips barring minor revisioning changes subject to final testing?

cmaier · Sep 9, 2021

casperes1996 said:
Wow you got fancy there at the end with GUI tools and everything - Something I always did wonder about chip design. How accurately do you feel like you understand a chip's performance characteristics when you just have the design specs for it, vs. a manufactured chip? Do you need intermediate prototyping physically manufactured implementations to test design ideas or can you pretty easily stay on the abstract level right up to the point you begin manufacturing what will effectively be the final chips barring minor revisioning changes subject to final testing?

We were always pretty accurate, keeping in mind that in the real world the performance of manufactured parts forms a bell curve - some always run better than others. You typically try to model the performance of where you think the middle of the distribution will be, and you miss by a bit, but you’re usually not too far off.

The way it works is you have SPICE models that describe the mathematical behavior of transistors. For the full-custom parts of the chip (oscillators, RAMs, PLLs, I/O drivers) you use SPICE as your tool to confirm the performance.

For the rest, we use “standard cells,” each of which is a specific logic gate with a specific layout. For example, you might have a dozen different 2-input NAND cells, each with a different set of transistor sizes and different layouts. You “characterize” these cells using automated tools with the spice models, to create tables. One column of the table is the 20%-80% rise or fall time for a particular input. The next column is the 20-80% rise or fall time for a particular output. There may be different tables corresponding to different alternate inputs. In other words, the behavior with respect to a transition on pin A may vary depending on the state of input pin B. Some people use percentages other than 20-80%.

Anyway, you make these tables.

You do your placement of the cells and route all the wires between them, then you do parasitics extraction. This involves an automatic tool that calculates resistances and capacitances due to the wires, and creates, effectively, a giant model of the circuit with pins of cells connected by RC networks.

You feed that into a ”static timing analyzer” that calculates how fast each cell output transitions based on its load, how long it takes the signal to propagate to the inputs of the next cell, and what the waveform looks like at that next input. It involves calculating the moments of the circuit equation. We used a tool from Synopsys called Primetime to do this. You feed that tool your constraints - “this signal has to be ready by such and such a time, or the chip won’t work.” It spits out a list of paths that take too long.

I came up with my own static timing algorithm, roughly based on an algorithm called AWE (asymptotic waveform evaluation) and built it into my graphical tool. I also built in a rough parasitics extractor (based on some work I had done years earlier at RPI), so when you moved cells around, it would guess at the new wiring, and very roughly estimate the new RC network. It would then tell you the new timing. You could even click on a cell and say “optimize size” and it would figure out the size of cell that would produce the best timing - making it bigger speeds up downstream logic, but means you put more load on upstream logic, so it was a fun optimization problem. It worked very well, and matched Primetime exactly as long as you gave it the same input (e.g. your extraction was accurate). It was pretty good considering I was a solid state physics guy and had never coded much in C in my life (and, in fact, I had never finished a coding or computer science class. I tested out in high school, and in grad school I enrolled in a bunch, but when I passed my doctoral oral exam on computer science I dropped them all).

Anyway, we also had tools to automatically add repeaters, etc. etc.

At a higher level, say you owned the integer ALUs (which, for some chips, I did). What happened was me and a couple others would floorplan the chip, creating an outline for that block. As we designed our block, we’d have “pins” on the edge, that abutted our neighbors. You’d route your wires to the pins, as would your neighbor. Moving a pin was like a nuclear submarine - both you and your neighbor had to agree in order for it to take effect. Anyway, you would have timing constraints at the pins. If you were receiving a signal, your neighbor might say “i can get it to you 50ps into the timing cycle,” and then you’d explain that to the timing tool. Similar when you are sending a signal to a neighbor.

So you are always working with these constraints, so in theory you are always “correct by construction.” Except it was massively difficult to fight the rules of physics to actually meet the constraints. You’d move logic across cycle boundaries, mess around with clocks to make some pipeline stages have a little more time than others, duplicate signals to avoid loading effects, duplicate logic, get very clever with logic design, sometimes create custom cells or custom circuits, force the metal router to do what you want by pre-routing a bunch of wires by hand, adjust where cells are located, etc. etc.

thenewperson · Sep 9, 2021

bradl said:
I just gotta say it...

It's funny that 3/4 of this entire thread reads more like the Linux Kernel Mailing List than the Linux Kernel Mailing List reads like the Linux Kernel Mailing List. 😁

Anywho, carry on!

BL.

The derailment of this thread made it worth reading lol

leman · Sep 10, 2021

casperes1996 said:
Is there even a standard way of doing that in C, not C++? I've only ever seen it done the FISR way.

Yep, unions (note that C allows type reinterpretation via unions, C++ does not). Or memcpy.

casperes1996 · Sep 10, 2021

leman said:
Yep, unions (note that C allows type reinterpretation via unions, C++ does not). Or memcpy.

Huh, never even considered that option, but it's obvious now that you mention it

cmaier · Sep 10, 2021

casperes1996 said:
Huh, never even considered that option, but it's obvious now that you mention it

When I was designing the floating point unit for the x705 (sequel to the x704) and had to design a square root block (ugh), I recall prototyping in C and using unions for that purpose.

Sqrt was a real pain. Friggin IBM with their stupid ISA updates.

(also fun: bit 0 in PowerPC is the *high order* bit. They number the bits backward. Which really really sucks when your words are all different sizes, as they are when you are dealing with floating point. Ugh. I am getting a cold sweat just remembering it)

dugbug · Sep 10, 2021

cmaier said:
(also fun: bit 0 in PowerPC is the *high order* bit. They number the bits backward. Which really really sucks when your words are all different sizes, as they are when you are dealing with floating point. Ugh. I am getting a cold sweat just remembering it)

There should be an xkcd cartoon for that. I've seen countless ICDs do it. 31 is high. 63 is high. 0? come on.

as for endianness, admit it big endian is the best.

cmaier · Sep 10, 2021

dugbug said:
There should be an xkcd cartoon for that. I've seen countless ICDs do it. 31 is high. 63 is high. 0? come on.

as for endianness, admit it big endian is the best.

I have no opinion on endianness, but I’ll fight anyone who says that the LSB is anything other than zero to the death.

nquinn · Sep 10, 2021

pshufd said:
ARM has structural advantages over x86 and Apple just showed the world what those are. In response, Intel and AMD will have to eventually go to ARM or RISC. And they will have to go through the same, painful transition that Apple is going through right now.

Intel's i9-12900 is faster than Apple's M1 in single-core and multi-core Geekbench 5. The i9-12900 should start shipping later this year or early next year. The i9-12900 beeds 250 Watts to beat the M1 running at 20 Watts though. Intel wins!

AMD recently won a big contract with Cloudflare for edge servers. Intel's performance was fine but their CPUs used hundreds of watts more than AMD's. I don't think that Apple will care to compete in many of the places where Intel and AMD compete like servers. But there will be other companies like nVidia making ARM server chips which should have the big performance per watt advantages over x86, just like Apple.

Lol you definitely can't compare a desktop intel cpu to an M1. Intel better be faster at 5-10x the watts.

Joelist · Sep 10, 2021

Whenever you see someone attributing the performance and especially the PPW of Apple Silicon to the ARM ISA, remember that there are plenty of ARM SOCs that are slow and not especially efficient. Apple Silicon does what it does because of its microarchitecture first and foremost. The advanced fab process is second and the ISA is only important in that it features the RISC fixed length instructions (which AS exploits more than any other SOC out there). What Apple Silicon does is shows just what RISC is capable of.

cmaier · Sep 10, 2021

Joelist said:
Whenever you see someone attributing the performance and especially the PPW of Apple Silicon to the ARM ISA, remember that there are plenty of ARM SOCs that are slow and not especially efficient. Apple Silicon does what it does because of its microarchitecture first and foremost. The advanced fab process is second and the ISA is only important in that it features the RISC fixed length instructions (which AS exploits more than any other SOC out there). What Apple Silicon does is shows just what RISC is capable of.

Yep. There have been studies that show that the differences between ISAs make little difference in the general purpose case. 10% here and there. But RISC vs. CISC is a much bigger deal. And I know from my own work that the difference between good physical designers and average physical designers is 20% (either performance, power, area, or a combination of both). Every time we had a tool provider come in and say that they could get better results using their EDA tools than what we were doing by hand, they always failed by 20%. Every single time.

Then you have the effect of a good microarchitecture vs a bad microarchitecture - look at Bulldozer vs what AMD sells now.

It all adds up.

casperes1996 · Sep 10, 2021

cmaier said:
When I was designing the floating point unit for the x705 (sequel to the x704) and had to design a square root block (ugh), I recall prototyping in C and using unions for that purpose.

Nice, haha. I used Unions for file system IPC in my kernel. All file system related tasks were handled by a special process, still in user space but with read/write IO privileges. To make file system requests you'd make a standard IPC call to that process and you'd. share a memory page with the file system that was, in C's world, a union FS_IPC_BUF*. I don't remember the exact structure but it was a way of separating client_request and fs_response cleanly in C for the same shared memory page. The file system process would read the union's. client_request section to figure out what to do and then fill it in as a response, which in itself had different structs depending on the type of request.

cmaier said:
(also fun: bit 0 in PowerPC is the *high order* bit. They number the bits backward. Which really really sucks when your words are all different sizes, as they are when you are dealing with floating point. Ugh. I am getting a cold sweat just remembering it)

... Who on Earth came up with that idea? Haha. I think big endian makes sense as most "regular" math around us is effectively big endian. I've at least never seen anyone write one thousand as 0001. But I guess it's also hard to quantise what big endian/little endian would mean in regular math without a notion of a byte or whatnot.
I've heard good arguments for little endian as a technical choice though. In particular this Stack answer

What is the advantage of little endian format?

Intel processors (and maybe some others) use the little endian format for storage. I always wonder why someone would want to store the bytes in reverse order. Does this format have any advantages...

softwareengineering.stackexchange.com

But yeah I can think of no good logic behind making 0 reference your msb.

throAU · Sep 10, 2021

cmaier said:
Itanium didn’t use software emulation to run x86. It had an actual x86 core on there. It was crap.

And why is everyone pretending that if you sell chips that do not have real mode that means you can’t sell any chips that do? Intel is a big company with lots of chips. It can sell chips that cater to the ever-shrinking market that needs to run software from the 1990’s, as well as chips that burn a third less power for the same performance, or a third more performance for the same power, and do so by jettisoning compatibility with real mode and some of the other seldom-used modes.

I said things are different now.
But that was in the context of x86 when Itanium was current 15 years ago.

cmaier · Sep 10, 2021

throAU said:
I said things are different now.
But that was in the context of x86 when Itanium was current 15 years ago.

I’m not sure i get your point. We are talking about now - the suggestion I made was to ditch Real mode (and, in fact, everything other than the 64-bit modes) and you’d have a much better product for 99% of people who buy PCs today.

throAU · Sep 10, 2021

cmaier said:
I’m not sure i get your point. We are talking about now - the suggestion I made was to ditch Real mode (and, in fact, everything other than the 64-bit modes) and you’d have a much better product for 99% of people who buy PCs today.

Well yeah i think that's a dumb idea too.

The silicon area to implement an entire 386 CPU for compatibility with old software is in the order of a few hundred thousand transistors. Assuming you were to do it in hardware rather than microcode, which all modern x86 cpus use.

I just don't see the massive win from ditching 32 bit because of "baggage". You're just throwing away old working software. And for intel and x86 in general, that's a large part of the reason people buy intel compatible at all.

Intel tried to ditch x86 altogether multiple times in the past. They failed.

mr_roboto · Sep 11, 2021

cmaier said:
(also fun: bit 0 in PowerPC is the *high order* bit. They number the bits backward. Which really really sucks when your words are all different sizes, as they are when you are dealing with floating point. Ugh. I am getting a cold sweat just remembering it)

You just gave me flashbacks to the time early in my career when I designed a single board computer that had several PowerPCs on it.

mr_roboto · Sep 11, 2021

casperes1996 said:
But yeah I can think of no good logic behind making 0 reference your msb.

Technically, it is more consistent. In a big-endian machine, byte address 0 is the most significant byte of a multi-byte integer, byte address 1 is the next most significant, and so on. Now how do you number -- or address -- the bits inside a word? If bit address 0 isn't the most significant, isn't that inconsistent with your byte-level scheme? Better make bit 0 the most significant.

I bet this seemed really clever to whoever came up with it inside IBM.

(Things get real fun when you start trying to deal with 32-bit and 64-bit values. Bit 0 of a 64-bit number is not at all the same as bit 0 of a 32-bit number, in this scheme...)

After that experience designing a PPC SBC, I realized why little-endian is actually superior: it has logical, consistent, and useful bit and byte numbering. Bit 0 is the least significant, and so is Byte 0.

The only sense in which BE is better than LE is that when you do a hex dump of some memory and print the 1-byte hex numbers out with addresses increasing left-to-right, any multi-byte integers are easily read without reversing byte order.

But this is the point at which you have to take a step back and realize that ultimately, what's responsible for this mess is the fact that we speakers of European languages stole our system for writing numbers from the Arabs. Arabic is a right-to-left writing system, and we've embedded their R-to-L numbers inside L-to-R languages. The Arabs write the least significant digit first, and it's on the right, and if you dump memory with addresses increasing R-to-L, multi-byte LE integers display just fine.

casperes1996 · Sep 11, 2021

mr_roboto said:
The only sense in which BE is better than LE is that when you do a hex dump of some memory and print the 1-byte hex numbers out with addresses increasing left-to-right, any multi-byte integers are easily read without reversing byte order.

Also network traffic is standard big endian, so you don't have to reverse that

cmaier · Sep 11, 2021

mr_roboto said:
Technically, it is more consistent. In a big-endian machine, byte address 0 is the most significant byte of a multi-byte integer, byte address 1 is the next most significant, and so on. Now how do you number -- or address -- the bits inside a word? If bit address 0 isn't the most significant, isn't that inconsistent with your byte-level scheme? Better make bit 0 the most significant.

I bet this seemed really clever to whoever came up with it inside IBM.

(Things get real fun when you start trying to deal with 32-bit and 64-bit values. Bit 0 of a 64-bit number is not at all the same as bit 0 of a 32-bit number, in this scheme...)

After that experience designing a PPC SBC, I realized why little-endian is actually superior: it has logical, consistent, and useful bit and byte numbering. Bit 0 is the least significant, and so is Byte 0.

The only sense in which BE is better than LE is that when you do a hex dump of some memory and print the 1-byte hex numbers out with addresses increasing left-to-right, any multi-byte integers are easily read without reversing byte order.

But this is the point at which you have to take a step back and realize that ultimately, what's responsible for this mess is the fact that we speakers of European languages stole our system for writing numbers from the Arabs. Arabic is a right-to-left writing system, and we've embedded their R-to-L numbers inside L-to-R languages. The Arabs write the least significant digit first, and it's on the right, and if you dump memory with addresses increasing R-to-L, multi-byte LE integers display just fine.

The problem for me is I was designing the hardware to do math, and now you can’t count on 2^bit telling you the value of that but position. You end up with 2^(width-bit) where the width changes because there are multiple floating point representations. Or when some instruction needs to affect “bit 9” that now can be one of multiple places depending on the instruction width. I was always doing the design the normal way then trying to flip everything around.

mr_roboto · Sep 11, 2021

cmaier said:
The problem for me is I was designing the hardware to do math, and now you can’t count on 2^bit telling you the value of that but position. You end up with 2^(width-bit) where the width changes because there are multiple floating point representations. Or when some instruction needs to affect “bit 9” that now can be one of multiple places depending on the instruction width. I was always doing the design the normal way then trying to flip everything around.

Yeah, it's a giant pain. Just baffling that IBM stuck with it, I can't imagine their designers being happy about it.

theluggage · Sep 12, 2021

mr_roboto said:
Yeah, it's a giant pain. Just baffling that IBM stuck with it, I can't imagine their designers being happy about it.

Well, there’s a reason why the two conventions got named “Big Endian” and “Little Endian” in reference to Gulliver’s Travels...

mr_roboto said:
what's responsible for this mess is the fact that we speakers of European languages stole our system for writing numbers from the Arabs

So are phrases like “Four and twenty blackbirds” or “it’s five and twenty to nine” throwbacks to pre-Arabic -numerals times? (But then there is “three score years and ten” so maybe it’s just about what sounds more poetic).

Actually, in spoken/written language it does kinda make sense to start with the most significant digit and stop as soon as you’ve reached the appropriate precision. Otherwise, have fun writing Pi in little-Endian notation...

Ok, heading to the LKML to start an argument about how many buttons a mouse should have

casperes1996 · Sep 12, 2021

theluggage said:
Actually, in spoken/written language it does kinda make sense to start with the most significant digit and stop as soon as you’ve reached the appropriate precision. Otherwise, have fun writing Pi in little-Endian notation...

In fairness, good luck writing Pi in big endian notation too, if you don't limit the precision. And as long as we're limiting precision it doesn't really matter if we start RTL or LTR does it?

theluggage said:
Ok, heading to the LKML to start an argument about how many buttons a mouse should have

16 buttons. Then it easily maps to a 16-bit bitfield. I propose the button naming scheme
LS, RVFFL ,VFFL, FFL, FL, L, LM, M, MB , RM, R, FR, FFR, VFFR, RVFFR, RS

Letters meaning
L -> Left
R -> Right
S -> Side
RV -> Really very
V -> Very
F -> Far
FF -> F***ing far
M -> Middle
B -> Back

To fit them comfortably on the device I also propose we have mice the size of small keyboards

Anyone else a bit skeptical about Apple silicon future?

macrumors 603

macrumors 604

Suspended

macrumors 604

Suspended

macrumors 65816

macrumors Core

macrumors 604

Suspended

macrumors 68000

Suspended

macrumors 6502a

macrumors 6502

Suspended

macrumors 604

macrumors G4

Suspended

macrumors G4

macrumors 6502a

macrumors 6502a

macrumors 604

Suspended

macrumors 6502a

macrumors G3

macrumors 604

Our Staff