Nvidia's new ARM CPU.

ChrisA · Mar 23, 2022

Just a note about the new ecosystem of ARM CPUs...

It looks like other companies are going with ARM for high performance applications (not just for phones) Nvidia has a new "Super Chip" that is kind of the polar opposite of the M1. Apple designed the M1 to use only a small amount of power because Apple sells mostly notebook computers. But Nvidia wanted the most powerful ARM chip, period. Their new "Grace" CPU burns 500 Watts, has 128 high performance cores and 1 Terabyte of RAM. (not Flash but actual RAM)

Obviously, this CPU would be overkill for even editing 8K video. It is also unaffordable to the average user. The intended use is to build data centers with tens of thousands of these chips.

Why should we care? It shows that "100X faster" is possible using current technology. Apple COULD if they choose build computers 100X faster than even the "Mac Studio". It is all about economics and what users are willing to pay, not so much about technology.

Apple thinks (and rightly so) "If we made $50,000 computers, no one would buy them." and Nvidia says "If we could build these for only $50K we'd sell them faster than we can make them."

Here is a link
https://nvidianews.nvidia.com/news/nvidia-introduces-grace-cpu-superchip

Realityck · Mar 23, 2022

ChrisA said:
It looks like other companies are going with ARM for high performance applications (not just for phones) Nvidia has a new "Super Chip" that is kind of the polar opposite of the M1. Apple designed the M1 to use only a small amount of power because Apple sells mostly notebook computers. But Nvidia wanted the most powerful ARM chip, period. Their new "Grace" CPU burns 500 Watts, has 128 high performance cores and 1 Terabyte of RAM. (not Flash but actual RAM)

Interesting news.
This is a another more powerful ARM specially designed for data center CPU in servers. The paragraph below points to the power efficiency that ARM contributes to.

Grace CPU Superchip also provides industry-leading energy efficiency and memory bandwidth with its innovative memory subsystem consisting of LPDDR5x memory with Error Correction Code for the best balance of speed and power consumption. The LPDDR5x memory subsystem offers double the bandwidth of traditional DDR5 designs at 1 terabyte per second while consuming dramatically less power with the entire CPU including the memory consuming just 500 watts.

opeter · Mar 23, 2022

OK, but that is the same if I would ask you: what would you buy? A bicycle or a train?

leman · Mar 23, 2022

It’s not at all an opposite of M1, it’s actually fairly similar (and Nvidia will eventually adopt UMA as well, for now they just have a quick CPU/GPU interconnect). It’s just targeted at a very different usage - it’s a supercomputer chip. And it’s unlikely that Nvidia will scale it down for PC use any time soon. What does it mean for Apple? Not much. Of course Apple could build a datacenter chip, they have the tech. But it’s not their business and they don’t have the software ecosystem to back it up. So I wouldn’t hold my breath.

BTW, Grace was announced around two years ago if I remember correctly

ChrisA · Mar 23, 2022

opeter said:
OK, but that is the same if I would ask you: what would you buy? A bicycle or a train?

My point was that we are seeing a change that is pervasive across the ecosystem. It is as if Busses and trains and bicycles are all moving to electric power.

In the past, ARM is only used in applications where power was the primary concern -- cell phones. Now we see them being more widely used.

What is happening is a return to the "Good Old Days" when every comuer company had their own CPU. How many people here remember computing BEFORE the 1980's? Back then every company had radically different architectures, then came Intel and everyone dropped that and just used the same X86 chips. The near universal use of X86 was a "change in the ecosystem"

I think now we are just starting to see another change but to call it that, it needs to be across the industry from bicycle to cars to trains.

So iPhones are the bicycles and data centers are the freight trains. They seem to be moving to ARM. But the mainstreamn world is "cars" (or Windows PCs). How long until they are converted over to ARM?

Will Dell and Azus and HP each have their own chips? If now who makes them? The next 10 years will be fun to watch,how will this transition go?

deconstruct60 · Mar 23, 2022

ChrisA said:
Just a note about the new ecosystem of ARM CPUs...

It looks like other companies are going with ARM for high performance applications (not just for phones) Nvidia has a new "Super Chip" that is kind of the polar opposite of the M1. Apple designed the M1 to use only a small amount of power because Apple sells mostly notebook computers. But Nvidia wanted the most powerful ARM chip, period. Their new "Grace" CPU burns 500 Watts, has 128 high performance cores and 1 Terabyte of RAM. (not Flash but actual RAM)

It is actually 72 cores per Grace. There is a "Grace SuperChip" that has two of those for 144 cores and expected 1TB of RAM.

Nvidia compared it to the Rome ( gen 2 Epyc ) chips they use that run 225W per package. So it is really about two of those running at 500W but the Nvidia has the soldered on RAM thrown in for the same TDP budget.
But on a W/core basis that is 3.5W/core (with the Memory overhead thrown in.) . They haven't thrown modest power consumption out the window here. Looks like they are using a tweak to TSMC N4P ( 4N for 4Nvidia).

ChrisA said:
Why should we care? It shows that "100X faster" is possible using current technology.

100X faster? Nvidia said they were 1.5X faster than the AMD Epyc gen 2 Rome on a simulated SpecInt_rate benchmark. Not saying much since AMD is about to release gen 4 about the same time Nvidia releases these Grace in 2023. But no where near two orders of magnitude faster on general purpose code.

Pushing jobs onto fixed function accelerators ... maybe. But on general applications that just run on CPU cores doing "dick , jane , spot' math, nobody is doing multiple orders of magnitude faster that the contemporary competition.

ChrisA said:
Apple COULD if they choose build computers 100X faster than even the "Mac Studio". It is all about economics and what users are willing to pay, not so much about technology.

Apple could cut their pricing and not pass along inflation costs .. but won't hold my breath on that either.
What Apple has on the CPU cluster complex really doesn't show any signs of scaling up where Neoverse ( Nvidia, Ampere , Graviton , etc) packages are deploying at this point.

Apple also trades memory bandwidth for capacity. Nvidia is spending wires/connections to do LPDDR5 ECC and Apple probably won't. Grace isn't going to be a single thread drag racing CPU package. Apple is targeting single users running single apps in the foreground, so they have made that bandwidth-for-RAS trade-off.

deconstruct60 · Mar 23, 2022

ChrisA said:
What is happening is a return to the "Good Old Days" when every comuer company had their own CPU. How many people here remember computing BEFORE the 1980's? Back then every company had radically different architectures, then came Intel and everyone dropped that and just used the same X86 chips. The near universal use of X86 was a "change in the ecosystem"

But they aren't. And didn't (at least through 90's and into 2000's. Alpha was 90's. PPC was 90's, MIPS 90's . HP PRISM , ) . The basic foundational design of Grace is likely based on Arm Neoverse.

"... Given what we know about Arm's roadmap, the Hopper CPU Superchip is based on the N2 Perseus platform, the first to support Arm v9. This platform comes as a 5nm design that supports all of the latest connectivity tech, like PCIe Gen 5.0, DDR5, HBM3, CCIX 2.0 and CXL 2.0, delivering up to 40% more performance over the V1 platform. ..."

Nvidia Unveils 144-core Grace CPU Superchip, Claims Arm Chip 1.5X Faster Than AMD's EPYC Rome

Nvidia enters the CPU business in a big way

www.tomshardware.com

Nvidia has done some work on the "uncore" aspects ( memory controller , PCI-e , NvLink , CXL 2.0 ) , but it is still unclear there is something beyond the Arm's mesh for the basic intrachip interconnect backbone and some modest tweaks to N2 cores.

Graviton 3 ... (and 2 ) similar based on Neoverse baseline design. Ampere .. same. Even Intel has some Neoverse cores in one of their upcoming DPU cards. As long as there is enough pooled R&D funds flowing into Arm that is specifically for server cores and infrasture it will probably make sense to continue to push updates out

Most of the Arm server folks are still mainly working off of shared R&D expenses for the baseline cores for their server offerings. That is changing a bit in 2023-2025 time frame but still pretty much building to shared specs and all running a shared Linux foundation for the OS.

the PC vendors each get their own? Probably not (unless radically shrink their system portfolios ).

ChrisA said:
Will Dell and Azus and HP each have their own chips? If now who makes them? The next 10 years will be fun to watch,how will this transition go?

deconstruct60 · Mar 23, 2022

leman said:
It’s not at all an opposite of M1, it’s actually fairly similar (and Nvidia will eventually adopt UMA as well, for now they just have a quick CPU/GPU interconnect). It’s just targeted at a very different usage - it’s a supercomputer chip.

It is not just a CPU/GPU interconnect. They have a Hopper GPU card with 40+GbE connector (actually 400Gb Infiniband) on the GPU card so that data in/out flow can do straight to the GPU. [ There CX7 networking chip is a significant piece of puzzle that the tech press , hype train is mostly overlooking at the moment. ]

Nvidia Chip-to-Chip NvLink scales also. 900GB/s isn't > 1TB/s, but it scales past two chips. [ would be interesting to find out if Apple is doing something similar in counting both directions bandwidth as one number as what Nvidia is doing. ]

UMA is a dual edge sword. Pragmatically it means giving up data working set scale to get the benefits of fewer copies and "highly uniform" access.

They also don't "give up" on having PCI-e v5 I/O. Take the higher power hit to work with more devices.

leman said:
BTW, Grace was announced around two years ago if I remember correctly

They sneaked a picture of "Grace Hopper" ( CPU + GPU ) at the last GTC. There are more combos now.

NVIDIA Grace-2xHopper Supercomputer Building Block at GTC 2022

NVIDIA made two brief mentions and had a small image of the NVIDIA Grace-2xHopper that looks like a supercomputer building block

www.servethehome.com

fs454 · Mar 23, 2022

I would be surprised if this isn't similar to what Apple will do with the Mac Pro this year. It's the only machine that's realistically fully unrestrained for power draw - those purchasing it want to push the bleeding edge of computing power under one hood and know it comes with a higher annual electric cost. Mac Studio is compelling for 99% of power user and pro creative types as it offers blistering performance and ridiculous efficiency, but Mac Pro's SoC will likely be a new designation or M1X-type product where it's a new die that packs a stupid amount of cores like this with little regard to power draw.

The loaded Mac Pro's current TDP is 300W at idle and over 900W at full load. Imagine the Apple Silicon compute power they could fit into that range: https://support.apple.com/en-us/HT201796

leman · Mar 23, 2022

deconstruct60 said:
UMA is a dual edge sword. Pragmatically it means giving up data working set scale to get the benefits of fewer copies and "highly uniform" access.

I don’t see it this way. There is nothing inherent to UMA that makes it non-scalable. The problem is the demand for high performance. CPUs traditionally had fairly low DRAM bandwidth, making scalable modular RAM design feasible. But this becomes more problematic if you want to go faster. High end GPUs never offered any RAM scalability. It’s a pattern that will continue in the future as demand for faster memory will only grow.

deconstruct60 · Mar 25, 2022

leman said:
I don’t see it this way. There is nothing inherent to UMA that makes it non-scalable.

I think you are trying to use UMA in the general sense that envelopes NUMA systems also. That is not really what Apple has done. When Apple says. UMA they more so mean shared, homogenous , uniform, "unified" memory architecture. For example, "UltraFusion" is largely to take the non-uniform access out of the equation for any significant impact. That won't scale. At some point run into both "speed of light" (and transitions between complexes at well under the speed of light) and Amdahl impacts were very different workloads with real-time and other completion parameters pull at heterogeneous memory types.

At some point have to substantively shift the software stack to deal with NUMA and Apple doesn't really do that to a significant degree.

leman said:
The problem is the demand for high performance. CPUs traditionally had fairly low DRAM bandwidth, making scalable modular RAM design feasible. But this becomes more problematic if you want to go faster. High end GPUs never offered any RAM scalability. It’s a pattern that will continue in the future as demand for faster memory will only grow.

Right .... which is not short-intermediate term scalable in capacity. Apple is designing a system mainly about being a GPU with some other sprinkled on top. UltraFusion is so skew in enabling talking to just another Apple chip that there isn't much left over in I/O to talk to anything else. That is a dual edged sword.

Is insatiable "faster memory" coupled to demands for longer battery life going to push DIMMs and soDIMMS (and 'plain' DDR ) out of general PC laptop market over the next 3-6 years? Probably Yes. Is that going to blow up the server market? Probably not over the same timespan.

To put it in the Nvidia Grace and Hopper context ... the Hopper has HBM3 on purpose, because LPDDR5x just isn't going to cut it for that 'half' of the problems being targeted. If have respective large pools of non-homogenous memory than pragmatically will have substantive NUMA effects. That's why Nvidia goes ahead and have built a stack that deals with it. NvLink 4 is cache-coherent. (it is alreay 'unified' that way); not unified on timings.

Search

Search

Nvidia's new ARM CPU.

ChrisA

macrumors G5

Realityck

macrumors G5

opeter

macrumors 68030

leman

macrumors Core

ChrisA

macrumors G5

deconstruct60

macrumors G5

deconstruct60

macrumors G5

Nvidia Unveils 144-core Grace CPU Superchip, Claims Arm Chip 1.5X Faster Than AMD's EPYC Rome

deconstruct60

macrumors G5

NVIDIA Grace-2xHopper Supercomputer Building Block at GTC 2022

fs454

macrumors 68000

leman

macrumors Core

deconstruct60

macrumors G5

Our Staff