Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

bxs

macrumors 65816
Original poster
Oct 20, 2007
1,180
553
Seattle, WA
Subject: iMac Pro is a Cray computer on my desktop

Couldn't resist this.... ;):rolleyes::D
 

Attachments

  • IMG_1443.jpg
    IMG_1443.jpg
    1.9 MB · Views: 1,503
Well the Cray-1 was rated at 160 million floating point operations per second and I would expect an W-2140B Xeon handily tops that.
 
  • Like
Reactions: jerwin
The iMac Pro laughs at how slow the Cray T90 is in comparison.

Well the Cray-1 was rated at 160 million floating point operations per second and I would expect an W-2140B Xeon handily tops that.

Numbers are all over the place depending on the test. Numbers I see for the Cray 1 range from 80 MFLOPS to 160 MFLOPS. The Cray 1 was soundly defeated by the Power Mac G3. Most likely by the previous generation 604e CPU.

While I don't have the numbers for the W-2140B. The old i7-5960x hit about 360 GFLOPS on Linpack. The low end iMac Pro should be a bit higher. So, somewhere around 2,500 to 6,000 times faster than a Cray 1.:eek: Depending on which benchmarks you choose. Too bad we'll likely never see a modern set of benchmarks on a Cray 1.
 
Last edited:
  • Like
Reactions: mossback
Well the Cray-1 was rated at 160 million floating point operations per second and I would expect an W-2140B Xeon handily tops that.
Yes that's true. Today Cray XK7 is #4 in the Top500 list.

560,640 cores at 2.2 GHz per core
Rmax 17,173.2 TFlop/s
Rpeak 27,112.5 TFlop/s
Power Use 8,209 kW
 

Attachments

  • Screen Shot 2018-02-08 at 2.07.27 PM.png
    Screen Shot 2018-02-08 at 2.07.27 PM.png
    52.3 KB · Views: 250
Last edited:
Ah, Cray! The first computer stock I made money with. Too bad I didn't put my gains into Apple at the time (leaving aside the fact that I'd have to have been pretty patient to see the Apple move pay off).
 
bxs what is the dock you have on the front of your iMac Pro.

I have tried a couple with my iMac but nothing will attach firmly to the tapered sides. I was lucky and retired on 29 June 1999 and took advantage buying Apple shares that day with a full year franchising. They were US $35 when the US dollar was less than the Aussie dollar and still have them.

At the time Bill Gates invested some money in Apple as they were going bad and were calling Steve Jobs back.
 
bxs what is the dock you have on the front of your iMac Pro.

I have tried a couple with my iMac but nothing will attach firmly to the tapered sides. I was lucky and retired on 29 June 1999 and took advantage buying Apple shares that day with a full year franchising. They were US $35 when the US dollar was less than the Aussie dollar and still have them.

At the time Bill Gates invested some money in Apple as they were going bad and were calling Steve Jobs back.

It's not a Dock as such.... It's a SATECHI Space Gray USB-C Clamp Hub Pro.
 
...While I don't have the numbers for the W-2140B. The old i7-5960x hit about 360 GFLOPS on Linpack. The low end iMac Pro should be a bit higher. So, somewhere around 2,500 to 6,000 times faster than a Cray 1....

According to these tests, the Linpack numbers are about 380 GFLOPS for 8 cores, 460 GFLOPS for 10 cores, and 686 GFLOPS for 18 cores: http://hrtapps.com/blogs/20180202/

I don't know if he used AVX512 instructions or not. According to this page using AVX could make a 2:1 or so difference: https://software.intel.com/en-us/articles/how-intel-avx2-improves-performance-on-server-applications

Regardless, if we use the Cray-1 number of 160 MFLOPS vs the above benchmark, an 18-core iMac Pro is 4,287 times faster.

For years supercomputer people criticized microprocessors as having fast "paper" performance but poor memory bandwidth. That was true for a period but not any longer (if comparing to older supercomputers). Without adequate memory bandwidth, a CPU when processing a large array that out-strips cache will stall.

The Cray-1 memory bandwidth was 640 megabytes/sec, and the XMP from 1982 was 10 gigabytes/sec. By contrast the Pentium 4 from 2000 was only about 2.4 gigabytes/sec.

However by 2014 the i7-4790K was at 27.2 gigabytes/sec, the W-2195 used in the iMac Pro about 85 GB/sec, the Xeon 8176 is about 160 gigabytes/sec and the IBM Power8 at 238 GB/sec.

Even though people commonly complain about how "hot" iMacs run, the Cray-1 consumed 115,000 watts and required a separate refrigeration plant. The total power consumption for both the Cray and the cooling system was about 250,000 watts: https://cs.lbl.gov/assets/Images/Ne...g-Month-Throwback-Thursday-Gallery/Cray-1.jpg

The Cray-1 used freon tubes to carry the heat away from the copper-clad logic boards. Each Cray-2 logic module consumed about 300-500 watts, so they were all immersed in liquid fluorinert. There were 320 modules: https://lh3.googleusercontent.com/-...Jnvh7w32qHkdeGcjCHTQCJoC/w3907-h2394/sc_3.jpg

However the fastest supercomputers today are farther beyond the fastest desktop computers than the Cray-1 was beyond the original IBM PC. This year the US Oak Ridge Summit supercomputer is expected to reach 200 peak petaflops (or 200,000 TFLOPS, or 200,000,000 GFLOPS): https://www.top500.org/news/oak-ridge-readies-summit-supercomputer-for-2018-debut/

That is about 291,000 times faster than an 18-core iMac Pro.
 
According to these tests, the Linpack numbers are about 380 GFLOPS for 8 cores, 460 GFLOPS for 10 cores, and 686 GFLOPS for 18 cores: http://hrtapps.com/blogs/20180202/

I don't know if he used AVX512 instructions or not. According to this page using AVX could make a 2:1 or so difference: https://software.intel.com/en-us/articles/how-intel-avx2-improves-performance-on-server-applications

Regardless, if we use the Cray-1 number of 160 MFLOPS vs the above benchmark, an 18-core iMac Pro is 4,287 times faster.

For years supercomputer people criticized microprocessors as having fast "paper" performance but poor memory bandwidth. That was true for a period but not any longer (if comparing to older supercomputers). Without adequate memory bandwidth, a CPU when processing a large array that out-strips cache will stall.

The Cray-1 memory bandwidth was 640 megabytes/sec, and the XMP from 1982 was 10 gigabytes/sec. By contrast the Pentium 4 from 2000 was only about 2.4 gigabytes/sec.

However by 2014 the i7-4790K was at 27.2 gigabytes/sec, the W-2195 used in the iMac Pro about 85 GB/sec, the Xeon 8176 is about 160 gigabytes/sec and the IBM Power8 at 238 GB/sec.

Even though people commonly complain about how "hot" iMacs run, the Cray-1 consumed 115,000 watts and required a separate refrigeration plant. The total power consumption for both the Cray and the cooling system was about 250,000 watts: https://cs.lbl.gov/assets/Images/Ne...g-Month-Throwback-Thursday-Gallery/Cray-1.jpg

The Cray-1 used freon tubes to carry the heat away from the copper-clad logic boards. Each Cray-2 logic module consumed about 300-500 watts, so they were all immersed in liquid fluorinert. There were 320 modules: https://lh3.googleusercontent.com/-...Jnvh7w32qHkdeGcjCHTQCJoC/w3907-h2394/sc_3.jpg

However the fastest supercomputers today are farther beyond the fastest desktop computers than the Cray-1 was beyond the original IBM PC. This year the US Oak Ridge Summit supercomputer is expected to reach 200 peak petaflops (or 200,000 TFLOPS, or 200,000,000 GFLOPS): https://www.top500.org/news/oak-ridge-readies-summit-supercomputer-for-2018-debut/

That is about 291,000 times faster than an 18-core iMac Pro.
Yea.... most of that performance is coming from the Cray having 1000s of cores. Now if the iMP had 1000s of cores how would it compare ?

The Cray XK7 gets 27,112.5 Tflop/s peak using 56,640 cores. Thus (27112.5/56,640) Tflop/s per core or ~0.5 Tflop/s per core. For the 10 core iMP each core gets around 0.023 Tflop/s. Just using this we see that the Cray XK7 is but 0.5/0.023 faster or about 22 times faster on a core basis.
 
...The Cray XK7 gets....~0.5 Tflop/s per core. For the 10 core iMP each core gets around 0.023 Tflop/s. Just using this we see that the Cray XK7 is but 0.5/0.023 faster or about 22 times faster on a core basis.

I think that's probably an artifact of the test method and what % of the load was on the Cray's CPUs vs GPUs. The XK7 simply uses the AMD Opteron 6200 16-core CPU -- nothing magic about that. The XK7 is a hybrid design, so it also uses the nVidia Tesla K20. For hybrid supercomputers they commonly use the CPUs to feed the GPUs which do much of the work. This of course assumes the task has been written to efficiently harness those.

If so, the XK7 numbers might have little to do with the CPU. Certainly the Opteron 6200 is not 22 times faster than a Xeon W-2195 from an iMac Pro.

Using SpecIntRate_2006, the fastest current dual-socket 16-core Xeon (Intel Xeon Gold 6142M) did a score of 1400, whereas the fastest Opteron 6200-series Opteron only did about 414. However that is comparing a new Xeon to an older 6200-series Opteron, yet that is what the XK7 uses and we are comparing that to the W-2195.

https://www.spec.org/cgi-bin/osgresults?conf=default;op=form

There aren't any current SpecInt numbers for the W-2195 but the above 6142M is the same generation (Xeon Skylake-SP) as the W-2195 used in the iMac.

A contemporary CPU from AMD would be the EPYC 7351, which did 1410 on SpecIntRate_2006, slightly faster than the Xeon Gold 6142M. So it's not like AMD is slow when comparing similar generations. However I'm guessing the high TFlop/s per core from the Cray XK7 is probably not from the CPU but the GPU and overall benchmark design vs that run on the iMac Pro.

The days of supercomputer manufacturers using exotic custom-designed CPUs are long gone, as are the days of vector supercomputers. So we should not expect any current supercomputer to be unusually fast on a per-CPU basis. These are all massively parallel designs and the main issues are whether to use a hybrid architecture (CPU + GPU), how to interconnect and how to handle the software.

Seymour Cray once laughed at massively parallel designs, saying "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?" Today the answer is 1024 chickens (or with the XK7, 560,400 chickens). However the upcoming Summit supercomputer will use 9,200 of the 24-core IBM Power9 CPUs, plus 27,600 nVidia Volta GPUs. So in this case each "chicken" is pretty strong.
 
I remember comments like that from Seymour Cray. Thing is, his biggest "Ah ha!" moment came when he realized that the speed of light was a limiting factor - in the days of building-size computers, minimizing the length of wire runs between modules would dramatically speed things up. Even in the '70s, in a Scientific American edition dedicated to the microprocessor, it was recognized that the more that could be squeezed into a silicon die, the less electron transit speed would be a limiting factor, and the less relevant Cray's approach would be. He was packing his power into a limited number of CPUs because, in part, running wires to thousands of similarly-sized modules would violate the premise behind his short-wire design.

Once you can put thousands of "chickens" onto a single die, and put dozens of dies into the space occupied by one of his CPUs... the terrain shifts.

Yet the 2013 Mac Pro was criticized for the dual-GPU design. Basically, if the developers (and benchmark tests) don't write for it or the computing tasks don't require it, a hybrid CPU/GPU design still isn't ready for the average desktop computer user.

I've been wondering whether the iMac Pro's approach (return to what the customers can understand - lots of Intel CPU power) will be present in the upcoming Modular Mac Pro (leaving the eGPU modules as options), or whether it'll have some hybrid processing baked into the base unit.
 
  • Like
Reactions: pwm86
Even though people commonly complain about how "hot" iMacs run, the Cray-1 consumed 115,000 watts and required a separate refrigeration plant. The total power consumption for both the Cray and the cooling system was about 250,000 watts: https://cs.lbl.gov/assets/Images/Ne...g-Month-Throwback-Thursday-Gallery/Cray-1.jpg


The Cray-1 used freon tubes to carry the heat away from the copper-clad logic boards. Each Cray-2 logic module consumed about 300-500 watts, so they were all immersed in liquid fluorinert. There were 320 modules: https://lh3.googleusercontent.com/-...Jnvh7w32qHkdeGcjCHTQCJoC/w3907-h2394/sc_3.jpg

Interesting history on the Cray-1.


However the fastest supercomputers today are farther beyond the fastest desktop computers than the Cray-1 was beyond the original IBM PC. This year the US Oak Ridge Summit supercomputer is expected to reach 200 peak petaflops (or 200,000 TFLOPS, or 200,000,000 GFLOPS): https://www.top500.org/news/oak-ridge-readies-summit-supercomputer-for-2018-debut/


I think that's probably an artifact of the test method and what % of the load was on the Cray's CPUs vs GPUs. The XK7 simply uses the AMD Opteron 6200 16-core CPU -- nothing magic about that. The XK7 is a hybrid design, so it also uses the nVidia Tesla K20. For hybrid supercomputers they commonly use the CPUs to feed the GPUs which do much of the work. This of course assumes the task has been written to efficiently harness those.


If so, the XK7 numbers might have little to do with the CPU. Certainly the Opteron 6200 is not 22 times faster than a Xeon W-2195 from an iMac Pro.


These are certainly cases where one is looking at GPU power but comparing to CPU power.


The 200 Petaflops achieved by Oak Ridge Summit is in GPU power. Which is measured based on double precision math. At least they didn’t fudge the numbers with single-precision math or worse yet half-precision math (AMD Vega).


When comparing those results of Summit with a desktop. I'd compare it to a desktop GPU. In which case the nVidia Titan V. Which uses the same V100 GPU and does 6.14 TFLOPS base clock and 7.45 TFLOPS max boost clock in double precision math. With good cooling it should be possible to maintain that 7.45 TFLOPS. Higher if you overclock it. All that they really plan is something about 27,000 times faster than available in Professional desktops. The speed of 25,000 Titan V with a modest overclock. Which they are installing in the system.


When comparing to a Xeon. I'd compare the fastest CPU based supercomputer to the fastest Desktop CPU. For an accurate gauge on how much faster a super computer is. In raw general purpose computing power. I'd be interested to see what speeds those kinds have achieved.


It would be nice if Top 500 maintained separate lists. One for general purpose multiple CPU based super computer performance. GPU focused systems. Highly specialized systems using ASICs on separate lists. Google's TPU for deep learning comes to mind. It would be too easy to list absurdly high numbers when only a highly specialized function can be performed at that speed.
 
  • Like
Reactions: pwm86
Interesting history on the Cray-1.








These are certainly cases where one is looking at GPU power but comparing to CPU power.


The 200 Petaflops achieved by Oak Ridge Summit is in GPU power. Which is measured based on double precision math. At least they didn’t fudge the numbers with single-precision math or worse yet half-precision math (AMD Vega).


When comparing those results of Summit with a desktop. I'd compare it to a desktop GPU. In which case the nVidia Titan V. Which uses the same V100 GPU and does 6.14 TFLOPS base clock and 7.45 TFLOPS max boost clock in double precision math. With good cooling it should be possible to maintain that 7.45 TFLOPS. Higher if you overclock it. All that they really plan is something about 27,000 times faster than available in Professional desktops. The speed of 25,000 Titan V with a modest overclock. Which they are installing in the system.


When comparing to a Xeon. I'd compare the fastest CPU based supercomputer to the fastest Desktop CPU. For an accurate gauge on how much faster a super computer is. In raw general purpose computing power. I'd be interested to see what speeds those kinds have achieved.


It would be nice if Top 500 maintained separate lists. One for general purpose multiple CPU based super computer performance. GPU focused systems. Highly specialized systems using ASICs on separate lists. Google's TPU for deep learning comes to mind. It would be too easy to list absurdly high numbers when only a highly specialized function can be performed at that speed.

It should be noted that the Cray-1's cooling system was housed at its base and and was enclosed with formed leather covered seats that software engineers could sit on or even sleep on. This person used to sleep on those seats; not the best comfort but at 3am it was bliss. That Cray-1 had a mere 1,000,000 64-bit wide 'words' (not bytes) or 8 MB if you wish, and cost millions of dollars. At my company we rented just 1/4 of the 8MB at first as we developed software for the Cray Operating System named COS, and as we got more and more software integrated rented more RAM until we had access to the full 1,000,000 words or 8 MB. Those WERE the days for sure and ones to remember.
 
....These are certainly cases where one is looking at GPU power but comparing to CPU power...The 200 Petaflops achieved by Oak Ridge Summit is in GPU power. Which is measured based on double precision math....

It's actually measuring floating point math performance, which can be achieved several different ways: vector processing, massively parallel scalar processing, or massively parallel GPU processing.

In general the vector approach (like the Cray-1 and Cray-2) is no longer competitive at the very top of the pack, but NEC is still developing and marketing vector machines. Their new machine does 157 TFLOPS. It's useful because of backward compatibility with existing vector code: https://insidehpc.com/2017/10/vectors-back-new-nec-sx-aurora-tsubasa-supercomputer/

https://www.hpcwire.com/2016/09/26/vectors-old-became-new-supercomputing/

Top 10 supercomputers: https://www.top500.org/lists/2017/06/

The current 5th fastest supercomputer, the IBM Sequoia, does not use GPUs but 98,000 IBM Power Architecture CPUs, each with 16 cores. It is termed a homogeneous design because it uses only CPUs, no GPUs or other co-processors.

The current 2nd fastest supercomputer, the Chinese Tianhe-2, uses 32,000 Intel Xeon E5-2692 CPUs and 48,000 Xeon Phi 31S1P co-procesors. It is a heterogeneous design but the Xeon Phi is not really a GPU per se but a module which contains many Pentium-type CPUs with vector extensions: https://en.wikipedia.org/wiki/Tianhe-2

The current top supercomputer is the Chinese Sunway TaihuLight, which I belive is another homogenous design using only high-core-count custom CPUs (40,000 of them, 256 cores each): https://en.wikipedia.org/wiki/Sunway_TaihuLight

The upcoming U.S. Summit supercomputer will be a hybrid CPU/GPU design but this is only one architectural variant of several commonly in use.

...When comparing those results of Summit with a desktop. I'd compare it to a desktop GPU. In which case the nVidia Titan V. Which uses the same V100 GPU and does 6.14 TFLOPS base clock and 7.45 TFLOPS max boost clock in double precision math...

Yes that is a fair comparison, assuming the Titan V can execute the required Linpack benchmark and it can be parallelized on a single GPU. I'm not sure if the stated nVidia TFLOPS number comes from that or whether it's a theoretical calculation.

...When comparing to a Xeon. I'd compare the fastest CPU based supercomputer to the fastest Desktop CPU...

By "fastest CPU based supercomputer", do you mean an individual CPU from the fastest such supercomputer or the entire supercomputer? If you mean the entire CPU-based supercomputer, that's the Tianhe-2 which does about 55 petaFLOPS. A 18-core iMac does about 686 GFLOPS so the Tianhe-2 is about 80,000 times faster.

If you mean an individual CPU from the fastest CPU-based supercomputer, those aren't that fast. Since the massively parallel approach is the only competitive supercomputer architecture, each CPU cannot be very expensive, hot or fast, because they must use 10s of thousands of them. The upcoming IBM Power9 to be used on the Summit is probably near the top of current single-CPU performance. It might be somewhat faster than the fastest Xeon but it just shipped in December and is intended for internal use, so I don't know if any 3rd party benchmarks are available. It's probably very roughly in the performance range of a top Xeon to be released in the 2018 timeframe.
 
  • Like
Reactions: pwm86
Absolutely fascinating. Maybe the most interesting thread I’ve ever seen on this forum. The progression has been incredible. Fun to think what the performance and architectures will be thirty years from now.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.