If AVX and AVX2 where so good and could be applicable to everything as Intel wanted to sell us, every Pro app should be using by now, no?
Getting good to extremely good AVX/AVX2 is dependent upon using tools to get there. It doesn't just magically fall out. Pragmatically need to either need to use specialized Intel compilers that are explicitly built to highly vectorize the code or hand craft assembler to do the trick. The medium "in between" approach of just using mainstream compilers and/or directives interleaved in the code often leaves performance on the floor.
So no every Pro app isn't going to use it because some Pro app developers have a "one and only one tool" focus.
2010-2020 will be know as the decade that ended Moore law.
Given the top end core counts are generally still going up on top end dies ... no it hasn't. Moore's law is about how much stuff on a die not has fast it clocks. That isn't necessarily the same thing.
[doublepost=1552148996][/doublepost]
Or, is it Maxon's fault for still compiling their code for a nine-year old processor?
....
Compiling for MMX and running on a system with AVX2 should be a crime.
Supposedly this is suppose to be more advanced feature oriented.
"...Cinebench R20 and Cinema 4D R20 incorporate the latest rendering architectures, including integration of Intel’s Embree raytracing technology and advanced features on modern CPUs from AMD and Intel that allow users to render the same scene on the same hardware twice as fast as previously. ..."
https://www.maxon.net/en/products/cinebench-r20-overview/
From results so far though it does somewhat seem to indicate that may have skipped AVX and only fine tuned for AVX-512. Embree can require that your re-oriented the app to maximize the libraries effectiveness. These may have been only aligned to feed the AXV-512 tuning of the library.
The numbers putting the 56xx (with not AVX ) close to the E5 v2 systems just first generation AVX. First generation AVX was only highly exploitable with hand crafted code in most cases. (that shouldn't be a major hurdle for high end math libraries with multiple human optimizers assigned to improvments. For more generic apps though focus on breadth of customer distribution it can be a issue. ) Better hooks to make it compiler exploitable only started to come with AVX2 and expanded with AVX-128 and AVX-512.
[doublepost=1552151013][/doublepost]
Actually, it's proof that Moore's law is still going strong.
.....
The "faster clock" thing is hitting a ceiling, so new CPUs work to increase the instructions-per-cycle. One big way this is done is to extend the MMX/SSE SIMD instructions to wider and wider registers, with more complete instruction coverage.
AVX uses 256-bit wide registers, with good FP coverage and basic integer instructions. This is basically "loop unrolling" in hardware. A single instruction can do 8 FP32 or "int" operations in parallel.
Not so much loop unrolling in hardware but software given construct to talk about things more directly. Inferencing on alaising is better done at the compiler level. What was missing more so is being able to "talk" directly about vectors in a meaningful way at significant sizes. The loop (and small incrementing of pointers inside the matrix/vector abstractions ) is indicative more so of hardware limitations in expression. The alias introducing aspects of some loop stuff was part of the problem and hardware wasn't going to "untangle" that.
Moore's law is always. It is just being solely tracked into faster four basic math functions ( + , - , * , / ) and minor variations of that ( address calculations for load/store. )
Have to come up with code that has more parallelism expressed explicitly in the code to get up over to higher "instructions per clock" results. Basic classic ( von Neumann ) style code pragmatically has a basic upper on parallelism baked right into the style. There has been times where breaking out of that was taking too far VLIW. But vectors in a context where vectors (and matrix ; vector of vectors) there is a much higher match.
The code is expression is denser and introduces much less negative alias artifacts.
Same issue in other "specific function" areas. Encryption/Compression. Video stream en/decode. With a much bigger transistor budgets are a 'significant enough' percentage workload of that type it just pays to assign transistors to doing more specific workloads. It is just much, much faster throughput.
Similarly, the GPUs largely eschew all the mainstream denser branching , short basic block, high interdependency classic von Nuemann oode is most apps. Those cores "suck" at that and are highly skewed to embarrassingly parallel code. And they tend do much better on that than CPUs (even with the vector additions).