MP All Models Future of AVX (Advanced Vector Extension)

th0masp · Jul 4, 2019

kazkus said:
I installed prebuilt binary of TensorFlow 1.13.0 on Ubuntu Server 18.04.2 LTS running on MacPro5,1

Just out of curiosity: why are you running a server with a 1080Ti in it? Are you by any chance Google Stadia?

thekev · Jul 5, 2019

MapleGreen said:
It is the first application that I have this issue, maybe a lot more to come!
It's very sad my Windows PC with old intel Core i7 2600k have AVX but my monster Mac pro with 12 core Xeon can't support it!

AVX refers to an instruction set, not a feature. SSE is the same thing. SSE4 is just a version of that, which adds a few things that I don't remember since I haven't looked at it in a long time. AMD supports it as well, but their implementation is completely different.

The lack of support is just a matter of hardware generation. Intel made a fairly big jump from SSE to AVX, and it really only becomes interesting with AVX2, since that generally also grants fma3. The ISAs differ quite a bit. One supports 2 op instructions. The other supports a lot of 3 op instructions, unless I'm thinking of AVX2.

Most of the hardware that ran SSE but not AVX had different criteria for latency hiding. Most of the instructions without a cross lane component such as the basic add, subtract, and multiply (not divide) required 3-4 cycles to complete and could be issued at a rate of 1 instruction per cycle. This means you need 4 to hide latency over a critical area. Skylake and newer (and to a lesser degree Haswell) moved toward standardizing as much as possible to a 4 cycle latency with each instruction able to issue independently on 2 different ports each cycle. This means you now need 8 rather than 3-4 to hide latency.

In general it's just annoying to compile for both. Even if you aren't explicitly telling the compiler to generate vector code by way of intrinsics, pragmas, etc in most of your codebase, you may still require either runtime dispatching or to compile different versions, just due to the way intel deals with loading and unloading of however they implement runtime support for effectively ignoring the upper half of each register name.

quoting the relevant portion of an intel article from around the time sandy bridge debuted
https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions

Another performance concern besides unaligned data issues is that mixing legacy XMM-only instructions and newer Intel AVX instructions causes delays, so minimize transitions between VEX-encoded instructions and legacy Intel SSE code. Said another way, do not mix VEX-prefixed instructions and non-VEX-prefixed instructions for optimal throughput. If you must do so, minimize transitions between the two by grouping instructions of the same VEX/non-VEX class. Alternatively, there is no transition penalty if the upper YMM bits are set to zero via VZEROUPPER or VZEROALL, which compilers should automatically insert. This insertion requires an extra instruction, so profiling is recommended.

kazkus said:
I installed prebuilt binary of TensorFlow 1.13.0 on Ubuntu Server 18.04.2 LTS running on MacPro5,1, and got the following error.

Code:

root@ubuntuserver:~# python3 Python 3.6.8 (default, Jan 14 2019, 11:02:34) [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf 2019-07-04 05:03:17.612831: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine. Aborted (core dumped) root@ubuntuserver:~#

I found that the prebuilt binary of TensorFlow 1.6 or later uses AVX instructions.
Since TensorFlow is an Open Source software, I can compile it without AVX instructions though..

Tensorflow can also be built against the mkl dnn backend. If they did this by default, it wouldn't be hard to support SSE, as mkl-dnn has a runtime dispatch. Granted, it's a really weird one, as it's almost built on top of Xbyak, but user code doesn't really interact with that layer.

https://www.tensorflow.org/guide/performance/overview

kazkus · Jul 5, 2019

th0masp said:
Just out of curiosity: why are you running a server with a 1080Ti in it? Are you by any chance Google Stadia?

I have a couple of MacPro5,1 with NVIDIA GPUs, and I used to use them for Machine Learning, especially with GTX 1080 Ti as CUDA device.
However, macOS Mojave (and later) does not support those NVIDIA GPUs and CUDA.
Also, I bought 32GB memory modules, but I found that macOS does not support them.
So, I keep macOS Mojave on one of MacPro5,1 as my desktop machine, replacing NVIDIA GPU by Vega Frontier Edition, and installed Linux (Ubuntu Server) on another MacPro5,1 for Machine Learning, putting 192GB (6x 32GB) memory.

thekev said:
Tensorflow can also be built against the mkl dnn backend. If they did this by default, it wouldn't be hard to support SSE, as mkl-dnn has a runtime dispatch. Granted, it's a really weird one, as it's almost built on top of Xbyak, but user code doesn't really interact with that layer.

Thanks for the info about Intel's MKL DNN.

Note: MKL was added as of TensorFlow 1.2 and currently only works on Linux. It also does not work when also using --config=cuda.
In addition to providing significant performance improvements for training CNN based models, compiling with the MKL creates a binary that is optimized for AVX and AVX2. The result is a single binary that is optimized and compatible with most modern (post-2011) processors.

I tried to build TensorFlow from source code, but could successfully install TensorFlow 1.13.1 with conda install.
Anyway, I use TensorFlow with CUDA on GTX 1080 Ti, so AVX and MKL does not matter on my configuration.

Also, I installed prebuilt binary of PyTorch 1.1.0, and it was fine (AVX instructions are not used).

thekev · Jul 5, 2019

kazkus said:
I have a couple of MacPro5,1 with NVIDIA GPUs, and I used to use them for Machine Learning, especially with GTX 1080 Ti as CUDA device.
However, macOS Mojave (and later) does not support those NVIDIA GPUs and CUDA.
Also, I bought 32GB memory modules, but I found that macOS does not support them.
So, I keep macOS Mojave on one of MacPro5,1 as my desktop machine, replacing NVIDIA GPU by Vega Frontier Edition, and installed Linux (Ubuntu Server) on another MacPro5,1 for Machine Learning, putting 192GB (6x 32GB) memory.

That sounds like a neat setup.

kazkus said:
Thanks for the info about Intel's MKL DNN.

No problem. Most of the common machine learning frameworks support it as a back end.

kazkus said:
I tried to build TensorFlow from source code, but could successfully install TensorFlow 1.13.1 with conda install.
Anyway, I use TensorFlow with CUDA on GTX 1080 Ti, so AVX and MKL does not matter on my configuration.

Also, I installed prebuilt binary of PyTorch 1.1.0, and it was fine (AVX instructions are not used).

I haven't tried these, but you can look up the recipes used to build various packages. The link just goes to a repo that aggregates various conda recipes. Conda is just a package manager, so it's not too opaque in that regard. I did this with llvmlite a while ago.

As far as PyTorch is concerned, it supports a few different back ends. They have also been working on their own back end for inference purposes. It does something akin to lowering a network graph directly, which allows for a fairly high level of optimization. They wrote a paper on it as well, which is quite good overall, but I spotted a couple bugs in their reasoning.

Tensorflow could have probably been better integrated with various vector math libraries from an earlier stage than it's at today, as it's built on top of Eigen, which has always used various intrinsics in lower level code sections. This should be enough to make it semi-opaque to tensorflow, but Eigen can be a bit moody at times so I'm not completely sure. It tends to redefine a lot of things.

SIMD code generation in itself is also not always straightforward. GCC and Clang have various flags that help with auto-vectorization as well as support for openmp pragma simd, but most of these don't typically result in optimal assembly code generation. There's the issue of potential aliasing which has to be considered in a direct translation of C like languages in addition to the lack of universally supported unroll pragmas. Efficient simdized code basically requires some amount of unrolling and proper loop ordering to achieve high throughput, since there are a lot of areas where you're lightly penalized for the use of unoptimized simd. For example, loads crossing 2 cache lines incur a penalty. Reordering incurs additional shuffle instructions, which do not contribute to arithmetic operations in any way. I tend to wish the majority of these could be abstracted a bit better at the compiler level. Right now GCC, MSVC, and various forks of Clang often disagree.

kazkus · Jul 6, 2019

thekev said:
...
I haven't tried these, but you can look up the recipes used to build various packages. The link just goes to a repo that aggregates various conda recipes. Conda is just a package manager, so it's not too opaque in that regard. I did this with llvmlite a while ago.

As far as PyTorch is concerned, it supports a few different back ends. They have also been working on their own back end for inference purposes. It does something akin to lowering a network graph directly, which allows for a fairly high level of optimization. They wrote a paper on it as well, which is quite good overall, but I spotted a couple bugs in their reasoning.

Tensorflow could have probably been better integrated with various vector math libraries from an earlier stage than it's at today, as it's built on top of Eigen, which has always used various intrinsics in lower level code sections. This should be enough to make it semi-opaque to tensorflow, but Eigen can be a bit moody at times so I'm not completely sure. It tends to redefine a lot of things.

SIMD code generation in itself is also not always straightforward. GCC and Clang have various flags that help with auto-vectorization as well as support for openmp pragma simd, but most of these don't typically result in optimal assembly code generation. There's the issue of potential aliasing which has to be considered in a direct translation of C like languages in addition to the lack of universally supported unroll pragmas. Efficient simdized code basically requires some amount of unrolling and proper loop ordering to achieve high throughput, since there are a lot of areas where you're lightly penalized for the use of unoptimized simd. For example, loads crossing 2 cache lines incur a penalty. Reordering incurs additional shuffle instructions, which do not contribute to arithmetic operations in any way. I tend to wish the majority of these could be abstracted a bit better at the compiler level. Right now GCC, MSVC, and various forks of Clang often disagree.

Thanks for interesting information.
Since I have recently focused on NLP, especially developing applications in Java and Python, I may need to catch up with these topics in near future.

basslik · Aug 30, 2024

DPUser said:
As mentioned in another thread, run your PC as a slave to your Mac under Vienna Ensemble Pro.

This is so not true. At least in the audio world.

Yup, still tracking awesome sessions on this beast. Pro Tools 2024.6

Search

Search

MP All Models Future of AVX (Advanced Vector Extension)

th0masp

macrumors 6502a

thekev

macrumors 604

kazkus

macrumors member

thekev

macrumors 604

kazkus

macrumors member

basslik

macrumors 6502a

Our Staff