macOS SSE 3, SSE 4 and AVX in one application

silvercircle · Apr 6, 2014

How do I support SSE 3, SSE4 and AVX in one application/bundle?

Do I check at launch what options (processor) are supported and then run a specific application from within the bundle? Are there other options to accomplish this? And how can I check which option is supported?

If I select SSE 4.2 on my mid 10 Mac Por the program runs a lot faster then when I select SSE 3, I want to offer the best and fastest for every user.

gnasher729 · Apr 6, 2014

silvercircle said:
How do I support SSE 3, SSE4 and AVX in one application/bundle?

Do I check at launch what options (processor) are supported and then run a specific application from within the bundle? Are there other options to accomplish this? And how can I check which option is supported?

If I select SSE 4.2 on my mid 10 Mac Por the program runs a lot faster then when I select SSE 3, I want to offer the best and fastest for every user.

The official way to check what is supported is by calling sysctl. I haven't used code that checks for the CPU type, but as an example:

Code:

		// Get the number of processors, cores, and threads by calling sysctl. If a call to 
		// sysctlbyname fails, then assume there is one processor, one core per processor, and one
		// thread per core. 
		size_t len;
		unsigned int procCount;
		unsigned int coreCount;
		unsigned int threadCount;
		
		if (sysctlbyname ("hw.packages", &procCount, (len = sizeof (procCount), &len), NULL, 0) != 0)
			procCount = 1;
			
		if (sysctlbyname ("hw.physicalcpu", &coreCount, (len = sizeof (coreCount), &len), NULL, 0) != 0)
			coreCount = procCount;
			
		if (sysctlbyname ("hw.logicalcpu", &threadCount, (len = sizeof (threadCount), &len), NULL, 0) != 0)
			threadCount = coreCount;

I'd probably put the performance critical code into a class (C++ or Objective-C) with subclasses that are compiled with different compiler options, as far as possible compiling identical code, and have some factory method returning an instance of the right class, depending on the processor that you have.

MorphingDragon · Apr 6, 2014

If you're doing SIMD via intrinsics or assembly the way you usually do it is to have multiple code paths for the program kernels that require SIMD. Then at runtime choose the codepath you need. More advanced applications use runtime code generation. As Gnasher mentioned usually this is an an application layer class to abstract away the details.

Code is untested, consider it c style pseudocode.

Code:

void Kernel_SSE3(args) {
   // SSE3 code
}

void Kernel_SSE4(args) {
   // SSE4 code
}

void Kernel_AVX(args) {
   // AVX Code
}

void Kernel_FMA(args) {
  // FMA code
}

void (*functionPtr)(arg,arg...)  g_KernelFunction = nullptr;
int main(...) {
    int simdType = GetSIMDType(ReadProc());
    switch(simdType)
         case SSE3:
                g_KernelFunction = Kernel_SSE3;
                break;

    etc etc
}

If you're letting the compiler do SIMD code generation. A) Most compilers don't let you have that much granularity, not easily. B) Don't rely on the compiler to output SIMD code. Hand written code and your brain is much better for that kind of optimization. Even the Intel compiler is terrible at SIMD optimization because its impossible to get the necessary context at compile time.

Dranix · Apr 6, 2014

Honestly, why care? All currently supported CPUs have at least SSE4.1, so simply compile for it.

Or if you care you could simple use OpenCL with the CPU mode - The OpenCL compiler generates extremely nice sse-code.

subsonix · Apr 6, 2014

Depending on what it is you are doing, look into Apple's Accelerate framework, it will pick the best option depending on what hardware you are running on across all systems.

MorphingDragon · Apr 6, 2014

Dranix said:
Or if you care you could simple use OpenCL with the CPU mode - The OpenCL compiler generates extremely nice sse-code.

Not always.

AFAIK, OpenCL can't tell if there's array aliasing so only vector arithmetic is sped up, not loop optimization. There are some other issues like memory alignment.

It depends on what he's trying to achieve. You shouldn't use loops in OpenCL anyway as it may run on the GPU if you just use the default device.

Search

Search

macOS SSE 3, SSE 4 and AVX in one application

silvercircle

macrumors member

gnasher729

Suspended

MorphingDragon

macrumors 603

Dranix

macrumors 65816

subsonix

macrumors 68040

MorphingDragon

macrumors 603

Our Staff