macOS Comparing GPU Performance using Metal Performance Shaders

maccan · Dec 3, 2019

The MacBook Pro 2018 has two GPUs:
AMD Radeon Pro 555X and Intel(R) UHD Graphics 630.
I supposed the AMD 555X would be superior in performance compared to the Intel(R) UHD Graphics 630.

However, I observed a huge performance difference for Metal Performance Shaders (MPS) between the two GPUs.
The Intel GPU performs the simple test code (a MPSMatrixMultiplication) 3 times faster compared to the AMD 555X.

You can compile the attached code in a Terminal by 'swiftc -O matrixMul.swift'
and run it by executing './matrixMul'

In the test code, I can select execution on the AMD 555X with the statement
let device = devices[0] // AMD Radeon Pro 555X

and I get the following:

start calculation on GPU-device <BronzeMtlDevice: 0x1071bf000>
name = AMD Radeon Pro 555X
...
GPU execution time = 12.612 seconds

The Intel(R) UHD Graphics 630 is selected by
let device = devices[1] // Intel(R) UHD Graphics 630

and I get

start calculation on GPU-device <MTLIGAccelDevice: 0x10f9c5000>
name = Intel(R) UHD Graphics 630
...
GPU execution time = 3.735 seconds

As you can see the Intel UHD 630 performed the MPSMatrixMultiplication 3 times faster than the AMD 555X.
I thought the AMD 555X would be more powerful than the Intel UHD 630, but this test shows the opposite.
I wonder why? Any idea?

-------------------- test code
import Metal
import Accelerate
import MetalPerformanceShaders

let devices = MTLCopyAllDevices()
print("available GPUs")
for d in devices {
print(d)
}

// select one of the two GPS by commenting out one of the two
let device = devices[0] // AMD Radeon Pro 555X
//let device = devices[1] // Intel(R) UHD Graphics 630

// commandQueue and commandBuffer
let commandQueue = device.makeCommandQueue()!;
let commandBuffer = commandQueue.makeCommandBuffer()!;

// Matrix dimensions
let n = 8192 // matrix dimension (n x n)
let rowsA = n
let columnsA = n
let rowsB = n
let columnsB = n
let rowsC = n
let columnsC = n

// matrix A data
var arrayA = [Float](repeating: 0, count: rowsA * columnsA)
for i in 0..<arrayA.count { // set random data
arrayA = Float(2 * drand48() - 1)
}

// matrix B data
var arrayB = [Float](repeating: 0, count: rowsB * columnsB)
for i in 0..<arrayB.count { // set random data
arrayB = Float(2 * drand48() - 1)
}

// MTL data buffers for Matrices A,B,C
let bufferA = device.makeBuffer(bytes: arrayA,
length: rowsA * columnsA * MemoryLayout<Float>.stride,
options: [])!;

let bufferB = device.makeBuffer(bytes: arrayB,
length: rowsB * columnsB * MemoryLayout<Float>.stride,
options: [])!;

let bufferC = device.makeBuffer(length: rowsC * columnsC * MemoryLayout<Float>.stride,
options: [])!;

// Matrix descriptions
let descA = MPSMatrixDescriptor(dimensions: rowsA, columns: columnsA,
rowBytes: columnsA * MemoryLayout<Float>.stride,
dataType: .float32);

let descB = MPSMatrixDescriptor(dimensions: rowsB, columns: columnsB,
rowBytes: columnsB * MemoryLayout<Float>.stride,
dataType: .float32);

let descC = MPSMatrixDescriptor(dimensions: rowsC, columns: columnsC,
rowBytes: columnsC * MemoryLayout<Float>.stride,
dataType: .float32);

// MTL matrix buffers
let matrixA = MPSMatrix(buffer: bufferA, descriptor: descA);
let matrixB = MPSMatrix(buffer: bufferB, descriptor: descB);
let matrixC = MPSMatrix(buffer: bufferC, descriptor: descC);

let matrixMultiplication = MPSMatrixMultiplication(device: device,
transposeLeft: false, transposeRight: false,
resultRows: rowsC, resultColumns: columnsC,
interiorColumns: columnsA, alpha: 1, beta: 0);

matrixMultiplication.encode(commandBuffer: commandBuffer, leftMatrix: matrixA,
rightMatrix: matrixB, resultMatrix: matrixC);
print("start calculation on GPU-device \(device)")

let start = DispatchTime.now().uptimeNanoseconds;
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let end = DispatchTime.now().uptimeNanoseconds
let execTime = String(format: "%.3f", 1e-9 * Double(end - start))

// we look at the result
let rawPointer = matrixC.data.contents();
let count = matrixC.rows * matrixC.columns;
let typedPointer = rawPointer.bindMemory(to: Float.self, capacity: count);
let bufferedPointer = UnsafeBufferPointer(start: typedPointer, count: count);

// Print the first 10 results, to make sure it's not all 0s or NaNs.
print("\nFirst 5 elements:")
for i in 0..<5 {
print("element \(i) =", bufferedPointer);
}
print("...")
print("last element =", bufferedPointer[n * n - 1]);
print("...")
print("GPU execution time = \(execTime) seconds")
exit(0)
------------------ end test-code

casperes1996 · Dec 3, 2019

Because there aren't enough calculations and operations for the speed of the GPU to be the limiting factor here. You create all that data on the CPU, storing it in main memory. Then you give the GPU a pointer to the memory, asks it to copy it into it's own VRAM, and copy the result back over to system memory when it's done. All those memory operations take a whoooooole lot longer than the calculation. The iGPU has direct access to system memory. The Radeon is still many times faster, but when the performance cap is so heavily skewed towards the memory over the calculation, it changes the dynamic.

maccan · Dec 3, 2019

casperes1996 said:
Because there aren't enough calculations and operations for the speed of the GPU to be the limiting factor here. You create all that data on the CPU, storing it in main memory. Then you give the GPU a pointer to the memory, asks it to copy it into it's own VRAM, and copy the result back over to system memory when it's done. All those memory operations take a whoooooole lot longer than the calculation. The iGPU has direct access to system memory. The Radeon is still many times faster, but when the performance cap is so heavily skewed towards the memory over the calculation, it changes the dynamic.

Thank you for the explanation. Sounds reasonable to me.
I assumed it is something like this and tried to avoid the copy-procedure by
let bufferA = device.makeBuffer(bytesNoCopy: arrayA!, ...
But the result was the same. Do you know some way to measure the GPU calculation-time without the amount of time needed for data-transfer?

casperes1996 · Dec 3, 2019

maccan said:
let bufferA = device.makeBuffer(bytesNoCopy: arrayA!, ...

You can't avoid copying the data onto the GPU, because the data is created in system memory by the CPU. You can however avoid copying the resulting matrix back to the CPU if it's used for further calculations on the GPU before returning to CPU.

maccan said:
But the result was the same. Do you know some way to measure the GPU calculation-time without the amount of time needed for data-transfer?

Afraid not :/

maccan · Dec 4, 2019

casperes1996 said:
You can't avoid copying the data onto the GPU, because the data is created in system memory by the CPU. You can however avoid copying the resulting matrix back to the CPU if it's used for further calculations on the GPU before returning to CPU.

Afraid not :/

Quantitatively the results look still strange to me:
The two matrices are 8192 x 8192 in size. Thus, the amount of memory used to store one matrix (Float32) is 0.268 GB. Matrix A, B and the result matrix C are transferred to and from the GPU. This is approximately 1.5 GB of data in total. On my old cMP (mid 2010) with two M.2 PCI storage disks, I'm able to copy 1.5 GB from one to the other M.2 PCI drive disk in about 1 second. I would assume that the data-rate to and from the GPU on the modern MacBook pro will be at least in the same order. Let's say 1 GB/s. However, the difference in execution time of the matrix-multiplication between the iGPU (lets assume no time for data transfer) and the Radeon 555X is about 8.5 seconds. This would mean that the data transfer of the 1.5 GB to the Radeon 555X takes at least 8.5 seconds, which is 8 times slower than a M.2 disk copy over PCI on an outdated Mac (mid 2010). From this, I would not expect much more than 1-2 seconds for the data transfer to the Radeon X555.

Furthermore I did a second test with a MPSVectorMultiplication shader. In this example, a vector with length 8192 and one Matrix of 0.256 GB is transferred to the GPU. It's only one third of the amount of data transferred compared to the previous example. However, the execution timing is 0.08 seconds for the Radeon 555X and 0.04 seconds for the iGPU. Of course the amount of floating-point operations is much less in this case. It is only in the order of 8182^2 while in the previous example of the matrix multiplication it was in the order of 8192^3.

According to these observations, I'm not sure if the performance difference between the iGPU and the Radeon 555X can be explained by data transfer alone. There must be another bottle neck as well.

There is also another MPS shader called MPSMatrixSum. Basically it sums up two matrices. However there is no description of the various parameters in the Apple Developer documentation. So I did not succeeded in using this yet. It would be a more direct comparison since it calculates C = A + B. Thus as in the case of the matrix multiplication (C = A * B) the amount of data transferred to the GPU would be exactly the same (1.5 GB). However, the number of floating-point operations is just in the order of 8192^2. So I guess it would mainly measure the data transfer time.

Do you know how to use MPSMatrixSum?

casperes1996 · Dec 4, 2019

Interesting. I must admit I didn't really look much at the numbers, I just made my statement about the transfer bottleneck base on past testing.

Regarding the use of MPSSum... Well, I have used it once... But it was the very first time I ever wrote anything to run on the GPU, it was for testing, and it is ugly as all hell. - I wasn't exactly a very good programmer back then. You can have it if you want, but I don't know how easy it'll be to read. If you have questions about it though you can ask and I'll see if I can figure out what I was trying to do back then. - Also note it's been re-written many times back then, trying out different combinations of threading and memory management models on GPUs with and without VRAM for testing, so God knows what state it is in right now. It didn't compile when I first dug it up just now, but all I did was make it compile properly and run once, without properly having a look at it, so yeah. One thing I did just notice is that it looks like in this variant of this mess, I was actually trying to, not sum a matrix together, but sum a ton of single numbers together, having them all represented as a 1x1 matrix to make it work. God knows what I was doing back then. Looks to be a sum of the numbers 1 through 10,000
Bulk of the execution time when I ran it just to see that it at least completes seems to be CPU setting up the data, the way it's written now, with GPU execution being around 18 seconds for me. Last disclaimer; It's ***** code, just meant for personally testing how the MPSMatrixSum worked years ago.

Assuming the MacRumors Forums allows it, swift file attached. Or rather a playground file, but two sides of the same coin

maccan · Dec 4, 2019

casperes1996 said:
Interesting. I must admit I didn't really look much at the numbers, I just made my statement about the transfer bottleneck base on past testing.

Regarding the use of MPSSum... Well, I have used it once... But it was the very first time I ever wrote anything to run on the GPU, it was for testing, and it is ugly as all hell. - I wasn't exactly a very good programmer back then. You can have it if you want, but I don't know how easy it'll be to read. If you have questions about it though you can ask and I'll see if I can figure out what I was trying to do back then. - Also note it's been re-written many times back then, trying out different combinations of threading and memory management models on GPUs with and without VRAM for testing, so God knows what state it is in right now. It didn't compile when I first dug it up just now, but all I did was make it compile properly and run once, without properly having a look at it, so yeah. One thing I did just notice is that it looks like in this variant of this mess, I was actually trying to, not sum a matrix together, but sum a ton of single numbers together, having them all represented as a 1x1 matrix to make it work. God knows what I was doing back then. Looks to be a sum of the numbers 1 through 10,000
Bulk of the execution time when I ran it just to see that it at least completes seems to be CPU setting up the data, the way it's written now, with GPU execution being around 18 seconds for me. Last disclaimer; It's ***** code, just meant for personally testing how the MPSMatrixSum worked years ago.

Assuming the MacRumors Forums allows it, swift file attached. Or rather a playground file, but two sides of the same coin

Thanks a lot for the example code!
I think it is all about using the 'good' storage classes and proper synchronisation between GPU and CPU.
In order to keep data traffic as low as possible, it is best to use MTLStorageMode.managed!
According to the Apple Developer Documentation: "The CPU and GPU may maintain separate copies of the resource, and any changes must be explicitly synchronized." ..."In macOS, this is the default storage mode for MTLTexture objects. In iOS and tvOS, the managed storage mode is not available."

Obviously it is not the default for the MPS objects.
So, by just making the buffers managed i.e.

let bufferA = device.makeBuffer(bytes: arrayA.baseAddress!,
length: rowsA * rowBytesA, options: [.storageModeManaged])!

instead of

let bufferA = device.makeBuffer(bytes: arrayA.baseAddress!,
length: rowsA * rowBytesA, options: [])!

and synching the result-buffer with blitEncoder.synchronize(resource: bufferC), the code runs much faster know. On the Radeon 555X I achieved 1.68 seconds (0.65 Teraflops). On my cMP Mid 2010 with a much more powerful Vega 56 (yes old hardware can still be used), the code runs in 0.44 seconds resulting in a mean floating-point performance of 2.5 Teraflops.
However the iGPU does not support MTLStorageMode.managed. So I could not run it there for comparison.
I think 3.7 seconds with the previous code is what you can get out of it i.e it is about a factor of 2 slower than the 555X.

I streamlined the code a bit, including reporting the floating-point performance. If you want to try on your GPU please see the attached file. Would be interesting to see what can be achieved in more recent GPUs like the Radeon VII.

casperes1996 · Dec 5, 2019

Right. Glad to hear that it seems like my initial conjecture about the issue wasn't too far off. The managed memory made makes no sense on the iGPU, as its memory is always synchronised with main memory anyways, since... Well, it's memory IS main memory, unlike the dGPUs that have their own VRAM. Thus the memory modes shouldn't make a difference on the iGPU.
When I wrote the code I linked to you originally I remember struggling with the synchronisation. It'd run fine on my MBP with iGPU and give unexpected results on my iMac, because I used a wrong sync command back then

.

I'll try it out, but the only GPU I have available is an Iris Pro.
[automerge]1575545602[/automerge]
I don't really have time to look through it now, but I get an error on the command buffer on my iGPU. It should still run with managed memory just with no difference in performance. If you want I can try and debug it later.

maccan · Dec 5, 2019

casperes1996 said:
Right. Glad to hear that it seems like my initial conjecture about the issue wasn't too far off. The managed memory made makes no sense on the iGPU, as its memory is always synchronised with main memory anyways, since... Well, it's memory IS main memory, unlike the dGPUs that have their own VRAM. Thus the memory modes shouldn't make a difference on the iGPU.
When I wrote the code I linked to you originally I remember struggling with the synchronisation. It'd run fine on my MBP with iGPU and give unexpected results on my iMac, because I used a wrong sync command back then .

I'll try it out, but the only GPU I have available is an Iris Pro.
[automerge]1575545602[/automerge]
I don't really have time to look through it now, but I get an error on the command buffer on my iGPU. It should still run with managed memory just with no difference in performance. If you want I can try and debug it later.

For iGPU use just delete the ..storageModeManaged option in the declaration-statement of bufferA, bufferB and bufferC.
Furthermore, you can comment out the statement blitEncoder.synchronize(resource: bufferC)

Then the code should be basically the same as on my original post. Or just take that one.

casperes1996 · Dec 5, 2019

maccan said:
For iGPU use just delete the ..storageModeManaged option in the declaration-statement of bufferA, bufferB and bufferC.
Furthermore, you can comment out the statement blitEncoder.synchronize(resource: bufferC)

It definitely should be possible running it on an iGPU with the storageModeManaged set though. Done that before. It will not act any different, but it's possible. Something else is going on as well
[automerge]1575562795[/automerge]
It also looks like you call the synchronise command at a weird time. You haven't committed your commandQueue or waited on completion when you try and sync the result
[automerge]1575563168[/automerge]
You've also accidentally written colsC where it should be rowsC, though they're both n so it's only about readability

casperes1996 · Dec 5, 2019

And lastly, even after removing storageModeManaged and the blitEncoder, it still fails on runtime, with
-[MTLIGAccelCommandBuffer computeCommandEncoder]:317: failed assertion `Already have uncommitted encoder'

Definitely something else going on, but don't have more time to look at it today.

maccan · Dec 5, 2019

casperes1996 said:
And lastly, even after removing storageModeManaged and the blitEncoder, it still fails on runtime, with
-[MTLIGAccelCommandBuffer computeCommandEncoder]:317: failed assertion `Already have uncommitted encoder'

Definitely something else going on, but don't have more time to look at it today.

The error comes from the statement

let blitEncoder = commandBuffer.makeBlitCommandEncoder()!

You have to remove this as well.
Please find attached the code modified explicitely to run on the iGPU.

Because my MacBook pro has 2 GPUs, I have to replace the

let device = MTLCreateSystemDefaultDevice()!

by forcing the selection of the iGPU 'by hand' with

let devices = MTLCopyAllDevices() // get all available devices
let device = devices[1] // selection of iGPU

If you have the iGPU only you can use 'let device = MTLCreateSystemDefaultDevice()! ' instead

maccan · Dec 6, 2019

casperes1996 said:
And lastly, even after removing storageModeManaged and the blitEncoder, it still fails on runtime, with
-[MTLIGAccelCommandBuffer computeCommandEncoder]:317: failed assertion `Already have uncommitted encoder'

Definitely something else going on, but don't have more time to look at it today.

Just listing the results optained on the various GPUs:

MacBook Pro 2018 iGPU (using code matrigMuliGPU.swift):
Values in matrix A[8192 x 8192]: 1.0 uniformly
Values in matrix B[8192 x 8192]: 2.0 uniformly
Starting calculation on Intel(R) UHD Graphics 630
...
Values in matrix C = A * B: 16384.0 uniformly
1'099'444'518'912 floating point operations performed
Elapsed GPU time = 3.542 seconds -> 0.310 Teraflops

MacBook Pro 2018 (usingCode matrixMul.swift, i.e. managed buffers):
Values in matrix A[8192 x 8192]: 1.0 uniformly
Values in matrix B[8192 x 8192]: 2.0 uniformly
Starting calculation on AMD Radeon Pro 555X
...
Values in matrix C = A * B: 16384.0 uniformly
1'099'444'518'912 floating point operations performed
Elapsed GPU time = 1.673 seconds -> 0.657 Teraflops

MacPro mid 2010, Vega 56 (usingCode matrixMul.swift, i.e. managed buffers):
Values in matrix A[8192 x 8192]: 1.0 uniformly
Values in matrix B[8192 x 8192]: 2.0 uniformly
Starting calculation on AMD Radeon RX Vega 56
...
Values in matrix C = A * B: 16384.0 uniformly
1'099'444'518'912 floating point operations performed
Elapsed GPU time = 0.437 seconds -> 2.516 Teraflops

casperes1996 · Dec 6, 2019

maccan said:
The error comes from the statement

let blitEncoder = commandBuffer.makeBlitCommandEncoder()!

Why did that cause issues? You should be able to create a blitEncoder on an iGPU as well.

Anyways, even doing all that it still reports:

Values in matrix A[8192 x 8192]: 1.0 uniformly

Values in matrix B[8192 x 8192]: 2.0 uniformly

Starting calculation on Intel Iris Pro Graphics

...

Error: Inconsistent calculation results

Which is certainly a new kind of issue - With your very first version I made it run properly.

PS.
Just to be clear, for the resulting matrix, I get :
16384.0
for the first 37887967 entries. After that it just has
0.0 in its matrix

maccan · Dec 6, 2019

casperes1996 said:
Why did that cause issues? You should be able to create a blitEncoder on an iGPU as well.

Anyways, even doing all that it still reports:

Values in matrix A[8192 x 8192]: 1.0 uniformly

Values in matrix B[8192 x 8192]: 2.0 uniformly

Starting calculation on Intel Iris Pro Graphics

...

Error: Inconsistent calculation results

Which is certainly a new kind of issue - With your very first version I made it run properly.

PS.
Just to be clear, for the resulting matrix, I get :
16384.0
for the first 37887967 entries. After that it just has
0.0 in its matrix

'Error: Inconsistent calculation results' is a message created in my code.

Look at the last few statements:
-------------------------------------
// Check consistency of resulting matrix
var ok = true
for i in 1..<nC {
if result != result[0] {
ok = false
}
}
if (ok) {
print("Values in matrix C = A * B: \(result[0]) uniformly")
let fops = getFops(matrixDim : n)
let tFlops = getTflops(nFP: fops, time: elapsedTime)
print(numberFormatter.string(for: fops) ?? "", "floating point operations performed")
print("Elapsed GPU time = \(elapsedTime) seconds -> \(tFlops) Teraflops")
}
else {
print("Error: Inconsistent calculation results")
}
--------------------------------------------

As you can see, the result is checked for consistency, and it reports the error if any number in the result-matrix has some deviation from the first. Float32 numbers are checked for equality, which might be a bit too strong. Maybe it is better to print a few results or print just a warning:

Replace the above code by something like this:
-----------------------------------------------------
// Check consistency of resulting matrix
var ok = true
for i in 1..<nC {
if (abs(result - result[0]) > 1e-4) {
ok = false
}
}
if (!ok) {
print("Warning: Inacurate results! Check results in more detail")
for i in 0..<10 {
print("element \(i) =", result)
}
print("last element = ", result[nC - 1])
}
else {
print("Values in matrix C = A * B: \(result[0]) uniformly")
}
let fops = getFops(matrixDim : n)
let tFlops = getTflops(nFP: fops, time: elapsedTime)
print(numberFormatter.string(for: fops) ?? "", "floating point operations performed")
print("Elapsed GPU time = \(elapsedTime) seconds -> \(tFlops) Teraflops")
------------------------------------

Now you should get at least the timing and some idea what is wrong with the resulting numbers
[automerge]1575640966[/automerge]

casperes1996 said:
Why did that cause issues? You should be able to create a blitEncoder on an iGPU as well.

Anyways, even doing all that it still reports:

Values in matrix A[8192 x 8192]: 1.0 uniformly

Values in matrix B[8192 x 8192]: 2.0 uniformly

Starting calculation on Intel Iris Pro Graphics

...

Error: Inconsistent calculation results

Which is certainly a new kind of issue - With your very first version I made it run properly.

PS.
Just to be clear, for the resulting matrix, I get :
16384.0
for the first 37887967 entries. After that it just has
0.0 in its matrix

Copy paste removed some brakets?
On various statements there is a missing. result instead of result
Don't know why.

Better I attach the modified code...

maccan · Dec 6, 2019

casperes1996 said:
Why did that cause issues? You should be able to create a blitEncoder on an iGPU as well.

Anyways, even doing all that it still reports:

Values in matrix A[8192 x 8192]: 1.0 uniformly

Values in matrix B[8192 x 8192]: 2.0 uniformly

Starting calculation on Intel Iris Pro Graphics

...

Error: Inconsistent calculation results

Which is certainly a new kind of issue - With your very first version I made it run properly.

PS.
Just to be clear, for the resulting matrix, I get :
16384.0
for the first 37887967 entries. After that it just has
0.0 in its matrix

I think your VRAM is too small for n = 8192.
Try n = 4096. I would assume you will get consistent results then.

casperes1996 · Dec 6, 2019

maccan said:
I think your VRAM is too small for n = 8192.
Try n = 4096. I would assume you will get consistent results then.

This is an iGPU, using system memory for VRAM. I don't really see it being limited in that way, unless there's a seemingly very low, arbitrary cutoff for allowed memory used by the iGPU. Then it's peculiar it functions properly on your iGPU however.

maccan · Dec 6, 2019

casperes1996 said:
This is an iGPU, using system memory for VRAM. I don't really see it being limited in that way, unless there's a seemingly very low, arbitrary cutoff for allowed memory used by the iGPU. Then it's peculiar it functions properly on your iGPU however.

What did you get for n = 4096?
Are all numbers ok in this case?
[automerge]1575646102[/automerge]

casperes1996 said:
This is an iGPU, using system memory for VRAM. I don't really see it being limited in that way, unless there's a seemingly very low, arbitrary cutoff for allowed memory used by the iGPU. Then it's peculiar it functions properly on your iGPU however.

The About this Mac reports some size of VRAM for the iGPU as well (See screenshot).
Maybe there is some limit...
What is displayed for your iGPU in the 'About this Mac' Info?

casperes1996 · Dec 6, 2019

maccan said:
What did you get for n = 4096?
Are all numbers ok in this case?

Yes

Code:

Values in matrix C = A * B: 8192.0 uniformly
137'422'176'256 floating point operations performed
Elapsed GPU time = 1.204 seconds -> 0.114 Teraflops

maccan said:
The About this Mac reports some size of VRAM for the iGPU as well (See screenshot).
Maybe there is some limit...
What is displayed for your iGPU in the 'About this Mac' Info?

Same as yours, 1536; But AFAIK that's not supposed to be a hard limit, just an initial partitioning, flexible to adjustment as needed.
[automerge]1575647913[/automerge]

maccan · Dec 6, 2019

casperes1996 said:
Yes

Code:

Values in matrix C = A * B: 8192.0 uniformly 137'422'176'256 floating point operations performed Elapsed GPU time = 1.204 seconds -> 0.114 Teraflops

Same as yours, 1536; But AFAIK that's not supposed to be a hard limit, just an initial partitioning, flexible to adjustment as needed.
[automerge]1575647913[/automerge]

What is reported under System Report -> Graphics/Displays
On my MacBook as well as on my cMP (Vega 56) there is

Metal: Supported, feature set macOS GPUFamily2 v1

Maybe your Polaris iGPU is GPUFamily1?
There are different Metal GPU Feature-Set tables for the different GPU families and maybe different max buffer-size limits...

casperes1996 · Dec 6, 2019

maccan said:
What is reported under System Report -> Graphics/Displays
On my MacBook as well as on my cMP (Vega 56) there is

Metal: Supported, feature set macOS GPUFamily2 v1

Maybe your Polaris iGPU is GPUFamily1?
There are different Metal GPU Feature-Set tables for the different GPU families and maybe different max buffer-size limits...

GPUFamily 1 v4. But I think that's because I stuck to Mojave for now. At a WWDC Tech Talk Apple mentioned there still only being one GPU Family on the Mac, only having separate families on iOS; Though perhaps that wasn't this year, and that information is incorrect now.

Small correction btw; No iGPU in a Mac is Polaris. That'd require an AMD APU or the Intel/AMD G chip. It's an Intel Iris Pro Crystal Well chip with Gen graphics.

maccan · Dec 6, 2019

casperes1996 said:
GPUFamily 1 v4. But I think that's because I stuck to Mojave for now. At a WWDC Tech Talk Apple mentioned there still only being one GPU Family on the Mac, only having separate families on iOS; Though perhaps that wasn't this year, and that information is incorrect now.

Small correction btw; No iGPU in a Mac is Polaris. That'd require an AMD APU or the Intel/AMD G chip. It's an Intel Iris Pro Crystal Well chip with Gen graphics.

On my MacPro 2010 with Vega 56 running Mojave 10.14.6, the AMD Vega 56 is 'macOS GPUFamily2 v1'
The MacBook Pro 2018 with the AMD 555X running Catalina 10.15.1, the 555X is reported as 'macOS GPUFamily2 v1'. I think the GPU capability is reported as it is and depends not on the OS version.

There are individual feature sets for the different GPU families and the max MTLBuffer size may be one of this individual specifications. For your GPU, just google for MTLFeatureSet.macOS_GPUFamily1_v4.
Of course, the limited MTLBuffer size can be circumvented by splitting the information on multiple buffers.

casperes1996 · Dec 7, 2019

maccan said:
On my MacPro 2010 with Vega 56 running Mojave 10.14.6, the AMD Vega 56 is 'macOS GPUFamily2 v1'
The MacBook Pro 2018 with the AMD 555X running Catalina 10.15.1, the 555X is reported as 'macOS GPUFamily2 v1'. I think the GPU capability is reported as it is and depends not on the OS version.

I stand corrected.

Definitely in part related to the OS version however, since if you go back far enough, the OS won't know about certain feature levels, even if your GPU supports it. I.e. initially, when Metal was first introduced, mine was Family 1 v1. Well, don't even think there was a v# attached. v2, v3, v4 came along later.

But yeah, fair enough, feature set wasn't what I thought it was at least.

macOS Comparing GPU Performance using Metal Performance Shaders

macrumors regular

macrumors 604

macrumors regular

macrumors 604

macrumors regular

macrumors 604

Attachments

macrumors regular

Attachments

macrumors 604

macrumors regular

macrumors 604

macrumors 604

macrumors regular

Attachments

macrumors regular

macrumors 604

macrumors regular

Attachments

macrumors regular

macrumors 604

macrumors regular

Attachments

macrumors 604

Attachments

macrumors regular

macrumors 604

macrumors regular

macrumors 604

Our Staff