@iDron - why does the GPU get utilised when your code tells TF not to use it?
I am not sure. I googled and found some people reporting a similar behaviour on other devices. I wonder if the command with tf.device is not a hard switch, but more like tells tf what device to prioritize. Another explanation could be that there is some optimization happening on OS level?
Okay, another little benchmark I did. Looking at small batch sizes:
batch | GPU | CPU |
---|
4 | 70s | 62s |
6 | 50s | 52s |
7 | 55s | 49s |
8 | 51s | 46s |
9 | 36s | 44s |
12 | 89s | 41s |
18 | 66s | 37s |
36 | 39s | 32s |
44 | 34s | 31s |
66 | 27s | 29s |
132 | 16s | 28s |
196 | 14s | 28s |
396 | 12s | 29s |
At a batch size of 9, running tf-metal actually is faster then the standard. However, in the ranges above (like 12,18) it is twice as slow. Looking at GPU power gives insights: At a batch size of 9, GPU power is at 4.5W. Once I set a number of 10-20, it hovers at about 2W.
This is just speculation now, but a try to understand it: 9 is just one below the CPU core number of 10, so I wonder if the OS allocates one efficiency core to system tasks and gives tf 9, which it then can use quite efficiently for the overhead tasks and pipeline input at batch sizes up to this number 9. If the cores are given multiple batches to process and feed into the GPUs, they might get very inefficient at first, unless batch numbers are sufficiently big (giving each CPU tens or hundreds of batches to feed into the GPU).
GPU power only comes back to similar levels of about 5W at a batch size of about 60. Interesting is this hard drop at a batch size of 10. As batch sizes get bigger, as reported above, GPU power in my case goes up to 12W. The CPU seems to max out at about 30W in my experiments, which is what has been reported as the full power of the CPU elsewhere:
https://www.notebookcheck.net/Apple-M1-Pro-Processor-Benchmarks-and-Specs.579915.0.html. This means, TF is maxing out the CPU easily. One the one hand this is good to see to actually put the hardware to use, on the other hand shows the neccesity of a GPU. Finding the maximum GPU power of the M1 Pro is a bit difficult. Notebookcheck reports 15W
https://www.notebookcheck.net/Apple-M1-Pro-14-Core-GPU-GPU-Benchmarks-and-Specs.576651.0.html , while Anandtech reports 43W as well as this benachmark on github
https://github.com/tlkh/tf-metal-experiments/blob/main/README.md#experiments-and-benchmarks for the M1 Max, which would be about 20W I assume then (guessing that the faster memory eats up a bit more power) for the M1 Pro. However, I also ran a TFLOPs calculation benchmark from this github repo, and the GPU power on my machine actually went up to 27W. The benchmark (matrix multiplication in tensorflow gave a power of 4.1TFLOPS).
This means I'm using only 45% GPU power at large batches in my experiment.
What do you guys think? Is this (the drop in GPU usage at a batch size of 10) normal and expected, or actually something to report to Apple, and how?