And if you can ELI5 - what is the difference between an ANE core and GPU core? It looks like they're both optimized for parallel processing, is one better than the other at vector/matrix math? Is there overlap?
That's a good question that is very difficult to answer properly (and it doesn't help that we don't really know many details much how the ANE actually works). One major difference is that the GPU is expected to wear many hats (from UI compositing to 3d graphics to flexible compute) while the ANE is only expected to do one operation, and do it well. Just being good at "parallel processing" is not enough, since parallel processing can mean different things.
Historically, GPUs have been designed to run the same program on multiple different data items at the same, while maintaining the illusion that the items are still independent (e.g. you write the program that computes a + b but the hardware actually does a[0] + b[0], a[1] + b[1] ... a[n] + b[n] simultaneously). This is called SIMT (single instruction, multiple threads), which allows for a neat programming model and works very well for typical graphics and compute tasks. But when you want to do matrix multiplication (or more specifically, convolution), the items are not really independent as you have to shuffle and accumulate results of data-parallel multiplication. Doing this efficiently requires you to break the SIMT model somewhat (that's why Nvidia calls this "cooperative matrix multiplication" — because the GPU "threads" are cooperating to perform a single matrix operation as opposed to the usual "every thread for itself"). And while GPUs have been recently gaining cooperative capabilities (moving data between threads in the same SIMD etc.), there are always tradeoffs. A dedicated matrix multiplication/convolition hardware will probably always be more efficient, simply because it doesn't care about the flexibility and execution capabilities that a general GPU must provide. And it appears that ANE is exactly this kind of hardware (from what I understand, Nvidia uses a sort of hybrid solution with their Tensor cores, combining the general-purpose GPU processing units with some matmul-optimized hardware, but I have no idea how they do it exactly).