I think the latency has to be higher when accessing RAM connected to the other chip. It’s gotta pass through the crossbar, which has to be latched on both sides, so it’s going to take at least a couple cycles longer.
Pretty sure the way it will work is chip A passes the address through the crossbar to chip B (adding at least a cycle or 2), chip B fetches the data, then chip B passes the data through the crossbar to chip A (adding at least a couple of cycles).
Due to clock skew, you may have another cycle wasted in each direction. And, of course, chip B may be busy fetching its own data (though each chip seems to have ample bandwidth to its own memory), which can also cause an indeterminate delay.
We don‘t know, though, how the cache is managed, Etc.