@mr_roboto &
@dmccloud :
I created a couple of Excel tables that should make the math more concrete. They demonstrate one can calculate a figure of merit for the relative effect of changing benchmarks on device performance, and that this figure is independent of any change in calibration devices, or the baseline scores assigned to those devices. Hence your contention, that my suggestion to do this "makes no sense" because of " different unit systems" or "different baselines", doesn't hold:
Consider two devices, X and Y. Suppose that, on a specific MC task in GB5, device X is faster than device Y. Further suppose that, on GB6, that task is replaced with a more challenging distributed task and that, compared to their performances with the GB5 task, this GB6 task takes device X four times as long to complete, but takes device Y only twice as long to complete. Intutitively, the change from the GB5 task to the GB6 task favors device Y over device X by a factor of two. Note this is independent of how long the GB5 tasks take X and Y. All that matters is how much their relative performance changes when we switch to GB6.
In the top tables I've assigned completion times for the GB5 task to devices X and Y. These are arbitrary, except that I've made device X faster, as described above. Then, also as described above, I made the GB6 X and Y completion times 4x and 2x as long, respectively. I then calculated the ratio by which the change from GB5 to GB6 favors Y over X, and got a figure of merit of "2", corresponding to the common-sense intuitive understanding mention above.
Note: In each case I calculate the resulting GB score from:
(baseline device score) x (baseline device time)/(test device time)
This implements Primate's prescription that the score is directly proportional to the performance.
I then repeated this calculation in the bottom tables, except this time I calculated the GB5 and GB6 scores for X and Y based on entirely different calibration devices, with different task completion times, and different assigned baseline scores. These cells are highlighted in light blue.
You can see these changes have absolutely no effect on the figure of merit, which retains its value of 2. [The figure of merit, which is the relative scoring benefit seen by Y vs X in changing benchmarks, is shown in the orange cells. I show two different ways to calculate it.]
[The times highlighted in yellow, which are the task completion times in GB5 and GB6 for X and Y, of course remain the same, since they are independent of which calibration devices are used, depending only on the device and the task.]
Of course, this is just an illustration of the math. In practice, you woudn't want to calculate ratios for one device vs. another. Instead, you'd want to calculate ratios for each device vs. the average for all devices. Devices with a ratio greater than one would be relatively favored by the change in benchmark, while the opposite would be the case for devices with a ratio less than one.