If I'm summarizing your argument correctly, wider ALU's are better for GPU computing but, for graphics (at least without TBDR), they can go to waste because of the small-triangle problem. Hence NVIDIA and AMD have found a 32-bit ALU width is the best compromise for GPUs that are used for both graphical and computing tasks.....Since compute tasks are usually large (you don't start a GPU compute on just 100 work items, it's usually multiple or even hundreds of thousands), making your architecture as wide as possible works well — you can get many more ALUs in without wasting the die space on controlling logic. So you get better performance at lower power investment....
...Now, for graphic, the situation is a bit different....With smaller triangles, you won't be able to collect enough pixels to keep reasonable GPU utilization and all this ALU power goes to waste...
Because of the above considerations, both AMD and Nvidia settled down on an ALU width of 32 values (32-wide SIMD). This appears to be a sweet spot between managing ALU occupation and conserving die space.
But if graphical inefficiencies are in fact the principal reason NVIDIA and AMD don't use wider ALU's in their graphical GPU's, then we would find wider ALU's in their non-graphical datacenter GPU's. Is this case?