You're correct that Blackwells, for example, aren't the (relatively!) simple systolic arrays that GPUs used to be, but SIMD is used actually or conceptually (single instruction stream on multiple threads) in various execution units and dataflows in the chips. I will argue that Nvidia's experience with SIMD concepts and implementations has given them a leg up over companies that have never implemented and productized SIMD or SIMD-like architectures before. The same goes for AMD. And Google has seven generations of getting systolic arrays right in TPUs. I know that you know that theory is easy compared to making an implementation function reliably, with high performance and high efficiency, especially in silicon gates.the SIMD is mostly don't getting used for AI it's the Matmul hardware or the tensor core part that is getting used. TPU is just a large matmul accelerator (gross simplification)
