Optimizing a Rust GPU matmul kernel