TLDR; Make 1x1 convolutions sparse, write fast Sparse Matrix Multiplication kernels, get a nearly 2x speedup with smaller models.
TLDR; Make 1x1 convolutions sparse, write fast Sparse Matrix Multiplication kernels, get a nearly 2x speedup with smaller models.