The convolution empire strikes back

  • My theory is that architecture doesn't matter - convolutional, transformer or recurrent, as long as you can efficiently train models of the same size, what counts is the dataset.

    Similarly, humans achieve about the same results when they have the same training. Small variations. What matters is not the brain but the education they get.

    Of course I am exaggerating a bit, just saying there are a multitude of architectures of brain and neural nets with similar abilities, and the differentiating factor is the data not the model.

    For years we have seen hundreds of papers trying to propose sub-quadratic attention. They all failed to get traction, big labs still use almost vanilla transformer. At some point a paper declared "mixing is all you need" (MLP-Mixers) to replace "attention is all you need". Just mixing, the optimiser adapts to what it gets.

    If you think about it, maybe language creates a virtual layer where language operations are performed. And this works similarly in humans and AIs. That's why the architecture doesn't matter, because it is running the language-OS on top. Similarly for vision.

    I place 90% the merits of AI on language and 10% on the model architecture. Finding intelligence was inevitable, it was hiding in language, that's how we get to be intelligent as well. A human raised without language is even worse than a primitive. Intelligence is encoded in software, not hardware. Our language software has more breadth and depth than any one of us can create or contain.

  • This is great, but what is a possible use-case of these massive classifier models? I'm guessing they won't be running at the edge, which precludes them from real-time applications like self-driving cars, smartphones, or military. So then what? Facial recognition for police/governments or targeted advertisement based on your Instagram/Google photos? I'm genuinely curious.

  • This is nice because convolutional models seem better for some vision tasks like segmentation which are less obvious how to do with ViTs. Convolution seems like something you fundamentally want to do in order to model translation invariance in vision.

  • I haven't fully read the paper yet. Isn't the strength of Vision Transformers in unsupervised learning, meaning that the data doesn't need labels? And don't ResNets require labeled data?

  • All machine learning is just convolution in the context of Hopf algebra convolution.