Isn't noise in the data going to dominate output size of lossless compression? Wouldn't linguistics and vision be better off with direct measurements of predictive strength?
(author here)
Recents events in ML make me feel about 2/3 vindicated of the claims made in the book. Based on the book's ideas, I began training LLMs based on large corpora in the early 2010s, well before it was "cool". I figured out that LLMs could scale to giga-parameter complexity without overfitting, and that the concepts developed under this training would be reusable for other tasks (I called this the Reusability Hypothesis, to emphasize that it was deeply non-obvious; other terms like "self-supervision" are more common in the literature).
I missed on two related points. Technically, I did not think DNNs would scale up forever; I thought that they would hit some barrier, and the engineers would not be able to debug the problem because of the black-box nature of DNNs. Philosophically, I wanted this work to resemble classical empirical science in that the humans involved should achieve a high degree of knowledge relating to the material. In the case of LLMs, I wanted researchers (including myself) to develop understanding of key concepts in linguistics such as syntax, semantics, morphology, etc.
This style of research actually worked! I built a statistical parser without using any labelled training data! And I did learn a ton about syntax by building these models. One nice insight was that the PCFG is a bad formalism for grammar; I wrote about this here:
https://ozoraresearch.wordpress.com/2017/03/17/chuckling-a-b...
Obviously, I feel into the "Bitter Lesson" trap described by Rich Sutton. The DNNs can scale up, and can improve up their understanding much faster than a group of human researchers can.
One funny memory is that in 2013 I went to CVPR and told a bunch of CV researchers that they should give up on modeling P(L|I) - label given image - and just model P(I) instead - the probability of an image. They weren't too happy to hear that. I'm not sure that approach has yet taken over the CV world, but based on the overwhelming success of GPT in the NLP world, I'm sure it's just a matter of time.
In hindsight, I regret the emphasis I placed on the keyword "compression". To me, compression is a nice and rigorous way to compare models, with a built-in Occam's principle. But "compression" means many different things to different people. The important idea is that we're modeling very large unlabelled datasets, using the most natural objective metric in this setting.
edit: I used the wrong name in reference to the Bitter Lesson idea, here is the essay: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Scientific models are causal. Compression is a condition on association.
I don't really know what more needs to be said here.