Physics of Language Models: Architecture Design and the Magic of Canon Layers

  • anyone know why they mix in the 3 previous tokens? could have just as easily done 5 or 2 right?