Llama2 implementation on Mojo runs at high performance

  • Is there a real link somewhere? What flags was llama2.c built with for the comparison? (edit, it's build with `make runfast` which doesn't parallelize across cores... I wonder if that's part of it. I also wonder if BLAS is another reason, I assume mojo has some accelerated linear algebra library.

    Llama2.c is a toy and not optimized, it's matmul is a for loop and C and it relies entirely on the compiler for speedup. You'd need to compare it with llama.cpp for anything credible.