What’s the largest Mamba model that has been trained so far?
Seems like it scales better than transformers, but this would only be really obvious at parameter counts far in excess of the experiments in this paper.
What’s the largest Mamba model that has been trained so far?
Seems like it scales better than transformers, but this would only be really obvious at parameter counts far in excess of the experiments in this paper.