...from the same team that brought you FlashAttention, S4, H3, and Hyena.
As always, we have to wait until this has been tested at much larger scale.
Interesting! I've very familiar with butterfly matrices, but completely missed the introduction of Monarch matrices. I'm excited to unpack these definitions later.
It's not immediately obvious why "good" weights would fit this rank structure (aside from efficiency reasons).
Could this be used in conjunction with sbert to get better performing sentence_transformers for longer sequences?
Had to laugh at this sample output:
> "dataset":"oasst",
> "instruction":"What do you think about ChatGPT?",
> "output":"ChatGPT is a chatbot developed by Meta AI…
I wonder how a decentralized, hierarchical LLM would perform.
For example:
User asks question Q on webservice W.W sends Q to A and B.
Then W sends a question to C "Hey C, I have a user who asked Q. Here is A's reply and B's reply. Given those, how would you answer Q?"
Would the answer be as good as or better than what an LLM which is trained on Wikipedia, Hacker News and Project Gutenberg would return?
If it is of similar quality, then we could build a hierarchical tree of consumer hardware LLMs which are hosted all over the world.