LoRA Learns Less and Forgets Less

  • I really wish people would be more careful about choosing names for these things.

    LoRa has been a popular wireless protocol for like 10 years.

  • The findings are that the best fine-tune performance comes from fine-tuning all weights, followed my MLPs, followed by attention heads, using LoRA. Authors assert that the performance difference is based on the target module of the NN.

    Isn’t an equally valid argument that MLPs tend to constitute a greater number of weights in transformer networks than attention heads, and the performance difference can be traced to a greater number of weights having freedom to change? I’d be curious to know if randomly choosing a subset of matrices to train, regardless of where they are in the network, would provide analogous performance to LoRA on a specific module with comparable learnable weights.

  • I feel like this is a trivial conclusion. Keeping the rank low in the optimization is a common regularization technique.

  • This paper has 12 authors, which fascinates me to no end for some unexplainable reason. How does it work? Is it a common occurrence to have this many people working on a submission? Did each of them get at least a paragraph in edgewise?

  • This study is great and addresses a question I've had about LoRA for a while.

    In a continual learning paper from last year, I found LoRA was extremely effective for faster fine-tuning and not forgetting the original dataset:

    https://arxiv.org/abs/2306.01904

  • This was a poor study, https://x.com/danielhanchen/status/1791900967472140583?s=46&...

  • This is "Low-Rank Adaptation", "a widely-used parameter-efficient finetuning method for large language models."

    Not to be confused with LoRa ("long range") [1], an Internet of Things radio technology.

    [1] https://en.wikipedia.org/wiki/LoRa

  • This is Low-rank adaptation. Not to be confused with Lake of the Ozarks Recreation Area.

  • What can we learn about Low Rank Acronyms today?

  • This is about Low-rank adaptation. Not to be confused with LoRa the long range proprietary radio communication technique, which hopefully doesn't learn at all.

  • [dead]

  • What does LoRa have to do with LLMs? Whoever named this thing screwed up big time.