Hacker News

Zamba2-7B

by datamineron 10/14/2024, 10:45:51 PM with 22 comments

by SubiculumCodeon 10/14/2024, 11:31:52 PM
When they say that they use two attention heads, are each attention head directed at different aspects of the data?
In memory research there is this idea that there is a dual representation of every event...a more verbatim representation, and more context weighted representation. As we develop over early childhood, our verbatim memory representations increase in fidelity and strength against interference, but peaks around 6 to 10 years, depending on the specifics. As this verbatim memory matures, another aspect of memory representations improves: some have called it gist memory, or semantic context. Increases in memory performance continue into adolescence primarily due to increases in the ability to use context and gist (broad representations that capture the details by inference or an event) to increase accuracy overall, but also greater likelihood of committing false alarms to lures primed by semantically related material during learning...expressly because there becomes greater reliance on context to support recall accuracy.
So I could imagine such a system in a LLM where attention is directed to exact representations in one head, and another that keeps its attention on a coarser grain of information that anchors information. However, I am not that familiar with LLMs to know if that is just silly analogizing.
by jwitthuhnon 10/15/2024, 1:20:44 AM
For anyone else looking for the weights which as far as I can tell are not linked in the article:
Base model: https://huggingface.co/Zyphra/Zamba2-7B
Instruct tuned: https://huggingface.co/Zyphra/Zamba2-7B-Instruct
by potatoman22on 10/14/2024, 11:43:08 PM
I wonder how much of the performance gains can be attributed to their improved dataset rather than their architecture. That would be an expensive experiment.
by arnaudsmon 10/15/2024, 1:03:32 AM
I'm tired of LLM releases that cherry pick benchmarks. How does it compare to SOTA qwen2.5/phi3.5 ?
Anyone knows an up to date independent leaderboard? Lmsys and livebench used to be great but skipped most major models recently.
by adton 10/15/2024, 12:14:31 AM
https://lifearchitect.ai/models-table/
by Havocon 10/15/2024, 12:02:29 PM
Nice to see more apache licensed models especially with different architectures
by PoignardAzuron 10/15/2024, 9:14:57 AM
For the amount of theoretical work behind those Mamba2 blocks (I can barely understand their paper on the subject), those are some extremely modest performance gains.
Attention remains king.
by simonwon 10/15/2024, 12:33:35 AM
Anyone seen a URL to a tool that lets you try this one out?
by zeroqon 10/15/2024, 12:48:27 AM
Another day, another world record in AI.
Reminds me of Sergey Bubka (https://en.wikipedia.org/wiki/Sergey_Bubka). Bubka broke the world record for men's pole vault 35 times during his career.
by itakeon 10/15/2024, 12:48:47 AM
Any ideas what languages this supports?
by nox101on 10/15/2024, 12:35:22 PM
what is magic about 7B? why not 8B, 9B, 11.234B? Is 7B some power of 2 reinterpreted?
by hkc88hkcon 10/15/2024, 5:44:35 AM
Will it be open sourced?
by iamronaldoon 10/14/2024, 11:05:21 PM
Not transformer based?
by semicolon_stormon 10/15/2024, 12:53:43 AM
No mention or comparison with phi-3 seems odd. Isn't phi-3 leading the other models by a bit?
by zomboton 10/15/2024, 11:03:47 AM
Will it be made available for ollama? Or is there another platform for running it locally?
by barkingcaton 10/15/2024, 3:24:36 PM
who decided names for models need to end with -a?
by resterson 10/15/2024, 3:09:43 AM
any benchmarks vs phi-3?
by edgarwarren9on 10/16/2024, 3:03:28 PM
[dead]
by wg0on 10/14/2024, 11:45:12 PM
If a model was trained in 1837, would it be useful even today? How models would be trained in 2037 when most of the web might be autogenerated on the fly like that cgi-bin era?
by AIFounderon 10/15/2024, 1:04:10 AM
[dead]
by DidYaWipeon 10/15/2024, 7:12:13 AM
Is what?
by whoistraitoron 10/15/2024, 12:23:54 AM
Cool! Seems we’re moving closer and closer to realizing the Lottery Ticket Hypothesis https://arxiv.org/abs/1803.03635