Hacker News

Large Language Diffusion Models

by SerCeon 2/20/2025, 2:19:22 AM with 4 comments

by jasonjmcgheeon 2/20/2025, 5:42:40 AM
As always you kind of need to play with the model to see how well it actually works as benchmarks can be misleading (e.g. phi-2)
But at face value, a new architectural approach with the same capacity (8b) trained on a dataset 1/6th the tokens, being competitive with llama3-8b is exciting
by jbogganon 2/20/2025, 5:53:52 AM
Not sure why they included a hallucination as one of their first examples:
"Please recommend me three famous movies"
"The Empire Strikes Back (1980) - Directed by George Lucas"
by billconanon 2/20/2025, 2:24:55 AM
it doesn't seem to support variable length for input and output, does it?
The paper seems to use EOS padding to create fixed length input/output.
so is there a maximum output length?
by flowerthoughtson 2/20/2025, 7:39:57 AM
Masking looks interesting for sequences that can't be lossy. If an image squishes a pixel here or there, it won't be noticed, but if a sentence lacks room for "if", that sounds bad.
Does this force the model to encode a high-level answering strategy? (AFAIU, there's no reordering during sampling.) Or does it mean a masking model of a certain size is more prone to making things up that fit the blank space?