Large Language Diffusion Models

  • As always you kind of need to play with the model to see how well it actually works as benchmarks can be misleading (e.g. phi-2)

    But at face value, a new architectural approach with the same capacity (8b) trained on a dataset 1/6th the tokens, being competitive with llama3-8b is exciting

  • Not sure why they included a hallucination as one of their first examples:

    "Please recommend me three famous movies"

    "The Empire Strikes Back (1980) - Directed by George Lucas"

  • it doesn't seem to support variable length for input and output, does it?

    The paper seems to use EOS padding to create fixed length input/output.

    so is there a maximum output length?

  • Masking looks interesting for sequences that can't be lossy. If an image squishes a pixel here or there, it won't be noticed, but if a sentence lacks room for "if", that sounds bad.

    Does this force the model to encode a high-level answering strategy? (AFAIU, there's no reordering during sampling.) Or does it mean a masking model of a certain size is more prone to making things up that fit the blank space?