Implementing DeepSeek R1's GRPO algorithm from scratch

  • I wonder whether they implemented the GRPO correction from this paper, which fixes overly long response lengths: https://arxiv.org/abs/2503.20783

    I guess probably not, as they don't mention it.