GPT-3.5/4 response times are linear with output tokens

  • It's funny seeing this measured experimentally without an explanation on a .ai site. Transformer models need to be rerun for each additional token. The algorithm itself (running the whole fixed-sized model again) is linear in time. So, this is the expected result.

    Seeing something different than this would point to an advancement. If it were discovered how to do this sublinearly, a company like OpenAI may even want to hide by artificially delaying responses to preserve the cost advantage.

  • [flagged]