Unlike text generation using LLMs, text-to-video generation brings unique challenges — balancing realism, prompt alignment, and artistic vision is something much more nuanced and intuitive than generated code.
But how do we measure the quality of the outputs?
Is choice of color more important than the realistic aspect or is it the composition of the scene?
We’ve launched a Text-to-Video Model Leaderboard to explore these questions, inspired by the LLM Leaderboard (lmarena.ai). Our idea: many models exist, but only an unbiased comparison can help evaluating what users of text-to-video models actually find most important.
Right now, the leaderboard includes five open-source models:
* HunyuanVideo
* Mochi1
* CogVideoX-5b
* Open-Sora 1.2
* PyramidFlow
We’re looking for feedback from the HN community:
* How should text-to-video models be evaluated?
* What criteria or benchmarks would you find meaningful?
* Are there other models we should include?
We’d love to hear your thoughts and suggestions!
Unlike text generation using LLMs, text-to-video generation brings unique challenges — balancing realism, prompt alignment, and artistic vision is something much more nuanced and intuitive than generated code.
But how do we measure the quality of the outputs? Is choice of color more important than the realistic aspect or is it the composition of the scene?
We’ve launched a Text-to-Video Model Leaderboard to explore these questions, inspired by the LLM Leaderboard (lmarena.ai). Our idea: many models exist, but only an unbiased comparison can help evaluating what users of text-to-video models actually find most important.
Right now, the leaderboard includes five open-source models: * HunyuanVideo * Mochi1 * CogVideoX-5b * Open-Sora 1.2 * PyramidFlow
We plan to expand it to include proprietary models from Kling AI, LumaLabs.ai, Pika.art. You can check out the current leaderboard here: https://t2vleaderboard.lambdalabs.com/leaderboard/
We’re looking for feedback from the HN community: * How should text-to-video models be evaluated? * What criteria or benchmarks would you find meaningful? * Are there other models we should include?
We’d love to hear your thoughts and suggestions!