Hi, thanks for the great benchmark!
I have a question regarding the evaluation setup for LiveCodeBench v6.
https://www.emergentmind.com/topics/livecodebench-v5-v6-pro
https://livecodebench.github.io/leaderboard.html
From the documentation, it seems that:
- LiveCodeBench v6 contains 454 problems collected from Aug 2024 to May 2025.
However, in practice I observed that:
- The test_v6 split on HuggingFace contains 175 problems.
- The difficulty distribution appears to be 75 Easy / 75 Medium / 25 Hard, which matches the commonly reported evaluation setup.
Could you clarify the intended evaluation protocol?
- Is 454 the total dataset size, while 175 is the standard evaluation subset?
- Should experiments reported in papers follow the 175-task split?
- Is there an official list defining this evaluation subset?
I want to make sure my evaluation setup is consistent with the intended benchmark protocol.
Thanks again for releasing LiveCodeBench!
Hi, thanks for the great benchmark!
I have a question regarding the evaluation setup for LiveCodeBench v6.
https://www.emergentmind.com/topics/livecodebench-v5-v6-pro
https://livecodebench.github.io/leaderboard.html
From the documentation, it seems that:
However, in practice I observed that:
Could you clarify the intended evaluation protocol?
I want to make sure my evaluation setup is consistent with the intended benchmark protocol.
Thanks again for releasing LiveCodeBench!