Rendered at 13:19:27 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
gertlabs 12 hours ago [-]
While this benchmark has interesting results, the "Contamination free" label only works for the initial release of the benchmark. It still has the same fundamental design issues of any other benchmark-- there's a single correct answer for tasks. It looks to be largely saturated upon release.
What they did well: normalizing the harness to mini-swe-agent -- models should be able to generalize to different tools at this point. When they struggle to do that (like most Google models), they're unlikely to be useful in practice. And that kind of generalization is an inherent part of intelligence.
For a benchmark that scales, you need to remove the ceiling and provide environments with measurable goals that are NOT a single correct answer, and sufficiently complex evaluation criteria to scale well beyond the current frontier.
We're still relatively unknown in the benchmarking space, but by rotating the pool of environments and ensuring the optimal strategies in the environments themselves are affected by other agents participating in the space, we expect we'll be able to resist contamination as major labs start investing more effort to climb the leaderboard. We've already seen Chinese labs taking an interest.
fiso64 6 hours ago [-]
The fact that claude and gpt 5.5 have nearly the same scores tells me your benchmark is not capturing a significant gap in capability between these two. What the linked page says about Claude is true in my experience: It frequently forgets important instructions and likes to take lazy shortcuts. Gpt by contrast is much more attentive and takes its time when needed to deliver a complete and robust solution. I have tested both models on two private repos (c#, go) on two long-horizon tasks with well-defined stop conditions and observed the same pattern in both cases. Both models still require a large harness to reduce shortcuts and architecturally unclean code, but gpt performs much better, to the point where I find claude unusable for any significant work.
gertlabs 6 hours ago [-]
GPT 5.5 does significantly outperform Opus 4.7 in the coding parts of our evals.
We also incorporate live decision making on social games (where GPT 5.5 has actually regressed from earlier models, which shouldn't be a huge surprise if you ever tried talking it out of some of its nits).
We are still looking for a way to integrate "logical" intelligence with social intelligence in a less arbitrary way, so I'd take a look at the use case that applies to you (probably coding): https://gertlabs.com/rankings?mode=agentic_coding
vanuatu 9 hours ago [-]
1. your 'agentic coding' benchmarks are already saturated, with mimo #2? Cmon
2. game rl is fundamentally less useful than coding or work rl
gertlabs 8 hours ago [-]
Check out the methodology section at the bottom -- we are trying to better convey this information.
1. These numbers are based on percentiles, which inherently can't be saturated. Most benchmarks operate on something like 0-100% of correct answers, so it's natural to make that assumption when you see our numbers. Perhaps we should divide by 100. We create a modified score based on percentiles against other agents, which rebalances every time we add new entries. So when a new frontier model comes out, all of the existing entries get downweighted if the new model outperforms them. And MiMo V2.5 Pro is a much stronger model than people realize.
2. Agents write code to play most of these games (accounting for ~80% of the combined bench score). There is increasing evidence that nearly identical patterns of weights emerge in different models, trained on different mediums and using different algorithms. Pattern matching and extrapolation don't care if the scenario is a 3D "game" environment or a Salesforce "work RL" environment. Examples of drawing distant connections in different domains can reward similar circuitry.
charleyslee 6 hours ago [-]
[flagged]
vanuatu 13 hours ago [-]
This benchmark matches my experience with GPT (I occasionally go back to Claude when I run into limits and frequently run into forgotten requirements and reward hacking)
I do have two questions / critiques:
- The verifier doesn't seem to check for code quality / maintainability, which I would posit is one of the major qualms with SOTA coding models i.e. they lack code 'taste'. Ofc this is a difficult problem to solve at scale, but wanted to point that out nonetheless
- This almost feels written like a critique on SWE Bench Pro. Hopefully they fix the issues with that benchmark!
It seems like GPT here is failing due to an environment issue of connecting to chromium, even though its local unit tests passed. All the models failed 4/4 and checking Opus it ran into the same problem
I checked some other tasks and they seemed legit, although in general the prompts seem somewhat contrived vs. what a typical user would ask their coding agent (such is the difficulty of benchmark construction)
dnnssl2 16 hours ago [-]
70% at launch seems pretty saturated, why ship a benchmark frontier models are about to top out on?
vanuatu 13 hours ago [-]
sell data for them to hillclimb :)
charleyslee 15 hours ago [-]
[flagged]
JacobAsmuth 13 hours ago [-]
I wonder why they didn't test Gemini 3.5 Flash (High).
charleyslee 7 hours ago [-]
in small scale testing we found high effort on gemini 3.5 flash caused it to over think, generating large amounts of tokens without a substantive improve in performance.
charleyslee 16 hours ago [-]
tysm for posting this! i'm charley, cofounder of datacurve, we created this benchmark and my team and i are here to answer any q's.
ammar_x 3 hours ago [-]
Absolutely! We need new and better benchmarks like this.
I have a question: why not use the maximum available reasoning on each LLM? For example, I see that Opus 4.7 at `max` reasoning but Sonnet 4.6 at `high`. Wouldn't it be a fairer comparison if all were at max?
davidshepherd7 4 hours ago [-]
Did you try Opus-4.7 on a lower reasoning level? Looks like on `max` it's using far more tokens than the other frontier models.
What they did well: normalizing the harness to mini-swe-agent -- models should be able to generalize to different tools at this point. When they struggle to do that (like most Google models), they're unlikely to be useful in practice. And that kind of generalization is an inherent part of intelligence.
For a benchmark that scales, you need to remove the ceiling and provide environments with measurable goals that are NOT a single correct answer, and sufficiently complex evaluation criteria to scale well beyond the current frontier.
We do this by running multi-agent simulations with large action spaces at https://gertlabs.com/rankings.
We're still relatively unknown in the benchmarking space, but by rotating the pool of environments and ensuring the optimal strategies in the environments themselves are affected by other agents participating in the space, we expect we'll be able to resist contamination as major labs start investing more effort to climb the leaderboard. We've already seen Chinese labs taking an interest.
We also incorporate live decision making on social games (where GPT 5.5 has actually regressed from earlier models, which shouldn't be a huge surprise if you ever tried talking it out of some of its nits).
We are still looking for a way to integrate "logical" intelligence with social intelligence in a less arbitrary way, so I'd take a look at the use case that applies to you (probably coding): https://gertlabs.com/rankings?mode=agentic_coding
2. game rl is fundamentally less useful than coding or work rl
1. These numbers are based on percentiles, which inherently can't be saturated. Most benchmarks operate on something like 0-100% of correct answers, so it's natural to make that assumption when you see our numbers. Perhaps we should divide by 100. We create a modified score based on percentiles against other agents, which rebalances every time we add new entries. So when a new frontier model comes out, all of the existing entries get downweighted if the new model outperforms them. And MiMo V2.5 Pro is a much stronger model than people realize.
2. Agents write code to play most of these games (accounting for ~80% of the combined bench score). There is increasing evidence that nearly identical patterns of weights emerge in different models, trained on different mediums and using different algorithms. Pattern matching and extrapolation don't care if the scenario is a 3D "game" environment or a Salesforce "work RL" environment. Examples of drawing distant connections in different domains can reward similar circuitry.
I do have two questions / critiques:
- The verifier doesn't seem to check for code quality / maintainability, which I would posit is one of the major qualms with SOTA coding models i.e. they lack code 'taste'. Ofc this is a difficult problem to solve at scale, but wanted to point that out nonetheless
- This almost feels written like a critique on SWE Bench Pro. Hopefully they fix the issues with that benchmark!
https://deepswe.datacurve.ai/data/trials/quill-shared-toolba...
It seems like GPT here is failing due to an environment issue of connecting to chromium, even though its local unit tests passed. All the models failed 4/4 and checking Opus it ran into the same problem
I checked some other tasks and they seemed legit, although in general the prompts seem somewhat contrived vs. what a typical user would ask their coding agent (such is the difficulty of benchmark construction)
I have a question: why not use the maximum available reasoning on each LLM? For example, I see that Opus 4.7 at `max` reasoning but Sonnet 4.6 at `high`. Wouldn't it be a fairer comparison if all were at max?