Hacker News

Tracking Copilot vs. Codex vs. Cursor vs. Devin PR Performance

by HiPHInchon 6/5/2025, 6:22:09 AM with 30 comments

by lukehobanon 6/9/2025, 1:53:37 AM
(Disclaimer: I work on coding agents at GitHub)
This data is great, and it is exciting to see the rapid growth of autonomous coding agents across GitHub.
One thing to keep in mind regarding merge rates is that each of these products creates the PR at a different phase of the work. So just tracking PR create to PR merge tells a different story for each product.
In some cases, the work to iterate on the AI generated code (and potentially abandon it if not sufficiently good) is done in private, and only pushed to a GitHub PR once the user decides they are ready to share/merge. This is the case for Codex for example. The merge rates for product experiences like this will look good in the stats presented here, even if many AI generated code changes are being abandoned privately.
For other product experiences, the Draft PR is generated immediately when a task is assigned, and users can iterate on this “in the open” with the coding agent. This creates more transparency into both the success and failure cases (including logs of the agent sessions for both). This is the case for GitHub Copilot coding agent for example. We believe this “learning in the open” is valuable for individuals, teams, and the industry. But it does lead to the merge rates reported here appearing worse - even if logically they are the same as “task assignment to merged PR” success rates for other tools.
We’re looking forward to continuing to evolve the notion of Draft PR to be even more natural for these use cases. And to enabling all of these coding agents to benefit from open collaboration on GitHub.
by osigurdsonon 6/9/2025, 1:42:44 AM
I've been underwhelmed with dedicated tools like Windsurf and Cursor in the sense that they are usually more annoying than just using ChatGPT. They have their niche but they are just so incredibly flow destroying it is hard to use them for long periods of time.
I just started using Codex casually a few days ago though and already have 3 PRs. While different tools for different purposes make sense, Codex's fully async nature is so much nicer. It does simple things like improve consistency and make small improvements quite well which is really nice. Finally we have something that operates more like an appliance for a certain classes of problems. Previously it felt more like a teenager with a learners license.
by zX41ZdbWon 6/9/2025, 3:38:23 AM
It is also worth looking at the number of unique repositories for each agent, or the number of unique large repositories (e.g., by the threshold on the number of stars). Here is the report we can check:
https://play.clickhouse.com/play?user=play#V0lUSCByZXBvX3N0Y...
I've also added some less popular agents like jetbrains-junie, and added a link to a random pull request for each agent, so we can look at the example PRs.
by behnamohon 6/9/2025, 12:54:50 AM
How about Google Jules?
also, of course OpenAI Codex would perform well because the tool is heavily tailored to this type of task, whereas Cursor is a more general-purpose (in the programming domain) tool/app.
by ubjon 6/9/2025, 5:11:39 AM
Where is Claude Code? Surprised to see it completely left out of this analysis.
by tmvntyon 6/9/2025, 1:07:21 AM
Merge rates is definitely a useful signal, but there are certainly other factors we need consider (PR small/big edits, refactors vs deps upgrades, direct merges, follow up PRs correcting merged mistakes, how easy it is to setup these AI agents, marketing, usage fees etc). Similar to how NPM downloads alone don’t necessarily reflect a package’s true success or quality.
by dimitri-vson 6/9/2025, 12:43:24 AM
This might be an obvious questions but why is Claude Code not included?
by throwaway314155on 6/9/2025, 2:27:51 AM
Is this data not somewhat tainted by the fact that there's really zero way to identify how much a human was or wasn't "in the loop" before the PR was created?
by SilverSlashon 6/9/2025, 8:29:07 AM
Wasn't Codex only released recently? Why is it present an order of magnitude more than the others?
by ehsanu1on 6/9/2025, 4:39:02 AM
It's hard to attribute PR merge rate with higher tool quality here. Another likely reason is level of complexity of task. Just looking at the first PR I saw from the github search for codex PRs, it was this one-line change that any tool, even years ago, could have easily accomplished: https://github.com/maruyamamasaya/yasukaribike/pull/20/files
by kneson 6/9/2025, 6:53:35 AM
This is great work. Would love to see Augmentcode.com remote agent. If you are down OP, msg and I'll give you a free subscription to add to the test
by nojson 6/9/2025, 8:46:56 AM
For people using these, is there an advantage to having the agent create PRs and reviewing these versus just iterating with Cursor/Claude Code locally before committing? It seems like additional bureaucracy and process when you could fix the errors sooner and closer to the source.
by yoranon 6/9/2025, 10:03:38 AM
All these tools seem to be GitHub-centric. Any tips for teams using GitLab to store their repositories?
by myhandleisbeston 6/9/2025, 7:03:29 AM
Can I get a clarification on the data here - Are these PRs reviewed by the tools or fully authored?
Also filter conditions that would be interesting - size of PR, language, files affected, distinct organizations etc. lmk if these get added please!
by joshstrangeon 6/9/2025, 1:26:03 PM
I can't be the only one annoyed by the square/circle mismatch in the legend/graph?
https://cs.joshstrange.com/lWRtNMTk
by pkongzon 6/9/2025, 7:31:01 AM
How does this analysis handle potential false positives? For instance, if a user coincidentally names their branch `codex/my-branch`, would it be incorrectly included in the "Codex" statistics?
by selvanon 6/9/2025, 3:27:02 AM
Total PRs between Codex vs Cursor is 208K vs 705, this is an enormous difference in absolute PRs. Since cursor is very popular, how does their PRs is not even 1% of codex PRs?.
by nikolayasdf123on 6/9/2025, 5:04:46 AM
yeah, GitHub Copilot PRs are unusable. from personal experience
by TZubirion 6/9/2025, 2:03:59 AM
Why is there 170k PR for a product released last month, but 700 for a product that has been around for like 6 months and was so popular it got acquired for 3B?
by frognumberon 6/9/2025, 12:23:55 AM
Missing data: I don't make a codex PR if it's nonsense.
Poor data: If I make one, I either if I want to:
a) Merge it (success)
b) Modify it (sometimes success, sometimes not). In one case, Codex made the wrong changes in all the right places, but it was still easier to work from that by hand.
c) Pick ideas from it (partial success)
So simple merge rates don't say much.
by pryelluwon 6/9/2025, 2:42:59 AM
Is it me or are there a lot of documentation related PRs? Not a majority, but enough to mask the impact of agent code.
by myhandleisbeston 6/9/2025, 7:04:18 AM
Stats? What about the vibes leaderboard?
by m3kw9on 6/9/2025, 5:24:08 AM
Agents should also sign the pr with secret keys so people can’t just fake the commit message
by cjbarberon 6/9/2025, 2:00:51 AM
Seems like the high order bit impacting results here might be how difficult the PR is?
by kaelandton 6/9/2025, 10:22:40 AM
could be nice to add a "merged PR with a test" metric. looking at the PRs they are mostly without tests, so could be bogus for all we know
by m4r1kon 6/9/2025, 8:59:04 AM
Just curious, why is there no reference to Google?
by rcarmoon 6/9/2025, 10:19:14 AM
I was expecting a better definition of “performance”. Merging a garbage PR shouldn’t be a positive uptick.
by zekoneon 6/9/2025, 2:59:00 AM
thanks for posting my project bradda
by zachlattaon 6/9/2025, 12:13:28 AM
Wow, this is an amazing project. Great work!