Am I supposed to be impressed by this? I think people are now just using agents for the sake of it. I'm perfectly happy running two simple agents, one for writing and one for reviewing. I don't need to go be writing code at faster than light speed. Just focusing on the spec, and watching the agent as it does its work and intervening when it goes sideways is perfectly fine with me. I'm doing 5-7x productivity easily, and don't need more than that.
I also spend most of my time reviewing the spec to make sure the design is right. Once I'm done, the coding agent can take 10 minutes or 30 minutes. I'm not really in that much of a rush.
You can always tell claude to use red-green-refactor and that really is a step-up from "yeah don't forget to write tests and make sure they pass" at the end of the prompt, sure. But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.
The trick is just not mixing/sharing the context. Different instances of the same model do not recognize each other to be more compliant.
I call this "Test Theatre" and it is real. I wrote about it last year:
https://benhouston3d.com/blog/the-rise-of-test-theater
You have to actively work against it.
This is Java EE all over again.
When I graduated in 2012 it was pushed everywhere, including my uni so my undergrad thesis was done in Java.
Everyone was learning it, certifying, building things on top of other things.
EJB, JPA, JTA, JNDI, JMS and JCA.
And them more things to make it even more powerful with Servlets, JSP, JSTL, JSF.
Many companies invested and built various application servers, used by enterprises by this day.
Every engineer I've met said Java is server side future, don't bother with other tech. You'll just draw data schema, persistence mapping, business logic and ship it.
I switched to C++ after Bjarne's talk I attended in 2013. I'm glad I did although I never worked as a software engineer. Following passion and going deep into technology was a bliss for me, the difference between my undergrad Java, Master C++ and Rust PhD is like a kids toy and a real turboprop engine.
Don't follow the hype - it will go away and you'll be left with what you've invested into.
Does anyone know what this guy is having his agents build? Bc I looked a bit and all I see him ship is linkedin posts about Claude.
It's... really the same problem when you hire people to just write tests. A lot of time it just confirms that the code does what the code does. Having clear specs of what the code should do make things better and clearer.
Been running 6 AI agents for my solo operation for a few months now. One does market research, another writes content, third handles video scripts. Not coding agents - business operations agents.
The overnight thing is real but overhyped. What actually works is giving agents very narrow tasks with clear success criteria. "Research top 10 Reddit threads about X and summarize pain points" works great. "Build me a feature" overnight is a coin flip.
Biggest lesson: the bottleneck moved from execution to context management. Getting agents to remember what matters and forget what doesn't is harder than the actual task delegation.
It’s not yet possible due to context size limitations.
LLMs can’t retain most codebases nor even most code files accurately - they start making serious mistakes at ~500 lines.
Paste a ~200 line React component or API endpoint, have it fix or add something, it’s fine, but paste a huge file, it starts omitting pieces, making mistakes, and it gets worse as time goes on.
You have to keep reminding it by repeatedly refreshing context with the part in question.
Everyone who has seriously tried knows this.
For this reason alone the LLM “agent” is simply not one. Not yet. It cannot really drive itself and it’s a fundamental limitation of the technology.
Someone who knows more about model architecture might be able to chime in on why increasing the context size will/won’t help agents retain a larger working memory to acceptable degrees of accuracy, but as it stands it’s so limited that it works more like a calculator that you must actively use rather than an autonomous agent.
Cost is the silent killer of always-on agents. We track inference pricing across 40+ vendors daily and the gap between the most and least expensive option for the same model can be over 90%. Worth building cost awareness into your agent stack from day one. a7om.com/mcp
At this stage, AI is no longer a tool that enhances your ability to ship code, it has replaced you entirely in that role. You don't control what is shipped, and you can't verify if it's correct. That's a serious problem! As software engineers, we remain accountable for code we no longer fully understand.
Then, what comes next feels less like a new software practice and more like a new religion, where trust has to replaces understanding, and the code is no longer ours to question.
I've been doing differential testing in Gemini CLI using sub-agents. The idea is:
1. one agent writes/updates code from the spec
2. one agent writes/updates tests from identified edge cases in the spec.
3. a QA agent runs the tests against the code. When a test fails, it examines the code and the test (the only agent that can see both) to determine blame, then gives feedback to the code and/or test writing agent on what it perceives the problem as so they can update their code.
(repeat 1 and/or 2 then 3 until all tests pass)
Since the code can never fix itself to directly pass the test and the test can never fix itself to accept the behavior of the code, you have some independence. The failure case is that the tests simply never pass, not that the test writer and code writer agents both have the same incorrect understanding of the spec (which is very improbable, like something that will happen before the heat death of the universe improbable, it is much more likely the spec isn't well grounded/ambiguous/contradictory or that the problem is too big for the LLM to handle and so the tests simply never wind up passing).
I guess to reach this point you have already decided you don't care what the code looks like.
Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?
Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.
One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.
Sounds like we've just gotten into lazy mode where we believe that whatever it spits out is good enough. Or rather, we want to believe it, and convince ourselves that some simple guardrail we put up will make it true, because God forbid we have to use our own brain again.
What if instead, the goal of using agents was to increase quality while retaining velocity, rather than the current goal of increasing velocity while (trying to) retain quality? How can we make that world come to be? Because TBH that's the only agentic-oriented future that seems unlikely to end in disaster.
I just don't understand where these people get all this money. The answer is often "oh it's just claude max" like man I don't have 200$ MONTHLY lying around ?? That's half my rent
Why would I ever want to book a course with someone who just realized weeks ago they don't know if the code does what they want if they don't look at it?
I've been impressed by Google Jules since the Gemini 3.1 Pro update. Sometimes it's been working on a task for 4h. I've now put it in a ralph loop using a Github Action to call itself and auto merge PRs after the linter, formatter and tests pass. It does still occasionally want my approval, but most of the time I just say Sounds great!
It's currently burning through the TESTING.md backlog: https://github.com/alpeware/datachannel-clj
> A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do.
I can't understand the mindset that would lead someone not to have realized this from the beginning.
And here I am turning my computer off at night for energy consumption, while others run a few extra ones for... for what, anyway? If you're working on problems real people are having (diseases, climate change, poverty, etc.) then sure, but exacerbating the energy transition for a blog post and your personal brand as OP seems to do? How's that not criminal
You can find approaches that improve things, but there's always going to be a chance that your code is terrible if you let an LLM generate it and don't review it with human eyes.
But review fatigue and resulting apathy is real. Devs should instead be informed if incorrect code for whatever feature or process they are working on would be high-risk to the business. Lower-risk processes can be LLM-reviewed and merged. Higher risk must be human-reviewed.
If the business you're supporting can't tolerate much incorrectness (at least until discovered), than guess what - you aren't going to get much speed increase from LLMs. I've written about and given conference talks on this over the past year. Teams can improve this problem at the requirements level: https://tonyalicea.dev/blog/entropy-tolerance-ai/
Hm, what's actually being shipped here?
I've been playing around with agent orchestration recently and at least tried to make useful outputs. The biggest differences were having pipelines talk to each other and making most of the work deterministic scripts instead of more LLM calls (funnily enough).
Made a post about it here in case anyone is interested about the technicals: https://www.frequency.sh/blog/introducing-frequency/
Pet peeve: this post misunderstands “TDD.” What it really describes is acceptance tests.
TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice. It’s “red green refactor repeat”, and each step is only a handful of lines of code.
TDD is not “write the tests, then write the code.” It’s “write the tests while writing the code, using the tests to help guide the process.”
Thank you for coming to my TED^H^H^H TDD talk.
Remember this, guys? https://agilemanifesto.org/
All these macho men - I wonder what exactly are they shipping at that pace?
Not a rhetoric question. Trillion token burners and such.
I'm running 8 specialized AI agents on a Mac Mini right now. They handle research, content strategy, writing, security audits, code, and visual design. They run on cron schedules, have persistent memory between sessions, and each one improves weekly through self-improvement loops.
The cost concern is real but manageable. The key is routing models by task. Complex reasoning gets Opus, routine work gets Sonnet, mechanical tasks get Haiku. Not everything needs the expensive model.
The quality concern is the bigger one. What people miss about autonomous agents is that "running unsupervised" doesn't mean "running without guardrails." Each of my agents has explicit escalation rules, a security agent that audits the others, and a daily health report system that catches failures. The agents that work best are the ones with built-in disagreement, not the ones that just pass things through.
Wrote up the full architecture here if anyone's curious about the multi-agent coordination patterns: https://clelp.com/blog/how-we-built-8-agent-ai-team
Code and Claude Code hooks can conditionally tell the model anything:
#!python
print(“fix needed: method ABC needs a return type annotation on line 45”
import os
os.exit(2)
Claude Code will show that output to the model. This lets you enforce anything from TDD to a ban on window.alert() in code - deterministically.
This can be the basis for much more predictable enforcement of rules and standards in your codebase.
Once you get used to code based guardrails, you’ll see how silly the current state of the art is: why do we pack the context full of instructions, distract the model from its task, then act all surprised when it doesn’t follow them perfectly!
The concept of long-running background agents sounds appealing, but the real challenge tends to be reliability and task definition rather than raw model capability.
If an agent runs unattended for hours, small errors compound quickly. Even simple misunderstandings about file structure or instructions can derail the whole process.
They're definitely inferior to proper tests, but even weak CC tests on top of CC code is an improvement over no tests. If CC does make a change that shifts something dramatically even a weak test may flag enough to get CC to investigate.
Even better though - external test suits. Recently made a S3 server of which the LLM made quick work for MVP. Then I found a Ceph S3 test suite that I could run against it and oh boy. Ended up working really good as TDD though.
Many times there is really no way of getting around some of the expert-human judgement complexity of the larger question of "How to get agents to build reliably".
One example I have been experimenting is using Learning Tests[1]. The idea is that when something new is introduced in the system the Agent must execute a high value test to teach itself how to use this piece of code. Because these should be high leverage i.e. they can really help any one understand the code base better, they should be exceptionally well chosen for AIs to use to iterate. But again this is just the expert-human judgement complexity shifted to identifying these for AI to learn from. In code bases that code Millions of LoC in new features in days, this would require careful work by the human.
[1] https://anthonysciamanna.com/2019/08/22/the-continuous-value...
Solo founder here, shipping a real product built mostly with AI. The code review thing is real but my actual daily pain is different. AI lies about being done. It'll say "implemented" and what it actually did is add a placeholder with a TODO comment. Or it silently adds a fallback path that returns hardcoded data when the real API fails, and now your app "works" but nothing is real.
I've also given it explicit rules like "never use placeholder images, always generate real assets" — and it just... ignores them sometimes. Not always. Sometimes. Which is worse, because you can't trust it but you also can't not use it.
The 80% it writes is fine. The problem is you still have to verify 100% of it.
I tried getting the ai to write the tests. It created placeholders that contained no code but returned a success.
Seems like QA is the new prompt engineering
> Writing acceptance criteria is harder than writing a prompt, because it forces you to think through edge cases before you've seen them. Engineers resist it for the same reason they resisted TDD, because it feels slower at the start.
This resonates with my experience, and it is also a refreshing honest take: pushing back on heavy upfront process isn't laziness, it's just the natural engineers drive to build things and feel productive.
Wasn't the best practice to run one model/coding agent that writes the code and another one that reviews it? E.g. Claude Code for writing the code, GPT Codex to review/critique it? Different reward functions.
He admits the real hole himself: "this doesn't catch spec misunderstandings. If your spec was wrong to begin with, the checks will pass."
But there's a second problem underneath that one. Acceptance criteria are ephemeral. You write them before prompting, Playwright runs against them, and then where do they go? A Notion doc. A PR comment. Nowhere permanent. Next time an agent touches that feature, it's starting from zero again.
The commit that ships the feature should carry the criteria that verified it. Git already travels with the code. The reasoning behind it should too.
I am getting started to claude projects... Any usefull things . . . worth knowing that saves free limits . . . . .
It's an interesting problem that even though it's represented by just you as a single person, I think this is shared across the board with larger corporations at scale. I know for example they were seeing this with game devs in regards to the Godot engine. So many people were uploading work done by AI that has been unverified that people just can't keep up with it. And maybe some of it's good, but how do you vet all the crap out? No one knows what's being written anymore (and non-devs can code now too, which is amazing, but part of the problem that we introduced). I think in the future of being a developer will be more about verifying code integrity and working with AI to ensure it is meeting said standards. Rather than actually being in the driver's seat. Not sexy, but we're handing the keys over willingly, yet, AI is only interpreting the intent. It's going to get things wrong no matter what we do.
Our app is a desktop integration and last year we added a local API that could be hit to read and interact with the UI. This unlocked the same thing the author is talking about - the LLM can do real QA - but it's an example of how it can be done even in non-web environments.
Edit: I even have a skill called release-test that does manual QA for every bug we've ever had reported. It takes about 10 hours to run but I execute it inside a VM overnight so I don't care.
In the end you'll always have to manually validate the output, to ensure that what the test case tests is correct. When you write a test case, that's always what you need to do, to ensure that the test case passes in the right conditions, and you have to test that manually.
Since you have to test that manually anyway, you can have AI write the code first; you test it; if it's the right result, you tell AI this is correct, so write test cases for this result.
Somewhat unrelated but are there good boilerplate/starter repos that are optimized for agent based development? Setting up the skills/MCPs/AGENTS.md files seems like a lot of work.
> At some point you're not reviewing diffs at all, just watching deploys and hoping something doesn't break.
To everyone who plan on automating themselves out of a job by taking the human element out- this is the endgame that management wants: replacing your (expensive and non-tax-optimized) labor with scalable Opex.
This is a really good article but I do kind take issue with the intro, because it's the same assertion I see all over the place :
> Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do.
> I care about this. I don't want to push slop
They clearly didn't care about that. They only cared about non stop lines of code generation and shipping anything fast. Otherwise they wouldn't need weeks to realise that they weren't reading or testing this code - it's obvious from the outset.
Maybe their approach to this changed and that's fine, but at the beginning they very much did not care and I feel people only keep saying that do because otherwise they'd need to be the one to admit the emperor isn't wearing clothes.
> When Claude writes tests for code Claude just wrote, it's checking its own work.
You can have Gemini write the tests and Claude write the code. And have Gemini do review of Claude's implementation as well. I routinely have ChatGPT, Claude and Gemini review each other's code. And having AI write unit tests has not been a problem in my experience.
This is TDD? Tests first, then code? I do first the docs, then the tests, then the code. For years.
What he describes is like that. Just that the plan step is suggesting docs, not writing actual docs.
Regarding the self-congratulation machine - I simply use a different claude code session to do the reviews. There is no self-congratulation, but overly critical at times. Works well.
Honestly, sometimes the harnesses, specs, some predefined structure for skills etc all feel over-engineering. 99% of the time a bloody prompt will do. Claude Code is capable of planning, spawning sub-agents, writing tests and so on.
Claude.md file with general guidelines about our repo has worked extraordinarily good, without any external wrappers, harnesses or special prompts. Even the MD file has no specific structure, just instructions or notes in English.
The hardest part of running agents autonomously is the data quality problem. When your agent runs unsupervised, every decision is only as good as the data it pulls. Having agents access authoritative structured sources (government APIs, international org datasets) rather than scraping random pages makes a huge difference. The real failure mode is not hallucination - it is the agent confidently acting on unreliable data.
the overnight cost thing is real. "$200 in 3 days" is actually pretty tame compared to what happens when you have agents spawning sub-tasks without a budget cap.
the part that doesn't get talked about enough: most people are hitting a single provider API and treating it as fixed cost. but inference pricing varies a lot across providers for the same model. we've seen 3-5x spreads for equivalent quality on commodity models.
so half the cost problem is architectural (don't let agents spin unboundedly) and the other half is just... shopping around. not glamorous but real.
I am afraid that we are heading to a world in which we simply give up on the idea of correct code as an aspiration to strive for. Of course code has always been bad, and of course good code has never been a goal in the whole startup ecosystem (for perfectly legitimate reasons!). But that real production code, for services that millions or even billions of people rely on, should be reliable, that if it breaks that's a problem, this is the whole _engineering_ part of software engineering. And we can say: if we give that up we're going to have a whole lot more outages, security issues, all those things we are meant to minimize as a profession. And the answer is going to be: so what? We save money overall. And people will get used to software being unreliable; which is to say, people will not have a choice but to get used to it.
One thing I've been wrestling with building persistent agents is memory quality. Most frameworks treat memory as a vector store — everything goes in, nothing gets resolved. Over time the agent is recalling contradictory facts with equal confidence.
The architecture we landed on: ingest goes through a certainty scoring layer before storage. Contradictions get flagged rather than silently stacked. Memories that get recalled frequently get promoted; stale ones fade.
It's early but the difference in agent coherence over long sessions is noticeable. Happy to share more if anyone's going down this path.
blog looks suspicious
- privacy policy links to marketing company `beehiiv.com`. the blog author doesn't show up there.
- the profile picture url is `.../Generated_Image_March_03__2026_-_1_55PM.jpg.jpeg`
i didn't dig or read further.
The cowboy gunslinging knows no bounds.
On the off chance that the author reads this: can you enable an RSS feed please?
I want to subscribe, but I never end up reading newsletters if they land in my email inbox.
> Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do. I care about this. I don't want to push slop, and I had no real answer.
That’s really putting the cart before the horse. How do you get to “merging 50 PRs a week” before thinking “wait, does this do the right thing?”
To me the last paragraph was the highest value in the article. Write out your test in plain language first, and then write the prompt for the autonomous agent using your language and the test prompt not the auto-code.
I think the solution has to be end to end tests. Maybe first run by humans, then maybe agents can learn and replicate. I can't see why unit tests really help other than for the LLM to reason about its own code a little more.
I wish there was a way to "freeze" the tests. I want to write the tests first (or have Claude do it with my review), and then I want to get Claude to change the code to get them to pass - but with confidence that it doesn't edit any of the test files!
There seems lots of preparing, planning, token buying, set goal, and token cost just to niche target and related vibe coding.
"I've been building agents that write code while I sleep" and "I don't want to push slop" seem directly at odds with each other...
Different approach: copy the programmer's logic, not the agent's behavior.
Anyone who wants a more programmatic version of this, check out cucumber / gherkin - very old school regex-to-code plain english kind of system.
Just because you can let Claude run overnight doesn't mean it makes sense if you can no longer review what it has done.
If you don't review the result, who is going to want to use or even pay for this slop?
Reviewing is the new bottleneck. If you cannot review any more code, stop producing new code.
Feels like a whole bunch of us are converging on very similar patterns right now.
I've been building OctopusGarden (https://github.com/foundatron/octopusgarden), which is basically a dark software factory for autonomous code generation and validation. A lot of the techniques were inspired by StrongDM's production software factory (https://factory.strongdm.ai/). The autoissue.py script (https://github.com/foundatron/octopusgarden/blob/main/script...) does something really close to what others in this thread are describing with information barriers. It's a 6-phase pipeline (plan, review plan, implement, cold code review, fix findings, CI retry) where each phase only gets the context it actually needs. The code review phase sees only the diff. Not the issue, not the plan. Just the diff. That's not a prompt instruction, it's how the pipeline is wired. Complexity ratings from the review drive model selection too, so simple stuff stays on Sonnet and complex tasks get bumped to Opus.
On the test freezing discussion, OctopusGarden takes a different approach. Instead of locking test files, the system treats hand-written scenarios as a holdout set that the generating agent literally never sees. And rather than binary pass/fail (which is totally gameable, the specification gaming point elsewhere in this thread is spot on), an LLM judge scores satisfaction probabilistically, 0-100 per scenario step. The whole thing runs in an iterative loop: generate, build in Docker, execute, score, refine. When scores plateau there's a wonder/reflect recovery mechanism that diagnoses what's stuck and tries to break out of it.
The point about reviewing 20k lines of generated code is real. I don't have a perfect answer either, but the pipeline does diff truncation (caps at 100KB, picks the 10 largest changed files, truncates to 3k lines) and CI failures get up to 4 automated retry attempts that analyze the actual failure logs. At least overnight runs don't just accumulate broken PRs silently.
Also want to shout out Ouroboros (https://github.com/Q00/ouroboros), which comes at the problem from the opposite direction. Instead of better verification after generation, it uses Socratic questioning to score specification ambiguity before any code gets written. It literally won't let you proceed until ambiguity drops below a threshold. The core idea ("AI can build anything, the hard part is knowing what to build") pairs well with the verification-focused approaches everyone's discussing here. Spec refinement upstream, holdout validation downstream.
Great so i can wake up to a nuked git and wiped drive
Took a super intelligent AI for us to realize how important tests and TDD is.
> At some point you're not reviewing diffs at all, just watching deploys and hoping something doesn't break.
Good luck doing that in any company that does something meaningful. I can't believe anybody can seriously be ok with such a workflow, except maybe for your little pet project at home.
How do people not understand this? LLMs are goal machines. You need to give them the specific goal if you want good results and continue to reenforce it. So of course this means speccing and design work.
People are so enamored with how fast the 20% part is now and yes it’s amazing. But the 80% part by time (designing, testing, reviewing, refactoring, repairing) still exists if you want coherent systems of non-trivial complexity.
All the old rules still apply.
I simply can't stand reading this prose, let alone bring myself to care about some vibe-code BS the author couldn't be bothered to write about themselves.
Telling Claude to turn your notes into a blog post with simple, terse language does not hide your own lack of taste.
Honestly I think the "same AI checking same AI" concern is a bit overstated at this point. If the agents don't share context - separate conversations, no common memory - Opus is good enough that they don't really fall into the same patterns. At least at the micro level, like individual functions and logic. Maybe at the macro/architectural level there's still something there but in practice I'm not seeing it much anymore.
Tests cannot show the absence of bugs.
These are fundamentals of CS that we are forgetting as we dismantle all truth and keep rocketing forward into LLM psychosis.
> I care about this. I don't want to push slop, and I had no real answer.
The answer is to write and understand code. You can't not want to push slop, and also want to just use LLMs.
Error: Reached max turns (1)
The example given in the article is acceptance criteria for a login/password entry flow. This is fairly easy to spec-out in terms of AC and TDD.
I have been asking these tools to build other types of projects where it (seems?) much more difficult to verify without a human-in-the-loop. One example is I had asked Codex to build a simulation of the solar system using a Metal renderer. It produced a fun working app quickly.
I asked it to add bloom. It looped for hours, failing. I would have to manually verify — because even from images — it couldn't tell what was right and wrong. It only got it right when I pasted a how-to-write-a-bloom-shader-pass-in-Metal blog post into it.
Then I noticed that all of the planet textures were rotating oddly every time I orbited the camera. Codex got stuck in another endless loop of "Oh, the lookAt matrix is in column major, let me fix that <proceeds to break everything>." or focusing (incorrectly) on UV coordinates and shader code. Eventually Codex told me what I was seeing "was expected" and that I just "felt like it was wrong."
When I finally realised the problem was that Codex had drawn the planets with back-facing polygons only, I reported the error, to which Codex replied, "Good hypothesis, but no"
I insisted that it change the culling configuration and then it worked fine.
These tools are fun, and great time savers (at times), but take them out of their comfort zone and it becomes real hard to steer them without domain knowledge and close human review.
Now, Someone has to review tests! Just shifting ownership! Claude has just released 'Code Review'. But I don't think you can leave either one on autopilot.
Code Review: https://news.ycombinator.com/item?id=47313787
It's amazing the length at which people who want to write code go, to not write code.
Don't get me wrong, I use agentic coding often, when I feel it's going to type it faster than me (e.g. a lot of scaffolding and filler code).
Otherwise, what's the point?
I feel the whole industry is having its "Look ma! no hands!" moment.
Time to mature up, and stop acting like sailing is going where the seas take you.
I appear to be in the minority here. Perhaps because I've been practicing TDD for decades, this reads like the blog equivalent of "water is wet."
red / green / refactor is a reasonable way through this problem
I don't think this is right becasue it's talking about Claude like it's a entity in the world. Claude reviewing Claude generated code and framing it like a individual reviewing it's own code isn't the same.
Codex is really good at checking Claude’s work: https://github.com/pjlsergeant/moarcode
I think the idea of running agents while you sleep isn't going to work until AI can match or exceed human-level agency and intelligence.
Whenever I coded any serious solution as a technical co-founder, every single day there was a major new debate about the product direction. Though we made massive 'progress' and built out a whole new universe in software, we haven't yet managed to find product market fit. It's like constant tension. If the intelligence of two relatively intelligent humans with a ton of experience and complimentary expertise isn't enough to find product-market-fit after one year, this gives you an idea about how high the bar is for an AI agent.
It's like the problem was that neither me nor my domain expert co-founder who had been in his industry for over 15 years had a sufficiently accurate worldview about the industry or human psychology to be able to produce a financially viable solution. Technically, it works perfectly but it just doesn't solve anyone's problem.
So just imagine how insanely smart AI has to be to compete in the current market.
Maybe you could have 100 agents building and promoting 100 random apps per day... But my feeling is that you're going to end up spending more money on tokens and domain names then you will earn in profits. Maybe deploy them all under the same domain with different subdomains? Not great for SEO... Also, the market for all these basic low-end apps is going to be extremely competitive.
IMO, the best chance to win will be on medium and complex systems and IMO, these will need some kind of human input.
Somewhat off topic, but any theories as to why the shilling for Claude (not insinuating that's what the OP is doing) is so transparent? For example, the bots/shills often go out of their way to insist you get the $200 plan, in particular. If Anthropic's product is so good: 1) why must it be shilled so hard, and 2) why is the shilling (which is likely partially a result of the product) so obvious? Is this an OpenAI reverse psychology dirty trick, the equivalent of using robocalls to inundate voters with messages telling them to vote for your opponent so as to annoy and negatively dispose them towards your opponent?
A short story: A developer let ClaudeCode manage his AWS infrastructure. The agent ran a Terraform destroy command... Gone: 2 websites, production database, all backups and 2.5 years of data The agent didn't make a mistake. It did exactly what it was allowed to do. That's the problem dude
Do you really, honestly, have to be doing this stuff even when you sleep? To the point it hits you “wait is this even any good? Gee I don’t want to push out slop.”
If you don’t trust the agent to do it right in the first place why do you trust them to implement your tests properly? Nothing but turtles here.
I guess I'll just wait a year until a best practice emerges.
Just don’t use the same model to write and vet the code. Use two or more different models to verify the code in addition to reading it yourself.
Adversarial AI code gen. Have another AI write the tests, tell Codex that Claude wrote some code and to audit the code and write some tests. Tell Gemini that Codex wrote the tests. Have it audit the tests. Tell Codex that Gemini thinks its code is bad and to do better. (Have Gemini write out why into dobetter.md)
I'm (re)writing a big project with the following approach:
1. Write tons of documentation first. I.e. NASA style, every singe known piece of information that is important to implementation. As it's a rewrite of legacy project, I know pretty much everything I need, so there is very little ideas validation/discovery in the loop for that stage. Documentation is structured in nested folders and multiple small .md files, because its amount already larger than Claude Code context (still fits into Gemini). Some of the core design documents are included into AGENTS.md(with symlink to GEMINI/CLAUDE mds)
For that particular project I spent around 1.5 months writing those docs. I used Claude to help with docs, especially based on the existing code base, but the docs are read and validated by humans, as a single source of truth. For every document I was also throwing Gemini and Codex onto it for analyzing for weaknesses or flaws (that worked great, btw).
2. TDD at it's extreme version. With unit tests, integration tests, e2e, visual testing in Maestro, etc. The whole implementation process is split in multiple modules and phases, but each phase starts with writing tests first. Again, as soon as test plan ready, I also throw it on Gemini and Codex to find flaws, missed edge cases, etc. After implementing tests, one more time - give it to Gemini/Codes to analyze and critique.
3. Actual coding. This part is the fastest now especially with docs and tests in place, but it's still crucial to split work into manageable phases/chunks, and validate every phase manually, and ocassionaly make some rounds of Gemini/Codex independently verifying if the code matches docs and doesn't contain flaws/extra duplication/etc.
I never let Claude to commit to git. I review changes quickly, checking if the structure of code makes sense, skimming over most important files to see if it looks good to me (i.e. no major bullshit, which, frankly, has never happened yet) and commit everything myself. Again, trying to make those phases small enough so my quick skim-review still meaningful.
If my manual inspection/test after each phase show something missing/deviating, first thing I ask is "check if that is in our documentation". And then repeat the loop - update docs, update/add tests, implement.
The project is still in progress, but so far I'm quite happy with the process and the speed. In a way, I feel that "writing documentation" and "TDD" has always been a good practice, but too expensive given that same time could've been spent on writing actual code. AI writing code flipped that dynamics, so I'm happy to spend more time on actual architecting/debating/making choices, then on finger tapping.
> Teams using Claude for everyday PRs are merging 40-50 a week instead of 10
How is this even possible? Am I the only SWE who feels like the easiest part of my job is writing code and this was never the main bottleneck to PR?
Before CC I'd probably spent around 20-30% of my day just writing code into an IE. That's now maybe 10% now. I'd probably also spend 20-30% of my day reading code and investigating issues, which is now maybe 10-15% of my day now using CC to help with investigation and explanations.
But there's a huge part of my day, perhaps the majority it, where I'm just thinking about technical requirements, trying to figure out the right data model & right architecture given those requirements, thinking about the UX, attending meetings, code reviews, QA, etc, etc, etc...
Are these people who are spitting out code literally doing nothing but writing code all day without any thought so now they're seeing 4-5x boosts in output?
For me it's probably made me 50% more efficient in about 40-50% of my work. So I'm probably only like 20-25% more efficient overall. And this assumes that the code I'm getting CC to produce is even comparable to my own, which in my experience it's not without significant effort which just erodes any productivity benefit from the production of code.
If your developers are raising 5x more PRs something is seriously wrong. I suspect that's only possible if they're not thinking through things and just getting CC to decide the requirements, come up with the architecture, decide on implementation details, write the code and test it. Presumably they're also not reviewing PRs, because if they were and there is this many PRs being raised then how does the team have time to spit out code all day using CC?
People who talk about 5x or 10x productivity boosts are either doing something wrong, or just building prototype. As someone who has worked in this industry for 20 years, I literally don't understand how what some people describe can even being happening in functional SWE teams building production software.
None of this really answers the problem of all this slop is being produced at record pace and still requires absorption into the company, into the practices, and be reviewed by a human being.
I don't think AI will ever solve this problem. It will never be more than a tool in the arsenal. Probably the best tool, but a tool nonetheless.
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[flagged]
[dead]
[dead]
[dead]
[flagged]
[flagged]
This _all_ (waves hands around) sounds like alot of work and expense for something that is meant to make programming easier and cheaper.
Writing _all_ (waves hands around various llm wrapper git repos) these frameworks and harnesses, built on top of ever changing models sure doesn't feel sensible.
I don't know what the best way of using these things is, but from my personal experience, the defaults get me a looong way. Letting these things churn away overnight, burning money in the process, with no human oversight seems like something we'll collectively look back at in a few years and laugh about, like using PHP!