LLMs give you superpowers if you work on a wide range of tasks and aren't a deep expert on all of them. They aren't that great if you know the area you are working in inside and out and never stray from it.
For example, using an LLM to help you write a Dockerfile when you write Dockerfiles once a project and don't have a dedicated expert like a Deployment Engineer in your company is fantastic.
Or using an LLM to get answers faster than google for syntax errors and other minor issues is nice.
Even using LLM with careful prompting to discuss architecture tradeoffs and get its analysis (but make the final decision yourself) can be helpful.
Generally, you want to be very careful about how you constrain the LLM through prompts to ensure you keep it on a very narrow path so that it doesn't do something stupid (as LLMs are prone to do), you also often have to iterate because LLMs will occasionally do things like hallucinate APIs that don't actually exist. But even with iteration it can often make you faster.
Seconded! I'm using LLMs in many different ways—like you—starting with small troubleshooting tasks, quick shell scripts, coding, or simply asking questions.
I use a wide variety of tools. For more private or personal tasks, I mostly rely on Claude and OpenAI; sometimes I also use Google or Perplexity—whichever gives the best results. For business purposes, I either use Copilot within VSCode or, via an internal corporate platform, Claude, OpenAI, and Google. I’ve also experimented a bit with Copilot Studio.
I’ve been working like this for about a year and a half now, though I haven’t had access to every tool the entire time.
So far, I can say this:
Yes, LLMs have increased my productivity. I’m experimenting with different programming languages, which is quite fun. I’m gaining a better understanding of various topics, and that definitely makes some things easier.
But—regardless of the model or its version—I also find myself getting really, really frustrated. The more complex the task, the more I step outside of well-trodden paths, and the more it's not just about piecing together simple components… the more they all tend to fail. And if that’s not enough: in some cases, I’d even say it takes more time to fix the mess an LLM makes than it ever saved me in the first place.
Right now, my honest conclusion is this: LLMs are useful for small code completion tasks, troubleshooting and explaining —but that’s about it. They’re not taking our jobs anytime soon.
I find LLMs very useful when I need to learn something new, e.g. a new library or API, or a new language. It is much faster to just ask the LLM how to render text in OpenGL, or whatever, than read through tons of terrible documentation. Or, if I have a bunch of repetitive boilerplate stuff that I have no good templates for or cannot borrow from another part of the codebase. But when it comes to what I consider the actual 'work', the part of my codebase that is actually unique to the product I'm building, I rarely find them valuable in the "letting them write the code for me" sense.
I reflected once that very little of my time as a senior engineer is actually spent just banging out code. The actual writing of the code is never the hard part or time-consuming part for me - it's figuring out the right architecture, figuring out how to properly refactor someone else's hairball, finding performance issues, debugging rare bugs, etc. Yes, LLMs accelerate the process of writing the boilerplate, but unless you're building brand new products from scratch every 2nd week, how much boilerplate are you really writing? If the answer is "a lot", you might consider how to solve that problem without relying on LLMs!
There's no silver bullet.
It is amazing how in our field we repeatedly forget this simple advice from Fred Brooks.
In my experience, LLMs are way more useful for coding and less problem-prone when you use them without exaggerated expectations and understand that it was trained on buggy code, and that of course it is going to generate buggy code. Because almost all code is buggy.
Don't delegate design for it, use functional decomposition, do your homework and then use LLMs to eliminate toil, to deal with the boring stuff, to guide you on unfamiliar territory. But LLMs don't eliminate the need for you to understand the code that goes with your name. And usually, if you think a piece of LLM generated code is perfect, remember that maybe the defects are there, but you need to improve your own knowledge and skills to find it. Be always suspicious, don't trust it blindly.
Basically yes, once the "problem" gets too big, the LLM stops being useful.
As you say, it's great for automating away boring things; as a more complicated search & replace, for instance. Or, "Implement methods so that it satisfies this interface", where the methods are pretty obvious. Or even "Fill out stub CRUD operations for this set of resources in the API".
I've recently started asking Claude Opus 4 to review my patches when I'm done, and it's occasionally caught errors, and sometimes has been good at prompting me to do something I know I really should be doing.
But once you get past a certain complexity level -- which isn't really that far - it just stops being useful.
For one thing, the changes which need to be made often span multiple files, each of which is fairly large; so I try to think carefully about which files would need to be touched to make a change; after which point I find I have an idea what needs to be changed anyway.
That said, using the AI like a "rubber duck" programmer isn't necessarily bad. Basically, I ask it to make a change; if it makes it and it's good, great! If it's a bit random, I just take over and do it myself. I've only wasted the time of reviewing the LLM's very first change, as nearly everything else I'd've had to do if I wrote the patch myself from scratch anyway.
Furthermore, I often find it much easier to take a framework that's mostly in the right direction and modify it the way that I want, than to code up everything from scratch. So if I say, "Implement this", and then end up modifying nearly everything, it still seems like less effort than starting from scratch myself.
The key thing is that I don't work hard at trying to make the LLM do something it's clearly having trouble with. Sometimes the specification was unclear and it made a reasonable assumption; but if I tell it to do something and it's still having trouble, I just finish the task myself.
I have had a pretty similar experience to you. I have found some value here:
- I find it is pretty good at making fairly self-contained react components or even pages especially if you are using a popular UI library
- It is pretty reliable at making well-defined pure functions and I find it easier to validate that these are correct
- It can be good for boilerplate in popular frameworks
I sometimes feel like I am losing my mind because people report these super powerful end to end experiences and I have yet to see anything close in my day to day usage despite really trying. I find it completely falls over on a complete feature. I tried using aider and people seem to love it but it was just a disaster for me. I wanted to implement a fairly simple templated email feature in a Next.js app. The kind of thing that would take me about a day. This is one of the most typical development scenarios I can imagine. I described the feature in it's entirety and aider completely failed, not even close. So I started describing sub-features one by one and it seemed to work better. But as I added more and more, existing parts began to break, I explained the issues to aider and it just got worse and worse with every prompt. I tried to fix it manually but the code was a mess.
> I asked friends who are enthusiastic vibe coders and they basically said "your standards are too high".
Sure, vibe coders by definition can't have any standards for the code they're generating because by definition they never look at it.
> Is the model for success here that you just say "I don't care about code quality because I don't have to maintain it because I will use LLMs for that too?"
Vibe coding may work for some purposes, but if it were currently a successful strategy in all cases, or even narrowly for improving AI, Google AI or DeepSeek or somebody would be improving their product far faster than mere humans could, by virtue of having more budget for GPUs and TPUs than you do, and more advanced AI models, too. If and when this happens you should not expect to find out by your job getting easier; rather, you'll be watching the news and extremely unexpected things will be happening. You won't find out that they were caused by AI until later, if ever.
LLM is good for throwaway code. Easier to write, harder to maintain, diagnose, repair, also with LLMs. Which is most code that is not a product.
Fast food, assembly line, factory may be examples, but there is a HUGE catch: When a machine with a good setup makes your burger, car or wristwatch, you can be sure that at 99.99% it is as specified. You trust the machine.
With LLMs, you have to verify each single step, and if you don't, it simply doesn't work. You cannot trust them to work autonomously 24/7.
That's why you ain't losing your job, yet.
I would never use an LLM for trying to generate a full solution to a complex problem. That's not what they're good at.
What LLMs are good at and their main value I'd argue, is nudging you along and removing the need to implement things that "just take time".
Like some days back I needed to construct a string with some information for a log entry, and the LLM that we have suggested a solution that was both elegant and provided a nicer formatted string than what I had in mind. Instead of spending 10-15 minutes on it, I spent 30 seconds and got something that was nicer than what I would have done.
It's these little things that add up and create value, in my opinion.
I am one of those lazy IT guys that is very content working in support and ops. I understand a lot of programming concepts and do a bit of automation scripting but never really bothered to learn any proper language fully. Vibe coding was made for people like me. I just need something that works, but I can still ask for code that can be maintained and expanded if needed.
It finally clicked for me when I tried Gemini and ChatGPT side by side. I found that my style of working is more iterative than starting with a fully formed plan. Gemini did well on oneshots, but my lack of experience made the output messy. This made it clear to me that the more chatty ChatGPT was working for me since it seems to incorporate new stuff better. Great for those "Oh, crap I didn't think of that" moments that come up for inexperienced devs like me.
With ChatGPT I use a modular approach. I first plan a high level concept with 03, then we consider best practices for each. After that I get best results with 4o and Canvas since that model doesn't seem to overthink and change direction as much. Granted, my creations are not pushing up against the limits of human knowledge, but I consistently get clean maintainable results this way.
Recently I made a browser extension to show me local times when I hover over text on a website that shows an international time. It uses regex to find the text, and I would never have been able to crank this out myself without spending considerable time learning it.
This weekend I made a Linux app to help rice a spare monitor so it shows scrolling cheat sheets to help me memorize stuff. This turned out so well, that I might put it up on GitHub.
For dilettantes like me this opens up a whole new world of fun and possibilities.
Nope, at least Claude code is very useful IME.
Great for annoying ad-hoc programming where the objective is clear but I lack the time or motivation to do it.
Example: After benchmarking an application on various combinations of OS/arch platforms, I wanted to turn the barely structured notes into nice graphs. Claude Code easily generated Python code that used a cursed regex parser to extract the raw data and turned it into a bunch of grouped bar charts via matplotlib. Took just a couple minutes and it didn't make a single mistake. Fantastic time saver!
This is just an ad-hoc script. No need to extend or maintain it for eternity. It has served its purpose and if the input data will change, I can just throw it away and generate a new script. But if Claude hadn't done it, the graphs simply wouldn't exist.
Update: Sorry, missed "writing self-contained throwaway pieces of code"... well for core development I too haven't really used it.
An IBM study based on conversations with 2,000 global CEOs recently found that only 25% of AI initiatives have delivered their expected ROI over the last few years, and, worse still, "64% of CEOs surveyed acknowledge that the risk of falling behind drives investment in some technologies before they have a clear understanding of the value they bring to the organization." 50% of respondents also found that "the pace of recent investments has left their organization with disconnected, piecemeal technology," almost as if they don't know what they're doing and are just putting AI in stuff for no reason.
https://newsroom.ibm.com/2025-05-06-ibm-study-ceos-double-do...
The company I work for rolled out GitHub Copilot. It is quite underwhelming honestly. We have a lot of homegrown frameworks that it does not know about (I mean, to be fair, it actually cannot know about these). When I asked it to explain some piece of code it just repeated what was stated in the JavaDoc, nearly verbatim. I had a class that had obvious null issues and I asked it to fix it and it introduced quite a lot of unnecessary "// we have to check it is not null "-style comments.
I get a lot of value out of LLMs including for existing codebases and authoring / modifying code in them.
However, only maybe 10% of that is agentic coding. Thus, my recommendation would be - try non-agentic tools.
My primary workflow is something that works with the Zed editor, and which I later ported as a custom plugin to Goland. Basically, you first chat with the AI in a sidebar possibly embedding a couple of files in the discussion (so far nothing new), and then (this is the new part) you use contextual inline edits to rewrite code "surgically".
Importantly, the inline edits have to be contextual, they need to know both the content of the edited file, and of the conversation so far, so they will usually just have a prompt like "implement what we discussed". From all I know, only Zed's AI assistant supports this.
With this I've had a lot of success. I still effectively make all architectural decisions, it just handles the nitty-gritty details, and with enough context in the chat from the current codebase (in my case usually tens of thousands of tokens worth of embedded files) it will also adhere very well to your code-style.
I still get a lot more value from webuis than from the agentic offerings, also for more general coding tasks. The main benefit is that I can more easily manage context. A muddied or insufficient context can sometimes degrade performance by a whole model generation, ime.
What works for me is collecting it manually and going one implementation chunk at a time. If it fails, I either do it myself or break it down into smaller chunks. As models got better these chunks got larger and larger.
Collecting context manually forces me to really consider what information is necessary to solve the problem, and it's much easier to then jump in to fix issues or break it down compared to sending it off blind. It also makes it a lot faster, since I shortcircuit the context collection step and it's easier to course-correct it.
Collecting manually is about 10 seconds of work as I have an extension that copies all files I have opened to the clipboard.
It really depends on the task and your preferences
I’ve had great success using LLMs for things that I haven’t done in a while or never before. They allow me to build without getting too bogged down into the details of syntax
Yes, they require constant attention, they are not fully independent or magical. And if you are building a project for the longer run, LLM-driven coding slows down a lot once the code base grows beyond just a couple of basic files (or when your files start getting to about 500-800+ lines)
I’ve tried several agentic editors and tools, including cursor, they can def be helpful, but I’d rather just manually loop between ChatGPT (o4-high-mini for the most part) and the editor. I get a very quick and tight feedback loop in which I get plenty of control
Git is essential for tracking changes, and tests are gold once you are at a certain size
I was just struggling with getting a standard/template npm package up and running with a few customizations. Ended up just following one of the popular npm packages. This is Claude 4 and although it is good at writing code, I feel like it gets dumber at certain tasks especially when you want it to connect things together. It very easily messes one thing up and then when you include the errors, it spirals from there into madness.
> Am I just not using the tools correctly?
No, there is no secret sauce and no secret prompting. If LLMs were capable, we'll see lots of new software generated by it given how fast LLMs are at writing code. Theoretically, assuming a conservative 10token/s speed and a 100M token for Chromium code base, you could write a new browser with LLMs in only 115 days.
Simon Willison has helpful recent advice on this: Here’s how I use LLMs to help me write code (11th March 2025) https://simonwillison.net/2025/Mar/11/us.ing-llms-for-code/
There are two key points which are important to get most out of LLM coding assistants:
1. Use a high quality model with big context windows via API (I recommend Openrouter). E.g. Google Gemini 2.5 Pro is one of the best which keeps constant good quality (OpenAI reasoning models can be better in problem solving but it's kinda a mixed bag). Other people swear by the Claude Sonnet models.
2. Upgrade your code tools you combine with this high quality models. Google Jules and OpenAI Codex are so brand new and have a totally different aim than Cursor. Don't use them (yet). Maybe they will get good enough in future. I would focus on established tools like aider (steepest learning curve), roo code (easier) to be paired with Openrouter and if you want to have it really easy claude code (only useful with a 100-200 USD Anthropic subscription IMHO). On average you will get better results with Aider, roo, claude code than with Cursor or Windsurf.
Btw. I think Cursor and Windsurf are great as a starter because you buy a subscription with 15-20 USD and are set. It can be most likely that the more high quality tools burn more tokens and you spent more per month but you also get better quality back in return.
Last but not least and can be applied to every coding assistant: Improve your coding prompts (be more specific in regards to files or sources), do smaller and more iterations until reaching your final result.
I think of LLMs as knowing a lot of things but as being relatively shallow in their knowledge.
I find them to be super useful for things that I don't already know how to do, e.g. a framework or library that I'm not familiar with. It can then give me approximate code that I will probably need to modify a fair bit, but that I can use as the basis for my work. Having an LLM code a preliminary solution is often more efficient than jumping to reading the docs immediately. I do usually need to read the docs, but by the time I look at them, I already know what I need to look up and have a feasible approach in my head.
If I know exactly how I would build something, an LLM isn't as useful, although I will admit that sometimes an LLM will come up with a clever algorithm that I wouldn't have thought up on my own.
I think that, for everyone who has been an engineer for some time, we already have a way that we write code, and LLMs are a departure. I find that I need to force myself to try them for a variety of different tasks. Over time, I understand them better and become better at integrating them into my workflows.
> I've tried:
> - Cursor (can't remember which model, the default)
> - Google's Jules
> - OpenAI Codex with o4
Cursor's "default model" rarely works for me. You have to choose one of the models yourself. Sonnet 4, Gemini 2.5 Pro, and for tricky problems, o3.
There is no public release of o4; you used o4-mini, a model with poorer performance than any of the frontier models (Sonnet 4, Gemini Pro 2.5, o3).
Jules and Codex, if they're like Claude Code, do not work well with "Build me a Facebook clone"-type instructions. You have to break everything down and make your own tech stack decisions, even if you use these tools to do so. Yes they are not perfect and make regressions or forget to run linters or check their work with the compiler, but they do work extremely well if you learn to use them, just like any other tool. They are not yet magic that works without you having to put in any effort to learn them.
It's easy to get good productivity out of LLMs in complex apps, here are my tips:
Create a directory in the root of your project called /specs
Chat with a LLM to drill into ideas having it play the role of a Startup Advisor, work through problem definitions, what your approach is, and build a high level plan.
If you are happy with the state of your direction, ask the LLM to build a product-strategy.md file with the sole purpose of describing to an AI Agent what the goal of the project is.
Discuss with an LLM all sorts of issues like:
Components of the site
Mono Repo vs targeted repos
Security issues and your approach to them
High level map of technologies you will use
Strong references to KISS rules and don't over complicate
A key rule is do not build features for future use
Wrap that up in a spec md file
Continue this process until you have a detailed spec, with smaller .md files indexed from your main README.md spec file
Continue promoting the role of AI Developer, AI Consumer, AI Advisor, End User
Break all work into Phase 1 (MVP), Phase 2, and future phases, don't get more granular (only do Phase 2 if needed)
Ask LLM to document best practice development standards and document in your CLAUDE.md or whatever you use. Discuss the standards, err to industry standard if you are lost
Challenge the LLM while building standards, keep looping back and adjusting earlier assumptions
Instruct AI Agents like Claude Code to target on specific spec files and implement only Phase 1. If you get stuck on how to do that, ask an LLM on how to prompt your coding agent to focus, you will learn how they operate.
Always ask the coding agent to review any markdown files used to document your solution and update with current features, progress, and next issues.
Paste all .md files back into other AI's e.g. high models of ChatGPT and ask it to review and identify missing areas / issues
Don't believe everything the agents say, challenge them and refuse to let them make you happy or confirm your actions, that is not their job.
Always provide context around errors that you want to solve, read the error, read the line number, paste in the whole function or focus your Cursor.ai prompt to that file.
Work with all the AI's, each has their strength.
Don't use the free models, pay, it's like running a business with borrowed tools, don't.
Learn like crazy, there are so many tips I'm nowhere near learning.
Be kind to your agent
(Edited: formatting)I'm getting real valuable work done with aider and Gemini. But it's not fun and it's not flow-state kind of work.
Aider, in my humble opinion, has some issues with its loop. It sometimes works much better just to head over to AI studio and copy and paste. Sometimes it feels like aider tries to get things done as cheaply as possible, and the AI ends up making the same mistakes over again instead of asking for more information or more context.
But it is a tool and I view it as my job to get used to the limitations and strengths of the tool. So I see my role as adapting to a useful but quirky coworker so I can focus my energy where I'm most useful.
It may help that I'm a parent of intelligent and curious little kids. So I'm used to working with smart people who aren't very experienced and I'm patient about the long term payoff of working at their level.
In my experience it works better in more constrained and repetitive domains. For example: it is better doing a Ruby on Rails website than a ASP.NET web service.
Particularly, with data structures it is garbage: it nevers understands the constrains that justify writing a new one instead of relying on the ones from the standard library.
And finally, it is incapable of understanding changes of mind. It will go back to stufd already discarted or replaced.
The worst part of all is that it insists in introducing its own "contributions". For example, recently I have been doing some work on ML and I wanted to see the effect of some ablations. It destroyed my code to add again all the stuff I had removed on purpose.
Overall, it provides small typing/search savings, but it cannot be trusted at all yet.
They can do chores that clear your day to do more challenging work.
That’s immensely valuable and pretty game changing
I'm a manager at work, so I only code during the weekends on personal projects these days. In that context, I get a lot of value out of plain-vanilla ChatGPT, it's able to write high-quality code for my toy use-cases in the 100-500 LOC range.
I have seen two apparently contradictory things.
Firstly, there absolutely are people popping up in certain domains with LLM assisted developed products that could not have managed it otherwise, with results you would not suspect were made that way if you were not told.
However, I share the same problem myself. The root of it is "analysis is harder than synthesis". i.e. if you have to be sure of the correctness of the code it's far easier to write it yourself than establish that an LLM got it right. This probably means needing to change how to split things out to LLMs in ways human co-workers would find intolerable.
I've been programming in python for over 20 years. An LLM creates code that sometimes works, but it definitely doesn't meet my standards, and there's no way I'd use the code since I couldn't support it. People who have less experience in Python might take that working code and just support that with their LLM, still having no clue what it does or why it works. That's probably fine for MVP but it won't stand up in the real world where you have to support the code or refactor it for your environment.
I tried to use an LLM to write a simple curses app - something where there's a lot of code out there, but most of the code is bad, and of course it doesn't work and there's lots of quirks. I then asked it to see if there are libraries out there that are better than curses, it gave me 'textual' which at first seemed like an HTML library, but is actually a replacement for curses. It did work, and I had some working code at the end, but I had to work around platform inconsistencies and deal with the LLM including outdated info like inline styles that are unsupported in the current version of the library. That said, I don't quite understand the code that it produced, I know it works and it looks nice, but I need to write the code myself if I want a deeper understanding of the library, so that I can support it. You won't get that from asking an LLM to write your code for you, but from you using what you learn. It's like any language learning. You could use google translate to translate what you want, and it may seem correct at first glance, but ultimately won't convey what you want, with all the nuance you want, if you just learned the language yourself.
100% agreed with your experience, AI provides little value to one's area of expertise (10+ years or more). It's the context length -- AI needs comparable training or inference-time cycles.
But just wait for the next doubling of long task capacity (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...). Or the doubling after that. AI will get there.
Have you tried using a rules file for for those micromanage-y tasks you mentioned?
I have collected similar requests over time and I don't have to remind GH copilot/Claude as much anymore.
I use this analogy. In the early 90's I had been programming in assembler and sometimes in pure hex codes. I had been very good at that, creating really effective code, tight, using as little resources as possible.
But then resources became cheap and it stoped matter. Yeah, the tight well designed machine code is still some sort of art expression but for practical purpose it makes sense to write a program in higher level language and waste a few MB...
Maybe I'm using these models incorrectly but I just don't ask it to do a lot at once and find it extremely useful.
Write tests for x in the style of this file to cover a, b, c.
Help me find a bug here within this pseudo code that covers three classes and a few functions. Here's the behavior I see, here's what I think could be happening.
I rarely give it access to all the code and usually give it small portions of code and ask for small things. I basically treat it as if I was reaching out to another senior developer in a large company or SO. They don't care to learn about all the details that don't matter, and want a good promoted question that's not wasting their time and that they can help with.
Using it this way I absolutely see the benefits and I'd say an arbitrary 1.25x sounds right (and I'm an experienced engineer in my field).
I'll just quietly keep using it this way and ignore the overwhelming hype on both sides (it's not a speed up camp and it's 100x camp. Imo both are wrong but the it's not a speed up camp make me question how they're using it the most
I just had an llm build the my side business’s website and full client portal. About 15k loc.
It’s amazing. Better design in terms of UI / UX than I could have fathomed and so much more.
There’s a lot of duplicated code that I’ll clean up, but the site functions and will be launched for clients to start using soon.
For my day job, it’s also helping me build out the software at a faster pace than before and is an amazing rubber duck.
The comments on this thread are a perfect mixture of Group A, explaining how there is no value in AI tools, and if there was, where is the evidence? And Group B, who are getting value from the tools and have evidence of using them to deliver real software, but are being blasted by Group A as idiots who can't recognize bad code. Why so angry?
I've been writing code for 36 years, so I don't take any of the criticism to heart. If you know what you are doing, you can ship production quality code written by an LLM. I'm not going to label it "made by an AI!" because the consumer doesn't care so long as it works and who needs the "never AI!" backlash anyway?
But to the OP: your standards are too high. AI is like working with a bright intern, they are not going to do everything exactly the way that you prefer, but they are enthusiastic and can take direction. Choose your battles and focus on making the code maintainable in the long term, not perfect in the short term.
I find they're useful for the first 100 lines of a program (toy problems, boilerplate).
As the project becomes non-trivial (>1000 lines), they get increasingly likely to get confused. They can still seem helpful, but they may be confidently incorrect. This makes checking their outputs harder. Eventually silly bugs slip through, cost me more time than all of the time LLMs saved previously.
I tend to spend 15mins to write clear requirements (functional and non-functional specs). And then ChatGTP works its miracle. When I ask it "do write code in X language that does so-and-so", it's proper crapware. But those 15mins of reqs, save me 4-5 hours of writing back and forth, looking at manuals, etc (I'm not a super dev, I'm not even a professional dev.) But I do ask it to write code for me as am I trying to solve (my) small IT problems, and from seeing various marketplaces, nobody has posted code/software to do what I want.
Perhaps one day I'll 'incorporate myself' and start posting my solutions and perhaps make some dough.. but the I benefit far more than the $20 a month I am paying.
The right 'prompt' (with plenty of specs and controls) saves me from the (classic!) swing-on-tree example: https://fersys.cloud/wp-content/uploads/2023/02/4.jpg
I speculate, from this, that you are giving the LLMs too big of a task. Not quite vibe coding, but along that path.
> "You've refactored most of the file but forgot a single function"). It would take many many iterations on trivial issues, and because these iterations are slow that just meant I had to context switch a lot, which is also exhausting.
Try prompts like this:
"Given these data structures: (Code of structs and enums), please implement X algorithm in this function signature. (Post exact function signature)."
Or: "This code is repetitive. Please write a macro to simplify the syntax. Here's what calling it should look like (Show macro use syntax)"
Or: "I get X error on this function call. Please correct it."
Or: "I'm copying these tables into native data structures/match arms etc. Here's the full table. Here's the first few lines of the native structures: ...""
I find the code quality is an LLMs first priority. It might not execute but it will have good variable names and comments.
I find that so far their quality is horizontal, not vertical.
A project that involves small depth across 5 languages/silos? Extremely useful.
A long project in a single language? Nearly useless.
I feel like its token memory. And I also feel like the solution will be deeper code modularisation.
I like to think of the painter who's trained in classical design; they spend their time thinking about Baroque diagonals and golden ratios. They have a, "deep knowledge" of how paintings are classically organized. One day they encounter a strange machine that can produce any image they wish-- without considering the economic impact this would have on their craft he is often asked, "isn't this machine a marvel?" His reply is often, "it sheds no light for me on why certain ratios produce beautiful forms, it tells me nothing of the inspired mind of poets, nor of the nature of the Good."
In Plato's Republic Socrates' compares the ability to produce a piece of furniture with the ability to produce the image of a cabient or so-forth with a small compact mirror; what is the difference if a deceivable crowd doesn't know the difference?
I've spent the past week overcoming my fear of Google's Gemini and OpenAI's ChatGPT. Things I've learned:
- Using an AI for strange tasks like using a TTS model to turn snippets of IPA text (for a constructed language) into an audio file (via CLI) - much of the task turned out to be setting up stuff. Gemini was not very good when it came to giving me instructions for doing things in the GCP and Google Workspace browser consoles. ChatGPT was much clearer with instructions for setting up AWS CLI locally and navigating the AWS browser console to create dedicated user for the task etc. The final audio results were mixed, but then that's what you get when trying to beat a commercial TTS AI to doing something it really thinks you're mad to try.
- Working with ChatGPT to interrogate a Javascript library to produce a markdown file summarising the library's functionality and usage, to save me the time of repeating the exercise with LLMs during future sessions. Sadly the exercise didn't help solve the truly useless code LLMs generate when using the library ... but it's a start.
- LLMs are surprisingly good at massaging my ego - once I learned how to first instruct them to take on a given persona before performing a task: <As an English literature academic, analyse the following poem: title: Tournesols; epigraph: (After "Six Sunflowers, 1888" by van Gogh / felled by bombs in 1945); text: This presented image, dead as the hand that / drew it, an echo blown to my time yet // flames erupt from each auburn wheel - / they lick at the air, the desk: sinews // of heat shovelled on cloth. Leaves / jag and drop to touch green glaze - // I want to tooth those dry seeds, sat / by the window caught on the pot's neck // and swallow sunshine. So strong / that lost paint of the hidden man.>
I still fear LLMs, but now I fear them a little less ...
Today I tried to use Copilot to go through a code base and export unexported types, interfaces and unions (typescript). It had to also rename them to make sure they are unique in context of the package. Otherwise I would have used search&replace :)
It started out promising, renaming the symbols according to my instructions. Slower than if I had done it myself, but not horrible slow. It skipped over a few renames so I did them manually. I had to tell it to continue every 2 minutes so I could not really do anything else in the meantime.
I figured it’s quicker if I find the files in question (simple ripgrep search) and feed them to copilot. So I don’t have to wait for it to search all files.
Cool, now it started to rename random other things and ignored the naming scheme I taught it before. It took quite some time to manually fix its mess.
Maybe I should have just asked it to write a quick script to do the rename in an automated way instead :)
For me anything before Codex model with webui was time waster.
Now with webui what's important is to constantly add tests around the code base, also if it gets stuck, go through the logs and understand why.
It's more of a management role of ,,unblocking'' the LLM if it gets stuck and working with it than fitting it to my previous workflow.
As coding even today becomes more popular every year, a lot of people are junior level or work in an environment that only need junior level skills. I assume those benefit a lot from LLMs.
Also a lot of marketing and it's cool to hype LLMs and I guess people like to see content about what it can do in YouTube and Instagram.
As google on steroids - yes. Next level helpful and offloads a lot of dumb and repetitive work from you.
As a developer buddy - no. LLMs don't actually think and don't actually learn like people do. That part of the overinflated expectations is gonna hit hard some companies one day.
I have been getting very good results.
What matters: - the model -> choose the SOTA (currently Claude 4 Opus). I use it mostly in Cursor. - the prompt: give it enough context to go by, reference files (especially the ones where it can start delving deeper from), be very clear in your intentions. Do bullet points.
- for a complex problem: ask it to break down its plan for you first. Then have a look to make sure it’s ok. If you need to change anything in your plan, now’s the time. Only then ask it to build the code.
- be patient: SOTA models currently aren’t very fast
I work at a company with millions of MAU, as well as do side projects - for the company, I do spend a bit more time checking and cleaning code, but lately with the new models less and less.
For my side projects, I just bang through with the flow above.
Good luck!
I don't understand the hesitation with existing large codebases. Maybe everything I work on is just easy compared to yall.
It has been unimaginably helpful in getting me up to speed in large existing codebases.
First thing to do in a codebase is to tell it to "analyze the entire codebase and generate md docs in a /llmdocs directory". Do this manually in a loop a few times and try a few different models. They'll build on each other's output.
Chunk embed and index those rather than the code files themselves. Use those for context. Get full code files through tool calls when needed
A few days ago I discovered a few security issues in one of the software we have developed.
I instantly decided to review the frontend and backend code with AI (used cursor and GitHub copilot)
It reported a dozen more issues which otherwise would have taken a few weeks to find.
We asked AI to generate code that will help the security providing rules informing about technology stack, coding guidelines, project structure and product description.
We got good recommendations, but couldn't implement the suggestions straightforward.
However, we took the advices and hand-coded the suggestions at all code files.
The entire exercise took a week for fairly large project.
As per my tech lead, it would have taken minimum 2 months.
Soniy works.
I think Jules does a good job at "generating code I'm willing to maintain." I never use Jules to write code from scratch. Instead, I usually write about 90% of the code myself, then use the agent to refactor, add tests (based on some I've already written), or make small improvements.
Most of the time, the output isn't perfect, but it's good enough to keep moving forward. And since I’ve already written most of the code, Jules tends to follow my style. The final result isn’t just 100%, it’s more like 120%. Because of those little refactors and improvements I’d probably be too lazy to do if I were writing everything myself.
If you ever did a declarative programing, then you know what's LLM brought in it. If before you should say in a very formal way what you want to do, now you can do the same writing your prompt if a very relaxed form. You declarations are fed from thousand implementation sources now, than just few ones before. Otherwise I do not see much difference. I can also warn also that using a very formal language is much more concise, and you get a result in much shorter time. However, most people try to avoid any learning.
I tried using LLMs for 2 types of work:
1. Improve the code that I already have. Waste of time, it never works. This is not because my code is too good, but because it is SQL with complex context and I get more hallucinations than usable code, still the usable code is good for basic tasks and not for more.
2. Areas I rarely use and I don't maintain an expertise on. This is where it is good value, I get 80% of what I need in 20% of the time, I take it and complete the work. But this does not happen too often, so the overall value is not there yet.
In a way is like RPA: it does something, not great, but it saves some time.
For the mediocre coder like myself, it’s a game changer. Instead of finicking around with reading a weird dictionary schema in Python, I simply give a sample and boom I have code. Not to mention it shows me new ways of doing things that I was only familiar with and now does it by example, and then I have a better understand of using it. I’m talking about things like decorators and itertools.
I think it’s great for coders of all levels, but jr programmers will get lost once the llm inevitably hallucinates and the expert will get gains, but not like those who are in the middle.
Th best value I get out of AI is just having "someone" to bounce ideas off of, or to get quick answers from. I find that it's a HUGE help if I need to write code/understand code in a language that I don't work with every day. I might get it to write me a function that takes X as input and outputs Y. I might get it to set up boilerplate. I have talked to people that have said "I don't even write code anymore". That's definitely not me. Maybe it's a skill issue, but yeah, I'm kind of in the same boat too.
Working with AI centric IDE on the mature codebase needs different skill set that I suspect related to people management. As the pattern would be describing a problem well, making a good plan, delegate and giving feedback.
On the other side, getting a good flow is not trivial. I had to tweak rules, how to describe problem, how to plan the work and how to ask the agent. It takes time to be productive.
Eg. Asking agent to create a script to do string manipulation is better than asking them to do inplace edit. As it's easier to debug and repeat.
I've only had great luck with the LLMs(chatgpt 3o) generated Perl code. It was able to synthesize code for a GTK2/3 application fairly consistently, without generating any syntax errors. Most of the code worked as described, and it seemed to make more mistakes misunderstand my descriptions of features rather than when implementing them. My colleagues suggested it was because Perl's popularity had fallen significantly before 2016, and the training data set might've had much less noise.
So you want to use a LLM on a code base. You have to feed it your code base as part of the prompt, which is limited in size.
I don't suppose there's any solution where you can somehow further train a LLM on your code base to make it become part of the neural net and not part of the prompt?
This could be useful on a large ish code base for helping with onboarding at the least.
Of course you'd have to do both the running and training locally, so there's no incentive for the LLM peddlers to offer that...
If I could offer another suggestion from what's been discussed so far - try Claude Code - they are doing something different than the other offerings around how they manage context with the LLM and the results are quite different than everything else.
Also, the big difference with this tool is that you spend more time planning, don't expect it to 1 shot, you need to think about how you go from epic to task first, THEN you let it execute.
I found they work for tasks that have been done 1000s of times already. But for creative solutions in super specialized environments (lot's of the works I do it just that) they cannot help me.
I expect they soon will be able to help me with basic refactoring that needs to be performed across a code base. Luckily my code uses strong types: type safety quickly shows where the LLM was tripping/forgetting.
I’ve landed where you are.
And given the output I’ve seen when I’ve tried to make it do more, I seriously doubt any of this magic generated software actually works.
At my current job, I need to write quite a bit of python. I been programming for enough decades that I can look at Python code and tell what it is doing, but creating it from scratch? No. But copilot “knows” Python and can write the code that I can then read and tweak. Maybe someone who actually learned Python would write the code differently, but so far it works very well.
Use cases like the ones you mentioned having are truly amazing. It's a shame that the AI hype machine has left us thinking of these use cases as practically nothing, leaving us disappointed.
My belief is that true utility will make itself apparent and won't have to be forced. The usages of LLMs that provide immense utility have already spread across most the industry.
I'd like to think the point of technology is to do the same thing but faster.
AI tools can do things faster, but at lower quality. They can't do the _same_ thing faster.
So AI is fine for specifically low quality, simple things. But for anything that requires any level of customization or novelty (which is most software), it's useless for me.
I find Junie (from JetBrains / IntelliJ) great compared to other offerings. I have 10+ years experience as well.
It writes JUnit tests with mocks, chooses to run them, and fixes the test or sometimes my (actually broken) code.
It’s not helpful for 90% of my work, but like having a hammer, it’s good to have when you know that you have a nail.
I'm getting a lot of value in areas where I don't have much experience but most of the time I still write the final version.
I'm not building commercial software and don't have a commercial job at the moment so I'm kind of struggling with credits otherwise I would probably blow 40-100$ dollars a day.
Give it context and copy paste the code yourself.
They're good at coming up with new code.
Give it function signature with types and it will give pretty good implementation.
Tell it to edit something, and it will lose track.
The write-lint-fix workflow with LLMs doesn't work for me - LLM is monkey brain edits unrelated parts of code.
I am struggling to programmatically get syntactically valid JSON out of LLMs, using both the OpenAI and Vertex apis.
I am using:
"response_format": { "type": "json_object" }
And with Vertex: "generationConfig": {
"responseMimeType": "application/json"
}
And even: "response_format": {
"type": "json_schema",
"json_schema": { ...
And with Vertex: "generationConfig": {
"responseMimeType": "application/json",
"responseSchema": { ...
Neither of them is reliable.It always gives me json in the format of a markup document with a single json code block:
```json
{}
```
Sure I can strip the code fence, but it's mighty suspicious I asked for json and got markup.I am getting a huge number of json syntax errors, so it's not even getting to the schemas.
When i did get to the schemas, it was occasionally leaving out fields that I'd declared were required (even if i.e. null or an empty array). So I had to mark them as not required, since the strict schema wasn't guiding it to produce correct output, just catching it when it did.
I admit I'm challenging it by asking it to produce json that contains big strings of markup, which might even contain code blocks with nested json.
If that's a problem, I'll refactor how I send it prompts so it doesn't nest different types.
But that's not easy or efficient, because I need it to return both json and markup in one call, so if I want to use "responseMimeType": "application/json" and "responseSchema", then it can ONLY be json, and the markup NEEDS to be embedded in the json, not the other way around, and there's no way to return both while still getting json and schema validation. I'd hate to have to use tool calls as "out parameters".
But I'm still getting a lot of json parsing problems and schema validation problems that aren't related to nested json formatting.
Are other people regularly seeing markup json code blocks around what's supposed to be pure json, and getting a lot of json parsing and schema validation issues?
I think you and your friends actually agree on the application of LLM's, and I agree with that too.
So the question really comes down to what kind of project you are developing:
Get an MVP fast? LLM is great!
Proof of Concept? LLM rules!
Big/Complex project? LLM junior developer is not up to the task.
> I had to micromanage them infinitely
You are missing a crucial part of the process - writing rules
I would put myself in the 100x camp but my start in AI (and programming) was symbolic approaches in lisp/scheme, then I put that away for about a decade and got into NLP and then got into question answering and search/information retrieval. I have read probably at least 20 papers on LLMs a week every week since December 2022. So I have some familiarity with the domain but I also would not exactly call myself an expert on LLMs. However, I would say I am confident I mostly understand what is required for good results.
Getting the best possible results requires: - an LLM trained to have the "features" (in the ML/DL sense of the word) required to follow instructions to complete your task - an application that manages the context window of the LLM - strategically stuffing the context window with preferences/conventions, design information, documentation, examples, your git repo's repo map, and make sure you actually use rules and conventions files for projects. Do not assume the LLM will be able to retrieve or conjure up all of that for you. Treat it like a junior dev and lead it down the path you want it to take. It is true, there is a bit of micromanagement required but Aider makes that very very simple to do. Aider even makes it possible to scrape a docs page to markdown for use by the LLM. Hooking up an LLM to search is a great way to stuff the context window BTW, makes things much simpler. You can use the Perplexity API with Aider and quickly write project plans and fetch necessary docs quickly this way; just turn that into markdown files you'll load up later after you switch models to a proper code gen model. Assume that you may end up editing some code yourself, Aider makes launching your editor easy though.
This mostly just works. For fun the first thing I did with Aider was to write a TUI chat interface for ollama and I had something I could post to github in about an hour or two.
I really think Aider is the missing ingredient for most people. I have used it to generate documentation for projects I wrote by hand, I have used it to generate code (in one of my choice languages) for projects written in a language I didn't like. It's my new favorite video game.
Join the Aider discord, read the docs, and start using it with Gemini and Sonnet. If you want local, there's more to that than what I'm willing to type in a comment here but long story short you also need to make a series of correct decisions to get good results from local but I do it on my RTX4090 just fine.
I am not a contributor or author of Aider, I'm just a fanatical user and devotee to its way of doing things.
> but trying to get it to generate code that I am willing to maintain and "put my name on" took longer than writing the code would have
Why don’t you consider that the AI will be the one maintaining it?
Honestly I've been getting a lot of use out of LLMs for coding, and have been adjusting my approach to LLM usage over the past year and a half. The current approach I take that has been fairly effective is to spend a lot of focus energy writing out exactly what I'm looking to implement, sometimes taking 30 or more minutes creating a specs doc / implementation plan, and passing it to a agent with an architect persona to review, generate a comprehensive, phased implementation document. I then review the document, iterate with it to make sure the plan works well, then send it off to do the work.
I'm not yet a fan of Windsurf or Cursor, but honestly Roo Codes out of the box personas for architect, and orchestration to spin up focused subtasks works well for me.
I am kinda treating it how I would a junior, to guide it there, give it enough information to do the work, and check it afterwards, ensuring it didn't do things like BS test coverage or write useless tests / code.
It works pretty well for me, and I've been treating prompting these bots just as a skill I improve as I go along.
Frankly it saves me a lot of time, I knocked out some work Friday afternoon that I'd estimate was probably 5pts of effort in 3 hours. I'll take the efficiency anyday as I've had less actual coding focus time in coding implementations than I used to in my career due to other responsibilities.
Agreed. I have an additional question. Given what look like limitations on tasks of larger scope, what problems are people who claim that they have run the LLMs for days working on?
If you can't understand the code the LLM is writing, how can you expect to debug and troubleshoot it when your vibe coded feature gets deployed to production?
All the comments here in the positive for LLMs including all the posts by experts can be summed up as "these are the loto numbers that worked for me."
> All these use cases work great, I save a lot of time.
So why are you complaining? I use AI all the time to give me suggestions and ideas. But I write the perfect code myself.
I think they're amazing but still a new tool only 3 years old and we have like 20 years left till Super-AGI
Honestly I think what this boils down to is if you are concerned with good code or concerned with working code. They make working code. And if there is a bug or implementation that fails (which there usually is with the first shot) when asked, they work to resolve that. LLMs are very results driven, and they will do most anything to try and get the results requested. Even cheat. So, for coders that are concerned with quality code... of course you'll be unimpressed. For project managers who are concerned with a working piece of software, of course you will be impressed by that. You didn't have to ask a developer to create it. As far as impact on our culture goes... much like other disruptions in the past, they will win this war. They will win because they lower the barrier to entry to get working software. You ask what the value is - it's that.
I always spend more time fighting the bot and having to debug the code it gives for anything it gives to be useful.
Yeah it is code monkey typewriter situation
but I think this is solvable when context length goes way higher than current length
my code setup hasn't changed in 10 years
I tried to use many LLM tools. They are generally not capable of doing anything useful in a real project.
Maybe solutions like MCP, that allow the LLM to access the git history make the LLM become useful for someone that actually works on a project.
I heard an interesting story from an architect at a large software consultancy. They are using AI in their teams to manage legacy codebases in multiple languages.
TLDR; it works for a codebase of 1M LoC. AI writes code a lot faster, completing tasks in days instead of sprints. Tasks can be parallelized. People code less, but they need to think more often.
(1) Maintain clear and structured architecture documentation (README, DDD context/module descriptions files, AGENTS-MD).
(2) Create detailed implementation plans first - explicitly mapping dependencies, tests, and potential challenges.
(3) Treat the implementation plan as a single source of truth until execution finishes. Review it manually and with LLM-assistance to detect logical inconsistencies. Plan is easier to change, than a scattered diff.
(4) In complex cases - instruct AI agents about relevant documents and contexts before starting tasks.
(5) Approve implementation plans before allowing AI to write code
(6) Results are better if code agent can launch automated full-stack tests and review their outputs in the process.
The same works for me in smaller projects. Less ceremony is needed there.
Who could have foreseen that laziness would be that demanding?
I'm using Claude Desktop to develop an agentic crawler system. My workflow is VSCode next to Claude Desktop. The tooling for Claude uses a bunch of MCP servers I wrote for Evolve: https://github.com/kordless/gnosis-evolve
Core Development Capabilities:
- File Discovery & Navigation: file_explorer with pattern matching and recursive search
- Intelligent Code Search: search_in_file_fuzzy with similarity thresholds for finding relevant code sections
- Advanced Code Editing: file_diff_writer with fuzzy matching that can handle code changes even after refactoring
- Backups: backup and restores of any files at any state of change.
- System Monitoring: Real-time log analysis and container management
- Hot Deployment: docker_rebuild for instant container updates (Claude can do the rebuild)
The Agentic Workflow:
- Claude searches your codebase to understand current implementation
- Uses fuzzy search to find related code patterns and dependencies
- Makes intelligent edits using fuzzy replacement (handles formatting changes)
- Monitors logs to verify changes work correctly
- Restarts containers as needed for testing
- Iterates based on log feedback
- Error handling requires analyzing logs and adjusting parsing strategies
- Performance tuning benefits from quick deploy-test-analyze cycles
I've not had any issues with Claude being able to handle changes, even doing things like refactoring overly large HTML files with inline CSS and JS. Had it move all that to a more manageable layout and helped out by deleting large blocks when necessary.
The fuzzy matching engine is the heart of the system. It uses several different strategies working in harmony. First, it tries exact matching, which is straightforward. If that fails, it normalizes whitespace by collapsing multiple spaces, removing trailing whitespace, and standardizing line breaks, then attempts to match again. This handles cases where code has been reformatted but remains functionally identical.
When dealing with multi-line code blocks, the system gets particularly clever. It breaks both the search text and the target content into individual lines, then calculates similarity scores for each line pair. If the average similarity across all lines exceeds the threshold, it considers it a match. This allows it to find code blocks even when individual lines have been slightly modified, variable names changed, or indentation adjusted.
Relax once it can do what you want you’re out of a job.
In my team we call it "The Intern".
Bit long but a lot of good insights by Theo.gg here: https://youtu.be/6koQP6-6mtY?si=PHtKGR7XImIKEvq6
TLDR, to use those tools effectively you need to change yourself a bit but in a fairly good direction.
They’re fantastic for making some basic websites or dealing with boilerplate.
I write compilers. Good luck getting an LLM to be helpful in that domain. It can be helpful to break down the docs for something like LLVM but not for writing passes or codegen etc
i haven't used cursor or codex or any system that says "agentic coding experience"
i speak in thoughts in my head and it is better to just translate those thoughts to code directly.
putting them into a language for LLMs to make sense and understanding the output is oof... too much overhead. and yeah the micromanagement, correcting mistakes, miscommunications, its shit
i just code like the old days and if i need any assistance, i use chatgpt
[dead]
[dead]
[dead]
[dead]
Use them like a butler or bad maid. They can do the absolute grunt work now and with time they might get better to do more serious tasks.
As much as someone else is struggling at something else? It's not for everyone just like programming is not for everyone. I don't even type the code anymore, I just copy and paste it, is that still programming? I don't remember last time I had to type out a complete line up to ;.
There are two kinds of engineers.
Those who can’t stop raving about how much of a superpower LLMs are for coding, how it’s made them 100x more productive, and is unlocking things they could’ve never done before.
And those who, like you, find it to be an extremely finicky process that requires extreme amount of coddling to get average results at best.
The only thing I don’t understand is why people from the former group aren’t all utterly dominating the market and obliterating their competitors with their revolutionary products and blazing fast iteration speed.