AI Coding Assistants: What a Month of Testing Taught Me

2025-07-12

The coding ability of LLMs fascinated me the first time I saw it in action with Github Copilot towards the end of 2021. Then ChatGPT amazed us all, and the developments just kept coming. In 2025, “agentic coding” came to the forefront to terrify and excite software engineers across all calibres. Each new iteration comes with a shiny presentation or blog post, often illustrating a toy example or citing some impressive (but not very useful) benchmark numbers. It sounds like a dream — “All I have to do is type what I want, then the computer will write the code for me”. I think it’s too good to be true

We’ve seen the shortfalls of AI and the risks associated with blindly trusting the output. The output of an LLM is engineered to look correct, it’s all just statistics under the hood. Fundamentally these models match statistical distributions of sequences of text data in a terabyte-scale dataset. Whether the code it produces is correct is not part of the objective function. This makes LLMs the perfect source of some of the hardest bugs to fix. Most developers by 2025 have experienced something like that 1st or 2nd hand several times.

And yet despite that, people are endorsing agentic coding. Letting these bug machines write the code in an unsupervised manner?

And then I came across this article “Field Notes From Shipping Real Code With Claude” by Diwank Tomer. This part of the article convinced me that agentic coding might be worthwhile exploring:

Your tests are your specification. They’re your safety net. They’re the encoded wisdom of every bug you’ve fixed and every edge case you’ve discovered. Guard them zealously.

With so much hype around this topic, it’s exhausting separating the signal from the noise. So I took this chance to challenge my criticisms, and try and learn more about this over a month of regular usage. Was this tool really worth the endorsements? Am I just holding the hammer wrong? Does it actually work in realistic scenarios? Has it made me more productive as claimed?

I don’t think my average productivity has changed with these tools, and it is definitely overhyped. If you’re already a proficient coder, you probably wont see much benefit. However I learned a few spots where it genuinely feels like the right tool for the job. This experience has taught me more about how to write better test code and improved by code reviewing skills, which was unexpected. But claims of massively increased productivity don’t check out.

It hasn’t put me out of a job yet, but it has earned a spot on my toolbelt for now.

Past attempts at using AI coding

I don’t like AI completions.

It was really amazing watching demos of this magical autocomplete, but when I used it in practice I could not stand it.

I felt like I was spending more time fighting off bad completions then I was spending on actually doing the task I was supposed to be doing. It interrupted my train of thought constantly, and ground my progress to a snails pace. This is probably the main reason I’m skeptical of agentic coding.

After that didn’t work out, I went back to chatting with ChatGPT in a browser window — copying and pasting snippets of code to and from my editor, perfectly crafting each prompt to give it the context it needs to generate a good answer

But again, I still feel like I’m not saving any time. In fact, I feel like I’m spending more time since I’m going back and forth with the LLM chat. Bringing this experience closer to the editor might address this speed bottleneck? Agentic coding might be a step in that direction.

Setup for agentic coding

I used OpenRouter as my LLM provider. It’s incredibly impressive how this service is able to switch between different LLMs with no subscription. For my 1 month experiment, I topped up my account with $20 (AUD) and I primarily used claude-sonnet-4 via Aider. Over the past month, I’m guessing I used AI coding about 4-6 times per week (each session was about an hour or 2). This cost me about $10 in credit

I didn’t tweak the configuration of aider too much, but here’s what I put in my ~/.aider.conf.yaml:

model: openrouter/anthropic/claude-sonnet-4

# Tweaked the colors to match my terminal
# Note that you shouldn't use `code-theme` and
# `dark-mode` at the same time. See
# https://pygments.org/styles for available themes
code-theme: gruvbox-dark
user-input-color: "#B8BB26"
assistant-output-color: "#83A598"

# Streaming responses were too distracting.
# I prefer to do something else while it's generating
# a response, then reading through the output before I continue
stream: false

# I prefer writing my own commits after I've reviewed the diff
# of the file changes generated by aider
auto-commits: false

# Aider will be automatically triggered when you write a
# comment with the token 'AI!'
# See https://aider.chat/docs/usage/watch.html
watch-files: true

# When Aider wants to pull a new file into context, this
# auto-approves it
yes-always: true

# This file contains some of my personal coding conventions across
# all of my projects, it's loaded in read-only mode in every session
read: ~/CONVENTIONS.md

I adopted a workflow where I would use Aider one commit at a time so that

I deliberately kept the scope of the problem small so the LLM performs better
I could review and test all the AI output in one place (and I could easily revert it if needed)

So my general setup consists of launching vim as usual to edit code, opening another shell in the same project directory (via a tmux pane) to launch aider, then sending requests to aider by inserting comments in my code in vim such as

# AI! Write a docstring for this function
def processor(
    data: str,
    client: Client,
) -> ProcessResult:
    ...

After a few iterations with Aider (making my own edits along the way), I would then review the diff using the greatest vim plugin of all time: vim-fugitive. The command :Gdiffsplit allows me to view the current unstaged changes side by side. Once I was happy with the diff, I would create a commit and proceed.

Review of agentic coding performance

Let me be blunt: agentic coding sucks at logic-heavy tasks. Even when the code is well-tested, even when the prompt is specific and constrained, it stumbles. Complex control flow, non-obvious edge cases, or any situation that demands conceptual insight — these are where the models tend to hallucinate, make wrong assumptions, or miss subtle but important details. And because the output looks plausible, these errors are harder to detect.

But that’s not the whole story.

There are specific areas where agentic coding shines, and they’re not the ones most of the hype is about:

It’s great at drafting tests: I found it very helpful to get that first round of unit tests written, especially for tedious or repetitive cases. The AI doesn’t catch every edge case, but it often thinks of things I’ve missed — and the mechanical nature of test writing is a good match for a probabilistic pattern matcher.
It helps clarify vague intentions: If I have a rough idea of what I want to do — “I need a function that converts these three formats into a common schema” — Aider helps me quickly sketch out a solution space. Sometimes it gives me what I wanted. More often, it gives me something slightly wrong, and fixing it helps crystallize my own understanding.
It’s good for knowledge transfer: When I’m diving into unfamiliar territory, Aider can summarize the purpose of a function, help guess what a cryptic config file is doing, or give me a working example of an obscure library usage.
It’s brilliant for boilerplate and communication: Writing docstrings, filling out __init__.py files, generating markdown summaries, or drafting a PR description — this is where agentic coding impresses me most.

These benefits aren’t flashy, and they certainly don’t support the narrative of “10x productivity” or “AI writes your code so you don’t have to”. That story falls apart in practice.

As I’ve written about previously: writing code isn’t the bottleneck. Reading code is. Understanding what it does, what it’s supposed to do, and communicating that clearly to others — that’s where time and attention go. That’s where real engineering happens. And that’s exactly where LLMs can provide real leverage.

In that sense, using Aider regularly has improved my coding skills — not because it writes correct code, but because it forces me to articulate my intentions more clearly. It’s helped me think more about test coverage and failure modes. More importantly, it’s forced me to read code more deliberately, because I can’t trust the machine — and that has a surprisingly positive side effect: it sharpens my code review reflexes.

Agentic coding isn’t magic. It’s not replacing us. But if you’re deliberate about its use, and honest about its limitations, it can be a genuinely useful tool.

Reply to this post by email ↪