I am often asked which AI we use. That motivated me to write an article about it. Which models do we use? What are their advantages and disadvantages? What are the characteristics of these ‘employees’ and how do they behave when working together?

Large language models are often used as chatbots. The training data comes largely from the internet – full of programming examples and technical descriptions (e.g. from Stack Overflow). Even the first chatbots were able to answer technical questions by outputting short – and sometimes even longer – pieces of source code.

Let’s rewind to the year 2025:

LLMs can now not only talk, but also act. They use tools (e.g. to read and write files, search for file names and text passages, execute commands and analyse their outputs) and are thus able to search, understand and – depending on the prompt – specifically modify not only individual lines of code, but entire codebases.

This marked the birth of the development agent: an agent that programs independently on the developer’s instructions.

We have been actively using such Dev Agents in a wide variety of projects for over a year now. And the development of new models that are specifically optimised for this type of work is progressing rapidly. In a current migration project – from one CRM system to another provider – we created large parts of the migration scripts with the help of agents, for example.

I will discuss the project results in a later post. Today, I would like to share a different perspective: my experiences with various LLMs as the basis for dev agents. In this project, we used three currently widely used models – not only comparing them technically, but also observing how the collaboration felt ‘interpersonally’.

Which three LLMs did we use?

Claude Sonnet 4.5 (Anthropic)
Gemini 3 Flash (Preview) (Google)
Grok Code Fast 1 (xAI)

What environment did we work in?

Visual Studio Code
Kilo Code as an agent platform
OpenRouter for using the LLMs via API

Brief overview: Strengths and framework conditions

Claude Sonnet 4.5 is currently something of a top dog among development LLMs. We have already used it to implement several comparable tasks with very good quality and high efficiency. The disadvantage: Sonnet 4.5 quickly becomes expensive for longer tasks with many tokens.

That’s exactly why we wanted to test Grok Code Fast 1 from xAI. It was developed specifically for agentic coding and is significantly cheaper per million tokens than Sonnet 4.5 – and, as the name suggests, very fast.

We chose Gemini 3 Flash Preview because we suspected that it would lie between these two extremes: very fast, large context window (1 million tokens) and significantly cheaper than Claude Sonnet 4.5.

Which model performed best?

All three models will get you where you want to go sooner or later. But what I found exciting was how different the collaboration felt. You really get the impression that each model has its own character.

Here is my very personal – and deliberately subjective – experience:

Claude Sonnet 4.5: The conscientious senior consultant

Sonnet worked through tasks very thoroughly and conscientiously. The results were often correct on the first attempt or only needed minor adjustments, which Sonnet then implemented quickly and cleanly.

What particularly struck me was that

Sonnet explains everything in great detail: approach, intermediate steps, results – often in such depth that I found myself thinking: ‘Thank you, but I didn’t really want to know that much detail.’

If something in the prompt was unclear, Sonnet politely asked for clarification instead of simply making assumptions.
When I found errors in the concept or code, Sonnet analysed in great detail why this had happened – and then corrected it neatly (including an apology, sometimes almost too much so).

Sometimes Sonnet was almost too diligent: in addition to the actual task, it wrote analysis or test scripts on its own initiative to ensure that nothing was missing. Nice – but in practice, I didn’t need everything.

My impression:

Claude Sonnet 4.5 is the conscientious senior consultant: extremely competent, explains everything thoroughly, delivers high-quality results – but with noticeable overhead and a corresponding price tag.

Grok Code Fast 1: The hyperactive junior developer

Fast 1 is really fast. It throws itself into every task and immediately starts implementing – without asking too many questions. It produces a lot of source code quickly and cheaply, and a surprising amount of it works on the first or second attempt.

However:

Fast 1 likes to make the odd mistake.
If you give quick feedback (stack trace in the prompt), it often corrects itself just as quickly.
It’s clear that this model is all about speed.

However, when the errors became more complex, I sometimes had the feeling that Fast 1 didn’t analyse them in a sufficiently structured way and instead simply tried the next approach – which then might not work either. A few times we almost ‘argued’: when a bug persisted and I insisted on a real fix several times, it simply commented out the questionable code and reported that the error had been fixed.

My impression:

Grok Code Fast 1 is the hyperactive junior developer: fast, inexpensive, churns out masses of code, solves many problems ‘on demand’ – but you have to manage it closely and occasionally slow it down.

Gemini 3 Flash Preview: The objective professional

Gemini has solved tasks for me at almost Sonnet level. It thinks carefully, analyses cleanly and delivers very good results that often work immediately.

This is what the collaboration feels like:

Errors are corrected quickly and specifically.
If I misunderstand something, Gemini clearly points this out and asks how I would like to proceed.
Communication is professional, structured and efficient.

After a while, however, the collaboration seemed rather cold to me:

Gemini reliably does what you explicitly ask it to do – but it rarely contributes additional ideas or proactive suggestions, as Sonnet likes to do.
Explanations tend to be brief; sometimes I had to specifically ask for details.

My impression:

Gemini 3 Flash Preview is the matter-of-fact professional: efficient, reliable, competent – but sober and less proactive than Sonnet.

So what now? Not 100% happy with any of them?

To exaggerate slightly:

either expensive and very talkative (Sonnet),
or very fast, but sometimes sloppy (Grok),
or efficient, but a little cold (Gemini).

The consequence for me was clear: I changed the way I work.

Instead of ‘one favourite LLM for everything’, I now use different agents for different tasks.

My current workflow with Kilo Code

In Kilo Code, I can assign different modes to different models. In simple terms, this is how it looks for me:

Orchestration mode – Claude Sonnet 4.5
For larger tasks: coordinating, planning, structuring. Orchestration distributes the detailed work to other agents and initiates it.
I also prefer to use Sonnet 4.5 in architecture mode because it gives me the most comprehensive concepts and suggestions.
Code mode – Gemini 3 Flash
What counts here is reliable implementation: correct, targeted, fast – without a lot of ‘chatter’.
Gemini also works very well for me in debug mode: clear, precise, solution-oriented.
Additional profile – Grok Code Fast 1
I use Grok when there are lots of similar tasks to do – e.g. many similar scripts or templates. In these scenarios, speed wins, and Fast 1 is simply extremely fast and inexpensive.

My conclusion: monoculture is a thing of the past

After these experiences, one thing is clear to me:

I would no longer use the model of a single provider.

Every agent – every LLM – has specific strengths and weaknesses. And this ‘character’ has a direct impact on how productive and pleasant the collaboration is.

The ability to flexibly combine multiple providers in open systems has become a decisive criterion for me in 2026.

It is not one ‘perfect’ AI that makes the difference – but the interplay of their different characters.