Above the AI (agentic)Harness

jarmestoBusiness Central3 hours ago37 Views

Vertical Layer Business Central Needs
Most people in the industry already agree on what an AI harness is. For Business Central developers, the real question isn’t which horizontal harness to pick, but which vertical layer of knowledge to build on top of it.

ALDC and BCQuality are two ways to do this, each having its own approach and strengths. In fact, they complement each other.

I focus on ALDC because I built it from the ground up and know its foundations, improvements, and future plans well. Still, there are many other interesting approaches in the community.

The harness: what it is, and why everyone suddenly talks about it

An AI harness connects a model’s reasoning to real actions, like accessing the filesystem, running commands, handling approvals, exposing tools, and managing context over long sessions.

On its own, a language model just produces text. The harness turns this input into actions—editing files, running tests, reading results—and sends the outcome back to the model so it can decide what to do next. Without a harness, you just have a chat window, not an agent.

Most of us are already familiar with a harness like Claude Code from Anthropic. For many developers, this is what an agentic coding harness looks like in practice: it has a provider abstraction, tools , context window management, persistence, and an agent loop tying it all integrated. When people picture a harness, they often think of something like Claude Code.

On Microsoft’s side, most coders would agree that GitHub Copilot and VS Code are also examples of harnesses. In a recent post, the VS Code team pointed out that every new model release brings up the same questions: which model is smartest or fastest?

But for coding, the model is just one part. Developers actually interact with the harness. The model is like the engine, while the harness is the car. A better model fills in the blanks more accurately, but the harness decides what those blanks are. The article also notes that different models need harnesses that work differently.

The harness isn’t just a static wrapper. It’s an active part of the product that gets regular updates. As the article says, and it’s a point worth remembering: the harness is the product.

A model by itself isn’t enough. It needs a harness to actually do anything useful in an editor.

CrewAI is another group I follow closely. I’ve watched them for a while because they often spot what’s coming next. They were building agentic systems before it became a buzzword, and their experience shows.

Their argument is that the AI harness itself is becoming a commodity. The think-act-observe cycle is converging across every player; what differentiates one harness from another is the layers on top, and increasingly even those will be common ground. Their bet for what comes next is systems that learn from each organization’s real flows instead of depending on manual configuration. I find that read convincing, about which we are sure to hear more in the future.

If the horizontal harness becomes a commodity, the value shifts elsewhere. It moves up to a vertical layer—a vertical or knowledge harness—that adds domain-specific knowledge to the generic loop. BCQuality, Microsoft’s a remedial knowledge base project for AL coding, is an early example of this direction. Its purpose is clear: even the best language model isn’t enough if it lacks the domain knowledge a Business Central agent needs to produce the right code.

This brings us to the reason why the way we measure agents has to change, and the VS Code team has already noticed it.

When you measure the wrong thing

The VS Code team still uses public benchmarks, but they’re clear that at the cutting edge, these aren’t enough to measure quality. SWE-bench is useful, but it focuses on public bug-fixing tasks. OpenAI even stopped reporting SWE-bench Verified after realizing that advanced models could sometimes just recall the reference patches from memory. If a model can memorize a benchmark, it’s no longer measuring real capability. The data then loses its value.

That’s why Microsoft built VSC-Bench: an offline suite for VS Code-specific tasks that public benchmarks don’t cover well, things like custom agent modes, MCP and tool use, multi-turn conversations, terminal and browser interaction. It measures solution correctness, agent effort, token efficiency, and latency together, not in isolation.

I ran into the same wall from the Business Central side, with BC-Bench. My empirical finding was simple and, in hindsight, not surprising: a framework that injects specifications, instructions, skill scaffolding and audit trails consumes more, greater context and more tokens, than a bare baseline. Measuring that framework against a base model on tokens-per-resolution isn’t a fair comparison, and it isn’t a useful one either. It’s comparing verbosity against minimalism. It tells you nothing about whether the output is auditable, reproducible, or correct for the domain.

That’s the same conclusion VS Code and OpenAI reached. Generic benchmarks measure the generic case. If you build for a specific surface, VS Code or Business Central, you need a benchmark that measures that surface. BC-Bench is to Business Central what VSC-Bench is to VS Code: not a verdict on a framework, but the admission that the off-the-shelf benchmark is asking the wrong question, and an attempt to build one that asks a better one.

There is a temptation here, the temptation to claim the obvious thing that comes next, that a vertical harness spends more and therefore delivers more.

But more tokens is not more quality. The point is a different one. The harness defines what context the model sees, and that context determines the quality of what it produces. A vertical harness spends more tokens because it injects domain knowledge, not verbosity, knowledge the model did not bring on its own. The cost goes up; what changes is not how much output you get, it is which output. That is the bet behind ALDC and BCQuality, and it is something you verify in real use.

ALDC: a knowledge harness for AL development?

This is where ALDC fits. Not as a harness, but as the vertical layer that sits on top of one.

A generalist harness gives you a capable agent. It can write code, edit files, run tests. What it can’t do is build a Business Central extension well. That gap, between an agent that codes and an agent that codes in AL the way it should be done, is the one ALDC sets out to close.

It does so by giving the work structure. Instead of asking the agent to solve a request in one pass, ALDC turns it into a journey with phases: first you design, then you specify, then you execute. There is a moment of architecture before a single line of code exists. There is a specification that makes that design concrete and testable. And there is an execution split into steps, each having its own focus. The agent stops improvising and starts following a method.

That framework isn’t decoration. Each phase leaves a trace of what knowledge was applied and why, so the result isn’t just code that compiles: it’s code you can audit, review and defend. That is what makes ALDC a knowledge harness. It doesn’t replace the agent’s loop, it governs it. It gives the capable agent the domain it was missing.

The vertical harness: where the value settles

Agentic AI harnesses (the buzzword of these days) in their most horizontal form are going to become a commodity, and at that point the engine stops being the valuable part. What becomes valuable is everything around the engine, what makes the car consume less, run smoother, go further on a charge. And the road it drives on, plus the map, is where we will start building value.

That «everything around the engine» is the vertical harness. A knowledge layer that is, by necessity, evolutionary and adaptive: it has to track the improvements of the horizontal harnesses below it and of the models themselves, absorbing what they get better at and re-focusing on what they still don’t know. BCQuality illustrates clearly how this works in practice. Its remedial layer isn’t a static rulebook; the measure of whether it works isn’t how many rules it contains, but how many agent mistakes it prevents that the model would otherwise have made. When the model gets smarter, some knowledge files become redundant and the layer should shed them. When a new BC release changes the rules, the layer should grow. The vertical harness is a living thing, sized to the gap between what the model knows and what the domain requires.

ALDC and BCQuality don’t compete in this layer, they fit together. BCQuality structures quality knowledge so an agent can consume it; ALDC is one of the systems that can orchestrate that consumption inside a governed development flow. Different angles on the same movement: the domain layer is where the durable value of agentic coding in Business Central lives.

APM as the governance layer

And a vertical harness needs governance, or it doesn’t scale. Without it, every team accumulates its own skills, its own instructions, its own knowledge files, drifting apart until nothing can be measured or adopted consistently. APM, package management for this layer, is what standardizes it: versioned, declared, composed deliberately. It turns a pile of local customizations into something an organization can govern, measure and deploy. The vertical harness is the value; APM is what keeps that value from fragmenting into noise.

Closing (my thoughts )

Harnesses will become a commodity. The loop is converging fast, and the engine will stop being where the differentiation lives. What stays valuable is the layer around it: the knowledge that makes an agent build Business Central code well, not just code. Vertical harnesses like ALDC and BCQuality are opening that road, and APM is the governance that lets an organization actually drive on it. The open question is which other domain layers will emerge, and whether the BC community builds them deliberately or lets them fragment one team at a time.