The Death of Determinism: How AI Forces Us to Rethink Testing

Kieran HolmesDyn365CE5 hours ago27 Views

Photo by Joan Gamell on Unsplash

AI is transforming every stage of the delivery lifecycle. Clients want faster delivery, higher quality, and more innovation — yet AI introduces a new challenge: non-deterministic behaviour. In a world where outputs can differ run-to-run, traditional ways we build, test, and release software must evolve. This blog explores what life looked like before AI, why our old testing structures are no longer enough, and how engineering teams can rethink their approach for a future defined by probabilistic systems.

When testing was simple, predictable, and binary.

Let’s think about life before AI, when everything was simple… A typical release cycle when working in agile would start with development of new features. The developers will write unit tests to ensure code quality. Then QA will test these features (user stories) within the same sprint to make sure the functionality is correct. If any bugs are raised, they can be fixed within the sprint so that the stories can be completed and are ready for release to the next environment. At the end of the sprint a test exit report is produced to evidence all features have been tested. The next environment is UAT, this is used to test business requirements. Ideally, all tests will be automated as part of a continuous pipeline. Regression tests are also run for each release to make sure new features haven’t introduced any bugs. At the end of this testing period, they produce their own test exit report and raise any bugs. There is a Go/No-Go decision that is made by the business to determine if these features can be released into production. If there are any bugs, they need to have a severity attached to them and the business needs to decide if these bugs are allowed into production. Low severity will not block a release, but high severity will. This worked well in a pre-AI world, when tests and code were deterministic — you will get the same output every time. However, AI is non-deterministic and here is the issue.

Why AI breaks the assumptions your test suite relies on.

Take Document Intelligence as an example. Azure Document Intelligence in Foundry Tools to give it its full name, is an AI service offered by Microsoft. It is used to read information off a document and works by using OCR. The Document Intelligence analysis result will return key-value pairs and include a confidence score which is an estimated probability that the prediction is correct. Crucially, this makes the output probabilistic and cannot be accurate 100% of the time. Here is an example of a response from an invoice which shows the total to be £103:

const response = 
{
"status": "succeeded",
"analyzeResult": {
"keyValuePairs": [
{
"key": {
"content": "Invoice Total"
},
"value": {
"content": "£304"
},
"confidence": 0.45
}
]
}
}

Now this value is incorrect, the actual value on the invoice is £804 but Document Intelligence has misread the 8 for a 3, perhaps because of noise in the document or the handwriting is illegible. This misread is reflected in the confidence score of 0.45. If you have any tests that have an assertion on that value, such as:

    const kvp = response.analyzeResult.keyValuePairs[0]
expect(kvp.key.content).toBe('Invoice Total')
expect(kvp.value.content).toBe('£804')
expect(kvp.confidence).toBeGreaterThan(0.90)

This test would fail. A failing test will mean a bug which could put a stop to the production release and now the client isn’t happy!

At this point there a few options. To appease the testers and fix the bug we could retrain the model to suit the test (if we are using custom models), effectively forcing the data to fit the test. This is problematic because it defeats the purpose of meaningful tests. Labelling and training a custom model is already time consuming for developers, repeating this process is inefficient. Another option is to accept the bugs as valid and either decide not to fix them or get the business to accept them as a risk. The final option is to scrap everything and create a new framework for testing AI-systems. The Azure AI in Production guide explains that in non-AI testing the focus is on “functional specifications and user experience”, whereas with AI-systems the outcome is evaluated on “predictive accuracy, bias assessment, and model robustness”.

Practical shifts needed for testing in an AI first world.

Redefine Test Success

The first step in the new way of working is to redefine test success. This is to move from a binary pass or fail to metric thresholds. One of these metrics could be the confidence score, if it the score is below 90% then the result should be double-checked by a human. The invoice total example above would trigger this human review as the confidence score is only 45%. You can bake these metrics into your CI/CD process by adding gates to your pipelines that flag any for review. Another recommended option is to execute tests multiple times and aggregate the results so that the system performs reliably on average.

Client alignment

The client needs to be aware of the change in risk and conversations need to be had to agree on what is acceptable and what defines a failure. By communicating these targets, teams ensure everyone understands that a certain rate of errors or deviations is normal and manageable. If the client knows in advance that there could be a 1% failure rate, they are less likely to be surprised or upset when it occurs.

Engineering practice changes

There needs to be a cultural shift from “all tests must pass” to “all metrics meet the threshold.” This may be difficult for teams that are accustomed to absolute pass/fail criteria. People resist change — however if you are reading this you are likely someone who embraces change!

The path forward

By embracing new methodologies such as metric‑based thresholds, multi‑run testing, and tighter stakeholder alignment, teams can deliver robust AI systems without falling into the trap of deterministic assumptions. The takeaway is simple: review your current processes, identify where outdated expectations still exist, and evolve your engineering culture to match the reality of AI systems.

If you want to learn more about the Microsoft Team here at Capgemini, take a look at our open roles and consider joining the team!


The Death of Determinism: How AI Forces Us to Rethink Testing was originally published in Capgemini Microsoft Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Original Post https://medium.com/capgemini-microsoft-team/the-death-of-determinism-how-ai-forces-us-to-rethink-testing-52bed49a8298?source=rss—-333ebfdadb74—4

0 Votes: 0 Upvotes, 0 Downvotes (0 Points)

Leave a reply

Follow
Search
Popular Now
Loading

Signing-in 3 seconds...

Signing-up 3 seconds...

Discover more from 365 Community Online

Subscribe now to keep reading and get access to the full archive.

Continue reading