
After my session about creating AI solutions in Dynamics 365 Business Central using Managed vs Custom AI at Directions EMEA in Poznan, outside of the session room I received an interesting question: “You’ve shown us how to use different AI models and how to monitor the AI service usage and costs. But what if I wanted to know which AI model is best for my task, in terms of performance, costs, and results? How can I do that?“.
Answering this question is not so simple…
Generally speaking, the key is to run systematic comparisons on your specific task. Here’s how to approach it:
To some of the guys I was speaking with, I explained that for exactly this type of evaluation I created long time ago a custom application (just for internal and personal usage) and that maybe I can share it to the community in the next months. And here it is…
The app I mentioned in Poznan is called LLMEvaluator and it’s a Windows application designed to help developers and organizations make informed decisions by systematically evaluating and comparing multiple Azure OpenAI models across custom prompts.
This tool provides a comprehensive performance comparison framework that measures key metrics such as response time, token consumption, cost estimation, and an intelligent ROI (Return on Investment) indicator.
This apps permits you to evaluate multiple Azure OpenAI models simultaneously against the same prompt. You can define a prompt, then define the Azure OpenAI models you want to compare and execute the benchmark.
The app then starts a concurrent execution of the selected prompt with all the selected AI models, it shows in real-time the execution (how every AI model answers). The app contains an execution engine that orchestrates the evaluation process:
At the end it gives a comparison set of metrics like:
When I created the application I also created a metric to calculate the ROI indicator of a model for this task. This indicator (1-10) is calculated considering:
After evaluation, the application displays a comprehensive comparison window featuring:


Top 3 recommended models for this task, with detailed explanation (because I always love to have a podium
).Installing the app is very easy. Just download the .ZIP package from here.
In this ZIP package you have the following files:
Just extract the ZIP content into a custom folder (for example C:LLMEvaluator) and you’re ready to go.
You need to have .NET 9.0 (or above) installed in your system as pre-requirement.
All app’s configuration is managed through the appsettings.json file.
Example of a configuration file:
{
"AzureOpenAIInstances": [
{
"Name": "AZUREOPENAIINSTANCE1",
"Endpoint": "https://YOURENDPOINT1.openai.azure.com/",
"ApiKey": "your-api-key",
"ApiVersion": "2025-01-01-preview",
"Models": [
{
"Name": "gpt-4.1",
"DeploymentName": "gpt-4.1",
"Cost1MInputTokens": 2.00,
"Cost1MOutputTokens": 8.0,
"Enabled": false
},
{
"Name": "gpt-5",
"DeploymentName": "gpt-5",
"Cost1MInputTokens": 1.25,
"Cost1MOutputTokens": 10
}
]
},
{
"Name": "AZUREOPENAIINSTANCE2",
"Endpoint": "https://YOURENDPOINT2.openai.azure.com/",
"ApiKey": "your-api-key",
"ApiVersion": "2025-01-01-preview",
"Models": [
{
"Name": "gpt-5",
"DeploymentName": "gpt-5",
"Cost1MInputTokens": 1.25,
"Cost1MOutputTokens": 10
}
]
}
]
}
In this file, you can configure multiple Azure OpenAI resource instances and for each instance you can configure the models you want to compare (that obviously must be deployed). If you want also to have a cost comparison for the task, you can configure cost per million tokens (input and output) for each model (costs can be retrieved from the Azure Pricing portal).
You can also easily enable or disable models without removing the relative configuration by using the “Enabled”: false parameter.
The prompt used for evaluation of the models must be created as a .txt file into the EvaluationPrompts/ folder. Directly from the app you can create new prompts or edit/deleting existing prompts.
Here you can see the app in action with some sample prompts. First, a test with a simple prompt like Can you tell me where is Milan located?
Then a new evaluation test with a more complex prompt:
The app is distributed “as-is”, simply because it was never intended to be released outside of my internal usage before Directions EMEA
. Try it, maybe it can be useful if you have AI models evaluation requirements (useful also if you need some benchmarks to discuss with your customers).
Remember that there’s rarely one “best” model, but the best model depends heavily on your specific priorities and constraints. A model that’s cheapest might be slower; the highest-performing might be expensive. Your job is deciding which tradeoffs matter most for your situation.
Original Post https://demiliani.com/2025/11/17/evaluating-llms-performances-for-your-ai-task-presenting-my-llmevaluator/






