Evaluating LLMs performances for your AI task: presenting my LLMEvaluator.

After my session about creating AI solutions in Dynamics 365 Business Central using Managed vs Custom AI at Directions EMEA in Poznan, outside of the session room I received an interesting question: “You’ve shown us how to use different AI models and how to monitor the AI service usage and costs. But what if I wanted to know which AI model is best for my task, in terms of performance, costs, and results? How can I do that?“.

Answering this question is not so simple…

Generally speaking, the key is to run systematic comparisons on your specific task. Here’s how to approach it:

  1. Define your evaluation metrics. First, clarify what “best results” means for your use case. Are you optimizing for accuracy, speed, relevance, creativity, factual correctness, or something else? Different models excel in different areas.
  2. Test with representative samples. Don’t just try one prompt. Use a diverse set of inputs that reflect real-world usage (your actual data or realistic examples). This reveals how models perform on tasks that matter to you, not just generic benchmarks.
  3. Compare across dimensions simultaneously. Create a simple comparison tracking:
    • Performance metrics: accuracy, speed, latency, quality scores relevant to your task
    • Costs: per-token pricing, API costs, infrastructure needs (some models are cheaper but slower)
    • Results quality: subjective assessment, consistency, edge case handling
  4. Use available benchmarks as a starting point (but only for that). Looking at public benchmarks and vendor comparisons can be a simple starting point to select an AI model for a specific task, but remember that these benchmarks may not reflect your specific task at all. Use them to narrow down candidates, not to make final decisions.
  5. Consider practical constraints. Think about rate limits, availability, integration difficulty, etc

To some of the guys I was speaking with, I explained that for exactly this type of evaluation I created long time ago a custom application (just for internal and personal usage) and that maybe I can share it to the community in the next months. And here it is…

Presenting my LLMEvaluator app

The app I mentioned in Poznan is called LLMEvaluator and it’s a Windows application designed to help developers and organizations make informed decisions by systematically evaluating and comparing multiple Azure OpenAI models across custom prompts.

This tool provides a comprehensive performance comparison framework that measures key metrics such as response time, token consumption, cost estimation, and an intelligent ROI (Return on Investment) indicator.

This apps permits you to evaluate multiple Azure OpenAI models simultaneously against the same prompt. You can define a prompt, then define the Azure OpenAI models you want to compare and execute the benchmark.

The app then starts a concurrent execution of the selected prompt with all the selected AI models, it shows in real-time the execution (how every AI model answers). The app contains an execution engine that orchestrates the evaluation process:

  • Manages concurrent model evaluations with semaphore-based throttling
  • Coordinates task execution across multiple models
  • Aggregates results and provides completion callbacks
  • Handles dynamic concurrency (all enabled models run in parallel)

At the end it gives a comparison set of metrics like:

  • Response Time: Measures the time taken from request to completion (in milliseconds). This is useful to measure how quickly the model in answering and also to measure the response time of the same model deployed on different Azure regions (I often use this KPI excactly to measure this only).
  • Token Consumption: Tracks input tokens, output tokens, and total tokens used.
  • Throughput: Calculates tokens per second (tok/s) to measure processing speed.
  • Token Efficiency Ratio: Compares output tokens to input tokens.
  • Context Utilization: Shows what percentage of the model’s context window is being used.
  • Cost Estimation: Calculates the estimated cost based on configurable per-model pricing.
  • Compression Ratio: Measures how efficiently the model compresses information.
  • Context Utilization: Percentage of model’s context window used.

When I created the application I also created a metric to calculate the ROI indicator of a model for this task. This indicator (1-10) is calculated considering:

  • 45% Response Time: Prioritizes faster models using a saturating decrease function
  • 45% Cost: Favors more cost-effective models
  • 10% Output Tokens: Rewards models that generate more comprehensive responses

After evaluation, the application displays a comprehensive comparison window featuring:

  • Summary Panel: Quick overview of best performers (highest ROI, fastest response, lowest cost, most efficient).
  • Detailed results grid: Sortable table showing all metrics for each model.
  • Star Indicators: Visual highlights for best and worst performers in each category (green is the best, red is the worst).
  • Medal Rankings: 🥇🥈🥉 Top 3 recommended models for this task, with detailed explanation (because I always love to have a podium 😉).

Installation

Installing the app is very easy. Just download the .ZIP package from here.

In this ZIP package you have the following files:

  • LLMEvaluator.exe: the main application (windows app, self-contained)
  • appsettings.json: the configuration file (details later)
  • EvaluationPrompts: folder where evaluation prompts can be stored

Just extract the ZIP content into a custom folder (for example C:LLMEvaluator) and you’re ready to go.

You need to have .NET 9.0 (or above) installed in your system as pre-requirement.

Configuration

All app’s configuration is managed through the appsettings.json file.

Example of a configuration file:

{
"AzureOpenAIInstances": [
{
"Name": "AZUREOPENAIINSTANCE1",
"Endpoint": "https://YOURENDPOINT1.openai.azure.com/",
"ApiKey": "your-api-key",
"ApiVersion": "2025-01-01-preview",
"Models": [
{
"Name": "gpt-4.1",
"DeploymentName": "gpt-4.1",
"Cost1MInputTokens": 2.00,
"Cost1MOutputTokens": 8.0,
"Enabled": false
},
{
"Name": "gpt-5",
"DeploymentName": "gpt-5",
"Cost1MInputTokens": 1.25,
"Cost1MOutputTokens": 10
}
]
},
{
"Name": "AZUREOPENAIINSTANCE2",
"Endpoint": "https://YOURENDPOINT2.openai.azure.com/",
"ApiKey": "your-api-key",
"ApiVersion": "2025-01-01-preview",
"Models": [
{
"Name": "gpt-5",
"DeploymentName": "gpt-5",
"Cost1MInputTokens": 1.25,
"Cost1MOutputTokens": 10
}
]
}
]
}

In this file, you can configure multiple Azure OpenAI resource instances and for each instance you can configure the models you want to compare (that obviously must be deployed). If you want also to have a cost comparison for the task, you can configure cost per million tokens (input and output) for each model (costs can be retrieved from the Azure Pricing portal).

You can also easily enable or disable models without removing the relative configuration by using the “Enabled”: false parameter.

The prompt used for evaluation of the models must be created as a .txt file into the EvaluationPrompts/ folder. Directly from the app you can create new prompts or edit/deleting existing prompts.

Here you can see the app in action with some sample prompts. First, a test with a simple prompt like Can you tell me where is Milan located?

Then a new evaluation test with a more complex prompt:

Conclusion

The app is distributed “as-is”, simply because it was never intended to be released outside of my internal usage before Directions EMEA 😜. Try it, maybe it can be useful if you have AI models evaluation requirements (useful also if you need some benchmarks to discuss with your customers).

Remember that there’s rarely one “best” model, but the best model depends heavily on your specific priorities and constraints. A model that’s cheapest might be slower; the highest-performing might be expensive. Your job is deciding which tradeoffs matter most for your situation.

Original Post https://demiliani.com/2025/11/17/evaluating-llms-performances-for-your-ai-task-presenting-my-llmevaluator/

0 Votes: 0 Upvotes, 0 Downvotes (0 Points)

Leave a reply

Follow
Search
Popular Now
Loading

Signing-in 3 seconds...

Signing-up 3 seconds...