Are local LLMs ready for production applications?

A Large Language Model (LLM) is an advanced type of artificial intelligence that uses deep learning techniques to understand, generate, and manipulate human-like text. These models are “large” because they consist of billions or even trillions of parameters, which enable them to capture complex patterns in language data.

LLMs are continually evolving, with improvements in both architecture and training methods leading to increasingly sophisticated capabilities (if you’re passionate like me about this world, you can see that there are new models quite every week).

Large Language Models (or some of its variations) can also be used locally. A local LLM is simply deployed and executed on local hardware, rather than relying on external cloud services. 

To be executed in a local hardware, lots of LLMs need to be compressed. Techniques like quantization or pruning can reduce model size and improve inference speed but might come at the cost of some accuracy.

In the last months I’ve spent lot of my time using SLMs and LLMs locally and creating solutions on top of those models. When I show LLMs running locally or when I do training on these topics, I often receive a question: are local LLMs ready for production AI-powered applications?

First of all I need to say that while it is feasible to run LLMs locally, whether it’s practical depends heavily on your specific needs, available resources, and technical expertise. For many organizations without the necessary infrastructure, cloud-based AI solutions might still be more convenient despite their cost implications.

But when analyzing real-world requirements related to AI solutions, I always more discover that for many scenarios local LLMs can be a good approach to AI. They cost you nothing and they can do what you want.

Nowadays there are extremely powerful LLMs or SLMs that can be executed fully locally and that can give results comparable to biggest LLMs running in the cloud. For many scenarios (RAG, agents, function calling etc) a local LLM/SLM can give the same results as a biggest cloud-based model.

Latest example can be Microsoft’s Phi-4 or Deepseek R1 (distilled) models. These models are extremely powerful and they can perfectly run offline.

Microsoft Phi-4 model is extremely memory optimized and it’s at the moment probably my prefereed model for local AI solutions. Here you can see a video of Phi-4 running locally on my Macbook and generating code. You can see that the memory usage is not growing too much and the token/sec is high (so quick on response, the model uses GPU at max as visible in the video):

Microsoft Phi-4 is great in my opinion for creating agentic AI solutions running “on the edge” and YES, it’s absolutely good for production scenarios!

Deepseek R1 is the latest open model that is able to “reason” before giving you a response on a task (it uses what is called Reinforcement Learning). The online version is probably the most cheaper and one of the powerful LLM available today.

Deepseek R1 has 671B parameters. It can be theoretically executed locally, but I don’t think there are hardwares today in every company that are able to run it (more than 400 GB of Ram are needed).

But when talking about local AI, you need to know that lots of LLMs (like Deepseek R1) have their distilled versions. A “distilled” version in the context of AI models typically refers to model distillation (or knowledge distillation), which is a technique used to create smaller, more efficient versions of large and complex machine learning models. 

In very simple words, Deepseek uses the big 671B parameters model to generate data to post-train / fine-tune existing open LLMs like Llama 8b, 70b and Qwen 1.5b, 7b and 32b. These are significantly smaller models but due to the which quality of the output of the huge 671b model are still very good at reasoning.

Here you can see a video using Deepseek R1 32B (distilled) and Microsoft Phi-4 running locally on my Macbook together (yes, you need a powerful machine to do that).

I asked them to complete the following task:

Write the AL code for creating a Business Central application that:

  • Contains a table for handling students (plus relative list and card pages)
  • Contains a table for handling courses ((plus relative list and card pages)
  • Contains a document page (header + line) for invoicing to a student a set of courses he attended.

Result is interesting: while Phi-4 started immediately writing code, Deepseek R1 instead is thinking (reasoning) for a solution. After the reasoning phase, the code is produced. Result of AL code is honestly quite comparable (not 100% perfect but a very good start).

I don’t want to write too much here, but just responding to the question launched in the title: YES, for me local-hosted LLMs/SLMs can be used in many production scenarios with success! They can guarantee you great results and zero costs. And this can be a great starting point for introducing AI in a company.

I plan to talk more about private AI solutions in the next months (probally I will also do a training tour in some countries and partners about this topic), so stay tuned for more.

Original Post https://demiliani.com/2025/01/24/are-local-llms-ready-for-production-applications/

0 Votes: 0 Upvotes, 0 Downvotes (0 Points)

Leave a reply

Join Us
  • X Network2.1K
  • LinkedIn3.8k
  • Bluesky0.5K
Support The Site
Events
March 2025
MTWTFSS
      1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31       
« Feb   Apr »
Follow
Sign In/Sign Up Sidebar Search
Popular Now
Loading

Signing-in 3 seconds...

Signing-up 3 seconds...