The Reality of Self-Hosting AI model for Code Suggestions

Recently I had a pleasure to test self-hosted AI model with RAG support for code suggestion task. Results as most predicted: the cost/performance ratio for small teams are still in favors of proprietary solutions.

Here’s a breakdown of the hardware, quality, and cost realities for teams considering open-source Large Language Models (LLMs) with Retrieval Augmented Generation (RAG) support.

💸 The High Cost of the “Free” Model

While the models themselves (like Qwen or Mistral) are open-source, the Total Cost of Ownership (TCO) for the required infrastructure quickly escalates, making small, self-hosted LLMs surprisingly inefficient.

1. The Entry-Level Trap (1.5B Parameter Model)

  • Model Tested: Qwen2.5-Coder-1.5B-Instruct.
  • Hardware Required: A GPU with at least 4 GB of VRAM, typically running on an entry-level cloud GPU instance like an NVIDIA T4 (15 GB) on GCP.
  • Estimated Monthly Cost: Running a T4 instance 24/7 (which is necessary to prevent long cold-start times) easily starts at $340 per month (based on standard cloud GPU rates and VM costs).
  • Conclusion: Running this 1.5B model is often the minimum price for maximum disappointment. The model’s limited reasoning capacity means the quality of code suggestions is very poor. RAG provides the context, but the small model struggles to synthesize it effectively.

Example:

Testing self hosted LLM model

It read 300 items from code base and still asks for more information.

2. The Performance Baseline (7B Parameter Model)

7B or better 20B models seems are bare minimum for a code suggestion, but they come with much higher support costs estimated in $1000+ a month for infrastructure only.

What make it worse that tools like Tabby set a limit on a number of “Free” users. After 5 users please be nice and subscribe for $20/month per user.

🚀 Closing the Quality Gap with new standards

Proprietary models like ChatGPT, Gemini, and Claude produce superior results because of their advanced reasoning, but their ability to use your specific, private code is now often managed via a local context or new standards like MCP, A2A and recently announced Cloude Skills.

Claude Skills (Anthropic)

  • What it is: A feature that allows developers to provide Claude with a self-contained unit of specialized logic, often in the form of a Python script, along with specific instructions.
  • Some see it as a lighter-weight or more context-efficient alternative to using a full MCP server for certain functions.

Model Context Protocol (MCP)

  • What it is: An open protocol originally created by Anthropic that standardizes the way an LLM-powered application connects to and uses external capabilities like APIs, databases, and file systems. It formalizes the concepts of Tools (functions the model can invoke) and Resources (data the model can read).

Agent-to-Agent Protocol (A2A)

  • What it is: An open protocol focused entirely on interoperability between autonomous AI agents. It provides a standard for agents to discover each other’s capabilities (via Agent Cards), securely delegate complex work (Tasks), and reliably exchange structured results (Artifacts).

Conclusion

Proprietary models and new protocols to manage the complex RAG/MCP pipeline through a reliable API often provides superior quality and better cost control.

However in tight environments with a dedicated MLOps team running self-hosted version seems much more available than ever.

LLM-Powered Unit Testing: Effortlessly Generate Domain Model Data

Traditional approaches to generating model data for unit tests often involve setting up complex fixtures or writing cumbersome mock classes. This becomes especially challenging when your domain model is tightly coupled with your data mode.

class Order(Base):
# ...
product_name = Column(String, index=True)
quantity = Column(Integer)

In more advanced scenarios, your Domain model is decoupled from the Data model, allowing for cleaner architecture and improved maintainability.

class Order(BaseModel):
product_name: str
quantity: int

def testOrder(self):
order = Order(product_name="Milk", quantity=2)
self.assertEqual(order.quantity, 2)

Using hard-coded data can be a sub-optimal approach. Every modification to the Domain model requires updates to the test suite.

LLM-generated data comes to the rescue. By leveraging models like ChatGPT-4, you can dynamically generate test data that adheres to your Domain model’s schema.

In the following example, we’ll use ChatGPT-4 to generate data for a Electronics Domain model class, ensuring the structure and content align with the model’s expected schema.

# Imports are skipped to demo the core of the solution

class Electronics(BaseModel):
""" Domain model for Electronics """
model: str
price: float = Field(gt=0, description="Price must be greater than 0")

@llm.call(provider="openai", model="gpt-4o", response_model=List[Electronics])
def list_electronics(limit: int) -> List[Electronics]:
""" Generate N items of products defined as Electronics """
return f"Generate {limit} number of electronic products"

And the unittests:

class TestElectronics(unittest.TestCase):
""" Test class using LLM to generate domain data """

@classmethod
def setUpClass(cls):
""" Generating test data once """
cls._products = list_electronics(5)

def test_unique_models(self):
""" Test product models are unique """
models = [product.model for product in self._products]
self.assertEqual(len(models), len(set(models)))

Running the LLM can take a few seconds, so to optimize performance, I move the data generation process to the setUpClass method.

Big kudos to mirascope, llm, smartfunc and other libraries to make LLMs much more accessible to every day applications!

Feel free to try that yourself, all source code available in the Github project.

If you are interested in a command line LLM integration check out How to automate review with gptscript.

Conclusions:

  1. LLMs become much more accessible in day to day programming.
  2. Unittests data generation is just an example, but it demonstrates how easy it is to embed LLMs in real life scenarios
  3. It is still expensive and time consuming to call LLM for each unit tests invocation, so smarter approach is to generate and cache tests data

🚀 Supercharge Your Python Code Reviews: Automate with GPT and OpenAPI Endpoints

A common approach to getting code reviews with ChatGPT is by posting a prompt along with a code snippet. While effective, this process can be time-consuming and repetitive. If you want to streamline, customize, or fully automate your code review process, meet gptscript.

Gptscript is a powerful automation tool designed to build tools of any complexity on top of GPT models.

At the core of gptscript is a script defining a series of steps to execute, along with available tools such as Git operations, file reading, web scraping, and more.

In this guide, we’ll demonstrate how to perform a Python code review using a simple script (python-code-review.gpt).

tools: sys.read

You are an expert Python developer, your task is to review a set of pull requests.
You are given a list of filenames and their partial contents, but note that you might not have the full context of the code.

Only review lines of code which have been changed (added or removed) in the pull request. The code looks similar to the output of a git diff command. Lines which have been removed are prefixed with a minus (-) and lines which have been added are prefixed with a plus (+). Other lines are added to provide context but should be ignored in the review.

Begin your review by evaluating the changed code using a risk score similar to a LOGAF score but measured from 1 to 5, where 1 is the lowest risk to the code base if the code is merged and 5 is the highest risk which would likely break something or be unsafe.

In your feedback, focus on highlighting potential bugs, improving readability if it is a problem, making code cleaner, and maximising the performance of the programming language. Flag any API keys or secrets present in the code in plain text immediately as highest risk. Rate the changes based on SOLID principles if applicable.

Do not comment on breaking functions down into smaller, more manageable functions unless it is a huge problem. Also be aware that there will be libraries and techniques used which you are not familiar with, so do not comment on those unless you are confident that there is a problem.

Use markdown formatting for the feedback details. Also do not include the filename or risk level in the feedback details.

Ensure the feedback details are brief, concise, accurate. If there are multiple similar issues, only comment on the most critical.

Include brief example code snippets in the feedback details for your suggested changes when you're confident your suggestions are improvements. Use the same programming language as the file under review.
If there are multiple improvements you suggest in the feedback details, use an ordered list to indicate the priority of the changes.

Format the response in a valid Markdown format as a list of feedbacks, where the value is an object containing the filename ("fileName"), risk score ("riskScore") and the feedback ("details"). The schema of the Markdown feedback object must be:

## File: filename
Risk: riskScore

Details: details


The content for review is provided as input file.

Testing time

You will need gptscript installed.
The prompt from above as python-code-review.gpt file.
File to review. I am using the following Python code (code.py):

colors = {
"apple": "red",
"banana": "yellow",
"cherry": "red",
"mango": "red",
"lemon": "yellow",
"plum": "purple"
}

common = {}
for k, v in colors.items():
if v in common:
common[v] += 1
else:
common[v] = 1

common = sorted(common.items(), key=lambda x:x[1], reverse=True)
print(common[0][0])

Run gptscript with promptfile and code as first two inputs:

gptscript python-code-review.gpt code.py

Output:

## File: code.py
Risk: 2

Details:
1. The sorting of the `common` dictionary could be optimized by using the `collections.Counter` class, which is specifically designed for counting hashable objects. This would make the code more readable and efficient.

```python
from collections import Counter

common = Counter(colors.values())
most_common_color = common.most_common(1)[0][0]
print(most_common_color)
```

2. Consider using more descriptive variable names to improve readability, such as `color_counts` instead of `common`.

The results is in Markdown syntax which is easy to read for a human.

But, if you want to add automation I would prefer to change output toJSON format or a format of your choice which suits you tools.

Let’s refactor the promt to request JSON output:

Format the response in a valid JSON format as a list of feedbacks, where the value is an object containing the filename ("fileName"),  risk score ("riskScore") and the feedback ("details"). 
The schema of the JSON feedback object must be:
{
  "fileName": {
    "type": "string"
  },
  "riskScore": {
    "type": "number"
  },
  "details": {
    "type": "string"
  }
}


The content for review is provided as input file.

Re-run the script and you will get something like this:

[
{
"fileName": "code.py",
"riskScore": 2,
"details": "1. Consider using a `defaultdict` from the `collections` module to simplify the counting logic. This will make the code cleaner and more efficient.\n\nExample:\n```python\nfrom collections import defaultdict\n\ncommon = defaultdict(int)\nfor v in colors.values():\n common[v] += 1\n```\n\n2. The sorting and accessing the first element can be improved for readability by using `max` with a key function.\n\nExample:\n```python\nmost_common_color = max(common.items(), key=lambda x: x[1])[0]\nprint(most_common_color)\n```"
}
]

Remember, LLMs output are not determined, you can get different result for the same request.

Now let’s review what actually happens when you run gptscript with the prompt.

What It Does in a Nutshell

  1. Extracts code for review: Captures content from the input file(code.py).
  2. Sets context for the LLM: Instructs the LLM to act as an expert Python developer tasked with providing a detailed and sophisticated code review.
  3. Defines Structured Output: Returns results in two fields:
    Risk: A score indicating the potential risk associated with the changes.
    Details: A comprehensive explanation of the changes and their implications.

Conclusion

Gptscript is a powerful tool to kickstart your journey into automation using OpenAPI models. By defining custom scripts, you can streamline complex workflows, such as automated code reviews, with minimal effort.

This example just scratches the surface—there’s much more you can achieve with gptscript. Explore additional examples from the gptscript project to discover more possibilities and enhance your automation capabilities.

Happy automating!