Traditional approaches to generating model data for unit tests often involve setting up complex fixtures or writing cumbersome mock classes. This becomes especially challenging when your domain model is tightly coupled with your data mode.
class Order(Base):
# ...
product_name = Column(String, index=True)
quantity = Column(Integer)
In more advanced scenarios, your Domain model is decoupled from the Data model, allowing for cleaner architecture and improved maintainability.
class Order(BaseModel):
product_name: str
quantity: int
def testOrder(self):
order = Order(product_name="Milk", quantity=2)
self.assertEqual(order.quantity, 2)
Using hard-coded data can be a sub-optimal approach. Every modification to the Domain model requires updates to the test suite.
LLM-generated data comes to the rescue. By leveraging models like ChatGPT-4, you can dynamically generate test data that adheres to your Domain model’s schema.
In the following example, we’ll use ChatGPT-4 to generate data for a Electronics
Domain model class, ensuring the structure and content align with the model’s expected schema.
# Imports are skipped to demo the core of the solution
class Electronics(BaseModel):
""" Domain model for Electronics """
model: str
price: float = Field(gt=0, description="Price must be greater than 0")
@llm.call(provider="openai", model="gpt-4o", response_model=List[Electronics])
def list_electronics(limit: int) -> List[Electronics]:
""" Generate N items of products defined as Electronics """
return f"Generate {limit} number of electronic products"
And the unittests:
class TestElectronics(unittest.TestCase):
""" Test class using LLM to generate domain data """
@classmethod
def setUpClass(cls):
""" Generating test data once """
cls._products = list_electronics(5)
def test_unique_models(self):
""" Test product models are unique """
models = [product.model for product in self._products]
self.assertEqual(len(models), len(set(models)))
Running the LLM can take a few seconds, so to optimize performance, I move the data generation process to the setUpClass
method.
Big kudos to mirascope, llm, smartfunc and other libraries to make LLMs much more accessible to every day applications!
Feel free to try that yourself, all source code available in the Github project.
If you are interested in a command line LLM integration check out How to automate review with gptscript.
Conclusions:
- LLMs become much more accessible in day to day programming.
- Unittests data generation is just an example, but it demonstrates how easy it is to embed LLMs in real life scenarios
- It is still expensive and time consuming to call LLM for each unit tests invocation, so smarter approach is to generate and cache tests data