Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more
Computer vision projects rarely go exactly as planned, and this one was no exception. The idea was simple: Build a model that could look at a photo of a laptop and identify any physical damage — things like cracked screens, missing keys or broken hinges. It seemed like a straightforward use case for image models and large language models (LLMs), but it quickly turned into something more complicated.
Along the way, we ran into issues with hallucinations, unreliable outputs and images that were not even laptops. To solve these, we ended up applying an agentic framework in an atypical way — not for task automation, but to improve the model’s performance.
In this post, we will walk through what we tried, what didn’t work and how a combination of approaches eventually helped us build something reliable.
Where we started: Monolithic prompting
Our initial approach was fairly standard for a multimodal model. We used a single, large prompt to pass an image into an image-capable LLM and asked it to identify visible damage. This monolithic prompting strategy is simple to implement and works decently for clean, well-defined tasks. But real-world data rarely plays along.
We ran into three major issues early on:
Hallucinations: The model would sometimes invent damage that did not exist or mislabel what it was seeing.
Junk image detection: It had no reliable way to flag images that were not even laptops, like pictures of desks, walls or people occasionally slipped through and received nonsensical damage reports.
Inconsistent accuracy: The combination of these problems made the model too unreliable for operational use.
This was the point when it became clear we would need to iterate.
First fix: Mixing image resolutions
One thing we noticed was how much image quality affected the model’s output. Users uploaded all kinds of images ranging from sharp and high-resolution to blurry. This led us to refer to research highlighting how image resolution impacts deep learning models.
We trained and tested the model using a mix of high-and low-resolut …