One of the largest barriers to entry when training Stable Diffusion models is creating a fully captioned training dataset. Tools like CLIP or BLIP can auto‐generate captions for generic imagery, but the real challenge arises when you’re aiming to fine‐tune on a niche dataset—whether it’s a comic book universe, a popular TV series, or a specific movie. In those scenarios, you need captions that accurately mention characters or locations unique to that world.
In the past, I’ve attempted fine-tuning my own captioning model on a small subset of labeled images and then auto‐labeling the rest, unfortunately, the results were never reliable enough to save any time. However, with the recent release of several multi‐modal LLMs like Qwen and Llama—capable of reading text and analyzing images—I decided to revisit the problem from a fresh angle.
One popular approach to tailoring Large Language Models (LLMs) to custom text content is called Retrieval‐Augmented Generation (RAG). Usually, it works by creating “embeddings” for each document ( like a web page or pdf ), and then utilizing the embeddings to locate whichever documents are closest in embedding space to a user’s query. The LLM then reads those documents before generating its response. This method is effective enough for text that it’s become a standard in many applications.

But what if those documents were actually images instead of text? Could we apply the same “find the nearest matches” idea to help label images? To explore that, I built a small dataset from “The Matrix“ by extracting one frame per second from a short sequence of the film, and labeled only a subset of the images in the dataset.
Using cosine similarity as my distance metric, I built a function to use CLIP embeddings to quickly look up the 5 closest labeled images from the dataset.

Then I drafted a prompt, which after some experimentation, became:
You are a caption writer.
You look at an image, plus reference captions/images, and return a caption.
When evaluating an image:
1) Study reference captions to match the sentence structure and detail level.
2) If the images are very similar, use the same words except for what's different.
3) Do not mention reference images in the new caption.
4) Focus on characters, action, location, time, lighting.
To tie it all together, I put together a Jupyter notebook to display all the images, with buttons to shoot the information to the LLM. The LLM’s ability to consider this data was about as good as I could have hoped, here’s an example of a target caption generated, and the 5 image+caption combinations that seeded it:

Not every single auto-caption worked quite as well as this example, however, this was helpful enough that I started trying to conceive of other strategies to aid with this approach. The first one I came up with was a strategy to pick out unlabeled images which this technique would be most effective on.
Employing the same method that I use to locate the nearest captioned images when generating a new caption, I decided to categorize the “most label-able” images in the whole dataset. By measuring the cosine distance of every single unlabeled image to every single labeled image, and then removing all connections that fell below a certain distance threshold, I was able to present the user with only the small batch of “lowest hanging fruit”.
Using this technique, each batch would require minimal corrections, and the user could “ride a wave” of easily auto-labeled images across the dataset, as the labeled image manifold grew.

Once I had that system in place, I wondered if I could also provide special guidance to the LLM by designating certain “key images”. Sometimes you know context or details that can’t be inferred from the nearest images alone; maybe by including that knowledge up front I could spare a lot of cleanup later. The idea was that a user would pick one or two representative images for the batch, caption them first, and then tell the LLM to pay extra attention to these examples before labeling everything else—similar to using key‐frames in classical animation.
Here’s an example of a set of images, where the nearest labeled images didn’t contain enough information for the LLM to complete the job, in this case it fails to recognize the highly stylized bullet trails which appear in the “bullet-time” shots:

With the addition of a special checkbox in my GUI, to signify an image+caption pair as a key image, and a small addition to the delivery script and the LLM prompt, this worked like a charm. Here’s the addition I made to the LLM prompt:
When evaluating user-provided fixes:
1) Place EXTRA EMPHASIS ON THESE.
2) If the images are very similar, use the same words except for what's different.
These will be labeled with <humanFix>.
Now look at how a single user‐guided caption, which includes details about the bullets, can seed an entire batch of captions that correctly recognize the iconic “bullet‐time” effect.

With that final piece in place, it was apparent just how much “key images” and user edits could amplify the quality of the generated captions. Combining these methods proved transformative—not only did I cut captioning time to about a third of what it would normally take, but the resulting dataset also wound up with higher‐quality, more consistent captions. Of course, this doesn’t completely automate captioning, but we’re still at the dawn of truly multimodal LLMs, and as their visual reasoning continues to improve, techniques like this are bound to become even more powerful.
Ultimately, data remains the largest barrier to training great models. Yet by harnessing LLM-based approaches, we’re moving closer to a future where anyone can generate the specialized data they need to train awesome models.
-NeuralVFX