Case Studies / Wynd Labs

How Wynd Labs Processed 1 Billion Videos at 17x Lower Cost with ClipTagger-12b

Wynd Labs builds Grass.io, a decentralized data network (3M+ nodes) aggregating public-web content into usable AI training and search datasets.

Outcomes

17x

Cheaper Inference

Billions

of Frames Annotated

3x

Lower Latency

Talk to an Engineer

Background

Wynd Labs operates Grass.io, a decentralized network with 3M+ nodes collecting public web video data. Their goal was to build a video clip search engine.

To power this, Wynd needed to transform their billion-video corpus into structured, queryable data. They partnered with Inference.net to build ClipTagger, a custom vision-language model that could process frames at massive scale while maintaining strict schema compliance and economic viability.

Challenges

Wynd Labs had a massive dataset of over a billion videos from its public web collection, and wanted to make it searchable. Their use case required strict, fixed JSON per frame (objects, actions, scene, production attributes), which generic captioning models could not follow, so they turned to LLMs.

Frontier models were accurate and effectively followed their schema, but were prohibitively expensive (~$5,850 per 1M frames for Claude 4 Sonnet). After experimenting with multiple open-source models, they all displayed issues with hallucination and generally poor captioning quality.

Model sizing was a trade-off: a 7B model couldn't meet their quality standard, but a 27B model would be prohibitively expensive at their scale (and still didn't hit their quality requirements). Most serverless LLM APIs also weren't designed to support large, asynchronous workloads, and Wynd initially approached Inference.net to scale SLM (Small Language Model) inference through our batch API.

However, it quickly became clear that simply choosing the best small open-source model was not a good solution, and Wynd needed a new unlock.

Solutions

To reach Wynd's cost targets, we proposed knowledge distillation—a technique that transfers the capabilities of a large, intelligent teacher model into a smaller, more efficient student model. Wynd curated 2M diverse keyframes from their corpus and finalized both the prompt structure and JSON schema before training began.

After settling on the Gemma family, we experimented with several different model sizes, including 7B, 12B, and 27B parameters. The 12B model proved to be the best middle-ground in terms of quality and cost. To fit in a single 80GB GPU, we utilized FP8 quantization, which delivered significant latency and cost benefits without measurable quality loss.

We conducted rigorous evaluations across schema compliance and caption quality. After we exposed and shared a dashboard with LLM-as-a-Judge scores, teacher–student diffs, schema error rates, and p50/p95/p99 latencies, the model was ready for joint sign-off.

To scale to billions of video frames, our batch API proved to be the perfect solution. Wynd could process their entire dataset at their own pace and slowly scaled up their batches.

What's next

Wynd is currently rolling out search.video, the video clip search engine they built with ClipTagger, to select partners. They continue to use Inference.net to process additional billions of frames across their expanding corpus.