How AI alt-text generation works on product images at scale

May 30, 2026·5 min read·by Mia Chen

How AI vision models generate alt text for product images at scale: the pipeline, the failure modes, and what actually ships to production.

AI alt text generation is the practice of sending a product image to a vision model and getting back a one-sentence plain-English description, automatically, without a human writing it. At scale it solves two problems at once: the sheer labor of describing thousands of images, and the consistency problem that makes manually written alt text nearly useless for search. If you have more than a few hundred product images and your alt text situation is a mix of blank fields and one-word entries, this is the most tractable fix available.

The problem with doing this manually

Alt text on product images sounds simple until you have ten thousand of them. Fifteen seconds per image, decent quality, no shortcuts. For a catalog that size that's 40 hours of work, and that's before anyone leaves or takes a new job and the next person writes descriptions in a completely different style.

The consistency gap is the real problem. One person writes "blue sneaker," another writes "Nike Air Force 1 low-top in University Blue with white sole and lace detailing." Both are technically correct. Only one does anything useful for screen readers or semantic search. When I've audited mid-size fashion and home goods stores, inconsistent alt text is almost universal; there was no standard and no enforcement mechanism. Nobody's fault. Just entropy.

AI generation solves the consistency problem. You get the same vocabulary, the same level of detail, the same output format across every asset. That's worth more than people expect.

What the generation pipeline actually does

AI alt text generation means passing an image URL to a vision-capable language model and prompting it to return a single descriptive sentence optimized for accessibility and keyword recall. The model reads the image, not a filename or tag.

const completion = await client.chat.completions.create({
  model: "meta-llama/llama-4-scout-17b-16e-instruct",
  temperature: 0.2,
  max_tokens: 120,
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: SYSTEM_PROMPT },
        { type: "image_url", image_url: { url: signedImageUrl } },
      ],
    },
  ],
});

Temperature 0.2 keeps the model from getting creative. You want "white linen button-down shirt on a wooden hanger against a neutral background," not a mood piece.

The one that works well is locked and explicit:

"Return ONE sentence (max 200 chars) describing the image for screen readers and for keyword-based semantic search. Use concrete nouns: garment types, colors, settings, props, model pose. Fashion-photography vocabulary when relevant. NO commentary, NO preamble, NO quotes. Just the sentence."

That prompt tells the model it's writing for two audiences simultaneously: screen readers and search. It gives domain vocabulary anchors so it defaults to specific terms. It also eliminates the chattiness vision models drift toward without constraints ("Certainly! Here's a description...").

Processing order: thumbnails first

One hard constraint shapes the whole pipeline. The vision API I've seen work at scale hard-rejects images above 33 megapixels. A 39MP camera raw converted to JPEG fails every time. Sending full-resolution originals was never going to work.

The pipeline runs in this order:

Generate a WebP thumbnail derivative from the original
Sign a download URL for the thumbnail
Send the thumbnail URL to the vision model
Write the resulting alt text to the asset row
Re-embed the asset so semantic search picks up the new text

Thumbnail vs. full-resolution tradeoffs

Factor	Thumbnail (1200px WebP)	Full-resolution original
Token cost (input)	~800–1,000 tokens	~2,500–4,000 tokens
API error rate	Near zero in my testing	~15% on catalogs shot at 42MP+
File size over the wire	80–200 KB typical	8–40 MB typical
Description quality	Equivalent for product detail	No meaningful improvement
33MP hard-reject risk	None	High for modern mirrorless cameras

Working off thumbnails is a cost and reliability improvement. The API limit just made the right choice obvious.

We learned this the hard way. The first version of the pipeline sent originals, and about 15% of a fashion client's catalog, all shot on a 42MP Sony, came back with API errors we couldn't explain for two days. Once we switched to thumbnail derivatives, that error rate went to zero.

Pixel Wand runs this block inside Next.js after(), which keeps the serverless function alive until the work finishes. Without that, concurrent batch uploads silently dropped thumbnails and alt text because the function suspended mid-flight before the background work completed.

Rate limits

The 429 response from Groq includes a suggested wait time in the error message. Two formats appear in practice:

Short TPM caps: "try again in 1.278s" or "try again in 924ms"
Long daily caps: "try again in 2m57.984s" or "try again in 5m29.8752s"

The right behavior differs for each. On a short wait, retry after the suggested delay. On a long wait, fail-soft and move on. Retrying a daily-budget 429 burns future capacity for no progress.

The cutoff I'd use is 10 seconds. Any suggested wait under that, retry. Over that, log at warn level and return an empty string. The caller downstream embeds the asset using tags and filename alone, which is less precise but still useful.

For backfill runs across large libraries, explicit pacing works better than hoping the client doesn't get rate-limited: one call at a time, 2.5 seconds between each. A 900-image supplement catalog we backfilled last quarter took most of a workday. That's fine as a background cron job. The teams I've worked with who tried to go faster all hit daily caps and had to restart anyway.

What happens when it fails

The function always resolves to a string. It never throws. Any failure (network error, model timeout, over-limit 429, malformed response) produces an empty string, which the caller treats as "no alt text generated."

A 15-second timeout wraps the API call via Promise.race. Without it, a hung request blocks the background worker indefinitely and starves other assets in the queue.

const result = await Promise.race([
  client.chat.completions.create({ ... }),
  new Promise<never>((_, reject) =>
    setTimeout(() => reject(new Error("groq timeout")), 15_000)
  ),
]);

Assets with empty alt text stay in the backfill queue. A daily cron walks every image asset where altText is null or empty and retries them. Retrying the same asset multiple times doesn't cause problems; each attempt either writes a result or leaves the field empty and moves on.

How the alt text feeds into search

The text the model generates isn't just metadata decoration. It feeds directly into the semantic search embedding.

The embedding input concatenates three sources:

"woman in red wool coat standing outdoors in winter fog | coat wool red winter outerwear | DSC_4821.jpg"

Order: alt text, tags, original filename. The alt text carries the most weight because it contains the richest vocabulary. Tags carry the user's own mental model, which is useful for matching searches that use terms the vision model wouldn't generate. The filename is a fallback for camera-named files with no other metadata.

An image that would otherwise match only its filename now surfaces on searches for "red coat winter outdoor shoot," "wool outerwear gray sky," or "woman standing fog."

A generated one-liner won't satisfy a thorough accessibility audit, and vendors who sell it as ADA risk mitigation are setting wrong expectations with legal teams. The ROI is search quality and catalog discoverability. In my testing with supplement brands, match rate jumped from around 30% to over 80% after a backfill run. The catalog didn't change. The metadata did.

FAQ

How accurate is AI-generated alt text?

Useful for most catalog shots, genuinely unreliable on a predictable subset. Studio shots on clean backgrounds come back correct nearly every time. The failure rate rises on images with cluttered backgrounds, small products relative to the frame, or abstract photography. The most common errors are subtle: wrong color name ("pink" for "dusty rose"), wrong fit descriptor ("oversized" for "relaxed fit"), or a generic description when the product has a distinctive detail the model missed. Plan for a spot-check on every batch and a manual override path for anything the model gets wrong.

Does AI alt text help SEO?

Yes, in two ways. First, search engines index alt text and use it to understand image content; images without alt text are essentially invisible to crawlers. Second, descriptive alt text improves the surrounding page's keyword relevance for long-tail queries. Google's own documentation confirms alt text as an image indexing signal. The gains compound on large catalogs: a brand with 10,000 product images and blank alt text fields is leaving a significant amount of indexable content on the table. That said, AI-generated alt text sets a floor. For hero images or high-priority landing pages, human-written alt text optimized for specific queries will outperform a generic model output.

What vision model should I use for product images?

For most e-commerce catalogs, a fast mid-tier model works better than a flagship. In my testing, Llama 4 Scout via Groq handles product photography well, runs fast, and is cheap enough to backfill large catalogs without meaningful spend. The main tradeoff is that it occasionally struggles with highly stylized or abstract imagery that a larger model handles better. If your catalog is primarily standard product-on-background photography, start with a fast model. If you have a lot of lifestyle or editorial shots, run a quality comparison against a larger model on a sample of 200 images before committing to either. The description quality difference on straightforward product shots rarely justifies the cost difference.

Does this work on videos?

No, at least not with this approach. Video alt text requires extracting a representative frame first; sending a video URL to a vision API doesn't work. For now, videos embed on tags and filename only.

What if the model describes the wrong thing?

It happens, and the failure modes are predictable once you've seen enough of them. Studio shots with busy backgrounds, images where the product is small relative to the frame, abstract photography. These produce off descriptions more often. Temperature 0.2 reduces it compared to higher settings but doesn't eliminate it. In Pixel Wand, users can overwrite the generated alt text manually and the embedding re-runs on save. The subtler problem is misidentification that looks plausible: a jacket with a hidden zipper described as a pullover, or a matte finish product described as glossy. Those pass a quick skim but break on search. Domain-specific vocabulary in the system prompt narrows the gap. Still, a manual spot-check on each batch will catch the ones it doesn't.

Is there a quality threshold before the alt text is written?

Not currently, and I think that's a gap worth closing. A minimum-length filter would help; descriptions under 20 characters are often useless. My view is that any pipeline running at catalog scale should reject suspiciously short outputs rather than writing them. I don't have a strong read on what the right floor is. Twenty characters is a guess worth testing.

What does this cost at scale?

On a free tier, the constraint is rate limits, not per-call charges. At 2.5-second inter-call pacing you can process roughly 1,400 images per hour. On paid tiers, the cost per image is well under $0.001 based on typical thumbnail token counts (roughly 900 input tokens and 50 output tokens per image at current Llama 4 Scout pricing). For most catalogs, the bigger cost is engineering time to build and maintain the pipeline, not the inference spend.

Where to start

Sort your assets by altText IS NULL, pick a vision model that accepts image URLs, and run a paced backfill on 100 images first. Check the output quality before scaling up. The first batch will show you where your system prompt needs tightening; almost every domain has at least one image type that generates generic output without explicit vocabulary guidance. Pixel Wand runs this automatically after each upload if you want to see it working against a real library before building your own.