Tools16 min read·PublicSoftTools Team·May 2026

AI Image Captioner Free — Generate Captions from Any Photo

Writing descriptions for images is time-consuming — and getting alt text right for accessibility is even harder. The free AI Image Captioner on PublicSoftTools generates a natural language caption from any uploaded image in seconds, with no signup and no image uploads to any server.

What Is an AI Image Captioner?

An AI image captioner is a model that takes an image as input and outputs a natural language description of what it contains. The model does not retrieve a stored description — it generates text by analysing the visual content of the image.

This tool uses the ViT-GPT2 architecture, which combines two powerful models: a Vision Transformer (ViT) that encodes the image into a structured representation, and GPT-2 that decodes that representation into a sentence. The entire process runs in your browser via Transformers.js — a JavaScript port of Hugging Face Transformers that uses WebAssembly to execute the model client-side.

The practical difference compared to cloud-based captioners is privacy: your images are never transmitted over the network. The only external request is the one-time download of the model weights (approximately 85 MB), which are then cached in browser storage for all subsequent sessions. Once loaded, the tool works entirely offline.

How ViT-GPT2 Works Under the Hood

Understanding the architecture helps you predict when captions will be good and when they will fall short.

Vision Transformer (ViT)

The Vision Transformer splits your image into a grid of fixed-size patches — typically 16×16 pixels each. Each patch is flattened into a vector and passed through a standard transformer encoder alongside a learnable classification token. The attention mechanism allows every patch to attend to every other patch, so the model captures both local texture and global composition simultaneously.

The output of the ViT encoder is a sequence of contextualised patch embeddings — a rich numerical representation of the image that captures objects, their spatial relationships, colours, and contextual cues. This representation is what gets handed to the language model.

GPT-2 Decoder

GPT-2 is an autoregressive language model trained on a large text corpus. In ViT-GPT2, it is fine-tuned to take the image embeddings as a prefix and then generate caption tokens one at a time, conditioning each new token on everything that came before. The result is a fluent English sentence that describes the image content.

The model was trained on the COCO Captions dataset and similar image-caption pairs. This means it performs best on the kinds of subjects that appear frequently in those datasets: everyday objects, people in natural settings, animals, food, sports, and outdoor scenes. Categories underrepresented in training data — abstract art, technical diagrams, screenshots, text-heavy images, unusual camera angles — are where accuracy drops.

Key Use Cases

Use Case	How AI Captions Help	Review needed?
Accessibility alt text	Generate a draft alt attribute for web images	Yes — especially for informational graphics
Social media posts	Get a factual base for photo captions	Minimal — adapt tone before publishing
Image library indexing	Generate searchable text metadata for photos	Optional — accuracy is usually good for photos
Content writing	Describe an image to use in an article or blog	Light editing for style
Screen reader content	Draft descriptions for visual content	Yes — context matters for accessibility
Product photo cataloguing	Auto-label product images for ecommerce	Yes — product-specific details need manual addition
AI image verification	Check if generated images match intended descriptions	No — useful as a sanity check

How to Generate Captions

Open the tool. Go to the AI Image Captioner. No login required.
Upload an image. Click the dropzone or drag any image onto it. JPEG, PNG, WebP, GIF, and AVIF are supported.
Click Generate Caption. On first use, the ViT-GPT2 model (~85 MB) downloads and caches. A status message shows progress. Subsequent captions are fast.
Copy and use. The caption appears below the image. Click Copy and paste it into your CMS, social post, or document.

Writing Effective Alt Text: A Practical Guide

AI captions give you a starting point, but good alt text requires human judgment. Here is how to take a generated caption and turn it into compliant, helpful alt text.

What WCAG requires

Web Content Accessibility Guidelines (WCAG) 2.1 Success Criterion 1.1.1 requires a text alternative for all non-decorative images. The alternative should convey the same information or function as the image. A caption that says "a dog sitting on grass" is technically valid, but if the image is illustrating a specific breed for a dog breed comparison article, the alt text should say "a golden retriever sitting on grass, facing forward" — adding the specific detail that serves the article's purpose.

Informational vs decorative images

Decorative images — backgrounds, dividers, purely aesthetic flourishes — should use an empty alt attribute (alt="") so screen readers skip them. The AI captioner will describe them as objects, but adding those descriptions creates noise for screen reader users without conveying anything meaningful. Reserve AI captions for images that actually communicate content.

Adding context the model cannot know

ViT-GPT2 sees only the image pixels — it has no knowledge of your page's topic, the person depicted, or the product shown. A caption of "a person holding a microphone" may be accurate, but the alt text for an article about a specific keynote speaker should say "Jane Smith speaking at the 2026 Tech Summit" — context only you can provide.

Keeping alt text concise

Screen reader users navigate pages sequentially. Long alt text interrupts the reading flow. Aim for one to two sentences. If more detail is needed, consider a visible caption below the image or a linked long description. For complex infographics or charts, a separate accessible table or paragraph often serves users better than a long alt attribute.

SEO Applications for Image Captions

Search engines index alt text and use it to understand image content. While Google can analyse images directly, descriptive alt text remains a ranking signal for image search and contributes to overall page relevance.

Product images

For ecommerce, product images with descriptive alt text improve rankings in Google Images, which drives purchase intent traffic. Use the AI caption as a base, then add the product name, model number, and key features. A caption like "a blue coffee mug on a wooden table" becomes "Aeropress Fellow Stagg EKG electric kettle — matte black, 0.9 litre capacity" with manual enhancement.

Blog and article images

For editorial images, keep alt text descriptive of what is shown rather than stuffing in keywords. Search engines penalise keyword-stuffed alt text. The goal is a description that matches what someone would search for to find that type of image.

Structured data for images

If you are adding ImageObject structured data to pages, the description property accepts the same kind of natural language caption this tool generates. Combine with the Schema Markup Generator to build the complete JSON-LD block.

Advanced Workflows

Batch captioning images

Once the model is loaded into browser memory after the first caption, subsequent captions generate quickly. For batch work, upload images one at a time and collect captions in a text editor. The model stays in memory as long as the browser tab is open. For very large batches, keep the tab open and work through images systematically — reloading the page clears the model from memory and triggers another download.

Generating alt text for a website

The most common professional use case is generating first-draft alt attributes for website images. Upload each image, copy the caption, and paste it into your HTML or CMS alt field. For decorative images (icons, dividers), use an empty alt attribute instead — captions for decorative elements can mislead screen reader users.

Converting images to searchable text

For photo libraries, upload images and use the captions as metadata. Combine this with the OCR tool for images that contain text — OCR extracts the written words while this tool generates a contextual description. Together they make a photo library fully searchable by natural language queries.

Describing AI-generated images

Generate an image with the AI Image Generator, then use the image captioner to describe what the model produced. This is useful for checking whether the generation matched your original intent or for creating accessible descriptions of AI artwork before publishing. If the caption reads very differently from your original prompt, the generation may need refinement.

Accessibility auditing workflow

If you are auditing an existing website for WCAG compliance, export all images from the site, run them through the image captioner, and compare the generated descriptions against the existing alt text. Images where the current alt text is empty, generic ("image", "photo"), or significantly different from the caption are likely accessibility failures that need review.

Understanding Caption Quality and Its Limits

What the model handles well

ViT-GPT2 performs best on photographs of common real-world subjects with clear composition and good lighting: people in recognisable settings (parks, offices, kitchens), animals, food, sports and fitness activities, cityscapes, and product photography. For these, the captions are typically factually accurate and grammatically natural.

Where accuracy degrades

Image Type	Typical Caption Quality	Recommended Approach
Photographs of people and objects	High accuracy	Review for names and context only
Abstract or fine art	Describes shapes and colours, misses meaning	Write manually or use caption as style descriptor
Charts and graphs	Describes the visual (a bar chart) but not the data	Use a visible caption or accessible table instead
Screenshots	Often misclassifies UI as photographs	Write manually based on the action shown
Text-heavy images	Does not read or transcribe the text	Use the OCR tool for text extraction
Low-light or blurry photos	Generic captions with low confidence	Improve image quality if possible before captioning

Single captions vs multiple runs

GPT-2 is autoregressive and generates text token by token with some randomness. Running the same image twice may produce slightly different captions. If the first caption seems off, try generating again — a different run may produce a more accurate description. For critical accessibility uses, always review the output manually regardless of which run produced it.

Privacy and Data Handling

Are my images stored anywhere?

No. The ViT-GPT2 model runs entirely in your browser. Your image is processed in-memory by the WebAssembly model and never transmitted over the network. This applies to all images, including private photos, confidential documents, or proprietary product images.

What is downloaded from Hugging Face?

When you first use the tool, two things are downloaded from Hugging Face's CDN: the ViT encoder weights and the GPT-2 decoder weights. These are standard open-source model files, not customised for this tool. They are cached in your browser's IndexedDB storage. No image data or caption output is ever sent anywhere — only the model download is a network request.

Using the tool on sensitive images

Because all processing is local, the tool is safe to use on images you cannot send to a cloud service: medical scans, confidential product prototypes, private photographs, financial documents that happen to contain images, or any content covered by data residency requirements. The processing isolation is equivalent to running a local desktop application.

Comparing AI Captioners

Several AI captioning tools exist, ranging from free browser-based tools to paid cloud APIs. Here is how they differ in practice.

Approach	Privacy	Accuracy	Cost	Best For
Browser-based (this tool)	Images never leave device	Good for photos	Free	Private images, quick drafts
Cloud API (Azure Vision, Google Vision)	Images sent to cloud	Higher — larger models	Per-request pricing	High-volume, production use
GPT-4o / Claude vision	Images sent to cloud	Highest — multimodal LLMs	Per-token pricing	Complex images, detailed descriptions
Manual alt text	N/A	Highest — human judgment	Time cost only	Critical accessibility content

For most workflows — drafting alt text for a personal blog, captioning a library of product photos for review, or indexing images for search — the browser-based tool provides sufficient accuracy without the cost or privacy tradeoffs of cloud services. Use cloud vision APIs when you need maximum accuracy at scale, or when the images are already public and privacy is not a concern.

Common Questions

What subjects does it handle best?

ViT-GPT2 performs best on everyday subjects: people, animals, food, nature, sports, and common objects. It produces less precise captions for technical diagrams, abstract art, charts, screenshots, and stylised illustrations. For these, use the caption as a starting point and add specific context manually.

Can I use this for commercial images?

Yes. The processing happens locally in your browser — no image data leaves your device. There are no licensing restrictions from the tool itself on the captions you generate. The ViT-GPT2 model is released under the Apache 2.0 licence, which permits commercial use.

Why does the model sometimes produce very short captions?

GPT-2 generates tokens until it predicts an end-of-sequence token or reaches a maximum length. For simple compositions — a single object against a plain background — it often produces a short, accurate caption. More complex scenes with many subjects or relationships sometimes also produce short captions if the model assigns high confidence to a simple description early in generation. Generating again may produce a more detailed result.

Is there a file size limit?

The tool processes images in your browser's memory. Very large image files (above 20 MB) may be slow or cause memory issues on devices with limited RAM. For large images, compress them first with the Image Compressor before captioning — the visual content at the resolution the model uses is limited anyway, so compression does not meaningfully reduce caption quality.

What languages does it output captions in?

ViT-GPT2 outputs captions in English only. The model was trained on English image-caption pairs and does not support multilingual output. For non-English alt text, generate the English caption and translate it using the AI Translator.

Can I use the captions directly as social media posts?

The captions are factual descriptions, not social media copy — they describe what is in the image literally. For social media, use the caption as raw material and add personality, a call to action, relevant hashtags, and the tonal style of your account. The description "a person standing on a mountain summit with a view of valleys below" becomes a post when you add context, voice, and engagement hooks.

How does this compare to using the OCR tool?

These are complementary tools with different jobs. The OCR tool extracts text that appears in an image — printed words, typed labels, handwriting. The Image Captioner generates a description of the visual scene. For an image of a product label, OCR reads the label text while the captioner might say "a bottle of sauce on a wooden surface." Use both together for complete image metadata.

Caption Your First Image Now

Free, no signup. Upload any image — your photos never leave your browser.

Open AI Image Captioner