PublicSoftTools

AI Image Captioner

Upload any image and get an AI-generated description in seconds. Powered by ViT-GPT2 — runs entirely in your browser, no signup, images never leave your device.

How the AI Image Captioner Works

  1. 1Upload an image. Click the dropzone or drag an image file onto it. JPEG, PNG, WebP, GIF, and AVIF are supported.
  2. 2Click Generate Caption. On first use, the ViT-GPT2 model (~85 MB) downloads and caches in your browser. A clear progress message shows the status.
  3. 3Read and copy the caption. The generated description appears below the image. Click Copy to grab it for use in alt text, social posts, or content.
  4. 4Caption more images instantly. Once the model is loaded, subsequent captions generate in seconds — no further downloads needed.

Use Cases for AI-Generated Image Captions

Accessibility alt text. Generate draft alt attributes for images on your website. Review and refine before publishing — especially for informational images like charts or screenshots where specific detail matters.

Social media descriptions. Get a base description for an image you are posting. The Witty or Casual tone of a social post can then be written from this factual foundation.

Image indexing and cataloguing. Process batches of photos and use the captions as metadata for search and filtering — useful for photographers, designers, and content libraries.

Content creation starting point. A generated caption gives a descriptive anchor from which to write longer creative or editorial content about the image.

Tips for Better Captions

Use clear, well-lit photos

The model performs best on images with a clear subject and good lighting. Blurry, dark, or heavily filtered images produce less accurate descriptions.

Crop to the main subject

If an image has a lot of background clutter, cropping it to focus on the main subject before uploading produces more specific captions.

Review for diagrams and charts

The model describes what it sees visually — it cannot interpret the data or meaning in a chart. Always write custom alt text for informational graphics.

Pair with the Image Converter

If your image is in an unusual format, convert it to JPEG or PNG first using the Image Converter before uploading here.

Frequently Asked Questions

How does the AI image captioner work?

The tool uses Transformers.js to run the ViT-GPT2 image captioning model in your browser via WebAssembly. ViT (Vision Transformer) encodes the image into feature representations, and GPT-2 then generates a natural language caption from those features. The model file (~85 MB) downloads from Hugging Face on first use and is cached in your browser.

Is my image uploaded to a server?

No. The ViT-GPT2 model runs entirely in your browser. Your image is read locally using the browser File API and processed in-memory by the WebAssembly model. No image data is transmitted to any server — not PublicSoftTools, not Hugging Face, not anyone.

Why does the first caption take longer?

On first use, the model file (~85 MB) downloads from Hugging Face CDN and is cached by your browser. This typically takes 15–60 seconds depending on your connection. Subsequent captions using the same model are much faster because the model is already loaded in browser memory.

What image formats are supported?

JPEG, PNG, WebP, GIF, AVIF, and any format your browser can natively display. Very large images may be slower to process; the model works with standard web-resolution images (up to a few megabytes) without issues.

How accurate are the captions?

ViT-GPT2 produces accurate high-level descriptions for common subjects — people, animals, objects, outdoor scenes, food. It may produce generic or less precise captions for unusual subjects, technical diagrams, or heavily stylised art. For accessibility alt-text generation, treat the output as a starting point and review before publishing.

Can I use this to generate alt text for accessibility?

Yes, this is one of the primary use cases. The captions are short, descriptive, and suitable as a starting point for image alt attributes. Review and refine the output — especially for images that convey specific information (charts, text, diagrams) where the context matters.