Tools16 min read·PublicSoftTools Team·June 2026

Speech to Text Online: The Complete Guide to Browser-Based Voice Transcription

Speaking is three to four times faster than typing for most people — the average speaking rate is 130 words per minute, while the average typing speed is 40–60 WPM. The free Speech to Text tool converts your voice to text in real time using your browser microphone, in 15 languages, with no signup and no audio uploads required.

How Browser Speech-to-Text Works — The Web Speech API

The tool uses the Web Speech API — a browser standard that provides JavaScript access to speech recognition and speech synthesis capabilities built into the browser engine. The Web Speech API was designed by the W3C (World Wide Web Consortium) and first implemented in Chrome in 2012, followed by Edge and Safari.

The API has two components: SpeechRecognition (speech-to-text) andSpeechSynthesis (text-to-speech). The tool uses SpeechRecognition, which operates as follows:

Microphone access. The browser requests permission to access your microphone via the MediaDevices API. If you deny the permission, speech recognition cannot start. Permissions are per-origin and can be managed in your browser's site settings.
Audio capture. Once permission is granted, the browser captures audio from your microphone in real time. The audio data is processed internally by the browser engine.
Speech recognition engine. The recognition engine differs by browser: Chrome and Edge send audio to Google's speech recognition servers (the same service used by Google Assistant); Safari uses Apple's on-device speech recognition engine. Firefox does not currently implement SpeechRecognition.
Result delivery. The engine returns recognition results as JavaScript events. Interim results (marked as not final) are returned quickly and shown in a distinctive style to indicate they may change. Final results are returned once the engine is confident in its transcription.

Continuous Mode and Interim Results

The tool runs in continuous mode (recognition.continuous = true) with interim results enabled (recognition.interimResults = true). This produces the live transcription effect where:

As you speak, grey text appears showing what the recognition engine is currently processing (interim results). These may change as the engine processes more context.
When you pause, the engine finalizes its transcription for that phrase. The text turns black and is locked into the transcript — it will not change.
The session continues until you click Stop, allowing you to dictate for as long as you need without restarting.

The auto-restart feature handles the browser's built-in timeout: Chrome's speech recognition automatically stops after about 20–30 seconds of silence. The tool detects this and restarts the recognition session automatically, so long dictation sessions continue without interruption.

Browser Support and Privacy Implications

Understanding which browsers support the Web Speech API and how each handles privacy is important for sensitive dictation work:

Browser	SpeechRecognition Support	Where Recognition Runs	Audio Sent to Servers?
Chrome (desktop)	Yes	Google Cloud Speech API	Yes — audio sent to Google servers
Edge (Chromium)	Yes	Google Cloud Speech API	Yes — same as Chrome
Safari (macOS/iOS)	Yes	Apple on-device engine	No — processed locally
Firefox	No	N/A	N/A — API not implemented
Chrome (Android)	Yes	Google Cloud Speech API	Yes — audio sent to Google servers
Safari (iOS)	Yes	Apple on-device engine	No — processed locally

For dictating sensitive content (medical notes, legal discussions, confidential business information), use Safari on macOS or iOS where processing is on-device. For best accuracy across languages, Chrome typically produces better results because Google's Speech API is trained on a larger and more diverse dataset.

Supported Languages and Dialects

The tool supports 15 languages covering the most widely spoken languages globally. Select your language before starting — the recognition engine uses the language setting to choose its model and optimize for the phonetics of that language:

Language	Code	Notes
English (US)	en-US	Default; optimized for American English accent
English (UK)	en-GB	British English accent optimization
Spanish	es-ES	Castilian Spanish; Latin American accents also recognized
French	fr-FR	Metropolitan French
German	de-DE	Standard German (Hochdeutsch)
Italian	it-IT	Standard Italian
Portuguese (Brazil)	pt-BR	Brazilian Portuguese
Arabic	ar-SA	Modern Standard Arabic; regional dialects partially supported
Chinese (Simplified)	zh-CN	Mandarin Chinese with simplified character output
Japanese	ja-JP	Japanese with kanji/hiragana/katakana output
Korean	ko-KR	Korean with hangul output
Hindi	hi-IN	Hindi with Devanagari script output
Russian	ru-RU	Russian with Cyrillic output
Turkish	tr-TR	Turkish
Dutch	nl-NL	Netherlands Dutch

Speech-to-Text Accuracy — What Affects It

Modern speech recognition accuracy has improved dramatically with deep learning. Google's speech recognition API achieves word error rates below 5% for clear English speech in quiet environments. However, several factors significantly affect real-world accuracy:

Microphone Quality

Microphone quality is the single most impactful factor in transcription accuracy. The difference between a built-in laptop microphone and a dedicated USB headset is often 10–20 percentage points in word error rate in real-world conditions. The recognition engine receives whatever audio signal your microphone captures — background noise, room reverb, and low-frequency interference all degrade the signal before recognition even begins.

For regular dictation, invest in a dedicated microphone. Budget USB headsets in the $20–40 range produce significantly better results than laptop built-in microphones. Dedicated USB desk microphones with cardioid polar patterns (which reject sound from the sides and rear) are even better for office environments.

Background Noise

Speech recognition models are trained primarily on speech isolated from noise. Ambient sounds — keyboard clicks, traffic, HVAC systems, conversations in the background — reduce accuracy by providing ambiguous audio input. In open-plan offices, the recognition engine may transcribe portions of nearby conversations.

Solutions: use the tool in a quiet environment; use a headset with noise cancellation (hardware-level noise cancellation, not software-level); close windows and doors; or use a directional microphone positioned close to your mouth.

Speaking Pace

Very fast speech (above 180 words per minute) causes words to blur together, especially consonant clusters at word boundaries. Very slow speech with long pauses between words causes the engine to finalize early, producing choppy transcriptions. Natural conversational pace (120–150 words per minute) produces the best results.

Pronunciation Clarity

Heavy regional accents, mumbling, or speaking with a hand near the mouth all reduce accuracy. The recognition engine compares audio input against probability distributions of phoneme sequences — non-standard pronunciation patterns not well-represented in the training data produce more errors.

Technical Terminology and Proper Nouns

Domain-specific vocabulary — medical terms, legal jargon, brand names, personal names — is where speech recognition makes its most frequent errors. The engine applies general language models that prioritize common words. "Paracetamol" may be transcribed as "para set of mall." "Kubernetes" may become "Cuba net ease." Reviewing and correcting technical terms is expected when dictating specialized content.

Language Selection

Selecting the wrong language — or a language variant that does not match your accent — dramatically reduces accuracy. If you are dictating British English, select en-GB rather than en-US. If you are dictating Spanish with a Latin American accent, es-ES may still produce better results than most alternatives, but accuracy will be lower than for a native Castilian speaker.

Use Cases — Where Speech-to-Text Saves the Most Time

Meeting Notes and Action Items

Immediately after a meeting, dictating a summary is significantly faster than typing one. Speaking flows naturally from fresh memory; typing from memory requires more cognitive switching. Open the tool, speak your summary for 2–3 minutes, copy the transcript, and paste it into your note-taking app or project management tool. The cognitive load of typing is replaced by the cognitive load of speaking — and most people find speaking easier immediately post-meeting.

For structured meeting notes, dictate in the format you want: "Action item one, Alice will send the proposal by Friday. Action item two, Bob will schedule the follow-up call for next week." The structure is captured in the transcript and easy to clean up.

First-Draft Content Creation

Writers and content creators use dictation to overcome the blank-page problem. Speaking flows more naturally than writing because it is closer to thought — we have been speaking since childhood but most of us have been typing for far fewer years. Dictating a rough first draft removes the editing impulse that many writers struggle with while typing.

Professional authors who dictate — including notable writers like Kevin J. Anderson, who has dictated millions of words of fiction while hiking — report that dictation produces higher word-per-hour rates and often more natural dialogue. The draft quality is rougher than a typed first draft, but the speed advantage is significant.

Accessibility for Users With Motor Difficulties

For users with repetitive strain injuries (RSI), carpal tunnel syndrome, hand pain from arthritis, or other motor difficulties affecting typing, voice input is not just faster — it may be the only practical way to input large amounts of text. The browser-based tool requires no software installation, works on any device with a microphone, and starts immediately — removing the barrier of navigating complex accessibility software setup.

Dyslexia accommodations also benefit from speech-to-text. Many individuals with dyslexia find composing text by voice significantly easier than by keyboard, as it bypasses the motor-visual-phoneme conversion difficulties associated with typing while maintaining normal thought flow.

Language Learning and Pronunciation Practice

Speaking a target language into the speech-to-text tool provides instant pronunciation feedback. If the recognition engine correctly transcribes your spoken French phrase, your pronunciation was sufficiently close to native patterns. Mispronounced words produce incorrect transcriptions — a visible, concrete signal that pronunciation needs work. This is more immediate feedback than most classroom exercises.

Select the target language in the language dropdown, speak individual words or phrases, and compare the transcript to what you intended to say. Where the transcript differs, your pronunciation was not clear enough to the recognition engine — which uses patterns similar to what a native speaker would hear.

Voice Memos and Quick Capture

When an idea occurs to you while walking or while your hands are otherwise occupied, opening the tool on mobile and dictating a quick note is faster than typing. The transcript can be copied to a note-taking app, email draft, or messaging app with a single tap.

Transcribing Pre-Recorded Audio (Workaround)

The tool only accepts live microphone input — it does not accept uploaded audio files. For transcribing pre-recorded audio (interviews, lectures, podcasts), use the AI Audio Transcriber tool, which accepts MP3, WAV, M4A, and other audio files and uses the Whisper model for transcription. Alternatively, play pre-recorded audio through speakers in a quiet room with the speech-to-text tool running — accuracy will be lower than live dictation but may be acceptable for casual notes.

Tips for Better Accuracy

Reduce background noise

Work in the quietest environment available. Close doors and windows, turn off fans and air conditioning if possible, and move away from computers with loud fans. Even small reductions in ambient noise produce measurable accuracy improvements.

Speak at a moderate, consistent pace

Do not rush to get words out quickly or speak unnaturally slowly. A consistent, moderate speaking pace (approximately 120–150 WPM) gives the recognition engine sufficient audio context for each word without running words together.

Speak directly into the microphone

Position your microphone correctly — for a headset, the microphone element should be level with and close to your mouth (2–4 cm away), angled slightly off-axis from your mouth (to reduce plosive sounds like "p" and "b"). For a desk microphone, position it 20–30 cm in front of you, angled slightly upward. Avoid speaking downward or to the side of the microphone.

Enunciate technical terms

For technical vocabulary, proper nouns, or unusual words, slow down and enunciate slightly more clearly than usual. You can also spell out difficult words letter by letter: "Kubernetes, K-U-B-E-R-N-E-T-E-S" — the spelled version can then be corrected in the transcript.

Use the online notepad for integrated dictation

If you want dictation and note-taking in a single interface, the Online Notepad has a built-in Dictate button that uses the same Web Speech API. Dictated text is appended directly to your notes, auto-saved to your browser's local storage between sessions, and ready to export as a .txt file.

Review and correct the transcript immediately

Review and correct the transcript immediately after dictating, while the original spoken words are fresh in memory. Recognition errors are easier to catch and correct when you remember what you said. Correcting a transcript hours or days later is more difficult because context is lost.

Speech-to-Text vs AI Transcription — Choosing the Right Tool

Two technologies are available for converting speech to text, and they suit different use cases:

Feature	Web Speech API (this tool)	Whisper AI Transcription
Input type	Live microphone	Uploaded audio file
Real-time output	Yes — text appears as you speak	No — batch processing after upload
Accuracy on clear speech	Excellent	Excellent (often better for accents)
Accuracy on recordings	Not applicable (live only)	Excellent
Processing location	Browser (Safari) or Google/Apple servers	In your browser via WebAssembly
Language count	15 languages	99+ languages
Best for	Live dictation, notes, real-time output	Pre-recorded audio, podcasts, meetings, interviews

Use the live speech-to-text tool when you are dictating in real time. Use the AI Audio Transcriber when you have pre-recorded audio — an interview, a lecture recording, a meeting recording, or a voice memo — that you need transcribed to text.

Dictation in Different Contexts — Style Guide

Dictating punctuation

The Web Speech API automatically inserts punctuation based on speech patterns and natural pauses — you do not need to say "comma" or "period." However, accuracy varies, and the automatic punctuation sometimes misses sentence boundaries or inserts commas in unexpected places. Plan to review and adjust punctuation in the transcript before using it in formal documents.

For more control, some users dictate punctuation explicitly: "The meeting is on Friday comma period." The speech recognition engine may or may not interpret these as punctuation marks rather than words — behavior varies by browser and language.

Paragraph structure

The tool does not insert paragraph breaks automatically. If your dictation should be structured into paragraphs, either add them manually after copying the transcript, or pause naturally at paragraph boundaries and use the copy-paste to structure the output in your destination editor.

Formal vs conversational style

Spoken language is naturally more conversational than written language. First-draft dictation often contains filler words ("um," "uh," "like," "you know"), run-on sentences, and informal constructions. Plan for an editing pass to formalize the language for professional documents, or dictate more deliberately with formal phrasing in mind.

Frequently Asked Questions

Why is the start button not working?

Three common causes: (1) You are using Firefox, which does not implement the Web Speech API — switch to Chrome, Edge, or Safari. (2) Microphone access was denied in your browser — check the site permissions in your browser settings and allow microphone access for the PublicSoftTools domain. (3) No microphone is connected — check that your device has a functioning microphone and that it is the default input device in your operating system audio settings.

Does the tool work offline?

Partially. The page loads normally offline after the first visit (it is cached). However, speech recognition in Chrome requires a network connection because recognition is performed on Google's servers. Safari on iOS uses an on-device recognition engine for some languages (including English) and may work offline in those cases.

Is my voice audio stored anywhere?

The tool itself does not store any audio. In Chrome and Edge, audio is sent to Google's speech recognition API — Google's privacy policy applies to that data. In Safari, processing is on-device and no audio is transmitted. PublicSoftTools never receives or stores audio data.

Can I dictate in multiple languages in a single session?

No — the recognition engine is set to a single language when recording starts. To switch languages, stop the current recording, change the language dropdown, and start a new recording session. The previous transcript is preserved, so you can continue from where you left off.

How accurate is the speech-to-text for technical terms?

Technical terminology is where browser-based speech recognition makes the most errors. Domain-specific terms not well-represented in general speech training data — medical terminology, legal jargon, programming language syntax, brand names — will frequently be misrecognized. Plan to review and correct technical terms manually. For highly specialized domains, consider AI transcription tools with custom vocabularies or models fine-tuned for specific domains.

Start Dictating Now

15 languages, real-time results, no signup required. Click Start Recording and speak.

Open Speech to Text