ASD123.ai Text to Speech Guide

Quick Navigation

→ Overview → Basic Workflow → Languages and Models → Voices → Quality and Speed → Playback and Export → Limits and Tips → Privacy → Use Cases

1. Overview

Text to Speech turns written text into spoken audio directly in your browser. Tip: the Markdown Converter has a To Speech button that sends a converted document straight into this tool. Two on-device models are available: Kokoro, an 82M-parameter model with very natural voices in English, Spanish, French, and Italian, and Supertonic 3, a fast multilingual model covering English, German, French, Spanish, and Italian.

You choose a language first, then the models that support it — sorted by quality, with the download size shown for each. The model files are fetched from Hugging Face the first time you use them and then cached by your browser; your text is processed locally and is never sent to ASD123.ai.

First use downloads a model: Kokoro is from ~88 MB and Supertonic 3 is ~380 MB. The download happens once per model and precision, then plays instantly from cache. A fast connection is recommended for the first generation.

2. Basic Workflow

Pick a language

Choose English, German, French, Spanish, or Italian. The model list updates to show only the models that speak that language.

Choose a model and voice

Models are listed best-quality first, each with its download size. Pick a voice, and for Kokoro optionally a model size (8-bit, 16-bit, or 32-bit precision).

Type your text and generate

Enter or paste text, set the speaking speed, then press Generate speech. The first run downloads the model; later runs are immediate.

Play and export

Play the audio in the built-in player, change the playback speed, then download it as WAV or MP3.

3. Languages and Models

Selection is language-first: choose what you want spoken, and the tool offers the models that support it.

English · French · Spanish · Italian

Both models: Kokoro (highest quality; English has American and British named voices) and Supertonic 3 (fast, multilingual). Kokoro is the recommended default.

German

Supertonic 3 — the model that natively supports German in the browser. Kokoro has no German voice.

How each model handles languages: Supertonic 3 is natively multilingual — the language is selected with a built-in language tag, no extra download. Kokoro speaks English natively; for Spanish, French, and Italian it adds a one-time eSpeak NG pronunciation pack (~19 MB) that converts the text to phonemes locally. Everything runs in your browser.

Models & versions

Model	Languages	Download	Runtime	Source
Kokoro 82M v1.0	English (US & UK), Spanish, French, Italian	~88 / 156 / 310 MB	kokoro-js 1.2.1	Hugging Face
Supertonic 3	English, German, French, Spanish, Italian	~380 MB	onnxruntime-web	Hugging Face
eSpeak NG (Kokoro non-English pronunciation)	Spanish, French, Italian	~19 MB (one-time)	espeak-ng 1.0.2 (WASM)	GitHub
lamejs (MP3 export)	—	self-hosted ~0.3 MB	1.2.7	GitHub

Model files load from Hugging Face on first use and are cached by the browser. The MP3 encoder is self-hosted, so audio export never leaves your device.

4. Voices

Kokoro voices

Named voices per language. English has 28, grouped into American (Heart, Bella, Michael) and British (Emma, George, Daniel) accents. Spanish adds Dora, Alex, and Santa; French has Siwis; Italian has Sara and Nicola. English voices show their upstream overall grade — the community quality rating from the Kokoro project (A = best, F = worst; Heart is A, Bella A-) — so you can pick the best one.

Supertonic voices

Ten preset voices, five female (Female 1–5) and five male (Male 1–5). They are language-agnostic — the same voice speaks English, German, French, or Spanish depending on the selected language.

5. Quality and Speed

Model size (Kokoro)

Choose a precision: 8-bit (~88 MB, balanced and the default), 16-bit (~156 MB, higher fidelity), or 32-bit (~310 MB, maximum). Larger sizes sound slightly cleaner but take longer to download.

Quality steps (Supertonic)

Supertonic exposes the number of diffusion steps. More steps refine the audio at the cost of speed; the default of 8 is a good balance for most text.

Speaking speed

The Speed slider changes how fast the voice speaks when the audio is generated. It affects the rendered file, so adjust it before generating.

Playback speed

After generating, the Playback speed buttons (0.75× to 2×) change how fast the player plays the audio without regenerating it. This does not alter the downloaded file.

6. Playback and Export

Generated audio appears in the built-in player on the right, where you can play, pause, scrub, and change the playback speed. Two download formats are available:

Download WAV saves the original, uncompressed audio exactly as the model produced it — best for editing or re-encoding.
Download MP3 encodes the audio to a compact MP3 in your browser using a self-hosted encoder. It is much smaller and convenient for sharing or listening on any device.
Both files are named after the model and voice, for example speech-kokoro-af_heart.mp3.

MP3 is encoded on demand: the first MP3 download for a clip takes a moment to encode locally, then is reused if you download it again. No audio is uploaded at any point.

7. Limits and Tips

The first generation with a model downloads it in full. Keep the tab open until the download completes; afterwards it is instant from cache.
Very long passages take longer to synthesize and use more memory. Split large texts into paragraphs or sections for the most responsive results.
For German, French, and Spanish, type text with correct accents and punctuation so the model pronounces words naturally.
If a voice mispronounces a name or abbreviation, try rephrasing or spelling it phonetically.
Switching the model size or precision triggers a separate download the first time you use it.

8. Privacy

Your text is converted to audio entirely on your device — synthesis runs in a background Web Worker, so the page stays responsive while it works. On first use the tool downloads runtime and model files: the Kokoro/Supertonic model weights come from Hugging Face, and the kokoro-js library plus the ~19 MB eSpeak NG pronunciation pack (needed only for non-English Kokoro voices) come from the jsDelivr CDN. All of these are cached afterwards. The text you type, the generated audio, and the MP3 encoding never leave the browser — ASD123.ai never receives your text or the resulting speech.

Only your tool preferences (language, model, voice, speed, playback rate) are stored in localStorage — never your text or audio. Cached model files live in the browser's standard cache and can be cleared like any other site data. Closing the tab discards the current text and audio.

9. Use Cases

Listen instead of read

Paste an article, email, or draft and listen to it as audio to proofread by ear or review hands-free.

Voice over for content

Generate a quick narration track for a slide, demo, or video without a microphone or a cloud service. Playback starts automatically when the audio is ready, and a Stop button lets you abort a long generation at any time.

Language practice

Hear German, French, or Spanish text spoken aloud with Supertonic to check pronunciation and rhythm.

Pair with the other tools

Convert a PDF with the Markdown Converter or clean it with the Optimizer, then have the result read back to you.