What TTS throws away

When a podcast host says "spaaaace" and holds the vowel for 1.1 seconds, every ASR (automatic speech recognition) transcribes it as "space." When two friends banter and one reacts with "Oh. Ooh. Mm." for 2.4 seconds of pure non-verbal communication, the transcript shows three punctuated words. When a speaker abandons a clause mid-thought — "I really have no — I never thought…" — the transcript records an em-dash. The audio records an affect shift.

That's the gap. The training data for every TTS system is built from transcripts that discard the expressive vocabulary humans actually use. Drawn-out vowels, reactive backchannels, mid-sentence register shifts, the timing of a laugh — all of it is flattened into clean text before a TTS model ever sees it. The model can't learn what it's never shown.

This post is about what that gap sounds like, why it exists, and what it would take to close it. Press play below — then expand the comparisons to see what three STT systems and four TTS engines make of the same 25 seconds.

The Read · 2-host pop-culture podcast · reminiscing about childhood movie restrictions

A note on the clip: This clip contains the N-word, used colloquially by the hosts. It was selected for its paralinguistic richness — drawn-out vowels, mid-sentence pivots, backchannels, laughter — not its lexical content. STT transcripts reproduce it verbatim because verbatim fidelity is the point of the comparison.

step 1 · what STT heardThree frontier ASRs, same 25 seconds, three different transcripts

Deepgram Nova-3 · conf 0.999

Oh. If you watch nigga movies, I feel like it's one of those. But for a massive chunk of my life, I didn't because my mama didn't let me watch anything that was rated worse than PG 13. That's true. And then when I went off college, I just had no desire to go back and watch all the Don't Be a Menaces. And I mean, I really have no I'd I never thought that there was a chance that you have seen that. But the point is that many of you at

sentiment: negative (-0.45) · intents: Express frustration about nigga movies · topics: Movie watching · summary: Speaker 0 discusses how he didn't watch the Partners in the Middle, citing his lack of desire to return to watching movies that were rated PG 13.

OpenAI gpt-4o-transcribe

If you watch nigga movies, I feel like it's one of those. But for a massive chunk of my life, I didn't because my mom didn't let me watch anything that was rated worse than PG-13. And then when I went off to college, I just had no desire to go back and watch all the Don't Be a Menaces. I mean, I really have no idea. I never thought that there was a chance that you have seen it. But the point is that many of you...

ElevenLabs Scribe v2

Oh. If you watch nigga movies, I feel like it's one of those. Mm. But for a, a massive chunk of my life I didn't, 'cause my mama didn't let me watch anything that was rated worse than PG-13. That's true. And then when I went off to college, I just had no desire to go back and watch all the Don't Be a Menaces. [laughs] And I mean, I really have no— I never thought that there was a chance that you have seen that.

audio events captured: [laughs]

What this comparison shows: (1) OpenAI drops the leading 'Oh'; Eleven keeps 'Oh' AND the 'Mm-hmm' backchannel that the other two missed. (2) Deepgram's 'Express frustration' / 'negative' sentiment are wrong — the host is warm and reminiscent. (3) Only Eleven captures [laughs] mid-utterance. (4) Deepgram preserves the disfluency 'I really have no I'd I never thought'; OpenAI invents 'no idea' to clean it up.

Listen for: The mid-sentence pivot at the em-dash and the [laughs] event Eleven caught. None of these STT outputs carry the affect of the clip — they capture words and (for one) events.

step 2 · what TTS does with that transcriptSame transcript fed to 4 frontier TTS labs — listen, then see where the time goes

The original is 25 seconds. The same 8-segment script rendered through each TTS lab takes 28.7s to 34.9s. The chart aligns all 9 versions on a shared time axis, colored by speaker. Clean = raw STT transcript; Enhanced = same transcript with lab-appropriate affective markup. Switch tabs to compare timing vs silence placement.

Timinghow long each segment tookAmplitude envelopewhere silence vs speech lands

10s

15s

20s

25s

30s

Original audio (Scribe v2)

Oh.

If you watch…

Mm. But for a massive chunk…

That's true.

And then when I went off…

I mean, I really have no —…

[laughs]

But the point…

24.68s

ElevenLabs v3 — clean

Oh.

If you watch…

Mm. But for a massive chunk…

That's true.

And then when I went off…

I mean, I really have no —…

[laughs]

But the point…

28.74s

ElevenLabs v3 — enhanced

Oh.

If you watch…

Mm. But for a massive chunk…

That's true.

And then when I went off…

I mean, I really have no —…

[laughs]

But the point…

29.48s

Gemini 3.1 Flash — clean

Oh.

If you watch…

Mm. But for a massive chunk…

That's true.

And then when I went off…

I mean, I really have no —…

[laughs]

But the point…

31.32s

Gemini 3.1 Flash — enhanced

Oh.

If you watch…

Mm. But for a massive chunk…

That's true.

And then when I went off…

I mean, I really have no —…

[laughs]

But the point…

31.16s

Gemini 2.5 Pro — clean

Oh.

If you watch…

Mm. But for a massive chunk…

That's true.

And then when I went off…

I mean, I really have no —…

[laughs]

But the point…

29.21s

Gemini 2.5 Pro — enhanced

Oh.

If you watch…

Mm. But for a massive chunk…

That's true.

And then when I went off…

I mean, I really have no —…

[laughs]

But the point…

29.41s

OpenAI gpt-4o-mini-tts — clean

Oh.

If you watch…

Mm. But for a massive chunk…

That's true.

And then when I went off…

I mean, I really have no —…

[laughs]

But the point…

34.90s

OpenAI gpt-4o-mini-tts — enhanced

Oh.

If you watch…

Mm. But for a massive chunk…

That's true.

And then when I went off…

I mean, I really have no —…

[laughs]

But the point…

33.96s

speaker_0 (Crissle, F) speaker_1 (Kid Fury, M)hover any bar to see the full segment text · TTS per-segment widths estimated by char-ratio × measured total duration; original uses Scribe v2 word timestamps

Original (24.68s, bolded) is the reference. OpenAI expanded most (34.9s) — stitching seams between speakers. ElevenLabs and Gemini 2.5 Pro closest to the human (~29s). Hover any bar for segment text; press play on any row.

10s

15s

20s

25s

30s

Original audio

24.68s

ElevenLabs v3, clean

28.74s

ElevenLabs v3, enhanced

29.48s

Gemini 3.1 Flash, clean

31.32s

Gemini 3.1 Flash, enhanced

31.16s

Gemini 2.5 Pro, clean

29.21s

Gemini 2.5 Pro, enhanced

29.41s

OpenAI gpt-4o-mini-tts, clean

34.90s

OpenAI gpt-4o-mini-tts, enhanced

33.96s

Amplitude envelopes (ffmpeg showwavespic), same time scale as the Timing tab. The dramatic gaps in OpenAI's strips are stitching seams — silence between concatenated segments rather than natural breath spacing.

That gap is what this project is trying to close. ElevenLabs v3 gives you [cheerful] or [deadpan]. It will not give you the mid-thought interruption above — one speaker abandoning a clause and restarting with softer affect. The transcript reads as a clean sentence with an em-dash. The audio is something no tag captures.

What I'm building

We're building a fine-tuned script writer — a model whose output has the affective contour, the cue tags, the drawn-out vowels, and the per-segment direction baked in, so when any TTS reads it, what comes out is something a human voice actor would recognise as expressive.

The script writer needs to emit four layers of paralinguistic detail that conventional script generation throws away:

The writer has to cover the full emotional range — not just happy/sad/angry/neutral but blended states like amused-resigned, hopeful-skeptical, warm-sarcastic. Current TTS can't reach those from a flat transcript alone; a script writer that puts affect where it goes is the leverage point.

"Expressive" is the load-bearing word. The goal isn't "monologue for one tired podcaster." It's the full surface area of human speech — comedy timing, meditation silences, conversational backchannels, scene dialogue, songlike delivery — and under all of those, the emotion spectrum a real voice carries: warmth, amusement, sarcasm, fatigue, grief, hopefulness, curiosity. And the blended states that are most of human affect (amused-resigned, hopeful-skeptical, excited-anxious). The script-writer is the layer that says "here is exactly which emotion goes where in the script", and the downstream TTS gets to focus on rendering instead of guessing.

The script writer starts as a LoRA fine-tune, scaling to a full 7-8B fine-tune if tag placement accuracy demands it. (Fish Audio S2 fine-tuned a full 30B model for their annotator; my task is narrower but LoRA may not have enough capacity for ~110 inline tags.) The downstream TTS can be any engine — the first validation gate is whether tagged scripts produce measurably better output than untagged ones. Portability across TTS engines is the only way to keep iterating without vendor lock-in.

The immediate question: where do the training annotations come from? Hand-labeling 100 hours at this granularity would take months. Frontier audio LLMs (Vertex Gemini 2.5 Pro) do it well but cost money and lock you into a closed API. Self-hosting an audio LLM that gets close enough would let us scale without rate limits, without sending audio to a third party, and with a path to fine-tune the annotator later. That's the build documented here.

TTS in May 2026: what moved, what didn't

As of May 2026, audio LLMs and TTS are in two different places. Audio understanding has moved fast. Frontier closed models — Vertex Gemini 2.5 Pro, OpenAI gpt-realtime — can listen to a 5-minute podcast and produce structured descriptions: speakers, emotions, vocal events, drawn-out words. Open models are catching up at roughly one per quarter — Qwen3-Omni, Kimi-Audio, MiDashengLM. Audio synthesis has also moved. ElevenLabs v3 exposes ~70 inline tags; Hume Octave 2 takes free-text prosodic direction; Google Gemini 3.1 Flash TTS accepts scene-level direction inline. But the ceiling moved, it didn't disappear. Every closed system still applies one uniform style across each tagged region. Continuous within-span affect — sarcasm, the smirk-while-talking — remains unsolved.

The corpus + script-writer project lives in that asymmetry: I can build the training data from today's audio understanding models, and bet that synthesis catches up (or that even current synthesis renders richly-tagged scripts better than flat ones).

The annotation bottleneck nobody talks about

The biggest lever for expressive TTS quality isn't model architecture or training scale — it's annotation quality.

Fish Audio's S2 improved its Tag Activation Rate from 62.6% to 88.1% — a 41% relative improvement — by better-annotating its training data. Same architecture, same pipeline, just richer labels. (arXiv:2603.08823) The DeEAR framework showed a 3x improvement in emotion expressiveness from data curation alone. And SpeechJudge found that a fine-tuned reward model (77.2% accuracy) outperforms every frontier AudioLLM at judging speech quality, while traditional metrics like UTMOS (53.7%) are barely better than coin flips. Annotation quality drives everything — training data and evaluation.

We're building in the "rich transcription" tier, aiming toward "annotation-as-reward."

The plan, in one picture

Take 5-minute audio clips, run each through an annotation pipeline that emits structured JSON, use those triples as supervised-learning data. In practice, the dark box is a multi-pass pipeline — VAD, diarization, ASR, emotion classifiers, pitch extractors — because no single audio LLM captures everything. The EACL 2026 finding on "lexical dominance" (Gemini scores 96.6% on text-based audio tasks but 25-35% on audio-only paralinguistic tasks) is why. A noisy annotator means a script writer that puts [laughs] in the wrong places.

The moment that started this benchmark

The first model I reached for was Audio Flamingo Next, NVIDIA's 8B audio model. On the first run, it returned identical output for every clip — same speaker count, same tone, same four breath events at the same timestamps. NPR news, a Moth story, a meditation, a stand-up set, all the same JSON. It wasn't broken. It just wasn't listening; the audio kwarg was being silently dropped by the processor.

Model leaderboards are not model deployments. The number on the HuggingFace page is one promise; the number you get on your data, in your container, is a different one. This post documents the gap.

Calibrating the annotation box is the single highest-leverage decision in the whole pipeline.

Three layers of the gap, with audio

Layer 1 — The "uniform style per region" problem

Same clip. Pay attention to the mid-thought abandon: "And I mean, I really have no — I never thought that there was a chance that you have seen that." That em-dash isn't punctuation — it's an affect shift. Confident anecdote to softer acknowledgment, inside a single breath.

The Read · listen for the mid-sentence pivot ("I really have no — I never thought…") and the co-host's overlapping "That's true"

What three STT systems heard (verbatim · no edits · 3 systems)

Deepgram Nova-3 · conf 0.999

OpenAI gpt-4o-transcribe

ElevenLabs Scribe v2

audio events captured: [laughs]

Listen for: The mid-sentence pivot at the em-dash and the [laughs] event Eleven caught. None of these STT outputs carry the affect of the clip — they capture words and (for one) events.

Now hear what frontier TTS does with this same transcript (4 labs × clean vs enhanced. The delta is the point)

The question: take that transcript, feed it to the best TTS engines with all available markup — how close do they get to the original? That gap is what the script-writer fine-tune is trying to close.

Same transcript re-synthesized through 3 frontier TTS labs (top row = raw STT transcript · bottom row = with lab-appropriate enhancement markup)

Clean
raw STT transcript
no tags, no direction

ElevenLabs v3

Gemini 3.1 Flash TTS

Gemini 2.5 Pro TTS

OpenAI gpt-4o-mini-tts-2025-12-15

Enhanced
our enhancement
tags + direction +
drawn-out vowels

ElevenLabs v3

bracketed tags: [laughs] [warmly] [whispers] + inline drawn-out vowels

Gemini 3.1 Flash TTS

preview model · 200+ inline audio tags · preamble-style direction

Gemini 2.5 Pro TTS

GA Pro model · higher fidelity · slower per-call

OpenAI gpt-4o-mini-tts-2025-12-15

pinned snapshot (Dec 2025, 35% lower WER vs prior gen) · premium voices · instructions field for prosodic + accent direction

Layer 2 — The "complex blended affect" problem

The opening of a Smartless live show. Three voices pile onto a single beat — "One more crack at it" — then hyped-up gratitude, then a self-deprecating setup about smacking Sean across the face. The same speaker simultaneously performs excitement and undercuts himself with dry self-deprecation. Two affects, same words.

Smartless · live show opening · three voices overlapping on "one more crack at it", then the smack-across-the-face setup

What three STT systems heard (verbatim · no edits · 3 systems)

Deepgram Nova-3 · conf 0.996

Less. One more crack at it. Get one more crack at it. Let him have one more crack. Guys, we're so excited. We're here, and thank you for coming out tonight. This is very exciting for us. Yeah. I gave Sean a big smack across the face right before we came out, and I'm just waiting for her to hit me back. So just pardon me if I'm over here. It was so loud. I can't believe you heard it. A little bit red over here. I'm sure it

sentiment: neutral (-0.03) · intents: Express excitement, Request assistance · topics: Excited and excited event, Speaker response · summary: Speaker 0 is excited to have one more crack and asks the others to help. They express excitement and mention that they gave Sean a big smile.

OpenAI gpt-4o-transcribe

Guys, we're so excited we're here and thank you for coming out tonight.

ElevenLabs Scribe v2

list. Woo! One more crack at it. One more, one more. One more crack at it. Oh, oh. Let him have one more crack. Um, guys, we're so excited we're here, and thank you for coming out tonight. This is very exciting for us. Yeah. I gave Sean a big smack across the face right before we came out- I am red over here.

What this comparison shows: (1) OpenAI massively truncated — ONE sentence out of ~ten spoken; overlapping voices confused endpointing. (2) Deepgram caught everything but hallucinated 'big smile' in its summary (actual: 'big smack'). (3) ElevenLabs captured cheering, laughs, false starts, and filler — most paralinguistically rich output by far. (4) None captured all 3 speakers — Deepgram says 1, Eleven says 2, ground truth is 3.

Listen for: What each STT shows vs misses. Eleven is closest to corpus-grade; OpenAI's massive truncation is a production-risk surprise.

Now hear what frontier TTS does with this same transcript (4 labs × clean vs enhanced. The delta is the point)

Same transcript re-synthesized through 3 frontier TTS labs (top row = raw STT transcript · bottom row = with lab-appropriate enhancement markup)

Clean
raw STT transcript
no tags, no direction

ElevenLabs v3

Gemini 3.1 Flash TTS

Gemini 2.5 Pro TTS

OpenAI gpt-4o-mini-tts-2025-12-15

Enhanced
our enhancement
tags + direction +
drawn-out vowels

ElevenLabs v3

bracketed tags: [laughs] [warmly] [whispers] + inline drawn-out vowels

Gemini 3.1 Flash TTS

preview model · 200+ inline audio tags · preamble-style direction

Gemini 2.5 Pro TTS

GA Pro model · higher fidelity · slower per-call

OpenAI gpt-4o-mini-tts-2025-12-15

pinned snapshot (Dec 2025, 35% lower WER vs prior gen) · premium voices · instructions field for prosodic + accent direction

Layer 3 — The "drawn-out vowel" problem

Tara Brach delivering a meditation. Listen to how she pronounces "space" — held for around a second, four times longer than dictionary expects. The silences between phrases run 2-3 seconds. None of this duration information appears in the transcript.

Tara Brach meditation · slow pace, held vowels, intentional silences between phrases

What three STT systems heard (verbatim · no edits · 3 systems)

Deepgram Nova-3 · conf 0.998

Perhaps you can even sense the space between the sensations just like you can sense the space or visualize the space between the particles in an atom, the nucleus of an atom. Space and aliveness.

sentiment: positive (+0.35) · intents: Improve spatiomotion · topics: Space perception · summary: Perhaps you can even sense the space between the sensations just like you can sense the space or visualize the space between the particles in an atom, the nucleus of an atom.

OpenAI gpt-4o-transcribe

Perhaps you can even sense the space between the sensations, just like you can sense the space or visualize the space between the particles in an atom, the nucleus of an atom. Space and aliveness.

ElevenLabs Scribe v1 · lang_prob 0.95

Perhaps you can even sense the space between the sensations, just like you can sense the space or visualize the space between the particles in an atom, in the nucleus of an atom. Space and aliveness.

What this comparison shows: Three near-identical transcripts — the easy case (single speaker, no overlap). ASRs converge on words; they diverge on paralinguistic events. Deepgram's intent 'Improve spatiomotion' is a hallucinated word. None of the transcripts encode the held vowels on 'space' or the 2.5s silences.

Listen for: Even on the easy case, all three miss what the audio actually carries: prosody. 'space' (~280ms expected) is held ~1.1s. Three transcripts, zero of them encode that.

Now hear what frontier TTS does with this same transcript (4 labs × clean vs enhanced. The delta is the point)

Same transcript re-synthesized through 3 frontier TTS labs (top row = raw STT transcript · bottom row = with lab-appropriate enhancement markup)

Clean
raw STT transcript
no tags, no direction

ElevenLabs v3

Gemini 3.1 Flash TTS

Gemini 2.5 Pro TTS

OpenAI gpt-4o-mini-tts-2025-12-15

Enhanced
our enhancement
tags + direction +
drawn-out vowels

ElevenLabs v3

bracketed tags: [laughs] [warmly] [whispers] + inline drawn-out vowels

Gemini 3.1 Flash TTS

preview model · 200+ inline audio tags · preamble-style direction

Gemini 2.5 Pro TTS

GA Pro model · higher fidelity · slower per-call

OpenAI gpt-4o-mini-tts-2025-12-15

pinned snapshot (Dec 2025, 35% lower WER vs prior gen) · premium voices · instructions field for prosodic + accent direction

What happens when you feed it back to TTS

Each audio block above has a collapsible TTS comparison: four labs (Gemini 3.1 Flash, Gemini 2.5 Pro, ElevenLabs v3, OpenAI gpt-4o-mini-tts), each in two modes — Clean (raw STT, no tags) and Enhanced (lab-appropriate markup).

The delta between clean and enhanced rows is the whole point. Listen to Tara Brach across the four labs — that's where the gap is loudest.

The two-speaker correction. And how the pipeline derived it

On first pass we treated The Read as single-speaker. It's a back-and-forth between Crissle West (F) and Kid Fury (M). The correction shows what the corpus pipeline does at scale: Scribe v2's speaker_id tags break the clip into 8 turns; gender labels are cross-checked (no STT returns gender natively); each turn routes to the appropriate voice per lab. Speaker IDs from one model, gender from another, voice routing from a manual mapping — "STT → faithful TTS" is not closed by any single product.

The accent problem

No production ASR returns accent or dialect information. Deepgram, ElevenLabs, OpenAI, CrisperWhisper — all return a BCP-47 language tag or just text. None returns "Black American", "Southern AAVE", or "British RP". Open-source classifiers exist (CommonAccent covers 16 English accents) but AAVE isn't a class in any of them.

The consequence: The Read's hosts code-switch into AAVE constantly. Feed a clean transcript to a default TTS voice and you get a generic white-American reading. We had to route accent manually — ElevenLabs via voice library, Gemini and OpenAI via preamble direction. Adherence is hit-or-miss. Accent-aware TTS isn't a production capability in May 2026. The end-to-end pipeline "audio → derived accent → matched TTS voice" doesn't exist as a shipped product.

Why no production TTS handles these well

Subtler affects are even further behind. Sarcasm requires modeling the divergence between semantic content and prosodic delivery — no production TTS exposes a controllable knob for it, for irony, or for the polite-chuckle-versus-genuine-laugh distinction. Non-frontier TTS (Bark, Sesame CSM, Kyutai Moshi) sits well behind on emotion expressiveness.

The thread through all three layers: at every current TTS tier, continuous human affect is approximated by discrete styling. Our bet: build a corpus whose training data carries the affective contour finer than the global tag, pairing paragraph-level scene metadata with per-speaker inline markup, so the writer learns when to describe affect versus when to mark it.

The same annotation pipeline that creates the training data can also validate TTS output. Fish Audio S2 and CosyVoice 3 both reuse their annotation models as RL reward signals. Building the annotator isn't just about creating training data — it's about creating the evaluation engine for everything downstream.

The open question: does a script writer trained on richly annotated data produce scripts that downstream TTS renders more expressively than an untagged baseline? That validation is in progress; we'll share tag activation rates and listener preference numbers when they're ready.

If you work in speech and see something I've gotten wrong — a methodology gap, a better tool, a paper I should have cited — I'd genuinely like to know. Reach me at @DavidAmal.