What TTS throws away

The paralinguistic gap between human speech and synthetic audio · May 2026

When a podcast host says "spaaaace" and holds the vowel for 1.1 seconds, every ASR (automatic speech recognition) transcribes it as "space." When two friends banter and one reacts with "Oh. Ooh. Mm." for 2.4 seconds of pure non-verbal communication, the transcript shows three punctuated words. When a speaker abandons a clause mid-thought — "I really have no — I never thought…" — the transcript records an em-dash. The audio records an affect shift.

That's the gap. The training data for every TTS system is built from transcripts that discard the expressive vocabulary humans actually use. Drawn-out vowels, reactive backchannels, mid-sentence register shifts, the timing of a laugh — all of it is flattened into clean text before a TTS model ever sees it. The model can't learn what it's never shown.

This post is about what that gap sounds like, why it exists, and what it would take to close it. Press play below — then expand the comparisons to see what three STT systems and four TTS engines make of the same 25 seconds.

The Read · 2-host pop-culture podcast · reminiscing about childhood movie restrictions
A note on the clip: This clip contains the N-word, used colloquially by the hosts. It was selected for its paralinguistic richness — drawn-out vowels, mid-sentence pivots, backchannels, laughter — not its lexical content. STT transcripts reproduce it verbatim because verbatim fidelity is the point of the comparison.

step 1 · what STT heardThree frontier ASRs, same 25 seconds, three different transcripts

Deepgram Nova-3 · conf 0.999

Oh. If you watch nigga movies, I feel like it's one of those. But for a massive chunk of my life, I didn't because my mama didn't let me watch anything that was rated worse than PG 13. That's true. And then when I went off college, I just had no desire to go back and watch all the Don't Be a Menaces. And I mean, I really have no I'd I never thought that there was a chance that you have seen that. But the point is that many of you at

sentiment: negative (-0.45) · intents: Express frustration about nigga movies · topics: Movie watching · summary: Speaker 0 discusses how he didn't watch the Partners in the Middle, citing his lack of desire to return to watching movies that were rated PG 13.

OpenAI gpt-4o-transcribe

If you watch nigga movies, I feel like it's one of those. But for a massive chunk of my life, I didn't because my mom didn't let me watch anything that was rated worse than PG-13. And then when I went off to college, I just had no desire to go back and watch all the Don't Be a Menaces. I mean, I really have no idea. I never thought that there was a chance that you have seen it. But the point is that many of you...

ElevenLabs Scribe v2

Oh. If you watch nigga movies, I feel like it's one of those. Mm. But for a, a massive chunk of my life I didn't, 'cause my mama didn't let me watch anything that was rated worse than PG-13. That's true. And then when I went off to college, I just had no desire to go back and watch all the Don't Be a Menaces. [laughs] And I mean, I really have no— I never thought that there was a chance that you have seen that.

audio events captured: [laughs]

What this comparison shows: (1) OpenAI drops the leading 'Oh'; Eleven keeps 'Oh' AND the 'Mm-hmm' backchannel that the other two missed. (2) Deepgram's 'Express frustration' / 'negative' sentiment are wrong — the host is warm and reminiscent. (3) Only Eleven captures [laughs] mid-utterance. (4) Deepgram preserves the disfluency 'I really have no I'd I never thought'; OpenAI invents 'no idea' to clean it up.

Listen for: The mid-sentence pivot at the em-dash and the [laughs] event Eleven caught. None of these STT outputs carry the affect of the clip — they capture words and (for one) events.

step 2 · what TTS does with that transcriptSame transcript fed to 4 frontier TTS labs — listen, then see where the time goes

The original is 25 seconds. The same 8-segment script rendered through each TTS lab takes 28.7s to 34.9s. The chart aligns all 9 versions on a shared time axis, colored by speaker. Clean = raw STT transcript; Enhanced = same transcript with lab-appropriate affective markup. Switch tabs to compare timing vs silence placement.

0s
5s
10s
15s
20s
25s
30s
Original audio (Scribe v2)
Oh.
If you watch…
Mm. But for a massive chunk…
That's true.
And then when I went off…
I mean, I really have no —…
[laughs]
But the point…
24.68s
ElevenLabs v3 — clean
Oh.
If you watch…
Mm. But for a massive chunk…
That's true.
And then when I went off…
I mean, I really have no —…
[laughs]
But the point…
28.74s
ElevenLabs v3 — enhanced
Oh.
If you watch…
Mm. But for a massive chunk…
That's true.
And then when I went off…
I mean, I really have no —…
[laughs]
But the point…
29.48s
Gemini 3.1 Flash — clean
Oh.
If you watch…
Mm. But for a massive chunk…
That's true.
And then when I went off…
I mean, I really have no —…
[laughs]
But the point…
31.32s
Gemini 3.1 Flash — enhanced
Oh.
If you watch…
Mm. But for a massive chunk…
That's true.
And then when I went off…
I mean, I really have no —…
[laughs]
But the point…
31.16s
Gemini 2.5 Pro — clean
Oh.
If you watch…
Mm. But for a massive chunk…
That's true.
And then when I went off…
I mean, I really have no —…
[laughs]
But the point…
29.21s
Gemini 2.5 Pro — enhanced
Oh.
If you watch…
Mm. But for a massive chunk…
That's true.
And then when I went off…
I mean, I really have no —…
[laughs]
But the point…
29.41s
OpenAI gpt-4o-mini-tts — clean
Oh.
If you watch…
Mm. But for a massive chunk…
That's true.
And then when I went off…
I mean, I really have no —…
[laughs]
But the point…
34.90s
OpenAI gpt-4o-mini-tts — enhanced
Oh.
If you watch…
Mm. But for a massive chunk…
That's true.
And then when I went off…
I mean, I really have no —…
[laughs]
But the point…
33.96s
speaker_0 (Crissle, F) speaker_1 (Kid Fury, M)hover any bar to see the full segment text · TTS per-segment widths estimated by char-ratio × measured total duration; original uses Scribe v2 word timestamps
Original (24.68s, bolded) is the reference. OpenAI expanded most (34.9s) — stitching seams between speakers. ElevenLabs and Gemini 2.5 Pro closest to the human (~29s). Hover any bar for segment text; press play on any row.
0s
5s
10s
15s
20s
25s
30s
Original audio
amplitude envelope for Original audio
24.68s
ElevenLabs v3, clean
amplitude envelope for ElevenLabs v3, clean
28.74s
ElevenLabs v3, enhanced
amplitude envelope for ElevenLabs v3, enhanced
29.48s
Gemini 3.1 Flash, clean
amplitude envelope for Gemini 3.1 Flash, clean
31.32s
Gemini 3.1 Flash, enhanced
amplitude envelope for Gemini 3.1 Flash, enhanced
31.16s
Gemini 2.5 Pro, clean
amplitude envelope for Gemini 2.5 Pro, clean
29.21s
Gemini 2.5 Pro, enhanced
amplitude envelope for Gemini 2.5 Pro, enhanced
29.41s
OpenAI gpt-4o-mini-tts, clean
amplitude envelope for OpenAI gpt-4o-mini-tts, clean
34.90s
OpenAI gpt-4o-mini-tts, enhanced
amplitude envelope for OpenAI gpt-4o-mini-tts, enhanced
33.96s
Amplitude envelopes (ffmpeg showwavespic), same time scale as the Timing tab. The dramatic gaps in OpenAI's strips are stitching seams — silence between concatenated segments rather than natural breath spacing.

That gap is what this project is trying to close. ElevenLabs v3 gives you [cheerful] or [deadpan]. It will not give you the mid-thought interruption above — one speaker abandoning a clause and restarting with softer affect. The transcript reads as a clean sentence with an em-dash. The audio is something no tag captures.

What I'm building

We're building a fine-tuned script writer — a model whose output has the affective contour, the cue tags, the drawn-out vowels, and the per-segment direction baked in, so when any TTS reads it, what comes out is something a human voice actor would recognise as expressive.

The script writer needs to emit four layers of paralinguistic detail that conventional script generation throws away:

The writer has to cover the full emotional range — not just happy/sad/angry/neutral but blended states like amused-resigned, hopeful-skeptical, warm-sarcastic. Current TTS can't reach those from a flat transcript alone; a script writer that puts affect where it goes is the leverage point.

"Expressive" is the load-bearing word. The goal isn't "monologue for one tired podcaster." It's the full surface area of human speech — comedy timing, meditation silences, conversational backchannels, scene dialogue, songlike delivery — and under all of those, the emotion spectrum a real voice carries: warmth, amusement, sarcasm, fatigue, grief, hopefulness, curiosity. And the blended states that are most of human affect (amused-resigned, hopeful-skeptical, excited-anxious). The script-writer is the layer that says "here is exactly which emotion goes where in the script", and the downstream TTS gets to focus on rendering instead of guessing.

The script writer starts as a LoRA fine-tune, scaling to a full 7-8B fine-tune if tag placement accuracy demands it. (Fish Audio S2 fine-tuned a full 30B model for their annotator; my task is narrower but LoRA may not have enough capacity for ~110 inline tags.) The downstream TTS can be any engine — the first validation gate is whether tagged scripts produce measurably better output than untagged ones. Portability across TTS engines is the only way to keep iterating without vendor lock-in.

The immediate question: where do the training annotations come from? Hand-labeling 100 hours at this granularity would take months. Frontier audio LLMs (Vertex Gemini 2.5 Pro) do it well but cost money and lock you into a closed API. Self-hosting an audio LLM that gets close enough would let us scale without rate limits, without sending audio to a third party, and with a path to fine-tune the annotator later. That's the build documented here.

TTS in May 2026: what moved, what didn't

As of May 2026, audio LLMs and TTS are in two different places. Audio understanding has moved fast. Frontier closed models — Vertex Gemini 2.5 Pro, OpenAI gpt-realtime — can listen to a 5-minute podcast and produce structured descriptions: speakers, emotions, vocal events, drawn-out words. Open models are catching up at roughly one per quarter — Qwen3-Omni, Kimi-Audio, MiDashengLM. Audio synthesis has also moved. ElevenLabs v3 exposes ~70 inline tags; Hume Octave 2 takes free-text prosodic direction; Google Gemini 3.1 Flash TTS accepts scene-level direction inline. But the ceiling moved, it didn't disappear. Every closed system still applies one uniform style across each tagged region. Continuous within-span affect — sarcasm, the smirk-while-talking — remains unsolved.

The corpus + script-writer project lives in that asymmetry: I can build the training data from today's audio understanding models, and bet that synthesis catches up (or that even current synthesis renders richly-tagged scripts better than flat ones).

The annotation bottleneck nobody talks about

The biggest lever for expressive TTS quality isn't model architecture or training scale — it's annotation quality.

Fish Audio's S2 improved its Tag Activation Rate from 62.6% to 88.1% — a 41% relative improvement — by better-annotating its training data. Same architecture, same pipeline, just richer labels. (arXiv:2603.08823) The DeEAR framework showed a 3x improvement in emotion expressiveness from data curation alone. And SpeechJudge found that a fine-tuned reward model (77.2% accuracy) outperforms every frontier AudioLLM at judging speech quality, while traditional metrics like UTMOS (53.7%) are barely better than coin flips. Annotation quality drives everything — training data and evaluation.

The field is converging on a spectrum of annotation approaches:

We're building in the "rich transcription" tier, aiming toward "annotation-as-reward."

The plan, in one picture

audio clip5 min, 16 kHzaudio LLM annotatorself-hosted, on ModalJSON annotation16-field schemascript-writer LLMfine-tune (LoRA → full)TTS-aware script[laughs] heyyy …The corpus pipeline, end to endthis post is about the dark box — choosing the annotator that feeds everything downstream▲ the subject of this post
The full corpus pipeline ends in a fine-tuned script-writer LLM. This post is about the dark middle box — picking the audio LLM that produces the labels everything downstream learns from.

Take 5-minute audio clips, run each through an annotation pipeline that emits structured JSON, use those triples as supervised-learning data. In practice, the dark box is a multi-pass pipeline — VAD, diarization, ASR, emotion classifiers, pitch extractors — because no single audio LLM captures everything. The EACL 2026 finding on "lexical dominance" (Gemini scores 96.6% on text-based audio tasks but 25-35% on audio-only paralinguistic tasks) is why. A noisy annotator means a script writer that puts [laughs] in the wrong places.

The moment that started this benchmark

The first model I reached for was Audio Flamingo Next, NVIDIA's 8B audio model. On the first run, it returned identical output for every clip — same speaker count, same tone, same four breath events at the same timestamps. NPR news, a Moth story, a meditation, a stand-up set, all the same JSON. It wasn't broken. It just wasn't listening; the audio kwarg was being silently dropped by the processor.

Model leaderboards are not model deployments. The number on the HuggingFace page is one promise; the number you get on your data, in your container, is a different one. This post documents the gap.

Calibrating the annotation box is the single highest-leverage decision in the whole pipeline.

Three layers of the gap, with audio

Three clips, three failure modes.

Layer 1 — The "uniform style per region" problem

Same clip. Pay attention to the mid-thought abandon: "And I mean, I really have no — I never thought that there was a chance that you have seen that." That em-dash isn't punctuation — it's an affect shift. Confident anecdote to softer acknowledgment, inside a single breath.

The Read · listen for the mid-sentence pivot ("I really have no — I never thought…") and the co-host's overlapping "That's true"
What three STT systems heard (verbatim · no edits · 3 systems)
Deepgram Nova-3 · conf 0.999

Oh. If you watch nigga movies, I feel like it's one of those. But for a massive chunk of my life, I didn't because my mama didn't let me watch anything that was rated worse than PG 13. That's true. And then when I went off college, I just had no desire to go back and watch all the Don't Be a Menaces. And I mean, I really have no I'd I never thought that there was a chance that you have seen that. But the point is that many of you at

sentiment: negative (-0.45) · intents: Express frustration about nigga movies · topics: Movie watching · summary: Speaker 0 discusses how he didn't watch the Partners in the Middle, citing his lack of desire to return to watching movies that were rated PG 13.

OpenAI gpt-4o-transcribe

If you watch nigga movies, I feel like it's one of those. But for a massive chunk of my life, I didn't because my mom didn't let me watch anything that was rated worse than PG-13. And then when I went off to college, I just had no desire to go back and watch all the Don't Be a Menaces. I mean, I really have no idea. I never thought that there was a chance that you have seen it. But the point is that many of you...

ElevenLabs Scribe v2

Oh. If you watch nigga movies, I feel like it's one of those. Mm. But for a, a massive chunk of my life I didn't, 'cause my mama didn't let me watch anything that was rated worse than PG-13. That's true. And then when I went off to college, I just had no desire to go back and watch all the Don't Be a Menaces. [laughs] And I mean, I really have no— I never thought that there was a chance that you have seen that.

audio events captured: [laughs]

What this comparison shows: (1) OpenAI drops the leading 'Oh'; Eleven keeps 'Oh' AND the 'Mm-hmm' backchannel that the other two missed. (2) Deepgram's 'Express frustration' / 'negative' sentiment are wrong — the host is warm and reminiscent. (3) Only Eleven captures [laughs] mid-utterance. (4) Deepgram preserves the disfluency 'I really have no I'd I never thought'; OpenAI invents 'no idea' to clean it up.

Listen for: The mid-sentence pivot at the em-dash and the [laughs] event Eleven caught. None of these STT outputs carry the affect of the clip — they capture words and (for one) events.

Now hear what frontier TTS does with this same transcript (4 labs × clean vs enhanced. The delta is the point)

The question: take that transcript, feed it to the best TTS engines with all available markup — how close do they get to the original? That gap is what the script-writer fine-tune is trying to close.

Same transcript re-synthesized through 3 frontier TTS labs (top row = raw STT transcript · bottom row = with lab-appropriate enhancement markup)
Clean
raw STT transcript
no tags, no direction
ElevenLabs v3
Gemini 3.1 Flash TTS
Gemini 2.5 Pro TTS
OpenAI gpt-4o-mini-tts-2025-12-15
Enhanced
our enhancement
tags + direction +
drawn-out vowels
ElevenLabs v3
bracketed tags: [laughs] [warmly] [whispers] + inline drawn-out vowels
Gemini 3.1 Flash TTS
preview model · 200+ inline audio tags · preamble-style direction
Gemini 2.5 Pro TTS
GA Pro model · higher fidelity · slower per-call
OpenAI gpt-4o-mini-tts-2025-12-15
pinned snapshot (Dec 2025, 35% lower WER vs prior gen) · premium voices · instructions field for prosodic + accent direction
What you just heard, technically (the chunk-by-chunk drift problem)

Try reproducing that em-dash pivot with ElevenLabs v3 by chunking — first half [confident], second half [warm, softer]. Every tag boundary becomes an acoustic seam. WeSCon (arXiv:2509.24629) calls these "unnatural acoustic discontinuities at segment boundaries". Other work attacks the same problem: Microsoft's EmoCtrl-TTS uses continuous valence/arousal trajectories instead of discrete labels, and CoCoEmo names the deeper limitation: single-label utterance-level control "collapses affective diversity".

"I had a great day at work, oh and then the boss called me into his office — wonderful."current TTS — sentence-level styling[cheerful] applied uniformlywhat a human actually does — shifting affect[cheerful][neutral][wary][deadpan, sarcastic]Frontier closed TTS (Eleven v3, Hume Octave 2, Gemini 3.1) lets you mark these segments withinline tags, but the style still applies uniformly within each tagged region. The continuous gradient is unsolved.
The gap: even the best closed TTS lets you tag regions, but each region still applies one uniform style. Continuous within-span affect, what humans do constantly — is unsolved.

Layer 2 — The "complex blended affect" problem

The opening of a Smartless live show. Three voices pile onto a single beat — "One more crack at it" — then hyped-up gratitude, then a self-deprecating setup about smacking Sean across the face. The same speaker simultaneously performs excitement and undercuts himself with dry self-deprecation. Two affects, same words.

Smartless · live show opening · three voices overlapping on "one more crack at it", then the smack-across-the-face setup
What three STT systems heard (verbatim · no edits · 3 systems)
Deepgram Nova-3 · conf 0.996

Less. One more crack at it. Get one more crack at it. Let him have one more crack. Guys, we're so excited. We're here, and thank you for coming out tonight. This is very exciting for us. Yeah. I gave Sean a big smack across the face right before we came out, and I'm just waiting for her to hit me back. So just pardon me if I'm over here. It was so loud. I can't believe you heard it. A little bit red over here. I'm sure it

sentiment: neutral (-0.03) · intents: Express excitement, Request assistance · topics: Excited and excited event, Speaker response · summary: Speaker 0 is excited to have one more crack and asks the others to help. They express excitement and mention that they gave Sean a big smile.

OpenAI gpt-4o-transcribe

Guys, we're so excited we're here and thank you for coming out tonight.

ElevenLabs Scribe v2

list. Woo! One more crack at it. One more, one more. One more crack at it. Oh, oh. Let him have one more crack. Um, guys, we're so excited we're here, and thank you for coming out tonight. This is very exciting for us. Yeah. I gave Sean a big smack across the face right before we came out- I am red over here.

What this comparison shows: (1) OpenAI massively truncated — ONE sentence out of ~ten spoken; overlapping voices confused endpointing. (2) Deepgram caught everything but hallucinated 'big smile' in its summary (actual: 'big smack'). (3) ElevenLabs captured cheering, laughs, false starts, and filler — most paralinguistically rich output by far. (4) None captured all 3 speakers — Deepgram says 1, Eleven says 2, ground truth is 3.

Listen for: What each STT shows vs misses. Eleven is closest to corpus-grade; OpenAI's massive truncation is a production-risk surprise.

Now hear what frontier TTS does with this same transcript (4 labs × clean vs enhanced. The delta is the point)

The question: take that transcript, feed it to the best TTS engines with all available markup — how close do they get to the original? That gap is what the script-writer fine-tune is trying to close.

Same transcript re-synthesized through 3 frontier TTS labs (top row = raw STT transcript · bottom row = with lab-appropriate enhancement markup)
Clean
raw STT transcript
no tags, no direction
ElevenLabs v3
Gemini 3.1 Flash TTS
Gemini 2.5 Pro TTS
OpenAI gpt-4o-mini-tts-2025-12-15
Enhanced
our enhancement
tags + direction +
drawn-out vowels
ElevenLabs v3
bracketed tags: [laughs] [warmly] [whispers] + inline drawn-out vowels
Gemini 3.1 Flash TTS
preview model · 200+ inline audio tags · preamble-style direction
Gemini 2.5 Pro TTS
GA Pro model · higher fidelity · slower per-call
OpenAI gpt-4o-mini-tts-2025-12-15
pinned snapshot (Dec 2025, 35% lower WER vs prior gen) · premium voices · instructions field for prosodic + accent direction
What you just heard, technically (complex emotion as a blend, not a label)

Real human affect is rarely one label. "We're so excited" is performed for the audience while "I gave Sean a big smack" undercuts it — hype + dry self-mockery in one breath. Three voices echoing "one more crack at it" is an emotional unit no tag captures. Sentence patterns that exercise this problem:

  1. "I really appreciate you scheduling this meeting at 5:30 on a Friday — truly, what a thoughtful use of everyone's time." The hinge is "truly". Tag the whole thing [grateful] and it's wrong; tag the second half [sarcastic] and the transition is a cliff.
  2. "So the migration is done, the rollback plan is tested, and we're ready to ship — I mean, I think we're ready, mostly, unless QA found something this morning." Confidence drains across the sentence. No clean tag boundary.
  3. "Okay, okay, the offer letter is here, it's actually here, oh god, I have to decide by Monday." Excitement and anxiety are co-present from word one.
A) one global tag — "basic TTS"Ihadagreatdayatwork,ohandthenthebosscalledmeintohisofficewonderful.[cheerful]. And "wonderful" comes out sincere instead of sarcastic. The joke dies.B) tags every few words — "Eleven v3 / Gemini 3.1 / Hume Octave 2"Ihadagreatdayatwork,ohandthenthebosscalledmeintohisofficewonderful.Tag boundaries become acoustic seams. WeSCon (arXiv:2509.24629) calls these "unnatural acoustic discontinuities at segmentboundaries" — each chunk is internally uniform, concatenation is audible. Also: tag overhead bloats the script.C) director's note — paragraph-level directionscene: speaker is recounting a workday to a friend. Starts genuinely cheerful, deflates intoresignation, then lands the final word with dry sarcasm. No mid-sentence pauses.→ "I had a great day at work, oh and then the boss called me into his office — wonderful."Ihadagreatdayatwork,ohandthenthebosscalledmeintohisofficewonderful.A continuous affect contour, what humans actually do. No hard boundaries, no token bloat.legend:cheerfulneutralwarysarcastic
Three approaches to a complex line: (A) one global tag. The joke dies. (B) chunked tags — seams audible, tokens wasted. (C) a paragraph of scene direction at the top, then the sentence flows — affect as a continuous trajectory.

I call this the "director's note" framing — direction at the scene level, like screenplay stage directions. Hume's Octave 2 calls their equivalent field "Acting Instructions". Voice acting works this way: a director gives one paragraph of intent before a scene, not [smiling] over each word.

Layer 3 — The "drawn-out vowel" problem

Tara Brach delivering a meditation. Listen to how she pronounces "space" — held for around a second, four times longer than dictionary expects. The silences between phrases run 2-3 seconds. None of this duration information appears in the transcript.

Tara Brach meditation · slow pace, held vowels, intentional silences between phrases
What three STT systems heard (verbatim · no edits · 3 systems)
Deepgram Nova-3 · conf 0.998

Perhaps you can even sense the space between the sensations just like you can sense the space or visualize the space between the particles in an atom, the nucleus of an atom. Space and aliveness.

sentiment: positive (+0.35) · intents: Improve spatiomotion · topics: Space perception · summary: Perhaps you can even sense the space between the sensations just like you can sense the space or visualize the space between the particles in an atom, the nucleus of an atom.

OpenAI gpt-4o-transcribe

Perhaps you can even sense the space between the sensations, just like you can sense the space or visualize the space between the particles in an atom, the nucleus of an atom. Space and aliveness.

ElevenLabs Scribe v1 · lang_prob 0.95

Perhaps you can even sense the space between the sensations, just like you can sense the space or visualize the space between the particles in an atom, in the nucleus of an atom. Space and aliveness.

What this comparison shows: Three near-identical transcripts — the easy case (single speaker, no overlap). ASRs converge on words; they diverge on paralinguistic events. Deepgram's intent 'Improve spatiomotion' is a hallucinated word. None of the transcripts encode the held vowels on 'space' or the 2.5s silences.

Listen for: Even on the easy case, all three miss what the audio actually carries: prosody. 'space' (~280ms expected) is held ~1.1s. Three transcripts, zero of them encode that.

Now hear what frontier TTS does with this same transcript (4 labs × clean vs enhanced. The delta is the point)

The question: take that transcript, feed it to the best TTS engines with all available markup — how close do they get to the original? That gap is what the script-writer fine-tune is trying to close.

Same transcript re-synthesized through 3 frontier TTS labs (top row = raw STT transcript · bottom row = with lab-appropriate enhancement markup)
Clean
raw STT transcript
no tags, no direction
ElevenLabs v3
Gemini 3.1 Flash TTS
Gemini 2.5 Pro TTS
OpenAI gpt-4o-mini-tts-2025-12-15
Enhanced
our enhancement
tags + direction +
drawn-out vowels
ElevenLabs v3
bracketed tags: [laughs] [warmly] [whispers] + inline drawn-out vowels
Gemini 3.1 Flash TTS
preview model · 200+ inline audio tags · preamble-style direction
Gemini 2.5 Pro TTS
GA Pro model · higher fidelity · slower per-call
OpenAI gpt-4o-mini-tts-2025-12-15
pinned snapshot (Dec 2025, 35% lower WER vs prior gen) · premium voices · instructions field for prosodic + accent direction
What you just heard, technically (orthographic surface vs acoustic duration)

The Deepgram transcript is technically correct — every word is right. What's missing: "space" is held ~1.1s (vs ~280ms expected), there's a 2.5s silence between "sensations" and "just like". The text says "space"; the audio says "spaaaaace". Most transcription pipelines, including CrisperWhisper, normalize elongated spellings because the dictionary doesn't have entries for them.

"hey" (expected duration)hey~280 ms"heyyy" (drawn-out — duration_ratio: 3.2x)heyyy~890 msA standard ASR pass (CrisperWhisper, Whisper-large-v3) will transcribe both as "hey". The signal we want lives in the duration, not the tokens.
A drawn-out vowel as duration vs spelling. The audio carries the signal; the standard transcript erases it. The corpus has to recover it from the audio directly.

If we want the script writer to emit "spaaaace" or "heyyy" in the right places, the training data has to contain those spellings. Standard ASR throws them away, so the audio-LLM annotator has to re-inject them by listening for duration ratios over a threshold.

What happens when you feed it back to TTS

Each audio block above has a collapsible TTS comparison: four labs (Gemini 3.1 Flash, Gemini 2.5 Pro, ElevenLabs v3, OpenAI gpt-4o-mini-tts), each in two modes — Clean (raw STT, no tags) and Enhanced (lab-appropriate markup).

Key findings across all four labs:

  1. Gemini 2.5 Pro absorbs slow-pace direction most aggressively — Tara Brach expanded from ~16.5s to ~31.7s (nearly 2x) but with 13-24s latency per call vs Flash's ~3-5s.
  2. ElevenLabs v3 is second-most-responsive — Tara Brach 14.3s → 25.0s (1.75x expansion); its [laughs] tag landed mid-utterance exactly right on The Read.
  3. Gemini 3.1 Flash takes direction less dramatically — Tara Brach 14.2s → 19.6s. Some tags ([softly], [whisper]) tripped the safety filter on the meditation script.
  4. Gemini's multi-speaker is hard-capped at 2 speakers via the public API (docs claim more; the API rejects 3+). OpenAI has no native multi-speaker at all.

The delta between clean and enhanced rows is the whole point. Listen to Tara Brach across the four labs — that's where the gap is loudest.

The two-speaker correction. And how the pipeline derived it

On first pass we treated The Read as single-speaker. It's a back-and-forth between Crissle West (F) and Kid Fury (M). The correction shows what the corpus pipeline does at scale: Scribe v2's speaker_id tags break the clip into 8 turns; gender labels are cross-checked (no STT returns gender natively); each turn routes to the appropriate voice per lab. Speaker IDs from one model, gender from another, voice routing from a manual mapping — "STT → faithful TTS" is not closed by any single product.

The accent problem

No production ASR returns accent or dialect information. Deepgram, ElevenLabs, OpenAI, CrisperWhisper — all return a BCP-47 language tag or just text. None returns "Black American", "Southern AAVE", or "British RP". Open-source classifiers exist (CommonAccent covers 16 English accents) but AAVE isn't a class in any of them.

The consequence: The Read's hosts code-switch into AAVE constantly. Feed a clean transcript to a default TTS voice and you get a generic white-American reading. We had to route accent manually — ElevenLabs via voice library, Gemini and OpenAI via preamble direction. Adherence is hit-or-miss. Accent-aware TTS isn't a production capability in May 2026. The end-to-end pipeline "audio → derived accent → matched TTS voice" doesn't exist as a shipped product.

Why no production TTS handles these well

Subtler affects are even further behind. Sarcasm requires modeling the divergence between semantic content and prosodic delivery — no production TTS exposes a controllable knob for it, for irony, or for the polite-chuckle-versus-genuine-laugh distinction. Non-frontier TTS (Bark, Sesame CSM, Kyutai Moshi) sits well behind on emotion expressiveness.

The thread through all three layers: at every current TTS tier, continuous human affect is approximated by discrete styling. Our bet: build a corpus whose training data carries the affective contour finer than the global tag, pairing paragraph-level scene metadata with per-speaker inline markup, so the writer learns when to describe affect versus when to mark it.

The same annotation pipeline that creates the training data can also validate TTS output. Fish Audio S2 and CosyVoice 3 both reuse their annotation models as RL reward signals. Building the annotator isn't just about creating training data — it's about creating the evaluation engine for everything downstream.


The open question: does a script writer trained on richly annotated data produce scripts that downstream TTS renders more expressively than an untagged baseline? That validation is in progress; we'll share tag activation rates and listener preference numbers when they're ready.

Three takeaways:

  1. Script quality matters more than model choice. The delta between clean and enhanced TTS output comes from what's in the script, not from switching engines.
  2. No single audio LLM captures everything. Drawn-out vowels, mid-sentence affect shifts, and backchannels require specialist acoustic models, not just bigger language models.
  3. The annotation pipeline is the product. Fish S2's 41% TAR improvement came from annotation alone. The data is the moat.

If you work in speech and see something I've gotten wrong — a methodology gap, a better tool, a paper I should have cited — I'd genuinely like to know. Reach me at @DavidAmal.