Agent Brief

The quickest way for a viewer to clock "this is AI content" is the voice. Even good text-to-speech has a tell — a flatness, a wrong stress on the third word of every clause. Recording your own narration fixes it instantly.

But it costs you the thing that made TTS attractive in an automated pipeline: TTS hands you exact word timings for free. Record yourself and you've got a .wav and no idea when each word lands — which you need for captions, cut timing, and trimming dead air.

Forced alignment gives that back. Feed it an audio file and the transcript (you already have the script you read), and it returns word-level timestamps.

The setup

faster-whisper will transcribe with word timings out of the box, and for short-form that's usually close enough:

from faster_whisper import WhisperModel

model = WhisperModel("base", compute_type="int8")  # int8 = runs fine on CPU
segments, _ = model.transcribe("narration.wav", word_timestamps=True)

words = [(w.word, w.start, w.end) for seg in segments for w in seg.words]
# [('The', 0.00, 0.18), ('seen', 0.18, 0.42), ('store', 0.42, 0.71), ...]

Now you can place captions to the frame, time a visual cut to the end of a phrase, or trim the silence between takes — automatically, from your real voice.

Transcription vs. true alignment

Worth being precise, because it bites people: Whisper transcribes — it decides what was said and when. It can mishear, and its text won't exactly match your script.

For captions and rough timing, Whisper's word timestamps are close enough. Ship it.
For true forced alignment to a known script — you have the exact text and want every word of that text pinned to the audio — use a dedicated aligner (aeneas, or a wav2vec2 forced-alignment model). It aligns to your words instead of guessing them.

Most short-form work lives in the first bucket. Reach for the second only when an exact-text caption has to be perfect.

The gotchas

Numbers and symbols get spoken differently than written ("2026" → "twenty twenty-six"). If you're matching against a script, normalize both sides first.
Retakes and filler wreck alignment. One clean take per line, minimal "um," beats one long messy take you fix in post.
Room noise drags accuracy down. A cheap mic in a quiet room beats a good mic in a live one.

The payoff is the whole point: your real voice and the automated timing. You don't have to trade the human element for the pipeline.

And that's the transferable lesson — don't accept the synthetic default just because it's easier to wire up. There's often an alignment trick that lets you keep the part a human should do and automate everything around it.