[{"data":1,"prerenderedAt":279},["ShallowReactive",2],{"blog-\u002Fblog\u002Fyour-own-voice-with-forced-alignment":3},{"id":4,"title":5,"body":6,"date":270,"description":271,"draft":272,"extension":273,"meta":274,"navigation":72,"path":275,"seo":276,"stem":277,"__hash__":278},"blog\u002Fblog\u002Fyour-own-voice-with-forced-alignment.md","Record your own voice — forced alignment instead of paying for TTS",{"type":7,"value":8,"toc":265},"minimark",[9,13,21,28,33,39,181,184,188,196,221,224,228,248,251,258,261],[10,11,12],"p",{},"The quickest way for a viewer to clock \"this is AI content\" is the voice. Even good\ntext-to-speech has a tell — a flatness, a wrong stress on the third word of every clause.\nRecording your own narration fixes it instantly.",[10,14,15,16,20],{},"But it costs you the thing that made TTS attractive in an automated pipeline: TTS hands you\nexact word timings for free. Record yourself and you've got a ",[17,18,19],"code",{},".wav"," and no idea when each\nword lands — which you need for captions, cut timing, and trimming dead air.",[10,22,23,27],{},[24,25,26],"strong",{},"Forced alignment"," gives that back. Feed it an audio file and the transcript (you already\nhave the script you read), and it returns word-level timestamps.",[29,30,32],"h2",{"id":31},"the-setup","The setup",[10,34,35,38],{},[17,36,37],{},"faster-whisper"," will transcribe with word timings out of the box, and for short-form that's\nusually close enough:",[40,41,46],"pre",{"className":42,"code":43,"language":44,"meta":45,"style":45},"language-python shiki shiki-themes github-dark github-dark","from faster_whisper import WhisperModel\n\nmodel = WhisperModel(\"base\", compute_type=\"int8\")  # int8 = runs fine on CPU\nsegments, _ = model.transcribe(\"narration.wav\", word_timestamps=True)\n\nwords = [(w.word, w.start, w.end) for seg in segments for w in seg.words]\n# [('The', 0.00, 0.18), ('seen', 0.18, 0.42), ('store', 0.42, 0.71), ...]\n","python","",[17,47,48,67,74,109,137,142,175],{"__ignoreMap":45},[49,50,53,57,61,64],"span",{"class":51,"line":52},"line",1,[49,54,56],{"class":55},"sOPea","from",[49,58,60],{"class":59},"suv1-"," faster_whisper ",[49,62,63],{"class":55},"import",[49,65,66],{"class":59}," WhisperModel\n",[49,68,70],{"class":51,"line":69},2,[49,71,73],{"emptyLinePlaceholder":72},true,"\n",[49,75,77,80,83,86,90,93,97,99,102,105],{"class":51,"line":76},3,[49,78,79],{"class":59},"model ",[49,81,82],{"class":55},"=",[49,84,85],{"class":59}," WhisperModel(",[49,87,89],{"class":88},"s4wv1","\"base\"",[49,91,92],{"class":59},", ",[49,94,96],{"class":95},"s-3mD","compute_type",[49,98,82],{"class":55},[49,100,101],{"class":88},"\"int8\"",[49,103,104],{"class":59},")  ",[49,106,108],{"class":107},"sJ8bj","# int8 = runs fine on CPU\n",[49,110,112,115,117,120,123,125,128,130,134],{"class":51,"line":111},4,[49,113,114],{"class":59},"segments, _ ",[49,116,82],{"class":55},[49,118,119],{"class":59}," model.transcribe(",[49,121,122],{"class":88},"\"narration.wav\"",[49,124,92],{"class":59},[49,126,127],{"class":95},"word_timestamps",[49,129,82],{"class":55},[49,131,133],{"class":132},"s8ozJ","True",[49,135,136],{"class":59},")\n",[49,138,140],{"class":51,"line":139},5,[49,141,73],{"emptyLinePlaceholder":72},[49,143,145,148,150,153,156,159,162,165,167,170,172],{"class":51,"line":144},6,[49,146,147],{"class":59},"words ",[49,149,82],{"class":55},[49,151,152],{"class":59}," [(w.word, w.start, w.end) ",[49,154,155],{"class":55},"for",[49,157,158],{"class":59}," seg ",[49,160,161],{"class":55},"in",[49,163,164],{"class":59}," segments ",[49,166,155],{"class":55},[49,168,169],{"class":59}," w ",[49,171,161],{"class":55},[49,173,174],{"class":59}," seg.words]\n",[49,176,178],{"class":51,"line":177},7,[49,179,180],{"class":107},"# [('The', 0.00, 0.18), ('seen', 0.18, 0.42), ('store', 0.42, 0.71), ...]\n",[10,182,183],{},"Now you can place captions to the frame, time a visual cut to the end of a phrase, or trim\nthe silence between takes — automatically, from your real voice.",[29,185,187],{"id":186},"transcription-vs-true-alignment","Transcription vs. true alignment",[10,189,190,191,195],{},"Worth being precise, because it bites people: Whisper ",[192,193,194],"em",{},"transcribes"," — it decides what was\nsaid and when. It can mishear, and its text won't exactly match your script.",[197,198,199,207],"ul",{},[200,201,202,203,206],"li",{},"For ",[24,204,205],{},"captions and rough timing",", Whisper's word timestamps are close enough. Ship it.",[200,208,202,209,212,213,216,217,220],{},[24,210,211],{},"true forced alignment to a known script"," — you have the exact text and want every\nword of ",[192,214,215],{},"that text"," pinned to the audio — use a dedicated aligner (",[17,218,219],{},"aeneas",", or a wav2vec2\nforced-alignment model). It aligns to your words instead of guessing them.",[10,222,223],{},"Most short-form work lives in the first bucket. Reach for the second only when an exact-text\ncaption has to be perfect.",[29,225,227],{"id":226},"the-gotchas","The gotchas",[197,229,230,236,242],{},[200,231,232,235],{},[24,233,234],{},"Numbers and symbols"," get spoken differently than written (\"2026\" → \"twenty twenty-six\").\nIf you're matching against a script, normalize both sides first.",[200,237,238,241],{},[24,239,240],{},"Retakes and filler"," wreck alignment. One clean take per line, minimal \"um,\" beats one\nlong messy take you fix in post.",[200,243,244,247],{},[24,245,246],{},"Room noise"," drags accuracy down. A cheap mic in a quiet room beats a good mic in a live\none.",[249,250],"hr",{},[10,252,253,254,257],{},"The payoff is the whole point: your real voice ",[192,255,256],{},"and"," the automated timing. You don't have to\ntrade the human element for the pipeline.",[10,259,260],{},"And that's the transferable lesson — don't accept the synthetic default just because it's\neasier to wire up. There's often an alignment trick that lets you keep the part a human\nshould do and automate everything around it.",[262,263,264],"style",{},"html pre.shiki code .sOPea, html code.shiki .sOPea{--shiki-default:#F97583;--shiki-dark:#F97583}html pre.shiki code .suv1-, html code.shiki .suv1-{--shiki-default:#E1E4E8;--shiki-dark:#E1E4E8}html pre.shiki code .s4wv1, html code.shiki .s4wv1{--shiki-default:#9ECBFF;--shiki-dark:#9ECBFF}html pre.shiki code .s-3mD, html code.shiki .s-3mD{--shiki-default:#FFAB70;--shiki-dark:#FFAB70}html pre.shiki code .sJ8bj, html code.shiki .sJ8bj{--shiki-default:#6A737D;--shiki-dark:#6A737D}html pre.shiki code .s8ozJ, html code.shiki .s8ozJ{--shiki-default:#79B8FF;--shiki-dark:#79B8FF}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}",{"title":45,"searchDepth":69,"depth":69,"links":266},[267,268,269],{"id":31,"depth":69,"text":32},{"id":186,"depth":69,"text":187},{"id":226,"depth":69,"text":227},"2026-06-17","Synthetic narration is the fastest tell that something is AI slop. Narrate in your own voice and still automate the timing — forced alignment maps your audio to the script for frame-accurate captions, free.",false,"md",{},"\u002Fblog\u002Fyour-own-voice-with-forced-alignment",{"title":5,"description":271},"blog\u002Fyour-own-voice-with-forced-alignment","5oldqgydfjaGsqzKW32ckk86SN5tqwtb3c7gO4RjGU8",1781756059952]