The moment I stopped trusting the first full render
The first time I watched a transition burn a full generation budget and still land on the wrong side of the edit, I knew the problem wasn’t quality — it was commitment. I was paying for the expensive answer before I had any evidence that the prompt had pointed the model in the right direction.
That’s what pushed me toward think frames. I wanted a cheap exploratory pass that could argue with itself before the pipeline spent real compute. Instead of generating one expensive candidate and hoping, I now generate a handful of lightweight sketches, score them, and only let the winner graduate to full-quality generation.
This is the part that felt obvious only after I built it: video generation needs scratch paper. LLMs have a place to reason before they answer; my generator didn’t. Think frames are the missing margin notes.
The key insight: explore first, commit later
The idea came from a simple mismatch. A full keyframe is irreversible in the only way that matters: once I’ve paid for it, I’ve already committed to the path. If the transition is wrong, the loss isn’t just a bad frame — it’s wasted budget and a dead end in the chain.
The naive fix is to generate more full-quality candidates and pick the best one. I’ve done that. It works in the same way buying more lottery tickets works: you increase your odds by multiplying cost.
That is not the kind of engineering I enjoy defending.
Think frames changed the shape of the problem. I keep the exploration cheap, vary the prompt and commitment strength slightly, score the results with the same reward machinery I trust elsewhere, and then spend the expensive pass only on the winning path. The important shift is that the pipeline no longer asks, “Which full render is best?” It asks, “Which direction deserves to become a full render?”
Here’s the architecture in one pass:
That small detour is the whole trick. It gives the generator room to be wrong cheaply, which is exactly what the expensive stage needs.
How I built the exploratory pass
I kept the implementation deliberately narrow. The think-frame module is not a second generator and not a separate product surface. It is a pre-generation layer that sits in front of the existing keyframe flow and feeds it better evidence.
The core comment at the top of lib/think-frames.ts says what the module is for, and I kept it that direct because the code has to earn its keep:
/**
* Think Frames — Lightweight Exploratory Pre-Generation
*
* Inspired by DeepGen's "think tokens" (learnable intermediate representations
* injected between VLM and DiT).
*
* Before committing to a full-quality keyframe generation, this module generates
* lightweight "think frames" — quick low-inference-step sketches that explore
* different transition paths. These are scored by the Reward Mixer, and only
* the winning path proceeds to full-quality generation.
*/
That framing matters because it keeps the module honest. I’m not trying to make the sketch look good. I’m trying to make it informative.
Five focused ways to be wrong
The first design choice was to stop making every exploratory frame fight the same battle. In buildThinkFramePrompts, I vary the focus across five buckets: character, environment, mood, composition, and atmosphere. Each one gets its own suffix so the prompt explores a different preservation priority instead of collapsing everything into one mushy compromise.
const FOCUS_SUFFIXES: Record<ThinkFrame["focus"], string> = {
character: "Focus on maintaining character identity, facial features...",
environment: "Focus on maintaining environment, lighting, and color palette...",
mood: "Focus on maintaining mood, atmosphere, and tonal continuity.",
composition: "Focus on maintaining spatial composition, framing...",
atmosphere: "Focus on maintaining texture details, material appearance...",
}
I like this pattern because it makes the exploration legible. If a candidate wins, I know what kind of preservation it was good at. If it loses, I know which dimension failed without pretending the model made a single all-purpose judgment.
The tradeoff is obvious: I’m constraining the search space on purpose. That means I may miss a weird but useful hybrid path. But in exchange I get five interpretable probes instead of one vague guess, and for this pipeline that is the better bargain.
Parallel probes, not serial hesitation
The second choice was to generate the candidates in parallel. I didn’t want the exploration pass to become a little queue of regrets. The module fans out the think frames together, then ranks the settled results after the fact.
const generationResults = await Promise.allSettled(
prompts.map((p, idx) =>
generator({
sourceImageUrl,
prompt: p.prompt,
strength: p.strength,
seed: baseSeed + idx,
})
)
)
That Promise.allSettled detail is doing real work. I wanted the cohort to survive partial failure. If one probe fails, the others still tell me something, and I don’t throw away a useful exploration round just because one branch misbehaved.
The non-obvious part is the seed progression. I offset the seed by index so each candidate gets a distinct path without turning the whole system into uncontrolled variation. The point is controlled diversity, not chaos with a nicer label.
Why I score think frames relative to each other
A fixed threshold sounds tidy until you stare at a mediocre cohort. If every candidate lands around 0.65, an absolute cutoff can tell you all of them are bad and leave you nowhere. That’s too blunt for a selection step that is supposed to decide the least-wrong path.
So I use group-relative normalization in the reward mixer. The score is not just “is this candidate good?” It is “how does this candidate compare to the rest of this batch?” That’s the part that matters when the whole cohort is imperfect, which is often the real world.
The normalization function is compact, and I kept it that way because the idea should be easy to inspect:
/**
* Normalize an array of values using group-relative normalization:
* normalized[i] = (value[i] - mean) / (std + epsilon)
*
* This is the core of GRPO: candidates are scored relative to their peers
* rather than against absolute thresholds.
*/
export function normalizeGroupRelative(values: number[]): number[] {
if (values.length === 0) return []
if (values.length === 1) return [0]
const mean = values.reduce((s, v) => s + v, 0) / values.length
const variance = values.reduce((s, v) => s + (v - mean) ** 2, 0) / values.length
const std = Math.sqrt(variance)
return values.map((v) => (v - mean) / (std + EPSILON))
}
A note on what these scores actually are: normalizeGroupRelative returns z-scores — mean-centered, standard-deviation-scaled values that are unbounded in both directions. A single candidate always gets a score of zero. A cohort produces scores that tell you how far each candidate sits from the group mean, not where it lands on a fixed 0–1 scale. The reward weights below are coefficients on these relative distances, not percentages of a bounded composite.
What surprised me here was how much this changes the feel of selection. The pipeline stops acting like a judge with a single hard line and starts acting like a scout comparing several imperfect routes through the same terrain.
The limitation is that relative ranking only works if the cohort is meaningful. If all the probes are identical, the normalization has nothing interesting to say. That is why the focus variations and seed offsets matter so much: they make the batch worth comparing.
The reward mixer is the second half of the trick
Think frames are only useful if the scoring surface can tell the difference between “looks plausible” and “preserves the right things.” I already had a multi-signal reward mixer for candidate scoring, so I reused that structure instead of inventing a separate heuristic just for exploration.
The mixer evaluates five signals: visual drift, color harmony, motion continuity, composition stability, and narrative coherence. The default weights are explicit:
export const DEFAULT_REWARD_WEIGHTS: RewardWeights = {
visualDrift: 0.30,
colorHarmony: 0.25,
motionContinuity: 0.15,
compositionStability: 0.15,
narrativeCoherence: 0.15,
}
I like that this makes the selection policy visible. Visual similarity matters most, but it doesn’t get to bully everything else. Color, motion, composition, and narrative continuity all still get a vote.
The important detail is that the mixer does not need every signal to be present. It skips nulls and renormalizes the remaining weights, which keeps the scorer from falling apart when one signal is unavailable. That makes the think-frame pass resilient in exactly the places I care about: partial evidence is still evidence.
Where think frames sit in the larger pipeline
Think frames are not a side quest. They are the front door to a three-stage progressive pipeline that I use to keep quality from collapsing into a single expensive guess.
The stage boundaries are spelled out in lib/progressive-pipeline.ts:
/**
* Stage 1 — Alignment (Generate): Think frames → select → full gen
* Stage 2 — Refinement (Diagnose & Adjust): Fix weakest signals → re-gen
* Stage 3 — Recovery (Last Resort): Aggressive fallback → always accept
*/
That structure matters because it gives me a place to be cautious before I become expensive. Stage 1 is where the think frames live. If the best probe looks good enough, I continue. If the result is weak, later stages can diagnose and adjust instead of blindly retrying the same mistake.
The pipeline config reflects that same philosophy:
export const DEFAULT_PIPELINE_CONFIG: PipelineConfig = {
stage1Threshold: 0.70,
stage2Threshold: 0.60,
thinkFrameCount: 3,
...
}
I’m intentionally not pretending the thresholds are magical. They are just gates that separate “continue exploring” from “move forward with what we have.” The think-frame pass reduces how often I have to spend full-quality compute just to discover the prompt was off by a mile.
The cost argument is simple, and that’s why it works
I didn’t build this because it sounds elegant. I built it because full-quality generation is the expensive part, and I was tired of paying for expensive uncertainty.
Think frames let me spend a little to learn a lot. The exploration pass is lightweight by design, and the winning path is the only one that gets promoted. That means I can inspect several candidate directions without paying full price for every one of them.
The practical difference is not subtle. A cohort of cheap sketches gives me a chance to reject a bad transition before I’ve committed to a full render. That is the kind of savings that shows up as fewer wasted generations and fewer dead-end branches in the chain.
Why I didn’t just make the sketches prettier
I had to resist the temptation to optimize the wrong thing. A think frame is not supposed to be a nice preview. It is supposed to be a diagnostic artifact. If it becomes too polished, it starts hiding the very mistakes I want to catch early.
That’s why the module varies strength as part of the exploration. I’m not only changing the prompt; I’m also changing how hard the image-to-image step clings to the source. That gives me a cheap way to probe the tradeoff between preservation and creativity before I commit to the final pass.
The benefit is that I can see which path preserves identity, which one keeps composition stable, and which one drifts too far. The downside is that exploratory frames are intentionally rough, so they are not meant for human review as finished artifacts. They are for the machine that has to decide where to spend next.
The part that made the whole system feel sane
What I appreciate most is that think frames made the pipeline less superstitious. Before, the generator had to guess and the budget had to trust it. Now I have a cheap cohort, a real scorer, and a selection step that chooses the best path from a small set of interpretable alternatives.
That's a better deal than hoping the first expensive pass gets lucky. I’m no longer asking the model to be right on the first expensive try. I’m asking it to show me its working notes first, then I spend the real budget on the note that actually makes sense.
And that, more than anything, is why think frames earned their place: they turn video generation from a single throw of the dice into a short conversation before the bill arrives.
