Hook
The first time I watched a clean-looking frame fail the chain, the problem was not blur, bad framing, or a weak prompt. The failure was more structural than that: the frame carried text, and that text was the thing that poisoned the next step. Once I stopped treating it like a visual annoyance and started treating it like contamination, the design became much easier to reason about.
That is why the text detector module exists. It does not just answer the question, "Is there text here?" It answers a better question: "How dirty is this frame, where is the dirt, and how much should the pipeline care?" That distinction matters because a subtitle strip at the bottom of the frame, a tiny watermark in the corner, and text embedded in the scene itself are not the same failure mode. They should not receive the same response.
This module sits in front of the handoff into the next scene. It acts like an airlock between perception and propagation: if a frame carries semantic residue, I want to know before the system promotes it into reference material for the next generation step.
The shape of this pipeline is the whole point. The detector is isolated from the rest of the visual scoring system. It does not try to be composition analysis, and it does not pretend OCR is a style model. It takes region boxes, classifies them spatially, computes a weighted contamination score, and hands the rest of the system a signal that can influence routing and recovery.
Why text became its own signal
The naive version of this problem is to detect text, return a boolean, and move on. I tried that framing early, and it was too blunt to be useful. A subtitle bar across the lower third and a watermark in the corner are both text, but they do not deserve the same treatment. One is usually a hard sign that the frame contains unwanted overlay material. The other is smaller, but still meaningful, because it can carry branding or UI residue that should not bleed forward into the chain.
So I split the detector into zones:
- subtitle
- watermark
- scene-content
That is the real insight in the module. Text is not only an object; it is a location-dependent failure mode. The same glyphs mean something different when they sit in the bottom 20% of the frame than when they sit in a corner, and both mean something different again when they appear inside the actual scene.
That is also why the detector returns a structured result instead of a binary alarm. The pipeline needs to know more than whether text exists. It needs to know the region count, the breakdown by zone, the classified boxes, and the final normalized score that feeds router feedback.
The actual contract of the detector
The implementation centers on a small set of types and helpers. I kept the surface area intentionally small so the signal stays legible: normalize the OCR output, classify each region, then score the result.
export interface TextRegion {
x: number;
y: number;
w: number;
h: number;
label: string;
}
export type TextZone = 'subtitle' | 'watermark' | 'scene-content';
export interface ClassifiedTextRegion extends TextRegion {
zone: TextZone;
}
export interface TextDetectionResult {
score: number;
regionCount: number;
subtitleCount: number;
watermarkCount: number;
sceneContentCount: number;
regions: ClassifiedTextRegion[];
}
const SUBTITLE_ZONE_TOP = 0.80;
const WATERMARK_MARGIN = 0.15;
const WATERMARK_MAX_AREA_RATIO = 0.02;
const SUBTITLE_WEIGHT = 3.0;
const WATERMARK_WEIGHT = 1.5;
const SCENE_CONTENT_WEIGHT = 1.0;
const SATURATION_COVERAGE = 0.05;
function clamp01(value: number): number {
return Math.max(0, Math.min(1, value));
}
function boxArea(region: TextRegion): number {
return Math.max(0, region.w) * Math.max(0, region.h);
}
function zoneWeight(zone: TextZone): number {
switch (zone) {
case 'subtitle':
return SUBTITLE_WEIGHT;
case 'watermark':
return WATERMARK_WEIGHT;
case 'scene-content':
default:
return SCENE_CONTENT_WEIGHT;
}
}
export function normalizeOcrBoxes(boxes: unknown[], labels?: string[]): TextRegion[] {
if (boxes.length === 0) return [];
const first = boxes[0];
if (typeof first === 'object' && first !== null && 'x' in first && 'w' in first) {
return boxes.map((box) => {
const b = box as { x: number; y: number; w: number; h: number; label?: string };
return {
x: b.x,
y: b.y,
w: b.w,
h: b.h,
label: b.label ?? '',
};
});
}
if (Array.isArray(first) && first.length >= 8) {
return boxes.map((quad, index) => {
const q = quad as number[];
const xs = [q[0], q[2], q[4], q[6]];
const ys = [q[1], q[3], q[5], q[7]];
const minX = Math.min(...xs);
const minY = Math.min(...ys);
const maxX = Math.max(...xs);
const maxY = Math.max(...ys);
return {
x: minX,
y: minY,
w: maxX - minX,
h: maxY - minY,
label: labels?.[index] ?? '',
};
});
}
return [];
}
export function isSubtitleZone(region: TextRegion, frameHeight: number): boolean {
const centerY = (region.y + region.h / 2) / frameHeight;
return centerY >= SUBTITLE_ZONE_TOP;
}
export function isWatermarkZone(region: TextRegion, frameWidth: number, frameHeight: number): boolean {
const centerX = (region.x + region.w / 2) / frameWidth;
const centerY = (region.y + region.h / 2) / frameHeight;
const areaRatio = boxArea(region) / (frameWidth * frameHeight);
if (areaRatio > WATERMARK_MAX_AREA_RATIO) {
return false;
}
const inCorner =
(centerX <= WATERMARK_MARGIN || centerX >= 1 - WATERMARK_MARGIN) &&
(centerY <= WATERMARK_MARGIN || centerY >= 1 - WATERMARK_MARGIN);
return inCorner;
}
export function classifyTextRegions(
regions: TextRegion[],
frameWidth: number,
frameHeight: number,
): ClassifiedTextRegion[] {
return regions.map((region) => {
let zone: TextZone;
if (isWatermarkZone(region, frameWidth, frameHeight)) {
zone = 'watermark';
} else if (isSubtitleZone(region, frameHeight)) {
zone = 'subtitle';
} else {
zone = 'scene-content';
}
return {
...region,
zone,
};
});
}
export function computeTextPresenceScore(
regions: TextRegion[],
frameWidth: number,
frameHeight: number,
): TextDetectionResult {
if (regions.length === 0) {
return {
score: 0,
regionCount: 0,
subtitleCount: 0,
watermarkCount: 0,
sceneContentCount: 0,
regions: [],
};
}
const classified = classifyTextRegions(regions, frameWidth, frameHeight);
const frameArea = frameWidth * frameHeight;
let weightedCoverage = 0;
let subtitleCount = 0;
let watermarkCount = 0;
let sceneContentCount = 0;
for (const region of classified) {
const areaRatio = boxArea(region) / frameArea;
weightedCoverage += areaRatio * zoneWeight(region.zone);
switch (region.zone) {
case 'subtitle':
subtitleCount += 1;
break;
case 'watermark':
watermarkCount += 1;
break;
case 'scene-content':
sceneContentCount += 1;
break;
}
}
const score = clamp01(weightedCoverage / SATURATION_COVERAGE);
return {
score,
regionCount: classified.length,
subtitleCount,
watermarkCount,
sceneContentCount,
regions: classified,
};
}
That is the real shape of the detector: not just zone classification, but zone classification plus weighted coverage plus a normalized score. The score matters because it is what the rest of the pipeline can actually use. A clean frame produces a low score. A frame with subtitle contamination pushes harder. A corner watermark contributes less than subtitles, but it still moves the needle. Scene-content text is tracked too, because even when it is legitimate on-frame text, it should not disappear into a blind spot.
Normalization is where the input drift disappears
The biggest implementation trap in OCR pipelines is format drift. Different detectors expose region data in different shapes, and if that drift leaks into the pipeline, every downstream consumer becomes brittle.
normalizeOcrBoxes is the first guardrail against that problem. I built it to handle two formats:
- already-normalized
{ x, y, w, h, label }objects - raw Florence-2 quad boxes with eight floats:
[x1, y1, x2, y2, x3, y3, x4, y4]
Those are not interchangeable formats, and it is important not to pretend they are. The normalized path is straightforward because the region already has width and height. The Florence-2 path requires a min/max pass across all four corners to build an axis-aligned box.
That min/max conversion matters. Using a partial coordinate pair or deriving height from the wrong point pair gives you a box that does not actually cover the text. Once that happens, both zone classification and area scoring become noisy.
A correct normalization pipeline gives the rest of the system a stable contract. I do not want the router to care whether the OCR came from fal.ai’s normalized region output or a raw Florence-2 quad. The detector absorbs that difference once, then everything downstream sees the same shape.
The zone rules are intentionally simple, but not loose
The zone logic is not fancy, and I like it that way. Simplicity makes it easier to keep the signal honest. But simple does not mean vague. The thresholds are specific:
- subtitle: text center in the bottom 20% of the frame
- watermark: text center in a corner with a 15% margin, and the box area must be under 2% of frame area
- scene-content: everything else
That area cap on watermarks is important. It prevents large corner overlays from being mislabeled as watermark material. If a region is too large, I do not want it to get the softer watermark treatment just because it is near an edge. The corner check and the size guard work together.
The subtitle rule is equally specific. It uses the region center, not the top edge or the bounding box intersection. That keeps the detector from overreacting to text that grazes the lower part of the frame without actually behaving like a subtitle strip.
The classification order also matters. Watermark gets checked first, then subtitle, then scene-content. A corner overlay should not be swept into the subtitle bucket just because it happens to live low in the frame. That ordering preserves the meaning of the zones.
The score is not decoration either
The score is the part that turns classification into policy.
The detector weights subtitle text most heavily, watermark text next, and scene-content text least. That mirrors the actual failure hierarchy. Subtitle-like text is almost always unwanted overlay. Watermarks are smaller but still harmful because they usually indicate branding or generated residue. Scene-content text is not necessarily wrong, but it still counts because it is part of the image’s semantic load.
The weighted score is based on area coverage rather than just region count. That is a crucial detail. Two tiny boxes should not have the same effect as a full-width subtitle strip, even if they are both in the same zone. By multiplying area ratio by zone weight and then normalizing the aggregate into the [0, 1] range, the detector gives the router a score that behaves like a real contamination measure instead of a checklist.
The saturation threshold is also deliberate. Once weighted coverage reaches 5% of frame area, the score clamps to 1.0. That tells the rest of the pipeline, clearly and early, that the frame is dirty enough to treat as a hard signal rather than a weak hint.
Here is a concrete example. Imagine three regions in a 1920 by 1080 frame:
- one subtitle region covering 1% of the frame
- one watermark region covering 0.5% of the frame
- one scene-content region covering 0.5% of the frame
The weighted coverage is:
- subtitle: 0.01 × 3.0 = 0.03
- watermark: 0.005 × 1.5 = 0.0075
- scene-content: 0.005 × 1.0 = 0.005
Total weighted coverage = 0.0425. Divide by 0.05 and the score lands at 0.85. That is the right shape for the signal. The frame is not just slightly noisy. It is contaminated enough that I want the pipeline to think twice before reusing it as a reference.
Why scene-content still matters
It is tempting to ignore scene-content text because it is the least suspicious category, but that would be a mistake. Sometimes text really belongs in the scene: signage, labels on products, interface elements in a diegetic screen, or text that is intentionally visible in the shot. I still track it because the presence of text tells me something about the structure of the frame, even if it does not immediately trigger the same response as a subtitle strip.
That distinction helps the router stay nuanced. If a frame has only scene-content text, the score can remain lower than a frame with overlay-like text, but the signal still exists. That makes the detector useful both for gating and for analysis.
The important thing is that I am not collapsing everything into a binary failure. The detector keeps enough structure to let the pipeline behave differently for different types of text contamination.
How this feeds the rest of the system
This detector is not a dead-end report. It is part of the feedback loop.
The surrounding generation path already works with a multi-signal strategy. Candidate selection is not based on a single metric. The reward mixer evaluates multiple signals, the progressive pipeline retries weak generations, and the feedback layer turns outcomes into calibration samples. the text detector plugs into that system as another source of evidence.
That matters because text contamination is one of the easiest ways for a scene chain to lie to itself. If a contaminated frame is allowed to pass forward silently, the next step may treat the contamination as visual truth. That is exactly the kind of failure I wanted to stop at the boundary.
The detector gives the router three things at once:
- a normalized score
- a breakdown of what kind of text was found
- the classified regions themselves for inspection and debugging
That combination lets automation make a decision while still leaving a paper trail for me when I need to inspect a bad run.
Debugging the failure modes
The edge cases here are what made the detector worth building carefully.
A box in the bottom part of the frame is not automatically a subtitle. It has to cross the bottom-20% threshold by center point. That prevents accidental overreach.
A tiny logo in the corner is not a watermark unless it is actually inside the corner window and under the area cap. That prevents large corner overlays from slipping through with the wrong label.
A wide text strip that is slightly above the subtitle boundary is not necessarily a subtitle, even if a human might casually call it one. I prefer the detector to stay consistent with the spatial rule rather than drift into subjective labeling.
And when the OCR input shape changes, normalization keeps the downstream score from collapsing. That is the part that tends to be invisible when it works and catastrophic when it fails. Once the detector owns that translation, the rest of the system does not have to care where the boxes came from.
Why the module stays pure
I keep the text detector pure on purpose. It does not call the API, it does not reach into storage, and it does not make routing decisions directly. It only transforms OCR regions into a structured judgment.
That separation buys me two things.
First, it is easy to test. I can throw synthetic regions at it and assert exactly how they land in each zone, what the score should be, and how many regions should be counted in each bucket.
Second, it keeps the rest of the pipeline honest. The contamination rule lives in one place. If I change frame extraction later, or swap OCR providers, or adjust the calibration loop, the detector still has one job: tell me whether the frame contains text, where it sits, and how badly it should count against the scene.
That kind of separation is what makes the pipeline maintainable. It means I can tune policy without rewriting geometry.
Closing
Text in a frame is not decoration. It is contamination — and once I started treating it that way, a class of failures that used to slip through generation undetected became impossible. The detector does not guess. It measures coverage, classifies by zone, weights by severity, and returns a number the router can act on without asking permission. Every frame either earns its way forward or gets cut. The pipeline stopped being fragile the moment I stopped being polite about what contamination looks like.
