When raw capture data lies
The first time I compared a crop from a different source resolution, the box looked right until I did the math. The coordinates were being treated as if every image had the same pixel grid, and that assumption quietly poisoned everything downstream. I fixed the handoff by making the boundary do the dangerous work once, up front, so the web side only ever reloads geometry that has already been normalized, scaled, and checked.
That design choice matters because the app is not just moving photos around. It is moving segments, bounding boxes, and measurements from capture into persistence, then into a viewer that needs to behave as if the stored geometry is authoritative. If the source and target resolutions differ, or if a crop is invalid, I want that failure to stop at the boundary rather than show up later as a bad reconstruction, a broken overlay, or a measurement that looks precise and is wrong.
The core idea: make the boundary explicit
The iOS side produces geometry, the API layer normalizes it, and the web viewer consumes the persisted result without reinterpreting the original capture conditions. The viewer does not remember where a box came from. It trusts that the box already survived scaling, crop validation, and the other little lies that raw image coordinates like to tell.
The project already has the pieces for that boundary. There is a depth route that accepts an image and an optional mask, calculation modules for scale, depth, orientation, and multi-reference validation, and persistence routes for segmentations and measurements. The engineering shape is consistent: capture produces inputs, calculation code turns them into geometry with explicit assumptions, and API routes store the result for later reload. That lines up with the way Next.js route handlers are meant to act as explicit request/response boundaries.
The diagram is small on purpose. The boundary should feel boring once it is in place: a straight line from capture to normalization to persistence to reload, with the schema sitting beside it like a contract.
What gets normalized before anything is saved
The calculation layer gives away the shape of the system. AutoScaleInput carries detected segments, an optional base64 PNG depth map, and the image dimensions. Each SegmentData includes a prompt, a bounding box, a confidence value, and an optional mask. Those fields are not decorative; they are the raw material for making a measurement that survives being stored and reopened later.
export interface AutoScaleInput {
/** Segments detected by SAM3 */
segments: SegmentData[];
/** Base64 PNG depth map from SAM3D (optional) */
depthMapBase64?: string;
/** Image dimensions */
imageWidth: number;
imageHeight: number;
}
export interface SegmentData {
/** Prompt used for detection (e.g., 'door', 'window') */
prompt: string;
/** Bounding box of detected object */
bbox: BoundingBox;
/** Detection confidence (0-1) */
confidence: number;
/** Base64 mask data (optional) */
mask_base64?: string;
}
The non-obvious part is that image dimensions are part of the input shape, not a hidden assumption. A bounding box only means something relative to the image it came from. Once the source and target resolutions differ, a box that was valid in one grid is just a rectangle with a memory problem. By carrying dimensions alongside the geometry, the normalization step has enough information to rescale instead of guess.
Scaling is not a detail; it is the boundary
Scaling is the thing that decides whether a segment can survive the trip from capture to persistence without changing meaning. The code makes that explicit in the simplest possible way: pixel distances become scale, scale becomes inches, and inches become the unit the rest of the system can reason about.
import type { Point } from '@/training/types';
/**
* Calculate the distance between two points in pixels
*/
export function calculatePixelDistance(start: Point, end: Point): number {
const dx = end.x - start.x;
const dy = end.y - start.y;
return Math.sqrt(dx * dx + dy * dy);
}
/**
* Calculate pixels per inch from a known reference measurement
*/
export function calculateScale(
pixelLength: number,
knownInches: number
): number {
if (knownInches <= 0) return 0;
return pixelLength / knownInches;
}
/**
* Convert pixels to inches using the calculated scale
*/
export function pixelsToInches(pixels: number, pxPerInch: number): number {
if (pxPerInch <= 0) return 0;
return pixels / pxPerInch;
}
There is no magical calibration object hiding in the background; there is just a distance, a known size, and a ratio. The limitation is equally honest: if the known measurement is bad, the scale is bad, and every inch derived from it inherits that mistake. The validation step has to sit next to scaling instead of after the fact.
Why bad crop geometry has to fail early
The geometry layer includes a separate check for complex surfaces. GeometryAnalysis reports whether the surface appears flat, how many depth peaks were detected, the complexity class, and a confidence factor for calibration. This information belongs near the boundary, because a crop that spans multiple planes can look plausible while being useless for precise measurement.
export interface GeometryAnalysis {
/** Whether surface appears flat (single depth plane) */
isFlatSurface: boolean;
/** Number of detected depth peaks */
peakCount: number;
/** Complexity classification */
complexity: 'flat' | 'angled' | 'multi-plane';
/** Detected peak depths (normalized 0-1) */
peaks: Peak[];
/** Warning message for user */
warning?: string;
/** Confidence factor for calibration (0-1) */
confidenceFactor: number;
}
The important detail is the confidenceFactor. Later stages should not pretend a bay window is the same kind of measurement surface as a flat wall. If the depth histogram says the region is multi-plane, the boundary should say so plainly instead of letting a downstream viewer infer a clean rectangle from a messy scene. Bad crop geometry gets kept from poisoning later processing.
How the API layer accepts the data
The depth route shows the shape of the server boundary very clearly. It accepts a base64 image and an optional mask, then returns a depth map, image dimensions, min and max depth, source metadata, and processing time. The API is not just a file drop; it is an explicit transformation boundary.
import { NextRequest, NextResponse } from 'next/server';
// Configuration - Always-on RunPod unified endpoint
const RUNPOD_ENDPOINT_URL = process.env.RUNPOD_ENDPOINT_URL; // https://xxx-xxxx.proxy.runpod.net
interface DepthRequest {
image: string; // base64
mask?: string; // base64, optional - if provided, depth only for masked region
}
interface DepthResponse {
success: boolean;
depthBase64: string;
imageWidth: number;
imageHeight: number;
minDepth: number;
maxDepth: number;
depthSource?: string;
depthSourceMode?: string;
processingTime: number;
provider: 'runpod';
}
The clarity comes from naming the response fields. A viewer that receives imageWidth, imageHeight, and depth metadata does not need to reverse-engineer the capture conditions. It can render from stored facts instead of reconstructing intent. A much safer contract than handing the web app a blob and asking it to be clever.
Persistence is only safe after the boundary has done its job
The persistence routes exist to store the normalized result, not the raw uncertainty. Once the geometry is saved, the viewer can reload it without re-running the same assumptions.
Multi-reference validation keeps the viewer honest
The system compares references instead of blindly trusting one. The multi-reference check is especially useful when both a door and a window are detected. It calculates scale from each, compares them, and warns when the ratio falls outside the agreed range.
/**
* Multi-Reference Cross-Validation for Auto-Calibration
*
* When both a door AND a window are detected, calculates scale from each
* and validates that they agree. Disagreement indicates potential issues
* like camera angle, lens distortion, or misdetection.
*
* Expected behavior:
* - If only one reference: use it directly
* - If both references: compare scales, warn if ratio outside 0.85-1.15
* - Prefer door (larger, more reliable) as primary reference
*/
import type { BoundingBox } from '@/training/types';
export interface ReferenceObject {
type: 'door' | 'window' | 'garage_door';
bbox: BoundingBox;
confidence: number;
knownDimensionInches: number;
isVertical: boolean;
}
The design does not try to make every reference equally trustworthy. Doors are preferred as the primary reference because they are larger and more reliable. The code treats field data the same way: some geometry is sturdy, some is decorative, and the distinction has to be known before anything gets written down.
The handoff from capture to viewer reload
The boundary is explicit because hidden geometry rules always come back to bite you later. Every dangerous assumption is made visible before persistence: image dimensions are carried with the segment data, depth analysis reports flatness and variance, orientation correction uses device sensors, multi-reference validation compares independent references, and crop geometry gets checked instead of assumed.
The persisted record is not a raw transcript of a camera frame. It is a cleaned, bounded, and explicit description of what the frame meant. When the viewer reloads that record, it is rendering a decision that already survived the boundary. A bad crop no longer sneaks through as a confident rectangle, and a resolution mismatch no longer turns into a quiet measurement bug. Once the handoff is explicit, the rest of the system gets to be boring — and in measurement software, boring is a compliment.
