The first bug in the live coaching system was not transcription quality. It was startup order.
I originally let audio frames flow as soon as the websocket connected, which meant the first chunks could arrive before the GPT-4o Realtime session had actually finished coming up. On paper that sounds harmless. In the console, it showed up as a maddeningly specific failure: the beginning of the utterance would go missing, or the transcript would start half a beat late, and the coaching prompt that came back felt detached from the sentence that triggered it. The system was technically alive, but it was not yet ready to hear.
That failure forced me to stop thinking about the voice path as a feature and start treating it as a boundary problem. Once the call enters the backend, every frame has to keep its shape, its order, and its timing. If the boundary is sloppy, the downstream model may still produce text, but the experience loses the only thing that matters in a live coaching loop: relevance at the moment of speech.
The boundary matters more than the model
The core flow in the backend is straightforward, but each step has to stay disciplined:
- ACS delivers media frames into the backend over websocket.
- The server waits for the realtime session handshake to complete.
- Audio is buffered, resampled from 16kHz to 24kHz, and forwarded into
input_audio_buffer.append. - GPT-4o Realtime returns partial transcripts and coaching signals.
- The backend streams those results to the frontend through SignalR.
- Transcripts are buffered to Redis so the session can be persisted and replayed.
That is the real shape of the system in backend/app/services/media_bridge.py and backend/app/main.py. The important thing is not that there are many parts. It is that the parts have different jobs and different clocks. Audio ingress, model ingestion, transcript delivery, and persistence cannot all be treated as the same path. If they are, the slowest branch steals time from the user.
I like to think of the resampler as the turnstile between two clocks: one clock is the live call, the other is the model input stream. The turnstile does not make the crowd smaller. It makes sure people pass through in the right cadence.
That diagram is the architecture I kept returning to while debugging the live path. It is deliberately boring. Boring is good here. A voice stream that behaves predictably is worth more than one that tries to be clever.
The first thing I fixed: session readiness
The earliest failure taught me more than any benchmark could have. Audio was arriving before the session was ready.
Once I saw that pattern, the fix was obvious: session.updated became the gate. No audio got appended to the model until the realtime session had acknowledged its configuration. That one change removed the most annoying class of startup bugs, because it separated transport readiness from model readiness. Before that, the code was implicitly assuming that a socket being open meant the whole pipeline was ready. It does not. An open socket is just a pipe. The session state is the contract.
This is also where the bridge state matters. In the backend, the MediaBridge owns the websocket, the resampler, the SignalR service, transcript buffering, and the session lifecycle. The docstring in media_bridge.py says exactly what it does: it bridges ACS audio streaming to GPT-4o Realtime, resamples audio, streams transcripts and insights via SignalR, and buffers transcripts to Redis for persistence. That is not a decorative abstraction. That is the object that keeps the live call and the model conversation from stepping on each other.
The other useful detail in the bridge is that it tracks session-level counters like sequence gaps and total frames. Those fields matter because live media is not a perfect stream. If frames arrive out of order or are dropped, the bridge needs to know before the transcript starts drifting. The console does not get better because the backend ignores the problem; it gets better when the backend names the problem.
How I made the resampling path deterministic
The resampling step is where the system stops being generic audio plumbing and becomes a contract.
In backend/app/main.py, FastResampler is initialized once at application startup so the filter coefficients are precomputed before any live traffic arrives. That matters because the conversion path is fixed: the incoming audio is 16kHz PCM, the model input is 24kHz PCM, and the conversion is always the same. There is no reason to rebuild the filter on every frame or every session.
The actual implementation is simple enough to explain directly. This is the shape of the conversion I use:
from dataclasses import dataclass
import numpy as np
from scipy import signal
@dataclass
class FastResampler:
source_rate: int = 16000
target_rate: int = 24000
def __post_init__(self) -> None:
if self.source_rate != 16000 or self.target_rate != 24000:
raise ValueError("This resampler is tuned for 16kHz -> 24kHz audio.")
self.up = 3
self.down = 2
self.filter_coeffs = signal.firwin(
numtaps=192,
cutoff=1.0 / self.up,
window="hamming",
).astype(np.float32)
def resample(self, pcm_16k: bytes) -> bytes:
if not pcm_16k:
return b""
audio = np.frombuffer(pcm_16k, dtype=np.int16).astype(np.float32)
audio = audio / 32768.0
resampled = signal.resample_poly(
audio,
up=self.up,
down=self.down,
window=self.filter_coeffs,
)
resampled = np.clip(resampled * 32767.0, -32768.0, 32767.0).astype(np.int16)
return resampled.tobytes()
That snippet captures the part that matters: convert to float, resample with a fixed filter, scale back to int16, and send the bytes onward in the exact shape the realtime socket expects. The bug that people make here is subtle. If you normalize to [-1, 1] and then cast straight back to int16, you do not get usable PCM. You get almost nothing. The signal has to be re-scaled before the cast.
That is why I prefer deterministic signal processing over improvisation. The model does not need creativity from the resampler. It needs consistency.
Why the bridge is split across audio, transcript, and delivery paths
The fastest way to wreck a live coaching system is to make the audio path wait on the text path.
The backend avoids that by separating responsibilities. The audio side is responsible for transport, buffering, session readiness, resampling, and append operations into the realtime socket. The text side is responsible for partial transcript delivery, coaching output, and persistence through Redis and SignalR. Those two sides talk to each other, but neither side is allowed to own the whole call.
That split is why the docstring in media_bridge.py explicitly calls out streaming transcripts and insights via SignalR. SignalR is not just a UI convenience. It is the delivery layer for everything that should reach the console as soon as it is available. Partial transcripts keep the user oriented while the call is still in flight, and coaching insights can follow the same path without freezing the media stream.
The transcript buffer to Redis is the other piece that keeps this sane. Live output is ephemeral by nature, but the product still needs memory. Buffering transcript state in Redis lets the backend preserve the session after the call and keeps the system from treating the UI as the source of truth. The UI is the display. Redis is the memory. The bridge is what keeps them consistent.
The problem I was really solving
The hard problem was not audio conversion. It was turning live speech into a usable interaction before the moment passed.
If a coaching hint arrives after the speaker has already moved on, the content may still be correct, but the timing is gone. That is why the boundary matters more than the model itself. A clean transcript that lands late is just a transcript. A slightly rough transcript that lands on time can still help a recruiter steer the conversation. In a live call, timing is part of correctness.
That is also why I stopped chasing cleverness in the bridge. I did not want a system that guessed when it was ready. I wanted a system that knew. The session handshake had to finish first. The audio format had to be explicit. The resampling had to be repeatable. The output delivery had to stay separate from the input pipeline. Those are ordinary engineering choices, but they are the difference between a console that feels reactive and one that feels out of sync with the person speaking.
Why the order of operations matters so much
There are a lot of ways to build a voice pipeline that technically works. Most of them fail in the same place: they treat the live stream like a batch job.
A batch job can wait for the whole file. A live call cannot. A batch job can recover from a one-second stall. A live call cannot without making the user feel the gap. A batch job can pad out timing with retries and retries. In a live conversation, every extra hop shows up as friction.
That is why the startup sequence mattered so much. The bridge had to exist, the realtime session had to be updated, and only then could audio start flowing into input_audio_buffer.append. After that, the resampler could do its job, SignalR could carry partial output to the frontend, and Redis could preserve the transcript state. Each step depends on the one before it, but none of them should block the live media stream longer than necessary.
The practical lesson is simple: the place where two systems meet is where correctness lives. If the boundary is sloppy, every downstream component inherits the mess. If the boundary is clean, the rest of the pipeline gets to stay boring.
Closing
Once I fixed the session ordering and made the resampling path deterministic, the rest of the console started behaving like a live system instead of a lucky one. Audio, model state, transcript persistence, and UI delivery are different problems with different clocks, and the backend finally treats them that way.
The next round of work is no longer about whether the call stays usable. It is about how far the coaching surface can be pushed once the boundary itself is trustworthy.
