Why Browser Studios Sound Like Zoom Calls (and How We Fixed It)
ApexStream Engineering
More in Engineering's
Broadcast Audio Team
Why your podcast sounds like a Zoom call
Open any "browser-based studio" — Restream, Streamyard, Riverside's web player, take your pick — record three guests, and listen back on monitors. Within ten seconds you'll hear the same three things every podcast producer learns to hate:
- A faint pumping as someone laughs (browser auto-compression chasing the peak).
- A metallic edge on consonants (resampling artefacts from 48 kHz mics down to a 16 kHz Hangouts-grade pipeline).
- A whomp when two guests overlap (the side-chain ducker over-reacting because it can't see the ratio).
None of these are hardware problems. Every guest is on a perfectly good USB mic. The problem is the audio chain inside the browser tab — and Chrome doesn't ship a broadcast pipeline.
This post is a tour of what we built instead.
The "browser studio" problem in one diagram
When a guest joins a typical browser studio, their voice takes this path:
Mic → OS → Chrome WebAudio → MediaStreamSource → Default AGC + AEC + NS
→ Resampler (target rate decided by Chrome)
→ MediaRecorder (Opus @ ~64 kbps, mono, 48 kHz nominal but resampled)
→ Cloud mixer → MP3 export
There are at least four lossy decisions in that chain that the studio app has no control over:
- Chrome's automatic gain control chooses an attack/release that's optimised for video calls, not voiceover. It pumps under loud laughter.
- Echo cancellation is on by default, even when guests wear headphones, and it strips low-frequency body from male voices.
- Resampling — Chrome will quietly resample to whatever the bottleneck device is. We've seen 48 kHz mics arrive at the mixer at 16 kHz on bad Bluetooth.
- MediaRecorder's Opus encoder runs at default settings (mono, ~64 kbps) which is fine for Hangouts and lossy for podcasts.
The output is functional. It is not broadcast.
The ApexStream pipeline
Here's what a guest's voice looks like inside our Studio:
Mic → AudioGraph (custom Web Audio routing)
├─ Per-channel gate (200 ms hold, voice-tuned)
├─ Per-channel 3-band EQ (low 120 Hz / mid 2.5 kHz / high 8 kHz, voice-optimised)
├─ Per-channel ducking bus (200 ms loop interval — CPU-tuned)
└─ Master bus
├─ Broadcast limiter (threshold −3, knee 6, ratio 20, attack 1 ms, release 100 ms)
├─ Makeup gain +6 dB (compensate for limiter ceiling)
└─ Output @ 48 kHz, locked, no resampling
→ Cloud recorder (256 kbps Opus, CBR, stereo, in-band FEC)
→ Post-render: two-pass loudnorm → −16 LUFS / −1.5 dBTP / 11 LU LRA → 256 kbps MP3
Every box on that diagram is a deliberate engineering decision we made because the browser default was wrong for paying audiences. A few highlights:
48 kHz, locked
The very first line of our AudioGraph constructor:
new AudioContextClass({ sampleRate: 48000 });
Specifying the sample rate at construction time prevents Chrome from silently resampling our 48 kHz capture down to whatever the cheapest output device suggests. It's a one-line fix that no browser studio bothers to make because the symptom — a metallic edge on consonants — is hard to A/B test if you've never heard a clean voiceover.
Broadcast limiter, not a "compressor"
Browser studios reach for DynamicsCompressorNode with default settings and hope. We hand-tuned a limiter:
- Threshold −3 dBFS so peaks have headroom.
- 20:1 ratio — anything above threshold is stopped, not soft-shaped.
- 1 ms attack, 100 ms release so transient peaks are caught but laughter doesn't pump.
- +6 dB makeup gain to bring the average level back up to broadcast.
The result is a master bus that sounds like a console, not a video call.
Per-channel gate at 200 ms
Most browser studios run a gate in the 50 ms loop range. That's CPU-expensive and produces a "chattery" gate when guests breathe near the threshold. We tuned ours to 200 ms — slow enough to not chatter, fast enough to catch a laptop fan in the gap between sentences.
Two-pass loudness mastering
Before an episode is published, every export runs through FFmpeg's loudnorm filter twice:
Pass 1: ffmpeg -i in.wav -af loudnorm=I=-16:TP=-1.5:LRA=11:print_format=json -f null -
Pass 2: ffmpeg -i in.wav -af loudnorm=I=-16:TP=-1.5:LRA=11:measured_I=...:linear=true ...
Pass 1 measures the actual integrated loudness of the file. Pass 2 corrects to the target with the measurement as input, which is an order of magnitude more accurate than single-pass dynamic loudnorm. The output is −16 LUFS, −1.5 dBTP, 11 LU LRA — Apple Podcast's published spec, exactly. Spotify's spec is different (−14 LUFS); we hit theirs on the same render path with a config flip.
Facebook is a special case
Browser studios all assume Facebook accepts 48 kHz audio. It does, for the first 60 minutes. Then the stream cuts. Facebook's RTMP audio path expects 44.1 kHz, and a 48 kHz feed builds up clock drift that the FB ingest server eventually rejects.
We hardcode 44.1 kHz in our Facebook relay path:
// apps/media-node/src/services/relay/grouper.ts
if (provider === 'facebook') {
args.push('-c:a', 'aac', '-b:a', '192k', '-ar', '44100');
}
It's two lines. It is the difference between a 60-minute podcast that drops every time and one that doesn't.
"Per-participant ISO recording" — the other moat
Audio quality at the mixer is one half of the story. The other half is what happens to the individual track for each guest — the ISO recording.
Browser studios generally record only the mixed master. If your co-host's audio sounds bad in post, you're stuck with the mix.
ApexStream Studio runs a dual pipeline for every guest:
- A per-participant
MediaRecorderwrites a lossless WAV/MP4 to the local browser disk in real time. - The same track is chunk-uploaded to Cloudflare R2 with a SHA-256 verification on each chunk.
If the guest's network drops, the local copy survives. If their laptop dies mid-record, the cloud copy is already on R2. If your post-production team wants to re-EQ one guest who was too far from the mic, they get the raw file. No mix-bake-in.
The cloud uploads are tenant-prefixed (recording:{customerId}:chunks:{sessionId}) so a multi-tenant bug can't cross-pollinate sessions, and the stitch step refuses to publish if the expected chunk count doesn't match — a recording is either complete or it's flagged, never silently truncated.
Why this is hard to copy
None of these decisions are secret. Anyone with FFmpeg and a textbook can implement broadcast loudness mastering. The reason browser studios don't is:
- Their stack is Chrome-shaped. Working around Chrome's built-in AGC/AEC/NS without breaking the WebRTC SDP negotiation is a multi-month rabbit hole.
- Their business model is a free tier that runs at a loss; spending CPU on per-channel limiters in the browser kills the unit economics.
- Their audience is creators who don't yet have paying listeners. Paying audiences are the ones who notice the pumping. Until you have them, you don't realise it matters.
We built ApexStream for the audience that comes after the free tier — the podcast that sells ads, the webinar that closes deals, the broadcaster whose audience can hear a mismatched sample rate.
What this means for your show
If you record paying guests — sponsors, executives, professional podcasters — the audio chain you record them through is the audio they'll be remembered by. A chain optimised for Hangouts will make them sound like Hangouts.
We built one optimised for the room.
Join the Inner Circle
Get exclusive insights, early access to new features, and strategies for scaling live infrastructure.
No spam. Unsubscribe anytime.