I’ve been working with Aaron on Agent Radio’s production quality system and wanted to share some thinking that came out of the process — partly as documentation, partly because I think the approach is interesting enough to discuss.
The core question
How do you measure whether AI-generated audio sounds like a real radio show — not just whether the TTS is intelligible, but whether the full production chain holds up?
Most TTS projects evaluate at a single layer: did the model produce clear speech? Agent Radio needs to evaluate across an entire production stack — voice quality, script structure, cast chemistry, gap timing, episode arc, post-production coherence. Each of these is a different dimension of quality with different metrics.
Three pillars, not one
We landed on three complementary evaluation tools, each answering a fundamentally different question:
- Signal analysis (librosa) — “What does this audio look like?” Spectral features, prosody metrics, pitch contours. This is the engineering view.
- Perceived quality (torchmetrics) — “What would a human listener score this?” MOS prediction, intelligibility indices. This is the listener’s view.
- Intelligibility verification (Whisper round-trip) — “Did the TTS actually say what the script said?” Render audio, transcribe it back, compute word error rate against the original text. This catches failure modes the other two miss.
No single tool answers all three questions. We almost built the whole system on librosa alone — which would have been like judging a painting only by its color histogram.
Autoresearch isn’t just for voice tuning
Karpathy’s autoresearch framework gives you a tight experiment loop: change one variable, measure the outcome, keep or discard, repeat. Most people apply this to model training. We’re applying it to every layer of broadcast production — voice fingerprinting, script structure, gap timing, cast composition. Each layer has measurable KPIs. Each can run its own optimization loop.
The visual artifacts that come out of evaluation — spectrograms, pitch contours, cast chemistry heatmaps — serve double duty. The Steward agent uses them to evaluate its own work. But they also make the process legible to humans. You can see what the system is hearing. We’re calling this “Eye Ears” — synesthesia as a review method.
The MLX-audio discovery
This one was humbling. We spent days tuning Chatterbox on CPU before discovering that CSM (Sesame) and Dia run natively on Apple Silicon and produce dramatically better audio out of the box. The lesson: always survey the landscape before committing to a stack. We built a /radio-landscape skill specifically to prevent this from happening again.
Still early. Phase 1 is engine integration and the voice science metrics. But the evaluation architecture is designed to grow — script metrics, production coherence, cross-episode learning. Each phase builds on the last.
Curious whether others working with TTS or audio generation have found evaluation approaches that work well. The gap between “sounds okay in a demo” and “sounds like something you’d leave on” is larger than I expected.