Skip to content

Voice Media Streams healthy, but unstable downstream turn-taking/transcription in realtime voice flow #927

@TimeHorizon2100

Description

@TimeHorizon2100

Title: Voice Media Streams active, caller audio present, but unstable downstream turn-taking/transcription

Summary

  • We are running outbound PSTN calls via Twilio Programmable Voice.
  • TwiML uses Connect + Stream to our websocket endpoint.
  • Audio transport appears healthy, but conversation turn-taking is unstable in our downstream realtime model flow.
  • Bot speech is audible and improved after prompt tuning, but caller speech is still sometimes not acted on.

Observed behavior

  • Call lifecycle is normal: initiated -> ringing -> in-progress -> completed.
  • Twilio requests /incoming and /media correctly.
  • Inbound caller audio reaches our server continuously (high RMS values observed, example up to 5760).
  • Despite valid inbound audio, the model does not always transition into a response turn after caller speech.
  • We also observed chunked bot output pacing from model output transcription logs.

Expected behavior

  • After caller utterance and short silence, downstream should reliably close caller turn and generate bot response.
  • Turn-taking should be stable across repeated calls with same setup.

Current architecture

  • Twilio Programmable Voice outbound call.
  • TwiML: Connect + Stream (websocket).
  • Incoming media: mu-law 8k -> PCM16 -> optional gain -> resample 16k -> forwarded to realtime model.
  • Outgoing media: model PCM24k -> resample 8k -> mu-law -> Twilio media event.

What we already changed

  • Removed aggressive local speech gate that was filtering silence.
  • Now forwarding full audio stream including silence to allow model-side VAD/end-of-turn.
  • Reduced and simplified prompt for faster speech output.
  • Fixed language typo in prompt content.

Still failing

  • Caller audio present in logs, but model response turn is still inconsistent.

Sample call identifiers

  • Redacted in public report.
  • Full identifiers can be shared privately with Twilio Support upon request.

Environment

  • Python 3.13
  • aiohttp websocket server
  • Twilio Voice Media Streams
  • Runtime observed on Windows host

Requests to Twilio

  1. Please verify media stream quality for provided sample call identifiers (packet continuity, jitter, transcoding anomalies).
  2. Please confirm any Twilio-side considerations for PSTN -> Stream -> external realtime model that affect end-of-turn reliability.
  3. Please suggest Twilio-recommended stream/session settings for low-latency conversational turn-taking stability.
  4. Please confirm if any known issues exist in similar topology where inbound audio is present but downstream turn closure is unstable.

Operational notes seen during testing

  • Intermittent local runtime issues were also observed and mitigated:
    • Port binding conflict on 5000 (WinError 10048).
    • One temporary 404 on /incoming when wrong server process occupied port.

Primary blocker

  • This is currently a production blocker for reliable two-way voice conversation quality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions