Voice Media Streams healthy, but unstable downstream turn-taking/transcription in realtime voice flow

Title: Voice Media Streams active, caller audio present, but unstable downstream turn-taking/transcription

Summary
- We are running outbound PSTN calls via Twilio Programmable Voice.
- TwiML uses Connect + Stream to our websocket endpoint.
- Audio transport appears healthy, but conversation turn-taking is unstable in our downstream realtime model flow.
- Bot speech is audible and improved after prompt tuning, but caller speech is still sometimes not acted on.

Observed behavior
- Call lifecycle is normal: initiated -> ringing -> in-progress -> completed.
- Twilio requests /incoming and /media correctly.
- Inbound caller audio reaches our server continuously (high RMS values observed, example up to 5760).
- Despite valid inbound audio, the model does not always transition into a response turn after caller speech.
- We also observed chunked bot output pacing from model output transcription logs.

Expected behavior
- After caller utterance and short silence, downstream should reliably close caller turn and generate bot response.
- Turn-taking should be stable across repeated calls with same setup.

Current architecture
- Twilio Programmable Voice outbound call.
- TwiML: Connect + Stream (websocket).
- Incoming media: mu-law 8k -> PCM16 -> optional gain -> resample 16k -> forwarded to realtime model.
- Outgoing media: model PCM24k -> resample 8k -> mu-law -> Twilio media event.

What we already changed
- Removed aggressive local speech gate that was filtering silence.
- Now forwarding full audio stream including silence to allow model-side VAD/end-of-turn.
- Reduced and simplified prompt for faster speech output.
- Fixed language typo in prompt content.

Still failing
- Caller audio present in logs, but model response turn is still inconsistent.

Sample call identifiers
- Redacted in public report.
- Full identifiers can be shared privately with Twilio Support upon request.

Environment
- Python 3.13
- aiohttp websocket server
- Twilio Voice Media Streams
- Runtime observed on Windows host

Requests to Twilio
1. Please verify media stream quality for provided sample call identifiers (packet continuity, jitter, transcoding anomalies).
2. Please confirm any Twilio-side considerations for PSTN -> Stream -> external realtime model that affect end-of-turn reliability.
3. Please suggest Twilio-recommended stream/session settings for low-latency conversational turn-taking stability.
4. Please confirm if any known issues exist in similar topology where inbound audio is present but downstream turn closure is unstable.

Operational notes seen during testing
- Intermittent local runtime issues were also observed and mitigated:
  - Port binding conflict on 5000 (WinError 10048).
  - One temporary 404 on /incoming when wrong server process occupied port.

Primary blocker
- This is currently a production blocker for reliable two-way voice conversation quality.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice Media Streams healthy, but unstable downstream turn-taking/transcription in realtime voice flow #927

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Voice Media Streams healthy, but unstable downstream turn-taking/transcription in realtime voice flow #927

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions