An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
Hello Abdul Rehman,
Welcome to Microsoft Q&A and Thank you for the questions.
Based on everything you’ve shared, this behavior strongly points to a capacity constraint specific to gpt-realtime-1.5 in Sweden Central, rather than an issue in your code or configuration.
What’s happening
You’re seeing:
- Sessions start normally and work for a few turns
- Mid-conversation failure with:
-
inference_service_unavailable_error- Input tokens consumed (~14K), but 0 output tokens
- Same setup works fine with
gpt-realtime(non-1.5) - Issue reproduced across multiple resources and subscriptions
-
This combination clearly indicates transient backend capacity exhaustion for the 1.5 model variant in that region.
Why only gpt-realtime-1.5 is affected
-
gpt-realtime-1.5is a newer and more resource-intensive model - Capacity is allocated per model, not shared equally
- Sweden Central is a high-demand region, and newer model variants often have:
- Limited initial capacity
- Higher contention under load
- Limited initial capacity
So it’s expected that:
-
gpt-realtime- stable -
gpt-realtime-1.5- intermittent failures
Why failures occur mid-conversation
This is an important nuance:
- Your session is accepted initially
- As the conversation progresses (~14K tokens), compute demand increases
- At response generation time → capacity not available
- Result:
-
response.done: failed- No output tokens
-
This is typical of dynamic capacity exhaustion during streaming workloads
Please check below steps,
1. Check deployment capacity and quotas
- Sweden Central can be oversubscribed for preview/advanced models
- Validate your deployment health, quotas, and provisioning details
Use internal dashboards like:
- AOAI subscription info / quota dashboards
- Check whether you’re using:
- Standard (shared) vs
- Provisioned Throughput (PTU)
2. Try alternative regions
Even if Sweden Central works for other models, you should test gpt-realtime-1.5 in regions with stronger capacity:
- West Europe
- UK South
- North Central US / East US 2
- Norway East
Spin up a test deployment and compare Error rate, Latency, Stability
3. Move to Provisioned Throughput (PTU) for production
For realtime voice workloads, this is the recommended approach.
Why Standard deployments = shared pool
PTU deployments = dedicated reserved capacity
Benefits:
- Eliminates most
service_unavailableerrors - Provides predictable performance
- Avoids mid-session drops
4. Add retry and recovery logic
Since these are transient failures, implement:
Retry on 408, 500, 502, 503, 504
Use Exponential backoff + jitter
Detect:
-
response.done: failed - Recreate session or retry turn
This is critical to avoid silent failures in voice scenarios
5. Reduce token pressure
You’re hitting ~14K tokens mid-session, which increases failure probability.
Consider:
- Trimming conversation history
- Using a rolling context window
- Avoiding unnecessary token accumulation
6. Add multi-region failover
For production-grade systems Deploy in at least two regions
Use Azure Front Door or Traffic Manager
Route traffic:
- Primary - Sweden Central
- Fallback - secondary region
This ensures continuity if one region is saturated.
Please refer this
- Model availability by region: https://dori-uw-1.kuma-moon.com/azure/ai-foundry/openai/concepts/models#model-summary-table-and-region-availability
- Provisioned throughput concepts: https://dori-uw-1.kuma-moon.com/azure/ai-foundry/openai/concepts/provisioned-throughput?tabs=global-ptum
- Regional resilience & BCDR best practices: https://dori-uw-1.kuma-moon.com/azure/ai-foundry/openai/how-to/business-continuity-disaster-recovery
I Hope this helps. Do let me know if you have any further queries.
Thank you!