Share via

Intermittent inference_service_unavailable_error on gpt-realtime-1.5 model across multiple resources in Sweden Central

Abdul Rehman 20 Reputation points
2026-04-01T04:01:23.8466667+00:00

We are experiencing intermittent inference_service_unavailable_error failures on the gpt-realtime-1.5 model via the Azure OpenAI Realtime WebSocket API across multiple Azure OpenAI resources in Sweden Central. The gpt-realtime model on the same resources, same API version, same code path, and same region works without issue.

                                                                                                                                                                **Impact:**                                                                                                                                          
```- Production voice agent platform serving multiple enterprise customers

  - Failures cause mid-conversation silence — the bot stops responding to the user with no recovery

  - Affecting approximately 26 deployments of gpt-realtime-1.5 across multiple Azure OpenAI resources and subscriptions in Sweden Central

                           

  **Symptoms**                                                                               

- The Realtime WebSocket session connects and functions normally for initial turns

  - Mid-conversation, a response.done event is returned with status: "failed"

  - Error payload:                                                                                                                                            


```yaml
{  
    "code": "inference_service_unavailable_error",
    "type": "service_unavailable_error",          
    "message": "The server is overloaded or not ready yet. Please try again later."                                     }     
  • The failed response shows input tokens consumed (~14,000) but 0 output tokens generated
    • cached_tokens: 0 — no prompt caching active

What works

  • gpt-realtime (non-1.5) deployments on the same Azure OpenAI resources, same region, same API version (2024-10-01-preview), same client code — no failures observed

What fails:

  • gpt-realtime-1.5 deployments — intermittent inference_service_unavailable_error mid-conversation

Affected resources (examples):

  • org_name-sweedencentral-01 — deployment: gpt-realtime-1.5
  • org_name-sweedencentral-01 — deployment: gpt-realtime-1.5
  • org_name-sweedencentral-01 — deployment: gpt-realtime-1.5
  • (26 total resources affected, all Sweden Central)

Connection details:

  • WebSocket endpoint: wss://{instance}.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=gpt-realtime-
  • SDK: OpenAI Node.js SDK v6.10.0 - Timestamp of recent failure: 2026-03-31T15:22:10.651Z

Questions:

  1. Is there a known capacity constraint for gpt-realtime-1.5 in Sweden Central?
  2. Are there alternative regions with more reliable capacity for this model? 3. Is there a recommended provisioned throughput option for production Realtime workloads to avoid transient capacity failures?
Azure OpenAI Service
Azure OpenAI Service

An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.


2 answers

Sort by: Most helpful
  1. SRILAKSHMI C 16,705 Reputation points Microsoft External Staff Moderator
    2026-04-01T07:59:38.23+00:00

    Hello Abdul Rehman,

    Welcome to Microsoft Q&A and Thank you for the questions.

    Based on everything you’ve shared, this behavior strongly points to a capacity constraint specific to gpt-realtime-1.5 in Sweden Central, rather than an issue in your code or configuration.

    What’s happening

    You’re seeing:

    • Sessions start normally and work for a few turns
    • Mid-conversation failure with:
      • inference_service_unavailable_error
        • Input tokens consumed (~14K), but 0 output tokens
        • Same setup works fine with gpt-realtime (non-1.5)
        • Issue reproduced across multiple resources and subscriptions

    This combination clearly indicates transient backend capacity exhaustion for the 1.5 model variant in that region.

    Why only gpt-realtime-1.5 is affected

    • gpt-realtime-1.5 is a newer and more resource-intensive model
    • Capacity is allocated per model, not shared equally
    • Sweden Central is a high-demand region, and newer model variants often have:
      • Limited initial capacity
        • Higher contention under load

    So it’s expected that:

    • gpt-realtime - stable
    • gpt-realtime-1.5 - intermittent failures

    Why failures occur mid-conversation

    This is an important nuance:

    • Your session is accepted initially
    • As the conversation progresses (~14K tokens), compute demand increases
    • At response generation time → capacity not available
    • Result:
      • response.done: failed
        • No output tokens

    This is typical of dynamic capacity exhaustion during streaming workloads

    Please check below steps,

    1. Check deployment capacity and quotas

    • Sweden Central can be oversubscribed for preview/advanced models
    • Validate your deployment health, quotas, and provisioning details

    Use internal dashboards like:

    • AOAI subscription info / quota dashboards
    • Check whether you’re using:
      • Standard (shared) vs
      • Provisioned Throughput (PTU)

    2. Try alternative regions

    Even if Sweden Central works for other models, you should test gpt-realtime-1.5 in regions with stronger capacity:

    • West Europe
    • UK South
    • North Central US / East US 2
    • Norway East

    Spin up a test deployment and compare Error rate, Latency, Stability

    3. Move to Provisioned Throughput (PTU) for production

    For realtime voice workloads, this is the recommended approach.

    Why Standard deployments = shared pool

    PTU deployments = dedicated reserved capacity

    Benefits:

    • Eliminates most service_unavailable errors
    • Provides predictable performance
    • Avoids mid-session drops

    4. Add retry and recovery logic

    Since these are transient failures, implement:

    Retry on 408, 500, 502, 503, 504

    Use Exponential backoff + jitter

    Detect:

    • response.done: failed
    • Recreate session or retry turn

    This is critical to avoid silent failures in voice scenarios

    5. Reduce token pressure

    You’re hitting ~14K tokens mid-session, which increases failure probability.

    Consider:

    • Trimming conversation history
    • Using a rolling context window
    • Avoiding unnecessary token accumulation

    6. Add multi-region failover

    For production-grade systems Deploy in at least two regions

    Use Azure Front Door or Traffic Manager

    Route traffic:

    • Primary - Sweden Central
    • Fallback - secondary region

    This ensures continuity if one region is saturated.

    Please refer this

    1. Model availability by region: https://dori-uw-1.kuma-moon.com/azure/ai-foundry/openai/concepts/models#model-summary-table-and-region-availability
    2. Provisioned throughput concepts: https://dori-uw-1.kuma-moon.com/azure/ai-foundry/openai/concepts/provisioned-throughput?tabs=global-ptum
    3. Regional resilience & BCDR best practices: https://dori-uw-1.kuma-moon.com/azure/ai-foundry/openai/how-to/business-continuity-disaster-recovery

    I Hope this helps. Do let me know if you have any further queries.

    Thank you!

    1 person found this answer helpful.
    0 comments No comments

  2. Q&A Assist
    2026-04-01T04:01:45.75+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more
    1. The context does not document any region-specific capacity constraint or incident for gpt-realtime-1.5 in Sweden Central. The only relevant capacity-related information in the context is for provisioned throughput and general model/region availability, not transient overload behavior for this specific model or region.
    2. gpt-realtime-1.5 is listed as a Global Standard model, which means it is available in multiple regions globally rather than being tied to a single regional SKU. The Global Standard table shows that Sweden Central supports a broad set of the latest GPT-5.x, o3/o4, and realtime/audio models, but it does not break out reliability or capacity differences by region. From the provided data, other regions with similarly rich support for advanced models (for example East US 2, West US, West Europe, North Central US, South Central US, Japan East, Korea Central, etc.) are candidates to test as alternates, but the documentation does not state that any specific region has “more reliable capacity” for gpt-realtime-1.5.
    3. For production workloads that need to avoid transient capacity issues, the documented option is Provisioned throughput (Provisioned managed or Global Provisioned Throughput) for models sold directly by Azure. The provisioned tables show:
      • Global Provisioned Throughput is available for high-end GPT-5.x, o3, o4-mini, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, o3-mini, o1, and gpt-4o variants across all listed regions.
      • Provisioned managed (regional) is available for the same family of models in Sweden Central and many other regions.

    gpt-realtime-1.5 itself is not explicitly listed in the provisioned throughput tables in the context, so there is no documented provisioned SKU for that exact model here. For production realtime workloads where transient capacity failures are unacceptable, the documented pattern is:

    • Use Global Provisioned Throughput or Provisioned managed for a supported model (for example gpt-4.1, gpt-4o, or o4-mini) in the target region.
    • Architect the application so that the realtime experience (WebSocket/streaming) is backed by a provisioned model where available, or by a region/model combination that has a provisioned option.

    If gpt-realtime-1.5 must be used specifically and no provisioned SKU is available, the documentation does not provide a way to eliminate transient service_unavailable errors beyond standard retry/failover patterns and potentially testing alternative regions that support the same Global Standard model.


    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.