Intermittent inference_service_unavailable_error on gpt-realtime-1.5 model across multiple resources in Sweden Central

Question

Intermittent inference_service_unavailable_error on gpt-realtime-1.5 model across multiple resources in Sweden Central

Abdul Rehman 20

We are experiencing intermittent inference_service_unavailable_error failures on the gpt-realtime-1.5 model via the Azure OpenAI Realtime WebSocket API across multiple Azure OpenAI resources in Sweden Central. The gpt-realtime model on the same resources, same API version, same code path, and same region works without issue.

                                                                                                                                                                **Impact:**                                                                                                                                          
```- Production voice agent platform serving multiple enterprise customers

  - Failures cause mid-conversation silence — the bot stops responding to the user with no recovery

  - Affecting approximately 26 deployments of gpt-realtime-1.5 across multiple Azure OpenAI resources and subscriptions in Sweden Central

                           

  **Symptoms**                                                                               

- The Realtime WebSocket session connects and functions normally for initial turns

  - Mid-conversation, a response.done event is returned with status: "failed"

  - Error payload:                                                                                                                                            


```yaml
{  
    "code": "inference_service_unavailable_error",
    "type": "service_unavailable_error",          
    "message": "The server is overloaded or not ready yet. Please try again later."                                     }

The failed response shows input tokens consumed (~14,000) but 0 output tokens generated
- cached_tokens: 0 — no prompt caching active

What works

gpt-realtime (non-1.5) deployments on the same Azure OpenAI resources, same region, same API version (2024-10-01-preview), same client code — no failures observed

What fails:

gpt-realtime-1.5 deployments — intermittent inference_service_unavailable_error mid-conversation

Affected resources (examples):

org_name-sweedencentral-01 — deployment: gpt-realtime-1.5
org_name-sweedencentral-01 — deployment: gpt-realtime-1.5
org_name-sweedencentral-01 — deployment: gpt-realtime-1.5
(26 total resources affected, all Sweden Central)

Connection details:

WebSocket endpoint: wss://{instance}.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=gpt-realtime-
SDK: OpenAI Node.js SDK v6.10.0 - Timestamp of recent failure: 2026-03-31T15:22:10.651Z

Questions:

Is there a known capacity constraint for gpt-realtime-1.5 in Sweden Central?
Are there alternative regions with more reliable capacity for this model? 3. Is there a recommended provisioned throughput option for production Realtime workloads to avoid transient capacity failures?

Abdul Rehman 20 Reputation points

2026-04-01T08:54:06.6433333+00:00
Hello @SRILAKSHMI C

Thank you for your suggestion.

We have verified this in the Azure portal, and unfortunately gpt-realtime-1.5 is not available for deployment in the following regions:

West Europe

UK South

Norway East

As a result, it is currently not possible for us to test or deploy this model in those regions.

At this time, the only available regions for gpt-realtime-1.5 are Sweden Central, France.

East US 2 is the only alternative, but introducing ~100ms+ additional latency for our European customer base in a real-time voice application is not a viable production solution

Kind regards,
Abdul Rehman
SRILAKSHMI C 16,705 Reputation points Microsoft External Staff Moderator

2026-04-02T03:57:35.3066667+00:00
Hi Abdul Rehman,

Thanks for validating the regional availability that helps clarify the situation further.

You’re absolutely right in your assessment. Given the current constraints:

gpt-realtime-1.5 is only available in Sweden Central and France Central (EU), and East US 2

And for a latency-sensitive realtime voice workload, moving to East US 2 is not a practical option

This confirms that your issue is not just capacity-related, but also limited regional footprint for this specific model

At the moment:

Sweden Central appears to be capacity-constrained for gpt-realtime-1.5

Region alternatives within EU are very limited

So you’re effectively operating in a single-region dependency with a high-demand model

This combination is what’s causing:

Mid-session inference_service_unavailable_error

Unpredictable behavior under load

Recommended approach

Since switching regions is not a viable solution, here are the most practical next steps for production stability:

1. Move to Provisioned Throughput

This becomes essential in your case

Standard deployments - shared capacity (what you're hitting now)

Provisioned Throughput (PTU) - dedicated capacity

For realtime voice workloads, PTU is the only reliable way to avoid:

Mid-session failures

Backend contention

Given you have 26 deployments and Enterprise traffic

This strongly indicates you’ve outgrown shared capacity.

2. Consider France Central as secondary EU fallback

Even if not ideal:

France Central is currently the only EU alternative with the same model

Latency impact within EU is significantly lower than US (~20–40 ms vs 100+ ms)

Recommended:

Primary - Sweden Central

Secondary - France Central

This gives you:

Regional redundancy within EU

Protection against localized saturation

3. Implement session-level recovery

Right now failures cause silent drop-offs, which is expected with realtime APIs.

You should Detect response.done: failed

Immediately Recreate session OR retry turn

Add Exponential backoff

This prevents the dead air experience for users

4. Control token growth

You mentioned ~14K tokens per session.

At that level Compute demand increases significantly

Probability of allocation failure rises

Recommendations:

Use rolling context window

Trim older turns

Avoid sending full history repeatedly

Given current platform state:

gpt-realtime-1.5 has limited regional availability

EU regions may experience temporary saturation under load

This is not something you can fully eliminate using Standard deployments

Which is why PTU + multi-region (EU) is the only stable production pattern today

Thank you!
SRILAKSHMI C 16,705 Reputation points Microsoft External Staff Moderator

2026-04-03T12:01:04.4866667+00:00

Hi Abdul Rehman,

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!
Abdul Rehman 20 Reputation points

2026-04-07T08:03:27.01+00:00

Hi @Anonymous

Thank you for the detailed explanation and recommendations — this really helps clarify the situation.

Your assessment aligns with what we are currently experiencing, especially regarding the regional limitations and capacity constraints for gpt-realtime-1.5 within the EU.

We understand that relying on Standard deployments is no longer sufficient for our workload, and we will evaluate moving to Provisioned Throughput (PTU) to ensure stability. Additionally, we will explore setting up France Central as a secondary region to introduce redundancy within the EU.

We also appreciate the guidance on session-level recovery and token management. We will work on implementing retry mechanisms with exponential backoff and optimize our context handling to reduce token load.

Thanks again for your support and clear recommendations.

Best regards,
Abdul Rehman

2 answers

Your answer

Abdul Rehman 20 Reputation points

2026-04-01T08:54:06.6433333+00:00

Hello @SRILAKSHMI C

Thank you for your suggestion.

We have verified this in the Azure portal, and unfortunately gpt-realtime-1.5 is not available for deployment in the following regions:

West Europe

UK South

Norway East

As a result, it is currently not possible for us to test or deploy this model in those regions.

At this time, the only available regions for gpt-realtime-1.5 are Sweden Central, France.

East US 2 is the only alternative, but introducing ~100ms+ additional latency for our European customer base in a real-time voice application is not a viable production solution

Kind regards,
Abdul Rehman
SRILAKSHMI C 16,705 Reputation points Microsoft External Staff Moderator

2026-04-03T12:01:04.4866667+00:00

Hi Abdul Rehman,

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!
Abdul Rehman 20 Reputation points

2026-04-07T08:03:27.01+00:00

Hi @Anonymous

Thank you for the detailed explanation and recommendations — this really helps clarify the situation.

Your assessment aligns with what we are currently experiencing, especially regarding the regional limitations and capacity constraints for gpt-realtime-1.5 within the EU.

We understand that relying on Standard deployments is no longer sufficient for our workload, and we will evaluate moving to Provisioned Throughput (PTU) to ensure stability. Additionally, we will explore setting up France Central as a secondary region to introduce redundancy within the EU.

We also appreciate the guidance on session-level recovery and token management. We will work on implementing retry mechanisms with exponential backoff and optimize our context handling to reduce token load.

Thanks again for your support and clear recommendations.

Best regards,
Abdul Rehman

Answer 1

Hello Abdul Rehman,

Welcome to Microsoft Q&A and Thank you for the questions.

Based on everything you’ve shared, this behavior strongly points to a capacity constraint specific to gpt-realtime-1.5 in Sweden Central, rather than an issue in your code or configuration.

What’s happening

You’re seeing:

Sessions start normally and work for a few turns
Mid-conversation failure with:
- inference_service_unavailable_error
  - Input tokens consumed (~14K), but 0 output tokens
  - Same setup works fine with gpt-realtime (non-1.5)
  - Issue reproduced across multiple resources and subscriptions

This combination clearly indicates transient backend capacity exhaustion for the 1.5 model variant in that region.

Why only gpt-realtime-1.5 is affected

gpt-realtime-1.5 is a newer and more resource-intensive model
Capacity is allocated per model, not shared equally
Sweden Central is a high-demand region, and newer model variants often have:
- Limited initial capacity
  - Higher contention under load

So it’s expected that:

gpt-realtime - stable
gpt-realtime-1.5 - intermittent failures

Why failures occur mid-conversation

This is an important nuance:

Your session is accepted initially
As the conversation progresses (~14K tokens), compute demand increases
At response generation time → capacity not available
Result:
- response.done: failed
  - No output tokens

This is typical of dynamic capacity exhaustion during streaming workloads

Please check below steps,

1. Check deployment capacity and quotas

Sweden Central can be oversubscribed for preview/advanced models
Validate your deployment health, quotas, and provisioning details

Use internal dashboards like:

AOAI subscription info / quota dashboards
Check whether you’re using:
- Standard (shared) vs
- Provisioned Throughput (PTU)

2. Try alternative regions

Even if Sweden Central works for other models, you should test gpt-realtime-1.5 in regions with stronger capacity:

West Europe
UK South
North Central US / East US 2
Norway East

Spin up a test deployment and compare Error rate, Latency, Stability

3. Move to Provisioned Throughput (PTU) for production

For realtime voice workloads, this is the recommended approach.

Why Standard deployments = shared pool

PTU deployments = dedicated reserved capacity

Benefits:

Eliminates most service_unavailable errors
Provides predictable performance
Avoids mid-session drops

4. Add retry and recovery logic

Since these are transient failures, implement:

Retry on 408, 500, 502, 503, 504

Use Exponential backoff + jitter

Detect:

response.done: failed
Recreate session or retry turn

This is critical to avoid silent failures in voice scenarios

5. Reduce token pressure

You’re hitting ~14K tokens mid-session, which increases failure probability.

Consider:

Trimming conversation history
Using a rolling context window
Avoiding unnecessary token accumulation

6. Add multi-region failover

For production-grade systems Deploy in at least two regions

Use Azure Front Door or Traffic Manager

Route traffic:

Primary - Sweden Central
Fallback - secondary region

This ensures continuity if one region is saturated.

Please refer this

Model availability by region: https://dori-uw-1.kuma-moon.com/azure/ai-foundry/openai/concepts/models#model-summary-table-and-region-availability
Provisioned throughput concepts: https://dori-uw-1.kuma-moon.com/azure/ai-foundry/openai/concepts/provisioned-throughput?tabs=global-ptum
Regional resilience & BCDR best practices: https://dori-uw-1.kuma-moon.com/azure/ai-foundry/openai/how-to/business-continuity-disaster-recovery

I Hope this helps. Do let me know if you have any further queries.

Thank you!

Answer 2

The context does not document any region-specific capacity constraint or incident for gpt-realtime-1.5 in Sweden Central. The only relevant capacity-related information in the context is for provisioned throughput and general model/region availability, not transient overload behavior for this specific model or region.
gpt-realtime-1.5 is listed as a Global Standard model, which means it is available in multiple regions globally rather than being tied to a single regional SKU. The Global Standard table shows that Sweden Central supports a broad set of the latest GPT-5.x, o3/o4, and realtime/audio models, but it does not break out reliability or capacity differences by region. From the provided data, other regions with similarly rich support for advanced models (for example East US 2, West US, West Europe, North Central US, South Central US, Japan East, Korea Central, etc.) are candidates to test as alternates, but the documentation does not state that any specific region has “more reliable capacity” for gpt-realtime-1.5.
For production workloads that need to avoid transient capacity issues, the documented option is Provisioned throughput (Provisioned managed or Global Provisioned Throughput) for models sold directly by Azure. The provisioned tables show:
- Global Provisioned Throughput is available for high-end GPT-5.x, o3, o4-mini, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, o3-mini, o1, and gpt-4o variants across all listed regions.
- Provisioned managed (regional) is available for the same family of models in Sweden Central and many other regions.

gpt-realtime-1.5 itself is not explicitly listed in the provisioned throughput tables in the context, so there is no documented provisioned SKU for that exact model here. For production realtime workloads where transient capacity failures are unacceptable, the documented pattern is:

Use Global Provisioned Throughput or Provisioned managed for a supported model (for example gpt-4.1, gpt-4o, or o4-mini) in the target region.
Architect the application so that the realtime experience (WebSocket/streaming) is backed by a provisioned model where available, or by a region/model combination that has a provisioned option.

If gpt-realtime-1.5 must be used specifically and no provisioned SKU is available, the documentation does not provide a way to eliminate transient service_unavailable errors beyond standard retry/failover patterns and potentially testing alternative regions that support the same Global Standard model.

References:

Share via

Intermittent inference_service_unavailable_error on gpt-realtime-1.5 model across multiple resources in Sweden Central

2 answers

Your answer