Realtime API caching behavior available through OpenAI but not through Azure OpenAI

Question

Realtime API caching behavior available through OpenAI but not through Azure OpenAI

Ricardo 20

Hi, we are testing a realtime implementation and are seeing different caching behavior between OpenAI directly and Azure OpenAI.

In our case, the workflow appears to benefit from caching when we use OpenAI’s realtime endpoint, but we are not seeing equivalent behavior when using Azure OpenAI for what is otherwise the same implementation.

I want to confirm whether this is:

expected because caching is not currently supported for Azure OpenAI realtime in this scenario,
a deployment / API version / region limitation, or
something we are doing incorrectly in our Azure setup.

Environment

Service: Azure OpenAI
Region: canada-central
Model: gpt-realtime-1.5
Connection method: WebRTC
Endpoint: /openai/v1/realtime/calls

What we expect

We expect repeated or shared prompt/context content to benefit from caching in a similar way to what we observe when calling OpenAI directly. In practice, we would expect to see prompt tokens written to cache appear in our Azure metrics dashboard and in the usage logged in the response.done object returned by Realtime.

What we're noticing

Metrics show zero prompt tokens written to cache or read from cache
response.done body reading "cached_tokens": 0

What we have already checked

Same application logic on both providers
Following rules of - A minimum of 1,024 tokens in length, the 1,024 tokens in the prompt must be identical.
Session is being maintained as expected
No intentional clearing/reset of conversation state between turns

We'd appreciate any help with this. Thanks.

SRILAKSHMI C 16,705 Reputation points Microsoft External Staff Moderator

2026-04-02T03:59:30.85+00:00

Hi Ricardo,

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!

Answer accepted by question author

1 additional answer

Your answer

SRILAKSHMI C 16,705 Reputation points Microsoft External Staff Moderator

2026-04-02T03:59:30.85+00:00

Hi Ricardo,

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!

Answer 1

Hello Ricardo,

Welcome to Microsoft Q&A and Thank you for the questions.

Based on what you’ve shared, the behavior you’re seeing is expected with the current Azure OpenAI implementation.

In Azure OpenAI, prompt caching is not currently supported for realtime endpoints, including:

/openai/v1/realtime/calls
WebRTC-based connections
Model: gpt-realtime-1.5

Because of this, Azure does not:

Write prompt tokens to cache
Read from cache on repeated inputs
Populate cached_tokens in response.done
Surface cache metrics in the portal

So seeing:

"cached_tokens": 0

consistently is expected in this scenario.

Why it differs from OpenAI

Even though the model naming and API structure are similar, feature parity is not always 1:1 between OpenAI and Azure OpenAI.

OpenAI’s realtime endpoint currently includes prompt caching optimizations

Azure OpenAI’s realtime endpoint is focused on low-latency streaming, and caching has not yet been enabled for this path

Model support clarification

Prompt caching in Azure OpenAI today is limited to specific models and APIs, primarily outside of realtime scenarios.

Examples of models where caching is supported include:

gpt-4o variants
o-series models like o3-mini, o1-2024-12-17)

At this time gpt-realtime-1.5 is not part of the supported set for prompt caching

What this means for your implementation

Each turn in the realtime session is processed fully (no reuse of prior prompt tokens)
Token usage and latency will not benefit from caching optimizations
Azure metrics and responses will not reflect any cache activity

You’ve already covered this well, but just to confirm:

API version should be 2024-06-14 or later
Workspace/region setup Canada Central is valid
Session handling and prompt structure look correct

There’s nothing in your current setup that would prevent caching the limitation is at the service capability level

Recommended steps

Depending on your requirement:

If realtime and caching is required

Continue using the OpenAI direct endpoint for now, where this behavior is supported

If staying on Azure is preferred

Consider switching to a model that supports caching Or optimize prompt usage to reduce repeated token overhead:

Keep a rolling context window
Avoid resending large static content
Move reusable instructions into session initialization

The absence of caching in your Azure OpenAI realtime scenario is expected

cached_tokens = 0 is by design, not an issue

This is not related to region, API version, or implementation gaps on your side

Current limitation: prompt caching is not supported for gpt-realtime-1.5 in Azure realtime APIs.

Please refer this

Prompt Caching in Azure OpenAI: https://dori-uw-1.kuma-moon.com/azure/ai-services/openai/how-to/prompt-caching

Azure OpenAI Service pricing (caching discounts): https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/

Azure Monitor overview (to track metrics): https://dori-uw-1.kuma-moon.com/azure/azure-monitor/overview

I Hope this helps. Do let me know if you have any further queries.

Thank you!

Answer 2

For Azure OpenAI, the behavior described is expected today for the Realtime API in the scenario outlined.

Key points from current behavior and documentation:

Prompt caching support is model- and operation-specific
- Prompt caching in Azure OpenAI is supported for “GPT‑4o or newer” models and applies to operations such as chat-completions, completions, responses, and real-time operations, but only where the model/operation combination actually implements caching and returns cached_tokens in usage details.
- Official support for the cached_tokens field in responses is currently documented only for the o‑series model family and for API version 2024-10-01-preview in the chat completions API, not generically for all realtime endpoints.
Realtime API and truncation can reduce cache effectiveness
- The Realtime API supports a truncation configuration. When the conversation grows beyond the model’s input token limit, older messages are dropped from the beginning of the context. This “busts” the cache because the earliest tokens (which must be identical for caching to apply) are no longer present.
- Even when truncation is enabled in a way that retains some history, the effective cache rate depends on how much of the initial shared context is preserved. If truncation removes or changes the earliest 1,024+ tokens, cached_tokens will be 0 on subsequent turns.
Prompt caching requirements must be met exactly
For a cache hit in Azure OpenAI:
- The request must be at least 1,024 tokens in length.
- The first 1,024 tokens must be identical across calls.
- After the first 1,024 tokens, cache hits occur in 128‑token increments for additional identical tokens.
- A single character difference in the first 1,024 tokens results in a cache miss (cached_tokens = 0).
- Caches are temporary and cleared after a period of inactivity (up to 24 hours in Azure AI Foundry Model endpoints; for Azure OpenAI prompt caching, typically within 5–10 minutes and always within one hour of last use). If the interval between calls is long enough, the cache may be gone.
What this means for the described scenario
Given the information in the question and the current documentation:
- Seeing "cached_tokens": 0 and zero prompt tokens written/read from cache in Azure metrics is consistent with a scenario where either:
  - The Realtime model/operation in that region does not yet expose prompt caching usage in the same way as OpenAI’s own realtime endpoint, or
  - The effective first 1,024 tokens of each request (after any truncation or session handling) are not identical, so no cache hits occur.
- There is no indication in the provided documentation that gpt-realtime-1.5 over WebRTC in canada-central will necessarily report cached_tokens the same way as OpenAI’s own realtime endpoint, even if the logical application behavior is similar.
Checks and adjustments to consider
Within the constraints of current Azure OpenAI behavior:
- Ensure that the shared, repetitive context is at the very beginning of the messages/context that the Realtime session sends to the model, matching the documented requirement that repetitive content be at the start of the messages array.
- Review any truncation settings for the Realtime session. If truncation is dropping early messages, the cache will not be hit on later turns. Configure truncation (where available) to retain enough of the initial context to keep the first 1,024+ tokens stable.
- Keep the interval between turns well under the cache lifetime (5–10 minutes typical, always <1 hour for Azure OpenAI prompt caches) so that the cache is still present.

Given the current documentation, the absence of cached_tokens > 0 and cache metrics for this specific Realtime setup in Azure OpenAI is consistent with expected limitations rather than a clear misconfiguration, especially when compared to OpenAI’s own realtime endpoint, which may implement and expose caching differently.

References:

Share via

Realtime API caching behavior available through OpenAI but not through Azure OpenAI

Environment

What we expect

What we're noticing

What we have already checked

1 additional answer

Your answer