Share via

Realtime API caching behavior available through OpenAI but not through Azure OpenAI

Ricardo 20 Reputation points
2026-03-31T19:06:55.9+00:00

Hi, we are testing a realtime implementation and are seeing different caching behavior between OpenAI directly and Azure OpenAI.

In our case, the workflow appears to benefit from caching when we use OpenAI’s realtime endpoint, but we are not seeing equivalent behavior when using Azure OpenAI for what is otherwise the same implementation.

I want to confirm whether this is:

  1. expected because caching is not currently supported for Azure OpenAI realtime in this scenario,
  2. a deployment / API version / region limitation, or
  3. something we are doing incorrectly in our Azure setup.

Environment

  • Service: Azure OpenAI
  • Region: canada-central
  • Model: gpt-realtime-1.5
  • Connection method: WebRTC
  • Endpoint: /openai/v1/realtime/calls

What we expect

We expect repeated or shared prompt/context content to benefit from caching in a similar way to what we observe when calling OpenAI directly. In practice, we would expect to see prompt tokens written to cache appear in our Azure metrics dashboard and in the usage logged in the response.done object returned by Realtime.

What we're noticing

  • Metrics show zero prompt tokens written to cache or read from cache
  • response.done body reading "cached_tokens": 0

What we have already checked

  • Same application logic on both providers
  • Following rules of - A minimum of 1,024 tokens in length, the 1,024 tokens in the prompt must be identical.
  • Session is being maintained as expected
  • No intentional clearing/reset of conversation state between turns

We'd appreciate any help with this. Thanks.

Azure OpenAI Service
Azure OpenAI Service

An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.


Answer accepted by question author
  1. SRILAKSHMI C 16,705 Reputation points Microsoft External Staff Moderator
    2026-04-01T08:51:29.94+00:00

    Hello Ricardo,

    Welcome to Microsoft Q&A and Thank you for the questions.

    Based on what you’ve shared, the behavior you’re seeing is expected with the current Azure OpenAI implementation.

    In Azure OpenAI, prompt caching is not currently supported for realtime endpoints, including:

    • /openai/v1/realtime/calls
    • WebRTC-based connections
    • Model: gpt-realtime-1.5

    Because of this, Azure does not:

    • Write prompt tokens to cache
    • Read from cache on repeated inputs
    • Populate cached_tokens in response.done
    • Surface cache metrics in the portal

    So seeing:

    "cached_tokens": 0
    

    consistently is expected in this scenario.

    Why it differs from OpenAI

    Even though the model naming and API structure are similar, feature parity is not always 1:1 between OpenAI and Azure OpenAI.

    OpenAI’s realtime endpoint currently includes prompt caching optimizations

    Azure OpenAI’s realtime endpoint is focused on low-latency streaming, and caching has not yet been enabled for this path

    Model support clarification

    Prompt caching in Azure OpenAI today is limited to specific models and APIs, primarily outside of realtime scenarios.

    Examples of models where caching is supported include:

    • gpt-4o variants
    • o-series models like o3-mini, o1-2024-12-17)

    At this time gpt-realtime-1.5 is not part of the supported set for prompt caching

    What this means for your implementation

    • Each turn in the realtime session is processed fully (no reuse of prior prompt tokens)
    • Token usage and latency will not benefit from caching optimizations
    • Azure metrics and responses will not reflect any cache activity

    You’ve already covered this well, but just to confirm:

    • API version should be 2024-06-14 or later
    • Workspace/region setup Canada Central is valid
    • Session handling and prompt structure look correct

    There’s nothing in your current setup that would prevent caching the limitation is at the service capability level

    Recommended steps

    Depending on your requirement:

    If realtime and caching is required

    Continue using the OpenAI direct endpoint for now, where this behavior is supported

    If staying on Azure is preferred

    Consider switching to a model that supports caching Or optimize prompt usage to reduce repeated token overhead:

    • Keep a rolling context window
    • Avoid resending large static content
    • Move reusable instructions into session initialization

    The absence of caching in your Azure OpenAI realtime scenario is expected

    cached_tokens = 0 is by design, not an issue

    This is not related to region, API version, or implementation gaps on your side

    Current limitation: prompt caching is not supported for gpt-realtime-1.5 in Azure realtime APIs.

    Please refer this

    Prompt Caching in Azure OpenAI: https://dori-uw-1.kuma-moon.com/azure/ai-services/openai/how-to/prompt-caching

    Azure OpenAI Service pricing (caching discounts): https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/

    Azure Monitor overview (to track metrics): https://dori-uw-1.kuma-moon.com/azure/azure-monitor/overview

    I Hope this helps. Do let me know if you have any further queries.

    Thank you!

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Q&A Assist
    2026-03-31T19:07:39.01+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    For Azure OpenAI, the behavior described is expected today for the Realtime API in the scenario outlined.

    Key points from current behavior and documentation:

    1. Prompt caching support is model- and operation-specific
      • Prompt caching in Azure OpenAI is supported for “GPT‑4o or newer” models and applies to operations such as chat-completions, completions, responses, and real-time operations, but only where the model/operation combination actually implements caching and returns cached_tokens in usage details.
      • Official support for the cached_tokens field in responses is currently documented only for the o‑series model family and for API version 2024-10-01-preview in the chat completions API, not generically for all realtime endpoints.
    2. Realtime API and truncation can reduce cache effectiveness
      • The Realtime API supports a truncation configuration. When the conversation grows beyond the model’s input token limit, older messages are dropped from the beginning of the context. This “busts” the cache because the earliest tokens (which must be identical for caching to apply) are no longer present.
      • Even when truncation is enabled in a way that retains some history, the effective cache rate depends on how much of the initial shared context is preserved. If truncation removes or changes the earliest 1,024+ tokens, cached_tokens will be 0 on subsequent turns.
    3. Prompt caching requirements must be met exactly
      For a cache hit in Azure OpenAI:
      • The request must be at least 1,024 tokens in length.
      • The first 1,024 tokens must be identical across calls.
      • After the first 1,024 tokens, cache hits occur in 128‑token increments for additional identical tokens.
      • A single character difference in the first 1,024 tokens results in a cache miss (cached_tokens = 0).
      • Caches are temporary and cleared after a period of inactivity (up to 24 hours in Azure AI Foundry Model endpoints; for Azure OpenAI prompt caching, typically within 5–10 minutes and always within one hour of last use). If the interval between calls is long enough, the cache may be gone.
    4. What this means for the described scenario
      Given the information in the question and the current documentation:
      • Seeing "cached_tokens": 0 and zero prompt tokens written/read from cache in Azure metrics is consistent with a scenario where either:
        • The Realtime model/operation in that region does not yet expose prompt caching usage in the same way as OpenAI’s own realtime endpoint, or
        • The effective first 1,024 tokens of each request (after any truncation or session handling) are not identical, so no cache hits occur.
      • There is no indication in the provided documentation that gpt-realtime-1.5 over WebRTC in canada-central will necessarily report cached_tokens the same way as OpenAI’s own realtime endpoint, even if the logical application behavior is similar.
    5. Checks and adjustments to consider
      Within the constraints of current Azure OpenAI behavior:
      • Ensure that the shared, repetitive context is at the very beginning of the messages/context that the Realtime session sends to the model, matching the documented requirement that repetitive content be at the start of the messages array.
      • Review any truncation settings for the Realtime session. If truncation is dropping early messages, the cache will not be hit on later turns. Configure truncation (where available) to retain enough of the initial context to keep the first 1,024+ tokens stable.
      • Keep the interval between turns well under the cache lifetime (5–10 minutes typical, always <1 hour for Azure OpenAI prompt caches) so that the cache is still present.

    Given the current documentation, the absence of cached_tokens > 0 and cache metrics for this specific Realtime setup in Azure OpenAI is consistent with expected limitations rather than a clear misconfiguration, especially when compared to OpenAI’s own realtime endpoint, which may implement and expose caching differently.


    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.