An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
Hello Ricardo,
Welcome to Microsoft Q&A and Thank you for the questions.
Based on what you’ve shared, the behavior you’re seeing is expected with the current Azure OpenAI implementation.
In Azure OpenAI, prompt caching is not currently supported for realtime endpoints, including:
-
/openai/v1/realtime/calls - WebRTC-based connections
- Model:
gpt-realtime-1.5
Because of this, Azure does not:
- Write prompt tokens to cache
- Read from cache on repeated inputs
- Populate
cached_tokensinresponse.done - Surface cache metrics in the portal
So seeing:
"cached_tokens": 0
consistently is expected in this scenario.
Why it differs from OpenAI
Even though the model naming and API structure are similar, feature parity is not always 1:1 between OpenAI and Azure OpenAI.
OpenAI’s realtime endpoint currently includes prompt caching optimizations
Azure OpenAI’s realtime endpoint is focused on low-latency streaming, and caching has not yet been enabled for this path
Model support clarification
Prompt caching in Azure OpenAI today is limited to specific models and APIs, primarily outside of realtime scenarios.
Examples of models where caching is supported include:
-
gpt-4ovariants -
o-seriesmodels likeo3-mini,o1-2024-12-17)
At this time gpt-realtime-1.5 is not part of the supported set for prompt caching
What this means for your implementation
- Each turn in the realtime session is processed fully (no reuse of prior prompt tokens)
- Token usage and latency will not benefit from caching optimizations
- Azure metrics and responses will not reflect any cache activity
You’ve already covered this well, but just to confirm:
- API version should be 2024-06-14 or later
- Workspace/region setup Canada Central is valid
- Session handling and prompt structure look correct
There’s nothing in your current setup that would prevent caching the limitation is at the service capability level
Recommended steps
Depending on your requirement:
If realtime and caching is required
Continue using the OpenAI direct endpoint for now, where this behavior is supported
If staying on Azure is preferred
Consider switching to a model that supports caching Or optimize prompt usage to reduce repeated token overhead:
- Keep a rolling context window
- Avoid resending large static content
- Move reusable instructions into session initialization
The absence of caching in your Azure OpenAI realtime scenario is expected
cached_tokens = 0 is by design, not an issue
This is not related to region, API version, or implementation gaps on your side
Current limitation: prompt caching is not supported for gpt-realtime-1.5 in Azure realtime APIs.
Please refer this
Prompt Caching in Azure OpenAI: https://dori-uw-1.kuma-moon.com/azure/ai-services/openai/how-to/prompt-caching
Azure OpenAI Service pricing (caching discounts): https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/
Azure Monitor overview (to track metrics): https://dori-uw-1.kuma-moon.com/azure/azure-monitor/overview
I Hope this helps. Do let me know if you have any further queries.
Thank you!