How to connect custom speech model with captioning with speech to text?

Question

How to connect custom speech model with captioning with speech to text?

Quynh Huynh (NON EA SC ALT) 40 Microsoft Employee

From following steps regarding training a custom speech model, most documentations are regarding file format following WAV files. For our particular use cases, we'd like to leverage the custom speech model to generate caption file for mp4 videos. Are there any suggestions on how this may work as converting mp4 to WAV seems redundant if Speech Studio also supports captioning solution.

Thank you!

Quynh Huynh (NON EA SC ALT) 40 Reputation points Microsoft Employee

2025-12-02T19:03:24.66+00:00

Hi Aryan,

From this documentation, it looks like there are no additional parameters to generating the caption for MP4 files with an existing trained custom speech model. Since our custom model may be trained on WAV files, what’s the best approach to avoid redundant steps like converting video to audio when generating caption text with Azure Speech resource? Alternatively, what would be the ideal workflow to ensure captions for a video can be generated directly from the trained speech model?

Regarding the Whisper model suggestion, I see that it does accept mp4 files, which is brilliant! Do we have any options for improving or customizing transcription quality of the Whisper model as well?

The original path that we were trying out is leveraging custom models from Video Indexer (https://dori-uw-1.kuma-moon.com/en-us/azure/azure-video-indexer/customize-language-model-how-to?tabs=customizewebportal), but there were limitations with line lengths for certain languages.

1 answer

Your answer

Quynh Huynh (NON EA SC ALT) 40 Reputation points Microsoft Employee

2025-12-02T19:03:24.66+00:00

Hi Aryan,

From this documentation, it looks like there are no additional parameters to generating the caption for MP4 files with an existing trained custom speech model. Since our custom model may be trained on WAV files, what’s the best approach to avoid redundant steps like converting video to audio when generating caption text with Azure Speech resource? Alternatively, what would be the ideal workflow to ensure captions for a video can be generated directly from the trained speech model?

Regarding the Whisper model suggestion, I see that it does accept mp4 files, which is brilliant! Do we have any options for improving or customizing transcription quality of the Whisper model as well?

The original path that we were trying out is leveraging custom models from Video Indexer (https://dori-uw-1.kuma-moon.com/en-us/azure/azure-video-indexer/customize-language-model-how-to?tabs=customizewebportal), but there were limitations with line lengths for certain languages.

Answer 1

Hi Quynh Huynh,

You are correct; training a custom model requires WAV files. Here is the documentation:
https://dori-uw-1.kuma-moon.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-test-and-train#audio-data-for-training-or-testing

However, for the captioning workflow you mentioned, you can input MP4 files directly without converting them, which avoids unnecessary steps in daily use.

You might also consider using the Whisper model available in Azure AI Foundry as an alternative.

Let me know if you have any further questions.

Share via

How to connect custom speech model with captioning with speech to text?

1 answer

Your answer