Hi Quynh Huynh,
You are correct; training a custom model requires WAV files. Here is the documentation:
https://dori-uw-1.kuma-moon.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-test-and-train#audio-data-for-training-or-testing
However, for the captioning workflow you mentioned, you can input MP4 files directly without converting them, which avoids unnecessary steps in daily use.
You might also consider using the Whisper model available in Azure AI Foundry as an alternative.
Let me know if you have any further questions.