Microsoft has just open-sourced a cutting-edge voice AI that can process 60 minutes of audio in a single session.
You upload your recording. It identifies each speaker, timestamps each word, outputs complete structured text, and annotates who said what and when.
It also supports real-time TTS with a first-episode audio latency of only 300 milliseconds and supports over 50 languages.
100% open source.
Link: github.com/microsoft/VibeVoice...…