Skip to content
Capture d’écran 2025-03-08 à 17.20.17
whisperseg_structure

WhisperSeg

WhisperSeg is the core model in the human-in-the-loop infrastructure of VoCallBase.

WhisperSeg utilizing the Whisper Transformer pre-trained for Automatic Speech Recognition (ASR) for both human and animal Voice Activity Detection (VAD). Contrary to traditional methods that detect human voice or animal vocalizations from a short audio frame and rely on careful threshold selection, WhisperSeg processes entire spectrograms of long audio and generates plain text representations of onset, offset, and type of voice activity. Processing a longer audio context with a larger network greatly improves detection accuracy from few labeled examples. We further demonstrate a positive transfer of detection performance to new animal species, making our approach viable in the data-scarce multi-species setting.

For more details, please refer to the paper:

Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection

Check out the model on our Github repository.