ConVoice: Real-Time Zero-Shot Voice Style Transfer
Voice Conversion Challenge 2018
The following samples are generated by ConVoice model. Some of them are produced in zero-shot setting, when the model hasn't seen a target or source speaker before, and some of them are synthesized using the model fine-tuned on the Voice Conversion Challenge 2018 training dataset, where each speaker has about 5 minutes of audio.
Source Speech | Target Speaker | Conversion | |||
---|---|---|---|---|---|
VCC2SF1 (Female) | VCC2TF1 (Female) | Zero-Shot | |||
Fine-tuned | |||||
VCC2TM1 (Male) | Zero-Shot | ||||
Fine-tuned | |||||
VCC2TF2 (Female) | Zero-Shot | ||||
Fine-tuned | |||||
VCC2TM2 (Male) | Zero-Shot | ||||
Fine-tuned | |||||
VCC2SM1 (Male) | VCC2TF1 (Female) | Zero-Shot | |||
Fine-tuned | |||||
VCC2TM1 (Male) | Zero-Shot | ||||
Fine-tuned | |||||
VCC2TF2 (Female) | Zero-Shot | ||||
Fine-tuned | |||||
VCC2TM2 (Male) | Zero-Shot | ||||
Fine-tuned | |||||
VCC2SF3 (Female) | VCC2TF1 (Female) | Zero-Shot | |||
Fine-tuned | |||||
VCC2TM1 (Male) | Zero-Shot | ||||
Fine-tuned | |||||
VCC2TF2 (Female) | Zero-Shot | ||||
Fine-tuned | |||||
VCC2TM2 (Male) | Zero-Shot | ||||
Fine-tuned | |||||
VCC2SM3 (Male) | VCC2TF1 (Female) | Zero-Shot | |||
Fine-tuned | |||||
VCC2TM1 (Male) | Zero-Shot | ||||
Fine-tuned | |||||
VCC2TF2 (Female) | Zero-Shot | ||||
Fine-tuned | |||||
VCC2TM2 (Male) | Zero-Shot | ||||
Fine-tuned |
LibriTTS
Our decoder was trained on "train-clean-100" and "train-clean-360" sets of the LibriTTS dataset. But here we present a few samples that were generated using random source and target audio from the "test" set, that the model hasn't ever seen before.
Source Speech | Target Speaker | Conversion | |||
---|---|---|---|---|---|
4507 (Female) | 8224 (Male) | Zero-Shot | |||
8230 (Male) | 1580 (Female) | Zero-Shot | |||
7729 (Male) | 8455 (Male) | Zero-Shot | |||
8463 (Female) | 3570 (Female) | Zero-Shot |