ConVoice: Real-Time Zero-Shot Voice Style Transfer

Yurii Rebryk, Stanislav Beliaev

Voice Conversion Challenge 2018

The following samples are generated by ConVoice model. Some of them are produced in zero-shot setting, when the model hasn't seen a target or source speaker before, and some of them are synthesized using the model fine-tuned on the Voice Conversion Challenge 2018 training dataset, where each speaker has about 5 minutes of audio.

Source Speech Target Speaker Conversion
VCC2SF1 (Female) VCC2TF1 (Female) Zero-Shot
Fine-tuned
VCC2TM1 (Male) Zero-Shot
Fine-tuned
VCC2TF2 (Female) Zero-Shot
Fine-tuned
VCC2TM2 (Male) Zero-Shot
Fine-tuned
VCC2SM1 (Male) VCC2TF1 (Female) Zero-Shot
Fine-tuned
VCC2TM1 (Male) Zero-Shot
Fine-tuned
VCC2TF2 (Female) Zero-Shot
Fine-tuned
VCC2TM2 (Male) Zero-Shot
Fine-tuned
VCC2SF3 (Female) VCC2TF1 (Female) Zero-Shot
Fine-tuned
VCC2TM1 (Male) Zero-Shot
Fine-tuned
VCC2TF2 (Female) Zero-Shot
Fine-tuned
VCC2TM2 (Male) Zero-Shot
Fine-tuned
VCC2SM3 (Male) VCC2TF1 (Female) Zero-Shot
Fine-tuned
VCC2TM1 (Male) Zero-Shot
Fine-tuned
VCC2TF2 (Female) Zero-Shot
Fine-tuned
VCC2TM2 (Male) Zero-Shot
Fine-tuned

LibriTTS

Our decoder was trained on "train-clean-100" and "train-clean-360" sets of the LibriTTS dataset. But here we present a few samples that were generated using random source and target audio from the "test" set, that the model hasn't ever seen before.

Source Speech Target Speaker Conversion
4507 (Female) 8224 (Male) Zero-Shot
8230 (Male) 1580 (Female) Zero-Shot
7729 (Male) 8455 (Male) Zero-Shot
8463 (Female) 3570 (Female) Zero-Shot