ConVoice: Real-Time Zero-Shot Voice Style Transfer
Voice Conversion Challenge 2018
The following samples are generated by ConVoice model. Some of them are produced in zero-shot setting, when the model hasn't seen a target or source speaker before, and some of them are synthesized using the model fine-tuned on the Voice Conversion Challenge 2018 training dataset, where each speaker has about 5 minutes of audio.
| Source Speech | Target Speaker | Conversion | |||
|---|---|---|---|---|---|
| VCC2SF1 (Female) | VCC2TF1 (Female) | Zero-Shot | |||
| Fine-tuned | |||||
| VCC2TM1 (Male) | Zero-Shot | ||||
| Fine-tuned | |||||
| VCC2TF2 (Female) | Zero-Shot | ||||
| Fine-tuned | |||||
| VCC2TM2 (Male) | Zero-Shot | ||||
| Fine-tuned | |||||
| VCC2SM1 (Male) | VCC2TF1 (Female) | Zero-Shot | |||
| Fine-tuned | |||||
| VCC2TM1 (Male) | Zero-Shot | ||||
| Fine-tuned | |||||
| VCC2TF2 (Female) | Zero-Shot | ||||
| Fine-tuned | |||||
| VCC2TM2 (Male) | Zero-Shot | ||||
| Fine-tuned | |||||
| VCC2SF3 (Female) | VCC2TF1 (Female) | Zero-Shot | |||
| Fine-tuned | |||||
| VCC2TM1 (Male) | Zero-Shot | ||||
| Fine-tuned | |||||
| VCC2TF2 (Female) | Zero-Shot | ||||
| Fine-tuned | |||||
| VCC2TM2 (Male) | Zero-Shot | ||||
| Fine-tuned | |||||
| VCC2SM3 (Male) | VCC2TF1 (Female) | Zero-Shot | |||
| Fine-tuned | |||||
| VCC2TM1 (Male) | Zero-Shot | ||||
| Fine-tuned | |||||
| VCC2TF2 (Female) | Zero-Shot | ||||
| Fine-tuned | |||||
| VCC2TM2 (Male) | Zero-Shot | ||||
| Fine-tuned |
LibriTTS
Our decoder was trained on "train-clean-100" and "train-clean-360" sets of the LibriTTS dataset. But here we present a few samples that were generated using random source and target audio from the "test" set, that the model hasn't ever seen before.
| Source Speech | Target Speaker | Conversion | |||
|---|---|---|---|---|---|
| 4507 (Female) | 8224 (Male) | Zero-Shot | |||
| 8230 (Male) | 1580 (Female) | Zero-Shot | |||
| 7729 (Male) | 8455 (Male) | Zero-Shot | |||
| 8463 (Female) | 3570 (Female) | Zero-Shot |