ConVoice: Real-Time Zero-Shot Voice Style Transfer

Voice Conversion Challenge 2018

The following samples are generated by ConVoice model. Some of them are produced in zero-shot setting, when the model hasn't seen a target or source speaker before, and some of them are synthesized using the model fine-tuned on the Voice Conversion Challenge 2018 training dataset, where each speaker has about 5 minutes of audio.


VCC2SF1 (Female)	VCC2TF1 (Female)	Zero-Shot
	VCC2TF1 (Female)	Fine-tuned
	VCC2TM1 (Male)	Zero-Shot
	VCC2TM1 (Male)	Fine-tuned
	VCC2TF2 (Female)	Zero-Shot
	VCC2TF2 (Female)	Fine-tuned
	VCC2TM2 (Male)	Zero-Shot
	VCC2TM2 (Male)	Fine-tuned
VCC2SM1 (Male)	VCC2TF1 (Female)	Zero-Shot
	VCC2TF1 (Female)	Fine-tuned
	VCC2TM1 (Male)	Zero-Shot
	VCC2TM1 (Male)	Fine-tuned
	VCC2TF2 (Female)	Zero-Shot
	VCC2TF2 (Female)	Fine-tuned
	VCC2TM2 (Male)	Zero-Shot
	VCC2TM2 (Male)	Fine-tuned
VCC2SF3 (Female)	VCC2TF1 (Female)	Zero-Shot
	VCC2TF1 (Female)	Fine-tuned
	VCC2TM1 (Male)	Zero-Shot
	VCC2TM1 (Male)	Fine-tuned
	VCC2TF2 (Female)	Zero-Shot
	VCC2TF2 (Female)	Fine-tuned
	VCC2TM2 (Male)	Zero-Shot
	VCC2TM2 (Male)	Fine-tuned
VCC2SM3 (Male)	VCC2TF1 (Female)	Zero-Shot
	VCC2TF1 (Female)	Fine-tuned
	VCC2TM1 (Male)	Zero-Shot
	VCC2TM1 (Male)	Fine-tuned
	VCC2TF2 (Female)	Zero-Shot
	VCC2TF2 (Female)	Fine-tuned
	VCC2TM2 (Male)	Zero-Shot
	VCC2TM2 (Male)	Fine-tuned

LibriTTS

Our decoder was trained on "train-clean-100" and "train-clean-360" sets of the LibriTTS dataset. But here we present a few samples that were generated using random source and target audio from the "test" set, that the model hasn't ever seen before.


4507 (Female)	8224 (Male)	Zero-Shot
8230 (Male)	1580 (Female)	Zero-Shot
7729 (Male)	8455 (Male)	Zero-Shot
8463 (Female)	3570 (Female)	Zero-Shot