Scientists at Amazon’s Alexa division are set to present the first systematic study detailing the advantages of neural text-to-speech (NTTS) systems from multiple speakers. The study revealed in tests involving 70 listeners that a model trained on 5,000 utterances from seven different speakers resulted in more natural-sounding speech than one that was trained on 15,000 utterances from a single speaker.
“An NTTS system trained on data from seven different speakers doesn’t sound like an average of seven different voices. When we train our neural network on multiple speakers, we use a one-hot vector — a string of zeros with one one among them — to indicate which speaker provided each sample. At run time, we can select an output voice by simply passing the network the corresponding one-hot vector,” wrote Jakub Lachowicz, an applied scientist in the Alexa Speech group, in a blog post.
He continued, “In our user study, we also presented listeners with live recordings of a human speaker and synthetic speech modeled on the same speaker, and asked them whether the speaker was the same. On this test, the NTTS system trained on multiple speakers fared just as well as the one trained on a single speaker. Nor did we observe any statistical difference between the naturalness of models trained on data from speakers of different genders and models trained on data from speakers of the same gender as the target speaker.”
The researchers also discovered that the models trained on multiple speakers experienced issues — such as dropping words, mumbling or “heavy glitches” — less frequently than the single-speaker models. Lachowicz added that the team’s experiments suggested that “beyond 15,000 training examples, single-speaker NTTS models will start outperforming multi-speaker models. To be sure, the NTTS version of Alexa‘s current voice was trained on more than 15,000 examples. But mixed models could make it significantly easier to get new voices up and running for targeted applications.”