it can still take plenty of training time and resources to produce natural-sounding output. Microsoft and Chinese researchers might have a more effective way.
They’ve crafted a text-to-speech AI that can generate realistic speech using just 200 voice samples (about 20 minutes’ worth) and matching transcriptions.
The system relies in part on Transformers, or deep neural networks that roughly emulate neurons in the brain.
Transformers weigh every input and output on the fly like synaptic links, helping them process even lengthy sequences very efficiently — say, a complex sentence. Combine that with a noise-removing encoder component and the AI can do a lot with relatively little.