While we were all away enjoying some much-needed downtime from the insurgence of machine learning and artificial intelligence in the workplace, Google has released a paper which details a new text-to-speech system called Tacotron 2.
The revelation of Tacotron 2 is that the search engine turned global R&D house for all things innovation claims that the system has “near-human” accuracy at imitating audio of a person speaking from text. This means that integrated into the right AI system, it could be the final step between communicating text data into voice.
The technology sees the text into a spectrogram, which is a visual way to represent audio frequencies (commonly seen in audio editing tools). From that the spectrogram is then fed into a system from Alphabet’s AI research lab DeepMind called WaveNet. WaveNet is a technology that reads the charts and generates corresponding audio.
The amazing thing about the technology is the accuracy in which it mimics the human voice. The examples above are one of many released with the report. The producer of the tracks has not disclosed which one is which, and neither can the listener.
Tacotron 2 represents a leap in one of the bigger short-term challenges for social robotics, which is making robots sound less “robotic” and “more human”. The uptake and application of the technology will depend not just on the price-point, but the way in which customers eventually engage with robots. Adding this layer of human touch to a critical interface will go a long way in bridging the gap between the human and the robot.
WaveNet was first announced in 2016 and already being used by Google for voice in its Google Assistant.