Microsoft Translates Your Speech, Voice Into Chinese
November 10, 2012

Microsoft Software Provides Instant Chinese Translation In User’s Own Voice

Rebecca Darrah for — Your Universe Online

Microsoft unveiled a new translation technology this week that converts spoken English into spoken Chinese nearly instantaneously.

The system then pumps the translation through speakers in the user's own voice, preserving both the intonation and cadence.

The technology is based on a new technique known as Deep Neural Networks (DNN), which uses human brain behavior to develop enhanced speech recognizers. This differs from conventional translation technology that uses the “hidden Markov modeling technique,” which bases translation on training data from many speakers.

Microsoft Chief Research Officer Rick Rashid described the breakthrough in a blog post on Thursday.

“In the realm of natural user interfaces, the single most important one — yet also one of the most difficult for computers - is that of human speech,” he wrote.

“We have attained an important goal by enabling an English speaker like me to present in Chinese in his or her own voice.”

“While still far from perfect, this is the most dramatic change in accuracy since the introduction of hidden Markov modeling in 1979, and as we add more data to the training we believe that we will get even better results.”

Rashid´s remarks were driven, in part, by the significant attention he received following a presentation he gave last month in China using the new technology.

During the final few minutes of the presentation, his words were nearly instantaneously translated into Chinese by piping the spoken English through Microsoft's translation system, which then pumped out a machine-generated version of his words in his spoken style.

Rashid said the new technology was made possible by work done in Microsoft´s labs that built upon previous breakthroughs. That earlier work did away with the pattern-matching approach of early speech translation systems in favor of statistical models that better captured the full range of human vocal ability.

Improvements in processing power had improved this further, although error rates were still about 20-25%, Rashid said.

But two years ago, Microsoft researchers working with scientists at the University of Toronto improved the translation even further by using deep neural networks that recognized sounds in a manner similar to that of a human brain.

Applying this technology to speech translation reduced the error rates to about 15%, Rashid said. These rates will likely fall even further as the networks train for longer periods of time, he added.

During Rashid´s presentation in China, the audio of his speech was first translated into English text, which was then converted into Chinese and reordered to made sense. Finally, the Chinese characters were piped through a text-to-speech system to come out sounding like Rashid.

"Of course, there are still likely to be errors in both the English text and the translation into Chinese, and the results can sometimes be humorous," Rashid wrote.

"Still, the technology has developed to be quite useful."

Microsoft is not the only company working on new voice translation technology. AT&T, Google and other tech firms have similar projects under way that seek to perform simultaneous translation. NTT Docomo has already demonstrated a smartphone app that lets Japanese people talk with foreigners with both speakers using their native tongue.

Rashid said he believes the DNN technology could ultimately reduce language barriers worldwide.

"The results are still not perfect, and there is still much work to be done, but the technology is very promising, and we hope that in a few years we will have systems that can completely break down language barriers.”