IBM’s Speech Recognition
There aren’t too many good-news stories coming out of Iraq, but here’s one. The U.S. military is bridging the communications gap between its soldiers and Iraqis by tapping some innovative speech recognition technology from IBM Research (IBM). Using a laptop computer or PDA, soldiers speak into a microphone and the software translates what they say in English into Arabic. Iraqi soldiers or civilians see and hear the words in Arabic, and their answers are immediately translated into English. About 10,000 of these systems are in use in the battle zone.
But what’s a boon for the U.S. military highlights a conundrum for IBM Research, which provides the technology gratis. When the military selected speech recognition technology for a new medical records network, it chose an offering from market leader, Burlington [Mass.]-based Nuance Communications (NUAN). For all of IBM’s expertise and resources, the 3,000 or so scientists in its basic research facilities worldwide face a major challenge to shepherd their innovations from the lab into the marketplace.
Partnering Up David Nahamoo, the chief technology officer for IBM Research’s speech and translation division, is out to change that. On Aug. 18, Nahamoo announced a new strategy at SpeechTEK 2008, a gathering of the leaders in the speech recognition industry in New York City. Rather than trying to push its technology mainly through IBM’s product and services divisions, the speech research group is focusing on forming partnerships with other companies to take the technology to market. Partners include Vlingo, the company that provides speech services for Yahoo! oneSearch (YHOO); PhoneTag, which converts mobile voice mail to text; and Jajah, which offers real-time phone translation between English and Mandarin. “We can find partners, spread the risk, and improve our ability to address these markets,” says Nahamoo.
IBM has been performing research into speech recognition for four decades. Some of the technology has found its way into products sold by the company’s software and services business, notably in the auto industry. But the technology hasn’t had the kind of impact that Nahamoo and his bosses believe is possible, in applications including autos, mobile phones, call centers, medical systems, and transcription services. The issue for IBM? That each of these applications on its own represents a relatively small market. That’s why IBM needs partners who are experts in different niches. “This new strategy gives very talented people in IBM an outlet for their work,” says William Meisel, president of technology consulting firm TMA Associates.
A Combined Technology Overall, demand for speech recognition technology is expected to rise dramatically over the next few years as people use their mobile phones as all-purpose lifestyle devices [so barking "find pizza" into your phone would load directions to the nearest pizza parlor]. In-car entertainment and navigation systems are increasingly controlled by voice commands. This growth in adoption is being fueled by steady improvements in speech recognition accuracy.
Speech recognition isn’t one technology but several combined. You start building a voice recognition engine by recording words, phrases, and sentences, and putting them in a database. Then you create a library of the specific pronunciations of the different words to be recognized. Then you map the sounds in the recordings to the word pronunciations. Last, you build a large table of the most commonly occurring patterns of words people are likely to speak. Algorithms are created that combine all these sources of information to come up with the right answer in a specific situation. In the past few years, scientists at IBM and elsewhere have been learning how to adapt their voice recognition engines more quickly to a specific person or sound environment. Nuance’s newly released Dragon NaturallySpeaking 10 PC speech recognition software translates speech into text with up to 99% accuracy.
Nuance and Vlingo Nuance is the giant of the speech recognition industry, with products for nearly every niche. Annual sales are expected to top $900 million this year. Steve Chambers, president of the company’s mobile speech and consumer-services division, says this breadth of experience has made it possible for the company to collect a huge treasure trove of speech samples from people with different languages and accents, which helps it improve its technology rapidly. “The technology is unlike others in research land. It has to be used to improve. The name of the game is scale and usage,” he says.
Even without Nuance’s scale in this field, IBM Research has managed to produce very effective speech recognition software. Vlingo evaluated IBM’s technology against Nuance’s and a couple of others. Dave Grannan, Vlingo’s chief executive, says IBM had the best combination of speed of processing and accuracy in his company’s tests. Another attraction: He didn’t fear that IBM might some day decide to get into his business. Nuance, on the other hand, competes with Vlingo.”Because IBM Research is not a go-to-market part of IBM, there wasn’t a competitive issue with them,” he says.
“Spoken Web” Nahamoo’s group is focusing on commercial opportunities right now. But IBM researchers are also exploring areas where the social impact could be huge. One example, spearheaded by scientists in India, is what it calls the “spoken Web.” In a handful of villages in the state of Andhra Pradesh, the company is helping locals create Web pages and search the Web purely with voice. A plumber or farmer goes to a kiosk with mobile phones and builds a Web page promoting his or her products, produce, or services by speaking the answers to 10 or so questions. Then other villagers can use a mobile phone to speak commands to search for those Web sites; they hear the search results, rather than see them.
If successful, the technology could help open up the Internet to the world’s hundreds of millions of illiterate people. “It has the potential to transform these regions,” says Paul Bloom, IBM Research’s business executive for the communications sector.