November 18, 2005

New Software Converts Podcasts to Text

BOSTON -- Suddenly the universe of downloadable audio files known as podcasts seems as enormous as the Internet. Name a topic - from the weather in Asuncion to the ZigBee wireless technology - and there is a podcast about it.

But while the Internet's vastness is accessible because of deep-probing search engines, comparably authoritative services for podcasts and other multimedia haven't really emerged.

That's because search programs are primed to catalog text. When they encounter an audio or video file, generally they determine the contents by reading the titles and other descriptive tags, known as "metadata," that creators voluntarily add.

It's useful, but much like examining only the first few lines of a Web site. Reading the whole thing is a lot better.

With that in mind, a few companies are trying to make search engines actually listen to big audio and video files. From there, speech-to-text software can generate written transcripts, which are searched in addition to metadata.

Perhaps best known has been Blinkx Inc., an information-management startup that gets its speech-to-text software from Autonomy Corp.

Now comes BBN Technologies Inc., a defense contractor that developed elements of the Internet. After tinkering with speech-to-text programs it created for U.S. intelligence services, BBN has produced Podzinger, a Web service that mines the content of podcasts.

A third service, Podscope, from a broadcast-monitoring company called TV Eyes Inc., performs a similar trick, but with a twist. CEO David Ives says Podscope uses some voice-recognition technology but mainly scans for phonemes - the individual sounds that make up syllables - rather than full words.

America Online Inc. is a big fan - it's due to begin using Podscope as its podcasting search engine this fall.

I tried all three, and found BBN's Podzinger best at podcast searches because it offered the most user-friendly options.

- Podzinger lets you expand the links in search results to read a podcast's metadata, so you can quickly tell what kind of show it is. Podscope does, too; Blinkx does not.

- Podzinger lets you stream a podcast if you don't feel like taking on a time-consuming download. Podscope handles that too; Blinkx seems to do it only for video clips.

- The results displayed by Podzinger helpfully include segments of the transcript include the terms you were looking for. Then, by clicking on the transcript, you can instantly play a sample of the file from that moment.

That turns out to be the big differentiator, in my view.

Podscope also lets you jump to moments in which it believes your term is mentioned. But you have to spend time listening to each snippet because the phoneme engine doesn't produce a transcript you can visually scan.

Blinkx shows a transcript, but you still have to cue up a clip from the beginning and find on your own the moment you think your subject might come up.

Blinkx appears to search the biggest pool of material - not only 45,000 podcasts but also millions of hours of TV broadcasts and homegrown video clips, which are displayed cleverly in thumbnail images alongside search results. This week Blinkx added lectures from Harvard, Princeton and other universities.

Podscope's podcast scope also is about 45,000, while Podzinger catalogs only about 11,000. But that should expand greatly, and incorporate video, as the site leaves beta mode.

To be sure, none of these sites has mastered audio recognition, a notoriously tricky beast. Computers still cannot consistently understand all the innumerable accents, mispronunciations and other nonstandard diction that colors human speech.

Even so, considering that the feds pay BBN a lot of money for real-time analysis of overseas broadcasts in Arabic and other languages, I found it funny that early in my test the phrase "Osama bin Laden" got no hits on Podzinger. Neither did "Usama bin Laden," the spelling often used by federal authorities.

When I shortened the search to "Osama," that brought up an episode of "MSNBC Countdown" in which host Keith Olbermann uttered the name of the terrorist mastermind - though Podzinger heard it as an Anglicized "Osama bin Lawton." I couldn't check whether the fault lay with a too-sharp pronunciation by Olbermann; the original material was gone from the Web.

To Podscope's credit, the same search returned a result in which a podcaster pronounced the name as "Oosama bin Layden." Nice catch.

Blinkx made a geopolitical gaffe by transcribing the following snippet from a Fox News broadcast about a political murder in Lebanon:

"... pro-Syrian President Emile Lahoud, citing a cell phone call Lahoud received minutes before the murder."


"... pro-Syrian President Emile of food citing a cell phone colic who received minutes before the murder."

Do errors like that matter? To some degree. That Lebanon-Syria clip doesn't appear if you search for "Lahoud." But it does come up if you hunt for "colic."

You're not likely to encounter that non sequitur on sites that don't convert speech to text.

For example, I got precise results from Yahoo Inc. (YHOO)'s podcast search engine, which launched in October and claims tens of thousands of podcasts. It mines metadata and reviews written by listeners to raise the chance a search will yield a relevant result.

All nine results about "colic" were indeed related to babies' dreaded crying fits. (I got no hits for the term on Podscope and one on Podzinger.)

The good news for speech-to-text services is that they might improve with use. That's partly because the engines can learn better ways to determine words from their context.

Blinkx co-founder Suranga Chandratillake illustrates the process this way: If a podcast were made about the topics in this story, a computer probably would be right if it detected the phrase "recognize speech."

But in a podcast about last year's tsunami, the computer would do better to hear almost the same sounds as "wreck a nice beach."


On the Net: