June 17, 2008

Pittsburgh Post-Gazette TechMan Column

By Pittsburgh Post-Gazette

Jun. 15--TechMan remembers when the Internet was emerging in the '90s and newspapers and magazines carried glowing articles about how all human knowledge would be available to everyone in searchable form.

("Yes, son, Dad was alive in a time called the '80s when there was no World Wide Web and we were forced to listen to music by bands named Flock of Seagulls.")

Well, it didn't quite turn out that way, but some people carry on one of those utopian visions, digitizing the world's books.

Digitizing a book means scanning it with a device that uses cameras to take pictures of the pages, then Optical Character Recognition technology to convert the images so that the pages of the book can be displayed on the Internet in readable and searchable form.

TechMan wants to concentrate on two nonprofit groups that are digitizing books.

The Universal Library and its Million Book project that originated at Carnegie Mellon University reached a million books scanned in the 2006-2007 period and continues to scan at 50 centers around the world, particularly the United States, China and India.

The OpenLibrary project, run by the Internet Archive, wants to create a Web entry for every book and scan as many as possible.

Both are consortiums of university libraries, with some funding from technology companies and governments.

To read the books scanned by either of these projects, go to www.ulib.org or www.openlibrary.org.

One thing you will immediately notice is that most of the books available in full-text form are from the 1800s or early 1900s.

This is an unfortunate consequence of U.S. copyright law. In 1998, Congress passed the Sonny Bono Copyright Term Extension Act, a law drafted by the '60s and '70s singer and partner of Cher, and later congressman from California, now deceased.

Called by some the Mickey Mouse Protection Act, since it was passed as Disney Corp.'s copyright on Mickey Mouse was running out, it extended copyright terms in the United States by 20 years.

Under the law, anything produced in 1923 or after and under copyright in 1998, will not enter the public domain until 2019 or after. Unless the copyright owner releases the work into the public domain before that, such works are in effect precluded from scanning.

So why would these groups want to put all this effort into scanning books for no earthly gain?

First, to advance human knowledge by making previous work available to anyone from any location.

Second, for preservation. Some of these books, most of which come from university libraries, may be among the few remaining copies. They are deteriorating rapidly or libraries have to discard them to make room for new books. At some point, the last paper copy will be gone.

So work goes on to make all human knowledge accessible to everyone. That is as long as that knowledge is older than Mickey Mouse.

Want to send a question to TechMan? Just fire an e-mail to [email protected] Please include your name, hometown and a daytime phone number. Visit Techman's blog at post-gazette.com/techman and listen to the Tech Talk podcast at post-gazette.com/podcast.


To see more of the Pittsburgh Post-Gazette, or to subscribe to the newspaper, go to http://www.post-gazette.com.

Copyright (c) 2008, Pittsburgh Post-Gazette

Distributed by McClatchy-Tribune Information Services.

For reprints, email [email protected], call 800-374-7985 or 847-635-6550, send a fax to 847-635-6968, or write to The Permissions Group Inc., 1247 Milwaukee Ave., Suite 303, Glenview, IL 60025, USA.