September 1, 2014
Researcher Uploads Over Two Million Historic Book Images To Flickr Commons
Chuck Bednar for redOrbit.com - Your Universe Online
Millions of photos and illustrations from the pages of public domain books originally digitized by the US Internet Archive have been uploaded to Flickr by a research fellow at Georgetown University in Washington DC.
Leetaru’s database of Internet Archive Book Images is 100 percent searchable (thanks to tags that are automatically added) and downloadable, Kelion and Megan Geuss of Ars Technica explained. When the library books were originally scanned, the Optical Character Recognition (OCR) software used automatically discarded sections of the text that it recognized as images, they noted.
In order to correct that issue, Leetaru wrote a new program that took advantage of the OCR program. His software went back and rediscovered those discarded portions of text, automatically converted them to JPEG format, and uploaded them to the photo sharing website. In addition, the software copied the caption for each image and the text from the paragraphs that immediately preceded and followed the image in the text, Kelion and Guess explained.
To date, 2.6 of the 14 million total images have been uploaded to Flickr Commons, Robert Miller, Global Director of Books for the Internet Archive, said in a blog post. He added that the organization would soon be able to continuously add to the collection from the more than 1,000 new ebooks that are being scanned on a daily basis.
“This way of discovering and reading a book will help transform our medical heritage collection as it goes up online. This is a big step forward and will bring digitized book collections to new audiences,” Dr. Simon Chaplin, Head of the Wellcome Library, told Miller. The Internet Archive added that they planned to continue working with Flickr to introduce new sub-collections and new ways to use image recognition tools for educational purposes.
Furthermore, anyone interested in learning more about the books from which each image came can access the full text from a link in each picture’s caption, added Josh Ong of The Next Web. The images are from 1500 to 1922, which is when copyright restrictions began in the US, and most of them have been difficult to access until now.
“For all these years all the libraries have been digitizing their books, but they have been putting them up as PDFs or text searchable works. They have been focusing on the books as a collection of words. This inverts that,” Leetaru told Kelion. “It’s amazing to see the total range of images and how the portrayals of things have changed over time.”
“Most of the images that are in the books are not in any of the art galleries of the world – the original copies have long ago been lost,” he continued. “I think one of the greatest things people will do is time travel through the images.” Leetaru also said that he hoped that other libraries throughout the world would follow his lead, running this process through their collection of digitized books in order to “constantly expand this universe of images.”