January 6, 2013
Library Of Congress Twitter Archive Reaches 170 Billion Messages
redOrbit Staff & Wire Reports - Your Universe Online
Less than 24 months after first announcing their plans to compile an archive of Twitter posts, the Library of Congress has collected more than 170 billion messages comprised of 140 characters or less, the institution announced on Friday.
The Library's first goal was "to acquire and preserve" all tweets from 2006 through 2010, so they could "establish a secure, sustainable process for receiving and preserving a daily, ongoing stream of tweets through the present day; and to create a structure for organizing the entire archive by date," Gayle Osterberg, the Library's Director of Communications, said on Friday. "This month, all those objectives will be completed."
"The volume of tweets the Library receives each day has grown from 140 million beginning in February 2011 to nearly half a billion tweets each day as of October 2012," she added. "The Library´s focus now is on addressing the significant technology challenges to making the archive accessible to researchers in a comprehensive, useful way. These efforts are ongoing and a priority for the Library."
Exactly how the massive collection of microblogging messages will be made available to researchers or to the general public is not yet known, reports Adrienne LaFrance of The Washington Post.
"Colorado-based data company Gnip, is managing the transfer of tweets to the archive, which is populated by a fully automated system that processes tweets from across the globe. Each archived tweet comes with more than 50 fields of metadata -- where the tweet originated, how many times it was retweeted, who follows the account that posted the tweet and so on -- although content from links, photos and videos attached to tweets are not included," she said. "But the library hasn´t started the daunting task of sorting or filtering its 133 terabytes of Twitter data, which it receives from Gnip in chronological bundles, in any meaningful way."
"People expect fully indexed -- if not online searchable -- databases, and that´s very difficult to apply to massive digital databases in real time," Deputy Librarian of Congress Robert Dizard Jr. told the Post. "The technology for archival access has to catch up with the technology that has allowed for content creation and distribution on a massive scale. Twitter is focused on creating and distributing content; that´s the model. Our focus is on collecting that data, archiving it, stabilizing it and providing access; a very different model."
One problem is the Library, which like many government agencies has experienced funding cuts in recent years, would need to greatly overhaul their IT systems and servers in order to handle Twitter-related requests, LaFrance explains.
Dizard told her that internal testing has revealed that completing a search of the approximately 21 billion tweets from 2006 to 2010 could take up to 24 hours using the agency's current computers. He notes the agency is considering hiring a third-party to handle public searches of the Twitter archive, but that will likely depend on whether or not the Library can afford to do so.
Rest assured, however, they do hope to make the tweet archive available to the general public eventually.
"Twitter is a new kind of collection for the Library of Congress but an important one to its mission," Osterberg said, according to PCMag.com. "As society turns to social media as a primary method of communication and creative expression, social media is supplementing, and in some cases supplanting, letters, journals, serial publications and other sources routinely collected by research libraries."