National Science Foundation Awards Millions to Fourteen Universities for Cloud Computing Research
CluE Awards Promote Academic Use of Cluster Computing Resources on IBM/Google cloud
In 2007, IBM and Google announced a joint university initiative to help computer science students gain the skills they need to build cloud applications. Now, the National Science Foundation is using the same infrastructure and open source methods to award CluE grants to universities around
The National Science Foundation awarded Cluster Exploratory (CluE) program grants to
“Academic researchers have expressed a need for access to massively scaled computing infrastructures that allow them to complete projects and research activities that have been difficult or impossible previously due to the amount of data involved,” said
“IBM is intensely focused on applying technology and science to make the world work better,” said
“We’re pleased and excited that the CluE program will support a wide range of original research,” said
The universities will run a wide range of advanced projects and explore innovative research ideas in data-intensive computing, including advancements in image processing, comparative studies of large-scale data analysis, studies and improvements to the Internet, and human genome sequencing, among others, using software and services on the IBM/Google cloud infrastructure.
The second research project is focused on developing the Integrated Cluster Computing Architecture (INCA) for machine translation (using computers to translate from one language to another). Open-source toolkits make it easier for new research groups to tackle the problem at lower costs, broadening participation. Unfortunately, existing toolkits have not kept up with the computing infrastructure required for modern “big data” approaches to machine translations; INCA will fill this void.
These three universities are using the National Science Foundation CluE grants for a comparative study of approaches to cluster-based, large-scale data analysis. Both MapReduce and parallel database systems provide scalable data processing over hundreds to thousands of nodes, yet it’s important for researchers to know the differences in performance and scalability of these two approaches to know which is more suitable when designing new data-intensive computing applications.
This project is investigating linguistic extensions to MapReduce abstractions for programming modern, large-scale systems, with special focus on applications that manipulate large, unstructured graphs. This will impact a broad class of scientific applications. Graphs have important utility in the social sciences (social networks), recommender systems, and business and finance (networks of transactions), among others. The specific case study targeted by the research is a comparative analysis of graph-structured biochemical networks and pathways which underlie many important problems in biology.
In many applications, data-quality issues resulting from a variety of errors create inconsistencies in structures, representations or semantics. Simple spelling variations such as “Schwarzenegger” vs. “Schwarseneger,” “Brittany Spears” vs. “
University of California-San Diego /
Researchers at the
Many of today’s data-intensive application domains, including searches on social networks like Facebook and protein matching in bioinformatics, require us to answer complex queries on highly-connected data. The UCSB Massive Graphs in Clusters (MAGIC) project is focused on developing software infrastructure that can efficiently answer queries on extremely large graph datasets. The MAGIC software will provide an easy to use interface for searching and analyzing data, and manage the processing of queries to efficiently take advantage of computing resources like large datacenters.
The CluE initiative is funding another machine translation project that promises to bridge the language divide in today’s multi-cultural and multi-faceted society. Systems capable of converting text from one language into another have the potential to transform how diverse individuals and organizations communicate. By coupling network analysis with cross-language information retrieval techniques, the result is a richer, multilingual contextual model that will guide a machine translation system in translating different types of text. The potential broader impact of this project is no less than knowledge dissemination across language boundaries, which will serve to enrich the lives of all the world’s citizens.
A second project focuses on developing parallel algorithms for analyzing the next generation of sequencing data. Scientists can now generate the rough equivalent of an entire human genome in just a few days with one single sequencing instrument. The analysis of these data is complicated by their size – a single run of a sequencing instrument yields terabytes of information, often requiring a significant scale-up of the existing computational infrastructure needed for analysis.
This project focuses on how researchers at the Center for Intelligent Information Retrieval (CIIR) are using the CluE infrastructure to learn more about word relationships. These relationships are not labeled explicitly in text and are quite varied; by exploiting these relationships, this project will help lead to a more effective ranking of web-retrieval results.
Imagine continuously zooming into an image from your personal photo collection. Unlike with modern image processing software, however, this zoom operation would reveal details missing from the original image. For example, zooming into someone’s shirt would eventually show a high-resolution image of the threads that compose it. The research team in the Department of Computer Science at the
Astrophysics is addressing many fundamental questions about the nature of the universe through a series of ambitious wide-field optical and infrared imaging surveys. New methodologies for analyzing and understanding petascale data sets are required to answer these questions. This research project is focused on developing new algorithms for indexing, accessing and analyzing astronomical images. This work is expected to have a broad range of applications to other data intensive fields.
This project is building a new infrastructure for computational oceanography that uses the CluE platform to allow ad hoc, longitudinal query and visualization of massive ocean simulation results at interactive speeds. This infrastructure leverages and extends two existing systems: GridFields, a library for general and efficient manipulation of simulation results; and VisTrails, a comprehensive platform for scientific workflow, collaboration, visualization, and provenance.
IBM/Google Cloud Computing University Initiative
The following resources are available from IBM and Google to these universities to leverage for their respective projects:
- A cluster of processors running an open source implementation of Google’s published computing infrastructure (MapReduce and GFS from Apache’s Hadoop project)
- A Creative Commons licensed university curriculum developed by Google and the
University of Washingtonfocusing on massively parallel computing techniques
- Open source software designed by IBM to help students develop programs for clusters running Hadoop. The software works with Eclipse, an open source development platform.
- Management, monitoring and dynamic resource provisioning by IBM using IBM Tivoli systems management software
(Caption: In this interview,
Credit: National Science Foundation )
Dana W. Cruikshank, NSF (703) 292-7738 firstname.lastname@example.org Kelly Sims, IBM Communications, Cloud Computing (917) 472-3456 email@example.com Tim Willeford, IBM University Programs (914) 766-3389 firstname.lastname@example.org Andrew Pederson, Google Corporate Communications (650) 214-6228 email@example.com