February 13, 2009

Scientists Create A Unique New Tool for Analyzing And Comparing Data

Berkeley Lab Scientists Create A Unique New Tool for Analyzing and Comparing Data

What does uncovering the true authorship of plays attributed to Shakespeare have to do with identifying our genetic ancestors or classifying new life forms? All involve the comparative analysis of long sets of data and all will benefit from a unique new analytical tool developed by researchers at Berkeley Lab.

Sung-Hou Kim, a chemist who holds a joint appointment with Berkeley Lab's Physical Biosciences Division and UC Berkeley's Chemistry Department, led the development of a technique called "feature frequency profiles" (FFP), that makes it possible to compare, classify, index and catalog just about any type of linear information that can be electronically stored. The kinds of information that can be analyzed with the FFP technique include nucleotide base and amino acid sequences, books, documents and possibly images. It could even prove to be the ultimate music organizer.

"I call our technique a tool for demographic phylogeny because it enables us to organize large sets of data into groups and find relationships among these groups," says Kim. "The idea is to organize data sets into groups based on the frequency at which key features occur and then look for relationships. This is the reverse of what is usually done, where you find relationships in the data set then organize the data set into groups based on those relationships."

Using the FFP technique, Kim and his colleagues can create "family trees" that put into easy-to-see perspective the relationships between groups within a data set, whether those groups are books or genomes. The key is to identify the "optimal features" for profiling. For books, the optimal feature consisted of sequences of text about eight letters in length. For mammalian genomes, the optical feature consisted of sequences of nucleotide bases of about 18 base pairs in length. However, to keep their genomic computations manageable, Kim and his colleagues reduced the four-letter DNA alphabet (adenine, guanine, thymine and cytosine) to a two-letter alphabet by using R for the purine nucleic acids and Y for the pyrimidine nucleic acids). In a series of tests run on books and genomes, the FFP technique provided a more comprehensive and in some cases more accurate analysis over the standard analytical tools.


On The Net: