September 1, 2012
MIT’s Pyxis System Streamlines Databases, Speeds Up Web Applications
April Flowers for redOrbit.com - Your Universe Online
Databases are the backbone of the Web. If you have every read stories on Yahoo!, shopped on Amazon, booked a flight on Kayak, posted on Facebook or Twitter, or downloaded a song from iTunes, you have used those databases. Most major websites maintain huge databases for everything from inventory and customer reviews, to seat availability on flights, to photos and comments. Almost any transaction on any major site, be it shopping, travel, or social media, requires multiple database queries, which can slow response time.
This week researchers from MIT's Computer Science and Artificial Intelligence Laboratory unveiled a new system that automatically streamlines websites' database access patterns, making the sites up to three times as fast. They presented their new system at the 38th International Conference on Very Large Databases in Istanbul, Turkey.
The MIT system, called Pyxis, works with languages that web developers already favor, unlike other systems that require the mastery of special-purpose programming languages.
Let's say you were booking a flight on one of those big travel sites like Expedia or Travelocity. What you, as an end user, see as a simple thing (putting in a date and asking for flight information) actually involves at least two process steps for the database. The first step is retrieval — pulling up which flights on those dates for that route have available seats, and the second step is computation — figuring out the differences in flight times that would allow a traveler to make a connection. Typically, data is stored on one server and the computation, or "application logic," is executed on another. These two servers might have to exchange information multiple times to conclude that no, a given itinerary won't work.
The MIT crew asked themselves, what if a few, frequently used chunks of application logic could be run on the database server instead? It would save time by limiting the number of cross-server transactions, and save on bandwidth, since the sole remaining transaction could be sending the single bit of information "no."
The first problem is that application logic and database queries are written in very different languages, which are optimized to handle completely different types of operations. So moving code to the database can require not only rewriting it, but rethinking the way it is implemented as well.
In addition, it is difficult to split a program in two without introducing bugs, or errors, which is the second problem.
Finally, even if the programmer invests an enormous amount of time to establish that the partitioning of the program will not introduce errors; the demands on the database server are constantly changing, which adds in one more layer of difficulty. During normal operation, the database server's CPU may have plenty of capacity to handle a little application logic. However, a spike in site traffic could overburden the CPU, making it so that the application logic computation puts it over the limit causing much worse delays than just shuttling data between application and database servers would.
Pyxis solves all three problems, its designers say. It automatically partitions a program between application server and database server, and does it in such a way that can be mathematically proven not to disrupt the operation of the program. It monitors the CPU load on the database server, giving it more or less application logic to execute depending on its available capacity.
Pyxis starts by transforming a program into a graph, a data construct that consists of "nodes" connected by "edges." Imagine a network diagram. The nodes (depicted as circles) are the computers, and the edges (depicted as lines connecting the circles) represent the bandwidth of the links between them. For Pyxis, the nodes represent individual instructions in a program, and the edges represent the amount of data that each instruction passes to the next.
“The code transitions from this statement to this next statement, and there´s a certain amount of data that has to be carried over from the previous statement to the next statement,” Sam Madden, professor at Department of Electrical Engineering and Computer Science (EECS), explains. “If the next statement uses some variable that was computed in the previous statement, then there´s some data dependency between the two statements, and the size of that dependency is the size of the variable.”
If the whole program runs on one computer, the variable is stored in main memory and each statement simply accesses it directly. But if consecutive statements run on separate computers, the data has to make the jump with them.
“There´s some cost to shipping data across the network, and there´s some cost to every network round-trip that you do,” Madden says. “So we want to find a placement of these nodes on the two different servers that minimizes the overall cost – or overall runtime – of the program.”
Pyxis finds several such placements of the nodes, some that push more computation to the database server and some that push less.
“Our tool is able to dynamically switch between them based on the current load on the server,” says Alvin Cheung, a graduate student at EECS.
Pyxis was three times as fast as typical implementation in experiments involving a standard set of simulated database transactions while cutting the bandwidth consumption almost in half. In addition, the improvements it afforded were within a few percent of those that resulted from careful optimizations hand-coded by software engineers.
Currently, Pyxis works with programs written in Java, which is the language favored by many commercial Web developers. Adapting it to other popular languages, such as Python, Ruby or Perl, would only require revising the code that translates programs into graphical models; however, the rest of the system would remain the same.
“Usually, partitioning systems are automated, not automatic – automated in the sense that some programmer input is taken into consideration, and then the system handles some of the more difficult aspects of the partitioning process,” says Eli Tilevich, an associate professor of computer science at Virginia Tech. With Pyxis, however, “they partition things completely automatically, without requiring any user input.”
“They have an program-analysis technique, called a partition graph,” Tilevich adds, “which is an interesting innovation that has not been applied to prior systems.” And while even successful academic research projects usually require a lot of work before they´re ready for commercial implementation, Tilevich says, “their ideas, their technologies, certainly have commercial applicability.”
The research and development team aren't done, though. Cheung and Madden, along with Owen Arden and Andrew Myers of Cornell University's Department of Computer Science are taking Pyxis a step farther. They are tweaking Pyxis, in collaboration with Armand Solar-Lezama, assistant professor of computer science and electrical engineering at MIT, to make it even easier to streamline database-dependent Web applications.
Most databases are written in so-called declarative languages such as SQL, which allow programmers to issue high-level commands, such as finding the largest value of some variable, without specifying a computational approach. The database system then automatically chooses the most efficient algorithm for executing the command, depending on the data characteristics.
Programmers who are more familiar with Java than SQL will sometimes move large chunks of data from the database to the application server, only to perform operations on it that SQL would have done more efficiently anyway. The MIT team is developing a system that builds on Pyxis, dubbed StatusQuo, which can identify such inefficient application logic. StatusQuo then automatically converts the application code into SQL queries, which the database then executes by whatever means it deems most efficient.