November 13, 2012
New Technique Speeds Data Compression Algorithms By Shrinking Data
Lee Rannals for redOrbit.com — Your Universe Online
A new way of representing data might open up new possibilities for Internet-connected sensors like GPS receivers and cameras in smartphones.
The new approach could be more applicable than other techniques for managing big data because it can work with existing algorithms.
The researchers applied their technique to two-dimensional location data generated by GPS receivers, which helps to demonstrate how the technique works.
Daniel Rus, a professor of computer science and engineering and director of MIT's Computer Science and Artificial Intelligence Laboratory, said GPS receivers take position readings every 10 seconds, which adds up to a gigabyte of data each day. A computer system trying to process this data from tens of thousands of cars to determine traffic patterns could be overwhelming.
With the new algorithm, it would make points when the car turns, and draw a line between these points to make an approximation of the vehicle's path.
Dan Feldman, a postdoc in Rus´ group and lead author on the new paper, said a key aspect to the algorithm is that it can compress data on the fly.
The algorithm could compress the first megabyte of data it receives from a car, then wait until another megabyte builds up, then compress again. The final representation of the data would preserve almost as much information as if the algorithm had waited for all data to arrive before compressing.
Feldman said the problem with approximating the pathways between points is similar to one solved by regression analysis. This is a procedure in statistics that finds the one line that best fits a scatter of data points.
He said choosing the number of line segments involves a trade-off between accuracy and complexity. Essentially, the first task of the algorithm is to find the optimal trade-off between the number of line segments and error.
“If you have N points, k” – the number of line segments – “is a number between 1 and N, and when you increase k, the error will be smaller,” Feldman said in a statement. “If you just connect every two points, the error will be zero, but then it won´t be a good approximation. If you just take k equal to 1, like linear regression, it will be too rough an approximation.”
The next step for the algorithm is to calculate the optimal set of k line segments, which are the ones that best fit the data.
In addition to this step, the algorithm must also store the precise coordinates of a random sampling of the points. Points that fall farther from the line have a higher chance of being sampled, but the sampled points are also given a weight that is inversely proportional to their chance of being sampled.
Points close to the line have a lower chance of being sampled, but if one of them is sampled, it is given more weight.
It is this combination of linear approximations and random samples that enables the algorithm to compress data on the fly.
During compression, the team was able to provide strict mathematical guarantees that the error introduced will stay beneath a low threshold. In many big-data contexts, a slightly erroneous approximation is better than a calculation that cannot be performed at all.
This algorithm could allow a GPS enabled car to record more data than just its location on a road. For example, a car could also record temperature and air pressure, and take a picture of the road. Each additional measurement would be another coordinate of a point in multiple dimensions. Once this data would be compressed, the randomly sampled points would include snapshots and atmospheric data.
The data in this example could serve as the basis for a computer system that could retrieve photos that characterized any stretch of road on a map, in addition to inferring traffic patterns.
With the GPS data, each line segment represents the approximate path taken between turns. One of the new applications the team is looking into is the analysis of video data, where each line segment represents a scene. The final representation of the data would automatically include sample frames from each scene.