Tracking The Spread Of The Flu Virus Through Wikipedia Traffic
Brett Smith for redOrbit.com – Your Universe Online
Some have suggested tracking the levels of specific Google searches can be used to determine the spread of an influenza epidemic, but a new study in the journal PLOS Computational Biology has found an even better internet-based measure – watching Wikipedia traffic.
By watching Wikipedia traffic for flu-related articles, the study team from Harvard Medical School was able to determine the level of flu infection in the United States population approximately two weeks earlier than information from the Centers for Disease Control and Prevention (CDC) is released and correctly determine the week of maximum influenza activity 17 percent more frequently than Google Flu Trends information.
In the study, the Harvard researchers first generated a list of relevant Wikipedia articles on influenza, influenza-like activity, or health generally speaking. These pages were derived from previous understanding of the subject, earlier released materials and professional opinion. The researchers also picked multiple articles and the main Wikipedia page to act as control markers for normal usage of the site.
The researchers noted that information of Wikipedia traffic is freely available through a project called Wikipedia Statistics. The Harvard team also used a third-party tool to more easily access the information that Wikipedia makes available and process the raw statistics into useful information such as daily article views.
The study team said data was collected from the earliest available date, December 10, 2007, through August 19, 2013. The data was then aggregated to the week level, with each week beginning on Sunday.
The researchers noted that the CDC gathers its flu data from sentinel sites around the US. These sites compile physician reports and the data is made available through the CDC’s FluView tool. Google’s weekly flu statistics are also freely available – through the company’s Google Flu Trends website. The team used a mathematical model that included the Wikipedia data to determine influenza activity.
The researchers saw that the average number of daily views of the “Influenza” Wikipedia article was nearly 31,000, but the total number of views ranged from about 3,000 to more than 330,000 per day. While some pages under observation had relatively few views, others had very high numbers of views per day, such as the Wikipedia Main Page, which had an average of 44 million views per day.
Using CDC data as their “gold standard,” the study team said values based on Wikipedia article view counts were able to estimate US flu activity within a reasonable range of error. The researchers did admit there are some potentially confounding factors which they could not account for – such as the fact that they were only following the English version of Wikipedia.
“Each influenza season provides new challenges and uncertainties to both the public as well as the public health community,” the researchers said in a statement. “We’re hoping that with this new method of influenza monitoring, we can harness publicly available data to help people get accurate, near-realtime information about the level of disease burden in the population.”