Research Article: Regional level influenza study based on Twitter and machine learning method

Date Published: April 23, 2019

Publisher: Public Library of Science

Author(s): Hongxin Xue, Yanping Bai, Hongping Hu, Haijian Liang, Lars Kaderali.


The significance of flu prediction is that the appropriate preventive and control measures can be taken by relevant departments after assessing predicted data; thus, morbidity and mortality can be reduced. In this paper, three flu prediction models, based on twitter and US Centers for Disease Control’s (CDC’s) Influenza-Like Illness (ILI) data, are proposed (models 1-3) to verify the factors that affect the spread of the flu. In this work, an Improved Particle Swarm Optimization algorithm to optimize the parameters of Support Vector Regression (IPSO-SVR) was proposed. The IPSO-SVR was trained by the independent and dependent variables of the three models (models 1-3) as input and output. The trained IPSO-SVR method was used to predict the regional unweighted percentage ILI (%ILI) events in the US. The prediction results of each model are analyzed and compared. The results show that the IPSO-SVR method (model 3) demonstrates excellent performance in real-time prediction of ILIs, and further highlights the benefits of using real-time twitter data, thus providing an effective means for the prevention and control of flu.

Partial Text

Influenza (flu) is a stealthy killer that threatens human health with its widespread contagion [1, 2]. The flu refers to a viral acute respiratory infection caused by the common flu virus. If the flu is not effectively controlled, it can cause wide-ranging flu outbreaks that pose a threat to social stability and development. The World Health Organization (WHO) asserts that about 3 to 5 million serious illnesses are reported worldwide each year and about 250,000-650,000 of those result in death [3]. If we can predict a flu trend in some areas before the outbreak of flu, and take effective measures to mitigate the contagion ahead of time, we can control the spread of disease and reduce the loss of life to a certain extent.

Historical twitter data mapped onto ILI contains a lot of information about flu epidemic from previous years, which has important significance for future flu trend-based predictions. Therefore, we develop model 1 by historical twitter data on ILI. The flu is an acute infectious disease with the ability to spread in physical space. Population regions that are geographically near each other will likely experience highly correlated patterns of flu cases. Therefore, we construct an empirical network model (model 2) using twitter data to verify the regional impact of flu transmission. In traditional flu prediction model development, the data becomes more accurate after rigorous scientific experimentation. Various forecasting methods have their own advantages and disadvantages. Therefore, we construct a combination model (model 3) by introducing CDC ILI data to model 2. Model 3 verifies whether the twitter data is complementary to CDC ILI data. Model 3 also determines whether the twitter data contains new information that is not provided by the historical CDC ILI data. The specific formulas of models 1-3 are
Model 1:
ILIi,t=∑k=1pαkXi,t−k+εt.(1)Model 2:
ILIi,t=∑k=1pβkXi,t−k+∑j≠i,j=1Nδjωi,jXj,t+θt.(2)Model 3:
In all models, the Xi,t−k represents twitter data in the i-th region for week t − k, ωi,j is weighting factor that establishes the relationship between regions i and j the correlation coefficient of the CDC ILI data in region i and j represents the relationship weight. ILIi,t−l characterizes the CDC ILI data of the i-th region for the last l weeks, p, q are the lagged order coefficients (the experimental results show that the prediction effect of the model is best when p = q = 3) and the coefficients αk, βk, δj, γk, μl and σj are the parameters of the model. The variables εt, θt and τt are the residual terms for each model, while N is the number of regions (in this case N = 10).

In this paper, we proposed three flu prediction models that use US-based twitter and CDC data. Then, we proposed an improved PSO to optimize the parameters of SVR. The independent and dependent variables of models 1-3 are used as input and output of the IPSO-SVR for predicting the CDC’ unweighted %ILI of US. Comparing the prediction results of IPSO-SVR, PSO-SVR, GA-SVR, and CV-SVR for models 1-3. The experimental results show that 1) flu outbreaks in adjacent areas also have an impact on the current spread of flu in a region; 2) the twitter data complements with CDC ILI data; 3) the IPSO-SVR prediction results of model 3 was better than the prediction results of IAT-BPNN model; 4) the IPSO-SVR prediction results of model 3 for %ILI are not only suitable for ten regions defined by HHS, but also generates an optimization algorithm that can be applied to optimize the SVR parameters, which used to solve the other predict problem.




Leave a Reply

Your email address will not be published.