Research Article: Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets

Date Published: March 21, 2019

Publisher: Public Library of Science

Author(s): Olga Krakovska, Gregory Christie, Andrew Sixsmith, Martin Ester, Sylvain Moreno, Konstantinos C. Fragkos.


Large survey databases for aging-related analysis are often examined to discover key factors that affect a dependent variable of interest. Typically, this analysis is performed with methods assuming linear dependencies between variables. Such assumptions however do not hold in many cases, wherein data are linked by way of non-linear dependencies. This in turn requires applications of analytic methods, which are more accurate in identifying potentially non-linear dependencies. Here, we objectively compared the feature selection performance of several frequently-used linear selection methods and three non-linear selection methods in the context of large survey data. These methods were assessed using both synthetic and real-world datasets, wherein relationships between the features and dependent variables were known in advance. In contrast to linear methods, we found that the non-linear methods offered better overall feature selection performance than linear methods in all usage conditions. Moreover, the performance of the non-linear methods was more stable, being unaffected by the inclusion or exclusion of variables from the datasets. These properties make non-linear feature selection methods a potentially preferable tool for both hypothesis-driven and exploratory analyses for aging-related datasets.

Partial Text

Within the field of statistical gerontology, there has been increasing use of large databases to explore relationships between key factorsand some outcome variable(s) of interest (dependent variable(s)). Indeed, several survey initiatives have been set up to track the biological, social and lifestyle factors that affect health and quality of life throughout the lifespan, i.e.Health and Retirement Study [1], Wisconsin Longitudinal Study[2] Canadian Longitudinal Study on Aging [3], National Population Health Survey [4]These databanks are a valuable resource that can be used to identify and quantify the factors affecting health in aging. In turn, the results of these analyses can empower key stakeholders, including end users and policy makers, to make informed decisions for themselves and optimized decisions at higher levels, i.e. at the level of healthcare systems.

It is not uncommon for large survey databases to store dozens or hundreds of different measurements for each person (we refer to these measurements herein as features). Given their size and complexity, it is not usually practical for researchers to assess how all factors within a database interact to determine an outcome of interest (say, mortality rate). Instead, researchers will often select a handful of features and assess the predictive ability of these features using a variant of regression such as linear regression. Unfortunately, both of these operations—feature selection and prediction—are potentially problematic for the analysis of many large survey databases. Here, we outline two major issues inherent to this analytic technique and offer an alternative approach, which may be better suited for the analysis of data within these survey databases, when it is reasonable to assume non-linear relationships.

The performance of a given statistical method depends on the underlying data to be analyzed. Therefore, an important preliminary step is to understand the properties of the data before commencing any analysis[21]. Here, we are interested in the extraction of relevant features from large social science datasets, which consist primarily of questionnaires filled by respondents, their proxies or reviewers[22]. To make a questionnaire simpler for respondents, questions are routinely presented in multiple choice formats, which maps continuous variables into discrete categories, with the number of categories typically ranging between three to seven. Respondents are occasionally asked to provide an exact number to a given question, and as a result the risk of erroneously splitting a response into categories is believed to be relatively high. For example, a respondent performing an activity five times per week may either report it as “daily” or “several times a week”.

Linear selection methods have been the main methods of the gerontology field to approach and study two of the main central databases, WLS[2] and HRS[1]. In many cases, however, the relationships between variables within these datasets are nonlinear. Although linear methods may still be effective in some cases at identifying important trends in the data, in other cases their selection performance can yield unstable or incorrect results. Because of this, there has been growing interest in the use of non-linear methods for identifying relevant features in aging-related datasets, as these approaches may be better suited for feature selection in many real-world usage scenarios. However, it remains unclear whether these approaches offer superior feature selection performance than linear-based methods, whose operation and implementation are arguably better understood by many researchers. The goal of the present study was to test the effectiveness of linear- and nonlinear-based feature selection methods to identify relevant features marked by non-linear dependencies.




Leave a Reply

Your email address will not be published.