Research Article: Investigation of Super Learner Methodology on HIV-1 Small Sample: Application on Jaguar Trial Data

Date Published: April 3, 2012

Publisher: Hindawi Publishing Corporation

Author(s): Allal Houssaïni, Lambert Assoumou, Anne Geneviève Marcelin, Jean Michel Molina, Vincent Calvez, Philippe Flandre.


Background. Many statistical models have been tested to predict phenotypic or virological response from genotypic data. A statistical framework called Super Learner has been introduced either to compare different methods/learners (discrete Super Learner) or to combine them in a Super Learner prediction method. Methods. The Jaguar trial is used to apply the Super Learner framework. The Jaguar study is an “add-on” trial comparing the efficacy of adding didanosine to an on-going failing regimen. Our aim was also to investigate the impact on the use of different cross-validation strategies and different loss functions. Four different repartitions between training set and validations set were tested through two loss functions. Six statistical methods were compared. We assess performance by evaluating R2 values and accuracy by calculating the rates of patients being correctly classified. Results. Our results indicated that the more recent Super Learner methodology of building a new predictor based on a weighted combination of different methods/learners provided good performance. A simple linear model provided similar results to those of this new predictor. Slight discrepancy arises between the two loss functions investigated, and slight difference arises also between results based on cross-validated risks and results from full dataset. The Super Learner methodology and linear model provided around 80% of patients correctly classified. The difference between the lower and higher rates is around 10 percent. The number of mutations retained in different learners also varys from one to 41. Conclusions. The more recent Super Learner methodology combining the prediction of many learners provided good performance on our small dataset.

Partial Text

The effectiveness of antiretroviral therapy has been limited by the development of human immunodeficiency virus type 1 (HIV-1) drug resistance. HIV-1 frequently develops resistance to the antiretroviral drugs used to treat it which may decrease both the magnitude and the duration of the response to treatment resulting in loss of viral suppression and therapeutic failure [1]. Moreover, there is a high level of cross-resistance within drug classes; a virus that has developed resistance to one drug in a class may also be resistant to other drugs in the same class [2]. Current International AIDS Society USA and French report HIV-1 guidelines recommend resistance testing both before starting antiretroviral therapy (ART) and at treatment failure. Resistance testing has become an important part of choosing and optimizing combination therapy for treating HIV-infected individuals [3]. Selecting a “salvage” regimen for an HIV-infected patient who has developed resistance to his or her current regimen is not straightforward [4].

We investigate the following learners: Logic Regression, Deletion/Substitution/Addition, Least squares regression, Random Forest, Classification and Regression Trees. All algorithms are available as free packages of R software.

Results of the Discrete Super Learner and Super Learner-5 are given in Table 1. For example, based on the SqE as loss function and a 10-fold cross-validation, LM(1) was identified as the top learner followed by Random Forest and CART. LM(1) slightly decreases its performance from the 1st rank on 10-fold to 3th rank on 2-fold while Random Forest becomes the second learners for the remaining k-folds. Surprisingly, linear model with interaction terms, LM(2), provided poor performance for all k-fold. The Super Learner-5 provided at least as good performance as the top learner whatever the k-fold cross-validation. R loss function drew similar findings. Although the ranks of the different learners are relatively stable, the combination of the Super Learner-5 provided the best performance. Inclusion of Logic Reg as additional learner in the previous set of candidate learners led to different findings (Table 2). Globally Logic Reg performed poorly, and only LM(2) produced worse performance than Logic Reg. Based on the SqE as loss function, including Logic Reg in the Super Learner-6 decreased its performance compared to Super Learner-5. Based on R as loss function, the performance of the Super Learner-6 was very good.

The choice of subsequent treatment in failing patients is of major importance in the management of HIV-infected patients. Genotypic and phenotypic resistance tests are important tools for choosing promising combination therapy for those patients. We investigated on a small sample a framework both for choosing optimal learner and building an estimator among a set of candidate through two different loss functions and k-fold cross-validation.

In this study, we showed that the Super Learner methodology applied on a relative small amount of data, provided good performance. Of note in our dataset, simple linear regression with two-way interaction terms performs as well as the Super Learner.

A. Houssaïni and P. Flandre designed research; A. Houssaïni and P. Flandre performed analysis; A. Houssaïni, L. Assoumou, A. G. Marcelin, J. M. Molina, V. Calvez and P. Flandre discussed the results and improved the paper.