Research Article: Challenges and caveats of a multi-center retrospective radiomics study: an example of early treatment response assessment for NSCLC patients using FDG-PET/CT radiomics

Date Published: June 3, 2019

Publisher: Public Library of Science

Author(s): Janna E. van Timmeren, Sara Carvalho, Ralph T. H. Leijenaar, Esther G. C. Troost, Wouter van Elmpt, Dirk de Ruysscher, Jean-Pierre Muratet, Fabrice Denis, Tanja Schimek-Jasch, Ursula Nestle, Arthur Jochems, Henry C. Woodruff, Cary Oberije, Philippe Lambin, Chin-Tu Chen.


Prognostic models based on individual patient characteristics can improve treatment decisions and outcome in the future. In many (radiomic) studies, small size and heterogeneity of datasets is a challenge that often limits performance and potential clinical applicability of these models. The current study is example of a retrospective multi-centric study with challenges and caveats. To highlight common issues and emphasize potential pitfalls, we aimed for an extensive analysis of these multi-center pre-treatment datasets, with an additional 18F-fluorodeoxyglucose (FDG) positron emission tomography/computed tomography (PET/CT) scan acquired during treatment.

The dataset consisted of 138 stage II-IV non-small cell lung cancer (NSCLC) patients from four different cohorts acquired from three different institutes. The differences between the cohorts were compared in terms of clinical characteristics and using the so-called ‘cohort differences model’ approach. Moreover, the potential prognostic performances for overall survival of radiomic features extracted from CT or FDG-PET, or relative or absolute differences between the scans at the two time points, were assessed using the LASSO regression method. Furthermore, the performances of five different classifiers were evaluated for all image sets.

The individual cohorts substantially differed in terms of patient characteristics. Moreover, the cohort differences model indicated statistically significant differences between the cohorts. Neither LASSO nor any of the tested classifiers resulted in a clinical relevant prognostic model that could be validated on the available datasets.

The results imply that the study might have been influenced by a limited sample size, heterogeneous patient characteristics, and inconsistent imaging parameters. No prognostic performance of FDG-PET or CT based radiomics models can be reported. This study highlights the necessity of extensive evaluations of cohorts and of validation datasets, especially in retrospective multi-centric datasets.

Partial Text

Prognostic models based on individual patient derived factors are essential to better estimate patient’s outcome prior to or during treatment. These models help to improve individualized treatment decisions (personalized medicine) that may lead to better patient outcomes [1, 2]. Medical images are an example of an information source, which could be used to derive important patient-specific prognostic information. Large amounts of quantitative features can be calculated from medical images acquired during a patient’s course of treatment, e.g. positron emission tomography (PET) or computed tomography (CT). This principle of extracting imaging features is called ‘radiomics’, which is a rapidly evolving field of interest [3–5]. Multiple studies have shown radiomics’ potential to derive prognostic information for patient outcomes. Despite the promising results, radiomics faces multiple challenges [6]. An important challenge is the collection and acquisition of (large amounts of) suitable imaging data, which is difficult due to evolving technology, lack of standardization protocols and differences in cohorts and protocols between institutes.

Characteristics of the cohorts are summarized in Table 2, which shows that all clinical parameters were significantly different between one or more datasets, except age (indicated with the bold numbers). All patients from Datasets 1, 3 and 4 received concurrent chemoradiotherapy and no other treatment between the first PET scan and the start of radiotherapy (RT). In Dataset 2, 55% of patients received sequential chemoradiotherapy and one patient did not receive any chemotherapy. Because of the limited sample size available to us for this study, we have decided to not exclude outliers (e.g. PET scans separated by long time intervals) in an effort to make the dataset more homogeneous.

This study aimed to show potential issues using a typical problematic example of retrospective multi-centric study, by investigating different methodologies and approaches for comparing cohort characteristics and assessing prognostic performance in a radiomic study. The analysis was based on a combined evaluation of four sub-cohorts, which resulted in the inclusion of 138 patients. The acquisition of a larger dataset is challenging for this research question, as FDG-PET/CT scans during treatment are not standard in clinical routine. As far as we are aware, the current dataset is one of the largest available. Nevertheless, the amount of data is probably less than required in order to build a generalized feature model for repeated PET/CT studies. Although the size of this dataset may not be sufficient from a statistical point of view, its results may provide clinical relevant insights to be obtained in order to improve future studies with limited sample sizes. Therefore, we think it is important to publish these results in an attempt to guide reflections and considerations with respect to analyses and conclusions in future (radiomic) studies. In this study, both the characteristics of the study cohort and the prognostic performance of imaging features were extensively explored.

In an attempt to inform future radiomic studies, we illustrate possible problems that can be encountered in retrospective multi-centric study by evaluating cohort characteristics and clinical characteristics. The models presented do not support any correlation between radiomic features acquired from FDG-PET/CT scans and overall survival. Further investigations indicate that the radiomic analysis was influenced by the limited sample size and heterogeneous imaging and clinical characteristics.