Research Article: Distillation of the clinical algorithm improves prognosis by multi-task deep learning in high-risk Neuroblastoma

Date Published: December 7, 2018

Publisher: Public Library of Science

Author(s): Valerio Maggio, Marco Chierici, Giuseppe Jurman, Cesare Furlanello, Chuhsing Kate Hsiao.


We introduce the CDRP (Concatenated Diagnostic-Relapse Prognostic) architecture for multi-task deep learning that incorporates a clinical algorithm, e.g., a risk stratification schema to improve prognostic profiling. We present the first application to survival prediction in High-Risk (HR) Neuroblastoma from transcriptomics data, a task that studies from the MAQC consortium have shown to remain the hardest among multiple diagnostic and prognostic endpoints predictable from the same dataset. To obtain a more accurate risk stratification needed for appropriate treatment strategies, CDRP combines a first component (CDRP-A) synthesizing a diagnostic task and a second component (CDRP-N) dedicated to one or more prognostic tasks. The approach leverages the advent of semi-supervised deep learning structures that can flexibly integrate multimodal data or internally create multiple processing paths. CDRP-A is an autoencoder trained on gene expression on the HR/non-HR risk stratification by the Children’s Oncology Group, obtaining a 64-node representation in the bottleneck layer. CDRP-N is a multi-task classifier for two prognostic endpoints, i.e., Event-Free Survival (EFS) and Overall Survival (OS). CDRP-A provides the HR embedding input to the CDRP-N shared layer, from which two branches depart to model EFS and OS, respectively. To control for selection bias, CDRP is trained and evaluated using a Data Analysis Protocol (DAP) developed within the MAQC initiative. CDRP was applied on Illumina RNA-Seq of 498 Neuroblastoma patients (HR: 176) from the SEQC study (12,464 Entrez genes) and on Affymetrix Human Exon Array expression profiles (17,450 genes) of 247 primary diagnostic Neuroblastoma of the TARGET NBL cohort. On the SEQC HR patients, CDRP achieves Matthews Correlation Coefficient (MCC) 0.38 for EFS and MCC = 0.19 for OS in external validation, improving over published SEQC models. We show that a CDRP-N embedding is indeed parametrically associated to increasing severity and the embedding can be used to better stratify patients’ survival.

Partial Text

The challenge of dealing with multiple endpoints of clinical interest is a key challenge of predictive models from high-throughput omics data, as found in the MAQC-II (Microarray Analysis and Quality Control) study [1]. Neuroblastoma is a paradigmatic example of disease where the medical community has adopted a clinical algorithm to assign risk status. Severity of cancer and therapeutic options are computed as a combination of clinical information and specific biomarkers. However, the precision medicine approach aims at identifying more accurately the subtypes of patients in terms of expected response to therapy. In Neuroblastoma, high throughput molecular profiling still fails to identify molecular profiles clearly associated to high risk (HR) subtypes, for which successful therapy cannot be warranted yet. Arising predominantly in the first two years of life, Neuroblastoma is the most frequent extracranial solid tumor in infancy, accounting for about 500 new cases in Europe per year (130 in Germany), corresponding to roughly 8% of pediatric cancers and 15% of pediatric oncology deaths [2].

Results obtained with CDRP solution on the SEQC-NB, and the TARGET-NB datasets are reported in details in Table 3, and in Table 4, respectively. Results obtained by other machine learning models are also reported for comparison, namely (linear) Support Vector Machine (LSVM), Random Forest (RF), CDRP-N network (no autoencoder contribution).

CDRP is a novel multitask deep learning architecture that improves prediction of hard prognostic endpoints by injecting latent variables derived by autoencoding a standard clinical model. The approach leverages the advent of deep learning structures that can flexibly integrate multimodal data or create internally multiple processing paths. In this study, the autoencoder component clearly improves prediction of survival for high risk patients. Further, the network can be used to generate embeddings associated with disease severity, improving on initial tumor grading.