Date Published: January 28, 2019
Publisher: Public Library of Science
Author(s): Handan Kulan, Tamer Dag, Wajid Mumtaz.
Understanding expression levels of proteins and their interactions is a key factor to diagnose and explain the Down syndrome which can be considered as the most prevalent reason of intellectual disability in human beings. In the previous studies, the expression levels of 77 proteins obtained from normal genotype control mice and from trisomic Ts65Dn mice have been analyzed after training in contextual fear conditioning with and without injection of the memantine drug using statistical methods and machine learning techniques. Recent studies have also pointed out that there may be a linkage between the Down syndrome and the immune system. Thus, the research presented in this paper aim at in silico identification of proteins which are significant to the learning process and the immune system and to derive the most accurate model for classification of mice. In this paper, the features are selected by implementing forward feature selection method after preprocessing step of the dataset. Later, deep neural network, gradient boosting tree, support vector machine and random forest classification methods are implemented to identify the accuracy. It is observed that the selected feature subsets not only yield higher accuracy classification results but also are composed of protein responses which are important for the learning and memory process and the immune system.
Down syndrome (DS) is a very common identifiable genetic cause of intellectual disability (ID) and affects approximately one in 700 live births . In addition to ID, people with DS are at risk for certain types of blood diseases, like leukemia, autoimmune disorders and Alzheimer’s disease (AD) [2, 3].
After selecting features, classification methods are applied for differentiating mice in different subgroups. We carried out four classification methods, DNN, gradient boosted tree, random forest and SVM. These classification methods are implemented by using Python and Scikit Learn package . In order to select the most appropriate parameters of classification methods, grid search method  is applied. Also, for building robust and reliable classification model, 5 fold cross validation is applied. Thanks to cross validation, a learner can generalize to an unknown data set. In K Fold cross validation , the data is partitioned into k subsets. Only one of these subsets is used as the test set and the others are constituted to a training set at each time. This procedure is repeated k times. The error estimation is averaged over all k trials to get total effectiveness. This way significantly decreases bias since we are using most of the data for fitting. It also significantly reduces variance as most of the data is also being used in validation set. In the rest of this section, a brief discussion on the four types of classification methods that we have used in our study is described.
Using the KNIME tool , forward feature selection technique is used to obtain the feature subsets for identifying the critical proteins in successful learning, rescued learning and failed learning cases. Afterwards, in order to validate importance of selected proteins, principal component analysis (PCA) is carried out. After determination and validation crucial proteins, DNN, gradient boosted tree, random forest and SVM classification methods are executed. PCA and application of classification methods are carried out with Python and Scikit learn package . Also, grid search which is the parameter optimization technique  and 5 fold cross validation are done for obtaining robust and reliable classification results. The below subsections successively show the results of feature selection method, PCA and classification methods for successful learning, rescued learning and failed learning.
Pharmacotherapies of ID are largely unknown as the abnormalities at the complex molecular level which causes ID are difficult to understand. DS which is the prevalent reason of ID and caused by an extra copy of the Hsa21 has been investigated on protein levels. Due to the increase in trisomic genes, protein expression levels of corresponding genes are elevated. Furthermore, in addition to expression of genes on 21 chromosome, protein coding genes on other chromosomes play important roles in DS. Thus, understanding the abnormalities in the protein expressions are very important for developing drugs to rescue learning. For this reason, critical roles of proteins have been analyzed by comparing protein expression levels of normal mice and trisomic mice which are exposed to CFC with or without memantine treatment. In order to find critical proteins in DS, statistical analysis and machine learning methods are used.