logo资料库

论文研究 - 特征集和特征选择的新组合改进了蛋白质磷酸化位点的预测.pdf

第1页 / 共14页
第2页 / 共14页
第3页 / 共14页
第4页 / 共14页
第5页 / 共14页
第6页 / 共14页
第7页 / 共14页
第8页 / 共14页
资料共14页,剩余部分请下载后查看
Improved Protein Phosphorylation Site Prediction by a New Combination of Feature Set and Feature Selection
ABSTRACT
1. INTRODUCTION
2. MATERIALS AND METHODS
2.1. Data Sets
2.2. Methods
2.2.1. Feature Extraction
2.2.2. Feature Selection
2.2.3. Classification
2.2.4. Evaluation
3. RESULT AND DISCUSSION
3.1. P.ELM Data Set
3.1.1. Feature Selection
3.1.2. Classification Result
3.2. PPA Data Set
3.2.1. Feature Selection
3.2.2. Classification Result
4. CONCLUSIONS
ACKNOWLEDGEMENTS
REFERENCES
http://www.scirp.org/journal/jbise J. Biomedical Science and Engineering, 2018, Vol. 11, (No. 6), pp: 144-157 Improved Protein Phosphorylation Site Prediction by a New Combination of Feature Set and Feature Selection Favorisen Rosyking Lumbanraja1, Ngoc Giang Nguyen1, Dau Phan1, Mohammad Reza Faisal1, Bahriddin Abapihi1, Bedy Purnama1, Mera Kartika Delimayanti1, Mamoru Kubo2, Kenji Satou2 1Graduate School of Natural Science and Technology, Kanazawa University, Kanazawa, Japan; 2Institute of Science and Engineering, Kanazawa University, Kanazawa, Japan Correspondence to: Favorisen Rosyking Lumbanraja, ; Ngoc Giang Nguyen, ; Dau Phan, ; Mohammad Reza Faisal, ; ; ; Bedy Purnama, ; Mamoru Kubo, Bahriddin Abapihi, Mera Kartika Delimayanti, Kenji Satou, Keywords: Protein Phosphorylation, Phosphorylation Site Prediction, Sequence Feature, Feature Selection with Grid Search Received: May 27, 2018 Accepted: June 26, 2018 Published: June 29, 2018 Copyright © 2018 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/ ; Open Access ABSTRACT Phosphorylation of protein is an important post-translational modification that enables ac- tivation of various enzymes and receptors included in signaling pathways. To reduce the cost of identifying phosphorylation site by laborious experiments, computational prediction of it has been actively studied. In this study, by adopting a new set of features and applying feature selection by Random Forest with grid search before training by Support Vector Machine, our method achieved better or comparable performance of phosphorylation site prediction for two different data sets. 1. INTRODUCTION Phosphorylation is one of the most important post-translational modifications (PTMs) in euka- ryotes. This occurs when a phosphate group is added to a protein by a kinase. The addition of phosphate group usually happens to Serine (S), Threonine, and Tyrosine (Y) [1]. It is also the common PTM, which occur in eukaryotic cell [2]. Between 30% and 50% proteins of eukaryotic cell undergo phosphorylation [3]. In the past, experimental approach such as mass spectrometry (MS/MS) [4] has been commonly used to identify phosphorylation sites. However, implementing this approach has several disadvantages. Con- ducting experiment to predict phosphorylation sites is considered expensive and requires intensive labor. https://doi.org/10.4236/jbise.2018.116013 144 J. Biomedical Science and Engineering
In addition, it also requires adequate technique, skill, and specific equipment. Instead, computational approach (in silico) is becoming more common because of new computer technology development. These days, computers can process large number of data in a short time. This makes prediction of phosphorylation sites using computational approach becoming more popular. Trost and Kusalik provided summarization of various technique approach, data, and tools that can be applied to predict phosphorylation site using computational approach [5]. In general, phosphorylation site prediction can be classified into two approaches: kinase-specific phosphorylation site prediction and non-kinase-specific (general) phosphorylation site prediction. Ki- nase-specific prediction approach requires both protein sequence and the type of kinase for phosphoryla- tion to conduct prediction. The other approach is non-kinase-specific, which only requires protein se- quence. Xue and Trost provided comparisons of these two approaches [5, 6]. The main disadvantage of kinase-specific approach is that the publicly available information about the type of kinase is limited, espe- cially for human kinase [7]. Therefore, non-kinase-specific approach is more popular to predict phospho- rylation site [8]. There are various methods proposed for the prediction. For example, Blom used neural network (NN) approach to predict eukaryotic protein phosphorylation site based on sequence and structure of proteins [9]. Kim proposed a prediction method using support vector machine (SVM) [10]. In this paper, we propose a new prediction method for non-kinase-specific phosphorylation site. By adopting a new combination of a classifier, features, and feature selection algorithm, we improved the performance of prediction. We measured the result of feature selection and classification and compared it with existing methods. We also tested our method with an independent data set and analysis of the classi- fication result. 2. MATERIALS AND METHODS 2.1. Data Sets In this work, we followed the preparation step done by Ismail [11]. The dataset we use were down- loaded from the Phospho SVM website [12]. The phosphorylation site data set is P.ELM version 9 [13]. The data set contains phosphorylated sequences at the position of Serine (S), Threonine (T), and Tyrosine (Y). These sequences were also checked for redundancy and sequences that had similarity more than 30% were removed. Table 1 shows the number of sequences and number of sites for Serine, Threonine, and Tyrosine, respectively. We generated fixed-length protein sequences using window size 9, which have phospholyratable re- sidues (Serine, Threonine, or Tyrosine) at the center of them. If the center residue of the sequence is known as phosphorylated, the sequence is “positive”, otherwise “negative”. For positive and negative se- quences, redundant ones were removed using skipredundant [14]. The parameters for redundancy remov- al are as follows: acceptable threshold percentage of similarity was set to 0% - 20%, value for gap opening penalty to 10, and gap extension penalty to 0.5. Table 2 shows the number of positive and negative se- quences before and after removing redundant sequences for each residue. The number of negative sequences after redundancy removal for Serine, Threonine, and Tyrosine re- sidues are: 4771, 3343, and 898, respectively. We then randomly selected negative sequences for each resi- due with the same number of negative sequences in the work of Ismail. Using the same window size and method, we also generated sequences from PPA data set down- loaded from the Phospho SVM website for conducting performance evaluation by independent data set. PPA is a database for providing information of phosphorylation sites in Arabidopsis and a predictor for plant-specific phosphorylation site [15]. After removal of redundant sequences, we randomly selected pos- itive and negative sequences based on the work of Ismail. Table 3 shows number of positive and negative phosphorylation sites for each amino acid. In order to make the data set well-balanced, the numbers of positive and negative sequences were set to equal. https://doi.org/10.4236/jbise.2018.116013 145 J. Biomedical Science and Engineering
Table 1. P.ELM data set of phosphorylation site from PhosphoSVM website. Residue Serine Threonine Tyrosine Number of Sequence Number of Sites 6635 3227 1392 20,964 5685 2163 Table 2. Number of sequences before and after removing redundant sequence for window size = 9. Residue Serine Threonine Tyrosine Positive Before 20,557 5596 1392 After 1554 707 267 Negative 1543 453 226 Table 3. PPA data set as the independent data set. Residue Serine Threonine Tyrosine Number of positive/negative sequences after redundancy removal Number of positive/negative sequences after selection 484/1830 132/1227 187/640 307/307 68/68 51/51 2.2. Methods 2.2.1. Feature Extraction Using the fixed-length sequences, we conducted feature extraction to represent them as vectors of numerical values. We used three different programs to extract features: PROFEAT (2016), PSI-BLAST, and protr. PROFEAT (2016) is a web server for extracting features from protein sequences [16]. We used it to generate the following feaures: Amino Acid Composition, Dipeptide Composition, Normalized Mo- reau-Broto Autocorrelation Descriptor, Moran Autocorrelation Descriptor, Moran Autocorrelation De- scriptor, Geary Autocorrelation Descriptor, Composition, Transition, Distribution Descriptor, Amphi- philic Pseudo-Amino Acid Composition, and Total Amino Acid Properties. For Position Specific Scoring Matrix (PSSM) features, we used PSI-BLAST [17]. In addition, an R package called protrwas used to pro- duce the following features: BLOSUM and PAM Matrices for the 20 Amino Acid, Amino Acid Properties Based Scales Descriptor (Protein Fingerprint), Scales-based Descriptor derived by Principal Components Analysis, Scales-based Descriptor derived by Multidimensional Scaling, Conjoint Triad Descriptors, and Sequence-Order-Coupling Number [18]. Details of these features are described below. Except three fea- tures (CTD, SOCN, QSO), most of the features are not used in Ismail’s work.  Amino Acid Composition (AAC) Amino Acid Composition is defined as the fraction of each amino acid in a protein sequence [19]. For all 20 amino acids, the fraction is calculated using this equation. fraction of aa = i total of number of amino acid type i total number of amino acid in protein sequence (1) https://doi.org/10.4236/jbise.2018.116013 146 J. Biomedical Science and Engineering
where i is a spesific type of amino acid  Dipeptide Composition (DPC) Dipeptide Composition generates 400 fixed-length numeric information based on the input protein sequences. It encapsulates information about the fraction of amino acid as well as their local order. It is calculated using Equation (2): fraction of ( ) dep i = total of number of ( ) dep i total number of all posible dipeptide (2) where dep(i) is one dipeptide i of 400 dipeptides.  Normalized Moreau-Broto Autocorrelation Descriptors (NMB) Before we calculate Normalized Moreau-Broto Autocorrelation, we must define Moreau-Broto Au- tocorrelation. It can be define using Equation (3): ( ) AC d (3) where Pi and Pi+d are the amino acid property at position i and i + d, respectively. Normalized Mo- reau-Broto Autocorrelation is defined using Equation (4) [20]: ( ) AC d N d − (4) ( ATS d PP i i d + = ) = ∑ N d − i 1 = When we usePROFEAT, the value of nlag should be lower than the size of the sequence. Since the where d = 1, 2, 3, ∙∙∙, 30. window size is 9, we setnlag = 8.  Moran Autocorrelation Descriptors (MORAN) Moran Autocorrelation can be calculated using Equation (5): ) i d + P − ( 1 − N d ( I d ) = N d − i 1 = ∑ ∑ 1 N − )( P P P i ( ) P P i − 2 N i 1 = , d = 1,2,3, (5) ,30 where P is the avarege of Pi. In the use of PROFEAT, we setnlag = 8.  Geary Autocorrelation Descriptors (GEARY) Geary Autocorrelation can be defined using Equation (6): ) − ( 2 , ) ) 2 ) ( 2 ( 1 d = − = i d + N N i 1 = ∑ ∑ ,30 N d − i 1 = ( C d 1,2,3, P P i P P i 1 N d − 1 − In the use of PROFEAT, we setnlag = 8.  Composition, Transition, Distribution (CTD) Composition, Transition, Distribution represent the amino acid distribution patterns of a certain structural or physicochemical property from a protein sequences. These features are calculated as follows: the protein sequence is transformed into a sequence of a specific physicochemical or structural properties of residue. Twenty amino acids are divided into three groups [20, 21]. (6) Composition (C), Transition (T), and Distribution (D) are then calculated for a given attribute to de- scribe the global percent composition if the three groups of amino acids in a protein, the percent frequen- cies with which the attribute changes its index along the entire length of the protein, and the distribution pattern of attribute along the sequence, respectively.  Sequence-Order-Coupling Number (SOCN) Sequence-Order-Coupling Number can be used to represent amino acid distribution pattern of a spe- cific physicochemical property along a protein sequence. The dth rank of sequence-order-coupling num- https://doi.org/10.4236/jbise.2018.116013 147 J. Biomedical Science and Engineering
ber can be calculated using Equation (7): = ∑ N d − i 1 = ( 2 ) τ d (7) where di,i+d is the distance between two amino acid at position i and i + d. In the use of protr, we also set- nlag = 8.  Quasi-Sequence-Order Descriptors (QSO) Quasi-Sequence-Order Descriptors can be calculated using Sequence-Order-Coupling Number. For each amino acid type, the type-1 Quasi-Sequence-Order Descriptors is calculated using Equation (8): 1,2, ,30 i i d , + 3, = d d , X r = ∑ 20 r 1 = f r f + r w ∑ 30 τ d d 1 = , r = 1,2,3, (8) ,20 where fr is the normalized occurrence of amino acid type i and w is the weighting factor, w = 0.1. The type-2 Quasi-Sequence-Order Descriptors is calculated using Equation (9): w τ d + 20 − ∑ w X d = , r = 21,22,23, (9) ,50 r f ∑ 30 τ d d 1 = 20 r 1 = In the use of PROFEAT, we setnlag = 8.  Amphiphilic Pseudo-Amino Acid Composition(APAAC) Before we calculate Amphiphilic Pseudo-Amino Acid Composition, we must define Pseudo-Amino Acid Composition (PAAC) [22]. First, three variables are generated from the original hydrophobicity val- 0M i of 20 amino acids (i = 1, 2, 3, ∙∙∙, ues 20). , and side chain masses , hydrophilicity values ( ) 2H i ( ) 1H i ( ) 0 0 ( ) H i 1 = ( ) H i 2 = ( ) M i = ( ) H i 0 1 − ∑ 20 i 1 = ∑ 20 i     ( ) H i 0 1 − ∑ 20 ( ) H i 0 2 − ∑ 20 i 1 = ∑ 20 i     ( ) H i 0 2 − ∑ 20 ( ) 0 H i 1 20 ( ) 0 H i 1 20 20 i 1 = ( ) 0 H i 2 20 ( ) 0 H i 2 20 20 i 1 = 2     2     (10) (11) M 0 ( ) i 20 0 M ( ) i 20 0 M ( ) i − ∑ 20 i 1 = ∑ 20 i     0 M ( ) i ∑ 20 i 1 = 20 (12) 2     Then, a correlation function can generated as: ( θ R R i , j ) = 1 3 + {     1 1 ) − ( H R i ( H R ( M R M R − ( ) i j j 2   2 ) ) }   +   ( H R i 2 ) − ( H R 2 j 2 )   (13) https://doi.org/10.4236/jbise.2018.116013 148 J. Biomedical Science and Engineering
From which, sequence order-correlated factors are defined as: ) , λ λ < θ λ R R i i ( N λ − θ i 1 = = ( , + 1 − ∑ λ N N ) (14) where λ is parameter. Let fi be the normalized frequency of 20 amino acids in the protein sequence, a set of 20 + λ descriptors called the PAAC can be defined using Equation (15): , when 1 ≤ ≤ u 20 u f + X u = X u = ∑ 20 i 1 = f w λ ∑ λθ i j 1 = 20 ∑ i 1 = w θ u 20 f w λ ∑ + θ i λ j 1 = − , when 20 1 + ≤ ≤ u 20 + λ (15) where w = 0.05. From Equation (10) and Equation (11), the hydrophobicity and hydrophilicity correlation can be define as: H 1 i , j = H i H j H ; , ( ) ( ) 1 1 2 i , j = ( ) H i H j ( ) , 2 2 (16) Then, sequence order factor can be define using Equation (17): 2 i i , ∑ λ Finally, APAAC can be calculated using Equation (18): N λ − H i 1 = N λ − H i 1 = τ 1 2 λ − τ 2 λ 1 − 1 − ∑ λ ; λ 1 i i , N N = = + , where λ + λ < 2 (17) f u + ∑ 2 λτ j j 1 = , when 1 ≤ u ≤ 20 p u = p u = ∑ 20 i 1 = i f 20 ∑ i 1 = w τ u ∑ + f i , when 20 1 + ≤ ≤ u 20 + λ (18) 2 λ τ j j 1 = In the use of PROFEAT, we set weight factor = 0.05 and lamda = 8.  Total Amino Acid Properties (AAP) Total Amino Acid Properties for a specific physicochemical property i is defined using Equation (19): (19) represents the property i of amino acid Rj that is normalized between 0 and 1.N is the length = ∑ N P= j 1 no p ( ) tot i 1 N rm i j where of the protein sequence. normP i j normP i j is calculated using Equation (20): i p − min i p − min i jp is the original amino acid property i for residue j. ip where max maximum values of the original amino acid property i, respectively. p i p max P norm = i j i j (20) and ip min are the minimum and the  Position Specific Scoring Matrix (PSSM) PSSM features were generated using PSI-BLAST against a local database generated from the phos- phorylation data set.  BLOSUM and PAM Matrices for the 20 Amino Acid (BLOSUM) In the use of protr, we setk = 5, lag = 3, and Matrix type = AABLOSUM45.  Amino Acid Properties Based Scales Descriptor (Protein Fingerprint) (ProtFP) In the use of protr, we set pc = 5, lag = 5, index vector for Amino Acid Index = (160:165, 258:296).  Scales-based Descriptor derived by Principal Components Analysis (SCALES) In the use of protr, we set pc = 7, lag = 5, properties matrix = AA index (7:26).  Scales-based Descriptor derived by Multidimensional Scaling (MDDSCALES) In the use of protr, we set lag = 8. https://doi.org/10.4236/jbise.2018.116013 149 J. Biomedical Science and Engineering
 Conjoint Triad Descriptors (CTriad) [22] 2.2.2. Feature Selection Random Forest was introduced by Breiman [23]. Random Forest method works as a collection of large number of decision trees randomly generated and not correlated to each other. This method is wide- lyapplied to classification problems. In Random Forest, Gini impurity index (GII) is used to measure feature importance. GII represents how often randomly chosen element from the data set would be classified incorrectly if it was randomly classified based on the distribution of classes in the subset. We use Gini Index to rank important features that can be used for the classification algorithm. In [11], Ismail also attempted the same feature selection and top 100 features were selected. In con- trast, we conducted grid search to find the best set of selected features. 2.2.3. Classification Vapnik [24] proposed support vector machine (SVM) as a classification method. It is a popular clas- sifier widely applied to various problems including phosphorylation site prediction. SVM produces an op- timal hyperplane separation between the classes. Here, optimal means finding the maximum margin around the separating hyperplane. In this work, we adopted Gaussian (also known as radial basis function) kernel for SVM. 2.2.4. Evaluation 10-fold cross-validation was repeated 10 times to measure the average performance of the P.ELM da- ta. To measure the performance for the PPA data set, which is used for the independent data set, leave-one-out cross-validation (LOOCV) was conducted. The metricsused to measure the classification performance are: Accuracy, Sensitivity, Specificity, F1 score, and Matthew’s Correlation Coefficient (MCC). These metrics are defined in the following equations: Accuracy = TP TN FP FN + + (21) TP TN + + TP TP FN + TN + TP TN FP Sensitiviy = (22) Specificity = (23) TP FP FN F1score ( ( + × ) TP FP = × 2 + ) TP TN − ) TP FN × × + ( ( (24) + ) FP FN ) TN FP (25) TN FN × + × + ( ) MCC = ( where TP, TN, FP, and FN are the abbreviation for true positive, true negative, false positive, and false negative. In this work, Area under the ROC curve (AUC) is also measured. 3. RESULT AND DISCUSSION 3.1. P.ELM Data Set 3.1.1. Feature Selection Gini impurity index (GII) in Random Forest was used to measure the importance of the features. For P.ELM data set, we conducted 10-fold cross validation repeated 10 times. For each fold in each iteration, we generate a list of important features based on the training data. Then we average the value GII of each features from the 100 list features we generated before. To give insight of which features that effects the classification, we listed the top twenty features for each residue in Figure 1. Composition, Transition, and https://doi.org/10.4236/jbise.2018.116013 150 J. Biomedical Science and Engineering
Figure 1. Top twenty importanct features for Serine (top), Threonine (middle), Tyrosine (bottom). The akronims of the group features are: Amino Acid Composition (AAC);Dipeptide Composition (DPC); Normalized Moreau-Broto Descriptors (NMB); Moran Autocorrelation Descriptors (MORAN); Geary Autocorrelation Descriptors (GEARY); Composition, Transition, Distribution (CTD); Quasi-Sequence-Order Descriptors (QSO); Amphiphilic Pseudo-Amino Acid Composition (APAAC); Total Amino Acid Properties (AAP); Position Specific Scoring Matrix (PSSM); BLOSUM and PAM Matrices the 20 Amino Acid (BLOSUM); Amino Acid Properties Based Scales Descriptors (ProtFP); Scale-based Descriptor derived by Principal Components Analysis (SCALES); Scale-based Descriptor derived by Multidimensional Scaling (MDSSCALES); Conjoint Triad Descriptor (Ctriad); 16: Sequence-Order-Coupling Number (SOCN). The number prefixed to a group name is just an identifier to descriminate different features in the same group. https://doi.org/10.4236/jbise.2018.116013 151 J. Biomedical Science and Engineering
分享到:
收藏