# rfppi
These scripts and datafiles are provided as-is, and come without
any warranty whatsoever. If it works for you, great, let me know!
If it doesn't work for you, we'd be happy to try and help you fix it.
If it destroys your universe, too bad (you may still file a bug report).
the test.r script runs one of the Random Forrest classifiers for
prediction of PPI Interface residues, on one of the provided
datafiles, and evaluates its performance on the true interface
sites also annotated based on PDB and PISA in the same datafiles.
(further instructions below)
Copyright (c) 2016 Q. Hou ,
P. De Geest ,
W.F. Vranken ,
J. Heringa ,
K. Anton Feenstra
PLEASE CITE:
Qingzhen Hou, Paul De Geest, Wim F. Vranken, Jaap Heringa and
K. Anton Feenstra.
Seeing the Trees through the Forest: Sequence-based Homo- and
Heteromeric Protein-protein Interaction sites prediction using
Random Forest.
Bioinformatics, in press (2016).
:ETIC ESAELP
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see .
#INSTRUCTIONS:
1. Predictors: five RF_homo, five RF_combined and one RF_hetero predictors
2. Testing dataset: Dset48.rar (heteromeric interfaces) and HM_479_testing.rar (homomeric interfaces)
3. Description of the datasets
In total, 20 types of features were used: Dynamine score: DYNA_q, mean_H and sd_H; RSA: RSA_q, RSA_H and RSA_sd_H; ASA: ASA_q, ASA_H and ASA_sd_H; secondary structure: α helix (PA_H, PA_Q, PA_sd_H), β sheet (PB_H, PB_Q, PB_sd_H) and the coil (PC_H, PC_Q, PC_sd_H);Entropy (H_Entropies); the length of query sequence (length).
For local features (Dynamine score, RSA, ASA, secondary structure and Entropy), 9-residue windowing approaches are used.
Lables in the dataset:
'Pos': the order of residues in the sequence
'name': PDB ID
'Interface': '0', non-interface positions; '1' interface residues
'Interface1': 'NI', non-interface positions; 'I' interface residues
'length': length of the protein sequence
'AliPos': the order of residues in the alignment
'AliSeq': amino acid of the protein
'mean_H': the average DynaMine score for each column in the alignment
'sd_H': the standard deviation of DynaMine scores for each column in the alignment
'DYNA_q': the DynaMine score for the query sequence
'RSA_H': the average predicted Relative Surface Accessibility(RSA) for each column in the alignment
'RSA_sd_H': the standard deviation of predicted RSA for each column in the alignment
'RSA_q': the predicted RSA for the query sequence
'ASA_H': the average predicted Absolute Surface Accessibility(ASA) for each column in the alignment
'ASA_sd_H': the standard deviation of predicted ASA for each column in the alignment
'ASA_q': the predicted ASA for the query sequence
'PA_H, PA_Q, PA_sd_H': the average predicted probability score of α helix, the predicted probability score of α helix for the query sequence, the standard deviation of predicted probability score of α helix
'PB_H, PB_Q, PB_sd_H': the average predicted probability score of β sheet, the predicted probability score of β sheet for the query sequence, the standard deviation of predicted probability score of β sheet
'PC_H, PC_Q, PC_sd_H': the average predicted probability score of coil, the predicted probability score of coil for the query sequence, the standard deviation of predicted probability score of coil.
'H_Entropies': Sequence entropy used to describe the degree of conservation
'1mean_H': for each residue position i, values of average DynaMine score at position i-4
'2mean_H': for each residue position i, values of average DynaMine score at position i-3
'3mean_H': for each residue position i, values of average DynaMine score at position i-2
'4mean_H': for each residue position i, values of average DynaMine score at position i-1
'5mean_H': for each residue position i, values of average DynaMine score at position i+1
'6mean_H': for each residue position i, values of average DynaMine score at position i+2
'7mean_H': for each residue position i, values of average DynaMine score at position i+3
'8mean_H': for each residue position i, values of average DynaMine score at position i+4
Similar lables are also used for other windowed features.
4. To run the interface sites prediction:
1) unzip the Testing dataset;
2) set the enviroment to store prediction results in the 'test.r' script: setwd("/home/qingzhou/github");
choose proper predictor (*.Rdata) to load in the 'test.r' script: load("/home/qingzhou/github/RF_homo_1.RData");
select dataset to predict: testing_set = read.csv("/home/qingzhou/github/Dset48.csv");
run 'test.r' script to predict the interface region
5. output1: prediction performance measured by Precision-Recall, TPR-FPR, AUC of ROC, MCC score and F1 score
output2: the probability score for each postion being a interface sites.