SoftBerry - CTL-epitope-Finder HELP

Cytotoxic T lymphocyte epitopes prediction in protein sequences

The program predicts putative cytotoxic T lymphocyte (CTL) epitopes in protein sequences. These polypeptides are known as potential candidates for vaccine design.
The sequence length for predicted epitopes is 9.

Input data:

Protein sequence in 20-letter alphabet in FASTA format.

Input Parameters:

List Output: if this check box is set checked, output data contain list of predicted peptides with their locations in the sequence and scores.
Threshold: This parameter specifies at which score value will separate positive examples (predicted epitopes, score >= threshold) and negative examples (non-epitopes, score < threshold). By default, threshold=0 (recommended).

Output data:

For each position of the sequence (except eight C-terminal positions) the program output whether the polypeptide of length 9 starting at this position is predicted as cytotoxic T lymphocyte epitope(*) or not ( ). If List Output checkbox is checked, list of predicted epitopes is printed out.

Algorithm.

The algorithm uses sequence comparison and linear discriminant analysis to predict CTL-epitopes. For each query sequence of length 9 we calculate position score similarity values with position specific score matrices derived for positive and negatibe training sets (9 predicting parameters).

Additionally we calculate 5 top sequence similarity scores of query sequence with sequences from positive set and 5 top scores from negative set (10 parameters).

Using such 19 parameters we obtain linear discriminant function for training dataset.

We use this frunction to discriminate between epitope and non-epitope sequences.

Datasets.

We used MHCBN database (1) to obtain training and testing datasets.

The algorithm of data extraction is similar to that described in (2). For positive examples we selected CTL epitopes from database using criteria: [ACTIVITY=yes] & [SEQLEN=9] & [BINDING=yes]. 1368 left after removing identical sequences and sequences with non-standard amino acids.

Negative dataset was constructed on the basis of non-epitope and non-binding sequences in the same way as described in (2).

Data were randomly split into 200+200 negative and positive sequences for test set and the rest sequences comprising training set.

For test set the fraction of true predictions by our program is 0.835 (334 true prediction out of 400).

(1) Bhasin M, Singh H, Raghava GPS. MHCBN: a comprehensive database of MHC binding and non-binding peptides. Bioinformatics (2003)19:666.

(2) Bhasin M, Raghava GPS. Prediction of CTL epitopes using QM, SVM and ANN techniques. Vaccine (2004)22:3195-3204.