A version of FGENESH program including NONCANONICAL GC dinucleotide in donor splice sites is installed to use on-line.
This program is useful to analyze ALTERNATIVE gene structure, where non-standard splice sites are often found (see also FGENES-M program to predict alternative gene variants) and create A SET of GENES and PROTEINS absent in standard gene prediction.
Donor GC splice site is accounting for the major part of non-standard splice sites in human genes. It present about 0.6% of all splice sites and observed in more than 5% of human genes. Prediction genes on large scale genomic sequences will contain hundreds of GC-donor exons and required programs which will predict their major amount. The noncanonical splice sites were investigated by us recently (Burset, Seledtsov and Solovyev, 2000, Nucleic Acids Res., 28(21), 4364-4375) and we received about 20000 verified by EST splice sites. We received a very strong GC-donor site weight matrix which is used in gene prediction program. We have developed this variant of program to predict GC-donor exons in addition to standard exons and we preserve the accuracy of program on the standard genes. Testing the program on 68 human genes with at least one GC donor site shows that FGENESH (GC) provide 10% higher rate of exact exon prediction for such group and 5% higher accuracy on the nucleotide level.
Click Human parameters and FGENESH_GC button Paste your sequence to the window or load your file with sequence in FASTA format
Solovyev V.V. (2001) Statistical approaches in Eukaryotic gene prediction. In Handbook of Statistical genetics (eds. Balding D. et al.), John Wiley & Sons, Ltd., p. 83-127.
(IN THIS EXAMPLE 2nd EXON HAVING GC-DONOR SITE IS FOUND, and it is LOST by STANDARD gene finders)
G - predicted gene number, starting from start of sequence;
Str - DNA strand (+ for direct or - for complementary);
Feature - type of coding sequence: CDSf - First (Starting with Start codon), CDSi - internal (internal exon), CDSl - last coding segment, ending with stop codon);
TSS - Position of transcription start (TATA-box position and score);
Start and End - Position of the Feature;
Weight - Log likelihood*10 score for the feature;
ORF - start/end positions where the first complete codon starts and the last codon ends.
fgeneshgc Wed Jan 30 20:59:27 EST 2002 FGENESH (with GC possible donor site) Gene prediction in Human genomic DNA Time: Wed Jan 30 20:59:27 2002 Seq name: Softberry SERVER PAST Sequence Length of sequence: 2932 GC content: 65 Zone: 4 Number of predicted genes 1 in +chain 1 in -chain 0 Number of predicted exons 5 in +chain 5 in -chain 0 Positions of predicted genes and exons: G Str Feature Start End Score ORF Len 1 + 1 CDSf 501 - 580 15.57 501 - 578 78 1 + 2 CDSi 747 - 853 22.53 748 - 852 105 1 + 3 CDSi 1847 - 1980 17.97 1849 - 1980 132 1 + 4 CDSi 2255 - 2333 10.88 2255 - 2332 78 1 + 5 CDSl 2563 - 2705 15.94 2565 - 2705 141 Predicted protein(s): >FGENESH 1 5 exon (s) 501 - 2705 180 aa, chain + MADSELQLVEQRIRSFPDFPTPGVVFRDISPVLKDPASFRAAIGLLARHLKATHGGRIDY IAGLDSRGFLFGPSLAQELGLGCVLIRKRGKLPGPTLWASYSLEYGKAELEIQKDALEPG QRVVVVDDLLATGGTMNAACELLGRLQAEVLECVSLVELTSLKGREKLAPVPFFSLLQYE