Services Test Online

Prot_map program is a new fast tool to align proteins with genome and accurately reconstruct exon-intron gene structure

Prot_map program maps a set of protein sequences onto genomic sequence producing gene structures and the corresponding alignments of coding exons with the similar or identical protein queries. Prot_map uses a genomic sequence and a set of protein sequences as its input parameters. Prot_map reconstructs the gene structure on the base of identical or similar protein instead of a set of unordered alignment fragments that generated the Blast program. The program is very fast, and the produces gene structure similar with the accuracy of slow Genewise program (that practically required knowing the protein genomic location) (Table 1). You can further significantly improve the accuracy of gene reconstruction with Fgenesh+ program by using the results of Prot_map (i.e.a fragment of genomic sequence and the protein sequence mapped on it) ( Table 2).

(1) Prot_map program is used in pipeline (Fgenesh++) of automatic annotation of new genomic sequences, as well as (2) to generate a set of genes in new genomes (without known genes) to train parameters of gene-finding programs. (3) It is very useful to find pseudogenes by selection of corrupted gene structures resulted in mapping a set of known proteins.

Figure 1. Example of mapping a protein sequence on the human 19 chromosome.

L:3000000    Sequence Chr19 [cut:1 3000000]
[DD] Sequence:       1(      1), S:      105.56, L:1739
IPI:IPI00170643.1|SWISS-PROT:Q8TEK3-1 Tax_Id=9606 Splice isoform 2 of Q8TEK3
Summ of block lengths: 1284, Alignment bounds:
On first  sequence: start   2146727, end   2167197, length 20471
On second sequence: start       263, end      1682, length 1420
Blocks of alignment: 21       
    1 E: 2146727      70 [ca GT] P: 2146727     263 L: 23, G: 101.574  S:14.75
    2 E: 2147573     107 [AG GT] P: 2147575     287 L: 35, G: 103.465, S:18.56
    3 E: 2148934      42 [AG GT] P: 2148934     322 L: 14, G: 103.043, S:11.68
    4 E: 2150399     111 [AG GT] P: 2150399     336 L: 37, G: 102.130, S:18.82
    5 E: 2150620     235 [AG GT] P: 2150620     373 L: 78, G: 101.500, S:27.15
    6 E: 2151098     114 [AG GT] P: 2151100     452 L: 37, G: 106.924, S:19.76
    7 E: 2151750      92 [AG GT] P: 2151752     490 L: 30, G: 101.424, S:16.82
    8 E: 2153538     102 [AG GT] P: 2153538     520 L: 34, G: 100.496, S:17.73
    9 E: 2153848     138 [AG GT] P: 2153848     554 L: 46, G:  99.003, S:20.30
   10 E: 2154470     126 [AG GT] P: 2154470     600 L: 42, G: 101.283, S:19.87
    
          1        11   2146713   2146723   2146739   2146769
          gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg
          ---------------(..)evdhqlkerfanmke  GGRIVSSKPFAPLNFRINSRNLS-
        248       248       249       259       267       277

    2146797   2146806   2147558   2147568   2147581   2147611
          ]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK
           ---------------(..)--------------- -dIGTIMRVVELSPLKGSVSWTGK
        286       286       286       286       289       299

    2147641   2147671   2147686   2148919   2148926   2148937
          PVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNP
          PVSYYLHTIDRTI ---------------(..)--------------- LENYFSSLKNP
        309       319       322       322       322       323

    2148967   2148982   2150384   2150391   2150402   2150432
          KLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATP
          KLR ---------------(..)--------------- EEQEAARRRQQRESKSNAATP
        333       336       336       336       337       347

    2150462   2150492   2150513   2150523   2150609   2150619
          TKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEK
          TKGPEGKVAGPADAPM ---------------(..)--------------- DSGAEEEK
        357       367       373       373       373       373

Table 1. Speed of processing sequences by Prot_map, Fgenesh+ and GeneWise.

  Fgenesh+ Prot_map GeneWise
88 sequences of genes < 20 kb ~1 min ~1 min ~90 min
8 sequences of genes > 400000 kb ~1 min ~1 min ~1200 min

Table 2. Comparison of accuracy of gene identification programs: ab initio Fgenesh and prediction with protein support: Fgenesh+ , GenWise and Prot_map on a set of human genes using mouse or drosophila homologous proteins. %CG (correct genes) is % of exactly predicted genes.

Mouse homologs: 60% < similarity level < 80% - 1425 sequences

  Sn ex Sno ex Sp ex Sn nuc Sp nuc CC %CG
Fgenesh 83.4 90.9 86.8 93.2 94.9 0.937 30
Genwise 88.1 96.5 90.5 97.8 99.2 0.984 43
Fgenesh+ 93.9 97.9 94.9 98.4 99.3 0.988 65
Prot_map 87.0 96.5 86.6 97.0 98.5 0.976 40

Drosophila homologs: similarity level > 80% - 66 sequences.

  Sn ex Sno ex Sp ex Sn nuc Sp nuc CC %CG
Fgenesh 90.5 93.8 95.1 97.9 96.9 0.950 55
Genwise 79.3 83.9 86.8 97.3 99.5 0.985 23
Fgenesh+ 95.1 97.8 97.0 98.9 99.5 0.9914 70
Prot_map 86.4 95.3 88.1 97.6 99.0 0.982 41