Prot_map program maps a set of protein sequences onto genomic sequence producing gene structures and the corresponding alignments of coding exons with the similar or identical protein queries. Prot_map uses a genomic sequence and a set of protein sequences as its input parameters. Prot_map reconstructs the gene structure on the base of identical or similar protein instead of a set of unordered alignment fragments that generated the Blast program. The program is very fast, and the produces gene structure similar with the accuracy of slow Genewise program (that practically required knowing the protein genomic location) (Table 1). You can further significantly improve the accuracy of gene reconstruction with Fgenesh+ program by using the results of Prot_map (i.e.a fragment of genomic sequence and the protein sequence mapped on it) ( Table 2).
(1) Prot_map program is used in pipeline (Fgenesh++) of automatic annotation of new genomic sequences, as well as (2) to generate a set of genes in new genomes (without known genes) to train parameters of gene-finding programs. (3) It is very useful to find pseudogenes by selection of corrupted gene structures resulted in mapping a set of known proteins.
Figure 1. Example of mapping a protein sequence on the human 19 chromosome.
L:3000000 Sequence Chr19 [cut:1 3000000] [DD] Sequence: 1( 1), S: 105.56, L:1739 IPI:IPI00170643.1|SWISS-PROT:Q8TEK3-1 Tax_Id=9606 Splice isoform 2 of Q8TEK3 Summ of block lengths: 1284, Alignment bounds: On first sequence: start 2146727, end 2167197, length 20471 On second sequence: start 263, end 1682, length 1420 Blocks of alignment: 21 1 E: 2146727 70 [ca GT] P: 2146727 263 L: 23, G: 101.574 S:14.75 2 E: 2147573 107 [AG GT] P: 2147575 287 L: 35, G: 103.465, S:18.56 3 E: 2148934 42 [AG GT] P: 2148934 322 L: 14, G: 103.043, S:11.68 4 E: 2150399 111 [AG GT] P: 2150399 336 L: 37, G: 102.130, S:18.82 5 E: 2150620 235 [AG GT] P: 2150620 373 L: 78, G: 101.500, S:27.15 6 E: 2151098 114 [AG GT] P: 2151100 452 L: 37, G: 106.924, S:19.76 7 E: 2151750 92 [AG GT] P: 2151752 490 L: 30, G: 101.424, S:16.82 8 E: 2153538 102 [AG GT] P: 2153538 520 L: 34, G: 100.496, S:17.73 9 E: 2153848 138 [AG GT] P: 2153848 554 L: 46, G: 99.003, S:20.30 10 E: 2154470 126 [AG GT] P: 2154470 600 L: 42, G: 101.283, S:19.87 …………………………………………………………………………………………………………………………………………………………………………………… 1 11 2146713 2146723 2146739 2146769 gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg ---------------(..)evdhqlkerfanmke GGRIVSSKPFAPLNFRINSRNLS- 248 248 249 259 267 277 2146797 2146806 2147558 2147568 2147581 2147611 ]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK ---------------(..)--------------- -dIGTIMRVVELSPLKGSVSWTGK 286 286 286 286 289 299 2147641 2147671 2147686 2148919 2148926 2148937 PVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNP PVSYYLHTIDRTI ---------------(..)--------------- LENYFSSLKNP 309 319 322 322 322 323 2148967 2148982 2150384 2150391 2150402 2150432 KLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATP KLR ---------------(..)--------------- EEQEAARRRQQRESKSNAATP 333 336 336 336 337 347 2150462 2150492 2150513 2150523 2150609 2150619 TKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEK TKGPEGKVAGPADAPM ---------------(..)--------------- DSGAEEEK 357 367 373 373 373 373
Table 1. Speed of processing sequences by Prot_map, Fgenesh+ and GeneWise.
Fgenesh+ | Prot_map | GeneWise | |
88 sequences of genes < 20 kb | ~1 min | ~1 min | ~90 min |
8 sequences of genes > 400000 kb | ~1 min | ~1 min | ~1200 min |
Table 2. Comparison of accuracy of gene identification programs: ab initio Fgenesh and prediction with protein support: Fgenesh+ , GenWise and Prot_map on a set of human genes using mouse or drosophila homologous proteins. %CG (correct genes) is % of exactly predicted genes.
Mouse homologs: 60% < similarity level < 80% - 1425 sequences
Sn ex | Sno ex | Sp ex | Sn nuc | Sp nuc | CC | %CG | |
Fgenesh | 83.4 | 90.9 | 86.8 | 93.2 | 94.9 | 0.937 | 30 |
Genwise | 88.1 | 96.5 | 90.5 | 97.8 | 99.2 | 0.984 | 43 |
Fgenesh+ | 93.9 | 97.9 | 94.9 | 98.4 | 99.3 | 0.988 | 65 |
Prot_map | 87.0 | 96.5 | 86.6 | 97.0 | 98.5 | 0.976 | 40 |
Drosophila homologs: similarity level > 80% - 66 sequences.
Sn ex | Sno ex | Sp ex | Sn nuc | Sp nuc | CC | %CG | |
Fgenesh | 90.5 | 93.8 | 95.1 | 97.9 | 96.9 | 0.950 | 55 |
Genwise | 79.3 | 83.9 | 86.8 | 97.3 | 99.5 | 0.985 | 23 |
Fgenesh+ | 95.1 | 97.8 | 97.0 | 98.9 | 99.5 | 0.9914 | 70 |
Prot_map | 86.4 | 95.3 | 88.1 | 97.6 | 99.0 | 0.982 | 41 |