Introduction to SpliceDB

A set of 43337 splice junction pairs was extracted from mammalian GenBank annotated genes. 22489 of them are supported by EST sequences. 98.71% of those contain canonical dinucleotides GT and AG for donor and acceptor sites, respectively. 0.56% hold non-canonical GC-AG splice site pairs. The reminder 0.73% occurs in a lot of small groups (with maximum size of 0.05%). Studying these groups we observe that many of them contain splicing dinucleotides shifted from the annotated splice junction by one position. After close examination of such cases we present a new classification consisting of only 8 observed types of splice site pairs (out of 256 a priori possible combinations). EST alignments allow us to verify the exonic part of splice sites, but many non-canonical cases may be due to intron sequencing errors. This idea is given substantial support when we compare the sequences of human genes having non-canonical splice sites deposited in GenBank by high throughput genome sequencing projects (HTG). 156 out of 171 human non-canonical and EST-supported splice site sequences had a clear match in the human HTG. They can be classified after corrections as: 79 GC-AG pairs (of which 1 was an error that corrected to GC-AG), 61 errors that were corrected to GT-AG canonical pairs, 6 AT-AC pairs (of which 2 were errors that corrected to AT-AC), 1 case was produced from non-existent intron, 7 cases were found in HTG that were deposited to GenBank and finally there were only 2 cases left of supported non-canonical splice sites. If we assume that approximately the same situation is true for the whole set of annotated mammalian non-canonical splice sites, then the 99.24% of splice site pairs should be GT-AG, 0.69% GC-AG, 0.05% AT-AC and finally only 0.02% could consist of other types of non-canonical splice sites. We analyze several characteristics of EST-verified splice sites and build weight matrices for the major groups, which can be incorporated into gene prediction programs. We also present a set of EST-verified canonical splice sites larger by two orders of magnitude than the current one (22199 entries vs. about 600), and finally a set of 290 EST supported non-canonical splice sites. Both sets should be significant for future investigations of the splicing mechanism.