This database has a main file, specified as complete mammals database (AllSites), all other files are subsets from that. The first subset obtained from mammals has been human group, which is presented together with mammals file above main table. Main table has been divided horizontally in mammals and human files, and vertically in canonical and non-canonical splice site pairs. Every group of files have three types of files, the first are pairs obtained directly from GenBank, without any filter (marked as pairs from GenBank), the second are pairs supported by ESTs (marked as EST supported pairs) and the third are pairs supported by ESTs and automatically corrected, meaning that all ambiguous junction cases have been deleted (marked as EST supported and corrected pairs).
In human group and only for non-canonical pairs there are a special file, which is a subset from non-canonical, EST supported and corrected pairs, and besides it is supported by high throughput genome (HTG) sequences. Information and possible corrections using HTG support has been done by hand, studying case by case sequence alignments.
All registers in database are presented in a tabular format, so every line in any file represents a completely specified splice site pair. We use two kinds of field separators, the different parts in every register are separated by the double symbol "@@", and inside every part of register the field separator is a typical blank space or tabulator, this allow us to write large sentences inside every part of register maintaining separated those clearly.
The typical structure for a register in our database can be represented as:
ID @@ ACCES @@ INTRON @@ DON @@ ACC @@ SEQ_DON @@ SEQ_ACC @@ EST @@ EST_ACCES @@ CORR
This field has always only one word, that is an unique and specific identifier provided to every pair, it is formed by Infogene entry name, assigned intron number, donor position in original sequence and acceptor corresponding position, all joined usign "##" symbol (i.e. HG_0000731##114##122615##122965) ACCES (Accession number):
This field has always only one word, that is the original accession GenBank number entry (i.e. AB011399)
This field has always only one word, that is the intron number assigned to every intron pair in Infogene database (i.e. 114)
This field has always only one word, that is the donor position in original Infogene entry (i.e. 122615)
This field has always only one word, that is the acceptor position in original Infogene entry (i.e. 122965)
This field has always only one word, that is the nucleotide sequence centered in donor characteristic dinucleotides, with 40 bp in every side, forming a total sequence of 82 bp (i.e. aacatctgtctctactggaaacctctgcactgaggagcagattgattgataagcaaaaggcttctactgcatttccatcctt)
This field has always only one word, that is the nucleotide sequence centered in acceptor characteristic dinucleotides, with 40 bp in every side, forming a total sequence of 82 bp (i.e. aaaaagctcactttttttgttcttcacattttacaggagcagacgcctccgcctagacctgaagcctaccccatccccactc)
This field has always only one word, that is the obtained EST classification (see Material and Methods in original paper for details) (i.e. B20)
This field has always only one word, that is the accession number of the EST used to support our classification (i.e. gb|N35650|N35650)
This field is optional and is specified in free text. All possible corrections after EST support are annotated in this field, based in ESTs or in HTG: automatic EST correction in positions pos1 pos2 using ESTaccession: There is annotated which positions present ambiguities with respect to annotated and supported junctions (pos1 and pos2), and EST accession number that supports alternative junction (ESTaccession)
There is annotated information about HTG comparison with respect to this entry. (for more details see Results in original paper)