SNP-effect - SNP annotation and damaging effect prediction
SNP-effect performs genic annotation for a set of human genetic variations. Variations must be in one of supported formats and position numbering must correspond to GRCh37/hg19 genome assembly. Only single nucleotide substitutions are evaluated, longer substitutions and indels are ignored.
The following information is reported in output file for each variation, if available:
Chromosome, position, reference and observed nucleotides, as in input file.
dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) ID and OMIM (http://omim.org/) link.
Repetitive DNA element from UCSC database (http://genome.ucsc.edu/), which contains the variation.
Gene, which contains the variation (human genes from UCSC database, fRNAdb database (http://www.ncrna.org/frnadb/), and genes, which were predicted by FGENESH , a predicted gene has ID which starts with FG_), its orientation, position, and CDS range.
Positions of all exons in the gene.
Short description of the gene from UCSC database.
Gene type: protein or RNA (not mRNA) coding.
Clinical significance of the gene, according to HUMSAVAR (http://www.uniprot.org/docs/humsavar) and Genetic Association Database (http://geneticassociationdb.nih.gov/) databases.
Position of the variation in the gene: CDS or UTR, an exon or an intron. Also, if the variation lies in a splicing site, defined as two nucleotides at 5' and 3' ends of introns.
Position of amino acid substitution in translated protein, which is caused by the variation.
Total length of the translated protein.
Codon change in the gene caused by the variation.
Amino acid substitution in the protein.
Tolerance score of substitutions, calculated by algorithms similar to the SIFT  approach. It is calculated on the basis of aminoacids occurrence in this position of alignment of coded protein to homologs from the UniProt Knowledgebase (http://www.uniprot.org/) or NCBI nr protein database (http://www.ncbi.nlm.nih.gov/). If the score is lower than a threshold value (0.05), the variation is considered as damaging. Tolerance score can not be calculated in case of insufficient homologs.
Items 4-14 are repeated for each of genes in consideration.
Supported file formats:
VCF file format (http://www.1000genomes.org/node/101). A VCF file contains meta-information lines (starts with ##), a header line (start with #CHROM), and then data lines each containing information about a position in the genome. First column is chromosome, second is position, fourth is reference nucleotide, fifth is an observed non-reference nucleotide or a list of such nucleotides. In the current version, meta-information and header lines are ignored.
ANNOVAR file format (http://www.openbioinformatics.org/annovar/annovar_input.html). First five columns: chromosome, start position, end position, reference nucleotide and observed nucleotide. For a SNP, the start position and the end position are the same. Other columns are optional and are ignored.
Simple file format. This format is the most convenient for manual input of SNP data. Four columns: chromosome, position, reference nucleotide, observed nucleotide.
23andme file format (http://snpedia.com/index.php/23andMe). Header lines starts with #. Four columns: ID, chromosome, position, observed nucleotide.
All these formats are tab-delimited, but any sequence of tabulation or spaces is considered as a column separator. The reference nucleotide for a variation is taken from the reference genome sequence. Please note that all positions must correspond to GRCh37/hg19 genome assembly and base numbering starts from 1.
Input data examples in different formats:
1 10469 . C G 100 PASS . GT:AP 1|0:0.740,0.450
1 1477244 rs7290 T C 100 PASS . GT:AP 1|1:0.700,0.820
2 183699584 rs7775 G C 100 PASS . GT:AP 0|1:0.440,0.695
1 10469 10469 C G
1 1477244 1477244 T C comments: rs7290
2 183699584 183699584 G C
1 10469 C G
1 1477244 T C
2 183699584 G C
. 1 10469 CG
rs7290 1 1477244 CT
rs7775 2 183699584 CC
All these examples produce the same output:
chr1 10469 C-->G
Repeat: telo 10468..11447
INTERGENIC. between FR137075 and uc010nxq.1
chr1 1477244 T-->C
dbSNP ID: rs7290
Name: uc001agd.3 Strand: - Region: 1477053..1510262 CDS:1477446..1509937
Exons: 1477053..1477547, 1479249..1479367, 1480243..1480382, 1500153..1500296, 1509858..1510262
Description: Homo sapiens SSU72 RNA polymerase II CTD phosphatase homolog (S. cerevisiae) (SSU72), mRNA.
Type of gene: Protein coding
Variation location: Out of CDS, 3' UTR.
chr2 183699584 G-->C
dbSNP ID: rs7775, OMIM link: http://omim.org/entry/605083#0001
Name: uc002upa.2 Strand: - Region: 183698005..183731498 CDS:183699576..183731280
Exons: 183698005..183699692, 183702676..183702739, 183703137..183703341, 183707206..183707271, 183723514..183723561, 183730803..183731498
Description: Homo sapiens frizzled-related protein (FRZB), mRNA.
Type of gene: Protein coding
Clinical significance: osteoarthritis; colorectal cancer; Defects in FRZB are associated with susceptibility to osteoarthritis type 1(OS1);
Variation location: Exon 6
Position in protein: 324
Protein length: 325
Codon: CGC => GGC
Translation: R => G
Tolerance Score: 0.01 (damaged)
Solovyev V, Kosarev P, Seledsov I, Vorobyev D. Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 2006,7, Suppl 1: P. 10.1-10.12.
Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003 Jul 1;31(13):3812-4.