SNP-effect - SNP annotation and damaging effect prediction
SNP-effect performs genic annotation for a set of human genetic variations. Variations must be in one of supported formats and position numbering must correspond to GRCh37/hg19 genome assembly. Only single nucleotide substitutions are evaluated, longer substitutions and indels are ignored.
The following information is reported in output file for each variation, if available:
-
Chromosome, position, reference and observed nucleotides, as in input file.
-
dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) ID and OMIM (http://omim.org/) link.
-
Repetitive DNA element from UCSC database (http://genome.ucsc.edu/), which contains the variation.
-
Gene, which contains the variation (human genes from UCSC database, fRNAdb database (http://www.ncrna.org/frnadb/), and genes, which were predicted by FGENESH [1], a predicted gene has ID which starts with FG_), its orientation, position, and CDS range.
-
Positions of all exons in the gene.
-
Short description of the gene from UCSC database.
-
Gene type: protein or RNA (not mRNA) coding.
-
Clinical significance of the gene, according to HUMSAVAR (http://www.uniprot.org/docs/humsavar) and Genetic Association Database (http://geneticassociationdb.nih.gov/) databases.
-
Position of the variation in the gene: CDS or UTR, an exon or an intron. Also, if the variation lies in a splicing site, defined as two nucleotides at 5' and 3' ends of introns.
-
Position of amino acid substitution in translated protein, which is caused by the variation.
-
Total length of the translated protein.
-
Codon change in the gene caused by the variation.
-
Amino acid substitution in the protein.
-
Tolerance score of substitutions, calculated by algorithms similar to the SIFT [2] approach. It is calculated on the basis of aminoacids occurrence in this position of alignment of coded protein to homologs from the UniProt Knowledgebase (http://www.uniprot.org/) or NCBI nr protein database (http://www.ncbi.nlm.nih.gov/). If the score is lower than a threshold value (0.05), the variation is considered as damaging. Tolerance score can not be calculated in case of insufficient homologs.
Items 4-14 are repeated for each of genes in consideration.
Supported file formats:
-
VCF file format (http://www.1000genomes.org/node/101). A VCF file contains meta-information lines (starts with ##), a header line (start with #CHROM), and then data lines each containing information about a position in the genome. First column is chromosome, second is position, fourth is reference nucleotide, fifth is an observed non-reference nucleotide or a list of such nucleotides. In the current version, meta-information and header lines are ignored.
-
ANNOVAR file format (http://www.openbioinformatics.org/annovar/annovar_input.html). First five columns: chromosome, start position, end position, reference nucleotide and observed nucleotide. For a SNP, the start position and the end position are the same. Other columns are optional and are ignored.
-
Simple file format. This format is the most convenient for manual input of SNP data. Four columns: chromosome, position, reference nucleotide, observed nucleotide.
-
23andme file format (http://snpedia.com/index.php/23andMe). Header lines starts with #. Four columns: ID, chromosome, position, observed nucleotide.
All these formats are tab-delimited, but any sequence of tabulation or spaces is considered as a column separator. The reference nucleotide for a variation is taken from the reference genome sequence. Please note that all positions must correspond to GRCh37/hg19 genome assembly and base numbering starts from 1.
Input data examples in different formats:
VCF:
1 10469 . C G 100 PASS . GT:AP 1|0:0.740,0.450
1 1477244 rs7290 T C 100 PASS . GT:AP 1|1:0.700,0.820
2 183699584 rs7775 G C 100 PASS . GT:AP 0|1:0.440,0.695
ANNOVAR:
1 10469 10469 C G
1 1477244 1477244 T C comments: rs7290
2 183699584 183699584 G C
Simple:
1 10469 C G
1 1477244 T C
2 183699584 G C
23andme:
. 1 10469 CG
rs7290 1 1477244 CT
rs7775 2 183699584 CC
All these examples produce the same output:
#VARIATION
chr1 10469 C-->G
#REPEATS
Repeat: telo 10468..11447
INTERGENIC. between FR137075 and uc010nxq.1
#VARIATION
chr1 1477244 T-->C
dbSNP ID: rs7290
#INTERSECTED GENE
Name: uc001agd.3 Strand: - Region: 1477053..1510262 CDS:1477446..1509937
Exons: 1477053..1477547, 1479249..1479367, 1480243..1480382, 1500153..1500296, 1509858..1510262
Description: Homo sapiens SSU72 RNA polymerase II CTD phosphatase homolog (S. cerevisiae) (SSU72), mRNA.
Type of gene: Protein coding
Clinical significance:unknown
Variation location: Out of CDS, 3' UTR.
#VARIATION
chr2 183699584 G-->C
dbSNP ID: rs7775, OMIM link: http://omim.org/entry/605083#0001
#INTERSECTED GENE
Name: uc002upa.2 Strand: - Region: 183698005..183731498 CDS:183699576..183731280
Exons: 183698005..183699692, 183702676..183702739, 183703137..183703341, 183707206..183707271, 183723514..183723561, 183730803..183731498
Description: Homo sapiens frizzled-related protein (FRZB), mRNA.
Type of gene: Protein coding
Clinical significance: osteoarthritis; colorectal cancer; Defects in FRZB are associated with susceptibility to osteoarthritis type 1(OS1);
Variation location: Exon 6
Position in protein: 324
Protein length: 325
Codon: CGC => GGC
Translation: R => G
Tolerance Score: 0.01 (damaged)
-
Solovyev V, Kosarev P, Seledsov I, Vorobyev D. Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 2006,7, Suppl 1: P. 10.1-10.12.
-
Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003 Jul 1;31(13):3812-4.