Services Test Online

SNP-effect - Gene annotation and prediction of SIFT-like damaging effect

    SNP-effect reads the set of human mutations out of input file and performs their genic annotations, i.e. determines a variety of mutations features and outputs the appropriates messages to output data set:
  1. Whether a mutation is located in one of genes (including those predicted by the FGENESH program) or between them.
  2. If a mutation is inside a gene, then does it hit its coding part (CDS) or lies in 5'- or 3'- end of a gene.
  3. If a mutation is inside a CDS, then does it lie in one of exons, or at splicing site, or deep inside an intron.
  4. For exonic mutations, the type of mutation is analyzed. If a mutation causes the shift of a reading frame, or is an insertion/deletion that affects more than a single codon, the message of potential harming effect is output. If a mutation is a single (or double/triple) nucleotide substitution in a single codon, then for nonsense mutations the message of potential harming effect is output.
  5. For missense mutations the position of a mutation in coded protein as well as a type of aminoacidic substitution are determined and score value (estimated by Sift -like approach, see http://sift.jcvi.org/), that is calculated on the basis of aminoacids occurrence in this position of alignment of coded protein to homologs from the SWISS-PROT+TrEMBL or NR databases (using preliminary calculated position-specific scoring matrices), is being output. If the score is lower than a threshold value, the message of possible harmful effect of a mutation is being output.
  6. In the case when a gene with mutated CDS is associated with a certain disease, the message of it is also placed in the output.

Data of genes annotations are mostly obtained from the UCSC genome annotation database. The portion of genes as well as their structures are predicted with FGENESH software (Softberry, genes markings begin with FG_).

Data on diseases associated with genes are obtained from SWISS-PROT+TrEMBL and HUMSAVAR.

Input data format (tab-delimited):
Chromosome, the first position (1-based, hg19), reference sequence ('-' for insertions), alternative sequence ('-' for deletions), any comments - for "nucleotidic mutations"
AC (SP+TReMBL), position in protein, reference sequence ('-' for insertions), alternative sequence ('-' for deletions), any comments - for "protein"

Input data example

1	69190	G	C	
1	322038	G	C
1	762702	A	C
1	911898	C	A
1	949868	G	A
1	948958	T	C
1	949361	C	T
1	950915	C	A
9	2639877	A	C	Alzheimer's disease
O00533	13	Y	G
Q96BZ9	17	D	E
B3KWS5	4	I	L	quantitative traits	

Program output example:

Input file with SNPs: sift_test.txt
SNP data: Reference_nucleotide  SNP_nucleotide

#SNP 1 chr1	14508	A	C		intergenic
SNP location: between uc010nxq.1 start=11873 end=14409 and FR404421 start=14629 end=14657
Effect: not determined

#SNP 2 chr1	69190	G	C	
SNP location: uc001aal.1 start=69090 end=70008, strand='+'
DE: Homo sapiens olfactory receptor, family 4, subfamily F, member 5 (OR4F5), mRNA.
Type of gene: Protein coding: Q8NH21
Source of gene: UCSC file knownGene.txt
CDS: start=69090 end=70008
SNP gene location: CDS, exon 0 start=69090 end=70008
SNP chr1 + G to C in 69189 is converted to Q8NH21 G 33 R GGA => CGA
Effect: DAMAGED protein: Score: 1.00 => 0.00

#SNP 3 chr1	130996	A	C	
SNP location: FR182042 start=130989 end=131015, strand='+'
DE: Piwi-interacting RNA (piRNA)
Type of gene: RNA coding
Source of gene: one of RNA DB genes
Effect: not determined

#SNP 4 chr1	322038	G	C
SNP location: FG_1_6 start=228291 end=450332, strand='-'
DE: ab initio, no homologs
Type of gene: Protein coding: unknown (predicted)
Source of gene: predicted by FGENESH (Softberry)
CDS: start=228291 end=450332
SNP gene location: CDS, intronic between exon 0 start=228291 end=228736 and exon 1 start=330033 end=330215
Effect: not determined
Alternative gene:
SNP location: uc009vjk.2 start=322036 end=326938, strand='+'
DE: Homo sapiens uncharacterized LOC100133331 (LOC100133331), non-coding RNA.
Type of gene: Protein coding: C9J4L2
Source of gene: UCSC file knownGene.txt
CDS: start=324342 end=325605
SNP gene location: 5'-end of the gene out of CDS
Effect: not determined

#SNP 5 chr1	324300	G	C
ERROR!!!  reference 'G' is not equal 'A' in position 324300 of chromosome chr1
Nucleotide 'A' is accepted as the reference nucleotide. SNP A => C is estimated
SNP location: FG_1_6 start=228291 end=450332, strand='-'
DE: ab initio, no homologs
Type of gene: Protein coding: unknown (predicted)
Source of gene: predicted by FGENESH (Softberry)
CDS: start=228291 end=450332
SNP gene location: CDS, intronic between exon 0 start=228291 end=228736 and exon 1 start=330033 end=330215
Effect: not determined
Alternative gene:
SNP location: uc009vjk.2 start=322036 end=326938, strand='+'
DE: Homo sapiens uncharacterized LOC100133331 (LOC100133331), non-coding RNA.
Type of gene: Protein coding: C9J4L2
Source of gene: UCSC file knownGene.txt
CDS: start=324342 end=325605
SNP gene location: 5'-end of the gene out of CDS
Effect: not determined
Alternative gene:
SNP location: uc001aau.3 start=323891 end=328581, strand='+'
DE: Homo sapiens uncharacterized LOC100132287 (LOC100132287), non-coding RNA.
Type of gene: Protein coding: C9J4L2
Source of gene: UCSC file knownGene.txt
CDS: start=324342 end=325605
SNP gene location: 5'-end of the gene out of CDS
Effect: not determined
Alternative gene:
SNP location: uc021oeh.1 start=324287 end=325896, strand='+'
DE: Homo sapiens uncharacterized LOC100133331 (LOC100133331), non-coding RNA.
Type of gene: Protein coding: F5H0Y1
Source of gene: UCSC file knownGene.txt
CDS: start=324514 end=325605
SNP gene location: 5'-end of the gene out of CDS
Effect: not determined

#SNP 6 chr1	911898	G	A
SNP location: FG_1_19 start=859811 end=1004719, strand='+'
DE: ab initio, no homologs
Type of gene: Protein coding: unknown (predicted)
Source of gene: predicted by FGENESH (Softberry)
CDS: start=859811 end=1004719
SNP gene location: CDS, intronic between exon 0 start=859811 end=860328 and exon 1 start=941679 end=941748
Effect: not determined
Alternative gene:
SNP location: uc001ach.2 start=910578 end=917473, strand='-'
DE: Homo sapiens chromosome 1 open reading frame 170 (C1orf170), non-coding RNA.
Type of gene: Protein coding: Q5SV97
Source of gene: UCSC file knownGene.txt
CDS: start=911551 end=916546
SNP gene location: CDS, exon 1 start=911878 end=912004
SNP chr1 - G to A in 911897 is converted to Q5SV97 T 637 T ACC => ACA
Effect: TOLERATED protein: Score: 1.00 => 1.00

#SNP 7 chr1	949868	C	A
SNP location: FG_1_19 start=859811 end=1004719, strand='+'
DE: ab initio, no homologs
Type of gene: Protein coding: unknown (predicted)
Source of gene: predicted by FGENESH (Softberry)
CDS: start=859811 end=1004719
SNP gene location: CDS, intronic between exon 1 start=941679 end=941748 and exon 2 start=994334 end=994557
Effect: not determined
Alternative gene:
SNP location: uc001acj.4 start=948846 end=949919, strand='+'
DE: Homo sapiens ISG15 ubiquitin-like modifier (ISG15), mRNA.
Type of gene: Protein coding: P05161
Source of gene: UCSC file knownGene.txt
CDS: start=948953 end=949858
SNP gene location: 3'-end of the gene out of CDS
Effect: not determined

#SNP 8 chr1	948958	T	C
SNP location: FG_1_19 start=859811 end=1004719, strand='+'
DE: ab initio, no homologs
Type of gene: Protein coding: unknown (predicted)
Source of gene: predicted by FGENESH (Softberry)
CDS: start=859811 end=1004719
SNP gene location: CDS, intronic between exon 1 start=941679 end=941748 and exon 2 start=994334 end=994557
Effect: not determined
Alternative gene:
SNP location: uc001acj.4 start=948846 end=949919, strand='+'
DE: Homo sapiens ISG15 ubiquitin-like modifier (ISG15), mRNA.
Type of gene: Protein coding: P05161
Source of gene: UCSC file knownGene.txt
CDS: start=948953 end=949858
SNP gene location: CDS, acceptor splice site of exon 0 start=948846 end=948956
Effect: not determined

#SNP 9 chr1	15929799	A	G
ERROR!!!  reference 'A' is not equal 'G' in position 15929799 of chromosome chr1
Nucleotide 'G' is accepted as the reference nucleotide. SNP G => G is estimated
SNP location: FG_1_59 start=15929689 end=15930136, strand='-'
DE:  hCG2038797
Type of gene: Protein coding: 119572124-like
Source of gene: predicted by FGENESH (Softberry)
CDS: start=15929689 end=15930136
SNP gene location: CDS, exon 0 start=15929689 end=15930136
SNP chr1 - G to G in 15929798 is converted to 119572124-like A 112 G GCT => GGT
Effect: not determined

#SNP 10 chr9	2639877	A	C	Alzheimer's disease
SNP location: FG_9_6 start=2419441 end=2662793, strand='-'
DE: ab initio, no homologs
Type of gene: Protein coding: unknown (predicted)
Source of gene: predicted by FGENESH (Softberry)
CDS: start=2419441 end=2662793
SNP gene location: CDS, intronic between exon 1 start=2536358 end=2536453 and exon 2 start=2662657 end=2662793
Effect: not determined
Alternative gene:
SNP location: uc003zhk.1 start=2621792 end=2654485, strand='+'
DE: Homo sapiens very low density lipoprotein receptor (VLDLR), transcript variant 1, mRNA.
Type of gene: Protein coding: P98155
Source of gene: UCSC file knownGene.txt
CDS: start=2622189 end=2653868
DISEASES associated with the gene: Alzheimer's Disease; diabetes, type 2; dementia; Lp(a); 
   efects in VLDLR are the cause of cerebellar ataxiamental retardation and dysequilibrium syndrome type 1 (CMARQ1); 
SNP gene location: CDS, exon 2 start=2639858 end=2639981
SNP chr9 + A to C in 2639876 is converted to P98155 E 73 A GAA => GCA
Effect: TOLERATED protein: Score: 1.00 => 0.43
Alternative gene:
SNP location: uc003zhl.1 start=2621792 end=2654485, strand='+'
DE: Homo sapiens very low density lipoprotein receptor (VLDLR), transcript variant 2, mRNA.
Type of gene: Protein coding: Q5VVF5
Source of gene: UCSC file knownGene.txt
CDS: start=2622189 end=2653868
DISEASES associated with the gene: Alzheimer's Disease; diabetes, type 2; dementia; Lp(a); 
SNP gene location: CDS, exon 2 start=2639858 end=2639981
SNP chr9 + A to C in 2639876 is converted to Q5VVF5 E 73 A GAA => GCA
Effect: TOLERATED protein: Score: 1.00 => 0.36

#SNP 11 protein O00533	13	T	G		Error
SNP location: uc003bou.3 start=238278 end=451097, strand='+'
DE: Homo sapiens cell adhesion molecule with homology to L1CAM (close homolog of L1) (CHL1), transcript variant 2, mRNA.
Type of gene: Protein coding: O00533
Source of gene: UCSC file knownGene.txt
CDS: start=2622189 end=2653868
DISEASES associated with the gene: schizophrenia; 
ERROR!!!  reference 'T' is not equal 'Y' in position 13 of O00533
Amino acid residue 'Y' is accepted as the reference. SNP Y => G is estimated
Effect: TOLERATED protein: Score: 0.20 => 0.11

#SNP 12 protein QDDDDD	17	D	E		Error in the name of protein
ERROR!!! A sequence QDDDDD is not found in SP+TrEMBL

#SNP 13 protein B3KWS5	4	I	L	quantitative traits
SNP location: uc002rej.4 start=23971533 end=24055472, strand='-'
DE: Homo sapiens ATPase family, AAA domain containing 2B (ATAD2B), transcript variant 1, mRNA.
Type of gene: Protein coding: B3KWS5
Source of gene: UCSC file knownGene.txt
CDS: start=2622189 end=2653868
DISEASES associated with the gene: quantitative traits; 
Effect: DAMAGED protein: Score: 0.05 => 0.04