Data of genes annotations are mostly obtained from the UCSC genome annotation database. The portion of genes as well as their structures are predicted with FGENESH software (Softberry, genes markings begin with FG_).
Data on diseases associated with genes are obtained from SWISS-PROT+TrEMBL and HUMSAVAR.
Input data format (tab-delimited):
Chromosome, the first position (1-based, hg19), reference sequence ('-' for insertions), alternative sequence ('-' for deletions), any comments - for "nucleotidic mutations"
AC (SP+TReMBL), position in protein, reference sequence ('-' for insertions), alternative sequence ('-' for deletions), any comments - for "protein"
Input data example
1 69190 G C 1 322038 G C 1 762702 A C 1 911898 C A 1 949868 G A 1 948958 T C 1 949361 C T 1 950915 C A 9 2639877 A C Alzheimer's disease O00533 13 Y G Q96BZ9 17 D E B3KWS5 4 I L quantitative traits
Program output example:
Input file with SNPs: sift_test.txt SNP data: Reference_nucleotide SNP_nucleotide #SNP 1 chr1 14508 A C intergenic SNP location: between uc010nxq.1 start=11873 end=14409 and FR404421 start=14629 end=14657 Effect: not determined #SNP 2 chr1 69190 G C SNP location: uc001aal.1 start=69090 end=70008, strand='+' DE: Homo sapiens olfactory receptor, family 4, subfamily F, member 5 (OR4F5), mRNA. Type of gene: Protein coding: Q8NH21 Source of gene: UCSC file knownGene.txt CDS: start=69090 end=70008 SNP gene location: CDS, exon 0 start=69090 end=70008 SNP chr1 + G to C in 69189 is converted to Q8NH21 G 33 R GGA => CGA Effect: DAMAGED protein: Score: 1.00 => 0.00 #SNP 3 chr1 130996 A C SNP location: FR182042 start=130989 end=131015, strand='+' DE: Piwi-interacting RNA (piRNA) Type of gene: RNA coding Source of gene: one of RNA DB genes Effect: not determined #SNP 4 chr1 322038 G C SNP location: FG_1_6 start=228291 end=450332, strand='-' DE: ab initio, no homologs Type of gene: Protein coding: unknown (predicted) Source of gene: predicted by FGENESH (Softberry) CDS: start=228291 end=450332 SNP gene location: CDS, intronic between exon 0 start=228291 end=228736 and exon 1 start=330033 end=330215 Effect: not determined Alternative gene: SNP location: uc009vjk.2 start=322036 end=326938, strand='+' DE: Homo sapiens uncharacterized LOC100133331 (LOC100133331), non-coding RNA. Type of gene: Protein coding: C9J4L2 Source of gene: UCSC file knownGene.txt CDS: start=324342 end=325605 SNP gene location: 5'-end of the gene out of CDS Effect: not determined #SNP 5 chr1 324300 G C ERROR!!! reference 'G' is not equal 'A' in position 324300 of chromosome chr1 Nucleotide 'A' is accepted as the reference nucleotide. SNP A => C is estimated SNP location: FG_1_6 start=228291 end=450332, strand='-' DE: ab initio, no homologs Type of gene: Protein coding: unknown (predicted) Source of gene: predicted by FGENESH (Softberry) CDS: start=228291 end=450332 SNP gene location: CDS, intronic between exon 0 start=228291 end=228736 and exon 1 start=330033 end=330215 Effect: not determined Alternative gene: SNP location: uc009vjk.2 start=322036 end=326938, strand='+' DE: Homo sapiens uncharacterized LOC100133331 (LOC100133331), non-coding RNA. Type of gene: Protein coding: C9J4L2 Source of gene: UCSC file knownGene.txt CDS: start=324342 end=325605 SNP gene location: 5'-end of the gene out of CDS Effect: not determined Alternative gene: SNP location: uc001aau.3 start=323891 end=328581, strand='+' DE: Homo sapiens uncharacterized LOC100132287 (LOC100132287), non-coding RNA. Type of gene: Protein coding: C9J4L2 Source of gene: UCSC file knownGene.txt CDS: start=324342 end=325605 SNP gene location: 5'-end of the gene out of CDS Effect: not determined Alternative gene: SNP location: uc021oeh.1 start=324287 end=325896, strand='+' DE: Homo sapiens uncharacterized LOC100133331 (LOC100133331), non-coding RNA. Type of gene: Protein coding: F5H0Y1 Source of gene: UCSC file knownGene.txt CDS: start=324514 end=325605 SNP gene location: 5'-end of the gene out of CDS Effect: not determined #SNP 6 chr1 911898 G A SNP location: FG_1_19 start=859811 end=1004719, strand='+' DE: ab initio, no homologs Type of gene: Protein coding: unknown (predicted) Source of gene: predicted by FGENESH (Softberry) CDS: start=859811 end=1004719 SNP gene location: CDS, intronic between exon 0 start=859811 end=860328 and exon 1 start=941679 end=941748 Effect: not determined Alternative gene: SNP location: uc001ach.2 start=910578 end=917473, strand='-' DE: Homo sapiens chromosome 1 open reading frame 170 (C1orf170), non-coding RNA. Type of gene: Protein coding: Q5SV97 Source of gene: UCSC file knownGene.txt CDS: start=911551 end=916546 SNP gene location: CDS, exon 1 start=911878 end=912004 SNP chr1 - G to A in 911897 is converted to Q5SV97 T 637 T ACC => ACA Effect: TOLERATED protein: Score: 1.00 => 1.00 #SNP 7 chr1 949868 C A SNP location: FG_1_19 start=859811 end=1004719, strand='+' DE: ab initio, no homologs Type of gene: Protein coding: unknown (predicted) Source of gene: predicted by FGENESH (Softberry) CDS: start=859811 end=1004719 SNP gene location: CDS, intronic between exon 1 start=941679 end=941748 and exon 2 start=994334 end=994557 Effect: not determined Alternative gene: SNP location: uc001acj.4 start=948846 end=949919, strand='+' DE: Homo sapiens ISG15 ubiquitin-like modifier (ISG15), mRNA. Type of gene: Protein coding: P05161 Source of gene: UCSC file knownGene.txt CDS: start=948953 end=949858 SNP gene location: 3'-end of the gene out of CDS Effect: not determined #SNP 8 chr1 948958 T C SNP location: FG_1_19 start=859811 end=1004719, strand='+' DE: ab initio, no homologs Type of gene: Protein coding: unknown (predicted) Source of gene: predicted by FGENESH (Softberry) CDS: start=859811 end=1004719 SNP gene location: CDS, intronic between exon 1 start=941679 end=941748 and exon 2 start=994334 end=994557 Effect: not determined Alternative gene: SNP location: uc001acj.4 start=948846 end=949919, strand='+' DE: Homo sapiens ISG15 ubiquitin-like modifier (ISG15), mRNA. Type of gene: Protein coding: P05161 Source of gene: UCSC file knownGene.txt CDS: start=948953 end=949858 SNP gene location: CDS, acceptor splice site of exon 0 start=948846 end=948956 Effect: not determined #SNP 9 chr1 15929799 A G ERROR!!! reference 'A' is not equal 'G' in position 15929799 of chromosome chr1 Nucleotide 'G' is accepted as the reference nucleotide. SNP G => G is estimated SNP location: FG_1_59 start=15929689 end=15930136, strand='-' DE: hCG2038797 Type of gene: Protein coding: 119572124-like Source of gene: predicted by FGENESH (Softberry) CDS: start=15929689 end=15930136 SNP gene location: CDS, exon 0 start=15929689 end=15930136 SNP chr1 - G to G in 15929798 is converted to 119572124-like A 112 G GCT => GGT Effect: not determined #SNP 10 chr9 2639877 A C Alzheimer's disease SNP location: FG_9_6 start=2419441 end=2662793, strand='-' DE: ab initio, no homologs Type of gene: Protein coding: unknown (predicted) Source of gene: predicted by FGENESH (Softberry) CDS: start=2419441 end=2662793 SNP gene location: CDS, intronic between exon 1 start=2536358 end=2536453 and exon 2 start=2662657 end=2662793 Effect: not determined Alternative gene: SNP location: uc003zhk.1 start=2621792 end=2654485, strand='+' DE: Homo sapiens very low density lipoprotein receptor (VLDLR), transcript variant 1, mRNA. Type of gene: Protein coding: P98155 Source of gene: UCSC file knownGene.txt CDS: start=2622189 end=2653868 DISEASES associated with the gene: Alzheimer's Disease; diabetes, type 2; dementia; Lp(a); efects in VLDLR are the cause of cerebellar ataxiamental retardation and dysequilibrium syndrome type 1 (CMARQ1); SNP gene location: CDS, exon 2 start=2639858 end=2639981 SNP chr9 + A to C in 2639876 is converted to P98155 E 73 A GAA => GCA Effect: TOLERATED protein: Score: 1.00 => 0.43 Alternative gene: SNP location: uc003zhl.1 start=2621792 end=2654485, strand='+' DE: Homo sapiens very low density lipoprotein receptor (VLDLR), transcript variant 2, mRNA. Type of gene: Protein coding: Q5VVF5 Source of gene: UCSC file knownGene.txt CDS: start=2622189 end=2653868 DISEASES associated with the gene: Alzheimer's Disease; diabetes, type 2; dementia; Lp(a); SNP gene location: CDS, exon 2 start=2639858 end=2639981 SNP chr9 + A to C in 2639876 is converted to Q5VVF5 E 73 A GAA => GCA Effect: TOLERATED protein: Score: 1.00 => 0.36 #SNP 11 protein O00533 13 T G Error SNP location: uc003bou.3 start=238278 end=451097, strand='+' DE: Homo sapiens cell adhesion molecule with homology to L1CAM (close homolog of L1) (CHL1), transcript variant 2, mRNA. Type of gene: Protein coding: O00533 Source of gene: UCSC file knownGene.txt CDS: start=2622189 end=2653868 DISEASES associated with the gene: schizophrenia; ERROR!!! reference 'T' is not equal 'Y' in position 13 of O00533 Amino acid residue 'Y' is accepted as the reference. SNP Y => G is estimated Effect: TOLERATED protein: Score: 0.20 => 0.11 #SNP 12 protein QDDDDD 17 D E Error in the name of protein ERROR!!! A sequence QDDDDD is not found in SP+TrEMBL #SNP 13 protein B3KWS5 4 I L quantitative traits SNP location: uc002rej.4 start=23971533 end=24055472, strand='-' DE: Homo sapiens ATPase family, AAA domain containing 2B (ATAD2B), transcript variant 1, mRNA. Type of gene: Protein coding: B3KWS5 Source of gene: UCSC file knownGene.txt CDS: start=2622189 end=2653868 DISEASES associated with the gene: quantitative traits; Effect: DAMAGED protein: Score: 0.05 => 0.04