SNP-select description

The main concept of the ANNOVAR-like program SNP-select, as well, in fact, as ANNOVAR itself (see http://www.openbioinformatics.org/annovar/), is a stepwise filtering of input mutation set to discriminate those hardly determine a certain hereditary disease or any other "mutational" feature. User can specify the set of filtering steps.

The SNP-select program reads input files where the every line corresponds to a single mutation per genome, including single and/or block substitutions, insertions or deletions. The first 5 values in a line, tab-separated, represent a chromosomal index (may contain 'chr' prior to index), starting and ending positions for mutation, reference and alternative nucleotides. It is allowed to insert additional columns to line that will be output without changes in output files. '0' is specified for indication of a reference nucleotide, if it is unknown. Insertions/deletions can be represented by the '-' symbol for indication of missing nucleotides. The Table 1 represents a number of examples.

**Table 1. Example of an input file with five genetic variants.**
Chromosome	Start	End	Ref	Obs	Comments
16	49303427	49303427	C	T	R702W (NOD2)
16	49321279	49321279	-	C	c.3016_3017insC (NOD2)
13	19661685	19661685	G	-	35delG (GJB2)
1	105293754	105293754	0	ATAAA	Block substitution
1	13133880	13133881	TC	-	2-bp deletion (rs59770105)

To estimate variants in regard to their functional effects, SNP-select utilizes a number of databases that are preliminary loaded and checked (Table 2). Some filters, by user discretion, can be omitted. Every variant (mutation) is being compared to content of databases selected as filters, and, on hit particular genome regions or on exceeding an user-specified score threshold, is being discriminated.

At every step, in the folder, that is specified in initialization file, the files 'step_i' and 'step_i.dropped' with "passed through the filter" and "dropped" at i-th step mutations. The last value in tab-delimited line of mutation contains the commentary on a feature of current filter. As a rule, it's a score value. For the first filter - genomic annotation - it is a reference to mutation localization: exonic or splicing for accepted mutations, and noncoding_exonic, intronic, 3-prim noncoding, 5-prim noncoding, intergenic for filtered, as well as '-' for those discriminated mutations that were not found in annotations DB (located outside the annotated part of a genome).

Once all filters are passed, the list of mutations that are most promising in terms of association with a certain feature/disease is formed.

**Table 2. List of filtering steps.**
Filter order	Name of a filter file in folder with DB	Score threshold for discrimination of non-promising mutations
1	Genic annotation of a mutation, knownGene.fg2	All mutations outside exons or splicing sites are being discriminated
2	Conserved regions among 46 species, phastConsElements46way.txt	All not found removed
3	Segmental duplication regions genomicSuperDups.txt	All found removed^**)
4	1000 Genomes Project Pilot data 2011 May release ALL.sites.2010_11.txt	All found removed^**)
5	dbSNP, NCBI short genetic variations snp135.txt	All found removed^**)
6	whole-exome SIFT scores for non-synonymous variants avsift.txt	>0.05^*)
7	PolyPhen-2 scores ljb_pp2.txt	>0.85^*)
8	PhyloP conservation scores ljb_phylop.txt	>0.95^*)
9	MutationTaster scores ljb_mt.txt	>0.5^*)
10	whole-exome LRT (likelihood ratio test) scores ljb_lrt.txt	>0.5^*)
11	whole-exome GERP++ scores ljb_gerp++.txt	>0^*)
12	Conserved genomic regions by GERP++ gerp++elem.txt	All not found removed
13	NHLBI Exome Sequencing Project esp6500_all.txt	>0^*)
14	UCSC repeats rmask.txt	All found removed^**)

^*) - can be changed by user
^**) - Database hit is enough

Services Test Online

SNP-select description