Services Test Online

SNP-select description

The main concept of the ANNOVAR-like program SNP-select, as well, in fact, as ANNOVAR itself (see, is a stepwise filtering of input mutation set to discriminate those hardly determine a certain hereditary disease or any other "mutational" feature. User can specify the set of filtering steps.

The SNP-select program reads input files where the every line corresponds to a single mutation per genome, including single and/or block substitutions, insertions or deletions. The first 5 values in a line, tab-separated, represent a chromosomal index (may contain 'chr' prior to index), starting and ending positions for mutation, reference and alternative nucleotides. It is allowed to insert additional columns to line that will be output without changes in output files. '0' is specified for indication of a reference nucleotide, if it is unknown. Insertions/deletions can be represented by the '-' symbol for indication of missing nucleotides. The Table 1 represents a number of examples.

Table 1. Example of an input file with five genetic variants.
Chromosome Start End Ref Obs Comments
16 49303427 49303427 C T R702W (NOD2)
16 49321279 49321279 - C c.3016_3017insC (NOD2)
13 19661685 19661685 G - 35delG (GJB2)
1 105293754 105293754 0 ATAAA Block substitution
1 13133880 13133881 TC - 2-bp deletion (rs59770105)

To estimate variants in regard to their functional effects, SNP-select utilizes a number of databases that are preliminary loaded and checked (Table 2). Some filters, by user discretion, can be omitted. Every variant (mutation) is being compared to content of databases selected as filters, and, on hit particular genome regions or on exceeding an user-specified score threshold, is being discriminated.

At every step, in the folder, that is specified in initialization file, the files 'step_i' and 'step_i.dropped' with "passed through the filter" and "dropped" at i-th step mutations. The last value in tab-delimited line of mutation contains the commentary on a feature of current filter. As a rule, it's a score value. For the first filter - genomic annotation - it is a reference to mutation localization: exonic or splicing for accepted mutations, and noncoding_exonic, intronic, 3-prim noncoding, 5-prim noncoding, intergenic for filtered, as well as '-' for those discriminated mutations that were not found in annotations DB (located outside the annotated part of a genome).

Once all filters are passed, the list of mutations that are most promising in terms of association with a certain feature/disease is formed.

Table 2. List of filtering steps.
Filter order Name of a filter
file in folder with DB
Score threshold for discrimination of non-promising mutations
1 Genic annotation of a mutation,
All mutations outside exons or splicing sites are being discriminated
2 Conserved regions among 46 species,
All not found removed
3 Segmental duplication regions
All found removed**)
4 1000 Genomes Project Pilot data 2011 May release
All found removed**)
5 dbSNP, NCBI short genetic variations
All found removed**)
6 whole-exome SIFT scores for non-synonymous variants
7 PolyPhen-2 scores
8 PhyloP conservation scores
9 MutationTaster scores
10 whole-exome LRT (likelihood ratio test) scores
11 whole-exome GERP++ scores
12 Conserved genomic regions by GERP++
All not found removed
13 NHLBI Exome Sequencing Project
14 UCSC repeats
All found removed**)

*) - can be changed by user
**) - Database hit is enough