Gene selection by query (logical expression).
The expression data for the set of genes is represented as a table, consisting of rows (usually corresponding to genes) and columns (or fields, usually corresponding to samples/tissues/experiments). Each row corresponds to expression measurements for the gene. Columns correspond to experiments/samples/tissues. However, this table may include not only expression data, but also other information related to genes, for example gene names, classifiers, etc. Therefore we will call the table columns as 'fields' in general case. In general, columns of the table could be of four basic types:
|IVALUE||signed integer value;|
|FVALUE||floating point value;|
|WORD||text without spaces inside (single word);|
|STRING||text with spaces inside allowed.|
Basic input file format should be as follows:
; May contain comment starting from the semicolon in any line of the file NAME<tab>WORD GENEID<tab>IVALUE TISSUECANCER0<tab>FVALUE TISSUECANCER1<tab>FVALUE TISSUENORMAL0<tab>FVALUE TISSUENORMAL1<tab>FVALUE TISSUENORMAL2<tab>FVALUE #GROUP<tab>Cancer tissues TISSUECANCER0 TISSUECANCER1 #ENDGROUP #GROUP<tab>Arbitrary group TISSUECANCER1 TISSUECANCER2 TISSUENORMAL0 TISSUENORMAL1 #ENDGROUP END DATA GENE04675<tab>402<tab>6.00<tab>5.60<tab>5.97<tab>6.00<tab>6.00 GENE46890<tab>794<tab>2.77<tab>3.22<tab>5.65<tab>5.68<tab>5.68 GENE23794<tab>404<tab>5.97<tab>5.97<tab>6.00<tab>5.60<tab>5.97
In this example <tab> implies 'Tab' character symbol.
First lines (up to the "DATA" line) contain data format description. In this part of the file each line describes field description: field name and field basic type.
After the "DATA" line - data on each gene are represented. Each line correspond single cards. Field data are separated by 'tab' symbol. Double 'tab' is interpreted as missed data.
It is assumed in SetTag program that the expression data in the file are normalized and the expression levels of genes in experiments are comparable.
MolQuest version of the SelTag program can also operates with other types of files, namely, selection files. These files contain information about some selected genes or samples from the large data file in SelTag format. The selection file contain: the data file name from which selection was obtained; type of selection data (genes of samples), list of selected objects (their indices in the large data file). The selection files are in the XML format. Two examples are below.
Selection for some genes.
<?xml version="1.0" encoding="ISO-8859-5"?> <SELECTION> <HEADER name="cc_Selection5"> <DATA source="c:/data/cc.txt"/> <COMMENT><![CDATA["$F1 == "GEN14263" | $F12 >= 300"]]></COMMENT> </HEADER> <ELEMENTS type="GENES" count="9"> <![CDATA[0;1;2;10;14;15;17;26;30]]> </ELEMENTS> </SELECTION>
Selection for some fields (samples).
<?xml version="1.0" encoding="ISO-8859-5"?> <SELECTION> <HEADER name="notterman2001_set1"> <DATA source="c:/data/notterman2001_set1.txt"/> <COMMENT><![CDATA["From cc.txt data file."]]></COMMENT> </HEADER> <ELEMENTS type="FIELDS" count="10"> <![CDATA[0;1;2;3;5;6;7;18;19;30]]> </ELEMENTS> </SELECTION>
Selection files may be selected during the SelTag execution and also used by SelTag for calculation and/or visualization. Note, each selection file is linked to large data file by its name. Selection data cannot be applied to another data file.
The logical expression contains field (experiment) indices denoted as $FX
(where X is the field index) and relationships between values of the fields.
For example, string
$F24 < 100
means that genes should be selected that have expression level for the field 24 lower then 100. To compare field values several operations can be used:
|<=||less or equal to|
|>=||greater or equal to|
Complex queries may be formed using logical operations
AND (&), OR (|), NOT (!) and parentheses for simple queries. For example, query
($F10 lt; 100 ) & ($F23 >= 0 )
should return all genes with expression level in the experiment #10 lower than 100 and expression level in experiment #23 greater or equal to zero.
Some additional operations may be used also.
|+,-||sum and difference|
|*,/||multiply and divide by|
|ABS(x)||absolute deviation of x|
|x^y||x in y power|
|SQRT(x)||square root of x|
ABS($F10-$F11) < 100
Will select genes for which absolute deviation between expression levels in 10 and 11 experiments is lower than 100. Arithmetical operations are allowed with the numerical fields only.
Text comparison is also possible if the compared field is of the STRING or WORD types. For example, to select query with name "Gene2356" in the field $F1, one can set query
Note that the textual values is better to put in quotation marks, this will allow to process even strings containing spaces and special characters (arithmetical or logical operations described above).
Genes can be also selected by their numbers in data file, for example, query
$N <= 400
returns all genes with indices from 1 to 400.
Genes can be selected by their expression level in the field (experiment) group. For example, to select genes with the expression level greater than 100 in any of the experiment from group 1, the following query is applicable:
$G1 > 100
Condition level can be applied to the group selection, namely, user can specify the number of fields from the group satisfying condition. To select genes for which at least in 10 experiments expression level is greater than 100, the previous query can be modified:
$G1:10 > 100
The condition can be specified in percents of group size:
$G1:50% > 100
The latter query allow to select genes in which at least 50% experiments from group 1 have expression level greater than 100.
The score can be ascribed to the gene upon query evaluation. For example if the query is $F3 > 100 and there are two genes satisfying this condition with $F3 expression levels 105 and 800, the gene with expression level 800 will have greater score.
List of selected genes and their scores [12 total]: No Index Name Score 1 1 GEN30482 0.5167 2 2 GEN03437 0.7767 3 3 GEN03687 0.9467 4 4 GEN24649 0.9600 5 5 GEN09108 0.2333 6 6 GEN09514 0.9933 7 7 GEN24589 0.7067 8 8 GEN02291 1.0233 9 9 GEN24534 0.9300 10 10 GEN14489 0.8000 11 11 GEN33519 0.8000 12 13 GEN35755 0.8633
First line is the header. It contains number of selected genes in parentheses. Second line is the data descriptions, separated by tabulation: No - number of the gene, Index - index of the gene in the large data file; Name - gene name (to determine name field in the data by default program searches the field that is called 'Name' in the field list names); Score - query scores (the better gene fits query expression, the higher the score). Next lines list data for selected genes separated by tabulation.