MAS5Baseline description

Comparison of the Affymetrix gene expression row data to the baseline data by MAS 5.0 algorithm.

Data specification

The input for MAS5Baseline is the set of expression row data in Affymetrix CEL data format, corresponding CDF file and file with list of CEL files to be processed and their short description (this file is provided by user). The CEL file stores the results of the intensity calculations on the pixel values on the chip. The CDF file describes the layout for an Affymetrix GeneChip array. The output is SelTag data file with gene expression data. The baseline experiment name should be provided by user.

Algorithm description

The purpose of the algorithm is to perform noise correction and data normalization for each experiment and to estimate the change of the gene expression signal relatively to the baseline experiment signal. The method is known as MAS 5.0 statistical algorithm implemented in the Affymetrix Microarray Suite version 5.0. The algorithm details are described in the Affymetrix documentation at ("Statistical Algorithms Description Document", Affymetrix, 2002; "Statistical Algorithms Reference Guide", Affymetrix, 2001).

    The algorithm contains of several steps.
  1. Background noise correction for baseline and experiment
  2. Change of the expression value (signal change) calculation between experiment and baseline
  3. Estimation of the signal change value statistical significance (change detection p-values)
  4. Estimation of the of the signal change (change detection call)

Background noise correction. At the first step the chip area is divided into K squared zones of the same size (default number of zones is 16). Then the 2% probes with the lowest intensity define the background intensity for each zone. The background noise level for each k-th zone bZk is the calculated as the average for those lowest intensity probes. The background noise level b(x,y) for each probe at the chip location x,y is calculated as weighted sum of zone background values


where weights wk(x,y) are calculated as follows:


where dk(x,y) is the distance from the point x,y to the center of the k-th zone, smooth - is the smoothing parameter (by default is 100).

The noise correction procedure is as follows. First, standard deviations of the 2% probes with the lowest intensity nZk are calculated for each zone. For each probe the noise intensity n(x,y) is is estimated by above formulas (substitute n(x,y) for b(x,y) and nZk for bZk in the formulas above). Then the probe intensity corrected for noise is calculated from actual probe intensity I(x,y) as follows:


where I'(x,y)=max(I(x,y),0.5), NoiseFrac is the fraction of noise and is set to 0.5 as in MAS 5.0 algorithm description.

Expression value (signal) calculation. After background subtraction from each probe intensity value, the signal values for the probesets are calculated. The calculation uses "ideal mismatch" technique that allows to process probe pairs for which the mismatch (MM) signal is greater than the match (PM) signal (see details in the Affymetrix documentation). When the ideal mismatch is calculated for each probe pair j of the each probeset i, the probe value PVij is calculated: PVij = log2(max(PMij-IMij, 2-20)). The signal log value (SLVi) for the probeset i is calculated as the one-step biweight estimate for the corresponding probeset SLVs. Then the algorithm scales all the probesets to target scale value Sc (default is 500) estimating the scale factor sf

and using normalization factor nf:

where SPVbi is the baseline signal, SPVei is the experiment signal, the scaled probe intensity values are calculated as SPVij=PVij+ log2(nf+sf). The TrimMean function calculates the mean value of the data without highest 2% and lowest 2% values. The probe log ratio PLR is calculated for probe pair j in probeset i on both the baseline b and experiment e arrays PLRij=eSPVij-bSPVij. Having the probe log ratios PLR the SignalLogRatio is calculated using the biweight algorithm. SignalLogRatio is the reported value for this algorithm.

Estimation of the signal statistical significance (detection p-values). To estimate the significance of the change of the expression signal between experiment and baseline two additional sets of values for each probeset are calculated:


They are used to estimate two balancing factors:

as the ratio of scaling factors of the of the q values for experiment sfE and baseline sfB data. The second balancing factor

is calculated as the ratio of scaling factors of the of the z values for experiment sf2E and baseline sf2B data. The balancing factor range is extended by using three balancing factors for the q values

and for z values

where d is perturbation parameter and is set by default to 1.1.

If the algorithm settings indicate a user defined balancing factor and the factor is not equal to 1 then, nf = nf2 = user defined normalization factor·sfE /sfB, where sfE is the experiment sf and sfB is the baseline sf as described in the Expression value (signal) calculation section.

The critical p-value is estimated for all three f[k] (k=0,1,2) parameters and are designated below as p[0],p[1],p[2] correspondingly. These values are used to estimate the signal p-value for the signal change:

p=max(p[0],p[1],p[2]) if p[0] < 0.5, p[1] < 0.5 and p[2] < 0.5
p=min(p[0],p[1],p[2]) if p[0] > 0.5, p[1] > 0.5 and p[2] > 0.5
p=0.5 otherwise.

Estimation of the presence/absence of the signal (detection call). The algorithm report several types of detection calls in the output file: increase (I - is the designation of the detection call in the SelTag file), marginally increase but not increase (i), decrease (D), marginally decrease but not decrease (d), no change / unchanged (U). The definition of the detection change is dependent on several parameters: g1High, g1Low, g2High, g2Low, yielding two parameters g1 as linear interpolation of g1High and g1Low (if g1High = g1Low, then g1= g1High = g1Low), and 2 as linear interpolation of g2High and g2Low (if g2High = g2Low, then g2= g2High = g2Low).

The rule for the detection change is as follows:

The MAS 5.0 default values for the gamma parameters are: g1High=0.0025, g1Low=0.0025; g2High=0.003, g2Low=0.003 (for 16-20 probe pairs).

Example of experiment list file

GSM42890	DEHP_48hr_Veh1	DEHP 48hr Veh1
GSM42891	DEHP_48hr_Veh2	DEHP 48hr Veh2
GSM42892	DEHP_48hr_Veh3	DEHP 48hr Veh3
GSM42893	DEHP_48hr_Veh4	DEHP 48hr Veh4
GSM42894	DEHP_48hr_Veh5	DEHP 48hr Veh5

This file contains three columns separated by symbol. First column is the experiment data name (the corresponding CEL file should start from this name and have extension *.cel, for example GSM42890.cel). Second column is the name of the variable in the output SelTag file, corresponding to this experiment (see below example of SelTag output file). This column should not contain spaces. Third column is the extended description of the experiment that will appear at the SelTag file header section.

Example of output data

Multiple chip data analysis by Affymetrix MAS5.0 algoritm [comparison with baseline].
    BaselineDataHeader=Baseline experiment

  1 ExperimentDataFilename=GSM42907.cel
  1 DataHeader=VPA_48hr_Ve	VPA 48hr Veh POOLED
  1 DataScalingFactor=2.3930
  1 DataNormalizationFactor=1.0000
  1 DataSignalTrimmedMean=500.0000

  2 ExperimentDataFilename=GSM42913.cel
  2 DataHeader=DEHP_48hr_t	DEHP 48hr treated POOLED
  2 DataScalingFactor=2.6396
  2 DataNormalizationFactor=1.0000
  2 DataSignalTrimmedMean=500.0000

MAS5 algorithm parameters:
ProbesetName	STRING
VPA_48hr_Ve_SignalLogRatio	FVALUE
VPA_48hr_Ve_Change	WORD
VPA_48hr_Ve_Change_p	FVALUE
DEHP_48hr_t_SignalLogRatio	FVALUE
DEHP_48hr_t_Change	WORD
DEHP_48hr_t_Change_p	FVALUE
AFFX-MurIL2_at	-0.0952	U	0.32868	-0.3230	U	0.28164
AFFX-MurIL10_at	0.5692	U	0.12112	0.3852	U	0.66645
AFFX-MurIL4_at	-0.1952	U	0.16996	-0.3095	U	0.30476
AFFX-MurFAS_at	-1.3517	U	0.49464	-0.2080	U	0.04914
AFFX-BioB-5_at	-0.7911	D	0.99998	0.0126	U	0.79768
AFFX-BioB-M_at	-0.7021	D	1.00000	-0.2708	D	0.99997
AFFX-BioB-3_at	-0.5249	D	0.99998	-0.4171	D	0.99987