SoftBerry - proteomicspeakalign

Proteomics-MSPeakAlign - Softberry Mass Spectra (SMS) processing tools. Peaks detection and alignment.

This program finds peaks in several samples and aligns them. For each single spectrum this program performs:
(1) Data resampling;
(2) Data smoothing;
(3) Detection of the baseline and its subtraction from intensity;
(4) Normalization;
(5) Peaks identification.
Once the peaks in different spectra are identified, they can be aligned over each other that allows to reveal the presence of common peaks in these spectra.

Step 1. Data resampling.
The first step in mass spectra processing is data resampling. It allows to discriminate the excessive data and to bring the m_i values to common scale. As a result, different spectra will have the same m value counts, and, thus, will be comparable. Reduction in number of spectrum points allows to lower the noise and to eliminate excessive data, but, at the same time, to keep the spectrum shape. The common data scale after conversion is located between the minimal and maximal m values of spectrum. The number of data that will be resampled from original set is determined by the 'Binning percent' parameter, that represents the percentage of spectrum points remained after conversion (default value is 25). Example of data resampling is shown in figure 1.

Figure 1. Result of data resampling for small spectrum interval. Original data are shown as blue squares, resampled ones - as red circles. The 'Binning percent' for this case was set to 25.

Step 2. Smoothing.
Data smoothing procedure is intended for data noise elimination. During the smoothing, the values of intensity for each m_i point are being averaged by several neighboring points. The number of such points is determined by the 'SmoothWindowSize' parameter (default value is 3). The smoothing procedure can be repeated for several times; the number of iterations is determined by the 'SmoothReps' parameter (default value is 3). Example of data smoothing is shown in the figure 2.

Figure 2. Result of data smoothing. Original data are shown as blue squares, smoothed ones - as red circles. The SmoothWindowSize was set to 3 and SmoothReps was set to 3.

Step 3. Baseline detection and subtraction.
This step of data processing is applied for elimination of the systematic artifacts that occur due to matrix and chemicals used in the experiments or as a result of detector overload. It results in background noise that may occur to be significant for some m values. The initial step in background noise removal is identification of peaks (local signal maxima that are located far enough from each other). The distance between peaks is determined by the 'Baseline parameter' value (default= 0.005). This parameter defines the minimal m distance, over which the two neighboring peaks 1 and 2 are to be located in the way, when:
|m₁-m₂|/m₁ > 'Baseline parameter'.

After peaks identification, algorithm detects the points with signal minima located in intervals between peaks. These are the base points for calculation of background noise line. Over base points the baseline for all spectrum points is built by interpolation. In case when in some spectrum parts the value of base signal exceeds the original one, the new base points selection from neighboring ones occurs.

The values of base signal intensity are subtracted from the original one. At that, if value of original signal has occurred below zero, it is equated to zero. The result of background subtraction is shown in figure 3.

Figure 3. Result of background signal subtraction. Original data are shown as blue squares, modified ones - as red circles. Baseline is shown in green line.

Step 4. Normalization.
Normalization allows to bring peaks intensity values to a common scale, and thus it becomes possible to compare data from different spectra. The only parameter for current procedure is 'NormalizationConstant' (default value is 10000). Example is shown in fig. 4.

Figure 4. Result of normalization procedure. Original data are shown as blue squares, modified ones - as red circles.

Step 5. Peaks identification.
The current step of analysis lies in searching for peaks in spectrum with high signal-noise relation. Peaks, in themselves, are identified as points of local spectrum maximum. The 'SNRMin' parameter specifies the minimally allowed value for signal-noise relation (default value is 3). This relation is considered in spectrum window w, size of which can be specified by the 'SNRWindowSize' parameter (default value is 250). Thus, the value for signal-noise relation is calculated as:

Program identify peaks with SNR>'SNRMin' and intensity not less than 'MinIntensity' (default value is 2). The result is shown in fig. 5.

Figure 5. Identification of peaks in mass spectrum. Peaks locations over signal curve are marked with red dots.

Step 6. Peaks alignment.
On analyzing several spectra the question if there are common peaks for these spectra easily arises. To solve this question it is necessary to compare peaks locations and intensity for spectra of interest. It is mandatory that for all spectra to be compared the steps 1 to 5 are to be completed with the same parameters. For further analysis, the peaks with signal-noise ratio not lower than that specified by 'SNRMin' parameter (default value is 3) will be selected. Once it is done, the peaks from different spectra are being grouped. Peak can be placed in the specific group if value dm=|m_i-mHi_group|/mHi_group, where mHi_group is the maximal mass value for peaks in current group, does not exceed that specified by the 'MassSeparation' parameter (default value is 0.0015). If multiple peaks groups meet this criterion, the group with minimal dm value is to be selected. Example for two spectra is shown in fig. 6.

Figure 6. Alignment of peaks from two mass spectra. Spectra are represented by blue and green curves, common peaks are marked with red dots, sample 1 specific peak is marked with dot of blue color.

Input: m/z - Intensity data
Output: List of peak groups identified with the following information for each group:
GroupIndex (Index of peak group), PeakID (peak ID), HighMass (highest value if m/z ration in peak group), MeanMass (mean value of m/z ration in the peak group), MinMass (minimal value of m/z ration in the peak group), MaxMass (maximal value of m/z ration in the peak group), NumPeaks (number of peaks in the peak group), MaxIntensity (maximal peak intensity in the peak group).

Parameter(s):
Binning percent - This parameter specify the fraction of data in percent that will remain after resampling. The default value is 25.
SmoothWindowSize - This parameter determine window size for smoothing operation. The default value is 3.
SmoothReps - This parameter specify the number of smoothing operation repeats. The default value is 3.
Baseline parameter - This parameter specify the minimal mass difference, over which the two neighboring peaks 1 and 2 are to be distinguished for baseline determination. The default value is 0.005.
NormalizationConstant - This parameter specify the normalization constant. The default value is 10000.
SNRWindowSize - This parameter specify window size to determine signal-to-noise ratio. The default value is 250.
SNRMin - This parameter specify minimal signal-to-noise ratio for peak detection. The default value is 3.
MinIntensity - This parameter specify minimal intensity for peak detection. The default value is 2.
MassSeparation - This parameter specify minimal mass separation for peaks from the same group. The default value is 0.0015.
File format type - This parameter specify file format. SSV-space separated values, CSV - comma separated values, TSV - tab separated values.

Data format.
Mass spectra data represent the sets of following pairs of values: mass to charge relation (m/z, further, for more convenience, it will be referred to as m, mass) and corresponding signal intensity (I). On a spectrum plot, the mass corresponds to X coordinate, and signal intensity- to Y one. A typical spectrum consists of several thousand of such value pairs (points). Data are represented as text files, where for each pair (m_i,I_i) of mass-intensity values the string is assigned, and data in this string are separated by special separator symbol. The SMS package allows several separators types: space (SSV, space separated values, file format), comma (CSV, comma separated values, file format) and tabulation (TSV, tab-separated values, file format). In files with data, the string with comments are allowed; during the file reading these strings are to be skipped. The commentary strings should begin with "#" symbol at the first position. In the figure 2 the example of file with data in CSV format is shown.


#M/Z,Intensity
-7.8602611e-005,4.1126194
2.1773576e-007,4.0764203
9.6021472e-005,4.0040221
0.00036601382,4.1186526
0.00081019477,4.0040221
0.0014285643,3.9617898
....
19742.941,4.077895
19745.564,4.0772248
19748.187,4.0772248

Figure 7. Example file with mass spectra data in CSV format.

Services Test Online

Proteomics-MSPeakAlign - Softberry Mass Spectra (SMS) processing tools. Peaks detection and alignment.