LDA analysis of mass spectra data.
One of the perspective applications of MS data is using them for prognosis of disease. The problem can be formulated as follows: Identify peaks in serum MS data that changes their intensity in the case of disease and such that this changes can be detectable as early as possible. It was shown recently, that the information contained in mass spectra, in combination with the level of tumor marker serum CA125 useful for early detection of ovarian cancer [1].
MS data processing can be used to solve this task.
First step of analysis is data preprocessing that allow to compare MS from different patients and to identify location of peaks [1].
Once the peaks in different spectra are identified, they can be aligned over each other that allows to reveal the presence of common peaks in these spectra.
The Softberry SMS program package allows to perform all these steps of analysis and output the set of spectral data in a single table. In this table rows correspond to samples, each column correspond to MS intensity for the peak groups identified at the preprocessing steps.
The table can also contain additional information (that can be passed to table from additional files). For example for each sample Patient ID, time of sampling, patient status (cancer or non-cancer) can be added. Additional patient parameters that can be used for prognosis can be also added to the table as well. For example, it is known, that tumor marker serum CA125 is useful for early detection of ovarian cancer [2], also in combination with mass spectra data [1].
In this example we demonstrate that mass spectra data along with the CA125 level can be applicable to classify MS samples for cancer and non-cancer with high precision value. We used data from the paper of Gammerman et al [1]. These data represents set of mass spectrum data for serum samples taken from patients with up to 7 years prior to the cancer detection (18 patients, 75 samples). The data contain also control samples that were taken from the healthy women. The number of control samples is 154.
In this work we considered control samples as a general pool of healthy people. We did not take into account the time the control sample was taken.
The idea was that before the cancer is developed, the tumor markers (CA125 and MS peak intensity) have the same values as for control healthy peoples. Then after cancer progress, the level of marker increased and it can significantly deviate from those of control samples. This deviation can serve as early indicator of cancer.
We tested the hypothesis, whether linear combination of CA125 level and peak intensities from MS data can be useful to separate serum from cancer patients from non-cancer control samples.
To solve this task we applied Linear Discriminate Analysis. It is used in statistics and machine learning to find a linear combination of features which characterize or separate two or more classes of objects. The resulting combination may be used as a linear classifier. In our case we have two classes of samples: cancer (class 1) and control (class 0). To find the linear classifier we used patients samples taken not later than 1 month and not earlier than 6 months before diagnosis (17 samples from class1, 154 from class 0)
We used two features to build classifier: the logarithm of the CA125 tumor marker level and logarithm of intensity of MS signal for one peak group, detected by MS analysis.
The MS data analysis allowed to find 374 peak groups for all the samples taken to analysis.
Some of them were poorly represented in the dataset, some of them were highly populated. The list of top 20 highly populated peaks is shown below. NumPeaks is the number of samples where the peak intensity is significantly higher than the neighbouring background signal.
GroupIndex | PeakID | HighMass | MeanMass | MinMass | MaxMass | NumPeaks | MaxIntensity |
5 | 5 | 3191.424 | 3191.554 | 3188.161 | 3193.358 | 211 | 45.57914 |
20 | 20 | 1770.354 | 1770.479 | 1769.719 | 1772.318 | 195 | 29.40414 |
18 | 18 | 2009.833 | 2009.877 | 2009.076 | 2012.017 | 193 | 30.74098 |
24 | 24 | 825.4889 | 825.7725 | 825.2985 | 826.2407 | 189 | 26.30554 |
42 | 42 | 3332.855 | 3333.192 | 3329.355 | 3334.906 | 184 | 19.21943 |
2 | 2 | 2026.94 | 2026.901 | 2025.914 | 2029.441 | 177 | 53.74245 |
37 | 37 | 2266.585 | 2267.009 | 2266.025 | 2268.258 | 177 | 20.36678 |
17 | 17 | 2985.269 | 2985.741 | 2983.11 | 2989.592 | 167 | 31.3781 |
90 | 90 | 2552.564 | 2552.984 | 2551.655 | 2554.576 | 157 | 11.41295 |
8 | 8 | 1894.82 | 1894.954 | 1894.057 | 1896.423 | 147 | 42.05795 |
78 | 78 | 2114.38 | 2114.491 | 2111.304 | 2116.45 | 147 | 13.52459 |
7 | 7 | 1863.57 | 1863.654 | 1862.77 | 1864.733 | 144 | 42.50182 |
10 | 10 | 1449.601 | 1449.102 | 1448.24 | 1451.12 | 136 | 35.4617 |
56 | 56 | 1584.514 | 1584.659 | 1582.731 | 1586.55 | 133 | 15.79827 |
55 | 55 | 2566.78 | 2567.124 | 2563.25 | 2568.585 | 132 | 16.01036 |
23 | 23 | 945.0417 | 944.728 | 944.0944 | 945.2638 | 130 | 27.91649 |
3 | 3 | 2647.767 | 2647.657 | 2646.315 | 2648.923 | 126 | 50.25482 |
6 | 6 | 6644.013 | 6647.589 | 6635.569 | 6651.674 | 121 | 44.14933 |
12 | 12 | 1395.262 | 1395.111 | 1394.238 | 1397.255 | 120 | 34.98699 |
We examined all the peaks from 20 from table above in combination with CA125 level to build linear classifier. We select the peak groub that delivers the best prediction performance.
For example, the best overall performance was achieved for combination of CA125 and peak group 17 (located within the min mass 2983.11, max mass 2989.592).
Prediction results for this peak shown below:
Number of samples=171 (control(0)=154;disease(1)=17) Class0 (control) (num/fract)=24/0.140351; mean_score=4.152928 Class1 (disease ) (num/fract)=147/0.859649; mean_score=-5.309040 Test results: Fraction of true predictions: 0.959064[164] Class 0: Fraction of true positives : 0.954545[147] Fraction of false negatives : 0.045455[7] Class 1: Fraction of true positives : 1.000000[17] Fraction of false negatives : 0.000000[0]
The overall fraction of true predictions is 0.959064. Interestingly, this classifier does not misclassified any cancer sample (17 true positives from 17). This can be useful as no cancer patient can be missed by this analysis.
The change of the linear discriminant function for this classifier (CA125 + 17 group peak intensity, LDF) is shown in figure below for each cancer patient samples for all times before diagnosis (including time=0, the time of diagnosis).
The X axis - time before diagnosis. For most samples the LDF value exceed zero in the range 10 months before diagnosis.
Y-axis - LDF value. 5 samples show no increase of LDF values for this period (they have only two of time points of samples
taken: time=0 and time > 6 months). One patient (ID 3480) have LDF value greater than zero for all period of time. T
hus positive LDF values based on CA125 and MS peak intensity [2983.0, 2989.6] can be used as OC markers for prognosis
within 6 months.
[1] Gammerman et al, The Computer Journal, (2008)
[2] Menon et al, (2007), J.Clin.Oncol, 23,7919-7926.