Feature extraction in Raman spectral data analysis
A Raman-spectroscopy based classification can fail to learn the pattern of the inter-group differences due to the intra-group spectral variations. A natural solution is a well-designed feature extraction method that can separate the intra-group variations from the inter-group differences; the classifier is then built based on merely the latter information. To do so, we modified two commonly applied feature extraction approaches, principal component analysis (PCA) and partial least squares (PLS), in order to extract the features representing merely the inter-group differences.
By Shuxia Guo // Petra Rösch // Jürgen Popp // Thomas Bocklitz
Raman spectroscopy saw dramatic growth in the last two decades in biological applications, including but not limited to toxicology and forensics, microbiology, drug discovery, metabolic investigations, and even in vivo detection . This has largely benefited from the development of chemometrics, where the Raman signals are translated into high-level chemical and/or biological information of interested, mostly using a classification model . However, the Raman-based classification is often degraded by the intra-group spectral variations, which can easily overwhelm the subtle inter-group spectral differences of interest. To deal with this issue and hence improve the performance of the classification, we modified two commonly applied feature extraction methods, PCA and PLS, so that the extracted features represent merely the inter-group differences. The details can be found in publication . Herein the methods will be briefly summarized, along with their results on two Raman spectral datasets measured from biological samples.
The idea of modified PCA is given as following:
The ∑x, ∑sub, and ∑’x represents the overall covariance, the intra-group covariance, and the inter-group covariance, respectively. The is the covariance matrix of the mean spectra of each replicate belonging to the gth group, while is the covariance matrix of the ith replicate belonging to the gth group. ∑bg is the covariance matrix of the mean spectra of each group. By subtracting ∑sub from ∑x, the resulted principal components V from the singular value decomposition (Eq. (1)) are supposed to indicate merely the inter-group differences.
The modified PLS (mPLS) was achieved by orthogonalizing the projection vectors (P) against Lsub during the iteration of the SIMPLS algorithm. The orthogonalization was done according to Eq. (4), where Lsub are the loadings resulted from Eq. (1), which represent the intra-group variations. The resulted projection vectors Pot were used to obtain score vectors of mPLS.
The methods were verified with two Raman spectral datasets measured from mice colon tissues and bacteria, respectively. The performance was compared to their ordinary counterparts according to the mean sensitivity of the classifications. Accordingly, the mPCA was proven to improve the generalization of the classifier while the mPLS helps to avoid overfitting. Due to the limit of space, only the results of the mice data will be described in the following paragraphs.
The mice data was measured in two cases: fully prepared samples from 47 individuals and biopsy samples from 97 individuals (details see ref. ). The mean spectra of both abnormal from normal samples are plotted in Figure 1 (a-b) along with the variances and their difference spectra. The peaks where the two groups are significantly different are marked by vertical lines. From the score plots in Figure 1(c-d), it is clear that the two groups are hardly distinguishable with the first two components of the ordinary PCA. A much better separation was observed in the case of modified PCA. That is to say, the inter-group variations are better extracted by modified PCA.
In addition, we tested the performance of the mPCA and mPLS according to the prediction on biopsy samples using the classifier built on prepared samples. The mean sensitivity of the prediction is visualized in Figure 2. Despite the ordinary and modified PLS gave comparable highest mean sensitivity over different number of components (nLV), the classification is shown more stable to the number of components (nLV) in the case of mPLS. In contrast, the highest mean sensitivity was clearly improved with mPCA than its ordinary counterpart. This demonstrate that the classifier is more tolerant to the differences between the prepared and biopsy samples (i.e., a better generalizability of the classification) in the former case.
(a-b) Mean spectra, variances, and difference spectrum of the mice colon data. The normal and abnormal groups are represented in cyan and pink color, respectively. The two groups are hardly distinguishable from the first two principal components of the ordinary PCA (c), which became much better using mPCA (d).
Mean sensitivity of mice colon dataset with respective to the number of components (nPC/nLV). The ‘Biopsy’ denote the results of prediction on the biopsy samples using the model built on the prepared samples. (a) While the highest mean sensitivity was comparable for ordinary and modified PLS, the mPLS could improve the stability of the model with respective to the number of components (nLV). (b) The prediction was better in case of mPCA than ordinary PCA, indicating an improved generalizability of the classification by mPCA.