Definitions
Spectroscopy studies the interaction between light and matter, and when analyzing this type of data, it is all about change. Spectral data will show as an accumulation of peaks, and these will vary in shape, symmetry, intensity, position and convolutions. The most subtle changes can mean significant property differences in the measured material, but also big changes could mean little (Vickerman & Gilmore, 2009). Thus, studying what these changes mean and how they occur is paramount. This is also true when using machine learning (ML) for spectroscopic data; change in spectral features remains the most important thing. Here we define four techniques included in pudu that evaluate ML models in terms of change, aiming to help scientists to further analyze their spectroscopic data beyond the ML results.
Importance
Importance quantifies the relevance of the features according to the changes in the prediction according to defined sequential perturbations on the features. Thus, Importance is measured in probability or target value difference for classification or regression problems, respectively.
In a formal way, Let \(x \in X\) be a 2-d array of dimensions \(h \times w\). Let \(P_M\) be the probability function of the model \(M\). Then, \(P_M(x)\) is the probability of \(x\) to belong to a classification class according to the problem solved by \(M\). Considering \(j \in J\) the feature in position \((h_j, w_j)\) of \(x\), then the local importance (\(LI\)) for said feature \(j\) is defined as:
Where \(R\) is a function of local perturbation of feature \(j\). Then, the relative importance (\(RI\)) can be denoted as:
Where \(LI\) contains all the \(LI_j\) of sample \(x\). Then, importance is the difference in a model’s classification probability according to change in the features.
Speed
Speed quantifies how fast a prediction changes according to perturbations in the features. For this, the Importance is calculated at different perturbation levels, and a line is fitted to the obtained values and the slope is extracted as the Speed value.
This is better defined considering states of \(R\) with different set parameters \(R_1, R_2, ...\). As for importance for \(x\), the local importance of feature \(j\) using the different perturbaions would be \(LI_{j,1}, LI_{j,2}, ...\). Then, speed is the slope calculated according to the linear fit of the local importance points as \(\left( 1, LI_{j,1} \right), \left( 2, LI_{j,2} \right), ...\)
Then, the speed is how fast the Imortance changes accoding to change in the feature, or how sensitive it is. These can have positive or negative values, depending on the slope. A positive value means that a bigger change will produce a bigger change in the prediciton. A negative value means that bigger changes produce smaller changes in the prediction.
Synergy
Peaks in spectral data can change at the same time as other peaks but their relationship can be difficult to pinpoint and understand, especially in more complex mixtures and materials. Synergy helps to explore this relationships of change by perturbating simoulatously pairs of areas of interest.
Fro this, Consider a feature \(j* \in J\) and a distinct feature \(j \in J\) from \(x_i\). Both are perturbated under \(R\) obtaining \(x_{j*,j}\). Then, the local importance obtained is \(LI_{j*,j}\). Then,
The synergy then indicates how features complement each other in terms of change and the effect on the prediction.
Activations and re-activaiton
Convolutional Neural Networks (CNN) can result in highly complex structures. As such, understanding how the final form of a CNN relates to the input data can be certainly challenging, but if done correctly can yield great benefits, as shown by Bau et al. (2019). Re-activation attempts to evaluate this structure in terms of change, thus better understanding how spectral characteristics affect the final shape of such networks. To do so, consider the following definitions:
Units
In a convolutional layer \(l \in L\), where \(L\) is the group of all convolutional layers in the model \(M\), the number of units in \(K\) is defined by the size of the input \((h, w)\), kernel size \((k_h, k_w)\), strides \((s_h, s_w)\) and the filters \(f\). Specifically, the number of units can be calculated as:
Where \((H_o, W_o)\) are the dimensions of the output of layer \(l\).
Acitvation map
As defined by Bau and Zhou et al. (2017), for \(x\), take the activation map \(A_k(x)\) for each of the units \(k\). Then \(a_k\) is the activation distribution for each individual units for \(X_s \in X\), where \(X_s\) is a subset of all samples \(X\). Then, all the activations belonging to the \(p\) quantile as \(P(a_k>T_k)=p\) were \(T_k\) is the value above which the quantile exists.
Re-activation map
The above can be evaluated based in feature perturbations considering \(x\), the original data, and \(x_j\), the perturbated input in feature \(j\), and evaluate the difference as \(B_k(x_j)-B_k(x)=\Delta B_{k,j}, \forall j \in J\), where \(B\) is the pre-activatoin map of unit \(k\). From here we can extract the distribution \(\Delta b_{k}\) and select then pass the data through the activation function to obtain \(a_{k}\). Finally, select the \(p\) quantile as \(P(\Delta a_{k}>T_k)=p\). In this case, \(X_s\) is the set of perturbed samples derived from \(x\).
The latter accounts then for difference in unit activations after perturbation that would account for a re-activation. For example, if unit \(k\) has and activation value of \(u\), and after perturbation the same unit \(k\) obtains a value of \(u* = u \rightarrow \Delta u = 0\), then it is not re-activated considering an activation function of \(ReLU\) or \(LeakyReLU\). In other words, this looks for significant changes in the activation map accoring to change, meaning significant a value that would be considered an activation in \(A_k(x)\).
With this, it is possible to obtain the following information:
How many units are re-activated, in units of change
What feature produces more unit re-activations, per unit of change
What unit is re-activated the most, per unit of change
Which feature re-activates what unit the most times \(^1\)
\(^1\) Though can be obtained for a single sample \(x\), it is better to use a significant number of them.
References
Vickerman, J. C. & Gilmore, I. S. Surface Analysis-The Principal Techniques. 2nd Editin, Wiley (2009).
Bau, D., Zhu, J.Y., Strobelt, H., Zhou, B., Tenenbaum, J.B., Freeman, W.T. & Torralba, A.. GaN dissection: Visualizing and understanding generative adversarial networks. 7th Int. Conf. Learn. Represent. ICLR 2019 (2019).
Bau, D., Zhou, B., Khosla, A., Aude, O. & Torralba, A.. Network Dissection: Quantifying Interpretability of Deep Visual Representation (2017).