tidyms.filter.BatchCorrector¶

class BatchCorrector(min_qc_dr: float = 0.9, first_n_qc: int | None = None, threshold: float = 0, frac: float | None = None, interpolator: str = 'splines', method: str = 'multiplicative', corrector_classes: List[str] | None = None, process_classes: List[str] | None = None, verbose: bool = False)¶

Correct time dependant systematic bias along samples due to variation in instrumental response.

Parameters:

min_qc_drfloat: minimum fraction of QC where the feature was detected. See the notes for an explanation of how this value is computed.
first_n_qcint, optional: The number of first QC samples used to estimate the expected value for each feature in the QC. If None uses all QC samples in a batch. See notes for an explanation of its use.
thresholdfloat: Minimum value to consider a feature detected. Used to compute the detection rate of each feature.
fracfloat, optional: frac parameter of the LOESS model. If None, the best value for each feature is estimated using LOOCV.
interpolator{“splines”, “linear”}: Interpolator used to estimate the correction for each sample.
method{“additive”, “multiplicative”}: Method used to model the variation in samples.
corrector_classeslist[str], optional: list of classes used to generate the correction. If None uses QC sample types from the mapping.
process_classeslist[str], optional: list of classes used to correct. If None uses sample sample types from the mapping.
verbosebool: If True a message is shown after processing the data matrix.

Notes

The correction is applied as described by Broadhurst in [1]. Using QC samples, a correction is generated for each feature in the following way: The signal of a feature is modeled as three additive components: a expected value \(m_{jk}\), a systematic bias \(f_{k}\) and error term \(\epsilon\):

\[m_{jk} = \bar{m_{k}} + f_{k}(t_{j}) + \epsilon\]

Where \(m_{jk}\) is the element in the j-th row and k-th column of the data matrix.

First, \(\bar{m_{k}}\) is subtracted to the detected values and then \(f_{k}\) is estimated using Locally weighted scatter plot smoothing (LOESS). The optimal fraction of samples for each feature is obtained using Leave One Out Cross Validation (LOOCV).

In order to apply this correction, several checks needs to be made. First, the QC template is checked and samples that cannot be corrected are removed. A study sample is valid if it is surrounded by QC samples. This is a necessary step because the correction for the study samples is built using interpolation. It’s recommended to have three QC samples at the beginning and at the end of each batch. See [1] for recommendations on analytical batches templates.

After checking the QC template, each feature is checked to see if the minimum number of QC samples necessary to perform LOESS are available. This step is done grouping samples of the same type into QC blocks: a QC block is a set of consecutive QC samples. A feature is detected in a block if it was detected in at least one sample in the block. For example, in an analytical batch:

Run order	Sample type	Block number
1, 2, 3	Q, Q, Q	1 (start)
4, 5, 6, 7	S, S, S, S	2 (middle)
8	Q	3
9, 10, 11, 12	S, S, S, S	4
13	Q	5
13, 14, 15, 16	S, S, S, S	6
17	Q	7
18, 19, 20, 21	S, S, S, S	8
22, 23, 24	Q, Q, Q	9 (end)

Detection is evaluated on each block, comparing samples against a threshold value. Then the fraction of the blocks where the feature was detected is the detection rate in the QC samples. A feature is removed if:

The prevalence if lower than the min_qc_dr parameter (this parameter is corrected in a way such that the minimum number of QC samples must be always greater or equal than 4).
The feature is not detected in the start or end block.

After these two checks, the remaining samples and features are suitable for LOESS batch correction. A final consideration is how to estimate the \(\bar{m_{k}}\) for each feature. This value is usually computed as the mean or median of the QC values in a batch, but if the temporal bias becomes stronger as more samples are analyzed, a better estimation of \(\bar{m_{k}}\) can be obtained using the average of the first samples analyzed in a batch. To this end, the n_qc parameter controls how many QC samples are used to estimate the expected values in the QC samples.

References

[1]

D Broadhurst et al, “Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies.”, Metabolomics, 2018;14(6):72. doi: 10.1007/s11306-018-1367-3