tidyms.filter¶
Tools to filter and correct DataContainers.
A Filter removes features or samples from a DataContainer according to some criteria. Correctors perform transformations on the data matrix. Each filter and corrector has a default behavior based on recommendations by widely accepted by the metabolomics community. Operations on DataContainers are made in place. To generate corrections and filters, the default corrections are generated using information in the DataContainer mapping.
For example, The BlankCorrector generates and estimation of the blank contribution to the signal using sample classes that are mapped from the “blank” sample type. See each Processor object for detailed information. Processor and Pipeline objects are used in a similar way:
Create a filter or Pipeline instance.
Use the process method on a DataContainer to process your data.
Objects¶
Processor : Abstract object used to create custom Filters and correctors. DuplicateMerger : Merge duplicate samples. ClassRemover : Remove samples with a given class name. BlankCorrector : corrects features contribution originated from sample prep. BatchCorrector : Removes time dependent bias in features PrevalenceFilter : Remove features with low detection rate. VariationFilter : Remove features with high coefficient of variation. DRatioFilter : Removes features using the D-Ratio. Pipeline : Combines Processors to apply them simultaneously.
Exceptions¶
MissingMappingInformation : Error raised when there are no sample classes assigned to sample type. MissingValueError : Error raised when the data matrix has missing values.
- class BatchCorrector(min_qc_dr: float = 0.9, first_n_qc: int | None = None, threshold: float = 0, frac: float | None = None, interpolator: str = 'splines', method: str = 'multiplicative', corrector_classes: List[str] | None = None, process_classes: List[str] | None = None, verbose: bool = False)¶
Correct time dependant systematic bias along samples due to variation in instrumental response.
- Parameters:
- min_qc_drfloat
minimum fraction of QC where the feature was detected. See the notes for an explanation of how this value is computed.
- first_n_qcint, optional
The number of first QC samples used to estimate the expected value for each feature in the QC. If None uses all QC samples in a batch. See notes for an explanation of its use.
- thresholdfloat
Minimum value to consider a feature detected. Used to compute the detection rate of each feature.
- fracfloat, optional
frac parameter of the LOESS model. If None, the best value for each feature is estimated using LOOCV.
- interpolator{“splines”, “linear”}
Interpolator used to estimate the correction for each sample.
- method{“additive”, “multiplicative”}
Method used to model the variation in samples.
- corrector_classeslist[str], optional
list of classes used to generate the correction. If None uses QC sample types from the mapping.
- process_classeslist[str], optional
list of classes used to correct. If None uses sample sample types from the mapping.
- verbosebool
If True a message is shown after processing the data matrix.
Notes
The correction is applied as described by Broadhurst in [1]. Using QC samples, a correction is generated for each feature in the following way: The signal of a feature is modeled as three additive components: a expected value \(m_{jk}\), a systematic bias \(f_{k}\) and error term \(\epsilon\):
\[m_{jk} = \bar{m_{k}} + f_{k}(t_{j}) + \epsilon\]Where \(m_{jk}\) is the element in the j-th row and k-th column of the data matrix.
First, \(\bar{m_{k}}\) is subtracted to the detected values and then \(f_{k}\) is estimated using Locally weighted scatter plot smoothing (LOESS). The optimal fraction of samples for each feature is obtained using Leave One Out Cross Validation (LOOCV).
In order to apply this correction, several checks needs to be made. First, the QC template is checked and samples that cannot be corrected are removed. A study sample is valid if it is surrounded by QC samples. This is a necessary step because the correction for the study samples is built using interpolation. It’s recommended to have three QC samples at the beginning and at the end of each batch. See [1] for recommendations on analytical batches templates.
After checking the QC template, each feature is checked to see if the minimum number of QC samples necessary to perform LOESS are available. This step is done grouping samples of the same type into QC blocks: a QC block is a set of consecutive QC samples. A feature is detected in a block if it was detected in at least one sample in the block. For example, in an analytical batch:
Run order
Sample type
Block number
1, 2, 3
Q, Q, Q
1 (start)
4, 5, 6, 7
S, S, S, S
2 (middle)
8
Q
3
9, 10, 11, 12
S, S, S, S
4
13
Q
5
13, 14, 15, 16
S, S, S, S
6
17
Q
7
18, 19, 20, 21
S, S, S, S
8
22, 23, 24
Q, Q, Q
9 (end)
Detection is evaluated on each block, comparing samples against a threshold value. Then the fraction of the blocks where the feature was detected is the detection rate in the QC samples. A feature is removed if:
The prevalence if lower than the min_qc_dr parameter (this parameter is corrected in a way such that the minimum number of QC samples must be always greater or equal than 4).
The feature is not detected in the start or end block.
After these two checks, the remaining samples and features are suitable for LOESS batch correction. A final consideration is how to estimate the \(\bar{m_{k}}\) for each feature. This value is usually computed as the mean or median of the QC values in a batch, but if the temporal bias becomes stronger as more samples are analyzed, a better estimation of \(\bar{m_{k}}\) can be obtained using the average of the first samples analyzed in a batch. To this end, the n_qc parameter controls how many QC samples are used to estimate the expected values in the QC samples.
References
[1]D Broadhurst et al, “Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies.”, Metabolomics, 2018;14(6):72. doi: 10.1007/s11306-018-1367-3
- class BlankCorrector(corrector_classes: List[str] | None = None, process_classes: List[str] | None = None, mode: str | Callable = 'lod', factor: float = 1, robust: bool = True, process_blanks: bool = True, verbose=False)¶
Corrects systematic bias due to sample preparation.
- Parameters:
- corrector_classeslist[str], optional
Classes used to generate the blank correction. If None, uses the value from blank in the DataContainer mapping attribute.
- process_classeslist[str], optional
Classes to be corrected. If None, uses the value from sample in the DataContainer mapping attribute.
- factorfloat
factor used to convert values to zero (see notes)
- mode{“mean”, “max”, “lod”, “loq”} or callable.
Function used to generate the blank correction. If mode is mean, the correction is generated as the mean of all blank samples. If max, the correction is generated as the maximum value for each feature in all blank samples. If mode is lod, the correction is the mean plus three times the standard deviation of the blanks. If mode is loq, the correction is the mean plus ten times the standard deviation.
- process_blanksbool
If True applies blank correction to blanks also.
- verbosebool
Shows a message with information after the correction has been applied.
Notes
Blank correction is applied for each feature in the following way:
\[\begin{split}X_{corrected} = 0 \textrm{ if } X < factor * mode(X_{blank}) \\ X_{corrected} = X - mode(X_{blank}) \textrm{ else}\end{split}\]Constructor for the BlankCorrector.
- class ClassRemover(classes: List[str])¶
Remove samples from the specified classes.
- Parameters:
- classes: list
List of classes to remove.
- class DRatioFilter(lb=0, ub=0.5, robust=False, verbose=False)¶
Remove Features with low biological information.
To use this filter the qc sample type and the study sample type must been specified in the DataContainer mapping.
- Parameters:
- lb: number between 0 and 1
Lower bound of acceptance
- ub: number between 0 and 1
Upper bound of acceptance.
- robust: bool
If True uses the MAD to compute the d-ratio. Else uses the standard deviation.
- verbosebool
Shows a message with information after the correction has been applied.
Notes
D-Ratio is a metric defined in [1] as the quotient between the technical and the biological variation of a feature:
\[D-Ratio = \frac{\sigma_{technical}} {\sqrt{\sigma_{technical}^{2} + \sigma_{biological}^{2}}}\]The technical variation is estimated as the dispersion from the QC samples, while the total variation (technical and biological) is estimated from the study samples. Lower D-Ratio values suggest features that are measured in a robust way. A maximum acceptance value of 0.5 is suggested.
References
[1]D.Broadhurst et al, “Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies”, Metabolomics (2018) 14:72.
Constructor of the DRatioFilter.
- class DilutionFilter(min_corr: float = 0.8, plim: float = 0.1, mode: str = 'ols', verbose: bool = False)¶
Filter features based on the correlation with a dilution factor.
In order to use this filter, the dilution column must be specified in the sample_metadata of the DataContainer. Also, the QCs used for the analysis must be specified under the dqc key in the DataContainer mapping.
- Parameters:
- min_corrnumber between 0 and 1
Lower bound for the correlation coefficient.
- plimnumber between 0 and 1
p-value limit for the Jarque-Bera test. Used only when mode is ols.
- mode{“ols”, “spearman”}
ols computes the ordinary least squares linear regression.The r squared from the fit and the p-value from the Jarque-Bera test are used to evaluate the linearity of the signal with the dilution. Features with correlation values lower than min_corr or p-values lower than plim are removed. spearman compares the correlation threshold with the spearman rank correlation coefficient.
- verbose: bool
If True, prints a message
Notes
Correlation with the dilution is a measure of the linearity of the response of the feature in the experimental conditions [2].
References
Constructor of the DilutionFilter.
- class DuplicateMerger(process_classes: List[str] | None = None)¶
Merge sample replicates.
- exception MissingMappingInformation¶
error raised when an empty sample type is used from a mapping
- exception MissingValueError¶
error raise when a DataContainer’s data matrix has missing values
- class Pipeline(processors: list, verbose: bool = False)¶
Combines Filters and Correctors and applies them simultaneously.
- Attributes:
- processors: list[Processors]
A list of processors to apply. Can also contain another Pipeline.
- verbose: bool
If True prints a message each time a Processor is applied.
- class PrevalenceFilter(process_classes: List[str] | None = None, lb: float | int = 0.5, ub: float | int = 1, intraclass: bool = True, verbose: bool = False, threshold: float | int = 0)¶
Remove Features detected in a low number of samples.
- Parameters:
- process_classesList[str], optional
Classes used to compute prevalence. If None, classes are obtained from sample classes in the DataContainer mapping.
- lbNumber between 0 and 1
Lower bound of acceptance.
- ubNumber between 0 and 1
Upper bound of acceptance. Must be greater or equal than lb.
- thresholdnon negative number
Minimum intensity to consider a feature as detected.
- intraclassbool
Whether to evaluate a global prevalence or a per class prevalence. If intraclass is True, the detection rate is computed for each class, and the prevalence is defined as the minimum value for the classes analyzed. If intraclass is False, the prevalence is computed as the detection rate for all the samples that belong to the process_classes.
- verbosebool
Shows a message with information after the correction has been applied.
Notes
The prevalence is computed using the detection rate, that is, the fraction of samples where a feature was detected. A feature is considered detected if its value is above a threshold. The mode parameter controls how the prevalence is computed.
Constructor of the PrevalenceFilter.
- class Processor(mode: str, axis: str | None = None, verbose: bool = False, default_process: str | None = None, default_correct: str | None = None, requirements: dict | None = None)¶
Abstract class to process DataContainer Objects. This class is intended to be subclassed to generate specific filters. Filter implementation is done overwriting the func method.
- Attributes:
- mode: {“filter”, “flag”, “transform”}
filter removes feature/samples from a DataContainer. flag selects features/samples to be inspected manually. Transform applies a transformation on the DataContainer.
- axis: {“samples”, “features”}, optional
Axis to process. Only necessary when using mode “filter” or “flag”.
- verbose: bool
- params: dict
parameter used by the filter function
- _default_process: str
default sample type used to apply filter
- _default_correct: str
default sample type to be corrected.
- _requirements: dict
dictionary with the same keys as the obtained from the diagnose method from a DataContainer. If any value is different compared to the values from diagnose an error is raised.
Methods
process(data)
(Applies a filter/correction to a DataContainer)
- class Reporter(name: str | None = None)¶
Abstract class with methods to report metrics.
- Attributes:
- metrics: dict
stores number of features, number of samples and mean coefficient of variation before and after processing.
- name: str
- class VariationFilter(lb=0, ub=0.25, process_classes=None, robust=False, intraclass=True, verbose=False)¶
Remove features with low reproducibility.
The reproducibility of the features is evaluated using the Relative standard deviation of each feature in samples of a specific class or classes. By default, the QC samples are analyzed.
- Parameters:
- lbnumber between 0 and 1
Lower bound of acceptance
- ubnumber between 0 and 1
Upper bound of acceptance. Must be greater than lb.
- process_classes: List[str], optional
Classes used to evaluate the coefficient of variation. If None, list of classes is taken from the qc sample type from the DataContainer mapping attribute.
- robust: bool
If false uses the mean and standard deviation to compute the cv. Else, the cv is estimated using the MAD and the median of the feature, assuming a normal distribution.
- intraclass: bool
If True, the cv is computed for each class in process_classes and then the maximum value is compared against lb and ub. Else a global cv is computed for all classes in process_classes.
- verbose: bool
If True, prints a message
Constructor of the VariationFilter.
- register(f)¶
register to available filters
- Parameters:
- fFilter or Corrector
- Returns:
- f