tidyms.Assay¶
- class Assay(assay_path: str | Path | None = None, data_path: str | List[str] | Path | None = None, sample_metadata: DataFrame | str | Path | None = None, ms_mode: str = 'centroid', instrument: str = 'qtof', separation: str = 'uplc', data_import_mode: str | None = None, n_jobs: int = 1, cache_MSData_objects: bool = False)¶
Manages data preprocessing workflows from raw data to data matrix.
See the user guide for usage instructions.
- Parameters:
- data_pathstr, List[str] or Path
Contains the path of mzML files to be analyzed.
data_pathcan be a string or list of strings of absolute path representations to mzML files in centroid mode or a Path object. Path objects can be used in two ways: It can point to a mzML file or to a directory. In the second case all mzML files inside the directory will be used.- assay_pathstr
Path to store the assay data. If the path does not exist, a new directory is created. If an existing assay directory is passed, it loads the data from the assay.
- sample_metadatastr, Path, DataFrame or None.
Provides information associated with each sample. If a string is provided, it is assumed that it is the path to a csv file with sample metadata information. The other columns may contain any kind of data but the following columns have reserved uses:
- sample
This column is mandatory. Must contain each one of the file names in
data_path, without the .mzML extension.- class
The class of each sample.
- order
A unique positive integer number that indicates the run order of each sample.
- batch
The batch number where each sample was analyzed. The values must be positive integers.
If a DataFrame is provided, it must have the same structure as the csv file described above. If
Noneis provided, the samples are assumed to be from the same class and no order and batch information is used.- ms_mode{“centroid”, “profile”}
The mode in which the data is stored.
- instrument{“qtof”, “orbitrap”}
The instrument type. Used to set several defaults during data preprocessing.
- separation{“uplc”, “hplc”}
The separation method used. Used to set several defaults during data preprocessing.
- annotate_isotopologues(n_jobs: int | None = 1, verbose: bool = True, **kwargs)¶
Labels isotopic envelopes in each sample.
Labels are stored in the isotopologue_label column of the feature table. Each envelope share the same label. Features labelled with
-1do not belong to any group. The isotopologue_index column indexes the nominal mass of the isotopologue, relative to the minimum mass isotopologue. The charge column contains the charge of the isotopic envelopeFeatures descriptors from each sample are organized in a Pandas DataFrame and stored to disk and can be recovered using
self.load_features. Besides the descriptors, these DataFrames contain two additional columns: roi_index and ft_index. roi_index is used to indentify the ROI where the feature was detected, and recovered using theload_roimethod. The ft_index value is used to identify the feature in the feature attribute of the ROI.- Parameters:
- n_jobs: int or None, default=None
Number of jobs to run in parallel.
Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.- verbosebool, default=True
If
True, displays a progress bar.- **kwargsdict
Parameters to pass to
tidyms.lcms.Roi.describe_features().
- build_feature_table()¶
Merges the feature descriptors from all samples into one DataFrame.
The feature table is stored in
self.feature_table. Two additional columns are created: sample_ contains the sample name where the feature was detected. class_ contains the corresponding class name of the sample.- Raises:
- ValueErrorif the feature table was not built for all samples. This
- occurs if self.describe_features was not called.
- describe_features(custom_descriptors: dict[str, Callable[[tidyms.lcms.Feature], float]] | None = None, filters: dict[str, Tuple] | None = None, n_jobs: int | None = 1, verbose: bool = True) Assay¶
Compute feature descriptors for the features extracted from the data.
Features descriptors from each sample are organized in a Pandas DataFrame and stored to disk and can be recovered using
self.load_features. Besides the descriptors, these DataFrames contain two additional columns: roi_index and ft_index. roi_index is used to indentify the ROI where the feature was detected, and recovered using theload_roimethod. The ft_index value is used to identify the feature in the feature attribute of the ROI.- Parameters:
- custom_descriptorsdict or None, default=None
A dictionary of strings to callables, used to estimate custom descriptors of a feature. The function must have the following signature:
"estimator_func(feature: Feature) -> float"- filtersdict or None, default=None
A dictionary of descriptor names to a tuple of minimum and maximum acceptable values. To use only minimum/maximum values, use None (e.g. (None, max_value) in the case of using only maximum). Features with descriptors outside those ranges are removed. Filters for custom descriptors can also be used.
- n_jobs: int or None, default=None
Number of jobs to run in parallel.
Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.- verbosebool, default=True
If
True, displays a progress bar.- **kwargsdict
Parameters to pass to
tidyms.lcms.Roi.describe_features().
- detect_features(strategy: str | Callable = 'default', n_jobs: int | None = None, verbose: bool = True, **kwargs) Assay¶
Builds Regions Of Interest (ROI) from raw data for each sample.
ROIs are computed and saved to disk. Computed ROIs can be recovered using
self.load_roiorself.load_roi_list.- Parameters:
- strategystr or callable, default=”default”
If
defaultis used, thentidyms.raw_data_utils.make_roi()is used to build ROIs in each sample. A function can be passed to customize the detection process. The following template must be used:def func(ms_data: MSData, **kwargs) -> List[Roi]: ...
- n_jobs: int or None, default=None
Number of jobs to run in parallel.
Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.- verbosebool, default=True
If
True, displays a progress bar.- **kwargs
Parameters to pass to the underlying function used. See the strategy parameter.
See also
fileio.MSDatamzML reader
lcms.Roiabstract ROI
lcms.LCRoiROI used in LC data
- extract_features(strategy: str | Callable = 'default', n_jobs: int | None = None, verbose: bool = True, **kwargs) Assay¶
Extract features from the ROIs detected on each sample.
Features are stored in
featuresattribute of each ROI. ROIs can be recovered usingself.load_roiorself.load_roi_list.- Parameters:
- strategystr or callable, default=”default”
If
defaultis used, thentidyms.lcms.LCRoi.extract_features()is used to extract features from each ROI. A function can be passed to customize the extraction process. The following template must be used:def func(roi: Roi, **kwargs) -> List[Feature]: ...
- n_jobs: int or None, default=None
Number of jobs to run in parallel.
Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.- verbosebool, default=True
If
True, displays a progress bar.- **kwargs
Parameters to pass to the underlying function used. See the strategy parameter.
- Raises:
- PreprocessingOrderErrorif called before
self.detect_features.
- PreprocessingOrderErrorif called before
See also
lcms.Roiabstract ROI
lcms.LCRoiROI used in LC data
lcms.Featureabstract feature
lcms.Peakfeature used in LC data
- fill_missing(mz_tolerance: float, n_deviations: float = 1.0, estimate_not_found: bool = True, n_jobs: int | None = None, verbose: bool = False)¶
Fill missing values in the Data matrix by searching missing features in raw data, using values average values from the detected features.
- Parameters:
- mz_tolerancefloat
m/z tolerance used to create chromatograms.
- n_deviationsfloat
Number of deviations from the mean retention time to search a peak, in units of standard deviations.
- estimate_not_foundbool
If
True, and estimation for the peak area in cases where no chromatographic peaks are found is done as described in the Notes. IfFalse, missing values after peak search are set to zero.- n_jobs: int or None, default=None
Number of jobs to run in parallel.
Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.- verbosebool
If True, shows a progress bar.
rac{rt_{detected} - rt_{mean}}{rt_{std}} leq n_{deviations}
where \(rt_{detected}\) is the Rt of the peak in the chromatogram, \(rt_{mean}\) is the mean Rt of the feature and \(rt_{std}\) is the standard deviation of the feature. If no peaks are found, the feature value is filled to zero if the estimate_not_found is set to
False. Otherwise, a fill value is computed as the area in the region where the chromatographic peak was expected to appear, defined by the rt_start and rt_end values in the feature table.
- get_ms_data(sample: str) MSData¶
Loads a raw sample file into an MSData object.
- Parameters:
- sample: str
Sample name used in the sample metadata.
- Returns:
- MSData
- Raises:
- ValueErrorif the sample is not found.
- get_sample_metadata() DataFrame¶
Creates a DataFrame with the metadata of each sample used in the assay.
- Returns:
- DataFrame
- load_features(sample: str) DataFrame¶
Loads a table with feature descriptors for a sample.
- Parameters:
- samplestr
sample name used in the sample metadata.
- Returns:
- pd.DataFrame
- Raises:
- ValueErrorif the feature data was not found. This error occurs if a
- wrong sample name is used or if self.describe_features was not called.
- load_roi(sample: str, roi_index: int) Roi¶
Loads a ROI from a sample.
Must be called after performing feature detection.
- Parameters:
- samplestr
sample name used in the sample metadata.
- roi_indexint
index of the requested ROI.
- Returns:
- ROI
- Raises:
- ValueErrorIf an invalid name or roi_index were used.
- FileNotFoundErrorIf a non-existent roi_index was used.
See also
detect_featuresDetect ROI in the Assay samples.
load_roi_listLoads all ROI from a sample.
- load_roi_list(sample: str) List[Roi]¶
Loads all the ROIs detected in a sample.
Must be called after performing feature detection.
- Parameters:
- samplestr.
sample name used in the sample metadata.
- Returns:
- List[ROI]
- Raises:
- ValueErrorif the ROI data was not found. This error occurs if a
- wrong sample name is used or if self.detect_features was not called.
See also
detect_featuresDetect ROI in the Assay samples.
- make_data_matrix(merge_close_features: bool = True, merge_threshold: float = 0.8, mz_merge: float | None = None, rt_merge: float | None = None)¶
Creates a data matrix.
The results are stored in self.data_matrix.
- Parameters:
- merge_close_featuresbool
If
Truefinds close features and merge them into a single feature. The code of the merged features is in the merged column of the feature metadata. The area in the data matrix is the sum of the merged features.- merge_thresholdfloat, default=0.8
Number between 0.0 and 1.0. This value is compared against the quotient between the number of samples where both features where detected and the number of samples where any of the features was detected. If this quotient is lower than the threshold, the pair o features is merged into a single one.
- mz_mergefloat or None, default=None
Merge features only if their mean m/z, as described by the feature metadata, are closer than this values.
- rt_mergefloat or None, default=None
Merge features only if their mean Rt, as described by the feature metadata, are closer than this values.
- match_features(strategy: str | Callable = 'default', **kwargs)¶
Match features across samples. Each feature is labelled using an integer value to assign a common id. Features that do not belong to any group are labelled with
-1. The label is stored in the label_ column of the feature table.- Parameters:
- strategystr or callable, default=”default”
If
defaultis used, thentidyms.correspondence.match_features()is used to match features across samples. A function can be passed to customize the matching process. The following template must be used:def func(assay: Assay, **kwargs) -> Dict: ...
The dictionary must have at least one key called “cluster_”, containing an 1D numpy with size matching the number of rows in the feature table. Each value is used to group features from different samples into a data matrix. The value
-1in the array is used to signal features that were not matched to any group and are not going to be included in the data matrix. Other keys with arbitrary names may be used, but the value associated must be a 1D numpy array with size equal to the number of different groups in the cluster array. These arrays can be used to compute feature matching metrics than can be passed to the data matrix construction and used to assess the feature matching process.- **kwargs
Parameters to pass to the underlying function used. See the strategy parameter.