tidyms.Assay¶

Manages data preprocessing workflows from raw data to data matrix.

See the user guide for usage instructions.

Parameters:

data_pathstr, List[str] or Path

Contains the path of mzML files to be analyzed. data_path can be a string or list of strings of absolute path representations to mzML files in centroid mode or a Path object. Path objects can be used in two ways: It can point to a mzML file or to a directory. In the second case all mzML files inside the directory will be used.

assay_pathstr

Path to store the assay data. If the path does not exist, a new directory is created. If an existing assay directory is passed, it loads the data from the assay.

sample_metadatastr, Path, DataFrame or None.

Provides information associated with each sample. If a string is provided, it is assumed that it is the path to a csv file with sample metadata information. The other columns may contain any kind of data but the following columns have reserved uses:

sample: This column is mandatory. Must contain each one of the file names in data_path, without the .mzML extension.
class: The class of each sample.
order: A unique positive integer number that indicates the run order of each sample.
batch: The batch number where each sample was analyzed. The values must be positive integers.

If a DataFrame is provided, it must have the same structure as the csv file described above. If None is provided, the samples are assumed to be from the same class and no order and batch information is used.

ms_mode{“centroid”, “profile”}

The mode in which the data is stored.

instrument{“qtof”, “orbitrap”}

The instrument type. Used to set several defaults during data preprocessing.

separation{“uplc”, “hplc”}

The separation method used. Used to set several defaults during data preprocessing.

annotate_isotopologues(n_jobs: int | None = 1, verbose: bool = True, **kwargs)¶

Labels isotopic envelopes in each sample.

Labels are stored in the isotopologue_label column of the feature table. Each envelope share the same label. Features labelled with -1 do not belong to any group. The isotopologue_index column indexes the nominal mass of the isotopologue, relative to the minimum mass isotopologue. The charge column contains the charge of the isotopic envelope

Features descriptors from each sample are organized in a Pandas DataFrame and stored to disk and can be recovered using self.load_features. Besides the descriptors, these DataFrames contain two additional columns: roi_index and ft_index. roi_index is used to indentify the ROI where the feature was detected, and recovered using the load_roi method. The ft_index value is used to identify the feature in the feature attribute of the ROI.

Parameters:

n_jobs: int or None, default=None: Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
verbosebool, default=True: If True, displays a progress bar.
**kwargsdict: Parameters to pass to tidyms.lcms.Roi.describe_features().

build_feature_table()¶

Merges the feature descriptors from all samples into one DataFrame.

The feature table is stored in self.feature_table. Two additional columns are created: sample_ contains the sample name where the feature was detected. class_ contains the corresponding class name of the sample.

Raises:

ValueErrorif the feature table was not built for all samples. This
occurs if self.describe_features was not called.

describe_features(custom_descriptors: dict[str, Callable[[tidyms.lcms.Feature], float]] | None = None, filters: dict[str, Tuple] | None = None, n_jobs: int | None = 1, verbose: bool = True) → Assay¶

Compute feature descriptors for the features extracted from the data.

Features descriptors from each sample are organized in a Pandas DataFrame and stored to disk and can be recovered using self.load_features. Besides the descriptors, these DataFrames contain two additional columns: roi_index and ft_index. roi_index is used to indentify the ROI where the feature was detected, and recovered using the load_roi method. The ft_index value is used to identify the feature in the feature attribute of the ROI.

Parameters:

custom_descriptorsdict or None, default=None

A dictionary of strings to callables, used to estimate custom descriptors of a feature. The function must have the following signature:

"estimator_func(feature: Feature) -> float"

filtersdict or None, default=None

A dictionary of descriptor names to a tuple of minimum and maximum acceptable values. To use only minimum/maximum values, use None (e.g. (None, max_value) in the case of using only maximum). Features with descriptors outside those ranges are removed. Filters for custom descriptors can also be used.

n_jobs: int or None, default=None

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

verbosebool, default=True

If True, displays a progress bar.

**kwargsdict

Parameters to pass to tidyms.lcms.Roi.describe_features().

detect_features(strategy: str | Callable = 'default', n_jobs: int | None = None, verbose: bool = True, **kwargs) → Assay¶

Builds Regions Of Interest (ROI) from raw data for each sample.

ROIs are computed and saved to disk. Computed ROIs can be recovered using self.load_roi or self.load_roi_list.

Parameters:

strategystr or callable, default=”default”

If default is used, then tidyms.raw_data_utils.make_roi() is used to build ROIs in each sample. A function can be passed to customize the detection process. The following template must be used:

def func(ms_data: MSData, **kwargs) -> List[Roi]:
    ...

n_jobs: int or None, default=None

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

verbosebool, default=True

If True, displays a progress bar.

**kwargs

Parameters to pass to the underlying function used. See the strategy parameter.

See also

fileio.MSData: mzML reader
lcms.Roi: abstract ROI
lcms.LCRoi: ROI used in LC data

extract_features(strategy: str | Callable = 'default', n_jobs: int | None = None, verbose: bool = True, **kwargs) → Assay¶

Extract features from the ROIs detected on each sample.

Features are stored in features attribute of each ROI. ROIs can be recovered using self.load_roi or self.load_roi_list.

Parameters:

strategystr or callable, default=”default”

If default is used, then tidyms.lcms.LCRoi.extract_features() is used to extract features from each ROI. A function can be passed to customize the extraction process. The following template must be used:

def func(roi: Roi, **kwargs) -> List[Feature]:
    ...

n_jobs: int or None, default=None

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

verbosebool, default=True

If True, displays a progress bar.

**kwargs

Parameters to pass to the underlying function used. See the strategy parameter.

Raises:

PreprocessingOrderErrorif called before self.detect_features.

See also

lcms.Roi: abstract ROI
lcms.LCRoi: ROI used in LC data
lcms.Feature: abstract feature
lcms.Peak: feature used in LC data

fill_missing(mz_tolerance: float, n_deviations: float = 1.0, estimate_not_found: bool = True, n_jobs: int | None = None, verbose: bool = False)¶

Fill missing values in the Data matrix by searching missing features in raw data, using values average values from the detected features.

Parameters:

mz_tolerancefloat: m/z tolerance used to create chromatograms.
n_deviationsfloat: Number of deviations from the mean retention time to search a peak, in units of standard deviations.
estimate_not_foundbool: If True, and estimation for the peak area in cases where no chromatographic peaks are found is done as described in the Notes. If False, missing values after peak search are set to zero.
n_jobs: int or None, default=None: Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
verbosebool: If True, shows a progress bar.

rac{rt_{detected} - rt_{mean}}{rt_{std}} leq n_{deviations}

where \(rt_{detected}\) is the Rt of the peak in the chromatogram, \(rt_{mean}\) is the mean Rt of the feature and \(rt_{std}\) is the standard deviation of the feature. If no peaks are found, the feature value is filled to zero if the estimate_not_found is set to False. Otherwise, a fill value is computed as the area in the region where the chromatographic peak was expected to appear, defined by the rt_start and rt_end values in the feature table.

get_ms_data(sample: str) → MSData¶

Loads a raw sample file into an MSData object.

Parameters:

sample: str: Sample name used in the sample metadata.

Returns:

MSData

Raises:

ValueErrorif the sample is not found.

get_sample_metadata() → DataFrame¶

Creates a DataFrame with the metadata of each sample used in the assay.

Returns:

DataFrame

load_features(sample: str) → DataFrame¶

Loads a table with feature descriptors for a sample.

Parameters:

samplestr: sample name used in the sample metadata.

Returns:

pd.DataFrame

Raises:

ValueErrorif the feature data was not found. This error occurs if a
wrong sample name is used or if self.describe_features was not called.

load_roi(sample: str, roi_index: int) → Roi¶

Loads a ROI from a sample.

Must be called after performing feature detection.

Parameters:

samplestr: sample name used in the sample metadata.
roi_indexint: index of the requested ROI.

Returns:

ROI

Raises:

ValueErrorIf an invalid name or roi_index were used.
FileNotFoundErrorIf a non-existent roi_index was used.

See also

detect_features: Detect ROI in the Assay samples.
load_roi_list: Loads all ROI from a sample.

load_roi_list(sample: str) → List[Roi]¶

Loads all the ROIs detected in a sample.

Must be called after performing feature detection.

Parameters:

samplestr.: sample name used in the sample metadata.

Returns:

List[ROI]

Raises:

ValueErrorif the ROI data was not found. This error occurs if a
wrong sample name is used or if self.detect_features was not called.

See also

detect_features: Detect ROI in the Assay samples.

make_data_matrix(merge_close_features: bool = True, merge_threshold: float = 0.8, mz_merge: float | None = None, rt_merge: float | None = None)¶

Creates a data matrix.

The results are stored in self.data_matrix.

Parameters:

merge_close_featuresbool: If True finds close features and merge them into a single feature. The code of the merged features is in the merged column of the feature metadata. The area in the data matrix is the sum of the merged features.
merge_thresholdfloat, default=0.8: Number between 0.0 and 1.0. This value is compared against the quotient between the number of samples where both features where detected and the number of samples where any of the features was detected. If this quotient is lower than the threshold, the pair o features is merged into a single one.
mz_mergefloat or None, default=None: Merge features only if their mean m/z, as described by the feature metadata, are closer than this values.
rt_mergefloat or None, default=None: Merge features only if their mean Rt, as described by the feature metadata, are closer than this values.

match_features(strategy: str | Callable = 'default', **kwargs)¶

Match features across samples. Each feature is labelled using an integer value to assign a common id. Features that do not belong to any group are labelled with -1. The label is stored in the label_ column of the feature table.

Parameters:

strategystr or callable, default=”default”

If default is used, then tidyms.correspondence.match_features() is used to match features across samples. A function can be passed to customize the matching process. The following template must be used:

def func(assay: Assay, **kwargs) -> Dict:
    ...

The dictionary must have at least one key called “cluster_”, containing an 1D numpy with size matching the number of rows in the feature table. Each value is used to group features from different samples into a data matrix. The value -1 in the array is used to signal features that were not matched to any group and are not going to be included in the data matrix. Other keys with arbitrary names may be used, but the value associated must be a 1D numpy array with size equal to the number of different groups in the cluster array. These arrays can be used to compute feature matching metrics than can be passed to the data matrix construction and used to assess the feature matching process.

**kwargs

Parameters to pass to the underlying function used. See the strategy parameter.