tidyms.container¶
Objects used to store and manage metabolomics data
Objects¶
DataContainer: Stores metabolomics data.
Exceptions¶
BatchInformationError
RunOrderError
ClassNameError
EmptyDataContainerError
Usage¶
DataContainers can be created in two different ways other than using the constructor:
Using the functions in the fileio module to read data processed with a third party software (XCMS, MZMine2, etc…)
Performing Feature correspondence algorithm on features detected from raw data (not implemented yet…)
- exception BatchInformationError¶
Error class when there is no batch information
- class BokehPlotMethods(data: DataContainer)¶
Methods to plot data from a DataContainer. Generates Bokeh Figures.
Methods
pca_scores()
pca_loadings()
feature()
- feature(ft: str, hue: str = 'class', ignore_classes: List[str] | None = None, draw: bool = True, fig_params: dict | None = None, scatter_params: dict | None = None) figure¶
plots a feature intensity as a function of the run order.
- Parameters:
- ft: str
Feature to plot. Index of feature in feature_metadata
- hue: {“class”, “type”}
- ignore_classeslist[str], optional
exclude samples from the listed classes in the plot
- draw: bool
If True calls bokeh.plotting.show on figure.
- fig_params: dict
key-value parameters to pass to bokeh figure
- scatter_params: dict
key-value parameters to pass to bokeh circle
- Returns:
- bokeh.plotting.figure
- pca_loadings(x_pc=1, y_pc=2, scaling: str | None = None, normalization: str | None = None, draw: bool = True, fig_params: dict | None = None, scatter_params: dict | None = None) figure¶
plots PCA loadings.
- Parameters:
- x_pc: int
Principal component number to plot along X axis.
- y_pc: int
Principal component number to plot along Y axis.
- scaling: {`autoscaling`, `rescaling`, `pareto`}, optional
scaling method.
- normalization: {`sum`, `max`, `euclidean`}, optional
normalizing method
- draw: bool
If True, calls bokeh.plotting.show on figure
- fig_params: dict, optional
Optional parameters to pass into bokeh figure
- scatter_params: dict, optional
Optional parameters to pass into bokeh scatter plot.
- Returns:
- bokeh.plotting.figure.
- pca_scores(x_pc: int = 1, y_pc: int = 2, hue: str = 'class', ignore_classes: List[str] | None = None, show_order: bool = False, scaling: str | None = None, normalization: str | None = None, draw: bool = True, fig_params: dict | None = None, scatter_params: dict | None = None) figure¶
plots PCA scores
- Parameters:
- x_pc: int
Principal component number to plot along X axis.
- y_pc: int
Principal component number to plot along Y axis.
- hue: {“class”, “type”, “batch”}
How to color samples. “class” color points according to sample class, “type” color points according to the sample type assigned in the mapping and “batch” uses batch information. Samples classes without a mapping are not shown in the plot
- ignore_classeslist[str], optional
classes in the data to ignore to build the PCA model.
- show_order: bool
add a label with the run order.
- scaling: {`autoscaling`, `rescaling`, `pareto`}, optional
scaling method.
- normalization: {`sum`, `max`, `euclidean`}, optional
normalization method
- draw: bool
If True calls bokeh.plotting.show on fig.
- fig_params: dict, optional
Optional parameters to pass to bokeh figure
- scatter_params: dict, optional
Optional parameters to pass to bokeh scatter plot.
- Returns:
- bokeh.plotting.figure.
- exception ClassNameError¶
Error class raised when using invalid class names
- class DataContainer(data_matrix: DataFrame, feature_metadata: DataFrame, sample_metadata: DataFrame, mapping: dict | None = None, plot_mode: str = 'bokeh')¶
Container object that stores processed metabolomics data.
The data is separated in three attributes: data_matrix, sample_metadata and feature_metadata. Each one is a pandas DataFrame. DataContainers can be created, apart from using the constructor, importing data in common formats (such as: XCMS, MZMine2, Progenesis, etc..) static methods.
See also
from_progenesisfrom_pickleMetricMethodsPlotMethodsPreprocessMethods
- Attributes:
- data_matrixDataFrame.
feature values for each sample. Data is organized in a “tidy” way: each row is an observation, each column is a feature. dtype must be a float and all values should be non-negative, but NANs are fine.
- sample_metadataDataFrame.
Metadata associated to each sample (eg: sample class). Has the same index as the data_matrix. class (standing for sample class) is a required column. Analytical batch and run order information can be included under the batch and order columns. Both must be integer numbers, and the run order must be unique for each sample. If the run order is specified in a per-batch fashion, the values will be converted to a unique value.
- feature_metadataDataFrame.
Metadata associated to each feature (eg: mass to charge ratio (mz), retention time (rt), etc…). The index is equal to the data_matrix column. “mz” and “rt” are required columns.
- mappingdictionary of sample types to a list of sample classes.
Maps sample types to sample classes. valid samples types are qc, blank, sample or suitability. values are list of sample classes. Mapping is used by Processor objects to define a default behavior. For example, when using a BlankCorrector, the blank contribution to each feature is estimated using the sample classes that are values of the blank sample type.
- metricsmethods to compute common feature metrics.
- plotmethods to plot features.
- preprocessmethods to perform common preprocessing tasks.
idpd.Series[str] : name id of each sample.
batchpd.Series[int]. Analytical batch number
orderpd.Series[int] : Run order in which samples were analyzed. It must be
Methods
remove(remove, axis)
(Remove samples/features from the DataContainer.)
reset(reset_mapping=True)
(Reset the DataContainer, ie: recover removed)
samples/features, transformed values.
is_valid_class_name(value)
(checks if a class is present in the)
DataContainer
diagnose()
(creates a dictionary with information about the status of the)
DataContainer. Used by Processor objects as a validity check.
select_features(mz, rt, mz_tol=0.01, rt_tol=5)
(Search features within)
a m/z and rt tolerance.
set_default_order()
(Assigns a default run order of the samples assuming)
that the data matrix is sorted by run order already.
sort(field, axis)
(sort features/samples using metadata information.)
save(filename)
(save the DataContainer as a pickle.)
See help(DataContainer) for more details
- Parameters:
- data_matrixpandas.DataFrame.
Feature values for each measured sample. Each row is a sample and each column is a feature.
- sample_metadatapandas.DataFrame.
Metadata for each sample. class is a required column.
- feature_metadatapandas.DataFrame.
DataFrame with features names as indices. mz and rt are required columns.
- mappingdict or None
if dict, set each sample class to sample type.
- plot_mode{“seaborn”, “bokeh”}
The package used to generate plots with the plot methods
- add_order_from_csv(path: str | TextIO, interbatch_order: bool = True) None¶
adds sample order and sample batch using information from a csv file. A column with the name sample with the same values as the index of the DataContainer sample_metadata must be provided. order information is taken from a column with name order and the same is done with batch information. order data must be positive integers and each batch must have unique values. Each batch must be identified with a positive integer.
- Parameters:
- pathstr
path to the file with order data. Data format is inferred from the file extension.
- interbatch_orderbool
If True converts the order value to a unique value for the whole DataContainer. This makes plotting the data as a function of order easier.
- diagnose() dict¶
Check if DataContainer has information to perform several correction types
- Returns:
- diagnosticdict
Each value is a bool indicating the status. empty is True if the size in at least one dimension of the data matrix is zero; “missing” is True if there are NANs in the data matrix; “order” is True if there is run order information for the samples; “batch” is True if there is batch number information associated to the samples.
- static from_pickle(path: str | BinaryIO)¶
read a DataContainer stored as a pickle
- Parameters:
- path: str or file
path to read DataContainer
- Returns:
- DataContainer
- static from_progenesis(path: str | TextIO)¶
Read a progenesis file into a DataContainer
- Parameters:
- pathstr or file
path to an Progenesis csv output or file object
- Returns:
- dc = DataContainer
- is_valid_class_name(test_class: str | List[str]) bool¶
Check if at least one sample class is`class_name`.
- Parameters:
- test_classstr or list[str]
classes to search in the DataContainer.
- Returns
- ——-
- is_validbool
- property order: Series¶
pd.Series[int] : Run order in which samples were analyzed. It must be an unique integer for each sample.
- remove(remove: Iterable[str], axis: str)¶
Remove selected features / samples
- Parameters:
- removeIterable[str]
List of sample/feature names to remove.
- axis{“features”, “samples”}
- reset(reset_mapping: bool = True)¶
Reloads the original data matrix.
- Parameters:
- reset_mapping: bool
If True, clears sample classes from the mapping.
- save(filename: str) None¶
Save DataContainer into a pickle
- Parameters:
- filename: str
name used to save the file.
- select_features(mzq: float, rtq: float, mz_tol: float = 0.01, rt_tol: float = 5) Index¶
Find feature names within the defined mass-to-charge and retention time tolerance.
- Parameters:
- mzq: positive number
Mass-to-charge value to search
- rtq: positive number
Retention time value to search
- mz_tol: positive number
Mass-to-charge tolerance used in the search.
- rt_tol: positive number
Retention time tolerance used in the search.
- Returns:
- Index
- set_default_order()¶
set the order of the samples, assuming that de data is already sorted.
- set_plot_mode(mode: str)¶
Set the library used to generate plots.
- Parameters:
- mode: {“bokeh”, “seaborn”}
- sort(field: str, axis: str)¶
Sort samples/features in place using metadata values.
- Parameters:
- field: str
field to sort by. Must be a column of sample_metadata or feature_metadata.
- axis: {“samples”, “features”}
- to_csv(filename: str) None¶
Save the DataContainer into a csv file.
- Parameters:
- filename: str
- exception DilutionInformationError¶
Error class raised when no dilution factor information has been provided.
- exception EmptyDataContainerError¶
Error class raised when remove leaves an empty DataContainer.
- class MetricMethods(data: DataContainer)¶
Methods to compute feature metrics from a DataContainer
Methods
cv: Computes the coefficient of variation for each feature.
dratio: Computes the D-Ratio of features, a metric used to compare technical
to biological variation.
detection_rate: Computes the ratio of samples where a features was detected.
pca: Computes the PCA scores, loadings and PC variance.
- correlation(field: str, mode: str = 'ols', classes: List[str] | None = None)¶
Correlates features with sample metadata properties.
- Parameters:
- fieldstr
A column of sample_metadata. Must have a numeric dtype.
- mode: {“ols”, “spearman”}
ols computes the ordinary least squares linear regression. Computes the Pearson r squared, p-value for the Jarque-Bera test and the Durwin-Watson statistic for each feature. spearman computes the spearman rank correlation coefficient for each feature
- classes: List[str], optional
Compute the correlation on the selected classes only. If None, computes the correlation on all samples.
- Returns:
- pandas.Series or pandas.DataFrame
- cv(groupby: str | List[str] | None = 'class', robust: bool = False, fill_value: float = inf)¶
Computes the Coefficient of variation for each feature.
The coefficient of variation is the quotient between the standard deviation and the mean of a feature.
- Parameters:
- groupby: str, List[str] or None
If groupby isa column or a list of columns of sample metadata, the values of CV are computed on a per group basis. If None, the CV is computed for all samples in the data.
- robust: bool
If True, computes the relative MAD. Else, computes the Coefficient of variation.
- fill_value: float
Value used to replace NaN. By default, NaNs are replaced by np.inf.
- Returns:
- pd.Series or pd.DataFrame
- detection_rate(groupby: str | List[str] | None = 'class', threshold=0)¶
Computes the fraction of samples where a feature was detected.
- Parameters:
- groupbystr, List[str] or None
If groupby isa column or a list of columns of sample metadata, the values of CV are computed on a per group basis. If None, the CV is computed for all samples in the data.
- thresholdfloat
Minimum value to consider a feature detected
- dratio(robust=False) Series¶
Computes the ratio between the sample variation and the quality control variation.
The D-Ratio is useful to compare technical to biological variation and non informative features.
- Parameters:
- robust: bool
If True, uses MAD to compute the D-ratio. Else, uses standard deviation.
- Returns:
- drpd.Series:
D-Ratio for each feature
- pca(n_components: int | None = 2, normalization: str | None = None, scaling: str | None = None, ignore_classes: List[str] | None = None)¶
Computes PCA score, loadings and PC variance of each component.
- Parameters:
- n_components: int
Number of Principal components to compute.
- scaling: {`autoscaling`, `rescaling`, `pareto`}, optional
scaling method.
- normalization: {`sum`, `max`, `euclidean`}, optional
normalizing method
- ignore_classeslist[str], optional
classes in the data to ignore to build the PCA model.
- Returns:
- scores: np.array
- loadings: np.array
- variance: np.array
Explained variance for each component.
- total_variance: float
Total variance of the scaled data.
- class PreprocessMethods(dc: DataContainer)¶
Common Preprocessing operations.
Methods
normalize(method, inplace=True): Adjust sample values.
scale(method, inplace=True): Adjust feature distribution values.
transform(method, inplace=True): element-wise transformations of data.
- correct_batches(min_qc_dr: float = 0.9, first_n_qc: int | None = None, threshold: float = 0.0, frac: float | None = None, n_jobs: int | None = None, verbose: bool = False)¶
Correct time dependant systematic bias along samples due to variation in instrumental response.
- Parameters:
- min_qc_drfloat
minimum fraction of QC where a feature was detected. See the notes for an explanation of how this value is computed.
- first_n_qcint, optional
The number of first QC samples used to estimate the expected value for each feature in the QC. If None uses all QC samples in a batch. See notes for an explanation of its use.
- thresholdfloat
Minimum value to consider a feature detected. Used to compute the detection rate of each feature in the QC samples. Only features in QC samples above this value are used to compute the correction factor.
- fracfloat, optional
frac parameter of the LOESS model. If None, the best value for each feature is estimated using cross validation.
- n_jobs: int or None, default=None
Number of jobs to run in parallel.
Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.- verbosebool
If True displays a progress bar.
Notes
The correction is applied as follows:
Split the data matrix using the batch number.
For each feature in a batch compute an intra-batch correction that removes time-dependent variations.
Once the features where corrected in all batches, apply an inter-batch where the mean across different batches is corrected.
A detailed explanation of the correction algorithm can be found here.
- normalize(method: str, inplace: bool = True, feature: str | None = None) DataFrame | None¶
Normalize samples.
- Parameters:
- method: {“sum”, “max”, “euclidean”}
Normalization method. sum normalizes using the sum along each row, max normalizes using the maximum of each row. euclidean normalizes using the euclidean norm of the row.
- inplace: bool
if True modifies in place the DataContainer. Else, returns a normalized data matrix.
- feature: str, optional
Feature used for normalization in feature mode.
- Returns:
- normalized: pandas.DataFrame
- scale(method: str, inplace: bool = True) DataFrame | None¶
scales features using different methods.
- Parameters:
- method: {“autoscaling”, “rescaling”, “pareto”}
Scaling method. autoscaling performs mean centering scaling of features to unitary variance. rescaling scales data to a 0-1 range. pareto performs mean centering and scaling using the square root of the standard deviation
- inplace: bool
if True modifies in place the DataContainer. Else, returns a normalized data matrix.
- Returns:
- scaled: pandas.DataFrame
- transform(method: str, inplace: bool = True) DataFrame | None¶
Perform element-wise data transformations.
- Parameters:
- method: {“log”, “power”}
transform method. log applies the base 10 logarithm on the data. power
- inplace: bool
if True modifies in place the DataContainer. Else, returns a normalized data matrix.
- Returns:
- transformed: pandas.DataFrame
- exception RunOrderError¶
Error class raised when there is no run order information
- class SeabornPlotMethods(data: DataContainer)¶
Methods to plot feature data from a DataContainer using Matplotlib/Seaborn.
- correlation_histogram(class_: str | None = None, **hist_params)¶
Plots the distribution of correlation of feature pairs for a given class.
Used with groups of replicates to assess time-dependent variations.
- Parameters:
- class_str or None, default=None
- Returns:
- matplotlib.axes.Axes
- Other Parameters:
- **hist_paramsdict
Parameters to pass to seaborn histplot function.
- pca_loadings(x_pc: int = 1, y_pc: int = 2, ignore_classes: List[str] | None = None, scaling: str | None = None, normalization: str | None = None, relplot_params: dict | None = None)¶
plots PCA scores using seaborn relplot function.
- Parameters:
- x_pcint
Principal component number to plot along X axis.
- y_pcint
Principal component number to plot along Y axis.
- ignore_classeslist[str], optional
classes in the data to ignore to build the PCA model.
- scaling{autoscaling, rescaling, pareto}, optional
scaling method.
- normalization{sum, max, euclidean}, optional
normalization method
- relplot_paramsdict, optional
key-values to pass to relplot function.
- Returns:
- seaborn.FacetGrid
- pca_scores(x_pc: int = 1, y_pc: int = 2, hue: str = 'class', ignore_classes: List[str] | None = None, show_order: bool = False, scaling: str | None = None, normalization: str | None = None, relplot_params: dict | None = None)¶
plots PCA scores using seaborn relplot function.
- Parameters:
- x_pcint
Principal component number to plot along X axis.
- y_pcint
Principal component number to plot along Y axis.
- hue{“class”, “type”, “batch”}
How to color samples. “class” color points according to sample class, “type” color points according to the sample type assigned in the mapping and “batch” uses batch information. Samples classes without a mapping are not shown in the plot
- ignore_classeslist[str], optional
classes in the data to ignore to build the PCA model.
- show_order: bool
add a label with the run order.
- scaling{autoscaling, rescaling, pareto}, optional
scaling method.
- normalization{sum, max, euclidean}, optional
normalization method
- relplot_paramsdict, optional
key-values to pass to relplot function.
- Returns:
- seaborn.FacetGrid