tidyms.container¶

Objects used to store and manage metabolomics data

Objects¶

DataContainer: Stores metabolomics data.

Exceptions¶

BatchInformationError
RunOrderError
ClassNameError
EmptyDataContainerError

Usage¶

DataContainers can be created in two different ways other than using the constructor:

Using the functions in the fileio module to read data processed with a third party software (XCMS, MZMine2, etc…)
Performing Feature correspondence algorithm on features detected from raw data (not implemented yet…)

exception BatchInformationError¶: Error class when there is no batch information

class BokehPlotMethods(data: DataContainer)¶

Methods to plot data from a DataContainer. Generates Bokeh Figures.

Methods

pca_scores()
pca_loadings()
feature()

feature(ft: str, hue: str = 'class', ignore_classes: List[str] | None = None, draw: bool = True, fig_params: dict | None = None, scatter_params: dict | None = None) → figure¶

plots a feature intensity as a function of the run order.

Parameters:

ft: str: Feature to plot. Index of feature in feature_metadata
hue: {“class”, “type”}
ignore_classeslist[str], optional: exclude samples from the listed classes in the plot
draw: bool: If True calls bokeh.plotting.show on figure.
fig_params: dict: key-value parameters to pass to bokeh figure
scatter_params: dict: key-value parameters to pass to bokeh circle

Returns:

bokeh.plotting.figure

pca_loadings(x_pc=1, y_pc=2, scaling: str | None = None, normalization: str | None = None, draw: bool = True, fig_params: dict | None = None, scatter_params: dict | None = None) → figure¶

plots PCA loadings.

Parameters:

x_pc: int: Principal component number to plot along X axis.
y_pc: int: Principal component number to plot along Y axis.
scaling: {`autoscaling`, `rescaling`, `pareto`}, optional: scaling method.
normalization: {`sum`, `max`, `euclidean`}, optional: normalizing method
draw: bool: If True, calls bokeh.plotting.show on figure
fig_params: dict, optional: Optional parameters to pass into bokeh figure
scatter_params: dict, optional: Optional parameters to pass into bokeh scatter plot.

Returns:

bokeh.plotting.figure.

pca_scores(x_pc: int = 1, y_pc: int = 2, hue: str = 'class', ignore_classes: List[str] | None = None, show_order: bool = False, scaling: str | None = None, normalization: str | None = None, draw: bool = True, fig_params: dict | None = None, scatter_params: dict | None = None) → figure¶

plots PCA scores

Parameters:

x_pc: int: Principal component number to plot along X axis.
y_pc: int: Principal component number to plot along Y axis.
hue: {“class”, “type”, “batch”}: How to color samples. “class” color points according to sample class, “type” color points according to the sample type assigned in the mapping and “batch” uses batch information. Samples classes without a mapping are not shown in the plot
ignore_classeslist[str], optional: classes in the data to ignore to build the PCA model.
show_order: bool: add a label with the run order.
scaling: {`autoscaling`, `rescaling`, `pareto`}, optional: scaling method.
normalization: {`sum`, `max`, `euclidean`}, optional: normalization method
draw: bool: If True calls bokeh.plotting.show on fig.
fig_params: dict, optional: Optional parameters to pass to bokeh figure
scatter_params: dict, optional: Optional parameters to pass to bokeh scatter plot.

Returns:

bokeh.plotting.figure.

exception ClassNameError¶: Error class raised when using invalid class names

class DataContainer(data_matrix: DataFrame, feature_metadata: DataFrame, sample_metadata: DataFrame, mapping: dict | None = None, plot_mode: str = 'bokeh')¶

Container object that stores processed metabolomics data.

The data is separated in three attributes: data_matrix, sample_metadata and feature_metadata. Each one is a pandas DataFrame. DataContainers can be created, apart from using the constructor, importing data in common formats (such as: XCMS, MZMine2, Progenesis, etc..) static methods.

See also

from_progenesis
from_pickle
MetricMethods
PlotMethods
PreprocessMethods

Attributes:

data_matrixDataFrame.: feature values for each sample. Data is organized in a “tidy” way: each row is an observation, each column is a feature. dtype must be a float and all values should be non-negative, but NANs are fine.
sample_metadataDataFrame.: Metadata associated to each sample (eg: sample class). Has the same index as the data_matrix. class (standing for sample class) is a required column. Analytical batch and run order information can be included under the batch and order columns. Both must be integer numbers, and the run order must be unique for each sample. If the run order is specified in a per-batch fashion, the values will be converted to a unique value.
feature_metadataDataFrame.: Metadata associated to each feature (eg: mass to charge ratio (mz), retention time (rt), etc…). The index is equal to the data_matrix column. “mz” and “rt” are required columns.
mappingdictionary of sample types to a list of sample classes.: Maps sample types to sample classes. valid samples types are qc, blank, sample or suitability. values are list of sample classes. Mapping is used by Processor objects to define a default behavior. For example, when using a BlankCorrector, the blank contribution to each feature is estimated using the sample classes that are values of the blank sample type.
metricsmethods to compute common feature metrics.
plotmethods to plot features.
preprocessmethods to perform common preprocessing tasks.
id: pd.Series[str] : name id of each sample.
batch: pd.Series[int]. Analytical batch number
order: pd.Series[int] : Run order in which samples were analyzed. It must be

Methods

remove(remove, axis)	(Remove samples/features from the DataContainer.)
reset(reset_mapping=True)	(Reset the DataContainer, ie: recover removed)
samples/features, transformed values.
is_valid_class_name(value)	(checks if a class is present in the)
DataContainer
diagnose()	(creates a dictionary with information about the status of the)
DataContainer. Used by Processor objects as a validity check.
select_features(mz, rt, mz_tol=0.01, rt_tol=5)	(Search features within)
a m/z and rt tolerance.
set_default_order()	(Assigns a default run order of the samples assuming)
that the data matrix is sorted by run order already.
sort(field, axis)	(sort features/samples using metadata information.)
save(filename)	(save the DataContainer as a pickle.)

See help(DataContainer) for more details

Parameters:

data_matrixpandas.DataFrame.: Feature values for each measured sample. Each row is a sample and each column is a feature.
sample_metadatapandas.DataFrame.: Metadata for each sample. class is a required column.
feature_metadatapandas.DataFrame.: DataFrame with features names as indices. mz and rt are required columns.
mappingdict or None: if dict, set each sample class to sample type.
plot_mode{“seaborn”, “bokeh”}: The package used to generate plots with the plot methods

add_order_from_csv(path: str | TextIO, interbatch_order: bool = True) → None¶

adds sample order and sample batch using information from a csv file. A column with the name sample with the same values as the index of the DataContainer sample_metadata must be provided. order information is taken from a column with name order and the same is done with batch information. order data must be positive integers and each batch must have unique values. Each batch must be identified with a positive integer.

Parameters:

pathstr: path to the file with order data. Data format is inferred from the file extension.
interbatch_orderbool: If True converts the order value to a unique value for the whole DataContainer. This makes plotting the data as a function of order easier.

property batch: Series¶: pd.Series[int]. Analytical batch number

property classes: Series¶: pd.Series[str] : class of each sample.

diagnose() → dict¶

Check if DataContainer has information to perform several correction types

Returns:

diagnosticdict: Each value is a bool indicating the status. empty is True if the size in at least one dimension of the data matrix is zero; “missing” is True if there are NANs in the data matrix; “order” is True if there is run order information for the samples; “batch” is True if there is batch number information associated to the samples.

static from_pickle(path: str | BinaryIO)¶

read a DataContainer stored as a pickle

Parameters:

path: str or file: path to read DataContainer

Returns:

DataContainer

static from_progenesis(path: str | TextIO)¶

Read a progenesis file into a DataContainer

Parameters:

pathstr or file: path to an Progenesis csv output or file object

Returns:

dc = DataContainer

property id: Series¶: pd.Series[str] : name id of each sample.

is_valid_class_name(test_class: str | List[str]) → bool¶

Check if at least one sample class is`class_name`.

Parameters:

test_classstr or list[str]: classes to search in the DataContainer.
Returns
——-
is_validbool

property order: Series¶: pd.Series[int] : Run order in which samples were analyzed. It must be an unique integer for each sample.

remove(remove: Iterable[str], axis: str)¶

Remove selected features / samples

Parameters:

removeIterable[str]: List of sample/feature names to remove.
axis{“features”, “samples”}

reset(reset_mapping: bool = True)¶

Reloads the original data matrix.

Parameters:

reset_mapping: bool: If True, clears sample classes from the mapping.

save(filename: str) → None¶

Save DataContainer into a pickle

Parameters:

filename: str: name used to save the file.

select_features(mzq: float, rtq: float, mz_tol: float = 0.01, rt_tol: float = 5) → Index¶

Find feature names within the defined mass-to-charge and retention time tolerance.

Parameters:

mzq: positive number: Mass-to-charge value to search
rtq: positive number: Retention time value to search
mz_tol: positive number: Mass-to-charge tolerance used in the search.
rt_tol: positive number: Retention time tolerance used in the search.

Returns:

Index

set_default_order()¶: set the order of the samples, assuming that de data is already sorted.

set_plot_mode(mode: str)¶

Set the library used to generate plots.

Parameters:

mode: {“bokeh”, “seaborn”}

sort(field: str, axis: str)¶

Sort samples/features in place using metadata values.

Parameters:

field: str: field to sort by. Must be a column of sample_metadata or feature_metadata.
axis: {“samples”, “features”}

to_csv(filename: str) → None¶

Save the DataContainer into a csv file.

Parameters:

filename: str

exception DilutionInformationError¶: Error class raised when no dilution factor information has been provided.

exception EmptyDataContainerError¶: Error class raised when remove leaves an empty DataContainer.

class MetricMethods(data: DataContainer)¶

Methods to compute feature metrics from a DataContainer

Methods

cv: Computes the coefficient of variation for each feature.
dratio: Computes the D-Ratio of features, a metric used to compare technical
to biological variation.
detection_rate: Computes the ratio of samples where a features was detected.
pca: Computes the PCA scores, loadings and PC variance.

correlation(field: str, mode: str = 'ols', classes: List[str] | None = None)¶

Correlates features with sample metadata properties.

Parameters:

fieldstr: A column of sample_metadata. Must have a numeric dtype.
mode: {“ols”, “spearman”}: ols computes the ordinary least squares linear regression. Computes the Pearson r squared, p-value for the Jarque-Bera test and the Durwin-Watson statistic for each feature. spearman computes the spearman rank correlation coefficient for each feature
classes: List[str], optional: Compute the correlation on the selected classes only. If None, computes the correlation on all samples.

Returns:

pandas.Series or pandas.DataFrame

cv(groupby: str | List[str] | None = 'class', robust: bool = False, fill_value: float = inf)¶

Computes the Coefficient of variation for each feature.

The coefficient of variation is the quotient between the standard deviation and the mean of a feature.

Parameters:

groupby: str, List[str] or None: If groupby isa column or a list of columns of sample metadata, the values of CV are computed on a per group basis. If None, the CV is computed for all samples in the data.
robust: bool: If True, computes the relative MAD. Else, computes the Coefficient of variation.
fill_value: float: Value used to replace NaN. By default, NaNs are replaced by np.inf.

Returns:

pd.Series or pd.DataFrame

detection_rate(groupby: str | List[str] | None = 'class', threshold=0)¶

Computes the fraction of samples where a feature was detected.

Parameters:

groupbystr, List[str] or None: If groupby isa column or a list of columns of sample metadata, the values of CV are computed on a per group basis. If None, the CV is computed for all samples in the data.
thresholdfloat: Minimum value to consider a feature detected

dratio(robust=False) → Series¶

Computes the ratio between the sample variation and the quality control variation.

The D-Ratio is useful to compare technical to biological variation and non informative features.

Parameters:

robust: bool: If True, uses MAD to compute the D-ratio. Else, uses standard deviation.

Returns:

drpd.Series:: D-Ratio for each feature

pca(n_components: int | None = 2, normalization: str | None = None, scaling: str | None = None, ignore_classes: List[str] | None = None)¶

Computes PCA score, loadings and PC variance of each component.

Parameters:

n_components: int: Number of Principal components to compute.
scaling: {`autoscaling`, `rescaling`, `pareto`}, optional: scaling method.
normalization: {`sum`, `max`, `euclidean`}, optional: normalizing method
ignore_classeslist[str], optional: classes in the data to ignore to build the PCA model.

Returns:

scores: np.array
loadings: np.array
variance: np.array: Explained variance for each component.
total_variance: float: Total variance of the scaled data.

class PreprocessMethods(dc: DataContainer)¶

Common Preprocessing operations.

Methods

normalize(method, inplace=True): Adjust sample values.
scale(method, inplace=True): Adjust feature distribution values.
transform(method, inplace=True): element-wise transformations of data.

correct_batches(min_qc_dr: float = 0.9, first_n_qc: int | None = None, threshold: float = 0.0, frac: float | None = None, n_jobs: int | None = None, verbose: bool = False)¶

Correct time dependant systematic bias along samples due to variation in instrumental response.

Parameters:

min_qc_drfloat: minimum fraction of QC where a feature was detected. See the notes for an explanation of how this value is computed.
first_n_qcint, optional: The number of first QC samples used to estimate the expected value for each feature in the QC. If None uses all QC samples in a batch. See notes for an explanation of its use.
thresholdfloat: Minimum value to consider a feature detected. Used to compute the detection rate of each feature in the QC samples. Only features in QC samples above this value are used to compute the correction factor.
fracfloat, optional: frac parameter of the LOESS model. If None, the best value for each feature is estimated using cross validation.
n_jobs: int or None, default=None: Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
verbosebool: If True displays a progress bar.

Notes

The correction is applied as follows:

Split the data matrix using the batch number.
For each feature in a batch compute an intra-batch correction that removes time-dependent variations.
Once the features where corrected in all batches, apply an inter-batch where the mean across different batches is corrected.

A detailed explanation of the correction algorithm can be found here.

normalize(method: str, inplace: bool = True, feature: str | None = None) → DataFrame | None¶

Normalize samples.

Parameters:

method: {“sum”, “max”, “euclidean”}: Normalization method. sum normalizes using the sum along each row, max normalizes using the maximum of each row. euclidean normalizes using the euclidean norm of the row.
inplace: bool: if True modifies in place the DataContainer. Else, returns a normalized data matrix.
feature: str, optional: Feature used for normalization in feature mode.

Returns:

normalized: pandas.DataFrame

scale(method: str, inplace: bool = True) → DataFrame | None¶

scales features using different methods.

Parameters:

method: {“autoscaling”, “rescaling”, “pareto”}: Scaling method. autoscaling performs mean centering scaling of features to unitary variance. rescaling scales data to a 0-1 range. pareto performs mean centering and scaling using the square root of the standard deviation
inplace: bool: if True modifies in place the DataContainer. Else, returns a normalized data matrix.

Returns:

scaled: pandas.DataFrame

transform(method: str, inplace: bool = True) → DataFrame | None¶

Perform element-wise data transformations.

Parameters:

method: {“log”, “power”}: transform method. log applies the base 10 logarithm on the data. power
inplace: bool: if True modifies in place the DataContainer. Else, returns a normalized data matrix.

Returns:

transformed: pandas.DataFrame

exception RunOrderError¶: Error class raised when there is no run order information

class SeabornPlotMethods(data: DataContainer)¶

Methods to plot feature data from a DataContainer using Matplotlib/Seaborn.

correlation_histogram(class_: str | None = None, **hist_params)¶

Plots the distribution of correlation of feature pairs for a given class.

Used with groups of replicates to assess time-dependent variations.

Parameters:

class_str or None, default=None

Returns:

matplotlib.axes.Axes

Other Parameters:

**hist_paramsdict: Parameters to pass to seaborn histplot function.

pca_loadings(x_pc: int = 1, y_pc: int = 2, ignore_classes: List[str] | None = None, scaling: str | None = None, normalization: str | None = None, relplot_params: dict | None = None)¶

plots PCA scores using seaborn relplot function.

Parameters:

x_pcint: Principal component number to plot along X axis.
y_pcint: Principal component number to plot along Y axis.
ignore_classeslist[str], optional: classes in the data to ignore to build the PCA model.
scaling{autoscaling, rescaling, pareto}, optional: scaling method.
normalization{sum, max, euclidean}, optional: normalization method
relplot_paramsdict, optional: key-values to pass to relplot function.

Returns:

seaborn.FacetGrid

pca_scores(x_pc: int = 1, y_pc: int = 2, hue: str = 'class', ignore_classes: List[str] | None = None, show_order: bool = False, scaling: str | None = None, normalization: str | None = None, relplot_params: dict | None = None)¶

plots PCA scores using seaborn relplot function.

Parameters:

x_pcint: Principal component number to plot along X axis.
y_pcint: Principal component number to plot along Y axis.
hue{“class”, “type”, “batch”}: How to color samples. “class” color points according to sample class, “type” color points according to the sample type assigned in the mapping and “batch” uses batch information. Samples classes without a mapping are not shown in the plot
ignore_classeslist[str], optional: classes in the data to ignore to build the PCA model.
show_order: bool: add a label with the run order.
scaling{autoscaling, rescaling, pareto}, optional: scaling method.
normalization{sum, max, euclidean}, optional: normalization method
relplot_paramsdict, optional: key-values to pass to relplot function.

Returns:

seaborn.FacetGrid