tidyms.container

Objects used to store and manage metabolomics data

Objects

  • DataContainer: Stores metabolomics data.

Exceptions

  • BatchInformationError

  • RunOrderError

  • ClassNameError

  • EmptyDataContainerError

Usage

DataContainers can be created in two different ways other than using the constructor:

  • Using the functions in the fileio module to read data processed with a third party software (XCMS, MZMine2, etc…)

  • Performing Feature correspondence algorithm on features detected from raw data (not implemented yet…)

exception BatchInformationError

Error class when there is no batch information

class BokehPlotMethods(data: DataContainer)

Methods to plot data from a DataContainer. Generates Bokeh Figures.

Methods

pca_scores()

pca_loadings()

feature()

feature(ft: str, hue: str = 'class', ignore_classes: List[str] | None = None, draw: bool = True, fig_params: dict | None = None, scatter_params: dict | None = None) figure

plots a feature intensity as a function of the run order.

Parameters:
ft: str

Feature to plot. Index of feature in feature_metadata

hue: {“class”, “type”}
ignore_classeslist[str], optional

exclude samples from the listed classes in the plot

draw: bool

If True calls bokeh.plotting.show on figure.

fig_params: dict

key-value parameters to pass to bokeh figure

scatter_params: dict

key-value parameters to pass to bokeh circle

Returns:
bokeh.plotting.figure
pca_loadings(x_pc=1, y_pc=2, scaling: str | None = None, normalization: str | None = None, draw: bool = True, fig_params: dict | None = None, scatter_params: dict | None = None) figure

plots PCA loadings.

Parameters:
x_pc: int

Principal component number to plot along X axis.

y_pc: int

Principal component number to plot along Y axis.

scaling: {`autoscaling`, `rescaling`, `pareto`}, optional

scaling method.

normalization: {`sum`, `max`, `euclidean`}, optional

normalizing method

draw: bool

If True, calls bokeh.plotting.show on figure

fig_params: dict, optional

Optional parameters to pass into bokeh figure

scatter_params: dict, optional

Optional parameters to pass into bokeh scatter plot.

Returns:
bokeh.plotting.figure.
pca_scores(x_pc: int = 1, y_pc: int = 2, hue: str = 'class', ignore_classes: List[str] | None = None, show_order: bool = False, scaling: str | None = None, normalization: str | None = None, draw: bool = True, fig_params: dict | None = None, scatter_params: dict | None = None) figure

plots PCA scores

Parameters:
x_pc: int

Principal component number to plot along X axis.

y_pc: int

Principal component number to plot along Y axis.

hue: {“class”, “type”, “batch”}

How to color samples. “class” color points according to sample class, “type” color points according to the sample type assigned in the mapping and “batch” uses batch information. Samples classes without a mapping are not shown in the plot

ignore_classeslist[str], optional

classes in the data to ignore to build the PCA model.

show_order: bool

add a label with the run order.

scaling: {`autoscaling`, `rescaling`, `pareto`}, optional

scaling method.

normalization: {`sum`, `max`, `euclidean`}, optional

normalization method

draw: bool

If True calls bokeh.plotting.show on fig.

fig_params: dict, optional

Optional parameters to pass to bokeh figure

scatter_params: dict, optional

Optional parameters to pass to bokeh scatter plot.

Returns:
bokeh.plotting.figure.
exception ClassNameError

Error class raised when using invalid class names

class DataContainer(data_matrix: DataFrame, feature_metadata: DataFrame, sample_metadata: DataFrame, mapping: dict | None = None, plot_mode: str = 'bokeh')

Container object that stores processed metabolomics data.

The data is separated in three attributes: data_matrix, sample_metadata and feature_metadata. Each one is a pandas DataFrame. DataContainers can be created, apart from using the constructor, importing data in common formats (such as: XCMS, MZMine2, Progenesis, etc..) static methods.

Attributes:
data_matrixDataFrame.

feature values for each sample. Data is organized in a “tidy” way: each row is an observation, each column is a feature. dtype must be a float and all values should be non-negative, but NANs are fine.

sample_metadataDataFrame.

Metadata associated to each sample (eg: sample class). Has the same index as the data_matrix. class (standing for sample class) is a required column. Analytical batch and run order information can be included under the batch and order columns. Both must be integer numbers, and the run order must be unique for each sample. If the run order is specified in a per-batch fashion, the values will be converted to a unique value.

feature_metadataDataFrame.

Metadata associated to each feature (eg: mass to charge ratio (mz), retention time (rt), etc…). The index is equal to the data_matrix column. “mz” and “rt” are required columns.

mappingdictionary of sample types to a list of sample classes.

Maps sample types to sample classes. valid samples types are qc, blank, sample or suitability. values are list of sample classes. Mapping is used by Processor objects to define a default behavior. For example, when using a BlankCorrector, the blank contribution to each feature is estimated using the sample classes that are values of the blank sample type.

metricsmethods to compute common feature metrics.
plotmethods to plot features.
preprocessmethods to perform common preprocessing tasks.
id

pd.Series[str] : name id of each sample.

batch

pd.Series[int]. Analytical batch number

order

pd.Series[int] : Run order in which samples were analyzed. It must be

Methods

remove(remove, axis)

(Remove samples/features from the DataContainer.)

reset(reset_mapping=True)

(Reset the DataContainer, ie: recover removed)

samples/features, transformed values.

is_valid_class_name(value)

(checks if a class is present in the)

DataContainer

diagnose()

(creates a dictionary with information about the status of the)

DataContainer. Used by Processor objects as a validity check.

select_features(mz, rt, mz_tol=0.01, rt_tol=5)

(Search features within)

a m/z and rt tolerance.

set_default_order()

(Assigns a default run order of the samples assuming)

that the data matrix is sorted by run order already.

sort(field, axis)

(sort features/samples using metadata information.)

save(filename)

(save the DataContainer as a pickle.)

See help(DataContainer) for more details

Parameters:
data_matrixpandas.DataFrame.

Feature values for each measured sample. Each row is a sample and each column is a feature.

sample_metadatapandas.DataFrame.

Metadata for each sample. class is a required column.

feature_metadatapandas.DataFrame.

DataFrame with features names as indices. mz and rt are required columns.

mappingdict or None

if dict, set each sample class to sample type.

plot_mode{“seaborn”, “bokeh”}

The package used to generate plots with the plot methods

add_order_from_csv(path: str | TextIO, interbatch_order: bool = True) None

adds sample order and sample batch using information from a csv file. A column with the name sample with the same values as the index of the DataContainer sample_metadata must be provided. order information is taken from a column with name order and the same is done with batch information. order data must be positive integers and each batch must have unique values. Each batch must be identified with a positive integer.

Parameters:
pathstr

path to the file with order data. Data format is inferred from the file extension.

interbatch_orderbool

If True converts the order value to a unique value for the whole DataContainer. This makes plotting the data as a function of order easier.

property batch: Series

pd.Series[int]. Analytical batch number

property classes: Series

pd.Series[str] : class of each sample.

diagnose() dict

Check if DataContainer has information to perform several correction types

Returns:
diagnosticdict

Each value is a bool indicating the status. empty is True if the size in at least one dimension of the data matrix is zero; “missing” is True if there are NANs in the data matrix; “order” is True if there is run order information for the samples; “batch” is True if there is batch number information associated to the samples.

static from_pickle(path: str | BinaryIO)

read a DataContainer stored as a pickle

Parameters:
path: str or file

path to read DataContainer

Returns:
DataContainer
static from_progenesis(path: str | TextIO)

Read a progenesis file into a DataContainer

Parameters:
pathstr or file

path to an Progenesis csv output or file object

Returns:
dc = DataContainer
property id: Series

pd.Series[str] : name id of each sample.

is_valid_class_name(test_class: str | List[str]) bool

Check if at least one sample class is`class_name`.

Parameters:
test_classstr or list[str]

classes to search in the DataContainer.

Returns
——-
is_validbool
property order: Series

pd.Series[int] : Run order in which samples were analyzed. It must be an unique integer for each sample.

remove(remove: Iterable[str], axis: str)

Remove selected features / samples

Parameters:
removeIterable[str]

List of sample/feature names to remove.

axis{“features”, “samples”}
reset(reset_mapping: bool = True)

Reloads the original data matrix.

Parameters:
reset_mapping: bool

If True, clears sample classes from the mapping.

save(filename: str) None

Save DataContainer into a pickle

Parameters:
filename: str

name used to save the file.

select_features(mzq: float, rtq: float, mz_tol: float = 0.01, rt_tol: float = 5) Index

Find feature names within the defined mass-to-charge and retention time tolerance.

Parameters:
mzq: positive number

Mass-to-charge value to search

rtq: positive number

Retention time value to search

mz_tol: positive number

Mass-to-charge tolerance used in the search.

rt_tol: positive number

Retention time tolerance used in the search.

Returns:
Index
set_default_order()

set the order of the samples, assuming that de data is already sorted.

set_plot_mode(mode: str)

Set the library used to generate plots.

Parameters:
mode: {“bokeh”, “seaborn”}
sort(field: str, axis: str)

Sort samples/features in place using metadata values.

Parameters:
field: str

field to sort by. Must be a column of sample_metadata or feature_metadata.

axis: {“samples”, “features”}
to_csv(filename: str) None

Save the DataContainer into a csv file.

Parameters:
filename: str
exception DilutionInformationError

Error class raised when no dilution factor information has been provided.

exception EmptyDataContainerError

Error class raised when remove leaves an empty DataContainer.

class MetricMethods(data: DataContainer)

Methods to compute feature metrics from a DataContainer

Methods

cv: Computes the coefficient of variation for each feature.

dratio: Computes the D-Ratio of features, a metric used to compare technical

to biological variation.

detection_rate: Computes the ratio of samples where a features was detected.

pca: Computes the PCA scores, loadings and PC variance.

correlation(field: str, mode: str = 'ols', classes: List[str] | None = None)

Correlates features with sample metadata properties.

Parameters:
fieldstr

A column of sample_metadata. Must have a numeric dtype.

mode: {“ols”, “spearman”}

ols computes the ordinary least squares linear regression. Computes the Pearson r squared, p-value for the Jarque-Bera test and the Durwin-Watson statistic for each feature. spearman computes the spearman rank correlation coefficient for each feature

classes: List[str], optional

Compute the correlation on the selected classes only. If None, computes the correlation on all samples.

Returns:
pandas.Series or pandas.DataFrame
cv(groupby: str | List[str] | None = 'class', robust: bool = False, fill_value: float = inf)

Computes the Coefficient of variation for each feature.

The coefficient of variation is the quotient between the standard deviation and the mean of a feature.

Parameters:
groupby: str, List[str] or None

If groupby isa column or a list of columns of sample metadata, the values of CV are computed on a per group basis. If None, the CV is computed for all samples in the data.

robust: bool

If True, computes the relative MAD. Else, computes the Coefficient of variation.

fill_value: float

Value used to replace NaN. By default, NaNs are replaced by np.inf.

Returns:
pd.Series or pd.DataFrame
detection_rate(groupby: str | List[str] | None = 'class', threshold=0)

Computes the fraction of samples where a feature was detected.

Parameters:
groupbystr, List[str] or None

If groupby isa column or a list of columns of sample metadata, the values of CV are computed on a per group basis. If None, the CV is computed for all samples in the data.

thresholdfloat

Minimum value to consider a feature detected

dratio(robust=False) Series

Computes the ratio between the sample variation and the quality control variation.

The D-Ratio is useful to compare technical to biological variation and non informative features.

Parameters:
robust: bool

If True, uses MAD to compute the D-ratio. Else, uses standard deviation.

Returns:
drpd.Series:

D-Ratio for each feature

pca(n_components: int | None = 2, normalization: str | None = None, scaling: str | None = None, ignore_classes: List[str] | None = None)

Computes PCA score, loadings and PC variance of each component.

Parameters:
n_components: int

Number of Principal components to compute.

scaling: {`autoscaling`, `rescaling`, `pareto`}, optional

scaling method.

normalization: {`sum`, `max`, `euclidean`}, optional

normalizing method

ignore_classeslist[str], optional

classes in the data to ignore to build the PCA model.

Returns:
scores: np.array
loadings: np.array
variance: np.array

Explained variance for each component.

total_variance: float

Total variance of the scaled data.

class PreprocessMethods(dc: DataContainer)

Common Preprocessing operations.

Methods

normalize(method, inplace=True): Adjust sample values.

scale(method, inplace=True): Adjust feature distribution values.

transform(method, inplace=True): element-wise transformations of data.

correct_batches(min_qc_dr: float = 0.9, first_n_qc: int | None = None, threshold: float = 0.0, frac: float | None = None, n_jobs: int | None = None, verbose: bool = False)

Correct time dependant systematic bias along samples due to variation in instrumental response.

Parameters:
min_qc_drfloat

minimum fraction of QC where a feature was detected. See the notes for an explanation of how this value is computed.

first_n_qcint, optional

The number of first QC samples used to estimate the expected value for each feature in the QC. If None uses all QC samples in a batch. See notes for an explanation of its use.

thresholdfloat

Minimum value to consider a feature detected. Used to compute the detection rate of each feature in the QC samples. Only features in QC samples above this value are used to compute the correction factor.

fracfloat, optional

frac parameter of the LOESS model. If None, the best value for each feature is estimated using cross validation.

n_jobs: int or None, default=None

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

verbosebool

If True displays a progress bar.

Notes

The correction is applied as follows:

  1. Split the data matrix using the batch number.

  2. For each feature in a batch compute an intra-batch correction that removes time-dependent variations.

  3. Once the features where corrected in all batches, apply an inter-batch where the mean across different batches is corrected.

A detailed explanation of the correction algorithm can be found here.

normalize(method: str, inplace: bool = True, feature: str | None = None) DataFrame | None

Normalize samples.

Parameters:
method: {“sum”, “max”, “euclidean”}

Normalization method. sum normalizes using the sum along each row, max normalizes using the maximum of each row. euclidean normalizes using the euclidean norm of the row.

inplace: bool

if True modifies in place the DataContainer. Else, returns a normalized data matrix.

feature: str, optional

Feature used for normalization in feature mode.

Returns:
normalized: pandas.DataFrame
scale(method: str, inplace: bool = True) DataFrame | None

scales features using different methods.

Parameters:
method: {“autoscaling”, “rescaling”, “pareto”}

Scaling method. autoscaling performs mean centering scaling of features to unitary variance. rescaling scales data to a 0-1 range. pareto performs mean centering and scaling using the square root of the standard deviation

inplace: bool

if True modifies in place the DataContainer. Else, returns a normalized data matrix.

Returns:
scaled: pandas.DataFrame
transform(method: str, inplace: bool = True) DataFrame | None

Perform element-wise data transformations.

Parameters:
method: {“log”, “power”}

transform method. log applies the base 10 logarithm on the data. power

inplace: bool

if True modifies in place the DataContainer. Else, returns a normalized data matrix.

Returns:
transformed: pandas.DataFrame
exception RunOrderError

Error class raised when there is no run order information

class SeabornPlotMethods(data: DataContainer)

Methods to plot feature data from a DataContainer using Matplotlib/Seaborn.

correlation_histogram(class_: str | None = None, **hist_params)

Plots the distribution of correlation of feature pairs for a given class.

Used with groups of replicates to assess time-dependent variations.

Parameters:
class_str or None, default=None
Returns:
matplotlib.axes.Axes
Other Parameters:
**hist_paramsdict

Parameters to pass to seaborn histplot function.

pca_loadings(x_pc: int = 1, y_pc: int = 2, ignore_classes: List[str] | None = None, scaling: str | None = None, normalization: str | None = None, relplot_params: dict | None = None)

plots PCA scores using seaborn relplot function.

Parameters:
x_pcint

Principal component number to plot along X axis.

y_pcint

Principal component number to plot along Y axis.

ignore_classeslist[str], optional

classes in the data to ignore to build the PCA model.

scaling{autoscaling, rescaling, pareto}, optional

scaling method.

normalization{sum, max, euclidean}, optional

normalization method

relplot_paramsdict, optional

key-values to pass to relplot function.

Returns:
seaborn.FacetGrid
pca_scores(x_pc: int = 1, y_pc: int = 2, hue: str = 'class', ignore_classes: List[str] | None = None, show_order: bool = False, scaling: str | None = None, normalization: str | None = None, relplot_params: dict | None = None)

plots PCA scores using seaborn relplot function.

Parameters:
x_pcint

Principal component number to plot along X axis.

y_pcint

Principal component number to plot along Y axis.

hue{“class”, “type”, “batch”}

How to color samples. “class” color points according to sample class, “type” color points according to the sample type assigned in the mapping and “batch” uses batch information. Samples classes without a mapping are not shown in the plot

ignore_classeslist[str], optional

classes in the data to ignore to build the PCA model.

show_order: bool

add a label with the run order.

scaling{autoscaling, rescaling, pareto}, optional

scaling method.

normalization{sum, max, euclidean}, optional

normalization method

relplot_paramsdict, optional

key-values to pass to relplot function.

Returns:
seaborn.FacetGrid