tidyms.DataContainer

class DataContainer(data_matrix: DataFrame, feature_metadata: DataFrame, sample_metadata: DataFrame, mapping: dict | None = None, plot_mode: str = 'bokeh')

Container object that stores processed metabolomics data.

The data is separated in three attributes: data_matrix, sample_metadata and feature_metadata. Each one is a pandas DataFrame. DataContainers can be created, apart from using the constructor, importing data in common formats (such as: XCMS, MZMine2, Progenesis, etc..) static methods.

See also

from_progenesis
from_pickle
MetricMethods
PlotMethods
PreprocessMethods
Attributes:
data_matrixDataFrame.

feature values for each sample. Data is organized in a “tidy” way: each row is an observation, each column is a feature. dtype must be a float and all values should be non-negative, but NANs are fine.

sample_metadataDataFrame.

Metadata associated to each sample (eg: sample class). Has the same index as the data_matrix. class (standing for sample class) is a required column. Analytical batch and run order information can be included under the batch and order columns. Both must be integer numbers, and the run order must be unique for each sample. If the run order is specified in a per-batch fashion, the values will be converted to a unique value.

feature_metadataDataFrame.

Metadata associated to each feature (eg: mass to charge ratio (mz), retention time (rt), etc…). The index is equal to the data_matrix column. “mz” and “rt” are required columns.

mappingdictionary of sample types to a list of sample classes.

Maps sample types to sample classes. valid samples types are qc, blank, sample or suitability. values are list of sample classes. Mapping is used by Processor objects to define a default behavior. For example, when using a BlankCorrector, the blank contribution to each feature is estimated using the sample classes that are values of the blank sample type.

metricsmethods to compute common feature metrics.
plotmethods to plot features.
preprocessmethods to perform common preprocessing tasks.
id

pd.Series[str] : name id of each sample.

batch

pd.Series[int]. Analytical batch number

order

pd.Series[int] : Run order in which samples were analyzed. It must be

Methods

remove(remove, axis)

(Remove samples/features from the DataContainer.)

reset(reset_mapping=True)

(Reset the DataContainer, ie: recover removed)

samples/features, transformed values.

is_valid_class_name(value)

(checks if a class is present in the)

DataContainer

diagnose()

(creates a dictionary with information about the status of the)

DataContainer. Used by Processor objects as a validity check.

select_features(mz, rt, mz_tol=0.01, rt_tol=5)

(Search features within)

a m/z and rt tolerance.

set_default_order()

(Assigns a default run order of the samples assuming)

that the data matrix is sorted by run order already.

sort(field, axis)

(sort features/samples using metadata information.)

save(filename)

(save the DataContainer as a pickle.)

See help(DataContainer) for more details

Parameters:
data_matrixpandas.DataFrame.

Feature values for each measured sample. Each row is a sample and each column is a feature.

sample_metadatapandas.DataFrame.

Metadata for each sample. class is a required column.

feature_metadatapandas.DataFrame.

DataFrame with features names as indices. mz and rt are required columns.

mappingdict or None

if dict, set each sample class to sample type.

plot_mode{“seaborn”, “bokeh”}

The package used to generate plots with the plot methods

add_order_from_csv(path: str | TextIO, interbatch_order: bool = True) None

adds sample order and sample batch using information from a csv file. A column with the name sample with the same values as the index of the DataContainer sample_metadata must be provided. order information is taken from a column with name order and the same is done with batch information. order data must be positive integers and each batch must have unique values. Each batch must be identified with a positive integer.

Parameters:
pathstr

path to the file with order data. Data format is inferred from the file extension.

interbatch_orderbool

If True converts the order value to a unique value for the whole DataContainer. This makes plotting the data as a function of order easier.

property batch: Series

pd.Series[int]. Analytical batch number

property classes: Series

pd.Series[str] : class of each sample.

diagnose() dict

Check if DataContainer has information to perform several correction types

Returns:
diagnosticdict

Each value is a bool indicating the status. empty is True if the size in at least one dimension of the data matrix is zero; “missing” is True if there are NANs in the data matrix; “order” is True if there is run order information for the samples; “batch” is True if there is batch number information associated to the samples.

static from_pickle(path: str | BinaryIO)

read a DataContainer stored as a pickle

Parameters:
path: str or file

path to read DataContainer

Returns:
DataContainer
static from_progenesis(path: str | TextIO)

Read a progenesis file into a DataContainer

Parameters:
pathstr or file

path to an Progenesis csv output or file object

Returns:
dc = DataContainer
property id: Series

pd.Series[str] : name id of each sample.

is_valid_class_name(test_class: str | List[str]) bool

Check if at least one sample class is`class_name`.

Parameters:
test_classstr or list[str]

classes to search in the DataContainer.

Returns
——-
is_validbool
property order: Series

pd.Series[int] : Run order in which samples were analyzed. It must be an unique integer for each sample.

remove(remove: Iterable[str], axis: str)

Remove selected features / samples

Parameters:
removeIterable[str]

List of sample/feature names to remove.

axis{“features”, “samples”}
reset(reset_mapping: bool = True)

Reloads the original data matrix.

Parameters:
reset_mapping: bool

If True, clears sample classes from the mapping.

save(filename: str) None

Save DataContainer into a pickle

Parameters:
filename: str

name used to save the file.

select_features(mzq: float, rtq: float, mz_tol: float = 0.01, rt_tol: float = 5) Index

Find feature names within the defined mass-to-charge and retention time tolerance.

Parameters:
mzq: positive number

Mass-to-charge value to search

rtq: positive number

Retention time value to search

mz_tol: positive number

Mass-to-charge tolerance used in the search.

rt_tol: positive number

Retention time tolerance used in the search.

Returns:
Index
set_default_order()

set the order of the samples, assuming that de data is already sorted.

set_plot_mode(mode: str)

Set the library used to generate plots.

Parameters:
mode: {“bokeh”, “seaborn”}
sort(field: str, axis: str)

Sort samples/features in place using metadata values.

Parameters:
field: str

field to sort by. Must be a column of sample_metadata or feature_metadata.

axis: {“samples”, “features”}
to_csv(filename: str) None

Save the DataContainer into a csv file.

Parameters:
filename: str