tidyms.DataContainer¶

class DataContainer(data_matrix: DataFrame, feature_metadata: DataFrame, sample_metadata: DataFrame, mapping: dict | None = None, plot_mode: str = 'bokeh')¶

Container object that stores processed metabolomics data.

The data is separated in three attributes: data_matrix, sample_metadata and feature_metadata. Each one is a pandas DataFrame. DataContainers can be created, apart from using the constructor, importing data in common formats (such as: XCMS, MZMine2, Progenesis, etc..) static methods.

See also

from_progenesis
from_pickle
MetricMethods
PlotMethods
PreprocessMethods

Attributes:

data_matrixDataFrame.: feature values for each sample. Data is organized in a “tidy” way: each row is an observation, each column is a feature. dtype must be a float and all values should be non-negative, but NANs are fine.
sample_metadataDataFrame.: Metadata associated to each sample (eg: sample class). Has the same index as the data_matrix. class (standing for sample class) is a required column. Analytical batch and run order information can be included under the batch and order columns. Both must be integer numbers, and the run order must be unique for each sample. If the run order is specified in a per-batch fashion, the values will be converted to a unique value.
feature_metadataDataFrame.: Metadata associated to each feature (eg: mass to charge ratio (mz), retention time (rt), etc…). The index is equal to the data_matrix column. “mz” and “rt” are required columns.
mappingdictionary of sample types to a list of sample classes.: Maps sample types to sample classes. valid samples types are qc, blank, sample or suitability. values are list of sample classes. Mapping is used by Processor objects to define a default behavior. For example, when using a BlankCorrector, the blank contribution to each feature is estimated using the sample classes that are values of the blank sample type.
metricsmethods to compute common feature metrics.
plotmethods to plot features.
preprocessmethods to perform common preprocessing tasks.
id: pd.Series[str] : name id of each sample.
batch: pd.Series[int]. Analytical batch number
order: pd.Series[int] : Run order in which samples were analyzed. It must be

Methods

remove(remove, axis)	(Remove samples/features from the DataContainer.)
reset(reset_mapping=True)	(Reset the DataContainer, ie: recover removed)
samples/features, transformed values.
is_valid_class_name(value)	(checks if a class is present in the)
DataContainer
diagnose()	(creates a dictionary with information about the status of the)
DataContainer. Used by Processor objects as a validity check.
select_features(mz, rt, mz_tol=0.01, rt_tol=5)	(Search features within)
a m/z and rt tolerance.
set_default_order()	(Assigns a default run order of the samples assuming)
that the data matrix is sorted by run order already.
sort(field, axis)	(sort features/samples using metadata information.)
save(filename)	(save the DataContainer as a pickle.)

See help(DataContainer) for more details

Parameters:

data_matrixpandas.DataFrame.: Feature values for each measured sample. Each row is a sample and each column is a feature.
sample_metadatapandas.DataFrame.: Metadata for each sample. class is a required column.
feature_metadatapandas.DataFrame.: DataFrame with features names as indices. mz and rt are required columns.
mappingdict or None: if dict, set each sample class to sample type.
plot_mode{“seaborn”, “bokeh”}: The package used to generate plots with the plot methods

add_order_from_csv(path: str | TextIO, interbatch_order: bool = True) → None¶

adds sample order and sample batch using information from a csv file. A column with the name sample with the same values as the index of the DataContainer sample_metadata must be provided. order information is taken from a column with name order and the same is done with batch information. order data must be positive integers and each batch must have unique values. Each batch must be identified with a positive integer.

Parameters:

pathstr: path to the file with order data. Data format is inferred from the file extension.
interbatch_orderbool: If True converts the order value to a unique value for the whole DataContainer. This makes plotting the data as a function of order easier.

property batch: Series¶: pd.Series[int]. Analytical batch number

property classes: Series¶: pd.Series[str] : class of each sample.

diagnose() → dict¶

Check if DataContainer has information to perform several correction types

Returns:

diagnosticdict: Each value is a bool indicating the status. empty is True if the size in at least one dimension of the data matrix is zero; “missing” is True if there are NANs in the data matrix; “order” is True if there is run order information for the samples; “batch” is True if there is batch number information associated to the samples.

static from_pickle(path: str | BinaryIO)¶

read a DataContainer stored as a pickle

Parameters:

path: str or file: path to read DataContainer

Returns:

DataContainer

static from_progenesis(path: str | TextIO)¶

Read a progenesis file into a DataContainer

Parameters:

pathstr or file: path to an Progenesis csv output or file object

Returns:

dc = DataContainer

property id: Series¶: pd.Series[str] : name id of each sample.

is_valid_class_name(test_class: str | List[str]) → bool¶

Check if at least one sample class is`class_name`.

Parameters:

test_classstr or list[str]: classes to search in the DataContainer.
Returns
——-
is_validbool

property order: Series¶: pd.Series[int] : Run order in which samples were analyzed. It must be an unique integer for each sample.

remove(remove: Iterable[str], axis: str)¶

Remove selected features / samples

Parameters:

removeIterable[str]: List of sample/feature names to remove.
axis{“features”, “samples”}

reset(reset_mapping: bool = True)¶

Reloads the original data matrix.

Parameters:

reset_mapping: bool: If True, clears sample classes from the mapping.

save(filename: str) → None¶

Save DataContainer into a pickle

Parameters:

filename: str: name used to save the file.

select_features(mzq: float, rtq: float, mz_tol: float = 0.01, rt_tol: float = 5) → Index¶

Find feature names within the defined mass-to-charge and retention time tolerance.

Parameters:

mzq: positive number: Mass-to-charge value to search
rtq: positive number: Retention time value to search
mz_tol: positive number: Mass-to-charge tolerance used in the search.
rt_tol: positive number: Retention time tolerance used in the search.

Returns:

Index

set_default_order()¶: set the order of the samples, assuming that de data is already sorted.

set_plot_mode(mode: str)¶

Set the library used to generate plots.

Parameters:

mode: {“bokeh”, “seaborn”}

sort(field: str, axis: str)¶

Sort samples/features in place using metadata values.

Parameters:

field: str: field to sort by. Must be a column of sample_metadata or feature_metadata.
axis: {“samples”, “features”}

to_csv(filename: str) → None¶

Save the DataContainer into a csv file.

Parameters:

filename: str