plaid.utils.stats

Utility functions for computing statistics on datasets.

Attributes

Classes

OnlineStatistics

OnlineStatistics is a class for computing online statistics of numpy arrays.

Stats

Class for aggregating and computing statistics across datasets.

Functions

aggregate_stats(→ tuple[numpy.ndarray, numpy.ndarray, ...)

Compute aggregated statistics of a batch of already computed statistics (without original samples information).

Module Contents

Self[source]
aggregate_stats(sizes: numpy.ndarray, means: numpy.ndarray, vars: numpy.ndarray) tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray][source]

Compute aggregated statistics of a batch of already computed statistics (without original samples information).

This function calculates aggregated statistics, such as the total number of samples, mean, and variance, by taking into account the statistics computed for each batch of data.

cf: [Variance from (cardinal,mean,variance) of several statistical series](https://fr.wikipedia.org/wiki/Variance_(math%C3%A9matiques)#Formules)

Parameters:
  • sizes (np.ndarray) – An array containing the sizes (number of samples) of each batch. Expect shape (n_batches,1).

  • means (np.ndarray) – An array containing the means of each batch. Expect shape (n_batches, n_features).

  • vars (np.ndarray) – An array containing the variances of each batch. Expect shape (n_batches, n_features).

Returns:

A tuple containing the aggregated statistics in the following order: - Total number of samples in all batches. - Weighted mean calculated from the batch means. - Weighted variance calculated from the batch variances, considering the means.

Return type:

tuple[np.ndarray,np.ndarray,np.ndarray]

class OnlineStatistics[source]

Bases: object

OnlineStatistics is a class for computing online statistics of numpy arrays.

This class computes running statistics (min, max, mean, variance, std) for streaming data without storing all samples in memory.

Example

>>> stats = OnlineStatistics()
>>> stats.add_samples(np.array([[1, 2], [3, 4]]))
>>> stats.add_samples(np.array([[5, 6]]))
>>> print(stats.get_stats()['mean'])
[[3. 4.]]

Initialize an empty OnlineStatistics object.

n_samples: int = 0[source]
n_features: int = None[source]
n_points: int = None[source]
min: numpy.ndarray = None[source]
max: numpy.ndarray = None[source]
mean: numpy.ndarray = None[source]
var: numpy.ndarray = None[source]
std: numpy.ndarray = None[source]
add_samples(x: numpy.ndarray, n_samples: int = None) None[source]

Add samples to compute statistics for.

Parameters:
  • x (np.ndarray) – The input numpy array containing samples data. Expect 2D arrays with shape (n_samples, n_features).

  • n_samples (int, optional) – The number of samples in the input array. If not provided, it will be inferred from the shape of x. Use this argument when the input array has already been flattened because of shape inconsistencies.

Raises:

ValueError – Raised when input contains NaN or Inf values.

merge_stats(other: Self) None[source]

Merge statistics from another instance.

Parameters:

other (Self) – The other instance to merge statistics from.

flatten_array() None[source]

When a shape incoherence is detected, you should call this function.

get_stats() dict[str, int | numpy.ndarray][source]

Get computed statistics.

Returns:

A dictionary containing computed statistics. The shapes of the arrays depend on the input data and may vary.

Return type:

dict[str, Union[int, np.ndarray]]

class Stats[source]

Class for aggregating and computing statistics across datasets.

The Stats class processes both scalar and field data from samples or datasets, computing running statistics like min, max, mean, variance and standard deviation.

_stats[source]

Dictionary mapping data identifiers to their statistics

Type:

dict[str, OnlineStatistics]

Initialize an empty Stats object.

add_dataset(dset: plaid.Dataset) None[source]

Add a dataset to compute statistics for.

Parameters:

dset (Dataset) – The dataset to add.

add_samples(samples: list[plaid.Sample] | plaid.Dataset) None[source]

Add samples or a dataset to compute statistics for.

Compute stats for each features present in the samples among scalars and fields. For fields, as long as the added samples have the same shape as the existing ones, the stats will be computed per-coordinates (n_features=x.shape[-1]). But as soon as the shapes differ, the stats and added fields will be flattened (n_features=1), then stats will be computed over all values of the field.

Parameters:

samples (Union[list[Sample], Dataset]) – List of samples or dataset to process

Raises:
  • TypeError – If samples is not a list[Sample] or Dataset

  • ValueError – If a sample contains invalid data

get_stats(identifiers: list[str] = None) dict[str, dict[str, numpy.ndarray]][source]

Get computed statistics for specified data identifiers.

Parameters:

identifiers (list[str], optional) – List of data identifiers to retrieve. If None, returns statistics for all identifiers.

Returns:

Dictionary mapping identifiers to their statistics

Return type:

dict[str, dict[str, np.ndarray]]

get_available_statistics() list[str][source]

Get list of data identifiers with computed statistics.

Returns:

List of data identifiers

Return type:

list[str]

clear_statistics() None[source]

Clear all computed statistics.

merge_stats(other: Self) None[source]

Merge statistics from another Stats object.

Parameters:

other (Stats) – Stats object to merge with