plaid.utils.stats¶
Utility functions for computing statistics on datasets.
Attributes¶
Classes¶
OnlineStatistics is a class for computing online statistics of numpy arrays. |
|
Class for aggregating and computing statistics across datasets. |
Functions¶
|
Compute aggregated statistics of a batch of already computed statistics (without original samples information). |
Module Contents¶
- aggregate_stats(sizes: numpy.ndarray, means: numpy.ndarray, vars: numpy.ndarray) tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray][source]¶
Compute aggregated statistics of a batch of already computed statistics (without original samples information).
This function calculates aggregated statistics, such as the total number of samples, mean, and variance, by taking into account the statistics computed for each batch of data.
cf: [Variance from (cardinal,mean,variance) of several statistical series](https://fr.wikipedia.org/wiki/Variance_(math%C3%A9matiques)#Formules)
- Parameters:
sizes (np.ndarray) – An array containing the sizes (number of samples) of each batch. Expect shape (n_batches,1).
means (np.ndarray) – An array containing the means of each batch. Expect shape (n_batches, n_features).
vars (np.ndarray) – An array containing the variances of each batch. Expect shape (n_batches, n_features).
- Returns:
A tuple containing the aggregated statistics in the following order: - Total number of samples in all batches. - Weighted mean calculated from the batch means. - Weighted variance calculated from the batch variances, considering the means.
- Return type:
tuple[np.ndarray,np.ndarray,np.ndarray]
- class OnlineStatistics[source]¶
Bases:
objectOnlineStatistics is a class for computing online statistics of numpy arrays.
This class computes running statistics (min, max, mean, variance, std) for streaming data without storing all samples in memory.
Example
>>> stats = OnlineStatistics() >>> stats.add_samples(np.array([[1, 2], [3, 4]])) >>> stats.add_samples(np.array([[5, 6]])) >>> print(stats.get_stats()['mean']) [[3. 4.]]
Initialize an empty OnlineStatistics object.
- min: numpy.ndarray = None[source]¶
- max: numpy.ndarray = None[source]¶
- mean: numpy.ndarray = None[source]¶
- var: numpy.ndarray = None[source]¶
- std: numpy.ndarray = None[source]¶
- add_samples(x: numpy.ndarray, n_samples: int = None) None[source]¶
Add samples to compute statistics for.
- Parameters:
x (np.ndarray) – The input numpy array containing samples data. Expect 2D arrays with shape (n_samples, n_features).
n_samples (int, optional) – The number of samples in the input array. If not provided, it will be inferred from the shape of x. Use this argument when the input array has already been flattened because of shape inconsistencies.
- Raises:
ValueError – Raised when input contains NaN or Inf values.
- class Stats[source]¶
Class for aggregating and computing statistics across datasets.
The Stats class processes both scalar and field data from samples or datasets, computing running statistics like min, max, mean, variance and standard deviation.
Initialize an empty Stats object.
- add_dataset(dset: plaid.Dataset) None[source]¶
Add a dataset to compute statistics for.
- Parameters:
dset (Dataset) – The dataset to add.
- add_samples(samples: list[plaid.Sample] | plaid.Dataset) None[source]¶
Add samples or a dataset to compute statistics for.
Compute stats for each features present in the samples among scalars and fields. For fields, as long as the added samples have the same shape as the existing ones, the stats will be computed per-coordinates (n_features=x.shape[-1]). But as soon as the shapes differ, the stats and added fields will be flattened (n_features=1), then stats will be computed over all values of the field.
- Parameters:
samples (Union[list[Sample], Dataset]) – List of samples or dataset to process
- Raises:
TypeError – If samples is not a list[Sample] or Dataset
ValueError – If a sample contains invalid data
- get_stats(identifiers: list[str] = None) dict[str, dict[str, numpy.ndarray]][source]¶
Get computed statistics for specified data identifiers.