plaid.utils.stats ================= .. py:module:: plaid.utils.stats .. autoapi-nested-parse:: Utility functions for computing statistics on datasets. Attributes ---------- .. autoapisummary:: plaid.utils.stats.Self Classes ------- .. autoapisummary:: plaid.utils.stats.OnlineStatistics plaid.utils.stats.Stats Functions --------- .. autoapisummary:: plaid.utils.stats.aggregate_stats Module Contents --------------- .. py:data:: Self .. py:function:: aggregate_stats(sizes: numpy.ndarray, means: numpy.ndarray, vars: numpy.ndarray) -> tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray] Compute aggregated statistics of a batch of already computed statistics (without original samples information). This function calculates aggregated statistics, such as the total number of samples, mean, and variance, by taking into account the statistics computed for each batch of data. cf: [Variance from (cardinal,mean,variance) of several statistical series](https://fr.wikipedia.org/wiki/Variance_(math%C3%A9matiques)#Formules) :param sizes: An array containing the sizes (number of samples) of each batch. Expect shape (n_batches,1). :type sizes: np.ndarray :param means: An array containing the means of each batch. Expect shape (n_batches, n_features). :type means: np.ndarray :param vars: An array containing the variances of each batch. Expect shape (n_batches, n_features). :type vars: np.ndarray :returns: A tuple containing the aggregated statistics in the following order: - Total number of samples in all batches. - Weighted mean calculated from the batch means. - Weighted variance calculated from the batch variances, considering the means. :rtype: tuple[np.ndarray,np.ndarray,np.ndarray] .. py:class:: OnlineStatistics Bases: :py:obj:`object` OnlineStatistics is a class for computing online statistics of numpy arrays. This class computes running statistics (min, max, mean, variance, std) for streaming data without storing all samples in memory. .. rubric:: Example >>> stats = OnlineStatistics() >>> stats.add_samples(np.array([[1, 2], [3, 4]])) >>> stats.add_samples(np.array([[5, 6]])) >>> print(stats.get_stats()['mean']) [[3. 4.]] Initialize an empty OnlineStatistics object. .. py:attribute:: n_samples :type: int :value: 0 .. py:attribute:: n_features :type: int :value: None .. py:attribute:: n_points :type: int :value: None .. py:attribute:: min :type: numpy.ndarray :value: None .. py:attribute:: max :type: numpy.ndarray :value: None .. py:attribute:: mean :type: numpy.ndarray :value: None .. py:attribute:: var :type: numpy.ndarray :value: None .. py:attribute:: std :type: numpy.ndarray :value: None .. py:method:: add_samples(x: numpy.ndarray, n_samples: int = None) -> None Add samples to compute statistics for. :param x: The input numpy array containing samples data. Expect 2D arrays with shape (n_samples, n_features). :type x: np.ndarray :param n_samples: The number of samples in the input array. If not provided, it will be inferred from the shape of `x`. Use this argument when the input array has already been flattened because of shape inconsistencies. :type n_samples: int, optional :raises ValueError: Raised when input contains NaN or Inf values. .. py:method:: merge_stats(other: Self) -> None Merge statistics from another instance. :param other: The other instance to merge statistics from. :type other: Self .. py:method:: flatten_array() -> None When a shape incoherence is detected, you should call this function. .. py:method:: get_stats() -> dict[str, Union[int, numpy.ndarray]] Get computed statistics. :returns: A dictionary containing computed statistics. The shapes of the arrays depend on the input data and may vary. :rtype: dict[str, Union[int, np.ndarray]] .. py:class:: Stats Class for aggregating and computing statistics across datasets. The Stats class processes both scalar and field data from samples or datasets, computing running statistics like min, max, mean, variance and standard deviation. .. attribute:: _stats Dictionary mapping data identifiers to their statistics :type: dict[str, OnlineStatistics] Initialize an empty Stats object. .. py:method:: add_dataset(dset: plaid.Dataset) -> None Add a dataset to compute statistics for. :param dset: The dataset to add. :type dset: Dataset .. py:method:: add_samples(samples: Union[list[plaid.Sample], plaid.Dataset]) -> None Add samples or a dataset to compute statistics for. Compute stats for each features present in the samples among scalars and fields. For fields, as long as the added samples have the same shape as the existing ones, the stats will be computed per-coordinates (n_features=x.shape[-1]). But as soon as the shapes differ, the stats and added fields will be flattened (n_features=1), then stats will be computed over all values of the field. :param samples: List of samples or dataset to process :type samples: Union[list[Sample], Dataset] :raises TypeError: If samples is not a list[Sample] or Dataset :raises ValueError: If a sample contains invalid data .. py:method:: get_stats(identifiers: list[str] = None) -> dict[str, dict[str, numpy.ndarray]] Get computed statistics for specified data identifiers. :param identifiers: List of data identifiers to retrieve. If None, returns statistics for all identifiers. :type identifiers: list[str], optional :returns: Dictionary mapping identifiers to their statistics :rtype: dict[str, dict[str, np.ndarray]] .. py:method:: get_available_statistics() -> list[str] Get list of data identifiers with computed statistics. :returns: List of data identifiers :rtype: list[str] .. py:method:: clear_statistics() -> None Clear all computed statistics. .. py:method:: merge_stats(other: Self) -> None Merge statistics from another Stats object. :param other: Stats object to merge with :type other: Stats