plaid.storage.hf_datasets.bridge
================================

.. py:module:: plaid.storage.hf_datasets.bridge

.. autoapi-nested-parse::

   HF Datasets bridge utilities.

   This module provides bridge functions for converting between PLAID datasets/samples
   and Hugging Face Datasets format. It includes utilities for feature type conversion,
   dataset generation from PLAID objects, and sample reconstruction.


Functions
---------

.. autoapisummary::

   plaid.storage.hf_datasets.bridge.convert_dtype_to_hf_feature
   plaid.storage.hf_datasets.bridge.convert_to_hf_feature
   plaid.storage.hf_datasets.bridge.generator_to_datasetdict
   plaid.storage.hf_datasets.bridge.to_var_sample_dict
   plaid.storage.hf_datasets.bridge.sample_to_var_sample_dict


Module Contents
---------------

.. py:function:: convert_dtype_to_hf_feature(feature_type: dict[str, Any])

   Convert a PLAID feature type dict to Hugging Face Feature.

   :param feature_type: Dictionary with 'dtype' and 'ndim' keys.
   :type feature_type: dict

   :returns: The corresponding HF feature type.
   :rtype: Features or Sequence


.. py:function:: convert_to_hf_feature(variable_schema: dict[str, dict])

   Convert a PLAID variable schema to Hugging Face Features.

   :param variable_schema: Mapping of variable names to type dicts.
   :type variable_schema: dict[str, dict]

   :returns: The HF Features object.
   :rtype: Features


.. py:function:: generator_to_datasetdict(generators: dict[str, Callable[Ellipsis, Generator[plaid.Sample, None, None]]], variable_schema: dict, cache_dir: str, gen_kwargs: Optional[dict[str, dict[str, list[plaid.types.IndexType]]]] = None, processes_number: int = 1, writer_batch_size: int = 1) -> datasets.DatasetDict

   Convert PLAID dataset generators into a Hugging Face `datasets.DatasetDict`.

   This function takes generator functions that yield PLAID samples and converts them
   into a Hugging Face DatasetDict. Each generator corresponds to a split (e.g., "train",
   "test") and the function processes samples by flattening their structure and converting
   them to the Hugging Face format based on the provided variable schema.

   :param generators: Mapping from split names (e.g., "train", "test") to generator functions.
                      Each generator function must yield PLAID Sample objects that will be
                      converted to the Hugging Face format.
   :type generators: dict[str, Callable[..., Generator[Sample, None, None]]]
   :param variable_schema: Dictionary defining the schema of variables/features in the dataset.
                           Maps feature names to their type information (dtype and ndim).
   :type variable_schema: dict[str, dict]
   :param cache_dir: Directory path used as cache directory for the Hugging Face
                     dataset generation process.
   :type cache_dir: str
   :param gen_kwargs: Optional mapping from split names to dictionaries of keyword arguments
                      to be passed to each generator function. Useful for passing split-specific
                      parameters like sample indices. Default is None, which creates empty
                      kwargs for each split.
   :type gen_kwargs: dict[str, dict[str, list[IndexType]]], optional
   :param processes_number: Number of parallel processes to use when materializing the dataset from
                            the generators. Default is 1 (no parallelization).
   :type processes_number: int, optional
   :param writer_batch_size: Batch size used when writing samples to disk in Hugging Face format.
                             Default is 1.
   :type writer_batch_size: int, optional

   :returns:     A Hugging Face DatasetDict containing one Dataset per split, where each
                 dataset contains the samples generated by the corresponding generator.
   :rtype: datasets.DatasetDict

   .. rubric:: Example

   >>> def train_generator():
   ...     for sample in train_samples:
   ...         yield sample
   >>> def test_generator():
   ...     for sample in test_samples:
   ...         yield sample
   >>> variable_schema = {
   ...     "velocity_x": {"dtype": "float32", "ndim": 2},
   ...     "velocity_y": {"dtype": "float32", "ndim": 2}
   ... }
   >>> ds_dict = generator_to_datasetdict(
   ...     generators={"train": train_generator, "test": test_generator},
   ...     variable_schema=variable_schema,
   ...     cache_dir="/tmp/hf_cache",
   ...     processes_number=4,
   ...     writer_batch_size=10
   ... )
   >>> print(ds_dict)
   DatasetDict({
       train: Dataset({
           features: ['velocity_x', 'velocity_y'],
           num_rows: ...
       }),
       test: Dataset({
           features: ['velocity_x', 'velocity_y'],
           num_rows: ...
       })
   })


.. py:function:: to_var_sample_dict(ds: datasets.Dataset, i: int, features: Optional[list[str]] = None, enforce_shapes: bool = True) -> dict[str, Optional[numpy.ndarray]]

   Convert a Hugging Face dataset row to a variable sample dict containing the features that vary in the dataset.

   :param ds: The Hugging Face dataset.
   :type ds: datasets.Dataset
   :param i: The row index.
   :type i: int
   :param features: Iterable of feature names (keys) to extract from the dataset.
   :param enforce_shapes: Whether to enforce consistent shapes.
   :type enforce_shapes: bool

   :returns: The variable sample dictionary.
   :rtype: dict[str, Optional[np.ndarray]]


.. py:function:: sample_to_var_sample_dict(hf_sample: dict[str, Any]) -> dict[str, Any]

   Convert a Hugging Face sample dict to variable sample dict.

   :param hf_sample: The HF sample dictionary.
   :type hf_sample: dict

   :returns: The processed variable sample dictionary.
   :rtype: dict