plaid.storage.hf_datasets.bridge
================================

.. py:module:: plaid.storage.hf_datasets.bridge

.. autoapi-nested-parse::

   HF Datasets bridge utilities.

   This module provides bridge functions for converting between PLAID datasets/samples
   and Hugging Face Datasets format. It includes utilities for feature type conversion,
   dataset generation from PLAID objects, and sample reconstruction.


Functions
---------

.. autoapisummary::

   plaid.storage.hf_datasets.bridge.convert_dtype_to_hf_feature
   plaid.storage.hf_datasets.bridge.convert_to_hf_feature
   plaid.storage.hf_datasets.bridge.plaid_dataset_to_datasetdict
   plaid.storage.hf_datasets.bridge.generator_to_datasetdict
   plaid.storage.hf_datasets.bridge.to_var_sample_dict
   plaid.storage.hf_datasets.bridge.sample_to_var_sample_dict


Module Contents
---------------

.. py:function:: convert_dtype_to_hf_feature(feature_type: dict[str, Any])

   Convert a PLAID feature type dict to Hugging Face Feature.

   :param feature_type: Dictionary with 'dtype' and 'ndim' keys.
   :type feature_type: dict

   :returns: The corresponding HF feature type.
   :rtype: Features or Sequence


.. py:function:: convert_to_hf_feature(variable_schema: dict[str, dict])

   Convert a PLAID variable schema to Hugging Face Features.

   :param variable_schema: Mapping of variable names to type dicts.
   :type variable_schema: dict[str, dict]

   :returns: The HF Features object.
   :rtype: Features


.. py:function:: plaid_dataset_to_datasetdict(dataset: plaid.Dataset, main_splits: dict[str, plaid.types.IndexType], var_features_types: dict[str, dict], processes_number: int = 1, writer_batch_size: int = 1) -> datasets.DatasetDict

   Convert a PLAID dataset into a Hugging Face `datasets.DatasetDict`.

   This is a thin wrapper that creates per-split generators from a PLAID dataset
   and delegates the actual dataset construction to
   `plaid_generator_to_datasetdict`.

   :param dataset: The PLAID dataset to be converted. Must support indexing with
                   a list of IDs (from `main_splits`).
   :type dataset: plaid.Dataset
   :param main_splits: Mapping from split names (e.g. "train", "test") to the subset of
                       sample indices belonging to that split.
   :type main_splits: dict[str, IndexType]
   :param var_features_types: Dictionary mapping feature names to their type information.
   :type var_features_types: dict[str, dict]
   :param processes_number: Number of parallel processes to use when writing the Hugging Face dataset.
   :type processes_number: int, optional, default=1
   :param writer_batch_size: Batch size used when writing samples to disk in Hugging Face format.
   :type writer_batch_size: int, optional, default=1

   :returns:     A Hugging Face `DatasetDict` containing one dataset per split.
   :rtype: datasets.DatasetDict

   .. rubric:: Example

   >>> ds_dict = plaid_dataset_to_huggingface_datasetdict(
   ...     dataset=my_plaid_dataset,
   ...     main_splits={"train": [0, 1, 2], "test": [3]},
   ...     processes_number=4,
   ...     writer_batch_size=3
   ... )
   >>> print(ds_dict)
   DatasetDict({
       train: Dataset({
           features: ...
       }),
       test: Dataset({
           features: ...
       })
   })


.. py:function:: generator_to_datasetdict(generators: dict[str, Callable[Ellipsis, Generator[plaid.Sample, None, None]]], variable_schema: dict, gen_kwargs: Optional[dict[str, dict[str, list[plaid.types.IndexType]]]] = None, processes_number: int = 1, writer_batch_size: int = 1) -> datasets.DatasetDict

   Convert PLAID dataset generators into a Hugging Face `datasets.DatasetDict`.

   This function inspects samples produced by the given generators, flattens their
   CGNS tree structure, infers Hugging Face feature types, and builds one
   `datasets.Dataset` per split. Constant features (identical across all samples)
   are separated out from variable features.

   :param generators: Mapping from split names (e.g., "train", "test") to generator functions.
                      Each generator function must return an iterable of PLAID samples, where
                      each sample provides `sample.features.data[0.0]` for flattening.
   :type generators: dict[str, Callable]
   :param variable_schema: Dictionary defining the schema of variables/features in the dataset.
   :type variable_schema: dict
   :param processes_number: Number of processes used internally by Hugging Face when materializing
                            the dataset from the generators.
   :type processes_number: int, optional, default=1
   :param writer_batch_size: Batch size used when writing samples to disk in Hugging Face format.
   :type writer_batch_size: int, optional, default=1
   :param gen_kwargs: Optional mapping from split names to dictionaries of keyword arguments
                      to be passed to each generator function, used for parallelization.
   :type gen_kwargs: dict, optional, default=None

   :returns:

                 - **DatasetDict** (`datasets.DatasetDict`):
                   A Hugging Face dataset dictionary with one dataset per split.
                 - **flat_cst** (`dict[str, Any]`):
                   Dictionary of constant features detected across all splits.
                 - **key_mappings** (`dict[str, Any]`):
                   Metadata dictionary containing:
                   - `"variable_features"`: list of paths for non-constant features.
                   - `"constant_features"`: list of paths for constant features.
                   - `"cgns_types"`: inferred CGNS types for all features.
   :rtype: tuple

   .. rubric:: Example

   >>> ds_dict, flat_cst, key_mappings = plaid_generator_to_huggingface_datasetdict(
   ...     {"train": lambda: iter(train_samples),
   ...      "test": lambda: iter(test_samples)},
   ...     processes_number=4,
   ...     writer_batch_size=2,
   ...     verbose=True
   ... )
   >>> print(ds_dict)
   DatasetDict({
       train: Dataset({
           features: ...
       }),
       test: Dataset({
           features: ...
       })
   })
   >>> print(flat_cst)
   {'Zone1/GridCoordinates': array([0., 0.1, 0.2])}
   >>> print(key_mappings["variable_features"][:3])
   ['Zone1/FlowSolution/VelocityX', 'Zone1/FlowSolution/VelocityY', ...]


.. py:function:: to_var_sample_dict(ds: datasets.Dataset, i: int, enforce_shapes: bool = True) -> dict[str, Any]

   Convert a Hugging Face dataset row to a variable sample dict containing the features that vary in the dataset.

   :param ds: The Hugging Face dataset.
   :type ds: datasets.Dataset
   :param i: The row index.
   :type i: int
   :param enforce_shapes: Whether to enforce consistent shapes.
   :type enforce_shapes: bool

   :returns: The variable sample dictionary.
   :rtype: dict


.. py:function:: sample_to_var_sample_dict(hf_sample: dict[str, Any]) -> dict[str, Any]

   Convert a Hugging Face sample dict to variable sample dict.

   :param hf_sample: The HF sample dictionary.
   :type hf_sample: dict

   :returns: The processed variable sample dictionary.
   :rtype: dict