plaid.storage.hf_datasets.bridge ================================ .. py:module:: plaid.storage.hf_datasets.bridge .. autoapi-nested-parse:: HF Datasets bridge utilities. This module provides bridge functions for converting between PLAID datasets/samples and Hugging Face Datasets format. It includes utilities for feature type conversion, dataset generation from PLAID objects, and sample reconstruction. Functions --------- .. autoapisummary:: plaid.storage.hf_datasets.bridge.convert_dtype_to_hf_feature plaid.storage.hf_datasets.bridge.convert_to_hf_feature plaid.storage.hf_datasets.bridge.plaid_dataset_to_datasetdict plaid.storage.hf_datasets.bridge.generator_to_datasetdict plaid.storage.hf_datasets.bridge.to_var_sample_dict plaid.storage.hf_datasets.bridge.sample_to_var_sample_dict Module Contents --------------- .. py:function:: convert_dtype_to_hf_feature(feature_type: dict[str, Any]) Convert a PLAID feature type dict to Hugging Face Feature. :param feature_type: Dictionary with 'dtype' and 'ndim' keys. :type feature_type: dict :returns: The corresponding HF feature type. :rtype: Features or Sequence .. py:function:: convert_to_hf_feature(variable_schema: dict[str, dict]) Convert a PLAID variable schema to Hugging Face Features. :param variable_schema: Mapping of variable names to type dicts. :type variable_schema: dict[str, dict] :returns: The HF Features object. :rtype: Features .. py:function:: plaid_dataset_to_datasetdict(dataset: plaid.Dataset, main_splits: dict[str, plaid.types.IndexType], var_features_types: dict[str, dict], processes_number: int = 1, writer_batch_size: int = 1) -> datasets.DatasetDict Convert a PLAID dataset into a Hugging Face `datasets.DatasetDict`. This is a thin wrapper that creates per-split generators from a PLAID dataset and delegates the actual dataset construction to `plaid_generator_to_datasetdict`. :param dataset: The PLAID dataset to be converted. Must support indexing with a list of IDs (from `main_splits`). :type dataset: plaid.Dataset :param main_splits: Mapping from split names (e.g. "train", "test") to the subset of sample indices belonging to that split. :type main_splits: dict[str, IndexType] :param var_features_types: Dictionary mapping feature names to their type information. :type var_features_types: dict[str, dict] :param processes_number: Number of parallel processes to use when writing the Hugging Face dataset. :type processes_number: int, optional, default=1 :param writer_batch_size: Batch size used when writing samples to disk in Hugging Face format. :type writer_batch_size: int, optional, default=1 :returns: A Hugging Face `DatasetDict` containing one dataset per split. :rtype: datasets.DatasetDict .. rubric:: Example >>> ds_dict = plaid_dataset_to_huggingface_datasetdict( ... dataset=my_plaid_dataset, ... main_splits={"train": [0, 1, 2], "test": [3]}, ... processes_number=4, ... writer_batch_size=3 ... ) >>> print(ds_dict) DatasetDict({ train: Dataset({ features: ... }), test: Dataset({ features: ... }) }) .. py:function:: generator_to_datasetdict(generators: dict[str, Callable[Ellipsis, Generator[plaid.Sample, None, None]]], variable_schema: dict, gen_kwargs: Optional[dict[str, dict[str, list[plaid.types.IndexType]]]] = None, processes_number: int = 1, writer_batch_size: int = 1) -> datasets.DatasetDict Convert PLAID dataset generators into a Hugging Face `datasets.DatasetDict`. This function inspects samples produced by the given generators, flattens their CGNS tree structure, infers Hugging Face feature types, and builds one `datasets.Dataset` per split. Constant features (identical across all samples) are separated out from variable features. :param generators: Mapping from split names (e.g., "train", "test") to generator functions. Each generator function must return an iterable of PLAID samples, where each sample provides `sample.features.data[0.0]` for flattening. :type generators: dict[str, Callable] :param variable_schema: Dictionary defining the schema of variables/features in the dataset. :type variable_schema: dict :param processes_number: Number of processes used internally by Hugging Face when materializing the dataset from the generators. :type processes_number: int, optional, default=1 :param writer_batch_size: Batch size used when writing samples to disk in Hugging Face format. :type writer_batch_size: int, optional, default=1 :param gen_kwargs: Optional mapping from split names to dictionaries of keyword arguments to be passed to each generator function, used for parallelization. :type gen_kwargs: dict, optional, default=None :returns: - **DatasetDict** (`datasets.DatasetDict`): A Hugging Face dataset dictionary with one dataset per split. - **flat_cst** (`dict[str, Any]`): Dictionary of constant features detected across all splits. - **key_mappings** (`dict[str, Any]`): Metadata dictionary containing: - `"variable_features"`: list of paths for non-constant features. - `"constant_features"`: list of paths for constant features. - `"cgns_types"`: inferred CGNS types for all features. :rtype: tuple .. rubric:: Example >>> ds_dict, flat_cst, key_mappings = plaid_generator_to_huggingface_datasetdict( ... {"train": lambda: iter(train_samples), ... "test": lambda: iter(test_samples)}, ... processes_number=4, ... writer_batch_size=2, ... verbose=True ... ) >>> print(ds_dict) DatasetDict({ train: Dataset({ features: ... }), test: Dataset({ features: ... }) }) >>> print(flat_cst) {'Zone1/GridCoordinates': array([0., 0.1, 0.2])} >>> print(key_mappings["variable_features"][:3]) ['Zone1/FlowSolution/VelocityX', 'Zone1/FlowSolution/VelocityY', ...] .. py:function:: to_var_sample_dict(ds: datasets.Dataset, i: int, enforce_shapes: bool = True) -> dict[str, Any] Convert a Hugging Face dataset row to a variable sample dict containing the features that vary in the dataset. :param ds: The Hugging Face dataset. :type ds: datasets.Dataset :param i: The row index. :type i: int :param enforce_shapes: Whether to enforce consistent shapes. :type enforce_shapes: bool :returns: The variable sample dictionary. :rtype: dict .. py:function:: sample_to_var_sample_dict(hf_sample: dict[str, Any]) -> dict[str, Any] Convert a Hugging Face sample dict to variable sample dict. :param hf_sample: The HF sample dictionary. :type hf_sample: dict :returns: The processed variable sample dictionary. :rtype: dict