plaid.storage.hf_datasets.bridge ================================ .. py:module:: plaid.storage.hf_datasets.bridge .. autoapi-nested-parse:: HF Datasets bridge utilities. This module provides bridge functions for converting between PLAID datasets/samples and Hugging Face Datasets format. It includes utilities for feature type conversion, dataset generation from PLAID objects, and sample reconstruction. Functions --------- .. autoapisummary:: plaid.storage.hf_datasets.bridge.convert_dtype_to_hf_feature plaid.storage.hf_datasets.bridge.convert_to_hf_feature plaid.storage.hf_datasets.bridge.generator_to_datasetdict plaid.storage.hf_datasets.bridge.to_var_sample_dict plaid.storage.hf_datasets.bridge.sample_to_var_sample_dict Module Contents --------------- .. py:function:: convert_dtype_to_hf_feature(feature_type: dict[str, Any]) Convert a PLAID feature type dict to Hugging Face Feature. :param feature_type: Dictionary with 'dtype' and 'ndim' keys. :type feature_type: dict :returns: The corresponding HF feature type. :rtype: Features or Sequence .. py:function:: convert_to_hf_feature(variable_schema: dict[str, dict]) Convert a PLAID variable schema to Hugging Face Features. :param variable_schema: Mapping of variable names to type dicts. :type variable_schema: dict[str, dict] :returns: The HF Features object. :rtype: Features .. py:function:: generator_to_datasetdict(generators: dict[str, Callable[Ellipsis, Generator[plaid.Sample, None, None]]], variable_schema: dict, cache_dir: str, gen_kwargs: Optional[dict[str, dict[str, list[plaid.types.IndexType]]]] = None, processes_number: int = 1, writer_batch_size: int = 1) -> datasets.DatasetDict Convert PLAID dataset generators into a Hugging Face `datasets.DatasetDict`. This function takes generator functions that yield PLAID samples and converts them into a Hugging Face DatasetDict. Each generator corresponds to a split (e.g., "train", "test") and the function processes samples by flattening their structure and converting them to the Hugging Face format based on the provided variable schema. :param generators: Mapping from split names (e.g., "train", "test") to generator functions. Each generator function must yield PLAID Sample objects that will be converted to the Hugging Face format. :type generators: dict[str, Callable[..., Generator[Sample, None, None]]] :param variable_schema: Dictionary defining the schema of variables/features in the dataset. Maps feature names to their type information (dtype and ndim). :type variable_schema: dict[str, dict] :param cache_dir: Directory path used as cache directory for the Hugging Face dataset generation process. :type cache_dir: str :param gen_kwargs: Optional mapping from split names to dictionaries of keyword arguments to be passed to each generator function. Useful for passing split-specific parameters like sample indices. Default is None, which creates empty kwargs for each split. :type gen_kwargs: dict[str, dict[str, list[IndexType]]], optional :param processes_number: Number of parallel processes to use when materializing the dataset from the generators. Default is 1 (no parallelization). :type processes_number: int, optional :param writer_batch_size: Batch size used when writing samples to disk in Hugging Face format. Default is 1. :type writer_batch_size: int, optional :returns: A Hugging Face DatasetDict containing one Dataset per split, where each dataset contains the samples generated by the corresponding generator. :rtype: datasets.DatasetDict .. rubric:: Example >>> def train_generator(): ... for sample in train_samples: ... yield sample >>> def test_generator(): ... for sample in test_samples: ... yield sample >>> variable_schema = { ... "velocity_x": {"dtype": "float32", "ndim": 2}, ... "velocity_y": {"dtype": "float32", "ndim": 2} ... } >>> ds_dict = generator_to_datasetdict( ... generators={"train": train_generator, "test": test_generator}, ... variable_schema=variable_schema, ... cache_dir="/tmp/hf_cache", ... processes_number=4, ... writer_batch_size=10 ... ) >>> print(ds_dict) DatasetDict({ train: Dataset({ features: ['velocity_x', 'velocity_y'], num_rows: ... }), test: Dataset({ features: ['velocity_x', 'velocity_y'], num_rows: ... }) }) .. py:function:: to_var_sample_dict(ds: datasets.Dataset, i: int, features: Optional[list[str]] = None, enforce_shapes: bool = True) -> dict[str, Optional[numpy.ndarray]] Convert a Hugging Face dataset row to a variable sample dict containing the features that vary in the dataset. :param ds: The Hugging Face dataset. :type ds: datasets.Dataset :param i: The row index. :type i: int :param features: Iterable of feature names (keys) to extract from the dataset. :param enforce_shapes: Whether to enforce consistent shapes. :type enforce_shapes: bool :returns: The variable sample dictionary. :rtype: dict[str, Optional[np.ndarray]] .. py:function:: sample_to_var_sample_dict(hf_sample: dict[str, Any]) -> dict[str, Any] Convert a Hugging Face sample dict to variable sample dict. :param hf_sample: The HF sample dictionary. :type hf_sample: dict :returns: The processed variable sample dictionary. :rtype: dict