plaid.storage.hf_datasets.bridge¶
HF Datasets bridge utilities.
This module provides bridge functions for converting between PLAID datasets/samples and Hugging Face Datasets format. It includes utilities for feature type conversion, dataset generation from PLAID objects, and sample reconstruction.
Functions¶
|
Convert a PLAID feature type dict to Hugging Face Feature. |
|
Convert a PLAID variable schema to Hugging Face Features. |
|
Convert PLAID dataset generators into a Hugging Face datasets.DatasetDict. |
|
Convert a Hugging Face dataset row to a variable sample dict containing the features that vary in the dataset. |
|
Convert a Hugging Face sample dict to variable sample dict. |
Module Contents¶
- convert_dtype_to_hf_feature(feature_type: dict[str, Any])[source]¶
Convert a PLAID feature type dict to Hugging Face Feature.
- Parameters:
feature_type (dict) – Dictionary with ‘dtype’ and ‘ndim’ keys.
- Returns:
The corresponding HF feature type.
- Return type:
Features or Sequence
- convert_to_hf_feature(variable_schema: dict[str, dict])[source]¶
Convert a PLAID variable schema to Hugging Face Features.
- generator_to_datasetdict(generators: dict[str, Callable[Ellipsis, Generator[plaid.Sample, None, None]]], variable_schema: dict, cache_dir: str, gen_kwargs: dict[str, dict[str, list[plaid.types.IndexType]]] | None = None, processes_number: int = 1, writer_batch_size: int = 1) datasets.DatasetDict[source]¶
Convert PLAID dataset generators into a Hugging Face datasets.DatasetDict.
This function takes generator functions that yield PLAID samples and converts them into a Hugging Face DatasetDict. Each generator corresponds to a split (e.g., “train”, “test”) and the function processes samples by flattening their structure and converting them to the Hugging Face format based on the provided variable schema.
- Parameters:
generators (dict[str, Callable[..., Generator[Sample, None, None]]]) – Mapping from split names (e.g., “train”, “test”) to generator functions. Each generator function must yield PLAID Sample objects that will be converted to the Hugging Face format.
variable_schema (dict[str, dict]) – Dictionary defining the schema of variables/features in the dataset. Maps feature names to their type information (dtype and ndim).
cache_dir (str) – Directory path used as cache directory for the Hugging Face dataset generation process.
gen_kwargs (dict[str, dict[str, list[IndexType]]], optional) – Optional mapping from split names to dictionaries of keyword arguments to be passed to each generator function. Useful for passing split-specific parameters like sample indices. Default is None, which creates empty kwargs for each split.
processes_number (int, optional) – Number of parallel processes to use when materializing the dataset from the generators. Default is 1 (no parallelization).
writer_batch_size (int, optional) – Batch size used when writing samples to disk in Hugging Face format. Default is 1.
- Returns:
A Hugging Face DatasetDict containing one Dataset per split, where each dataset contains the samples generated by the corresponding generator.
- Return type:
datasets.DatasetDict
Example
>>> def train_generator(): ... for sample in train_samples: ... yield sample >>> def test_generator(): ... for sample in test_samples: ... yield sample >>> variable_schema = { ... "velocity_x": {"dtype": "float32", "ndim": 2}, ... "velocity_y": {"dtype": "float32", "ndim": 2} ... } >>> ds_dict = generator_to_datasetdict( ... generators={"train": train_generator, "test": test_generator}, ... variable_schema=variable_schema, ... cache_dir="/tmp/hf_cache", ... processes_number=4, ... writer_batch_size=10 ... ) >>> print(ds_dict) DatasetDict({ train: Dataset({ features: ['velocity_x', 'velocity_y'], num_rows: ... }), test: Dataset({ features: ['velocity_x', 'velocity_y'], num_rows: ... }) })
- to_var_sample_dict(ds: datasets.Dataset, i: int, features: list[str] | None = None, enforce_shapes: bool = True) dict[str, numpy.ndarray | None][source]¶
Convert a Hugging Face dataset row to a variable sample dict containing the features that vary in the dataset.
- Parameters:
- Returns:
The variable sample dictionary.
- Return type: