plaid.storage.hf_datasets.bridge¶

HF Datasets bridge utilities.

This module provides bridge functions for converting between PLAID datasets/samples and Hugging Face Datasets format. It includes utilities for feature type conversion, dataset generation from PLAID objects, and sample reconstruction.

Functions¶

`convert_dtype_to_hf_feature`(feature_type)	Convert a PLAID feature type dict to Hugging Face Feature.
`convert_to_hf_feature`(variable_schema)	Convert a PLAID variable schema to Hugging Face Features.
`generator_to_datasetdict`(→ datasets.DatasetDict)	Convert PLAID dataset generators into a Hugging Face datasets.DatasetDict.
`to_var_sample_dict`(→ dict[str, Optional[numpy.ndarray]])	Convert a Hugging Face dataset row to a variable sample dict containing the features that vary in the dataset.
`sample_to_var_sample_dict`(→ dict[str, Any])	Convert a Hugging Face sample dict to variable sample dict.

Module Contents¶

convert_dtype_to_hf_feature(feature_type: dict[str, Any])[source]¶

Convert a PLAID feature type dict to Hugging Face Feature.

Parameters:: feature_type (dict) – Dictionary with ‘dtype’ and ‘ndim’ keys.
Returns:: The corresponding HF feature type.
Return type:: Features or Sequence

convert_to_hf_feature(variable_schema: dict[str, dict])[source]¶

Convert a PLAID variable schema to Hugging Face Features.

Parameters:: variable_schema (dict[str, dict]) – Mapping of variable names to type dicts.
Returns:: The HF Features object.
Return type:: Features

generator_to_datasetdict(generators: dict[str, Callable[Ellipsis, Generator[plaid.Sample, None, None]]], variable_schema: dict, cache_dir: str, gen_kwargs: dict[str, dict[str, list[plaid.types.IndexType]]] | None = None, processes_number: int = 1, writer_batch_size: int = 1) → datasets.DatasetDict[source]¶

Convert PLAID dataset generators into a Hugging Face datasets.DatasetDict.

This function takes generator functions that yield PLAID samples and converts them into a Hugging Face DatasetDict. Each generator corresponds to a split (e.g., “train”, “test”) and the function processes samples by flattening their structure and converting them to the Hugging Face format based on the provided variable schema.

Parameters:

generators (dict[str, Callable[..., Generator[Sample, None, None]]]) – Mapping from split names (e.g., “train”, “test”) to generator functions. Each generator function must yield PLAID Sample objects that will be converted to the Hugging Face format.
variable_schema (dict[str, dict]) – Dictionary defining the schema of variables/features in the dataset. Maps feature names to their type information (dtype and ndim).
cache_dir (str) – Directory path used as cache directory for the Hugging Face dataset generation process.
gen_kwargs (dict[str, dict[str, list[IndexType]]], optional) – Optional mapping from split names to dictionaries of keyword arguments to be passed to each generator function. Useful for passing split-specific parameters like sample indices. Default is None, which creates empty kwargs for each split.
processes_number (int, optional) – Number of parallel processes to use when materializing the dataset from the generators. Default is 1 (no parallelization).
writer_batch_size (int, optional) – Batch size used when writing samples to disk in Hugging Face format. Default is 1.

Returns:

A Hugging Face DatasetDict containing one Dataset per split, where each dataset contains the samples generated by the corresponding generator.

Return type:

datasets.DatasetDict

Example

>>> def train_generator():
...     for sample in train_samples:
...         yield sample
>>> def test_generator():
...     for sample in test_samples:
...         yield sample
>>> variable_schema = {
...     "velocity_x": {"dtype": "float32", "ndim": 2},
...     "velocity_y": {"dtype": "float32", "ndim": 2}
... }
>>> ds_dict = generator_to_datasetdict(
...     generators={"train": train_generator, "test": test_generator},
...     variable_schema=variable_schema,
...     cache_dir="/tmp/hf_cache",
...     processes_number=4,
...     writer_batch_size=10
... )
>>> print(ds_dict)
DatasetDict({
    train: Dataset({
        features: ['velocity_x', 'velocity_y'],
        num_rows: ...
    }),
    test: Dataset({
        features: ['velocity_x', 'velocity_y'],
        num_rows: ...
    })
})

to_var_sample_dict(ds: datasets.Dataset, i: int, features: list[str] | None = None, enforce_shapes: bool = True) → dict[str, numpy.ndarray | None][source]¶

Convert a Hugging Face dataset row to a variable sample dict containing the features that vary in the dataset.

Parameters:

ds (datasets.Dataset) – The Hugging Face dataset.
i (int) – The row index.
features – Iterable of feature names (keys) to extract from the dataset.
enforce_shapes (bool) – Whether to enforce consistent shapes.

Returns:

The variable sample dictionary.

Return type:

dict[str, Optional[np.ndarray]]

sample_to_var_sample_dict(hf_sample: dict[str, Any]) → dict[str, Any][source]¶

Convert a Hugging Face sample dict to variable sample dict.

Parameters:: hf_sample (dict) – The HF sample dictionary.
Returns:: The processed variable sample dictionary.
Return type:: dict