plaid.storage.hf_datasets.bridge

HF Datasets bridge utilities.

This module provides bridge functions for converting between PLAID datasets/samples and Hugging Face Datasets format. It includes utilities for feature type conversion, dataset generation from PLAID objects, and sample reconstruction.

Functions

convert_dtype_to_hf_feature(feature_type)

Convert a PLAID feature type dict to Hugging Face Feature.

convert_to_hf_feature(variable_schema)

Convert a PLAID variable schema to Hugging Face Features.

generator_to_datasetdict(→ datasets.DatasetDict)

Convert PLAID dataset generators into a Hugging Face datasets.DatasetDict.

to_var_sample_dict(→ dict[str, Optional[numpy.ndarray]])

Convert a Hugging Face dataset row to a variable sample dict containing the features that vary in the dataset.

sample_to_var_sample_dict(→ dict[str, Any])

Convert a Hugging Face sample dict to variable sample dict.

Module Contents

convert_dtype_to_hf_feature(feature_type: dict[str, Any])[source]

Convert a PLAID feature type dict to Hugging Face Feature.

Parameters:

feature_type (dict) – Dictionary with ‘dtype’ and ‘ndim’ keys.

Returns:

The corresponding HF feature type.

Return type:

Features or Sequence

convert_to_hf_feature(variable_schema: dict[str, dict])[source]

Convert a PLAID variable schema to Hugging Face Features.

Parameters:

variable_schema (dict[str, dict]) – Mapping of variable names to type dicts.

Returns:

The HF Features object.

Return type:

Features

generator_to_datasetdict(generators: dict[str, Callable[Ellipsis, Generator[plaid.Sample, None, None]]], variable_schema: dict, cache_dir: str, gen_kwargs: dict[str, dict[str, list[plaid.types.IndexType]]] | None = None, processes_number: int = 1, writer_batch_size: int = 1) datasets.DatasetDict[source]

Convert PLAID dataset generators into a Hugging Face datasets.DatasetDict.

This function takes generator functions that yield PLAID samples and converts them into a Hugging Face DatasetDict. Each generator corresponds to a split (e.g., “train”, “test”) and the function processes samples by flattening their structure and converting them to the Hugging Face format based on the provided variable schema.

Parameters:
  • generators (dict[str, Callable[..., Generator[Sample, None, None]]]) – Mapping from split names (e.g., “train”, “test”) to generator functions. Each generator function must yield PLAID Sample objects that will be converted to the Hugging Face format.

  • variable_schema (dict[str, dict]) – Dictionary defining the schema of variables/features in the dataset. Maps feature names to their type information (dtype and ndim).

  • cache_dir (str) – Directory path used as cache directory for the Hugging Face dataset generation process.

  • gen_kwargs (dict[str, dict[str, list[IndexType]]], optional) – Optional mapping from split names to dictionaries of keyword arguments to be passed to each generator function. Useful for passing split-specific parameters like sample indices. Default is None, which creates empty kwargs for each split.

  • processes_number (int, optional) – Number of parallel processes to use when materializing the dataset from the generators. Default is 1 (no parallelization).

  • writer_batch_size (int, optional) – Batch size used when writing samples to disk in Hugging Face format. Default is 1.

Returns:

A Hugging Face DatasetDict containing one Dataset per split, where each dataset contains the samples generated by the corresponding generator.

Return type:

datasets.DatasetDict

Example

>>> def train_generator():
...     for sample in train_samples:
...         yield sample
>>> def test_generator():
...     for sample in test_samples:
...         yield sample
>>> variable_schema = {
...     "velocity_x": {"dtype": "float32", "ndim": 2},
...     "velocity_y": {"dtype": "float32", "ndim": 2}
... }
>>> ds_dict = generator_to_datasetdict(
...     generators={"train": train_generator, "test": test_generator},
...     variable_schema=variable_schema,
...     cache_dir="/tmp/hf_cache",
...     processes_number=4,
...     writer_batch_size=10
... )
>>> print(ds_dict)
DatasetDict({
    train: Dataset({
        features: ['velocity_x', 'velocity_y'],
        num_rows: ...
    }),
    test: Dataset({
        features: ['velocity_x', 'velocity_y'],
        num_rows: ...
    })
})
to_var_sample_dict(ds: datasets.Dataset, i: int, features: list[str] | None = None, enforce_shapes: bool = True) dict[str, numpy.ndarray | None][source]

Convert a Hugging Face dataset row to a variable sample dict containing the features that vary in the dataset.

Parameters:
  • ds (datasets.Dataset) – The Hugging Face dataset.

  • i (int) – The row index.

  • features – Iterable of feature names (keys) to extract from the dataset.

  • enforce_shapes (bool) – Whether to enforce consistent shapes.

Returns:

The variable sample dictionary.

Return type:

dict[str, Optional[np.ndarray]]

sample_to_var_sample_dict(hf_sample: dict[str, Any]) dict[str, Any][source]

Convert a Hugging Face sample dict to variable sample dict.

Parameters:

hf_sample (dict) – The HF sample dictionary.

Returns:

The processed variable sample dictionary.

Return type:

dict