plaid.storage.hf_datasets.bridge¶
plaid.storage.hf_datasets.bridge
¶
HF Datasets bridge utilities.
This module provides bridge functions for converting between PLAID datasets/samples and Hugging Face Datasets format. It includes utilities for feature type conversion, dataset generation from PLAID objects, and sample reconstruction.
plaid.storage.hf_datasets.bridge.convert_dtype_to_hf_feature
¶
Convert a PLAID feature type dict to Hugging Face Feature.
Parameters:
-
feature_type(dict) –Dictionary with 'dtype' and 'ndim' keys.
Returns:
-
Any(Any) –The corresponding HF feature type (
FeaturesorSequence).
Source code in plaid/storage/hf_datasets/bridge.py
plaid.storage.hf_datasets.bridge.convert_to_hf_feature
¶
Convert a PLAID variable schema to Hugging Face Features.
Parameters:
-
variable_schema(dict[str, dict]) –Mapping of variable names to type dicts.
Returns:
-
Features(Features) –The HF Features object.
Source code in plaid/storage/hf_datasets/bridge.py
plaid.storage.hf_datasets.bridge.generator_to_datasetdict
¶
generator_to_datasetdict(
generators,
variable_schema,
cache_dir,
gen_kwargs=None,
processes_number=1,
writer_batch_size=1,
)
Convert PLAID dataset generators into a Hugging Face datasets.DatasetDict.
This function takes generator functions that yield PLAID samples and converts them into a Hugging Face DatasetDict. Each generator corresponds to a split (e.g., "train", "test") and the function processes samples by flattening their structure and converting them to the Hugging Face format based on the provided variable schema.
Parameters:
-
generators(dict[str, Callable[..., Generator[Sample, None, None]]]) –Mapping from split names (e.g., "train", "test") to generator functions. Each generator function must yield PLAID Sample objects that will be converted to the Hugging Face format.
-
variable_schema(dict[str, dict]) –Dictionary defining the schema of variables/features in the dataset. Maps feature names to their type information (dtype and ndim).
-
cache_dir(str) –Directory path used as cache directory for the Hugging Face dataset generation process.
-
gen_kwargs(dict[str, dict[str, IndexArrayType]], default:None) –Optional mapping from split names to dictionaries of keyword arguments to be passed to each generator function. Useful for passing split-specific parameters like sample indices. Default is None, which creates empty kwargs for each split.
-
processes_number(int, default:1) –Number of parallel processes to use when materializing the dataset from the generators. Default is 1 (no parallelization).
-
writer_batch_size(int, default:1) –Batch size used when writing samples to disk in Hugging Face format. Default is 1.
Returns:
-
DatasetDict–datasets.DatasetDict: A Hugging Face DatasetDict containing one Dataset per split, where each dataset contains the samples generated by the corresponding generator.
Example
def train_generator(): ... for sample in train_samples: ... yield sample def test_generator(): ... for sample in test_samples: ... yield sample variable_schema = { ... "velocity_x": {"dtype": "float32", "ndim": 2}, ... "velocity_y": {"dtype": "float32", "ndim": 2} ... } ds_dict = generator_to_datasetdict( ... generators={"train": train_generator, "test": test_generator}, ... variable_schema=variable_schema, ... cache_dir="/tmp/hf_cache", ... processes_number=4, ... writer_batch_size=10 ... ) print(ds_dict) DatasetDict({ train: Dataset({ features: ['velocity_x', 'velocity_y'], num_rows: ... }), test: Dataset({ features: ['velocity_x', 'velocity_y'], num_rows: ... }) })
Source code in plaid/storage/hf_datasets/bridge.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | |
plaid.storage.hf_datasets.bridge.to_var_sample_dict
¶
Convert a Hugging Face dataset row to a variable sample dict containing the features that vary in the dataset.
Parameters:
-
ds(Dataset) –The Hugging Face dataset.
-
i(int) –The row index.
-
features(Optional[list[str]], default:None) –Iterable of feature names (keys) to extract from the dataset.
-
indexers(Optional[dict[str, Any]], default:None) –Optional mapping
feature_path -> indexerused to select feature values along the last axis.
Returns:
-
dict[str, Optional[ndarray]]–dict[str, Optional[np.ndarray]]: The variable sample dictionary.
Source code in plaid/storage/hf_datasets/bridge.py
plaid.storage.hf_datasets.bridge.sample_to_var_sample_dict
¶
Convert a Hugging Face sample dict to variable sample dict.
Parameters:
-
hf_sample(dict) –The HF sample dictionary.
Returns:
-
dict(dict[str, Any]) –The processed variable sample dictionary.