plaid.storage.writer¶
plaid.storage.writer
¶
PLAID storage writer module.
This module provides high-level functions for saving PLAID datasets to local disk and pushing them to Hugging Face Hub. It supports multiple storage backends including CGNS, HF Datasets, and Zarr, abstracting the backend-specific implementations.
Key features: - Unified interface for saving datasets across different backends - Automatic preprocessing and schema extraction - Metadata and problem definition handling - Hub integration with dataset cards and metadata
plaid.storage.writer.save_to_disk
¶
save_to_disk(
output_folder,
sample_constructor,
ids,
infos=None,
backend="hf_datasets",
pb_defs=None,
num_proc=1,
verbose=False,
overwrite=False,
)
Save a PLAID dataset to local disk using the specified backend.
This function preprocesses the dataset, extracts schemas, and saves the dataset to disk using the chosen backend. It also saves metadata, infos, and problem definitions.
The user provides a simple function sample_constructor that takes a single
identifier and returns a :class:~plaid.Sample, together with a dictionary
ids mapping split names to sliceable sequences of identifiers.
PLAID handles iteration, generator creation, and parallel sharding
internally.
Example::
from plaid import Sample
from plaid.storage import save_to_disk
def sample_constructor(file_path):
sample = Sample()
sample.add_tree(load_my_data(file_path))
return sample
save_to_disk(
"output/",
sample_constructor=sample_constructor,
ids={
"train": train_file_paths,
"test": test_file_paths,
},
infos=Infos(
owner="owner",
license="license",
),
num_proc=6,
)
Parameters:
-
output_folder(Union[str, Path]) –Path to the output directory where the dataset will be saved.
-
sample_constructor(Callable[[Any], Sample]) –A callable that takes a single identifier (of any type) and returns a :class:
~plaid.Sample. -
ids(Mapping[str, Any]) –Dictionary mapping split names (e.g.
"train","test") to sliceable sequences of sample identifiers. Each sequence must support__getitem__and__len__(list, tuple, numpy array, …). The identifiers can be of any type: integers, file paths, strings, tuples, etc. -
backend(str, default:'hf_datasets') –Storage backend to use (
'cgns','hf_datasets', or'zarr'). -
infos(Optional[Infos], default:None) –Dataset information to save with the dataset. If
None, a placeholder :class:~plaid.Infosis created withowner="unknown", license="unknown". -
pb_defs(Optional[dict[str, ProblemDefinition]], default:None) –Optional mapping from problem definition identifiers to definitions.
-
num_proc(int, default:1) –Number of processes to use for parallel writing. When
num_proc > 1PLAID automatically shards the identifier sequences and distributes work across workers. -
verbose(bool, default:False) –If True, enables verbose output during processing.
-
overwrite(bool, default:False) –If True, overwrites existing output directory.
Source code in plaid/storage/writer.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 | |
plaid.storage.writer.push_to_hub
¶
push_to_hub(
repo_id,
local_dir,
num_workers=1,
viewer=False,
pretty_name=None,
dataset_long_description=None,
illustration_urls=None,
arxiv_paper_urls=None,
)
Push a local PLAID dataset to Hugging Face Hub.
This function uploads a previously saved dataset from local disk to Hugging Face Hub, including data, metadata, infos, and problem definitions. It automatically detects the backend used for saving and configures the dataset card.
Parameters:
-
repo_id(str) –Hugging Face repository ID (e.g., 'username/dataset-name').
-
local_dir(Union[str, Path]) –Local directory containing the saved dataset.
-
num_workers(int, default:1) –Number of workers for parallel upload.
-
viewer(bool, default:False) –If True, enables dataset viewer on Hub.
-
pretty_name(Optional[str], default:None) –Optional pretty name for the dataset.
-
dataset_long_description(Optional[str], default:None) –Optional detailed description.
-
illustration_urls(Optional[list[str]], default:None) –Optional list of illustration URLs.
-
arxiv_paper_urls(Optional[list[str]], default:None) –Optional list of arXiv paper URLs.