plaid.storage.writer¶
PLAID storage writer module.
This module provides high-level functions for saving PLAID datasets to local disk and pushing them to Hugging Face Hub. It supports multiple storage backends including CGNS, HF Datasets, and Zarr, abstracting the backend-specific implementations.
Key features: - Unified interface for saving datasets across different backends - Automatic preprocessing and schema extraction - Metadata and problem definition handling - Hub integration with dataset cards and metadata
Functions¶
|
Save a PLAID dataset to local disk using the specified backend. |
|
Push a local PLAID dataset to Hugging Face Hub. |
Module Contents¶
- save_to_disk(output_folder: str | pathlib.Path, sample_constructor: Callable[[Any], plaid.Sample], ids: Mapping[str, Sequence], backend: str = 'hf_datasets', infos: dict[str, Any] | None = None, pb_defs: dict[str, plaid.ProblemDefinition] | plaid.ProblemDefinition | None = None, num_proc: int = 1, verbose: bool = False, overwrite: bool = False) None[source]¶
Save a PLAID dataset to local disk using the specified backend.
This function preprocesses the dataset, extracts schemas, and saves the dataset to disk using the chosen backend. It also saves metadata, infos, and problem definitions.
The user provides a simple function
sample_constructorthat takes a single identifier and returns aSample, together with a dictionaryidsmapping split names to sliceable sequences of identifiers. PLAID handles iteration, generator creation, and parallel sharding internally.Example:
from plaid import Sample from plaid.storage import save_to_disk def sample_constructor(file_path): sample = Sample() sample.add_tree(load_my_data(file_path)) return sample save_to_disk( "output/", sample_constructor=sample_constructor, ids={ "train": train_file_paths, "test": test_file_paths, }, num_proc=6, )
- Parameters:
output_folder – Path to the output directory where the dataset will be saved.
sample_constructor – A callable that takes a single identifier (of any type) and returns a
Sample.ids – Dictionary mapping split names (e.g.
"train","test") to sliceable sequences of sample identifiers. Each sequence must support__getitem__and__len__(list, tuple, numpy array, …). The identifiers can be of any type: integers, file paths, strings, tuples, etc.backend – Storage backend to use (
'cgns','hf_datasets', or'zarr').infos – Optional additional information to save with the dataset.
pb_defs – Optional problem definitions to save.
num_proc – Number of processes to use for parallel writing. When
num_proc > 1PLAID automatically shards the identifier sequences and distributes work across workers.verbose – If True, enables verbose output during processing.
overwrite – If True, overwrites existing output directory.
- push_to_hub(repo_id: str, local_dir: str | pathlib.Path, num_workers: int = 1, viewer: bool = False, pretty_name: str | None = None, dataset_long_description: str | None = None, illustration_urls: list[str] | None = None, arxiv_paper_urls: list[str] | None = None) None[source]¶
Push a local PLAID dataset to Hugging Face Hub.
This function uploads a previously saved dataset from local disk to Hugging Face Hub, including data, metadata, infos, and problem definitions. It automatically detects the backend used for saving and configures the dataset card.
- Parameters:
repo_id – Hugging Face repository ID (e.g., ‘username/dataset-name’).
local_dir – Local directory containing the saved dataset.
num_workers – Number of workers for parallel upload.
viewer – If True, enables dataset viewer on Hub.
pretty_name – Optional pretty name for the dataset.
dataset_long_description – Optional detailed description.
illustration_urls – Optional list of illustration URLs.
arxiv_paper_urls – Optional list of arXiv paper URLs.