`plaid.storage.writer`¶

plaid.storage.writer ¶

PLAID storage writer module.

This module provides high-level functions for saving PLAID datasets to local disk and pushing them to Hugging Face Hub. It supports multiple storage backends including CGNS, HF Datasets, and Zarr, abstracting the backend-specific implementations.

Key features: - Unified interface for saving datasets across different backends - Automatic preprocessing and schema extraction - Metadata and problem definition handling - Hub integration with dataset cards and metadata

plaid.storage.writer.save_to_disk ¶

save_to_disk(
    output_folder,
    sample_constructor,
    ids,
    infos=None,
    backend="hf_datasets",
    pb_defs=None,
    num_proc=1,
    verbose=False,
    overwrite=False,
)

Save a PLAID dataset to local disk using the specified backend.

This function preprocesses the dataset, extracts schemas, and saves the dataset to disk using the chosen backend. It also saves metadata, infos, and problem definitions.

The user provides a simple function sample_constructor that takes a single identifier and returns a :class:~plaid.Sample, together with a dictionary ids mapping split names to sliceable sequences of identifiers. PLAID handles iteration, generator creation, and parallel sharding internally.

Example::

from plaid import Sample
from plaid.storage import save_to_disk

def sample_constructor(file_path):
    sample = Sample()
    sample.add_tree(load_my_data(file_path))
    return sample

save_to_disk(
    "output/",
    sample_constructor=sample_constructor,
    ids={
        "train": train_file_paths,
        "test":  test_file_paths,
    },
    infos=Infos(
        owner="owner",
        license="license",
    ),
    num_proc=6,
)

Parameters:

output_folder (Union[str, Path]) –

Path to the output directory where the dataset will be saved.
sample_constructor (Callable[[Any], Sample]) –

A callable that takes a single identifier (of any type) and returns a :class:~plaid.Sample.
ids (Mapping[str, Any]) –

Dictionary mapping split names (e.g. "train", "test") to sliceable sequences of sample identifiers. Each sequence must support __getitem__ and __len__ (list, tuple, numpy array, …). The identifiers can be of any type: integers, file paths, strings, tuples, etc.
backend (str, default: 'hf_datasets' ) –

Storage backend to use ('cgns', 'hf_datasets', or 'zarr').
infos (Optional[Infos], default: None ) –

Dataset information to save with the dataset. If None, a placeholder :class:~plaid.Infos is created with owner="unknown", license="unknown".
pb_defs (Optional[dict[str, ProblemDefinition]], default: None ) –

Optional mapping from problem definition identifiers to definitions.
num_proc (int, default: 1 ) –

Number of processes to use for parallel writing. When num_proc > 1 PLAID automatically shards the identifier sequences and distributes work across workers.
verbose (bool, default: False ) –

If True, enables verbose output during processing.
overwrite (bool, default: False ) –

If True, overwrites existing output directory.

Source code in plaid/storage/writer.py

def save_to_disk(
    output_folder: Union[str, Path],
    sample_constructor: Callable[[Any], Sample],
    ids: Mapping[str, Any],
    infos: Optional[Infos] = None,
    backend: str = "hf_datasets",
    pb_defs: Optional[dict[str, ProblemDefinition]] = None,
    num_proc: int = 1,
    verbose: bool = False,
    overwrite: bool = False,
) -> None:
    """Save a PLAID dataset to local disk using the specified backend.

    This function preprocesses the dataset, extracts schemas, and saves
    the dataset to disk using the chosen backend.  It also saves metadata, infos,
    and problem definitions.

    The user provides a simple function ``sample_constructor`` that takes a single
    identifier and returns a :class:`~plaid.Sample`, together with a dictionary
    ``ids`` mapping split names to sliceable sequences of identifiers.
    PLAID handles iteration, generator creation, and parallel sharding
    internally.

    Example::

        from plaid import Sample
        from plaid.storage import save_to_disk

        def sample_constructor(file_path):
            sample = Sample()
            sample.add_tree(load_my_data(file_path))
            return sample

        save_to_disk(
            "output/",
            sample_constructor=sample_constructor,
            ids={
                "train": train_file_paths,
                "test":  test_file_paths,
            },
            infos=Infos(
                owner="owner",
                license="license",
            ),
            num_proc=6,
        )

    Args:
        output_folder: Path to the output directory where the dataset will be saved.
        sample_constructor: A callable that takes a single identifier (of any type)
            and returns a :class:`~plaid.Sample`.
        ids: Dictionary mapping split names (e.g. ``"train"``, ``"test"``) to
            sliceable sequences of sample identifiers.  Each sequence must
            support ``__getitem__`` and ``__len__`` (list, tuple, numpy array,
            …).  The identifiers can be of any type: integers, file paths,
            strings, tuples, etc.
        backend: Storage backend to use (``'cgns'``, ``'hf_datasets'``, or
            ``'zarr'``).
        infos: Dataset information to save with the dataset. If ``None``, a
            placeholder :class:`~plaid.Infos` is created with
            ``owner="unknown", license="unknown"``.
        pb_defs: Optional mapping from problem definition identifiers to definitions.
        num_proc: Number of processes to use for parallel writing.  When
            ``num_proc > 1`` PLAID automatically shards the identifier
            sequences and distributes work across workers.
        verbose: If True, enables verbose output during processing.
        overwrite: If True, overwrites existing output directory.
    """
    assert backend in available_backends(), (
        f"backend {backend} not among available ones: {available_backends()}"
    )
    # ---- validate ids: must be sliceable sequences ---------------------------
    for split_name, split_ids in ids.items():
        if not (hasattr(split_ids, "__getitem__") and hasattr(split_ids, "__len__")):
            raise TypeError(
                f"ids for split '{split_name}' must be a sliceable sequence "
                f"(with __getitem__ and __len__), got {type(split_ids).__name__}. "
                f"Use a list, tuple, or numpy array of sample identifiers."
            )

    # ---- build generators from sample_constructor -----------------------------------
    generators: dict[str, Callable[..., Generator[Sample, None, None]]] = {}
    for split_name in ids:
        generators[split_name] = _SampleFuncGenerator(sample_constructor)

    # ---- auto-shard when running in parallel ---------------------------------
    gen_kwargs: Optional[dict[str, dict[str, Any]]] = None
    if num_proc > 1:
        gen_kwargs = _build_gen_kwargs(ids, num_proc)
    else:
        # For sequential execution, wrap ids into a single shard so the
        # generator receives them via shards_ids
        gen_kwargs = {
            split_name: {"shards_ids": [list(split_ids)]}
            for split_name, split_ids in ids.items()
        }

    output_folder = Path(output_folder)
    _check_folder(output_folder, overwrite)

    # CGNS stores each sample as a complete CGNS tree. It does not need the
    # constant/variable split used by columnar backends, so avoid the expensive
    # preprocessing pass that flattens every generated tree and detects constant
    # leaves.  Sample counts can be derived directly from the declared ids.
    if backend == "cgns":
        variable_schema = None
        num_samples = {
            split_name: len(split_ids) for split_name, split_ids in ids.items()
        }
    else:
        flat_cst, variable_schema, constant_schema, num_samples, cgns_types = (
            preprocess(
                generators, gen_kwargs=gen_kwargs, num_proc=num_proc, verbose=verbose
            )
        )
        save_metadata_to_disk(
            output_folder, flat_cst, variable_schema, constant_schema, cgns_types
        )

    # Inject the actual on-disk storage backend and sample counts so the
    # written ``infos.yaml`` always reflects how the dataset was saved,
    # overriding any inherited values from the input ``infos``.
    if infos is None:
        infos = Infos(
            owner="unknown",
            license="unknown",
        )
    infos_data = infos.model_dump(exclude_none=True)
    infos_data["num_samples"] = num_samples
    infos_data["storage_backend"] = backend
    infos = Infos.validate_persisted(infos_data)

    save_infos_to_disk(output_folder, infos)

    if pb_defs is not None:
        save_problem_definitions_to_disk(output_folder, pb_defs)

    backend_spec = get_backend(backend)
    backend_spec.generate_to_disk(
        output_folder,
        generators,
        variable_schema,
        gen_kwargs=gen_kwargs,
        num_proc=num_proc,
        verbose=verbose,
    )

plaid.storage.writer.push_to_hub ¶

push_to_hub(
    repo_id,
    local_dir,
    num_workers=1,
    viewer=False,
    pretty_name=None,
    dataset_long_description=None,
    illustration_urls=None,
    arxiv_paper_urls=None,
)

Push a local PLAID dataset to Hugging Face Hub.

This function uploads a previously saved dataset from local disk to Hugging Face Hub, including data, metadata, infos, and problem definitions. It automatically detects the backend used for saving and configures the dataset card.