plaid.storage.writer
====================

.. py:module:: plaid.storage.writer

.. autoapi-nested-parse::

   PLAID storage writer module.

   This module provides high-level functions for saving PLAID datasets to local disk and pushing
   them to Hugging Face Hub. It supports multiple storage backends including CGNS, HF Datasets,
   and Zarr, abstracting the backend-specific implementations.

   Key features:
   - Unified interface for saving datasets across different backends
   - Automatic preprocessing and schema extraction
   - Metadata and problem definition handling
   - Hub integration with dataset cards and metadata


Functions
---------

.. autoapisummary::

   plaid.storage.writer.save_to_disk
   plaid.storage.writer.push_to_hub


Module Contents
---------------

.. py:function:: save_to_disk(output_folder: Union[str, pathlib.Path], sample_constructor: Callable[[Any], plaid.Sample], ids: Mapping[str, Sequence], backend: str = 'hf_datasets', infos: Optional[dict[str, Any]] = None, pb_defs: Optional[Union[dict[str, plaid.ProblemDefinition], plaid.ProblemDefinition]] = None, num_proc: int = 1, verbose: bool = False, overwrite: bool = False) -> None

   Save a PLAID dataset to local disk using the specified backend.

   This function preprocesses the dataset, extracts schemas, and saves
   the dataset to disk using the chosen backend.  It also saves metadata, infos,
   and problem definitions.

   The user provides a simple function ``sample_constructor`` that takes a single
   identifier and returns a :class:`~plaid.Sample`, together with a dictionary
   ``ids`` mapping split names to sliceable sequences of identifiers.
   PLAID handles iteration, generator creation, and parallel sharding
   internally.

   Example::

       from plaid import Sample
       from plaid.storage import save_to_disk

       def sample_constructor(file_path):
           sample = Sample()
           sample.add_tree(load_my_data(file_path))
           return sample

       save_to_disk(
           "output/",
           sample_constructor=sample_constructor,
           ids={
               "train": train_file_paths,
               "test":  test_file_paths,
           },
           num_proc=6,
       )

   :param output_folder: Path to the output directory where the dataset will be saved.
   :param sample_constructor: A callable that takes a single identifier (of any type)
                              and returns a :class:`~plaid.Sample`.
   :param ids: Dictionary mapping split names (e.g. ``"train"``, ``"test"``) to
               sliceable sequences of sample identifiers.  Each sequence must
               support ``__getitem__`` and ``__len__`` (list, tuple, numpy array,
               …).  The identifiers can be of any type: integers, file paths,
               strings, tuples, etc.
   :param backend: Storage backend to use (``'cgns'``, ``'hf_datasets'``, or
                   ``'zarr'``).
   :param infos: Optional additional information to save with the dataset.
   :param pb_defs: Optional problem definitions to save.
   :param num_proc: Number of processes to use for parallel writing.  When
                    ``num_proc > 1`` PLAID automatically shards the identifier
                    sequences and distributes work across workers.
   :param verbose: If True, enables verbose output during processing.
   :param overwrite: If True, overwrites existing output directory.


.. py:function:: push_to_hub(repo_id: str, local_dir: Union[str, pathlib.Path], num_workers: int = 1, viewer: bool = False, pretty_name: Optional[str] = None, dataset_long_description: Optional[str] = None, illustration_urls: Optional[list[str]] = None, arxiv_paper_urls: Optional[list[str]] = None) -> None

   Push a local PLAID dataset to Hugging Face Hub.

   This function uploads a previously saved dataset from local disk to Hugging Face Hub,
   including data, metadata, infos, and problem definitions. It automatically detects the
   backend used for saving and configures the dataset card.

   :param repo_id: Hugging Face repository ID (e.g., 'username/dataset-name').
   :param local_dir: Local directory containing the saved dataset.
   :param num_workers: Number of workers for parallel upload.
   :param viewer: If True, enables dataset viewer on Hub.
   :param pretty_name: Optional pretty name for the dataset.
   :param dataset_long_description: Optional detailed description.
   :param illustration_urls: Optional list of illustration URLs.
   :param arxiv_paper_urls: Optional list of arXiv paper URLs.