plaid.storage.zarr.writer
=========================

.. py:module:: plaid.storage.zarr.writer

.. autoapi-nested-parse::

   Zarr dataset writer module.

   This module provides functionality for writing and managing datasets in Zarr format
   for the PLAID library. It includes utilities for generating datasets from sample
   generators, saving them to disk with optimized chunking, uploading to Hugging Face
   Hub, and configuring dataset cards with metadata and usage examples.

   Key features:
   - Parallel and sequential dataset generation from generators
   - Automatic chunking for efficient storage
   - Integration with Hugging Face Hub for dataset sharing
   - Dataset card generation with splits, features, and documentation


Functions
---------

.. autoapisummary::

   plaid.storage.zarr.writer.generate_datasetdict_to_disk
   plaid.storage.zarr.writer.push_local_datasetdict_to_hub
   plaid.storage.zarr.writer.configure_dataset_card


Module Contents
---------------

.. py:function:: generate_datasetdict_to_disk(output_folder: Union[str, pathlib.Path], generators: dict[str, Callable[Ellipsis, Generator[plaid.Sample, None, None]]], variable_schema: dict[str, dict], gen_kwargs: Optional[dict[str, dict[str, list[plaid.types.IndexType]]]] = None, num_proc: int = 1, verbose: bool = False) -> None

   Generates and saves a dataset dictionary to disk in Zarr format.

   This function processes sample generators for different dataset splits,
   converts samples to dictionaries, and writes them to Zarr arrays on disk.
   It supports both sequential and parallel processing modes. In parallel mode,
   gen_kwargs must be provided with batch information for each split.

   :param output_folder: Base directory where the dataset will be saved.
                         A 'data' subdirectory will be created inside this folder.
   :type output_folder: Union[str, Path]
   :param generators: Dictionary mapping split names (e.g., "train", "test") to generator
                      functions that yield Sample objects.
   :type generators: dict[str, Callable[..., Generator[Sample, None, None]]]
   :param variable_schema: Schema describing the structure and types
                           of variables/features in the samples.
   :type variable_schema: dict[str, dict]
   :param gen_kwargs: Optional
                      generator arguments for parallel processing. Must include "shards_ids"
                      for each split when num_proc > 1. Required for parallel execution.
   :type gen_kwargs: Optional[dict[str, dict[str, list[IndexType]]]]
   :param num_proc: Number of processes to use for parallel processing.
                    Defaults to 1 (sequential). Must be > 1 only when gen_kwargs is provided.
   :type num_proc: int, optional
   :param verbose: Whether to display progress bars during processing.
                   Defaults to False.
   :type verbose: bool, optional

   :returns:

             This function does not return a value; it writes the dataset directly
                 to disk.
   :rtype: None


.. py:function:: push_local_datasetdict_to_hub(repo_id: str, local_dir: Union[str, pathlib.Path], num_workers: int = 1) -> None

   Pushes a local dataset directory to Hugging Face Hub.

   This function uploads the contents of a local directory to a specified
   Hugging Face repository as a dataset. It uses the HfApi to handle large
   folder uploads with configurable parallelism.

   :param repo_id: The Hugging Face repository ID where the dataset will be uploaded
                   (e.g., "username/dataset_name").
   :type repo_id: str
   :param local_dir: Path to the local directory containing the dataset files
                     to upload.
   :type local_dir: str or Path
   :param num_workers: Number of worker threads to use for uploading.
                       Defaults to 1.
   :type num_workers: int, optional

   :returns:

             This function does not return a value; it uploads the dataset directly
                 to Hugging Face Hub.
   :rtype: None


.. py:function:: configure_dataset_card(repo_id: str, infos: dict[str, dict[str, str]], local_dir: Union[str, pathlib.Path], variable_schema: Optional[dict] = None, viewer: Optional[bool] = None, pretty_name: Optional[str] = None, dataset_long_description: Optional[str] = None, illustration_urls: Optional[list[str]] = None, arxiv_paper_urls: Optional[list[str]] = None) -> None

   Configures and pushes a dataset card to Hugging Face Hub for a zarr backend dataset.

   This function generates a dataset card in YAML format with metadata, features,
   splits information, and usage examples. It automatically detects splits and
   sample counts from the local directory structure, then pushes the card to
   the specified Hugging Face repository.

   :param repo_id: The Hugging Face repository ID where the dataset card will be pushed.
   :type repo_id: str
   :param infos: Dictionary containing dataset metadata,
                 including legal information like license.
   :type infos: dict[str, dict[str, str]]
   :param local_dir: Path to the local directory containing the
                     dataset files, expected to have a 'data' subdirectory with split folders.
   :type local_dir: Union[str, Path]
   :param variable_schema: Schema describing the variables/features
                           in the dataset, used to generate the features section in the card.
   :type variable_schema: Optional[dict]
   :param viewer: Unused parameter for viewer configuration.
   :type viewer: Optional[bool]
   :param pretty_name: A human-readable name for the dataset to
                       display in the card.
   :type pretty_name: Optional[str]
   :param dataset_long_description: A detailed description of the
                                    dataset to include in the card.
   :type dataset_long_description: Optional[str]
   :param illustration_urls: List of URLs to images that
                             illustrate the dataset, displayed in the card.
   :type illustration_urls: Optional[list[str]]
   :param arxiv_paper_urls: List of arXiv URLs for papers
                            related to the dataset, included as sources.
   :type arxiv_paper_urls: Optional[list[str]]

   :returns:

             This function does not return a value; it pushes the dataset card
                 directly to Hugging Face Hub.
   :rtype: None