plaid.storage.zarr.reader
=========================

.. py:module:: plaid.storage.zarr.reader

.. autoapi-nested-parse::

   Zarr dataset reader module.

   This module provides functionality for reading and streaming datasets stored in Zarr format
   for the PLAID library. It includes utilities for loading datasets from local disk or
   streaming directly from Hugging Face Hub, with support for selective loading of splits
   and features.

   Key features:
   - Local dataset loading from disk
   - Streaming datasets from Hugging Face Hub
   - Selective loading of splits and features
   - ZarrDataset class for convenient data access


Classes
-------

.. autoapisummary::

   plaid.storage.zarr.reader.ZarrDataset


Functions
---------

.. autoapisummary::

   plaid.storage.zarr.reader.sample_generator
   plaid.storage.zarr.reader.create_zarr_iterable_dataset
   plaid.storage.zarr.reader.init_datasetdict_from_disk
   plaid.storage.zarr.reader.download_datasetdict_from_hub
   plaid.storage.zarr.reader.init_datasetdict_streaming_from_hub


Module Contents
---------------

.. py:class:: ZarrDataset(zarr_group: zarr.Group, **kwargs)

   A dataset class for accessing Zarr-stored data.

   This class provides a convenient interface for accessing samples stored in Zarr format.
   It wraps a Zarr group and provides dictionary-like access to samples, along with
   additional metadata fields.

   Initialize a :class:`ZarrDataset`.

   :param zarr_group: The underlying Zarr group containing the data.
   :type zarr_group: zarr.Group
   :param \*\*kwargs: Optional keyword metadata to attach to the dataset instance.
                      All provided kwargs are stored in ``self._extra_fields`` and are
                      accessible as attributes via ``__getattr__`` / ``__setattr__``.


   .. py:attribute:: zarr_group


.. py:function:: sample_generator(repo_id: str, split: str, ids: Iterable[int], selected_features: list[str]) -> Iterator[dict[str, Any]]

   Generates samples from a Zarr dataset on Hugging Face Hub.

   :param repo_id: The Hugging Face repository ID.
   :type repo_id: str
   :param split: The dataset split name.
   :type split: str
   :param ids: Iterable of sample IDs to generate.
   :type ids: Iterable[int]
   :param selected_features: List of features to include.
   :type selected_features: list[str]

   :Yields: *dict* -- Dictionary mapping feature names to Zarr arrays.


.. py:function:: create_zarr_iterable_dataset(repo_id: str, split: str, ids: Iterable[int], selected_features: list[str]) -> datasets.IterableDataset

   Creates an IterableDataset from Zarr data on Hugging Face Hub.

   :param repo_id: The Hugging Face repository ID.
   :type repo_id: str
   :param split: The dataset split name.
   :type split: str
   :param ids: Iterable of sample IDs.
   :type ids: Iterable[int]
   :param selected_features: List of features to include.
   :type selected_features: list[str]

   :returns: An iterable dataset for streaming access.
   :rtype: IterableDataset


.. py:function:: init_datasetdict_from_disk(path: Union[str, pathlib.Path]) -> dict[str, ZarrDataset]

   Initializes dataset dictionaries from local Zarr files.

   :param path: Path to the local directory containing the dataset.
   :type path: Union[str, Path]

   :returns: Dictionary mapping split names to ZarrDataset objects.
   :rtype: dict[str, ZarrDataset]


.. py:function:: download_datasetdict_from_hub(repo_id: str, local_dir: Union[str, pathlib.Path], split_ids: Optional[dict[str, list[int]]] = None, features: Optional[list[str]] = None, overwrite: bool = False) -> None

   Downloads dataset from Hugging Face Hub to local directory.

   :param repo_id: The Hugging Face repository ID.
   :type repo_id: str
   :param local_dir: Local directory to download to.
   :type local_dir: Union[str, Path]
   :param split_ids: Optional split IDs for selective download.
   :type split_ids: Optional[dict[str, list[int]]]
   :param features: Optional features for selective download.
   :type features: Optional[list[str]]
   :param overwrite: Whether to overwrite existing directory.
   :type overwrite: bool

   :returns: None


.. py:function:: init_datasetdict_streaming_from_hub(repo_id: str, split_ids: Optional[dict[str, list[int]]] = None, features: Optional[list[str]] = None) -> dict[str, datasets.IterableDataset]

   Initializes streaming dataset dictionaries from Hugging Face Hub.

   This function creates iterable datasets that stream Zarr data directly from
   the Hugging Face Hub without downloading files locally. It supports selective
   loading of specific splits and features for memory-efficient data access.
   Note that streaming mode is not compatible with private Hugging Face mirrors.

   :param repo_id: The Hugging Face repository ID (e.g., "username/dataset_name").
   :type repo_id: str
   :param split_ids: Optional dictionary mapping split names
                     to lists of sample IDs to include. If None, all samples from all splits
                     are included.
   :type split_ids: Optional[dict[str, list[int]]]
   :param features: Optional list of feature names to include.
                    If None, all features from the variable schema are included.
   :type features: Optional[list[str]]

   :returns:

             Dictionary mapping split names to IterableDataset
                 objects for streaming data access.
   :rtype: dict[str, IterableDataset]