plaid.storage.zarr.reader ========================= .. py:module:: plaid.storage.zarr.reader .. autoapi-nested-parse:: Zarr dataset reader module. This module provides functionality for reading and streaming datasets stored in Zarr format for the PLAID library. It includes utilities for loading datasets from local disk or streaming directly from Hugging Face Hub, with support for selective loading of splits and features. Key features: - Local dataset loading from disk - Streaming datasets from Hugging Face Hub - Selective loading of splits and features - ZarrDataset class for convenient data access Classes ------- .. autoapisummary:: plaid.storage.zarr.reader.ZarrDataset Functions --------- .. autoapisummary:: plaid.storage.zarr.reader.sample_generator plaid.storage.zarr.reader.create_zarr_iterable_dataset plaid.storage.zarr.reader.init_datasetdict_from_disk plaid.storage.zarr.reader.download_datasetdict_from_hub plaid.storage.zarr.reader.init_datasetdict_streaming_from_hub Module Contents --------------- .. py:class:: ZarrDataset(zarr_group: zarr.Group, **kwargs) A dataset class for accessing Zarr-stored data. This class provides a convenient interface for accessing samples stored in Zarr format. It wraps a Zarr group and provides dictionary-like access to samples, along with additional metadata fields. Initialize a :class:`ZarrDataset`. :param zarr_group: The underlying Zarr group containing the data. :type zarr_group: zarr.Group :param \*\*kwargs: Optional keyword metadata to attach to the dataset instance. All provided kwargs are stored in ``self._extra_fields`` and are accessible as attributes via ``__getattr__`` / ``__setattr__``. .. py:attribute:: zarr_group .. py:function:: sample_generator(repo_id: str, split: str, ids: Iterable[int], selected_features: list[str]) -> Iterator[dict[str, Any]] Generates samples from a Zarr dataset on Hugging Face Hub. :param repo_id: The Hugging Face repository ID. :type repo_id: str :param split: The dataset split name. :type split: str :param ids: Iterable of sample IDs to generate. :type ids: Iterable[int] :param selected_features: List of features to include. :type selected_features: list[str] :Yields: *dict* -- Dictionary mapping feature names to Zarr arrays. .. py:function:: create_zarr_iterable_dataset(repo_id: str, split: str, ids: Iterable[int], selected_features: list[str]) -> datasets.IterableDataset Creates an IterableDataset from Zarr data on Hugging Face Hub. :param repo_id: The Hugging Face repository ID. :type repo_id: str :param split: The dataset split name. :type split: str :param ids: Iterable of sample IDs. :type ids: Iterable[int] :param selected_features: List of features to include. :type selected_features: list[str] :returns: An iterable dataset for streaming access. :rtype: IterableDataset .. py:function:: init_datasetdict_from_disk(path: Union[str, pathlib.Path]) -> dict[str, ZarrDataset] Initializes dataset dictionaries from local Zarr files. :param path: Path to the local directory containing the dataset. :type path: Union[str, Path] :returns: Dictionary mapping split names to ZarrDataset objects. :rtype: dict[str, ZarrDataset] .. py:function:: download_datasetdict_from_hub(repo_id: str, local_dir: Union[str, pathlib.Path], split_ids: Optional[dict[str, list[int]]] = None, features: Optional[list[str]] = None, overwrite: bool = False) -> None Downloads dataset from Hugging Face Hub to local directory. :param repo_id: The Hugging Face repository ID. :type repo_id: str :param local_dir: Local directory to download to. :type local_dir: Union[str, Path] :param split_ids: Optional split IDs for selective download. :type split_ids: Optional[dict[str, list[int]]] :param features: Optional features for selective download. :type features: Optional[list[str]] :param overwrite: Whether to overwrite existing directory. :type overwrite: bool :returns: None .. py:function:: init_datasetdict_streaming_from_hub(repo_id: str, split_ids: Optional[dict[str, list[int]]] = None, features: Optional[list[str]] = None) -> dict[str, datasets.IterableDataset] Initializes streaming dataset dictionaries from Hugging Face Hub. This function creates iterable datasets that stream Zarr data directly from the Hugging Face Hub without downloading files locally. It supports selective loading of specific splits and features for memory-efficient data access. Note that streaming mode is not compatible with private Hugging Face mirrors. :param repo_id: The Hugging Face repository ID (e.g., "username/dataset_name"). :type repo_id: str :param split_ids: Optional dictionary mapping split names to lists of sample IDs to include. If None, all samples from all splits are included. :type split_ids: Optional[dict[str, list[int]]] :param features: Optional list of feature names to include. If None, all features from the variable schema are included. :type features: Optional[list[str]] :returns: Dictionary mapping split names to IterableDataset objects for streaming data access. :rtype: dict[str, IterableDataset]