plaid.storage.zarr.reader¶

Zarr dataset reader module.

This module provides functionality for reading and streaming datasets stored in Zarr format for the PLAID library. It includes utilities for loading datasets from local disk or streaming directly from Hugging Face Hub, with support for selective loading of splits and features.

Key features: - Local dataset loading from disk - Streaming datasets from Hugging Face Hub - Selective loading of splits and features - ZarrDataset class for convenient data access

Attributes¶

ZarrDatasetDict

Classes¶

ZarrDataset

A dataset class for accessing Zarr-stored data.

Functions¶

`sample_generator`(→ Iterator[dict[str, Any]])	Generates samples from a Zarr dataset on Hugging Face Hub.
`create_zarr_iterable_dataset`(→ datasets.IterableDataset)	Creates an IterableDataset from Zarr data on Hugging Face Hub.
`init_datasetdict_from_disk`(→ ZarrDatasetDict)	Initializes dataset dictionaries from local Zarr files.
`download_datasetdict_from_hub`(→ str)	Downloads dataset from Hugging Face Hub to local directory.
`init_datasetdict_streaming_from_hub`(→ dict[str, ...)	Initializes streaming dataset dictionaries from Hugging Face Hub.

Module Contents¶

class ZarrDataset(zarr_group: zarr.Group, path: str | pathlib.Path)[source]¶

A dataset class for accessing Zarr-stored data.

This class provides a convenient interface for accessing samples stored in Zarr format. It wraps a Zarr group and provides dictionary-like access to samples, along with additional metadata fields.

Initialize a ZarrDataset.

Parameters:

zarr_group (zarr.Group) – The underlying Zarr group containing the data.
path (Union[str, Path]) – Path to the dataset root (local directory or remote identifier). Stored on the instance as self.path.

zarr_group[source]¶

path[source]¶

ids[source]¶

sample_generator(repo_id: str, split: str, ids: Iterable[int], selected_features: list[str]) → Iterator[dict[str, Any]][source]¶

Generates samples from a Zarr dataset on Hugging Face Hub.

Parameters:

repo_id (str) – The Hugging Face repository ID.
split (str) – The dataset split name.
ids (Iterable[int]) – Iterable of sample IDs to generate.
selected_features (list[str]) – List of features to include.

Yields:

dict – Dictionary mapping feature names to Zarr arrays.

create_zarr_iterable_dataset(repo_id: str, split: str, ids: Iterable[int], selected_features: list[str]) → datasets.IterableDataset[source]¶

Creates an IterableDataset from Zarr data on Hugging Face Hub.

Parameters:

repo_id (str) – The Hugging Face repository ID.
split (str) – The dataset split name.
ids (Iterable[int]) – Iterable of sample IDs.
selected_features (list[str]) – List of features to include.

Returns:

An iterable dataset for streaming access.

Return type:

IterableDataset

ZarrDatasetDict[source]¶

init_datasetdict_from_disk(path: str | pathlib.Path) → ZarrDatasetDict[source]¶

Initializes dataset dictionaries from local Zarr files.

Parameters:: path (Union[str, Path]) – Path to the local directory containing the dataset.
Returns:: Dictionary mapping split names to ZarrDataset objects.
Return type:: ZarrDatasetDict

download_datasetdict_from_hub(repo_id: str, local_dir: str | pathlib.Path, split_ids: dict[str, list[int]] | None = None, features: list[str] | None = None, overwrite: bool = False) → str[source]¶

Downloads dataset from Hugging Face Hub to local directory.

Parameters:

repo_id (str) – The Hugging Face repository ID.
local_dir (Union[str, Path]) – Local directory to download to.
split_ids (Optional[dict[str, list[int]]]) – Optional split IDs for selective download.
features (Optional[list[str]]) – Optional features for selective download.
overwrite (bool) – Whether to overwrite existing directory.

Returns:

Path to the local directory where the dataset has been downloaded.

Return type:

str

init_datasetdict_streaming_from_hub(repo_id: str, split_ids: dict[str, list[int]] | None = None, features: list[str] | None = None) → dict[str, datasets.IterableDataset][source]¶

Initializes streaming dataset dictionaries from Hugging Face Hub.

This function creates iterable datasets that stream Zarr data directly from the Hugging Face Hub without downloading files locally. It supports selective loading of specific splits and features for memory-efficient data access. Note that streaming mode is not compatible with private Hugging Face mirrors.

Parameters:

repo_id (str) – The Hugging Face repository ID (e.g., “username/dataset_name”).
split_ids (Optional[dict[str, list[int]]]) – Optional dictionary mapping split names to lists of sample IDs to include. If None, all samples from all splits are included.
features (Optional[list[str]]) – Optional list of feature names to include. If None, all features from the variable schema are included.

Returns:

Dictionary mapping split names to IterableDataset: objects for streaming data access.

Return type:

dict[str, IterableDataset]