plaid.storage.zarr.reader¶
plaid.storage.zarr.reader
¶
Zarr dataset reader module.
This module provides functionality for reading and streaming datasets stored in Zarr format for the PLAID library. It includes utilities for loading datasets from local disk or streaming directly from Hugging Face Hub, with support for selective loading of splits and features.
Key features: - Local dataset loading from disk - Streaming datasets from Hugging Face Hub - Selective loading of splits and features - ZarrDataset class for convenient data access
plaid.storage.zarr.reader.ZarrDataset
¶
A dataset class for accessing Zarr-stored data.
This class provides a convenient interface for accessing samples stored in Zarr format. It wraps a Zarr group and provides dictionary-like access to samples, along with additional metadata fields.
Initialize a :class:ZarrDataset.
Parameters:
-
zarr_group(Group) –The underlying Zarr group containing the data.
-
path(Union[str, Path]) –Path to the dataset root (local directory or remote identifier). Stored on the instance as
self.path.
Source code in plaid/storage/zarr/reader.py
plaid.storage.zarr.reader.ZarrDataset.__iter__
¶
Iterate over all samples in the dataset.
Yields:
-
dict[str, Any]–dict[str, Any]: Dictionary containing sample data.
plaid.storage.zarr.reader.ZarrDataset.__getitem__
¶
Get a sample by index.
Parameters:
-
idx(int) –Sample index.
Returns:
-
dict[str, Any]–dict[str, Any]: Dictionary containing sample data.
Source code in plaid/storage/zarr/reader.py
plaid.storage.zarr.reader.ZarrDataset.__len__
¶
plaid.storage.zarr.reader.ZarrDataset.__repr__
¶
String representation of the dataset.
Returns:
-
str(str) –String representation.
plaid.storage.zarr.reader.sample_generator
¶
Generates samples from a Zarr dataset on Hugging Face Hub.
Parameters:
-
repo_id(str) –The Hugging Face repository ID.
-
split(str) –The dataset split name.
-
ids(Iterable[int]) –Iterable of sample IDs to generate.
-
selected_features(list[str]) –List of features to include.
Yields:
-
dict(dict[str, Any]) –Dictionary mapping feature names to Zarr arrays.
Source code in plaid/storage/zarr/reader.py
plaid.storage.zarr.reader.create_zarr_iterable_dataset
¶
Creates an IterableDataset from Zarr data on Hugging Face Hub.
Parameters:
-
repo_id(str) –The Hugging Face repository ID.
-
split(str) –The dataset split name.
-
ids(Iterable[int]) –Iterable of sample IDs.
-
selected_features(list[str]) –List of features to include.
Returns:
-
IterableDataset(IterableDataset) –An iterable dataset for streaming access.
Source code in plaid/storage/zarr/reader.py
plaid.storage.zarr.reader.init_datasetdict_from_disk
¶
Initializes dataset dictionaries from local Zarr files.
Parameters:
-
path(Union[str, Path]) –Path to the local directory containing the dataset.
Returns:
-
ZarrDatasetDict(ZarrDatasetDict) –Dictionary mapping split names to ZarrDataset objects.
Source code in plaid/storage/zarr/reader.py
plaid.storage.zarr.reader.download_datasetdict_from_hub
¶
download_datasetdict_from_hub(
repo_id,
local_dir,
split_ids=None,
features=None,
overwrite=False,
)
Downloads dataset from Hugging Face Hub to local directory.
Parameters:
-
repo_id(str) –The Hugging Face repository ID.
-
local_dir(Union[str, Path]) –Local directory to download to.
-
split_ids(Optional[dict[str, Iterable[int]]], default:None) –Optional split IDs for selective download.
-
features(Optional[list[str]], default:None) –Optional features for selective download.
-
overwrite(bool, default:False) –Whether to overwrite existing directory.
Returns:
-
str(str) –Path to the local directory where the dataset has been downloaded.
Source code in plaid/storage/zarr/reader.py
plaid.storage.zarr.reader.init_datasetdict_streaming_from_hub
¶
Initializes streaming dataset dictionaries from Hugging Face Hub.
This function creates iterable datasets that stream Zarr data directly from the Hugging Face Hub without downloading files locally. It supports selective loading of specific splits and features for memory-efficient data access. Note that streaming mode is not compatible with private Hugging Face mirrors.
Parameters:
-
repo_id(str) –The Hugging Face repository ID (e.g., "username/dataset_name").
-
split_ids(Optional[dict[str, Iterable[int]]], default:None) –Optional dictionary mapping split names to lists of sample IDs to include. If None, all samples from all splits are included.
-
features(Optional[list[str]], default:None) –Optional list of feature names to include. If None, all features from the variable schema are included.
Returns:
-
dict[str, IterableDataset]–dict[str, IterableDataset]: Dictionary mapping split names to IterableDataset objects for streaming data access.