plaid.storage.hf_datasets.reader

Reader for hf dataset storage.

  • If the environment variable HF_ENDPOINT is set, uses a private Hugging Face mirror.

    • Streaming is disabled.

    • The dataset is downloaded locally via snapshot_download and loaded from disk.

  • If HF_ENDPOINT is not set, attempts to load from the public Hugging Face hub.

    • If the dataset is already cached locally, loads from disk.

    • Otherwise, loads from the hub, optionally using streaming mode.

Attributes

Functions

init_datasetdict_from_disk(→ HFDatasetDict)

Initializes a DatasetDict from local disk files.

download_datasetdict_from_hub(→ str)

Downloads a dataset from Hugging Face Hub to local directory.

init_datasetdict_streaming_from_hub(...)

Initializes a streaming DatasetDict from Hugging Face Hub.

Module Contents

HFDatasetDict[source]
init_datasetdict_from_disk(path: str | pathlib.Path) HFDatasetDict[source]

Initializes a DatasetDict from local disk files.

Parameters:

path (Union[str, Path]) – Path to the directory containing the dataset files.

Returns:

The loaded dataset dictionary.

Return type:

HFDatasetDict

download_datasetdict_from_hub(repo_id: str, local_dir: str | pathlib.Path, split_ids: dict[str, int] | None = None, features: list[str] | None = None, overwrite: bool = False) str[source]

Downloads a dataset from Hugging Face Hub to local directory.

Parameters:
  • repo_id (str) – The repository ID on Hugging Face Hub.

  • local_dir (Union[str, Path]) – Local directory to download to.

  • split_ids (Optional[dict[str, int]]) – Unused parameter for split selection.

  • features (Optional[list[str]]) – Unused parameter for feature selection.

  • overwrite (bool) – Whether to overwrite existing directory.

Returns:

Path to the downloaded dataset.

Return type:

str

init_datasetdict_streaming_from_hub(repo_id: str, split_ids: dict[str, int] | None = None, features: list[str] | None = None) datasets.IterableDatasetDict[source]

Initializes a streaming DatasetDict from Hugging Face Hub.

Parameters:
  • repo_id (str) – The repository ID on Hugging Face Hub.

  • split_ids (Optional[dict[str, int]]) – Unused parameter for split selection.

  • features (Optional[list[str]]) – Optional list of features to load.

Returns:

The streaming dataset dictionary.

Return type:

datasets.IterableDatasetDict