plaid.storage.hf_datasets.reader¶
Reader for hf dataset storage.
If the environment variable HF_ENDPOINT is set, uses a private Hugging Face mirror.
Streaming is disabled.
The dataset is downloaded locally via snapshot_download and loaded from disk.
If HF_ENDPOINT is not set, attempts to load from the public Hugging Face hub.
If the dataset is already cached locally, loads from disk.
Otherwise, loads from the hub, optionally using streaming mode.
Attributes¶
Functions¶
|
Initializes a DatasetDict from local disk files. |
Downloads a dataset from Hugging Face Hub to local directory. |
|
Initializes a streaming DatasetDict from Hugging Face Hub. |
Module Contents¶
- init_datasetdict_from_disk(path: str | pathlib.Path) HFDatasetDict[source]¶
Initializes a DatasetDict from local disk files.
- Parameters:
path (Union[str, Path]) – Path to the directory containing the dataset files.
- Returns:
The loaded dataset dictionary.
- Return type:
- download_datasetdict_from_hub(repo_id: str, local_dir: str | pathlib.Path, split_ids: dict[str, int] | None = None, features: list[str] | None = None, overwrite: bool = False) str[source]¶
Downloads a dataset from Hugging Face Hub to local directory.
- Parameters:
repo_id (str) – The repository ID on Hugging Face Hub.
local_dir (Union[str, Path]) – Local directory to download to.
split_ids (Optional[dict[str, int]]) – Unused parameter for split selection.
features (Optional[list[str]]) – Unused parameter for feature selection.
overwrite (bool) – Whether to overwrite existing directory.
- Returns:
Path to the downloaded dataset.
- Return type: