plaid.storage.hf_datasets.reader¶
plaid.storage.hf_datasets.reader
¶
Reader for hf dataset storage.
-
If the environment variable
HF_ENDPOINTis set, uses a private Hugging Face mirror.- Streaming is disabled.
- The dataset is downloaded locally via
snapshot_downloadand loaded from disk.
-
If
HF_ENDPOINTis not set, attempts to load from the public Hugging Face hub.- If the dataset is already cached locally, loads from disk.
- Otherwise, loads from the hub, optionally using streaming mode.
plaid.storage.hf_datasets.reader.init_datasetdict_from_disk
¶
Initializes a DatasetDict from local disk files.
Parameters:
-
path(Union[str, Path]) –Path to the directory containing the dataset files.
Returns:
-
HFDatasetDict(HFDatasetDict) –The loaded dataset dictionary.
Source code in plaid/storage/hf_datasets/reader.py
plaid.storage.hf_datasets.reader.download_datasetdict_from_hub
¶
download_datasetdict_from_hub(
repo_id,
local_dir,
split_ids=None,
features=None,
overwrite=False,
)
Downloads a dataset from Hugging Face Hub to local directory.
Parameters:
-
repo_id(str) –The repository ID on Hugging Face Hub.
-
local_dir(Union[str, Path]) –Local directory to download to.
-
split_ids(Optional[dict[str, Iterable[int]]], default:None) –Unused parameter for split selection.
-
features(Optional[list[str]], default:None) –Unused parameter for feature selection.
-
overwrite(bool, default:False) –Whether to overwrite existing directory.
Returns:
-
str(str) –Path to the downloaded dataset.
Source code in plaid/storage/hf_datasets/reader.py
plaid.storage.hf_datasets.reader.init_datasetdict_streaming_from_hub
¶
Initializes a streaming DatasetDict from Hugging Face Hub.
Parameters:
-
repo_id(str) –The repository ID on Hugging Face Hub.
-
split_ids(Optional[dict[str, Iterable[int]]], default:None) –Unused parameter for split selection.
-
features(Optional[list[str]], default:None) –Optional list of features to load.
Returns:
-
Any–datasets.IterableDatasetDict: The streaming dataset dictionary.