plaid.storage.hf_datasets.reader ================================ .. py:module:: plaid.storage.hf_datasets.reader .. autoapi-nested-parse:: Reader for hf dataset storage. - If the environment variable `HF_ENDPOINT` is set, uses a private Hugging Face mirror. - Streaming is disabled. - The dataset is downloaded locally via `snapshot_download` and loaded from disk. - If `HF_ENDPOINT` is not set, attempts to load from the public Hugging Face hub. - If the dataset is already cached locally, loads from disk. - Otherwise, loads from the hub, optionally using streaming mode. Functions --------- .. autoapisummary:: plaid.storage.hf_datasets.reader.init_datasetdict_from_disk plaid.storage.hf_datasets.reader.download_datasetdict_from_hub plaid.storage.hf_datasets.reader.init_datasetdict_streaming_from_hub Module Contents --------------- .. py:function:: init_datasetdict_from_disk(path: Union[str, pathlib.Path]) -> datasets.DatasetDict Initializes a DatasetDict from local disk files. :param path: Path to the directory containing the dataset files. :type path: Union[str, Path] :returns: The loaded dataset dictionary. :rtype: datasets.DatasetDict .. py:function:: download_datasetdict_from_hub(repo_id: str, local_dir: Union[str, pathlib.Path], split_ids: Optional[dict[str, int]] = None, features: Optional[list[str]] = None, overwrite: bool = False) -> str Downloads a dataset from Hugging Face Hub to local directory. :param repo_id: The repository ID on Hugging Face Hub. :type repo_id: str :param local_dir: Local directory to download to. :type local_dir: Union[str, Path] :param split_ids: Unused parameter for split selection. :type split_ids: Optional[dict[str, int]] :param features: Unused parameter for feature selection. :type features: Optional[list[str]] :param overwrite: Whether to overwrite existing directory. :type overwrite: bool :returns: Path to the downloaded dataset. :rtype: str .. py:function:: init_datasetdict_streaming_from_hub(repo_id: str, split_ids: Optional[dict[str, int]] = None, features: Optional[list[str]] = None) -> datasets.IterableDatasetDict Initializes a streaming DatasetDict from Hugging Face Hub. :param repo_id: The repository ID on Hugging Face Hub. :type repo_id: str :param split_ids: Unused parameter for split selection. :type split_ids: Optional[dict[str, int]] :param features: Optional list of features to load. :type features: Optional[list[str]] :returns: The streaming dataset dictionary. :rtype: datasets.IterableDatasetDict