plaid.bridges.huggingface_bridge
================================

.. py:module:: plaid.bridges.huggingface_bridge

.. autoapi-nested-parse::

   Hugging Face bridge for PLAID datasets.


Attributes
----------

.. autoapisummary::

   plaid.bridges.huggingface_bridge.Self


Functions
---------

.. autoapisummary::

   plaid.bridges.huggingface_bridge.generate_huggingface_description
   plaid.bridges.huggingface_bridge.plaid_dataset_to_huggingface
   plaid.bridges.huggingface_bridge.plaid_dataset_to_huggingface_datasetdict
   plaid.bridges.huggingface_bridge.plaid_generator_to_huggingface
   plaid.bridges.huggingface_bridge.plaid_generator_to_huggingface_datasetdict
   plaid.bridges.huggingface_bridge.huggingface_description_to_problem_definition
   plaid.bridges.huggingface_bridge.to_plaid_sample
   plaid.bridges.huggingface_bridge.huggingface_dataset_to_plaid
   plaid.bridges.huggingface_bridge.streamed_huggingface_dataset_to_plaid
   plaid.bridges.huggingface_bridge.create_string_for_huggingface_dataset_card


Module Contents
---------------

.. py:data:: Self

.. py:function:: generate_huggingface_description(infos: dict, problem_definition: plaid.ProblemDefinition) -> dict[str, Any]

   Generates a Hugging Face dataset description field from a plaid dataset infos and problem definition.

   The conventions chosen here ensure working conversion to and from huggingset datasets.

   :param infos: infos entry of the plaid dataset from which the Hugging Face description is to be generated
   :type infos: dict
   :param problem_definition: of which the Hugging Face description is to be generated
   :type problem_definition: ProblemDefinition

   :returns: Hugging Face dataset description
   :rtype: dict[str]


.. py:function:: plaid_dataset_to_huggingface(dataset: plaid.Dataset, problem_definition: plaid.ProblemDefinition, split: str = 'all_samples', processes_number: int = 1) -> datasets.Dataset

   Use this function for converting a Hugging Face dataset from a plaid dataset.

   The dataset can then be saved to disk, or pushed to the Hugging Face hub.

   :param dataset: the plaid dataset to be converted in Hugging Face format
   :type dataset: Dataset
   :param problem_definition: the problem definition is used to generate the description of the Hugging Face dataset.
   :type problem_definition: ProblemDefinition
   :param split: The name of the split. Default: "all_samples".
   :type split: str
   :param processes_number: The number of processes used to generate the Hugging Face dataset. Default: 1.
   :type processes_number: int

   :returns: dataset in Hugging Face format
   :rtype: datasets.Dataset

   .. rubric:: Example

   .. code-block:: python

       dataset = plaid_dataset_to_huggingface(dataset, problem_definition, split)
       dataset.save_to_disk("path/to/dir)
       dataset.push_to_hub("chanel/dataset")


.. py:function:: plaid_dataset_to_huggingface_datasetdict(dataset: plaid.Dataset, problem_definition: plaid.ProblemDefinition, main_splits: list[str], processes_number: int = 1) -> datasets.DatasetDict

   Use this function for converting a Hugging Face dataset dict from a plaid dataset.

   The dataset can then be saved to disk, or pushed to the Hugging Face hub.

   :param dataset: the plaid dataset to be converted in Hugging Face format
   :type dataset: Dataset
   :param problem_definition: the problem definition is used to generate the description of the Hugging Face dataset.
   :type problem_definition: ProblemDefinition
   :param main_splits: The name of the main splits: defining a partitioning of the sample ids.
   :type main_splits: list[str]
   :param processes_number: The number of processes used to generate the Hugging Face dataset. Default: 1.
   :type processes_number: int

   :returns: dataset in Hugging Face format
   :rtype: datasets.Dataset

   .. rubric:: Example

   .. code-block:: python

       dataset = plaid_dataset_to_huggingface(dataset, problem_definition, split)
       dataset.save_to_disk("path/to/dir)
       dataset.push_to_hub("chanel/dataset")


.. py:function:: plaid_generator_to_huggingface(generator: Callable, infos: dict, problem_definition: plaid.ProblemDefinition, split: str = 'all_samples', processes_number: int = 1) -> datasets.Dataset

   Use this function for creating a Hugging Face dataset from a sample generator function.

   This function can be used when the plaid dataset cannot be loaded in RAM all at once due to its size.
   The generator enables loading samples one by one.
   The dataset can then be saved to disk, or pushed to the Hugging Face hub.

   :param generator: a function yielding a dict {"sample" : sample}, where sample is of type 'bytes'
   :type generator: Callable
   :param infos: the info is used to generate the description of the Hugging Face dataset.
   :type infos: dict
   :param problem_definition: the problem definition is used to generate the description of the Hugging Face dataset.
   :type problem_definition: ProblemDefinition
   :param split: The name of the split. Default: "all_samples".
   :type split: str
   :param processes_number: The number of processes used to generate the Hugging Face dataset. Default: 1.
   :type processes_number: int

   :returns: dataset in Hugging Face format
   :rtype: datasets.Dataset

   .. rubric:: Example

   .. code-block:: python

       dataset = plaid_generator_to_huggingface(generator, infos, split, problem_definition)
       dataset.push_to_hub("chanel/dataset")
       dataset.save_to_disk("path/to/dir")


.. py:function:: plaid_generator_to_huggingface_datasetdict(generator: Callable, infos: dict, problem_definition: plaid.ProblemDefinition, main_splits: list, processes_number: int = 1) -> datasets.DatasetDict

   Use this function for creating a Hugging Face dataset dict (containing multiple splits) from a sample generator function.

   This function can be used when the plaid dataset cannot be loaded in RAM all at once due to its size.
   The generator enables loading samples one by one.
   The dataset dict can then be saved to disk, or pushed to the Hugging Face hub.

   .. rubric:: Notes

   Only the first split will contain the decription.

   :param generator: a function yielding a dict {"sample" : sample}, where sample is of type 'bytes'
   :type generator: Callable
   :param infos: infos entry of the plaid dataset from which the Hugging Face dataset is to be generated
   :type infos: dict
   :param problem_definition: the problem definition is used to generate the description of the Hugging Face dataset.
   :type problem_definition: ProblemDefinition
   :param main_splits: The name of the main splits: defining a partitioning of the sample ids.
   :type main_splits: str, optional
   :param processes_number: The number of processes used to generate the Hugging Face dataset. Default: 1.
   :type processes_number: int

   :returns: dataset dict in Hugging Face format
   :rtype: datasets.DatasetDict

   .. rubric:: Example

   .. code-block:: python

       dataset = plaid_generator_to_huggingface_datasetdict(generator, infos, problem_definition, main_splits)
       dataset.push_to_hub("chanel/dataset")
       dataset.save_to_disk("path/to/dir")


.. py:function:: huggingface_description_to_problem_definition(description: dict) -> plaid.ProblemDefinition

   Converts a Hugging Face dataset description to a plaid problem definition.

   :param description: the description field of a Hugging Face dataset, containing the problem definition
   :type description: dict

   :returns: the plaid problem definition initialized from the Hugging Face dataset description
   :rtype: problem_definition (ProblemDefinition)


.. py:function:: to_plaid_sample(hf_sample: dict[str, Any]) -> plaid.Sample

   Convert a Hugging Face sample dictionary to a PLAID Sample instance.

   :param hf_sample: A dictionary with a "sample" key containing the pickled sample bytes.
   :type hf_sample: dict[str, Any]

   :returns: The deserialized PLAID Sample object.
   :rtype: Sample


.. py:function:: huggingface_dataset_to_plaid(ds: datasets.Dataset, ids: Optional[list[int]] = None, processes_number: int = 1, large_dataset: bool = False, verbose: bool = True) -> tuple[plaid.Dataset, plaid.ProblemDefinition]

   Use this function for converting a plaid dataset from a Hugging Face dataset.

   A Hugging Face dataset can be read from disk or the hub. From the hub, the
   split = "all_samples" options is important to get a dataset and not a datasetdict.
   Many options from loading are available (caching, streaming, etc...)

   :param ds: the dataset in Hugging Face format to be converted
   :type ds: datasets.Dataset
   :param ids: The specific sample IDs to load from the dataset. Defaults to None.
   :type ids: list, optional
   :param processes_number: The number of processes used to generate the plaid dataset
   :type processes_number: int, optional
   :param large_dataset: if True, uses a variant where parallel worker do not each load the complete dataset. Default: False.
   :type large_dataset: bool
   :param verbose: if True, prints progress using tdqm
   :type verbose: bool, optional

   :returns: the converted dataset.
             problem_definition (ProblemDefinition): the problem definition generated from the Hugging Face dataset
   :rtype: dataset (Dataset)

   .. rubric:: Example

   .. code-block:: python

       from datasets import load_dataset, load_from_disk

       dataset = load_dataset("path/to/dir", split = "all_samples")
       dataset = load_from_disk("chanel/dataset")
       plaid_dataset, plaid_problem = huggingface_dataset_to_plaid(dataset)


.. py:function:: streamed_huggingface_dataset_to_plaid(hf_repo: str, number_of_samples: int) -> tuple[plaid.Dataset, plaid.ProblemDefinition]

   Use this function for creating a plaid dataset by streaming on Hugging Face.

   The indices of the retrieved sample is not controled.

   :param hf_repo: the name of the repo on Hugging Face
   :type hf_repo: str
   :param number_of_samples: The number of samples to retrieve.
   :type number_of_samples: int

   :returns: the converted dataset.
             problem_definition (ProblemDefinition): the problem definition generated from the Hugging Face dataset
   :rtype: dataset (Dataset)

   .. rubric:: Notes

   .. code-block:: python

       from plaid.bridges.huggingface_bridge import streamed_huggingface_dataset_to_plaid

       dataset, pb_def = streamed_huggingface_dataset_to_plaid('PLAID-datasets/VKI-LS59', 2)


.. py:function:: create_string_for_huggingface_dataset_card(description: dict, download_size_bytes: int, dataset_size_bytes: int, nb_samples: int, owner: str, license: str, zenodo_url: Optional[str] = None, arxiv_paper_url: Optional[str] = None, pretty_name: Optional[str] = None, size_categories: Optional[list[str]] = None, task_categories: Optional[list[str]] = None, tags: Optional[list[str]] = None, dataset_long_description: Optional[str] = None, url_illustration: Optional[str] = None) -> str

   Use this function for creating a dataset card, to upload together with the datase on the Hugging Face hub.

   Doing so ensure that load_dataset from the hub will populate the hf-dataset.description field, and be compatible for conversion to plaid.

   Without a dataset_card, the description field is lost.

   The parameters download_size_bytes and dataset_size_bytes can be determined after a
   dataset has been uploaded on Hugging Face:
   - manually by reading their values on the dataset page README.md,
   - automatically as shown in the example below

   See `the hugginface examples <https://github.com/PLAID-lib/plaid/blob/main/examples/bridges/huggingface_bridge_example.py>`__ for a concrete use.

   :param description: Hugging Face dataset description. Obtained from
   :type description: dict
   :param - description = hf_dataset.description:
   :param - description = generate_huggingface_description:
   :type - description = generate_huggingface_description: infos, problem_definition
   :param download_size_bytes: the size of the dataset when downloaded from the hub
   :type download_size_bytes: int
   :param dataset_size_bytes: the size of the dataset when loaded in RAM
   :type dataset_size_bytes: int
   :param nb_samples: the number of samples in the dataset
   :type nb_samples: int
   :param owner: the owner of the dataset, usually a username or organization name on Hugging Face
   :type owner: str
   :param license: the license of the dataset, e.g. "CC-BY-4.0", "CC0-1.0", etc.
   :type license: str
   :param zenodo_url: the Zenodo URL of the dataset, if available
   :type zenodo_url: str, optional
   :param arxiv_paper_url: the arxiv paper URL of the dataset, if available
   :type arxiv_paper_url: str, optional
   :param pretty_name: a human-readable name for the dataset, e.g. "PLAID Dataset"
   :type pretty_name: str, optional
   :param size_categories: size categories of the dataset, e.g. ["small", "medium", "large"]
   :type size_categories: list[str], optional
   :param task_categories: task categories of the dataset, e.g. ["image-classification", "text-generation"]
   :type task_categories: list[str], optional
   :param tags: tags for the dataset, e.g. ["3D", "simulation", "mesh"]
   :type tags: list[str], optional
   :param dataset_long_description: a long description of the dataset, providing more details about its content and purpose
   :type dataset_long_description: str, optional
   :param url_illustration: a URL to an illustration image for the dataset, e.g. a screenshot or a sample mesh
   :type url_illustration: str, optional

   :returns: the converted dataset
             problem_definition (ProblemDefinition): the problem definition generated from the Hugging Face dataset
   :rtype: dataset (Dataset)

   .. rubric:: Example

   .. code-block:: python

       hf_dataset.push_to_hub("chanel/dataset")

       from datasets import load_dataset_builder

       datasetInfo = load_dataset_builder("chanel/dataset").__getstate__()['info']

       from huggingface_hub import DatasetCard

       card_text = create_string_for_huggingface_dataset_card(
           description = description,
           download_size_bytes = datasetInfo.download_size,
           dataset_size_bytes = datasetInfo.dataset_size,
           ...)
       dataset_card = DatasetCard(card_text)
       dataset_card.push_to_hub("chanel/dataset")