plaid.bridges.huggingface_bridge¶
Hugging Face bridge for PLAID datasets.
Attributes¶
Functions¶
|
Generates a Hugging Face dataset description field from a plaid dataset infos and problem definition. |
|
Use this function for converting a Hugging Face dataset from a plaid dataset. |
Use this function for converting a Hugging Face dataset dict from a plaid dataset. |
|
|
Use this function for creating a Hugging Face dataset from a sample generator function. |
Use this function for creating a Hugging Face dataset dict (containing multiple splits) from a sample generator function. |
|
Converts a Hugging Face dataset description to a plaid problem definition. |
|
|
Convert a Hugging Face sample dictionary to a PLAID Sample instance. |
|
Use this function for converting a plaid dataset from a Hugging Face dataset. |
Use this function for creating a plaid dataset by streaming on Hugging Face. |
|
Use this function for creating a dataset card, to upload together with the datase on the Hugging Face hub. |
Module Contents¶
- generate_huggingface_description(infos: dict, problem_definition: plaid.ProblemDefinition) dict[str, Any][source]¶
Generates a Hugging Face dataset description field from a plaid dataset infos and problem definition.
The conventions chosen here ensure working conversion to and from huggingset datasets.
- Parameters:
infos (dict) – infos entry of the plaid dataset from which the Hugging Face description is to be generated
problem_definition (ProblemDefinition) – of which the Hugging Face description is to be generated
- Returns:
Hugging Face dataset description
- Return type:
- plaid_dataset_to_huggingface(dataset: plaid.Dataset, problem_definition: plaid.ProblemDefinition, split: str = 'all_samples', processes_number: int = 1) datasets.Dataset[source]¶
Use this function for converting a Hugging Face dataset from a plaid dataset.
The dataset can then be saved to disk, or pushed to the Hugging Face hub.
- Parameters:
dataset (Dataset) – the plaid dataset to be converted in Hugging Face format
problem_definition (ProblemDefinition) – the problem definition is used to generate the description of the Hugging Face dataset.
split (str) – The name of the split. Default: “all_samples”.
processes_number (int) – The number of processes used to generate the Hugging Face dataset. Default: 1.
- Returns:
dataset in Hugging Face format
- Return type:
datasets.Dataset
Example
dataset = plaid_dataset_to_huggingface(dataset, problem_definition, split) dataset.save_to_disk("path/to/dir) dataset.push_to_hub("chanel/dataset")
- plaid_dataset_to_huggingface_datasetdict(dataset: plaid.Dataset, problem_definition: plaid.ProblemDefinition, main_splits: list[str], processes_number: int = 1) datasets.DatasetDict[source]¶
Use this function for converting a Hugging Face dataset dict from a plaid dataset.
The dataset can then be saved to disk, or pushed to the Hugging Face hub.
- Parameters:
dataset (Dataset) – the plaid dataset to be converted in Hugging Face format
problem_definition (ProblemDefinition) – the problem definition is used to generate the description of the Hugging Face dataset.
main_splits (list[str]) – The name of the main splits: defining a partitioning of the sample ids.
processes_number (int) – The number of processes used to generate the Hugging Face dataset. Default: 1.
- Returns:
dataset in Hugging Face format
- Return type:
datasets.Dataset
Example
dataset = plaid_dataset_to_huggingface(dataset, problem_definition, split) dataset.save_to_disk("path/to/dir) dataset.push_to_hub("chanel/dataset")
- plaid_generator_to_huggingface(generator: Callable, infos: dict, problem_definition: plaid.ProblemDefinition, split: str = 'all_samples', processes_number: int = 1) datasets.Dataset[source]¶
Use this function for creating a Hugging Face dataset from a sample generator function.
This function can be used when the plaid dataset cannot be loaded in RAM all at once due to its size. The generator enables loading samples one by one. The dataset can then be saved to disk, or pushed to the Hugging Face hub.
- Parameters:
generator (Callable) – a function yielding a dict {“sample” : sample}, where sample is of type ‘bytes’
infos (dict) – the info is used to generate the description of the Hugging Face dataset.
problem_definition (ProblemDefinition) – the problem definition is used to generate the description of the Hugging Face dataset.
split (str) – The name of the split. Default: “all_samples”.
processes_number (int) – The number of processes used to generate the Hugging Face dataset. Default: 1.
- Returns:
dataset in Hugging Face format
- Return type:
datasets.Dataset
Example
dataset = plaid_generator_to_huggingface(generator, infos, split, problem_definition) dataset.push_to_hub("chanel/dataset") dataset.save_to_disk("path/to/dir")
- plaid_generator_to_huggingface_datasetdict(generator: Callable, infos: dict, problem_definition: plaid.ProblemDefinition, main_splits: list, processes_number: int = 1) datasets.DatasetDict[source]¶
Use this function for creating a Hugging Face dataset dict (containing multiple splits) from a sample generator function.
This function can be used when the plaid dataset cannot be loaded in RAM all at once due to its size. The generator enables loading samples one by one. The dataset dict can then be saved to disk, or pushed to the Hugging Face hub.
Notes
Only the first split will contain the decription.
- Parameters:
generator (Callable) – a function yielding a dict {“sample” : sample}, where sample is of type ‘bytes’
infos (dict) – infos entry of the plaid dataset from which the Hugging Face dataset is to be generated
problem_definition (ProblemDefinition) – the problem definition is used to generate the description of the Hugging Face dataset.
main_splits (str, optional) – The name of the main splits: defining a partitioning of the sample ids.
processes_number (int) – The number of processes used to generate the Hugging Face dataset. Default: 1.
- Returns:
dataset dict in Hugging Face format
- Return type:
datasets.DatasetDict
Example
dataset = plaid_generator_to_huggingface_datasetdict(generator, infos, problem_definition, main_splits) dataset.push_to_hub("chanel/dataset") dataset.save_to_disk("path/to/dir")
- huggingface_description_to_problem_definition(description: dict) plaid.ProblemDefinition[source]¶
Converts a Hugging Face dataset description to a plaid problem definition.
- Parameters:
description (dict) – the description field of a Hugging Face dataset, containing the problem definition
- Returns:
the plaid problem definition initialized from the Hugging Face dataset description
- Return type:
problem_definition (ProblemDefinition)
- to_plaid_sample(hf_sample: dict[str, Any]) plaid.Sample[source]¶
Convert a Hugging Face sample dictionary to a PLAID Sample instance.
- huggingface_dataset_to_plaid(ds: datasets.Dataset, ids: list[int] | None = None, processes_number: int = 1, large_dataset: bool = False, verbose: bool = True) tuple[plaid.Dataset, plaid.ProblemDefinition][source]¶
Use this function for converting a plaid dataset from a Hugging Face dataset.
A Hugging Face dataset can be read from disk or the hub. From the hub, the split = “all_samples” options is important to get a dataset and not a datasetdict. Many options from loading are available (caching, streaming, etc…)
- Parameters:
ds (datasets.Dataset) – the dataset in Hugging Face format to be converted
ids (list, optional) – The specific sample IDs to load from the dataset. Defaults to None.
processes_number (int, optional) – The number of processes used to generate the plaid dataset
large_dataset (bool) – if True, uses a variant where parallel worker do not each load the complete dataset. Default: False.
verbose (bool, optional) – if True, prints progress using tdqm
- Returns:
the converted dataset. problem_definition (ProblemDefinition): the problem definition generated from the Hugging Face dataset
- Return type:
dataset (Dataset)
Example
from datasets import load_dataset, load_from_disk dataset = load_dataset("path/to/dir", split = "all_samples") dataset = load_from_disk("chanel/dataset") plaid_dataset, plaid_problem = huggingface_dataset_to_plaid(dataset)
- streamed_huggingface_dataset_to_plaid(hf_repo: str, number_of_samples: int) tuple[plaid.Dataset, plaid.ProblemDefinition][source]¶
Use this function for creating a plaid dataset by streaming on Hugging Face.
The indices of the retrieved sample is not controled.
- Parameters:
- Returns:
the converted dataset. problem_definition (ProblemDefinition): the problem definition generated from the Hugging Face dataset
- Return type:
dataset (Dataset)
Notes
from plaid.bridges.huggingface_bridge import streamed_huggingface_dataset_to_plaid dataset, pb_def = streamed_huggingface_dataset_to_plaid('PLAID-datasets/VKI-LS59', 2)
- create_string_for_huggingface_dataset_card(description: dict, download_size_bytes: int, dataset_size_bytes: int, nb_samples: int, owner: str, license: str, zenodo_url: str | None = None, arxiv_paper_url: str | None = None, pretty_name: str | None = None, size_categories: list[str] | None = None, task_categories: list[str] | None = None, tags: list[str] | None = None, dataset_long_description: str | None = None, url_illustration: str | None = None) str[source]¶
Use this function for creating a dataset card, to upload together with the datase on the Hugging Face hub.
Doing so ensure that load_dataset from the hub will populate the hf-dataset.description field, and be compatible for conversion to plaid.
Without a dataset_card, the description field is lost.
The parameters download_size_bytes and dataset_size_bytes can be determined after a dataset has been uploaded on Hugging Face: - manually by reading their values on the dataset page README.md, - automatically as shown in the example below
See the hugginface examples for a concrete use.
- Parameters:
description (dict) – Hugging Face dataset description. Obtained from
hf_dataset.description (- description =)
generate_huggingface_description (- description =)
download_size_bytes (int) – the size of the dataset when downloaded from the hub
dataset_size_bytes (int) – the size of the dataset when loaded in RAM
nb_samples (int) – the number of samples in the dataset
owner (str) – the owner of the dataset, usually a username or organization name on Hugging Face
license (str) – the license of the dataset, e.g. “CC-BY-4.0”, “CC0-1.0”, etc.
zenodo_url (str, optional) – the Zenodo URL of the dataset, if available
arxiv_paper_url (str, optional) – the arxiv paper URL of the dataset, if available
pretty_name (str, optional) – a human-readable name for the dataset, e.g. “PLAID Dataset”
size_categories (list[str], optional) – size categories of the dataset, e.g. [“small”, “medium”, “large”]
task_categories (list[str], optional) – task categories of the dataset, e.g. [“image-classification”, “text-generation”]
tags (list[str], optional) – tags for the dataset, e.g. [“3D”, “simulation”, “mesh”]
dataset_long_description (str, optional) – a long description of the dataset, providing more details about its content and purpose
url_illustration (str, optional) – a URL to an illustration image for the dataset, e.g. a screenshot or a sample mesh
- Returns:
the converted dataset problem_definition (ProblemDefinition): the problem definition generated from the Hugging Face dataset
- Return type:
dataset (Dataset)
Example
hf_dataset.push_to_hub("chanel/dataset") from datasets import load_dataset_builder datasetInfo = load_dataset_builder("chanel/dataset").__getstate__()['info'] from huggingface_hub import DatasetCard card_text = create_string_for_huggingface_dataset_card( description = description, download_size_bytes = datasetInfo.download_size, dataset_size_bytes = datasetInfo.dataset_size, ...) dataset_card = DatasetCard(card_text) dataset_card.push_to_hub("chanel/dataset")