plaid.bridges.huggingface_bridge

Hugging Face bridge for PLAID datasets.

Attributes

Functions

generate_huggingface_description(→ dict[str, Any])

Generates a Hugging Face dataset description field from a plaid dataset infos and problem definition.

plaid_dataset_to_huggingface(→ datasets.Dataset)

Use this function for converting a Hugging Face dataset from a plaid dataset.

plaid_dataset_to_huggingface_datasetdict(...)

Use this function for converting a Hugging Face dataset dict from a plaid dataset.

plaid_generator_to_huggingface(→ datasets.Dataset)

Use this function for creating a Hugging Face dataset from a sample generator function.

plaid_generator_to_huggingface_datasetdict(...)

Use this function for creating a Hugging Face dataset dict (containing multiple splits) from a sample generator function.

huggingface_description_to_problem_definition(...)

Converts a Hugging Face dataset description to a plaid problem definition.

to_plaid_sample(→ plaid.Sample)

Convert a Hugging Face sample dictionary to a PLAID Sample instance.

huggingface_dataset_to_plaid(→ tuple[plaid.Dataset, ...)

Use this function for converting a plaid dataset from a Hugging Face dataset.

streamed_huggingface_dataset_to_plaid(...)

Use this function for creating a plaid dataset by streaming on Hugging Face.

create_string_for_huggingface_dataset_card(→ str)

Use this function for creating a dataset card, to upload together with the datase on the Hugging Face hub.

Module Contents

Self[source]
generate_huggingface_description(infos: dict, problem_definition: plaid.ProblemDefinition) dict[str, Any][source]

Generates a Hugging Face dataset description field from a plaid dataset infos and problem definition.

The conventions chosen here ensure working conversion to and from huggingset datasets.

Parameters:
  • infos (dict) – infos entry of the plaid dataset from which the Hugging Face description is to be generated

  • problem_definition (ProblemDefinition) – of which the Hugging Face description is to be generated

Returns:

Hugging Face dataset description

Return type:

dict[str]

plaid_dataset_to_huggingface(dataset: plaid.Dataset, problem_definition: plaid.ProblemDefinition, split: str = 'all_samples', processes_number: int = 1) datasets.Dataset[source]

Use this function for converting a Hugging Face dataset from a plaid dataset.

The dataset can then be saved to disk, or pushed to the Hugging Face hub.

Parameters:
  • dataset (Dataset) – the plaid dataset to be converted in Hugging Face format

  • problem_definition (ProblemDefinition) – the problem definition is used to generate the description of the Hugging Face dataset.

  • split (str) – The name of the split. Default: “all_samples”.

  • processes_number (int) – The number of processes used to generate the Hugging Face dataset. Default: 1.

Returns:

dataset in Hugging Face format

Return type:

datasets.Dataset

Example

dataset = plaid_dataset_to_huggingface(dataset, problem_definition, split)
dataset.save_to_disk("path/to/dir)
dataset.push_to_hub("chanel/dataset")
plaid_dataset_to_huggingface_datasetdict(dataset: plaid.Dataset, problem_definition: plaid.ProblemDefinition, main_splits: list[str], processes_number: int = 1) datasets.DatasetDict[source]

Use this function for converting a Hugging Face dataset dict from a plaid dataset.

The dataset can then be saved to disk, or pushed to the Hugging Face hub.

Parameters:
  • dataset (Dataset) – the plaid dataset to be converted in Hugging Face format

  • problem_definition (ProblemDefinition) – the problem definition is used to generate the description of the Hugging Face dataset.

  • main_splits (list[str]) – The name of the main splits: defining a partitioning of the sample ids.

  • processes_number (int) – The number of processes used to generate the Hugging Face dataset. Default: 1.

Returns:

dataset in Hugging Face format

Return type:

datasets.Dataset

Example

dataset = plaid_dataset_to_huggingface(dataset, problem_definition, split)
dataset.save_to_disk("path/to/dir)
dataset.push_to_hub("chanel/dataset")
plaid_generator_to_huggingface(generator: Callable, infos: dict, problem_definition: plaid.ProblemDefinition, split: str = 'all_samples', processes_number: int = 1) datasets.Dataset[source]

Use this function for creating a Hugging Face dataset from a sample generator function.

This function can be used when the plaid dataset cannot be loaded in RAM all at once due to its size. The generator enables loading samples one by one. The dataset can then be saved to disk, or pushed to the Hugging Face hub.

Parameters:
  • generator (Callable) – a function yielding a dict {“sample” : sample}, where sample is of type ‘bytes’

  • infos (dict) – the info is used to generate the description of the Hugging Face dataset.

  • problem_definition (ProblemDefinition) – the problem definition is used to generate the description of the Hugging Face dataset.

  • split (str) – The name of the split. Default: “all_samples”.

  • processes_number (int) – The number of processes used to generate the Hugging Face dataset. Default: 1.

Returns:

dataset in Hugging Face format

Return type:

datasets.Dataset

Example

dataset = plaid_generator_to_huggingface(generator, infos, split, problem_definition)
dataset.push_to_hub("chanel/dataset")
dataset.save_to_disk("path/to/dir")
plaid_generator_to_huggingface_datasetdict(generator: Callable, infos: dict, problem_definition: plaid.ProblemDefinition, main_splits: list, processes_number: int = 1) datasets.DatasetDict[source]

Use this function for creating a Hugging Face dataset dict (containing multiple splits) from a sample generator function.

This function can be used when the plaid dataset cannot be loaded in RAM all at once due to its size. The generator enables loading samples one by one. The dataset dict can then be saved to disk, or pushed to the Hugging Face hub.

Notes

Only the first split will contain the decription.

Parameters:
  • generator (Callable) – a function yielding a dict {“sample” : sample}, where sample is of type ‘bytes’

  • infos (dict) – infos entry of the plaid dataset from which the Hugging Face dataset is to be generated

  • problem_definition (ProblemDefinition) – the problem definition is used to generate the description of the Hugging Face dataset.

  • main_splits (str, optional) – The name of the main splits: defining a partitioning of the sample ids.

  • processes_number (int) – The number of processes used to generate the Hugging Face dataset. Default: 1.

Returns:

dataset dict in Hugging Face format

Return type:

datasets.DatasetDict

Example

dataset = plaid_generator_to_huggingface_datasetdict(generator, infos, problem_definition, main_splits)
dataset.push_to_hub("chanel/dataset")
dataset.save_to_disk("path/to/dir")
huggingface_description_to_problem_definition(description: dict) plaid.ProblemDefinition[source]

Converts a Hugging Face dataset description to a plaid problem definition.

Parameters:

description (dict) – the description field of a Hugging Face dataset, containing the problem definition

Returns:

the plaid problem definition initialized from the Hugging Face dataset description

Return type:

problem_definition (ProblemDefinition)

to_plaid_sample(hf_sample: dict[str, Any]) plaid.Sample[source]

Convert a Hugging Face sample dictionary to a PLAID Sample instance.

Parameters:

hf_sample (dict[str, Any]) – A dictionary with a “sample” key containing the pickled sample bytes.

Returns:

The deserialized PLAID Sample object.

Return type:

Sample

huggingface_dataset_to_plaid(ds: datasets.Dataset, ids: list[int] | None = None, processes_number: int = 1, large_dataset: bool = False, verbose: bool = True) tuple[plaid.Dataset, plaid.ProblemDefinition][source]

Use this function for converting a plaid dataset from a Hugging Face dataset.

A Hugging Face dataset can be read from disk or the hub. From the hub, the split = “all_samples” options is important to get a dataset and not a datasetdict. Many options from loading are available (caching, streaming, etc…)

Parameters:
  • ds (datasets.Dataset) – the dataset in Hugging Face format to be converted

  • ids (list, optional) – The specific sample IDs to load from the dataset. Defaults to None.

  • processes_number (int, optional) – The number of processes used to generate the plaid dataset

  • large_dataset (bool) – if True, uses a variant where parallel worker do not each load the complete dataset. Default: False.

  • verbose (bool, optional) – if True, prints progress using tdqm

Returns:

the converted dataset. problem_definition (ProblemDefinition): the problem definition generated from the Hugging Face dataset

Return type:

dataset (Dataset)

Example

from datasets import load_dataset, load_from_disk

dataset = load_dataset("path/to/dir", split = "all_samples")
dataset = load_from_disk("chanel/dataset")
plaid_dataset, plaid_problem = huggingface_dataset_to_plaid(dataset)
streamed_huggingface_dataset_to_plaid(hf_repo: str, number_of_samples: int) tuple[plaid.Dataset, plaid.ProblemDefinition][source]

Use this function for creating a plaid dataset by streaming on Hugging Face.

The indices of the retrieved sample is not controled.

Parameters:
  • hf_repo (str) – the name of the repo on Hugging Face

  • number_of_samples (int) – The number of samples to retrieve.

Returns:

the converted dataset. problem_definition (ProblemDefinition): the problem definition generated from the Hugging Face dataset

Return type:

dataset (Dataset)

Notes

from plaid.bridges.huggingface_bridge import streamed_huggingface_dataset_to_plaid

dataset, pb_def = streamed_huggingface_dataset_to_plaid('PLAID-datasets/VKI-LS59', 2)
create_string_for_huggingface_dataset_card(description: dict, download_size_bytes: int, dataset_size_bytes: int, nb_samples: int, owner: str, license: str, zenodo_url: str | None = None, arxiv_paper_url: str | None = None, pretty_name: str | None = None, size_categories: list[str] | None = None, task_categories: list[str] | None = None, tags: list[str] | None = None, dataset_long_description: str | None = None, url_illustration: str | None = None) str[source]

Use this function for creating a dataset card, to upload together with the datase on the Hugging Face hub.

Doing so ensure that load_dataset from the hub will populate the hf-dataset.description field, and be compatible for conversion to plaid.

Without a dataset_card, the description field is lost.

The parameters download_size_bytes and dataset_size_bytes can be determined after a dataset has been uploaded on Hugging Face: - manually by reading their values on the dataset page README.md, - automatically as shown in the example below

See the hugginface examples for a concrete use.

Parameters:
  • description (dict) – Hugging Face dataset description. Obtained from

  • hf_dataset.description (- description =)

  • generate_huggingface_description (- description =)

  • download_size_bytes (int) – the size of the dataset when downloaded from the hub

  • dataset_size_bytes (int) – the size of the dataset when loaded in RAM

  • nb_samples (int) – the number of samples in the dataset

  • owner (str) – the owner of the dataset, usually a username or organization name on Hugging Face

  • license (str) – the license of the dataset, e.g. “CC-BY-4.0”, “CC0-1.0”, etc.

  • zenodo_url (str, optional) – the Zenodo URL of the dataset, if available

  • arxiv_paper_url (str, optional) – the arxiv paper URL of the dataset, if available

  • pretty_name (str, optional) – a human-readable name for the dataset, e.g. “PLAID Dataset”

  • size_categories (list[str], optional) – size categories of the dataset, e.g. [“small”, “medium”, “large”]

  • task_categories (list[str], optional) – task categories of the dataset, e.g. [“image-classification”, “text-generation”]

  • tags (list[str], optional) – tags for the dataset, e.g. [“3D”, “simulation”, “mesh”]

  • dataset_long_description (str, optional) – a long description of the dataset, providing more details about its content and purpose

  • url_illustration (str, optional) – a URL to an illustration image for the dataset, e.g. a screenshot or a sample mesh

Returns:

the converted dataset problem_definition (ProblemDefinition): the problem definition generated from the Hugging Face dataset

Return type:

dataset (Dataset)

Example

hf_dataset.push_to_hub("chanel/dataset")

from datasets import load_dataset_builder

datasetInfo = load_dataset_builder("chanel/dataset").__getstate__()['info']

from huggingface_hub import DatasetCard

card_text = create_string_for_huggingface_dataset_card(
    description = description,
    download_size_bytes = datasetInfo.download_size,
    dataset_size_bytes = datasetInfo.dataset_size,
    ...)
dataset_card = DatasetCard(card_text)
dataset_card.push_to_hub("chanel/dataset")