plaid.bridges.huggingface_bridge¶
Hugging Face bridge for PLAID datasets.
Attributes¶
Functions¶
|
Convert a Hugging Face dataset into a PLAID dataset. |
|
Convert a Hugging Face dataset row to a PLAID Sample object. |
|
Load a Hugging Face dataset from the Hub and instantiate it as a dictionary of PLAID datasets. |
|
Loads a Hugging Face dataset from the public hub, a private mirror, or local cache, with automatic handling of streaming and download modes. |
|
Load dataset infos from the Hugging Face Hub. |
|
Load a ProblemDefinition from the Hugging Face Hub. |
|
Load the tree structure metadata of a PLAID dataset from the Hugging Face Hub. |
|
Load a Hugging Face dataset or dataset dictionary from disk. |
|
Load dataset information from a YAML file stored on disk. |
Load a ProblemDefinition and its split information from disk. |
|
|
Load a tree structure for a dataset from disk. |
|
Convert a Hugging Face dataset sample in binary format to a Plaid Sample. |
Use this function for converting a plaid dataset from a Hugging Face dataset. |
|
Converts a Hugging Face dataset description to a plaid problem definition. |
|
|
Convert a Hugging Face dataset description dictionary to a PLAID infos dictionary. |
|
Infer Hugging Face dataset feature type from a given value. |
|
Flatten a PLAID Sample's CGNS trees into Hugging Face–compatible arrays and metadata. |
|
Process a single shard of sample ids and collect per-shard metadata. |
|
Pre-process dataset splits: inspect samples to infer features, constants and CGNS metadata. |
Convert a PLAID dataset into a Hugging Face datasets.DatasetDict. |
|
Convert PLAID dataset generators into a Hugging Face datasets.DatasetDict. |
|
|
Push a Hugging Face DatasetDict to the Hugging Face Hub. |
|
Upload dataset infos to the Hugging Face Hub. |
|
Upload a ProblemDefinition and its split information to the Hugging Face Hub. |
|
Upload a dataset's tree structure to a Hugging Face dataset repository. |
|
Save a Hugging Face DatasetDict to disk. |
|
Save dataset infos as a YAML file to disk. |
|
Save a ProblemDefinition and its split information to disk. |
|
Save the structure of a dataset tree to disk. |
|
Use this function for converting a Hugging Face dataset from a plaid dataset. |
|
Use this function for creating a Hugging Face dataset from a sample generator function. |
Use this function for converting a Hugging Face dataset dict from a plaid dataset. |
|
Use this function for creating a Hugging Face dataset dict (containing multiple splits) from a sample generator function. |
|
|
Update a dataset card with PLAID-specific metadata and documentation. |
Module Contents¶
- to_plaid_dataset(hf_dataset: datasets.Dataset, flat_cst: dict[str, Any], cgns_types: dict[str, str], enforce_shapes: bool = True) plaid.Dataset[source]¶
Convert a Hugging Face dataset into a PLAID dataset.
Iterates over all samples in a Hugging Face Dataset and converts each one into a PLAID-compatible sample using to_plaid_sample. The resulting samples are then collected into a single PLAID Dataset.
- Parameters:
hf_dataset (datasets.Dataset) – The Hugging Face dataset split to convert.
flat_cst (dict[str, Any]) – Flattened representation of the CGNS tree structure constants.
cgns_types (dict[str, str]) – Mapping of CGNS paths to their expected types.
enforce_shapes (bool, optional) – If True, ensures all arrays strictly follow the reference shapes. Defaults to True.
- Returns:
A PLAID Dataset object containing the converted samples.
- Return type:
- to_plaid_sample(ds: datasets.Dataset, i: int, flat_cst: dict[str, Any], cgns_types: dict[str, str], enforce_shapes: bool = True) plaid.Sample[source]¶
Convert a Hugging Face dataset row to a PLAID Sample object.
Extracts a single row from a Hugging Face dataset and converts it into a PLAID Sample by unflattening the CGNS tree structure. Constant features from flat_cst are merged with the variable features from the row.
- Parameters:
ds (datasets.Dataset) – The Hugging Face dataset containing the sample data.
i (int) – The index of the row to convert.
flat_cst (dict[str, Any]) – Dictionary of constant features to add to each sample.
cgns_types (dict[str, str]) – Dictionary mapping paths to CGNS types for reconstruction.
enforce_shapes (bool, optional) – If True, ensures consistent array shapes during conversion. Defaults to True.
- Returns:
A validated PLAID Sample object reconstructed from the Hugging Face dataset row.
- Return type:
Note
Uses the dataset’s pyarrow table data for efficient access.
Handles array shapes and types according to enforce_shapes.
Constant features from flat_cst are merged with the variable features from the row.
- instantiate_plaid_datasetdict_from_hub(repo_id: str, enforce_shapes: bool = True) dict[str, plaid.Dataset][source]¶
Load a Hugging Face dataset from the Hub and instantiate it as a dictionary of PLAID datasets.
This function retrieves a dataset dictionary from the Hugging Face Hub, along with its associated CGNS tree structure and type information. Each split of the Hugging Face dataset is then converted into a PLAID dataset.
- Parameters:
- Returns:
A dictionary mapping split names (e.g. “train”, “test”) to PLAID Dataset objects.
- Return type:
- load_dataset_from_hub(repo_id: str, streaming: bool = False, *args, **kwargs) datasets.Dataset | datasets.DatasetDict | datasets.IterableDataset | datasets.IterableDatasetDict[source]¶
Loads a Hugging Face dataset from the public hub, a private mirror, or local cache, with automatic handling of streaming and download modes.
Behavior:
If the environment variable HF_ENDPOINT is set, uses a private Hugging Face mirror.
Streaming is disabled.
The dataset is downloaded locally via snapshot_download and loaded from disk.
If HF_ENDPOINT is not set, attempts to load from the public Hugging Face hub.
If the dataset is already cached locally, loads from disk.
Otherwise, loads from the hub, optionally using streaming mode.
- Parameters:
repo_id (str) – The Hugging Face dataset repository ID (e.g., ‘username/dataset’).
streaming (bool, optional) – If True, attempts to stream the dataset (only supported on the public hub).
*args – Positional arguments forwarded to [datasets.load_dataset](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset).
**kwargs – Keyword arguments forwarded to [datasets.load_dataset](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset).
- Returns:
The loaded Hugging Face dataset object.
- Return type:
Union[datasets.Dataset, datasets.DatasetDict]
- Raises:
Exception – Propagates any exceptions raised by datasets.load_dataset, datasets.load_from_disk, or huggingface_hub.snapshot_download if loading fails.
Note
Streaming mode is not supported when using a private mirror.
If the dataset is found in the local cache, loads from disk instead of streaming.
- To use behind a proxy or with a private mirror, you may need to set:
HF_ENDPOINT to your private mirror address
CURL_CA_BUNDLE to your trusted CA certificates
HF_HOME to a shared cache directory if needed
- load_infos_from_hub(repo_id: str) dict[str, dict[str, str]][source]¶
Load dataset infos from the Hugging Face Hub.
Downloads the infos.yaml file from the specified repository and parses it as a dictionary.
- load_problem_definition_from_hub(repo_id: str, name: str) plaid.ProblemDefinition[source]¶
Load a ProblemDefinition from the Hugging Face Hub.
Downloads the problem infos YAML and split JSON files from the specified repository and location, then initializes a ProblemDefinition object with this information.
- Parameters:
- Returns:
The loaded problem definition.
- Return type:
- load_tree_struct_from_hub(repo_id: str) tuple[dict, dict][source]¶
Load the tree structure metadata of a PLAID dataset from the Hugging Face Hub.
- This function retrieves two artifacts previously uploaded alongside a dataset:
tree_constant_part.pkl: a pickled dictionary of constant feature values (features that are identical across all samples).
key_mappings.yaml: a YAML file containing metadata about the dataset feature structure, including variable features, constant features, and CGNS types.
- Parameters:
repo_id (str) – The repository ID on the Hugging Face Hub (e.g., “username/dataset_name”).
- Returns:
flat_cst (dict): constant features dictionary (path → value).
key_mappings (dict): metadata dictionary containing keys such as: - “variable_features”: list of paths for non-constant features. - “constant_features”: list of paths for constant features. - “cgns_types”: mapping from paths to CGNS types.
- Return type:
- load_dataset_from_disk(path: str | pathlib.Path, *args, **kwargs) datasets.Dataset | datasets.DatasetDict[source]¶
Load a Hugging Face dataset or dataset dictionary from disk.
This function wraps datasets.load_from_disk to accept either a string path or a Path object and returns the loaded dataset object.
- Parameters:
path (Union[str, Path]) – Path to the directory containing the saved dataset.
*args – Positional arguments forwarded to [datasets.load_from_disk](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_from_disk).
**kwargs – Keyword arguments forwarded to [datasets.load_from_disk](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_from_disk).
- Returns:
The loaded Hugging Face dataset object, which may be a single Dataset or a DatasetDict depending on what was saved on disk.
- Return type:
Union[datasets.Dataset, datasets.DatasetDict]
- load_infos_from_disk(path: str | pathlib.Path) dict[str, dict[str, str]][source]¶
Load dataset information from a YAML file stored on disk.
- load_problem_definition_from_disk(path: str | pathlib.Path, name: str | pathlib.Path) plaid.ProblemDefinition[source]¶
Load a ProblemDefinition and its split information from disk.
- Parameters:
- Returns:
The loaded problem definition.
- Return type:
- load_tree_struct_from_disk(path: str | pathlib.Path) tuple[dict[str, Any], dict[str, Any]][source]¶
Load a tree structure for a dataset from disk.
This function loads two components from the specified directory: 1. tree_constant_part.pkl: a pickled dictionary containing the constant parts of the tree. 2. key_mappings.yaml: a YAML file containing key mappings and metadata.
- binary_to_plaid_sample(hf_sample: dict[str, bytes]) plaid.Sample[source]¶
Convert a Hugging Face dataset sample in binary format to a Plaid Sample.
The input hf_sample is expected to contain a pickled representation of a sample under the key “sample”. This function attempts to validate the unpickled sample as a Plaid Sample. If validation fails, it reconstructs the sample from its components (meshes, path, and optional scalars) before validating it.
- Parameters:
hf_sample (dict[str, bytes]) – A dictionary representing a Hugging Face sample, with the pickled sample stored under the key “sample”.
- Returns:
A validated Plaid Sample object.
- Return type:
- Raises:
KeyError – If required keys (“sample”, “meshes”, “path”) are missing and the sample cannot be reconstructed.
ValidationError – If the reconstructed sample still fails Plaid validation.
- huggingface_dataset_to_plaid(ds: datasets.Dataset, ids: list[int] | None = None, processes_number: int = 1, large_dataset: bool = False, verbose: bool = True) tuple[plaid.Dataset | plaid.ProblemDefinition, plaid.ProblemDefinition][source]¶
Use this function for converting a plaid dataset from a Hugging Face dataset.
A Hugging Face dataset can be read from disk or the hub. From the hub, the split = “all_samples” options is important to get a dataset and not a datasetdict. Many options from loading are available (caching, streaming, etc…)
- Parameters:
ds (datasets.Dataset) – the dataset in Hugging Face format to be converted
ids (list, optional) – The specific sample IDs to load from the dataset. Defaults to None.
processes_number (int, optional) – The number of processes used to generate the plaid dataset
large_dataset (bool) – if True, uses a variant where parallel worker do not each load the complete dataset. Default: False.
verbose (bool, optional) – if True, prints progress using tdqm
- Returns:
the converted dataset. problem_definition (ProblemDefinition): the problem definition generated from the Hugging Face dataset
- Return type:
dataset (Dataset)
Example
from datasets import load_dataset, load_from_disk dataset = load_dataset("path/to/dir", split = "all_samples") dataset = load_from_disk("chanel/dataset") plaid_dataset, plaid_problem = huggingface_dataset_to_plaid(dataset)
- huggingface_description_to_problem_definition(description: dict) plaid.ProblemDefinition[source]¶
Converts a Hugging Face dataset description to a plaid problem definition.
- Parameters:
description (dict) – the description field of a Hugging Face dataset, containing the problem definition
- Returns:
the plaid problem definition initialized from the Hugging Face dataset description
- Return type:
problem_definition (ProblemDefinition)
- huggingface_description_to_infos(description: dict) dict[str, dict[str, str]][source]¶
Convert a Hugging Face dataset description dictionary to a PLAID infos dictionary.
Extracts the “legal” and “data_production” sections from the Hugging Face description and returns them in a format compatible with PLAID dataset infos.
- infer_hf_features_from_value(value: Any) datasets.Value | datasets.Sequence[source]¶
Infer Hugging Face dataset feature type from a given value.
This function analyzes the input value and determines the appropriate Hugging Face feature type representation. It handles None values, scalars, and arrays/lists of various dimensions, mapping them to corresponding Hugging Face Value or Sequence types.
- Parameters:
value (Any) – The value to infer the feature type from. Can be None, scalar, list, tuple, or numpy array.
- Returns:
- A Hugging Face feature type (Value or Sequence) that corresponds
to the input value’s structure and data type.
- Return type:
datasets.Feature
- Raises:
Note
For scalar values, maps numpy dtypes to appropriate Hugging Face Value types: float types to “float32”, int32 to “int32”, int64 to “int64”, others to “string”
For arrays/lists, creates nested Sequence structures based on dimensionality: 1D → Sequence(base_type), 2D → Sequence(Sequence(base_type)), 3D → Sequence(Sequence(Sequence(base_type)))
All float values are enforced to “float32” to limit data size
All int64 values are preserved as “int64” to satisfy CGNS standards
- build_hf_sample(sample: plaid.Sample) tuple[dict[str, Any], list[str], dict[str, str]][source]¶
Flatten a PLAID Sample’s CGNS trees into Hugging Face–compatible arrays and metadata.
The function traverses every CGNS tree stored in sample.features.data (keyed by time), produces a flattened mapping path -> primitive value for each time, and then builds compact numpy arrays suitable for storage in a Hugging Face Dataset. Repeated value blocks that are identical across times are deduplicated and referenced by start/end indices; companion “<path>_times” arrays describe, per time, the slice indices into the concatenated arrays.
- Parameters:
sample (Sample) – A PLAID Sample whose features contain one or more CGNS trees (sample.features.data maps time -> CGNSTree).
- Returns:
hf_sample (dict[str, Any]): Mapping of flattened CGNS paths to either a numpy array (concatenation of per-time blocks) or None. For each path there is also an entry “<path>_times” containing a flattened numpy array of triplets [time, start, end] (end == -1 indicates the block extends to the end of the array).
all_paths (list[str]): Sorted list of all considered variable feature paths (excluding Time-related nodes and CGNSLibraryVersion).
sample_cgns_types (dict[str, str]): Mapping from path to CGNS node type (metadata produced by flatten_cgns_tree).
- Return type:
Note
Byte-array encoded strings (dtype
"|S1") are handled by reassembling and storing the string as a single-element numpy array; a sha256 hash is used for deduplication.Deduplication reduces storage when identical blocks recur across times.
Paths containing “/Time” or “CGNSLibraryVersion” are ignored for variable features.
- process_shard(generator_fn: Callable[Ellipsis, Any], progress: Any, n_proc: int, shard_ids: list[plaid.types.IndexType] | None = None) tuple[set[str], dict[str, str], dict[str, datasets.Value | datasets.Sequence], dict[str, dict[str, str | bool | int]], int][source]¶
Process a single shard of sample ids and collect per-shard metadata.
This function drives a shard-level pass over samples produced by generator_fn. For each sample it: - flattens the sample into Hugging Face friendly arrays (build_hf_sample), - collects observed flattened paths, - aggregates CGNS type metadata, - infers Hugging Face feature types for each path, - detects per-path constants using a content hash, - updates progress (either a multiprocessing.Queue or a tqdm progress bar).
- Parameters:
shard_ids (list[IndexType]) – Sequence of sample ids (a single shard) to process.
generator_fn (Callable) – Generator function accepting a list of shard id sequences and yielding Sample objects for those ids.
progress (Any) – Progress reporter; either a multiprocessing.Queue (for parallel execution) or a tqdm progress bar object (for sequential execution).
n_proc (int) – Number of worker processes used by the caller (used to decide how to report progress).
- Returns:
split_all_paths (set[str]): Set of all flattened feature paths observed in the shard.
shard_global_cgns_types (dict[str, str]): Mapping path -> CGNS node type observed in the shard.
shard_global_feature_types (dict[str, Union[Value, Sequence]]): Inferred HF feature types per path.
split_constant_leaves (dict[str, dict]): Per-path metadata for constant detection. Each entry is a dict with keys “hash” (str), “constant” (bool) and “count” (int).
n_samples_processed (int): Number of samples processed in this shard.
- Return type:
- Raises:
ValueError – If inconsistent feature types are detected for the same path within the shard.
- preprocess_splits(generators: dict[str, Callable], gen_kwargs: dict[str, dict[str, list[plaid.types.IndexType]]] | None = None, processes_number: int = 1, verbose: bool = True) tuple[dict[str, set[str]], dict[str, dict[str, Any]], dict[str, set[str]], dict[str, str], dict[str, datasets.Value | datasets.Sequence]][source]¶
Pre-process dataset splits: inspect samples to infer features, constants and CGNS metadata.
This function iterates over the provided split generators (optionally in parallel), flattens each PLAID sample into Hugging Face friendly arrays, detects constant CGNS leaves (features identical across all samples in a split), infers global Hugging Face feature types, and aggregates CGNS type metadata.
The work is sharded per-split and each shard is processed by process_shard. In parallel mode, progress is updated via a multiprocessing.Queue; otherwise a tqdm progress bar is used.
- Parameters:
generators (dict[str, Callable]) – Mapping from split name to a generator function. Each generator must accept a single argument (a sequence of shard ids) and yield PLAID samples.
gen_kwargs (dict[str, dict[str, list[IndexType]]]) – Per-split kwargs used to drive generator invocation (e.g. {“train”: {“shards_ids”: […]}}).
processes_number (int, optional) – Number of worker processes to use for shard-level parallelism. Defaults to 1.
verbose (bool, optional) – If True, displays progress bars. Defaults to True.
- Returns:
- split_all_paths (dict[str, set[str]]):
For each split, the set of all observed flattened feature paths (including “_times” keys).
- split_flat_cst (dict[str, dict[str, Any]]):
For each split, a mapping of constant feature path -> value (constant parts of the tree).
- split_var_path (dict[str, set[str]]):
For each split, the set of variable feature paths (non-constant).
- global_cgns_types (dict[str, str]):
Aggregated mapping from flattened path -> CGNS node type.
- global_feature_types (dict[str, Union[Value, Sequence]]):
Aggregated inferred Hugging Face feature types for each variable path.
- Return type:
- Raises:
ValueError – If inconsistent feature types or CGNS types are detected across shards/splits.
- plaid_dataset_to_huggingface_datasetdict(dataset: plaid.Dataset, main_splits: dict[str, plaid.types.IndexType], processes_number: int = 1, writer_batch_size: int = 1, verbose: bool = False) tuple[datasets.DatasetDict, dict[str, Any], dict[str, Any]][source]¶
Convert a PLAID dataset into a Hugging Face datasets.DatasetDict.
This is a thin wrapper that creates per-split generators from a PLAID dataset and delegates the actual dataset construction to plaid_generator_to_huggingface_datasetdict.
- Parameters:
dataset (plaid.Dataset) – The PLAID dataset to be converted. Must support indexing with a list of IDs (from main_splits).
main_splits (dict[str, IndexType]) – Mapping from split names (e.g. “train”, “test”) to the subset of sample indices belonging to that split.
processes_number (int, optional, default=1) – Number of parallel processes to use when writing the Hugging Face dataset.
writer_batch_size (int, optional, default=1) – Batch size used when writing samples to disk in Hugging Face format.
verbose (bool, optional, default=False) – If True, print progress and debug information.
- Returns:
A Hugging Face DatasetDict containing one dataset per split.
- Return type:
datasets.DatasetDict
Example
>>> ds_dict = plaid_dataset_to_huggingface_datasetdict( ... dataset=my_plaid_dataset, ... main_splits={"train": [0, 1, 2], "test": [3]}, ... processes_number=4, ... writer_batch_size=3 ... ) >>> print(ds_dict) DatasetDict({ train: Dataset({ features: ... }), test: Dataset({ features: ... }) })
- plaid_generator_to_huggingface_datasetdict(generators: dict[str, Callable], gen_kwargs: dict[str, dict[str, list[plaid.types.IndexType]]] | None = None, processes_number: int = 1, writer_batch_size: int = 1, verbose: bool = False) tuple[datasets.DatasetDict, dict[str, Any], dict[str, Any]][source]¶
Convert PLAID dataset generators into a Hugging Face datasets.DatasetDict.
This function inspects samples produced by the given generators, flattens their CGNS tree structure, infers Hugging Face feature types, and builds one datasets.Dataset per split. Constant features (identical across all samples) are separated out from variable features.
- Parameters:
generators (dict[str, Callable]) – Mapping from split names (e.g., “train”, “test”) to generator functions. Each generator function must return an iterable of PLAID samples, where each sample provides sample.features.data[0.0] for flattening.
processes_number (int, optional, default=1) – Number of processes used internally by Hugging Face when materializing the dataset from the generators.
writer_batch_size (int, optional, default=1) – Batch size used when writing samples to disk in Hugging Face format.
gen_kwargs (dict, optional, default=None) – Optional mapping from split names to dictionaries of keyword arguments to be passed to each generator function, used for parallelization.
verbose (bool, optional, default=False) – If True, displays progress bars and diagnostic messages.
- Returns:
DatasetDict (datasets.DatasetDict): A Hugging Face dataset dictionary with one dataset per split.
flat_cst (dict[str, Any]): Dictionary of constant features detected across all splits.
key_mappings (dict[str, Any]): Metadata dictionary containing: - “variable_features”: list of paths for non-constant features. - “constant_features”: list of paths for constant features. - “cgns_types”: inferred CGNS types for all features.
- Return type:
Example
>>> ds_dict, flat_cst, key_mappings = plaid_generator_to_huggingface_datasetdict( ... {"train": lambda: iter(train_samples), ... "test": lambda: iter(test_samples)}, ... processes_number=4, ... writer_batch_size=2, ... verbose=True ... ) >>> print(ds_dict) DatasetDict({ train: Dataset({ features: ... }), test: Dataset({ features: ... }) }) >>> print(flat_cst) {'Zone1/GridCoordinates': array([0., 0.1, 0.2])} >>> print(key_mappings["variable_features"][:3]) ['Zone1/FlowSolution/VelocityX', 'Zone1/FlowSolution/VelocityY', ...]
- push_dataset_dict_to_hub(repo_id: str, hf_dataset_dict: datasets.DatasetDict, **kwargs) None[source]¶
Push a Hugging Face DatasetDict to the Hugging Face Hub.
This is a thin wrapper around datasets.DatasetDict.push_to_hub, allowing you to upload a dataset dictionary (with one or more splits such as “train”, “validation”, “test”) to the Hugging Face Hub.
Note
The function automatically handles sharding of the dataset by setting num_shards for each split. For each split, the number of shards is set to the minimum between the number of samples in that split and such that shards are targetted to approx. 500 MB. This ensures efficient chunking while preventing excessive fragmentation. Empty splits will raise an assertion error.
- Parameters:
repo_id (str) – The repository ID on the Hugging Face Hub (e.g. “username/dataset_name”).
hf_dataset_dict (datasets.DatasetDict) – The Hugging Face dataset dictionary to push.
**kwargs – Keyword arguments forwarded to [DatasetDict.push_to_hub](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub).
- Returns:
None
- push_infos_to_hub(repo_id: str, infos: dict[str, dict[str, str]]) None[source]¶
Upload dataset infos to the Hugging Face Hub.
Serializes the infos dictionary to YAML and uploads it to the specified repository as infos.yaml.
- push_problem_definition_to_hub(repo_id: str, name: str, pb_def: plaid.ProblemDefinition) None[source]¶
Upload a ProblemDefinition and its split information to the Hugging Face Hub.
- Parameters:
repo_id (str) – The repository ID on the Hugging Face Hub.
name (str) – The name of the problem_definition to store in the repo.
pb_def (ProblemDefinition) – The problem definition to upload.
- push_tree_struct_to_hub(repo_id: str, flat_cst: dict[str, Any], key_mappings: dict[str, Any]) None[source]¶
Upload a dataset’s tree structure to a Hugging Face dataset repository.
This function pushes two components of a dataset tree structure to the specified Hugging Face Hub repository:
flat_cst: the constant parts of the dataset tree, serialized as a pickle file (tree_constant_part.pkl).
key_mappings: the dictionary of key mappings and metadata for the dataset tree, serialized as a YAML file (key_mappings.yaml).
Both files are uploaded using the Hugging Face HfApi().upload_file method.
- Parameters:
- Returns:
None
Note
Each upload includes a commit message indicating the filename.
This function is not covered by unit tests (pragma: no cover).
- save_dataset_dict_to_disk(path: str | pathlib.Path, hf_dataset_dict: datasets.DatasetDict, **kwargs) None[source]¶
Save a Hugging Face DatasetDict to disk.
This function serializes the provided DatasetDict and writes it to the specified directory, preserving its features, splits, and data for later loading.
- Parameters:
path (Union[str, Path]) – Directory path where the DatasetDict will be saved.
hf_dataset_dict (datasets.DatasetDict) – The Hugging Face DatasetDict to save.
**kwargs – Keyword arguments forwarded to [DatasetDict.save_to_disk](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict.save_to_disk).
- Returns:
None
- save_infos_to_disk(path: str | pathlib.Path, infos: dict[str, dict[str, str]]) None[source]¶
Save dataset infos as a YAML file to disk.
- save_problem_definition_to_disk(path: str | pathlib.Path, name: str | pathlib.Path, pb_def: plaid.ProblemDefinition) None[source]¶
Save a ProblemDefinition and its split information to disk.
- Parameters:
path (Union[str, Path]) – The root directory path for saving.
name (str) – The name of the problem_definition to store in the disk directory.
pb_def (ProblemDefinition) – The problem definition to save.
- save_tree_struct_to_disk(path: str | pathlib.Path, flat_cst: dict[str, Any], key_mappings: dict[str, Any]) None[source]¶
Save the structure of a dataset tree to disk.
This function writes the constant part of the tree and its key mappings to files in the specified directory. The constant part is serialized as a pickle file, while the key mappings are saved in YAML format.
- plaid_dataset_to_huggingface_binary(dataset: plaid.Dataset, ids: list[plaid.types.IndexType] | None = None, split_name: str = 'all_samples', processes_number: int = 1) datasets.Dataset[source]¶
Use this function for converting a Hugging Face dataset from a plaid dataset.
The dataset can then be saved to disk, or pushed to the Hugging Face hub.
- Parameters:
dataset (Dataset) – the plaid dataset to be converted in Hugging Face format
ids (list, optional) – The specific sample IDs to convert the dataset. Defaults to None.
split_name (str) – The name of the split. Default: “all_samples”.
processes_number (int) – The number of processes used to generate the Hugging Face dataset. Default: 1.
- Returns:
dataset in Hugging Face format
- Return type:
datasets.Dataset
Example
dataset = plaid_dataset_to_huggingface_binary(dataset, problem_definition, split) dataset.save_to_disk("path/to/dir) dataset.push_to_hub("chanel/dataset")
- plaid_generator_to_huggingface_binary(generator: Callable, split_name: str = 'all_samples', processes_number: int = 1) datasets.Dataset[source]¶
Use this function for creating a Hugging Face dataset from a sample generator function.
This function can be used when the plaid dataset cannot be loaded in RAM all at once due to its size. The generator enables loading samples one by one.
- Parameters:
- Returns:
dataset in Hugging Face format
- Return type:
datasets.Dataset
Example
dataset = plaid_generator_to_huggingface_binary(generator, infos, split)
- plaid_dataset_to_huggingface_datasetdict_binary(dataset: plaid.Dataset, main_splits: dict[str, plaid.types.IndexType], processes_number: int = 1) datasets.DatasetDict[source]¶
Use this function for converting a Hugging Face dataset dict from a plaid dataset.
The dataset can then be saved to disk, or pushed to the Hugging Face hub.
- Parameters:
- Returns:
dataset in Hugging Face format
- Return type:
datasets.Dataset
Example
dataset = plaid_dataset_to_huggingface_datasetdict_binary(dataset, problem_definition, split) dataset.save_to_disk("path/to/dir) dataset.push_to_hub("chanel/dataset")
- plaid_generator_to_huggingface_datasetdict_binary(generators: dict[str, Callable], processes_number: int = 1) datasets.DatasetDict[source]¶
Use this function for creating a Hugging Face dataset dict (containing multiple splits) from a sample generator function.
This function can be used when the plaid dataset cannot be loaded in RAM all at once due to its size. The generator enables loading samples one by one. The dataset dict can then be saved to disk, or pushed to the Hugging Face hub.
Note
Only the first split will contain the decription.
- Parameters:
- Returns:
dataset dict in Hugging Face format
- Return type:
datasets.DatasetDict
Example
hf_dataset_dict = plaid_generator_to_huggingface_datasetdict(generator, infos, problem_definition, main_splits) push_dataset_dict_to_hub("chanel/dataset", hf_dataset_dict) hf_dataset_dict.save_to_disk("path/to/dir")
- update_dataset_card(dataset_card: str, infos: dict[str, dict[str, str]] | None = None, pretty_name: str | None = None, dataset_long_description: str | None = None, illustration_urls: list[str] | None = None, arxiv_paper_urls: list[str] | None = None) str[source]¶
Update a dataset card with PLAID-specific metadata and documentation.
- Parameters:
dataset_card (str) – The original dataset card content to update.
infos (dict[str, dict[str, str]]) – Dictionary containing dataset information with “legal” and “data_production” sections. Defaults to None.
pretty_name (str, optional) – A human-readable name for the dataset. Defaults to None.
dataset_long_description (str, optional) – Detailed description of the dataset’s content, purpose, and characteristics. Defaults to None.
illustration_urls (list[str], optional) – List of URLs to images illustrating the dataset. Defaults to None.
arxiv_paper_urls (list[str], optional) – List of URLs to related arXiv papers. Defaults to None.
- Returns:
The updated dataset card content as a string.
- Return type: