plaid.bridges.huggingface_bridge ================================ .. py:module:: plaid.bridges.huggingface_bridge .. autoapi-nested-parse:: Hugging Face bridge for PLAID datasets. Attributes ---------- .. autoapisummary:: plaid.bridges.huggingface_bridge.Self Functions --------- .. autoapisummary:: plaid.bridges.huggingface_bridge.to_plaid_dataset plaid.bridges.huggingface_bridge.to_plaid_sample plaid.bridges.huggingface_bridge.instantiate_plaid_datasetdict_from_hub plaid.bridges.huggingface_bridge.load_dataset_from_hub plaid.bridges.huggingface_bridge.load_infos_from_hub plaid.bridges.huggingface_bridge.load_problem_definition_from_hub plaid.bridges.huggingface_bridge.load_tree_struct_from_hub plaid.bridges.huggingface_bridge.load_dataset_from_disk plaid.bridges.huggingface_bridge.load_infos_from_disk plaid.bridges.huggingface_bridge.load_problem_definition_from_disk plaid.bridges.huggingface_bridge.load_tree_struct_from_disk plaid.bridges.huggingface_bridge.binary_to_plaid_sample plaid.bridges.huggingface_bridge.huggingface_dataset_to_plaid plaid.bridges.huggingface_bridge.huggingface_description_to_problem_definition plaid.bridges.huggingface_bridge.huggingface_description_to_infos plaid.bridges.huggingface_bridge.infer_hf_features_from_value plaid.bridges.huggingface_bridge.build_hf_sample plaid.bridges.huggingface_bridge.process_shard plaid.bridges.huggingface_bridge.preprocess_splits plaid.bridges.huggingface_bridge.plaid_dataset_to_huggingface_datasetdict plaid.bridges.huggingface_bridge.plaid_generator_to_huggingface_datasetdict plaid.bridges.huggingface_bridge.push_dataset_dict_to_hub plaid.bridges.huggingface_bridge.push_infos_to_hub plaid.bridges.huggingface_bridge.push_problem_definition_to_hub plaid.bridges.huggingface_bridge.push_tree_struct_to_hub plaid.bridges.huggingface_bridge.save_dataset_dict_to_disk plaid.bridges.huggingface_bridge.save_infos_to_disk plaid.bridges.huggingface_bridge.save_problem_definition_to_disk plaid.bridges.huggingface_bridge.save_tree_struct_to_disk plaid.bridges.huggingface_bridge.plaid_dataset_to_huggingface_binary plaid.bridges.huggingface_bridge.plaid_generator_to_huggingface_binary plaid.bridges.huggingface_bridge.plaid_dataset_to_huggingface_datasetdict_binary plaid.bridges.huggingface_bridge.plaid_generator_to_huggingface_datasetdict_binary plaid.bridges.huggingface_bridge.update_dataset_card Module Contents --------------- .. py:data:: Self .. py:function:: to_plaid_dataset(hf_dataset: datasets.Dataset, flat_cst: dict[str, Any], cgns_types: dict[str, str], enforce_shapes: bool = True) -> plaid.Dataset Convert a Hugging Face dataset into a PLAID dataset. Iterates over all samples in a Hugging Face `Dataset` and converts each one into a PLAID-compatible sample using `to_plaid_sample`. The resulting samples are then collected into a single PLAID `Dataset`. :param hf_dataset: The Hugging Face dataset split to convert. :type hf_dataset: datasets.Dataset :param flat_cst: Flattened representation of the CGNS tree structure constants. :type flat_cst: dict[str, Any] :param cgns_types: Mapping of CGNS paths to their expected types. :type cgns_types: dict[str, str] :param enforce_shapes: If True, ensures all arrays strictly follow the reference shapes. Defaults to True. :type enforce_shapes: bool, optional :returns: A PLAID `Dataset` object containing the converted samples. :rtype: Dataset .. py:function:: to_plaid_sample(ds: datasets.Dataset, i: int, flat_cst: dict[str, Any], cgns_types: dict[str, str], enforce_shapes: bool = True) -> plaid.Sample Convert a Hugging Face dataset row to a PLAID Sample object. Extracts a single row from a Hugging Face dataset and converts it into a PLAID Sample by unflattening the CGNS tree structure. Constant features from flat_cst are merged with the variable features from the row. :param ds: The Hugging Face dataset containing the sample data. :type ds: datasets.Dataset :param i: The index of the row to convert. :type i: int :param flat_cst: Dictionary of constant features to add to each sample. :type flat_cst: dict[str, Any] :param cgns_types: Dictionary mapping paths to CGNS types for reconstruction. :type cgns_types: dict[str, str] :param enforce_shapes: If True, ensures consistent array shapes during conversion. Defaults to True. :type enforce_shapes: bool, optional :returns: A validated PLAID Sample object reconstructed from the Hugging Face dataset row. :rtype: Sample .. note:: - Uses the dataset's pyarrow table data for efficient access. - Handles array shapes and types according to enforce_shapes. - Constant features from flat_cst are merged with the variable features from the row. .. py:function:: instantiate_plaid_datasetdict_from_hub(repo_id: str, enforce_shapes: bool = True) -> dict[str, plaid.Dataset] Load a Hugging Face dataset from the Hub and instantiate it as a dictionary of PLAID datasets. This function retrieves a dataset dictionary from the Hugging Face Hub, along with its associated CGNS tree structure and type information. Each split of the Hugging Face dataset is then converted into a PLAID dataset. :param repo_id: The Hugging Face repository identifier (e.g. `"user/dataset"`). :type repo_id: str :param enforce_shapes: If True, enforce strict array shapes when converting to PLAID datasets. Defaults to True. :type enforce_shapes: bool, optional :returns: A dictionary mapping split names (e.g. `"train"`, `"test"`) to PLAID `Dataset` objects. :rtype: dict[str, Dataset] .. py:function:: load_dataset_from_hub(repo_id: str, streaming: bool = False, *args, **kwargs) -> Union[datasets.Dataset, datasets.DatasetDict, datasets.IterableDataset, datasets.IterableDatasetDict] Loads a Hugging Face dataset from the public hub, a private mirror, or local cache, with automatic handling of streaming and download modes. Behavior: - If the environment variable `HF_ENDPOINT` is set, uses a private Hugging Face mirror. - Streaming is disabled. - The dataset is downloaded locally via `snapshot_download` and loaded from disk. - If `HF_ENDPOINT` is not set, attempts to load from the public Hugging Face hub. - If the dataset is already cached locally, loads from disk. - Otherwise, loads from the hub, optionally using streaming mode. :param repo_id: The Hugging Face dataset repository ID (e.g., 'username/dataset'). :type repo_id: str :param streaming: If True, attempts to stream the dataset (only supported on the public hub). :type streaming: bool, optional :param \*args: Positional arguments forwarded to [`datasets.load_dataset`](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset). :param \*\*kwargs: Keyword arguments forwarded to [`datasets.load_dataset`](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset). :returns: The loaded Hugging Face dataset object. :rtype: Union[datasets.Dataset, datasets.DatasetDict] :raises Exception: Propagates any exceptions raised by `datasets.load_dataset`, `datasets.load_from_disk`, or `huggingface_hub.snapshot_download` if loading fails. .. note:: - Streaming mode is not supported when using a private mirror. - If the dataset is found in the local cache, loads from disk instead of streaming. - To use behind a proxy or with a private mirror, you may need to set: - HF_ENDPOINT to your private mirror address - CURL_CA_BUNDLE to your trusted CA certificates - HF_HOME to a shared cache directory if needed .. py:function:: load_infos_from_hub(repo_id: str) -> dict[str, dict[str, str]] Load dataset infos from the Hugging Face Hub. Downloads the infos.yaml file from the specified repository and parses it as a dictionary. :param repo_id: The repository ID on the Hugging Face Hub. :type repo_id: str :returns: Dictionary containing dataset infos. :rtype: dict[str, dict[str, str]] .. py:function:: load_problem_definition_from_hub(repo_id: str, name: str) -> plaid.ProblemDefinition Load a ProblemDefinition from the Hugging Face Hub. Downloads the problem infos YAML and split JSON files from the specified repository and location, then initializes a ProblemDefinition object with this information. :param repo_id: The repository ID on the Hugging Face Hub. :type repo_id: str :param name: The name of the problem_definition stored in the repo. :type name: str :returns: The loaded problem definition. :rtype: ProblemDefinition .. py:function:: load_tree_struct_from_hub(repo_id: str) -> tuple[dict, dict] Load the tree structure metadata of a PLAID dataset from the Hugging Face Hub. This function retrieves two artifacts previously uploaded alongside a dataset: - **tree_constant_part.pkl**: a pickled dictionary of constant feature values (features that are identical across all samples). - **key_mappings.yaml**: a YAML file containing metadata about the dataset feature structure, including variable features, constant features, and CGNS types. :param repo_id: The repository ID on the Hugging Face Hub (e.g., `"username/dataset_name"`). :type repo_id: str :returns: - **flat_cst (dict)**: constant features dictionary (path → value). - **key_mappings (dict)**: metadata dictionary containing keys such as: - `"variable_features"`: list of paths for non-constant features. - `"constant_features"`: list of paths for constant features. - `"cgns_types"`: mapping from paths to CGNS types. :rtype: tuple[dict, dict] .. py:function:: load_dataset_from_disk(path: Union[str, pathlib.Path], *args, **kwargs) -> Union[datasets.Dataset, datasets.DatasetDict] Load a Hugging Face dataset or dataset dictionary from disk. This function wraps `datasets.load_from_disk` to accept either a string path or a `Path` object and returns the loaded dataset object. :param path: Path to the directory containing the saved dataset. :type path: Union[str, Path] :param \*args: Positional arguments forwarded to [`datasets.load_from_disk`](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_from_disk). :param \*\*kwargs: Keyword arguments forwarded to [`datasets.load_from_disk`](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_from_disk). :returns: The loaded Hugging Face dataset object, which may be a single `Dataset` or a `DatasetDict` depending on what was saved on disk. :rtype: Union[datasets.Dataset, datasets.DatasetDict] .. py:function:: load_infos_from_disk(path: Union[str, pathlib.Path]) -> dict[str, dict[str, str]] Load dataset information from a YAML file stored on disk. :param path: Directory path containing the `infos.yaml` file. :type path: Union[str, Path] :returns: Dictionary containing dataset infos. :rtype: dict[str, dict[str, str]] .. py:function:: load_problem_definition_from_disk(path: Union[str, pathlib.Path], name: Union[str, pathlib.Path]) -> plaid.ProblemDefinition Load a ProblemDefinition and its split information from disk. :param path: The root directory path for loading. :type path: Union[str, Path] :param name: The name of the problem_definition stored in the disk directory. :type name: str :returns: The loaded problem definition. :rtype: ProblemDefinition .. py:function:: load_tree_struct_from_disk(path: Union[str, pathlib.Path]) -> tuple[dict[str, Any], dict[str, Any]] Load a tree structure for a dataset from disk. This function loads two components from the specified directory: 1. `tree_constant_part.pkl`: a pickled dictionary containing the constant parts of the tree. 2. `key_mappings.yaml`: a YAML file containing key mappings and metadata. :param path: Directory path containing the `tree_constant_part.pkl` and `key_mappings.yaml` files. :type path: Union[str, Path] :returns: A tuple containing: - `flat_cst` (dict): Dictionary of constant tree values. - `key_mappings` (dict): Dictionary of key mappings and metadata. :rtype: tuple[dict, dict] .. py:function:: binary_to_plaid_sample(hf_sample: dict[str, bytes]) -> plaid.Sample Convert a Hugging Face dataset sample in binary format to a Plaid `Sample`. The input `hf_sample` is expected to contain a pickled representation of a sample under the key `"sample"`. This function attempts to validate the unpickled sample as a Plaid `Sample`. If validation fails, it reconstructs the sample from its components (`meshes`, `path`, and optional `scalars`) before validating it. :param hf_sample: A dictionary representing a Hugging Face sample, with the pickled sample stored under the key `"sample"`. :type hf_sample: dict[str, bytes] :returns: A validated Plaid `Sample` object. :rtype: Sample :raises KeyError: If required keys (`"sample"`, `"meshes"`, `"path"`) are missing and the sample cannot be reconstructed. :raises ValidationError: If the reconstructed sample still fails Plaid validation. .. py:function:: huggingface_dataset_to_plaid(ds: datasets.Dataset, ids: Optional[list[int]] = None, processes_number: int = 1, large_dataset: bool = False, verbose: bool = True) -> tuple[Union[plaid.Dataset, plaid.ProblemDefinition], plaid.ProblemDefinition] Use this function for converting a plaid dataset from a Hugging Face dataset. A Hugging Face dataset can be read from disk or the hub. From the hub, the split = "all_samples" options is important to get a dataset and not a datasetdict. Many options from loading are available (caching, streaming, etc...) :param ds: the dataset in Hugging Face format to be converted :type ds: datasets.Dataset :param ids: The specific sample IDs to load from the dataset. Defaults to None. :type ids: list, optional :param processes_number: The number of processes used to generate the plaid dataset :type processes_number: int, optional :param large_dataset: if True, uses a variant where parallel worker do not each load the complete dataset. Default: False. :type large_dataset: bool :param verbose: if True, prints progress using tdqm :type verbose: bool, optional :returns: the converted dataset. problem_definition (ProblemDefinition): the problem definition generated from the Hugging Face dataset :rtype: dataset (Dataset) .. rubric:: Example .. code-block:: python from datasets import load_dataset, load_from_disk dataset = load_dataset("path/to/dir", split = "all_samples") dataset = load_from_disk("chanel/dataset") plaid_dataset, plaid_problem = huggingface_dataset_to_plaid(dataset) .. py:function:: huggingface_description_to_problem_definition(description: dict) -> plaid.ProblemDefinition Converts a Hugging Face dataset description to a plaid problem definition. :param description: the description field of a Hugging Face dataset, containing the problem definition :type description: dict :returns: the plaid problem definition initialized from the Hugging Face dataset description :rtype: problem_definition (ProblemDefinition) .. py:function:: huggingface_description_to_infos(description: dict) -> dict[str, dict[str, str]] Convert a Hugging Face dataset description dictionary to a PLAID infos dictionary. Extracts the "legal" and "data_production" sections from the Hugging Face description and returns them in a format compatible with PLAID dataset infos. :param description: The Hugging Face dataset description dictionary. :type description: dict :returns: Dictionary containing "legal" and "data_production" infos if present. :rtype: dict[str, dict[str, str]] .. py:function:: infer_hf_features_from_value(value: Any) -> Union[datasets.Value, datasets.Sequence] Infer Hugging Face dataset feature type from a given value. This function analyzes the input value and determines the appropriate Hugging Face feature type representation. It handles None values, scalars, and arrays/lists of various dimensions, mapping them to corresponding Hugging Face Value or Sequence types. :param value: The value to infer the feature type from. Can be None, scalar, list, tuple, or numpy array. :type value: Any :returns: A Hugging Face feature type (Value or Sequence) that corresponds to the input value's structure and data type. :rtype: datasets.Feature :raises TypeError: If the value type is not supported. :raises TypeError: If the array dimensionality exceeds 3D for arrays/lists. .. note:: - For scalar values, maps numpy dtypes to appropriate Hugging Face Value types: float types to "float32", int32 to "int32", int64 to "int64", others to "string" - For arrays/lists, creates nested Sequence structures based on dimensionality: 1D → Sequence(base_type), 2D → Sequence(Sequence(base_type)), 3D → Sequence(Sequence(Sequence(base_type))) - All float values are enforced to "float32" to limit data size - All int64 values are preserved as "int64" to satisfy CGNS standards .. py:function:: build_hf_sample(sample: plaid.Sample) -> tuple[dict[str, Any], list[str], dict[str, str]] Flatten a PLAID Sample's CGNS trees into Hugging Face–compatible arrays and metadata. The function traverses every CGNS tree stored in sample.features.data (keyed by time), produces a flattened mapping path -> primitive value for each time, and then builds compact numpy arrays suitable for storage in a Hugging Face Dataset. Repeated value blocks that are identical across times are deduplicated and referenced by start/end indices; companion "_times" arrays describe, per time, the slice indices into the concatenated arrays. :param sample: A PLAID Sample whose features contain one or more CGNS trees (sample.features.data maps time -> CGNSTree). :type sample: Sample :returns: - hf_sample (dict[str, Any]): Mapping of flattened CGNS paths to either a numpy array (concatenation of per-time blocks) or None. For each path there is also an entry "_times" containing a flattened numpy array of triplets [time, start, end] (end == -1 indicates the block extends to the end of the array). - all_paths (list[str]): Sorted list of all considered variable feature paths (excluding Time-related nodes and CGNSLibraryVersion). - sample_cgns_types (dict[str, str]): Mapping from path to CGNS node type (metadata produced by flatten_cgns_tree). :rtype: tuple .. note:: - Byte-array encoded strings (dtype ``"|S1"``) are handled by reassembling and storing the string as a single-element numpy array; a sha256 hash is used for deduplication. - Deduplication reduces storage when identical blocks recur across times. - Paths containing "/Time" or "CGNSLibraryVersion" are ignored for variable features. .. py:function:: process_shard(generator_fn: Callable[Ellipsis, Any], progress: Any, n_proc: int, shard_ids: Optional[list[plaid.types.IndexType]] = None) -> tuple[set[str], dict[str, str], dict[str, Union[datasets.Value, datasets.Sequence]], dict[str, dict[str, Union[str, bool, int]]], int] Process a single shard of sample ids and collect per-shard metadata. This function drives a shard-level pass over samples produced by `generator_fn`. For each sample it: - flattens the sample into Hugging Face friendly arrays (build_hf_sample), - collects observed flattened paths, - aggregates CGNS type metadata, - infers Hugging Face feature types for each path, - detects per-path constants using a content hash, - updates progress (either a multiprocessing.Queue or a tqdm progress bar). :param shard_ids: Sequence of sample ids (a single shard) to process. :type shard_ids: list[IndexType] :param generator_fn: Generator function accepting a list of shard id sequences and yielding Sample objects for those ids. :type generator_fn: Callable :param progress: Progress reporter; either a multiprocessing.Queue (for parallel execution) or a tqdm progress bar object (for sequential execution). :type progress: Any :param n_proc: Number of worker processes used by the caller (used to decide how to report progress). :type n_proc: int :returns: - split_all_paths (set[str]): Set of all flattened feature paths observed in the shard. - shard_global_cgns_types (dict[str, str]): Mapping path -> CGNS node type observed in the shard. - shard_global_feature_types (dict[str, Union[Value, Sequence]]): Inferred HF feature types per path. - split_constant_leaves (dict[str, dict]): Per-path metadata for constant detection. Each entry is a dict with keys "hash" (str), "constant" (bool) and "count" (int). - n_samples_processed (int): Number of samples processed in this shard. :rtype: tuple :raises ValueError: If inconsistent feature types are detected for the same path within the shard. .. py:function:: preprocess_splits(generators: dict[str, Callable], gen_kwargs: Optional[dict[str, dict[str, list[plaid.types.IndexType]]]] = None, processes_number: int = 1, verbose: bool = True) -> tuple[dict[str, set[str]], dict[str, dict[str, Any]], dict[str, set[str]], dict[str, str], dict[str, Union[datasets.Value, datasets.Sequence]]] Pre-process dataset splits: inspect samples to infer features, constants and CGNS metadata. This function iterates over the provided split generators (optionally in parallel), flattens each PLAID sample into Hugging Face friendly arrays, detects constant CGNS leaves (features identical across all samples in a split), infers global Hugging Face feature types, and aggregates CGNS type metadata. The work is sharded per-split and each shard is processed by `process_shard`. In parallel mode, progress is updated via a multiprocessing.Queue; otherwise a tqdm progress bar is used. :param generators: Mapping from split name to a generator function. Each generator must accept a single argument (a sequence of shard ids) and yield PLAID samples. :type generators: dict[str, Callable] :param gen_kwargs: Per-split kwargs used to drive generator invocation (e.g. {"train": {"shards_ids": [...]}}). :type gen_kwargs: dict[str, dict[str, list[IndexType]]] :param processes_number: Number of worker processes to use for shard-level parallelism. Defaults to 1. :type processes_number: int, optional :param verbose: If True, displays progress bars. Defaults to True. :type verbose: bool, optional :returns: - split_all_paths (dict[str, set[str]]): For each split, the set of all observed flattened feature paths (including "_times" keys). - split_flat_cst (dict[str, dict[str, Any]]): For each split, a mapping of constant feature path -> value (constant parts of the tree). - split_var_path (dict[str, set[str]]): For each split, the set of variable feature paths (non-constant). - global_cgns_types (dict[str, str]): Aggregated mapping from flattened path -> CGNS node type. - global_feature_types (dict[str, Union[Value, Sequence]]): Aggregated inferred Hugging Face feature types for each variable path. :rtype: tuple :raises ValueError: If inconsistent feature types or CGNS types are detected across shards/splits. .. py:function:: plaid_dataset_to_huggingface_datasetdict(dataset: plaid.Dataset, main_splits: dict[str, plaid.types.IndexType], processes_number: int = 1, writer_batch_size: int = 1, verbose: bool = False) -> tuple[datasets.DatasetDict, dict[str, Any], dict[str, Any]] Convert a PLAID dataset into a Hugging Face `datasets.DatasetDict`. This is a thin wrapper that creates per-split generators from a PLAID dataset and delegates the actual dataset construction to `plaid_generator_to_huggingface_datasetdict`. :param dataset: The PLAID dataset to be converted. Must support indexing with a list of IDs (from `main_splits`). :type dataset: plaid.Dataset :param main_splits: Mapping from split names (e.g. "train", "test") to the subset of sample indices belonging to that split. :type main_splits: dict[str, IndexType] :param processes_number: Number of parallel processes to use when writing the Hugging Face dataset. :type processes_number: int, optional, default=1 :param writer_batch_size: Batch size used when writing samples to disk in Hugging Face format. :type writer_batch_size: int, optional, default=1 :param verbose: If True, print progress and debug information. :type verbose: bool, optional, default=False :returns: A Hugging Face `DatasetDict` containing one dataset per split. :rtype: datasets.DatasetDict .. rubric:: Example >>> ds_dict = plaid_dataset_to_huggingface_datasetdict( ... dataset=my_plaid_dataset, ... main_splits={"train": [0, 1, 2], "test": [3]}, ... processes_number=4, ... writer_batch_size=3 ... ) >>> print(ds_dict) DatasetDict({ train: Dataset({ features: ... }), test: Dataset({ features: ... }) }) .. py:function:: plaid_generator_to_huggingface_datasetdict(generators: dict[str, Callable], gen_kwargs: Optional[dict[str, dict[str, list[plaid.types.IndexType]]]] = None, processes_number: int = 1, writer_batch_size: int = 1, verbose: bool = False) -> tuple[datasets.DatasetDict, dict[str, Any], dict[str, Any]] Convert PLAID dataset generators into a Hugging Face `datasets.DatasetDict`. This function inspects samples produced by the given generators, flattens their CGNS tree structure, infers Hugging Face feature types, and builds one `datasets.Dataset` per split. Constant features (identical across all samples) are separated out from variable features. :param generators: Mapping from split names (e.g., "train", "test") to generator functions. Each generator function must return an iterable of PLAID samples, where each sample provides `sample.features.data[0.0]` for flattening. :type generators: dict[str, Callable] :param processes_number: Number of processes used internally by Hugging Face when materializing the dataset from the generators. :type processes_number: int, optional, default=1 :param writer_batch_size: Batch size used when writing samples to disk in Hugging Face format. :type writer_batch_size: int, optional, default=1 :param gen_kwargs: Optional mapping from split names to dictionaries of keyword arguments to be passed to each generator function, used for parallelization. :type gen_kwargs: dict, optional, default=None :param verbose: If True, displays progress bars and diagnostic messages. :type verbose: bool, optional, default=False :returns: - **DatasetDict** (`datasets.DatasetDict`): A Hugging Face dataset dictionary with one dataset per split. - **flat_cst** (`dict[str, Any]`): Dictionary of constant features detected across all splits. - **key_mappings** (`dict[str, Any]`): Metadata dictionary containing: - `"variable_features"`: list of paths for non-constant features. - `"constant_features"`: list of paths for constant features. - `"cgns_types"`: inferred CGNS types for all features. :rtype: tuple .. rubric:: Example >>> ds_dict, flat_cst, key_mappings = plaid_generator_to_huggingface_datasetdict( ... {"train": lambda: iter(train_samples), ... "test": lambda: iter(test_samples)}, ... processes_number=4, ... writer_batch_size=2, ... verbose=True ... ) >>> print(ds_dict) DatasetDict({ train: Dataset({ features: ... }), test: Dataset({ features: ... }) }) >>> print(flat_cst) {'Zone1/GridCoordinates': array([0., 0.1, 0.2])} >>> print(key_mappings["variable_features"][:3]) ['Zone1/FlowSolution/VelocityX', 'Zone1/FlowSolution/VelocityY', ...] .. py:function:: push_dataset_dict_to_hub(repo_id: str, hf_dataset_dict: datasets.DatasetDict, **kwargs) -> None Push a Hugging Face `DatasetDict` to the Hugging Face Hub. This is a thin wrapper around `datasets.DatasetDict.push_to_hub`, allowing you to upload a dataset dictionary (with one or more splits such as `"train"`, `"validation"`, `"test"`) to the Hugging Face Hub. .. note:: The function automatically handles sharding of the dataset by setting `num_shards` for each split. For each split, the number of shards is set to the minimum between the number of samples in that split and such that shards are targetted to approx. 500 MB. This ensures efficient chunking while preventing excessive fragmentation. Empty splits will raise an assertion error. :param repo_id: The repository ID on the Hugging Face Hub (e.g. `"username/dataset_name"`). :type repo_id: str :param hf_dataset_dict: The Hugging Face dataset dictionary to push. :type hf_dataset_dict: datasets.DatasetDict :param \*\*kwargs: Keyword arguments forwarded to [`DatasetDict.push_to_hub`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub). :returns: None .. py:function:: push_infos_to_hub(repo_id: str, infos: dict[str, dict[str, str]]) -> None Upload dataset infos to the Hugging Face Hub. Serializes the infos dictionary to YAML and uploads it to the specified repository as infos.yaml. :param repo_id: The repository ID on the Hugging Face Hub. :type repo_id: str :param infos: Dictionary containing dataset infos to upload. :type infos: dict[str, dict[str, str]] :raises ValueError: If the infos dictionary is empty. .. py:function:: push_problem_definition_to_hub(repo_id: str, name: str, pb_def: plaid.ProblemDefinition) -> None Upload a ProblemDefinition and its split information to the Hugging Face Hub. :param repo_id: The repository ID on the Hugging Face Hub. :type repo_id: str :param name: The name of the problem_definition to store in the repo. :type name: str :param pb_def: The problem definition to upload. :type pb_def: ProblemDefinition .. py:function:: push_tree_struct_to_hub(repo_id: str, flat_cst: dict[str, Any], key_mappings: dict[str, Any]) -> None Upload a dataset's tree structure to a Hugging Face dataset repository. This function pushes two components of a dataset tree structure to the specified Hugging Face Hub repository: 1. `flat_cst`: the constant parts of the dataset tree, serialized as a pickle file (`tree_constant_part.pkl`). 2. `key_mappings`: the dictionary of key mappings and metadata for the dataset tree, serialized as a YAML file (`key_mappings.yaml`). Both files are uploaded using the Hugging Face `HfApi().upload_file` method. :param repo_id: The Hugging Face dataset repository ID where files will be uploaded. :type repo_id: str :param flat_cst: Dictionary containing constant values in the dataset tree. :type flat_cst: dict[str, Any] :param key_mappings: Dictionary containing key mappings and additional metadata. :type key_mappings: dict[str, Any] :returns: None .. note:: - Each upload includes a commit message indicating the filename. - This function is not covered by unit tests (`pragma: no cover`). .. py:function:: save_dataset_dict_to_disk(path: Union[str, pathlib.Path], hf_dataset_dict: datasets.DatasetDict, **kwargs) -> None Save a Hugging Face DatasetDict to disk. This function serializes the provided DatasetDict and writes it to the specified directory, preserving its features, splits, and data for later loading. :param path: Directory path where the DatasetDict will be saved. :type path: Union[str, Path] :param hf_dataset_dict: The Hugging Face DatasetDict to save. :type hf_dataset_dict: datasets.DatasetDict :param \*\*kwargs: Keyword arguments forwarded to [`DatasetDict.save_to_disk`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict.save_to_disk). :returns: None .. py:function:: save_infos_to_disk(path: Union[str, pathlib.Path], infos: dict[str, dict[str, str]]) -> None Save dataset infos as a YAML file to disk. :param path: The directory path where the infos file will be saved. :type path: Union[str, Path] :param infos: Dictionary containing dataset infos. :type infos: dict[str, dict[str, str]] .. py:function:: save_problem_definition_to_disk(path: Union[str, pathlib.Path], name: Union[str, pathlib.Path], pb_def: plaid.ProblemDefinition) -> None Save a ProblemDefinition and its split information to disk. :param path: The root directory path for saving. :type path: Union[str, Path] :param name: The name of the problem_definition to store in the disk directory. :type name: str :param pb_def: The problem definition to save. :type pb_def: ProblemDefinition .. py:function:: save_tree_struct_to_disk(path: Union[str, pathlib.Path], flat_cst: dict[str, Any], key_mappings: dict[str, Any]) -> None Save the structure of a dataset tree to disk. This function writes the constant part of the tree and its key mappings to files in the specified directory. The constant part is serialized as a pickle file, while the key mappings are saved in YAML format. :param path: Directory path where the tree structure files will be saved. :type path: Union[str, Path] :param flat_cst: Dictionary containing the constant part of the tree. :type flat_cst: dict :param key_mappings: Dictionary containing key mappings for the tree structure. :type key_mappings: dict :returns: None .. py:function:: plaid_dataset_to_huggingface_binary(dataset: plaid.Dataset, ids: Optional[list[plaid.types.IndexType]] = None, split_name: str = 'all_samples', processes_number: int = 1) -> datasets.Dataset Use this function for converting a Hugging Face dataset from a plaid dataset. The dataset can then be saved to disk, or pushed to the Hugging Face hub. :param dataset: the plaid dataset to be converted in Hugging Face format :type dataset: Dataset :param ids: The specific sample IDs to convert the dataset. Defaults to None. :type ids: list, optional :param split_name: The name of the split. Default: "all_samples". :type split_name: str :param processes_number: The number of processes used to generate the Hugging Face dataset. Default: 1. :type processes_number: int :returns: dataset in Hugging Face format :rtype: datasets.Dataset .. rubric:: Example .. code-block:: python dataset = plaid_dataset_to_huggingface_binary(dataset, problem_definition, split) dataset.save_to_disk("path/to/dir) dataset.push_to_hub("chanel/dataset") .. py:function:: plaid_generator_to_huggingface_binary(generator: Callable, split_name: str = 'all_samples', processes_number: int = 1) -> datasets.Dataset Use this function for creating a Hugging Face dataset from a sample generator function. This function can be used when the plaid dataset cannot be loaded in RAM all at once due to its size. The generator enables loading samples one by one. :param generator: a function yielding a dict {"sample" : sample}, where sample is of type 'bytes' :type generator: Callable :param split_name: The name of the split. Default: "all_samples". :type split_name: str :param processes_number: The number of processes used to generate the Hugging Face dataset. Default: 1. :type processes_number: int :returns: dataset in Hugging Face format :rtype: datasets.Dataset .. rubric:: Example .. code-block:: python dataset = plaid_generator_to_huggingface_binary(generator, infos, split) .. py:function:: plaid_dataset_to_huggingface_datasetdict_binary(dataset: plaid.Dataset, main_splits: dict[str, plaid.types.IndexType], processes_number: int = 1) -> datasets.DatasetDict Use this function for converting a Hugging Face dataset dict from a plaid dataset. The dataset can then be saved to disk, or pushed to the Hugging Face hub. :param dataset: the plaid dataset to be converted in Hugging Face format. :type dataset: Dataset :param main_splits: The name of the main splits: defining a partitioning of the sample ids. :type main_splits: list[str] :param processes_number: The number of processes used to generate the Hugging Face dataset. Default: 1. :type processes_number: int :returns: dataset in Hugging Face format :rtype: datasets.Dataset .. rubric:: Example .. code-block:: python dataset = plaid_dataset_to_huggingface_datasetdict_binary(dataset, problem_definition, split) dataset.save_to_disk("path/to/dir) dataset.push_to_hub("chanel/dataset") .. py:function:: plaid_generator_to_huggingface_datasetdict_binary(generators: dict[str, Callable], processes_number: int = 1) -> datasets.DatasetDict Use this function for creating a Hugging Face dataset dict (containing multiple splits) from a sample generator function. This function can be used when the plaid dataset cannot be loaded in RAM all at once due to its size. The generator enables loading samples one by one. The dataset dict can then be saved to disk, or pushed to the Hugging Face hub. .. note:: Only the first split will contain the decription. :param generators: a dict of functions yielding a dict {"sample" : sample}, where sample is of type 'bytes' :type generators: dict[str, Callable] :param processes_number: The number of processes used to generate the Hugging Face dataset. Default: 1. :type processes_number: int :returns: dataset dict in Hugging Face format :rtype: datasets.DatasetDict .. rubric:: Example .. code-block:: python hf_dataset_dict = plaid_generator_to_huggingface_datasetdict(generator, infos, problem_definition, main_splits) push_dataset_dict_to_hub("chanel/dataset", hf_dataset_dict) hf_dataset_dict.save_to_disk("path/to/dir") .. py:function:: update_dataset_card(dataset_card: str, infos: Optional[dict[str, dict[str, str]]] = None, pretty_name: Optional[str] = None, dataset_long_description: Optional[str] = None, illustration_urls: Optional[list[str]] = None, arxiv_paper_urls: Optional[list[str]] = None) -> str Update a dataset card with PLAID-specific metadata and documentation. :param dataset_card: The original dataset card content to update. :type dataset_card: str :param infos: Dictionary containing dataset information with "legal" and "data_production" sections. Defaults to None. :type infos: dict[str, dict[str, str]] :param pretty_name: A human-readable name for the dataset. Defaults to None. :type pretty_name: str, optional :param dataset_long_description: Detailed description of the dataset's content, purpose, and characteristics. Defaults to None. :type dataset_long_description: str, optional :param illustration_urls: List of URLs to images illustrating the dataset. Defaults to None. :type illustration_urls: list[str], optional :param arxiv_paper_urls: List of URLs to related arXiv papers. Defaults to None. :type arxiv_paper_urls: list[str], optional :returns: The updated dataset card content as a string. :rtype: str