Skip to content

plaid.storage.common.reader

plaid.storage.common.reader

Common storage reader utilities.

This module provides common utilities for reading dataset metadata, problem definitions, and other auxiliary files from disk or downloading them from Hugging Face Hub.

plaid.storage.common.reader.load_infos_from_disk

load_infos_from_disk(path)

Load dataset information from a YAML file stored on disk.

Parameters:

  • path (Union[str, Path]) –

    Directory path containing the infos.yaml file.

Returns:

  • Infos ( Infos ) –

    Validated dataset infos object.

Source code in plaid/storage/common/reader.py
def load_infos_from_disk(path: Union[str, Path]) -> Infos:
    """Load dataset information from a YAML file stored on disk.

    Args:
        path (Union[str, Path]): Directory path containing the `infos.yaml` file.

    Returns:
        Infos: Validated dataset infos object.
    """
    return Infos.from_path(Path(path) / "infos.yaml")

plaid.storage.common.reader.load_problem_definitions_from_disk

load_problem_definitions_from_disk(path)

Load ProblemDefinitions from a local dataset directory.

This function reads all serialized ProblemDefinition files located in the problem_definitions/ subdirectory under path and reconstructs them into ProblemDefinition objects.

Each file is loaded using ProblemDefinition.from_path and inserted into a dictionary keyed by the YAML filename stem.

Expected local layout

/ problem_definitions/ ...

Parameters:

  • path (Union[str, Path]) –

    Root dataset directory containing the problem_definitions/ folder.

Returns:

  • dict[str, ProblemDefinition] | None

    dict[str, ProblemDefinition] | None: Mapping from problem definition filename stems to loaded ProblemDefinition objects.

Source code in plaid/storage/common/reader.py
def load_problem_definitions_from_disk(
    path: Union[str, Path],
) -> dict[str, ProblemDefinition] | None:
    """Load ProblemDefinitions from a local dataset directory.

    This function reads all serialized ``ProblemDefinition`` files located in the
    ``problem_definitions/`` subdirectory under ``path`` and reconstructs them
    into ``ProblemDefinition`` objects.

    Each file is loaded using ``ProblemDefinition.from_path`` and inserted into
    a dictionary keyed by the YAML filename stem.

    Expected local layout:
        <path>/
            problem_definitions/
                <problem_name_1>
                <problem_name_2>
                ...

    Args:
        path (Union[str, Path]):
            Root dataset directory containing the ``problem_definitions/`` folder.

    Returns:
        dict[str, ProblemDefinition] | None:
            Mapping from problem definition filename stems to loaded ``ProblemDefinition``
            objects.
    """
    pb_def_dir = Path(path).absolute()
    if pb_def_dir.name != "problem_definitions":
        pb_def_dir /= Path("problem_definitions")

    if pb_def_dir.is_dir():
        pb_defs = {}
        for p in pb_def_dir.iterdir():
            if p.is_file():
                pb_def_path = pb_def_dir / Path(p.name)
                try:
                    pb_def = ProblemDefinition.from_path(pb_def_path)
                except Exception as exc:
                    raise ValueError(
                        f"Failed to load problem definition file "
                        f"'{pb_def_path.name}': {exc}"
                    ) from exc
                pb_defs[p.stem] = pb_def
        return pb_defs

plaid.storage.common.reader.load_constants_from_disk

load_constants_from_disk(path)

Load constant features stored under a dataset's "constants" directory.

The function expects the following layout under /constants/. One folder per split (e.g. "train", "test", ...) each containing: - layout.json : mapping constant_name -> {'offset': int, 'shape': [..]} or None - constant_schema.yaml : YAML describing dtype for each constant (dtype string or "string") - data.mmap : raw bytes memory-mapped file containing packed constant data

Parameters:

  • path (str | Path) –

    Root dataset directory that contains the "constants" folder.

Returns:

  • tuple ( tuple[dict[str, dict[str, Any]], dict[str, dict[str, Any]]] ) –

    A 2-tuple (flat_cst, constant_schema) where flat_cst and constant_schema are both dict[str, dict[str, Any]]. Numeric constants are returned as np.memmap arrays backed by data.mmap. String constants are returned as one-element numpy arrays of Python strings decoded using ASCII. If a layout entry is None, the returned value is None.

Raises:

  • FileNotFoundError

    If the expected "constants" directory or required files are missing.

Source code in plaid/storage/common/reader.py
def load_constants_from_disk(
    path: Union[str, Path],
) -> tuple[dict[str, dict[str, Any]], dict[str, dict[str, Any]]]:
    """Load constant features stored under a dataset's "constants" directory.

    The function expects the following layout under <path>/constants/. One folder per split (e.g. "train", "test", ...)
    each containing:
    - layout.json            : mapping constant_name -> {'offset': int, 'shape': [..]} or None
    - constant_schema.yaml   : YAML describing dtype for each constant (dtype string or "string")
    - data.mmap              : raw bytes memory-mapped file containing packed constant data

    Args:
        path (str | Path): Root dataset directory that contains the "constants" folder.

    Returns:
        tuple: A 2-tuple ``(flat_cst, constant_schema)`` where ``flat_cst`` and
            ``constant_schema`` are both ``dict[str, dict[str, Any]]``. Numeric
            constants are returned as ``np.memmap`` arrays backed by ``data.mmap``.
            String constants are returned as one-element numpy arrays of Python
            strings decoded using ASCII. If a layout entry is ``None``, the
            returned value is ``None``.

    Raises:
        FileNotFoundError: If the expected "constants" directory or required files are missing.
    """
    cst_path = Path(path) / "constants"

    folders = [p for p in cst_path.iterdir() if p.is_dir()]
    splits = [folder.name for folder in folders]

    flat_cst = {}
    constant_schema = {}

    for folder, split in zip(folders, splits):
        with open(folder / "layout.json", "r", encoding="utf-8") as f:
            layout = json.load(f)

        with open(folder / "constant_schema.yaml", "r", encoding="utf-8") as f:
            constant_schema[split] = yaml.safe_load(f)

        flat_cst[split] = {}

        for key, spec in constant_schema[split].items():
            entry = layout[key]

            if entry is None:
                flat_cst[split][key] = None
                continue

            offset = entry["offset"]
            shape = tuple(entry["shape"])
            dtype = np.dtype(entry["dtype"])

            # -----------------
            # STRING CASE
            # -----------------
            if spec["dtype"] == "string":
                nbytes = int(np.prod(shape))
                with open(folder / "data.mmap", "rb") as f:
                    f.seek(offset)
                    raw = f.read(nbytes)

                flat_cst[split][key] = np.array([raw.decode("ascii", "strict")])

            # -----------------
            # NUMERIC CASE
            # -----------------
            else:
                flat_cst[split][key] = np.memmap(
                    folder / "data.mmap",
                    mode="r",
                    dtype=dtype,
                    offset=offset,
                    shape=shape,
                    order="C",
                )

    return flat_cst, constant_schema

plaid.storage.common.reader.load_metadata_from_disk

load_metadata_from_disk(path)

Load dataset metadata from disk.

Parameters:

  • path (Union[str, Path]) –

    Directory path containing the metadata files.

Returns:

  • tuple[dict[str, Any], dict[str, Any], dict[str, Any], dict[str, Any]]

    tuple[dict[str, Any], dict[str, Any], dict[str, Any], dict[str, Any]]: - flat_cst: constant features dictionary (numeric constants kept as file-backed np.memmap) - variable_schema: variable schema dictionary - constant_schema: constant schema dictionary - cgns_types: CGNS types dictionary

Source code in plaid/storage/common/reader.py
def load_metadata_from_disk(
    path: Union[str, Path],
) -> tuple[dict[str, Any], dict[str, Any], dict[str, Any], dict[str, Any]]:
    """Load dataset metadata from disk.

    Args:
        path (Union[str, Path]): Directory path containing the metadata files.

    Returns:
        tuple[dict[str, Any], dict[str, Any], dict[str, Any], dict[str, Any]]:
            - flat_cst: constant features dictionary (numeric constants kept as
              file-backed ``np.memmap``)
            - variable_schema: variable schema dictionary
            - constant_schema: constant schema dictionary
            - cgns_types: CGNS types dictionary
    """
    path = Path(path)

    flat_cst, constant_schema = load_constants_from_disk(path)

    with open(path / "variable_schema.yaml", "r", encoding="utf-8") as f:
        variable_schema = yaml.safe_load(f)

    with open(path / "cgns_types.yaml", "r", encoding="utf-8") as f:
        cgns_types = yaml.safe_load(f)

    return flat_cst, variable_schema, constant_schema, cgns_types

plaid.storage.common.reader.load_infos_from_hub

load_infos_from_hub(repo_id)

Load dataset infos from the Hugging Face Hub.

Downloads the infos.yaml file from the specified repository and parses it into a validated :class:Infos.

Parameters:

  • repo_id (str) –

    The repository ID on the Hugging Face Hub.

Returns:

  • Infos ( Infos ) –

    Validated dataset infos object.

Source code in plaid/storage/common/reader.py
def load_infos_from_hub(
    repo_id: str,
) -> Infos:  # pragma: no cover
    """Load dataset infos from the Hugging Face Hub.

    Downloads the infos.yaml file from the specified repository and parses it
    into a validated :class:`Infos`.

    Args:
        repo_id (str): The repository ID on the Hugging Face Hub.

    Returns:
        Infos: Validated dataset infos object.
    """
    yaml_path = hf_hub_download(
        repo_id=repo_id, filename="infos.yaml", repo_type="dataset"
    )
    return Infos.from_path(yaml_path)

plaid.storage.common.reader.load_problem_definitions_from_hub

load_problem_definitions_from_hub(repo_id)

Load ProblemDefinitions from Hugging Face Hub.

Parameters:

  • repo_id (str) –

    The repository ID on the Hugging Face Hub.

Returns:

  • Optional[dict[str, ProblemDefinition]]

    Optional[list[ProblemDefinition]]: List of loaded problem definitions, or None if not found.

Source code in plaid/storage/common/reader.py
def load_problem_definitions_from_hub(
    repo_id: str,
) -> Optional[dict[str, ProblemDefinition]]:  # pragma: no cover
    """Load ProblemDefinitions from Hugging Face Hub.

    Args:
        repo_id (str): The repository ID on the Hugging Face Hub.

    Returns:
        Optional[list[ProblemDefinition]]: List of loaded problem definitions, or None if not found.
    """
    with tempfile.TemporaryDirectory(prefix="pb_def_") as temp_folder:
        snapshot_download(
            repo_id=repo_id,
            repo_type="dataset",
            allow_patterns=["problem_definitions/"],
            local_dir=temp_folder,
        )
        pb_defs = load_problem_definitions_from_disk(temp_folder)
    return pb_defs

plaid.storage.common.reader.load_metadata_from_hub

load_metadata_from_hub(repo_id)

Load dataset metadata from Hugging Face Hub.

Parameters:

  • repo_id (str) –

    The repository ID on the Hugging Face Hub.

Returns:

  • tuple[dict[str, Any], dict[str, Any], dict[str, Any], dict[str, Any]]

    tuple[dict[str, Any], dict[str, Any], dict[str, Any], dict[str, Any]]: - flat_cst: constant features dictionary (numeric constants are materialized to in-memory arrays) - variable_schema: variable schema dictionary - constant_schema: constant schema dictionary - cgns_types: CGNS types dictionary

Source code in plaid/storage/common/reader.py
def load_metadata_from_hub(
    repo_id: str,
) -> tuple[
    dict[str, Any], dict[str, Any], dict[str, Any], dict[str, Any]
]:  # pragma: no cover
    """Load dataset metadata from Hugging Face Hub.

    Args:
        repo_id (str): The repository ID on the Hugging Face Hub.

    Returns:
        tuple[dict[str, Any], dict[str, Any], dict[str, Any], dict[str, Any]]:
            - flat_cst: constant features dictionary (numeric constants are
              materialized to in-memory arrays)
            - variable_schema: variable schema dictionary
            - constant_schema: constant schema dictionary
            - cgns_types: CGNS types dictionary
    """
    # constant part of the tree
    with tempfile.TemporaryDirectory(prefix="constants_") as temp_folder:
        snapshot_download(
            repo_id=repo_id,
            repo_type="dataset",
            allow_patterns=["constants/"],
            local_dir=temp_folder,
        )
        flat_cst, constant_schema = load_constants_from_disk(temp_folder)
        # Hub metadata is downloaded under a temporary directory: materialize
        # memmaps so returned constants remain valid after temp cleanup.
        flat_cst = _materialize_memmaps(flat_cst)

    # variable_schema
    yaml_path = hf_hub_download(
        repo_id=repo_id,
        filename="variable_schema.yaml",
        repo_type="dataset",
    )
    with open(yaml_path, "r", encoding="utf-8") as f:
        variable_schema = yaml.safe_load(f)

    # cgns_types
    yaml_path = hf_hub_download(
        repo_id=repo_id,
        filename="cgns_types.yaml",
        repo_type="dataset",
    )
    with open(yaml_path, "r", encoding="utf-8") as f:
        cgns_types = yaml.safe_load(f)

    return flat_cst, variable_schema, constant_schema, cgns_types