Skip to content

plaid.storage.cgns.reader

plaid.storage.cgns.reader

CGNS dataset reader module for PLAID.

This module provides functionality for reading and streaming CGNS datasets for the PLAID library. It includes utilities for loading datasets from local disk or streaming directly from Hugging Face Hub, with support for selective loading of splits and samples.

Key features: - Local dataset loading from disk via CGNSDataset class - Streaming datasets from Hugging Face Hub - Selective loading of splits and sample IDs - Integration with PLAID Sample objects

plaid.storage.cgns.reader.CGNSDataset

CGNSDataset(path)

CGNS dataset class for local disk access.

This class represents a CGNS dataset stored on local disk, providing access to individual samples and associated metadata. It supports iteration over samples and attribute access to extra fields.

Initialize a :class:CGNSDataset.

Parameters:

  • path (Union[str, Path]) –

    Path to the dataset directory.

Source code in plaid/storage/cgns/reader.py
def __init__(self, path: Union[str, Path]) -> None:
    """Initialize a :class:`CGNSDataset`.

    Args:
        path: Path to the dataset directory.
    """
    self.path = Path(path)

    ids = sorted(
        int(p.name.removeprefix("sample_"))
        for p in self.path.iterdir()
        if p.is_dir() and p.name.startswith("sample_")
    )
    self.ids = np.asarray(ids, dtype=int)

plaid.storage.cgns.reader.CGNSDataset.__iter__

__iter__()

Iterate over all samples in the dataset.

Yields:

  • Sample ( Sample ) –

    A PLAID Sample object for each sample in the dataset.

Source code in plaid/storage/cgns/reader.py
def __iter__(self) -> Iterator[Sample]:
    """Iterate over all samples in the dataset.

    Yields:
        Sample: A PLAID Sample object for each sample in the dataset.
    """
    for idx in self.ids:
        yield self[int(idx)]

plaid.storage.cgns.reader.CGNSDataset.__getitem__

__getitem__(idx)

Get a sample by index.

Parameters:

  • idx (int) –

    Sample index.

Returns:

  • Sample ( Sample ) –

    A PLAID Sample object.

Source code in plaid/storage/cgns/reader.py
def __getitem__(self, idx: int) -> Sample:
    """Get a sample by index.

    Args:
        idx: Sample index.

    Returns:
        Sample: A PLAID Sample object.
    """
    assert idx in self.ids
    return Sample.load_from_dir(self.path / f"sample_{idx:09d}")

plaid.storage.cgns.reader.CGNSDataset.__len__

__len__()

Get the number of samples in the dataset.

Returns:

  • int ( int ) –

    Number of samples.

Source code in plaid/storage/cgns/reader.py
def __len__(self) -> int:
    """Get the number of samples in the dataset.

    Returns:
        int: Number of samples.
    """
    return len(self.ids)

plaid.storage.cgns.reader.CGNSDataset.__repr__

__repr__()

String representation of the dataset.

Returns:

  • str ( str ) –

    String representation.

Source code in plaid/storage/cgns/reader.py
def __repr__(self) -> str:
    """String representation of the dataset.

    Returns:
        str: String representation.
    """
    return f"<CGNSDataset {repr(self.path)} | ids={self.ids}>"

plaid.storage.cgns.reader.sample_generator

sample_generator(repo_id, split, ids)

Generate Sample objects from a Hugging Face Hub repository.

This function downloads individual samples from a CGNS dataset stored on Hugging Face Hub and yields PLAID Sample objects. Each sample is downloaded to a temporary directory and loaded as a Sample.

Parameters:

  • repo_id (str) –

    The Hugging Face repository ID (e.g., 'username/dataset-name').

  • split (str) –

    The dataset split name (e.g., 'train', 'test').

  • ids (Iterable[int]) –

    Iterable of sample IDs to generate.

Yields:

  • Sample ( Sample ) –

    A PLAID Sample object for each requested ID.

Source code in plaid/storage/cgns/reader.py
def sample_generator(
    repo_id: str, split: str, ids: Iterable[int]
) -> Iterator[Sample]:  # pragma: no cover
    """Generate Sample objects from a Hugging Face Hub repository.

    This function downloads individual samples from a CGNS dataset stored on Hugging Face Hub
    and yields PLAID Sample objects. Each sample is downloaded to a temporary directory
    and loaded as a Sample.

    Args:
        repo_id: The Hugging Face repository ID (e.g., 'username/dataset-name').
        split: The dataset split name (e.g., 'train', 'test').
        ids: Iterable of sample IDs to generate.

    Yields:
        Sample: A PLAID Sample object for each requested ID.
    """
    for idx in ids:
        with tempfile.TemporaryDirectory(prefix="plaid_sample_") as temp_folder:
            snapshot_download(
                repo_id=repo_id,
                repo_type="dataset",
                allow_patterns=[f"data/{split}/sample_{idx:09d}/"],
                local_dir=temp_folder,
            )
            sample = Sample.load_from_dir(
                Path(temp_folder) / "data" / f"{split}" / f"sample_{idx:09d}"
            )
        yield sample

plaid.storage.cgns.reader.create_CGNS_iterable_dataset

create_CGNS_iterable_dataset(repo_id, split, ids)

Create an iterable dataset from CGNS samples on Hugging Face Hub.

This function creates a Hugging Face IterableDataset that streams CGNS samples from a repository. The dataset can be used for efficient streaming access without loading all samples into memory.

Parameters:

  • repo_id (str) –

    The Hugging Face repository ID (e.g., 'username/dataset-name').

  • split (str) –

    The dataset split name (e.g., 'train', 'test').

  • ids (Iterable[int]) –

    Iterable of sample IDs to include in the dataset.

Returns:

  • IterableDataset ( IterableDataset ) –

    A Hugging Face IterableDataset for streaming access.

Source code in plaid/storage/cgns/reader.py
def create_CGNS_iterable_dataset(
    repo_id: str, split: str, ids: Iterable[int]
) -> IterableDataset:  # pragma: no cover
    """Create an iterable dataset from CGNS samples on Hugging Face Hub.

    This function creates a Hugging Face IterableDataset that streams CGNS samples
    from a repository. The dataset can be used for efficient streaming access without
    loading all samples into memory.

    Args:
        repo_id: The Hugging Face repository ID (e.g., 'username/dataset-name').
        split: The dataset split name (e.g., 'train', 'test').
        ids: Iterable of sample IDs to include in the dataset.

    Returns:
        IterableDataset: A Hugging Face IterableDataset for streaming access.
    """
    return IterableDataset.from_generator(
        sample_generator,
        gen_kwargs={"repo_id": repo_id, "split": split, "ids": ids},
        split=NamedSplit(split),
        features=None,
    )

plaid.storage.cgns.reader.init_datasetdict_from_disk

init_datasetdict_from_disk(path)

Initialize a dataset dictionary from local disk.

This function scans a local directory structure and creates CGNSDataset objects for each split found in the data directory.

Parameters:

  • path (Union[str, Path]) –

    Path to the root directory containing the dataset. Should contain a 'data' subdirectory with split subdirectories.

Returns:

  • CGNSDatasetDict ( CGNSDatasetDict ) –

    Dictionary mapping split names to CGNSDataset objects.

Source code in plaid/storage/cgns/reader.py
def init_datasetdict_from_disk(
    path: Union[str, Path],
) -> CGNSDatasetDict:
    """Initialize a dataset dictionary from local disk.

    This function scans a local directory structure and creates CGNSDataset objects
    for each split found in the data directory.

    Args:
        path: Path to the root directory containing the dataset. Should contain a 'data' subdirectory
            with split subdirectories.

    Returns:
        CGNSDatasetDict: Dictionary mapping split names to CGNSDataset objects.
    """
    local_path = Path(path) / "data"
    split_names = [p.name for p in local_path.iterdir() if p.is_dir()]
    return {sn: CGNSDataset(local_path / sn) for sn in split_names}

plaid.storage.cgns.reader.download_datasetdict_from_hub

download_datasetdict_from_hub(
    repo_id,
    local_dir,
    split_ids=None,
    features=None,
    overwrite=False,
)

Download a CGNS dataset from Hugging Face Hub to local disk.

This function downloads selected parts or the entire CGNS dataset from a Hugging Face repository to a local directory. Supports selective downloading of specific splits and samples.

Parameters:

  • repo_id (str) –

    The Hugging Face repository ID (e.g., 'username/dataset-name').

  • local_dir (Union[str, Path]) –

    Local directory path where the dataset will be downloaded.

  • split_ids (Optional[dict[str, Iterable[int]]], default: None ) –

    Optional dictionary mapping split names to iterables of sample IDs to download. If None, downloads all splits and samples.

  • features (Optional[list[str]], default: None ) –

    Optional list of features to download (currently unused).

  • overwrite (bool, default: False ) –

    If True, removes existing local directory before downloading.

Returns:

  • str ( str ) –

    Path to the local directory where the dataset has been downloaded.

Source code in plaid/storage/cgns/reader.py
def download_datasetdict_from_hub(
    repo_id: str,
    local_dir: Union[str, Path],
    split_ids: Optional[dict[str, Iterable[int]]] = None,
    features: Optional[list[str]] = None,  # noqa: ARG001
    overwrite: bool = False,
) -> str:  # pragma: no cover
    """Download a CGNS dataset from Hugging Face Hub to local disk.

    This function downloads selected parts or the entire CGNS dataset from a Hugging Face
    repository to a local directory. Supports selective downloading of specific splits and samples.

    Args:
        repo_id: The Hugging Face repository ID (e.g., 'username/dataset-name').
        local_dir: Local directory path where the dataset will be downloaded.
        split_ids: Optional dictionary mapping split names to iterables of sample IDs to download.
            If None, downloads all splits and samples.
        features: Optional list of features to download (currently unused).
        overwrite: If True, removes existing local directory before downloading.

    Returns:
        str: Path to the local directory where the dataset has been downloaded.
    """
    output_folder = Path(local_dir)

    if output_folder.is_dir():
        if overwrite:
            shutil.rmtree(local_dir)
            logger.warning(f"Existing {local_dir} directory has been reset.")
        elif any(output_folder.iterdir()):
            raise ValueError(
                f"directory {local_dir} already exists and is not empty. Set `overwrite` to True if needed."
            )

    if split_ids is not None:
        allow_patterns = []
        for split, ids in split_ids.items():
            allow_patterns.extend([f"data/{split}/sample_{i:09d}/*" for i in ids])
    else:
        allow_patterns = ["data/*"]

    return snapshot_download(
        repo_id=repo_id,
        repo_type="dataset",
        allow_patterns=allow_patterns,
        local_dir=local_dir,
    )

plaid.storage.cgns.reader.init_datasetdict_streaming_from_hub

init_datasetdict_streaming_from_hub(
    repo_id, split_ids=None, features=None
)

Initialize streaming datasets from Hugging Face Hub.

This function creates a dictionary of streaming IterableDataset objects for CGNS data stored on Hugging Face Hub. Supports selective streaming of specific splits and samples.

Parameters:

  • repo_id (str) –

    The Hugging Face repository ID (e.g., 'username/dataset-name').

  • split_ids (Optional[dict[str, Iterable[int]]], default: None ) –

    Optional dictionary mapping split names to iterables of sample IDs to stream. If None, streams all available samples for each split.

  • features (Optional[list[str]], default: None ) –

    Optional list of features to stream (currently unused).

Returns:

  • dict[str, IterableDataset]

    dict[str, IterableDataset]: Dictionary mapping split names to IterableDataset objects for streaming access.

Source code in plaid/storage/cgns/reader.py
def init_datasetdict_streaming_from_hub(
    repo_id: str,
    split_ids: Optional[dict[str, Iterable[int]]] = None,
    features: Optional[list[str]] = None,  # noqa: ARG001
) -> dict[str, IterableDataset]:  # pragma: no cover
    """Initialize streaming datasets from Hugging Face Hub.

    This function creates a dictionary of streaming IterableDataset objects for CGNS data
    stored on Hugging Face Hub. Supports selective streaming of specific splits and samples.

    Args:
        repo_id: The Hugging Face repository ID (e.g., 'username/dataset-name').
        split_ids: Optional dictionary mapping split names to iterables of sample IDs to stream.
            If None, streams all available samples for each split.
        features: Optional list of features to stream (currently unused).

    Returns:
        dict[str, IterableDataset]: Dictionary mapping split names to IterableDataset objects
            for streaming access.
    """
    hf_endpoint = os.getenv("HF_ENDPOINT", "").strip()
    if hf_endpoint:
        raise RuntimeError("Streaming mode not compatible with private mirror.")

    if split_ids is not None:
        selected_ids = split_ids
    else:
        infos = load_infos_from_hub(repo_id=repo_id)
        selected_ids = {
            split: range(n_samples) for split, n_samples in infos.num_samples.items()
        }

    return {
        split: create_CGNS_iterable_dataset(repo_id, split, ids)
        for split, ids in selected_ids.items()
    }