`plaid.cli.plaidcheck`¶

plaid.cli.plaidcheck ¶

CLI tool to validate integrity of a PLAID dataset stored on disk.

plaid.cli.plaidcheck.CheckMessage `dataclass` ¶

CheckMessage(severity, code, location, message)

One integrity check message.

Parameters:

severity (str) –

Message severity (error, warning, or info).
code (str) –

Stable message code identifier.
location (str) –

Path-like location string related to the issue.
message (str) –

Human-readable message.

plaid.cli.plaidcheck.CheckReport `dataclass` ¶

CheckReport(messages)

Container for check results and summary helpers.

Parameters:

messages (list[CheckMessage]) –

Integrity check messages collected during validation.

plaid.cli.plaidcheck.CheckReport.add ¶

add(severity, code, location, message)

Append a new message to the report.

Parameters:

severity (str) –

Message severity (error, warning, or info).
code (str) –

Stable message code identifier.
location (str) –

Path-like location string related to the issue.
message (str) –

Human-readable message.

Source code in plaid/cli/plaidcheck.py

def add(self, severity: str, code: str, location: str, message: str) -> None:
    """Append a new message to the report.

    Args:
        severity: Message severity (`error`, `warning`, or `info`).
        code: Stable message code identifier.
        location: Path-like location string related to the issue.
        message: Human-readable message.
    """
    self.messages.append(
        CheckMessage(
            severity=severity,
            code=code,
            location=location,
            message=message,
        )
    )

plaid.cli.plaidcheck.CheckReport.counts ¶

counts()

Return counts by severity.

Returns:

dict[str, int] –

Mapping from severity names to message counts.

Source code in plaid/cli/plaidcheck.py

def counts(self) -> dict[str, int]:
    """Return counts by severity.

    Returns:
        Mapping from severity names to message counts.
    """
    return {
        "error": sum(msg.severity == "error" for msg in self.messages),
        "warning": sum(msg.severity == "warning" for msg in self.messages),
        "info": sum(msg.severity == "info" for msg in self.messages),
    }

plaid.cli.plaidcheck.CheckReport.has_errors ¶

has_errors()

Return whether at least one error was reported.

Returns:

bool –

True when the report contains one or more error messages.

Source code in plaid/cli/plaidcheck.py

def has_errors(self) -> bool:
    """Return whether at least one error was reported.

    Returns:
        True when the report contains one or more error messages.
    """
    return any(msg.severity == "error" for msg in self.messages)

plaid.cli.plaidcheck.CheckReport.has_warnings ¶

has_warnings()

Return whether at least one warning was reported.

Returns:

bool –

True when the report contains one or more warning messages.

Source code in plaid/cli/plaidcheck.py

def has_warnings(self) -> bool:
    """Return whether at least one warning was reported.

    Returns:
        True when the report contains one or more warning messages.
    """
    return any(msg.severity == "warning" for msg in self.messages)

plaid.cli.plaidcheck.CheckReport.to_json ¶

to_json()

Serialize report to JSON string.

Returns:

str –

JSON string containing severity counts and message details.

Source code in plaid/cli/plaidcheck.py

def to_json(self) -> str:
    """Serialize report to JSON string.

    Returns:
        JSON string containing severity counts and message details.
    """
    payload = {
        "counts": self.counts(),
        "messages": [asdict(msg) for msg in self.messages],
    }
    return json.dumps(payload, indent=2)

plaid.cli.plaidcheck.load_infos_from_disk ¶

load_infos_from_disk(path)

Load infos for checker diagnostics without persisted-field enforcement.

Source code in plaid/cli/plaidcheck.py

def load_infos_from_disk(path: Path) -> Infos:
    """Load infos for checker diagnostics without persisted-field enforcement."""
    return Infos.from_path(path, require_persisted=False)

plaid.cli.plaidcheck.compute_checksum ¶

compute_checksum(sample)

Compute a SHA-256 checksum for a converted sample representation.

Parameters:

sample (Any) –

Sample object or dictionary representation to checksum.

Returns:

str ( str ) –

Hexadecimal SHA-256 digest of the pickled sample.

Source code in plaid/cli/plaidcheck.py

def compute_checksum(sample: Any) -> str:
    """Compute a SHA-256 checksum for a converted sample representation.

    Args:
        sample: Sample object or dictionary representation to checksum.

    Returns:
        str: Hexadecimal SHA-256 digest of the pickled sample.
    """
    import hashlib
    import pickle

    sha256 = hashlib.sha256()
    sha256.update(pickle.dumps(sample))
    return sha256.hexdigest()

plaid.cli.plaidcheck.check_dataset ¶

check_dataset(
    path,
    splits=None,
    show_progress=True,
    problem_definitions=None,
)

Run integrity checks on a local PLAID dataset.

Algorithm overview

Validate the required on-disk PLAID layout.
Load infos, metadata, and split-specific dataset/converter objects.
Validate top-level declarations from infos.yaml (backend, sample counts).
Resolve requested splits and report unknown ones.
For each checked split:
verify split-level schema/value consistency,
validate sample IDs,
convert each sample and validate values,
compute checksums for duplicate-data detection,
build scalar signatures to detect duplicated DOE-like inputs.
Validate optional problem definitions against available features/splits/indices.
Emit an OK info message when no issue is detected.

Parameters:

path (Path) –

Dataset directory.
splits (Optional[list[str]], default: None ) –

Optional selected split names.
show_progress (bool, default: True ) –

Whether to display tqdm progress bars for expensive checks.
problem_definitions (Optional[list[str]], default: None ) –

Optional selected problem-definition names. When omitted, all discovered problem definitions are checked.

Returns:

CheckReport –

A populated :class:CheckReport.

Source code in plaid/cli/plaidcheck.py

def check_dataset(
    path: Path,
    splits: Optional[list[str]] = None,
    show_progress: bool = True,
    problem_definitions: Optional[list[str]] = None,
) -> CheckReport:
    """Run integrity checks on a local PLAID dataset.

    Algorithm overview:
        1. Validate the required on-disk PLAID layout.
        2. Load infos, metadata, and split-specific dataset/converter objects.
        3. Validate top-level declarations from ``infos.yaml`` (backend, sample counts).
        4. Resolve requested splits and report unknown ones.
        5. For each checked split:
           - verify split-level schema/value consistency,
           - validate sample IDs,
           - convert each sample and validate values,
           - compute checksums for duplicate-data detection,
           - build scalar signatures to detect duplicated DOE-like inputs.
        6. Validate optional problem definitions against available features/splits/indices.
        7. Emit an ``OK`` info message when no issue is detected.

    Args:
        path: Dataset directory.
        splits: Optional selected split names.
        show_progress: Whether to display tqdm progress bars for expensive checks.
        problem_definitions: Optional selected problem-definition names. When
            omitted, all discovered problem definitions are checked.

    Returns:
        A populated :class:`CheckReport`.
    """
    report = CheckReport(messages=[])

    # Load infos first so we can branch on the declared backend.
    if not (path / "infos.yaml").exists():
        report.add(
            "error",
            "MISSING_PATH",
            "infos.yaml",
            "Missing file/path path: infos.yaml",
        )
        return report
    try:
        infos = load_infos_from_disk(path / "infos.yaml")
    except Exception as exc:
        report.add("error", "INFOS_READ_ERROR", "infos.yaml", str(exc))
        return report

    declared_backend_for_layout = infos.storage_backend
    if not isinstance(declared_backend_for_layout, str):
        declared_backend_for_layout = None

    # Verify the dataset has the required on-disk files and folders for the
    # detected backend. Later checks rely on these paths being present and
    # readable.
    _check_required_layout(path, report, backend=declared_backend_for_layout)
    if report.has_errors():
        return report  # pragma: no cover

    # Validate top-level dataset declarations from infos.yaml before calling
    # init_from_disk(), because storage initialization indexes num_samples by
    # split and otherwise reports missing entries as opaque KeyError messages.
    declared_backend = infos.storage_backend
    if not isinstance(declared_backend, str):
        report.add(
            "error",
            "BACKEND_MISSING",
            "infos.yaml",
            "Missing or invalid 'storage_backend' in infos.yaml",
        )

    num_samples = infos.num_samples
    if not isinstance(num_samples, dict):
        report.add(
            "error", "NUM_SAMPLES_INVALID", "infos.yaml", "'num_samples' must be a dict"
        )
        num_samples = {}

    # Load metadata when the backend defines it. The CGNS backend stores
    # self-contained samples and intentionally writes no derived metadata.
    if declared_backend_for_layout == "cgns":
        flat_cst: dict = {}
        variable_schema: dict = {}
        constant_schema: dict = {}
    else:
        try:
            flat_cst, variable_schema, constant_schema, _ = load_metadata_from_disk(
                path
            )
        except Exception as exc:
            report.add("error", "METADATA_READ_ERROR", str(path), str(exc))
            return report

    discovered_splits = _discover_split_names_from_disk(
        path,
        declared_backend_for_layout,
        flat_cst,
        constant_schema,
    )
    _check_num_samples_declares_splits(num_samples, discovered_splits, report)
    if report.has_errors():
        return report

    try:
        datasetdict, converterdict = init_from_disk(path)
    except KeyError as exc:
        report.add(
            "error",
            "NUM_SAMPLES_MISSING_SPLIT",
            "infos.yaml",
            _format_missing_split_message(exc.args[0] if exc.args else str(exc)),
        )
        return report
    except Exception as exc:
        report.add("error", "DATASET_INIT_ERROR", str(path), str(exc))
        return report

    # Resolve the user-requested splits against the splits actually available.
    dataset_splits = set(datasetdict.keys())
    target_splits = set(splits) if splits else dataset_splits
    unknown_splits = target_splits - dataset_splits
    for split in sorted(unknown_splits):
        available = " and ".join(f'"{x}"' for x in dataset_splits)
        report.add(
            "error",
            "UNKNOWN_SPLIT",
            split,
            f"Split not found in dataset, available are {available}",
        )
    target_splits = target_splits & dataset_splits

    checksum_report = {}
    # Track shape of each Global feature across all checked splits/samples to
    # detect inconsistencies (e.g. a Global stored as a scalar in one sample
    # and as a vector in another).
    global_shape_observations: dict[str, dict[tuple, list[str]]] = {}
    for split in sorted(target_splits):
        dataset = datasetdict[split]
        converter = converterdict[split]

        # Check split-level consistency between metadata, schemas, and storage.
        expected_n = num_samples.get(split)
        actual_n = len(dataset)
        if isinstance(expected_n, int) and expected_n != actual_n:
            report.add(
                "error",
                "SPLIT_COUNT_MISMATCH",
                split,
                f"Expected {expected_n} samples from infos.yaml, found {actual_n}",
            )

        if declared_backend_for_layout != "cgns":
            if split not in constant_schema:
                report.add(
                    "error",
                    "MISSING_CONSTANT_SCHEMA",
                    split,
                    "No constant schema for split",
                )

            if split not in flat_cst:
                report.add(
                    "error",
                    "MISSING_CONSTANT_VALUES",
                    split,
                    "No constant values for split",
                )

        # Deep-check to validate content and detect non valide data in fields (nan inf)
        for idx in _progress(
            range(actual_n),
            desc=f"Checking split {split}",
            show_progress=show_progress,
            total=actual_n,
        ):
            try:
                sample = converter.to_plaid(dataset, idx)
            except Exception as exc:
                report.add(
                    "error",
                    "SAMPLE_CONVERSION_ERROR",
                    f"{split}[{idx}]",
                    str(exc),
                )
                continue

            # Track whole-sample checksums to detect duplicated data across
            # all checked splits after the per-split loop completes.
            sample_checksum = compute_checksum(sample)
            checksum_report[(idx, split)] = sample_checksum

            for global_name in sample.get_global_names():
                global_path = "Global/" + global_name
                value = sample.get_feature_by_path(global_path)

                if _is_branch_without_data(sample, global_path):
                    continue

                issue = _check_numeric_content(value)
                if issue is not None:
                    report.add(
                        "warning",
                        "INVALID_DATA_VALUE A",
                        f"{split}[{idx}] global/{global_name}",
                        issue,
                    )

                # Record the observed shape of this Global so we can later
                # detect dimension mismatches across all checked samples
                # (across splits). At this point ``_check_numeric_content``
                # already coerced ``value`` through ``np.asarray`` without
                # error, so the same call here is safe.
                if value is not None:
                    shape = tuple(np.asarray(value).shape)
                    global_shape_observations.setdefault(global_name, {}).setdefault(
                        shape, []
                    ).append(f"{split}[{idx}]")

            for time in sample.get_all_time_values():
                local_bases = sample.get_base_names(time=time)
                for base in local_bases:
                    zone_names = sample.get_zone_names(base=base, time=time)
                    for zone in zone_names:
                        for location in CGNS_FIELD_LOCATIONS:
                            field_names = sample.get_field_names(
                                location=location,
                                zone=zone,
                                base=base,
                                time=time,
                            )

                            for f_name in field_names:
                                field_value = sample.get_field(
                                    f_name,
                                    location=location,
                                    zone=zone,
                                    base=base,
                                    time=time,
                                )
                                issue = _check_numeric_content(field_value)
                                if issue is not None:
                                    report.add(
                                        "warning",
                                        "INVALID_DATA_VALUE A",
                                        f"{split}[{idx}][{time}] {base}/{zone}/{location}/{f_name}",
                                        issue,
                                    )

    # Report Globals whose dimension/shape is not consistent across all
    # checked samples (across splits).
    for global_name, shape_to_locations in global_shape_observations.items():
        if len(shape_to_locations) <= 1:
            continue
        details = "; ".join(
            f"shape={shape} at {locations[:5]}"
            + (f" (+{len(locations) - 5} more)" if len(locations) > 5 else "")
            for shape, locations in sorted(
                shape_to_locations.items(), key=lambda kv: str(kv[0])
            )
        )
        report.add(
            "error",
            "GLOBAL_SHAPE_MISMATCH",
            f"global/{global_name}",
            f"Global '{global_name}' has inconsistent shapes across samples: {details}",
        )

    # Compare checksums from every checked sample to flag identical sample data.
    checksum_values = list(checksum_report.values())
    if len(checksum_report) != len(np.unique(checksum_values)):
        k = list(checksum_report.keys())
        v = list(checksum_report.values())
        uni, cou = np.unique(v, return_counts=True)
        for u, c in zip(uni, cou):
            if c == 1:
                continue
            duplicated = k[v == u]

            report.add(
                "warning",
                "DUPLICATED_DATA",
                str(duplicated),
                "duplicated sample",
            )
    # If problem definitions are present, verify that their feature references,
    # split names, and sample indices are compatible with the dataset.
    pb_def_dir = path / "problem_definitions"
    if pb_def_dir.exists():
        try:
            pb_defs = load_problem_definitions_from_disk(path)
        except Exception as exc:
            report.add(
                "error",
                "PB_DEF_READ_ERROR",
                "problem_definitions",
                str(exc),
            )
            return report

        all_features = set(variable_schema.keys())
        for split_cst in flat_cst.values():
            all_features.update(split_cst.keys())
        # The CGNS backend stores self-contained samples and writes no
        # derived feature schema, so we have no authoritative catalogue
        # to validate problem-definition feature paths against. Skip the
        # feature-name checks in that case (split / index checks below
        # still run).
        validate_pb_def_features = declared_backend_for_layout != "cgns"

        target_pb_names = (
            set(problem_definitions) if problem_definitions else set(pb_defs)
        )
        unknown_pb_names = target_pb_names - set(pb_defs)
        for pb_name in sorted(unknown_pb_names):
            available = " and ".join(f'"{x}"' for x in sorted(pb_defs))
            report.add(
                "error",
                "PB_DEF_UNKNOWN",
                f"problem_definitions/{pb_name}",
                f"Problem definition not found, available are {available}",
            )
        target_pb_names = target_pb_names & set(pb_defs)

        for pb_name, pb_def in pb_defs.items():
            if pb_name not in target_pb_names:
                continue
            if validate_pb_def_features:
                for feat in pb_def.input_features:
                    if feat not in all_features:
                        report.add(
                            "error",
                            "PB_DEF_UNKNOWN_INPUT",
                            f"problem_definitions/{pb_name}",
                            f"Unknown input feature: {feat}",
                        )

                for feat in pb_def.output_features:
                    if feat not in all_features:
                        report.add(
                            "error",
                            "PB_DEF_UNKNOWN_OUTPUT",
                            f"problem_definitions/{pb_name}",
                            f"Unknown output feature: {feat}",
                        )

            for split_dict_name in ["train_split", "test_split"]:
                split_dict = getattr(pb_def, split_dict_name)
                if split_dict is None:
                    continue
                # split_dict must have only one elements
                if len(split_dict) > 1:
                    report.add(
                        "error",
                        "PB_DEF_SPLIT",
                        f"problem_definitions/{pb_name}",
                        f"{split_dict_name} has more than 1 split: {list(split_dict.keys())}",
                    )
                    continue
                split_name = next(iter(split_dict.keys()))
                split_ids = next(iter(split_dict.values()))
                if split_name not in dataset_splits:
                    report.add(
                        "error",
                        "PB_DEF_UNKNOWN_SPLIT",
                        f"problem_definitions/{pb_name}",
                        f"Unknown split in {split_dict_name}: {split_name}",
                    )
                    continue
                split_len = len(datasetdict[split_name])
                ids_list = _resolve_problem_split_indices(split_ids, split_len)
                if len(ids_list) != len(set(ids_list)):
                    report.add(
                        "error",
                        "PB_DEF_DUPLICATE_INDICES",
                        f"problem_definitions/{pb_name}",
                        f"Duplicated indices in {split_dict_name}",
                    )
                bad = [i for i in ids_list if i < 0 or i >= split_len]
                if bad:
                    report.add(
                        "error",
                        "PB_DEF_OUT_OF_RANGE_INDICES",
                        f"problem_definitions/{pb_name}",
                        f"Out-of-range indices in {split_dict_name} (first 10): {bad[:10]}",
                    )
                    continue

                if ids_list:
                    _check_problem_definition_first_sample_feature_keys(
                        pb_name=pb_name,
                        split_dict_name=split_dict_name,
                        split_name=split_name,
                        idx=ids_list[0],
                        dataset=datasetdict[split_name],
                        converter=converterdict[split_name],
                        input_features=list(pb_def.input_features),
                        output_features=list(pb_def.output_features),
                        report=report,
                    )

                if split_dict_name == "train_split":
                    features = list(pb_def.input_features) + list(
                        pb_def.output_features
                    )
                else:
                    features = list(pb_def.input_features)

                for idx in _progress(
                    ids_list,
                    desc=f"Checking problem {pb_name} {split_dict_name}",
                    show_progress=show_progress,
                    total=len(ids_list),
                ):
                    _check_problem_definition_sample_features(
                        pb_name=pb_name,
                        split_dict_name=split_dict_name,
                        split_name=split_name,
                        idx=idx,
                        dataset=datasetdict[split_name],
                        converter=converterdict[split_name],
                        features=features,
                        report=report,
                    )

    return report

plaid.cli.plaidcheck.main ¶

main(argv=None)

CLI entry point for plaid-check.

Parameters:

argv (Optional[list[str]], default: None ) –

Optional command-line args.

Returns:

int –

Process exit code.

Source code in plaid/cli/plaidcheck.py

def main(argv: Optional[list[str]] = None) -> int:
    """CLI entry point for `plaid-check`.

    Args:
        argv: Optional command-line args.

    Returns:
        Process exit code.
    """
    parser = _build_parser()
    args = parser.parse_args(argv)

    report = check_dataset(
        path=args.path,
        splits=args.split,
        show_progress=not args.json,
        problem_definitions=args.problem_definition,
    )

    if args.json:
        print(report.to_json())
    else:
        if not report.messages:
            print(f"[OK] {args.path}: No issue detected")
        else:
            for msg in report.messages:
                print(
                    f"[{msg.severity.upper()}] {msg.code} {msg.location}: {msg.message}"
                )
        counts = report.counts()
        print(
            f"Summary: {counts['error']} error(s), "
            f"{counts['warning']} warning(s), {counts['info']} info message(s)"
        )

    if report.has_errors():
        return 1
    if args.strict and report.has_warnings():
        return 2
    return 0

plaid.cli.plaidcheck¶

plaid.cli.plaidcheck ¶

plaid.cli.plaidcheck.CheckMessage dataclass ¶

plaid.cli.plaidcheck.CheckReport dataclass ¶

plaid.cli.plaidcheck.CheckReport.add ¶

plaid.cli.plaidcheck.CheckReport.counts ¶

plaid.cli.plaidcheck.CheckReport.has_errors ¶

plaid.cli.plaidcheck.CheckReport.has_warnings ¶

plaid.cli.plaidcheck.CheckReport.to_json ¶

plaid.cli.plaidcheck.load_infos_from_disk ¶

plaid.cli.plaidcheck.compute_checksum ¶

plaid.cli.plaidcheck.check_dataset ¶

plaid.cli.plaidcheck.main ¶

`plaid.cli.plaidcheck`¶

plaid.cli.plaidcheck.CheckMessage `dataclass` ¶

plaid.cli.plaidcheck.CheckReport `dataclass` ¶