Skip to content

Dataset check

plaid-check validates the integrity of a local PLAID dataset.

It checks:

  • required on-disk files and directories;
  • infos.yaml, metadata and split sample counts;
  • sample conversion through the declared storage backend;
  • invalid numeric values such as None, empty arrays, NaN and Inf;
  • duplicated samples;
  • optional problem_definitions/ feature names, splits and indices.

Basic usage

plaid-check /path/to/plaid_dataset

A valid dataset prints an [OK] line and returns exit code 0.

Options

Check only selected splits:

plaid-check /path/to/plaid_dataset --split train --split test

Check only selected problem definitions:

plaid-check /path/to/plaid_dataset --problem-definition regression_500

Emit a machine-readable report:

plaid-check /path/to/plaid_dataset --json

Make warnings fail the command:

plaid-check /path/to/plaid_dataset --strict

Report format

Messages are reported with a severity, a stable code, a location and a short description. Errors return exit code 1; warnings return exit code 2 only in strict mode.

Validation notes

Dataset validation behavior

  • For CGNS datasets, only infos.yaml and data/ are required at the root.
  • For other backends, metadata files and constants/ are checked as well.
  • Without --problem-definition, all discovered problem definitions are checked.
  • In JSON mode, progress bars are disabled to keep the output parseable.