Upgrade guide¶

This page explains how to upgrade an existing code base to PLAID v1.0.0.

PLAID follows Semantic Versioning. The v1.0.0 release is the first major release: it consolidates the data model, removes deprecated and out-of-scope modules, and simplifies several public APIs. As a major release, it contains breaking changes.

The guide is organized by version jump. Read the section that matches the version you are upgrading from. For the exhaustive, change-by-change history, see the CHANGELOG.md.

Upgrade incrementally

If you are several versions behind, pin pyplaid, upgrade one step at a time, and run your test suite between steps. The last release of the 0.x series is 0.1.15; the sections below describe the jump from 0.1.x to 1.0.0.

Upgrade to v1.0.0 from 0.1.x¶

v1.0.0 reorganizes the package. The changes most likely to affect your code are listed below, with before/after examples.

Top-level imports¶

The single biggest change is that the Dataset class has been removed: it is no longer exported from the top-level plaid package and no longer exists as a module either. A new Infos object is exported, and the version string moved module. See Removing the Dataset class below for the full migration.

# Before (0.1.x)
from plaid import Dataset, Sample, ProblemDefinition
from plaid import __version__            # backed by plaid._version

# After (1.0.0)
from plaid import Sample, ProblemDefinition, Infos
from plaid import __version__            # backed by plaid.version
# `Dataset` no longer exists — there is no `plaid.containers.dataset` module.
# Use the storage helpers instead:
from plaid.storage import save_to_disk, init_from_disk

The helpers get_number_of_samples and get_sample_ids are still exported from the top-level package.

Removing the `Dataset` class: use `plaid.storage`¶

In the 0.1.x series, plaid.Dataset was a monolithic, in-memory container: you built one Dataset object, appended every Sample to it, kept the whole collection in RAM, and called save_to_dir / load on that object.

In v1.0.0 this class is removed entirely — there is no public high-level dataset container class anymore, and there is no plaid.containers.dataset module to import from. The data model is now centered on three objects — Sample, ProblemDefinition and Infos — plus the storage helpers in plaid.storage. A dataset on disk is a shared metadata layout plus backend-specific sample payloads; loading it back gives you, per split, a backend dataset object and a Converter that materializes individual Sample objects lazily.

This is a deliberate shift away from "load the whole dataset into one in-memory object" toward backend-agnostic, lazy, per-sample access, so that large datasets that do not fit in memory can be streamed sample by sample into ML pipelines. The concepts are introduced in Quickstart and the Dataset concept page; the end-to-end workflow is in the Conversion tutorial.

Writing: build-then-append → `save_to_disk(sample_constructor, ids)`¶

Instead of building a Dataset and appending samples, you provide a sample_constructor(id) -> Sample callable plus an ids mapping of split names to sliceable id sequences. PLAID handles iteration, generator creation and parallel sharding internally, and writes directly to the chosen backend.

# Before (0.1.x) — everything in memory, then dumped
from plaid import Dataset, Sample

dataset = Dataset()
for raw in raw_items:
    sample = Sample()
    # fill the sample: add_tree, add_field, ...
    dataset.add_sample(sample)
dataset.save_to_dir("my_plaid_dataset")

# After (1.0.0) — lazy, per-sample, backend-aware
from plaid import Sample
from plaid.storage import save_to_disk

def sample_constructor(sample_id):
    sample = Sample()
    # fill the sample: add_tree, add_field, ...
    return sample

save_to_disk(
    "my_plaid_dataset",
    sample_constructor=sample_constructor,
    ids={"train": [0, 1, 2], "test": [3, 4]},
    backend="zarr",   # one of "hf_datasets", "cgns", "zarr"
)

See the Conversion tutorial for a complete example (including num_proc parallel writing and push_to_hub) and the writer API.

Reading: `Dataset.load(...)` → `init_from_disk(...)` + converter¶

Loading no longer returns a single object you index into. It returns a dictionary of backend datasets and a dictionary of converters, one per split. You materialize a Sample on demand with converter.to_plaid(dataset, idx).

# Before (0.1.x)
from plaid import Dataset

dataset = Dataset()
dataset.load("my_plaid_dataset")
sample = dataset[0]
n = len(dataset)

# After (1.0.0)
from plaid.storage import init_from_disk

datasetdict, converterdict = init_from_disk("my_plaid_dataset")
dataset = datasetdict["train"]
converter = converterdict["train"]

sample = converter.to_plaid(dataset, 0)   # materialize one Sample lazily
n = len(dataset)

The same shape is used for the Hub (download_from_hub, init_streaming_from_hub). See the Dataset concept page, the reader API, and the backend API.

Operation-by-operation map¶

`0.1.x` — `Dataset` method	`1.0.0` — replacement
`Dataset()` + `add_sample` / `add_samples` / `from_list_of_samples`	`save_to_disk(sample_constructor=..., ids=...)`
`Dataset.save_to_dir(path)` / `add_to_dir`	`save_to_disk(path, sample_constructor=..., ids=...)`
`Dataset.load(path)`	`init_from_disk(path)` → `(datasetdict, converterdict)`
`Sample(path=sample_dir)`	`Sample.load_from_dir(sample_dir)` or `sample = Sample(); sample.load(sample_dir)`
`dataset[i]` / `get_samples()`	`converter.to_plaid(dataset, i)`
`len(dataset)` / `get_number_of_samples()`	`len(dataset)` (per-split backend object)
`dataset.set_infos(...)` / `get_infos()`	pass `Infos` to `save_to_disk(infos=...)`; read back with `Infos.from_path(path)`
persisting a `ProblemDefinition` with the dataset	`save_to_disk(..., pb_defs=...)`; read back with `load_problem_definitions_from_disk(path)`
`Dataset.add_features_from_tabular` (ex-`from_tabular`)	build the corresponding `Sample` objects in `sample_constructor`
`Dataset.extract_dataset_from_identifier`	request features at read time: `converter.to_plaid(dataset, i, features=[...])`
`Dataset.get_tabular_from_stacked_identifiers`	gather features yourself from the materialized `Sample` objects
`plaid.examples`	`plaid.downloadable_examples`
change backend (e.g. CGNS → HF)	`init_from_disk` then `save_to_disk` with the new `backend` (see the Conversion tutorial)

If you only need a subset of features or spatial indices, the converter supports features=[...] and indexers={...} for partial reads on the hf_datasets and zarr backends — see the Conversion tutorial.

Removed modules¶

The following modules were removed from the plaid package in 1.0.0. They were either out of the scope of the data model or superseded:

Removed module	What to do instead
`plaid.pipelines` (`plaid_blocks`, `sklearn_block_wrappers`)	build ML pipelines outside PLAID, on top of the data model
`plaid.post` (`bisect`, `metrics`)	compute post-processing / metrics in your own code
`plaid.utils.split`	manage dataset splits via `ProblemDefinition` train/test splits
`plaid.utils.stats`	compute statistics in your own code
`plaid.utils.interpolation`	use an external interpolation routine
`plaid.utils.init_with_tabular`	construct samples explicitly
`plaid.utils.deprecation`, `plaid.utils.base`	internal helpers, no public replacement

If you imported any of these, remove the import and move the corresponding logic into your own project, or rely on the supported data-model APIs.

`ProblemDefinition`¶

ProblemDefinition was rewritten as a compact pydantic model with four required fields — input_features, output_features, train_split and test_split. The many *_features_identifiers accessors were collapsed into two methods, splits became plain model attributes, and YAML key order is now enforced on save.

# Before (0.1.x)
pb.add_in_features_identifiers([...])
pb.add_out_features_identifiers([...])
pb.set_in_features_identifiers([...])
pb.set_out_features_identifiers([...])
pb.get_in_features_identifiers()
pb.get_out_features_identifiers()
pb.get_split("train")            # split accessors

# After (1.0.0)
from plaid import ProblemDefinition

pb = ProblemDefinition(
    input_features=["Base/Zone/GridCoordinates/CoordinateX"],
    output_features=["Base/Zone/VertexFields/pressure"],
    train_split={"train": [0, 1, 2]},
    test_split={"test": [3, 4]},
)
pb.add_input_features([...])
pb.add_output_features([...])
pb.train_split                   # direct attribute access
pb.test_split

The public surface of ProblemDefinition in 1.0.0 is intentionally small: from_path, model_validate, add_input_features, add_output_features, save_to_file, and the four model fields (input_features, output_features, train_split, test_split). The previous constant_features_identifiers accessors and the get_*_split_* / set_*_split_* helpers were removed together with the in/out identifier accessors; splits are now read and assigned directly via the train_split / test_split attributes, and feature lists are normalized (stringified, sorted, deduplicated, non-empty) by pydantic validators. The problem name is no longer stored in the model — on disk it is the YAML filename stem, in memory it is the dictionary key returned by load_problem_definitions_from_disk. See the Problem definition concept page and the problem_definition API.

Storage / CGNS backend¶

The constant/variable mechanism used in the CGNS backend reading and writing paths was removed. If you relied on that distinction at the storage level, review your read/write code against the current backend API and the CGNS backend API. The on-disk layout written by save_to_disk (shared metadata + per-backend payloads) is described in the Disk format concept page, and the three backends (hf_datasets, cgns, zarr) are compared in the Conversion tutorial.

New in v1.0.0¶

v1.0.0 also introduces new building blocks you can adopt:

plaid.infos — a dedicated pydantic Infos class, now living at the same level as ProblemDefinition (see Infos).
plaid-viewer — an interactive trame application for visual dataset exploration (see Viewer).
plaid-check — a CLI tool that validates the integrity of a local PLAID dataset (on-disk layout, infos.yaml, splits, sample conversion, invalid numeric values, duplicated samples, and optional problem definitions); see Dataset check.

Upgrading from an older 0.1.x release¶

If you are upgrading from a release earlier than 0.1.15, first move up to 0.1.15 and account for the intermediate breaking changes documented in the CHANGELOG.md, in particular:

0.1.15 — save_to_disk API simplified: generators replaced by sample_constructor and ids.
0.1.13 — get_mesh renamed to get_tree; get_<x>_assignment renamed to resolve_<x> (e.g. get_time_assignment → resolve_time).
0.1.11 — get_all_mesh_times() renamed to get_all_time_values(); FeatureIdentifier moved from plaid.types to plaid.containers; Python 3.10 support dropped.
0.1.10 — Sample restructured to store globals at time steps (scalars and time series unified into CGNS trees).
0.1.8 — Dataset.from_tabular → Dataset.add_features_from_tabular; Dataset.from_features_identifier → Dataset.extract_dataset_from_identifier; Sample.from_features_identifier → Sample.extract_sample_from_identifier.

Once on 0.1.15, follow the Upgrade to v1.0.0 from 0.1.x section above.