Dataset¶

A PLAID Dataset is a collection of physics configurations, organized into Sample. Each Sample contains all the necessary features to define a specific configuration, including mesh and scalar data.

A dataset must contain at least two different samples, i.e. having at least one features different between the two samples.

Create and load¶

Empty dataset:

from plaid.containers.dataset import Dataset

dataset = Dataset()
print(dataset)  # Dataset(0 samples, 0 scalars, 0 fields)

From directory in PLAID format:

dataset = Dataset.load_from_dir("/path/to/dataset_dir")

From .plaid archive (TAR produced by Dataset.save):

dataset = Dataset.load_from_file("/path/to/dataset.plaid")

From a list of Samples (IDs optional):

from plaid.containers.sample import Sample

samples = [Sample(...), Sample(...)]
dataset = Dataset.from_list_of_samples(samples)

Basic usage¶

Length, iteration, indexing:

len(dataset)                # number of samples
for sample in dataset: ...  # iterate samples
sample_3 = dataset[3]       # get sample by id
subset = dataset[0:10]      # returns a Dataset with selected ids

Manage samples and IDs:

sid = dataset.add_sample(Sample(...))
dataset.del_sample(sid)
ids = dataset.get_sample_ids()

Discover features across the dataset¶

dataset.get_scalar_names(ids=None)
dataset.get_field_names(ids=None, zone_name=None, base_name=None)

# Structured, hashable descriptors of features (recommended)
feat_ids = dataset.get_all_features_identifiers(ids=None)
node_ids = dataset.get_all_features_identifiers_by_type("nodes")

Learn more about identifiers: Feature identifiers.

Retrieve features by identifier(s)¶

from plaid.types import FeatureIdentifier

fid_scalar = FeatureIdentifier({"type": "scalar", "name": "Re"})
fid_field  = FeatureIdentifier({
    "type": "field", "name": "pressure", "base_name": "Base",
    "zone_name": "Zone", "location": "Vertex", "time": 0.0,
})

# One feature for all samples (dict: sample_id -> feature)
scalar_by_sample = dataset.get_feature_from_identifier(fid_scalar)

# Several features per sample (dict: sample_id -> list[feature])
features_by_sample = dataset.get_features_from_identifiers([fid_scalar, fid_field])

Convert to/from tabular data¶

Extract homogeneous features (same sizes) to a 3D array (n_samples, n_features, dim_feature):

tab = dataset.get_tabular_from_homogeneous_identifiers([fid_scalar, fid_field])

Extract and stack features to a 2D array (n_samples, dim_stacked):

tab = dataset.get_tabular_from_stacked_identifiers([fid_scalar, fid_field])

Update/add features from tabular data (optionally restricting the output dataset to only those features):

updated = dataset.add_features_from_tabular(
    tabular=tab,
    feature_identifiers=[fid_scalar, fid_field],
    restrict_to_features=True,
)

Merge and extract¶

Extract a dataset containing only selected features:

slim = dataset.extract_dataset_from_identifier([fid_field])

Merge entire datasets (append samples) or merge only features:

ids_added = dataset.merge_dataset(other_dataset)          # append samples
merged    = dataset.merge_features(other_dataset)         # union of features

Save to disk¶

# Save to directory (PLAID format)
dataset._save_to_dir_("/path/to/output_dir")

# Save to .plaid archive (TAR)
dataset.save("/path/to/output.plaid")

Dataset metadata (infos)¶

Datasets can carry metadata grouped by categories (e.g., legal, data_production):

dataset.add_info("legal", "owner", "CompanyX")
dataset.add_infos("data_production", {"type": "simulation", "simulator": "Z-Set"})
infos = dataset.get_infos()
dataset.print_infos()

Quality checks and summaries¶

print(dataset.summarize_features())         # coverage of feature names
print(dataset.check_feature_completeness()) # detect missing features per sample

Best practices¶

Prefer FeatureIdentifiers for unambiguous selection and stable keys.
Keep sample IDs contiguous when possible (simplifies slicing and joins).
For large datasets, consider using processes_number when loading from disk to parallelize I/O.
When building learning tasks, pair Dataset with ProblemDefinition and rely on identifiers for inputs/outputs.