Dataset

A PLAID Dataset is a collection of physics configurations, organized into Sample. Each Sample contains all the necessary features to define a specific configuration, including mesh and scalar data.

A dataset must contain at least two different samples, i.e. having at least one features different between the two samples.

Create and load

  • Empty dataset:

from plaid.containers.dataset import Dataset

dataset = Dataset()
print(dataset)  # Dataset(0 samples, 0 scalars, 0 fields)
  • From directory in PLAID format:

dataset = Dataset.load_from_dir("/path/to/dataset_dir")
  • From .plaid archive (TAR produced by Dataset.save):

dataset = Dataset.load_from_file("/path/to/dataset.plaid")
  • From a list of Samples (IDs optional):

from plaid.containers.sample import Sample

samples = [Sample(...), Sample(...)]
dataset = Dataset.from_list_of_samples(samples)

See also: Dataset Examples.

Basic usage

  • Length, iteration, indexing:

len(dataset)                # number of samples
for sample in dataset: ...  # iterate samples
sample_3 = dataset[3]       # get sample by id
subset = dataset[0:10]      # returns a Dataset with selected ids
  • Manage samples and IDs:

sid = dataset.add_sample(Sample(...))
dataset.del_sample(sid)
ids = dataset.get_sample_ids()

Discover features across the dataset

dataset.get_scalar_names(ids=None)
dataset.get_field_names(ids=None, zone_name=None, base_name=None)

# Structured, hashable descriptors of features (recommended)
feat_ids = dataset.get_all_features_identifiers(ids=None)
node_ids = dataset.get_all_features_identifiers_by_type("nodes")

Learn more about identifiers: Feature identifiers.

Retrieve features by identifier(s)

from plaid.types import FeatureIdentifier

fid_scalar = FeatureIdentifier({"type": "scalar", "name": "Re"})
fid_field  = FeatureIdentifier({
    "type": "field", "name": "pressure", "base_name": "Base",
    "zone_name": "Zone", "location": "Vertex", "time": 0.0,
})

# One feature for all samples (dict: sample_id -> feature)
scalar_by_sample = dataset.get_feature_from_identifier(fid_scalar)

# Several features per sample (dict: sample_id -> list[feature])
features_by_sample = dataset.get_features_from_identifiers([fid_scalar, fid_field])

Convert to/from tabular data

Extract homogeneous features (same sizes) to a 3D array (n_samples, n_features, dim_feature):

tab = dataset.get_tabular_from_homogeneous_identifiers([fid_scalar, fid_field])

Extract and stack features to a 2D array (n_samples, dim_stacked):

tab = dataset.get_tabular_from_stacked_identifiers([fid_scalar, fid_field])

Update/add features from tabular data (optionally restricting the output dataset to only those features):

updated = dataset.add_features_from_tabular(
    tabular=tab,
    feature_identifiers=[fid_scalar, fid_field],
    restrict_to_features=True,
)

Merge and extract

  • Extract a dataset containing only selected features:

slim = dataset.extract_dataset_from_identifier([fid_field])
  • Merge entire datasets (append samples) or merge only features:

ids_added = dataset.merge_dataset(other_dataset)          # append samples
merged    = dataset.merge_features(other_dataset)         # union of features

Save to disk

# Save to directory (PLAID format)
dataset._save_to_dir_("/path/to/output_dir")

# Save to .plaid archive (TAR)
dataset.save("/path/to/output.plaid")

Dataset metadata (infos)

Datasets can carry metadata grouped by categories (e.g., legal, data_production):

dataset.add_info("legal", "owner", "CompanyX")
dataset.add_infos("data_production", {"type": "simulation", "simulator": "Z-Set"})
infos = dataset.get_infos()
dataset.print_infos()

Quality checks and summaries

print(dataset.summarize_features())         # coverage of feature names
print(dataset.check_feature_completeness()) # detect missing features per sample

Best practices

  • Prefer FeatureIdentifiers for unambiguous selection and stable keys.

  • Keep sample IDs contiguous when possible (simplifies slicing and joins).

  • For large datasets, consider using processes_number when loading from disk to parallelize I/O.

  • When building learning tasks, pair Dataset with ProblemDefinition and rely on identifiers for inputs/outputs.