Dataset¶
A PLAID Dataset is a collection of physics configurations, organized into Sample. Each Sample contains all the necessary features to define a specific configuration, including mesh and scalar data.
A dataset must contain at least two different samples, i.e. having at least one features different between the two samples.
Create and load¶
Empty dataset:
from plaid.containers.dataset import Dataset
dataset = Dataset()
print(dataset) # Dataset(0 samples, 0 scalars, 0 fields)
From directory in PLAID format:
dataset = Dataset.load_from_dir("/path/to/dataset_dir")
From .plaid archive (TAR produced by
Dataset.save):
dataset = Dataset.load_from_file("/path/to/dataset.plaid")
From a list of Samples (IDs optional):
from plaid.containers.sample import Sample
samples = [Sample(...), Sample(...)]
dataset = Dataset.from_list_of_samples(samples)
See also: Dataset Examples.
Basic usage¶
Length, iteration, indexing:
len(dataset) # number of samples
for sample in dataset: ... # iterate samples
sample_3 = dataset[3] # get sample by id
subset = dataset[0:10] # returns a Dataset with selected ids
Manage samples and IDs:
sid = dataset.add_sample(Sample(...))
dataset.del_sample(sid)
ids = dataset.get_sample_ids()
Discover features across the dataset¶
dataset.get_scalar_names(ids=None)
dataset.get_field_names(ids=None, zone_name=None, base_name=None)
# Structured, hashable descriptors of features (recommended)
feat_ids = dataset.get_all_features_identifiers(ids=None)
node_ids = dataset.get_all_features_identifiers_by_type("nodes")
Learn more about identifiers: Feature identifiers.
Retrieve features by identifier(s)¶
from plaid.types import FeatureIdentifier
fid_scalar = FeatureIdentifier({"type": "scalar", "name": "Re"})
fid_field = FeatureIdentifier({
"type": "field", "name": "pressure", "base_name": "Base",
"zone_name": "Zone", "location": "Vertex", "time": 0.0,
})
# One feature for all samples (dict: sample_id -> feature)
scalar_by_sample = dataset.get_feature_from_identifier(fid_scalar)
# Several features per sample (dict: sample_id -> list[feature])
features_by_sample = dataset.get_features_from_identifiers([fid_scalar, fid_field])
Convert to/from tabular data¶
Extract homogeneous features (same sizes) to a 3D array (n_samples, n_features, dim_feature):
tab = dataset.get_tabular_from_homogeneous_identifiers([fid_scalar, fid_field])
Extract and stack features to a 2D array (n_samples, dim_stacked):
tab = dataset.get_tabular_from_stacked_identifiers([fid_scalar, fid_field])
Update/add features from tabular data (optionally restricting the output dataset to only those features):
updated = dataset.add_features_from_tabular(
tabular=tab,
feature_identifiers=[fid_scalar, fid_field],
restrict_to_features=True,
)
Merge and extract¶
Extract a dataset containing only selected features:
slim = dataset.extract_dataset_from_identifier([fid_field])
Merge entire datasets (append samples) or merge only features:
ids_added = dataset.merge_dataset(other_dataset) # append samples
merged = dataset.merge_features(other_dataset) # union of features
Save to disk¶
# Save to directory (PLAID format)
dataset._save_to_dir_("/path/to/output_dir")
# Save to .plaid archive (TAR)
dataset.save("/path/to/output.plaid")
Dataset metadata (infos)¶
Datasets can carry metadata grouped by categories (e.g., legal, data_production):
dataset.add_info("legal", "owner", "CompanyX")
dataset.add_infos("data_production", {"type": "simulation", "simulator": "Z-Set"})
infos = dataset.get_infos()
dataset.print_infos()
Quality checks and summaries¶
print(dataset.summarize_features()) # coverage of feature names
print(dataset.check_feature_completeness()) # detect missing features per sample
Best practices¶
Prefer FeatureIdentifiers for unambiguous selection and stable keys.
Keep sample IDs contiguous when possible (simplifies slicing and joins).
For large datasets, consider using
processes_numberwhen loading from disk to parallelize I/O.When building learning tasks, pair
DatasetwithProblemDefinitionand rely on identifiers for inputs/outputs.