--- title: Dataset --- # Dataset A PLAID {py:class}`~plaid.containers.dataset.Dataset` is a collection of physics configurations, organized into {py:class}`~plaid.containers.sample.Sample`. Each {py:class}`~plaid.containers.sample.Sample` contains all the necessary features to define a specific configuration, including mesh and scalar data. ## Create and load - Empty dataset: ```python from plaid.containers.dataset import Dataset dataset = Dataset() print(dataset) # Dataset(0 samples, 0 scalars, 0 fields) ``` - From directory in PLAID format: ```python dataset = Dataset.load_from_dir("/path/to/dataset_dir") ``` - From .plaid archive (TAR produced by `Dataset.save`): ```python dataset = Dataset.load_from_file("/path/to/dataset.plaid") ``` - From a list of Samples (IDs optional): ```python from plaid.containers.sample import Sample samples = [Sample(...), Sample(...)] dataset = Dataset.from_list_of_samples(samples) ``` See also: {doc}`../notebooks/containers/dataset_example`. ## Basic usage - Length, iteration, indexing: ```python len(dataset) # number of samples for sample in dataset: ... # iterate samples sample_3 = dataset[3] # get sample by id subset = dataset[0:10] # returns a Dataset with selected ids ``` - Manage samples and IDs: ```python sid = dataset.add_sample(Sample(...)) dataset.del_sample(sid) ids = dataset.get_sample_ids() ``` ## Discover features across the dataset ```python dataset.get_scalar_names(ids=None) dataset.get_field_names(ids=None, zone_name=None, base_name=None) # Structured, hashable descriptors of features (recommended) feat_ids = dataset.get_all_features_identifiers(ids=None) node_ids = dataset.get_all_features_identifiers_by_type("nodes") ``` Learn more about identifiers: {doc}`feature_identifiers`. ## Retrieve features by identifier(s) ```python from plaid.types import FeatureIdentifier fid_scalar = FeatureIdentifier({"type": "scalar", "name": "Re"}) fid_field = FeatureIdentifier({ "type": "field", "name": "pressure", "base_name": "Base", "zone_name": "Zone", "location": "Vertex", "time": 0.0, }) # One feature for all samples (dict: sample_id -> feature) scalar_by_sample = dataset.get_feature_from_identifier(fid_scalar) # Several features per sample (dict: sample_id -> list[feature]) features_by_sample = dataset.get_features_from_identifiers([fid_scalar, fid_field]) ``` ## Convert to/from tabular data Extract homogeneous features (same sizes) to a 3D array `(n_samples, n_features, dim_feature)`: ```python tab = dataset.get_tabular_from_homogeneous_identifiers([fid_scalar, fid_field]) ``` Extract and stack features to a 2D array `(n_samples, dim_stacked)`: ```python tab = dataset.get_tabular_from_stacked_identifiers([fid_scalar, fid_field]) ``` Update/add features from tabular data (optionally restricting the output dataset to only those features): ```python updated = dataset.add_features_from_tabular( tabular=tab, feature_identifiers=[fid_scalar, fid_field], restrict_to_features=True, ) ``` ## Merge and extract - Extract a dataset containing only selected features: ```python slim = dataset.extract_dataset_from_identifier([fid_field]) ``` - Merge entire datasets (append samples) or merge only features: ```python ids_added = dataset.merge_dataset(other_dataset) # append samples merged = dataset.merge_features(other_dataset) # union of features ``` ## Save to disk ```python # Save to directory (PLAID format) dataset._save_to_dir_("/path/to/output_dir") # Save to .plaid archive (TAR) dataset.save("/path/to/output.plaid") ``` ## Dataset metadata (infos) Datasets can carry metadata grouped by categories (e.g., legal, data_production): ```python dataset.add_info("legal", "owner", "CompanyX") dataset.add_infos("data_production", {"type": "simulation", "simulator": "Z-Set"}) infos = dataset.get_infos() dataset.print_infos() ``` ## Quality checks and summaries ```python print(dataset.summarize_features()) # coverage of feature names print(dataset.check_feature_completeness()) # detect missing features per sample ``` ## Best practices - Prefer FeatureIdentifiers for unambiguous selection and stable keys. - Keep sample IDs contiguous when possible (simplifies slicing and joins). - For large datasets, consider using `processes_number` when loading from disk to parallelize I/O. - When building learning tasks, pair `Dataset` with {py:class}`~plaid.problem_definition.ProblemDefinition` and rely on identifiers for inputs/outputs.