plaid.containers.dataset¶
Implementation of the Dataset container.
Attributes¶
Classes¶
A set of samples, and optionnaly some other informations about the Dataset. |
Module Contents¶
- class Dataset(path: str | pathlib.Path | None = None, verbose: bool = False, processes_number: int = 0, samples: list[plaid.containers.sample.Sample] | None = None, sample_ids: list[int] | None = None)[source]¶
Bases:
objectA set of samples, and optionnaly some other informations about the Dataset.
Initialize a
Dataset.If path is not specified it initializes an empty
Datasetthat should be fed withSamples.Use
add_sampleoradd_samplesto feed theDataset- Parameters:
path (Union[str, Path], optional) – The path from which to load PLAID dataset files.
verbose (bool, optional) – Explicitly displays the operations performed. Defaults to False.
processes_number (int, optional) – Number of processes used to load files (-1 to use all available ressources, 0 to disable multiprocessing). Defaults to 0.
samples (list[Sample], optional) – A list of
Samplesto initialize theDataset. Defaults to None.sample_ids (list[int], optional) – An optional list of IDs for the new samples. If not provided, the IDs will be automatically generated based on the current number of samples in the dataset.
Example
from plaid import Dataset from plaid import Sample # 1. Create empty instance of Dataset dataset = Dataset() print(dataset) >>> Dataset(0 samples, 0 scalars, 0 fields) print(len(dataset)) >>> 0 # 2. Load dataset and create Dataset instance dataset = Dataset("path_to_plaid_dataset") # .plaid or directory print(dataset) >>> Dataset(3 samples, 2 scalars, 5 fields) print(len(dataset)) >>> 3 for sample in dataset: print(sample) >>> Sample(1 scalar, 1 timestamp, 2 fields) Sample(1 scalar, 0 timestamps, 0 fields) Sample(2 scalars, 1 timestamp, 2 fields) # 3. Create Dataset instance from a list of Samples dataset = Dataset(samples=[sample1, sample2, sample3]) print(dataset) >>> Dataset(3 samples, 0 scalars, 2 fields) # 4. Create Dataset instance from a list of Samples with specific ids dataset = Dataset(samples=[sample1, sample2, sample3], sample_ids=[3, 5, 7]) print(dataset) >>> Dataset(3 samples, 0 scalars, 2 fields)
Caution
It is assumed that you provided a compatible PLAID dataset.
- copy() Self[source]¶
Create a deep copy of the dataset.
- Returns:
A new Dataset instance with all internal data (samples, infos) deeply copied to ensure full isolation from the original.
Note
This operation may be memory-intensive for large datasets.
- get_samples(ids: list[int] | None = None, as_list: bool = False) list[plaid.containers.sample.Sample] | dict[int, plaid.containers.sample.Sample][source]¶
Return dictionnary of samples with ids corresponding to
idsif specified, else all samples.
- add_sample(sample: plaid.containers.sample.Sample, id: int | None = None) int[source]¶
Add a new
Sampleto theDataset <plaid.containers.dataset.Dataset>..- Parameters:
- Raises:
- Returns:
Id of the new added
Sample.- Return type:
Example
from plaid import Dataset dataset = Dataset() dataset.add_sample(sample) print(dataset) >>> Dataset(3 samples, 0 scalars, 2 fields)
- del_sample(sample_id: int) None[source]¶
Delete a
Samplefrom theDatasetand reorganize the remaining sample IDs to eliminate gaps.- Parameters:
sample_id (int) – The ID of the sample to delete.
- Raises:
ValueError – If the provided sample ID is not present in the dataset.
- Returns:
The new list of sample ids.
- Return type:
Example
from plaid import Dataset dataset = Dataset() dataset.add_samples(samples) print(dataset) >>> Dataset(1 samples, y scalars, x fields) dataset.del_sample(0) print(dataset) >>> Dataset(0 samples, 0 scalars, 0 fields)
- add_samples(samples: list[plaid.containers.sample.Sample], ids: list[int] | None = None) list[int][source]¶
Add new
Samplesto theDataset.- Parameters:
- Raises:
TypeError – If
samplesis not a list or if one of thesamplesis not aSample.ValueError – If samples list is empty.
ValueError – If the length of ids list (if provided) is not equal to the length of samples list.
ValueError – If provided ids are not unique.
- Returns:
Ids of added
Samples.- Return type:
Example
from plaid import Dataset dataset = Dataset() dataset.add_samples(samples) print(len(samples)) >>> n print(dataset) >>> Dataset(n samples, 0 scalars, x fields)
- del_samples(sample_ids: list[int]) None[source]¶
Delete
Samplefrom theDatasetand reorganize the remaining sample IDs to eliminate gaps.- Parameters:
sample_ids (list[int]) – The list of IDs of samples to delete.
- Raises:
TypeError – If
sample_idsis not a list.ValueError – If sample_ids list is empty.
ValueError – If any of the sample_ids does not exist in the dataset.
ValueError – If the provided IDs are not unique.
- Returns:
The new list of sample ids.
- Return type:
Example
from plaid import Dataset dataset = Dataset() # Assume samples are already added to the dataset print(dataset) >>> Dataset(6 samples, y scalars, x fields) dataset.del_samples([1, 3, 5]) print(dataset) >>> Dataset(3 samples, y scalars, x fields)
- get_scalar_names(ids: list[int] | None = None) list[str][source]¶
Return union of scalars names in all samples with id in ids.
- get_field_names(ids: list[int] | None = None, location: str | None = None, zone_name: str | None = None, base_name: str | None = None, time: float | None = None) list[str][source]¶
Return union of fields names in all samples with id in ids.
- Parameters:
ids (list[int], optional) – Select fields depending on sample id. If None, take all samples. Defaults to None.
location (str, optional) – If provided, only field names from this location will be included. Defaults to None.
zone_name (str, optional) – If provided, only field names from this zone will be included. Defaults to None.
base_name (str, optional) – If provided, only field names containing this base name will be included. Defaults to None.
time (float, optional) – If provided, only field names from this time will be included. Defaults to None.
- Returns:
List of all fields names.
- Return type:
- get_all_features_identifiers(ids: list[int] | None = None) list[plaid.containers.feature_identifier.FeatureIdentifier][source]¶
Get all features identifiers from the dataset.
- get_all_features_identifiers_by_type(feature_type: Literal['scalar', 'nodes', 'field'], ids: list[int] = None) list[plaid.containers.feature_identifier.FeatureIdentifier][source]¶
Get all features identifiers from the dataset.
- Parameters:
- Returns:
A list of dictionaries containing the identifiers of all features of a given type in the dataset.
- Return type:
- add_tabular_scalars(tabular: numpy.ndarray, names: list[str] | None = None) None[source]¶
Add tabular scalar data to the summary.
- Parameters:
- Raises:
ShapeError – Raised if the input tabular array does not have the correct shape (2D).
ShapeError – Raised if the number of columns in the tabular data does not match the number of names provided.
Note
If no names are provided, it will automatically create names based on the pattern ‘X{number}’
- get_scalars_to_tabular(scalar_names: list[str] | None = None, sample_ids: list[int] | None = None, as_nparray=False) dict[str, numpy.ndarray] | numpy.ndarray[source]¶
Return a dict containing scalar values as tabulars/arrays.
- Parameters:
scalar_names (str, optional) – Scalars to work on. If None, all scalars will be returned. Defaults to None.
sample_ids (list[int], optional) – Filter by sample id. If None, take all samples. Defaults to None.
as_nparray (bool, optional) – If True, return the data as a single numpy ndarray. If False, return a dictionary mapping scalar names to their respective tabular values. Defaults to False.
- Returns:
if as_nparray is True. dict[str,np.ndarray]: if as_nparray is False, scalar name -> tabular values.
- Return type:
np.ndarray
- get_feature_from_string_identifier(feature_string_identifier: str) dict[int, plaid.types.Feature][source]¶
Get a list of features from the dataset based on the provided feature string identifier.
- get_feature_from_identifier(feature_identifier: plaid.containers.feature_identifier.FeatureIdentifier) dict[int, plaid.types.Feature][source]¶
Get a list of features from the dataset based on the provided feature identifier.
- Parameters:
feature_identifier (FeatureIdentifier) – A dictionary containing the feature identifier.
- Returns:
A list of features matching the provided identifier.
- Return type:
- get_features_from_identifiers(feature_identifiers: list[plaid.containers.feature_identifier.FeatureIdentifier]) dict[int, list[plaid.types.Feature]][source]¶
Get a list of features from the dataset based on the provided feature identifiers.
- Parameters:
feature_identifiers (FeatureIdentifier) – A dictionary containing the feature identifier.
- Returns:
A list of features matching the provided identifier.
- Return type:
- update_features_from_identifier(feature_identifiers: plaid.containers.feature_identifier.FeatureIdentifier | list[plaid.containers.feature_identifier.FeatureIdentifier], features: dict[int, plaid.types.Feature | list[plaid.types.Feature]], in_place: bool = False) Self[source]¶
Update one or several features of the dataset by their identifier(s).
This method applies updates to scalars, fields, or nodes using feature identifiers, and corresponding feature data. When in_place=False, a deep copy of the dataset is created before applying updates, ensuring full isolation from the original.
- Parameters:
- Returns:
The updated dataset (either the current instance or a new copy).
- Return type:
- Raises:
AssertionError – If types are inconsistent or identifiers contain unexpected keys.
- extract_dataset_from_identifier(feature_identifiers: plaid.containers.feature_identifier.FeatureIdentifier | list[plaid.containers.feature_identifier.FeatureIdentifier]) Self[source]¶
Extract features of the dataset by their identifier(s) and return a new dataset containing these features.
This method applies updates to scalars, fields, or nodes using feature identifiers
- Parameters:
feature_identifiers (dict or list of dict) – One or more feature identifiers.
- Returns:
New dataset containing the provided feature identifiers
- Return type:
- Raises:
AssertionError – If types are inconsistent or identifiers contain unexpected keys.
- from_features_identifier(feature_identifiers: plaid.containers.feature_identifier.FeatureIdentifier | list[plaid.containers.feature_identifier.FeatureIdentifier]) Self[source]¶
DEPRECATED: Use
Dataset.extract_dataset_from_identifier()instead.
- get_tabular_from_homogeneous_identifiers(feature_identifiers: list[plaid.containers.feature_identifier.FeatureIdentifier]) plaid.types.Array[source]¶
Extract features of the dataset by their identifier(s) and return an array containing these features.
Features must have identic sizes to be casted in an array. The first dimension of the array is the number of samples in the dataset. This method applies updates to scalars, fields, or nodes using feature identifiers.
- Parameters:
- Returns:
An containing the provided feature identifiers, size (nb_sample, nb_features, dim_features)
- Return type:
- Raises:
AssertionError – If feature sizes are inconsistent.
- get_tabular_from_stacked_identifiers(feature_identifiers: list[plaid.containers.feature_identifier.FeatureIdentifier]) tuple[plaid.types.Array, plaid.types.Array][source]¶
Extract features of the dataset by their identifier(s), stack them and return an array containing these features.
After stacking, each sample has one feature of dimension dim_stacked_features
- Parameters:
- Returns:
An array containing the provided feature identifiers, size (nb_sample, dim_stacked_features) Array: An array containing the cumulated feature dimensions, starts with 0, size (len(feature_identifiers)+1, )
- Return type:
- add_features_from_tabular(tabular: plaid.types.Array, feature_identifiers: list[plaid.containers.feature_identifier.FeatureIdentifier], restrict_to_features: bool = True) Self[source]¶
Add or update features in the dataset from tabular data using feature identifiers.
This method takes tabular data and applies it to the dataset, either by updating existing features or adding new ones based on the provided feature identifiers. The method can either: 1. Extract only the specified features and return a new dataset with just those features (if restrict_to_features=True) 2. Update the specified features in the current dataset while keeping all other existing features (if restrict_to_features=False)
- Parameters:
tabular (Array) – of size (nb_sample, nb_features) or (nb_sample, nb_features, dim_feature) if dim_feature>1
feature_identifiers (list of dict) – One or more feature identifiers specifying which features to update/add.
restrict_to_features (bool, optional) – If True, only returns the features from feature identifiers, otherwise keep the other features as well. Defaults to True.
- Returns:
- A new dataset with features updated/added from the tabular data. If restrict_to_features=True,
contains only the specified features. If restrict_to_features=False, contains all original features plus the updated/added ones.
- Return type:
- Raises:
AssertionError – If the number of rows in tabular does not match the number of samples in the dataset, or if the number of feature identifiers does not match the number of columns in tabular.
- from_tabular(tabular: plaid.types.Array, feature_identifiers: plaid.containers.feature_identifier.FeatureIdentifier | list[plaid.containers.feature_identifier.FeatureIdentifier], restrict_to_features: bool = True) Self[source]¶
DEPRECATED: Use
Dataset.add_features_from_tabular()instead.
- add_info(cat_key: str, info_key: str, info: str) None[source]¶
Add information to the
Dataset, overwriting existing information if there’s a conflict.- Parameters:
cat_key (str) – Category key, choose among “legal,” “data_production,” and “data_description”.
info_key (str) – Information key, depending on the chosen category key, choose among “owner”, “license”, “type”, “physics”, “simulator”, “hardware”, “computation_duration”, “script”, “contact”, “location”, “number_of_samples”, “number_of_splits”, “DOE”, “inputs” and “outputs”.
info (str) – Information content.
- Raises:
Example
from plaid import Dataset dataset = Dataset() infos = {"legal":{"owner":"CompX", "license":"li_X"}} dataset.set_infos(infos) print(dataset.get_infos()) >>> {'legal': {'owner': 'CompX', 'license': 'li_X'}} dataset.add_info("data_production", "type", "simulation") print(dataset.get_infos()) >>> {'legal': {'owner': 'CompX', 'license': 'li_X'}, 'data_production': {'type': 'simulation'}}
- add_infos(cat_key: str, infos: dict[str, str]) None[source]¶
Add information to the
Dataset, overwriting existing information if there’s a conflict.- Parameters:
- Raises:
Example
from plaid import Dataset dataset = Dataset() infos = {"legal":{"owner":"CompX", "license":"li_X"}} dataset.set_infos(infos) print(dataset.get_infos()) >>> {'legal': {'owner': 'CompX', 'license': 'li_X'}} new_info = {"type":"simulation", "simulator":"Z-set"} dataset.add_infos("data_production", new_info) print(dataset.get_infos()) >>> {'legal': {'owner': 'CompX', 'license': 'li_X'}, 'data_production': {'type': 'simulation', 'simulator': 'Z-set'}}
- set_infos(infos: dict[str, dict[str, str]], warn: bool = True) None[source]¶
Set information to the
Dataset, overwriting the existing one.- Parameters:
- Raises:
Example
from plaid import Dataset dataset = Dataset() infos = {"legal":{"owner":"CompX", "license":"li_X"}} dataset.set_infos(infos) print(dataset.get_infos()) >>> {'legal': {'owner': 'CompX', 'license': 'li_X'}}
- get_infos() dict[str, dict[str, str]][source]¶
Get information from an instance of
Dataset.Example
from plaid import Dataset dataset = Dataset() infos = {"legal":{"owner":"CompX", "license":"li_X"}} dataset.set_infos(infos) print(dataset.get_infos()) >>> {'legal': {'owner': 'CompX', 'license': 'li_X'}}
- merge_dataset(dataset: Self) list[int][source]¶
Merges samples of another dataset into this one.
- Parameters:
- Returns:
- Return type:
- Raises:
ValueError – If the provided dataset value is not an instance of Dataset
- merge_features(dataset: Self, in_place: bool = False) Self[source]¶
Merge features of another dataset into this one.
- classmethod merge_dataset_by_features(datasets_list: list[Self]) Self[source]¶
Merge features a list of datasets.
- save(path: str | pathlib.Path) None[source]¶
DEPRECATED: use
Dataset.save_to_file()instead.
- save_to_file(path: str | pathlib.Path) None[source]¶
Saves the data set to a TAR (Tape Archive) file.
It creates a temporary intermediate directory to store temporary files during the loading process.
- Parameters:
path (Union[str, Path]) – The path to which the data set will be saved.
- Raises:
ValueError – If the randomly generated temporary dir name is already used (extremely unlikely!).
- save_to_dir(path: str | pathlib.Path, verbose: bool = False) None[source]¶
Saves the dataset into a sub-directory samples and creates an ‘infos.yaml’ file to store additional information about the dataset.
- summarize_features() str[source]¶
Show the name of each feature and the number of samples containing it.
- Returns:
A summary of features across the dataset.
- Return type:
Example
Dataset Feature Summary: ================================================== Scalars (8 unique): - Pr: 30/32 samples (93.8%) - Q: 30/32 samples (93.8%) - Tr: 30/32 samples (93.8%) - angle_in: 32/32 samples (100.0%) - angle_out: 30/32 samples (93.8%) - eth_is: 30/32 samples (93.8%) - mach_out: 32/32 samples (100.0%) - power: 30/32 samples (93.8%) Fields (8 unique): - M_iso: 30/32 samples (93.8%) - mach: 30/32 samples (93.8%) - nut: 30/32 samples (93.8%) - ro: 30/32 samples (93.8%) - roe: 30/32 samples (93.8%) - rou: 30/32 samples (93.8%) - rov: 30/32 samples (93.8%) - sdf: 32/32 samples (100.0%)
- check_feature_completeness() str[source]¶
Detect and notify if some Samples don’t contain all features.
- Returns:
A report on feature completeness across the dataset.
- Return type:
Example
Dataset Feature Completeness Check: ======================================== Complete samples: 30/32 (93.8%) Incomplete samples: 2/32 (6.2%) Samples with missing features: Sample 671: missing 13 features - scalar:Tr - scalar:angle_out - scalar:power - scalar:Pr - scalar:Q ... and 8 more Sample 672: missing 13 features - scalar:Tr - scalar:angle_out - scalar:power - scalar:Pr - scalar:Q ... and 8 more
- classmethod from_list_of_samples(list_of_samples: list[plaid.containers.sample.Sample], ids: list[int] | None = None) Self[source]¶
DEPRECATED: use Dataset(samples=…, sample_ids=…) instead.
- classmethod load_from_file(path: str | pathlib.Path, verbose: bool = False, processes_number: int = 0) Self[source]¶
Load data from a specified TAR (Tape Archive) file.
- Parameters:
path (Union[str, Path]) – The path to the data file to be loaded.
verbose (bool, optional) – Explicitly displays the operations performed. Defaults to False.
processes_number (int, optional) – Number of processes used to load files (-1 to use all available ressources, 0 to disable multiprocessing). Defaults to 0.
- Returns:
The loaded dataset (Dataset).
- Return type:
- classmethod load_from_dir(path: str | pathlib.Path, ids: list[int] | None = None, verbose: bool = False, processes_number: int = 0) Self[source]¶
Load data from a specified directory.
- Parameters:
path (Union[str, Path]) – The path from which to load files.
ids (list, optional) – The specific sample IDs to load from the dataset. Defaults to None.
verbose (bool, optional) – Explicitly displays the operations performed. Defaults to False.
processes_number (int, optional) – Number of processes used to load files (-1 to use all available ressources, 0 to disable multiprocessing). Defaults to 0.
- Returns:
The loaded dataset (Dataset).
- Return type:
- load(path: str | pathlib.Path, verbose: bool = False, processes_number: int = 0) None[source]¶
Load data from a specified file or directory.
Note
If path is a file, it creates a temporary intermediate directory to extract the files from the archive during the loading process.
Note
This method overwrites the content of the calling instance.
- Parameters:
path (Union[str, Path]) – The path to the data file to be loaded.
verbose (bool, optional) – Explicitly displays the operations performed. Defaults to False.
processes_number (int, optional) – Number of processes used to load files (-1 to use all available ressources, 0 to disable multiprocessing). Defaults to 0.
- Raises:
ValueError – If a randomly generated temporary directory already exists,
indicating a potential conflict during the loading process (extremely unlikely). –
- add_to_dir(sample: plaid.containers.sample.Sample, path: str | pathlib.Path | None = None, verbose: bool = False) None[source]¶
Add a sample to the dataset and save it to the specified directory.
Note
If path is None, will look for self.path which will be retrieved from last previous call to load or save. path given in argument will take precedence over self.path and overwrite it.
- Parameters:
- Raises:
ValueError – If both self.path and path are None.
- set_samples(samples: dict[int, plaid.containers.sample.Sample]) None[source]¶
Set the samples of the data set, overwriting the existing ones.
- Parameters:
samples (dict[int,Sample]) – A dictionary of samples to set inside the dataset.
- Raises:
TypeError – If the ‘samples’ parameter is not of type dict[int, Sample].
TypeError – If the ‘id’ inside a sample is not of type int.
ValueError – If the ‘id’ inside a sample is negative (id >= 0 is required).
TypeError – If the values inside the ‘samples’ dictionary are not of type Sample.
- set_sample(id: int, sample: plaid.containers.sample.Sample, warning_overwrite: bool = True) None[source]¶
Set a
samplewithidin the Dataset, overwriting existing samples if there’s a conflict.- Parameters:
- Raises:
TypeError – If the ‘id’ inside the sample is not of type int.
ValueError – If the ‘id’ inside a sample is negative (id >= 0 is required).
TypeError – If ‘sample’ parameter is not of type Sample.
Caution
In case of conflict, the existing samples will be overwritten.