Pipeline Examples¶
This notebook demonstrates the end-to-end process of building a machine learning pipeline using PLAID datasets and PLAID’s scikit-learn-compatible blocks.
PCA-GP for mach field prediction of VKI-LS59 dataset¶
Key steps covered:
Loading the PLAID dataset using Hugging Face integration and PLAID’s dataset classes
Standardizing features with PLAID-wrapped scikit-learn transformers for scalars
Dimensionality reduction of flow fields via Principal Component Analysis (PCA) to reduce output complexity
Regression modeling of PCA coefficients from scalar inputs using Gaussian Process regression
Pipeline assembly combining transformations and regressors into a single scikit-learn-compatible workflow
Hyperparameter tuning using Optuna and scikit-learn’s
GridSearchCVBest practices for working with PLAID datasets and pipelines in a reproducible and modular manner
📦 Imports¶
import warnings
warnings.filterwarnings('ignore', module='sklearn')
warnings.filterwarnings("ignore", message=".*IProgress not found.*")
import os
from pathlib import Path
import yaml
import numpy as np
import optuna
from datasets.utils.logging import disable_progress_bar
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import KFold, GridSearchCV
from plaid.bridges.huggingface_bridge import huggingface_dataset_to_plaid, load_dataset_from_hub
from plaid.pipelines.sklearn_block_wrappers import WrappedSklearnTransformer, WrappedSklearnRegressor
from plaid.pipelines.plaid_blocks import TransformedTargetRegressor, ColumnTransformer
disable_progress_bar()
n_processes = min(max(1, os.cpu_count()), 6)
📥 Load Dataset¶
We load the VKI-LS59 dataset from Hugging Face and restrict ourselves to the first 24 samples of the training set.
hf_dataset = load_dataset_from_hub("PLAID-datasets/VKI-LS59", split="all_samples[:24]")
dataset_train, pb_def = huggingface_dataset_to_plaid(hf_dataset, processes_number = n_processes, verbose = False)
[2026-03-25 17:14:33,884:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/main/README.md "HTTP/1.1 307 Temporary Redirect"
[2026-03-25 17:14:33,889:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/api/resolve-cache/datasets/PLAID-datasets/VKI-LS59/1aad0a69c26462c039305a931a300f80bdf34827/README.md "HTTP/1.1 200 OK"
[2026-03-25 17:14:33,916:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/VKI-LS59.py "HTTP/1.1 404 Not Found"
[2026-03-25 17:14:33,975:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/PLAID-datasets/VKI-LS59/PLAID-datasets/VKI-LS59.py "HTTP/1.1 404 Not Found"
[2026-03-25 17:14:34,054:INFO:_client.py:_send_single_request(1025)]:HTTP Request: GET https://huggingface.co/api/datasets/PLAID-datasets/VKI-LS59/revision/1aad0a69c26462c039305a931a300f80bdf34827 "HTTP/1.1 200 OK"
[2026-03-25 17:14:34,081:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/.huggingface.yaml "HTTP/1.1 404 Not Found"
[2026-03-25 17:14:34,129:INFO:_client.py:_send_single_request(1025)]:HTTP Request: GET https://datasets-server.huggingface.co/info?dataset=PLAID-datasets/VKI-LS59 "HTTP/1.1 200 OK"
[2026-03-25 17:14:34,165:INFO:_client.py:_send_single_request(1025)]:HTTP Request: GET https://huggingface.co/api/datasets/PLAID-datasets/VKI-LS59/tree/1aad0a69c26462c039305a931a300f80bdf34827/data?recursive=true&expand=false "HTTP/1.1 200 OK"
[2026-03-25 17:14:34,198:INFO:_client.py:_send_single_request(1025)]:HTTP Request: GET https://huggingface.co/api/datasets/PLAID-datasets/VKI-LS59/tree/1aad0a69c26462c039305a931a300f80bdf34827?recursive=false&expand=false "HTTP/1.1 200 OK"
[2026-03-25 17:14:34,229:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/dataset_infos.json "HTTP/1.1 404 Not Found"
[2026-03-25 17:14:34,273:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/data/all_samples-00000-of-00006.parquet "HTTP/1.1 302 Found"
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-25 17:14:34,274:WARNING:_http.py:_warn_on_warning_headers(916)]:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-25 17:14:34,304:INFO:_client.py:_send_single_request(1025)]:HTTP Request: GET https://huggingface.co/api/datasets/PLAID-datasets/VKI-LS59/xet-read-token/1aad0a69c26462c039305a931a300f80bdf34827 "HTTP/1.1 200 OK"
[2026-03-25 17:14:36,248:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/data/all_samples-00001-of-00006.parquet "HTTP/1.1 302 Found"
[2026-03-25 17:14:37,893:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/data/all_samples-00002-of-00006.parquet "HTTP/1.1 302 Found"
[2026-03-25 17:14:39,738:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/data/all_samples-00003-of-00006.parquet "HTTP/1.1 302 Found"
[2026-03-25 17:14:42,584:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/data/all_samples-00004-of-00006.parquet "HTTP/1.1 302 Found"
[2026-03-25 17:14:45,629:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/data/all_samples-00005-of-00006.parquet "HTTP/1.1 302 Found"
We print the summary of dataset_train, which contains 24 samples, with 8 scalars and 8 fields, which is consistent with the VKI-LS59 dataset:
print(dataset_train)
Dataset(24 samples, 8 scalars, 7 fields)
⚙️ Pipeline Configuration¶
For convenience, the in_features_identifiers and out_features_identifiers for each pipeline block are defined in a .yml file. Here’s an example of how the configuration might look:
pca_nodes:
in_features_identifiers:
- type: nodes
base_name: Base_2_2
out_features_identifiers:
- type: scalar
name: reduced_nodes_*
try:
filename = Path(__file__).parent.parent.parent / "examples" / "pipelines" / "config_pipeline.yml"
except NameError:
filename = "config_pipeline.yml"
with open(filename, 'r') as f:
config = yaml.safe_load(f)
all_feature_id = config['input_scalar_scaler']['in_features_identifiers'] +\
config['pca_nodes']['in_features_identifiers'] + config['pca_mach']['in_features_identifiers']
In this example, we aim to predict the mach field based on two input scalars angle_in and mach_out, and the mesh node coordinates. To contain memory consumption, we restrict the dataset to the features required for this example:
dataset_train = dataset_train.extract_dataset_from_identifier(all_feature_id)
print("dataset_train =", dataset_train)
print("scalar names =", dataset_train.get_scalar_names())
print("field names =", dataset_train.get_field_names())
dataset_train = Dataset(24 samples, 2 scalars, 1 field)
scalar names = ['angle_in', 'mach_out']
field names = ['mach']
We notive that only the 2 scalars and the field of interest are kept after restriction.
1. Preprocessor¶
We now define a preprocessor: a MinMaxScaler of the 2 input scalars and a PCA on the nodes coordinates of the meshes:
preprocessor = ColumnTransformer(
[
('input_scalar_scaler', WrappedSklearnTransformer(MinMaxScaler(), **config['input_scalar_scaler'])),
('pca_nodes', WrappedSklearnTransformer(PCA(), **config['pca_nodes'])),
]
)
preprocessor
ColumnTransformer(plaid_transformers=[('input_scalar_scaler',
WrappedSklearnTransformer(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'}],
sklearn_block=MinMaxScaler())),
('pca_nodes',
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'type': 'nodes'}],
out_features_identifiers=[{'name': 'reduced_nodes_*',
'type': 'scalar'}],
sklearn_block=PCA()))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| plaid_transformers | [('input_scalar_scaler', ...), ('pca_nodes', ...)] |
_
MinMaxScaler()
Parameters
_
PCA()
Parameters
We use a PlaidColumnTransformer to apply independent transformations to different feature groups.
To verify this behavior, we apply the preprocessor to dataset_train:
preprocessed_dataset = preprocessor.fit_transform(dataset_train)
print("preprocessed_dataset:", preprocessed_dataset)
print("scalar names =", preprocessed_dataset.get_scalar_names())
print("field names =", preprocessed_dataset.get_field_names())
preprocessed_dataset: Dataset(24 samples, 3 scalars, 1 field)
scalar names = ['angle_in', 'mach_out', 'reduced_nodes_*']
field names = ['mach']
Using MinMaxScaler, we scaled the angle_in and mach_out features, replacing their original values. In contrast, PCA compressed the node coordinates and produced new scalar features named reduced_nodes_*, representing the PCA components. Alternatively, we could have specified out_features_identifiers in the .yml file configuring the MinMaxScaler block to generate new scalars without overwriting the original inputs.
2. Postprocessor¶
Next, we define the postprocessor, which applies PCA to the mach field:
postprocessor = WrappedSklearnTransformer(PCA(), **config['pca_mach'])
postprocessor
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA())In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| sklearn_block | PCA() | |
| in_features_identifiers | [{'base_name': 'Base_2_2', 'name': 'mach', 'type': 'field'}] | |
| out_features_identifiers | [{'name': 'reduced_mach_*', 'type': 'scalar'}] |
PCA()
Parameters
3. TransformedTargetRegressor¶
The Gaussian Process regressor takes the transformed angle_in and mach_out scalars, along with the PCA coefficients of the mesh node coordinates as inputs, and predicts the PCA coefficients of the mach field as outputs. This is facilitated by using a PlaidTransformedTargetRegressor.
kernel = Matern(length_scale_bounds=(1e-8, 1e8), nu = 2.5)
gpr = GaussianProcessRegressor(
kernel=kernel,
optimizer='fmin_l_bfgs_b',
n_restarts_optimizer=1,
random_state=42)
reg = MultiOutputRegressor(gpr)
regressor = WrappedSklearnRegressor(reg, **config['regressor_mach'])
target_regressor = TransformedTargetRegressor(
regressor=regressor,
transformer=postprocessor
)
target_regressor
TransformedTargetRegressor(regressor=WrappedSklearnRegressor(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'},
{'name': 'reduced_nodes_*',
'type': 'scalar'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1,
random_state=42))),
transformer=WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA()))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| regressor | WrappedSklear...om_state=42))) | |
| transformer | WrappedSklear...n_block=PCA()) |
WrappedSklearnRegressor(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'},
{'name': 'reduced_nodes_*',
'type': 'scalar'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1,
random_state=42)))MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1,
random_state=42))GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1, random_state=42)Parameters
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA())PCA()
Parameters
PlaidTransformedTargetRegressor functions like scikit-learn’s TransformedTargetRegressor but operates directly on PLAID datasets.
4. Pipeline assembling¶
We then define the complete pipeline as follows:
pipeline = Pipeline(
steps=[
("preprocessor", preprocessor),
("regressor", target_regressor),
]
)
pipeline
Pipeline(steps=[('preprocessor',
ColumnTransformer(plaid_transformers=[('input_scalar_scaler',
WrappedSklearnTransformer(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'}],
sklearn_block=MinMaxScaler())),
('pca_nodes',
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'type': 'nodes'...
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1,
random_state=42))),
transformer=WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA())))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Parameters
| plaid_transformers | [('input_scalar_scaler', ...), ('pca_nodes', ...)] |
_
MinMaxScaler()
Parameters
_
PCA()
Parameters
Parameters
| regressor | WrappedSklear...om_state=42))) | |
| transformer | WrappedSklear...n_block=PCA()) |
WrappedSklearnRegressor(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'},
{'name': 'reduced_nodes_*',
'type': 'scalar'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1,
random_state=42)))MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1,
random_state=42))GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1, random_state=42)Parameters
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA())PCA()
Parameters
🎯 Optuna hyperparameter tuning¶
We now use Optuna to optimize hyperparameters, specifically tuning the number of components for the two PCA blocks using three-fold cross-validation.
def objective(trial):
# Suggest hyperparameters
nodes_n_components = trial.suggest_int("preprocessor__pca_nodes__sklearn_block__n_components", 3, 4)
mach_n_components = trial.suggest_int("regressor__transformer__sklearn_block__n_components", 4, 5)
# Clone and configure pipeline
pipeline_run = clone(pipeline)
pipeline_run.set_params(
preprocessor__pca_nodes__sklearn_block__n_components=nodes_n_components,
regressor__transformer__sklearn_block__n_components=mach_n_components,
regressor__regressor__sklearn_block__estimator__kernel=Matern(
length_scale_bounds=(1e-8, 1e8), nu=2.5, length_scale=np.ones(nodes_n_components + len(config['input_scalar_scaler']['in_features_identifiers']))
)
)
cv = KFold(n_splits=3, shuffle=True, random_state=42)
scores = []
indices = np.arange(len(dataset_train))
for train_idx, val_idx in cv.split(indices):
dataset_cv_train_ = dataset_train[train_idx]
dataset_cv_val_ = dataset_train[val_idx]
pipeline_run.fit(dataset_cv_train_)
score = pipeline_run.score(dataset_cv_val_)
scores.append(score)
return np.mean(scores)
We maximize the defined objective function over 4 trials selected by Optuna.
preprocessed_dataset = preprocessor.fit_transform(dataset_train)
print("preprocessed_dataset:", preprocessed_dataset)
print("scalar names =", preprocessed_dataset.get_scalar_names())
print("field names =", preprocessed_dataset.get_field_names())
preprocessed_dataset: Dataset(24 samples, 3 scalars, 1 field)
scalar names = ['angle_in', 'mach_out', 'reduced_nodes_*']
field names = ['mach']
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=4)
print("best_params =", study.best_params)
[I 2026-03-25 17:15:32,281] A new study created in memory with name: no-name-a0f646aa-9b1a-4f89-8263-259c34631bc1
[I 2026-03-25 17:15:34,952] Trial 0 finished with value: 0.9230857020034403 and parameters: {'preprocessor__pca_nodes__sklearn_block__n_components': 3, 'regressor__transformer__sklearn_block__n_components': 4}. Best is trial 0 with value: 0.9230857020034403.
[I 2026-03-25 17:15:37,491] Trial 1 finished with value: 0.9230857116163129 and parameters: {'preprocessor__pca_nodes__sklearn_block__n_components': 3, 'regressor__transformer__sklearn_block__n_components': 4}. Best is trial 1 with value: 0.9230857116163129.
[I 2026-03-25 17:15:40,030] Trial 2 finished with value: 0.9238595028842024 and parameters: {'preprocessor__pca_nodes__sklearn_block__n_components': 4, 'regressor__transformer__sklearn_block__n_components': 4}. Best is trial 2 with value: 0.9238595028842024.
[I 2026-03-25 17:15:42,453] Trial 3 finished with value: 0.9238499932542266 and parameters: {'preprocessor__pca_nodes__sklearn_block__n_components': 4, 'regressor__transformer__sklearn_block__n_components': 4}. Best is trial 2 with value: 0.9238595028842024.
best_params = {'preprocessor__pca_nodes__sklearn_block__n_components': 4, 'regressor__transformer__sklearn_block__n_components': 4}
We retrieve the best hyperparameters found by Optuna and use them to define the optimized_pipeline.
optimized_pipeline = clone(pipeline).set_params(**study.best_params)
optimized_pipeline.set_params(regressor__regressor__sklearn_block__estimator__kernel=Matern(
length_scale_bounds=(1e-8, 1e8), nu=2.5, length_scale=np.ones(study.best_params['preprocessor__pca_nodes__sklearn_block__n_components'] + len(config['input_scalar_scaler']['in_features_identifiers']))
)
)
optimized_pipeline.fit(dataset_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(plaid_transformers=[('input_scalar_scaler',
WrappedSklearnTransformer(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'}],
sklearn_block=MinMaxScaler())),
('pca_nodes',
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'type': 'nodes'...
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1,
random_state=42))),
transformer=WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA(n_components=4))))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Parameters
| plaid_transformers | [('input_scalar_scaler', ...), ('pca_nodes', ...)] |
_
MinMaxScaler()
Parameters
_
PCA(n_components=4)
Parameters
Parameters
| regressor | WrappedSklear...om_state=42))) | |
| transformer | WrappedSklear...components=4)) |
WrappedSklearnRegressor(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'},
{'name': 'reduced_nodes_*',
'type': 'scalar'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1,
random_state=42)))MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1,
random_state=42))GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1, random_state=42)Parameters
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA(n_components=4))PCA(n_components=4)
Parameters
Next, we fit the optimized_pipeline to the dataset_train dataset and evaluate its performance on the same data.
dataset_pred = optimized_pipeline.predict(dataset_train)
score = optimized_pipeline.score(dataset_train)
print("score =", score, ", error =", 1. - score)
score = 0.9615073680464543 , error = 0.038492631953545686
We use an anisotropic kernel in the Gaussian Process. Its optimized length_scale is a vector with dimensions equal to 2 plus the number of PCA components from preprocessor__pca_nodes__sklearn_block__n_components, accounting for the two input scalars.
print(optimized_pipeline.named_steps["regressor"].regressor_.sklearn_block_.estimators_[0].kernel_.get_params()['length_scale'])
[8.18059014e-01 3.46865845e-01 3.82671445e+01 6.56887169e+00
1.29524961e+03 1.00000000e+08]
print("Dimension GP kernel length_scale =", len(optimized_pipeline.named_steps["regressor"].regressor_.sklearn_block_.estimators_[0].kernel_.get_params()['length_scale']))
print("Expected dimension =", 2 + study.best_params['preprocessor__pca_nodes__sklearn_block__n_components'])
Dimension GP kernel length_scale = 6
Expected dimension = 6
The error remains non-zero due to the approximation introduced by PCA. Since the Gaussian Process regressor interpolates, the error is expected to vanish on the training set if all PCA modes are retained.
exact_pipeline = clone(pipeline).set_params(
preprocessor__pca_nodes__sklearn_block__n_components = 24,
regressor__transformer__sklearn_block__n_components = 24
)
exact_pipeline.fit(dataset_train)
dataset_pred = exact_pipeline.predict(dataset_train)
score = exact_pipeline.score(dataset_train)
print("score =", score, ", error =", 1. - score)
score = 0.9999999999912061 , error = 8.79385453345094e-12
🔍 GridSearchCV hyperparameter tuning¶
Since our pipeline nodes conform to the scikit-learn API, the constructed pipeline can be used directly with GridSearchCV.
pca_n_components = [3, 4]
regressor_n_components = [4, 5]
param_grid = []
for n, m in zip(pca_n_components, regressor_n_components):
param_grid.append(
{
"preprocessor__pca_nodes__sklearn_block__n_components": [n],
"regressor__transformer__sklearn_block__n_components": [m],
"regressor__regressor__sklearn_block__estimator__kernel": [
Matern(
length_scale_bounds=(1e-8, 1e8), nu=2.5, length_scale=np.ones(n + 2)
)
],
}
)
cv = KFold(n_splits=3, shuffle=True, random_state=42)
search = GridSearchCV(pipeline, param_grid=param_grid, cv=cv, verbose=3, error_score='raise')
search.fit(dataset_train)
Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV 1/3] END preprocessor__pca_nodes__sklearn_block__n_components=3, regressor__regressor__sklearn_block__estimator__kernel=Matern(length_scale=[1, 1, 1, 1, 1], nu=2.5), regressor__transformer__sklearn_block__n_components=4;, score=0.936 total time= 0.8s
[CV 2/3] END preprocessor__pca_nodes__sklearn_block__n_components=3, regressor__regressor__sklearn_block__estimator__kernel=Matern(length_scale=[1, 1, 1, 1, 1], nu=2.5), regressor__transformer__sklearn_block__n_components=4;, score=0.913 total time= 0.8s
[CV 3/3] END preprocessor__pca_nodes__sklearn_block__n_components=3, regressor__regressor__sklearn_block__estimator__kernel=Matern(length_scale=[1, 1, 1, 1, 1], nu=2.5), regressor__transformer__sklearn_block__n_components=4;, score=0.921 total time= 0.8s
[CV 1/3] END preprocessor__pca_nodes__sklearn_block__n_components=4, regressor__regressor__sklearn_block__estimator__kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5), regressor__transformer__sklearn_block__n_components=5;, score=0.935 total time= 0.9s
[CV 2/3] END preprocessor__pca_nodes__sklearn_block__n_components=4, regressor__regressor__sklearn_block__estimator__kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5), regressor__transformer__sklearn_block__n_components=5;, score=0.914 total time= 0.9s
[CV 3/3] END preprocessor__pca_nodes__sklearn_block__n_components=4, regressor__regressor__sklearn_block__estimator__kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5), regressor__transformer__sklearn_block__n_components=5;, score=0.923 total time= 0.8s
GridSearchCV(cv=KFold(n_splits=3, random_state=42, shuffle=True),
error_score='raise',
estimator=Pipeline(steps=[('preprocessor',
ColumnTransformer(plaid_transformers=[('input_scalar_scaler',
WrappedSklearnTransformer(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'}],
sklearn_block=MinMaxScaler())),
('pca_nod...
'regressor__regressor__sklearn_block__estimator__kernel': [Matern(length_scale=[1, 1, 1, 1, 1], nu=2.5)],
'regressor__transformer__sklearn_block__n_components': [4]},
{'preprocessor__pca_nodes__sklearn_block__n_components': [4],
'regressor__regressor__sklearn_block__estimator__kernel': [Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5)],
'regressor__transformer__sklearn_block__n_components': [5]}],
verbose=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Parameters
| plaid_transformers | [('input_scalar_scaler', ...), ('pca_nodes', ...)] |
_
MinMaxScaler()
Parameters
_
PCA(n_components=4)
Parameters
Parameters
| regressor | WrappedSklear...om_state=42))) | |
| transformer | WrappedSklear...components=5)) |
WrappedSklearnRegressor(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'},
{'name': 'reduced_nodes_*',
'type': 'scalar'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1,
random_state=42)))MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1,
random_state=42))GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1, random_state=42)Parameters
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA(n_components=5))PCA(n_components=5)
Parameters
We evaluate the performance of the optimized pipeline by computing its score on the training set.
print("best_params =", search.best_params_)
optimized_pipeline = clone(pipeline).set_params(**search.best_params_)
optimized_pipeline.fit(dataset_train)
dataset_pred = optimized_pipeline.predict(dataset_train)
score = optimized_pipeline.score(dataset_train)
print("score =", score, ", error =", 1. - score)
best_params = {'preprocessor__pca_nodes__sklearn_block__n_components': 4, 'regressor__regressor__sklearn_block__estimator__kernel': Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5), 'regressor__transformer__sklearn_block__n_components': 5}
score = 0.969269526975111 , error = 0.03073047302488896