Pipeline Examples¶
This notebook demonstrates the end-to-end process of building a machine learning pipeline using PLAID datasets and PLAID’s scikit-learn-compatible blocks.
PCA-GP for mach field prediction of VKI-LS59 dataset¶
Key steps covered:
Loading the PLAID dataset using Hugging Face integration and PLAID’s dataset classes
Standardizing features with PLAID-wrapped scikit-learn transformers for scalars
Dimensionality reduction of flow fields via Principal Component Analysis (PCA) to reduce output complexity
Regression modeling of PCA coefficients from scalar inputs using Gaussian Process regression
Pipeline assembly combining transformations and regressors into a single scikit-learn-compatible workflow
Hyperparameter tuning using Optuna and scikit-learn’s
GridSearchCVBest practices for working with PLAID datasets and pipelines in a reproducible and modular manner
📦 Imports¶
import warnings
warnings.filterwarnings('ignore', module='sklearn')
warnings.filterwarnings("ignore", message=".*IProgress not found.*")
import os
from pathlib import Path
import yaml
import numpy as np
import optuna
from datasets.utils.logging import disable_progress_bar
from datasets import load_dataset
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import KFold, GridSearchCV
from plaid.bridges.huggingface_bridge import huggingface_dataset_to_plaid, huggingface_description_to_problem_definition
from plaid.pipelines.sklearn_block_wrappers import WrappedSklearnTransformer, WrappedSklearnRegressor
from plaid.pipelines.plaid_blocks import TransformedTargetRegressor, ColumnTransformer
disable_progress_bar()
n_processes = min(max(1, os.cpu_count()), 6)
📥 Load Dataset¶
We load the VKI-LS59 dataset from Hugging Face and restrict ourselves to the first 24 samples of the training set.
hf_dataset = load_dataset("PLAID-datasets/VKI-LS59", split="all_samples[:24]")
dataset_train, _ = huggingface_dataset_to_plaid(hf_dataset, processes_number = n_processes, verbose = False)
We print the summary of dataset_train, which contains 24 samples, with 8 scalars and 8 fields, which is consistent with the VKI-LS59 dataset:
print(dataset_train)
Dataset(24 samples, 8 scalars, 0 time_series, 8 fields)
⚙️ Pipeline Configuration¶
For convenience, the in_features_identifiers and out_features_identifiers for each pipeline block are defined in a .yml file. Here’s an example of how the configuration might look:
pca_nodes:
in_features_identifiers:
- type: nodes
base_name: Base_2_2
out_features_identifiers:
- type: scalar
name: reduced_nodes_*
try:
filename = Path(__file__).parent.parent.parent / "examples" / "pipelines" / "config_pipeline.yml"
except NameError:
filename = "config_pipeline.yml"
with open(filename, 'r') as f:
config = yaml.safe_load(f)
all_feature_id = config['input_scalar_scaler']['in_features_identifiers'] +\
config['pca_nodes']['in_features_identifiers'] + config['pca_mach']['in_features_identifiers']
In this example, we aim to predict the mach field based on two input scalars angle_in and mach_out, and the mesh node coordinates. To contain memory consumption, we restrict the dataset to the features required for this example:
dataset_train = dataset_train.extract_dataset_from_identifier(all_feature_id)
print("dataset_train =", dataset_train)
print("scalar names =", dataset_train.get_scalar_names())
print("field names =", dataset_train.get_field_names())
dataset_train = Dataset(24 samples, 2 scalars, 0 time_series, 1 field)
scalar names = ['angle_in', 'mach_out']
field names = ['mach']
We notive that only the 2 scalars and the field of interest are kept after restriction.
1. Preprocessor¶
We now define a preprocessor: a MinMaxScaler of the 2 input scalars and a PCA on the nodes coordinates of the meshes:
preprocessor = ColumnTransformer(
[
('input_scalar_scaler', WrappedSklearnTransformer(MinMaxScaler(), **config['input_scalar_scaler'])),
('pca_nodes', WrappedSklearnTransformer(PCA(), **config['pca_nodes'])),
]
)
preprocessor
ColumnTransformer(plaid_transformers=[('input_scalar_scaler',
WrappedSklearnTransformer(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'}],
sklearn_block=MinMaxScaler())),
('pca_nodes',
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'type': 'nodes'}],
out_features_identifiers=[{'name': 'reduced_nodes_*',
'type': 'scalar'}],
sklearn_block=PCA()))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| plaid_transformers | [('input_scalar_scaler', ...), ('pca_nodes', ...)] |
_
MinMaxScaler()
Parameters
| feature_range | (0, ...) | |
| copy | True | |
| clip | False |
_
PCA()
Parameters
| n_components | None | |
| copy | True | |
| whiten | False | |
| svd_solver | 'auto' | |
| tol | 0.0 | |
| iterated_power | 'auto' | |
| n_oversamples | 10 | |
| power_iteration_normalizer | 'auto' | |
| random_state | None |
We use a PlaidColumnTransformer to apply independent transformations to different feature groups.
To verify this behavior, we apply the preprocessor to dataset_train:
preprocessed_dataset = preprocessor.fit_transform(dataset_train)
print("preprocessed_dataset:", preprocessed_dataset)
print("scalar names =", preprocessed_dataset.get_scalar_names())
print("field names =", preprocessed_dataset.get_field_names())
preprocessed_dataset: Dataset(24 samples, 3 scalars, 0 time_series, 1 field)
scalar names = ['angle_in', 'mach_out', 'reduced_nodes_*']
field names = ['mach']
Using MinMaxScaler, we scaled the angle_in and mach_out features, replacing their original values. In contrast, PCA compressed the node coordinates and produced new scalar features named reduced_nodes_*, representing the PCA components. Alternatively, we could have specified out_features_identifiers in the .yml file configuring the MinMaxScaler block to generate new scalars without overwriting the original inputs.
2. Postprocessor¶
Next, we define the postprocessor, which applies PCA to the mach field:
postprocessor = WrappedSklearnTransformer(PCA(), **config['pca_mach'])
postprocessor
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA())In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| sklearn_block | PCA() | |
| in_features_identifiers | [{'base_name': 'Base_2_2', 'name': 'mach', 'type': 'field'}] | |
| out_features_identifiers | [{'name': 'reduced_mach_*', 'type': 'scalar'}] |
PCA()
Parameters
| n_components | None | |
| copy | True | |
| whiten | False | |
| svd_solver | 'auto' | |
| tol | 0.0 | |
| iterated_power | 'auto' | |
| n_oversamples | 10 | |
| power_iteration_normalizer | 'auto' | |
| random_state | None |
3. TransformedTargetRegressor¶
The Gaussian Process regressor takes the transformed angle_in and mach_out scalars, along with the PCA coefficients of the mesh node coordinates as inputs, and predicts the PCA coefficients of the mach field as outputs. This is facilitated by using a PlaidTransformedTargetRegressor.
kernel = Matern(length_scale_bounds=(1e-8, 1e8), nu = 2.5)
gpr = GaussianProcessRegressor(
kernel=kernel,
optimizer='fmin_l_bfgs_b',
n_restarts_optimizer=1,
random_state=42)
reg = MultiOutputRegressor(gpr)
regressor = WrappedSklearnRegressor(reg, **config['regressor_mach'])
target_regressor = TransformedTargetRegressor(
regressor=regressor,
transformer=postprocessor
)
target_regressor
TransformedTargetRegressor(regressor=WrappedSklearnRegressor(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'},
{'name': 'reduced_nodes_*',
'type': 'scalar'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1,
random_state=42))),
transformer=WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA()))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| regressor | WrappedSklear...om_state=42))) | |
| transformer | WrappedSklear...n_block=PCA()) |
WrappedSklearnRegressor(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'},
{'name': 'reduced_nodes_*',
'type': 'scalar'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1,
random_state=42)))MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1,
random_state=42))GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1, random_state=42)Parameters
| kernel | Matern(length_scale=1, nu=2.5) | |
| alpha | 1e-10 | |
| optimizer | 'fmin_l_bfgs_b' | |
| n_restarts_optimizer | 1 | |
| normalize_y | False | |
| copy_X_train | True | |
| n_targets | None | |
| random_state | 42 | |
| kernel__length_scale | 1.0 | |
| kernel__length_scale_bounds | (1e-08, ...) | |
| kernel__nu | 2.5 |
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA())PCA()
Parameters
| n_components | None | |
| copy | True | |
| whiten | False | |
| svd_solver | 'auto' | |
| tol | 0.0 | |
| iterated_power | 'auto' | |
| n_oversamples | 10 | |
| power_iteration_normalizer | 'auto' | |
| random_state | None |
PlaidTransformedTargetRegressor functions like scikit-learn’s TransformedTargetRegressor but operates directly on PLAID datasets.
4. Pipeline assembling¶
We then define the complete pipeline as follows:
pipeline = Pipeline(
steps=[
("preprocessor", preprocessor),
("regressor", target_regressor),
]
)
pipeline
Pipeline(steps=[('preprocessor',
ColumnTransformer(plaid_transformers=[('input_scalar_scaler',
WrappedSklearnTransformer(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'}],
sklearn_block=MinMaxScaler())),
('pca_nodes',
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'type': 'nodes'...
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1,
random_state=42))),
transformer=WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA())))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('preprocessor', ...), ('regressor', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| plaid_transformers | [('input_scalar_scaler', ...), ('pca_nodes', ...)] |
_
MinMaxScaler()
Parameters
| feature_range | (0, ...) | |
| copy | True | |
| clip | False |
_
PCA()
Parameters
| n_components | None | |
| copy | True | |
| whiten | False | |
| svd_solver | 'auto' | |
| tol | 0.0 | |
| iterated_power | 'auto' | |
| n_oversamples | 10 | |
| power_iteration_normalizer | 'auto' | |
| random_state | None |
Parameters
| regressor | WrappedSklear...om_state=42))) | |
| transformer | WrappedSklear...n_block=PCA()) |
WrappedSklearnRegressor(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'},
{'name': 'reduced_nodes_*',
'type': 'scalar'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1,
random_state=42)))MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1,
random_state=42))GaussianProcessRegressor(kernel=Matern(length_scale=1, nu=2.5),
n_restarts_optimizer=1, random_state=42)Parameters
| kernel | Matern(length_scale=1, nu=2.5) | |
| alpha | 1e-10 | |
| optimizer | 'fmin_l_bfgs_b' | |
| n_restarts_optimizer | 1 | |
| normalize_y | False | |
| copy_X_train | True | |
| n_targets | None | |
| random_state | 42 | |
| kernel__length_scale | 1.0 | |
| kernel__length_scale_bounds | (1e-08, ...) | |
| kernel__nu | 2.5 |
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA())PCA()
Parameters
| n_components | None | |
| copy | True | |
| whiten | False | |
| svd_solver | 'auto' | |
| tol | 0.0 | |
| iterated_power | 'auto' | |
| n_oversamples | 10 | |
| power_iteration_normalizer | 'auto' | |
| random_state | None |
🎯 Optuna hyperparameter tuning¶
We now use Optuna to optimize hyperparameters, specifically tuning the number of components for the two PCA blocks using three-fold cross-validation.
def objective(trial):
# Suggest hyperparameters
nodes_n_components = trial.suggest_int("preprocessor__pca_nodes__sklearn_block__n_components", 3, 4)
mach_n_components = trial.suggest_int("regressor__transformer__sklearn_block__n_components", 4, 5)
# Clone and configure pipeline
pipeline_run = clone(pipeline)
pipeline_run.set_params(
preprocessor__pca_nodes__sklearn_block__n_components=nodes_n_components,
regressor__transformer__sklearn_block__n_components=mach_n_components,
regressor__regressor__sklearn_block__estimator__kernel=Matern(
length_scale_bounds=(1e-8, 1e8), nu=2.5, length_scale=np.ones(nodes_n_components + len(config['input_scalar_scaler']['in_features_identifiers']))
)
)
cv = KFold(n_splits=3, shuffle=True, random_state=42)
scores = []
indices = np.arange(len(dataset_train))
for train_idx, val_idx in cv.split(indices):
dataset_cv_train_ = dataset_train[train_idx]
dataset_cv_val_ = dataset_train[val_idx]
pipeline_run.fit(dataset_cv_train_)
score = pipeline_run.score(dataset_cv_val_)
scores.append(score)
return np.mean(scores)
We maximize the defined objective function over 4 trials selected by Optuna.
preprocessed_dataset = preprocessor.fit_transform(dataset_train)
print("preprocessed_dataset:", preprocessed_dataset)
print("scalar names =", preprocessed_dataset.get_scalar_names())
print("field names =", preprocessed_dataset.get_field_names())
preprocessed_dataset: Dataset(24 samples, 3 scalars, 0 time_series, 1 field)
scalar names = ['angle_in', 'mach_out', 'reduced_nodes_*']
field names = ['mach']
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=4)
print("best_params =", study.best_params)
[I 2025-09-18 14:48:23,344] A new study created in memory with name: no-name-b84a2093-ca51-457d-907d-c21e4baa5474
[I 2025-09-18 14:48:24,458] Trial 0 finished with value: 0.9230857354761245 and parameters: {'preprocessor__pca_nodes__sklearn_block__n_components': 3, 'regressor__transformer__sklearn_block__n_components': 4}. Best is trial 0 with value: 0.9230857354761245.
[I 2025-09-18 14:48:25,624] Trial 1 finished with value: 0.9231200379079878 and parameters: {'preprocessor__pca_nodes__sklearn_block__n_components': 3, 'regressor__transformer__sklearn_block__n_components': 5}. Best is trial 1 with value: 0.9231200379079878.
[I 2025-09-18 14:48:26,782] Trial 2 finished with value: 0.923859388310893 and parameters: {'preprocessor__pca_nodes__sklearn_block__n_components': 4, 'regressor__transformer__sklearn_block__n_components': 4}. Best is trial 2 with value: 0.923859388310893.
[I 2025-09-18 14:48:27,848] Trial 3 finished with value: 0.9231201052200503 and parameters: {'preprocessor__pca_nodes__sklearn_block__n_components': 3, 'regressor__transformer__sklearn_block__n_components': 5}. Best is trial 2 with value: 0.923859388310893.
best_params = {'preprocessor__pca_nodes__sklearn_block__n_components': 4, 'regressor__transformer__sklearn_block__n_components': 4}
We retrieve the best hyperparameters found by Optuna and use them to define the optimized_pipeline.
optimized_pipeline = clone(pipeline).set_params(**study.best_params)
optimized_pipeline.set_params(regressor__regressor__sklearn_block__estimator__kernel=Matern(
length_scale_bounds=(1e-8, 1e8), nu=2.5, length_scale=np.ones(study.best_params['preprocessor__pca_nodes__sklearn_block__n_components'] + len(config['input_scalar_scaler']['in_features_identifiers']))
)
)
optimized_pipeline.fit(dataset_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(plaid_transformers=[('input_scalar_scaler',
WrappedSklearnTransformer(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'}],
sklearn_block=MinMaxScaler())),
('pca_nodes',
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'type': 'nodes'...
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1,
random_state=42))),
transformer=WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA(n_components=4))))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('preprocessor', ...), ('regressor', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| plaid_transformers | [('input_scalar_scaler', ...), ('pca_nodes', ...)] |
_
MinMaxScaler()
Parameters
| feature_range | (0, ...) | |
| copy | True | |
| clip | False |
_
PCA(n_components=4)
Parameters
| n_components | 4 | |
| copy | True | |
| whiten | False | |
| svd_solver | 'auto' | |
| tol | 0.0 | |
| iterated_power | 'auto' | |
| n_oversamples | 10 | |
| power_iteration_normalizer | 'auto' | |
| random_state | None |
Parameters
| regressor | WrappedSklear...om_state=42))) | |
| transformer | WrappedSklear...components=4)) |
WrappedSklearnRegressor(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'},
{'name': 'reduced_nodes_*',
'type': 'scalar'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1,
random_state=42)))MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1,
random_state=42))GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1, random_state=42)Parameters
| kernel | Matern(length...1, 1], nu=2.5) | |
| alpha | 1e-10 | |
| optimizer | 'fmin_l_bfgs_b' | |
| n_restarts_optimizer | 1 | |
| normalize_y | False | |
| copy_X_train | True | |
| n_targets | None | |
| random_state | 42 | |
| kernel__length_scale | array([1., 1...., 1., 1., 1.]) | |
| kernel__length_scale_bounds | (1e-08, ...) | |
| kernel__nu | 2.5 |
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA(n_components=4))PCA(n_components=4)
Parameters
| n_components | 4 | |
| copy | True | |
| whiten | False | |
| svd_solver | 'auto' | |
| tol | 0.0 | |
| iterated_power | 'auto' | |
| n_oversamples | 10 | |
| power_iteration_normalizer | 'auto' | |
| random_state | None |
Next, we fit the optimized_pipeline to the dataset_train dataset and evaluate its performance on the same data.
dataset_pred = optimized_pipeline.predict(dataset_train)
score = optimized_pipeline.score(dataset_train)
print("score =", score, ", error =", 1. - score)
score = 0.961507368044701 , error = 0.03849263195529895
We use an anisotropic kernel in the Gaussian Process. Its optimized length_scale is a vector with dimensions equal to 2 plus the number of PCA components from preprocessor__pca_nodes__sklearn_block__n_components, accounting for the two input scalars.
print(optimized_pipeline.named_steps["regressor"].regressor_.sklearn_block_.estimators_[0].kernel_.get_params()['length_scale'])
[8.18059014e-01 3.46865845e-01 3.82671445e+01 6.56887169e+00
1.29524962e+03 1.00000000e+08]
print("Dimension GP kernel length_scale =", len(optimized_pipeline.named_steps["regressor"].regressor_.sklearn_block_.estimators_[0].kernel_.get_params()['length_scale']))
print("Expected dimension =", 2 + study.best_params['preprocessor__pca_nodes__sklearn_block__n_components'])
Dimension GP kernel length_scale = 6
Expected dimension = 6
The error remains non-zero due to the approximation introduced by PCA. Since the Gaussian Process regressor interpolates, the error is expected to vanish on the training set if all PCA modes are retained.
exact_pipeline = clone(pipeline).set_params(
preprocessor__pca_nodes__sklearn_block__n_components = 24,
regressor__transformer__sklearn_block__n_components = 24
)
exact_pipeline.fit(dataset_train)
dataset_pred = exact_pipeline.predict(dataset_train)
score = exact_pipeline.score(dataset_train)
print("score =", score, ", error =", 1. - score)
score = 0.9999999999912063 , error = 8.793743511148477e-12
🔍 GridSearchCV hyperparameter tuning¶
Since our pipeline nodes conform to the scikit-learn API, the constructed pipeline can be used directly with GridSearchCV.
pca_n_components = [3, 4]
regressor_n_components = [4, 5]
param_grid = []
for n, m in zip(pca_n_components, regressor_n_components):
param_grid.append(
{
"preprocessor__pca_nodes__sklearn_block__n_components": [n],
"regressor__transformer__sklearn_block__n_components": [m],
"regressor__regressor__sklearn_block__estimator__kernel": [
Matern(
length_scale_bounds=(1e-8, 1e8), nu=2.5, length_scale=np.ones(n + 2)
)
],
}
)
cv = KFold(n_splits=3, shuffle=True, random_state=42)
search = GridSearchCV(pipeline, param_grid=param_grid, cv=cv, verbose=3, error_score='raise')
search.fit(dataset_train)
Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV 1/3] END preprocessor__pca_nodes__sklearn_block__n_components=3, regressor__regressor__sklearn_block__estimator__kernel=Matern(length_scale=[1, 1, 1, 1, 1], nu=2.5), regressor__transformer__sklearn_block__n_components=4;, score=0.936 total time= 0.4s
[CV 2/3] END preprocessor__pca_nodes__sklearn_block__n_components=3, regressor__regressor__sklearn_block__estimator__kernel=Matern(length_scale=[1, 1, 1, 1, 1], nu=2.5), regressor__transformer__sklearn_block__n_components=4;, score=0.913 total time= 0.3s
[CV 3/3] END preprocessor__pca_nodes__sklearn_block__n_components=3, regressor__regressor__sklearn_block__estimator__kernel=Matern(length_scale=[1, 1, 1, 1, 1], nu=2.5), regressor__transformer__sklearn_block__n_components=4;, score=0.921 total time= 0.3s
[CV 1/3] END preprocessor__pca_nodes__sklearn_block__n_components=4, regressor__regressor__sklearn_block__estimator__kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5), regressor__transformer__sklearn_block__n_components=5;, score=0.935 total time= 0.4s
[CV 2/3] END preprocessor__pca_nodes__sklearn_block__n_components=4, regressor__regressor__sklearn_block__estimator__kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5), regressor__transformer__sklearn_block__n_components=5;, score=0.914 total time= 0.4s
[CV 3/3] END preprocessor__pca_nodes__sklearn_block__n_components=4, regressor__regressor__sklearn_block__estimator__kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5), regressor__transformer__sklearn_block__n_components=5;, score=0.923 total time= 0.4s
GridSearchCV(cv=KFold(n_splits=3, random_state=42, shuffle=True),
error_score='raise',
estimator=Pipeline(steps=[('preprocessor',
ColumnTransformer(plaid_transformers=[('input_scalar_scaler',
WrappedSklearnTransformer(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'}],
sklearn_block=MinMaxScaler())),
('pca_nod...
'regressor__regressor__sklearn_block__estimator__kernel': [Matern(length_scale=[1, 1, 1, 1, 1], nu=2.5)],
'regressor__transformer__sklearn_block__n_components': [4]},
{'preprocessor__pca_nodes__sklearn_block__n_components': [4],
'regressor__regressor__sklearn_block__estimator__kernel': [Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5)],
'regressor__transformer__sklearn_block__n_components': [5]}],
verbose=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| estimator | Pipeline(step...ock=PCA())))]) | |
| param_grid | [{'preprocessor__pca_node...arn_block__n_components': [3], 'regressor__regressor__...lock__estimator__kernel': [Matern(length...1, 1], nu=2.5)], 'regressor__transformer...arn_block__n_components': [4]}, {'preprocessor__pca_node...arn_block__n_components': [4], 'regressor__regressor__...lock__estimator__kernel': [Matern(length...1, 1], nu=2.5)], 'regressor__transformer...arn_block__n_components': [5]}] | |
| scoring | None | |
| n_jobs | None | |
| refit | True | |
| cv | KFold(n_split... shuffle=True) | |
| verbose | 3 | |
| pre_dispatch | '2*n_jobs' | |
| error_score | 'raise' | |
| return_train_score | False |
Parameters
| plaid_transformers | [('input_scalar_scaler', ...), ('pca_nodes', ...)] |
_
MinMaxScaler()
Parameters
| feature_range | (0, ...) | |
| copy | True | |
| clip | False |
_
PCA(n_components=4)
Parameters
| n_components | 4 | |
| copy | True | |
| whiten | False | |
| svd_solver | 'auto' | |
| tol | 0.0 | |
| iterated_power | 'auto' | |
| n_oversamples | 10 | |
| power_iteration_normalizer | 'auto' | |
| random_state | None |
Parameters
| regressor | WrappedSklear...om_state=42))) | |
| transformer | WrappedSklear...components=5)) |
WrappedSklearnRegressor(in_features_identifiers=[{'name': 'angle_in',
'type': 'scalar'},
{'name': 'mach_out',
'type': 'scalar'},
{'name': 'reduced_nodes_*',
'type': 'scalar'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1,
random_state=42)))MultiOutputRegressor(estimator=GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1,
random_state=42))GaussianProcessRegressor(kernel=Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5),
n_restarts_optimizer=1, random_state=42)Parameters
| kernel | Matern(length...1, 1], nu=2.5) | |
| alpha | 1e-10 | |
| optimizer | 'fmin_l_bfgs_b' | |
| n_restarts_optimizer | 1 | |
| normalize_y | False | |
| copy_X_train | True | |
| n_targets | None | |
| random_state | 42 | |
| kernel__length_scale | array([1., 1...., 1., 1., 1.]) | |
| kernel__length_scale_bounds | (1e-08, ...) | |
| kernel__nu | 2.5 |
WrappedSklearnTransformer(in_features_identifiers=[{'base_name': 'Base_2_2',
'name': 'mach',
'type': 'field'}],
out_features_identifiers=[{'name': 'reduced_mach_*',
'type': 'scalar'}],
sklearn_block=PCA(n_components=5))PCA(n_components=5)
Parameters
| n_components | 5 | |
| copy | True | |
| whiten | False | |
| svd_solver | 'auto' | |
| tol | 0.0 | |
| iterated_power | 'auto' | |
| n_oversamples | 10 | |
| power_iteration_normalizer | 'auto' | |
| random_state | None |
We evaluate the performance of the optimized pipeline by computing its score on the training set.
print("best_params =", search.best_params_)
optimized_pipeline = clone(pipeline).set_params(**search.best_params_)
optimized_pipeline.fit(dataset_train)
dataset_pred = optimized_pipeline.predict(dataset_train)
score = optimized_pipeline.score(dataset_train)
print("score =", score, ", error =", 1. - score)
best_params = {'preprocessor__pca_nodes__sklearn_block__n_components': 4, 'regressor__regressor__sklearn_block__estimator__kernel': Matern(length_scale=[1, 1, 1, 1, 1, 1], nu=2.5), 'regressor__transformer__sklearn_block__n_components': 5}
score = 0.9692695269779374 , error = 0.030730473022062554