Pipeline Examples¶

This notebook demonstrates the end-to-end process of building a machine learning pipeline using PLAID datasets and PLAID’s scikit-learn-compatible blocks.

PCA-GP for `mach` field prediction of `VKI-LS59` dataset¶

Key steps covered:

Loading the PLAID dataset using Hugging Face integration and PLAID’s dataset classes
Standardizing features with PLAID-wrapped scikit-learn transformers for scalars
Dimensionality reduction of flow fields via Principal Component Analysis (PCA) to reduce output complexity
Regression modeling of PCA coefficients from scalar inputs using Gaussian Process regression
Pipeline assembly combining transformations and regressors into a single scikit-learn-compatible workflow
Hyperparameter tuning using Optuna and scikit-learn’s GridSearchCV
Best practices for working with PLAID datasets and pipelines in a reproducible and modular manner

📦 Imports¶

import warnings
warnings.filterwarnings('ignore', module='sklearn')
warnings.filterwarnings("ignore", message=".*IProgress not found.*")

import os
from pathlib import Path

import yaml
import numpy as np
import optuna

from datasets.utils.logging import disable_progress_bar

from sklearn.base import clone
from sklearn.pipeline import Pipeline

from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
from sklearn.multioutput import MultiOutputRegressor

from sklearn.model_selection import KFold, GridSearchCV

from plaid.bridges.huggingface_bridge import huggingface_dataset_to_plaid, load_dataset_from_hub
from plaid.pipelines.sklearn_block_wrappers import WrappedSklearnTransformer, WrappedSklearnRegressor
from plaid.pipelines.plaid_blocks import TransformedTargetRegressor, ColumnTransformer


disable_progress_bar()
n_processes = min(max(1, os.cpu_count()), 6)

📥 Load Dataset¶

We load the VKI-LS59 dataset from Hugging Face and restrict ourselves to the first 24 samples of the training set.

hf_dataset = load_dataset_from_hub("PLAID-datasets/VKI-LS59", split="all_samples[:24]")
dataset_train, pb_def = huggingface_dataset_to_plaid(hf_dataset, processes_number = n_processes, verbose = False)

[2026-03-25 17:14:33,884:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/main/README.md "HTTP/1.1 307 Temporary Redirect"

[2026-03-25 17:14:33,889:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/api/resolve-cache/datasets/PLAID-datasets/VKI-LS59/1aad0a69c26462c039305a931a300f80bdf34827/README.md "HTTP/1.1 200 OK"

[2026-03-25 17:14:33,916:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/VKI-LS59.py "HTTP/1.1 404 Not Found"

[2026-03-25 17:14:33,975:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/PLAID-datasets/VKI-LS59/PLAID-datasets/VKI-LS59.py "HTTP/1.1 404 Not Found"

[2026-03-25 17:14:34,054:INFO:_client.py:_send_single_request(1025)]:HTTP Request: GET https://huggingface.co/api/datasets/PLAID-datasets/VKI-LS59/revision/1aad0a69c26462c039305a931a300f80bdf34827 "HTTP/1.1 200 OK"

[2026-03-25 17:14:34,081:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/.huggingface.yaml "HTTP/1.1 404 Not Found"

[2026-03-25 17:14:34,129:INFO:_client.py:_send_single_request(1025)]:HTTP Request: GET https://datasets-server.huggingface.co/info?dataset=PLAID-datasets/VKI-LS59 "HTTP/1.1 200 OK"

[2026-03-25 17:14:34,165:INFO:_client.py:_send_single_request(1025)]:HTTP Request: GET https://huggingface.co/api/datasets/PLAID-datasets/VKI-LS59/tree/1aad0a69c26462c039305a931a300f80bdf34827/data?recursive=true&expand=false "HTTP/1.1 200 OK"

[2026-03-25 17:14:34,198:INFO:_client.py:_send_single_request(1025)]:HTTP Request: GET https://huggingface.co/api/datasets/PLAID-datasets/VKI-LS59/tree/1aad0a69c26462c039305a931a300f80bdf34827?recursive=false&expand=false "HTTP/1.1 200 OK"

[2026-03-25 17:14:34,229:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/dataset_infos.json "HTTP/1.1 404 Not Found"

[2026-03-25 17:14:34,273:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/data/all_samples-00000-of-00006.parquet "HTTP/1.1 302 Found"

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

[2026-03-25 17:14:34,274:WARNING:_http.py:_warn_on_warning_headers(916)]:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

[2026-03-25 17:14:34,304:INFO:_client.py:_send_single_request(1025)]:HTTP Request: GET https://huggingface.co/api/datasets/PLAID-datasets/VKI-LS59/xet-read-token/1aad0a69c26462c039305a931a300f80bdf34827 "HTTP/1.1 200 OK"

[2026-03-25 17:14:36,248:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/data/all_samples-00001-of-00006.parquet "HTTP/1.1 302 Found"

[2026-03-25 17:14:37,893:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/data/all_samples-00002-of-00006.parquet "HTTP/1.1 302 Found"

[2026-03-25 17:14:39,738:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/data/all_samples-00003-of-00006.parquet "HTTP/1.1 302 Found"

[2026-03-25 17:14:42,584:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/data/all_samples-00004-of-00006.parquet "HTTP/1.1 302 Found"

[2026-03-25 17:14:45,629:INFO:_client.py:_send_single_request(1025)]:HTTP Request: HEAD https://huggingface.co/datasets/PLAID-datasets/VKI-LS59/resolve/1aad0a69c26462c039305a931a300f80bdf34827/data/all_samples-00005-of-00006.parquet "HTTP/1.1 302 Found"

We print the summary of dataset_train, which contains 24 samples, with 8 scalars and 8 fields, which is consistent with the VKI-LS59 dataset:

print(dataset_train)

Dataset(24 samples, 8 scalars, 7 fields)

	feature_range feature_range: tuple (min, max), default=(0, 1) Desired range of transformed data.	(0, ...)
	copy copy: bool, default=True Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).	True
	clip clip: bool, default=False Set to True to clip transformed values of held-out data to provided `feature_range`. Since this parameter will clip values, `inverse_transform` may not be able to restore the original data. .. note:: Setting `clip=True` does not prevent feature drift (a distribution shift between training and test data). The transformed values are clipped to the `feature_range`, which helps avoid unintended behavior in models sensitive to out-of-range inputs (e.g. linear models). Use with care, as clipping can distort the distribution of test data. .. versionadded:: 0.24	False

	n_components n_components: int, float or 'mle', default=None Number of components to keep. if n_components is not set all components are kept:: n_components == min(n_samples, n_features) If ``n_components == 'mle'`` and ``svd_solver == 'full'``, Minka's MLE is used to guess the dimension. Use of ``n_components == 'mle'`` will interpret ``svd_solver == 'auto'`` as ``svd_solver == 'full'``. If ``0 < n_components < 1`` and ``svd_solver == 'full'``, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If ``svd_solver == 'arpack'``, the number of components must be strictly less than the minimum of n_features and n_samples. Hence, the None case results in:: n_components == min(n_samples, n_features) - 1	None
	copy copy: bool, default=True If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.	True
	whiten whiten: bool, default=False When True (False by default) the `components_` vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.	False
	svd_solver svd_solver: {'auto', 'full', 'covariance_eigh', 'arpack', 'randomized'}, default='auto' "auto" : The solver is selected by a default 'auto' policy is based on `X.shape` and `n_components`: if the input data has fewer than 1000 features and more than 10 times as many samples, then the "covariance_eigh" solver is used. Otherwise, if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient "randomized" method is selected. Otherwise the exact "full" SVD is computed and optionally truncated afterwards. "full" : Run exact full SVD calling the standard LAPACK solver via `scipy.linalg.svd` and select the components by postprocessing "covariance_eigh" : Precompute the covariance matrix (on centered data), run a classical eigenvalue decomposition on the covariance matrix typically using LAPACK and select the components by postprocessing. This solver is very efficient for n_samples >> n_features and small n_features. It is, however, not tractable otherwise for large n_features (large memory footprint required to materialize the covariance matrix). Also note that compared to the "full" solver, this solver effectively doubles the condition number and is therefore less numerical stable (e.g. on input data with a large range of singular values). "arpack" : Run SVD truncated to `n_components` calling ARPACK solver via `scipy.sparse.linalg.svds`. It requires strictly `0 < n_components < min(X.shape)` "randomized" : Run randomized SVD by the method of Halko et al. .. versionadded:: 0.18.0 .. versionchanged:: 1.5 Added the 'covariance_eigh' solver.	'auto'
	tol tol: float, default=0.0 Tolerance for singular values computed by svd_solver == 'arpack'. Must be of range [0.0, infinity). .. versionadded:: 0.18.0	0.0
	iterated_power iterated_power: int or 'auto', default='auto' Number of iterations for the power method computed by svd_solver == 'randomized'. Must be of range [0, infinity). .. versionadded:: 0.18.0	'auto'
	n_oversamples n_oversamples: int, default=10 This parameter is only relevant when `svd_solver="randomized"`. It corresponds to the additional number of random vectors to sample the range of `X` so as to ensure proper conditioning. See :func:`~sklearn.utils.extmath.randomized_svd` for more details. .. versionadded:: 1.1	10
	power_iteration_normalizer power_iteration_normalizer: {'auto', 'QR', 'LU', 'none'}, default='auto' Power iteration normalizer for randomized SVD solver. Not used by ARPACK. See :func:`~sklearn.utils.extmath.randomized_svd` for more details. .. versionadded:: 1.1	'auto'
	random_state random_state: int, RandomState instance or None, default=None Used when the 'arpack' or 'randomized' solvers are used. Pass an int for reproducible results across multiple function calls. See :term:`Glossary `. .. versionadded:: 0.18.0	None

	sklearn_block	PCA()
	in_features_identifiers	[{'base_name': 'Base_2_2', 'name': 'mach', 'type': 'field'}]
	out_features_identifiers	[{'name': 'reduced_mach_*', 'type': 'scalar'}]

	n_components n_components: int, float or 'mle', default=None Number of components to keep. if n_components is not set all components are kept:: n_components == min(n_samples, n_features) If ``n_components == 'mle'`` and ``svd_solver == 'full'``, Minka's MLE is used to guess the dimension. Use of ``n_components == 'mle'`` will interpret ``svd_solver == 'auto'`` as ``svd_solver == 'full'``. If ``0 < n_components < 1`` and ``svd_solver == 'full'``, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If ``svd_solver == 'arpack'``, the number of components must be strictly less than the minimum of n_features and n_samples. Hence, the None case results in:: n_components == min(n_samples, n_features) - 1	None
	copy copy: bool, default=True If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.	True
	whiten whiten: bool, default=False When True (False by default) the `components_` vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.	False
	svd_solver svd_solver: {'auto', 'full', 'covariance_eigh', 'arpack', 'randomized'}, default='auto' "auto" : The solver is selected by a default 'auto' policy is based on `X.shape` and `n_components`: if the input data has fewer than 1000 features and more than 10 times as many samples, then the "covariance_eigh" solver is used. Otherwise, if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient "randomized" method is selected. Otherwise the exact "full" SVD is computed and optionally truncated afterwards. "full" : Run exact full SVD calling the standard LAPACK solver via `scipy.linalg.svd` and select the components by postprocessing "covariance_eigh" : Precompute the covariance matrix (on centered data), run a classical eigenvalue decomposition on the covariance matrix typically using LAPACK and select the components by postprocessing. This solver is very efficient for n_samples >> n_features and small n_features. It is, however, not tractable otherwise for large n_features (large memory footprint required to materialize the covariance matrix). Also note that compared to the "full" solver, this solver effectively doubles the condition number and is therefore less numerical stable (e.g. on input data with a large range of singular values). "arpack" : Run SVD truncated to `n_components` calling ARPACK solver via `scipy.sparse.linalg.svds`. It requires strictly `0 < n_components < min(X.shape)` "randomized" : Run randomized SVD by the method of Halko et al. .. versionadded:: 0.18.0 .. versionchanged:: 1.5 Added the 'covariance_eigh' solver.	'auto'
	tol tol: float, default=0.0 Tolerance for singular values computed by svd_solver == 'arpack'. Must be of range [0.0, infinity). .. versionadded:: 0.18.0	0.0
	iterated_power iterated_power: int or 'auto', default='auto' Number of iterations for the power method computed by svd_solver == 'randomized'. Must be of range [0, infinity). .. versionadded:: 0.18.0	'auto'
	n_oversamples n_oversamples: int, default=10 This parameter is only relevant when `svd_solver="randomized"`. It corresponds to the additional number of random vectors to sample the range of `X` so as to ensure proper conditioning. See :func:`~sklearn.utils.extmath.randomized_svd` for more details. .. versionadded:: 1.1	10
	power_iteration_normalizer power_iteration_normalizer: {'auto', 'QR', 'LU', 'none'}, default='auto' Power iteration normalizer for randomized SVD solver. Not used by ARPACK. See :func:`~sklearn.utils.extmath.randomized_svd` for more details. .. versionadded:: 1.1	'auto'
	random_state random_state: int, RandomState instance or None, default=None Used when the 'arpack' or 'randomized' solvers are used. Pass an int for reproducible results across multiple function calls. See :term:`Glossary `. .. versionadded:: 0.18.0	None

	kernel kernel: kernel instance, default=None The kernel specifying the covariance function of the GP. If None is passed, the kernel ``ConstantKernel(1.0, constant_value_bounds="fixed") * RBF(1.0, length_scale_bounds="fixed")`` is used as default. Note that the kernel hyperparameters are optimized during fitting unless the bounds are marked as "fixed".	Matern(length_scale=1, nu=2.5)
	alpha alpha: float or ndarray of shape (n_samples,), default=1e-10 Value added to the diagonal of the kernel matrix during fitting. This can prevent a potential numerical issue during fitting, by ensuring that the calculated values form a positive definite matrix. It can also be interpreted as the variance of additional Gaussian measurement noise on the training observations. Note that this is different from using a `WhiteKernel`. If an array is passed, it must have the same number of entries as the data used for fitting and is used as datapoint-dependent noise level. Allowing to specify the noise level directly as a parameter is mainly for convenience and for consistency with :class:`~sklearn.linear_model.Ridge`. For an example illustrating how the alpha parameter controls the noise variance in Gaussian Process Regression, see :ref:`sphx_glr_auto_examples_gaussian_process_plot_gpr_noisy_targets.py`.	1e-10
	optimizer optimizer: "fmin_l_bfgs_b", callable or None, default="fmin_l_bfgs_b" Can either be one of the internally supported optimizers for optimizing the kernel's parameters, specified by a string, or an externally defined optimizer passed as a callable. If a callable is passed, it must have the signature:: def optimizer(obj_func, initial_theta, bounds): # * 'obj_func': the objective function to be minimized, which # takes the hyperparameters theta as a parameter and an # optional flag eval_gradient, which determines if the # gradient is returned additionally to the function value # * 'initial_theta': the initial value for theta, which can be # used by local optimizers # * 'bounds': the bounds on the values of theta .... # Returned are the best found hyperparameters theta and # the corresponding value of the target function. return theta_opt, func_min Per default, the L-BFGS-B algorithm from `scipy.optimize.minimize` is used. If None is passed, the kernel's parameters are kept fixed. Available internal optimizers are: `{'fmin_l_bfgs_b'}`.	'fmin_l_bfgs_b'
	n_restarts_optimizer n_restarts_optimizer: int, default=0 The number of restarts of the optimizer for finding the kernel's parameters which maximize the log-marginal likelihood. The first run of the optimizer is performed from the kernel's initial parameters, the remaining ones (if any) from thetas sampled log-uniform randomly from the space of allowed theta-values. If greater than 0, all bounds must be finite. Note that `n_restarts_optimizer == 0` implies that one run is performed.	1
	normalize_y normalize_y: bool, default=False Whether or not to normalize the target values `y` by removing the mean and scaling to unit-variance. This is recommended for cases where zero-mean, unit-variance priors are used. Note that, in this implementation, the normalisation is reversed before the GP predictions are reported. .. versionchanged:: 0.23	False
	copy_X_train copy_X_train: bool, default=True If True, a persistent copy of the training data is stored in the object. Otherwise, just a reference to the training data is stored, which might cause predictions to change if the data is modified externally.	True
	n_targets n_targets: int, default=None The number of dimensions of the target values. Used to decide the number of outputs when sampling from the prior distributions (i.e. calling :meth:`sample_y` before :meth:`fit`). This parameter is ignored once :meth:`fit` has been called. .. versionadded:: 1.3	None
	random_state random_state: int, RandomState instance or None, default=None Determines random number generation used to initialize the centers. Pass an int for reproducible results across multiple function calls. See :term:`Glossary `.	42
	kernel__length_scale	1.0
	kernel__length_scale_bounds	(1e-08, ...)
	kernel__nu	2.5

Pipeline Examples¶

PCA-GP for `mach` field prediction of `VKI-LS59` dataset¶

📦 Imports¶

📥 Load Dataset¶

⚙️ Pipeline Configuration¶

1. Preprocessor¶

2. Postprocessor¶

3. TransformedTargetRegressor¶

4. Pipeline assembling¶

🎯 Optuna hyperparameter tuning¶

🔍 GridSearchCV hyperparameter tuning¶

	regressor	WrappedSklear...om_state=42)))
	transformer	WrappedSklear...n_block=PCA())

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.	[('preprocessor', ...), ('regressor', ...)]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

	kernel kernel: kernel instance, default=None The kernel specifying the covariance function of the GP. If None is passed, the kernel ``ConstantKernel(1.0, constant_value_bounds="fixed") * RBF(1.0, length_scale_bounds="fixed")`` is used as default. Note that the kernel hyperparameters are optimized during fitting unless the bounds are marked as "fixed".	Matern(length...1, 1], nu=2.5)
	alpha alpha: float or ndarray of shape (n_samples,), default=1e-10 Value added to the diagonal of the kernel matrix during fitting. This can prevent a potential numerical issue during fitting, by ensuring that the calculated values form a positive definite matrix. It can also be interpreted as the variance of additional Gaussian measurement noise on the training observations. Note that this is different from using a `WhiteKernel`. If an array is passed, it must have the same number of entries as the data used for fitting and is used as datapoint-dependent noise level. Allowing to specify the noise level directly as a parameter is mainly for convenience and for consistency with :class:`~sklearn.linear_model.Ridge`. For an example illustrating how the alpha parameter controls the noise variance in Gaussian Process Regression, see :ref:`sphx_glr_auto_examples_gaussian_process_plot_gpr_noisy_targets.py`.	1e-10
	optimizer optimizer: "fmin_l_bfgs_b", callable or None, default="fmin_l_bfgs_b" Can either be one of the internally supported optimizers for optimizing the kernel's parameters, specified by a string, or an externally defined optimizer passed as a callable. If a callable is passed, it must have the signature:: def optimizer(obj_func, initial_theta, bounds): # * 'obj_func': the objective function to be minimized, which # takes the hyperparameters theta as a parameter and an # optional flag eval_gradient, which determines if the # gradient is returned additionally to the function value # * 'initial_theta': the initial value for theta, which can be # used by local optimizers # * 'bounds': the bounds on the values of theta .... # Returned are the best found hyperparameters theta and # the corresponding value of the target function. return theta_opt, func_min Per default, the L-BFGS-B algorithm from `scipy.optimize.minimize` is used. If None is passed, the kernel's parameters are kept fixed. Available internal optimizers are: `{'fmin_l_bfgs_b'}`.	'fmin_l_bfgs_b'
	n_restarts_optimizer n_restarts_optimizer: int, default=0 The number of restarts of the optimizer for finding the kernel's parameters which maximize the log-marginal likelihood. The first run of the optimizer is performed from the kernel's initial parameters, the remaining ones (if any) from thetas sampled log-uniform randomly from the space of allowed theta-values. If greater than 0, all bounds must be finite. Note that `n_restarts_optimizer == 0` implies that one run is performed.	1
	normalize_y normalize_y: bool, default=False Whether or not to normalize the target values `y` by removing the mean and scaling to unit-variance. This is recommended for cases where zero-mean, unit-variance priors are used. Note that, in this implementation, the normalisation is reversed before the GP predictions are reported. .. versionchanged:: 0.23	False
	copy_X_train copy_X_train: bool, default=True If True, a persistent copy of the training data is stored in the object. Otherwise, just a reference to the training data is stored, which might cause predictions to change if the data is modified externally.	True
	n_targets n_targets: int, default=None The number of dimensions of the target values. Used to decide the number of outputs when sampling from the prior distributions (i.e. calling :meth:`sample_y` before :meth:`fit`). This parameter is ignored once :meth:`fit` has been called. .. versionadded:: 1.3	None
	random_state random_state: int, RandomState instance or None, default=None Determines random number generation used to initialize the centers. Pass an int for reproducible results across multiple function calls. See :term:`Glossary `.	42
	kernel__length_scale	array([1., 1...., 1., 1., 1.])
	kernel__length_scale_bounds	(1e-08, ...)
	kernel__nu	2.5

	estimator estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.	Pipeline(step...ock=PCA())))])
	param_grid param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.	[{'preprocessor__pca_node...arn_block__n_components': [3], 'regressor__regressor__...lock__estimator__kernel': [Matern(length...1, 1], nu=2.5)], 'regressor__transformer...arn_block__n_components': [4]}, {'preprocessor__pca_node...arn_block__n_components': [4], 'regressor__regressor__...lock__estimator__kernel': [Matern(length...1, 1], nu=2.5)], 'regressor__transformer...arn_block__n_components': [5]}]
	scoring scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.	None
	n_jobs n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20 `n_jobs` default changed from 1 to None	None
	refit refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20 Support for callable added.	True
	cv cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.	KFold(n_split... shuffle=True)
	verbose verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.	3
	pre_dispatch pre_dispatch: int, or str, default='2n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2n_jobs'	'2*n_jobs'
	error_score error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.	'raise'
	return_train_score return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21 Default value was changed from ``True`` to ``False``	False

Pipeline Examples¶

PCA-GP for mach field prediction of VKI-LS59 dataset¶

📦 Imports¶

📥 Load Dataset¶

⚙️ Pipeline Configuration¶

1. Preprocessor¶

2. Postprocessor¶

3. TransformedTargetRegressor¶

4. Pipeline assembling¶

🎯 Optuna hyperparameter tuning¶

🔍 GridSearchCV hyperparameter tuning¶

PCA-GP for `mach` field prediction of `VKI-LS59` dataset¶