Pipeline Examples¶

This notebook demonstrates the end-to-end process of building a machine learning pipeline using PLAID datasets and PLAID’s scikit-learn-compatible blocks.

PCA-GP for `mach` field prediction of `VKI-LS59` dataset¶

Key steps covered:

Loading the PLAID dataset using Hugging Face integration and PLAID’s dataset classes
Standardizing features with PLAID-wrapped scikit-learn transformers for scalars
Dimensionality reduction of flow fields via Principal Component Analysis (PCA) to reduce output complexity
Regression modeling of PCA coefficients from scalar inputs using Gaussian Process regression
Pipeline assembly combining transformations and regressors into a single scikit-learn-compatible workflow
Hyperparameter tuning using Optuna and scikit-learn’s GridSearchCV
Best practices for working with PLAID datasets and pipelines in a reproducible and modular manner

📦 Imports¶

import warnings
warnings.filterwarnings('ignore', module='sklearn')
warnings.filterwarnings("ignore", message=".*IProgress not found.*")

import os
from pathlib import Path

import yaml
import numpy as np
import optuna

from datasets.utils.logging import disable_progress_bar
from datasets import load_dataset

from sklearn.base import clone
from sklearn.pipeline import Pipeline

from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
from sklearn.multioutput import MultiOutputRegressor

from sklearn.model_selection import KFold, GridSearchCV

from plaid.bridges.huggingface_bridge import huggingface_dataset_to_plaid, huggingface_description_to_problem_definition
from plaid.pipelines.sklearn_block_wrappers import WrappedSklearnTransformer, WrappedSklearnRegressor
from plaid.pipelines.plaid_blocks import TransformedTargetRegressor, ColumnTransformer

disable_progress_bar()
n_processes = min(max(1, os.cpu_count()), 6)

📥 Load Dataset¶

We load the VKI-LS59 dataset from Hugging Face and restrict ourselves to the first 24 samples of the training set.

hf_dataset = load_dataset("PLAID-datasets/VKI-LS59", split="all_samples[:24]")
dataset_train, _ = huggingface_dataset_to_plaid(hf_dataset, processes_number = n_processes, verbose = False)

We print the summary of dataset_train, which contains 24 samples, with 8 scalars and 8 fields, which is consistent with the VKI-LS59 dataset:

print(dataset_train)

Dataset(24 samples, 8 scalars, 0 time_series, 8 fields)

	feature_range	(0, ...)
	copy	True
	clip	False

	n_components	None
	copy	True
	whiten	False
	svd_solver	'auto'
	tol	0.0
	iterated_power	'auto'
	n_oversamples	10
	power_iteration_normalizer	'auto'
	random_state	None

	sklearn_block	PCA()
	in_features_identifiers	[{'base_name': 'Base_2_2', 'name': 'mach', 'type': 'field'}]
	out_features_identifiers	[{'name': 'reduced_mach_*', 'type': 'scalar'}]

	n_components	None
	copy	True
	whiten	False
	svd_solver	'auto'
	tol	0.0
	iterated_power	'auto'
	n_oversamples	10
	power_iteration_normalizer	'auto'
	random_state	None

	kernel	Matern(length_scale=1, nu=2.5)
	alpha	1e-10
	optimizer	'fmin_l_bfgs_b'
	n_restarts_optimizer	1
	normalize_y	False
	copy_X_train	True
	n_targets	None
	random_state	42
	kernel__length_scale	1.0
	kernel__length_scale_bounds	(1e-08, ...)
	kernel__nu	2.5

Pipeline Examples¶

PCA-GP for `mach` field prediction of `VKI-LS59` dataset¶

📦 Imports¶

📥 Load Dataset¶

⚙️ Pipeline Configuration¶

1. Preprocessor¶

2. Postprocessor¶

3. TransformedTargetRegressor¶

4. Pipeline assembling¶

🎯 Optuna hyperparameter tuning¶

🔍 GridSearchCV hyperparameter tuning¶

	regressor	WrappedSklear...om_state=42)))
	transformer	WrappedSklear...n_block=PCA())

	steps	[('preprocessor', ...), ('regressor', ...)]
	transform_input	None
	memory	None
	verbose	False

	kernel	Matern(length...1, 1], nu=2.5)
	alpha	1e-10
	optimizer	'fmin_l_bfgs_b'
	n_restarts_optimizer	1
	normalize_y	False
	copy_X_train	True
	n_targets	None
	random_state	42
	kernel__length_scale	array([1., 1...., 1., 1., 1.])
	kernel__length_scale_bounds	(1e-08, ...)
	kernel__nu	2.5

	estimator	Pipeline(step...ock=PCA())))])
	param_grid	[{'preprocessor__pca_node...arn_block__n_components': [3], 'regressor__regressor__...lock__estimator__kernel': [Matern(length...1, 1], nu=2.5)], 'regressor__transformer...arn_block__n_components': [4]}, {'preprocessor__pca_node...arn_block__n_components': [4], 'regressor__regressor__...lock__estimator__kernel': [Matern(length...1, 1], nu=2.5)], 'regressor__transformer...arn_block__n_components': [5]}]
	scoring	None
	n_jobs	None
	refit	True
	cv	KFold(n_split... shuffle=True)
	verbose	3
	pre_dispatch	'2*n_jobs'
	error_score	'raise'
	return_train_score	False

Pipeline Examples¶

PCA-GP for mach field prediction of VKI-LS59 dataset¶

📦 Imports¶

📥 Load Dataset¶

⚙️ Pipeline Configuration¶

1. Preprocessor¶

2. Postprocessor¶

3. TransformedTargetRegressor¶

4. Pipeline assembling¶

🎯 Optuna hyperparameter tuning¶

🔍 GridSearchCV hyperparameter tuning¶

PCA-GP for `mach` field prediction of `VKI-LS59` dataset¶