{ "cells": [ { "cell_type": "markdown", "id": "f3bf3005", "metadata": {}, "source": [ "# Pipeline Examples\n", "\n", "This notebook demonstrates the end-to-end process of building a machine learning pipeline using PLAID datasets and PLAID’s scikit-learn-compatible blocks." ] }, { "cell_type": "markdown", "id": "ab3feab4", "metadata": {}, "source": [ "## PCA-GP for `mach` field prediction of `VKI-LS59` dataset\n", "\n", "Key steps covered:\n", "\n", "- **Loading the PLAID dataset** using Hugging Face integration and PLAID’s dataset classes\n", "- **Standardizing features** with PLAID-wrapped scikit-learn transformers for scalars\n", "- **Dimensionality reduction** of flow fields via Principal Component Analysis (PCA) to reduce output complexity\n", "- **Regression modeling** of PCA coefficients from scalar inputs using Gaussian Process regression\n", "- **Pipeline assembly** combining transformations and regressors into a single scikit-learn-compatible workflow\n", "- **Hyperparameter tuning** using Optuna and scikit-learn’s `GridSearchCV`\n", "- **Best practices** for working with PLAID datasets and pipelines in a reproducible and modular manner" ] }, { "cell_type": "markdown", "id": "f187a3bc", "metadata": {}, "source": [ "### 📦 Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "b9ef7c66", "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore', module='sklearn')\n", "warnings.filterwarnings(\"ignore\", message=\".*IProgress not found.*\")\n", "\n", "import os\n", "from pathlib import Path\n", "\n", "import yaml\n", "import numpy as np\n", "import optuna\n", "\n", "from datasets.utils.logging import disable_progress_bar\n", "\n", "from sklearn.base import clone\n", "from sklearn.pipeline import Pipeline\n", "\n", "from sklearn.decomposition import PCA\n", "from sklearn.preprocessing import MinMaxScaler\n", "from sklearn.gaussian_process import GaussianProcessRegressor\n", "from sklearn.gaussian_process.kernels import Matern\n", "from sklearn.multioutput import MultiOutputRegressor\n", "\n", "from sklearn.model_selection import KFold, GridSearchCV\n", "\n", "from plaid.bridges.huggingface_bridge import huggingface_dataset_to_plaid, load_dataset_from_hub\n", "from plaid.pipelines.sklearn_block_wrappers import WrappedSklearnTransformer, WrappedSklearnRegressor\n", "from plaid.pipelines.plaid_blocks import TransformedTargetRegressor, ColumnTransformer\n", "\n", "\n", "disable_progress_bar()\n", "n_processes = min(max(1, os.cpu_count()), 6)" ] }, { "cell_type": "markdown", "id": "f303b754", "metadata": {}, "source": [ "### 📥 Load Dataset\n", "\n", "We load the `VKI-LS59` dataset from Hugging Face and restrict ourselves to the first 24 samples of the training set." ] }, { "cell_type": "code", "execution_count": null, "id": "dfa4b7a3", "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "hf_dataset = load_dataset_from_hub(\"PLAID-datasets/VKI-LS59\", split=\"all_samples[:24]\")\n", "dataset_train, pb_def = huggingface_dataset_to_plaid(hf_dataset, processes_number = n_processes, verbose = False)" ] }, { "cell_type": "markdown", "id": "84b6142c", "metadata": {}, "source": [ "We print the summary of dataset_train, which contains 24 samples, with 8 scalars and 8 fields, which is consistent with the `VKI-LS59` dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "f2e8d088", "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "print(dataset_train)" ] }, { "cell_type": "markdown", "id": "125cea5a", "metadata": {}, "source": [ "### ⚙️ Pipeline Configuration\n", "\n", "For convenience, the `in_features_identifiers` and `out_features_identifiers` for each pipeline block are defined in a `.yml` file. Here's an example of how the configuration might look:" ] }, { "cell_type": "markdown", "id": "92c23928", "metadata": {}, "source": [ "```yaml\n", "pca_nodes:\n", " in_features_identifiers:\n", "- type: nodes\n", " base_name: Base_2_2\n", " out_features_identifiers:\n", "- type: scalar\n", " name: reduced_nodes_*\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "008774ed", "metadata": {}, "outputs": [], "source": [ "try:\n", " filename = Path(__file__).parent.parent.parent / \"examples\" / \"pipelines\" / \"config_pipeline.yml\"\n", "except NameError:\n", " filename = \"config_pipeline.yml\"\n", "\n", "with open(filename, 'r') as f:\n", " config = yaml.safe_load(f)\n", "\n", "all_feature_id = config['input_scalar_scaler']['in_features_identifiers'] +\\\n", " config['pca_nodes']['in_features_identifiers'] + config['pca_mach']['in_features_identifiers']" ] }, { "cell_type": "markdown", "id": "a7bf76c9", "metadata": {}, "source": [ "In this example, we aim to predict the ``mach`` field based on two input scalars ``angle_in`` and ``mach_out``, and the mesh node coordinates. To contain memory consumption, we restrict the dataset to the features required for this example:" ] }, { "cell_type": "code", "execution_count": null, "id": "14889cb0", "metadata": {}, "outputs": [], "source": [ "dataset_train = dataset_train.extract_dataset_from_identifier(all_feature_id)\n", "print(\"dataset_train =\", dataset_train)\n", "print(\"scalar names =\", dataset_train.get_scalar_names())\n", "print(\"field names =\", dataset_train.get_field_names())" ] }, { "cell_type": "markdown", "id": "e3fa6a5a", "metadata": {}, "source": [ "We notive that only the 2 scalars and the field of interest are kept after restriction." ] }, { "cell_type": "markdown", "id": "51009857", "metadata": {}, "source": [ "#### 1. Preprocessor\n", "\n", "We now define a preprocessor: a `MinMaxScaler` of the 2 input scalars and a `PCA` on the nodes coordinates of the meshes:" ] }, { "cell_type": "code", "execution_count": null, "id": "719c8837", "metadata": {}, "outputs": [], "source": [ "preprocessor = ColumnTransformer(\n", " [\n", " ('input_scalar_scaler', WrappedSklearnTransformer(MinMaxScaler(), **config['input_scalar_scaler'])),\n", " ('pca_nodes', WrappedSklearnTransformer(PCA(), **config['pca_nodes'])),\n", " ]\n", ")\n", "preprocessor" ] }, { "cell_type": "markdown", "id": "acb1831b", "metadata": {}, "source": [ "We use a `PlaidColumnTransformer` to apply independent transformations to different feature groups.\n", "\n", "To verify this behavior, we apply the `preprocessor` to `dataset_train`:" ] }, { "cell_type": "code", "execution_count": null, "id": "2eaea76c", "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "preprocessed_dataset = preprocessor.fit_transform(dataset_train)\n", "print(\"preprocessed_dataset:\", preprocessed_dataset)\n", "print(\"scalar names =\", preprocessed_dataset.get_scalar_names())\n", "print(\"field names =\", preprocessed_dataset.get_field_names())" ] }, { "cell_type": "markdown", "id": "c4572512", "metadata": {}, "source": [ "Using `MinMaxScaler`, we scaled the `angle_in` and `mach_out` features, replacing their original values. In contrast, `PCA` compressed the node coordinates and produced new scalar features named `reduced_nodes_*`, representing the PCA components. Alternatively, we could have specified `out_features_identifiers` in the `.yml` file configuring the `MinMaxScaler` block to generate new scalars without overwriting the original inputs." ] }, { "cell_type": "markdown", "id": "908abbb3", "metadata": {}, "source": [ "#### 2. Postprocessor\n", "\n", "Next, we define the postprocessor, which applies PCA to the `mach` field:" ] }, { "cell_type": "code", "execution_count": null, "id": "028ce806", "metadata": {}, "outputs": [], "source": [ "postprocessor = WrappedSklearnTransformer(PCA(), **config['pca_mach'])\n", "postprocessor" ] }, { "cell_type": "markdown", "id": "1f2849a2", "metadata": {}, "source": [ "#### 3. TransformedTargetRegressor\n", "\n", "The Gaussian Process regressor takes the transformed `angle_in` and `mach_out` scalars, along with the PCA coefficients of the mesh node coordinates as inputs, and predicts the PCA coefficients of the `mach` field as outputs. This is facilitated by using a `PlaidTransformedTargetRegressor`." ] }, { "cell_type": "code", "execution_count": null, "id": "46cc9cde", "metadata": {}, "outputs": [], "source": [ "kernel = Matern(length_scale_bounds=(1e-8, 1e8), nu = 2.5)\n", "\n", "gpr = GaussianProcessRegressor(\n", " kernel=kernel,\n", " optimizer='fmin_l_bfgs_b',\n", " n_restarts_optimizer=1,\n", " random_state=42)\n", "\n", "reg = MultiOutputRegressor(gpr)\n", "\n", "regressor = WrappedSklearnRegressor(reg, **config['regressor_mach'])\n", "\n", "target_regressor = TransformedTargetRegressor(\n", " regressor=regressor,\n", " transformer=postprocessor\n", ")\n", "target_regressor" ] }, { "cell_type": "markdown", "id": "6f91a108", "metadata": {}, "source": [ "`PlaidTransformedTargetRegressor` functions like scikit-learn’s `TransformedTargetRegressor` but operates directly on PLAID datasets." ] }, { "cell_type": "markdown", "id": "527df515", "metadata": {}, "source": [ "#### 4. Pipeline assembling\n", "\n", "We then define the complete pipeline as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "e67c710c", "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "pipeline = Pipeline(\n", " steps=[\n", " (\"preprocessor\", preprocessor),\n", " (\"regressor\", target_regressor),\n", " ]\n", ")\n", "pipeline" ] }, { "cell_type": "markdown", "id": "5764a9fb", "metadata": { "lines_to_next_cell": 2 }, "source": [ "### 🎯 Optuna hyperparameter tuning\n", "\n", "We now use Optuna to optimize hyperparameters, specifically tuning the number of components for the two `PCA` blocks using three-fold cross-validation." ] }, { "cell_type": "code", "execution_count": null, "id": "9cce44c3", "metadata": {}, "outputs": [], "source": [ "def objective(trial):\n", " # Suggest hyperparameters\n", " nodes_n_components = trial.suggest_int(\"preprocessor__pca_nodes__sklearn_block__n_components\", 3, 4)\n", " mach_n_components = trial.suggest_int(\"regressor__transformer__sklearn_block__n_components\", 4, 5)\n", "\n", " # Clone and configure pipeline\n", " pipeline_run = clone(pipeline)\n", " pipeline_run.set_params(\n", " preprocessor__pca_nodes__sklearn_block__n_components=nodes_n_components,\n", " regressor__transformer__sklearn_block__n_components=mach_n_components,\n", " regressor__regressor__sklearn_block__estimator__kernel=Matern(\n", " length_scale_bounds=(1e-8, 1e8), nu=2.5, length_scale=np.ones(nodes_n_components + len(config['input_scalar_scaler']['in_features_identifiers']))\n", " )\n", " )\n", "\n", " cv = KFold(n_splits=3, shuffle=True, random_state=42)\n", "\n", " scores = []\n", "\n", " indices = np.arange(len(dataset_train))\n", "\n", " for train_idx, val_idx in cv.split(indices):\n", "\n", " dataset_cv_train_ = dataset_train[train_idx]\n", " dataset_cv_val_ = dataset_train[val_idx]\n", "\n", " pipeline_run.fit(dataset_cv_train_)\n", "\n", " score = pipeline_run.score(dataset_cv_val_)\n", "\n", " scores.append(score)\n", "\n", " return np.mean(scores)" ] }, { "cell_type": "markdown", "id": "ace22166", "metadata": { "lines_to_next_cell": 2 }, "source": [ "We maximize the defined objective function over 4 trials selected by Optuna." ] }, { "cell_type": "code", "execution_count": null, "id": "4f0a3737", "metadata": {}, "outputs": [], "source": [ "preprocessed_dataset = preprocessor.fit_transform(dataset_train)\n", "print(\"preprocessed_dataset:\", preprocessed_dataset)\n", "print(\"scalar names =\", preprocessed_dataset.get_scalar_names())\n", "print(\"field names =\", preprocessed_dataset.get_field_names())" ] }, { "cell_type": "code", "execution_count": null, "id": "3a411b13", "metadata": {}, "outputs": [], "source": [ "study = optuna.create_study(direction='maximize')\n", "study.optimize(objective, n_trials=4)\n", "print(\"best_params =\", study.best_params)" ] }, { "cell_type": "markdown", "id": "2d6e9a27", "metadata": {}, "source": [ "We retrieve the best hyperparameters found by Optuna and use them to define the `optimized_pipeline`." ] }, { "cell_type": "code", "execution_count": null, "id": "b2ae19c6", "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "optimized_pipeline = clone(pipeline).set_params(**study.best_params)\n", "optimized_pipeline.set_params(regressor__regressor__sklearn_block__estimator__kernel=Matern(\n", " length_scale_bounds=(1e-8, 1e8), nu=2.5, length_scale=np.ones(study.best_params['preprocessor__pca_nodes__sklearn_block__n_components'] + len(config['input_scalar_scaler']['in_features_identifiers']))\n", " )\n", ")\n", "\n", "optimized_pipeline.fit(dataset_train)" ] }, { "cell_type": "markdown", "id": "775933b9", "metadata": {}, "source": [ "Next, we fit the `optimized_pipeline` to the `dataset_train` dataset and evaluate its performance on the same data." ] }, { "cell_type": "code", "execution_count": null, "id": "c2378014", "metadata": {}, "outputs": [], "source": [ "dataset_pred = optimized_pipeline.predict(dataset_train)\n", "score = optimized_pipeline.score(dataset_train)\n", "print(\"score =\", score, \", error =\", 1. - score)" ] }, { "cell_type": "markdown", "id": "a6592692", "metadata": {}, "source": [ "We use an anisotropic kernel in the Gaussian Process. Its optimized `length_scale` is a vector with dimensions equal to 2 plus the number of PCA components from `preprocessor__pca_nodes__sklearn_block__n_components`, accounting for the two input scalars." ] }, { "cell_type": "code", "execution_count": null, "id": "693955c4", "metadata": {}, "outputs": [], "source": [ "print(optimized_pipeline.named_steps[\"regressor\"].regressor_.sklearn_block_.estimators_[0].kernel_.get_params()['length_scale'])" ] }, { "cell_type": "code", "execution_count": null, "id": "5f293c69", "metadata": {}, "outputs": [], "source": [ "print(\"Dimension GP kernel length_scale =\", len(optimized_pipeline.named_steps[\"regressor\"].regressor_.sklearn_block_.estimators_[0].kernel_.get_params()['length_scale']))\n", "print(\"Expected dimension =\", 2 + study.best_params['preprocessor__pca_nodes__sklearn_block__n_components'])" ] }, { "cell_type": "markdown", "id": "6e334938", "metadata": {}, "source": [ "The error remains non-zero due to the approximation introduced by PCA. Since the Gaussian Process regressor interpolates, the error is expected to vanish on the training set if all PCA modes are retained." ] }, { "cell_type": "code", "execution_count": null, "id": "7d29b639", "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "exact_pipeline = clone(pipeline).set_params(\n", " preprocessor__pca_nodes__sklearn_block__n_components = 24,\n", " regressor__transformer__sklearn_block__n_components = 24\n", ")\n", "exact_pipeline.fit(dataset_train)\n", "dataset_pred = exact_pipeline.predict(dataset_train)\n", "score = exact_pipeline.score(dataset_train)\n", "print(\"score =\", score, \", error =\", 1. - score)" ] }, { "cell_type": "markdown", "id": "201e9167", "metadata": {}, "source": [ "### 🔍 GridSearchCV hyperparameter tuning\n", "\n", "Since our pipeline nodes conform to the scikit-learn API, the constructed pipeline can be used directly with `GridSearchCV`." ] }, { "cell_type": "code", "execution_count": null, "id": "e5339ada", "metadata": {}, "outputs": [], "source": [ "pca_n_components = [3, 4]\n", "regressor_n_components = [4, 5]\n", "\n", "param_grid = []\n", "for n, m in zip(pca_n_components, regressor_n_components):\n", " param_grid.append(\n", " {\n", " \"preprocessor__pca_nodes__sklearn_block__n_components\": [n],\n", " \"regressor__transformer__sklearn_block__n_components\": [m],\n", " \"regressor__regressor__sklearn_block__estimator__kernel\": [\n", " Matern(\n", " length_scale_bounds=(1e-8, 1e8), nu=2.5, length_scale=np.ones(n + 2)\n", " )\n", " ],\n", " }\n", " )\n", "\n", "cv = KFold(n_splits=3, shuffle=True, random_state=42)\n", "search = GridSearchCV(pipeline, param_grid=param_grid, cv=cv, verbose=3, error_score='raise')\n", "\n", "search.fit(dataset_train)" ] }, { "cell_type": "markdown", "id": "96ddf769", "metadata": {}, "source": [ "We evaluate the performance of the optimized pipeline by computing its score on the training set." ] }, { "cell_type": "code", "execution_count": null, "id": "ed8780a3", "metadata": {}, "outputs": [], "source": [ "print(\"best_params =\", search.best_params_)\n", "optimized_pipeline = clone(pipeline).set_params(**search.best_params_)\n", "optimized_pipeline.fit(dataset_train)\n", "dataset_pred = optimized_pipeline.predict(dataset_train)\n", "score = optimized_pipeline.score(dataset_train)\n", "print(\"score =\", score, \", error =\", 1. - score)" ] } ], "metadata": { "jupytext": { "formats": "ipynb,py:percent" }, "kernelspec": { "display_name": "plaid-dev", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }