{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f3bf3005",
   "metadata": {},
   "source": [
    "# Pipeline Examples\n",
    "\n",
    "This notebook demonstrates the end-to-end process of building a machine learning pipeline using PLAID datasets and PLAID’s scikit-learn-compatible blocks."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab3feab4",
   "metadata": {},
   "source": [
    "## PCA-GP for `mach` field prediction of `VKI-LS59` dataset\n",
    "\n",
    "Key steps covered:\n",
    "\n",
    "- **Loading the PLAID dataset** using Hugging Face integration and PLAID’s dataset classes\n",
    "- **Standardizing features** with PLAID-wrapped scikit-learn transformers for scalars\n",
    "- **Dimensionality reduction** of flow fields via Principal Component Analysis (PCA) to reduce output complexity\n",
    "- **Regression modeling** of PCA coefficients from scalar inputs using Gaussian Process regression\n",
    "- **Pipeline assembly** combining transformations and regressors into a single scikit-learn-compatible workflow\n",
    "- **Hyperparameter tuning** using Optuna and scikit-learn’s `GridSearchCV`\n",
    "- **Best practices** for working with PLAID datasets and pipelines in a reproducible and modular manner"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f187a3bc",
   "metadata": {},
   "source": [
    "### 📦 Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9ef7c66",
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore', module='sklearn')\n",
    "warnings.filterwarnings(\"ignore\", message=\".*IProgress not found.*\")\n",
    "\n",
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "import yaml\n",
    "import numpy as np\n",
    "import optuna\n",
    "\n",
    "from datasets.utils.logging import disable_progress_bar\n",
    "\n",
    "from sklearn.base import clone\n",
    "from sklearn.pipeline import Pipeline\n",
    "\n",
    "from sklearn.decomposition import PCA\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from sklearn.gaussian_process import GaussianProcessRegressor\n",
    "from sklearn.gaussian_process.kernels import Matern\n",
    "from sklearn.multioutput import MultiOutputRegressor\n",
    "\n",
    "from sklearn.model_selection import KFold, GridSearchCV\n",
    "\n",
    "from plaid.bridges.huggingface_bridge import huggingface_dataset_to_plaid, load_dataset_from_hub\n",
    "from plaid.pipelines.sklearn_block_wrappers import WrappedSklearnTransformer, WrappedSklearnRegressor\n",
    "from plaid.pipelines.plaid_blocks import TransformedTargetRegressor, ColumnTransformer\n",
    "\n",
    "\n",
    "disable_progress_bar()\n",
    "n_processes = min(max(1, os.cpu_count()), 6)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f303b754",
   "metadata": {},
   "source": [
    "### 📥 Load Dataset\n",
    "\n",
    "We load the `VKI-LS59` dataset from Hugging Face and restrict ourselves to the first 24 samples of the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dfa4b7a3",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "hf_dataset = load_dataset_from_hub(\"PLAID-datasets/VKI-LS59\", split=\"all_samples[:24]\")\n",
    "dataset_train, pb_def = huggingface_dataset_to_plaid(hf_dataset, processes_number = n_processes, verbose = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84b6142c",
   "metadata": {},
   "source": [
    "We print the summary of dataset_train, which contains 24 samples, with 8 scalars and 8 fields, which is consistent with the `VKI-LS59` dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2e8d088",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "print(dataset_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "125cea5a",
   "metadata": {},
   "source": [
    "### ⚙️ Pipeline Configuration\n",
    "\n",
    "For convenience, the `in_features_identifiers` and `out_features_identifiers` for each pipeline block are defined in a `.yml` file. Here's an example of how the configuration might look:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92c23928",
   "metadata": {},
   "source": [
    "```yaml\n",
    "pca_nodes:\n",
    "  in_features_identifiers:\n",
    "- type: nodes\n",
    "  base_name: Base_2_2\n",
    "  out_features_identifiers:\n",
    "- type: scalar\n",
    "  name: reduced_nodes_*\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "008774ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    filename = Path(__file__).parent.parent.parent / \"examples\" / \"pipelines\" / \"config_pipeline.yml\"\n",
    "except NameError:\n",
    "    filename = \"config_pipeline.yml\"\n",
    "\n",
    "with open(filename, 'r') as f:\n",
    "    config = yaml.safe_load(f)\n",
    "\n",
    "all_feature_id = config['input_scalar_scaler']['in_features_identifiers'] +\\\n",
    "    config['pca_nodes']['in_features_identifiers'] + config['pca_mach']['in_features_identifiers']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7bf76c9",
   "metadata": {},
   "source": [
    "In this example, we aim to predict the ``mach`` field based on two input scalars ``angle_in`` and ``mach_out``, and the mesh node coordinates. To contain memory consumption, we restrict the dataset to the features required for this example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14889cb0",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_train = dataset_train.extract_dataset_from_identifier(all_feature_id)\n",
    "print(\"dataset_train =\", dataset_train)\n",
    "print(\"scalar names =\", dataset_train.get_scalar_names())\n",
    "print(\"field names =\", dataset_train.get_field_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3fa6a5a",
   "metadata": {},
   "source": [
    "We notive that only the 2 scalars and the field of interest are kept after restriction."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51009857",
   "metadata": {},
   "source": [
    "#### 1. Preprocessor\n",
    "\n",
    "We now define a preprocessor: a `MinMaxScaler` of the 2 input scalars and a `PCA` on the nodes coordinates of the meshes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "719c8837",
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocessor = ColumnTransformer(\n",
    "    [\n",
    "        ('input_scalar_scaler', WrappedSklearnTransformer(MinMaxScaler(), **config['input_scalar_scaler'])),\n",
    "        ('pca_nodes', WrappedSklearnTransformer(PCA(), **config['pca_nodes'])),\n",
    "    ]\n",
    ")\n",
    "preprocessor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "acb1831b",
   "metadata": {},
   "source": [
    "We use a `PlaidColumnTransformer` to apply independent transformations to different feature groups.\n",
    "\n",
    "To verify this behavior, we apply the `preprocessor` to `dataset_train`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2eaea76c",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "preprocessed_dataset = preprocessor.fit_transform(dataset_train)\n",
    "print(\"preprocessed_dataset:\", preprocessed_dataset)\n",
    "print(\"scalar names =\", preprocessed_dataset.get_scalar_names())\n",
    "print(\"field names =\", preprocessed_dataset.get_field_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4572512",
   "metadata": {},
   "source": [
    "Using `MinMaxScaler`, we scaled the `angle_in` and `mach_out` features, replacing their original values. In contrast, `PCA` compressed the node coordinates and produced new scalar features named `reduced_nodes_*`, representing the PCA components. Alternatively, we could have specified `out_features_identifiers` in the `.yml` file configuring the `MinMaxScaler` block to generate new scalars without overwriting the original inputs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "908abbb3",
   "metadata": {},
   "source": [
    "#### 2. Postprocessor\n",
    "\n",
    "Next, we define the postprocessor, which applies PCA to the `mach` field:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "028ce806",
   "metadata": {},
   "outputs": [],
   "source": [
    "postprocessor = WrappedSklearnTransformer(PCA(), **config['pca_mach'])\n",
    "postprocessor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f2849a2",
   "metadata": {},
   "source": [
    "#### 3. TransformedTargetRegressor\n",
    "\n",
    "The Gaussian Process regressor takes the transformed `angle_in` and `mach_out` scalars, along with the PCA coefficients of the mesh node coordinates as inputs, and predicts the PCA coefficients of the `mach` field as outputs. This is facilitated by using a `PlaidTransformedTargetRegressor`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46cc9cde",
   "metadata": {},
   "outputs": [],
   "source": [
    "kernel = Matern(length_scale_bounds=(1e-8, 1e8), nu = 2.5)\n",
    "\n",
    "gpr = GaussianProcessRegressor(\n",
    "    kernel=kernel,\n",
    "    optimizer='fmin_l_bfgs_b',\n",
    "    n_restarts_optimizer=1,\n",
    "    random_state=42)\n",
    "\n",
    "reg = MultiOutputRegressor(gpr)\n",
    "\n",
    "regressor = WrappedSklearnRegressor(reg, **config['regressor_mach'])\n",
    "\n",
    "target_regressor = TransformedTargetRegressor(\n",
    "    regressor=regressor,\n",
    "    transformer=postprocessor\n",
    ")\n",
    "target_regressor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f91a108",
   "metadata": {},
   "source": [
    "`PlaidTransformedTargetRegressor` functions like scikit-learn’s `TransformedTargetRegressor` but operates directly on PLAID datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "527df515",
   "metadata": {},
   "source": [
    "#### 4. Pipeline assembling\n",
    "\n",
    "We then define the complete pipeline as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e67c710c",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "pipeline = Pipeline(\n",
    "    steps=[\n",
    "        (\"preprocessor\", preprocessor),\n",
    "        (\"regressor\", target_regressor),\n",
    "    ]\n",
    ")\n",
    "pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5764a9fb",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "### 🎯 Optuna hyperparameter tuning\n",
    "\n",
    "We now use Optuna to optimize hyperparameters, specifically tuning the number of components for the two `PCA` blocks using three-fold cross-validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9cce44c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def objective(trial):\n",
    "    # Suggest hyperparameters\n",
    "    nodes_n_components = trial.suggest_int(\"preprocessor__pca_nodes__sklearn_block__n_components\", 3, 4)\n",
    "    mach_n_components = trial.suggest_int(\"regressor__transformer__sklearn_block__n_components\", 4, 5)\n",
    "\n",
    "    # Clone and configure pipeline\n",
    "    pipeline_run = clone(pipeline)\n",
    "    pipeline_run.set_params(\n",
    "        preprocessor__pca_nodes__sklearn_block__n_components=nodes_n_components,\n",
    "        regressor__transformer__sklearn_block__n_components=mach_n_components,\n",
    "        regressor__regressor__sklearn_block__estimator__kernel=Matern(\n",
    "                    length_scale_bounds=(1e-8, 1e8), nu=2.5, length_scale=np.ones(nodes_n_components + len(config['input_scalar_scaler']['in_features_identifiers']))\n",
    "                )\n",
    "    )\n",
    "\n",
    "    cv = KFold(n_splits=3, shuffle=True, random_state=42)\n",
    "\n",
    "    scores = []\n",
    "\n",
    "    indices = np.arange(len(dataset_train))\n",
    "\n",
    "    for train_idx, val_idx in cv.split(indices):\n",
    "\n",
    "        dataset_cv_train_ = dataset_train[train_idx]\n",
    "        dataset_cv_val_   = dataset_train[val_idx]\n",
    "\n",
    "        pipeline_run.fit(dataset_cv_train_)\n",
    "\n",
    "        score = pipeline_run.score(dataset_cv_val_)\n",
    "\n",
    "        scores.append(score)\n",
    "\n",
    "    return np.mean(scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ace22166",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "We maximize the defined objective function over 4 trials selected by Optuna."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f0a3737",
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocessed_dataset = preprocessor.fit_transform(dataset_train)\n",
    "print(\"preprocessed_dataset:\", preprocessed_dataset)\n",
    "print(\"scalar names =\", preprocessed_dataset.get_scalar_names())\n",
    "print(\"field names =\", preprocessed_dataset.get_field_names())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a411b13",
   "metadata": {},
   "outputs": [],
   "source": [
    "study = optuna.create_study(direction='maximize')\n",
    "study.optimize(objective, n_trials=4)\n",
    "print(\"best_params =\", study.best_params)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d6e9a27",
   "metadata": {},
   "source": [
    "We retrieve the best hyperparameters found by Optuna and use them to define the `optimized_pipeline`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2ae19c6",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "optimized_pipeline = clone(pipeline).set_params(**study.best_params)\n",
    "optimized_pipeline.set_params(regressor__regressor__sklearn_block__estimator__kernel=Matern(\n",
    "                    length_scale_bounds=(1e-8, 1e8), nu=2.5, length_scale=np.ones(study.best_params['preprocessor__pca_nodes__sklearn_block__n_components'] + len(config['input_scalar_scaler']['in_features_identifiers']))\n",
    "                )\n",
    ")\n",
    "\n",
    "optimized_pipeline.fit(dataset_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "775933b9",
   "metadata": {},
   "source": [
    "Next, we fit the `optimized_pipeline` to the `dataset_train` dataset and evaluate its performance on the same data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2378014",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_pred = optimized_pipeline.predict(dataset_train)\n",
    "score = optimized_pipeline.score(dataset_train)\n",
    "print(\"score =\", score, \", error =\", 1. - score)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6592692",
   "metadata": {},
   "source": [
    "We use an anisotropic kernel in the Gaussian Process. Its optimized `length_scale` is a vector with dimensions equal to 2 plus the number of PCA components from `preprocessor__pca_nodes__sklearn_block__n_components`, accounting for the two input scalars."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "693955c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(optimized_pipeline.named_steps[\"regressor\"].regressor_.sklearn_block_.estimators_[0].kernel_.get_params()['length_scale'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f293c69",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Dimension GP kernel length_scale =\", len(optimized_pipeline.named_steps[\"regressor\"].regressor_.sklearn_block_.estimators_[0].kernel_.get_params()['length_scale']))\n",
    "print(\"Expected dimension =\", 2 + study.best_params['preprocessor__pca_nodes__sklearn_block__n_components'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e334938",
   "metadata": {},
   "source": [
    "The error remains non-zero due to the approximation introduced by PCA. Since the Gaussian Process regressor interpolates, the error is expected to vanish on the training set if all PCA modes are retained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d29b639",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "exact_pipeline = clone(pipeline).set_params(\n",
    "    preprocessor__pca_nodes__sklearn_block__n_components = 24,\n",
    "    regressor__transformer__sklearn_block__n_components = 24\n",
    ")\n",
    "exact_pipeline.fit(dataset_train)\n",
    "dataset_pred = exact_pipeline.predict(dataset_train)\n",
    "score = exact_pipeline.score(dataset_train)\n",
    "print(\"score =\", score, \", error =\", 1. - score)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "201e9167",
   "metadata": {},
   "source": [
    "### 🔍 GridSearchCV hyperparameter tuning\n",
    "\n",
    "Since our pipeline nodes conform to the scikit-learn API, the constructed pipeline can be used directly with `GridSearchCV`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5339ada",
   "metadata": {},
   "outputs": [],
   "source": [
    "pca_n_components = [3, 4]\n",
    "regressor_n_components = [4, 5]\n",
    "\n",
    "param_grid = []\n",
    "for n, m in zip(pca_n_components, regressor_n_components):\n",
    "    param_grid.append(\n",
    "        {\n",
    "            \"preprocessor__pca_nodes__sklearn_block__n_components\": [n],\n",
    "            \"regressor__transformer__sklearn_block__n_components\": [m],\n",
    "            \"regressor__regressor__sklearn_block__estimator__kernel\": [\n",
    "                Matern(\n",
    "                    length_scale_bounds=(1e-8, 1e8), nu=2.5, length_scale=np.ones(n + 2)\n",
    "                )\n",
    "            ],\n",
    "        }\n",
    "    )\n",
    "\n",
    "cv = KFold(n_splits=3, shuffle=True, random_state=42)\n",
    "search = GridSearchCV(pipeline, param_grid=param_grid, cv=cv, verbose=3, error_score='raise')\n",
    "\n",
    "search.fit(dataset_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96ddf769",
   "metadata": {},
   "source": [
    "We evaluate the performance of the optimized pipeline by computing its score on the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed8780a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"best_params =\", search.best_params_)\n",
    "optimized_pipeline = clone(pipeline).set_params(**search.best_params_)\n",
    "optimized_pipeline.fit(dataset_train)\n",
    "dataset_pred = optimized_pipeline.predict(dataset_train)\n",
    "score = optimized_pipeline.score(dataset_train)\n",
    "print(\"score =\", score, \", error =\", 1. - score)"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "ipynb,py:percent"
  },
  "kernelspec": {
   "display_name": "plaid-dev",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}