{ "cells": [ { "cell_type": "markdown", "id": "422cd865", "metadata": {}, "source": [ "# Dataset Splitting Examples\n", "\n", "This Jupyter Notebook demonstrates the usage of the split module using the PLAID library. It includes examples of:\n", "\n", "1. Initializing a Dataset\n", "2. Splitting a Dataset with ratios\n", "3. Splitting a Dataset with fixed sizes\n", "4. Splitting a Dataset with ratio and fixed Sizes\n", "5. Splitting a Dataset with custom split IDs\n", "\n", "This example demonstrates the usage of dataset splitting functions to divide a dataset into training, validation, and test sets. It provides examples of splitting the dataset using different methods and configurations.\n", "\n", "**Each section is documented and explained.**" ] }, { "cell_type": "code", "execution_count": null, "id": "dc91ab3c", "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "id": "924528bd", "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries and functions\n", "from plaid.utils.init_with_tabular import initialize_dataset_with_tabular_data\n", "from plaid.utils.split import split_dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "2c6ceaf7", "metadata": {}, "outputs": [], "source": [ "# Print dict util\n", "def dprint(name: str, dictio: dict):\n", " print(name, \"{\")\n", " for key, value in dictio.items():\n", " print(\" \", key, \":\", value)\n", "\n", " print(\"}\")" ] }, { "cell_type": "markdown", "id": "953c9c92", "metadata": {}, "source": [ "## Section 1: Initialize Dataset\n", "\n", "In this section, we create a dataset with random tabular data for testing purposes. The dataset will be used for subsequent splitting." ] }, { "cell_type": "code", "execution_count": null, "id": "68e19523", "metadata": {}, "outputs": [], "source": [ "# Create a dataset with random tabular data for testing purposes\n", "nb_scalars = 7\n", "nb_samples = 70\n", "tabular_data = {f\"scalar_{j}\": np.random.randn(nb_samples) for j in range(nb_scalars)}\n", "dataset = initialize_dataset_with_tabular_data(tabular_data)\n", "\n", "print(f\"{dataset = }\")" ] }, { "cell_type": "markdown", "id": "d23da8e5", "metadata": {}, "source": [ "## Section 2: Splitting a Dataset with Ratios\n", "\n", "In this section, we split the dataset into training, validation, and test sets using specified ratios. We also have the option to shuffle the dataset during the split process." ] }, { "cell_type": "code", "execution_count": null, "id": "fc8edb1b", "metadata": {}, "outputs": [], "source": [ "print(\"# First split\")\n", "options = {\n", " \"shuffle\": True,\n", " \"split_ratios\": {\n", " \"train\": 0.8,\n", " \"val\": 0.1,\n", " },\n", "}\n", "\n", "split = split_dataset(dataset, options)\n", "dprint(\"split =\", split)" ] }, { "cell_type": "markdown", "id": "d0867b0d", "metadata": {}, "source": [ "## Section 3: Splitting a Dataset with Fixed Sizes\n", "\n", "In this section, we split the dataset into training, validation, and test sets with fixed sample counts for each set. We can also choose to shuffle the dataset during the split." ] }, { "cell_type": "code", "execution_count": null, "id": "c9a132c3", "metadata": {}, "outputs": [], "source": [ "print(\"# Second split\")\n", "options = {\n", " \"shuffle\": True,\n", " \"split_sizes\": {\n", " \"train\": 14,\n", " \"val\": 8,\n", " \"test\": 5,\n", " },\n", "}\n", "\n", "split = split_dataset(dataset, options)\n", "dprint(\"split =\", split)" ] }, { "cell_type": "markdown", "id": "a6a3e79f", "metadata": {}, "source": [ "## Section 4: Splitting a Dataset with Ratios and Fixed Sizes\n", "\n", "In this section, we split the dataset into training, validation, and test sets with fixed sample counts and sample ratios for each set. We can also choose to shuffle the dataset during the split." ] }, { "cell_type": "code", "execution_count": null, "id": "4679a64b", "metadata": {}, "outputs": [], "source": [ "print(\"# Third split\")\n", "options = {\n", " \"shuffle\": True,\n", " \"split_ratios\": {\n", " \"train\": 0.7,\n", " \"test\": 0.1,\n", " },\n", " \"split_sizes\": {\"val\": 7},\n", "}\n", "\n", "split = split_dataset(dataset, options)\n", "dprint(\"split =\", split)" ] }, { "cell_type": "markdown", "id": "adea24a3", "metadata": {}, "source": [ "## Section 5: Splitting a Dataset with Custom Split IDs\n", "\n", "In this section, we split the dataset based on custom sample IDs for each set. We can specify the sample IDs for training, validation, and prediction sets." ] }, { "cell_type": "code", "execution_count": null, "id": "9f40f756", "metadata": {}, "outputs": [], "source": [ "print(\"# Fourth split\")\n", "options = {\n", " \"split_ids\": {\n", " \"train\": np.arange(20),\n", " \"val\": np.arange(30, 60),\n", " \"predict\": np.arange(25, 35),\n", " },\n", "}\n", "\n", "split = split_dataset(dataset, options)\n", "dprint(\"split =\", split)" ] } ], "metadata": { "jupytext": { "formats": "ipynb,py:percent" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }