{ "cells": [ { "cell_type": "markdown", "id": "18760207", "metadata": {}, "source": [ "# Hugging Face support\n", "\n", "IMPORTANT NOTICE: THIS CODE IS STILL FUNCTIONAL, BUT IS DEPRECATED. NEW DATA HANDLING DETAILED IN STORAGE DESCRIPTIONS.\n", "\n", "This Jupyter Notebook demonstrates various operations involving the Hugging Face bridge:\n", "\n", "1. Converting a plaid dataset to Hugging Face\n", "2. Generating a Hugging Face dataset with a generator\n", "3. Converting a Hugging Face dataset to plaid\n", "4. Saving and Loading Hugging Face datasets\n", "5. Handling plaid samples from Hugging Face datasets without converting the complete dataset to plaid\n", "6. Advanced concepts (read speed, memory usage, streaming)\n", "\n", "\n", "**Each section is documented and explained.**" ] }, { "cell_type": "code", "execution_count": null, "id": "f1f9d428", "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries and functions\n", "import os, psutil\n", "import tempfile\n", "import shutil\n", "from time import time\n", "from functools import partial\n", "\n", "import numpy as np\n", "from Muscat.Bridges.CGNSBridge import MeshToCGNS\n", "from Muscat.MeshTools import MeshCreationTools as MCT\n", "\n", "from plaid.bridges import huggingface_bridge\n", "from plaid import Dataset, Sample, ProblemDefinition\n", "from plaid.types import FeatureIdentifier" ] }, { "cell_type": "code", "execution_count": null, "id": "21624b5e", "metadata": {}, "outputs": [], "source": [ "# Print Sample util\n", "def show_sample(sample: Sample):\n", " print(f\"sample = {sample}\")\n", " sample.show_tree()\n", " print(f\"{sample.get_scalar_names() = }\")\n", " print(f\"{sample.get_field_names() = }\")\n", "\n", "\n", "# Get_mem util\n", "def get_mem():\n", " \"\"\"Get the current memory usage of the process in MB.\"\"\"\n", " process = psutil.Process(os.getpid())\n", " return process.memory_info().rss / (1024**2) # in MB" ] }, { "cell_type": "markdown", "id": "7bbae576", "metadata": {}, "source": [ "## Initialize plaid dataset, infos and problem_definition" ] }, { "cell_type": "code", "execution_count": null, "id": "afc76795", "metadata": {}, "outputs": [], "source": [ "# Input data\n", "points = np.array(\n", " [\n", " [0.0, 0.0],\n", " [1.0, 0.0],\n", " [1.0, 1.0],\n", " [0.0, 1.0],\n", " [0.5, 1.5],\n", " ]\n", ")\n", "\n", "triangles = np.array(\n", " [\n", " [0, 1, 2],\n", " [0, 2, 3],\n", " [2, 4, 3],\n", " ]\n", ")\n", "\n", "dataset = Dataset()\n", "\n", "scalar_feat_id = FeatureIdentifier({\"type\": \"scalar\", \"name\": \"scalar\"})\n", "node_field_feat_id = FeatureIdentifier(\n", " {\"type\": \"field\", \"name\": \"node_field\", \"location\": \"Vertex\"}\n", ")\n", "cell_field_feat_id = FeatureIdentifier(\n", " {\"type\": \"field\", \"name\": \"cell_field\", \"location\": \"CellCenter\"}\n", ")\n", "\n", "print(\"Creating meshes dataset...\")\n", "for _ in range(3):\n", " mesh = MCT.CreateMeshOfTriangles(points, triangles)\n", "\n", " sample = Sample()\n", "\n", " sample.add_tree(MeshToCGNS(mesh, exportOriginalIDs=False))\n", "\n", " sample.update_features_from_identifier(\n", " scalar_feat_id, np.random.randn(), in_place=True\n", " )\n", " sample.update_features_from_identifier(\n", " node_field_feat_id, np.random.rand(len(points)), in_place=True\n", " )\n", " sample.update_features_from_identifier(\n", " cell_field_feat_id, np.random.rand(len(triangles)), in_place=True\n", " )\n", "\n", " dataset.add_sample(sample)\n", "\n", "infos = {\n", " \"legal\": {\"owner\": \"Bob\", \"license\": \"my_license\"},\n", " \"data_production\": {\"type\": \"simulation\", \"physics\": \"3D example\"},\n", "}\n", "\n", "dataset.set_infos(infos)\n", "\n", "print(f\" {dataset = }\")\n", "print(f\" {infos = }\")\n", "\n", "pb_def = ProblemDefinition()\n", "pb_def.add_in_features_identifiers([scalar_feat_id, node_field_feat_id])\n", "pb_def.add_out_features_identifiers([cell_field_feat_id])\n", "\n", "pb_def.set_task(\"regression\")\n", "pb_def.set_split({\"train\": [0, 1], \"test\": [2]})\n", "\n", "print(f\" {pb_def = }\")" ] }, { "cell_type": "markdown", "id": "a2173fcc", "metadata": {}, "source": [ "## Section 1: Convert plaid datasets to Hugging Face DatasetDict" ] }, { "cell_type": "code", "execution_count": null, "id": "6f608a2b", "metadata": {}, "outputs": [], "source": [ "main_splits = {\n", " split_name: pb_def.get_split(split_name) for split_name in [\"train\", \"test\"]\n", "}\n", "\n", "hf_datasetdict, flat_cst, key_mappings = (\n", " huggingface_bridge.plaid_dataset_to_huggingface_datasetdict(dataset, main_splits)\n", ")\n", "\n", "print(f\"{hf_datasetdict = }\")\n", "print(f\"{flat_cst = }\")\n", "print(f\"{key_mappings = }\")" ] }, { "cell_type": "markdown", "id": "ac11ad14", "metadata": {}, "source": [ "A partitioning of all the indices is provided in `main_splits`. The conversion outputs `flat_cst` and `key_mappings`, which are central to the Hugging Face support:\n", "- **`flat_cst`**: constant features dictionary (path → value): a flatten tree containing the CGNS trees leaves that a reconstant throughout the plaid dataset.\n", "- **`key_mappings`**: metadata dictionary containing keys such as:\n", " - `variable_features`: list of paths for non-constant features.\n", " - `constant_features`: list of paths for constant features.\n", " - `cgns_types`: mapping from paths to CGNS types.\n", "\n", "`flat_cst` and `cgns_types` are required for reconstructing plaid datasets and samples from the hugginface datasets." ] }, { "cell_type": "markdown", "id": "97865bca", "metadata": {}, "source": [ "## Section 2: Generate a Hugging Face dataset with a generator" ] }, { "cell_type": "markdown", "id": "4df75b5a", "metadata": {}, "source": [ "Ganarators are used to handle large datasets that do not fit in memory:" ] }, { "cell_type": "code", "execution_count": null, "id": "5723fc67", "metadata": {}, "outputs": [], "source": [ "split_ids = {}\n", "split_ids[\"train\"] = [0, 1]\n", "split_ids[\"test\"] = [2]\n", "\n", "generators = {}\n", "for split_name in split_ids.keys():\n", "\n", " def generator_(ids):\n", " for id in ids:\n", " yield dataset[id]\n", "\n", " generators[split_name] = partial(generator_, ids = split_ids[split_name])\n", "\n", "hf_datasetdict, flat_cst, key_mappings = (\n", " huggingface_bridge.plaid_generator_to_huggingface_datasetdict(\n", " generators\n", " )\n", ")\n", "print(f\"{hf_datasetdict = }\")\n", "print(f\"{flat_cst = }\")\n", "print(f\"{key_mappings = }\")" ] }, { "cell_type": "markdown", "id": "51569d53", "metadata": {}, "source": [ "In this example, the generators are not very usefull since the plaid dataset is already loaded in memory. In real settings, one can create generators in the following way to prevent loading all the data beforehand:\n", "```python\n", "generators = {}\n", "for split_name, ids in main_splits.items():\n", " def generator_(ids=ids):\n", " for id in ids:\n", " loaded_simulation_data = load('path/to/split_name/simulation_id')\n", " sample = convert_to_sample(loaded_simulation_data)\n", " yield sample\n", " generators[split_name] = generator_\n", "```" ] }, { "cell_type": "markdown", "id": "5ff57beb", "metadata": {}, "source": [ "## Section 3: Convert a Hugging Face dataset to plaid" ] }, { "cell_type": "code", "execution_count": null, "id": "67653f81", "metadata": {}, "outputs": [], "source": [ "cgns_types = key_mappings[\"cgns_types\"]\n", "\n", "dataset_2 = huggingface_bridge.to_plaid_dataset(\n", " hf_datasetdict[\"train\"], flat_cst[\"train\"], cgns_types\n", ")\n", "print()\n", "print(f\"{dataset_2 = }\")" ] }, { "cell_type": "markdown", "id": "d5e2b000", "metadata": {}, "source": [ "## Section 4: Save and Load Hugging Face datasets\n", "\n", "### From and to disk\n", "\n", "Saving and loading datasetdict, infos, tree_struct and problem definition to disk:" ] }, { "cell_type": "code", "execution_count": null, "id": "a3840c80", "metadata": {}, "outputs": [], "source": [ "with tempfile.TemporaryDirectory() as out_dir:\n", " huggingface_bridge.save_dataset_dict_to_disk(out_dir, hf_datasetdict)\n", " huggingface_bridge.save_infos_to_disk(out_dir, infos)\n", " huggingface_bridge.save_tree_struct_to_disk(out_dir, flat_cst, key_mappings)\n", " huggingface_bridge.save_problem_definition_to_disk(out_dir, \"task_1\", pb_def)\n", "\n", " loaded_hf_datasetdict = huggingface_bridge.load_dataset_from_disk(out_dir)\n", " loaded_infos = huggingface_bridge.load_infos_from_disk(out_dir)\n", " flat_cst, key_mappings = huggingface_bridge.load_tree_struct_from_disk(out_dir)\n", " loaded_pb_def = huggingface_bridge.load_problem_definition_from_disk(\n", " out_dir, \"task_1\"\n", " )\n", "\n", " shutil.rmtree(out_dir)\n", "\n", "print(f\"{loaded_hf_datasetdict = }\")\n", "print(f\"{loaded_infos = }\")\n", "print(f\"{flat_cst = }\")\n", "print(f\"{key_mappings = }\")\n", "print(f\"{loaded_pb_def = }\")" ] }, { "cell_type": "markdown", "id": "837837ca", "metadata": {}, "source": [ "### From and to the Hugging Face hub\n", "\n", "Find below examples of instructions (not executed by this notebook)." ] }, { "cell_type": "markdown", "id": "b4d8a3bc", "metadata": {}, "source": [ "#### Load from hub\n", "\n", "To load datasetdict, infos and problem_definitions from the hub:\n", "```python\n", "huggingface_bridge.load_dataset_from_hub(\"chanel/dataset\", *args, **kwargs)\n", "huggingface_bridge.load_hf_infos_from_hub(\"chanel/dataset\")\n", "huggingface_bridge.load_hf_problem_definition_from_hub(\"chanel/dataset\", \"name\")\n", "```\n", "\n", "Partial retrieval are possible along samples\n", "```python\n", "huggingface_bridge.load_dataset_from_hub(\"chanel/dataset\", split=\"train[:10], *args, **kwargs)\n", "```\n", "\n", "Streaming allows handling very large datasets\n", "```python\n", "hf_dataset_streamed = huggingface_bridge.load_dataset_from_hub(\"chanel/dataset\", split=\"split\", streaming=True, *args, **kwargs)\n", "for hf_sample in hf_dataset_streamed:\n", " sample = huggingface_bridge.to_plaid_sample(hf_sample, flat_cst, cgns_types)\n", "```\n", "\n", "Native HF datasets commands are also possible:\n", "\n", "```python\n", "dataset_train = load_dataset(\"chanel/dataset\", split=\"train\")\n", "dataset_train = load_dataset(\"chanel/dataset\", split=\"train\", streaming=True)\n", "dataset_train_extract = load_dataset(\"chanel/dataset\", split=\"train[:10]\")\n", "```\n", "\n", "If you are behind a proxy and relying on a private mirror the function `load_dataset_from_hub` is working provided the following is set:\n", "- `HF_ENDPOINT` to your private mirror address\n", "- `CURL_CA_BUNDLE` to your trusted CA certificates\n", "- `HF_HOME` to a shared cache directory if needed" ] }, { "cell_type": "markdown", "id": "2ebc1111", "metadata": {}, "source": [ "#### Push to the hub\n", "\n", "To push a dataset on the Hub, you need an huggingface account, with a configured access token.\n", "\n", "First login the huggingface cli:\n", "```bash\n", "huggingface-cli login\n", "```\n", "and enter you access token.\n", "\n", "Then, the following python instruction enable pushing datasetdict, infos and problem_definitions to the hub:\n", "```python\n", "huggingface_bridge.push_dataset_dict_to_hub(\"chanel/dataset\", hf_dataset_dict)\n", "huggingface_bridge.push_infos_to_hub(\"chanel/dataset\", infos)\n", "huggingface_bridge.push_tree_struct_to_hub(\"chanel/dataset\", flat_cst, key_mappings)\n", "huggingface_bridge.push_problem_definition_to_hub(\"chanel/dataset\", \"location\", pb_def)\n", "```\n", "\n", "The dataset card can then be customized online, on the dataset repo page directly." ] }, { "cell_type": "markdown", "id": "3da81b17", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Section 5: Handle plaid samples from Hugging Face datasets without converting the complete dataset to plaid\n", "\n", "To fully exploit optimzed data handling of the Hugging Face datasets library, it is possible to extract information from the huggingface dataset without converting to plaid." ] }, { "cell_type": "markdown", "id": "23c21867", "metadata": {}, "source": [ "Get the first sample of the first split" ] }, { "cell_type": "code", "execution_count": null, "id": "04a0dd19", "metadata": {}, "outputs": [], "source": [ "hf_sample = hf_datasetdict[\"train\"][0]\n", "\n", "print(f\"{hf_sample = }\")" ] }, { "cell_type": "markdown", "id": "8c27c644", "metadata": {}, "source": [ "We notice that ``hf_sample`` is not a plaid sample, but a dict containing the variable features of the datasets, with keys being the flattened path of the CGNS tree. contains a binary object efficiently handled by huggingface datasets. It can be converted into a plaid sample using a specific constructor relying on a pydantic validator, and the required `flat_cst` and `cgns_types`." ] }, { "cell_type": "code", "execution_count": null, "id": "a0cedc20", "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "plaid_sample = huggingface_bridge.to_plaid_sample(\n", " hf_datasetdict[\"train\"], 0, flat_cst[\"train\"], cgns_types\n", ")\n", "\n", "print(\"Variable features:\")\n", "for t in plaid_sample.get_all_time_values():\n", " for path in key_mappings[\"variable_features\"]:\n", " print(path, plaid_sample.get_feature_by_path(path=path, time=t))\n", "print(\"-------\")\n", "print(\"Sample and CGNS tree:\")\n", "show_sample(plaid_sample)" ] }, { "cell_type": "markdown", "id": "36aa8076", "metadata": {}, "source": [ "Very large datasets that do not fit on disk can be streamed directly from the Hugging Face hub:\n", "\n", "```python\n", "hf_dataset_stream = load_dataset(\"chanel/dataset\", split=\"train\", streaming=True)\n", "plaid_sample = huggingface_bridge.to_plaid_sample(next(iter(hf_dataset_stream)), flat_cst, cgns_types)\n", "```\n", "\n", "If you are behing a proxy:\n", "```python\n", "hf_dataset_stream = huggingface_bridge.load_dataset_from_hub(\"chanel/dataset\", split=\"train\", streaming=True)\n", "plaid_sample = huggingface_bridge.to_plaid_sample(next(iter(hf_dataset_stream)), flat_cst, cgns_types)\n", "```" ] }, { "cell_type": "markdown", "id": "d87c8216", "metadata": {}, "source": [ "## Section 6: Advanced concepts" ] }, { "cell_type": "markdown", "id": "ea2410d8", "metadata": {}, "source": [ "In this section, we investigate concepts to better exploit the datasets made available on Hugging Face, by looking into read speed and memory usage. The commands are not executed by this notebook. You can copy/paste the following code to execute it, but be mindfull that it will download a 235MB dataset.\n", "\n", "```python\n", "repo_id = \"fabiencasenave/Tensile2d_DO_NOT_DELETE\"\n", "split_names = [\"train_500\", \"test\", \"OOD\"]\n", "\n", "hf_dataset_dict = huggingface_bridge.load_dataset_from_hub(repo_id)\n", "```" ] }, { "cell_type": "markdown", "id": "7320b9e3", "metadata": {}, "source": [ "We investigate the time and memory needed to instantiate the plaid dataset dict from the repo_id, now that the hf datasets have been loaded in cache:\n", "```python\n", "init_ram = get_mem()\n", "start = time()\n", "dataset_dict = huggingface_bridge.instantiate_plaid_datasetdict_from_hub(repo_id)\n", "elapsed = time() - start\n", "print(f\"Time to instantiate plaid dataset dict from cache: {elapsed:.6g} s, RAM usage increase: {get_mem()-init_ram} MB\")\n", "```\n", "```bash\n", ">> Time to instantiate plaid dataset dict from cache: 1.37948 s, RAM usage increase: 22.5 MB\n", "```\n", "We notice the RAM usage is lower than the size of the dataset: all the variable shape 1DArrays and constant shape 2DArrays in the samples are initiated in no-copy mode." ] }, { "cell_type": "markdown", "id": "666448c8", "metadata": {}, "source": [ "We now investigate the possible gains when handling the datasets directly. First, bypassing cache checks and constructing plaid dataset from an instantiated HF dataset is much faster:\n", "```python\n", "flat_cst, key_mappings = huggingface_bridge.load_tree_struct_from_hub(repo_id)\n", "pb_def = huggingface_bridge.load_problem_definition_from_hub(repo_id, \"task_1\")\n", "infos = huggingface_bridge.load_infos_from_hub(repo_id)\n", "cgns_types = key_mappings[\"cgns_types\"]\n", "\n", "hf_dataset = hf_dataset_dict[split_names[0]]\n", "\n", "init_ram = get_mem()\n", "start = time()\n", "dataset = huggingface_bridge.to_plaid_dataset(hf_dataset, flat_cst, cgns_types)\n", "elapsed = time() - start\n", "print(f\"Time to build dataset on split {split_names[0]}: {elapsed:.6g} s, RAM usage increase: {get_mem()-init_ram} MB\")\n", "```\n", "```bash\n", ">> Time to build dataset on split train_500: 0.173115 s, RAM usage increase: 16.3125 MB\n", "```" ] }, { "cell_type": "markdown", "id": "b20e794a", "metadata": {}, "source": [ "It is possible to further remove overheads by accessing directly 1DArrays in the arrow table of the HF datasets in no-copy mode:\n", "```python\n", "init_ram = get_mem()\n", "start = time()\n", "data = {}\n", "for i in range(len(hf_dataset)):\n", " data[i] = hf_dataset.data[\"Base_2_2/Zone/PointData/sig12\"][i].values.to_numpy(zero_copy_only=True)\n", "elapsed = time() - start\n", "print(f\"Time to read 1D fields of variable size on the complete split {split_names[0]}: {elapsed:.6g} s, RAM usage increase: {get_mem()-init_ram} MB\")\n", "```\n", "```bash\n", ">> Time to read 1D fields of variable size on the complete split train_500: 0.0021801 s, RAM usage increase: 0.375 MB\n", "```" ] }, { "cell_type": "markdown", "id": "12f679aa", "metadata": {}, "source": [ "An efficient way to retrieve the output feature directly from the pyarrow table is:\n", "```python\n", "init_ram = get_mem()\n", "start = time()\n", "for i in tqdm(\n", " range(len(hf_dataset_new[\"train\"])), desc=\"Retrieving features\"\n", "):\n", " for path in pb_def.get_out_features_identifiers():\n", " hf_dataset_new[\"train\"].data[path][i].values.to_numpy(\n", " zero_copy_only=False\n", " )\n", "elapsed = time() - start\n", "print(\n", " f\"Time to retrieve out features on train: {elapsed:.6g} s, RAM usage increase: {get_mem() - init_ram} MB\"\n", ")\n", "```\n", "```bash\n", ">> Time to retrieve out features on train: 0.0400107 s, RAM usage increase: 0.27734375 MB\n", "```\n", "Notice that doing this for time-dependent datasets would require manual handling of the time dimension." ] }, { "cell_type": "markdown", "id": "4e6e9654", "metadata": {}, "source": [ "A robust way to retrieve input and output features from a HF dataset relying on the `to_plaid_sample` constructor is:\n", "```python\n", "init_ram = get_mem()\n", "start = time()\n", "for i in tqdm(\n", " range(len(hf_dataset_new[split_names[0]])), desc=\"Retrieving all variable features\"\n", "):\n", " sample = huggingface_bridge.to_plaid_sample(\n", " hf_dataset_new[split_names[0]],\n", " i,\n", " flat_cst[split_names[0]],\n", " cgns_types,\n", " enforce_shapes=False,\n", " )\n", " for t in sample.get_all_mesh_times():\n", " for path in pb_def.get_in_features_identifiers():\n", " sample.get_feature_by_path(path=path, time=t)\n", " for path in pb_def.get_out_features_identifiers():\n", " sample.get_feature_by_path(path=path, time=t)\n", "elapsed = time() - start\n", "print(\n", " f\"Time to retrieve in and out features on train: {elapsed:.6g} s, RAM usage increase: {get_mem() - init_ram} MB\"\n", ")\n", "```\n", "```bash\n", ">> Time to retrieve in and out features on train: 0.401273 s, RAM usage increase: 17.72265625 MB\n", "```\n", "Notice that converting first to plaid samples incurs some overhead, but this method is robust and works for time-dependent datasets as well." ] } ], "metadata": { "jupytext": { "custom_cell_magics": "kql", "encoding": "# -*- coding: utf-8 -*-", "formats": "ipynb,py:percent" }, "kernelspec": { "display_name": "plaid-dev", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }