{ "cells": [ { "cell_type": "markdown", "id": "1a58be59", "metadata": {}, "source": [ "# Hugging Face support\n", "\n", "This Jupyter Notebook demonstrates various operations involving the Hugging Face bridge:\n", "\n", "1. Converting a plaid dataset to Hugging Face\n", "2. Generating a Hugging Face dataset with a generator\n", "3. Converting a Hugging Face dataset to plaid\n", "4. Saving and Loading Hugging Face datasets\n", "5. Handling plaid samples from Hugging Face datasets without converting the complete dataset to plaid\n", "\n", "\n", "**Each section is documented and explained.**" ] }, { "cell_type": "code", "execution_count": null, "id": "87f97780", "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries and functions\n", "import pickle\n", "\n", "import numpy as np\n", "from Muscat.Bridges.CGNSBridge import MeshToCGNS\n", "from Muscat.MeshTools import MeshCreationTools as MCT\n", "\n", "from plaid.bridges import huggingface_bridge\n", "from plaid import Dataset\n", "from plaid import Sample\n", "from plaid import ProblemDefinition" ] }, { "cell_type": "code", "execution_count": null, "id": "f1da9293", "metadata": {}, "outputs": [], "source": [ "# Print Sample util\n", "def show_sample(sample: Sample):\n", " print(f\"sample = {sample}\")\n", " sample.show_tree()\n", " print(f\"{sample.get_scalar_names() = }\")\n", " print(f\"{sample.get_field_names() = }\")" ] }, { "cell_type": "markdown", "id": "7134c22e", "metadata": {}, "source": [ "## Initialize plaid dataset and problem_definition" ] }, { "cell_type": "code", "execution_count": null, "id": "ae79f14e", "metadata": {}, "outputs": [], "source": [ "# Input data\n", "points = np.array(\n", " [\n", " [0.0, 0.0],\n", " [1.0, 0.0],\n", " [1.0, 1.0],\n", " [0.0, 1.0],\n", " [0.5, 1.5],\n", " ]\n", ")\n", "\n", "triangles = np.array(\n", " [\n", " [0, 1, 2],\n", " [0, 2, 3],\n", " [2, 4, 3],\n", " ]\n", ")\n", "\n", "\n", "dataset = Dataset()\n", "\n", "print(\"Creating meshes dataset...\")\n", "for _ in range(3):\n", " mesh = MCT.CreateMeshOfTriangles(points, triangles)\n", "\n", " sample = Sample()\n", "\n", " sample.meshes.add_tree(MeshToCGNS(mesh))\n", " sample.add_scalar(\"scalar\", np.random.randn())\n", " sample.add_field(\"node_field\", np.random.rand(len(points)), location=\"Vertex\")\n", " sample.add_field(\n", " \"cell_field\", np.random.rand(len(triangles)), location=\"CellCenter\"\n", " )\n", "\n", " dataset.add_sample(sample)\n", "\n", "infos = {\n", " \"legal\": {\"owner\": \"Bob\", \"license\": \"my_license\"},\n", " \"data_production\": {\"type\": \"simulation\", \"physics\": \"3D example\"},\n", "}\n", "\n", "dataset.set_infos(infos)\n", "\n", "print(f\" {dataset = }\")\n", "\n", "problem = ProblemDefinition()\n", "problem.add_output_scalars_names([\"scalar\"])\n", "problem.add_output_fields_names([\"node_field\", \"cell_field\"])\n", "problem.add_input_meshes_names([\"/Base/Zone\"])\n", "\n", "problem.set_task(\"regression\")\n", "problem.set_split({\"train\": [0, 1], \"test\": [2]})\n", "\n", "print(f\" {problem = }\")" ] }, { "cell_type": "markdown", "id": "3b93db97", "metadata": {}, "source": [ "## Section 1: Convert plaid dataset to Hugging Face\n", "\n", "The description field of Hugging Face dataset is automatically configured to include data from the plaid dataset info and problem_definition to prevent loss of information and equivalence of format." ] }, { "cell_type": "code", "execution_count": null, "id": "b2209d87", "metadata": {}, "outputs": [], "source": [ "hf_dataset = huggingface_bridge.plaid_dataset_to_huggingface(dataset, problem)\n", "print()\n", "print(f\"{hf_dataset = }\")\n", "print(f\"{hf_dataset.description = }\")" ] }, { "cell_type": "markdown", "id": "e6d3ad80", "metadata": {}, "source": [ "The previous code generates a Hugging Face dataset containing all the samples from the plaid dataset, the splits being defined in the hf_dataset descriptions. For splits, Hugging Face proposes `DatasetDict`, which are dictionaries of hf datasets, with keys being the name of the corresponding splits. It is possible de generate a hf datasetdict directly from plaid:" ] }, { "cell_type": "code", "execution_count": null, "id": "9226f984", "metadata": {}, "outputs": [], "source": [ "hf_datasetdict = huggingface_bridge.plaid_dataset_to_huggingface_datasetdict(dataset, problem, main_splits = ['train', 'test'])\n", "print()\n", "print(f\"{hf_datasetdict['train'] = }\")\n", "print(f\"{hf_datasetdict['test'] = }\")" ] }, { "cell_type": "markdown", "id": "17d5fba6", "metadata": {}, "source": [ "## Section 2: Generate a Hugging Face dataset with a generator" ] }, { "cell_type": "code", "execution_count": null, "id": "7f739201", "metadata": {}, "outputs": [], "source": [ "def generator():\n", " for id in range(len(dataset)):\n", " yield {\n", " \"sample\": pickle.dumps(dataset[id]),\n", " }\n", "\n", "\n", "hf_dataset_gen = huggingface_bridge.plaid_generator_to_huggingface(\n", " generator, infos, problem\n", ")\n", "print()\n", "print(f\"{hf_dataset_gen = }\")\n", "print(f\"{hf_dataset_gen.description = }\")" ] }, { "cell_type": "markdown", "id": "cffe268e", "metadata": {}, "source": [ "The same is available with datasetdict:" ] }, { "cell_type": "code", "execution_count": null, "id": "01e31555", "metadata": {}, "outputs": [], "source": [ "hf_datasetdict_gen = huggingface_bridge.plaid_generator_to_huggingface_datasetdict(\n", " generator, infos, problem, main_splits = ['train', 'test']\n", ")\n", "print()\n", "print(f\"{hf_datasetdict['train'] = }\")\n", "print(f\"{hf_datasetdict['test'] = }\")" ] }, { "cell_type": "markdown", "id": "54f11f59", "metadata": {}, "source": [ "## Section 3: Convert a Hugging Face dataset to plaid\n", "\n", "Plaid dataset infos and problem_defitinion are recovered from the huggingface dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "38800a47", "metadata": {}, "outputs": [], "source": [ "dataset_2, problem_2 = huggingface_bridge.huggingface_dataset_to_plaid(hf_dataset)\n", "print()\n", "print(f\"{dataset_2 = }\")\n", "print(f\"{dataset_2.get_infos() = }\")\n", "print(f\"{problem_2 = }\")" ] }, { "cell_type": "markdown", "id": "b668b6fe", "metadata": {}, "source": [ "## Section 4: Save and Load Hugging Face datasets\n", "\n", "### From and to disk" ] }, { "cell_type": "code", "execution_count": null, "id": "906244cf", "metadata": {}, "outputs": [], "source": [ "# Save to disk\n", "hf_dataset.save_to_disk(\"/tmp/path/to/dir\")" ] }, { "cell_type": "code", "execution_count": null, "id": "b36aa2d4", "metadata": {}, "outputs": [], "source": [ "# Load from disk\n", "from datasets import load_from_disk, load_dataset\n", "\n", "loaded_hf_dataset = load_from_disk(\"/tmp/path/to/dir\")\n", "\n", "print()\n", "print(f\"{loaded_hf_dataset = }\")\n", "print(f\"{loaded_hf_dataset.description = }\")" ] }, { "cell_type": "markdown", "id": "c67139ec", "metadata": {}, "source": [ "### From and to the Hugging Face hub\n", "\n", "You need an huggingface account, with a configured access token, and to install huggingface_hub[cli].\n", "Pushing and loading a huggingface dataset without loss of information requires the configuration of a DatasetCard.\n", "\n", "Find below example of instruction (not executed by this notebook).\n", "\n", "### Push to the hub\n", "\n", "First login the huggingface cli:\n", "```bash\n", "huggingface-cli login\n", "```\n", "and enter you access token.\n", "\n", "Then, the following python instruction enable pushing a dataset to the hub:\n", "```python\n", "hf_dataset.push_to_hub(\"chanel/dataset\")\n", "\n", "from datasets import load_dataset_builder\n", "\n", "datasetInfo = load_dataset_builder(\"chanel/dataset\").__getstate__()['info']\n", "\n", "from huggingface_hub import DatasetCard\n", "\n", "card_text = create_string_for_huggingface_dataset_card(\n", " description = description,\n", " download_size_bytes = datasetInfo.download_size,\n", " dataset_size_bytes = datasetInfo.dataset_size,\n", " ...)\n", "dataset_card = DatasetCard(card_text)\n", "dataset_card.push_to_hub(\"chanel/dataset\")\n", "```\n", "\n", "The second upload of the dataset_card is required to ensure that load_dataset from the hub will populate\n", "the hf-dataset.description field, and be compatible for conversion to plaid. Wihtout a dataset_card, the description field is lost.\n", "\n", "\n", "### Load from hub\n", "\n", "```python\n", "dataset = load_dataset(\"chanel/dataset\", split=\"all_samples\")\n", "```\n", "\n", "More efficient retrieval are made possible by partial loads and split laods (in the case of a datasetdict):\n", "\n", "```python\n", "dataset_train = load_dataset(\"chanel/dataset\", split=\"train\")\n", "dataset_train_extract = load_dataset(\"chanel/dataset\", split=\"train[:10]\")\n", "```" ] }, { "cell_type": "markdown", "id": "7ecbb60c", "metadata": {}, "source": [ "## Section 5: Handle plaid samples from Hugging Face datasets without converting the complete dataset to plaid\n", "\n", "To fully exploit optimzed data handling of the Hugging Face datasets library, it is possible to extract information from the huggingface dataset without converting to plaid. The ``description`` atttribute includes the plaid dataset _infos attribute and plaid problem_definition attributes." ] }, { "cell_type": "code", "execution_count": null, "id": "7004fdc4", "metadata": {}, "outputs": [], "source": [ "print(f\"{loaded_hf_dataset.description = }\")" ] }, { "cell_type": "markdown", "id": "c7967cbb", "metadata": {}, "source": [ "Get the first sample of the first split" ] }, { "cell_type": "code", "execution_count": null, "id": "244305dc", "metadata": {}, "outputs": [], "source": [ "split_names = list(loaded_hf_dataset.description[\"split\"].keys())\n", "id = loaded_hf_dataset.description[\"split\"][split_names[0]]\n", "hf_sample = loaded_hf_dataset[id[0]]\n", "\n", "print(f\"{hf_sample = }\")" ] }, { "cell_type": "markdown", "id": "fa3ce8ca", "metadata": {}, "source": [ "We notice that ``hf_sample`` is a binary object efficiently handled by huggingface datasets. It can be converted into a plaid sample using a specific constructor relying on a pydantic validator." ] }, { "cell_type": "code", "execution_count": null, "id": "8d8f8230", "metadata": {}, "outputs": [], "source": [ "plaid_sample = huggingface_bridge.to_plaid_sample(hf_sample)\n", "\n", "show_sample(plaid_sample)" ] }, { "cell_type": "markdown", "id": "5beb4d12", "metadata": {}, "source": [ "Very large datasets can be streamed directly from the Hugging Face hub:\n", "\n", "```python\n", "hf_dataset_stream = load_dataset(\"chanel/dataset\", split=\"all_samples\", streaming=True)\n", "\n", "plaid_sample = huggingface_bridge.to_plaid_sample(next(iter(hf_dataset_stream)))\n", "\n", "show_sample(plaid_sample)\n", "```" ] } ], "metadata": { "jupytext": { "formats": "ipynb,py:percent" }, "kernelspec": { "display_name": "plaid-dev", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }