tablegpt · edwardzjl · Nov 19, 2024 · Nov 19, 2024
diff --git a/docs/index.md b/docs/index.md
@@ -11,9 +11,9 @@ tablegpt-agent is a pre-built agent for [TableGPT2 (huggingface)](https://huggin
 
 <!-- mkdocs requires 4 space to intent the list -->
 - Tutorials
-    - [Quickstart](tutorials/quickstart.md)
-    - [Chat on Tabular Data](tutorials/chat-on-tabular-data.md)
-    - [Continue Analysis on Generated Charts](tutorials/continue-analysis-on-generated-charts.md)
+    - [Quickstart](tutorials/quick-start.ipynb)
+    - [Chat on Tabular Data](tutorials/chat-on-tabular-data.ipynb)
+    - [Continue Analysis on Generated Charts](tutorials/continue-analysis-on-generated-charts.ipynb)
 - How-To Guides
     - [Enhance TableGPT Agent with RAG](howto/retrieval.md)
     - [Persist Messages](howto/persist-messages.md)

diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
@@ -0,0 +1,8 @@
+/* hide jupyter notebooks input/output numbers */
+.jp-InputPrompt {
+  display: none !important;
+}
+
+.jp-OutputPrompt {
+  display: none !important;
+}
diff --git a/docs/tutorials/chat-on-tabular-data.ipynb b/docs/tutorials/chat-on-tabular-data.ipynb
@@ -0,0 +1,190 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1944f5bf",
+   "metadata": {},
+   "source": [
+    "# Chat on Tabular Data\n",
+    "\n",
+    "TableGPT Agent excels at analyzing and processing tabular data. To perform data analysis, you need to first let the agent \"see\" the dataset. This is done by a specific \"file-reading\" workflow. In short, you begin by \"uploading\" the dataset and let the agent read it. Once the data is read, you can ask the agent questions about it.\n",
+    "\n",
+    "> To learn more about the file-reading workflow, see [File Reading](./explanation/file-reading.md).\n",
+    "\n",
+    "For data analysis tasks, we introduce two important parameters when creating the agent: `checkpointer` and `session_id`.\n",
+    "\n",
+    "- The `checkpointer` should be an instance of `langgraph.checkpoint.base.BaseCheckpointSaver`, which acts as a versioned \"memory\" for the agent. (See [langgraph's persistence concept](https://langchain-ai.github.io/langgraph/concepts/persistence) for more details.)\n",
+    "- The `session_id` is a unique identifier for the current session. It ties the agent's execution to a specific kernel, ensuring that the agent's results are retained across multiple invocations.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "ec321eaa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langgraph.checkpoint.memory import MemorySaver\n",
+    "\n",
+    "checkpointer = MemorySaver()\n",
+    "agent = create_tablegpt_graph(\n",
+    "    llm=llm,\n",
+    "    pybox_manager=pybox_manager,\n",
+    "    checkpointer=checkpointer,\n",
+    "    session_id=\"some-session-id\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b2554859",
+   "metadata": {},
+   "source": [
+    "Add the file for processing in the additional_kwargs of HumanMessage. Here's an example using the [Titanic dataset](https://github.com/tablegpt/tablegpt-agent/blob/main/examples/datasets/titanic.csv).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "55a52fb7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from typing import TypedDict\n",
+    "from langchain_core.messages import HumanMessage\n",
+    "\n",
+    "class Attachment(TypedDict):\n",
+    "    \"\"\"Contains at least one dictionary with the key filename.\"\"\"\n",
+    "    filename: str\n",
+    "\n",
+    "attachment_msg = HumanMessage(\n",
+    "    content=\"\",\n",
+    "    # Please make sure your iPython kernel can access your filename.\n",
+    "    additional_kwargs={\"attachments\": [Attachment(filename=\"titanic.csv\")]},\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0630560d",
+   "metadata": {},
+   "source": [
+    "Invoke the agent as shown in the quick start:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "ce20b1b0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[HumanMessage(content='', additional_kwargs={'attachments': [{'filename': 'titanic.csv'}]}, response_metadata={}, id='ab0a7157-ad7d-4de8-9b24-1bee78ad7c55'),\n",
+       " AIMessage(content=\"我已经收到您的数据文件，我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中，并通过 `df.info` 查看 NaN 情况和数据类型。\\n```python\\n# Load the data into a DataFrame\\ndf = read_df('titanic.csv')\\n\\n# Remove leading and trailing whitespaces in column names\\ndf.columns = df.columns.str.strip()\\n\\n# Remove rows and columns that contain only empty values\\ndf = df.dropna(how='all').dropna(axis=1, how='all')\\n\\n# Get the basic information of the dataset\\ndf.info(memory_usage=False)\\n```\", additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件，我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中，并通过 `df.info` 查看 NaN 情况和数据类型。', 'action': {'tool': 'python', 'tool_input': \"# Load the data into a DataFrame\\ndf = read_df('titanic.csv')\\n\\n# Remove leading and trailing whitespaces in column names\\ndf.columns = df.columns.str.strip()\\n\\n# Remove rows and columns that contain only empty values\\ndf = df.dropna(how='all').dropna(axis=1, how='all')\\n\\n# Get the basic information of the dataset\\ndf.info(memory_usage=False)\"}, 'model_type': None}, response_metadata={}, id='add6691d-d7ea-411d-9699-e99ae0b7de97', tool_calls=[{'name': 'python', 'args': {'query': \"# Load the data into a DataFrame\\ndf = read_df('titanic.csv')\\n\\n# Remove leading and trailing whitespaces in column names\\ndf.columns = df.columns.str.strip()\\n\\n# Remove rows and columns that contain only empty values\\ndf = df.dropna(how='all').dropna(axis=1, how='all')\\n\\n# Get the basic information of the dataset\\ndf.info(memory_usage=False)\"}, 'id': 'b846aa01-04ef-4669-9a5c-53ddcb9a2dfb', 'type': 'tool_call'}]),\n",
+       " ToolMessage(content=[{'type': 'text', 'text': \"```pycon\\n<class 'pandas.core.frame.DataFrame'>\\nRangeIndex: 4 entries, 0 to 3\\nData columns (total 8 columns):\\n #   Column    Non-Null Count  Dtype  \\n---  ------    --------------  -----  \\n 0   Pclass    4 non-null      int64  \\n 1   Sex       4 non-null      object \\n 2   Age       4 non-null      float64\\n 3   SibSp     4 non-null      int64  \\n 4   Parch     4 non-null      int64  \\n 5   Fare      4 non-null      float64\\n 6   Embarked  4 non-null      object \\n 7   Survived  4 non-null      int64  \\ndtypes: float64(2), int64(4), object(2)\\n```\"}], name='python', id='0d441b21-bff3-463c-a07f-c0b12bd17bc5', tool_call_id='b846aa01-04ef-4669-9a5c-53ddcb9a2dfb', artifact=[]),\n",
+       " AIMessage(content='接下来我将用 `df.head(5)` 来查看数据集的前 5 行。\\n```python\\n# Show the first 5 rows to understand the structure\\ndf.head(5)\\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 `df.head(5)` 来查看数据集的前 5 行。', 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='5e26ef1d-7042-471e-b39f-194a51a185c7', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\\ndf.head(5)'}, 'id': 'f6be0d96-05b3-4b5b-8313-90197a8c3d87', 'type': 'tool_call'}]),\n",
+       " ToolMessage(content=[{'type': 'text', 'text': '```pycon\\n   Pclass     Sex   Age  SibSp  Parch    Fare Embarked  Survived\\n0       2  female  29.0      0      2  23.000        S         1\\n1       3  female  39.0      1      5  31.275        S         0\\n2       3    male  26.5      0      0   7.225        C         0\\n3       3    male  32.0      0      0  56.496        S         1\\n```'}], name='python', id='6fc6d8aa-546c-467e-91d3-d57b0b62dd68', tool_call_id='f6be0d96-05b3-4b5b-8313-90197a8c3d87', artifact=[]),\n",
+       " AIMessage(content='我已经了解了数据集 titanic.csv 的基本信息。请问我可以帮您做些什么？', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='b6dc3885-94cb-4b0f-b691-f37c4c8c9ba3')]"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from datetime import date\n",
+    "from tablegpt.agent.file_reading import Stage\n",
+    "\n",
+    "# Reading and processing files.\n",
+    "response = await agent.ainvoke(\n",
+    "    input={\n",
+    "        \"entry_message\": attachment_msg,\n",
+    "        \"processing_stage\": Stage.UPLOADED,\n",
+    "        \"messages\": [attachment_msg],\n",
+    "        \"parent_id\": \"some-parent-id1\",\n",
+    "        \"date\": date.today(),\n",
+    "    },\n",
+    "    config={\n",
+    "        # Using checkpointer requires binding thread_id at runtime.\n",
+    "        \"configurable\": {\"thread_id\": \"some-thread-id\"},\n",
+    "    },\n",
+    ")\n",
+    "response[\"messages\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bed80e60",
+   "metadata": {},
+   "source": [
+    "Continue to ask questions for data analysis:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "dcceeebf",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "content=\"为了回答您的问题，我将筛选出所有男性乘客并计算其中的幸存者数量。\\n```python\\n# Filter male passengers who survived and count them\\nmale_survivors = df[(df['Sex'] == 'male') & (df['Survived'] == 1)]\\nmale_survivors_count = male_survivors.shape[0]\\nmale_survivors_count\\n```\" additional_kwargs={} response_metadata={'finish_reason': 'stop', 'model_name': 'TableGPT2-7B'} id='run-661d7496-341d-4a6b-84d8-b4094db66ef0'\n",
+      "content=[{'type': 'text', 'text': '```pycon\\n1\\n```'}] name='python' id='1c7531db-9150-451d-a8dd-f07176454e6f' tool_call_id='2860e8bb-0fa7-421b-bb2d-bfeca873354b' artifact=[]\n",
+      "content='根据数据集，有 1 名男性乘客幸存。' additional_kwargs={} response_metadata={'finish_reason': 'stop', 'model_name': 'TableGPT2-7B'} id='run-db640705-0085-4f47-adb4-3e0adce694cd'\n"
+     ]
+    }
+   ],
+   "source": [
+    "human_message = HumanMessage(content=\"How many men survived?\")\n",
+    "\n",
+    "async for event in agent.astream_events(\n",
+    "    input={\n",
+    "        # After using checkpoint, you only need to add new messages here.\n",
+    "        \"messages\": [human_message],\n",
+    "        \"parent_id\": \"some-parent-id2\",\n",
+    "        \"date\": date.today(),\n",
+    "    },\n",
+    "    version=\"v2\",\n",
+    "    # We configure the same thread_id to use checkpoints to retrieve the memory of the last run.\n",
+    "    config={\"configurable\": {\"thread_id\": \"some-thread-id\"}},\n",
+    "):\n",
+    "    event_name: str = event[\"name\"]\n",
+    "    evt: str = event[\"event\"]\n",
+    "    if evt == \"on_chat_model_end\":\n",
+    "        print(event[\"data\"][\"output\"])\n",
+    "    elif event_name == \"tools\" and evt == \"on_chain_stream\":\n",
+    "        for lc_msg in event[\"data\"][\"chunk\"][\"messages\"]:\n",
+    "            print(lc_msg)\n",
+    "    else:\n",
+    "        # Other events can be handled here.\n",
+    "        pass\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}