diff --git a/docs/explanation/file-reading.ipynb b/docs/explanation/file-reading.ipynb new file mode 100644 index 0000000..71f2603 --- /dev/null +++ b/docs/explanation/file-reading.ipynb @@ -0,0 +1,596 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "229c30c0-9715-48a2-b5fe-ee8c733d847a", + "metadata": {}, + "source": [ + "# File Reading\n", + "\n", + "When working with dataset files, maintaining a clear separation between file reading and data analysis workflows can significantly improve control and clarity. At TableGPT Agent, we've designed a robust and structured approach to handling file reading that empowers the LLM (Large Language Model) to effectively analyze dataset files without being overwhelmed by unnecessary details. This method not only enhances the LLM's ability to inspect the data but also ensures a smoother and more reliable data analysis process.\n", + "\n", + "Traditionally, allowing an LLM to directly inspect a dataset might involve simply calling the `df.head()` function to preview its content. While this approach suffices for straightforward use cases, it often lacks depth when dealing with more complex or messy datasets. To address this, we've developed a multi-step file reading workflow designed to deliver richer insights into the dataset structure while preparing it for advanced analysis." + ] + }, + { + "cell_type": "markdown", + "id": "a6ffbe96-f066-4b10-a743-0e9da6d41cbd", + "metadata": {}, + "source": [ + "**Here's how the workflow unfolds:**" + ] + }, + { + "cell_type": "markdown", + "id": "f9ba4763-5784-4c39-8e99-6156061e35bf", + "metadata": {}, + "source": [ + "## Normalization (Optional)\n", + "\n", + "Not all files are immediately suitable for direct analysis. Excel files, in particular, can pose challenges—irregular formatting, merged cells, and inconsistent headers are just a few examples. To tackle these issues, we introduce an optional normalization step that preprocesses the data, transforming it into a format that is “pandas-friendly.”\n", + "\n", + "This step addresses the most common quirks in Excel files, such as non-standard column headers, inconsistent row structures, or missing metadata. By resolving these typical issues upfront, the data is transformed into a format that is 'pandas-friendly' ensuring smooth integration with downstream processes.\n", + "\n", + "**Example Scenario:**\n", + "\n", + "Imagine you have an Excel file that looks like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "c83bfe50-176b-4781-a4f6-ba809aa54750", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
产品生产统计表
生产日期制造编号产品名称预定产量本日产量累计产量耗费工时
Unnamed: 0_level_2Unnamed: 1_level_2Unnamed: 2_level_2Unnamed: 3_level_2预计实际Unnamed: 6_level_2本日累计
02007-08-10 00:00:00FK-001猕猴桃果肉饮料100000.040000450008300010.020.0
12007-08-11 00:00:00FK-002西瓜果肉饮料100000.04000044000820009.018.0
22007-08-12 00:00:00FK-003草莓果肉饮料100000.04000045000830009.018.0
32007-08-13 00:00:00FK-004蓝莓果肉饮料100000.04000045000830009.018.0
42007-08-14 00:00:00FK-005水密桃果肉饮料100000.040000450008300010.020.0
\n", + "
" + ], + "text/plain": [ + " 产品生产统计表 \n", + " 生产日期 制造编号 产品名称 预定产量 本日产量 累计产量 耗费工时 \n", + " Unnamed: 0_level_2 Unnamed: 1_level_2 Unnamed: 2_level_2 Unnamed: 3_level_2 预计 实际 Unnamed: 6_level_2 本日 累计\n", + "0 2007-08-10 00:00:00 FK-001 猕猴桃果肉饮料 100000.0 40000 45000 83000 10.0 20.0\n", + "1 2007-08-11 00:00:00 FK-002 西瓜果肉饮料 100000.0 40000 44000 82000 9.0 18.0\n", + "2 2007-08-12 00:00:00 FK-003 草莓果肉饮料 100000.0 40000 45000 83000 9.0 18.0\n", + "3 2007-08-13 00:00:00 FK-004 蓝莓果肉饮料 100000.0 40000 45000 83000 9.0 18.0\n", + "4 2007-08-14 00:00:00 FK-005 水密桃果肉饮料 100000.0 40000 45000 83000 10.0 20.0" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Load the data into a DataFrame\n", + "df1 = read_df('产品生产统计表.xlsx', header=[0, 1, 2])\n", + "df1.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "0062d50e-63c9-4be2-bc15-7ebc15b23e4e", + "metadata": {}, + "source": [ + "The file is riddled with merged cells, empty rows, and redundant formatting that make it incompatible with pandas. If you try to load this file directly, pandas might misinterpret the structure or fail to parse it entirely.\n", + "\n", + "With our normalization feature, irregular datasets can be seamlessly transformed into clean, structured formats. When using the `create_tablegpt_agent` method, simply pass the `normalize_llm` parameter. The system will automatically analyze the irregular data and generate the appropriate transformation code, ensuring the dataset is prepared in the optimal format for further analysis.\n", + "\n", + "Below is an example of the code generated for the provided irregular dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "9933fabd-d951-4da6-9bcd-a6511b12bc1b", + "metadata": {}, + "outputs": [], + "source": [ + "# Normalize the data\n", + "try:\n", + " df = df1.copy()\n", + "\n", + " import pandas as pd\n", + "\n", + " # Assuming the original data is loaded into a DataFrame named df\n", + " # Here is the transformation process:\n", + "\n", + " # Step 1: Isolate the Table Header\n", + " # Remove the unnecessary top rows and columns\n", + " final_df = df.iloc[2:, :9].copy()\n", + "\n", + " # Step 2: Rename Columns of final_df\n", + " # Adjust the column names to match the desired format\n", + " final_df.columns = ['生产日期', '制造编号', '产品名称', '预定产量', '本日产量预计', '本日产量实际', '累计产量', '本日耗费工时', '累计耗费工时']\n", + "\n", + " # Step 3: Data Processing\n", + " # Ensure there are no NaN values and drop any duplicate rows if necessary\n", + " final_df.dropna(inplace=True)\n", + " final_df.drop_duplicates(inplace=True)\n", + "\n", + " # Convert the appropriate columns to numeric types\n", + " final_df['预定产量'] = final_df['预定产量'].astype(int)\n", + " final_df['本日产量预计'] = final_df['本日产量预计'].astype(int)\n", + " final_df['本日产量实际'] = final_df['本日产量实际'].astype(int)\n", + " final_df['累计产量'] = final_df['累计产量'].astype(int)\n", + " final_df['本日耗费工时'] = final_df['本日耗费工时'].astype(int)\n", + " final_df['累计耗费工时'] = final_df['累计耗费工时'].astype(int)\n", + "\n", + " # Display the transformed DataFrame\n", + " if final_df.columns.tolist() == final_df.iloc[0].tolist():\n", + " final_df = final_df.iloc[1:]\n", + "\n", + " # reassign df1 with the formatted DataFrame\n", + " df1 = final_df\n", + "except Exception as e:\n", + " # Unable to apply formatting to the original DataFrame. proceeding with the unformatted DataFrame.\n", + " print(f\"Reformat failed with error {e}, use the original DataFrame.\")" + ] + }, + { + "cell_type": "markdown", + "id": "2b589ac3-405c-4350-84f7-bf675ddaaa06", + "metadata": {}, + "source": [ + "Using the generated transformation code, the irregular dataset is converted into a clean, structured format, ready for analysis:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "76efd557-333b-46c3-a697-644a84b8e6ec", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
生产日期制造编号产品名称预定产量本日产量预计本日产量实际累计产量本日耗费工时累计耗费工时
22007-08-12 00:00:00FK-003草莓果肉饮料100000400004500083000918
32007-08-13 00:00:00FK-004蓝莓果肉饮料100000400004500083000918
42007-08-14 00:00:00FK-005水密桃果肉饮料1000004000045000830001020
52007-08-15 00:00:00FK-006荔枝果肉饮料1000004000044000820001020
62007-08-16 00:00:00FK-007樱桃果肉饮料100000400004600084000918
\n", + "
" + ], + "text/plain": [ + " 生产日期 制造编号 产品名称 预定产量 本日产量预计 本日产量实际 累计产量 本日耗费工时 累计耗费工时\n", + "2 2007-08-12 00:00:00 FK-003 草莓果肉饮料 100000 40000 45000 83000 9 18\n", + "3 2007-08-13 00:00:00 FK-004 蓝莓果肉饮料 100000 40000 45000 83000 9 18\n", + "4 2007-08-14 00:00:00 FK-005 水密桃果肉饮料 100000 40000 45000 83000 10 20\n", + "5 2007-08-15 00:00:00 FK-006 荔枝果肉饮料 100000 40000 44000 82000 10 20\n", + "6 2007-08-16 00:00:00 FK-007 樱桃果肉饮料 100000 40000 46000 84000 9 18" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "ac19e7a0-5487-4e02-80c0-f7dc48903df4", + "metadata": {}, + "source": [ + "## Dataset Structure Overview \n", + "\n", + "After normalization, the next step dives into the structural aspects of the dataset using the `df.info()` function. Unlike `df.head()`, which only shows a snippet of the data, `df.info()` provides a holistic view of the dataset’s structure. Key insights include:\n", + "\n", + "- **Column Data Types**: Helps identify numerical, categorical, or textual data at a glance.\n", + "- **Non-Null Counts**: Reveals the completeness of each column, making it easy to spot potential gaps or inconsistencies.\n", + "- **Memory Usage**: Offers a sense of the dataset's size, crucial for performance optimization in larger workflows.\n", + "\n", + "By focusing on the foundational structure of the dataset, this step enables the LLM to better understand the quality and layout of the data, paving the way for more informed analyses." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "2acf71a1-0e81-4f14-973e-05dfe1c9d963", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Index: 18 entries, 2 to 19\n", + "Data columns (total 9 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 生产日期 18 non-null object\n", + " 1 制造编号 18 non-null object\n", + " 2 产品名称 18 non-null object\n", + " 3 预定产量 18 non-null int64 \n", + " 4 本日产量预计 18 non-null int64 \n", + " 5 本日产量实际 18 non-null int64 \n", + " 6 累计产量 18 non-null int64 \n", + " 7 本日耗费工时 18 non-null int64 \n", + " 8 累计耗费工时 18 non-null int64 \n", + "dtypes: int64(6), object(3)" + ] + } + ], + "source": [ + "# Remove leading and trailing whitespaces in column names\n", + "df1.columns = df1.columns.str.strip()\n", + "\n", + "# Remove rows and columns that contain only empty values\n", + "df1 = df1.dropna(how='all').dropna(axis=1, how='all')\n", + "\n", + "# Get the basic information of the dataset\n", + "df1.info(memory_usage=False)" + ] + }, + { + "cell_type": "markdown", + "id": "226cebc1-e38c-4da8-8455-662db9c152f6", + "metadata": {}, + "source": [ + "## Dataset Content Preview\n", + "\n", + "Finally, we utilize the `df.head()` function to provide a **visual preview of the dataset’s content**. This step is crucial for understanding the actual values within the dataset—patterns, anomalies, or trends often become apparent here.\n", + "\n", + "The number of rows displayed (`n`) is configurable to balance between granularity and simplicity. For smaller datasets or detailed exploration, a larger `n` might be beneficial. However, for larger datasets, displaying too many rows could overwhelm the LLM with excessive details, detracting from the primary analytical objectives." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "37ffeb0f-a80f-4ca8-9fda-fa2054870acf", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
生产日期制造编号产品名称预定产量本日产量预计本日产量实际累计产量本日耗费工时累计耗费工时
22007-08-12 00:00:00FK-003草莓果肉饮料100000400004500083000918
32007-08-13 00:00:00FK-004蓝莓果肉饮料100000400004500083000918
42007-08-14 00:00:00FK-005水密桃果肉饮料1000004000045000830001020
52007-08-15 00:00:00FK-006荔枝果肉饮料1000004000044000820001020
62007-08-16 00:00:00FK-007樱桃果肉饮料100000400004600084000918
\n", + "
" + ], + "text/plain": [ + " 生产日期 制造编号 产品名称 预定产量 本日产量预计 本日产量实际 累计产量 本日耗费工时 累计耗费工时\n", + "2 2007-08-12 00:00:00 FK-003 草莓果肉饮料 100000 40000 45000 83000 9 18\n", + "3 2007-08-13 00:00:00 FK-004 蓝莓果肉饮料 100000 40000 45000 83000 9 18\n", + "4 2007-08-14 00:00:00 FK-005 水密桃果肉饮料 100000 40000 45000 83000 10 20\n", + "5 2007-08-15 00:00:00 FK-006 荔枝果肉饮料 100000 40000 44000 82000 10 20\n", + "6 2007-08-16 00:00:00 FK-007 樱桃果肉饮料 100000 40000 46000 84000 9 18" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show the first 5 rows to understand the structure\n", + "df1.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "a7b90c69-cf82-4711-9229-242113d30804", + "metadata": {}, + "source": [ + "## Why This Matters\n", + "\n", + "This structured, multi-step approach is not just about processing data; it's about making the LLM smarter in how it interacts with datasets. By systematically addressing issues like messy formatting, structural ambiguity, and information overload, we ensure the LLM operates with clarity and purpose.\n", + "\n", + "The separation of file reading from analysis offers several advantages:\n", + "\n", + "- Enhanced Accuracy: Preprocessing and structure-checking reduce the risk of errors in downstream analyses.\n", + "- Scalability: Handles datasets of varying complexity and size with equal efficiency.\n", + "- Transparency: Provides clear visibility into the dataset’s structure, enabling better decision-making.\n", + "\n", + "By adopting this method, TableGPT Agent transforms the way dataset files are read and analyzed, offering a smarter, more controlled, and ultimately more **user-friendly experience**." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/explanation/file-reading.md b/docs/explanation/file-reading.md deleted file mode 100644 index 3fbcf7f..0000000 --- a/docs/explanation/file-reading.md +++ /dev/null @@ -1,9 +0,0 @@ -# File Reading - -TableGPT Agent separates the file reading workflow from the data analysis workflow to maintain greater control over how the LLM inspects the dataset files. Typically, if you let the LLM inspect the dataset itself, it uses the `df.head()` function to preview the data. While this is sufficient for basic cases, we have implemented a more structured approach by hard-coding the file reading workflow into several steps: - -- `normalization` (optional): For some Excel files, the content may not be 'pandas-friendly'. We include an optional normalization step to transform the Excel content into a more suitable format for pandas. -- `df.info()`: Unlike `df.head()`, `df.info()` provides insights into the dataset's structure, such as the data types of each column and the number of non-null values, which also indicates whether a column contains NaN. This insight helps the LLM understand the structure and quality of the data. -- `df.head()`: The final step displays the first n rows of the dataset, where n is configurable. A larger value for n allows the LLM to glean more information from the dataset; however, too much detail may divert its attention from the primary task. - - diff --git a/docs/index.md b/docs/index.md index 540c2d3..effbb29 100644 --- a/docs/index.md +++ b/docs/index.md @@ -21,7 +21,7 @@ tablegpt-agent is a pre-built agent for [TableGPT2 (huggingface)](https://huggin - [Normalize Datasets](howto/normalize-datasets.md) - Explanation - [Agent Workflow](explanation/agent-workflow.md) - - [File Reading](explanation/file-reading.md) + - [File Reading](explanation/file-reading.ipynb) - [Reference](reference.md) ## Contributing diff --git a/mkdocs.yml b/mkdocs.yml index 78fe592..30750a1 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -56,7 +56,7 @@ nav: - Reference: reference.md - Explanation: - 'Agent Workflow': explanation/agent-workflow.md - - 'File Reading': explanation/file-reading.md + - 'File Reading': explanation/file-reading.ipynb repo_name: tablegpt/tablegpt-agent repo_url: https://github.com/tablegpt/tablegpt-agent