This repository has been archived by the owner on Dec 10, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
e30a4cb
commit af92771
Showing
7 changed files
with
293 additions
and
43 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Parser construction example" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This file demonstrates the process of constructing a parser file using `animals.csv` as a source dataset.\n", | ||
"\n", | ||
"Before you start: `autoparser` requires an OpenAI API key to function. You should add yours to your environment, as described [here](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). \n", | ||
"Edit the `API_KEY` line below to match the name you gave yours." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import autoparser\n", | ||
"import pandas as pd\n", | ||
"import os\n", | ||
"API_KEY = os.environ.get(\"OPENAI_API_KEY\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"data = pd.read_csv(\"tests/sources/animal_data.csv\")\n", | ||
"data.head()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's generate a basic data dictionary from this data set. We want to use the configuration file set up for this dataset, located in the `tests` directory." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"writer = autoparser.DictWriter(\"tests/test_config.toml\")\n", | ||
"data_dict = writer.create_dict(data)\n", | ||
"data_dict.head()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The 'Common Values' column indicates fields where there are a limited number of unique values, suggesting mapping to a controlled terminology may have been done, or might be required in the parser. The list of common values is every unique value in the field.\n", | ||
"\n", | ||
"Notice that the Description column is empty. To proceed to the next step of the parser generation process, creating the mapping file linking source -> schema fields, this column must be filled. You can either do this by hand (the descriptions MUST be in english), or use autoparser's LLM functionality to do it for you, demonstrated below." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"dd_described = writer.generate_descriptions(\"fr\", data_dict, key=API_KEY)\n", | ||
"dd_described.head()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now that we have a data dictionary with descriptions added, we can proceed to creating an intermediate mapping file:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"mapper = autoparser.Mapper(\"tests/schemas/animals.schema.json\", dd_described, \"fr\", api_key=API_KEY, config=\"tests/test_config.toml\")\n", | ||
"mapping_dict = mapper.create_mapping(file_name='example_mapping.csv')\n", | ||
"\n", | ||
"mapping_dict.head()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"At this point, you should inspect the mapping file and look for fields/values that have been incorrectly mapped, and edit them where necessary.\n", | ||
"The mapping file has been written out to [example_mapping.csv](example_mapping.csv). A good example is the 'loc_admin_1' field; the LLM often maps the common values provided to 'None' as the schema denotes this as a free-text field. Instead, delete these mapped values and the parsed data will contain the original free text.\n", | ||
"Also note the warning above; the LLM should not have found fields to map to the 'country_iso3' or 'owner' fields. If the original data did contain an appropriate field for these, you should edit the mapping file accordingly." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Once you have edited the mapping file to your satisfaction, we can go ahead and create the TOML parser file, `example_parser.toml`:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"writer = autoparser.ParserGenerator(\"example_mapping.csv\", \"tests/schemas\", \"example\", config=\"tests/test_config.toml\")\n", | ||
"writer.create_parser(\"example_parser.toml\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"You can veiw/edit the created parser at [example_parser.toml](example_parser.toml), and try it out using [ADTL](https://github.com/globaldothealth/adtl)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": ".venv", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.10" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.