Edit README, add example.ipynb

globaldothealth · Nov 7, 2024 · af92771 · af92771
1 parent e30a4cb
commit af92771
Show file tree

Hide file tree

Showing 7 changed files with 293 additions and 43 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,8 @@
 # autoparser
-Temporary repo for mpox-specific autoparser.
+autoparser helps in the generation of ADTL parsers as
+TOML files, which can then be processed by
+[adtl](https://github.com/globaldothealth/adtl) to transform files from the
+source schema to a specified schema.
 
 Contains functionality to:
 1. Create a basic data dictionary from a raw data file (`create-dict`)
@@ -11,3 +14,77 @@ Contains functionality to:
     (rules-based from the mapping file; `create-parser`).
 
 All 4 functions have both a command-line interface, and a python function associated.
+
+## Parser construction process (CLI)
+
+1. **Data**: Get the data as CSV or Excel and the data dictionary if available.
+
+2. **Creating autoparser config**: Optional step if the data is not in REDCap
+   (English) format. The autoparser config ([example](redcap-en.toml),
+   [schema](#autoparser-config-schema)) specifies most of the variable
+   configuration settings for autoparser.
+
+3. **Preparing the data dictionary**: If the data dictionary is not in CSV, or
+   split across multiple Excel sheets, then it needs to be combined to a single
+   CSV. If a data dictionary does not already exist, one can be created using
+
+   ```shell
+    autoparser create-dict <path to data> -o <parser-name>
+   ```
+
+   Here, `-o` sets the output name, and will create
+   `<parser-name>.csv`. For optional arguments (such as using a custom configuration
+   which was created in step 2), see `autoparser create-dict --help`.
+
+4. **Generate intermediate mappings (CSV)**: Run with config and data dictionary
+   to generate mappings:
+
+   ```shell
+   autoparser create-mapping <path to data dictionary> <path to schema> <language> <api key> -o <parser-name>
+   ```
+
+   Here `language` refers to the language of the original data, e.g. "fr" for french 
+   language data. `autoparser` defaults to using OpenAI as the LLM API, so the api key 
+   provided should be for the OpenAi platform. In future, alternative API's and/or a 
+   self-hosted llm are planned to be provided as options.
+
+5. **Curate mappings**: The intermediate mappings must be manually curated, as
+   the LLM may have generated false matches, or missed certain fields or value mappings.
+
+6. **Generate TOML**: This step is automated and should produce a TOML file that
+   conforms to the parser schema. 
+
+   For example:
+
+   ```shell
+   autoparser create-toml parser.csv <path to schema> -n parser
+   ```
+
+   will create `parser.toml` (specified using the `-n` flag) from the
+   intermediate mappings `parser.csv` file.
+
+7. **Review TOML**: The TOML file may contain errors, so it is recommended to
+   check it and alter as necessary.
+
+8. **Run adtl**: Run adtl on the TOML file and the data source. This process
+   will report validation errors, which can be fixed by reviewing the TOML file
+   and looking at the source data that is invalid.
+
+## Parser construction process (Python)
+
+An [example notebook](example.ipynb) has been provided using the test data to demonstrate
+the process of constructing a parser using the Python functions of `autoparser`.
+
+## Troubleshooting autogenerated parsers
+
+1. "I get validation errors like "'x' must be date":
+ADTL expects dates to be provided in ISO format (i.e. YYY-MM-DD). If your dates are
+formatted differently, e.g. "dd/mm/yyyy", you can add a line in the header
+of the TOML file (e.g. underneath the line "returnUnmatched=True") like this:
+
+```TOML
+defaultDateFormat = "%d/%m/%Y"
+```
+which should automatically convert the dates for you.
+
+2. ADTL can't find my schema (error: No such file or directory ..../x.schema.json)
diff --git a/example.ipynb b/example.ipynb
@@ -0,0 +1,158 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Parser construction example"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This file demonstrates the process of constructing a parser file using `animals.csv` as a source dataset.\n",
+    "\n",
+    "Before you start: `autoparser` requires an OpenAI API key to function. You should add yours to your environment, as described [here](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). \n",
+    "Edit the `API_KEY` line below to match the name you gave yours."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import autoparser\n",
+    "import pandas as pd\n",
+    "import os\n",
+    "API_KEY = os.environ.get(\"OPENAI_API_KEY\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = pd.read_csv(\"tests/sources/animal_data.csv\")\n",
+    "data.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's generate a basic data dictionary from this data set. We want to use the configuration file set up for this dataset, located in the `tests` directory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "writer = autoparser.DictWriter(\"tests/test_config.toml\")\n",
+    "data_dict = writer.create_dict(data)\n",
+    "data_dict.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The 'Common Values' column indicates fields where there are a limited number of unique values, suggesting mapping to a controlled terminology may have been done, or might be required in the parser. The list of common values is every unique value in the field.\n",
+    "\n",
+    "Notice that the Description column is empty. To proceed to the next step of the parser generation process, creating the mapping file linking source -> schema fields, this column must be filled. You can either do this by hand (the descriptions MUST be in english), or use autoparser's LLM functionality to do it for you, demonstrated below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dd_described = writer.generate_descriptions(\"fr\", data_dict, key=API_KEY)\n",
+    "dd_described.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we have a data dictionary with descriptions added, we can proceed to creating an intermediate mapping file:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mapper = autoparser.Mapper(\"tests/schemas/animals.schema.json\", dd_described, \"fr\", api_key=API_KEY, config=\"tests/test_config.toml\")\n",
+    "mapping_dict = mapper.create_mapping(file_name='example_mapping.csv')\n",
+    "\n",
+    "mapping_dict.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "At this point, you should inspect the mapping file and look for fields/values that have been incorrectly mapped, and edit them where necessary.\n",
+    "The mapping file has been written out to [example_mapping.csv](example_mapping.csv). A good example is the 'loc_admin_1' field; the LLM often maps the common values provided to 'None' as the schema denotes this as a free-text field. Instead, delete these mapped values and the parsed data will contain the original free text.\n",
+    "Also note the warning above; the LLM should not have found fields to map to the 'country_iso3' or 'owner' fields. If the original data did contain an appropriate field for these, you should edit the mapping file accordingly."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once you have edited the mapping file to your satisfaction, we can go ahead and create the TOML parser file, `example_parser.toml`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "writer = autoparser.ParserGenerator(\"example_mapping.csv\", \"tests/schemas\", \"example\", config=\"tests/test_config.toml\")\n",
+    "writer.create_parser(\"example_parser.toml\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can veiw/edit the created parser at [example_parser.toml](example_parser.toml), and try it out using [ADTL](https://github.com/globaldothealth/adtl)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/src/autoparser/dict_writer.py b/src/autoparser/dict_writer.py
@@ -32,8 +32,10 @@ class DictWriter:
 
     def __init__(
         self,
-        config: Path | None = None,
+        config: Path | str | None = None,
     ):
+        if isinstance(config, str):
+            config = Path(config)
         self.config = read_data(config or Path(Path(__file__).parent, DEFAULT_CONFIG))
 
     def _setup_llm(self, key: str, name: str):
@@ -170,7 +172,7 @@ def generate_descriptions(
         pd.DataFrame
             Data dictionary with descriptions added
         """
-        if not data_dict:
+        if data_dict is None:
             try:
                 data_dict = self.data_dictionary
             except AttributeError:

diff --git a/src/autoparser/make_toml.py b/src/autoparser/make_toml.py
@@ -18,20 +18,23 @@
 def adtl_header(
     name: str,
     description: str,
+    tables_schemas: dict,
     definitions: dict = {},
 ):
     "The ADTL-specific header for the TOML file"
+    schemas = {}
+    for table in tables_schemas:
+        schemas[table] = {
+            "kind": "oneToOne",
+            "schema": f"{tables_schemas[table]}",
+        }
+
     return {
         "adtl": {
             "name": name,
             "description": description,
             "returnUnmatched": True,
-            "tables": {
-                "linelist": {
-                    "kind": "oneToOne",
-                    "schema": "../../schemas/linelist.schema.json",
-                },
-            },
+            "tables": schemas,
             **{"defs": definitions},
         }
     }
@@ -100,6 +103,10 @@ def __init__(
             self.header = adtl_header(
                 self.parser_name,
                 self.parser_description,
+                {
+                    t: self.schema_path / Path(self.config["schemas"][t])
+                    for t in self.tables
+                },
                 self.references_definitions[1],
             )
 

diff --git a/src/autoparser/util.py b/src/autoparser/util.py
@@ -14,7 +14,10 @@
 DEFAULT_CONFIG = "config/mpox-cdc.toml"
 
 
-def read_data(path: Path) -> Dict:
+def read_data(path: str | Path) -> Dict:
+    if isinstance(path, str):
+        path = Path(path)
+
     if path.suffix == ".json":
         return read_json(path)
     elif path.suffix == ".toml":
@@ -24,7 +27,10 @@ def read_data(path: Path) -> Dict:
         raise ValueError(f"read_data(): Unsupported file format: {path.suffix}")
 
 
-def read_json(file: str) -> dict:
+def read_json(file: str | Path) -> dict:
+    if isinstance(file, str):
+        file = Path(file)
+
     with file.open() as fp:
         return json.load(fp)
 

diff --git a/tests/__snapshots__/test_parser_generator.ambr b/tests/__snapshots__/test_parser_generator.ambr
@@ -15,9 +15,9 @@
       'name': 'animals',
       'returnUnmatched': True,
       'tables': dict({
-        'linelist': dict({
+        'animals': dict({
           'kind': 'oneToOne',
-          'schema': '../../schemas/linelist.schema.json',
+          'schema': 'tests/schemas/animals.schema.json',
         }),
       }),
     }),