diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml index fd4df67..d9ce585 100644 --- a/.github/workflows/tests.yml +++ b/.github/workflows/tests.yml @@ -1,4 +1,4 @@ -name: Unit and integration tests +name: tests on: workflow_dispatch: push: diff --git a/.readthedocs.yaml b/.readthedocs.yaml index b20cb0c..53afcfb 100644 --- a/.readthedocs.yaml +++ b/.readthedocs.yaml @@ -23,4 +23,6 @@ sphinx: # Optionally declare the Python requirements required to build your docs python: install: + - method: pip + path: . - requirements: docs/requirements.txt \ No newline at end of file diff --git a/README.md b/README.md index 4ec570b..6a86151 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,8 @@ # autoparser [![](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/) -[![tests](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml/badge.svg)](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml) +[![Test Status](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml/badge.svg)](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml) +[![Documentation Status](https://readthedocs.org/projects/insightboard/badge/?version=latest)](https://insightboard.readthedocs.io/en/latest/?badge=latest) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) @@ -10,6 +11,8 @@ TOML files, which can then be processed by [adtl](https://github.com/globaldothealth/adtl) to transform files from the source schema to a specified schema. +Documentation: [ReadTheDocs](https://autoparser.readthedocs.io/en/latest) + Contains functionality to: 1. Create a basic data dictionary from a raw data file (`create-dict`) 2. Use an LLM (currently only ChatGPT via the OpenAI API) to add descriptions to the @@ -93,7 +96,7 @@ defaultDateFormat = "%d/%m/%Y" which should automatically convert the dates for you. 2. ADTL can't find my schema (error: No such file or directory ..../x.schema.json) -autoparser puts the path to the schema at the top of the TOML file, relative to the +AutoParser puts the path to the schema at the top of the TOML file, relative to the *current location of the parser* (i.e, where you ran the autoparser command from). If you have since moved the parser file, you will need to update the schema path at the top of the TOML parser. diff --git a/docs/conf.py b/docs/conf.py index fdcfb4e..e3ae7ed 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -4,7 +4,7 @@ # https://www.sphinx-doc.org/en/master/usage/configuration.html # -- Project information ----------------------------------------------------- # https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information -project = "InsightBoard" +project = "AutoParser" copyright = "2024, globaldothealth" author = "globaldothealth" # -- General configuration --------------------------------------------------- @@ -14,9 +14,9 @@ "sphinx.ext.napoleon", "sphinx.ext.coverage", "sphinx.ext.graphviz", - "myst_parser", "sphinx_book_theme", "sphinxcontrib.mermaid", + "myst_nb", ] templates_path = [ "_templates", diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md new file mode 100644 index 0000000..54ad793 --- /dev/null +++ b/docs/getting_started/index.md @@ -0,0 +1,19 @@ +# Getting started + +## Installation + +AutoParser is a Python package that can either be built into your code or run as a +command-line interface (CLI). You can install AutoParser using pip: + +```bash + python3 -m pip install git+https://github.com/globaldothealth/autoparser +``` + +Note that it is usually recommended to install into a virtual environment. We recommend using [uv](https://github.com/astral-sh/uv) to manage the virtual environment. To create and active a virtual environment for AutoParser using `uv` run the following commands: + +```bash +uv sync +. .venv/bin/activate +``` + +To view and use the CLI, you can type `autoparser` into the command line to view the options available. diff --git a/docs/images/flowchart.png b/docs/images/flowchart.png new file mode 100644 index 0000000..ec4dfd5 Binary files /dev/null and b/docs/images/flowchart.png differ diff --git a/docs/index.md b/docs/index.md index 800218a..3a3f748 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,14 +1,27 @@ -# autoparser -Autparser is a tool for semi-automated data parser creation. +# AutoParser +AutoParser is a tool for semi-automated data parser creation. The package allows you +to generate a new data parser for converting your source data into a new format specified +using a schema file, ready to use with the data transformation tool [adtl](https://adtl.readthedocs.io/en/latest/index.html). ## Key Features - Data Dictionary Creation: Automatically create a basic data dictionary framework - Parser Generation: Generate data parsers to match a given schema +## Framework + +```{figure} images/flowchart.png +Flowchart showing the inputs (bright blue), outputs (green blocks) and functions +(dashed diamonds) of AutoParser. +``` + ## Documentation ```{toctree} --- maxdepth: 2 +caption: Contents: --- self +getting_started/index +usage/data_dict +usage/parser_generation ``` \ No newline at end of file diff --git a/docs/requirements.txt b/docs/requirements.txt index 476e903..3e80995 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,4 +1,4 @@ sphinx==8.0.2 -myst_parser==4.0.0 sphinx-book-theme==1.1.3 -sphinxcontrib-mermaid==0.9.2 \ No newline at end of file +sphinxcontrib-mermaid==0.9.2 +myst_nb \ No newline at end of file diff --git a/docs/usage/data_dict.md b/docs/usage/data_dict.md new file mode 100644 index 0000000..56b94fb --- /dev/null +++ b/docs/usage/data_dict.md @@ -0,0 +1,63 @@ +# Creating a Data Dictionary + +## Motivation + +A data dictionary is a structured guide which contains the details of a data file. +It should contain, at minimum, a list of field/column names, and some kind of description +of what data each field holds. This often takes the form of a textual description, plus +a note of the data type (text, decimals, date, boolean...) and/or a set of expected values. + +A data dictionary is required by AutoParser for (parser generation)[parser_generation]. +This is to avoid having to send potentially sensitive or confidential data to an external +body (in this case an externally hosted LLM hosted); instead a *decription* of what the +data looks like from the dictionary can be sent to the LLM, which allows for mapping to +occur without risking the unintentional release of data. + +Many data capture services such as (REDCaP)[https://projectredcap.org/] will generate +a data dictionary automatically when surveys are set up. However, where data is being +captured either rapidly, or by individuals/small teams, a formal data dictionary may not +have been created for a corresponding dataset. For this scenario, AutoParser provides +functionality to generate a simple dictionary based on your data. This dictionary can +then be used in other AutoParser modules. + +## Create a basic data dictionary +AutoParser will take your raw data file and create a basic data dictionary. For an example +dataset of animals, a generated data dictionary looks like this: + +| source_field | source_description | source_type | common_values | +|-------------------|--------------------|-------------|----------------------------------------------------------| +| Identité | | string | | +| Province | | choice | Equateur, Orientale, Katanga, Kinshasa | +| DateNotification | | string | | +| Classicfication | | choice | FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU | +| Nom complet | | string | | +| Date de naissance | | string | | +| AgeAns | | number | | +| AgeMois | | number | | +| Sexe | | choice | F, M, f, m, f, m , inconnu | + +`source_field` contains each column header from the source data, and `source_type` shows the +data type in each column. 'choice' denotes where a small set of strings have been detected, +so AutoParser assumes that specified terms are being used, and lists them in `common values`. + +Notice that the `source_description` column is empty. This is done by default, so the +user can add in a short text description *in English* (as this column is read by the LLM +in later steps and assumes the text is written in English). For example, the description +for the `AgeMois` column might be 'Age in Months'. + +If instead you would like to auto-generate these descriptions, AutoParser can use an LLM +to automate this step. Note, we strongly encourage all users to check the results of the +auto-generated descriptions for accuracy before proceeding to use the described data dictionary +to generate a data parser. + +## API + +```{eval-rst} +.. autofunction:: autoparser.create_dict + :noindex: + +.. autofunction:: autoparser.generate_descriptions + :noindex: +``` + + diff --git a/docs/usage/parser_generation.md b/docs/usage/parser_generation.md new file mode 100644 index 0000000..e5b17a8 --- /dev/null +++ b/docs/usage/parser_generation.md @@ -0,0 +1,51 @@ +# Write a Data Parser + +AutoParser assumes the use of Global.Health's (adtl)[https://github.com/globaldothealth/adtl] +package to transform your source data into a standardised format. To do this, adtl requires a +(TOML)[https://toml.io/en/] specification file which describes how raw data should be +converted into the new format, on a field-by-field basis. Every unique data file format +(i.e. unique sets of fields and data types) should have a corresponding parser file. + +AutoParser exists to semi-automate the process of writing new parser files. This requires +a data dictionary (which can be created if it does not already exist, see [data_dict]), +and the JSON schema of the target format. + +Parser generation is a 2-step process. + +## Generate intermedaite mappings (CSV) +First, an intermediate mapping file is created which can look like this: + +| target_field | source_description | source_field | common_values | target_values | value_mapping | +|-------------------|--------------------|------------------|----------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------| +| identity | Identity | Identité | | | | +| name | Full Name | Nom complet | | | | +| loc_admin_1 | Province | Province | Equateur, Orientale, Katanga, Kinshasa | | | +| country_iso3 | | | | | | +| notification_date | Notification Date | DateNotification | | | | +| classification | Classification | Classicfication | FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU | mammal, bird, reptile, amphibian, fish, invertebrate, None | mammifère=mammal, rept=reptile, fish=fish, oiseau=bird, amphibie=amphibian, poisson=fish | +| case_status | Case Status | StatusCas | Vivant, Décédé | alive, dead, unknown, None | décédé=dead, vivant=alive | + +`target_x` refers to the desired output format, while `source_x` refers to the raw data. +In this example, the final row shows that the `case_status` field in the desired output +format should be filled using data from the `StatusCas` field in the raw data. The `value_mapping` +column indicated that all instances of `décédé` in the raw data should be mapped to `dead` +in the converted file, and `vivant` should map to `alive`. + +These intermediate mappings should be manually curated, as they are generated using an +LLM which may be prone to errors and hallucinations, generating incorrect matches for either +the field, or the values within that field. + +## Generate TOML + +This step is automated and should produce a TOML file that conforms to the adtl parser +schema, ready for use transforming data. + +## API + +```{eval-rst} +.. autofunction:: autoparser.create_mapping + :noindex: + +.. autofunction:: autoparser.create_parser + :noindex: +``` \ No newline at end of file