Add ReadTheDocs doccumentation (#1)

Add basic documentation for API
globaldothealth · Nov 20, 2024 · c167219 · c167219
1 parent 766d381
commit c167219
Show file tree

Hide file tree

Showing 10 changed files with 160 additions and 9 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -1,4 +1,4 @@
-name: Unit and integration tests
+name: tests
 on:
   workflow_dispatch:
   push:

diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -23,4 +23,6 @@ sphinx:
 # Optionally declare the Python requirements required to build your docs
 python:
   install:
+    - method: pip
+      path: .
     - requirements: docs/requirements.txt
diff --git a/README.md b/README.md
@@ -1,7 +1,8 @@
 # autoparser
 
 [![](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
-[![tests](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml/badge.svg)](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml)
+[![Test Status](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml/badge.svg)](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml)
+[![Documentation Status](https://readthedocs.org/projects/insightboard/badge/?version=latest)](https://insightboard.readthedocs.io/en/latest/?badge=latest)
 <!-- [![codecov](https://codecov.io/gh/globaldothealth/autoparser/graph/badge.svg?token=AINU8PNJE3)](https://codecov.io/gh/globaldothealth/autoparser) -->
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 
@@ -10,6 +11,8 @@ TOML files, which can then be processed by
 [adtl](https://github.com/globaldothealth/adtl) to transform files from the
 source schema to a specified schema.
 
+Documentation: [ReadTheDocs](https://autoparser.readthedocs.io/en/latest)
+
 Contains functionality to:
 1. Create a basic data dictionary from a raw data file (`create-dict`)
 2. Use an LLM (currently only ChatGPT via the OpenAI API) to add descriptions to the 
@@ -93,7 +96,7 @@ defaultDateFormat = "%d/%m/%Y"
 which should automatically convert the dates for you.
 
 2. ADTL can't find my schema (error: No such file or directory ..../x.schema.json)
-autoparser puts the path to the schema at the top of the TOML file, relative to the
+AutoParser puts the path to the schema at the top of the TOML file, relative to the
 *current location of the parser* (i.e, where you ran the autoparser command from).
 If you have since moved the parser file, you will need to update the schema path at the
 top of the TOML parser.
diff --git a/docs/conf.py b/docs/conf.py
@@ -4,7 +4,7 @@
 # https://www.sphinx-doc.org/en/master/usage/configuration.html
 # -- Project information -----------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
-project = "InsightBoard"
+project = "AutoParser"
 copyright = "2024, globaldothealth"
 author = "globaldothealth"
 # -- General configuration ---------------------------------------------------
@@ -14,9 +14,9 @@
     "sphinx.ext.napoleon",
     "sphinx.ext.coverage",
     "sphinx.ext.graphviz",
-    "myst_parser",
     "sphinx_book_theme",
     "sphinxcontrib.mermaid",
+    "myst_nb",
 ]
 templates_path = [
     "_templates",

diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md
@@ -0,0 +1,19 @@
+# Getting started
+
+## Installation
+
+AutoParser is a Python package that can either be built into your code or run as a
+command-line interface (CLI). You can install AutoParser using pip:
+
+```bash
+  python3 -m pip install git+https://github.com/globaldothealth/autoparser
+```
+
+Note that it is usually recommended to install into a virtual environment. We recommend using [uv](https://github.com/astral-sh/uv) to manage the virtual environment. To create and active a virtual environment for AutoParser using `uv` run the following commands:
+
+```bash
+uv sync
+. .venv/bin/activate
+```
+
+To view and use the CLI, you can type `autoparser` into the command line to view the options available.
diff --git a/docs/images/flowchart.png b/docs/images/flowchart.png
diff --git a/docs/index.md b/docs/index.md
@@ -1,14 +1,27 @@
-# autoparser
-Autparser is a tool for semi-automated data parser creation.
+# AutoParser
+AutoParser is a tool for semi-automated data parser creation. The package allows you
+to generate a new data parser for converting your source data into a new format specified
+using a schema file, ready to use with the data transformation tool [adtl](https://adtl.readthedocs.io/en/latest/index.html).
 
 ## Key Features
 - Data Dictionary Creation: Automatically create a basic data dictionary framework
 - Parser Generation: Generate data parsers to match a given schema
 
+## Framework
+
+```{figure} images/flowchart.png
+Flowchart showing the inputs (bright blue), outputs (green blocks) and functions
+(dashed diamonds) of AutoParser.
+```
+
 ## Documentation
 ```{toctree}
 ---
 maxdepth: 2
+caption: Contents:
 ---
 self
+getting_started/index
+usage/data_dict
+usage/parser_generation
 ```
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,4 +1,4 @@
 sphinx==8.0.2
-myst_parser==4.0.0
 sphinx-book-theme==1.1.3
-sphinxcontrib-mermaid==0.9.2
+sphinxcontrib-mermaid==0.9.2
+myst_nb
diff --git a/docs/usage/data_dict.md b/docs/usage/data_dict.md
@@ -0,0 +1,63 @@
+# Creating a Data Dictionary
+
+## Motivation
+
+A data dictionary is a structured guide which contains the details of a data file.
+It should contain, at minimum, a list of field/column names, and some kind of description
+of what data each field holds. This often takes the form of a textual description, plus
+a note of the data type (text, decimals, date, boolean...) and/or a set of expected values.
+
+A data dictionary is required by AutoParser for (parser generation)[parser_generation].
+This is to avoid having to send potentially sensitive or confidential data to an external
+body (in this case an externally hosted LLM hosted); instead a *decription* of what the
+data looks like from the dictionary can be sent to the LLM, which allows for mapping to
+occur without risking the unintentional release of data.
+
+Many data capture services such as (REDCaP)[https://projectredcap.org/] will generate
+a data dictionary automatically when surveys are set up. However, where data is being
+captured either rapidly, or by individuals/small teams, a formal data dictionary may not
+have been created for a corresponding dataset. For this scenario, AutoParser provides 
+functionality to generate a simple dictionary based on your data. This dictionary can 
+then be used in other AutoParser modules.
+
+## Create a basic data dictionary
+AutoParser will take your raw data file and create a basic data dictionary. For an example
+dataset of animals, a generated data dictionary looks like this:
+
+| source_field      | source_description | source_type | common_values                                            |
+|-------------------|--------------------|-------------|----------------------------------------------------------|
+| Identité          |                    | string      |                                                          |
+| Province          |                    | choice      | Equateur, Orientale, Katanga, Kinshasa                   |
+| DateNotification  |                    | string      |                                                          |
+| Classicfication   |                    | choice      | FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU |
+| Nom complet       |                    | string      |                                                          |
+| Date de naissance |                    | string      |                                                          |
+| AgeAns            |                    | number      |                                                          |
+| AgeMois           |                    | number      |                                                          |
+| Sexe              |                    | choice      | F, M,   f, m, f, m     , inconnu                         |
+
+`source_field` contains each column header from the source data, and `source_type` shows the
+data type in each column. 'choice' denotes where a small set of strings have been detected,
+so AutoParser assumes that specified terms are being used, and lists them in `common values`.
+
+Notice that the `source_description` column is empty. This is done by default, so the
+user can add in a short text description *in English* (as this column is read by the LLM
+in later steps and assumes the text is written in English). For example, the description
+for the `AgeMois` column might be 'Age in Months'.
+
+If instead you would like to auto-generate these descriptions, AutoParser can use an LLM
+to automate this step. Note, we strongly encourage all users to check the results of the
+auto-generated descriptions for accuracy before proceeding to use the described data dictionary
+to generate a data parser.
+
+## API
+
+```{eval-rst}
+.. autofunction:: autoparser.create_dict
+    :noindex:
+
+.. autofunction:: autoparser.generate_descriptions
+    :noindex:
+```
+
+
diff --git a/docs/usage/parser_generation.md b/docs/usage/parser_generation.md
@@ -0,0 +1,51 @@
+# Write a Data Parser
+
+AutoParser assumes the use of Global.Health's (adtl)[https://github.com/globaldothealth/adtl]
+package to transform your source data into a standardised format. To do this, adtl requires a
+(TOML)[https://toml.io/en/] specification file which describes how raw data should be
+converted into the new format, on a field-by-field basis. Every unique data file format
+(i.e. unique sets of fields and data types) should have a corresponding parser file.
+
+AutoParser exists to semi-automate the process of writing new parser files. This requires
+a data dictionary (which can be created if it does not already exist, see [data_dict]),
+and the JSON schema of the target format.
+
+Parser generation is a 2-step process. 
+
+## Generate intermedaite mappings (CSV)
+First, an intermediate mapping file is created which can look like this:
+
+| target_field      | source_description | source_field     | common_values                                            | target_values                                              | value_mapping                                                                            |
+|-------------------|--------------------|------------------|----------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------|
+| identity          | Identity           | Identité         |                                                          |                                                            |                                                                                          |
+| name              | Full Name          | Nom complet      |                                                          |                                                            |                                                                                          |
+| loc_admin_1       | Province           | Province         | Equateur, Orientale, Katanga, Kinshasa                   |                                                            |                                                                                          |
+| country_iso3      |                    |                  |                                                          |                                                            |                                                                                          |
+| notification_date | Notification Date  | DateNotification |                                                          |                                                            |                                                                                          |
+| classification    | Classification     | Classicfication  | FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU | mammal, bird, reptile, amphibian, fish, invertebrate, None | mammifère=mammal, rept=reptile, fish=fish, oiseau=bird, amphibie=amphibian, poisson=fish |
+| case_status       | Case Status        | StatusCas        | Vivant, Décédé                                           | alive, dead, unknown, None                                 | décédé=dead, vivant=alive                                                                |
+
+`target_x` refers to the desired output format, while `source_x` refers to the raw data.
+In this example, the final row shows that the `case_status` field in the desired output
+format should be filled using data from the `StatusCas` field in the raw data. The `value_mapping`
+column indicated that all instances of `décédé` in the raw data should be mapped to `dead`
+in the converted file, and `vivant` should map to `alive`.
+
+These intermediate mappings should be manually curated, as they are generated using an
+LLM which may be prone to errors and hallucinations, generating incorrect matches for either
+the field, or the values within that field.
+
+## Generate TOML
+
+This step is automated and should produce a TOML file that conforms to the adtl parser
+schema, ready for use transforming data.
+
+## API
+
+```{eval-rst}
+.. autofunction:: autoparser.create_mapping
+    :noindex:
+
+.. autofunction:: autoparser.create_parser
+    :noindex:
+```