Add some documentations for setting up and WDI examples (#4)

* Add wdi scraper and initial documentation Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add downloading the data section Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add loading data to the database instructions Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add setting up the environment Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Update toc with content Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> --------- Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
worldbank · May 30, 2023 · 69fcefb · 69fcefb
1 parent 28d372c
commit 69fcefb
Show file tree

Hide file tree

Showing 8 changed files with 314 additions and 1 deletion.
diff --git a/docs/_toc.yml b/docs/_toc.yml
@@ -1,2 +1,13 @@
 format: jb-book
 root: README
+
+parts:
+  - caption: Setting up your environment
+    chapters:
+      - file: notebooks/examples/"Setting up your environment.ipynb"
+
+  - caption: Indicators
+    chapters:
+      - file: notebooks/examples/indicators/README
+        sections:
+        - file: notebooks/examples/indicators/"Getting started with the WDI.ipynb"
diff --git a/notebooks/examples/Setting up the environment.ipynb b/notebooks/examples/Setting up the environment.ipynb
@@ -0,0 +1,74 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Setting up the environment"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We use the environment variables to allow the user to configure the library. The user can set the environment variables in a `.env` file in the root directory of the project.\n",
+    "\n",
+    "The following environment variables, also specified in the `example.env` file, are used:\n",
+    "\n",
+    "```bash\n",
+    "# OpenAI\n",
+    "ORGANIZATION=\"\"\n",
+    "OPENAI_API_KEY=\"\"\n",
+    "\n",
+    "\n",
+    "# DATABASES\n",
+    "WDI_DB_TABLE_NAME=\"wdi\"\n",
+    "\n",
+    "## POSTRESQL\n",
+    "WDI_DB_ENGINE=\"postgresql\"\n",
+    "WDI_DB_HOST=\"localhost\"\n",
+    "WDI_DB_USERNAME=\"postgres\"\n",
+    "WDI_DB_PASSWORD=\"<your password>\"\n",
+    "WDI_DB_PORT=5432\n",
+    "\n",
+    "## SQLITE\n",
+    "# WDI_DB_ENGINE=\"sqlite\"\n",
+    "# WDI_DB_HOST=\n",
+    "# WDI_DB_USERNAME=\"/data/sqldb/wdi.db\"\n",
+    "# WDI_DB_PASSWORD=\n",
+    "# WDI_DB_PORT=\n",
+    "\n",
+    "\n",
+    "# DIRS\n",
+    "OPENAI_PAYLOAD_DIR=\"data/openai/payloads\"\n",
+    "\n",
+    "\n",
+    "# Task Labels\n",
+    "## Specify the label in the payload directory\n",
+    "## for this prompt set.\n",
+    "TASK_LABEL_WDI_SQL = \"prompt2wdiSQL\"\n",
+    "```"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `example.env` file is updated with the relevant variables used in the project. The user can copy the file and rename it to `.env` to set the environment variables. The `.env` file is ignored by Git for security reasons so that you can store API keys.\n",
+    "\n",
+    "NEVER commit your `.env` file to Git. It is ignored by default in the `.gitignore` file, but you should double-check that it is not being tracked by Git."
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/examples/indicators/Getting started with the WDI.ipynb b/notebooks/examples/indicators/Getting started with the WDI.ipynb
@@ -0,0 +1,83 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Getting Started with the World Development Indicators (WDI) data\n",
+    "\n",
+    "In this notebook, we will show the steps to get started with the WDI data. We will use the [WDI API](https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-about-the-indicators-api-documentation) to get the data.\n",
+    "\n",
+    "We will be using the `data/` folder to store the data. You can change the location of the data folder by changing the `data_dir` parameter in the code below. Make sure to refer to the correct location of the data folder in the rest of the notebook.\n",
+    "\n",
+    "After the data is collected, we will store it in a [SQLite](https://www.sqlite.org/index.html) database. With the data in a database, we can then use LLM4Data to query the data using natural language."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Downloading the data\n",
+    "\n",
+    "\n",
+    "If the data is not yet available in the `data` folder, we will download the data from the WDI API.\n",
+    "\n",
+    "```\n",
+    "poetry run python -m scripts.scrapers.indicators.wdi --data_dir=data/indicators/wdi --force\n",
+    "```\n",
+    "\n",
+    "This will scrape the data from the WDI API and store it in the `data/indicators/wdi` folder. Each indicator will be stored in a separate file."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Storing the data to a database\n",
+    "\n",
+    "After the data is downloaded, we will store it in a database. We will use SQLite for this example. You can use other databases as well, as long as you have the appropriate drivers installed.\n",
+    "\n",
+    "Please review the `setting up the environment` section for instructions on how to update the relevant environment variables.\n",
+    "\n",
+    "You can then run the following command to store the data in a database:\n",
+    "\n",
+    "```\n",
+    "poetry run python -m scripts.scrapers.indicators.wdi_db  --wdi_jsons_dir=data/indicators/wdi\n",
+    "```\n",
+    "\n",
+    "This will create a database file in the `data/indicators/wdi` folder. The database file will be named based on the information you specified in the environment variables.\n",
+    "\n",
+    "Alternatively, you can run the cells below to store the data in a database."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from llm4data.llm.indicators.wdi_sql import WDISQL\n",
+    "\n",
+    "\n",
+    "wdi_jsons_dir = \"data/indicators/wdi\"\n",
+    "\n",
+    "WDISQL.load_wdi_jsons(wdi_jsons_dir)"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/examples/indicators/README b/notebooks/examples/indicators/README
@@ -0,0 +1 @@
+# Indicators
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -16,6 +16,7 @@ langchain = "^0.0.178"
 psycopg2-binary = "^2.9.6"
 python-dotenv = "^1.0.0"
 tiktoken = "^0.4.0"
+fire = "^0.5.0"
 
 [tool.poetry.group.test.dependencies]
 pytest = "^7.3.1"

diff --git a/scripts/scrapers/indicators/wdi.py b/scripts/scrapers/indicators/wdi.py
@@ -0,0 +1,100 @@
+import json
+import requests
+import pandas as pd
+from pathlib import Path
+from tqdm.auto import tqdm
+import fire
+import backoff
+
+from llm4data import indicator2name
+
+class WDIException(BaseException):
+    pass
+
+
+@backoff.on_exception(backoff.expo, WDIException, max_tries=20)
+def get_json(url):
+    try:
+        response = requests.get(url)
+        return response.json()
+    except requests.exceptions.RequestException as e:
+        raise WDIException(e)
+    except json.decoder.JSONDecodeError as e:
+        raise WDIException(e)
+
+
+class IndicatorScraper:
+
+    def __init__(self, indicator_id, data_dir):
+        self.indicator_id = indicator_id
+        self.data_dir = Path(data_dir)
+
+    def get_api_url(self, page: int = 1, per_page: int = 1000):
+        return f"https://api.worldbank.org/v2/country/all/indicator/{self.indicator_id}?format=json&page={page}&per_page={per_page}"
+
+    def scrape(self):
+        page = 1
+        _json = get_json(self.get_api_url(page))
+
+        try:
+            data = _data = _json[1]
+        except IndexError:
+            print(f"Skipping (IndexError): {self.indicator_id}")
+            return None
+
+        total = _json[0]["pages"]
+        # print(f"Total pages: {total}")
+
+        for page in tqdm(range(2, total + 1), desc=f"Scraping: ({self.indicator_id})", position=1):
+            _data = get_json(self.get_api_url(page))[1]
+            data += _data
+
+        data = [self.normalize_record(d) for d in data]
+
+        return data
+
+    @property
+    def filename(self):
+        return self.data_dir / (self.indicator_id + ".json")
+
+    def save(self, data):
+        self.filename.parent.mkdir(exist_ok=True, parents=True)
+        self.filename.write_text(json.dumps(data, indent=2))
+
+    def run(self, force: bool = False):
+
+        if not force and self.filename.exists():
+            print(f"Skipping: {self.filename}")
+            return
+
+        data = self.scrape()
+
+        if data is not None:
+            self.save(data)
+
+    def normalize_record(self, data):
+        return {
+            "indicator_id": data["indicator"]["id"],
+            "indicator_name": data["indicator"]["value"],
+            "country_id": data["country"]["id"],
+            "country_name": data["country"]["value"],
+            "country_iso3": data["countryiso3code"],
+            "date": data["date"],
+            "value": data["value"],
+            "unit": data["unit"],
+            "obs_status": data["obs_status"],
+            "decimal": data["decimal"],
+        }
+
+
+def scrape_indicators(data_dir, force: bool = False):
+    indicators = sorted(indicator2name)
+
+    for indicator in tqdm(indicators, desc="Scraping data...", position=0):
+        indicator = IndicatorScraper(indicator, data_dir)
+        indicator.run(force=force)
+
+
+if __name__ == "__main__":
+    # python -m scripts.scrapers.indicators.wdi --data_dir=data/indicators/wdi --force
+    fire.Fire(scrape_indicators)
diff --git a/scripts/scrapers/indicators/wdi_db.py b/scripts/scrapers/indicators/wdi_db.py
@@ -0,0 +1,13 @@
+"""Create an sqlite database from the World Bank WDI data.
+"""
+from llm4data.llm.indicators.wdi_sql import WDISQL
+import fire
+
+
+def main(wdi_jsons_dir: str):
+    WDISQL.load_wdi_jsons(wdi_jsons_dir)
+
+
+if __name__ == "__main__":
+    # python -m scripts.scrapers.indicators.wdi_db  --wdi_jsons_dir=data/indicators/wdi
+    fire.Fire(main)