Skip to content

Commit

Permalink
Add some documentations for setting up and WDI examples (#4)
Browse files Browse the repository at this point in the history
* Add wdi scraper and initial documentation

Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>

* Add downloading the data section

Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>

* Add loading data to the database instructions

Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>

* Add setting up the environment

Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>

* Update toc with content

Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>

---------

Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
  • Loading branch information
avsolatorio authored May 30, 2023
1 parent 28d372c commit 69fcefb
Show file tree
Hide file tree
Showing 8 changed files with 314 additions and 1 deletion.
11 changes: 11 additions & 0 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
@@ -1,2 +1,13 @@
format: jb-book
root: README

parts:
- caption: Setting up your environment
chapters:
- file: notebooks/examples/"Setting up your environment.ipynb"

- caption: Indicators
chapters:
- file: notebooks/examples/indicators/README
sections:
- file: notebooks/examples/indicators/"Getting started with the WDI.ipynb"
74 changes: 74 additions & 0 deletions notebooks/examples/Setting up the environment.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setting up the environment"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the environment variables to allow the user to configure the library. The user can set the environment variables in a `.env` file in the root directory of the project.\n",
"\n",
"The following environment variables, also specified in the `example.env` file, are used:\n",
"\n",
"```bash\n",
"# OpenAI\n",
"ORGANIZATION=\"\"\n",
"OPENAI_API_KEY=\"\"\n",
"\n",
"\n",
"# DATABASES\n",
"WDI_DB_TABLE_NAME=\"wdi\"\n",
"\n",
"## POSTRESQL\n",
"WDI_DB_ENGINE=\"postgresql\"\n",
"WDI_DB_HOST=\"localhost\"\n",
"WDI_DB_USERNAME=\"postgres\"\n",
"WDI_DB_PASSWORD=\"<your password>\"\n",
"WDI_DB_PORT=5432\n",
"\n",
"## SQLITE\n",
"# WDI_DB_ENGINE=\"sqlite\"\n",
"# WDI_DB_HOST=\n",
"# WDI_DB_USERNAME=\"/data/sqldb/wdi.db\"\n",
"# WDI_DB_PASSWORD=\n",
"# WDI_DB_PORT=\n",
"\n",
"\n",
"# DIRS\n",
"OPENAI_PAYLOAD_DIR=\"data/openai/payloads\"\n",
"\n",
"\n",
"# Task Labels\n",
"## Specify the label in the payload directory\n",
"## for this prompt set.\n",
"TASK_LABEL_WDI_SQL = \"prompt2wdiSQL\"\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The `example.env` file is updated with the relevant variables used in the project. The user can copy the file and rename it to `.env` to set the environment variables. The `.env` file is ignored by Git for security reasons so that you can store API keys.\n",
"\n",
"NEVER commit your `.env` file to Git. It is ignored by default in the `.gitignore` file, but you should double-check that it is not being tracked by Git."
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
83 changes: 83 additions & 0 deletions notebooks/examples/indicators/Getting started with the WDI.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting Started with the World Development Indicators (WDI) data\n",
"\n",
"In this notebook, we will show the steps to get started with the WDI data. We will use the [WDI API](https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-about-the-indicators-api-documentation) to get the data.\n",
"\n",
"We will be using the `data/` folder to store the data. You can change the location of the data folder by changing the `data_dir` parameter in the code below. Make sure to refer to the correct location of the data folder in the rest of the notebook.\n",
"\n",
"After the data is collected, we will store it in a [SQLite](https://www.sqlite.org/index.html) database. With the data in a database, we can then use LLM4Data to query the data using natural language."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Downloading the data\n",
"\n",
"\n",
"If the data is not yet available in the `data` folder, we will download the data from the WDI API.\n",
"\n",
"```\n",
"poetry run python -m scripts.scrapers.indicators.wdi --data_dir=data/indicators/wdi --force\n",
"```\n",
"\n",
"This will scrape the data from the WDI API and store it in the `data/indicators/wdi` folder. Each indicator will be stored in a separate file."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Storing the data to a database\n",
"\n",
"After the data is downloaded, we will store it in a database. We will use SQLite for this example. You can use other databases as well, as long as you have the appropriate drivers installed.\n",
"\n",
"Please review the `setting up the environment` section for instructions on how to update the relevant environment variables.\n",
"\n",
"You can then run the following command to store the data in a database:\n",
"\n",
"```\n",
"poetry run python -m scripts.scrapers.indicators.wdi_db --wdi_jsons_dir=data/indicators/wdi\n",
"```\n",
"\n",
"This will create a database file in the `data/indicators/wdi` folder. The database file will be named based on the information you specified in the environment variables.\n",
"\n",
"Alternatively, you can run the cells below to store the data in a database."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"from llm4data.llm.indicators.wdi_sql import WDISQL\n",
"\n",
"\n",
"wdi_jsons_dir = \"data/indicators/wdi\"\n",
"\n",
"WDISQL.load_wdi_jsons(wdi_jsons_dir)"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
1 change: 1 addition & 0 deletions notebooks/examples/indicators/README
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Indicators
32 changes: 31 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ langchain = "^0.0.178"
psycopg2-binary = "^2.9.6"
python-dotenv = "^1.0.0"
tiktoken = "^0.4.0"
fire = "^0.5.0"

[tool.poetry.group.test.dependencies]
pytest = "^7.3.1"
Expand Down
100 changes: 100 additions & 0 deletions scripts/scrapers/indicators/wdi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
import json
import requests
import pandas as pd
from pathlib import Path
from tqdm.auto import tqdm
import fire
import backoff

from llm4data import indicator2name

class WDIException(BaseException):
pass


@backoff.on_exception(backoff.expo, WDIException, max_tries=20)
def get_json(url):
try:
response = requests.get(url)
return response.json()
except requests.exceptions.RequestException as e:
raise WDIException(e)
except json.decoder.JSONDecodeError as e:
raise WDIException(e)


class IndicatorScraper:

def __init__(self, indicator_id, data_dir):
self.indicator_id = indicator_id
self.data_dir = Path(data_dir)

def get_api_url(self, page: int = 1, per_page: int = 1000):
return f"https://api.worldbank.org/v2/country/all/indicator/{self.indicator_id}?format=json&page={page}&per_page={per_page}"

def scrape(self):
page = 1
_json = get_json(self.get_api_url(page))

try:
data = _data = _json[1]
except IndexError:
print(f"Skipping (IndexError): {self.indicator_id}")
return None

total = _json[0]["pages"]
# print(f"Total pages: {total}")

for page in tqdm(range(2, total + 1), desc=f"Scraping: ({self.indicator_id})", position=1):
_data = get_json(self.get_api_url(page))[1]
data += _data

data = [self.normalize_record(d) for d in data]

return data

@property
def filename(self):
return self.data_dir / (self.indicator_id + ".json")

def save(self, data):
self.filename.parent.mkdir(exist_ok=True, parents=True)
self.filename.write_text(json.dumps(data, indent=2))

def run(self, force: bool = False):

if not force and self.filename.exists():
print(f"Skipping: {self.filename}")
return

data = self.scrape()

if data is not None:
self.save(data)

def normalize_record(self, data):
return {
"indicator_id": data["indicator"]["id"],
"indicator_name": data["indicator"]["value"],
"country_id": data["country"]["id"],
"country_name": data["country"]["value"],
"country_iso3": data["countryiso3code"],
"date": data["date"],
"value": data["value"],
"unit": data["unit"],
"obs_status": data["obs_status"],
"decimal": data["decimal"],
}


def scrape_indicators(data_dir, force: bool = False):
indicators = sorted(indicator2name)

for indicator in tqdm(indicators, desc="Scraping data...", position=0):
indicator = IndicatorScraper(indicator, data_dir)
indicator.run(force=force)


if __name__ == "__main__":
# python -m scripts.scrapers.indicators.wdi --data_dir=data/indicators/wdi --force
fire.Fire(scrape_indicators)
13 changes: 13 additions & 0 deletions scripts/scrapers/indicators/wdi_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"""Create an sqlite database from the World Bank WDI data.
"""
from llm4data.llm.indicators.wdi_sql import WDISQL
import fire


def main(wdi_jsons_dir: str):
WDISQL.load_wdi_jsons(wdi_jsons_dir)


if __name__ == "__main__":
# python -m scripts.scrapers.indicators.wdi_db --wdi_jsons_dir=data/indicators/wdi
fire.Fire(main)

0 comments on commit 69fcefb

Please sign in to comment.