Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some documentations for setting up and WDI examples #4

Merged
merged 5 commits into from
May 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
@@ -1,2 +1,13 @@
format: jb-book
root: README

parts:
- caption: Setting up your environment
chapters:
- file: notebooks/examples/"Setting up your environment.ipynb"

- caption: Indicators
chapters:
- file: notebooks/examples/indicators/README
sections:
- file: notebooks/examples/indicators/"Getting started with the WDI.ipynb"
74 changes: 74 additions & 0 deletions notebooks/examples/Setting up the environment.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setting up the environment"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the environment variables to allow the user to configure the library. The user can set the environment variables in a `.env` file in the root directory of the project.\n",
"\n",
"The following environment variables, also specified in the `example.env` file, are used:\n",
"\n",
"```bash\n",
"# OpenAI\n",
"ORGANIZATION=\"\"\n",
"OPENAI_API_KEY=\"\"\n",
"\n",
"\n",
"# DATABASES\n",
"WDI_DB_TABLE_NAME=\"wdi\"\n",
"\n",
"## POSTRESQL\n",
"WDI_DB_ENGINE=\"postgresql\"\n",
"WDI_DB_HOST=\"localhost\"\n",
"WDI_DB_USERNAME=\"postgres\"\n",
"WDI_DB_PASSWORD=\"<your password>\"\n",
"WDI_DB_PORT=5432\n",
"\n",
"## SQLITE\n",
"# WDI_DB_ENGINE=\"sqlite\"\n",
"# WDI_DB_HOST=\n",
"# WDI_DB_USERNAME=\"/data/sqldb/wdi.db\"\n",
"# WDI_DB_PASSWORD=\n",
"# WDI_DB_PORT=\n",
"\n",
"\n",
"# DIRS\n",
"OPENAI_PAYLOAD_DIR=\"data/openai/payloads\"\n",
"\n",
"\n",
"# Task Labels\n",
"## Specify the label in the payload directory\n",
"## for this prompt set.\n",
"TASK_LABEL_WDI_SQL = \"prompt2wdiSQL\"\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The `example.env` file is updated with the relevant variables used in the project. The user can copy the file and rename it to `.env` to set the environment variables. The `.env` file is ignored by Git for security reasons so that you can store API keys.\n",
"\n",
"NEVER commit your `.env` file to Git. It is ignored by default in the `.gitignore` file, but you should double-check that it is not being tracked by Git."
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
83 changes: 83 additions & 0 deletions notebooks/examples/indicators/Getting started with the WDI.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting Started with the World Development Indicators (WDI) data\n",
"\n",
"In this notebook, we will show the steps to get started with the WDI data. We will use the [WDI API](https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-about-the-indicators-api-documentation) to get the data.\n",
"\n",
"We will be using the `data/` folder to store the data. You can change the location of the data folder by changing the `data_dir` parameter in the code below. Make sure to refer to the correct location of the data folder in the rest of the notebook.\n",
"\n",
"After the data is collected, we will store it in a [SQLite](https://www.sqlite.org/index.html) database. With the data in a database, we can then use LLM4Data to query the data using natural language."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Downloading the data\n",
"\n",
"\n",
"If the data is not yet available in the `data` folder, we will download the data from the WDI API.\n",
"\n",
"```\n",
"poetry run python -m scripts.scrapers.indicators.wdi --data_dir=data/indicators/wdi --force\n",
"```\n",
"\n",
"This will scrape the data from the WDI API and store it in the `data/indicators/wdi` folder. Each indicator will be stored in a separate file."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Storing the data to a database\n",
"\n",
"After the data is downloaded, we will store it in a database. We will use SQLite for this example. You can use other databases as well, as long as you have the appropriate drivers installed.\n",
"\n",
"Please review the `setting up the environment` section for instructions on how to update the relevant environment variables.\n",
"\n",
"You can then run the following command to store the data in a database:\n",
"\n",
"```\n",
"poetry run python -m scripts.scrapers.indicators.wdi_db --wdi_jsons_dir=data/indicators/wdi\n",
"```\n",
"\n",
"This will create a database file in the `data/indicators/wdi` folder. The database file will be named based on the information you specified in the environment variables.\n",
"\n",
"Alternatively, you can run the cells below to store the data in a database."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"from llm4data.llm.indicators.wdi_sql import WDISQL\n",
"\n",
"\n",
"wdi_jsons_dir = \"data/indicators/wdi\"\n",
"\n",
"WDISQL.load_wdi_jsons(wdi_jsons_dir)"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
1 change: 1 addition & 0 deletions notebooks/examples/indicators/README
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Indicators
32 changes: 31 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ langchain = "^0.0.178"
psycopg2-binary = "^2.9.6"
python-dotenv = "^1.0.0"
tiktoken = "^0.4.0"
fire = "^0.5.0"

[tool.poetry.group.test.dependencies]
pytest = "^7.3.1"
Expand Down
100 changes: 100 additions & 0 deletions scripts/scrapers/indicators/wdi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
import json
import requests
import pandas as pd
from pathlib import Path
from tqdm.auto import tqdm
import fire
import backoff

from llm4data import indicator2name

class WDIException(BaseException):
pass


@backoff.on_exception(backoff.expo, WDIException, max_tries=20)
def get_json(url):
try:
response = requests.get(url)
return response.json()
except requests.exceptions.RequestException as e:
raise WDIException(e)
except json.decoder.JSONDecodeError as e:
raise WDIException(e)


class IndicatorScraper:

def __init__(self, indicator_id, data_dir):
self.indicator_id = indicator_id
self.data_dir = Path(data_dir)

def get_api_url(self, page: int = 1, per_page: int = 1000):
return f"https://api.worldbank.org/v2/country/all/indicator/{self.indicator_id}?format=json&page={page}&per_page={per_page}"

def scrape(self):
page = 1
_json = get_json(self.get_api_url(page))

try:
data = _data = _json[1]
except IndexError:
print(f"Skipping (IndexError): {self.indicator_id}")
return None

total = _json[0]["pages"]
# print(f"Total pages: {total}")

for page in tqdm(range(2, total + 1), desc=f"Scraping: ({self.indicator_id})", position=1):
_data = get_json(self.get_api_url(page))[1]
data += _data

data = [self.normalize_record(d) for d in data]

return data

@property
def filename(self):
return self.data_dir / (self.indicator_id + ".json")

def save(self, data):
self.filename.parent.mkdir(exist_ok=True, parents=True)
self.filename.write_text(json.dumps(data, indent=2))

def run(self, force: bool = False):

if not force and self.filename.exists():
print(f"Skipping: {self.filename}")
return

data = self.scrape()

if data is not None:
self.save(data)

def normalize_record(self, data):
return {
"indicator_id": data["indicator"]["id"],
"indicator_name": data["indicator"]["value"],
"country_id": data["country"]["id"],
"country_name": data["country"]["value"],
"country_iso3": data["countryiso3code"],
"date": data["date"],
"value": data["value"],
"unit": data["unit"],
"obs_status": data["obs_status"],
"decimal": data["decimal"],
}


def scrape_indicators(data_dir, force: bool = False):
indicators = sorted(indicator2name)

for indicator in tqdm(indicators, desc="Scraping data...", position=0):
indicator = IndicatorScraper(indicator, data_dir)
indicator.run(force=force)


if __name__ == "__main__":
# python -m scripts.scrapers.indicators.wdi --data_dir=data/indicators/wdi --force
fire.Fire(scrape_indicators)
13 changes: 13 additions & 0 deletions scripts/scrapers/indicators/wdi_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"""Create an sqlite database from the World Bank WDI data.
"""
from llm4data.llm.indicators.wdi_sql import WDISQL
import fire


def main(wdi_jsons_dir: str):
WDISQL.load_wdi_jsons(wdi_jsons_dir)


if __name__ == "__main__":
# python -m scripts.scrapers.indicators.wdi_db --wdi_jsons_dir=data/indicators/wdi
fire.Fire(main)