-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add some documentations for setting up and WDI examples (#4)
* Add wdi scraper and initial documentation Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add downloading the data section Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add loading data to the database instructions Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add setting up the environment Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Update toc with content Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> --------- Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
- Loading branch information
1 parent
28d372c
commit 69fcefb
Showing
8 changed files
with
314 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,13 @@ | ||
format: jb-book | ||
root: README | ||
|
||
parts: | ||
- caption: Setting up your environment | ||
chapters: | ||
- file: notebooks/examples/"Setting up your environment.ipynb" | ||
|
||
- caption: Indicators | ||
chapters: | ||
- file: notebooks/examples/indicators/README | ||
sections: | ||
- file: notebooks/examples/indicators/"Getting started with the WDI.ipynb" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Setting up the environment" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We use the environment variables to allow the user to configure the library. The user can set the environment variables in a `.env` file in the root directory of the project.\n", | ||
"\n", | ||
"The following environment variables, also specified in the `example.env` file, are used:\n", | ||
"\n", | ||
"```bash\n", | ||
"# OpenAI\n", | ||
"ORGANIZATION=\"\"\n", | ||
"OPENAI_API_KEY=\"\"\n", | ||
"\n", | ||
"\n", | ||
"# DATABASES\n", | ||
"WDI_DB_TABLE_NAME=\"wdi\"\n", | ||
"\n", | ||
"## POSTRESQL\n", | ||
"WDI_DB_ENGINE=\"postgresql\"\n", | ||
"WDI_DB_HOST=\"localhost\"\n", | ||
"WDI_DB_USERNAME=\"postgres\"\n", | ||
"WDI_DB_PASSWORD=\"<your password>\"\n", | ||
"WDI_DB_PORT=5432\n", | ||
"\n", | ||
"## SQLITE\n", | ||
"# WDI_DB_ENGINE=\"sqlite\"\n", | ||
"# WDI_DB_HOST=\n", | ||
"# WDI_DB_USERNAME=\"/data/sqldb/wdi.db\"\n", | ||
"# WDI_DB_PASSWORD=\n", | ||
"# WDI_DB_PORT=\n", | ||
"\n", | ||
"\n", | ||
"# DIRS\n", | ||
"OPENAI_PAYLOAD_DIR=\"data/openai/payloads\"\n", | ||
"\n", | ||
"\n", | ||
"# Task Labels\n", | ||
"## Specify the label in the payload directory\n", | ||
"## for this prompt set.\n", | ||
"TASK_LABEL_WDI_SQL = \"prompt2wdiSQL\"\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The `example.env` file is updated with the relevant variables used in the project. The user can copy the file and rename it to `.env` to set the environment variables. The `.env` file is ignored by Git for security reasons so that you can store API keys.\n", | ||
"\n", | ||
"NEVER commit your `.env` file to Git. It is ignored by default in the `.gitignore` file, but you should double-check that it is not being tracked by Git." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"language_info": { | ||
"name": "python" | ||
}, | ||
"orig_nbformat": 4 | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
83 changes: 83 additions & 0 deletions
83
notebooks/examples/indicators/Getting started with the WDI.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Getting Started with the World Development Indicators (WDI) data\n", | ||
"\n", | ||
"In this notebook, we will show the steps to get started with the WDI data. We will use the [WDI API](https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-about-the-indicators-api-documentation) to get the data.\n", | ||
"\n", | ||
"We will be using the `data/` folder to store the data. You can change the location of the data folder by changing the `data_dir` parameter in the code below. Make sure to refer to the correct location of the data folder in the rest of the notebook.\n", | ||
"\n", | ||
"After the data is collected, we will store it in a [SQLite](https://www.sqlite.org/index.html) database. With the data in a database, we can then use LLM4Data to query the data using natural language." | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Downloading the data\n", | ||
"\n", | ||
"\n", | ||
"If the data is not yet available in the `data` folder, we will download the data from the WDI API.\n", | ||
"\n", | ||
"```\n", | ||
"poetry run python -m scripts.scrapers.indicators.wdi --data_dir=data/indicators/wdi --force\n", | ||
"```\n", | ||
"\n", | ||
"This will scrape the data from the WDI API and store it in the `data/indicators/wdi` folder. Each indicator will be stored in a separate file." | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Storing the data to a database\n", | ||
"\n", | ||
"After the data is downloaded, we will store it in a database. We will use SQLite for this example. You can use other databases as well, as long as you have the appropriate drivers installed.\n", | ||
"\n", | ||
"Please review the `setting up the environment` section for instructions on how to update the relevant environment variables.\n", | ||
"\n", | ||
"You can then run the following command to store the data in a database:\n", | ||
"\n", | ||
"```\n", | ||
"poetry run python -m scripts.scrapers.indicators.wdi_db --wdi_jsons_dir=data/indicators/wdi\n", | ||
"```\n", | ||
"\n", | ||
"This will create a database file in the `data/indicators/wdi` folder. The database file will be named based on the information you specified in the environment variables.\n", | ||
"\n", | ||
"Alternatively, you can run the cells below to store the data in a database." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "plaintext" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from llm4data.llm.indicators.wdi_sql import WDISQL\n", | ||
"\n", | ||
"\n", | ||
"wdi_jsons_dir = \"data/indicators/wdi\"\n", | ||
"\n", | ||
"WDISQL.load_wdi_jsons(wdi_jsons_dir)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"language_info": { | ||
"name": "python" | ||
}, | ||
"orig_nbformat": 4 | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Indicators |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
import json | ||
import requests | ||
import pandas as pd | ||
from pathlib import Path | ||
from tqdm.auto import tqdm | ||
import fire | ||
import backoff | ||
|
||
from llm4data import indicator2name | ||
|
||
class WDIException(BaseException): | ||
pass | ||
|
||
|
||
@backoff.on_exception(backoff.expo, WDIException, max_tries=20) | ||
def get_json(url): | ||
try: | ||
response = requests.get(url) | ||
return response.json() | ||
except requests.exceptions.RequestException as e: | ||
raise WDIException(e) | ||
except json.decoder.JSONDecodeError as e: | ||
raise WDIException(e) | ||
|
||
|
||
class IndicatorScraper: | ||
|
||
def __init__(self, indicator_id, data_dir): | ||
self.indicator_id = indicator_id | ||
self.data_dir = Path(data_dir) | ||
|
||
def get_api_url(self, page: int = 1, per_page: int = 1000): | ||
return f"https://api.worldbank.org/v2/country/all/indicator/{self.indicator_id}?format=json&page={page}&per_page={per_page}" | ||
|
||
def scrape(self): | ||
page = 1 | ||
_json = get_json(self.get_api_url(page)) | ||
|
||
try: | ||
data = _data = _json[1] | ||
except IndexError: | ||
print(f"Skipping (IndexError): {self.indicator_id}") | ||
return None | ||
|
||
total = _json[0]["pages"] | ||
# print(f"Total pages: {total}") | ||
|
||
for page in tqdm(range(2, total + 1), desc=f"Scraping: ({self.indicator_id})", position=1): | ||
_data = get_json(self.get_api_url(page))[1] | ||
data += _data | ||
|
||
data = [self.normalize_record(d) for d in data] | ||
|
||
return data | ||
|
||
@property | ||
def filename(self): | ||
return self.data_dir / (self.indicator_id + ".json") | ||
|
||
def save(self, data): | ||
self.filename.parent.mkdir(exist_ok=True, parents=True) | ||
self.filename.write_text(json.dumps(data, indent=2)) | ||
|
||
def run(self, force: bool = False): | ||
|
||
if not force and self.filename.exists(): | ||
print(f"Skipping: {self.filename}") | ||
return | ||
|
||
data = self.scrape() | ||
|
||
if data is not None: | ||
self.save(data) | ||
|
||
def normalize_record(self, data): | ||
return { | ||
"indicator_id": data["indicator"]["id"], | ||
"indicator_name": data["indicator"]["value"], | ||
"country_id": data["country"]["id"], | ||
"country_name": data["country"]["value"], | ||
"country_iso3": data["countryiso3code"], | ||
"date": data["date"], | ||
"value": data["value"], | ||
"unit": data["unit"], | ||
"obs_status": data["obs_status"], | ||
"decimal": data["decimal"], | ||
} | ||
|
||
|
||
def scrape_indicators(data_dir, force: bool = False): | ||
indicators = sorted(indicator2name) | ||
|
||
for indicator in tqdm(indicators, desc="Scraping data...", position=0): | ||
indicator = IndicatorScraper(indicator, data_dir) | ||
indicator.run(force=force) | ||
|
||
|
||
if __name__ == "__main__": | ||
# python -m scripts.scrapers.indicators.wdi --data_dir=data/indicators/wdi --force | ||
fire.Fire(scrape_indicators) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
"""Create an sqlite database from the World Bank WDI data. | ||
""" | ||
from llm4data.llm.indicators.wdi_sql import WDISQL | ||
import fire | ||
|
||
|
||
def main(wdi_jsons_dir: str): | ||
WDISQL.load_wdi_jsons(wdi_jsons_dir) | ||
|
||
|
||
if __name__ == "__main__": | ||
# python -m scripts.scrapers.indicators.wdi_db --wdi_jsons_dir=data/indicators/wdi | ||
fire.Fire(main) |