Crawling Open Source DataBases from dbdb.io.
Crawling the Open Source DBMS list from dbdb.io/browse?type=open-source with the beautifulsoup package.
Save as OSDB_crawling_202301_raw.csv
About dbdb.io http link:
- Use "https://dbdb.io/browse?type=open-source" to get all the open source databases in dbdb.io
- Use "https://dbdb.io/browse?q=*" to get all the databases in dbdb.io
This repository focus on the repositories which have image source codes and communities on github. The commercial databases is not within the scope of crawling. However, you can use "dbdb.io/browse?q=*" to crawl the entire data set if it is necessary. And the column "open_source_license" in "data/dbdbio_OSDB_list/OSDB_info_crawling_{month_yyyyMM}_raw.csv" may need to be re-labeled manually.
The table of OSDB list csv have columns ["card_title", "card_title_href", "card_img_href", "card_text"]. "card_title" is almost the DBMS name we want, however, some DBMS has the same card_title values: e.g.
- DBMS "Consus" from "https://dbdb.io/db/consus" and DBMS "Consus" from "https://dbdb.io/db/consus-java" have the same "card_title" value.
So the Key column should be re-calculated to distinct data format. We add a new column "Name" to store the recalculated DBMS name by the "card_title_href" column, which has unique values.
Finally, we override the old OSDB list csv in place.
Crawling the Open Source DBMS information from the OSDB card_title_href of OSDB list csv, which has crawled by function "crawling_OSDB_infos_soup".
Some fields should be re-calculated as other data formats. e.g.
- Check the "Name" column, if not found, recalculated DBMS name by the "card_title_href" column;
- Mapping the "Data Model" column to DB-Engines with DB-Engines DBMS categories labels mapping table, and get a new dbdbio DBMS category labels mapping table;
- Check whether Source_Code_record_from_github by "Source Code" column;
- Convert type from float to str(int) for "Start Year" and "End Year" columns.
Join OSDB list and OSDB information on the column "Name", and set the key name alias to 'DBMS_urnform' after joined.
Set use_cols_OSDB_list = None to use all fields of OSDB list, and set use_cols_OSDB_info = ["Name", "card_title", "Description", "Data Model", "Query Interface", "System Architecture", "Website", "Source Code", "Tech Docs", "Developer", "Country of Origin", "Start Year", "End Year", "Project Type", "Written in", "Supported languages", "Embeds / Uses", "Licenses", "Operating Systems"] as default.
Find difference between current month manulabeled data(e.g. OSDB_info_202303_joined_manulabeled.csv) and last month manulabeled data(e.g. OSDB_info_202302_joined_manulabeled.csv). Solve conflicts in the merged table(default overwrite the current month manulabeled data) manually.