I took interested in mechanical keyboards and saw a lot of discussion on social media about them. So many types of keybaords available now with different materials, layouts, size... I wonder how popular one type is over others, options and technical specification of the keyboard in selling now.
Keeb-finder is a good website for that. They have many keyboards listed and easy to scrape. For the seek of practicing, keyboard data was only scraped from Mar 28, 2024 to Mar 30, 2024 using a Python script here. Hopefully this scraping will not be a burden for the site.
Disclaimer: Some keyboards will be missing because of bad url format, or missing info.
- How many keyboards listed over time
- Wired/Wireless connection percentage
- Most common material of keyboard case
- Most popular keyboard brands
- Python Script to scrape keyboards tile and detail listing from keeb-finder/com/keyboards
- GitHub repo for storing scraped data and the project
- Google Cloud to use BigQuery as data warehouse. I haven't set up VM machine since my GCP free trial already ended.
- Mage used to orchestrate and monitor pipeline
- DBT Core to transform data in BigQuery and prepare for visualization using SQL
- Looker Studio to visualize the transformed dataset
- Pandas to import and transform dataset
- Terraform for version control of our infrastructure
- Docker for Mage image. I also update dbt to latest version via Mage terminal as the original dbt version in Mage not working
- Python script scrapes data from keeb-finder.com and exports them as csv files. These files then uploaded to GitHub for easier to work with Mage.
- Terraform is used to setup BigQuerry database.
- Project keeb-finder create in Mage and use to load and clean data from GitHub to BigQuery.
- dbtCore also run inside Mage as dbt blocks to build and load models to BigQuery.
- Looker Studio is used to visualize the transformed dataset.
Model core-keyboards' structure check csv here
- Table
base.keyboard_details
contains data scraped from category page and listing page Technical Specification table - Table
base.purchase_options
contains data scraped from listing page Purchase Options table base_keyboard_details
andbase_purchase_options
are models for the above tables, transformed by renaming columns, adding new columns with CASE statement, applying correct data types, dropping uncessary columnscore_keyboards
are model that join thebase_keyboard_details
andbase_purchase_options
by listing_link and scrapping date
View dashboard here
-
Check if you already have these these Python libraries
urllib
,bs4
,pandas
,datetime
,re
in Terminal withimport library-name-here
. If not yet, please install them. -
Replace the value in
path
variable with your desired path -
Run script
Check here for instruction
Check this video here for details
- Create a google cloud account
- Setup a new google cloud project
- Create service account and give it the roles
Storage Admin
,BigQuery Amin
- Download key as json file for this service account
- Create BigQuery dataset, you might use this Terraform file as reference, replace
project
,dataset_id
,location
according to your environment - Make sure these API are enable
NAME | TITLE |
---|---|
analyticshub.googleapis.com | Analytics Hub API |
bigquery.googleapis.com | BigQuery API |
bigqueryconnection.googleapis.com | BigQuery Connection API |
bigquerydatapolicy.googleapis.com | BigQuery Data Policy API |
bigquerymigration.googleapis.com | BigQuery Migration API |
bigqueryreservation.googleapis.com | BigQuery Reservation API |
bigquerystorage.googleapis.com | BigQuery Storage API |
cloudapis.googleapis.com | Google Cloud APIs |
cloudresourcemanager.googleapis.com | Cloud Resource Manager API |
cloudtrace.googleapis.com | Cloud Trace API |
dataform.googleapis.com | Dataform API |
dataplex.googleapis.com | Cloud Dataplex API |
datastore.googleapis.com | Cloud Datastore API |
logging.googleapis.com | Cloud Logging API |
monitoring.googleapis.com | Cloud Monitoring API |
servicemanagement.googleapis.com | Service Management API |
serviceusage.googleapis.com | Service Usage API |
sql-component.googleapis.com | Cloud SQL |
storage-api.googleapis.com | Google Cloud Storage JSON API |
storage-component.googleapis.com | Cloud Storage |
storage.googleapis.com | Cloud Storage API |
- Install Docker
- Clone this repo, go to
mage-quickstart
folder, rundocker-compose up
- Upload your service account key to
mage-quickstart\keeb-finder
- Update
mage-quickstart\keeb-finder\io_config.yaml
,GOOGLE_SERVICE_ACC_KEY_FILEPATH: "/home/src/keeb-finder/your-key-file-name.json"
- In these 2 pipelines:
ingest_products_github_to_bigquery
,ingest_purchase_options_github_to_bigquery
, Update Export to BigQuery blockstable_id
according to your BigQuery setup
- Install dbt Core and connect dbt with BigQuery
- Go to
mage-quickstart\keeb-finder\keeb_finder_dbt\profiles.yml
updatekeyfile
values to your service account key location,project
,dataset
- Run
dbt debug
to check dbt connection - Go to
mage-quickstart\keeb-finder\keeb_finder_dbt\models\base\schema.yml
update sources configuration values according to your bigquery setting
Add data source from BigQuery/core-keyboards table
- More automation with Mage, scape data with Mage and load to GCS instead of GitHub
- CI/CD run pipeline daily
- Dockerize Python environment for reproduction supporting
- Use variable with Terraform