-
Notifications
You must be signed in to change notification settings - Fork 2
Data Ingestion Module
Mangat Singh Toor | ਮੰਗਤ ਸਿੰਘ ਤੂਰ edited this page May 18, 2024
·
8 revisions
Status: Draft
Author: Mangat Toor
Reviewer(s): Sam Huo
To ensure Airbnb compliance with local rules and regulations, data scraping is necessary to verify if each listing meets the required standards.
- Scraping only Vancouver listings
- Retrieving limited information, including location(approximate latitude and longitude), registration number, and Airbnb ID
The primary goal of this project is to gather comprehensive information on all Airbnb listings in Vancouver, including their associated registration numbers.
- Data scraping beyond the Vancouver region.
- Collecting extensive information beyond the specified data points.
- Utilizing Scrapy for data scraping from Airbnb (python)
- Cleans data before being handled by the repository. Eg. dealing with undefined and wrong types etc.
- Manages the writing of data to a persistent storage (a CSV file for now)
- Stores data for the Airbnb listing
- airbnb ID (airbnb_listing_id : string)
- registration number (registration_number : string)
- city (location : string)
- Latitude (latitude : string)
- Longitude (longitude : string)
- Listing Name (name : string)
- Listing title (title : string)
- Max guest occupancy (person_capacity : string)
- Type of Listing (room_type : string)
- Baths (baths : string)
- Beds (beds : string)
[Provide alternative design considerations here]
-
DataCollector:
- Utilize Selenium to test data scraping from Airbnb
- using scrapy's own testing utilities
-
Data Validator:
- Implement Jest for data validation testing
- using scrapy's own testing utilities
-
Repository:
- Use Jest to test the persistence of data
- using scrapy's own testing utilities