Skip to content

Data Ingestion Module

Mangat Singh Toor | ਮੰਗਤ ਸਿੰਘ ਤੂਰ edited this page May 18, 2024 · 8 revisions

Software Design Doc

Status: Draft

Author: Mangat Toor

Reviewer(s): Sam Huo

Context

To ensure Airbnb compliance with local rules and regulations, data scraping is necessary to verify if each listing meets the required standards.

Scope

  • Scraping only Vancouver listings
  • Retrieving limited information, including location(approximate latitude and longitude), registration number, and Airbnb ID

Goals

The primary goal of this project is to gather comprehensive information on all Airbnb listings in Vancouver, including their associated registration numbers.

Non-goals

  • Data scraping beyond the Vancouver region.
  • Collecting extensive information beyond the specified data points.

Overview

image

Detailed Design

DataCollector

  • Utilizing Scrapy for data scraping from Airbnb (python)

Data Cleaner

  • Cleans data before being handled by the repository. Eg. dealing with undefined and wrong types etc.

Repository

  • Manages the writing of data to a persistent storage (a CSV file for now)

Contracts

Listing

  • Stores data for the Airbnb listing
    • airbnb ID (airbnb_listing_id : string)
    • registration number (registration_number : string)
    • city (location : string)
    • Latitude (latitude : string)
    • Longitude (longitude : string)
    • Listing Name (name : string)
    • Listing title (title : string)
    • Max guest occupancy (person_capacity : string)
    • Type of Listing (room_type : string)
    • Baths (baths : string)
    • Beds (beds : string)

Alternative Design

[Provide alternative design considerations here]

Testing Strategies

  • DataCollector:

    • Utilize Selenium to test data scraping from Airbnb
    • using scrapy's own testing utilities
  • Data Validator:

    • Implement Jest for data validation testing
    • using scrapy's own testing utilities
  • Repository:

    • Use Jest to test the persistence of data
    • using scrapy's own testing utilities