Our CX4242 project focuses on quantifying differences between Airbnb listings and hotels in the city of New York. The entire project consists of several different components:
- Data collection and scraping: We collected data from several sources, notably Airbnb, Amadeus (a travel IT company with an API for booking/pricing information), TripAdvisor, and OpenStreetMap.
- NLP analysis on reviews: We used the Stanford Core NLP library to segment reviews and perform sentiment analysis.
- Search engine: We compiled all Airbnb and hotel data into an ElasticSearch instance hosted on AWS, to be able to search across both datasets at once.
- Visualization UI: We summarized all of the data and analyses through an interactive webpage.
Our finalized datasets are stored in an AWS ElasticSearch instance, and our site is hosted with AWS ElasticBeanstalk.
Our project uses Python 3.5.
To install all Python dependencies used in this project, run
pip install -r requirements.txt
First, use the appropriate repository by doing cd tripadvisor_scraper
.
base_spider.py
- This spider gets the necessary URLs (through TripAdvisor's autocomplete) for each city that we are searching for. This only needs to be run once, and it outputs tointermediate/urls.csv
listings_spider.py
- Uses the URLs from the previous part to crawl for listings. Run withscrapy crawl listings -o listings.json
hotels_spider.py
- Scrapes hotel amenities for each listing. Run withscrapy crawl hotels -o amenities.json
listings.json
andamenities.json
contain price, amenities, and some other basic information for the TripAdvisor search results.reviews_spider.py
- Scrapes review text for each listing obtained from the listings spider. Run withscrapy crawl reviews -a filename=<filename>
, where the file is a CSV with TripAdvisor URLs for each hotel.
Scripts for collecting data from the Amadeus API are in the amadeus-api
folder. In order to access the Amadeus API, sign for an API key. Then, set an environment variable for this key.
\\ On Unix-based systems:
export AMADEUS_KEY='your api key here'
\\ On Windows:
setx AMADEUS_KEY "your api key here"
We wanted to merge data from both TripAdvisor and Amadeus.
search.py
- This script searches for hotels in Amadeus based off of the coordinates of hotels we've already scraped from TripAdvisor.recordPrices.py
- This script searches each hotel for prices across a range of dates.
cd amadeus-api
python search.py
python recordPrices.py
The Airbnb listings are from Inside Airbnb.
data/scrape_airbnb_prices.py
scrapes Airbnb prices for given listings on given dates.
To see some example data that we scraped/collected/merged, see the data folder.
In AWS, create a new ElasticSearch instance, with indices for hotels
(all hotel data), airbnbs
(Airbnb listing data), and airbnb_prices
(Airbnb temporal data). See the data folder for more details about uploading.
See the reviews analysis folder for more details.
To run the web application, first set environment variables for ElasticSearch access keys:
export ES_KEY='your key ID here' // or setx ES_KEY "key ID" in Windows
export ES_SECRET='your secret here' // or setx ES_SECRET "secret" in Windows
Then, start the web application.
cd flask-app
python application.py
Navigate to localhost:8000
in your browser to see the site.