Analyzing Amazon reviews written by members of the paid Amazon Vine program. The Amazon Vine program is a service that allows manufacturers and publishers to receive reviews for their products. Companies like SellBy pay a small fee to Amazon and provide products to Amazon Vine members, who are then required to publish a review.
In this project, you’ll have access to approximately 50 datasets. Each one contains reviews of a specific product, from clothing apparel to wireless products. You’ll need to pick one of these datasets and use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. Next, you’ll use PySpark, Pandas, or SQL to determine if there is any bias toward favourable reviews from Vine members in your dataset. Then, you’ll write a summary of the analysis for Jennifer to submit to the SellBy stakeholders.
Determine if there is any bias toward favourable reviews from Vine members (paid reviews) in Amazon product reviews.
From this list: https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt
Resource: Google Colab, PySpark, AWS RDS, AWS S3, Postgres 12
- Retrieve the Amazon Reviews dataset
- Upload in my AWS S3 bucket
- Call the S3 dataset from my Google Colab workbook
- Assemble the data as indicated in 5. Assemble & Clean the Data
- create database in Amazon RDS instance
- create connection & corresponding server in Postgres
- create database schema in Postgres database
- From Google Colab, connect to the AWS RDS instance and populate the tables which will then populate the database tables in Postgres
- S3 --> Google Colab --> AWS RDS instance --> Postgres RDS
Create 4 dataframes from the dataset to fit in with our database tables:
- review_id_table
- products_table
- customers_table
- vine_table
The analysis is indicated below in Analysis
The dataset is only limited to year 2015 so the trend might have changed since then.
The "Proper" Conclusion is indicated below in Summary
- How many Vine reviews and non-Vine reviews were there?
Paid Total Reviews
There is a total of 1207 of paid reviews that have received 20 or more helpful votes and those helpful votes are 50% or more than total votes.
Unpaid Total Reviews
There is a total of 97839 of unpaid reviews that have received 20 or more helpful votes and those helpful votes are 50% or more than total votes.
- How many Vine reviews were 5 stars? How many non-Vine reviews were 5 stars?
- What percentage of Vine reviews were 5 stars? What percentage of non-Vine reviews were 5 stars?
Percentage 5 Stars Paid
Percentage of 5 stars paid review is 42.170671% at 509 5 stars reviews
Percentage 5 Stars UnPaid
Percentage of 5 stars unpaid review is 46.870880% at 45858 5 stars reviews
Looking at the analysis of Amazon Kitchen reviews above, there is no positive bias in the Vine program as paid 5 stars reviews is at 42% from total paid reviews and unpaid reviews are at 45% from total unpaid reviews. This means the percentage of unpaid 5 stars reviews are more than paid 5 stars reviews. From the total of 5 stars reviews for paid and unpaid program, Vine paid 5 stars is only 1% (509) of unpaid 5 stars reviews (45858).
Additional Information
Currently there are 107421 reviews that have received 20 or more helpful votes, see above. That means paid reviews is only 1% of the total helpful reviews in this category.
Given the dataset above, I will propose additional analysis with NLP for the columns below:
- review_headline: Title of Reviews
- review_body: Review sentences
The above analysis will be able to give us customer sentiments on products and potential improvements and suggestions for the products above. Furthermore, it could also enable potentially new products to be invented that will solve their pain points.
Overview
NLP Tutorial series
https://eugenia-anello.medium.com/nlp-tutorial-series-d0baaf7616e0
Python