Author

Ayanwoye Gideon Ayandele – ayanwoyegideon@gmail.com

March, 2022 - Scraping, Cleaning and Analyzing Companies Information as listed on Ycombinator

The motivation for this project is to achieve a very basic end-to-end data engineering project by collecting/scraping, wrangling, cleaning and analysing/visualizing companies' information listed on https://ycombinator.com/companies.

The project main objectives were:

Perform web scraping
Do data wrangling (gathering, assessing and cleaning) on the crawled data.
Store, analyze, and visualize the wrangled data.
Reporting on:
- data wrangling efforts.
- data analysis and visualizations

The project was divided into two parts:

Web Scraping (ycombinator_scraper.ipynb/ycombinator_scraper.py)
Data Wrangling and Exploration (EDA_ycombinator.ipynb)

Web Scraping (`ycombinator_scraper.ipynb`/`ycombinator_scraper.py`)

The dependencies and third party libraries for the scraper include:

Selenium
BeautifulSoup
requests
numpy
pandas

I scraped data pertaining to all 1000 companies listed on https://ycombinator.com/companies, which are:

The listed company names
The company's ycombinator page url
The company location
The company short description (Description head) using the selenium library since the page is dynamic.

I then went through the scraped company's ycombinator page url using requests library since the pages are static, and grab many other informations (company's description, year founded, team size, company page url, social media urls, management details) as they appear for each company.

At the end, I created a CSV file in the following format:

Company_Name	Company_Page_URL	Company_Location	Description_Head	Website	Description	Founded	Team_Size	Linkedin_Profile	Twitter_Profile	Facebook_Profile	Crunchbase_Profile	Active_Founder1	Active_Founder2	Active_Founder3
Airbnb	https://www.ycombinator.com/companies/airbnb	San Francisco, CA, US,	Book accommodations around the world.	http://airbnb.com	Founded in August of 2008 and based in San Fra...	2008	5000	https://www.linkedin.com/company/airbnb/	https://twitter.com/Airbnb	https://www.facebook.com/airbnb/	https://www.crunchbase.com/organization/airbnb	Nathan Blecharczyk\nNone\nhttps://twitter.com/...	Brian Chesky\nNone\nhttps://twitter.com/bchesky\n	Joe Gebbia\nNone\nhttps://twitter.com/jgebbia\n,

The scraper runs for approxiamtely 1.5 minute with multithreading and approximately 7 minutes when NOT multithreaded

Data Wrangling and Exploration (`EDA_ycombinator.ipynb`)

The dependencies and third party libraries for the EDA include:

numpy
pandas
matplotlib
seaborn

The summary from the data assessment and cleaning were that:

There were cases of duplicated company names (Nash, Atlas and Streak) which appeared twice but had their characteristics to be different from the duplicate, it was then concluded to neglect the issue.

Missing data were represented with NaN which would not be imputed or removed as they represented charateristics that were not for the particular company
New variable showing the Country_Of_Origin of the company was extracted from the Company_Location column and, another variable Number_Of_Founders was also extracted from Active_Founder1 through to Active_Founder6

Analysis Summary

Using both Univariate and Bivariate analysis:

The most represented country of all is the USA which counts 654 of the total 1000 companies. It is followed by India, Canada, UK, Nigeria and Indonesia
It could be seen that the more recent a company is founded, the likely it is to be funded/listed by ycombinator
The Team size distribution is highly right-skewed with a really long tail that it was very difficult to view the plot. I had to resolve into binning of size 100 and also set the plot's x_axis limit to 3000. Most teamsize is between 2-4
Most number of founders is 2 followed by 1 and 3
No interesting relationship between country of origin and team size, number of founder and year founded. Also
There is a weak, negative linear correlation between Number_Of_Founder and team size.

Details of Charts

Most represented country (Country_Of_Origin) on ycombinator: The most represented country of all is the USA which counts 654 of the total 1000 companies. It is followed by India, Canada, UK, Nigeria and Indonesia:
The distribution of the Year founded of the companies: It could be seen that the more recent a company is founded, the likely it is to be funded/listed by ycombinator:
The distribution of the team size of the companies: The Team size distribution is highly right-skewed with a really long tail that it was very difficult to view the plot. I had to resolve into binning of size 100 and also set the plot's x_axis limit to 3000. Most teamsize is between 2-4:
The distribution of the Number_Of_Founder of the companies: Most number of founders is 2 followed by 1 and 3:

There is no interesting relationship between country of origin and team size, number of founder and year founded. Also there is a weak, negative linear correlation between Number_Of_Founder and team size.

The relationship between Country_Of_Origin, and Year founded.:
The relationship between Country_Of_Origin, and team size:
The relationship between Country_Of_Origin, and Number_Of_Founder:
The relationship between Number_Of_Founder, and team size:

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
EDA_ycombinator.ipynb		EDA_ycombinator.ipynb
README.md		README.md
prepared_data.csv		prepared_data.csv
ycombinator_data.csv		ycombinator_data.csv
ycombinator_scraper.ipynb		ycombinator_scraper.ipynb
ycombinator_scraper.py		ycombinator_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Author

Table of Contents

March, 2022 - Scraping, Cleaning and Analyzing Companies Information as listed on Ycombinator

Web Scraping (`ycombinator_scraper.ipynb`/`ycombinator_scraper.py`)

Data Wrangling and Exploration (`EDA_ycombinator.ipynb`)

Analysis Summary

Details of Charts

References

About

Releases

Packages

Languages

DeleLinus/Scrape-and-Analyze-ycombinator

Folders and files

Latest commit

History

Repository files navigation

Author

Table of Contents

March, 2022 - Scraping, Cleaning and Analyzing Companies Information as listed on Ycombinator

Web Scraping (ycombinator_scraper.ipynb/ycombinator_scraper.py)

Data Wrangling and Exploration (EDA_ycombinator.ipynb)

Analysis Summary

Details of Charts

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Web Scraping (`ycombinator_scraper.ipynb`/`ycombinator_scraper.py`)

Data Wrangling and Exploration (`EDA_ycombinator.ipynb`)

Packages