In this project I will be performing a process in which I will extract information, clean the data and then save that information in a SQL Server database.
1.-First we will be doing web scraping to a page to extract important data that can help us to perform a small analysis at the end of this work.for this I will realize 2 scripts made in python with the help of the Spider library for a better web scraping.
2.-I will clean the data so that they can be stored in a better way in our database and for this we will use the pyspark sql functions and we will clean the outputs in the best way to generate the database as shown in the following picture
3.- As a final step we will store the collected data in a SQL Server database for this we will need to create it in SQL Server otherwise pyspark will create the tables in its own way.