In this project, I will analyze a dataset and then communicate my findings about it. I will use the Python libraries NumPy, pandas, and Matplotlib to make my analysis easier.
I needed an installation of Python, plus the following libraries:
- pandas
- NumPy
- Matplotlib
- seaborn
- csv
I recommend installing Anaconda, which comes with all of the necessary packages, as well as IPython notebook.
In this project, I will go through the data analysis process and see how everything fits together.
I'll use the Python libraries NumPy, pandas, and Matplotlib, which make writing data analysis code in Python a lot easier! Not only that, these are sought-after skills by employers!
- Have known all the steps involved in a typical data analysis process
- Comfortable posing questions that can be answered with a given dataset and then answering those questions
- Know how to investigate problems in a dataset and wrangle the data into a format I can use
- Have experience communicating the results of my analysis
- Able to use vectorized operations in NumPy and pandas to speed up my data analysis code
- Familiar with pandas' Series and DataFrame objects, which let me access my data more conveniently
- I know how to use Matplotlib to produce plots showing my findings
For the final project, I conducted my own data analysis and created a jupyter file and html file which can be viewed over here to share my findings. I started by taking a look at the dataset and brainstorming what questions I could answer using it. Then I used pandas and NumPy to answer the questions I was most interested in, and created a report sharing the answers. I did not required to use inferential statistics or machine learning to complete this project, but I made it clear in my communications that my findings are tentative. This project is open-ended.
TMDb movie data (cleaned from original data on Kaggle)
This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.
- Certain columns, like ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters.
- There are some odd characters in the ‘cast’ column. Don’t worry about cleaning them. You can leave them as is.
- The final two columns ending with “adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.
- Which genres are most popular from year to year?
- What kinds of properties are associated with movies that have high revenues?
Use the Project Rubric to review your project. If you are happy with your submission, then you're ready to submit your project. If you see room for improvement, keep working to improve your project!
*Investigate a Dataset - Template Notebook
- https://morphocode.com/pandas-cheat-sheet/
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
- https://realpython.com/python-histograms/
- https://www.geeksforgeeks.org/python-pandas-split-strings-into-two-list-columns-using-str-split/
- https://github.com/antra0497/Udacity--Project-Investigate-TMDB-Movies-Dataset/blob/master/investigate-a-dataset-template.ipynb
- https://matplotlib.org/gallery/shapes_and_collections/scatter.html#sphx-glr-gallery-shapes-and-collections-scatter-py