The #66Days of Data is a initiative started by Ken Jee started to help people develop better data science habits!
- Google Data Analytics Course
- Book: Data Analysis With Python
- Daily Practise on Datacamp
- DataSet For Analysis
I did few chapters of the SQL Course from @DataCamp and learn several methods for data cleaning. SQL queries like SELECT, AND, OR, IS NOT NULL, LIKE helps to filter out outliers and make data suitable for further analysis.
Python for Data Analysis by Wes McKinney is an amazing book to start Data Science Learning. Just on the first day of studying, I knew many things about libraries like pandas and numpy which are powerful DS Libraries. This book is going to help a lot.
Visualization is a crucial part of Data Analysis as visualization gives a clear idea of what the information means by giving it visual context through maps or graphs. Noise from data is removed if we are able to visualize data beautifully. And Python has got matplotlib for it. Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
Today I learned about Numpy 2D arrays and Basic Statistics in Numpy. As Numpy Stands for Numerical Python it adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions. Numpy is equipped with many statistical functions as they are the key for analysis in case of a large chunk of data. They work with arrays.
Hacker Statistics in Python Hacker Statistics in Python talks about gathering repeated measurements to gather more information about data. The basic idea is that instead of literally repeating the data acquisition over and over again, we can simulate those repeated measurements using Python Loops. In various cases, like coin flipping and dice rolling, we can use it to predict the result by repeating the loop vast amount of time which will be almost equal to the theoretical probability.
[DC Course Assignment]Here the case is, we are in the lift on the 50th floor of a building and we have a dice to roll. If the result is 1 or 2, we go 1 step down & if the result is 3,4, or 5, we go 1 step up. Else(in case of 6), we roll again and go the exact step up which is rolled in dice. This process continued 2000 times and from the histogram result, we find that around 600 times we reached around 80th floor, round 400 times we reached around 70th and 90th floor. The more time, we perform a repetition, we can have more precise results.
Hypothesis testing is a part of statistical analysis, where we test the assumptions made regarding a population. The main objective of hypothesis testing is to make a decision whether to accept or reject the hypothesis being tested.
Few Terms in code explained:Null Hypothesis: It is the hypothesis that is tested for possible rejection under the assumption that it is true. It is denoted by H0 and read as H-naught.
Alternative Hypothesis: Any hypothesis that is mutually exclusive and complementary to the null hypothesis is an alternative hypothesis. It is denoted by H1 or Ha.
Level of significance: Denoted by alpha. It is a fixed probability of wrongly rejecting a True Null Hypothesis. For example, here alpha is 20%, which means we are okay to take a 20% risk and conclude there exists a difference when there is no actual difference.
Shapiro-Wilk Test: The Shapiro–Wilk test is a test of normality in frequentist statistics. It was published in 1965 by Samuel Sanford Shapiro and Martin Wilk. It is included in scipy package of python.