Analysing data if waitress' or waiter's T-Shirt color has an impact if she or he is getting tipped.
It is thought, that customer leave more often tips when the waitress is wearing red T-Shirt. So, let's dive into the data and let's see whether it's true or not. In our dataset we have these variables:
- Color - indicates the color of t-shirt the waitress is wearing.
- Tipnum - the variable which indicates if customer tipped or not. When tipnum = 1 - customer tipped.
- Male - indicates the client's sex. When male = 1 - client is male.
- Black - When black = 1, it means waitress is wearing black T-Shirt, and when black = 0, it means waitress is not wearing black t-shirt.
- White - "--"
- Yellow - "--"
- Blue - "--"
- Green - "--"
Note: when every variable black, white, yellow, blue and green are equal to zero, it that waitress is wearing red t-shirt. Dataset report:
Here sampsz variable indicates, how many observations we have of each group.
I'm going to work with the data where only the observed customers are men. I'm analysing this logistic regression model:
First of all, let's see if we have enough data in each color group to work with:
It seems that we have enough data, since frequency of each group is over 5.
Next, let's see if every independent varialbe is statistically significant:
Since our = 0.95, we can see that yellow variable is statistically insignificant and we should remove it from the model.
Once it's removed, let's check Analysis of Maximum Likelihood Estimates table again:
Now we see that our every variable is statistically significant and we can continue work with this model.
Our convergence criterion is satisfied:
AIC criterion shows that our model is suitable as well:
Let's see if our model is accurate to our data:
From this table, we can see that c coefficient is only 0.593. It means, the model is better than trying to predict the outcome randomly, but it's still low.
The classification table is showing practically the same:
Also, from this table we can see that our treshold is best when it's equal between 0.4 and 0.56.
Now, let's see if we have any outliers:
We can see that we don't have any outliers since Pearson Residual and DFBetas values are not exceeding their limits.
Thus, our model is suitable. We have this model:
In order to calculate the probability of getting a tip (P(timnup = 1)), we would need to use this formula:
I get these results when we are putting the specific numbers: When waitress wear red T-Shirt, the probability of getting tipped is 0.1846; Black - 0.2841; White - 0.2586; Green - 0.2513; Blue - 0.2563;
We can see that when waitress is wearing the red T-shirt, it's the lowest chance of getting tipped. When waitress is wearing black T-shirt, she has the biggest chance of getting tipped, but the value isn't very significantly different from other color groups, which means, that there isn't really big difference what T-shirt the waitress is wearing.