-
Notifications
You must be signed in to change notification settings - Fork 5
/
twitter_analysis.Rmd
425 lines (336 loc) · 17.1 KB
/
twitter_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
# Twitter User Analysis
According to [Alexa.com](http://www.alexa.com/siteinfo/twitter.com),
Twitter.com is the 10th most popular site in the world. Twitter
is a social network that allows users to share information as a string
of 140 or less characters. This information is called a status update or tweet.
Twitter also allows a user _A_ to follow another user _B_. Then user
_A_ will be able to easily view all of user _B_'s status updates. This
interaction makes user _A_ a follower of user _B_. The number of
followers for a user can be seen as a status symbol or it can indicate
a user's social media influence. This study attempts to predict the number
of followers based upon the various characteristics of a twitter user.
To be more exact, this study aims to predict the number of twitter followers for the _top_ 1000 twitter accounts associated with the search term **data**.
## About the data
Twitter has an [API(Application Programming Interface)](https://dev.twitter.com/docs/api/1.1) which provides access
to information about the _top_ 1000 users for any search term. Unfortunately,
Twitter does not specify how these _top_ users are determined, but the users
can likely be identified as the most influential on twitter for a given search
term. On October 10, 2013, the Twitter API was used to pull information about the
_top_ 1000 users associated with the term "data". The final data is
formatted as a CSV(Comma Separated Values) file with each row indicating
a separate user and the columns as follows:
1. **handle** - twitter username | string
1. **name** - full name of the twitter user | string
1. **age** - number of days the user has existed on twitter | number
1. **num_of_tweets** - number of tweets this user has created (includes retweets) | number
1. **has_profile** - 1 if the user has created a profile description, 0 otherwise | boolean
1. **has_pic** - 1 if the user has setup a profile pic, 0 otherwise | boolean
1. **num_following** - number of other twitter users, this user is following | number
1. **num_of_favorites** - number of tweets the user has favorited | number
1. **num_of_lists** - number of public lists this user has been added to | number
1. **num_of_followers** - number of other users following this user | number
### Training and Validation Data
The data file was then split into 2 datasets. One for training the
model and another for validating the model. The split was
60% for training and 40% for validation.
# Exploratory Analysis
Obviously more can be added here. About the only thing to note here is
the **has_pic** column contains only a single value that is different
from the rest. Thus **has_pic** will not be included in the analysis.
```{r}
library(stats)
#read in the files
set.seed(34567)
training_data = read.csv('twitter_user_data_data_training.csv')
validation_data = read.csv('twitter_user_data_data_test.csv')
full_data = read.csv('twitter_user_data_data.csv')
summary(training_data)
pairs(training_data[3:10])
```
## Outliers
When looking at the num_following versus the num_of_lists plot, there
appear to be a few outliers. Thus, the user with the very high num_following
and the users with the very high num_of_lists were removed. Therefore, the
training set now contains 597 users instead of 600 users. Also, the analysis is
not included here, but the models performed the same or better with the outliers
removed.
```{r}
plot(training_data$num_of_lists, training_data$num_following)
training_data=training_data[training_data$num_following < 20000 & training_data$num_of_lists < 5000 ,]
dim(training_data) # new dimensions
plot(training_data$num_of_lists, training_data$num_following)
```
# Analysis
First, a linear model with all the predictors was created.
The full linear model identified the following predictors as
significant: age, num_following, and num_of_lists. Then backwards
step-wise regression was performed and the best fitting model
was identified as the model containing the same predictors
as the linear model previously mentioned.
The Box-Cox method was used to determine if any transformations needed to be performed
on the response variable. As can be seen in the Box-Cox plot, the maximum value occurred at 0.06. Due to that value, two separate
transformations were performed. The first transformation took the natural log of the
dependent variable. The second transformation involved raising the dependent variable
to the exponent, 0.06. Neither of these transformations yielded promising results, so the
detailed analysis is not included in this report.
Next, all-subsets regression was performed using the _leaps_ package
in the R programming language. The _leaps_ package will perform
an exhaustive search of all possible subsets of the variables in
order to find the best fitting models based upon the Mallows' Cp Criterion.
As can be seen in the output plot, four models appeared to have low Cp values
as compared to the rest. Not surprisingly, the four models contain
different combinations of the 3 predictors already identified, including the
model indentified by the step-wise regression. For these reasons and the reasons above, the four
models with the lowest Cp values will be compared to determine the best model.
Here are the 4 candidate models being considered.
### Model 1
$$
num\_of\_followers = \beta_0 + \beta_1*age + \beta_2*num\_following + \beta_3*num\_of\_lists
$$
### Model 2
$$
num\_of\_followers = \beta_0 + \beta_2*num\_following + \beta_3*num\_of\_lists
$$
### Model 3
$$
num\_of\_followers = \beta_0 + \beta_3*num\_of\_lists
$$
### Model 4
$$
num\_of\_followers = \beta_0 + \beta_1*age + \beta_3*num\_of\_lists
$$
## Best Model
First look at the PRESS statistic. A PRESS statistic reasonably close to the
SSE supports the validity of a linear regression model. As can be seen in
the table, all four candidate models have a PRESS statistic reasonably close
to the SSE.
Next look at the Mallows' $C_p$ value. A lower $C_p$ value is better, in particular,
the $C_p$ value should be less than p (number of predictors + 1 for the intercept). Also
a $C_p$ value equal to p indicates a model with no bias. Therefore, it is advantageous
to find a $C_p$ near p. Model 4 has the lowest overall $C_p$ value, but for Model 4 the p is 3,
making the $C_p$ value greater than p. Only Model 1 has a $C_p$ less than or equal to p. Model
1 has a $C_p = 4$ and $p = 4$. Thus, Mallows' $C_p$ would favor Model 1.
Finally, look at the MSRP (Mean Squared Prediction Error) for the four models. A lower value
indicates more predictive power. Model 1 has the lowest value of the four models,
so MSRP favors Model 1 as well.
Overall, Model 1 appears to be the best predictive model for the twitter data.
Model 1 was then recreated using all the available data, not just the training
data. The final model for predicting the number of followers of a twitter
user in the top 1000 for the search term 'data' is:
$$
num\_of\_followers = 898 - 1.5*age + .8*num\_following + 28.1*num\_of\_lists
$$
Here is how the final model can be interpreted. Given a new account following 0 users
and not it any lists, a twitter user in the top 1000 would be expected to have 898 followers.
At first this seems ridulous. Why would a user have any followers without any activity
and a brand new account. Remember, that this data is for twitter users in the top 1000, so
for a new twitter account to appear in the top 1000, it is likely the person/organization that
created the account is already influential outside of twitter. Think about a celebrity
creating a twitter account. The account will quickly start attracting followers
with the anticipation of future activity.
All other factors remaining the same,
an increase in the age of the account by 1 day results in a decrease in the number
of followers by 1.5. Thus having a twitter account for longer does not appear to increase
followers. Also, when the other predictors remain the same, for every 1 user an
account follows, it will result in .8 new followers. Another way to look at that is: keeping everything else the same, following 10 more people will result in 8 more followers.
With the rest of the predictors remaining the same, being included in 1 more list results
in 28.1 more followers. Thus being in lists is the most influential predictor of followers.
This makes sense considering the data is associated with the top 1000 and being in more lists means being more influential.
One interesting area of future research would be generalizing this model to work with other search terms.
Does the same model still work well or do different terms need different models?
How do search terms based upon trending topics have an affect?
# Conclusions
It is possible to predict the number of followers for a twitter user in the top 1000 based
upon the search term "data". It appears the age of the account, the number of twitter users
the account is following, and the number of lists including the account are all
correlated with the number of followers for an account in the top 1000 twitter users
for the search term "data". For those that are familiar with twitter,
it is not surprising that the number of tweets
does not appear to be correlated with the number of followers. Thus, tweeting
more is not helpful for getting more followers. Also, having a profile does not
appear to be connected with the number of followers either.
Surprisingly,
the quality of tweets does not appear to be correlated with the number of followers
either. Number of tweets that have been favorited would be an indicator
of the quality of the tweets. More favorites would appear to indicate more
quality tweets. However, having more or less quality tweets does not
appear to be correlated with the total number of followers.
# R code
```{r}
basic_model = lm(num_of_followers ~ age + num_of_tweets + num_following + num_of_favorites + num_of_lists + as.factor(has_profile), data=training_data)
summary(basic_model)
# Stepwise Regression
library(MASS)
step <- stepAIC(basic_model, direction="backward") # forward, backward, or both
step$anova # display results
```
### All-subsets Regression
```{r}
# use this to find the Cp values
library(leaps)
# only check for columns that we are looking at
x = training_data[,c(3,5,9)]
y = training_data[,10]
models = leaps(x, y)
models
plot(models$size, models$Cp, log = "y", xlab = "# of predictors", ylab = expression(C[p]), main='Cp values by Number of Predictors', col="red", cex=1.5)
minimum <- models$Cp == min(models$Cp)
best.model <- models$which[minimum, ]
x_val = validation_data[,c(3,5,9)]
y_val = validation_data[,10]
models_val = leaps(x_val, y_val)
models_val
```
```{r}
library(qpcR) # for PRESS statistic
#calculate the MSRP
msrp = function(actuals, predicted) {
sum((actuals-predicted)^2)/length(actuals)
}
# print out linear model info
model_info = function(model) {
print(summary(model))
#SSE
SSE = deviance(model)
print(paste('SSE:', SSE))
#PRESS
pr = PRESS(model, verbose=FALSE)
print(paste('PRESS:', pr$stat))
#MSE
MSE = tail(anova(model)$Mean, 1)
print(paste('MSE:', MSE))
#R2a
aR2 = summary(model)$adj.r.squared
print(paste('Adjusted R^2:', aR2))
}
```
## Model 1: Linear Model with age, num_following, and num_of_lists
```{r}
model_1_training = lm(num_of_followers ~ age + num_following + num_of_lists, training_data)
model_info(model_1_training)
print("Cp: 4.0")
# check how closely the model will predict the values in the validation set
predicted_vals = predict(model_1_training, newdata=validation_data)
MSRP = msrp(validation_data$num_of_followers, predicted_vals)
print(paste('MSPR:', MSRP))
# this is for the validation data
model_1_validation = lm(num_of_followers ~ age + num_following + num_of_lists, validation_data)
model_info(model_1_validation)
print("Cp: 4.0")
```
## Model 2: Linear Model with num_following and num_of_lists
```{r}
model_2_training = lm(num_of_followers ~ num_following + num_of_lists, training_data)
model_info(model_2_training)
print("Cp: 7.13")
# check how closely the model will predict the values in the validation set
predicted_vals = predict(model_2_training, newdata=validation_data)
MSRP = msrp(validation_data$num_of_followers, predicted_vals)
print(paste('MSPR:', MSRP))
# this is for the validation data
model_2_validation = lm(num_of_followers ~ num_following + num_of_lists, validation_data)
model_info(model_2_validation)
print("Cp: 3.45")
```
## Model 3: Linear Model with just num_of_lists
```{r}
model_3_training = lm(num_of_followers ~ num_of_lists, training_data)
model_info(model_3_training)
print("Cp: 5.74")
# check how closely the model will predict the values in the validation set
predicted_vals = predict(model_3_training, newdata=validation_data)
MSRP = msrp(validation_data$num_of_followers, predicted_vals)
print(paste('MSPR:', MSRP))
# this is for the validation data
model_3_validation = lm(num_of_followers ~ num_of_lists, validation_data)
model_info(model_3_validation)
print("Cp: 1.85")
```
## Model 4: Linear Model with age and num_of_lists
```{r}
model_4_training = lm(num_of_followers ~ age + num_of_lists, training_data)
model_info(model_4_training)
print("Cp: 3.883")
# check how closely the model will predict the values in the validation set
predicted_vals = predict(model_4_training, newdata=validation_data)
MSRP = msrp(validation_data$num_of_followers, predicted_vals)
print(paste('MSPR:', MSRP))
# this is for the validation data
model_4_validation = lm(num_of_followers ~ age + num_of_lists, validation_data)
model_info(model_4_validation)
print("Cp: 2.66")
```
### Model Transformation
## Model 5: Linear Model with Log(num_of_followers) and age, num_following, and num_of_lists
Before running the next model, the Box-Cox method was used to determine if any
transformations need to be done on the response variable (num_of_followers).
Box-Cox returns $\lambda = 0.06$ which is pretty close to 0, so a Log
of the response was applied.
```{r}
# run the box cox
model = lm(num_of_followers ~ age + num_following + num_of_lists , data=training_data)
bc = boxcox(model, xlab = expression(lambda), ylab = "log-Likelihood")
max = with(bc, x[which.max(y)])
# create a column for the transformed column
training_data$log_num_of_followers = log(training_data$num_of_followers)
validation_data$log_num_of_followers = log(validation_data$num_of_followers)
# run the model
#m4_error = model_test_function(log_num_of_followers ~ age + num_following + num_of_lists, 1)
```
## Model 6: Linear Model with num_of_followers^.06 and age, num_following, and num_of_lists
Also due to the Box-Cox, the num_of_followers were raised to the 0.06 power.
```{r}
# transform Y^.06
# create a column for the transformed column
training_data$raise_num_of_followers = training_data$num_of_followers^.06
validation_data$raise_num_of_followers = validation_data$num_of_followers^.06
#m5_error = model_test_function(raise_num_of_followers ~ age + num_following + num_of_lists, 1)
```
# Initial Conclusions
Of the initial 5 models, the best predictive power on the validation set belongs to
Model 3, the linear model using just the num_of_lists. However, a few other models
can be applied.
## Model 6: Robust linear Regression with age, num_following, and num_of_lists
```{r}
robust_model_6 = rlm(num_of_followers ~ age + num_following + num_of_lists , data=training_data, psi = psi.bisquare, init='lts', maxit=50)
summary(robust_model_6)
predicted_vals = predict(robust_model_6, newdata=validation_data)
m6_error = sum(abs(predicted_vals - validation_data$num_of_followers))/length(predicted_vals)
print(paste('The average prediction error is:', m6_error))
```
## Model 7: Robust linear Regression with num_following, and num_of_lists
```{r}
robust_model_7 = rlm(num_of_followers ~ num_following + num_of_lists , data=training_data, psi = psi.bisquare, init='lts', maxit=50)
summary(robust_model_7)
predicted_vals = predict(robust_model_7, newdata=validation_data)
m7_error = sum(abs(predicted_vals - validation_data$num_of_followers))/length(predicted_vals)
print(paste('The average prediction error is:', m7_error))
```
## Model 8: Robust linear Regression with num_following, and num_of_lists
Robust Regression is less sensitive to outliers than ordinary least squares regression.
```{r}
robust_model_8 = rlm(num_of_followers ~ num_following + num_of_lists , data=training_data, psi = psi.bisquare, init='lts', maxit=50)
summary(robust_model_8)
predicted_vals = predict(robust_model_8, newdata=validation_data)
m8_error = sum(abs(predicted_vals - validation_data$num_of_followers))/length(predicted_vals)
print(paste('The average prediction error is:', m8_error))
```
## Model 9: Decision Tree
```{r}
library(tree)
regtree = tree(num_of_followers ~ age + num_of_tweets + num_following + num_of_favorites + num_of_lists + as.factor(has_profile), data = training_data)
summary(regtree)
predicted_vals = predict(regtree, newdata=validation_data)
m9_error = sum(abs(predicted_vals - validation_data$num_of_followers))/length(predicted_vals)
print(paste('The average prediction error is:', m9_error))
```
## Model 10: Quantile Regression
```{r}
```
# Further Conclusion
Model 1 is chosen as the best model, so it can be rebuilt with all the data.
```{r}
final_model = lm(num_of_followers ~ age + num_following + num_of_lists , data=full_data)
summary(final_model)
confint(final_model)
```