[Large-Scale Channel and Video Metadata from English-Speaking YouTube]
- Data is made available through Zenodo.
- For context and an explanation of how data collection was done, see the paper.
- Please cite this dataset if you use it:
@inproceedings{ribeiro2021youniverse,
title={YouNiverse: Large-Scale Channel and Video Metadata from English-Speaking YouTube},
author={Ribeiro, Manoel Horta and West Robert},
year={2021},
booktitle={{Proceedings of the 15th International AAAI Conference
on Weblogs and Social Media (ICWSM'21)}}
}
We provide two jupyter notebooks along with the dataset:
preprocessing.ipynb
: which replicatesanalyses.ipynb
: which reproduces the plots in the associated paper.
The file df_channels_en.tsv.gz
contains data related to channels.
It aggregates both basic stats from channels obtained from channelcrawler.com
,
as well as rankings obtained from socialblade.com
.
category_cc
: category of the channel.join_date
: join date of the channel.channel
: channel id.name_cc
: name of the channel.subscribers_cc
: number of subscribers.videos_cc
: number of videos.subscriber_rank_sb
: rank in terms of number of subscribers.weights
: weights cal
category_cc | join_date | channel | name_cc | subscribers_cc | videos_cc | subscriber_rank_sb | weights |
---|---|---|---|---|---|---|---|
Gaming | 2015-08-26 | UCi_AR7WqvXa6LEnRn_7ES7A | Thunder Play | 11500 | 849 | 877395 | 11.175 |
Sports | 2016-02-04 | UCgVlxaBsBkmMh2SUgrzG1ZQ | Thunder Prod | 76000 | 61 | 198340 | 5.5295 |
Music | 2013-09-14 | UCNBYpqbD64tkuuFS-NNhkfQ | Thunder Rain | 33700 | 58 | 382980 | 6.5855 |
Some facts about it:
- This dataframe has 136,470 rows, where each one corresponds to a different channel.
- We obtained all channels with >10k subscribers and >10 videos from
channelcrawler.com
in the 27 October 2019. - Additionally we filtered all channels that were not in english given their video metadata (see `Raw Channels').
The file df_timeseries_en.csv.gz
contains data related to time series.
We have a data point for each channel and each week:
channel
: channel id.category
: category of the channel as assigned bysocialblade.com
according to the last 10 videos at time of crawl.datetime
: Week related to the data point.views
: Total number of views the channel had this week.delta_views
: Delta views obtained this week.subs
: Total number of subscribers the channel had this week.delta_subs
: Delta subscribers obtained this week.videos
: Total number of videos the channel had this week.delta_videos
: Delta videos obtained this week.activity
: Number of videos published in the last 15 days.
channel | category | datetime | views | delta_views | subs | delta_subs | videos | delta_videos | activity |
---|---|---|---|---|---|---|---|---|---|
UCBJuEqXfXTdcPSbGO9qqn1g | Film and Animation | 2017-07-03 | 202495 | 0 | 650 | 0 | 5 | 0 | 3 |
UCBJuEqXfXTdcPSbGO9qqn1g | Film and Animation | 2017-07-10 | 394086 | 191591 | 1046 | 396 | 6 | 1 | 1 |
UCBJuEqXfXTdcPSbGO9qqn1g | Film and Animation | 2017-07-17 | 835394 | 441308 | 1501 | 456 | 6 | 0 | 1 |
UCBJuEqXfXTdcPSbGO9qqn1g | Film and Animation | 2017-07-17 | 835394 | 441308 | 1501 | 456 | 6 | 0 | 1 |
Some facts about it:
- This file contains 18,872,499 data points belonging to 153,550 channels.
- In average, it contains 2.8 years of data for each channel
- Data goes from early January 2015 to the end of September 2019. Not all channels have the complete time frame.
- Additionally we filtered all channels that were not in english given their video metadata (see `Raw Channels').
The file yt_metadata_en.jsonl.gz
contains metadata data related to ~73M videos from ~137k channels.
Below we show the data recorded for each of the video
{
'categories': 'People & Blogs',
'channel_id': 'UCzzYnZ8GIzfB1Vr3hk2Nj9Q',
'crawl_date': '2019-11-02 09:01:05.328421',
'description': 'See more at http://www.standstrongcompany.com Fitness Keep it healthy at (...)',
'dislike_count': 8,
'display_id': 'x72dBgcVPFI',
'duration': 187,
'like_count': 91,
'tags': 'Tiger Fitness,TigerFitness,fitness,workout,diet,health,pre workout,ab workout,(...)',
'title': 'Slingshot for Squats? | Tiger Fitness',
'upload_date': '2019-04-21 00:00:00',
'view_count': 2559
}
Some facts about it:
- This data was crawled from YouTube between 2019-10-29 and 2019-11-23.
- It contains 72,924,794 videos created between 2005-05-24 to 2019-11-20.
The file youtube_comments.tsv.gz
contains metadata data related to ~8.6B comments made by ~449M users in 20.5M videos.
Each rows corresponds to a comment.
It contains an anonymized user id, a video id, the number of replies the comment received, and the number of likes the comment received.
author | video_id | likes | replies |
---|---|---|---|
1 | Gkb1QMHrGvA | 2 | 0 |
1 | CNtp0xqoods | 0 | 0 |
1 | 249EEzQmVmQ | 1 | 0 |
1 | rJbjhm0weYc | 0 | 0 |
2 | EpHERODJ7gM | 0 | 0 |
2 | 9pQILRT42Cg | 0 | 0 |
2 | PWWRzCyuiFU | 0 | 0 |
3 | o7W2fdWXnPc | 0 | 0 |
Some facts about it:
- We only make available comments for videos with more than 30 comments.
Additionally, we provide raw files.
These have the same names as the remaining files but:
a) have the prefix _raw_
attached to them and;
b) do not have the suffix _en
before the name extension.
The large .json
file associated with video metadata can be quite painful to deal with.
With that in mind, we also provide a helper (yt_metadata_helper.feather
).
This DataFrame contains the same fields as df_videos_raw.jsonl.gz
,
except description
, tags
, and title
(the largest fields).
Feather is a language-agnostic portable file that can be easily loaded in Python or R (see here).