In this file, I use python to do basic text mining on Chinese text. The text I use is Apple News sample from 2013 to 2018 in Taiwan, and I have scraped all news using Beautifulsoap.The data are all saved as Pandas dataframe. (Apple News Sample: 50 days news per year)
- Important News
- Entertainment News
- International News
- Financial News
- Political News
這個分析我用網路爬蟲取得蘋果日報2013年至2018年中美年隨機選取50天的新聞(含有要聞版、娛樂版、國際板、財經版、政治板),進行文本分析,了解蘋果日報的新聞用詞、內容概況。
- Web Scraping
- Chinese words Preprocessing (Jieba)(中文斷詞、去除停止詞)
- Chinese word and words Frequency (tfidf)(使用tfidf計算詞頻)
- Shannon Entropy and Simpson Index of Chinese text
- Hierarchy Clustering (tfidf + scipy.cluster.hierarchy)
- Topic Modelling using LDA(with visualization using LDAvis)
In this part, I will show you how I scrape Apple Daily News with Python. Link of Apple Daily News: https://tw.appledaily.com/daily
In this part, I simply use Jieba to cut words, adding some words not in Jieba dictionary. In addition, I will remove url, english, numbers and stopwords here. (I download stopwords.txt online)
In this part, I would make single word count and using CountVectorizer to count words frequency. Besides, I would convert all result into csv files.
In order to compare text's variety of news in different years, I compute Shannon Entropy of those texts. However, the number of words of different texts would affect the result of entropy. Hence, I also compute Simpson Index, which can remove the impact caused by the number of words.
With words frequency, we can do hierarchy clustering. In this part, you can choose max_feature in CountVectorizer to determine the words you are going to use in clustering.
After basic EDA, let's go on to topic modeling. I would count the proportion of each cluster. In addition, I will use LDAvis to see the result of LDA.
I use CorEX package to improve my topic modeling with some domain knowledge. (You can see more information about CorEX through this link: https://www.aclweb.org/anthology/Q17-1037.pdf)