-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathTextMining_UseR2016.Rmd
2217 lines (1537 loc) · 96.6 KB
/
TextMining_UseR2016.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Text Mining - moRe than woRds"
author: "Sanjiv Ranjan Das and Karthik Mokashi"
date: "UseR @Stanford -- June 2016"
output: slidy_presentation
---
## Reference monograph
Text expands the universe of data by many-fold. See my monograph on text mining in finance at:
http://srdas.github.io/Das_TextAnalyticsInFinance.pdf
This covers some of the content of this presentation. These files are useful for the talk itself and you may run the program code as we proceed.
http://srdas.github.io/Temp/user2016/
## Text as Data
1. Big Text: there is more textual data than numerical data.
2. Text is versatile. Nuances and behavioral expressions that are not conveyed with numbers.
3. Text contains emotive content. Sentiment analysis. Admati-Pfleiderer 2001; DeMarzo et al 2003; Antweiler-Frank 2004, 2005; Das-Chen 2007; Tetlock 2007; Tetlock et al 2008; Mitra et al 2008; Leinweber-Sisk 2010.
4. Text contains opinions and connections. Das et al 2005; Das and Sisk 2005; Godes et al 2005; Li 2006; Hochberg et al 2007.
5. Numbers aggregate; text disaggregates.
## Anecdotal ...
1. In a talk at the 17th ACM Conference on Information Knowledge and Management (CIKM '08), Google's director of research Peter Norvig stated his unequivocal preference for data over algorithms---"data is more agile than code." Yet, it is well-understood that too much data can lead to overfitting so that an algorithm becomes mostly useless out-of-sample.
2. Chris Anderson: "Data is the New Theory."
3. These issues are relevant to text mining, but let's put them on hold till the end of the session.
## Definition: Text-Mining
1. Text mining is the large-scale, automated processing of plain text language in digital form to extract data that is converted into useful quantitative or qualitative information.
2. Text mining is automated on big data that is not amenable to human processing within reasonable time frames. It entails extracting data that is converted into information of many types.
3. Simple: Text mining may be simple as in key word searches and counts.
4. Complicated: It may require language parsing and complex rules for information extraction.
5. Structured text, such as the information in forms and some kinds of web pages.
6. Unstructured text is a much harder endeavor.
7. Text mining is also aimed at unearthing unseen relationships in unstructured text as in meta analyses of research papers, see Van Noorden 2012.
## Definition: News Analytics
Wikipedia defines it as - "... the measurement of the various qualitative and quantitative attributes of textual (unstructured data) news stories. Some of these attributes are: sentiment, relevance, and novelty. Expressing news stories as numbers permits the manipulation of everyday information in a mathematical and statistical way. News analytics are used in financial modeling, particularly in quantitative and algorithmic trading. Further, news analytics can be used to plot and characterize firm behaviors over time and thus yield important strategic insights about rival firms. News analytics are usually derived through automated text analysis and applied to digital texts using elements from natural language processing and machine learning such as latent semantic analysis, support vector machines, `bag of words', among other techniques."
https://www.amazon.com/Handbook-News-Analytics-Finance/dp/047066679X/ref=sr_1_1?ie=UTF8&qid=1466897817&sr=8-1&keywords=handbook+of+news+analytics
## Data and Algorithms
<img src = "data_algo.jpg" width=700 height=450>
## Text Extraction
The R programming language is increasingly being used to download text from the web and then analyze it. The ease with which R may be used to scrape text from web site may be seen from the following simple command in R:
```{r}
text = readLines("http://srdas.github.io/bio-candid.html")
text[15:20]
```
Here, we downloaded the my bio page from my university's web site. It's a simple HTML file.
```{r}
length(text)
```
## String Parsing
Suppose we just want the 17th line, we do:
```{r}
text[17]
```
And, to find out the character length of the this line we use the function:
```{r}
library(stringr)
str_length(text[17])
```
We have first invoked the library **stringr** that contains many string handling functions. In fact, we may also get the length of each line in the text vector by applying the function **length()** to the entire text vector.
```{r}
text_len = str_length(text)
print(text_len)
print(text_len[55])
text_len[17]
```
## Sort by Length
Some lines are very long and are the ones we are mainly interested in as they contain the bulk of the story, whereas many of the remaining lines that are shorter contain html formatting instructions. Thus, we may extract the top three lengthy lines with the following set of commands.
```{r}
res = sort(text_len,decreasing=TRUE,index.return=TRUE)
idx = res$ix
text2 = text[idx]
text2
```
## Text cleanup
In short, text extraction can be exceedingly simple, though getting clean text is not as easy an operation. Removing html tags and other unnecessary elements in the file is also a fairly simple operation. We undertake the following steps that use generalized regular expressions (i.e., **grep**) to eliminate html formatting characters.
This will generate one single paragraph of text, relatively clean of formatting characters. Such a text collection is also known as a "bag of words".
```{r}
text = paste(text,collapse="\n")
print(text)
text = str_replace_all(text,"[<>{}()&;,.\n]"," ")
print(text)
```
## XML Package
The **XML** package in R also comes with many functions that aid in cleaning up text and dropping it (mostly unformatted) into a flat file or data frame. This may then be further processed. Here is some example code for this.
## Processing XML files in R into a data frame
The following example has been adapted from r-bloggers.com. It uses the following URL:
http://www.w3schools.com/xml/plant_catalog.xml
```{r}
library(XML)
#Part1: Reading an xml and creating a data frame with it.
xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"
xmlfile <- xmlTreeParse(xml.url)
xmltop <- xmlRoot(xmlfile)
plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
plantcat_df <- data.frame(t(plantcat),row.names=NULL)
plantcat_df[1:5,1:4]
```
## Creating a XML file from a data frame
```{r}
#Example adapted from https://stat.ethz.ch/pipermail/r-help/2008-September/175364.html
#Load the iris data set and create a data frame
data("iris")
data <- as.data.frame(iris)
xml <- xmlTree()
xml$addTag("document", close=FALSE)
for (i in 1:nrow(data)) {
xml$addTag("row", close=FALSE)
for (j in names(data)) {
xml$addTag(j, data[i, j])
}
xml$closeTag()
}
xml$closeTag()
#view the xml
cat(saveXML(xml))
```
## The Response to News
### Das, Martinez-Jerez, and Tufano (FM 2005)
<img src = "news_posters1.png" width=600 height=350>
### Breakdown of News Flow
<img src = "news_posters2.png" width=600 height=350>
### Frequency of Postings
<img src = "posters_histogram.png" width=600 height=350>
### Weekly Posting
<img src = "weekly_postings.png" width=600 height=350>
### Intraday Posting
<img src = "intraday_postings.png" width=600 height=350>
### Number of Characters per Posting
<img src = "characters_postings.png" width=600 height=350>
## Text Handling
First, let's read in a simple web page (my landing page)
```{r}
text = readLines("http://srdas.github.io/")
print(text[1:4])
print(length(text))
```
## String Detection
String handling is a basic need, so we use the **stringr** package.
```{r}
#EXTRACTING SUBSTRINGS (take some time to look at
#the "stringr" package also)
library(stringr)
substr(text[4],24,29)
#IF YOU WANT TO LOCATE A STRING
res = regexpr("Sanjiv",text[4])
print(res)
print(substr(text[4],res[1],res[1]+nchar("Sanjiv")-1))
#ANOTHER WAY
res = str_locate(text[4],"Sanjiv")
print(res)
print(substr(text[4],res[1],res[2]))
```
## Cleaning Text
Now we look at using regular expressions with the **grep** command to clean out text. I will read in my research page to process this. Here we are undertaking a "ruthless" cleanup.
```{r}
#SIMPLE TEXT HANDLING
text = readLines("http://srdas.github.io/research.htm")
print(length(text))
print(text)
text = text[setdiff(seq(1,length(text)),grep("<",text))]
text = text[setdiff(seq(1,length(text)),grep(">",text))]
text = text[setdiff(seq(1,length(text)),grep("]",text))]
text = text[setdiff(seq(1,length(text)),grep("}",text))]
text = text[setdiff(seq(1,length(text)),grep("_",text))]
text = text[setdiff(seq(1,length(text)),grep("\\/",text))]
print(length(text))
print(text)
text = str_replace_all(text,"[\"]","")
idx = which(nchar(text)==0)
research = text[setdiff(seq(1,length(text)),idx)]
print(research)
```
Take a look at the text now to see how cleaned up it is. But there is a better way, i.e., use the text-mining package **tm**.
## Text Mining with the "tm" Package
1. The R programming language supports a text-mining package, succinctly named {\tt tm}. Using functions such as {\tt readDOC()}, {\tt readPDF()}, etc., for reading DOC and PDF files, the package makes accessing various file formats easy.
2. Text mining involves applying functions to many text documents. A library of text documents (irrespective of format) is called a **corpus**. The essential and highly useful feature of text mining packages is the ability to operate on the entire set of documents at one go.
```{r}
library(tm)
text = c("INTL is expected to announce good earnings report", "AAPL first quarter disappoints","GOOG announces new wallet", "YHOO ascends from old ways")
text_corpus = Corpus(VectorSource(text))
print(text_corpus)
writeCorpus(text_corpus)
```
The **writeCorpus()** function in **tm** creates separate text files on the hard drive, and by default are names **1.txt**, **2.txt**, etc. The simple program code above shows how text scraped off a web page and collapsed into a single character string for each document, may then be converted into a corpus of documents using the **Corpus()** function.
It is easy to inspect the corpus as follows:
```{r}
inspect(text_corpus)
```
## A second example
Here we use **lapply** to inspect the contents of the corpus.
```{r}
#USING THE tm PACKAGE
library(tm)
text = c("Doc1;","This is doc2 --", "And, then Doc3.")
ctext = Corpus(VectorSource(text))
ctext
#writeCorpus(ctext)
#THE CORPUS IS A LIST OBJECT in R of type VCorpus or Corpus
inspect(ctext)
print(as.character(ctext[[1]]))
print(lapply(ctext[1:2],as.character))
ctext = tm_map(ctext,tolower) #Lower case all text in all docs
inspect(ctext)
ctext2 = tm_map(ctext,toupper)
inspect(ctext2)
```
## Function *tm_map*
- The **tm_map** function is very useful for cleaning up the documents. We may want to remove some words.
- We may also remove *stopwords*, punctuation, numbers, etc.
```{r}
#FIRST CURATE TO UPPER CASE
dropWords = c("IS","AND","THEN")
ctext2 = tm_map(ctext2,removeWords,dropWords)
inspect(ctext2)
```
```{r}
ctext = Corpus(VectorSource(text))
temp = ctext
print(lapply(temp,as.character))
temp = tm_map(temp,removeWords,stopwords("english"))
print(lapply(temp,as.character))
temp = tm_map(temp,removePunctuation)
print(lapply(temp,as.character))
temp = tm_map(temp,removeNumbers)
print(lapply(temp,as.character))
```
## Bag of Words
We can create a *bag of words* by collapsing all the text into one bundle.
```{r}
#CONVERT CORPUS INTO ARRAY OF STRINGS AND FLATTEN
txt = NULL
for (j in 1:length(temp)) {
txt = c(txt,temp[[j]]$content)
}
txt = paste(txt,collapse=" ")
txt = tolower(txt)
print(txt)
```
## Example (on my bio page)
Now we will do a full pass through of this on my bio.
```{r}
text = readLines("http://srdas.github.io/bio-candid.html")
ctext = Corpus(VectorSource(text))
ctext
print(lapply(ctext, as.character))
ctext = tm_map(ctext,removePunctuation)
print(lapply(ctext, as.character))
txt = NULL
for (j in 1:length(ctext)) {
txt = c(txt,ctext[[j]]$content)
}
txt = paste(txt,collapse=" ")
txt = tolower(txt)
print(txt)
```
## Term Document Matrix (TDM)
An extremeley important object in text analysis is the **Term-Document Matrix**. This allows us to store an entire library of text inside a single matrix. This may then be used for analysis as well as searching documents. It forms the basis of search engines, topic analysis, and classification (spam filtering).
It is a table that provides the frequency count of every word (term) in each document. The number of rows in the TDM is equal to the number of unique terms, and the number of columns is equal to the number of documents.
```{r}
#TERM-DOCUMENT MATRIX
tdm = TermDocumentMatrix(ctext,control=list(minWordLength=1))
print(tdm)
inspect(tdm[10:20,11:18])
out = findFreqTerms(tdm,lowfreq=5)
print(out)
```
## Term Frequency - Inverse Document Frequency (TF-IDF)
This is a weighting scheme provided to sharpen the importance of rare words in a document, relative to the frequency of these words in the corpus. It is based on simple calculations and even though it does not have strong theoretical foundations, it is still very useful in practice. The TF-IDF is the importance of a word $w$ in a document $d$ in a corpus $C$. Therefore it is a function of all these three, i.e., we write it as TF-IDF$(w,d,C)$, and is the product of term frequency (TF) and inverse document frequency (IDF).
The frequency of a word in a document is defined as
$$
f(w,d) = \frac{\#w \in d}{|d|}
$$
where $|d|$ is the number of words in the document. We usually normalize word frequency so that
$$
TF(w,d) = \ln[f(w,d)]
$$
This is log normalization. Another form of normalization is known as double normalization and is as follows:
$$
TF(w,d) = \frac{1}{2} + \frac{1}{2} \frac{f(w,d)}{\max_{w \in d} f(w,d)}
$$
Note that normalization is not necessary, but it tends to help shrink the difference between counts of words.
Inverse document frequency is as follows:
$$
IDF(w,C) = \ln\left[ \frac{|C|}{|d_{w \in d}|} \right]
$$
That is, we compute the ratio of the number of documents in the corpus $C$ divided by the number of documents with word $w$ in the corpus.
Finally, we have the weighting score for a given word $w$ in document $d$ in corpus $C$:
$$
\mbox{TF-IDF}(w,d,C) = TF(w,d) \times IDF(w,C)
$$
## Example of TD-IDF
We illustrate this with an application to the previously computed term-document matrix.
```{r}
tdm_mat = as.matrix(tdm) #Convert tdm into a matrix
print(dim(tdm_mat))
nw = dim(tdm_mat)[1]
nd = dim(tdm_mat)[2]
doc = 13 #Choose document
word = "derivatives" #Choose word
#COMPUTE TF
f = NULL
for (w in row.names(tdm_mat)) {
f = c(f,tdm_mat[w,doc]/sum(tdm_mat[,doc]))
}
fw = tdm_mat[word,doc]/sum(tdm_mat[,doc])
TF = 0.5 + 0.5*fw/max(f)
print(TF)
#COMPUTE IDF
nw = length(which(tdm_mat[word,]>0))
print(nw)
IDF = nd/nw
print(IDF)
#COMPUTE TF-IDF
TF_IDF = TF*IDF
print(TF_IDF) #With normalization
print(fw*IDF) #Without normalization
```
We can write this code into a function and work out the TF-IDF for all words. Then these word weights may be used in further text analysis.
## TF-IDF in the **tm** package
We may also directly use the **weightTfIdf** function in the **tm** package. This undertakes the following computation:
- Term frequency ${\it tf}_{i,j}$ counts the number of occurrences $n_{i,j}$ of a term $t_i$ in a document $d_j$. In the case of normalization, the term frequency $\mathit{tf}_{i,j}$ is divided by $\sum_k n_{k,j}$.
- Inverse document frequency for a term $t_i$ is defined as $\mathit{idf}_i = \log_2 \frac{|D|}{|{d_{t_i \in d}}|}$ where $|D|$ denotes the total number of documents $|{d_{t_i \in d}}|$ is the number of documents where the term $t_i$ appears.
- Term frequency - inverse document frequency is now defined as $\mathit{tf}_{i,j} \cdot \mathit{idf}_i$.
*Example*:
```{r}
library(tm)
textarray = c("Free software comes with ABSOLUTELY NO certain WARRANTY","You are welcome to redistribute free software under certain conditions","Natural language support for software in an English locale","A collaborative project with many contributors")
textcorpus = Corpus(VectorSource(textarray))
m = TermDocumentMatrix(textcorpus)
print(as.matrix(m))
print(as.matrix(weightTfIdf(m)))
```
## Using the ANLP package for bigrams and trigrams
This package has a few additional functions that make the preceding ideas more streamlined to implement. First let's read in the usual text.
```{r}
library(ANLP)
download.file("http://srdas.github.io/bio-candid.html",destfile = "text")
text = readTextFile("text","UTF-8")
ctext = cleanTextData(text) #Creates a text corpus
```
The last function removes non-english characters, numbers, white spaces, brackets, punctuation. It also handles cases like abbreviation, contraction. It converts entire text to lower case.
We now make TDMs for unigrams, bigrams, trigrams. Then, combine them all into one list for word prediction.
```{r}
g1 = generateTDM(ctext,1)
g2 = generateTDM(ctext,2)
g3 = generateTDM(ctext,3)
gmodel = list(g1,g2,g3)
```
Next, use the **back-off** algorithm to predict the next sequence of words.
```{r}
print(predict_Backoff("you never",gmodel))
print(predict_Backoff("life is",gmodel))
print(predict_Backoff("been known",gmodel))
print(predict_Backoff("needs to",gmodel))
print(predict_Backoff("worked at",gmodel))
print(predict_Backoff("being an",gmodel))
print(predict_Backoff("publish",gmodel))
```
## Wordclouds
Wordlcouds are interesting ways in which to represent text. They give an instant visual summary. The **wordcloud** package in R may be used to create your own wordclouds.
```{r}
#MAKE A WORDCLOUD
library(wordcloud)
tdm2 = as.matrix(tdm)
wordcount = sort(rowSums(tdm2),decreasing=TRUE)
tdm_names = names(wordcount)
wordcloud(tdm_names,wordcount)
#REMOVE STOPWORDS, NUMBERS, STEMMING
ctext1 = tm_map(ctext,removeWords,stopwords("english"))
ctext1 = tm_map(ctext1, removeNumbers)
tdm = TermDocumentMatrix(ctext1,control=list(minWordLength=1))
tdm2 = as.matrix(tdm)
wordcount = sort(rowSums(tdm2),decreasing=TRUE)
tdm_names = names(wordcount)
wordcloud(tdm_names,wordcount)
```
## Stemming
**Stemming** is the procedure by which a word is reduced to its root or stem. This is done so as to treat words from the one stem as the same word, rather than as separate words. We do not want "eaten" and "eating" to be treated as different words for example.
```{r}
#STEMMING
ctext2 = tm_map(ctext,removeWords,stopwords("english"))
ctext2 = tm_map(ctext2, stemDocument)
print(lapply(ctext2, as.character))
```
## Regular Expressions
Regular expressions are syntax used to modify strings in an efficient manner. They are complicated but extremely effective. Here we will illustrate with a few examples, but you are encouraged to explore more on your own, as the variations are endless. What you need to do will depend on the application at hand, and with some experience you will become better at using regular expressions. The initial use will however be somewhat confusing.
We start with a simple example of a text array where we wish replace the string "data" with a blank, i.e., we eliminate this string from the text we have.
```{r}
library(tm)
#Create a text array
text = c("Doc1 is datavision","Doc2 is datatable","Doc3 is data","Doc4 is nodata","Doc5 is simpler")
print(text)
#Remove all strings with the chosen text for all docs
print(gsub("data","",text))
#Remove all words that contain "data" at the start even if they are longer than data
print(gsub("*data.*","",text))
#Remove all words that contain "data" at the end even if they are longer than data
print(gsub("*.data*","",text))
#Remove all words that contain "data" at the end even if they are longer than data
print(gsub("*.data.*","",text))
```
## Complex Regular Expressions using *grep*
We now explore some more complex regular expressions. One case that is common is handling the search for special types of strings like telephone numbers. Suppose we have a text array that may contain telephone numbers in different formats, we can use a single **grep** command to extract these numbers. Here is some code to illustrate this.
```{r}
#Create an array with some strings which may also contain telephone numbers as strings.
x = c("234-5678","234 5678","2345678","1234567890","0123456789","abc 234-5678","234 5678 def","xx 2345678","abc1234567890def")
#Now use grep to find which elements of the array contain telephone numbers
idx = grep("[[:digit:]]{3}-[[:digit:]]{4}|[[:digit:]]{3} [[:digit:]]{4}|[1-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]",x)
print(idx)
print(x[idx])
#We can shorten this as follows
idx = grep("[[:digit:]]{3}-[[:digit:]]{4}|[[:digit:]]{3} [[:digit:]]{4}|[1-9][0-9]{9}",x)
print(idx)
print(x[idx])
#What if we want to extract only the phone number and drop the rest of the text?
pattern = "[[:digit:]]{3}-[[:digit:]]{4}|[[:digit:]]{3} [[:digit:]]{4}|[1-9][0-9]{9}"
print(regmatches(x, gregexpr(pattern,x)))
#Or use the stringr package, which is a lot better
library(stringr)
str_extract(x,pattern)
```
## Using *grep* for emails
Now we use grep to extract emails by looking for the "@" sign in the text string. We would proceed as in the following example.
```{r}
x = c("sanjiv das","srdas@scu.edu","SCU","data@science.edu")
print(grep("\\@",x))
print(x[grep("\\@",x)])
```
You get the idea. Using the functions **gsub**, **grep**, **regmatches**, and **gregexpr**, you can manage most fancy string handling that is needed.
## Extracting Text from the Web using APIs
We now look to getting text from the web and using various APIs from different services like Twitter, Facebook, etc. You will need to open free developer accounts to do this on each site. You will also need the special R packages for each different source.
## Twitter
The Twitter API needs a lot of handshaking...
```{r, eval=FALSE}
##TWITTER EXTRACTOR
library(twitteR)
library(ROAuth)
library(RCurl)
download.file(url="https://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")
#certificate file based on Privacy Enhanced Mail (PEM) protocol: https://en.wikipedia.org/wiki/Privacy-enhanced_Electronic_Mail
cKey = "h4J3x0i5kgD58E1t5JCEnw" #These are my keys and won't work for you
cSecret = "fi4SOHENNySeQKWe95SuBIRx74Xjv0Cx4EZx59QKwg" #use your own secret
reqURL = "https://api.twitter.com/oauth/request_token"
accURL = "https://api.twitter.com/oauth/access_token"
authURL = "https://api.twitter.com/oauth/authorize"
#NOW SUBMIT YOUR CODES AND ASK FOR CREDENTIALS
cred = OAuthFactory$new(consumerKey=cKey, consumerSecret=cSecret,requestURL=reqURL, accessURL=accURL,authURL=authURL)
cred$handshake(cainfo="cacert.pem") #Asks for token
#Test and save credentials
#registerTwitterOAuth(cred)
#save(list="cred",file="twitteR_credentials")
#FIRST PHASE DONE
```
## Accessing Twitter
```{r, eval=FALSE}
##USE httr, SECOND PHASE
library(httr)
#options(httr_oauth_cache=T)
accToken = "18666236-DmDE1wwbpvPbDcw9kwt9yThGeyYhjfpVVywrHuhOQ"
accTokenSecret = "cttbpxpTtqJn7wrCP36I59omNI5GQHXXgV41sKwUgc"
setup_twitter_oauth(cKey,cSecret,accToken,accTokenSecret) #At prompt type 1
```
This completes the handshaking with Twitter. Now we can access tweets using the functions in the **twitteR** package.
## Using the *twitteR* package
```{r, eval=FALSE}
#EXAMPLE 1
s = searchTwitter("#GOOG") #This is a list
s
#CONVERT TWITTER LIST TO TEXT ARRAY (see documentation in twitteR package)
twts = twListToDF(s) #This gives a dataframe with the tweets
names(twts)
twts_array = twts$text
print(twts$retweetCount)
twts_array
#EXAMPLE 2
s = getUser("srdas")
fr = s$getFriends()
print(length(fr))
print(fr[1:10])
s_tweets = userTimeline("srdas",n=20)
print(s_tweets)
getCurRateLimitInfo(c("srdas"))
```
## Getting Streaming Data from Twitter
This assumes you have a working twitter account and have already connected R to it using twitteR package.
- Retriving tweets for a particular search query
- Example 1 adapted from http://bogdanrau.com/blog/collecting-tweets-using-r-and-the-twitter-streaming-api/
- Additional reference: https://cran.r-project.org/web/packages/streamR/streamR.pdf
```{r,eval=FALSE}
library(streamR)
filterStream(file.name = "tweets.json", # Save tweets in a json file
track = "useR_Stanford" , # Collect tweets with useR_Stanford over 60 seconds. Can use twitter handles or keywords.
language = "en",
timeout = 30, # Keep connection alive for 60 seconds
oauth = cred) # Use OAuth credentials
tweets.df <- parseTweets("tweets.json", simplify = FALSE) # parse the json file and save to a data frame called tweets.df. Simplify = FALSE ensures that we include lat/lon information in that data frame.
```
## Retrieving tweets of a particular user over a 60 second time period
```{r,eval=FALSE}
filterStream(file.name = "tweets.json", # Save tweets in a json file
track = "3497513953" , # Collect tweets from useR2016 feed over 60 seconds. Must use twitter ID of the user.
language = "en",
timeout = 30, # Keep connection alive for 60 seconds
oauth = cred) # Use my_oauth file as the OAuth credentials
tweets.df <- parseTweets("tweets.json", simplify = FALSE)
```
## Streaming messages from the accounts your user follows.
```{r,eval=FALSE}
userStream( file.name="my_timeline.json", with="followings",tweets=10, oauth=cred )
```
## Facebook
Now we move on to using Facebook, which is a little less trouble than Twitter. Also the results may be used for creating interesting networks.
```{r, eval=FALSE}
##FACEBOOK EXTRACTOR
library(Rfacebook)
library(SnowballC)
library(Rook)
library(ROAuth)
app_id = "847737771920076" # USE YOUR OWN IDs
app_secret = "a120a2ec908d9e00fcd3c619cad7d043"
fb_oauth = fbOAuth(app_id,app_secret,extended_permissions=TRUE)
#save(fb_oauth,file="fb_oauth")
#DIRECT LOAD
load("fb_oauth")
```
## Examples
```{r, eval=FALSE}
##EXAMPLES
bbn = getUsers("bloombergnews",token=fb_oauth)
print(bbn)
page = getPage(page="bloombergnews",token=fb_oauth,n=20)
print(dim(page))
print(head(page))
print(names(page))
print(page$message)
print(page$message[11])
```
## Yelp - Setting up an authorization
First we examine the protocol for connecting to the Yelp API. This assumes you have opei
```{r, eval=FALSE}
###CODE to connect to YELP.
consumerKey = "z6w-Or6HSyKbdUTmV9lbOA"
consumerSecret = "ImUufP3yU9FmNWWx54NUbNEBcj8"
token = "mBzEBjhYIGgJZnmtTHLVdQ-0cyfFVRGu"
token_secret = "v0FGCL0TS_dFDWFwH3HptDZhiLE"
```
## Yelp - handshaking with the API
```{r, eval=FALSE}
require(httr)
require(httpuv)
require(jsonlite)
# authorization
myapp = oauth_app("YELP", key=consumerKey, secret=consumerSecret)
sig=sign_oauth1.0(myapp, token=token,token_secret=token_secret)
```
```{r, eval=FALSE}
## Searching the top ten bars in Chicago and SF.
limit <- 10
# 10 bars in Chicago
yelpurl <- paste0("http://api.yelp.com/v2/search/?limit=",limit,"&location=Chicago%20IL&term=bar")
# or 10 bars by geo-coordinates
yelpurl <- paste0("http://api.yelp.com/v2/search/?limit=",limit,"&ll=37.788022,-122.399797&term=bar")
locationdata=GET(yelpurl, sig)
locationdataContent = content(locationdata)
locationdataList=jsonlite::fromJSON(toJSON(locationdataContent))
head(data.frame(locationdataList))
for (j in 1:limit) {
print(locationdataContent$businesses[[j]]$snippet_text)
}
```
## Cosine Similarity in the Text Domain
In this segment we will learn some popular functions on text that are used in practice. One of the first things we like to do is to find similar text or like sentences (think of web search as one application). Since documents are vectors in the TDM, we may want to find the closest vectors or compute the distance between vectors.
$$ cos(\theta) = \frac{A \cdot B}{||A|| \times ||B||} $$
where $||A|| = \sqrt{A \cdot A}$, is the dot product of $A$ with itself, also known as the norm of $A$. This gives the cosine of the angle between the two vectors and is zero for orthogonal vectors and 1 for identical vectors.
```{r}
#COSINE DISTANCE OR SIMILARITY
A = as.matrix(c(0,3,4,1,7,0,1))
B = as.matrix(c(0,4,3,0,6,1,1))
cos = t(A) %*% B / (sqrt(t(A)%*%A) * sqrt(t(B)%*%B))
print(cos)
library(lsa)
#THE COSINE FUNCTION IN LSA ONLY TAKES ARRAYS
A = c(0,3,4,1,7,0,1)
B = c(0,4,3,0,6,1,1)
print(cosine(A,B))
```
## Dictionaries - I
1. Webster's defines a "dictionary" as "...a reference source in print or electronic form containing words usually alphabetically arranged along with information about their forms, pronunciations, functions, etymologies, meanings, and syntactical and idiomatic uses."
2. The Harvard General Inquirer: http://www.wjh.harvard.edu/~inquirer/
3. Standard Dictionaries: www.dictionary.com, and www.merriam-webster.com.
4. Computer dictionary: http://www.hyperdictionary.com/computer that contains about 14,000 computer related words, such as "byte" or "hyperlink".
5. Math dictionary, such as http://www.amathsdictionaryforkids.com/dictionary.html.
6. Medical dictionary, see http://www.hyperdictionary.com/medical.
## Dictionaries - II
1. Internet lingo dictionaries may be used to complement standard dictionaries with words that are not usually found in standard language, for example, see http://www.netlingo.com/dictionary/all.php for words such as "2BZ4UQT" which stands for "too busy for you cutey" (LOL). When extracting text messages, postings on Facebook, or stock message board discussions, internet lingo does need to be parsed and such a dictionary is very useful.
2. Associative dictionaries are also useful when trying to find context, as the word may be related to a concept, identified using a dictionary such as
http://www.visuwords.com/. This dictionary doubles up as a thesaurus, as it provides alternative words and phrases that mean the same thing, and also related concepts.
3. Value dictionaries deal with values and may be useful when only affect (positive or negative) is insufficient for scoring text. The Lasswell Value Dictionary http://www.wjh.harvard.edu/~inquirer/lasswell.htm may be used to score the loading of text on the eight basic value categories: Wealth, Power, Respect, Rectitude, Skill, Enlightenment, Affection, and Well being.
## Lexicons
1. A **lexicon** is defined by Webster's as "a book containing an alphabetical arrangement of the words in a language and their definitions; the vocabulary of a language, an individual speaker or group of speakers, or a subject; the total stock of morphemes in a language." This suggests it is not that different from a dictionary.
2. A "morpheme" is defined as "a word or a part of a word that has a meaning and that contains no smaller part that has a meaning."
3. In the text analytics realm, we will take a lexicon to be a smaller, special purpose dictionary, containing words that are relevant to the domain of interest.
4. The benefit of a lexicon is that it enables focusing only on words that are relevant to the analytics and discards words that are not.
5. Another benefit is that since it is a smaller dictionary, the computational effort required by text analytics algorithms is drastically reduced.
## Constructing a lexicon
1. By hand. This is an effective technique and the simplest. It calls for a human reader who scans a representative sample of text documents and culls important words that lend interpretive meaning.
2. Examine the term document matrix for most frequent words, and pick the ones that have high connotation for the classification task at hand.
3. Use pre-classified documents in a text corpus. We analyze the separate groups of documents to find words whose difference in frequency between groups is highest. Such words are likely to be better in discriminating between groups.
## Lexicons as Word Lists
1. Das and Chen (2007) constructed a lexicon of about 375 words that are useful in parsing sentiment from stock message boards. This lexicon also introduced the notion of "negation tagging" into the literature.
2. Loughran and McDonald (2011):
- Taking a sample of 50,115 firm-year 10-Ks from 1994 to 2008, they found that almost three-fourths of the words identified as negative by the Harvard Inquirer dictionary are not typically negative words in a financial context.
- Therefore, they specifically created separate lists of words by the following attributes of words: negative, positive, uncertainty, litigious, strong modal, and weak modal. Modal words are based on Jordan's categories of strong and weak modal words. These word lists may be downloaded from http://www3.nd.edu/~mcdonald/Word_Lists.html.
## Scoring Text
- Text can be scored using dictionaries and word lists. Here is an example of mood scoring. We use a psychological dictionary from Harvard. There is also WordNet.
- WordNet is a large database of words in English, i.e., a lexicon. The repository is at http://wordnet.princeton.edu. WordNet groups words together based on their meanings (synonyms) and hence may be used as a thesaurus. WordNet is also useful for natural language processing as it provides word lists by language category, such as noun, verb, adjective, etc.
## Mood Scoring using Harvard Inquirer
<img src = "hgi.png" width=700 height=550>
## Creating Positive and Negative Word Lists
```{r}
#MOOD SCORING USING HARVARD INQUIRER
#Read in the Harvard Inquirer Dictionary
#And create a list of positive and negative words
HIDict = readLines("inqdict.txt")
dict_pos = HIDict[grep("Pos",HIDict)]
poswords = NULL
for (s in dict_pos) {
s = strsplit(s,"#")[[1]][1]
poswords = c(poswords,strsplit(s," ")[[1]][1])
}
dict_neg = HIDict[grep("Neg",HIDict)]
negwords = NULL
for (s in dict_neg) {
s = strsplit(s,"#")[[1]][1]
negwords = c(negwords,strsplit(s," ")[[1]][1])
}
poswords = tolower(poswords)
negwords = tolower(negwords)
print(sample(poswords,25))
print(sample(negwords,25))
poswords = unique(poswords)
negwords = unique(negwords)
print(length(poswords))
print(length(negwords))
```
The preceding code created two arrays, one of positive words and another of negative words.
## One Function to Rule All Text
In order to score text, we need to clean it first and put it into an array to compare with the word list of positive and negative words. I wrote a general purpose function that grabs text and cleans it up for further use.
```{r}
library(tm)
library(stringr)
#READ IN TEXT FOR ANALYSIS, PUT IT IN A CORPUS, OR ARRAY, OR FLAT STRING
#cstem=1, if stemming needed
#cstop=1, if stopwords to be removed
#ccase=1 for lower case, ccase=2 for upper case
#cpunc=1, if punctuation to be removed
#cflat=1 for flat text wanted, cflat=2 if text array, else returns corpus
read_web_page = function(url,cstem=0,cstop=0,ccase=0,cpunc=0,cflat=0) {
text = readLines(url)
text = text[setdiff(seq(1,length(text)),grep("<",text))]
text = text[setdiff(seq(1,length(text)),grep(">",text))]
text = text[setdiff(seq(1,length(text)),grep("]",text))]
text = text[setdiff(seq(1,length(text)),grep("}",text))]
text = text[setdiff(seq(1,length(text)),grep("_",text))]
text = text[setdiff(seq(1,length(text)),grep("\\/",text))]
ctext = Corpus(VectorSource(text))
if (cstem==1) { ctext = tm_map(ctext, stemDocument) }
if (cstop==1) { ctext = tm_map(ctext, removeWords, stopwords("english"))}
if (cpunc==1) { ctext = tm_map(ctext, removePunctuation) }
if (ccase==1) { ctext = tm_map(ctext, tolower) }
if (ccase==2) { ctext = tm_map(ctext, toupper) }
text = ctext
#CONVERT FROM CORPUS IF NEEDED
if (cflat>0) {
text = NULL
for (j in 1:length(ctext)) {
temp = ctext[[j]]$content
if (temp!="") { text = c(text,temp) }
}
text = as.array(text)
}
if (cflat==1) {
text = paste(text,collapse="\n")
text = str_replace_all(text, "[\r\n]" , " ")
}
result = text
}
```
## Example
Now apply this function and see how we can get some clean text.
```{r}
url = "http://srdas.github.io/research.htm"
res = read_web_page(url,0,0,0,1,1)
print(res)
```
## Mood Scoring Text
Now we will take a different page of text and mood score it.
```{r}
#EXAMPLE OF MOOD SCORING
library(stringr)
url = "http://srdas.github.io/bio-candid.html"
text = read_web_page(url,cstem=0,cstop=0,ccase=0,cpunc=1,cflat=1)
print(text)
text = str_replace_all(text,"nbsp"," ")
text
text = unlist(strsplit(text," "))
print(text)
posmatch = match(text,poswords)
numposmatch = length(posmatch[which(posmatch>0)])
negmatch = match(text,negwords)
numnegmatch = length(negmatch[which(negmatch>0)])
print(c(numposmatch,numnegmatch))
#FURTHER EXPLORATION OF THESE OBJECTS
print(length(text))
print(posmatch)
print(text[77])
print(poswords[204])
is.na(posmatch)
```
## Language Detection
We may be scraping web sites from many countries and need to detect the language and then translate it into English for mood scoring. The useful package **textcat** enables us to categorize the language.
```{r}
library(textcat)
text = c("Je suis un programmeur novice.",
"I am a programmer who is a novice.",
"Sono un programmatore alle prime armi.",
"Ich bin ein Anfänger Programmierer",
"Soy un programador con errores.")
lang = textcat(text)
print(lang)