generated from jtr13/cctemplate
-
Notifications
You must be signed in to change notification settings - Fork 67
/
Copy pathweb_scraping_rcurl_xml_project.Rmd
224 lines (145 loc) · 11 KB
/
web_scraping_rcurl_xml_project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
# Guideline on Using Package Rcurl and XML for Web Scraping
Sitong Qian
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
library(RCurl)
library(XML)
```
## Introduction
When comes to the world of EDAV, having a rich dataset is inevitably the cornerstone of success. While obtaining such a dataset is not always an easy job, sometimes, the dataset is dirty, waiting for a throughout clean before using, or it might be the case that it's still not born yet, what we have are some piece-wise information, and needed to be assembled in a data frame format. Thus, it's bringing to us the question of how to construct a dataset from none.
Here, I want to give an approach to how to extarct the basic information we need for dataset construction from the website, namely web scraping, using package **Rcurl** and **XML**. To ease the potential confusion, I will go through a concrete example step by step, introducing the functions while go through the steps, explaining all the details behind the functions in a simple way and showing exactly how to do extracting work.
## Extracting Information for Constructing Dataset -- Rcurl and XML
### Step 1: Web Searching
The first step is choosing a topic you are interested in, Since it's a job hunting season, let's say I wanna know the **Data** related job posted on the website **CyberCoders**. I simply copy the page from browser
```{r}
link = 'https://www.cybercoders.com/search'
```
By clicking this link, we expected to be directed to a page like this.
```{r, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./resources/web_scraping_rcurl_xml/CCP1.png")
```
Now, let's put in the keyword **Data**
```{r}
search = getForm(link,searchterms = 'Data' )
```
Here, we use this **getForm()** function in the **Rcurl** Package.
According to the **Rcurl** package description in the cran, **getForm()** provide facilities for submitting an HTML form using the simple GET mechanism(appending the name-value pairs of parameters in the URL)
To be more straightfoward, **getForm()** do the work of putting the search word in the box and click the search buttom. While keyword **Data** is myinput, where is the searchterms comes from? Now, I will go through the steps of why it has to be **searchterms**
<br>
The following instruction is valid for **macOS system**
<br>
Using Chrome browser, on the site of cybercoders, click the **View -> Developer -> ViewSource**
```{r, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./resources/web_scraping_rcurl_xml/CCP2.png")
```
Then we will redirected to a page that gives the HTML source code, from the page, we will see that HTML page is build on hierarchy, we quickly go through the page, and find the term that describes the search box which is on **line 130** **name = "searchterms"**
```{r, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./resources/web_scraping_rcurl_xml/CCP3.png")
```
Here we finish the first step, Web Searching. Clearly, different websites have their own hierarchy tree and possibly having their unique ways of calling **searchterms**, Thus, when using this method, we need to look at specific cases and do slightly modifications.
### Step 2: Cleanup(Parsing) Search Results
In the previous step, we end up putting keywords **Data** in the searchterms. Here, what search returns is basically what we expected to see if enter **Data** in the search box directly on the website, and click on the **View Source**.
If we input **search** in the console, we will find out that it returns the HTML page we are looking for but in a really messy form. That's because **getform()** function converts the page to the character.
**To cut the length, I just took a screenshot of my console, otherwise, the output will occupy several pages with no actual meaning.**
```{r, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./resources/web_scraping_rcurl_xml/CCP4.png")
```
This is not what we want, since HTML is built on hierarchy, such forms can't obtain any information. Thus we need to return it in a more readable form. Here we introduce another powerful function **htmlParse** from **XML** package
```{r}
document = htmlParse(search)
```
Here, what **htmlParse** do is very simple and can be directly referred from its function name, as parsing the output above in a HTML form.
**To cut the length, I just took a screenshot of my console, otherwise, the output will occupy several pages with no actual meaning.**
```{r, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./resources/web_scraping_rcurl_xml/CCP5.png")
```
After returning a nice and clean form, we should move on to the next job, **detect the pattern**!
Web scraping is really fun, for most of the time it's build with elegant neatness while it doesn't mean that it will always be nice and easy, but for sake of time, we just assume it is for this time.
### Step 3: Detect the Pattern
In the previous section, we return the results in a nice and clean form, now we need to **detect the pattern**. It's fairly easy to see that the variable **document** give us the first page of the searching result. If we directly look on the website, the page returned roughly 21 job positions, we want to divide these 21 jobs so that we can look into each of them.
Here we introduce the function **getNodeSet()**
According to the **XML** instruction from cran, **getNodeSet()** find matching nodes in an internal XML tree. If we see the class(document) we will find out one of it is **"XMLInternalDocument"**.Exactly what we want!
```{r}
class(document)
```
As we discussed before, HTML is built on hierarchy. It's not hard to find out that each of the jobs starts with **div class="job-listing-item"** Thus, now we know the correct function we should use and we know the nodes, combine these two together, we get the function. There are exactly 21 items in this list, corresponding to the 21 jobs display on the websites, and I showed the first jobs for later illustration
```{r}
list = getNodeSet(document,"//div[@class = 'job-listing-item']")
list[1][[1]]
```
### Step 4: Extract Information on Each Job Post
From the previous step, we get a list with 21 jobs, now we want to extract the actual information we are looking for to construct the dataset. Based on the result, we could look at variables like **Job-Title**,**Wage**,**Location**,**Skill**,**Posted Date**,**Job Description**, etc
Then, how to extract the variable name under such categories? we look back on the first job as a reference. Let's say we wanna do the **Job-Title**
We introduce the function **xpathSApply**
According to the **XML** in cran, we can think xpathSApply as two parts, xpath and SApply, while what xpath do is scan over the page and match the result of our input, and SApply is how we returned it.
like we did before, we found out the job-title starts with node **div class='job-title'**, we combine the function and indicator
```{r}
test = list[1][[1]]
title1 = xpathSApply(test, ".//div[@class = 'job-title']")
title1
```
While title 1 is located to the information we want, but slightly messy since it's still contains some XML format.
But we have a magic tool, **xmlValue**, by adding xmlValue in the function, we get rid off all the XML locators, just the value
```{r}
title2 = xpathSApply(test, ".//div[@class = 'job-title']",xmlValue)
title2
```
Still, there are some irrevalant things, but we can use **trimws** to fix that, and we will get a nice and pretty form
```{r}
title3 = trimws(xpathSApply(test, ".//div[@class = 'job-title']",xmlValue))
title3
```
This **xpathSApply** function is a really powerful function of extracting the value we need, because it's return to a character, we can use regular expression on it to do other clean-up jobs. I personally think it is the most important function of this **XML** package.
Here are what we have for the **1st** job
```{r}
job_title = trimws(unique(xpathSApply(test, ".//div[@class = 'job-title']",xmlValue)))
salary_jobStatus = trimws(unique(xpathSApply(test, ".//div[@class = 'wage']",xmlValue)))
location = trimws(unique(xpathSApply(test, ".//div[@class = 'location']",xmlValue)))
date = trimws(xpathSApply(test,".//div[@class = 'posted']",xmlValue))
description = trimws(xpathSApply(test,".//div[@class = 'description']",xmlValue))
preferred_skills = trimws(unique(xpathSApply(test, ".//li[@class = 'skill-item']",xmlValue)))
job_title
salary_jobStatus
location
date
description
preferred_skills
```
### Step 5: Combine Job Posts
Here, we come to the next step, combining the results of each job, well it does not have lots to do with **XML** and **Rcurl** package. However, I thought it would be more intuitive to do this step for sake of completeness, and help to build an overall logical way of thinking about web scraping.
Basically, what we did before is only apply to one post, now we want to do this for every post on the first page. The code is nothing fancy, just throw the individual code we did before and have it in a "function form".
**To cut the length, I just output first three, if I output all the posts it will occupy several pages with no actual meaning.**
```{r}
job =
function(ind)
{
job_title = trimws(unique(xpathSApply(ind, ".//div[@class = 'job-title']",xmlValue)))
salary_jobStatus = trimws(unique(xpathSApply(ind, ".//div[@class = 'wage']",xmlValue)))
location = trimws(unique(xpathSApply(ind, ".//div[@class = 'location']",xmlValue)))
date = trimws(xpathSApply(ind,".//div[@class = 'posted']",xmlValue))
description = trimws(xpathSApply(ind,".//div[@class = 'description']",xmlValue))
preferred_skills = trimws(unique(xpathSApply(ind, ".//li[@class = 'skill-item']",xmlValue)))
list(title = job_title, job_status = salary_jobStatus,date = date,location = location,job_description = description,preferred_skills = preferred_skills)
}
pagecontentjoblist = function(link){
searchlinks = getForm(link,searchterms = 'Data' )
documentlinks = htmlParse(searchlinks)
all_job_lists = getNodeSet(documentlinks,"//div[@class = 'job-listing-item']")
return(all_job_lists)
}
all_job_lists = pagecontentjoblist(link)
post_all = lapply(all_job_lists, job)
```
**To cut the length, I just output first three, if I output all the posts it will occupy several pages with no actual meaning.**
```{r}
post_all[1:3]
```
As it's shown, the results are pretty much cleaned to use but probably need some help from regular expression to make it become an actual dataset. But since this guideline is focusing on web scraping mainly. I will call this an end.
## Conclusion
Here I go through how I would like to do web scraping using packages of **XML** and **Rcurl**, what I covered is just a basic approach by using some most frequently used functions in these two packages. There are lots of useful functions in there and worth reading all of them if you have time. I hope people who read this tutorial will have a more clear understanding of the idea of Web Scraping, and how to do it.
## Work Cited
https://cran.r-project.org/web/packages/XML/XML.pdf
<br>
https://cran.r-project.org/web/packages/RCurl/RCurl.pdf