-
Notifications
You must be signed in to change notification settings - Fork 1
/
sentenceSearch.Rmd
110 lines (87 loc) · 3.46 KB
/
sentenceSearch.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: "R Notebook"
output: html_notebook
---
The Implementation studies (IS) for Learning paths (LP) and Towards Data
Stewardship (DM) are proposing a a web scrapping solution to search for job ads
and look for descriptions of knowledge skills and abilities (KSA's).
The current solution searches and saves job ads from indeed.com as text
(csv, html) and then, it collects sentences for each job ad, after collecting
the sentences, the report presented here looks for keywords identified around
KSA's.
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. The code is
displayed within the notebook, the results appear beneath the code. I recommend
to click on code "Hide all Code", if you prefer to see only results.
The R packages used for the search are `tidyverse` and `here`:
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(here)
```
Using the script `webScrapping.R` from the scripts folder we have collated some
job ads as text files in `results > data > pages` and then
for each job ad, a search for sentences was performed and
the results were saved in text form in `results > data > sentences`. Once having
a dataser of sentences, this report shows the search for KSA's.
This report is meant to be an example, it was last rendered on `r Sys.Date()`.
From the keywords below the search is non case sensitive, any keywords with `^`
means `starts with`, otherwise the keyword will be searched in any part of the
sentence.
```{r include=FALSE}
# Loads custome function
ReadMyDataFiles <- function(pattern = pattern, path = "./") {
map(list.files(path = path, pattern = pattern, full.names = TRUE),
read_csv)
}
```
```{r include=FALSE}
safely_ReadMyDataFiles <- safely(ReadMyDataFiles)
check1 <- safely_ReadMyDataFiles(pattern = "2018", path = here("results", "data", "sentences"))
```
```{r include=FALSE}
allsentences <-
check1$result %>%
map_df(~(.))
dim(allsentences)
```
```{r}
mysentences <-
allsentences %>%
distinct(sentence, .keep_all = FALSE) %>%
filter(stringi::stri_count(sentence, regex = "\\w+") > 2) %>% # remove sentences if less than 2 words
mutate_at(vars(sentence), ~str_replace_all(., "\\. \\.", "\\.")) %>%
mutate_at(vars(sentence), ~str_squish(.)) %>% # reduced repeated white spaces in a string
arrange(sentence)
cat("Number of unique sentences analised", nrow(mysentences))
```
```{r include=FALSE}
# clean up
rm(check1)
# save clean sentences
write_csv(mysentences, here("results", "data", "sentences", "allsentences",
paste0(Sys.Date(), "_allsentences.csv")))
save(mysentences, file = here("results", "RData",
paste0(Sys.Date(), "_allsentences.RData")))
```
# Analysis of text - KSA
## Knowledge
```{r}
mysentences %>%
filter(grepl(sentence, pattern = "^knowledge", ignore.case = TRUE) |
grepl(sentence, pattern = "experience|degree|expertise", ignore.case = TRUE) ) %>%
arrange(sentence)
```
## Skills
```{r}
mysentences %>%
filter(grepl(sentence, pattern = "skills|proficiency|familiarity", ignore.case = TRUE) |
grepl(sentence, pattern = "^highly", ignore.case = TRUE)) %>%
arrange(sentence)
```
## Abilities
```{r}
mysentences %>%
filter(grepl(sentence, pattern = "^ability", ignore.case = TRUE) |
grepl(sentence, pattern = "^able to", ignore.case = TRUE) ) %>%
arrange(sentence)
```
This report is meant to be an example, it was last rendered on `r Sys.Date()`.