report/sentenceSearch.Rmd

---
title: "R Notebook"
output: html_notebook
---

The Implementation studies (IS) for Learning paths (LP) and Towards Data 
Stewardship (DM) are proposing a a web scrapping solution to search for job ads 
and look for descriptions of knowledge skills and abilities (KSA's). 
The current solution searches and saves job ads from indeed.com as text 
(csv, html) and then, it collects sentences for each job ad, after collecting 
the sentences, the report presented here looks for keywords identified around 
KSA's.


This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. The code is 
displayed within the notebook, the results appear beneath the code. I recommend 
to click on code "Hide all Code", if you prefer to see only results.

The R packages used for the search are `tidyverse` and `here`:
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(here)
```

Using the script `webScrapping.R` from the scripts folder we have collated some 
job ads as text files in `results > data > pages` and then
for each job ad, a search for sentences was performed and 
the results were saved in text form in `results > data > sentences`. Once having
a dataser of sentences, this report shows the search for KSA's.

This report is meant to be an example, it was last rendered on `r Sys.Date()`.

From the keywords below the search is non case sensitive, any keywords with `^`
means `starts with`, otherwise the keyword will be searched in any part of the 
sentence. 

```{r include=FALSE}
# Loads custome function
ReadMyDataFiles <- function(pattern = pattern, path = "./") {
    map(list.files(path = path, pattern = pattern, full.names = TRUE), 
        read_csv)
}
```

```{r include=FALSE}
safely_ReadMyDataFiles <- safely(ReadMyDataFiles)

check1 <- safely_ReadMyDataFiles(pattern = "2018", path = here("results", "data", "sentences"))
```

```{r include=FALSE}
allsentences <- 
    check1$result %>%
    map_df(~(.))

dim(allsentences)
```

```{r}
mysentences <-
allsentences %>% 
    distinct(sentence, .keep_all = FALSE) %>% 
    filter(stringi::stri_count(sentence, regex = "\\w+") > 2) %>% # remove sentences if less than 2 words
    mutate_at(vars(sentence), ~str_replace_all(., "\\. \\.", "\\.")) %>% 
    mutate_at(vars(sentence), ~str_squish(.)) %>%  # reduced repeated white spaces in a string 
    arrange(sentence) 

cat("Number of unique sentences analised", nrow(mysentences))
```

```{r include=FALSE}
# clean up
rm(check1)

# save clean sentences
write_csv(mysentences,  here("results", "data", "sentences", "allsentences",
                                paste0(Sys.Date(), "_allsentences.csv")))
save(mysentences, file = here("results", "RData",
                                paste0(Sys.Date(), "_allsentences.RData")))
```

# Analysis of text - KSA

## Knowledge

```{r}
mysentences %>% 
    filter(grepl(sentence, pattern = "^knowledge", ignore.case = TRUE) |
           grepl(sentence, pattern = "experience|degree|expertise", ignore.case = TRUE) ) %>% 
    arrange(sentence)
```
## Skills

```{r}
mysentences %>% 
    filter(grepl(sentence, pattern = "skills|proficiency|familiarity", ignore.case = TRUE) |
           grepl(sentence, pattern = "^highly", ignore.case = TRUE)) %>% 
    arrange(sentence)
```

## Abilities

```{r}
mysentences %>% 
    filter(grepl(sentence, pattern = "^ability", ignore.case = TRUE) |
           grepl(sentence, pattern = "^able to", ignore.case = TRUE) ) %>% 
    arrange(sentence)
```

This report is meant to be an example, it was last rendered on `r Sys.Date()`.