Skip to content

Commit

Permalink
add code to save html files before any filtering
Browse files Browse the repository at this point in the history
  • Loading branch information
orchid00 committed Nov 30, 2018
1 parent a6c10cc commit eed7b61
Show file tree
Hide file tree
Showing 4 changed files with 15 additions and 4 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ The project structure looks like this:
└── results
├── data
│ ├── pages
│ ├── rawhtml
│ └── sentences
│ └── allsentences
└── RData
Expand Down
7 changes: 4 additions & 3 deletions docs/index.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions scripts/00_installPkgs.R
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ if (!dir.exists("results")) {
dir.create(here("results", "data"))
dir.create(here("results", "data", "pages"))
dir.create(here("results", "data", "sentences"))
dir.create(here("results", "data", "rawhtml"))
}


Expand Down
10 changes: 9 additions & 1 deletion scripts/01_custome_functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,11 @@ getPageContent <- function(url, css_class, num){
#url <- "https://www.indeed.com/rc/clk?jk=788e7e311656fc54&fccid=09fad757f3449fa5&vjs=3"
#css_class <- ".jobsearch-JobComponent-description"

htmlpage <-
read_html(url)

page_content <-
read_html(url) %>%
htmlpage %>%
html_nodes(css = css_class) %>%
html_text() %>% # clean text
str_replace_all(regex("\n"), "\\. ") %>%
Expand All @@ -36,6 +39,11 @@ getPageContent <- function(url, css_class, num){
page_content %>%
tibble(doc = as.numeric(num), text = .) # convert character to a tibble

write_html(htmlpage,
path = here("results", "data","rawhtml",
paste0(Sys.Date(), "_", num, ".html"))
)

save(page_content,
file = here("results", "RData",
paste0(Sys.Date(), "_", num, ".RData"))
Expand Down

0 comments on commit eed7b61

Please sign in to comment.