Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve output of config proposal #120

Merged
merged 1 commit into from
Sep 6, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 1 addition & 72 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This project's goal is to make it easier to scrape structured data from web page
different venue websites. However, the code has been rewritten to handle a more general use case of extracting a list of items from any website.
This could be a list of books from an online book store, a list of plays in a public theater, a list of newspaper articles, etc. Currently, information can only be extracted from static websites.

Note that there are already similar projects that might do a better job in certain cases or are more generic tools. However, on the one hand this is a personal project to make myself familiar with webscraping and Go and on the other hand goskyr supports certain features that I haven't found in any other projects. For instance, the way dates can be extracted from websites and the notion of scraping information from subpages defined by previously at runtime extracted urls.
Note that there are already similar projects that might do a better job in certain cases or are more generic tools. However, on the one hand this is a personal project to make myself familiar with webscraping and Go and on the other hand goskyr supports certain features that I haven't found in any other projects. For instance, the way dates can be extracted from websites and the notion of scraping information from subpages defined by previously at runtime extracted urls. Be sure to checkout the section on [auto configuration](#auto-configuration-experimental).

Similar projects:

Expand Down Expand Up @@ -50,77 +50,6 @@ goskyr -generate "https://www.goodreads.com/quotes/tag/life"
which will automatically find repeating fields, will ask you to chose a subset of those fields and then return the resulting config snippet, which might look
something like this:

```yaml
writer:
type: ""
uri: ""
user: ""
password: ""
filepath: ""
scrapers:
- name: ""
url: https://www.goodreads.com/quotes/tag/life
item: body > div.content > div.mainContentContainer > div.mainContent > div.mainContentFloat > div.leftContainer > div.quote.mediumText > div.quoteDetails
exclude_with_selector: []
fields:
static: []
dynamic:
- name: field-0
type: text
location:
selector: div.quoteText > span.authorOrTitle
node_index: 0
child_index: 0
regex_extract:
exp: ""
index: 0
attr: ""
max_length: 0
entire_subtree: false
on_subpage: ""
can_be_empty: false
components: []
date_location: ""
date_language: ""
hide: false
- name: field-1
type: text
location:
selector: div.quoteText
node_index: 0
child_index: 2
regex_extract:
exp: ""
index: 0
attr: ""
max_length: 0
entire_subtree: false
on_subpage: ""
can_be_empty: false
components: []
date_location: ""
date_language: ""
hide: false
filters: []
paginator:
location:
selector: ""
node_index: 0
child_index: 0
regex_extract:
exp: ""
index: 0
attr: ""
max_length: 0
entire_subtree: false
max_pages: 0
global:
user-agent: ""
```

Note that currently all fields are displayed although the majority contains default values that you normally wouldn't have to configure. As a consequence, they can
be manually removed for better readability which in the above example results in:

```yaml
scrapers:
- name: ""
Expand Down
2 changes: 0 additions & 2 deletions config-gen.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,7 @@ scrapers:
- name: ""
url: https://www.goodreads.com/quotes/tag/life
item: body > div.content > div.mainContentContainer > div.mainContent > div.mainContentFloat > div.leftContainer > div.quote.mediumText > div.quoteDetails
exclude_with_selector: []
fields:
static: []
dynamic:
- name: field-0
type: text
Expand Down
6 changes: 0 additions & 6 deletions generate/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -180,12 +180,6 @@ parse:
Selector: p,
ChildIndex: nrChildren[p],
}
// this check has to be updated
// we cannot check the exact path because a repeating node might have both
// repeating classes and non-repeating classes. So instead of checking whether
// we have seen the exact location we need to check whether there is a location
// where for each node in the path that there is at least on overlapping class
// (if at least one of the two nodes has a class)
locMan = update(locMan, l, strings.TrimSpace(text))
}
nrChildren[p] += 1
Expand Down
50 changes: 25 additions & 25 deletions scraper/scraper.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ type GlobalConfig struct {
// Values will be taken from a config yml file or environment variables
// or both.
type Config struct {
Writer output.WriterConfig `yaml:"writer"`
Scrapers []Scraper `yaml:"scrapers"`
Global GlobalConfig `yaml:"global"`
Writer output.WriterConfig `yaml:"writer,omitempty"`
Scrapers []Scraper `yaml:"scrapers,omitempty"`
Global GlobalConfig `yaml:"global,omitempty"`
}

func NewConfig(configPath string) (*Config, error) {
Expand Down Expand Up @@ -65,13 +65,13 @@ type RegexConfig struct {

// ElementLocation is used to find a specific string in a html document
type ElementLocation struct {
Selector string `yaml:"selector"`
NodeIndex int `yaml:"node_index"`
ChildIndex int `yaml:"child_index"`
RegexExtract RegexConfig `yaml:"regex_extract"`
Attr string `yaml:"attr"`
MaxLength int `yaml:"max_length"`
EntireSubtree bool `yaml:"entire_subtree"`
Selector string `yaml:"selector,omitempty"`
NodeIndex int `yaml:"node_index,omitempty"`
ChildIndex int `yaml:"child_index,omitempty"`
RegexExtract RegexConfig `yaml:"regex_extract,omitempty"`
Attr string `yaml:"attr,omitempty"`
MaxLength int `yaml:"max_length,omitempty"`
EntireSubtree bool `yaml:"entire_subtree,omitempty"`
}

// CoveredDateParts is used to determine what parts of a date a
Expand Down Expand Up @@ -103,16 +103,16 @@ type StaticField struct {
// for each item
type DynamicField struct {
Name string `yaml:"name"`
Type string `yaml:"type"` // can currently be text, url or date
Type string `yaml:"type,omitempty"` // can currently be text, url or date
// If a field can be found on a subpage the following variable has to contain a field name of
// a field of type 'url' that is located on the main page.
ElementLocation ElementLocation `yaml:"location"`
OnSubpage string `yaml:"on_subpage"` // applies to text, url, date
CanBeEmpty bool `yaml:"can_be_empty"` // applies to text, url
Components []DateComponent `yaml:"components"` // applies to date
DateLocation string `yaml:"date_location"` // applies to date
DateLanguage string `yaml:"date_language"` // applies to date
Hide bool `yaml:"hide"` // appliess to text, url, date
ElementLocation ElementLocation `yaml:"location,omitempty"`
OnSubpage string `yaml:"on_subpage,omitempty"` // applies to text, url, date
CanBeEmpty bool `yaml:"can_be_empty,omitempty"` // applies to text, url
Components []DateComponent `yaml:"components,omitempty"` // applies to date
DateLocation string `yaml:"date_location,omitempty"` // applies to date
DateLanguage string `yaml:"date_language,omitempty"` // applies to date
Hide bool `yaml:"hide,omitempty"` // appliess to text, url, date
}

// A Filter is used to filter certain items from the result list
Expand All @@ -128,16 +128,16 @@ type Scraper struct {
Name string `yaml:"name"`
URL string `yaml:"url"`
Item string `yaml:"item"`
ExcludeWithSelector []string `yaml:"exclude_with_selector"`
ExcludeWithSelector []string `yaml:"exclude_with_selector,omitempty"`
Fields struct {
Static []StaticField `yaml:"static"`
Dynamic []DynamicField `yaml:"dynamic"`
Static []StaticField `yaml:"static,omitempty"`
Dynamic []DynamicField `yaml:"dynamic,omitempty"`
} `yaml:"fields"`
Filters []Filter `yaml:"filters"`
Filters []Filter `yaml:"filters,omitempty"`
Paginator struct {
Location ElementLocation `yaml:"location"`
MaxPages int `yaml:"max_pages"`
}
Location ElementLocation `yaml:"location,omitempty"`
MaxPages int `yaml:"max_pages,omitempty"`
} `yaml:"paginator,omitempty"`
}

// GetItems fetches and returns all items from a website according to the
Expand Down