Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/click #261

Merged
merged 2 commits into from
Dec 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 30 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ Start the configuration generation. The configuration file is written to the def
```bash
goskyr -g https://www.imdb.com/chart/top/ -f
```

Note, that different colors are used to show how 'close' certain fields are to each other in the html tree. This should help when there are multiple list-like structures on a web page and you need to figure out which fields belong together.

Next, start the scraping process. The configuration file is read from the default location `config.yml`.
Expand Down Expand Up @@ -99,12 +100,12 @@ scrapers:
url: "https://www.goodreads.com/quotes/tag/life"
item: ".quote"
fields:
- name: "quote"
location:
selector: ".quoteText"
- name: "author"
location:
selector: ".authorOrTitle"
- name: "quote"
location:
selector: ".quoteText"
- name: "author"
location:
selector: ".authorOrTitle"
```

Save this to a file, e.g. `quotes-config.yml` and run `goskyr -c quotes-config.yml` (or `go run main.go -c quotes-config.yml`) to retreive the scraped quotes as json string. The result should look something like this:
Expand Down Expand Up @@ -244,7 +245,7 @@ The result may look like this.
...]
```

*Subkey: `regex_extract`*
_Subkey: `regex_extract`_

It might be a bit more complex to extract the desired information. Take for instance the concert scraper configuration for "Kaufleuten", shown above, more specifically the config snippet for the `title` field.

Expand All @@ -260,7 +261,7 @@ fields:

This field is implicitly of type `text`. Other types, such as `url` or `date` would have to be configured with the keyword `type`. The `location` tells the scraper where to look for the field value and how to extract it. In this case the selector on its own would not be enough to extract the desired value as we would get something like this: `Bastian Baker • Konzert`. That's why there is an extra option to define a regular expression to extract a substring. Note that in this example our extracted string would still contain a trainling space which is automatically removed by the scraper. Let's have a look at a few more examples to have a better understanding of the location configuration.

*Subkey: `node_index`*
_Subkey: `node_index`_

Let's say we want to extract "Tonhalle-Orchester Zürich" from the following html snippet.

Expand All @@ -286,7 +287,7 @@ location:

Note that the same result can be achieved with the `:nth-child()` selector so `node_index` might be removed in the future, see issue [#119](https://github.com/jakopako/goskyr/issues/119)

*Subkey: `child_index`*
_Subkey: `child_index`_

Next, let's say we want to extract the time "20h00" from the following html snippet.

Expand Down Expand Up @@ -323,11 +324,11 @@ location:

Here, the selector is not enough to extract the desired string and we can't go further down the tree by using different selectors. With the `child_index` we can point to the exact string we want. A `child_index` of 0 would point to the first `<strong>` node, a `child_index` of 1 would point to the string containing "19h00", a `child_index` of 2 would point to the second `<strong>` node and finally a `child_index` of 3 points to the correct string. If `child_index` is set to -1 the first child that results in a regex match will be used. This can be useful if the `child_index` varies across different items. In the current example however, the `child_index` is always the same but the string still contains more stuff than we need which is why we use a regular expression to extract the desired substring.

*Subkey: `entire_subtree`*
_Subkey: `entire_subtree`_

This subkey, if set to `true` causes goskyr to grab all text elements under the element defined in the location's selector. It is useful when the target location contains inline tags, eg. `This is some text with a <strong>strong</strong> part.`

*Subkey: `all_nodes`*
_Subkey: `all_nodes`_

This subkey, if set to `true` joins together all strings having the given selector. The subkey `separator` will be used as separator string. If not defined the separator is an empty string. Example:

Expand Down Expand Up @@ -358,8 +359,6 @@ Resulting json:
...]
```



To get an even better feeling for the location configuration check out the numerous examples in the `concerts-config.yml` file.

**Key: `can_be_empty`**
Expand Down Expand Up @@ -415,12 +414,12 @@ A dynamic field has a field type that can either be `text`, `url` or `date`. The
```

As can be seen, a component has to define which part of the date it covers (at least one part has to be covered). Next, the location of this component has to be defined. This is done the same way as we defined the location for a text field string. Finally, we need to define a list of possible layouts where each layout is defined the 'go-way' as this scraper is written in go. For more details check out [this](https://yourbasic.org/golang/format-parse-string-time-date-example/) link or have a look at the numerous examples in the `concerts-config.yml` file. Note that a layout string is always in English although the date string on the scraped website might be in a different language. Also note that mostly the layout list only contains one element. Only in rare cases where different events on the same site have different layouts it is necessary to define more than one layout.

The `date_language` key needs to correspond to the language on the website. Currently, the default is `de_DE`. Note, that this doesn't matter for dates that only contain numbers. `date_location` sets the time zone of the respective date.

### JS Rendering

Since version 0.3.0 js rendering is supported. For this to work the `google-chrome` binary needs to be installed. In the configuration snippet of a scraper just add `renderJs: true` and everything will be taken care of. With `page_load_wait_sec: <seconds>` the default waiting time of 2 seconds can be adapted accordingly.
Since version 0.3.0 js rendering is supported. For this to work the `google-chrome` binary needs to be installed. In the configuration snippet of a scraper just add `render_js: true` and everything will be taken care of. With `page_load_wait_sec: <seconds>` the default waiting time of 2 seconds can be adapted accordingly.

User interactions with the page (eg scrolling) might be implemented in the future. Clicking has been implemented. TODO: document.

Expand Down Expand Up @@ -457,7 +456,7 @@ interaction:
delay: 2000 # milliseconds that the scraper waits after each click. Default is 1000
```

Note that these clicks are executed before the data is scraped. Also the interaction configuration will be ignored if `renderJs` is not set to `true` because only in that case is the website actually run within a headless browser.
Note that these clicks are executed before the data is scraped. Also the interaction configuration will be ignored if `render_js` is not set to `true` because only in that case is the website actually run within a headless browser.

### Pagination

Expand All @@ -469,7 +468,7 @@ paginator:
selector: ".pagination .selector"
```

In case `renderJs` is set to `false` by default the value of the `href` key is taken as url for the next page. However, you can change this and other parameters in the paginator configuration.
In case `render_js` is set to `false` by default the value of the `href` key is taken as url for the next page. However, you can change this and other parameters in the paginator configuration.

```yml
paginator:
Expand All @@ -480,7 +479,7 @@ paginator:
max_pages: <number>
```

If `renderJs` is set to `true` the scraper will simulate a mouse click on the given selector to loop over the pages.
If `render_js` is set to `true` the scraper will simulate a mouse click on the given selector to loop over the pages.

### Output

Expand All @@ -492,7 +491,6 @@ writer:
filepath: test-file.json
```


## Build ML Model for Improved Auto-Config

In order for the auto configuration feature to find suitable names for the extracted fields, since `v0.4.0` machine learning can be used. Goskyr allows you to extract a fixed set of features based on an existing goskyr configuration. Basically, goskyr scrapes all the websites you configured, extracts the raw text values based on the configured fields per site and then calculates the features for each extracted value, labeling the resulting vector with the field name you defined in the configuration. Currently, all features are based on the extracted text only, ie not on the location within the website. Checkout the `Features` struct in the `ml/ml.go` file if you want to know what exactly those features are. Extraction command:
Expand Down Expand Up @@ -530,28 +528,28 @@ To build and release a new version of goskyr [Goreleaser](https://goreleaser.com

1. Make a "dry-run" release to see if it works using the release command:

```bash
make release-dry-run
```
```bash
make release-dry-run
```

1. Make sure you have a file called `.release-env` containing the github token.

```bash
GITHUB_TOKEN=YOUR_GH_TOKEN
```
```bash
GITHUB_TOKEN=YOUR_GH_TOKEN
```

1. Create a tag and push it to GitHub

```bash
git tag -a v0.1.5 -m "new features"
git push origin v0.1.5
```
```bash
git tag -a v0.1.5 -m "new features"
git push origin v0.1.5
```

1. Run GoReleaser at the root of this repository:

```bash
make release
```
```bash
make release
```

## Contributing

Expand Down
2 changes: 1 addition & 1 deletion concerts-config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -513,7 +513,7 @@ scrapers:
- name: RoteFabrik
url: "https://rotefabrik.ch/de/programm.html#/calendar/konzert/"
item: ".rf-calendar-container > article"
renderJs: true
render_js: true
fields:
- name: title
location:
Expand Down
17 changes: 13 additions & 4 deletions fetch/fetcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import (
"context"
"fmt"
"io"
"log"
"net/http"
"time"

Expand Down Expand Up @@ -67,6 +68,10 @@ func NewDynamicFetcher(ua string, ms int) *DynamicFetcher {
chromedp.DefaultExecAllocatorOptions[:],
chromedp.WindowSize(1920, 1080), // init with a desktop view (sometimes pages look different on mobile, eg buttons are missing)
)
if ua != "" {
opts = append(opts,
chromedp.UserAgent(ua))
}
allocContext, cancelAlloc := chromedp.NewExecAllocator(context.Background(), opts...)
d := &DynamicFetcher{
UserAgent: ua,
Expand All @@ -85,10 +90,14 @@ func (d *DynamicFetcher) Cancel() {
}

func (d *DynamicFetcher) Fetch(url string, opts FetchOpts) (string, error) {
// start := time.Now()
start := time.Now()
ctx, cancel := chromedp.NewContext(d.allocContext)
// ctx, cancel := chromedp.NewContext(d.allocContext,
// chromedp.WithLogf(log.Printf),
// chromedp.WithDebugf(log.Printf),
// chromedp.WithErrorf(log.Printf),
// )
defer cancel()
// TODO: add user agent
var body string
sleepTime := time.Duration(d.WaitMilliseconds) * time.Millisecond
actions := []chromedp.Action{
Expand Down Expand Up @@ -134,7 +143,7 @@ func (d *DynamicFetcher) Fetch(url string, opts FetchOpts) (string, error) {
err := chromedp.Run(ctx,
actions...,
)
// elapsed := time.Since(start)
// log.Printf("scraping %s took %s", url, elapsed)
elapsed := time.Since(start)
log.Printf("fetching %s took %s", url, elapsed)
return body, err
}
30 changes: 16 additions & 14 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -4,51 +4,53 @@ go 1.19

require (
github.com/PuerkitoBio/goquery v1.8.1
github.com/chromedp/cdproto v0.0.0-20231011050154-1d073bb38998
github.com/chromedp/cdproto v0.0.0-20231205062650-00455a960d61
github.com/chromedp/chromedp v0.9.3
github.com/gdamore/tcell/v2 v2.7.0
github.com/goodsign/monday v1.0.1
github.com/ilyakaznacheev/cleanenv v1.5.0
github.com/rivo/tview v0.0.0-20220916081518-2e69b7385a37
github.com/rivo/tview v0.0.0-20231206124440-5f078138442e
golang.org/x/net v0.19.0
gopkg.in/yaml.v3 v3.0.1
)

require (
github.com/antchfx/jsonquery v1.3.3
github.com/sjwhitworth/golearn v0.0.0-20221228163002-74ae077eafb2
golang.org/x/exp v0.0.0-20230304125523-9ff063c70017
golang.org/x/exp v0.0.0-20231219180239-dc181d75b848
)

require (
github.com/antchfx/xpath v1.2.3 // indirect
github.com/antchfx/xpath v1.2.5 // indirect
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect
github.com/golang/protobuf v1.4.0 // indirect
github.com/gonum/blas v0.0.0-20181208220705-f22b278b28ac // indirect
github.com/gonum/lapack v0.0.0-20181123203213-e4cdc5a0bff9 // indirect
github.com/gonum/matrix v0.0.0-20181209220409-c518dec07be9 // indirect
github.com/google/go-cmp v0.5.8 // indirect
github.com/guptarohit/asciigraph v0.5.1 // indirect
github.com/olekukonko/tablewriter v0.0.4 // indirect
github.com/rocketlaunchr/dataframe-go v0.0.0-20201007021539-67b046771f0b // indirect
golang.org/x/sync v0.1.0 // indirect
gonum.org/v1/gonum v0.12.0 // indirect
github.com/google/go-cmp v0.6.0 // indirect
github.com/guptarohit/asciigraph v0.5.6 // indirect
github.com/olekukonko/tablewriter v0.0.5 // indirect
github.com/rocketlaunchr/dataframe-go v0.0.0-20211025052708-a1030444159b // indirect
golang.org/x/sync v0.5.0 // indirect
gonum.org/v1/gonum v0.14.0 // indirect
google.golang.org/protobuf v1.21.0 // indirect
)

require (
github.com/BurntSushi/toml v1.2.1 // indirect
github.com/BurntSushi/toml v1.3.2 // indirect
github.com/agnivade/levenshtein v1.1.1
github.com/andybalholm/cascadia v1.3.1 // indirect
github.com/andybalholm/cascadia v1.3.2 // indirect
github.com/chromedp/sysutil v1.0.0 // indirect
github.com/gdamore/encoding v1.0.0 // indirect
github.com/gobwas/httphead v0.1.0 // indirect
github.com/gobwas/pool v0.2.1 // indirect
github.com/gobwas/ws v1.3.0 // indirect
github.com/gobwas/ws v1.3.1 // indirect
github.com/joho/godotenv v1.5.1 // indirect
github.com/josharian/intern v1.0.0 // indirect
github.com/lucasb-eyer/go-colorful v1.2.0 // indirect
github.com/mailru/easyjson v0.7.7 // indirect
github.com/mattn/go-runewidth v0.0.15 // indirect
github.com/rivo/uniseg v0.4.3 // indirect
github.com/rivo/uniseg v0.4.4 // indirect
golang.org/x/sys v0.15.0 // indirect
golang.org/x/term v0.15.0 // indirect
golang.org/x/text v0.14.0 // indirect
Expand Down
Loading
Loading