Skip to content

Commit

Permalink
chore(chrome): fix semaphore limiting scrape
Browse files Browse the repository at this point in the history
  • Loading branch information
j-mendez committed Mar 18, 2024
1 parent 63d6f37 commit 5322150
Show file tree
Hide file tree
Showing 6 changed files with 30 additions and 27 deletions.
6 changes: 3 additions & 3 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion spider/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider"
version = "1.85.3"
version = "1.85.4"
authors = [
"madeindjs <contact@rousseau-alexandre.fr>",
"j-mendez <jeff@a11ywatch.com>",
Expand Down
22 changes: 11 additions & 11 deletions spider/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is a basic async example crawling a web page, add spider to your `Cargo.tom

```toml
[dependencies]
spider = "1.85.3"
spider = "1.85.4"
```

And then the code:
Expand Down Expand Up @@ -93,7 +93,7 @@ We have the following optional feature flags.

```toml
[dependencies]
spider = { version = "1.85.3", features = ["regex", "ua_generator"] }
spider = { version = "1.85.4", features = ["regex", "ua_generator"] }
```

1. `ua_generator`: Enables auto generating a random real User-Agent.
Expand Down Expand Up @@ -135,7 +135,7 @@ Move processing to a worker, drastically increases performance even if worker is

```toml
[dependencies]
spider = { version = "1.85.3", features = ["decentralized"] }
spider = { version = "1.85.4", features = ["decentralized"] }
```

```sh
Expand Down Expand Up @@ -166,7 +166,7 @@ Use the subscribe method to get a broadcast channel.

```toml
[dependencies]
spider = { version = "1.85.3", features = ["sync"] }
spider = { version = "1.85.4", features = ["sync"] }
```

```rust,no_run
Expand Down Expand Up @@ -196,7 +196,7 @@ Allow regex for blacklisting routes

```toml
[dependencies]
spider = { version = "1.85.3", features = ["regex"] }
spider = { version = "1.85.4", features = ["regex"] }
```

```rust,no_run
Expand All @@ -223,7 +223,7 @@ If you are performing large workloads you may need to control the crawler by ena

```toml
[dependencies]
spider = { version = "1.85.3", features = ["control"] }
spider = { version = "1.85.4", features = ["control"] }
```

```rust
Expand Down Expand Up @@ -293,7 +293,7 @@ Use cron jobs to run crawls continuously at anytime.

```toml
[dependencies]
spider = { version = "1.85.3", features = ["sync", "cron"] }
spider = { version = "1.85.4", features = ["sync", "cron"] }
```

```rust,no_run
Expand Down Expand Up @@ -332,7 +332,7 @@ the feature flag [`chrome_intercept`] to possibly speed up request using Network

```toml
[dependencies]
spider = { version = "1.85.3", features = ["chrome", "chrome_intercept"] }
spider = { version = "1.85.4", features = ["chrome", "chrome_intercept"] }
```

You can use `website.crawl_concurrent_raw` to perform a crawl without chromium when needed. Use the feature flag `chrome_headed` to enable headful browser usage if needed to debug.
Expand Down Expand Up @@ -362,7 +362,7 @@ Enabling HTTP cache can be done with the feature flag [`cache`] or [`cache_mem`]

```toml
[dependencies]
spider = { version = "1.85.3", features = ["cache"] }
spider = { version = "1.85.4", features = ["cache"] }
```

You need to set `website.cache` to true to enable as well.
Expand Down Expand Up @@ -393,7 +393,7 @@ Intelligently run crawls using HTTP and JavaScript Rendering when needed. The be

```toml
[dependencies]
spider = { version = "1.85.3", features = ["smart"] }
spider = { version = "1.85.4", features = ["smart"] }
```

```rust,no_run
Expand All @@ -419,7 +419,7 @@ Set a depth limit to prevent forwarding.

```toml
[dependencies]
spider = { version = "1.85.3", features = ["budget"] }
spider = { version = "1.85.4", features = ["budget"] }
```

```rust,no_run
Expand Down
19 changes: 11 additions & 8 deletions spider/src/website.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2054,7 +2054,6 @@ impl Website {
let shared = shared.clone();

set.spawn(async move {
drop(permit);
let page_resource = crate::utils::fetch_page_html_raw(
link.as_ref(),
&shared.0,
Expand All @@ -2076,6 +2075,7 @@ impl Website {
} else {
page.links(&shared.1).await
};
drop(permit);

(link, page, page_links)
});
Expand Down Expand Up @@ -2915,7 +2915,6 @@ impl Website {
let shared = shared.clone();

set.spawn(async move {
drop(permit);
let page_resource =
crate::utils::fetch_page_html(link.as_ref(), &shared.0)
.await;
Expand All @@ -2935,6 +2934,7 @@ impl Website {
} else {
page.links(&shared.1).await
};
drop(permit);

(link, page, page_links)
});
Expand Down Expand Up @@ -3083,10 +3083,9 @@ impl Website {
let shared = shared.clone();

set.spawn(async move {
drop(permit);
let target_url = link.as_ref();

match shared.4.new_page(target_url).await {
let r = match shared.4.new_page(target_url).await {
Ok(new_page) => {
match shared.5.evaluate_on_new_document {
Some(ref script) => {
Expand Down Expand Up @@ -3174,7 +3173,9 @@ impl Website {

(link, page, Default::default())
}
}
};
drop(permit);
r
});

match q.as_mut() {
Expand Down Expand Up @@ -3303,10 +3304,9 @@ impl Website {
self.setup_chrome_interception(&new_page).await;

set.spawn(async move {
drop(permit);
let target_url = link.as_ref();

match shared.5.new_page(target_url).await {
let r = match shared.5.new_page(target_url).await {
Ok(new_page) => {
let new_page =
configure_browser(new_page, &shared.6)
Expand Down Expand Up @@ -3385,7 +3385,10 @@ impl Website {

(link, page, Default::default())
}
}
};
drop(permit);

r
});
}
_ => (),
Expand Down
4 changes: 2 additions & 2 deletions spider_cli/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_cli"
version = "1.85.3"
version = "1.85.4"
authors = [
"madeindjs <contact@rousseau-alexandre.fr>",
"j-mendez <jeff@a11ywatch.com>",
Expand Down Expand Up @@ -29,7 +29,7 @@ quote = "1.0.18"
failure_derive = "0.1.8"

[dependencies.spider]
version = "1.85.3"
version = "1.85.4"
path = "../spider"

[[bin]]
Expand Down
4 changes: 2 additions & 2 deletions spider_worker/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_worker"
version = "1.85.3"
version = "1.85.4"
authors = [
"madeindjs <contact@rousseau-alexandre.fr>",
"j-mendez <jeff@a11ywatch.com>",
Expand All @@ -25,7 +25,7 @@ lazy_static = "1.4.0"
env_logger = "0.11.3"

[dependencies.spider]
version = "1.85.3"
version = "1.85.4"
path = "../spider"
features = ["serde", "flexbuffers"]

Expand Down

0 comments on commit 5322150

Please sign in to comment.