Skip to content

Commit

Permalink
chore(chrome): fix fingerprint and initial eval scripting
Browse files Browse the repository at this point in the history
  • Loading branch information
j-mendez committed Aug 1, 2024
1 parent cc03e65 commit ec5634b
Show file tree
Hide file tree
Showing 8 changed files with 98 additions and 74 deletions.
6 changes: 3 additions & 3 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion spider/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider"
version = "1.99.14"
version = "1.99.16"
authors = [
"j-mendez <jeff@a11ywatch.com>"
]
Expand Down
24 changes: 12 additions & 12 deletions spider/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is a basic async example crawling a web page, add spider to your `Cargo.tom

```toml
[dependencies]
spider = "1.99.14"
spider = "1.99.16"
```

And then the code:
Expand Down Expand Up @@ -93,7 +93,7 @@ We have the following optional feature flags.

```toml
[dependencies]
spider = { version = "1.99.14", features = ["regex", "ua_generator"] }
spider = { version = "1.99.16", features = ["regex", "ua_generator"] }
```

1. `ua_generator`: Enables auto generating a random real User-Agent.
Expand Down Expand Up @@ -138,7 +138,7 @@ Move processing to a worker, drastically increases performance even if worker is

```toml
[dependencies]
spider = { version = "1.99.14", features = ["decentralized"] }
spider = { version = "1.99.16", features = ["decentralized"] }
```

```sh
Expand Down Expand Up @@ -169,7 +169,7 @@ Use the subscribe method to get a broadcast channel.

```toml
[dependencies]
spider = { version = "1.99.14", features = ["sync"] }
spider = { version = "1.99.16", features = ["sync"] }
```

```rust,no_run
Expand Down Expand Up @@ -200,7 +200,7 @@ Allow regex for blacklisting routes

```toml
[dependencies]
spider = { version = "1.99.14", features = ["regex"] }
spider = { version = "1.99.16", features = ["regex"] }
```

```rust,no_run
Expand All @@ -227,7 +227,7 @@ If you are performing large workloads you may need to control the crawler by ena

```toml
[dependencies]
spider = { version = "1.99.14", features = ["control"] }
spider = { version = "1.99.16", features = ["control"] }
```

```rust
Expand Down Expand Up @@ -297,7 +297,7 @@ Use cron jobs to run crawls continuously at anytime.

```toml
[dependencies]
spider = { version = "1.99.14", features = ["sync", "cron"] }
spider = { version = "1.99.16", features = ["sync", "cron"] }
```

```rust,no_run
Expand Down Expand Up @@ -336,7 +336,7 @@ the feature flag [`chrome_intercept`] to possibly speed up request using Network

```toml
[dependencies]
spider = { version = "1.99.14", features = ["chrome", "chrome_intercept"] }
spider = { version = "1.99.16", features = ["chrome", "chrome_intercept"] }
```

You can use `website.crawl_concurrent_raw` to perform a crawl without chromium when needed. Use the feature flag `chrome_headed` to enable headful browser usage if needed to debug.
Expand Down Expand Up @@ -366,7 +366,7 @@ Enabling HTTP cache can be done with the feature flag [`cache`] or [`cache_mem`]

```toml
[dependencies]
spider = { version = "1.99.14", features = ["cache"] }
spider = { version = "1.99.16", features = ["cache"] }
```

You need to set `website.cache` to true to enable as well.
Expand Down Expand Up @@ -397,7 +397,7 @@ Intelligently run crawls using HTTP and JavaScript Rendering when needed. The be

```toml
[dependencies]
spider = { version = "1.99.14", features = ["smart"] }
spider = { version = "1.99.16", features = ["smart"] }
```

```rust,no_run
Expand All @@ -423,7 +423,7 @@ Use OpenAI to generate dynamic scripts to drive the browser done with the featur

```toml
[dependencies]
spider = { version = "1.99.14", features = ["openai"] }
spider = { version = "1.99.16", features = ["openai"] }
```

```rust
Expand All @@ -449,7 +449,7 @@ Set a depth limit to prevent forwarding.

```toml
[dependencies]
spider = { version = "1.99.14" }
spider = { version = "1.99.16" }
```

```rust,no_run
Expand Down
31 changes: 18 additions & 13 deletions spider/src/page.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1019,20 +1019,25 @@ impl Page {
.evaluate_on_new_document
{
Some(ref script) => {
let _ = new_page
.evaluate_on_new_document(
script.as_str(),
)
.await;
if configuration.fingerprint {
let _ = new_page
.evaluate_on_new_document(string_concat!(
crate::features::chrome::FP_JS,
script.as_str()
))
.await;
} else {
let _ =
new_page.evaluate_on_new_document(script.as_str()).await;
}
}
_ => {
if configuration.fingerprint {
let _ = new_page
.evaluate_on_new_document(crate::features::chrome::FP_JS)
.await;
}
}
_ => (),
}
if configuration.fingerprint {
let _ = new_page
.evaluate_on_new_document(
crate::features::chrome::FP_JS,
)
.await;
}

let new_page =
Expand Down
99 changes: 59 additions & 40 deletions spider/src/website.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1452,10 +1452,26 @@ impl Website {
}
}

if self.configuration.fingerprint {
let _ = chrome_page
.evaluate_on_new_document(crate::features::chrome::FP_JS)
.await;
match self.configuration.evaluate_on_new_document {
Some(ref script) => {
if self.configuration.fingerprint {
let _ = chrome_page
.evaluate_on_new_document(string_concat!(
crate::features::chrome::FP_JS,
script.as_str()
))
.await;
} else {
let _ = chrome_page.evaluate_on_new_document(script.as_str()).await;
}
}
_ => {
if self.configuration.fingerprint {
let _ = chrome_page
.evaluate_on_new_document(crate::features::chrome::FP_JS)
.await;
}
}
}

let _ = self.setup_chrome_interception(&chrome_page).await;
Expand Down Expand Up @@ -2323,19 +2339,6 @@ impl Website {
_ => None,
};

match self.configuration.evaluate_on_new_document {
Some(ref script) => {
let _ = new_page.evaluate_on_new_document(script.as_str()).await;
}
_ => (),
}

if self.configuration.fingerprint {
let _ = new_page
.evaluate_on_new_document(crate::features::chrome::FP_JS)
.await;
}

if match self.configuration.budget {
Some(ref b) => match b.get(&*WILD_CARD_PATH) {
Some(b) => b.eq(&1),
Expand Down Expand Up @@ -3315,16 +3318,28 @@ impl Website {
Ok(new_page) => {
match self.configuration.evaluate_on_new_document {
Some(ref script) => {
let _ = new_page
.evaluate_on_new_document(script.as_str())
.await;
if self.configuration.fingerprint {
let _ = new_page
.evaluate_on_new_document(string_concat!(
crate::features::chrome::FP_JS,
script.as_str()
))
.await;
} else {
let _ = new_page
.evaluate_on_new_document(script.as_str())
.await;
}
}
_ => {
if self.configuration.fingerprint {
let _ = new_page
.evaluate_on_new_document(
crate::features::chrome::FP_JS,
)
.await;
}
}
_ => (),
}
if self.configuration.fingerprint {
let _ = new_page
.evaluate_on_new_document(crate::features::chrome::FP_JS)
.await;
}

let mut q = match &self.channel_queue {
Expand Down Expand Up @@ -3414,21 +3429,25 @@ impl Website {
Ok(new_page) => {
match shared.5.evaluate_on_new_document {
Some(ref script) => {
let _ = new_page
.evaluate_on_new_document(
script.as_str(),
)
.await;
if shared.5.fingerprint {
let _ = new_page
.evaluate_on_new_document(string_concat!(
crate::features::chrome::FP_JS,
script.as_str()
))
.await;
} else {
let _ =
new_page.evaluate_on_new_document(script.as_str()).await;
}
}
_ => {
if shared.5.fingerprint {
let _ = new_page
.evaluate_on_new_document(crate::features::chrome::FP_JS)
.await;
}
}
_ => (),
}

if shared.5.fingerprint {
let _ = new_page
.evaluate_on_new_document(
crate::features::chrome::FP_JS,
)
.await;
}

let new_page =
Expand Down
4 changes: 2 additions & 2 deletions spider_cli/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_cli"
version = "1.99.14"
version = "1.99.16"
authors = [
"j-mendez <jeff@a11ywatch.com>"
]
Expand Down Expand Up @@ -28,7 +28,7 @@ quote = "1"
failure_derive = "0.1.8"

[dependencies.spider]
version = "1.99.14"
version = "1.99.16"
path = "../spider"

[[bin]]
Expand Down
2 changes: 1 addition & 1 deletion spider_utils/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ edition = "2018"
indexmap = { version = "1", optional = true }

[dependencies.spider]
version = "1.99.14"
version = "1.99.16"
path = "../spider"

[features]
Expand Down
4 changes: 2 additions & 2 deletions spider_worker/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_worker"
version = "1.99.14"
version = "1.99.16"
authors = [
"j-mendez <jeff@a11ywatch.com>"
]
Expand All @@ -24,7 +24,7 @@ lazy_static = "1.4.0"
env_logger = "0.11.3"

[dependencies.spider]
version = "1.99.14"
version = "1.99.16"
path = "../spider"
features = ["serde", "flexbuffers"]

Expand Down

0 comments on commit ec5634b

Please sign in to comment.