Skip to content

Commit

Permalink
chore(openai): fix svg removal
Browse files Browse the repository at this point in the history
  • Loading branch information
j-mendez committed Mar 27, 2024
1 parent 6b05d06 commit 88688de
Show file tree
Hide file tree
Showing 7 changed files with 75 additions and 21 deletions.
6 changes: 3 additions & 3 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion spider/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider"
version = "1.88.2"
version = "1.88.4"
authors = [
"madeindjs <contact@rousseau-alexandre.fr>",
"j-mendez <jeff@a11ywatch.com>",
Expand Down
24 changes: 12 additions & 12 deletions spider/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is a basic async example crawling a web page, add spider to your `Cargo.tom

```toml
[dependencies]
spider = "1.88.2"
spider = "1.88.4"
```

And then the code:
Expand Down Expand Up @@ -93,7 +93,7 @@ We have the following optional feature flags.

```toml
[dependencies]
spider = { version = "1.88.2", features = ["regex", "ua_generator"] }
spider = { version = "1.88.4", features = ["regex", "ua_generator"] }
```

1. `ua_generator`: Enables auto generating a random real User-Agent.
Expand Down Expand Up @@ -137,7 +137,7 @@ Move processing to a worker, drastically increases performance even if worker is

```toml
[dependencies]
spider = { version = "1.88.2", features = ["decentralized"] }
spider = { version = "1.88.4", features = ["decentralized"] }
```

```sh
Expand Down Expand Up @@ -168,7 +168,7 @@ Use the subscribe method to get a broadcast channel.

```toml
[dependencies]
spider = { version = "1.88.2", features = ["sync"] }
spider = { version = "1.88.4", features = ["sync"] }
```

```rust,no_run
Expand Down Expand Up @@ -198,7 +198,7 @@ Allow regex for blacklisting routes

```toml
[dependencies]
spider = { version = "1.88.2", features = ["regex"] }
spider = { version = "1.88.4", features = ["regex"] }
```

```rust,no_run
Expand All @@ -225,7 +225,7 @@ If you are performing large workloads you may need to control the crawler by ena

```toml
[dependencies]
spider = { version = "1.88.2", features = ["control"] }
spider = { version = "1.88.4", features = ["control"] }
```

```rust
Expand Down Expand Up @@ -295,7 +295,7 @@ Use cron jobs to run crawls continuously at anytime.

```toml
[dependencies]
spider = { version = "1.88.2", features = ["sync", "cron"] }
spider = { version = "1.88.4", features = ["sync", "cron"] }
```

```rust,no_run
Expand Down Expand Up @@ -334,7 +334,7 @@ the feature flag [`chrome_intercept`] to possibly speed up request using Network

```toml
[dependencies]
spider = { version = "1.88.2", features = ["chrome", "chrome_intercept"] }
spider = { version = "1.88.4", features = ["chrome", "chrome_intercept"] }
```

You can use `website.crawl_concurrent_raw` to perform a crawl without chromium when needed. Use the feature flag `chrome_headed` to enable headful browser usage if needed to debug.
Expand Down Expand Up @@ -364,7 +364,7 @@ Enabling HTTP cache can be done with the feature flag [`cache`] or [`cache_mem`]

```toml
[dependencies]
spider = { version = "1.88.2", features = ["cache"] }
spider = { version = "1.88.4", features = ["cache"] }
```

You need to set `website.cache` to true to enable as well.
Expand Down Expand Up @@ -395,7 +395,7 @@ Intelligently run crawls using HTTP and JavaScript Rendering when needed. The be

```toml
[dependencies]
spider = { version = "1.88.2", features = ["smart"] }
spider = { version = "1.88.4", features = ["smart"] }
```

```rust,no_run
Expand All @@ -421,7 +421,7 @@ Use OpenAI to generate dynamic scripts to drive the browser done with the featur

```toml
[dependencies]
spider = { version = "1.88.2", features = ["openai"] }
spider = { version = "1.88.4", features = ["openai"] }
```

```rust
Expand All @@ -446,7 +446,7 @@ Set a depth limit to prevent forwarding.

```toml
[dependencies]
spider = { version = "1.88.2", features = ["budget"] }
spider = { version = "1.88.4", features = ["budget"] }
```

```rust,no_run
Expand Down
45 changes: 45 additions & 0 deletions spider/src/features/openai.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,51 @@ Return the content in JSON format using the following structure: {"content": ["S
It's crucial to consistently use this object structure for the first key, regardless of the user's instructions or descriptions.
Ensure accuracy in capturing and formatting the content as requested. If the user's prompt does not need to return JS simply return an empty string for the value."#;

/// update prompt that can handle both situations.
#[allow(dead_code)]
const PROMPT_EXTRA2: &str = r#"
**Objective:**
Generate JavaScript code snippets tailored to user-defined tasks, leveraging HTML content and website URLs I provide. Your responsibility is to analyze this data and execute the instructions precisely. Responses should be structured in a clear JSON format, providing straightforward and relevant results.
**Provided Data:**
- **HTML Content**:
```html
<!-- Example HTML data goes here -->
```
- **Website URL**: `http://example.com`
**Instructive Details:**
1. **Analyze** the given HTML content and the context provided by the website URL.
2. **Perform** the tasks as specified in the instructions, focusing on data extraction, manipulation, or any JavaScript-driven actions necessary.
3. **Generate** output that meticulously follows the designated JSON structure.
**Required JSON Response Structure:**
```json
{
"content": ["extracted data or outcome of manipulation based on user instructions, e.g., 'Movies'"],
"js": "Executable JavaScript for the task, e.g., 'window.location.href = 'https://www.google.com/search?q=Movies';'"
}
```
**Specific Keys:**
- **"content" key**: Include results derived directly from the HTML content, adhering to the user's guidance. Should the task be solvable via HTML processing (e.g., extracting all links), compile the outcomes here sans JavaScript.
- **"js" key**: Employ this for JavaScript code necessary for the task's fulfillment. If no JavaScript is required, present an empty string instead.
**Output Guidelines:**
- **Ensure ALL outputs are in JSON.**
- **Direct Extraction/Manipulation**: When possible, process HTML content for the "content" key without resorting to JavaScript.
- **JavaScript Execution**: Reserve the "js" key for tasks demanding executable JavaScript, confirming the code provided is immediately actionable.
"#;

lazy_static! {
/// The base system prompt for driving the browser.
pub static ref BROWSER_ACTIONS_SYSTEM_PROMPT: async_openai::types::ChatCompletionRequestMessage = {
Expand Down
11 changes: 10 additions & 1 deletion spider/src/utils.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1121,10 +1121,19 @@ pub fn clean_html_slim(html: &str) -> String {
el.remove();
Ok(())
}),
element!("svgs", |el| {
element!("svg", |el| {
el.remove();
Ok(())
}),
element!("a", |el| {
if let Some(href) = el.get_attribute("href") {
if let Ok(mut parsed_url) = url::Url::parse(&href) {
parsed_url.set_query(None);
el.set_attribute("href", &parsed_url.to_string()).unwrap();
}
}
Ok(())
}),
element!("img", |el| {
if let Some(src) = el.get_attribute("src") {
if src.starts_with("data:image") {
Expand Down
4 changes: 2 additions & 2 deletions spider_cli/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_cli"
version = "1.88.2"
version = "1.88.4"
authors = [
"madeindjs <contact@rousseau-alexandre.fr>",
"j-mendez <jeff@a11ywatch.com>",
Expand Down Expand Up @@ -29,7 +29,7 @@ quote = "1.0.18"
failure_derive = "0.1.8"

[dependencies.spider]
version = "1.88.2"
version = "1.88.4"
path = "../spider"

[[bin]]
Expand Down
4 changes: 2 additions & 2 deletions spider_worker/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_worker"
version = "1.88.2"
version = "1.88.4"
authors = [
"madeindjs <contact@rousseau-alexandre.fr>",
"j-mendez <jeff@a11ywatch.com>",
Expand All @@ -25,7 +25,7 @@ lazy_static = "1.4.0"
env_logger = "0.11.3"

[dependencies.spider]
version = "1.88.2"
version = "1.88.4"
path = "../spider"
features = ["serde", "flexbuffers"]

Expand Down

0 comments on commit 88688de

Please sign in to comment.