Skip to content

Commit

Permalink
chore(openai): add detailed gpt results output
Browse files Browse the repository at this point in the history
  • Loading branch information
j-mendez committed Apr 10, 2024
1 parent 28db8f9 commit 19ed6a7
Show file tree
Hide file tree
Showing 9 changed files with 52 additions and 36 deletions.
8 changes: 4 additions & 4 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion examples/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_examples"
version = "1.83.12"
version = "1.90.0"
authors = [
"madeindjs <contact@rousseau-alexandre.fr>",
"j-mendez <jeff@a11ywatch.com>",
Expand Down
2 changes: 1 addition & 1 deletion spider/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider"
version = "1.90.0"
version = "1.91.1"
authors = [
"madeindjs <contact@rousseau-alexandre.fr>",
"j-mendez <jeff@a11ywatch.com>",
Expand Down
24 changes: 12 additions & 12 deletions spider/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is a basic async example crawling a web page, add spider to your `Cargo.tom

```toml
[dependencies]
spider = "1.90.0"
spider = "1.91.1"
```

And then the code:
Expand Down Expand Up @@ -93,7 +93,7 @@ We have the following optional feature flags.

```toml
[dependencies]
spider = { version = "1.90.0", features = ["regex", "ua_generator"] }
spider = { version = "1.91.1", features = ["regex", "ua_generator"] }
```

1. `ua_generator`: Enables auto generating a random real User-Agent.
Expand Down Expand Up @@ -137,7 +137,7 @@ Move processing to a worker, drastically increases performance even if worker is

```toml
[dependencies]
spider = { version = "1.90.0", features = ["decentralized"] }
spider = { version = "1.91.1", features = ["decentralized"] }
```

```sh
Expand Down Expand Up @@ -168,7 +168,7 @@ Use the subscribe method to get a broadcast channel.

```toml
[dependencies]
spider = { version = "1.90.0", features = ["sync"] }
spider = { version = "1.91.1", features = ["sync"] }
```

```rust,no_run
Expand Down Expand Up @@ -198,7 +198,7 @@ Allow regex for blacklisting routes

```toml
[dependencies]
spider = { version = "1.90.0", features = ["regex"] }
spider = { version = "1.91.1", features = ["regex"] }
```

```rust,no_run
Expand All @@ -225,7 +225,7 @@ If you are performing large workloads you may need to control the crawler by ena

```toml
[dependencies]
spider = { version = "1.90.0", features = ["control"] }
spider = { version = "1.91.1", features = ["control"] }
```

```rust
Expand Down Expand Up @@ -295,7 +295,7 @@ Use cron jobs to run crawls continuously at anytime.

```toml
[dependencies]
spider = { version = "1.90.0", features = ["sync", "cron"] }
spider = { version = "1.91.1", features = ["sync", "cron"] }
```

```rust,no_run
Expand Down Expand Up @@ -334,7 +334,7 @@ the feature flag [`chrome_intercept`] to possibly speed up request using Network

```toml
[dependencies]
spider = { version = "1.90.0", features = ["chrome", "chrome_intercept"] }
spider = { version = "1.91.1", features = ["chrome", "chrome_intercept"] }
```

You can use `website.crawl_concurrent_raw` to perform a crawl without chromium when needed. Use the feature flag `chrome_headed` to enable headful browser usage if needed to debug.
Expand Down Expand Up @@ -364,7 +364,7 @@ Enabling HTTP cache can be done with the feature flag [`cache`] or [`cache_mem`]

```toml
[dependencies]
spider = { version = "1.90.0", features = ["cache"] }
spider = { version = "1.91.1", features = ["cache"] }
```

You need to set `website.cache` to true to enable as well.
Expand Down Expand Up @@ -395,7 +395,7 @@ Intelligently run crawls using HTTP and JavaScript Rendering when needed. The be

```toml
[dependencies]
spider = { version = "1.90.0", features = ["smart"] }
spider = { version = "1.91.1", features = ["smart"] }
```

```rust,no_run
Expand All @@ -421,7 +421,7 @@ Use OpenAI to generate dynamic scripts to drive the browser done with the featur

```toml
[dependencies]
spider = { version = "1.90.0", features = ["openai"] }
spider = { version = "1.91.1", features = ["openai"] }
```

```rust
Expand All @@ -447,7 +447,7 @@ Set a depth limit to prevent forwarding.

```toml
[dependencies]
spider = { version = "1.90.0", features = ["budget"] }
spider = { version = "1.91.1", features = ["budget"] }
```

```rust,no_run
Expand Down
16 changes: 14 additions & 2 deletions spider/src/page.rs
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,18 @@ lazy_static! {
};
}

/// The AI data returned from a GPT.
#[derive(Debug, Clone, Default)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
pub struct AIResults {
/// The prompt used for the GPT.
pub input: String,
/// The js output of the GPT response.
pub js_output: String,
/// The content output returned from the GPT response that is not a browser script, example: extracted data from the markup.
pub content_output: Vec<String>,
}

/// Represent a page visited. This page contains HTML scraped with [scraper](https://crates.io/crates/scraper).
#[derive(Debug, Clone)]
#[cfg(not(feature = "decentralized"))]
Expand Down Expand Up @@ -98,7 +110,7 @@ pub struct Page {
pub openai_credits_used: Option<Vec<crate::utils::OpenAIUsage>>,
#[cfg(feature = "openai")]
/// The extra data from the AI, example extracting data etc...
pub extra_ai_data: Option<Vec<String>>,
pub extra_ai_data: Option<Vec<AIResults>>,
}

/// Represent a page visited. This page contains HTML scraped with [scraper](https://crates.io/crates/scraper).
Expand Down Expand Up @@ -128,7 +140,7 @@ pub struct Page {
pub openai_credits_used: Option<Vec<crate::utils::OpenAIUsage>>,
#[cfg(feature = "openai")]
/// The extra data from the AI, example extracting data etc...
pub extra_ai_data: Option<Vec<String>>,
pub extra_ai_data: Option<Vec<AIResults>>,
}

lazy_static! {
Expand Down
24 changes: 14 additions & 10 deletions spider/src/utils.rs
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ pub struct PageResponse {
pub openai_credits_used: Option<Vec<OpenAIUsage>>,
#[cfg(feature = "openai")]
/// The extra data from the AI, example extracting data etc...
pub extra_ai_data: Option<Vec<String>>,
pub extra_ai_data: Option<Vec<crate::page::AIResults>>,
}

/// wait for event with timeout
Expand Down Expand Up @@ -228,24 +228,27 @@ pub fn handle_openai_credits(_page_response: &mut PageResponse, _tokens_used: Op

/// Handle extra OpenAI data used. This does nothing without 'openai' feature flag.
#[cfg(feature = "openai")]
pub fn handle_extra_ai_data(page_response: &mut PageResponse, js: &str) -> String {
pub fn handle_extra_ai_data(page_response: &mut PageResponse, prompt: &str, js: &str) {
match serde_json::from_str::<JsonResponse>(&js) {
Ok(x) => {
let ai_response = crate::page::AIResults {
input: prompt.into(),
js_output: x.js,
content_output: x.content,
};

match page_response.extra_ai_data.as_mut() {
Some(v) => v.extend(x.content),
None => page_response.extra_ai_data = Some(x.content),
Some(v) => v.push(ai_response),
None => page_response.extra_ai_data = Some(Vec::from([ai_response])),
};
x.js
}
_ => Default::default(),
_ => (),
}
}

#[cfg(not(feature = "openai"))]
/// Handle extra OpenAI data used. This does nothing without 'openai' feature flag.
pub fn handle_extra_ai_data(_page_response: &mut PageResponse, _js: &str) -> String {
Default::default()
}
pub fn handle_extra_ai_data(_page_response: &mut PageResponse, _prompt: &str, _js: &str) {}

#[cfg(feature = "chrome")]
/// Perform a network request to a resource extracting all content as text streaming via chrome.
Expand Down Expand Up @@ -349,7 +352,8 @@ pub async fn fetch_page_html_chrome_base(
};

let js_script = if gpt_configs.extra_ai_data {
handle_extra_ai_data(&mut page_response, &js_script)
handle_extra_ai_data(&mut page_response, &prompt, &js_script);
js_script
} else {
js_script
};
Expand Down
4 changes: 2 additions & 2 deletions spider/src/website.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4410,7 +4410,7 @@ impl Website {
/// use spider::tokio;
/// use spider::website::Website;
/// #[tokio::main]
///
///
/// async fn main() {
/// let mut website: Website = Website::new("http://example.com");
/// let mut rx2 = website.subscribe(18).unwrap();
Expand Down Expand Up @@ -4450,7 +4450,7 @@ impl Website {
/// ```
/// use spider::tokio;
/// use spider::website::Website;
///
///
/// #[tokio::main]
/// async fn main() {
/// let mut website: Website = Website::new("http://example.com");
Expand Down
4 changes: 2 additions & 2 deletions spider_cli/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_cli"
version = "1.90.0"
version = "1.91.1"
authors = [
"madeindjs <contact@rousseau-alexandre.fr>",
"j-mendez <jeff@a11ywatch.com>",
Expand Down Expand Up @@ -29,7 +29,7 @@ quote = "1.0.18"
failure_derive = "0.1.8"

[dependencies.spider]
version = "1.90.0"
version = "1.91.1"
path = "../spider"

[[bin]]
Expand Down
4 changes: 2 additions & 2 deletions spider_worker/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_worker"
version = "1.90.0"
version = "1.91.1"
authors = [
"madeindjs <contact@rousseau-alexandre.fr>",
"j-mendez <jeff@a11ywatch.com>",
Expand All @@ -25,7 +25,7 @@ lazy_static = "1.4.0"
env_logger = "0.11.3"

[dependencies.spider]
version = "1.90.0"
version = "1.91.1"
path = "../spider"
features = ["serde", "flexbuffers"]

Expand Down

0 comments on commit 19ed6a7

Please sign in to comment.